# ordered_momentum_for_asynchronous_sgd__228338e2.pdf

Ordered Momentum for Asynchronous SGD

Chang-Wei Shi Yi-Rui Yang Wu-Jun Li National Key Laboratory for Novel Software Technology, School of Computer Science, Nanjing University, Nanjing, China {shicw, yangyr}@smail.nju.edu.cn, liwujun@nju.edu.cn

Distributed learning is essential for training large-scale deep models. Asynchronous SGD (ASGD) and its variants are commonly used distributed learning methods, particularly in scenarios where the computing capabilities of workers in the cluster are heterogeneous. Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training. However, existing works have found that naively incorporating momentum into ASGD can impede the convergence. In this paper, we propose a novel method called ordered momentum (Or Mo) for ASGD. In Or Mo, momentum is incorporated into ASGD by organizing the gradients in order based on their iteration indexes. We theoretically prove the convergence of Or Mo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis of ASGD with momentum without dependence on the maximum delay. Empirical results demonstrate that Or Mo can achieve better convergence performance compared with ASGD and other asynchronous methods with momentum.

1 Introduction

Many machine learning problems can be formulated as optimization problems of the following form:

min w Rd F(w) = Eξ D [f(w; ξ)] , (1)

where w denotes the model parameter, d is the dimension of the parameter, D represents the distribution of the training instances and f(w; ξ) denotes the loss on the training instance ξ.

Stochastic gradient descent (SGD) [28] and its variants [6, 12] are widely employed to solve the problem in (1). At each iteration, SGD uses one stochastic gradient or a mini-batch of stochastic gradients as an estimate of the full gradient to update the model parameter. In practice, momentum [26, 36, 24, 33] is often incorporated into SGD as a crucial technique for faster convergence and better generalization performance. Many popular machine learning libraries, such as Tensor Flow [1] and Py Torch [25], include SGD with momentum (SGDm) as one of the optimizers.

Due to the rapid increase in the sizes of both models and datasets in recent years, a single machine is often insufficient to complete the training task of machine learning models within a reasonable time. Distributed learning [42, 37] aims to distribute the computations across multiple machines (workers) to accelerate the training process. Because of its necessity for training large-scale machine learning models, distributed learning has become a hot research topic in recent years. Existing distributed learning methods can be categorized into two main types: synchronous distributed learning (SDL) methods [19, 35, 29, 38, 44, 39] and asynchronous distributed learning (ADL) methods [2, 27, 8, 47, 17, 43]. In SDL methods, faster workers that have completed the computation must wait idly for

Corresponding author.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

the other slower workers in each communication round. Hence, the speed of SDL methods is often hindered by slow workers. In contrast, faster workers do not necessarily wait idly for the other slower workers in ADL methods, because ADL methods require aggregating information from only one worker or a subset of workers in each communication round. Representative ADL methods include asynchronous SGD (ASGD) and its variants [2, 8, 46, 30, 49, 48, 7, 3, 31, 22, 13]. In ASGD, once a worker finishes its gradient computation, the parameter (typically on the server) is immediately updated using this gradient through an SGD step, without waiting for other workers.

Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training [33]. In SDL methods, momentum is extensively utilized across various domains, including decentralized algorithms [18, 45], communication compression algorithms [19, 29, 38, 40, 34, 41], infrequent communication algorithms [44, 39, 40], and federated learning algorithms [21, 32]. However, in ADL methods, some works [23, 9] have found that naively incorporating momentum into ASGD may decrease the convergence rate or even result in divergence. To tackle this challenge, some more sophisticated methods have been proposed to incorporate momentum into ASGD. The works in [23, 9] recommend tuning the momentum coefficient to enhance convergence performance when naively incorporating momentum into ASGD. The work in [9] proposes shifted momentum, which maintains local momentum on each worker. Inspired by Nesterov s accelerated gradient, the work in [4] proposes SMEGA2, which leverages the momentum to estimate the future parameter. However, the process of tuning the momentum coefficient in [23, 9] is time-consuming and yields limited improvement in practice. Although shifted momentum and SMEGA2 can achieve better empirical convergence performance than the method which naively incorporates momentum into ASGD, both of them lack theoretical convergence analysis.

In this paper, we propose a novel method, called ordered momentum (Or Mo), for asynchronous SGD. The main contributions of this paper are outlined as follows:

Or Mo incorporates momentum into ASGD by organizing the gradients in order based on their iteration indexes.

We theoretically prove the convergence of Or Mo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis for ASGD with momentum without dependence on the maximum delay.

Empirical results demonstrate that Or Mo can achieve better convergence performance compared with ASGD and other asynchronous methods with momentum.

2 Preliminary

In this paper, we use to denote the L2 norm. For a positive integer n, we use [n] to denote the set {0, 1, 2, . . . , n 1}. f(w; ξ) denotes the stochastic gradient computed over the training instance ξ and model parameter w. In this paper, we focus on the widely used Parameter Server framework [15], where the server is responsible for storing and updating the model parameter and the workers are responsible for sampling training instances and computing stochastic gradients. For simplicity, we assume that each worker samples one training instance for gradient computation each time. The analysis of mini-batch sampling on each worker follows a similar approach.

One of the most representative methods for distributing SGD across multiple workers is Synchronous SGD (SSGD) [20, 10]. Distributed SGD (DSGD), as presented in Algorithm 1, unifies SSGD and ASGD within a single framework [13]. The waiting set C in Algorithm 1 is a collection of workers (indexes) that are awaiting the server to send the latest parameter. The only difference between SSGD and ASGD is the communication scheduler associated with the waiting set. SSGD corresponds to DSGD with a synchronous communication scheduler, while ASGD corresponds to DSGD with an asynchronous communication scheduler. We use gkt ite(kt,t) to denote the stochastic gradient f(wite(kt,t); ξkt), where kt is the index of the worker whose gradient participates in the parameter update at iteration t and ξkt denotes a training instance sampled on worker kt. The function ite(k, t) denotes the iteration index of the latest parameter sent to worker k before iteration t, where k [K] and t [T]. The delay of the gradient gkt ite(kt,t) is defined as τt = t ite(kt, t). When K = 1, DSGD degenerates to vanilla SGD, i.e., ite(kt, t) t.

Algorithm 1 Distributed SGD

1: Server: 2: Input: number of workers K, number of iterations T, learning rate η; 3: Initialization: initial parameter w0, waiting set C = ; 4: Send the initial parameter w0 to all workers; 5: for t = 0 to T 1 do 6: Receive a stochastic gradient gkt ite(kt,t) from some worker kt;

7: Update the parameter wt+1 = wt ηgkt ite(kt,t);

8: Add the worker kt to the waiting set C = C {kt}; 9: Execute the communication scheduler: Option I: (Synchronous) only when all the workers are in the waiting set, i.e., C = [K], send the parameter wt+1 to the workers in C and set C to ; Option II: (Asynchronous) once the waiting set is not empty, i.e., C = , immediately send the parameter wt+1 to the worker in C and set C to ; 10: end for 11: Notify all workers to stop; 12: Worker k : (k [K]) 13: repeat 14: Wait until receiving the parameter w from the server; 15: Randomly sample ξk D and then compute the stochastic gradient gk = f(w; ξk); 16: Send the stochastic gradient gk to the server; 17: until receive server s notification to stop

In ASGD, the latest parameter wt+1 will be immediately sent back to the worker after the server updates the parameter at each iteration. The function ite(k, t) in ASGD can be formulated as follows:

ite(k, t) =

0 t = 0, k [K], t t > 0, k = kt 1, ite(k, t 1) t > 0, k = kt 1,

where k [K], t [T].

In SSGD, there is a barrier in the synchronous communication scheduler since the latest parameter wt+1 will be sent back to the workers only when all the workers are in the waiting set. The function ite(k, t) in SSGD can be formulated as ite(k, t) = t

K K, where k [K], t [T] and is the floor function.

Remark 1. In existing works [10, 22], SSGD is often presented in the form of mini-batch SGD:

ws+1 = ws η

k [K] f( ws; ξk), (2)

where s [S] and S denotes the number of iterations. Here, all workers aggregate their stochastic gradients to obtain the mini-batch gradient 1 K P

k [K] f( ws; ξk), which is then used to update the parameter. To unify SSGD and ASGD into a single framework in Algorithm 1, we reformulate SSGD in the form of mini-batch SGD in (2). Specifically, one update using a mini-batch gradient computed over K training instances in (2) is split into K updates, each using a stochastic gradient over a single training instance. Letting η = η K , T = KS and w0 = w0, the sequence {ws K}s [S] in SSGD in Algorithm 1 matches { ws}s [S] in (2).

3 Ordered Momentum

In this section, we first propose a new reformulation of SSGD with momentum, which inspires the design of ordered momentum (Or Mo) for ASGD. Then, we present the details of Or Mo, including the algorithm and convergence analysis.

3.1 Reformulation of SSGD with Momentum

The widely used SGD with momentum (SGDm) [26] can be expressed as follows:

ws+1 = ws β us η |Bs|

ξ Bs f ( ws; ξ) , (3)

us+1 = β us + η |Bs|

ξ Bs f ( ws; ξ) , (4)

where u0 = 0, β [0, 1), s [S] and S denotes the number of iterations. β is the momentum coefficient. us represents the Polyak s momentum. 1 |Bs| P

ξ Bs f ( ws; ξ) denotes the stochastic gradient computed over the sampled training instance set Bs, which contains either a single training instance or a mini-batch of training instances sampled from D. (3) denotes the parameter update step and (4) denotes the momentum update step. When β = 0, SGDm degenerates to (mini-batch) SGD.

Since SSGD can be presented in the form of mini-batch SGD as depicted in Remark 1, it s straightforward to implement SSGD with momentum (SSGDm) as follows:

ws+1 = ws β us η

k [K] f( ws; ξk),

us+1 = β us + η

k [K] f( ws; ξk), (5)

where u0 = 0, β [0, 1), s [S] and S denotes the number of iterations. Here, the server aggregates the stochastic gradients from all the workers to obtain the mini-batch gradient 1

K P k [K] f( ws; ξk), which is then used to update both the parameter and the momentum in (5).

To gain insights from SSGDm on incorporating momentum into ASGD, we reformulate SSGDm in (5) to fit into the framework of Algorithm 1. Similar to the reformulation in Remark 1, the updates using a mini-batch gradient computed over K training instances in (5) are split into K updates, each using a stochastic gradient over a single training instance. The corresponding implementation details of SSGDm are presented in Algorithm 3 in Appendix B. In this way, the update rules of SSGDm in (5) can be reformulated as follows:

2 = wt βut K | t, wt K t,

2 = βut K | t, ut K t,

wt+1 = wt+ 1

ut+1 = ut+ 1

where u0 = 0, gkt t

K K = f(w t

K K; ξkt) and t [T]. We give the following proposition about the relationship between the sequences in (5) and those in (6). The proof details can be found in Appendix C.1.1.

Proposition 1. Letting η = η K , T = KS and w0 = w0, the sequences {ws K}s [S] and {us K}s [S] in (6) are equivalent to { ws}s [S] and { us}s [S] in (5), respectively.

We investigate how the momentum term ut+1 evolves during the iterations in (6). For t K and t [T], ut+1 can be formulated as:

k [K] ηgk i K

where the superscript of the scalar β indicates the exponent. For t < K, ut+1 = β0 Pt j=0 ηgkj 0 . Figure 1 shows u10 as an example when K = 4. We define {ηg0 i K, ηg1 i K, , ηg K 1 i K } as the i-th

Figure 1: An example of the momentum term u10 in SSGDm when K = 4. The gradients shown in red indicate those having not arrived at the server. In this case, u10 = β2 ηg0 0 + ηg1 0 + ηg2 0 + ηg3 0 + β1 ηg0 4 + ηg1 4 + ηg2 4 + ηg3 4 + β0 ηg0 8 + ηg3 8 .

(scaled) gradient group, which contains K gradients scaled by the learning rate η. The order of the gradient groups is based on the iteration indexes of their corresponding gradients. Though some gradients may be missing because they have not yet arrived at the server, the momentum is a weighted sum of the gradients from the first several gradient groups. Hence, the momentum in SSGDm is referred to as an ordered momentum. Specifically, the gradients in the i-th gradient group are weighted by β t

K i in the momentum ut+1, where i [ t

K + 1]. We refer to the gradient group whose gradients are weighted by β0 as the latest gradient group, which contains the latest gradients. For ut+1 in SSGDm, the latest gradient group corresponds to the t

K -th gradient group.

Due to the barrier in the synchronous communication scheduler in SSGDm as presented in Algorithm 3, the gradients in SSGDm consistently arrive at the server in the order of their iteration indexes. The arriving gradient always belongs to the latest gradient group at each iteration. Thus, maintaining such an ordered momentum in SSGDm is straightforward. As shown in line 13 of Algorithm 3, the scaled gradient ηgkt ite(kt,t) is always added to the momentum with a weight of β0 at each iteration. However, for ASGD, since the gradients arrive at the server out of order, it s not trivial to incorporate such an ordered momentum. To address this problem, we propose a solution in the following subsection.

3.2 Or Mo for ASGD

In this subsection, we introduce our novel method called ordered momentum (Or Mo) for ASGD, and present it in Algorithm 2.

Firstly, we define the (scaled) gradient groups in Or Mo for ASGD. Due to the differences in the communication scheduler, the iteration indexes of the parameters used to compute the gradients in ASGD differ from those in SSGD (SSGDm). Specifically, the sequence of gradients computed in SSGD (SSGDm) can be formulated as:

g0 0, g1 0, , g K 1 0 , g0 K, g1 K, , g K 1 K , g0 2K, g1 2K, , g K 1 2K , . (7)

In contrast, the sequence of gradients computed in ASGD is given by:

g0 0, g1 0, , g K 1 0 , gk0 1 , gk1 2 , , gk K 1 K , gk K K+1, gk K+1 K+2 , , gk2K 1 2K , . (8)

Thus, the i-th (scaled) gradient group in Or Mo for ASGD is defined as: n ηg k(i 1)K (i 1)K+1, ηg k(i 1)K+1 (i 1)K+2, , ηgki K 1 i K o ,

where i 1. The 0-th (scaled) gradient group in Or Mo is {ηg0 0, ηg1 0, , ηg K 1 0 }. Despite the difference in the gradients iteration indexes, each gradient group in Or Mo for ASGD also contains K gradients scaled by the learning rate η, similar to that in SSGDm as discussed in Subsection 3.1.

We use It+1 to denote the index of the latest gradient group of ut+1 in Or Mo. The iteration index of the latest gradient in ut+1 can be t at most. Since the gradient with iteration index t belongs to the t

K -th gradient group, the latest gradient group for ut+1 should be the t

K -th gradient group, i.e., It+1 t

K , t [T]. ut+1 is the weighted sum of the gradients from the first It+1 + 1 gradient groups, where some gradients may be missing because they have not yet arrived at the server. The gradients in the i-th gradient group are weighted by β t

K i in the momentum ut+1, where i [It+1 + 1] and the superscript of the scalar β indicates the exponent. Figure 2 shows an example of u10 in Or Mo when K = 4.

For the t-th iteration in Or Mo, the server performs the following operations:

Algorithm 2 Or Mo

1: Server: 2: Input: number of workers K, number of iterations T, learning rate η, momentum coefficient β [0, 1); 3: Initialization: initial parameter w0, momentum u0 = 0, index of the latest gradient group I0 = 0, waiting set C = ; 4: Send the initial parameter w0 and its iteration index 0 to all workers; 5: for t = 0 to T 1 do 6: if the waiting set C is empty and t

K > It then 7: wt+ 1

2 = wt βut, ut+ 1

2 = βut, It+1 = It + 1; 8: else 9: wt+ 1

2 = wt, ut+ 1

2 = ut, It+1 = It; 10: end if 11: Receive a stochastic gradient gkt ite(kt,t) and its iteration index ite(kt, t) from some worker kt and then calculate ite(kt,t)

K (i.e., the index of the gradient group that gkt ite(kt,t) belongs to);

12: Update the momentum ut+1 = ut+ 1

2 + βIt+1 ite(kt,t)

K ηgkt ite(kt,t) ;

13: Update the parameter wt+1 = wt+ 1

2 1 βIt+1 ite(kt,t)

1 β ηgkt ite(kt,t) ;

14: Add the worker kt to the waiting set C = C {kt}; 15: Execute the asynchronous communication scheduler: once the waiting set is not empty, i.e., C = , immediately send the parameter wt+1 and its iteration index t + 1 to the worker in C and set C to ; 16: end for 17: Notify all workers to stop; 18: Worker k : (k [K]) 19: repeat 20: Wait until receiving the parameter wt and its iteration index t from the server; 21: Randomly sample ξk D and then compute the stochastic gradient gk t = f(wt ; ξk); 22: Send the stochastic gradient gk t and its iteration index t to the server; 23: until receive server s notification to stop

Figure 2: An example of the momentum term u10 in Or Mo when K = 4. The gradients shown in red indicate those having not arrived at the server. In this case, u10 = β3 ηg0 0 + ηg1 0 + ηg2 0 + ηg3 0 +

β2 ηgk0 1 + ηgk1 2 + ηgk2 3 + β1 ηgk5 6 + ηgk7 8 + β0 ηgk8 9 .

If the parameter with iteration index t that satisfies t

K > It has been sent to some worker, update the parameter using the momentum and multiply the momentum with β: wt+ 1

2 = wt βut, ut+ 1

2 = βut, It+1 = It + 1.

In this way, the momentum changes the index of its latest gradient group to t

K and gets ready to accommodate the new gradient with iteration index t.

Receive a stochastic gradient gkt ite(kt,t) and its iteration index ite(kt, t) from some worker kt and calculate ite(kt,t)

K , which is the index of the gradient group that gkt ite(kt,t) belongs to.

Update the momentum: ut+1 = ut+ 1

2 + βIt+1 ite(kt,t)

K ηgkt ite(kt,t) .

Since the weight of the scaled gradients from the latest gradient group in the momentum is β0, the weight of the gradients from the ite(kt,t)

K -th gradient group should be βIt+1 ite(kt,t)

Or Mo updates the momentum by adding the scaled gradient ηgkt ite(kt,t) into the momentum

with a weight of βIt+1 ite(kt,t)

Update the parameter: wt+1 = wt+ 1

2 1 βIt+1 ite(kt,t)

1 β ηgkt ite(kt,t) .

The update rule of the parameter in Or Mo is motivated by that in SSGDm, as presented in Algorithm 3. In SSGDm, the scaled gradient ηgkt ite(kt,t) is always added to the momentum with a weight of β0. At the current iteration, this scaled gradient updates the parameter with a coefficient β0. In subsequent iterations, this scaled gradient in the momentum updates the parameter with the coefficients β, β2, β3, . By the time this scaled gradient is weighted by βIt+1 ite(kt,t)

K in the momentum of SSGDm, it has already updated the

parameter for It+1 ite(kt,t)

K + 1 steps, with a total coefficient PIt+1 ite(kt,t)

K j=0 βj. In Or Mo, for the scaled gradient ηgkt ite(kt,t) which is added to the momentum with a weight of

βIt+1 ite(kt,t)

K , we compensate for the missed It+1 ite(kt,t)

K + 1 steps compared with

SSGDm and update the parameter with the coefficient 1 βIt+1 ite(kt,t)

1 β at the current iteration. The design of the parameter update rule is crucial for the derivation of Lemma 2, which is further supported by the ablation study in Appendix A.3. Add the worker kt to the waiting set C = C {kt} and execute the asynchronous communication scheduler. Remark 2. Compared to ASGD, the additional communication overhead introduced by the iteration index in Or Mo is negligible since the iteration index is only a scalar. Remark 3. When the momentum coefficient β is set to 0, Or Mo degenerates to ASGD in Algorithm 1. If the asynchronous communication scheduler in line 15 of Algorithm 2 is replaced by a synchronous communication scheduler: only when all the workers are in the waiting set, i.e., C = [K], send the parameter wt+1 and the iteration index t + 1 to the workers in C and set C to , Or Mo degenerates to SSGDm in Algorithm 3.

3.3 Convergence Analysis

In this section, we prove the convergence of Or Mo in Algorithm 2 for non-convex problems. We only present the main results here. The proof details can be found in Appendix C.

We make the following assumptions, which are widely used in distributed learning [47, 43, 41, 22]. Assumption 1. For any stochastic gradient f(w; ξ), we assume that it satisfies:

Eξ D [ f(w; ξ)] = F(w), Eξ D f(w; ξ) F(w) 2 σ2, w Rd.

Assumption 2. For any stochastic gradient f(w; ξ), we assume that it satisfies:

Eξ D f(w; ξ) 2 G2, w Rd.

Assumption 3. F(w) is L-smooth (L > 0):

F(w) F(w ) + F(w )T (w w ) + L

2 w w 2, w, w Rd.

Assumption 4. The objective function F(w) is lower bounded by F : F(w) F , w Rd.

Firstly, we define the auxiliary sequence {ˆut}t 1 for the momentum: ˆu1 = P

k [K] ηgk 0, and

( βˆut + ηgkt 1 t K | (t 1),

ˆut + ηgkt 1 t K (t 1),

for t 1. Lemma 1. For any t 0, the gap between ut+1 and ˆut+1 can be formulated as follows:

ˆut+1 ut+1 = X

k [K],k =kt β t

K ηgk ite(k,t). (9)

Then, we define the auxiliary sequence { ˆwt}t 1 for the parameter: ˆw1 = w0 P k [K] ηgk 0, and

( ˆwt βˆut ηgkt 1 t K | (t 1),

ˆwt ηgkt 1 t K (t 1), for t 1. Lemma 2. For any t 0, the gap between wt+1 and ˆwt+1 can be formulated as follows:

ˆwt+1 wt+1 = X

k [K],k =kt

1 β ηgk ite(k,t). (10)

Then, we define another auxiliary sequence {ˆyt}t 1: ˆy1 = ˆw1 βw0

1 β , and ˆyt+1 = ˆyt η 1 β gkt 1 t , for t 1. Lemma 3. For any t 1, the gap between ˆyt and ˆwt can be formulated as follows:

ˆyt ˆwt = β 1 β ˆut. (11)

Theorem 1. With Assumptions 1, 2, 3 and 4, letting η = min{ 1 β

2KL, (1 β) 1 2

(LT ) 1 2 σ , (1 β) 5 3 1 3

(LKG) 2 3 T 1 3 }, Algo-

rithm 2 has the following convergence rate:

t=1 E F(wt) 2 O

where = F(w0) F and T K.

Many works [46, 30, 7, 13, 22] consider delay-adaptive methods for ASGD. The key insight of these methods is to penalize the gradients with large delays and reduce their contribution to the parameter update. Or Mo is orthogonal to these delay-adaptive methods. Concretely, we can replace the constant learning rate η in Algorithm 2 with a delay-adaptive learning rate ηt, which is dependent on the delay of the gradient τt. Inspired by [13], we adopt the following delay-adaptive learning rate ηt:

min{η, 1 4Lτt } τt > 2K.

The convergence of Or Mo with the above delay-adaptive learning rate (called Or Mo-DA) is guaranteed by Theorem 2.

Theorem 2. With Assumptions 1, 3 and 4, letting η = min{ (1 β)2

T Lσ2 }, Or Mo-DA has the following convergence rate:

E F( w T ) 2 O

where = F(w0) F and w T is randomly chosen from {w0, w1, , w T 1} according to a probability distribution which is related to the delay-adaptive learning rates.

The proof details can be found in Appendix C.3. Compared with Theorem 1, Theorem 2 removes the dependence on Assumption 2 (bounded gradient) and provides a better convergence bound. Remark 4. We focus on the scenario where the training instances across all workers are independent and identically distributed (i.i.d.) from D. This scenario commonly appears in the data-center setup for distributed training [5], where all workers have access to the full training dataset. Our analysis for the i.i.d. scenario can also provide insights into the analysis in a non-i.i.d. scenario [22], which will be studied in future work. Remark 5. Most existing theoretical analyses of ASGD [16, 49, 3, 20] rely on the maximum delay

τmax (e.g., O( q

T ) in [20]), where τmax = maxt [T ] τt. However, since ASGD can still perform well even when the maximum delay is extremely large (τmax K) in practice, these theoretical analyses don t accurately reflect the true behavior of ASGD. The most closely related works to this work are [13, 22], which analyze ASGD without relying on the maximum delay. But the works in [13, 22] do not consider momentum. To the best of our knowledge, this is the first work to establish the convergence guarantee of ASGD with momentum without relying on the maximum delay.

Table 1: Empirical results of different methods on CIFAR10 dataset.

Number of Workers 16 (hom.) 64 (hom.) 16 (het.) 64 (het.) Methods Training Loss Test Accuracy Training Loss Test Accuracy Training Loss Test Accuracy Training Loss Test Accuracy

ASGD 0.06 0.00 89.77 0.11 0.40 0.02 83.14 0.55 0.06 0.00 89.73 0.19 0.38 0.01 83.94 0.21 naive ASGDm 0.20 0.07 88.15 1.70 0.44 0.06 82.39 1.79 0.58 0.86 73.23 31.61 0.78 0.77 68.75 29.51 shifted momentum 0.08 0.01 90.23 0.27 0.38 0.00 83.72 0.29 0.10 0.02 89.95 0.32 0.37 0.01 83.99 0.23 SMEGA2 0.05 0.01 90.60 0.42 0.23 0.04 86.82 0.69 0.04 0.01 90.88 0.25 0.22 0.07 86.89 1.42 Or Mo 0.04 0.01 90.95 0.27 0.15 0.02 88.03 0.28 0.04 0.00 91.01 0.10 0.16 0.03 87.76 0.57 Or Mo-DA 0.03 0.01 91.17 0.18 0.16 0.02 88.03 0.33 0.03 0.01 91.28 0.37 0.15 0.02 88.08 0.38

Table 2: Empirical results of different methods on CIFAR100 dataset.

Number of Workers 16 (hom.) 64 (hom.) 16 (het.) 64 (het.) Methods Training Loss Test Accuracy Training Loss Test Accuracy Training Loss Test Accuracy Training Loss Test Accuracy

ASGD 0.51 0.01 66.16 0.36 0.96 0.03 61.61 0.59 0.51 0.01 65.94 0.39 0.95 0.03 61.74 0.30 naive ASGDm 0.54 0.01 65.46 0.20 1.03 0.05 59.96 0.90 0.53 0.00 65.69 0.42 0.97 0.06 61.13 1.02 shifted momentum 0.47 0.01 66.37 0.14 0.82 0.01 63.55 0.32 0.47 0.00 66.28 0.14 0.82 0.04 63.28 0.66 SMEGA2 0.41 0.00 67.32 0.22 0.69 0.00 64.16 0.12 0.40 0.01 67.29 0.16 0.68 0.02 64.12 0.53 Or Mo 0.41 0.01 67.56 0.34 0.56 0.00 65.48 0.17 0.40 0.01 67.71 0.33 0.58 0.02 65.43 0.35 Or Mo-DA 0.40 0.00 67.72 0.21 0.56 0.01 65.79 0.12 0.04 0.00 67.82 0.20 0.57 0.01 65.82 0.30

4 Experiments

In this section, we evaluate the performance of Or Mo, Or Mo-DA and other baseline methods. All the experiments are implemented based on the Parameter Server framework [15]. Our distributed platform is conducted with Docker. Each Docker container corresponds to either a server or a worker. All the methods are implemented with Py Torch 1.3.

The baseline methods include ASGD, naive ASGDm which naively incorporates momentum into ASGD [23], shifted momentum [9] and SMEGA2 [4]. The details of naive ASGDm are shown in Algorithm 4. In Or Mo-DA, for a gradient with a large delay satisfying τt > 2K, its corresponding learning rate will be multiplied by 1

τt . We evaluate these methods by training Res Net20 model [11] on CIFAR10 and CIFAR100 datasets [14]. The number of workers is set to 16 and 64. The batch size on each worker is set to 64. The momentum coefficient is set to 0.9. Each experiment is repeated 5 times. The experiments are conducted under two settings:

Setting I [homogeneous (hom.)]: each worker has similar computing capabilities, which ensures comparable average time for gradient computations. Setting II [heterogeneous (het.)]: some workers ( 1

16 of all) are designated as slow workers, with an average computation time that is 10 times longer than that of the others.

For the CIFAR10 dataset, the weight decay is set to 0.0001 and the model is trained with 160 epochs. The learning rate is multiplied by 0.1 at the 80-th and 120-th epoch, as suggested in [11]. The experiments for the CIFAR10 dataset are conducted on NVIDIA RTX 2080 Ti GPUs. For the CIFAR100 dataset, the weight decay is set to 0.0005 and the model is trained with 200 epochs. The learning rate is multiplied by 0.2 at the 60-th, 120-th and 160-th epoch, as suggested in [40]. The experiments for the CIFAR100 dataset are conducted on NVIDIA V100 GPUs.

Table 1 and Table 2 show the empirical results of different methods. Figure 3 and Figure 4 show the test accuracy curves of different methods. We also present the test accuracy curves of SSGDm for reference in Figure 3 and Figure 4. Training loss curves can be found in Appendix A.1. Compared with other asynchronous methods, Or Mo and Or Mo-DA can achieve better training loss and test accuracy. As shown in Figure 3, naive ASGDm occasionally fails to converge under Setting II. We can find that naively incorporating momentum into ASGD will impede its convergence. Due to the existence of slow workers, the maximum delay under Setting II is far greater than that under Setting I. For example, when training on the CIFAR10 dataset with K = 64, the maximum delay under Setting II is about 30000, while it s around 300 under Setting I. Or Mo and Or Mo-DA perform well under both settings, which aligns with our theoretical results without dependence on the maximum delay.

Figure 5 presents the training curves of Or Mo and SSGDm with respect to wall-clock time. As shown in Figure 5(b), Or Mo can be 8 times faster than SSGDm under Setting II since the training speed of SSGDm is hindered by slow workers. In contrast, slow workers have a limited impact on Or Mo s training speed. Even under Setting I where each worker possesses similar computing

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(a) K = 16 (hom.)

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(b) K = 64 (hom.)

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(c) K = 16 (het.)

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(d) K = 64 (het.)

Figure 3: Test accuracy curves on CIFAR10 with different numbers of workers.

0 25 50 75 100 125 150 175 200 Epochs

Test Accuracy

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(a) K = 16 (hom.)

0 25 50 75 100 125 150 175 200 Epochs

Test Accuracy

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(b) K = 64 (hom.)

0 25 50 75 100 125 150 175 200 Epochs

Test Accuracy

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(c) K = 16 (het.)

0 25 50 75 100 125 150 175 200 Epochs

Test Accuracy

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(d) K = 64 (het.)

Figure 4: Test accuracy curves on CIFAR100 with different numbers of workers.

0 100 200 300 400 500 600 Wall-Clock Time (s)

Training Loss

SSGDm Or Mo

0 100 200 300 400 500 600 Wall-Clock Time (s)

Test Accuracy

SSGDm Or Mo

(a) homogeneous

0 250 500 750 1000 1250 1500 1750 2000

Wall-Clock Time (s)

Training Loss

SSGDm Or Mo

0 250 500 750 1000 1250 1500 1750 2000

Wall-Clock Time (s)

Test Accuracy

SSGDm Or Mo

(b) heterogeneous

Figure 5: Training curves with respect to wall-clock time on CIFAR10 when K = 16.

capability, Or Mo can still be more than twice as fast as SSGDm, as shown in Figure 5(a). This advantage arises because the computation time of each worker varies within a certain range even under the homogeneous setting and some workers must wait for others to finish gradient computations in SSGDm.

5 Conclusion

In this paper, we propose a novel method named ordered momentum (Or Mo) for asynchronous SGD. We theoretically prove the convergence of Or Mo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis of ASGD with momentum without dependence on the maximum delay. Empirical results demonstrate that Or Mo can achieve state-of-the-art performance.

Acknowledgment

This work is supported by National Key R&D Program of China (No. 2020YFA0713900), NSFC Project (No. 12326615), Major Key Project of Pengcheng Laboratory (No. PCL2024A06) and Key R&D Project of Jiangsu Province (No. BE2023652).

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, pages 265 283, 2016.

[2] Alekh Agarwal and John C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873 881, 2011.

[3] Yossi Arjevani, Ohad Shamir, and Nathan Srebro. A tight convergence analysis for stochastic gradient descent with delayed updates. In Proceedings of the International Conference on Algorithmic Learning Theory, pages 111 132, 2020.

[4] Refael Cohen, Ido Hakimi, and Assaf Schuster. SMEGA2: distributed asynchronous deep neural network training with a single momentum buffer. In Proceedings of the International Conference on Parallel Processing, pages 1 10, 2022.

[5] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1232 1240, 2012.

[6] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 2014.

[7] Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya Nagpurkar. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed SGD. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 803 812, 2018.

[8] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. An asynchronous minibatch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61(12):3740 3754, 2016.

[9] Niv Giladi, Mor Shpigel Nacson, Elad Hoffer, and Daniel Soudry. At stability s edge: How to adjust hyperparameters to preserve minima selection in asynchronous training of neural networks? In Proceedings of the International Conference on Learning Representations, 2020.

[10] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

[12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013.

[13] Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Sharper convergence guarantees for asynchronous SGD for distributed and federated learning. In Advances in Neural Information Processing Systems, pages 17202 17215, 2022.

[14] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009.

[15] Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19 27, 2014.

[16] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737 2745, 2015.

[17] Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the International Conference on Machine Learning, pages 3043 3052, 2018.

[18] Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, and Martin Jaggi. Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data. In Proceedings of the International Conference on Machine Learning, pages 6654 6665, 2021.

[19] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the International Conference on Learning Representations, 2018.

[20] Ji Liu and Ce Zhang. Distributed learning systems with first-order methods. Foundations and Trends in Databases, 9(1):1 100, 2020.

[21] Wei Liu, Li Chen, Yunfei Chen, and Wenyi Zhang. Accelerating federated learning via momentum gradient descent. IEEE Transactions on Parallel and Distributed Systems, 31(8):1754 1766, 2020.

[22] Konstantin Mishchenko, Francis R. Bach, Mathieu Even, and Blake E. Woodworth. Asynchronous SGD beats minibatch SGD under arbitrary delays. In Advances in Neural Information Processing Systems, pages 420 433, 2022.

[23] Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, and Christopher Ré. Asynchrony begets momentum, with an application to deep learning. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing, pages 997 1004, 2016.

[24] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2013.

[25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024 8035, 2019.

[26] Boris Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1 17, 12 1964.

[27] Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693 701, 2011.

[28] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951.

[29] Chang-Wei Shi, Shen-Yi Zhao, Yin-Peng Xie, Hao Gao, and Wu-Jun Li. Global momentum compression for sparse communication in distributed learning. ar Xiv preprint ar Xiv:1905.12948, 2019.

[30] Suvrit Sra, Adams Wei Yu, Mu Li, and Alex Smola. Adadelay: Delay adaptive distributed stochastic optimization. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 957 965, 2016.

[31] Sebastian U. Stich, Amirkeivan Mohtashami, and Martin Jaggi. Critical parameters for scalable distributed learning with large batches and asynchronous updates. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 4042 4050, 2021.

[32] Jianhui Sun, Xidong Wu, Heng Huang, and Aidong Zhang. On the role of server momentum in federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 15164 15172, 2024.

[33] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, pages 1139 1147, 2013.

[34] Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 1-bit adam: Communication efficient large-scale training with adam s convergence speed. In Proceedings of the International Conference on Machine Learning, pages 10118 10129, 2021.

[35] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In Advances in Neural Information Processing Systems, pages 7663 7673, 2018.

[36] Paul Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506 531, 1998.

[37] Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer. A survey on distributed machine learning. ACM Computing Surveys, 53(2):1 33, 2020.

[38] Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems, pages 14236 14245, 2019.

[39] Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael G. Rabbat. Slowmo: Improving communication-efficient distributed SGD with slow momentum. In Proceedings of the International Conference on Learning Representations, 2020.

[40] Cong Xie, Shuai Zheng, Oluwasanmi Koyejo, Indranil Gupta, Mu Li, and Haibin Lin. CSER: Communication-efficient SGD with error reset. In Advances in Neural Information Processing Systems, pages 12593 12603, 2020.

[41] An Xu and Heng Huang. Detached error feedback for distributed SGD with random sparsification. In Proceedings of the International Conference on Machine Learning, pages 24550 24575, 2022.

[42] Tao Yang, Xinlei Yi, Junfeng Wu, Ye Yuan, Di Wu, Ziyang Meng, Yiguang Hong, Hong Wang, Zongli Lin, and Karl H Johansson. A survey of distributed optimization. Annual Reviews in Control, 47:278 305, 2019.

[43] Yi-Rui Yang and Wu-Jun Li. BASGD: buffered asynchronous SGD for byzantine learning. In Proceedings of the International Conference on Machine Learning, pages 11751 11761, 2021.

[44] Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proceedings of the International Conference on Machine Learning, pages 7184 7193, 2019.

[45] Kun Yuan, Yiming Chen, Xinmeng Huang, Yingya Zhang, Pan Pan, Yinghui Xu, and Wotao Yin. Decentlam: Decentralized momentum sgd for large-batch deep training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3029 3039, 2021.

[46] Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-aware async-sgd for distributed deep learning. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 2350 2356, 2016.

[47] Shen-Yi Zhao and Wu-Jun Li. Fast asynchronous parallel stochastic gradient descent: A lockfree approach with convergence guarantee. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2379 2385, 2016.

[48] Shen-Yi Zhao, Gong-Duo Zhang, and Wu-Jun Li. Lock-free optimization for non-convex problems. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2935 2941, 2017.

[49] Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhiming Ma, and Tie-Yan Liu. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the International Conference on Machine Learning, pages 4120 4129, 2017.

0 20 40 60 80 100 120 140 160 Epochs

Training Loss

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(a) K = 16 (hom.)

0 20 40 60 80 100 120 140 160 Epochs

Training Loss

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(b) K = 64 (hom.)

0 20 40 60 80 100 120 140 160 Epochs

Training Loss

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(c) K = 16 (het.)

0 20 40 60 80 100 120 140 160 Epochs

Training Loss

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(d) K = 64 (het.)

Figure 6: Training loss curves of different methods on CIFAR10 dataset with different numbers of workers.

0 25 50 75 100 125 150 175 200 Epochs

Training Loss

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(a) K = 16 (hom.)

0 25 50 75 100 125 150 175 200 Epochs

Training Loss

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(b) K = 64 (hom.)

0 25 50 75 100 125 150 175 200 Epochs

Training Loss

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(c) K = 16 (het.)

0 25 50 75 100 125 150 175 200 Epochs

Training Loss

SSGDm ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(d) K = 64 (het.)

Figure 7: Training loss curves of different methods on CIFAR100 dataset with different numbers of workers.

Table 3: Empirical results of naive ASGDm with different β when training Res Net20 on CIFAR10 dataset.

Number of workers 16 (hom.) 64 (hom.) 16 (het.) 64 (het.) Algorithm (β) Training Loss Test Accuracy Training Loss Test Accuracy Training Loss Test Accuracy Training Loss Test Accuracy

naive ASGDm (0.1) 0.06 0.01 89.85 0.24 0.38 0.01 83.93 0.25 0.06 0.00 89.95 0.19 0.38 0.01 83.76 0.34 naive ASGDm (0.3) 0.06 0.00 89.91 0.20 0.36 0.02 84.23 0.49 0.05 0.01 90.26 0.05 0.35 0.02 84.43 0.22 naive ASGDm (0.6) 0.07 0.00 90.39 0.24 0.38 0.02 83.87 0.28 0.06 0.01 90.56 0.13 0.37 0.02 84.07 0.38 naive ASGDm (0.9) 0.20 0.07 88.15 1.70 0.44 0.06 82.39 1.79 0.58 0.86 73.23 31.61 0.78 0.77 68.75 29.51 Or Mo (0.9) 0.04 0.01 90.95 0.27 0.15 0.02 88.03 0.28 0.04 0.00 91.01 0.10 0.16 0.03 87.76 0.57

A More Experimental Results

A.1 Loss Curves

Figure 6 and Figure 7 show the training loss curves of Res Net20 model.

A.2 Tuning β for Naive ASGDm

Following the suggestion in [23, 9], we conduct experiments to tune the momentum coefficient β for naive ASGDm and present the results when training Res Net20 on CIFAR10 in Table 3. While tuning the momentum coefficient can enhance the performance of naive ASGDm, hyperparameter tuning is quite time-consuming and costly. In contrast, Or Mo achieves better performance using the commonly used momentum value of 0.9, without requiring extensive tuning.

A.3 Ablation Study

An ablation study is also conducted to justify the parameter update rule in line 13 of Algorithm 2. We replace the update rule in line 13 of Algorithm 2 with a vanilla SGD step, wt+1 = wt+ 1

2 ηgkt ite(kt,t), and name it Or Mo (vanilla SGD step). The comparison between the experimental results of Or Mo and Or Mo (vanilla SGD step) are presented in Table 4 and Figure 8.

Table 4: Empirical results of Or Mo and Or Mo (vanilla SGD step) when training Res Net20 on CIFAR10 dataset.

Number of Workers 16 (hom.) 64 (hom.) 16 (het.) 64 (het.) Methods Training Loss Test Accuracy Training Loss Test Accuracy Training Loss Test Accuracy Training Loss Test Accuracy

Or Mo (vanilla SGD step) 0.07 0.02 90.32 0.45 0.27 0.07 86.08 1.33 0.07 0.01 90.23 0.32 0.27 0.07 86.10 1.71 Or Mo 0.04 0.01 90.95 0.27 0.15 0.02 88.03 0.28 0.04 0.00 91.01 0.10 0.16 0.03 87.76 0.57

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

Or Mo (vanilla SGD step) Or Mo

(a) K = 16 (hom.)

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

Or Mo (vanilla SGD step) Or Mo

(b) K = 64 (hom.)

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

Or Mo (vanilla SGD step) Or Mo

(c) K = 16 (het.)

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

Or Mo (vanilla SGD step) Or Mo

(d) K = 64 (het.)

Figure 8: Test accuracy curves when training Res Net20 model on CIFAR10 dataset with different numbers of worker number.

Table 5: Test accuracy of different methods when training Res Net18 on CIFAR10 dataset.

homogeneous heterogeneous ASGD 91.45 91.52 naive ASGDm 93.74 93.10 shifted momentum 94.02 94.20 SMEGA2 93.72 93.36 Or Mo 94.32 94.26 Or Mo-DA 94.50 94.03

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(a) homogeneous

0 20 40 60 80 100 120 140 160 Epochs

Test Accuracy

ASGD naive ASGDm shifted momentum SMEGA2

Or Mo Or Mo-DA

(b) heterogeneous

Figure 9: Test accuracy curves when training Res Net18 model on CIFAR10 dataset.

A.4 Experimental Results on Res Net18 Model

Table 5 and Figure 9 show the performance of different methods when training the Res Net18 model on the CIFAR10 dataset. The number of workers is set to 8. In the homogeneous setting, each worker has similar computing capabilities. In the heterogeneous setting, one worker is designated as the slow worker, whose average computation time is 10 times longer than that of the others.

B Algorithm Details

Algorithm 3 and Algorithm 4 show the details of SSGDm and naive ASGDm.

Algorithm 3 SSGDm

1: Server: 2: Input: number of workers K, number of iterations T, learning rate η, momentum coefficient β [0, 1); 3: Initialization: initial parameter w0, momentum u0 = 0, waiting set C = ; 4: Send the initial parameter w0 to all workers; 5: for t = 0 to T 1 do 6: if the waiting set C is empty then 7: wt+ 1

2 = wt βut, ut+ 1

2 = βut; 8: else 9: wt+ 1

2 = wt, ut+ 1

2 = ut; 10: end if 11: Receive a stochastic gradient gkt ite(kt,t) from some worker kt;

12: Update the parameter wt+1 = wt+ 1

2 ηgkt ite(kt,t);

13: Update the momentum ut+1 = ut+ 1

2 + ηgkt ite(kt,t);

14: Add the worker kt to the waiting set C = C {kt}; 15: Execute the synchronous communication scheduler: only when all the workers are in the waiting set, i.e., C = [K], send the parameter wt+1 to the workers in C and set C to ; 16: end for 17: Notify all workers to stop; 18: Worker k : (k [K]) 19: repeat 20: Wait until receiving the parameter w from the server; 21: Randomly sample ξk D and then compute the stochastic gradient gk = f(w; ξk); 22: Send the stochastic gradient gk to the server; 23: until receive server s notification to stop

Algorithm 4 naive ASGDm

1: Server: 2: Input: number of workers K, number of iterations T, learning rate η, momentum coefficient β [0, 1); 3: Initialization: initial parameter w0, momentum u0 = 0, waiting set C = ; 4: Send the initial parameter w0 to all workers; 5: for t = 0 to T 1 do 6: Receive a stochastic gradient gkt ite(kt,t) from some worker kt;

7: Update the momentum ut+1 = βut + ηgkt ite(kt,t) 8: Update the parameter wt+1 = wt ut+1 9: Add the worker kt to the waiting set C = C {kt}; 10: Execute the asynchronous communication scheduler: once the waiting set is not empty, i.e., C = , immediately send the parameter wt+1 to the worker in C and set C to ; 11: end for 12: Notify all workers to stop; 13: Worker k : (k [K]) 14: repeat 15: Wait until receiving the parameter w from the server; 16: Randomly sample ξk D and then compute the stochastic gradient gk = f(w; ξk); 17: Send the stochastic gradient gk to the server; 18: until receive server s notification to stop

C Proof Details

C.1 Reformulation of SSGDm

C.1.1 Proof of Proposition 1

Proof. It s easy to verify that {kt, kt+1, , kt+K 1} = [K] in (6), where K | t.

Base case: for s = 0, w0 = w0 and u0 = u0.

Inductive hypothesis: for some arbitrary integer s 0, assume that ws = ws K, us = us K.

Inductive step:

us +1 = β us + η

k [K] f( ws ; ξk) = βus K + η X

k [K] f(ws K; ξk) = u(s +1)K,

ws +1 = ws β us η

k [K] f( ws ; ξk) = ws K βus K η X

k [K] f(ws K; ξk)

= w(s +1)K.

We can conclude that ws = ws K and us = us K for any s [S].

C.2.1 Proof of Lemma 1

Lemma 1 can be viewed as a special case of Lemma 5, and its proof is completed by substituting each delay-adaptive learning rate ˆηk,t in the proof of Lemma 5 with η for all k [K] and t [T].

C.2.2 Proof of Lemma 2

Lemma 2 can be viewed as a special case of Lemma 6, and its proof is completed by substituting each delay-adaptive learning rate ˆηk,t in the proof of Lemma 6 with η for all k [K] and t [T].

C.2.3 Proof of Lemma 3

Lemma 3 can be viewed as a special case of Lemma 7, and its proof is completed by substituting each delay-adaptive learning rate ˆηk,t in the proof of Lemma 7 with η for all k [K] and t [T].

Lemma 4. With Assumption 2, the gap between ˆyt and ˆwt can be bounded:

E ˆyt ˆwt 2 β2η2K2G2

(1 β)4 , t 1. (12)

Proof. For any t 1, ˆut can be formulated as follows:

ˆut = β t+K 2

k [K] ηgk 0

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ηgkj 1 j

ˆyt ˆwt 2 = β2

(1 β)2 ˆut 2

k [K] ηgk 0

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ηgkj 1 j

k [K] β t+K 2

K + P t+K 2

K s=1 Pmin{s K,t 1} j=(s 1)K+1 β t+K 2

K s, then we have

ˆyt ˆwt 2 = β2

k [K] ηgk 0

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ηgkj 1 j

= β2q2 t (1 β)2

min{s K,t 1} X

qt ηgkj 1 j

β2qt (1 β)2

k [K] β t+K 2

K ηgk 0 2 +

min{s K,t 1} X

j=(s 1)K+1 β t+K 2

K s ηgkj 1 j 2

E ˆyt ˆwt 2 β2qt (1 β)2

k [K] β t+K 2

K E ηgk 0 2 +

min{s K,t 1} X

j=(s 1)K+1 β t+K 2

K s E ηgkj 1 j 2

k [K] β t+K 2

min{s K,t 1} X

j=(s 1)K+1 β t+K 2

= β2η2G2q2 t (1 β)2 β2η2K2G2

C.2.4 Proof of Theorem 1

EF (ˆyt+1) F (ˆyt) η 1 β F(ˆyt), Egkt 1 t + Lη2

2(1 β)2 E gkt 1 t 2

F(ˆyt) η 1 β F(ˆyt), F(wt) + Lη2

2(1 β)2 E gkt 1 t F(wt) 2

2(1 β)2 F(wt) 2

F(ˆyt) η 1 β F(ˆyt), F(wt) + Lη2σ2

2(1 β)2 + Lη2

2(1 β)2 F(wt) 2

F(ˆyt), F(wt) = F(ˆyt) F(wt) + F(wt), F(wt)

= F(ˆyt) F(wt), F(wt) F(wt) 2

2 F(ˆyt) F(wt) 2 + 1

2 F(wt) 2 F(wt) 2

2 ˆyt wt 2 1

E ˆyt wt 2 2E ˆyt ˆwt 2 + 2E ˆwt wt 2 2β2η2K2G2

(1 β)4 + 2η2K2G2

(1 β)2 2η2K2G2

EF(ˆyt+1) EF(ˆyt) η 1 β E F(ˆyt), F(wt) + Lη2σ2

2(1 β)2 + Lη2

2(1 β)2 E F(wt) 2

EF(ˆyt) + ηL2

2(1 β)E ˆyt wt 2 + Lη2

2(1 β)2 η 2(1 β)

E F(wt) 2 + Lη2σ2

2KL EF(ˆyt) + ηL2

2(1 β)E ˆyt wt 2 η 4(1 β)E F(wt) 2 + Lη2σ2

EF(ˆyt) η 4(1 β)E F(wt) 2 + Lη2σ2

2(1 β)2 + η3K2G2L2

Summing up the above equation from t = 1 to t = T, we can get that

t=1 E F(wt) 2 4(1 β) [EF(ˆy1) F ]

1 β + 4η2K2G2L2

Since ˆy1 w0 = 1 1 β ( ˆw1 w0) = η 1 β P k [K] gk 0, we have

EF(ˆy1) F(w0) η 1 β E F(w0), X

k [K] gk 0 + Lη2

1 β F(w0) 2 + Lη2

1 β F(w0) 2 + Lη2

k [K] gk 0 K F(w0) + K F(w0)

F(w0) + LK2η2

F(w0) 2 + KLσ2η2

F(w0) + KLσ2η2

The last inequality above holds because η 1 β

Combining the above equations, we can get that

t=1 E F(wt) 2 4(1 β) [F(w0) F ]

1 β + 2LKσ2η

(1 β)T + 4η2L2K2G2

T K 4(1 β) [F(w0) F ]

1 β + 4η2L2K2G2

Let η = min{ 1 β

2KL, (1 β)[F (w0) F ] 1 2

(LT ) 1 2 σ , (1 β) 5 3 [F (w0) F ] 1 3

(LKG) 2 3 T 1 3 }, then we can get that

t=1 E F(wt) 2 O

C.3 Or Mo with Delay-Adaptive Learning Rate

C.3.1 Algorithm

The details of Or Mo with delay-adaptive learning rate (Or Mo-DA) are presented in Algorithm 5.

C.3.2 Notation

For a positive integer n, [n] = {0, 1, 2, , n 1}. [0] is defined as .

The function ite(k, t) denotes the iteration index of the latest parameter sent to worker k before iteration t, which can be formulated as

ite(k, t) =

0 t = 0, k [K], t t > 0, k = kt 1, ite(k, t 1) t > 0, k = kt 1,

where k [K], t [T + 1].

Algorithm 5 Or Mo-DA

1: Server: 2: Input: number of workers K, number of iterations T, momentum coefficient β [0, 1); 3: Initialization: initial parameter w0, momentum u0 = 0, index of the latest gradient group I0 = 0, waiting set C = ; 4: Send the initial parameter w0 and its iteration index 0 to all workers; 5: for t = 0 to T 1 do 6: if the waiting set C is empty and t

K > It then 7: wt+ 1

2 = wt βut, ut+ 1

2 = βut, It+1 = It + 1; 8: else 9: wt+ 1

2 = wt, ut+ 1

2 = ut, It+1 = It; 10: end if 11: Receive a stochastic gradient gkt ite(kt,t) and its iteration index ite(kt, t) from some worker kt and then calculate ite(kt,t)

K (i.e., the index of the gradient group that gkt ite(kt,t) belongs to);

12: Update the momentum ut+1 = ut+ 1

2 + βIt+1 ite(kt,t)

K ηtgkt ite(kt,t) ;

13: Update the parameter wt+1 = wt+ 1

2 1 βIt+1 ite(kt,t)

1 β ηtgkt ite(kt,t) ;

14: Add the worker kt to the waiting set C = C {kt}; 15: Execute the asynchronous communication scheduler: once the waiting set is not empty, i.e., C = , immediately send the parameter wt+1 and its iteration index t + 1 to the worker in C and set C to ; 16: end for 17: Notify all workers to stop; 18: Worker k : (k [K]) 19: repeat 20: Wait until receiving the parameter wt and its iteration index t from the server; 21: Randomly sample ξk D and then compute the stochastic gradient gk t = f(wt ; ξk); 22: Send the stochastic gradient gk t and its iteration index t to the server; 23: until receive server s notification to stop

ηt is the learning rate at iteration t and satisfies that

min{η, 1 4Lτt } τt > 2K,

where t [T].

The function next(k, t) denotes the index of the next iteration that the gradient from worker k will participate in the parameter update after iteration t (including iteration t), which can be formulated as

next(k, t) = min{j t : kj = k} j [T] \ [t] , kj = k, T j [T] \ [t] , kj = k,

where k [K], t [T]. It s easy to verify that ite(k, next(k, t)) = ite(k, t), next(k, ite(k, t)) = next(k, t), where k [K], t [T].

We define ˆτk,t = t ite(k, t), where k [K] and t [T + 1]. ˆτk,t denotes the current delay of the gradient gk ite(k,t) at iteration t, which is the number of iterations that have happened since ite(k, t). It s easy to verify that ˆτkt,t = t ite(kt, t) = τt, where t [T].

We also define an auxiliary sequence ˆηk,t for the adaptive learning rates. ˆηk,t denotes the learning rate corresponding to the gradient gk ite(k,t).

η ˆτk,next(k,t) 2K,

min{η, 1 4Lˆτk,next(k,t) } ˆτk,next(k,t) > 2K,

where k [K] and t [T]. Since next(k, t) = next(k, ite(k, t)), we can have that ˆηk,t = ˆηk,ite(k,t), where k [K], t [T].

If next(k, t) [T], ˆτk,next(k,t) = ˆτknext(k,t),next(k,t) = τnext(k,t) and then we have that

ˆηk,t = ηnext(k,t) =

η τnext(k,t) 2K,

min{η, 1 4Lτnext(k,t) } τnext(k,t) > 2K.

It s easy to verify that ˆηkt,t = ηnext(kt,t) = ηt, where t [T].

If next(k, t) = T,

η ˆτk,T 2K,

min{η, 1 4Lˆτk,T } ˆτk,T > 2K.

C.3.3 Convergence Analysis for Or Mo-DA

Firstly, we define one auxiliary sequence {ˆut}t 1 for the momentum: ˆu1 = P

k [K] ˆηk,0gk 0, and

( βˆut + ˆηkt 1,tgkt 1 t K | (t 1) ,

ˆut + ˆηkt 1,tgkt 1 t K (t 1) , (13)

Lemma 5. For any t 0, the gap between ut+1 and ˆut+1 can be formulated as follows:

ˆut+1 ut+1 = X

k [K],k =kt β t

K ˆηk,ite(k,t)gk ite(k,t). (14)

Proof. Base case: for t = 0, ˆu1 = P

k [K] ˆηk,0gk 0, u1 = ˆηk0,0gk0 0 , then we have

k [K],k =k0 β 0

K ˆηk,0gk 0

k [K],k =k0 β 0

K ˆηk,ite(k,0)gk ite(k,0).

Inductive hypothesis: for some arbitrary integer t 1 0, assume that (14) is true for t = t 1.

Inductive step: We will prove that (14) is true for t = t . Firstly, we divide our discussion into two cases based on whether t 1 is divisible by K and prove that

ˆut +1 ut +1 = X

k [K],k =kt 1 β t

K ite(k,t )

K ˆηk,ite(k,t )gk ite(k,t ) + ˆηkt 1,t g kt 1 t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t ).

Case 1: K | (t 1)

ˆut +1 = βˆut + ˆηkt 1,t g kt 1 t ,

ut +1 = βut + β t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t ),

ˆut +1 ut +1 = β(ˆut ut ) + ˆηkt 1,t g kt 1 t β t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1 β t 1

K +1 ite(k,t 1)

K ˆηk,ite(k,t 1)gk ite(k,t 1)

+ ˆηkt 1,t g kt 1 t β t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1 β t

K ite(k,t 1)

K ˆηk,ite(k,t 1)gk ite(k,t 1)

+ ˆηkt 1,t g kt 1 t β t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t ).

The second equation above holds because t

K = It when K | (t 1), It +1 = t

K and ˆηkt ,ite(kt ,t ) = ηnext(kt ,ite(kt ,t )) = ηt . The last equation above holds because t

Case 2: K (t 1)

ˆut +1 = ˆut + ˆηkt 1,t g kt 1 t ,

ut +1 = ut + β t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t ),

ˆut +1 ut +1 = (ˆut ut ) + ˆηkt 1,t g kt 1 t β t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1 β t 1

K ite(k,t 1)

K ˆηk,ite(k,t 1)gk ite(k,t 1)

+ ˆηkt 1,t g kt 1 t β t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1 β t

K ite(k,t 1)

K ˆηk,ite(k,t 1)gk ite(k,t 1)

+ ˆηkt 1,t g kt 1 t β t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t ).

The last equation above holds because t

Since ite(k, t ) = ite(k, t 1), k = kt 1, we can get the following equation for both cases above:

ˆut +1 ut +1 = X

k [K],k =kt 1 β t

K ite(k,t )

K ˆηk,ite(k,t )gk ite(k,t ) + ˆηkt 1,t g kt 1 t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t ).

If kt = kt 1, then we have ite(kt , t ) = t and

ˆut +1 ut +1 = X

k [K],k =kt β t

K ite(k,t )

K ˆηk,ite(k,t )gk ite(k,t ).

If kt = kt 1, then we have

ˆut +1 ut +1 = X

k [K],k =kt 1 β t

K ite(k,t )

K ˆηk,ite(k,t )gk ite(k,t ) + ˆηkt 1,t g kt 1 t

K ite(kt ,t )

K ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1,k =kt β t

K ite(k,t )

K ˆηk,ite(k,t )gk ite(k,t ) + ˆηkt 1,t g kt 1 t

k [K],k =kt 1,k =kt β t

K ite(k,t )

K ˆηk,ite(k,t )gk ite(k,t )

K ite(kt 1,t )

K ˆηkt 1,ite(kt 1,t )g kt 1 ite(kt 1,t )

k [K],k =kt β t

K ite(k,t )

K ˆηk,ite(k,t )gk ite(k,t ).

We can conclude that ˆut+1 ut+1 = P

k [K],k =kt β t

K ˆηk,ite(k,t)gk ite(k,t) is true for any t 0.

Then, we define one auxiliary sequence { ˆwt}t 1 for the parameter: ˆw1 = w0 P

k [K] ˆηk,0gk 0, and

( ˆwt βˆut ˆηkt 1,tgkt 1 t K | (t 1),

ˆwt ˆηkt 1,tgkt 1 t K (t 1),

Lemma 6. For any t 0, the gap between wt+1 and ˆwt+1 can be formulated as follows:

ˆwt+1 wt+1 = X

k [K],k =kt

1 β ˆηk,ite(k,t)gk ite(k,t). (15)

Proof. Base case: for t = 0, ˆw1 = w0 P

k [K] ˆηk,0gk 0, w1 = w0 ˆηk0,0gk0 0 , then we have

k [K],k =k0

1 β ˆηk,ite(k,0)gk ite(k,0).

Inductive hypothesis: for some arbitrary integer t 1 0, assume that (15) is true for t = t 1.

Inductive step: We will prove that (15) is true for t = t . Firstly, we divide our discussion into two cases based on whether t 1 is divisible by K and prove that

ˆwt +1 wt +1 = X

k [K],k =kt 1

K ite(k,t )

1 β ˆηk,ite(k,t )gk ite(k,t ) ˆηkt 1,t g kt 1 t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t ).

Case 1: K | (t 1)

ˆwt +1 = ˆwt βˆut ˆηkt 1,t g kt 1 t ,

wt +1 = wt βut 1 β t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t ),

ˆwt +1 wt +1 = ˆwt wt β (ˆut ut ) ˆηkt 1,t g kt 1 t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1

K ite(k,t 1)

1 β ˆηk,ite(k,t 1)gk ite(k,t 1)

k [K],k =kt 1 β t 1

K ite(k,t 1)

K +1ˆηk,ite(k,t 1)gk ite(k,t 1)

ˆηkt 1,t g kt 1 t 1 β t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1

K ite(k,t 1)

1 β ˆηk,ite(k,t 1)gk ite(k,t 1)

ˆηkt 1,t g kt 1 t + 1 β t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1

K ite(k,t 1)

1 β ˆηk,ite(k,t 1)gk ite(k,t 1)

ˆηkt 1,t g kt 1 t + 1 β t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t ).

The last equation above holds because t

Case 2: K (t 1)

ˆwt +1 = ˆwt ˆηkt 1,t g kt 1 t ,

wt +1 = wt 1 β t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t ),

ˆwt +1 wt +1 = ˆwt wt

ˆηkt 1,t g kt 1 t 1 β t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1

K ite(k,t 1)

1 β ˆηk,ite(k,t 1)gk ite(k,t 1)

ˆηkt 1,t g kt 1 t + 1 β t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1

K ite(k,t 1)

1 β ˆηk,ite(k,t 1)gk ite(k,t 1)

ˆηkt 1,t g kt 1 t + 1 β t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t ).

The last equation above holds because t

Since ite(k, t ) = ite(k, t 1), k = kt 1, we can get the following equation for both cases above:

ˆwt +1 wt +1 = X

k [K],k =kt 1

K ite(k,t )

1 β ˆηk,ite(k,t )gk ite(k,t ) ˆηkt 1,t g kt 1 t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t ).

If kt = kt 1, then we have ite(kt , t ) = t and

ˆwt +1 wt +1 = X

k [K],k =kt

K ite(k,t )

1 β ˆηk,ite(k,t )gk ite(k,t ).

If kt = kt 1, then we have

ˆwt +1 wt +1 = X

k [K],k =kt 1

K ite(k,t )

1 β ˆηk,ite(k,t )gk ite(k,t ) ˆηkt 1,t g kt 1 t

K ite(kt ,t )

1 β ˆηkt ,ite(kt ,t )gkt ite(kt ,t )

k [K],k =kt 1,k =kt

K ite(k,t )

1 β ˆηk,ite(k,t )gk ite(k,t ) ˆηkt 1,t g kt 1 t

k [K],k =kt

K ite(k,t )

1 β ˆηk,ite(k,t )gk ite(k,t )

We can conclude that ˆwt+1 wt+1 = P

k [K],k =kt 1 β t

1 β ˆηk,ite(k,t)gk ite(k,t) is true for any t 0.

Then, we define another auxiliary sequence {ˆyt}t 1: ˆy1 = w0 1 1 β P

k [K] ˆηk,0gk 0, and

ˆyt+1 = ˆyt 1 1 β ˆηkt 1,tgkt 1 t ,

for t 1. Lemma 7. For any t 1, the gap between ˆyt and ˆwt can be formulated as follows:

ˆyt ˆwt = β 1 β ˆut. (16)

Proof. Base case: For t = 1, we have that

k [K] ˆηk,0gk 0

k [K] ˆηk,0gk 0

k [K] ˆηk,0gk 0 = β 1 β ˆu1.

Inductive hypothesis: for some arbitrary integer t 1, assume that ˆyt ˆwt = β 1 β ˆut is true for t = t .

Inductive step: We will prove that ˆyt ˆwt = β 1 β ˆut is true for t = t +1. We divide our discussion into two cases based on whether t 1 is divisible by K.

Case 1: K | (t 1)

ˆyt +1 = ˆyt 1 1 β ˆηkt 1,t g kt 1 t

ˆwt +1 = ˆwt βˆut ˆηkt 1,t g kt 1 t

ˆyt +1 ˆwt +1 = ˆyt 1 1 β ˆηkt 1,t g kt 1 t

ˆwt βˆut ˆηkt 1,t g kt 1 t

= ˆyt ˆwt + βˆut β 1 β ˆηkt 1,t g kt 1 t

1 β ˆut β 1 β ˆηkt 1,t g kt 1 t

= β 1 β ˆut +1

Case 2: K (t 1)

ˆyt +1 = ˆyt 1 1 β ˆηkt 1,t g kt 1 t

ˆwt +1 = ˆwt ˆηkt 1,t g kt 1 t

ˆyt +1 ˆwt +1 = ˆyt 1 1 β ˆηkt 1,t g kt 1 t

ˆwt ˆηkt 1,t g kt 1 t

= ˆyt ˆwt β 1 β ˆηkt 1,t g kt 1 t

= β 1 β ˆut β 1 β ˆηkt 1,t g kt 1 t

= β 1 β ˆut +1

We can conclude that ˆyt ˆwt = β 1 β ˆut is true for any t 1.

Lemma 8. For any t 1, ˆut can be formulated as follows:

ˆut = β t+K 2

k [K] ˆηk,0gk 0

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ˆηkj 1,jgkj 1 j

Proof. It s straightforward to get this conclusion from the definition of the sequence ˆut in (13).

Lemma 9. (descent lemma) With Assumptions 1 and 3, we have the following descent lemma for t 1,:

EF(ˆyt+1) F(ˆyt) +

L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

F(wt) 2 + (ˆηkt 1,t)2σ2L

+ L2ˆηkt 1,t

1 β ˆyt ˆwt 2 + L2ˆηkt 1,t

1 β ˆwt wt 2 .

ˆyt+1 = ˆyt 1 1 β ˆηkt 1,tgkt 1 t

EF(ˆyt+1) F(ˆyt) + E F(ˆyt), ˆyt+1 ˆyt + L

2 E ˆyt+1 ˆyt 2

= F(ˆyt) 1 1 β E F(ˆyt), ˆηkt 1,tgkt 1 t + L

2(1 β)2 E (ˆηkt 1,t)2 gkt 1 t 2

= F(ˆyt) 1 1 β F(ˆyt), ˆηkt 1,t F(wt) + L

2(1 β)2 E (ˆηkt 1,t)2 gkt 1 t 2 .

1 β F(ˆyt), F(wt) = ˆηkt 1,t

1 β F(ˆyt) F(wt) + F(wt), F(wt)

1 β F(ˆyt) F(wt), F(wt) ˆηkt 1,t

1 β F(wt) 2

ˆηkt 1,t 2(1 β) F(ˆyt) F(wt) 2 ˆηkt 1,t 2(1 β) F(wt) 2

ˆηkt 1,t L2

2(1 β) ˆyt wt 2 ˆηkt 1,t 2(1 β) F(wt) 2

2(1 β)2 E (ˆηkt 1,t)2 gkt 1 t 2 = L(ˆηkt 1,t)2

2(1 β)2 E gkt 1 t F(wt) + F(wt) 2

L(ˆηkt 1,t)2

E gkt 1 t F(wt) 2 + F(wt) 2

(ˆηkt 1,t)2L

2(1 β)2 σ2 + F(wt) 2

EF(ˆyt+1) F(ˆyt) +

L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

F(wt) 2 + (ˆηkt 1,t)2σ2L

+ L2ˆηkt 1,t

2(1 β) ˆyt wt 2

L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

F(wt) 2 + (ˆηkt 1,t)2σ2L

+ L2ˆηkt 1,t

1 β ˆyt ˆwt 2 + L2ˆηkt 1,t

1 β ˆwt wt 2 .

Lemma 10. With Assumption 1, the gap between ˆyt and ˆwt in Or Mo-DA can be bounded as follows:

t=1 E ˆηkt 1,t ˆyt ˆwt 2 2β2ηK2

k [K] (ˆηk,0)2 F(w0) 2 +

ˆηkt 1,t 2E F(wt) 2

k [K] (ˆηk,0)2 +

E ˆut 2 = E

k [K] ˆηk,0gk 0

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ˆηkj 1,jgkj 1 j

k [K] ˆηk,0 gk 0 F(w0)

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ˆηkj 1,j gkj 1 j F(wj)

k [K] ˆηk,0 F(w0)

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ˆηkj 1,j F(wj)

k [K] β2 t+K 2

K (ˆηk,0)2σ2 + 2

min{s K,t 1} X

j=(s 1)K+1 β2 t+K 2

K 2s ˆηkj 1,j 2σ2

k [K] ˆηk,0 F(w0)

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ˆηkj 1,j F(wj)

Let qt = P k [K] β t+K 2

K + P t+K 2

K s=1 Pmin{s K,t 1} j=(s 1)K+1 β t+K 2

K s, then we have

k [K] ˆηk,0 F(w0)

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ˆηkj 1,j F(wj)

qt ˆηk,0 F(w0)

min{s K,t 1} X

qt ˆηkj 1,j F(wj)

k [K] β t+K 2

K (ˆηk,0)2 F(w0) 2

min{s K,t 1} X

j=(s 1)K+1 β t+K 2

K s ˆηkj 1,j 2E F(wj) 2

k [K] β t+K 2

K (ˆηk,0)2 F(w0) 2

min{s K,t 1} X

j=(s 1)K+1 β t+K 2

K s ˆηkj 1,j 2E F(wj) 2

Thus, we get that

k [K] ˆηk,0 F(w0)

s=1 β t+K 2

min{s K,t 1} X

j=(s 1)K+1 ˆηkj 1,j F(wj)

k [K] β2 t+K 2

K (ˆηk,0)2σ2 + 2

min{s K,t 1} X

j=(s 1)K+1 β2 t+K 2

K 2s ˆηkj 1,j 2σ2

k [K] β t+K 2

K (ˆηk,0)2 F(w0) 2 +

min{s K,t 1} X

j=(s 1)K+1 β t+K 2

K s ˆηkj 1,j 2E F(wj) 2

k [K] β2 t+K 2

K (ˆηk,0)2σ2 + 2

min{s K,t 1} X

j=(s 1)K+1 β2 t+K 2

K 2s ˆηkj 1,j 2σ2.

t=1 E ˆηkt 1,t ˆyt ˆwt 2 β2η

t=1 E ˆut 2

k [K] β t+K 2

K (ˆηk,0)2 F(w0) 2

min{s K,t 1} X

j=(s 1)K+1 β t+K 2

K s ˆηkj 1,j 2E F(wj) 2

k [K] β2 t+K 2

K (ˆηk,0)2 +

min{s K,t 1} X

j=(s 1)K+1 β2 t+K 2

K 2s ˆηkj 1,j 2

k [K] (ˆηk,0)2 F(w0) 2 +

ˆηkt 1,t 2E F(wt) 2

k [K] (ˆηk,0)2 +

Lemma 11. With Assumption 1, letting η 1 8KL, the gap between ˆwt and wt in Or Mo-DA can be bounded as follows:

t=1 E ˆηkt 1,t ˆwt wt 2 ηK

t=1 ˆηkt 1,t E F(wt) 2 +

k [K] ˆηk,0

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

t=1 E ˆηkt 1,t ˆwt wt 2 η

t=1 E ˆwt wt 2

k [K],k =kt 1

K ite(k,t 1)

1 β ˆηk,ite(k,t 1)gk ite(k,t 1)

k [K],k =kt 1

K ite(k,t 1)

1 β ˆηk,ite(k,t 1) gk ite(k,t 1) F(wite(k,t 1))

k [K],k =kt 1

K ite(k,t 1)

1 β ˆηk,ite(k,t 1) F(wite(k,t 1))

k [K],k =kt

ˆηk,ite(k,t) 2 σ2 + KE F(wite(k,t)) 2

ˆηk,ite(k,t) 2 σ2 + KE F(wite(k,t)) 2 1 (k = kt) 1 (j = ite(k, t))

ˆηk,ite(k,t) 2 σ2 + KE F(wite(k,t)) 2 1 (k = kt) 1 (j = ite(k, t))

ˆηk,ite(k,t) 2 σ2 + KE F(wite(k,t)) 2 1 (k = kt) 1 (0 = ite(k, t))

ˆηkj 1,j 2 σ2 + KE F(wj) 2 (next(kj 1, j) j)

k [K] (ˆηk,0)2 σ2 + K F(w0) 2 (next(k, 0) 0)

t=1 E ˆηkt 1,t ˆwt wt 2 2η

ˆηkt 1,t 2 σ2 + KE F(wt) 2 ˆτkt 1,next(kt 1,t)

k [K] (ˆηk,0)2 σ2 + K F(w0) 2 ˆτk,next(k,0)

t=1 E ˆηkt 1,t ˆwt wt 2

t=1 ˆηkt 1,t h σ2 + KE F(wt) 2i + X

k [K] ˆηk,0 h σ2 + K F(w0) 2i

t=1 ˆηkt 1,t E F(wt) 2 +

k [K] ˆηk,0

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

Theorem 3. With Assumptions 1, 3 and 4, letting

min{η, 1 4Lτt } τt > 2K,

and η = min{ (1 β)2

T Lσ2 }, Algorithm 5 has the following convergence rate:

E F( w T ) 2 O(

where = F(w0) F and w T is randomly chosen from {w0, w1, , w T 1} according to a probability distribution which is related to the delay-adaptive learning rates as shown in (17).

Proof. According to Lemma 9, we can get that

EF(ˆy T ) EF(ˆy1)

" L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

E F(wt) 2 #

t=1 (ˆηkt 1,t)2

ˆηkt 1,t ˆwt wt 2 + ˆηkt 1,t ˆyt ˆwt 2 #

" L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

E F(wt) 2 #

t=1 (ˆηkt 1,t)2

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

t=1 ˆηkt 1,t E F(wt) 2 +

k [K] ˆηk,0

k [K] (ˆηk,0)2 F(w0) 2 +

ˆηkt 1,t 2E F(wt) 2

+ 2β2ηKL2σ2

k [K] (ˆηk,0)2 +

" L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

E F(wt) 2 #

t=1 (ˆηkt 1,t)2

2(1 β)3 + ηβ2Lσ2

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

t=1 ˆηkt 1,t E F(wt) 2 + X

k [K] ˆηk,0 F(w0) 2

k [K] ˆηk,0 F(w0) 2 +

t=1 ˆηkt 1,t E F(wt) 2

" L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

E F(wt) 2 #

t=1 (ˆηkt 1,t)2

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

t=1 ˆηkt 1,t E F(wt) 2 + X

k [K] ˆηk,0 F(w0) 2

" L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

E F(wt) 2 #

t=1 (ˆηkt 1,t)2

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

t=1 ˆηkt 1,t E F(wt) 2 + X

k [K] ˆηk,0 F(w0) 2

EF(ˆy1) F(w0) 1 1 β E F(w0),

k [K] ˆηk,0gk 0

k [K] ˆηk,0gk 0

k [K] ˆηk,0

1 β F(w0) 2

k [K] ˆηk,0 gk 0 F(w0) + F(w0)

k [K] ˆηk,0 2

k [K] ˆηk,0

F(w0) 2 + Lσ2

k [K] (ˆηk,0)2

EF(ˆy T ) F(w0)

" L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 2(1 β)

E F(wt) 2 #

t=1 (ˆηkt 1,t)2

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

t=1 ˆηkt 1,t E F(wt) 2 + X

k [K] ˆηk,0 F(w0) 2

k [K] ˆηk,0 2

P k [K] ˆηk,0

F(w0) 2 + Lσ2

k [K] (ˆηk,0)2

" L(ˆηkt 1,t)2

2(1 β)2 ˆηkt 1,t 4(1 β)

E F(wt) 2 #

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

k [K] ˆηk,0 2

k [K] ˆηk,0 4 (1 β)

EF(ˆy T ) F(w0) 1 8 (1 β)

t=1 ˆηkt 1,t E F(wt) 2 + X

k [K] ˆηk,0 F(w0) 2

t=1 ˆηkt 1,t + X

k [K] ˆηk,0

Thus, we can get that

1 PT 1 t=0 ηt E

t=0 ηt F(wt) 2 !

8 (1 β) (F(w0) F ) PT 1 t=0 ηt + 16ηLσ2

where η0 = P

k [K] ˆηk,0 and ηt = ˆηkt 1,t(t 1).

Then, we analyse the lower bound of PT 1 t=0 ηt.

It s easy to verify that

ˆτk,t+1 = t + 1 ite(k, t + 1) = t ite(k, t) + 1 = ˆτk,t + 1 k = kt, 0 = ˆτk,t τt k = kt.

k [K] ˆτk,0 = 0, we have P

k [K] ˆτk,T + PT 1 t=0 τt = (K 1)T. Moreover, ˆτk T 1,T =

T ite(k T 1, T) = 0. We get that P

k [K],k =k T 1 ˆτk,T + PT 1 t=0 τt = (K 1)T. Thus, at least T

2 delays are smaller than 2K. PT t=0 ηt = PT 1 t=1 ˆηkt 1,t + P k [K] ˆηk,0 T η

1 PT 1 t=0 ηt

t=0 ηt E F(wt) 2 16(1 β)

Tη (F(w0) F ) + 16ηLσ2

If we choose an output w T from {w0, w1, , w T 1} according to

P( w T = wt) ηt, (17)

where t [T] and let η = min{ (1 β)2

(1 β)3(F (w0) F )

T Lσ2 }, we have that

E F( w T ) 2 O

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The main claims made in the abstract and introduction accurately reflect the paper s contributions and scope. We propose Or Mo for ASGD in Subsection 3.2. We prove the convergence of Or Mo theoretically in Subsection 3.3 and verify its superiority empirically in Section 4. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations of the assumptions of Or Mo in Subsection 3.3. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: The full set of assumptions and the proof sketch can be found in Section 3.3. Please refer to Appendix C for the complete proof details. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: This paper fully discloses all the information needed to reproduce the main experimental results. The implementation details can be found in Section 4. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code. The dataset is public. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The implementation details can be found in Section 4. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Each experiment is repeated 5 times as reported in Section 4. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The type of compute workers and the time of execution are presented in Section 4. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our research conforms with the Neur IPS Code of Ethics in every respect. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: There is no societal impact of the work performed. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We have cited the original papers.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.