# accelerating_gossip_sgd_with_periodic_global_averaging__ae111379.pdf Accelerating Gossip SGD with Periodic Global Averaging Yiming Chen * 1 Kun Yuan * 1 Yingya Zhang 1 Pan Pan 1 Yinghui Xu 1 Wotao Yin 1 Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity 1 β which measures the network connectivity. On large and sparse networks where 1 β 0, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging into Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from Ω(β4n3/(1 β)4) to Ω(β4n3H4) for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period H. Its transient-stage complexity is also superior to Local SGD which has order Ω(n3H4). Empirical results of large-scale training on image classification (Res Net50) and language modeling (BERT) validate our theoretical findings. 1. Introduction The scale of deep learning nowadays calls for efficient largescale distributed training across multiple computing nodes in the data-center clusters. In distributed optimization, a network of n nodes cooperate to solve the problem min x Rd 1 n i=1 [fi(x) := Eξi Di Fi(x; ξi)] (1) where each component fi is local and private to node i and the random variable ξi denotes the local data that follows distribution Di. We assume each node i can locally evaluate stochastic gradients Fi(x; ξi) where ξi Di, but must communicate to access information from other nodes. *Equal contribution 1Alibaba Group, Hangzhou, China. Correspondence to: Kun Yuan . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). METHOD EPOCH ACC.% TIME(HRS.) PARALLEL SGD 120 76.26 2.22 GOSSIP SGD (RING) 120 74.86 1.56 GOSSIP SGD (EXPO) 120 75.34 1.55 GOSSIP SGD (RING) 240 75.62 3.02 GOSSIP SGD (EXPO) 240 76.18 3.03 Table 1. Top-1 validation accuracy for Image Net with 256 GPUs connected with the ring or one-peer exponential network. Gossip SGD takes more time to reach the same accuracy as Parallel SGD. Parallel SGD methods are leading algorithms to solve (1), in which every node processes local training samples independently, and synchronize gradients every iteration either using a central Parameter Server (PS) (Li et al., 2014) or the All-Reduce communication primitive (Patarasuk & Yuan, 2009). The global synchronization in Parallel SGD either incurs significant bandwidth cost or high latency, which hampers the training scalability. Many alternative methods have been proposed to reduce communication overhead in distributed training. Gossip SGD, also known as decentralized SGD (Nedic & Ozdaglar, 2009; Chen & Sayed, 2012; Lian et al., 2017; 2018; Assran et al., 2019), recently received lots of attention. This line of work lets each node communicate with (some of) their direct neighbors. In a sparse topology such as one-peer exponential graph (Assran et al., 2019), each node only communicates with one neighbor each time. This gossip-style communication is much faster than PS and All-Reduce but the computed average can be highly inaccurate. Local SGD (Stich, 2019; Yu et al., 2019; Lin et al., 2018) is another line of work that increases the computation-to-communication ratio. Local SGD lets each node to run local gradient descent for multiple rounds and only average their parameters globally once in a while. By communicating less frequently, Local SGD reduces the communication overhead. The reduced communication in Gossip and Local SGDs comes at a cost: slower convergence rate. While both algorithms are proved to have convergence linear speedup asymptotically, they are sensitive to network topology and synchronization period, respectively. For Gossip SGD, the convergence rate is inversely proportional to 1 β (β is defined in Remark 1). Since β 1 on the large and sparse network topology which is most valuable for deep training, Gossip SGD will converge very slow and require more iterations than Parallel SGD to achieve a desired solution. This Accelerating Gossip SGD with Periodic Global Averaging GOSSIP SGD GOSSIP-PGA IID NON-IID IID NON-IID (PROPOSED) SMALL OR DENSE NETWORK (WHEN 1 1 β < H) Ω( n3β4 (1 β)2 ) Ω( n3β4 (1 β)4 ) Ω(n3β4C2 β) Ω( n3β4C2 β (1 β)2 ) LARGE OR SPARSE NETWORK (WHEN 1 1 β H) Ω( n3β4 (1 β)2 ) Ω( n3β4 (1 β)4 ) Ω(n3β4C2 β) Ω(n3β4C2 βH2) Table 2. The lengths of the transient stages of Gossip SGD and Gossip-PGA. Since Cβ = PH 1 k=0 βk = (1 βH)/(1 β) < min{1/(1 β), H}, Gossip-PGA always has shorter transient stage, more evident on large and sparse networks where β 1. LOCAL SGD GOSSIP-PGA IID SCENARIO Ω(n3H2) Ω(n3β4C2 β) NON-IID SCENARIO Ω(n3H4) Ω(n3β4C2 βH2) Table 3. The lengths of the transient stages of Local SGD and Gossip-PGA. Gossip-PGA always has shorter transient stages than Local SGD since β < 1 and Cβ < H. Such superiority becomes more significant on well-connected networks where β 0. may nullify its communication efficiency and result in even more training time (see Table 1). Local SGD with a large averaging period meets the same issue. This paper proposes Gossip-PGA, which adds periodic All Reduce global averaging into Gossip to accelerate its convergence especially on large and sparse networks. Gossip-PGA also extends Local SGD with fast gossip-style communication after local updates. When the same averaging period H is used, the additional gossip communication in Gossip PGA endows it with faster convergence than Local SGD. Challenges. Gossip-PGA can be regarded as a special form of the topology-changing Gossip SGD (Koloskova et al., 2020) and Slow Mo (Wang et al., 2019) (in which the base optimizer is set as Gossip SGD, and the momentum coefficient β = 0). However, its theory and practical performance were not carefully investigated in literature. Unanswered important questions include how much acceleration can PGA bring to Gossip and Local SGDs, in what scenario can PGA benefits most, how to adjust the averaging period effectively, and how Gossip-PGA performs in large-scale deep learning systems. Providing quantitative answers to these questions requires new understanding on the interplay between gossip communication and global averaging period. Simply following existing analysis in (Koloskova et al., 2020) will result in incomplete conclusions, see Remark 5. Also, the analysis in Slow Mo (Wang et al., 2019) does not consider heterogeneous data distributions and cannot cover our results. 1.1. Main Results This paper proves that Gossip-PGA converges at n T | {z } SGD rate 1 3 β β 2 3 (σ 2 3 + D 1 3 β b 2 3 ) T 2 3 + βDβ | {z } Extra overhead for both smooth convex and non-convex functions fi (the metrics used for both scenarios can be referred to Theorems 1 and 2), where n is the network size, T is the total number of iterations, σ2 denotes gradient noise, b2 gauges data heterogeneity, β (0, 1) measures how well the network is connected, H is the global averaging period, and we define Cβ = PH 1 k=0 βk and Dβ = min{H, 1/(1 β)}. Linear speedup. When T is sufficiently large, the first term 1/ n T dominates (2). This also applies to Parallel, Local, and Gossip SGDs. Gossip-PGA and these algorithms all require T = Ω(1/(nϵ2)) iterations to reach a desired accuracy ϵ, which is inversely proportional to n. We say an algorithm is in its linear-speedup stage at Tth iteration if, for this T, the term involving n T is dominating the rate. Transient stage. Transient stage is referred to those iterations before an algorithm reaches its linear-speedup stage, that is iterations 1, . . . , T where T is relatively small so nonn T terms (i.e., the extra overhead terms in (2)) still dominate the rate. We take Gossip-PGA in the non-iid scenario (b2/3 σ) as example. To reach linear speedup, T has to satisfy T 2 3 /(C 1 3 β β 2 3 D1/3 β ) n 1 2 T 1 2 , i.e., T n3β4C2 βD2 β. So, the transient stage has Ω(n3β4C2 βD2 β) iterations. Transient stage is an important metric to measure the scalability of distributed algorithms. Shorter transient stage than Gossip SGD. The transient stage comparison between Gossip SGD and Gossip-PGA is shown in Table 2. Since Cβ = (1 βH)/(1 β) < min{H, 1/(1 β)}, we conclude Gossip-PGA always has a shorter transient stage than Gossip SGD for any β and H. Moreover, the superiority of Gossip-PGA becomes evident when the network is large and sparse, i.e., 1 β 0. In this case, the transient stage of Gossip SGD can grow dramatically (see the second line in Table 2) while Gossip PGA is controlled by the global period H because Cβ < H. As a result, Gossip-PGA improves the transient stage of Gossip-SGD from O(n3/(1 β)4) (or O(n3/(1 β)2 in the iid scenario) to O(n3) when β 1. Shorter transient stage than Local SGD. The transient stage comparison between Local SGD and Gossip-PGA is shown in Table 3. Using Cβ < H, we find Gossip-PGA is always endowed with a shorter transient stage than Local SGD. Moreover, when the network is well-connected such Accelerating Gossip SGD with Periodic Global Averaging that β 0, it holds that Cβ 1. Gossip-PGA will have a significantly shorter transient stage than Local SGD. 1.2. Contributions We establish the convergence rate of Gossip-PGA for both smooth convex and non-convex problems. Our results clarify how gossip communication and periodic global averaging collaborate to improve the transient stage of Gossip and Local SGDs. We also established shorter wall-clock training times of Gossip-PGA. We propose Gossip-AGA, which has adaptive global averaging periods. Gossip-AGA automatically adjusts H and has convergence guarantees. We conduct various experiments (convex logistic regression and large-scale deep learning tasks) to validate all established theoretical results. In particular, the proposed Gossip-PGA/AGA achieves a similar convergence speed to parallel SGD in iterations, but provides 1.3 1.9 runtime speed-up. The introduced global averaging steps in Gossip-PGA/AGA remedy the accuracy degradation in Gossip SGD and Local SGD. 2. Related Work Decentralized optimization algorithms can be tracked back to (Tsitsiklis et al., 1986). After that, decentralized optimization has been intensively studied in signal processing and control community. Decentralized gradient descent (DGD) (Nedic & Ozdaglar, 2009), diffusion (Chen & Sayed, 2012) and dual averaging (Duchi et al., 2011) are among the first decentralized algorithms that target on general optimization problems. However, these algorithms suffer from a bias caused by data heterogeneity (Yuan et al., 2016). Various primal-dual algorithms are proposed to overcome this issue, and they are based on alternating direction method of multipliers (ADMM) (Shi et al., 2014), explicit bias-correction (Shi et al., 2015; Yuan et al., 2019; Li et al., 2019c), gradient tracking (Xu et al., 2015; Di Lorenzo & Scutari, 2016; Nedic et al., 2017; Qu & Li, 2018), coordinate-descent methods (He et al., 2018), and dual acceleration (Scaman et al., 2017; 2018; Uribe et al., 2020). In the context of machine learning, decentralized SGD, also known as Gossip SGD, have gained a lot of attention recently. (Lian et al., 2017) first proves Gossip SGD can reach the same linear speedup as vanilla parallel SGD. After that, (Assran et al., 2019) comes out to extend Gossip SGD to directed topology. A recent work (Koloskova et al., 2020) proposes a unified framework to analyze algorithms with changing topology and local updates. While it covers Gossip-PGA as a special form, the theoretical and practical benefits of periodic global averaging were not studied therein. The data heterogeneity issue suffered in Gossip SGD is discussed and addressed in (Tang et al., 2018; Yuan et al., 2020; Lu et al., 2019; Xin et al., 2020). Gossip SGD is also extended to asynchronous scenarios in (Lian et al., 2018; Luo et al., 2020). Local SGD can be traced back to (Zinkevich et al., 2010) which proposed a one-shot averaging. More frequent averaging strategy is proposed in (Zhang et al., 2016), and the convergence property of Local SGD is established in (Yu et al., 2019; Stich, 2019; Bayoumi et al., 2020). Local SGD is also widely-used in federated learning (Mc Mahan et al., 2017; Li et al., 2019a). Another closely related work (Wang et al., 2019) proposes a slow momentum (Slow Mo) framework, where each node, similar to the Gossip-PGA algorithm proposed in this paper, periodically synchronizes across the network and performs a momentum update. The analysis in Slow Mo cannot cover the convergence results in this paper due to its datahomogeneous setting. In addition, we will clarify some new questions such as how much acceleration can PGA bring to Gossip and Local SGDs, and how to adjust the averaging period effectively. Various techniques can be integrated to Gossip SGD to improve its communication efficiency. This paper does not consider quantization (Alistarh et al., 2017; Bernstein et al., 2018), gradient compression (Tang et al., 2019; Koloskova et al., 2019b;a) and lazy communication (Chen et al., 2018; Liu et al., 2019), but these orthogonal techniques can be added to our methods. 3. Gossip SGD with Periodic Global Average Assume all computing nodes are connected over a graph G = {V, E} where V = {1, 2, , n} denote the node index and E denote the communication links between all nodes. Similar to existing decentralized algorithms (Nedic & Ozdaglar, 2009; Chen & Sayed, 2012; Lian et al., 2017; Assran et al., 2019), information exchange in the gossip step is only allowed to occur between connected neighbors. To characterize the decentralized communication, we let W Rn n be a doubly stochastic matrix, i.e., W 0, W1n = 1n and 1T n W = 1T n. The (i, j)-th element wij is the weight to scale information flowing from node j to node i. If nodes i and j are not neighbors then wij = 0, and if they are neighbors or identical then the weight wij > 0. Furthermore, we define Ni as the set of neighbors of node i which also includes node i itself. The Gossip-PGA algorithm is listed in Algorithm 1. In the gossip step, every node i collects information from all its connected neighbors. For global average step, nodes synchronize their model parameters using the efficient All Reduce primitives. When H , Gossip-PGA will reduce to standard Gossip SGD; when W = 1 n11n, Gossip- Accelerating Gossip SGD with Periodic Global Averaging Algorithm 1 Gossip-PGA Require: Initialize learning rate γ > 0, weight matrix W, global averaging period H, and let each x(0) i to be equivalent to each other. for k = 0, 1, 2, ..., T 1, every node i do Sample ξ(k+1) i , update g(k) i = Fi(x(k) i ; ξ(k+1) i ) 2 ) i = x(k) i γg(k) i Local SGD update if mod(k + 1, H) = 0 then x(k+1) i = 1 n Pn j=1 x (k+ 1 2 ) j global average x(k+1) i = P j Ni wijx (k+ 1 2 ) j one gossip step PGA will reduce to vanilla parallel SGD; when W = I, Gossip-PGA will reduce to Local SGD. All-Reduce v.s. multiple Gossips. In a computing cluster with n nodes, global averaging is typically conducted in an efficient Ring All-Reduce manner, rather than via multiple gossip steps as in (Berahas et al., 2018). The communication time comparison between a single gossip and Ring All-Reduce step is listed in Appendix H. In the one-peer exponential network, the exact global average can be achieved via ln(n) gossip communications, which generally takes more wall-clock time than a single Ring All-Reduce operation. Therefore, we recommend exploiting All-Reduce to conduct global averaging in Gossip-PGA. Data-center v.s. wireless network. This paper considers deep training within high-performance data-center clusters, in which all GPUs are connected with high-bandwidth channels and the network topology can be fully controlled. Under such setting, the periodic global averaging conducted with Ring All-Reduce has tolerable communication cost, see Appendix H. For scenarios where global averaging is extremely expensive to conduct such as in wireless sensor network, the global averaging can be approximated via multiple gossip steps, or may not be recommended. 3.1. Assumptions and analysis highlights We now establish convergence rates for Gossip-PGA on smooth convex and non-convex problems. For all our theoretical results we make the following standard assumptions. Assumption 1 (L-SMOOTHNESS). Each local cost function fi(x) is differentiable, and there exists a constant L such that for each x, y Rd: fi(x) fi(y) L x y . (3) Assumption 2 (GRADIENT NOISE). Recall g(k) i is the stochastic gradient noise defined in line 2 of Algorithm 1. It is assumed that for any k and i that E[g(k) i fi(x(k) i )|F(k 1)] = 0, (4) E[ g(k) i fi(x(k) i ) 2|F(k 1)] σ2 (5) for some constant σ2 > 0. Moreover, we assume ξ(k) i is independent of each other for any k and i. Filtration is defined as F(k)= {x(k) i }n i=1, {ξ(k) i }n i=1, , {x(0) i }n i=1, {ξ(0) i }n i=1 Assumption 3 (WEIGHTING MATRIX). The network is strongly connected and the weight matrix W satisfies W1n = 1n, 1T n W = 1T n, null(I W) = span(1n). We also assume W 1 n11T 2 β for some β (0, 1). Remark 1. Quantity β (0, 1) indicates how well the topology is connected. Smaller β indicates better-connected network while larger β implies worse-connected topology. Analysis highlights. To derive the influence of periodic global averaging, we have to exploit all useful algorithm structures to achieve its superiority. These structures are: x(k) i = x(k) when mod (k, H) = 0. This structure relieves the influence of network topology; Gossip communications within each period also contribute to consensus among nodes. This structure is crucial to establish superiority to Local SGD; When network is large and sparse, i.e., H < 1 1 β , the global averaging is more critical to drive consensus. This structure is crucial to establish superiority to Gossip SGD when H < 1 1 β . When network is small or dense, i.e., H > 1 1 β , gossip communication is more critical to drive consensus. This structure is crucial to establish superiority to Gossip SGD when H > 1 1 β . Ignoring any of the above structures in the analysis will result in incomplete conclusions on comparison among Gossip-PGA, Gossip SGD and Local SGD. 3.2. Convergence analysis: convex scenario Assumption 4 (CONVEXITY). Each fi(x) is convex. Definition 1 (DATA HETEROGENEITY). When each fi(x) is convex, we let b2 = 1 n Pn i=1 fi(x ) 2 denote the data heterogeneity. When each local data follows the same distribution, it holds that fi(x) = f(x) i and hence fi(x ) = f(x ) = 0 which also implies b2 = 0. With Assumption 4, we let x be one of the global solutions to problem (1). Accelerating Gossip SGD with Periodic Global Averaging GOSSIP SGD (KOLOSKOVA ET AL., 2020) GOSSIP-PGA RATES (GENERAL FORM) O σ n T + β 2 3 σ 2 3 T 2 3 (1 β) 1 3 + β 2 3 b 2 3 T 2 3 (1 β) 2 3 + β (1 β)T O σ n T + C 1 3 β β 2 3 σ 2 3 T 2 3 + C 1 3 β D 1 3 β β 2 3 b 2 3 T 2 3 + βDβ RATES (WHEN 1 1 β < H) O σ n T + β 2 3 σ 2 3 T 2 3 (1 β) 1 3 + β 2 3 b 2 3 T 2 3 (1 β) 2 3 + β (1 β)T O σ n T + C 1 3 β β 2 3 σ 2 3 T 2 3 + C 1 3 β β 2 3 b 2 3 (1 β) 1 3 T 2 3 + β (1 β)T RATES (WHEN 1 1 β H) O σ n T + β 2 3 σ 2 3 T 2 3 (1 β) 1 3 + β 2 3 b 2 3 T 2 3 (1 β) 2 3 + β (1 β)T O σ n T + C 1 3 β β 2 3 σ 2 3 T 2 3 + C 1 3 β H 1 3 β 2 3 b 2 3 Table 4. Convergence rate comparison between Gossip SGD and Gossip-PGA for smooth convex/non-convex problems. We use notation b2 to indicate the data heterogeneity for both convex and non-convex scenarios. Theorem 1. Under Assumptions 1 4, if γ is chosen as γ =min n 1 12βLDβ , r0 r1(T + 1) 2 , r0 r2(T + 1) with constants r0 = 2E x(0) x 2, r1 = 2σ2/n, and r2 = 6Lβ2Cβσ2 + 18Lβ2CβDβ, it holds for any T that Ef(ˆx(T )) f(x ) 1 3 β β 2 3 (σ 2 3 + D 1 3 β b 2 3 ) T 2 3 + βDβ where x(k) = 1 n Pn i=1 x(k) i , ˆx(T ) = 1 T +1 PT k=0 x(k), Cβ = PH 1 k=0 βk and Dβ = min{H, 1/(1 β)}. (Proof is in Appendix B.) Remark 2. When β 0, i.e., the network tends to be fully connected, Gossip-PGA will converge at rate O(σ/ n T), which recovers the rate of parallel SGD. Remark 3. When β 1, i.e., the information exchange via gossip communication is inefficient, it holds that Cβ H and Dβ = min{H, 1/(1 β)} = H. Substituting them to (7) will recover the rate of Local SGD, see Table 6. Remark 4. When H , i.e., the networked agents tend not to conduct global synchronization, it holds that Cβ 1/(1 β) and Dβ = 1 1 β . Substituting these values to (7) will recover the rate of Gossip SGD, see Table 4. 3.3. Convergence analysis: non-convex scenario We first introduce an assumption about data heterogeneity specifically for non-convex problems: Assumption 5 (DATA HETEROGENEITY). There exists constant ˆb > 0 such that 1 n Pn i=1 fi(x) f(x) 2 ˆb2 for any x Rd. If local data follows the same distribution, it holds that ˆb = 0. Theorem 2. Under Assumptions 1 3 and 5, if γ satisfies the condition (6) (replace b2 with ˆb2 and use r0 = 4Ef( x(0))), it holds for any T > 0 that k=0 E f( x(k)) 2 GOSSIP SGD GOSSIP-PGA TRANSIENT ITER. O(n7) O(n5) SINGLE COMM. O(θd + α) O(θd + nα) TRANSIENT TIME O(n7θd + n7α) O(n5θd + n5.5α) Table 5. Transient time comparison between non-iid Gossip SGD and Gossip-PGA over the specific grid (1 β = O(1/n)) topology. We choose H = n as the period in Gossip-PGA. 1 3 β β 2 3 (σ 2 3 + D 1 3 β b 2 3 ) T 2 3 + βDβ where x(k) = 1 n Pn i=1 x(k) i . (Proof is in Appendix C.) 3.4. Comparison with Gossip SGD To better illustrate how periodic global averaging helps relieve the affects of network topology in Gossip SGD, we list convergence rates of Gossip SGD and Gossip-PGA for smooth convex or non-convex problems in Table 4. The first line is the general rate expression for both algorithms. In the second line we let Dβ = min{H, 1/(1 β)} = 1/(1 β) for Gossip-PGA, and in the third line we let Dβ = H. According to this table, we derive the transient stages of Gossip SGD and Gossip-PGA for each scenarios (i.e., large/small network, iid/non-iid data distributions) in Table 2 (see the derivation detail in Appendix D). As we have explained in Main Results subsection in the introduction, it is observed from Tables 2 and 4 that: (i) Gossip-PGA always converges faster (or has shorter transient stages) than Gossip SGD for any β and H value. (ii) Such superiority gets evident for large and sparse networks where β 1. Remark 5. The convergence analysis in topology-changing Gossip SGD (Koloskova et al., 2020) covers Gossip-PGA. By letting p = 1 and τ = H in Theorem 2 of (Koloskova et al., 2020), it is derived that Gossip-PGA has a transient stage on the order of Ω(n3H4) for non-convex non-iid scenario. Such transient stage cannot quantify the superiority to Gossip and Local SGDs. In fact, it may even show PGA can do harm to Gossip SGD when H > 1 1 β , which is counter-intuitive. This is because (Koloskova et al., 2020) is for the general time-varying topology. It does not utilize Accelerating Gossip SGD with Periodic Global Averaging n T + H 1 3 σ 2 3 T 2 3 + H 2 3 b 2 3 n T + C 1 3 β β 2 3 σ 2 3 T 2 3 + C 1 3 β H 1 3 β 2 3 b 2 3 Table 6. Convergence rate comparison between Local SGD (LSGD) and Gossip-PGA (G-PGA) over smooth convex/non-convex problems. The rate for Local SGD is from (Koloskova et al., 2020; Yu et al., 2019; Li et al., 2019b). the structures listed in Sec. 3.1. Transient stage in runtime. Table 2 compares transient stages between Gossip-PGA and Gossip SGD in iterations. But what people really care about in practice is runtime. Since both Gossip SGD and Gossip-PGA have the same computational overhead per iteration, we will focus on communication time spent in the transient stage. Given the bandwidth in a computing cluster with size n, we let α denote the point-to-point latency in the network, and θ denote the communication time cost to transmit a scalar variable. Since variable x in problem (1) has dimension d, it will take θd time to transmit x between two nodes. Under this setting, the All-Reduce global averaging step will take 2θd + nα = O(θd + nα) time (see section 2.5 in (Ben-Nun & Hoefler, 2019)). The gossipstyle communication time varies with different network topologies. For the commonly-used ring or grid topology, it takes |Ni|θd + α = O(θd + α) for one gossip communication, where |Ni| is the neighborhood size of node i, and |Ni| = 3 for the ring and 5 for the grid. As to Gossip-PGA, if we amortize the periodic All-Reduce cost into each communication, it will have |Ni|θd + α + (2θd + nα)/H = O(θd + nα) when we set H = n. With the formula Total time = transient stage (in iteration) comm. per iter. We calculate and compare the transient time between noniid Gossip-PGA and Gossip-SGD (over the grid topology) in Table 5. Other comparisons for iid scenario or the ring topology can be found in Appendix D. It is observed in all tables that Gossip-PGA has shorter transient time. 3.5. Comparison with Local SGD The convergence rates of Gossip-PGA and Local SGD are listed in Table 6, from which we derive the transient stages of them in Table 3 (details are in Appendix D). As we have explained in the introduction, it is observed from Tables 3 and 6 that (i) Gossip-PGA always converges faster (or has shorter transient stages) than Local SGD for any β and H value, and (ii) Such superiority gets more evident for well-connected network where β 0. As to the wall-clock transient time of Local SGD, if we amortize the periodic All-Reduce cost into each local up- date, it will take (2θd+nα)/H = O(θd/H +nα/H) communication time per iteration. Using the transient iteration derived in Table 3, the total transient time for Local SGD (non-iid scenario) will be O(n3H3(θd + nα)). Comparing it with the total transient time O(n3HC2 ββ4(Hθd + nα)) for Gossip-PGA, we find Gossip-PGA always has shorter transient runtime for a large H > β4C2 β. Remark 6. While we discuss in detail that the transient time of Gossip-PGA is shorter than Gossip and Local SGDs, it is worth noting that the communication time during the linear speedup stage (i.e., after the transient stage) also contributes to the total training time. In this stage, Gossip PGA is less efficient due to its periodic global averaging. However, we illustrate that Gossip-PGA is always endowed with shorter total training time than Gossip and Local SGDs with extensive deep learning experiments in Sec. 5. 4. Gossip SGD with Adaptive Global Average Gossip-PGA suffers from the burden of tuning H by hand. A small H will incur more communication overhead while a large value can slow down the convergence. We further propose Gossip-AGA, an adaptive extension of Gossip-PGA. Intuition. A small consensus variance Pn i=1 E xi x 2 would accelerate Gossip-PGA. To see that, if Pn i=1 E xi x 2 = 0 for each iteration, then Gossip-PGA reduces to parallel SGD and can reach its fastest convergence. Recall from Lemma 8 in the appendix that the averaged consensus 1 T +1 PT k=0 E x(k) x(k) 2 is bounded T +1 PT k=0 E f( x(k)) 2+d2γ2 where d1 and d2 are constants. It is observed that the initial consensus variance (when T is small) can be significant due to large γ and E f( x(k)) 2. In the later stage when T is sufficiently large, both the diminishing step-size γ and gradient E f( x(k)) 2 go to 0 and hence leading to a small consensus variance naturally. With these observations, it is intuitive to take global synchronizations more frequently in initial stages to reduce the overall consensus variance. Convergence. We denote H(ℓ) as the duration of the ℓ-th period. The following corollary establishes convergence for Gossip-PGA with any time-varying but finite global averaging period sequence {H(ℓ)}: Corollary 1. Suppose Assumptions 1 3 and 5 hold and the time-varying period H(ℓ) is upper bounded by Hmax = maxℓ 0{H(ℓ)}. If γ satisfies the condition in Theorem 1 with H = Hmax, then Gossip-AGA converges at rate (8) in which H is replaced by Hmax. (Proof is in Appendix E.) Adaptive Strategy. This subsection will propose an adaptive strategy that is inspired by (Wang & Joshi, 2019). If we recover the influence of the initial value F0 = Ef( x(0)) on convergence rate (8), Gossip-PGA for non-convex problems Accelerating Gossip SGD with Periodic Global Averaging will converge at n T + H 1 3 β 2 3 σ 2 3 F 2 3 0 T 2 3 + H 2 3 β 2 3ˆb 2 3 F 2 3 0 T 2 3 + βDβF0 For a fixed T, a period H = σ 3 2 T 1 4 /(βˆb F 1 4 0 n 3 4 ) will guarantee the linear speedup. Therefore, the initial period H(0) can be chosen as H(0) = d1/[Ef( x(0))] 1 4 for some constant d1. Similarly, for the ℓ-th period, workers can be viewed as restarting training at a new initial point x(Tℓ 1) where Tℓ 1 = H(0) + + H(ℓ 1). As a result, the ℓ-th period H(ℓ) can be chosen as H(ℓ) = d1/[Ef( x(Tℓ 1))] 1 4 . With such choice of H(0) and H(ℓ), it is not difficult to have H(ℓ) = Ef( x(0)) Ef( x(Tℓ 1)) 4 H(0). (9) Since Ef( x(k)) will decrease as k increases, (9) will generate an increasing sequence of period H(ℓ). We list Gossip AGA as Algorithm 2 in Appendix G and elaborate on implementation details there. 5. Experimental Results In this section, we first examine how the transient stage differs for Gossip-PGA, Gossip and Local SGDs on networks with different topology and size on convex logistic regression. Next, we systematically evaluate the aforementioned methods on two typical large-scale deep learning tasks: image classification (over 256 GPUs) and language modeling (over 64 GPUs). See Appendix F for implementation details. 5.1. Logistic Regression We consider a distributed logistic regression problem with fi(x) = 1 M PM m=1 ln[1 + exp( yi,mhi,m)T x], where {hi,m, yi,m}M m=1 are local data samples at agent i with hi,m Rd being the feature vector and yi,m {+1, 1} being the corresponding label. Each hi,m is generated from the normal distribution N(0; 10Id). To generate yi,m, we first generate an auxiliary random vector x i Rd with each entry following N(0, 1). Next, we generate yi,m from a uniform distribution U(0, 1). If yi,m 1/[1 + exp( h T i,mx i )] then yi,m is set as +1; otherwise yi,m is set as 1. We let x i = x i to generate data for iid scenario and x i = x j i, j for non-iid scenario. Each x i is normalized. Figure 1 compares how Gossip-PGA performs against parallel and Gossip SGD over the ring topology and non-iid data distribution. The network sizes are set as n = 20, 50, 100 which results in β = 0.967, 0.995, 0.998. We set d = 10 and M = 8000. H is set as 16 in Gossip-PGA. The stepsize γ is initialized as 0.2 and gets decreased by half for every 1000 iterations. We repeat all simulations 50 times and illustrate the mean of all trials with solid curve and METHOD ACC.% HRS EPOCHS/HRS TO 76%. PARALLEL SGD 76.26 2.22 94 / 1.74 LOCAL SGD 74.20 1.05 N.A. LOCAL SGD 3 75.41 3 N.A. GOSSIP SGD 75.34 1.55 N.A. GOSSIP SGD 2 76.18 3 198/2.55 OSGP 75.04 1.32 N.A. OSGP 2 76.07 2.59 212/2.28 GOSSIP-PGA 76.28 1.66 109/1.50 GOSSIP-AGA 76.25 1.57 91/1.20 Table 7. Comparison of Top-1 validation accuracy (Column 2) and wall-clock training time (Column 3) on different methods after finishing all epochs. We also report the epochs and training time required to reach 76% accuracy (Column 4). N.A. implies that the target accuracy is not reached when all epochs are completed. standard deviation with shaded area. It is observed that both Gossip SGD and Gossip-PGA will asymptotically converge at the same rate as parallel SGD (i.e., the linear speedup stage), albeit with different transient stages. Gossip-PGA always has shorter transient stages than Gossip SGD, and such superiority gets more evident when network size increases (recall that 1 β = O(1/n2)). For experiments on different topologies such as grid and exponential graph, on iid data distribution, and comparison with Local SGD, see Appendix F. All experiments are consistent with the theoretical transient stage comparisons in Tables 2 and 3. 5.2. Image Classification The Image Net-1k (Deng et al., 2009) 1 dataset consists of 1,281,167 training images and 50,000 validation images in 1000 classes. We train Res Net-50 (He et al., 2016) model ( 25.5M parameters) following the training protocol of (Goyal et al., 2017). We train total 120 epochs. The learning rate is warmed up in the first 5 epochs and is decayed by a factor of 10 at 30, 60 and 90 epochs. We set the period to 6 for both Local SGD and Gossip-PGA. In Gossip-AGA, the period is set to 4 initially and changed adaptively afterwards, roughly 9% iterations conduct global averaging. Table 7 shows the top-1 validation accuracy and wall-clock training time of aforementioned methods. It is observed both Gossip-PGA and Gossip-AGA can reach comparable accuracy with parallel SGD after all 120 epochs but with roughly 1.3x 1.4x training time speed-up. On the other hand, while local and Gossip SGD completes all 120 epochs faster than Gossip-PGA/AGA and parallel SGD, they suffer from a 2.06% and 0.92% accuracy degradation separately. Moreover, both algorithms cannot reach the 76% top-1 accuracy within 120 epochs. We also compare with OSGP (Assran et al., 2019), which adding overlapping on the Gossip 1The usage of Image Net dataset in this paper is for noncommercial research purposes only. Accelerating Gossip SGD with Periodic Global Averaging Figure 1. Convergence comparison between Gossip-PGA, Gossip and parallel SGDs on the logistic regression problem over ring topology. The transient stage is determined by counting iterations before an algorithm exactly matches the convergence curve of Parallel SGD. Note that the transient stage for Gossip SGD in the middle and right sub-figures is beyond the plotting canvas. 0 2500 5000 7500 10000 12500 15000 17500 Iterations Training Loss 16000 18000 Parallel SGD Gossip SGD Local SGD Gossip-PGA Gossip-AGA 0 1000 2000 3000 4000 5000 6000 7000 8000 Training Time Training Loss Parallel SGD Gossip SGD Local SGD Gossip-PGA Gossip-AGA Figure 2. Convergence results on the Image Net in terms of iteration and runtime. More results are in Appendix F.3. SGD. We find OSGP 2, while faster than Gossip SGD 2, still needs more time than Gossip-PGA to achieve 76% accuracy. To further illustrate how much time it will take local and Gossip SGD to reach the target accuracy, we run another Local SGD and Gossip SGD experiments with extended epochs (i.e., Gossip SGD 2 trains total 240 epochs and the learning rate is decayed at 60, 120, and 180 epoch. Local SGD 3 trains total 360 epochs and the learning rate is decayed at 90, 180, and 270 epochs). It is observed that Gossip-SGD 2 can reach the target with notably more time expense than Gossip-PGA/AGA and parallel SGD, and Local SGD 3 still cannot reach the 76% accuracy. All these observations validate that periodic global averaging can accelerate Gossip SGD significantly. Figure 2 shows the iteration-wise and runtime-wise convergence in terms of training loss. In the left figure, it is observed Gossip-PGA/AGA converges faster (in iteration) and more accurate than local and Gossip SGD, which is consistent with our theory. In the right figure, it is observed that Gossip-PGA/AGA is the fastest method (in time) that can reach the same training loss as parallel SGD. Compare with Slow Mo. Gossip-PGA is an instance of Slow Mo, in which the base optimizer is set as Gossip SGD, slow momentum β = 0, and slow learning rate α = 1. We made experiments to compare Gossip-PGA with Slow Mo. It is observed the additional slow momentum update helps Slow Mo with large H but degrades it when H is small. This observation is consistent with Fig. 3(a) in (Wang et al., 2019). This observation implies that the slow momentum update may not always be beneficial in Slow Mo. Period Gossip-PGA Slow Mo H = 6 76.28 75.23 H = 48 75.66 75.81 Table 8. Comparison of Top-1 validation accuracy with Slow Mo with different periods. Ring Topology. While the convergence property of Gossip PGA is established over the static network topology, we utilize the dynamic one-peer exponential topology in the above deep experiments because it usually achieves better accuracy. To illustrate the derived theoretical results, we make an additional experiment, over the static ring topology, to compare Gossip-PGA with Gossip SGD in Table 9. It is observed that Gossip-PGA can achieve better accuracy than Gossip SGD after running the same epochs, which coincides with our analysis that Gossip-PGA has faster convergence. Scalability. We establish in Theorem 2 that Gossip-PGA can achieve linear speedup in the non-convex setting. To Accelerating Gossip SGD with Periodic Global Averaging 0 20000 40000 60000 80000 100000 Iterations Training Loss Parallel SGD Gossip SGD Local SGD Gossip-PGA Gossip-AGA 0 50000 100000 150000 200000 Training time Training Loss Parallel SGD Gossip SGD Local SGD Gossip-PGA Gossip-AGA Figure 3. Convergence results of BERT on the language modeling task in terms of iteration and runtime. Method Epoch Acc% Time(Hrs.) Gossip SGD 120 74.86 1.56 Gossip PGA 120 75.94 1.68 Table 9. Comparison of Top-1 validation accuracy on Gossip-PGA and Gossip SGD with ring topology. validate it, we conduct a scaling experiment and list the result in Table 10. Figures represent the final accuracy and hours to finish training. It is observed that Gossip-PGA can achieve a roughly linear speedup in training time without notably performance degradation. Method 4 nodes 8 nodes 16 nodes 32 nodes Parallel SGD 76.3/11.6 76.4/6.3 76.3/3.7 76.2/2.2 Gossip SGD 76.3/11.1 76.4/5.7 75.9/2.8 75.0/1.5 Gossip PGA 76.4/11.2 76.7/5.9 76.3/3.0 76.2/1.6 Table 10. Scaling effects on different methods with different numbers of nodes. Figures represent the final accuracy and hours to complete training. 5.3. Language Modeling BERT (Devlin et al., 2018) is a widely used pre-training language representation model for NLP tasks. We train a BERT-Large model ( 330M parameters) on the Wikipedia METHOD FINAL LOSS RUNTIME (HRS) PARALLEL SGD 1.75 59.02 LOCAL SGD 2.85 20.93 LOCAL SGD 3 1.88 60 GOSSIP SGD 2.17 29.7 GOSSIP SGD 2 1.81 59.7 GOSSIP-PGA 1.82 35.4 GOSSIP-AGA 1.77 30.4 Table 11. Comparison of training loss and training time of BERT training on different algorithms after completing all training steps. and Book Corpus datasets. We set the period to 6 for both Local SGD and Gossip-PGA. In Gossip-AGA, the period is set to 4 initially and changed adaptively afterwards, roughly 9.6% iterations conduct global averaging. Table 11 shows the final training loss and training runtime of the aforementioned methods. Gossip-AGA can reach comparable training loss with parallel SGD, but with roughly 1.94 x training time speed-up. Gossip SGD and Local SGD cannot reach training loss that below 1.8 even if they are trained over 60 hours (see Local SGD 3 and Gossip SGD 2.) Figure 3 shows the iteration-wise and runtime-wise convergence w.r.t training loss of the aforementioned methods. The left plot shows Gossip-PGA/AGA has almost the same convergence as Gossip SGD in iterations; the right plot shows that Gossip-AGA is the fastest method in training time that can reach the same accuracy as parallel SGD. 6. Conclusion We introduce Gossip-PGA/AGA to mitigate the slow convergence rate of Gossip SGD in distributed training. Theoretically, we prove the convergence improvement in smooth convex and non-convex problem. Empirically, experimental results of large-scale training validate our theories. Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709 1720, 2017. Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning (ICML), pp. 344 353, 2019. Bayoumi, A. K. R., Mishchenko, K., and Richtarik, P. Tighter theory for local sgd on identical and heteroge- Accelerating Gossip SGD with Periodic Global Averaging neous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519 4529, 2020. Ben-Nun, T. and Hoefler, T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1 43, 2019. Berahas, A. S., Bollapragada, R., Keskar, N. S., and Wei, E. Balancing communication and computation in distributed optimization. IEEE Transactions on Automatic Control, 64(8):3141 3155, 2018. Bernstein, J., Zhao, J., Azizzadenesheli, K., and Anandkumar, A. signsgd with majority vote is communication efficient and fault tolerant. ar Xiv preprint ar Xiv:1810.05291, 2018. Chen, J. and Sayed, A. H. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289 4305, 2012. Chen, T., Giannakis, G., Sun, T., and Yin, W. LAG: Lazily aggregated gradient for communication-efficient distributed learning. In Advances in Neural Information Processing Systems, pp. 5050 5060, 2018. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255. Ieee, 2009. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Di Lorenzo, P. and Scutari, G. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120 136, 2016. Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592 606, 2011. Goyal, P., Doll ar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. He, L., Bian, A., and Jaggi, M. Cola: Decentralized linear learning. In Advances in Neural Information Processing Systems, pp. 4536 4546, 2018. Koloskova, A., Lin, T., Stich, S. U., and Jaggi, M. Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2019a. Koloskova, A., Stich, S., and Jaggi, M. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pp. 3478 3487, 2019b. Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. U. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning (ICML), pp. 1 12, 2020. Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pp. 583 598, 2014. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2019a. Li, X., Yang, W., Wang, S., and Zhang, Z. Communication efficient decentralized training with multiple local updates. ar Xiv preprint ar Xiv:1910.09126, 2019b. Li, Z., Shi, W., and Yan, M. A decentralized proximalgradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, July 2019c. early acces. Also available on ar Xiv:1704.07807. Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330 5340, 2017. Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pp. 3043 3052, 2018. Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don t use large mini-batches, use local sgd. ar Xiv preprint ar Xiv:1808.07217, 2018. Liu, Y., Xu, W., Wu, G., Tian, Z., and Ling, Q. Communication-censored admm for decentralized consensus optimization. IEEE Transactions on Signal Processing, 67(10):2565 2579, 2019. Accelerating Gossip SGD with Periodic Global Averaging Lu, S., Zhang, X., Sun, H., and Hong, M. Gnsd: A gradienttracking based nonconvex stochastic algorithm for decentralized optimization. In 2019 IEEE Data Science Workshop (DSW), pp. 315 321. IEEE, 2019. Luo, Q., He, J., Zhuo, Y., and Qian, X. Prague: Highperformance heterogeneity-aware asynchronous decentralized training. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 401 416, 2020. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273 1282. PMLR, 2017. Nedic, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. Nedic, A., Olshevsky, A., and Shi, W. Achieving geometric convergence for distributed optimization over timevarying graphs. SIAM Journal on Optimization, 27(4): 2597 2633, 2017. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (Neur IPS), pp. 8024 8035, 2019. Patarasuk, P. and Yuan, X. Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2):117 124, 2009. Qu, G. and Li, N. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245 1260, 2018. Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massouli e, L. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In International Conference on Machine Learning, pp. 3027 3036, 2017. Scaman, K., Bach, F., Bubeck, S., Massouli e, L., and Lee, Y. T. Optimal algorithms for non-smooth distributed optimization in networks. In Advances in Neural Information Processing Systems, pp. 2740 2749, 2018. Shi, W., Ling, Q., Yuan, K., Wu, G., and Yin, W. On the linear convergence of the admm in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014. Shi, W., Ling, Q., Wu, G., and Yin, W. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, 2015. Stich, S. U. Local sgd converges fast and communicates little. In International Conference on Learning Representations (ICLR), 2019. Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. d2: Decentralized training over decentralized data. In International Conference on Machine Learning, pp. 4848 4856, 2018. Tang, H., Yu, C., Lian, X., Zhang, T., and Liu, J. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning, pp. 6155 6165. PMLR, 2019. Tsitsiklis, J., Bertsekas, D., and Athans, M. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9):803 812, 1986. Uribe, C. A., Lee, S., Gasnikov, A., and Nedi c, A. A dual approach for optimal algorithms in distributed optimization over networks. Optimization Methods and Software, pp. 1 40, 2020. Wang, J. and Joshi, G. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. In Systems and Machine Learning (Sys ML) Conference, 2019. Wang, J., Tantia, V., Ballas, N., and Rabbat, M. Slow Mo: Improving communication-efficient distributed sgd with slow momentum. ar Xiv preprint ar Xiv:1910.00643, 2019. Xin, R., Khan, U. A., and Kar, S. An improved convergence analysis for decentralized online stochastic non-convex optimization. ar Xiv preprint ar Xiv:2008.04195, 2020. Xu, J., Zhu, S., Soh, Y. C., and Xie, L. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In IEEE Conference on Decision and Control (CDC), pp. 2055 2060, Osaka, Japan, 2015. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2019. Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5693 5700, 2019. Accelerating Gossip SGD with Periodic Global Averaging Yuan, K., Ling, Q., and Yin, W. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835 1854, 2016. Yuan, K., Ying, B., Zhao, X., and Sayed, A. H. Exact dffusion for distributed optimization and learning Part I: Algorithm development. IEEE Transactions on Signal Processing, 67(3):708 723, 2019. Yuan, K., Alghunaim, S. A., Ying, B., and Sayed, A. H. On the influence of bias-correction on distributed stochastic optimization. IEEE Transactions on Signal Processing, 2020. Zhang, J., De Sa, C., Mitliagkas, I., and R e, C. Parallel sgd: When does averaging help? ar Xiv preprint ar Xiv:1606.07365, 2016. Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pp. 2595 2603, 2010.