# enhancing_parallelism_in_decentralized_stochastic_convex_optimization__17ba4cc8.pdf Enhancing Parallelism in Decentralized Stochastic Convex Optimization Ofri Eisen * 1 Ron Dorfman * 1 Kfir Y. Levy 1 Decentralized learning has emerged as a powerful approach for handling large datasets across multiple machines in a communication-efficient manner. However, such methods often face scalability limitations, as increasing the number of machines beyond a certain point negatively impacts convergence rates. In this work, we propose Decentralized Anytime SGD, a novel decentralized learning algorithm that significantly extends the critical parallelism threshold, enabling the effective use of more machines without compromising performance. Within the stochastic convex optimization (SCO) framework, we establish a theoretical upper bound on parallelism that surpasses the current state-of-the-art, allowing larger networks to achieve favorable statistical guarantees and closing the gap with centralized learning in highly connected topologies. 1. Introduction Distributed learning has become a key paradigm for largescale machine learning (ML), where multiple machines train an ML model by utilizing their collective data and computational resources (Verbraeken et al., 2020). This approach enables efficient ML scaling by accelerating training through parallel computation. Distributed ML systems are especially useful when data is inherently distributed across multiple users, as in federated learning (Mc Mahan et al., 2017; Kairouz et al., 2021), where preserving local data privacy is a priority. Distributed learning systems are broadly categorized into two primary architectures: centralized and decentralized. Centralized systems rely on a central parameter server (PS) to coordinate training by aggregating local computations and distributing global model updates (Li et al., 2014). While *Equal contribution 1Department of Electrical and Computer Engineering, Technion, Haifa, Israel. Correspondence to: Ofri Eisen . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). this design simplifies training management, it faces scalability challenges due to communication bottlenecks and is inherently vulnerable to a single point of failure since the server is essential to system operation. In contrast, decentralized systems rely on direct interactions between neighboring machines, eliminating the need for a central PS (Kempe et al., 2003; Boyd et al., 2006; Nedic & Ozdaglar, 2009; Lian et al., 2017). While this architecture introduces challenges in coordination and maintaining global model consistency, it offers significant advantages. By reducing the risk of a single point of failure and enhancing communication efficiency, decentralized systems provide greater flexibility and resilience (Assran et al., 2019; Kong et al., 2021; Yuan et al., 2021; Li et al., 2022). A central trade-off in distributed learning, both centralized and decentralized, lies in balancing parallelism with statistical efficiency. While increasing parallelism by incorporating more machines can accelerate training, it may degrade learning efficiency beyond a certain point. This issue is particularly acute in decentralized systems, where sparse communication topologies amplify performance degradation, limiting the effective utilization of additional machines. Existing decentralized methods reveal a persistent gap between centralized and decentralized learning in their ability to scale with the number of machines (Koloskova et al., 2020; 2021; He et al., 2022). Surprisingly, this gap is observed even in highly connected network topologies, where information between nodes propagates faster, effectively mimicking the behavior of centralized systems. This raises the question of whether decentralized methods are fundamentally constrained in fully exploiting increased parallelism, even under ideal communication conditions. In this work, we propose Decentralized Anytime SGD (DATSGD), a novel and simple algorithm designed to enhance parallelism in decentralized systems. It relies on the Anytime SGD framework (Cutkosky, 2019), where stochastic gradient descent (SGD) updates are performed using gradients evaluated at averaged iterates. Each machine performs these updates locally and exchanges information with its neighbors. By relying on averaged query points that evolve more slowly than the iterates themselves, our approach effectively mitigates the bias caused by local model inconsistencies, also known as consensus distance. Enhancing Parallelism in Decentralized SCO Table 1. Parallelism Bounds. Comparison of parallelism bounds for decentralized SCO across different network topologies for Decentralized SGD (D-SGD, Koloskova et al., 2020) and our method, expressed in terms of the total number of samples N = M T. For reference, the centralized case achieves a bound of O( N). NC stands for near-complete graph. Higher parallelism bounds are better, allowing us to accelerate the learning process without degrading performance. TOPOLOGY 1/ρ D-SGD DAT-SGD RING O(M 2) O(N 1/8) O(N 1/6) TORUS O(M) O(N 1/6) O(N 1/4) NC 1 O(ρ1/2N 1/4) O(ρ We perform our analysis within the stochastic convex optimization (SCO) framework a powerful framework which captures several classical problems like SVMs and linear regression, as well as serving as a testbed towards designing and analyzing ML algorithms. For convex and smooth functions, our method achieves an error of ϵ over a decentralized network of M machines in T = O(1/Mϵ2 + 1/ρϵ) iterations, where ρ (0, 1] is the spectral gap of the network, a parameter that captures the connectivity of the network topology (see Definition 2.5 for a formal definition). This result implies that (1) for complete (or near-complete) topologies, where ρ = Ω(1), our algorithm recovers the convergence rate of centralized methods (Dekel et al., 2012), thus filling the gap implied by prior methods; and (2) for general topologies, it improves upon the previously best-known rate of O(1/Mϵ2 + 1/ρ1/2ϵ3/2) (Koloskova et al., 2020), thereby enabling larger networks while maintaining linear speed-up; see Table 1 for a comparison of parallelism bounds for common network topologies. In Section 2.1, we elaborate on the computation of these parallelism bounds. 1.1. Related Work Centralized learning systems are widely adopted in both industry and academia, thanks to their simplicity and the effectiveness of Minibatch SGD (Dekel et al., 2012), which is the most widespread approach to training in such systems. While alternative centralized methods like local-SGD have been explored in recent years (Mc Mahan et al., 2017; Kairouz et al., 2021), Minibatch SGD still serves as a cornerstone in large-scale training due to its ease of implementation and strong empirical performance. Despite the enduring popularity of centralized training, decentralized systems present an appealing alternative by eliminating the need for a central parameter server and enabling direct communication among neighboring machines (Tsitsiklis, 1984; Kempe et al., 2003; Boyd et al., 2006; Nedic & Ozdaglar, 2009; Lian et al., 2017). This design not only mitigates the risk of a single point of failure but also offers greater scalability and flexibility (Assran et al., 2019; Kong et al., 2021; Yuan et al., 2021). A common mechanism underpinning these benefits is the gossip averaging protocol (Xiao & Boyd, 2004; Boyd et al., 2006), a low-overhead communication method that efficiently disseminates model updates across the network. In the context of stochastic optimization, decentralized training approaches have garnered considerable interest in recent years. The stochastic convex case was explored by Lian et al. (2017), who analyzed the Decentralized-SGD (D-SGD) method. Later, Koloskova et al. (2020) extended this approach within a unified framework that accounts for changing topologies and randomized gossip communication. While these aforementioned methods degrade in the face of data heterogeneity, Koloskova et al. (2021) completely eliminated this issue by employing an approach called gradienttracking (Di Lorenzo & Scutari, 2016; Nedic et al., 2017; Pu & Nedi c, 2021). See Table 2 for a comparison of convergence rates. Curiously, all existing methods present convergence bounds that imply a clear gap between the centralized and decentralized cases, even for highly connected topologies. Concretely, in centralized systems one may use up to O( N) machines to accelerate the learning process without degrading generalization (Dekel et al., 2012), where N is the total number of samples used in the process. Conversely, in decentralized systems the best known parallelization limit is M O((ρ N)1/2), which is substantially smaller. See Table 1 for bounds regarding different topologies. Decentralized training has also been extensively studied in the context of stochastic non-convex optimization. This was done in (Lian et al., 2017; Koloskova et al., 2020; 2021), which have derived analogous bounds to the ones they achieved in the convex setting. Additionally, Kong et al. (2021) empirically studied the interplay between the local model parameters and gossip communication. More recently, He et al. (2022) improved convergence rates for non-convex problems by replacing local SGD updates with local momentum updates, establishing a parallelization limit of M O((ρ N)2/3). However, this remains below the centralized bound of M O( N), which applies in the non-convex setting. To further improve the parallelism bound, we build on a recent technique that involves gradually shifting query points in SCO (Cutkosky, 2019). This approach has been leveraged in recent works to improve asynchronous (Aviv et al., 2021) and local (Dahan & Levy, 2024) training methods, as well as to design universal accelerated algorithms (Kavis et al., 2019; Ene et al., 2021; Antonakopoulos et al., 2022). Enhancing Parallelism in Decentralized SCO Table 2. Convergence to ϵ-accuracy. Comparison of convergence rates with prior work in convex decentralized learning. Here, σ denotes noise variance, ζ represents data heterogeneity, and ρ is the spectral gap. We note that the leading term σ2/Mϵ2 is statistically optimal (Nemirovski & Yudin, 1983). REFERENCE CONVERGENCE RATE KOLOSKOVA ET AL. (2020) O σ2 Mϵ2 + σ ρ+ζ KOLOSKOVA ET AL. (2021) O σ2 Mϵ2 + σ ρϵ3/2 + 1 THIS WORK O σ2 Mϵ2 + 2. Problem Setup and Background In this section, we formally define our problem setup and provide an overview of relevant background. We study decentralized SCO problems, where the objective is to minimize a convex loss function f : Rd R defined as follows: i=1 {fi(x) := Ez Di[fi(x, z)]} , (1) where fi : Rd R represents the loss function on machine i {1, . . . , M}, Di is the local data distribution, and fi( , z) is the instance-dependent loss for a given sample z on machine i. We focus on first-order optimization methods, where in round t, each machine i samples zi t Di, computes a stochastic gradient estimate fi(xi t, zi t), and uses it to update its local iterate xi t. The method ultimately produces an output xoutput, and its performance is measured by the expected excess loss, defined as: E[Excess-Loss] := E[f(xoutput)] f(x ) , where x arg minx Rd f(x) is a minimizer of f, and the expectation is taken over the randomness of the samples. Throughout, we make the following standard assumptions. Assumption 2.1 (Smoothness). Each function fi is Lsmooth, i.e., for any x, y Rd we have: fi(y) fi(x) + fi(x)T (y x) + L Assumption 2.2 (Bounded noise variance). There exists a constant σ2 such that for all x Rd and i [M]: Ez Di fi(x, z) fi(x) 2 σ2 . Assumption 2.3 (Bounded heterogeneity). There exists a constant ζ2 such that for all x Rd: i=1 fi(x) f(x) 2 ζ2 . We note that our bounded heterogeneity assumption has been widely adopted in prior works (Lian et al., 2017; Tang et al., 2018a; Li et al., 2019; He et al., 2022). Notably, some studies have explored milder assumptions, requiring bounded heterogeneity only at the optimum (Koloskova et al., 2020) or at the origin (Lu & De Sa, 2021), or even removing this assumption entirely by leveraging techniques such as gradient tracking (Nedic et al., 2017; Koloskova et al., 2021). In this work, we adopt the standard bounded heterogeneity assumption for simplicity, leaving potential relaxations to future work. Decentralized Training. We consider systems (networks) where nodes are connected through a communication graph, and each node can efficiently communicate with its neighbors. In such decentralized systems, nodes represent machines, and edges correspond to communication links. Gossip averaging is a robust communication protocol that efficiently and reliably propagates information across the decentralized network without relying on a central coordinator (Xiao & Boyd, 2004; Boyd et al., 2006; Nedic & Ozdaglar, 2009; Shi et al., 2014; Koloskova et al., 2020). Concretely, suppose each machine i in the network holds a vector xi 0 Rd, and our goal is to efficiently and robustly communicate the consensus x := 1 M P i xi 0 to all of the machines in the network. Gossip averaging is a sequential process towards doing so, where each machine computes a sequence {xi t}t such that xi t eventually converges to x for all i. At every gossip step, each node i updates its vector as a weighted average of its neighbors: j=1 Pijxj t , where Pij > 0 if nodes i and j are connected, and P [0, 1]M M is a given gossip matrix which satisfies the following properties. Definition 2.4 (Gossip matrix). A gossip matrix P [0, 1]M M is a symmetric, doubly stochastic matrix. That is, P satisfies P = P T as well as P1 = 1 and 1T P = 1T . These properties imply that (i) for any node i then {Pij}j [M] is a weight vector (i.e., has positive entries which sum to 1), and that (ii) the consensus is preserved i.e. that t, x = 1 M PM i=1 xi 0 = 1 M PM i=1 xi t. As is standard in the literature we shall assume that the matrix P is given and satisfies the above properties. Note that gossip averaging can be written in matrix form as: Xt+1 = Xt P , where Xt := x1 t x2 t x M t Rd M. Enhancing Parallelism in Decentralized SCO A key characteristic of the gossip matrix is the spectral gap, reflecting the connectivity degree of the network topology. Definition 2.5 (Spectral gap). For a gossip matrix P with eigenvalues 1 = λ1 > |λ2| |λM|, the spectral gap is defined as: ρ := 1 |λ2| (0, 1]. An important property of gossip averaging is its contractive effect on the distance from the average of local vectors. Property 2.6. Let X Rd M and define X := X 1 M 11 . Let P RM M be a gossip matrix with spectral gap ρ. Then, XP X 2 F (1 ρ) X X 2 F . (2) The above, along with the consensus-preserving property, ensures that iterative gossip averaging converges to the consensus (i.e., the average of the local vectors) exponentially fast. Notably, in complete graph topologies where ρ = 1, consensus is achieved in a single gossip step, effectively mimicking centralized behavior. 2.1. Parallelism Bounds It has been well established in the literature that the convergence rate of SGD for convex functions is given by O(σ/ T + 1/T), where T is the number of iterations (Nemirovski et al., 2009; Agarwal et al., 2012). Assuming each iteration processes a single sample, we have N = T total samples. In distributed systems, computation is distributed across M machines, therefore accelerating the training process. With M machines, we process N samples in just T = N/M iterations. However, as we mentioned, this parallelization introduces a trade-off, as increasing M can eventually reduce efficiency, as we show next. For centralized systems, Dekel et al. (2012) derived the following convergence rate for Mini-batch (Parallel) SGD: E[Excess-Loss] = O σ Thus, as long as M O( N) (ignoring σ), we can increase M without degrading performance. For decentralized systems, Koloskova et al. (2020) analyzed the Decentralized SGD (D-SGD) algorithm, obtaining the error rate: MT + σ2/3ρ1/3 + ζ2/3 ρ2/3T 2/3 + 1 σ2/3ρ1/3 + ζ2/3 M 2/3 ρ2/3N 2/3 + M Interestingly, the parallelism bound differs based on data heterogeneity. In homogeneous setups (ζ = 0), we can scale up to M O(ρ1/2N 1/4), whereas in heterogeneous settings (ζ > 0), the limit tightens to M O(ρN 1/4), which can be significantly lower for sparse topologies. It is worth mentioning that Koloskova et al. (2021) enhanced this bound, by incorporating a gradient tracking mechanism, eliminating the dependence on ζ altogether. Note that even for dense graph topologies, where ρ 1, the rate in (3) does not match the centralized case. As we establish in Section 4, our approach bridges this gap, achieving an error rate of O(σ/ MT + ( σ + ζ)/ρT + 1/T). This improvement increases the parallelism bound to M O(ρ N). Table 2 provides a comparison of convergence rates with prior work. 3. The Pitfall of Decentralized SGD Next, we discuss the D-SGD algorithm and outline its key limitation. Given the structure of decentralized topologies, the D-SGD algorithm naturally extends SGD. As described in Algorithm 1, each machine alternates between updating its local model weights using the standard SGD rule and exchanging information with its neighbors via gossip averaging. However, unlike in centralized Minibatch SGD, where all machines compute gradients at the same query points, decentralized training lacks immediate synchronization. Since gossip-based communication does not instantly enforce consensus (unless ρ = 1), local models diverge, and each machine evolves independently. This lack of synchronization introduces a fundamental challenge: gradient estimates at each node become biased with respect to the global consensus, affecting convergence analysis. As shown in (Koloskova et al., 2019), this bias is closely tied to the consensus distance Ξt := 1 M PM i=1 wi t wt 2, where wt = 1 M PM i=1 wi t represents the global average of local models. The consensus distance quantifies the discrepancy between local models, and limited communication makes it challenging to minimize. This issue can be framed as a competition between the rate of consensus achievement and the rate at which local parameters evolve. If all machines synchronized instantly, D-SGD would closely resemble centralized Minibatch SGD. However, for any ρ < 1, perfect synchronization remains unattainable, necessitating to control and bound this bias. To gain intuition, consider the D-SGD update rule under the simplifying assumption that the gradient variance across nodes is bounded: Ψt := 1 M PM i=1 gi t gt 2 G2 for all t 1, where gt = 1 M PM i=1 gi t is the global average of local stochastic gradients. Using standard gossip averaging analysis and the SGD update rule, we can obtain the following recursion for the Enhancing Parallelism in Decentralized SCO Algorithm 1 Decentralized SGD (D-SGD) Input: Initial point w1 Rd, learning rate η, gossip matrix P, number of rounds T. Initialize wi 1 = w1 for all i [M] for t = 1, . . . , T do for i [M] in parallel do Sample zi t D and set gi t = fi(wi t, zi t) wi t+ 1 2 = wi t ηgi t wi t+1 = PM j=1 wj t+ 1 2 Pij end for end for consensus distance: i=1 wi t+1 wt+1 2 i=1 wi t+ 1 i=1 (wt ηgi t) ( wt η gt) 2 Ξt + C G2η2 for some constant C > 0. Solving this recursion yields the bound Ξt O(η2/ρ2) for all t 1. In practice, Koloskova et al. (2020) conducted a more refined analysis, obtaining an improved bound of order O(η2/ρ). This leads to an overall error rate of O(1/ηT + η + η2/ρ). Optimizing the learning rate by balancing these terms suggests choosing η (ρ/T)1/3, which in turn results in the O(1/ρ1/3T 2/3) term in (3), ultimately limiting the parallelism bound. Ideally, if we could tighten the bound on the consensus distance, it would lead to a direct improvement in this term. In the next section, we achieve this by introducing gradually shifting anchor points. 4. DAT-SGD: Decentralized Anytime SGD In this section, we introduce DAT-SGD. We begin in Section 4.1 by discussing the Anytime SGD method, which serves as the foundation of our approach. Then, in Section 4.2, we extend this framework to the decentralized setting. Finally, in Section 4.3, we provide an intuitive analysis and a proof sketch of the convergence statement, highlighting how Anytime SGD enables to mitigate the consensus distance an idea we further elaborate on in Section 4.4. 4.1. Anytime SGD For this part, consider a single-machine setup. The widely studied SGD method generates a sequence of iterates {wt}t Algorithm 2 Decentralized Anytime SGD (DAT-SGD) 1: Input: Initial iterate w1, learning rate η, gossip matrix P, number of rounds T, non-negative weights {αt}t 1. 2: Initialize wi 1 = xi 1 = w1 for all i [M] 3: for t = 1, . . . , T do 4: for i [M] in parallel do 5: Sample zi t Di and set gi t = fi(xi t, zi t) 6: Local updates 2 = wi t ηαtgi t 8: xi t+ 1 α1:t xi t + αt α1:t wi t+ 1 2 9: Gossip communication 10: wi t+1 = PM j=1 wj t+ 1 11: xi t+1 = PM j=1 xj t+ 1 2 Pij 12: end for 13: end for using the update rule wt+1 = wt ηgt, where gt is a stochastic gradient estimate computed at the current iterate wt. In contrast, the Anytime SGD framework (Cutkosky, 2019) computes stochastic gradients at different query points specifically, the weighted average of past iterates. Formally, given a sequence of non-negative weights {αt}t, Anytime SGD generates two sequences, {wt}t and {xt}t, according to the update rules: wt+1 = wt ηαtgt, (5) xt+1 = α1:t 1 α1:t xt + αt α1:t wt+1, (6) where x1 = w1, α1:t := Pt τ=1 ατ for all t > 0, and α1:0 := 0. Here, gt is an estimate of the gradient at the weighted average xt. Cutkosky (2019) introduced the Anytime framework to ensure that the query points converge to the optimal solution unlike standard SGD, where iterates may not. It was shown that Anytime SGD achieves the same convergence rates as SGD for convex functions. However, the averaged query points {xt}t exhibit greater stability, changing slower than the iterates themselves. As we show next, this approach also enables establishing a last-iterate convergence guarantee. 4.2. Extension to the Decentralized Setup Our approach extends Anytime SGD to the decentralized setting, as outlined in Algorithm 2. In round t, each machine performs the updates given in Equations (5) and (6) locally and, similar to D-SGD, shares both its model weights, wi t, and query points, xi t, through gossip averaging. The convergence of Algorithm 2 is established in Theorem 4.1, with the full proof deferred to Appendix B. Enhancing Parallelism in Decentralized SCO Theorem 4.1. Under Assumptions 2.1-2.3, consider Algorithm 2 with linear weights αt = t and a learning rate given by ( 1 24LT , ρ2 D1 2K σ ρ T where D2 1 := w1 x 2, K2 := 5120L2, and σ2 := 2σ2 + ζ2. Then, for all T 1, the following bound holds: E[f( x T ) f ] O MT + D3/2 1 L σ ρT + LD2 1 T where x T := 1 M PM i=1 xi T . Let N = MT be the total number of samples after T iterations. From Theorem 4.1, ignoring dependencies on L, D1, σ and ζ, we obtain the convergence rate O(1/ N + M/ρN). This implies that the asymptotic upper bound on parallelism is O(ρ N). Beyond this, the second term of M/ρN dominates, leading to a decline in learning efficiency when increasing the number of machines. This bound represents a significant improvement over the prior O(ρ1/2N 1/4) parallelism limit (Koloskova et al., 2020; 2021). In addition, we can establish the convergence of the local iterates, as follows. The proof is provided in Appendix B.1. Corollary 4.2. Under the same assumptions and parameter selections as in Theorem 4.1, the local iterate at each machine i {1, . . . , M} satisfies: E[f(xi T ) f ] MT + D3/2 1 L σ ρT + LD2 1 T + M σD1 The additional last term does not affect the established upper bound on parallelism, which remains M O(ρ Transient Iteration Complexity. A complementary metric to the parallelism bound is the transient iteration complexity, which quantifies how many iterations are needed before the convergence rate matches that of centralized SGD, i.e., when the σ/ MT term dominates (Pu et al., 2021). From Theorem 4.1, it follows that the transient iteration complexity of our method is O(M/ρ2), representing an improvement over D-SGD by a factor of M 2. For instance, this complexity corresponds to O(M 5) for the ring topology and O(M) for the exponential graph (Ying et al., 2021). 4.3. Proof Sketch To complement the analysis in Section 3, we first provide an intuitive analysis of the consensus distance. As formally shown in Appendices A and B, the bias in Anytime SGD is related to the average consensus distance between the query points, defined as i=1 xi t xt 2, where xt = 1 Since the query points evolve more gradually than the iterates, we can derive a tighter bound on Γt. To illustrate this, we consider a simplified setting with uniform weights αt = 1 for all t and assume Ψt G2 for all t 1, as in Section 3. Using gossip averaging for the query points and the Anytime averaging, we can show that: i=1 xi t+1 xt+1 2 i=1 xi t+ 1 (xi t xt) + 1 ρt2 Ξt + G2η2 . (7) for some constant C > 0. Recall that Ξt satisfies its own recursion, as derived in Equation (4). Using the refined bound Ξt O(G2η2/ρ), we obtain: Γt + 2C G2η2 Summing over t [T] and rearranging terms, we can get the bound PT t=1 Γt O(η2/ρ3). Since the final error rate is proportional to the average consensus distance over time, 1 T PT t=1 Γt, the resulting error is of order O(1/ηT + η + η2/ρ3T). Tuning the learning rate by setting η ρ yields the O(1/ρT) term, which enables enhanced parallelism. This simplified analysis provides intuition on how slowly changing query points improve the rate. Next, we present a more formal proof sketch capturing finer transition details. Proof Sketch of Theorem 4.1. As we show in Lemma B.1, the consensus query point sequence { xt}t 1 is an {αt}t 1weighted average of the consensus iterates sequence { wt}t 1. Moreover, the average iterates sequence evolves similarly to Equation (5), following the update rule wt+1 = wt ηαt gt. Thus, the consensus sequences { xt}t 1 and { wt}t 1 align with the structure of Anytime SGD, allowing us to leverage standard results applicable to Anytime SGD. Specifically, defining t := E[f( xt) f ], it follows for all t 1 that (cf. Lemma A.3): α1:t t D2 1 η + η τ=1 α2 τE gτ 2 + 4ηT BT , (9) Enhancing Parallelism in Decentralized SCO where BT := E h PT τ=1 α2 τ E gτ f( xτ) 2i . Focusing on the middle term, we show using standard arguments that: M + 3E E gτ f( xτ) 2 + 3E f( xτ) 2 . Plugging this into Equation (9) allows us to derive the following bound: α1:t t D2 1 η + 3σ2η τ=1 α2 τ + 8ηTBT τ=1 α2 τE f( xτ) 2 . (10) A key component in our proof is bounding BT , which is essentially the sum of weighted squared biases. As we show in Section 4.4, this term is bounded as follows: Therefore, from Equation (10), we get: α1:t t D2 1 η + 3σ2η τ=1 α2 τ + 4K2 σ2η3T τ=1 α2 τE f( xτ) 2 . (11) For the last sum, the smoothness of f implies that f(x) 2 2L(f(x) f ) for all x Rd, yielding: τ=1 α2 τE f( xτ) 2 2L τ=1 α2 τ τ 4L τ=1 α1:τ τ , where the last inequality follows from α2 τ = τ 2 2α1:τ. Substituting this bound into (11) and recalling that η 1 24LT , we obtain: α1:t t D2 1 η + 3σ2η τ=1 α2 τ + 4K2 σ2η3T τ=1 α1:τ τ . (12) Observe that at := α1:t t follows the structure at b + 1 2T PT τ=1 aτ, t [T], which implies that at 2b for all t [T] (see Lemma D.3). In particular, for t = T, we get: α1:T T 2D2 1 η + 6σ2η τ=1 α2 τ + 8K2 σ2η3T Dividing by α1:T , noting that PT τ=1 α2 τ α1:T T for linear weights, and appropriately tuning the learning rate concludes the proof. 4.4. Bounding the Bias A central part of our analysis is bounding the bias introduced by local model differences. The bias bound is formally stated in Lemma 4.3, and the complete proof can be found in Appendix B.2. A brief proof sketch is provided below. Lemma 4.3. Consider the setting in Theorem 4.1 and let gt := 1 M PM i=1 gi t and xt := 1 M PM i=1 xi t denote the average gradient and query point across machines at round t, respectively. Then, for linear weights αt = t, and a learning rate bounded as η ρ2 K , the following bound holds: τ=1 α2 τ E gτ f( xτ) 2 # where we recall that K2 := 5120L2 and σ2 := 2σ2 + ζ2. Proof Sketch. By Jensen s inequality and the smoothness of each fi, we can bound each term in the sum as follows: E E gτ f( xτ) 2 1 i=1 E fi(xi τ) fi( xτ) 2 i=1 E xi τ xτ 2 . As previously defined, let Γt := 1 M PM i=1 E xi t xt 2 denote the average consensus distance of the query points at round t. Thus, τ=1 α2 τΓτ . (13) Our objective is therefore to bound Γτ for every τ [T]. Using standard gossip averaging analysis and the query points averaging, we derive the following recursion for Γt: Γt + 4δ2 t ρ Ξt + η2α2 tΨt , (14) recalling that Ξt := 1 M PM i=1 E wi t wt 2 and Ψt := 1 M PM i=1 E gi t gt 2 represent the average consensus distances of the iterates and gradients at round t, respectively, and using δt := αt α1:t . Notably, Ξt satisfies its own recursion (see Lemma C.4): Ξt + 2η2α2 t ρ Ψt . Solving this recursion yields the following for all t [T]: Ξt 2η2α2 t ρ Enhancing Parallelism in Decentralized SCO Substituting this back into (14) gives: + 4η2α2 tδ2 t ρ In Lemma C.2, we show that Ψt 5 σ2 + 10L2Γt for all t [T]. Plugging this bound, noting that for linear weights α2 tδ2 t = O(1), and using some algebra, we obtain: Γt+1 1 κ + c2 1η2 τ=1 (1 κ)t 1 τΓτ + c2 2η2 where κ = ρ 2, c2 1 = Θ(K2), and c2 2 = Θ( σ2). We have now derived a recursion for Γt that no longer depends on Ξt or Ψt. For a sufficiently small learning rate, specifically η ρ2 K , this recursion can be explicitly solved, yielding: κ4 = Θ σ2η2 Injecting this bound into (13) yields the result: τ=1 α2 τΓτ = Θ 5. Experiments In this section, we empirically evaluate our method on both a synthetic least squares problem and an image classification task. All experiments are conducted using three random seeds, and we report the averaged results. 5.1. Least Squares on Synthetic Data We begin with a synthetic least squares problem to illustrate key theoretical properties of our algorithm. For each machine, the local objective function is defined as fi(x) = 1 2 Aix bi 2, where Ai Rd d is drawn from a standard multivariate normal distribution. The targets vector is set as bi = Ai(x δi), where x N(0, 1 sampled once per configuration, and δi N(0, ζ2 d Id) introduces heterogeneity across machines.1 To incorporate stochasticity, we add Gaussian noise ξ N(0, σ2 d Id) when querying local gradients, resulting in the noisy gradient estimate fi(x) + ξ. In our experiments, we set d = 50. We compare our method with D-SGD, evaluating performance across three network topologies: ring, torus, 1Here, ζ2 quantifies heterogeneity at the optimum. 4 9 25 49 100 Number of Machines (M) 10 6 10 5 10 4 10 3 10 2 4 9 25 49 100 Number of Machines (M) 10 6 10 5 10 4 10 3 10 2 10 1 4 9 25 49 100 Number of Machines (M) 4 9 25 49 100 Number of Machines (M) 10 6 10 5 10 4 10 3 10 2 10 1 DAT-SGD (Ring) D-SGD (Ring) DAT-SGD (Torus) D-SGD (Torus) DAT-SGD (Exp.) D-SGD (Exp.) Figure 1. Final error on synthetic least squares problem for different numbers of machines and various gradient noise variance (σ2) and heterogeneity (ζ2) levels over ring, torus, and exponential graph topologies. We plot 1 M PM i=1 xi T x 2 and 1 M PM i=1 wi T x 2 for DAT-SGD and D-SGD, respectively. and 1-peer exponential graph (Ying et al., 2021). The exponential graph is a fast-mixing topology for which 1/ρ = O(log M). For each method and topology, we perform a grid search over the learning rate η {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1} and select the value that yields the lowest error after 100K iterations. For DAT-SGD, we use constant weights αt = 1 for all t. In Figure 1, we plot the final error as a function of the number of machines, across four configurations defined by σ, ζ {1, 10}, with different colors indicating the underlying topology. For D-SGD, we observe that performance degrades on the ring and torus topologies starting from M = 4 and M = 9, respectively, while it remains relatively stable on the 1-peer exponential graph. In contrast, our method improves with larger M: performance steadily improves on the torus and exponential graphs, and performs better on the ring topology up to M = 25 before deteriorating. This suggests that, beyond a certain threshold (between M = 25 and 49), the network-dependent error term (which scales as O(1/ρT) = O(M 2/T) for the ring) becomes dominant. Overall, the results align with our theoretical findings, as DAT-SGD enables performance improvement for larger M. We provide complete convergence curves in Appendix E. 5.2. Image Classification with a Neural Network Next, we evaluate our method on the Fashion MNIST (Xiao et al., 2017) image classification task using the Le Net (Le Cun et al., 1998) architecture. The data is partitioned among Enhancing Parallelism in Decentralized SCO 0 50 100 150 200 Epoch 0 50 100 150 200 Epoch 0 50 100 150 200 Epoch 0 50 100 150 200 Epoch DAT-SGD (Ring) DAT-SGD (Base-2) DSGD (Ring) DSGD (Base-2) D2 (Ring) D2 (Base-2) Figure 2. Test accuracy on Fashion MNIST for different numbers of machines (across rows) and varying heterogeneity levels (across columns) on the ring and the Base-2 graph topologies. workers following a Dirichlet distribution with parameter α, which controls the heterogeneity level (Hsu et al., 2019). We compare our method against D-SGD and D2 (Tang et al., 2018b), a decentralized optimization method specifically designed to improve robustness to data heterogeneity. Experiments are performed on the ring topology and the Base-2 Graph (Takezawa et al., 2023) a time-varying, state-of-theart topology for decentralized learning. For our method and D-SGD, we use momentum with parameter β = 0.9. For each method and topology, the learning rate was selected via grid search over η {0.001, 0.01, 0.1}. Unlike the synthetic least squares problem, this task is nonconvex. Following the heuristic proposed by Dahan & Levy (2025), we adopt a momentum-like update for Anytime averaging of the form xt+1 = γtxt + (1 γt)wt, using a fixed value of γt = 0.9. This choice was shown to enhance training stability and adaptability in non-convex landscapes. Figure 2 depicts the test accuracy for M = 8 and 16 machines under heterogeneous (α = 0.1) and nearly homogeneous (α = 10) data, with different colors indicating the compared methods. In the heterogeneous case, our method consistently outperforms the baselines across both topologies, with results on the ring matching (for M = 8) or nearly matching (for M = 16) those of the baselines on the well-connected Base-2 Graph. Conversely, and perhaps unexpectedly, under homogeneous data, D-SGD and D2 achieve better performance, motivating further investigation into this gap across data regimes in non-convex settings. In Figure 3, similar to Figure 1, we show the final accuracy for different numbers of machines on the ring topology with 4 8 16 32 Number of Machines (M) Final Accuracy DAT-SGD DSGD D2 Figure 3. Final test accuracy on Fashion MNIST for varying number of machines on the ring topology with heterogeneous data. heterogeneous data (α = 0.1). Notably, the largest accuracy drop for DAT-SGD occurs between M = 8 and 16, whereas D-SGD and D2 degrade most between M = 4 and 8, demonstrating our claim of improved parallelism. 6. Conclusion and Future Work In this work, we presented DAT-SGD, a simple yet powerful approach to decentralized SCO that raises the parallelism threshold, enabling to increase the number of machines in the network while maintaining linear speed-up. This is achieved by effectively mitigating the consensus distance using the Anytime SGD mechanism, which computes stochastic gradients at gradually changing query points, thereby limiting local model divergence. Several promising directions emerge from our work. One is integrating our method with gradient tracking, which has been shown to remove dependence on data heterogeneity (Koloskova et al., 2021). Moreover, we assume the gossip matrix is symmetric and doubly stochastic, allowing us to use Property 2.6 for a clear and simple analysis. Extending our results to asymmetric or row/column stochastic matrices, as in, e.g., (Assran et al., 2019), remains an open problem. Finally, establishing convergence bounds in the non-convex setting is a compelling challenge for future research. Acknowledgments We thank the reviewers for their valuable comments. This research was partially supported by Israel PBC-VATAT, by the Technion Artificial Intelligence Hub (Tech.AI), and by the Israel Science Foundation (grant No. 3109/24). Enhancing Parallelism in Decentralized SCO Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235 3249, 2012. Antonakopoulos, K., Vu, D. Q., Cevher, V., Levy, K., and Mertikopoulos, P. Undergrad: A universal black-box optimization method with almost dimension-free convergence rate guarantees. In International Conference on Machine Learning, pp. 772 795. PMLR, 2022. Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pp. 344 353. PMLR, 2019. Aviv, R. Z., Hakimi, I., Schuster, A., and Levy, K. Y. Asynchronous distributed learning: Adapting to gradient delays without prior knowledge. In International Conference on Machine Learning, pp. 436 445. PMLR, 2021. Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508 2530, 2006. Cutkosky, A. Anytime online-to-batch, optimism and acceleration. In International conference on machine learning, pp. 1446 1454. PMLR, 2019. Dahan, T. and Levy, K. Y. SLowcal SGD : Slow Query Points Improve Local-SGD for Stochastic Convex Optimization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=B29Bl Re26Z. Dahan, T. and Levy, K. Y. Do Stochastic, Feel Noiseless: Stable Stochastic Optimization via a Double Momentum Mechanism. In The Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=z CZn EXF3b N. Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012. Di Lorenzo, P. and Scutari, G. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120 136, 2016. Ene, A., Nguyen, H. L., and Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7314 7321, 2021. He, L., Karimireddy, S. P., and Jaggi, M. Byzantine-robust decentralized learning via clippedgossip. ar Xiv preprint ar Xiv:2202.01545, 2022. Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. ar Xiv preprint ar Xiv:1909.06335, 2019. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021. Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019. Kempe, D., Dobra, A., and Gehrke, J. Gossip-based computation of aggregate information. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pp. 482 491. IEEE, 2003. Koloskova, A., Stich, S., and Jaggi, M. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pp. 3478 3487. PMLR, 2019. Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381 5393. PMLR, 2020. Koloskova, A., Lin, T., and Stich, S. U. An improved analysis of gradient tracking for decentralized machine learning. Advances in Neural Information Processing Systems, 34:11422 11435, 2021. Kong, L., Lin, T., Koloskova, A., Jaggi, M., and Stich, S. Consensus control for decentralized deep learning. In International Conference on Machine Learning, pp. 5686 5696. PMLR, 2021. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems, 27, 2014. Enhancing Parallelism in Decentralized SCO Li, S., Zhou, T., Tian, X., and Tao, D. Learning to collaborate in decentralized learning of personalized models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9766 9775, 2022. Li, X., Yang, W., Wang, S., and Zhang, Z. Communicationefficient local decentralized sgd methods. ar Xiv preprint ar Xiv:1910.09126, 2019. Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems, 30, 2017. Lu, Y. and De Sa, C. Optimal complexity in decentralized training. In International conference on machine learning, pp. 7111 7123. PMLR, 2021. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017. Nedic, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. Nedic, A., Olshevsky, A., and Shi, W. Achieving geometric convergence for distributed optimization over timevarying graphs. SIAM Journal on Optimization, 27(4): 2597 2633, 2017. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. Nemirovski, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. 1983. Pu, S. and Nedi c, A. Distributed stochastic gradient tracking methods. Mathematical Programming, 187(1):409 457, 2021. Pu, S., Olshevsky, A., and Paschalidis, I. C. A sharp estimate on the transient time of distributed stochastic gradient descent. IEEE Transactions on Automatic Control, 67 (11):5900 5915, 2021. Shi, W., Ling, Q., Yuan, K., Wu, G., and Yin, W. On the linear convergence of the admm in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014. Takezawa, Y., Sato, R., Bao, H., Niwa, K., and Yamada, M. Beyond exponential graph: Communication-efficient topologies for decentralized learning via finite-time convergence. Advances in Neural Information Processing Systems, 36:76692 76717, 2023. Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J. Communication compression for decentralized training. Advances in Neural Information Processing Systems, 31, 2018a. Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. D2: Decentralized training over decentralized data. In International Conference on Machine Learning, pp. 4848 4856. PMLR, 2018b. Tsitsiklis, J. N. Problems in decentralized decision making and computation. Ph D thesis, Massachusetts Institute of Technology, 1984. Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., and Rellermeyer, J. S. A survey on distributed machine learning. Acm computing surveys (csur), 53(2): 1 33, 2020. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Xiao, L. and Boyd, S. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65 78, 2004. Ying, B., Yuan, K., Chen, Y., Hu, H., Pan, P., and Yin, W. Exponential graph is provably efficient for decentralized deep training. Advances in Neural Information Processing Systems, 34:13975 13987, 2021. Yuan, K., Chen, Y., Huang, X., Zhang, Y., Pan, P., Xu, Y., and Yin, W. Decentlam: Decentralized momentum sgd for large-batch deep training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3029 3039, 2021. Enhancing Parallelism in Decentralized SCO A. Anytime SGD Analysis In this section, we discuss key properties of Anytime SGD with biased gradients. The main result is Lemma A.3, which will later be adapted to the decentralized, gossip-based communication setup in Appendix B For an initial iterate w1 Rd, learning rate η > 0, and non-negative weights {αt}t 1, we consider the sequences {wt}t 1 and {xt}t 1, defined by the update rules: wt+1 = wt ηαtgt, (15) xt+1 = α1:t 1 α1:t xt + αt α1:t wt+1, with x1 = w1 . (16) Here, gt is a biased estimate of the gradient of some function f at xt. The following result mimics the gradient inequality for the {αt}t 1-weighted averages. Lemma A.1 (Cutkosky, 2019; Theorem 1, Dahan & Levy, 2024). Let f : Rd R be a convex function with global minimum f := f(x ), and let the sequence {xt}t 1 be defined as in Equation (16). Then, for any t 1, the following holds: 0 α1:t (f(xt) f ) τ=1 ατ f(xτ) (wτ x ) . Next, we present a regret bound for online gradient descent with weighted gradients. Lemma A.2 (Cutkosky, 2019; Lemma 2, Dahan & Levy, 2024). Consider the update rule in Equation (15). Then, for all t 1, it holds that: t X τ=1 ατg τ (wτ x ) w1 x 2 τ=1 α2 τ gτ 2 . The following result is fundamental to our analysis and plays a key role in establishing our main proof. The analysis closely follows that of Dahan & Levy (2024) and is included here for completeness; see Appendix F therein for further details. Lemma A.3. For the sequences {wt}t 1 and {x}t 1 defined in Equations (15) and (16), the following holds for any t 1: α1:t E [f(xt) f ] w1 x 2 τ=1 α2 τE gτ 2 + 4ηT τ=1 α2 τE Egτ f(xτ) 2 . Proof. Lemma A.1 implies the following: α1:t E [f(xt) f ] E τ=1 ατ f(xτ) (wt x ) τ=1 ατg τ (wτ x ) τ=1 ατ ( f(xτ) gτ) (wτ x ) τ=1 α2 τE gτ 2 + τ=1 E h ατ ( f(xτ) gτ) (wτ x ) i , (17) where the last inequality follows from Lemma A.2. Every element in the rightmost sum can be bounded using Cauchy Schwarz inequality and Young s inequality, a b 1 2θ a 2 + θ 2 b 2, which holds for any θ > 0, yielding: E h ατ ( f(xτ) gτ) (wτ x ) i = E h ατ ( f(xτ) Egτ) (wτ x ) i E [ατ Egτ f(xτ) wτ x ] α2 τ 2θ E Egτ f(xτ) 2 + θ 2E wτ x 2 , (18) where the first equality follows from the law of total expectation. Enhancing Parallelism in Decentralized SCO Substituting (18) into (17) gives: α1:t (f(xt) f ) w1 x 2 τ=1 α2 τE gτ 2 + 1 τ=1 α2 τE Egτ f(xτ) 2 + θ τ=1 E wτ x 2 τ=1 α2 τE gτ 2 + 1 τ=1 α2 τE Egτ f(xτ) 2 + θ τ=1 E wτ x 2 , where the last inequality follows from the last three terms being monotonically increasing with t. Lemma 3 of Dahan & Levy (2024) guarantees that for sequences {wt}t 1 and {x}t 1 as defined in Equations (15) and (16), the following holds: τ=1 E wt x 2 2T w1 x 2 + 2Tη2 T X τ=1 α2 τE gτ 2 + 16η2T 2 T X τ=1 α2 τE Egτ f(xτ) 2 , which in turn implies that for θ = 1 4ηT : α1:t E [f(xt) f ] 1 2η + θT w1 x 2 + η 2 + θTη2 T X t=1 α2 t E gt 2 2θ + 8θη2T 2 T X τ=1 α2 τE Egτ f(xτ) 2 t=1 α2 t E gt 2 + 4ηT τ=1 α2 τE Egτ f(xτ) 2 t=1 α2 t E gt 2 + 4ηT τ=1 α2 τE Egτ f(xτ) 2 , thus concluding the proof. B. Proof of Theorem 4.1 In this section, we prove our main result. To simplify the presentation and analysis, we first introduce some notations: Xt := x1 t x2 t x M t Rd M, Xt := Xt 1 M 11 = xt xt xt Rd M, Wt := w1 t w2 t w M t Rd M, Wt := Wt 1 M 11 = wt wt wt Rd M, Gt := g1 t g2 t g M t Rd M, Gt := Gt 1 M 11 = gt gt gt Rd M , with xt = 1 M PM i=1 xi t, wt = 1 M PM i=1 wi t, and gt = 1 M PM i=1 gi t. Denoting δt = αt α1:t , our Decentralized Anytime SGD algorithm can be expressed using matrix notation as follows: Local update and averaging: 2 = Wt ηαt Gt Xt+ 1 2 = (1 δt) Xt + δt Wt+ 1 Gossip communication: ( Wt+1 = Wt+ 1 2 P Xt+1 = Xt+ 1 We also define the expected average distance between local elements and their mean, commonly referred to as the consensus distance. Specifically, we define consensus distances for the iterates, the query points (i.e., the {αt}t 1-weighted iterates), Enhancing Parallelism in Decentralized SCO and the gradients: i=1 E xi t xt 2 = 1 M E Xt Xt 2 F , i=1 E wi t wt 2 = 1 M E Wt Wt 2 F , i=1 E gi t gt 2 = 1 M E Gt Gt 2 F . First, we establish that the sequence of query points average over machines { xt}t 1 is indeed an {αt}t 1-weighted average of the sequence of iterates average over machines { wt}t 1. This allows us to apply the results from Appendix A on Anytime SGD to analyze the consensus iterates. Lemma B.1. The sequence { xt}t 1 is an {αt}t 1-weighted average of the sequence { wt}t 1. Proof. Using matrix notation and the linearity of averaging, we have: Xt+1 = Xt+ 1 2 P = Xt+ 1 α1:t Xt + αt α1:t Xt + αt 2 P = α1:t 1 α1:t Xt + αt α1:t Wt+1 , where the second and fourth equalities follow from Lemma D.1, implying that gossip communication preserves averages. Next, we establish the convergence of Algorithm 2, as stated in Theorem 4.1, which we restate here for ease of reference. Theorem 4.1. Under Assumptions 2.1-2.3, consider Algorithm 2 with linear weights αt = t and a learning rate given by ( 1 24LT , ρ2 D1 2K σ ρ T where D2 1 := w1 x 2, K2 := 5120L2, and σ2 := 2σ2 + ζ2. Then, for all T 1, the following bound holds: E[f( x T ) f ] O MT + D3/2 1 L σ ρT + LD2 1 T where x T := 1 M PM i=1 xi T . Proof. For any t [T], define t := E[f( xt) f ]. We analyze the consensus iterates, which (according to Lemma B.1) follow the structure of Anytime SGD as defined in Equations (15) and (16), namely, wt+1 = wt ηαt gt, and xt+1 = (1 δt) xt + δt wt+1. Therefore, by Lemma A.3, we have: α1:t t w1 x 2 τ=1 α2 τE gτ 2 + 4ηT τ=1 α2 τE E gτ f( xτ) 2 = D2 1 η + η τ=1 α2 τE gτ 2 +4ηT BT , (19) where BT := PT τ=1 α2 τE E gτ f( xτ) 2, and we also used w1 = w1. Enhancing Parallelism in Decentralized SCO Bounding (A). We focus on E gτ 2. By adding and subtracting E gτ and f( xτ), and applying the inequality a + b + c 2 3 a 2 + 3 b 2 + 3 c 2, we obtain: E gτ 2 = E gτ E gτ + E gτ f( xτ) + f( xτ) 2 3E gτ E gτ 2 + 3E E gτ f( xτ) 2 + 3E f( xτ) 2 M + 3E E gτ f( xτ) 2 + 3E f( xτ) 2 , where the last inequality holds because gi τ fi(xi τ) i [M] are independent, zero-mean, and have variance at most σ2: E gτ E gτ 2 = E gi τ fi(xi τ) i=1 E gi t fi(xi τ) 2 σ2 Thus, (A) is bounded as follows, τ=1 α2 t E gτ 2 3σ2 τ=1 α2 τ + 3BT + 3 τ=1 α2 τE f( xτ) 2 . Plugging this bound back into (19), we get: α1:t t D2 1 η + η τ=1 α2 τ + 3VT + 3 τ=1 α2 τE f( xτ) 2 ! = D2 1 η + 3σ2η τ=1 α2 τ + (3η + 4ηT) BT + 3η τ=1 α2 τE f( xτ) 2 D2 1 η + 3σ2η τ=1 α2 τ + 8ηT BT + 3η τ=1 α2 τE f( xτ) 2 . (20) For linear weights αt = t and learning rate upper bounded by ρ2 K , Lemma 4.3 provides the following bound on BT (see proof in Appendix B.2 below): Substituting this bound back to Equation (20), we obtain: α1:t t D2 1 η + 3σ2η τ=1 α2 τ + 4K2 σ2η3T τ=1 α2 τ + 3η τ=1 α2 τE f( xτ) 2 . The rightmost term is be bounded as follows: τ=1 α2 t E f( xτ) 2 2L τ=1 α2 τ τ 4L τ=1 α1:τ τ , where the first inequality follows from Lemma D.2 (i.e., f( xτ) 2 2L(f( xτ) f )), and the second inequality arises from α2 t = t2 t(t + 1) = 2α1:t. Therefore, α1:t t D2 1 η + 3σ2η τ=1 α2 τ + 4K2 σ2η3T τ=1 α2 τ + 12Lη D2 1 η + 3σ2η τ=1 α2 τ + 4K2 σ2η3T τ=1 α2 τ + 1 τ=1 α1:τ τ , Enhancing Parallelism in Decentralized SCO where the last inequality holds true for η 1 24LT . Applying Lemma D.3 with at = α1:t t and b = D2 1 η + 3σ2η M PT τ=1 α2 τ + ρ4 PT τ=1 α2 τ yields (for t = T): 2D2 1 η + 6σ2η τ=1 α2 τ + 8K2 σ2η3T 2D2 1 η + 6σ2ηT 3 M + 8K2 σ2η3T 4 where the second inequality holds as PT t=1 α2 t = PT t=1 t2 T 3. Thus, employing Lemma D.6 with A = 2D2 1, B = 6σ2T 3 C = 8K2 σ2T 4 ρ4 , and η1 = min n 1 24LT , ρ2 K o , we get: AB + 2A3/4C1/4 12D2 1σ2T 3 M + 2 2D2 1 3/4 8K2 σ2T 4 K σT ρ + 48LD2 1T + 2KD2 1 ρ2 . Finally, dividing by α1:T = T (T +1) 2 , we obtain the result as: T = E[f( x T ) f ] 8 K σ ρT + 96LD2 1 T + 4KD2 1 ρ2T 2 MT + D3/2 1 p L(σ + ζ) ρT + LD2 1 T + LD2 1 ρ2T 2 B.1. Proof of Corollary 4.2 Next, we demonstrate the convergence of the local iterates. Proof. Let i [M]. By the smoothness of the f, it holds that: E[f(xi T ) f ] = E[f(xi T ) f( x T )] + E[f( x T ) f ] E f( x T ) (xi T x T ) + L xi T x T 2 + E[f( x T ) f ] 2θE f( x T ) 2 + θ + L 2 E xi T x T 2 + E[f( x T ) f ] , where the last inequality holds for any θ > 0 due to Young s inequality, a b 1 2θ a 2 + θ 2 b 2. Setting θ = L and using Lemma D.2 to upper bound f( x T ) 2 by 2L(f( x T ) f ), we get: E[f(xi T ) f ] 1 2LE f( x T ) 2 + LE xi T x T 2 + E[f( x T ) f ] 2E[f( x T ) f ] + LE xi T x T 2 2E[f( x T ) f ] + L i=1 E xi T x T 2 = 2E[f( x T ) f ] + LMΓT . Enhancing Parallelism in Decentralized SCO Using Lemma C.3, which applies under the conditions of Theorem 4.1, we can bound ΓT by 2560 σ2η2/ρ4, where σ2 := 2σ2 + ζ2. In addition, the learning rate in Theorem 4.1 satisfies η2 D1ρ2/2K σT 2, where K := 5120L. Therefore, we obtain: E[f(xi T ) f ] 2E[f( x T ) f ] + 2560LM σ2η2 ρ4 2E[f( x T ) f ] + 8 5MD1 σ ρ2T 2 . Plugging the bound on E[f( x T ) f ] from Theorem 4.1 establishes the result: E[f(xi T ) f ] 16 K σ ρT + 192LD2 1 T + 8KD2 1 ρ2T 2 + 8 5MD1 σ ρ2T 2 MT + D3/2 1 L σ ρT + LD2 1 T + MD1 σ In terms on the total of samples N = MT, the established rate is given by: E[f(xi T ) f ] O N + MD3/2 1 L σ ρN + MLD2 1 N + M 3D1 σ Therefore, ignoring the dependence on L, D1, σ and ζ (i.e., assuming they all equal 1), the parallelism bound (i.e., the maximal asymptotic M for which the rate is O(1/ M O min n ρ B.2. Proof of Lemma 4.3 Proof. We aim to prove that: τ=1 α2 τ E gτ f( xτ) 2 # where K2 := 5120L2 and σ2 := 2σ2 + ζ2. Focusing on E E gτ f( xτ) 2 and using the convexity of 2, we apply Jensen s inequality to obtain: E E gτ f( xτ) 2 = E fi(xi τ) fi( xτ) i=1 E fi(xi τ) fi( xτ) 2 i=1 E xi τ xτ 2 where the second inequality follows from the smoothness of fi. Using Lemma C.3, which applies specifically to linear weights αt = t and a learning rate bounded as η ρ2 80L we bound Γτ and derive a corresponding bound for BT : τ=1 α2 τE E gτ f( xτ) 2 L2 T X τ=1 α2 τΓτ 2560L2 σ2η2 τ=1 α2 τ = K2 σ2η2 which concludes the proof. Enhancing Parallelism in Decentralized SCO C. Consensus Recursion Analysis In this section, we analyze the consensus distances Γt, Ξt, and Ψt. We begin by observing that Γt follows a recursive relation (Equation (24)), where Γt+1 depends not only on Γt but also on Ξt and Ψt. Similarly, Ξt satisfies its own recursion, with Ξt+1 depending on Ξt and Ψt (Lemma C.4). To simplify the analysis, we first solve the recursion for Ξt, allowing us to express Γt in terms of Ψt alone, eliminating the dependence on Ξt (Lemma C.1). Finally, in Lemma C.2, we bound Ψt in terms of the problem parameters (σ and ζ) and Γt, enabling us to explicitly solve the recursion for Γt (Lemma C.3). C.1. Consensus of the Query Points The next result establishes a recursion for the consensus distance of the query points Xt. Lemma C.1. For all t 1, the consensus distance of the query points (i.e., averaged iterates) Γt satisfies the following recursion: Γt + 4η2α2 tδ2 t ρ Proof. The gossip averaging of the query points leads to the following inequality: MΓt+1 = E Xt+1 Xt+1 2 F = E Xt+ 1 2 2 F (1 ρ)E Xt+ 1 2 2 F , (21) where we used Lemma D.1, stating that Xt+ 1 2 P = Xt+ 1 2 , and the mixing property of the gossip matrix (Property 2.6). Substituting the query points averaging, Xt+ 1 2 = (1 δt)Xt+δt Wt+ 1 2 , and using a + b 2 (1+β 1) a 2+(1+β) b 2 (which holds for any β > 0), we obtain: MΓt+1 (1 ρ)E Xt+ 1 = (1 ρ)E (1 δt)Xt + δt Wt+ 1 2 ((1 δt) Xt + δt Wt+ 1 (1 ρ) 1 + 1 (1 δt)2E Xt Xt 2 F + (1 ρ) (1 + β) δ2 t E Wt+ 1 (1 ρ) 1 + 1 MΓt + (1 ρ)(1 + β)δ2 t E Wt+ 1 2 2 F | {z } ( ) where the last inequality follows from (1 δt)2 1. Focusing on ( ) and plugging the iterate update rule, we have for all γ > 0: 2 2 F = E Wt ηαt Gt ( Wt ηαt Gt) 2 F E Wt Wt 2 F + (1 + γ) η2α2 t E Gt Gt 2 F MΞt + (1 + γ)η2α2 t MΨt . (23) Substituting (23) into (22), setting β = 2/ρ and γ = 1, and dividing by M, we get: Γt+1 (1 ρ) 1 + ρ Γt + (1 ρ) 1 + 2 δ2 t 2Ξt + 2η2α2 tΨt 1 ρ Γt + 4δ2 t ρ Ξt + η2α2 tΨt , (24) where in the second inequality we used (1 ρ)(1 + ρ 2 and (1 ρ)(1 + 2 ρ) = 1 ρ + 2 ρ. Note that we derived a recursion for Γt, which also involves Ξt. The consensus distance of the iterates, Ξt, satisfies its own Enhancing Parallelism in Decentralized SCO recursion as stated in Lemma C.4, from which an explicit bound is provided in Lemma C.5. Substituting this bound yields: Γt + 4δ2 t ρ t 1 τ Ψτ + η2α2 tΨt Γt + 4η2α2 tδ2 t ρ thus concluding the proof. Note that the sequence Ψ1, . . . , Ψt appears in the recursion given in Lemma C.1. We next provide an upper bound on Ψt in terms of Γt, which will enable us to derive an explicit bound for Γt. Lemma C.2. For every t 1, Ψt 5 2σ2 + ζ2 + 10L2Γt. Proof. Using PN n=1 un 2 N PN n=1 un 2, we have: i=1 E gi t gt 2 i=1 E gi t Egi t + Egi t gt + E gt E gt fi( xt) + fi( xt) f( xt) + f( xt) 2 E gi t Egi t 2 + E gt E gt 2 + E Egi t fi( xt) 2 + E E gt f( xt) 2 + E fi( xt) f( xt) 2 . Next, we individually bound each sum. Bounding 1 M PM i=1 E gi t Egi t 2. By the bounded variance assumption, it holds that: i=1 E gi t Egi t 2 = 1 i=1 E fi(xi t, zi t) fi(xi t) 2 1 i=1 σ2 = σ2 . Bounding E gt E gt 2. Since gi t Egi t are independent, zero mean, and have variance bounded by σ2, it follows that: E gt E gt 2 = E i=1 E gi t Egi t 2 σ2 Bounding 1 M PM i=1 E Egi t fi( xt) 2. By the smoothness of fi, i=1 E Egi t fi( xt) 2 = 1 i=1 E fi(xi t) fi( xt) 2 L2 i=1 E xi t xt 2 = L2Γt . Bounding E E gt f( xt) 2. By Jensen s inequality and the smoothness of each fi, we have: E E gt f( xt) 2 = E fi(xi t) fi( xt) i=1 E fi(xi t) fi( xt) 2 L2Γt . Enhancing Parallelism in Decentralized SCO Bounding 1 M PM i=1 E fi( xt) f( xt) 2. By the heterogeneity assumption (Assumption 2.3), i=1 E fi( xt) f( xt) 2 ζ2 . Substituting these bounds back into Equation (26) implies the result as, Ψt 5 σ2 + σ2 M + 2L2Γt + ζ2 5 2σ2 + ζ2 + 10L2Γt . Using this, we now derive an explicit bound for Γt, specifically for linear weights and a sufficiently small learning rate. Lemma C.3. For linear weights {αt = t}t 1, and a learning rate η ρ2 80L, the consensus distance of the query points Γt is bounded as: Γt 2560 σ2η2 where σ2 := 2σ2 + ζ2. Proof. First, observe that for linear weights αt = t, we have α1:t = Pt τ=1 τ = t(t+1) 2 , which implies that for all t 1: α1:t = 2 t + 1, and αtδt = 2t t + 1 2 . Thus, from Lemma C.1, we get: Γt + 4ηα2 tδ2 t ρ where the second inequality follows from α2 tδ2 t 4. Let σ2 := 2σ2 + ζ2. Then, by Lemma C.2, for all t , we have Ψt 5 σ2 + 10L2Γt. Substituting this bound yields: σ2 + 2L2Γt + 2 t 1 τ σ2 + 2L2Γτ ! 2 + 80L2η2 2 Γt + 80L2η2 2 t 1 τ Γτ + 80η2 2 + c2 1η2 2 Γt + c2 1η2 2 t 1 τ Γτ + c2 2η2 2 + c2 1η2 2 Γt + c2 1η2 2 t 1 τ Γτ + c2 2η2 2 where c2 1 := 80L2, c2 2 := 80 σ2, the final inequality holds because 1 4 ρ2 , and the second inequality stems from the following bound on the geometric sum: Enhancing Parallelism in Decentralized SCO Equation (27) exhibits a recursive structure as in Lemma D.5. Since η ρ2 80L, the condition required to apply the lemma is satisfied. Applying the lemma with at = Γt and κ = ρ 2 yields the final result: κ4 = 2560 σ2η2 C.2. Consensus of the Iterates The following lemma provides a recursive relation for the consensus distance of the iterates Wt. Lemma C.4. The consensus distance for the iterates Ξt, satisfies the following recursive relation: Ξt + 2η2α2 t ρ Ψt . Proof. Similarly to Equation (21), we have by the gossip averaging of the iterates: MΞt+1 = E Wt+1 Wt+1 2 F = E Wt+ 1 2 2 F (1 ρ)E Wt+ 1 where the second equality is due to Lemma D.1, stating that Wt+ 1 2 P = Wt+ 1 2 , and the inequality follows from the mixing property of the gossip matrix Property 2.6. Substituting the bound on E Wt+ 1 2 2 F in Equation (23) then yields: MΞt+1 (1 ρ) 1 + 1 MΞt + (1 ρ)(1 + γ)η2α2 t MΨt . Dividing by M and setting γ = 2/ρ finally gives: Ξt+1 (1 ρ) 1 + ρ Ξt + (1 ρ) 1 + 2 η2α2 tΨt 1 ρ Ξt + 2η2α2 t ρ Ψt , where in the last inequality we used (1 ρ)(1 + ρ 2 and (1 ρ)(1 + 2 ρ) = 1 ρ + 2 Using this recursion, we derive an explicit bound on Ξt as follows. Lemma C.5. For any non-decreasing sequence {αt}t 1, the consensus distance Ξt is bounded as: Ξt 2η2α2 t ρ Proof. The recursion for Ξt in Lemma C.4 exhibits the structure in Lemma D.4. Applying the lemma with at = Ξt, bt = η2α2 tΨt, and κ = ρ/2, we obtain: t 1 τ η2α2 τΨτ 2η2α2 t ρ where in the last inequality we used α2 τ α2 t for all τ t. D. Technical Lemmas In this section, we list some technical lemmas, starting with the following property that establishes the invariance of gossip communication when all machines hold the same vector. Lemma D.1 (Proposition 1, Koloskova et al., 2020). Let x Rd. For any matrix X = x x x Rd M and a symmetric, doubly stochastic matrix P RM M, we have XP = X. Enhancing Parallelism in Decentralized SCO Next, we state a classical result for smooth functions. Lemma D.2. Let f : Rd R be an L-smooth function and x arg minx Rd f(x). Then, for every x Rd, it holds that: f(x) 2 2L (f(x) f(x )) . The following lemma proves to be useful as well. Lemma D.3. Consider a non-negative sequence a1, . . . , a T satisfying the following relation for any t [T]: for some constant b R. Then, for all t [T], it holds that Proof. Summing over t [T], we have that: t=1 at Tb + 1 τ=1 aτ = Tb + 1 Subtracting 1 2 PT t=1 at and multiplying by 2 results in PT t=1 at 2Tb. Thus, we obtain: τ=1 aτ b + 1 2T 2Tb = 2b . The following two lemmas are used to derive explicit bounds for the consensus recursions discussed in Appendix C. Lemma D.4. Let κ (0, 1] and consider a sequence {at}t 1 satisfying the recursion: at (1 κ)at 1 + bt 1 κ , with a1 = 0, for some non-negative sequence {bt}t 1. Then, for all t 2, the following bound holds: τ=1 (1 κ)t 1 τbτ . Proof. We prove the statement using induction. Base case. For t = 2, since a1 = 0, it trivially follows that: a2 (1 κ)a1 + b1 τ=1 (1 κ)1 τbτ . Induction step. Assume that the induction hypothesis holds for some t 2. We show that it holds for index t + 1: at+1 (1 κ)at + bt τ=1 (1 κ)t 1 τbτ + bt τ=1 (1 κ)t τ bτ + bt τ=1 (1 κ)t τbτ . Enhancing Parallelism in Decentralized SCO Lemma D.5. Let κ (0, 1] and consider a sequence {at}t 1 satisfying the recursion: at 1 κ + c2 1η2 at 1 + c2 1η2 τ=1 (1 κ)t 2 τaτ + c2 2η2 κ3 , with a1 = 0, for some non-negative constants c1,c2, and η. Suppose η κ2 2c1 . Then, for all t 1, the following bound holds: Proof. Similar to Lemma D.4, we prove this result by (strong) induction. Base cases. For t = 1, the statement is trivial because a1 = 0 2c2 2η2/κ4. For t = 2, we have: a2 1 κ + c2 1η2 a1 + c2 1η2 τ=1 (1 κ) τaτ + c2 2η2 κ3 = c2 2η2 where the last inequality follows from 1 2 κ as κ 1, thus verifying the base cases. Induction step. Suppose that for some t 3, the induction hypothesis holds for all indices s = 1, . . . , t. We shall prove that it then holds for t + 1. Specifically, denoting the upper bound by B := 2c2 2η2 κ4 , we assume that as B for s = 1, . . . , t and prove that at+1 B. Plugging the induction hypothesis into the recursion, we get: at+1 1 κ + c2 1η2 at + c2 1η2 τ=1 (1 κ)t 1 τaτ + c2 2η2 1 κ + c2 1η2 τ=1 (1 κ)t 1 τB + c2 2η2 1 κ + c2 1η2 τ=1 (1 κ)t 1 τ ! 1 κ + c2 1η2 1 κ + 2c2 1η2 where the last inequality holds because 1 κ 1 κ3 for any κ 1, and the penultimate inequality results from the bound on the geometric series: t 1 X τ=1 (1 κ)t 1 τ = s=0 (1 κ)s = 1 From the condition on η, it follows that 2c2 1η2 2 . Substituting this inequality and noting that c2 2η2 2 B, we obtain: thus establishing the result. We also utilize the following result, which we prove using simple algebraic manipulation. Lemma D.6. Let A 0 and B, C > 0, and define η = min n η1, p A/B, (A/C)1/4o for some η1 > 0. Then, η + Bη + Cη3 A AB + 2A3/4C1/4 . Enhancing Parallelism in Decentralized SCO Proof. Since η is the minimum between three terms, 1/η is the maximum of their inverses. Therefore, η + Bη + Cη3 A max A B + C A1/4 AB + A3/4C1/4 + AB + A3/4C1/4 AB + 2A3/4C1/4 , where the second inequality follows from the fact that the maximum of non-negative terms is smaller than their sum, and from the monotonic increase of the terms Bη and Cη3 with η. E. Additional Experimental Results In Figures 4 to 6, we present the full convergence curves for DAT-SGD and D-SGD on the synthetic least squares problem over the ring, torus, and 1-peer exponential graph topologies, respectively. These results are shown for varying numbers of machines and correspond to the final errors reported in Figure 1. On the torus and exponential graph topologies, our method steadily improves with increasing numbers of machines, demonstrating the potential of our approach to effectively increase parallelism. For the ring topology, performance improves when transitioning from M = 9 to 25 machines but degrades at M = 100 machines, as the topology-related error term becomes dominant. Conversely, D-SGD does not improve when increasing the number of machines. On the ring and torus topologies, performance is initially better for larger M, but then deteriorates, indicating that while we initially benefit from variance reduction, the optimization transitions to a regime where the leading error term depends on topology and inefficient information flow hinders further improvement. For the well-connected exponential graph, initial performance improves with M, but all configurations eventually converge to approximately the same error level. 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 DAT-SGD, M=9 D-SGD, M=9 DAT-SGD, M=25 D-SGD, M=25 DAT-SGD, M=100 D-SGD, M=100 Figure 4. Error curves for the least squares problem with varying noise, data heterogeneity, and # of machines in the ring topology. Enhancing Parallelism in Decentralized SCO 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 DAT-SGD, M=9 D-SGD, M=9 DAT-SGD, M=25 D-SGD, M=25 DAT-SGD, M=100 D-SGD, M=100 Figure 5. Error curves for the least squares problem with varying noise, data heterogeneity, and # of machines in the torus topology. 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 100 101 102 103 104 105 10 6 10 5 10 4 10 3 10 2 10 1 100 101 DAT-SGD, M=9 D-SGD, M=9 DAT-SGD, M=25 D-SGD, M=25 DAT-SGD, M=100 D-SGD, M=100 Exponential Graph Figure 6. Error curves for the least squares problem with varying noise, data heterogeneity, and # of machines in the 1-peer exponential graph topology.