# dadao_decoupled_accelerated_decentralized_asynchronous_optimization__53b05f8a.pdf

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Adel Nabli 1 2 Edouard Oyallon 1

This work introduces DADAO: the first decentralized, accelerated, asynchronous, primal, firstorder algorithm to minimize a sum of L-smooth and µ-strongly convex functions distributed over a given network of size n. Our key insight is based on modeling the local gradient updates and gossip communication procedures with separate independent Poisson Point Processes. This allows us to decouple the computation and communication steps, which can be run in parallel, while making the whole approach completely asynchronous. This leads to communication acceleration compared to synchronous approaches. Our new method employs primal gradients and does not use a multiconsensus inner loop nor other ad-hoc mechanisms such as Error Feedback, Gradient Tracking, or a Proximal operator. By relating the inverse of the smallest positive eigenvalue of the Laplacian matrix χ1 and the maximal resistance χ2 χ1 of the graph to a sufficient minimal communication rate between the nodes of the network, we show that our algorithm requires O(n q

gradients and only O(n χ1χ2 q

ϵ )) communications to reach a precision ϵ, up to logarithmic terms. Thus, we simultaneously obtain an accelerated rate for both computations and communications, leading to an improvement over stateof-the-art works, our simulations further validating the strength of our relatively unconstrained method.

1. Introduction

In recent years, the increased amount of available data as well as the proliferation of highly-parallelizable and con-

1Sorbonne Universit e, CNRS, ISIR, Paris, France 2Mila, Concordia University, Montr eal, Canada. Correspondence to: Adel Nabli <adel.nabli@sorbonne-universite.fr>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

nected hardware have brought significant changes in the way we process data. These developments have led to the need for efficient and scalable methods for distributed optimization, particularly in the context of machine learning. Indeed, in scenarios where data is distributed across multiple nodes, such as in edge computing or distributed sensor networks, leveraging the local resources of each device is a topic of significant interest. In other settings, such as clusters, spreading the compute load is ideally done to obtain a linear speedup in the number of nodes. In a typical distributed training framework, the goal is to minimize a sum of functions (fi)i n split across n nodes of a computer network. A corresponding optimization procedure involves alternating local computations on the nodes and communications along the edges E of the network. In the decentralized setting, there is no central machine aggregating the information sent by the workers: nodes are only allowed to communicate with their neighbors in the network. This work addresses simultaneously multiple limitations of existing decentralized algorithms while guaranteeing fast convergence rates.

Synchronous lock. Optimal methods (Scaman et al., 2017; Kovalev et al., 2021a) have been derived for synchronous first-order algorithms, whose executions are blocked until all nodes have reached a predefined state (e.g., they must all finish computing local gradients before the round of communication begins), which limits their efficiency in practice as they can heavily be impacted by a few slow nodes or edges in the graph (the straggler problem). To tackle the synchronous lock, we rely on the continuized framework (Even et al., 2021a), itself derived from the randomized gossip model (Boyd et al., 2006). In randomized gossip, nodes update their local values at random times using pairwise communication updates named gossip. Thus, iterates are randomized, labeled with a continuous-time index that only need local clocks to be synchronized at the beginning of the procedure (in opposition to a global iteration count that has to be known by all at all time) and performed locally with no regards to a specific global ordering of events. While based on discrete events and thus readily implementable, the continuized framework simplifies the analysis by leveraging continuous proof tools.

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Coupled lock. However, in (Even et al., 2021a), gradient and gossip operations are coupled: each communication along an edge first requires the computation of the gradients of the two functions locally stored on the corresponding nodes. As more communication steps than gradient computations are necessary to reach an ϵ precision, even in an optimal framework (Kovalev et al., 2021a; Scaman et al., 2017), the coupling leads to an overload in terms of gradient steps. Moreover, coupling computations and communications implies they must be performed sequentially, decoupling them allows both tasks to be done in parallel, allowing an additional speedup.

To our knowledge, our work is the first primal method to tackle those locks simultaneously while obtaining accelerated rates for both computations and communications. We propose a novel algorithm (DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization) based on a combination of similar formulations to (Kovalev et al., 2021a; Even et al., 2021b; Hendrikx, 2022) in the continuized framework of (Even et al., 2021a). We study:

i=1 fi(x) , (1)

where each fi : Rd R is a µ-strongly convex and Lsmooth function computed in one of the n nodes of a network. We derive a first-order optimization algorithm that only uses primal gradients and relies on Point-wise Poisson Processes (P.P.P.s (Last & Penrose, 2017)) modeling of the communication and gradient occurrences, leading to accelerated communication and computation rates. Furthermore, our communication bounds rely on the maximal resistance of a graph rather than the largest eigenvalue of a Laplacian, leading to an additional acceleration compared to works which rely on synchrony. Our framework is based on a simple fixed-point iteration and kept minimal: it only involves primal computations with an additional momentum term. Thus, we do not add other cumbersome designs such as the Error-Feedback or Forward-Backward used in (Kovalev et al., 2021a) (whose adaptation to asynchronous settings is for now unclear). While we do not consider the delays bound to appear in practice (we assume instantaneous communications and computations), we remove the coupling lock by performing gradient and gossip steps in parallel. Tab. 1 compares DADAO with other approaches and shows it is the only work to achieve accelerated rates both in number of communication and gradients.

Contributions. (1) We propose a primal algorithm with provable guarantees in the context of asynchronous decentralized learning. (2) This algorithm is the first to reach accelerated rates for both communications and computations while not requiring ad-hoc mechanisms obtained from an inner loop. (3) We propose a simple theoretical frame-

work compared to concurrent works, we show that our rates are better than previous works, and (4) we illustrate this theoretical comparison numerically.

Structure of the paper. In Sec. 3.1, we describe our work hypothesis and our model of a decentralized environment, while Sec. 3.2 describes our dynamic. Sec. 3.3 states our convergence guarantees and highlights that the communication and computational rates of our method are better compared to its competitors. Next, Sec. 4.1 explains our implementation of this algorithm, and finally, Sec. 4.2 verifies our claims numerically. All our experiments are reproducible, using Py Torch (Paszke et al., 2019), our code being online 1.

Notations: f = O(g) means there is a constant C > 0 such that |f| C|g|, {ei}i d is the canonical basis of Rd, d N, 1 is the vector of 1, I the identity, A+ is the pseudo-inverse of A. We further write ei ei I.

2. Related Work

Continuized and asynchronous algorithms. We highly rely on the elegant continuized framework (Even et al., 2021a), which allows obtaining simpler proofs and brings the flexibility of asynchronous algorithms. We reemphasize that identically to (Even et al., 2021a), the result of this paper is a stochastic discrete algorithm with a continuous proof: our proof framework is not based on the discretization of an Ordinary Differential Equation (ODE) but rather studies the evolution of a Stochastic Differential Equation (SDE) with jumps. However, by contrast to (Even et al., 2021a), in our work, we significantly reduce the necessary amount of gradient steps compared to (Even et al., 2021a) while maintaining the same amount of activated edges. Another type of asynchronous algorithm can also be found in (Latz, 2021), yet it fails to obtain Nesterov s accelerated rates for lack of momentum. We note that (Leblond et al., 2018) studies the robustness to delays yet requires a shared memory and thus applies to a different context than decentralized optimization. (Hendrikx, 2022) is a promising approach for modeling random communication on graphs yet fails to obtain acceleration in a neat framework without inner loops.

Decentralized algorithms with fixed topology. (Scaman et al., 2017) is the first work to derive an accelerated algorithm for decentralized optimization, and it links the convergence speed to the Laplacian eigengap. The corresponding algorithm uses a dual formulation and a Chebychev acceleration (synchronous and only for fixed topology). Yet, as stated in Tab. 3.3, it still requires many edges to be activated. Furthermore, under a relatively flexible condition on the in-

1https://github.com/Adel Nabli/DADAO/

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Table 1. This table shows the strength of DADAO compared to concurrent works for obtaining ϵ-precision. n is the number of nodes, |E| the number of edges, 1 χ1 the smallest positive eigenvalue of a fixed weighted Laplacian L, ρ the eigengap and χ2 χ1 the effective resistance. Note that under reasonable assumptions χ1χ2n = O(|E| ρ) (see Lemma 3.3). Async., Comm., Grad., M.-C. and Prox. stand respectively for Asynchrony, Communication steps, Gradient steps., Multi-consensus and Proximal operator. As suggested in their respective papers, all the algorithms are run with 1 L L. For OGT, we note that the matrix is stochastic, a more precise comparison is given by Prop. 3.9.

Method Async. Decoupled No Inner Loop Primal Total Total (M.-C. or Prox.) Oracle # Comm. # Grad.

MSDA (Scaman et al., 2017) ρ|E| q

ϵ DVR (Hendrikx et al., 2020) ρ|E| q

ϵ ADOM+ (Kovalev et al., 2021a) ρ|E| q

ϵ TVR (Hendrikx, 2022) ρ|E| L

ϵ AGT (Li & Lin, 2021) ρ|E| q

ϵ OGT (Song et al., 2021) ρ|E| q

ϵ ESDACD (Hendrikx et al., 2019b) χ1χ2n q

ϵ Continuized (Even et al., 2021a) χ1χ2n q

ϵ DADAO (ours) χ1χ2n q

tensity of our P.P.P.s, we show that our work improves over bounds that depend on the spectral gap. An emerging line of work following this formulation employs the continuized framework (Even et al., 2020; 2021a;b), but unfortunately do not use a primal oracle, as they rely on the gradients of the Fenchel conjugate. Finally, we note that the work of (Even et al., 2021b) incorporates delays in their model, showing that, with some adaptation, the continuized methods in (Even et al., 2021a) still converge at a linear rate. Yet transferring this robustness to our setting remains unclear. Reducing the number of communication has been studied in (Mishchenko et al., 2022), but without obtaining accelerated rates. (Hendrikx et al., 2021) allows for fast communication and gossip rates yet requires a proximal step and synchrony between nodes to apply a momentum variable.

Finite sum acceleration. If each local function fi is a sum of elementary functions Pm j=1 fi,j with a favorable conditionning, an additional acceleration is possible, as observed by (Hendrikx et al., 2021; 2020; Hendrikx, 2022), which is a different problem from Eq. 1. This can be viewed as a cluster of nodes with infinite connectivity and an efficient decentralized algorithm should automatically adapt to such structure. Thus, we focused on the setting m = 1, which allows a fair comparison with these works.

Error feedback/Gradient tracking. A major lock for asynchrony is the use of Gradient Tracking (GT) (Koloskova et al., 2021; Nedic et al., 2017; Li & Lin, 2021) or Error Feedback (Stich & Karimireddy, 2020; Kovalev et al., 2021b). Indeed, gradient operations are locally tracked by a running-mean variable which must be synchronously

updated at each gradient update (in GT, the sum of this distributed variable keeps track of the gradient of the objective function), making it incompatible with an asynchronous framework. (Zhang & You, 2019) uses a GT-like procedure, which do not verify this property, at the cost of using a buffer (increasing memory requirements) and worse convergence rates (e.g., not accelerated). Furthermore, acceleration requires an undesirable multi-consensus inner loop. We emphasize that (Song et al., 2021) allows to decouple the gradient updates from communication, yet the framework is still synchronous, leading to synchronous communication rates, that asynchrony can improve (see Tab. 1).

Decoupling procedures. Decoupling subsequent steps of optimization procedures traditionally leads to speedups (Hendrikx et al., 2021; Hendrikx, 2022; Belilovsky et al., 2020; 2021). This contrasts with methods which couple gradient and gossip updates, so that they happen in a predefined order, i.e., simultaneously (Even et al., 2021a) or sequentially (Kovalev et al., 2021a; Koloskova et al., 2020). In decoupled optimization procedures, inner-loops are not desirable as they require an external procedure that can be potentially slow and need a block-barrier instruction during the algorithm s execution (e.g., (Hendrikx et al., 2021)). It means in particular that it is preferrable to avoid approaches such as Catalyst (Lin et al., 2015), multi-consensus steps (Kovalev et al., 2021c) or Tchebychev acceleration of consensus (Scaman et al., 2017).

Resistance of a graph. The maximal resistance of a graph is a widely studied quantity, particularly in physics (Klein & Randi c, 1993; Vos, 2016; Klein, 2002), as it is a refined

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

geometric invariant of graphs. The resistance of a graph corresponds to the commute time of a Markov Chain (Chandra et al., 1996). However, beyond the Continuized framework (Even et al., 2021a) or acceleration of consensus problems (Can et al., 2022; Aybat & G urb uzbalaban, 2017; Ghosh et al., 2008), we are unaware of generic, asynchronous, accelerated decentralized optimization procedures that rely on this quantity. Also, it has a more physical interpretation than the Laplacian s norm, as it can be computed via Ohm s and Kirchhoff s Circuit Laws (see (Chandra et al., 1996)).

3. Accelerated Asynchronous Algorithm

3.1. Gossip Framework

We consider the problem defined by Eq. 1 in a distributed environment constituted by n nodes whose dynamic is indexed by a continuous time index t R+. Each node has a local memory and can compute a local gradient fi, as well as elementary operations, in an instantaneous manner. As said above, having no delay is less realistic, yet adding them also leads to significantly more difficult proofs whose adaptation to our framework remains largely unclear. Next, we will assume that our computations and gossip result from independent homogeneous P.P.P. with no delay. For the sake of simplicity, we assume that all nodes can compute a gradient at the same rate:

Assumption 3.1 (Homogeneous gradient computations). The gradient computations are normalized to fire independently at a rate of 1 computation per time unit. For the i-th worker, we write Ni(t) the corresponding P.P.P. of rate 1, as well as N(t) = (Ni(t))i n.

Remark 3.2. The P.P.P Ni(t) on node i means that taking gradient steps at i are discrete events, but the time intervals between two events is a random variable following an exponential law of parameter 1. Thus, the expected waiting time between two gradient steps on the i-th worker is 1 time unit.

Next, we model the bandwidth of each connection. For an edge (i, j) belonging to E, the set of edges of a graph we assume connected (3.5), we write Mij(t) the P.P.P. with rate 0 < λij < . When this P.P.P. fires, both nodes share and update their local memories. The rate λij is adjustable locally by machine i while λji is controlled by machine j. While λij and λji may be different, we highlight that the communication process is symmetric, i.e. both nodes update their local memories when either one of the two corresponding P.P.Ps fires. Thus, in the corresponding undirected graph E, each edge (i, j) will fire at a rate of λij + λji. Given our notations, if (i, j) E, then the connection between (i, j) can be thought as a P.P.P. with intensity 0. Taking the λij as edge weights, we introduce the subsequent graph Laplacian,

which is the expected Laplacian of our graph:

(i,j) E λij(ei ej)(ei ej)T .

We write Λ P

(i,j) E λij(ei ej)(ei ej)T its tensorized counter-part that will be useful for our Lyapunovbased proofs. Following (Scaman et al., 2017), we will compare this quantity to the following projector:

1 i,j n (ei ej)(ei ej)T .

In this context, a natural quantity is the algebraic connectivity of our network given by (Kovalev et al., 2021a):

χ1 sup x =1,x 1

We might also write χ1[Λ] to avoid confusion, depending on the context.

Next, the maximal effective resistance of the network, as in (Even et al., 2021a; Ellens et al., 2011), is:

2 sup (i,j) E (ei ej)TΛ+(ei ej) .

A standard quantity (Scaman et al., 2017), which is used to control the number of synchronous gossips steps, is the spectral gap, given by:

We also introduce the value κ, the ratio of communication frequency between the fastest and slowest edges:

κ sup(i,j) E λij + λji

inf(i,j) E λij + λji .

This ratio is typically bounded, for instance, in the case of a graph with constant edge weights or for λij = 1 di with di the degree of the i-th node in a bounded degree graph or a regular graph.

We prove the following Lemma (proved in Appendix A), which is useful to control χ1, χ2 and compare our bounds with works that rely on the spectral gap of a graph:

Lemma 3.3 (Effective resistance). The spectrum of Λ is nonnegative. Also, we have χ1 = + iff E is not a connected graph. If the graph is connected, then:

Tr Λ χ2 χ1 .

Furthermore, we also have the following:

χ1χ2Tr Λ ρ q

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

The last part of this Lemma indicates that our method requires less communications than methods depending on the spectral gap when no degenerated behavior on the graph s connectivity happens, i.e. when κ is adequately bounded, which is a standard assumption (Hendrikx et al., 2019a). Remark 3.4. For synchronous frameworks (Kovalev et al., 2021a; Scaman et al., 2017), the spectral quantity extracted from their gossip matrix is a given measure of the connectedness of their graphs that is used afterwards to deduce the right number of synchronous communication rounds between two gradient steps. In our framework, Λ directly contains the information of both the topology E and the edge communication rates λij, thus χ1[Λ] must rather be understood as an indicator of how well the graphs connectedness and the chosen communication strategy interact. In fact, we will see later that there is a condition on χ1, χ2 for DADAO to converge, see Remark 3.7 for further discussion.

Assumption 3.5 (Connected graph). The set of edges E defines a connected graph such that χ1[Λ] < .

3.2. Dynamic to optimum

Next, we follow a standard approach (Kovalev et al., 2020; 2021a; Salim et al., 2022; Hendrikx, 2022) for solving Eq. 1 (see Appendix B for details), leading to studying, for 0 < ν < µ, the following Lagrangian:

inf x Rn d sup y Rn d

i=1 fi(xi) ν

2 x 2 x, y 1

2ν πz + y 2.

For f(x) = Pn i=1 fi(xi), the saddle points (x , y , z ) of the above Lagrangian are given by:

f(x ) νx y = 0 y + πz + νx = 0 πz + πy = 0 . (2)

Our algorithm is based on a fixed-point algorithm to obtain those saddle points, which is a similar idea to (Kovalev et al., 2021b), which is only restricted to a setting without communication acceleration. Furthermore, contrary to (Kovalev et al., 2021a), we do not employ a Forward-Backward algorithm, which requires both an extra-inversion step and additional regularity on the considered proximal operator. Not only does this condition not hold in this particular case, but this is not desirable in a continuized framework where iterates are not ordered in a predefined sequence and require a local descent at each instant. Another major difference is that no Error-Feedback is required by our approach, which allows unlocking asynchrony while simplifying the proofs and decreasing the required number of communications. Instead, we show it is enough to incorporate a standard fixed point algorithm, without any specific preconditioning (see

(Condat et al., 2019)). We consider the following dynamic:

dxt = η( xt xt)dt γ( f(xt) νxt yt) d N(t)

d xt = η(xt xt)dt γ( f(xt) νxt yt) d N(t)

d yt = θ(yt + zt + ν xt)dt

+ (δ + δ)( f(xt) νxt yt)d N(t)

dyt = α( yt yt)dt (3)

dzt = α( zt zt)dt

(i,j) E (ei ej)(ei ej)T(yt + zt)d Mij(t)

d zt = α(zt zt)dt

(i,j) E (ei ej)(ei ej)T(yt + zt)d Mij(t) ,

where ν, η, η, γ, α, α, θ, δ, δ, β, β are undetermined realvalued parameters. As in (Nesterov, 2003), variables are paired to obtain a Nesterov acceleration. The variables (x, y) allow decoupling the gossip steps from the gradient steps using independent P.P.P.s. Furthermore, the Lebesgue integrable path of yt does not correspond to a standard momentum, as in a continuized framework (Even et al., 2021a); however, it turns out to be a crucial component of our method. Compared to (Kovalev et al., 2021a), no extra multi-consensus step needs to be integrated. Our formulation of an asynchronous gossip step is similar to (Even et al., 2021a), which introduces a stochastic variable on edges; however, contrary to this work, our gossip and gradient computations are decoupled. We emphasize that while the dynamic 3 is formulated using SDEs (Arnold, 1974), which brings the power of the continuous-time analysis toolbox, it is still event-based and thus discrete in nature. Hence, the dynamic can be efficiently implemented in practice as explained in Sec. 4.1 and Appendix E.

3.3. Theoretical guarantees

We prove the following in Appendix C.

Theorem 3.6. Assume each fi is µ-strongly convex and Lsmooth. Assume 3.1, 3.5, and that 2χ1χ2 1. Then there exists some parameters for the dynamic Eq. (3) (given in Appendix C.1), such that for any initialization x0 ker(π), and x0 = x0, y0 = y0 = f(x0) µ

2 x0, z0 = z0 = πy0, we get for t R+:

E[ xt x 2] (1

8 L µ + 2L2

µ2 ) x0 x 2e t 8

Also, the expected number of oracle gradient calls is nt and the expected number of edges activated is: t

The condition 2χ1χ2 1 can simply be understood as whether or not the chosen communication strategy suits sufficiently well the graph topology E for DADAO to converge

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

using the expected rates of communication (λij)(i,j) E compared to the expected rate of one gradient step per time unit on each node we assumed in Assumption 3.1. In fact, we make the following interpretation: Remark 3.7. Introducing λ P

(ij) E λij and PΛ

2Λ Tr Λ, we write Λ as the product of these two more interpretable quantities to gain more insight on the condition 2χ1[Λ]χ2[Λ] 1: Λ = λPΛ. (4)

In this setting, λ is the expected rate of communication over the whole graph, while PΛ can be interpreted as the Laplacian of E with each edge weighted with its probability of having spiked given a communication happened in the graph. We have:

χ1[Λ] = χ1[PΛ]

λ ; χ2[Λ] = χ2[PΛ]

PΛ being normalized, we could say that the quantities χ1[PΛ], χ2[PΛ] characterize the graph s connectivity while λ is the global rate of communication. Then, using (5), the condition 2χ1[Λ]χ2[Λ] 1 is equivalent to saying p

2χ1[PΛ]χ2[PΛ] λ, (6)

meaning that the global communication rate should be larger than some spectral quantity quantifying the graph s connectivity. If structural constraints on the network makes it impossible to verify this condition, as the notion of time is only defined through the rate of gradient steps given by Assumption 3.1 (λ can be interpreted as the expected number of communications in the graph between the expected duration separating two subsequent gradient steps on a given node ), it only means that the gradient steps are happening too fast compared to the ability of the network to communicate, and the rate of gradient steps should be slow-down by a factor of 2χ1χ2.

Thus, given a graph topology, it is straightforward to obtain a Laplacian verifying 2χ1χ2 1. Indeed, we have the following: Remark 3.8. For any graph topology E and any choice of corresponding graph Laplacian L verifying χ1[L] bounded, there is a communication strategy (λij)(i,j) E such that Λ, the Laplacian of E with edges (i, j) E weighted with their expected communication rate λij, verifies 2χ1[Λ]χ2[Λ] 1. One such example is given by:

2χ1[L]χ2[L]L .

This property allows us to use in DADAO the Laplacians introduced in previous work, to which we can now compare. It leads to the following proposition (proved in Appendix D), which shows that DADAO obtains better complexities than concurrent works while starting from the same family of Laplacians:

Proposition 3.9 (Comparison with concurrent work).

If κ = O( | E|

n ), then DADAO obtains better communication rate than MSDA (Scaman et al., 2017), and requires strictly less communications for the complete graph.

For any fixed Laplacian valid for ADOM+ (Kovalev et al., 2021a), DADAO obtains a better communication rate than ADOM+, and requires strictly less communications for the complete graph.

If κ = O( | E|

n ), for a valid Laplacian using the Gossip matrix of OGT (Song et al., 2021) or AGT (Li & Lin, 2021), DADAO obtains a better communication rate than both Gradient Tracking methods, and requires strictly less communications for the complete graph.

DADAO requires fewer gradient computations than the Continuized framework (Even et al., 2021b), and has a strictly better computation rate for the cycle graph.

We highlight that (Scaman et al., 2017) claimed that their algorithm is optimal because they study the number of computations and synchronized gossips on a worst-case graph; our claim is, by nature different, as we are interested in the number of edges fired rather than the number of synchronized gossip rounds. Indeed, in an asynchronous framework, there is no notion of round of communication, which allows this framework to enjoy the advantageous rates of randomized procedures. Moreover, not only this measure of complexity is standard in asynchronous frameworks (e.g., see (Boyd et al., 2006)), but also, if we are aiming for the most frugal procedure, minimizing both the total number of computations and communications is of interest. Tab. 3.3 predicts the behavior of our algorithm on various classes of graphs encoded via a normalized Laplacian. It shows that systematically, our algorithm leads to the best complexities. For example, in the case of a complete graph, one synchronized gossip round requires a total of |E| = O(n2) communications whereas we show that a rate of only O(n) communications per time unit suffices for DADAO in this case. We note that the graph class depicted in Tab. 3.3 was used as worst-case examples for proving the optimality of (Scaman et al., 2017) in a synchronous context.

4. Practical implementation

4.1. Algorithm

To study the trajectories of Xt (xt, xt, yt), Yt (yt, zt, zt), we use the following, equivalent to (3):

d Xt = a1(Xt, Yt)dt + b1(Xt)d N(t) (7)

d Yt = a2(Xt, Yt)dt + X

(i,j) E bij 2 (Yt)d Mij(t) ,

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Table 2. Complexity for various graphs using 1 L L with L the standard Laplacian with unit edge weights. In this case, ρ = χ1. The complexities are reported per time unit so that all algorithms reach ϵ-precision at the same time. We have, respectively, for a star/line or cyclic/complete graph and the d-dimensional grid: χ1 = O(n), χ2 = O(n), Tr = O(1) / χ1 = O(n2), χ2 = O(1), Tr = O(n) / χ1 = O(1), χ2 = O(1), Tr = O(n) / χ1 = O(n2/d), χ2 = O(1), Tr = O(n).

Method # edges activated per time unit # gradients computed per time unit Graph Star Line Complete d-grid Star Line Complete d-grid (Kovalev et al., 2021a) ADOM+ n2 n3 n2 n1+2/d n n n n (Scaman et al., 2017) MSDA n3/2 n2 n2 n1+1/d n n n n (Even et al., 2021a) Continuized n n2 n n1+1/d n n2 n n1+1/d

Centralized n - - - n - - - DADAO (ours) n n2 n n1+1/d n n n n

where a1, a2, b1 = (bi 1)i, (bij 2 )ij are smooth functions. We now describe the algorithm used to implement the dynamics of Eq. (3). Let us write T (i) 1 < T (i) 2 < ... < T (i) k < ... the time of the k-th event on the i-th node, which is either an edge activation, either a gradient update. We remind that the spiking times of a specific event correspond to random variables with independent exponential increments and can thus be generated at the beginning of our simulation. They can also be generated on the fly and locally to stress the locality and asynchronicity of our algorithm. We write Xt = (X(i) t )i and Yt = (Y (i) t )i, then on the i-th node, at the k-th iteration, we integrate on [T (i) k ; T (i) k+1] the ODE

d Xt = a1(Xt, Yt)dt

d Yt = a2(Xt, Yt)dt

to define the values right before the spike. One can easily find the matrix A R6 6 such that:

T (i) k+1 Y (i)

= exp (T (i) k+1 T (i) k )A

T (i) k Y (i)

Next, if one has a gradient update, then:

T (i) k+1 = X(i)

T (i) k+1 + b1

Otherwise, if the edge (i, j) or (j, i) is activated, a communication bridge is created between both nodes i and j. In this case, the local update on i writes:

T (i) k+1 = Y (i)

T (i) k+1 + b2

T (i) k+1 , Y (j)

Note that, even if this event takes place along an edge (i, j), we can write it separately for nodes i and j by making sure they both have the events T (i) ki = T (j) kj , for some ki, kj N corresponding to this communication. As advocated, all those operations are local, and we summarize in the Alg. 1 the algorithmic block corresponding to our implementation. See Appendix E for more details.

Algorithm 1: This algorithm block describes our implementation on each local machine. The ODE routine is described by Eq. 8. Input: On each machine i {1, ..., n}, gradient oracle fi, parameters µ, L, χ1, tmax.

1 Initialize on each machine i {1, ..., n}:

2 Set X(i), Y (i), T (i) to 0 and set A via Eq. (75);

3 Synchronize the clocks of all machines ;

4 In parallel on workers i {1, ..., n}, while t < tmax, continuously do:

5 t clock() ;

6 if there is an event at time t then

7 (X(i), Y (i)) ODE(A, t T (i), X(i), Y (i));

8 if the event is to take a gradient step then

9 X(i) X(i) + b1(X(i)) ;

10 else if the event is to communicate with j then

11 Y (i) Y (i) + b2(Y (i), Y (j)) ;

// Happens at j simultaneously.

12 T (i) t ;

13 return (x(i) tmax)1 i n.

4.2. Numerical results

In this section, we study the behavior of our method in a standard experimental setting (e.g., see(Kovalev et al., 2021a; Even et al., 2021a)). In order to compare to methods using the gradient of the Fenchel conjugate (Even et al., 2021a) in our experiments, we restrict ourselves to a situation where it is easily computable. Thus, we perform the empirical risk minimization for the decentralized linear regression task given by:

j=1 a ijx cij 2, (9)

where aij Rd, and cij R correspond to m local data points stored at node i. We follow a protocol similar to (Kovalev et al., 2021a): we generate n independent synthetic datasets with the make regression functions of scikit-

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

i = 1 xi x 2

i = 1 xi x 2

Figure 1. Comparison between ADOM+ (Kovalev et al., 2021a), the continuized framework (Even et al., 2021a), MSDA (Scaman et al., 2017) and DADAO, using the same data for the linear regression task, and the same graphs (from left to right: line with n = 150, complete with n = 250).

learn (Pedregosa et al., 2011), each worker storing m = 100 data points. We recall that the metrics of interest are the total number of local gradient steps and the total number of individual messages exchanged (i.e., number of edges that fired) to reach an ϵ-precision. We systematically used the proposed hyper-parameters of each reference paper for our implementation without any specific fine-tuning.

Comparison between all methods. We fix the Laplacian matrix via Eq. (3.8) to compare simultaneously to the continuized framework (Even et al., 2021a), MSDA (Scaman et al., 2017) and ADOM+ (Kovalev et al., 2021a). We compare ourselves to both versions of ADOM+: with and without the Multi-Consensus (M.-C.). We report in Fig. 1 results corresponding to the complete graph with n = 250 nodes and the line graph of size n = 150. While sharing the same asymptotic rate (see Fig. 2 for experimental confirmation), we note that the Continuized framework (Even et al., 2021a) and MSDA (Scaman et al., 2017) have better absolute constants than DADAO, giving them an advantage both in terms of the number of communication and gradient steps. However, in the continuized framework, the gradient and communication steps being coupled, the number of gradient computations can potentially be orders of magnitude worse than our algorithm, which is reflected by Fig. 1 for the line graph. As for MSDA, Tab. 3.3 showed they do not have the best communication rates on certain classes of graphs, as confirmed to the right in Fig. 1 for MSDA and in Fig. 2. Thanks to its M.-C. procedure, ADOM+ can significantly reduce the number of necessary gradient steps. Yet, consistently with our analysis in Prop. 3.9, our method is systematically better in all settings in terms of communications.

Further comparison between DADAO and MSDA. To experimentally confirm that our communication complexity is better than accelerated methods using the spectral gap and abstract away the better absolute constants for MSDA, we ran DADAO and MSDA on the task of dis-

tributed linear regression for star graphs of size n {10, 20, 70, 200, 300, 1000, 2000}. We considered the evolution of the average distance to the optimal with the number of gradient steps and commmunication steps in log scale for each run, and computed the slope of each line. For each graph size, we report in Fig. 2 the rate between the slope for DADAO and the slope for MSDA. We remark that the rate

0 500 1000 1500 2000

size n of the star graph.

Computation rate

0 500 1000 1500 2000

size n of the star graph.

Communication rate

Figure 2. Rate between the slopes in log scale of DADAO and MSDA for star graphs of size n {10, 20, 70, 200, 300, 1000, 2000}.

between the gradient complexities of DADAO and MSDA is indeed a O(1) (with a constant value of 1/14) while MSDA is indeed O( n) worse than DADAO for communications on the star graph, as stated in Tab. 3.3.

Discussion. We do not claim the optimality of the absolute constant in DADAO s rate (the 8

2 term in the exponential of Th. 3.6), which might be further optimized with a tighter analysis and different values of parameters. However, our focus being on large scale problems, we are concerned about asymptotic rates rather than the effect of the absolute constants. Thus, even-though Fig. 1 may sometimes display a faster convergence for the continuized framework (Even et al., 2021a) and MSDA (Scaman et al., 2017) in terms of number of steps, we remind that both (Even et al.,

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

2021a; Scaman et al., 2017) use expensive evaluation of dual gradients whereas we only use primal gradients, and Tab. 3.3 confirms that DADAO has at least their asymptotic rate. On the star graph, where Tab. 3.3 showed a strict advantage for DADAO compared to MSDA, Fig. 2 confirms that our convergence rate meets the one of (Scaman et al., 2017) for n 200, and has strictly better rates for larger graphs. Finally, we note that, our algorithm falling into the randomized category, further improvements may be theoretically possible. Indeed, (Woodworth & Srebro, 2016) showed that the lower bound on the number of grad-andprox oracle accesses for the optimization of composite objectives with deterministic methods is in O(n q

ϵ )), which is attained by work such as (Scaman et al., 2017). However, for randomized algorithms, the lower bound is in O(n + q

ϵ )) and, to the best of our knowledge, no algorithm meeting this bound is known to date in the decentralized setting, and in particular, in the asynchronous one.

In conclusion, while several methods can share similar asymptotic convergence rates, ours is the only one to perform at least as well as its competitors in every setting for different graph s topology, as predicted by Tab. 1.

5. Conclusion

In this work, we have proposed a novel stochastic algorithm for the decentralized optimization of a sum of smooth and strongly convex functions. We have demonstrated, theoretically and empirically, that this algorithm leads systematically to a substantial acceleration compared to state-ofthe-art works. Furthermore, our algorithm is asynchronous, decoupled, primal, and does not rely on an extra inner loop: each of these properties makes it suitable for real applications. In future work, we would like to explore the robustness of such algorithms to more challenging variabilities occurring in real-life applications such as time-varying networks and to extend our work to less regular functions and stochastic analysis.

Acknowledgements

This work was supported by Project ANR-21-CE23-0030 ADONIS and EMERG-ADONIS from Alliance SU. This work was granted access to the HPC/AI resources of IDRIS under the allocation AD011013743 made by GENCI. In addition, the authors would like to thank Rapha el Berthier, Mathieu Even, Hadrien Hendrikx, and Dmitry Kovalev for their helpful discussions.

Arnold, L. Stochastic differential equations. New York, 1974.

Aybat, N. S. and G urb uzbalaban, M. Decentralized computation of effective resistances and acceleration of consensus algorithms. In 2017 IEEE Global Conference on Signal and Information Processing (Global SIP), pp. 538 542, 2017. doi: 10.1109.

Belilovsky, E., Eickenberg, M., and Oyallon, E. Decoupled greedy learning of CNNs. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 736 745. PMLR, 13 18 Jul 2020.

Belilovsky, E., Leconte, L., Caccia, L., Eickenberg, M., and Oyallon, E. Decoupled greedy learning of cnns for synchronous and asynchronous distributed learning. ar Xiv preprint ar Xiv:2106.06401, 2021.

Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. Randomized gossip algorithms. IEEE Transactions on Information Theory, 52(6):2508 2530, 2006. doi: 10.1109/TIT. 2006.874516.

Can, B., Soori, S., Aybat, N. S., Dehnavi, M. M., and G urb uzbalaban, M. Randomized gossiping with effective resistance weights: Performance guarantees and applications. IEEE Transactions on Control of Network Systems, 9(2):524 536, 2022.

Chandra, A. K., Raghavan, P., Ruzzo, W. L., Smolensky, R., and Tiwari, P. The electrical resistance of a graph captures its commute and cover times. computational complexity, 6(4):312 340, 1996.

Condat, L., Kitahara, D., Contreras, A., and Hirabayashi, A. Proximal splitting algorithms for convex optimization: A tour of recent advances, with new twists, 2019.

Ellens, W., Spieksma, F. M., Van Mieghem, P., Jamakovic, A., and Kooij, R. E. Effective graph resistance. Linear algebra and its applications, 435(10):2491 2506, 2011.

Even, M., Hendrikx, H., and Massouli e, L. Asynchrony and acceleration in gossip algorithms, 2020.

Even, M., Berthier, R., Bach, F., Flammarion, N., Hendrikx, H., Gaillard, P., Massouli e, L., and Taylor, A. A continuized view on nesterov acceleration for stochastic gradient descent and randomized gossip. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021a.

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Even, M., Hendrikx, H., and Massoulie, L. Decentralized optimization with heterogeneous delays: a continuoustime approach. ar Xiv preprint ar Xiv:2106.03585, 2021b.

Ghosh, A., Boyd, S., and Saberi, A. Minimizing effective resistance of a graph. SIAM Review, 50(1):37 66, 2008. doi: 10.1137/050645452.

Hendrikx, H. A principled framework for the design and analysis of token algorithms, 2022.

Hendrikx, H., Bach, F., and Massouli e, L. An accelerated decentralized stochastic proximal algorithm for finite sums. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a.

Hendrikx, H., Bach, F., and Massouli e, L. Accelerated decentralized optimization with local updates for smooth and strongly convex objectives. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 897 906. PMLR, 2019b.

Hendrikx, H., Bach, F., and Massouli e, L. Dual-free stochastic decentralized optimization with variance reduction. Advances in neural information processing systems, 33: 19455 19466, 2020.

Hendrikx, H., Bach, F., and Massouli e, L. An optimal algorithm for decentralized finite-sum optimization. SIAM Journal on Optimization, 31(4):2753 2783, 2021. doi: 10.1137/20M134842X.

Klein, D. J. Resistance-distance sum rules. Croatica chemica acta, 75(2):633 649, 2002.

Klein, D. J. and Randi c, M. Resistance distance. Journal of mathematical chemistry, 12(1):81 95, 1993.

Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381 5393. PMLR, 2020.

Koloskova, A., Lin, T., and Stich, S. U. An improved analysis of gradient tracking for decentralized machine learning. Advances in Neural Information Processing Systems, 34:11422 11435, 2021.

Kovalev, D., Salim, A., and Richtarik, P. Optimal and practical algorithms for smooth and strongly convex decentralized optimization. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 18342 18352. Curran Associates, Inc., 2020.

Kovalev, D., Gasanov, E., Gasnikov, A., and Richt arik, P. Lower bounds and optimal algorithms for smooth and strongly convex decentralized optimization over timevarying networks. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021a.

Kovalev, D., Gasnikov, A., and Richt arik, P. Accelerated primal-dual gradient method for smooth and convexconcave saddle-point problems with bilinear coupling. ar Xiv preprint ar Xiv:2112.15199, 2021b.

Kovalev, D., Shulgin, E., Richt arik, P., Rogozin, A. V., and Gasnikov, A. Adom: accelerated decentralized optimization method for time-varying networks. In International Conference on Machine Learning, pp. 5784 5793. PMLR, 2021c.

Last, G. and Penrose, M. Lectures on the Poisson process, volume 7. Cambridge University Press, 2017.

Latz, J. Analysis of stochastic gradient descent in continuous time. Statistics and Computing, 31(4):1 25, 2021.

Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Improved asynchronous parallel optimization analysis for stochastic incremental methods. ar Xiv preprint ar Xiv:1801.03749, 2018.

Li, H. and Lin, Z. Accelerated gradient tracking over timevarying graphs for decentralized optimization. ar Xiv preprint ar Xiv:2104.02596, 2021.

Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for first-order optimization. Advances in neural information processing systems, 28, 2015.

Mishchenko, K., Malinovsky, G., Stich, S., and Richt arik, P. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! ar Xiv preprint ar Xiv:2202.09357, 2022.

Nedic, A., Olshevsky, A., and Shi, W. Achieving geometric convergence for distributed optimization over timevarying graphs. SIAM Journal on Optimization, 27(4): 2597 2633, 2017.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc,

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Salim, A., Condat, L., Kovalev, D., and Richt arik, P. An optimal algorithm for strongly convex minimization under affine constraints. In International Conference on Artificial Intelligence and Statistics, pp. 4482 4498. PMLR, 2022.

Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massouli e, L. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3027 3036. PMLR, 06 11 Aug 2017.

Song, Z., Shi, L., Pu, S., and Yan, M. Optimal gradient tracking for decentralized optimization. ar Xiv preprint ar Xiv:2110.05282, 2021.

Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Sgd with delayed gradients. Journal of Machine Learning Research, 21(237):1 36, 2020.

Vos, V. S. S. Methods for determining the effective resistance. Mestrado, Mathematisch Instituut Universiteit Leiden, 2016.

Woodworth, B. E. and Srebro, N. Tight complexity bounds for optimizing composite objectives. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Zhang, J. and You, K. Fully asynchronous distributed optimization with linear convergence in directed networks, 2019.

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Table of Contents

A Communication bounds 12

B Saddle Point Reformulation 13

C Proof of the main Theorem (Th. 3.5) 13 C.1 Proof of the Lemma C.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

D Comparison of complexities with related work 19

E Practical Implementation 22 E.1 Simulating the Poisson Point Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 E.2 Pseudo Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A. Communication bounds

The following properties will be used all along the proofs of the Lemmas and Theorems and are related to the communication of our nodes. Lemma A.1. Under the assumptions of Theorem 3.6, if z0, z0 span(π), then zt, zt span(π) almost surely.

Proof. It s clear that for any i, j, we get:

π(ei ej)(ei ej)T = (ei ej)(ei ej)T .

Thus, the variations of (zt, zt) belong to span(π), and so is the trajectory.

Lemma A.2 (Effective resistance contraction). For (i, j) E and any x Rd, we have:

(ei ej)(ei ej)Tx 2 Λ+ χ2 (ei ej)(ei ej)Tx 2 .

Proof. Indeed, we note that:

(ei ej)(ei ej)Tx 2 Λ+ = x T(ei ej)(ei ej)TΛ+(ei ej)(ei ej)Tx (10)

2χ2x T(ei ej)(ei ej)Tx (11)

= χ2 (ei ej)(ei ej)Tx 2 (12)

Lemma A.3 (Bound on the resistance).

For (i, j) E, (ei ej)TΛ+(ei ej) 1 λij+λji .

Proof. For a graph such that | E| = 1, the inequality is trivial. Now, we assume that there are other edges than (i, j) or (j, i). Thus, we have:

Λ (λij + λji)(ei ej)(ei ej)T + λ π

where π(ei ej) = 0, π1 = 0, rank( π) = n 1, π orthogonal projector and λ > 0 small enough. In this case: (λij + λji)(ei ej)(ei ej)T + λ π + Λ+, but this implies that 1 λij+λji (ei ej)TΛ+(ei ej).

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Proof of Lemma 3.3. First, we note that Λ is symmetric and has a non-negative spectrum, as:

(ij) E λij xi xj 2 .

From this, we also clearly see that χ1 = + iff the graph is disconnected. Next, assuming that the graph is connected, 0 is an eigenvalue of Λ with multiplicity 1 and by definition of χ1, we have χ1 χ2. As we also have the following: X

(i,j) E λij(ei ej)TΛ+(ei ej) = Tr (Λ+Λ) = n 1 ,

we can write: n 1 2χ2 X

(i,j) E λij = χ2Tr Λ

and get n 1

Tr Λ χ2. Finally, note that: Tr (Λ) = 2 P (i,j) E λij 2| E| sup(i,j) E(λij + λji) and Tr (Λ) (n 1) Λ , so that, using Lemma A.3:

χ2Tr(Λ) 1 p

2 inf(i,j) E(λij + λji)

Tr Λ 2 inf(i,j) E(λij + λji)

2| E| sup (i,j) E (λij + λji) (14)

(n 1) Λ (15)

B. Saddle Point Reformulation

With 0 < ν < µ and introducing an extra dual variable ˆx, we get:

i=1 fi(x) = inf x,ˆx Rn d x=ˆx,πˆx=0

i=1 fi(xi) ν

= inf x,ˆx Rn d sup y,z Rn d

i=1 fi(xi) ν

2 ˆx 2 + y, ˆx x + z, πˆx

= inf x Rn d sup y,z Rn d inf ˆx Rn d

i=1 fi(xi) ν

2 ˆx 2 + y, ˆx x + z, πˆx

= inf x Rn d sup y,z Rn d

i=1 fi(xi) ν

2 x 2 x, y 1

2ν πz + y 2 .

C. Proof of the main Theorem (Th. 3.5)

Basic reminds about the Bregman divergence For a smooth convex function F,

d F (x, y) F(x) F(y) F(y), x y

is its Bregman divergence. Next, we set ν = µ

2 such that, for F(x) = Pn i=1 fi(xi) ν

2 x 2, then F is L-smooth, ν-strongly convex, and we get: 1 2L F(x) F(y) 2 d F (x, y) L

and ν 2 x y 2 d F (x, y) 1

2ν F(x) F(y) 2 ,

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Proof of Theorem 3.6. We introduce for a positive semi-definite matrix A, x 2 A x TAx. For our proof, we rely on the notation of Eq. 7, and we introduce X (x, x, y), Y (y, z, z) and the following Lyapunov potential:

Φ(t, X, Y ) Atd F (x, x ) + At x x 2 + Bt y y 2 + Bt y y 2

+ Ct z + y z y 2 + Ct z z 2 Λ+ ,

where At, At, Bt, Bt, Ct, Ct are non-negative functions to be defined. We will use this potential to control the trajectories.

Because Φ is smooth, the SDE is a smooth trajectory, we get via Ito s formula (Last & Penrose, 2017) applied to the semi-martingale (Xt, Yt) on any intervals [0, T]:

Φ(T, XT , YT ) = Φ(0, X0, Y0) + Z T

0 Φ(t, Xt, Yt),

1 a1(Xt, Yt) a2(Xt, Yt)

Φ(t, Xt + bi 1(Xt), Yt) Φ(t, Xt, Yt) dt

Φ(t, Xt, Yt + bij 2 (Yt)) Φ(t, Xt, Yt) λijdt

where the following quantity is a Martingale:

Φ(t, Xt , Yt + bi 1(Xt )) Φ(u, Xt , Yt ) (d Ni(t) dt)

Φ(t, Xt + bij 2 (Xt ), Yt ) Φ(t, Xt , Yt ) (d Mij(t) λijdt) .

Taking the expectation, we get that, as the initialization is deterministic:

E[Φ(T, XT , YT )] = Φ(0, X0, Y0) + Z T

0 Φ(t, Xt, Yt),

1 a1(Xt, Yt) a2(Xt, Yt)

Φ(t, Xt + bi 1(Xt), Yt) Φ(t, Xt, Yt) dt

Φ(t, Xt, Yt + bij 2 (Yt)) Φ(t, Xt, Yt) λijdt ,

To show that the integrand term is negative, we will use the following technical Lemma, which is also difficult to prove and whose proof is deferred to Appendix C.1:

Lemma C.1. If:

L γ = 1 4L δ = 1

νL δ = 1 α = 1

L β = 2χ1[Λ] q

L Bt = 1 8Le

Bt = 1 16Le

L Ct = 1 2µe

L Ct = 1 32χ1Le

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Φ(t, Xt, Yt),

1 a1(Xt, Yt) a2(Xt, Yt)

+ Φ(t, Xt + b1(Xt), Yt) Φ(t, Xt, Yt)

(i,j) E λij Φ(t, Xt, Yt + bij 2 (Yt)) Φ(t, Xt, Yt) 0 a.s. .

Now, we remark that if we have F(x) = f(x) µ

4 x 2, and initialize with x0 = x0, y0 = y0 = F(x0) and z0 = z0 = π F(x0), then, given the linear relation between At, At, Bt, Bt, Ct, Ct, the L smoothness and the fact π is an orthogonal projection, we get:

Φ(0, X0, Y0) d F (x0, x ) + µ

8 x0 x 2 + 1 16L F(x0) F(x ) 2

8L F(x0) F(x ) 2 + 1

2µ (I π)( F(x0) F(x )) 2

32 χ1 Lχ1 π( F(x0) F(x )) 2

In particular, as µ

4 x x 2 d F (x, x ) it implies that:

E[ xt x 2] (1

8 L µ + 2L2

µ2 ) x0 x 2e t 8

Finally, we note that the expected number of gradients between [0, T] is given by:

i=1 Ni(T)] = n T ,

and similarly, the number of edges activated is given by:

1 i,j n Mij(T)] = X

0 λij dt = T

C.1. Proof of the Lemma C.1

We first state a couple of inequalities that we will combine to obtain a bound on our Lyapunov function. In all this section, for a given variable x, we denote by x+ the value of x right after a Poisson update. Lemma C.2. First:

ϕA At(d F (x+, x ) d F (x, x )) + At( x+ x 2 x x 2)

+ ηAt x x, F(x) F(x ) + 2 η At x x, x x (16)

F(x) y 2 At Lγ2

2 Atγ + At γ2

+ Atγ F(x) y, y y + 2 γ At y y , x x (17)

2 γ At (d F ( x, x ) + d F (x , x) d F ( x, x))

ηAt(d F ( x, x) + d F (x, x ) d F ( x, x )) At η x x 2 + At η x x 2

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Proof. First, we have to use optimality conditions and smoothness, as well as the separability of F:

d F (x+, x ) d F (x, x ) = d F (x+, x) x+ x, F(x ) F(x) (18)

2 x+ x 2 x+ x, F(x ) F(x) (19)

2 y F(x) 2 γ F(x) y 2

+ γ F(x) y, y y (20)

Next, we note that, again using optimality conditions:

x+ x 2 x x 2 = 2 x+ x, x x + x+ x 2 (21)

= 2 γ F(x) y, x x + γ2 F(x) y 2 (22)

= 2 γ F(x) F(x ), x x + 2 γ y y , x x + γ2 F(x) y 2 (23)

= 2 γ(d F ( x, x ) + d F (x , x) d F ( x, x)) + 2 γ y y , x x + γ2 F(x) y 2 (24)

Momentum in x associated with the term d F (x, x ) gives:

η x x, F(x) F(x ) = η(d F ( x, x) + d F (x, x ) d F ( x, x )) (25)

and momentum in x associated with x x 2 leads to:

2 η x x, x x = 2 η x x 2 + 2 η x x , x x η x x 2 + η x x 2 (26)

Lemma C.3. Next, we show that if αBt = δ

ϕB Bt( y+ y 2 y y 2) + Bt( y+ y 2 y y 2)

+ 2αBt y y , y y 2θ Bt y + z + ν x, y y + 2αCt y y, z + y y z (27)

2 Bt y y 2 δ

2 Bt y y 2 2 δ Bt F(x) y, y y

+ δ Bt F(x) F(x ) 2 + (δ + δ)2 δ Bt F(x) y 2

2θ Bt y + z y z , y y 2θν Bt x x , y y + 2αCt y y, z + y y z (28)

Proof. Using optimality conditions:

y+ y 2 y y 2 = 2 y y , y+ y + y+ y 2 (29)

= 2δ F(x) y, y y + 2 δ F(x) y, y y + (δ + δ)2 F(x) y 2 (30)

= 2 δ F(x) y, y y + δ F(x) F(x ) 2 δ y y 2

+ (δ + δ)2 δ F(x) y 2 (31)

The momentum in y associated with the term y y 2 gives:

2θ Bt y + z + ν x, y y = 2θ Bt y + z y z , y y 2θν Bt x x , y y (32)

The momentum in y associated with the term y y 2 gives:

2αBt y y, y y = αBt y y 2 αBt y y 2 + αBt y y 2 (33)

and the one associated with y + z y z 2:

2αCt y y, z + y y z (34)

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Lemma C.4. Finally, assuming θ Bt = β Ct = αCt, letting 1 τ > 0, z+ ij = z β(ei ej)(ei ej)T(y + z) and z+ ij = z β(ei ej)(ei ej)T(y + z):

ϕC 2θ Bt y + z y z , y y + 2αCt y y, z + y y z X

ij λij Ct y + z+ ij y z 2 y + z y z 2

ij λij Ct z+ ij z 2 Λ+ z z 2 Λ+ + 2 α Ct z z, z z Λ+ (35)

+ 2αCt z + y z y, z + y y z 2θ Bt y + z y z , y y

(i,j) E λij (ei ej)(ei ej)T(y + z) 2

+ β(β 1)Ct X

(i,j) E λij (ei ej)(ei ej)T(y + z) 2

αCt y + z y z 2 + αχ1 Ct z z 2 α Ct z z 2 Λ+

L Ct z z 2 + τ ν

δ Bt y y 2 (36)

Proof. Having in mind that π(y + z ) = 0 and Λ+Λ = π, we get, using Lemma A.1 and Lemma A.2 on the inequality (40):

(i,j) E λij z+ ij z 2 Λ+ z z 2 Λ+ (37)

(i,j) E λij2 z z , z+ ij z Λ+ + z+ ij z 2 Λ+ (38)

(i,j) E λij z z , (ei ej)(ei ej)T(y + z y z ) Λ+

(i,j) E λij β2 (ei ej)(ei ej)T(y + z) 2 Λ+ (39)

2 β z z , Λ+Λ(y + z) + χ2 β2 X

(i,j) E λij (ei ej)(ei ej)T(y + z) 2 (40)

= 2 β z z , π(y + z) + χ2 β2 X

(i,j) E λij (ei ej)(ei ej)T(y + z) 2 (41)

We also have, noting that y+, the value of y after a Poisson update, is equal to y:

(i,j) E λij( y+ + z+ ij y z 2 y + z y z 2) (42)

(i,j) E λij y + z+ ij y z, y + z y z + X

(i,j) E λij y + z+ ij y z 2 (43)

(i,j) E βλij (ei ej)(ei ej)T(y + z), y + z y z

(i,j) E β2λij (ei ej)(ei ej)T(y + z) 2 (44)

(i,j) E λij (ei ej)(ei ej)T(y + z) 2 (45)

The momentum in z associated with z z 2 Λ+ gives:

2 α Ct z z, z z Λ+ αχ1 Ct z z 2 α Ct z z 2 Λ+ (46)

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

And the one in z associated with y + z y z 2 gives:

2αCt z z, z + y y z (47)

Then, assuming that θ Bt = β Ct = αCt, we have:

2αCt y y, z + y y z 2 β Ct z z , y + z y z

2θ Bt y + z y z , y y + 2αCt z z, z + y y z (48)

= 2αCt y + z y z 2 (49)

At this stage, we split the negative term (49) into two halves, upper-bounding one of the halves by remembering that ν

L 1 and introducing 1 τ > 0:

αCt y + z y z 2 τ ν

LαCt y + z y z 2 (50)

L Ct y + z y z 2 (51)

L Ct z z 2 + τ β ν

L Ct y y 2 (52)

L Ct z z 2 + τ ν

δ Bt y y 2 (53)

Keeping in mind that θ Bt = β Ct = αCt and δ

2 Bt = αBt, we put everything together. Defining Ψ = Φ

t + ϕA + ϕB + ϕC, we have:

Ψ F(x) y 2 At Lγ2

2 Atγ + At γ2 + (δ + δ)2 δ Bt

+ z z 2 Λ+ α Ct + C t (55)

+ y y 2( B t δ

+ x x 2( At η At ν γ

+ x x 2( A t At η) (58)

+ F(x) F(x ) 2(δ Bt γ

2L At) (59)

(i,j) E λij (ei ej)(ei ej)T(y + z) 2 χ2 β2 Ct + β(β 1)Ct (60)

+ z z 2(χ1 α τ 1

+ y y (B t (1 τ ν

δ )αBt) (62)

+ y + z y z 2(C t αCt) (63)

+ d F (x, x )(A t ηAt) (64)

+ d F ( x, x)( Atη + 2 γ At) (65)

+ d F ( x, x )(Atη 2 γ At) (66)

+ F(x) y, y y ( 2 δ Bt + γAt) (67)

+ y y , x x 2 γ At 2θν Bt (68)

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Proof of Lemma C.1. Our goal is to put to zero all of the terms appearing next to scalar products and make the factors of positive quantities (norms or divergences) less or equal to zero. Given our relations, we guess that each exponential has the same rate. Thus, with τ > 0, we fix δ

2 = η = η = α = τp ν

L, which leads to γ = 2τ

νL using Eq. (57). Also, from Eq. (66): 4 At = νAt.

Next, from Eq. (59) and Eq. (68), it s necessary that:

thus θ = 4τ q

ν . From Eq. (68), we get: At = 2Lν Bt.

Combining this previous equation with Eq. (67), as 4 At = νAt, we have δ = 4Lγ. Next, Eq. (54) gives, with the equations above:

2 γ) + At γ2 + (δ + δ)2 δ Bt = At Lγ2

4 γ2At + δ2 + δ2 + δ At

+ At(2τ r ν

L + 16L2γ2) 1

We thus pick γ = 1 4L and τ = 1

8, so that δ = 1. Via Eq. (62), we fix τ = 1

8 < 1. With Eq. (61), we then get:

We also put α = 2τp ν

L and only one last equation, Eq. (60), needs to be satisfied, for which we pick β = 1

χ2 β2 Ct + β(β 1)Ct = (χ2 βα 1

This implies that χ2χ1 1

2. In summary, we set:

L γ = 1 4L δ = 1

νL δ = 1 α = 1

L β = 2χ1 q

Now, let s pick: At = etτ ν

L Bt = 1 8Le

Bt = 1 16Le

L Ct = 1 2µe

L Ct = 1 32χ1Le

This implies that Ψ 0.

D. Comparison of complexities with related work

Proof of Prop. 3.9. We consider the settings of concurrent works and, given any gossip matrix admissible for them, we show that DADAO has better rates.

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Comparison with ADOM+ (Kovalev et al., 2021a). The ADOM+ setting is:

Gossip matrices: Laplacians L with Lx x 2 (1 1

χ) x 2 for χ 1 and x 1 .

Total number of gradients to reach ϵ precision: O(n q

Total number of edges activated to reach ϵ precision: O(|E|χ q

If we take an eigenvector of L for the eigenvalue 1 χ1[L], then the Laplacian inequality directly leads to 1 1 χ1[L] 2 1 1

χ and we have:

1 χ1[L] 2 1

leading to:

2 χ1[L] 1 χ1[L]

Thus, χ1[L] 2χ. Remind that, by definition, χ2 χ1. Note that in ADOM+, we also have L 2, so that Tr(L) 2n. Then, for any Laplacian matrix L valid for ADOM+, we consider for DADAO a Λ defined as followed:

2χ1[L]χ2[L]L

Then, DADAO has the same gradient complexity as ADOM+, but the number of communications of DADAO is:

2χ1[L]χ2[L]

2n = O(|E|χ

Consequently, DADAO is better than ADOM+ for all valid configurations of ADOM+ (in the fixed graph topology setting) and DADAO. We note that for the complete graph, χ = O(1) and |E| = O(n2), whereas n p

2χ1[L]χ2[L] = O(n) and DADAO has strictly better communication rates than ADOM+ on this graph.

Comparison with Gradient Tracking methods AGT, OGT (Li & Lin, 2021; Song et al., 2021). The setting is described by:

Gossip matrices: Any matrix L = I W with W symmetric doubly-stochastic, i.e. such that P i Wij = 1 and P

Total number of gradients to reach ϵ precision: O(n q

Total number of edges activated to reach ϵ precision: O(|E| 1

ϵ ), where θ = 1 W 1

If κ = O( |E|

n ), using Lemma 3.3, we have:

2χ1[L]χ2[L]Tr L = O(|E| p

Furthermore, as W is stochastic, L 2 and θ 1 χ1[L], leading to ρ 2

θ. Consequently, the communication complexity

of DADAO run for O( q

ϵ ) iterations recovers the rate of GT. Furthermore, for the complete graph, for any Laplacian

L with ρ[L] = O(1) (remind that, by definition ρ 1), we have p

2χ1[L]χ2[L]Tr L = O(n) whereas |E| = O(n2), thus DADAO uses an order of magnitude less communications than GT need to.

Note that for the Star-graph, there is no admissible Laplacian in the framework of (Song et al., 2021).

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Comparison with MSDA (Scaman et al., 2017). The setting is described by:

Gossip matrices: any Laplacians admissible for DADAO.

Total number of gradients to reach ϵ precision: O(n q

Total number of edges activated to reach ϵ precision: O(|E| ρ q

If κ = O( |E|

n ), using Lemma 3.3, we see that, for any Laplacian L:

2χ1[L]χ2[L]Tr L = O(|E| ρ)

Consequently, the communication complexity of DADAO run for O( q

ϵ ) iterations is better than MSDA. Furthermore, for the complete graph, for any Laplacian L with ρ[L] = O(1) (remind that, by definition ρ 1), we have p

2χ1[L]χ2[L]Tr L = O(n) whereas |E| = O(n2), thus DADAO uses an order of magnitude less communications than MSDA need to.

Comparison with the Continuized framework (Even et al., 2021a). In this framework and using their notations, we have:

Gossip matrices: all Laplacian matrices L verifying Tr L = 2.

Total number of gradients to reach ϵ precision: O( 1 θ ARG

Total number of edges activated to reach ϵ precision: O( 1 θ ARG

First, we slightly rephrase one of the proposition of (Even et al., 2021a) to match our notations:

Lemma D.1. For a Laplacian L with Tr L = 2, the communication and gradient complexity of the continuized framework is given by:

Proof. Under the notation of (Even et al., 2021a), θ ARG = q

µgossip/ max{v,w} Rvw Pvw , with µgossip = 1 χ1[L] in our setting.

Moreover, we remind that in (Even et al., 2021a), L = AAT and that Aevw = Pvw(ev ew). Thus, by definition:

Pvw e T vw A+Aevw

= e T vw A+(ev ew) Pvw (70)

= (A+Tevw)T(ev ew) Pvw (71)

= ((AAT)+TAevw)T(ev ew) Pvw (72)

= (ev ew)TL+(ev ew) , (73)

and we have θ ARG = 1

2χ1[L]χ2[L].

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Next, if Tr L = 2, we proved in Lemma 3.3that 2 p

χ1[L]χ2[L] (n 1). Thus, we see that the gradient complexity of

DADAO in O(n q

ϵ ) is always better than the one of the continuized framework. If we write f(n) = Ω(g(n)) the

fact that there is a constant C > 0 and n0 N such that n n0, Cg(n) f(n), then, in the cycle graph, for any L with Tr L = 2, χ1[L] = Ω(n3) and χ2[L] = Ω(n). Thus DADAO uses an order of magnitude less gradients than the continuized framework for this graph.

E. Practical Implementation

In this section, we describe in more detail the implementation of our algorithm. As we did not physically execute our method on a compute network but carried it out on a single machine, all the asynchronous computations and communications had to be simulated. Thus, we will first discuss the method we followed to simulate our asynchronous framework before detailing the practical steps of our algorithm through a pseudo-code.

E.1. Simulating the Poisson Point Processes

To emulate the asynchronous setting, before running our algorithm, we generate 2 independent sequences of jump times at the graph s scale: one for the computations and one for the communications. As we considered independent P.P.Ps, the time increments follow a Poisson distribution. At the graph s scale, each node spiking at a rate of 1, the Poisson parameter for the gradient steps process is n. Following the experimental setting of the Continuized framework (Even et al., 2021a), we considered that all edges in E had the same probability of spiking. Thus, given any graph E and L its corresponding standard Laplacian with unit edge weights, we computed the parameter λ of the communication process:

Having generated the 2 sequences of spiking times at the graph s scale, we run our algorithm playing the events in order of appearance, attributing the location of the events by sampling uniformly one node if the event is a gradient step or sampling uniformly an edge in E if it is a communication.

E.2. Pseudo Code

We keep the notations introduced in Eq. (3) and recall the following constant values specified in Appendix C.1:

L γ = 1 4L δ = 1

νL δ = 1 α = 1

L β = 2χ1[Λ] q

For the sake of completeness, we also specify the matrix A describing the linear ODE (8):

η η 0 0 0 0 η η 0 0 0 0 0 0 α α 0 0 0 θν θ 0 θ 0 0 0 0 0 α α 0 0 0 0 α α

As described in Appendix E.1, we call PPPspikes the process mentioned above, returning the ordered sequence of events and time of spikes of the two P.P.Ps. Then, we can write the pseudo-code of our implementation of the DADAO optimizer in Algorithm 2.

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Algorithm 2: Pseudo-code of our implementation of DADAO on a single machine. Input: On each machine i {1, ..., n}, an oracle able to evaluate fi, Parameters µ, L, χ1, tmax, n, λ. The graph E.

1 Initialize on each machine i {1, ..., n}:

2 Set X(i) = (xi, xi, yi) and Y (i) = (yi, zi, zi) to 0 ;

3 Set constants ν, η, η, γ, α, α, θ, δ, δ, β, β using µ, L, χ1;

5 T (i) 0 ;

6 List Events, List Times PPPspikes(n, λ, tmax) ;

7 nevents |List Events| ;

8 for k [[1, nevents]] do

9 if List Events[k] is to take a gradient step then

10 i U([[1, n]]) ;

exp (List Times[k] T (i))A X(i)

12 gi ( fi(xi) νxi yi); // Local gradient computation.

13 xi xi γgi ;

14 xi xi γgi ;

15 yi yi + (δ + δ)gi ;

16 T (i) List Times[k] ;

17 else if List Events[k] is to take a communication step then

18 (i, j) U(E) ;

exp (List Times[k] T (i))A X(i)

exp (List Times[k] T (j))A X(j)

21 mij (yi + zi yj zj); // Message exchanged.

22 zi zi βmij;

23 zi zi βmij;

24 zj zj + βmij;

25 zj zj + βmij;

26 T (i) List Times[k];

27 T (j) List Times[k];

28 return (xi)1 i n, the estimate of x on each worker i.