# dualfree_stochastic_decentralized_optimization_with_variance_reduction__006ac9a1.pdf

Dual-Free Stochastic Decentralized Optimization with Variance Reduction

Hadrien Hendrikx INRIA - DIENS - PSL Research University hadrien.hendrikx@inria.fr

Francis Bach INRIA - DIENS - PSL Research University francis.bach@inria.fr

Laurent Massoulié INRIA - DIENS - PSL Research University laurent.massoulie@inria.fr

We consider the problem of training machine learning models on distributed data in a decentralized way. For ﬁnite-sum problems, fast single-machine algorithms for large datasets rely on stochastic updates combined with variance reduction. Yet, existing decentralized stochastic algorithms either do not obtain the full speedup allowed by stochastic updates, or require oracles that are more expensive than regular gradients. In this work, we introduce a Decentralized stochastic algorithm with Variance Reduction called DVR. DVR only requires computing stochastic gradients of the local functions, and is computationally as fast as a standard stochastic variance-reduced algorithms run on a 1/n fraction of the dataset, where n is the number of nodes. To derive DVR, we use Bregman coordinate descent on a well-chosen dual problem, and obtain a dual-free algorithm using a speciﬁc Bregman divergence. We give an accelerated version of DVR based on the Catalyst framework, and illustrate its effectiveness with simulations on real data.

1 Introduction

We consider the regularized empirical risk minimization problem distributed on a network of n nodes. Each node has a local dataset of size m, and the problem thus writes:

min x Rd F(x)

i=1 fi(x), with fi(x) σi

j=1 fij(x), (1)

where fij typically corresponds to the loss function for training example j of machine i, and σi is the local regularization parameter for node i. We assume that each function fij is convex and Lij-smooth (see, e.g., [30]), and that each function fi is Mi-smooth. Following [40], we denote κi = (1 + m i=1 Lij)/σi the stochastic condition number of fi, and κs = maxi κi. Similarly, the batch condition number is κb = maxi Mi/σi. It always holds that κb κs mκb, but generally κs mκb, which explains the success of stochastic methods. Indeed, κs mκb when all Hessians are orthogonal to one another which is rarely the case in practice, especially for a large dataset.

Regarding the distributed aspect, we follow the standard gossip framework [5, 29, 8, 32] and assume that nodes are linked by a communication network which we represent as an undirected graph G. We denote N(i) the set of neighbors of node i and 1 Rd the vector with all coordinates equal to 1. Communication is abstracted by multiplication by a positive semi-deﬁnite matrix W Rn n, which is such that Wk = 0 if k / N( ), and Ker(W) = Span(1). The matrix W is called the gossip matrix, and we denote its spectral gap by γ = λ+ min(W)/λmax(W), the ratio between the smallest

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

non-zero and the highest eigenvalue of W, which is a key quantity in decentralized optimization. We ﬁnally assume that nodes can compute a local stochastic gradient fij in time 1, and that communication (i.e., multiplication by W) takes time τ.

Single-machine stochastic methods. Problem (1) is generally solved using ﬁrst-order methods. When m is large, computing F becomes very expensive, and batch methods require O(κb log(ε 1)) iterations, which takes time O(mκb log(ε 1)), to minimize F up to precision ε. In this case, updates using the stochastic gradients fij, where (i, j) is selected randomly, can be much more effective [4]. Yet, these updates are noisy and plain stochastic gradient descent (SGD) does not converge to the exact solution unless the step-size goes to zero, which slows down the algorithm. One way to ﬁx this problem is to use variance-reduced methods such as SAG [33], SDCA [35], SVRG [16] or SAGA [7]. These methods require O((nm + κs) log(ε 1)) stochastic gradient evaluations, which can be much smaller than O(mκb log(ε 1)).

Decentralized methods. Decentralized adaptations of gradient descent in the smooth and strongly convex setting include EXTRA [37], DIGing [28] or NIDS [21]. These algorithms have sparked a lot of interest, and the latest convergence results [14, 42, 20] show that EXTRA and NIDS require time O((κb + γ 1)(m + τ)) log(ε 1)) to reach precision ε. A generic acceleration of EXTRA using Catalyst [20] obtains the (batch) optimal O( κb(1 + τ/ γ) log(ε 1)) rate up to log factors. Another line of work on decentralized algorithms is based on the penalty method [19, 9]. This consists in performing traditional optimization algorithms to problems augmented with a Laplacian penalty, and in particular enables the use of accelerated methods. Yet, these algorithms are sensitive to the value of the penalty parameter (when it is ﬁxed), since it directly inﬂuences the solution they converge to. Another natural way to construct decentralized optimization algorithms is through dual approaches [32, 38]. Although the dual approach leads to algorithms that are optimal both in terms of number of communications and computations [31, 13], they generally assume access to the proximal operator or the gradient of the Fenchel conjugate of the local functions, which is not very practical in general since it requires solving a subproblem at each step.

Decentralized stochastic optimization. Although both stochastic and decentralized methods have a rich litterature, there exist few decentralized stochastic methods with linear convergence rate. Although DSA [27], or GT-SAGA [41] propose such algorithms, they respectively take time O((mκs + κ4 sγ 1(1 + τ) log(ε 1)) and O((m + κ2 sγ 2)(1 + τ) log(ε 1)) to reach precision ε. Therefore, they have signiﬁcantly worse rates than decentralized batch methods when m = 1, and than single-machine stochastic methods when n = 1. Other methods have better rates of convergence [36, 12] but they require evaluation of proximal operators, which may be expensive.

Our contributions. This work develops a dual approach similar to that of [12], which leads to a decentralized stochastic algorithm with rate O(m + κs + τκb/ γ), where the γ factor comes from Chebyshev acceleration, such as used in [32]. Yet, our algorithm, called DVR, can be formulated in the primal only, thus avoiding the need for computing expensive dual gradients or proximal operators. Besides, DVR is derived by applying Bregman coordinate descent to the dual of a speciﬁc augmented problem. Thus, its convergence follows from the convergence of block coordinate descent with Bregman gradients, which we prove as a side contribution. When executed on a single-machine, DVR is similar to dual-free SDCA [34], and obtains similar rates. We believe that the same methodology could be applied to tackle non-convex problems, but we leave these extensions for future work.

We present in Section 2 the derivations leading to DVR, namely the dual approach and the dual-free trick. Then, Section 3 presents the actual algorithm along with a convergence theorem based on block Bregman coordinate descent (presented in Appendix A). Section 4 shows how to accelerate DVR, both in terms of network dependence (Chebyshev acceleration) and global iteration complexity (Catalyst acceleration [23]). Finally, experiments on real-world data are presented in Section 5, that demonstrate the effectiveness of DVR.

2 Algorithm Design

This section presents the key steps leading to DVR. We start by introducing a relevant dual formulation from [12], then introduce the dual-free trick based on [17], and ﬁnally show how this leads to DVR, an actual implementable decentralized stochastic algorithm, as a special case of the previous derivations.

2.1 Dual formulation

The standard dual formulation of Problem (1) is obtained by associating a parameter vector to each node, and imposing that two neighboring nodes have the same parameters [6, 15, 32]. This leads to the following constrained problem, in which we write θ(i) Rd the local vector of node i:

i=1 fi(θ(i)) such that k, N(k), θ(k) = θ( ). (2)

Following the approach of [12, 13], we further split the fi(θ(i)) term into σi θ(i) 2/2 + n j=1 fij(θ(ij)), with the constraint that θ(i) = θ(ij) for all j. This is equivalent to the previous approach performed on an augmented graph [12, 13] in which each node is split into a star network with the regularization in the center and a local summand at each tip of the star. Thus, the equivalent augmented constrained problem that we consider writes:

min θ Rn(m+1)d

j=1 fij(θ(ij))

s.t. k, N(k), θ(k) = θ( ) and i, j, θ(i) = θ(ij).

(3) We now use Lagrangian duality, and introduce two kinds of multipliers. The variable x corresponds to multipliers associated with the constraints given by edges of the communication graph (i.e., θ(k) = θ( ) if k N( )), that we will call communication edges. Similarly, y corresponds to the constraints associated with the edges that are speciﬁc to the augmented graph (i.e., θ(i) = θ(ij) i, j) that we call computation or virtual edges, since they are not present in the original graph and were constructed for the augmented problem. Therefore, there are E communication edges (number of edges in the initial graph), and nm virtual edges. The dual formulation of Problem (3) thus writes:

min x REd, y Rnmd 1 2q A(x, y)+

j=1 f ij((A(x, y))(ij)), with q A(x, y) (x, y) A ΣA(x, y), (4)

and where (x, y) R(E+nm)d is the concatenation of vectors x REd, which is associated with the communication edges, and y Rnmd, which is the vector associated with computation edges. We denote Σ = Diag(σ 1 1 , , σ 1 n , 0, , 0) Id Rn(m+1)d n(m+1)d and A is such that for all z Rd, A(ek, z) = µk (uk u ) Pk z for edge (k, ), where Pk = Id if (k, ) is a communication edge, Pij is the projector on Ker(fij) ( x Rd Ker( 2fij(x))) if (i, j) is a virtual edge, z1 z2 is the Kronecker product of vectors z1 and z2, and ek, RE+nm and uk Rn(m+1) are the unit vectors associated with edge (k, ) and node k respectively.

Note that the upper left nd nd block of AA (corresponding to the communication edges) is equal to W Id where W is a gossip matrix (see, e.g., [32]) that depends on the µk . In particular, W is equal to the Laplacian of the communication graph if µ2 k = 1/2 for all (k, ). For computation edges, the projectors Pij account for the fact that the parameters θ(i) and θ(ij) only need to be equal on the subspaces on which fij is not constant, and we choose µij such that µ2 ij = αLij for some α > 0. Although this introduces heavier notations, explicitly writing A as an n(1 + m)d (E + nm)d matrix instead of an n(1 + m) (E + nm) matrix allows to introduce the projectors Pij, which then yields a better communication complexity than choosing Pij = Id. See [12, 13] for more details on this dual formulation, and in particular on the construction on the augmented graph. Now that we have obtained a suitable dual problem, we would like to solve it without computing gradients or proximal operators of f ij, which can be very expensive.

2.2 Dual-free trick

Dual methods are based on variants of Problem (4), and apply different algorithms to it. In particular, [32, 38] use accelerated gradient descent [30], and [11, 12] use accelerated (proximal) coordinate descent [24]. Let pcomm denote the probability of performing a communication step and pij be the probability that node i samples a gradient of fij, which are such that for all i, m j=1 pij = 1 pcomm. Applying a coordinate update with step-size η/pcomm to Problem (4) in the direction x (associated with communication edges) writes: xt+1 = xt ηp 1 comm xq A(xt, yt), (5)

where we denote x the gradient in coordinates that correspond to x (communication edges), and y,ij the gradient for coordinate (ij) (computation edge). Similarly, the standard coordinate update of a local computation edge (i, j) can be written as:

y(ij) t+1 = arg min y Rd

y,ijq A(xt, yt) + µij f ij(µijy(ij) t ) y + pij

2η y y(ij) t 2 , (6)

where the minimization problem actually has a closed form solution. Yet, as mentioned before, solving Equation (6) requires computing the derivative of f ij. In order to avoid this, a trick introduced by [17] and later used in [39] is to replace the Euclidean distance term by a well-chosen Bregman divergence. More speciﬁcally, the Bregman divergence of a convex function φ is deﬁned as:

Dφ(x, y) = φ(x) φ(y) φ(y) (x y). (7)

Bregman gradient algorithms typically enjoy the same kind of guarantees as standard gradient algorithms, but with slightly different notions of relative smoothness and strong convexity [1, 25]. Note that the Bregman divergence of the squared Euclidean norm is the squared Euclidean distance, and the standard gradient descent algorithm is recovered in that case. We now replace the Euclidean distance by the Bregman divergence induced by function φ : y (Lij/µ2 ij)f ij(µijy(ij)), which is normalized to be 1-strongly convex since f ij is L 1 ij -strongly convex. We introduce the constant α > 0 such that µ2 ij = αLij for all computation edges (i, j). Using the deﬁnition of the Bregman divergence with respect to φ, we write:

y(ij) t+1 = arg min y Rd

y,ijq A(xt, yt) + µij f ij(µijy(ij) t ) y + pij

η Dφ y, y(ij) t

= arg min y R

pij y,ijq A(xt, yt) 1 αη

µij f ij(µijy(ij) t ) y + f ij(µijy)

= 1 µij fij

f ij(µijy(ij) t ) αη µijpij y,ijq A(xt, yt) .

In particular, if we know f ij(µijy(ij) t ) then it is possible to compute y(ij) t+1. Besides,

f ij(µijy(ij) t+1) = (1 αη) f ij(µijy(ij) t ) αη

µij y,ijq A(xt, yt), (8)

so we can also compute f ij(µijy(ij) t+1), and we can use it for the next step. Therefore, instead of

computing a dual gradient at each step, we can simply choose y(i) 0 = µ 1 ij fij(z(ij) 0 ) for any z(ij) 0 , and iterate from this. Therefore, the Bregman coordinate update applied to Problem (4) in the block of direction (i, j) with y(ij) 0 = µ 1 ij fi(z(ij) 0 ) yields:

z(ij) t+1 = 1 αη

z(ij) t αη pijµij y,ijq A(xt, yt), y(ij) t+1 = µ 1 ij fi(z(ij) t+1). (9)

The iterations of (9) are called a dual-free algorithm because they are a transformation of the iterations from (6) that do not require computing f ij anymore. This is obtained by replacing the Euclidean distance in (6) by the Bregman divergence of a function proportional to f ij. Note that although we use the same dual-free trick the tools are different since [17] applies a randomized primal-dual algorithm with ﬁxed Bregman divergences choice to a speciﬁc primal-dual formulation. Instead, we apply a generic randomized Bregman coordinate descent algorithm to a speciﬁc dual formulation.

2.3 Distributed implementation

Iterations from (9) do not involve functions f ij anymore, which was our ﬁrst goal. Yet, they consist in updating dual variables associated with edges of the augmented graph, and have no clear distributed meaning yet. In this section, we rewrite the updates of (9) in order to have an easy to implement distributed algorithm. The key steps are (i) multiplication of the updates by A, (ii) expliciting the gossip matrix and (iii) remarking that θ(i) t = (ΣA(xt, yt))(i) converges to the primal solution for all i. For a vector z R(n+nm)d, we denote [z]comm Rnd its restriction to the communication nodes, and [M]comm Rnd nd similarly refers to the restriction on communication edges of a matrix

M R(n+nm)d (n+nm)d. By abuse of notations, we call Acomm Rnd Ed the restriction of A to communication nodes and edges. We denote Pcomm the projector on communication edges, and Pcomp the projector on y. We multiply the x (communication) update in (9) by A on the left (which is standard [32, 12]) and obtain:

Acommxt+1 = Acommxt ηp 1 comm[APcomm A ]comm[ΣA(xt, yt)]comm. (10)

Note that [Pcomm A ΣA(xt, yt)]comm = [Pcomm A ]comm[ΣA(xt, yt)]comm because Pcomm and Σ are non-zero only for communication edges and nodes. Similarly, and as previously stated, one can verify that Acomm[Pcomm A ]comm = [APcomm A ]comm = W Id Rnd nd where W is a gossip matrix. We ﬁnally introduce xt Rnd which is a variable associated with nodes, and which is such that xt = Acommxt. With this rewriting, the communication update becomes:

xt+1 = xt ηp 1 comm(W Id)Σcomm [A(xt, yt)]comm .

To show that [A(xt, yt)]comm is locally accessible to each node, we write:

[A(xt, yt)](i) comm = (Acommxt)(i) n

j=1 (A(ekj y(kj) t ))(i) = ( xt)(i)

j=1 µijy(ij) t .

We note this rescaled local vector θt = Σcomm([A(xt, yt)]comm), and obtain for variables xt the gossip update of (12). Note that we directly write y(ij) t instead of Pijy(ij) t even though there has been a multiplication by the matrix A. This is allowed because Equation (13) implies that (i) y(ij) t Ker(fij) for all t, and (ii) the value of (Id Pij)z(ij) t does not matter since z(ij) t is only used to compute fij. We now consider computation edges, and remark that:

y,ijq A(xt, yt) = µij(Σcomm)ii([A(xt, yt)]comm)(i) = µijθt. (11)

Plugging Equation (11) into the updates of (9), we obtain the following updates:

xt+1 = xt η pcomm (W Id)θt, (12)

for communication edges, and for the local update of the j-th component of node i:

z(ij) t+1 = 1 αη

z(ij) t + αη

pij θ(i) t , θ(i) t+1 = 1

j=1 fij(z(ij) t+1) . (13)

Finally, Algorithm 1 is obtained by expressing everything in terms of θt and removing variable xt. To simplify notations, we further consider θ as a matrix in Rn d (instead of a vector in Rnd), and so the communication update of Equation (12) is a standard gossip update with matrix W, which we recall is such that W Id = [APcomm A ]comm. We now discuss the local updates of Equation (13) more in details, which are closely related to dual-free SDCA updates [34].

3 Convergence Rate

The goal of this section is to set parameters η and α in order to get the best convergence guarantees. We introduce κcomm = γλmax(A commΣcomm Acomm)/λ+ min(A comm D 1 M Acomm), where λ+ min and λmax respectively refer to the smallest non-zero and the highest eigenvalue of the corresponding matrices. We denote DM the diagonal matrix such that (DM)ii = σi + λmax( m j=1 Lij Pij), where 2fij(x) Lij Pij for all x Rd. Note that we use notation κcomm since it corresponds to a condition number. In particular, κcomm κs when σi = σj for all i, j, and κcomm more ﬁnely captures the interplay between regularity of local functions (through DM and Σcomm) and the topology of the network (through A) otherwise.

Theorem 1. We choose pcomm = 1 + γ m+κs

κcomm 1, pij (1 pcomm)(1 + Lij/σi) and α and η as in Algorithm 1. Then, there exists C0 > 0 that only depends on θ0 (initial conditions) such that for all t > 0, the error and the expected time Tε required to reach precision ε are such that:

1 2E θ(i) t θ 2 C0 1 αη

t , and so Tε = O m + κs + τ κcomm

Algorithm 1 DVR(z0)

1: α = 2λ+ min(A comm D 1 M Acomm), η = min pcomm λmax(A commΣcomm Acomm), pij α(1+σ 1 i Lij)

2: θ(i) 0 = ( m j=1 fij(z(ij) 0 ))/σi. // z0 is arbitrary but not θ0. 3: for t = 0 to K 1 do // Run for K iterations 4: Sample ut uniformly in [0, 1]. // Randomly decide the kind of update 5: if ut pcomm then 6: θt+1 = θt η pcomm ΣWθt // Communication using W 7: else 8: for i = 1 to n do 9: Sample j {1, , m} with probability pij.

10: z(ij ) t+1 = z(ij ) t for j = j // Only one virtual node is updated

11: z(ij) t+1 = 1 αη

z(ij) t + αη

pij θ(i) t // Virtual node update

12: θ(i) t+1 = θ(i) t 1

fij(z(ij) t+1) fij(z(ij) t ) // Local update using fij 13: return θK

Proof sketch. We have seen in Section 2 that DVR is obtained by applying Bregman coordinate descent on a well-chosen dual problem. Therefore, one of our key results consists in proving convergence rates for Bregman coordinate descent in the relatively smooth setting. Although a similar algorithm is analyzed in [10], we give sharper results in the case of arbitrary sampling of blocks, and tightly adapt to the separability structure. This is crucial to our analysis since the probabilities to sample a local gradient and to communicate can be vastly different. In order to ease the reading of the paper, we present these results for a general setting in Appendix A, which is self-contained and which we believe to be of independent interest (beyond its application to decentralized optimization).

Then, Appendix B focuses on the application to decentralized optimization. In particular, we recall the Equivalence between DVR and Bregman coordinate descent applied to the dual problem of Equation (4), and show that its structure is suited to the application of coordinate descent. Indeed, no two virtual edges adjacent to the same node are updated at the same time with our sampling. Then, we evaluate the relative smoothness and strong convexity constants of the augmented problem, which is rather challenging due to the complex structure of the dual problem. This allows to derive adequate values for parameters α and η. Finally, we choose pcomm in order to minimize the execution time of DVR.

We would like to highlight the fact that the convergence theory of DVR decomposes nicely into several building blocks, and thus simple rates are obtained. This is not so usual for decentralized algorithms, for instance many follow-up papers were needed to obtain a tight convergence theory for EXTRA [37, 14, 42, 20]. We now discuss the convergence rate of DVR more in details.

Computation complexity. The computation complexity of DVR is the same computation complexity as locally running a stochastic algorithm with variance reduction at each node. This is not surprising since, as we argue later, DVR can be understood as a decentralized version of an algorithm that is closely related to dual-free SDCA [34]. Therefore, this improves the computation complexity of EXTRA from O(m(κb+γ 1)) individual gradients to O(m+κs), which is the expected improvement for stochastic variance-reduced algorithm. In comparison, GT-SAGA [41], a recent decentralized stochastic algorithm, has a computation complexity of order O(m + κ2 s/γ2), which is signiﬁcantly worse than that of DVR, and generally worse than that of EXTRA as well.

Communication complexity. The communication complexity of DVR (i.e., the number of communications, so the communication time is retrieved by multiplying by τ) is of order O(κcomm/γ), and can be improved to O(κcomm/ γ) using Chebyshev acceleration (see Section 4). Yet, this is in general worse than the O(κb + γ 1) communication complexity of EXTRA or NIDS, which can be interpreted as a partly accelerated communication complexity since the optimal dependence is O(

κb/γ) [31], and 2

κb/γ = κb + γ 1 in the worst case (κb = γ 1). Yet, stochastic updates are mainly intended to deal with cases in which the computation time dominates, and we show in the experimental section that DVR outperforms EXTRA and NIDS for a wide range of communication

times τ (the computation complexity dominates roughly as long as τ < γ(m + κs)/κcomm). Finally, the communication complexity of DVR is signiﬁcantly lower than that of DSA and GT-SAGA, the primal decentralized stochastic alternatives presented in Section 1.

Homogeneous parameter choice. In the homogeneous case (σi = σj for all i, j), choosing the optimal pcomp and pcomm described above leads to ηλmax(W) = σpcomm. Therefore, the communication update becomes θt+1 = (I W/λmax(W)) θt, which is a gossip update with a standard step-size (independent of the optimization parameters). Similarly, αη(m + κs) = pcomp, and so the step-size for the computation updates is independent of the network.

Links with SDCA. The single-machine version of Algorithm 1 (n = 1, pcomm = 0) is closely related to dual-free SDCA [34]. The difference is in the stochastic gradient used: DVR uses fij(z(ij) t ), where z(ij) t is a convex combination of θ(i) k for k < t, whereas dual-free SDCA uses g(ij) t , which is a convex combination of fij(θ(i) k ) for k < t. Both algorithms obtain the same rates.

Local synchrony. Instead of using the synchronous communications of Algorithm 1, it is possible to update edges one at a time, as in [12]. This can be very efﬁcient in heterogeneous settings (both in terms of computation and communication times) and similar convergence results can be obtained using the same framework, and we leave the details for future work.

4 Acceleration

We show in this section how to modify DVR to improve the convergence rate of Theorem 1.

Network acceleration. Algorithm 1 depends on γ 1, also called the mixing time of the graph, which can be as high as O(n2) for a chain of length n [26]. However, it is possible to improve this dependency to γ 1/2 by using Chebyshev acceleration, as in [32]. To do so, the ﬁrst step is to choose a polynomial P of degree k and communicate with P(W) instead of W. In terms of implementation, this comes down to performing k communication rounds instead of one, but this makes the algorithm depend on the spectral gap of P(W). Then, the important fact is that there is a polynomial Pγ of degree γ 1/2 such that the spectral gap of Pγ(W) is of order 1. Each communication step with Pγ(W) only takes time τdeg(Pγ) = τ γ 1/2 , and so the communication term in Theorem 1 can be replaced by τκcommγ 1/2, thus leading to network acceleration. The polynomial Pγ can for example be chosen as a Chebyshev polynomial, and we refer the interested reader to [32] for more details. Finally, other polynomials yield even faster convergence when the graph topology is known [2].

Catalyst acceleration. Catalyst [22] is a generic framework that achieves acceleration by solving a sequence of subproblems. Because of space limitations, we only present the accelerated convergence rate without specifying the algorithm in the main text. Yet, only mild modiﬁcations to Algorithm 1 are required to obtain these rates, and the detailed derivations and proofs are presented in Appendix C.

Theorem 2. DVR can be accelerated using catalyst, so that the time Tε required to reach precision ε is equal (up to log factors) to

Tε = O m + mκs + τ κcomm

Proof sketch. We follow the approach of [20] to derive the algorithm, and apply Catalyst acceleration to the primal problem on the mean parameter θt (which is never explicitly computed). Indeed, this conceptual algorithm can actually be implemented in a fully decentralized manner.

Then, we proceed to the actual proof, which requires a tight control over both primal and dual warm-start errors. Indeed, Theorem 4 (Appendix B) controls dual variables but Catalyst acceleration is applied to the primal variables.

This rate recovers the computation complexity of optimal ﬁnite sum algorithms such as ADFS [12, 13]. Although the communication time is slightly increased (by a factor

mκcomm/κs), ADFS uses a stronger oracle than DVR (proximal operator instead of gradient), which is why we develop DVR in the ﬁrst place. Although both ADFS and DVR are derived using the same dual formulation, both the approach and the resulting algorithms are rather different: ADFS uses accelerated coordinate

(a) Erd os-Rényi, σ = m 10 5

(b) Grid, σ = m 10 5

(c) Erd os-Rényi, σ = m 10 7

(d) Grid, σ = m 10 7

Figure 1: Experimental results for the RCV1 dataset with different graphs of size n = 81, with m = 2430 samples per node, and with different regularization parameters.

descent, and thus has strong convergence guarantees at the cost of requiring dual oracles. DVR uses coordinate descent with the Bregman divergence of φij f ij in order to work with primal oracles, but thus loses direct acceleration, which is recovered through the Catalyst framework. Note that the parameters of accelerated DVR can also be set such that Tε = O κcomm m + τ/ γ log ε 1 , which recovers the convergence rate of optimal batch algorithms, but loses the ﬁnite-sum speedup.

5 Experiments

We investigate in this section the practical performances of DVR. We solve a regularized logistic regression problem on the RCV1 dataset [18] (d = 47236) with n = 81 (leading to m = 2430) and two different graph topologies: an Erd os-Rényi random graph (see, e.g., [3]) and a grid. We choose µ2 k = 1/2 for all communication edges, so the gossip matrix W is the Laplacian of the graph.

Figure 1 compares the performance of DVR with that of state-of-the-art primal algorithms such as EXTRA [37], NIDS [21], GT-SAGA [41], and Catalyst accelerated versions of EXTRA [20] and DVR. Suboptimality refers to F(θ(0) t ) F(θ ), where node 0 is chosen arbitrarily and F(θ ) is approximated by the minimal error over all iterations. Each subplot of Figure 1(a) shows the same run with different x axes. The left plot measures the complexity in terms of individual gradients ( fij) computed by each node whereas the center plot measures it in terms of communications (multiplications by W). All other plots are taken with respect to (simulated) time (i.e., computing fij takes time 1 and multiplying by W takes time τ) with τ = 250 in order to report results that are independent of the computing cluster hardware and status. All parameters are chosen according to theory, except for the smoothness of the fi, which requires ﬁnding the smallest eigenvalue of a d d matrix. For this, we start with Lb = σi + m j=1 Lij (which is a known upper bound), and decrease it while convergence is ensured, leading to κb = 0.01κs. The parameters for accelerated EXTRA are chosen as in [20] since tuning the number of inner iterations does not signiﬁcantly improve the results (at the cost of a high tuning effort). For accelerated DVR, we set the number of inner iterations to N/pcomp (one pass over the local dataset). We use Chebyshev acceleration for (accelerated) DVR but not for (accelerated) EXTRA since it is actually slower, as predicted by the theory.

As expected from their theoretical iteration complexities, NIDS and EXTRA perform very similarly [20], and GT-SAGA is the slowest method. Therefore, we only plot NIDS and GT-SAGA in Figure 1(a). We then see that though it requires more communications, DVR has a much lower

(a) Erd os-Rényi, σ = m 10 5

(b) Grid, σ = m 10 5

(c) Grid, σ = m 10 7

Figure 2: Experimental results for the RCV1 dataset with different graphs of size n = 81, with m = 2430 samples per node, and with different regularization parameters.

computation complexity than EXTRA, which illustrates the beneﬁts of stochastic methods. We see that DVR is faster overall if we choose τ = 250, and both methods perform similarly for τ 1000, at which point communicating takes roughly as much time as computing a full local gradient. We then see that accelerated EXTRA has quite a lot of overhead and, despite our tuning efforts, is slower than EXTRA when the regularization is rather high. On the other hand, accelerated DVR consistently outperforms DVR by a relatively large margin. The communication complexity is in particular greatly improved, allowing accelerated DVR to be the fastest method regardless of the setting.

Finally, Figure 2 presents the comparison between DVR and MSDA [32], an optimal decentralized batch algorithm, in terms of communication complexity. To implement MSDA, we compute the dual gradients by solving each local subproblem ( f (x) = arg maxy x y f(y)) up to precision 10 11 using accelerated gradient descent. Solving the subproblems with lower precision caused MSDA to plateau and not converge to the true optimum. In Figure 2(c), Acc. DVR comm (the brown line) refers to Accelerated DVR with Catalyst parameter chosen to favor communication complexity (as explained after Theorem 2). MSDA is the fastest algorithm as expected, but accelerated DVR is not too far behind, especially given the fact that it relies on generic Catalyst acceleration, which adds some complexity overhead. Therefore, the comparison with MSDA corroborates the fact that accelerated DVR is competitive with optimal methods in terms of communication while enjoying a drastically lower computational cost. Further experimental results are given in Appendix D, and the code is available in supplementary material and at https://github.com/Hadrien Hx/DVR_Neur IPS.

6 Conclusion

This paper introduces DVR, a Decentralized stochastic algorithm with Variance Reduction obtained using Bregman block coordinate descent on a well-chosen dual formulation. Thanks to this approach, DVR inherits from the fast rates and simple theory of dual approaches without the computational burden of relying on dual oracles. Therefore, DVR has a drastically lower computational cost than standard primal decentralized algorithms, although sometimes at the cost of a slight increase in communication complexity. The framework used to derive DVR is rather general and could in particular be extended to analyze asynchronous algorithms. Finally, although deriving a direct acceleration of DVR is a challenging open problem, Catalyst and Chebyshev accelerations allow to signiﬁcantly reduce DVR s communication overhead both in theory and in practice.

Acknowledgement

This work was funded in part by the French government under management of Agence Nationalede la Recherche as part of the Investissements d avenir program, reference ANR-19-P3IA0001(PRAIRIE 3IA Institute). We also acknowledge support from the European Research Council (grant SEQUOIA 724063) and from the MSR-INRIA joint centre.

Broader impact statement

This work does not present any foreseeable societal consequence.

[1] Heinz H. Bauschke, Jérôme Bolte, and Marc Teboulle. A descent lemma beyond Lipschitz gradient continuity: ﬁrst-order methods revisited and applications. Mathematics of Operations Research, 42(2):330 348, 2017.

[2] Raphaël Berthier, Francis Bach, and Pierre Gaillard. Accelerated gossip in networks of given dimension using jacobi polynomial iterations. SIAM Journal on Mathematics of Data Science, 2(1):24 47, 2020.

[3] Béla Bollobás. Random graphs. Number 73 in Cambridge studies in advanced mathematics. Cambridge University Press, 2001.

[4] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT, pages 177 186. Springer, 2010.

[5] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE Transactions on Information Theory, 52(6):2508 2530, 2006.

[6] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine learning, 3(1):1 122, 2011.

[7] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 2014.

[8] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592 606, 2012.

[9] Darina Dvinskikh and Alexander Gasnikov. Decentralized and parallelized primal and dual accelerated methods for stochastic convex programming problems. ar Xiv preprint ar Xiv:1904.09015, 2019.

[10] F. Hanzely and P. Richtárik. Fastest rates for stochastic mirror descent methods. ar Xiv:1803.07374, 2018.

[11] Hadrien Hendrikx, Francis Bach, and Laurent Massoulié. Accelerated decentralized optimization with local updates for smooth and strongly convex objectives. In Artiﬁcial Intelligence and Statistics, 2019.

[12] Hadrien Hendrikx, Francis Bach, and Laurent Massoulié. An accelerated decentralized stochastic proximal algorithm for ﬁnite sums. In Advances in Neural Information Processing Systems, 2019.

[13] Hadrien Hendrikx, Francis Bach, and Laurent Massoulié. An optimal algorithm for decentralized ﬁnite sum optimization. ar Xiv preprint ar Xiv:2005.10675, 2020.

[14] Dušan Jakoveti c. A uniﬁcation and generalization of exact distributed ﬁrst-order methods. IEEE Transactions on Signal and Information Processing over Networks, 5(1):31 46, 2018.

[15] Dušan Jakoveti c, José M. F. Moura, and Joao Xavier. Linear convergence rate of a class of distributed augmented Lagrangian algorithms. IEEE Transactions on Automatic Control, 60(4):922 936, 2014.

[16] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013.

[17] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical Programming, pages 1 49, 2017.

[18] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361 397, 2004.

[19] Huan Li, Cong Fang, Wotao Yin, and Zhouchen Lin. A sharp convergence rate analysis for distributed accelerated gradient methods. ar Xiv preprint ar Xiv:1810.01053, 2018.

[20] Huan Li and Zhouchen Lin. Revisiting EXTRA for smooth distributed optimization. ar Xiv preprint ar Xiv:2002.10110, 2020.

[21] Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17):4494 4506, 2019.

[22] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for ﬁrst-order optimization. In Advances in Neural Information Processing Systems, pages 3384 3392, 2015.

[23] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. Catalyst acceleration for ﬁrst-order convex optimization: from theory to practice. Journal of Machine Learning Research, 18(1):7854 7907, 2017.

[24] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244 2273, 2015.

[25] Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively smooth convex optimization by ﬁrst-order methods, and applications. SIAM Journal on Optimization, 28(1):333 354, 2018.

[26] Bojan Mohar. Some applications of laplace eigenvalues of graphs. In Graph Symmetry, pages 225 275. Springer, 1997.

[27] Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging gradient algorithm. Journal of Machine Learning Research, 17(1):2165 2199, 2016.

[28] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597 2633, 2017.

[29] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009.

[30] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013.

[31] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Lee, and Laurent Massoulié. Optimal convergence rates for convex distributed optimization in networks. Journal of Machine Learning Research, 20:1 31, 2019.

[32] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In International Conference on Machine Learning, pages 3027 3036, 2017.

[33] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83 112, 2017.

[34] Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pages 747 754, 2016.

[35] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567 599, 2013.

[36] Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, and Hui Qian. Towards more efﬁcient stochastic decentralized learning: Faster convergence and sparse communication. In International Conference on Machine Learning, pages 4631 4640, 2018.

[37] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact ﬁrst-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, 2015.

[38] César A. Uribe, Soomin Lee, Alexander Gasnikov, and Angelia Nedi c. A dual approach for optimal algorithms in distributed optimization over networks. Optimization Methods and Software, pages 1 40, 2020.

[39] Jialei Wang and Lin Xiao. Exploiting strong convexity from data with primal-dual ﬁrst-order algorithms. In International Conference on Machine Learning, pages 3694 3702, 2017.

[40] Lin Xiao, Adams Wei Yu, Qihang Lin, and Weizhu Chen. DSCOVR: Randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. Journal of Machine Learning Research, 20(43):1 58, 2019.

[41] Ran Xin, Soummya Kar, and Usman A Khan. Decentralized stochastic optimization and machine learning: A uniﬁed variance-reduction framework for robust performance and fast convergence. IEEE Signal Processing Magazine, 37(3):102 113, 2020.

[42] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Distributed algorithms for composite optimization: Uniﬁed and tight convergence analysis. ar Xiv preprint ar Xiv:2002.11534, 2020.