# cola_decentralized_linear_learning__117acc91.pdf

COLA: Decentralized Linear Learning

EPFL lie.he@epfl.ch

An Bian ETH Zurich ybian@inf.ethz.ch

Martin Jaggi

EPFL martin.jaggi@epfl.ch

Decentralized machine learning is a promising emerging paradigm in view of global challenges of data ownership and privacy. We consider learning of linear classiﬁcation and regression models, in the setting where the training data is decentralized over many user devices, and the learning algorithm must run ondevice, on an arbitrary communication network, without a central coordinator. We propose COLA, a new decentralized training algorithm with strong theoretical guarantees and superior practical performance. Our framework overcomes many limitations of existing methods, and achieves communication efﬁciency, scalability, elasticity as well as resilience to changes in data and allows for unreliable and heterogeneous participating devices.

1 Introduction

With the immense growth of data, decentralized machine learning has become not only attractive but a necessity. Personal data from, for example, smart phones, wearables and many other mobile devices is sensitive and exposed to a great risk of data breaches and abuse when collected by a centralized authority or enterprise. Nevertheless, many users have gotten accustomed to giving up control over their data in return for useful machine learning predictions (e.g. recommendations), which beneﬁts from joint training on the data of all users combined in a centralized fashion.

In contrast, decentralized learning aims at learning this same global machine learning model, without any central server. Instead, we only rely on distributed computations of the devices themselves, with each user s data never leaving its device of origin. While increasing research progress has been made towards this goal, major challenges in terms of the privacy aspects as well as algorithmic efﬁciency, robustness and scalability remain to be addressed. Motivated by aforementioned challenges, we make progress in this work addressing the important problem of training generalized linear models in a fully decentralized environment.

Existing research on decentralized optimization, minx2Rn F(x), can be categorized into two main directions. The seminal line of work started by Bertsekas and Tsitsiklis in the 1980s, cf. [Tsitsiklis et al., 1986], tackles this problem by splitting the parameter vector x by coordinates/components among the devices. A second more recent line of work including e.g. [Nedic and Ozdaglar, 2009, Duchi et al., 2012, Shi et al., 2015, Mokhtari and Ribeiro, 2016, Nedic et al., 2017] addresses sum-structured F(x) = P

k Fk(x) where Fk is the local cost function of node k. This structure is closely related to empirical risk minimization in a learning setting. See e.g. [Cevher et al., 2014] for an overview of both directions. While the ﬁrst line of work typically only provides convergence guarantees for smooth objectives F, the second approach often suffers from a lack of consensus , that is, the minimizers of {Fk}k are typically different since the data is not distributed i.i.d. between devices in general.

These two authors contributed equally

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Contributions. In this paper, our main contribution is to propose COLA, a new decentralized framework for training generalized linear models with convergence guarantees. Our scheme resolves both described issues in existing approaches, using techniques from primal-dual optimization, and can be seen as a generalization of COCOA [Smith et al., 2018] to the decentralized setting. More speciﬁcally, the proposed algorithm offers

- Convergence Guarantees: Linear and sublinear convergence rates are guaranteed for strongly

convex and general convex objectives respectively. Our results are free of the restrictive assumptions made by stochastic methods [Zhang et al., 2015, Wang et al., 2017], which requires i.i.d. data distribution over all devices. - Communication Efﬁciency and Usability: Employing a data-local subproblem between each

communication round, COLA not only achieves communication efﬁciency but also allows the re-use of existing efﬁcient single-machine solvers for on-device learning. We provide practical decentralized primal-dual certiﬁcates to diagnose the learning progress. - Elasticity and Fault Tolerance: Unlike sum-structured approaches such as SGD, COLA is

provably resilient to changes in the data, in the network topology, and participating devices disappearing, straggling or re-appearing in the network.

Our implementation is publicly available under github.com/epfml/cola .

1.1 Problem statement

Setup. Many machine learning and signal processing models are formulated as a composite convex optimization problem of the form

u l(u) + r(u),

where l is a convex loss function of a linear predictor over data and r is a convex regularizer. Some cornerstone applications include e.g. logistic regression, SVMs, Lasso, generalized linear models, each combined with or without L1, L2 or elastic-net regularization. Following the setup of [Dünner et al., 2016, Smith et al., 2018], these training problems can be mapped to either of the two following formulations, which are dual to each other

FA(x) := f(Ax) + P

FB(w) := f (w) + P

where f , g

i are the convex conjugates of f and gi, respectively. Here x 2 Rn is a parameter vector and A := [A1; . . . ; An] 2 Rd n is a data matrix with column vectors Ai 2 Rd, i 2 [n]. We assume that f is smooth (Lipschitz gradient) and g(x) := Pn

i=1 gi(xi) is separable.

Data partitioning. As in [Jaggi et al., 2014, Dünner et al., 2016, Smith et al., 2018], we assume the dataset A is distributed over K machines according to a partition {Pk}K

k=1 of the columns of A. Note that this convention maintains the ﬂexibility of partitioning the training dataset either by samples (through mapping applications to (B), e.g. for SVMs) or by features (through mapping applications to (A), e.g. for Lasso or L1-regularized logistic regression). For x 2 Rn, we write x[k] 2 Rn for the n-vector with elements (x[k])i := xi if i 2 Pk and (x[k])i := 0 otherwise, and analogously A[k] 2 Rd nk for the corresponding set of local data columns on node k, which is of size nk = |Pk|.

Network topology. We consider the task of joint training of a global machine learning model in a decentralized network of K nodes. Its connectivity is modelled by a mixing matrix W 2 RK K

+ . More precisely, Wij 2 [0, 1] denotes the connection strength between nodes i and j, with a non-zero weight indicating the existence of a pairwise communication link. We assume W to be symmetric and doubly stochastic, which means each row and column of W sums to one.

The spectral properties of W used in this paper are that the eigenvalues of W are real, and 1 = λ1(W) λn(W) 1. Let the second largest magnitude of the eigenvalues of W be β := max{|λ2(W)|, |λn(W)|}. 1 β is called the spectral gap, a quantity well-studied in graph theory and network analysis. The spectral gap measures the level of connectivity among nodes. In the extreme case when W is diagonal, and thus an identity matrix, the spectral gap is 0 and there is no communication among nodes. To ensure convergence of decentralized algorithms, we impose

the standard assumption of positive spectral gap of the network which includes all connected graphs, such as e.g. a ring or 2-D grid topology, see also Appendix B for details.

1.2 Related work

Research in decentralized optimization dates back to the 1980s with the seminal work of Bertsekas and Tsitsiklis, cf. [Tsitsiklis et al., 1986]. Their framework focuses on the minimization of a (smooth) function by distributing the components of the parameter vector x among agents. In contrast, a second more recent line of work [Nedic and Ozdaglar, 2009, Duchi et al., 2012, Shi et al., 2015, Mokhtari and Ribeiro, 2016, Nedic et al., 2017, Scaman et al., 2017, 2018] considers minimization of a sum of individual local cost-functions F(x) = P

i Fi(x), which are potentially non-smooth. Our work here can be seen as bridging the two scenarios to the primal-dual setting (A) and (B).

While decentralized optimization is a relatively mature area in the operations research and automatic control communities, it has recently received a surge of attention for machine learning applications, see e.g. [Cevher et al., 2014]. Decentralized gradient descent (DGD) with diminishing stepsizes was proposed by [Nedic and Ozdaglar, 2009, Jakovetic et al., 2012], showing convergence to the optimal solution at a sublinear rate. [Yuan et al., 2016] further prove that DGD will converge to the neighborhood of a global optimum at a linear rate when used with a constant stepsize for strongly convex objectives. [Shi et al., 2015] present EXTRA, which offers a signiﬁcant performance boost compared to DGD by using a gradient tracking technique. [Nedic et al., 2017] propose the DIGing algorithm to handle a time-varying network topology. For a static and symmetric W, DIGing recovers EXTRA by redeﬁning the two mixing matrices in EXTRA. The dual averaging method [Duchi et al., 2012] converges at a sublinear rate with a dynamic stepsize. Under a strong convexity assumption, decomposition techniques such as decentralized ADMM (DADMM, also known as consensus ADMM) have linear convergence for time-invariant undirected graphs, if subproblems are solved exactly [Shi et al., 2014, Wei and Ozdaglar, 2013]. DADMM+ [Bianchi et al., 2016] is a different primal-dual approach with more efﬁcient closed-form updates in each step (as compared to ADMM), and is proven to converge but without a rate. Compared to COLA, neither of DADMM and DADMM+ can be ﬂexibly adapted to the communication-computation tradeoff due to their ﬁxed update deﬁnition, and both require additional hyperparameters to tune in each use-case (including the from ADMM). Notably COLA shows superior performance compared to DIGing and decentralized ADMM in our experiments. [Scaman et al., 2017, 2018] present lower complexity bounds and optimal algorithms for objectives in the form F(x) = P

i Fi(x). Speciﬁcally, [Scaman et al., 2017] assumes each Fi(x) is smooth and strongly convex, and [Scaman et al., 2018] assumes each Fi(x) is Lipschitz continuous and convex. Additionally [Scaman et al., 2018] needs a boundedness constraint for the input problem. In contrast, COLA can handle non-smooth and non-strongly convex objectives (A) and (B), suited to the mentioned applications in machine learning and signal processing. For smooth nonconvex models, [Lian et al., 2017] demonstrate that a variant of decentralized parallel SGD can outperform the centralized variant when the network latency is high. They further extend it to the asynchronous setting [Lian et al., 2018] and to deal with large data variance among nodes [Tang et al., 2018a] or with unreliable network links [Tang et al., 2018b]. For the decentralized, asynchronous consensus optimization, [Wu et al., 2018] extends the existing PG-EXTRA and proves convergence of the algorithm. [Sirb and Ye, 2018] proves a O(K/ 2) rate for stale and stochastic gradients. [Lian et al., 2018] achieves O(1/ ) rate and has linear speedup with respect to number of workers.

In the distributed setting with a central server, algorithms of the COCOA family [Yang, 2013, Jaggi et al., 2014, Ma et al., 2015, Dünner et al., 2018] see [Smith et al., 2018] for a recent overview are targeted for problems of the forms (A) and (B). For convex models, COCOA has shown to signiﬁcantly outperform competing methods including e.g., ADMM, distributed SGD etc. Other centralized algorithm representatives are parallel SGD variants such as [Agarwal and Duchi, 2011, Zinkevich et al., 2010] and more recent distributed second-order methods [Zhang and Lin, 2015, Reddi et al., 2016, Gargiani, 2017, Lee and Chang, 2017, Dünner et al., 2018, Lee et al., 2018].

In this paper we extend COCOA to the challenging decentralized environment with no central coordinator while maintaining all of its nice properties. We are not aware of any existing primaldual methods in the decentralized setting, except the recent work of [Smith et al., 2017] on federated learning for the special case of multi-task learning problems. Federated learning was ﬁrst described by [Konecn y et al., 2015, 2016, Mc Mahan et al., 2017] as decentralized learning for on-device learning applications, combining a global shared model with local personalized models. Current

Algorithm 1: COLA: Communication-Efﬁcient Decentralized Linear Learning

1 Input: Data matrix A distributed column-wise according to partition {Pk}K

k=1. Mixing matrix W. Aggregation parameter γ 2[0, 1], and local subproblem parameter σ0 as in (1). Starting point x(0) := 0 2 Rn, v(0) := 0 2 Rd, v(0)

k := 0 2 Rd 8 k = 1, . . . K;

2 for t = 0, 1, 2, . . . , T do

3 for k 2 {1, 2, . . . , K} in parallel over all nodes do

4 compute locally averaged shared vector v

2 ) k := PK

l=1 Wklv(t)

5 x[k] -approximate solution to subproblem (1) at v

2 ) k 6 update local variable x(t+1)

[k] := x(t)

[k] + γ x[k]

7 compute update of local estimate vk := A[k] x[k]

2 ) k + γK vk 9 end

federated optimization algorithms (like Fed Avg in [Mc Mahan et al., 2017]) are still close to the centralized setting. In contrast, our work provides a fully decentralized alternative algorithm for federated learning with generalized linear models.

2 The decentralized algorithm: COLA

The COLA framework is summarized in Algorithm 1. For a given input problem we map it to either of the (A) or (B) formulation, and deﬁne the locally stored dataset A[k] and local part of the weight vector x[k] in node k accordingly. While v = Ax is the shared state being communicated in COCOA, this is generally unknown to a node in the fully decentralized setting. Instead, we maintain vk, a local estimate of v in node k, and use it as a surrogate in the algorithm.

New data-local quadratic subproblems. During a computation step, node k locally solves the following minimization problem

min x[k]2Rn G σ0

k ( x[k]; vk, x[k]), (1)

k ( x[k]; vk, x[k]) := 1

K f(vk) + rf(vk)>A[k] x[k]

$$A[k] x[k]

i2Pk gi(xi + ( x[k])i).

Crucially, this subproblem only depends on the local data A[k], and local vectors vl from the neighborhood of the current node k. In contrast, in COCOA [Smith et al., 2018] the subproblem is deﬁned in terms of a global aggregated shared vector vc := Ax 2 Rd, which is not available in the decentralized setting.2 The aggregation parameter γ 2 [0, 1] does not need to be tuned; in fact, we use the default γ := 1 throughout the paper, see [Ma et al., 2015] for a discussion. Once γ is settled, a safe choice of the subproblem relaxation parameter σ0 is given as σ0 := γK. σ0 can be additionally tightened using an improved Hessian subproblem (Appendix E.3).

Algorithm description. At time t on node k, v

2 ) k is a local estimate of the shared variable after a communication step (i.e. gossip mixing). The local subproblem (1) based on this estimate is solved

2Subproblem interpretation: Note that for the special case of γ := 1, σ0 := K, by smoothness of f, our subproblem in (2) is an upper bound on

min x[k]2Rn 1 K f(A(x + K x[k])) + P

i2Pk gi(xi + ( x[k])i), (3)

which is a scaled block-coordinate update of block k of the original objective (A). This assumes that we have consensus vk Ax 8 k. For quadratic objectives (i.e. when f k.k2

2 and A describes the quadratic), the equality of the formulations (2) and (3) holds. Furthermore, by convexity of f, the sum of (3) is an upper bound on the centralized updates f(x + x) + g(x + x). Both inequalities quantify the overhead of the distributed algorithm over the centralized version, see also [Yang, 2013, Ma et al., 2015, Smith et al., 2018] for the non-decentralized case.

and yields x[k]. Then we calculate vk := A[k] x[k], and update the local shared vector v(t+1)

k . We allow the local subproblem to be solved approximately: Assumption 1 ( -approximation solution). Let 2 [0, 1] be the relative accuracy of the local solver (potentially randomized), in the sense of returning an approximate solution x[k] at each step t, s.t.

k ( x[k]; vk, x[k]) G σ0

[k]; vk, x[k])]

k ( 0 ; vk, x[k]) G σ0

[k]; vk, x[k]) ,

[k] 2 arg min x2Rn G σ0

k ( x[k]; vk, x[k]), for each k 2 [K].

Elasticity to network size, compute resources and changing data and fault tolerance. Realworld communication networks are not homogeneous and static, but greatly vary in availability, computation, communication and storage capacity. Also, the training data is subject to changes. While these issues impose signiﬁcant challenges for most existing distributed training algorithms, we hereby show that COLA offers adaptivity to such dynamic and heterogenous scenarios.

Scalability and elasticity in terms of availability and computational capacity can be modelled by a node-speciﬁc local accuracy parameter k in Assumption 1, as proposed by [Smith et al., 2017]. The more resources node k has, the more accurate (smaller) k we can use. The same mechanism also allows dealing with fault tolerance and stragglers, which is crucial e.g. on a network of personal devices. More speciﬁcally, when a new node k joins the network, its x[k] variables are initialized to 0; when node k leaves, its x[k] is frozen, and its subproblem is not touched anymore (i.e. k = 1). Using the same approach, we can adapt to dynamic changes in the dataset such as additions and removal of local data columns by adjusting the size of the local weight vector accordingly. Unlike gradient-based methods and ADMM, COLA does not require parameter tuning to converge, increasing resilience to drastic changes.

Extension to improved second-order subproblems. In the centralized setting, it has recently been shown that the Hessian information of f can be properly utilized to deﬁne improved local subproblems [Lee and Chang, 2017, Dünner et al., 2018]. Similar techniques can be applied to COLA as well, details on which are left in Appendix E.

Extension to time-varying graphs. Similar to scalability and elasticity, it is also straightforward to extend COLA to a time varying graph under proper assumptions. If we use the time-varying model in [Nedic et al., 2017, Assumption 1], where an undirected graph is connected with B gossip steps, then changing COLA to perform B communication steps and one computation step per round still guarantees convergence. Details of this setup are provided in Appendix E.

3 On the convergence of COLA

In this section we present a convergence analysis of the proposed decentralized algorithm COLA for both general convex and strongly convex objectives. In order to capture the evolution of COLA, we reformulate the original problem (A) by incorporating both x and local estimates {vk}K

k=1 HA(x, {vk}K

k=1 f(vk) + g(x), (DA)

such that vk = Ax, k = 1, ..., K.

While the consensus is not always satisﬁed during Algorithm 1, the following relations between the decentralized objective and the original one (A) always hold. All proofs are deferred to Appendix C. Lemma 1. Let {vk} and x be the iterates generated during the execution of Algorithm 1. At any timestep, it holds that

k=1 vk = Ax, (4)

FA(x) HA(x, {vk}K

k=1) FA(x) + 1 2 K

k=1 kvk Axk2 . (5)

The dual problem and duality gap of the decentralized objective (DA) are given in Lemma 2. Lemma 2 (Decentralized Dual Function and Duality Gap). The Lagrangian dual of the decentralized formation (DA) is

k=1 HB({wk}K

k=1 f (wk) + Pn

Given primal variables {x, {vk}K

k=1} and dual variables {wk}K

k=1, the duality gap is:

GH(x, {vk}K

k(f(vk)+f (wk))+g(x)+Pn

If the dual variables are ﬁxed to the optimality condition wk = rf(vk), then the dual variables can be omitted in the argument list of duality gap, namely GH(x, {vk}K

k=1). Note that the decentralized duality gap generalizes the duality gap of COCOA: when consensus is ensured, i.e., vk Ax and wk rf(Ax), the decentralized duality gap recovers that of COCOA.

3.1 Linear rate for strongly convex objectives

We use the following data-dependent quantities in our main theorems

σk := maxx[k]2Rn $$A[k]x[k]

$$2 /kx[k]k2, σmax = maxk=1,...,K σk, σ := PK

k=1 σknk. (7)

If {gi} are strongly convex, COLA achieves the following linear rate of convergence. Theorem 1 (Strongly Convex gi). Consider Algorithm 1 with γ := 1 and let be the quality of the local solver in Assumption 1. Let gi be µg-strongly convex for all i 2 [n] and let f be 1/ -smooth. Let σ0 := (1 + β)σ0, := (1 + (1 β)2

36(1+ )β ) 1 and := γ(1 )(1 )

s0 = µg µg+σmax σ0 2 [0, 1]. (8)

Then after T iterations of Algorithm 1 with3

s0 log "(0)

it holds that E

HA(x(T ), {v(T )

k=1) HA(x?, {v?

"H. Furthermore, after T iterations with

we have the expected duality gap E[GH(x(T ), {PK

k=1 Wklv(T )

3.2 Sublinear rate for general convex objectives

Models such as sparse logistic regression, Lasso, group Lasso are non-strongly convex. For such models, we show that COLA enjoys a O(1/T) sublinear rate of convergence for all network topologies with a positive spectral gap. Theorem 2 (Non-strongly Convex Case). Consider Algorithm 1, using a local solver of quality . Let gi( ) have L-bounded support, and let f be (1/ )-smooth. Let "GH > 0 be the desired duality gap. Then after T iterations where

log 2 (HA(x(0),{v(0)

l }) HA(x?,{v?})) 4L2σ σ0

and σ0 := (1 + β)σ0, := (1 + (1 β)2

36(1+ )β ) 1 and := γ(1 )(1 ). We have that the expected duality gap satisﬁes

GH( x, { vk}K

k=1, { wk}K

at the averaged iterate x := 1 T T0

t=T0+1 x(t), and v0

l=1 Wklvl and vk :=

k)(t) and wk := 1 T T0

t=T0+1 rf((v0

Note that the assumption of bounded support for the gi functions is not restrictive in the general convex case, as discussed e.g. in [Dünner et al., 2016].

H := HA(x(0), {v(0)

k=1) HA(x?, {v?

k=1) is the initial suboptimality.

3.3 Local certiﬁcates for global accuracy

Accuracy certiﬁcates for the training error are very useful for practitioners to diagnose the learning progress. In the centralized setting, the duality gap serves as such a certiﬁcate, and is available as a stopping criterion on the master node. In the decentralized setting of our interest, this is more challenging as consensus is not guaranteed. Nevertheless, we show in the following Proposition 1 that certiﬁcates for the decentralized objective (DA) can be computed from local quantities:

Proposition 1 (Local Certiﬁcates). Assume gi has L-bounded support, and let Nk := {j : Wjk > 0} be the set of nodes accessible to node k. Then for any given " > 0, we have

GH(x; {vk}K

if for all k = 1, . . . , K the following two local conditions are satisﬁed:

$$$rf(vk) 1 |Nk|

j2Nk rf(vj)

The local conditions (9) and (10) have a clear interpretation. The ﬁrst one ensures the duality gap of the local subproblem given by vk as on the left hand side of (9) is small. The second condition (10) guarantees that consensus violation is bounded, by ensuring that the gradient of each node is similar to its neighborhood nodes.

Remark 1. The resulting certiﬁcate from Proposition 1 is local, in the sense that no global vector aggregations are needed to compute it. For a certiﬁcate on the global objective, the boolean ﬂag of each local condition (9) and (10) being satisﬁed or not needs to be shared with all nodes, but this requires extremely little communication. Exact values of the parameters β and PK

kσk are not required to be known, and any valid upper bound can be used instead. We can use the local certiﬁcates to avoid unnecessary work on local problems which are already optimized, as well as to continuously quantify how newly arriving local data has to be re-optimized in the case of online training. The local certiﬁcates can also be used to quantify the contribution of newly joining or departing nodes, which is particularly useful in the elastic scenario described above.

4 Experimental results

Here we illustrate the advantages of COLA in three respects: ﬁrstly we investigate the application in different network topologies and with varying subproblem quality ; secondly, we compare COLA with state-of-the-art decentralized baselines: 1 , DIGing [Nedic et al., 2017], which generalizes the gradient-tracking technique of the EXTRA algorithm [Shi et al., 2015], and 2 , Decentralized ADMM (aka. consensus ADMM), which extends the classical ADMM (Alternating Direction Method of

Figure 1: Suboptimality for solving Lasso (λ=10 6) for the RCV1 dataset on a ring of 16 nodes. We illustrate the performance of COLA: a) number of iterations; b) time. here denotes the number of local data passes per communication round.

Figure 2: Convergence of COLA for solving problems on a ring of K=16 nodes. Left) Ridge regression on URL reputation dataset (λ=10 4); Right) Lasso on webspam dataset (λ=10 5).

Multipliers) method [Boyd et al., 2011] to the decentralized setting [Shi et al., 2014, Wei and Ozdaglar, 2013]; Finally, we show that COLA works in the challenging unreliable network environment where each node has a certain chance to drop out of the network.

We implement all algorithms in Py Torch with MPI backend. The decentralized network topology is simulated by running one thread per graph node, on a 2 12 core Intel Xeon CPU E5-2680 v3 server with 256 GB RAM. Table 1 describes the datasets4 used in the experiments. For Lasso, the columns of A are features. For ridge regression, the columns are features and samples for COLA primal and COLA dual, respectively. The order of columns is shufﬂed once before being distributed across the nodes. Due to space limit, details on the experimental conﬁgurations are included in Appendix D.

Table 1: Datasets Used for Empirical Study

Dataset #Training #Features Sparsity URL 2M 3M 3.5e-5 Webspam 350K 16M 2.0e-4 Epsilon 400K 2K 1.0 RCV1 Binary 677K 47K 1.6e-3

Effect of approximation quality . We study the convergence behavior in terms of the approximation quality . Here, is controlled by the number of data passes on subproblem (1) per node. Figure 1 shows that increasing always results in less number of iterations (less communication rounds) for COLA. However, given a ﬁxed network bandwidth, it leads to a clear trade-off for the overall wall-clock time, showing the cost of both communication and computation. Larger leads to less communication rounds, however, it also takes more time to solve subproblems. The observations suggest that one can adjust for each node to handle system heterogeneity, as what we have discussed at the end of Section 2.

Effect of graph topology. Fixing K=16, we test the performance of COLA on 5 different topologies: ring, 2-connected cycle, 3-connected cycle, 2D grid and complete graph. The mixing matrix W is given by Metropolis weights for all test cases (details in Appendix B). Convergence curves are plotted in Figure 3. One can observe that for all topologies, COLA converges monotonically and especailly when all nodes in the network are equal, smaller β leads to a faster convergence rate. This is consistent with the intuition that 1 β measures the connectivity level of the topology.

Superior performance compared to baselines. We compare COLA with DIGing and D-ADMM for strongly and general convex problems. For general convex objectives, we use Lasso regression with λ = 10 4 on the webspam dataset; for the strongly convex objective, we use Ridge regression with λ = 10 5 on the URL reputation dataset. For Ridge regression, we can map COLA to both primal and dual problems. Figure 2 traces the results on log-suboptimality. One can observe that for both generally and strongly convex objectives, COLA signiﬁcantly outperforms DIGing and decentralized ADMM in terms of number of communication rounds and computation time. While DIGing and D-ADMM need parameter tuning to ensure convergence and efﬁciency, COLA is much easier to deploy as it is parameter free. Additionally, convergence guarantees of ADMM relies on exact subproblem solvers, whereas inexact solver is allowed for COLA.

4https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Figure 3: Performance comparison of COLA on different topologies. Solving Lasso regression (λ=10 6) for RCV1 dataset with 16 nodes.

Figure 4: Performance of COLA when nodes have p chance of staying in the network on the URL dataset (λ=10 4). Freezing x[k] when node k leaves the network.

Fault tolerance to unreliable nodes. Assume each node of a network only has a chance of p to participate in each round. If a new node k joins the network, then local variables are initialized as x[k] = 0; if node k leaves the network, then x[k] will be frozen with k = 1. All remaining nodes dynamically adjust their weights to maintain the doubly stochastic property of W. We run COLA on such unreliable networks of different ps and show the results in Figure 4. First, one can observe that for all p > 0 the suboptimality decreases monotonically as COLA progresses. It is also clear from the result that a smaller dropout rate (a larger p) leads to a faster convergence of COLA.

5 Discussion and conclusions

In this work we have studied training generalized linear models in the fully decentralized setting. We proposed a communication-efﬁcient decentralized framework, termed COLA, which is free of parameter tuning. We proved that it has a sublinear rate of convergence for general convex problems, allowing e.g. L1 regularizers, and has a linear rate of convergence for strongly convex objectives. Our scheme offers primal-dual certiﬁcates which are useful in the decentralized setting. We demonstrated that COLA offers full adaptivity to heterogenous distributed systems on arbitrary network topologies, and is adaptive to changes in network size and data, and offers fault tolerance and elasticity. Future research directions include improving subproblems, as well as extension to the network topology with directed graphs, as well as recent communication compression schemes [Stich et al., 2018].

Acknowledgments. We thank Prof. Bharat K. Bhargava for fruitful discussions. We acknowledge funding from SNSF grant 200021_175796, Microsoft Research JRC project Coltrain , as well as a Google Focused Research Award.

John N Tsitsiklis, Dimitri P Bertsekas, and Michael Athans. Distributed asynchronous deterministic and

stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803 812, 1986.

Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE

Transactions on Automatic Control, 54(1):48 61, 2009.

J C Duchi, A Agarwal, and M J Wainwright. Dual Averaging for Distributed Optimization: Convergence

Analysis and Network Scaling. IEEE Transactions on Automatic Control, 57(3):592 606, March 2012.

Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact ﬁrst-order algorithm for decentralized consensus

optimization. SIAM Journal on Optimization, 25(2):944 966, 2015.

Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging gradient algorithm.

Journal of Machine Learning Research, 17(61):1 35, 2016.

Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization

over time-varying graphs. SIAM Journal on Optimization, 27(4):2597 2633, 2017.

Volkan Cevher, Stephen Becker, and Mark Schmidt. Convex Optimization for Big Data: Scalable, randomized,

and parallel algorithms for big data analytics. IEEE Signal Processing Magazine, 31(5):32 43, 2014.

Virginia Smith, Simone Forte, Chenxin Ma, Martin Takác, Michael I Jordan, and Martin Jaggi. Co Co A: A

General Framework for Communication-Efﬁcient Distributed Optimization. Journal of Machine Learning Research, 18(230):1 49, 2018.

Sixin Zhang, Anna E Choromanska, and Yann Le Cun. Deep learning with Elastic Averaging SGD. In NIPS

2015 - Advances in Neural Information Processing Systems 28, pages 685 693, 2015.

Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and Communication Efﬁcient Distributed Stochastic

Optimization with Minibatch Prox. In ICML 2017 - Proceedings of the 34th International Conference on Machine Learning, pages 1882 1919, June 2017.

Celestine Dünner, Simone Forte, Martin Takác, and Martin Jaggi. Primal-Dual Rates and Certiﬁcates. In ICML

2016 - Proceedings of the 33th International Conference on Machine Learning, pages 783 792, 2016.

Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and

Michael I Jordan. Communication-efﬁcient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 3068 3076, 2014.

Kevin Scaman, Francis R. Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for

smooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3027 3036, 2017.

Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for

non-smooth distributed optimization in networks. ar Xiv preprint ar Xiv:1806.00291, 2018.

Dusan Jakovetic, Joao Xavier, and Jose MF Moura. Convergence rate analysis of distributed gradient methods

for smooth optimization. In Telecommunications Forum (TELFOR), 2012 20th, pages 867 870. IEEE, 2012.

Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on

Optimization, 26(3):1835 1854, 2016.

Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the Linear Convergence of the ADMM in

Decentralized Consensus Optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014.

Ermin Wei and Asuman Ozdaglar. On the O(1/k) Convergence of Asynchronous Distributed Alternating

Direction Method of Multipliers. ar Xiv, July 2013.

Pascal Bianchi, Walid Hachem, and Franck Iutzeler. A coordinate descent primal-dual algorithm and application

to distributed asynchronous optimization. IEEE Transactions on Automatic Control, 61(10):2947 2957, 2016.

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms

outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5336 5346, 2017.

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent.

In ICML 2018 - Proceedings of the 35th International Conference on Machine Learning, 2018.

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over decentralized data.

ar Xiv preprint ar Xiv:1803.07068, 2018a.

Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized

training. In NIPS 2018 - Advances in Neural Information Processing Systems, 2018b.

Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H Sayed. Decentralized consensus optimization with

asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2): 293 307, 2018.

Benjamin Sirb and Xiaojing Ye. Decentralized consensus algorithm with delayed and stochastic gradients. SIAM

Journal on Optimization, 28(2):1232 1254, 2018.

Tianbao Yang. Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent. In

NIPS 2014 - Advances in Neural Information Processing Systems 27, 2013.

Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter Richtárik, and Martin Takác. Adding vs.

Averaging in Distributed Primal-Dual Optimization. In ICML 2015 - Proceedings of the 32th International Conference on Machine Learning, pages 1973 1982, 2015.

Celestine Dünner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, and Martin Jaggi. A

Distributed Second-Order Algorithm You Can Trust. In ICML 2018 - Proceedings of the 35th International Conference on Machine Learning, pages 1357 1365, July 2018.

Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in Neural

Information Processing Systems, pages 873 881, 2011.

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In

Advances in Neural Information Processing Systems, pages 2595 2603, 2010.

Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss. In International

conference on machine learning, pages 362 370, 2015.

Sashank J Reddi, Jakub Konecn y, Peter Richtárik, Barnabás Póczós, and Alex Smola. Aide: Fast and communi-

cation efﬁcient distributed optimization. ar Xiv preprint ar Xiv:1608.06879, 2016.

Matilde Gargiani. Hessian-Co Co A: a general parallel and distributed framework for non-strongly convex

regularizers. Master s thesis, ETH Zurich, June 2017.

Ching-pei Lee and Kai-Wei Chang. Distributed block-diagonal approximation methods for regularized empirical

risk minimization. ar Xiv preprint ar Xiv:1709.03043, 2017.

Ching-pei Lee, Cong Han Lim, and Stephen J Wright. A distributed quasi-newton algorithm for empirical risk

minimization with nonsmooth regularization. In ACM International Conference on Knowledge Discovery and Data Mining, 2018.

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated Multi-Task Learning. In

NIPS 2017 - Advances in Neural Information Processing Systems 30, 2017.

Jakub Konecn y, Brendan Mc Mahan, and Daniel Ramage. Federated optimization: Distributed optimization

beyond the datacenter. ar Xiv preprint ar Xiv:1511.03575, 2015.

Jakub Konecn y, H Brendan Mc Mahan, Felix X Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon.

Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-

efﬁcient learning of deep networks from decentralized data. In Artiﬁcial Intelligence and Statistics, pages 1273 1282, 2017.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and

statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine learning, 3(1):1 122, 2011.

Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsiﬁed sgd with memory. In NIPS 2018 -

Advances in Neural Information Processing Systems, 2018.

Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.

W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57

(1):97 109, 1970.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin,

Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop on Autodiff, 2017.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,

V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.