# ideal_inexact_decentralized_accelerated_augmented_lagrangian_method__5dc9e0c1.pdf

IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method

Yossi Arjevani NYU yossia@nyu.edu

Joan Bruna NYU bruna@cims.nyu.edu

Bugra Can Rutgers University bc600@scarletmail.rutgers.edu

Mert Gürbüzbalaban Rutgers University mg1366@rutgers.edu

Stefanie Jegelka MIT stefje@csail.mit.edu

Hongzhou Lin MIT hongzhou@mit.edu

We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known decentralized algorithms including EXTRA [47] and SSDA [43]. When coupled with accelerated gradient descent, our framework yields a novel primal algorithm whose convergence rate is optimal and matched by recently derived lower bounds. We provide experimental results that demonstrate the effectiveness of the proposed algorithm on highly ill-conditioned problems.

1 Introduction

Due to their rapidly increasing size, modern datasets are typically collected, stored and manipulated in a distributed manner. This, together with strict privacy requirements, has created a large demand for efﬁcient solvers for the decentralized setting in which models are trained locally at each agent, and only local parameter vectors are shared. This approach has become particularly appealing for applications such as edge computing [30, 48], cooperative multi-agent learning [8, 38] and federated learning [31, 49]. Clearly, the nature of the decentralized setting prevents a global synchronization, as only communication within the neighboring machines is allowed. The goal is then to arrive at a consensus on all local agents with a model that performs as well as in the centralized setting.

Arguably, the simplest approach for addressing decentralized settings is to adapt the vanilla gradient descent method to the underlying network architecture [11, 18, 34, 53]. To this end, the connections between the agents are modeled through a mixing matrix which dictates how agents average over their neighbors parameter vectors. Perhaps surprisingly, when the stepsizes are constant, simply averaging over the local iterates via the mixing matrix only converges to a neighborhood of the optimum [47, 58]. A recent line of works [17, 35, 36, 39, 46, 47] proposed a number of alternative methods that linearly converge to the global minimum. Extensions to the composite setting are also studied [1, 2, 27, 55].

The overall complexity of solving decentralized optimization problems is typically determined by two factors: (i) the condition number of the objective function κf, which measures the hardness of solving the underlying optimization problem, and (ii) the condition number of the mixing matrix κW , which quantiﬁes the severity of information bottlenecks present in the network. Lower complexity bounds recently derived for distributed settings [3, 5, 43, 52] show that one cannot expect to have a better dependence on the condition numbers than κf and κW . Notably, despite the recent

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

considerable progress, none of the methods mentioned above is able to achieve accelerated rates, that is, a square root dependence for both κf and κW simultaneously.

An extensive effort has been devoted to obtaining acceleration for decentralized algorithms under various settings [12, 13, 16, 26, 40, 43, 44, 51, 54, 57, 59]. When a dual oracle is available, that is, access to the gradients of the dual functions is provided, optimal rates can be attained for smooth and strongly convex objectives [43]. However, having access to a dual oracle is a very restrictive assumption, and resorting to a direct primalization through inexact approximation of the dual gradients leads to sub-optimal worst-case theoretical rates [51]. In this work, we propose a novel primal approach that leads to optimal rates in terms of dependency on κf and κW .

Our contributions can be summarized as follows.

We introduce a novel framework based on the accelerated augmented Lagrangian method for designing primal decentralized methods. The framework provides a simple and systematic way for deriving several well-known decentralized algorithms [17, 46, 47], including EXTRA [47] and SSDA [43], and uniﬁes their convergence analyses.

Using accelerated gradient descent as a sub-routine, we derive a novel method for smooth and strongly convex local functions which achieves optimal accelerated rates on both the condition numbers of the problem, κf and κW , using primal updates, see Table 2.

We perform a large number of experiments which conﬁrm our theoretical ﬁndings, and demonstrate a signiﬁcant improvement when the objective function is ill-conditioned and κf κW .

2 Decentralized Optimization Setting

We consider n computational agents and a network graph G = (V, E) which deﬁnes how the agents are linked. The set of vertices V = {1, , n} represents the agents and the set of edges E V V speciﬁes the connectivity in the network, i.e., a communication link between agents i and j exists if and only if (i, j) E. Each agent has access to local information encoded by a loss function fi : Rd R. The goal is to minimize the global objective over the entire network,

min x Rd f(x) :=

i=1 fi(x). (1)

In this paper, we assume that the local loss functions fi are differentiable, L-smooth and µ-strongly convex.1 Strong convexity of the component functions fi implies that the problem admits a unique solution, which we denote by x .

We consider the following computation and communication models [43]:

Local computation: Each agent is able to compute the gradients of fi and the cost of this computation is one unit of time.

Communication: Communication is done synchronously, and each agent can only exchange information with its neighbors, where i is a neighbor of j if (i, j) E. The ratio between the communication cost and computation cost per round is denoted by τ.

We further assume that propagation of information is governed by a mixing matrix W Rn n [34, 43, 58]. Speciﬁcally, given a local copy of the decision variable xi Rd at node i [1, n], one round of communication provides the following update xi Pn i=1 Wijxj. The following standard assumptions regarding the mixing matrix [43] are made throughout the paper.

Assumption 1. The mixing matrix W satisﬁes the following:

1. Symmetry: W = W T .

2. Positiveness: W is positive semi-deﬁnite.

3. Decentralized property: If (i, j) / E and i = j, then Wij = Wji = 0.

4. Spectrum property: The kernel of W is given by the vector of all ones Ker(W) = R1n.

1 f is L-smooth if f is L-Lipschitz; f is µ-strongly convex if f µ

2 x 2 is convex.

Algorithm 1 Decentralized Augmented Lagrangian framework

Input: mixing matrix W, regularization parameter ρ, stepsize η. 1: for k = 1, 2, ..., K do 2: Xk = arg min Pk(X) := F(X) + ΛT k X + ρ

2 X 2 W . 3: Λk+1 = Λk + η WXk. 4: end for

A typical choice of the mixing matrix is the (weighted) Laplacian matrix of the graph. Another common choice is to set W as I W where W is a doubly stochastic matrix [7, 10, 47]. By Assumption 1.4, all the eigenvalues of W are strictly positive, except for the smallest one. We let λmax(W) denote the maximum eigenvalue, and let λ+ min(W) denote the smallest positive eigenvalue. The ratio between these two quantities plays an important role in quantifying the overall complexity of this problem.

Theorem 1 (Decentralized lower bound [43]). For any ﬁrst-order black-box decentralized method, the number of time units required to reach an ϵ-optimal solution for (1) is lower bounded by

Ω κf(1 + τ κW ) log 1

where κf = L/µ is the condition number of the loss function and κW = λmax(W)/λ+ min(W) is the condition number of the mixing matrix.

The lower bound decomposes as follows: a) computation cost, given by κf log(1/ϵ), and b) communication cost, given by τ κfκW log(1/ϵ). The computation cost matches lower bounds for centralized settings [4, 37], while the communication cost introduces an additional term which depends on κW and accounts for the price of communication in decentralized models. It follows that the effective condition number of a given decentralized problem is κW κf.

Clearly, the choice of the matrix W can strongly affect the optimal attainable performance. For example, κW can get as large as n2 in the line/cycle graph, or be constant in the complete graph. In this paper, we do not focus on optimizing over the choice of W for a given graph G; instead, following the approach taken by existing decentralized algorithms, we assume that the graph G and the mixing matrix W are given and aim to achieve the optimal complexity (2) for this particular choice of W.

3 Related Work and the Dual Formulation

A standard approach to address problem (1) is to express it as a constrained optimization problem

min X Rnd F(X) := 1

i=1 fi(xi) such that x1 = x2 = = xn Rd , (P)

where X = [x1; x2; xn] Rnd is a concatenation of the vectors. To lighten the notation, we introduce the global mixing matrix W = W Id Rnd nd, where denotes the Kronecker product, and let W denote the semi-norm induced by W, i.e. X 2 W = XT WX. With this notation in hand, we brieﬂy review existing literature on decentralized algorithms.

Decentralized Gradient Descent The decentralized gradient method [34, 58] has the update rule

Xk+1 = WXk η F(Xk). (DGD)

However, with constant stepsize, the algorithm does not converge to a global minimum of (P), but rather to a neighborhood of the solution [58]. A decreasing stepsize schedule may be used to ensure convergence but this yields a sublinear convergence rate, even in the strongly convex case.

Linearly convergent primal algorithms By and large, recent methods that achieve linear convergence in the strongly convex case [17, 25, 35, 36, 39, 46, 47, 50] can be shown to follow a general framework based on the augmented Lagrangian method, see Algorithm 1. The main difference lies

Algorithm 2 Accelerated Decentralized Augmented Lagrangian framework Input: mixing matrix W, regularization parameter ρ, stepsize η, extrapolation parameters {βk}k N

1: Initialize dual variables Λ1 = Ω1 = 0 Rnd. 2: for k = 1, 2, ..., K do 3: Xk = arg min Pk(X) := F(X) + ΩT k X + ρ

2 X 2 W . 4: Λk+1 = Ωk + ηWXk 5: Ωk+1 = Λk+1 + βk+1(Λk+1 Λk) 6: end for Output: XK.

Algorithm 3 IDEAL: Inexact Acc-Decentralized Augmented Lagrangian framework Additional Input: A ﬁrst-order optimization algorithm A Apply A to solve the subproblem Pk warm starting at Xk 1 to ﬁnd an approximate solution

Xk arg min n Pk(X) := F(X) + ΩT k X + ρ

2 X 2 W o ,

Option I: stop the algorithm when Xk X k 2 ϵk, where X k is the unique minimizer of Pk. Option II: stop the algorithm after a preﬁxed number of iterations Tk.

in how subproblems Pk are solved. Shi et al. [46] apply an alternating directions method; in [47], the EXTRA algorithm takes a single gradient descent step to solve Pk, see Appendix B for details. Jakoveti c et al. [17] use multi-step algorithms such as Jacobi/Gauss-Seidel methods. To the best of our knowledge, the complexity of these algorithms is not better than O (1 + τ)κfκW log( 1

ϵ ) , in other words, they are non-accelerated. The recently proposed algorithm APM-C [26] enjoys a square root dependence on κf and κW , but incurs an additional log(1/ϵ) factor compared to the optimal attainable rate.

Optimal method based on the dual formulation By Assumption 1.4, the constraint x1 = x2 = = xn is equivalent to the identity W X = 0, which is again equivalent to

W X = 0. Hence, the dual formulation of (P) is given by

max Λ Rdn F (

Since the primal function is convex and the constraints are linear, we can use strong duality and address the dual problem instead of the primal one. Using this approach, [43] proposed a dual method with optimal accelerated rates, using Nesterov s accelerated gradient method for the dual problem (D). As mentioned earlier, the main drawback of this method is that it requires access to the gradient of the dual function which, unless the primal function has a relatively simple structure, is not available. One may apply a ﬁrst-order method to approximate the dual gradients inexactly at the expense of an additional κf factor in the computation cost [51], but this woul make the algorithm no longer optimal. This indicates that achieving optimal rates when using primal updates is a rather challenging task in the decentralized setting. In the following sections, we provide a generic framework which allows us to derive a primal decentralized method with optimal complexity guarantees.

4 An Inexact Accelerated Augmented Lagrangian framework

In this section, we introduce our inexact accelerated Augmented Lagrangian framework, and show how to combine it with Nesterov s acceleration. To ease the presentation, we ﬁrst describe a conceptual algorithm, Algorithm 2, where subproblems are solved exactly, and only then introduce inexact inner-solvers.

Similarly to Nesterov s accelerated gradient method, we use an extrapolation step for the dual variable Λk. The component WXk in line 4 of Algorithm 2 is the negative gradient of the Moreauenvelope2 of the dual function. Hence our algorithm is equivalent to applying Nesterov s method

2A proper deﬁnition of the Moreau-envelope is given in [42], readers that are not familiar with this concept could take it as an implicit function which shares the same optimum as the original function.

on the Moreau-envelope of the dual function, or equivalently, an accelerated dual proximal point algorithm. This renders the optimal dual method proposed in [43] as a special case of our algorithmic framework (with ρ set to 0).

While Algorithm 2 is conceptually plausible, it requires an exact solution of the Augmented Lagrangian problems, which can be too expensive in practice. To address this issue, we introduce an inexact version, shown in Algorithm 3, where the k-th subproblem Pk is solved up to a predeﬁned accuracy ϵk. The choice of ϵk is rather subtle. On the one hand, choosing a large ϵk may result in a non-converging algorithm. On the other hand, choosing a small ϵk can be exceedingly expensive as the optimal solution of the subproblem X k is not the global optimum X . Intuitively, ϵk should be chosen to be of the same order of magnitude as X k X , leading to the following result. Theorem 2. Consider the sequence of primal variables (Xk)k N generated by Algorithm 3 with the subproblem Pk solved up to ϵk accuracy in Option I. With parameters set to

Lρ + µρ , η = 1

Lρ , ϵk = µρ 2λmax(W)

k dual, (3)

where Lρ = λmax(W ) µ+ρλmax(W ), µρ = λ+ min(W ) L+ρλ+ min(W ) and dual is the initial dual function gap, we obtain

k dual, (4)

where X = 1n x and Cρ = 258 Lρλmax(W )

Corollary 3. The number of subproblems Pk to achieve Xk X 2 ϵ in IDEAL is bounded by

Lρ µρ log Cρ dual

We remark that inexact accelerated Augmented Lagrangian methods have been previously analyzed under different assumptions [21, 33, 56]. The main difference is that here, we are able to establish a linear convergence rate, whereas existing analyses only yield sublinear rates. One of the reasons for this discrepancy is that, although F is strongly convex, the dual problem (D) is not, as the mixing matrix W is singular. The key to obtaining a linear convergence rate is a ﬁne-grained analysis of the dual problem, showing that the dual variables always lie in the subspace where strong convexity holds. The proof of the theorem relies on the equivalence between Augmented Lagrangian methods and the dual proximal point algorithm [9, 41], which can be interpreted as applying an inexact accelerated proximal point algorithm [15, 28] to the dual problem. A complete convergence analysis is deferred to Section C in the appendix.

Theorem 2 provides an accelerated convergence rate with respect to the augmented condition number κρ := Lρ/µρ, as determined by the Augmented Lagrangian parameter ρ in Algorithm 3. We have the following bounds:

1 |{z} ρ= κρ = L + ρλ+ min(W) µ + ρλmax(W) λmax(W)

λ+ min(W) L

λ+ min(W) | {z } ρ=0

= κfκW , (6)

where we observe that the condition number κρ is a decreasing function of the regularization parameter ρ. When ρ = 0, the maximum value is attained at κρ = κfκW , the effective condition number of the decentralized problem. As ρ goes to inﬁnity, the augmented condition number κρ goes to 1. Naively, one may want to take ρ as large as possible to get a fast convergence. However, one must also take into account the complexity of solving the subproblems. Indeed, since W is singular, the additional regularization term in Pk does not improve the strong convexity of the subproblems, yielding an increase in inner loops complexity as ρ grows. Hence, the optimal choice of ρ requires balancing the inner and outer complexity in a careful manner.

To study the inner loop complexity, we introduce a warm-start strategy. Intuitively, the distance between Xk 1 and the k-th solution X k to the subproblem Pk is roughly on the order of ϵk 1. More precisely, we have the following result.

Tk O L+ρλmax(W )

L+ρλmax(W )

ρ L λmax(W ) L λmax(W ) L λ+ min(W ) K X

k=1 Tk O κf κW log( 1

ϵ ) O κfκW log( 1

ϵ ) O σ2κf κW

Table 1: The ﬁrst row indicates the number of iterations required for different inner solvers to achieve ϵk accuracy for the k-th subproblem Pk; the O notation hides logarithmic factors in the parameters ρ, κf and κW . The second row shows the theoretical choice of the regularization parameter ρ. The last row shows the total number of iterations according to the choice of ρ.

Lemma 4. Given the parameter choice in Theorem 2, initializing the subproblem Pk at Xk 1 yields,

Xk 1 X k 2 8Cρ

Consequently, the ratio between the initial gap at the k-th subproblem and the desired gap ϵk is bounded by Xk 1 X k 2

µρ = O(κfκW ρ2),

which is independent of k. In other words, the inner loop solver only needs to decrease the iterate gap by a constant factor for each Pk. If the algorithm enjoys a linear convergence rate, a constant number of iteration is sufﬁcient for that. If the algorithm enjoys a sublinear convergence, then the inner loop complexity grows with k. To illustrate the behaviour of different algorithms, we present the inner loop complexity Tk for gradient descent (GD), accelerated gradient descent (AGD) and stochastic gradient descent (SGD) in Table 1. Note that while the inner complexity of GD and AGD are independent of k, the inner complexity for SGD increases geometrically with k. Other possible choices for inner solvers are the alternating directions or Jacobi/Gauss-Seidel method, both of which yield accelerated variants for [46] and [17].

In fact, the theoretical upper bounds on the inner complexity also provide a more practical way to halt the inner optimization processes (see Option II in Algorithm 3). Indeed, one can predeﬁne the computational budget for each subproblem, for instance, 100 iterations of AGD. If this budget exceeds the theoretical inner complexity Tk in Table 1, then the desired accuracy ϵk is guaranteed to be reached. In particular, we do not need to evaluate the sub-optimality condition, it is automatically satisﬁed as long as the budget is chosen appropriately.

Finally, the global complexity is obtained by summing PK k=1 Tk, where K is the number of subproblems given in (5). Note that, so far, our analysis applies to any regularization parameter ρ. Since PK k=1 Tk is a function of ρ, this implies that one can select the parameter ρ such that the overall complexity is minimized, leading to the choices of ρ described in Table 1.

Two-fold acceleration In our setting, acceleration seems to occur in two stages (when compared to the non-accelerated O κfκW log( 1

ϵ ) rates in [17, 35, 39, 46, 47]). First, combining IDEAL with GD improves the dependence on the condition of the mixing matrix κW . Secondly, when used as an inner solver, AGD improves the dependence on the condition number of the local functions κf. This suggests that the two phenomena are independent; while one is related to the consensus between the agents, as governed by the mixing matrix, the other one is related to the respective centralized hardness of the optimization problem.

Stochastic oracle Our framework also subsumes the stochastic setting, where only noisy gradients are available. In this case, since SGD is sublinear, the required iteration counters Tk for the subproblem must increase inversely proportional to ϵk. Also the stepsize at the k-th iteration needs to be decreased accordingly. The overall complexity is now given by O σ2κf κW

µ2ϵ . However, in this case, the resulting dependence on the graph condition number can be improved [13].

ρ Computation cost Communication cost

SSDA+AGD 0 O κf κW log( 1

ϵ ) O τ κfκW log( 1

IDEAL+AGD L λmax(W ) O κfκW log( 1

ϵ ) O τ κfκW log( 1

MSDA+AGD 0 O κf log( 1

ϵ ) O τ κfκW log( 1

MIDEAL+AGD L λmax(Q(W )) O κf log( 1

ϵ ) O τ κfκW log( 1

Table 2: The communication cost of the presented algorithms are all optimal, but the computation cost differs. An additional factor of κf is introduced in SSDA/MSDA compared to their original rate in [43], due to the gradient approximation. The optimal computation cost is achieved by combining our multi-stage algorithm MIDEAL with AGD as an inner solver.

Multi-stage variant (MIDEAL) We remark that the complexity presented in Table 2 is abbreviated, in the sense that it does not distinguish between communication cost and computation cost. To provide a more ﬁne-grained analysis, it sufﬁces to note that performing a gradient step of the subproblem Pk(X) = F(X) + Ωk + ρWX requires one local computation to evaluate F, and one round of communication to obtain WX. This implies that when GD/AGD/SGD is combined with IDEAL, the number of local computation rounds is roughly the number of communication rounds, leading to a sub-optimal computation cost, as shown in Table 2.

To achieve optimal accelerated rates, we enforce multiple communication rounds after one evaluation of F. This is achieved by substituting the regularization metric 2 W with 2 Q(W), where Q is a well-chosen polynomial. In this case, the gradient of the subproblem becomes Pk(X) = F(X) + Ωk + ρ Q(W)X, which requires deg(Q) rounds of communication.

The choice of the polynomial Q relies on Chebyshev acceleration, which is introduced in [6, 43]. More concretely, the Chebyshev polynomials are deﬁned by the recursion relation T0(x) = 1, T1(x) = x, Tj+1(x) = 2x Tj(x) Tj 1(x), and Q is deﬁned by

Q(x) = 1 Tj W (c(1 x))

Tj W (c) with j W = κW , c = κW + 1

Applying this speciﬁc choice of Q to the mixing matrix W reduces its condition number by the maximum amount [6, 43], yielding a graph independent bound κQ(W ) = λmax(Q(W))/λ+ min(Q(W)) 4. Moreover, the symmetry, positiveness and spectrum property in Assumption 1 are maintained by Q(W). Even though Q(W) no longer satisﬁes the decentralized property, it can be implemented using κW rounds of communications with respect to W. The implementation details of the resulting algorithm are similar to Algorithm 2, and follow by substituting the mixing matrix W by Q(W) (Algorithm 5 in Appendix E).

Comparison with inexact SSDA/MSDA [43] Recall that SSDA/MSDA are special cases of our algorithmic framework with the degenerate regularization parameter ρ = 0. Therefore, our complexity analysis naturally extends to an inexact anlysis of SSDA/MSDA, as shown in Table 2. although the resulting communication costs are optimal, the computation cost is not, due to the additional κf factor introduced by solving the subproblems inexactly. In contrast, our multi-stage framework achieves the optimal computation cost.

Low communication cost regime: τ κW < 1, the computation cost dominates the communication cost, a κf improvement is obtained by MIDEAL comparing to MSDA.

Ill conditioned regime: 1 < τ κW < κf, the complexity of MSDA is dominated by the computation cost O κf log( 1

ϵ ) while the complexity MIDEAL is dominated by the communication cost O τ κfκW log( 1

ϵ ) . The improvement is proportional to the ratio κf/τ κW .

High communication cost regime: κf < τ κW , the communication cost dominates, and MIDEAL and MSDA are comparable.

0 500 1000 1500 2000 Time

Circular Graph (n= 16, = 0.1, f = 1000, W = 26.3)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 1000 2000 3000 4000 5000 6000 Time

Circular Graph (n= 16, = 1.0, f = 1000, W = 26.3)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 5000 10000 15000 Time

Circular Graph (n= 16, = 10.0, f = 1000, W = 26.3)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 500 1000 1500 2000 Time

Barbell Graph (n= 16, = 0.1, f = 1000, W = 47.1)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 1000 2000 3000 4000 5000 6000 Time

Barbell Graph (n= 16, = 1.0, f = 1000, W = 47.1)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 5000 10000 15000 Time

Barbell Graph (n= 16, = 10.0, f = 1000, W = 47.1)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

Figure 1: We evaluate the empirical performance of existing state-of-the-art algorithms, where the underlying network is a circular graph (top) and a barbell graph (bottom). We consider the following regimes: low communication cost (left), Ill-condition problems (middle) and High communication cost (right). The x-axis is the time counter, i.e. the sum of the communication cost and the computation cost; the y-axis is the log scale suboptimality. We observe that our algorithms IDEAL/MIDEAL are optimal under various regimes, validating our theoretical ﬁndings.

5 Experiments

Having described the IDEAL/MIDEAL algorithms for decentralized optimization problem (1), we now turn to presenting various empirical results which corroborate our theoretical analysis. To facilitate a simple comparison between existing state-of-the-art algorithms, we consider an ℓ2-regularized logistic regression task over two classes of the MNIST [24]/CIFAR-10 [23] benchmark datasets.

For the MNIST experiment, we directly use the normalized image as input feature. For the CIFAR experiment, a linear model is not rich enough to express the complex images. Hence, we ﬁrst apply an unsupervised learning model, convolutional kernel network [29], to extract the feature and then apply the logistic regression on top of it. This can be approximately viewed as training the last layer of a conventional neural network by freezing the well-trained ﬁrst layers.

The smoothness parameter of the logistic regression (assuming normalized feature vectors) can be shown to be bounded by 1/4, which together with a regularization parameter µ 1e 3, yields a relatively high 1e3-bound on the condition number of the loss function. Further empirical results which demonstrate the robustness of IDEAL/MIDEAL under wide range of parameter choices are provided in Appendix G.

We compare the performance of IDEAL/MIDEAL with the state-of-the-art algorithms EXTRA [47], APM-C [26] and the inexact dual method SSDA/MSDA [43]. We set the inner iteration counter to be Tk = 100 for all algorithms, and use the theoretical stepsize schedule. The decentralized environment is modelled in a synthetic setting, where the communication time is steady and no latency is encountered. To demonstrate the effect of the underlying network architecture, we consider: a) a circular graph, where the agents form a cycle; b) a Barbell graph, where the agents are split into two complete subgraphs, connected by a single bridge (shown in Figure 2 in the appendix).

As shown in Figure 1, our multi-stage algorithm MIDEAL is optimal in the regime where the communication cost τ is small, and the single-stage variant IDEAL is optimal when τ is large. As expected, the inexactness mechanism signiﬁcantly slows down the dual method SSDA/MSDA in the low communication cost regime. In contrast, the APM-C algorithm performs reasonably well in the low communication regime, but performs relatively poorly when the communication cost is high.

0 500 1000 1500 2000 Time

CIFAR Circular Graph (n= 16, = 0.1, f = 1000, W = 26.3)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 1000 2000 3000 4000 Time

CIFAR Circular Graph (n= 16, = 1.0, f = 1000, W = 26.3)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 5000 10000 15000 Time

0 CIFAR Circular Graph (n= 16, = 10.0, f = 1000, W = 26.3)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 500 1000 1500 2000 Time

CIFAR Barbell Graph (n= 16, = 0.1, f = 1000, W = 47.1)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 1000 2000 3000 4000 Time

CIFAR Barbell Graph (n= 16, = 1.0, f = 1000, W = 47.1)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

0 5000 10000 15000 Time

0 CIFAR Barbell Graph (n= 16, = 10.0, f = 1000, W = 47.1)

IDEAL+AGD MIDEAL+AGD APM-C SSDA+AGD MSDA+AGD EXTRA

Figure 2: CIFAR experiments: We observe similar phenomenon as in the MNIST experiment, that the multi-stage algorithm MIDEAL outperforms when the communication cost τ is low and the IDEAL outperforms in the other cases.

6 Conclusions

We propose a novel framework of decentralized algorithms for smooth and strongly convex objectives. The framework provides a uniﬁed viewpoint of several well-known decentralized algorithms and, when instantiated with AGD, achieves optimal convergence rates in theory and state-of-the-art performance in practice. We leave further generalization to (non-strongly) convex and non-smooth objectives to future work.

Acknowledgements

YA and JB acknowledge support from the Sloan Foundation and Samsung Research. BC and MG acknowledge support from the grants NSF DMS-1723085 and NSF CCF-1814888. HL and SJ acknowledge support by The Defense Advanced Research Projects Agency (grant number YFA17 N66001-17-1-4039). The views, opinions, and/or ﬁndings contained in this article are those of the author and should not be interpreted as representing the ofﬁcial views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

Broader impact

Centralization of data is not always possible because of security and legacy concerns [14]. Our work proposes a new optimization algorithm in the decentralized setting, which can learn a model without revealing the privacy sensitive data. Potential applications include data coming from healthcare, environment, safety, etc, such as personal medical information [19, 20], keyboard input history [22, 32] and beyond.

[1] S. Alghunaim, K. Yuan, and A. H. Sayed. A linearly convergent proximal gradient algorithm for decentralized optimization. In Advances in Neural Information Processing Systems, pages

2848 2858, 2019.

[2] S. A. Alghunaim, E. Ryu, K. Yuan, and A. H. Sayed. Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control, 2020.

[3] Y. Arjevani and O. Shamir. Communication complexity of distributed convex learning and optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.

[4] Y. Arjevani and O. Shamir. On the iteration complexity of oblivious ﬁrst-order optimization algorithms. In International Conferences on Machine Learning (ICML), 2016.

[5] Y. Arjevani, O. Shamir, and N. Srebro. A tight convergence analysis for stochastic gradient descent with delayed updates. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117, pages 111 132, 2020.

[6] W. Auzinger and J. Melenk. Iterative solution of large linear systems. Lecture Note, 2011.

[7] N. S. Aybat and M. Gürbüzbalaban. Decentralized computation of effective resistances and acceleration of consensus algorithms. In 2017 IEEE Global Conference on Signal and Information Processing (Global SIP), pages 538 542. IEEE, 2017.

[8] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of markov decision processes. Mathematics of operations research, 27(4):819 840, 2002.

[9] D. P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.

[10] B. Can, S. Soori, N. S. Aybat, M. M. Dehvani, and M. Gürbüzbalaban. Decentralized computation of effective resistances and acceleration of distributed optimization algorithms. ar Xiv preprint ar Xiv:1907.13110, 2019.

[11] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3): 592 606, 2011.

[12] D. Dvinskikh and A. Gasnikov. Decentralized and parallelized primal and dual accelerated methods for stochastic convex programming problems. ar Xiv preprint ar Xiv:1904.09015, 2019.

[13] A. Fallah, M. Gürbüzbalaban, A. Ozdaglar, U. Simsekli, and L. Zhu. Robust distributed accelerated stochastic gradient methods for multi-agent networks. ar Xiv preprint ar Xiv:1910.08701, 2019.

[14] GDPR. The EU general data protection regulation (gdpr). 2016.

[15] O. Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649 664, 1992.

[16] H. Hendrikx, F. Bach, and L. Massoulie. An optimal algorithm for decentralized ﬁnite sum optimization, 2020.

[17] D. Jakoveti c, J. M. Moura, and J. Xavier. Linear convergence rate of a class of distributed augmented lagrangian algorithms. IEEE Transactions on Automatic Control, 60(4):922 936, 2014.

[18] D. Jakoveti c, J. Xavier, and J. M. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131 1146, 2014.

[19] A. Jochems, T. M. Deist, J. Van Soest, M. Eble, P. Bulens, P. Coucke, W. Dries, P. Lambin, and A. Dekker. Distributed learning: developing a predictive model based on data from multiple hospitals without data leaving the hospital a real life proof of concept. Radiotherapy and Oncology, 121(3):459 467, 2016.

[20] A. Jochems, T. M. Deist, I. El Naqa, M. Kessler, C. Mayo, J. Reeves, S. Jolly, M. Matuszak, R. Ten Haken, J. van Soest, et al. Developing and validating a survival prediction model for nsclc patients through distributed learning across 3 countries. International Journal of Radiation Oncology* Biology* Physics, 99(2):344 352, 2017.

[21] M. Kang, M. Kang, and M. Jung. Inexact accelerated augmented lagrangian methods. Computational Optimization and Applications, 62(2):373 404, 2015.

[22] J. Konevcný, H. B. Mc Mahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efﬁciency. In NIPS Workshop on Private Multi-Party Machine Learning, 2016. URL https://arxiv.org/abs/1610.05492.

[23] A. Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.

[24] Y. Le Cun, C. Cortes, and C. Burges. Mnist handwritten digit database. ATT Labs [Online], 2, 2010. URL http://yann.lecun.com/exdb/mnist.

[25] H. Li and Z. Lin. Revisiting EXTRA for smooth distributed optimization. SIAM Journal on Optimization, 30(3):1795 1821, 2020.

[26] H. Li, C. Fang, W. Yin, and Z. Lin. A sharp convergence rate analysis for distributed accelerated gradient methods. ar Xiv preprint ar Xiv:1810.01053, 2018.

[27] Z. Li, W. Shi, and M. Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17): 4494 4506, 2019.

[28] H. Lin, J. Mairal, and Z. Harchaoui. Catalyst acceleration for ﬁrst-order convex optimization: from theory to practice. The Journal of Machine Learning Research, 18(1):7854 7907, 2017.

[29] J. Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2016.

[30] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief. A survey on mobile edge computing: The communication perspective. IEEE Communications Surveys & Tutorials, 19(4):2322 2358, 2017.

[31] B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efﬁcient learning of deep networks from decentralized data. In Artiﬁcial Intelligence and Statistics, pages 1273 1282, 2017.

[32] H. B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, et al. Communication-efﬁcient learning of deep networks from decentralized data. ar Xiv preprint ar Xiv:1602.05629, 2016.

[33] V. Nedelcu, I. Necoara, and Q. Tran-Dinh. Computational complexity of inexact gradient augmented lagrangian methods: application to constrained mpc. SIAM Journal on Control and Optimization, 52(5):3109 3134, 2014.

[34] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009.

[35] A. Nedic, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597 2633, 2017.

[36] A. Nedi c, A. Olshevsky, W. Shi, and C. A. Uribe. Geometrically convergent distributed optimization with uncoordinated step-sizes. In 2017 American Control Conference (ACC), pages 3950 3955. IEEE, 2017.

[37] Y. Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.

[38] L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems, 11(3):387 434, 2005.

[39] G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245 1260, 2017.

[40] G. Qu and N. Li. Accelerated distributed nesterov gradient descent. IEEE Transactions on Automatic Control, 2019.

[41] R. T. Rockafellar. Augmented lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of operations research, 1(2):97 116, 1976.

[42] R. T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.

[43] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In International Conferences on Machine Learning (ICML), 2017.

[44] K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee. Optimal algorithms for nonsmooth distributed optimization in networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2018.

[45] M. Schmidt, N. L. Roux, and F. R. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2011.

[46] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7): 1750 1761, 2014.

[47] W. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact ﬁrst-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, 2015.

[48] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu. Edge computing: Vision and challenges. IEEE internet of things journal, 3(5):637 646, 2016.

[49] R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1310 1321, 2015.

[50] Y. Sun, A. Daneshmand, and G. Scutari. Convergence rate of distributed optimization algorithms based on gradient tracking. ar Xiv preprint ar Xiv:1905.02637, 2019.

[51] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedi c. A dual approach for optimal algorithms in distributed optimization over networks. Optimization Methods and Software, pages 1 40, 2020.

[52] B. E. Woodworth, J. Wang, A. Smith, B. Mc Mahan, and N. Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2018.

[53] L. Xiao, S. Boyd, and S.-J. Kim. Distributed average consensus with least-mean-square deviation. Journal of parallel and distributed computing, 67(1):33 46, 2007.

[54] J. Xu, Y. Tian, Y. Sun, and G. Scutari. Accelerated primal-dual algorithms for distributed smooth convex optimization over networks. International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020.

[55] J. Xu, Y. Tian, Y. Sun, and G. Scutari. Distributed algorithms for composite optimization: Uniﬁed and tight convergence analysis. ar Xiv preprint ar Xiv:2002.11534, 2020.

[56] S. Yan and N. He. Bregman augmented lagrangian and its acceleration, 2020.

[57] H. Ye, L. Luo, Z. Zhou, and T. Zhang. Multi-consensus decentralized accelerated gradient descent. ar Xiv preprint ar Xiv:2005.00797, 2020.

[58] K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835 1854, 2016.

[59] J. Zhang, C. A. Uribe, A. Mokhtari, and A. Jadbabaie. Achieving acceleration in distributed optimization via direct discretization of the heavy-ball ode. In 2019 American Control Conference (ACC), pages 3408 3413. IEEE, 2019.