# adaptive_consensus_admm_for_distributed_optimization__14f8e3e1.pdf Adaptive Consensus ADMM for Distributed Optimization Zheng Xu 1 Gavin Taylor 2 Hao Li 1 M ario A. T. Figueiredo 3 Xiaoming Yuan 4 Tom Goldstein 1 The alternating direction method of multipliers (ADMM) is commonly used for distributed model fitting problems, but its performance and reliability depend strongly on userdefined penalty parameters. We study distributed ADMM methods that boost performance by using different fine-tuned algorithm parameters on each worker node. We present a O(1/k) convergence rate for adaptive ADMM methods with node-specific parameters, and propose adaptive consensus ADMM (ACADMM), which automatically tunes parameters without user oversight. 1. Introduction The alternating direction method of multipliers (ADMM) is a popular tool for solving problems of the form, min u Rn,v Rm f(u) + g(v), subject to Au + Bv = b, (1) where f : Rn R and g : Rm R are convex functions, A Rp n, B Rp m, and b Rp. ADMM was first introduced in (Glowinski & Marroco, 1975) and (Gabay & Mercier, 1976), and has found applications in many optimization problems in machine learning, distributed computing and many other areas (Boyd et al., 2011). Consensus ADMM (Boyd et al., 2011) solves minimization problems involving a composite objective f(v) = P i fi(v), where worker i stores the data needed to compute fi, and so is well suited for distributed model fitting problems (Boyd et al., 2011; Zhang & Kwok, 2014; Song et al., 2016; Chang et al., 2016; Goldstein et al., 2016; Taylor et al., 2016). To distribute this problem, consensus methods assign a separate copy of the unknowns, ui, to 1University of Maryland, College Park; 2United States Naval Academy, Annapolis; 3Instituto de Telecomunicac oes, IST, ULisboa, Portugal; 4Hong Kong Baptist University, Hong Kong. Correspondence to: Zheng Xu . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). each worker, and then apply ADMM to solve min ui Rd,v Rd i=1 fi(ui) + g(v), subject to ui = v, (2) where v is the central copy of the unknowns, and g(v) is a regularizer. The consensus problem (2) coincides with (1) by defining u = (u1; . . . ; u N) Rd N, A = Id N Rd N d N, and B = (Id; . . . ; Id) Rd N d, where Id represents the d d identity matrix. ADMM methods rely on a penalty parameter (stepsize) that is chosen by the user. In theory, ADMM converges for any constant penalty parameter (Eckstein & Bertsekas, 1992; He & Yuan, 2012; Ouyang et al., 2013). In practice, however, the efficiency of ADMM is highly sensitive to this parameter choice (Nishihara et al., 2015; Ghadimi et al., 2015), and can be improved via adaptive penalty selection methods (He et al., 2000; Song et al., 2016; Xu et al., 2017a). One such approach, residual balancing (RB) (He et al., 2000), adapts the penalty parameter so that the residuals (derivatives of the Lagrangian with respect to primal and dual variables) have similar magnitudes. When the same penalty parameter is used across nodes, RB is known to converge, although without a known rate guarantee. A more recent approach, AADMM (Xu et al., 2017a), achieves impressive practical convergence speed on many applications, including consensus problems, with adaptive penalty parameters by estimating the local curvature of the dual functions. However, the dimension of the unknown variables in consensus problems grows with the number of distributed nodes, causing the curvature estimation to be inaccurate and unstable. AADMM uses the same convergence analysis as RB. Consensus residual balancing (CRB) (Song et al., 2016) extends residual balancing to consensusbased ADMM for distributed optimization by balancing the local primal and dual residuals on each node. However, convergence guarantees for this method are fairly weak, and adaptive penalties need to be reset after several iterations to guarantee convergence. We study the use of adaptive ADMM in the distributed setting, where different workers use different local algorithm parameters to accelerate convergence. We begin by studying the theory and provide convergence guarantees when Adaptive Consensus ADMM for Distributed Optimization node-specific penalty parameters are used. We demonstrate a O(1/k) convergence rate under mild conditions that is applicable for many forms of adaptive ADMM including all the above methods. Our theory is more general than the convergence guarantee in (He et al., 2000; Xu et al., 2017a) that only shows convergence when the scalar penalty parameter is adapted. Next, we propose an adaptive consensus ADMM (ACADMM) method to automate local algorithm parameters selection. Instead of estimating one global penalty parameter for all workers, different local penalty parameters are estimated using the local curvature of subproblems on each node. 2. Related work ADMM is known to have a O(1/k) convergence rate under mild conditions for convex problems (He & Yuan, 2012; 2015), while a O(1/k2) rate is possible when at least one of the functions is strongly convex or smooth (Goldfarb et al., 2013; Goldstein et al., 2014; Kadkhodaie et al., 2015; Tian & Yuan, 2016). Linear convergence can be achieved with strong convexity assumptions (Davis & Yin, 2014; Nishihara et al., 2015; Giselsson & Boyd, 2016). All of these results assume constant parameters; to the best of our knowledge, no convergence rate has been proven for ADMM with an adaptive penalty: (He et al., 2000; Xu et al., 2017b) proves convergence without providing a rate, and (Lin et al., 2011; Banert et al., 2016; Goldstein et al., 2015) prove convergence for some particular variants of ADMM ( linearized or preconditioned ). To improve practical convergence of ADMM, fixed optimal parameters are discussed in (Raghunathan & Di Cairano, 2014; Ghadimi et al., 2015; Nishihara et al., 2015; Franc a & Bento, 2016). These methods make strong assumptions about the objective and require information about the spectrum of A and/or B. Additionally, adaptive methods have been proposed; the most closely related work to our own is (Song et al., 2016), which extends the results of (He et al., 2000) to consensus problems, where communication is controlled by predefined network structure and the regularizer g(v) is absent. In contrast to these methods, the proposed ACADMM extends the spectral penalty in (Xu et al., 2017a) to consensus problems and provides convergence theory that can be applied to a broad range of adaptive ADMM variants. 3. Consensus ADMM In the following, we use the subscript i to denote iterates computed on the ith node, superscript k is the iteration number, λk i is the dual vector of Lagrange multipliers, and {τ k i } are iteration/worker-specific penalty parameters (contrasted with the single constant penalty parameter τ of vanilla ADMM). Consensus methods apply ADMM to (2), resulting in the steps uk+1 i = arg min ui fi(ui) + τ k i 2 vk ui + λk i τ k i 2 (3) vk+1 = arg min v g(v) + τ k i 2 v uk+1 i + λk i τ k i 2 (4) λk+1 i = λk i + τ k i (vk+1 uk+1 i ). (5) The primal and dual residuals, rk and dk, are used to monitor convergence. rk 1 ... rk N dk 1 ... dk N ( rk i = vk uk i dk i = τ k i (vk 1 vk). (6) The primal residual rk approaches zero when the iterates accurately satisfy the linear constraints in (2), and the dual residual dk approaches zero as the iterates near a minimizer of the objective. Iteration can be terminated when rk 2 ϵtol max{ XN i=1 uk i 2, N vk 2} and dk 2 ϵtol XN i=1 λk i 2, (7) where ϵtol is the stopping tolerance. The residuals in (6) and stopping criterion in (7) are adopted from the general problem (Boyd et al., 2011) to the consensus problem. The observation that residuals rk, dk can be decomposed into local residuals rk i , dk i has been exploited to generalize the residual balancing method (He et al., 2000) for distributed consensus problems (Song et al., 2016). 4. Convergence analysis We now study the convergence of ADMM with nodespecific adaptive penalty parameters. We provide conditions on penalty parameters that guarantee convergence, and also a convergence rate. The issue of how to automatically tune penalty parameters effectively will be discussed in Section 5. 4.1. Diagonal penalty parameters for ADMM Let T k = diag(τ k 1 Id, . . . , τ k NId) be a diagonal matrix containing non-negative penalty parameters on iteration k. Define the norm u 2 T = u T Tu. Using the notation defined above with u = (u1; . . . ; u N) Rd N, we can rewrite the consensus ADMM steps (3) (5) as uk+1 = arg min u f(u) + Au, λk + 1/2 b Au Bvk 2 T k (8) Adaptive Consensus ADMM for Distributed Optimization vk+1 = arg min v g(v) + Bv, λk + 1/2 b Auk+1 Bv 2 T k (9) λk+1 = λk + T k(b Auk+1 Bvk+1). (10) When using a diagonal penalty matrix, the generalized residuals become ( rk = b Auk Buk dk = AT T k B(vk vk 1). (11) The sequel contains a convergence proof for generalized ADMM with adaptive penalty matrix T k. Our proof is inspired by the variational inequality (VI) approach in (He et al., 2000; He & Yuan, 2012; 2015). 4.2. Preliminaries Notation. We use the following notation to simplify the discussions. Define the combined variables y = (u; v) Rn+m and z = (u; v; λ) Rn+m+p, and denote iterates as yk = (uk; vk) and zk = (uk; vk; λk). Let y and z denote optimal primal/dual solutions. Further define z+ k = ( u+ k ; v+ k ; λ+ k ) := zk+1 zk and z k = ( u k; v k; λ k) := z zk. Set φ(y) = f(u) + g(v), F(z) = AT λ BT λ Au + Bv b 0 0 0 0 BT T k B 0 0 0 (T k) 1 In 0 0 0 Im 0 0 T k B Ip Note that F(z) is a monotone operator satisfying z, z , (z z )T (F(z) F(z )) 0. We introduce intermediate variable zk+1 = (uk+1; vk+1; ˆλk+1), where ˆλk+1 = λk + T k(b Auk+1 Bvk). We thus have z+ k = M k( zk+1 zk). (12) Variational inequality formulation. The optimal solution z of problem (1) satisfies the variational inequality (VI), z, φ(y) φ(y ) + (z z )T F(z ) 0. (13) From the optimality conditions for the sub-steps (8, 9), we see that yk+1 satisfies the variational inequalities u, f(u) f(uk+1) + (u uk+1)T (AT T k(Auk+1 + Bvk b) AT λk) 0 (14) v, g(v) g(vk+1) + (v vk+1)T (BT T k(Auk+1 + Bvk+1 b) BT λk) 0, (15) which can be combined as φ(y) φ(yk+1) + (z zk+1)T F( zk+1) + Hk z+ k 0. (16) Lemmas. We present several lemmas to facilitate the proof of our main convergence theory, which extend previous results regarding ADMM (He & Yuan, 2012; 2015) to ADMM with a diagonal penalty matrix. Lemma 1 shows the difference between iterates decreases as the iterates approach the true solution, while Lemma 2 implies a contraction in the VI sense. Full proofs are provided in supplementary material; Eq. (17) and Eq. (18) are supported using equations (13, 15, 16) and standard techniques, while Eq. (19) is proven from Eq. (18). Lemma 2 is supported by the relationship in Eq. (12). Lemma 1. The optimal solution z = (u ; v ; λ ) and sequence zk = (uk; vk; λk) of generalized ADMM satisfy (B v+ k )T λ+ k 0, (17) z k+1Hk z+ k 0, (18) z+ k 2 Hk z k 2 Hk z k+1 2 Hk. (19) Lemma 2. The sequence zk = (uk; vk; ˆλk) and zk = (uk; vk; λk)T from generalized ADMM satisfy, z, ( zk+1 z)T Hk z+ k 1 2( zk+1 z 2 Hk zk z 2 Hk). (20) 4.3. Convergence criteria We provide a convergence analysis of ADMM with an adaptive diagonal penalty matrix by showing (i) the norm of the residuals converges to zero; (ii) the method attains a worst-case ergodic O(1/k) convergence rate in the VI sense. The key idea of the proof is to bound the adaptivity of T k so that ADMM is stable enough to converge, which is presented as the following assumption. Assumption 1. The adaptivity of the diagonal penalty matrix T k = diag(τ k i , . . . , τ k p ) is bounded by k=1 (ηk)2 < , where (ηk)2 = max i {1,...,p}{(ηk i )2}, (ηk i )2 = max{τ k i /τ k 1 i 1, τ k 1 i /τ k i 1}. We can apply Assumption 1 to verify that 1 1 + (ηk)2 τ k i τ k 1 i 1 + (ηk)2. (22) which is needed to prove Lemma 3. Lemma 3. Suppose Assumption 1 holds. Then z = (u; v; λ) and z = (u ; v ; λ ) satisfy, z, z z z 2 Hk (1 + (ηk)2) z z 2 Hk 1. (23) Adaptive Consensus ADMM for Distributed Optimization Now we are ready to prove the convergence of generalized ADMM with adaptive penalty under Assumption 1. We prove the following quantity, which is a norm of the residuals, converges to zero. z+ k 2 Hk = B v+ k 2 T k + λ+ k 2 (T k) 1 = (AT T k) dk 2 T k + rk 2 T k, (24) where A denotes generalized inverse of a matrix A. Note that z+ k 2 Hk converges to zero only if rk and dk converge to zero, provided A and T k are bounded. Theorem 1. Suppose Assumption 1 holds. Then the iterates zk = (uk; vk; λk) of generalized ADMM satisfy lim k z+ k 2 Hk = 0. (25) Proof. Let z = zk, z = z in Lemma 3 to achieve z k 2 Hk (1 + (ηk)2) z k 2 Hk 1. (26) Combine (26) with Lemma 1 (19) to get z+ k 2 Hk (1+(ηk)2) z k 2 Hk 1 z k+1 2 Hk. (27) Accumulate (27) for k = 1 to l, t=k+1 (1 + (ηt)2) z+ k 2 Hk t=1 (1 + (ηt)2) z 1 2 H0 z l+1 2 Hl. Then we have k=1 z+ k 2 Hk t=1 (1 + (ηt)2) z 1 2 H0. (29) When l , Assumption 1 suggests Q t=1(1 + (ηt)2) < , which means P k=1 z+ k 2 Hk < . Hence limk z+ k 2 Hk = 0. We further exploit Assumption 1 and Lemma 3 to prove Lemma 4, and combine VI (16), Lemma 2, and Lemma 4 to prove the O(1/k) convergence rate in Theorem 2. Lemma 4. Suppose Assumption 1 holds. Then z = (u; v; λ) Rm+n+p and the iterates zk = (uk; vk; λk) of generalized ADMM satisfy, z k=1 ( z zk 2 Hk z zk 2 Hk 1) CΣ η CΠ η ( z z 2 H0 + z 1 2 H0) < , where CΣ η = P k=1(ηk)2, CΠ η = Q t=1(1 + (ηt)2). Theorem 2. Suppose Assumption 1 holds. Consider the sequence zk = (uk; vk; ˆλk) of generalized ADMM and define zl = 1 l Pl k=1 zk. Then sequence zl satisfies the convergence bound φ(y) φ( yl) + (z zl)T F( zl) 1 2 l( z z0 2 H0 + CΣ η CΠ η z z 2 H0 + CΣ η CΠ η z 1 2 H0). (31) Proof. We can verify with simple algebra that (z z )T F(z) = (z z )T F(z ). (32) Apply (32) with z = zk+1, and combine VI (16) and Lemma 2 to get φ(y) φ(yk+1) + (z zk+1)T F(z) (33) =φ(y) φ(yk+1) + (z zk+1)T F( zk+1) (34) ( zk+1 z)T Hk z+ k (35) 2( zk+1 z 2 Hk zk z 2 Hk). (36) Summing for k = 0 to l 1 gives us k=1 φ(y) φ(yk) + (z zk)T F(z) k=1( z zk 2 Hk 1 z zk 1 2 Hk 1). (37) Since φ(y) is convex, the left hand side of (37) satisfies, LHS = l φ(y) k=1 φ(yk) + (l z k=1 zk)T F(z) l φ(y) l φ( yl) + (l z l zl)T F(z). (38) Applying Lemma 4, we see the right hand side satisfies, k=1 ( z zk 2 Hk z zk 1 2 Hk 1)+ k=1 ( z zk 2 Hk 1 z zk 2 Hk) 2( z zl 2 Hl z z0 2 H0)+ 2CΣ η CΠ η ( z z 2 H0 + z 1 2 H0) (40) 2( z z0 2 H0 + CΣ η CΠ η z z 2 H0+ CΣ η CΠ η z 1 2 H0). (41) Combining inequalities (37), (38) and (41), and letting z = zk in (32) yields the O(1/k) convergence rate in (31) Adaptive Consensus ADMM for Distributed Optimization 5. Adaptive Consensus ADMM (ACADMM) To address the issue of how to automatically tune parameters on each node for optimal performance, we propose adaptive consensus ADMM (ACADMM), which sets worker-specific penalty parameters by exploiting curvature information. We derive our method from the dual interpretation of ADMM Douglas-Rachford splitting (DRS) using a diagonal penalty matrix. We then derive the spectral stepsizes for consensus problems by assuming the curvatures of the objectives are diagonal matrices with diverse parameters on different nodes. At last, we discuss the practical computation of the spectral stepsizes from consensus ADMM iterates and apply our theory in Section 4 to guarantee convergence. 5.1. Dual interpretation of generalized ADMM The dual form of problem (1) can be written min λ Rp f (AT λ) λ, b | {z } ˆ f(λ) + g (BT λ) | {z } ˆg(λ) where λ denotes the dual variable, while f , g denote the Fenchel conjugate of f, g (Rockafellar, 1970). It is known that ADMM steps for the primal problem (1) are equivalent to performing Douglas-Rachford splitting (DRS) on the dual problem (42) (Eckstein & Bertsekas, 1992; Xu et al., 2017a). In particular, the generalized ADMM iterates satisfy the DRS update formulas 0 (T k) 1(ˆλk+1 λk) + ˆf(ˆλk+1) + ˆg(λk) (43) 0 (T k) 1(λk+1 λk) + ˆf(ˆλk+1) + ˆg(λk+1), (44) where ˆλ denotes the intermediate variable defined in Section 4.2. We prove the equivalence of generalized ADMM and DRS in the supplementary material. 5.2. Generalized spectral stepsize rule Xu et al. (2017a) first derived spectral penalty parameters for ADMM using the DRS. Proposition 1 in (Xu et al., 2017a) proved that the minimum residual of DRS can be obtained by setting the scalar penalty to τ k = 1/ α β, where we assume the subgradients are locally linear as ˆf(ˆλ) = α ˆλ + Ψ and ˆg(λ) = β λ + Φ, (45) α, β R represent scalar curvatures, and Ψ, Φ Rp. We now present generalized spectral stepsize rules that can accomodate consensus problems. Proposition 1 (Generalized spectral DRS). Suppose the generalized DRS steps (43, 44) are used, and assume the subgradients are locally linear, ˆf(ˆλ) = Mα ˆλ + Ψ and ˆg(λ) = Mβ λ + Φ. (46) for matrices Mα = diag(α1Id, . . . , αNId) and Mβ = diag(β1Id, . . . , βNId), and some Ψ, Φ Rp. Then the minimal residual of ˆf(λk+1) + ˆg(λk+1) is obtained by setting τ k i = 1/ αi βi, i = 1, . . . , N. Proof. Substituting subgradients ˆf(ˆλ), ˆg(λ) into the generalized DRS steps (43, 44), and using our linear assumption (46) yields 0 (T k) 1(ˆλk+1 λk) + (Mα ˆλk+1 + Ψ) + (Mβ λk + Φ) 0 (T k) 1(λk+1 λk) + (Mα ˆλk+1 + Ψ) + (Mβ λk+1 + Φ). Since T k, Mα, Mβ are diagonal matrices, we can split the equations into independent blocks, i = 1, . . . , N, 0 (ˆλk+1 i λk i )/τ k i + (αi ˆλk+1 + Ψi) + (βi λk + Φi) 0 (λk+1 i λk i )/τ k i + (αi ˆλk+1 + Ψi) + (βi λk+1 + Φi). Applying Proposition 1 in (Xu et al., 2017a) to each block, τ k i = 1/ αi βi minimizes the block residual represented by rk+1 DR,i = (αi + βi)λk+1 + (ai + bi) , where ai Ψi, bi Φi. Hence the residual norm at step k + 1, which is (Mα + Mβ)λk+1 + (a + b) = q PN i=1(rk+1 DR,i)2 is minimized by setting τ k i = 1/ αi βi, i = 1, . . . , N. 5.3. Stepsize estimation for consensus problems Thanks to the equivalence of ADMM and DRS, Proposition 1 can also be used to guide the selection of the optimal penalty parameter. We now show that the generalized spectral stepsizes can be estimated from the ADMM iterates for the primal consensus problem (2), without explicitly supplying the dual functions. The subgradients of dual functions ˆf, ˆg can be computed from the ADMM iterates using the identities derived from (8, 9), Auk+1 b ˆf(ˆλk+1) and Bvk+1 ˆg(λk+1). (47) For the consensus problem we have A = Id N, B = (Id; . . . ; Id), and b = 0, and so (uk+1 1 ; . . . ; uk+1 N ) ˆf(ˆλk+1) (48) (vk+1; . . . ; vk+1 | {z } N duplicates of vk+1 ) ˆg(λk+1). (49) If we approximate the behavior of these sub-gradients using the linear approximation (46), and break the subgradients into blocks (one for each worker node), we get (omitting iteration index k for clarity) ui = αi ˆλi + ai and v = βi λi + bi, i (50) where αi and βi represent the curvature of local functions ˆfi and ˆgi on the ith node. Adaptive Consensus ADMM for Distributed Optimization We select stepsizes with a two step procedure, which follows the spectral stepsize literature. First, we estimate the local curvature parameters, αi and βi, by finding leastsquares solutions to (50). Second, we plug these curvature estimates into the formula τ k i = 1/ αi βi. This formula produces the optimal stepsize when ˆf and ˆg are well approximated by a linear function, as shown in Proposition 1. For notational convenience, we work with the quantities ˆαk i = 1/αi, ˆβk i = 1/βi, which are estimated on each node using the current iterates uk i , vk, λk i , ˆλk i and also an older iterate uk0 i , vk0, λk0 i , ˆλk0 i , k0 < k. Defining uk i = uk i uk0 i , ˆλk i = ˆλk i ˆλk0 i and following the literature for Barzilai-Borwein/spectral stepsize estimation, there are two least squares estimators that can be obtained from (50): ˆαk SD,i = ˆλk i , ˆλk i uk i , ˆλk i and ˆαk MG,i = uk i , ˆλk uk i , uk i (51) where SD stands for steepest descent, and MG stands for minimum gradient. (Zhou et al., 2006) recommend using a hybrid of these two estimators, and choosing ( ˆαk MG,i if 2 ˆαk MG,i > ˆαk SD,i ˆαk SD,i ˆαk MG,i/2 otherwise. (52) It was observed that this choice worked well for nondistributed ADMM in (Xu et al., 2017a). We can similarly estimate ˆβk i from vk = vk +vk0 and λk i = λk i λk0 i . ACADMM estimates the curvatures in the original ddimensional feature space, and avoids estimating the curvature in the higher Nd-dimensional feature space (which grows with the number of nodes N in AADMM (Xu et al., 2017a)), which is especially useful for heterogeneous data with different distributions allocated to different nodes. The overhead of our adaptive scheme is only a few inner products, and the computation is naturally distributed on different workers. 5.4. Safeguarding and convergence Spectral stepsizes for gradient descent methods are equipped with safeguarding strategies like backtracking line search to handle inaccurate curvature estimation and to guarantee convergence. To safeguard the proposed spectral penalty parameters, we check whether our linear subgradient assumption is reasonable before updating the stepsizes. We do this by testing that the correlations αk cor,i = uk i , ˆλk i uk i ˆλk i and βk cor,i = vk, λk i vk λk i , (53) are bounded away from zero by a fixed threshold. We also bound changes in the penalty parameter by (1+ Ccg/k2) according to Assumption 1, which was shown in Theorem 1 Algorithm 1 Adaptive consensus ADMM (ACADMM) Input: initialize v0, λ0 i , τ 0 i , k0 =0, 1: while not converge by (7) and k < maxiter do 2: Locally update uk i on each node by (3) 3: Globally update vk on central server by (4) 4: Locally update dual variable λk i on each node by (5) 5: if mod(k, Tf) = 1 then 6: Locally update ˆλk i = λk 1 i + τ k i (vk 1 uk i ) 7: Locally compute spectral stepsizes ˆαk i , ˆβk i 8: Locally estimate correlations αk cor,i , βk cor,i 9: Locally update τ k+1 i using (54) 10: k0 k 11: else 12: τ k+1 i τ k i 13: end if 14: k k + 1 15: end while and Theorem 2 to guarantee convergence. The final safeguarded ACADMM rule is ˆαk i ˆβk i if αkcor,i > ϵcor and βkcor,i > ϵcor ˆαk i if αkcor,i > ϵcor and βkcor,i ϵcor ˆβk i if αkcor,i ϵcor and βkcor,i > ϵcor τ k i otherwise, τ k+1 i = max{min{ˆτ k+1 i , (1 + Ccg k2 )τ k i } , τ k i 1 + Ccg/k2 }. The complete adaptive consensus ADMM is shown in Algorithm 1. We suggest updating the stepsize every Tf = 2 iterations, fixing the safeguarding threshold ϵcor = 0.2, and choosing a large convergence constant Ccg = 1010. 6. Experiments & Applications We now study the performance of ACADMM on benchmark problems, and compare to other methods. 6.1. Applications Our experiments use the following test problems that are commonly solved using consensus methods. Linear regression with elastic net regularizer. We consider consensus formulations of the elastic net (Zou & Hastie, 2005) with fi and g defined as, 2 Diui ci 2, g(v) = ρ1|v| + ρ2 2 v 2, (55) where Di Rni m is the data matrix on node i, and ci is a vector of measurements. Sparse logistic regression with ℓ1 regularizer can be written in the consensus form for distributed computing, Adaptive Consensus ADMM for Distributed Optimization Table 1: Iterations (and runtime in seconds);128 cores are used; absence of convergence after n iterations is indicated as n+. Application Dataset #samples #features 1 CADMM (Boyd et al., 2011) RB-ADMM (He et al., 2000) AADMM (Xu et al., 2017a) CRB-ADMM (Song et al., 2016) Proposed ACADMM Elastic net regression Synthetic1 64000 100 1000+(1.27e4) 94(1.22e3) 43(563) 106(1.36e3) 48(623) Synthetic2 64000 100 1000+(1.27e4) 130(1.69e3) 341(4.38e3) 140(1.79e3) 57(738) MNIST 60000 784 100+(1.49e4) 88(1.29e3) 40(5.99e3) 87(1.27e4) 14(2.18e3) CIFAR10 2 10000 3072 100+(1.04e3) 100+(1.06e3) 100+(1.05e3) 100+(1.05e3) 35(376) News20 19996 1355191 100+(4.61e3) 100+(4.60e3) 100+(5.17e3) 100+(4.60e3) 78(3.54e3) RCV1 20242 47236 33(1.06e3) 31(1.00e3) 20(666) 31(1.00e3) 8(284) Realsim 72309 20958 32(5.91e3) 30(5.59e3) 14(2.70e3) 30(5.57e3) 9(1.80e3) Sparse logistic regression Synthetic1 64000 100 138(137) 78(114) 80(101) 48(51.9) 24(29.9) Synthetic2 64000 100 317(314) 247(356) 1000+(1.25e3) 1000+(1.00e3) 114(119) MNIST 60000 784 325(444) 212(387) 325(516) 203(286) 149(218) CIFAR10 10000 3072 310(700) 152(402) 310(727) 149(368) 44(118) News20 19996 1355191 316(4.96e3) 211(3.84e3) 316(6.36e3) 207(3.73e3) 137(2.71e3) RCV1 20242 47236 155(115) 155(116) 155(137) 155(115) 150(114) Realsim 72309 20958 184(77) 184(77) 184(85) 183(77) 159(68) Support Vector Machine Synthetic1 64000 100 33(35.0) 33(49.8) 19(27) 26(28.4) 21(25.3) Synthetic2 64000 100 283(276) 69(112) 1000+(1.59e3) 81(97.4) 25(39.0) MNIST 60000 784 1000+(930) 172(287) 73(127) 285(340) 41(88.0) CIFAR10 10000 3072 1000+(774) 227(253) 231(249) 1000+(1.00e3) 62(60.2) News20 19996 1355191 259(2.63e3) 262(2.74e3) 259(3.83e3) 267(2.78e3) 217(2.37e3) RCV1 20242 47236 47(21.7) 47(21.6) 47(31.1) 40(19.0) 27(15.4) Realsim 72309 20958 1000+(76.8) 1000+(77.6) 442(74.4) 1000+(79.3) 347(41.6) SDP Ham-9-5-6 512 53760 100+(2.01e3) 100+(2.14e3) 35(860) 100+(2.14e3) 30(703) 1 #vertices #edges for SDP; 2We only use the first training batch of CIFAR10. j=1 log(1 + exp( ci,j DT i,jui)), g(v) = ρ|v| (56) where Di,j Rm is the jth sample, and ci,j { 1, 1} is the corresponding label. The minimization sub-step (3) in this case is solved by L-BFGS (Liu & Nocedal, 1989). Support Vector Machines (SVMs) minimize the distributed objective function (Goldstein et al., 2016) j=1 max{1 ci,j DT i,jui, 0}, g(v) = 1 2 v 2 2 (57) where Di,j Rm is the jth sample on the ith node, and ci,j { 1, 1} is its label. The minimization (3) is solved by dual coordinate ascent (Chang & Lin, 2011). Semidefinite programming (SDP) can be distributed as, fi(Ui) = ι{Di(Ui) = ci}, g(v) = F, V + ι{V 0} (58) where ι{S} is a characteristic function that is 0 if condition S is satisfied and infinity otherwise. V 0 indicates that V is positive semidefinite. V, F, Di,j Rn n are symmetric matrices, X, Y = trace(XT Y ) denotes the inner product of X and Y , and Di(X) = ( Di,1, X ; . . . ; Di,mi, X ). 6.2. Experimental Setup We test the problems in Section 6.1 with synthetic and real datasets. The number of samples and features are specified in Table 1. Synthetic1 contains samples from a normal distribution, and Synthetic2 contains samples from a mixture of 10 random Gaussians. Synthetic2 is heterogeneous because the data block on each individual node is sampled from only 1 of the 10 Gaussians. We also acquire large empirical datasets from the LIBSVM webpage (Liu et al., 2009), as well as MNIST digital images (Le Cun et al., 1998), and CIFAR10 object images (Krizhevsky & Hinton, 2009). For binary classification tasks (SVM and logreg), we equally split the 10 category labels of MNIST and CIFAR into positive and negative groups. We use a graph from the Seventh DIMACS Implementation Challenge on Semidefinite and Related Optimization Problems following (Burer & Monteiro, 2003) for Semidefinite Programming (SDP). The regularization parameter is fixed at ρ = 10 in all experiments. Consensus ADMM (CADMM) (Boyd et al., 2011), residual balancing (RB-ADMM) (He et al., 2000), adaptive ADMM (AADMM) (Xu et al., 2017a), and consensus residual balancing (CRB-ADMM) (Song et al., 2016) are implemented and reported for comparison. Hyperparameters of these methods are set as suggested by their creators. The initial penalty is fixed at τ0 = 1 for all methods unless otherwise specified. 6.3. Convergence results Table 1 reports the convergence speed in iterations and wall-clock time (secs) for various test cases. These experiments are performed with 128 cores on a Cray XC-30 supercomputer. CADMM with default penalty τ = 1 (Boyd et al., 2011) is often slow to converge. ACADMM outperforms the other ADMM variants on all the real-world Adaptive Consensus ADMM for Distributed Optimization 10-2 100 102 104 Initial penalty parameter ENRegression-Synthetic1 CADMM RB-ADMM AADMM CRB-ADMM ACADMM Number of cores ENRegression-Synthetic2 CADMM RB-ADMM AADMM CRB-ADMM ACADMM Number of cores SVM-Synthetic2 CADMM RB-ADMM AADMM CRB-ADMM ACADMM 10-2 100 102 104 Initial penalty parameter ENRegression-Synthetic2 CADMM RB-ADMM AADMM CRB-ADMM ACADMM (a) Sensitivity of iteration count to initial penalty τ0. Synthetic problems of EN regression are studied with 128 cores. Number of samples ENRegression-Synthetic2 CADMM RB-ADMM AADMM CRB-ADMM ACADMM (b) Sensitivity of iteration count to number of cores (top) and number of samples (bottom). Number of cores SVM-Synthetic2 CADMM RB-ADMM AADMM CRB-ADMM ACADMM (c) Sensitivity of iteration count (top) and wall time (bottom) to number of cores. Figure 1: ACADMM is robust to the initial penalty τ, number of cores N, and number of training samples. datasets, and is competitive with AADMM on two homogeneous synthetic datasets where the curvature may be globally estimated with a scalar. ACADMM is more reliable than AADMM since the curvature estimation becomes difficult for high dimensional variables. RB is relatively stable but sometimes has difficulty finding the exact optimal penalty, as the adaptation can stop because the difference of residuals are not significant enough to trigger changes. RB does not change the initial penalty in several experiments such as logistic regression on RCV1. CRB achieves comparable results with RB, which suggests that the relative sizes of local residuals may not always be very informative. ACADMM significantly boosts AADMM and the local curvature estimations are helpful in practice. 6.4. Robustness and sensitivity Fig. 1a shows that the practical convergence of ADMM is sensitive to the choice of penalty parameter. ACADMM is robust to the selection of the initial penalty parameter and achieves promising results for both homogeneous and heterogeneous data, comparable to ADMM with a fine-tuned penalty parameter. We study scalability of the method by varying the number of workers and training samples (Fig. 1b). ACADMM is fairly robust to the scaling factor. AADMM occasion- ally performs well when small numbers of nodes are used, while ACADMM is much more stable. RB and CRB are more stable than AADMM, but cannot compete with ACADMM. Fig. 1c (bottom) presents the acceleration in (wall-clock secs) achieved by increasing the number of workers. Finally, ACADMM is insensitive to the safeguarding hyper-parameters, correlation threshold ϵcor and convergence constant Ccg. Though tuning these parameters may further improve the performance, the fixed default values generally perform well in our experiments and enable ACADMM to run without user oversight. In further experiments in the supplementary material, we also show that ACADMM is fairly insensitive to the regularization parameter ρ in our classification/regression models. 7. Conclusion We propose ACADMM, a fully automated algorithm for distributed optimization. Numerical experiments on various applications and real-world datasets demonstrate the efficiency and robustness of ACADMM. We also prove a O(1/k) convergence rate for ADMM with adaptive penalties under mild conditions. By automating the selection of algorithm parameters, adaptive methods make distributed systems more reliable, and more accessible to users that lack expertise in optimization. Adaptive Consensus ADMM for Distributed Optimization Acknowledgements ZX , GT, HL and TG were supported by the US Office of Naval Research under grant N00014-17-1-2078 and by the US National Science Foundation (NSF) under grant CCF1535902. GT was partially supported by the DOD High Performance Computing Modernization Program. MF was partially supported by the Fundac ao para a Ciˆencia e Tecnologia, grant UID/EEA/5008/2013. XY was supported by the General Research Fund from Hong Kong Research Grants Council under grant HKBU-12313516. Banert, Sebastian, Bot, Radu Ioan, and Csetnek, Ern o Robert. Fixing and extending some recent results on the admm algorithm. ar Xiv preprint ar Xiv:1612.05057, 2016. Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. and Trends in Mach. Learning, 3:1 122, 2011. Burer, Samuel and Monteiro, Renato DC. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329 357, 2003. Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. Chang, Tsung-Hui, Hong, Mingyi, Liao, Wei-Cheng, and Wang, Xiangfeng. Asynchronous distributed alternating direction method of multipliers: Algorithm and convergence analysis. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4781 4785. IEEE, 2016. Davis, Damek and Yin, Wotao. Faster convergence rates of relaxed peaceman-rachford and admm under regularity assumptions. ar Xiv preprint ar Xiv:1407.5210, 2014. Eckstein, Jonathan and Bertsekas, Dimitri. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1-3):293 318, 1992. Franc a, Guilherme and Bento, Jos e. An explicit rate bound for over-relaxed admm. In Information Theory (ISIT), 2016 IEEE International Symposium on, pp. 2104 2108. IEEE, 2016. Gabay, Daniel and Mercier, Bertrand. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17 40, 1976. Ghadimi, Euhanna, Teixeira, Andr e, Shames, Iman, and Johansson, Mikael. Optimal parameter selection for the alternating direction method of multipliers: quadratic problems. IEEE Trans. Autom. Control, 60:644 658, 2015. Giselsson, Pontus and Boyd, Stephen. Linear convergence and metric selection in douglas-rachford splitting and admm. 2016. Glowinski, Roland and Marroco, A. Sur l approximation, par el ements finis d ordre un, et la r esolution, par p enalisation-dualit e d une classe de probl emes de Dirichlet non lin eaires. ESAIM: Modlisation Mathmatique et Analyse Numrique, 9:41 76, 1975. Goldfarb, Donald, Ma, Shiqian, and Scheinberg, Katya. Fast alternating linearization methods for minimizing the sum of two convex functions. Mathematical Programming, 141(1-2):349 382, 2013. Goldstein, Tom and Setzer, Simon. High-order methods for basis pursuit. UCLA CAM Report, pp. 10 41, 2010. Goldstein, Tom, O Donoghue, Brendan, Setzer, Simon, and Baraniuk, Richard. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588 1623, 2014. Goldstein, Tom, Li, Min, and Yuan, Xiaoming. Adaptive primal-dual splitting methods for statistical learning and image processing. In Advances in Neural Information Processing Systems, pp. 2080 2088, 2015. Goldstein, Tom, Taylor, Gavin, Barabin, Kawika, and Sayre, Kent. Unwrapping ADMM: efficient distributed computing via transpose reduction. In AISTATS, 2016. He, Bingsheng and Yuan, Xiaoming. On the o(1/n) convergence rate of the douglas-rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2): 700 709, 2012. He, Bingsheng and Yuan, Xiaoming. On non-ergodic convergence rate of Douglas-Rachford alternating direction method of multipliers. Numerische Mathematik, 130: 567 577, 2015. He, Bingsheng, Yang, Hai, and Wang, Shengli. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. Jour. Optim. Theory and Appl., 106(2):337 356, 2000. Kadkhodaie, Mojtaba, Christakopoulou, Konstantina, Sanjabi, Maziar, and Banerjee, Arindam. Accelerated alternating direction method of multipliers. In Proceedings of the 21th ACM SIGKDD, pp. 497 506, 2015. Adaptive Consensus ADMM for Distributed Optimization Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009. Le Cun, Yann, Bottou, L eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Lin, Zhouchen, Liu, Risheng, and Su, Zhixun. Linearized alternating direction method with adaptive penalty for low-rank representation. In NIPS, pp. 612 620, 2011. Liu, Dong C and Nocedal, Jorge. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503 528, 1989. Liu, Jun, Chen, Jianhui, and Ye, Jieping. Large-scale sparse logistic regression. In ACM SIGKDD, pp. 547 556, 2009. Nishihara, R., Lessard, L., Recht, B., Packard, A., and Jordan, M. A general analysis of the convergence of ADMM. In ICML, 2015. Ouyang, Hua, He, Niao, Tran, Long, and Gray, Alexander G. Stochastic alternating direction method of multipliers. ICML (1), 28:80 88, 2013. Raghunathan, Arvind and Di Cairano, Stefano. Alternating direction method of multipliers for strictly convex quadratic programs: Optimal parameter selection. In American Control Conf., pp. 4324 4329, 2014. Rockafellar, R. Convex Analysis. Princeton University Press, 1970. Song, Changkyu, Yoon, Sejong, and Pavlovic, Vladimir. Fast ADMM algorithm for distributed optimization with adaptive penalty. AAAI, 2016. Studer, Christoph, Goldstein, Tom, Yin, Wotao, and Baraniuk, Richard G. Democratic representations. ar Xiv preprint ar Xiv:1401.3420, 2014. Taylor, Gavin, Burmeister, Ryan, Xu, Zheng, Singh, Bharat, Patel, Ankit, and Goldstein, Tom. Training neural networks without gradients: A scalable ADMM approach. ICML, 2016. Tian, Wenyi and Yuan, Xiaoming. Faster alternating direction method of multipliers with a worst-case o (1/n2) convergence rate. 2016. Xu, Zheng, Figueiredo, Mario AT, and Goldstein, Thomas. Adaptive ADMM with spectral penalty parameter selection. AISTATS, 2017a. Xu, Zheng, Figueiredo, Mario AT, Yuan, Xiaoming, Studer, Christoph, and Goldstein, Thomas. Adaptive relaxed ADMM: Convergence theory and practical implementation. CVPR, 2017b. Zhang, Ruiliang and Kwok, James T. Asynchronous distributed ADMM for consensus optimization. In ICML, pp. 1701 1709, 2014. Zhou, Bin, Gao, Li, and Dai, Yu-Hong. Gradient methods with adaptive step-sizes. Computational Optimization and Applications, 35:69 86, 2006. Zou, Hui and Hastie, Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2): 301 320, 2005.