# adding_vs_averaging_in_distributed_primaldual_optimization__d52e216e.pdf Adding vs. Averaging in Distributed Primal-Dual Optimization Chenxin Ma CHM514@LEHIGH.EDU Industrial and Systems Engineering, Lehigh University, USA Virginia Smith VSMITH@BERKELEY.EDU University of California, Berkeley, USA Martin Jaggi JAGGI@INF.ETHZ.CH ETH Z urich, Switzerland Michael I. Jordan JORDAN@CS.BERKELEY.EDU University of California, Berkeley, USA Peter Richt arik PETER.RICHTARIK@ED.AC.UK School of Mathematics, University of Edinburgh, UK Martin Tak aˇc TAKAC.MT@GMAIL.COM Industrial and Systems Engineering, Lehigh University, USA Authors contributed equally. Distributed optimization methods for large-scale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization. Our framework, COCOA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both COCOA as well as our new variants, and generalize the theory for both methods to cover non-smooth convex loss functions. We provide an extensive experimental comparison that shows the markedly improved performance of COCOA+ on several real-world distributed datasets, especially when scaling up the number of machines. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). 1. Introduction With the wide availability of large datasets that exceed the storage capacity of single machines, distributed optimization methods for machine learning have become increasingly important. Existing methods require significant communication between workers, frequently equaling the amount of local computation (or reading of local data). As a result, distributed machine learning suffers significantly from a communication bottleneck on real world systems, where communication is typically several orders of magnitudes slower than reading data from main memory. In this work we focus on optimization problems with empirical loss minimization structure, i.e., objectives that are a sum of the loss functions of each datapoint. This includes the most commonly used regularized variants of linear regression and classification methods. For this class of problems, the recently proposed COCOA approach (Yang, 2013; Jaggi et al., 2014) develops a communicationefficient primal-dual scheme that targets the communication bottleneck, allowing more computation on data-local subproblems native to each machine before communication. By appropriately choosing the amount of local computation per round, this framework allows one to control the trade-off between communication and local computation based on the systems hardware at hand. However, the performance of COCOA (as well as related primal SGD-based methods) is significantly reduced by the Adding vs. Averaging in Distributed Primal-Dual Optimization need to average updates between all machines. As the number of machines K grows, the updates get diluted and slowed by 1/K, e.g., in the case where all machines except one would have already reached the solutions of their respective partial optimization tasks. On the other hand, if the updates are instead added, the algorithms can diverge, as we will observe in the practical experiments below. To address both described issues, in this paper we develop a novel generalization of the local COCOA subproblems assigned to each worker, making the framework more powerful in the following sense: Without extra computational cost, the set of locally computed updates from the modified subproblems (one from each machine) can be combined more efficiently between machines. The proposed COCOA+ updates can be aggressively added (hence the + -suffix), which yields much faster convergence both in practice and in theory. This difference is particularly significant as the number of machines K becomes large. 1.1. Contributions Strong Scaling. To our knowledge, our framework is the first to exhibit favorable strong scaling for the class of problems considered, as the number of machines K increases and the data size is kept fixed. More precisely, while the convergence rate of COCOA degrades as K is increased, the stronger theoretical convergence rate here is in the worst case independent of K. Our experiments in Section 7 confirm the improved speed of convergence. Since the number of communicated vectors is only one per round and worker, this favorable scaling might be surprising. Indeed, for existing methods, splitting data among more machines generally increases communication requirements (Shamir & Srebro, 2014), which can severely affect overall runtime. Theoretical Analysis of Non-Smooth Losses. While the existing analysis for COCOA in (Jaggi et al., 2014) only covered smooth loss functions, here we extend the class of functions where the rates apply, additionally covering, e.g., Support Vector Machines and non-smooth regression variants. We provide a primal-dual convergence rate for both COCOA as well as our new method COCOA+ in the case of general convex (L-Lipschitz) losses. Primal-Dual Convergence Rate. Furthermore, we additionally strengthen the rates by showing stronger primaldual convergence for both algorithmic frameworks, which are almost tight to their objective-only counterparts. Primal-dual rates for COCOA had not previously been analyzed in the general convex case. Our primal-dual rates allow efficient and practical certificates for the optimization quality, e.g., for stopping criteria. The new rates apply to both smooth and non-smooth losses, and for both COCOA as well as the extended COCOA+. Arbitrary Local Solvers. COCOA as well as COCOA+ allow the use of arbitrary local solvers on each machine. Experimental Results. We provide a thorough experimental comparison with competing algorithms using several real-world distributed datasets. Our practical results confirm the strong scaling of COCOA+ as the number of machines K grows, while competing methods, including the original COCOA, slow down significantly with larger K. We implement all algorithms in Spark, and our code is publicly available at: github.com/gingsmith/cocoa. 1.2. History and Related Work While optimal algorithms for the serial (single machine) case are already well researched and understood, the literature in the distributed setting is relatively sparse. In particular, details on optimal trade-offs between computation and communication, as well as optimization or statistical accuracy, are still widely unclear. For an overview over this currently active research field, we refer the reader to (Balcan et al., 2012; Richt arik & Tak aˇc, 2013; Duchi et al., 2013; Yang, 2013; Liu & Wright, 2014; Fercoq et al., 2014; Jaggi et al., 2014; Shamir & Srebro, 2014; Shamir et al., 2014; Zhang & Lin, 2015; Qu & Richt arik, 2014) and the references therein. We provide a detailed comparison of our proposed framework to the related work in Section 6. We consider regularized empirical loss minimization problems of the following well-established form: i=1 ℓi(x T i w) + λ Here the vectors {xi}n i=1 Rd represent the training data examples, and the ℓi(.) are arbitrary convex real-valued loss functions (e.g., hinge loss), possibly depending on label information for the i-th datapoints. The constant λ > 0 is the regularization parameter. The above class includes many standard problems of wide interest in machine learning, statistics, and signal processing, including support vector machines, regularized linear and logistic regression, ordinal regression, and others. Dual Problem, and Primal-Dual Certificates. The conjugate dual of (1) takes following form: j=1 ℓ j( αj) λ Here the data matrix A = [x1, x2, . . . , xn] Rd n collects all data-points as its columns, and ℓ j is the conjugate function to ℓj. See, e.g., (Shalev-Shwartz & Zhang, 2013c) for several concrete applications. Adding vs. Averaging in Distributed Primal-Dual Optimization It is possible to assign for any dual vector α Rn a corresponding primal feasible point w(α) = 1 λn Aα (3) The duality gap function is then given by: G(α) := P(w(α)) D(α) (4) By weak duality, every value D(α) at a dual candidate α provides a lower bound on every primal value P(w). The duality gap is therefore a certificate on the approximation quality: The distance to the unknown true optimum P(w ) must always lie within the duality gap, i.e., G(α) = P(w) D(α) P(w) P(w ) 0. In large-scale machine learning settings like those considered here, the availability of such a computable measure of approximation quality is a significant benefit during training time. Practitioners using classical primal-only methods such as SGD have no means by which to accurately detect if a model has been well trained, as P(w ) is unknown. Classes of Loss-Functions. To simplify presentation, we assume that all loss functions ℓi are non-negative, and ℓi(0) 1 i (5) Definition 1 (L-Lipschitz continuous loss). A function ℓi : R R is L-Lipschitz continuous if a, b R, we have |ℓi(a) ℓi(b)| L|a b| (6) Definition 2 ((1/µ)-smooth loss). A function ℓi : R R is (1/µ)-smooth if it is differentiable and its derivative is (1/µ)-Lipschitz continuous, i.e., a, b R, we have |ℓ i(a) ℓ i(b)| 1 3. The COCOA+ Algorithm Framework In this section we present our novel COCOA+ framework. COCOA+ inherits the many benefits of Co Co A as it remains a highly flexible and scalable, communicationefficient framework for distributed optimization. COCOA+ differs algorithmically in that we modify the form of the local subproblems (9) to allow for more aggressive additive updates (as controlled by γ). We will see that these changes allow for stronger convergence guarantees as well as improved empirical performance. Proofs of all statements in this section are given in the supplementary material. Data Partitioning. We write {Pk}K k=1 for the given partition of the datapoints [n] := {1, 2, . . . , n} over the K worker machines. We denote the size of each part by nk = |Pk|. For any k [K] and α Rn we use the notation α[k] Rn for the vector ( 0, if i / Pk, αi, otherwise. Local Subproblems in COCOA+. We can define a datalocal subproblem of the original dual optimization problem (2), which can be solved on machine k and only requires accessing data which is already available locally, i.e., datapoints with i Pk. More formally, each machine k is assigned the following local subproblem, depending only on the previous shared primal vector w Rd, and the change in the local dual variables αi with i Pk: max α[k] Rn Gσ k ( α[k]; w, α[k]) (8) Gσ k ( α[k]; w, α[k]) := 1 i Pk ℓ i ( αi ( α[k])i) K λ 2 w 2 1 nw T A α[k] λ λn A α[k] 2 (9) Interpretation. The above definition of the local objective functions Gσ k are such that they closely approximate the global dual objective D, as we vary the local variable α[k], in the following precise sense: Lemma 3. For any dual α, α Rn, primal w = w(α) and real values γ, σ satisfying (11), it holds that k=1 α[k] (1 γ)D(α) k=1 Gσ k ( α[k]; w, α[k]) (10) The role of the parameter σ is to measure the difficulty of the given data partition. For our purposes, we will see that it must be chosen not smaller than σ σ min := γ max α Rn Aα 2 PK k=1 Aα[k] 2 (11) In the following lemma, we show that this parameter can be upper-bounded by γK, which is trivial to calculate for all values γ R. We show experimentally (Section 7) that this safe upper bound for σ has a minimal effect on the overall performance of the algorithm. Our main theorems below show convergence rates dependent on γ [ 1 K , 1], which we refer to as the aggregation parameter. Lemma 4. The choice of σ := γK is valid for (11), i.e., Notion of Approximation Quality of the Local Solver. Assumption 1 (Θ-approximate solution). We assume that there exists Θ [0, 1) such that k [K], the local solver at any outer iteration t produces a (possibly) randomized approximate solution α[k], which satisfies E Gσ k ( α [k]; w, α[k]) Gσ k ( α[k]; w, α[k]) (12) Θ Gσ k ( α [k]; w, α[k]) Gσ k (0; w, α[k]) , Adding vs. Averaging in Distributed Primal-Dual Optimization α [k] arg max α Rn Gσ k ( α[k]; w, α[k]) k [K] (13) We are now ready to describe the COCOA+ framework, shown in Algorithm 1. The crucial difference compared to the existing COCOA algorithm (Jaggi et al., 2014) is the more general local subproblem, as defined in (9), as well as the aggregation parameter γ. These modifications allow the option of directly adding updates to the global vector w. Algorithm 1 COCOA+ Framework 1: Input: Datapoints A distributed according to partition {Pk}K k=1. Aggregation parameter γ (0, 1], subproblem parameter σ for the local subproblems Gσ k ( α[k]; w, α[k]) for each k [K]. Starting point α(0) := 0 Rn, w(0) := 0 Rd. 2: for t = 0, 1, 2, . . . do 3: for k {1, 2, . . . , K} in parallel over computers do 4: call the local solver, computing a Θ-approximate solution α[k] of the local subproblem (9) 5: update α(t+1) [k] := α(t) [k] + γ α[k] 6: return wk := 1 λn A α[k] 7: end for 8: reduce w(t+1) := w(t) + γ PK k=1 wk. (14) 4. Convergence Guarantees Before being able to state our main convergence results, we introduce some useful quantities and the following main lemma characterizing the effect of iterations of Algorithm 1, for any chosen internal local solver. Lemma 5. Let ℓ i be strongly1 convex with convexity parameter µ 0 with respect to the norm , i [n]. Then for all iterations t of Algorithm 1 under Assumption 1, and any s [0, 1], it holds that E[D(α(t+1)) D(α(t))] (15) γ(1 Θ) s G(α(t)) σ R(t) := λµn(1 s) σ s u(t) α(t) 2 (16) + PK k=1 A(u(t) α(t))[k] 2, for u(t) Rn with u(t) i ℓi(w(α(t))T xi). (17) 1Note that the case of weakly convex ℓ i (.) is explicitly allowed here as well, as the Lemma holds for the case µ = 0. The following Lemma provides a uniform bound on R(t): Lemma 6. If ℓi are L-Lipschitz continuous for all i [n], then t : R(t) 4L2 K X where σk := max α[k] Rn Aα[k] 2 α[k] 2 . (19) Remark 7. If all data-points xi are normalized such that xi 1 i [n], then σk |Pk| = nk. Furthermore, if we assume that the data partition is balanced, i.e., that nk = n/K for all k, then σ n2/K. This can be used to bound the constants R(t), above, as R(t) 4L2n2 4.1. Primal-Dual Convergence for General Convex Losses The following theorem shows the convergence for nonsmooth loss functions, in terms of objective values as well as primal-dual gap. The analysis in (Jaggi et al., 2014) only covered the case of smooth loss functions. Theorem 8. Consider Algorithm 1 with Assumption 1. Let ℓi( ) be L-Lipschitz continuous, and ϵG > 0 be the desired duality gap (and hence an upper-bound on primal sub-optimality). Then after T iterations, where T T0 + max{ l 1 γ(1 Θ) λn2ϵGγ(1 Θ)}, (20) T0 t0 + 2 γ(1 Θ) t0 max(0, l 1 γ(1 Θ) log( 2λn2(D(α ) D(α(0))) 4L2σσ ) m ), we have that the expected duality gap satisfies E[P(w(α)) D(α)] ϵG, at the averaged iterate α := 1 T T0 PT 1 t=T0+1α(t). (21) The following corollary of the above theorem clarifies our main result: The more aggressive adding of the partial updates, as compared averaging, offers a very significant improvement in terms of total iterations needed. While the convergence in the adding case becomes independent of the number of machines K, the averaging regime shows the known degradation of the rate with growing K, which is a major drawback of the original COCOA algorithm. This important difference in the convergence speed is not a theoretical artifact but also confirmed in our practical experiments below for different K, as shown e.g. in Figure 2. We further demonstrate below that by choosing γ and σ accordingly, we can still recover the original COCOA algorithm and its rate. Adding vs. Averaging in Distributed Primal-Dual Optimization Corollary 9. Assume that all datapoints xi are bounded as xi 1 and that the data partition is balanced, i.e. that nk = n/K for all k. We consider two different possible choices of the aggregation parameter γ: (COCOA Averaging, γ := 1 K ): In this case, σ := 1 is a valid choice which satisfies (11). Then using σ n2/K in light of Remark 7, we have that T iterations are sufficient for primal-dual accuracy ϵG, with T T0 + max{ l K 1 Θ t0 max(0, K 1 Θ log( 2λ(D(α ) D(α(0))) Hence the more machines K, the more iterations are needed (in the worst case). (COCOA+ Adding, γ := 1): In this case, the choice of σ := K satisfies (11). Then using σ n2/K in light of Remark 7, we have that T iterations are sufficient for primal-dual accuracy ϵG, with T T0 + max{ l 1 1 Θ T0 t0 + 2 1 Θ t0 max(0, 1 1 Θ log( 2λn(D(α ) D(α(0))) This is significantly better than the averaging case. In practice, we usually have σ n2/K, and hence the actual convergence rate can be much better than the proven worst-case bound. Table 1 shows that the actual value of σ is typically between one and two orders of magnitudes smaller compared to our used upper-bound n2/K. Table 1. The ratio of upper-bound n2 K divided by the true value of the parameter σ, for some real datasets. K 16 32 64 128 256 512 news 15.483 14.933 14.278 13.390 12.074 10.252 real-sim 42.127 36.898 30.780 23.814 16.965 11.835 rcv1 40.138 23.827 28.204 21.792 16.339 11.099 K 256 512 1024 2048 4096 8192 covtype 17.277 17.260 17.239 16.948 17.238 12.729 4.2. Primal-Dual Convergence for Smooth Losses The following theorem shows the convergence for smooth losses, in terms of the objective as well as primal-dual gap. Theorem 10. Assume the loss functions functions ℓi are (1/µ)-smooth i [n]. We define σmax = maxk [K] σk. Then after T iterations of Algorithm 1, with T 1 γ(1 Θ) λµn+σmaxσ λµn log 1 ϵD , it holds that E[D(α ) D(α(T ))] ϵD. Furthermore, after T iterations with T 1 γ(1 Θ) λµn+σmaxσ λµn log 1 γ(1 Θ) λµn+σmaxσ we have the expected duality gap E[P(w(α(T ))) D(α(T ))] ϵG. The following corollary is analogous to Corollary 9, but for the case of smooth loses. It again shows that while the COCOA variant degrades with the increase of the number of machines K, the COCOA+ rate is independent of K. Corollary 11. Assume that all datapoints xi are bounded as xi 1 and that the data partition is balanced, i.e., that nk = n/K for all k. We again consider the same two different possible choices of the aggregation parameter γ: (COCOA Averaging, γ := 1 K ): In this case, σ := 1 is a valid choice which satisfies (11). Then using σmax nk = n/K in light of Remark 7, we have that T iterations are sufficient for suboptimality ϵD, with T 1 1 Θ λµK+1 λµ log 1 ϵD Hence the more machines K, the more iterations are needed (in the worst case). (COCOA+ Adding, γ := 1): In this case, the choice of σ := K satisfies (11). Then using σmax nk = n/K in light of Remark 7, we have that T iterations are sufficient for suboptimality ϵD, with T 1 1 Θ λµ+1 λµ log 1 ϵD This is significantly better than the averaging case. Both rates hold analogously for the duality gap. 4.3. Comparison with Original COCOA Remark 12. If we choose averaging (γ := 1 K ) for aggregating the updates, together with σ := 1, then the resulting Algorithm 1 is identical to COCOA analyzed in (Jaggi et al., 2014). However, they only provide convergence for smooth loss functions ℓi and have guarantees for dual suboptimality and not the duality gap. Formally, when σ = 1, the subproblems (9) will differ from the original dual D(.) only by an additive constant, which does not affect the local optimization algorithms used within COCOA. 5. SDCA as an Example Local Solver We have shown convergence rates for Algorithm 1, depending solely on the approximation quality Θ of the used local Adding vs. Averaging in Distributed Primal-Dual Optimization solver (Assumption 1). Any chosen local solver in each round receives the local α variables as an input, as well as a shared vector w (3)= w(α) being compatible with the last state of all global α Rn variables. As an illustrative example for a local solver, Algorithm 2 below summarizes randomized coordinate ascent (SDCA) applied on the local subproblem (9). The following two Theorems (13, 14) characterize the local convergence for both smooth and non-smooth functions. In all the results we will use rmax := maxi [n] xi 2. Algorithm 2 LOCALSDCA (w, α[k], k, H) 1: Input: α[k], w = w(α) 2: Data: Local {(xi, yi)}i Pk 3: Initialize: α(0) [k] := 0 Rn 4: for h = 0, 1, . . . , H 1 do 5: choose i Pk uniformly at random 6: δ i := arg max δi R Gσ k ( α(h) [k] + δiei; w, α[k]) 7: α(h+1) [k] := α(h) [k] + δ i ei 8: end for 9: Output: α(H) [k] Theorem 13. Assume the functions ℓi are (1/µ) smooth for i [n]. Then Assumption 1 on the local approximation quality Θ is satisfied for LOCALSDCA as given in Algorithm 2, if we choose the number of inner iterations H as H nk σ rmax + λnµ Theorem 14. Assume the functions ℓi are L-Lipschitz for i [n]. Then Assumption 1 on the local approximation quality Θ is satisfied for LOCALSDCA as given in Algorithm 2, if we choose the number of inner iterations H as 2Θλn2 α [k] 2 Gσ k ( α [k]; .) Gσ k (0; .) Remark 15. Between the different regimes allowed in COCOA+ (ranging between averaging and adding the updates) the computational cost for obtaining the required local approximation quality varies with the choice of σ . From the above worst-case upper bound, we note that the cost can increase with σ , as aggregation becomes more aggressive. However, as we will see in the practical experiments in Section 7 below, the additional cost is negligible compared to the gain in speed from the different aggregation, when measured on real datasets. 6. Discussion and Related Work SGD-based Algorithms. For the empirical loss minimization problems of interest here, stochastic subgradient descent (SGD) based methods are well-established. Several distributed variants of SGD have been proposed, many of which build on the idea of a parameter server (Niu et al., 2011; Liu et al., 2014; Duchi et al., 2013). The downside of this approach, even when carefully implemented, is that the amount of required communication is equal to the amount of data read locally (e.g., mini-batch SGD with a batch size of 1 per worker). These variants are in practice not competitive with the more communication-efficient methods considered here, which allow more local updates per round. One-Shot Communication Schemes. At the other extreme, there are distributed methods using only a single round of communication, such as (Zhang et al., 2013; Zinkevich et al., 2010; Mann et al., 2009; Mc Williams et al., 2014). These require additional assumptions on the partitioning of the data, and furthermore can not guarantee convergence to the optimum solution for all regularizers, as shown in, e.g., (Shamir et al., 2014). (Balcan et al., 2012) shows additional relevant lower bounds on the minimum number of communication rounds necessary for a given approximation quality for similar machine learning problems. Mini-Batch Methods. Mini-batch methods are more flexible and lie within these two communication vs. computation extremes. However, mini-batch versions of both SGD and coordinate descent (CD) (Richt arik & Tak aˇc, 2013; Shalev-Shwartz & Zhang, 2013b; Yang, 2013; Qu & Richt arik, 2014; Qu et al., 2014) suffer from their convergence rate degrading towards the rate of batch gradient descent as the size of the mini-batch is increased. This follows because mini-batch updates are made based on the outdated previous parameter vector w, in contrast to methods that allow immediate local updates like COCOA. Furthermore, the aggregation parameter for mini-batch methods is harder to tune, as it can lie anywhere in the order of mini-batch size. In the COCOA setting, the parameter lies in the smaller range given by K. Our COCOA+ extension avoids needing to tune this parameter entirely, by adding. Methods Allowing Local Optimization. Developing methods that allow for local optimization requires carefully devising data-local subproblems to be solved after each communication round. (Shamir et al., 2014; Zhang & Lin, 2015) have proposed distributed Newton-type algorithms in this spirit. However, the subproblems must be solved to high accuracy for convergence to hold, which is often prohibitive as the size of the data on one machine is still relatively large. In contrast, the COCOA framework (Jaggi et al., 2014) allows using any local solver of weak local approximation quality in each round. By making use of the primal-dual structure in the line of work of (Yu et al., 2012; Pechyony et al., 2011; Yang, 2013; Lee & Roth, 2015), the COCOA and COCOA+ frameworks also allow more control over the aggregation of updates between ma- Adding vs. Averaging in Distributed Primal-Dual Optimization Number of Communications 101 102 103 104 Duality Gap 100 Covertype, 1e-4 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Elapsed Time (s) 100 101 102 Duality Gap 100 Covertype, 1e-4 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Number of Communications 101 102 103 104 105 Duality Gap 100 RCV1, 1e-4 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Elapsed Time (s) 101 102 103 104 Duality Gap 100 RCV1, 1e-4 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Number of Communications 101 102 103 104 Duality Gap 100 Covertype, 1e-5 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Elapsed Time (s) 100 101 102 Duality Gap 100 Covertype, 1e-5 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Number of Communications 101 102 103 104 105 Duality Gap 100 RCV1, 1e-5 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Elapsed Time (s) 101 102 103 104 Duality Gap 100 RCV1, 1e-5 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Number of Communications 101 102 103 104 Duality Gap 100 Covertype, 1e-6 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Elapsed Time (s) 101 102 Duality Gap 100 Covertype, 1e-6 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Number of Communications 102 103 104 105 Duality Gap 100 RCV1, 1e-6 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Elapsed Time (s) 101 102 103 104 Duality Gap 100 RCV1, 1e-6 Co Co A H=106 Co Co A H=105 Co Co A H=104 Co Co A+ H=106 Co Co A+ H=105 Co Co A+ H=104 Figure 1. Duality gap vs. the number of communicated vectors, as well as duality gap vs. elapsed time in seconds for two datasets: Covertype (left, K=4) and RCV1 (right, K=8). Both are shown on a log-log scale, and for three different values of regularization (λ=1e-4; 1e-5; 1e-6). Each plot contains a comparison of COCOA (red) and COCOA+ (blue), for three different values of H, the number of local iterations performed per round. For all plots, across all values of λ and H, we see that COCOA+ converges to the optimal solution faster than COCOA, in terms of both the number of communications and the elapsed time. chines. The practical variant Dis DCA-p proposed in (Yang, 2013) allows additive updates but is restricted to SDCA updates, and was proposed without convergence guarantees. Dis DCA-p can be recovered as a special case of the COCOA+ framework when using SDCA as a local solver, if nk = n/K and σ := K, see Appendix C. The theory presented here also therefore covers that method. ADMM. An alternative approach to distributed optimization is to use the alternating direction method of multipliers (ADMM), as used for distributed SVM training in, e.g., (Forero et al., 2010). This uses a penalty parameter balancing between the equality constraint w and the optimization objective (Boyd et al., 2011). However, the known convergence rates for ADMM are weaker than the more problemtailored methods mentioned previously, and the choice of the penalty parameter is often unclear. Batch Proximal Methods. In spirit, for the special case of adding (γ = 1), COCOA+ resembles a batch proximal method, using the separable approximation (9) instead of the original dual (2). Known batch proximal methods require high accuracy subproblem solutions, and don t allow arbitrary solvers of weak accuracy Θ such as we do here. 7. Numerical Experiments We present experiments on several large real-world distributed datasets. We show that COCOA+ converges faster in terms of total rounds as well as elapsed time as compared to COCOA in all cases, despite varying: the dataset, values of regularization, batch size, and cluster size (Section 7.2). In Section 7.3 we demonstrate that this performance translates to orders of magnitude improvement in convergence when scaling up the number of machines K, as compared to COCOA as well as to several other state-of-the-art methods. Finally, in Section 7.4 we investigate the impact of the local subproblem parameter σ in the COCOA+ framework. Table 2. Datasets for Numerical Experiments. Dataset n d Sparsity covertype 522,911 54 22.22% epsilon 400,000 2,000 100% RCV1 677,399 47,236 0.16% 7.1. Implementation Details We implement all algorithms in Apache Spark (Zaharia et al., 2012) and run them on m3.large Amazon EC2 instances, applying each method to the binary hinge-loss sup- Adding vs. Averaging in Distributed Primal-Dual Optimization port vector machine. The analysis for this non-smooth loss was not covered in (Jaggi et al., 2014) but has been captured here, and thus is both theoretically and practically justified. The used datasets are summarized in Table 2. For illustration and ease of comparison, we here use SDCA (Shalev-Shwartz & Zhang, 2013c) as the local solver for both COCOA and COCOA+. Note that in this special case, and if additionally σ := K, and if the partitioning nk = n/K is balanced, once can show that the COCOA+ framework reduces to the practical variant of Dis DCA (Yang, 2013) (which had no convergence guarantees so far). We include more details on the connection in Appendix C. 7.2. Comparison of COCOA+ and COCOA We compare the COCOA+ and COCOA frameworks directly using two datasets (Covertype and RCV1) across various values of λ, the regularizer, in Figure 1. For each value of λ we consider both methods with different values of H, the number of local iterations performed before communicating to the master. For all runs of COCOA+ we use the safe upper bound of γK for σ . In terms of both the total number of communications made and the elapsed time, COCOA+ (shown in blue) converges to the optimal solution faster than COCOA (red). The discrepancy is larger for greater values of λ, where the strongly convex regularizer has more of an impact and the problem difficulty is reduced. We also see a greater performance gap for smaller values of H, where there is frequent communication between the machines and the master, and changes between the algorithms therefore play a larger role. 7.3. Scaling the Number of Machines K In Figure 2 we demonstrate the ability of COCOA+ to scale with an increasing number of machines K. The experiments confirm the ability of strong scaling of the new method, as predicted by our theory in Section 4, in contrast to the competing methods. Unlike COCOA, which becomes linearly slower when increasing the number of machines, the performance of COCOA+ improves with additional machines, only starting to degrade slightly once K=16 for the RCV1 dataset. 7.4. Impact of the Subproblem Parameter σ Finally, in Figure 3, we consider the effect of the choice of the subproblem parameter σ on convergence. We plot both the number of communications and clock time on a log-log scale for the RCV1 dataset with K=8 and H=1e4. For γ = 1 (the most aggressive variant of COCOA+ in which updates are added) we consider several different values of σ , ranging from 1 to 8. The value σ =8 represents the safe upper bound of γK. The optimal convergence occurs around σ =4, and diverges for σ 2. Notably, we Number of machines (K) 2 4 6 8 10 12 14 16 Time (s) to e-4 Duality Gap Scaling up K, RCV1 Co Co A+ Co Co A Number of machines (K) 2 4 6 8 10 12 14 16 Time (s) to e-3 Accurate Primal 103 Scaling up K, RCV1 Co Co A+ Co Co A Mini-batch SGD Number of machines (K) 20 40 60 80 100 Time (s) to e-2 Duality Gap 700 Scaling up K, Epsilon Co Co A+ Co Co A Figure 2. The effect of increasing K on the time (s) to reach an ϵD-accurate solution. We see that COCOA+ converges twice as fast as COCOA on 100 machines for the Epsilon dataset, and nearly 7 times as quickly for the RCV1 dataset. Mini-batch SGD converges an order of magnitude more slowly than both methods. see that the easy to calculate upper bound of σ := γK (as given by Lemma 4) has only slightly worse performance than best possible subproblem parameter in our setting. Number of Communications 101 102 103 Duality Gap 101 Effect of < for . = 1 (adding) < = 8 (K) < = 6 < = 4 < = 2 < = 1 Elapsed Time (s) Duality Gap 101 Effect of < for . = 1 (adding) < = 8 (K) < = 6 < = 4 < = 2 < = 1 Figure 3. The effect of σ on convergence of COCOA+ for the RCV1 dataset distributed across K=8 machines. Decreasing σ improves performance in terms of communication and overall run time until a certain point, after which the algorithm diverges. The safe upper bound of σ :=K=8 has only slightly worse performance than the practically best un-safe value of σ . 8. Conclusion In conclusion, we present a novel framework COCOA+ that allows for fast and communication-efficient additive aggregation in distributed algorithms for primal-dual optimization. We analyze the theoretical performance of this method, giving strong primal-dual convergence rates with outer iterations scaling independently of the number of machines. We extended our theory to allow for non-smooth losses. Our experimental results show significant speedups over previous methods, including the original COCOA framework as well as other state-of-the-art methods. Acknowledgments. We thank Ching-pei Lee and an anonymous reviewer for several helpful insights and comments. Adding vs. Averaging in Distributed Primal-Dual Optimization Balcan, M.-F., Blum, A., Fine, S., and Mansour, Y. Distributed Learning, Communication Complexity and Privacy. In COLT, pp. 26.1 26.22, 2012. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011. Duchi, J. C., Jordan, M. I., and Mc Mahan, H. B. Estimation, Optimization, and Parallelism when Data is Sparse. In NIPS, 2013. Fercoq, O. and Richt arik, P. Accelerated, parallel and proximal coordinate descent. ar Xiv:1312.5799, 2013. Fercoq, O., Qu, Z., Richt arik, P., and Tak aˇc, M. Fast distributed coordinate descent for non-strongly convex losses. IEEE Workshop on Machine Learning for Signal Processing, 2014. Forero, P. A., Cano, A., and Giannakis, G. B. Consensus Based Distributed Support Vector Machines. JMLR, 11: 1663 1707, 2010. Jaggi, M., Smith, V., Tak aˇc, M., Terhorst, J., Krishnan, S., Hofmann, T., and Jordan, M. I. Communication-efficient distributed dual coordinate ascent. In NIPS, 2014. Lee, C.-P. and Roth, D. Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM. In ICML, 2015. Liu, J. and Wright, S. J. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. ar Xiv:1403.3862, 2014. Liu, J., Wright, S. J., R e, C., Bittorf, V., and Sridhar, S. An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. In ICML, 2014. Lu, Z. and Xiao, L. On the complexity analysis of randomized block-coordinate descent methods. ar Xiv preprint ar Xiv:1305.4723, 2013. Mann, G., Mc Donald, R., Mohri, M., Silberman, N., and Walker, D. D. Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models. NIPS, 2009. Mareˇcek, J., Richt arik, P., and Tak aˇc, M. Distributed block coordinate descent for minimizing partially separable functions. ar Xiv:1406.0238, 2014. Mc Williams, B., Heinze, C., Meinshausen, N., Krummenacher, G., and Vanchinathan, H. P. LOCO: Distributing Ridge Regression with Random Projections. ar Xiv stat.ML, June 2014. Niu, F., Recht, B., R e, C., and Wright, S. J. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS, 2011. Pechyony, D., Shen, L., and Jones, R. Solving Large Scale Linear SVM with Distributed Block Minimization. In NIPS Workshop on Big Learning, 2011. Qu, Z. and Richt arik, P. Coordinate descent with arbitrary sampling I: Algorithms and complexity. ar Xiv:1412.8060, 2014. Qu, Z., Richt arik, P., and Zhang, T. Randomized dual coordinate ascent with arbitrary sampling. ar Xiv:1411.5873, 2014. Richt arik, P. and Tak aˇc, M. Distributed coordinate descent method for learning with big data. ar Xiv preprint ar Xiv:1310.2059, 2013. Richt arik, P. and Tak aˇc, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1 38, April 2014. Richt arik, P. and Tak aˇc, M. Parallel coordinate descent methods for big data optimization. Mathematical Programming, pp. 1 52, 2015. Shalev-Shwartz, S. and Zhang, T. Accelerated mini-batch stochastic dual coordinate ascent. In NIPS, 2013a. Shalev-Shwartz, S. and Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. ar Xiv:1309.2375, 2013b. Shalev-Shwartz, S. and Zhang, T. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. JMLR, 14:567 599, 2013c. Shamir, O. and Srebro, N. Distributed Stochastic Optimization and Learning . In Allerton, 2014. Shamir, O., Srebro, N., and Zhang, T. Communication efficient distributed optimization using an approximate newton-type method. In ICML, 2014. Tappenden, R., Tak aˇc, M., and Richt arik, P. On the complexity of parallel coordinate descent. Technical report, 2015. ERGO 15-001, University of Edinburgh. Yang, T. Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent. In NIPS, 2013. Yang, T., Zhu, S., Jin, R., and Lin, Y. On Theoretical Analysis of Distributed Stochastic Dual Coordinate Ascent. ar Xiv:1312.1031, 2013. Adding vs. Averaging in Distributed Primal-Dual Optimization Yu, H.-F., Hsieh, C.-J., Chang, K.-W., and Lin, C.-J. Large Linear Classification When Data Cannot Fit in Memory. TKDD, 5(4):1 23, 2012. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Mc Cauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012. Zhang, Y. and Lin, X. Di SCO: Distributed Optimization for Self-Concordant Empirical Loss. In ICML, pp. 362 370, 2015. Zhang, Y., Duchi, J. C., and Wainwright, M. J. Communication-Efficient Algorithms for Statistical Optimization. JMLR, 14:3321 3363, 2013. Zinkevich, M. A., Weimer, M., Smola, A. J., and Li, L. Parallelized Stochastic Gradient Descent. NIPS, 2010.