# distributed_learning_with_sublinear_communication__b6a85c20.pdf Distributed Learning with Sublinear Communication Jayadev Acharya 1 Christopher De Sa 1 Dylan J. Foster 2 Karthik Sridharan 1 In distributed statistical learning, N samples are split across m machines and a learner wishes to use minimal communication to learn as well as if the examples were on a single machine. This model has received substantial interest in machine learning due to its scalability and potential for parallel speedup. However, in high-dimensional settings, where the number examples is smaller than the number of features ( dimension ), the speedup afforded by distributed learning may be overshadowed by the cost of communicating a single example. This paper investigates the following question: When is it possible to learn a d-dimensional model in the distributed setting with total communication sublinear in d? Starting with a negative result, we observe that for learning 1-bounded or sparse linear models, no algorithm can obtain optimal error until communication is linear in dimension. Our main result is that by slightly relaxing the standard boundedness assumptions for linear models, we can obtain distributed algorithms that enjoy optimal error with communication logarithmic in dimension. This result is based on a family of algorithms that combine mirror descent with randomized sparsification/quantization of iterates, and extends to the general stochastic convex optimization model. 1. Introduction In statistical learning, a learner receives examples z1,...,z N i.i.d. from an unknown distribution D. Their goal is to output a hypothesis ˆh H that minimizes the prediction error LD(h) = Ez D (h,z), and in particular to guarantee that excess risk of the learner is small, i.e. h H LD(h) "(H,N), (1) 1Cornell University 2Massachusetts Institute of Technology. Correspondence to: Dylan Foster . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). where "(H,N) is a decreasing function of N. This paper focuses on distributed statistical learning. We consider a distributed setting, where the N examples are split evenly across m machines, with n = N m examples per machine, and the learner wishes to achieve an excess risk guarantee such as (1) with minimal overhead in computation or communication. Distributed learning has been the subject of extensive investigation due to its scalability for processing massive data: We may wish to efficiently process datasets that are spread across multiple data-centers, or we may want to distribute data across multiple machines to allow for parallelization of learning procedures. The question of parallelizing computation via distributed learning is a well-explored problem (Bekkerman et al., 2011; Recht et al., 2011; Dekel et al., 2012; Chaturapruek et al., 2015). However, one drawback that limits the practical viability of these approaches is that the communication cost between machines may overshadow gains in parallel speedup (Bijral et al., 2016). Indeed, for high-dimensional statistical inference tasks where N could be much smaller than the dimension d, or in modern deep learning where the number of model parameters exceeds the number of examples (e.g. (He et al., 2016)), communicating a single gradient or sending the raw model parameters between machines constitutes a significant overhead. Algorithms with reduced communication complexity in distributed learning have received significant recent development (Seide et al., 2014; Alistarh et al., 2017; Zhang et al., 2017; Suresh et al., 2017; Bernstein et al., 2018; Tang et al., 2018), but typical results here take as a given that when gradients or examples live in d dimensions, communication will scale as (d). Our goal is to revisit this tacit assumption and understand when it can be relaxed. We explore the question of sublinear communication: Suppose a hypothesis class H has d parameters. When is it possible to achieve optimal excess risk for H in the distributed setting using o(d) communication? 1.1. Sublinear Communication for Linear Models? Let us first focus on linear models, which are a special case of the general learning setup (1). We restrict to linear hypotheses of the form hw(x) = w,x where w,x Rd and write (hw,z) = φ( w,x ,y), where φ( ,y) is a fixed Distributed Learning with Sublinear Communication link function and z = (x,y). We overload notation slightly and write LD(w) = E(x,y) D φ( w,x ,y). (2) This formulation captures standard learning tasks such as square loss regression, where φ( w,x ,y) = ( w,x y)2, logistic regression, where φ( w,x ,y) = log 1 + e y w,x , and classification with surrogate losses such as the hinge loss, where φ( w,x ,y) = max{1 w,x y,0}. Our results concern the communication complexity of learning for linear models in the p q-bounded setup: weights belong to Wp = w Rd w p Bp and feature vectors belong to Xq = x Rd x q Rq .1 This setting is a natural starting point to investigate sublinear-communication distributed learning because learning is possible even when N d. Consider the case where p and q are dual, i.e. 1 q = 1, and where φ is 1-Lipschitz. Here it is well known (Zhang, 2002; Kakade et al., 2009) that whenever q 2, the optimal sample complexity for learning, which is achieved by choosing the learner s weights w using empirical risk minimization (ERM), is w Wp LD(w) = where Cq = q 1 for finite q and C = log d, namely w W1 LD(w) = We see that when q < the excess risk for the dual p q setting is independent of dimension so long as the norm bounds Bp and Rq are held constant, and that even in the 1 case there is only a mild logarithmic dependence. Hence, we can get nontrivial excess risk even when the number of examples N is arbitrarily small compared to the dimension d. This raises the intriguing question: Given that we can obtain nontrivial excess risk when N d, can we obtain nontrivial excess risk when communication is sublinear in d? To be precise, we would like to develop algorithms that achieve (3)/(4) with total bits of communication poly(N,m,log d), permitting also poly(Bp,Rq) dependence. The prospect of such a guarantee is exciting because in light of the discussion above as this would imply that we can obtain nontrivial excess risk with fewer bits of total communication than are required to naively send a single feature vector. 1Recall the definition of the p norm: w p = d 1.2. Contributions We provide new communication-efficient distributed learning algorithms and lower bounds for p q-bounded linear models, and more broadly, stochastic convex optimization. We make the following observations: For 2 2-bounded linear models, sublinear communi- cation is achievable, and is obtained by using a derandomized Johnson-Lindenstrauss transform to compress examples and weights. For 1 -bounded linear models, no distributed algo- rithm can obtain optimal excess risk until communication is linear in dimension. These observations lead to our main result. We show that by relaxing the 1 -boundedness assumption and instead learning 1 q-bounded models for a constant q < , one unlocks a plethora of new algorithmic tools for sublinear distributed learning: 1. We give an algorithm with optimal rates matching (3), with total communication poly(N,mq,log d). 2. We extend the sublinear-communication algorithm to give refined guarantees, including instance-dependent small loss bounds for smooth losses, fast rates for strongly convex losses, and optimal rates for matrix learning problems. Our main algorithm is a distributed version of mirror descent that uses randomized sparsification of weight vectors to reduce communication. Beyond learning in linear models, the algorithm enjoys guarantees for the more general distributed stochastic convex optimization model. To elaborate on the fast rates mentioned above, another important case where learning is possible when N d is the sparse high-dimensional linear model setup, central to compressed sensing and statistics. Here, the standard result is that when φ is strongly convex and the benchmark class consists of k-sparse linear predictors, i.e. W0 = w Rd w 0 k , one can guarantee w W0 LD(w) = k log (d k) With -bounded features, no algorithm can obtain optimal excess risk for this setting until communication is linear in dimension, even under compressed sensing-style assumptions. When features are q-bounded however, our general machinery gives optimal fast rates matching (5) under Lassostyle assumptions, with communication poly(N q,log d). The remainder of the paper is organized as follows. In Section 2 we develop basic upper and lower bounds for the 2 2 and 1 -bounded settings. In Section 3 we shift to the 1 q-bounded setting, where we introduce the family of Distributed Learning with Sublinear Communication sparsified mirror descent algorithms that leads to our main results and sketch the analysis. 1.3. Related Work Much of the work in algorithm design for distributed learning and optimization does not explicitly consider the number of bits used in communication per messages, and instead tries to make communication efficient via other means, such as decreasing the communication frequency or making learning robust to network disruptions (Duchi et al., 2012; Zhang et al., 2012). Other work reduces the number of bits of communication, but still requires that this number be linear in the dimension d. One particularly successful line of work in this vein is low-precision training, which represents the numbers used for communication and elsewhere within the algorithm using few bits (Alistarh et al., 2017; Zhang et al., 2017; Seide et al., 2014; Bernstein et al., 2018; Tang et al., 2018; Stich et al., 2018; Alistarh et al., 2018). Although low-precision methods have seen great success and adoption in neural network training and inference, low-precision methods are fundamentally limited to use bits proportional to d; once they go down to one bit per number there is no additional benefit from decreasing the precision. Some work in this space tries to use sparsification to further decrease the communication cost of learning, either on its own or in combination with a low-precision representation for numbers (Alistarh et al., 2017; Wangni et al., 2018; Wang et al., 2018). While the majority of these works apply lowprecision and sparsification to gradients, a number of recent works apply sparsification to model parameters (Tang et al., 2018; Stich et al., 2018; Alistarh et al., 2018); We also adopt this approach. The idea of sparsifying weights is not new (Shalev-Shwartz et al., 2010), but our work is the first to provably give communication logarithmic in dimension. To achieve this, our assumptions and analysis are quite a bit different from the results mentioned above, and we crucially use mirror descent, departing from the gradient descent approaches in (Tang et al., 2018; Stich et al., 2018; Alistarh et al., 2018). Lower bounds on the accuracy of learning procedures with limited memory and communication have been explored in several settings, including mean estimation, sparse regression, learning parities, detecting correlations, and independence testing (Shamir, 2014; Duchi et al., 2014; Garg et al., 2014; Steinhardt & Duchi, 2015; Braverman et al., 2016; Steinhardt et al., 2016; Acharya et al., 2019a;b; Raz, 2018; Han et al., 2018; Sahasranand & Tyagi, 2018; Dagan & Shamir, 2018; Dagan et al., 2019). In particular, the results of (Steinhardt & Duchi, 2015) and (Braverman et al., 2016) imply that optimal algorithms for distributed sparse regression need communication much larger than the sparsity level under various assumptions on the number of machines and the communication protocol. 2. Linear Models: Basic Results In this section we develop basic upper and lower bounds for communication in 2 2and 1 -bounded linear models. Our goal is to highlight that the communication complexity of distributed learning and the statistical complexity of centralized learning do not in general coincide, and to motivate the 1 q-boundedness assumption under which we derive communication-efficient algorithms in Section 3. 2.1. Preliminaries We formulate our results in a distributed communication model following Shamir (2014). Recalling that n = N m, the model is as follows. For machine i = 1,...,m: Receive n i.i.d. examples Si = zi n. Compute message Wi = fi(Si ; W1,...,Wi 1), where Wi is at most bi bits. Return W = f(W1,...,Wm). We refer to m i=1 bi as the total communication, and we refer to any protocol with bi b i as a (b,n,m) protocol. As a special case, this model captures a serial distributed learning setting where machines proceed one after another: Each machine does some computation on their data zi n and previous messages W1,...,Wi 1, then broadcasts their own message Wi to all subsequent machines, and the final model in (1) is computed from W, either on machine m or on a central server. The model also captures protocols in which each machine independently computes a local estimator and sends it to a central server, which aggregates the local estimators to produce a final estimator (Zhang et al., 2012). All of our upper bounds have the serial structure above, and our lower bounds apply to any (b,n,m) protocol. 2.2. 2 2-Bounded Models In the 2 2-bounded setting, we can achieve sample optimal learning with sublinear communication by using dimensionality reduction. The idea is to project examples into k = O(N) dimensions using the Johnson-Lindenstrauss transform, then perform a naive distributed implementation of any standard learning algorithm in the projected space. Here we implement the approach using stochastic gradient descent. To project the examples onto the same subspace, the machines need to agree on a JL transformation matrix. To do so with little communication, we consider the derandomized sparse JL transform of Kane & Nelson (2010), which constructs a collection A of matrices in Rk d with A = exp(O(log(k δ) log d)) for confidence parameter δ. The first machine randomly selects an entry of A and sends the identity of this matrix using O(log(k δ) log d) bits to Distributed Learning with Sublinear Communication Algorithm 1 (Maurey Sparsification). Input: Weight vector w Rd. Sparsity level s. Define p d via pi wi . For = 1,...,s: Sample index i p. Return Qs(w) = w 1 =1 sgn(wi )ei . the other m 1 machines. The dimension k and parameter δ are chosen as a function of N. Each machine uses the matrix A to project its features down to k dimensions. Letting x t = Axt denote the projected features, the first machine starts with a k-dimensional weight vector u1 = 0 and performs the online gradient descent update (Zinkevich, 2003; Cesa-Bianchi & Lugosi, 2006) over its n projected samples as: ut ut 1 φ( ut,x where > 0 is the learning rate. Once the first machine has passed over all its samples, it broadcasts the last iterate un+1 as well the average n s=1 us, which takes O(k) communication. The next machine machine performs the same sequence of gradient updates on its own data using un+1 as the initialization, then passes its final iterate and the updated average to the next machine. This repeats until we arrive at the mth machine. The mth machine computes the k-dimensional vector u = 1 t=1 ut, and returns w = A ˆu as the solution. Theorem 1. When φ is L-Lipschitz and k = (N log(d N)), the strategy above guarantees that ESEA [LD( w)] inf w W2 LD(w) O where ES denotes expectation over samples and EA denotes expectation over the algorithm s randomness. The total communication is O(m N log(d N)log(LB2R2N) + mlog(d N)log d) bits. 2.3. 1 -Bounded Models: Model Compression While the results for the 2 2-bounded setting are encouraging, they are not useful in the common situation where features are dense. When features are -bounded, Equation (4) shows that one can obtain nearly dimensionindependent excess risk so long as they restrict to 1bounded weights. This 1 -bounded setting is particularly important because it captures the fundamental problem of learning from a finite hypothesis class, or aggregation (Tsybakov, 2003): Given a class H of { 1}-valued predictors with H < we can set x = (h(z))h H R H , in which case (4) turns into the familiar finite class bound log H N (Shalev-Shwartz & Ben-David, 2014). Thus, algorithms with communication sublinear in dimension for the 1 setting would lead to positive results in the general setting (1). As first positive result in this direction, we observe that by using the well-known technique of randomized sparsification or Maurey sparsification, we can compress models to require only logarithmic communication while preserving excess risk.2 The method is simple: Suppose we have a weight vector w that lies in the simplex d. We sample s elements of [d] i.i.d. according to w and return the empirical distribution, which we will denote Qs(w). The empirical distribution is always s-sparse and can be communicated using at most O(slog (ed s)) bits when s d,3 and it follows from standard concentration tools that by taking s large enough the empirical distribution will approximate the true vector w arbitrarily well. The following lemma shows that Maurey sparsification indeed provides a dimension-independent approximation to the excess risk in the 1 -bounded setting. It applies to a version of the Maurey technique for general vectors, which is given in Algorithm 1. Lemma 1. Let w Rd be fixed and suppose features belong to X . When φ is L-Lipschitz, Algorithm 1 guarantees that ELD(Qs(w)) LD(w) + 2L2R2 where the expectation is with respect to the algorithm s randomness. Furthermore, when φ is β-smooth4 Algorithm 1 guarantees: ELD(Qs(w)) LD(w) + βR2 The number of bits required to communicate Qs(w), including sending the scalar w 1 up to numerical precision, is at most O(slog (ed s) + log(LB1R s)). Thus, if any single machine is able to find an estimator w with good excess risk, they can communicate it to any other machine while preserving the excess risk with sublinear communication. In particular, to preserve the optimal excess risk guarantee in (4) for a Lipschitz loss such as absolute or hinge, the total bits of communication required is only 2We refer to the method as Maurey sparsification in reference to Maurey s early use of the technique in Banach spaces (Pisier, 1980), which predates its long history in learning theory (Jones, 1992; Barron, 1993; Zhang, 2002). 3That O(s log (ed s)) bits rather than, e.g., O(s log d) bits suffice is a consequence of the usual stars and bars counting argument. We expect one can bring the expected communication down further using an adaptive scheme such as Elias coding, as in Alistarh et al. (2017). 4A scalar function is said to be β-smooth if it has β-Lipschitz first derivative. Distributed Learning with Sublinear Communication O(N + log (LB1R N)), which is indeed sublinear in dimension! For smooth losses (square, logistic), this improves further to only O( N log (ed N) + log (LB1R N)) bits. 2.4. 1 -Bounded Models: Impossibility Alas, we have only shown that if we happen to find a good solution, we can send it using sublinear communication. If we have to start from scratch, is it possible to use Maurey sparsification to coordinate between all machines to find a good solution? Unfortunately, the answer is no: For the 1 bounded setting, in the extreme case where each machine has a single example, no algorithm can obtain a risk bound matching (4) until the number of bits b allowed per machine is (nearly) linear in d. Theorem 2. Consider the problem of learning with linear loss in the (b,1,N) model, where the risk is LD(w) = E(x,y) D[ y w,x ]. Let the benchmark class be the 1 ball W1, where B1 = 1. For any algorithm w there exists a distribution D with x 1 and y 1 such that Pr LD( w) inf w W1 LD(w) 1 16 The lower bound also extends to the case of multiple examples per machine (i.e., m > 1), albeit with a less sharp tradeoff. Proposition 1. Let m, n, and " > 0 be fixed. In the setting of Theorem 2, any algorithm in the (b,n,m) protocol with b O(d1 " 2 N) has excess risk at least ( d" N) with constant probability. This lower bound follows almost immediately from reduction to the hide-and-seek problem of Shamir (2014). The weaker guarantee from Proposition 1 is a consequence of the fact that the lower bound for the hide-and-seek problem from Shamir (2014) is weaker in the multi-machine case. The value of Theorem 2 and Proposition 1 is to rule out the possibility of obtaining optimal excess risk with communication polylogarithmic in d in the 1 setting, even when there are many examples per machine. This motivates the results of the next section, which show that for 1 qbounded models it is indeed possible to get polylogarithmic communication for any value of m. One might hope that it is possible to circumvent Theorem 2 by making compressed sensing-type assumptions, e.g. assuming that the vector w is sparse and that restricted eigenvalue or a similar property is satisfied. Unfortunately, this is not the case.5 5See Appendix B.2 for additional discussion of compressed sensing based assumptions under which sublinear communication may be possible. Proposition 2. Consider square loss regression in the (b,1,N) model. For any algorithm w there exists a distribution D with the following properties: x 1 and y 1 with probability 1. = E[xx ] = I, so that the population risk is 1- strongly convex, and in particular has restricted strong convexity constant 1. w = arg minw w 1 1 LD(w) is 1-sparse. Pr LD( w) LD(w ) 1 256 d Moreover, any algorithm in the (b,n,m) protocol with b O(d1 " 2 N) has excess risk at least (d" N) with constant probability. That (d) communication is required to obtain optimal excess risk for m = N was proven in (Steinhardt & Duchi, 2015). The lower bound for general m is important here because it serves as a converse to the algorithmic results we develop for sparse regression in Section 3.6 3. Sparsified Mirror Descent We now deliver on the promise outlined in the introduction and give new algorithms with logarithmic communication under an assumption we call 1 q-boundness. The model for which we derive algorithms in this section is more general than the linear model setup (2) to which our lower bounds apply. We consider problems of the form w W LD(w) = Ez D (w,z), (6) where ( ,z) is convex, W W1 = w Rd w 1 B1 is a convex constraint set, and subgradients @ (w,z) are assumed to belong to Xq = x Rd x q Rq . This setting captures linear models with 1-bounded weights and q-bounded features as a special case, but is more general, since the loss can be any Lipschitz function of w. We have already shown that one cannot expect sublinearcommunication algorithms for 1 -bounded models, and so the q-boundedness of subgradients in (6) may be thought of as strengthening our assumption on the data generating process. That this is stronger follows from the elementary fact that x q x for all q. Statistical complexity and nontriviality. For the dual 1 setup in (2) the optimal rate is ( log d N). While our goal is to find minimal assumptions that allow for distributed learning with sublinear communication, the reader 6(Braverman et al., 2016) also prove a communication lower bound for sparse regression. Their lower bound applies for all values of m and for more sophisticated interactive protocols, but does not rule out the possibility of poly(N, m, log d) communication. Distributed Learning with Sublinear Communication may wonder at this point whether we have made the problem easier statistically by moving to the 1 q assumption. The answer is yes, but only slightly. When q is constant the optimal rate for 1 q-bounded models is ( and so the effect of this assumption is to shave off the log d factor that was present in (4). 3.1. Lipschitz Losses Our main algorithm is called sparsified mirror descent (Algorithm 2). The idea is to run the online mirror descent algorithm (Ben-Tal & Nemirovski, 2001; Hazan, 2016) in serial across the machines and sparsify the iterates whenever we move from one machine to the next. In a bit more detail, Algorithm 2 proceeds from machine to machine sequentially. On each machine, the algorithm generates a sequence of iterates wi n by doing a single pass over the machine s n examples zi n using the mirror descent update with regularizer R(w) = 1 p, where 1 p + 1 q = 1, and using stochastic gradients i t). After the last example is processed on machine i, we compress the last iterate using Maurey sparsification (Algorithm 1) and send it to the next machine, where the process is repeated. To formally describe the algorithm, we recall the definition of the Bregman divergence. Given a convex regularization function R Rd R, the Bregman divergence for R is defined as DR(w w ) = R(w) R(w ) R(w ),w w . For the 1 q setting we exclusively use the regularizer R(w) = 1 The main guarantee for Algorithm 2 is as follows. Theorem 3. Let q 2 be fixed. Suppose that subgradients belong to Xq and that W W1. If we run Algorithm 2 with = B1 1 Cq N and with initial point w = 0, then whenever s = (m2(q 1)) and s0 = (N q 2 ) the algorithm guarantees E[LD( w)] LD(w ) O where Cq = q 1 is a constant depending only on q. The total number of bits sent by each machine besides communicating the final iterate w is at most O(m2(q 1) log(d m) + log(B1Rq N)), and so the total number of bits communicated globally is at most q 2 log(d N) + m2q 1 log(d m) + mlog(B1Rq N) . In the linear model setting (2) with 1-Lipschitz loss φ it suffices to set s0 = (N), so the total bits of communication is O(N log(d N) + m2q 1 log(d m) + mlog(B1Rq N)). 7The upper bound follows from (3) and the lower bound follows by reduction to the one-dimensional case. Algorithm 2 (Sparsified Mirror Descent). Input: Constraint set W with w 1 B1. Gradient norm parameter q [2, ). Gradient q norm bound Rq. Learning rate , Initial point w, Sparsity s,s0 N. Define p = q q 1 and R(w) = 1 For machine i = 1,...,m: Receive wi 1 from machine i 1 and set wi 1 = wi 1 (if machine 1 set w1 For t = 1,...,n: // Mirror descent step. Get gradient i t+1 arg minw W DR(w i Let wi Qs(wi n+1). // Sparsification. Send wi to machine i + 1. Sample i [m], t [n] uniformly at random and return w = Qs0(wi We see that the communication required by sparsified mirror descent is exponential in the norm parameter q. This means that whenever q is constant, the overall communication is polylogarithmic in dimension. When q = log d, x q x up to a multiplicative constant. In this case the communication from Theorem 3 becomes polynomial in dimension, which we know from Section 2.4 is necessary. The guarantee of Algorithm 2 extends beyond the statistical learning model to the first-order stochastic convex optimization model, as well as the online convex optimization model. Proof sketch. They basic premise behind the algorithm and its analysis is that by using the same learning rate across all machines, we can pretend as though we are running a single instance of mirror descent on a centralized machine. The key difference from the usual analysis is that we need to bound the error incurred by sparsification between successive machines. Here, the choice of the regularizer is crucial. A fundamental property used in the analysis of mirror descent is strong convexity of the regularizer. To give convergence rates that do not depend on dimension (such as (3)) it is essential that the regularizer be (1)-strongly convex. Our regularizer R indeed has this property. Proposition 3 (Ball et al. (1994)). For p (1,2], R is (p 1)-strongly convex with respect to p. Equivalently, DR(w w ) p 1 On the other hand, to argue that sparsification has negligible impact on convergence, our analysis leverages smoothness Distributed Learning with Sublinear Communication of the regularizer. Strong convexity and smoothness are at odds with each other: It is well known that in infinite dimension, any norm that is both strongly convex and smooth is isomorphic to a Hilbert space (Pisier, 2011). What makes our analysis work is that while the regularizer R is not smooth, it is H older-smooth for any finite q. This is sufficient to bound the approximation error from sparsification. To argue that the excess risk achieved by mirror descent with the p regularizer R is optimal, however, it is essential that the gradients are q-bounded rather than -bounded. In more detail, the proof has three components: Telescoping. Mirror descent gives a regret bound that tele- scopes across all m machines up to the error introduced by sparsification. To match the optimal centralized regret, we only need to to bound m error terms of the form DR(w Qs(wi n+1)) DR(w wi H older-smoothness. We prove (Theorem 7) that the dif- ference above is of order n+1 p + B3 p Maurey for p norms. We prove (Theorem 6) that 1 1 p and likewise that With a bit more work these inequalities yield Theorem 3. We close this section with a few more notes about Algorithm 2 and its performance. Remark 1. For the special case of 1 q-bounded linear models with Lipschitz link function, it is straightforward to show that the following strategy also leads to sublinear communication: Truncate each feature vector to the top (N q 2) coordinates, then send all the truncated examples to a central server, which returns the empirical risk minimizer. This strategy matches the risk of Theorem 3 with total communication O(N q 2+1), but has two deficiencies. First, the total communication is larger than the O(N + m2q 1) bound achieved by Theorem 3, unless m is very large. Second, it does not extend to the general optimization setting. 3.2. Smooth Losses We can improve the statistical guarantee and total communication further for the case where LD is smooth with respect to q rather than just Lipschitz. We assume that has βq Lipschitz gradients, in the sense that for all w,w W1 for all z, (w,z) (w ,z) q βq w w p, where p is such that 1 Theorem 4. Suppose in addition to the assumptions of Theorem 3 that ( ,z) is non-negative and has βq-Lipschitz gradients with respect to q. Let L = infw W LD(w). If we run Algorithm 2 with learning rate = 1 Cqβq L N 1 4Cqβq and w = 0 then, if s = (m2(q 1)) and s0 = Cq , the algorithm guarantees E[LD( w)] L O N + Cqβq B2 The total number of bits sent by each machine besides communicating the final iterate w is at most O(m2(q 1) log(d m)), and so the total number of bits communicated globally is at most O(mlog(βq B1N))+ Cq log(d N) + m2q 1 log(d m) . Compared to the previous theorem, this result provides a so-called small-loss bound (Srebro et al., 2010), with the main term scaling with the optimal loss L . The dependence on N in the communication cost can be as low as O( N) depending on the value of L . 3.3. Fast Rates under Restricted Strong Convexity So far all of the algorithmic results we have present scale as O(N 1 2). While this is optimal for generic Lipschitz losses, we mentioned in Section 2 that for strongly convex losses the rate can be improved in a nearly-dimension independent fashion to O(N 1) for sparse high-dimensional linear models. As in the generic Lipschitz loss setting, we show that making the assumption of 1 q-boundedness is sufficient to get statistically optimal distributed algorithms with sublinear communication, thus providing a way around the lower bounds for fast rates in Section 2.4. The key assumption for the results in this section is that the population risk satisfies a form of restricted strong convexity. Assumption 1. There is a constant γq such that w W, LD(w) LD(w ) LD(w ),w w γq In a moment we will show how to relate this property to the standard restricted eigenvalue property in high-dimensional statistics (Negahban et al., 2012) and apply it to sparse regression. Our main algorithm for strongly convex losses is Algorithm 3, which is stated in Appendix C due to space constraints. The algorithm does not introduce any new tricks for distributed learning over Algorithm 2; rather, it invokes Algorithm 2 repeatedly in an inner loop, relying on these invocations to take care of communication. This reduction is based on techniques developed in (Juditsky & Nesterov, 2014), whereby restricted strong convexity is used to establish that error decreases geometrically as a function of the number of invocations to the sub-algorithm. We refer the reader to Appendix C for additional details. Theorem 5. Suppose Assumption 1 holds, that subgradients belong to Xq for q 2, and that W W1. When the Distributed Learning with Sublinear Communication parameter c > 0 is a sufficiently large absolute constant, Algorithm 3 guarantees that E[LD( w T )] LD(w ) O The total numbers of bits communicated is O N 2(q 1)m2q 1 plus O(mlog(B1Rq N)). Treating scale parameters as constants, the total communication simplifies to O N 2q 2m2q 1 log d . Application: Sparse Regression. As an application of Algorithm 3, we consider the sparse regression setting (5), where LD(w) = Ex,y( w,x y)2. We assume x q Rq and y 1. We let w = arg minw W1 LD(w), so w 1 B1. We assume w is k-sparse with support set S [d]. We invoke Algorithm 3 constraint set W = w Rd w 1 w 1 and let = E[xx ]. Our bound depends on the restricted eigenvalue parameter: γ = inf W w {0} 1 2 Proposition 4. Algorithm 3, with constraint set W and appropriate choice of parameters, guarantees: E[LD( w T )] LD(w ) O Cq B2 Suppressing problem-dependent constants, total communication is of order O((N 2q 2m2q 1 log d) k4q 4). 3.4. Extension: Matrix Learning and Beyond The basic idea behind sparsified mirror descent that by assuming q-boundedness one can get away with using a H older-smooth regularizer that behaves well under sparsification is not limited to the 1 q setting. To extend the algorithm to more general geometry, all that is required is the following: The constraint set W can be written as the convex hull of a set of atoms A that has sublinear bit complexity. The data should be bounded in some norm such that the dual admits a regularizer R that is strongly convex and H older-smooth with respect to is preserved under sparsification. We remark in passing that this property and the previous one are closely related to the notions of type and cotype in Banach spaces (Pisier, 2011). Here we deliver on this potential and sketch how to extend the results so far to matrix learning problems where W Rd d is a convex set of matrices. As in Section 3.1 we work with a generic Lipschitz loss LD(W) = Ez (W,z). Letting W Sp = tr((WW ) p 2 ) denote the Schatten p-norm, we make the following spectral analogue of the 1 q-boundedness assumption: W WS1 = W Rd d W S1 B1 and subgradients @ ( ,z) be- long to XSq = X Rd d X Sq Rq , where q 2. Recall that S1 and S are the nuclear norm and spectral norm, respectively. The S1 S setup has many applications in learning (Hazan et al., 2012). We make the following key changes to Algorithm 2: Use the Schatten regularizer R(W) = 1 Use the following spectral version of the Maurey opera- tor Qs(W): Let W have singular value decomposition W = d i with σi 0 and define P d via Pi σi.8 Sample i1,...,is i.i.d. from P and return Encode and transmit Qs(W) as the sequence (ui1,vi1),...,(uis,vis), plus the scalar W S1. This takes O(sd) bits. Proposition 5. Let q 2 be fixed, and suppose that subgradients belong to XSq and that W WS1. If we run the variant of Algorithm 2 described above with learning rate = B1 1 Cq N and initial point W = 0, then whenever s = (m2(q 1)) and s0 = (N q 2 ), the algorithm guarantees E LD( W) inf W W LD(W) O where Cq = q 1. The total number of bits communicated globally is at most O(m2q 1d + N In the matrix setting, the number of bits required to naively send weights W Rd d or subgradients @ (W,z) Rd d is O(d2). The communication required by our algorithm scales only as O(d), so it is indeed sublinear. The proof of Proposition 5 is sketched in Appendix C. 4. Discussion We hope our work will lead to further development of algorithms with sublinear communication. A few immediate questions: Can we get matching upper and lower bounds for com- munication in terms of m, N, log d, and q? Currently all of our algorithms work serially. Can we extend the techniques to give parallel speedup? Returning to the general setting (1), what abstract prop- erties of the hypothesis class H are required to guarantee that learning with sublinear-communication is possible? 8We may assume σi 0 without loss of generality. Distributed Learning with Sublinear Communication Acknowledgements Part of this work was completed while DF was a student at Cornell University and supported by the Facebook Ph D fellowship. Acharya, J., Canonne, C. L., and Tyagi, H. Communica- tion constrained inference and the role of shared randomness. In International Conference on Machine Learning, 2019a. Acharya, J., Canonne, C. L., and Tyagi, H. Inference under information constraints I: lower bounds from chi-square contraction. In Conference on Learning Theory, 2019b. Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709 1720, 2017. Alistarh, D., Hoefler, T., Johansson, M., Konstantinov, N., Khirirat, S., and Renggli, C. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pp. 5977 5987, 2018. Ball, K., Carlen, E. A., and Lieb, E. H. Sharp uniform convexity and smoothness inequalities for trace norms. Inventiones mathematicae, 115(1):463 482, 1994. Barron, A. R. Universal approximation bounds for super- positions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930 945, May 1993. Bekkerman, R., Bilenko, M., and Langford, J. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011. Ben-Tal, A. and Nemirovski, A. Lectures on modern con- vex optimization: analysis, algorithms, and engineering applications, volume 2. Siam, 2001. Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anand- kumar, A. Signsgd: Compressed optimisation for nonconvex problems. In International Conference on Machine Learning, pp. 559 568, 2018. Bijral, A. S., Sarwate, A. D., and Srebro, N. On data de- pendence in distributed stochastic optimization. ar Xiv preprint ar Xiv:1603.04379, 2016. Braverman, M., Garg, A., Ma, T., Nguyen, H. L., and Woodruff, D. P. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 1011 1020. ACM, 2016. Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006. Chaturapruek, S., Duchi, J. C., and R e, C. Asynchronous stochastic convex optimization: the noise is in the noise and sgd don t care. In Advances in Neural Information Processing Systems, pp. 1531 1539, 2015. Dagan, Y. and Shamir, O. Detecting correlations with little memory and communication. In Conference on Learning Theory, pp. 1145 1198, 2018. Dagan, Y., Kur, G., and Shamir, O. Space lower bounds for linear prediction. In Conference on Learning Theory, 2019. Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165 202, 2012. Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual aver- aging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592 606, 2012. Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Zhang, Y. Optimality guarantees for distributed statistical estima- tion. ar Xiv preprint ar Xiv:1405.0782, 2014. Garg, A., Ma, T., and Nguyen, H. On communication cost of distributed statistical estimation and dimensionality. In Advances in Neural Information Processing Systems, pp. 2726 2734, 2014. Han, Y., Ozg ur, A., and Weissman, T. Geometric lower bounds for distributed parameter estimation under communication constraints. In Conference On Learning Theory, pp. 3163 3188, 2018. Hazan, E. Introduction to online convex optimization. Foun- dations and Trends in Optimization, 2(3-4):157 325, 2016. Hazan, E., Kale, S., and Shalev-Shwartz, S. Near-optimal algorithms for online matrix prediction. Conference on Learning Theory, 2012. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Jones, L. K. A simple lemma on greedy approximation in hilbert space and convergence rates for projection pursuit regression and neural network training. The annals of Statistics, 20(1):608 613, 1992. Distributed Learning with Sublinear Communication Juditsky, A. and Nesterov, Y. Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization. Stochastic Systems, 4(1):44 80, 2014. Kakade, S. M., Sridharan, K., and Tewari, A. On the com- plexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793 800, 2009. Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Regular- ization techniques for learning with matrices. Journal of Machine Learning Research, 13(Jun):1865 1890, 2012. Kane, D. M. and Nelson, J. A derandomized sparse johnson- lindenstrauss transform. ar Xiv preprint ar Xiv:1006.3585, 2010. Loh, P.-L., Wainwright, M. J., et al. Support recovery with- out incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455 2482, 2017. Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538 557, 2012. Pisier, G. Remarques sur un r esultat non publi e de b. maurey. S eminaire Analyse fonctionnelle (dit), pp. 1 12, 1980. Pisier, G. Martingales in banach spaces (in connection with type and cotype). course ihp, feb. 2 8, 2011. 2011. Raz, R. Fast learning requires good memory: A time-space lower bound for parity learning. Journal of the ACM (JACM), 66(1):3, 2018. Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lock- free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693 701, 2011. Sahasranand, K. and Tyagi, H. Extra samples can reduce the communication for independence testing. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 2316 2320. IEEE, 2018. Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochas- tic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. Shalev-Shwartz, S. and Ben-David, S. Understanding ma- chine learning: From theory to algorithms. Cambridge university press, 2014. Shalev-Shwartz, S., Srebro, N., and Zhang, T. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20(6): 2807 2832, 2010. Shamir, O. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems, pp. 163 171, 2014. Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, pp. 2199 2207, 2010. Steinhardt, J. and Duchi, J. Minimax rates for memory- bounded sparse linear regression. In Conference on Learning Theory, pp. 1564 1587, 2015. Steinhardt, J., Valiant, G., and Wager, S. Memory, communi- cation, and statistical queries. In Conference on Learning Theory, pp. 1490 1516, 2016. Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pp. 4452 4463, 2018. Suresh, A. T., Yu, F. X., Kumar, S., and Mc Mahan, H. B. Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3329 3337. JMLR. org, 2017. Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J. Com- munication compression for decentralized training. In Advances in Neural Information Processing Systems, pp. 7663 7673, 2018. Tsybakov, A. B. Optimal rates of aggregation. In Learn- ing Theory and Kernel Machines, pp. 303 313. Springer, 2003. Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. Atomo: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pp. 9872 9883, 2018. Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient spar- sification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1306 1316, 2018. Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., and Zhang, C. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning, pp. 4035 4043, 2017. Zhang, T. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527 550, 2002. Zhang, Y., Wainwright, M. J., and Duchi, J. C. Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pp. 1502 1510, 2012. Distributed Learning with Sublinear Communication Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pp. 928 936, 2003.