# distributed_mean_estimation_with_limited_communication__26aa1d87.pdf

Distributed Mean Estimation with Limited Communication

Ananda Theertha Suresh 1 Felix X. Yu 1 Sanjiv Kumar 1 H. Brendan Mc Mahan 2

Motivated by the need for distributed learning and optimization algorithms with low communication cost, we study communication efﬁcient algorithms for distributed mean estimation. Unlike previous works, we make no probabilistic assumptions on the data. We ﬁrst show that for d dimensional data with n clients, a naive stochastic rounding approach yields a mean squared error (MSE) of (d/n) and uses a constant number of bits per dimension per client. We then extend this naive algorithm in two ways: we show that applying a structured random rotation before quantization reduces the error to O((log d)/n) and a better coding strategy further reduces the error to O(1/n). We also show that the latter coding strategy is optimal up to a constant in the minimax sense i.e., it achieves the best MSE for a given communication cost. We ﬁnally demonstrate the practicality of our algorithms by applying them to distributed Lloyd s algorithm for kmeans and power iteration for PCA.

1. Introduction

1.1. Background

Given n vectors Xn def = X1, X2 . . . , Xn 2 Rd that reside on n clients, the goal of distributed mean estimation is to estimate the mean of the vectors:

This basic estimation problem is used as a subroutine in several learning and optimization tasks where data is distributed across several clients. For example, in Lloyd s algorithm (Lloyd, 1982) for k-means clustering, if data is distributed across several clients, the server needs to compute

1Google Research, New York, NY, USA 2Google Research, Seattle, WA, USA. Correspondence to: Ananda Theertha Suresh <theertha@google.com>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

the means of all clusters in each update step. Similarly, for PCA, if data samples are distributed across several clients, then for the power-iteration method, the server needs to average the output of all clients in each step.

Recently, algorithms involving distributed mean estimation have been used extensively in training large-scale neural networks and other statistical models (Mc Donald et al., 2010; Povey et al., 2014; Dean et al., 2012; Mc Mahan et al., 2016; Alistarh et al., 2016). In a typical scenario of synchronized distributed learning, each client obtains a copy of a global model. The clients then update the model independently based on their local data. The updates (usually in the form of gradients) are then sent to a server, where they are averaged and used to update the global model. A critical step in all of the above algorithms is to estimate the mean of a set of vectors as in Eq. (1).

One of the main bottlenecks in distributed algorithms is the communication cost. This has spurred a line of work focusing on communication cost in learning (Tsitsiklis & Luo, 1987; Balcan et al., 2012; Zhang et al., 2013; Arjevani & Shamir, 2015; Chen et al., 2016). The communication cost can be prohibitive for modern applications, where each client can be a low-power and low-bandwidth device such as a mobile phone (Koneˇcn y et al., 2016). Given such a wide set of applications, we study the basic problem of achieving the optimal minimax rate in distributed mean estimation with limited communication.

We note that our model and results differ from previous works on mean estimation (Zhang et al., 2013; Garg et al., 2014; Braverman et al., 2016) in two ways: previous works assume that the data is generated i.i.d. according to some distribution; we do not make any distribution assumptions on data. Secondly, the objective in prior works is to estimate the mean of the underlying statistical model; our goal is to estimate the empirical mean of the data.

Our proposed communication algorithms are simultaneous and independent, i.e., the clients independently send data to the server and they can transmit at the same time. In any independent communication protocol, each client transmits a function of Xi (say f(Xi)), and a central server estimates the mean by some function of f(X1), f(X2), . . . , f(Xn).

Distributed Mean Estimation with Limited Communication

Let be any such protocol and let Ci( , Xi) be the expected number of transmitted bits by the i-th client during protocol , where throughout the paper, expectation is over the randomness in protocol .

The total number of bits transmitted by all clients with the protocol is

Let the estimated mean be ˆ X. For a protocol , the MSE of the estimate is

E( , Xn) = E

We allow the use of both private and public randomness. Private randomness refers to random values that are generated by each machine separately, and public randomness refers to a sequence of random values that are shared among all parties1.

The proposed algorithms work for any Xn. To measure the minimax performance, without loss of generality, we restrict ourselves to the scenario where each Xi 2 Sd, the ball of radius 1 in Rd, i.e., X 2 Sd iff

where ||X||2 denotes the 2 norm of the vector X. For a protocol , the worst case error for all Xn 2 Sd is

def = max Xn:Xi2Sd 8i E( , Xn).

Let (c) denote the set of all protocols with communication cost at most c. The minimax MSE is

E( (c), Sd)

def = min 2 (c) E( , Sd).

1.3. Results and discussion

1.3.1. ALGORITHMS

We ﬁrst analyze the MSE E( , Xn) for three algorithms, when C( , Xn) = (nd), i.e., each client sends a constant number of bits per dimension.

Stochastic uniform quantization. In Section 2.1, as a warm-up we ﬁrst show that a naive stochastic binary quantization algorithm (denoted by sb) achieves an MSE of

E( sb, Xn) =

1In the absence of public randomness, the server can communicate a random seed that can be used by clients to emulate public randomness.

and C( sb, Xn) = n (d + O(1))2, i.e., each client sends one bit per dimension. We further show that this bound is tight. In many practical scenarios, d is much larger than n and the above error is prohibitive (Koneˇcn y et al., 2016).

A natural way to decease the error is to increase the number of levels of quantization. If we use k levels of quantization, in Theorem 2, we show that the error deceases as

E( sk, Xn) = O

d n(k 1)2 1

However, the communication cost would increase to C( sk, Xn) = n (ddlog2 ke + O(1)) bits, which can be expensive, if we would like the MSE to be o(d/n).

In order to reduce the communication cost, we propose two approaches.

Stochastic rotated quantization: We show that preprocessing the data by a random rotation reduces the mean squared error. Speciﬁcally, in Theorem 3, we show that this new scheme (denoted by srk) achieves an MSE of

E( srk, Xn) = O

log d n(k 1)2 1

and has a communication cost of C( srk, Xn) = n (ddlog2 ke + O(1)). Note that the new scheme achieves much smaller MSE than naive stochastic quantization for the same communication cost.

Variable length coding: Our second approach uses the same quantization as sk but encodes levels via variable length coding. Instead of using dlog2 ke bits per dimension, we show that using variable length encoding such as arithmetic coding to compress the data reduces the communication cost signiﬁcantly. In particular, in Theorem 4 we show that there is a scheme (denoted by svk) such that

C( svk, Xn) = O(nd(1 + log(k2/d + 1)) + O(n)), (3)

and E( svk, Xn) = E( sk, Xn). Hence, setting k =

d in Eqs. 2 and 3 yields

E( svk, Xn) = O

and with (nd) bits of communication i.e., constant number of bits per dimension per client. Of the three protocols, svk has the best MSE for a given communication cost. Note that svk uses k quantization levels but still uses O(1) bits per dimension per client for all k

Theoretically, while variable length coding has better guarantees, stochastic rotated quantization has several practical

2We use O(1) to denote O(log(dn)). 3All logarithms are to base e, unless stated.

Distributed Mean Estimation with Limited Communication

advantages: it uses ﬁxed length coding and hence can be combined with encryption schemes for privacy preserving secure aggregation (Bonawitz et al., 2016). It can also provide lower quantization error in some scenarios due to better constants (see Section 7 for details).

Concurrent to this work, Alistarh et al. (2016) showed that stochastic quantization and Elias coding can be used to obtain communication-optimal SGD. Recently, Koneˇcn y & Richt arik (2016) showed that sb can be improved further by optimizing the choice of stochastic quantization boundaries. However, their results depend on the number of bits necessary to represent a ﬂoat, whereas ours do not.

1.3.2. MINIMAX MSE

In the above protocols, all of the clients transmit the data. We augment these protocols with a sampling procedure, where only a random fraction of clients transmit data. We show that a combination of k-level quantization, variable length coding, and sampling can be used to achieve information theoretically optimal MSE for a given communication cost. In particular, combining Corollary 1 and Theorem 5 yields our minimax result:

Theorem 1. There exists a universal constant t < 1 such that for communication cost c ndt and n 1/t,

E( (c), Sd) =

This result shows that the product of communication cost and MSE scales linearly in the number of dimensions.

The rest of the paper is organized as follows. We ﬁrst analyze the stochastic uniform quantization technique in Section 2. In Section 3, we propose the stochastic rotated quantization technique, and in Section 4 we analyze arithmetic coding. In Section 5, we combine the above algorithm with a sampling technique and state the upper bound on the minimax risk, and in Section 6 we state the matching minimax lower bounds. Finally, in Section 7 we discuss some practical considerations and apply these algorithms on distributed power iteration and Lloyd s algorithm.

2. Stochastic uniform quantization

2.1. Warm-up: Stochastic binary quantization

For a vector Xi, let Xmax

i = max1 j d Xi(j) and similarly let Xmin

i = min1 j d Xi(j). In the stochastic binary quantization protocol sb, for each client i, the quantized value for each coordinate j is generated independently with private randomness as

i w.p. Xi(j) Xmin

i otherwise.

Observe EYi(j) = Xi(j). The server estimates X by

We ﬁrst bound the communication cost of the this protocol.

Lemma 1. There exists an implementation of stochastic binary quantization that uses d + O(1) bits per client and

hence C( sb, Xn) n

Proof. Instead of sending vectors Yi, clients transmit two real values Xmax

i (to a desired error) and a bit vector Y 0

i such that Y 0

i (j) = 1 if Yi = Xmax

i and 0 otherwise. Hence each client transmits d + 2r bits, where r is the number of bits to transmit the real value to a desired error.

Let B be the maximum norm of the underlying vectors. To bound r, observe that using r bits, one can represent a number between B and B to an error of B/2r 1. Thus using 3 log2(dn) + 1 bits one can represent the minimum and maximum to an additive error of B/(nd)3. This error in transmitting minimum and maximum of the vector does not affect our calculations and we ignore it for simplicity. We note that in practice, each dimension of Xi is often stored as a 32 bit or 64 bit ﬂoat, and r should be set as either 32 or 64. In this case, using an even larger r does not further reduce the error.

We now compute the estimation error of this protocol.

Lemma 2. For any set of vectors Xn,

E( sb, Xn) = 1

i Xi(j))(Xi(j) Xmin

E( sb, Xn) = E

E ||Yi Xi||2

where the last equality follows by observing that Yi Xi, 8i, are independent zero mean random variables. The proof follows by observing that for every i,

E ||Yi Xi||2

E[(Yi(j) Xi(j))2]

i Xi(j))(Xi(j) Xmin

Distributed Mean Estimation with Limited Communication

Lemma 2 implies the following upper bound. Lemma 3. For any set of vectors Xn,

E( sb, Xn) d

Proof. The proof follows by Lemma 2 observing that 8j

i Xi(j))(Xi(j) Xmin

i )2 2 ||Xi||2

We also show that the above bound is tight: Lemma 4. There exists a set of vectors Xn such that

E( sb, Xn) d 2

Proof. For every i, let Xi be deﬁned as follows. Xi(1) = 1/

2, Xi(2) = 1/

2, and for all j > 2, Xi(j) = 0. For every i, Xmax

2. Substituting these bounds in the conclusion of Lemma 2 (which is an equality) yields the theorem.

Therefore, the simple algorithm proposed in this section gives MSE (d/n). Such an error is too large for realworld use. For example, in the application of neural networks (Koneˇcn y et al., 2016), d can be on the order of millions, yet n can be much smaller than that. In such cases, the MSE is even larger than the norm of the vector.

2.2. Stochastic k-level quantization

A natural generalization of binary quantization is k-level quantization. Let k be a positive integer larger than 2. We propose a k-level stochastic quantization scheme sk to quantize each coordinate. Recall that for a vector Xi, Xmax

i = max1 j d Xi(j) and Xmin

i = min1 j d Xi(j). For every integer r in the range [0, k), let

i + rsi k 1,

where si satisﬁes Xmin

i + si Xmax

i . A natural choice for si would be Xmax

i .4 The algorithm quantizes each coordinate into one of Bi(r)s stochastically. In sk, for the i-th client and j-th coordinate, if Xi(j) 2 [Bi(r), Bi(r + 1)),

Bi(r + 1) w.p. Xi(j) Bi(r) Bi(r+1) Bi(r) Bi(r) otherwise.

4We will show in Section 4, however, a higher value of si and variable length coding has better guarantees.

The server estimates X by

As before, the communication complexity of this protocol is bounded. The proof is similar to that of Lemma 1 and hence omitted.

Lemma 5. There exists an implementation of stochastic klevel quantization that uses ddlog(k)e+ O(1) bits per client

and hence C( sk, Xn) n

ddlog2 ke + O(1)

The mean squared loss can be bounded as follows.

Theorem 2. If Xmax

2 ||Xi||2 8i, then for any Xn, the sk protocol satisﬁes,

E( sk, Xn) d 2n(k 1)2 1

E( sk, Xn) = E

E ||Yi Xi||2

i 4(k 1)2 , (5)

where the last equality follows by observing Yi(j) Xi(j) is an independent zero mean random variable with E(Yi(j) Xi(j))2 s2

i 4(k 1)2 . si

2 ||Xi||2 completes the proof.

We conclude this section by noting that si = Xmax

i satisﬁes the conditions for the above theorem by Eq. (4).

3. Stochastic rotated quantization

We show that the algorithm of the previous section can be signiﬁcantly improved by a new protocol. The motivation comes from the fact that the MSE of stochastic binary quantization and stochastic k-level quantization is O( d

i )2) (the proof of Lemma 3 and Theorem 2 with si = Xmax

i ). Therefore the MSE is smaller when Xmax

i are close. For example, when Xi is generated uniformly on the unit sphere, with

high probability, Xmax

& Gupta, 2003). In such case, E( sk, Xn) is O( log d

n ) instead of O( d

In this section, we show that even without any assumptions on the distribution of the data, we can reduce Xmax

i with a structured random rotation, yielding

Distributed Mean Estimation with Limited Communication

an O( log d

n ) error. We call the method stochastic rotated quantization and denote it by srk.

Using public randomness, all clients and the central server generate a random rotation matrix (random orthogonal matrix) R 2 Rd d according to some known distribution. Let Zi = RXi and Z = R X. In the stochastic rotated quantization protocol srk(R), clients quantize the vectors Zi instead of Xi and transmit them similar to srk. The server estimates X by

ˆ X srk = R 1 ˆ Z, ˆ Z = 1

The communication cost is same as sk and is given by Lemma 5. We now bound the MSE.

Lemma 6. For any Xn, E( srk(R), Xn) is at most

d 2n2(k 1)2

where Zi = RXi and for every i, let si = Zmax

E( srk, Xn) = E

###R 1 ˆ Z R 1 Z

d 4n2(k 1)2

where the last inequality follows Eq. (5) and the value of si. (a) follows from the fact that rotation does not change the norm of the vector, and (b) follows from the tower law of expectation. The lemma follows from observing that

i )2 2(Zmax

i )2 + 2(Zmin

To obtain strong bounds, we need to ﬁnd an orthogonal matrix R that achieves low (Zmax

i )2 and (Zmin

i )2. In addition, due to the fact that d can be huge in practice, we need a type of orthogonal matrix that permits fast matrix-vector products. Naive orthogonal matrices that support fast multiplication such as block-diagonal matrices often result in high values of (Zmax

i )2 and (Zmin

i )2. Motivated by recent works of structured matrices (Ailon & Chazelle, 2006; Yu et al., 2016), we propose to use a special type of orthogonal matrix R = HD, where D is a random diagonal matrix with i.i.d. Rademacher entries ( 1 with probability 0.5). H is a Walsh-Hadamard matrix (Horadam, 2012). The Walsh Hadamard matrix of dimension 2m for m 2 N is given by

the recursive formula,

H(2m 1) H(2m 1) H(2m 1) H(2m 1)

Both applying the rotation and inverse rotation take O(d log d) time and O(1) additional space (with an inplace algorithm). The next lemma bounds E (Zmax

/2 for this choice of R. The lemma is similar to that of Ailon & Chazelle (2006), and we give the proof in Appendix A for completeness.

Lemma 7. Let R = HD, where D is a diagonal matrix with independent Radamacher random variables. For every i and every sequence Xn,

2 (2 log d + 2)

Combining the above two lemmas yields the main result.

Theorem 3. For any Xn, srk(HD) protocol satisﬁes,

E( srk(HD), Xn) 2 log d + 2

4. Variable length coding

Instead of preprocessing the data via a rotation matrix as in srk, in this section we propose to use a variable length coding strategy to minimize the number of bits.

Consider the stochastic k-level quantization technique. A natural way of transmitting Yi is sending the bin number for each coordinate, thus the total number of bits the algorithm sends per transmitted coordinate would be ddlog2 ke. This naive implementation is sub-optimal. Instead, we propose to further encode the transmitted values using universal compression schemes (Krichevsky & Troﬁmov, 1981; Falahatgar et al., 2015). We ﬁrst encode hr, the number of times each quantized value r has appeared, and then use arithmetic or Huffman coding corresponding to the distribution pr = hr

d . We denote this scheme by svk. Since we quantize vectors the same way in sk and svk, the MSE of svk is also given by Theorem 2. We now bound the communication cost. Theorem 4. Let si =

2 ||Xi||. There exists an implementation of svk such that C( svk, Xn) is at most

Proof. As in Lemma 1, O(1) bits are used to transmit the si s and Xmin

i . Recall that hr is the number of coordinates that are quantized into bin r, and r takes k possible values. Furthermore, P

r hr = d. Thus the number of bits

Distributed Mean Estimation with Limited Communication

necessary to represent the hr s is

Once we have compressed the hr s, we use arithmetic coding corresponding to the distribution pr = hr/d to compress and transmit bin values for each coordinate. The total number of bits arithmetic coding uses is (Mac Kay, 2003)

Let pr = hr/d, a = (k 1)Xmin

i , b = si, and β = Pk 1

r=0 1/((a + br)2 + δ). Note that

1/(((a + br)2 + δ)β)

pr log2(((a + br)2 + δ)β)

pr log2(((a + br)2 + δ)β)

pr(a + br)2 + δ) + log2 β,

where the ﬁrst inequality follows from the positivity of KLdivergence. Choosing δ = s2

i , yields β 4/s2

i and hence, X

pr(a+br)2 +s2

i )+log2(4/s2

Note that if Yi(j) belongs to bin r, (a + br)2 = (k 1)2Y 2

i (j). Recall that hr is the number of coordinates quantized into bin r. Hence P

r hr(a + br)2 is the scaled norm-square of Yi, i.e.,

hr(a + br)2 = (k 1)2

((Xi(j) + (j))(k 1))2 ,

where the (j) = Yi(j) Xi(j). Taking expectations on both sides and using the fact that the (j) are independent zero mean random variables over a range of si/(k 1), we get

hr(a + br)2 =

E(Xi(j)2 + (j)2)(k 1)2

Using Jensen s inequality yields the result.

Thus if k =

d + 1, the communication complexity is O(nd) and the MSE is O(1/n).

5. Communication MSE trade-off

In the above protocols, all the clients transmit and hence the communication cost scales linearly with n. Instead, we show that any of the above protocols can be combined by client sampling to obtain trade-offs between the MSE and the communication cost. Note that similar analysis also holds for sampling the coordinates.

Let be a protocol where the mean estimate is of the form:

ˆ X = R 1 1

All three protocols we have discussed are of this form. Let p be the protocol where each client participates independently with probability p. The server estimates X by

ˆ X p = R 1 1

where Yis are deﬁned in the previous section and S is the set of clients that transmitted. Lemma 8. For any set of vectors Xn and protocol of the form Equation (6), its sampled version p satisﬁes

E( p, Xn) = 1

p E( , Xn) + 1 p

C( p, Xn) = p C( , Xn).

Proof. The proof of communication cost follows from Lemma 5 and the fact that in expectation, np clients transmit. We now bound the MSE. Let S be the set of clients that transmit. The error E( p, Xn) is

where the last equality follows by observing that R 1Yi Xi are independent zero mean random variables and hence for any i, E[(R 1Yi Xi)T (P

i2S Xi X)] = 0. The ﬁrst term can be bounded as

2 + (1 p) ||Xi||2

Distributed Mean Estimation with Limited Communication

Furthermore, the second term can be bounded as

(a) = 1 n2p2

h####(R 1Yi Xi)

h####(R 1Yi Xi)

h####R 1Yi Xi

where the last equality follows from the assumption that s mean estimate is of the form (6). (a) follows from the fact that R 1Yi Xi are independent zero mean random variables.

Combining the above lemma with Theorem 4, and choosing k =

d + 1 results in the following. Corollary 1. For every c nd(2+log2(7/4)), there exists a protocol such that C( , Sd) c and

E( , Sd) = O

6. Lower bounds

The lower bound relies on the lower bounds on distributed statistical estimation due to Zhang et al. (2013). Lemma 9 ((Zhang et al., 2013) Proposition 2). There exists

a set of distributions Pd supported on

such that if any centralized server wishes to estimate the mean of the underlying unknown distribution, then for any independent protocol

max pd2Pd E

where C( ) is the communication cost of the protocol, (pd) is the mean of pd, and t is a positive constant. Theorem 5. Let t be the constant in Lemma 9. For every c ndt/4 and n 4/t,

E( (c), Sd) t

Proof. Given n samples from the underlying distribution where each sample belongs to Sd, it is easy to see that

### (pd) ˆ (pd)

0 1 2 3 4 5 6 25

Bits per dimension

uniform rotation variable

Figure 1. Distributed mean estimation on data generated from a Gaussian distribution.

where ˆ (pd) is the empirical mean of the observed samples. Let Pd be the set of distributions in Lemma 9. Hence for any protocol there exists a distribution pd such that

###ˆ (pd) ˆ

### (pd) ˆ (pd)

(a) follows from the fact that 2(a b)2 + 2(b c)2 (a c)2. (b) follows from Lemma 9 and (c) follows from the fact that C( , Sd) ndt/4 and n 4/t.

Corollary 1 and Theorem 5 yield Theorem 1. We note that the above lower bound holds only for communication cost c < O(nd). Extending the results for larger values of c remains an open problem.

At a ﬁrst glance it may appear that combining structured random matrix and variable length encoding may improve the result asymptotically, and therefore violates the lower bound. However, this is not true.

Observe that variable length coding svk and stochastic rotated quantization srk use different aspects of the data: the variable length coding uses the fact that bins with large values of index r are less frequent. Hence, we can use fewer bits to encode frequent bins and thus improve communication. In this scheme bin-width (si/(k 1)) is

2||Xi||2/(k 1). Rotated quantization uses the fact that rotation makes the min and max closer to each other and hence we can make bins with smaller width. In such a case, all the bins become more or less equally likely and hence variable length coding does not help. In this scheme bin-width (si/(k 1)) is (Zmax

i )/(k 1) ||Xi||2(log d)/(kd), which is much smaller than bin-width for variable length coding. Hence variable length coding and random rotation cannot be used simultaneously.

7. Practical considerations and applications

Based on the theoretical analysis, the variable-length coding method provides the lowest quantization error asymptotically when using a constant number of bits. However in practice, stochastic rotated quantization may be preferred due to (hidden) constant factors and the fact that it uses a ﬁxed amount of bits per dimension. For example, considering quantizing the vector [ 1, 1, 0, 0], stochastic rotated

Distributed Mean Estimation with Limited Communication

100 200 300 400 500 19.4

Total bits per dimension

uniform rotation variable

(a) MNIST k = 16

100 200 300 400 500 19.4

Total bits per dimension

uniform rotation variable

(b) MNIST k = 32

100 200 300 400 500 16.6

Total bits per dimension

uniform rotation variable

(c) CIFAR k = 16

100 200 300 400 500 16.6

Total bits per dimension

uniform rotation variable

(d) CIFAR k = 32

Figure 2. Lloyd s algorithm with different types of quantizations. Here we test two settings: 16 quantization levels and 32 quantization levels. The x-axis is the averaged number of bits sent for each data dimension (this scales linearly to the number of iterations), and the y-axis is the global objective of Lloyd s algorithm.

quantization can use 1 bit per dimension and gives zero error, whereas the other two protocols do not. To see this, observe that the naive quantization will quantize 0 to either 1 or 1 and variable length coding cannot achieve 0 error with 1 bit per dimension due to its constant factors.

We further note that the rotated quantization is preferred when applied on unbalanced data, due to the fact that the rotation can correct the unbalancedness. We demonstrate this by generating a dataset where the value of the last feature dimension entry is much larger than others. We generate 1000 datapoints each with 256 dimensions. The ﬁrst 255 dimensions are generated i.i.d. from N(0, 1), and the last dimension is generated from N(100, 1). As shown in Figure 1, the rotated stochastic quantization has the best performance. The improvement is especially signiﬁcant for low bit rate cases.

We demonstrate two applications in the rest of this section. The experiments are performed on the MNIST (d = 1024) and CIFAR (d = 512) datasets.

Distributed Lloyd s algorithm. In the distributed Lloyd s (k-means) algorithm, each client has access to a subset of data points. In each iteration, the server broadcasts the cluster centers to all the clients. Each client updates the centers based on its local data, and sends the centers back to the server. The server then updates the centers by computing the weighted average of the centers sent from all clients. In

10 20 30 40 0

Total bits per dimension

uniform rotation variable

(a) MNIST k = 16

10 20 30 40 0

Total bits per dimension

uniform rotation variable

(b) MNIST k = 32

10 20 30 40 0

Total bits per dimension

uniform rotation variable

(c) CIFAR k = 16

10 20 30 40 0

Total bits per dimension

uniform rotation variable

(d) CIFAR k = 32

Figure 3. Power iteration with different types of quantizations. Here we test two settings: 16 quantization levels and 32 quantization levels. The x-axis is the averaged number of bits sent for each data dimension (this scales linearly to the number of iterations), and the y-axis is the 2 distance between the computed eigenvector and the ground-truth eigenvector.

the quantized setting, the client compresses the new centers before sending to the server. This saves the uplink communication cost, which is often the bottleneck of distributed learning5. We set both the number of centers and number of clients to 10. Figure 2 shows the result.

Distributed power iteration. Power iteration is a widely used method to compute the top eigenvector of a matrix. In the distributed setting, each client has access to a subset of data. In each iteration, the server broadcasts the current estimate of the eigenvector to all clients. Each client then updates the eigenvector based on one power iteration on its local data, and sends the updated eigenvector back to the server. The server updates the eigenvector by computing the weighted average of the eigenvectors sent by all clients. Similar to the above distributed Lloyd s algorithm, in the quantized setting, the client compresses the estimated eigenvector before sending to the server. Figure 3 shows the result. The dataset is distributed over 100 clients.

For both of these applications, variable-length coding achieves the lowest quantization error in most of the settings. Furthermore, for low-bit rate, stochastic rotated quantization is competitive with variable-length coding.

5In this setting, the downlink is a broadcast, and therefore its cost can be reduced by a factor of O(n/ log n) without quantization, where n is the number of clients.

Distributed Mean Estimation with Limited Communication

Acknowledgments

We thank Jayadev Acharya, Keith Bonawitz, Dan Holtmann-Rice, Jakub Konecny, Tengyu Ma, and Xiang Wu for helpful comments and discussions.

Ailon, Nir and Chazelle, Bernard. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In STOC, 2006.

Alistarh, Dan, Li, Jerry, Tomioka, Ryota, and Vo-

jnovic, Milan. QSGD: Randomized quantization for communication-optimal stochastic gradient descent. ar Xiv:1610.02132, 2016.

Arjevani, Yossi and Shamir, Ohad. Communication com-

plexity of distributed convex learning and optimization. In NIPS, 2015.

Balcan, Maria-Florina, Blum, Avrim, Fine, Shai, and Man-

sour, Yishay. Distributed learning, communication complexity and privacy. In COLT, 2012.

Bonawitz, Keith, Ivanov, Vladimir, Kreuter, Ben, Marce-

done, Antonio, Mc Mahan, H Brendan, Patel, Sarvar, Ramage, Daniel, Segal, Aaron, and Seth, Karn. Practical secure aggregation for federated learning on user-held data. ar Xiv:1611.04482, 2016.

Braverman, Mark, Garg, Ankit, Ma, Tengyu, Nguyen,

Huy L., and Woodruff, David P. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In STOC, 2016.

Chen, Jiecao, Sun, He, Woodruff, David, and Zhang, Qin.

Communication-optimal distributed clustering. In NIPS, 2016.

Dasgupta, Sanjoy and Gupta, Anupam. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60 65, 2003.

Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai,

Devin, Matthieu, Mao, Mark, Senior, Andrew, Tucker, Paul, Yang, Ke, Le, Quoc V, et al. Large scale distributed deep networks. In NIPS, 2012.

Efron, Bradley and Stein, Charles. The jackknife estimate

of variance. The Annals of Statistics, pp. 586 596, 1981.

Falahatgar, Moein, Jafarpour, Ashkan, Orlitsky, Alon, Pichapati, Venkatadheeraj, and Suresh, Ananda Theertha. Universal compression of power-law distributions. In ISIT, 2015.

Garg, Ankit, Ma, Tengyu, and Nguyen, Huy L. On com-

munication cost of distributed statistical estimation and dimensionality. In NIPS, 2014.

Horadam, Kathy J. Hadamard matrices and their applica-

tions. Princeton university press, 2012.

Koneˇcn y, Jakub and Richt arik, Peter. Randomized distributed mean estimation: Accuracy vs communication. ar Xiv:1611.07555, 2016.

Koneˇcn y, Jakub, Mc Mahan, H Brendan, Yu, Felix X,

Richt arik, Peter, Suresh, Ananda Theertha, and Bacon, Dave. Federated learning: Strategies for improving communication efﬁciency. ar Xiv:1610.05492, 2016.

Krichevsky, R and Troﬁmov, V. The performance of univer-

sal encoding. IEEE Transactions on Information Theory, 27(2):199 207, 1981.

Lloyd, Stuart. Least squares quantization in PCM. IEEE

Transactions on Information Theory, 28(2):129 137, 1982.

Mac Kay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003.

Mc Donald, Ryan, Hall, Keith, and Mann, Gideon. Dis-

tributed training strategies for the structured perceptron. In HLT, 2010.

Mc Mahan, H. Brendan, Moore, Eider, Ramage, Daniel,

and y Arcas, Blaise Aguera. Federated learning of deep networks using model averaging. ar Xiv:1602.05629, 2016.

Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev.

Parallel training of deep neural networks with natural gradient and parameter averaging. ar Xiv preprint, 2014.

Tsitsiklis, John N and Luo, Zhi-Quan. Communication complexity of convex optimization. Journal of Complexity, 3(3):231 243, 1987.

Yu, Felix X, Suresh, Ananda Theertha, Choromanski,

Krzysztof, Holtmann-Rice, Daniel, and Kumar, Sanjiv. Orthogonal random features. In NIPS, 2016.

Zhang, Yuchen, Duchi, John, Jordan, Michael I, and Wain-

wright, Martin J. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, 2013.