# improved_distributed_principal_component_analysis__5df1770e.pdf

Improved Distributed Principal Component Analysis

Maria-Florina Balcan School of Computer Science Carnegie Mellon University ninamf@cs.cmu.edu

Vandana Kanchanapally School of Computer Science Georgia Institute of Technology vvandana@gatech.edu

Yingyu Liang Department of Computer Science Princeton University yingyul@cs.princeton.edu

David Woodruff Almaden Research Center IBM Research dpwoodru@us.ibm.com

We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve problems such as k-means clustering and low rank approximation. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efﬁciency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for k-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as a general transformation from a constant success probability subspace embedding to a high success probability subspace embedding with a dimension and sparsity independent of the success probability, may be of independent interest.

1 Introduction

Since data is often partitioned across multiple servers [20, 7, 18], there is an increased interest in computing on it in the distributed model. A basic tool for distributed data analysis is Principal Component Analysis (PCA). The goal of PCA is to ﬁnd an r-dimensional (afﬁne) subspace that captures as much of the variance of the data as possible. Hence, it can reveal low-dimensional structure in very high dimensional data. Moreover, it can serve as a preprocessing step to reduce the data dimension in various machine learning tasks, such as Non-Negative Matrix Factorization (NNMF) [15] and Latent Dirichlet Allocation (LDA) [3].

In the distributed model, approximate PCA was used by Feldman et al. [9] for solving a number of shape ﬁtting problems such as k-means clustering, where the approximation is in the form of a coreset, and has the property that local coresets can be easily combined across servers into a global coreset, thereby providing an approximate PCA to the union of the data sets. Designing small coresets therefore leads to communication-efﬁcient protocols. Coresets have the nice property that their size typically does not depend on the number n of points being approximated. A beautiful property of the coresets developed in [9] is that for approximate PCA their size also only depends linearly on the dimension d, whereas previous coresets depended quadratically on d [8]. This gives the best known communication protocols for approximate PCA and k-means clustering.

Despite this recent exciting progress, several important questions remain. First, can we improve the communication further as a function of the number of servers, the approximation error, and other parameters of the downstream applications (such as the number k of clusters in k-means clustering)? Second, while preserving optimal or nearly-optimal communication, can we improve the computational costs of the protocols? We note that in the protocols of Feldman et al. each server has to run a singular value decomposition (SVD) on her local data set, while additional work needs to be performed to combine the outputs of each server into a global approximate PCA. Third, are these algorithms practical and do they scale well with large-scale datasets? In this paper we give answers to the above questions. To state our results more precisely, we ﬁrst deﬁne the model and the problems.

Communication Model. In the distributed setting, we consider a set of s nodes V = {vi, 1 i s}, each of which can communicate with a central coordinator v0. On each node vi, there is a local data matrix Pi Rni d having ni data points in d dimension (ni > d). The global data P Rn d is then a concatenation of the local data matrix, i.e. P = P 1 , P 2 , . . . , P s and n = Ps i=1 ni. Let pi denote the i-th row of P. Throughout the paper, we assume that the data points are centered to have zero mean, i.e., Pn i=1 pi = 0. Uncentered data requires a rank-one modiﬁcation to the algorithms, whose communication and computation costs are dominated by those in the other steps.

Approximate PCA and ℓ2-Error Fitting. For a matrix A = [aij], let A 2 F = P

i,j a2 ij be its Frobenius norm, and let σi(A) be the i-th singular value of A. Let A(t) denote the matrix that contains the ﬁrst t columns of A. Let LX denote the linear subspace spanned by the columns of X. For a point p, let πL(p) be its projection onto subspace L and let πX(p) be shorthand for πLX(p). For a point p Rd and a subspace L Rd, we denote the squared distance between p and L by d2(p, L) := min q L p q 2 2 = p πL(p) 2 2.

Deﬁnition 1. The linear (or afﬁne) r-Subspace k-Clustering on P Rn d is

min L d2(P, L) :=

i=1 min L L d2(pi, L) (1)

where P is an n d matrix whose rows are p1, . . . , pn, and L = {Lj}k j=1 is a set of k centers, each of which is an r-dimensional linear (or afﬁne) subspace.

PCA is a special case when k = 1 and the center is an r-dimensional subspace. This optimal rdimensional subspace is spanned by the top r right singular vectors of P, also known as the principal components, and can be found using the singular value decomposition (SVD). Another special case of the above is k-means clustering when the centers are points (r = 0). Constrained versions of this problem include NNMF where the r-dimensional subspace should be spanned by positive vectors, and LDA which assumes a prior distribution deﬁning a probability for each r-dimensional subspace. We will primarily be concerned with relative-error approximation algorithms, for which we would like to output a set L of k centers for which d2(P, L ) (1 + ϵ) min L d2(P, L).

For approximate distributed PCA, the following protocol is implicit in [9]: each server i computes its top O(r/ϵ) principal components Yi of Pi and sends them to the coordinator. The coordinator stacks the O(r/ϵ) d matrices Yi on top of each other, forming an O(sr/ϵ) d matrix Y, and computes the top r principal components of Y, and returns these to the servers. This provides a relative-error approximation to the PCA problem. We refer to this algorithm as Algorithm dis PCA.

Our Contributions. Our results are summarized as follows.

Improved Communication: We improve the communication cost for using distributed PCA for kmeans clustering and similar ℓ2-ﬁtting problems. The best previous approach is to use Corollary 4.5 in [9], which shows that given a data matrix P, if we project the rows onto the space spanned by the top O(k/ϵ2) principal components, and solve the k-means problem in this subspace, we obtain a (1+ϵ)-approximation. In the distributed setting, this would require ﬁrst running Algorithm dis PCA with parameter r = O(k/ϵ2), and thus communication at least O(skd/ϵ3) to compute the O(k/ϵ2) global principal components. Then one can solve a distributed k-means problem in this subspace, and an α-approximation in it translates to an overall α(1 + ϵ) approximation.

Our Theorem 3 shows that it sufﬁces to run Algorithm dis PCA while only incurring O(skd/ϵ2) communication to compute the O(k/ϵ2) global principal components, preserving the k-means solution cost up to a (1 + ϵ)-factor. Our communication is thus a 1/ϵ factor better, and illustrates that

for downstream applications it is sometimes important to open up the box rather than to directly use the guarantees of a generic PCA algorithm (which would give O(skd/ϵ3) communication). One feature of this approach is that by using the distributed k-means algorithm in [2] on the projected data, the coordinator can sample points from the servers proportional to their local k-means cost solutions, which reduces the communication roughly by a factor of s, which would come from each server sending their local k-means coreset to the coordinator. Furthermore, before applying the above approach, one can ﬁrst run any other dimension reduction to dimension d so that the k-means cost is preserved up to certain accuracy. For example, if we want a 1+ϵ approximation factor, we can set d = O(log n/ϵ2) by a Johnson-Lindenstrauss transform; if we want a larger 2+ϵ approximation factor, we can set d = O(k/ϵ2) using [4]. In this way the parameter d in the above communication cost bound can be replaced by d . Note that unlike these dimension reductions, our algorithm for projecting onto principal components is deterministic and does not incur error probability.

Improved Computation: We turn to the computational cost of Algorithm dis PCA, which to the best of our knowledge has not been addressed. A major bottleneck is that each player is computing a singular value decomposition (SVD) of its point set Pi, which takes min(nid2, n2 i d) time. We change Algorithm dis PCA to instead have each server ﬁrst sample an oblivious subspace embedding (OSE) [22, 5, 19, 17] matrix Hi, and instead run the algorithm on the point set deﬁned by the rows of Hi Pi. Using known OSEs, one can choose Hi to have only a single non-zero entry per column and thus Hi Pi can be computed in nnz(Pi) time. Moreover, the number of rows of Hi is O(d2/ϵ2), which may be signiﬁcantly less than the original ni number of rows. This number of rows can be further reducted to O(d log O(1) d/ϵ2) if one is willing to spend O(nnz(Pi) log O(1) d/ϵ) time [19]. We note that the number of non-zero entries of Hi Pi is no more than that of Pi.

One technical issue is that each of s servers is locally performing a subspace embedding, which succeeds with only constant probability. If we want a single non-zero entry per column of Hi, to achieve success probability 1 O(1/s) so that we can union bound over all s servers succeeding, we naively would need to increase the number of rows of Hi by a factor linear in s. We give a general technique, which takes a subspace embedding that succeeds with constant probability as a black box, and show how to perform a procedure which applies it O(log 1/δ) times independently and from these applications ﬁnds one which is guaranteed to succeed with probability 1 δ. Thus, in this setting the players can compute a subspace embedding of their data in nnz(Pi) time, for which the number of non-zero entries of Hi Pi is no larger than that of Pi, and without incurring this additional factor of s. This may be of independent interest.

It may still be expensive to perform the SVD of Hi Pi and for the coordinator to perform an SVD on Y in Algorithm dis PCA. We therefore replace the SVD computation with a randomized approximate SVD computation with spectral norm error. Our contribution here is to analyze the error in distributed PCA and k-means after performing these speedups.

Empirical Results: Our speedups result in signiﬁcant computational savings. The randomized techniques we use reduce the time by orders of magnitude on medium and large-scal data sets, while preserving the communication cost. Although the theory predicts a new small additive error because of our speedups, in our experiments the solution quality was only negligibly affected.

Related Work A number of algorithms for approximate distributed PCA have been proposed [21, 14, 16, 9], but either without theoretical guarantees, or without considering communication. Most closely related to our work is [9, 12]. [9] observes the top singular vectors of the local data is its summary and the union of these summaries is a summary of the global data, i.e., Algorithm dis PCA. [12] studies algorithms in the arbitrary partition model in which each server holds a matrix Pi and P = Ps i=1 Pi. More details and more related work can be found in the appendix.

2 Tradeoff between Communication and Solution Quality

Algorithm dis PCA for distributed PCA is suggested in [21, 9], which consists of a local stage and a global stage. In the local stage, each node performs SVD on its local data matrix, and communicates the ﬁrst t1 singular values Σi (t1) and the ﬁrst t1 right singular vectors Vi (t1) to the central coordinator. Then in the global stage, the coordinator concatenates Σi (t1)(Vi (t1)) to form a matrix Y, and performs SVD on it to get the ﬁrst t2 right singular vectors.

To get some intuition, consider the easy case when the data points actually lie in an r-dimensional subspace. We can run Algorithm dis PCA with t1 = t2 = r. Since Pi has rank r, its projection to

Local PCA ... Local PCA

Σ(t1) 1 V(t1) 1

Σ(t1) s V(t1) s

= Y Global PCA V(t2)

Figure 1: The key points of the algorithm dis PCA.

the subspace spanned by its ﬁrst t1 = r right singular vectors, b Pi = UiΣi (r)(Vi (r)) , is identical to Pi. Then we only need to do PCA on b P, the concatenation of b Pi. Observing that b P = e UY where e U is orthonormal, it sufﬁces to compute SVD on Y, and only Σi (r)Vi (r) needs to be communicated. In the general case when the data may have rank higher than r, it turns out that one needs to set t1 sufﬁciently large, so that b Pi approximates Pi well enough and does not introduce too much error into the ﬁnal solution. In particular, the following close projection property about SVD is useful:

Lemma 1. Suppose A has SVD A = UΣV and let b A = AV(t)(V(t)) denote its SVD truncation. If t = O(r/ϵ), then for any d r matrix X with orthonormal columns,

0 AX b AX 2 F ϵd2(A, LX), and 0 AX 2 F b AX 2 F ϵd2(A, LX).

This means that the projections of b A and A on any r-dimensional subspace are close, when the projected dimension t is sufﬁciently large compared to r. Now, note that the difference between P PXX 2 F and b P b PXX 2 F is only related to PX 2 F b PX 2 F = P

i[ Pi X 2 F b Pi X 2 F ]. Each term in which is bounded by the lemma. So we can use b P as a proxy for P in the PCA task. Again, computing PCA on b P is equivalent to computing SVD on Y, as done in Algorithm dis PCA. These lead to the following theorem, which is implicit in [9], stating that the algorithm can produce a (1 + ϵ)-approximation for the distributed PCA problem. Theorem 2. Suppose Algorithm dis PCA takes parameters t1 r + 4r/ϵ 1 and t2 = r. Then

P PV(r)(V(r)) 2 F (1 + ϵ) min X P PXX 2 F

where the minimization is over d r orthonormal matrices X. The communication is O( srd

2.1 Guarantees for Distributed ℓ2-Error Fitting Algorithm dis PCA can also be used as a pre-processing step for applications such as ℓ2-error ﬁtting. In this section, we prove the correctness of Algorithm dis PCA as pre-processing for these applications. In particular, we show that by setting t1, t2 sufﬁciently large, the objective value of any solution merely changes when the original data P is replaced the projected data P = PV(t2)(V(t2)) . Therefore, the projected data serves as a proxy of the original data, i.e., any distributed algorithm can be applied on the projected data to get a solution on the original data. As the dimension is lower, the communication cost is reduced. Formally, Theorem 3. Let t1 = t2 = O(rk/ϵ2) in Algorithm dis PCA for ϵ (0, 1/3). Then there exists a constant c0 0 such that for any set of k centers L in r-Subspace k-Clustering,

(1 ϵ)d2(P, L) d2( P, L) + c0 (1 + ϵ)d2(P, L).

The theorem implies that any α-approximate solution L on the projected data P is a (1 + 3ϵ)αapproximation on the original data P. To see this, let L denote the optimal solution. Then

(1 ϵ)d2(P, L) d2( P, L) + c0 αd2( P, L ) + c0 α(1 + ϵ)d2(P, L )

which leads to d2(P, L) (1 + 3ϵ)αd2(P, L ). In other words, the distributed PCA step only introduces a small multiplicative approximation factor of (1 + 3ϵ).

The key to prove the theorem is the close projection property of the algorithm (Lemma 4): for any low dimensional subspace spanned by X, the projections of P and P on the subspace are close. In

Algorithm 1 Distributed k-means clustering Input: {Pi}s i=1, k N+ and ϵ (0, 1/2), a non-distributed α-approximation algorithm Aα 1: Run Algorithm dis PCA with t1 = t2 = O(k/ϵ2) to get V, and send V to all nodes. 2: Run the distributed k-means clustering algorithm in [2] on {Pi VV }s i=1, using Aα as a subroutine, to get k centers L. Output: L.

particular, we choose X to be the orthonormal basis of the subspace spanning the centers. Then the difference between the objective values of P and P can be decomposed into two terms depending only on PX PX 2 F and PX 2 F PX 2 F respectively, which are small as shown by the lemma. The complete proof of Theorem 3 is provided in the appendix.

Lemma 4. Let t1 = t2 = O(k/ϵ) in Algorithm dis PCA. Then for any d k matrix X with orthonormal columns, 0 PX PX 2 F ϵd2(P, LX), and 0 PX 2 F PX 2 F ϵd2(P, LX).

Proof Sketch: We ﬁrst introduce some auxiliary variables for the analysis, which act as intermediate connections between P and P. Imagine we perform two kinds of projections: ﬁrst project Pi to b Pi = Pi Vi (t1)(Vi (t1)) , then project b Pi to Pi = b Pi V(t2)(V(t2)) . Let b P denote the vertical concatenation of b Pi and let P denote the vertical concatenation of Pi. These variables are designed so that the difference between P and b P and that between b P and P are easily bounded.

Our proof then proceeds by ﬁrst bounding these differences, and then bounding that between P and P. In the following we sketch the proof for the second statement, while the other statement can be proved by a similar argument. See the appendix for details.

PX 2 F PX 2 F = h PX 2 F b PX 2 F i + h b PX 2 F PX 2 F i + h PX 2 F PX 2 F i .

The ﬁrst term is just Ps i=1 h Pi X 2 F b Pi X 2 F i , each of which can be bounded by Lemma 1,

since b Pi is the SVD truncation of P. The second term can be bounded similarly. The more difﬁcult part is the third term. Note that Pi = b Pi Z, Pi = Pi Z where Z := V(t2)(V(t2)) X, leading to

PX 2 F PX 2 F = Ps i=1 h b Pi Z 2 F Pi Z 2 F i . Although Z is not orthonormal as required by Lemma 1, we prove a generalization (Lemma 7 in the appendix) which can be applied to show that the third term is indeed small.

Application to k-Means Clustering To see the implication, consider the k-means clustering problem. We can ﬁrst perform any other possible dimension reduction to dimension d so that the kmeans cost is preserved up to accuracy ϵ, and then run Algorithm dis PCA and ﬁnally run any distributed k-means clustering algorithm on the projected data to get a good approximate solution. For example, in the ﬁrst step we can set d = O(log n/ϵ2) using a Johnson-Lindenstrauss transform, or we can perform no reduction and simply use the original data.

As a concrete example, we can use original data (d = d), then run Algorithm dis PCA, and ﬁnally run the distributed clustering algorithm in [2] which uses any non-distributed α-approximation algorithm as a subroutine and computes a (1 + ϵ)α-approximate solution. The resulting algorithm is presented in Algorithm 1.

Theorem 5. With probability at least 1 δ, Algorithm 1 outputs a (1 + ϵ)2α-approximate solution for distributed k-means clustering. The total communication cost of Algorithm 1 is O( sk

ϵ2 ) vectors

in Rd plus O 1 ϵ4 ( k2

δ ) + sk log sk

δ vectors in RO(k/ϵ2).

3 Fast Distributed PCA

Subspace Embeddings One can signiﬁcantly improve the time of the distributed PCA algorithms by using subspace embeddings, while keeping similar guarantees as in Lemma 4, which sufﬁce for l2-error ﬁtting. More precisely, a subspace embedding matrix H Rℓ n for a matrix A Rn d has the property that for all vectors y Rd, HAy 2 = (1 ϵ) Ay 2. Suppose independently,

each node vi chooses a random subspace embedding matrix Hi for its local data Pi. Then, they run Algorithm dis PCA on the embedded data {Hi Pi}s i=1 instead of on the original data {Pi}s i=1.

The work of [22] pioneered subspace embeddings. The recent fast sparse subspace embeddings [5] and its optimizations [17, 19] are particularly suitable for large scale sparse data sets, since their running time is linear in the number of non-zero entries in the data matrix, and they also preserve the sparsity of the data. The algorithm takes as input an n d matrix A and a parameter ℓ, and outputs an ℓ d embedded matrix A = HA (the embedded matrix H does need to be built explicitly). The embedded matrix is constructed as follows: initialize A = 0; for each row in A, multiply it by +1 or 1 with equal probability, then add it to a row in A chosen uniformly at random.

The success probability is constant, while we need to set it to be 1 δ where δ = Θ(1/s). Known results which preserve the number of non-zero entries of H to be 1 per column increase the dimension of H by a factor of s. To avoid this, we propose an approach to boost the success probability by computing O(log 1

δ ) independent embeddings, each with only constant success probability, and then run a cross validation style procedure to ﬁnd one which succeeds with probability 1 δ. More precisely, we compute the SVD of all embedded matrices Hj A = UjΣj V j , and ﬁnd a j [r] such that for at least half of the indices j = j, all singular values of Σj V j Vj Σ j are in [1 O(ϵ)] (see Algorithm 4 in the appendix). The reason why such an embedding Hj A succeeds with high probability is as follows. Any two successful embeddings Hj A and Hj A, by deﬁnition, satisfy that Hj Ax 2 2 = (1 O(ϵ)) Hj Ax 2 2 for all x, which we show is equivalent to passing the test on the singular values. Since with probability at least 1 δ, 9/10 fraction of the embeddings are successful, it follows that the one we choose is successful with probability 1 δ.

Randomized SVD The exact SVD of an n d matrix is impractical in the case when n or d is large. Here we show that the randomized SVD algorithm from [11] can be applied to speed up the computation without compromising the quality of the solution much. We need to use their speciﬁc form of randomized SVD since the error is with respect to the spectral norm, rather than the Frobenius norm, and so can be much smaller as needed by our applications.

The algorithm ﬁrst probes the row space of the ℓ d input matrix A with an ℓ 2t random matrix Ωand orthogonalizes the image of Ωto get a basis Q (i.e., QR-factorize A Ω); projects the data to this basis and computes the SVD factorization on the smaller matrix AQ. It also performs q power iterations to push the basis towards the top t singular vectors.

Fast Distributed PCA for l2-Error Fitting We modify Algorithm dis PCA by ﬁrst having each node do a subspace embedding locally, then replace each SVD invocation with a randomized SVD invocation. We thus arrive at Algorithm 2. For ℓ2-error ﬁtting problems, by combining approximation guarantees of the randomized techniques with that of distributed PCA, we are able to prove: Theorem 6. Suppose Algorithm 2 takes ϵ (0, 1/2], t1 = t2 = O(max k

δ ), ℓ= O( d2

ϵ2 ), q = O(max{log d

ϵ }) as input, and sets the failure probability of each local subspace embedding to δ = δ/2s. Let P = PVV . Then with probability at least 1 δ, there exists a constant c0 0, such that for any set of k points L,

(1 ϵ)d2(P, L) ϵ PX 2 F d2( P, L) + c0 (1 + ϵ)d2(P, L) + ϵ PX 2 F where X is an orthonormal matrix whose columns span L. The total communication is O(skd/ϵ2)

and the total time is O nnz(P) + s h d3k

Proof Sketch: It sufﬁces to show that P enjoys the close projection property as in Lemma 4, i.e., PX PX 2 F 0 and PX 2 F PX 2 F 0 for any orthonormal matrix whose columns span a low dimensional subspace. Note that Algorithm 2 is just running Algorithm dis PCA (with randomized SVD) on TP where T = diag(H1, H2, . . . , Hs), so we ﬁrst show that T P enjoys this property. But now exact SVD is replaced with randomized SVD, for which we need to use the spectral error bound to argue that the error introduced is small. More precisely, for a matrix A and its SVD truncation b A computed by randomized SVD, it is guaranteed that the spectral norm of A b A is small, then (A b A)X F is small for any X with small Frobenius norm, in particular, the orthonormal basis spanning a low dimensional subspace. This then sufﬁces to guarantee T P enjoys the close projection property. Given this, it sufﬁces to show that P enjoys this property as T P, which follows from the deﬁnition of a subspace embedding.

Algorithm 2 Fast Distributed PCA for l2-Error Fitting Input: {Pi}s i=1; parameters t1, t2 for Algorithm dis PCA; ℓ, q for randomized techniques. 1: for each node vi V do 2: Compute subspace embedding P i = Hi Pi. 3: end for 4: Run Algorithm dis PCA on {P i}s i=1 to get V, where the SVD is randomized. Output: V.

4 Experiments

Our focus is to show the randomized techniques used in Algorithm 2 reduce the time taken significantly without compromising the quality of the solution. We perform experiments for three tasks: rank-r approximation, k-means clustering and principal component regression (PCR).

Datasets We choose the following real world datasets from UCI repository [1] for our experiments. For low rank approximation and k-means clustering, we choose two medium size datasets News Groups (18774 61188) and MNIST (70000 784), and two large-scale Bag-of-Words datasets: NYTimes news articles (BOWnytimes) (300000 102660) and Pub Med abstracts (BOWpubmed) (8200000 141043). We use r = 10 for rank-r approximation and k = 10 for k-means clustering. For PCR, we use MNIST and further choose Year Prediction MSD (515345 90), CTslices (53500 386), and a large dataset MNIST8m (800000 784).

Experimental Methodology The algorithms are evaluated on a star network. The number of nodes is s = 25 for medium-size datasets, and s = 100 for the larger ones. We distribute the data over the nodes using a weighted partition, where each point is distributed to the nodes with probability proportional to the node s weight chosen from the power law with parameter α = 2.

For each projection dimension, we ﬁrst construct the projected data using distributed PCA. For low rank approximation, we report the ratio between the cost of the obtained solution to that of the solution computed by SVD on the global data. For k-means, we run the algorithm in [2] (with Lloyd s method as a subroutine) on the projected data to get a solution. Then we report the ratio between the cost of the above solution to that of a solution obtained by running Lloyd s method directly on the global data. For PCR, we perform regression on the projected data to get a solution. Then we report the ratio between the error of the above solution to that of a solution obtained by PCR directly on the global data. We stop the algorihtm if it takes more than 24 hours. For each projection dimension and each algorithm with randomness, the average ratio over 5 runs is reported.

Results Figure 2 shows the results for low rank approximation. We observe that the error of the fast distributed PCA is comparable to that of the exact solution computed directly on the global data. This is also observed for distributed PCA with one or none of subspace embedding and randomized SVD. Furthermore, the error of the fast PCA is comparable to that of normal PCA, which means that the speedup techniques merely affects the accuracy of the solution. The second row shows the computational time, which suggests a signiﬁcant decrease in the time taken to run the fast distributed PCA. For example, on News Groups, the time of the fast distributed PCA improves over that of normal distributed PCA by a factor between 10 to 100. On the large dataset BOWpubmed, the normal PCA takes too long to ﬁnish and no results are presented, while the speedup versions produce good results in reasonable time. The use of the randomized techniques gives us a good performance improvement while keeping the solution quality almost the same.

Figure 3 and Figure 4 show the results for k-means clustering and PCR respectively. Similar to that for low rank approximation, we observe that the distributed solutions are almost as good as that computed directly on the global data, and the speedup merely affects the solution quality. We again observe a huge decrease in the running time by the speedup techniques.

Acknowledgments This work was supported in part by NSF grants CCF-0953192, CCF-1451177, CCF-1101283, and CCF-1422910, ONR grant N00014-09-1-0751, and AFOSR grant FA9550-091-0538. David Woodruff would like to acknowledge the XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750-12-C0323, for supporting this work.

5 10 15 20 25 1

Fast_PCA Only_Subspace Only_Randomized Normal_PCA

(a) News Groups

14 24 34 44 54 1

10 15 20 25 30

(c) BOWnytimes

10 15 20 25 30 1

(d) BOWpubmed

5 10 15 20 25 10

Fast_PCA Only_Subspace Only_Randomized Normal_PCA

(e) News Groups

14 24 34 44 54 10

10 15 20 25 30 10

(g) BOWnytimes

10 15 20 25 30

(h) BOWpubmed

Figure 2: Low rank approximation. First row: error (normalized by baseline) v.s. projection dimension. Second row: time v.s. projection dimension.

5 10 15 20 25 1.02

Fast_PCA Only_Randomized Only_Subspace Normal_PCA

(a) News Groups

14 24 34 44 54 1

10 15 20 25 30 1.035

(c) BOWnytimes

10 15 20 25 30 1

(d) BOWpubmed

5 10 15 20 25 10

Fast_PCA Only_Subspace Only_Randomized Normal_PCA

(e) News Groups

14 24 34 44 54 10

10 15 20 25 30 10

(g) BOWnytimes

10 15 20 25 30

(h) BOWpubmed

Figure 3: k-means clustering. First row: cost (normalized by baseline) v.s. projection dimension. Second row: time v.s. projection dimension.

14 24 34 44 54 1.002

Fast_PCA Only_Subspace Only_Randomized Normal_PCA

10 15 20 25 30 1.05

(b) Year Prediction MSD

10 15 20 25 30 1

(c) CTslices

14 24 34 44 54 1.001

(d) MNIST8m

14 24 34 44 54 10

Fast_PCA Only_Subspace Only_Randomized Normal_PCA

10 15 20 25 30 10

(f) Year Prediction MSD

10 15 20 25 30 10

(g) CTslices

14 24 34 44 54 10

(h) MNIST8m

Figure 4: PCR. First row: error (normalized by baseline) v.s. projection dimension. Second row: time v.s. projection dimension.

[1] K. Bache and M. Lichman. UCI machine learning repository, 2013.

[2] M.-F. Balcan, S. Ehrlich, and Y. Liang. Distributed k-means and k-median clustering on general communication topologies. In Advances in Neural Information Processing Systems, 2013.

[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 2003.

[4] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas. Stochastic dimensionality reduction for k-means clustering. Co RR, abs/1110.2897, 2011.

[5] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing, 2013.

[6] M. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu. Dimensionality reduction for k-means clustering and low rank approximation. ar Xiv preprint ar Xiv:1410.6801, 2014.

[7] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al. Spanner: Googles globally-distributed database. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, 2012.

[8] D. Feldman and M. Langberg. A uniﬁed framework for approximating and clustering data. In Proceedings of the Annual ACM Symposium on Theory of Computing, 2011.

[9] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 2013.

[10] M. Ghashami and J. M. Phillips. Relative errors for deterministic low-rank matrix approximations. In ACM-SIAM Symposium on Discrete Algorithms, 2014.

[11] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 2011.

[12] R. Kannan, S. S. Vempala, and D. P. Woodruff. Principal component analysis and higher correlations for distributed data. In Proceedings of the Conference on Learning Theory, 2014.

[13] N. Karampatziakis and P. Mineiro. Combining structured and unstructured randomness in large scale pca. Co RR, abs/1310.6304, 2013.

[14] Y.-A. Le Borgne, S. Raybaud, and G. Bontempi. Distributed principal component analysis for wireless sensor networks. Sensors, 2008.

[15] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 2001. [16] S. V. Macua, P. Belanovic, and S. Zazo. Consensus-based distributed principal component analysis in wireless sensor networks. In Proceedings of the IEEE International Workshop on Signal Processing Advances in Wireless Communications, 2010.

[17] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the Annual ACM symposium on Symposium on theory of computing, 2013.

[18] S. Mitra, M. Agrawal, A. Yadav, N. Carlsson, D. Eager, and A. Mahanti. Characterizing webbased video sharing workloads. ACM Transactions on the Web, 2011.

[19] J. Nelson and H. L. Nguyˆen. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In IEEE Annual Symposium on Foundations of Computer Science, 2013.

[20] C. Olston, J. Jiang, and J. Widom. Adaptive ﬁlters for continuous queries over distributed data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003.

[21] Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist. Principal component analysis for dimension reduction in massive distributed data sets. In Proceedings of IEEE International Conference on Data Mining, 2002.

[22] T. Sarl os. Improved approximation algorithms for large matrices via random projections. In IEEE Symposium on Foundations of Computer Science, 2006.