# on_distributed_averaging_for_stochastic_kpca__027d9668.pdf

On Distributed Averaging for Stochastic k-PCA

Aditya Bhaskara School of Computing University of Utah bhaskara@cs.utah.edu

Maheshakya Wijewardena School of Computing University of Utah pmaheshakya4@gmail.com

In the stochastic k-PCA problem, we are given i.i.d. samples from an unknown distribution over vectors, and the goal is to compute the top k eigenvalues and eigenvectors of the moment matrix. In the simplest distributed variant, we have m machines each of which receives n samples. Each machine performs some computation and sends an O(k)-size summary of the local dataset to a central server. The server performs an aggregation and computes the desired eigenvalues and vectors. The goal is to achieve the same effect as the server computing using mn samples by itself. The main choices in this framework are the choice of the summary, and the method of aggregation. We consider a slight variant of the wellstudied distributed averaging approach, and prove that this leads to signiﬁcantly better bounds on the dependence between n and the eigenvalue gaps. Our method can also be applied directly to a setting where the right value of the parameter k (i.e., one for which there is a non-trivial eigenvalue gap) is not known exactly. This is a common issue in practice which prior methods were unable to address.

1 Introduction

Principal Component Analysis (PCA) is one of the classic tools for the analysis of high dimensional data. It is used in applications ranging from data visualization, to dimension reduction, to signal de-noising [16, 10]. Formally, the problem is the following: given a collection of data points x1, x2, . . . , xn, the aim is to ﬁnd a subspace U of dimension precisely k such that captures the most mass of the points. Speciﬁcally, the goal is to ﬁnd a matrix U(d k) with orthonormal columns (corresponding to a basis for the desired subspace) to as to maximize ΣU F , where Σ is the covariance matrix of the data, deﬁned as

i xix T i . This problem can be solved efﬁciently by computing the singular value decomposition (SVD) (see [9]).

In the stochastic version of the problem, the data is viewed as samples from an unknown distribution D over points in Rd, and the goal is to ﬁnd the top k singular directions of the distribution covariance matrix (or the second moment matrix) Σ = Ex Dxx T . The question of how many samples from D are needed to ﬁnd a good estimate for the k-PCA is extensively studied ([15, 2, 20], and tight bounds that involve the gap between the kth and the (k + 1)th eigenvalues of σ can be obtained using matrix concentration inequalities [1, 22].

In this paper, we consider distributed algorithms for stochastic PCA, where the samples from D are distributed across machines, and the goal is to use a small amount of communication and ﬁnd a solution that approximates the PCA of the distribution. Our focus will be on the simplest model, where we have m machines that each has access to n i.i.d. samples of the data. Each machine sends one summary to a central server. The server, using the summaries from the different machines, computes the estimate of the PCA (this will be known as the aggregation step).

This distributed procedure is well-studied for various optimization problems [14, 24, 25]. The most well-known example is distributed convex optimization, where the goal is to optimize the loss

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

L(θ) = Ex Df(x, θ) for some convex function f. Here, it turns out that a simple procedure known as distributed averaging yields good guarantees. Machines simply optimize the objective on their local dataset and send the solution θ to the server, and the central server averages the local solutions. Tight bounds are known for distributed averaging for various convex objectives (see [24]).

A natural problem not covered by the general results on convex optimization is PCA. Garber et al. [8] studied the power of distributed averaging for PCA. They showed that for the problem of computing the top eigenvector, simply averaging the best vectors for the different machines does not work (the issue being one of having the right signs). However, it turns out that averaging with appropriately chosen signs works well, as long as there is a sufﬁcient gap between λ1 and λ2. Fan et al. [7] extended this idea to the case of ﬁnding the top k principal subspaces of the covariance matrix. They show that as long as every machine has sufﬁciently many samples (a quantity that depends on the gap between λk and λk+1), distributed averaging of the projection matrices output by different machines (followed by a k-SVD) yields a good approximation.

1.1 Problem setup and motivation

Let us start with some basic notation. For a real symmetric matrix M, we use λj(M) to refer to its j-th largest eigenvalue. We denote by Mk the best k-rank approximation of M. Also, the trace of M will be denoted as Tr(M) =

j λj(M). The Frobenius norm is M F :=

j λj(M)2. For a

r [d], we deﬁne Δr to be the eigenvalue gap λr λr+1.

Formal setting for distributed stochastic PCA. Let D be an (unknown) sub-Gaussian distribution over vectors in Rd.1 We have m machines, each of which receives n i.i.d. samples from D. A denotes the covariance matrix of the distribution D, i.e., A = Ex Dxx T . Let the spectral decomposition of A be denoted UΛU T where U Rd d is a matrix with orthonormal columns and Σ = diag(λ1, λ2, . . . , λd). The aim is to ﬁnd the vectors u1, u2, . . . , uk (the ﬁrst k columns of U) and the corresponding eigenvalues λ1, λ2, . . . , λk.

Motivation. The works [7, 8] have two key limitations. First, they are aimed at ﬁnding the k-PCA subspace, for a given k. Even modifying the goal slightly, e.g., requiring the algorithm to output each of the top k PCA directions individually, requires more communication, and a sample complexity (per machine) that has a quadratic dependence on the individual gaps, as we explain below. Second, and more importantly, these works assume the knowledge of a number k for which there exists an eigenvalue gap. This is quite unrealistic in practice. Can we design algorithms that can work with only a rough idea of the location of the gap? Our main contribution is to handle these two issues. We provide novel estimation bounds and validate them using experiments on real and synthetic data.

The ﬁrst restriction above is quite serious at a quantitative level. Ignoring other terms, the works of [7, 8] require (in order to estimate the k-subspace), a value of n 1/Δ2 k. Intuitively, this corresponds to a requirement of each machine having a rough estimate of the top-k PCA subspace. Their main results then can be interpreted as saying that under this assumption, the server can obtain a signiﬁcantly better estimate of the top-k subspace than the individual machines. Further, the estimation errors are only obtained for the matrix Uk U T k (i.e., the projection matrix to the top-k subspace). If one needs each of the top k singular directions, the procedure needs to be re-done for each index (and this requires having n 1/Δ2 min, where Δmin = mini k Δi, which could be tiny).

Note that our setting is slightly different from the deterministic case of the distributed PCA, where we have a matrix U whose columns are arbitrarily distributed across machines, and the goal is to ﬁnd the best k-subspace. Moreover, the objective is not always to ﬁnd the right eigenvalues/vectors, but to approximate the value of the low-rank error (see [4, 11] and references therein). These results extend the works of [19, 5] where the power of subspace embeddings in matrix approximation is shown to distributed settings. In this case, in order to obtain a 1 + low rank error in Frobenius norm for A, each machine needs to communicate O(k/ ) vectors. It is evident from their work that the sketching methods perform better as the sketch size grows. In contrast to these results, a part of our goal is to discuss the trade-off between n and the quality of the approximations. Applying these sketching methods in our setting will yield error terms that depend only on the sketch size and cannot be controlled by n or m, thus making them undesirable for this setting.

1As in the prior works [7, 8], the data distribution is assumed to have sub-Gaussian tails ([12, 18]), i.e., there exists a constant C > 0 such that (u T x)2 ψ1 CE[(u T x)2], u Rd. The ψ1 norm of a random variable X is X ψ1 = supp 1(E|X|p)1/p/p (see [23]).

1.2 Our contributions

In Theorem 1 and its corollaries (Section 2.1), we show that as long as n Ω(1/Δ2 k), we can (using a single sketch ) compute estimates vi of each of the vectors v1, v2, . . . , vk up to an error

, where δi := min(λi 1 λi, λi λi+1),

where κ1, κ2 are factors that are not dominant when Δk is large enough. Further, the amount of communication per machine is O(kd), i.e., each machine communicates k vectors in Rd.

Remark. Note that if each individual machine is to achieve vi vi 1/4, the above requirement translates to min(n, mn) 4/δi. To achieve this error using prior work, one needs n Ω(1/δ2 i ). Our result can be a signiﬁcant improvement as m grows. Speciﬁcally, in our setting the individual machines need not be able to obtain any estimate of vi, but the corresponding average is still accurate.

Our next result (Theorem 9) is in a setting in which we only approximately know the location of the gap in the eigenvalues (as is common in practice). In particular, suppose that there exists a k (k0, k1) such that Δk is large enough. Then, using a single sketch of k1 vectors (i.e. space O(k1d), we show an efﬁcient way to ﬁnd k, and thus also achieve the same guarantees as our ﬁrst result (described above).

Using prior results (at least directly) to solve this problem lead to two issues. First, we need to run the averaging procedures for each k in the range (k0, k1). And more importantly, it is not clear how to determine which of the PCA directions obtained are accurate and which are not (because accuracy guarantees depend on the consecutive gaps, and we do not know which gap is large).

Finally, we run experiments on both real and synthetic datasets (the latter gives us a way to control the eigenvalue gaps), and establish that our theoretical bounds are reﬂected accurately in practice.

2 Spectral approximation via distributed averaging

We start by introducing notation that we will use for the rest of the paper and stating the theorems formally. Recall that the jth machine gets n i.i.d. vectors from a sub-Gaussian distribution D, and let A(j) denote the empirical covariance matrix. Also, for the jth machine, let A(j) k denote the best rank k approximation of A(j), and denote its SVD by U (j) k Λ(j) k ( U (j) k )T . We also deﬁne V (j) k = U (j) k ( Λ(j) k )1/2. The columns of this matrix are what each machine sends to the central server.

We also deﬁne the average across machines: Ak = 1

j [m] A(j) k .

Algorithm 1 Distributed Averaging (parameter k)

Local: On each machine, compute the rank-k SVD of the empirical covariance matrix A(j), and send V (j) k (as deﬁned above) to the server. Server: On the central server, compute Ak = 1 m m j=1 ˆV (j)( ˆV (j))T . Then output the top k

eigenvalues and the corresponding eigenvectors of Ak.

The procedure above differs from the prior works [8, 7] in the choice of the summary (i.e., what the individual machines send to the central server). Algorithm 1 uses the eigenvectors weighted by the square root of the eigenvalues, while unweighted vectors are used in the prior work. This turns out to give us three advantages: (i) an efﬁcient way to obtain the individual eigenvectors v1, v2, . . . , vk, (ii) use the summary for different values of k, as we will see in Section 4, and also (iii) improved bounds on the parameters m, n (especially in experiments).

2.1 Guarantees for eigenvalue and eigenvector estimation

We now formally state the guarantees obtained by the procedure above in order to estimate the eigenvalues and eigenvectors of A. We start with a general theorem about approximating the best k-rank approximation of A: Ak, which will imply both of these statements.

Theorem 1. There exist constants C1 and C2 such that Ak Ak F ψ1 C1 κ1

n + C2 κ2 mn,

kλ2 1 Tr(A)/Δ2 k and κ2 = λ1

kλ1 Tr(A)/Δk

The statement uses subgaussian norms described in section 1.1. By deﬁnition, we can rephrase the theorem as a concentration bound: the probability of Ak Ak F exceeding log(1/δ) times the RHS is at most δ. Using known perturbation bounds, we can show the following corollaries.

Corollary 2. For all i k, there exist constants C1 and C2 such that |λi( Ak) λi(A)| ψ1 C1 κ1

n + C2 κ2 mn.

The corollary follows from Theorem 1 using Weyl s inequality [21].

Corollary 3. Deﬁne δi = min{(λi λi 1), (λi+1 λi)}.2 For i k, there exist constants C1 and C2 such that 1 ( u T i ui)2 ψ1 C1 κ2 1 δ2 i n2 + C2 κ2 2 δ2 i mn,

where ui is the eigenvector corresponding to the ith largest eigenvalue of Ak.

The proof follows from the Davis-Kahan sin-Θ theorem [21].

As outlined in Section 1.2, when the gap δi Δk, using the summary corresponding to k has a signiﬁcant advantage over using the one for i. This results in better guarantee (compared to the procedures of [8, 7]) when recovering ui, for 1 i k.

3 Analysis: estimating the rank-k approximation of A

Our goal in this section will be to show that Ak approximates Ak accurately, thereby proving Theorem 1.

Outline of the argument. The key step is to deﬁne the matrix A , which is the expectation of A(j) k . As all the machines receive inputs drawn from D, this is independent of j. The argument proceeds in two steps, similar to the works of [7] and [8]. The ﬁrst step is showing that A Ak (in other words, the bias) is small. This is the harder step, and involves showing that one obtains non-trivial cancellations. In other words, even though A(j) k Ak is of the order O(1/ n), we will show that A Ak is of the order O(1/n). The second step is to show that Ak, which is the empirical average of A(j) k over the m machines, is close to A . This is proved using a matrix concentration bound, originally due to [3].

To summarize, let us deﬁne (noting that the RHS is independent of j),

A = E[ A(j) k ].

We have Ak Ak A Ak + Ak A . The ﬁrst term will be referred to as the bias and the second as the variance. In what follows we will bound the terms separately.

3.1 Analyzing the bias term

We now show the following theorem about the bias term.

Theorem 4. There is a constant C such that

kλ2 1 Tr(A) Δ2 kn .

2To deal with the border cases i = 1, d, deﬁne λ0 = + and λd+1 = .

In what follows, we abuse notation slightly and denote A = A(j) for some machine j. As we are ﬁnally interested in the expectation, the choice of j will not matter. Deﬁne A = A + E and let = E 2/Δk. By deﬁnition, A = E[ A]. Let us also deﬁne the projection matrices Π = Uk U T k and Π = Uk U T k .

The main idea behind the proof of theorem 4 is we express Ak Ak in a single machine using linear and quadratic terms of E. Once we consider the expectation of this error, the linear terms of E (O(1/ n) which is dominant in magnitude) will become zero, thus giving the bound for O(1/n) error bias in theorem 4. The ﬁrst lemma gives a coarse bound, which we will use when E is large.

Lemma 5. Let Ak be the rank-k approximation on one of the machines, and let E be as deﬁned above. Then

Ak Ak = ΠE + H, where H F 2

k E 2 2 Δk .

The next lemma shows that when = E 2/Δk is small, we have a much better bound.

Lemma 6. Let A satisfy the condition = A A 2/Δk 1/10. There exists a linear function f : Rd k Rd k and a constant C such that

Ak Ak = ΠE + f(EUk)U T k + Ukf(EUk)T A + H, where H F c

k E 2 2(λ1 + E 2)

The lemma is a consequence of a result in [7] showing that in this case, Π has a sufﬁciently good ﬁrst order approximation in terms of E.

Proof. Lemma 2 of [7] shows that

Π = Π + f(EUk)U T k + U T k f(EUk)T + E ,

where (a) f is a linear function as in the statement of the theorem that also satisﬁes f(.) F . F /Δk, and (b) E is a matrix with E F 24

k E 2 2/Δ2 k (this is only true under the assumption we have, i.e., 1/10). Using this,

Ak Ak = Π(A + E) Ak = Π + f(EUk)U T k + Ukf(EUk)T + E (A + E) Ak = ΠE + f(EUk)U T k + Ukf(EUk)T A + f(EUk)U T k + U T k f(EUk)T E + E A + E E

Thus to show the lemma, the error term is

H = f(EUk)U T k + U T k f(EUk)T E + E A + E E.

To bound the ﬁrst term, note that f(EUk) F EUk F

Δk . Thus we have

f(EUk)U T k + U T k f(EUk)T E F f(EUk)U T k + U T k f(EUk)T F E 2 2

k E 2 2 Δk .

The second term can be bounded (using the bound on E above), by 24

kλ1 E 2 2/Δ2 k. Using the bound on E again completes the proof of the lemma.

Note that the two lemmas give different linear approximations of Ak Ak. However, in order to take expectation, we need the same function. Luckily, we observe that the one from Lemma 6 can be used in the place of one from before, with small error. To this end, note that

f(EUk)U T k + Ukf(EUk)T A F 2 f(EUk) F A 2 2

from the property of f mentioned earlier (shown in [7]).

We can now prove Theorem 4.

Proof of Theorem 4. Using the observation in (1), Lemma 5 implies that for all E, we have

Ak Ak = ΠE + f(EUk)U T k + Ukf(EUk)T A + H, where H F 4

k E 2(λ1 + E 2)

(2) Using this expression for bounding E F when 1/10, and the one from Lemma 6 when is smaller, we can now take the expected value of Ak Ak. The linear terms in E will evaluate to zero. Thus we have E[ Ak Ak] F E[Q1 | 1/10] + E[Q2 | 1/10], where Q1 and Q2 are bounds on H F from (2) and Lemma 6 respectively. Now, conditioned on E 2/Δk 1/10, it is trivially true that E 2/Δk 10 E 2 2/Δ2 k. Thus we can simplify the above as

E[ Ak Ak] F E

k E 2 2(λ1 + E ) Δ2 k

Using the subgaussian property of the moments of our distribution, we have that the expectation above is dominated by E[ E 2 2] term (due to the multiplier λ1). This gives

E[ Ak Ak] F C

kλ2 1 Tr(A) nΔ2 k .

This completes the proof of the theorem.

3.2 Analyzing the variance term

We now need to show that the average of the matrices Ak is close to the expectation (which is A ). The main idea is to use the concentration inequality due to Bosq [3] (see also Lemma 4 of [7]). The inequality lets us bound the ψ1 norm of the average of i.i.d. random variables using the ψ1 norm of the individual variables.

To this end, we ﬁrst show the following.

Lemma 7. Suppose each machine receives n points, where n λ1Tr(A)

Δ2 k . Then, there is a constant C such that Ak A F ψ1 C λ1

Now we analyze the average of Ak.

Theorem 8. There exists a constant C such that the matrix Ak, i.e. the average of the matrices A(i) k , satisﬁes Ak A F ψ1 C λ1

Theorems 4 and 8 together complete the proof of our main approximation result, Theorem 1. As observed in Section 2.1, this also completes the proofs of Corollaries 2 and 3.

4 Algorithm for imprecise k

We now consider the setting in which we do not exactly know the value of k for which λk λk+1 is large . Knowing that some k in the interval (k0, k1) satisﬁes an appropriate gap assumption, we will give an algorithm that can, using O(k1) columns of communication per machine, (a) ﬁnd such a k, and (b) compute all the eigenvalues and eigenvectors with guarantees matching ones from the case in which we know k (i.e. Theorem 1).

Our algorithm relies on the following theorem. As deﬁned earlier, for any t 1, denote At := 1 m

i A(i) t (i.e., the empirical average of the rank-t approximations on the individual machines).

Now the main advantage of having every machine sending across the eigenvectors vi scaled by

λi (Algorithm 1) is that if a machine sends this information for 1 i k1, then the central server can compute At for every 1 t k1.

Theorem 9. Let At be deﬁned as above, and let δ > 0 be a given parameter. Let k be an integer for which Δk = λk λk+1 is sufﬁciently large, in particular, so that for the given m, n, δ, we have

where κ1 and κ2 are as deﬁned in Theorem 1, and C is an appropriate constant. Then with probability at least 1 δ, for all t k, the matrix At has its top k + 1 eigenvalues θ1 θ2 θk θk+1 that satisfy

|θi λi| O κ1

log(1/δ), for 1 i k, and (3)

θk+1 λk+1 + O κ1

log(1/δ). (4)

Proof of this theorem is deferred to Section A.4 of the supplementary material.

Estimating the location of the gap. The theorem shows that one can use any value of t k in order to estimate all the eigenvalues up to k, and also the gap between k and k + 1. Thus, if we only know the approximate location of a gap (some k between k0 and k1, we can use Algorithm 1 with k = t1, and using the above result, ﬁnd the k with the desired gap. Knowing k, the server can then compute Ak (using only the information it has), and this leads to a ﬁner estimate of the matrix Ak.

Estimating the eigenvectors. The theorem above can also be used (together with the sin-Θ theorem) to show estimates on eigenvector estimation (as in Corollary 3). While the bounds are qualitatively similar to those in Corollary 3, we observe that in practice, using t > k is signiﬁcantly better for approximating the top k eigenspace to a good accuracy.

5 Experiments

We validate our results with experiments using synthetic and real datasets. We simulated a distributed environment on a single machine.

5.1 Synthetic dataset

We generated vectors in Rd from a multivariate-Gaussian distribution with mean 0 and the covariance matrix A = UΛU T . Λ(1, 1) = 1, Λ(i, i) = 0.9Λ(i 1, i 1) for i = 2, . . . , 6. We set Λ(7, 7) = Λ(6, 6) 0.3. For 7 < i 50, we set Λ(i, i) = 0.9Λ(i 1, i 1). So there is a gap of Δ6 = 0.3. In this experiment we ﬁxed the number of machines to m = 50. We computed top 3 eigenvectors using the algorithm 1 increasing the number of points n per machine and compared them with the top 3 eigenvectors of the population covariance matrix. Note that it is not possible to use prior methods for computing individual eigenvectors. We ﬁrst computed the eigenvectors by communicating only the top k = 3 weighted vectors. We then compute that by communicating k = 7 weighted vectors so that it includes the eigengap Δ6. We computed the error (1 (u T i ui)2, i = 1, 2, 3) for both these cases. These results are averaged over 200 iterations

As observed in Figure 1, these results are consistent with our theoretical bounds for the case where correct eigengap is not known (but it is located within the k we communicate).

5.2 Real datasets

We used 3 real datasets to evaluate our methods (Table 1)[ [13, 17, 6]]. Each dataset has N points and d features. In these experiments we compute the error of the subspace spanned by the top r eigenvectors ( Ur U T r Ur U T r F ). We consider each dataset X as the population matrix, then the population covariance matrix A = XXT . In each machine we sampled n columns from these X matrices uniformly at random. For these experiments we ﬁxed number of machines m = 50. Each result is averaged over 200 iterations.

We compare the prior method by [7] (unweighted) with our methods. In one of the cases we communicate exactly r vectors (weighted r) and in the other case we communicate a slightly higher

Figure 1: Estimation errors of ﬁrst eigenvector (left), second eigenvector (middle), and third eigenvector (right) for k = 3 and k = 7 vs. samples size n per machine.

Dataset N d r t MNIST-small 20000 196 5 15 NIPS-papers 11463 150 5 15 FMA-music 21314 518 10 70 Table 1: Dataset information

(t > r) number of vectors (weighted t). This is towards the ends of demonstrating our theoretical results for the case where we do not know the exact location of a reasonable eigengap. Note that it is not possible to compute the correct eigenspace using prior methods if we do not communicate the exact number of required vectors.

Similar to the synthetic dataset experiments we computed the error of each method varying the sample size n per machine (Figure 2).

0 10000 20000 30000 n

error of top r subspace

unweighted weighted r weighted t

0 5000 10000 15000 20000 n

unweighted weighted r weighted t

1000 2000 3000 4000 5000 n

1.8 unweighted weighted r weighted t

Figure 2: Estimation errors of top r subspace of MNIST-small dataset (left), NIPS-papers dataset (middle), FMA-music dataset (right) vs. unweighted, weighted r, weighted t averaging.

[1] R. Ahlswede and A. Winter. Strong converse for identiﬁcation via quantum channels. IEEE Transactions on Information Theory, 48(3):569 579, March 2002.

[2] Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incremental pca. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 13, pages 3174 3182, USA, 2013. Curran Associates Inc.

[3] Denis Bosq. Stochastic processes and random variables in function spaces. In Linear Processes in Function Spaces, pages 15 42. Springer, 2000.

[4] Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 236 249. ACM, 2016.

[5] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-ﬁfth annual ACM symposium on Theory of Computing, pages 81 90. ACM, 2013.

[6] Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis. 2017.

[7] Jianqing Fan, Dong Wang, Kaizheng Wang, and Ziwei Zhu. Distributed estimation of principal eigenspaces. ar Xiv preprint ar Xiv:1702.06488, 2017.

[8] Dan Garber, Ohad Shamir, and Nathan Srebro. Communication-efﬁcient algorithms for distributed stochastic principal component analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1203 1212. JMLR. org, 2017.

[9] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996.

[10] Ian Jolliffe. Principal component analysis. Springer, 2011.

[11] Ravi Kannan, Santosh Vempala, and David Woodruff. Principal component analysis and higher correlations for distributed data. In Conference on Learning Theory, pages 1040 1057, 2014.

[12] Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for sample covariance operators. ar Xiv preprint ar Xiv:1405.2468, 2014.

[13] Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/.

[14] Ryan Mc Donald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 10, pages 456 464, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[15] Erkki Oja. Simpliﬁed neuron model as a principal component analyzer. Journal of Mathematical Biology, 15(3):267 273, November 1982.

[16] Karl Pearson. Liii. on lines and planes of closest ﬁt to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559 572, 1901.

[17] Valerio Perrone, Paul A Jenkins, Dario Spano, and Yee Whye Teh. Poisson random ﬁelds for dynamic feature models. ar Xiv preprint ar Xiv:1611.07460, 2016.

[18] Markus Reiß and Martin Wahl. Non-asymptotic upper bounds for the reconstruction error of pca. ar Xiv preprint ar Xiv:1609.03779, 2016.

[19] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pages 143 152. IEEE, 2006.

[20] Ohad Shamir. A stochastic pca and svd algorithm with an exponential convergence rate. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 144 152. JMLR.org, 2015.

[21] Gilbert W. Stewart and Ji guang Sun. Matrix Perturbation Theory. Academic Press, 1990.

[22] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, Aug 2012.

[23] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

[24] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efﬁcient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pages 1502 1510, 2012.

[25] Martin A. Zinkevich, Markus Weimer, Alex Smola, and Lihong Li. Parallelized stochastic gradient descent. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2, NIPS 10, pages 2595 2603, USA, 2010. Curran Associates Inc.