# communicationefficient_distributed_svd_via_local_power_iterations__44f61b7f.pdf Communication-Efficient Distributed SVD via Local Power Iterations Xiang Li 1 Shusen Wang 2 Kun Chen 1 Zhihua Zhang 1 Abstract We study distributed computing of the truncated singular value decomposition problem. We develop an algorithm that we call Local Power for improving communication efficiency. Specifically, we uniformly partition the dataset among m nodes and alternate between multiple (precisely p) local power iterations and one global aggregation. In the aggregation, we propose to weight each local eigenvector matrix with orthogonal Procrustes transformation (OPT). As a practical surrogate of OPT, sign-fixing, which uses a diagonal matrix with 1 entries as weights, has better computation complexity and stability in experiments. We theoretically show that under certain assumptions Local Power lowers the required number of communications by a factor of p to reach a constant accuracy. We also show that the strategy of periodically decaying p helps obtain highprecision solutions. We conduct experiments to demonstrate the effectiveness of Local Power. 1. Introduction In this paper we consider the truncated singular value decomposition (SVD) which has broad applications in machine learning, such as dimension reduction (Wold et al., 1987), matrix completion (Candès & Recht, 2009), and information retrieval (Deerwester et al., 1990). Let a1, , an Rd be sampled i.i.d. from some fixed but unknown distribution. The goal is to compute the k (k < min{d, n}) singular vectors of A [a1, . . . , an] Rn d. Let Vk Rd k contain the top k singular vectors. The power iteration and its variants such as Krylov subspace iterations are common approaches to the truncated SVD. They have O(nd) space complexity and O(ndk) per-iteration time complexity. They take O(log d ϵ ) iterations to converge to ϵ precision, where O hides the spectral gap and constants (Golub & Van Loan, 2012; Saad, 2011). 1School of Mathematical Sciences, Peking University, China 2Department of Computer Science, Stevens Institute of Technology, USA. Correspondence to: Xiang Li . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). When either n or d is big, the data matrix A Rn d may not fit in the memory, making standard single-machine algorithms infeasible. A distributed power iteration is feasible and practical for large-scale truncated SVDs. In particular, we partition the rows of A among m worker nodes (see Figure 1(a)) and let the nodes perform power iterations in parallel (see Figure 1(b)). In every iteration, every node performs O( ndk m ) FLOPs (suppose the load is balanced), while the server performs only O(dk2) FLOPs. When solving large-scale matrix computation problems, communication costs are not negligible; in fact, communication costs can outweigh computation costs. The largescale SVD experiments in (Gittens et al., 2016; Wang et al., 2019) show that the runtime caused by communication and straggler s effect1 can exceed the computation time. Due to the communication costs and other overheads, parallel computing can even demonstrate anti-scaling; that is, when m is big, the overall wall-clock runtime increases with m. Reducing the frequency of communications will reduce the communication and synchronization costs and thereby improving the scalability. 1.1. Our Contributions Inspired by the Fed Avg algorithm (Mc Mahan et al., 2017), we propose an algorithm called Local Power to improve communication-efficiency. Local Power is based on the distributed power iteration (DPI) described in Figure 1. The difference is that Local Power makes every node locally perform orthogonal iterations using its own data for p iterations. In the case for p = 1, Local Power degenerates to DPI. When p 2, local updates are employed to reduce communication frequency. In practice, a naive implementation of the proposed Local Power does not work very well. We propose three effective techniques for improving Local Power: We propose to decay the communication interval, p, over time. In this way, the loss drops fast in the beginning and converge to the optimal solution in the end. Without the decay strategy, Local Power is not 1Straggler s effect means that one outlier node is tremendously slower than the rest, and the system waits for the slowest to complete. Communication-Efficient Distributed SVD via Local Power Iterations (a) Partition. 𝐀, + 𝐀,𝐙"#$ 𝐘= 𝑝0𝐘0 , 01$ 𝐙" Orthogonalize Broadcast Aggregation Communication Computation by worker nodes Communication Computation by server (b) Standard parallel power iteration (DPI). Notation Definition A data matrix n number or rows d number of columns ρ rank of data matrix m number of partitions k target rank of truncated SVD r running columns p number of local iterations si number of rows in i-th node pi = si n fraction of rows in i-th node (c) Commonly used symbols Figure 1. (a) The n d data matrix A is partitioned among m worker nodes. (b) In every iteration of the distributed power iteration, there are two rounds of communications. Most of the computations are performed by the worker nodes. (c) Commonly used symbols. guaranteed to converge to the optimum. Orthogonal Procrustes transformation (OPT) postprocesses the output matrices of the m nodes after each iteration so that the m matrices are close to each other. OPT makes Local Power stable at the cost of more computation. To reduce the computation of OPT, we replace its orthogonal space to the set of all diagonal matrices with 1 entries. In this way, OPT becomes the signfixing technique which is stable (slightly worse than OPT) and efficient. Sign fixing was originally proposed by Garber et al. (2017) for the special case of k = 1, while we generalize sign-fixing to k > 1. In summary, this work s contributions include the new algorithm, Local Power, its convergence analysis, and the effective techniques for improving Local Power. The remainder of this paper is organized as follows. In Section 2, we define notation and give preliminary backgrounds on the orthogonal Procrustes problem and the distributed power iteration. In Section 3, we propose Local Power and its variants and then provide theoretical analysis in Section 4. In Section 5, we conduct experiments to illustrate the effectiveness of Local Power and to validate our theoretical results. In Section 6, we give further discussions on some aspects of Local Power. All proof details can be found at Appendix A. In Appendix D, we discuss related work on SVD and parallel algorithms. 2. Preliminary Notation. For any A Rn d, we use A 2 and A F to denote its spectral norm and Frobenius norm. Let A Rd n denote the Moore-Penrose pseudo-inverse of A. For any positive integer T, let [T] = {1, 2, , T}. Od k is the set of all d k column orthonormal matrices (1 k d). Ok, short for Ok k, denotes the set of k k orthogonal matrices. R(U) denotes the subspace spanned by the columns of U. The commonly used notation is summarized in Figure 1(c). Power iteration. The top k right singular vectors of A can be obtained by the subspace iteration that repeats Y MZ and Z orth Y , (1) where M = 1 n A A. In every power iteration, computing Y has O(ndk) time complexity, and orthogonalizing Y has O(dk2) time complexity. It is well known that the tangent of principle angles between R(Z) and R(Uk) converges to zero geometrically (Arbenz et al., 2012; Saad, 2011) and thus so their projection distance. Distributed power iteration (DPI) is a direct distributed variant of power iteration. Consider data parallelism and partition the data (rows of A) among m worker nodes. See Figure 1(a) for the illustration. We partition A as A = [A 1 , , A m] where Ai Rsi d contains si rows of A. Using m worker nodes and data parallelism, one power iteration works in four steps. First, the server broadcasts Z to the workers, which has O(dk) or O(dkm) communication complexity (depending on the network structure). Second, every worker (say, the i-th) locally computes Yi = Mi Z Rd k with Mi = 1 si A i Ai, (2) which has O(d2k) or O(sidk) time complexity. Third, the server aggregates Yi, for all i [m], to obtain Y = Pm i=1 pi Yi; this step is equivalent to Y = MZ, where M = Pm i=1 pi Mi with pi = si n . It has O(dk) or O(dkm) communication complexity. Last, the server locally orthogonalizes Y to obtain Z = orth(Y), which has merely O(dk2) Communication-Efficient Distributed SVD via Local Power Iterations Algorithm 1 Local Power 1: Input: distributed dataset {Ai}m i=1, target rank k, iteration rank r k, number of iterations T. 2: Initialization: generate a standard Gaussian matrix, Y0; 3: for t = 0 to T do 4: Broadcast: If t IT , the server sends Yt to workers; let Y(i) t Yt; 5: Local computation: For all i [m], the i-th worker locally computes Z(i) t = orth(Y(i) t ) and Y(i) t+1 = 1 si A i Ai Z(i) t ; 6: Aggregation: If (t + 1) IT , the server computes Yt+1 = Pm i=1 pi Y(i) t+1; 7: end for 8: Output: orth(Yt+1). time complexity. The algorithm is described in Figure 1(b). The following lemma is a well-known result (Arbenz et al., 2012; Saad, 2011). Lemma 1. To obtain a column-orthonormal matrix Z such that the subspace distance dist(Z, Uk) ϵ (see Definition 1 for detail), with high probability, the communication needed by DPI is Ω σk σk σk+1 log d Here, σj is the j-th largest singular value of the matrix M. 3. Algorithms Local Power is a new algorithm that we propose for improving communication efficiency. It is described in Algorithm 1. Its basic idea is to trade more local power iterations for fewer communications via reducing the communication frequency. Between two communications, every worker node locally runs eqn. (2) for multiple times. We let the set IT ( [T]) index the iterations that perform communications; for example, IT = 0, p, 2p, , T means that the algorithm communicates once after p lower power iterations. The cardinality |IT | is the total number of communications. Suppose Local Power performs one communication every p iterations. In T iterations, each worker performs O(sidk T) FLOPs, the server performs O(dk2T/p) FLOPs, and the overal communication complexity is O(dk T/p). The standard distributed power iteration is a special case of Local Power with p = 1, that is, IT = {0, 1, 2, , T}.2 2The reason why we average Y(i) t instead of Z(i) t is that we hope Local Power is reduced to DPI when p = 1. One-shot SVD, aka divide-and-conquer SVD, (Liang et al., 2014; Garber et al., 2017; Fan et al., 2019), is a special case of Local Power with p = T, that is, IT = {0, T}. Decaying p. In practice, it is helpful to use a big p in the beginning but let p = 1 in the end. For example, we can decrease p by half every few communications. The rationale is that the error of Local Power does not converge to zero if p is big; see the theoretical analysis in the next section. Our empirical observation confirms the theories: if p is set big, then the error drops very fast in the beginning, but it does not vanish with the iterations. Orthogonal Procrustes Transformation. In Algorithm 1, the i-th worker locally computes Y(i) t+1 = 1 si A i Ai Z(i) t . When it comes to the time of communication (i.e., t + 1 IT ), we replace the equation by the following steps. First, we choose the device which has the maximum number of samples as a base. Without loss of generality, we can assume the first device is selected (which indicates 1 = argmini [m] pi). Second, we compute O(i) t = argmin O Ok Z(i) t O Z(1) t 2 F . (4) Eqn. (4) is a classic matrix approximation problem in linear algebra, named as the Procrustes problem (Schönemann, 1966; Cape, 2020). The solution to eqn. (4) is referred to as orthogonal Procrustes transformation (OPT) and has a closed form: O(i) t = W1W 2 , where W1ΣW 2 is the SVD of (Z(i) t ) Z(1) t . Finally, we compute Y(i) t+1 = 1 si A i Ai Z(i) t O(i) t . Remak 1. Intuitively, such O(i) t adjusts Z(i) t such that it aligns with Z(1) t better. In an ideal case, all Z(i) t s would be identical with Z(1) t and thus the aggregation step (line 6 in Algorithm 1) would be the same as that in DPI . From our theory, it is important to use OPT. It weakens the assumption on the smallness of a residual error which is incurred by local computation. From our experiments, it stabilizes vanilla Local Power and achieves much smaller errors. Remak 2. To compute such O(i) t , the i-th client should communicate Z(i) t to the server, which results in additional communication cost. However, the cost is the same in magnitude as that of sending Y(i) t+1 in the aggregation step. Be- sides, the computation of O(i) t as well as the communication of Y(i) t+1 only happens when t + 1 IT . These make the additional communication cost affordable. Communication-Efficient Distributed SVD via Local Power Iterations Sign-Fixing. While OPT makes Local Power more stable in practice, OPT incurs more local computation. Specifically, it has time complexity O(dk2) via calling the SVD of (Z(i) t ) Z(1) t . To attain both efficiency and stability, we propose to replace the k k matrix O(i) in eqn. (4) by D(i) t = argmin D Dk Z(i) t D Z(1) t 2 F , (5) where Dk denotes all the k k diagonal matrices with 1 diagonal entries. D(i) t can be computed in O(kd) time by D(i) t [j, j] = sgn D Z(i) t [:, j] , Z(1) t [:, j] E , j [k]. We empirically observe that sign-fixing serves as a good practical surrogate of OPT; it maintains good stability and achieves comparably small errors. Remak 3. If we decay p, p will drop to one after a few communications. When p = 1, we stop using OPT (or sign-fixing); we simply set O(i) t (or D(i) t ) to the identity matrix. Remak 4. The technique of sign-fixing has been proposed in the setting of k = 1 by Garber et al. (2017). In the k = 1 setting, OPT and sign-fixing coincide with each other. In eqn. (5), we provide a simple way to extend it to highdimensional k > 1. We compute D(i) t that simultaneously adjusts the signs of columns of Z(i) t and Z(1) t . There exists other way to handle the high-dimensional sign-fixing problem. For example, if first k eigenvalues are well-separated from others, we can reduce the top-k sign-fixing problem to the one-dimensional sign-fixing problem instanced k times. 4. Convergence Analysis In this section we analyze the convergence of Local Power and show the benefit of OPT under an ideal setting. We use the projection distance of two subspaces as the metric for convergence evaluation. Definition 1 (Projection Distance). Let U, e U Od k be any matrices with orthonormal columns. The projection distance3 between them is dist(U, e U) UU e U e U 2. Projection distance is equivalent to dist(U, e U) = sin θk(U, e U) where θk(U, e U) denotes the k-th largest principal angle between the subspaces spanned by U and e U. Principal angles quantify how different two subspaces are. We can actually calculate θ1 U, e U , θ2 U, e U , , θk U, e U 3Unlike the spectral norm or the Frobenius norm, the projection norm will not fall short of accounting for global orthonormal transformation. Check Ye & Lim (2014) to find more information about distance between two spaces. via the SVD of U e U. The l-th largest singular value of U e U is equal to cos θl(U, e U) for all l = 1, , k. Definition 2 (Local Approximation). Let Mi = 1 si A i Ai be hosted by the i-th worker. Let M = 1 n Pm i=1 A i Ai = Pm i=1 pi Mi. Define η max i [m] Mi M 2 which measures how far the local matrices, M1, , Mm, are from M. If si = pin is sufficiently larger than d, then η is sufficiently small. Definition 3 (Residual Error). If OPT is not used, define ρt max i [m] Z(i) t Z(1) t 2. If OPT is used, define ρt max i [m] min O Ok Z(i) t O Z(1) t 2. The residual error ρt measures how the local top-k eigenspace estimator varies across the m worker. Based on the definition, using OPT makes ρt smaller than without using OPT. When t IT , Z(1) t = = Z(m) t and thus ρt = 0. When t / IT , each local update would enlarge ρt. Hence, intuitively ρt depends on p, i.e., the local iterations between two communications. However, later we will show that with OPT ρt does not depend on p (when p is sufficiently large) while it depends on p without OPT. A residual error is inevitable in previous literature of empirical risk minimization that uses local updates to improve communication efficiency (Stich, 2018; Wang & Joshi, 2018b; Yu et al., 2019; Li et al., 2019a;b). In our case, it takes the form of ρt. Assumption 1 (Uniformly small residual errors). Let r be the running column number, σj be the j-th largest singular value of M, and ϵ (0, 0.5) be a constant. Assume η 1 3κ where κ = M M is the condition number of M. Assume for all t [T], η 1t/ IT + (1 pmax)(ρt + ρt 11t/ IT ) = O(ϵ0), (6) where pmax = maxi [m] pi, 1t/ IT is the indication function of the event {t / IT }, and for some small constant τ > 0. Theorem 1 (Convergence for Local Power). Let τ be a positive constant, and Assumption 1 hold. Then, after |IT | rounds of communication where T = Ω σk σk σk+1 log τd Communication-Efficient Distributed SVD via Local Power Iterations with probability at least 1 τ Ω(r+1 k) e Ω(d), we have dist(ZT , Uk) = sin θk(ZT , Uk) ϵ. Theorem 1 shows Local Power takes T = eΘ σk σk σk+1 iterations to obtain an ϵ-optimal solution, the same quantity required by DPI . However, Local Power uses less communications. For example, with IT = {0, p, 2p, , T}, Local Power makes only |IT | = eΘ 1 p σk σk σk+1 communications, saving a factor of p than DPI . Theorem 1 depends on Assumption 1 which requires eqn. (6) holds for all t [T]. What s more, the final error ϵ is positively related to η and ρt via eqn. (6). The first part of eqn. (6) (i.e., η 1t/ IT ) is incurred by the variety of Mi s. So, if all devices have access to M (which implies M1 = = Mm), then it would vanish. The second part eqn. (6) is brought by intermittent communication. Indeed, if communication happens at iteration t (i.e., t IT ), we have ρt = 0 and 1t/ IT = 0, implying eqn. (6) holds obviously. Without communication, ρt is likely to grow continually, which is harmful to obtaining an accurate solution. Therefore, the assumption actually requires the communication interval p is not too large. From another hand, when p is fixed, the assumption instead imposes restriction on η when t / IT , because we show in Theorem 2 that ρt is bounded by a function of η. If OPT is used, then ρt = O(η), without dependence on p. However, if OPT is not used, then ρt = O( kpκpη) has an exponential dependence on p. Theorem 2 (Benefits of OPT). Let τ(t) IT be the nearest communication time before t and p = t τ(t). Let e be the natural constant. Assume η min( 1 With OPT, ρt is bounded by min 2e2κppη, ησ1 δk + 2γp/4 k Ct where γk (0, 1), δk = Θ(σk σk+1), and lim supt Ct = O(η + ϵ). Without OPT, ρt is bounded by Why using OPT has such an exponential improvement on dependence on p in theory? This is mainly because of the property of OPT. Let O = argmin O Ok U e UO F for U, e U Od k. Then, up to some universal constant, we have U e UO 2 dist(U, e U). See Lemma 3 in Appendix for a formal statement and detailed proof. It implies up to a tractable orthonormal transformation, the difference between the orthonormal bases of two subspaces is no larger than the projection distance between the subspaces. By the Davis-Kahan theorem (see Lemma 11), their projection distance is not larger than O(η) up to some problem-dependent constants. However, without OPT, we have to use perturbation theory to bound ρt, which inevitably results in exponential dependence on p. 5. Experiments Settings. We use 15 datasets available on the LIBSVM website.4 The n data samples are randomly shuffled and then partitioned among m nodes so that each node has s = n m samples. We set m = max( n 1000 , 3) so that each node has s = 1, 000 samples, unless n is too small. The features are normalized so that all the values are between 1 and 1. All the algorithms start from the same initialization Y0. We fix the target rank to k = 5. Our focus is on communication efficiency, so we use communication rounds for evaluating the compared algorithms. Due to the space limit, we defer more experiment details and additional experiment results to Appendix E. Compared algorithms. We evaluate three variants of Local Power : the vanilla version, with OPT, and with sign-fixing. We compare our algorithms with one-shot algorithms, UDA (Fan et al., 2019), WDA (Bhaskara & Wijewardena, 2019), and DR-SVD5; the algorithms are described in Appendix E.2. Final precision. In this set of experiments, we study the precision when the algorithms converge. For three variants of Local Power we fix p = 4 (without decaying p). We run each algorithm 10 times and report the mean and standard deviation (std) of the final errors. Due to limited space, Table 1 shows the results on 7 datasets. Table 6 and Figure 4 (in the appendix) present all the results on the 15 datasets. Out of the 15 datasets, Local Power has the smallest error mean and std on 12 datasets. The results indicate that oneshot methods do not find high-precision solutions unless the local data size is sufficiently large. The final error depends on p. With p > 1, the final error, limt sin θk(Zt, Uk), does not convergence to zero; instead, it remains to be a constant after a number of iterations. Figure 3(c) shows that the final error depends on p: the bigger p is, the bigger the final error is. The final error is not sensitive to p. The final error stops growing with p when p is sufficiently large. Note that Local Power as p becomes a one-shot algorithm, that is, the algorithm performs only one aggregation.6 One-shot algorithms typically have 4This page contains them all. https://www.csie.ntu. edu.tw/~cjlin/libsvmtools/datasets/. See Table 4 in the Appendix for n, d information. 5It is a direct distributed variant of Randomized SVD, the latter proposed by Halko et al. (2011). 6The one-shot method is different from those we introduced in Communication-Efficient Distributed SVD via Local Power Iterations reasonable empirical performance and theoretical bounds. The final error depends on m. Big m means smaller local sample size, s = n m, and thereby big matrix approximation error, η (in Definition 2). Our theory indicates that big m (and thereby big η) is bad for the final error. The empirical results in Figure 3(c) corroborate our theories. Effect of local power iterations. In this set of experiments, we set p to 1, 2, 4, or 8 (without decaying p) and compare the convergence curves. Note that Local Power with p = 1 is the standard distributed power iteration (DPI). We plot the error, sin θk(Zt, Uk), against communications. The convergence curves indicate how p affects the communication efficiency. Figure 2(a) shows the experimental results on one dataset. Due to page limit, the results on the other datasets are left to the appendix; see Figures 5, 6, and 7. In all the experiments, large p leads to fast convergence in the beginning but has a nonvanishing error in the end. Some machine learning tasks, such as principal component analysis and latent semantic analysis (Deerwester et al., 1990), do not require high-precision solutions. In this case, Local Power is advantageous over DPI, as Local Power finds a satisfactory solution using very few communications. For two-stages methods like (Garber et al., 2017) It is also implied that Local Power helps If a higher precision is required, we can decay p so that Local Power will have the same precision as DPI. While one-shot algorithms are more communication-efficient, their precision is too low unless each node has a large sample size. The decay strategy. We have observed that large p fastens initial convergence but enlarges the final error. By contrast, p = 1 has the lowest error (which actually can be zero) but also the lowest convergence rate. Similar phenomena have been previously observed in distributed empirical risk minimization (Wang & Joshi, 2018a; Li et al., 2019b). To allow for both fast initial convergence and vanishing final error, we are motivated to decay p gradually. We halve p every iteration until it reaches 1. We apply the decay strategy to the three variants of Local Power. For each setting and each dataset, we repeat the experiment 10 times and report the mean and std. Table 2 and Figure 3(a) show the results on some datasets. The results on all the 15 datasets are left to the appendix; see Table 7, Figures 8, 9, and 10. The decay strategy not only makes convergence faster but also improves the final precision well. Stability. In almost all the experiments, Local Power with OPT has smaller std and more stable convergence curves than Local Power without OPT. Why does OPT improve stability. Theorem 2 shows that with OPT, ρt (in related work. It simply averages local top-k eigenvectors rather than distributed averaging methods (see Algorithm 2 and 3). Definition 3) has a linear function of p. Even if p is large, Assumption 1 can be satisfied, and thus Theorem 1 guarantees the convergence of Local Power with OPT. However, Theorem 2 shows that without using OPT, ρt is an exponential function of p. If p is large, Assumption 1 is violated, and thus the convergence of Local Power without OPT is not guaranteed. Sign-fixing is practical alternative to OPT. Table 1 and Figures 5 and 6 show that sign-fixing has comparable stability as OPT. To explain why sign-fixing works, we first explain what causes instability. Note that if we flip the signs of some columns of Z(i) t , the subspace R(Z(i) t ) remains the same. During the local power iterations on the i-th node, the signs of the columns of Z(i) t can flip. While the sign flipping does not affect R(Z(i) t ), it changes the outcome of the aggregation of Z(1) t , , Z(m) t . The sign-fixing method can counteract sign flippings and thereby stabilizes Local Power. Table 2 shows that Local Power with decaying p has better stability. With the decaying strategy used, p will drop to 1 after several communications, and Local Power becomes the standard DPI which does not suffer from the instability issue. Effect of local sample size. Since the n data samples are partitioned among m nodes uniformly at random, every node holds s = n m samples. Figure 3(b) shows that small m, equivalently, big s, is good for Local Power. We use η = maxi [m] Mi M 2/ M 2 to measure the difference between a local covariance matrix and the full one. We give the values of η under different uniform partitions in Table 3. It shows that if s is large (so m is small), η is small, which implies M1, , Mm well approximate the global matrix M, and the residuals accumulated by the local iterations are small. It in turn makes the curves with small m have small errors. This can be explained by our theories. 6. Discussion Smallness on η. Theorem 1 requires η = O( 1 κ) which might be too stringent in practice. If we use a refined analysis just like Guo et al. (2021), it can be relaxed to η = O(1) as well as ϵ0 whose dependence on κ can be removed.7 Besides, the concurrent work (Charisopoulos et al., 2020) provides sharper analysis on one-shot average via OPT, which might be used to refine our analysis and relax the strictness on η further. 7In particular, Guo et al. (2021) analyzes the convergence of the virtual sequence in a form of Zt = Pn i=1 pi Z(i) t D(i) t , while we focus on the weighted Y (i) t , i.e., Yt = Pn i=1 pi Y(i) t D(i) t . Roughly speaking, Yt is about M 2 larger than Zt , while Y t is about M 2 smaller than Z t . It leads to an additional factor κ = M 2 M 2. Communication-Efficient Distributed SVD via Local Power Iterations Table 1. We report the errors of three proposed algorithms and three baselines methods on seven datasets. We show the mean errors of ten repeated experiments with its standard deviation enclosed in parentheses. The result of full fifteen datasets is shown in Table 6. Datasets Local Power (p = 4) DR-SVD UDA WDA OPT Sign-fixing Vanilla A9a 4.09e-03 (4.20e-4) 5.82e-03 (1.41e-3) 8.13e-02 (3.44e-2) 4.63e-02 (9.24e-3) 2.64e-02 (1.58e-2) 2.40e-02 (1.50e-2) Abalone 3.16e-03 (2.89e-3) 3.85e-03 (2.54e-3) 3.03e-02 (5.70e-2) 3.20e-01 (2.30e-1) 1.03e-01 (9.38e-2) 1.03e-01 (9.18e-2) Acoustic 1.83e-03 (4.40e-4) 2.03e-03 (3.90e-4) 2.38e-03 (8.5e-4) 1.54e-02 (6.59e-3) 7.76e-03 (2.64e-3) 6.67e-03 (2.42e-3) Combined 6.01e-03 (1.59e-3) 5.57e-03 (1.05e-3) 2.47e-02 (3.40e-2) 5.19e-02 (6.23e-3) 4.63e-02 (2.97e-3) 4.16e-02 (2.76e-2) Connect-4 1.27e-02 (4.52e-3) 1.81e-02 (3.79e-3) 1.70e-02 (4.35e-3) 1.61e-02 (2.96e-3) 1.65e-01 (3.48e-2) 1.56e-0 1(3.26e-2) Covtype 7.38e-03 (6.50e-4) 6.23e-03 (4.70e-4) 1.28e-02 (1.88e-3) 1.82e-01 (8.73e-2) 6.09e-02 (9.70e-3) 5.60e-02 (9.41e-3) MSD 9.90e-03 (1.21e-3) 9.62e-03 (5.20e-4) 1.44e-02 (1.58e-3) 3.01e-02 (9.64e-3) 1.55e-02 (1.39e-3) 1.92e-02 (1.14e-3) Table 2. Error comparison of Local Power with decay strategy under the same setting of Table 1. See Table 7 for full results. In theory, Local Power with decay strategy achieves zero error. Datasets Local Power with p = 4 and the decay strategy OPT Sign-fixing Vanilla A9a 4.84e-03 (1.40e-02) 1.52e-03 (4.08e-03) 3.11e-04 (4.84e-04) Abalone 3.50e-10 (4.10e-10) 4.14e-10 (4.00e-10) 6.12e-10 (6.77e-10) Acoustic 1.40e-05 (2.16e-05) 1.92e-05 (3.72e-05) 2.28e-05 (4.91e-05) Combined 3.68e-03 (5.63e-03) 7.74e-03 (1.70e-02) 2.99e-03 (3.88e-03) Connect-4 4.90e-03 (8.47e-03) 3.58e-03 (4.35e-03) 3.09e-03 (3.16e-03) Covtype 5.57e-04 (1.55e-03) 4.95e-05 (5.40e-05) 8.01e-05 (8.62e-05) MSD 2.75e-05 (3.34e-05) 2.47e-05 (3.27e-05) 3.02e-05 (2.10e-05) Table 3. The value of η under uniform partitions on some datasets. It can be seen that for a fixed n, the larger m, the larger η. Full results see Table 5. Dataset m = 20 m = 40 m = 60 A9a 0.034 0.0563 0.0701 Abalone 0.1089 0.23 0.2458 Acoustic 0.0063 0.0107 0.0134 Combined 0.006 0.0089 0.0113 Connect-4 0.0376 0.054 0.0771 Covtype 0.0078 0.011 0.0159 MSD 0.0007 0.0009 0.0012 (a) The performance of Local Power with different p on Covtype dataset (b) Stability on A9a dataset Figure 2. (a) We illustrate the convergence of Local Power with different F s and various p on Covtype dataset where A R581,012 54. See Figure 5, 6 and 7 for full results. (b) The vanilla Local Power sometimes fluctuates and even diverges (see Figure 7 for full results). We can stabilize it in two ways: (i) use Ok or D instead or (ii) use the decay strategy. Increase local sample size. In addition to OPT or the decay strategy, we find that increasing local data size also reduces the final error. Intuitively, if si is sufficiently large, then Mi = 1 si A i Ai will be very close to M = 1 n A A. Actually, this is true if we construct each Ai by sampling uniformly from the overall data A (see Lemma 2). Therefore, to make η sufficiently small, we can increase local data size. If the total number of rows n is fixed in advance, increasing each si is equivalent to decreasing the number of worker nodes m. The term η = maxi [m] Mi M 2/ M 2 is commonly used to analyze matrix approximation problems. It aims to ensure each Ai is a typical representative of the whole dataset A. Prior work (Gittens & Mahoney, 2016; Woodruff, 2014; Wang et al., 2016) showed that uniform sampling and the partition size in Lemma 2 suffice for that Mi well approximates M. The proof is based on matrix Bernstein (Tropp, 2015). Therefore, under uniform sampling, the smallness of η means sufficiently large local dataset size (or equivalently a small number of worker nodes). This can be also seen in Table 3. One may doubt the motivation of each device anticipating the cooperated eigenspace estimation due to the large local dataset assumption. Here we focus on the empirical PCA Communication-Efficient Distributed SVD via Local Power Iterations (a) Decay strategy (b) Vary device number m (c) Error dependence on p and m Figure 3. Some results on Covtype dataset. (a) A typical convergence curve of the decay strategy. See Figure 8, 9 and 10 for full results. (b) The smaller m, the faster convergence as well as the smaller error. See Figure 11 and 12 for full results. (c) The error depends positively on p and m. See Figure 13 for full results. rather than the population PCA. This implies we inevitably suffer a statistic error that will diminish if we have an infinite number of total samples. As a result, if m devices participate in the training with comparable local data size, the statistical error can be reduced by a factor of m. See Appendix B for more details. Lemma 2 (Uniform sampling.). Let ϵ, δ (0, 1). Assume the rows of Ai are sampled from the rows of A uniformly at random. Assume each node has sufficiently many samples, that is, for all i m, where ρ = rank(A) and µ is the row coherence of A.8 With probability greater than 1 δ, we have η = max i [m] Mi M 2/ M 2 ϵ. Error dependence. The choice of IT determines the frequency Local Power communicates. We explore the use of IT = {0, p, 2p, , p} and the decay strategy in experiments. When p = 1, Local Power reduces to DPI. As a result, both the residual errors Ψt and Ωt vanish. As shown in Lemma 1, DPI converges to zero error. When p 2, the error sin θk typically increases with p and is non-zero. Corollary 1 depicts the relationship between the error and problem-dependent parameters including n, m, p. The proof is provided in Appendix A.5. It can be proved by Theorem 2 and Lemma 2. Corollary 1. Under uniform sampling and assuming si = Θ( n m) and n is sufficiently large, , with probability 1δ, Local Power with OPT has an asymptotic error satisfying lim sup t sin θk(ZT , Uk) = O hp 8The row coherence of A is defined by µ(A) = n d maxj uj 2 2 [1, n d ] where uj comes from the column orthonormal bases of A. where hp(x) is non-negative and increasing in (typically both p and) x, and it satisfies h1(x) = 0 as well as 0 hp(x) Cx for some C. We hide constants σk, k, d, ρ, κ, δ in the big-O notation and hp( ). However, with any decay strategy in which p converges to 1 finally, Local Power achieves zero error asymptotically. Corollary 1 says that when p goes to infinity, the error is saturated and has a finite limit, because hp( ) is bounded. The curve of error v.s. p and m in Figure 3(c) validates the conclusion. Indeed, the extreme case of super large p means Local Power reduces to the one-shot method, which has a non-zero optimization error typically. Corollary 1 also reveals methods to reduce error. To that end, we can (i) use the decay strategy (p ) to achieve arbitrary error or (ii) reduce the number of devices (m ) or collect more data points n . Both methods work in experiments empirically. Dependence on σk σk+1. Our result depends on σk σk+1 even when r > k where r is the number of columns used in subspace iteration. If we borrow the tool of Balcan et al. (2016a) rather than that of Hardt & Price (2014), we can improve the result to a slightly milder dependency on σk σq+1, where q is any intermediate integer between k and r. In particular, the required iteration T will decrease from e O σk σk σk+1 to e O σk σk σq+1 . It means using additional columns fastens convergence. For a formal statement, please refer to Appendix C. Further extensions. Our proposed Local Power is simple, effective and well-grounded. While we analyze it only on the centralized setting, Local Power can be extended to broader settings, such as decentralized setting (Gang et al., 2019) and streaming setting (Raja & Bajwa, 2020). To further reduce the communication complexity, we can combine Local Power with sketching techniques (Boutsidis et al., 2016; Balcan et al., 2016b). For example, we could sketch each Y(i) t and communicate the compressed iterates to a Communication-Efficient Distributed SVD via Local Power Iterations central server in each iteration. We leave the extensions to our future work. Besides, in typical federated learning structures, real systems clients might not correspond to the central server due to connection failure. It is also possible to consider partial participation of clients and the optimal way of client selection (Reisizadeh et al., 2020; Chen et al., 2020). Guo et al. (2021) makes an attempt towards the direction. 7. Conclusion We have developed a communication-efficient distributed algorithm named Local Power to solve the truncated SVD. Every worker machine performs multiple (say p) local power iterations between two consecutive communications. We have theoretically shown that Local Power converges p times faster (in terms of communication) than the baseline distributed power iteration, if the residual error is uniformly small. To reduce the residual error, we can (i) use OPT or sign-fixing, (ii) make use of a decay strategy that halves p gradually, and (iii) increase local data size. Both OPT and sign-fixing are more stable, while sign-fixing additionally is computationally efficient. The strategy is motivated by an experimental phenomenon that large p often leads to a quick initial drop of loss but a higher final error. The decay strategy obtains zero error asymptotically in theory and has better convergence performance in experiments. We have conducted the thorough experiments to show the effectiveness of Local Power and all the theories are agree with our empirical experiments. Acknowledgement Li, Chen and Zhang have been supported by the National Key Research and Development Project of China (No. 2018AAA0101004 & 2020AAA0104400), and Beijing Academy of Artificial Intelligence (BAAI). Allen-Zhu, Z. and Li, Y. Lazysvd: even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pp. 974 982, 2016. 24 Arbenz, P., Kressner, D., and Zürich, D.-M. E. Lecture notes on solving large scale eigenvalue problems. D-MATH, EHT Zurich, 2012. 2, 3 Arora, R., Cotter, A., and Srebro, N. Stochastic optimization of pca with capped msg. In Advances in Neural Information Processing Systems, pp. 1815 1823, 2013. 24 Balcan, M.-F., Du, S. S., Wang, Y., and Yu, A. W. An improved gap-dependency analysis of the noisy power method. In Conference on Learning Theory, pp. 284 309, 2016a. 8, 23, 24, 25 Balcan, M. F., Liang, Y., Song, L., Woodruff, D., and Xie, B. Communication efficient distributed kernel principal component analysis. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 725 734, 2016b. 8 Bhaskara, A. and Wijewardena, P. M. On distributed averaging for stochastic k-pca. In Advances in Neural Information Processing Systems, pp. 11024 11033, 2019. 5, 24, 25, 26 Boutsidis, C., Woodruff, D. P., and Zhong, P. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 236 249. ACM, 2016. 8 Candès, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009. 1 Cape, J. Orthogonal procrustes and norm-dependent optimality. The Electronic Journal of Linear Algebra, 36(36): 158 168, 2020. 3, 12, 13 Charisopoulos, V., Benson, A. R., and Damle, A. Communication-efficient distributed eigenspace estimation. ar Xiv preprint ar Xiv:2009.02436, 2020. 6, 24 Chen, W., Horvath, S., and Richtarik, P. Optimal client sampling for federated learning. ar Xiv preprint ar Xiv:2010.13723, 2020. 9 Chen, X., Lee, J. D., Li, H., and Yang, Y. Distributed estimation for principal component analysis: An enlarged eigenspace analysis. Journal of the American Statistical Association, pp. 1 12, 2021. 24 De Sa, C., He, B., Mitliagkas, I., Ré, C., and Xu, P. Accelerated stochastic power iteration. Proceedings of machine learning research, 84:58, 2018. 24 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391 407, 1990. 1, 6 Fan, J., Wang, D., Wang, K., Zhu, Z., et al. Distributed estimation of principal eigenspaces. The Annals of Statistics, 47(6):3009 3031, 2019. 3, 5, 24, 25, 26 Gang, A., Raja, H., and Bajwa, W. U. Fast and communication-efficient distributed pca. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7450 7454. IEEE, 2019. 8, 24, 25 Communication-Efficient Distributed SVD via Local Power Iterations Garber, D. and Hazan, E. Fast and simple pca via convex optimization. ar Xiv preprint ar Xiv:1509.05647, 2015. 24 Garber, D., Hazan, E., Jin, C., Kakade, S. M., Musco, C., Netrapalli, P., and Sidford, A. Faster eigenvector computation via shift-and-invert preconditioning. In ICML, pp. 2626 2634, 2016. 24 Garber, D., Shamir, O., and Srebro, N. Communicationefficient algorithms for distributed stochastic principal component analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1203 1212. JMLR. org, 2017. 2, 3, 4, 6, 24 Gittens, A. and Mahoney, M. W. Revisiting the Nyström method for improved large-scale machine learning. Journal of Machine Learning Research, 17(1):3977 4041, 2016. 7 Gittens, A., Devarakonda, A., Racah, E., Ringenburg, M., Gerhardt, L., Kottalam, J., Liu, J., Maschhoff, K., Canon, S., and Chhugani, J. Matrix factorizations at scale: a comparison of scientific data analytics in spark and C+ MPI using three case studies. In IEEE International Conference on Big Data, 2016. 1 Golub, G. H. and Van Loan, C. F. Matrix computations, volume 3. JHU Press, 2012. 1, 24 Grammenos, A., Mendoza-Smith, R., Mascolo, C., and Crowcroft, J. Federated principal component analysis. ar Xiv preprint ar Xiv:1907.08059, 2019. 24 Guo, X., Li, X., Chang, X., Wang, S., and Zhang, Z. Privacypreserving distributed svd via federated power. ar Xiv preprint ar Xiv:2103.00704, 2021. 6, 9 Halko, N., Martinsson, P.-G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011. 5, 26 Hardt, M. and Price, E. The noisy power method: A meta algorithm with applications. In Advances in Neural Information Processing Systems, pp. 2861 2869, 2014. 8, 14, 15, 22, 23, 25 Ji-guang, S. Perturbation of angles between linear subspaces. Journal of Computational Mathematics, pp. 58 61, 1987. 18 Khaled, A., Mishchenko, K., and Richtárik, P. First analysis of local gd on heterogeneous data. ar Xiv preprint ar Xiv:1909.04715, 2019. 24 Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2019a. 4, 24 Li, X., Yang, W., Wang, S., and Zhang, Z. Communication efficient decentralized training with multiple local updates. ar Xiv preprint ar Xiv:1910.09126, 2019b. 4, 6, 24 Liang, Y., Balcan, M.-F. F., Kanchanapally, V., and Woodruff, D. Improved distributed principal component analysis. In Advances in Neural Information Processing Systems, pp. 3113 3121, 2014. 3 Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics (AISTATS), 2017. 1, 24 Musco, C. and Musco, C. Randomized block Krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems (NIPS), 2015. 24 Oja, E. and Karhunen, J. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of mathematical analysis and applications, 106(1):69 84, 1985. 24 Raja, H. and Bajwa, W. U. Distributed stochastic algorithms for high-rate streaming principal component analysis. ar Xiv preprint ar Xiv:2001.01017, 2020. 8, 25 Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pp. 2021 2031. PMLR, 2020. 9 Saad, Y. Numerical methods for large eigenvalue problems. preparation. Available from: http://www-users. cs. umn. edu/saad/books. html, 2011. 1, 2, 3, 24 Schönemann, P. H. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1 10, 1966. 3, 12, 13 Shamir, O. A stochastic pca and svd algorithm with an exponential convergence rate. In International Conference on Machine Learning, pp. 144 152, 2015. 24 Shamir, O. Convergence of stochastic gradient descent for pca. In International Conference on Machine Learning, pp. 257 265, 2016. 24 Simchowitz, M., Alaoui, A. E., and Recht, B. On the gap between strict-saddles and true convexity: An omega (log d) lower bound for eigenvector approximation. ar Xiv preprint ar Xiv:1704.04548, 2017. 25 Stich, S. U. Local SGD converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018. 4, 24 Communication-Efficient Distributed SVD via Local Power Iterations Sun, J.-g. On perturbation bounds for the qr factorization. Linear algebra and its applications, 215:95 111, 1995. 20 Tropp, J. A. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12 (4):389 434, 2012. 23 Tropp, J. A. An introduction to matrix concentration inequalities. ar Xiv preprint ar Xiv:1501.01571, 2015. 7 Vu, V. Q., Lei, J., et al. Minimax sparse principal subspace estimation in high dimensions. The Annals of Statistics, 41(6):2905 2947, 2013. 13 Wang, J. and Joshi, G. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. ar Xiv preprint ar Xiv:1810.08313, 2018a. 6 Wang, J. and Joshi, G. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. ar Xiv preprint ar Xiv:1808.07576, 2018b. 4, 24 Wang, S., Luo, L., and Zhang, Z. SPSD matrix approximation vis column selection: Theories, algorithms, and extensions. Journal of Machine Learning Research, 17 (49):1 49, 2016. 7 Wang, S., Gittens, A., and Mahoney, M. W. Scalable kernel k-means clustering with Nystrom approximation: Relative-error bounds. Journal of Machine Learning Research, 20(12):1 49, 2019. 1 Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37 52, 1987. 1 Woodruff, D. P. Sketching as a tool for numerical linear algebra. ar Xiv preprint ar Xiv:1411.4357, 2014. 7 Wu, S. X., Wai, H.-T., Li, L., and Scaglione, A. A review of distributed algorithms for principal component analysis. Proceedings of the IEEE, 106(8):1321 1340, 2018. 24 Xu, Z. Gradient descent meets shift-and-invert preconditioning for eigenvector computation. Advances in Neural Information Processing Systems, 31:2825 2834, 2018. 24 Ye, K. and Lim, L.-H. Distance between subspaces of different dimensions. ar Xiv preprint ar Xiv:1407.0900, 4, 2014. 4, 12 Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI Conference on Artificial Intelligence, 2019. 4, 24 Zhou, F. and Cong, G. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. ar Xiv preprint ar Xiv:1708.01012, 2017. 24