# tensor_decomposition_via_simultaneous_power_iteration__7e88d542.pdf Tensor Decomposition via Simultaneous Power Iteration Po-An Wang 1 Chi-Jen Lu 1 Tensor decomposition is an important problem with many applications across several disciplines, and a popular approach for this problem is the tensor power method. However, previous works with theoretical guarantee based on this approach can only find the top eigenvectors one after one, unlike the case for matrices. In this paper, we show how to find the eigenvectors simultaneously with the help of a new initialization procedure. This allows us to achieve a better running time in the batch setting, as well as a lower sample complexity in the streaming setting. 1. Introduction Tensors have long been successfully used in several disciplines, including neuroscience, phylogenetics, statistics, signal processing, computer vision, and data mining. They are used to model multi-relational or multi-modal data, and their decompositions often reveal some underlying structures behind the observed data. See (Kolda & Bader, 2009) for a survey of such results. Recently, they have found applications in machine learning, particularly for learning various latent variable models (Anandkumar et al., 2012; Chaganty & Liang, 2014; Anandkumar et al., 2014a). One popular decomposition method in such applications is the CP (Candecomp/Parafac) decomposition, which decomposes the given tensor as a sum of rank-one components. This is similar to the singular value decomposition (SVD) of matrices, and a popular approach for SVD is the power method, which is well-understood and has nice theoretical guarantee. As tensors can be seen as generalization of matrices to higher orders, one would hope that a natural generalization of the power method to tensors could inherit the success from the matrix case. However, the situation turns out to be much more complicated for tensors (see e.g. the discussion in (Anandkumar et al., 2014a)), 1Academia Sinica, Taiwan. Correspondence to: Po-An Wang . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). and in fact several problems related to tensor decomposition are known to be NP-hard (Hillar & Lim, 2013). Nevertheless, when the given tensor has some additional structure, the tensor decomposition problem becomes tractable again. In particular, for tensors having orthogonal decomposition, Anandkumar et al. (2014a) provided an efficient algorithm based on the tensor power method with theoretical guarantee. Still, as we will discuss later in Section 2, the seemingly subtle change of going from matrices to tensors makes some significant differences for the power method. The first is that while the matrix power method can guarantee that a randomly selected initial vector will almost surely converge to the top singular vector, we have much less control of where the convergence goes in the tensor case. Consequently, most previous works based on the tensor power method with theoretical guarantee, such as (Anandkumar et al., 2014a;b; Wang & Anandkumar, 2016), require much more complicated procedures. In particular, they can only find the top k eigenvectors one by one, each time with the power method applied to a modified tensor, deflated from the original tensor according to the previously found vectors. Moreover, to find each vector, they need to sample several initial vectors and apply the power method on all of them, before selecting just one from them. In contrast, algorithms for matrices such as (Mitliagkas et al., 2013; Hardt & Price, 2014) are much simpler, as they can find the k vectors simultaneously by applying the power method only on k random initial vectors. The second difference, on the other hand, has a beneficial effect, which allows the tensor power method to converge exponentially faster than the matrix one when starting from good initial vectors. Then a natural question is: can we inherit the best of both worlds? Namely, is it possible to have a simple algorithm which can find the k eigenvectors of a tensor simultaneously and converge faster than that for matrices? Our Results. As in previous works, we consider the slightly harder scenario in which we only have access to a noisy version of the tensor we want to decompose. This arises in applications such as learning latent variable models, in which the tensor we have access to is obtained from some empirical average of the observed data. Our main contribution is to answer the above question affirmatively. First, we consider the batch setting in which we assume Tensor Decomposition via Simultaneous Power Iteration that the given noisy tensor is stored somewhere and can be accessed whenever we want to. In this setting, we identify a sufficient condition such that if we have k initial vectors satisfying this condition, then we can apply the tensor power method on them simultaneously, which will come within some distance ε to the eigenvectors in O(log log 1 ε) iterations, with parameters related to eigenvalues considered as constant. To apply such a result, we need an efficient way to find such initial vectors. We show how to do this by choosing a good direction to project the tensor down to a matrix while preserving the eigengaps, and then applying the matrix power method for only a few iterations just to obtain vectors meeting that sufficient condition. The number of iterations needed here is only O(log d), independent of ε, where d is the dimension of the eigenvectors. The result stated above is for orthogonal tensors. On the other hand, it is known that an nonorthogonal tensor with linearly independent eigenvectors can be converted into an orthogonal one with the help of some whitening matrix. However, previous works usually pay little attention on how to find such a whitening matrix efficiently. According to (Anandkumar et al., 2014a), one way is via SVD on some second moment matrix, but doing this using the matrix power method would take longer to converge compared to the tensor power method which would then be applied on the whitened tensor. Our second contribution is to provide an efficient way to find a whitening matrix, by simply applying only one iteration of the matrix power method. While most previous works on tensor decomposition focus on the batch setting, storing even a tensor of order three requires Ω(d3) space, which is infeasible for a large d. We show to avoid this in the streaming setting, with a stream of data arriving one at a time, which is the only source of information about the tensor. We provide a streaming algorithm using only O(kd) space, which is the smallest possible, just enough to store the k eigenvectors of dimension d. To achieve an approximation error ε, the total number of samples we need is O(kd log d + 1 ε2 log(d log 1 Related Works There is a huge literature on tensor decomposition, and it is beyond the scope of this paper to give a comprehensive survey. Thus, we only compare our results to the most related ones, particularly those based on the power method. While different works may focus on different aspects, we are most interested in understanding how the error parameter ε affects various performance measures, having in mind a small ε. First, the batch algorithm of Anandkumar et al. (2014a), using a better analysis in (Wang & Anandkumar, 2016), runs in time about O((k2 log k)(log log 1 ε)), which can be made to run in O(k(log log 1 ε)) iterations in parallel, while ours are O(k log log 1 ε) and O(log log 1 ε), respectively. On the other hand, one advantage of their algorithm is that its running time does not depend on the eigengaps, while ours has the dependence hidden above as some constant. In the streaming setting, Wang & Anandkumar (2016) provided an algorithm using O(dk log k) memory and O( k ε)) samples, while ours only uses O(dk) memory and O( 1 ε2 log(d log log 1 ε)) samples.1 Nevertheless, the sample complexity of Wang & Anandkumar (2016) is also independent of the eigengaps, while ours has the dependence hidden above as a constant factor. As one can see, our algorithms, which find the k eigenvectors simultaneously, allow us to save a factor of k in the time complexity and the sample complexity, although our bounds may become worse when the eigengaps are small. Thus, our algorithms can be seen as new options for users to choose from, depending on the data they are given. Although not directed related, let us also compare to previous works on SVD. Two related ones, both based on the simultaneous matrix power method, are the batch algorithm of (Hardt & Price, 2014) which converges in O(log d ε) iterations, and the streaming algorithm of (Li et al., 2016) which requires O( 1 ε2 log(d log 1 ε)) samples. Both bounds are worse than ours and also depend on the eigengaps. Thus, although one approach for orthogonal tensor decomposition is to reduce it to a matrix SVD problem, this does not appear to result in better performance than ours. Finally, comparisons of the tensor power method with other approaches can be found in works such as (Anandkumar et al., 2014a; Wang & Anandkumar, 2016). For example, the online SGD approach of (Ge et al., 2015) works only for tensors of even orders and its sample complexity has a poor dependency on the dimension d. Organization of the paper. First, we provide some preliminaries in Section 2. Then we present our batch algorithm for orthogonal and symmetric tensors of order three in Section 3, and then for general orthogonal tensors in Section 4. In Section 5, we introduce our whitening procedure for nonorthogonal but symmetry tensors. Finally, in Section 6, we present our algorithm for the streaming setting. Due to the space limitation, we will move all our proofs to the appendix in the supplementary material. 2. Preliminaries Let us first introduce some notations and definitions which we will use later. Let R denote the set of real numbers and N the set of positive integers. Let N(0, 1) denote the standard normal distribution with mean 0 and variance 1, 1We use a different input distribution from theirs. The bound listed here is modified from theirs according to our distribution. Tensor Decomposition via Simultaneous Power Iteration and let N d(0, 1), for d N, denote the d-variate one which has each of its d dimensions sampled independently from N(0, 1). For d N, let [d] denote the set {1, . . . , d}. For a vector x, let x denote its L2 norm. For d N, let Id denote the d d identity matrix. For a matrix A Rd k, let Ai, for i [k], denote its i-th column, and let Ai,j, for j [d], be the j-th entry of Ai. Moreover, for a matrix A, let A denote its transpose, and define its norm as A = maxx Rk A x x , using the convention that 0 Tensors are the focus of our paper, which can be seen as generalization of matrices to higher orders. For simplicity of presentation, we will use symmetric tensors of order three as examples in the following definitions. A real tensor T of order three can be seen as an three-dimensional array in Rd d d, for some d N, with its (i, j, k)-th entry denoted as Ti,j,k. For such a tensor T and three matrices A Rd m1, B Rd m2, C Rd m3, let T(A, B, C) be the tensor in Rm1 m2 m3, with its (a, b, c)- th entry defined as P i,j,k [d] Ti,j,k Aa,i Bb,j Cc,k. The norm of a tensor T we will use is the operator norm: T = maxx,y,z Rd |T (x,y,z)| The tensor decomposition problem. In this problem, there is a tensor T with some unknown decomposition i [d] λi ui ui ui, with λi 0 and ui Rd for any i [d]. Then given some k [d] and ε (0, 1), our goal is to find ˆλi and ˆui with |ˆλi λi| ε and ˆui ui ε, for every i [k]. We will assume that X i [d] λi 1 and i [k] : λi > λi+1. (1) As in previous works, we consider a slightly harder version of the problem, in which we only have access to some noisy version of T, instead of the noiseless T. We will consider the following two settings. In the batch setting, we have access to some T = T + Φ for the whole time, for some perturbation tensor Φ. In the streaming setting, we have a stream of data points x1, x2, . . . arriving one by one, which provide the only information we have about T, with each xτ Rd allowing us to compute some Tτ with mean E[ T] = T. In this streaming setting, we are particularly interested in the case of a large d which prohibits one to store a tensor of size d3 in memory. Power Method: Matrices versus Tensors. Note that a tensor of order two is just a matrix, and a popular approach for decomposing matrices is the so-called power method, which works as follows. Suppose we are given a d d matrix M = P i [d] λi ui ui, with nonnegative λ1 > λ2 and orthonormal vectors u1, . . . , ud. The power method starts with some q(0) = P i [d] ci ui, usually chosen randomly, and then repeatedly performs the update q(t) = M q(t 1), which results in i [d] λi u i q(t 1) ui = X Note that for any i = 1, as λi < λ1, the coefficient λt ici will soon become much smaller than the coefficient λt 1c1 if c1 is not too small, which is likely to happen for a randomly chosen q(0). This has the effect that after normalization, q(t)/ q(t) approaches u1 quickly. Now consider a tensor T = P i [d] λi ui ui ui, with nonnegative λ1 > λ2 and orthonormal vectors u1, . . . , ud. The tensor version of the power method again starts from a randomly chosen q(0) = P i [d] ci ui, but now repeatedly performs the update q(t) = T(Id, q(t 1), q(t 1)), which in turn results in i [d] λi u i q(t 1) 2 ui = X i [d] λ 1 i (λici)2t ui. The coefficient of each ui now has a different form from the matrix case, and this leads to the following two effects. First, one now has much less control on what q(t)/ q(t) converges to. In fact, it can converge to any ui = u1 if ui has the largest value of λi|ci|, which happens with a good probability if λi is not much smaller than λ1. Consequently, to find the top k vectors u1, . . . , uk, previous works based on the power method all need much more complicated procedures (Anandkumar et al., 2014a), compared to those for matrices, as discussed in the introduction. On the other hand, the different form of q(t) has the beneficial effect that the convergence is now exponentially faster than in the matrix case. More precisely, if λi|ci| < λj|cj|, than the gap between the coefficients (λici)2t and (λjcj)2t is now amplified much faster. We will show how to inherit this nice property of faster convergence but at the same time avoid the difficulty discussed above. 3. Orthogonal and Symmetric Tensors of Order Three In this section, we focus on the special case in which the tensors to be decomposed are orthogonal, symmetric, and of order three. Formally, there is an underlying tensor i [d] λi ui ui ui, Tensor Decomposition via Simultaneous Power Iteration Algorithm 1 Robust tensor power method Input: Tensor T Rd d d and parameters k, L, S, N. Initialization Phase: Sample w1, . . . , w L, Y (0) 1 , . . . , Y (0) k N d(0, 1). Compute w = 1 L P j [L] T(Id, wj, wj). Compute M = T(Id, Id, w). Factorize Y (0) as Z(0) R(0) by QR decomposition. for s = 1 to S do Compute Y (s) = M Z(s 1). Factorize Y (s) as Z(s) R(s) by QR decomposition. end for Tensor Power Phase: Let Q(0) = Z(S). for t = 1 to N do Compute Y (t) j = T(Id, Q(t 1) j , Q(t 1) j ), j [k]. Factorize Y (t) as Q(t) R(t) by QR decomposition. end for Output: ˆuj = Q(N) j and ˆλj = T(ˆuj, ˆuj, ˆuj), j [k]. with orthonormal vectors ui s and real λi s satisfying the condition (1). Then given k [d] and ε (0, 1), our goal is to find approximates to those λi and ui within distance ε, but we only have access to some noisy tensor T = T + Φ for some symmetric perturbation tensor Φ. Our algorithm is given in Algorithm 1, which consists of two phases: the initialization phase and the tensor power phase. The main phase is the tensor power phase, which we will discuss in detail in Subsection 3.1. For our tensor power phase to work, it needs to have a good starting point. This is provided by the initialization phase, which we will discuss in detail in Subsection 3.2. Through these two subsections, we will prove Theorem 1 below, which summarizes the performance of our algorithm, according to the following parameters of the tensor: γ = min i [k] λ2 i λ2 i+1 λ2 i and = min i [k] λi λi+1 Theorem 1. Suppose ε λk 2 and the perturbation tensor has the bound Φ min{ ε dk } for a small enough constant α0. Then for some L = O( 1 γ2 log d), S = O( 1 γ log d), and N = O(log( 1 ε)), our Algorithm 1 with high probability will output ˆui and ˆλi with ˆui ui ε and |ˆλi λi| ε for every i [k]. Let us make some remarks about the theorem. First, the L samples are used to compute w and M, which can be done in a parallel way. Second, our parameter γ is related to a parameter γ = mini [k] λi λi+1 λi used in (Hardt & Price, 2014), and it is easy to verify that γ γ . Thus, our algorithm for tensors converges in O( 1 γ log d + log( 1 ε)) rounds, which is faster than the O( 1 γ log d + 1 γ log 1 rounds of (Hardt & Price, 2014) for matrices. Note that our dependence on the error parameter ε is exponentially smaller than that of (Hardt & Price, 2014), which means that for a small ε, we can decompose tensors much faster than matrices. Finally, compared to previous works on tensors, our convergence time, for a small ε is about O(log log 1 ε) while those in (Anandkumar et al., 2014a; Wang & Anandkumar, 2016) are at least Ω(k log log 1 3.1. Our Robust Tensor Power Method The tensor power phase of our Algorithm 1 is based on our version of the tensor power method, which works as follows. At each step t, we maintain a d k matrix Q(t) with columns Q(t) 1 , . . . , Q(t) k as our current estimators for u1, . . . , uk, which is obtained by updating the previous estimators with the following two operations. The main operation is to apply the noisy tensor T on them simultaneously to get a d k matrix Y (t) with its j-th column computed as Y (t) j = T(Id, Q(t 1) j , Q(t 1) j ), which equals X i [d] λi u i Q(t 1) j 2 ui + ˆΦ(t) j , for ˆΦ(t) j = Φ(Id, Q(t 1) j , Q(t 1) j ). This implies that i [d] : u i Y (t) j = λi u i Q(t 1) j 2 + u i ˆΦ(t) j , (2) which shows the progress made by this operation. The second operation is to orthogonalize Y (t) as Y (t) = Q(t) R(t), by the QR decomposition via the Gram-Schmidt process, to obtain a d k matrix Q(t) with columns Q(t) 1 , . . . , Q(t) k , which then become our new estimators. As we will show in Lemma 1 below, given a small enough Φ , if we start with a full-rank Q(0), then each Q(t) also has full rank and consists of orthonormal columns, and each R(t) is invertible. Moreover, although we apply the QR decomposition on the whole matrix Y (t) to obtain the matrix Q(t), it has the effect that for any m [k], the first m columns of Q(t) can be seen as obtained from the first m columns of Y (t) by a QR decomposition. This property is needed in our Lemma 1 and Theorem 2 below to guarantee the simultaneous convergence of Q(t) i to ui for every i [k]. Before stating Lemma 1 which guarantees the progress we make at each step, let us prepare some notations first. For a d k matrix A and some m [k], let A[m] denote the d m matrix containing the first m columns of A. Let U denote the d k matrix with the target vector ui as its i th column. For a d k matrix Q and some m [k], define cosm(Q) = min y Rm U [m] Q[m] y / Q[m] y , Tensor Decomposition via Simultaneous Power Iteration which equals the cosine of the m th principal angle between the column spaces of U[m] and Q[m], let sinm(Q) = p 1 cos2m(Q), and let us use as the error measure tanm(Q) = sinm(Q)/ cosm(Q). More information about the principal angles can be found in, e.g., (Golub & Van Loan, 1996). Then we have the following lemma, which we prove in Appendix B.1. Lemma 1. Fix any m [k] and t 0. Let ˆΦ(t) [m] denote the d m matrix with ˆΦ(t) j = Φ(Id, Q(t 1) j , Q(t 1) j ) as its j th column, and suppose ˆΦ(t) [m] < min n β, cos2 m(Q(t 1)) o , (3) for some β > 0. Then for ρ = maxi [k]( λi+1 λi ) 1 4 , we have tanm(Q(t)) max n β, max{β, ρ} tan2 m(Q(t 1)) o . Observe that the guarantee provided by the lemma above has a similar form as that in (Hardt & Price, 2014) for matrices. The main difference is that here in the tensor case, we have the error measure essentially squared after each step, which has the following two implications. First, to guarantee that the error is indeed reduced, we need tanm(Q(t 1)) to be small enough (say, less than one), unlike in the matrix case. Next, if we indeed have a small enough tanm(Q(t 1)), then the error can be reduced in a much faster rate than in the matrix case. Another difference is that here we provide the guarantee for all the k submatrices Q(t) [m], for m [k], instead of just one matrix Q(t). This allows us to show the simultaneous convergence of each column Q(t) i to the target vector ui for every i [k], as given in the following, which we prove in Appendix B.2. Theorem 2. For any ε (0, λk 2 ), there exists some N O(log( 1 ε)) such that the following holds. Suppose the perturbation is bounded by Φ ε 2 start from some initial Q(0) with tanm Q(0) 1 for every m [k]. Then for any t N, with ˆui = Q(t) i and ˆλi = T(ˆui, ˆui, ˆui), we have ui ˆui ε and |λi ˆλi| ε, for every i [k]. Note that the convergence rate guaranteed by the theorem above is exponentially faster than that in (Hardt & Price, 2014) for matrices, assuming that we indeed can have such a good initial Q(0) to start with. In the next subsection, we show how it can be found efficiently. 3.2. Initialization Procedure Our approach for finding a good initialization is to project the tensor down to a matrix and apply the matrix power method for only a few steps just to make the tangents less than one. Although we could continue applying the matrix power method till reaching the much smaller target bound ε, this would take exponentially longer than by switching to the tensor power method as we actually do. As mentioned above, we would first like to project the tensor T down to a matrix. A naive approach is to sample a random vector w and take the matrix T(Id, Id, w) T(Id, Id, w) = P i [d] λi(u i w) ui ui. However, this may mess up the gaps between eigenvalues, which are needed to guarantee the convergence rate of the matrix power method. The reason is that as each u i w has mean zero, the coefficient λi(u i w) also has mean zero and thus has a good chance of coming very close to others. To preserve the gaps, we would like to have u i w u i+1 w for each i with high probability. To achieve this, let us first imagine sampling a random w Rd from N d(0, 1), and computing the vector w = T(Id, w, w), which is close to T(Id, w, w) = X i [d] λi(u i w)2 ui. Then one can show that for every i, E[(u i w)2] = 1, so that E[ w] E[T(Id, w, w)] = P i [d] λiui, and u i E[ w] λi > λi+1 u i+1 E[ w]. However, we want the gap-preserving guarantee to be in high probability, instead in expectation. Thus we go further by sampling not just one, but some number L of vectors w1, . . . , w L independently from the distribution N d(0, 1), and then taking the average j [L] T(Id, wj, wj). (4) The following lemma shows that such a w is likely to have u i w λi, which we prove in Appendix B.3. Lemma 2. Suppose we have T = T + Φ with Φ 3d. Then for some L O( 1 γ2 log d), the vector w computed according to (4) with high probability satisfies 4 (λiγ + 2 ) for every i [k]. (5) With this w, we compute the matrix M = T(Id, Id, w). As shown by the following lemma, which we prove in Appendix B.4, M is close to a matrix with ui s as eigenvectors and good gaps between eigenvalues. Lemma 3. Suppose we have T = T + Φ. Then for any w satisfying the condition (5) in Lemma 2, the matrix M = T(Id, Id, w) can be decomposed as λi ui ui + Φ, Tensor Decomposition via Simultaneous Power Iteration for some λi s with λi λi+1 2, for i [k], and Φ = Φ(Id, Id, w) with Φ 2 Φ . With such a matrix M, we next apply the matrix power method of (Hardt & Price, 2014) to find good approximates to its eigenvectors. More precisely, we sample an initial matrix Y (0) Rd k by choosing each of its column independently according to the distribution N d(0, 1), and factorize it as Y (0) = Z(0) R(0) by QR decomposition via the Gram-Schmidt process. Then at step s 1, we multiply the previous estimate Z(s 1) by M to obtain Y (s) = M Z(s 1), factorize it as Y (s) = Z(s) R(s) by QR decomposition via the Gram-Schmidt process, and then take the orthonormal Z(s) as the new estimate. The following lemma shows the number of steps needed to find a good enough Z(s). Lemma 4. Suppose we are given a matrix M having the decomposition described in Lemma 3, with Φ 2α0 2 dk for a small enough constant α0. Then there exists some S O( 1 γ log d) such that with high probability, we have tanm(Z(s)) 1 for every m [k] whenever s S. This together with the previous two lemmas guarantee that given T = T +Φ, with Φ min{ dk }, for a small enough constant α0, we can obtain with high probability a good Z(S) which can be used as the initial Q(0) for our tensor power phase. Combining this with Theorem 2 in the previous subsection, we then have our Theorem 1 given at the beginning of the section. 4. General Orthogonal Tensors In the previous section, we consider tensors which are orthogonal, symmetric, and of order three. In this section, we show how to extend our results for general orthogonal tensors, to deal with higher orders first and then asymmetry. 4.1. Higher-Order Tensors To handle orthogonal and symmetric tensors of any order, only the initialization procedure needs to be modified. First, for tensors of any odd order, a straightforward modification is as follows. Take for example a tensor of order 2r + 1. Now we simply compute j [L] T(Id, wj, . . . , wj), with 2r copies of wj, and similarly to Lemma 2, one can show that w is likely to be close to the vector P i [d] λi ui. With such a vector w, one can show that the matrix M = T(Id, Id, w, . . . , w), with 2r 1 copies of w, is close to the matrix P i [d] λ2r i ui ui, similarly to Lemma 3. Then the rest is the same as that in the previous section. Note that this approach has the eigenvalues decreased exponentially in r. A different approach avoiding this is to compute M directly as T(Id, Id, wj, . . . , wj) 2 , which is close to j [L] (T(Id, Id, wj, . . . , wj))2 i [d] λ2 i u i wj 4r 2 ui ui i [d] λ2 i 1 L u i wj 4r 2 ui ui. Then one can again show that such a matrix M is likely to be close to the matrix P i [d] λ2 i ui ui. To handle tensors of even orders, the initialization is slightly different but the idea is similar. Given a tensor of order 2r, we again sample vectors w1, . . . , w L as before, but now we compute the matrix directly as j [L] T(Id, Id, wj, . . . , wj), with 2r 2 copies of wj. As before, one can show that the matrix M is likely to be close to the matrix P i [d] λi ui ui. Then again we can apply the matrix power method on M and obtain a good initialization for the tensor power method as before. Note that now the eigenvalues are no longer squared, and the previous requirement on Φ can be slightly relaxed, with the dependence on 2 being replaced by . 4.2. Asymmetric Tensors For simplicity of presentation, let us focus on the third order case; the extension to higher orders is straightforward. That is, now the underlying tensor has the form i [d] λi ai bi ci, with nonnegative λi s satisfying the condition (1), together with three sets of orthonormal vectors of ai s, bi s, and ci s. As before, we only have access to a noisy version T of T. The main modification of our algorithm is again to the initialization procedure, but the idea is also similar. To find a good initial matrix A for ai s, we sample w1, . . . , w L independently from N d(0, 1), and now compute the matrix T(Id, Id, wj) T(Id, Id, wj) . Tensor Decomposition via Simultaneous Power Iteration As before, it is not hard to show that j [L] λ2 i c i wj 2 ai ai, which is close to the matrix P i [d] λ2 i ai ai with high probability. From the matrix M, we can again apply the matrix power method to find a good initial matrix A. Similarly, we can find good initial matrices B and C for bi s and ci s, respectively. Next, with such matrices, we would like to apply the tensor power method, which we modify as follows. Now at each step t, we take previous estimates A(t 1), B(t 1), C(t 1), and compute X(t) i = T(Id, B(t 1) i , C(t 1) i ), Y (t) i = T(A(t 1) i , Id, C(t 1) i ), Z(t) i = T(A(t 1) i , B(t 1) i , Id), for i [k], followed by orthonormalizing X(t), Y (t), Z(t) to obtain the new estimates A(t), B(t), C(t) via QRdecomposition. It is not hard to show that the resulting algorithm has a similar convergence rate as our Algorithm 1. 5. Nonorthogonal but Symmetric Tensors In the previous section, we consider general orthogonal tensors, which can be asymmetric. In this section, we consider non-orthogonal tensors which are symmetric. We remark that for some latent variable models such as the multi-view model, the corresponding asymmetric tensors can be converted into symmetric ones (Anandkumar et al., 2014a), so that our result here can still be applied. For simplicity of exposition, let us again focus on the case of order three, so that the given tensor has the form T = P i [d] λi vi vi vi, but the vectors vi s are no longer assumed to be orthogonal to each other. Still we assume them to be linearly independent, and we again assume without loss of generality that T 1 and vi = 1 for each i. In addition, let us assume, as in previous works, that λj = 0 for j k + 1.2 Following (Anandkumar et al., 2014a), we would like to whiten such a tensor T into an orthogonal one, so that we can then apply our Algorithm 1. More precisely, our goal is to find a d k matrix W such that the tensor T(W, W, W) becomes orthogonal. As in (Anandkumar et al., 2014a), assume that we also have available a matrix i [k] λi vi vi.3 Then for a whitening matrix, it suffices to find some W 2This assumption is not necessary. We assume it just to simplify the first step of our algorithm given below. Without it, we can simply replace that step by the matrix power method used in our Algorithm 1, which takes more steps but can still do the job. 3More generally, the weights λi in M are allowed to differ from those in T, but for simplicity we assume they are the same. such that W MW = Ik. The reason is that Ik = W MW = X i [k] λi W vi W vi , which implies that the vectors λi W vi, for i [k], are orthonormal. Then the tensor T(W, W, W) equals i [k] λi (W vi) 3 = X λi W vi 3 , which has an orthogonal decomposition. According to (Anandkumar et al., 2014a), one way to find such a W is to do the spectral decomposition of M as UΛU , with eigenvectors as columns of U, and let W = UΛ 1 2 . However, we will not take this approach, because finding a good approximate to U by the matrix power method would take longer to converge than the tensor power method which we will later apply to the whitened tensor. Our key observation is that it suffices to find a d k matrix Q such that the matrix P = Q MQ is invertible, since we can then let W = QP 1 With such a W, the tensor T(W, W, W) becomes orthogonal, so that we can decompose it4 to obtain σi = 1 λi and ui = λi W vi, from which we can recover λi = 1 σ2 i and vi = σi QP 1 2 ui if Q has orthonormal columns. As before, we consider a similar setting in which we only have access to a noisy M = M + Φ, for some symmetric perturbation matrix Φ, in addition to the noisy tensor T = T +Φ. Then our algorithm for finding the whitening matrix consists of the following two steps: 1. Sample a random matrix Z Rd k with orthonormal columns, compute Y = MZ, and factorize it as Y = Q R by a QR decomposition. 2. Compute P = Q MQ and output W = Q P 1 2 as the whitening matrix. We analyze our algorithm in the following. First note that Q is computed in the same way as we compute Z(1) in Algorithm 1, and with λk+1 = 0 we are likely to have tank(Q) 0 so that the matrix P = Q MQ is invertible. Formally, we have the following, which we prove in Appendix C.1. Lemma 5. Suppose Φ α0λk dk for a small enough constant α0. Then with high probability we have σmax(P) λ1 and σmin(P) λk 4To apply our Algorithm 1, we need to scale it properly, say by a factor of λk/k to make its norm at most one. Tensor Decomposition via Simultaneous Power Iteration Next, with a small enough Φ , if P is invertible, then so is P, and moreover, we have P 1 2 . This is shown in the following, which we prove in Appendix C.2. Lemma 6. Fix any ϵ (0, 1) and suppose we have σmin(P) 2ϵ and Φ ϵ. Then P is invertible and P 1 2 2ϵ(σmin(P)) 2(σmax(P)) 1 2 . Then, with a good P 1 2 , we can obtain a good W and have T( W, W, W) close to T(W, W, W) which has an orthogonal decomposition. This is shown in the following, which we prove in Appendix C.3. Theorem 3. Fix any ε (0, λk 4 ) and suppose we have 3 2 k ε and Φ α0ε min{ λk dk, λ3 k λ1 }, for a small enough constant α0. Then with high probability we have T( W, W, W) T(W, W, W) ε. 6. Streaming setting In the previous sections, we consider the batch setting in which the tensor T is assumed to be stored somewhere which can be accessed whenever we want to. However, storing such a tensor, say of order three, requires a space complexity of Ω(d3), which becomes impractical even for a moderate value of d. In this section, we study the possibility of achieving a space complexity of O(kd), which is the least amount of memory needed just to store the k vectors in Rd. More precisely, we consider the streaming setting, in which there is a stream of vectors x1, x2, . . . arriving one at a time. We assume that each vector x is sampled independently from some distribution over Rd, with x 1 and some function g : Rd Rd d d such that E[g(x)] = T, and given x, u, v Rd, g(x)(Id, u, v) can be computed in O(d) space. Such a function g is known to exist for some latent variable models (Ge et al., 2015; Wang & Anandkumar, 2016). Given such a function, our algorithms in previous sections can all be converted to work in the streaming setting using O(kd) space. This is because all our operations involving tensors have the form T(Id, u, v), for some u, v Rd, which can be realized as 1 |J| (Id, u, v) = 1 t J (g(xt) (Id, u, v)) , for a collection J of samples, with the righthand side above clearly computable in O(kd) space.5 Then depending on the distance we want between T = 1 |J| P t J g(xt) and T, 5This also includes the initialization phase in which we now do not store the matrix M explicitly but instead replace the operation MZi by T(Id, Zi, w). we can choose a proper size for J. In fact, to save the total number of samples, we can follow the approach of (Li et al., 2016) by choosing different sizes in different iterations of the matrix or tensor power method. Following (Wang & Anandkumar, 2016), let us take the specific case with g(x) = x x x as a concrete example and focus on the orthogonal case studied in Section 3; it is not hard to convert other algorithms of ours to the streaming setting. One can show that in this specific case, we have E[x] = P i [d] λiui so that there is a more efficient way to find a vector w for producing the matrix M in the initialization phase. Formally, we have the following lemma, which we prove in Appendix D.1. Lemma 7. There is an algorithm using O(d) space and O( log k 2 ) samples to find some w Rd satisfying the condition (5) in Lemma 2 with high probability. With such a vector w, we can then use the streaming algorithm of (Li et al., 2016) to find a good initial matrix Z for the later tensor power phase. Formally, we have the following lemma, which we prove in Appendix D.2. Lemma 8. Given w from Lemma 7, we can use O(kd) space and O( kd log d γ 4γ ) samples to find some Z Rd k with tanm(Z) < 1, for any m [k], with high probability. Having such a matrix Z, we can proceed to the tensor power phase. Borrowing again the idea from (Li et al., 2016), let us partition the incoming data into blocks of increasing sizes, with the t th block Jt used to carry out one tensor power iteration Y (t) i = T (t)(Id, Q(t 1) i , Q(t 1) i ), for i [k], of Algorithm 1, with T (t) = 1 |Jt| P τ Jt g(xτ). In- stead of preparing this T (t) and then computing each Y (t) i , we now go through |Jt| steps of updates: For τ Jt do: Y (t) i = Y (t) i + 1 |Jt|(x τ Q(t 1) i )2xτ. The block sizes are chosen carefully to keep T (t) T small enough so that we can have tanm(Q(t)) decreased in a desirable rate. Here, we choose the parameters βt = max n ρ2t 1, ε o and |Jt| = c0 log(dt) 2β2 t , (6) for a large enough constant c0, to make the condition (3) in Lemma 1 hold with high probability so that we have tanm(Q(t)) βt. In Appendix D.3, we summarize our algorithm and prove the following theorem. Theorem 4. Given ε (0, λk 2 ), with high probability we can find ˆλi, ˆui with | ˆλi λi|, ˆui ui ε, for any i [k], using O(kd) space and O( kd log d γ 4γ + log(d log( 1 ε )) 2γε2 ) samples. Tensor Decomposition via Simultaneous Power Iteration Anandkumar, Animashree, Hsu, Daniel J, and Kakade, Sham M. A method of moments for mixture models and hidden markov models. In COLT, volume 1, pp. 4, 2012. Anandkumar, Animashree, Ge, Rong, Hsu, Daniel, Kakade, Sham M, and Telgarsky, Matus. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773 2832, 2014a. Anandkumar, Animashree, Ge, Rong, and Janzamin, Majid. Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates. ar Xiv preprint ar Xiv:1402.5180, 2014b. Chaganty, Arun Tejasvi and Liang, Percy. Estimating latent-variable graphical models using moments and likelihoods. In ICML, pp. 1872 1880, 2014. Ge, Rong, Huang, Furong, Jin, Chi, and Yuan, Yang. Escaping from saddle pointsonline stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pp. 797 842, 2015. Golub, Gene H. and Van Loan, Charles F. Matrix Computation. John Hopkins University Press, 3rd edition, 1996. Hardt, Moritz and Price, Eric. The noisy power method: A meta algorithm with applications. In Advances in Neural Information Processing Systems, pp. 2861 2869, 2014. Hillar, Christopher J and Lim, Lek-Heng. Most tensor problems are NP-hard. Journal of the ACM (JACM), 60 (6), 2013. Kolda, Tamara G and Bader, Brett W. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009. Li, Chun-Liang, Lin, Hsuan-Tien, and Lu, Chi-Jen. Rivalry of two families of algorithms for memory-restricted streaming PCA. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 473 481, 2016. Mitliagkas, Ioannis, Caramanis, Constantine, and Jain, Prateek. Memory limited, streaming PCA. In Advances in Neural Information Processing Systems, pp. 2886 2894, 2013. Wang, Yining and Anandkumar, Anima. Online and differentially-private tensor decomposition. In Advances in Neural Information Processing Systems, pp. 3531 3539, 2016.