# power_iteration_for_tensor_pca__581f2618.pdf

Journal of Machine Learning Research 23 (2022) 1-47 Submitted 10/21; Revised 12/21; Published 1/22

Power Iteration for Tensor PCA

Jiaoyang Huang jh4427@nyu.edu New York University, New York, NY

Daniel Z. Huang dzhuang@caltech.edu California Institute of Technology, Pasadena, CA

Qing Yang yangq@ustc.edu.cn University of Science and Technology of China, China

Guang Cheng guangcheng@ucla.edu University of California, Los Angeles & Purdue University, Los Angeles, CA

Editor: Animashree Anandkumar

In this paper, we study the power iteration algorithm for the asymmetric spiked tensor model, as introduced in Richard and Montanari (2014). We give necessary and suﬃcient conditions for the convergence of the power iteration algorithm. When the power iteration algorithm converges, for the rank one spiked tensor model, we show the estimators for the spike strength and linear functionals of the signal are asymptotically Gaussian; for the multi-rank spiked tensor model, we show the estimators are asymptotically mixtures of Gaussian. This new phenomenon is diﬀerent from the spiked matrix model. Using these asymptotic results of our estimators, we construct valid and eﬃcient conﬁdence intervals for spike strengths and linear functionals of the signals. Keywords: Spiked tensor, tensor PCA, power iteration, statistical inference

1. Introduction

High order arrays, or tensors have been actively considered in neuroimaging analysis, topic modeling, signal processing and recommendation system Frolov and Oseledets (2017); Comon (2014); Hackbusch (2012); Karatzoglou et al. (2010); Rendle and Schmidt-Thieme (2010); Zhou et al. (2013); Simony et al. (2016); Cichocki et al. (2015); Sidiropoulos et al. (2017). Setting the stage, imagine that the signal is in the form of a large symmetric low-rank k-th order tensor

j=1 βjv k j k Rn, (1)

where r (r n) represents the rank and βj are the strength of the signals. Such low-rank tensor components appear in various applications, e.g. community detection Anandkumar et al. (2014a); Jing et al. (2020), moments estimation for latent variable models Hsu and Kakade (2013); Anandkumar et al. (2014b) and hypergraph matching Duchenne et al. (2011). Suppose that we do not have access to perfect measurements about the entries of this signal tensor. The observations X = X + Z are contaminated by a substantial amount of random noise (reﬂected by the random tensor Z which has i.i.d. Gaussian entries

c 2022 Jiaoyang Huang, Daniel Z.Huang, Qing Yang, Guang Cheng.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1290.html.

Huang, Huang, Yang, Cheng

with mean 0 and variance 1/n.). The aim is to perform reliable estimation and inference on the unseen signal tensor X . In literature, this is the spiked tensor model, introduced in Richard and Montanari (2014). In the special case, when k = 2, the above model reduces to the well-known spiked matrix model Johnstone (2001). In this setting it is known that there is an order 1 critical signal-to-noise ratio βc, such that below βc, it is information-theoretical impossible to detect the spikes, and above βc, it is possible to detect the spikes by Principal Component Analysis (PCA). A body of work has quantiﬁed the behavior of PCA in this setting Johnstone (2001); Baik et al. (2005); Baik and Silverstein (2006); Paul (2007); Benaych-Georges and Nadakuditi (2012); Bai and Yao (2012); Johnstone and Lu (2009); Birnbaum et al. (2013); Cai et al. (2013); Ma et al. (2013); Vu et al. (2013); Cai et al. (2015); El Karoui et al. (2008); Ledoit et al. (2012); Donoho et al. (2018). We refer readers to the review articles Johnstone and Paul (2018) for more discussion and references to this and related lines of work Tensor problems are far more than an extension of matrices. Not only the more involved structures and high-dimensionality, many concepts are not well deﬁned Kolda and Bader (2009), e.g. eigenvalues and eigenvectors, and most tensor problems are NP-hard Hillar and Lim (2013). Despite a large body of work tackling the spiked tensor model, there are several fundamental yet unaddressed challenges that deserve further attention. Computational Hardness. The same as the spiked matrix model, for spiked tensor model, there is an order 1 critical signal-to-noise ratio βk (depending on the order k). Below βk, it is information-theoretical impossible to detect the spikes, and above βk, the maximum likelihood estimator is a distinguishing statistics Chen et al. (2019, 2018a); Lesieur et al. (2017); Perry et al. (2020); Jagannath et al. (2018). In the matrix setting the maximum likelihood estimator is the top eigenvector, which can be computed in polynomial time by, e.g., power iteration. However, for order k 3 tensor, computing the maximum likelihood estimator is NP-hard in generic setting. In this setting, it is widely believed that there is a regime of signal-to-noise ratios for which it is information theoretically possible to recover the signal but there is no known algorithm to eﬃciently approximate it. In the pioneering work Richard and Montanari (2014), the algorithmic aspects of this model have been studied under the special setting when the rank r = 1 and the signal-to-noise-ratio is β. They showed that tensor power iteration with random initialization recovers the signal provided β n(k 1)/2, and tensor unfolding recovers the signal provided β n( k/2 1)/2. Based on heuristic arguments, they predicted that the necessary and suﬃcient condition for power iteration to succeed is β n(k 2)/2, and for tensor unfolding is β n(k 2)/4. Langevin dynamics and gradient descent were studied in Arous et al. (2020), and shown to recover the signal provided β n(k 2)/2. Later the sharp threshold β n(k 2)/4 was achieved using Sum-of-Squares algorithms Hopkins et al. (2015, 2016); Kim et al. (2017) and sophisticated iteration algorithms Luo et al. (2020); Zhang and Xia (2018); Han et al. (2020). The necessary part of this threshold still remains open, and its relation with hypergraphic planted clique problem was discussed in Luo and Zhang (2020a,b). Statistical inferences. In many applications, it is often the case that the ultimate goal is not to characterize the L2 or bulk behavior (e.g. the mean squared estimation error) of the signals, but rather to reason about the signals along a few preconceived yet important directions. In the example of community detecting for hypergraphs, the entries of the vector v can represent diﬀerent community memberships. The testing of whether

Power Iteration for Tensor PCA

any two nodes belong to the same community is reduced to the hypothesis testing problem of whether the corresponding entries of v are equal. These problems can be formulated as estimation and inference for linear functionals of a signal, namely, quantities of the form a, vj , 1 j r with a prescribed vector a. A natural starting point is to plug in an estimator bvj of vj, i.e. the estimator a, bvj . However, most prior works Richard and Montanari (2014); Hopkins et al. (2015, 2016); Kim et al. (2017); Luo et al. (2020); Zhang and Xia (2018) on spiked tensor models focuses on the L2 risk analysis, which is often too coarse to give tight uncertainty bound for the plug-in estimator. To further complicate matters, there is often a bias issue surrounding the plug-in estimator. Addressing these issues calls for reﬁned risk analysis of the algorithms.

1.1 Our Contributions

We consider the power iteration algorithm given by the following recursion

u0 = u, ut+1 = X[u (k 1) t ]

X[u (k 1) t ] 2 (2)

where u Rn with u 2 = 1 is the initial vector, and X[v (k 1)] Rn is the vector with i-th entry given by X, ei v (k 1) . The estimators are given by

bv = u T , bβ = X, bv k . (3)

for some large T. In the worst case scenario, i.e. with random initialization, power iteration algorithm underperforms tensor unfolding. However, if extra information about the signals vj is available, power iteration algorithm with a warm start can be used to obtain a much better estimator. In fact this approach is commonly used to obtain reﬁned estimators. In this paper, we study the convergence and statistical inference aspects of the power iteration algorithm. The main contributions of this paper are summarized below, Convergence criterion. We give necessary and suﬃcient conditions for the convergence of the power iteration algorithm in ﬁnite time. In the rank one case r = 1, we show that the power iteration algorithm converges to the true signal v in ﬁnite time, provided |β u, v k 2| 1 where u is the initialization vector. In the complementary setting, if |β u, v k 2| 1, the output of the power iteration algorithm behaves like random Gaussian vectors, and has no correlations with the signal. With random initialization, i.e. u is a uniformly random vector on the unit sphere, our results assert that the power iteration algorithm converges in ﬁnite time, if and only if β n(k 2)/2, which veriﬁes the prediction in Richard and Montanari (2014). This is analogous to the PCA of spiked matrix model, where power iteration recovers the top eigenvalue. However, the multi-rank spiked tensor model, i.e. r 2, is diﬀerent from multi-rank spiked matrix. The power iteration algorithm for multi-rank spiked tensor model is more sensitive to the initialization, i.e. the power iteration algorithm converges if maxj |βj u, vj k 2| 1. In this case, it converges to vj with j = argmaxj |βj u, vj k 2|. Statistical inference We consider the statistical inference problem for the spiked tensor model. We develop the limiting distributions of the above power iteration estimators. In the rank one case, above the threshold |β u, v k 2| 1, we show that our estimator a, bv

Huang, Huang, Yang, Cheng

(modulo some global sign) admits the following ﬁrst order approximation

a, bv 1 1 2β2

a, v + a , ξ

where a = a a, v v, and ξ = Z[v (k 1)], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variables. For multi-rank spiked tensor model, the output of power iteration algorithm depends on the angle between the initialization u and the signals vj. We consider the case that the initialization u is a uniformly random vector on the unit sphere. For such initialization, very interestingly, our estimator a, bv is asymptotically a mixture of Gaussian, with modes at a, vj and mixture weights depending on the signal strength βj. Using these asymptotic results of our estimators, we construct valid and eﬃcient conﬁdence intervals for the linear functionals a, vj . In a concurrent work, Xia et al. (2020) studied the statistical inference for several low rank tensor models, and asymptotic distributions were established under some essential conditions on the signal-to-noise ratio.

1.2 Notations:

For a vector v Rn, we denote its i-th coordinate as v(i). We equate k-th order tensors in k Rn with vectors of dimension nk, i.e. τ = (τi1i2 ik)1 i1,i2, ,ik n. For any two k-th order tensors τ, η k Rn, we denote their inner product as τ, η := P 1 i1,i2, ,ik n τi1i2 ikηi1i2 ik. A k-th order tensor can act on a (k 1)-th order tensor, and return a vector: τ k Rn

and η k 1Rn

τ[η] Rn, τ[η](i) = τ, ei η = X

1 i1, ,ik 1 n τii1i2 ik 1ηi1i2 ik 1.

We denote the L2 norm of a vector v as v . We use d= for the equality in law, and d for the convergence in law. We denote the index sets [[a, b]] = {a, a + 1, a + 2, , b} and [[n]] = {1, 2, 3, , n}. We use C to represent large universal constant, and c a small universal constant, which may be diﬀerent from line by line. We write that X = O(Y ) or X Y , if there exists some universal constant such that |X| CY . We write X = o(Y ), or X Y if the ratio |X|/Y 0 as n goes to inﬁnity. We write X Y , if there exists a universal constant such that X C|Y |. We write X Y if there exist universal constants such that c Y |X| CY . We say an event holds with high probability, if for there exists c > 0, and n large enough, the event holds with probability at least 1 n c log n. An outline of the paper is given as follows. In Section 2.1, we state our main results for the rank-one spiked tensor model. In particular, with general initialization a distributional result for the power iteration algorithm is developed. Section 2.2 investigates the general rank-r spiked tensor model. A similar distributional result is established with general initialization as in Section 2.1. While with uniformly distributed initialization over the unit sphere, we obtain a multinoimal distribution which yields a mixture Gaussian. Numerical simulations are presented in Section 3. All proofs and technical details are deferred to the appendix.

Power Iteration for Tensor PCA

2. Main Results

2.1 Rank one spiked tensor model

In this section, we state our main results for the rank-one spiked tensor model (corresponding to r = 1 in (1)):

X = βv k + Z,

X k Rn is the k-th order tensor observation.

Z k Rn is a noise tensor. The entries of Z are i.i.d. standard N(0, 1/n) Gaussian random variables.

β R is the signal size.

v Rn is an unknown unit vector to be recovered.

We obtain a distributional result for the power iteration algorithm (2) with general initialization u: when |β| is above certain threshold, ut converges to v, and the error is asymptotically Gaussian; when |β| is below the same threshold, the algorithm does not converge.

Theorem 1 Fix the initialization u Rn with u 2 = 1 and u, v 1/ n. If |β u, v k 2| nε with arbitrarily small ε > 0, then for any ﬁxed unit vector a Rn, and

2 + 2 log |β|

with probability 1 O(n c(log n)2), the power iteration estimator (bβ, bv) = (X[u k T ], u T ) from (3) satisﬁes

1. If k is odd, and β > 0, then (bβ, bv) recovers (β, v) in the following sense,

a, bv = a, u T = 1 1 2β2

a, v + a, ξ a, v v, ξ

log n β2 n + (log n)3/2

β3/2n3/4 + | a, v |

where ξ = Z[v (k 1)], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variable. And

bβ = X[u k T ] = β + ξ, v k/2 1

log n |β| n + (log n)3/2

|β|1/2n3/4 + 1 |β|3

2. If k is odd, and β < 0 then (bβ, bv) recovers ( β, v), in the sense of (4) and (5) with (β, v) replaced by ( β, v)

Huang, Huang, Yang, Cheng

3. If k is even, and β > 0, then (bβ, bv)) recovers (β, sgn( u, v )v), in the sense of (4) and (5) with (β, v) replaced by (β, sgn( u, v )v);

4. If k is even, and β < 0, then (bβ, bv)) recovers (β, ( 1)T sgn( u, v )v), in the sense of (4) and (5) with (β, v) replaced by (β, ( 1)T sgn( u, v )v).

Theorem 2 Fix the initialization u Rn with u 2 = 1. If |β| nε and |β u, v k 2| n ε with arbitrarily small ε > 0, then ut does not converge to v, and ut behaves like a random Gaussian vector. For

2 log |β| (k 2) log n

with probability 1 O(n c(log n)2), it holds

bv = u T = ξ ξ 2 + O

|β| log n n

where ξ is the standard Gaussian vector in Rn, the error term is a vector of length bounded by |β|(log n/ n)k 1.

In Theorem 1, we assume that u, v 1/ n, which is generic and is true for a random u. Moreover, if the initial vector u is random, then | u, v | n 1/2. Notably, Theorems 1 and 2 together state that power iteration recovers v if |β| n(k 2)/2 and fails if |β| n(k 2)/2. This gives a rigorous proof of the prediction in Richard and Montanari (2014) that the necessary and suﬃcient condition for the convergence is given by |β| n(k 2)/2. In practice, it may be possible to use domain knowledge to choose better initialization points. For example, in the classical topic modeling applications Anandkumar et al. (2014b), the unknown vectors v are related to the topic word distributions, and many documents may be primarily composed of words from just single topic. Therefore, good initialization points can be derived from these single-topic documents. The special case for k = 2, i.e. the spiked matrix model, has been intensively studied since the pioneering work of Johnstone Johnstone (2001). In this setting it is known [30] that there is an order O(1) critical signal-to-noise ratio, such that below the threshold, it is information-theoretically impossible to recover v, and above the threshold, the PCA (partially) recovers the unseen eigenvector v P ech e (2006); Abbe et al. (2020); O Rourke et al. (2018); Vu (2011); Zhong (2017); Chen et al. (2018b); Zhang et al. (2018); Cheng et al. (2020). The special case of our results Theorem 1, i.e. the power iteration recovers the eigenvector v when β nε, recovers some abovementioned results. As a consequence of Theorem 1, we have the following central limit theorem for our estimators.

Corollary 3 (Central Limit Theorem) Fix the initialization u Rn with u 2 = 1 and | u, v | 1/ n. If |β u, v k 2| nε with arbitrarily small ε > 0, in Case 1 of Theorem 1, for any ﬁxed unit vector a Rn obeying

| a, v | = o β3

Power Iteration for Tensor PCA

2 + 2 log |β|

the estimators bv = u T , and bβ = X[bv k] satisﬁes

a, (In bvbv )a

2bβ2 1 a, bv a, v d N(0, 1), (7)

as n tends to inﬁnity. We have similar results for Cases 2, 3, 4, by simply changing (β, v) in (7) to the corresponding limit.

We remark that in Corollary 3, we assume that | a, v | = o β3/ n , which is generic. For example, if v is delocalized, and a is supported on ﬁnitely many entries, we will have that | a, v | 1/ n, and (6) is satisﬁed. With the central limit theorem for our estimators in Corollary 3, we can easily write down the conﬁdence interval for our estimators.

Corollary 4 (Prediction Interval) Given the asymptotic signiﬁcance level α, and let zα = Φ(1 α/2) where Φ( ) is the CDF of a standard Gaussian. If |β u, v k 2| nε with arbitrarily small ε > 0, in Case 1 of Theorem 1, for any ﬁxed unit vector a Rn obeying

| a, v | = o β3

2 + 2 log |β|

let bv = u T , and bβ = X[bv k]. The asymptotic conﬁdence interval of a, v is given by

a, (In bvbv )a

nbβ , a, bv + zα

a, (In bvbv )a

We have similar results for Cases 2, 3, 4, by simply changing (β, v) in (9) to the corresponding limit.

2.2 General Results: rank-r spiked tensor model

In this section, we state our main results for the general case, the rank-r spiked tensor model (1):

j=1 βjv k j k Rn + Z (10)

Huang, Huang, Yang, Cheng

X k Rn is the k-th order tensor observation.

Z k Rn is a noise tensor. The entries of Z are i.i.d. standard N(0, 1/n) Gaussian random variables.

|β1| |β2| |βr| are the signal sizes.

v1, v2, , vr Rn are unknown orthonormal vectors to be recovered.

In (10), we assumed that the signals v1, v2, , vr are orthonormal. The orthogonally decomposable tensor has been widely studied as a benchmark setting for tensor decomposition in the literature Kolda (2001); Chen and Saad (2009); Robeva (2016); Belkin et al. (2018); Auddy and Yuan (2020). Before stating our main results, we need to introduce some more notations and assumptions.

Assumption 5 We assume that the initialization does not distinguish v1, v2, , vr, such that there exists some large constant κ > 0

1/κ u, vi u, vj

for all 1 i, j r.

If we take the uniform initialization, i.e. u0 = u is a uniformly distributed vector in Sn 1. Then with probability 1 O(r/ κ) we will have 1/ κn | u, vi | p

κ/n for 1 i r, and Assumption 5 holds. The same as in the rank-1 case, the quantities |βj u, vj k 2| play a crucial role in our power iteration algorithm. We need to make the following technical assumption:

Assumption 6 Let j = argmaxj |βj u, vj k 2|. We assume that there exists some large constant κ > 0

(1 1/κ)|βj u, vj k 2| |βj u, vj k 2|,

for all 1 j r and j = j .

It turns out under Assumptions 5 and 6, the power iteration converges to vj . Moreover, if we simply take the uniform initialization, i.e. u0 = u is a uniformly distributed vector in Sn 1. Assumption 6 holds for some 1 j r with probability 1 O(1/κ).

Theorem 7 Fix the initialization u Rn with u 2 = 1 and | u, vj | 1/ n, for 1 j r. Let j = argmaxj |βj u, vj k 2|. Under Assumptions 5 and 6, if |βj u, vj k 2| nε

with arbitrarily small ε > 0, then for any ﬁxed unit vector a Rn, and

2 + 2 log |β1|

+ log log( n|β1|)

with probability 1 O(n c(log n)2), the power iteration estimator (bβ, bv) = (X[u k T ], u T ) from (3) satisﬁes

Power Iteration for Tensor PCA

1. If k is odd, and βj > 0, then (bβ, bv) recovers (βj , vj ) in the following sense,

a, bv = a, u T =

a, vj + a, ξ a, vj vj , ξ

log n n|β1|

k 1 + log n |β1|2 n + (log n)3/2

|β1|3/2n3/4 + 1 |β1|4

where ξ = Z[v (k 1) j ], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variable. And

bβ = X[u k T ] = βj + ξ, vj k/2 1

log n n|β1|

k 1 + log n

|β1| n + (log n)3/2

|β1|1/2n3/4 + 1 |β1|3

2. If k is odd, and βj < 0 then (bβ, bv) recovers ( βj , vj ), in the sense of (11) and (12) with (βj , vj ) replaced by ( βj , vj )

3. If k is even, and βj > 0, then (bβ, bv)) recovers (βj , sgn( u, vj )vj ), in the sense of (11) and (12) with (βj , vj ) replaced by (βj , sgn( u, vj )vj );

4. If k is even, and βj < 0, then (bβ, bv)) recovers (βj , ( 1)T sgn( u, vj )vj ), in the sense of (11) and (12) with (βj , vj ) replaced by (βj , ( 1)T sgn( u, vj )vj ).

In Theorem 7, we assume that | u, vj | 1/ n for 1 j r. This is generic and is true for a random initialization u. Similarly to Theorem 2, if |βj u, vj k 2| n ε, the iteration does not converge, and ut behaves like a random Gaussian vector for any ﬁxed time t. We want to remark that for multi-rank spiked tensor model, the senarios for k = 2, i.e. the spiked matrix model, and k 3 are very diﬀerent. For the spiked matrix model, in Theorem 7, we always have that j = argmaxj |βj| = 1, and power iteration algorithm always converges to the eigenvector corresponding to the largest eigenvalue. However, for rank k 3, the power iteration algorithm may converge to any vector vj provided that the initialization u is suﬃciently close to vj. As a consequence of Theorem 7, we have the following central limit theorem for our estimators.

Corollary 8 Fix the initialization u Rn with u 2 = 1 and | u, vj | 1/ n for 1 j r. We assume |β u, vj k 2| nε with arbitrarily small ε > 0, and Assumptions 5 and 6. In Case 1 of Theorem 7, for any ﬁxed unit vector a Rn, for any ﬁxed unit vector a Rn

| a, vj | = o |β1|3

2 + 2 log |β1|

Huang, Huang, Yang, Cheng

the estimators bv = u T , and bβ = X[u k T ] satisfy

a, (In bvbv )a

1 a, bv a, vj

d N(0, 1). (14)

We have similar results for Cases 2, 3, 4, by simply changing (βj , vj ) in (14) to the corresponding limit.

In the following we take u to be a random vector uniformly distributed over the unit sphere. The power iteration algorithm can be easily understood in this setting, thanks to Theorem 7. More precisely if j = argmaxj |βj u, vj k 2| and the initialization u satisﬁes Assumptions 5 and 6, then the power iteration estimator (bv, bβ) recovers (vj , βj ). From the discussions below, for a random vector u uniformly distributed over the unit sphere, Assumptiosn 5 and 6 holds with probability 1 O(1/ κ). We can compute explicitly the probability that index i achieves argmaxj |βj u, vj k 2|:

pi := P(i = argmaxj |βj u, vj k 2|)

|βℓ| 1 k 2 x

2 πe y2/2dy

for any 1 i r. For spiked matrix model, i.e. k = 2, we always have 1 = argmaxj |βj u, vj k 2|, and p1 = 1, p2 = p3 = = 0. For spiked tensor models with k 3, all those pi are nonnegative and p1 p2 p3 > 0. When there exists a large gap between the ﬁrst signal and the remaining signals, namely |β1| M|β2| for some large constant M > 0, then

Z M 1 k 2 x

2 πe y2/2dy

2 πe x2/2 1 e M 2 k 2 x2/2 r 1 dx

2 πe x2/2 1 (r 1)e M 2 k 2 x2/2 dx 1 (r 1)

M 2 k 2 + 1 = 1 O M 2 k 2 ,

where in the second line we used the lower bound for the error function. In other words, we can recover the top signal v1 with probability 1 δ, provided |β1|/|β2| Cδ (k 2)/2.

Theorem 9 Fix large κ > 0 and recall pi as deﬁned (15). If u is uniformly distributed over the unit sphere, and |β1| n(k 2)/2+ε with arbitrarily small ε > 0, then for any ﬁxed unit vector a Rn, any 1 i r, and

2 + 2 log |β1|

+ log log( n|β1|)

the power iteration estimator (bβ, bv) = (X[u k T ], u T ) from (3) satisﬁes

Power Iteration for Tensor PCA

1. If k is odd, and βi > 0 then with probability pi + O(1/ κ), (bβ, bv) recovers (βi, vi), in the sense

a, bv = a, u T = 1 1 2β2 i

a, vi + a, ξ a, vi vi, ξ

log n n|β1|

k 1 + log n |β1|2 n + (log n)3/2

|β1|3/2n3/4 + 1 |β1|4

where ξ = Z[v (k 1) i ], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variable. And

bβ = X[u k T ] = βi + ξ, vi k/2 1

log n n|β1|

k 1 + log n

|β1| n + (log n)3/2

|β1|1/2n3/4 + 1 |β1|3

2. If k is odd, and βi < 0 then with probability pi + O(1/ κ), (bβ, bv) recovers ( βi, vi);

3. If k is even, and βi > 0 or βi < 0, then with probability pi/2 + O(1/ κ), (bβ, bv) recovers (βi, +vi), and with probability pi/2 + O(1/ κ), (bβ, bv) recovers (βi, vi).

We want to emphasize here that the senario for k = 2, i.e. the spiked matrix model, and k 3 are very diﬀerent. For spiked matrix model, i.e. k = 2, we always have that p1 = 0, p2 = p3 = = 0. The power iteration algorithm always converges to the eigenvector corresponding to the largest eigenvalue. We can only recover (β1, v1) no matter how many times we repeat the algorithm. However, for spiked tensor models with k 3, all those pi are nonnegative, p1 p2 p3 > 0. By repeating the power iteration algorithm for suﬃciently many times, it recovers (βi, vi) with probability roughly pi. Similar to the rank one case in Section 2.1, we are also able to establish the asymptotic distribution and conﬁdence interval for multi-rank spiked tensor model with uniformly distributed initialization u.

Corollary 10 Fix k 3, assume u to be a random vector uniformly distributed over the unit sphere and |β1| n(k 2)/2+ε with arbitrarily small ε > 0. In Case 1 of Theorem 9, for any ﬁxed unit vector a Rn, and time

2 + 2 log |β1|

+ log log( n|β1|)

for any 1 i r, with probability pi + O(1/ κ), the estimators bv = u T and bβ = X[u k T ] satisfy nbβ p

a, (In bvbv )a

2bβ2 a, vi d N(0, 1). (18)

And n βi bβ k/2 1

d N(0, 1). (19)

Huang, Huang, Yang, Cheng

We have similar results for Cases 2, 3, 4, by simply changing (βi, vi) above to the corresponding limit.

We want to emphasize the diﬀerence between Corollary 3 and Corollary 10. In the rank one case, the estimators bβ and a, bv are asymptotically Gaussian. In the multi-rank spiked tensor model with k 3, those estimators bβ and a, bv are no longer Gaussian. Instead, they are asymptotically a mixture Gaussian with mixture weights p1 p2 p3 .

Corollary 11 Given the asymptotic signiﬁcance level α, and let zα = Φ(1 α/2) where Φ( ) is the CDF of a standard Gaussian. Under the conditions in Corollary 10, in Case 1 of Theorem 9, we can ﬁnd the asymptotic conﬁdence interval of a, vi as

a, (In bvbv )a

nbβ , a, bv + zα

a, (In bvbv )a

and the asymptotic conﬁdence interval of βi as bβ + k/2 1

bβ zα n, bβ + k/2 1

We have similar results for Cases 2, 3, 4, by changing (βi, vi) above to the corresponding limit.

2.3 Proof Sketch

In this section, we outline the proof of Theorem 1. The detailed proofs are given in Appendix A. To analyze the power iteration algorithm (2), we construct an auxiliary iteration, y0 = u and yt+1 = X[y (k 1) t ]. This gives us a sequence of vectors y0, y1, y2, , yt, . To analyze yt+1, we condition on y0, y1, , yt, then Lemma 12 implies a decomposition of the Gaussian tensor Z = Zﬁxed + Zrand given by a deterministic part and a random part. This can be used to give a decomposition of yt+1

yt+1 = β v, yt k 1v + Zﬁxed[y (k 1) t ] + Zrand[y (k 1) t ]

=: at+1v + bt+1wt+1 + ct+1ξt+1, (20)

where the ﬁrst term is the projection in the signal direction and ξt+1 is a random unit vector. The relation (20) induces a recursion for the coeﬃcients {at, bt, ct}. Although the recursion is complicated, it can be analyzed using Gaussian concentration estimates. Under the Assumption of Theorem 1 and |β u, v k 2| nε, we show that eventually at will dominate. More precisely, for

2 + 2 log |β|

yt = atv + at

β ξ + small error, (21)

where ξ = Z[v (k 1)] behaves like a Gaussian vector. The claims of Theorem 1 then follows from (21).

Power Iteration for Tensor PCA

3. Numerical Study

In this section, we conduct numerical experiments on synthetic data to demonstrate our distributional results provided in Sections 2.1 and 2.2. We ﬁx the dimension n = 600 and rank k = 3.

3.1 Rank one spiked tensor model

We begin with numerical experiments on rank one case. This section is devoted to numerically studying the eﬃciency of our estimators for the strength of signals and linear functionals of the signals. We take the signal v a random vector sampled from the unit sphere in Rn, and the vector

3(en/3 + e2n/3 + en)

For the setting without prior information of the signal, we take the initialization of our power iteration algorithm u a random vector sampled from the unit sphere in Rn, and the strength of signal β = n(k 2)/2 24.495. We plot in Figure 1 our estimators for the strength of signals after normalization

and our estimators for the linear functionals of the signals nbβ p

a, (In bvbv )a

2bβ2 1 a, bv a, v (23)

as in Corollary 3. For the setting that there is prior information of the signal, we take the initilization of our power iteration algorithm u = (v + w)/ v + w 2, where v is a random vector sampled from the unit sphere in Rn. We plot our estimators for the strength of signals after normalization (22) and our estimators for the linear functionals of the signals (23) for β = 5 in Figure 2, and for β = 10 in Figure 3. Although our Theorem 1 and Corollary 3 requires |β u, v k 2| nε 1, Figures 2 and 3 indicate that our estimators bβ and a, bv are asymptotically Gaussian even with small β, i.e. β = 5, 10. Theorem 1 also indicates that error term in Corollary (3), i.e. the error term in (7), is of order 1/|β|. This matches with our simulation. In Figures 2 and 3, the the diﬀerence between the Gaussian ﬁt of our empirical density and the density of N(0, 1) decreases as β increases from 5 to 10. In Figure 4, we test the threshold signal-to-noise ratio for the power iteration algorithm. Our Theorems 1 and 2 state that for |β u0, v k 2| 1 tensor power iteration recovers the signal v, and fails when |β u0, v k 2| 1. Especially for random initialization, we have that | u0, v | 1/ n. Our Theorems state that for |β| n(k 2)/2 tensor power iteration recovers the signal v, and fails when |β| n(k 2)/2. Take k = 3. In the left panel of Figure 4, we test tensor power iteration with random initialization for various dimensions n {200, 300, 400, 500, 600} and signal strength β/ n (0, 2]. In the right panel of Figure 4, we test tensor power iteration with ﬁxed small β = 3 and informative initialization β u0, v (0, 2] for various dimensions n {200, 300, 400, 500, 600}. The outputs bv, v are averaged over 60 independent trials.

Huang, Huang, Yang, Cheng

Figure 1: The empirical density of normalized bβ as in (22) (left panel), and normalized a, bv as in (23). The results are reported over 2000 independent trials where the initialization of our power iteration algorithm u a random vector sampled from the unit sphere in Rn, and the strength of signal β = n(k 2)/2 24.495.

Figure 2: The empirical density of normalized bβ as in (22) (left panel), and normalized a, bv as in (23). The results are reported over 2000 independent trials where the initialization of our power iteration algorithm u a random vector sampled from the unit sphere in Rn, and the strength of signal β = 5.

Power Iteration for Tensor PCA

Figure 3: The empirical density of normalized bβ as in (22) (left panel), and normalized a, bv as in (23). The results are reported over 2000 independent trials where the initialization of our power iteration algorithm u a random vector sampled from the unit sphere in Rn, and the strength of signal β = 10.

Figure 4: Output of tensor power iteration with random initialization for various signal strength β/ n (0, 2] (left panel), and tensor power iteration with ﬁxed small β = 3 and informative initialization β u0, v (0, 2].

Huang, Huang, Yang, Cheng

Figure 5: Scatter plot of (bβ, a, bv ) (ﬁrst panel), the normalized (bβ, a, bv ) as in (24) for the cluster corresponding to (β1, a, v1 ) (second panel), the normalized (bβ, a, bv ) as in (25) for the cluster corresponding to (β2, a, v2 ). The contour plot is a standard 2-dim Gaussian distribution, at 1, 2, 3 standard deviation. The results are reported over 5000 independent trials where the initialization of our power iteration algorithm u a random vector sampled from the unit sphere in Rn.

3.2 Rank-r spiked tensor model

In this section, we conduct numerical experiments to demonstrate our distributional results for the multi-rank spiked tensor model. We consider the simplest case that there are two spikes with signals v1, v2, such that they are uniformly sampled from the unit sphere in Rn

and orthogonal to each other v1, v2 = 0, and the vector

3(en/3 + e2n/3 + en).

We test the setting that there is no prior information of the signal. We take the strength of signals β1 = 1.2 n(k 2)/2 29.394 and β2 = n(k 2)/2 24.495 and the initialization of our power iteration algorithm u a random vector sampled from the unit sphere in Rn. We scatter plot in Figure 6 our estimator bβ for the strength of signals, and our estimator a, bv for the linear functionals of the signals over 5000 independent trials. As seen in the ﬁrst panel of Figure 6, our estimators (bβ, a, bv ) form two clusters, centered around (β1, a, v1 ) (29.394, 0.000) and (β2, a, v2 ) (24.495, 0.039). In the second and third panels, we zoom in, and scatter plot for the cluster corresponding to (β1, a, bv1 ) (29.394, 0.000)

bβ β, nbβ1 p

a, (In bvbv )a

2bβ2 1 a, bv a, v1 , (24)

and scatter plot for the cluster corresponding to (β2, a, bv2 ) (24.495, 0.039)

bβ β2, nbβ p

a, (In bvbv )a

2bβ2 1 a, bv a, v2 . (25)

Power Iteration for Tensor PCA

Figure 6: The empirical density of the normalized (bβ, a, bv ) as in (24) for the cluster corresponding to (β1, a, v1 ) (second panel), the normalized (bβ, a, bv ) as in (25) for the cluster corresponding to (β2, a, v2 ). The results are reported over 5000 independent trials where the initialization of our power iteration algorithm u a random vector sampled from the unit sphere in Rn.

As predicted by our Theorem 9, both clusters are asymptotically Gaussian, and the normalized estimators matches pretty well with the contour plot of standard 2-dim Gaussian distribution, at 1, 2, 3 standard deviation.

We plot in Figure 6 our estimators for the strength of signals and the linear functionals of the signals after normalization, for the ﬁrst cluster (24), and for the second cluster (25).

In Table (1), for each n {50, 100, 200, 400, 600} and k = 3, we take the strength of signals β1 = 10n(k 2)/2 and β2 = 12 n(k 2)/2. Over 1500 independent trials for power iteration with random initialization for each n, we estimate the percentage bp1 of estimators converging to β1, and the percentage bp2 of estimators converging to β2. Our theoretical

Huang, Huang, Yang, Cheng

n = 50 n = 100 n = 200 n = 400 n = 600 bp1 0.421 0.421 0.439 0.446 0.445 bp2 0.579 0.579 0.561 0.554 0.555 signal β1 0.932 0.956 0.948 0.961 0.957 linear form a, v1 0.965 0.929 0.959 0.945 0.952 signal β2 0.944 0.944 0.946 0.953 0.949 linear form a, v2 0.950 0.950 0.937 0.949 0.941

Table 1: Estimated bp1, bp2 over 1500 independent trials for dimension n {50, 100, 200, 400, 600} (top two rows), and numerical coverage rates for our 95% conﬁdence intervals over 1500 independent trials for dimension n {50, 100, 200, 400, 600} (last four rows).

p1 = P(|β1 u, v1 | > |β2 u, v2 |) 0.44,

p2 = P(|β1 u, v1 | < |β2 u, v2 |) 0.56.

We also examine the numerical coverage rates for our 95% conﬁdence intervals over 1500 independent trials.

Acknowledgments

The research collaboration was initiated when both G.C. and J.H. were warmly hosted by IAS in the special year of Deep Learning Theory. The research of G.C. is supported by NSF grant NSF-SCALE Mo DL (2134209). The research of J.H. is supported by the Simons Foundation as a Junior Fellow at the Simons Society of Fellows, and NSF grant DMS-2054835. The research of Q.Y. is supported by National Natural Science Foundation of China (No. 12101585). We also want to thank the anonymous referees for their helpful comments and suggestions.

Power Iteration for Tensor PCA

Appendix A. Proof of main theorems

A.1 Proof of Theorems 1 and 2

The following lemma on the conditioning of Gaussian tensors will be repeatedly use in the remaining of this section.

Lemma 12 Let Z k Rn be a random Gaussian tensor. The entries of Z are i.i.d. standard N(0, 1/n) Gaussian random variables. Fix τ1, τ2, , τt k 1Rn orthonormal (k 1)-th order tensors, i.e. τi, τj = δij, and vectors ξ1, ξ2, , ξt Rn. Then the distribution of Z[τ] conditioned on Z[τs] = ξs for 1 s t is

s=1 τs, τ ξs + Z

s=1 τs, τ τs

where Z is an independent copy of Z.

Proof [Proof of Lemma 12] For any (k 1)-th order tensor τ, viewed as a vector in Rnk 1, we can decompose it as the projection on the span of τ1, τ2, , τt and the orthogonal part

s=1 τs, τ τs +

s=1 τs, τ τs

Using the above decomposition and Z[τs] = ξs, we can write Z[τ] as

s=1 τs, τ ξs + Z

s=1 τs, τ τs

and the ﬁrst sum and the second term on the righthand side of (27) are independent. The claim (26) follows.

Proof [Proof of Theorem 1] We deﬁne an auxiliary iteration, y0 = u and

yt+1 = X[y (k 1) t ]. (28)

Then with yt, our original power iteration (2) is given by ut = yt/ yt 2. Let ξ = Z[v (k 1)] Rn. Then the entries of ξ are given by

ξ(i) = Z[v (k 1)](i) = Z, ei v (k 1) = X

i1,i2, ,ik 1 [[1,n]] Zii1i2 ik 1v(i1)v(i2) v(ik 1).

From the expression, ξ(i) is a linear combination of Gaussian random variables, itself is also a Gaussian. Moreover, these entries ξ(i) are i.i.d. Gaussian variables with mean zero and variance 1/n:

E[ξ(i)2] = X

i1,i2, ,ik 1 [[1,n]] E[Z2 ii1i2 ik 1]v(i1)2v(i2)2 v(ik 1)2 = 1

Huang, Huang, Yang, Cheng

We can compute yt iteratively: y1 is given by

y1 = X[y (k 1) 0 ] = β y0, v k 1v + Z[y (k 1) 0 ]. (29)

For the last term on the righthand side of (29), we can decompose y (k 1) 0 as a projection on v (k 1) and its orthogonal part:

y (k 1) 0 = y0, v k 1v (k 1) + q

1 y0, v 2(k 1)τ0,

where τ0 (k 1)Rn and v (k 1), τ0 = 0, τ0, τ0 = 1. Thanks to Lemma 12, conditioning on ξ = Z[v (k 1)], ξ1 = Z[τ0] has the same law as Z[τ0], where Z is an independent copy of Z. Since τ0, τ0 = 1, ξ1 is a Gaussian vector with each entry N(0, 1/n). With those notations we can rewrite the expression (29) of y1 as

y1 = β y0, v k 1v + y0, v k 1ξ + q

1 y0, v 2(k 1)ξ1. (30)

In the following we show that:

Claim 13 We can compute y1, y2, y3, , yt inductively. The Gram-Schmidt orthonormalization procedure gives an orthogonal base of v (k 1), y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 as:

v (k 1), τ0, τ1, , τt 1. (31)

Let ξs+1 = Z[τs] for 0 s t 1. Conditioning on ξ = Z[v (k 1)] and ξs+1 = Z[τs] for 0 s t 2, ξt = Z[τt 1] is an independent Gaussian vector, with each entry N(0, 1/n). Then yt is in the following form

yt = atv + btwt + ctξt, btwt = bt0ξ + bt1ξ1 + + btt 1ξt 1, (32)

where wt 2 = 1.

Proof [Proof of Claim 13] The Claim 13 for t = 1 follows from (30). In the following, assuming Claim 13 holds for t, we prove it for t + 1. Let v (k 1), τ0, τ1, , τt be an orthogonal base for v (k 1), y (k 1) 0 , y (k 1) 1 , , y (k 1) t , obtained by the Gram-Schmidt orthonormalization procedure. More precisely, given those tensors v (k 1), τ0, τ1, , τt 1, we denote

b(t+1)0 = y (k 1) t , v (k 1) , ct+1 = y (k 1) t , τt ,

b(t+1)(s+1) = y (k 1) t , τs , 0 s t 1.

then b(t+1)0v (k 1) +b(t+1)1τ0 +b(t+1)2τ1 + b(t+1)tτt 1 is the projection of y (k 1) t on the

span of v (k 1), y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 . With those notations, we can write y (k 1) t as

y (k 1) t = b(t+1)0v (k 1) + b(t+1)1τ0 + b(t+1)2τ1 + b(t+1)tτt 1 + ct+1τt, (33)

Power Iteration for Tensor PCA

Using (32) and (33), we notice that

βv k 1, y (k 1) t = β(at + bt wt, v + ct ξt, v )k 1v,

and the iteration (28) implies that

yt+1 = β(at + bt wt, v + ct ξt, v )k 1v + bt+1wt+1 + ct+1Z[τt], (34)

bt+1wt+1 = Z[b(t+1)0v (k 1) + b(t+1)1τ0 + b(t+1)2τ1 + b(t+1)tτt 1]

= b(t+1)0ξ + b(t+1)1ξ1 + b(t+1)2ξ2 + b(t+1)tξt.

Since τt is orthogonal to v (k 1), τ0, τ1, , τt 1, Lemma 12 implies that conditioning on ξ = Z[v (k 1)] and ξs+1 = Z[τs] for 0 s t 1, ξt+1 = Z[τt] is an independent Gaussian vector, with each entry N(0, 1/n). The above discussion gives us that

yt+1 = at+1v + bt+1wt+1 + ct+1ξt+1, at+1 = β(at + bt wt, v + ct ξt, v )k 1. (35)

In this way, for any t 0, yt is given in the form (32).

In the following, We study the case that u, v > 0. The case u, v < 0 can be proven in exactly the same way, by simply changing (β, v) with (( 1)kβ, v). We prove by induction

Claim 14 For any ﬁxed time t, with probability at least 1 O(e c(log N)2) the following holds: for any s t,

|as| |β|(|bs0| + |bs1| + + |bs(s 1)|),

|as| nε max{1(k 3)|cs/β1/(k 2)|, |cs/ n|}. (36)

ξ , ξs 2 = 1 + O(log n/ n), | v, ξ |, | a, ξ |, | a, ξs |,

Proj Span{v,ξ,ξ1, , ,ξs 1}(ξs) 2 log n/ n. (37)

Proof [Proof of Claim 14] From (30), y1 = β u, v k 1v + u, v k 1ξ + p

1 u, v 2(k 1)ξ1. We have a1 = β u, v k 1, b10 = u, v k 1, b1w1 = u, v k 1ξ and c1 = p

1 u, v 2(k 1). Since ξ is a Gaussian vector with each entry mean zero and variance 1/n, the concentration for chi-square distribution implies that

i=1 ξ(i)2 = 1 + O(log n/ n)

with probability 1 ec(log n)2. We can directly check that |a1| = |βb10|, |β1/(k 2)a1| = |β u, v k 2|(k 1)/(k 2) n(k 1)ε/(k 2) nε|c1|, and | na1| = |β u, v k 2|| n u, v | nε nε|c1|. Moreover, conditioning on Z[v (k 1)] = ξ, Lemma 12 implies that ξ1 = Z[τ0]

Huang, Huang, Yang, Cheng

is an independent Gaussian random vector with each entry N(0, 1/n). By the standard concentration inequality, it holds that with probability 1 ec(log n)2, ξ1 2 = 1+O(log n/ n), | a, ξ1 | and the projection of ξ1 on the span of {v, ξ} is bounded by log n/ n. So far we have proved that (36) and (37) for t = 1. In the following, we assume that (36) holds for t, and prove it for t + 1. We recall from (32) and (35) that

at+1 = β(at + bt wt, v + ct ξt, v )k 1, btwt = bt0ξ + bt1ξ1 + + btt 1ξt 1 (38)

By our induction hypothesis, we have that

|bt wt, v | |bt0 ξ, v | + |bt1 ξ1, v | + + |bt(t 1) ξt 1, v | (log n/ n)|at|/|β|, (39)

|ct ξt, v | (log n/ n)|ct| (log n)|at|/nε. (40)

It follows from plugging (39) and (40) into (38), we get

at+1 = β(at + O(log n|at|/nε))k 1 = (1 + O(log n/nε))βak 1 t . (41)

We recall from (33), the coeﬃcients b(t+1)0, b(t+1)1, , b(t+1)t are determined from the

projection of y (k 1) t on v (k 1), τ0, τ1, , τt 1

y (k 1) t = b(t+1)0v (k 1) + b(t+1)1τ0 + b(t+1)2τ1 + b(t+1)tτt 1 + ct+1τt.

We also recall that v (k 1), τ0, τ1, , τt 1 are obtained from v (k 1), y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 by the Gram-Schmidt orthonormalization procedure. So we have that the span of vectors (viewed as vectors) v (k 1), τ0, τ1, , τt 1 is the same as the span of the tensors v (k 1), y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 , which is contained in the span of {v, wt, y0, , yt 1} (k 1). Moreover from the relation (32), one can see that the span of {v, wt, y0, , yt 1} is the same as the span of {v, ξ, ξ1, , ξt 1}. It follows that

b2 (t+1)0 + b2 (t+1)1 + b2 (t+1)2 + + b2 (t+1)t

= Proj Span{v (k 1),τ0,τ1, ,τt 1}(atv + btwt + ctξt) (k 1) 2

Proj Span{v,wt,y0, ,yt 1} (k 1)(atv + btwt + ctξt) (k 1) 2

Proj Span{v,wt,y0, ,yt 1}(atv + btwt + ctξt) k 1 2 = atv + btwt + ct Proj Span{v,ξ,ξ1, ,ξt 1}(ξt) k 1 2

|at| + |bt| + log n|ct| n

k 1 |at|k 1 |at+1|/|β|,

where in the last line we used our induction hypothesis that Proj Span{v,ξ,ξ1, ,ξt 1}(ξt) 2 log n/ n.

Power Iteration for Tensor PCA

Finally we estimate ct+1. We recall from (33), the coeﬃcient ct+1 is the remainder of y (k 1) t after projecting on v (k 1), τ0, τ1, , τt 1. It is bounded by the remainder of y (k 1) t after projecting on v (k 1),

|ct+1| y (k 1) t ak 1 t v (k 1) 2 = (atv + btwt + ctξt) (k 1) ak 1 t v (k 1) 2.

The diﬀerence (atv + btwt + ctξt) (k 1) ak 1 t v (k 1) is a sum of terms in the following form,

η1 η2 ηk 1, (43)

where vectors η1, η2, , ηk 1 {atv, btwt+ctξt}, and at least one of them is btwt+ctξt. We notice that by our induction hypothesis, btwt + ctξt 2 |bt| wt 2 + |ct| ξt 2 |bt| + |ct|. For the L2 norm of (43), each copy of atv contributes at and each copy of btwt + ctξt contributes a factor |bt| + |ct|. We conclude that

|ct+1| (atv + btwt + ctξt) (k 1) ak 1 t v (k 1) 2

r=1 |at|k 1 r(|bt| + |ct|)r. (44)

Combining the above estimate with (41) that |at+1| |β||at|k 1, we divide both sides of (44) by |β||at|k 1,

|ct+1| |at+1| 1

|at| + |ct|

where we used our induction hypothesis that |at| |β||bt|. There are three cases:

1. If |ct|/|at| 1, then

|ct+1| |at+1| 1

If k = 2, then our assumption |β u, v k 2| = |β| nε, implies that |ct+1|/|at+1| (|ct|/|at|)/nε. If k 2, by our induction hypothesis |ct|/|at| β1/(k 2)/nε. This implies (|ct|/|at|)k 2/|β| 1/nε, and we still get that |ct+1|/|at+1| (|ct|/|at|)/nε.

2. If 1/|β| |ct|/|at| 1, then

|ct+1| |at+1| 1

where we used that |β| |β u, v k 2| nε.

3. Finally for |ct|/|at| 1/|β|, we will have

|ct+1| |at+1| 1

Huang, Huang, Yang, Cheng

In all these cases if |ct|/|at| min{ n, 1(k 3)|β|1/(k 2)}/nε, we have |ct+1|/|at+1| min{ n, 1(k 3)|β|1/(k 2)}/nε. This ﬁnishes the proof of the induction (36). For (37), since τt is orthogonal to v (k 1), τ0, τ1, , τt 1, Lemma 12 implies that conditioning on ξ = Z[v (k 1)] and ξs+1 = Z[τs] for 0 s t 1, ξt+1 = Z[τt] is an independent Gaussian vector, with each entry N(0, 1/n). By the standard concentration inequality, it holds that with probability 1 ec(log n)2, ξt+1 2 = 1+O(log n/ n), | a, ξt+1 | and the projection of ξt+1 on the span of {v, ξ, ξ1, , ξt} is bounded by log n/ n. This ﬁnishes the proof of the induction (37).

Next, using (36) and (37) in Claim 14 as input, we prove that for

2 + 2 log |β|

with probability 1 ec(log n)2 we have

yt = atv + bt0ξ + bt1ξ1 + + btt 1ξt 1 + ctξt, (47)

β + O log n|at|

|bt1|, |bt2|, , |bt(t 1)| (log n)1/2|at|

|β|3/2n1/4 , |ct| |at|/β2. (48)

Let xt = |ct/at| |β|1/(k 2), then (45) implies

from the discussion after (45), we have that either xt+1 1/β2, or xt+1 xt/nε. Since x1 = |c1/a1| n1/2 ε, we conclude that it holds

xt = |ct/at| 1/β2, when t 1

2 + 2 log |β|

To derive the upper bound of bt1, bt2, , bt(t 1), we use (42).

b2 (t+1)0 + b2 (t+1)1 + b2 (t+1)2 + + b2 (t+1)t

atv + btwt + ct Proj Span{v,ξ,ξ1, ,ξt 1}(ξt) 2(k 1) 2

|at| (|bt| + |ct|) log n n + |bt| + |ct|log n n

where we used our induction (37) that | ξ, v |, | ξ1, v |, , | ξt, v | log n/ n and the projection Proj Span{v, ξ, ξ1, , ξt 1}(ξt) 2 log n/ n. Moreover, the ﬁrst term b(t+1)0 is the projection of y (k 1) t on v (k 1),

b(t+1)0 = atv + btwt + ctξt, v k 1 = at + O log n(|bt| + |ct|) n

Power Iteration for Tensor PCA

where we used (37) that | ξ, v |, | ξ1, v |, , | ξt, v | log n/ n. Now we can take difference of (50) and (51), and use that |bt| |at|/|β| from (36) and |ct| |at|/|β| from (49),

b(t+1)0 = ak 1 t + O |at|k 1 log n

, b2 (t+1)1 + b2 (t+1)2 + + b2 (t+1)t a2(k 1) t log n |β| n.

From (38) and (41), we have that

at+1 = βb(t+1)0 βak 1 t . (53)

Using the above relation, we can simplify (52) as

b(t+1)0 = at+1

β + O log n|at+1|

, |b(t+1)1|, |b(t+1)2|, |b(t+1)t| (log n)1/2|at+1|

|β|3/2n1/4 .

This ﬁnishes the proof of (48). With the expression (48), we can process to prove our main results (4) and (5). Thanks to (37), for t satisﬁes (46), we have that with probability at least 1 O(e c(log N)2)

yt 2 2 = a2 t

β2 + 2 v, ξ

log n β2 n + (log n)3/2

|β|3/2n3/4 + 1

By rearranging it we get

1/ yt 2 = 1 |at|

1 1 2β2 v, ξ

log n β2 n + (log n)3/2

|β|3/2n3/4 + 1

We can take the inner product a, yt using (47) and (48), and multiply (55)

a, ut = a, yt

yt 2 = sgn(at) 1 1 2β2

a, v + a, ξ a, v v, ξ

log n β2 n + (log n)3/2

|β|3/2n3/4 + | a, v |

where we used (37) that with high probability | a, ξ |, | a, ξs | for 1 s t are bounded by log n/ n. This ﬁnishes the proof of (4). For bβ in (5), we have

X[u k t ] = X[y k t ] yt k 2 = yt, X[y (k 1) t ] yt k 2 = yt, yt+1

yt k 2 . (57)

Thanks to (49), (35) and (37), for t satisﬁes (46), with probability at least 1 O(e c(log N)2), we have

yt+1 = at+1v + b(t+1)0ξ + b(t+1)1ξ1 + + b(t+1)tξt + ct+1ξt+1,

Huang, Huang, Yang, Cheng

where |ct+1| |at|k 1/β2,

at+1 = β(at + bt0 ξ, v + bt1 ξ1, v + + bt(t 1) ξt 1, v + ct ξt, v )k 1

log n β2 n + (log n)3/2

!!k 1 , (58)

b(t+1)0 = ak 1 t

1 + O log n

, |b(t+1)1|, |b(t+1)2| + + |b(t+1)t| ak 1 t (log n)1/2

|β|1/2n1/4 .

From the discussion above, combining with (47) and (48) with straightforward computation, we have

yt, yt+1 = βak t

β2 + (k + 1) ξ, v

log n β2 n + (log n)3/2

By plugging (55) and (59) into (57), we get

X[u k t ] = sgn(at)k β + ξ, v k/2 1

log n |β| n + (log n)3/2

|β|1/2n3/4 + 1 |β|3

Since by our assumption, in Case 1 we have that β > 0. Thanks to (58) at+1 = βak 1 t (1 + o(1)), especially at+1 and at are of the same sign. In the case u, v > 0, we have a1 = β u, v k 1 > 0. We conclude that at > 0. Therefore sgn(X[u k t ]) = sgn(at)k = +, and it follows that

X[u k t ] = β + ξ, v k/2 1

log n |β| n + (log n)3/2

|β|1/2n3/4 + 1 |β|3

This ﬁnishes the proof of (5). The Cases 2, 3, 4 follow by simply changing (β, v) in the righthand side of (4) and (5) to the corresponding limit.

Proof [Proof of Theorem 2] We use the same notations as in the proof of Theorem 1. If |β| nε and |β u, v k 2| n ε, then we ﬁrst prove by induction that for any ﬁxed time t, with probability at least 1 O(e c(log N)2) the following holds: for any s t,

|bs0|, |bs1|, , |bs(s 1)| max{|cs|/|β|(k 1)/(k 2), (log n)k 1|cs|/n(k 1)/2},

|cs| nεβ1/(k 2)|as|, (60)

ξ , ξs 2 = 1 + O(log n/ n),

| v, ξ |, Proj Span{v,ξ,ξ1, , ,ξs 1}(ξs) 2 log n/ n. (61)

Power Iteration for Tensor PCA

From (30), a1 = β u, v k 1, b10 = u, v k 1 and c1 = p

1 u, v 2(k 1). Since |β| nε

and |β u, v k 2| n ε, we have that | u, v | n 2ε/(k 2) 1 and therefore |c1| 1. We can check that |β1/(k 2)a1| = |β u, v k 2|(k 1)/(k 2) n ε n ε|c1| and |b10| = |a1/β| n ε|c1/β(k 1)/(k 2)|. Moreover, conditioning on Z[v (k 1)] = ξ, Lemma 12 implies that ξ1 = Z[τ0] is an independent Gaussian random vector with each entry N(0, 1/n). By the standard concentration inequality, it holds that with probability 1 ec(log n)2, ξ1 2 = 1 + O(log n/ n), and the projection of ξ1 on the span of {v, ξ} is bounded by log n/ n. So far we have proved (60) and (61) for t = 1. In the following, assuming the statements (60) and (61) hold for t, we prove them for t + 1. From (38), using (38) and (39), we have

|at+1| = β(at + bt wt, v + ct ξt, v )k 1

|β| |at| + log n(|bt0| + |bt1| + + |bt(t 1)|) n + log n|ct| n

|β| |at| + log n|ct| n

k 1 |β| |ct| nε|β|1/(k 2) + log n|ct| n

|β||ct|k 1 1 nε|β|1/(k 2) + log n n

k 1 |ct|k 1

nε|β|1/(k 2) ,

where in the third line we used our induction hypothesis that |bt0|+|bt1|+ +|bt(t 1)| |ct|, and n ε |β|| u, v |k 2 |β|/n(k 2)/2. For b(t+1)0, b(t+1)1, , b(t+1)t, from (42) we have

b2 (t+1)0 + b2 (t+1)1 + b2 (t+1)2 + + b2 (t+1)t |at| + |bt| + log n|ct| n

|at| + log n|ct| n

k 1 |ct| nε|β|1/(k 2) + log n|ct| n

|ct|k 1 1 nε|β|1/(k 2) + log n n

Finally we estimate ct+1. We recall from (33), the coeﬃcient ct+1 is the remainder of y (k 1) t after projecting on v (k 1), τ0, τ1, , τt 1. We have the following lower bound for ct+1

|ct+1|2 = (atv + btwt + ctξt) (k 1) 2 2 (b2 (t+1)0 + b2 (t+1)1 + b2 (t+1)2 + + b2 (t+1)t)

atv + btwt + ctξt 2(k 1) 2 O

|ct|2(k 1) 1 nε|β|1/(k 2) + log n n

For the ﬁrst term on the righthand side of (64), using our induction hypothesis (60) and (61) that |at| |ct|, we have

atv + btwt + ctξt 2 2 = a2 t + b2 t + c2 t ξt 2 2 + 2atbt v, wt + 2atct v, ξt + 2btct wt, ξt

= 1 + O log n n + 1 n2εβ2/(k 2)

c2 t . (65)

Huang, Huang, Yang, Cheng

We get the following lower for ct+1 by plugging (65) into (64), and rearranging

|ct+1| 1 + O log n n + 1 n2εβ2/(k 2)

|ct|k 1 (66)

The claim that |b(t+1)0|, |b(t+1)1|, , |b(t+1)t| max{|ct+1|/|β|(k 1)/(k 2), (log n)k 1|ct+1|/n(k 1)/2} follows from combining (63) and (66). The claim that |ct+1| nεβ1/(k 2)|at+1| follows from combining (62) and (66). For (61), since τt is orthogonal to v (k 1), τ0, τ1, , τt 1, Lemma 12 implies that conditioning on ξ = Z[v (k 1)] and ξs+1 = Z[τs] for 0 s t 1, ξt+1 = Z[τt] is an independent Gaussian vector, with each entry N(0, 1/n). By the standard concentration inequality, it holds that with probability 1 ec(log n)2, ξt+1 2 = 1 + O(log n/ n), and the projection of ξt+1 on the span of {v, ξ, ξ1, , ξt} is bounded by log n/ n. This ﬁnishes the proof of the induction (61). Next, using (36) and (37) as input, we prove that for

2 log |β| (k 2) log n

yt = atv + bt0ξ + bt1ξ1 + + bt(t 1)ξt 1 + ctξt, (68)

|at|, |bt0|, |bt1|, , |bt(t 1)| |ct||β| log n n

Let xt = |at/ct|, then (60) implies that xt 1/(nε|β|1/(k 2)). By taking the ratio of (62) and (66), we get

xt+1 |β| log n n + xt

there are two cases,

1. if log n/ n xt 1/(nε|β|1/(k 2)), then

xt+1 |β|xk 1 t = xt(|β|1/(k 2)xt)k 2 xt/nε;

2. If xt log n/ n, then |xt+1| |β|(log n/ n)k 1.

Since x1 = |a1/c1| 1/(nε|β|1/(k 2)), we conclude that

xt = |at/ct| |β|(log n/ n)k 1, when t 1

2 log |β| (k 2) log n

Power Iteration for Tensor PCA

In this regime, (63) implies that

|b(t+1)0|, |b(t+1)1|, |b(t+1)2|, , |b(t+1)t| |β||ct|k 1 |at|

|ct| + log n n

|β||ct|k 1 log n n

k 1 |ct+1||β| log n n

where we used (66) in the last inequality. This ﬁnishes the proof of (69). Using (69), we can compute ut,

ut = yt yt = ξt ξt 2 + OP

|β| log n n

where the error term is a vector of length bounded by |β|(log n/ n)k 1. This ﬁnishes the proof of Theorem 1.

A.2 Proof of Corollarys 3 and 4

Proof [Proof of Corollary 3] According to the deﬁnition of ξ in (4) of Theorem 1, i.e. ξ = Z[v (k 1)], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variable. We see that ξ, v d= N (0, 1/n) .

Especially with high probability we will have that | ξ, v | log n/ n. Then we conclude from (5), with high probability it holds

bβ = β + O 1

β + log n n

With the bound (112), we can replace a, v /(2β2) on the righthand side of (4) by a, v /(2bβ2), which gives an error a, v

= O | a, v | 1

|β|4 + log n |β|3 n

Combining the above discussion together, we can rewrite (4) as

a, v = a, ξ a, v v, ξ

log n β2 n + (log n)3/2

β3/2n3/4 + | a, v |

with high probability. Again thanks to the deﬁnition of ξ in (4) of Theorem 1, i.e. ξ = Z[v (k 1)], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variable, we see that

a, ξ a, v v, ξ = a a, v v, ξ ,

Huang, Huang, Yang, Cheng

is a Gaussian random variable, with mean zero and variance

E[ a a, v v, ξ 2] = 1

n a a, v v 2 2 = 1

n a, (In vv )a = 1 + o(1)

n a, (In bvbv )a .

This together with (112), (113) as well as our assumption (6) nbβ p

a, (In bvbv )a

2bβ2 1 a, bv a, v d N(0, 1). (74)

Under the same assumption, we have similar results for Cases 2, 3, 4, by simply changing (β, v) in the righthand side of (4) and (5) to the corresponding expression.

Proof [Proof of Corollary 4] Given the signiﬁcance level α, the asymptotic conﬁdence intervals in Corollary 4 can be calculated from Corollary 3 by bounding the absolute values of the left hand sides of (7) at zα.

A.3 Proof of Theorem 7

Proof [Proof of Theorem 7] We deﬁne an auxiliary iteration, y0 = u and

yt+1 = X[y (k 1) t ]. (75)

Then we have that ut = yt/ yt 2. For index j = (j1, j2, , jk 1) [[1, r]]k 1. Let ξj = Z[vj1 vj2 vjk 1]. Its entries

i1,i2, ,ik 1 [[1,n]] Zii1i2 ik 1vj1(i1)vj2(i2) vjk 1(ik 1),

are linear combination of Gaussian random variables, which is also Gaussian. These entries are i.i.d. Gaussian variables with mean zero and variance 1/n,

E[ξj(i)2] = X

i1,i2, ,ik 1 [[1,n]] E[Z2 ii1i2 ik 1]vj1(i1)2vj2(i2)2 vjk 1(ik 1)2 = 1

We can compute yt iteratively:

y1 = X[y (k 1) 0 ] =

j=1 βj y0, vj k 1vj + Z[y (k 1) 0 ]. (76)

For the last term on the righthand side of (76), we can decompose y (k 1) 0 as a projection on vj1 vj2 vjk 1 for j [[1, r]]k 1, and its orthogonal part:

y (k 1) 0 = X

s=1 y0, vjs vj1 vj2 vjk 1 +

j=1 y0, vj 2

Power Iteration for Tensor PCA

where the sum is over j [[1, r]]k 1, τ0 k Rn and τ0 2 = 1. Let ξ1 = Z[τ0]. By our construction vj1 vj2 vjk 1 for any j [[1, r]]k 1 and τ0 are othorgonal to each other. Thanks to Lemma 12, conditioning on ξj := Z[vj1 vj2 vjk 1] for index j = (j1, j2, , jk 1) [[1, r]]k 1, ξ1 = Z[τ0] has the same law as Z[τ0], where Z is an independent copy of Z. Since τ0, τ0 = 1, ξ1 is a Gaussian vector with each entry N(0, 1/n). With those notations we can rewrite y1 as

j=1 βj y0, vj k 1vj + X

s=1 y0, vjs ξj +

j=1 y0, vj 2

In the following we show that:

Claim 15 We can compute y2, y3, , yt inductively. The Gram-Schmidt orthonormalization procedure gives an orthogonal base of vj1 vj2 vjk 1 for j [[1, r]]k 1 and

y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 as:

{vj1 vj2 vjk 1}j [[1,r]]k 1, τ0, τ1, , τt 1. (78)

Let ξj = Z[vj1 vj2 vjk 1] for j = (j1, j2, , jk 1) [[1, r]]k 1, and ξs+1 = Z[τs] for 0 s t 1. Conditioning on ξj = Z[vj1 vj2 vjk 1] for j = (j1, j2, , jk 1) [[1, r]]k 1 and ξs+1 = Z[τs] for 0 s t 2, ξt = Z[τt 1] is an independent Gaussian vector, with each entry N(0, 1/n). Then yt is in the following form

yt = atvt + btwt + ctξt, (79)

atvt = at1v1 + at2v2 + + atrvr, btwt = X

j btjξj + bt1ξ1 + + btt 1ξt 1, (80)

and v1 2, v2 2, , vr 2, wt 2 = 1.

Proof [Proof of Claim 15] The Claim 15 for t = 1 follows from (77). In the following, assuming Claim 15 holds for t, we prove it for t + 1. Conditioning on ξj = Z[vj1 vj2 vjk 1] for index j = (j1, j2, , jk 1) [[1, r]]k 1 and Z[τs] = ξs+1 for 0 s t 2, Lemma 12 implies that ξt = Z[τt 1] has the same law as Z[τt 1], where Z is an independent copy of Z. Since τt 1 is orthogonal to vj1 vj2 vjk 1 for index j = (j1, j2, , jk 1) [[1, r]]k 1 and Z[τs] = ξs+1 for 0 s t 2, ξt is an independent Gaussian random vector with each entry N(0, 1/n). Let {vj1 vj2 vjk 1}j [[1,r]]k 1, τ0, τ1, , τt be an orthogonal base for vj1

vj2 vjk 1 for j [[1, r]]k 1 and y (k 1) 0 , y (k 1) 1 , , y (k 1) t , obtained by the Gram Schmidt orthonormalization procedure. More precisely, given those tensors {vj1 vj2 vjk 1}j [[1,r]]k 1, τ0, τ1, , τt 1, we denote

b(t+1)j = y (k 1) t , vj1 vj2 vjk 1 , j = (j1, j2, , jk 1) [[1, r]]k 1,

b(t+1)s = y (k 1) s , τs 1 , 1 s t, ct+1 = y (k 1) t , τt (81)

Huang, Huang, Yang, Cheng

and P j b(t+1)jvj1 vj2 vjk 1 + b(t+1)1τ0 + b(t+1)2τ1 + b(t+1)tτt 1 is the projection

of y (k 1) t on the span of {vj1 vj2 vjk 1}j [[1,r]]k 1, y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 .

Then we can write y (k 1) t in terms of the base (78)

y (k 1) t = X

j b(t+1)jvj1 vj2 vjk 1 + b(t+1)1τ0 + b(t+1)2τ1 + b(t+1)tτt 1 + ct+1τt.

The recursion (75) implies that

j=1 βj(atj + bt wt, vj + ct ξt, vj )k 1vj + bt+1wt+1 + ct+1Z[τt] (82)

bt+1wt+1 = Z[ X

j b(t+1)jvj1 vj2 vjk 1 + b(t+1)1τ0 + b(t+1)2τ1 + b(t+1)tτt 1]

j b(t+1)jξj + b(t+1)1ξ1 + b(t+1)2ξ2 + b(t+1)tξt.

Since τt is orthogonal to {vj1 vj2 vjk 1}j [[1,r]]k 1, τ0, τ1, , τt 1, Lemma 12 implies that conditioning on ξj = Z[vj1 vj2 vjk 1] for j = (j1, j2, , jk 1) [[1, r]]k 1 and ξs+1 = Z[τs] for 0 s t 1, ξt+1 = Z[τt] is an independent Gaussian vector, with each entry N(0, 1/n). The above discussion gives us that

yt+1 = at+1vt+1 + bt+1wt+1 + ct+1ξt+1, at+1 = q

a2 (t+1)1 + a2 (t+1)2 + + a2 (t+1)r.

a(t+1)j = βj(atj + bt wt, vj + ct ξt, vj )k 1, 1 j r. (84)

We recall that by our Assumption 5, that

1/κ u, vi u, vj

for all 1 i, j r. If j = argmaxj βj u, vj k 2, it is necessary that βj β1, where the implicit constant depends on κ. In the following, we study the case that u, vj > 0. The case u, vj < 0 can be proven in exactly the same way, by simply changing (β, vj ) with (( 1)kβ, vj ). We prove by induction

Power Iteration for Tensor PCA

Claim 16 For any ﬁxed time t, with probability at least 1 O(e c(log n)2) the following holds: for any s t,

|asj | |asj|, |as| |β1|( X

j |bsj| + |bs1| + + |bs(s 1)|),

|as| nε max{1(k 3)|cs/β1/(k 2) 1 |, |cs/ n|},

and for j = (j1, j2, , jk 1) [[1, r]]k 1

ξj , ξs 2 = 1 + O(log n/ n), | vj, ξj |, | a, ξj |, | a, ξs | log n/ n.

Proj Span{v1,v2, ,vr,{ξj}j [[1,r]]k 1,ξ1, , ,ξs 1}(ξs) 2 log n/ n. (86)

Proof [Proof of Claim 16] From (77), we have

j=1 βj y0, vj k 1vj + X

s=1 y0, vjs ξj +

j=1 y0, vj 2

j=1 a1jvj + X

j b1jξj + c1ξ1,

where a1j = βj u, vj k 1 for 1 j r, b1j = Qk 1 s=1 u, vjs for any index j = (j1, j2, , jk 1)

1 Pr j=1 u, vj 2 (k 1) . Since ξj are independent Gaussian vectors with each

entry mean zero and variance 1/n, the concentration for chi-square distribution implies that ξj 2 = 1 + O(log n/ n) with probability 1 ec(log n)2. Since j = argmaxj |βj u, vj k 2|, combining with our Assumption 5, it gives that |a1j | |a1j|/κ. As a consequence, we also have that |a1| = p

a2 11 + a2 12 + + a2 1r |a1j |. Again using our Assumption 5

j=1 | u, vj |

j=1 | u, vj |k 1 |a1j |/βj |a1|/β1.

We can check that |β1/(k 2) 1 a1| |β1/(k 2) j a1j | = |βj u, vj k 2|(k 1)/(k 2) nε nε|c1|, and | na1| | na1j | = |βj u, vj k 2|| n u, vj | nε nε|c1|. Moreover, conditioning on ξj = Z[vj1 vj2 vjk 1] for j = (j1, j2, , jk 1) [[1, r]]k 1, Lemma 12 implies that ξ1 = Z[τ0] is an independent Gaussian random vector with each entry N(0, 1/n). By the standard concentration inequality, it holds that with probability 1 ec(log n)2, ξ1 2 = 1 + O(log n/ n), | a, ξ1 | and the projection of ξ1 on the span of {v1, v2, , vr, {ξj}j [[1,r]]k 1} is bounded by log n/ n. So far we have proved that (36) and (37) hold for t = 1. In the following, we assume that (85) and (86) hold for t, and prove it for t + 1. We recall from (80) and (84) that

a(t+1)j = βj(atj + bt wt, vj + ct ξt, vj )k 1, btwt = X

j btjξj + bt1ξ1 + + btt 1ξt 1.

Huang, Huang, Yang, Cheng

By our induction hypothesis, we have

|bt wt, vj | X

j |btj ξj, vj | + |bt1 ξ1, vj | + + |bt(t 1) ξt 1, vj | (log n/ n)|at|/|β1|,

|ct ξt, vj | (log n/ n)|ct| (log n)|at|/nε. (89)

It follows from plugging (88) and (89) into (87), we get

a(t+1)j = βj(atj + bt wt, vj + ct ξt, vj )k 1 = βj(atj + O(log n|at|/nε))k βj|at|k 1,

and especially

a(t+1)j = βj (atj + O(log n|atj |/nε))k = (1 + O(log n/nε))βj ak 1 tj .

Therefore, we conclude that

|at+1j | |βj ak 1 tj | |βak 1 t |, (90)

|a(t+1)j| βj|at|k 1 βj |atj |k 1 |at+1j |.

We recall from (81), P j b(t+1)jvj1 vj2 vjk 1+b(t+1)1τ0+b(t+1)2τ1+ b(t+1)tτt 1 is

the projection of y (k 1) t on the span of {vj1 vj2 vjk 1}j [[1,r]]k 1, y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 . We also recall that {vj1 vj2 vjk 1}j [[1,r]]k 1, τ0, τ1, , τt 1 are obtained from

{vj1 vj2 vjk 1}j [[1,r]]k 1, , y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 by the Gram-Schmidt orthonormalization procedure. So we have that the span of vectors {vj1 vj2 vjk 1}j [[1,r]]k 1, τ0, τ1, , τt 1 is the same as the span of the tensors {vj1 vj2

vjk 1}j [[1,r]]k 1, y (k 1) 0 , y (k 1) 1 , , y (k 1) t 1 , which is contained in the span of the tensors {v1, v2, , vr, wt, y0, , yt 1} (k 1). Moreover from the relation (79) and (80), one can see that the span of {v1, v2, , vr, wt, y0, , yt 1} is the same as the span of {v1, v2, , vr, {ξj}j [[1,r]]k 1, ξ1, , ξt 1}. It follows that

j b2 (t+1)j + b2 (t+1)1 + b2 (t+1)2 + + b2 (t+1)t

= Proj Span{{vj1 vj2 vjk 1}j [[1,r]]k 1,τ0,τ1, ,τt 1}(atvt + btwt + ctξt) (k 1) 2

Proj Span{v1,v2, ,vr,wt,y0, ,yt 1} (k 1)(atvt + btwt + ctξt) (k 1) 2

Proj Span{v,wt,y0, ,yt 1}(atvt + btwt + ctξt) k 1 2 = atvt + btwt + ct Proj Span{v1,v2, ,vr,{ξj}j [[1,r]]k 1,ξ1, ,ξt 1}(ξt) k 1 2

|at| + |bt| + log n|ct| n

k 1 |at|k 1 |at+1|/β1,

Power Iteration for Tensor PCA

where in the ﬁrst line we used (83), and in the last line of (91) we used our induction hypothesis that Proj Span{v1,v2, ,vr,{ξj}j [[1,r]]k 1,ξ1, ,ξt 1}(ξt) 2 log n/ n.

Finally we estimate ct+1. We recall from (81), the coeﬃcient ct+1 is the remainder of y (k 1) t after projecting on {vj1 vj2 vjk 1}j [[1,r]]k 1, τ0, τ1, , τt 1. It is bounded

by the remainder of y (k 1) t after projecting on {vj1 vj2 vjk 1}j [[1,r]]k 1,

|ct+1| y (k 1) t ak 1 t v (k 1) t 2 = (atvt + btwt + ctξt) (k 1) ak 1 t v (k 1) t 2.

The diﬀerence (atvt + btwt + ctξt) (k 1) ak 1 t v (k 1) t is a sum of terms in the following form,

η1 η2 ηk 1, (92)

where η1, η2, , ηk 1 {atvt, btwt + ctξt}, and at least one of them is btwt + ctξt. We notice that by our induction hypothesis, btwt + ctξt 2 |bt| wt 2 + |ct| ξt 2 |bt| + |ct|. For the L2 norm of (92), each copy of atvt contributes at and each copy of btwt + ctξt contributes a factor |bt| + |ct|. We conclude that

|ct+1| (atvt + btwt + ctξt) (k 1) ak 1 t v (k 1) t 2

r=1 |at|k 1 r(|bt| + |ct|)r. (93)

Combining with (90) that |at+1| |β1||at|k 1, we divide both sides of (93) by |β1||at|k 1,

|ct+1| |at+1| 1 |β1|

|at| + |ct|

|β1| + |ct|

There are three cases:

1. If |ct|/|at| 1, then

|ct+1| |at+1| 1 |β1|

|β1| + |ct|

If k = 2, then |ct+1|/|at+1| (|ct|/|at|)/nε. If k 2, by our induction hypothesis |ct|/|at| β1/(k 2) 1 /nε. Especially, (|ct|/|at|)k 2/|β1| 1/nε. We still get that |ct+1|/|at+1| (|ct|/|at|)/nε.

2. If 1/|β1| |ct|/|at| 1, then

|ct+1| |at+1| 1 |β1|

|β1| + |ct|

3. Finally for |ct|/|at| 1/|β1|, we will have

|ct+1| |at+1| 1 |β1|

|β1| + |ct|

Huang, Huang, Yang, Cheng

In all these cases we have |ct+1|/|at+1| min{ n, 1(k 3)|β1|1/(k 2)}/nε. This ﬁnishes the proof of the induction (85). For (86), since τt is orthogonal to {vj1 vj2 vjk 1}j [[1,r]]k 1, τ0, τ1, , τt 1, Lemma 12 implies that conditioning on ξj = Z[vj1 vj2 vjk 1] for index j = (j1, j2, , jk 1) [[1, r]]k 1 and ξs+1 = Z[τs] for 0 s t 1, ξt+1 = Z[τt] is an independent Gaussian vector, with each entry N(0, 1/n). By the standard concentration inequality, it holds that with probability 1 ec(log n)2, ξt+1 2 = 1 + O(log n/ n), | a, ξt+1 | and the projection of ξt+1 on the span of {v1, v2, , vr, {ξj}j [[1,r]]k 1, ξ1, , ξt 1} is bounded by log n/ n. This ﬁnishes the proof of the induction (86).

Next, using (85) and (86) as input, we prove that for

2 + 2 log |β1|

+ log log( n|β1|)

log(k 1) (95)

j=1 atjvj + X

j btjξj + bt1ξ1 + + btt 1ξt 1 + ctξt, (96)

|atj| log n n 1 |β1|

k 1 |atj |, j = j ,

bt(j ,j , ,j ) = atj

βj + O log n|at|

, |b(t+1)j | log n n|β1|2 |atj |, j = (j , j , , j ),

|bt1|, |bt2|, , |bt(t 1)| (log n)1/2|at|

|β1|3/2n1/4 , |ct| |at|/β2 1

Let xt = |ct/at| n ε|β|1/(k 2), and rt = maxj =j (β1/(k 2) j atj)/(β1/(k 2) j atj ). For t = 1, our Assumption 6 implies that

β1/(k 2) j a1j (βj u, vj k 2)(k 1)/(k 2)

((1 1/κ)βj u, vj k 2)(k 1)/(k 2) (1 1/κ)β1/(k 2) j a1j .

Thus we have that r1 (1 1/κ). We recall from (87)

β1/(k 2) j a(t+1)j = β1/k 2 j (atj + bt wt, vj + ct ξt, vj ) k 1

= β1/k 2 j (atj + O |at|log n(1/|β1| + xt) n

Power Iteration for Tensor PCA

where we used (85) and (86). Thus it follows that

rt+1 = max j =j

β1/k 2 j (atj + O (|at| log n(1/|β1| + xt)/ n)

β1/k 2 j (atj + O (|at| log n(1/|β1| + xt)/ n)

rt + O (log n(1/|β1| + xt)/ n)

1 + O (log n(1/|β1| + xt)/ n)

For xt, (94) implies

xt+1 1 |β1|

from the discussion after (94), we have that either xt+1 1/|β1|2, or xt+1 xt/nε. Since x1 = |c1/a1| n1/2 ε, and r1 (1 1/κ) we conclude from (98) and (99) that

xt = |ct/at| 1/β2 1, rt (log n/(|β1| n))k 1, (100)

2 + 2 log |β1|

+ log log( n|β1|)

To derive the upper bound of bt1, bt2, , bt(t 1), we use (91).

j b2 (t+1)j + b2 (t+1)1 + b2 (t+1)2 + + b2 (t+1)t

atvt + btwt + ct Proj Span{v1,v2, ,vr,{ξj}j [[1,r]]k 1,ξ1, ,ξt 1}(ξt) 2(k 1) 2

|at| (|bt| + |ct|) log n n + |bt| + |ct|log n n

where we used (86). The ﬁrst term b(t+1)j is the projection of y (k 1) t on vj1 vj2 vjk 1,

s=1 atvt + btwt + ctξt, vjs =

atjs + O log n(|bt| + |ct|) n

j b2 (t+1)j =

s=1 | atvt + btwt + ctξt, vjs |2 !k

|at| (|bt| + |ct|) log n n + |bt| + |ct|log n n

Huang, Huang, Yang, Cheng

where we used (37) that | ξj, vj |, | ξ1, vj |, , | ξt, vj | log n/ n. Now we can take diﬀerence of (101) and (103), and use that |bt| |at|/|β1| from (85) and |ct| |at|/|β1|2

from (100),

b2 (t+1)1 + b2 (t+1)2 + + b2 (t+1)t a2(k 1) t log n |β| n. (104)

Using (102) and (100), we get that

b(t+1)j = ak 1 tj

1 + O log n n|β1|

, j = (j , j , , j )

|b(t+1)j| log n n|β1||atj |k 1, j = j . (105)

From (87), (90) and (100), we have that

a(t+1)j = βj b(t+1)j = βj ak 1 tj

1 + O log n n|β1|

|a(t+1)j| log n n|β1|

k 1 |a(t+1)j |, j = j . (106)

Using the above relation, we can simplify (104) and (105) as

|b(t+1)1|, |b(t+1)2|, |b(t+1)t| (log n)1/2|at+1|

|β1|3/2n1/4 .

b(t+1)j = a(t+1)j

1 + O log n n|β1|

|b(t+1)j| log n n|β1|2 |a(t+1)j |, j = j .

This ﬁnishes the proof of (97). With the expression (97), we can process to prove our main results (11) and (12). Thanks to (86) and (96), for t satisﬁes (95), we have that with probability at least 1 O(e c(log n)2)

yt 2 2 = a2 tj

β2 j + 2 vj , ξj

log n n|β1|

k 1 + log n

β2 1 n + (log n)3/2

|β1|3/2n3/4 + 1

where j = (j , j , , j ). By rearranging it we get

1/ yt 2 = a2 tj

1 1 2β2 j 2 vj , ξj

log n n|β1|

k 1 + log n

β2 1 n + (log n)3/2

|β1|3/2n3/4 + 1

Power Iteration for Tensor PCA

We can take the inner product a, yt , and multiply (108)

a, ut = a, yt

yt 2 = sgn(atj )

a, vj + a, ξj a, vj vj , ξj

log n n|β1|

k 1 + log n

β2 1 n + (log n)3/2

|β1|3/2n3/4 + 1

where we used (37) that with high probability | a, ξj |, | a, ξs | for 1 s t are bounded by log n/ n. This ﬁnishes the proof of (11). For bβ in (12), we have that

X[u k t ] = X[y k t ] yt k 2 = yt, X[y (k 1) t ] yt k 2 = yt, yt+1

yt k 2 . (109)

Thanks to (100), (106) and (86), for t satisﬁes (95), with probability at least 1 O(e c(log n)2), we can write the ﬁrst term on the righthand side of (57), we have

j a(t+1)jvj + X

j b(t+1)jξj + b(t+1)1ξ1 + + b(t+1)tξt + ct+1ξt+1,

where |ct+1| |at|k 1/β2,

a(t+1)j = βj b(t+1)j = βj ak 1 tj

1 + O log n n|β1|

|a(t+1)j| log n n|β1|

k 1 |a(t+1)j |, j = j

b(t+1)j = a(t+1)j

1 + O log n n|β1|

|b(t+1)j| log n n|β1|2 |a(t+1)j |, j = j ,

|b(t+1)1|, |b(t+1)2| + + |b(t+1)t| ak 1 t (log n)1/2

|β|1/2n1/4 .

From the discussion above, combining with (96) and (97) with straightforward computation, we have

yt, yt+1 = βj ak tj

β2 j + (k + 1) ξj , vj

log n n|β1|

k 1 + log n

β2 1 n + (log n)3/2

|β1|3/2n3/4

By plugging (108) and (110) into (109), we get

X[u k t ] = sgn(ak tj ) βj + ξj , vj k/2 1

log n n|β1|

k 1 + log n

|β1| n + (log n)3/2

|β1|1/2n3/4 + 1 |β1|3

Huang, Huang, Yang, Cheng

Since by our assumption, in Case 1 we have that βj > 0. Thanks to (106) at+1j = βak 1 tj (1 + o(1)), especially at+1j and atj are of the same sign. In the case u, vj > 0, we have a1j = β u, vj k 1 > 0. We conclude that atj > 0. Therefore sgn(X[u k t ]) = sgn(atj )k = +, and it follows that

X[u k t ] = βj + ξj , vj k/2 1

log n n|β1|

k 1 + log n

|β1| n + (log n)3/2

|β1|1/2n3/4 + 1 |β1|3

This ﬁnishes the proof of (12). The Cases 2, 3, 4, by simply changing (βj , vj ) in the righthand side of (11) and (12) to the corresponding limit.

A.4 Proof of Theorem 9

Proof [Proof of Theorem 9] We ﬁrst prove (15). If u is uniformly distributed over the unit sphere, then it has the same law as η/ η 2, where η is an n-dim standard Gaussian vector, with each entry N(0, 1). With this notation

|βj u, vj k 2| = |βj η, vj k 2|/ η k 2 2 , (111)

and we can rewrite P(i = argmaxj |βj u, vj k 2|) as

P(i = argmaxj |βj u, vj k 2|) = P(i = argmaxj |βj η, vj k 2|).

Since v1, v2, , vr are orthonormal vectors, v1, η , v2, η , , vr, η are independent standard Gaussian random variables. Then we have

pi = P(i = argmaxj |βj η, vj k 2|) = P(|βi/βℓ|1/k 2 η, vi | | η, vℓ |, for all i = ℓ)

|βℓ| 1 k 2 x

2 πe y2/2dy

This gives (15). Using the fact we can rewrite u as η/ η 2, we have that with probability 1 O(1/ κ),

1/ κn | u, vi | p

for all 1 i r. Thus Assumption (5) holds, and especially,

max j |βj u, vj k 2| |β1 u, v1 k 2| |β1(1/ κn)k 2| nε.

Theorem 9 then follows directly from Theorem 7.

Power Iteration for Tensor PCA

A.5 Proof of Corollaries 8, 10 and 11

Proof [Proof of Corollary 8] According to the deﬁnition of ξ in (11) of Theorem 7, i.e. ξ = Z[v (k 1) j ], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variable. We see that ξ, v d= N (0, 1/n) .

Especially with high probability we will have that | ξ, v | log n/ n. Then we conclude from (12), with high probability it holds

bβ = βj + O 1

βj + log n n

With the bound (112), we can replace a, v /(2β2) on the righthand side of (11) by a, v /(2bβ2), which gives an error a, v

= O | a, v | 1 |βj |4 + log n |βj |3 n

Combining the above discussion together, we can rewrite (11) as

a, v = a, ξ a, vj vj , ξ

log n n|β1|

k 1 + log n |β1|2 n + (log n)3/2

|β1|3/2n3/4 + 1 |β1|4

with high probability, where we used that |βj | |β1|. Again thanks to the deﬁnition of ξ in (11) of Theorem 1, i.e. ξ = Z[v (k 1)], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variable, we see that

a, ξ a, vj vj , ξ = a a, vj vj , ξ ,

is a Gaussian random variable, with mean zero and variance

E[ a a, vj vj , ξ 2] = 1

n a a, vj vj 2 2 = 1

n a, (In vj v j )a

n a, (In bvj bv j )a .

This together with (112), (113) as well as our assumption (13)

a, (In bvbv )a

2bβ2 1 a, bv a, vj d N(0, 1). (114)

Under the same assumption, we have similar results for Cases 2, 3, 4, by simply changing (βj , vj ) in the righthand side of (4) and (5) to the corresponding expression.

Huang, Huang, Yang, Cheng

Proof [Proof of Corollary 10] For k 3 and |β1| n(k 2)/2+ε, the assumption 13 holds trivially. The claim (18) follows from (14). For (19), we recall that in (17), ξ = Z[v (k 1) i ], is an n-dim vector, with each entry i.i.d. N(0, 1/n) Gaussian random variable. We see that

ξ, vi d= N (0, 1/n) .

Especially with high probability we will have that | ξ, v | log n/ n. Then we conclude from (17), with high probability it holds

bβ = βi + O 1

βi + log n n

With the bound (115), we can replace (k/2 1)/βi on the righthand side of (17) by (k/2 1)/bβ, which gives an error k/2 1

= O 1 |β1|2 + log n

where we used that |βi| |β1|. Combining the above discussion together, we can rewrite (17) as

βi = bβ + k/2 1

bβ ξ, vi + OP

log n n|β1|

k 1 + log n

|β1| n + (log n)3/2

|β1|1/2n3/4 + 1 |β1|2

Since ξ, vi d= N (0, 1/n), and the error term in (116) is much smaller than 1/ n. We conclude from (116)

n βi bβ + k/2 1

This ﬁnishes the proof of (19).

Proof [Proof of Corollary 11] Given the signiﬁcance level α, the asymptotic conﬁdence intervals in Corollary 11 can be calculated from Corollary 10 by bounding the absolute values of the left hand sides of (18) and (19) at zα.

Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, Yiqiao Zhong, et al. Entrywise eigenvector analysis of random matrices with low expected rank. Annals of Statistics, 48(3):1452 1474, 2020.

Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor approach to learning mixed membership community models. The Journal of Machine Learning Research, 15(1):2239 2312, 2014a.

Power Iteration for Tensor PCA

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773 2832, 2014b.

Gerard Ben Arous, Reza Gheissari, Aukosh Jagannath, et al. Algorithmic thresholds for tensor pca. Annals of Probability, 48(4):2052 2087, 2020.

Arnab Auddy and Ming Yuan. Perturbation bounds for orthogonally decomposable tensors and their applications in high dimensional data analysis. ar Xiv preprint ar Xiv:2007.09024, 2020.

Zhidong Bai and Jianfeng Yao. On sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis, 106:167 177, 2012.

Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97(6):1382 1408, 2006.

Jinho Baik, G erard Ben Arous, Sandrine P ech e, et al. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability, 33(5):1643 1697, 2005.

Mikhail Belkin, Luis Rademacher, and James Voss. Eigenvectors of orthogonally decomposable functions. SIAM Journal on Computing, 47(2):547 615, 2018.

Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vectors of low rank perturbations of large rectangular random matrices. Journal of Multivariate Analysis, 111:120 135, 2012.

Aharon Birnbaum, Iain M Johnstone, Boaz Nadler, and Debashis Paul. Minimax bounds for sparse pca with noisy high-dimensional data. Annals of statistics, 41(3):1055, 2013.

T Tony Cai, Zongming Ma, Yihong Wu, et al. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074 3110, 2013.

Tony Cai, Zongming Ma, and Yihong Wu. Optimal estimation and rank detection for sparse spiked covariance matrices. Probability theory and related ﬁelds, 161(3-4):781 815, 2015.

Jie Chen and Yousef Saad. On the tensor svd and the optimal low rank orthogonal approximation of tensors. SIAM journal on Matrix Analysis and Applications, 30(4):1709 1734, 2009.

Wei-Kuo Chen, Madeline Handschy, and Gilad Lerman. Phase transition in random tensors with multiple spikes. ar Xiv preprint ar Xiv:1809.06790, 2018a.

Wei-Kuo Chen et al. Phase transition in the spiked random tensor with rademacher prior. The Annals of Statistics, 47(5):2734 2756, 2019.

Yuxin Chen, Chen Cheng, and Jianqing Fan. Asymmetry helps: Eigenvalue and eigenvector analyses of asymmetrically perturbed low-rank matrices. ar Xiv preprint ar Xiv:1811.12804, 2018b.

Huang, Huang, Yang, Cheng

Chen Cheng, Yuting Wei, and Yuxin Chen. Inference for linear forms of eigenvectors under minimal eigenvalue separation: Asymmetry and heteroscedasticity. ar Xiv preprint ar Xiv:2001.04620, 2020.

Andrzej Cichocki, Danilo Mandic, Lieven De Lathauwer, Guoxu Zhou, Qibin Zhao, Cesar Caiafa, and Huy Anh Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE signal processing magazine, 32(2): 145 163, 2015.

Pierre Comon. Tensors: a brief introduction. IEEE Signal Processing Magazine, 31(3): 44 53, 2014.

David L Donoho, Matan Gavish, and Iain M Johnstone. Optimal shrinkage of eigenvalues in the spiked covariance model. Annals of statistics, 46(4):1742, 2018.

Olivier Duchenne, Francis Bach, In-So Kweon, and Jean Ponce. A tensor-based algorithm for high-order graph matching. IEEE transactions on pattern analysis and machine intelligence, 33(12):2383 2395, 2011.

Noureddine El Karoui et al. Spectrum estimation for large dimensional covariance matrices using random matrix theory. The Annals of Statistics, 36(6):2757 2790, 2008.

Evgeny Frolov and Ivan Oseledets. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017.

Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus, volume 42. Springer, 2012.

Rungang Han, Rebecca Willett, and Anru Zhang. An optimal statistical and computational framework for generalized tensor estimation. ar Xiv preprint ar Xiv:2002.11255, 2020.

Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):1 39, 2013.

Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956 1006, 2015.

Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 178 191, 2016.

Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11 20, 2013.

Aukosh Jagannath, Patrick Lopatto, and Leo Miolane. Statistical thresholds for tensor pca. ar Xiv preprint ar Xiv:1812.03403, 2018.

Power Iteration for Tensor PCA

Bing-Yi Jing, Ting Li, Zhongyuan Lyu, and Dong Xia. Community detection on mixture multi-layer networks via regularized tensor decomposition. ar Xiv preprint ar Xiv:2002.04457, 2020.

Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295 327, 2001.

Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486): 682 693, 2009.

Iain M Johnstone and Debashis Paul. PCA in high dimensions: An orientation. Proceedings of the IEEE, 106(8):1277 1292, 2018.

Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative ﬁltering. In Proceedings of the fourth ACM conference on Recommender systems, pages 79 86, 2010.

Chiheon Kim, Afonso S Bandeira, and Michel X Goemans. Community detection in hypergraphs, spiked tensor models, and sum-of-squares. In 2017 International Conference on Sampling Theory and Applications (Samp TA), pages 124 128. IEEE, 2017.

Tamara G Kolda. Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, 23(1):243 255, 2001.

Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009.

Olivier Ledoit, Michael Wolf, et al. Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics, 40(2):1024 1060, 2012.

Thibault Lesieur, L eo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborov a. Statistical and computational phase transitions in spiked tensor estimation. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 511 515. IEEE, 2017.

Yuetian Luo and Anru R Zhang. Open problem: Average-case hardness of hypergraphic planted clique detection. In Conference on Learning Theory, pages 3852 3856. PMLR, 2020a.

Yuetian Luo and Anru R Zhang. Tensor clustering with planted structures: Statistical optimality and computational limits. ar Xiv preprint ar Xiv:2005.10743, 2020b.

Yuetian Luo, Garvesh Raskutti, Ming Yuan, and Anru R Zhang. A sharp blockwise tensor perturbation bound for orthogonal iteration. ar Xiv preprint ar Xiv:2008.02437, 2020.

Zongming Ma et al. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772 801, 2013.

Huang, Huang, Yang, Cheng

Sean O Rourke, Van Vu, and Ke Wang. Random perturbation of low rank matrices: Improving classical bounds. Linear Algebra and its Applications, 540:26 59, 2018.

Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617 1642, 2007.

Sandrine P ech e. The largest eigenvalue of small rank perturbations of hermitian random matrices. Probability Theory and Related Fields, 134(1):127 173, 2006.

Amelia Perry, Alexander S Wein, Afonso S Bandeira, et al. Statistical limits of spiked tensor models. In Annales de l Institut Henri Poincar e, Probabilit es et Statistiques, volume 56, pages 230 264. Institut Henri Poincar e, 2020.

Steﬀen Rendle and Lars Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM international conference on Web search and data mining, pages 81 90, 2010.

Emile Richard and Andrea Montanari. A statistical model for tensor pca. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2897 2905. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5616-a-statistical-model-for-tensor-pca.pdf.

Elina Robeva. Orthogonal decomposition of symmetric tensors. SIAM Journal on Matrix Analysis and Applications, 37(1):86 102, 2016.

Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, and Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551 3582, 2017.

Erez Simony, Christopher J Honey, Janice Chen, Olga Lositsky, Yaara Yeshurun, Ami Wiesel, and Uri Hasson. Dynamic reconﬁguration of the default mode network during narrative comprehension. Nature communications, 7:12141, 2016.

Van Vu. Singular vectors under random perturbation. Random Structures & Algorithms, 39(4):526 538, 2011.

Vincent Q Vu, Jing Lei, et al. Minimax sparse principal subspace estimation in high dimensions. The Annals of Statistics, 41(6):2905 2947, 2013.

Dong Xia, Anru R Zhang, and Yuchen Zhou. Inference for low-rank tensors no need to debias. ar Xiv preprint ar Xiv:2012.14844, 2020.

Anru Zhang and Dong Xia. Tensor svd: Statistical and computational limits. IEEE Transactions on Information Theory, 64(11):7311 7338, 2018.

Anru Zhang, T Tony Cai, and Yihong Wu. Heteroskedastic pca: Algorithm, optimality, and applications. ar Xiv preprint ar Xiv:1810.08316, 2018.

Yiqiao Zhong. Eigenvector under random perturbation: A nonasymptotic rayleighschr\ {o} dinger theory. ar Xiv preprint ar Xiv:1702.00139, 2017.

Power Iteration for Tensor PCA

Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013.