# the_complexity_of_sparse_tensor_pca__ff62ac0c.pdf

The Complexity of Sparse Tensor PCA

Davin Choo Department of Computer Science National University of Singapore davin@u.nus.edu

Tommaso d Orsi Department of Computer Science ETH Zürich tommaso.dorsi@inf.ethz.ch

We study the problem of sparse tensor principal component analysis: given a tensor Y = W + λx p with W p Rn having i.i.d. Gaussian entries, the goal is to recover the k-sparse unit vector x Rn. The model captures both sparse PCA (in its Wigner form) and tensor PCA. For the highly sparse regime of k n, we present a family of algorithms that smoothly interpolates between a simple polynomial-time algorithm and the exponential-time exhaustive search algorithm. For any 1 t k, our algorithms recovers the sparse vector for signal-tonoise ratio λ e O(

t (k/t)p/2) in time e O(np+t), capturing the state-of-the-art guarantees for the matrix settings (in both the polynomial-time and sub-exponential time regimes). Our results naturally extend to the case of r distinct k-sparse signals with disjoint supports, with guarantees that are independent of the number of spikes. Even in the restricted case of sparse PCA, known algorithms only recover the sparse vectors for λ e O(k r) while our algorithms require λ e O(k). Finally, by analyzing the low-degree likelihood ratio, we complement these algorithmic results with rigorous evidence illustrating the trade-offs between signal-to-noise ratio and running time. This lower bound captures the known lower bounds for both sparse PCA and tensor PCA. In this general model, we observe a more intricate three-way trade-off between the number of samples n, the sparsity k, and the tensor power p.

1 Introduction

Sparse tensor principal component analysis is a statistical primitive generalizing both sparse PCA2 and tensor PCA3. We are given multi-linear measurements in the form of a tensor

Y = W + λx p p Rn (SSTM)

for a Gaussian noise tensor W p Rn containing i.i.d. N(0, 1) entries4 and signal-to-noise ratio λ > 0. Our goal is to estimate the structured unit vector x Rn. The structure we enforce on x is sparsity: |supp(x)| k. The model can be extended to include multiple spikes in a natural way: Y = W + Pr q=1 λqx p (q), and even general order-p tensors: Y = W + Pr q=1 λq X(q) for X(q) = x(q,1) x(q,p) p Rn. In this introduction, we focus on the simplest single spike setting of SSTM.

It is easy to see that sparse PCA corresponds to the setting with tensor order p = 2. On the other hand, tensor PCA is captured by effectively removing the sparsity constraint: |supp(x)| n. In

Part of the work was done while the author was in ETH Zürich. 2Often in the literature, the terms sparse PCA and spiked covariance model refer to the sparse spiked Wishart model. However, here we consider the sparse spiked Wigner matrix model. 3Tensor PCA is also known as the spiked Wigner tensor model, or simply the spiked tensor model. 4Throughout the paper, we will write random variables in boldface.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

recent years, two parallel lines of work focused respectively on sparse PCA [JL09, AW08, BR13a, DM16, HKP+17, DKWB19, HSV20, d KNS20] and tensor PCA [MR14, HSS15, MSS16, HKP+17, KWB19, AMMN19], however no result captures both settings. The appeal of the sparse spiked tensor model (henceforth SSTM) is that it allows one to study the computational and statistical aspects of these other fundamental statistical primitives in a uniﬁed framework, understanding the computational phenomena at play from a more general perspective.

In this work, we investigate SSTM from both algorithmic and computational hardness perspectives. Our algorithm improves over known tensor algorithms whenever the signal vector is highly sparse. We also present a lower bound against low-degree polynomials which extends the known lower bounds for both sparse PCA and tensor PCA, leading to a more intricate understanding of how all three parameters (n, k and p) interact.

1.1 Related work

Disregarding computational efﬁciency, it is easy to see that optimal statistical guarantees can be achieved with a simple exhaustive search (corresponding to the maximum likelihood estimator): ﬁnd a k-sparse unit vector maximizing Y , x p . This algorithm returns a k-sparse unit vector bx achieving constant squared correlation5 with the signal x as soon as λ p

k log(np/k). That is, whenever λ max x =1, x 0=k W , x p . Unfortunately, this approach runs in time exponential in k and takes super-polynomial time when p k.6 As such, we assume p k from now on.

Taking into account computational aspects, the picture changes. A good starting point to draw intuition for SSTM is the literature on sparse PCA and tensor PCA. We brieﬂy outline some known results here. To simplify the discussion, we hide absolute constant multiplicative factors using O( ), Ω( ), , and , and hide multiplicative factors logarithmic in n using e O( ).

1.1.1 Sparse PCA (Wigner noise)

Sparse PCA with Wigner noise exhibits a sharp phase transition in the top eigenvalue of Y for λ n [FP07]. In this strong signal regime, the top eigenvector7 v of Y correlates8 with x with high probability, thus the following spectral method achieves the same guarantees as the exhaustive search suggested above: compute a leading eigenvector of Y and restrict it to the top k largest entries in absolute value. Conversely, when λ < n, the top eigenvector of Y does not correlate with the signal x. In this weak signal regime, [JL09] proposed a simple algorithm known as diagonal thresholding: compute the top eigenvector of the principal submatrix deﬁned by the k largest diagonal entries of Y . This algorithm recovers the sparse direction when λ e O(k), thus requiring almost an additional

k factor when compared to inefﬁcient algorithms. More reﬁned polynomial-time algorithms (low-degree polynomials [d KNS20], covariance thresholding [DM16] and the basic SDP relaxation [d GJL07, d KNS20]) only improve over diagonal thresholding by a logarithmic factor in the regime n1 o(1) k2 n. Interestingly, multiple results suggest that this information-computation gap is inherent to the sparse PCA problem [BR13a, BR13b, DKWB19, d KNS20]. Subexponential time algorithms and lower bounds have also been shown. For instance, [DKWB19, HSV20] presented smooth trade-offs between signal strength and running time.9

5One could also aim to ﬁnd a unit vector with correlation approaching one or, in the restricted setting of x {0, 1/

k}, aim to recover the support of x. At the coarseness of our discussion here, these goals could be considered mostly equivalent. 6Note that the problem input is of size np. So when p k, exhaustive search takes n O(p) time which is polynomial in np. Thus, the interesting parameter regimes occur when p k. 7By top eigenvector or leading eigenvector , we mean the eigenvector corresponding to largest (in absolute value) eigenvalue . 8More precisely, the vector consisting of the k largest (in absolute value) entries of v. 9Both works studied the single spike matrix setting. [HSV20] only considers the Wishart noise model and thus its guarantees cannot be compared to ours. [DKWB19] studied both the Wishart and Wigner noise models. In the Wishart noise model setting, both [HSV20] and [DKWB19] observe the same tradeoff between running time and signal-to-noise ratio. In the Wigner noise model setting, our algorithm and the algorithm of [DKWB19] offer the same smooth-trade off between running time and signal strength, up to universal constants.

1.1.2 Tensor PCA

In tensor settings, computing max x =1 Y , x p is NP-hard already for p = 3 [HL13]. For even tensor powers p, one can unfold the tensor Y into a np/2-by-np/2 matrix and solve for the top eigenvector [MR14]. However, this approach is sub-optimal for odd tensor powers. For general tensor powers p, a successful strategy to tackle tensor PCA has been the use of semideﬁnite programming [HSS15, BGL16, HSS19]. Spectral algorithms inspired by the insight of these convex relaxations have also been successfully applied to the problem [SS17]. These methods succeed in recovering the single-spike x when λ e O np/4 , thus exhibiting a large gap when compared to exhaustive search algorithms. Matching lower bounds have been shown for constant degrees in the Sum-of-Squares hierarchy [BGL16, HKP+17] and through average case reductions [BB20].

1.1.3 Sparsity-exploiting algorithms and tensor algorithms

It is natural to ask how do the characteristics of sparse PCA and tensor PCA extend to the more general setting of SSTM. In particular, there are two main observations to be made.

The ﬁrst observation concerns the sharp computational transition that we see for k n in sparse PCA. In these highly sparse settings, the top eigenvector of Y does not correlate with the signal x and so algorithms primarily based on spectral methods fail to recover it. Indeed, the best known guarantees are achieved through algorithms that crucially exploit the sparsity of the hidden signal. These algorithms require the signal strength to satisfy λ e O(

k), with only logarithmic dependency on the ambient dimension. To exemplify this to an extreme, notice how the following algorithm can recover the support of xx T with the same guarantees as diagonal thresholding, essentially disregarding the matrix structure of the data: zero all but the k2 largest (in absolute value) entries of Y . A natural question to ask is whether a similar phenomenon may happen for higher order tensors. In the highly sparse settings where k n, can we obtain better algorithms exploiting the sparsity of the hidden vector? Recently, a partial answer appeared in [LZ20] with a polynomial time algorithm recovering the hidden signal for λ e O(p kp/2), albeit with suboptimal dependency on the tensor order p.

The second observation concerns the computational-statistical gap in the spiked tensor model. As p grows, the gap between efﬁcient algorithms and exhaustive search widens with the polynomial time algorithms requiring signal strength λ e O np/4 while exhaustive search succeeds when λ e O( n) [MR14]. The question here is: how strong is the dependency on p for efﬁcient algorithms in sparse signal settings?

In this work, we investigate these questions in the high order tensors regime p ω(1). We present a family of algorithms with a smooth trade-off between running time and signal-to-noise ratio. Even restricting to polynomial-time settings, our algorithms improve over previous results. Furthermore, through the lens of low-degree polynomials, we provide rigorous evidence of an exponential gap in the tensor order p between algorithms and lower bounds. Remark. The planted sparse densest sub-hypergraph model [CPMB19, BCPS20, CPSB20] is closely related to SSTM. We discuss this model in Appendix E.

1.2 Results

1.2.1 Single spike setting

Consider ﬁrst the restricted, but representative, case where the planted signal is a (k, A)-sparse unit

vector with k non-zero entries having magnitudes in the range h 1 A

i for some constant A 1. We say that the signal is ﬂat when A = 1 and approximately ﬂat when A 1.

Our ﬁrst result is a limited brute force algorithm informally, an algorithm that smoothly interpolates between some brute force approach and some simple polynomial time algorithm that exactly recovers the signal support of the planted signal10. Theorem 1 (Algorithm for single spike sparse tensor PCA, Informal). Let A 1 be a constant. Consider the observation tensor Y = W + λx p

10A similar algorithm was analyzed by [DKWB19] for the special case of p = 2 and r = 1.

where the additive noise tensor W p Rn contains i.i.d. N(0, 1) entries and the signal x Rn is a (k, A)-sparse unit vector with signal strength λ > 0. Let 1 t k be an integer. Suppose that

Then, there exists an algorithm that runs in O(pnp+t) time and, with probability 0.99, outputs the support of x.

Let s ﬁrst consider Theorem 1 in its simplest setting where A = 1 and t is a ﬁxed constant. For k n, the theorem succeed when λ e O(kp/2), thus improving over the guarantees of known tensor PCA methods which require λ e O(np/4). In addition, since support recovery is exact, one can obtain a good estimate11 of the planted signal by running any known tensor PCA algorithm on the subtensor corresponding to its support. Indeed, the resulting subtensor will be of signiﬁcantly smaller dimension and the requirement needed on the signal strength by single-spike tensor PCA algorithms are weaker than the requirement we impose on λ (see Remark 9 for details). As a result, our algorithm recovers the guarantees of diagonal thresholding in the matrix (p = 2) setting. Our polynomial-time algorithm also improves over the result of [LZ20], which required λ pkp log n, by removing the polynomial dependency of the tensor order p in the signal strength λ.12

Consider now the limited brute force parameter t. From the introductory exposition, we know that one can obtain a statistically optimal algorithm by performing a brute force search over the space of k-sparse ﬂat vectors in Rn. The limited brute force algorithm is a natural extension that takes into account computational constraints by searching over the smaller set of t-sparse ﬂat vectors, for 1 t k, to maximize Y , u p . The parametric nature of the algorithm captures both the brute force search algorithm (when t = k) and the idea of diagonal thresholding (when t = 1 and p = 2). As long as t k, using a larger t represents a direct trade-off between running time and the signal-to-noise ratio. Extending the result to approximately ﬂat vectors, the dependency on A in the term 2A2 p can be removed by increasing the computational budget to some value t 2A2t.

1.2.2 Multiple spikes

Theorem 2 (Algorithm for multi-spike sparse tensor PCA, Informal). Let A 1 be a constant. Consider the observation tensor

q=1 λqx p (q)

where the additive noise tensor W p Rn contains i.i.d. N(0, 1) entries and the signals x(1), . . . , x(r) Rn are (k, A)-sparse unit vectors with disjoint supports and corresponding signal strengths λ1 . . . λr > 0. Let 1 t k be an integer and 0 < ϵ 1/2. Suppose that

p ln n and λr A2p (2ϵ)p 1 λ1 .

Then, there exists an algorithm that runs in O(rpnp+t) time and, with probability 0.99, outputs the individual signal supports of x(π(1)), . . . , x(π(r)) for some unknown bijection π : [r] [r].

Theorem 2 requires two assumptions on the signals: (1) signals have disjoint support; (2) there is a bounded signal strength gap of λr A2p (2ϵ)p 1 λ1. In the context of sparse PCA, algorithms that recover multiple spikes (e.g. [JL09, DM16]) only require the sparse vectors to be orthogonal. However, their guarantees are of the form λr e O S q [r] supp x(q) . That is, when the

r signals have disjoint supports, they require the smallest signal to satisfy λr e O (k r). In comparison, already for constant t, Theorem 2 successfully recovers the supports when λr e O(k),

11Recovery is up to a global sign ﬂip since u, v p = u, v p for even tensor powers p. 12The result of [LZ20] extends to the settings where Y = W + λX for an approximately ﬂat tensor X p Rn. Both Theorem 1 and Theorem 2 can also be extended to these settings (see Appendix B.2).

thus removing the dependency on the number of signals and improving the bound by a 1/r factor13. Meanwhile, the bounded signal strength gap assumption is a common identiﬁability assumption (e.g. see [CMW13, DM16]). We remark that Theorem 2 provides a tradeoff between this signal strength gap assumption and the signal strengths: we can recover the supports with a smaller gap if the signal strengths are increased proportionally increasing λr by a multiplicative factor α enables the algorithm to succeed with gap that is smaller by a multiplicative factor of 1/α. As an immediate consequence, we also obtain a tradeoff between gap assumption and running time: every time we double t (while ensuring 1 t k), λr increases by a factor of (1/

2)p 1 and thus the algorithm can succeed with a smaller gap. Finally, as in the single spike case, the exact support recovery allow us to obtain good estimate of each signal by running known tensor PCA algorithms.

Remark We remark that these results can be extended to the general tensor settings

q=1 λq X(q)

where for q [r], X(q) = x(q,1) x(q,p) p Rn in a natural way. See Appendix B.2.

1.2.3 An exponential gap between lower bounds and algorithms

SSTM generalizes both sparse PCA and tensor PCA. Hence, a tight hardness result for the model is interesting as it may combine and generalize the known bounds for these special cases. Here, we give a lower bound for the restricted computational model captured by low-degree polynomials. Originally developed in the context of the sum of squares hierarchy, this computational model appears to accurately predict the current best-known guarantees for problems such as sparse PCA, tensor PCA, community detection, and planted clique (e.g. see [HS17, HKP+17, Hop18, BHK+19, DKWB19, KWB19, d KNS20]).

Theorem 3 (Lower bound for low-degree polynomials, Informal). Let 1 D 2n/p and ν be the distribution of Z p Rn with i.i.d. entries from N(0, 1). Then, there exists a distribution µ over tensors Y p Rn of the form Y = W + λx p

where W p Rn is a noise tensor with i.i.d. N(0, 1) entries, the marginal distribution of x is

supported on vectors with entries n 1/

k, 0 on , and x and W are distributionally independent, such that whenever

1 + ln np D

µ is indistinguishable14 from ν with respect to all polynomials of degree at most D.

Theorem 3 states that for certain values of λ, low-degree polynomials cannot be used to distinguish between the distribution of Y and W as typical values of low-degree polynomials are the same (up to a vanishing difference) under both distributions. The theorem captures known results in both sparse and tensor PCA settings. When p = 2, our bound reduces to λ min n n, k

D 1 + ln 2n D

ek2 o , matching known low-degree bounds of [DKWB19] in the sparse PCA setting. Meanwhile, in the

tensor PCA settings (p 2, k = n), Theorem 3 implies a bound of the form λ

D n p D p/4 , thus recovering the results of [KWB19].

13It is an intriguing question whether an improvement of 1/r can be achieved in the more general settings of orthogonal spikes. Our approach relies on the signals having disjoint support and we expect it to not be generalizable to orthogonal signals. This can be noticed in the simplest settings with brute-force parameter t = 1 and p = 2 where the criteria of Algorithm 3 for ﬁnding an entry of a signal vector is to look at the diagonal entries of the data matrix. In this case, the algorithm may be fooled since the largest diagonal entry can depend on more than one spike. Nevertheless, we are unaware of any fundamental barrier suggesting that such guarantees are computationally hard to achieve. 14In the sense that for any low-degree polynomial p(Y ) we have Eµp(Y ) Eνp(Y )

Vp(Y ) o(1). See Appendix A.4.2.

For constant power p and k n, our lower bound suggests that no estimator captured by polynomials of degree D t log n can improve over our algorithmic guarantees by more than a logarithmic factor. However, for p ω(1), an exponential gap appears between the bounds of Theorem 3 and state-of-the-art algorithms (both in the sparse settings as well as in the dense settings).15 As a concrete example, let us consider the setting where p = n0.1 < k. The polynomial time algorithm of Theorem 1 requires λ O(kp/2) while according to Theorem 3 it may be enough to have λ O k/n0.1 p/2. Similarly, for k np, known tensor algorithms recovers the signal for λ O(np/4) while our lower bound only rules out algorithms for λ O n0.9 p/4 .

Surprisingly, for the distinguishing problem considered in Theorem 3, these bounds appear to be tight. For a wide range of parameters (in both the dense and sparse settings) there exists polynomial time algorithms that can distinguish the distributions ν and µ right at the threshold considered in Theorem 3 (see Appendix C). It remains a fascinating open question whether sharper recovering algorithms can be designed or stronger lower bounds are required.

Finally, we would like to highlight that this non-trivial dependency on p is a purely computational phenomenon as it does not appear in information-theoretic bounds (see Appendix D).

Remark Note that Theorem 3 is not in itself a lower bound for the recovery problem. However, any algorithm which obtains a good estimation of the signal vector x for signal strength λ k log n can be used to design a probabilistic algorithm which solve the distinguishing problem for signal strength Op(λ). Let us elaborate. Consider an algorithm that given Y = W + λx p outputs a vector ˆx such that | ˆx, x | 0.9. With high probability, max|z|2=1,|z|0=k | W , z p | e O(

k) and thus | Y , ˆx p | λ (0.9)p e O(

k). Therefore, one can solve the distinguishing problem as follows: output planted if | Y , ˆx p | k log n and null otherwise.

1.3 Notation and outline of paper

We write random variables in boldface and the set {1, . . . , n} as [n]. We hide absolute constant multiplicative factors and multiplicative factors logarithmic in n using standard notations: O( ), Ω( ), , , and e O( ). We denote by e1, . . . , en Rn the standard basis vectors. For x Rn, we use supp(x) [n] to denote the set of support coordinates. We say that x is a (k, A)-sparse vector if k [n], constant A 1, |supp(x)| = k, and 1 A

k for ℓ supp(x). When A = 1, we say that x is a k-sparse ﬂat vector and may omit the parameter A. For general A 1, we say that x

is approximately ﬂat. For an integer t 1, we deﬁne Ut = n u n 1

on : |supp(u)| = t o

as the set of t-sparse ﬂat vectors. For a tensor T p Rn and a vector u Rn, their inner product is deﬁned as T, u p = P

i1,...,ip [n] Ti1,...,ipui1 . . . uip.

The rest of the paper is organized as follows: In Section 2, we introduce the main ideas behind Theorem 1 and Theorem 2. In Section 3, we ﬂesh out some concrete unresolved research questions. Appendix A contains preliminary notions. We formally prove Theorem 1 and Theorem 2 in Appendix B. The lower bound Theorem 3 is given in Appendix C. We present an information theoretic bound in Appendix D. Appendix E discusses the planted sparse densest sub-hypergraph model. Finally, Appendix F contains any deferred technical proofs required throughout the paper.

2 Recovering signal supports via limited brute force searches

We describe here the main ideas behind our limited brute force algorithm. We consider the model

Model 4 (Sparse spiked tensor model). For A 1, r 1, k n we observe a tensor of the form

q=1 λqx p (q) p Rn

15In particular, in the sparse settings k np, the p p/2 factor could not be seen in the restricted case of sparse PCA (as this factor is a constant when p = 2).

where W p Rn is a noise tensor with i.i.d. N(0, 1) entries, λ1 . . . λr > 0 are the signal strengths, and x(1), . . . , x(r) are k-sparse ﬂat unit length signal vectors.

We ﬁrst look at the simplest setting of a single ﬂat signal (i.e. A = 1 and r = 1). Then, we explain how to extend the analysis to multiple ﬂat signals. For a cleaner discussion, we assume here that all the non-zero entries of the sparse vector x and vectors in the set Ut have positive sign. Our techniques also easily extend to approximately ﬂat vectors (where A 1) and general signal tensors x(1) x(p) p Rn. We provide details for these extensions in Appendix B.

2.1 Single ﬂat signal

Limited-brute force As already mentioned in the introduction, a brute force search over Uk for the vector maximizing Y , u p returns the signal vector x (up to a global sign ﬂip) with high probability whenever λ k log n. This algorithm provides provably optimal guarantees but requires exponential time (see Appendix D for an information-theoretic lower bound). The idea of a limited brute force search is to search over a smaller set Ut (1 t k) instead, and use the maximizer v to determine the signal support supp (x). The hope is that for a sufﬁciently large signal-to-noise ratio, this t-sparse vector v will still be non-trivially correlated with the hidden vector x. Indeed as t grows, the requirement on λ decreases towards the information-theoretic bound, at the expense of increased running time.

As a concrete example, consider the matrix settings (p = 2). It is easy to generalize the classic diagonal thresholding algorithm ([JL09]) into a limited brute-force algorithm. Recall that diagonal thresholding identiﬁes the support of x by picking the indices of the largest k diagonal entries of Y . In other words, the algorithm simply computes Y , e 2 i for all i [n] and returns the largest k indices. From this perspective, the algorithm can be naturally extended to t > 1 by computing the k t vectors u Ut maximizing Y , u 2 and reconstructing the signal from them. For t = k, the algorithm corresponds to exhaustive search.

With this intuition in mind, we now introduce our family of algorithms, heavily inspired16 by [DKWB19]. We ﬁrst apply a preprocessing step to obtain two independent copies of the data.

Algorithm 1 Preprocessing

Input: Y . Sample a Gaussian tensor Z p Rn where each entry is an i.i.d. standard Gaussian N(0, 1). Return two independent copies Y (1) and Y (2) of Y as follows:

2 (Y + Z) and Y (2) = 1

Algorithm 1 effectively creates two independent copies of the observation tensor Y . To handle the noise variance, the signal-to-noise ratio is only decreased by the constant factor 1/

2. For simplicity, we will ignore this constant factor in the remainder of the section.

Algorithm 2 Single spike limited brute force

Input: k, t and Y (1),Y (2) obtained from Algorithm 1. Compute v := argmaxu Ut Y (1), u p . Compute the vector α Rn with entries αℓ:= Y (2),v p 1 eℓ for every ℓ [n]. Return the indices of the largest k entries of α.

16The algorithm in [DKWB19] is a specialization of ours (with comparable guarantees) in the simplest setting of p = 2 and a single spike. However, looking at [DKWB19], it is a priori unclear how to generalize the result to the settings of our interest. This is especially true in the tensor settings (p 3) with multiple spikes, where signals may interfere with each other.

The signal support recovery process outlined in Algorithm 2 has two phases. In the ﬁrst phase, we search over Ut to obtain a vector v that is correlated with the signal x. In the second phase, we use v to identify supp(x). The correctness of the algorithm follows from these two claims:

(i) The t-sparse maximizer v shares a large fraction of its support coordinates with signal x.

(ii) The k largest entries of α belong to the support supp(x) of signal x.

Crucial to our analysis is the following standard concentration bound on Gaussian tensors. We directly use Lemma 5 in our exposition here, and formally prove a more general form in Appendix F.1.

Lemma 5. Let p n, t > 0 be an integer, and W p Rn be a tensor with i.i.d. N(0, 1) entries. Then, with high probability, for any u Ut,

For some constant 0 < ϵ 1/2, suppose that

λ 1 ϵ (1 ϵ)p 1

p log n . (1)

For any u Ut with support supp(u) supp(x), we have

Y (1), u p = λ x, u p + W (1), u p λ t

On the other hand, any u Ut with support satisfying |supp(u) supp(x)| (1 ϵ) t has small correlation with Y (1) in the sense that

Y (1), u p = λ x, u p + W (1), u p λ (1 ϵ)p t

By Eq. (1), with high probability, v will have at least a fraction (1 ϵ) of its support contained in supp(x), yielding the ﬁrst claim. Observe that v does not completely overlap with x. A priori, this might seem to be an issue. However, it turns out that we can still use v to exactly reconstruct the support of x. Indeed, for all ℓ supp(x),

αℓ= λ xℓ x,v p 1 + W (2),v p 1 eℓ

2 + W (2),v p 1 eℓ

log n + W (2),v p 1 eℓ .

Now, by independence of W (2) and v , W (2),v p 1 eℓ behaves like a standard Gaussian. Thus, with high probability, W (2),v p 1 eℓ log n and αℓ log n. Conversely, if ℓis not in

the support of the signal, then αℓ log n. So, the vector α acts as indicator of the support of x!

Remark 6. In its simplest form of t = 1, Algorithm 2 does not exploit the tensor structure of the data: it performs entry-wise search for the largest (in magnitude) over a subset of Y . However, this is no longer true as t grows. For t = k, the algorithm computes the k-sparse ﬂat unit vector u maximizing Y (1), u p .

2.2 Multiple ﬂat signals with disjoint signal supports

Consider now the setting with r > 1 spikes. Recall that we assumed the vectors x(1), . . . , x(r) to have non-intersecting supports. We also assumed that for any q, q [r] and some ﬁxed scalar 0 κ 1, if λq λq , then λq κ λq. We remark that we may not recover the signal supports in a known

order, but we are guaranteed to recover all of them exactly. For simplicity of discussion, let us assume here that we recover the vector x(i) at iteration i.

The idea to recover the r spikes is essentially to run Algorithm 2 r times. At ﬁrst, we compute the t-sparse vector v by maximizing the product Y (1),v p . Then, using v , we compute the vector α to obtain a set I1 [n]. With high probability, we will have I1 = supp x(1) and so we will exactly recover the support of x(1). In the second iteration of the loop, we repeat the same procedure with the additional constraint of searching only over the n k dimensional subset of Ut containing vectors with disjoint support from I1. Similarly, at iteration i, we search over the subset of Ut containing vectors with disjoint support from S

1 j<i Ij. As before, we ﬁrst preprocess the data to create two

independent copies Y (1) and Y (2). Concretely:

Algorithm 3 Multi-spike limited brute force

Input: k, t, r and Y (1),Y (2) obtained from Algorithm 1. Repeat for i = 1 to r:

Compute v := argmaxu Ut Y (1), u p subject to supp (v )

Compute the vector α Rn with entries αℓ:= Y (2),v p 1 eℓ for every ℓ [n]. Let Ii be the set of indices of the largest k entries of α. Return I1, . . . , Ir.

The proof structure is similar to that of Algorithm 2 and essentially amounts to showing that the claims (i) and (ii) described in Section 2.1 hold in each iteration.

Let λmin = minq [r] λq and λmax = maxq [r] λq. For some 0 < ϵ 1/2, let κ ϵ 1 ϵ p 1 such that λmin κ λmax. Suppose that

λmin 1 ϵ (1 ϵ)p

p log n and λmin ϵ 1 ϵ

p 1 λmax . (2)

Consider an arbitrary iteration i and suppose that we exactly recovered the support of one signal in each of the previous iterations. Without loss of generality, assume that λmax is the largest signal strength among the yet to be recovered signals, and let x(max) be one such corresponding signal.

For u Ut satisfying supp(u) supp(x(max)), we have

Y (1), u p = λmax x(max), u p + W (1), u p λmax t

On the other hand, for any u Ut such that supp (u) supp x(q) (1 ϵ) t for all q [r],

Y (1), u p = X

q [r] λq x(q), u p + W (1), u p

2 ((1 ϵ)p + ϵp) + O p

2 (1 ϵ)p 1 + O p

Thus, as in Section 2.1, it follows that v satisﬁes supp (v ) supp x(i) (1 ϵ) t for some signal x(i). Note that x(i) may not be x(max). Even though v does not exactly overlap with any of the signal vectors, we will not accumulate an error at each iteration. This is because, analogous to the single spike setting, we can exactly identify the support of a signal through α. For any

ℓ supp x(i) , it holds that αℓ log n as before because W (2),v p 1 eℓ log n.

Conversely, since signal supports are disjoint, we see that for ℓ/ supp x(i) ,

q [r] λq x(q),ℓ x(q),v p 1 + W (2),v p 1 eℓ

So, once again, α exactly identiﬁes the support of x(i) with high probability. Remark 7 (On the strength of the assumption on κ). As already brieﬂy discussed in Section 1.2, the algorithm provides a three-way trade-off between signal gap κ, signal-to-noise ratio λ and running time. By appropriately choosing the constant ϵ > 0, the algorithm can tolerate different values of

κ. Indeed, the above analysis holds as long as κ ϵ 1 ϵ p 1 . This suggests two ways in which we can loosen the requirement λmin κ λmax and still successfully recover the signals through Algorithm 3. One is increase the running time, so that we can decrease ϵ without increasing the signal-to-noise ratio λmin. The other is to decrease ϵ and increase the value of λmin accordingly. Remark 8 (On independent copies of Y ). To clarify why it sufﬁces to have 2 independent copies of Y even for multiple iterations, observe that at each iteration i, the choice of the set Ii depends only on the vector v with high probability. Consider the following thought experiment where we are given a fresh copy Y (i) of Y in the second phase of each iteration i of the algorithm (while still using only a single copy Y (1)for all the ﬁrst phases). Even with fresh randomness, the result is the same as Algorithm 3 with high probability because at each iteration the choice of maximizer v causes the same output. Remark 9 (Reconstructing the signals from their supports). After recovering individual signal supports, one can reconstruct signals using known tensor PCA algorithms (e.g. [MR14, HSS15]) on the subtensor deﬁned by each recovered support. The signal strength required for this new subproblem is weaker and is satisﬁed by our recovery assumptions. For instance, by concatenating our algorithm with [HSS15, Theorem 7.1], one obtains vectors bx(1), . . . , bx(r) such that bx(i), x(i) 0.99, for any i [r], with probability 0.99.

3 Open questions

Open question 1. Theorem 2 improves over existing sparse PCA multi-spike recovery algorithms (which only assumed orthogonal spikes) by a factor of 1/r in the case where these planted signals have disjoint support. Can one still obtain an improvement of 1/r if we only assume orthogonality? Open question 2. For p ω(1), there is an exponential gap between the bounds of Theorem 3 and state-of-the-art algorithms. A natural question is whether one can design better recover algorithms or prove stronger lower bounds for this range of tensor power p ω(1)?

Remark (Societal impact). This work does not present any foreseeable negative societal consequence.

Acknowledgments and Disclosure of Funding

This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No 815464). This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-Ph D/2021-08-013). We thank David Steurer for several helpful conversations. We thank Luca Corinzia and Paolo Penna for useful discussions about the planted sparse densest sub-hypergraph model.

[AMMN19] Gerard Ben Arous, Song Mei, Andrea Montanari, and Mihai Nica. The landscape of the spiked tensor model. Communications on Pure and Applied Mathematics, 72(11):2282 2330, 2019. 2

[AW08] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semideﬁnite relaxations for sparse principal components. In 2008 IEEE International Symposium on Information Theory, pages 2454 2458. IEEE, 2008. 2

[BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], pages 648 847, 2020. 3

[BCPS20] Joachim M Buhmann, Luca Corinzia, Paolo Penna, and Wojciech Szpankowski. Recovery of a Planted k-Densest Sub-Hypergraph, 2020. Available at: https: //www.cs.purdue.edu/homes/spa/papers/isit20-clique.pdf. 3, 27

[BGL16] Vijay Bhattiprolu, Venkatesan Guruswami, and Euiwoong Lee. Sum-of-squares certiﬁcates for maxima of random tensors on the sphere. ar Xiv preprint ar Xiv:1605.00903, 2016. 3

[BHK+19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra, and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem. SIAM Journal on Computing, 48(2):687 735, 2019. 5, 16, 17

[BKW20] Afonso S. Bandeira, Dmitriy Kunisky, and Alexander S. Wein. Computational hardness of certifying bounds on constrained PCA problems. In 11th Innovations in Theoretical Computer Science Conference, ITCS 2020, January 12-14, 2020, Seattle, Washington, USA, pages 78:1 78:29, 2020. 16

[BR13a] Quentin Berthet and Philippe Rigollet. Complexity Theoretic Lower Bounds for Sparse Principal Component Detection. In Conference on Learning Theory, pages 1046 1066, 2013. 2

[BR13b] Quentin Berthet and Philippe Rigollet. OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS IN HIGH DIMENSION. The Annals of Statistics, 41(4):1780 1815, 2013. 2

[Cam60] Lucien Le Cam. Locally asymptotically normal families. Univ. California Publ. Statist., 1960. 17

[CMW13] T Tony Cai, Zongming Ma, and Yihong Wu. SPARSE PCA: OPTIMAL RATES AND ADAPTIVE ESTIMATION. The Annals of Statistics, 41(6):3074 3110, 2013. 5

[CPMB19] Luca Corinzia, Paolo Penna, Luca Mondada, and Joachim M Buhmann. Exact Recovery for a Family of Community-Detection Generative Models. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 415 419. IEEE, 2019. 3, 27

[CPSB20] Luca Corinzia, Paolo Penna, Wojciech Szpankowski, and Joachim M. Buhmann. Statistical and computational thresholds for the planted k-densest sub-hypergraph problem, 2020. 3, 27

[d GJL07] Alexandre d Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semideﬁnite programming. SIAM Review, 49(3):434 448, 2007. 2

[d KNS20] Tommaso d Orsi, Pravesh K. Kothari, Gleb Novikov, and David Steurer. Sparse PCA: Algorithms, Adversarial Perturbations and Certiﬁcates, 2020. To appear in FOCS 2020. 2, 5, 18

[DKWB19] Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Subexponential-Time Algorithms for Sparse PCA. ar Xiv preprint ar Xiv:1907.11635, 2019. 2, 3, 5, 7, 22

[DM16] Yash Deshpande and Andrea Montanari. Sparse PCA via Covariance Thresholding. The Journal of Machine Learning Research, 17(1):4913 4953, 2016. 2, 4, 5

[Duc16] John Duchi. Lecture Notes for Statistics 311/Electrical engineering 377, 2016. Available at: https://stanford.edu/class/stats311/Lectures/full_notes.pdf. 15, 16

[FP07] Delphine Féral and Sandrine Péché. The Largest Eigenvalue of Rank One Deformation of Large Wigner Matrices. Communications in mathematical physics, 272(1):185 228, 2007. 2

[HKP+17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum-of-squares for detecting hidden structures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 720 731. IEEE, 2017. 2, 3, 5, 16, 17, 22

[HL13] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):1 39, 2013. 3

[Hop18] Samuel Brink Klevit Hopkins. STATISTICAL INFERENCE AND THE SUM OF SQUARES METHOD. Ph D thesis, Cornell University, 2018. 5, 16, 17

[HS17] Samuel B Hopkins and David Steurer. Efﬁcient Bayesian estimation from few samples: community detection and related problems. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 379 390. IEEE, 2017. 5, 16, 17, 22

[HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-squares proofs. In Conference on Learning Theory, pages 956 1006, 2015. 2, 3, 10

[HSS19] Samuel B Hopkins, Tselil Schramm, and Jonathan Shi. A robust spectral algorithm for overcomplete tensor decomposition. In Conference on Learning Theory, pages 1683 1722, 2019. 3

[HSV20] Guy Holtzman, Adam Soffer, and Dan Vilenchik. A greedy anytime algorithm for sparse PCA. In Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], pages 1939 1956, 2020. 2

[HW20] Justin Holmgren and Alexander S Wein. Counterexamples to the low-degree conjecture. ar Xiv preprint ar Xiv:2004.08454, 2020. 17

[JL09] Iain M Johnstone and Arthur Yu Lu. On Consistency and Sparsity for Principal Components Analysis in High Dimensions. Journal of the American Statistical Association, 104(486):682 693, 2009. 2, 4, 7

[KWB19] Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Notes on Computational Hardness of Hypothesis Testing: Predictions using the Low-Degree Likelihood Ratio. ar Xiv preprint ar Xiv:1907.11636, 2019. 2, 5

[LZ20] Yuetian Luo and Anru Zhang. Tensor clustering with planted structures: Statistical optimality and computational limits. Co RR, abs/2005.10743, 2020. 3, 4, 21

[MR14] Andrea Montanari and Emile Richard. A statistical model for tensor PCA. In Advances in Neural Information Processing Systems, pages 2897 2905, 2014. 2, 3, 10

[MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 438 446. IEEE, 2016. 2

[NP33] J. Neyman and E. S. Pearson. On the problem of the most efﬁcient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231:289 337, 1933. 17

[NWZ20] Jonathan Niles-Weed and Ilias Zadik. The all-or-nothing phenomenon in sparse tensor pca. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17674 17684. Curran Associates, Inc., 2020. 26

[NZ20] Jonathan Niles-Weed and Ilias Zadik. The all-or-nothing phenomenon in sparse tensor PCA. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. 25

[O D14] Ryan O Donnell. Analysis of boolean functions. Cambridge University Press, 2014. 15

[PWB16] Amelia Perry, Alexander S. Wein, and Afonso S. Bandeira. Statistical limits of spiked tensor models. Co RR, abs/1612.07728, 2016. 25

[PWB+20] Amelia Perry, Alexander S Wein, Afonso S Bandeira, et al. Statistical limits of spiked tensor models. In Annales de l Institut Henri Poincaré, Probabilités et Statistiques, volume 56, pages 230 264. Institut Henri Poincaré, 2020. 26

[SC19] Jonathan Scarlett and Volkan Cevher. An Introductory Guide to Fano s Inequality with Applications in Statistical Estimation. ar Xiv preprint ar Xiv:1901.00555, 2019. 16

[SS17] Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary learning. ar Xiv preprint ar Xiv:1706.08672, 2017. 3

[Tao12] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012. 18

[Tao14] Terence Tao. Metric entropy analogues of sum set theory, 2014. Available at: https://terrytao.wordpress.com/2014/03/19/ metric-entropy-analogues-of-sum-set-theory/. 15

[Tro15] Joel A Tropp. An Introduction to Matrix Concentration Inequalities. ar Xiv preprint ar Xiv:1501.01571, 2015. 18

[TS14] Ryota Tomioka and Taiji Suzuki. Spectral norm of random tensors. ar Xiv preprint ar Xiv:1407.1870, 2014. 18

[Ver18] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018. 15, 18, 28