# sparse_pca_via_covariance_thresholding__033a4b05.pdf

Sparse PCA via Covariance Thresholding

Yash Deshpande Electrical Engineering Stanford University yashd@stanford.edu

Andrea Montanari Electrical Engineering and Statistics Stanford University montanari@stanford.edu

In sparse principal component analysis we are given noisy observations of a lowrank matrix of dimension n p and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here that the principal components v1, . . . , vr have at most k1, , kq non-zero entries respectively, and study the high-dimensional regime in which p is of the same order as n. In an inﬂuential paper, Johnstone and Lu [JL04] introduced a simple algorithm that estimates the support of the principal vectors v1, . . . , vr by the largest entries in the diagonal of the empirical covariance. This method can be shown to succeed with high probability if kq C1 p

n/ log p, and to fail with high probability if kq C2 p

n/ log p for two constants 0 < C1, C2 < . Despite a considerable amount of work over the last ten years, no practical algorithm exists with provably better support recovery guarantees. Here we analyze a covariance thresholding algorithm that was recently proposed by Krauthgamer, Nadler and Vilenchik [KNV13]. We conﬁrm empirical evidence presented by these authors and rigorously prove that the algorithm succeeds with high probability for k of order n. Recent conditional lower bounds [BR13] suggest that it might be impossible to do signiﬁcantly better. The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before.

1 Introduction

In the spiked covariance model proposed by [JL04], we are given data x1, x2, . . . , xn with xi Rp of the form1:

βq uq,i vq + zi , (1)

Here v1, . . . , vr Rp is a set of orthonormal vectors, that we want to estimate, while uq,i N(0, 1) and zi N(0, Ip) are independent and identically distributed. The quantity βq R>0 quantiﬁes the signal-to-noise ratio. We are interested in the high-dimensional limit n, p with limn p/n = α (0, ). In the rest of this introduction we will refer to the rank one case, in order to simplify the exposition, and drop the subscript q = {1, 2, . . . , r}. Our results and proofs hold for general bounded rank.

The standard method of principal component analysis involves computing the sample covariance matrix G = n 1 Pn i=1 xix T i and estimates v = v1 by its principal eigenvector v PC(G). It is a well-known fact that, in the high dimensional asymptotic p/n α > 0, this yields an inconsistent

1Throughout the paper, we follow the convention of denoting scalars by lowercase, vectors by lowercase boldface, and matrices by uppercase boldface letters.

estimate [JL09]. Namely v PC v 2 0 in the high-dimensional asymptotic limit, unless α 0 (i.e. p/n 0). Even worse, Baik, Ben-Arous and P ech e [BBAP05] and Paul [Pau07] demonstrate a phase transition phenomenon: if β < α the estimate is asymptotically orthogonal to the signal v PC, v 0. On the other hand, for β > α, v PC, v remains strictly positive as n, p . This phase transition phenomenon has attracted considerable attention recently within random matrix theory [FP07, CDMF09, BGN11, KY13].

These inconsistency results motivated several efforts to exploit additional structural information on the signal v. In two inﬂuential papers, Johnstone and Lu [JL04, JL09] considered the case of a signal v that is sparse in a suitable basis, e.g. in the wavelet domain. Without loss of generality, we will assume here that v is sparse in the canonical basis e1, ...ep. In a nutshell, [JL09] proposes the following:

1. Order the diagonal entries of the Gram matrix Gi(1),i(1) Gi(2),i(2) Gi(p),i(p), and let J {i(1), i(2), . . . , i(k)} be the set of indices corresponding to the k largest entries. 2. Set to zero all the entries Gi,j of G unless i, j J, and estimate v with the principal eigenvector of the resulting matrix.

Johnstone and Lu formalized the sparsity assumption by requiring that v belongs to a weak ℓq-ball with q (0, 1). Instead, here we consider a strict sparsity constraint where v has exactly k non-zero entries, with magnitudes bounded below by θ/

k for some constant θ > 0. The case of θ = 1 was studied by Amini and Wainwright in [AW09].

Within this model, it was proved that diagonal thresholding successfully recovers the support of v provided v is sparse enough, namely k C p

n/ log p with C = C(α, β) a constant [AW09]. (Throughout the paper we denote by C constants that can change from point to point.) This result is a striking improvement over vanilla PCA. While the latter requires a number of samples scaling as the number of parameters2 n p, sparse PCA via diagonal thresholding achieves the same objective with a number of samples scaling as the number of non-zero parameters, n k2 log p.

At the same time, this result is not as optimistic as might have been expected. By searching exhaustively over all possible supports of size k (a method that has complexity of order pk) the correct support can be identiﬁed with high probability as soon as n k log p. On the other hand, no method can succeed for much smaller n, because of information theoretic obstructions [AW09].

Over the last ten years, a signiﬁcant effort has been devoted to developing practical algorithms that outperform diagonal thresholding, see e.g. [MWA05, ZHT06, d EGJL07, d BG08, WTH09]. In particular, d Aspremont et al. [d EGJL07] developed a promising M-estimator based on a semideﬁnite programming (SDP) relaxation. Amini and Wainwright [AW09] carried out an analysis of this method and proved that, if (i) k C(β) n/ log p, and (ii) if the SDP solution has rank one, then the SDP relaxation provides a consistent estimator of the support of v.

At ﬁrst sight, this appears as a satisfactory solution of the original problem. No procedure can estimate the support of v from less than k log p samples, and the SDP relaxation succeeds in doing it from at most a constant factor more samples. This picture was upset by a recent, remarkable result by Krauthgamer, Nadler and Vilenchik [KNV13] who showed that the rank-one condition assumed by Amini and Wainwright does not hold for n k (n/ log p). This result is consistent with recent work of Berthet and Rigollet [BR13] demonstrating that sparse PCA cannot be performed in polynomial time in the regime k n, under a certain computational complexity conjecture for the so-called planted clique problem.

In summary, the problem of support recovery in sparse PCA demonstrates a fascinating interplay between computational and statistical barriers.

From a statistical perspective, and disregarding computational considerations, the support of v can be estimated consistently if and only if k n/ log p. This can be done, for instance, by exhaustive search over all the p k possible supports of v. (See [VL12, CMW+13] for a minimax analysis.)

2Throughout the introduction, we write f(n) g(n) as a shorthand of f(n) C g(n) for a some constant C = C(β, α) . Further C denotes a constant that may change from point to point.

From a computational perspective, the problem appears to be much more difﬁcult. There is rigorous evidence [BR13, MW13] that no polynomial algorithm can reconstruct the support unless k n. On the positive side, a very simple algorithm (Johnstone and Lu s diagonal thresholding) succeeds for k p

Of course, several elements are still missing in this emerging picture. In the present paper we address one of them, providing an answer to the following question:

Is there a polynomial time algorithm that is guaranteed to solve the sparse PCA problem with high probability for p

n/ log p k n?

We answer this question positively by analyzing a covariance thresholding algorithm that proceeds, brieﬂy, as follows. (A precise, general deﬁnition, with some technical changes is given in the next section.)

1. Form the Gram matrix G and set to zero all its entries that are in modulus smaller than τ/ n, for τ a suitably chosen constant.

2. Compute the principal eigenvector bv1 of this thresholded matrix.

3. Denote by B {1, . . . , p} be the set of indices corresponding to the k largest entries of bv1.

4. Estimate the support of v by cleaning the set B. (Brieﬂy, v is estimated by thresholding Gbv B with bv B obtained by zeroing the entries outside B.)

Such a covariance thresholding approach was proposed in [KNV13], and is in turn related to earlier work by Bickel and Levina [BL08]. The formulation discussed in the next section presents some technical differences that have been introduced to simplify the analysis. Notice that, to simplify proofs, we assume k to be known: This issue is discussed in the next two sections.

The rest of the paper is organized as follows. In the next section we provide a detailed description of the algorithm and state our main results. Our theoretical results assume full knowledge of problem parameters for ease of proof. In light of this, in Section 3 we discuss a practical implementation of the same idea that does not require prior knowledge of problem parameters, and is entirely datadriven. We also illustrate the method through simulations. The complete proofs are available in the accompanying supplement, in Sections 1, 2 and 3 respectively.

2 Algorithm and main result

For notational convenience, we shall assume hereafter that 2n sample vectors are given (instead of n): {xi}1 i 2n. These are distributed according to the model (1). The number of spikes r will be treated as a known parameter in the problem.

We will make the following assumptions:

A1 The number of spikes r and the signal strengths β1, . . . , βr are ﬁxed as n, p . Further β1 > β2 > . . . βr are all distinct.

A2 Let Qq and kq denote the support of vq and its size respectively. We let Q = q Qq and k = P

q kq throughout. Then the non-zero entries of the spikes satisfy |vq,i| θ/ p

kq for all i Qq for some θ > 0. Further, for any q, q we assume |vq,i/vq ,i| γ for every i Qq Qq , for some constant γ > 1.

As before, we are interested in the high-dimensional limit of n, p with p/n α. A more detailed description of the covariance thresholding algorithm for the general model (1) is given in Algorithm 1. We describe the basic intuition for the simpler rank-one case (omitting the subscript q {1, 2, . . . , r}), while stating results in generality.

We start by splitting the data into two halves: (xi)1 i n and (xi)n<i 2n and compute the respective sample covariance matrices G and G respectively. As we will see, the matrix G is used to obtain a good estimate for the spike v. This estimate, along with the (independent) second part G , is then used to construct a consistent estimator for the supports of v.

Let us focus on the ﬁrst phase of the algorithm, which aims to obtain a good estimate of v. We ﬁrst compute bΣ = G I. For β > α, the principal eigenvector of G, and hence of bΣ is positively correlated with v, i.e. limn bv1(bΣ), v > 0. However, for β < α, the noise component in bΣ dominates and the two vectors become asymptotically orthogonal, i.e. for instance limn bv1(bΣ), v = 0. In order to reduce the noise level, we exploit the sparsity of the spike v.

Denoting by X Rn p the matrix with rows x1, ...xn, by Z Rn p the matrix with rows z1, . . .zn, and letting u = (u1, u2, . . . , un), the model (1) can be rewritten as

β u v T + Z . (2)

Hence, letting β β u 2/n β, and w βZTu/n

bΣ = β vv T + v w T + w v T + 1

n ZTZ Ip, . (3)

For a moment, let us neglect the cross terms (vw T + wv T). The signal component β vv T is sparse with k2 entries of magnitude β/k, which (in the regime of interest k = n/C) is equivalent to Cβ/ n. The noise component ZTZ/n Ip is dense with entries of order 1/ n. Assuming k/ n a small enough constant, it should be possible to remove most of the noise by thresholding the entries at level of order 1/ n. For technical reasons, we use the soft thresholding function η : R R 0 R, η(z; τ) = sgn(z)(|z| τ)+. We will omit the second argument wherever it is clear from context. Classical denoising theory [DJ94, Joh02] provides upper bounds the estimation error of such a procedure. Note however that these results fall short of our goal. Classical theory measures estimation error by (element-wise) ℓp norm, while here we are interested in the resulting principal eigenvector. This would require bounding, for instance, the error in operator norm.

Since the soft thresholding function η(z; τ/ n) is afﬁne when z τ/ n, we would expect that the following decomposition holds approximately (for instance, in operator norm):

η(bΣ) η β vv T + η 1

The main technical challenge now is to control the operator norm of the perturbation η(ZTZ/n Ip). It is easy to see that η(ZTZ/n Ip) has entries of variance δ(τ)/n, for δ(τ) 0 as τ . If entries were independent with mild decay, this would imply with high probability η 1

n ZTZ 2 Cδ(τ), (5)

for some constant C. Further, the ﬁrst component in the decomposition (4) is still approximately rank one with norm of the order of β β. Consequently, with standard linear algebra results on the perturbation of eigenspaces [DK70], we obtain an error bound bv v δ(τ)/C β

Our ﬁrst theorem formalizes this intuition and provides a bound on the estimation error in the principal components of η(bΣ). Theorem 1. Under the spiked covariance model Eq. (1) satisfying Assumption A1, let bvq denote the qth eigenvector of η(bΣ) using threshold τ. For every α, (βq)r q=1 (0, ), integer r and every ε > 0 there exist constants, τ = τ(ε, α, (βq)r q=1, r, θ) and 0 < c = c (ε, α, (βq)r q=1, r, θ) < such that, if P

q |supp(vq)| c n), then

P n min( bvq vq , bvq + vq ) ε q {1, . . . , r} o 1 α

Random matrices of the type η(ZTZ/n Ip) are called inner-product kernel random matrices and have attracted recent interest within probability theory [EK10a, EK10b, CS12]. In this literature, the asymptotic eigenvalue distribution of a matrix f(ZTZ/n) is the object of study. Here f : R R is a kernel function and is applied entry-wise to the matrix ZTZ/n, with Z a matrix as above. Unfortunately, these results cannot be applied to our problem for the following reasons:

The results [EK10a, EK10b] are perturbative and provide conditions under which the spectrum of f(ZTZ/n) is close to a rescaling of the spectrum of (ZTZ/n) (with rescaling

Algorithm 1 Covariance Thresholding

1: Input: Data (xi)1 i 2n, parameters r, (kq)q r N, θ, τ, ρ R 0; 2: Compute the Gram matrices G Pn i=1 xix T i /n , G P2n i=n+1 xix T i /n;

3: Compute bΣ = G Ip (resp. bΣ = G Ip);

4: Compute the matrix η(bΣ) by soft-thresholding the entries of bΣ:

bΣij τ n if bΣij τ/ n,

0 if τ/ n < bΣij < τ/ n, bΣij + τ n if bΣij τ/ n,

5: Let (bvq)q r be the ﬁrst r eigenvectors of η(bΣ); 6: Deﬁne sq Rp by sq,i = bvq,i I( bvq,i θ/2 p

7: Output: b Q = {i [p] : q s.t. |(bΣ sq)i| ρ}.

factors depending on the Taylor expansion of f close to 0). We are interested instead in a non-perturbative regime in which the spectrum of f(ZTZ/n) is very different from the one of (ZTZ/n) (and the Taylor expansion is trivial). [CS12] consider n-dependent kernels, but their results are asymptotic and concern the weak limit of the empirical spectral distribution of f(ZTZ/n). This does not yield an upper bound on the spectral norm3 of f(ZTZ/n).

Our approach to prove Theorem 1 follows instead the so-called ε-net method: we develop high probability bounds on the maximum Rayleigh quotient:

max y Sp 1 y, η(ZTZ/n)y = max y Sp 1 X

i,j η zi, zj

Here Sp 1 = {y Rp : y = 1} is the unit sphere and zi denote the columns of Z. Since η(ZTZ/n) is not Lipschitz continuous in the underlying Gaussian variables Z, concentration does not follow immediately from Gaussian isoperimetry. We have to develop more careful (non-uniform) bounds on the gradient of η(ZTZ/n) and show that they imply concentration as required.

While Theorem 1 guarantees that bv is a reasonable estimate of the spike v in ℓ2 distance (up to a sign ﬂip), it does not provide a consistent estimator of its support. This brings us to the second phase of our algorithm. Although bv is not even expected to be sparse, it is easy to see that the largest entries of bv should have signiﬁcant overlap with supp(v). Steps 6, 7 and 8 of the algorithm exploit this property to construct a consistent estimator b Q of the support of the spike v. Our second theorem guarantees that this estimator is indeed consistent. Theorem 2. Consider the spiked covariance model of Eq. (1) satisfying Assumptions A1, A2. For any α, (βq)q r (0, ), θ, γ > 0 and integer r, there exist constants c , τ, ρ dependent on α, (βq)q r, γ, θ, r, such that, if P

q kq = |supp(vq)| c n, the Covariance Thresholding algorithm of Table 1 recovers the joint supports of vq with high probability.

Explicitly, there exists a constant C > 0 such that

P n b Q = qsupp(vq) o 1 C

Given the results above, it is useful to pause for a few remarks. Remark 2.1. We focus on a consistent estimation of the joint supports qsupp(vq) of the spikes. In the rank-one case, this obviously corresponds to the standard support recovery. Once this is accomplished, estimating the individual supports poses no additional difﬁculty: indeed, since | q supp(vq))| = O( n) an extra step with n fresh samples xi restricted to b Q yields estimates for vq

3Note that [CS12] also provide a ﬁnite n bound for the spectral norm of f(ZTZ/n) via the moment method, but this bound diverges with n and does not give a result of the type of Eq. (5).

with ℓ2 error of order p

k/n. This implies consistent estimation of supp(vq) when vq have entries bounded below as in Assumption A2. Remark 2.2. Recovering the signed supports Qq,+ = {i [p] : vq,i > 0} and Qq, = {i [p] : vq,i < 0} is possible using the same technique as recovering the supports supp(vq) above, and poses no additional difﬁculty.

Remark 2.3. Assumption A2 requires |vq,i| θ/ p

kq for all i Qq. This is a standard requirement in the support recovery literature [Wai09, MB06]. The second part of assumption A2 guarantees that when the supports of two spikes overlap, their entries are roughly of the same order. This is necessary for our proof technique to go through. Avoiding such an assumption altogether remains an open question.

Our covariance thresholding algorithm assumes knowledge of the correct support sizes kq. Notice that the same assumption is made in earlier theoretical, e.g. in the analysis of SDP-based reconstruction by Amini and Wainwright [AW09]. While this assumption is not realistic in applications, it helps to focus our exposition on the most challenging aspects of the problem. Further, a ballpark estimate of kq (indeed P

q kq) is actually sufﬁcient, with which we use the following steps in lieu of Steps 7, 8 of Algorithm 1.

7: Deﬁne s q Rp by

s q,i = bvq,i if |bvq,i| > θ/(2 k0) 0 otherwise. (8)

8: Return b Q = q{i [p] : |(bΣ s q)i| ρ} . (9)

The next theorem shows that this procedure is effective even if k0 overestimates P

q kq by an order of magnitude. Its proof is deferred to Section 2. Theorem 3. Consider the spiked covariance model of Eq. (1). For any α, β (0, ), let constants c , τ, ρ be given as in Theorem 2. Further assume k = P

q |supp(vq)| c n, and P

q k k0 20 P

q kq. Then, the Covariance Thresholding algorithm of Table 1, with the deﬁnitions in Eqs. (8) and (9), recovers the joint supports of vq successfully, i.e.

P b Q = qsupp(vq) 1 C

3 Practical aspects and empirical results

Specializing to the rank one case, Theorems 1 and 2 show that Covariance Thresholding succeeds with high probability for a number of samples n k2, while Diagonal Thresholding requires n k2 log p. The reader might wonder whether eliminating the log p factor has any practical relevance or is a purely conceptual improvement. Figure 1 presents simulations on synthetic data under the strictly sparse model, and the Covariance Thresholding algorithm of Table 1, used in the proof of Theorem 2. The objective is to check whether the log p factor has an impact at moderate p. We compare this with Diagonal Thresholding.

We plot the empirical success probability as a function of k/ n for several values of p, with p = n. The empirical success probability was computed by using 100 independent instances of the problem. A few observations are of interest: (i) Covariance Thresholding appears to have a signiﬁcantly larger success probability in the difﬁcult regime where Diagonal Thresholding starts to fail; (ii) The curves for Diagonal Thresholding appear to decrease monotonically with p indicating that k proportional to n is not the right scaling for this algorithm (as is known from theory); (iii) In contrast, the curves for Covariance Thresholding become steeper for larger p, and, in particular, the success probability increases with p for k 1.1 n. This indicates a sharp threshold for k = const n, as suggested by our theory.

In terms of practical applicability, our algorithm in Table 1 has the shortcomings of requiring knowledge of problem parameters βq, r, kq. Furthermore, the thresholds ρ, τ suggested by theory need not

0.5 1 1.5 2

Fraction of support recovered

p = 625 p = 1250 p = 2500 p = 5000

0.5 1 1.5 2 0

Fraction of support recovered

p = 625 p = 1250 p = 2500 p = 5000

0.5 1 1.5 2

Fraction of support recovered

p = 625 p = 1250 p = 2500 p = 5000

Figure 1: The support recovery phase transitions for Diagonal Thresholding (left) and Covariance Thresholding (center) and the data-driven version of Section 3 (right). For Covariance Thresholding, the fraction of support recovered correctly increases monotonically with p, as long as k c n with c 1.1. Further, it appears to converge to one throughout this region. For Diagonal Thresholding, the fraction of support recovered correctly decreases monotonically with p for all k of order n. This conﬁrms that Covariance Thresholding (with or without knowledge of the support size k) succeeds with high probability for k c n, while Diagonal Thresholding requires a signiﬁcantly sparser principal component.

be optimal. We next describe a principled approach to estimating (where possible) the parameters of interest and running the algorithm in a purely data-dependent manner. Assume the following model, for i [n]

βquq,ivq + σzi,

where µ Rp is a ﬁxed mean vector, ui have mean 0 and variance 1, and zi have mean 0 and covariance Ip. Note that our focus in this section is not on rigorous analysis, but instead to demonstrate a principled approach to applying covariance thresholding in practice. We proceed as follows:

Estimating µ, σ: We let bµ = Pn i=1 xi/n be the empirical mean estimate for µ. Further letting X = X 1bµT we see that pn (P

q kq)n pn entries of X are mean 0 and variance σ2. We let bσ = MAD(X)/ν where MAD( ) denotes the median absolute deviation of the entries of the matrix in the argument, and ν is a constant scale factor. Guided by the Gaussian case, we take ν = Φ 1(3/4) 0.6745.

Choosing τ: Although in the statement of the theorem, our choice of τ depends on the SNR β/σ2, we believe this is an artifact of our analysis. Indeed it is reasonable to threshold at the noise level , as follows. The noise component of entry i, j of the sample covariance (ignoring lower order terms) is given by σ2 zi, zj /n. By the central limit theo-

rem, zi, zj / n d N(0, 1). Consequently, σ2 zi, zj /n N(0, σ4/n), and we need to choose the (rescaled) threshold proportional to

σ4 = σ2. Using previous estimates, we let τ = ν bσ2 for a constant ν . In simulations, a choice 3 ν 4 appears to work well.

Estimating r: We deﬁne bΣ = X TX/n σ2Ip and soft threshold it to get η(bΣ) using τ as above. Our proof of Theorem 1 relies on the fact that η(bΣ) has r eigenvalues that are separated from the bulk of the spectrum4. Hence, we estimate r using br: the number of eigenvalues separated from the bulk in η(bΣ).

Estimating vq: Let bvq denote the qth eigenvector of η(bΣ). Our theoretical analysis indicates that bvq is expected to be close to vq. In order to denoise bvq, we assume bvq (1 δ)vq + εq, where εq is additive random noise. We then threshold vq at the noise level to recover a better estimate of vq. To do this, we estimate the standard deviation of the noise ε by c σε = MAD(vq)/ν. Here we set again guided by the Gaussian heuristic ν 0.6745. Since vq is sparse, this procedure returns a good estimate for the size of the noise deviation. We let ηH(bvq) denote the vector obtained by hard thresholding bvq: set

4The support of the bulk spectrum can be computed numerically from the results of [CS12].

(ηH(bvq))i = bvq,i if |bvq,i| ν c σεq and 0 otherwise. We then let bv q = η(bvq)/ η(bvq) and return bv q as our estimate for vq.

Note that while different in several respects this empirical approach shares the same philosophy of the algorithm in Table 1. On the other hand, the data-driven algorithm presented in this section is less straightforward to analyze, a task that we defer to future work.

Figure 1 also shows results of a support recovery experiment using the data-driven version of this section. Covariance thresholding in this form also appears to work for supports of size k const n. Figure 2 shows the performance of vanilla PCA, Diagonal Thresholding and Covariance Thresholding on the Three Peak example of Johnstone and Lu [JL04]. This signal is sparse in the wavelet domain and the simulations employ the data-driven version of covariance thresholding. A similar experiment with the box example of Johnstone and Lu is provided in the supplement. These experiments demonstrate that, while for large values of n both Diagonal Thresholding and Covariance Thresholding perform well, the latter appears superior for smaller values of n.

0 1,000 2,000 3,000 4,000 5 10 2

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000 5 10 2

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

Figure 2: The results of Simple PCA, Diagonal Thresholding and Covariance Thresholding (respectively) for the Three Peak example of Johnstone and Lu [JL09] (see Figure 1 of the paper). The signal is sparse in the Symmlet 8 basis. We use β = 1.4, p = 4096, and the rows correspond to sample sizes n = 1024, 1625, 2580, 4096 respectively. Parameters for Covariance Thresholding are chosen as in Section 3, with ν = 4.5. Parameters for Diagonal Thresholding are from [JL09]. On each curve, we superpose the clean signal (dotted).

[AW09] Arash A Amini and Martin J Wainwright, High-dimensional analysis of semideﬁnite relaxations for sparse principal components, The Annals of Statistics 37 (2009), no. 5B, 2877 2921.

[BBAP05] Jinho Baik, G erard Ben Arous, and Sandrine P ech e, Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices, Annals of Probability (2005), 1643 1697.

[BGN11] Florent Benaych-Georges and Raj Rao Nadakuditi, The eigenvalues and eigenvectors of ﬁnite, low rank perturbations of large random matrices, Advances in Mathematics 227 (2011), no. 1, 494 521. [BL08] Peter J Bickel and Elizaveta Levina, Regularized estimation of large covariance matrices, The Annals of Statistics (2008), 199 227. [BR13] Quentin Berthet and Philippe Rigollet, Computational lower bounds for sparse pca, ar Xiv preprint ar Xiv:1304.0828 (2013). [CDMF09] Mireille Capitaine, Catherine Donati-Martin, and Delphine F eral, The largest eigenvalues of ﬁnite rank deformation of large wigner matrices: convergence and nonuniversality of the ﬂuctuations, The Annals of Probability 37 (2009), no. 1, 1 47. [CMW+13] T Tony Cai, Zongming Ma, Yihong Wu, et al., Sparse pca: Optimal rates and adaptive estimation, The Annals of Statistics 41 (2013), no. 6, 3074 3110. [CS12] Xiuyuan Cheng and Amit Singer, The spectrum of random inner-product kernel matrices, ar Xiv preprint ar Xiv:1202.3155 (2012). [d BG08] Alexandre d Aspremont, Francis Bach, and Laurent El Ghaoui, Optimal solutions for sparse principal component analysis, The Journal of Machine Learning Research 9 (2008), 1269 1294. [d EGJL07] Alexandre d Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet, A direct formulation for sparse pca using semideﬁnite programming, SIAM review 49 (2007), no. 3, 434 448. [DJ94] D. L. Donoho and I. M. Johnstone, Minimax risk over lp balls, Prob. Th. and Rel. Fields 99 (1994), 277 303. [DK70] Chandler Davis and W. M. Kahan, The rotation of eigenvectors by a perturbation. iii, SIAM Journal on Numerical Analysis 7 (1970), no. 1, pp. 1 46. [EK10a] Noureddine El Karoui, On information plus noise kernel random matrices, The Annals of Statistics 38 (2010), no. 5, 3191 3216. [EK10b] , The spectrum of kernel random matrices, The Annals of Statistics 38 (2010), no. 1, 1 50. [FP07] Delphine F eral and Sandrine P ech e, The largest eigenvalue of rank one deformation of large wigner matrices, Communications in mathematical physics 272 (2007), no. 1, 185 228. [JL04] Iain M Johnstone and Arthur Yu Lu, Sparse principal components analysis, Unpublished manuscript (2004). [JL09] , On consistency and sparsity for principal components analysis in high dimensions, Journal of the American Statistical Association 104 (2009), no. 486. [Joh02] IM Johnstone, Function estimation and gaussian sequence models, Unpublished manuscript 2 (2002), no. 5.3, 2. [KNV13] R. Krauthgamer, B. Nadler, and D. Vilenchik, Do semideﬁnite relaxations really solve sparse pca?, Co RR abs/1306:3690 (2013). [KY13] Antti Knowles and Jun Yin, The isotropic semicircle law and deformation of wigner matrices, Communications on Pure and Applied Mathematics (2013). [MB06] Nicolai Meinshausen and Peter B uhlmann, High-dimensional graphs and variable selection with the lasso, The Annals of Statistics (2006), 1436 1462. [MW13] Zongming Ma and Yihong Wu, Computational barriers in minimax submatrix detection, ar Xiv preprint ar Xiv:1309.5914 (2013). [MWA05] Baback Moghaddam, Yair Weiss, and Shai Avidan, Spectral bounds for sparse pca: Exact and greedy algorithms, Advances in neural information processing systems, 2005, pp. 915 922. [Pau07] Debashis Paul, Asymptotics of sample eigenstructure for a large dimensional spiked covariance model, Statistica Sinica 17 (2007), no. 4, 1617. [VL12] Vincent Q Vu and Jing Lei, Minimax rates of estimation for sparse pca in high dimensions, Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) 2012, 2012. [Wai09] Martin J Wainwright, Sharp thresholds for high-dimensional and noisy sparsity recovery usingconstrained quadratic programming (lasso), Information Theory, IEEE Transactions on 55 (2009), no. 5, 2183 2202. [WTH09] Daniela M Witten, Robert Tibshirani, and Trevor Hastie, A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics 10 (2009), no. 3, 515 534. [ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani, Sparse principal component analysis, Journal of computational and graphical statistics 15 (2006), no. 2, 265 286.