# sparse_pca_via_covariance_thresholding__340e2758.pdf

Journal of Machine Learning Research 17 (2016) 1-41 Submitted 4/15; Revised 4/16; Published 8/16

Sparse PCA via Covariance Thresholding

Yash Deshpande yash.deshpande@stanford.edu Department of Electrical Engineering Stanford University Stanford, CA 94305, USA

Andrea Montanari montanari@stanford.edu Departments of Electrical Engineering and Statistics Stanford University Stanford, CA 94305, USA

Editor: Alexander Rakhlin

In sparse principal component analysis we are given noisy observations of a low-rank matrix of dimension n p and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here each of the principal components v1, . . . , vr has at most s0 non-zero entries. We are particularly interested in the high dimensional regime wherein p is comparable to, or even much larger than n. In an inﬂuential paper, Johnstone and Lu (2004) introduced a simple algorithm that estimates the support of the principal vectors v1, . . . , vr by the largest entries in the diagonal of the empirical covariance. This method can be shown to identify the correct support with high probability if s0 K1 p

n/ log p, and to fail with high probability if s0 K2 p

n/ log p for two constants 0 < K1, K2 < . Despite a considerable amount of work over the last ten years, no practical algorithm exists with provably better support recovery guarantees. Here we analyze a covariance thresholding algorithm that was recently proposed by Krauthgamer, Nadler, Vilenchik, et al. (2015). On the basis of numerical simulations (for the rank-one case), these authors conjectured that covariance thresholding correctly recover the support with high probability for s0 K n (assuming n of the same order as p). We prove this conjecture, and in fact establish a more general guarantee including higher-rank as well as n much smaller than p. Recent lower bounds (Berthet and Rigollet, 2013; Ma and Wigderson, 2015) suggest that no polynomial time algorithm can do signiﬁcantly better. The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before. Using these, we also derive sharp bounds for estimating the population covariance, and the principal component (with ℓ2-loss).

c 2016 Yash Deshpande and Andrea Montanari.

Deshpande and Montanari

1. Introduction

In the spiked covariance model proposed by Johnstone and Lu (2004), we are given data x1, x2, . . . , xn with xi Rp of the form1:

βq uq,i vq + zi , (1)

Here v1, . . . , vr Rp is a set of orthonormal vectors, that we want to estimate, while uq,i N(0, 1) and zi N(0, Ip) are independent and identically distributed. The quantity βq R>0 is a measure of signal-to-noise ratio. In the rest of this introduction, in order to simplify the exposition, we will refer to the rank one case and drop the subscript q {1, 2, . . . , r}. Further, we will assume n to be of the same order as p. Our results and proofs hold for a broad range of scalings of r, p, n, and will be stated in general form. The standard method of principal component analysis involves computing the sample covariance matrix G = n 1 Pn i=1 xix T i and estimates v = v1 by its principal eigenvector v PC(G). It is a well-known fact that, in the high dimensional regime, this yields an inconsistent estimate (see Johnstone and Lu (2009)). Namely v PC v 0 unless p/n 0. Even worse, Baik, Ben Arous, and P ech e (2005) and Paul (2007) demonstrate the following phase transition phenomenon. Assuming that p/n α (0, ), if β < α the estimate is asymptotically orthogonal to the signal, i.e. v PC, v 0. On the other hand, for β > α, | v PC, v | remains bounded away from zero as n, p . This phase transition phenomenon has attracted considerable attention recently within random matrix theory (see, e.g. F eral and P ech e, 2007; Capitaine et al., 2009; Benaych-Georges and Nadakuditi, 2011; Knowles and Yin, 2013). These inconsistency results motivated several eﬀorts to exploit additional structural information on the signal v. In two inﬂuential papers, Johnstone and Lu (2004, 2009) considered the case of a signal v that is sparse in a suitable basis, e.g. in the wavelet domain. Without loss of generality, we will assume here that v is sparse in the canonical basis e1, . . . ep. In a nutshell, Johnstone and Lu (2009) propose the following:

1. Order the diagonal entries of the Gram matrix Gi(1),i(1) Gi(2),i(2) Gi(p),i(p), and let J {i(1), i(2), . . . , i(k)} be the set of indices corresponding to the s0 largest entries.

2. Set to zero all the entries Gi,j of G unless i, j J, and estimate v with the principal eigenvector of the resulting matrix.

Johnstone and Lu formalized the sparsity assumption by requiring that v belongs to a weak ℓq-ball with q (0, 1). Instead, here we consider a strict sparsity constraint where v has exactly s0 non-zero entries, with magnitudes bounded below by θ/ s0 for some constant θ > 0. Amini and Wainwright (2009) studied the more restricted case when every entry of v has equal magnitude of 1/ s0. Within this restricted model, they proved diagonal thresholding successfully recovers the support of v provided the sample size n satisﬁes2

1. Throughout the paper, we follow the convention of denoting scalars by lowercase, vectors by lowercase boldface, and matrices by uppercase boldface letters. 2. Throughout the introduction, we write f(n) g(n) as a shorthand of f(n) K g(n) for a some constant K = K(r, β) .

Sparse PCA via Covariance Thresholding

n s2 0 log p (see Amini and Wainwright, 2009). This result is a striking improvement over vanilla PCA. While the latter requires a number of samples scaling with the number of parameters n p, sparse PCA via diagonal thresholding achieves the same objective with a number of samples that scales with the number of non-zero parameters, n s2 0 log p. At the same time, this result is not as strong as might have been expected. By searching exhaustively over all possible supports of size s0 (a method that has complexity of order ps0) the correct support can be identiﬁed with high probability as soon as n s0 log p. No method can succeed for much smaller n, because of information theoretic obstructions. We refer the reader to Amini and Wainwright (2009) for more details. Over the last ten years, a signiﬁcant eﬀort has been devoted to developing practical algorithms that outperform diagonal thresholding, see e.g. Moghaddam et al. (2005); Zou et al. (2006); d Aspremont et al. (2007, 2008); Witten et al. (2009). In particular, d Aspremont et al. (2007) developed a promising M-estimator based on a semideﬁnite programming (SDP) relaxation. Amini and Wainwright (2009) also carried out an analysis of this method and proved that, if3 (i) n K(β) s0 log(p s0)p, and (ii) the SDP solution has rank one, then the SDP relaxation provides a consistent estimator of the support of v. At ﬁrst sight, this appears as a satisfactory solution of the original problem. No procedure can estimate the support of v from less than s0 log p samples, and the SDP relaxation succeeds in doing it from at most a constant factor more samples. This picture was upset by a recent, remarkable result by Krauthgamer et al. (2015) who showed that the rankone condition assumed by Amini and Wainwright does not hold for n s0 (n/ log p). This result is consistent with recent work of Berthet and Rigollet (2013) demonstrating that sparse PCA cannot be performed in polynomial time in the regime s0 n, under a certain computational complexity conjecture for the so-called planted clique problem. In summary, the sparse PCA problem demonstrates a fascinating interplay between computational and statistical barriers.

From a statistical perspective, and disregarding computational considerations, the support of v can be estimated consistently if and only if s0 n/ log p. This can be done, for instance, by exhaustive search over all the p s0 possible supports of v. We refer to Vu and Lei (2012); Cai et al. (2013) for a minimax analysis.

From a computational perspective, the problem appears to be much more diﬃcult. There is rigorous evidence (Berthet and Rigollet, 2013; ?; Ma and Wigderson, 2015; Wang et al., 2014) that no polynomial algorithm can reconstruct the support unless s0 n. On the positive side, a very simple algorithm (Johnstone and Lu s diagonal thresholding) succeeds for s0 p

Of course, several elements are still missing in this emerging picture. In the present paper we address one of them, providing an answer to the following question:

Is there a polynomial time algorithm that is guaranteed to solve the sparse PCA problem with high probability for p

n/ log p s0 n?

3. Throughout the paper, we denote by K constants that can depend on problem parameters r and β. We denote by upper case C (lower case c) generic absolute constants that are bigger (resp. smaller) than 1, but which might change from line to line.

Deshpande and Montanari

We answer this question positively by analyzing a covariance thresholding algorithm that proceeds, brieﬂy, as follows. (A precise, general deﬁnition, with some technical changes is given in the next section.)

1. Form the empirical covariance matrix G and set to zero all its entries that are in modulus smaller than τ/ n, for τ a suitably chosen constant.

2. Compute the principal eigenvector bv1 of this thresholded matrix.

3. Denote by B {1, . . . , p} be the set of indices corresponding to the s0 largest entries of bv1.

4. Estimate the support of v by cleaning the set B. (Brieﬂy, v is estimated by thresholding Gbv B with bv B obtained by zeroing the entries outside B.)

Such a covariance thresholding approach was proposed in Krauthgamer et al. (2015), and is in turn related to earlier work by Bickel and Levina (2008b); Cai et al. (2010). The formulation discussed in the next section presents some technical diﬀerences that have been introduced to simplify the analysis. Notice that, to simplify proofs, we assume s0 to be known: this issue is discussed in the next two sections. The rest of the paper is organized as follows. In the next section we provide a detailed description of the algorithm and state our main results. The proof strategy for our results is explained in Section 3. Our theoretical results assume full knowledge of problem parameters for ease of proof. In light of this, in Section 4 we discuss a practical implementation of the same idea that does not require prior knowledge of problem parameters, and is data-driven. We also illustrate the method through simulations. The complete proofs are in Sections 5, 7 and 6 respectively. A preliminary version of this paper appeared in (Deshpande and Montanari, 2014). This paper extends signiﬁcantly the results in Deshpande and Montanari (2014). In particular, by following an analogous strategy, we improve greatly the bounds obtained by Deshpande and Montanari (2014). This signifantly improves the regimes of (s0, p, n) on which we can obtain non-trivial results. The proofs follow a similar strategy but are, correspondingly, more careful.

2. Algorithm and main results

We provide a detailed description of the covariance thresholding algorithm for the general model (1) in Table 1. For notational convenience, we shall assume that 2n sample vectors are given (instead of n): {xi}1 i 2n. We start by splitting the data into two halves: (xi)1 i n and (xi)n<i 2n and compute the respective sample covariance matrices G and G respectively. Deﬁne Σ to be the population covariance minus identity. i.e.

q=1 βqvqv T q . (2)

Throughout, we let Qq and sq denote the support of vq and its size respectively, for q {1, 2, . . . , r}. We further let Q = r q=1Qq and s0 = |Q|. The matrix G is used, in steps 1 to

Sparse PCA via Covariance Thresholding

Algorithm 1 Covariance Thresholding

1: Input: Data (xi)1 i 2n, parameter s0 N, τ, ρ R 0;

2: Compute the empirical covariance matrices G Pn i=1 xix T i /n , G P2n i=n+1 xix T i /n;

3: Compute bΣ = G Ip (resp. bΣ = G Ip);

4: Compute the matrix η(bΣ) by soft-thresholding the entries of bΣ:

bΣij τ n if bΣij τ/ n,

0 if τ/ n < bΣij < τ/ n, bΣij + τ n if bΣij τ/ n,

5: Let (bvq)q r be the ﬁrst r eigenvectors of η(bΣ);

6: Output: b Q = {i [p] : q s.t. |(bΣ bvq)i| ρ}.

4 to obtain a good estimate η(bΣ) for the low rank part of the population covariance Σ. The algorithm ﬁrst computes bΣ, a centered version of the empirical covariance of the samples as follows:

bΣ G Ip, (3)

where G = n 1 P i n xix T i is the sample covariance matrix.

It then obtains the estimate η(bΣ) Rp p by soft thresholding each entry of bΣ at a threshold τ/ n. Explicitly:

ij η bΣij; τ n

Here η : R R+ R is the soft thresholding function

z λ if z λ z + λ if z λ 0 otherwise.

In step 5 of the algorithm, this estimate is used to construct good estimates bvq of the eigenvectors vq. Finally, in step 6, these estimates are combined with the (independent) second half of the data G to construct estimators b Qq for the support of the individual eigenvectors vq. In the ﬁrst two subsections we will focus on the estimation of Σ and the individual principal components. Our results on support recovery are provided in the ﬁnal subsection.

2.1 Estimating the population covariance

Our ﬁrst result bounds the estimation error of the soft thresholding procedure in operator norm.

Deshpande and Montanari

Theorem 1 There exist numerical constants C1, C2, C > 0 such that the following happens. Assume n > C log p, n > s2 0 and let τ = C1(β 1) p

log(p/s2 0). We keep the thresholding level τ according to

τ when τ log p/2, s2 0 p/e C2τ when τ log p/2, s0 p/e 0 otherwise.

. Then with probability 1 o(1):

η(bΣ) Σ op C

s2 0 1 . (7)

At this point, it is useful to compare Theorem 1 with available results in the literature. Classical denoising theory (Donoho and Johnstone, 1994; Johnstone, 2015) provides upper bounds on the estimation error of soft-thresholding. However, estimation error is measured by (element-wise) ℓp norm, while here we are interested in operator norm. Bickel and Levina (2008a,b); Karoui (2008); Cai, Zhang, Zhou, et al. (2010); Cai and Liu (2011) considered the operator norm error of thresholding estimators for structured covariance matrices. Specializing to our case of exact sparsity, the result of Bickel and Levina (2008a) implies that, with high probability:

ηH(bΣ) Σ op C0

Here ηH( , ) is the hard-thresholding function: ηH(z) = z I(|z| τ/ n), and the threshold is chosen to be τ = C1 log p. Also, ηH(M) is the matrix obtained by thresholding the entries of M. In fact, Cai et al. (2012) showed that the rate in (8) is minimax optimal over the class of sparse population covariance matrices, with at most s0 non-zero entries per row, under the assumption s2 0/n C(log p) 3. Theorem 1 ensures consistency under a weaker sparsity condition, viz. s2 0/n 0 is suﬃcient. Also, the resulting rate depends on log(p/s2 0) instead of log p. In other words, in order to achieve η(bΣ) Σ op < ε for a ﬁxed ε, it is suﬃcient s0 ε n as opposed to s0 p

n/ log p. Crucially, in this regime for s0 = Θ(ε n), Theorem 1 suggests a threshold of order τ = Θ( p

log(1/ε)) as opposed to τ = C1 log p which is used in Bickel and Levina (2008a); Cai et al. (2012). As we will see in Section 3, this regime mathematically more challenging than the one of Bickel and Levina (2008a); Cai et al. (2012). By setting τ = C1 log p for a large enough constant C1, all the entries of bΣ outside the support of Σ are set to 0. In contrast, a large part of our proof is devoted to control the operator norm of the noise part of bΣ.

2.2 Estimating the principal components

We next turn to the question of estimating the principal components v1, . . . vr. Of course, these are not identiﬁable if there are degeneracies in the population eigenvalues β1, β2, . . . , βr. We thus introduce the following identiﬁability condition.

Sparse PCA via Covariance Thresholding

A1 The spike strengths β1 > β2 > . . . βr are all distinct. We denote by β max(β1, . . . , βr) and βmin minq =q (β1 β2, β2 β3, . . . , βr). Namely, β is the largest signal strength and βmin is the minimum gap.

We measure estimation error through the following loss, deﬁned for x, y Sp 1 {v Rp : v = 1}:

2 min s {+1, 1}

x s y 2 (9)

= 1 | x, y | . (10)

Notice the minimization over the sign s {+1, 1}. This is required because the principal components v1, . . . , vr are only identiﬁable up to a sign. Analogous results can obtained for alternate loss functions such as the projection distance:

2 xx T yy T F = p

1 x, y 2. (11)

The theorem below is an immediate consequence of Theorem 1. In particular, it uses the guarantee of Theorem 1 to show that the corresponding principal components of η(bΣ) provide good estimates of the principal components vq, 1 q r.

Theorem 2 There exists a numerical constant C such that the following holds. Suppose that Assumption A1 holds in addition to the conditions n > C log p, s2 0 < n, and s2 0 < p/e. Set τ as according to Theorem 1, and let bv1, . . . , bvr denote the r principal eigenvectors of η(bΣ; τ/ n). Then, with probability 1 o(1)

max q [r] L(bvq, vq) C β2 min

s2 0 . (12)

Proof Let η(bΣ; τ/ n) Σ. By Davis-Kahn sin-theta theorem (Davis and Kahan, 1970), we have, for βmin > op,

L(bvq, vq) 1

For β2 min > 2C(s2 0(β2 1)/n) log(p/s2 0), the claim follows by using Theorem 1. If β2 min 2C(s2 0(β2 1)/n) log(p/s2 0), the claim is obviously true since L(bvq, vq) 1 always.

2.3 Support recovery

Finally, we consider the question of support recovery of the principal components vq. The second phase of our algorithm aims at estimating union of the supports Q = Q1 Qr from the estimated principal components bvq. Note that, although bvq is not even expected to be sparse, it is easy to see that the largest entries of bvq should have signiﬁcant overlap with supp(vq). Step 6 of the algorithm exploit this property to construct a consistent estimator b Qq of the support of the spike vq. We will require the following assumption to ensure support recovery.

Deshpande and Montanari

A2 There exist constants θ, γ > 0 such that the following holds. The non-zero entries of the spikes satisfy |vq,i| θ/ s0 for all i Qq. Further, for any q, q |vq,i/vq ,i| γ for every i Qq Qq . Without loss of generality, we will assume γ 1.

Theorem 3 Assume the spiked covariance model of Eq. (1) satisfying assumptions A1 and A2, and further n > C log p, s2 0 < n, and s2 0 < p/e for C a large enough numerical constant. Consider the Covariance Thresholding algorithm of Table 1, with τ as in Theorem 1 ρ = βminθ/(2 s0). Then there exists K0 = K0(θ, γ, β, βmin) such that, if

n K0s2 0r log p

then the algorithm recovers the union of supports of vq with probability 1 o(1) (i.e. we have b Q = Q).

The proof in Section 7 also provides an explicit expression for the constant K0.

Remark 4 In Assumption A2, the requirement on the minimum size of |vq,i| is standard in support recovery literature (see, e.g. Wainwright, 2009; Meinshausen and B uhlmann, 2006). Additionally, however, we require that when the supports of vq, vq overlap, they are of the same order, quantiﬁed by the parameter γ. Relaxing this condition is a potential direction for future work.

Remark 5 Recovering the signed supports Qq,+ = {i [p] : vq,i > 0} and Qq, = {i [p] : vq,i < 0}, up to a sign ﬂip, is possible using the same technique as recovering the supports supp(vq) above, and poses no additional diﬃculty.

3. Algorithm intuition and proof strategy

For the purposes of exposition, throughout this section, we will assume that r = 1 and drop the corresponding subscript q. Denoting by X Rn p the matrix with rows x1, . . . xn, by Z Rn p the matrix with rows z1, . . . zn, and letting u = (u1, u2, . . . , un), the model (1) can be rewritten as

β u v T + Z . (15)

Recall that bΣ = n 1XTX Ip = G Ip. For β > p

p/n, the principal eigenvector of G, and hence of bΣ is positively correlated with v, i.e. | bv1(bΣ), v | is bounded away from zero. However, for β < p

p/n, the noise component in bΣ dominates and the two vectors become asymptotically orthogonal, i.e. for instance limn | bv1(bΣ), v | = 0. In order to reduce the noise level, we must exploit the sparsity of the spike v. Now, letting β β u 2/n β, and w βZTu/n, we can rewrite bΣ as

bΣ = β vv T + v w T + w v T + 1

n ZTZ Ip, . (16)

Sparse PCA via Covariance Thresholding

For a moment, let us neglect the cross terms (vw T + wv T). The signal component β vv T

is sparse with s2 0 entries of magnitude β θ2/s0, which (in the regime of interest s0 = n/C) is equivalent to Cθ2β/ n. The noise component ZTZ/n Ip is dense with entries of order 1/ n. Assuming s0/ n < c for some small constant c, it should be possible to remove most of the noise by thresholding the entries at level of order 1/ n. For technical reasons, we use the soft thresholding function η : R R 0 R, η(z; τ) = sgn(z)(|z| τ)+. We will omit the second argument from η( ; ) wherever it is clear from context. Consider again the decomposition (16). Since the soft thresholding function η(z; τ/ n) is aﬃne when z τ/ n, we would expect that the following decomposition holds approximately (for instance, in operator norm):

η(bΣ) η β vv T + η 1

Since β β and each entry of vv T has magnitude at least θ2/s0, the ﬁrst term is still approximately rank one, with η β vv T βvv T op s0τ n. (18)

This is straightforward to see since soft thresholding introduces a maximum bias of τ/ n per entry of the matrix, while the factor s0 comes due to the support size of vv T (see Proposition 14 below for a rigorous argument). The main technical challenge now is to control the operator norm of the perturbation η(ZTZ/n Ip). We know that η(ZTZ/n Ip) has entries of variance δ(τ)/n, for δ(τ) exp( cτ 2). If entries were independent with mild tail conditions, this would imply with high probability η 1

op Cδ(τ) r p

n = C exp( cτ 2) r p

for some constant C. Combining the bias bound from Eq. (18) and the heuristic decomposition of Eq. (19) with the decomposition (17) results in the bound

η(bΣ) βvv T op s0τ n + C exp( cτ 2) r p

Our analysis formalizes this argument and shows that such a bound is correct when p < n. The matrix η ZTZ/n Ip is a special case of so-called inner-product kernel random matrices, which have attracted recent interest within probability theory (see El Karoui, 2010a,b; Cheng and Singer, 2013; Fan and Montanari, 2015). The basic object of study in this line of work is a matrix M Rp p of the type:

n I(i = j) . (21)

In other words, fn : R R is a kernel function and is applied entry-wise to the matrix ZTZ/n Ip, with Z a matrix with independent standard normal entries as above and zi Rn

are the columns of Z.

Deshpande and Montanari

The key technical challenge in our proof is the analysis of the operator norm of such matrices, when fn is the soft-thresholding function, with threshold of order 1/ n. Earlier results are not general enough to cover this case. El Karoui (2010a,b) provide conditions under which the spectrum of fn(ZTZ/n Ip) is close to a rescaling of the spectrum of (ZTZ/n Ip). We are interested instead in a diﬀerent regime in which the spectrum of fn(ZTZ/n Ip) is very diﬀerent from the one of (ZTZ/n Ip). Cheng and Singer (2013) consider n-dependent kernels, but their results are asymptotic and concern the weak limit of the empirical spectral distribution of fn(ZTZ/n Ip). This does not yield an upper bound on the spectral norm of fn(ZTZ/n Ip). Finally, Fan and Montanari (2015) consider the spectral norm of kernel random matrices for smooth kernels f, only in the proportional regime n/p c (0, ). Our approach to proving Theorem 1 follows instead the ε-net method: we develop high probability bounds on the maximum Rayleigh quotient:

max y Sp 1 y, η(ZTZ/n Ip)y = max y Sp 1 X

i,j η zi, zj

by discretizing Sp 1 = {y Rp : y = 1}, the unit sphere in p dimensions. For a ﬁxed y, the Rayleigh quotient y, η(ZTZ/n Ip)y is a (complicated) function of the underlying Gaussian random variables Z. One might hope that it is Lipschitz continuous with some Lipschitz constant B = B(n, p, τ, y), thereby implying, by Gaussian isoperimetry (Ledoux, 2005), that it concentrates to the scale B around its expectation (i.e. 0). Then, by a standard union bound argument over a discretization of the sphere, one would obtain that the operator norm of η ZTZ/n Ip is typically no more than p supy Sp 1 B(n, p, τ, y). Unfortunately, this turns out not to be true over the whole space of Z, i.e. the Rayleigh quotient is not Lipschitz continuous in the underlying Gaussian variables Z. Our approach, instead, shows that for typical values of Z, we can control the gradient of y, η(ZTZ/n Ip)y with respect to Z, and extract the required concentration only from such local information of the function. This is formalized in our concentration lemma 9, which we apply extensively while proving Theorem 1. This lemma is a signﬁcantly improved version of the analogous result in Deshpande and Montanari (2014).

4. Practical aspects and empirical results

Specializing to the rank one case, Theorems 2 and 3 show that Covariance Thresholding succeeds with high probability for a number of samples n s2 0, while Diagonal Thresholding requires n s2 0 log p. The reader might wonder whether eliminating the log p factor has any practical relevance or is a purely conceptual improvement. Figure 1 presents simulations on synthetic data under the strictly sparse model, and the Covariance Thresholding algorithm of Table 1, used in the proof of Theorem 3. The objective is to check whether the log p factor has an impact at moderate p. We compare this with Diagonal Thresholding. We plot the empirical success probability as a function of s0/ n for several values of p, with p = n. The empirical success probability was computed by using 100 independent instances of the problem. A few observations are of interest: (i) Covariance Thresholding appears to have a signiﬁcantly larger success probability in the diﬃcult regime where Diagonal Thresholding starts to fail; (ii) The curves for Diagonal Thresholding appear to

Sparse PCA via Covariance Thresholding

0.5 1 1.5 2

Fraction of support recovered

p = 625 p = 1250 p = 2500 p = 5000

0.5 1 1.5 2 0

Fraction of support recovered

p = 625 p = 1250 p = 2500 p = 5000

0.5 1 1.5 2

Fraction of support recovered

p = 625 p = 1250 p = 2500 p = 5000

Figure 1: The support recovery phase transitions for Diagonal Thresholding (left) and Covariance Thresholding (center) and the data-driven version of Section 4 (right). For Covariance Thresholding, the fraction of support recovered correctly increases monotonically with p, as long as s0 c n with c 1.1. Further, it appears to converge to one throughout this region. For Diagonal Thresholding, the fraction of support recovered correctly decreases monotonically with p for all s0 of order n. This conﬁrms that Covariance Thresholding (with or without knowledge of the support size s0) succeeds with high probability for s0 c n, while Diagonal Thresholding requires a signiﬁcantly sparser principal component.

decrease monotonically with p indicating that s0 proportional to n is not the right scaling for this algorithm (as is known from theory); (iii) In contrast, the curves for Covariance Thresholding become steeper for larger p, and, in particular, the success probability increases with p for s0 1.1 n. This indicates a sharp threshold for s0 = const n, as suggested by our theory. In terms of practical applicability, our algorithm in Table 1 has the shortcomings of requiring knowledge of problem parameters s0, β, θ. Furthermore, the thresholds ρ, τ suggested by theory need not be optimal. We next describe a principled approach to estimating (where possible) the parameters of interest and running the algorithm in a purely data-dependent manner. Assume the following model, for i [n]

βquq,ivq + σzi,

where µ Rp is a ﬁxed mean vector, uq,i have mean 0 and variance 1, and zi have mean 0 and covariance Ip. Note that our focus in this section is not on rigorous analysis, but instead to demonstrate a principled approach to applying covariance thresholding in practice. We proceed as follows:

Estimating µ, σ: We let bµ = Pn i=1 xi/n be the empirical mean estimate for µ. Further letting X = X 1bµT we see that pn (P q kq)n pn entries of X are mean 0 and variance σ2. We let bσ = MAD(X)/ν where MAD( ) denotes the median absolute deviation of the entries of the matrix in the argument, and ν is a constant scale factor. Guided by the Gaussian case, we take ν = Φ 1(3/4) 0.6745.

Choosing τ: Although in the statement of the theorem, our choice of τ depends on the SNR β/σ2, it is reasonable to instead threshold at the noise level , as follows. The

Deshpande and Montanari

noise component of entry i, j of the sample covariance (ignoring lower order terms)

is given by σ2 zi, zj /n. By the central limit theorem, zi, zj / n d N(0, 1). Consequently, σ2 zi, zj /n N(0, σ4/n), and we need to choose the (rescaled) threshold proportional to

σ4 = σ2. Using previous estimates, we let τ = ν bσ2 for a constant ν . In simulations, a choice 3 ν 4 appears to work well.

Estimating r: We deﬁne bΣ = X TX/n bσ2Ip and soft threshold it to get η(bΣ) using τ as above. Our proof of Theorem 2 relies on the fact that η(bΣ) has r eigenvalues that are separated from the bulk of the spectrum. Hence, we estimate r using br: the number of eigenvalues separated from the bulk in η(bΣ). The edge of the spectrum can be computed numerically using the Stieltjes transform method as in Cheng and Singer (2013).

Estimating vq: Let bvq denote the qth eigenvector of η(bΣ). Our theoretical analysis indicates that bvq is expected to be close to vq. In order to denoise bvq, we assume bΣbvq (1 δ)vq+εq, where εq is additive random noise (perhaps with some sparse corruptions). We then threshold bΣvq at the noise level to recover a better estimate of vq. To do this, we estimate the standard deviation of the noise ε by c σε = MAD(bvq)/ν. Here we set again guided by the Gaussian heuristic ν 0.6745. Since vq is sparse, this procedure returns a good estimate for the size of the noise deviation. We let bv q denote the vector obtained by hard thresholding bvq: set bv i = bvq,i if |bvq,i| ν bσεq and 0 otherwise. We then let bv q = bv q/ bv q and return bv q as our estimate for vq.

Note that while diﬀerent in several respects this empirical approach shares the same philosophy of the algorithm in Table 1. On the other hand, the data-driven algorithm presented in this section is less straightforward to analyze, a task that we defer to future work. Figure 1 also shows results of a support recovery experiment using the data-driven version of this section. Covariance thresholding in this form also appears to work for supports of size s0 const n. Figure 2 shows the performance of vanilla PCA, Diagonal Thresholding and Covariance Thresholding on the Three Peak example of Johnstone and Lu (2004). This signal is sparse in the wavelet domain and the simulations employ the datadriven version of covariance thresholding. A similar experiment with the box example of Johnstone and Lu is provided in Figure 3. These experiments demonstrate that, while for large values of n both Diagonal Thresholding and Covariance Thresholding perform well, the latter appears superior for smaller values of n.

5. Proof preliminaries

In this section we review some notation and preliminary facts that we will use throughout the paper.

5.1 Notation

We let [m] = {1, 2, . . . , m} denote the set of ﬁrst m integers. We will represent vectors using boldface lower case letters, e.g. u, v, x. The entries of a vector u Rn will be represented

Sparse PCA via Covariance Thresholding

0 1,000 2,000 3,000 4,000 5 10 2

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000 5 10 2

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

Figure 2: The results of Simple PCA, Diagonal Thresholding and Covariance Thresholding (respectively) for the Three Peak example of Johnstone and Lu (2009) (see Figure 1 of the paper). The signal is sparse in the Symmlet 8 basis. We use β = 1.4, p = 4096, and the rows correspond to sample sizes n = 1024, 1625, 2580, 4096 respectively. Parameters for Covariance Thresholding are chosen as in Section 4, with ν = 4.5. Parameters for Diagonal Thresholding are from Johnstone and Lu (2009). On each curve, we superpose the clean signal (dotted).

by ui, i [n]. Matrices are represented using boldface upper case letters e.g. A, X. The entries of a matrix A Rm n are represented by Aij for i [m], j [n]. Given a matrix A Rm n, we generically let a1, a2, . . . , am denote its rows, and a1, a2, . . . , an its columns.

Deshpande and Montanari

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000 2

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000 2

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

0 1,000 2,000 3,000 4,000

Figure 3: The results of Simple PCA, Diagonal Thresholding and Covariance Thresholding (respectively) for a synthetic block-constant function (which is sparse in the Haar wavelet basis). We use β = 1.4, p = 4096, and the rows correspond to sample sizes n = 1024, 1625, 2580, 4096 respectively. Parameters for Covariance Thresholding are chosen as in Section 4, with ν = 4.5. Parameters for Diagonal Thresholding are from Johnstone and Lu (2009). On each curve, we superpose the clean signal (dotted).

For E [m] [n], we deﬁne the projector operator PE : Rm n Rm n by letting PE(A) be the matrix with entries

( Aij if (i, j) E, 0 otherwise. (23)

Sparse PCA via Covariance Thresholding

For a matrix A Rm n, and a set E [n], we deﬁne its column restriction AE Rm n to be the matrix obtained by setting to 0 columns outside E:

( Aij if j E, 0 otherwise.

Similarly y E is obtained from y by setting to zero all indices outside E. The operator norm of a matrix A is denoted by A (or A op) and its Frobenius norm by A F . We write x for the standard ℓ2 norm of a vector x. Other vector norms such as ℓ1 or ℓ are denoted with appropriate subscripts. We let Qq denotes the support of the qth spike vq. Also, we denote the union of the supports of vq by Q = q Qq. The complement of a set E [n] is denoted by Ec. We write η( ; ) for the soft-thresholding function. By η( ; τ) we denote the derivative of η( ; τ) with respect to the ﬁrst argument, which exists Lebesgue almost everywhere. To simplify the notation, we omit the second argument when it is understood from context. For a random variable Z and a measurable set A we write E{Z; A} to denote E{ZI(Z A)}, the expectation of Z constrained to the event A. In the statements of our results, consider the limit of large p and large n with certain conditions on p, n (as in Theorem 2). This limit will be referred to either as n large enough or p large enough where the phrase large enough indicates dependence of p (and thereby n) on speciﬁc problem parameters. The Gaussian distribution function will be denoted by Φ(x) = R x e t2/2 dt/

5.2 Preliminary facts

Let SN 1 denote the unit sphere in N dimensions, i.e. SN 1 = {x RN : x = 1}. We use the following deﬁnition (see Vershynin, 2012, Deﬁnition 5.2) of the ε-net of a set X Rn:

Deﬁnition 6 (Nets, Covering numbers) A subset T ε(X) X is called an ε-net of X if every point in X may be approximated by one in T ε(X) with error at most ε. More precisely:

x X, inf y T ε(X) x y ε.

The minimum cardinality of an ε-net of X, if ﬁnite, is called its covering number.

The following two facts are useful while using ε-nets to bound the spectral norm of a matrix. For proofs, we refer the reader to (see Vershynin, 2012, Lemmas 5.2, 5.4).

Lemma 7 Let Sn 1 be the unit sphere in n dimensions. Then there exists an ε-net of Sn 1, T ε(Sn 1) satisfying:

|T ε(Sn 1)| 1 + 2

Deshpande and Montanari

Lemma 8 Let A Rn n be a symmetric matrix. Then, there exists x T ε(Sn 1) such that

| x, Ax | (1 2ε) A . (24)

Proof Firstly, we have A = maxx Sn 1| x, Ax | = maxx Sn 1 Ax . Let x be the maximizer (which exists as Sn 1 is compact and | x, Ax | is continuous in x). Choose x T ε n so that x x ε. Then:

x, Ax = x x , A(x + x ) + x , Ax . (25)

The lemma then follows as | x, A(x x ) | x + x A x x 2ε A .

Throughout the paper we will denote by T ε N an ε-net on the unit sphere SN 1 that satisﬁes Lemma 7. For a subset of indices S [N] we denote by T ε N(S) the natural isometric embedding of T ε S in SN 1. We now state a general concentration lemma. This will be our basic tool to establish Theorem 2, and thereby Theorem 3.

Lemma 9 Let z N(0, IN) be vector of N i.i.d. standard normal variables. Suppose S is a ﬁnite set and we have functions Fs : RN R for every s S. Assume G RN RN is a Borel set such that for Lebesgue-almost every (x, y) G:

max s S max t [0,1] Fs(

1 ty) L . (26)

Then, for any > 0:

P n max s S |Fs(z) EFs(z)| o C|S| exp 2

2 E n max s S (Fs(z) Fs(z ))2 ; Gco .

Here z is an independent copy of z.

Proof We use the Maurey-Pisier method along with symmetrization. By centering, assume that EFs(z) = 0 for all s S. Further, by including the functions Fs in the set S (at most doubling its size), it suﬃces to prove the one-sided version of the inequality:

P{max s S Fs(z) } C|S| exp 2

2 E{max s (Fs(z) Fs(z ))2; Gc} . (28)

We ﬁrst implement the symmetrization. Note that:

{x : max s Fs(x) } {x : max x R,s S[2x Fs(x) x2] 2} (29)

{x, y : max s [Fs(x) Fs(y)] } {x, y : max x R,s S[2x(Fs(x) Fs(y)) x2] 2}. (30)

Sparse PCA via Covariance Thresholding

Furthermore, by centering, Fs(z) = E{Fs(z) Fs(z )|z}. Hence for any non-decreasing convex function φ(z):

E n φ max x,s [2x Fs(z) x2] o E φ max x,s E{2x Fs(z) 2x Fs(z ) x2|z} (31)

(a) E φ E max x,s [2x(Fs(z) Fs(z )) x2]|z (32)

(b) E n φ max x,s [2x(Fs(z) Fs(z )) x2] o . (33)

Here we use Jensen s inequality with the monotonicity of φ( ) to obtain (a) and with the convexity of φ( ) to obtain (b). Now we choose φ(z) = (z a)+, for a = 2/2.

P{max s Fs(z) } P max x,s [2x Fs(z) x2] 2 (34)

(a) φ( 2) 1E n φ max x,s [2x Fs(z) x2] o (35)

(b) φ( 2) 1E n φ max x,s [2x(Fs(z) Fs(z )) x2] o (36)

= φ( 2) 1E n φ max s [(Fs(z) Fs(z ))2] o (37)

= φ( 2) 1 E n φ max s [(Fs(z) Fs(z ))2] ; G o

+ E φ(max s [(Fs(z) Fs(z ))2]; Gc . (38)

Here (a) is Markov s inequality, and (b) is the symmetrization bound Eq. (33), where we use the fact that φ(z) = (z a)+ is non-decreasing and convex in z. At this point, it is easy to see that the lemma follows if we are able to control the ﬁrst term in Eq. (38). We establish this via the Maurey-Pisier method. Deﬁne the path z(θ) z sin θ + z cos θ, the velocity z dz/dθ = z cos θ z sin θ.

E n φ max s [(Fs(z) Fs(z ))2] ; G o = Z

0 P n max s [(Fs(z) Fs(z ))2] a

+I(G) x o dx

0 P max s [|Fs(z) Fs(z )|]

x + a; G dx (40)

a e λ x max s

h E exp{λ(Fs(z) Fs(z ))}; G i dx ,

where, in the last inequality we use the union bound followed by Markov s inequality. To control the exponential moment, note that Fs(z) Fs(z ) = R π/2 0 F(z(θ)), z(θ) dθ whence,

Deshpande and Montanari

using Jensen s inequality:

E n exp λ(Fs(z) Fs(z )) ; G o = E exp Z π/2

0 λ Fs(z(θ)), z(θ) dθ ; G (42)

0 E n exp λπ Fs(z(θ)), z(θ) /2 ; G o dθ. (43)

Deﬁne the set Gθ = {(z, z ) : maxs Fs(z(θ)) L}. Then:

E n exp λ(Fs(z) Fs(z )) ; G o (a) 2

0 E n exp λπ Fs(z(θ)), z(θ) /2 ; Gθ o dθ (44)

0 E exp λ2π2 Fs(z(θ)) 2

8 ; Gθ dθ (45)

(c) exp λ2π2L2

Here (a) follows as Gθ G. Equality (b) follows from noting that Gθ is measurable with respect to z(θ) and, hence, ﬁrst integrating with respect to z(θ) = z cos θ z sin θ, a Gaussian random variable that is independent of z(θ). The ﬁnal inequality (c) follows by using the fact that Fs(z(θ)) L on the set Gθ. Since this bound is uniform over s S, we can use it in (41):

E n φ(max s (Fs(z) Fs(z ))2); G o 2|S| Z

a exp λ x + λ2π2L2

λ2 (1 + λ a) exp λ a + λ2π2L2

We can now set λ = 4 a/π2L2, to obtain the exponent above as 2a/π2L2 = 2/π2L2. The prefactor (1 + λ a)λ 2 is bounded by CL2 max(, L2/ 2) when a = 2/2. Therefore, as required, we obtain:

E n φ(max s (Fs(z) Fs(z ))2); G o C max(1, L4/ 4) exp 2

Combining this with Eq. (38) and the fact that φ( 2) 1 C 2 gives Eq. (28) and, consequently, the lemma.

By a simple application of Cauchy-Schwarz, this lemma implies the following.

Corollary 10 Under the same conditions as Lemma 9,

P n max s S |Fs(z) EFs(z)| o C|S| exp 2

2 E n max s S (Fs(z) Fs(z ))4 o1/2 P{Gc}1/2. (50)

The following two lemmas are well-known concentration of measure results. The forms below can be found in (Vershynin, 2012, Corollary 5.35), (Laurent and Massart, 2000, Lemma 1) respectively.

Sparse PCA via Covariance Thresholding

Lemma 11 Let A RM N be a matrix with i.i.d. standard normal entries, i.e. Aij N(0, 1). Then, for every t 0:

N + t o exp t2

Lemma 12 Let z N(0, IN). Then

P{ z 2 N + 2

Nt + 2t} exp( t). (52)

6. Proof of Theorem 1

Since bΣ = XTX/n Ip, we have:

n vq(vq)T +

βq n vq(ZTuq)T + (ZTuq)v T q )

(pβqβq uq, uq

n vq(vq )T )

n Ip . (53)

We let D = {(i, i) : i [p] \ Q} be the diagonal entries not included in any support. (Recall that Q = q Qq denote the union of the supports.) Further let E = Q Q, F = (Qc Qc)\D, and G = [p] [p]\(D E F), or, equivalently G = (Q Qc) (Qc Q). Since these are disjoint we have:

η(bΣ) = PE n η(bΣ) o

+ PF n η bΣ o

+ PG n η(bΣ) o

+ PD n η(bΣ) o

The ﬁrst term corresponds to the signal component, while the last three terms correspond to the noise component. Theorem 1 is a direct consequence of the next ﬁve propositions. The ﬁrst demonstrates that, even for a low level of thresholding, viz. τ < log p/2, the term N has small operator norm. The second demonstrates that the soft thresholding operation preserves the signal in the term S. The next two propositions show that the cross and diagonal terms C and D are negligible as well. Finally, in the last proposition, we demonstrate that, for the regime of thresholding far above the noise level, i.e. τ > C log p, the noise terms N and C vanish entirely.

Proposition 13 Let N denote the second term of Eq. (54). Since F = Qc Qc\D,

N = PF η(bΣ) = PF

n ZTZ . (55)

Then, there exists an absolute constant C such that the following happens. Assuming that (i) τ < log p/2 and (ii) n > C log p, then with probability 1 o(1)

e τ 2/C . (56)

Deshpande and Montanari

Proposition 14 Let S denote the ﬁrst term in Eq. (54):

S = PE n η(bΣ) o . (57)

Assume that (i) s0/n < 1 and (ii)n > C log p: Then with probability 1 o(1):

S Σ op 2τs0 n + C(β 1) rs0

Proposition 15 Let C denote the matrix corresponding to the third term of Eq. (54):

C = PG n η(bΣ) o .

Assuming the conditions of Proposition 13 and, additionally, that s2 0 p, there exist constants C, c such that with probability 1 o(1)

C op C τe cτ 2/(β 1) r p

Proposition 16 Let D denote the matrix corresponding to the third term of Eq. (54):

D = PD n η(bΣ) o .

With probability 1 o(1) we have that D op C p

Proposition 17 For some absolute constant C0, we have for τ C0(β 1) log p that, with probability 1 o(1):

i, j Nij = Cij = 0. (60)

Therefore, N op = 0 and C op = 0.

Remark 18 At this point we remark that the probability 1 o(1) can be made quantitative, for e.g. of the form 1 exp( min( p, n)/C1), for every n large enough. For simplicity of exposition we do not pursue this in the paper.

We defer the proofs of Propositions 13, 14, 15, 16 and 17 to Sections 6.1, 6.2, 6.3, 6.4 and 6.5 respectively. By combining them for β = O(1), we immediately obtain the following bound.

Theorem 19 There exist numerical constants C0, C1 such that the following happens. Assume β C0, n > C1 log p and τ log p/2. Then with probability 1 o(1):

η(bΣ) Σ op 2τs0 n + C r p

e τ 2/C + C

Sparse PCA via Covariance Thresholding

Proof The proof is obtained by adding the error terms from Propositions 13, 14, 15 and 16, and noting that β is bounded.

Using Propositions 13, 14, 15 and 16, together with a suitable choice of τ, we obtain the proof of Theorem 1. Proof [Proof of Theorem 1] Note that in the case s2 0 > p/e there is no thresholding and hence the result follows from the fact that bΣ Σ op C p

p/n (Vershynin, 2012, Remark 5.40). We assume now that s2 0 p/e and the case that τ = C1(β 1) p

log(p/s2 0) log p/2. In that case we set τ = τ log p/2. Below we will keep C1 a large enough constant, and check that each of the error terms in Propositions 13, 14, 15 and 16 is upper bounded by (a constant times) the right-hand side of Eq. (7). Throughout C will denote a generic constant that can be made as large as we want, and can change from line to line. We start from Proposition 13:

s2 0 n log p

s2 0 , (65)

where in the last step we used (e s2 0/p), (s2 0/n) 1. Next consider Proposition 14:

s2 0 . (67)

From Proposition 15, we get, using the same argument as in Eq. (65)

s2 0 n log p

s2 0 . (69)

Finally, the term of Proposition 16 is also bounded as desired using log p s2 0 log(p/s2 0) (dividing both sides by p and using the fact that x 7 x log(1/x) is increasing). The case of τ log p/2 is easier. In that case, we can keep τ = C2τ with C2 large enough so that τ C0(β 1) log p for C0 of Proposition 17. Then, by Proposition 17, we

Deshpande and Montanari

know that N = 0 and C = 0. Therefore we only need consider the terms S Σ and D. For these terms we can use Propositions 14 and 16 respectively and, arguing as in the earlier case τ log p, we obtain the desired result.

6.1 Proof of Proposition 13

Deﬁne e N as

Since N is a principal submatrix of e N, it suﬃces to prove the same bound for e N. Our main tool in the proof will be the concentration lemma 9 which we use on multiple occasions. With a view to using the lemma, we let let Z Rn p denote an independent copy of Z, and z i it s ith column. The proof relies on two preliminary lemmas. For some A 1 (to be chosen later), we ﬁrst state and prove the following lemma that controls the norm of any principal submatrix of N of size at most p/A.

Lemma 20 Fix any A 1. There exists an absolute constants C, c such that:

P n max S [p],|S| p/A PS S( N) op o C exp plog CA

2 exp( cn).

Proof For any subset S [p] recall that T ε p (S) denotes an ε-net of unit vectors in Sp 1

supported on the subset S. For simplicity let T(A) = S:|S| p/AT ε p (S). It suﬃces, by Lemma 8, to control y, Ny on the set T(A). In particular:

P n max S [p],|S| p/A PS S( N) op o P n max y T(A) | y, Ny | (1 2ε) o . (71)

Consider the good set G1 given by:

G1 = {(Z, Z ) : max( Z , Z )

2( n + p))}. (72)

To use Lemma 9, we need to compute E y, Ny and the gradient of y, Ny with respect to the underlying random variables Z. Since η( ) is an odd function the expectation vanishes. To compute the gradient, we let t [0, 1] and W =

t Z+ 1 t Z , and consider y, Ny = y, η(WTW/n)y as a function of the W. Taking the gradient with respect to a column wℓfor ℓ S:

wℓ y, Ny = yℓ

i =ℓ,i S wiyi η( wi, wℓ /n) (73)

Sparse PCA via Covariance Thresholding

( yi η( wi, wℓ /n) if i = ℓ, i S 0 otherwise. (75)

Since σ y = 1, we have that wℓ y, Ny 2 y2 ℓ W 2/n2. Summing over ℓ S we obtain the gradient bound, holding on the good set G1:

W y, Ny 2 P ℓy2 ℓ n2 W 2 (76)

which holds because of triangle inequality and the fact that

2. We can now apply Lemma 9 to bound the RHS of Eq. (71) and get:

P n max S [p],|S| p/A PS S( N) o C|T(A)| exp n2 2

2 E max y T y, Ny 2; Gc 1 . (78)

We can simplify the terms on the right-hand side to obtain the result of the lemma. With ε = 1/4, Stirling s approximation and Lemma 7 we have:

|T(A)| exp plog CA

We use a crude bound on the complement of the good set G1. It is easy to see that, for any unit vector y, y, Ny 2 N 2 F ZTZ 2 F /n2. Cauchy-Schwarz then implies that

E{max y, Ny 2; Gc 1} n 2 E{ ZTZ 4 F } 1/2P{Gc 1}1/2 (80)

(np)C exp( c(n + p)), (81)

where the bound on P{Gc 1} follows from Lemma 11. This concludes the lemma.

Note that Lemma 20, with A = 1, tells us that N op is of order p

p/n + (p/n)2

(uniformly in τ) with high probability. Already this non-asymptotic bound is non-trivial, since the previous results of Cheng and Singer (2013) and Fan and Montanari (2015) do not extend to this case. However, Proposition 13 is stronger, and establishes a rate of decay with the thresholding level τ. The second lemma we require controls the Rayleigh quotient y, Ny when the entries of y are spread out .

Lemma 21 Assume that τ log p/2. Given A 1 and a unit vector y, let S = {i : |yi| p

A/p} and y S, y Sc denote the projections of y onto supports S, Sc respectively. We have:

P n max y T 1/4 p | y S, Ny S | o C exp n2 2

L2 1 + Cp + (np)C exp c min( p, n) , (82)

Deshpande and Montanari

for any L1 where L1 = C1 p

A exp( τ 2/16)(n + p)/n2. The same bound holds for P maxy T 1/4 p | y Sc, Ny S | .

Proof We ﬁrst prove the claim for y S, e Ny S . Firstly, we have E y S, Ny S = 0. Consider the good set G2 of pairs (W, W ) Rn p Rn p satisfying the conditions:

2( n + p) , (83)

j [p]\i |I( wi, wj | τ n/2) 2 exp( τ 2/16) , (84)

j [p]\i |I( w i, w j | τ n/2) 2 exp( τ 2/16) , (85)

j [p] I(| wi, w j | τ n/2) 2 exp( τ 2/16). . (86)

Also, for any pair W, W G2, for W(t) =

t W + 1 t W (and its columns w(t)i deﬁned appropriately) we have:

W(t) max t (

2p) = 2( n + p), (87)

j [p]\i I( w(t)i, w(t)j τ n) 6 exp( τ 2/16). (88)

Equation (87) follows by a simple application of triangle inequality and condition (83) deﬁning G2. For inequality (88), expanding the product w(t)i, w(t)j :

w(t)i, w(t)j = t wi, wj + (1 t) w i, w j + p

t(1 t) wi, w j , (89)

whence, by triangle inequality and p

I(| w(t)i, w(t)j | τ n) I(| wi, wj | τ n/2) + I(| w i, w j | τ n/2)

+ I(| wi, w j | τ n/2). (90)

The gradient of y S, η(WTW/n)y S with respect to a column wℓof W is given by:

wℓ y S, η(WTW/n)y S = yℓ

j S\ℓ yj η wj, wℓ

( η( wi, wℓ /n; τ/ n)yi when i S\ℓ 0 otherwise. (93)

Sparse PCA via Covariance Thresholding

wℓ y S, e Ny S 2 y2 ℓ n2 W 2 σ 2 (94)

i =ℓ (yi η( wi, wℓ /n))2 (95)

(a) y2 ℓ W 2

p I(| wi, wℓ | τ n) (96)

(b) y2 ℓ n2 C(n + p)A exp( τ 2/16) (97)

Here (a) follows from fact that the entries of y S are bounded by p

A/p and the deﬁnition of the soft thresholding function. Inequality (b) follows follows when we set W = Z(t) =

t Z+ 1 t Z and (Z, Z ) G2. Therefore, summing over ℓwe obtain the following bound for the gradient of y S, e Ny S

Z(t) y S, e Ny S 2 C1 A exp( τ 2/16)(n + p)

n2 L2 1. (98)

We can use now Lemma 9, to get, for L1 > 0 as deﬁned above and any L1:

P n max y T 1/4 p y S, e Ny S o C exp 2

+ CL 2 1 E{max y T εp y S, e Ny S 2; G2} (99)

CL2 1 + Cp + C(np)CP{Gc 2}1/2, (100)

where the last line follows by Cauchy-Schwarz, as in the proof of Lemma 20, and the fact that L1 (np) C2 using the upper bound τ log p/2. To obtain the thesis, we need to now bound P{Gc 2}. It suﬃces to control the failure probability of conditions (83), (84), (85), (86) of the good set G2 individually, and apply the union bound. For Z, Z independent, max( Z , Z )

2( n + p) with probability at most 2 exp( c(n + p)) by Lemma 11. Now consider condition (84) with i = 1, without loss of generality. First, for any h > 0 we have:

j =1 I(| z1, zj | τ n/2) h o P n1

j =1 I(| z1, zj | τ n/2) 2h; z1 2 n o

Lemma 12 guarantees that the second term is at most exp( cn). To control the ﬁrst term, we note that, conditional on z1, zj, z1 , j = 1 are independent Gaussian random variables with variance z1 2. Therefore, conditional on z1, I(| z1, zj | τ n/2) are independent Bernoulli random variables with success probability h0 = 2Φ τ n/(2 z1 ) , where Φ( ) is

Deshpande and Montanari

the Gaussian cumulative distribution function. It follows, by the Chernoﬀ-Hoeﬀding bound for Bernoulli random variables that

j =1 I(| z1, zj | τ n/2) h z1 o exp p D(h h0) , (102)

where D(a b) = a log(a/b) + (1 a) log[(1 a)/(1 b)]. Choosing h = 4Φ( τ/(2

2)), and conditional on z1

2n, D(h h0) ch for a constant c, implying that

j =1 I(| z1, zj | τ n/2) h; z1

2n o exp( cph). (103)

By standard bounds h = 4Φ( τ/2

2) 2 exp( τ 2/16) and, as τ log p/2, h 1/ p, we have

j =1 I(| z1, zj | τ n/2) h; z1

2n o exp( c p). (104)

Combining this with Eq. (101) we now get:

j =1 I(| z1, zj | τ n/2) h o 2 exp( c min(n, p)). (105)

A similar bound holds for i = 1 and the other conditions (85) and (86), whence we have by the union bound that P{Gc 2} p2 exp( c min( p, n)). This completes the proof of the claim (82). The proof of the claim for y S, Ny Sc is analogous, so we only sketch the points at which it diﬀers from that of Eq. (82). We use the same good set G2, as deﬁned earlier. Computing the gradient as for y S, e Ny S we obtain:

wℓ y S, e Ny Sc = yℓ

j S(ℓ) yj wj η wj, wℓ

Here S(ℓ) = Sc if ℓ S and S otherwise. Deﬁne the vector σ(ℓ) Rp as

( yℓyj η wj, wℓ

0 otherwise. (107)

As before, we have that wℓ y S, e Ny Sc = n 1 Wσ(ℓ) n 1 W σ(ℓ) . Therefore, summing over ℓ [p]:

W y S, e Ny Sc 2 W

ℓ [p] σ(ℓ) 2 (108)

j S(ℓ) y2 ℓy2 j η wj, wℓ

j Sc y2 j y2 ℓ η wj, wℓ

p max ℓ [p]

j =p η wj, wℓ

Sparse PCA via Covariance Thresholding

Under the condition of G2, the gradient also satisﬁes, when evaluated at W = Z(t) =

t Z + 1 t Z :

Z(t) y S, e Ny Sc 2 CA exp( τ 2/16)(n + p)

The rest of the proof is then the same as before.

Given these lemmas, we can now establish Proposition 13. Proof [Proof of Proposition 13] We use a variant of the ε-net argument of Lemma 20. To bound the probability that N op is large, with Lemma 8, we obtain:

P N op P n max y T εp | y, Ny | (1 2ε) o . (113)

Let S = {i : |yi| p

A/p} for some A 1 to be chosen later. Then let y = y S + y Sc denote the projections of y onto supports S, Sc respectively. Since y, Ny = y Sc, Ny Sc + y S, Ny S + 2 y S, Ny Sc by triangle inequality and union bound:

P N op P n max y T εp | y Sc, Ny Sc | + | y S, Ny S | + 2| y S, Nys Sc | (1 2ε) o

P n max y T εp | y Sc, Ny Sc | (1 2ε)/4 o + P n max y T εp | y S, Ny S | (1 2ε)/4 o

+ P n max y T εp | y S, Ny Sc | (1 2ε)/4 o (115)

P n max S :|S | p/A PS S ( N) (1 2ε)/4 o + P n max y T εp | y S, Ny S | (1 2ε)/4 o

+ P n max y T εp | y S, Ny Sc | (1 2ε)/4 o . (116)

With ε = 1/4, the ﬁrst term is controlled by Lemma 20 while the ﬁnal two are controlled by Lemma 21. We choose ε = 1/4 in Eq. (116), and

A + A exp τ 2

for large enough C so that, using the bounds of Lemmas 20 and 21, we have:

P N C(np)C exp h c min p, n, plog A

This probability bound is o(1) provided A is not too large: we choose A = 0.25 p

τ exp(τ 2/16) p which guarantees that the bound above is o(1) when n > C log p for some C large enough. This concludes the proposition.

Deshpande and Montanari

6.2 Proof of Proposition 14

We decompose the empirical covariance matrix (53) as

PE(bΣ) = Σ + 1 + 2 + T 2 + PE 1

n ZTZ Ip , (119)

n uq, u q 1q=q vqv T q , (120)

βq n vq(ZTuq)T Q . (121)

Next notice that, for any x R,

η(x) x τ n . (122)

With a view to employing this inequality, we use Eq. (119) and the triangle inequality:

PE(η(bΣ)) Σ op = PE η(bΣ) PE bΣ 1 2 T 2 PE 1

n ZTZ Ip op (123)

PE η(bΣ) bΣ op + 1 op + 2 2 op + PE 1

n ZTZ Ip op (124)

s0τ n + 1 op + 2 2 op + PE 1

n ZTZ Ip op, (125)

where the last line follows by noticing that the ﬁrst term is supported on E of size s0 s0 and then using bias bound Eq. (122) entry-wise. We next bound each of the three terns on the right hand side. For the ﬁrst term in Eq (125), note that with a change of basis to the orthonormal set v1, . . . vr 1 is equivalent to an r r matrix with entries Mqq pβqβq , where Mqq = uq, u q /n 1q=q . Denote by B Rr r the diagonal matrix with Bqq = p

βq and by U Rr n, the matrix with columns u1,. . . ur. Then, we have, with high probability

1 op = BMB op (126)

B 2 op M op = β 1

n UTU Ir r op (127)

The last inequality follows from the Bai-Yin law on eigenvalues of Wishart matrices (see Vershynin, 2012, Corollary 5.35). Consider the second term in Eq (125). By orthonormality of v1, . . . , vr, the matrix 2 is orthogonally equivalent to BZT QU/n, where we recall that ZQ denotes the submatrix of Z formed by the columns in Q. Denoting by PU the orthogonal projector onto the column

Sparse PCA via Covariance Thresholding

space of U, we then have, with high probability,

n B op ZT QPUU op (129)

n PUZQ op U op (130)

n s0 + r n + r Cβ rs0

Here the penultimate inequality follows by Lemma 11 noting that, by invariance under rotations (and since PU project onto a random subspace of r dimensions independent of Z), PUZQ op is distributed as the norm of a matrix with i.i.d. standard normal entries, with dimensions |Q| r, |Q| s0. Finally, for the third term of Eq. (125) we use the Bai-Yin law of Wishart matrices (see Vershynin, 2012, Corollary 5.35) to obtain, with high probability: PE 1

n ZTZ Ip op = 1

n ZT QZQ Is0 op (132)

Finally, substituting the above bounds in Eq. (125), we get

PE(η(bΣ)) Σ op = τs0 n + C(1 + β) rs0

which implies the proposition.

6.3 Proof of Proposition 15

Note that C = C + CT where C = PQ Qc η(bΣ) . It is therefore suﬃcient to control C, and then use triangle inequality. The proof is similar to that of Proposition 13. We let U Rn r denote the matrix with columns u1, u2,. . . ur, and introduce the set

U n U Rn r : 1

n UTU Ir r op 5 r r

We then have

P C op sup U U P C op U + P U U . (136)

Notice that, by the Bai-Yin law on eigenvalues of Wishart matrices (see Vershynin, 2012, Corollary 5.35), limn P(U U) = 1 (throughout r < c n for c a small constant). It is therefore suﬃcient to show sup U U P C op U 0 for as in the statement of the theorem. In order to lighten the notation, we will write e P( ) P( |U) and bound the above probability uniformly over U U. (In other words e P denotes expectation over Z with U ﬁxed). We ﬁrst control the norms of small submatrices of C, following which we control the full matrix.

Deshpande and Montanari

Lemma 22 Fix an A [1, p1/3], and let L = p

((β 1)n + p)/n2. Then, there exists an absolute constant C > 0 such that, for any > 0:

e P n max Qc S:|S| p/A PQ S η(bΣ) op o C exp Cs0 + p log(CA)

+ L 2(np)C exp( n/C). (137)

Proof Let, as before, T ε p (S) denote the ε-net of unit vectors supported on S Qc of size at most p/A and let T = ST ε p (S). Then, by Lemma 8, with ε = 1/4:

e P n max S Qc|S| p/A

PQ S η(bΣ) op o e P n max y T,w T εs0 w, Cy (1 2ε)/2 o . (138)

It now suﬃces to control the right hand side via Lemma 9. We ﬁrst compute the gradients with respect to zℓas before:

n P i Qc yi η( xℓ, zi /n) zj when ℓ Q, . yℓ

n P i Q wi η( zℓ, xi /n) xi when ℓ Qc, (139)

Therefore, arguing as in proof of Proposition 13 (see Lemma 20):

Z w, Cy 2 F = X

ℓ zℓ w, Cy 2 Z 2 + XQ 2

Let B Rr r be the diagonal matrix with entries Bq,q = p

βq, and V Rp r be the matrix with columns v1, . . . , vr. We then have X = UBVT + Z, whence, recalling U U, and r c n with c small enough

XQ X UBVT + Z (141)

β U + Z 5 p

βn + Z . (142)

Consider the good set G4 of pairs Z, Z satisfying:

max( Z , Z )

max( ZQ , Z Q )

For (Z, Z G4, and t [0, 1], deﬁne Z(t) =

t Z + 1 t Z . Now Using Eqs. (140) and (142, the gradient w, Cy evaluated at Z(t) satisﬁes:

w, Cy 2 3 Z(t) 2 + 10βn

C (n + p) + βn

C (β 1)n + p

Sparse PCA via Covariance Thresholding

Now applying Corollary 10, for L = C p

((β 1)n + p)/n2:

e P n max S Qc|S| p/A

PQ S η(bΣ) op o C|T| exp 2

+ CL 2e E{max w,y w, Cy 4}1/4P{G4}1/2. (148)

Let ε = 1/4, observing that T S:|S| p/AT ε p (S), we have the bound (using Lemma 7 and Stirling s approximation):

|T| exp(Cs0 + A 1p log CA), (149)

for some absolute C. Now, as in the proof of Proposition 13, | w, Cy | C C F bΣ F . From this it follows that e E maxw,y w, Cy 4 (np)C for some C. Finally P{Gc 4} exp( cn) using Lemmas 11, 12 and the union bound. Combining these bounds in Eq. (148) yields the lemma.

Now we prove a similar lemma when y has entries that are spread out .

Lemma 23 Fix an A [1, p1/3], and a unit vector y RQc let S = {i : |yi| p

A/p, and y S denote the projection of y on the set of indices S. Then there exists a numerical constant C such that, assuming τ log p/2, we have

e P n max w T ε Q,y T ε Qc w, Cy S o C exp 2

CL2 + Cp + (np)C exp c min( p, n) ,

where L = p

A exp( τ 2/C(β 1))(n(β 1) + p)/n2.

Proof For simplicity of notation, it is convenient to introduce the vector y = y S. Throughout the proof, we will use that y 1 and y p

A/p. We compute the gradients as follows:

n P i Qc y i η( xℓ, zi /n) zj when ℓ Q y ℓ n P i Q wi η( zℓ, xi /n) xi when ℓ Qc . (151)

Therefore we have

ℓ Q zℓ w, Cy 2 X

w2 ℓ n2 Z 2 X

y i η( xℓ, zℓ) 2 (152)

pn2 max ℓ Q

i Qc η( xℓ, zi /n), (153)

Deshpande and Montanari

where we used the fact that |y i| p

A/p and that η( ) {0, 1}. Similarly, for ℓ Qc:

ℓ Qc zℓ w, Cy 2 X

(y i)2 XQ 2

wi η( zℓ, xi /n) 2 (154)

ℓ Qc (y ℓ)2 η( zℓ, xi /n)2 (155)

pn2 max ℓ Q

i Qc η( zi, xℓ /n). (156)

Combining the bounds in Eqs.(153), (156), we obtain

Z w, Cy 2 F = X

ℓ [p] zℓ w, Cy 2 (157)

pn2 ( XQ 2 + Z 2) max i Q

j Qc η( xi, zj /n). (158)

With K = Cβ 1, we deﬁne the good set G5 of pairs (Z, Z ) satisfying

j Qc I( xi, zj τ n/2) 2 exp( τ 2/K) (160)

j Qc I( x i, z j τ n/2) 2 exp( τ 2/K) (161)

j Qc I( x i, zj τ n/4) 2 exp( τ 2/K) (162)

j Qc I( xi, z j τ n/4) 2 exp( τ 2/K). (163)

Deﬁne Z(t) =

t Z + 1 t Z with (Z, Z ) G5. By Eq. (158) the gradient evaluated at Z(t) is bounded by

pn2 ( XQ(t) 2 + Z(t) 2) max i Q

j Qc η( x(t)i), z(t)j /n) (164)

pn2 ((β 1)n + p) max i Q

j Qc η( x(t)i), z(t)j /n) , (165)

where we bounded XQ(t) as in Eq. (142), and used Z(t) op 2( n+ p), which follows from Eq. (159) and triangle inequality. Furthermore, as x(t)i, z(t)j = t xi, zj + (1 t) x i, z j + p

t(1 t)( xi, z j + x i, zj ), we have that:

η( x(t)i), z(t)j /n) = I(| x(t)i, z(t)j | τ n) (166)

I(| xi, zj | τ n/2) + I(| x i, z j | τ n/2)

+ I(| x i, zj | τ n/4) + I(| x i, zj | τ n/4). (167)

Sparse PCA via Covariance Thresholding

Hence on the good set G5, we have:

j Qc η( x(t)i), z(t)j /n) 4p e τ 2/K . (168)

Therefore the gradient satisﬁes, on the good set:

Z w, Cy 2 C A

n2 ((β 1)n + p) e τ 2/K = CL2 . (169)

Hence, by Lemma 9, we obtain:

e P n max w T ε Q,y T εp w, Cy o C|T ε Q||T ε p | exp 2

+ CL 2 e E{max w, Cy 4}1/4P{Gc 5}1/2 .

By Lemma 7, keeping ε = 1/4 we have that the ﬁrst term is at most C exp(Cp+exp( 2/CL2 )). For the second term, we have | w, Cy | C C F bΣ F . Since E{ bΣ 4 F } (np)C, we have that E{maxw,y w, Cy 4}1/4 (np)C. Also as τ < log p, L (np) C, implying that the second term is bounded above by (np)CP{Gc 5}1/2. Therefore:

e P n max w T ε Q,y T εp w, Cy o C exp Cp 2

+ (np)CP{Gc 5}1/2 . (171)

It remains to control the probability of the bad set Gc 5. For this, we control the probability of violating any one condition among (159), (160), (161), (162) and (163) deﬁning G5 and then use the union bound. By Lemmas 11, condition (159) hold with probability 1 C exp( cn). The argument controlling the probability for conditions (160), (161), (162) and (163) to hold are essentially the same, so we restrict ourselves to condition (160) keeping i = 1 Q, without loss of generality. Conditional on x1, x1, zj for j Qc are independent N(0, x1 2) variables. Therefore, conditional on x1, I(| x1, zj | τ n/2) are independent Bernoulli random variables with success probability Φ{ τ n/2 x1 }. Deﬁne h1 to be the success probability, i.e. h1 = Φ( τ n/(2 x1 )). Since K = C(β 1) we can enlarge C to a large absolute constant. Letting V Rn r

be the matrix with columns v1, . . . , vr, and B the diagonal matrix with Bq,q = p

βq, we have, with probability at least 1 exp( n/C),

x1 UBVTe1 + z1 B U + z1

where the last equality holds since U U and by tail bounds on chi-squared random variables. Further

j Qc I(| x1, zj τ n/2) |Qc|h o e P{ x1 2 Kn}

+ sup x1 2 Kn e P n X

j Qc I(| x1, zj τ n/2) |Qc|h x1 o .

Deshpande and Montanari

By the above argument, the ﬁrst term is at most exp( n/C) and we turn to the second term. By the Chernoﬀbound

j Qc I(| x1, zj τ n/2) |Qc|h x1 o exp |Qc|D(h||h1) , , (174)

with h1 < exp( τ 2/K) when x1 2 Kn/4. Choosing h = 2 exp( τ 2/K) implies that h1 h/2 when and, thereby, that D(h h1) h/C. Further since τ < log p/2, h 1/ p. This implies that

exp( |Qc|D(h h1 h1)) = exp( (p s0)h/C) exp( p/C). (175)

Combining this with Eq. (173) we have that P{Gc} Cp2 exp( min(n, p)/C) for some absolute C. Plugging this in Eq. (171) yields the lemma.

We are now ready to prove Proposition 15. Indeed, as in Proposition 13, for any unit vector y RQc, let S = {i : |yi| p

A/p} and y S, y Sc denote the projections on the indices in S, Sc respectively.

e P n C1 } e P n max w T ε Q,y T ε Qc | w, Cy | (1 2ε) o (176)

e P n max w T ε Q,y T ε Qc | w, Cy S | (1 2ε)/2 o

+ P n max w T ε Q,y T ε Qc | w, Cy Sc | (1 2ε)/2 o . (177)

As before, we will let ε = 1/4. The ﬁrst term is controlled via Lemma 22, while the second is controlled by Lemma 23. We keep = where

= C L p + L

so that, via the bounds of Lemmas 22, 23 and that s2 0 p:

P{ C1 } C exp cp log A

+ L 2 (np)C exp c min( p, n) . (179)

We now set A = (τ 2/K) exp(τ 2/K) 1/2 with K = C(β 1) for a suitable constant C and, since τ log p/2, we get that A p1/3. Furthermore, it is straightforward to see that L (np) C, and this implies that

P{ C1 } (np)C exp( c min( p, n)) = o(1). (180)

With this setting of A, we get the form of below, as required for the proposition.

C e cτ 2/K r

K pn(β 1) + p2

C (τ 1)e cτ 2/K r p

Sparse PCA via Covariance Thresholding

6.4 Proof of Proposition 16

Since D is a diagonal matrix, its spectral norm is bounded by the maximum of its entries. This is easily done as, for every i Qc:

|(D)ii| = η zi 2

By the Chernoﬀbound for χ2-squared random variables as in Lemma 12 followed by the union bound, with probability 1 o(1):

for some absolute C. Here we used the fact that (log p)/n < 1.

6.5 Proof of Proposition 17

It suﬃces to show that with probability 1 o(1)

max i,j F G |bΣij| τ n = C0(β 1)

This is a standard argument (see Bickel and Levina, 2008b, Lemma A.3) where (following the dependence on β) it suﬃces to take τ C0(β 1) log p for C0 a suﬃciently large absolute constant. We note here that the same can also be proved via the conditioning technique applied in the proofs of Propositions 13 and 15.

7. Proof of Theorems 3

Throughout this section, to lighten notation, we drop the prime from bΣ and X while keeping in mind that these are independent from bv1, . . . , bvr. We further write X = UBVT+ Z, where U Rn r is the matrix with columns u1, . . . , ur, B is diagonal with Bii = βi and V Rp r has columns v1, . . . , vr. Deﬁne the event

U n U Rn r : 1

n UTU Ir r op 3 r r

By the Bai-Yin law on eigenvalues of Wishart matrices (see Vershynin, 2012), limn P(U U) = 1. In the rest of the proof, we will therefore assume U U ﬁxed, and denote by e P( ) = P( |U) the expectation conditional on U. In other words, e P( ) denotes expectation with respect to Z. Note that

n VBUTUBVT + 1

n ZTUBVT + 1

n VBUTZ + 1

n ZTZ I . (188)

Deshpande and Montanari

We then have, for q {1, . . . , r} and i {1, . . . , p}, (bΣbvq)i βq vq, bvq vq,i T (1) i,q + T (2) i,q + T (3) i,q , (189)

T (1) i,q 1

n ei, VBUTUBVTbvq βq vq, bvq vq,i , (190)

T (2) i,q 1

Z, (UBVTei)bv T q + (UBVTbvq)e T i , (191)

T (3) i,q ei, 1

n ZTZ I bvq . (192)

We next bound, with high probability, maxi,q T (a) i,q for a {1, 2, 3}. Throughout we let ε maxq [r] bvq vq . Considering the ﬁrst term, we have

T (1) i,q ei, VB 1

n UTU I BVTbvq + ei, VB2VTbvq βq vq, bvq vq,i (193)

n + βε r max q [r]\q |vq ,i| , (194)

where in the last inequality we used P q [r]\q vq , bvq 2 1 vq, bvq 2 ε2/2.

Consider next the second term. Since Zij iid N(0, 1), it follows that T (2) i,q = |Wi,q|, for Wi,q N(0, σ2 i,q) a Gaussian random variable with variance

n2 (UBVTei)bv T q + (UBVTbvq)e T i 2 F (195)

n UBVTei 2 + UBVTbvq 2o (196)

n2 UBVT 2 op (197)

n2 U 2 op B 2 op 8β2

By union bound over i [p], q [r] we obtain

max i [p],q [r] T (2) i,q 8β

Finally, consider the last term. By rotational invariance of Z, the distribution of T (3) i,q only depends on the angle between ei and bvq. Calling this angle ϑ, we have

T (3) i,q d= e1, 1

n ZTZ I e1 cos ϑ + e1, 1

n ZTZ I e2 sin ϑ (200)

n z1 2 1 + 1

n z1, z2 . (201)

Both of these terms have Bernstein-type tail bonds, whence

e P T (3) i,q t n

2 exp c min(t n, t2) . (202)

Sparse PCA via Covariance Thresholding

Using t = C0 log p, and recalling that n C log p for C a large constant, we obtain e P T (3) i,q C0 p

(log p)/n 2 p 10. Hence by union bound

max i [p],q [r] T (3) i,q C0

By putting together Eqs. (194), (199), (203), and using assumption A2, we get

(bΣbvq)i βq vq, bvq vq,i Cβ r r

n + βεγ r |vq,i| I(i Q) . (204)

Let b Qq = {i [p] : |(bΣ bvq)i| ρ}. We claim that the above implies that, with high probability, Qq b Qq Q for all q. For i Q, we have

(bΣbvq)i Cβ r r

2 s0 , (206)

where the last inequality follows from Eq. (14). On the other hand, By Theorem 2 and using the assumption (14), we can guarantee

βγ r 1 . (207)

Hence for i Qq, and considering to be deﬁnite vq,i > 0, we get

(bΣbvq)i βq vq, bvq vq,i Cβ r r

n βεγ r |vq,i| (208)

βmin 1 ε β βmin εγ r vq,i Cβ r r

4 s0 Cβ r r

2 s0 . (211)

where, in the ﬁrst inequality, we used vq, bvq 1 ε. This concludes the proof. Keeping track of the dependence on θ, γ, β, βmin, we get that the following conditions are suﬃcient for the theorem s conclusion to hold (with C a

Deshpande and Montanari

suitable numerical constant):

β2 minθ2 s0 log p , (212)

β2 minθ2 rs0 , (213)

β2 min γ2 r s2 0 log p

s2 0 , (214)

β2 min s2 0 log p

s2 0 . (215)

All of these conditions are implied by the assumptions of Theorem 3, namely Eq. (14). In particular, this is shown by using the fact that s0 log p s2 0 log(p/s2 0) for s0 p.

Acknowledgments

We are grateful to David Donoho for his feedback on an early draft of this manuscript. This work was partially supported by the NSF CAREER award CCF-0743978, the NSF grant CCF-1319979, and the grants AFOSR/DARPA FA9550-12-1-0411 and FA9550-13-1-0036.

Arash A Amini and Martin J Wainwright. High-dimensional analysis of semideﬁnite relaxations for sparse principal components. The Annals of Statistics, 37(5B):2877 2921, 2009.

Jinho Baik, G erard Ben Arous, and Sandrine P ech e. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, pages 1643 1697, 2005.

Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of ﬁnite, low rank perturbations of large random matrices. Advances in Mathematics, 227 (1):494 521, 2011.

Quentin Berthet and Philippe Rigollet. Computational lower bounds for sparse pca. ar Xiv preprint ar Xiv:1304.0828, 2013.

Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, pages 2577 2604, 2008a.

Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, pages 199 227, 2008b.

T Tony Cai, Cun-Hui Zhang, Harrison H Zhou, et al. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38(4):2118 2144, 2010.

Sparse PCA via Covariance Thresholding

T Tony Cai, Harrison H Zhou, et al. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics, 40(5):2389 2420, 2012.

T Tony Cai, Zongming Ma, Yihong Wu, et al. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074 3110, 2013.

Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association, 106(494):672 684, 2011.

Mireille Capitaine, Catherine Donati-Martin, and Delphine F eral. The largest eigenvalues of ﬁnite rank deformation of large wigner matrices: convergence and nonuniversality of the ﬂuctuations. The Annals of Probability, 37(1):1 47, 2009.

Xiuyuan Cheng and Amit Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 2(04):1350010, 2013.

Alexandre d Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse pca using semideﬁnite programming. SIAM review, 49(3): 434 448, 2007.

Alexandre d Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. The Journal of Machine Learning Research, 9:1269 1294, 2008.

Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970.

Yash Deshpande and Andrea Montanari. Sparse pca via covariance thresholding. In Advances in Neural Information Processing Systems, pages 334 342, 2014.

David L. Donoho and Iain M. Johnstone. Minimax risk over lp balls. Prob. Th. and Rel. Fields, 99:277 303, 1994.

Noureddine El Karoui. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191 3216, 2010a.

Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1 50, 2010b.

Zhou Fan and Andrea Montanari. The spectral norm of random inner-product kernel matrices. ar Xiv:1507.05343, 2015.

Delphine F eral and Sandrine P ech e. The largest eigenvalue of rank one deformation of large wigner matrices. Communications in mathematical physics, 272(1):185 228, 2007.

Iain M. Johnstone. Function estimation and gaussian sequence models. Unpublished manuscript, 2015.

Iain M Johnstone and Arthur Yu Lu. Sparse principal components analysis. Unpublished manuscript, 2004.

Deshpande and Montanari

Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486), 2009.

Noureddine El Karoui. Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics, pages 2717 2756, 2008.

Antti Knowles and Jun Yin. The isotropic semicircle law and deformation of wigner matrices. Communications on Pure and Applied Mathematics, 2013.

Robert Krauthgamer, Boaz Nadler, Dan Vilenchik, et al. Do semideﬁnite relaxations solve sparse pca up to the information limit? The Annals of Statistics, 43(3):1300 1322, 2015.

Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000.

Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2005.

Tengyu Ma and Avi Wigderson. Sum-of-squares lower bounds for sparse pca. In Advances in Neural Information Processing Systems, pages 1603 1611, 2015.

Nicolai Meinshausen and Peter B uhlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, pages 1436 1462, 2006.

Baback Moghaddam, Yair Weiss, and Shai Avidan. Spectral bounds for sparse pca: Exact and greedy algorithms. In Advances in neural information processing systems, pages 915 922, 2005.

Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17(4):1617, 2007.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y.C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 210 268. Cambridge University Press, 2012.

Vincent Q Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions. In Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) 2012, 2012.

Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183 2202, 2009.

Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and computational trade-oﬀs in estimation of sparse principal components. ar Xiv preprint ar Xiv:1408.5369, 2014.

Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515 534, 2009.

Sparse PCA via Covariance Thresholding

Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265 286, 2006.