# sparse_pca_from_sparse_linear_regression__3c3b3a35.pdf Sparse PCA from Sparse Linear Regression Guy Bresler MIT guy@mit.edu Sung Min Park MIT sp765@mit.edu M ad alina Persu Two Sigma , MIT mpersu@mit.edu Sparse Principal Component Analysis (SPCA) and Sparse Linear Regression (SLR) have a wide range of applications and have attracted a tremendous amount of attention in the last two decades as canonical examples of statistical problems in high dimension. A variety of algorithms have been proposed for both SPCA and SLR, but an explicit connection between the two had not been made. We show how to efficiently transform a black-box solver for SLR into an algorithm for SPCA: assuming the SLR solver satisfies prediction error guarantees achieved by existing efficient algorithms such as those based on the Lasso, the SPCA algorithm derived from it achieves near state of the art guarantees for testing and for support recovery for the single spiked covariance model as obtained by the current best polynomialtime algorithms. Our reduction not only highlights the inherent similarity between the two problems, but also, from a practical standpoint, allows one to obtain a collection of algorithms for SPCA directly from known algorithms for SLR. We provide experimental results on simulated data comparing our proposed framework to other algorithms for SPCA. 1 Introduction Principal component analysis (PCA) is a fundamental technique for dimension reduction used widely in data analysis. PCA projects data along a few directions that explain most of the variance of observed data. One can also view this as linearly transforming the original set of variables into a (smaller) set of uncorrelated variables called principal components. Recent work in high-dimensional statistics has focused on sparse principal component analysis (SPCA), as ordinary PCA estimates become inconsistent in this regime [22]. In SPCA, we restrict the principal components to be sparse, meaning they have only a few nonzero entries in the original basis. This has the advantage, among others, that the components are more interpretable [23, 49], while components may no longer be uncorrelated. We study SPCA under the Gaussian (single) spiked covariance model introduced by [21]: we observe n samples of a random variable X distributed according to a Gaussian distribution N(0, Id + uu>), where ||u||2 = 1 with at most k nonzero entries,2 Id is the d d identity matrix, and is the signal-to-noise parameter. We study two settings of the problem, hypothesis testing and support recovery. Sparsity assumptions have played an important role in a variety of other problems in high-dimensional statistics, in particular linear regression. Linear regression is also ill-posed in high dimensions, so via imposing sparsity on the regression vector we recover tractability. Though the literature on two problems are largely disjoint, there is a striking similarity between the two problems, in particular when we consider statistical and computational trade-offs. The The views expressed herein are solely the views of the author(s) and are not necessarily the views of Two Sigma Investments, LP or any of its affiliates. They are not intended to provide, and should not be relied upon for, investment advice. 2Sometimes we will write this latter condition as u 2 B0(k) where B0(k) is the 0-ball of radius k. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. natural information-theoretically optimal algorithm for SPCA [4] involves searching over all possible supports of the hidden spike. This bears resemblance to the minimax optimal algorithm for SLR [35], which optimizes over all sparse supports of the regression vector. Both problems appear to exhibit gaps between statistically optimal algorithms and known computationally efficient algorithms, and conditioned on relatively standard complexity assumptions, these gaps seem irremovable [3, 44, 47]. 1.1 Our contributions In this paper we give algorithmic evidence that this similarity is likely not a coincidence. Specifically, we give a simple, general, and efficient procedure for transforming a black-box solver for sparse linear regression to an algorithm for SPCA. At a high level, our algorithm tries to predict each coordinate3 linearly from the rest of the coordinates using a black-box algorithm for SLR. The advantages of such a black-box framework are two fold: theoretically, it highlights a structural connection between the two problems; practically, it allows one to simply plug in any of the vast majority of solvers available for SLR and directly get an algorithm for SPCA with provable guarantees. In particular, For hypothesis testing: we match state of the art provable guarantee for computationally efficient algorithms; our algorithm successfully distinguishes between isotropic and spiked Gaussian distributions at signal strength & n . This matches the phase transition of diagonal thresholding [22] and Minimal Dual Perturbation [4] up to constant factors. For support recovery: for general p and n, when each non-zero entry of u is at least (1/ k) (a standard assumption in the literature), our algorithm succeeds with high probability for signal strength & n , which is nearly optimal.4 In experiments, we demonstrate that using popular existing SLR algorithms as our black-box results in reasonable performance. We theoretically and empirically illustrate that our SPCA algorithm is also robust to rescaling of the data, for instance by using a Pearson correlation matrix instead of a covariance matrix.5 Many iterative methods rely on initialization via first running diagonal thresholding, which filters variables with higher variance; rescaling renders diganoal thresholding useless, so in some sense our framework is more robust. 2 Preliminaries 2.1 Problem formulation for SPCA Hypothesis testing Here, we want to distinguish whether X is distributed according to an isotropic Gaussian or a spiked covariance model. That is, our null and alternate hypotheses are: H0 : X N(0, Id) and H1 : X N(0, Id + uu>), Our goal is to design a test : Rn d ! {0, 1} that discriminates H0 and H1. More precisely, we say that discriminates between H0 and H1 with probability 1 δ if both type I and II errors have a probability smaller than δ: PH0( (X) = 1) δ and PH1( (X) = 0) δ. We assume the following additional condition on the spike u: min/k for at least one i 2 [d] where cmin > 0 is some constant. 3From here on, we will use coordinate and variable interchangeably. 4In the scaling limit d/n ! as d, n ! 1, the covariance thresholding algorithm [15] theoretically succeeds at a signal strength that is an order of plog d smaller. However, our experimental results indicate that with an appropriate choice of black-box, our Q algorithm outperforms covariance thresholding 5Solving SPCA based on the correlation matrix was suggested in a few earlier works [49, 41]. The above condition says that at least one coordinate has enough mass, yet the mass is not entirely concentrated on just that singlecoordinate. Trivially, we always have at least one i 2 [d] s.t. u2 i 1/k, but this is not enough for our regression setup, since we want at least one other coordinate j to have sufficient correlation with coordinate i. We remark that the above condition is a very mild technical condition. If it were violated, almost all of the mass of u is on a single coordinate, so a simple procedure for testing the variance (which is akin to diagonal thresholding) would suffice. Support recovery The goal of support recovery is to identify the support of u from our samples X. More precisely, we say that a support recovery algorithm succeeds if the recovered support b S is the same as S, the true support of u. As standard in the literature [1, 31], we need to assume a minimal bound on the size of the entries of u in the support. For our support recovery algorithm, we will assume the following condition (note that it implies Condition (C1) and is much stronger): (C2) |ui| cmin/ k for some constant 0 < cmin < 1 8i 2 [d] Though the settings are a bit different, this minimal bound along with our results are consistent with lower bounds known for sparse recovery. These lower bounds ([18, 42]; bound of [18] is a factor of k weaker) imply that the number of samples must grow roughly as n & (1/u2 min)k log d where umin is the smallest entry of our signal u normalized by 1/ k, which is qualitativley the same threhold required by our theorems. 2.2 Background on SLR In linear regression, we observe a response vector y 2 Rn and a design matrix X 2 Rn d that are linked by the linear model y = Xβ + w, where w 2 Rn is some form of observation noise, typically with i.i.d. N(0, σ2) entries. Our goal is to recover β given noisy observations y. While the matrices X we consider arise from a (random) correlated design (as analyzed in [42], [43]), it will make no difference to assume the matrices are deterministic by conditioning, as long as the distribution of the design matrix and noise are independent, which we will demonstrate in our case. Most of the relevant results on sparse linear regression pertain to deterministic design. In sparse linear regression, we additionally assume that β has only k non-zero entries, where k d. This makes the problem well posed in the high-dimensional setting. Commonly used performance measures for SLR are tailored to prediction error (1/nk Xβ Xbβk2 2 where bβ is our guess), support recovery (recovering support of β ), or parameter estimation (minimizing kβ bβk under some norm). We focus on prediction error, analyzed over random realizations of the noise. There is a large amount of work on SLR and we defer a more in-depth overview to Appendix A. Most efficient methods for SLR impose certain conditions on X. We focus on the restricted eigenvalue condition, which roughly stated makes the prediction loss strongly convex near the optimum: Definition 2.1 (Restricted eigenvalue [47]). First define the cone C(S) = {β 2 Rd | kβSck1 3kβSk1}, where Sc denotes the complement, βT is β restricted to the subset T. The restricted eigenvalue (RE) constant of X, denoted γ(X), is defined as the largest constant γ > 0 s.t. 2 for all β 2 |S|=k,S [d] For more discussion on the restricted eigenvalue, see Appendix A. Black-box condition Given the known guarantees on SLR, we define a condition that is natural to require on the guarantee of our SLR black-box, which is invoked as SLR(y, X, k). Condition 2.2 (Black-box condition). Let γ(X) denote the restricted eigenvalue of X. There are universal constants c, c0, c00 such that SLR(y, X, k) outputs bβ that is k-sparse and satisfies: 1 nk Xbβ Xβ k2 (σ2k log d) n 8β 2 B0(k) w.p. 1 c0 exp( c00k log d) 3 Algorithms and main results We first discuss how to view samples from the spiked covariance model in terms of a linear model. We then give some intuition motivating our statistic. Finally, we state our algorithms and main theorems, and give a high-level sketch of the proof. 3.1 The linear model Let X(1), X(2), . . . , X(n) be n i.i.d. samples from the spiked covariance model; denote as X 2 Rn d the matrix whose rows are X(i). Intuitively, if variable i is contained in the support of the spike, then the rest of the support should allow to provide a nontrivial prediction for Xi since variables in the support are correlated. Conversely, for i not in the support (or under the isotropic null hypothesis), all of the variables are independent and other variables are useless for predicting Xi. So we regress Xi onto the rest of the variables. Let X i denote the matrix of samples in the SPCA model with the ith column removed. For each column i, we can view our data as coming from a linear model with design matrix X = X i and the response variable y = Xi. The true regression vector depends on i. Under the alternate hypothesis H1, if i 2 S, we can write y = Xβ + w where β = ui 1+(1 u2 i ) u i and w N(0, σ2) with σ2 = 1 + u2 i ) .6 If i 62 S, and for any i 2 [d] under the null hypothesis, y = w where w = Xi N(0, 1) (implicitly β = 0). 3.2 Designing the test statistic Based on the linear model above, we want to compute a test statistic that will indicate when a coordinate i is on support. Intuitively, we predictive power of our linear model should be higher when i is on support. Indeed, a calculation shows that the variance in Xi is reduced by approximately 2/k. We want to measure this reduction in noise to detect when i is on support or not. Suppose for instance that we have access to β rather than bβ (note that this is not possible in practice since we do not know the support!). Since we want to measure the reduction in noise when the variable is on support, as a first step we might try the following statistic: Qi = 1/nky Xβ k2 2 Unfortunately, this statistic will not be able to distinguish the two hypotheses, as the reduction in the above error is too small (on the order of 2/k compared to overall order of 1 + ), so deviation due to random sampling will mask any reduction in noise. We can fix this by adding the variance term kyk2: Qi = 1/nkyk2 2 1/nky Xβ k2 On a more intuitive level, including kyk2 2 allows us to measure the relative gain in predictive power without being penalized by a possibly large variance in y. Fluctuations in y due to noise will typically be canceled out in the difference of terms in Qi, minimizing the variance of our statistic. We have to add one final fix to the above estimator. We obviously do not have access to β , so we must use the estimate bβ = SLR(y, X, k) (y, X are as defined in Section 3.1) which we get from our black-box. As our analysis shows, this substitution does not affect much of the discriminative power of Qi as long as the SLR black-box satisfies prediction error guarantees stated in Condition 2.2. This gives our final statistic:7 Qi = 1/nkyk2 2 1/nky Xbβk2 3.3 Algorithms Below we give algorithms for hypothesis testing and for support recovery, based on the Q statistic: 6By the theory of linear minimum mean-square-error (LMMSE) confirms that this choice of β minimizes the error σ2. See Appendix B.1, B.2 for details of this calculation. 7As pionted out by a reviewer, Note that this statistic is actually equivalent to R2 up to rescaling by sample variance. Note that our formula is slightly different though as we use the sample variance computed with population mean as opposed to sample mean, as the mean is known to be zero. Algorithm 1 Q-hypothesis testing Input: X 2 Rd n, k Output: {0, 1} for i = 1, . . . , d do bβi = SLR(Xi, X i, k) Qi = 1 nk Xi X i bβik2 2 if Qi > 13k log d k n then return 1 end if end for Return 0 Algorithm 2 Q-support recovery Input: X 2 Rd n, k b S = ? for i = 1, . . . , d do bβi = SLR(Xi, X i, k) Qi = 1 nk Xi X i bβik2 2 if Qi > 13k log d k n then b S b S [ {i} end if end for Return b S Below we summarize our guarantees for the above algorithms. The proofs are simple, but we defer them to Appendix C. Theorem 3.1 (Hypothesis test). Assume we have access to SLR that satisfies Condition 2.2 and with runtime T(d, n, k) per instance. Under Condition (C1), there exist universal constants c1, c2, c3, c4 s.t. if 2 > c1 c2 n and n > c2k log d, Algorithm 1 outputs s.t. PH0( (X) = 1) _ PH1( (X) = 0) c3 exp( c4k log d) in time O(d T + d2n). Theorem 3.2 (Support recovery). Under the same conditions as above plus Condition (C2), if 2 > c1 c2 n , Algorithm 2 above finds b S = S with probability at least 1 c3 exp( c4k log d) in time O(d T + d2n). 3.4 Comments RE for sample design matrix Because population covariance = E[XX>] has minimum eigenvalue 1, with high probability the sample design matrix X has constant restricted eigenvalue value given enough samples, i.e. n is large enough (see Appendix B.3 for more details), and the prediction error guarantee of Condition 2.2 will be good enough for our analysis. Running time The runtime of both Algorithm 1 and 2 is O(nd2). The discussion presented at the end of Appendix C details why this is competitive for (single spiked) SPCA, at least theoretically. Unknown sparsity Throughout the paper we assume that the sparsity level k is known. However, if k is unknown, standard techniques could be used to adaptively find approximate values of k ([16]). For instance, for hypothesis testing, we can start with an initial overestimate k0, and keep halving until we get enough coordinates i with Qi that passes the threshold for the given k0. Robustness of Q statistic to rescaling Intuitively, our algorithms for detecting correlated structure in data should be invariant to rescaling of the data; the precise scale or units for which one variable is measured should not have an impact on our ability to find meaningful structure underlying the data. Our algorithms based on Q are robust to rescaling, perhaps unsurprisingly, since correlations between variables in the support remain under rescaling. On the other hand, diagonal thresholding, an often-used preprocessing step for SPCA which filters variables strictly based on variance, would trivially fail under rescaling. This illustrates a strength of our framework over other existing algorithms for SPCA. Below we show explicitly that Q statistics are indeed robust to rescaling: Let e X = DX be the rescaling of X, where D is some diagonal matrix. Let DS be D restricted to rows and columns in S. Note that e , the covariance matrix of the rescaled data, is just D D by expanding the definition. Similarly, note e 2:d,1 = D1D2:d 2:d,1 where D2:d denotes D without row and column 1. Now, recall the term which dominated our analysis of Qi under H1, (β )> 2:dβ , which was equal to We replace the covariances by their rescaled versions to obtain: β >e β = (D1 1,2:d D2:d)D 1 2:d(D2:d 2:d,1D1) = D2 1(β )> 2:dβ For the spiked covariance model, rescaling variances to one amount to rescaling with D1 = 1/(1+ ). Thus, we see that our signal strength is affected only by constant factor (assuming 1). 4 Experiments We test our algorithmic framework on randomly generated synthetic data and compare to other existing algorithms for SPCA. The code was implemented in Python using standard libraries. We refer to our general algorithm from Section 3 that uses the Q statistic as SPCAv SLR. For our SLR black-box, we use thresholded Lasso [47].8. (We experimented with other SLR algorithms such as the forward-backward algorithm of [46] and Co Sa MP9 [32], but results were similar and only slower.) For more details on our experimental setup, including hyperparameter selection, see Appendix D. Support recovery We randomly generate a spike u 2 Rd by first choosing a random support of size k, and then using random signs for each coordinate (uniformity is to make sure Condition (C2) is met). Then spike is scaled appropriately with to build the spiked covariance matrix of our normal distribution, from which we draw samples. We study how the performance of six algorithms vary over various values of k for fixed n and d.10 As in the [15], our measure is the fraction of true support. We compare SPCAv SLR with the following algorithms: diagnoal thresholding, which is a simple baseline; SPCA (ZHT [49]) is a fast heuristic also based on the regression idea; the truncated power method of [45], which is known for both strong theoretical guarantees and empirical performance; covariance thresholding, which has state-of-the-art theoretical guarantees. We modified each algorithm to return the top k most likely coordinates in the support (rather than thresholding based on a cutoff); for algorithms that compute a candidate eigenvector, we choose the top k coordinates largest in absolute value. We observe that SPCAv SLR performs better than covariance thresholding and diagonal thresholding, but its performance falls short of that of the truncated power method and the heuristic algorithm of [49]. We suspect the playing with different SLR algorithms may slightly improve its performance. The reason for the gap between performance of SPCAv SLR and other state of the arts algorithms despite its theoretical guarantees is open to further investigation. Hypothesis testing We generate data under two different distributions: for the spiked covariance model, we generate a spike u by sampling a uniformly random direction from the k-dimensional unit sphere, and embedding the vector at a random subset of k coordinates among d coordinates; for the null, we draw from standard isotropic Gaussian. In a single trial, we draw n samples from each distribution and we compute various statistics11 (diagonal thresholding (DT), Minimal Dual Perturbation (MDP), and our Q statistic, again using thresholded Lasso). We repeat for 100 trials, and plot the resulting empirical distribution for each statistic. We observe similar performance of 8Thresholded Lasso is a variant of lasso where after running lasso, we keep the k largest (in magnitude) coefficients to make the estimator k-sparse. Proposition 2 in [47] shows that thresholded Lasso satisfies Condition 2.2 9Co Sa MP in theory requires the stronger condition of restricted isometry on the sensing matrix. 10We remark that while the size of this dataset might seem too small to be representative high-dimensional setting, these are representative of the usual size of dataset that these methods are usually tested on. One bottleneck is the computation of the covariance matrix. 11While some of the algorithms used for support recovery in the previous section could in theory be adapated for hypothesis testing, the extensions were immediate so we do not consider them here. Figure 1: Performance of diagonal thresholding, SPCA (ZHT), truncated power method, covariance thresholding, and SPCAv SLR for support recovery at n = d = 625 (left) and n = 625, d = 1250 (right), varying values of k, and = 3.0. On the horizontal axis we show k/pn; the vertical axis is the fraction of support correctly recovered. Each datapoint on the figure is averaged over 50 trials. DT and Q, while MDP seems slightly more effective at distinguishing H0 and H1 at the same signal strength (that is, the distributions of the statistics under H0 vs. H1 are more well-separated). Rescaling variables As discussed in Section 3.4, our algorithms are robust to rescaling the covariance matrix to the correlation matrix. As illustrated in Figure 2 (right), DT fails while Q appears to be still effective for distinguishing hypotheses the same regime of parameters. Other methods such as MDP and CT also appear to be robust to such rescaling (not shown). This suggests that more modern algorithms for SPCA may be more appropriate than diagonal thresholding in practice, particularly on instances where the relative scales of the variables may not be accurate or knowable in advance, but we still want to be able to find correlation between the variables. Figure 2: Performance of diagonal thresholding (D), MDP, and Q for hypothesis testing at n = 200, d = 500, k = 30, = 4 (left and center). T0 denotes the statistic T under H0, and similarly for T1. The effect of rescaling the covariance matrix to make variances indistinguishable is demonstrated (right). 5 Previous work Here we discuss in more detail previous approaches to SPCA and how it relates to our work. Various approaches to SPCA have been designed in an extensive list of prior work. As we cannot cover all of them, we focus on works that aim to give computationally efficient (i.e. polynomial time) algorithms with provable guarantees in settings similar to ours. These algorithms include fast, heuristic methods based on 1 minimization [23, 49], rigorous but slow methods based on natural semidefinite program (SDP) relaxations [13, 1, 41, 44], iterative methods motivated by power methods for approximating eigenvectors [45, 24], non-iterative methods based on random projections [20], among others. Many iterative methods rely on initialization schemes, such as ordinary PCA or diagonal thresholding [22]. Below, we discuss the known sample bounds for support recovery and hypothesis testing. Support recovery [1] analyzed both diagonal thresholding and an SDP for support recovery under the spiked covariance model.12 They showed that the SDP requires an order of k fewer samples when the SDP optimal solution is rank one. However, [27] showed that the rank one condition does not happen in general, particularly in the regime approaching the information theoretic limit (pn . k . n/log d). This is consistent with computational lower bounds from [3] (k & pn), but a small gap remains (diagonal thresholding and SDP s succeed only up to k . n/ log d). The above gap was closed by the covariance thresholding algorithm, first suggested by [27] and analyzed by [15], that succeeds in the regime n/ log d . k . pn, although the theoretical guarantee is limited to the regime when d/n ! due to relying on techniques from random matrix theory. Hypothesis testing Some works [4, 1, 17] have focused on the problem of detection. In this case, [4] observed that it suffices to work with the much simpler dual of the standard SDP called Minimal Dual Perturbation (MDP). Diagonal thresholding (DT) and MDP work up to the same signal threshold as for support recovery, but MDP seems to outperform DT on simulated data [4]. MDP works at the same signal threshold as the standard SDP relaxation for SPCA. [17] analyze a statistic based on an SDP relaxation and its approximation ratio to the optimal statistic. In the regime where k, n are proportional to d, their statistic succeeds at a signal threshold for that is independent of d, unlike the MDP. However, their statistic is quite slow to compute; runtime is at least a high order polynomial in d. Regression based approaches To the best of our knowledge, our work is the first to give a general framework for SPCA that uses SLR in a black-box fashion. [49] uses specific algorithms for SLR such as Lasso as a subroutine, but they use a heuristic alternating minimization procedure to solve a non-convex problem, and hence lack any theoretical guarantees. [31] applies a regression based approach to a restricted class of graphical models. While our regression setup is similar, their statistic is different and their analysis depends directly on the particulars of Lasso. Further, their algorithm requires extraneous conditions on the data.[9] also uses a reduction to linear regression for their problem of sparse subspace estimation. Their iterative algorithm depends crucially on a good initialization done by a diagonal thresholding-like pre-processing step, which fails under rescaling of the data.13 Furthermore, their framework uses regression for the specific case of orthogonal design, whereas our design matrix can be more general as long as it satisfies a condition similar to the restricted eigenvalue condition. On the other hand, their setup allows for more general q-based sparsity as well as the estimation of an entire subspace as opposed to a single component. [29] also achieves this more general setup, while still suffering from the same initialization problem. Sparse priors Finally, connections between SPCA and SLR have been noted in the probabilistic setting [26, 25], albeit in an indirect manner: the same sparsity-inducing priors can be used for either problem. We view our work as entirely different as we focus on giving a black-box reduction. Furthermore, provable guarantees for the EM algorithm and variational methods are lacking in general, and it is not immediately obvious what signal threshold their algorithm achieves for the single spike covariance model. 6 Conclusion We gave a black-box reduction for reducing instances of the SPCA problem under the spiked covariance model to instances of SLR. Given oracle access to SLR black-box meeting a certain natural condition, the reduction is shown to efficiently solve hypothesis testing and support recovery. Several directions are open for future work. The work in this paper remains limited to the Gaussian setting and to the single spiked covariance model. Making the results more general would make the connection made here more appealing. Also, the algorithms designed here, though simple, seem a bit wasteful in that they do not aggregate information from different statistics. Designing a more efficient estimator that makes a more efficient use of samples would be interesting. Finally, there is certainly room for improvement by tuning the choice of the SLR black-box to make the algorithm more efficient for use in practice. 12They analyze the subcase when the spike is uniform in all k coordinates. 13See Section 3.4 for more discussion on rescaling. [1] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pages 2454 2458. IEEE, 2009. [2] Afonso S Bandeira, Edgar Dobriban, Dustin G Mixon, and William F Sawin. Certifying the restricted isometry property is hard. IEEE transactions on information theory, 59(6):3448 3450, 2013. [3] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on Learning Theory, pages 1046 1066, 2013. [4] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780 1815, 2013. [5] Peter J Bickel, Ya acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705 1732, 2009. [6] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265 274, 2009. [7] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Aggregation for gaussian regression. The Annals of Statistics, 35(4):1674 1697, 2007. [8] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Sparse density estimation with 1 penalties. In Learning theory, pages 530 543. Springer, 2007. [9] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074 3110, 2013. [10] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, pages 2313 2351, 2007. [11] Emmanuel J Candes and Terence Tao. Decoding by linear programming. Information Theory, IEEE Transactions on, 51(12):4203 4215, 2005. [12] Dong Dai, Philippe Rigollet, Lucy Xia, and Tong Zhang. Aggregation of affine estimators. Electronic Journal of Statistics, 8(1):302 327, 2014. [13] Alexandre d Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse pca using semidefinite programming. SIAM review, 49(3):434 448, 2007. [14] Gamarnik David and Zadik Ilias. High dimensional regression with binary coefficients. estimating squared error and a phase transtition. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 948 953. PMLR, 2017. [15] Yash Deshpande and Andrea Montanari. Sparse pca via covariance thresholding. In Advances in Neural Information Processing Systems, pages 334 342, 2014. [16] Thong T Do, Lu Gan, Nam Nguyen, and Trac D Tran. Sparsity adaptive matching pursuit algorithm for practical compressed sensing. In Signals, Systems and Computers, 2008 42nd Asilomar Conference on, pages 581 587. IEEE, 2008. [17] Alexandre d Aspremont, Francis Bach, and Laurent El Ghaoui. Approximation bounds for sparse principal component analysis. Mathematical Programming, 148(1-2):89 110, 2014. [18] Alyson K Fletcher, Sundeep Rangan, and Vivek K Goyal. Necessary and sufficient conditions for sparsity pattern recovery. IEEE Transactions on Information Theory, 55(12):5758 5772, 2009. [19] David Gamarnik and Ilias Zadik. Sparse high-dimensional linear regression. algorithmic barriers and a local search algorithm. ar Xiv preprint ar Xiv:1711.04952, 2017. [20] Milana Gataric, Tengyao Wang, and Richard J Samworth. Sparse principal component analysis via random projections. ar Xiv preprint ar Xiv:1712.05630, 2017. [21] Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295 327, 2001. [22] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 2009. [23] Ian T Jolliffe, Nickolay T Trendafilov, and Mudassir Uddin. A modified principal component technique based on the lasso. Journal of computational and Graphical Statistics, 12(3):531 547, 2003. [24] Michel Journée, Yurii Nesterov, Peter Richtárik, and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(Feb):517 553, 2010. [25] Rajiv Khanna, Joydeep Ghosh, Russell A Poldrack, and Oluwasanmi Koyejo. Sparse submodular proba- bilistic pca. In AISTATS, 2015. [26] Oluwasanmi O Koyejo, Rajiv Khanna, Joydeep Ghosh, and Russell Poldrack. On prior distributions and approximate inference for structured variables. In Advances in Neural Information Processing Systems, pages 676 684, 2014. [27] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations really solve sparse pca. Technical report, Technical report, Weizmann Institute of Science, 2013. [28] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000. [29] Zongming Ma et al. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772 801, 2013. [30] Stéphane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397 3415, 1993. [31] Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, pages 1436 1462, 2006. [32] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301 321, 2009. [33] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for high- dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348 1356, 2009. [34] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241 2259, 2010. [35] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over q-balls. Information Theory, IEEE Transactions on, 57(10):6976 6994, 2011. [36] Philippe Rigollet and Alexandre Tsybakov. Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2):731 771, 2011. [37] Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. In Conference on Learning Theory, pages 10 1, 2012. [38] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996. [39] Sara van de Geer. The deterministic lasso. Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich, 2007. [40] Sara A Van De Geer, Peter Bühlmann, et al. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360 1392, 2009. [41] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. In Advances in Neural Information Processing Systems, pages 2670 2678, 2013. [42] Martin Wainwright. Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting. In Information Theory, 2007. ISIT 2007. IEEE International Symposium on, pages 961 965. IEEE, 2007. [43] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183 2202, 2009. [44] Tengyao Wang, Quentin Berthet, Richard J Samworth, et al. Statistical and computational trade-offs in estimation of sparse principal components. The Annals of Statistics, 44(5):1896 1930, 2016. [45] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. The Journal of Machine Learning Research, 14(1):899 925, 2013. [46] Tong Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In Advances in Neural Information Processing Systems, pages 1921 1928, 2009. [47] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. ar Xiv preprint ar Xiv:1402.1918, 2014. [48] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Optimal prediction for sparse linear models? lower bounds for coordinate-separable m-estimators. ar Xiv preprint ar Xiv:1503.03188, 2015. [49] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computa- tional and graphical statistics, 15(2):265 286, 2006.