# approximating_sparse_pca_from_incomplete_data__15af34b2.pdf

Approximating Sparse PCA from Incomplete Data

Abhisek Kundu Petros Drineas Malik Magdon-Ismail

We study how well one can recover sparse principal components of a data matrix using a sketch formed from a few of its elements. We show that for a wide class of optimization problems, if the sketch is close (in the spectral norm) to the original data matrix, then one can recover a near optimal solution to the optimization problem by using the sketch. In particular, we use this approach to obtain sparse principal components and show that for m data points in n dimensions, O(ϵ 2 k max{m, n}) elements gives an ϵ-additive approximation to the sparse PCA problem ( k is the stable rank of the data matrix). We demonstrate our algorithms extensively on image, text, biological and ﬁnancial data. The results show that not only are we able to recover the sparse PCAs from the incomplete data, but by using our sparse sketch, the running time drops by a factor of ﬁve or more.

1 Introduction

Principal components analysis constructs a low dimensional subspace of the data such that projection of the data onto this subspace preserves as much information as possible (or equivalently maximizes the variance of the projected data). The earliest reference to principal components analysis (PCA) is in [15]. Since then, PCA has evolved into a classic tool for data analysis. A challenge for the interpretation of the principal components (or factors) is that they can be linear combinations of all the original variables. When the original variables have direct physical signiﬁcance (e.g. genes in biological applications or assets in ﬁnancial applications) it is desirable to have factors which have loadings on only a small number of the original variables. These interpretable factors are sparse principal components (SPCA).

The question we address is not how to better perform sparse PCA; rather, it is whether one can perform sparse PCA on incomplete data and be assured some degree of success. (i.e., can we do sparse PCA when we have a small sample of data points and those data points have missing features?). Incomplete data is a situation that one is confronted with all too often in machine learning. For example, with user-recommendation data, one does not have all the ratings of any given user. Or in a privacy preserving setting, a client may not want to give us all entries in the data matrix. In such a setting, our goal is to show that if the samples that we do get are chosen carefully, the sparse PCA features of the data can be recovered within some provable error bounds. A signiﬁcant part of this work is to demonstrate our algorithms on a variety of data sets.

More formally, The data matrix is A Rm n (m data points in n dimensions). Data matrices often have low effective rank. Let Ak be the best rank-k approximation to A; in practice, it is often possible to choose a small value of k for which A Ak 2 is small. The best rank-k approximation Ak is obtained by projecting A onto the subspace spanned by its top-k principal components Vk, which is the n k matrix containing the top-k right singular vectors of A. These top-k principal

Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, kundua2@rpi.edu. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, drinep@cs.rpi.edu. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, magdon@cs.rpi.edu.

components are the solution to the variance maximization problem:

Vk = arg max V Rn k,VT V=I trace(VT AT AV).

We denote the maximum variance attainable by OPTk, which is the sum of squares of the topk singular values of A. To get sparse principal components, we add a sparsity constraint to the optimization problem: every column of V should have at most r non-zero entries (the sparsity parameter r is an input),

Sk = arg max V Rn k,VT V=I, V(i) 0 r trace(VT AT AV). (1)

The sparse PCA problem is itself a very hard problem that is not only NP-hard, but also inapproximable [12] There are many heuristics for obtaining sparse factors [2, 18, 20, 5, 4, 14, 16] including some approximation algorithms with provable guarantees [1]. The existing research typically addresses the task of getting just the top principal component (k = 1) (some exceptions are [11, 3, 19, 9]). While the sparse PCA problem is hard and interesting, it is not the focus of this work.

We address the question: What if we do not know A, but only have a sparse sampling of some of the entries in A (incomplete data)? The sparse sampling is used to construct a sketch of A, denoted A. There is not much else to do but solve the sparse PCA problem with the sketch A instead of the full data A to get Sk,

Sk = arg max V Rn k,VT V=I, V(i) 0 r trace(VT AT AV). (2)

We study how Sk performs as an approximation to Sk with respective to the objective that we are trying to optimize, namely trace(ST AT AS) the quality of approximation is measured with respect to the true A. We show that the quality of approximation is controlled by how well AT A approximates AT A as measured by the spectral norm of the deviation AT A AT A. This is a general result that does not rely on how one constructs the sketch A.

Theorem 1 (Sparse PCA from a Sketch) Let Sk be a solution to the sparse PCA problem that solves (1), and Sk a solution to the sparse PCA problem for the sketch A which solves (2). Then,

trace( Sk T AT A Sk) trace(Sk T AT ASk) 2k AT A AT A 2.

Theorem 1 says that if we can closely approximate A with A, then we can compute, from A, sparse components which capture almost as much variance as the optimal sparse components computed from the full data A.

In our setting, the sketch A is computed from a sparse sampling of the data elements in A (incomplete data). To determine which elements to sample, and how to form the sketch, we leverage some recent results in elementwise matrix completion ([8]). In a nutshell, if one samples larger data elements with higher probability than smaller data elements, then, for the resulting sketch A, the error AT A AT A 2 will be small. The details of the sampling scheme and how the error depends on the number of samples is given in Section 2.1. Combining the bound on A A 2 from Theorem 4 in Section 2.1 with Theorem 1, we get our main result:

Theorem 2 (Sampling Complexity for Sparse PCA) Sample s data-elements from A Rm n to form the sparse sketch A using Algorithm 1. Let Sk be a solution to the sparse PCA problem that solves (1), and let Sk, which solves (2), be a solution to the sparse PCA problem for the sketch A formed from the s sampled data elements. Suppose the number of samples s satisﬁes

s 2k2ϵ 2(ρ2 + ϵγ/(3k)) log((m + n)/δ)

(ρ2 and γ are dimensionless quantities that depend only on A). Then, with probability at least 1 δ

trace( Sk T AT A Sk) trace(Sk T AT ASk) 2ϵ(2 + ϵ/k) A 2 2.

The dependence of ρ2 and γ on A are given in Section 2.1. Roughly speaking, we can ignore the term with γ since it is multiplied by ϵ/k, and ρ2 = O( k max{m, n}), where k is the stable (numerical) rank of A. To paraphrase Theorem 2, when the stable rank is a small constant, with O(k2 max{m, n}) samples, one can recover almost as good sparse principal components as with all data (the price being a small fraction of the optimal variance, since OPTk A 2 2). As far as we know, the only prior work related to the problem we consider here is [10] which proposed a speciﬁc method to construct sparse PCA from incomplete data. However, we develop a general tool that can be used with any existing sparse PCA heuristic. Moreover, we derive much simpler bounds (Theorems 1 and 2) using matrix concentration inequalities, as opposed to ϵ-net arguments in [10]. We also give an application of Theorem 1 to running sparse PCA after denoising the data using a greedy thresholding algorithm that sets the small elements to zero (see Theorem 3). Such denoising is appropriate when the observed matrix has been element-wise perturbed by small noise, and the uncontaminated data matrix is sparse and contains large elements. We show that if an appropriate fraction of the (noisy) data is set to zero, one can still recover sparse principal components. This gives a principled approach to regularizing sparse PCA in the presence of small noise when the data is sparse.

Not only do our algorithms preserve the quality of the sparse principal components, but iterative algorithms for sparse PCA, whose running time is proportional to the number of non-zero entries in the input matrix, beneﬁt from the sparsity of A. Our experiments show about ﬁve-fold speed gains while producing near-comparable sparse components using less than 10% of the data.

Discussion. In summary, we show that one can recover sparse PCA from incomplete data while gaining computationally at the same time. Our result holds for the optimal sparse components from A versus from A. One cannot efﬁciently ﬁnd these optimal components (since the problem is NPhard to even approximate), so one runs a heuristic, in which case the approximation error of the heuristic would have to be taken into account. Our experiments show that using the incomplete data with the heuristics is just as good as those same heuristics with the complete data.

In practice, one may not be able to sample the data, but rather the samples are given to you. Our result establishes that if the samples are chosen with larger values being more likely, then one can recover sparse PCA. In practice one has no choice but to run the sparse PCA on these sampled elements and hope. Our theoretical results suggest that the outcome will be reasonable. This is because, while we do not have speciﬁc control over what samples we get, the samples are likely to represent the larger elements. For example, with user-recommendation data, users are more likely to rate items they either really like (large positive value) or really dislike (large negative value).

Notation. We use bold uppercase (e.g., X) for matrices and bold lowercase (e.g., x) for column vectors. The i-th row of X is X(i), and the i-th column of X is X(i). Let [n] denote the set {1, 2, ..., n}. E(X) is the expectation of a random variable X; for a matrix, E(X) denotes the element-wise expectation. For a matrix X Rm n, the Frobenius norm X F is X 2 F = Pm,n i,j=1 X2 ij, and the spectral (operator) norm X 2 is X 2 = max y 2=1 Xy 2. We also have the ℓ1 and ℓ0 norms: X ℓ1 = Pm,n i,j=1 |Xij| and X 0 (the number of non-zero entries in X). The k-th largest singular value of X is σk(X). and log x is the natural logarithm of x.

2 Sparse PCA from a Sketch

In this section, we will prove Theorem 1 and give a simple application to zeroing small ﬂuctuations as a way to regularize to noise. In the next section we will use a more sophisticated way to select the elements of the matrix allowing us to tolerate a sparser matrix (more incomplete data) but still recovering sparse PCA to reasonable accuracy.

Theorem 1 will be a corollary of a more general result, for a class of optimization problems involving a Lipschitz-like objective function over an arbitrary (not necessarily convex) domain. Let f(V, X) be a function that is deﬁned for a matrix variable V and a matrix parameter X. The optimization variable V is in some feasible set S which is arbitrary. The parameter X is also arbitrary. We assume that f is locally Lipschitz in X with, that is

|f(V, X) f(V, X)| γ(X) X X 2 V S.

(Note we allow the Lipschitz constant to depend on the ﬁxed matrix X but not the variables V, X; this is more general than a globally Lipshitz objective) The next lemma is the key tool we need to prove Theorem 1 and it may be on independent interest in other optimization settings. We are interested in maximizing f(V, X) w.r.t. V to obtain V . But, we only have an approximation X for X, and so we maximize f(V, X) to obtain V , which will be a suboptimal solution with respect to X. We wish to bound f(V , X) f( V , X) which quantiﬁes how suboptimal V is w.r.t. X.

Lemma 1 (Surrogate optimization bound) Let f(V, X) be γ-locally Lipschitz w.r.t. X over the domain V S. Deﬁne V = arg max V S f(V, X); V = arg max V S f(V, X). Then,

f(V , X) f( V , X) 2γ(X) X X 2.

In the lemma, the function f and the domain S are arbitrary. In our setting, X Rn n, the domain S = {V Rn k; VT V = Ik; V(j) 0 r}, and f(V, X) = trace(VT XV). We ﬁrst show that f is Lipschitz w.r.t. X with γ = k (a constant independent of X). Let the representation of V by its columns be V = [v1, . . . , vk]. Then,

|trace(VT XV) trace(VT XV)| = |trace((X X)VVT )|

i=1 σi(X X) k X X 2

where, σi(A) is the i-th largest singular value of A (we used Von-neumann s trace inequality and the fact that VVT is a k-dimensional projection). Now, by Lemma 1, trace(V T XV ) trace( V T X V ) 2k X X 2. Theorem 1 follows by setting X = AT A and X = AT A 1.

Greedy thresholding. We give the simplest scenario of incomplete data where Theorem 1 gives some reassurance that one can compute good sparse principal components. Suppose the smallest data elements have been set to zero. This can happen, for example, if only the largest elements are measured, or in a noisy setting if the small elements are treated as noise and set to zero. So

Aij = Aij |Aij| δ; 0 |Aij| < δ.

Recall k = A 2 F / A 2 2 (stable rank of A), and deﬁne Aδ 2 F = P

|Aij|<δ A2 ij. Let A = A + . By construction, 2 F = Aδ 2 F . Then,

AT A A T A 2 = AT + T A T 2 2 A 2 2 + 2 2. (3) Suppose the zeroing of elements only loses a fraction of the energy in A, i.e. δ is selected so that Aδ 2 F ϵ2 A 2 F / k; that is an ϵ/ k fraction of the total variance in A has been lost in the

unmeasured (or zero) data. Then 2 F ϵ A F / p

Theorem 3 Suppose that A is created from A by zeroing all elements that are less than δ, and δ is such that the truncated norm satisﬁes Aδ 2 2 ϵ2 A 2 F / k. Then the sparse PCA solution V satisﬁes trace( V T AA V ) trace(V T AAT V ) 2kϵ A 2 2(2 + ϵ).

Theorem 3 shows that it is possible to recover sparse PCA after setting small elements to zero. This is appropriate when most of the elements in A are small noise and a few of the elements in A contain large data elements. For example if the data consists of sparse O( nm) large elements (of magnitude, say, 1) and many nm O( nm) small elements whose magnitude is o(1/ nm) (high signal-to-noise setting), then Aδ 2 2/ A 2 2 0 and with just a sparse sampling of the O( nm) large elements (very incomplete data), we recover near optimal sparse PCA.

Greedily keeping only the large elements of the matrix requires a particular structure in A to work, and it is based on a crude Frobenius-norm bound for the spectral error. In Section 2.1, we use recent results in element-wise matrix sparsiﬁcation to choose the elements in a randomized way, with a bias toward large elements. With high probability, one can directly bound the spectral error and hence get better performance.

1Theorem 1 can also be proved as follows: trace(VT XV) trace( VT X V) = trace(VT XV) trace(VT XV) + trace(VT XV) trace( VT X V) k X X 2 + trace(VT XV) trace( VT X V) k X X 2 + trace( VT X V) trace( VT X V) 2k X X 2.

Algorithm 1 Hybrid (ℓ1, ℓ2)-Element Sampling Input: A Rm n; # samples s; probabilities {pij}.

1: Set A = 0m n. 2: for t = 1 . . . s (i.i.d. trials with replacement) do 3: Randomly sample indices (it, jt) [m] [n] with P [(it, jt) = (i, j)] = pij. 4: Update A: Aij Aij + Aij/(s pij). 5: return A (with at most s non-zero entries).

2.1 An (ℓ1, ℓ2)-Sampling Based Sketch

In the previous section, we created the sketch by deterministically setting the small data elements to zero. Instead, we could randomly select the data elements to keep. It is natural to bias this random sampling toward the larger elements. Therefore, we deﬁne sampling probabilities for each data element Aij which are proportional to a mixture of the absolute value and square of the element:

pij = α |Aij|

A ℓ1 + (1 α) A2 ij A 2 F , (4)

where α (0, 1] is a mixing parameter. Such a sampling probability was used in [8] to sample data elements in independent trials to get a sketch A. We repeat the prototypical algorithm for element-wise matrix sampling in Algorithm 1.

Note that unlike with the deterministic zeroing of small elements, in this sampling scheme, one samples the element Aij with probability pij and then rescales it by 1/pij. To see the intuition for this rescaling, consider the expected outcome for a single sample: E[ Aij] = pij (Aij/pij) + (1 pij) 0 = Aij; that is, A is a sparse but unbiased estimate for A. This unbiasedness holds for any choice of the sampling probabilities pij deﬁned over the elements of A in Algorithm 1. However, for an appropriate choice of the sampling probabilities, we get much more than unbiasedness; we can control the spectral norm of the deviation, A A 2. In particular, the hybrid-(ℓ1, ℓ2) distribution in (4) was analyzed in [8], where they suggest an optimal choice for the mixing parameter α which minimizes the theoretical bound on A A 2. This algorithm to choose α is summarized in Algorithm 1 of [8].

Using the probabilities in (4) to create the sketch A using Algorithm 1, with α selected using Algorithm 1 of [8], one can prove a bound for A A 2. We state a simpliﬁed version of the bound from [8] in Theorem 4.

Theorem 4 ([8]) Let A Rm n and let ϵ > 0 be an accuracy parameter. Deﬁne probabilities pij as in (4) with α chosen using Algorithm 1 of [8]. Let A be the sparse sketch produced using Algorithm 1 with a number of samples s 2ϵ 2(ρ2 + γϵ/3) log((m + n)/δ), where

ρ2 = k max{m, n} α k A 2/ A ℓ1 + (1 α) 1 , and γ 1 + p

Then, with probability at least 1 δ, A A 2 ϵ A 2.

3 Experiments

We show the experimental performance of sparse PCA from a sketch using several real data matrices. As we mentioned, sparse PCA is NP-Hard, and so we must use heuristics. These heuristics are discussed next, followed by the data, the experimental design and ﬁnally the results.

Algorithms for Sparse PCA: Let G (ground truth) denote the algorithm which computes the principal components (which may not be sparse) of the full data matrix A; the optimal variance is OPTk. We consider six heuristics for getting sparse principal components.

Gmax,r The r largest-magnitude entries in each principal component generated by G. Gsp,r r-sparse components using the Spasm toolbox of [17] with A. Hmax,r The r largest entries of the principal components for the (ℓ1, ℓ2)-sampled sketch A. Hsp,r r-sparse components using Spasm with the (ℓ1, ℓ2)-sampled sketch A. Umax,r The r largest entries of the principal components for the uniformly sampled sketch A. Usp,r r-sparse components using Spasm with the uniformly sampled sketch A.

Output of an algorithm Z is sparse principal components V, and our metric is f(Z) = trace(VT AT AV), where A is the original centered data. We consider the following statistics.

f(Gsp,r) Relative loss of greedy thresholding versus Spasm, illustrating the value of a good sparse PCA algorithm. Our sketch based algorithms do not address this loss. f(Hmax/sp,r)

f(Gmax/sp,r) Relative loss of using the (ℓ1, ℓ2)-sketch A instead of complete data A. A ratio close to 1 is desired. f(Umax/sp,r) f(Gmax/sp,r) Relative loss of using the uniform sketch A instead of complete data A. A benchmark to highlight the value of a good sketch.

We also report the computation time for the algorithms. We show results to conﬁrm that sparse PCA algorithms using the (ℓ1, ℓ2)-sketch are nearly comparable to those same algorithms on the complete data; and, gain in computation time from sparse sketch is proportional to the sparsity.

Data Sets: We show results on image, text, stock, and gene expression data.

Digit Data (m = 2313, n = 256): We use the [7] handwritten zip-code digit images (300 pixels/inch in 8-bit gray scale). Each pixel is a feature (normalized to be in [ 1, 1]). Each 16 16 digit image forms a row of the data matrix A. We focus on three digits: 6 (664 samples), 9 (644 samples), and 1 (1005 samples).

Tech TC Data (m = 139, n = 15170): We use the Technion Repository of Text Categorization Dataset (Tech TC, see [6]) from the Open Directory Project (ODP). We removed words (features) with fewer than 5 letters. Each document (row) has unit norm.

Stock Data (m = 7056, n = 1218): We use S&P100 stock market data with 7056 snapshots of prices for 1218 stocks. The prices of each day form a row of the data matrix and a principal component represents an index of sorts each stock is a feature.

Gene Expression Data (m = 107, n = 22215): We use GSE10072 gene expression data for lung cancer from the NCBI Gene Expression Omnibus database. There are 107 samples (58 lung tumor cases and 49 normal lung controls) forming the rows of the data matrix, with 22,215 probes (features) from the GPL96 platform annotation table.

3.1 Results

We report results for primarily the top principal component (k = 1) which is the case most considered in the literature. When k > 1, our results do not qualitatively change. We note the optimal mixing parameter α using Algorithm 1 of [8] for various datasets in Table 1.

Handwritten Digits. We sample approximately 7% of the elements from the centered data using (ℓ1, ℓ2)-sampling, as well as uniform sampling. The performance for small r is shown in Table 1, including the running time τ. For this data, f(Gmax,r)/f(Gsp,r) 0.23 (r = 10), so it is important to use a good sparse PCA algorithm. We see from Table 1 that the (ℓ1, ℓ2)-sketch signiﬁcantly outperforms the uniform sketch. A more extensive comparison of recovered variance is given in Figure 2(a). We also observe a speed-up of a factor of about 6 for the (ℓ1, ℓ2)-sketch. We point out that the uniform sketch is reasonable for the digits data because most data elements are close to either +1 or 1, since the pixels are either black or white.

We show a visualization of the principal components in Figure 1. We observe that the sparse components from the (ℓ1, ℓ2)-sketch are almost identical to that of from the complete data.

Tech TC Data. We sample approximately 5% of the elements from the centered data using our (ℓ1, ℓ2)-sampling, as well as uniform sampling. For this data, f(Gmax,r)/f(Gsp,r) 0.84 (r = 10). We observe a very signiﬁcant performance difference between the (ℓ1, ℓ2)-sketch and uniform sketch. A more extensive comparison of recovered variance is given in Figure 2(b). We also observe

α r f(Hmax/sp,r)

f(Gmax/sp,r) τ(G) τ(H)

f(Umax/sp,r) f(Gmax/sp,r) τ(G) τ(U)

Digit .42 40 0.99/0.90 6.21 1.01/0.70 5.33 Tech TC 1 40 0.94/0.99 5.70 0.41/0.38 5.96 Stock .10 40 1.00/1.00 3.72 0.66/0.66 4.76 Gene .92 40 0.82/0.88 3.61 0.65/0.15 2.53

Table 1: Comparison of sparse principal components from the (ℓ1, ℓ2)-sketch and uniform sketch.

(a) r = 100%

(b) r = 50%

(c) r = 30%

(d) r = 10%

Figure 1: [Digits] Visualization of top-3 sparse principal components. In each ﬁgure, left panel shows Gsp,r and right panel shows Hsp,r.

20 40 60 80 100

Sparsity constraint: r (percent)

f(Hsp,r)/f(Gsp,r) f(Usp,r)/f(Gsp,r)

20 40 60 80 100 0.2

Sparsity constraint: r (percent)

f(Hsp,r)/f(Gsp,r) f(Usp,r)/f(Gsp,r)

20 40 60 80 100 0.6

Sparsity constraint: r (percent)

f(Hsp,r)/f(Gsp,r) f(Usp,r)/f(Gsp,r)

20 40 60 80 100 0.2

Sparsity constraint: r (percent)

f(Hsp,r)/f(Gsp,r) f(Usp,r)/f(Gsp,r)

(a) Digit (b) Tech TC (c) Stock (d) Gene

Figure 2: Performance of sparse PCA for (ℓ1, ℓ2)-sketch and uniform sketch over an extensive range for the sparsity constraint r. The performance of the uniform sketch is signiﬁcantly worse highlighting the importance of a good sketch.

a speed-up of a factor of about 6 for the (ℓ1, ℓ2)-sketch. Unlike the digits data which is uniformly near 1, the text data is spikey and now it is important to sample with a bias toward larger elements, which is why the uniform-sketch performs very poorly.

As a ﬁnal comparison, we look at the actual sparse top component with sparsity parameter r = 10. The topic IDs in the Tech TC data are 10567= US: Indiana: Evansville and 11346= US: Florida . The top-10 features (words) in the full PCA on the complete data are shown in Table 2.

In Table 3 we show which words appear in the top sparse principal component with sparsity r = 10 using various sparse PCA algorithms. We observe that the sparse PCA from the (ℓ1, ℓ2)-sketch with only 5% of the data sampled matches quite closely with the same sparse PCA algorithm using the complete data (Gmax/sp,r matches Hmax/sp,r).

Stock Data. We sample about 2% of the non-zero elements from the centered data using the (ℓ1, ℓ2)- sampling, as well as uniform sampling. For this data, f(Gmax,r)/f(Gsp,r) 0.96 (r = 10). We observe a very signiﬁcant performance difference between the (ℓ1, ℓ2)-sketch and uniform sketch. A more extensive comparison of recovered variance is given in Figure 2(c). We also observe a speed-up of a factor of about 4 for the (ℓ1, ℓ2)-sketch. Similar to Tech TC data this dataset is also spikey , so biased sampling toward larger elements signiﬁcantly outperforms the uniform-sketch.

Gene Expression Data. We sample about 9% of the elements from the centered data using the (ℓ1, ℓ2)-sampling, as well as uniform sampling. For this data, f(Gmax,r)/f(Gsp,r) 0.05 (r = 10)

ID Top 10 in Gmax,r ID Other words 1 evansville 11 service 2 ﬂorida 12 small 3 south 13 frame 4 miami 14 tours 5 indiana 15 faver 6 information 16 transaction 7 beach 17 needs 8 lauderdale 18 commercial 9 estate 19 bullet 10 spacer 20 inlets 21 producer

Table 2: [Tech TC] Top ten words in top principal component of the complete data (the other words are discovered by some of the sparse PCA algorithms).

Gmax,r Hmax,r Umax,r Gsp,r Hsp,r Usp,r 1 1 6 1 1 6 2 2 14 2 2 14 3 3 15 3 3 15 4 4 16 4 4 16 5 5 17 5 5 17 6 7 7 6 7 7 7 6 18 7 8 18 8 8 19 8 6 19 9 11 20 9 12 20 10 12 21 13 11 21

Table 3: [Tech TC] Relative ordering of the words (w.r.t. Gmax,r) in the top sparse principal component with sparsity parameter r = 10.

which means a good sparse PCA algorithm is imperative. We observe a very signiﬁcant performance difference between the (ℓ1, ℓ2)-sketch and uniform sketch. A more extensive comparison of recovered variance is given in Figure 2(d). We also observe a speed-up of a factor of about 4 for the (ℓ1, ℓ2)-sketch. Similar to Tech TC data this dataset is also spikey , and consequently biased sampling toward larger elements signiﬁcantly outperforms the uniform-sketch.

Performance of Other Sketches: We brieﬂy report on other options for sketching A. We consider suboptimal α (not α from Algorithm 1 of [8] ) in (4) to construct a suboptimal hybrid distribution, and use this in proto-Algorithm 1 to construct a sparse sketch. Figure 3 reveals that a good sketch using the α is important.

20 40 60 80 100

Sparsity constraint: r (percent)

f(Hsp,r), α = 0.1

f(Hsp,r), α = 1.0

Figure 3: [Stock data] Performance of sketch using suboptimal α to illustrate the importance of the optimal mixing parameter α .

Conclusion: It is possible to use a sparse sketch (incomplete data) to recover nearly as good sparse principal components as one would have gotten with the complete data. We mention that, while Gmax which uses the largest weights in the unconstrained PCA does not perform well with respect to the variance, it does identify good features. A simple enhancement to Gmax is to recalibrate the sparse component after identifying the features - this is an unconstrained PCA problem on just the columns of the data matrix corresponding to the features. This method of recalibrating can be used to improve any sparse PCA algorithm.

Our algorithms are simple and efﬁcient, and many interesting avenues for further research remain. Can the sampling complexity for the top-k sparse PCA be reduced from O(k2) to O(k). We suspect that this should be possible by getting a better bound on Pk i=1 σi(AT A AT A); we used the crude bound k AT A AT A 2. We also presented a general surrogate optimization bound which may be of interest in other applications. In particular, it is pointed out in [13] that though PCA optimizes variance, a more natural way to look at PCA is as the linear projection of the data that minimizes the information loss. [13] gives efﬁcient algorithms to ﬁnd sparse linear dimension reduction that minimizes information loss the information loss of sparse PCA can be considerably higher than optimal. To minimize information loss, the objective to maximize is f(V) = trace(AT AV(AV) A). It would be interesting to see whether one can recover sparse low-information-loss linear projectors from incomplete data.

Acknowledgments: AK and PD are partially supported by NSF IIS-1447283 and IIS-1319280.

[1] M. Asteris, D. Papailiopoulos, and A. Dimakis. Non-negative sparse PCA with provable guarantees. In Proc. ICML, 2014.

[2] J. Cadima and I. Jolliffe. Loadings and correlations in the interpretation of principal components. Applied Statistics, 22:203 214, 1995.

[3] T. T. Cai, Z. Ma, and Y. Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074 3110, 2013.

[4] Alexandre d Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research, 9:1269 1294, June 2008.

[5] Alexandre d Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semideﬁnite programming. SIAM Review, 49(3):434 448, 2007.

[6] E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In Proceedings of International Conference on Machine Learning, 2004.

[7] J. J. Hull. A database for handwritten text recognition research. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 550 554, 16(5), 1994.

[8] A. Kundu, P. Drineas, and M. Magdon-Ismail. Recovering PCA from Hybrid-(ℓ1, ℓ2) Sparse Sampling of Data Elements. In http://arxiv.org/pdf/1503.00547v1.pdf, 2015.

[9] J. Lei and V. Q. Vu. Sparsistency and agnostic inference in sparse pca. The Annals of Statistics, 43(1):299 322, 2015.

[10] Karim Lounici. Sparse principal component analysis with missing observations. arxiv report: http://arxiv.org/abs/1205.7060, 2012.

[11] Z. Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772 801, 2013.

[12] M. Magdon-Ismail. NP-hardness and inapproximability of sparse pca. arxiv report: http://arxiv.org/abs/1502.05675, 2015.

[13] M. Magdon-Ismail and C. Boutsidis. arxiv report: http://arxiv.org/abs/1502.06626, 2015.

[14] B. Moghaddam, Y. Weiss, and S. Avidan. Generalized spectral bounds for sparse LDA. In Proc. ICML, 2006.

[15] K. Pearson. On lines and planes of closest ﬁt to systems of points in space. Philosophical Magazine, 2:559 572, 1901.

[16] Haipeng Shen and Jianhua Z. Huang. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99:1015 1034, July 2008.

[17] K. Sjstrand, L.H. Clemmensen, R. Larsen, and B. Ersbll. Spasm: A matlab toolbox for sparse statistical modeling. In Journal of Statistical Software (Accepted for publication), 2012.

[18] N. Trendaﬁlov, I. T. Jolliffe, and M. Uddin. A modiﬁed principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12:531 547, 2003.

[19] Z. Wang, H. Lu, and H. Liu. Nonconvex statistical optimization: Minimax-optimal sparse pca in polynomial time. http://arxiv.org/abs/1408.5352?context=cs.LG, 2014.

[20] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational & Graphical Statistics, 15(2):265 286, 2006.