# streaming_sparse_principal_component_analysis__0fe0be39.pdf

Streaming Sparse Principal Component Analysis

Wenzhuo Yang A0096049@NUS.EDU.SG Department of Mechanical Engineering, National University of Singapore, Singapore 117576

Huan Xu MPEXUH@NUS.EDU.SG Department of Mechanical Engineering, National University of Singapore, Singapore 117576

This paper considers estimating the leading k principal components with at most s non-zero attributes from p-dimensional samples collected sequentially in memory limited environments. We develop and analyze two memory and computational efﬁcient algorithms called streaming sparse PCA and streaming sparse ECA for analyzing data generated according to the spike model and the elliptical model respectively. In particular, the proposed algorithms have memory complexity O(pk), computational complexity O(pk min{k, s log p}) and sample complexity Θ(s log p). We provide their ﬁnite sample performance guarantees, which implies statistical consistency in the high dimensional regime. Numerical experiments on synthetic and realworld datasets demonstrate good empirical performance of the proposed algorithms.

1. Introduction

Principal component analysis (PCA) (Pearson, 1901), arguably the most widely used dimension reduction method, is a fundamental tool in data analysis in a wide range of areas including machine learning, ﬁnance, statistics and many others. Standard PCA extracts the principal components (PCs) of a set of samples by computing the leading eigenvectors of the sample covariance matrix. However, in the face of modern high dimensional data with p n, PCA is no longer statistically solid. Indeed, Johnstone & Lu (2009) showed that the consistency of PCA depends crucially on the limiting value of p/n: the angle between the PCA estimate and the true leading PC does not converge to zero unless p/n goes to zero.

To address this inconsistency issue and encourage more in-

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

terpretable solutions, most of previous works focus on the sparse setting assuming that the leading PCs are sparse, i.e., only a few of their attributes are non-zero. In this setting, many variants of sparse PCA have been developed (e.g., Zou et al., 2006; d Aspremont et al., 2007; Shen & Huang, 2008; Journee & Y. Nesterov, 2008; Birnbaum et al., 2013; Vu et al., 2013; Yuan & Zhang, 2013; Wang et al., 2014). For example, Zou et al. (2006) proposed to use a regression-type optimization problem based on the elastic-net to compute sparse PCs. d Aspremont et al. (2007) considered a convex semideﬁnite program formulation for sparse PCA. Yuan & Zhang (2013) and Ma (2013) proposed the TPower method and the iterative thresholding sparse PCA respectively, which are essentially modiﬁed variants of the classical power method. Vu et al. (2013) proposed the Fantope projection selection method (FPS) which is a convex relaxation formulation of sparse principal subspace estimation based on a semideﬁnite program.

Yet, it can be hard to apply these sparse PCA methods to real large scale data, as 1) they need to explicitly compute the sample covariance matrix or store all the samples, which means that at least O(p min{p, n}) storage is required; 2) their computational cost may become prohibitive when the dimensionality is high. For example, FPS is solved via an ADMM algorithm that requires to perform spectral decomposition on a p p matrix in each iteration.

For non-sparse PCA, many computation/memory efﬁcient algorithms have been proposed in recent years. For example, Warmuth & Kuzmin (2008) proposed a multiplicative update algorithm called online PCA under the streaming data model i.e., the samples are received sequentially. Although each update of online PCA can be calculated efﬁciently, it still requires O(p2) storage. Brand (2002) and Arora et al. (2012) developed two variants of PCA incremental PCA and the stochastic power method both of which have low memory and computational complexity. Yet, they only showed the empirical performance and provided no theoretical performance guarantees. Recently, Mitliagkas et al. (2013) proposed streaming PCA

Streaming Sparse Principal Component Analysis

with memory complexity O(pk) and sample complexity Θ(p log p). However, similar to PCA, all these method are inconsistent in the high dimensional regime since sparsity is not exploited. Indeed, how to design a computationand memory-efﬁcient sparse PCA method remains unsolved. Mairal et al. (2010) proposed an online learning algorithm for sparse coding that leads to an online sparse PCA algorithm. Although this algorithm requires only O(pk) memory, its computational cost is high due to solving the elasticnet problem in each iteration.

Another important issue is the sub-Gaussianality assumption. Many sparse PCA methods are theoretically analyzed under the spike model (e.g., Amini & Wainwright, 2009; Vu et al., 2013; Yuan & Zhang, 2013; Vu & Lei, 2012; Shen et al., 2013; Mitliagkas et al., 2013; Cai et al., 2014). The limitation of the spike model is it requires subgaussian data and noise, and hence can not model heavy-tail distributions. To relax this assumption, Han & Liu (2013b) used the semiparametric transelliptical family to model data and proposed the transelliptical component analysis (TCA) based on the the marginal Kendall s tau statistic for estimating the leading eigenvectors of the correlation matrix. In their following-up work (Han & Liu, 2013a), they developed the elliptical component analysis (ECA) based on the multivariate Kendall s tau statistic for estimating the leading eigenvectors of the covariance matrix under the elliptical family (Fang et al., 1990) a semiparametric generalization of the Gaussian family that can model heavy-tail distributions and nontrivial tail dependence between variables (Hult & Lindskog, 2002). Although TCA and ECA have beautiful theoretical guarantees, they require at least O(p2n2 + p3) computation and O(p2) memory due to the calculation of the marginal/multivariate Kendall s tau matrix and the spectral decomposition, which makes them unsuitable for large-scale applications.

In this paper, to address the issues discussed above, we propose two variants of sparse PCA streaming sparse PCA and streaming sparse ECA for estimating the leading PCs of samples drawn from the spike model and the elliptical model, respectively. Our theoretical analysis shows that both algorithms have memory complexity O(pk), computational complexity O(pk min{k, s log p}) and sample complexity Θ(s log p), and are consistent in the high dimensional regime.

Notation. Matrices and column vectors are denoted by upper-case and lower-case boldface letters, respectively. For matrix X, we use X 0 and X 2 to denote the number of non-zero entries and the spectral norm of X, respectively. The ﬁrst k singular values of X are denoted by λ1(X), , λk(X), and the ith column and the ith row of X are denoted by Xi and X(i) respectively. For a square matrix S, its main diagonal is denoted by diag(S).

2. Problem Setup

We consider the streaming data model where one receives sample xt at time t for t = 1, 2, , and xt vanishes after it is collected unless it is stored in the memory. Our goal is to extract the leading k PCs of the received data. We consider the following two models that generate the samples:

Spike model. Sample xt is generated according to xt = Azt + wt where zt Rd is an i.i.d. sample of the random vector z N(0, Id), wt Rp is an i.i.d. Gaussian noise drawn from N(0, σ2Ip), and matrix A Rp d is deterministic but unknown. Let Σ be the covariance of xt, i.e., Σ E[xtx t ] = AA + σ2Ip.

Elliptical model. Sample xt is generated according to the elliptical distribution ECp(µ, Σ, ξ), i.e., xt = µ + ξt Azt, where zt Rd is a sample of z which is a uniform random vector on the unit sphere, ξt is a sample of ξ which is a scalar random variable (with unknown distribution) independent of z, µ Rp is a ﬁxed vector, and A Rp d

is a deterministic matrix satisfying AA = Σ. We only consider the case that ξ has a continuous distribution.

As our algorithms are designed for the sparse setting, we assume that the projection matrix Π Uk U k onto the subspace spanned by the eigenvectors Uk Rp k

of Σ corresponding to its k largest eigenvalues satisﬁes diag(Π) 0 s, where s indicates the sparsity which is less than the dimension p and the number of samples n.

3. Algorithm

We now present the details of streaming sparse PCA and streaming sparse ECA for analyzing high dimensional data. Similar to streaming PCA (Mitliagkas et al., 2013), our algorithms are block-wise stochastic power methods that update the estimated PCs once a block of samples are received. The difference between our algorithms and streaming PCA is that we propose to apply a row truncation operator as shown in Algorithm 1 to maintain the sparsity of the estimated PCs to achieve consistency in the high dimensional regime, which truncates the row vectors of a matrix to zero except the ones with the largest l2-norms.

Algorithm 1 Row Truncation Operator

Input: Matrix X Rp k and parameter s. Procedure:

1) Compute vi = X(i) 2 for i = 1, , p; 2) Sort {vi} and select the largest s ones. Let I be the selected indices; 3) Compute X where X(i) = X(i) if i I or 0 otherwise, and return X.

We ﬁrst present two streaming sparse PCA algorithms, i.e.,

Streaming Sparse Principal Component Analysis

Algorithm 2 and 3, for analyzing data generated according to the spike model. Algorithm 2 accepts a sequence of samples and estimates the leading k sparse PCs simultaneously. The samples are divided into blocks with equal-size B. In each block, the estimated PCs are updated via the power method update followed by the row truncation operation and the QR decomposition. Clearly, its memory complexity is O(pk) since only the estimated PCs need to be stored in the memory. Its computational complexity for each iteration is O(p(k2 + log p) + pk B) since the computation of the power method update requires O(pk B) operations, the row truncation operation needs O(pk + p log p) operations and the QR decomposition requires O(pk2) operations. In the next section, we show that block size B = Θ(s log p) which implies that streaming sparse PCA has a much lower computational complexity O(pk min{k, s log p}) than the methods proposed by Vu et al. (2013); d Aspremont et al. (2007) when s is much smaller than p.

Algorithm 2 Streaming SPCA via Row Truncation Input: Samples {x1, x2, }, number of steps T, block size B, parameter γ and initial solution Q0 Rp k. Procedure:

for τ = 0 to T 1 do

1) Initialize Sτ+1 = 0; for t = Bτ + 1 to B(τ + 1) do

2) Sτ+1 = Sτ+1 + 1

B xtx t Qτ; end for 3) Sτ+1 = Truncate( Sτ+1, γ); 4) QR-decomposition: Sτ+1 = Qτ+1Rτ+1; end for 5) Return QT .

Intuitively, Algorithm 2 works when matrix Uk Rp k which consists of the leading k PCs is row sparse, or equivalently, the projection matrix Π onto the subspace spanned by the leading k PCs satisﬁes diag(Π) 0 s. If this row sparse assumption is not satisﬁed, e.g., the leading k PCs are all sparse but their supports are nearly disjoint, one can compute the PCs iteratively via the iterative deﬂation method (d Aspremont et al., 2007; Mackey, 2008), as shown in Algorithm 3.

Our streaming sparse PCA algorithm has the following advantages compared with streaming PCA and TPower: 1) As we show in the next section, the streaming sparse PCA is consistent in the high dimensional regime where p is much larger than n a regime where streaming PCA is known to be inconsistent. 2) TPower is not designed to handle the streaming data model. It needs to store all the samples or explicitly compute the empirical covariance matrix, which requires at least O(p min{p, n}) storage. This can be problematic in applications involving large data. In contrast, our methods only require O(pk) storage. 3) When

Algorithm 3 Streaming SPCA via Iterative Deﬂation Input: Parameters B1, , Bk, T1, , Tk, γ1, , γk and samples {x1, x2, }. Procedure:

1) Let n = 0; for i = 1 to k do

2) Initialize q(i) 0 ; 4) Run Algorithm 2 with {y n, , y n+Ti Bi}, Ti, Bi, q(i) 0 , γi and k = 1. For t = n, , n + Ti Bi, yt is deﬁned as yt xt i 1 j=0 q(j)q(j) xt. The output of Algorithm 2 is denoted by q(i); 5) Set n = n + Ti Bi; end for 6) Return q(1), , q(k).

the row sparse assumption holds, Algorithm 1 can compute the leading k PCs simultaneously, but TPower can only extract them one by one via the iterative deﬂation method.

For elliptically distributed data, Han & Liu (2013a) proposed the ECA algorithm based on the multivariate Kendall s tau statistic. In particular, let x be a random vector following the elliptical distribution ECp(µ, Σ, ξ) and x be an independent copy of x. The multivariate Kendall s tau matrix is deﬁned as

K E [(x x)(x x)

It is known that the eigenspace of K is identical to that of Σ. Based on this fact, Han & Liu (2013a) proposed to recover the eigenspace of a second order U-statistic estimator of K which is deﬁned as

ˆKU 2 n(n 1)

(xi xi )(xi xi )

xi xi 2 2 ,

where {x1, , xn} are n independent samples of x. Note that ˆKU is indeed the empirical covariance matrix of { xi xi xi xi 2 }i <i. ˆKU is not suitable for the streaming data model because its computation requires to store all the samples and is quite time-consuming when n is large. So instead of ˆKU, we consider the following estimator:

(x2i 1 x2i)(x2i 1 x2i)

x2i 1 x2i 2 2 . (2)

Without loss of generality, we assume that n is an even number. This leads to streaming sparse ECA proposed in Algorithm 4. In comparison with ECA, streaming sparse ECA is able to deal with the streaming data model and requires only O(pk) storage. We remark that the main difference between Algorithms 2 and 4 is that Algorithm 4 computes the empirical covariance of { x2t 1 x2t x2t 1 x2t 2 }B(τ+1) t=Bτ+1,

Streaming Sparse Principal Component Analysis

Algorithm 4 Streaming ECA via Row Truncation Input: Samples {x1, x2, }, number of steps T, block size B, parameter γ and initial solution Q0 Rp k. Procedure:

for τ = 0 to T 1 do

1) Initialize Sτ+1 = 0; for t = Bτ + 1 to B(τ + 1) do

2) yt = (x2t 1 x2t)/ x2t 1 x2t 2; 3) Sτ+1 = Sτ+1 + 1

B yty t Qτ; end for 4) Sτ+1 = Truncate( Sτ+1, γ); 5) QR-decomposition: Sτ+1 = Qτ+1Rτ+1; end for 6) Return QT .

rather than the empirical covariance matrix of {xt}B(τ+1) t=Bτ+1, to handle elliptically distributed data.

4. Performance Guarantees

To measure the accuracy of the output, we use the following distance function (Mitliagkas et al., 2013). For two orthogonal matrix U, V Rp k, the distance between U and V is deﬁned based on the largest principal angle between the column spaces of U and V:

dist(U, V) = dist(span(U), span(V)) = U V 2, (3)

where U is an orthogonal basis of the perpendicular subspace to the one spanned by the columns of U.

4.1. Streaming Sparse PCA

We ﬁrst show performance guarantees of streaming sparse PCA, i.e., Algorithm 2 and 3, which handles data generated under the spike model. Recall that the covariance matrix is denoted by Σ. Let Σ = UΛU be the singular value decomposition of Σ, Uk be a submatrix of U consisting of the ﬁrst k columns of U and λk be the kth largest eigenvalue of Σ. Theorem 1 shows sufﬁcient conditions for Algorithm 2 to obtain a solution of accuracy ϵ. We assume that the initial solution Q0 satisﬁes U k, Q0 > ϵ.

Theorem 1. For η > 0, 0 < ϵ < 1, and γ s, let µ (k+1)λk+1+2ηλk

λk , f(µ, η, k) max{ (2+

k}. If the initial solution Q0 satisﬁes that

ν U k, Q0 2 < 1 µ2

1 µ2 + (µ + 1)f(µ, η, k) , (4)

and the following two inequalities hold

T log(ϵ/ν) log [ µ/ (

1 ν2 f(µ, η, k)ν )],

B ck2λ2 1[(s + 2γ) log p + log T]

where c is a universal constant, then with probability at least 1 s 10 the output QT of Algorithm 2 satisﬁes U k, QT 2 ϵ.

Remark 1. From Theorem 1, we observe that: 1) Algorithm 2 succeeds as long as λk > (k + 1)λk+1 since there exists η, e.g., η = λk (k+1)λk+1

4λk so that µ < 1. 2)

λk affects the convergence rate and the block size

B. A smaller (k+1)λk+1

λk leads to a smaller µ and a larger η, which implies faster convergence and less samples required. 3) The upper bound related to the initial solution Q0 mainly depends on µ, which goes to 0 as µ 1 and approaches to k

η as µ 0. In other words, a more accurate initial solution is required when µ is larger. 4) The block size B should be at least Θ((s + 2γ) log p + log T). Typically, if s γ 2s, our algorithm can succeed when B = Θ(s log p + log T). Notice that this is typically much smaller than streaming PCA which requires the block size B = Θ(p log p).

Remark 2. When k = d, η can be set to λd (d+1)σ2

2 [ 1 + (d+1)σ2

] and f(µ, η, k) = 2+

in this case, Algorithm 2 succeeds as long as σ2 the covariance of the noise is less than λd d+1.

Remark 3. When k = 1, η can be set to λ1 2λ2

4λ1 so that µ = 1 2 + λ2

λ1 and f(µ, η, k) = (2 +

2)µ. Notice that Algorithm 2 now becomes a block-wise stochastic version of TPower. When γ = s, Yuan & Zhang (2013) proved that TPower succeeds when λ1 >

5λ2, while our analysis leads to a slightly better result of λ1 > 2λ2.

The following theorem provides the performance guarantee of Algorithm 3 that extracts PCs via iterative deﬂation. Note that the errors of the ﬁrst t 1 estimated PCs q(1), , q(t 1) may propagate to the estimate of q(t).

Theorem 2. Let η > 0, 0 < ϵ <

2 2 , γi s for i = 1, , k. Let {ϵ1, ϵ2, , ϵk} be such that ϵk = ϵ, ϵk 1 = ηλkϵk 20λ1k, ϵk 2 = ηλk 1ϵk 1

20λ1(k 1) , , ϵ1 = ηλ2ϵ2

For the ith iteration, let µi 2λi+1+2ηλi

λi , f(µi, η)

2)µi, η}. If the initial solution q(i) 0 satisﬁes

1 |u i q(i) 0 |2 < 1 µ2 i

1 µ2 i + (µi + 1)f(µi, η) ,

and the following two inequalities hold

Ti log(ϵi/νi) log [ µi/ ( 1 νi2 f(µi, η)νi )],

Bi cλ2 1[(s + 2γi) log p + log(k Ti)]

ϵ2 i η2λ2 i ,

where c is a universal constant, then |u i q(i)|

1 ϵ2 i holds with probability at least 1 s 10.

Streaming Sparse Principal Component Analysis

Let Q [q(1), , q(k)], one can easily verify that Theorem 2 implies U k, Q 2 2ϵ. In the experiments, we empirically show that Algorithm 3 can generates good estimates when k is relatively small, e.g., k = 8.

4.2. Streaming Sparse ECA

We now present the performance guarantee of streaming sparse ECA. We start with the following theorem which states that the population multivariate Kendall s tau statistic K and the scatter matrix Σ under the elliptical model share the same eigenspace.

Theorem 3. (Marden, 1999; Croux et al., 2002; Han & Liu, 2013a) Let ECp(µ, Σ, ξ) be a continuous elliptical distribution and K be the population multivariate Kendall s tau statistic. Then if rank(K) = q and λj(Σ) = λk(Σ) for any j = k {1, , q}, we have

uj(Σ) = uj(K) and λj(K) = E

[ λj(Σ)y2 j q i=1 λi(Σ)y2 i

where uj( ) is the eigenvector corresponding to the jth

largest eigenvalue and y (y1, , yq) N(0, Iq).

This theorem states that when Σ has distinct eigenvalues, Σ and K have the same eigenspace with the same descending order of the eigenvalues. Based on this, our streaming sparse ECA utilizes ˆK deﬁned in Equation (2) to recover the eigenspace of Σ. We here abuse the notations and let λk(K) be the kth largest eigenvalue of K and Uk be the matrix consisting of the eigenvectors of K corresponding to its k largest eigenvalues, and suppose that diag(Uk U k ) 0 s. The following theorem states the theoretical guarantee of the streaming sparse ECA.

Theorem 4. For η > 0, 0 < ϵ < 1, and γ s, let µ (k+1)λk+1+2ηλk

λk , f(µ, η, k) max{ (2+

k}. If the initial solution Q0 satisﬁes that

ν U k, Q0 2 < 1 µ2

1 µ2 + (µ + 1)f(µ, η, k) ,

and the following two inequalities hold

T log(ϵ/ν) log [ µ/ (

1 ν2 f(µ, η, k)ν )],

B ck2 (1 + λ1(K))2 [(s + 2γ) log p + log T]

ϵ2η2λk(K)2 ,

where c is a universal constant, then with probability at least 1 s 10 the output QT of Algorithm 4 satisﬁes U k, QT 2 ϵ.

Notice that all the bounds in Theorem 4 relates to the eigenvalues of K. The connection between the eigenvalues of

K and Σ has been shown in (Han & Liu, 2013a), namely, λj(K) = Θ(λj(Σ)/tr(Σ)) or each j, when Σ F log p = tr(Σ) o(1), e.g., the condition number of Σ is upper bounded by a constant.

5. Initialization

As shown in Theorem 1, 2 and 4, both of streaming sparse PCA and ECA require an initial solution whose estimation error is bounded by Equation (4) which involves the intrinsic properties of Σ and the number of PCs one wants to extract. Therefore, how to ﬁnd such an initial solution is an important issue. Yuan & Zhang (2013) proposed to adaptively select the desired sparsity γ in TPower, i.e., one can run TPower with a relatively large γ to generate the initial solution and then rerun the algorithm with a smaller γ, while (Han & Liu, 2013a) computed the initial solution by FPS (Vu et al., 2013). In streaming PCA, Mitliagkas et al. (2013) showed that the initial solution could be generated by randomly drawing k vectors from the standard Gaussian distribution.

Due to limitations of space, we mainly discuss the initialization issue for streaming sparse PCA. This can be easily generalized to streaming sparse ECA. The idea is to run streaming PCA (streaming sparse PCA with γ = p) on several blocks of the collected samples and then apply the truncation operation on its output to generate the initial solution. This is summarized in Algorithm 5.

Algorithm 5 Finding Initial Solution Input: Samples {x1, x2, }, block size B, parameter θ and γ. Procedure:

1) Run streaming PCA on several blocks so that the output ˆQ0 satisﬁes U k, ˆQ0 2 θ; 2) Initialize S0 = 0; for t = 1 to B do

2) S0 = S0 + 1

B xtx t ˆQ0; end for 3) S0 = Truncate( S0, γ); 4) QR-decomposition: S0 = Q0R0; 5) Return Q0.

Theorem 5 provides the performance guarantee of Algorithm 5. Note that the block size B should be Ω(p) to achieve consistency. However, we observe from experiments empirically this algorithm is able to generate acceptable results even when B is much smaller than p, especially when the covariance of noise σ2 is small.

Theorem 5. Fix accuracy ϵ with 0 < ϵ < 1, and let θ =

2)(1+ kλk+1

λk )2 , then U k, Q0 2 ϵ holds

Streaming Sparse Principal Component Analysis

with probability at least 1 d 10 if γ s, U k, ˆQ0 2 θ

and B ck2[λ1(A)4d+(2λ1(A)+σ)2σ2p]

(λk(A)2+σ2)2 θ2 , where c is a universal constant.

6. Proof Sketch

We now sketch the proofs of the theorems presented in Section 4. We start with Theorem 1. The key ingredient of our proofs is to build a connection between the estimation error on the τ th iteration U k, Qτ 2 and that on the τ + 1th

iteration U k, Qτ+1 2. If this error decreases on each iteration, then one can compute the number of iterations needed for the estimation error to be less than ϵ.

In the rest of this section, the empirical covariance of samples {x Bτ+1, , x B(τ+1)} is denoted by ˆΣτ, and the row supports of Uk, Qτ and Qτ+1 are represented by S, Fτ and Fτ+1, respectively. We let F S Fτ Fτ+1 and let X(i, j) be the (i, j)th entry of matrix X. For a p p squared matrix, e.g., ˆΣτ, let ˆΣτ,F denote the matrix whose (i, j)th entry equals ˆΣτ(i, j) if the row index i and the column index j are both in F, and 0 otherwise. For a p k matrix, e.g., Sτ+1, let Sτ+1,F denote the matrix whose (i, j)th entry equals Sτ+1(i, j) if the row index i F, and 0 otherwise. Then one can easily verify that Sτ+1,F = ˆΣτ,FQτ and Sτ+1 = Truncate( Sτ+1,F, γ). Let Sτ+1,F = Qτ+1,FRτ+1,F be the QR decomposition of Sτ+1,F. In order to establish a relationship between Qτ and Qτ+1, we ﬁrst connect Qτ to Qτ+1,F:

Lemma 1. Let ξ = sup F:|F| s+2γ ˆΣτ,F Στ,F 2, then if Sτ+1,F has full column rank, we have

U k, Qτ+1,F 2 2

[λk+1 U k, Qτ 2 + ξ]2

[λk+1 U k, Qτ 2 + ξ]2 + [λk

1 U k, Qτ 2 2 ξ]2 .

We now connect Qτ+1,F to Qτ+1 based on the perturbation analysis of the QR decomposition and the error analysis of the row truncation operation.

Lemma 2. Let Qτ+1,F, be an orthonormal matrix such that matrix [Qτ+1,F, Qτ+1,F, ] is orthogonal and let ξ sup F:|F| s+2γ ˆΣτ,F Στ,F 2, then if γ s and

k[λk+1 U k, Qτ 2 + ξ]

1 U k, Qτ 2 2 < 1

Q τ+1,F, (Qτ+1 Qτ+1,F) 2

k[λk+1 U k, Qτ 2 + ξ]

1 U k, Qτ 2 2 c

k[λk+1 U k, Qτ 2 + ξ] ,

where c = 2 +

We then prove that U k, Qτ+1 2 can be upper bounded by U k, Qτ+1,F 2 + Q τ+1,F, (Qτ+1 Qτ+1,F) 2. Therefore, by combining the two lemmas above, let ν U k, Q0 2, we can show that the following holds with

probability at least 1 s 10

U k, Qτ+1 2 µ U k, Qτ 2

1 ν2 f(µ, η, k)ν ,

if B ck2λ2 1[(s+2γ) log p+log T ]

ϵ2η2λ2 k and Inequality (4) holds. Thus Theorem 1 is established.

Theorem 2 can be proved by analyzing the error propagation in the iterative deﬂation, combining with the results derived from Theorem 1 with k = 1. For Theorem 4, as we have discussed in Section 3, streaming sparse ECA uses ˆK as the estimator of the multivariate Kendall s tau matrix instead of the empirical covariance of the received samples. Therefore, Theorem 4 can be established following the proofs discussed above, together with an upper bound of sup F:|F| s+2γ ( ˆKτ Kτ)F 2 shown below.

Theorem 6. Let x1, , xn be n independent observations of random vector x ECp(µ, Σ, ξ). Let K be multivariate Kendall s tau matrix of x and ˆK be the empirical estimation of K which is deﬁned as Equation (2). Then there exists a universal constant c such that the following holds with probability at least 1 s 10

sup v 2=1, v 0=s |v ( ˆK K)v|

c ( min {4λ1(K)

qλq(K), 1 } + K 2

s log p + log T

for parameter T, where q = rank(K).

Theorem 6 implies that sup F:|F| s+2γ ( ˆKτ Kτ)F 2

c (1 + K 2)

(s+2γ) log p+log T

n with high probability. Then by embedding this inequality into Lemma 1, Lemma 2, we obtain Theorem 4.

7. Experiments

We investigate the performance of our algorithms on a variety of simulated and real-world datasets. All the algorithms mentioned below are implemented in Python. The experiments are conducted on a desktop PC with an i7 3.4GHz CPU and 4G memory.

7.1. Synthetic Data

We ﬁrst illustrate the empirical performance of our streaming sparse PCA (Algorithm 2) with synthetic datasets. For

Streaming Sparse Principal Component Analysis

the spike model, we consider two data generating schemes. The ﬁrst one, which is similar to (Yuan & Zhang, 2013), is that matrix A Rp 2 satisﬁes that AA = λ1v1v 1 + λ2v2v 2 , where v1 Rp and v2 Rp are two sparse vectors satisfying that

10 1 i 10 0 otherwise, v2i = { 1

10 11 i 20 0 otherwise,

and λ1 = 5, λ2 = 3. The second one constructs matrix A Rp d by randomly generating two orthogonal matrices U Rp d and V Rd d such that diag(UU ) 0 = s, and then setting A = USV where S Rd d is a diagonal matrix whose diagonal entries are computed ac-

cording to the chi-square density f(x) = x 1

2 ) , i.e.,

20) for i = 1, , d. In the experiments, we repeat each test 20 times and report the average results.

In the ﬁrst experiment, we make a comparison between streaming sparse PCA, streaming PCA, FPS and online sparse PCA (Mairal et al., 2010) for a relative small p. We use three measurements to evaluate their performance, namely, the subspace distance deﬁned in Equation (3), the expressed variance in (Xu et al., 2013) deﬁned as k i=1 q i AA qi/ k i=1 λi(AA ), and the sparsity deﬁned as k i=1 |{j : |qij| > t}|, where {q1, , qk} are their estimation and t is a threshold which is set to 0.001. Figure 1(a) shows the results in the ﬁrst scheme, where the leading PC is extracted. We observe that streaming sparse PCA performs similarly to online sparse PCA and outperforms FPS and streaming PCA. The running time of streaming sparse PCA is much less than that of FPS and online sparse PCA because FPS needs to solve a SDP problem via an ADMM algorithm and online sparse PCA needs to solve the elastic-net (Zou & Hastie, 2005) in each iteration, whereas our algorithm only requires the truncation operation and the QR decomposition. Figure 1(b) presents the results in the second scheme, where we extract the leading ten PCs. It can be observed that FPS and streaming sparse PCA have better performance than the other methods. Notice that streaming sparse PCA runs more than 100 times faster than FPS and FPS needs to store the p p covariance matrix while streaming sparse PCA only maintains a p k matrix.

In the second experiment, we mainly compare streaming PCA and streaming sparse PCA in the high dimensional regime since the other two methods are too slow in this setup. The estimation error is measured by the subspace distance. The samples are generated according to the second scheme described above and we extract the leading 10 PCs. Figure 2(a) shows their estimation errors when n = 1000, B = 100 and p varies from 1000 to 50000. Clearly, streaming sparse PCA succeeds in recovering the sparse PCs but streaming PCA fails when p is large, which

300 400 500 600 700 800 900 p

Stre am ing PCA Stre am ing SPCA Online SPCA FPS

300 400 500 600 700 800 900 p

Running tim e (s)

Stre am ing PCA Stre am ing SPCA Online SPCA FPS

300 400 500 600 700 800 900 p

Stre am ing PCA Stre am ing SPCA Online SPCA FPS

300 400 500 600 700 800 900 p

Running tim e (s)

Stre am ing PCA Stre am ing SPCA Online SPCA FPS

300 400 500 600 700 800 900 p

Subs pa ce dista nc e

Stre am ing PCA Stre am ing SPCA Online SPCA FPS

300 400 500 600 700 800 900 p

Expre sse d varian ce

Stre am ing PCA Stre am ing SPCA Online SPCA FPS

300 400 500 600 700 800 900 p

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Subs pa ce dista nc e

Stre am ing PCA Stre am ing SPCA Online SPCA FPS

300 400 500 600 700 800 900 p

Expre sse d varian ce

Stre am ing PCA Stre am ing SPCA Online SPCA FPS

Figure 1. (a) Scheme 1 with n = 1000, σ2 = 0.5, B = 100, γ = 10 and k = 1. (b) Scheme 2 with n = 1000, d = 10, σ2 = 0.2, B = 100, s = 50, γ = 50 and k = 10.

is consistent with the theoretical results shown in Theorem 1. Figure 2(b) presents the number of samples required for recovering the leading 10 PCs of accuracy ϵ 0.03 where B = 100, which shows that streaming PCA requires much more samples than streaming sparse PCA to achieve the same accuracy. Figure 2(c) shows the effect of block size B on their performance, where n = 1000 and p = 5000. When B is too small, e.g., less than 60, streaming sparse PCA does not guarantee the convergence to the true PCs, which agrees with Theorem 1. Figure 2(d) demonstrates the probability of success of streaming sparse PCA, measured by the fraction of the trials in which the estimation error is less than 0.05. We here set n = 1000 and B = 100. We can observe that the tolerance of σ2 decreases as p increases when B is ﬁxed. This is because both of p and σ2

affect the lower bound of B as shown in Theorem 1.

We now compare our streaming sparse PCA/ECA with ECA (Han & Liu, 2013a) under the elliptical model. In the following experiments, the samples are independently drawn from ECp(0, Σ, ξ). Here, Σ is constructed according to Σ = AA + Ip where A is generated following the ﬁrst scheme described above, and ξ follows the chidistribution with degree of freedom p or the F-distribution with degrees of freedom p and 1. We estimate the leading

Streaming Sparse Principal Component Analysis

0 10000 20000 30000 40000 50000 p

Subs pa ce dista nc e

Stre am ing PCA Stre am ing SPCA

0 5000 10000 15000 20000 p

Stre am ing PCA Stre am ing SPCA

20 40 60 80 100 120 140 160 B

Subs pa ce dista nc e

Stre am ing PCA Stre am ing SPCA

0.05 0.15 0.25 0.35 0.45 2000 6000 10000 14000 18000 22000 26000 30000

Prob ab ility of succe ss

Figure 2. (a), (b) and (c) show the comparison between streaming PCA and streaming sparse PCA, where σ2 = 0.2, k = 10 and γ = s = 100. (d) presents the effect of σ2 on the performance of streaming sparse PCA.

0 5 10 15 20 25 30 35 40 0.0

Subs pa ce dista nc e

ECA Stre am ing SPCA Stre am ing ECA

0 5 10 15 20 25 30 35 40 0.0

Subs pa ce dista nc e

ECA Stre am ing SPCA Stre am ing ECA

Figure 3. Comparison between ECA, streaming sparse ECA, and streaming sparse PCA. Random variable ξ follows (Left) the chidistribution χp, and (Right) the F-distribution F(d, 1).

eigenvector of Σ. Figure 3 plots the subspace distances between their estimation and the true PC against parameter γ for these three methods, where we set p = 500, n = 600 and B = 100, which shows that streaming sparse ECA performs similarly to ECA and is better than streaming sparse PCA especially when γ is close to s. Typically, streaming sparse ECA is much faster than ECA when p or n is large, because ECA computes the second order U-statistic estimator and uses FPS to construct a good initial solution. In this experiment, on the average, ECA needs 30s to generate its solution while streaming sparse ECA only requires 30 milliseconds. Figure 4 shows the comparison between the streaming versions of PCA, ECA and sparse PCA in the high dimensional setting where p = 10000, n = 2000 and B = 100. Clearly, streaming sparse ECA outperforms the other two methods in the elliptical model.

7.2. Real-world Datasets

We use two large datasets, the NIPS paper dataset and the NYTimes news articles dataset, both available from the UCI Machine Learning Repository (Bache & Lichman), to compare the empirical performance of streaming sparse PCA, streaming PCA and large-scale sparse PCA (Zhang & El Ghaoui, 2011). Both datasets record word occur-

0 5 10 15 20 25 30 35 40 0.0

Subs pa ce dista nc e

Stre am ing PCA Stre am ing SPCA Stre am ing ECA

0 5 10 15 20 25 30 35 40 0.0

Subs pa ce dista nc e

Stre am ing PCA Stre am ing SPCA Stre am ing ECA

Figure 4. Comparison between streaming PCA and streaming sparse PCA/ECA. Random variable ξ follows (Left) the chidistribution χp, and (Right) the F-distribution F(d, 1).

1 2 3 4 5 6 7 8 k

0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32

Explaine d varian ce

Stre am ing PCA Stre am ing SPCA Large -sca le SPCA

1 2 3 4 5 6 7 8 k

Explaine d varian ce

Stre am ing PCA Stre am ing SPCA Large -sca le SPCA

Figure 5. (Left) NIPS dataset. (Right) NYTimes dataset.

rences in the form of bags-of-words. The NIPS dataset contains 1500 articles and a dictionary of 12419 words. The NYTimes dataset contains 300000 articles and a dictionary of 102660 words. We evaluate the performance by the fraction of explained variance which is deﬁned as tr(U XX U)/tr(XX ) where X Rp n is the sample matrix and U Rp k is the output. Parameters B and γ in streaming sparse PCA are set to 300 and 500, respectively. The leading k PCs are computed via the iterative deﬂation. Figure 5 plots the explained variance against k. We observe that streaming sparse PCA performs similarly to streaming PCA and is better than large-scale sparse PCA, in terms of the expressed variance. One advantage of streaming sparse PCA is that it is computationally efﬁcient and guarantees the sparsity of its solution, i.e., γ = 500, without much loss of performance compared to streaming PCA.

8. Conclusion

In this paper, we propose streaming sparse PCA/ECA for dimensionality reduction of high dimensional data generated according to the spike model and the elliptical model, and establish ﬁnite sample performance guarantees. The experiments validate that they perform better and are more computationally efﬁcient than the other alternatives to PCA.

Acknowledgments

This work is partially supported by the Ministry of Education of Singapore Ac RF Tier Two grants R265000443112 and R265000519112, and A*STAR Public Sector Funding R265000540305.

Streaming Sparse Principal Component Analysis

Amini, A. A. and Wainwright, M. J. High-dimensional analysis of semideﬁnite relaxiation for sparse principal components. The Annals of Statistics, 37(5B):2877 2921, 2009.

Arora, R., Cotter, A., Livescu, K., and Srebro, N. Stochastic optimization for PCA and PLS. In Allerton Conference, 2012.

Bache, K. and Lichman, M. UCI machine learning repository. URL http://archive.ics.uci.edu/ml.

Birnbaum, A., Johnstone, I. M., Nadler, B., and Paul, D. Minimax bounds for sparse PCA with noisy highdimensional data. The Annals of Statistics, 41(3):1055 1084, 2013.

Brand, M. Incremental singular value decomposition of uncertain data with missing values. In ECCV, 2002.

Cai, T., Ma, Z., and Wu, Y. Optimal estimation and rank detection for sparse spiked covariance matrices. ar Xiv:1305.3235, 2014.

Croux, C., Ollila, E., and Oja, H. Sign and rank covariance matrices: statistical properties and application to principal components analysis. Statistics for Industry and Technology, pp. 257 269, 2002.

d Aspremont, A., El Ghaoui, L., Jordan, M. I., and Lanckriet, G. R. A direct formulation for sparse PCA using semideﬁnite programming. SIAM review, 49(3):434 448, 2007.

Fang, K., Kotz, S., and Ng, K. Symmetric Multivariate and Related Distributions. Chapman & Hall, 1990.

Han, F. and Liu, H. ECA: High dimensional elliptical component analysis in non-Gaussian distributions. ar Xiv:1310.3561, 2013a.

Han, F. and Liu, H. Optimal rates of convergence for latent generalized correlation matrix estimation in transelliptical distribution. ar Xiv:1305.6916v3, 2013b.

Hult, H. and Lindskog, F. Multivariate extremes, aggregation and dependence in elliptical distributions. Advances in Applied probability, 34(3):587 608, 2002.

Johnstone, I. M. and Lu, A. Y. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682 693, 2009.

Journee, M. and Y. Nesterov, Peter Richtarik, R. Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, pp. 517 553, 2008.

Ma, Z. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772 801, 2013.

Mackey, L. Deﬂation methods for sparse PCA. In NIPS, 2008.

Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19 60, 2010.

Marden, J. Some robust estimates of principal components. Statistics and Probability Letters, 43(4):349 359, 1999.

Mitliagkas, I., Caramanis, C., and Jain, P. Memory limited, streaming PCA. In NIPS, 2013.

Pearson, K. On lines and planes of closest ﬁt to systems of points in space. Philosophical Magazine, 2(6):559C 572, 1901.

Shen, D., Shen, H., and Marron, J.S. Consistency of sparse PCA in high dimension, low sample size contexts. Journal of Multivariate Analysis, 115:115 317, 2013.

Shen, H. and Huang, J. Z. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99:1015 1034, 2008.

Vu, V. Q. and Lei, J. Minimax rates of estimation for sparse PCA in high dimensions. In AISTATS, 2012.

Vu, V. Q., Cho, J., Lei, J., and Robe, K. Fantope projection and selection: A near-optimal convex relaxation of sparse PCA. In NIPS, 2013.

Wang, Z., Lu, H., and Liu, H. Nonconvex statistical optimization: minimax optimal sparse PCA in polynomial time. In NIPS, 2014.

Warmuth, M. K. and Kuzmin, D. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9: 2287 2320, 2008.

Xu, H., Caramanis, C., and Mannor, S. Outlier-robust PCA: the high-dimensional case. IEEE Transactions on Information Theory, 59(1):546 572, 2013.

Yuan, X. and Zhang, T. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14:899 925, 2013.

Zhang, Y. and El Ghaoui, L. Large-scale sparse principal component analysis with application to text data. In NIPS, 2011.

Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301 320, 2005.

Streaming Sparse Principal Component Analysis

Zou, H., Hastie, T., and Tibshirani, R. Sparse principal component analysis. In JCGS, pp. 265 286, 2006.