# learning_feature_sparse_principal_subspace__f402268b.pdf

Learning Feature Sparse Principal Subspace

Lai Tian School of Computer Science & Center for OPTIMAL, Northwestern Polytechnical University, Xi an 710072, China. tianlai.cs@gmail.com

Feiping Nie School of Computer Science & Center for OPTIMAL, Northwestern Polytechnical University, Xi an 710072, China. feipingnie@gmail.com

Rong Wang School of Cybersecurity & Center for OPTIMAL, Northwestern Polytechnical University, Xi an 710072, China. wangrong07@tsinghua.org.cn

Xuelong Li School of Computer Science & Center for OPTIMAL, Northwestern Polytechnical University, Xi an 710072, China. li@nwpu.edu.cn

This paper presents new algorithms to solve the feature-sparsity constrained PCA problem (FSPCA), which performs feature selection and PCA simultaneously. Existing optimization methods for FSPCA require data distribution assumptions and lack of global convergence guarantee. Though the general FSPCA problem is NP-hard, we show that, for a low-rank covariance, FSPCA can be solved globally (Algorithm 1). Then, we propose another strategy (Algorithm 2) to solve FSPCA for the general covariance by iteratively building a carefully designed proxy. We prove (data-dependent) approximation bound and convergence guarantees for the new algorithms. For the spectrum of covariance with exponential/Zipf s distribution, we provide exponential/posynomial approximation bound. Experimental results show the promising performance and efﬁciency of the new algorithms compared with the state-of-the-arts on both synthetic and real-world datasets.

1 Introduction

Consider n data points in Rd. When d n, PCA has inconsistence issue in estimating the m leading eigenvectors W Rd m of population covariance matrix A Rd d [18], which can be addressed by assuming the sparsity in the principal components. Prior work has been done in methodology design [51, 38, 11, 41, 36, 47, 34, 33, 22] and theoretical understanding [42, 23, 46, 50].

The principal subspace estimation [6, 20, 28, 16, 45] is directly connected to dimension reduction and is important when there are more than one principal component of interest. Indeed, typical applications of PCA use the projection onto the principal subspace to facilitate exploration and inference of important features of the data. As Vu et al. [42] point out, dimension reduction by PCA should emphasize subspaces rather than eigenvectors. The sparsity level in sparse principal subspace estimation is deﬁned as follows [42, 43, 47]. Deﬁnition 1.1 (Subspace sparsity, [42]). For the m-dimensional principal subspace span(W) of the covariance A, the subspace sparsity level k is deﬁned by

k = card(supp[diag(Π)]) = W 2,0,

Corresponding author: Feiping Nie

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

" 1 d 0.3 0.0 0.7 0.7 0.0 1 0.0 0.8 0.0 0.2 0.5 0.0 0.5 0.7 0.0 0.5 m

" 1 d 0.6 0.0 0.6 0.0 0.5 1 0.6 0.0 0.8 0.0 0.1 0.4 0.0 0.3 0.0 0.9 m

Figure 1: Element-wise Sparse PCA W(a) versus Feature Sparse PCA W(b).

where Π = WW is the projection matrix onto span(W) and 2,0 is the row-sparsity norm.

This paper considers the principal subspace estimation problem with the feature subspace sparsity constraint, termed Feature Sparse PCA (Problem (3.1) ). Some approaches have been proposed to solve the FSPCA problem [43, 47, 27]. Yet, there are some drawbacks in the existing methods. (1) Most of the existing analysis only holds in high probability when speciﬁc data generation assumptions hold, e.g., Yang & Xu [47] requires data generated from the spike model, Wang et al. [43] requires data generated from the sub-Gaussian distribution. Otherwise, they only guarantee convergence when the initial solution is near the global optimum. (2) In practice, monotonic algorithms are preferred as they bring improvement in every step. However, existing iterative schemes for FSPCA are not ascent guaranteed. (3) Some methods make the spike model assumption, in which the population covariance is instinctively low-rank (up to an additive scaled identity), but existing methods cannot make full use of the low-rank structure in the covariance.

Compared with prior work which mostly averaging out the worst case by assuming probability model on the covariance, our work provides algorithms with deterministic analysis from the optimization aspect which is in a model-free style, thus, can be applied to any model. [21, 9, 36, 1] also consider the sparse PCA problem from the optimization perspective. But they only compute the leading sparse eigenvector, which might be suboptimal when multiple eigenvectors are considered.

In this paper, we provide two optimization strategies to compute the leading sparse principal subspace with provable optimization guarantees. The ﬁrst one (Algorithm 1) solves the feature sparse PCA problem globally when the covariance matrix is low-rank, while the second one (Algorithm 2) solves the feature sparse PCA for general covariance matrix iteratively with guaranteed convergence.

Contributions. More precisely, we make the following contributions:

1. We show that, for a low-rank covariance matrix, the FSPCA problem can be solved globally with the newly proposed algorithm (Algorithm 1). For the general high-rank case, we report an iterative algorithm (Algorithm 2) by building a carefully designed proxy.

2. We prove (data-dependent) approximation bound and convergence guarantees for the proposed optimization strategies. Computational complexities of both algorithms are analyzed.

3. We conduct experiments on both synthetic and real-world data to evaluate the new algorithms. The experimental results demonstrate the promising performance of the newly proposed algorithms compared with the state-of-the-art methods.

Notations. Throughout this paper, scalars, vectors and matrices are denoted by lowercase letters, boldface lowercase letters and boldface uppercase letters, respectively; for a matrix A Rd d, A denotes the transpose of A, Tr(A) = Pd i=1 aii, A 2 F = Tr(A A); 1{condition} is the (0, 1)- indicator of the condition; 1n Rn denotes vector with all ones; x 0 denotes the number of non-zero elements; W 2,0 = Pd i=1 wi 0 2 = Pd i=1 1{ wi = 0} measures the row-sparsity of W where W Rd m, wi R1 m is the ith row of W; In n Rn n denotes the identity matrix; I(1 : k) is the ﬁrst k elements in indices I; A denotes the Moore Penrose inverse; Am is the best rank-m approximation of A in Frobenius norm; card(I) is the cardinality of I; [n] := Z {i : 1 i n}. We assume that the eigenvalues {λi}n i=1 are arranged in descending order, i.e., λ1 λ2 λn.

2 Prior Work

In this section, we review several prior arts that consider related problems.

Sparse Principal Components. Most existing methods in the literature to solve the sparse PCA problem only estimate the ﬁrst leading eigenvector with the element-wise sparsity constraint. To estimate the m leading eigenvectors, one has to build a new covariance matrix with the deﬂation technique [26] and solve the leading eigenvector again. The main drawback of this scheme is that, for example, the indices of non-zero elements in the ﬁrst eigenvector might not be the same as that of the second eigenvector. As shown in Figure 1, the sparsity pattern is inconsistent among the m leading eigenvectors, which causes difﬁculties in applications, e.g., feature selection. Moreover, the deﬂation has identiﬁability and orthogonality issues when the top m eigenvalues are not distinct [43]. [1, 36, 21, 9] propose methods and analysis for the leading eigenvector with approximation guarantee but their guarantee only applies to the ﬁrst component, not to further deﬂation iterations.

Sparse Principal Subspace. Vu et al. [42] consider a different setting that the estimated subspace is subspace sparsity constrained (Deﬁnition 1.1), in which the sparsity pattern is forced consistent among rows. They show this problem has nice statistical properties [42], that is, the optimum is statistically minimax optimal. But there is a gap between the computational method and statistical theory. To close this gap, [43, 47, 27, 6, 20, 28, 16, 45] proposed algorithms to solve the subspace sparsity constrained problem. However, from an optimization viewpoint, existing methods require data distribution assumptions and lack of global convergence guarantee. Besides, [2] proposed an algorithm that runs exponential in the rank(A) and m for the disjoint-FSPCA problem that requires the support of different eigenvectors to be disjoint, which is clearly different from our setting.

Sparse Regression. Another line of research [35, 13, 8, 32] considers solving the sparse regression problem with the ℓ2,0 constraint or its convex relaxation. The main technical difference between the ℓ2,0 constrained sparse regression and FSPCA is the semi-orthogonal constraint on W. Without the semi-orthogonal constraint, the FSPCA problem is not bound from above. Existing techniques to solve the ℓ2,0 constrained sparse regression problem, e.g., the projected gradient scheme in [35], cannot be used to solve our problem because, to our knowledge, there is no method to solve the projection subproblem with the semi-orthogonal constraint. Thus, the FSPCA problem is substantially more difﬁcult than that of ℓ2,0-constrained sparse regression.

3 Problem Setup

Formally, we propose algorithms to solve the following general problem

max W Rd m Tr W AW s.t. W W = Im m, W 2,0 k, (3.1)

where m k d and matrix A Rd d is positive semi-deﬁnite. This problem is NP-hard to solve globally even for m = 1 [30] and sadly NP-hard to solve (1 ε)-approximately for a small µ > 0 if ε < µ [9]. Several techniques have been proposed [43, 47] to solve this challenging problem. However, they only report high-probability analysis and none of them provides practical algorithm with deterministic guarantee on both approximation and global convergence.

Remark 3.1. As shown in Vu et al. [42], the optimal W of Problem (3.1) achieves the optimal minimax error for row sparse subspace estimation. Besides, the FSPCA problem can be viewed as performing unsupervised feature selection and PCA simultaneously. The key point is the ℓ2,0 norm constraint forces the sparsity pattern consistence among different eigenvectors, while the vanilla element-wise sparse PCA model cannot keep this consistence as shown in Figure 1. One might use only the leading sparse eigenvector for feature selection [25, 31] but this leads to suboptimal solution when there are more than one principal component of interest (see Figure 2, TPower (G) ).

4 Optimization Strategies

In this section, we provide new optimization strategies to solve the FSPCA model in Problem (3.1). We ﬁrst consider the case when rank(A) m, for which a non-iterative strategy (Algorithm 1) is provided to solve the problem globally. Then we consider the general case when rank(A) > m, for which we provide an iterative algorithm (Algorithm 2) by approximating A with a carefully designed low-rank proxy covariance P and solve the proxy subproblem with the Algorithm 1.

4.1 GO: Global Optimum if rank(A) m

We make the following notion for ease of notations.

Deﬁnition 4.1 (Row selection matrix map). We use (d, k)-row selection matrix map Sd,k(I) to build row selection matrix S Rd k according to given indices I such that Sd,k(I) = S, i.e., sij = 1i=I(j). One can left multiply the selection matrix S to select speciﬁc k rows from d inputs.

The algorithm to solve Problem (3.1) is summarized in the following Algorithm 1.

Algorithm 1 Go for rank(A) m

1: procedure GO(A, m, k, d) 2: I indices of the k largest elements of diag(A) prefer smaller indices if tied. 3: S Sd,k(I); 4: V m ﬁrst eigenvectors of AI,I 5: return W SV; 6: end procedure

The following theorem justiﬁes the global optimality of the output of Algorithm 1:

Theorem 4.2. Suppose A 0 and rank(A) m. Let W = GO(A, m, k, d) with m k d. Then, W is a globally optimal solution of Problem (3.1).

Remark 4.3. Theorem 4.2 guarantees the global optimality of Algorithm 1 for a low-rank A. It is interesting to see that, though the Problem (3.1) is NP-hard to solve in general, it is globally solvable for a low-rank covariance A. A natural idea then comes out that we can try to solve the general Problem (3.1) by running Algorithm 1 with the best rank-m approximation Am. In Theorem 5.1 and Section 6, we will justify this idea theoretically and empirically.

Remark 4.4. It is notable that, for any B {A + σId d : A 0, rank(A) m, σ 0}, which is the population covariance in the spike model, the Algorithm 1 still outputs a globally optimal solution with B σId d as the input, since Tr W BW = Tr W (B σId d)W + σm = Tr W AW + σm. Sufﬁcient condition rank(A) m is a special case that σ = 0.

4.2 IPU: Iteratively Proxy Update for rank(A) > m

In this subsection, we consider the general case, that is, rank(A) > m. The main idea is that we try to build a proxy covariance, say P, of original A such that rank(P) m, P 0. Then we can run Algorithm 1 with the low-rank proxy P to solve the original problem iteratively. Besides, we note here the proxy covariance P introduced below, by design, makes the iterative procedure an MM-type one, which directly suggests its convergence by construction (see Section 5.2 for details).

Proxy Construction. With careful design, given the estimate Wt from the tth iterative step, we deﬁne the matrix Pt = AWt(W t AWt) W t A

as the low-rank proxy matrix of original A. Then, we solve Problem 3.1 with the proxy Pt rather than A. Following claim veriﬁes the sufﬁcient conditions for Pt to be solvable with Algorithm 1.

Claim 4.5. For each t 1, W t Wt = Im m, it holds rank(Pt) m, and Pt 0.

Indices Selection. With the proxy matrix Pt in hand, a natural idea is to iteratively update W by solving the following problem with Algorithm 1:

f Wt+1 GO(Pt, m, k, d). (4.1)

But we can further reﬁne the f Wt+1 by performing eigenvalue decomposition on original A rather than on the proxy covariance Pt, which will accelerate the convergence.

Eigenvectors Reﬁnement. Note that f Wt+1 can be written as f Wt+1 = St+1 e Vt+1, where St+1 is the selection matrix and e Vt+1 is the eigenvectors in the row support of f Wt+1. Then, f Wt+1 can be

further reﬁned by ﬁxing the selection matrix St+1 and updating the eigenvectors Vt+1 with Vt+1 arg max V V=Im m Tr(V S t+1ASt+1V). (4.2)

Finally, the reﬁned Wt+1 can be computed by Wt+1 St+1Vt+1. Compared with updating with Problem (4.1), updating with the reﬁnement makes larger progress thus it is more aggressive.

Algorithm 2 IPU for general A

1: procedure IPU(A, m, k, d, W0) 2: t 0; 3: repeat 4: Pt AWt(W t AWt) W t A; 5: [S, I] Go(Pt, m, k, d); 6: V m ﬁrst eigenvectors of AI,I 7: Wt+1 SV; t t + 1; 8: until Wt = Wt 1 9: return Wt; 10: end procedure

In summary, we collect the procedure to solve FSPCA when rank(A) > m in Algorithm 2. The iterative procedure in Algorithm 2 is simple and well-motivated by the iteratively updated proxy idea. However, existing algorithms [43, 47] in the literature usually follow the orthogonal iteration scheme [19], which makes it hard to see the difference between prior arts and IPU. To cope with this, we provide an orthogonal iteration like reformulation of Algorithm 2 and a detailed discussion in Appendix due to space limitation, which might be of interest on its own.

5 Theoretical Analysis

In this section, we provide the theoretical analysis for Algorithm 1 and 2. In detail, we prove approximation and convergence guarantees for the new algorithms. Then, we report the computational complexities and compare them with these of methods in the literature.

5.1 Approximation Guarantee

The intuition guiding us to the approximation ratio bound is that, while we have global optimality if rank(A) m, we want to understand the solution accuracy if we have the rank(A) almost m.

To begin, we deﬁne constants related to the eigenvalues decay of A. Let r = min{rank(A), 2m},

Pr i=m+1 λi(A) Pm i=1 λi(A) , G2 =

Pr i=m+1 λi(A) Pd i=1 λi(A) .

The main approximation result can be stated as follows. Theorem 5.1. Suppose A 0 with condition number κ, m k d. Let Wm = Go(Am, m, k, d),

and W be globally optimal for Problem 3.1. Then, we have (1 ε) Tr(W m AWm) Tr(W AW ) 1 with

m , 1 κ 1, 1 k

Remark 5.2. Theorem 5.1 says that, for sufﬁciently large m or k, Go(Am, m, k, d) gives a certiﬁed approximate solution of Problem 3.1. Also note that, when the eigenvalues of the covariance A decay fast enough, a small m or k is sufﬁcient to guarantee certiﬁed approximation. It is notable that, when rank(A) m, we have G1 = G2 = 0, A = Am, which implies ε = 0 and the output of the Algorithm 1 is globally optimal. Using Theorem 4.2, the bound given in Theorem 5.1 is sharp.

If the eigenvalues of A decay sufﬁciently fast, e.g., exponentially, the bound would be tighter. Corollary 5.3 (Exponential distribution). Suppose A 0, m k d, and λi(A) = c e ci with c > 0, c > 0 for each i = 1, . . . , 2m. Let Wm = Go(Am, m, k, d), and W be an optimal solution

of Problem 3.1. If m Ω 1

kε , then we have (1 ε) Tr(W m AWm) Tr(W AW ) 1.

The difﬁcult case is when the spectrum of A has a heavy-tail distribution, e.g., Zipf s law, a.k.a., Pareto s distribution. It has been observed by Breslau et al. [7], Faloutsos et al. [14], Mihail & Papadimitriou [29] that many phenomena approximately follow Zipf-like spectrum, e.g., Web caching, Internet topology, and city population. The ith eigenvalue of the Zipf-like spectrum is ci t with constants c > 0, t > 1. We have following corollary for Zipf-like distributed eigenvalues. Corollary 5.4 (Zipf s distribution). Suppose A 0, m k d, and λi(A) = ci t with t > 1, c > 0 for each i = 1, . . . , 2m. Let Wm = Go(Am, m, k, d), and W be an optimal solution of Problem

3.1. If m Ω d

kε 1 t 1 , then we have (1 ε) Tr(W m AWm) Tr(W AW ) 1.

5.2 Convergence Guarantee

In this section, we show the iterative scheme proposed in Algorithm 2 increases the objective function value in every iterative step, which directly indicates the convergence of the iterative scheme.

Lots of classical algorithms can be framed into the MM framework, e.g., EM Algorithm [12], Proximal Algorithms [4, 37], Concave-Convex Procedure (CCCP) [49, 24]. Please refer to [39] for further discussion. It is notable that the newly proposed Algorithm 2 can also be viewed as a special case of the general MM optimization framework. Unlike conventional MM using Jensen s/A-GM/Cauchy-Schwartz s inequalities, or quadratic upper bound to build auxiliary function [44, 39], our auxiliary function for Algorithm 2 is based on the von Neumann s trace inequality [40], which is deﬁned by g(W; Wt) = Tr(W AWt(W t AWt) W t AW) Tr(W AW). Meanwhile, it is easy to check that g(W; Wt) satisﬁes g(Wt; Wt) = Tr(W t AWt). Theorem 5.5 (Monotonic increasing). Suppose A 0, m k d. Let Wt+1 be the variable deﬁned in Algorithm 2. If Wt = Wt+1 up to EVD, then, Tr(W t AWt) < Tr(W t+1AWt+1).

Leveraging the ascent property, we have following the approximation guarantee for Algorithm 2.

Corollary 5.6. Suppose A 0, κ = λ1(A)/λd(A). Let c W = IPU(A, m, k, d, Go(Am, m, k, d)),

and W be an optimal solution of Problem 3.1. Then, we have (1 ε) Tr(c W Ac W) Tr(W AW ) 1 with

m , 1 κ 1, 1 k

Remark 5.7. Theorem 5.5 shows that the newly proposed Algorithm 2 is an ascent method, that is {Tr(W t AWt)}T t=1 is an increasing sequence, which is important since most of the existing algorithms for solving Problem (3.1) are not ascent. That is to say, they cannot guarantee the output is better than the initialization. Combining with the fact that the objective function is bounded from above by ﬁnite Tr(A), the convergence of objective function value can be obtained.

We show the sequence from Algorithm 2 converges to a ﬁxed point in the sense of subspace. Theorem 5.8 (Convergence). Suppose A 0, m k d, and λm λm+1 > 0 on the selected principal submatrix of ﬁxed point. Let {Wt} t=1 be any sequence generated by Algorithm 2. Then, the sequence {Wt} t=1 converges to a ﬁxed point, say f W, of Algorithm 2 in the sense of subspace, and sin Θ (span(Wt+1), span(Wt)) 2 0, Tr(W t AWt) Tr(f W Af W).

5.3 Computational Complexity

Algorithm 1. It is easy to see the overall complexity is O(d+k3) since O(d) for the largest k indices selection (use the Θ(d) median of medians [5] to select the largest k-th element, then do a scan to ﬁlter elements that is larger than the k-th element), O(k3) for eigenvalue decomposition, and O(km) for building the output W.

Algorithm 2. The overall computational complexity2 is O(max{dkm, k3}T), where T is the number of iterative steps used to coverage. We did not provide an upper bound on T as characterizing the rate of convergence for most MM algorithm is very hard [44] (except for some quadratic upper bound type algorithms). But we empirically observe in Section 6.1 that T 10 for both synthetic and real-world data. For proxy covariance construction and indices selection, we need O(d2m) for naively building Pt and running Algorithm 1. But note that we only need the diagonal elements in Pt for sorting and selecting. Thus, we only compute the diagonal elements of Pt and sort it for the indices selection3, that is O(dkm). Then, performing eigenvectors reﬁnement and updating Wt+1 costs O(k3). Also note that, the computational complexity of SOAP proposed in [43] is O(d2m) for every iterative step. Ours computational complexity is strictly less than that of SOAP. For SRT in [47], the computational complexity is O(dm min{m, k log d}). When k = O(m), our complexity matches that of SRT.

2If we do not insist on the eigenvalue reﬁnement step, we can optimize the overall complexity to O(dkm T) by using SVD on AW(W AW) 1

2 rather than performing partial EVD on AI,I. 3First, compute AW with O(dkm) since W is row-sparse. Then, compute (W AW) with O(km2). Let the ith row of AW be [AW]i. Finally, compute the diagonal elements of P by [AW]i(W AW) [AW] i with O(dm2). Overall, O(dkm).

Table 1: Synthetic Data Description

No. Description Note

A λ(A) = {100, 100, 4, 1, . . . , 1} Setting in [43] B λ(A) = {300, 180, 60, 1, . . . , 1} Setting in [43] C λ(A) = {300, 180, 60, 0, . . . , 0} Verify the correctness of Theorem 4.2 D λ(A) = {160, 80, 40, 20, 10, 5, 2, 1, . . . , 1} For all σ, rank(A + σId d) > m E X is iid sampled from U[0, 1] and A = XX Uniform Distribution F X is iid sampled from N(0, 1) and A = XX Gaussian Distribution

5.4 On the Invertibility of W t AWt

In the deﬁnition of the proxy matrix Pt, there is a Moore Penrose inverse term (W t AWt) . In this subsection we provide a condition under which this matrix is always invertible thus the Moore Penrose inverse (W t AWt) can be replaced with the matrix inverse (W t AWt) 1. The reason why we care about the invertibility is that when W t AWt is not invertible, it is rank deﬁcient. Thus it might not be a good approximation to the high-rank covariance A.

Claim 5.9. If rank(A) d k + m, then, for all t, W t AWt in Algorithm 2 is always invertible.

Remark 5.10. Note that the condition shown in Claim 5.9 is easy to be satisﬁed. Indeed, we can solve Problem (3.1) with Aε = A+ε Id d. Thus, rank(Aε) = d d k+m. Note that this small ε perturbation on A does not change the optimal W because Tr W AεW = Tr W AW + εm, which is only a constant εm added to the original objective function. Thus, the optimal W remains unchanged. In practice, we recommend using Aε with a small ε > 0 to keep safe.

6 Experiments

In this section, we provide experimental results to validate the effectiveness of the proposed Go and IPU on both synthetic and real-world data. In our experiments, we always use Aε with ε = 0.1 to keep safe (Remark 5.10), except in the No. C synthetic data where we require rank(A) m.

6.1 Synthetic Data

To show the effectiveness of the proposed method, we build a series of small-scale synthetic datasets, whose global optimum can be obtained by brute-force searching. Then we compare our methods with several state-of-the-art methods with the optimal indices and objective value in hand.

Experiments Setup. We compare the newly proposed Go (Algorithm 1) and IPU (Algorithm 2) with SOAP [43], SRT [47], and CSSP [27]. For the synthetic data, we ﬁx m = 3, k = 7,and d = 20. We cannot afford large-scale setting since the brute-force searching space grows exponentially. We consider three different initialization methods: Random Subspace; Convex Relaxation proposed in [41] and used in [43]; Low Rank Approx. with GO(Am, m, k, d). We consider 6 different synthetic data in our experiments. The descriptions of these schemes are summarized in Table 1. For Scheme A and B, they are the synthetic data used in [43]. But we trim them to ﬁt our setting, that is m = 3, k = 7, d = 20. For Scheme C, we validate the correctness that Algorithm 1 globally solves Problem (3.1). For Scheme D, we use it to see the performance comparison when the rank(A) is strictly larger than m. For Scheme E and F, we compare the performance when data are generated from known distribution rather than using the eigenvalues ﬁxed covariance. For A D, we ﬁx the eigenvalues and generate the eigenspace randomly following [43]. Every scheme is independently run for 100 times and we report the mean and standard error. For the Random Subspace setting, every realization A is repeated run 20 times with different random initialization. Thus, in the random initialization setting, we run all algorithms 20 100 = 2000 times. To compute std. err. of HF, we run algorithms as we do for random initialization. The overall mean and standard error are reported.

Performance Measures. (1) Intersection Ratio (IR): card({estimated indices} {optimal indices})/# sparsity k. The reason we use Intersection Ratio is that FSPCA performs feature selection and PCA simultaneously. The Intersection Ratio can measure the intersection between the indices returned by algorithm

Table 2: Synthetic Data Results. [mean (std. err.); : larger is better; : smaller is better]

Random Subspace Convex Relaxation Low Rank Approx.

IR RE HF IR RE HF IR RE HF

SOAP 0.73 (0.09) 0.03 (0.02) 0.18 (0.15) 0.71 (0.12) 0.08 (0.04) 0.01 (0.01) 0.84 (0.12) 0.03 (0.03) 0.22 (0.17) SRT 0.77 (0.19) 0.01 (0.02) 0.70 (0.21) 0.92 (0.12) 0.01 (0.02) 0.62 (0.24) 0.88 (0.16) 0.02 (0.04) 0.50 (0.25) CSSP 0.63 (0.13) 0.88 (0.05) 0.00 (0.00) 0.62 (0.12) 0.87 (0.06) 0.00 (0.00) 0.62 (0.12) 0.87 (0.06) 0.00 (0.00) Go 0.92 (0.12) 0.01 (0.03) 0.74 (0.19) 0.93 (0.12) 0.01 (0.03) 0.67 (0.22) 0.93 (0.12) 0.01 (0.03) 0.66 (0.22) IPU 0.97 (0.04) 0.00 (0.00) 1.00 (0.00) 0.99 (0.04) 0.00 (0.00) 0.97 (0.03) 0.98 (0.05) 0.00 (0.00) 0.91 (0.08)

SOAP 0.76 (0.12) 0.03 (0.03) 0.14 (0.12) 0.78 (0.11) 0.04 (0.03) 0.09 (0.08) 0.77 (0.12) 0.04 (0.03) 0.05 (0.05) SRT 0.59 (0.08) 0.03 (0.03) 0.28 (0.20) 0.79 (0.14) 0.04 (0.04) 0.15 (0.13) 0.80 (0.16) 0.04 (0.05) 0.30 (0.21) CSSP 0.77 (0.10) 0.90 (0.05) 0.00 (0.00) 0.76 (0.12) 0.90 (0.05) 0.00 (0.00) 0.76 (0.12) 0.91 (0.05) 0.00 (0.00) Go 0.99 (0.02) 0.00 (0.00) 1.00 (0.00) 0.99 (0.02) 0.00 (0.00) 1.00 (0.00) 0.99 (0.01) 0.00 (0.00) 1.00 (0.00) IPU 0.97 (0.03) 0.00 (0.00) 1.00 (0.00) 0.99 (0.01) 0.00 (0.00) 1.00 (0.00) 0.99 (0.01) 0.00 (0.00) 1.00 (0.00)

SOAP 0.77 (0.12) 0.04 (0.03) 0.11 (0.10) 0.77 (0.12) 0.04 (0.03) 0.08 (0.07) 0.76 (0.12) 0.04 (0.03) 0.05 (0.05) SRT 0.59 (0.08) 0.03 (0.04) 0.20 (0.16) 0.76 (0.16) 0.05 (0.05) 0.12 (0.11) 0.80 (0.17) 0.05 (0.06) 0.26 (0.19) CSSP 0.77 (0.11) 0.94 (0.03) 0.00 (0.00) 0.76 (0.12) 0.94 (0.03) 0.00 (0.00) 0.76 (0.12) 0.94 (0.03) 0.00 (0.00) Go 1.00 (0.00) 0.00 (0.00) 1.00 (0.00) 1.00 (0.00) 0.00 (0.00) 1.00 (0.00) 1.00 (0.00) 0.00 (0.00) 1.00 (0.00) IPU 1.00 (0.00) 0.00 (0.00) 1.00 (0.00) 1.00 (0.00) 0.00 (0.00) 1.00 (0.00) 1.00 (0.00) 0.00 (0.00) 1.00 (0.00)

SOAP 0.79 (0.08) 0.01 (0.01) 0.43 (0.25) 0.80 (0.13) 0.02 (0.02) 0.15 (0.13) 0.84 (0.11) 0.01 (0.01) 0.22 (0.17) SRT 0.57 (0.07) 0.02 (0.02) 0.14 (0.12) 0.77 (0.15) 0.04 (0.04) 0.12 (0.11) 0.83 (0.14) 0.02 (0.03) 0.27 (0.20) CSSP 0.76 (0.12) 0.80 (0.07) 0.00 (0.00) 0.77 (0.12) 0.82 (0.08) 0.00 (0.00) 0.77 (0.12) 0.82 (0.08) 0.00 (0.00) Go 0.91 (0.10) 0.00 (0.01) 0.52 (0.25) 0.92 (0.10) 0.00 (0.01) 0.59 (0.24) 0.92 (0.09) 0.00 (0.01) 0.56 (0.25) IPU 0.83 (0.07) 0.00 (0.00) 0.97 (0.03) 0.93 (0.10) 0.00 (0.01) 0.65 (0.23) 0.92 (0.10) 0.00 (0.01) 0.60 (0.24)

SOAP 0.43 (0.07) 0.06 (0.03) 0.01 (0.01) 0.46 (0.16) 0.12 (0.05) 0.00 (0.00) 0.73 (0.16) 0.04 (0.04) 0.12 (0.11) SRT 0.86 (0.07) 0.00 (0.00) 0.72 (0.20) 0.88 (0.11) 0.01 (0.01) 0.40 (0.24) 0.90 (0.09) 0.01 (0.01) 0.52 (0.25) CSSP 0.43 (0.16) 0.82 (0.06) 0.00 (0.00) 0.43 (0.16) 0.83 (0.06) 0.00 (0.00) 0.44 (0.16) 0.83 (0.06) 0.00 (0.00) Go 0.89 (0.09) 0.00 (0.01) 0.48 (0.25) 0.90 (0.09) 0.00 (0.01) 0.46 (0.25) 0.88 (0.10) 0.01(0.01) 0.41 (0.24) IPU 0.83 (0.06) 0.00 (0.00) 0.89 (0.10) 0.87 (0.10) 0.01 (0.01) 0.37 (0.23) 0.88 (0.10) 0.01(0.01) 0.42 (0.24)

SOAP 0.61 (0.07) 0.01 (0.01) 0.36 (0.23) 0.79 (0.14) 0.03 (0.03) 0.16 (0.13) 0.81 (0.12) 0.03 (0.02) 0.16 (0.13) SRT 0.62 (0.08) 0.01 (0.01) 0.37 (0.23) 0.82 (0.12) 0.03 (0.02) 0.20 (0.16) 0.82 (0.12) 0.03 (0.02) 0.17 (0.14) CSSP 0.79 (0.13) 0.52 (0.08) 0.00 (0.00) 0.77 (0.14) 0.54 (0.08) 0.00 (0.00) 0.77 (0.14) 0.54 (0.08) 0.00 (0.00) Go 0.83 (0.12) 0.02 (0.03) 0.21 (0.17) 0.81 (0.12) 0.03 (0.03) 0.16 (0.13) 0.81 (0.12) 0.03 (0.03) 0.16 (0.13) IPU 0.62 (0.07) 0.01 (0.01) 0.44 (0.25) 0.82 (0.12) 0.03 (0.02) 0.18 (0.15) 0.82 (0.12) 0.03 (0.02) 0.17 (0.14)

10 20 30 40 50 # Sparsity (k)

Normalized Explained Variance

Lymphoma (m=10, d=500)

SOAP SRT TPower(G) Go IPU

10 20 30 40 50 # Sparsity (k)

Normalized Explained Variance

NUS WIDE (m=10, d=225)

SOAP SRT TPower(G) Go IPU

10 20 30 40 50 # Sparsity (k)

Normalized Explained Variance

Numerical Numbers (m=10, d=216)

SOAP SRT TPower (G) TPower (D) Go IPU Teorem 5.1

Figure 2: Real-world Data Results.

and the optimal indices. (2) Relative Error (RE): Tr(W AW ) Tr(W AW)

Tr(W AW ) . (3) Hit Frequency (HF):

1 N PN i=1 1{Relative Error 10 3},where N is the number of repeated runs. This measure shows the frequency of the algorithm approximately reaches the global optimum.

Results. Experimental results are reported in Table 2, and we get the following insights: (1) From No. C, Algorithm 1 gives a globally optimal solution when the covariance A is low-rank. (2) Both the performance of Go and IPU outperform or match other state-of-the-art methods, especially when the numerical rank of covariance is small. (3) CSSP does not perform well in HF and RE, which is consistent with results reported in [27], since the objective of CSSP is a regression-type minimization rather than variance maximization. (4) When the Low Rank Approx. strategy (with Go) is used as initialization, all methods have match or even better explained variance than initialization with Convex Relaxation, while the computational complexity of Low Rank Approx. (with SVD) is seriously smaller than that of Convex Relaxation (with ADMM or SDP). A small but important detail: IPU is a local ascent algorithm, thus when initialized with Low Rank Approx., IPU always perform better or match than Go. Meanwhile, initialization with Random Space has better performance than both Convex Relaxation and Low Rank Approx., which is not surprising since the reported results for

Random Subspace are the maximal objective value among 20 random initialization. This strategy is widely used in practice, e.g., run k-means multiple times with different initialization and pick the one with the smallest loss.

6.2 Real-world Data

Experiment Setup. We consider real-world datasets, including Lymphoma (biology) [48], NUSWIDE (web images) [10], and Numerical Numbers (handwritten numbers) [3]. We compare Go and IPU with SOAP, SRT, TPower (G) and report the results of TPower (D) as a baseline. TPower (G) selects the sparsity pattern with the leading eigenvector Greedily and TPower (D) uses the Deﬂation scheme, which cannot produce consistent sparsity pattern among rows. We follow [43] to use Convex Relaxation as the initialization. Following [43, 47], we use the Normalized Explained Variance as the performance measure. The Normalized Explained Variance is deﬁned as Tr(c W Ac W)/Tr(Am), where the c W is the subspace estimation returned by algorithms.

0 1000 2000 3000 4000 5000 # Sparsity (k)

Computation Time (s)

Synthetic Data D (m=10, d=5000)

(Wang et al., 2014) (Yang et al., 2015) (Yuan et al., 2013) Go IPU

Figure 3: Computation Time.

Results. The experimental results are reported in Figure 2, from which we get the following insights: (1) For all three real-world datasets, the new algorithms, Go and IPU, consistently perform better than other state-of-theart methods that solve FSPCA; (2) For NN dataset, the performance of all methods except SOAP and TPower (D) are tied. It is of interest to see whether the reason for this phenomenon is the dataset is too difﬁcult or too easy. Therefore, we plot the approximation bound in Theorem 5.1, which reveals that these methods achieve almost optimal performance; (3) While TPower (D) achieves the highest NEV, it cannot be used for either feature selection or sparse subspace estimation (see Deﬁnition 1.1), due to the sparsity inconsistent issue of one-by-one eigenvectors estimation (see Figure 1). Actually, TPower (D) actually solves a less constrained problem.

Computation Time. We conducted experiments to evaluation computation time on synthetic setting D with d = 5000, m = 10. Please see Figure 3, which shows the new algorithms scale well for large-scale covariance. All experiments in this paper were run on MATLAB 2018a with a 2.3 GHz Quad-Core Intel Core i5 CPU and 16GB memory MBP.

0 5 10 15 20 # Iterative Steps (T)

Explained Variance

10 4 Lymphoma (m=10, k=100, d=500)

SOAP SRT Go IPU

Figure 4: Convergence.

Convergence. In Theorem 5.5, we prove the monotonic ascent property of IPU (Algorithm 2) and in Remark 5.7, we claim that existing iterative schemes are not monotonic ascent guaranteed. Here we provide numerical evidence to support this claim. We run Go, IPU, SOAP, SRT on Lymphoma dataset with m = 10, k = 100, d = 500. We use the same convex relaxation initialization for all methods with row truncation. We record the objective value in every iterative step for all methods. The results are plotted in Figure 4, from which we can see both SOAP and SRT are not ascent methods and both Go and IPU achieve better Explained Variance than SOAP and SRT with the same initialization. Besides, IPU takes less than 10 steps to converge, which is the case we keep seeing in all our experiments.

7 Conclusion

In this paper, we present algorithms to directly estimate the row sparsity constrained leading m eigenvectors. We propose Algorithm 1 to solve FSPCA for low-rank covariance globally. For general high-rank covariance, we propose Algorithm 2 to solve FSPCA by iteratively building a carefully designed low-rank proxy covariance matrix. We prove theoretical guarantees for both algorithms on approximation and convergence. Experimental results show the promising performance of the new algorithms compared with the state-of-the-art methods.

Broader Impact

This paper provides efﬁcient, effective, and provable algorithms to solve the feature sparse PCA problem. The researcher who working on feature selection, dimension reduction, and graph analysis might ﬁnd the techniques in this paper interesting and highly usable for real-world applications.

Acknowledgments and Disclosure of Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2018AAA0101902, in part by the National Natural Science Foundation of China under Grant 61936014, Grant 61772427 and Grant 61751202, and in part by the Fundamental Research Funds for the Central Universities under Grant G2019KY0501.

[1] Asteris, M., Papailiopoulos, D., and Dimakis, A. Nonnegative sparse pca with provable guarantees. In International Conference on Machine Learning, pp. 1728 1736, 2014.

[2] Asteris, M., Papailiopoulos, D., Kyrillidis, A., and Dimakis, A. G. Sparse pca via bipartite matchings. In Advances in Neural Information Processing Systems, pp. 766 774, 2015.

[3] Asuncion, A. and Newman, D. Uci machine learning repository, 2007.

[4] Bertsekas, D. P. and Tseng, P. Partial proximal minimization algorithms for convex pprogramming. SIAM Journal on Optimization, 4(3):551 572, 1994.

[5] Blum, M., Floyd, R. W., Pratt, V. R., Rivest, R. L., and Tarjan, R. E. Time bounds for selection. 1973.

[6] Bouveyron, C., Latouche, P., Mattei, P.-A., et al. Bayesian variable selection for globally sparse probabilistic pca. Electronic Journal of Statistics, 12(2):3036 3070, 2018.

[7] Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S., et al. Web caching and zipf-like distributions: Evidence and implications. In Ieee Infocom, volume 1, pp. 126 134. INSTITUTE OF ELECTRICAL ENGINEERS INC (IEEE), 1999.

[8] Cai, X., Nie, F., and Huang, H. Exact top-k feature selection via l2, 0-norm constraint. In Twenty-Third International Joint Conference on Artiﬁcial Intelligence, 2013.

[9] Chan, S. O., Papailliopoulos, D., and Rubinstein, A. On the approximability of sparse pca. In Conference on Learning Theory, pp. 623 646, 2016.

[10] Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., and Zheng, Y.-T. Nus-wide: A real-world web image database from national university of singapore. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR 09), Santorini, Greece., July 8-10, 2009.

[11] d Aspremont, A., El Ghaoui, L., Jordan, M. I., and Lanckriet, G. R. G. A direct formulation for sparse pca using semideﬁnite programming. SIAM Rev., 49(3):434 448, July 2007. ISSN 0036-1445.

[12] Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (methodological), pp. 1 38, 1977.

[13] Du, X., Nie, F., Wang, W., Yang, Y., and Zhou, X. Exploiting combination effect for unsupervised feature selection by ℓ2,0 norm. IEEE Trans. Neural Netw. Learn. Syst., (99):1 14, 2018.

[14] Faloutsos, M., Faloutsos, P., and Faloutsos, C. On power-law relationships of the internet topology. In ACM SIGCOMM computer communication review, volume 29, pp. 251 262. ACM, 1999.

[15] Golub, G. H. and van Loan, C. F. Matrix Computations. JHU Press, fourth edition, 2013. ISBN 1421407949 9781421407944. URL http://www.cs.cornell.edu/cv/GVL4/golubandvanloan.htm.

[16] Gu, Q., Li, Z., and Han, J. Joint feature selection and subspace learning. In Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2011.

[17] Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012.

[18] Johnstone, I. M. and Lu, A. Y. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682 693, 2009.

[19] Journée, M., Nesterov, Y., Richtárik, P., and Sepulchre, R. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(Feb):517 553, 2010.

[20] Khan, Z., Shafait, F., and Mian, A. Joint group sparse pca for compressed hyperspectral imaging. IEEE Transactions on Image Processing, 24(12):4934 4942, 2015.

[21] Khanna, R., Ghosh, J., Poldrack, R., and Koyejo, O. Sparse submodular probabilistic pca. In Artiﬁcial Intelligence and Statistics, pp. 453 461, 2015.

[22] Kundu, A., Drineas, P., and Magdon-Ismail, M. Recovering pca and sparse pca via hybrid-(l 1, l 2) sparse sampling of data elements. The Journal of Machine Learning Research, 18(1):2558 2591, 2017.

[23] Lei, J., Vu, V. Q., et al. Sparsistency and agnostic inference in sparse pca. The Annals of Statistics, 43(1): 299 322, 2015.

[24] Lipp, T. and Boyd, S. Variations and extension of the convex concave procedure. Optimization and Engineering, 17(2):263 287, 2016.

[25] Luss, R. and d Aspremont, A. Clustering and feature selection using sparse principal component analysis. Optimization and Engineering, 11(1):145 157, 2010.

[26] Mackey, L. W. Deﬂation methods for sparse pca. In Advances in Neural Information Processing Systems, pp. 1017 1024, 2009.

[27] Magdon-Ismail, M. and Boutsidis, C. Optimal sparse linear encoders and sparse pca. In Advances in Neural Information Processing Systems, pp. 298 306, 2016.

[28] Masaeli, M., Yan, Y., Cui, Y., Fung, G., and Dy, J. G. Convex principal feature selection. In Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 619 628. SIAM, 2010.

[29] Mihail, M. and Papadimitriou, C. On the eigenvalue power law. In International Workshop on Randomization and Approximation Techniques in Computer Science, pp. 254 262. Springer, 2002.

[30] Moghaddam, B., Weiss, Y., and Avidan, S. Spectral bounds for sparse pca: Exact and greedy algorithms. In Advances in neural information processing systems, pp. 915 922, 2006.

[31] Naikal, N., Yang, A. Y., and Sastry, S. S. Informative feature selection for object recognition via sparse pca. In 2011 International Conference on Computer Vision, pp. 818 825. IEEE, 2011.

[32] Nie, F., Huang, H., Cai, X., and Ding, C. H. Efﬁcient and robust feature selection via joint l2, 1-norms minimization. In Advances in neural information processing systems, pp. 1813 1821, 2010.

[33] Nie, F., Huang, H., Ding, C., Luo, D., and Wang, H. Robust principal component analysis with non-greedy l1-norm maximization. In IJCAI proceedings-international joint conference on artiﬁcial intelligence, volume 22, pp. 1433, 2011.

[34] Nie, F., Yuan, J., and Huang, H. Optimal mean robust principal component analysis. In International conference on machine learning, pp. 1062 1070, 2014.

[35] Pang, T., Nie, F., Han, J., and Li, X. Efﬁcient feature selection via l2,0-norm constrained sparse regression. IEEE Transactions on Knowledge and Data Engineering, 2018.

[36] Papailiopoulos, D., Dimakis, A., and Korokythakis, S. Sparse pca through low-rank approximations. In International Conference on Machine Learning, pp. 747 755, 2013.

[37] Parikh, N., Boyd, S., et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127 239, 2014.

[38] Shen, H. and Huang, J. Z. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6):1015 1034, 2008.

[39] Sun, Y., Babu, P., and Palomar, D. P. Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing, 65(3):794 816, 2017.

[40] Von Neumann, J. Some matrix-inequalities and metrization of matric space. 1937.

[41] Vu, V. Q., Cho, J., Lei, J., and Rohe, K. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. In Advances in Neural Information Processing Systems, pp. 2670 2678, 2013.

[42] Vu, V. Q., Lei, J., et al. Minimax sparse principal subspace estimation in high dimensions. The Annals of Statistics, 41(6):2905 2947, 2013.

[43] Wang, Z., Lu, H., and Liu, H. Tighten after relax: Minimax-optimal sparse pca in polynomial time. In Advances in neural information processing systems, pp. 3383 3391, 2014.

[44] Wu, T. T., Lange, K., et al. The mm alternative to em. Statistical Science, 25(4):492 505, 2010.

[45] Xiaoshuang, S., Zhihui, L., Zhenhua, G., Minghua, W., Cairong, Z., and Heng, K. Sparse principal component analysis via joint l 2, 1-norm penalty. In Australasian Joint Conference on Artiﬁcial Intelligence, pp. 148 159. Springer, 2013.

[46] Yang, D., Ma, Z., and Buja, A. Rate optimal denoising of simultaneously sparse and low rank matrices. The Journal of Machine Learning Research, 17(1):3163 3189, 2016.

[47] Yang, W. and Xu, H. Streaming sparse principal component analysis. In International Conference on Machine Learning, pp. 494 503, 2015.

[48] Yuan, X.-T. and Zhang, T. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899 925, 2013.

[49] Yuille, A. L. and Rangarajan, A. The concave-convex procedure (cccp). In Advances in Neural Information Processing Systems, pp. 1033 1040, 2002.

[50] Zhang, A. and Han, R. Optimal sparse singular value decomposition for high-dimensional high-order data. Journal of the American Statistical Association, pp. 1 40, 2018.

[51] Zou, H., Hastie, T., and Tibshirani, R. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265 286, 2006.