# nonnegative_sparse_pca_with_provable_guarantees__a25e6569.pdf

Nonnegative Sparse PCA with Provable Guarantees

Megasthenis Asteris MEGAS@UTEXAS.EDU Dimitris S. Papailiopoulos DIMITRIS@UTEXAS.EDU Alexandros G. Dimakis DIMAKIS@AUSTIN.UTEXAS.EDU Department of Electrical and Computer Engineering, The University of Texas at Austin, TX, USA

We introduce a novel algorithm to compute nonnegative sparse principal components of positive semideﬁnite (PSD) matrices. Our algorithm comes with approximation guarantees contingent on the spectral proﬁle of the input matrix A: the sharper the eigenvalue decay, the better the quality of the approximation.

If the eigenvalues decay like any asymptotically vanishing function, we can approximate nonnegative sparse PCA within any accuracy ϵ in time polynomial in the matrix dimension n and desired sparsity k, but not in 1/ϵ. Further, we obtain a data dependent bound that is computed by executing an algorithm on a given data set. This bound is signiﬁcantly tighter than a-priori bounds and can be used to show that for all tested datasets our algorithm is provably within 40% 90% from the unknown optimum.

Our algorithm is combinatorial and explores a subspace deﬁned by the leading eigenvectors of A. We test our scheme on several data sets, showing that it matches or outperforms the previous state of the art.

1. Introduction

Given a data matrix S Rn m comprising m zero-mean vectors on n features, the ﬁrst principal component (PC) is

arg max x 2=1 x T Ax, (1)

where A = 1/m SST is the n n positive semideﬁnite (PSD) empirical covariance matrix. Subsequent PCs can be computed after A has been appropriately deﬂated to remove the ﬁrst eigenvector. PCA is arguably the workhorse

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

of high dimensional data analysis and achieves dimensionality reduction by computing the directions of maximum variance. Typically, all n features affect positively or negatively these directions resulting in dense PCs, which explain the largest possible data variance, but are often not interpretable.

It has been shown that enforcing nonnegativity on the computed principal components can aid interpretability. This is particularly true in applications where features interact only in an additive manner. For instance, in bioinformatics, chemical concentrations are nonnegative (Kim & Park, 2007), or the expression level of genes is typically attributed to positive or negative inﬂuences of those genes, but not both (Badea & Tilivea, 2005). Here, enforcing nonnegativity, in conjunction with sparsity on the computed components can assist the discovery of local patterns in the data. In computer vision, where features may coincide with non negatively valued image pixels, nonnegative sparse PCA pertains to the extraction of the most informative image parts (Lee & Seung, 1999). In other applications, nonnegative weights admit a meaningful probabilistic interpretation.

Sparsity emerges as an additional desirable trait of the computed components because it further helps interpretability (Zou et al., 2006; d Aspremont et al., 2007b), even independently of nonnegativity. From a machine learning perspective, enforcing sparsity serves as an unsupervised feature selection method: the active coordinates in an optimal l0-norm constrained PC should correspond to the most informative subset of features. Although nonnegativity inherently promotes sparsity, an explicit sparsity constraint enables precise control on the number of selected features.

Nonnegative Sparse PC. Nonnegativity and sparsity can be directly enforced on the principal component optimization by adding constraints to (1). The k-sparse nonnegative principal component of A is

x = arg max x Sn k x T Ax, (2)

Nonnegative Sparse PCA with Provable Guarantees

where Sn k = {x Rn : x 2 = 1, x 0 k, x 0}, for a desired sparsity parameter k [n].

The problem of computing the ﬁrst eigenvector (1) is easily solvable, but with the additional sparsity and nonnegativity constraints problem (2) becomes computationally intractable. The cardinality constraint alone renders sparse PCA NP-hard (Moghaddam et al., 2006b). Even if the l0norm constraint is dropped, we show that problem (2) remains computationally intractable by reducing it to checking matrix copositivity, a well known co-NP complete decision problem (Murty & Kabadi, 1987; Parrilo, 2000). Therefore, each of the constraints x 0 and x 0 k individually makes the problem intractable.

Our Contribution: We introduce a novel algorithm for approximating the nonnegative k-sparse principal component with provable approximation guarantees.

Given any PSD matrix A Rn n, sparsity parameter k, and accuracy parameter d [n], our algorithm outputs a nonnegative, k-sparse, unit norm vector xd that achieves at least ρd fraction of the maximum objective value in (2), i.e.,

x T d Axd ρd x T Ax , (3)

2n, 1 1 + 2 n

Here, λi is the ith largest eigenvalue of A, and the accuracy parameter d speciﬁes the rank of the approximation used and controls the running time. Speciﬁcally, our algorithm runs in time O(ndkd+nd+1). As can be seen our result depends on the spectral proﬁle of A: the faster the eigenvalue decay, the tighter the approximation.

Near-Linear time approximation. Our algorithm has a running time O(ndkd + nd+1), which in the linear sparsity regime can be as high as O(n2d). This can be non-practical for large data sets, even if we set the rank parameter d to be two or three. We present a modiﬁcation of our algorithm that can provably approximate the result of the ﬁrst in nearlinear time. Speciﬁcally, for any desired accuracy ϵ (0, 1] it computes a nonnegative, k-sparse, unit norm vector bxd such that

bx T d Abxd (1 ϵ) ρd x T Ax , (5)

where ρd is as described in (4). We show that the running time of our approximate algorithm is O ϵ d n log n , which is near-linear in n for any ﬁxed accuracy parameters d and ϵ.

Our approximation theorem has several implications.

Exact solution for low-rank matrices. Observe that if the matrix A has rank d, our algorithm returns the optimal ksparse PC for any target sparsity k. The same holds in

the case of the rank-d update matrix A = σI + C, with rank(C) = d and arbitrary constant σ, since the algorithm can be equivalently applied on C.

PTAS for any spectral decay. Consider the linear sparsity regime k = c n and assume that the eigenvalues follow a decay law λi λ1 f(i) for any decay function f(i) which vanishes: f(i) 0 as i . Special cases include power law decay f(i) = 1/iα or even very slow decay functions like f(i) = 1/ log log i. For all these cases, we can solve nonnegative sparse PCA for any desired accuracy ϵ in time polynomial in n and k, but not in 1/ϵ. Therefore, we obtain a polynomial-time approximation scheme (PTAS) for any spectral decay behavior.

Computable upper bounds. In addition to these theoretical guarantees, our method yields a data dependent upper bound on the maximum value of (2), that can be computed by running our algorithm. As it can be seen in Fig. 4-6, the obtained upper bound, combined with our achievable point, sandwiches the unknown optimum within a narrow region. Using this upper bound we are able to show that our solutions are within 40 90% from the optimal in all the datasets that we examine. To the best of our knowledge, this framework of data dependent bounds has not been considered in the previous literature.

1.1. Related Work

There is a substantial volume of work on sparse PCA, spanning a rich variety of approaches: from early heuristics in (Jolliffe, 1995), to the LASSO based techniques in (Jolliffe et al., 2003), the elastic net l1-regression in (Zou et al., 2006), a greedy branch-and-bound technique in (Moghaddam et al., 2006a), or semideﬁnite programming approaches (d Aspremont et al., 2008; Zhang et al., 2012; d Aspremont et al., 2007a). This line of work does not consider or enforce nonnegativity constraints.

When nonnegative components are desired, fundamentally different approaches have been used. Nonnegative matrix factorization (Lee & Seung, 1999) and its sparse variants (Hoyer, 2004; Kim & Park, 2007) fall within that scope: data is expressed as (sparse) nonnegative linear combinations of (sparse) nonnegative parts. These approaches are interested in ﬁnding a lower dimensionality representation of the data that reveals latent structure and minimizes a reconstruction error, but are not explicitly concerned with the statistical signiﬁcance of individual output vectors.

Nonnegativity as an additional constraint on (sparse) PCA ﬁrst appeared in (Zass & Shashua, 2007). The authors suggested a coordinate-descent scheme that jointly computes a set of nonnegative sparse principal components, maximizing the cumulative explained variance. An l1-penalty promotes sparsity of computed components on average,

Nonnegative Sparse PCA with Provable Guarantees

but not on each component individually. A second convex penalty is incorporated to favor orthogonal components.

Similar convex optimization approaches for nonnegative PCA have been subsequently proposed in the literature. In (Allen & Maleti c-Savati c, 2011) for instance, the authors suggest an alternating maximization scheme for the computation of the ﬁrst nonnegative PC, allowing the incorporation of known structural dependencies.

A competitive algorithm for nonnegative sparse PCA was established in (Sigg & Buhmann, 2008), with the development of a framework stemming from Expectation Maximization (EM) for a probabilistic generative model of PCA. The proposed algorithm, which enforces hard sparsity, or nonnegativity, or both constraints simultaneously, computes the ﬁrst approximate PC in O(n2), i.e., time quadratic in the number of features.

To the best of our knowledge, no prior works provide provable approximation guarantees for the nonnegative sparse PCA optimization problem. Further, no data dependent upper bounds have been present in the previous literature.

Differences from SPCA work. Our work is closely related to (Karystinos & Liavas, 2010; Asteris et al., 2011; Papailiopoulos et al., 2013) that introduced the ideas of solving low-rank quadratic combinatorial optimization problems on low-rank PSD matrices using hyperspectral transformations. Such transformations are called spannograms and follow a similar architecture. In this paper, we extend the spannogram framework to nonnegative sparse PCA. The most important technical issue compared to (Asteris et al., 2011; Papailiopoulos et al., 2013) is introducing nonnegativity constraints in spannogram algorithms.

To understand how this changes the problem, notice that in the original sparse PCA problem without nonnegativity constraints, if the support is known, the optimal principal component supported on that set can be easily found. However, under nonnegativity constraints, the problem is hard even if the optimal support is known. This is the fundamental technical problem that we address in this paper. We show that if the involved subspace is low-dimensional, it is possible to solve this problem.

2. Algorithm Overview

Given an n n PSD matrix A, the desired sparsity k, and an accuracy parameter d [n], our algorithm computes a nonnegative, k-sparse, unit norm vector xd approximating the nonnegative, k-sparse PC of A. We begin with a highlevel description of the main steps of the algorithm.

Step 1. Compute Ad, the rank-d approximation of A. We compute Ad, the best rank-d approximation of A, zeroing

Algorithm 1 Spannogram Nonnegative Sparse PCA input A (n n PSD matrix), k [n], d [n].

1: U, Λ svd(A, d) 2: V = UΛ 1/2 {Ad = VVT } 3: Sd Spannogram(V, k) {Algo. 2} 4: Xd {} {|Sd| O(nd)} 5: for all I Sd do 6: c(I) arg max c 2=1 VIc 0 (VIc) 2 2 {Sec. 5}

7: x(I) I |VIc|/ VIc , x(I) Ic 0 8: Xd Xd {x(I)} 9: end for {|Xd| |Sd|} output xd arg maxx Xd x T Adx

out the n d trailing eigenvalues of A, that is,

i=1 λiuiu T i ,

where λi is the ith largest eigenvalue of A and ui the corresponding eigenvector.

Step 2. Compute Sd, a set of O(nd) candidate supports. Enumerating the n k possible supports for k-sparse vectors in Rn is computationally intractable. Using our Spannogram technique described in Section 4, we efﬁciently determine a collection Sd of support sets, with cardinality |Sd| 2d n+1 d , that provably contains the support of the nonnegative, k-sparse PC of Ad.

Step 3. Compute Xd, a set of candidate solutions. For each candidate support set I Sd, we compute a candidate solution x supported only in I:

arg max x 2=1,x 0, supp(x) I x T Adx. (6)

The constant rank of Ad is essential in solving (6): the constrained quadratic maximization is in general NP-hard, even for a given support.

Step 4. Output the best candidate solution in Xd, i.e., the candidate that maximizes the quadratic form.

If multiple components are desired, the procedure is repeated after an appropriate deﬂation has been applied on Ad (Mackey, 2008). The steps are formally presented in Algorithm 1. A detailed description is the subject of subsequent sections.

2.1. Approximation Guarantees

Instead of the nonnegative, k-sparse, principal component x of A, which attains the optimal value OPT = x T Ax , our algorithm outputs a nonnegative, k-sparse, unit norm vector xd. We measure the quality of xd as a surrogate of x by the approximation factor xd T Axd/OPT. Clearly,

Nonnegative Sparse PCA with Provable Guarantees

the approximation factor takes values in (0, 1], with higher values implying tighter approximation. Theorem 1. For any n n PSD matrix A, sparsity parameter k, and accuracy parameter d [n], Alg. 1 outputs a nonnegative, k-sparse, unit norm vector xd such that

xd T Axd ρd x T Ax ,

2n, 1 1 + 2 n

in time O(nd+1 + ndkd).

The approximation guarantee of Theorem 1 relies on establishing connections among the eigenvalues of A, and the quadratic forms xd T Axd and xd T Adxd. The proof can be found in the supplemental material. The complexity of Algorithm 1 follows upon its detailed description.

3. Proposed Scheme

Our algorithm approximates the nonnegative, k-sparse PC of a PSD matrix A by computing the corresponding PC of Ad, a rank-d surrogate of the input argument A:

i=1 vivi T = VVT , (7)

where vi = λiui is the scaled eigenvector corresponding to the ith largest eigenvalue of A, and V = [v1 vd] Rn d. In this section, we delve into the details of our algorithmic developments and describe how the low rank of Ad unlocks the computation of the desired PC.

3.1. Rank-1: A simple case

We begin with the rank-1 case because, besides its motivational simplicity, it is a fundamental component of the algorithmic developments for the rank-d case.

In the rank-1 case, V reduces to a single vector in Rn and x1, the nonnegative k-sparse PC of A1, is the solution to

max x Sn k x T A1x = max x Sn k

v T x 2 . (8)

That is, x1 is the nonnegative, k-sparse, unit length vector that maximizes (v T x)2. Let I = supp(x1), |I| k, be the unknown support of x1. Then, (v T x)2 = P i I vi xi 2 . Since x1 0, it should not be hard to see that the active entries of x1 must correspond to nonnegative or nonpositive entries of v, but not a combination of both. In other words, v I, the entries of v indexed by I, must satisfy v I 0 or v I 0. In either case, by the Cauchy-Schwarz inequality, v T x 2 = v I T x I 2 v I 2 2 x I 2 2, = v I 2 2. (9)

Equality in (9) can always be achieved by setting x I = v I/ v I 2 if v I 0, and x I = v I/ v I 2 if v I 0. The support of the optimal solution x1 is the set I for which v I 2 2 in (9) is maximized under the restriction that the entries of v I do not have mixed signs. Def. 1. Let I+ k (v), 1 k n denote the set of indices of the (at most) k largest nonnegative entries in v Rn. Proposition 3.1. Let x1 be the solution to problem (8). Then, supp (x1) S1 = I+ k (v) , I+ k ( v) .

The collection S1 and the associated candidate vectors via (9) are constructed in O(n) . The solution x1 is the candidate that maximizes the quadratic.

3.2. Rank-d case

In the rank-d case, xd, the nonnegative, k-sparse PC of Ad is the solution to the following problem:

max x Sn k x T Adx = max x Sn k VT x 2 2. (10)

Consider an auxiliary vector c Rd, with c 2 = 1. From the Cauchy-Schwarz inequality,

VT x 2 2 = c 2 2 VT x 2 2 c T VT x 2 . (11)

Equality in (11) is achieved if and only if c is colinear to VT x. Since c spans the entire unit sphere, such a c exists for every x, yielding an alternative description for the objective function in (10):

VT x 2 2 = max c Sd

(Vc)T x 2 , (12)

where Sd = c Rd : c 2 = 1 is the d-dimensional unit sphere. The maximization in (10) becomes

max x Sn k VT x 2 2 = max x Sn k max c Sd, | (Vc)T x|2

= max c Sd, max x Sn k | (Vc)T x|2. (13)

The set of candidate supports. A ﬁrst key observation is that for ﬁxed c, the product (Vc) is a vector in Rn. Maximizing | (Vc)T x|2 over all vectors x Sn k is a rank-1 instance of the optimization problem, as in (8). Let (cd, xd) be the optimal solution of (10). By Proposition 3.1, the support of xd coincides with either I+ k (Vcd) or I+ k ( Vcd). Hence, we can safely claim that supp(xd) appears in

I+ k (Vc) . (14)

Naively, one might think that Sd can contain as many as n k distinct support sets. In Section 4, we show that |Sd| 2d n+1 d and present our Spannogram technique (Alg. 2) for efﬁciently constructing Sd in O(nd+1). Each support in Sd corresponds to a candidate principal component.

Nonnegative Sparse PCA with Provable Guarantees

Solving for a given support. We seek a pair (x, c) that maximizes (13) under the additional constraint that x is supported only on a given set I. By the Cauchy-Schwarz inequality, the objective in (13) satisﬁes

| (Vc)T x|2 = | (VIc)T x I|2 (VIc) 2 2, (15)

where VI is the matrix formed by the rows of V indexed by I. Equality in (15) is achieved if and only if x I is colinear to VIc. However, it is not achievable for arbitrary c, as x I must be nonnegative. From Proposition 3.1, we infer that x being supported in I implies that all entries of VIc have the same sign. Further, whenever the last condition holds, a nonnegative x I colinear to VIc exists and equality in (15) can be achieved. Under the additional constraint that supp(x) = I Sd, the maximization in (13) becomes

max c Sd max x Sn k supp(x) I

| (Vc)T x|2 = max c Sd VIc 0 (VIc) 2 2. (16)

The constraint VIc 0 in (16), is equivalent to requiring that all entries in VIc have the same sign, since c and c achieve the same objective value.

The optimization problem in (16) is NP-hard. In fact, it encompasses the original nonnegative PCA problem as a special case. Here, however, the constant dimension d = Θ(1) of the unknown variable c permits otherwise intractable operations. In Section 5, we outline an O(kd) algorithm for solving this constrained quadratic maximization.

The algorithm. The previous discussion suggests a twostep algorithm for solving the rank-d optimization problem in (10). First, run the Spannogram algorithm to construct Sd, the collection of O(nd) candidate supports for xd, in O(nd+1). For each I Sd, solve (16) in O(kd) to obtain a candidate solution x(I) supported on I. Output the candidate solution that maximizes the quadratic x T Adx. Efﬁciently combining the previous steps yields an O(nd+1 + ndkd) procedure for approximating the nonnegative sparse PC, outlined in Alg. 1.

4. The Nonnegative Spannogram

In this section, we describe how to construct Sd, the collection of candidate supports, deﬁned in (14) as

I+ k (Vc) ,

for a given V Rn d. Sd comprises all support sets induced by vectors in the range of V. The Spannogram of V is a visualization of its range, and a valuable tool in efﬁciently collecting those supports.

Figure 1. Spannogam of an arbitrary rank-2 matrix V R4 2. At a point φ, the values of the curves correspond to the entries of a vector v(φ) in the range of V and vice versa.

4.1. Constructing S2

We describe the d = 2 case, the simplest nontrivial case, to facilitate a gentle exposure to the Spannogram technique. The core ideas generalize to arbitrary d and a detailed description is provided in the supplemental material.

Spherical variables. Up to scaling, all vectors v in the range of V Rn 2, R(V), can be written as v = Vc for some c R2 : c = 1. We introduce a variable φ Φ = ( π/2, π/2], and set c to be the following function of φ:

c(φ) = sin(φ) cos(φ) T .

The range of V, R(V) = { v(φ) = Vc(φ), φ Φ}, is also a function of φ, and in turn S2 can be expressed as

I+ k (v(φ)) , I+ k ( v(φ)) .

Spannogram. The ith entry of v(φ) is a continuous function of φ generated by the ith row of V: [v(φ)]i = Vi,1 sin(φ)+Vi,2 cos(φ). Fig. 1 depicts the functions corresponding to the rows of an arbitrary matrix V R4 2. We call this a spannogram, because at each φ, the values of the curves coincide with the entries of a vector in the range of V. A key observation is that the sorting of the curves at some φ is locally invariant for most points in Φ. In fact, due to the continuity of the curves, as we move along the φ-axis, the set I+ k (v(φ)) can only change at points where a curve intersects with (i) another curve, or (ii) the zero axis; a change in either the sign of a curve or the relative order of two curves is necessary, although not sufﬁcient, for I+ k (v(φ)) to change.

Appending a zero (n + 1)th row to V, the two aforementioned conditions can be merged into one: I+ k (v(φ)) can change only at the points where two of the n + 1 curves intersect. Finding the unique intersection point of two curves [v(φ)]i and [v(φ)]j for all pairs {i, j} is the key to dis-

Nonnegative Sparse PCA with Provable Guarantees

covering all possible candidate support sets. There are exactly n+1 2 such points partitioning Φ into n+1 2 +1 intervals within which the set of largest k nonnegative entries of v(φ) and v(φ) are invariant.

Constructing S2. The point φij where the ith and jth

curves intersect, corresponds to a vector v(φij) R(V) whose ith and jth entries are equal. To ﬁnd it, it sufﬁces to compute a c = 0 such that (ei ej)T Vc = 0, i.e., a unit norm vector cij in the one-dimensional nullspace of (ei ej)T V. Then, v(φij) = Vcij.

We compute the candidate support I+ k (v(φij)) at the intersection. Assuming for simplicity that only the ith and jth curves intersect at φij, the sorting of all curves is unchanged in a small neighborhood of φij, except the ith and jth curves whose order changes over φij. If both the ith and jth entries of v(φij) or none of them is included in the k largest nonnegative entries, then the set I+ k (v(φ)) in the two intervals incident to φij is identical. Otherwise, the ith

and jth curve occupy the kth and (k + 1)th order at φij, and the change in their relative order implies that one leaves and one joins the set of k largest nonnegative curves at φij. The support sets associated with the two adjacent intervals differ only in one element (one contains index i and the other contains index j instead), while the remaining k 1 common indices correspond to the k 1 largest curves at the intersection point φij. We include both in S2 and repeat the above procedure for I+ k ( v(φij)).

Each pairwise intersection is computed in O(1) and the at most 4 associated candidate supports in O(n). In total, the collection S2 comprises |S2| 4 n+1 2 = O(n2) candidate supports and can be constructed in O(n3).

The generalized Spannogram algorithm for constructing Sd runs in O(nd+1) and is formally presented in Alg. 2. A detailed description is provided in the supplemental material.

5. Quadratic Maximization over unit vectors in the intersection of halfspaces

Each support set I in Sd yields a candidate nonnegative, k-sparse PC, which can be obtained by solving (16), a quadratic maximization over the intersection of halfspaces and the unit sphere:

c = arg max c Sd Rc 0 c T Qc, (Pd)

where Q = VT I VI is a d d matrix and R is a k d matrix. Problem (Pd) is NP-hard: for Q PSD and R = Id d, it reduces to the original problem in (2). Here, however, we are interested in the case where the dimension d is a constant. We outline an O(kd) algorithm, i.e., polynomial in the number of linear constraints, for solving (Pd). A detailed proof is available in the supplemental material.

Algorithm 2 Spannogram algorithm for constructing Sd input V Rn d, k [n].

1: Sd {} {Set of candidate supports in R(V)} 2: b V VT 0d T {Append zero row; b V Rn+1 d} 3: for all n+1 d sets {i1, . . . , id} [n + 1] do

4: c Nullspace

e T i1 e T i2 ... e T i1 e T id

5: for α {+1, 1} do 6: I I+ k (αVc) {Entries than the kth one} 7: if |I| k then 8: Sd Sd {I} {No ambiguity} 9: else 10: A {i1, . . . , id}\{n + 1} {Ambiguous set} 11: T I\ {{i1, . . . , id} {n + 1}} 12: r k |T | 13: bd |A| 14: for all b d r r-subsets A(r) A} do 15: b I T A(r)

16: Sd Sd {b I} { 2d new candidates} 17: end for 18: end if 19: end for 20: end for output Sd

The objective of (Pd) is maximized by u1 Rd, the leading eigenvector of Q. If u1 or u1 is feasible, i.e., if it satisﬁes all linear constraints, then c = u1. It can be shown that if none of u1 is feasible, at least one of the k linear constraints is active at the optimal solution c , that is, there exists 1 i k such that Ri,:c = 0.

Fig. 2 depicts an example for d = 2. The leading eigenvector of Q lies outside the feasible region, an arc of the unit-circle in the intersection of k halfspaces. The optimal solution coincides with one of the two endpoints of the feasible region, where a linear inequality is active, motivating a simple algorithm for solving (P2): (i) for each linear inequality determine a unit length point where the inequality becomes active, and (ii) output the point that is feasible and maximizes the objective.

Figure 2. An instance of (P2): the leading eigenvector u1 of Q S2 lies outside the feasible region (highlighted arc). The maximum of the constrained quadratic optimization problem is attained at φ1, an endpoint of the feasible interval.

Nonnegative Sparse PCA with Provable Guarantees

Back to the general (Pd) problem, if a linear inequality Ri,:c 0 for some i [k] is enforced with equality, the modiﬁed problem can be written as a quadratic maximization in the form of (Pd), with dimension reduced to d 1 and k 1 linear constraints. This observation suggests a recursive algorithm for solving (Pd): If u1 is feasible, it is also the optimal solution. Otherwise, for i = 1, . . . , k, set the ith inequality constraint active, solve recursively, and collect candidate solutions. Finally, output the candidate that maximizes the objective. The O(kd) recursive algorithm is formally presented in the supplemental material.

6. Near-Linear Time Nonnegative SPCA

Alg. 1 approximates the nonnegative, k-sparse PC of a PSD matrix A by solving the nonnegative sparse PCA problem exactly on Ad, the best rank-d approximation of A. Albeit polynomial in n, the running time of Alg. 1 can be impractical even for moderate values of n.

Instead of pursuing the exact solution to the low-rank nonnegative sparse PCA problem maxx Sn k x T Adx, we can compute an approximate solution in near-linear time, with performance arbitrarily close to optimal. The suggested procedure is outlined in Algorithm 3, and a detailed discussion is provided in the supplemental material. Alg. 3 relies on randomly sampling points from the range of Ad and efﬁciently solving rank-1 instances of the nonnegative sparse PCA problem as described in Section 3.1. Theorem 2. For any n n PSD matrix A, sparsity parameter k, and accuracy parameters d [n] and ϵ (0, 1], Alg. 3 outputs a nonnegative, k-sparse, unit norm vector bxd such that

bx T d Abxd (1 ϵ) ρd x T Ax ,

with probability at least 1 1/n, in time O ϵ d n log n

plus the time to compute the d leading eigenvectors of A.

7. Experimental Evaluation

We empirically evaluate the performance of our algorithm on various datasets and compare it to the EM algorithm1 for sparse and nonnegative PCA of (Sigg & Buhmann, 2008) which is known to outperform previous algorithms.

CBCL Face Dataset. The CBCL face image dataset (Sung, 1996), with 2429 gray scale images of size 19 19 pixels, has been used in the performance evaluation of both the NSPCA (Zass & Shashua, 2007) and EM (Sigg & Buhmann, 2008) algorithms.

Fig. 3 depicts samples from the dataset, as well as six orthogonal, nonnegative, k-sparse components (k = 40) successively computed by (i) Alg. 3 (d = 3, ϵ = 0.1) and

1 Matlab implementation available by the author.

Algorithm 3 Approximate Spannogram NSPCA (ϵ-net) input A (n n PSD matrix), k, d [n], ϵ (0, 1]

1: [U, Λ] = svd(A, d) 2: V = UΛ 1/2 {Ad = VVT } 3: Xd = 4: for i = 1 : O(ϵ d log n) do 5: c = randn(d, 1) 6: a = Vc/ c 2 7: x = rank1solver(a) {Section 3.1} 8: Xd = Xd {x} 9: end for output bxd = arg maxx Xd VT x 2 2

(ii) the EM algorithm. Features active in one component are removed from the dataset prior to computing subsequent PCs to ensure orthogonality. Fig. 3 reveals the ability of nonnegative sparse PCA to extract signiﬁcant parts.

In Fig. 4, we plot the variance explained by the computed approximate nonnegative, k-sparse PC (normalized by the leading eigenvalue) versus the sparsity parameter k. Alg. 3 for d = 3 and ϵ = 0.1, and the EM algorithm exhibit nearly identical performance. For this dataset, we also compute the leading component using the NSPCA algorithm of (Zass & Shashua, 2007). Note that NSPCA does not allow for a precise control of the sparsity of its output; an appropriate sparsity penalty β was determined via binary search for each target sparsity k. We plot the explained variance only for those values of k for which a k-sparse component was successfully extracted. Finally, note that both the EM and NSPCA algorithms are randomly initialized. All depicted values are the best results over multiple random restarts.

Our theory allows us to obtain provable approximation guarantees: based on Theorem 2 and the output of Alg. 3, we compute a data dependent upper bound on the maximum variance, which provably lies in the shaded area. For instance, for k = 180, the extracted component explains at least 58% of the variance explained by the true nonnegative, k-sparse PC. The quality of the bound depends on the accuracy parameters d and ϵ, and the eigenvalue decay of the empirical covariance matrix of the data. There exist

Figure 3. We plot (a) six samples from the dataset, and the six leading orthogonal, nonnegative, k-sparse PCs for k = 40 extracted by (b) Alg. 3 (d = 3, ϵ = 0.1), and (c) the EM algorithm.

Nonnegative Sparse PCA with Provable Guarantees

20 40 60 80 100 120 140 160 180 200 0

Support Cardinality (first PC)

Normalized Explained Variance

CBCL FACE Dataset

EM nns Span (d=3, ε=0.10) OPT UB NSPCA (a=1e5) NSPCA (a=1e6) NSPCA (a=1e7)

Figure 4. CBCL dataset (Sung, 1996). We plot the normalized variance explained by the approximate nonnegative, k-sparse PC versus the sparsity k. Our theory yields a provable data dependent approximation guarantee: the true unknown optimum provably lies in the shaded area.

datasets on which our algorithm provably achieves 70% or even 90% of the optimal.

Leukemia Dataset. The Leukemia dataset (Armstrong et al., 2001) contains 72 samples, each consisting of expression values for 12582 probe sets. The dataset was used in the evaluation of (Sigg & Buhmann, 2008). In Fig. 5, we plot the normalized variance explained by the computed nonnegative, k-sparse PC versus the sparsity parameter k. For low values of k, Alg. 3 outperforms the EM algorithm in terms of explained variance. For larger values, the two algorithms exhibit similar performance.

The approximation guarantees accompanying our algorithm allow us to upper bound the optimal performance. For k as small as 50, which roughly amounts to 0.4% of the features, the extracted component captures at least 44.6% of the variance corresponding to the true nonnegative ksparse PC. The obtained upper bound is a signiﬁcant improvement compared to the trivial bound given by λ1.

Low Resolution Spectrometer Dataset. The Low Resolution Spectrometer (LRS) dataset, available in (Bache & Lichman, 2013), originates from the Infra-Red Astronomy Satellite Project. It contains 531 high quality spectra (samples) measured in 93 bands. Fig. 6 depicts the normalized variance explained by the computed nonnegative, k-sparse PC versus the sparsity parameter k. The empirical covariance matrix of this dataset exhibits sharper decay in the spectrum than the previous examples, yielding tighter approximation guarantees according to our theory. For instance, for k = 20, the extracted nonnegative component captures at least 86% of the maximum variance. For values closer to k = 90, where the computed PC is nonnegative but no longer sparse, this value climbs to nearly 93%.

50 100 150 200 250 300 0

Support Cardinality (first PC)

Normalized Explained Variance

LEUKEMIA Dataset

EM nns Span (d=3, ε=0.10) OPT UB

Figure 5. Leukemia dataset (Armstrong et al., 2001). We plot the normalized variance explained by the output of Alg. 3 (d = 3, ϵ = 0.1) versus the sparsity k, and compare with the EM algorithm of (Sigg & Buhmann, 2008). By our approximation guarantees, the maximum variance provably lies in the shaded area.

8. Conclusions

We introduced a novel algorithm for nonnegative sparse PCA, expanding the spannogram theory to nonnegative quadratic optimization. We observe that the performance of our algorithm often matches and sometimes outperforms the previous state of the art (Sigg & Buhmann, 2008). Even though the theoretical running time of Alg. 3 scales better than EM, in practice we observed similar speed, both in the order of a few seconds. Our approach has the beneﬁt of provable approximation, giving both theoretical a-priori guarantees and data dependent bounds that can be used to estimate the variance explained by nonnegative sparse PCs, as shown in our experiments.

9. Acknowledgements

The authors would like to acknowledge support from NSF grants CCF-1344364, CCF-1344179, DARPA XDATA and research gifts by Google and Docomo.

10 20 30 40 50 60 70 80 90 100 0

Support Cardinality (first PC)

Normalized Explained Variance

LRS Dataset

EM nns Span (d=3, ε=0.10) OPT UB 77.1%OPT

Figure 6. LRS dataset (Bache & Lichman, 2013). We plot the normalized explained variance versus the sparsity k. Alg. 3 (d = 3, ϵ = 0.1) and the EM algorithm exhibit similar performance. The optimum value of the objective in (2) provably lies in the shaded area, which in this case is particularly tight.

Nonnegative Sparse PCA with Provable Guarantees

Allen, Genevera I. and Maleti c-Savati c, Mirjana. Sparse nonnegative generalized pca with applications to metabolomics. Bioinformatics, 2011.

Armstrong, Scott A, Staunton, Jane E, Silverman, Lewis B, Pieters, Rob, den Boer, Monique L, Minden, Mark D, Sallan, Stephen E, Lander, Eric S, Golub, Todd R, and Korsmeyer, Stanley J. Mll translocations specify a distinct gene expression proﬁle that distinguishes a unique leukemia. Nature genetics, 30(1):41 47, 2001.

Asteris, M., Papailiopoulos, D.S., and Karystinos, G.N. Sparse principal component of a rank-deﬁcient matrix. In Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on, pp. 673 677, 2011.

Bache, K. and Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.

Badea, Liviu and Tilivea, Doina. Sparse factorizations of gene expression guided by binding data. In Paciﬁc Symposium on Biocomputing, 2005.

d Aspremont, A., El Ghaoui, L., Jordan, M.I., and Lanckriet, G.R.G. A direct formulation for sparse pca using semideﬁnite programming. SIAM review, 49(3):434 448, 2007a.

d Aspremont, Alexandre, Bach, Francis R., and Ghaoui, Laurent El. Full regularization path for sparse principal component analysis. In Proceedings of the 24th international conference on Machine learning, ICML 07, pp. 177 184, New York, NY, USA, 2007b. ACM.

d Aspremont, Alexandre, Bach, Francis, and Ghaoui, Laurent El. Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res., 9:1269 1294, Jun 2008.

Hoyer, Patrik O. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5: 1457 1469, 2004.

Jolliffe, I.T. Rotation of principal components: choice of normalization constraints. Journal of Applied Statistics, 22(1):29 35, 1995.

Jolliffe, I.T., Trendaﬁlov, N.T., and Uddin, M. A modiﬁed principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12(3):531 547, 2003.

Karystinos, G.N. and Liavas, A.P. Efﬁcient computation of the binary vector that maximizes a rank-deﬁcient quadratic form. Information Theory, IEEE Transactions on, 56(7):3581 3593, 2010.

Kim, Hyunsoo and Park, Haesun. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12): 1495 1502, 2007.

Lee, Daniel D and Seung, H Sebastian. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755): 788 791, 1999.

Mackey, Lester. Deﬂation methods for sparse pca. In Advances in Neural Information Processing Systems 21, NIPS 08, pp. 1 8, Vancouver, Canada, Dec 2008. Moghaddam, B., Weiss, Y., and Avidan, S. Spectral bounds for sparse pca: Exact and greedy algorithms. Advances in neural information processing systems, 18:915, 2006a.

Moghaddam, Baback, Weiss, Yair, and Avidan, Shai. Generalized spectral bounds for sparse lda. In Proceedings of the 23rd international conference on Machine learning, ICML 06, pp. 641 648, 2006b.

Murty, Katta G. and Kabadi, Santosh N. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117 129, 1987.

Papailiopoulos, D. S., Dimakis, A. G., and Korokythakis, S. Sparse pca through low-rank approximations. In Proceedings of the 30th International Conference on Machine Learning, ICML 13, pp. 767 774. ACM, 2013.

Parrilo, Pablo A. Structured Semidenite Programs and Semialgebraic Geometry Methods in Robustness and Optimization. Ph D thesis, California Institute of Technology Pasadena, California, 2000.

Sigg, Christian D. and Buhmann, Joachim M. Expectationmaximization for sparse and non-negative pca. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, pp. 960 967, New York, NY, USA, 2008. ACM.

Sung, Kah-Kay. Learning and example selection for object and pattern recognition. Ph D thesis, Ph D thesis, MIT, Artiﬁcial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA, 1996.

Zass, Ron and Shashua, Amnon. Nonnegative sparse pca. In Advances in Neural Information Processing Systems 19, pp. 1561 1568, Cambridge, MA, 2007. MIT Press.

Zhang, Y., d Aspremont, A., and Ghaoui, L.E. Sparse pca: Convex relaxations, algorithms and applications. Handbook on Semideﬁnite, Conic and Polynomial Optimization, pp. 915 940, 2012.

Zou, H., Hastie, T., and Tibshirani, R. Sparse principal component analysis. Journal of computational and graphical statistics, 15 (2):265 286, 2006.