# sparse_pca_a_geometric_approach__2880d649.pdf

Journal of Machine Learning Research 23 (2022) 1-33 Submitted 1/22; Revised 12/22; Published 12/22

Sparse PCA: a Geometric Approach

Dimitris Bertsimas dbertsim@mit.edu Driss Lahlou Kitane driss@mit.edu Operations Research Center Massachusetts Institute of Technology 77, Massachusetts Ave. Cambridge, MA 02138, USA

Editor: Daniel Hsu

We consider the problem of maximizing the variance explained from a data matrix using orthogonal sparse principal components that have a support of ﬁxed cardinality. While most existing methods focus on building principal components (PCs) iteratively through deﬂation, we propose Geo SPCA, a novel algorithm to build all PCs at once while satisfying the orthogonality constraints which brings substantial beneﬁts over deﬂation. This novel approach is based on the left eigenvalues of the covariance matrix which helps circumvent the non-convexity of the problem by approximating the optimal solution using a binary linear optimization problem that can ﬁnd the optimal solution. The resulting approximation can be used to tackle diﬀerent versions of the sparse PCA problem including the case in which the principal components share the same support or have disjoint supports and the Structured Sparse PCA problem. We also propose optimality bounds and illustrate the beneﬁts of Geo SPCA in selected real world problems both in terms of explained variance, sparsity and tractability. Improvements vs. the greedy algorithm, which is often at par with state-of-the-art techniques, reaches up to 24% in terms of variance while solving real world problems with 10,000s of variables and support cardinality of 100s in minutes. We also apply Geo SPCA in a face recognition problem yielding more than 10% improvement vs. other PCA based technique such as structured sparse PCA.

Keywords: Linear Integer Optimization, PCA, Sparse PCA

1. Introduction

PCA (Pearson, 1901) is a popular data analysis technique. It is used in a variety of applications including ﬁnance, data imputation, image processing and genome analysis. PCA is of particular interest when the data matrix data X Rn p has a high dimension p. Yet, models resulting from PCA use all the features while sparsity of the PCs can be desired for various beneﬁts. Sparse versions of PCA that use a reduced number of variables to build principal components were proposed to improve interpretability, enhance model s predictive power or reduce operational costs (such as in ﬁnance) and investment costs (such as in spectroscopy). Solving the sparse PCA problem is particularly challenging due to the non-convexity of the problem.

c 2022 Dimitris Bertsimas and Driss Lahlou Kitane.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/22-0088.html.

Dimitris Bertsimas and Driss Lahlou Kitane

1.1 Background

The optimization community has been studying several versions of sparse PCA problem for decades now. Most of the methods proposed fall into two broad categories. One of the categories aims at approximating the whole covariance matrix using sparse principal components; the loss function is typically of the form ||X Z|| where X Rn p is the covariance matrix and Z is constructed using problem variables. In this category, sparsity is generated using thresholding and/or regularization either directly in the objective function or in the constraints. Thresholding has been used as early as in (Jeﬀers, 1967). Limits of thresholding have also been documented (Cadima and Jolliﬀe, 1995), in particular key variables could be zeroed and highly correlated variables chosen together which ultimately could lead to inaccurate interpretations. The ﬁrst algorithm to use ℓ1 penalization is SCOTLASS (Jolliﬀe et al., 2003). It was introduced then as a method for preserving the orthogonality constraints while sparsity was induced by ℓ1 constraints but, using this technique, the number of features used in the resulting sparse model is capped by n and the method is not tractable for p 100. Larger problems could be then tackled by the introduction of the LASSO method to generate PCs by adopting an ℓ1 penalized regression approach in the generation of the PCs (Zou et al., 2006) which, on one hand, allowed the number of variables used in the model to exceed n but, on the other hand, sacriﬁced the orthogonality of the PCs and provided no guarantee on the optimality of the solution. (Shen and Huang, 2008; Witten et al., 2009) introduced iterative thresholding methods to tackle the sparse PCA problem building PCs iteratively. Other breakthroughs were achieved and successfully implemented including the GPower method using ℓ1 penalty which preserves orthogonality (Journee et al., 2010), scales for p in 10,000s and n in 1,000s and outperforms LASSO-based regression approach and greedy algorithm in terms of quality of solution and computation speed. Subsequent works using penalization to induce sparsity adopting optimization over Stiefel manifolds (Huang and Wei, 2019; Tan et al., 2021; Chen et al., 2020) and the Procrustes reformulation (Benidis et al., 2016) showed that solution quality and computation time could be further improved considering the same size of problems while preserving orthogonality. The state of the art in this category of works can then handle large instances in competitive time, delivers orthogonality of the PCs and can furthermore be adapted to variations of the sparse PCA problem including the structured sparse PCA (Jenatton et al., 2009; Li et al., 2017). Yet, the community is still making sizeable eﬀorts to improve the variance captured. Indeed, most of these approaches come with no guarantee of the optimality of the solution and the control of the number of variables cannot be chosen with precision. Recent eﬀorts in (Erichson et al., 2020) introduce ℓ0 penalization while keeping a loss function of the form ||X Z|| which could theoretically achieve sparser principal components and achieve a tighter control over the number of nonzero variables (numerical tests in this paper focusing essentially on ℓ1 penalization). The number of variables used in the ﬁrst category of works is often too high to enable interpretability or to be practical in several intended uses of sparse PCA.

The second category of approaches aims at having a strict control on the number of variables used. For a given number k of variables, the objective sought is to maximize the variance captured using one or several principal components that have a support of cardinal k. Rather than generating sparsity through penalization, this family of approaches

Sparse PCA, a Geometric Approach

aims at ﬁnding optimal or near optimal solutions for a deﬁned number of features to be used. Initial work involved a greedy strategy as well as a branch-and-bound approach (Moghaddam et al., 2006). Optimal solutions were sought through semideﬁnite optimization modeling (SDO) (d Aspremont et al., 2004). Further developments of this technique enabled tackling relatively big problems with optimality certiﬁcates (d Aspremont et al., 2008). GPower technique was also adapted to propose a truncation approach in (Yuan and Zhang, 2013). Later, a novel technique that performs particularly well when the the decay of eigenvalue of the data matrix X was proposed (Papailiopoulos et al., 2013). Although each sparse PC constructed is optimal, the set of PCs are constructed iteratively using deﬂation techniques of the data matrix, thus there is no guarantee of the overall optimality and orthogonality is sacriﬁced. Another track of work develops branch-and-bound approaches to construct optimal solutions while controlling k. This enabled solving problems for p in the 1,000s and k 10 but the PCs are then constructed by iteratively deﬂating X (Berk and Bertsimas, 2019). More recent work combines branch-and-bound and SDO formulation to solve problems with p in 10,000s and k 10 to near optimality in hours (Bertsimas et al., 2022; Li and Xie, 2020). Since the PCs are obtained by deﬂating X, orthogonality is sacriﬁced. One variation of the sparse PCA tackled by this category of approaches includes additional constraints aiming at building sparse PCs with disjoint supports (Asteris et al., 2015). More recently, (Del Pia, 2022) proves that ﬁnding orthogonal principal components sharing the same support and maximizing the variance captured is a polynomial problem when the rank of the covariance matrix is ﬁxed. The result is further extended to a special case of the disjoint supports sparse PCA problem when the cardinality of the disjoint supports is identical. This paper is theoretical and does not provide any numerical results so there is no indication regarding how the suggested approach would perform from a practical point of view. (Dey et al., 2022) also recently proposed the ﬁrst algorithm that builds simultaneously orthogonal sparse principal components sharing the same support while upper-bounding the number of nonzero rows. The methods tackles problems for p as high as 2000 in less than two hours and upper and lower bounds are provided for the optimal solution.

The two broad categories of approaches presented do not span all of the approaches that the machine learning community designed. Indeed, remarkable approaches aimed for instance at approximating the subspace generated by the principal components of the classic PCA algorithm were developed. (Johnstone and Lu, 2009) focuses on the principal components that correspond to the largest eigenvalues of the covariance matrix and generates sparsity through thresholding. (Ma, 2013) further improves the results achieved by this approach by proposing a novel iterative thresholding technique achieving tighter loss than comparable techniques.

1.2 Motivations and Contributions

As mentioned in the background section, one of the main approaches to construct sparse principal components that the machine learning community adopted is to add a constraint upper-bounding the number of nonzero variables. This approach is particularly relevant when the desired number of nonzero variables is low compared to the dimension p of the data points. When upper-bounding the number of nonzero variables by an integer k when

Dimitris Bertsimas and Driss Lahlou Kitane

k is signiﬁcantly lower, the maximization of the variance captured is more relevant than the minimization of the error ||X Z|| where X Rn p is the data matrix and Z is the constructed sparse matrix as the error ||X Z|| is too large. Considering this approach, the machine learning community focused on constructing the PCs iteratively which does not guarantee the optimality of the overall solution (and actually yields sub-optimal solutions). Only very recent works (Del Pia, 2022; Dey et al., 2022) tackle the problem of building multiple orthogonal principal components at once to guarantee optimality when PCs share a common support (Del Pia, 2022; Dey et al., 2022) or when PCs have disjoint supports (Del Pia, 2022). Our main motivation is to bring a new method that upper-bounds the number of nonzero variables, builds multiple orthogonal principal components at once and scales further than existing methods when PCs share the same support or have block-disjoint supports. Furthermore, we also aim at providing guarantees on the quality of the solution found.

In the present paper, we propose a novel approach to sparse PCA that considers the left eigenvectors of X (it is worth noting that all works on sparse PCA focus on the right eigenvectors). We derive then a geometrical interpretation of the resulting problem. This interpretation leads to a binary linear formulation that aims at approximating the original problem. This geometrical approach is versatile and can be adapted to several versions of sparse PCA problem. We introduce two formulations; one for a version in which all PCs share the same support and another version in which groups of PCs use disjoint supports (a generalization of the version imposing to PCs to have disjoint supports). Building on the versatility of the method, we also propose a formulation for the structured sparse PCA problem. We provide optimality gap bounds and test the proposed method on real world data sets. The geometrical approach we propose solves problem for p in 10,000s and k in 100s while preserving the orthogonality and controlling k in minutes.

As we mentioned earlier, upper-bounding the number of nonzero variables is a desirable feature for practitioners whether the true principal components share the same support or not. While the method we propose could be used in the general case, practitioners might me even more interested in this technique when the data studied follows particular structures (eg. hyperspectral imagery and spectroscopy when the focus is on particular variables (Fu et al., 2017; Bertsimas et al., 2020), computer vision (Jenatton et al., 2009) or matching (Benidis et al., 2016) among others).

We summarize our contributions in this paper below:

We introduce a new approach to the sparse PCA problem based on left eigenvectors that leads to a geometrical interpretation of sparse PCA. We approximate the sparse PCA problem using binary linear optimization (BLO) and the introduction of cuts that improve the solution.

We prove that the optimal solution can be found using the approximation proposed in a ﬁnite number of steps and provide a theoretical optimality gap to the solution generated by the method we propose.

We propose formulations and algorithms (i) for sparse PCA problem in which all PCs share the same support, (ii) for a generalization of the version of PCA requiring that the PCs have disjoint support, and (iii) for the structured sparse PCA problem.

Sparse PCA, a Geometric Approach

U[s]U[s]T X(s)

X(s) U[s]U[s]T X(s)

Figure 1: X(s) U[s]U[s]T X(s) is the projection of X(s) on the subspace of lower dimension deﬁned by U[s]

We test the proposed method on publicly available real world data sets and compare its performance vs. existing methods and show that the geometrical approach produces solutions of higher quality than alternative state-of-the-art methods.

The structure of the paper is as follows. In Section 2, we ﬁrst study the sparse PCA problem when all principal components share the same support. We introduce the geometrical approach and oﬀer interpretations and provide formulations, optimality gap bounds and related algorithm. In Section 3, we extend the geometric approach to the case in which the principal components have disjoint supports and to the structured sparse PCA problem. In Section 4, we compare Geo SPCA to other state-of-the-art sparse PCA techniques and we ﬁnally conclude in Section 5.

1.3 Notations and Deﬁnitions

In the remainder of this paper, we deﬁne X Rn p a centered data matrix with n data points and p features. We refer to the number of features used to build a sparse PCA model by k. All matrices and vectors in bold characters.

We note MT , the transpose of matrix M. Consider the SVD decomposition of a real matrix Rn p, UΣVT . We call the columns of U (resp. V) the left (resp. right) eigenvectors of X. Consider m a non-negative integer, we note [m] = {1, 2, . . . , m}. We denote Xi the ith column of a matrix X. The scalar si is the ith component of vector s. The matrix S = diag(s) is the diagonal matrix with a diagonal equal to the vector s. Consider σ [m], we note sσ the vector of {0, 1}m such that sσ i = 1 if i σ and sσ i = 0, otherwise. If E is a ﬁnite set, we note |E| its cardinality. Consider a matrix M Rn p and a vector s {0, 1}p, we note M(s) the matrix we obtain by suppressing all columns Mi of M such that si = 0. ||.||F is the Frobenius norm. ||.||0 designates the number of nonzero coeﬃcients of a vector or the number of nonzero rows of a matrix. For b a non-negative integer, Ib is the identity matrix of size b b. Given s {0, 1}p, we note U[s] any element of argmax U Ra n tr(UT X(s)X(s)T U) s.t. UT U = Ia for a given non-negative integer a and η(s) = ||X(s) U[s]U[s]T X(s)||2 F , µ(s) = ||X(s)||2 F = Pp i=1 si||Xi||2 and π(s) = ||U[s]U[s]T X(s)||2 F = tr(U[s]T X(s)X(s)T U[s]). It is easy to verify that µ(s) = π(s) + η(s) (See Figure 1). We deﬁne ﬁnally the vector e Rp as the vector e = {1, 1, . . . , 1}.

Dimitris Bertsimas and Driss Lahlou Kitane

2. Sparse PCA problem with Principal components sharing the same support

In this section, we address the sparse PCA problem in which all the principal components share the same support. We ﬁrst transform the problem to a left eigenvectors problem and derive a geometric interpretation of the problem. We then introduce a binary linear optimization approximation of the sparse PCA problem. We ﬁnally propose an algorithm that solves the approximation and that can ﬁnd the optimal solution to the original problem.

2.1 An exact formulation using the left eigenvectors perspective

We consider the problem of maximizing the variance explained by a given number a of sparse principal components that share the same support of cardinal less than or equal to k. We write the problem as:

max W Rp a tr(WT XT XW) (1)

s.t. ||W||0 k,

where ||W||0 is the number of nonzero rows of W. The a principal components formed by the columns of W share then the same support of cardinality at most k. We transform the problem to let the left eigenvectors appear:

Proposition 1 Problem (1) is equivalent to:

max s {0,1}p,U Rn a

j=1 (XT i Uj)2 (2)

s.t. e T s k,

Proof We introduce a variable s {0, 1}p and consider that si = 0 if WT i = 0 and si = 1, otherwise. When Pp i=1 si k, and W Rp a, we have ||SW||0 k where S = diag(s) and ||SW||0 is the number of nonzero rows of SW. If the principal components are represented by the columns of SW, then WT ST SW = WT SW = Ia ensures that the principal components are normalized and orthogonal. We note that if, in addition, we have WT W = Ia, we have i [p], Pp j=1 W 2 ij = 1 but since Pp j=1 sj W 2 ij = 1, it means that Wij = 0 if sj = 0. Hence, the combination of WT SW = Ia and WT W = Ia ensures that ||W||0 k. Problem (1) can then be rewritten:

Sparse PCA, a Geometric Approach

max s {0,1}p,W Rp a tr(WT SXT XSW) (3)

s.t. e T s k,

WT SW = Ia,

The constraints WT SW = Ia can be dropped (the proof is detailed in Proposition 2). Consider X(s) the matrix obtained when the columns Xi are removed from X when si = 0. Problem (3) can then be written considering inner and outer problems as follows:

max s {0,1}p,W Rk a tr(WT X(s)T X(s)W)

s.t. e T s k

The inner problem is a standard PCA problem for the data matrix X(s). We can now write the equivalent problem considering the left eigenvectors of X(s):

max s {0,1}p,U Rn a tr(UT X(s)X(s)T U)

s.t. e T s k

By expanding the trace, we have then:

max s {0,1}p,U Rn a

j=1 (X(s)T i Uj)2

s.t. e T s k,

which is equivalent to:

max s {0,1}p,U Rn a

j=1 (XT i Uj)2

s.t. e T s k

j=1 (X(s)T i Uj)2 =

j=1 (XT i Uj)2.

Pa j=1(XT i Uj)2 is the sum of the norms of the projections of Xi in the subspace generated by the columns of U. In other words, the problem is to ﬁnd k columns of X that maximize the sum of the norms of their projections in a subspace of Rn of dimension a.

Dimitris Bertsimas and Driss Lahlou Kitane

Proposition 2 Problem (3) is equivalent to:

max s {0,1}p,W Rp a tr(WT SXT XSW)

s.t. e T s k,

Proof Consider an SVD decomposition of X(s) = UΣVT with U Rn n, Σ Rn k

and V Rk k such that Σ is a diagonal matrix and UT U = VT V = Ik. Consider now V Rp k such that Vi = 0 if si = 0 and the columns of V appear in V in the indices i for which si = 1 in the same order as in V. We can easily verify that XS = UΣV T which shows that XS and X(s) have the same nonzero eigenvalues and the same left eigenvectors.

2.2 Geometric approximation formulation

Although the problem has been transformed, the two main issues of the problem are still present; the objective function is not concave and we still need to deal with the non-convex orthogonality constraints. In this subsection, our objective is to provide a binary linear problem that approximates Problem (2). We ﬁrst propose a linear approximation of the objective function and outline the geometric intuition supporting this approximation. We then replace the non-convex constraints by a set of linear constraints yielding the same optimal solutions. The approximation introduces a new parameter (noted η). We also show that optimal solutions of Problem (2) are also optimal solutions for the approximation if η is chosen appropriately.

We start by addressing the non-concavity of the objective function. The underlying hypothesis when using PCA to tackle a data problem is that the data matrix X can be written X = X + ϵ where X is a matrix of rank a < min(n, p) and ϵ is a low-norm matrix representing noise, or second order phenomena or other perturbations that are to be ignored by PCA modelling. X is found by projecting X according to a set of a orthonormal vectors Vi, the columns of V, X = VVT X. Since ||X X ||F = ||ϵ||F 0, we consider that ||X||F ||X ||F .

When considering Problem (2), the objective function is the sum of the norms of the projections of k columns of X on the subspace generated by the columns of U. Since the inner problem of problem (2) is the standard PCA problem, and using the approximation we just mentioned, we can approximate the objective function as the sum of the norms of k columns of X instead of the sum of the norms of their projections whenever ||X UUT X||F is small enough. We introduce then the parameter η that bounds that norm of the diﬀerence of X and its projection UUT X. This can be expressed by a constraint ||X UUT X||2 F η

Sparse PCA, a Geometric Approach

for a given η 0:

max s {0,1}p,U Rn a

i=1 si||Xi||2 (5)

s.t. e T s k,

||X(s) UUT X(s)||2 F η

that could be rewritten as follows:

max s {0,1}p

i=1 si||Xi||2 (6)

s.t. e T s k,

Indeed, since the objective function of (5) does not have U as variable, the s is feasible if there exists U such that ||X(s) UUT X(s)||2 F η and UT U = Ia. If such U exists, the objective function is equal to Pp i=1 si||Xi||2. If U[s] veriﬁes ||X(s) U[s]U[s]T X(s)||2 F η, then U[s] is a feasible solution of the inner problem of (5). Conversely, if s is feasible then by deﬁnition ||X(s) U[s]U[s]T X(s)||2 F η.

We replace ﬁnally the constraints η(s) η by eliminating the vectors s for which we have η(s) > η. Consider σ [p], and sσ {0, 1}p such that sσ i = 1 if i σ and sσ i = 0 otherwise. If η(sσ) > η, sσ is not feasible and could be cut using the following inequality: X

i σ si |σ| 1.

Hence, we derive the following approximate formulation for the sparse PCA problem with PCs sharing a common support:

max s {0,1}p

i=1 si||Xi||2 (7)

s.t. e T s k,

σ [p], η(sσ) > η

i σ si |σ| 1.

We note f(η) the value of the objective function of problem (2) evaluated for an arbitrarily chosen optimal solution of problem (7) for a chosen η > 0 (if problem (7) has several solutions, we choose one that minimizes ||X(s) UT UX(s)||F ) and let us prove that an optimal solution to problem (2) can be found using formulation (7):

Dimitris Bertsimas and Driss Lahlou Kitane

Theorem 3 Consider so an optimal solution of problem (2). There exists δ > 0 such that for any η [η(so), η(so) + δ], optimal solutions of problem (7) are optimal solutions of problem (2).

Proof By contradiction, consider η = η(so) and suppose that there exists s {0, 1}p such that s is an optimal solution of problem (7) and is not an optimal solution for problem (2) which implies Pp i=1 so i Pa j=1(Uj[so]T Xi)2 > Pp i=1 s i Pa j=1(Uj[s ]T Xi)2 according to problem (2). Since s is an optimal solution of problem (7) and so is feasible in (7) because η = η(so), we have Pp i=1 s i||Xi||2 Pp i=1 so i ||Xi||2, which yields then:

i=1 s i||Xi||2

j=1 (Uj[s ]T Xi)2 >

i=1 so i ||Xi||2

j=1 (Uj[so]T Xi)2,

i.e. η(s ) > η(so),

which means that s is not feasible as it violates the constraint η(s) η(so).

Since f(η) takes discrete and ﬁnite values as the number of s {0, 1}p is ﬁnite, we can choose δ > 0 such that the set of feasible solutions of problem (7) remains the same as if η = η(so) to ensure that so remains an optimal solution of problem (7). For example, we can consider s argmins||X(s) U[s]U[s]T X(s)||2 F > η(so) and choose δ = ( η η(so))/2 with η = ||X( s) U[ s]U[ s]T X( s)||2 F .

Beyond the theoretical value of Theorem 4 as it shows that with an adequate η, an optimal solution could be found. More practically, this theorem is also used in the design of a practical algorithm in Section 2.4 (Algorithm 2) can approach and ﬁnd the optimal solution in a ﬁnite number of steps.

Although the objective function and the constraints of problem (7) are linear, there are still potentially an exponential number of constraints (the number of subsets of [p] is 2p) which can hinder the tractability of the problem and we still need to assess the quality of the solution found by solving problem (7) relative to the optimal solution of (1). We start by providing a theoretical bound that partially addresses the ﬁrst question and further address practical and empirical aspects in Section 3. We also address the large number of constraints by proposing a simple separation procedure to generate useful constraints without including all the constraints in the problem.

Finally we note that the approximation ||ϵ||F 0 is considered only to provide the reader with an intuition behind the approximation. ||ϵ||F 0 is not a required assumption in any of the results of this paper.

2.3 Worst case upper bound

Before introducing an algorithm to solve Problem (7), we provide tight worst case upper bounds for the diﬀerence between the optimal value of Problem (2) and the value of the objective function of Problem (2) evaluated at an optimal solution of the approximation Problem (7). After showing the role η plays in these bounds, we derive a straightforward

Sparse PCA, a Geometric Approach

way to derive an upper bound without knowing η. The relationship between the worst case upper bound and η furthermore enables us to design an algorithm that approaches an optimal solution of the original Problem (2). Furthermore, this algorithm ﬁnds the optimal solution to the original Problem (2) in a ﬁnite number of iterations.

Proposition 4 Consider (so, U[so]) an optimal solution of problem (2). Let s A be an optimal solution of problem (7) for η η(so), then π(so) π(s A) η; meaning that the diﬀerence between the optimal value of problem (2),Pp i=1 so i Pa j=1(XT i U[so]j)2, and the value of the objective function of problem (2) evaluated at s A, Pp i=1 s A i Pa j=1(XT i U[s A]j)2 is bounded by η. Furthermore, this bound is tight.

Proof Note Vo the optimal value of problem (2), s A a solution to problem (7) and VA = max U Rn a Pp i=1 s A i Pa j=1(XT i Uj)2 s.t. UT U = Ia, the value of the objective function of problem (2) evaluated at s A. We have:

i=1 so i ||Xi||2 Vo

i=1 s A i ||Xi||2 η,

i=1 so i ||Xi||2

i=1 s A i ||Xi||2 + η Vo VA.

As η η(so), so is feasible in Problem (7) so

i=1 so i ||Xi||2

i=1 s A i ||Xi||2,

then η Vo VA.

We show now that the bound is tight. We build an example for X R2 4. Consider 0 < ϵ < 0.5 and X = ϵ ϵ 1 1 1 1 0 0

and consider k = 2 and a = 1. The columns of X are illustrated in Figure 2. Let us ﬁrst note that the value of the objective function of the optimal solution of problem (2) is 2 and is given by taking s = (1 1 0 0). If η 2ϵ, then s is feasible in Problem (7) and the optimal solution is given by s and U[s ] = U1 = (0 1). If η < 2ϵ, then it is impossible to have s1 = s2 = 1; the optimal solution is then s = (0 0 1 1) and the value of the objective function of problem (2) when s = s is equal to 2 2ϵ as U[s ] = U2 = (1 0). Since η can be taken as close as desired to 2ϵ, then the bound is tight.

This bound expresses the fact that the approximation is as good as the hypothesis that the matrix X(so) composed of the columns of the optimal support can be approximated with

Dimitris Bertsimas and Driss Lahlou Kitane

y X1( ϵ, 1) X2(ϵ, 1)

U1(0, 1) η 2ϵ

U2(1, 0) X3

Figure 2: Example illustrating the tightness of Proposition 4 s bound.

a matrix of rank a. Even if η0 is not known, we show in the implementation of Algorithm 2 that this bound is practical. We prove ﬁrst that a bound can be found a priori by solving the classic PCA problem. We need ﬁrst the following result:

Proposition 5 Consider that X = VVT X + ϵ with V Rn a and VT V = Ia. Let (so, U[so]) be an optimal solution of problem (2). We have:

η(so) ||Soϵ||2.

Furthermore, so is feasible when η = ||Soϵ||2 in problem (7).

Proof We have:

||Soϵ||F = ||So(X VVT X)||F = ||So X So VVT X||F = ||So X VVT So X||F as So VVT = VVT So

||X(so) U[so]U[so]T X(so)||F by deﬁnition.

When η = ||Soϵ||2, since

||X(so) U[so]U[so]T X(so)||2 F ||Soϵ||,

then ||X(so) U[so]U[so]T X(so)||2 F η

and hence so is feasible in Problem (7).

Corollary 6 Consider V argmin V Rn a||X VVT X|| st VT V = Ia (i.e., V is a solution to the classical PCA problem). We denote ϵ = X V V T X and σ [p] the set of the indices of k columns of ϵ with the highest norm. We have:

η(so) ||Soϵ||2 F X

i σ ||ϵi||2,

Sparse PCA, a Geometric Approach

ϵ can be easily computed using classic PCA. Although So is not known, the norm of the columns of ϵ can be computed and so is P i σ ||ϵi||. Using the same notations as in Proposition 4 and Corollary 6, we have:

Corollary 7 The diﬀerence between the optimal value of problem (2), ||U[so]U[so]T X(so)||2 F , and the value of the objective function of problem (2) evaluated at s A, ||U[s A]U[s A]T X(s A)||2 F is bounded by P i σ ||ϵi||2.

2.4 Algorithm

We propose two algorithms. Algorithm 1 solves Problem (7) for a given parameter η while Algorithm 2 starts without any knowledge on η and then approaches the optimal solution of Problem (2) by exploring several values of η while also updating the worst case upper bound.

We propose an implementation based on cut generation (Algorithm 1). We initially solve the problem without any of the constraints P i σ si |σ| 1, and then iteratively add these constraints. If at iteration t, the optimal solution found st violates η(st) η, then we add the constraint P i σt si |σt| 1 such that σt is the set of indices i such that st i = 1. Note that the computation of U[st] is inexpensive as it involves only the generation of an SVD decomposition of the matrix X(st) which is of size n k which can in done in O(k3 + nk2).

Algorithm 1: Constraints generation algorithm for PCs with a common support

Input: Data matrix X, number of components a, number of variable sought k, parameter η and a set cuts (optional); Output: Optimal support for the approximation formulation; Initiate (φ), a BLO problem with the following formulation:

max s {0,1}p

i=1 si||Xi||2

s.t. e T s k,

Add the input set of cuts if there is any. Compute s0 an optimal solution of (φ) using a BLO solver; Compute U[s0] by solving the PCA problem for matrix X(s0); while ||η(st) > η do

Update (φ) by adding the constraint P i σt si |σt| 1; Compute st+1 an optimal solution of (φ) using a BLO solver; Compute U[st+1] by solving the PCA problem for matrix X(st+1); end return the support found by solving (φ)

While Algorithm 1 terminates as {0, 1}p is ﬁnite and the while loop suppresses at least one element of {0, 1}p, the number of iterations is still potentially exponential. However, in

Dimitris Bertsimas and Driss Lahlou Kitane

practice the algorithm is relatively fast and solves large real world problems in minutes as illustrated in Section 4.

We also use the proof of Theorem 4 to reﬁne the search for η(so). We have:

Lemma 8 We assume that Problem (7) has a unique optimal solution. The function f : η f(η) returning the value of the objective function of problem (2) evaluated at the optimal value of problem (7) depending on η is a piece wise constant function.

Proof We note denote the function F as F(η) = {s {0, 1}p|η(s) η}. For a value η R+, F returns the set of supports s such that η(s) η. The cardinality of F(η) is ﬁnite as {0, 1}p is ﬁnite. The cardinality of F(η) is increasing with respect to η and f is constant when F is which proves the lemma.

Using this lemma, we can derive a method to eﬃciently select η. Indeed, consider that we solve problem (7) for a value η and note s, an optimal solution. It is easy to see that f(η) is constant for η [ η( s), η]. We can then start with a large value of η0 and iteratively update ηt by computing η(st) where st is a solution obtained and then input a new value ηt+1 = η(st) δ for δ small enough as f(η) is constant on [ η(st), ηt].

This also provides means to obtain tighter optimality gaps. Indeed, using Proposition 4 and starting with η0, which is large enough, we can store and update η , the value of η that achieves the largest value f(η) and since we have started with values of η larger than the η(so) corresponding to the optimal solution of problem (2), then according to Proposition 4, the diﬀerence between the optimal solution of problem (2) and f(η ) is capped by η .

We report then the following algorithm to solve or approximate problem (2) and obtain η .

Algorithm 2: Approximate solution for problem (2)

Input: Data matrix X, number of components a, number of variable sought k and an integer λ ; Output: Support s approximating the solution and η ; Initiate with η0 and η large enough; c:=0; while c λ do

Run Algorithm 1 using ηt and cuts generated so far; if f(η ) < f(ηt) then

η := ηt; ηt+1 := η(st) δ; c:= c + 1; end return s , η , f(η )

Proposition 9 Algorithm 2 converges and for δ > 0 small enough, Algorithm 2 ﬁnds an optimal solution to (2) in a ﬁnite number of steps for λ large enough.

Proof We ﬁrst note that Algorithm 2 converges as f(η ) increases from an iteration to another by construction and is upper-bounded by ||X||2 F .

Sparse PCA, a Geometric Approach

We consider now so an optimal solution of (2), and let = {|η(s) η(s )| > 0, s.t.e T s = e T s = k; s, s {0, 1}p}. Suppose that is not empty. Consider any δ > 0 such that δ < min( ). Considering Algorithm 2, we know that η0 > η(so). If f(η0) = f(η(so)) then Algorithm 2 ﬁnds an optimal solution s0 at the ﬁrst iteration as f(η(so)) is an optimal solution to problem (2) according to Theorem 4. Consider now t such that ηt > η(so) such that f(ηt) < f(η(so)). We show now that f(ηt+1) = f(η(so)) or ηt+1 > ηo. If ηt+1 < η(so), then so is feasible in (7) and Pp i=1 st+1 i ||Xi||2 Pp i=1 so i ||Xi||2. We also have by construction of δ, η(st+1) η(so). Since f(η(s)) = Pp i=1 si||Xi|| + η(s), we have f(ηt+1) f(η(so)) and since f(η(so)) is the optimal value of (2), then f(ηt+1) = f(η(so)) meaning that Algorithm 2 ﬁnds an optimal solution st+1.

If ηt+1 = η(so), then f(ηt+1) = f(η(so)) according to Theorem 4. We conclude that if f(ηt+1) < f(η(so)) then ηt+1 < η(so).

Hence, since ηt+1 ηt+2 > δ > 0 by construction, Algorithm 2 ﬁnds an optimal solution in less than (η0 η(so))/δ steps.

If = , then an optimal solution of (7) is also an optimal solution to (2) as all η(s) are equal and the Algorithm 2 ﬁnds the optimal solution at the ﬁrst iteration.

We note that the number of values f(η) takes is lower than the number of feasible solution, which contributes to simplifying the problem. The condition c λ in the while loop is a stopping criteria. It stops the search of a solution if no improvement is achieved after λ attempts after η is updated. Of course, other stopping criteria could be considered in this algorithm.

We ﬁnally provide an additional upper-bound for the optimal solution of (2) that leverages on the history of search of Algorithm 2.

Proposition 10 Consider that Algorithm 2 generated θ cuts and note sθ an optimal solution of (7) using the θ cuts generated. If f(η ) Pp i=1 sθ i ||Xi||2, then s is an optimal solution for (2). Otherwise, the optimal solution of (2) is upper-bounded by Pp i=1 sθ i ||Xi||2.

Proof We note sθ a solution of (7) after generating θ cuts using Algorithm 2, φ(θ) the value taken by the objective function of (2) at sθ and ψ(θ) = Pp i=1 sθ i ||Xi||2, the value objective function of (7) at the same point sθ. We ﬁrst note that ψ is a decreasing function. Indeed, the optimal value of (7) can only decrease when we add constraints. We also note that for any θ, φ(θ) ψ(θ).

If f(η ) Pp i=1 sθ i ||Xi||2 = ψ(θ), then for any θ θ, f(η ) >= ψ(θ ) as ψ is decreasing; and since φ(θ ) ψ(θ ), then f(η ) ψ(θ ) so f(η ) is optimal.

Otherwise, by construction, f(η ) φ(θ ) for any θ θ. We also have for any θ θ, ψ(θ) ψ(θ ) ans ψ is decreasing and we also have φ(θ ) ψ(θ ) so ψ(θ) φ(θ ) so the optimal solution of (2) is indeed upper-bounded by Pp i=1 sθ i ||Xi||2 = ψ(θ).

Dimitris Bertsimas and Driss Lahlou Kitane

2.5 Complexity assessment

There are two aspects that need to be considered to assess the theoretical eﬃciency of the proposed algorithms; the number of cuts generated) and the computational cost of each iteration.

Let us start by the number of cuts. As mentioned earlier, the number of cuts is potentially exponential. We provide here a theoretical example in which the number of cuts generated is exponential whatever the dimension p chosen. We choose k = 2 and a = 1 and consider α = 5/24 and n = p. Consider the matrix X Rn p such that x1,1 = 1, x1,2 = 1, x1,j = 1 α for j 3, xi,i = 2α and xi,i+1 = 2α for i 3, x2,p = 2α and xi,j = 0 otherwise. It is easy to verify that the norm of any column Xj is strictly greater than 1 for j 1 and the norm of the two ﬁrst columns is equal to 1. Is is also easy to verify that pair of columns (i, j) = (1, 2), (2, 1), UT X(ij)T X(ij)U < 2 where X(ij) is a sub-matrix of X formed by its i and j columns and U Rp with ||U|| = 1. It is then easy to verify that the optimal solution so of (2) is achieved with s1 = s2 = 1 and si = 0 for i 3. We set now η = 0.00001, then the only feasible solution for (7) is so. Since ||Xi||2 + ||Xj||2 > 2 for any (i, j) = (1, 2), (2, 1), and the cuts generated cut one binary point at a time, then Algorithm 1 will have to cut all the binary points except the optimal solution before it reaches the optimal solution.

We examine now the cost of each iteration. Each iteration involves two steps (i) solving a BLO problem and (ii) a running a separation algorithm. As mentioned earlier, the separation algorithm consists of a classic PCA problem of dimension n k which can be solved by using an SVD decomposition in O(k3 + nk2). Solving the BLO problem in Algorithm 1 is not trivial and the constraints matrix deﬁnes a polyhedron that has non-integer vertices in the general case. Yet, we can show that the BLO problem can be replaced by a much simpler algorithm. Indeed, at the ﬁrst iteration of Algorithm 1 (no cut generated yet), the solution is obtained by choosing the k columns of X that have the largest norms. Suppose now that we are at iteration t. If the separation algorithm ﬁnds a violated cut at iteration t 1, the new cut generated cuts exactly one binary point which is the optimal solution found of iteration t 1. This means that a solution of the BLO at iteration t is deﬁned by a set of vectors with the largest sum of norms that are diﬀerent from the solutions of iterations 1, 2, 3, ..., t 1. Since the BLO chooses the sets with largest sums of norms, a solution for the BLO of iteration t has a sum of norms lower or equal to the sum of the binary solutions that were cut. Finding such solution could then be performed through tree search with a cost of O(k) as this process is then reduced ﬁnding the k columns with the largest norms, then the set with the second largest sum of norms,..., then the set with the tth largest sum of norms. The actual computational cost of each iteration is then O(k3 + nk2).

Since the cost of each iteration is modest, theoretically the potential issue would be with the number of iteration that could be theoretically exponential. Yet, in practice, high quality solutions are found in a tractable fashion for fairly large instances as we show in the following section using real-life data.

In addition, it is worth noting that our method does not require computing the covariance matrix XT X. Since we are interested in particular in instances in which p is large

Sparse PCA, a Geometric Approach

(and in general high dimension setting we have n p), this results in dramatic reduction in memory requirements in addition to reductions in computation time.

Finally, although solving the BLO is computationally costly in theory, we have found that it is actually fast in practice when it comes to the BLO proposed in Algorithm 1. In Section 4, we illustrate how Algorithm 2 performs vs. other methods. We keep using the BLO because (i) solving the BLO using commercial solvers is so fast that Algorithm 2 already scales more than any other known method that tackles the problem considered in this section, (ii) the BLO could hold the opportunity to add tighter cuts (for potential future research) and (iii) too often, solving is deemed computationally costly while in practice it can be competitive, so we decided to illustrate the fact that considering BLO could be a valid approach to tackle high dimension problems.

3. Extension to the Disjoint Support Block Sparse PCA and the Structured Sparse PCA problems

We consider now the problem of maximizing the variance generated by b sets of PCs such that the PCs of a same set share the same support while the supports of PCs of diﬀerent sets are disjoint. We call this problem the Disjoint Support Block Sparse PCA (DSB SPCA) problem.

3.1 DSB SPCA Formulation

We note (ki)i [b] the desired cardinality of supports for each set and (ai)i [k] the number of PCs in each set. Following the same method to formulate problem (3), the problem can be written as follows:

max si {0,1}p,(Wi Rp ai);i [b]

i=1 tr(WT i Si XT XSi Wi) (8)

j=1 sij ki, i [b],

WT i Si Wi = Iai, i [b],

WT i Wi = Iai, i [b],

i=1 sij 1, j [p].

We ﬁrst note that when i [b], ai = 1, then formulation (5) matches the sparse PCA problem with disjoint support as formulated in (Bertsimas et al., 2022). Following the same

Dimitris Bertsimas and Driss Lahlou Kitane

steps to construct problem (7), we obtain the following approximation for (5):

max si {0,1}p;i [b]

j=1 sij||Xi||2 (9)

j=1 sij ki, i [b],

σ [p], i [b], ||X(sσ i ) U[sσ i ]U[sσ i ]T X(sσ i )||2 F > η

j σ sij |σ| 1,

i=1 sij 1, j [p].

3.2 Worst Case Scenario Upper Bound

We generalize Proposition 4 as follows:

Proposition 11 Let (so l , U[so l ])l [b] be an optimal solution of Problem (8). We note

l=1 ||X(so l ) U[so l ]U[so l ]T X(so l )||2 F

Consider (s A l )l [b] an optimal solution of Problem (9) for η = Pb l=1 ηl η(so), then the diﬀerence between the optimal value of (8),

j=1 (XT i U[so l ]j)2,

and the value of the objective function of (8) evaluated at (s A l )l [b],

j=1 (XT i U[s A l ]j)2

is bounded by η. Furthermore, this bound is tight.

Proof Let Vo be the optimal value of Problem (8), (s A l )l [b] be a solution to Problem (9) and b X

j=1 (XT i U[s A l ]j)2,

be the value of the objective function of Problem (8) evaluated at (s A l )l [b]. We have

i=1 so li||Xi||2 Vo

Sparse PCA, a Geometric Approach

i=1 s A li||Xi||2 η.

i=1 so li||Xi||2

i=1 s A li||Xi||2 + η Vo VA.

(so l )l [b] is feasible in Problem (9) so Pb l=1 Pp i=1 so li||Xi||2 Pb l=1 Pp i=1 s A li||Xi||2, then η Vo VA.

Proposition 4 and corollaries 6 and 7 are still applicable in the case of disjoint block supports and the proofs are almost identical.

3.3 Algorithm

Algorithm 1 can be easily adapted for the case of groups of PCs having disjoint supports

Algorithm 3: Constraints generation algorithm for group of PCs with disjoint supports

Input: Data matrix X, number of components (al)l [b], number of variables sought (kl) and parameter (ηl)l [b] per group of PCs.; Result: Optimal supports for the disjoint supports approximation formulation Initiate the following BLO problem (φ):

max si {0,1}p;i [b]

j=1 sij||Xi||2

j=1 sij ki, i [b],

i=1 sij 1, j [p].

Compute (so) an optimal solution of (φ) using a BLO solver; Compute U[so] by solving the PCA problem for matrix X(so); while l [b] ||X(st l) U[st l]U[st l]T X(st l)||2 F > ηl do Update (φ) by adding the constraint P i σt sli |σt| 1 for all l [b] such that ||X(st l) U[st l]U[st l]T X(st l)|| > ηl; Compute (st+1 l )l [b] an optimal solution of (φ) using a BLO solver; Compute (U[st+1 l ])l [b] by solving the PCA problem for matrix (X(st+1 l ))l [b]; end return the support found by solving (φ)

Dimitris Bertsimas and Driss Lahlou Kitane

3.4 Structured Sparse PCA

In some applications of sparse PCA, additional properties to sparsity are needed either to further improve interpretability or to enhance performances in subsequent classiﬁcation or regression. One of the notable applications of Structured Sparse PCA is image recognition in which the fact that the features selected need to be adjacent and/or form a particular 2D pattern (Jenatton et al., 2009). Another notable similar application is protein complex dynamics in which practitioners require that the 3D distance between the features to be limited and more recently, more abstract structures have been considered in genomics based on the interaction among diﬀerent genes (Li et al., 2017).

The approximation (7) can be adapted to virtually any pattern that can be deﬁned by linear constraints. We propose a general formulation for the approximation Structured Sparse PCA problem following the same approach that lead us to propose problem (7) and problem (9). We then illustrate this formulation for 2D data in Section 3.

Consider Π = {π1, ..., π|Π|} a set of subsets of [p] that represent the patterns that are desired and b > 0 a number of patterns that would be used to construct the PCs. In practice, patterns could represent a structure that is sought in the data; for instance, patterns could be related genes in genomics or 2D shapes in image recognition. We propose the following exact formulation:

max s {0,1}p,z {0,1}|Π|,W Rp a tr(WT SXT XSW) (10)

j [Π]:i πj zj, i [p],

Following the same steps as the ones we used to build the approximation (7), we propose the following approximation for the structured sparse PCA problem for a given ητ > 0:

max s {0,1}p

i=1 si||Xi||2 (11)

j [Π]:i πj zj, i [p],

πj Π, ||X(sπj) U[sπj]U[sπj]T X(sπj)||2 F > ητ,

Sparse PCA, a Geometric Approach

This formulation can also be extended to the case in which the patterns are disjoint:

max s {0,1}p

i=1 si||Xi||2 (12)

j [Π]:i πj zj, i [p],

πj Π, ||X(sπj) U[sπj]U[sπj]T X(sπj)|| > ητ zj = 0,

j:i πj zj 1.

(12) being a special case of problem (9) (each pattern representing one group of PCs sharing the same support), optimality bounds and algorithm apply in this case for ηl = ητ. Furthermore, if |Π| is not too large, all patterns πj that violate ||X(sπj) U[sπj]U[sπj]T X(sπj)|| ητ can be enumerated and eliminated before solving the problem. In this case, all the constraints of Problem (12) can be enumerated and and there is no need to use a cut generation algorithm.

We aim in this section to illustrate the beneﬁts of Geometric Sparse PCA (Geo SPCA) method in terms of variance explained, sparsity, predictive power and tractability. We also aim at comparing the performances obtained when building all the PCs at once vs. building them iteratively by deﬂating the data matrix.

We ﬁrst explicit principles for the choice of a and tuning of η. We then test and compare Geo SPCA in the case of all PCs sharing the same support on real world data sets. We ﬁnally test Geo SPCA on a image recognition data set using its structured sparsity version including disjointedness constraints of problem (9).

All tests are conducted computations on an Intel Core i7-8750H CPU at 2.20GHz with 16Gb of RAM on Windows 10 Pro. The solver we used is Gurobi Optimizer 9.1 running with Python 3.6.5.

4.1 Choosing η and a

We base the approach we use to choose a and η on the the similarities Geo SPCA has with the classic PCA.

We ﬁrst tune a by ﬁnding a suitable number of PCs for the classic PCA following a standard procedure. Namely, a could be chosen as the number of PCs beyond which the marginal gain in explained variance is limited (Figure 3a). If X can be approximated by a

Dimitris Bertsimas and Driss Lahlou Kitane

Figure 3: Illustrative ﬁgures for the choice of a and η.

matrix X of rank a, then a sub-matrix X(s) of X composed of selected columns of X can be approximated by a matrix of rank a. Although, theoretically, X(s) could be approximated by a matrix of a rank lower than a, this is could be the case in a real data setting but this would involve a special structure in the data; in particular, a subset of the columns of X must be orthogonal to the remaining columns, or in other words, independence among variables would be required.

The parameter η is found using Algorithm 2. We start by choosing η0 that is large (for example ||X||2 F ) and then Algotrithm 2 tightens the values of η by updating ηt. When ηt > η(so) + δ for a δ > 0 (see Theorem 4), then f(η ) is lower than the optimal value of (2). When ηt > η(so) then the optimal solutions of problem (2) are cut from the feasible set of problem (7) (Figure 3.b).

4.2 Geo SPCA with common support

We consider the problem of maximizing the variance explained from X using a number a of orthogonal PCs that have a common support of cardinal k. We use the formulation (7) to approximate problem (2). We use real world data sets of various natures and sizes to illustrate the beneﬁts of Geo SPCA. In this subsection, we focus on the amount of explained variance, the beneﬁt of building all PCs at once and the tractability depending on the size of the data set, the number of PCs and the cardinality k chosen. We selected publicly available data sets that are widely used in the literature including Mturk (n; p) = (180; 500) (Cheng et al., 2016) which consists of descriptions of randomly chosen pictures using a bag of words, Colon (n; p) = (62; 2, 000) (Alon et al., 1999) a colon cancer gene expression data set, Arcene (n, p) = (700; 10, 000) a mass-spectrometric data aiming at detecting cancer patterns proposed at the NIPS 2003 Feature Selection Challenge (Guyon et al., 2007) and CGD (n; p) = (286; 22, 283) (Wang et al., 2005), a gene expression data set used to classify breast cancers.

We compare Geo SPCA to other techniques that control the sparsity by imposing k, the number of variables that are used in the sparse model. Comparison with other techniques in which sparsity is induced by regularization is not signiﬁcant because achieving the level

Sparse PCA, a Geometric Approach

of sparsity that is achieved while controlling k requires to choose a very large weight for the regularization which skews the objective function and produces ultimately low quality solutions. Since we are also aiming at choosing k higher than 100 and p in 10,000s; we choose Path SPCA (d Aspremont et al., 2008) to illustrate the beneﬁts of building all PCs at once as its performance is comparable to other techniques extracting iteratively (Papailiopoulos et al., 2013; Yuan and Zhang, 2013). Since this technique extracts one PC at each iteration, a deﬂation technique is needed. We choose Schur complement deﬂation for its empirical performance and ease of use (see (Mackey, 2009) for a description and a full discussion on deﬂation techniques). Sparse PCs found with this process are then orthogonalized. We also use a greedy algorithm directly inspired from (Moghaddam et al., 2006) and (d Aspremont et al., 2008). We build the support of the solution σ by iteratively including the indices that maximize the increase in variace pactured from an iteration to another. We start with an empty set and we iteratively add indices i to σ such that:

i argmaxi [p]\σ max W Rp a tr(WT Sσ {i}XT XSσ {i}W)

s.t. WT W = Ia.

Greedy approaches have proven to be particularly eﬀective and are often at par with state-of-the-art techniques (d Aspremont et al., 2008; Bertsimas et al., 2022; Journee et al., 2010; Papailiopoulos et al., 2013).

We report the sorted norms of the columns on X for the diﬀerent data sets considered in Figure 4 (Top). Formulation (2) sheds a light on one reason the greedy approach performs well. Indeed, as explained in problem (2), the objective function is Pp i=1 si Pa j=1(XT i Uj)2, so maximizing the variance captured from k features is closely related to the norm of the k columns related to these features. If some columns of X have a much higher norm than the rest of the columns, they are likely to be selected by the greedy algorithm and their related variables are also likely to be in the support of the optimal solution. This is true for a = 1 which is the case that is the most studied in the literature and many of the data sets considered in the literature have columns with a norm far exceeding the norms of the remaining columns.

We ﬁrst conduct tests with Path SPCA using a constant number of features for the construction of each PC. After generating a PCs, we count the total number of features used k (we note that diﬀerent PCs may use common features). We then conduct tests using Algorithm 1 using k variables. Values of k are chosen for illustration purposes and are also the result of the number of variables that the deﬂation method produces. Indeed, in all cases studied in this paper, Path SPCA often chooses to reuse variables that were used in previously constructed PCs. The cuts are generated as lazy constraints. We report results in Table 1. We ﬁrst notice that deﬂation produces signiﬁcantly lower explained variance vs. Geo SPCA and the greedy in all cases; sometimes by more than an order of magnitude. The illustrates the beneﬁt of constructing all the PCs at once instead of doing so iteratively through deﬂation. Considering the Greedy approach, Geo SPCA outperforms this method in almost all cases by up to 24.2%. When it comes to the variance explained by sparse PCA, even 1% is signiﬁcant as this implies, depending on the application

Dimitris Bertsimas and Driss Lahlou Kitane

Table 1: Explained Variance captured by each approach, Geo SPCA algorithm computation time, the optimality gap (GAP) and the number of cuts generated. a is the number of PCs used. k is the number of variables in the support. The columns Deﬂation, Greedy and Geo SCPA indicate the variance obtained using Path PCA combined with Schur complement deﬂation technique, the Greedy algorithm and Geo SPCA respectively. The Geo SPCA vs. Greedy column shows the relative improvement in captured variance of Geo SPCA vs. the Greedy algorithm. The Time column indicates the computation time of Geo SPCA Algorithm. GAP indicates the theoretical upper bound of the relative gap between the variance captured by Geo SPCA vs. the variance captured by the optimal solution of the exact formulation as per propositions 4 and 10 (0% gap means that the solution is optimal as per Proposition 10).

Data set a k Deﬂation Greedy Geo SPCA

Geo SPCA vs. Greedy Time GAP #cuts Mturk 8 16 5.62E+3 1.57E+5 1.59E+5 +1.1% <1s 0% 56 (n, p) = 21 5.6E+3 1.66E+5 1.67E+5 +1% <1s 1.7% 44 (180;500) 26 8.73E+3 1.71E+5 1.74E+5 +1.7% <1s 2.8% 30 29 8.7E+3 1.76E+5 1.78E+5 +0.8% <1s 3.4% 154 32 9.06E+3 1.8E+5 1.81E+5 +0.4% <1s 4% 1208 Colon 5 11 1.02E+9 4.57E+9 4.79E+9 +4.9% <1s 1.7% 120 (n, p) = 12 1.97E+9 4.74E+9 4.92E+9 +3.8% <1s 3.8% 85 (62;2,000) 15 2.27E+9 5.41E+9 5.49E+9 +1.5% <1s 8.4% 562 18 2.8E+9 5.9E+9 5.94E+9 +0.6% <1s 12% 384 33 3.81E+9 7.62E+9 7.6E+9 -0.3% <1s 21.2% 1214 Arcene 3 14 7.43E+7 8.2E+8 1.02E+9 +24.2% <1s 0% 6 (n, p) = 22 8.7E+7 1.27E+9 1.5E+9 +18.1% <1s 0% 9 (700; 53 1.76E+8 2.74E+9 3.01E+9 +9.7% <1s 1.6% 7 10,000) 132 5.9E+8 5.83E+9 6.08E+9 +4.4% <1s 4.1% 35 270 7.95E+8 9.78E+9 9.82E+9 +0.4% 45s 7.1% 590 CGD 11 23 5.11E+10 1.84E+12 1.89E+12 +2.7% <1s 1% 71 (n,p) = 43 1.09E+11 2.5E+12 2.63E+12 +5.1% <1s 6.3% 50 (286; 58 1.91E+11 2.89E+12 3E+12 +3.9% 6s 10.2% 166 22,283) 85 2.82E+11 3.45E+12 3.54E+12 +2.9% 20s 14.4% 359 174 3.25E+11 4.64E+12 4.67E+12 +0.6% 118s 23% 1308

Sparse PCA, a Geometric Approach

Figure 4: (Top) Norms of the columns of X in decreasing order (Bottom) Variance explained by standard PCA depending on the number of PCs

Table 2: Computation time range for Path SPCA and Greedy Algorithms

Method Mturk Colon Arcene CGD Path SPCA 1 to 3s 9 to 11s 4 to 8 mins 2 mins to 11 hours Greedy 2 to 10s 4 to 20s 2min to 8 hours 4 mins to 51 hours

considered, more lives saved or increased proﬁts. Even theoretically, considering problem (2), when the vectors with the largest norms have a norm far exceeding the remaining columns (Figure 3), improving the explained variance by 1% is remarkable. We notice also that for Arcene and CDG data sets, the error ϵ has a greater norm compared to other data sets with respect to a. This signals that the sparse PCA problem is harder to solve which explains the edge Geo SPCA has over a simple greedy approach.

Computation time for Geo SPCA did not exceed 2 mins while Path SPCA and the greedy approach needed several hours (and often tens of hours on CGD) to provide a solution (see Table 2. We note also that Geo SPCA does not need to compute or handle the covariance matrix XT X which also contribute to drastically reduced computation time, especially for the largest instances. XT X has 4.84 10E9 entries that, besides the inherent Path SPCA computation time, needs to be deﬂated to compute each PC. This already creates complications in the management of the memory capacity of most personal computers.

Still considering the same data sets and a and k values, we report in Figure 5 the evolution of the objective function of (2) in blue and (7) in orange as cuts are iteratively added in Algorithm 2 with respect to the number of cuts for the 3000 ﬁrst cuts generated. For this experiment, we drop the lazy constraints feature in Gurobi. An optimal solution

Dimitris Bertsimas and Driss Lahlou Kitane

Figure 5: Values of the objective functions of (2) (in blue) and (7) (in orange) in function of the number of cuts added using Algorithm 2.

is found when the lowest value of (7) (in orange) is lower than a value of (2) (in blue). The green dotted line represents the lowest value of (7) achieved.

We also report for the same experiment f(ηt) with respect to t in Figure 6. We chose instances in which the optimal value is found by Algorithm 2 (using Proposition 10 to prove it). For Mturk (left), we notice that Algorithm 2 reduces ηt until the optimal solution is found at t = 2, then optimal solutions are cut and f(ηt) then decreases below the optimal solution value. For Arcene (right), the optimal solution is found at the ﬁrst iteration, then the optimal solution is cut and f(ηt) decreases also to values below the optimal value.

Sparse PCA, a Geometric Approach

Figure 6: f(ηt) with respect to t on the left for Mturk data set ((a, k) = (8, 16)) and Arcene ((a, k) = 3, 22))

Figure 7: Sample of pictures of the data set chosen

4.3 Experiments for Structured Sparse PCA

Imposing a structure in the variables to build the PCs can yield substantial beneﬁts vs. PCA or sparse PCA in a number of applications including genomics and face recognition(Li et al., 2017; Jenatton et al., 2009). We test Geo SPCA in its structured version (12) (we will call it Geo SSPCA). We chose to use face recognition for its ease of interpretation to test the method and use the data set (Martinez and Kak, 2001). The data set consists of 2600 cropped pictures of the faces of 50 men and 50 women. For each person, 26 pictures are provided with the diﬀerent face expressions and lightning conﬁgurations. In 12 of the 26 pictures, parts of the face is hidden either by black glasses or scarves (a sample of pictures is provided in Figure7). We use as patterns triangles, rectangles and octagons with dimensions varying from 3 to 8 pixels as elements of the patterns set. We ﬁlter all patterns π that violate the constraint ||X(sπj) U[sπj]U[sπj]T X(sπj)||2 F ητ (using the same notations as in (12))

Dimitris Bertsimas and Driss Lahlou Kitane

Figure 8: Example of PCs constructed for k = 10 and a = 3. First row shows 2 halves of the mouth; second row represents the shape of the jaw; third row the eye and the glasses (many participants to the data set were wearing glasses); fourth and ﬁfth rows focus on the shape of the forehead and some on the eyebrows.

so we solve only a BLO problem once as all remaining patterns verify this constraint. We choose a = 3 and k = 7 and report an example of PCs constructed while solving (12) in Figure8. We notice that the shapes and the location of the shape capture intuitive components of the face most of the time (eyes, mouth, shape of the jaw and forehead,...).

We follow the same procedure used in (Jenatton et al., 2009); the resolution of the images is reduced from 165 120 to 38 27; for each person represented in the data set, we use the 14 pictures in which the whole face is visible for training and test on the 12 pictures in which part of the face is hidden. We also use k-NN algorithm to classify the pictures after reducing the dimension using Geo SSPCA and compare to the precision achieved by Structured Sparse PCA(Jenatton et al., 2009) (using the modeling scheme proposed by the authors), and PCA depending on the number of PCs. For Geo SSPCA, we choose k = 5 and increase a to have a number of PCs varying from 10 to 70. A comparison of the precision is provided in Figure 9.

We note that the patterns can be written also as intersections of half spaces by modifying accordingly (12). However, the polyhedron resulting from the relaxation of the binary constraints leads to computational considerations that go beyond the scope of this paper.

4.4 Discussion

We deduct from the geometric interpretation of Geo SPCA some implications on good practices when using Sparse PCA in terms of choice of a and k. First on the choice of a, if

Sparse PCA, a Geometric Approach

10 20 30 40 50 60 70 0

Figure 9: Out-of-sample face recognition precision by method

the matrix X can be approximated with a matrix X of rank a , it seems unlikely that a submatrix X(s) of X would need to be approximated by a matrix of rank higher than a . Choosing a value a higher than a would lead to capture elements that were meant to be ignored such as noise and could then lead to over-ﬁtting. It is however possible to consider values of a that are lower than a . If X can be approximated by a matrix of rank a , a submatrix X(s) could eventually be approximated by a matrix X(s) of rank b < a if X(s)

has columns that are orthogonal to other columns of X , or in other words if the variables deﬁned by the support of s are independent from other features of X. This is true when the supports of the PCs are disjoint for example as we have considered in the Structured Sparse PCA setting in the current section.

Regarding the choice of k, we notice ﬁrst that choosing k a leads to a trivial problem as an optimal solution can be constructed using the k columns of X that have the largest norm. In this case, k columns can be projected into a space of dimension a using U[s] with X(s) = U[s]U[s]T X(s). On the other hand, k cannot be chosen too big to preserve the putpose of sparse PCA. We provide a summary in Figure 10.

5. Conclusion

In this paper, we proposed Geo SPCA, a new approach to the sparse PCA problem building on a geometrical interpretation of the problem. We addressed in particular the case in which the PCs share a common support. We then illustrated the versatility of this method to the case in which PCs are organized in groups that have disjoint supports and further extended this adaptation to the Structured Sparse PCA problem. The experiments we conducted showed that Geo SPCA can tackle real world instances with a number of features

Dimitris Bertsimas and Driss Lahlou Kitane

Figure 10: Map for the choice of a and k.

in the 10,000s exceeding the performance of state-of-the-art approaches while providing high quality solutions in minutes.

We believe the method can be further applied to more variants of the Sparse PCA problem and can also be improved especially by generating more eﬃcient cuts at each iteration of the algorithms proposed.

U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745 6750, 1999.

Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, and Alexandros Dimakis. Sparse pca via bipartite matchings. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

Konstantinos Benidis, Ying Sun, Prabhu Babu, and Daniel Palomar. Orthogonal sparse eigenvectors: A procrustes problem. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4683 4686, 2016.

Lauren Berk and Dimitris Bertsimas. Certiﬁably optimal sparse principal component analysis. Mathematical Programming Computation, 11:381 420, 9 2019.

D. Bertsimas, D. Lahlou Kitane, N. Azami, and F.R. Doucet. Novel mixed integer optimization sparse regression approach in chemometrics. Analytica Chimica Acta, 1137:115 124, 2020.

Dimitris Bertsimas, Ryan Cory-Wright, and Jean Pauphilet. Solving large-scale sparse pca to certiﬁable (near) optimality. Journal of Machine Learning Research, 23(13):1 35, 2022.

Jorge Cadima and Ian T Jolliﬀe. Loading and correlations in the interpretation of principle compenents. Journal of applied Statistics, 22(2):203 214, 1995.

Sparse PCA, a Geometric Approach

EShixiang Chen, Shiqian Ma, Lingzhou Xue, and Hui Zou. An alternating manifold proximal gradient method for sparse principal component analysis and sparse canonical correlation analysis. INFORMS Journal on Optimization, 2(3):192 208, 2020.

Ting-Yu Cheng, Guiguan Lin, xinyang gong, Kang-Jun Liu, and Shan-Hung (Brandon) Wu. Learning user perceived clusters with feature-level supervision. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Alexandre d Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semideﬁnite programming. Co RR, cs.CE/0406021, 2004.

Alexandre d Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research, 9:1269 1294, 6 2008.

Alberto Del Pia. Sparse pca on ﬁxed-rank matrices. Mathematical Programming, 2022.

Santanu S. Dey, Marco Molinaro, and Guanyi Wang. Solving sparse principal component analysis with global support. Mathematical Programming, 2022.

N. Benjamin Erichson, Peng Zheng, Krithika Manohar, Steven L. Brunton, J. Nathan Kutz, and Aleksandr Y. Aravkin. Sparse principal component analysis via variable projection. SIAM Journal on Applied Mathematics, 80(2):977 1002, 2020.

Wei Fu, Shutao Li, Leyuan Fang, and J on Atli Benediktsson. Adaptive spectral spatial compression of hyperspectral image with sparse representation. IEEE Transactions on Geoscience and Remote Sensing, 55(2):671 682, 2017.

Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick Pletscher, Georg Schneider, and Markus Uhr. Competitive baseline methods set new standards for the nips 2003 feature selection benchmark. Pattern Recognition Letters, 28(12):1438 1444, 2007.

Wen Huang and Ke Wei. Extending ﬁsta to riemannian optimization for sparse pca. Technical report, 2019.

J. N. R. Jeﬀers. Two case studies in the application of principal component analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 16(3):225 236, 1967.

Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Structured sparse principal component analysis, 2009.

Iain M. Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486): 682 693, 2009.

Ian T. Jolliﬀe, Nickolay T. Trendaﬁlov, and Mudassir Uddin. A modiﬁed principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12 (3):531 547, 2003.

Dimitris Bertsimas and Driss Lahlou Kitane

Michel Journee, Yurii Nesterov, Peter Richtarik, and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11:517 553, 2 2010.

Yongchun Li and Weijun Xie. Exact and approximation algorithms for sparse pca, 2020.

Ziyi Li, Sandra Safo, and Qi Long. Incorporating biological information in sparse principal component analysis with application to genomic data. BMC Bioinformatics, 18(1):332, 2017.

Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772 801, 2013.

Lester Mackey. Deﬂation methods for sparse pca. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.

A.M. Martinez and A.C. Kak. Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228 233, 2001.

Baback Moghaddam, Yair Weiss, and Shai Avidan. Spectral bounds for sparse pca: Exact and greedy algorithms. In Y. Weiss, B. Sch olkopf, and J. Platt, editors, Advances in Neural Information Processing Systems, volume 18. MIT Press, 2006.

Dimitris Papailiopoulos, Alexandros Dimakis, and Stavros Korokythakis. Sparse pca through low-rank approximations. In Sanjoy Dasgupta and David Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 747 755, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR.

Karl Pearson. On lines and planes of closest ﬁt to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559 572, 1901.

Haipeng Shen and Jianhua Z. Huang. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6):1015 1034, 2008.

Mingkui Tan, Zhibin Hu, Yuguang Yan, Jiezhang Cao, Dong Gong, and Qingyao Wu. Learning sparse pca with stabilized admm method on stiefel manifold. IEEE Transactions on Knowledge and Data Engineering, 33(3):1078 1088, 2021.

Yixin Wang, Jan GM Klijn, Yi Zhang, Anieta M Sieuwerts, Maxime P Look, Fei Yang, Dmitri Talantov, Mieke Timmermans, Marion E Meijer-van Gelder, Jack Yu, Tim Jatkoe, Els MJJ Berns, David Atkins, and John A Foekens. Gene-expression proﬁles to predict distant metastasis of lymph-node-negative primary breast cancer. The Lancet, 365(9460): 671 679, 2005.

Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515 534, 04 2009.

Sparse PCA, a Geometric Approach

Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14:899 925, 12 2013.

Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265 286, 2006.