# a_note_on_sparse_generalized_eigenvalue_problem__c61f5879.pdf A Note on Sparse Generalized Eigenvalue Problem Yunfeng Cai, Guanhua Fang, Ping Li Cognitive Computing Lab Baidu Research No. 10 Xibeiwang East Road, Beijing 100193, China 10900 NE 8th St. Bellevue, Washington 98004, USA {yunfengcai, guanhuafang, liping11}@baidu.com The sparse generalized eigenvalue problem (SGEP) aims to find the leading eigenvector with sparsity structure. SGEP plays an important role in statistical learning and has wide applications including, but not limited to, sparse principal component analysis, sparse canonical correlation analysis and sparse Fisher discriminant analysis, etc. Due to the sparsity constraint, the solution of SGEP entails interesting properties from both numerical and statistical perspectives. In this paper, we provide a detailed sensitivity analysis for SGEP and establish the rate-optimal perturbation bound under the sparse setting. Specifically, we show that the bound is related to the perturbation/noise level and the recovery of the true support of the leading eigenvector as well. We also investigate the estimator of SGEP via imposing a non-convex regularization. Such estimator can achieve the optimal error rate and can recover the sparsity structure as well. Extensive numerical experiments corroborate our theoretical findings via using alternating direction method of multipliers (ADMM)-based computational method. 1 Introduction The sparse generalized eigenvalue problem (SGEP) is to solve the following constrained optimization problem: max x Rp x T e Ax x T e Bx , subject to x 0 k, (1) where e A, e B are both p-by-p symmetric matrices and e B is (semi) positive definite; x 0 denotes the ℓ0-norm of x, i.e., the number of nonzero entries of x; k is an integer which may be much smaller than p. For a fixed k, we let x be the solution to (1). Moreover, matrices e A and e B usually have the following decomposition: e A = A + E, e B = B + F, (2) where A, B are underlying symmetric matrices and B is positive definite. However, A and B are unobserved in practice and they can be contaminated by some noise/perturbation matrices E and F, respectively. Let λ1 be the largest generalized eigenvalue of matrix pair (A, B) and u1 be its corresponding eigenvector which is also known as the leading eigenvector. In addition, u1 is always assumed to be sparse, i.e., u1 0 p. Therefore, the SGEP is essentially to find an approximation of u1 based on observation ( e A, e B). The SGEP is challenging due to the following aspects. First, the sparsity constraint optimization problem (1) is NP-hard, since it is essentially a subset selection problem [19, 20]. To recover the underlying sparsity structure, we need to try different k which is computationally expensive. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Second, due to insufficient number of samples, e A and e B can be both ill-conditioned or even share an approximate common null space, and perturbations E and F can be large. As a result, the leading eigenvector of pair ( e A, e B) may not be a good approximation of u1. Though challenging, especially when there is a limited number of data samples, SGEP and related problems have received more and more attentions in the past decade and various numerical methods are proposed to solve the SGEP. For example, in [24], the SGEP is framed as a d.c. (difference of convex functions) program and is solved as a sequence of convex programs by invoking the majorization-minimization method. In [23], the ℓ0 norm constraint is approximated by a continuous surrogate function, an algorithm inspired by the majorization-minimization method is developed via iteratively majorizing the surrogate function by a quadratic separable function, and a systematic way based on smoothing is proposed to deal with the singularity issue. [9] proposes a semidefinite programming method for sparse canonical correlation analysis which is one of important SGEP applications. In [30], a two-stage computational framework is proposed to solve the SGEP, in which the first stage computes an initial guess via convex relaxation, and the second stage solves a nonconvex optimization via a fixed step size steepest ascent method, and followed by the simple truncation. In [22], a general framework called sparse estimation with linear programming is proposed for sparse canonical correlation analysis. Combining the idea of ℓ1-penalized adaptive normalized quasi-Newton algorithm with nested orthogonal complement structure, an algorithm is proposed in [32] to solve the SGEP. [13] solves the SGEP by modifying the standard generalized orthogonal iteration with a sparsity-inducing penalty for the eigenvectors. For more comprehensive review of computations in the SGEP, please see [3, 42] and references therein. Despite the efforts mentioned above, there is very little work investigating the optimal estimation error and no work provides theoretical guarantees whether the underlying sparsity structure of the leading eigenvector of SGEP can be recovered. In this paper, we bridge the theoretical gap in the literature. First, we provide both upper and lower bounds of the approximation error given noisy observations in the sparse generalized eigenvalue problem setting. To be specific, we prove that the difference between u1 and x is bounded by two parts, the perturbation error and error of not recovering the true support of u1. Furthermore, we show that the error bound we obtained is rate optimal. That is, the lower bound matches the upper bound up to a multiplicative constant. Second, we consider a family of non-convex penalty functions and reformulate (1) into a regularization problem. We show that such regularized estimator enjoys the merits of sparsity and nearly unbiasedness. With these nice properties, the estimator can be shown to achieve the optimal estimation error rate and can recover the true support of leading eigenvector as well. In addition, we also present a new computational algorithm. The estimation procedure adopts the Alternating Direction Method of Multipliers (ADMM, [37, 21]) and can recover the sparsity structure well under mild conditions. Local convergence theory is established for the proposed method. The rest of paper is organized as follows. In Section 2, we introduce the sparse generalized eigenvalue problem with its applications and technical tools. In Section 3, we provide a sensitivity analysis for SGEP and establish its upper and lower perturbation bounds. In Section 4, we propose a regularized estimator via using non-convex penalization method and established the estimation bound. We also propose a computational method, non-convex SGEP algorithm (NC-SGEP), to tackle with the estimation issues. Multiple numerical results are given in Section 5 and corroborate our theory. The concluding remark is given in Section 6. Notations: The calligraphic letters I, J , K, and S are usually used to denote index sets. |I| denotes the cardinality of I, e.g., I = {i1, i2, . . . , is}, where i1, i2, . . . , is are distinct integers, then |I| = s. We use x to denote a generic vector in Rp and x T to denote its transpose. x[j] is the jth element in x. x[I] stands for the subvector of x with indices in set I. We use A to denote a matrix in Rp p. A[i, j] is the entry in the ith row and the jth column. A[I, J ] stands for the submatrix of A with row indices in set I and column indices in set J . If I = J , A[I, I] is also denoted by AI. We also let (I, J ) be the union of sets I and J . For x Rp, supp(x) denotes the index set of all nonzero entries of x. For matrix A, supp(A) denotes the index set of all nonzero rows of A. Ip is the p p identity matrix. diag(d1, . . . , dp) represents a diagonal matrix with d1, . . . , dp being its diagonal elements. x 0 represents the number of non-zero entries in x; x 2 := q P j x2[j]; A 2 denotes the largest singular value of A; A F and A is the Frobenius norm and nuclear norm of matrix A respectively. A := maxi,j |A[i, j]|. We write a b (a b) if a Kb (a b/K) for some sufficiently large number K. 2 Preliminary Applications Sparse generalized eigenvalue problem has a wide application. Many high-dimensional multivariate statistical problems can be formulated as special instances of (1). Example 1 [Sparse Principal Component Analysis [44, 5, 1, 35, 36]] Given n observations with p features, sparse principal component analysis (SPCA) seeks the best low dimensional projection of the observed data for increasing interpretability, minimizing information loss and achieving sparsity structure. Let eΣ be the empirical covariance matrix. SPCA aims to solve max x Rp x TeΣx, subject to x 0 k, x 2 = 1. (3) Such problem is a special case of SGEP where e A = eΣ and e B = I. Example 2 [Sparse Fisher s Discriminant Analysis [31, 12, 17, 14, 16]] Given n observations with p features from K different classes, Fisher s discriminant analysis seeks a projection for mapping observations to the low dimensional space on which the between-class variance Σb is large while the within-class variance Σw is small. Let eΣb and eΣw be the sample estimates of Σb and Σw, respectively. To obtain the sparse leading discriminant vector, one can solve max x Rp x TeΣbx, subject to x TeΣwx = 1, x 0 k. (4) Such a problem can be formulated as an SGEP with e A = eΣb and e B = eΣw. Example 3 [Sparse Canonical Correlation Analysis [39, 9, 10, 41]] Given two random vectors X Rp, Y Rp, let Σxx, Σyy, Σxy be the covariance matrices for X, Y , and the cross-covariance matrix between X and Y , respectively. Let eΣxx, eΣyy, eΣxy be estimators (constructed from samples) for Σxx, Σyy, Σxy, respectively. Sparse canonical correlation analysis (SCCA) aims to solve the constrained optimization problem: max ux,uy u T x eΣxyuy, subject to u T x eΣxxux = 1, u T y eΣyyuy = 1, ux 0 kx, uy 0 ky, where kx and ky are two small integers. Such a problem can be recast as an SGEP with e A = 0 eΣxy eΣT xy 0 , e B = h eΣxx 0 0 eΣyy i , u1 = [ ux uy ] .1 Strictly speaking, two problems are not exactly equivalent, because u 0 kx + ky does not necessarily imply ux 0 kx and uy 0 ky. Technical preparation We first introduce some useful terminologies for describing the generalized eigenvalue problem (GEP). Let A Rp p be a symmetric matrix and B Rp p be a symmetric and positive definite matrix. Then GEP for the matrix pair (A, B) is defined as Aui = λi Bui; i = 1, . . . , p. Without loss of generality, we can always assume that λ1 λp. Here ui is ith eigenvector and λi is ith eigenvalue; (λi, ui) is called ith eigenpair of (A, B). Notice that ui is only determined up to a scale. In this paper, we always assume that ui 2 = 1. The perturbation analysis of generalized eigenvalue problem has been extensively studied since 1970s and many perturbation bounds have been developed [25, 27, 28, 33]. For reader convenience, we present several existing fundamental results before moving to our main theory immediately. Definition 1 The angle between x, y( = 0) Rp is defined as θ(x, y) := arccos |x Ty| x 2 y 2 . We assume x and y Rp have unit l2 norms and let [x, X2] be an orthogonal matrix. Then it holds | sin θ(x, y)| = XT 2 y 2. (5) Furthermore, it can be checked that | sin θ(x, y)| x y 2 2| sin θ(x, y)|. Thus | sin θ(x, y)| is a measure to quantify the distance between two vectors. In generalized eigenvalue problem, Crawford number is an important quantity for characterizing the perturbation bound. Definition 2 The Crawford number of a symmetric matrix pair (A, B) is defined as c(A, B) := min x 2=1 (x TAx)2 + (x TBx)2. A symmetric matrix pair (A, B) is referred to as definite if c(A, B) > 0. Obviously, the symmetric matrix pair (A, B) is definite for any positive definite B. By [25], it is known that the Crawford number is continuous with respect to the matrix pair. 3 Understanding of Perturbation Bound 3.1 On the General Setting In order to understand the difference between the solution of (1) and the true leading eigenvector, we first describe the perturbation results without considering the sparsity. The following lemma is from [29] and is a generalization of the Davis-Kahn s sin Θ theorem in [6] (also see [26, 15]). It gives an upper bound for | sin θ(u1, eu1)|, where eu1 is the leading eigenvector of ( A, B). Lemma 1 Suppose (A, B), ( e A, e B) = (A + E, B + F) are both symmetric-definite pairs, and their eigenvalues satisfy λ1 λp 0 and eλ1 eλp 0, respectively. Let φ1 = arctan λ1 > eφ2 = arctan eλ2, u1, eu1 be the eigenvectors corresponding to λ1, eλ1, respectively. Denote 2( A 2 2 + B 2 2) c(A, B) , ξ = Eu1 2 2 + Fu1 2 2 c( e A, e B) , | sin θ(u1, eu1)| Cu ξ sin(φ1 eφ2) . Therefore, it holds asymptotically that | sin θ(u1, eu1)| Cu ϵ c(A, B) sin(φ1 φ2). (6) Although perturbation upper bound has been studied extensively, there are few results on lower bound especially in algebra literature. Next we establish the perturbation lower bound. That is, given any symmetric definite matrix pair (A, B) with λ1 > λ2 and any sufficiently small constant ϵ > 0, we can always find a matrix pair (E, F) satisfying p E 2 2 + F 2 2 < ϵ such that the distance between u1 and eu1 is lower bounded by a quantity in the same order of the bound in Lemma 1. The result is stated in the next lemma. Lemma 2 Follow the notations in Lemma 1. For any small positive constant ϵ, it holds that sup (E,F ) Fϵ | sin θ(u1, eu1)| Cl ϵ c(A, B) sin(φ1 φ2), (7) where Cl = c(A,B) 2( A 2 2+ B 2 2)κ 3 2 with κ := B 2 B 1 2, Fϵ := {(E, F)| p E 2 2 + F 2 2 ϵ}. Comparing (6) and (7) and noticing that Cu, Cl are two constants, we can see that upper and lower bounds of approximation error only differ up to a multiplicative constant when condition number κ is assumed to be bounded. Quantity ϵ/c(A, B) can be seen as the relative perturbation and 1 sin(φ1 φ2) is a monotonically decreasing function of the gap φ1 φ2. The upper and lower bounds are both proportional to ϵ c(A,B) sin(φ1 φ2). Thus, we declare that the perturbation bound for the leading eigenvector u1 is rate-optimal. 3.2 On the Sparse Setting Now we are ready to present the upper and lower perturbation bounds under the sparse setting. Throughout the rest of this section, the following assumptions are assumed. A1 For any K supp(u1) with |K| s + k, it holds q e AK AK 2 2 + e BK BK 2 2 c(AK, BK) < 1, where c(AK, BK) is the Crawford number of (AK, BK) and s = | supp(u1)|. A2 For any K supp(u1) with |K| s + k, e AK and e BK are positive definite. Assumption A1 requires that the perturbation within a small superset of u1 is tiny. It says that one can get a good approximation for u1 (according to Lemma 1 and 2) when a small superset of u1 is available. It is, in fact, a necessary condition for computing a good approximation for u1. Assumption A2 is a technical requirement to ensure positive definiteness for submatrices of B. We further adopt the following notations. We let S := supp(u1) and define the perturbation level ϵ := max K:K S;|K| s+k e AK AK 2 2 + e BK BK 2 2, the perturbation set Fϵ,l := {(E, F)| p EK 2 2 + FK 2 2 ϵ for any subset K with size less than l}. Upper Bound of | sin θ(u1, x )| Theorem 1 Let J = supp(x ), K = S J and = K \ J . Denote ρ = x T e Ax x T e Bx , c K = c(AK, BK) and ec K = c( e AK, e BK). Let δ = e A(,J )[x ]J ρ e B(,J )[x ]J 2 ec K . If δ < p arctan ρ > arctan µ2 + arctan ϵ + arctan δ p 1 + ρ2 , (8) then | sin θ(u1, x )| 1 ϵ Cϵ c(AK, BK) sin(φ1 eφ2) + δ Cδ sin(φ eφ2) where φ = arctan ρ , µ2 is the second largest eigenvalue for (AK, BK), Cϵ and Cδ are some constants. Remark: The upper bound has two terms. The first term is due to the perturbation, which is approximately proportional to the perturbation level ϵ. The second term is due to the failure in finding the true support set of u1. If supp(u1) supp(x ), then δ = 0, consequently, the second term vanishes. Therefore, it holds asymptotically that | sin θ(u1, x )| Cu,K ϵ c(AK, BK) sin(φ1 φ2) (9) with Cu,K = 2( AK 2 2+ BK 2 2) c(AK,BK) , when supp(x ) supp(u1) and ϵ is sufficiently small. Lower Bound of | sin θ(u1, x )| Next we present the lower bound for | sin θ(u1, x )|. It follows immediately from the proofs of Theorem 1 and Lemma 2. Theorem 2 Follow the notations and assumptions in Theorem 1. Then the following result holds. Case (a) If δ = 0, then max (E,F ) Fϵ,s+k | sin θ(u1, x )| Cl,K ϵ c(AK, BK) sin(φ1 φ2), where Cl,K = c(AK,BK) 2( AK 2 2+ BK 2 2)κ 3 2 2 (BK) . Case (b) If ϵ = 0, then max (E,F ) Fϵ,s+k | sin θ(u1, x )| 1 bξK sin(φ φ2), where bξK = b E[u1]K 2 2+ b F [u1]K 2 2 A 2 2+ B 2 2 , and b E, b F R|K| |K| are some matrices controlled by δ. Note that Crawford and condition number is well bounded on submatrix pair (AK, BK). Therefore, in Case (a), we can see that the lower bound is at the same order of the first term of the upper bound in Theorem 1; In Case (b), bξK = O(δ), then the lower bound is at the same order of the second term of the upper bound in Theorem 1. We may declare that our perturbation bounds are rate-optimal. Connection to the literature Although we do not make any statistical assumptions on E and F, we still make connections to the statistical literature here. First, it can be computed that sin(φ1 φ2) = λ1 λ2 p When B = I is an identity matrix, Cu,K = O( p λ2 1 + 1) and Cl,K = Ω( 1 λ2 1+1). Thus our lower bound can be simplified to λ2 2+1 λ1 λ2 ϵ. In sparse PCA [36], it has been established that the minimax lower bound inf u1 sup M(E u1 u1 2 2)1/2 is Θ( λ1λ2 λ1 λ2 ϵ) with ϵ = q n and M is the model space where covariance matrix A satisfies that its principle singular vector is s-sparse and λ1 is larger than λ2 by certain margin. When λ1,λ2 and λ1/λ2 are bounded constants, this minimax rate matches ours. In [2], they provide the minimax perturbation bound for singular subspace of non-symmetric matrices. The minimax rate of u1 u1 2 is Θ( αz21 + βz12 α2 + β2 min{z2 12, z2 21}), where the detailed definitions of α, β, z12, z21 can be found accordingly in [2]. Especially, for symmetric A and A, the above bound can be simplified to ϵ λ1 λ2 when ϵ λ1 λ2. Again, when λ1 and λ2 are bounded constants, the rate is consistent with ours. 4 Understanding of Estimation Quality In order to estimate the sparse leading eigenvector, it is straightforward to solve (1). It can be seen that the sparsity of solution to (1) depends on the choice of k, which may not be easily tuned. It is also computationally expensive to find the optimal solution to (1). Alternatively, such problem can be solved via regularization method. For example, [23] uses a smooth surrogate function to replace 0 penalty for solving SGEP. A good estimator should have the following properties, 1. sparsity, 2. nearly unbiasedness, and 3. stability. In high dimensional statistical problem, B is always singular when the sample size is smaller than the number of features. In order to obtain a good approximation of u1, we consider the following restricted problem (P1 ) min x Rp, x 0 sn x T e Ax + pλ(x) s.t. x T e Bx 1, (10) where pλ(x) := Pp j=1 pλ(x[j]) and pλ(x) is some univariate non-convex function, and sn is the restricted dimension. In practice, sn need not be a small number and it can grow with sample size n. The restriction x 0 sn is imposed for enforcing the solution to be nearly low-dimensional. Along with pλ, the estimator can recover the sparsity structure. This reformulation can be viewed as the counterpart of two stage methods [30, 9], where they need to find a good approximation of u1 in the first stage. 4.1 Penalization Function For the choice of pλ, we consider a family of special non-convex penalties: Pλ = {pλ(x) : pλ(x) satisfies (a1) - (a3)}, (11) where (a1) Function pλ(x) is an even function, i.e., pλ(x) = pλ( x); (a2) The derivative of pλ(x), p λ(x), exists in (0, ); lim x 0 p λ(x) = λ and p λ(x) 0 if x γλ for some constant γ; (a3) On (0, ), p λ(x) is monotone decreasing and Lipschitz continuous, i.e., there exists a constant κ such that 0 p λ(x1) p λ(x2) x2 x1 κ for any 0 < x1 < x2. Here (a1) requires pλ(x) to be symmetric; (a2) specifies the local property of derivatives of pλ(x) around 0 and assumes the flatness of pλ(x) for larger x; (a3) puts continuity constraints on p λ(x). Many popular penalty functions are included in Pλ, for example, smoothly clipped absolute deviation (SCAD, [7]), minimax concave penalty (MCP, [43]) , etc. Based on the definitions, we can see that any pλ(x) Pλ is similar to ℓ1-norm locally around the origin. On the other hand, pλ(x) puts a smaller penalization on the signal compared with ℓ1 penalty. They do not give the penalization to those large values. Intuitively speaking, the regularized estimator with non-convex penalty may outperform the ℓ1-norm-based estimator since it is nearly unbiased. 4.2 Support Recovery In this section, we provide the error bound and oracle properties of the non-convex estimator under suitable conditions. Some additional notations are introduced as follows. We define ˆu1 as arg maxx: x 0 sn,x T e Bx 1 x T Ax and denote ρ = ˆu T 1 Aˆu1. Vector ˆu1 can be viewed as the best approximation of the leading eigenvector in the restricted space. We define c(A, B, sn) := max|K| sn c(AK, BK) and c( A, B, sn) := max|K| sn c( e AK, e BK). We also define N(A, B, sn) := max|K| sn p A2 K + B2 K 2. We define ϵs := max K: K 0 sn{ EK 2, FK 2}. A different set of assumptions is stated as follows. B0 (Regularity) λ1 λ2 is positive; B , B 1 are bounded by some constant. B1 (Signal) min{|u1[j]| : j supp(u1)} B2 (Penalization) λ ϵs CF , where CF is a constant depending on c(A, B, sn), c( A, B, sn) and N(A, B, sn). B3 (Support size) |S|λ2 1. B4 (Identifiability) max K: K 0 sn,S K λ1(AK, BK) < λ1 2ϵ. Condition B0 ensures that the leading eigenvector is unique and underlying matrix B is well-behaved. Condition B1 requires that the absolute values of entries in the true support is not too small so that S can be identified. Condition B2 specifies the relationship between penalty level λ and noise level ϵ. That is, penalty level λ should be at least larger than the noise level up to a multiplicative constant. Condition B3 makes sure that the support size is not too large, i.e., u1 should be sparse. The following theorem guarantees that the estimator is not far away from the re-scaled leading eigenvector. Condition B4 is for the identifiability of |S| to ensure that the support of u1 could be still identified after perturbation. Without loss of generality, we always assume sn |S|. Then we have the following results. Theorem 3 Let bx be the optimizer of Problem (P1 ). Under Conditions B0 - B4, it holds that min sgn { 1,1} sgn bx u1s 2 C1 sin(θ(ˆu1, u1)) + C2ϵs, for some constants C1 and C2. Here u1s = u1(u T 1 Bu1) 1/2 is the re-scaled leading eigenvector. Furthermore, we can recover the support of u1 under this restricted problem. Theorem 4 Under the same conditions in Theorem 3, it holds that supp(bx) = supp(u1). (12) Based on Theorem 4, we actually have even stronger results. When all required conditions are met, the proposed estimator is equal to the oracle estimator which is the one estimated when the true support S is known to us. Therefore, the estimator achieves the optimal error bound which matches the one obtained in Theorem 1. Specifically, the proposed estimator achieves the optimal error rate ( |S| log p n )1/2 in sparse PCA and sparse CCA problems. Theorem 5 Under the same set of conditions in Theorem 4, we have that | sin θ(ˆx, u1)| Cu,S ϵ c(AS, BS) sin(φ1 φ2), (13) with Cu,S = C 2( AS 2 2+ BS 2 2) c(AS,BS) (C is a universal constant). Remark 1 To achieve the optimal statistical error rate ( |S| log p n )1/2, most existing methods are two-stage based. For example, [38] studies sparse PCA problem and propose a sparse orthogonal iteration pursuit" (SOAP) algorithm which involves relax" and tighten" stages; [9] considers sparse CCA problem and adopt the two-stage regularization approach with L1 penalty for the first stage and group LASSO penalty for the second stage. By comparison, our current estimator does not require convex relaxation in the first stage. 4.3 Computation For computational purpose, we consider to solve (P1 ) by using the alternating direction method of multipliers (ADMM, [37, 21]). Specifically, we relax the problem (P1 ) by reformulating it to min x,y,z L(x, z, y), s.t. x T e Bx = 1, x 0 sn, (14) where L(x, z, y) = x T Ax + pλ(z) + y T(x z) + η In (14), we introduce several auxiliary variables for the following reasons. We construct a new vector z which is a copy of x. It can help us to split the original problem to two simple separate sub-problems. y is the dual variable for the constraint x = z. By formulation (14), we can optimize the objective function with respect to each variable iteratively. The steps for updating x, y, z are described as follows. i Update x: At (t + 1)-th iteration, we aim to find x(t+1) which is arg min x D η 2 x z(t) 2 2 + (y(t))T(x z(t)) x T Ax, (15) where D = {x : x T e Bx = 1, x sn}. Solve sub-problem (15) to get x(t+1). ii Update z: We know that z(t+1) = arg minz η 2 z x(t) 2 2 + (y(t))T(x(t+1) z) + pλ(z). Each entry of z can be optimized separately. Specifically, if we take pλ as the MCP penalty, then z(t+1) has the following analytical form, z(t+1)[j] = ˇz(t) if |ˇz(t)| > γλ, sgn(ˇz(t)[j])(|ˇz(t)[j]| λ ηγ if |ˇz(t)| γλ, where ˇz(t) := x(t+1) + y(t) iii Update y: By dual variable update in [21], we have y(t+1) = y(t) + η(x(t+1) z(t+1)). We call the above procedure as non convex-SGEP (NC-SGEP) algorithm. Such proposed algorithm works on matrix-vector product and eigen-decomposition for submatrices of B, which is computationally efficient. Compared with other truncation methods, we do not need to make efforts to choose best sn due to the existence of non-convex regularization term. Thanks to the penalization term, our method can give a sparse estimator with a very wide range to choose the restricted dimension sn. Here we propose a projection-based method for solving sub-problem (15). At (t + 1)-th iteration, we aim to find x(t+1) which is the minimizer of (15). We construct the active set St and consider the following recursive formula, b(t+1) m = z(t) y(t) Ax(t+1) m 1 η , (16) x(t+1) m [St] = (β(t+1) m e BSt + I) 1b(t+1) m [St], where x(t+1) 0 = x(t) and index m {1, 2, . . .}. Scalar β(t+1) m satisfies 1 = P j dj(eb(t+1) m [j])2 (β(t+1) m dj+1)2 with eb(t+1) m = U Tb(t+1) m [St]; e BSt = UDU T and D = diag(d1, . . . , dsn). Such recursive formula is valid due to the following two observations (Propositions 1 - 2). Proposition 1 Let ˇy be the projection of y on to the ellipsoid {x | x TBx = 1}. Then ˇy has the following form ˇy = (βB + I) 1y, where β is a scalar which is the solution to the equation 1 = P j dj(ey[j])2 (βdj+1)2 , where ey = U Ty, B = UDU T and D = diag(d1, . . . , dp). Proposition 2 The limiting point returned by (16) is the stationary point of (15). Proposition 1 gives the explicit formula to project an arbitrary vector y to the convex body {x|x TBx = 1}. By contrast, it may lead to worse performance, if we just do the naive rescaling method, i.e., ˇy = y(y TBy) 1/2. Since (15) is highly non-convex, Proposition 2 only guarantees a way to find a stationary solution but not necessarily a optimal solution. In this paper, we consider the following two possible constructions of active set St. (C1) St is the set of indices corresponding to first sn largest absolute values of entries in b(t+1) 1 (see (16)). (C2) St is the set of indices corresponding to first sn largest absolute values of entries in x(t). The following theorem gives the local convergence of NC-SGEP algorithm under additional assumptions. In general, ADMM-based method is extremely hard to analyze especially in highly non-convex problem. It remains an open question whether the global convergence result could be established. Theorem 6 For the active set construction (C2), we take a large η value, set the initial support of x(0) includes supp(u1) and let z(0) = x(0), y(0) = 0. Then it holds T(ϵ) C(L(x(0), z(0), y(0)) f) where T(ϵ) = min{t : x(t) z(t) ϵ} and f = minx L(x, x, y) for some constant C which may depend on η and max S:|S| sn (BS) 1/2AS(BS) 1/2 2. Moreover, supp(x(T (ϵ))) = supp(u1) if ϵ = o(ϵs). For construction method (C1), the same local convergence result also holds with an extra assumption that B is diagonal. 4.4 Remarks In practice, we find that construction (C1) has higher probabilities to find the global optimum compared with construction (C2). Especially in the application of sparse principle component analysis, construction (C1) is very efficient to recover the sparsity structure of leading component. Theorem 6 gives a local convergence result, which requires the initial value x(0) contains the support of true leading eigenvector u1. Such a good initial candidate of leading eigenvector could be obtained via using semidefinite programming (SDP)-based methods [35, 36, 9]. In other words, our proposed method can be easily merged to a two-stage-type method. The proposed algorithm only requires O(snp + s3 n) operations per iteration, while SDP-based methods have O(p3) computational complexity. More discussions about SDP-based methods can be found in the supplemental material. 5 Numerical Experiments 5.1 Validation of Perturbation Bounds We conduct perturbation analyses of proposed estimator under different settings. The matrix pair is set as A = 4u1u T 1 + I Pu1, B = I. We sample n data which follows N(0, A) and sample another n data which follows N(0, B). The A and B are constructed based on the sample covariance correspondingly. The leading eigenvector u1 has unit norm and has non-zero entries in first |S| positions. We fixed dimension p 100 and let number of samples (n) and support size (|S|) vary. Each setting is repeated for 100 times with fixed choice of λ = 0.3, η = 1, sn = 25. The mean and Table 1: Estimation error ˆx u1 2 under different perturbation and sparsity level. Oracle" / Est": with / without knowing support S. Sd" is the standard deviation of Est". n 100 200 400 800 1600 3200 6400 n 0.230 0.162 0.115 0.081 0.057 0.041 0.029 Oracle 0.056 0.036 0.029 0.019 0.013 0.010 0.007 Est 0.140 0.047 0.043 0.024 0.019 0.012 0.008 Sd (0.190) (0.034) (0.030) (0.022) (0.013) (0.009) (0.006) Oracle 0.107 0.073 0.057 0.037 0.028 0.019 0.014 Est 0.225 0.131 0.077 0.043 0.032 0.022 0.017 Sd (0.244) (0.061) (0.036) (0.028) (0.017) (0.011) (0.009) Oracle 0.173 0.118 0.087 0.062 0.044 0.030 0.022 Est 0.277 0.191 0.110 0.082 0.058 0.043 0.028 Sd (0.279) (0.156) (0.066) (0.039) (0.031) (0.022) (0.012) standard deviation of ˆx u1 2 are reported. We also compute the oracle estimator which is the best ˆx when true support S is known. From Table 1, when noise level p log p/n is small, we can see that the estimation error is quite close to the optimal (oracle) error. In addition, we can see that estimation error is proportional to p log p/n. This indicates that our method can achieve the theoretical optimal error rate, i.e., O( q 5.2 Validation of Sparsity Recovery The underlying matrix pair is set as A = 3u1u T 1 + I Pu1, B = I. Thus, we naturally take e B = B = I and e A as the sample covariance of data. We let dimension p grow from 16 to 256, fix the sample size n = 100 and set λ = 0.5, η = 1 and sn = 50. We compare the proposed method with the semidefinite programming method [34, 18, 40, SDP] with ℓ1 (SDP_L1) and MCP (SDP_MCP) penalty. The estimation error and percentage of support recovery are reported in Table 2. We can see that the proposed method can have a slightly better performance. This is because that NC-SGEP optimizes objective within space Rp unlike those SDP methods work on space Rp p. In addition, the proposed method can recover the sparsity structure pretty well, while SDP methods can never recover the true support set. Table 2: Estimation accuracy for sparse canonical correlation analysis. p 16 32 64 128 256 ˆxˆx T u1u T 1 NC-SGEP 0.074 (0.056) 0.076 (0.068) 0.072 (0.059) 0.081(0.054) 0.084 (0.054) SDP_L1 0.071 (0.046) 0.080 (0.041) 0.095 (0.049) 0.094 (0.039) 0.100 (0.041) SDP_MCP 0.070 (0.045) 0.079 (0.040) 0.093 (0.048) 0.093 (0.038) 0.096 (0.039) Recovery of |S| NC-SGEP 93 % 82 % 68 % 59 % 49 % SDP_L1 - - - - - SDP_MCP - - - - - 6 Conclusion In this paper, we establish the upper and lower bounds for perturbation analysis of sparse generalized eigenvalue problem. We also consider a new statistical estimation method. The proposed method gives a sparse, nearly unbiased and stable solution to SGEP. We show that the proposed estimator can achieve the optimal estimation error rate. We further present a non-convex SGEP (NC-SGEP) algorithm to solve a non-convex regularization problem with guarantee of local convergence. Multiple numerical results validate our theories and also show the superior performance of the proposed method. In the future work, on the theoretical side, we may focus on extending the current results to the problem of finding multiple leading sparse eigenvectors and establishing the corresponding new lower bound theory. On the computational side, even though computing leading vectors one by one seems straightforward, it is not an easy problem since it requires the orthogonalization procedure. The computational complexity may require a special care. [1] Aharon Birnbaum, Iain M Johnstone, Boaz Nadler, and Debashis Paul. Minimax bounds for sparse pca with noisy high-dimensional data. Annals of statistics, 41(3):1055, 2013. [2] T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. Annals of Statistics, 46(1):60 89, 2018. [3] Yunfeng Cai and Ping Li. An inverse-free truncated rayleigh-ritz method for sparse generalized eigenvalue problem. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3460 3470, Online [Palermo, Sicily, Italy], 2020. [4] Emmanuel J Candès, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207 1223, 2006. [5] Alexandre d Aspremont, Francis R. Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res., 9:1269 1294, 2008. [6] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal., 7(1):1 46, 1970. [7] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348 1360, 2001. [8] Ky Fan. On a theorem of Weyl concerning eigenvalues of linear transformations I. Proceedings of the National Academy of Sciences of the United States of America, 35(11):652, 1949. [9] Chao Gao, Zongming Ma, and Harrison H Zhou. An efficient and optimal method for sparse canonical correlation analysis. Ar Xi V e-prints, 1409, 2014. [10] Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse cca: Adaptive estimation and computational barriers. Annals of Statistics, 45(5):2074 2101, 2017. [11] Rong Ge, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Efficient algorithms for large-scale generalized eigenvector computation and canonical correlation analysis. In Proceedings of the 33nd International Conference on Machine Learning (ICML), pages 2741 2750, New York City, NY, 2016. [12] Yaqian Guo, Trevor Hastie, and Robert Tibshirani. Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8(1):86 100, 2007. [13] Sungkyu Jung, Jeongyoun Ahn, and Yongho Jeon. Penalized orthogonal iteration for sparse estimation of generalized eigenvalue problem. Journal of Computational and Graphical Statistics, 28(3):710 721, 2019. [14] Mladen Kolar and Han Liu. Optimal feature selection in high-dimensional discriminant analysis. IEEE Trans. Inf. Theory, 61(2):1063 1083, 2015. [15] Ren-Cang Li. Matrix perturbation theory. In L. Hogben, R. Brualdi, and G. W. Stewart, editors, Handbook of Linear Algebra, chapter 21. CRC Press, Boca Raton, FL, 2nd edition, 2014. [16] Qing Mai, Yi Yang, and Hui Zou. Multiclass sparse discriminant analysis. Statistica Sinica, 29(1):97 111, 2019. [17] Qing Mai, Hui Zou, and Ming Yuan. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99(1):29 42, 2012. [18] Jérôme Malick, Janez Povh, Franz Rendl, and Angelika Wiegele. Regularization methods for semidefinite programming. SIAM J. Optim., 20(1):336 356, 2009. [19] Baback Moghaddam, Yair Weiss, and Shai Avidan. Spectral bounds for sparse PCA: exact and greedy algorithms. In Advances in Neural Information Processing Systems (NIPS), pages 915 922, Vancouver, Canada], 2005. [20] Baback Moghaddam, Yair Weiss, and Shai Avidan. Generalized spectral bounds for sparse LDA. In Proceedings of the Twenty-Third International Conference (ICML), pages 641 648, Pittsburgh, PA, 2006. [21] Neal Parikh and Stephen P. Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127 239, 2014. [22] Sandra E Safo, Jeongyoun Ahn, Yongho Jeon, and Sungkyu Jung. Sparse generalized eigenvalue problem with application to canonical correlation analysis for integrative analysis of methylation and gene expression data. Biometrics, 74(4):1362 1371, 2018. [23] Junxiao Song, Prabhu Babu, and Daniel Pérez Palomar. Sparse generalized eigenvalue problem via smooth optimization. IEEE Trans. Signal Process., 63(7):1627 1642, 2015. [24] Bharath K Sriperumbudur, David A Torres, and Gert RG Lanckriet. A majorizationminimization approach to the sparse generalized eigenvalue problem. Mach. Learn., 85(12):3 39, 2011. [25] Gilbert W Stewart. Pertubation bounds for the definite generalized eigenvalue problem. Linear Algebra and Its Applications, 23:69 85, 1979. [26] Gilbert W Stewart and Ji-Guang Sun. Matrix Perturbation Theory. Academic Press, Boston, 1990. [27] Ji-Guang Sun. The perturbation bounds of generalized eigenvalues of a class of matrix-pairs. Math. Numer. Sinica, 4(1):23 29, 1982. [28] Ji-Guang Sun. Perturbation analysis for the generalized eigenvalue and the generalized singular value problem. In Matrix Pencils, pages 221 244. Springer, 1983. [29] Ji-guang Sun. The perturbation bounds for eigenspaces of a definite matrix-pair. Numer. Math., 41(3):321 343, 1983. [30] Kean Ming Tan, Zhaoran Wang, Han Liu, and Tong Zhang. Sparse generalized eigenvalue problem: Optimal statistical rates via truncated rayleigh flow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):1057 1086, 2018. [31] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Class prediction by nearest shrunken centroids, with applications to dna microarrays. Statistical Science, pages 104 117, 2003. [32] Kengo Uchida and Isao Yamada. A nested ℓ1-penalized adaptive normalized quasi-Newton algorithm for sparsity-aware generalized eigen-subspace extraction. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 212 217. IEEE, 2018. [33] Charles F Van Loan and Gene H Golub. Matrix Computations. Johns Hopkins University Press, Baltimore, MD, 4th edition, 2013. [34] Lieven Vandenberghe and Stephen P. Boyd. Semidefinite programming. SIAM Rev., 38(1):49 95, 1996. [35] Vincent Q. Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse PCA. In Advances in Neural Information Processing Systems (NIPS), pages 2670 2678, Lake Tahoe, NV, 2013. [36] Vincent Q Vu, Jing Lei, et al. Minimax sparse principal subspace estimation in high dimensions. Annals of Statistics, 41(6):2905 2947, 2013. [37] Bo Wahlberg, Stephen Boyd, Mariette Annergren, and Yang Wang. An admm algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes, 45(16):83 88, 2012. [38] Zhaoran Wang, Huanran Lu, and Han Liu. Nonconvex statistical optimization: Minimax-optimal sparse pca in polynomial time. ar Xiv preprint ar Xiv:1408.5352, 2014. [39] Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515 534, 2009. [40] Henry Wolkowicz, Romesh Saigal, and Lieven Vandenberghe. Handbook of semidefinite programming: theory, algorithms, and applications, volume 27. Springer Science & Business Media, 2012. [41] Zhiqiang Xu and Ping Li. Towards practical alternating least-squares for cca. In Neur IPS, pages 14737 14746, 2019. [42] Zhiqiang Xu and Ping Li. A practical riemannian algorithm for computing dominant generalized eigenspace. In Conference on Uncertainty in Artificial Intelligence, pages 819 828. PMLR, 2020. [43] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2):894 942, 2010. [44] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265 286, 2006.