# sparse_gca_and_thresholded_gradient_descent__be65a85e.pdf Journal of Machine Learning Research 24 (2023) 1-61 Submitted 7/21; Revised 12/22; Published 5/23 Sparse GCA and Thresholded Gradient Descent Sheng Gao shenggao@wharton.upenn.edu Zongming Ma zongming@wharton.upenn.edu Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA Editor: Ji Zhu Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple data sets. It generalizes canonical correlation analysis that is designed for two data sets. We study sparse GCA when there are potentially multiple leading generalized correlation tuples in data that are of interest and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We first formulate sparse GCA as a generalized eigenvalue problem at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic data sets. Keywords: sparse GCA, sparse CCA, thresholded gradient descent, non-convex optimization, generalized eigenvalue problem 1. Introduction With the advent of big data acquisition technology, it has become increasingly important to integrate information across multiple data sets collected on a common set of subjects. Canonical correlation analysis (CCA), first proposed by Hotelling (1992), is a widely used statistical tool to integrate information from two data sets: It seeks linear combinations of variables within each data set such that their correlation is maximized. However, recent advances in fields such as multi-omics and multimodal brain imaging have presented us with new challenges, since scientists are often able to collect more than two data sets on the same set of subjects nowadays. To tackle these challenges, we turn to a useful generalization of CCA called generalized correlation analysis (GCA) (Kettenring, 1971) which aims to explore linear relationships across multiple data sources. Kettenring (1971) proposed five different variants for generalized correlation analysis of multiple data sets, where different methods correspond to maximization of different objective functions of covariances and correlations, subject to certain normalization constraints. Tenenhaus and Tenenhaus (2011) extended these approaches to regularized versions by adding l2 penalties to the objective functions and proposed partial least squares algorithms for solving them. c 2023 Sheng Gao and Zongming Ma. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0745.html. In addition to unsupervised settings, it has been shown that GCA can be incorporated into supervised learning settings to improve efficiency and accuracy (Shen et al., 2014). Suppose that the k data sets are i.i.d. realizations of k random vectors X{1} Rp1, X{2} Rp2, ..., X{k} Rpk. Here and after, we use subscript {j} to denote the jth set of features and/or the jth data set. At the population level, we propose to seek vectors a{1}, a{2}, ..., a{k} (called the first k-tuple of generalized loading vectors) that solve maximize l{1},...,l{k} j=1 cov(l {i}X{i}, l {j}X{j}) i=1 var(l {i}X{i}) = 1. An important difference between the foregoing formulation and those in Kettenring (1971) is the normalization constraint. We propose to normalize the sum of the variances of the linear combinations while Kettenring (1971) requires the individual variance of each linear combination to be one. When k = 2, the two normalizations essentially lead to the same solution. One can show that the optimal solution to (1) necessarily has var(a {i}X{i}) = 1 2 for i = 1 and 2, and hence the optimal a{i} s are proportional to those from normalizing individual variances. However, when k 3, the solutions are usually different and we argue that formulation (1) is more appropriate for two reasons. First, the normalization in (1) conforms with the null principle (Donnell et al., 1994) in that if all linear combinations of the Xi s are uncorrelated then the objective function reduces to the lefthand side of the normalization constraint. In addition, it is more resilient to spurious solutions as illustrated by the following example. Example 1 Suppose k = 3 and p1 = p2 = p3 = m. Let X{1} = Y v{1} + Z{1}, X{2} = Y v{2} + Z{2} and X{3} = Z{3} where v{1} and v{2} are two deterministic unit vectors in Rm, Y N(0, 1), Z{i} iid Nm(0, Im), and they are mutually independent. Simple linear algebra shows that the optimal solution to (1) is a{1} = 1 2v{1}, a{2} = 1 2v{2} and a{3} = 0. In contrast, with the individual normalization constraint var(l {i}X{i}) = 1, the optimal solution would change to a{1} = 1 2v{1}, a{2} = 1 2v{2} and a{3} is any vector in Bm 1 where Bm 1 is the unit ball in Rm. Comparing the solutions to two different normalizations, formulation (1) is advantageous in that it provides a meaningful and unique optimal a{3}. The solution to (1) gives the first k-tuple of leading GCA loading vectors. More generally, we want to extract successive k-tuples of leading loading vectors subject to certain additional constraints. In view of (1), suppose we have found (a(m) {1} , . . . , a(m) {k}), for m = 1, . . . , r 1, as the first (r 1) k-tuples of leading loading vectors. We define the rth k-tuple (a(r) {1}, . . . , a(r) {k}) as the solution to (1) with the following additional constraints: i=1 cov((l(m) {i} ) X{i}, l {i}X{i}) = 0, m = 1, . . . , r 1. Sparse GCA and Thresholded Gradient Descent To be more concise, let the columns of L{i} Rpi r be the successive leading loading vectors for X{i} for i = 1, . . . , k. The foregoing proposal for finding the first r leading k-tuples then reduces to the following optimization problem maximize L{1},...,L{k} i,j Tr(L {i}Σ{ij}L{j}) i=1 L {i}Σ{ii}L{i} = Ir. (2) Here Σ{ii} denotes the covariance matrix of X{i} and Σ{ij} = cov(X{i}, X{j}) for i = j. To write (2) more concisely, we define L = [L {1}, . . . , L {k}] Rp r where p = Pk i=1 pi, Σ0 = diag(Σ{11}, . . . , Σ{kk}) Rp p, and Σ Rp p where its (i, j)th block is Σ{ij} and ith diagonal block is Σ{ii}. Then we can rewrite (2) as maximize L Tr(L ΣL) subject to L Σ0L = Ir. (3) In what follows, A = [A {1}, . . . , A {k}] denotes the solution to (3) where each A{i} Rpi r and the mth column of A{i} is a(m) {i} for i = 1, . . . , k and m = 1, . . . , r. Informed readers might have already realized that (3) is closely connected to a generalized eigenvalue problem |Σ λΣ0| = 0, where |M| stands for the determinant of a square matrix M. In addition, when Σ0 is diagonal (i.e., when k = p), the problem is equivalent to principal component analysis (PCA) of the correlation matrix: For any non-singular Σ0, the columns of Σ1/2 0 A are the leading eigenvectors of the correlation matrix Σ 1/2 0 ΣΣ 1/2 0 . When k = 2, the solution to (3) is equivalent to CCA up to scaling. Thus, program (3) provides a unified formulation for extracting leading linear covariation within one or across multiple data sets. In practice, one does not have direct knowledge of the joint sample covariance matrix Σ or even its block diagonal part Σ0. The natural choice is to replace them with their sample counterparts. In potential modern applications of GCA, the data set dimensions (i.e., the pi s) can be much larger than the sample size n. Hence, one suffers from the curse of dimensionality if no further structural assumption is made (Donoho, 2000; Johnstone, 2001; Bickel and Levina, 2008a,b). Due to its interpretability and practicality, a structural assumption that has been widely adopted in both theory and practice is sparsity: Most energy of the solution to (1) (3) concentrates on a small number of entries (Hastie et al., 2015; Wainwright, 2019). Let A denote the solution to (3). In this manuscript, we adopt the assumption that at most s rows of A contain nonzero entries. In other words, the target of estimation is also the solution to the following sparse generalized correlation analysis (Sparse GCA, or SGCA) problem: maximize L Tr(L ΣL) subject to L Σ0L = Ir L 2,0 s. (4) Here and after, for any matrix L, L 2,0 counts the number of nonzero rows in L. In view of the discussion following (3), when k = p and k = 2, (4) reduces to sparse PCA of correlation matrix and sparse CCA, respectively. 1.1 Main contributions The main contributions of the present manuscript are the following. First, we clarify the target of estimation in generalized correlation analysis (i.e., the solution A to (3)) by considering a natural latent variable model in which an r-dimensional latent variable drives the covariation of k random vectors. Under mild conditions, we show that the linear subspace spanned by the leading r generalized eigenvectors in our GCA formulation (i.e., the columns of A) coincides with the subspace spanned by a concrete functional of parameters in the latent variable model. In addition, we characterize the behavior of generalized eigenvalues under the latent variable model. Next, for sample sparse GCA, we propose a thresholded gradient descent algorithm for solving a Lagrangian version of the sample counterpart of (4). The algorithm is intuitive and easy to implement. In view of the discussion following (3), such an algorithm provides a unified approach to a number of different sparse unsupervised learning problems, including sparse PCA of correlation matrices and sparse CCA. Furthermore, we provide a theoretical analysis of the thresholded gradient descent algorithm. We show that with high probability, when initialized properly, estimation errors of the intermediate results after each iteration converge at a geometric rate until they arrive in a tight neighborhood of the population solution to (4). Statistically, we establish tight estimation error bounds for the output of our algorithm as an estimator. Numerically, this implies geometric convergence of our algorithm to such an estimator. Finally, these theoretical findings are corroborated by numerical studies on simulated data sets. 1.2 Related works In view of the discussion following (3), the present paper is closely connected to the sparse PCA and sparse CCA literature. To date, there is a large literature devoted to various aspects of the sparse PCA problem, including algorithms (Zou et al., 2006; Johnstone and Lu, 2009; Amini and Wainwright, 2009; Witten et al., 2009; Ma, 2013; Yuan and Zhang, 2013; Vu et al., 2013; Wang et al., 2014), information-theoretic limits (Birnbaum et al., 2013; Cai et al., 2013; Vu and Lei, 2012) and computational theoretic limits (Berthet and Rigollet, 2013; Gao et al., 2017). However, this literature has mostly focused on sparse PCA of covariance matrices. Thus, the special case of k = p in our setting complements the existing literature by providing both theory and method for sparse PCA of correlation matrices. In the case of k = 2, the theory and method in this paper specialize to the sparse CCA setting (Chen et al., 2013; Gao et al., 2015a, 2017). In this case, we provide a new iterative algorithm for sparse CCA that achieves optimal estimation rates derived in Gao et al. (2015a). Therefore, for this special case, the present manuscript provides a competitive alternative to existing sparse CCA methods. When r = 1, (4) reduces to the population version of the sparse generalized eigenvalue problem considered in Tan et al. (2018). Tan et al. (2018) proposed a truncated Rayleigh Sparse GCA and Thresholded Gradient Descent flow method for estimating the population solution in this special case, and established its rates of convergence. It is not clear how their method can be generalized to estimate successive generalized eigenvectors or eigenspaces. In contrast, our estimator is based on a different algorithm (thresholded gradient descent of a Lagrangian objective function). It not only achieves fast converging estimation rates for estimating the first generalized eigenvector but also works for estimating leading generalized eigenspaces of fixed dimensions that are greater than one. 1.3 Organization of the paper The rest of this paper is organized as follows. In section 2 we examine the generalized eigenvalue problem underpinning sparse GCA under a latent variable model. Section 3 proposes a thresholded gradient descent algorithm and its initialization via generalized Fantope projection. Section 4 establishes the convergence rate of our algorithm under reasonable initialization. Numerical results are presented in Section 5. Technical proofs are deferred to appendices. 1.4 Notation For any set S, let |S| denote its size and Sc denote its complement. For any event E, 1E is its indicator function. For a vector u, u = (P i u2 i )1/2, u 0 = P i 1ui =0, u 1 = P i |ui|, u = maxi |ui|. For any matrix A = (aij), the ith row of A is denoted by Ai and the jth column by A j. We let Col(A) denote the span of columns of A. For a positive integer m, [m] denotes the index set 1, 2, . . . , m. For two subsets I and J of indices, we write AIJ for the |I| |J| submatrices formed by aij with (i, j) I J. When I or J is the whole set, we abbreviate it with an , and so if A Rp k, then AI = AI[k] and A J = A[p]J. For any square matrix A = (aij), denote its trace by Tr(A) = P i aii. Moreover, we denote the collection of all p r matrices with orthonormal columns by O(p, r), and abbreviate O(r, r) by O(r). The set of p p symmetric matrices is denoted by Sp. Furthermore, σi(A) stands for the ith largest singular value of A and σmax(A) = σ1(A), σmin(A) = σmin{p,k}(A). The Frobenius norm and the operator norm of A are A F = (P i,j a2 ij)1/2 and A op = σ1(A), respectively. The infinity norm of A is defined as A = maxi,j |aij|. The l1 norm and the nuclear norm of a matrix A are A 1 = P i,j |aij| and A = P i σi(A), respectively. Similarly, si(A) stands for the ith largest eigenvalue of A Rp p, while smax(A) = s1(A) and smin(A) = sp(A). The support of A is defined as supp(A) = {i [n] : Ai > 0}. For two symmetric matrices A and B, we write A B if B A is positive semidefinite. For any positive semi-definite matrix A, A1/2 denotes its principal square root that is positive semi-definite and satisfies A1/2A1/2 = A. The trace inner product of two matrices A, B Rp k is A, B = Tr(A B). For any real numbers a and b, let a b = min(a, b) and a b = max(a, b). For any two sequences of positive numbers {an} and {bn}, we write an = O(bn) if lim supn an/bn is finite. Given a random element X, L(X) denotes its probability distribution. The symbol C and its variants C1, C , etc. are generic positive constants and may vary from occurrence to occurrence, unless otherwise specified. The symbols P and E stand for generic probability and expectation when the distribution is clear from the context. 2. A Latent Variable Model In this part, we aim to identify the solutions to (3) as functionals of parameters in the joint distribution of (X {1}, . . . , X {k}) . To this end, we introduce an intuitive latent variable model. The fundamental assumption underlying generalized correlation analysis is that there exists a shared low-dimensional latent variable which orchestrates the (linear) covariation of observed features across all data sets. Let z be an r-dimensional latent variable. The following latent variable model describes an idealized data generating process: X{i} = U{i}z + e{i}, i = 1, . . . , k, z Nr(0, Ir), e{i} ind Npi(0, Ψ{ii}). (5) Here the deterministic matrix U{i} Rpi r for i = 1, . . . , k, Ψ{ii} is positive definite, and the latent variable z and the idiosyncratic noises {e{i}}k i=1 are mutually independent. Under the foregoing latent variable model, the joint covariance matrix of (X {1}, . . . , X {k}) Rp is given by Σ with the ith diagonal block Σ{ii} = U{i}U {i} + Ψ{ii}, for i = 1, . . . , k, (6) and the (i, j)th block Σ{ij} = U{i}U {j}, for 1 i = j k. (7) We denote by Ψ the block diagonal matrix with blocks Ψ{ii}, i = 1, . . . , k, on the diagonal. We also let U = [U {1}, U {2}, ..., U {k}] Rp r. In the rest of this section, we assume that the observed data sets are generated by model (5). In addition, we let λ1 λ2 λp 0 (8) denote the population generalized eigenvalues of Σ with respect to Σ0 = diag(Σ{11}, . . . , Σ{kk}). The following key lemma identifies the connection between parameters in model (5) and the solution to (3). Lemma 1 Suppose that Σ and Σ0 are specified by model (5) (7). Let A = [A {1}, . . . , A {k}] be the solution to (3). If the rth generalized eigenvalue of Σ w.r.t. Σ0 is larger than 1, i.e., λr > 1, then Col(A{i}) Col(Σ 1 {ii}U{i}) for all i [k]. For any i, if further rank(A{i}) = r , then Col(A{i}) = Col(Σ 1 {ii}U{i}). The next lemma describes the behavior of population generalized eigenvalues under latent variable model (5). It also identifies a sufficient condition for λr > 1. To this end, we start with an assumption motivated by Fan et al. (2019). Assumption 2 In latent variable model (5) (7), we assume the following: The matrix U has full column rank, that is, rank(U) = r; Σ0 and U satisfy σr(Σ 1/2 0 U) 1. Sparse GCA and Thresholded Gradient Descent Lemma 3 In model (5) (7), under Assumption 2, we have r = max{j : λj > 1, j [p]}. (9) In other words, there are exactly r generalized eigenvalues greater than 1. Moreover, define Y = [U {1}, U {2}, . . . , U {k}] Rp r where U {i} is the Moore-Penrose inverse of U{i}. Then the multiplicity of 1 as a generalized eigenvalue is #{j : λj = 1} = p i=1 rank(U{i}) + r rank(U UY U). (10) The foregoing lemma guarantees an eigengap between the rth and the (r + 1)th generalized eigenvalues, and so the leading rank r generalized eigenspace is well-defined at the population level. By the foregoing lemma, under model (5) we can decompose Σ as Σ = Σ0KΛK Σ0 = Σ0AΛr A Σ0 + Σ0BΛB Σ0. (11) Here Λ, Λr, and Λ are all diagonal where Λr = diag(λ1, λ2, ..., λr) collects the first r generalized eigenvalues, Λ collects the remaining p r generalized eigenvalues, and Λ = diag(Λr, Λ). In addition, B collects the eigenvectors associated with the bottom p r generalized eigenvalues. On the other hand, the decomposition (11) hold as long as the rth and (r + 1)th generalized eigenvalues are distinct and hence is more general than model (5). Remark 4 In the case when k = 2, The covariance matrices between X and Y can be reparameterized as Σxy = Σx V Θr W Σy with V Σx V = W Σy W = Ir Gao et al. (2015a, 2017). Here Θr = diag(θ1, θ2, ..., θr) collects the leading r canonical coefficients for CCA and Θr + Ir = Λr. Hence we have λr = θr + 1. The solution A = [A {1}, A {2}] to (3) satisfies that A{1} = 1 2V, A{2} = 1 2W. Such a relationship does not hold in general for k 3. Remark 5 When k = p, Σ0 becomes a diagonal matrix where the diagonal entries are variances of the variables. The problem (3) is then equivalent to finding the leading eigenspace of correlation matrix of the data, defined as R = Σ 1/2 0 ΣΣ 1/2 0 . Let ER Rp r be the matrix containing the eigenvectors that span the leading r dimensional eigenspace. If A is the solution to (3), then Σ1/2 0 A coincides with ER (up to an r r rotation matrix when there is any generalized eigenvalue with multiplicity larger than one). Thus, we essentially estimate the leading eigenspace of the correlation matrix. 3. Gradient Descent with Hard Thresholding In this section, we present a thresholded gradient descent algorithm for simultaneously finding multiple leading sparse generalized eigenvectors. 3.1 Motivation The sample counterpart of (4) can be recast as the following minimization problem: minimize L bΣ, LL subject to L bΣ0L = Ir, L 2,0 s. Here bΣ is the joint sample covariance matrix of all k sets of features. We propose to solve its Lagrangian version: minimize L Rp r f(L) subject to L 2,0 s, (12) where the objective function f(L) is f(L) = bΣ, LL + λ 2 L bΣ0L Ir 2 F. (13) Effectively, minimizing the first term in (13) maximizes the original objective function whereas minimizing the second term controls the deviation from the normalization constraint in (3). Here λ is a tuning parameter. Intuitively, the larger it is, the more penalty we put on deviating from the normalization constraint and the closer L bΣ0L is to Ir. To deal with the constraint in (12), we shall perform the following hard thresholding. Definition 6 Given a matrix U and a natural number k, we define the output of hard thresholding function HT(U, k) to be the matrix obtained by keeping the k rows with the largest l2 norms and replacing all other rows with zeros, that is HT(U, k)i = ( Ui , if i Ck 0, otherwise where Ck is the index set of k rows of U with largest l2 norms. When there is a tie, we always pick the smaller/smallest index. 3.2 Algorithm Let X(1), X(2)..., X(n) be i.i.d. observations generated from the latent variable model with covariance matrix Σ and its block diagonal part Σ0. Let bΣ and bΣ0 be the sample covariance matrix and its block diagonal part computed on {X(1), X(2)..., X(n)}. Algorithm 1 describes the proposed thresholded gradient descent procedure. Given a proper initial estimator which we shall specify later, each iteration first performs a step of gradient descent on f defined in (12), then it keeps the s rows with the largest l2 norms and thresholds the remaining rows to zero. By iteratively performing these two steps, the algorithm can be viewed as heuristics for solving the non-convex optimization problem (4). Here s is a user-specified tuning parameter and is not necessarily equal to the true sparsity level s. Step 2 in Algorithm 1 is a re-normalization step. This is due to the fact that the stationary point of the objective function f(L) is not the same as the solution to the original optimization problem (4). Thus, we first transform it for iterations and eventually transform back in Step 7 to define the final estimator. Sparse GCA and Thresholded Gradient Descent Algorithm 1: Thresholded gradient descent for sparse GCA Input: Covariance matrix estimator bΣ and its block diagonal part bΣ0; Initialization b A0. Tuning Parameters: Step size η; Penalty λ; Sparsity level s ; Number of iterations T; 1: e A0 b A0( b A 0 bΣ0 b A0) 1/2 2: V 1 e A0(I + 1 λ e A 0 bΣ e A0)1/2 3: for t = 1, 2, 3, ..., T do 4: Vt+1 V t η f(V t) = V t 2η( bΣV t + λbΣ0V t(V t bΣ0V t Ir)) 5: V t+1 HT(Vt+1, s ) 7: Output: b AT = V T (V T bΣ0V T ) 1/2 In Algorithm 1, each iteration is computationally efficient: line 4 is essentially matrix multiplication and addition, and line 5 requires calculating and sorting row l2 norms. Since bΣ = 1 n Pn i=1(X(i) X)(X(i) X) where X = 1 n Pn i=1 X(i), an efficient way to calculate bΣV t would be to calculate Y (i) t = (X(i) X) V t first, followed by multiplying (X(i) X)Y (i) t , resulting in O(npr) flops. Line 5 requires O(rp + p log p) flops for selecting top s rows. Hence a single iteration of thresholded gradient descent will require O(npr + p log p) flops in total. 3.3 Initialization via generalized Fantope projection Success of Algorithm 1 depends crucially on the quality of the initial estimator b A0. Thus, we need to find an initial estimator that is relatively close to the true generalized eigenspace in some distance so that later iterations could further improve on estimation accuracy. We now introduce such an initial estimator based on Fantope projection (Vu et al., 2013). Note that k X i,j=1 Tr(A {i}Σ{ij}A{j}) = Σ, AA . The idea is to lift AA into Sp, the space of p p symmetric matrices, and hence to treat it as a single quantity. Since we assume that A is row sparse, F = AA have at most s2 nonzero entries which is much smaller than its number of elements. To ensure sparsity of its solution, we impose an entrywise l1 penalty to define the following objective function for initialization: minimize F Sp bΣ, F + ρ F 1. On the other hand, the normalization constraint on A implies that (bΣ0)1/2F(bΣ0)1/2 P(p, r), where P(p, r) is defined to be the set of rank r projection matrices P(p, r) = {PP , P O(p, r)}. However, this direct generalization leads to a nonconvex feasible set since P(p, r) is nonconvex. To obtain a bona fide convex program, in the light of Vu et al. (2013), we use the Fantope set introduced by Dattorro (2003): Fr = {X : 0 X I and Tr (X) = r}. The motivation is the observation from Overton and Womersley (1991) that Fr = conv(P(p, r)), where conv(A) denotes the convex hull of A. To summarize, our initial estimator is the solution to the following program: minimize F Sp bΣ, F + ρ F 1, subject to (bΣ0)1/2F(bΣ0)1/2 Fr. (14) Upon obtaining b F as the solution to (14), we collect the leading r eigenvectors of b F as e Ur Rp r and the corresponding leading r eigenvalues as entries of the diagonal matrix e Dr Rr r. Then let A0 = e Ur e D1/2 r and b A0 = HT(A0, s ). (15) This finishes initialization for Algorithm 1. The initialization procedure has O(p3) computation complexity due to the ADMM step involved in solving the generalized Fantope projection. See Gao et al. (2017) for details of the ADMM algorithm. As we shall show in next section, b F suffers a relatively large estimation error rate for estimating AA and hence b A0 for A. However, for Algorithm 1 to work, such an estimator serves well as an initial estimator under mild conditions. In addition to PCA and the current setting, an analogous initialization via convex relaxation idea has appeared in Gao et al. (2017) for performing sparse CCA which is asymmetric. 4. Theoretical Results We provide theoretical justifications for our algorithms in this section. We first state our main result on how each iteration of Algorithm 1 improves estimation accuracy, followed by a corollary on error bounds achieved by the final estimator. Analysis of generalized Fantope initialization follows our investigation of the main algorithm. In addition, we include a lower bound for the finite k setting at the end of this section. Parameter space Under covariance structure (11), we define F({si}k 1, {pi}k 1, r, {λj}p 1; ν) as the collection of all covariance matrices Σ satisfying (11) and the following conditions: (i) Sparsity: A{i} Rpi r with A{i} 2,0 si; (ii) Bounded spectrum : 1 ν smin(Σ) < smax(Σ) ν, 1 ν smin(Σ0) < smax(Σ0) ν; (iii) Eigengap : λr λr+1 > 0. Here λi s are generalized eigenvalues as defined in (8). By Section 2, under model (5) the eigengap condition is satisfied if Assumption 2 holds. The parameter space is then defined Sparse GCA and Thresholded Gradient Descent as Pn({si}k 1, {pi}k 1, r, {λj}p 1; ν) = L(X(1), X(2)..., X(n)) : X(i) iid N(0, Σ), Σ F({si}k 1, {pi}k 1, r, {λj}p 1; ν) . (17) In what follows, S denotes the true row support of A. While the parameter space requires normality, our theoretical analysis generalizes directly to sub-Gaussian distributions. We omit the generalization in this work as it is mostly formality. Matrix distance To measure estimation accuracy, we define the distance between two matrices U and V Rp r as dist(U, V ) = min P O(r) UP V F. (18) Here O(r) is the collection of r r orthogonal matrices. The matrix distance has been used previously in Ge et al. (2017), Tu et al. (2015) and Golub and Van Loan (2012), among others. 4.1 Main results The following theorem characterizes numerical convergence of Algorithm 1 when starting at a reasonable initializer. Theorem 7 In Algorithm 1, set η c 12λ1ν(1 + c)2 , λ = λ1 for some constant c, and (λr λr+1)2η2 s. (20) For initial estimator b A0, suppose that it has row sparsity s , and that for V = A I + 1 b A0 is so constructed that after the first two lines of Algorithm 1, we have dist(V, V 1) 1 8 ν min c(λr λr+1) 2λ1ν2(42 + 25c), 1 + c If (16) holds and p 1 + λ2 r+1 (λr λr+1)2 n < c0 (23) for some sufficiently small constant c0 > 0, then for some constants C, C > 0, uniformly over Pn = Pn({si}k 1, {pi}k 1, r, {λj}p 1; ν), with probability at least 1 exp( C (s log(ep/s ))), for all t 1, dist(V, V t+1) C s 1 + λ2 r+1 λr λr+1 n | {z } Statistical Error + ξtdist(V, V 1) | {z } Optimization Error where ξ = 1 η2(λr λr+1)2 The foregoing theorem leads to the following corollary on high probability error bounds for estimating A. Corollary 8 Suppose the conditions of Theorem 7 hold. For each t 1, let b At = V t(V t bΣ0V t) 1/2. Then for some constants C, C1, C > 0, uniformly over Pn, with probability at least 1 exp( C (s log(ep/s ))), for all t T0, where T0 = log C s 1+λ2 r+1 λr λr+1 log(1 η2(λr λr+1)2 64ν2 ) , (25) dist(A, b At) C1 1 + λ2 r+1 λr λr+1 In particular, when T T0, the last display holds for the output b AT of Algorithm 1. We give some brief remarks on Theorem 7 and Corollary 8. The error bound for each step, given by the right side of (24), is composed of two parts: a statistical error term independent of the iteration counter t, and an optimization error that decreases geometrically as t increases. When t T0, the optimization error is dominated by the statistical error and we can achieve the estimation rate in Corollary 8. Thus, T0 can be interpreted as the minimum number of iterations needed for achieving the estimation error rate in Corollary 8. When the eigengap λr λr+1 is lower bounded by a positive constant, it is straightforward to verify that T0 = O(log(p+n)). In other words, thresholded gradient descent drives down the statistical error of the estimator to the desired rate after O(log(p + n)) iterates. It is worth noting that the contraction rate ξ does not depend on ambient dimension p, which can be much larger than r and s. Condition (22) on initial estimator is related to the notion of basin of attraction . It has previously appeared in the literature of machining learning with nonconvex optimization in other contexts, e.g., Chen and Candes (2015); Chi et al. (2019); Wang et al. (2014). The global landscape of (12) could be hard to handle due to non-convexity and sparsity constraints. However, within the basin of attraction, the objective function f( ) is locally smooth and strongly convex, which allows projected gradient descent to find a statistically sound solution at geometric convergence rate. Outline of proof To prove Theorem 7, we track the estimator trajectory {V t : t 0} over iteration. To this end, we first characterize a global high probability event under which the entire trajectory would lie inside the basin of attraction within which the objective function is smooth and strongly convex. On the event, we analyze in sequel the effects of the gradient step and the hard thresholding step in each iteration. In particular, we show that each gradient step drives down the distance between the target of estimation and the current estimator while the hard thresholding step projects the current estimator onto the feasible set. For details, see Propositions 18 and 19, respectively. Combining the two propositions, we obtain a recursive inequality that characterizes the estimator trajectory Sparse GCA and Thresholded Gradient Descent as in (24). This completes the major steps in the proof. Finally, Corollary 8 is a direct consequence of normalization on the same event that we performed the foregoing analysis of iteration. For proof details, see Appendix B. 4.2 Analysis of initialization The following theorem characterizes the estimation accuracy of the proposed initial estimator via generalized Fantope projection. Theorem 9 Suppose s2 log p n(λr λr+1)2 ϵ (26) for some sufficiently small ϵ > 0. Let b F be the solution to (14) where ρ = γ q n for γ [γ1, γ2] for some positive constants γ1 < γ2. There exist constants C, C > 0 such that uniformly over Pn, with probability at least 1 exp( C (s + log(ep/s))), b F AA 2 F C (Pk i=1 si)2 log(Pk i=1 pi) n(λr λr+1)2 = O s2 log p n(λr λr+1)2 The following corollary further bounds the estimation accuracy of b A0 through the lens of V 1 defined in line 2 of Algorithm 1. Corollary 10 Suppose that the conditions of Theorem 9 hold. Under the choice of s and condition (23) in Theorem 7, there exist constants C, C > 0 such that uniformly over Pn, with probability at least 1 exp( C (s + log(ep/s))), dist(V, V 1) Cs λr λr+1 where V is defined in (21). Combining Theorems 7 and 9 and Corollaries 8 and 10, we obtain the following corollary on the whole procedure: Algorithm 1 with initialization via generalized Fantope projection. Corollary 11 Suppose that n C0 max (1 + λ2 1)(1 + λ2 r+1) (λr λr+1)4 rs log p, s2 log p (λr λr+1)2 for some sufficiently large positive constant C0, that T T0 with T0 in (25), and that all the other conditions in Theorems 7 and 9 are satisfied. There exist constants C, C > 0 such that uniformly over Pn, with probability at least 1 exp( C (s + log(ep/s))), the final output satisfies dist(A, b AT ) C1 1 + λ2 r+1 λr λr+1 4.3 A lower bound for finite k Corollary 8 provides upper bounds for the proposed procedure in Algorithm 1. For finite k, we have the following information-theoretic lower bound result. Theorem 12 Assume that 1 r mini si 2 , and that n(λr λr+1)2 C0 r + max i log epi for some sufficiently large positive constant C0. Then there exist positive constants c and c0 such that the minimax risk for estimating A satisfies inf b A sup Σ F EΣdist2(A, b A) c0 c n(λr λr+1)2 i=1 si log epi The proof of Theorem 12 is given in Appendix C. Comparing Theorem 12 and Corollary 8, we see that when r is finite and λ1 C for some large positive constant C, as long as there exists i {1, . . . , k} such that si and pi are of the same order as s and p simultaneously, the lower and upper bounds match. Otherwise, they differ by at most a multiplicative factor of log p. 5. Numerical Results This section reports numerical results on synthetic data sets. Except for the settings in Section 5.4 and Section 5.6, r, the latent dimension in model (5) is assumed to be known. Choices of tuning parameters are specified in each setting. In practice, they can also be selected using cross-validation on grids. The rest of this section is organized as follows. We first study numerical errors for estimating generalized eigenspaces of different dimensions. For the special case of r = 1, we also compare our method with the Rifle method in Tan et al. (2018). Next, we assess the performance of Algorithm 1 in the context of sparse CCA by comparing it to the Co La R method in Gao et al. (2017). Furthermore, we consider the potential model mis-specification scenario where the input latent dimension of the algorithm is different from the true value. We then apply Algorithm 1 to perform sparse PCA of correlation matrices. Finally, we investigate the performance of Algorithm 1 under a general covariance structure. 5.1 Sparse GCA with different latent dimensions We first consider sparse GCA of three high dimensional data sets. In particular, we set n = 500, k = 3, p2 = p3 = 200, p1 = 500, and s1 = s2 = s3 = 5. To generate covariance matrices Σ and Σ0, we use the latent variable model specified in Section 2. Specifically, Σ is a block matrix with Σ{ij} = T{i}U{i}U {j}T{j}, Σ{ii} = T{i}, for i = j {1, 2, 3}, U {1}T{1}U{1} = U {2}T{2}U{2} = U {3}T{3}U{3} = I. Sparse GCA and Thresholded Gradient Descent Here each Σ{ii} = T{i} is a Toeplitz matrix, defined by setting (T{k})ij = σkij where σkij = a|i j| k for all i, j [pk] with a1 = 0.5, a2 = 0.7, a3 = 0.9. To generate U{i} Rpi r, we first randomly select a support of size 5. For each row in the support, we generate its entries as i.i.d. standard normal random variables. Then all U{i} s are normalized with respect to T{i}. With the foregoing construction, it is straightforward to verify that λr λr+1 = 2 and that Col(A{i}) = Col(U{i}). Finally, Σ0 = diag(Σ{11}, Σ{22}, Σ{33}) contains the block diagonal elements of Σ. We vary r in {1, 2, 3, 4, 5} and report squared matrix distances defined as dist2(A, b A) = min O O(r) b AO A 2 F for both initial and final estimators based on 50 repetitions in each setting. For tuning parameters in Algorithm 1, we set s = 20, η = 0.001, λ = 0.01, and T = 15000. The tuning parameter for generalized Fantope initialization is set to be n . The truncation parameter for initialization is also set to be s = 20. Table 1 reports the results of the aforementioned simulation study. For all latent dimensions, we observe a significant decrease in estimation error after Algorithm 1 is applied. This corroborates the theory in Section 4. To better understand how each iteration of Algorithm 1 improves estimation, we plot dist(V, V t) and dist(A, b At) in logarithmic scale against the iteration counter t for r = 1, 2, 3, 4, 5 and n = 500. For any t, we set b At = V t(V t bΣ0V t) 1/2 which agrees with the definition of b AT in the last line of Algorithm 1. From Figure 1, we observe an approximate linear decay trend at the beginning in all cases, which corresponds to exponential decay in the original scale. Moreover, after sufficiently many iterations, all error curves plateau, which suggests that the performance of the resulting estimators have stabilized. Both phenomena agree well with the theoretical findings in Theorem 7. Error r = 1 r = 2 r = 3 r = 4 r = 5 Initial 0.1319(0.0737) 0.2308(0.0591) 0.2746(0.0652) 0.2354(0.0483) 0.1969(0.0237) Final 0.0015(0.0030) 0.0072(0.0209) 0.0098(0.0301) 0.0121(0.0071) 0.0171(0.0061) Table 1: Median errors of initial (generalized Fantope) and final (Algorithm 1) estimators in squared matrix distance out of 50 repetitions. Median absolute deviations of errors are reported in parentheses. Tan et al. (2018) proposed a truncated Rayleigh flow method, called Rifle, to solve the sparse generalized eigenvalue problem when r = 1. Here we compare our method and Rifle under the same experiment setting as above, with n = 500 and 1000. For TGD we set the tuning parameters to be s = 20, η = 0.01, λ = 0.01 and T = 15000. For Rifle we also use s = 20 for truncation, the step size η is set to be the default value 0.01 and we run it for sufficiently many iterations until estimator errors no longer improve. We vary the sample size n in {500, 1000} and each experiment setting is repeated 50 times. We report median estimation errors of both initial and final estimators of both methods measured in squared matrix distances in Table 2. Both Rifle and our method produce highly accurate final estimators, which significantly decrease errors of respective initializers. Final estimators by our algorithm are slightly more accurate in both settings. Figure 1: Plots of dist(V, V t) (left) and dist(A, b At) (right) (both in logarithmic scale) vs. iteration counter t. Rifle TGD Initial Error Final Error Initial Error Final Error n = 500 0.1225 (0.0562) 0.0006 (0.0005) 0.1458 (0.0713) 0.0003 (0.0011) n = 1000 0.0805 (0.0449) 0.0003 (0.0012) 0.0920 (0.0504) 0.0002 (0.0003) Table 2: Median errors of initial and final estimators of both Rifle and Algorithm 1 in squared matrix distance out of 50 repetitions. Median absolute deviations of errors are reported in parentheses. 5.2 Choices of tuning parameters In this section, we study the selection of tuning parameters. For the sparsity level s in the hard thresholding step of Algorithm 1, we show that a five-fold cross validation approach works reasonably well. For the Lagrangian multiplier parameter λ, we show that the estimation procedure is robust with respect to it and we recommend fixing it at some small positive constant. For simulations below, we fix n = 500 and p = 900. In addition, we fix other tuning parameters throughout the experiments as follows: ρ = 1 n for Fantope initialization and η = 0.001, T = 15000 for Algorithm 1. Choice of sparsity level s . The procedure for the selection of s is as follows. We first randomly split the data X into five folds of equal sizes. For l = 1, . . . , 5, we use one fold as the test set Xtest (l) and the other four folds combined as the training set Xtrain (l) . For each value of s in a pre-specified grid G , we apply Algorithm 1 on Xtrain (l) to obtain an estimator b Atrain (l) . Then we compute the test GCA score defined as CV(l)(s ) = Tr b Atrain (l) bΣtest (l) b Atrain (l) . Sparse GCA and Thresholded Gradient Descent Here bΣtest (l) is the sample covariance matrix of the test data Xtest (l) . Finally, we compute the cross-validation score for s , defined as CV(s ) = 1 5 P5 l=1 CV(l)(s ). Upon obtaining the cross-validation scores for all s G , we select the sparsity level that maximizes CV(s ). In the experiments here, we set the grid G as {5i : i = 1, . . . , 20}. We fix r = 3, λ = 0.01 and consider three cases: s1 = s2 = s3 = 5 (Case I), s1 = s2 = s3 = 15 (Case II) and s1 = s2 = s3 = 20 (Case III). The true sparsity levels in three cases are then s = 15, 45, and 60, respectively. The results from the 5-fold cross validation procedure for the three cases are plotted in Figure 2. Here, we plot CV(s ) against s G and use error-bars to indicate the range of one standard deviation of cross-validation scores. Figure 2: Plots of cross-validation score CV(s ) against s G for case I (s = 15, left), case II (s = 45, middle), and case III (s = 60, right). Curves indicate average over five folds and error bars indicate one standard deviations. Figure 2 shows that in all three cases, the cross-validation scores increase with s until reaching a plateau level and fluctuate around some constant when s is set to larger values. The sparsity levels selected by this procedure are slightly larger than the true sparsity levels in all three settings. Specifically, the selected sparsity levels for the three cases are 25, 60, and 75, respectively. This suggests that our previous choice of s = 20 for s = 15 in Section 5.1 is reasonable. Choice of penalty λ. We now study the impact of different choices of λ on our estimator. To this end, we vary λ on a log scale and consider all values in the set {0.0001, 0.001, 0.01, 0.1, 1} and compare the final estimation errors across these choices. We fix s = 20 and test using the same simulation setting as in Section 5.1 with r = 3 and 4. We set all other tuning parameters in the same way and only change λ. Table 3 reports squared distances from estimators to true parameter values based on 50 repetitions in each (r, λ) value combination. From Table 3, we notice that for both r = 3 and r = 4, the final estimation error is robust with respect to the choice of λ. 5.3 Sparse CCA To further compare the performance of our algorithm with a benchmark, we compare its performance in sparse CCA problems with the Co La R method proposed by Gao et al. (2017). Instead of using gradient descent and hard thresholding, Gao et al. (2017) refined the initial estimate by a linear regression with group Lasso penalty. In addition, their λ 0.0001 0.001 0.01 0.1 1 r = 3 0.0121(0.0364) 0.0136(0.0252) 0.0098(0.0301) 0.0093(0.0140) 0.0099(0.0108) r = 4 0.0145(0.0068) 0.0152(0.0102) 0.0121(0.0071) 0.0101(0.0055) 0.0119(0.0093) Table 3: Median errors of the final (Algorithm 1) estimators in squared matrix distance out of 50 repetitions for r = 3, 4 and different choices of λ. Median absolute deviations of errors are reported in parentheses. estimator was shown to achieve the optimal estimation rate for canonical loading matrix under a prediction loss that is different from (18). For fair comparison, we use the same simulation settings as in Gao et al. (2017). Using the notation in Remark 4, we set p1 = p2, Σx = Σy = M, r = 2, θ1 = 0.9, θ2 = 0.8. The covariance matrix between X and Y is generated from the canonical pair model Σxy = Σx V Θ2W Σy Recall that from the relationship between Θ2 and Λ2 we obtain that λ1 = θ1+1 = 1.9 and λ2 = θ2+1 = 1.8. Hence the leading two generalized eigenvalues are 1.9 and 1.8. Moreover, the row supports for both V and W are set to be {1, 6, 11, 16, 21}. The values at the nonzero coordinates are obtained by normalizing (with respect to M = Σx = Σy) random numbers drawn from the uniform distribution on the finite set { 2, 1, 0, 1, 2}. We consider three different choices of M as follows: (i) Identity: M = I; (ii) Toeplitz: M = (σij) where σij = 0.3|i j| for all i, j; (iii) Sparse Inv: M = σ0 ij/ q σ0 iiσ0 jj where Σ0 = (σ0 ij) = Ω 1 for Ω= ωij with ωij = 1i=j + 0.5 1|i j|=1 + 0.4 1|i j|=2 i, j [p1]. We consider the same four configurations of (n, p1, p2) as in Gao et al. (2017). For Co La R and its initialization, we set all tuning parameters to those values used in Gao et al. (2017). For tuning parameters in Algorithm 1 and its initialization, we set ρ = 1 n , s = 20, λ = 0.01, η = 0.001, and T = 10000. Upon obtaining the final estimate b A = [ b A {1}, b A {2}] from Algorithm 1, we calculate estimators for V and W as b V = b A{1}( b A {1}bΣx b A{1}) 1/2 and c W = b A{2}( b A {2}bΣy b A{2}) 1/2. We summarize simulation results of the foregoing settings in Tables 4 6, each corresponding to a different choice of M. Following Gao et al. (2017), we measure estimation error by prediction loss defined as L(V, b V ) = inf O O(r) Σ1/2 x (b V O V ) 2 F and L(W, c W) = inf O O(r) Σ1/2 y (c WO W) 2 F. Each reported number is the median error out of 50 independent repetitions. In Tables 4 6, the first four columns collect results of Co La R. Columns V -GFP and W-GFP collect errors of initial estimators by generalized Fantope projection, while columns V -TGD and W-TGD report errors of final estimators by Algorithm 1. In all settings, both Co La R and Algorithm 1 yield good final estimators, which significantly improve over their respective initializers. In addition, Algorithm 1 outperforms Co La R in most settings. Sparse GCA and Thresholded Gradient Descent (n,p1,p2) V -init W-init V -Co La R W-Co La R V -GFP W-GFP V -TGD W-TGD (300,300,200) 0.2885 0.1706 0.0511 0.0601 0.1391 0.2636 0.0118 0.0118 (600,600,200) 0.3236 0.2004 0.0638 0.0764 0.3420 0.1655 0.0237 0.0485 (300,300,500) 0.1202 0.0664 0.0135 0.0166 0.2646 0.3078 0.0100 0.0076 (600,600,500) 0.1408 0.0811 0.0176 0.0209 0.2459 0.1876 0.0213 0.0157 Table 4: Comparison of Co La R and Algorithm 1: median errors out of 50 repetitions with Toeplitz covariance structure. (n,p1,p2) V -init W-init V -Co La R W-Co La R V -GFP W-GFP V -TGD W-TGD (300,300,200) 0.2653 0.1712 0.0498 0.0646 0.3413 0.5688 0.0454 0.0211 (600,600,200) 0.3167 0.2087 0.0671 0.0776 0.5016 0.7134 0.0231 0.0559 (300,300,500) 0.1207 0.0665 0.0135 0.0159 0.3580 0.1713 0.0124 0.0117 (600,600,500) 0.1448 0.0817 0.0166 0.0203 0.3906 0.4095 0.0082 0.0129 Table 5: Comparison of Co La R and Algorithm 1: median errors out of 50 repetitions with Identity covariance structure. (n,p1,p2) V -init W-init V -Co La R W-Co La R V -GFP W-GFP V -TGD W-TGD (300,300,200) 0.5552 0.5718 0.1568 0.1194 0.2021 0.4924 0.0719 0.0421 (600,600,200) 0.5596 0.6133 0.2123 0.1572 0.4653 0.2222 0.1555 0.0686 (300,300,500) 0.2695 0.1917 0.0242 0.0219 0.1592 0.3058 0.0042 0.0056 (600,600,500) 0.3068 0.2368 0.0338 0.0271 0.1853 0.1404 0.0454 0.0491 Table 6: Comparison of Co La R and Algorithm 1: median errors out of 50 repetitions with Sparse Inv covariance structure. In summary, both Co La R and Algorithm 1 consistently estimate the leading r canonical loading vectors, while Algorithm 1 has a slight advantage. Moreover, Algorithm 1 is more desirable due to its generality beyond the sparse CCA setting. 5.4 Model mis-specification To investigate the robustness of our method, we consider one possible mis-specification of the model in the case of two data sets: There are 3 pairs of nontrivial canonical correlations present but we only set r = 2. As we are in the CCA setting, we consider three types of covariance matrices specified in the previous subsection. For all three cases, the first two canonical loading vectors and generalized eigenvalues are generated in the same way as the previous subsection. In addition, we also set the support of the third vector to be {1, 6, 11, 16, 21}. We fix θ3 = 0.3 (i.e., λ3 = 1.3) and focus on the configuration (n, p1, p2) = (300, 300, 500). All tuning parameters are set in the same way as in last subsection. Simulation results under these settings can be found in Table 7, and we continue to use prediction loss as in the previous subsection. Similar to Co La R, Algorithm 1 continues to Covariance structure V -Co La R W-Co La R V -TGD W-TGD Toeplitz 0.0197 0.0197 0.0291 0.0157 Identity 0.0190 0.0195 0.0113 0.0155 Sparse Inv 0.0348 0.0263 0.0112 0.0488 Table 7: Comparison of Co La R and Algorithm 1: median errors of estimating first two generalized eigenvectors out of 100 repetitions in mis-specified models. produce accurate estimates when the input latent dimension rin is smaller than the true value r and there is an eigengap between θrin and θrin+1, and thus between λrin and λrin+1. 5.5 Sparse PCA of correlation matrices We now apply Algorithm 1 to performing sparse PCA of correlation matrices. See Remark 5 for a detailed account on how sparse PCA of correlation matrices can be cast as a special case of sparse GCA. By Remark 5, let bΣ0 be the diagonal matrix with p sample variances on the diagonal, we could use bΣ1/2 0 b A as our estimator of the r leading eigenvectors of correlation matrix R = Σ 1/2 0 ΣΣ 1/2 0 , where b A is the estimator produced by Algorithm 1. We report the squared matrix distance for estimating the leading eigenspace L(ER, b A) = min O O(r) bΣ1/2 0 b AO ER 2 F for initialization and final error, where ER is the leading eigenspace of R. Given ambient dimension p, sparsity s, latent dimension r and generalized eigenvalues λ1 λr > 1, we generate a p p correlation matrix R with leading eigenvalues λ1 λr > λr+1 = 1 and s-sparse leading eigenvectors according to the procedure described in Section D. To further construct the covariance matrix Σ, we generate Σ0 as a diagonal matrix with i.i.d. uniform random numbers in [0.1, 1] as diagonal elements. Finally, we set Σ = Σ1/2 0 RΣ1/2 0 . In our simulation study, we set p = 500, s = 20, r = 3, and let sample size n be either 500 or 2000. The tuning parameter for initialization is set as ρ = 1 n . Tuning parameters of Algorithm 1 are set to be s = 40, T = 20000, λ = 0.01, and η = 0.001. The leading generalized eigenvalues are set as {5, 5, 5} (case I) or {7, 5, 3} (case II). For both cases, our construction of Σ ensures that λ4 = 1. n Case Initial Error Final Error n = 500 I 0.4603 (0.1404) 0.0851 (0.0306) II 0.6379 (0.0567) 0.1001 (0.0098) n = 2000 I 0.0882 (0.0177) 0.0184 (0.0025) II 0.1060 (0.0697) 0.0258 (0.0025) Table 8: Sparse PCA of correlation matrices by Algorithm 1: median errors of initial and final estimators out of 50 repetitions. MADs are reported in parentheses. Sparse GCA and Thresholded Gradient Descent We report results from 50 repetitions for each case and each sample size in Table 8. From Table 8, it can be seen that in all settings, Algorithm 1 yields accurate estimators of leading eigenspaces of correlation matrices. Due to a larger eigengap in the first case, it has slightly better estimation accuracy which corroborates our theory. 5.6 Sparse GCA with general covariance structure In this section, we investigate the performance of our algorithm under a general covariance structure and demonstrate empirically that our algorithm works beyond latent variable models as long as the eigengap condition holds. We consider sparse GCA of three high-dimensional data sets with p1 = 500, p2 = p3 = 200 and let the rank of all off-diagonal blocks of Σ be pmin = mini pi = 200. To generate covariance matrices Σ and Σ0, we first define a diagonal matrix Θ with Θii = 2/i for i = 1, 2, . . . , pmin. We then define Σ as a block matrix with Σ{ij} = T{i}U{i}ΘU {j}T{j}, Σ{ii} = 2T{i}, for i = j {1, 2, 3}, U {1}T{1}U{1} = U {2}T{2}U{2} = U {3}T{3}U{3} = I. Here, each Σ{ii} = 2T{i} and each T{i} is a Toeplitz matrix, defined by setting (T{k})ij = σkij where σkij = a|i j| k for all i, j [pk] with a1 = 0.5, a2 = 0.7, a3 = 0.9. The first five columns of each U{i} have row sparsity level si = 5. As a result, the support sizes of the first three GCA loading vectors are at most s = 15. For each U{i} Rpi pmin, we define U{i} = [U{i}(1), U{i}(2)] where U{i}(1) Rpi si and U{i}(2) Rpi (pmin si). The submatrices U{i}(1), U{i}(2) are generated as follows. To generate each U{i}(1), we first randomly select a support of size si. For each row in the support of U{i}(1), we generate its entries as i.i.d. standard normal random variables and fill the remaining entries with zeros. Then, we normalize all U{i}(1) s with respect to T{i}. To generate U{i}(2), we first perform SVD on U{i}(1) T 1/2 {i} = P{i}D{i}Q {i} to obtain Q{i} O(pi). We then take e Q{i} = Q{i}, J for J = {si + 1, . . . , pmin} which is the submatrix of Q{i} consisting of the (si + 1)th to the pminth columns of Q{i}. By construction, we have U{i}(1) T 1/2 {i} e Q{i} = 0 and e Q{i} Rpi (pmin si). Finally, we define U{i}(2) = T 1/2 {i} e Q{i}. By construction, we have U{i}(2) T{i}U{i}(2) = I and U{i}(2) T{i}U{i}(1) = 0. Thus, the leading five columns of each U{i} are sparse and we have U {i}T{i}U{i} = I. With the foregoing construction, it is straightforward to verify that the ith generalized eigenvalue is given by λi = 1 + 2i 1 for i = 1, 2, . . . , pmin and the non-trivial GCA loadings are given by [U {1}, U {2}, U {3}] . Therefore, there are eigengaps between any consecutive generalized eigenvalues above one and only the first five GCA loading vectors are sparse. Under the foregoing setup, we aim at estimating the leading GCA loading vector (Case I), the matrix with leading two GCA loading vectors (Case II), and the matrix with leading three GCA loading vectors (Case III). For tuning parameters, we set s = 20 for sparsity level, η = 0.001, λ = 0.01, and T = 15000 in Algorithm 1, and ρ = 1 n in (14). We consider two different sample sizes: n = 500 and n = 2000. Table 9 reports the squared matrix distances for the initial and the final estimators from true value defined as dist2(A, b A) = min O O(r) b AO A 2 F for all three cases and two different sample sizes. Numbers in each cell are the median and the MAD of results over 50 repetitions. Population parameters are also generated independently in each repetition. To gauge the scale of errors, we also report squared Frobenius norm A 2 F of the estimand in the last column of Table 9. n Case Initial Error Final Error A 2 F I 0.0500 (0.0358) 0.0005 (0.0007) 0.5669 II 0.1586 (0.0420) 0.0605 (0.0313) 1.2433 III 0.3097 (0.1484) 0.1831 (0.1107) 1.8378 I 0.0174 (0.0064) 0.0001 (0.0002) 0.5669 II 0.0390 (0.0268) 0.0136 (0.0063) 1.2433 III 0.0741 (0.0719) 0.0369 (0.0178) 1.8378 Table 9: Sparse GCA with general covariance structure: median errors of initial and final estimators out of 50 repetitions with MADs reported in parentheses. Medians of squared Frobenius norms of estimands are listed in the last column. From Table 9, it can be seen that in all settings, Algorithm 1 consistently yields improved estimates of leading GCA loading vectors whenever there is an eigengap. Appendix A. Proof of Lemmas in Section 2 A.1 Proof of Lemma 1 Proof For Λr = diag(λ1, . . . , λr). Since A solves (3), we have Σ 1 0 ΣA = AΛr. (31) By model (5) (7), we have Σ = Ψ + UU and Σ0 = diag(Σ{11}, . . . , Σ{kk}), and so Ip1 Σ 1 {11}U{1}U {2} Σ 1 {11}U{1}U {k} Σ 1 {22}U{2}U {1} Ip2 Σ 1 {22}U{2}U {k} ... ... ... ... Σ 1 {kk}U{k}U {1} Σ 1 {kk}U{k}U {2} Ipk Thus, (31) leads to A{1} + Σ 1 {11}U{1} P j =1 U {j}A{j} ... A{k} + Σ 1 {kk}U{k} P j =k U {j}A{j} A{1}Λr ... A{k}Λr Collecting terms, we obtain Σ 1 {11}U{1} P j =1 U {j}A{j} ... Σ 1 {kk}U{k} P j =k U {j}A{j} A{1}(Λr Ir) ... A{k}(Λr Ir) = A(Λr Ir). Sparse GCA and Thresholded Gradient Descent Since both A and Λr Ir are of rank r under the assumption of the lemma, we complete the proof. A.2 Proof of Lemma 3 Before the proof of the lemma, we recall Weyl s inequality (Weyl, 1912) stated as follows. Lemma 13 (Weyl s inequality) Let A, B be two p p Hermitian matrices and si(A) be the ith eigenvalue of A. Then for 1 l, j p, we have sl(A + B) sj(A) + sk(B), for l j + k 1, sj(A) + sl(B) sj+l p(A + B), for j + l p. Proof The proof is adapted from Fan et al. (2019). It is composed of three parts. First we show that if i r + 1, λi 1. Then we will prove that λr > 1. Finally we prove the result on multiplicity of 1. Recall that the generalized eigenvalues are identical to the eigenvalues of R = Σ 1/2 0 ΣΣ 1/2 0 . From now on we study the eigenvalues of R directly. Step (1): Note that we can write R as R = Σ 1/2 0 ΣΣ 1/2 0 = Σ 1/2 0 (UU + Ψ)Σ 1/2 0 = Q1Q 1 + Q2Q 2 where Q1 = Σ 1/2 0 U Q2 = Σ 1/2 0 Ψ1/2. We first show that Q2Q 2 op 1. To this end, note that Q2Q 2 op = Q 2 Q2 op = Ψ1/2Σ 1 0 Ψ1/2 op. Note that Ψ1/2Σ 1 0 Ψ1/2 is a block diagonal matrix with the ith block given by Ψ1/2 {ii}(Ψ{ii} + U{i}U {i}) 1Ψ1/2 {ii}. Thus, it suffices to show that Ψ1/2 {ii}(Ψ{ii} + U{i}U {i}) 1Ψ1/2 {ii} op 1, for i [k]. To this end, by Woodbury matrix identity, Ψ1/2 {ii}(Ψ{ii} + U{i}U {i}) 1Ψ1/2 {ii} =Ψ1/2 {ii}[Ψ 1 {ii} Ψ 1 {ii}U{i}(Ir + U {i}Ψ 1 {ii}U{i}) 1U {i}Ψ 1 {ii}]Ψ1/2 {ii} =Ipi Ψ 1/2 {ii} U{i}(Ir + U {i}Ψ 1 {ii}U{i}) 1U {i}Ψ 1/2 {ii} . Therefore, 0 Ψ1/2 {ii}(Ψ{ii} + U{i}U {i}) 1Ψ1/2 {ii} Ipi, and so Q2Q 2 op 1. By Lemma 13, we have, for i j + k 1, λi = si(R) sj(Q1Q 1 ) + sk(Q2Q 2 ). Now we set j = r + 1, k = 1 and this gives λi= si(R) sr+1(Q1Q 1 ) + s1(Q2Q 2 ) = s1(Q2Q 2 ) for i r + 1. The last equality holds since we have rank(Q1Q 1 ) = r and so sr+1(Q1Q 1 ) = 0. As a result, λi= si(R) s1(Q2Q 2 ) = Q2Q 2 op 1. This finishes the proof of the first step. Step (2): By Lemma 13 and Assumption 2, we have λr= sr(R) = sr(Q1Q 1 + Q2Q 2 ) sr(Q1Q 1 ) + sp(Q2Q 2 ) > sr(Q1Q 1 ) = σ2 r(Σ 1/2 0 U) 1. Here, the strict inequality holds since Q2Q 2 is of full rank. This finishes the proof for the rth eigenvalue. Step (3): Now we calculate the multiplicity of 1. To this end, we resort to Theorem 2.2 in Tian (2004), which gives rank(U U Y U ) = rank(U UY U) = rank(U U U {1}U{1} U {k}U{k}) = rank(e U) + rank(U) i=1 rank(U{i}). 0 Σ{12} Σ{13} ... Σ{1k} Σ{21} 0 Σ{23} ... Σ{2k} Σ{31} Σ{32} 0 ... Σ{3k} ... ... ... ... ... Σ{k1} Σ{k2} Σ{k3} ... 0 Note that we have Σ{ij} = U{i}U {j}, rank(U) = r, by the latent variable model. Moreover, it is easy to verify that multiplicity of 1 as an eigenvalue of R is the same as multiplicity of 0 as an eigenvalue of e U, which in turn equals p rank(e U). Thus, multiplicity of 1 is p rank(e U) = p rank(U U Y U ) + rank(U) i=1 rank(U{i}) i=1 rank(U{i}) + r rank(U UY U). This finishes the proof of the lemma. Appendix B. Proof of Main Results We first provide a lemma about the distance defined in (18) which is adapted from Lemmas 5.3 and 5.4 from Tu et al. (2015). Lemma 14 For any U, V Rp r, we have dist2(U, V ) 1 2( 2 1)σ2r(V ) UU V V 2 F. Sparse GCA and Thresholded Gradient Descent If further that dist(U, V ) 1 4 V op, we have 4 U opdist(U, V ). Before diving into the proof, we present a list of nice events on the intersection of which all desired results hold deterministically. Recall that S is the true row support of A with cardinality s. Let I {1, 2, 3, ...p} be an index set. We use b A(I) Rp r to denote the leading r generalized eigenvectors that solve max bΣ, LL such that L bΣ0L = Ir, supp(L) I. (32) We also denote the diagonal matrix formed by the first r restricted sample generalized eigenvalues by bΛr(I). For some sufficiently large constant C > 0, define B1 = dist(A, b A(I)) C r 1 + λ2 r+1 λr λr+1 for all I S such that |I| 2s + s . Note that the event includes all index sets containing true support S and the sizes of which are at most 2s + s . We define event B2 as B2 = bΣII ΣII op bΣ0,II Σ0,II op C (2s + s) log p for all I [p] with |I| 2s + s . Further we define B3 as B3 = bΣ Σ + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0 + λr+1 Σ0 bΣ0 + λr+1 Σ0AA Σ0 bΣ0AA bΣ0 C Let e A = A(A bΣ0A) 1/2, eΛr = (A bΣ0A)1/2Λr(A bΣ0A)1/2. (36) The event B4 is defined as eΛr Λr F + λr+1 A bΣ0A I F + Σ1/2 0 ( e A A) F C r(s + log(ep/s)) (37) for some sufficiently large constant C > 0. The following lemmas guarantee that all these events occur with high probabilities, uniformly over Pn. In addition, since s s, Lemma 12 in Gao et al. (2015a) implies that B2 happens with probability at least 1 exp( C s log(ep/s )) for some positive constant C , uniformly over Pn. Lemma 15 Suppose condition (23) holds, there exist constants C, C > 0 such that uniformly over Pn, B1 happens with probability at least 1 exp( C s log(ep/s )). Lemma 16 Suppose r p log p/n c for some sufficiently small constant c (0, 1). Then there exist positive constants C, C such that uniformly over Pn, B3 happens with probability at least 1 p C . Lemma 17 Suppose q s+log(ep/s) n c for some sufficiently small constant c (0, 1). Then there exist constants C, C > 0 such that uniformly over Pn, B4 happens with probability at least 1 exp( C (s + log(ep/s))). In the rest of this section, we will show that Theorem 7 and Corollary 8 hold on event B1 B2. Theorem 9 holds on event B3 B4, and Corollary 10 holds on event B2 B3 B4. As a result, the entire algorithm with generalized Fantope initialization yields an estimator satisfying the upper bound in Corollary 11 on B1 B2 B3 B4, which holds with probability at least 1 exp( C (s + log(ep/s))) for some constant C > 0 uniformly over Pn, by the foregoing lemmas and the union bound. This argument proves Corollary 11 about the whole procedure. Before proving the main Theorem, we first specifically study effect of the gradient descent step and the hard thresholding step. The following proposition characterizes the progress in the gradient step. Proposition 18 Let St = supp(V t) S supp(V t+1) S supp(A) be a super set of the row support of V t. Define b V (St) = b A(St) I + bΛr(St) Set η c 12λ1ν(c + 2)2 and λ = λ1 for some constant c < 1. On event B2, suppose V t satisfies dist(V t, b V (St)) 1 4 ν min c(λr λr+1) 2λ1ν2(42 + 25c), 1 + c then after the gradient step, we have dist2(V o t+1, b V (St)) 1 η(λr λr+1) dist2(V t, b V (St)), here we use V o t+1 Rp r to denote a matrix that has the same entries as those in Vt+1 on St [r] and zeros elsewhere. The following proposition characterizes the effect of hard thresholding. Proposition 19 Define V as in (21). Then if we perform hard thresholding by selecting the top s elements of Vt+1, we have dist2(V, V t+1) 1 + 2 r s dist2(V, Vt+1). (40) Sparse GCA and Thresholded Gradient Descent B.1 Proof of Theorem 7 We first prove the theorem assuming that Propositions 18 and 19 hold. Proofs of the two propositions will be given later in Sections B.3 and B.4. It is worth noting that the results of both propositions hold deterministically on event B2. The first step is to define the effective support in each step. We first define Ft = supp(V t), recall that S is the true row support of A, then we define the effective restricted set to be St = Ft Ft+1 S. The gradient descent step restricted to St can be viewed as Vt+1,St = V t,St 2η( bΣSt St V t,St + λbΣ0,St St V t,St (V t,St bΣ0,St St V t,St Ir)). Note that applying hard thresholding on V o t+1 (as defined in Proposition 18) is equivalent to applying hard thresholding on the original Vt+1. This allows us to replace the intermediate update by V o t+1 and still obtain the same output sequence V t+1. Thus, we will prove instead for the update using bΣ0,St St, bΣSt St. Proof Throughout the proof, we assume the event B1 B2 happens, which occurs with probability at least 1 exp( C s log(ep/s )) for some constant C > 0, uniformly over Pn. As we have mentioned, the conclusions of Propositions 18 and 19 hold on this event, and the remaining arguments in this proof proceed in a deterministic fashion. We argue by induction on the iteration counter t. Specifically, for t = 1, 2, ..., we will prove that V t satisfies condition (39) and that dist(V t, V ) ξt 1 dist(V 1, V ) + C1 1 ξ 1 + λ2 r+1 λr λr+1 for some positive constant C1. Base Case: By (23), (22), and definition of B1 in (33), condition (39) is satisfied when t = 1. Moreover, when t = 1 equation (41) holds trivially. Induction Step: Suppose that V t satisfies the radius condition (39) and that the induction hypothesis (41) is satisfied at step t. We are to show that (39) and (41) hold for V t+1. In the gradient step, Proposition 18 shows that under radius condition (39) on dist(V t, b V (St)), if we choose the step-size to be η 1 dist(V o t+1, b V (St)) p 1 αη dist(V t, b V (St)) 1 αη dist(V t, b V (St)), α = (λr λr+1) 4ν , β = 12λν Recall that V = A I + 1 2 , b V (St) = b A(St) I + 1 λ bΛr(St) 1 By (33), on event B1 B2, since λ = λ1/c, we have dist(V, b V (St)) C0 r 1 + λ2 r+1 λr λr+1 for some positive constant C0. Triangle inequality then leads to dist(V o t+1, V ) 1 αη dist(V t, V ) + 2C0 r 1 + λ2 r+1 λr λr+1 Turn to the thresholding step. By Proposition 19, we have dist(V t+1, V ) dist(V o t+1, V ) dist(V o t+1, V ). Combining the last two displays, we obtain that dist(V t+1, V ) 1 + r s dist(V t, V ) 1 + λ2 r+1 λr λr+1 ξ = 1 + r s Note that we can always ensure ξ < 1 by enlarging s appropriately. Specifically, the choice in the theorem, s 16 α2η2 s, ensures that ξ 1 + 2 r s Since |St| 2s + s and s s, we further obtain that dist(V t+1, V ) ξ dist(V t, V ) + C1 1 + λ2 r+1 λr λr+1 Sparse GCA and Thresholded Gradient Descent for some constant C1 > 0. Note that in the above equation C1 does not depend on t. We now show that it is identical with the constant C1 in (41). To this end, note that dist(V t+1, V ) ξ dist(V t, V ) + C1 1 + λ2 r+1 λr λr+1 ξt 1 dist(V 1, V ) + C1 1 ξ 1 + λ2 r+1 λr λr+1 1 + λ2 r+1 λr λr+1 ξt dist(V 1, V ) + C1 1 ξ 1 + λ2 r+1 λr λr+1 Here the second inequality is due to the induction hypothesis at the tth iteration. Moreover, it can be ensured that V t+1 also satisfies (39) since ξ < 1 and the extra term is bounded by a sufficiently small constant due to (23). As a result, we have shown that both (39) and (41) are satisfied for all t 1. In summary, on B1 B2, for any t 1, dist(V t+1, V ) ξt dist(V 1, V ) + C1 1 ξ 1 + λ2 r+1 λr λr+1 ξt dist(V 1, V ) + 4C1 1 + λ2 r+1 λr λr+1 ξt dist(V 1, V ) + C s 1 + λ2 r+1 λr λr+1 Here, the second inequality holds since 1 1 ξ 4 α2η2 and the last inequality is due to s 16 α2η2 s. This completes the proof. B.2 Proof of Corollary 8 Proof We prove the desired result on B1 B2, which happens with probability at least 1 exp( C (s log(ep/s ))) for some constant C > 0, uniformly over Pn. For notational convenience, denote the statistical error rate in (24) by ϵn, that is 1 + λ2 r+1 λr λr+1 Under condition (25), the statistical error dominates optimization error in (24). Hence by Theorem 7 dist(V t, V ) is bounded by a constant multiple of ϵn, that is, V t, A I + Λr for some constant C0 > 0. Let P be the orthogonal matrix that minimizes the distance, then we can write V t = A I + Λr 1/2 P + Q , (44) where Q F C0ϵn. Since V t is s row sparse and A is s row sparse, Q is s + s row sparse. The remaining proof is composed of two steps: (1) bounding (V t bΣ0V t) 1/2 P (I + Λr λ ) 1/2P F and (2) bounding V t(V t bΣ0V t) 1/2 AP F. The bound in step (2) then gives the desired bound in the statement of the corollary. In the rest of the proof, let = (V t bΣ0V t) 1/2 P (I + Λr/λ) 1/2P. Step (1): By definition we have V t bΣ0V t = 1/2 A bΣ0A I + Λr 1/2 P + Q bΣ0A I + Λr 1/2 A bΣ0Q + Q bΣ0Q. Since A Σ0A = Ir, we have 1/2 A Σ0A I + Λr 1/2 P = P I + Λr Then we can bound the Frobenius norm for V t bΣ0V t P (I + Λr V t bΣ0V t P (I + Λr λ )P F P (I + Λr λ )1/2A (bΣ0 Σ0)A(I + Λr λ )1/2P F | {z } Term I + 2 Q bΣ0A(I + Λr λ )1/2 F | {z } Term II + Q bΣ0Q F | {z } Term III We bound the three terms on the right side of the last display separately. Since A is s sparse, we have that on event B2, λ )1/2A (bΣ0 Σ0)A(I + Λr λ )1/2P op I + Λr λ op A S (bΣ0,SS Σ0,SS)AS op Since A is of rank r, we can then bound the Frobenius norm of Term I as λ )1/2A (bΣ0 Σ0)A(I + Λr λ )1/2P F C1 Sparse GCA and Thresholded Gradient Descent For Term II, since Q is s + s sparse, we notice that on event B2 and under condition (23), we can bound it as Q bΣ0A(I + Λr λ )1/2 F C2 Q F for some constant C2 > 0. Term III is also dominated by the same upper bound. On B2, the operator norms of V t bΣ0V t and (V t bΣ0V t) 1 are both upper bounded by a positive constant. In addition, by (19), the operator norms of I + Λr/λ and (I + Λr/λ) 1 are also upper bounded by a positive constant. So we conclude that F (V t bΣ0V t) 1/2 op (V t bΣ0V t)1/2 P (I + Λr λ )1/2P F P (I + Λr λ ) 1/2P op C3 V t bΣ0V t P (I + Λr λ )P F C4ϵn. where the second to last inequality is due to Lemma 2 in Supplement of Gao et al. (2017). Step (2): Now we bound b At AP F as follows. Recall we have defined that = (V t bΣ0V t) 1/2 P (I + Λr b At AP F = V t(V t bΣ0V t) 1/2 AP F = (A(I + Λr λ )1/2P + Q)(P (I + Λr λ ) 1/2P + ) AP F = AP + QP (I + Λr λ ) 1/2P + A(I + Λr λ )1/2P + Q AP F λ ) 1/2P F + A(I + Λr λ )1/2P F + Q F Q F P (I + Λr λ ) 1/2P op + A(I + Λr λ )1/2P op F + Q F op for some constant C5 > 0, due to the bounds on Q F and F that we have established in step (1). Combining the results above we deduce that dist( b At, A) b At AP F C5ϵn with probability at least 1 exp( C (s log(ep/s ))) for some positive constants C5 and C , uniformly over Pn. This finishes our proof. B.3 Proof of Proposition 18 Proof Throughout the whole proof, we work on event B2 defined in (34) which happens with probability at least 1 exp( C s log(ep/s )) for some constant C > 0, uniformly over Pn. As mentioned before, the gradient step is equivalent to replacing all covariance matrices by bΣSt St, bΣ0,St St respectively due to sparsity of V t, V t+1 (since the output sequence V t remains unaltered after this substitution). Effectively, at t th step of gradient descent the relavant support is St since outside this set the output matrix has row equal to 0. We denote principal submatrices bΣSt St, bΣ0,St St by bΣt R|St| |St|, bΣ0,t R|St| |St| for simplicity in the proof and define the restricted Lagrangian function ft to be ft(L) = bΣt, LL + λ 2 L bΣ0,t L Ir 2 F, where L R|St| r. As a result, 1 2 ft(L) = bΣt L + λbΣ0,t L(L bΣ0,t L Ir). (45) Throughout the proof of Proposition 18, we will work on the restricted function and its gradient. We denote Lt+1 = Lt η ft(Lt). Then we notice that Lt = V t,St , Lt+1 = Vt+1,St by our submatrix notation respectively. Our proof will work on distance involving Lt, Lt+1 which transfers to the desired bound as stated in this proposition. Before the proof we revisit a lemma characterizing the effect of gradient descent. The following lemma is adapted from Lemma 4 in Chi et al. (2019) and it is an extension of the gradient descent condition from vectors to rank r matrices. For notational convenience, we define L t to be a global minimizer of function ft(L) (which will be calculated later) and HX = argmin H O(r) XH L t F. We define a function f(L) to be β smooth at L if for all Z, we have vec(Z) 2f(L)vec(Z) β Z 2 F. Lemma 20 Suppose that ft is β smooth within a ball B(L t ) = {L : L L t F R} and that ft(L)P = ft(LP) for any orthonormal matrix P. Assume that for any L B(L t ) and any Z, we have vec(ZHZ L t ) 2ft(L)vec(ZHZ L t ) α ZHZ L t 2 F. In addition, if η 1 β, then using gradient descent with dist(Lt, L t ) R, we have dist2(Lt+1, L t ) (1 αη) dist2(Lt, L t ). Moreover, with dist(L0, L t ) R, we have dist2(Lt, L t ) (1 αη)t dist2(L0, L t ). In Lemma 4 of Chi et al. (2019) the condition on gradient descent is Lt B(L t ). Here we generalize it to dist(Lt, L t ) R and the proof follows without change of the original proof. By (45), it is straightforward to verify the condition ft(L)P = ft(LP). The remaining proof is composed of three steps: (1) deriving the expression for vec(Z) 2ft(L)vec(Z), (2) verifying the smoothness condition, and (3) verifying the condition on strong convexity. We check the radius condition at the end of the proof. Sparse GCA and Thresholded Gradient Descent Step (1) Recall (45). As a result, 1 2vec ft(L) = (Ir bΣt)vec(L) + λ(Ir bΣ0,t LL bΣ0,t)vec(L) λ(Ir bΣ0,t)vec(L). The main calculation is to deal with vec(bΣ0,t LL bΣ0,t L) = (I bΣ0,t LL bΣ0,t)vec(L). We now directly compute this expression as follows: since L R|St| r, we have vec(L) = [l 1 , l 2 , ..., l r ] where li is the i th column of the matrix. Following this notation, we can write vec(bΣ0,t LL bΣ0,t L) = (I bΣ0,t LL bΣ0,t)vec(L) bΣ0,t LL bΣ0,t 0 ... 0 0 bΣ0,t LL bΣ0,t ... 0 0 0 ... 0 0 0 ... bΣ0,t LL bΣ0,t l1 l2 l3 ... lr As a result, vec(bΣ0,t LL bΣ0,t L) = bΣ0,t LL bΣ0,tl1 bΣ0,t LL bΣ0,tl2 bΣ0,t LL bΣ0,tl3 ... bΣ0,t LL bΣ0,tlr bΣ0,t Pr i=1 lil i bΣ0,tl1 bΣ0,t Pr i=1 lil i bΣ0,tl2 bΣ0,t Pr i=1 lil i bΣ0,tl3 ... bΣ0,t Pr i=1 lil i bΣ0,tlr Now we can calculate the derivative vec ft(L) vec(L) , note that we can do this block by block: the jk block entry of the Hessian is just vec ft(L)j 1 2vec ft(L) = (Ir bΣt)vec(L) + λ(Ir bΣ0,t LL bΣ0,t)vec(L) λ(Ir bΣ0,t)vec(L), the only term we have to deal with is the middle one. By previous calculations, we have bΣ0,t Pr i=1 lil i bΣ0,tlj lk = bΣ0,tlkl k bΣ0,tlj lk = l j bΣ0,tlkbΣ0,t + bΣ0,tljl k bΣ0,t, when j = k and bΣ0,t Pr i=1 lil i bΣ0,tlj lj = X i =j bΣ0,tlil i bΣ0,t + 2bΣ0,tljl j bΣ0,t + l j bΣ0,tlj bΣ0,t = bΣ0,t LL bΣ0,t + bΣ0,tljl j bΣ0,t + l j bΣ0,tlj bΣ0,t. As a result, we can combine the above results to obtain the Hessian 1 2 2ft(L) =1 2 vec ft(L) bΣt ... 0 0 bΣt 0 0 ... bΣt bΣ0,t .. 0 0 bΣ0,t 0 0 ... bΣ0,t bΣ0,t LL bΣ0,t + bΣ0,tl1l 1 bΣ0,t + l 1 bΣ0,tl1bΣ0,t ... l 1 bΣ0,tlr bΣ0,t + bΣ0,tl1l r bΣ0,t ... ... ... bΣ0,tlrl 1 bΣ0,t ... bΣ0,t LL bΣ0,t + bΣ0,tlrl r bΣ0,t +l r bΣ0,tl1bΣ0,t ... +l r bΣ0,tlr bΣ0,t Now we proceed to calculate vec(Z) 2ft(L)vec(Z). We have vec(Z) 2ft(L)vec(Z) = X i,j z i 2ft(L)ijzj where zi is the i th column of Z. Substituting the expression on the Hessian, we have 1 2vec(Z) 2ft(L)vec(Z) = i=1 z i bΣtzi λ i=1 z i bΣ0,tzi i =j l i bΣ0,tljz i bΣ0,tzj i =j z i bΣ0,tljl i bΣ0,tzj + i=1 (z i bΣ0,tli)2 i=1 z i bΣ0,t LL bΣ0,tzi + i=1 (l i bΣ0,tli)(z i bΣ0,tli) . Now we claim the following simplification: 1 2vec(Z) 2ft(L)vec(Z) =gt(L, Z) = bΣt, ZZ λ bΣ0,t, ZZ + λ ZZ , bΣ0,t LL bΣ0,t + λ L bΣ0,t L, Z bΣ0,t Z + λ Z bΣ0,t L, L bΣ0,t Z . Sparse GCA and Thresholded Gradient Descent Now we begin to prove this claim. We expand the above expression as follows: gt(L, Z) = bΣt, ZZ λ( bΣ0,t, ZZ ZZ , bΣ0,t LL bΣ0,t L bΣ0,t L, Z bΣ0,t Z Z bΣ0,t L, L bΣ0,t Z ) i=1 z i bΣtzi λ i=1 z i bΣ0,tzi + λ X i,j z i bΣ0,tljl j bΣ0,tzi + λ X i,j z i bΣ0,tljl i bΣ0,tzj i =j l i bΣ0,tljz i bΣ0,tzj + λ i=1 (l i bΣ0,tli)(z i bΣ0,tzi) i=1 z i bΣtzi λ i=1 z i bΣ0,tzi + λ( X i =j l i bΣ0,tljz i bΣ0,tzj + X i =j z i bΣ0,tljl i bΣ0,tzj i=1 (l i bΣ0,tli)(z i bΣ0,tzi) + i=1 z i bΣ0,t j=1 ljl j bΣ0,tzi + i=1 (z i bΣ0,tli)2) i=1 z i bΣtzi λ( i=1 z i bΣ0,tzi) + λ( X i =j l i bΣ0,tljz i bΣ0,tzj + X i =j z i bΣ0,tljl i bΣ0,tzj i=1 (z i bΣ0,tli)2 + i=1 z i bΣ0,t LL bΣ0,tzi + i=1 (l i bΣ0,tli)(z i bΣ0,tli)) 2vec(Z) 2ft(L)vec(Z). This completes the first step. Step (2) In view of the lemma, our next task would be to bound the smoothness parameter in a neighborhood of L t . The neighborhood will be defined by the distance bΣ1/2 0,t L t bΣ1/2 0,t L F δ, (we define in this unusual way due to the normalization constraint and specific δ given by the condition will be explained later), which by triangle inequality gives bΣ1/2 0,t L t op δ bΣ1/2 0,t L op bΣ1/2 0,t L t op + δ. We first find this global minimizer L t up to a rotation matrix. Setting the gradient equal to 0, we have any critical point must satisfy the following equation bΣt L = λbΣ0,t L(L bΣ0,t L Ir). From the above equation, we deduce that the column space of global minimizer should coincide with some generalized eigenvectors (not necessarily leading ones). Hence without loss of generality we assume the global minimizer of function is achieved at when L t = b Lt Dt for Dt being an invertible matrix and b Lt being the generalized eigenvectors with eigenvalues bλI for I [p], |I| = r for sample covariance matrices. Then we have bΣtb Lt Dt = λbΣ0,tb Lt Dt(D2 t Ir) and this gives bΣ0,tb LtbΛIDt = λbΣ0,tb Lt Dt(D2 t Ir). Here we abuse the notation a bit to denote diagonal matrices with entries bλI to be bΛI. We deduce that Dt = (Ir + 1 λ bΛI) 1 2 . This is true for any critical point and now we will show that the global minimizer is achieved at I being {1, 2, ..., r}. To see this, note that ft(L t ) = ft(b Lt Dt) = bΣt, b Lt Dt D t b L t + λ 2 D t b L t bΣ0b Lt Dt Ir 2 F = Tr(D t bΛIDt) + λ 2 D t Dt Ir 2 F = Tr(bΛI) 1 λ Tr(bΛ2 I) + 1 2λ Tr(bΛ2 I). We notice that the above quantity is minimized only when bΛI = bΛr(St), that is, when we are selecting the leading r generalized eigenvectors of bΣt with respect to bΣ0,t. As a result, Dt = Ir + 1 λ bΛr(St) 1 2 , L t = b A(St)St λ bΛr(St) 1 2 = b V (St)St , where bΛr(St) is the diagonal matrix with entries being first r generalized eigenvalues for sample covariance matrices as specified before. According to our definition, b A(St)St and b V (St)St are of size |St| r. In the rest of the proof, we denote b V (St)St by b V for simplicity. Similarly, we slightly abuse notation and abbreviate bΛr(St), b A(St)St , b B(St)St as bΛr, b A, b B in the rest of this proof. Then we have bΣ1/2 0,t L t op = bΣ1/2 0,t b V op = q event B2 and assumption in the theorem. Here with slight abuse of notation, we use bλi to denote the ith restricted sample generalized eigenvalue. Now we can bound the smoothness parameter from above by controlling each term in Hessian matrix. Since we are working on event B2, we have that 1 2ν bΣt op 2ν and 1 2ν bΣ0,t op 2ν. Similar bound holds for minimum restricted sample generalized eigenvalue. When bΣ1/2 0,t L t bΣ1/2 0,t L F δ, we have gt(L, Z) = bΣt, ZZ λ bΣ0,t, ZZ + λ ZZ , bΣ0,t LL bΣ0,t + λ L bΣ0,t L, Z bΣ0,t Z + λ Z bΣ0,t L, L bΣ0,t Z 0 + 0 + λ Z 2 F L bΣ0,t 2 op + λ L bΣ0,t LZ F bΣ0,t Z F + λ Z bΣ0,t L F L bΣ0,t Z F 4λν(δ + bΣ1/2 0,t b V op)2 Z 2 F + 2λ(δ + bΣ1/2 0,t b V op)2ν Z 2 F (6λν(δ + bΣ1/2 0,t b V op)2) Z 2 F = 1 Then we can upper bound the Hessian eigenvalue by Step (3) To derive the strong convexity parameter α, we start from the function evaluated at the global minimizer L = b V . Define e Z = ZHZ b V , we now lower bound gt(b V , e Z) by a Sparse GCA and Thresholded Gradient Descent constant multiple of 1 2 e Z 2 F. To do this, we substitute the expression into gt and obtain gt(b V , e Z) = bΣt + λbΣ0,t, e Z e Z + λ b V bΣ0,t b V , e Z bΣ0,t e Z + λ e Z bΣ0,t b V , e Z bΣ0,t b V + λ e Z bΣ0,t b V , b V bΣ0,t e Z = bΣt + λbΣ0,t, e Z e Z + λ I + 1 λ bΛr, e Z bΣ0,t e Z + λ e Z e Z , bΣ0,t b V b V bΣ0,t + λ e Z bΣ0,t b V , b V bΣ0,t e Z = bΣt, e Z e Z + bΛr, e Z bΣ0,t e Z + λ e Z e Z , bΣ0,t b V b V bΣ0,t | {z } Term I + λ e Z bΣ0,t b V , b V bΣ0,t e Z | {z } Term II We deal with Term I and Term II separately. To simplify Term I, we recall that bΣ 1 0,t bΣtbΣ 1 0,t = b AbΛr b A + b BbΛ b B , where we have defined the remaining generalized eigenvectors by b B and the rest of eigenvalues by diagonal entries of bΛ, both on the restrict set St. Here we also omit the dependence on t. By the definition of restricted sample generalized eigenvectors, we have Term I = bΛr, e Z bΣ0,t e Z + e Z e Z , λbΣ0,t b V b V bΣ0,t bΣt = e ZbΛr e Z , bΣ0,t + e Z e Z , bΣ0,t(λ b A(I + bΛr/λ) b A bΣ 1 0,t bΣbΣ 1 0,t )bΣ0,t bλr e Z e Z , bΣ0,t + e Z e Z , bΣ0,t(λ b A(I + bΛr/λ) b A ( b AbΛr b A + b BbΛ b B ))bΣ0,t = e Z e Z , bΣ1/2 0,t (bλr I + bΣ1/2 0,t b A(λI) b A bΣ1/2 0,t bΣ1/2 0,t b BbΛ b BbΣ1/2 0,t )bΣ1/2 0,t = e Z e Z , bΣ1/2 0,t (bλr(bΣ1/2 0,t b A b A bΣ1/2 0,t + bΣ1/2 0,t b B b B bΣ1/2 0,t ) + λbΣ1/2 0,t b A b A bΣ1/2 0,t bΣ1/2 0,t b BbΛ b B bΣ1/2 0,t )bΣ1/2 0,t = e Z e Z , bΣ1/2 0,t (bΣ1/2 0,t b A(λ + bλr)I b A bΣ1/2 0,t + bΣ1/2 0,t b B(bλr bΛ) b B bΣ1/2 0,t )bΣ1/2 0,t 2ν e Z e Z , (bλr bλr+1)I = 1 2ν (bλr bλr+1) e Z 2 F. The first inequality is due to the fact we only select the first r generalized eigenvalues, the second inequality follows from the fact that the second term can be bounded using the eigengap since bΛ is a diagonal matrix containing the last |St| r generalized eigenvalues. As a result, Term I can be bounded in terms of eigengap at sample level. Note that the above inequality holds for any e Z. Now we bound Term II. Here we use the fact that HZ is the solution to min P O(r) ZP b V F. By definition of e Z, we hope to bound λ (ZHZ b V ) bΣ0,t b V , b V bΣ0,t(ZHZ b V ) . We will prove now that this can be lower bounded by 0, as is the case in Example 1 (46) in Chi et al. (2019). Note that we have e Z bΣ0,t b V , b V bΣ0,t e Z = Tr(b V bΣ0,t(ZHZ b V )b V bΣ0,t(ZHZ b V ) = Tr(bΣ0,t(ZHZ b V )b V bΣ0,t(ZHZ b V )b V ) = Tr(bΣ0,t ZHZ b V bΣ0,t ZHZ b V ) 2 Tr(bΣ0,t b V b V bΣ0,t ZHZ b V ) + Tr(bΣ0,t b V b V bΣ0,t b V b V ). By Lemma 2 in Ten Berge (1977), we know that ZHZ b V 0 and it is also symmetric. If we denote ZHZ b V = L0L 0 and write Z1 = bΣ1/2 0,t L0, Z2 = bΣ1/2 0,t b V , we have e Z bΣ0,t b V , b V bΣ0,t e Z = Z1Z 1 2 F + Z2Z 2 2 F 2 Z1Z 1 , Z2Z 2 0 by Cauchy Schwarz Inequality. This proves that Term II is non-negative. Finally we conclude that we have gt(b V , e Z) 1 2ν (bλr bλr+1) e Z 2 F. Now we argue for general L. First we show that for any L in the neighborhood of b V |gt(L, e Z) gt(b V , e Z)| = |λ e Z e Z , bΣ0,t(LL b V b V )bΣ0,t + λ L bΣ0,t L b V bΣ0,t b V , e Z bΣ0,t e Z + λ( e Z bΣ0,t L, L bΣ0,t e Z e Z bΣ0,t b V , b V bΣ0,t e Z )| c1 L b V F e Z 2 F, for some positive constant c1 depending on λ and ν. Specifically, this bound can be obtained by bounding each term. The first term can be bounded by λ| e Z e Z , bΣ0,t(LL b V b V )bΣ0,t | λ e Z 2 F bΣ0,t(LL b V b V )bΣ0,t op 2λν bΣ1/2 0,t LL bΣ1/2 0,t bΣ1/2 0,t b V b V bΣ1/2 0,t op e Z 2 F 2λν bΣ1/2 0,t L op bΣ1/2 0,t L bΣ1/2 0,t b V F e Z 2 F 2λν3/2 1 + λ1 λ + δ L b V F e Z 2 F, under the assumption that dist(bΣ1/2 0,t L, bΣ1/2 0,t b V ) 1 4 bΣ1/2 0,t b V op = 1 λ . The first inequality is the property of matrix norm, the second inequality is due to the bound on sample covariance matrix. The third inequality follows from Lemma 14 and the last inequality follows from the neighborhood assumption. The second term can be bounded in a similar fashion as λ| L bΣ0,t L b V bΣ0,t b V , e Z bΣ0,t e Z | λ (L bΣ0,t L b V bΣ0,t b V ) e Z F bΣ0,t e Z F 2λν L bΣ0,t L b V bΣ0,t b V op e Z 2 F = 2λν b V bΣ0,t(L b V ) + (L b V ) bΣ0,t b V + (L b V ) bΣ0,t(L b V ) op e Z 2 F 2λν(2 bΣ1/2 0,t b V op bΣ1/2 0,t (L b V ) op + δ bΣ1/2 0,t (L b V ) op) e Z 2 F λν bΣ1/2 0,t (L b V ) op e Z 2 F 2 δ + 2 1 + λ1 λν3/2 L b V F e Z 2 F. Sparse GCA and Thresholded Gradient Descent The first line is standard matrix inequality and the rest follows from the neighborhood assumption as well as the assumption on the operator norm of sample covariance matrices on event B2. Finally we bound the last term as λ|( e Z bΣ0,t L, L bΣ0,t e Z e Z bΣ0,t b V , b V bΣ0,t e Z )| = λ| Tr(( e Z bΣ0,t L)2 ( e Z bΣ0,t b V )2)| = λ| Tr( e Z bΣ0,t(L + b V ) e Z bΣ0,t(L b V ))| λ e Z bΣ0,t(L + b V ) F e Z bΣ0,t(L b V ) F 2λν3/2 e Z 2 F L b V F bΣ1/2 0,t (L + b V ) op 2λν3/2 2 1 + λ1 + δ L b V F e Z 2 F. In this way, we see that we can choose the constant to be 2 δ + 2 1 + λ1 2λν3/2 1 + λ1 Thus for any L within a δ neighborhood of b V = L t as defined above, we have, by triangle inequality, gt(L, e Z) gt(b V , e Z) |gt(L, e Z) gt(b V , e Z)| 2ν (bλr bλr+1) e Z 2 F c1 L b V F e Z 2 F 4ν (bλr bλr+1) e Z 2 F 1 8ν (λr λr+1) e Z 2 F, as long as L b V F bλr bλr+1 4νc1 , which is guaranteed by the assumption in the theorem and the event B2. Recall that gt(L, e Z) = 1 2vec( e Z) 2ft(L)vec( e Z). This motivates us to pick α = (λr λr+1) under appropriate radius conditions. To finally find the radius of attraction region such that dist(Lt, L t ) = dist(V t, b V (St)) R, we notice that throughout the proof for smoothness and strongly convexity to hold, we require 3 conditions on the distance L L t F (bλr bλr+1) 4νc1 , bΣ1/2 0,t L bΣ1/2 0,t L t F δ, bΣ1/2 0,t L bΣ1/2 0,t L t F 1 Hence to ensure all of the above three conditions to hold, we can just set Here recall that 1 ν is the lower bound of minimum eigenvalue of Σ0. In summary, we conclude that there is a constant α such that for any L in the neighborhood of L t and for any Z, vec(ZHZ L t ) 2ft(L)vec(ZHZ L t ) α ZHZ L t 2 F. Combine with the upper bound and resort to Lemma 20, we conclude that if we choose the step-size to be η 1 β, under the radius conditions, dist2(Lt+1, b V ) (1 αη) dist2(Lt, b V ). Moreover, we have α = (λr λr+1) 4ν , β = 12λν This finishes our analysis for the gradient descent step. We comment here that Proposition 18 is then proved by simply taking δ = 1 and calculate the radius in (46) and α, β accordingly. We keep δ in the proof for the sake of generality. Specifically, under the choice that λ = λ1 c , we have η c 12λ1ν(c + 2)2 = 1 and we have, by lifting the matrix of size |St| r to p r by filling in 0s, dist2(V o t+1, b V (St)) 1 η(λr λr+1) dist2(V t, b V (St)). B.4 Proof of Proposition 19 Proof Recall that we have defined supp(V ) = S, supp(V t+1) = Ft+1. Now we define sets F1 = S\Ft+1, F2 = S Ft+1 and F3 = Ft+1\S, the sizes of which are denoted by k1, k2 and k3 respectively. We also define x1 = VF1 F, x2 = VF2 F, y1 = Vt+1,F1 F, y2 = Vt+1,F2 F, y3 = Vt+1,F3 F. Finally, we define x2 1 + x2 2 = V 2 F = X2, y2 1 + y2 2 + y2 3 Vt+1 2 F = Y 2. Let = Tr(|V Vt+1|) be the quantity we are interested in. Here we let Tr(|M|) denote the trace norm Tr(|M|) = Tr( M M) which is simply the sum of singular values (a.k.a. nuclear Sparse GCA and Thresholded Gradient Descent norm). This is a special case of Schatten p norm, defined as T p = Tr(|T|p) 1 p which is essentially vector p norm of a vector composed of singular values. We first prove the following claim about the effect of truncation step: Tr(|V Vt+1|) Tr(|V V t+1|) r s X2Y 2 2, 1 + p XY (X2Y 2 2) Proof Since in the hard thresholding step we greedily pick the rows with largest l2 norms, we have y2 1 k1 y2 3 k3 . In addition, since k1 + k2 = s s = k2 + k3, we also have k1 k3. By applying Holder s inequality on Schatten norm and the definition of support set, we have 2 ( VF1 F Vt+1,F1 F + VF2 F Vt+1,F2 F)2 = (x1y1 + x2y2)2 (y2 1 + y2 2)X2 (Y 2 y2 3)X2 X2Y 2 k3 and this gives y2 1 k1 k3X2 (X2Y 2 2) k1 + k2 k3 + k2 (Y 2 2/X2) = s s (Y 2 2/X2), where the second inequality holds since k1 k3. Now we split the arguments into two cases. Case I: < XY q s s+s . In this case, we obtain that This implies (47) immediately. Case II: We can now assume that XY q s s+s . Then we have s (Y 2 2/X2) 2 By definition we have X2 x2 1 x1y1 + x2y2 . Solving this inequality as a quadratic inequality in x1 yields (X2Y 2 2)(Y 2 y2 1) Y 2 . By definition of X2 = x2 1 +x2 2 we have x1 X. Also we have XY by Holder inequality on Schatten norm. Combining the above inequalities together, we have Note that y1 p s X by previous calculation. Substituting it into the above equation, we have Finally, we can compute that X2Y 2 2 min X, 1 X2Y 2 2(1 + r s X2Y 2 2, 1 + p s s XY (X2Y 2 2) This is the error induced by the truncation step, finally we have Tr(|V Vt+1|) Tr(|V V t+1|) Tr(|V (V t+1 Vt+1)|) X2Y 2 2, 1 + p s s XY (X2Y 2 2) This finishes the proof for the claim. We now switch to work with the distance metric dist(U, V ) = min P O(r) UP V F. We first expand the expression UP V 2 F = UP 2 F + V 2 F 2 UP, V = U 2 F + V 2 F 2 Tr(P U V ). So, to minimize the distance metric we defined is to maximize Tr(P U V ). To this end, let ADB denote the singular value decomposition of U V , then we have Tr(P U V ) = Tr(P ADB ) = Tr(B P AD) = Tr(ZD) for Z being an orthogonal matrix. Since D is a diagonal matrix, we have Tr(ZD) = P i Zii Dii and to maximize this value we would like to have Zii = 1 for all i since Dii are non-negative. As a result, we have the optimal P being AB and the optimal value is given by min UP V 2 F = U 2 F + V 2 F 2 Tr(D) = U 2 F + V 2 F 2 Tr(|U V |). Sparse GCA and Thresholded Gradient Descent Now we consider the subspace distance we have defined. Recall that = Tr(|V Vt+1|). Then the subspace distance can be written as dist2(V t+1, V ) = V 2 F + V t+1 2 F 2 Tr |V V t+1| V 2 F + V t+1 2 F 2 Tr |V Vt+1| + 2 r s X2Y 2 2, 1 + p s s XY (X2Y 2 2) V 2 F + Vt+1 2 F 2 Tr |V Vt+1| + 2 r s s XY (X2Y 2 2) = dist2(Vt+1, V ) + 2 r s s XY (X2Y 2 2), where the first equality follows from the previous expansion of distance, the first inequality follows from Claim (47) and we use the fact that truncation reduces the Frobenius norm in the second inequality. In the last equality, we use again the relationship between trace norm and distance. To deal with the extra error term, we use the following bound X2Y 2 2 = (XY + )(XY ) 2XY (XY ) XY (X2 + Y 2 2 ) = XY ( V 2 F + Vt+1 2 F 2 Tr(|V Vt+1|) = XY dist2(Vt+1, V ). Combining the inequalities, we have dist2(V t+1, V ) dist2(Vt+1, V ) + 2 r s s XY (X2Y 2 2) dist2(Vt+1, V ) + 2 r s s XY XY dist2(Vt+1, V ) = 1 + 2 r s dist2(Vt+1, V ). This finishes our analysis for the hard thresholding step. B.5 Proof of Theorem 9 Proof We present the proof of initialization using generalized Fantope in this section. Some proof arguments originate from Gao et al. (2017). Recall that we have defined e A = A(A bΣ0A) 1/2, eΛr = (A bΣ0A)1/2Λr(A bΣ0A)1/2, and e F = e A e A . (48) Throughout the proof, we work on the event B3 B4, which by Lemmas 16 and 17 happens with probability at least 1 exp( C (s + log(ep/s))) for some constant C > 0, uniformly over Pn. We shall need the following lemma, which is a simplified version of Lemma 6.3 in Gao et al. (2017). The proof is essentially the same and so we omit it. Lemma 21 (Curvature of Fantope) Let P O(p, r) and D = diag(d1, d2, ..., dr) with d1 d2 ... dr 0. If F Fr, then PDP , PP F dr 2 PP F 2 F. Recall that Σ = Σ0KΛK Σ0 = Σ0AΛr A Σ0 + Σ0BΛB Σ0. For notational simplicity, for any positive semi-definite matrix B, we define φB max(k) = max u 0 k,u =0 u Bu u u , φB min(k) = max u 0 k,u =0 u Bu In the rest of this proof, let = b F e F with e F defined in (48). As in Gao et al. (2017), the main proof consists of two steps. The first step is to derive upper bound of bΣ1/2 0 bΣ1/2 0 F and the second step is to lower bound bΣ1/2 0 bΣ1/2 0 F by F. Step (1) First note that by Lemma 17, e A is well defined on event B4, so e F = e A e A is also well defined on event B4. In addition, on event B4, we have Σ1/2 0 ( e F AA )Σ1/2 0 F C r(s + log p) We first show that e F is a feasible solution, that is, bΣ1/2 0 e A e A bΣ1/2 0 Fr. By (48), e F = e A e A = A(A bΣ0A) 1A . Let M = bΣ1/2 0 e F bΣ1/2 0 = bΣ1/2 0 A(A bΣ0A) 1A bΣ1/2 0 . Then Tr(M) = Tr(bΣ1/2 0 A(A bΣ0A) 1A bΣ1/2 0 ) = r. Next we check that 0 M I. It is obvious that 0 M. Moreover, M op bΣ1/2 0 e A op bΣ1/2 0 e A op 1 by the construction of e A. Hence, e F Fr. Recall that b F is the solution to (14). The basic inequality implies bΣ, b F + ρ b F 1 bΣ, e F + ρ e F 1. Rearranging terms, we have 0 ρ( e F 1 e F + 1) + bΣ, . (50) We deal with each term on the right side separately. For the first term on the right, we have e F 1 e F + 1 = e FSS 1 e FSS + SS 1 (SS)c 1 SS 1 (SS)c 1. (51) Sparse GCA and Thresholded Gradient Descent Here the first equality holds since e F is supported on S S. For the second term, we have bΣ, = bΣ Σ, + Σ0AΛr A Σ0, | {z } Term I + Σ0BΛB Σ0, | {z } Term II We now bound Term I and Term II separately. Bound for Term I : For Term I, we decompose it in the following way: Σ0AΛr A Σ0, = Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + bΣ0AΛr A bΣ0, . Now we bound the second term on the right as follows bΣ0AΛr A bΣ0, = bΣ1/2 0 AΛr A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 = bΣ1/2 0 e AeΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 = bΣ1/2 0 e AΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 + bΣ1/2 0 e A(eΛr Λr) e A bΣ1/2 0 , bΣ1/2 0 bΣ1/2 0 bΣ1/2 0 e AΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 + eΛr Λr F bΣ1/2 0 bΣ1/2 0 F. As a result, Term I can be bounded in the following way Term I Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + bΣ1/2 0 e AΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 +δ1 bΣ1/2 0 bΣ1/2 0 F, for δ1 = eΛr Λr F. Bound for Term II : To bound Term II, we first notice that by definition, B Σ0A is zero matrix since A, B are normalized with respect to Σ0. As a result, we have Σ0BΛB Σ0, = Σ0BΛB Σ0, b F λr+1 Σ0BB Σ0, b F = λr+1 Σ0BB Σ0, b F e A e A = λr+1 Σ1/2 0 (I Σ1/2 0 AA Σ1/2 0 )Σ1/2 0 , b F e A e A = λr+1 Σ0 Σ0AA Σ0, b F e A e A . Here, the second last equality holds since Σ1/2 0 AA Σ1/2 0 + Σ1/2 0 BB Σ1/2 0 = I. We further decompose the rightmost side as λr+1 Σ0 Σ0AA Σ0, b F e A e A = λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0, + λr+1 bΣ0, b F e A e A | {z } Term A λr+1 bΣ1/2 0 AA bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 | {z } Term B We now deal with Term A and Term B separately. By definition we have bΣ1/2 0 b F bΣ1/2 0 Fr, hence bΣ0, b F e A e A = Tr(bΣ1/2 0 b F bΣ1/2 0 ) Tr( e A bΣ0 e A) = r r = 0. For Term B, we have λr+1 bΣ1/2 0 AA bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 = λr+1 bΣ1/2 0 e A e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 λr+1 bΣ1/2 0 e A(A bΣ0A I) e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 λr+1 bΣ1/2 0 e A e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 + λr+1 A bΣ0A I F bΣ1/2 0 bΣ1/2 0 F. Therefore, Term II has the following bound Σ0BΛB Σ0, λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0, λr+1 bΣ1/2 0 e A e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 + δ2 bΣ1/2 0 bΣ1/2 0 F, for δ2 = λr+1 A bΣ0A I F. Now we combine the results for Term I and Term II to obtain Σ0AΛr A Σ0, + Σ0BΛB Σ0, Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + bΣ1/2 0 e AΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 + δ1 bΣ1/2 0 bΣ1/2 0 F + λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0, λr+1 bΣ1/2 0 e A e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 + δ2 bΣ1/2 0 bΣ1/2 0 F = Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0, + δ bΣ1/2 0 bΣ1/2 0 F + bΣ1/2 0 e A(Λr λr+1I) e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 , where δ = δ1 + δ2. Plugging back to (50), together (51) with (52), we have 0 ρ( e F 1 e F + 1) + bΣ, ρ( SS 1 (SS)c 1) + bΣ Σ, + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0, + δ bΣ1/2 0 bΣ1/2 0 F + bΣ1/2 0 e A(Λr λr+1I) e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 . By Holder s inequality, we have bΣ Σ, + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0, bΣ Σ 1 + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0 1 + λr+1 Σ0 bΣ0 1 + λr+1 Σ0AA Σ0 bΣ0AA bΣ0 1. On event B3, we can pick ρ = γ q n for some large constant γ > 0 such that bΣ Σ + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0 + λr+1 Σ0 bΣ0 + λr+1 Σ0AA Σ0 bΣ0AA bΣ0 ρ Sparse GCA and Thresholded Gradient Descent thus the lefthand side of (54) is bounded by ρ 2 1. Furthermore, by Lemma 21, we have bΣ1/2 0 e A(Λr λr+1I) e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 = bΣ1/2 0 e A(Λr λr+1I) e A bΣ1/2 0 , bΣ1/2 0 ( e A e A b F)bΣ1/2 0 2 bΣ1/2 0 ( e A e A b F)bΣ1/2 0 2 F. Together with (53) and (54), the last display implies that when (55) holds, 0 ρ( SS 1 (SS)c 1)+ ρ 2 1 λr λr+1 2 bΣ1/2 0 ( e A e A b F)bΣ1/2 0 2 F +δ bΣ1/2 0 bΣ1/2 0 F. Rearranging terms and multiplying both side by 2, we obtain (λr λr+1) bΣ1/2 0 bΣ1/2 0 2 F 3ρ SS 1 ρ (SS)c 1 + 2δ bΣ1/2 0 bΣ1/2 0 F 3ρ SS 1 + 2δ bΣ1/2 0 bΣ1/2 0 F. This can be view as a quadratic equation, which, by Lemma 2 in Cai et al. (2013), yields bΣ1/2 0 bΣ1/2 0 2 F 4δ2 (λr λr+1)2 + 6ρ λr λr+1 SS 1. (56) Combining the last two displays, we have 0 3ρ SS 1 ρ (SS)c 1 + δ2 λr λr+1 + (λr λr+1) bΣ1/2 0 bΣ1/2 0 2 F 9ρ SS 1 ρ (SS)c 1 + 5δ2 This can be viewed as a version of generalized cone condition. Finally, using Cauchy Schwarz inequality, SS 1 s SS F, and so (56) leads to bΣ1/2 0 bΣ1/2 0 2 F 4δ2 (λr λr+1)2 + 6ρs λr λr+1 SS F. (57) and this is the end of the first step. Step (2) Recall that we have established the generalized cone condition (SS)c 1 9 SS 1 + 5δ2 (λr λr+1)ρ. (58) In this step we lower bound bΣ1/2 0 bΣ1/2 0 2 F by a function of F on this cone. Adapting the peeling argument in Bickel et al. (2009), we define the index set J1 = {(il, jl)}t l=1 in (S S)c correspond to the entries with the t largest absolute values in , and also define e J = (S S) S J1. Next we partition e Jc into disjoint subsets J2, ..., JM of size t and possibly |JM| < t such that each Jm is the set of indices corresponding the entries of the t largest absolute values in outside e J S m 1 j=2 Jj. Then by triangle inequality bΣ1/2 0 bΣ1/2 0 F bΣ1/2 0 e J bΣ1/2 0 F m=2 bΣ1/2 0 Jm bΣ1/2 0 F φ bΣ0 min(s + t) e J F φ bΣ0 max(t) In addition, by our construction of the index sets, m=2 Jm 1 1 t 1/2 (SS)c 1 t 1/2(9 SS 1 + 5δ2 t e J F + 5δ2 The second last inequality follows from the generalized cone condition (58). Hence combining the results above, we have bΣ1/2 0 bΣ1/2 0 F κ1 e J F κ2 δ2 where κ1 = φ bΣ0 min(s + t) 9s tφ bΣ0 max(t), κ2 = 5φ bΣ0 max(t). Taking t = c1s2 for c1 sufficiently large, using the same argument as in Gao et al. (2017), under condition (26) we can lower bound κ1 by some constant C1 and upper bound κ2 by some constant C2. Combining (57) and (59), we obtain e J 2 F C1 sρ λr λr+1 e J F + C2 (λr λr+1)2 + δ2 Solving this equation gives (λr λr+1)2 + δ2 (λr λr+1)2 + δ2 for some positive constant C3 that is sufficiently large. Also we have m=2 Jm F 9s t e J F + 5δ2 Sparse GCA and Thresholded Gradient Descent Combining the last two displays and using t = c1s2, we obtain that sρ + δ λr λr+1 + δ2 sρ(λr λr+1) Recall that on event B3 we pick ρ = γ q n . Then on event B4, we have F C5 sρ λr λr+1 = C5γ s log p n(λr λr+1) on event B3 B4. Step (3) The last step follows from AA b F F F + e F AA F. We combine the last display and (49) to obtain the desired bounds for each term on the right side. This finishes our proof. B.6 Proof of Corollary 10 Proof Define τ 2 n = s2 log p n(λr λr+1)2 . We will bound dist(V 1, V ) by a constant multiple of τn on event B2 B3 B4. Notice that since A0A 0 is the best rank r approximation, and AA also has rank r. Thus, we obtain that A0A 0 b F F AA b F F. Triangle inequality further leads to A0A 0 AA F A0A 0 b F F + AA b F F 2 AA b F F. Together with Lemma 14, the last display implies dist(A, A0) C0τn for some positive constant C0. By definition b A0 = HT(A0, s ), V 1 and e A0 are both s sparse. Moreover, following the lines of the proof of Proposition 19 and Corollary 8, we have dist( e A0, A) C1τn for some constant C1 > 0. Let P be the orthogonal matrix such that e A0P A F = dist( e A0, A). Then we have V 1P = e A0(I + e A 0 bΣ e A0/λ)1/2P = e A0PP (I + e A 0 bΣ e A0/λ)1/2P = e A0P(I + P e A 0 bΣ e A0P/λ)1/2. dist(V 1, V ) V 1P V F = e A0P(I + P e A 0 bΣ e A0P/λ)1/2 A(I + Λr/λ)1/2 F , we turn to bound the rightmost side. To this end, triangle inequality gives e A0P(I + P e A 0 bΣ e A0P/λ)1/2 A(I + Λr/λ)1/2 F e A0P(I + P e A 0 bΣ e A0P/λ)1/2 e A0P(I + Λr/λ)1/2 F | {z } Term I + ( e A0P A)(I + Λr/λ)1/2 F | {z } Term II We now bound each term separately. For Term II, we have ( e A0P A)(I + Λr/λ)1/2 F e A0P A F (I + Λr/λ)1/2 op C2τn. The last inequality holds since λ = λ1/c. To deal with Term I, we define = e A0P A and notice that e A0P(I + P e A 0 bΣ e A0P/λ)1/2 e A0P(I + Λr/λ)1/2 F e A0P op (I + P e A 0 bΣ e A0P/λ)1/2 (I + Λr/λ)1/2 F. By matrix root perturbation bound (e.g., Lemma 2 in Gao et al. (2017)), we have (I + P e A 0 bΣ e A0P/λ)1/2 (I + Λr/λ)1/2 F C3 P e A 0 bΣ e A0P Λr F C3 (A + ) bΣ(A + ) A ΣA F = C3( A (bΣ Σ)A F + 2 bΣA F + bΣ F) for some positive constant C3. Since e A0 is s sparse and A is s sparse, = e A0P A is s+s sparse. By definition of P, F = dist( e A0, A) C1τn. By a similar argument to the proof of Corollary 8, on event B2, we have (I + P e A 0 bΣ e A0P/λ)1/2 (I + Λr/λ)1/2 F C4 Consequently, Term I is also dominated by a constant multiple of τn. Thus, we conclude that on the intersection of B2 B3 B4, dist(V 1, V ) C5τn for some positive constant C5. By union bound, the intersection of the three events holds with probability at least 1 exp( C (s + log(ep/s))), uniformly over Pn. This completes the proof. B.7 Proof of Lemma 15 Proof In this proof, we work on event B2 defined in (34), which occurs with probability at least 1 exp( C s log(ep/s )) for some constant C > 0, uniformly over Pn. The proof relies on Theorem 3.1 of Sun (1983), a matrix perturbation bound for generalized eigenspaces. Fix any I such that S I and |I| 2s + s. We shall apply the theorem on matrix pair (ΣII, Σ0,II) and its perturbation (bΣII, bΣ0,II). Sparse GCA and Thresholded Gradient Descent To this end, for the fixed set I, define δ = min λ,λ λ λ p (1 + λ2)(1 + λ 2) for λ {λ1, ...λr} and λ {bλr+1, ...bλ|I|}, where λi denotes the ith generalized eigenvalue of matrix pair (ΣII, Σ0,II) and bλi denotes the ith sample generalized eigenvalue of matrix pair (bΣII, bΣ0,II). Define event e B2 = bλr+1 2λr+1 λr + λr+1 On event e B2, since generalized eigenvalues are in decreasing order, we have δ = λr bλr+1 q (1 + λ2 1)(1 + bλ2 r+1) 1 2 λr λr+1 q (1 + λ2 1)(1 + bλ2 r+1) . Moreover, since bλ2 r+1 4λ2 r+1 on e B2, we further obtain that 4 λr λr+1 p 1 + λ2 r+1 . On the other hand, e B2 can be equivalently defined as e B2 = bλr+1 λr+1 λr+1 λr λr+1 Since λi is also the ith eigenvalue of Σ 1/2 0,II ΣIIΣ 1/2 0,II and bλi the ith eigenvalue of bΣ 1/2 0,II bΣII bΣ 1/2 0,II , Weyl s inequality then implies that on event B2, |bλr+1 λr+1| Σ 1/2 0,II ΣIIΣ 1/2 0,II bΣ 1/2 0,II bΣII bΣ 1/2 0,II op (Σ 1/2 0,II bΣ 1/2 0,II )ΣIIΣ 1/2 0,II op + bΣ 1/2 0,II (ΣII bΣII)Σ 1/2 0,II op + bΣ 1/2 0,II bΣII(Σ 1/2 0,II bΣ 1/2 0,II ) op Σ 1/2 0,II bΣ 1/2 0,II op ΣIIΣ 1/2 0,II op + bΣ 1/2 0,II op ΣII bΣII op Σ 1/2 0,II op + bΣ 1/2 0,II bΣII op Σ 1/2 0,II bΣ 1/2 0,II op for some constant C0 depending on ν. Here the last inequality is due to the matrix root bound from Lemma 2 in Gao et al. (2015a). Under condition (23), by the choice of s in (20), the last display implies |bλr+1 λr+1| (λr λr+1 2 λr+1). Hence e B2 holds on event B2 under condition (23). Furthermore, we have γ(ΣII, Σ0,II) = min x =1 (x ΣIIx)2 + (x Σ0,IIx)2 1 γ(bΣII, bΣ0,II) = min x =1 (x bΣIIx)2 + (x bΣ0,IIx)2 1 on event B2. Now we apply Theorem 3.1 of Sun (1983) for perturbation on restricted covariance matrices and obtain that dist(A, b A(I)) C1 Σ2 II + Σ2 0,II op γ(ΣII, Σ0,II)γ(bΣII, bΣ0,II) (bΣII ΣII)AI 2 F + (bΣ0,II Σ0,II)AI 2 F δ 1 + λ2 r+1 λr λr+1 AI 2 F( bΣII ΣII 2op + bΣ0,II Σ0,II 2op) 1 + λ2 r+1 λr λr+1 1 + λ2 r+1 λr λr+1 for some positive constant C4 on event B2. This finishes the proof of Lemma 15. B.8 Proofs of Lemma 16 and Lemma 17 Proof Note that λr+1 < 1. Hence, it suffices to show that each of the four infinity norm terms on the left side in the definition of event B3 in (35) is upper bounded by a constant multiple of p log p/n with probability at least 1 p C for some positive constant C , uniformly over Pn. This can be achieved by following the lines of the proof of Lemma 6.4 in Gao et al. (2017) and so we omit the details. Proof We first bound the operator norm of each term on the left side of (37) by a constant multiple of q 1 n(s + log(ep/s)), and the result on Frobenius follows directly since each term is of rank r. As in the proof of Lemma 6.1 of Gao et al. (2017), we have Σ1/2 0 ( e A A) op Σ1/2 0 A op (A bΣ0A)1/2 I op (A bΣ0A) 1/2 op, eΛr Λr op (A bΣ0A)1/2 I op Λr(A bΣ0A)1/2 op + Λr op (A bΣ0A)1/2 I op. Sparse GCA and Thresholded Gradient Descent So it remains to bound (A bΣ0A)1/2 I op. By Lemma 2 in Gao et al. (2015a), it suffices to bound A bΣ0A I op. Note that A bΣ0A I op = A (bΣ0 Σ0)A op = A S (bΣ0,SS Σ0,SS)AS op = sup v =1 (AS v) (bΣ0,SS Σ0,SS)(AS v) Σ1/2 0,SSAS 2 op Σ 1/2 0,SS bΣ0,SSΣ 1/2 0,SS I op Σ 1/2 0,SS bΣ0,SSΣ 1/2 0,SS I op C bΣ0,SS Σ0,SS op. By Lemma 3 of Gao et al. (2015a), we then have A bΣ0A I op C1 with probability at least 1 exp( C (s + log(ep/s)) for some constant C > 0, uniformly over Pn. This completes the proof. Appendix C. Proof of Lower Bound We first present a lemma on Kullback-Leibler divergence between data distributions from a special kind of covariance matrix. The Lemma can be viewed as a general case of Lemma 1 in Gao et al. (2015b). Lemma 22 For t = 1, 2, define Σ(t) to be a block matrix whose (i, j)th block is given by λU(t) {i}U(t) {j} for i = j, where U(t) {i} O(pi, r) and whose ith diagonal block is given by Ipi. Further let P (t) denote the distribution of a random i.i.d sample of size n from the Np(0, Σ(t)) distribution where p = Pk i=1 pi. Then D(P (1)||P (2)) = λ2n 4 ( (k 1)λ2 + (k 2)λ + 1) U(1) {i}U(1) {j} U(2) {i}U(2) Proof The first step is to find the eigenvalues of Σ(t). To this end, define U(t) = h U(t) {2} , . . . , U(t) K(t) i = h U(t) {2} . . . , U(t) {i 1}, (k 1)U(t) {i+1}, . . . , U(t) That is, K(t) i Rp r is a block matrix with jth blocks being U(t) {j} for j = i and the ith block being (k 1)U(t) {i}. It is straightforward to verify that Σ(t) yields the following decomposition Σ(t) = I + (k 1)λ k U(t)U(t) λ i=1 K(t) i K(t) We then deduce that the eigen-structure of Σ(t) is given by r eigenvalues equal to 1+λ(k 1), (k 1)r eigenvalues equal to 1 λ and p kr eigenvalues equal to 1. Furthermore, we notice that Σ(t)U(t) = (1 + (k 1)λ)U(t), hence the leading r dimensional eigenspace is given by the matrix U(t). This in particular implies that det Σ(1) = det Σ(2). Now the KL divergence is given by D(P (1)||P (2)) = n Tr (Σ(2)) 1Σ(1) i=1 pi log det (Σ(2)) 1Σ(1) ! Tr (Σ(2)) 1Σ(1) p Tr (Σ(2)) 1 Σ(1) Σ(2) . We now calculate the inverse of the block matrix Σ(2). To this end, we first guess the form of the inverse and then determine the coefficients. Specifically, we try to solve for the inverse with the following form I + a U(2) {1}U(2) {1} b U(2) {1}U(2) {2} . . . . . . b U(2) {1}U(2) {k} b U(2) {2}U(2) {1} I + a U(2) {2}U(2) {2} b U(2) {2}U(2) {3} . . . b U(2) {2}U(2) {k} . . . . . . . . . . . . . . . b U(2) {k}U(2) {1} . . . . . . . . . I + a U(2) {k}U(2) with the (i, j)th block being b U(2) {i}U(2) {j} for i = j and ith diagonal block being I+a U(2) {i}U(2) To determine the values of (a, b), we solve for the equation (Σ(2)) 1Σ(2) = Ip. This requires two conditions on the matrix product: the off-diagonal block to be 0 and the diagonal block to be I. For the ith diagonal block, we require I + (k 1)bλU(2) {i}(U(2) {i}) + a U(2) {i}(U(2) {i}) = I, and for the (i, j)th off-diagonal block, we require λb(k 2)U(2) {i}(U(2) {j}) + b U(2) {i}(U(2) {j}) + λU(2) {i}(U(2) {j}) + aλU(2) {i}(U(2) {j}) = 0. Solving the above equations, we have a = (k 1)bλ, b = λ (k 1)λ2 λ(k 2) 1. Sparse GCA and Thresholded Gradient Descent As a result, we can simplify the expression for KL divergence as follows D(P (1)||P (2)) = n Tr (Σ(2)) 1 Σ(1) Σ(2) j =i b U(2) {i}(U(2) {j}) U(1) {j}(U(1) {i}) U(2) {j}(U(2) {i}) i,j:i =j Tr I U(2) {i}(U(2) {j}) U(1) {j}(U(1) {i}) U(1) {i}(U(1) {j}) U(2) {i}(U(2) {j}) 2 = λ2n 4 ( (k 1)λ2 + (k 2)λ + 1) U(1) {i}(U(1) {j}) U(2) {i}(U(2) {j}) 2 This finishes our proof for the Lemma. Note that when k = 2, we recover Lemma 1 in Gao et al. (2015b) as a special case. C.1 Proof of Theorem 12 Proof The main body of the proof is adapted from Gao et al. (2015a). The main tool for our proof is Fano s Lemma. For the sake of completeness, we provide the following version of Fano s Lemma from Yu et al. (1997). Lemma 23 Let (Θ, ρ) be a metric space and {Pθ : θ Θ} a collection of probability measures. For any totally bounded T Θ, denoted by M(T, ρ, ϵ) the ϵ-packing number of T with respect to ρ, i.e., the maximal number of points in T whose pairwise maximum distance in ρ is at least ϵ. Define the Kullback-Leibler diameter T by d KL(T) sup θ,θ T D(Pθ||Dθ ). inf θ sup θ Θ Eθ[ρ2(ˆθ(X), θ)] sup T Θ sup ϵ>0 1 d KL(T) + log 2 log M(T, ρ, ϵ) The proof is compose of two steps corresponding to different terms in the lower bound. Throughout the proof, we define λ = λr λr+1 for simplicity. Step I. We first establish the term involving r Pk i=1 si. To this end, let U{i}0 = Ir 0 O(pi, r) for each i = 1, 2, . . . , k. For some ϵ0 (0, p r (s1 r)] to be specified later, let B(ϵ0) = {U{1} O(p1, r) : supp(U{1}) [s1], U{1} U{1}0 F ϵ0} Ip1 λU{1}U {2}0 λU{1}U {3}0 . . . λU{1}U {k}0 λU{2}0U {1} Ip2 . . . λU{2}0U {k}0 . . . . . . . . . . . . λU{k}0U {1} λU{k}0U {2}0 . . . Ipk : U{1} B(ϵ0) with U = h U {1}, U {2}0, . . . , U {k}0 i . Since our target of estimation is A instead of U, we first establish the relationship between U and A under T0. Note that under the construction of Σ, we have Σ0 = Ip so the generalized eigenspace coincides with eigenspace. From Lemma 22, we deduce that the leading r dimensional eigenspace of Σ is given by span(U). From the normalization constraint such that A A = I, we conclude that dist(A, 1 k U) = 0. From here on, we first derive the minimax lower bound for the estimation of U and the lower bound for estimating A under the matrix distance follows immediately by scaling. The above analysis also implies that T0 F, where F is our original parameter space. By Lemma 22, d KL(T0) = sup U(1) {1},U(2) {1} B(ϵ0) D(P (1)||P (2)) U(1) {1},U(2) {1} B(ϵ0) λ2n 4 ( (k 1)λ2 + (k 2)λ + 1) U(1) {i}U(1) {j} U(2) {i}U(2) U(1) {1},U(2) {1} B(ϵ0) λ2n 4 ( (k 1)λ2 + (k 2)λ + 1) j=2 2 U(1) {1}U(1) {j}0 U(2) {1}U(2) U(1) {1},U(2) {1} B(ϵ0) λ2n 2 ( (k 1)λ2 + (k 2)λ + 1) U(1) {1} U(2) {1} = 2λ2n(k 1)ϵ2 0 (k 1)λ2 + (k 2)λ + 1. Here, the second to last equality follows from the definition of B(ϵ0). We now establish a lower bound for the packing number of T0. For some α (0, 1) which shall be specified later, we define {e U{1}(1), e U{1}(2), . . . , e U{1}(N)} O(p1, r) to be a maximal set such that supp(e U{1}(i)) [s1] and for i = j, e U{1}(i)e U {1}(i) U{1}0U {1}0 F ϵ0, e U{1}(i)e U {1}(i) e U{1}(j)e U {1}(j) F Then by Lemma 1 in Cai et al. (2013), for some absolute constant C > 1, For each e U{1}(i), we define U{1}(i) to be the matrix such that U{1}0 U{1}(i) 2 F = dist2 e U{1}(i), U{1}0 . Sparse GCA and Thresholded Gradient Descent Then for any i, we deduce that U{1}(i) O(p1, r), supp(U{1}(i)) [s1] and U{1}(i)U{1}(i) = e U{1}(i)e U{1}(i) . In addition, Lemma 6.6 in Gao et al. (2017) implies that U{1}(i) U{1}0 F = dist e U{1}(i), U{1}0 = 1 2 e U{1}(i)e U {1}(i) U{1}0U {1}0 F ϵ0 hence U{1}(i) B(ϵ0). On the other hand, note that from Lemma 6.6 in Gao et al. (2017), U{1}(i), U{1}(j) = 1 2 U{1}(i)U {1}(i) U{1}(j)U {1}(j) F 2 e U{1}(i)e U {1}(i) e U{1}(j)e U {1}(j) F αϵ0. Define the metric to be ρ(Σ(1), Σ(2)) = dist U(1) {1}, U(2) {1} . The above argument implies that for ϵ = αϵ0, log M(T0, ρ, ϵ) r(s1 r) log 1 (k 1)λ2 + (k 2)λ + 1 n(k 1)λ2 r(s1 r) for sufficiently small constants c0, α, we obtain a lower bound of the order r (s1 r) (k 1)λ2 + (k 2)λ + 1 n(k 1)λ2 r(s1 r) by applying Lemma 23. By symmetry, we also have the lower bound with r(s1 r) replaced by r(si r) for i = 2, 3, . . . , k. Furthermore, recall that we have r 1 2 mini si hence r (si r), we obtain the lower bound of the order r r Pk i=1 si nλ2 for the estimation of U with metric dist2(U, b U), when k is finite. Step II We now turn to establish the lower bound term involving si log epi si . The step follows from the rank-one argument from Chen et al. (2013). Without loss of generality, we may assume that si pi 2 for any i, we then consider the following subset of the parameter space: Ip1 λU{1}U {2}0 λU{1}U {3}0 . . . λU{1}U {k}0 λU{2}0U {1} Ip2 . . . λU{2}0U {k}0 . . . . . . . . . . . . λU{k}0U {1} λU{k}0U {2}0 . . . Ipk : U{1} = Ir 1 0 0 ur for ur Sp1 r+1 and |supp(ur)| s1 r + 1. Restricting on the set T1, the minimax risk for estimating U is the same as the minimax risk for estimating ur under the squared error loss ur bur 2 F. Let X{i} = [X{i}1, X{i}2] for X{i}1 Rn (r 1) and X{i}2 Rn (pi r+1). By the same argument in Gao et al. (2015a), it is further equivalent to estimating ur under the squared loss based on the observations from X{i}2 since X{i}2 for i = 1, 2 . . . , k is a sufficient statistic for ur. Applying the argument in Chen et al. (2013), we obtain the lower bound for dist2(U, b U) with the following term By symmetry the same lower bound holds if we replace s1, p1 by si, pi for i = 2, 3, . . . k. Combining all the above steps and noticing the 1 k scaling between the estimation of A and U, we finally deduce that the lower bound for dist2(A, b A) is given by (30). This finishes our proof. Appendix D. Simulation Details for Sparse PCA of Correlation Matrices Here we describe the procedure of generating a p p correlation matrix R with eigenvalues λ1 λr > λr+1 = 1 λp > 0 and s-sparse leading eigenvectors. Necessarily, Pr i=1 λr s < p 1 and s r. Without loss of generality, we assume that s = m 2l where m and l are positive integers and 2l r. Since 2l r, there is a 2l r matrix T0 such that T0,ij { 1} and that the columns of T0 are orthogonal. For example, we may take the first r columns of a 2l 2l Hadamard matrix. Fix such a T0, we generate an s r matrix Ts as T 0 . . . T 0 . By our construction, each row of Ts has the same l2 norm and the columns are orthonormal. Next, we define Rs = Ts diag(θ1, . . . , θr)T s + 1 Pr i=1 θi This decomposition ensures that the diagonal of Rs is equal to 1. We then solve for θi such that the eigenvalues of Rs are given by λ1, . . . , λr. Then we augment this s s positive definite matrix to a p p correlation matrix R = diag(Rs, Ip s). It is straightforward to verify that R has leading eigenvalues λ1 λr > 1 and the columns of T = [T s O (p s) r] are the corresponding eigenvectors. A.A. Amini and M.J. Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics, 37(5B):2877 2921, 2009. Q. Berthet and P. Rigollet. Computational lower bounds for sparse pca. ar Xiv preprint ar Xiv:1304.0828, 2013. Sparse GCA and Thresholded Gradient Descent Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 2604, 2008a. Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1):199 227, 2008b. Peter J Bickel, Ya acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009. A. Birnbaum, I.M. Johnstone, B. Nadler, and D. Paul. Minimax bounds for sparse PCA with noisy high-dimensional data. The Annals of Statistics, 41(3):1055 1084, 2013. T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074 3110, 2013. Mengjie Chen, Chao Gao, Zhao Ren, and Harrison H Zhou. Sparse cca via precision adjusted iterative thresholding. ar Xiv preprint ar Xiv:1311.6186, 2013. Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems, pages 739 747, 2015. Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239 5269, 2019. Jon Dattorro. Convex optimization and euclidean distance geometry, 2005. Meboo, Palo Alto, 2003. Deborah J Donnell, Andreas Buja, and Werner Stuetzle. Analysis of additive dependencies and concurvities using smallest additive principal components. The Annals of Statistics, pages 1635 1668, 1994. David L Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000. Jianqing Fan, Jianhua Guo, and Shurong Zheng. Estimating number of factors by adjusted eigenvalues thresholding. ar Xiv preprint ar Xiv:1909.10710, 2019. Chao Gao, Zongming Ma, Zhao Ren, and Harrison H Zhou. Minimax estimation in sparse canonical correlation analysis. The Annals of Statistics, 43(5):2168 2197, 2015a. Chao Gao, Zongming Ma, Zhao Ren, and Harrison H Zhou. Supplement to minimax estimation in sparse canonical correlation analysis . The Annals of Statistics, 2015b. Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse cca: Adaptive estimation and computational barriers. The Annals of Statistics, 45(5):2074 2101, 2017. Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1233 1242. JMLR. org, 2017. Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2012. Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015. Harold Hotelling. Relations between two sets of variates. In Breakthroughs in statistics, pages 162 190. Springer, 1992. Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295 327, 2001. I.M. Johnstone and A.Y. Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682 693, 2009. Jon R Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433 451, 1971. Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772 801, 2013. Michael L. Overton and Robert S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix. SIAM J. Matrix Anal. Appl., 1991. Cencheng Shen, Ming Sun, Minh Tang, and Carey E Priebe. Generalized canonical correlation analysis for classification. Journal of Multivariate Analysis, 130:310 322, 2014. Ji-guang Sun. The perturbation bounds for eigenspaces of a definite matrix-pair. Numerische Mathematik, 41(3):321 343, 1983. Kean Ming Tan, Zhaoran Wang, Han Liu, and Tong Zhang. Sparse generalized eigenvalue problem: Optimal statistical rates via truncated rayleigh flow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):1057 1086, 2018. Jos MF Ten Berge. Orthogonal procrustes rotation for two or more matrices. Psychometrika, 42(2):267 276, 1977. Arthur Tenenhaus and Michel Tenenhaus. Regularized generalized canonical correlation analysis. Psychometrika, 76(2):257, 2011. Yongge Tian. Rank equalities for block matrices and their moore-penrose inverses. Houston J. Math, 30(4):483 510, 2004. Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes flow. ar Xiv preprint ar Xiv:1507.03566, 2015. Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. In Advances in neural information processing systems, pages 2670 2678, 2013. Sparse GCA and Thresholded Gradient Descent V.Q. Vu and J. Lei. Minimax sparse principal subspace estimation in high dimensions. ar Xiv preprint ar Xiv:1211.0373, 2012. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. Zhaoran Wang, Huanran Lu, and Han Liu. Tighten after relax: Minimax-optimal sparse pca in polynomial time. In Advances in neural information processing systems, pages 3383 3391, 2014. Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441 479, 1912. D.M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10:515 534, 2009. Bin Yu, Fano Assouad, and Lucien Le Cam. Festschrift for lucien le cam, 1997. Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899 925, 2013. H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15:265 286, 2006.