# sparse_gca_and_thresholded_gradient_descent__be65a85e.pdf

Journal of Machine Learning Research 24 (2023) 1-61 Submitted 7/21; Revised 12/22; Published 5/23

Sparse GCA and Thresholded Gradient Descent

Sheng Gao shenggao@wharton.upenn.edu Zongming Ma zongming@wharton.upenn.edu Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA

Editor: Ji Zhu

Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple data sets. It generalizes canonical correlation analysis that is designed for two data sets. We study sparse GCA when there are potentially multiple leading generalized correlation tuples in data that are of interest and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We ﬁrst formulate sparse GCA as a generalized eigenvalue problem at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic data sets.

Keywords: sparse GCA, sparse CCA, thresholded gradient descent, non-convex optimization, generalized eigenvalue problem

1. Introduction

With the advent of big data acquisition technology, it has become increasingly important to integrate information across multiple data sets collected on a common set of subjects. Canonical correlation analysis (CCA), ﬁrst proposed by Hotelling (1992), is a widely used statistical tool to integrate information from two data sets: It seeks linear combinations of variables within each data set such that their correlation is maximized.

However, recent advances in ﬁelds such as multi-omics and multimodal brain imaging have presented us with new challenges, since scientists are often able to collect more than two data sets on the same set of subjects nowadays. To tackle these challenges, we turn to a useful generalization of CCA called generalized correlation analysis (GCA) (Kettenring, 1971) which aims to explore linear relationships across multiple data sources. Kettenring (1971) proposed ﬁve diﬀerent variants for generalized correlation analysis of multiple data sets, where diﬀerent methods correspond to maximization of diﬀerent objective functions of covariances and correlations, subject to certain normalization constraints. Tenenhaus and Tenenhaus (2011) extended these approaches to regularized versions by adding l2 penalties to the objective functions and proposed partial least squares algorithms for solving them.

c 2023 Sheng Gao and Zongming Ma.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0745.html.

In addition to unsupervised settings, it has been shown that GCA can be incorporated into supervised learning settings to improve eﬃciency and accuracy (Shen et al., 2014). Suppose that the k data sets are i.i.d. realizations of k random vectors X{1} Rp1, X{2} Rp2, ..., X{k} Rpk. Here and after, we use subscript {j} to denote the jth set of features and/or the jth data set. At the population level, we propose to seek vectors a{1}, a{2}, ..., a{k} (called the ﬁrst k-tuple of generalized loading vectors) that solve

maximize l{1},...,l{k}

j=1 cov(l {i}X{i}, l {j}X{j})

i=1 var(l {i}X{i}) = 1.

An important diﬀerence between the foregoing formulation and those in Kettenring (1971) is the normalization constraint. We propose to normalize the sum of the variances of the linear combinations while Kettenring (1971) requires the individual variance of each linear combination to be one. When k = 2, the two normalizations essentially lead to the same solution. One can show that the optimal solution to (1) necessarily has var(a {i}X{i}) = 1

2 for i = 1 and 2, and hence the optimal a{i} s are proportional to those from normalizing individual variances. However, when k 3, the solutions are usually diﬀerent and we argue that formulation (1) is more appropriate for two reasons. First, the normalization in (1) conforms with the null principle (Donnell et al., 1994) in that if all linear combinations of the Xi s are uncorrelated then the objective function reduces to the lefthand side of the normalization constraint. In addition, it is more resilient to spurious solutions as illustrated by the following example.

Example 1 Suppose k = 3 and p1 = p2 = p3 = m. Let X{1} = Y v{1} + Z{1}, X{2} = Y v{2} + Z{2} and X{3} = Z{3} where v{1} and v{2} are two deterministic unit vectors in

Rm, Y N(0, 1), Z{i} iid Nm(0, Im), and they are mutually independent. Simple linear algebra shows that the optimal solution to (1) is a{1} = 1

2v{1}, a{2} = 1

2v{2} and a{3} = 0. In contrast, with the individual normalization constraint var(l {i}X{i}) = 1, the optimal solution would change to a{1} = 1

2v{1}, a{2} = 1

2v{2} and a{3} is any vector in Bm 1 where Bm 1

is the unit ball in Rm. Comparing the solutions to two diﬀerent normalizations, formulation (1) is advantageous in that it provides a meaningful and unique optimal a{3}.

The solution to (1) gives the ﬁrst k-tuple of leading GCA loading vectors. More generally, we want to extract successive k-tuples of leading loading vectors subject to certain additional constraints. In view of (1), suppose we have found (a(m) {1} , . . . , a(m) {k}), for m = 1, . . . , r 1, as

the ﬁrst (r 1) k-tuples of leading loading vectors. We deﬁne the rth k-tuple (a(r) {1}, . . . , a(r) {k}) as the solution to (1) with the following additional constraints:

i=1 cov((l(m) {i} ) X{i}, l {i}X{i}) = 0, m = 1, . . . , r 1.

Sparse GCA and Thresholded Gradient Descent

To be more concise, let the columns of L{i} Rpi r be the successive leading loading vectors for X{i} for i = 1, . . . , k. The foregoing proposal for ﬁnding the ﬁrst r leading k-tuples then reduces to the following optimization problem

maximize L{1},...,L{k}

i,j Tr(L {i}Σ{ij}L{j})

i=1 L {i}Σ{ii}L{i} = Ir. (2)

Here Σ{ii} denotes the covariance matrix of X{i} and Σ{ij} = cov(X{i}, X{j}) for i = j. To write (2) more concisely, we deﬁne L = [L {1}, . . . , L {k}] Rp r where p = Pk i=1 pi, Σ0 = diag(Σ{11}, . . . , Σ{kk}) Rp p, and Σ Rp p where its (i, j)th block is Σ{ij} and ith diagonal block is Σ{ii}. Then we can rewrite (2) as

maximize L Tr(L ΣL)

subject to L Σ0L = Ir. (3)

In what follows, A = [A {1}, . . . , A {k}] denotes the solution to (3) where each A{i} Rpi r

and the mth column of A{i} is a(m) {i} for i = 1, . . . , k and m = 1, . . . , r. Informed readers might have already realized that (3) is closely connected to a generalized eigenvalue problem |Σ λΣ0| = 0, where |M| stands for the determinant of a square matrix M. In addition, when Σ0 is diagonal (i.e., when k = p), the problem is equivalent to principal component analysis (PCA) of the correlation matrix: For any non-singular Σ0, the columns of Σ1/2 0 A are the leading eigenvectors of the correlation matrix Σ 1/2 0 ΣΣ 1/2 0 . When k = 2, the solution to (3) is equivalent to CCA up to scaling. Thus, program (3) provides a uniﬁed formulation for extracting leading linear covariation within one or across multiple data sets. In practice, one does not have direct knowledge of the joint sample covariance matrix Σ or even its block diagonal part Σ0. The natural choice is to replace them with their sample counterparts. In potential modern applications of GCA, the data set dimensions (i.e., the pi s) can be much larger than the sample size n. Hence, one suﬀers from the curse of dimensionality if no further structural assumption is made (Donoho, 2000; Johnstone, 2001; Bickel and Levina, 2008a,b). Due to its interpretability and practicality, a structural assumption that has been widely adopted in both theory and practice is sparsity: Most energy of the solution to (1) (3) concentrates on a small number of entries (Hastie et al., 2015; Wainwright, 2019). Let A denote the solution to (3). In this manuscript, we adopt the assumption that at most s rows of A contain nonzero entries. In other words, the target of estimation is also the solution to the following sparse generalized correlation analysis (Sparse GCA, or SGCA) problem:

maximize L Tr(L ΣL)

subject to L Σ0L = Ir L 2,0 s. (4)

Here and after, for any matrix L, L 2,0 counts the number of nonzero rows in L. In view of the discussion following (3), when k = p and k = 2, (4) reduces to sparse PCA of correlation matrix and sparse CCA, respectively.

1.1 Main contributions

The main contributions of the present manuscript are the following. First, we clarify the target of estimation in generalized correlation analysis (i.e., the solution A to (3)) by considering a natural latent variable model in which an r-dimensional latent variable drives the covariation of k random vectors. Under mild conditions, we show that the linear subspace spanned by the leading r generalized eigenvectors in our GCA formulation (i.e., the columns of A) coincides with the subspace spanned by a concrete functional of parameters in the latent variable model. In addition, we characterize the behavior of generalized eigenvalues under the latent variable model. Next, for sample sparse GCA, we propose a thresholded gradient descent algorithm for solving a Lagrangian version of the sample counterpart of (4). The algorithm is intuitive and easy to implement. In view of the discussion following (3), such an algorithm provides a uniﬁed approach to a number of diﬀerent sparse unsupervised learning problems, including sparse PCA of correlation matrices and sparse CCA. Furthermore, we provide a theoretical analysis of the thresholded gradient descent algorithm. We show that with high probability, when initialized properly, estimation errors of the intermediate results after each iteration converge at a geometric rate until they arrive in a tight neighborhood of the population solution to (4). Statistically, we establish tight estimation error bounds for the output of our algorithm as an estimator. Numerically, this implies geometric convergence of our algorithm to such an estimator. Finally, these theoretical ﬁndings are corroborated by numerical studies on simulated data sets.

1.2 Related works

In view of the discussion following (3), the present paper is closely connected to the sparse PCA and sparse CCA literature. To date, there is a large literature devoted to various aspects of the sparse PCA problem, including algorithms (Zou et al., 2006; Johnstone and Lu, 2009; Amini and Wainwright, 2009; Witten et al., 2009; Ma, 2013; Yuan and Zhang, 2013; Vu et al., 2013; Wang et al., 2014), information-theoretic limits (Birnbaum et al., 2013; Cai et al., 2013; Vu and Lei, 2012) and computational theoretic limits (Berthet and Rigollet, 2013; Gao et al., 2017). However, this literature has mostly focused on sparse PCA of covariance matrices. Thus, the special case of k = p in our setting complements the existing literature by providing both theory and method for sparse PCA of correlation matrices. In the case of k = 2, the theory and method in this paper specialize to the sparse CCA setting (Chen et al., 2013; Gao et al., 2015a, 2017). In this case, we provide a new iterative algorithm for sparse CCA that achieves optimal estimation rates derived in Gao et al. (2015a). Therefore, for this special case, the present manuscript provides a competitive alternative to existing sparse CCA methods. When r = 1, (4) reduces to the population version of the sparse generalized eigenvalue problem considered in Tan et al. (2018). Tan et al. (2018) proposed a truncated Rayleigh

Sparse GCA and Thresholded Gradient Descent

ﬂow method for estimating the population solution in this special case, and established its rates of convergence. It is not clear how their method can be generalized to estimate successive generalized eigenvectors or eigenspaces. In contrast, our estimator is based on a diﬀerent algorithm (thresholded gradient descent of a Lagrangian objective function). It not only achieves fast converging estimation rates for estimating the ﬁrst generalized eigenvector but also works for estimating leading generalized eigenspaces of ﬁxed dimensions that are greater than one.

1.3 Organization of the paper

The rest of this paper is organized as follows. In section 2 we examine the generalized eigenvalue problem underpinning sparse GCA under a latent variable model. Section 3 proposes a thresholded gradient descent algorithm and its initialization via generalized Fantope projection. Section 4 establishes the convergence rate of our algorithm under reasonable initialization. Numerical results are presented in Section 5. Technical proofs are deferred to appendices.

1.4 Notation

For any set S, let |S| denote its size and Sc denote its complement. For any event E, 1E is its indicator function. For a vector u, u = (P i u2 i )1/2, u 0 = P i 1ui =0, u 1 = P i |ui|, u = maxi |ui|. For any matrix A = (aij), the ith row of A is denoted by Ai and the jth column by A j. We let Col(A) denote the span of columns of A. For a positive integer m, [m] denotes the index set 1, 2, . . . , m. For two subsets I and J of indices, we write AIJ for the |I| |J| submatrices formed by aij with (i, j) I J. When I or J is the whole set, we abbreviate it with an , and so if A Rp k, then AI = AI[k] and A J = A[p]J. For any square matrix A = (aij), denote its trace by Tr(A) = P i aii. Moreover, we denote the collection of all p r matrices with orthonormal columns by O(p, r), and abbreviate O(r, r) by O(r). The set of p p symmetric matrices is denoted by Sp. Furthermore, σi(A) stands for the ith largest singular value of A and σmax(A) = σ1(A), σmin(A) = σmin{p,k}(A). The Frobenius norm and the operator norm of A are A F = (P i,j a2 ij)1/2 and A op = σ1(A), respectively. The inﬁnity norm of A is deﬁned as A = maxi,j |aij|. The l1 norm and the nuclear norm of a matrix A are A 1 = P i,j |aij| and A = P i σi(A), respectively. Similarly, si(A) stands for the ith largest eigenvalue of A Rp p, while smax(A) = s1(A) and smin(A) = sp(A). The support of A is deﬁned as supp(A) = {i [n] : Ai > 0}. For two symmetric matrices A and B, we write A B if B A is positive semideﬁnite. For any positive semi-deﬁnite matrix A, A1/2 denotes its principal square root that is positive semi-deﬁnite and satisﬁes A1/2A1/2 = A. The trace inner product of two matrices A, B Rp k is A, B = Tr(A B). For any real numbers a and b, let a b = min(a, b) and a b = max(a, b). For any two sequences of positive numbers {an} and {bn}, we write an = O(bn) if lim supn an/bn is ﬁnite. Given a random element X, L(X) denotes its probability distribution. The symbol C and its variants C1, C , etc. are generic positive constants and may vary from occurrence to occurrence, unless otherwise speciﬁed. The symbols P and E stand for generic probability and expectation when the distribution is clear from the context.

2. A Latent Variable Model

In this part, we aim to identify the solutions to (3) as functionals of parameters in the joint distribution of (X {1}, . . . , X {k}) . To this end, we introduce an intuitive latent variable model. The fundamental assumption underlying generalized correlation analysis is that there exists a shared low-dimensional latent variable which orchestrates the (linear) covariation of observed features across all data sets. Let z be an r-dimensional latent variable. The following latent variable model describes an idealized data generating process:

X{i} = U{i}z + e{i}, i = 1, . . . , k,

z Nr(0, Ir), e{i} ind Npi(0, Ψ{ii}). (5)

Here the deterministic matrix U{i} Rpi r for i = 1, . . . , k, Ψ{ii} is positive deﬁnite, and the latent variable z and the idiosyncratic noises {e{i}}k i=1 are mutually independent. Under the foregoing latent variable model, the joint covariance matrix of (X {1}, . . . , X {k}) Rp

is given by Σ with the ith diagonal block

Σ{ii} = U{i}U {i} + Ψ{ii}, for i = 1, . . . , k, (6)

and the (i, j)th block Σ{ij} = U{i}U {j}, for 1 i = j k. (7)

We denote by Ψ the block diagonal matrix with blocks Ψ{ii}, i = 1, . . . , k, on the diagonal. We also let U = [U {1}, U {2}, ..., U {k}] Rp r. In the rest of this section, we assume that the observed data sets are generated by model (5). In addition, we let

λ1 λ2 λp 0 (8)

denote the population generalized eigenvalues of Σ with respect to Σ0 = diag(Σ{11}, . . . , Σ{kk}). The following key lemma identiﬁes the connection between parameters in model (5) and the solution to (3).

Lemma 1 Suppose that Σ and Σ0 are speciﬁed by model (5) (7). Let A = [A {1}, . . . , A {k}]

be the solution to (3). If the rth generalized eigenvalue of Σ w.r.t. Σ0 is larger than 1, i.e., λr > 1, then Col(A{i}) Col(Σ 1 {ii}U{i}) for all i [k]. For any i, if further rank(A{i}) = r

, then Col(A{i}) = Col(Σ 1 {ii}U{i}).

The next lemma describes the behavior of population generalized eigenvalues under latent variable model (5). It also identiﬁes a suﬃcient condition for λr > 1. To this end, we start with an assumption motivated by Fan et al. (2019).

Assumption 2 In latent variable model (5) (7), we assume the following:

The matrix U has full column rank, that is, rank(U) = r;

Σ0 and U satisfy σr(Σ 1/2 0 U) 1.

Sparse GCA and Thresholded Gradient Descent

Lemma 3 In model (5) (7), under Assumption 2, we have

r = max{j : λj > 1, j [p]}. (9)

In other words, there are exactly r generalized eigenvalues greater than 1. Moreover, deﬁne Y = [U {1}, U {2}, . . . , U {k}] Rp r where U {i} is the Moore-Penrose inverse of U{i}. Then the multiplicity of 1 as a generalized eigenvalue is

#{j : λj = 1} = p

i=1 rank(U{i}) + r rank(U UY U). (10)

The foregoing lemma guarantees an eigengap between the rth and the (r + 1)th generalized eigenvalues, and so the leading rank r generalized eigenspace is well-deﬁned at the population level. By the foregoing lemma, under model (5) we can decompose Σ as

Σ = Σ0KΛK Σ0 = Σ0AΛr A Σ0 + Σ0BΛB Σ0. (11)

Here Λ, Λr, and Λ are all diagonal where Λr = diag(λ1, λ2, ..., λr) collects the ﬁrst r generalized eigenvalues, Λ collects the remaining p r generalized eigenvalues, and Λ = diag(Λr, Λ). In addition, B collects the eigenvectors associated with the bottom p r generalized eigenvalues. On the other hand, the decomposition (11) hold as long as the rth and (r + 1)th generalized eigenvalues are distinct and hence is more general than model (5).

Remark 4 In the case when k = 2, The covariance matrices between X and Y can be reparameterized as Σxy = Σx V Θr W Σy with V Σx V = W Σy W = Ir Gao et al. (2015a, 2017). Here Θr = diag(θ1, θ2, ..., θr) collects the leading r canonical coeﬃcients for CCA and Θr + Ir = Λr. Hence we have λr = θr + 1. The solution A = [A {1}, A {2}] to (3) satisﬁes that A{1} = 1

2V, A{2} = 1

2W. Such a relationship does not hold in general for k 3.

Remark 5 When k = p, Σ0 becomes a diagonal matrix where the diagonal entries are variances of the variables. The problem (3) is then equivalent to ﬁnding the leading eigenspace of correlation matrix of the data, deﬁned as R = Σ 1/2 0 ΣΣ 1/2 0 . Let ER Rp r be the matrix containing the eigenvectors that span the leading r dimensional eigenspace. If A is the solution to (3), then Σ1/2 0 A coincides with ER (up to an r r rotation matrix when there is any generalized eigenvalue with multiplicity larger than one). Thus, we essentially estimate the leading eigenspace of the correlation matrix.

3. Gradient Descent with Hard Thresholding

In this section, we present a thresholded gradient descent algorithm for simultaneously ﬁnding multiple leading sparse generalized eigenvectors.

3.1 Motivation

The sample counterpart of (4) can be recast as the following minimization problem:

minimize L bΣ, LL

subject to L bΣ0L = Ir, L 2,0 s.

Here bΣ is the joint sample covariance matrix of all k sets of features. We propose to solve its Lagrangian version:

minimize L Rp r f(L) subject to L 2,0 s, (12)

where the objective function f(L) is

f(L) = bΣ, LL + λ

2 L bΣ0L Ir 2 F. (13)

Eﬀectively, minimizing the ﬁrst term in (13) maximizes the original objective function whereas minimizing the second term controls the deviation from the normalization constraint in (3). Here λ is a tuning parameter. Intuitively, the larger it is, the more penalty we put on deviating from the normalization constraint and the closer L bΣ0L is to Ir. To deal with the constraint in (12), we shall perform the following hard thresholding.

Deﬁnition 6 Given a matrix U and a natural number k, we deﬁne the output of hard thresholding function HT(U, k) to be the matrix obtained by keeping the k rows with the largest l2 norms and replacing all other rows with zeros, that is

HT(U, k)i =

( Ui , if i Ck 0, otherwise

where Ck is the index set of k rows of U with largest l2 norms. When there is a tie, we always pick the smaller/smallest index.

3.2 Algorithm

Let X(1), X(2)..., X(n) be i.i.d. observations generated from the latent variable model with covariance matrix Σ and its block diagonal part Σ0. Let bΣ and bΣ0 be the sample covariance matrix and its block diagonal part computed on {X(1), X(2)..., X(n)}. Algorithm 1 describes the proposed thresholded gradient descent procedure. Given a proper initial estimator which we shall specify later, each iteration ﬁrst performs a step of gradient descent on f deﬁned in (12), then it keeps the s rows with the largest l2 norms and thresholds the remaining rows to zero. By iteratively performing these two steps, the algorithm can be viewed as heuristics for solving the non-convex optimization problem (4). Here s is a user-speciﬁed tuning parameter and is not necessarily equal to the true sparsity level s. Step 2 in Algorithm 1 is a re-normalization step. This is due to the fact that the stationary point of the objective function f(L) is not the same as the solution to the original optimization problem (4). Thus, we ﬁrst transform it for iterations and eventually transform back in Step 7 to deﬁne the ﬁnal estimator.

Sparse GCA and Thresholded Gradient Descent

Algorithm 1: Thresholded gradient descent for sparse GCA

Input: Covariance matrix estimator bΣ and its block diagonal part bΣ0; Initialization b A0. Tuning Parameters: Step size η; Penalty λ; Sparsity level s ; Number of iterations T;

1: e A0 b A0( b A 0 bΣ0 b A0) 1/2

2: V 1 e A0(I + 1

λ e A 0 bΣ e A0)1/2

3: for t = 1, 2, 3, ..., T do

4: Vt+1 V t η f(V t) = V t 2η( bΣV t + λbΣ0V t(V t bΣ0V t Ir))

5: V t+1 HT(Vt+1, s )

7: Output: b AT = V T (V T bΣ0V T ) 1/2

In Algorithm 1, each iteration is computationally eﬃcient: line 4 is essentially matrix multiplication and addition, and line 5 requires calculating and sorting row l2 norms. Since bΣ = 1

n Pn i=1(X(i) X)(X(i) X) where X = 1

n Pn i=1 X(i), an eﬃcient way to calculate bΣV t would be to calculate Y (i) t = (X(i) X) V t ﬁrst, followed by multiplying (X(i) X)Y (i) t , resulting in O(npr) ﬂops. Line 5 requires O(rp + p log p) ﬂops for selecting top s rows. Hence a single iteration of thresholded gradient descent will require O(npr + p log p) ﬂops in total.

3.3 Initialization via generalized Fantope projection

Success of Algorithm 1 depends crucially on the quality of the initial estimator b A0. Thus, we need to ﬁnd an initial estimator that is relatively close to the true generalized eigenspace in some distance so that later iterations could further improve on estimation accuracy. We now introduce such an initial estimator based on Fantope projection (Vu et al., 2013). Note that k X

i,j=1 Tr(A {i}Σ{ij}A{j}) = Σ, AA .

The idea is to lift AA into Sp, the space of p p symmetric matrices, and hence to treat it as a single quantity. Since we assume that A is row sparse, F = AA have at most s2

nonzero entries which is much smaller than its number of elements. To ensure sparsity of its solution, we impose an entrywise l1 penalty to deﬁne the following objective function for initialization: minimize F Sp bΣ, F + ρ F 1.

On the other hand, the normalization constraint on A implies that

(bΣ0)1/2F(bΣ0)1/2 P(p, r),

where P(p, r) is deﬁned to be the set of rank r projection matrices

P(p, r) = {PP , P O(p, r)}.

However, this direct generalization leads to a nonconvex feasible set since P(p, r) is nonconvex. To obtain a bona ﬁde convex program, in the light of Vu et al. (2013), we use the Fantope set introduced by Dattorro (2003):

Fr = {X : 0 X I and Tr (X) = r}.

The motivation is the observation from Overton and Womersley (1991) that Fr = conv(P(p, r)), where conv(A) denotes the convex hull of A. To summarize, our initial estimator is the solution to the following program:

minimize F Sp bΣ, F + ρ F 1,

subject to (bΣ0)1/2F(bΣ0)1/2 Fr. (14)

Upon obtaining b F as the solution to (14), we collect the leading r eigenvectors of b F as e Ur Rp r and the corresponding leading r eigenvalues as entries of the diagonal matrix e Dr Rr r. Then let A0 = e Ur e D1/2 r and

b A0 = HT(A0, s ). (15)

This ﬁnishes initialization for Algorithm 1. The initialization procedure has O(p3) computation complexity due to the ADMM step involved in solving the generalized Fantope projection. See Gao et al. (2017) for details of the ADMM algorithm. As we shall show in next section, b F suﬀers a relatively large estimation error rate for estimating AA and hence b A0 for A. However, for Algorithm 1 to work, such an estimator serves well as an initial estimator under mild conditions. In addition to PCA and the current setting, an analogous initialization via convex relaxation idea has appeared in Gao et al. (2017) for performing sparse CCA which is asymmetric.

4. Theoretical Results

We provide theoretical justiﬁcations for our algorithms in this section. We ﬁrst state our main result on how each iteration of Algorithm 1 improves estimation accuracy, followed by a corollary on error bounds achieved by the ﬁnal estimator. Analysis of generalized Fantope initialization follows our investigation of the main algorithm. In addition, we include a lower bound for the ﬁnite k setting at the end of this section.

Parameter space Under covariance structure (11), we deﬁne F({si}k 1, {pi}k 1, r, {λj}p 1; ν) as the collection of all covariance matrices Σ satisfying (11) and the following conditions:

(i) Sparsity: A{i} Rpi r with A{i} 2,0 si;

(ii) Bounded spectrum : 1

ν smin(Σ) < smax(Σ) ν, 1

ν smin(Σ0) < smax(Σ0) ν;

(iii) Eigengap : λr λr+1 > 0.

Here λi s are generalized eigenvalues as deﬁned in (8). By Section 2, under model (5) the eigengap condition is satisﬁed if Assumption 2 holds. The parameter space is then deﬁned

Sparse GCA and Thresholded Gradient Descent

as Pn({si}k 1, {pi}k 1, r, {λj}p 1; ν) = L(X(1), X(2)..., X(n)) : X(i) iid N(0, Σ),

Σ F({si}k 1, {pi}k 1, r, {λj}p 1; ν) . (17)

In what follows, S denotes the true row support of A. While the parameter space requires normality, our theoretical analysis generalizes directly to sub-Gaussian distributions. We omit the generalization in this work as it is mostly formality.

Matrix distance To measure estimation accuracy, we deﬁne the distance between two matrices U and V Rp r as

dist(U, V ) = min P O(r) UP V F. (18)

Here O(r) is the collection of r r orthogonal matrices. The matrix distance has been used previously in Ge et al. (2017), Tu et al. (2015) and Golub and Van Loan (2012), among others.

4.1 Main results

The following theorem characterizes numerical convergence of Algorithm 1 when starting at a reasonable initializer.

Theorem 7 In Algorithm 1, set

η c 12λ1ν(1 + c)2 , λ = λ1

for some constant c, and

(λr λr+1)2η2 s. (20)

For initial estimator b A0, suppose that it has row sparsity s , and that for

V = A I + 1

b A0 is so constructed that after the ﬁrst two lines of Algorithm 1, we have

dist(V, V 1) 1 8 ν min c(λr λr+1)

2λ1ν2(42 + 25c), 1 + c

If (16) holds and p

1 + λ2 r+1 (λr λr+1)2

n < c0 (23)

for some suﬃciently small constant c0 > 0, then for some constants C, C > 0, uniformly over Pn = Pn({si}k 1, {pi}k 1, r, {λj}p 1; ν), with probability at least 1 exp( C (s log(ep/s ))), for all t 1,

dist(V, V t+1) C s

1 + λ2 r+1 λr λr+1

n | {z } Statistical Error

+ ξtdist(V, V 1) | {z } Optimization Error

where ξ = 1 η2(λr λr+1)2

The foregoing theorem leads to the following corollary on high probability error bounds for estimating A.

Corollary 8 Suppose the conditions of Theorem 7 hold. For each t 1, let b At = V t(V t bΣ0V t) 1/2. Then for some constants C, C1, C > 0, uniformly over Pn, with probability at least 1 exp( C (s log(ep/s ))), for all t T0, where

T0 = log C s

1+λ2 r+1 λr λr+1

log(1 η2(λr λr+1)2

64ν2 ) , (25)

dist(A, b At) C1

1 + λ2 r+1 λr λr+1

In particular, when T T0, the last display holds for the output b AT of Algorithm 1.

We give some brief remarks on Theorem 7 and Corollary 8. The error bound for each step, given by the right side of (24), is composed of two parts: a statistical error term independent of the iteration counter t, and an optimization error that decreases geometrically as t increases. When t T0, the optimization error is dominated by the statistical error and we can achieve the estimation rate in Corollary 8. Thus, T0 can be interpreted as the minimum number of iterations needed for achieving the estimation error rate in Corollary 8. When the eigengap λr λr+1 is lower bounded by a positive constant, it is straightforward to verify that T0 = O(log(p+n)). In other words, thresholded gradient descent drives down the statistical error of the estimator to the desired rate after O(log(p + n)) iterates. It is worth noting that the contraction rate ξ does not depend on ambient dimension p, which can be much larger than r and s. Condition (22) on initial estimator is related to the notion of basin of attraction . It has previously appeared in the literature of machining learning with nonconvex optimization in other contexts, e.g., Chen and Candes (2015); Chi et al. (2019); Wang et al. (2014). The global landscape of (12) could be hard to handle due to non-convexity and sparsity constraints. However, within the basin of attraction, the objective function f( ) is locally smooth and strongly convex, which allows projected gradient descent to ﬁnd a statistically sound solution at geometric convergence rate.

Outline of proof To prove Theorem 7, we track the estimator trajectory {V t : t 0} over iteration. To this end, we ﬁrst characterize a global high probability event under which the entire trajectory would lie inside the basin of attraction within which the objective function is smooth and strongly convex. On the event, we analyze in sequel the eﬀects of the gradient step and the hard thresholding step in each iteration. In particular, we show that each gradient step drives down the distance between the target of estimation and the current estimator while the hard thresholding step projects the current estimator onto the feasible set. For details, see Propositions 18 and 19, respectively. Combining the two propositions, we obtain a recursive inequality that characterizes the estimator trajectory

Sparse GCA and Thresholded Gradient Descent

as in (24). This completes the major steps in the proof. Finally, Corollary 8 is a direct consequence of normalization on the same event that we performed the foregoing analysis of iteration. For proof details, see Appendix B.

4.2 Analysis of initialization

The following theorem characterizes the estimation accuracy of the proposed initial estimator via generalized Fantope projection.

Theorem 9 Suppose s2 log p n(λr λr+1)2 ϵ (26)

for some suﬃciently small ϵ > 0. Let b F be the solution to (14) where ρ = γ q

n for γ [γ1, γ2] for some positive constants γ1 < γ2. There exist constants C, C > 0 such that uniformly over Pn, with probability at least 1 exp( C (s + log(ep/s))),

b F AA 2 F C (Pk i=1 si)2 log(Pk i=1 pi) n(λr λr+1)2 = O s2 log p n(λr λr+1)2

The following corollary further bounds the estimation accuracy of b A0 through the lens of V 1 deﬁned in line 2 of Algorithm 1.

Corollary 10 Suppose that the conditions of Theorem 9 hold. Under the choice of s and condition (23) in Theorem 7, there exist constants C, C > 0 such that uniformly over Pn, with probability at least 1 exp( C (s + log(ep/s))),

dist(V, V 1) Cs λr λr+1

where V is deﬁned in (21).

Combining Theorems 7 and 9 and Corollaries 8 and 10, we obtain the following corollary on the whole procedure: Algorithm 1 with initialization via generalized Fantope projection.

Corollary 11 Suppose that

n C0 max (1 + λ2 1)(1 + λ2 r+1) (λr λr+1)4 rs log p, s2 log p (λr λr+1)2

for some suﬃciently large positive constant C0, that T T0 with T0 in (25), and that all the other conditions in Theorems 7 and 9 are satisﬁed. There exist constants C, C > 0 such that uniformly over Pn, with probability at least 1 exp( C (s + log(ep/s))), the ﬁnal output satisﬁes

dist(A, b AT ) C1

1 + λ2 r+1 λr λr+1

4.3 A lower bound for ﬁnite k

Corollary 8 provides upper bounds for the proposed procedure in Algorithm 1. For ﬁnite k, we have the following information-theoretic lower bound result.

Theorem 12 Assume that 1 r mini si

2 , and that

n(λr λr+1)2 C0

r + max i log epi

for some suﬃciently large positive constant C0. Then there exist positive constants c and c0 such that the minimax risk for estimating A satisﬁes

inf b A sup Σ F EΣdist2(A, b A) c0 c n(λr λr+1)2

i=1 si log epi

The proof of Theorem 12 is given in Appendix C. Comparing Theorem 12 and Corollary 8, we see that when r is ﬁnite and λ1 C for some large positive constant C, as long as there exists i {1, . . . , k} such that si and pi are of the same order as s and p simultaneously, the lower and upper bounds match. Otherwise, they diﬀer by at most a multiplicative factor of log p.

5. Numerical Results

This section reports numerical results on synthetic data sets. Except for the settings in Section 5.4 and Section 5.6, r, the latent dimension in model (5) is assumed to be known. Choices of tuning parameters are speciﬁed in each setting. In practice, they can also be selected using cross-validation on grids. The rest of this section is organized as follows. We ﬁrst study numerical errors for estimating generalized eigenspaces of diﬀerent dimensions. For the special case of r = 1, we also compare our method with the Riﬂe method in Tan et al. (2018). Next, we assess the performance of Algorithm 1 in the context of sparse CCA by comparing it to the Co La R method in Gao et al. (2017). Furthermore, we consider the potential model mis-speciﬁcation scenario where the input latent dimension of the algorithm is diﬀerent from the true value. We then apply Algorithm 1 to perform sparse PCA of correlation matrices. Finally, we investigate the performance of Algorithm 1 under a general covariance structure.

5.1 Sparse GCA with diﬀerent latent dimensions

We ﬁrst consider sparse GCA of three high dimensional data sets. In particular, we set n = 500, k = 3, p2 = p3 = 200, p1 = 500, and s1 = s2 = s3 = 5. To generate covariance matrices Σ and Σ0, we use the latent variable model speciﬁed in Section 2. Speciﬁcally, Σ is a block matrix with

Σ{ij} = T{i}U{i}U {j}T{j}, Σ{ii} = T{i}, for i = j {1, 2, 3},

U {1}T{1}U{1} = U {2}T{2}U{2} = U {3}T{3}U{3} = I.

Sparse GCA and Thresholded Gradient Descent

Here each Σ{ii} = T{i} is a Toeplitz matrix, deﬁned by setting (T{k})ij = σkij where σkij =

a|i j| k for all i, j [pk] with a1 = 0.5, a2 = 0.7, a3 = 0.9. To generate U{i} Rpi r, we ﬁrst randomly select a support of size 5. For each row in the support, we generate its entries as i.i.d. standard normal random variables. Then all U{i} s are normalized with respect to T{i}. With the foregoing construction, it is straightforward to verify that λr λr+1 = 2 and that Col(A{i}) = Col(U{i}). Finally, Σ0 = diag(Σ{11}, Σ{22}, Σ{33}) contains the block diagonal elements of Σ.

We vary r in {1, 2, 3, 4, 5} and report squared matrix distances deﬁned as dist2(A, b A) = min O O(r) b AO A 2 F for both initial and ﬁnal estimators based on 50 repetitions in each setting. For tuning parameters in Algorithm 1, we set s = 20, η = 0.001, λ = 0.01, and T = 15000. The tuning parameter for generalized Fantope initialization is set to be

n . The truncation parameter for initialization is also set to be s = 20.

Table 1 reports the results of the aforementioned simulation study. For all latent dimensions, we observe a signiﬁcant decrease in estimation error after Algorithm 1 is applied. This corroborates the theory in Section 4. To better understand how each iteration of Algorithm 1 improves estimation, we plot dist(V, V t) and dist(A, b At) in logarithmic scale against the iteration counter t for r = 1, 2, 3, 4, 5 and n = 500. For any t, we set b At = V t(V t bΣ0V t) 1/2

which agrees with the deﬁnition of b AT in the last line of Algorithm 1. From Figure 1, we observe an approximate linear decay trend at the beginning in all cases, which corresponds to exponential decay in the original scale. Moreover, after suﬃciently many iterations, all error curves plateau, which suggests that the performance of the resulting estimators have stabilized. Both phenomena agree well with the theoretical ﬁndings in Theorem 7.

Error r = 1 r = 2 r = 3 r = 4 r = 5 Initial 0.1319(0.0737) 0.2308(0.0591) 0.2746(0.0652) 0.2354(0.0483) 0.1969(0.0237) Final 0.0015(0.0030) 0.0072(0.0209) 0.0098(0.0301) 0.0121(0.0071) 0.0171(0.0061)

Table 1: Median errors of initial (generalized Fantope) and ﬁnal (Algorithm 1) estimators in squared matrix distance out of 50 repetitions. Median absolute deviations of errors are reported in parentheses.

Tan et al. (2018) proposed a truncated Rayleigh ﬂow method, called Riﬂe, to solve the sparse generalized eigenvalue problem when r = 1. Here we compare our method and Riﬂe under the same experiment setting as above, with n = 500 and 1000. For TGD we set the tuning parameters to be s = 20, η = 0.01, λ = 0.01 and T = 15000. For Riﬂe we also use s = 20 for truncation, the step size η is set to be the default value 0.01 and we run it for suﬃciently many iterations until estimator errors no longer improve. We vary the sample size n in {500, 1000} and each experiment setting is repeated 50 times. We report median estimation errors of both initial and ﬁnal estimators of both methods measured in squared matrix distances in Table 2. Both Riﬂe and our method produce highly accurate ﬁnal estimators, which signiﬁcantly decrease errors of respective initializers. Final estimators by our algorithm are slightly more accurate in both settings.

Figure 1: Plots of dist(V, V t) (left) and dist(A, b At) (right) (both in logarithmic scale) vs. iteration counter t.

Riﬂe TGD Initial Error Final Error Initial Error Final Error n = 500 0.1225 (0.0562) 0.0006 (0.0005) 0.1458 (0.0713) 0.0003 (0.0011) n = 1000 0.0805 (0.0449) 0.0003 (0.0012) 0.0920 (0.0504) 0.0002 (0.0003)

Table 2: Median errors of initial and ﬁnal estimators of both Riﬂe and Algorithm 1 in squared matrix distance out of 50 repetitions. Median absolute deviations of errors are reported in parentheses.

5.2 Choices of tuning parameters

In this section, we study the selection of tuning parameters. For the sparsity level s

in the hard thresholding step of Algorithm 1, we show that a ﬁve-fold cross validation approach works reasonably well. For the Lagrangian multiplier parameter λ, we show that the estimation procedure is robust with respect to it and we recommend ﬁxing it at some small positive constant. For simulations below, we ﬁx n = 500 and p = 900. In addition, we ﬁx other tuning

parameters throughout the experiments as follows: ρ = 1

n for Fantope initialization and η = 0.001, T = 15000 for Algorithm 1.

Choice of sparsity level s . The procedure for the selection of s is as follows. We ﬁrst randomly split the data X into ﬁve folds of equal sizes. For l = 1, . . . , 5, we use one fold as the test set Xtest (l) and the other four folds combined as the training set Xtrain (l) . For each value of s in a pre-speciﬁed grid G , we apply Algorithm 1 on Xtrain (l) to obtain an estimator b Atrain (l) . Then we compute the test GCA score deﬁned as

CV(l)(s ) = Tr b Atrain (l) bΣtest (l) b Atrain (l) .

Sparse GCA and Thresholded Gradient Descent

Here bΣtest (l) is the sample covariance matrix of the test data Xtest (l) . Finally, we compute the

cross-validation score for s , deﬁned as CV(s ) = 1

5 P5 l=1 CV(l)(s ). Upon obtaining the cross-validation scores for all s G , we select the sparsity level that maximizes CV(s ). In the experiments here, we set the grid G as {5i : i = 1, . . . , 20}. We ﬁx r = 3, λ = 0.01 and consider three cases: s1 = s2 = s3 = 5 (Case I), s1 = s2 = s3 = 15 (Case II) and s1 = s2 = s3 = 20 (Case III). The true sparsity levels in three cases are then s = 15, 45, and 60, respectively. The results from the 5-fold cross validation procedure for the three cases are plotted in Figure 2. Here, we plot CV(s ) against s G and use error-bars to indicate the range of one standard deviation of cross-validation scores.

Figure 2: Plots of cross-validation score CV(s ) against s G for case I (s = 15, left), case II (s = 45, middle), and case III (s = 60, right). Curves indicate average over ﬁve folds and error bars indicate one standard deviations.

Figure 2 shows that in all three cases, the cross-validation scores increase with s until reaching a plateau level and ﬂuctuate around some constant when s is set to larger values. The sparsity levels selected by this procedure are slightly larger than the true sparsity levels in all three settings. Speciﬁcally, the selected sparsity levels for the three cases are 25, 60, and 75, respectively. This suggests that our previous choice of s = 20 for s = 15 in Section 5.1 is reasonable.

Choice of penalty λ. We now study the impact of diﬀerent choices of λ on our estimator. To this end, we vary λ on a log scale and consider all values in the set {0.0001, 0.001, 0.01, 0.1, 1} and compare the ﬁnal estimation errors across these choices. We ﬁx s = 20 and test using the same simulation setting as in Section 5.1 with r = 3 and 4. We set all other tuning parameters in the same way and only change λ. Table 3 reports squared distances from estimators to true parameter values based on 50 repetitions in each (r, λ) value combination. From Table 3, we notice that for both r = 3 and r = 4, the ﬁnal estimation error is robust with respect to the choice of λ.

5.3 Sparse CCA

To further compare the performance of our algorithm with a benchmark, we compare its performance in sparse CCA problems with the Co La R method proposed by Gao et al. (2017). Instead of using gradient descent and hard thresholding, Gao et al. (2017) reﬁned the initial estimate by a linear regression with group Lasso penalty. In addition, their

λ 0.0001 0.001 0.01 0.1 1 r = 3 0.0121(0.0364) 0.0136(0.0252) 0.0098(0.0301) 0.0093(0.0140) 0.0099(0.0108) r = 4 0.0145(0.0068) 0.0152(0.0102) 0.0121(0.0071) 0.0101(0.0055) 0.0119(0.0093)

Table 3: Median errors of the ﬁnal (Algorithm 1) estimators in squared matrix distance out of 50 repetitions for r = 3, 4 and diﬀerent choices of λ. Median absolute deviations of errors are reported in parentheses.

estimator was shown to achieve the optimal estimation rate for canonical loading matrix under a prediction loss that is diﬀerent from (18). For fair comparison, we use the same simulation settings as in Gao et al. (2017). Using the notation in Remark 4, we set

p1 = p2, Σx = Σy = M, r = 2, θ1 = 0.9, θ2 = 0.8.

The covariance matrix between X and Y is generated from the canonical pair model Σxy = Σx V Θ2W Σy Recall that from the relationship between Θ2 and Λ2 we obtain that λ1 = θ1+1 = 1.9 and λ2 = θ2+1 = 1.8. Hence the leading two generalized eigenvalues are 1.9 and 1.8. Moreover, the row supports for both V and W are set to be {1, 6, 11, 16, 21}. The values at the nonzero coordinates are obtained by normalizing (with respect to M = Σx = Σy) random numbers drawn from the uniform distribution on the ﬁnite set { 2, 1, 0, 1, 2}. We consider three diﬀerent choices of M as follows: (i) Identity: M = I; (ii) Toeplitz: M = (σij)

where σij = 0.3|i j| for all i, j; (iii) Sparse Inv: M = σ0 ij/ q

σ0 iiσ0 jj where Σ0 = (σ0 ij) = Ω 1

for Ω= ωij with

ωij = 1i=j + 0.5 1|i j|=1 + 0.4 1|i j|=2 i, j [p1].

We consider the same four conﬁgurations of (n, p1, p2) as in Gao et al. (2017). For Co La R and its initialization, we set all tuning parameters to those values used in Gao et al. (2017).

For tuning parameters in Algorithm 1 and its initialization, we set ρ = 1

n , s = 20,

λ = 0.01, η = 0.001, and T = 10000. Upon obtaining the ﬁnal estimate b A = [ b A {1}, b A {2}]

from Algorithm 1, we calculate estimators for V and W as b V = b A{1}( b A {1}bΣx b A{1}) 1/2 and c W = b A{2}( b A {2}bΣy b A{2}) 1/2. We summarize simulation results of the foregoing settings in Tables 4 6, each corresponding to a diﬀerent choice of M. Following Gao et al. (2017), we measure estimation error by prediction loss deﬁned as L(V, b V ) = inf O O(r) Σ1/2 x (b V O V ) 2 F and

L(W, c W) = inf O O(r) Σ1/2 y (c WO W) 2 F. Each reported number is the median error out of 50 independent repetitions. In Tables 4 6, the ﬁrst four columns collect results of Co La R. Columns V -GFP and W-GFP collect errors of initial estimators by generalized Fantope projection, while columns V -TGD and W-TGD report errors of ﬁnal estimators by Algorithm 1. In all settings, both Co La R and Algorithm 1 yield good ﬁnal estimators, which signiﬁcantly improve over their respective initializers. In addition, Algorithm 1 outperforms Co La R in most settings.

Sparse GCA and Thresholded Gradient Descent

(n,p1,p2) V -init W-init V -Co La R W-Co La R V -GFP W-GFP V -TGD W-TGD (300,300,200) 0.2885 0.1706 0.0511 0.0601 0.1391 0.2636 0.0118 0.0118 (600,600,200) 0.3236 0.2004 0.0638 0.0764 0.3420 0.1655 0.0237 0.0485 (300,300,500) 0.1202 0.0664 0.0135 0.0166 0.2646 0.3078 0.0100 0.0076 (600,600,500) 0.1408 0.0811 0.0176 0.0209 0.2459 0.1876 0.0213 0.0157

Table 4: Comparison of Co La R and Algorithm 1: median errors out of 50 repetitions with Toeplitz covariance structure.

(n,p1,p2) V -init W-init V -Co La R W-Co La R V -GFP W-GFP V -TGD W-TGD (300,300,200) 0.2653 0.1712 0.0498 0.0646 0.3413 0.5688 0.0454 0.0211 (600,600,200) 0.3167 0.2087 0.0671 0.0776 0.5016 0.7134 0.0231 0.0559 (300,300,500) 0.1207 0.0665 0.0135 0.0159 0.3580 0.1713 0.0124 0.0117 (600,600,500) 0.1448 0.0817 0.0166 0.0203 0.3906 0.4095 0.0082 0.0129

Table 5: Comparison of Co La R and Algorithm 1: median errors out of 50 repetitions with Identity covariance structure.

(n,p1,p2) V -init W-init V -Co La R W-Co La R V -GFP W-GFP V -TGD W-TGD (300,300,200) 0.5552 0.5718 0.1568 0.1194 0.2021 0.4924 0.0719 0.0421 (600,600,200) 0.5596 0.6133 0.2123 0.1572 0.4653 0.2222 0.1555 0.0686 (300,300,500) 0.2695 0.1917 0.0242 0.0219 0.1592 0.3058 0.0042 0.0056 (600,600,500) 0.3068 0.2368 0.0338 0.0271 0.1853 0.1404 0.0454 0.0491

Table 6: Comparison of Co La R and Algorithm 1: median errors out of 50 repetitions with Sparse Inv covariance structure.

In summary, both Co La R and Algorithm 1 consistently estimate the leading r canonical loading vectors, while Algorithm 1 has a slight advantage. Moreover, Algorithm 1 is more desirable due to its generality beyond the sparse CCA setting.

5.4 Model mis-speciﬁcation

To investigate the robustness of our method, we consider one possible mis-speciﬁcation of the model in the case of two data sets: There are 3 pairs of nontrivial canonical correlations present but we only set r = 2. As we are in the CCA setting, we consider three types of covariance matrices speciﬁed in the previous subsection. For all three cases, the ﬁrst two canonical loading vectors and generalized eigenvalues are generated in the same way as the previous subsection. In addition, we also set the support of the third vector to be {1, 6, 11, 16, 21}. We ﬁx θ3 = 0.3 (i.e., λ3 = 1.3) and focus on the conﬁguration (n, p1, p2) = (300, 300, 500). All tuning parameters are set in the same way as in last subsection. Simulation results under these settings can be found in Table 7, and we continue to use prediction loss as in the previous subsection. Similar to Co La R, Algorithm 1 continues to

Covariance structure V -Co La R W-Co La R V -TGD W-TGD Toeplitz 0.0197 0.0197 0.0291 0.0157 Identity 0.0190 0.0195 0.0113 0.0155 Sparse Inv 0.0348 0.0263 0.0112 0.0488

Table 7: Comparison of Co La R and Algorithm 1: median errors of estimating ﬁrst two generalized eigenvectors out of 100 repetitions in mis-speciﬁed models.

produce accurate estimates when the input latent dimension rin is smaller than the true value r and there is an eigengap between θrin and θrin+1, and thus between λrin and λrin+1.

5.5 Sparse PCA of correlation matrices

We now apply Algorithm 1 to performing sparse PCA of correlation matrices. See Remark 5 for a detailed account on how sparse PCA of correlation matrices can be cast as a special case of sparse GCA. By Remark 5, let bΣ0 be the diagonal matrix with p sample variances on the diagonal, we could use bΣ1/2 0 b A as our estimator of the r leading eigenvectors of correlation matrix R = Σ 1/2 0 ΣΣ 1/2 0 , where b A is the estimator produced by Algorithm 1. We report the squared matrix distance for estimating the leading eigenspace L(ER, b A) = min O O(r) bΣ1/2 0 b AO ER 2 F for initialization and ﬁnal error, where ER is the leading eigenspace of R.

Given ambient dimension p, sparsity s, latent dimension r and generalized eigenvalues λ1 λr > 1, we generate a p p correlation matrix R with leading eigenvalues λ1 λr > λr+1 = 1 and s-sparse leading eigenvectors according to the procedure described in Section D. To further construct the covariance matrix Σ, we generate Σ0 as a diagonal matrix with i.i.d. uniform random numbers in [0.1, 1] as diagonal elements. Finally, we set Σ = Σ1/2 0 RΣ1/2 0 .

In our simulation study, we set p = 500, s = 20, r = 3, and let sample size n be either 500

or 2000. The tuning parameter for initialization is set as ρ = 1

n . Tuning parameters of Algorithm 1 are set to be s = 40, T = 20000, λ = 0.01, and η = 0.001. The leading generalized eigenvalues are set as {5, 5, 5} (case I) or {7, 5, 3} (case II). For both cases, our construction of Σ ensures that λ4 = 1.

n Case Initial Error Final Error

n = 500 I 0.4603 (0.1404) 0.0851 (0.0306) II 0.6379 (0.0567) 0.1001 (0.0098)

n = 2000 I 0.0882 (0.0177) 0.0184 (0.0025) II 0.1060 (0.0697) 0.0258 (0.0025)

Table 8: Sparse PCA of correlation matrices by Algorithm 1: median errors of initial and ﬁnal estimators out of 50 repetitions. MADs are reported in parentheses.

Sparse GCA and Thresholded Gradient Descent

We report results from 50 repetitions for each case and each sample size in Table 8. From Table 8, it can be seen that in all settings, Algorithm 1 yields accurate estimators of leading eigenspaces of correlation matrices. Due to a larger eigengap in the ﬁrst case, it has slightly better estimation accuracy which corroborates our theory.

5.6 Sparse GCA with general covariance structure

In this section, we investigate the performance of our algorithm under a general covariance structure and demonstrate empirically that our algorithm works beyond latent variable models as long as the eigengap condition holds. We consider sparse GCA of three high-dimensional data sets with p1 = 500, p2 = p3 = 200 and let the rank of all oﬀ-diagonal blocks of Σ be pmin = mini pi = 200. To generate covariance matrices Σ and Σ0, we ﬁrst deﬁne a diagonal matrix Θ with Θii = 2/i for i = 1, 2, . . . , pmin. We then deﬁne Σ as a block matrix with

Σ{ij} = T{i}U{i}ΘU {j}T{j}, Σ{ii} = 2T{i}, for i = j {1, 2, 3},

U {1}T{1}U{1} = U {2}T{2}U{2} = U {3}T{3}U{3} = I.

Here, each Σ{ii} = 2T{i} and each T{i} is a Toeplitz matrix, deﬁned by setting (T{k})ij = σkij where σkij = a|i j| k for all i, j [pk] with a1 = 0.5, a2 = 0.7, a3 = 0.9. The ﬁrst ﬁve columns of each U{i} have row sparsity level si = 5. As a result, the support sizes of the ﬁrst three GCA loading vectors are at most s = 15. For each U{i} Rpi pmin, we deﬁne U{i} = [U{i}(1), U{i}(2)] where U{i}(1) Rpi si and U{i}(2) Rpi (pmin si). The submatrices U{i}(1), U{i}(2) are generated as follows. To generate each U{i}(1), we ﬁrst randomly select a support of size si. For each row in the support of U{i}(1), we generate its entries as i.i.d. standard normal random variables and ﬁll the remaining entries with zeros. Then, we normalize all U{i}(1) s with respect to T{i}. To generate U{i}(2), we ﬁrst

perform SVD on U{i}(1) T 1/2 {i} = P{i}D{i}Q {i} to obtain Q{i} O(pi). We then take e Q{i} = Q{i}, J for J = {si + 1, . . . , pmin} which is the submatrix of Q{i} consisting of the

(si + 1)th to the pminth columns of Q{i}. By construction, we have U{i}(1) T 1/2 {i} e Q{i} = 0

and e Q{i} Rpi (pmin si). Finally, we deﬁne U{i}(2) = T 1/2 {i} e Q{i}. By construction, we have U{i}(2) T{i}U{i}(2) = I and U{i}(2) T{i}U{i}(1) = 0. Thus, the leading ﬁve columns of each U{i} are sparse and we have U {i}T{i}U{i} = I. With the foregoing construction, it is straightforward to verify that the ith generalized eigenvalue is given by λi = 1 + 2i 1

for i = 1, 2, . . . , pmin and the non-trivial GCA loadings are given by [U {1}, U {2}, U {3}] . Therefore, there are eigengaps between any consecutive generalized eigenvalues above one and only the ﬁrst ﬁve GCA loading vectors are sparse. Under the foregoing setup, we aim at estimating the leading GCA loading vector (Case I), the matrix with leading two GCA loading vectors (Case II), and the matrix with leading three GCA loading vectors (Case III). For tuning parameters, we set s = 20 for sparsity

level, η = 0.001, λ = 0.01, and T = 15000 in Algorithm 1, and ρ = 1

n in (14). We consider two diﬀerent sample sizes: n = 500 and n = 2000.

Table 9 reports the squared matrix distances for the initial and the ﬁnal estimators from true value deﬁned as dist2(A, b A) = min O O(r) b AO A 2 F for all three cases and two diﬀerent sample sizes. Numbers in each cell are the median and the MAD of results over 50 repetitions. Population parameters are also generated independently in each repetition. To gauge the scale of errors, we also report squared Frobenius norm A 2 F of the estimand in the last column of Table 9.

n Case Initial Error Final Error A 2 F

I 0.0500 (0.0358) 0.0005 (0.0007) 0.5669 II 0.1586 (0.0420) 0.0605 (0.0313) 1.2433 III 0.3097 (0.1484) 0.1831 (0.1107) 1.8378

I 0.0174 (0.0064) 0.0001 (0.0002) 0.5669 II 0.0390 (0.0268) 0.0136 (0.0063) 1.2433 III 0.0741 (0.0719) 0.0369 (0.0178) 1.8378

Table 9: Sparse GCA with general covariance structure: median errors of initial and ﬁnal estimators out of 50 repetitions with MADs reported in parentheses. Medians of squared Frobenius norms of estimands are listed in the last column.

From Table 9, it can be seen that in all settings, Algorithm 1 consistently yields improved estimates of leading GCA loading vectors whenever there is an eigengap.

Appendix A. Proof of Lemmas in Section 2

A.1 Proof of Lemma 1

Proof For Λr = diag(λ1, . . . , λr). Since A solves (3), we have

Σ 1 0 ΣA = AΛr. (31)

By model (5) (7), we have Σ = Ψ + UU and Σ0 = diag(Σ{11}, . . . , Σ{kk}), and so

Ip1 Σ 1 {11}U{1}U {2} Σ 1 {11}U{1}U {k} Σ 1 {22}U{2}U {1} Ip2 Σ 1 {22}U{2}U {k} ... ... ... ... Σ 1 {kk}U{k}U {1} Σ 1 {kk}U{k}U {2} Ipk

Thus, (31) leads to

A{1} + Σ 1 {11}U{1} P j =1 U {j}A{j} ... A{k} + Σ 1 {kk}U{k} P j =k U {j}A{j}

A{1}Λr ... A{k}Λr

Collecting terms, we obtain

Σ 1 {11}U{1} P j =1 U {j}A{j} ... Σ 1 {kk}U{k} P j =k U {j}A{j}

A{1}(Λr Ir) ... A{k}(Λr Ir)

= A(Λr Ir).

Sparse GCA and Thresholded Gradient Descent

Since both A and Λr Ir are of rank r under the assumption of the lemma, we complete the proof.

A.2 Proof of Lemma 3

Before the proof of the lemma, we recall Weyl s inequality (Weyl, 1912) stated as follows.

Lemma 13 (Weyl s inequality) Let A, B be two p p Hermitian matrices and si(A) be the ith eigenvalue of A. Then for 1 l, j p, we have

sl(A + B) sj(A) + sk(B), for l j + k 1,

sj(A) + sl(B) sj+l p(A + B), for j + l p.

Proof The proof is adapted from Fan et al. (2019). It is composed of three parts. First we show that if i r + 1, λi 1. Then we will prove that λr > 1. Finally we prove the result on multiplicity of 1. Recall that the generalized eigenvalues are identical to the eigenvalues of R = Σ 1/2 0 ΣΣ 1/2 0 . From now on we study the eigenvalues of R directly.

Step (1): Note that we can write R as

R = Σ 1/2 0 ΣΣ 1/2 0 = Σ 1/2 0 (UU + Ψ)Σ 1/2 0 = Q1Q 1 + Q2Q 2

where Q1 = Σ 1/2 0 U Q2 = Σ 1/2 0 Ψ1/2.

We ﬁrst show that Q2Q 2 op 1. To this end, note that Q2Q 2 op = Q 2 Q2 op = Ψ1/2Σ 1 0 Ψ1/2 op. Note that Ψ1/2Σ 1 0 Ψ1/2 is a block diagonal matrix with the ith block given by Ψ1/2 {ii}(Ψ{ii} + U{i}U {i}) 1Ψ1/2 {ii}. Thus, it suﬃces to show that

Ψ1/2 {ii}(Ψ{ii} + U{i}U {i}) 1Ψ1/2 {ii} op 1, for i [k].

To this end, by Woodbury matrix identity,

Ψ1/2 {ii}(Ψ{ii} + U{i}U {i}) 1Ψ1/2 {ii}

=Ψ1/2 {ii}[Ψ 1 {ii} Ψ 1 {ii}U{i}(Ir + U {i}Ψ 1 {ii}U{i}) 1U {i}Ψ 1 {ii}]Ψ1/2 {ii}

=Ipi Ψ 1/2 {ii} U{i}(Ir + U {i}Ψ 1 {ii}U{i}) 1U {i}Ψ 1/2 {ii} .

Therefore, 0 Ψ1/2 {ii}(Ψ{ii} + U{i}U {i}) 1Ψ1/2 {ii} Ipi, and so Q2Q 2 op 1. By Lemma 13, we have, for i j + k 1,

λi = si(R) sj(Q1Q 1 ) + sk(Q2Q 2 ).

Now we set j = r + 1, k = 1 and this gives

λi= si(R) sr+1(Q1Q 1 ) + s1(Q2Q 2 ) = s1(Q2Q 2 ) for i r + 1.

The last equality holds since we have rank(Q1Q 1 ) = r and so sr+1(Q1Q 1 ) = 0. As a result,

λi= si(R) s1(Q2Q 2 ) = Q2Q 2 op 1.

This ﬁnishes the proof of the ﬁrst step.

Step (2): By Lemma 13 and Assumption 2, we have

λr= sr(R) = sr(Q1Q 1 + Q2Q 2 ) sr(Q1Q 1 ) + sp(Q2Q 2 )

> sr(Q1Q 1 ) = σ2 r(Σ 1/2 0 U) 1.

Here, the strict inequality holds since Q2Q 2 is of full rank. This ﬁnishes the proof for the rth eigenvalue.

Step (3): Now we calculate the multiplicity of 1. To this end, we resort to Theorem 2.2 in Tian (2004), which gives

rank(U U Y U ) = rank(U UY U) = rank(U U U {1}U{1} U {k}U{k})

= rank(e U) + rank(U)

i=1 rank(U{i}).

0 Σ{12} Σ{13} ... Σ{1k} Σ{21} 0 Σ{23} ... Σ{2k} Σ{31} Σ{32} 0 ... Σ{3k} ... ... ... ... ... Σ{k1} Σ{k2} Σ{k3} ... 0

Note that we have Σ{ij} = U{i}U {j}, rank(U) = r,

by the latent variable model. Moreover, it is easy to verify that multiplicity of 1 as an eigenvalue of R is the same as multiplicity of 0 as an eigenvalue of e U, which in turn equals p rank(e U). Thus, multiplicity of 1 is

p rank(e U) = p rank(U U Y U ) + rank(U)

i=1 rank(U{i})

i=1 rank(U{i}) + r rank(U UY U).

This ﬁnishes the proof of the lemma.

Appendix B. Proof of Main Results

We ﬁrst provide a lemma about the distance deﬁned in (18) which is adapted from Lemmas 5.3 and 5.4 from Tu et al. (2015).

Lemma 14 For any U, V Rp r, we have

dist2(U, V ) 1 2(

2 1)σ2r(V ) UU V V 2 F.

Sparse GCA and Thresholded Gradient Descent

If further that dist(U, V ) 1

4 V op, we have

4 U opdist(U, V ).

Before diving into the proof, we present a list of nice events on the intersection of which all desired results hold deterministically. Recall that S is the true row support of A with cardinality s. Let I {1, 2, 3, ...p} be an index set. We use b A(I) Rp r to denote the leading r generalized eigenvectors that solve

max bΣ, LL such that L bΣ0L = Ir, supp(L) I. (32)

We also denote the diagonal matrix formed by the ﬁrst r restricted sample generalized eigenvalues by bΛr(I). For some suﬃciently large constant C > 0, deﬁne

B1 = dist(A, b A(I)) C r

1 + λ2 r+1 λr λr+1

for all I S such that |I| 2s + s .

Note that the event includes all index sets containing true support S and the sizes of which are at most 2s + s . We deﬁne event B2 as

B2 = bΣII ΣII op bΣ0,II Σ0,II op C

(2s + s) log p

for all I [p] with |I| 2s + s .

Further we deﬁne B3 as

B3 = bΣ Σ + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0

+ λr+1 Σ0 bΣ0 + λr+1 Σ0AA Σ0 bΣ0AA bΣ0 C

Let e A = A(A bΣ0A) 1/2, eΛr = (A bΣ0A)1/2Λr(A bΣ0A)1/2. (36)

The event B4 is deﬁned as

eΛr Λr F + λr+1 A bΣ0A I F + Σ1/2 0 ( e A A) F C

r(s + log(ep/s))

(37) for some suﬃciently large constant C > 0. The following lemmas guarantee that all these events occur with high probabilities, uniformly over Pn. In addition, since s s, Lemma 12 in Gao et al. (2015a) implies that B2 happens with probability at least 1 exp( C s log(ep/s )) for some positive constant C , uniformly over Pn.

Lemma 15 Suppose condition (23) holds, there exist constants C, C > 0 such that uniformly over Pn, B1 happens with probability at least 1 exp( C s log(ep/s )).

Lemma 16 Suppose r p

log p/n c for some suﬃciently small constant c (0, 1). Then there exist positive constants C, C such that uniformly over Pn, B3 happens with probability at least 1 p C .

Lemma 17 Suppose q

s+log(ep/s)

n c for some suﬃciently small constant c (0, 1). Then there exist constants C, C > 0 such that uniformly over Pn, B4 happens with probability at least 1 exp( C (s + log(ep/s))).

In the rest of this section, we will show that Theorem 7 and Corollary 8 hold on event B1 B2. Theorem 9 holds on event B3 B4, and Corollary 10 holds on event B2 B3 B4. As a result, the entire algorithm with generalized Fantope initialization yields an estimator satisfying the upper bound in Corollary 11 on B1 B2 B3 B4, which holds with probability at least 1 exp( C (s + log(ep/s))) for some constant C > 0 uniformly over Pn, by the foregoing lemmas and the union bound. This argument proves Corollary 11 about the whole procedure. Before proving the main Theorem, we ﬁrst speciﬁcally study eﬀect of the gradient descent step and the hard thresholding step. The following proposition characterizes the progress in the gradient step.

Proposition 18 Let St = supp(V t) S supp(V t+1) S supp(A) be a super set of the row support of V t. Deﬁne

b V (St) = b A(St)

I + bΛr(St)

Set η c 12λ1ν(c + 2)2 and λ = λ1

for some constant c < 1. On event B2, suppose V t satisﬁes

dist(V t, b V (St)) 1 4 ν min c(λr λr+1)

2λ1ν2(42 + 25c), 1 + c

then after the gradient step, we have

dist2(V o t+1, b V (St)) 1 η(λr λr+1)

dist2(V t, b V (St)),

here we use V o t+1 Rp r to denote a matrix that has the same entries as those in Vt+1 on St [r] and zeros elsewhere.

The following proposition characterizes the eﬀect of hard thresholding.

Proposition 19 Deﬁne V as in (21). Then if we perform hard thresholding by selecting the top s elements of Vt+1, we have

dist2(V, V t+1) 1 + 2 r s

dist2(V, Vt+1). (40)

Sparse GCA and Thresholded Gradient Descent

B.1 Proof of Theorem 7

We ﬁrst prove the theorem assuming that Propositions 18 and 19 hold. Proofs of the two propositions will be given later in Sections B.3 and B.4. It is worth noting that the results of both propositions hold deterministically on event B2. The ﬁrst step is to deﬁne the eﬀective support in each step. We ﬁrst deﬁne Ft = supp(V t), recall that S is the true row support of A, then we deﬁne the eﬀective restricted set to be St = Ft Ft+1 S.

The gradient descent step restricted to St can be viewed as

Vt+1,St = V t,St 2η( bΣSt St V t,St + λbΣ0,St St V t,St (V t,St bΣ0,St St V t,St Ir)).

Note that applying hard thresholding on V o t+1 (as deﬁned in Proposition 18) is equivalent to applying hard thresholding on the original Vt+1. This allows us to replace the intermediate update by V o t+1 and still obtain the same output sequence V t+1. Thus, we will prove instead for the update using bΣ0,St St, bΣSt St. Proof Throughout the proof, we assume the event B1 B2 happens, which occurs with probability at least 1 exp( C s log(ep/s )) for some constant C > 0, uniformly over Pn. As we have mentioned, the conclusions of Propositions 18 and 19 hold on this event, and the remaining arguments in this proof proceed in a deterministic fashion. We argue by induction on the iteration counter t. Speciﬁcally, for t = 1, 2, ..., we will prove that V t satisﬁes condition (39) and that

dist(V t, V ) ξt 1 dist(V 1, V ) + C1 1 ξ

1 + λ2 r+1 λr λr+1

for some positive constant C1. Base Case: By (23), (22), and deﬁnition of B1 in (33), condition (39) is satisﬁed when t = 1. Moreover, when t = 1 equation (41) holds trivially.

Induction Step: Suppose that V t satisﬁes the radius condition (39) and that the induction hypothesis (41) is satisﬁed at step t. We are to show that (39) and (41) hold for V t+1. In the gradient step, Proposition 18 shows that under radius condition (39) on dist(V t, b V (St)), if we choose the step-size to be η 1

dist(V o t+1, b V (St)) p

1 αη dist(V t, b V (St)) 1 αη

dist(V t, b V (St)),

α = (λr λr+1)

4ν , β = 12λν

Recall that

V = A I + 1

2 , b V (St) = b A(St) I + 1

λ bΛr(St) 1

By (33), on event B1 B2, since λ = λ1/c, we have

dist(V, b V (St)) C0 r

1 + λ2 r+1 λr λr+1

for some positive constant C0. Triangle inequality then leads to

dist(V o t+1, V ) 1 αη

dist(V t, V ) + 2C0 r

1 + λ2 r+1 λr λr+1

Turn to the thresholding step. By Proposition 19, we have

dist(V t+1, V )

dist(V o t+1, V )

dist(V o t+1, V ).

Combining the last two displays, we obtain that

dist(V t+1, V ) 1 + r s

dist(V t, V )

1 + λ2 r+1 λr λr+1

ξ = 1 + r s

Note that we can always ensure ξ < 1 by enlarging s appropriately. Speciﬁcally, the choice in the theorem, s 16 α2η2 s, ensures that

ξ 1 + 2 r s

Since |St| 2s + s and s s, we further obtain that

dist(V t+1, V ) ξ dist(V t, V ) + C1

1 + λ2 r+1 λr λr+1

Sparse GCA and Thresholded Gradient Descent

for some constant C1 > 0. Note that in the above equation C1 does not depend on t. We now show that it is identical with the constant C1 in (41). To this end, note that

dist(V t+1, V ) ξ dist(V t, V ) + C1

1 + λ2 r+1 λr λr+1

ξt 1 dist(V 1, V ) + C1 1 ξ

1 + λ2 r+1 λr λr+1

1 + λ2 r+1 λr λr+1

ξt dist(V 1, V ) + C1 1 ξ

1 + λ2 r+1 λr λr+1

Here the second inequality is due to the induction hypothesis at the tth iteration. Moreover, it can be ensured that V t+1 also satisﬁes (39) since ξ < 1 and the extra term is bounded by a suﬃciently small constant due to (23). As a result, we have shown that both (39) and (41) are satisﬁed for all t 1. In summary, on B1 B2, for any t 1,

dist(V t+1, V ) ξt dist(V 1, V ) + C1 1 ξ

1 + λ2 r+1 λr λr+1

ξt dist(V 1, V ) + 4C1

1 + λ2 r+1 λr λr+1

ξt dist(V 1, V ) + C s

1 + λ2 r+1 λr λr+1

Here, the second inequality holds since 1 1 ξ 4 α2η2 and the last inequality is due to s 16 α2η2 s. This completes the proof.

B.2 Proof of Corollary 8

Proof We prove the desired result on B1 B2, which happens with probability at least 1 exp( C (s log(ep/s ))) for some constant C > 0, uniformly over Pn. For notational convenience, denote the statistical error rate in (24) by ϵn, that is

1 + λ2 r+1 λr λr+1

Under condition (25), the statistical error dominates optimization error in (24). Hence by Theorem 7 dist(V t, V ) is bounded by a constant multiple of ϵn, that is,

V t, A I + Λr

for some constant C0 > 0. Let P be the orthogonal matrix that minimizes the distance, then we can write

V t = A I + Λr

1/2 P + Q , (44)

where Q F C0ϵn. Since V t is s row sparse and A is s row sparse, Q is s + s row sparse. The remaining proof is composed of two steps: (1) bounding (V t bΣ0V t) 1/2 P (I + Λr

λ ) 1/2P F and (2)

bounding V t(V t bΣ0V t) 1/2 AP F. The bound in step (2) then gives the desired bound in the statement of the corollary. In the rest of the proof, let = (V t bΣ0V t) 1/2 P (I + Λr/λ) 1/2P.

Step (1): By deﬁnition we have

V t bΣ0V t =

1/2 A bΣ0A I + Λr

1/2 P + Q bΣ0A I + Λr

1/2 A bΣ0Q + Q bΣ0Q.

Since A Σ0A = Ir, we have

1/2 A Σ0A I + Λr

1/2 P = P I + Λr

Then we can bound the Frobenius norm for V t bΣ0V t P (I + Λr

V t bΣ0V t P (I + Λr

λ )P F P (I + Λr

λ )1/2A (bΣ0 Σ0)A(I + Λr

λ )1/2P F | {z } Term I

+ 2 Q bΣ0A(I + Λr

λ )1/2 F | {z } Term II

+ Q bΣ0Q F | {z } Term III

We bound the three terms on the right side of the last display separately. Since A is s sparse, we have that on event B2,

λ )1/2A (bΣ0 Σ0)A(I + Λr

λ )1/2P op I + Λr

λ op A S (bΣ0,SS Σ0,SS)AS op

Since A is of rank r, we can then bound the Frobenius norm of Term I as

λ )1/2A (bΣ0 Σ0)A(I + Λr

λ )1/2P F C1

Sparse GCA and Thresholded Gradient Descent

For Term II, since Q is s + s sparse, we notice that on event B2 and under condition (23), we can bound it as Q bΣ0A(I + Λr

λ )1/2 F C2 Q F

for some constant C2 > 0. Term III is also dominated by the same upper bound. On B2, the operator norms of V t bΣ0V t and (V t bΣ0V t) 1 are both upper bounded by a positive constant. In addition, by (19), the operator norms of I + Λr/λ and (I + Λr/λ) 1 are also upper bounded by a positive constant. So we conclude that

F (V t bΣ0V t) 1/2 op (V t bΣ0V t)1/2 P (I + Λr

λ )1/2P F P (I + Λr

λ ) 1/2P op

C3 V t bΣ0V t P (I + Λr

λ )P F C4ϵn.

where the second to last inequality is due to Lemma 2 in Supplement of Gao et al. (2017).

Step (2): Now we bound b At AP F as follows. Recall we have deﬁned that = (V t bΣ0V t) 1/2 P (I + Λr

b At AP F = V t(V t bΣ0V t) 1/2 AP F

= (A(I + Λr

λ )1/2P + Q)(P (I + Λr

λ ) 1/2P + ) AP F

= AP + QP (I + Λr

λ ) 1/2P + A(I + Λr

λ )1/2P + Q AP F

λ ) 1/2P F + A(I + Λr

λ )1/2P F + Q F

Q F P (I + Λr

λ ) 1/2P op + A(I + Λr

λ )1/2P op F + Q F op

for some constant C5 > 0, due to the bounds on Q F and F that we have established in step (1). Combining the results above we deduce that

dist( b At, A) b At AP F C5ϵn

with probability at least 1 exp( C (s log(ep/s ))) for some positive constants C5 and C , uniformly over Pn. This ﬁnishes our proof.

B.3 Proof of Proposition 18

Proof Throughout the whole proof, we work on event B2 deﬁned in (34) which happens with probability at least 1 exp( C s log(ep/s )) for some constant C > 0, uniformly over Pn. As mentioned before, the gradient step is equivalent to replacing all covariance matrices by bΣSt St, bΣ0,St St respectively due to sparsity of V t, V t+1 (since the output sequence V t remains unaltered after this substitution). Eﬀectively, at t th step of gradient descent

the relavant support is St since outside this set the output matrix has row equal to 0. We denote principal submatrices bΣSt St, bΣ0,St St by bΣt R|St| |St|, bΣ0,t R|St| |St| for simplicity in the proof and deﬁne the restricted Lagrangian function ft to be

ft(L) = bΣt, LL + λ

2 L bΣ0,t L Ir 2 F,

where L R|St| r. As a result,

1 2 ft(L) = bΣt L + λbΣ0,t L(L bΣ0,t L Ir). (45)

Throughout the proof of Proposition 18, we will work on the restricted function and its gradient. We denote Lt+1 = Lt η ft(Lt). Then we notice that Lt = V t,St , Lt+1 = Vt+1,St by our submatrix notation respectively. Our proof will work on distance involving Lt, Lt+1 which transfers to the desired bound as stated in this proposition. Before the proof we revisit a lemma characterizing the eﬀect of gradient descent. The following lemma is adapted from Lemma 4 in Chi et al. (2019) and it is an extension of the gradient descent condition from vectors to rank r matrices. For notational convenience, we deﬁne L t to be a global minimizer of function ft(L) (which will be calculated later) and

HX = argmin H O(r) XH L t F.

We deﬁne a function f(L) to be β smooth at L if for all Z, we have

vec(Z) 2f(L)vec(Z) β Z 2 F.

Lemma 20 Suppose that ft is β smooth within a ball B(L t ) = {L : L L t F R} and that ft(L)P = ft(LP) for any orthonormal matrix P. Assume that for any L B(L t ) and any Z, we have

vec(ZHZ L t ) 2ft(L)vec(ZHZ L t ) α ZHZ L t 2 F.

In addition, if η 1

β, then using gradient descent with dist(Lt, L t ) R, we have

dist2(Lt+1, L t ) (1 αη) dist2(Lt, L t ).

Moreover, with dist(L0, L t ) R, we have

dist2(Lt, L t ) (1 αη)t dist2(L0, L t ).

In Lemma 4 of Chi et al. (2019) the condition on gradient descent is Lt B(L t ). Here we generalize it to dist(Lt, L t ) R and the proof follows without change of the original proof. By (45), it is straightforward to verify the condition ft(L)P = ft(LP). The remaining proof is composed of three steps: (1) deriving the expression for vec(Z) 2ft(L)vec(Z), (2) verifying the smoothness condition, and (3) verifying the condition on strong convexity. We check the radius condition at the end of the proof.

Sparse GCA and Thresholded Gradient Descent

Step (1) Recall (45). As a result,

1 2vec ft(L) = (Ir bΣt)vec(L) + λ(Ir bΣ0,t LL bΣ0,t)vec(L) λ(Ir bΣ0,t)vec(L).

The main calculation is to deal with

vec(bΣ0,t LL bΣ0,t L) = (I bΣ0,t LL bΣ0,t)vec(L).

We now directly compute this expression as follows: since L R|St| r, we have vec(L) = [l 1 , l 2 , ..., l r ] where li is the i th column of the matrix. Following this notation, we can write

vec(bΣ0,t LL bΣ0,t L) = (I bΣ0,t LL bΣ0,t)vec(L)

bΣ0,t LL bΣ0,t 0 ... 0 0 bΣ0,t LL bΣ0,t ... 0 0 0 ... 0 0 0 ... bΣ0,t LL bΣ0,t

l1 l2 l3 ... lr

As a result,

vec(bΣ0,t LL bΣ0,t L) =

bΣ0,t LL bΣ0,tl1 bΣ0,t LL bΣ0,tl2 bΣ0,t LL bΣ0,tl3 ... bΣ0,t LL bΣ0,tlr

bΣ0,t Pr i=1 lil i bΣ0,tl1 bΣ0,t Pr i=1 lil i bΣ0,tl2 bΣ0,t Pr i=1 lil i bΣ0,tl3 ... bΣ0,t Pr i=1 lil i bΣ0,tlr

Now we can calculate the derivative vec ft(L)

vec(L) , note that we can do this block by block:

the jk block entry of the Hessian is just vec ft(L)j

1 2vec ft(L) = (Ir bΣt)vec(L) + λ(Ir bΣ0,t LL bΣ0,t)vec(L) λ(Ir bΣ0,t)vec(L),

the only term we have to deal with is the middle one. By previous calculations, we have

bΣ0,t Pr i=1 lil i bΣ0,tlj lk = bΣ0,tlkl k bΣ0,tlj lk = l j bΣ0,tlkbΣ0,t + bΣ0,tljl k bΣ0,t,

when j = k and

bΣ0,t Pr i=1 lil i bΣ0,tlj lj = X

i =j bΣ0,tlil i bΣ0,t + 2bΣ0,tljl j bΣ0,t + l j bΣ0,tlj bΣ0,t

= bΣ0,t LL bΣ0,t + bΣ0,tljl j bΣ0,t + l j bΣ0,tlj bΣ0,t.

As a result, we can combine the above results to obtain the Hessian

1 2 2ft(L) =1

2 vec ft(L)

bΣt ... 0 0 bΣt 0 0 ... bΣt

bΣ0,t .. 0 0 bΣ0,t 0 0 ... bΣ0,t

bΣ0,t LL bΣ0,t + bΣ0,tl1l 1 bΣ0,t + l 1 bΣ0,tl1bΣ0,t ... l 1 bΣ0,tlr bΣ0,t + bΣ0,tl1l r bΣ0,t ... ... ... bΣ0,tlrl 1 bΣ0,t ... bΣ0,t LL bΣ0,t + bΣ0,tlrl r bΣ0,t +l r bΣ0,tl1bΣ0,t ... +l r bΣ0,tlr bΣ0,t

Now we proceed to calculate vec(Z) 2ft(L)vec(Z). We have

vec(Z) 2ft(L)vec(Z) = X

i,j z i 2ft(L)ijzj

where zi is the i th column of Z. Substituting the expression on the Hessian, we have

1 2vec(Z) 2ft(L)vec(Z) =

i=1 z i bΣtzi λ

i=1 z i bΣ0,tzi

i =j l i bΣ0,tljz i bΣ0,tzj

i =j z i bΣ0,tljl i bΣ0,tzj +

i=1 (z i bΣ0,tli)2

i=1 z i bΣ0,t LL bΣ0,tzi +

i=1 (l i bΣ0,tli)(z i bΣ0,tli) .

Now we claim the following simpliﬁcation:

1 2vec(Z) 2ft(L)vec(Z) =gt(L, Z)

= bΣt, ZZ λ bΣ0,t, ZZ + λ ZZ , bΣ0,t LL bΣ0,t

+ λ L bΣ0,t L, Z bΣ0,t Z + λ Z bΣ0,t L, L bΣ0,t Z .

Sparse GCA and Thresholded Gradient Descent

Now we begin to prove this claim. We expand the above expression as follows:

gt(L, Z) = bΣt, ZZ λ( bΣ0,t, ZZ ZZ , bΣ0,t LL bΣ0,t

L bΣ0,t L, Z bΣ0,t Z Z bΣ0,t L, L bΣ0,t Z )

i=1 z i bΣtzi λ

i=1 z i bΣ0,tzi + λ X

i,j z i bΣ0,tljl j bΣ0,tzi + λ X

i,j z i bΣ0,tljl i bΣ0,tzj

i =j l i bΣ0,tljz i bΣ0,tzj + λ

i=1 (l i bΣ0,tli)(z i bΣ0,tzi)

i=1 z i bΣtzi λ

i=1 z i bΣ0,tzi + λ( X

i =j l i bΣ0,tljz i bΣ0,tzj + X

i =j z i bΣ0,tljl i bΣ0,tzj

i=1 (l i bΣ0,tli)(z i bΣ0,tzi) +

i=1 z i bΣ0,t

j=1 ljl j bΣ0,tzi +

i=1 (z i bΣ0,tli)2)

i=1 z i bΣtzi λ(

i=1 z i bΣ0,tzi) + λ( X

i =j l i bΣ0,tljz i bΣ0,tzj + X

i =j z i bΣ0,tljl i bΣ0,tzj

i=1 (z i bΣ0,tli)2 +

i=1 z i bΣ0,t LL bΣ0,tzi +

i=1 (l i bΣ0,tli)(z i bΣ0,tli))

2vec(Z) 2ft(L)vec(Z).

This completes the ﬁrst step.

Step (2) In view of the lemma, our next task would be to bound the smoothness parameter in a neighborhood of L t . The neighborhood will be deﬁned by the distance bΣ1/2 0,t L t bΣ1/2 0,t L F δ, (we deﬁne in this unusual way due to the normalization constraint and speciﬁc δ given by the condition will be explained later), which by triangle inequality gives bΣ1/2 0,t L t op δ bΣ1/2 0,t L op bΣ1/2 0,t L t op + δ.

We ﬁrst ﬁnd this global minimizer L t up to a rotation matrix. Setting the gradient equal to 0, we have any critical point must satisfy the following equation

bΣt L = λbΣ0,t L(L bΣ0,t L Ir).

From the above equation, we deduce that the column space of global minimizer should coincide with some generalized eigenvectors (not necessarily leading ones). Hence without loss of generality we assume the global minimizer of function is achieved at when L t = b Lt Dt for Dt being an invertible matrix and b Lt being the generalized eigenvectors with eigenvalues bλI for I [p], |I| = r for sample covariance matrices. Then we have

bΣtb Lt Dt = λbΣ0,tb Lt Dt(D2 t Ir)

and this gives bΣ0,tb LtbΛIDt = λbΣ0,tb Lt Dt(D2 t Ir).

Here we abuse the notation a bit to denote diagonal matrices with entries bλI to be bΛI. We deduce that Dt = (Ir + 1

λ bΛI) 1 2 . This is true for any critical point and now we will show that the global minimizer is achieved at I being {1, 2, ..., r}. To see this, note that

ft(L t ) = ft(b Lt Dt) = bΣt, b Lt Dt D t b L t + λ

2 D t b L t bΣ0b Lt Dt Ir 2 F

= Tr(D t bΛIDt) + λ

2 D t Dt Ir 2 F = Tr(bΛI) 1

λ Tr(bΛ2 I) + 1

2λ Tr(bΛ2 I).

We notice that the above quantity is minimized only when bΛI = bΛr(St), that is, when we are selecting the leading r generalized eigenvectors of bΣt with respect to bΣ0,t. As a result,

Dt = Ir + 1

λ bΛr(St) 1

2 , L t = b A(St)St

λ bΛr(St) 1

2 = b V (St)St ,

where bΛr(St) is the diagonal matrix with entries being ﬁrst r generalized eigenvalues for sample covariance matrices as speciﬁed before. According to our deﬁnition, b A(St)St and b V (St)St are of size |St| r. In the rest of the proof, we denote b V (St)St by b V for simplicity. Similarly, we slightly abuse notation and abbreviate bΛr(St), b A(St)St , b B(St)St as bΛr, b A, b B

in the rest of this proof. Then we have bΣ1/2 0,t L t op = bΣ1/2 0,t b V op = q

event B2 and assumption in the theorem. Here with slight abuse of notation, we use bλi to denote the ith restricted sample generalized eigenvalue. Now we can bound the smoothness parameter from above by controlling each term in Hessian matrix. Since we are working on event B2, we have that 1

2ν bΣt op 2ν and 1

2ν bΣ0,t op 2ν. Similar bound holds for minimum restricted sample generalized

eigenvalue. When bΣ1/2 0,t L t bΣ1/2 0,t L F δ, we have

gt(L, Z) = bΣt, ZZ λ bΣ0,t, ZZ + λ ZZ , bΣ0,t LL bΣ0,t + λ L bΣ0,t L, Z bΣ0,t Z

+ λ Z bΣ0,t L, L bΣ0,t Z

0 + 0 + λ Z 2 F L bΣ0,t 2 op + λ L bΣ0,t LZ F bΣ0,t Z F + λ Z bΣ0,t L F L bΣ0,t Z F

4λν(δ + bΣ1/2 0,t b V op)2 Z 2 F + 2λ(δ + bΣ1/2 0,t b V op)2ν Z 2 F

(6λν(δ + bΣ1/2 0,t b V op)2) Z 2 F = 1

Then we can upper bound the Hessian eigenvalue by

Step (3) To derive the strong convexity parameter α, we start from the function evaluated at the global minimizer L = b V . Deﬁne e Z = ZHZ b V , we now lower bound gt(b V , e Z) by a

Sparse GCA and Thresholded Gradient Descent

constant multiple of 1

2 e Z 2 F. To do this, we substitute the expression into gt and obtain

gt(b V , e Z) = bΣt + λbΣ0,t, e Z e Z + λ b V bΣ0,t b V , e Z bΣ0,t e Z + λ e Z bΣ0,t b V , e Z bΣ0,t b V

+ λ e Z bΣ0,t b V , b V bΣ0,t e Z

= bΣt + λbΣ0,t, e Z e Z + λ I + 1

λ bΛr, e Z bΣ0,t e Z + λ e Z e Z , bΣ0,t b V b V bΣ0,t

+ λ e Z bΣ0,t b V , b V bΣ0,t e Z

= bΣt, e Z e Z + bΛr, e Z bΣ0,t e Z + λ e Z e Z , bΣ0,t b V b V bΣ0,t | {z } Term I

+ λ e Z bΣ0,t b V , b V bΣ0,t e Z | {z } Term II

We deal with Term I and Term II separately. To simplify Term I, we recall that

bΣ 1 0,t bΣtbΣ 1 0,t = b AbΛr b A + b BbΛ b B ,

where we have deﬁned the remaining generalized eigenvectors by b B and the rest of eigenvalues by diagonal entries of bΛ, both on the restrict set St. Here we also omit the dependence on t. By the deﬁnition of restricted sample generalized eigenvectors, we have

Term I = bΛr, e Z bΣ0,t e Z + e Z e Z , λbΣ0,t b V b V bΣ0,t bΣt

= e ZbΛr e Z , bΣ0,t + e Z e Z , bΣ0,t(λ b A(I + bΛr/λ) b A bΣ 1 0,t bΣbΣ 1 0,t )bΣ0,t

bλr e Z e Z , bΣ0,t + e Z e Z , bΣ0,t(λ b A(I + bΛr/λ) b A ( b AbΛr b A + b BbΛ b B ))bΣ0,t

= e Z e Z , bΣ1/2 0,t (bλr I + bΣ1/2 0,t b A(λI) b A bΣ1/2 0,t bΣ1/2 0,t b BbΛ b BbΣ1/2 0,t )bΣ1/2 0,t

= e Z e Z , bΣ1/2 0,t (bλr(bΣ1/2 0,t b A b A bΣ1/2 0,t + bΣ1/2 0,t b B b B bΣ1/2 0,t ) + λbΣ1/2 0,t b A b A bΣ1/2 0,t

bΣ1/2 0,t b BbΛ b B bΣ1/2 0,t )bΣ1/2 0,t

= e Z e Z , bΣ1/2 0,t (bΣ1/2 0,t b A(λ + bλr)I b A bΣ1/2 0,t + bΣ1/2 0,t b B(bλr bΛ) b B bΣ1/2 0,t )bΣ1/2 0,t

2ν e Z e Z , (bλr bλr+1)I = 1

2ν (bλr bλr+1) e Z 2 F.

The ﬁrst inequality is due to the fact we only select the ﬁrst r generalized eigenvalues, the second inequality follows from the fact that the second term can be bounded using the eigengap since bΛ is a diagonal matrix containing the last |St| r generalized eigenvalues. As a result, Term I can be bounded in terms of eigengap at sample level. Note that the above inequality holds for any e Z. Now we bound Term II. Here we use the fact that HZ is the solution to min P O(r) ZP b V F. By deﬁnition of e Z, we hope to bound λ (ZHZ b V ) bΣ0,t b V , b V bΣ0,t(ZHZ b V ) . We will prove now that this can be lower bounded by 0, as is the case in Example 1 (46) in Chi et al. (2019). Note that we have

e Z bΣ0,t b V , b V bΣ0,t e Z

= Tr(b V bΣ0,t(ZHZ b V )b V bΣ0,t(ZHZ b V )

= Tr(bΣ0,t(ZHZ b V )b V bΣ0,t(ZHZ b V )b V )

= Tr(bΣ0,t ZHZ b V bΣ0,t ZHZ b V ) 2 Tr(bΣ0,t b V b V bΣ0,t ZHZ b V ) + Tr(bΣ0,t b V b V bΣ0,t b V b V ).

By Lemma 2 in Ten Berge (1977), we know that ZHZ b V 0 and it is also symmetric. If we denote ZHZ b V = L0L 0 and write Z1 = bΣ1/2 0,t L0, Z2 = bΣ1/2 0,t b V , we have

e Z bΣ0,t b V , b V bΣ0,t e Z = Z1Z 1 2 F + Z2Z 2 2 F 2 Z1Z 1 , Z2Z 2 0

by Cauchy Schwarz Inequality. This proves that Term II is non-negative. Finally we conclude that we have

gt(b V , e Z) 1

2ν (bλr bλr+1) e Z 2 F.

Now we argue for general L. First we show that for any L in the neighborhood of b V

|gt(L, e Z) gt(b V , e Z)| = |λ e Z e Z , bΣ0,t(LL b V b V )bΣ0,t + λ L bΣ0,t L b V bΣ0,t b V , e Z bΣ0,t e Z

+ λ( e Z bΣ0,t L, L bΣ0,t e Z e Z bΣ0,t b V , b V bΣ0,t e Z )|

c1 L b V F e Z 2 F,

for some positive constant c1 depending on λ and ν. Speciﬁcally, this bound can be obtained by bounding each term. The ﬁrst term can be bounded by

λ| e Z e Z , bΣ0,t(LL b V b V )bΣ0,t | λ e Z 2 F bΣ0,t(LL b V b V )bΣ0,t op

2λν bΣ1/2 0,t LL bΣ1/2 0,t bΣ1/2 0,t b V b V bΣ1/2 0,t op e Z 2 F

2λν bΣ1/2 0,t L op bΣ1/2 0,t L bΣ1/2 0,t b V F e Z 2 F

2λν3/2 1 + λ1

λ + δ L b V F e Z 2 F,

under the assumption that dist(bΣ1/2 0,t L, bΣ1/2 0,t b V ) 1

4 bΣ1/2 0,t b V op = 1

λ . The ﬁrst inequality is the property of matrix norm, the second inequality is due to the bound on sample covariance matrix. The third inequality follows from Lemma 14 and the last inequality follows from the neighborhood assumption. The second term can be bounded in a similar fashion as

λ| L bΣ0,t L b V bΣ0,t b V , e Z bΣ0,t e Z |

λ (L bΣ0,t L b V bΣ0,t b V ) e Z F bΣ0,t e Z F

2λν L bΣ0,t L b V bΣ0,t b V op e Z 2 F = 2λν b V bΣ0,t(L b V ) + (L b V ) bΣ0,t b V + (L b V ) bΣ0,t(L b V ) op e Z 2 F

2λν(2 bΣ1/2 0,t b V op bΣ1/2 0,t (L b V ) op + δ bΣ1/2 0,t (L b V ) op) e Z 2 F

λν bΣ1/2 0,t (L b V ) op e Z 2 F

2 δ + 2 1 + λ1

λν3/2 L b V F e Z 2 F.

Sparse GCA and Thresholded Gradient Descent

The ﬁrst line is standard matrix inequality and the rest follows from the neighborhood assumption as well as the assumption on the operator norm of sample covariance matrices on event B2. Finally we bound the last term as

λ|( e Z bΣ0,t L, L bΣ0,t e Z e Z bΣ0,t b V , b V bΣ0,t e Z )|

= λ| Tr(( e Z bΣ0,t L)2 ( e Z bΣ0,t b V )2)|

= λ| Tr( e Z bΣ0,t(L + b V ) e Z bΣ0,t(L b V ))|

λ e Z bΣ0,t(L + b V ) F e Z bΣ0,t(L b V ) F

2λν3/2 e Z 2 F L b V F bΣ1/2 0,t (L + b V ) op

2λν3/2 2 1 + λ1

+ δ L b V F e Z 2 F.

In this way, we see that we can choose the constant to be

2 δ + 2 1 + λ1

2λν3/2 1 + λ1

Thus for any L within a δ neighborhood of b V = L t as deﬁned above, we have, by triangle inequality,

gt(L, e Z) gt(b V , e Z) |gt(L, e Z) gt(b V , e Z)|

2ν (bλr bλr+1) e Z 2 F c1 L b V F e Z 2 F

4ν (bλr bλr+1) e Z 2 F 1

8ν (λr λr+1) e Z 2 F,

as long as L b V F bλr bλr+1

4νc1 , which is guaranteed by the assumption in the theorem and the event B2. Recall that gt(L, e Z) = 1

2vec( e Z) 2ft(L)vec( e Z). This motivates us to pick

α = (λr λr+1)

under appropriate radius conditions. To ﬁnally ﬁnd the radius of attraction region such that dist(Lt, L t ) = dist(V t, b V (St)) R, we notice that throughout the proof for smoothness and strongly convexity to hold, we require 3 conditions on the distance

L L t F (bλr bλr+1)

4νc1 , bΣ1/2 0,t L bΣ1/2 0,t L t F δ, bΣ1/2 0,t L bΣ1/2 0,t L t F 1

Hence to ensure all of the above three conditions to hold, we can just set

Here recall that 1

ν is the lower bound of minimum eigenvalue of Σ0. In summary, we conclude that there is a constant α such that for any L in the neighborhood of L t and for any Z,

vec(ZHZ L t ) 2ft(L)vec(ZHZ L t ) α ZHZ L t 2 F.

Combine with the upper bound and resort to Lemma 20, we conclude that if we choose the step-size to be η 1

β, under the radius conditions,

dist2(Lt+1, b V ) (1 αη) dist2(Lt, b V ).

Moreover, we have

α = (λr λr+1)

4ν , β = 12λν

This ﬁnishes our analysis for the gradient descent step. We comment here that Proposition 18 is then proved by simply taking δ = 1 and calculate the radius in (46) and α, β accordingly. We keep δ in the proof for the sake of generality. Speciﬁcally, under the choice that λ = λ1

c , we have

η c 12λ1ν(c + 2)2 = 1

and we have, by lifting the matrix of size |St| r to p r by ﬁlling in 0s,

dist2(V o t+1, b V (St)) 1 η(λr λr+1)

dist2(V t, b V (St)).

B.4 Proof of Proposition 19

Proof Recall that we have deﬁned

supp(V ) = S, supp(V t+1) = Ft+1.

Now we deﬁne sets F1 = S\Ft+1, F2 = S Ft+1 and F3 = Ft+1\S, the sizes of which are denoted by k1, k2 and k3 respectively. We also deﬁne

x1 = VF1 F, x2 = VF2 F,

y1 = Vt+1,F1 F, y2 = Vt+1,F2 F, y3 = Vt+1,F3 F.

Finally, we deﬁne

x2 1 + x2 2 = V 2 F = X2, y2 1 + y2 2 + y2 3 Vt+1 2 F = Y 2.

Let = Tr(|V Vt+1|) be the quantity we are interested in. Here we let Tr(|M|) denote the trace norm Tr(|M|) = Tr(

M M) which is simply the sum of singular values (a.k.a. nuclear

Sparse GCA and Thresholded Gradient Descent

norm). This is a special case of Schatten p norm, deﬁned as T p = Tr(|T|p) 1 p which is essentially vector p norm of a vector composed of singular values. We ﬁrst prove the following claim about the eﬀect of truncation step:

Tr(|V Vt+1|) Tr(|V V t+1|) r s

X2Y 2 2, 1 + p

XY (X2Y 2 2)

Proof Since in the hard thresholding step we greedily pick the rows with largest l2 norms, we have y2 1 k1 y2 3 k3 .

In addition, since k1 + k2 = s s = k2 + k3, we also have k1 k3. By applying Holder s inequality on Schatten norm and the deﬁnition of support set, we have

2 ( VF1 F Vt+1,F1 F + VF2 F Vt+1,F2 F)2 = (x1y1 + x2y2)2

(y2 1 + y2 2)X2 (Y 2 y2 3)X2 X2Y 2 k3

and this gives

y2 1 k1 k3X2 (X2Y 2 2) k1 + k2

k3 + k2 (Y 2 2/X2) = s

s (Y 2 2/X2),

where the second inequality holds since k1 k3. Now we split the arguments into two cases. Case I: < XY q

s s+s . In this case, we obtain that

This implies (47) immediately. Case II: We can now assume that XY q

s s+s . Then we have

s (Y 2 2/X2) 2

By deﬁnition we have

X2 x2 1 x1y1 + x2y2 .

Solving this inequality as a quadratic inequality in x1 yields

(X2Y 2 2)(Y 2 y2 1) Y 2 .

By deﬁnition of X2 = x2 1 +x2 2 we have x1 X. Also we have XY by Holder inequality on Schatten norm.

Combining the above inequalities together, we have

Note that y1 p s

X by previous calculation. Substituting it into the above equation, we have

Finally, we can compute that

X2Y 2 2 min X, 1

X2Y 2 2(1 + r s

X2Y 2 2, 1 + p s

s XY (X2Y 2 2)

This is the error induced by the truncation step, ﬁnally we have

Tr(|V Vt+1|) Tr(|V V t+1|) Tr(|V (V t+1 Vt+1)|)

X2Y 2 2, 1 + p s

s XY (X2Y 2 2)

This ﬁnishes the proof for the claim.

We now switch to work with the distance metric dist(U, V ) = min P O(r) UP V F. We ﬁrst expand the expression

UP V 2 F = UP 2 F + V 2 F 2 UP, V

= U 2 F + V 2 F 2 Tr(P U V ).

So, to minimize the distance metric we deﬁned is to maximize Tr(P U V ). To this end, let ADB denote the singular value decomposition of U V , then we have

Tr(P U V ) = Tr(P ADB ) = Tr(B P AD) = Tr(ZD)

for Z being an orthogonal matrix. Since D is a diagonal matrix, we have Tr(ZD) = P i Zii Dii and to maximize this value we would like to have Zii = 1 for all i since Dii are non-negative. As a result, we have the optimal P being AB and the optimal value is given by

min UP V 2 F = U 2 F + V 2 F 2 Tr(D) = U 2 F + V 2 F 2 Tr(|U V |).

Sparse GCA and Thresholded Gradient Descent

Now we consider the subspace distance we have deﬁned. Recall that = Tr(|V Vt+1|). Then the subspace distance can be written as

dist2(V t+1, V )

= V 2 F + V t+1 2 F 2 Tr |V V t+1|

V 2 F + V t+1 2 F 2 Tr |V Vt+1| + 2 r s

X2Y 2 2, 1 + p s

s XY (X2Y 2 2)

V 2 F + Vt+1 2 F 2 Tr |V Vt+1| + 2 r s

s XY (X2Y 2 2)

= dist2(Vt+1, V ) + 2 r s

s XY (X2Y 2 2),

where the ﬁrst equality follows from the previous expansion of distance, the ﬁrst inequality follows from Claim (47) and we use the fact that truncation reduces the Frobenius norm in the second inequality. In the last equality, we use again the relationship between trace norm and distance. To deal with the extra error term, we use the following bound

X2Y 2 2 = (XY + )(XY ) 2XY (XY ) XY (X2 + Y 2 2 )

= XY ( V 2 F + Vt+1 2 F 2 Tr(|V Vt+1|) = XY dist2(Vt+1, V ).

Combining the inequalities, we have

dist2(V t+1, V ) dist2(Vt+1, V ) + 2 r s

s XY (X2Y 2 2)

dist2(Vt+1, V ) + 2 r s

s XY XY dist2(Vt+1, V )

= 1 + 2 r s

dist2(Vt+1, V ).

This ﬁnishes our analysis for the hard thresholding step.

B.5 Proof of Theorem 9

Proof We present the proof of initialization using generalized Fantope in this section. Some proof arguments originate from Gao et al. (2017). Recall that we have deﬁned

e A = A(A bΣ0A) 1/2, eΛr = (A bΣ0A)1/2Λr(A bΣ0A)1/2, and e F = e A e A . (48)

Throughout the proof, we work on the event B3 B4, which by Lemmas 16 and 17 happens with probability at least 1 exp( C (s + log(ep/s))) for some constant C > 0, uniformly over Pn. We shall need the following lemma, which is a simpliﬁed version of Lemma 6.3 in Gao et al. (2017). The proof is essentially the same and so we omit it.

Lemma 21 (Curvature of Fantope) Let P O(p, r) and D = diag(d1, d2, ..., dr) with d1 d2 ... dr 0. If F Fr, then

PDP , PP F dr

2 PP F 2 F.

Recall that Σ = Σ0KΛK Σ0 = Σ0AΛr A Σ0 + Σ0BΛB Σ0.

For notational simplicity, for any positive semi-deﬁnite matrix B, we deﬁne

φB max(k) = max u 0 k,u =0 u Bu

u u , φB min(k) = max u 0 k,u =0 u Bu

In the rest of this proof, let = b F e F with e F deﬁned in (48). As in Gao et al. (2017), the main proof consists of two steps. The ﬁrst step is to derive upper bound of bΣ1/2 0 bΣ1/2 0 F and the second step is to lower bound bΣ1/2 0 bΣ1/2 0 F by F.

Step (1) First note that by Lemma 17, e A is well deﬁned on event B4, so e F = e A e A is also well deﬁned on event B4. In addition, on event B4, we have

Σ1/2 0 ( e F AA )Σ1/2 0 F C

r(s + log p)

We ﬁrst show that e F is a feasible solution, that is, bΣ1/2 0 e A e A bΣ1/2 0 Fr. By (48),

e F = e A e A = A(A bΣ0A) 1A .

Let M = bΣ1/2 0 e F bΣ1/2 0 = bΣ1/2 0 A(A bΣ0A) 1A bΣ1/2 0 . Then

Tr(M) = Tr(bΣ1/2 0 A(A bΣ0A) 1A bΣ1/2 0 ) = r.

Next we check that 0 M I. It is obvious that 0 M. Moreover,

M op bΣ1/2 0 e A op bΣ1/2 0 e A op 1

by the construction of e A. Hence, e F Fr. Recall that b F is the solution to (14). The basic inequality implies

bΣ, b F + ρ b F 1 bΣ, e F + ρ e F 1.

Rearranging terms, we have

0 ρ( e F 1 e F + 1) + bΣ, . (50)

We deal with each term on the right side separately. For the ﬁrst term on the right, we have e F 1 e F + 1 = e FSS 1 e FSS + SS 1 (SS)c 1 SS 1 (SS)c 1. (51)

Sparse GCA and Thresholded Gradient Descent

Here the ﬁrst equality holds since e F is supported on S S. For the second term, we have

bΣ, = bΣ Σ, + Σ0AΛr A Σ0, | {z } Term I

+ Σ0BΛB Σ0, | {z } Term II

We now bound Term I and Term II separately.

Bound for Term I : For Term I, we decompose it in the following way:

Σ0AΛr A Σ0, = Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + bΣ0AΛr A bΣ0, .

Now we bound the second term on the right as follows

bΣ0AΛr A bΣ0, = bΣ1/2 0 AΛr A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0

= bΣ1/2 0 e AeΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0

= bΣ1/2 0 e AΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 + bΣ1/2 0 e A(eΛr Λr) e A bΣ1/2 0 , bΣ1/2 0 bΣ1/2 0

bΣ1/2 0 e AΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 + eΛr Λr F bΣ1/2 0 bΣ1/2 0 F.

As a result, Term I can be bounded in the following way

Term I Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + bΣ1/2 0 e AΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 +δ1 bΣ1/2 0 bΣ1/2 0 F,

for δ1 = eΛr Λr F.

Bound for Term II : To bound Term II, we ﬁrst notice that by deﬁnition, B Σ0A is zero matrix since A, B are normalized with respect to Σ0. As a result, we have

Σ0BΛB Σ0, = Σ0BΛB Σ0, b F

λr+1 Σ0BB Σ0, b F

= λr+1 Σ0BB Σ0, b F e A e A

= λr+1 Σ1/2 0 (I Σ1/2 0 AA Σ1/2 0 )Σ1/2 0 , b F e A e A

= λr+1 Σ0 Σ0AA Σ0, b F e A e A .

Here, the second last equality holds since Σ1/2 0 AA Σ1/2 0 + Σ1/2 0 BB Σ1/2 0 = I. We further decompose the rightmost side as

λr+1 Σ0 Σ0AA Σ0, b F e A e A

= λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0,

+ λr+1 bΣ0, b F e A e A | {z } Term A

λr+1 bΣ1/2 0 AA bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 | {z } Term B

We now deal with Term A and Term B separately. By deﬁnition we have bΣ1/2 0 b F bΣ1/2 0 Fr, hence bΣ0, b F e A e A = Tr(bΣ1/2 0 b F bΣ1/2 0 ) Tr( e A bΣ0 e A) = r r = 0.

For Term B, we have

λr+1 bΣ1/2 0 AA bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0

= λr+1 bΣ1/2 0 e A e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0

λr+1 bΣ1/2 0 e A(A bΣ0A I) e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0

λr+1 bΣ1/2 0 e A e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 + λr+1 A bΣ0A I F bΣ1/2 0 bΣ1/2 0 F.

Therefore, Term II has the following bound

Σ0BΛB Σ0, λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0,

λr+1 bΣ1/2 0 e A e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 + δ2 bΣ1/2 0 bΣ1/2 0 F,

for δ2 = λr+1 A bΣ0A I F. Now we combine the results for Term I and Term II to obtain

Σ0AΛr A Σ0, + Σ0BΛB Σ0,

Σ0AΛr A Σ0 bΣ0AΛr A bΣ0,

+ bΣ1/2 0 e AΛr e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 + δ1 bΣ1/2 0 bΣ1/2 0 F

+ λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0,

λr+1 bΣ1/2 0 e A e A bΣ1/2 0 , bΣ1/2 0 ( b F e A e A )bΣ1/2 0 + δ2 bΣ1/2 0 bΣ1/2 0 F

= Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + λr+1 Σ0 bΣ0,

λr+1 Σ0AA Σ0 bΣ0AA bΣ0, + δ bΣ1/2 0 bΣ1/2 0 F

+ bΣ1/2 0 e A(Λr λr+1I) e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 ,

where δ = δ1 + δ2. Plugging back to (50), together (51) with (52), we have

0 ρ( e F 1 e F + 1) + bΣ,

ρ( SS 1 (SS)c 1) + bΣ Σ,

+ Σ0AΛr A Σ0 bΣ0AΛr A bΣ0, + λr+1 Σ0 bΣ0,

λr+1 Σ0AA Σ0 bΣ0AA bΣ0,

+ δ bΣ1/2 0 bΣ1/2 0 F + bΣ1/2 0 e A(Λr λr+1I) e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0 .

By Holder s inequality, we have

bΣ Σ, + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0,

+ λr+1 Σ0 bΣ0, λr+1 Σ0AA Σ0 bΣ0AA bΣ0,

bΣ Σ 1 + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0 1

+ λr+1 Σ0 bΣ0 1 + λr+1 Σ0AA Σ0 bΣ0AA bΣ0 1.

On event B3, we can pick ρ = γ q

n for some large constant γ > 0 such that

bΣ Σ + Σ0AΛr A Σ0 bΣ0AΛr A bΣ0

+ λr+1 Σ0 bΣ0 + λr+1 Σ0AA Σ0 bΣ0AA bΣ0 ρ

Sparse GCA and Thresholded Gradient Descent

thus the lefthand side of (54) is bounded by ρ

2 1. Furthermore, by Lemma 21, we have

bΣ1/2 0 e A(Λr λr+1I) e A bΣ1/2 0 , bΣ1/2 0 ( b F e F)bΣ1/2 0

= bΣ1/2 0 e A(Λr λr+1I) e A bΣ1/2 0 , bΣ1/2 0 ( e A e A b F)bΣ1/2 0

2 bΣ1/2 0 ( e A e A b F)bΣ1/2 0 2 F.

Together with (53) and (54), the last display implies that when (55) holds,

0 ρ( SS 1 (SS)c 1)+ ρ

2 1 λr λr+1

2 bΣ1/2 0 ( e A e A b F)bΣ1/2 0 2 F +δ bΣ1/2 0 bΣ1/2 0 F.

Rearranging terms and multiplying both side by 2, we obtain

(λr λr+1) bΣ1/2 0 bΣ1/2 0 2 F 3ρ SS 1 ρ (SS)c 1 + 2δ bΣ1/2 0 bΣ1/2 0 F

3ρ SS 1 + 2δ bΣ1/2 0 bΣ1/2 0 F.

This can be view as a quadratic equation, which, by Lemma 2 in Cai et al. (2013), yields

bΣ1/2 0 bΣ1/2 0 2 F 4δ2

(λr λr+1)2 + 6ρ λr λr+1 SS 1. (56)

Combining the last two displays, we have

0 3ρ SS 1 ρ (SS)c 1 + δ2

λr λr+1 + (λr λr+1) bΣ1/2 0 bΣ1/2 0 2 F

9ρ SS 1 ρ (SS)c 1 + 5δ2

This can be viewed as a version of generalized cone condition. Finally, using Cauchy Schwarz inequality, SS 1 s SS F, and so (56) leads to

bΣ1/2 0 bΣ1/2 0 2 F 4δ2

(λr λr+1)2 + 6ρs λr λr+1 SS F. (57)

and this is the end of the ﬁrst step.

Step (2) Recall that we have established the generalized cone condition

(SS)c 1 9 SS 1 + 5δ2

(λr λr+1)ρ. (58)

In this step we lower bound bΣ1/2 0 bΣ1/2 0 2 F by a function of F on this cone. Adapting the peeling argument in Bickel et al. (2009), we deﬁne the index set J1 = {(il, jl)}t l=1 in (S S)c correspond to the entries with the t largest absolute values in , and also deﬁne e J = (S S) S J1. Next we partition e Jc into disjoint subsets J2, ..., JM of

size t and possibly |JM| < t such that each Jm is the set of indices corresponding the entries of the t largest absolute values in outside e J S m 1 j=2 Jj. Then by triangle inequality

bΣ1/2 0 bΣ1/2 0 F bΣ1/2 0 e J bΣ1/2 0 F

m=2 bΣ1/2 0 Jm bΣ1/2 0 F

φ bΣ0 min(s + t) e J F φ bΣ0 max(t)

In addition, by our construction of the index sets,

m=2 Jm 1 1 t 1/2 (SS)c 1

t 1/2(9 SS 1 + 5δ2

t e J F + 5δ2

The second last inequality follows from the generalized cone condition (58). Hence combining the results above, we have

bΣ1/2 0 bΣ1/2 0 F κ1 e J F κ2 δ2

where κ1 = φ bΣ0 min(s + t) 9s

tφ bΣ0 max(t),

κ2 = 5φ bΣ0 max(t).

Taking t = c1s2 for c1 suﬃciently large, using the same argument as in Gao et al. (2017), under condition (26) we can lower bound κ1 by some constant C1 and upper bound κ2 by some constant C2. Combining (57) and (59), we obtain

e J 2 F C1 sρ λr λr+1 e J F + C2

(λr λr+1)2 + δ2

Solving this equation gives

(λr λr+1)2 + δ2

(λr λr+1)2 + δ2

for some positive constant C3 that is suﬃciently large. Also we have

m=2 Jm F 9s

t e J F + 5δ2

Sparse GCA and Thresholded Gradient Descent

Combining the last two displays and using t = c1s2, we obtain that

sρ + δ λr λr+1 + δ2

sρ(λr λr+1)

Recall that on event B3 we pick ρ = γ q

n . Then on event B4, we have

F C5 sρ λr λr+1 = C5γ s log p n(λr λr+1)

on event B3 B4.

Step (3) The last step follows from AA b F F F + e F AA F. We combine the last display and (49) to obtain the desired bounds for each term on the right side. This ﬁnishes our proof.

B.6 Proof of Corollary 10

Proof Deﬁne

τ 2 n = s2 log p n(λr λr+1)2 .

We will bound dist(V 1, V ) by a constant multiple of τn on event B2 B3 B4. Notice that since A0A 0 is the best rank r approximation, and AA also has rank r. Thus, we obtain that A0A 0 b F F AA b F F. Triangle inequality further leads to

A0A 0 AA F A0A 0 b F F + AA b F F 2 AA b F F.

Together with Lemma 14, the last display implies dist(A, A0) C0τn for some positive constant C0. By deﬁnition b A0 = HT(A0, s ), V 1 and e A0 are both s sparse. Moreover, following the lines of the proof of Proposition 19 and Corollary 8, we have

dist( e A0, A) C1τn

for some constant C1 > 0. Let P be the orthogonal matrix such that e A0P A F = dist( e A0, A). Then we have

V 1P = e A0(I + e A 0 bΣ e A0/λ)1/2P = e A0PP (I + e A 0 bΣ e A0/λ)1/2P

= e A0P(I + P e A 0 bΣ e A0P/λ)1/2.

dist(V 1, V ) V 1P V F = e A0P(I + P e A 0 bΣ e A0P/λ)1/2 A(I + Λr/λ)1/2 F ,

we turn to bound the rightmost side. To this end, triangle inequality gives

e A0P(I + P e A 0 bΣ e A0P/λ)1/2 A(I + Λr/λ)1/2 F

e A0P(I + P e A 0 bΣ e A0P/λ)1/2 e A0P(I + Λr/λ)1/2 F | {z } Term I

+ ( e A0P A)(I + Λr/λ)1/2 F | {z } Term II

We now bound each term separately. For Term II, we have

( e A0P A)(I + Λr/λ)1/2 F e A0P A F (I + Λr/λ)1/2 op C2τn.

The last inequality holds since λ = λ1/c. To deal with Term I, we deﬁne = e A0P A and notice that

e A0P(I + P e A 0 bΣ e A0P/λ)1/2 e A0P(I + Λr/λ)1/2 F

e A0P op (I + P e A 0 bΣ e A0P/λ)1/2 (I + Λr/λ)1/2 F.

By matrix root perturbation bound (e.g., Lemma 2 in Gao et al. (2017)), we have

(I + P e A 0 bΣ e A0P/λ)1/2 (I + Λr/λ)1/2 F

C3 P e A 0 bΣ e A0P Λr F

C3 (A + ) bΣ(A + ) A ΣA F

= C3( A (bΣ Σ)A F + 2 bΣA F + bΣ F)

for some positive constant C3. Since e A0 is s sparse and A is s sparse, = e A0P A is s+s sparse. By deﬁnition of P, F = dist( e A0, A) C1τn. By a similar argument to the proof of Corollary 8, on event B2, we have

(I + P e A 0 bΣ e A0P/λ)1/2 (I + Λr/λ)1/2 F C4

Consequently, Term I is also dominated by a constant multiple of τn. Thus, we conclude that on the intersection of B2 B3 B4,

dist(V 1, V ) C5τn

for some positive constant C5. By union bound, the intersection of the three events holds with probability at least 1 exp( C (s + log(ep/s))), uniformly over Pn. This completes the proof.

B.7 Proof of Lemma 15

Proof In this proof, we work on event B2 deﬁned in (34), which occurs with probability at least 1 exp( C s log(ep/s )) for some constant C > 0, uniformly over Pn. The proof relies on Theorem 3.1 of Sun (1983), a matrix perturbation bound for generalized eigenspaces. Fix any I such that S I and |I| 2s + s. We shall apply the theorem on matrix pair (ΣII, Σ0,II) and its perturbation (bΣII, bΣ0,II).

Sparse GCA and Thresholded Gradient Descent

To this end, for the ﬁxed set I, deﬁne

δ = min λ,λ λ λ p

(1 + λ2)(1 + λ 2) for λ {λ1, ...λr} and λ {bλr+1, ...bλ|I|},

where λi denotes the ith generalized eigenvalue of matrix pair (ΣII, Σ0,II) and bλi denotes the ith sample generalized eigenvalue of matrix pair (bΣII, bΣ0,II). Deﬁne event

e B2 = bλr+1 2λr+1 λr + λr+1

On event e B2, since generalized eigenvalues are in decreasing order, we have

δ = λr bλr+1 q

(1 + λ2 1)(1 + bλ2 r+1) 1

2 λr λr+1 q

(1 + λ2 1)(1 + bλ2 r+1) .

Moreover, since bλ2 r+1 4λ2 r+1 on e B2, we further obtain that

4 λr λr+1 p

1 + λ2 r+1 .

On the other hand, e B2 can be equivalently deﬁned as

e B2 = bλr+1 λr+1 λr+1 λr λr+1

Since λi is also the ith eigenvalue of Σ 1/2 0,II ΣIIΣ 1/2 0,II and bλi the ith eigenvalue of bΣ 1/2 0,II bΣII bΣ 1/2 0,II , Weyl s inequality then implies that on event B2,

|bλr+1 λr+1| Σ 1/2 0,II ΣIIΣ 1/2 0,II bΣ 1/2 0,II bΣII bΣ 1/2 0,II op

(Σ 1/2 0,II bΣ 1/2 0,II )ΣIIΣ 1/2 0,II op + bΣ 1/2 0,II (ΣII bΣII)Σ 1/2 0,II op

+ bΣ 1/2 0,II bΣII(Σ 1/2 0,II bΣ 1/2 0,II ) op

Σ 1/2 0,II bΣ 1/2 0,II op ΣIIΣ 1/2 0,II op + bΣ 1/2 0,II op ΣII bΣII op Σ 1/2 0,II op

+ bΣ 1/2 0,II bΣII op Σ 1/2 0,II bΣ 1/2 0,II op

for some constant C0 depending on ν. Here the last inequality is due to the matrix root bound from Lemma 2 in Gao et al. (2015a). Under condition (23), by the choice of s in (20), the last display implies |bλr+1 λr+1| (λr λr+1

2 λr+1). Hence e B2 holds on event B2 under condition (23). Furthermore, we have

γ(ΣII, Σ0,II) = min x =1

(x ΣIIx)2 + (x Σ0,IIx)2 1

γ(bΣII, bΣ0,II) = min x =1

(x bΣIIx)2 + (x bΣ0,IIx)2 1

on event B2. Now we apply Theorem 3.1 of Sun (1983) for perturbation on restricted covariance matrices and obtain that

dist(A, b A(I)) C1

Σ2 II + Σ2 0,II op

γ(ΣII, Σ0,II)γ(bΣII, bΣ0,II)

(bΣII ΣII)AI 2 F + (bΣ0,II Σ0,II)AI 2 F δ

1 + λ2 r+1 λr λr+1

AI 2 F( bΣII ΣII 2op + bΣ0,II Σ0,II 2op)

1 + λ2 r+1 λr λr+1

1 + λ2 r+1 λr λr+1

for some positive constant C4 on event B2. This ﬁnishes the proof of Lemma 15.

B.8 Proofs of Lemma 16 and Lemma 17

Proof Note that λr+1 < 1. Hence, it suﬃces to show that each of the four inﬁnity norm terms on the left side in the deﬁnition of event B3 in (35) is upper bounded by a constant multiple of p

log p/n with probability at least 1 p C for some positive constant C , uniformly over Pn. This can be achieved by following the lines of the proof of Lemma 6.4 in Gao et al. (2017) and so we omit the details.

Proof We ﬁrst bound the operator norm of each term on the left side of (37) by a constant

multiple of q

1 n(s + log(ep/s)), and the result on Frobenius follows directly since each term is of rank r.

As in the proof of Lemma 6.1 of Gao et al. (2017), we have

Σ1/2 0 ( e A A) op Σ1/2 0 A op (A bΣ0A)1/2 I op (A bΣ0A) 1/2 op,

eΛr Λr op (A bΣ0A)1/2 I op Λr(A bΣ0A)1/2 op + Λr op (A bΣ0A)1/2 I op.

Sparse GCA and Thresholded Gradient Descent

So it remains to bound (A bΣ0A)1/2 I op. By Lemma 2 in Gao et al. (2015a), it suﬃces to bound A bΣ0A I op. Note that

A bΣ0A I op = A (bΣ0 Σ0)A op

= A S (bΣ0,SS Σ0,SS)AS op

= sup v =1 (AS v) (bΣ0,SS Σ0,SS)(AS v)

Σ1/2 0,SSAS 2 op Σ 1/2 0,SS bΣ0,SSΣ 1/2 0,SS I op

Σ 1/2 0,SS bΣ0,SSΣ 1/2 0,SS I op C bΣ0,SS Σ0,SS op.

By Lemma 3 of Gao et al. (2015a), we then have

A bΣ0A I op C1

with probability at least 1 exp( C (s + log(ep/s)) for some constant C > 0, uniformly over Pn. This completes the proof.

Appendix C. Proof of Lower Bound

We ﬁrst present a lemma on Kullback-Leibler divergence between data distributions from a special kind of covariance matrix. The Lemma can be viewed as a general case of Lemma 1 in Gao et al. (2015b).

Lemma 22 For t = 1, 2, deﬁne Σ(t) to be a block matrix whose (i, j)th block is given by

λU(t) {i}U(t)

{j} for i = j, where U(t) {i} O(pi, r) and whose ith diagonal block is given by Ipi.

Further let P (t) denote the distribution of a random i.i.d sample of size n from the Np(0, Σ(t)) distribution where p = Pk i=1 pi. Then

D(P (1)||P (2)) = λ2n 4 ( (k 1)λ2 + (k 2)λ + 1)

U(1) {i}U(1)

{j} U(2) {i}U(2)

Proof The ﬁrst step is to ﬁnd the eigenvalues of Σ(t). To this end, deﬁne

U(t) = h U(t)

{2} , . . . , U(t)

K(t) i = h U(t)

{2} . . . , U(t)

{i 1}, (k 1)U(t)

{i+1}, . . . , U(t)

That is, K(t) i Rp r is a block matrix with jth blocks being U(t) {j} for j = i and the

ith block being (k 1)U(t) {i}. It is straightforward to verify that Σ(t) yields the following decomposition

Σ(t) = I + (k 1)λ

k U(t)U(t) λ

i=1 K(t) i K(t)

We then deduce that the eigen-structure of Σ(t) is given by r eigenvalues equal to 1+λ(k 1), (k 1)r eigenvalues equal to 1 λ and p kr eigenvalues equal to 1. Furthermore, we notice that Σ(t)U(t) = (1 + (k 1)λ)U(t),

hence the leading r dimensional eigenspace is given by the matrix U(t). This in particular implies that det Σ(1) = det Σ(2).

Now the KL divergence is given by

D(P (1)||P (2)) = n

Tr (Σ(2)) 1Σ(1)

i=1 pi log det (Σ(2)) 1Σ(1) !

Tr (Σ(2)) 1Σ(1) p

Tr (Σ(2)) 1 Σ(1) Σ(2) .

We now calculate the inverse of the block matrix Σ(2). To this end, we ﬁrst guess the form of the inverse and then determine the coeﬃcients. Speciﬁcally, we try to solve for the inverse with the following form

I + a U(2) {1}U(2)

{1} b U(2) {1}U(2)

{2} . . . . . . b U(2) {1}U(2)

{k} b U(2) {2}U(2)

{1} I + a U(2) {2}U(2)

{2} b U(2) {2}U(2)

{3} . . . b U(2) {2}U(2)

{k} . . . . . . . . . . . . . . .

b U(2) {k}U(2)

{1} . . . . . . . . . I + a U(2) {k}U(2)

with the (i, j)th block being b U(2) {i}U(2)

{j} for i = j and ith diagonal block being I+a U(2) {i}U(2)

To determine the values of (a, b), we solve for the equation (Σ(2)) 1Σ(2) = Ip. This requires two conditions on the matrix product: the oﬀ-diagonal block to be 0 and the diagonal block to be I. For the ith diagonal block, we require

I + (k 1)bλU(2) {i}(U(2) {i}) + a U(2) {i}(U(2) {i}) = I,

and for the (i, j)th oﬀ-diagonal block, we require

λb(k 2)U(2) {i}(U(2) {j}) + b U(2) {i}(U(2) {j}) + λU(2) {i}(U(2) {j}) + aλU(2) {i}(U(2) {j}) = 0.

Solving the above equations, we have

a = (k 1)bλ,

b = λ (k 1)λ2 λ(k 2) 1.

Sparse GCA and Thresholded Gradient Descent

As a result, we can simplify the expression for KL divergence as follows

D(P (1)||P (2)) = n

Tr (Σ(2)) 1 Σ(1) Σ(2)

j =i b U(2) {i}(U(2) {j}) U(1) {j}(U(1) {i}) U(2) {j}(U(2) {i})

i,j:i =j Tr I U(2) {i}(U(2) {j}) U(1) {j}(U(1) {i})

U(1) {i}(U(1) {j}) U(2) {i}(U(2) {j}) 2

= λ2n 4 ( (k 1)λ2 + (k 2)λ + 1)

U(1) {i}(U(1) {j}) U(2) {i}(U(2) {j}) 2

This ﬁnishes our proof for the Lemma. Note that when k = 2, we recover Lemma 1 in Gao et al. (2015b) as a special case.

C.1 Proof of Theorem 12

Proof The main body of the proof is adapted from Gao et al. (2015a). The main tool for our proof is Fano s Lemma. For the sake of completeness, we provide the following version of Fano s Lemma from Yu et al. (1997).

Lemma 23 Let (Θ, ρ) be a metric space and {Pθ : θ Θ} a collection of probability measures. For any totally bounded T Θ, denoted by M(T, ρ, ϵ) the ϵ-packing number of T with respect to ρ, i.e., the maximal number of points in T whose pairwise maximum distance in ρ is at least ϵ. Deﬁne the Kullback-Leibler diameter T by

d KL(T) sup θ,θ T D(Pθ||Dθ ).

inf θ sup θ Θ Eθ[ρ2(ˆθ(X), θ)] sup T Θ sup ϵ>0

1 d KL(T) + log 2

log M(T, ρ, ϵ)

The proof is compose of two steps corresponding to diﬀerent terms in the lower bound. Throughout the proof, we deﬁne λ = λr λr+1 for simplicity.

Step I. We ﬁrst establish the term involving r Pk i=1 si. To this end, let U{i}0 = Ir 0

O(pi, r) for each i = 1, 2, . . . , k. For some ϵ0 (0, p

r (s1 r)] to be speciﬁed later, let

B(ϵ0) = {U{1} O(p1, r) : supp(U{1}) [s1], U{1} U{1}0 F ϵ0}

Ip1 λU{1}U {2}0 λU{1}U {3}0 . . . λU{1}U {k}0 λU{2}0U {1} Ip2 . . . λU{2}0U {k}0 . . . . . . . . . . . . λU{k}0U {1} λU{k}0U {2}0 . . . Ipk

: U{1} B(ϵ0)

with U = h U {1}, U {2}0, . . . , U {k}0 i . Since our target of estimation is A instead of U, we ﬁrst establish the relationship between U and A under T0. Note that under the construction of Σ, we have Σ0 = Ip so the generalized eigenspace coincides with eigenspace. From Lemma 22, we deduce that the leading r dimensional eigenspace of Σ is given by span(U). From the normalization constraint such that A A = I, we conclude that dist(A, 1

k U) = 0. From here on, we ﬁrst derive the minimax lower bound for the estimation of U and the lower bound for estimating A under the matrix distance follows immediately by scaling. The above analysis also implies that T0 F, where F is our original parameter space. By Lemma 22,

d KL(T0) = sup

U(1) {1},U(2) {1} B(ϵ0) D(P (1)||P (2))

U(1) {1},U(2) {1} B(ϵ0)

λ2n 4 ( (k 1)λ2 + (k 2)λ + 1)

U(1) {i}U(1)

{j} U(2) {i}U(2)

U(1) {1},U(2) {1} B(ϵ0)

λ2n 4 ( (k 1)λ2 + (k 2)λ + 1)

j=2 2 U(1) {1}U(1)

{j}0 U(2) {1}U(2)

U(1) {1},U(2) {1} B(ϵ0)

λ2n 2 ( (k 1)λ2 + (k 2)λ + 1)

U(1) {1} U(2) {1}

= 2λ2n(k 1)ϵ2 0 (k 1)λ2 + (k 2)λ + 1.

Here, the second to last equality follows from the deﬁnition of B(ϵ0). We now establish a lower bound for the packing number of T0. For some α (0, 1) which shall be speciﬁed later, we deﬁne {e U{1}(1), e U{1}(2), . . . , e U{1}(N)} O(p1, r) to be a maximal set such that supp(e U{1}(i)) [s1] and for i = j,

e U{1}(i)e U {1}(i) U{1}0U {1}0 F ϵ0, e U{1}(i)e U {1}(i) e U{1}(j)e U {1}(j) F

Then by Lemma 1 in Cai et al. (2013), for some absolute constant C > 1,

For each e U{1}(i), we deﬁne U{1}(i) to be the matrix such that

U{1}0 U{1}(i) 2 F = dist2 e U{1}(i), U{1}0 .

Sparse GCA and Thresholded Gradient Descent

Then for any i, we deduce that U{1}(i) O(p1, r), supp(U{1}(i)) [s1] and U{1}(i)U{1}(i) = e U{1}(i)e U{1}(i) . In addition, Lemma 6.6 in Gao et al. (2017) implies that

U{1}(i) U{1}0 F = dist e U{1}(i), U{1}0 = 1

2 e U{1}(i)e U {1}(i) U{1}0U {1}0 F ϵ0

hence U{1}(i) B(ϵ0). On the other hand, note that from Lemma 6.6 in Gao et al. (2017),

U{1}(i), U{1}(j) = 1

2 U{1}(i)U {1}(i) U{1}(j)U {1}(j) F

2 e U{1}(i)e U {1}(i) e U{1}(j)e U {1}(j) F αϵ0.

Deﬁne the metric to be ρ(Σ(1), Σ(2)) = dist U(1) {1}, U(2) {1} . The above argument implies that for ϵ = αϵ0,

log M(T0, ρ, ϵ) r(s1 r) log 1

(k 1)λ2 + (k 2)λ + 1

n(k 1)λ2 r(s1 r)

for suﬃciently small constants c0, α, we obtain a lower bound of the order

r (s1 r) (k 1)λ2 + (k 2)λ + 1

n(k 1)λ2 r(s1 r)

by applying Lemma 23. By symmetry, we also have the lower bound with r(s1 r) replaced by r(si r) for i = 2, 3, . . . , k. Furthermore, recall that we have r 1 2 mini si hence r (si r), we obtain the lower bound of the order

r r Pk i=1 si nλ2

for the estimation of U with metric dist2(U, b U), when k is ﬁnite.

Step II We now turn to establish the lower bound term involving si log epi

si . The step follows from the rank-one argument from Chen et al. (2013). Without loss of generality, we may assume that si pi

2 for any i, we then consider the following subset of the parameter space:

Ip1 λU{1}U {2}0 λU{1}U {3}0 . . . λU{1}U {k}0 λU{2}0U {1} Ip2 . . . λU{2}0U {k}0 . . . . . . . . . . . . λU{k}0U {1} λU{k}0U {2}0 . . . Ipk

: U{1} = Ir 1 0 0 ur

for ur Sp1 r+1 and |supp(ur)| s1 r + 1. Restricting on the set T1, the minimax risk for estimating U is the same as the minimax risk for estimating ur under the squared error

loss ur bur 2 F. Let X{i} = [X{i}1, X{i}2] for X{i}1 Rn (r 1) and X{i}2 Rn (pi r+1). By the same argument in Gao et al. (2015a), it is further equivalent to estimating ur under the squared loss based on the observations from X{i}2 since X{i}2 for i = 1, 2 . . . , k is a suﬃcient statistic for ur. Applying the argument in Chen et al. (2013), we obtain the lower bound for dist2(U, b U) with the following term

By symmetry the same lower bound holds if we replace s1, p1 by si, pi for i = 2, 3, . . . k. Combining all the above steps and noticing the 1

k scaling between the estimation of A

and U, we ﬁnally deduce that the lower bound for dist2(A, b A) is given by (30). This ﬁnishes our proof.

Appendix D. Simulation Details for Sparse PCA of Correlation Matrices

Here we describe the procedure of generating a p p correlation matrix R with eigenvalues λ1 λr > λr+1 = 1 λp > 0 and s-sparse leading eigenvectors. Necessarily, Pr i=1 λr s < p 1 and s r. Without loss of generality, we assume that s = m 2l

where m and l are positive integers and 2l r. Since 2l r, there is a 2l r matrix T0 such that T0,ij { 1} and that the columns of T0 are orthogonal. For example, we may take the ﬁrst r columns of a 2l 2l Hadamard matrix. Fix such a T0, we generate an s r matrix Ts as

T 0 . . . T 0 .

By our construction, each row of Ts has the same l2 norm and the columns are orthonormal. Next, we deﬁne

Rs = Ts diag(θ1, . . . , θr)T s + 1 Pr i=1 θi

This decomposition ensures that the diagonal of Rs is equal to 1. We then solve for θi such that the eigenvalues of Rs are given by λ1, . . . , λr. Then we augment this s s positive deﬁnite matrix to a p p correlation matrix R = diag(Rs, Ip s). It is straightforward to verify that R has leading eigenvalues λ1 λr > 1 and the columns of T = [T s O (p s) r] are the corresponding eigenvectors.

A.A. Amini and M.J. Wainwright. High-dimensional analysis of semideﬁnite relaxations for sparse principal components. The Annals of Statistics, 37(5B):2877 2921, 2009.

Q. Berthet and P. Rigollet. Computational lower bounds for sparse pca. ar Xiv preprint ar Xiv:1304.0828, 2013.

Sparse GCA and Thresholded Gradient Descent

Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 2604, 2008a.

Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1):199 227, 2008b.

Peter J Bickel, Ya acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009.

A. Birnbaum, I.M. Johnstone, B. Nadler, and D. Paul. Minimax bounds for sparse PCA with noisy high-dimensional data. The Annals of Statistics, 41(3):1055 1084, 2013.

T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074 3110, 2013.

Mengjie Chen, Chao Gao, Zhao Ren, and Harrison H Zhou. Sparse cca via precision adjusted iterative thresholding. ar Xiv preprint ar Xiv:1311.6186, 2013.

Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems, pages 739 747, 2015.

Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239 5269, 2019.

Jon Dattorro. Convex optimization and euclidean distance geometry, 2005. Meboo, Palo Alto, 2003.

Deborah J Donnell, Andreas Buja, and Werner Stuetzle. Analysis of additive dependencies and concurvities using smallest additive principal components. The Annals of Statistics, pages 1635 1668, 1994.

David L Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000.

Jianqing Fan, Jianhua Guo, and Shurong Zheng. Estimating number of factors by adjusted eigenvalues thresholding. ar Xiv preprint ar Xiv:1909.10710, 2019.

Chao Gao, Zongming Ma, Zhao Ren, and Harrison H Zhou. Minimax estimation in sparse canonical correlation analysis. The Annals of Statistics, 43(5):2168 2197, 2015a.

Chao Gao, Zongming Ma, Zhao Ren, and Harrison H Zhou. Supplement to minimax estimation in sparse canonical correlation analysis . The Annals of Statistics, 2015b.

Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse cca: Adaptive estimation and computational barriers. The Annals of Statistics, 45(5):2074 2101, 2017.

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A uniﬁed geometric analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1233 1242. JMLR. org, 2017.

Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2012.

Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.

Harold Hotelling. Relations between two sets of variates. In Breakthroughs in statistics, pages 162 190. Springer, 1992.

Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295 327, 2001.

I.M. Johnstone and A.Y. Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682 693, 2009.

Jon R Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433 451, 1971.

Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772 801, 2013.

Michael L. Overton and Robert S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix. SIAM J. Matrix Anal. Appl., 1991.

Cencheng Shen, Ming Sun, Minh Tang, and Carey E Priebe. Generalized canonical correlation analysis for classiﬁcation. Journal of Multivariate Analysis, 130:310 322, 2014.

Ji-guang Sun. The perturbation bounds for eigenspaces of a deﬁnite matrix-pair. Numerische Mathematik, 41(3):321 343, 1983.

Kean Ming Tan, Zhaoran Wang, Han Liu, and Tong Zhang. Sparse generalized eigenvalue problem: Optimal statistical rates via truncated rayleigh ﬂow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):1057 1086, 2018.

Jos MF Ten Berge. Orthogonal procrustes rotation for two or more matrices. Psychometrika, 42(2):267 276, 1977.

Arthur Tenenhaus and Michel Tenenhaus. Regularized generalized canonical correlation analysis. Psychometrika, 76(2):257, 2011.

Yongge Tian. Rank equalities for block matrices and their moore-penrose inverses. Houston J. Math, 30(4):483 510, 2004.

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear matrix equations via procrustes ﬂow. ar Xiv preprint ar Xiv:1507.03566, 2015.

Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. In Advances in neural information processing systems, pages 2670 2678, 2013.

Sparse GCA and Thresholded Gradient Descent

V.Q. Vu and J. Lei. Minimax sparse principal subspace estimation in high dimensions. ar Xiv preprint ar Xiv:1211.0373, 2012.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Zhaoran Wang, Huanran Lu, and Han Liu. Tighten after relax: Minimax-optimal sparse pca in polynomial time. In Advances in neural information processing systems, pages 3383 3391, 2014.

Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441 479, 1912.

D.M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10:515 534, 2009.

Bin Yu, Fano Assouad, and Lucien Le Cam. Festschrift for lucien le cam, 1997.

Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899 925, 2013.

H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15:265 286, 2006.