# integrated_principal_components_analysis__3dddba24.pdf Journal of Machine Learning Research 22 (2021) 1-71 Submitted 1/20; Revised 4/21; Published 8/21 Integrated Principal Components Analysis Tiffany M. Tang tiffany.tang@berkeley.edu Department of Statistics University of California, Berkeley Berkeley, CA 94720, USA Genevera I. Allen gallen@rice.edu Department of Electrical and Computer Engineering Rice University Houston, TX 77005, USA Editor: Garvesh Raskutti Data integration, or the strategic analysis of multiple sources of data simultaneously, can often lead to discoveries that may be hidden in individualistic analyses of a single data source. We develop a new unsupervised data integration method named Integrated Principal Components Analysis (i PCA), which is a model-based generalization of PCA and serves as a practical tool to find and visualize common patterns that occur in multiple data sets. The key idea driving i PCA is the matrix-variate normal model, whose Kronecker product covariance structure captures both individual patterns within each data set and joint patterns shared by multiple data sets. Building upon this model, we develop several penalized (sparse and non-sparse) covariance estimators for i PCA, and using geodesic convexity, we prove that our non-sparse i PCA estimator converges to the global solution of a non-convex problem. We also demonstrate the practical advantages of i PCA through extensive simulations and a case study application to integrative genomics for Alzheimer s disease. In particular, we show that the joint patterns extracted via i PCA are highly predictive of a patient s cognition and Alzheimer s diagnosis. Keywords: data integration, multi-view data, matrix-variate normal, dimension reduction, integrative genomics 1. Introduction The recent growth in both data volume and variety drives the need for principled data integration methods that can analyze multiple sources of data simultaneously. For instance, meteorologists must integrate data from satellites, ground-based sensors, and numerical models for forecasting (Ghil and Malanotte-Rizzoli, 1991). Audio and video are often combined for surveillance as well as speech recognition (Shivappa et al., 2010); and as new high-throughput technologies arise in biology, scientists are leveraging information from multiple genomic sources to better understand complex biological processes (Huang et al., 2017). By exploiting the commonalities and diversity of information from different data sets, data integration methods have the potential to provide a holistic and perhaps more realistic model of the phenomena at hand. c 2021 Tiffany M. Tang and Genevera I. Allen. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-084.html. Tang and Allen In this work, we focus on facilitating unsupervised learning tasks such as pattern recognition, dimension reduction, and visualization for integrated data in various applications. More specifically, we consider the multi-view data setting, where we observe multiple data sources (or matrices) with features of different types that are measured on the same set of samples. Our goal in this setting is to develop a practical statistical data integration method that 1) leverages multiple data sources to discover and visualize dominant joint patterns among the samples that are common across all data sets; 2) generalizes the classical principal components analysis (PCA), thereby inheriting its nice properties including easily-interpretable visualizations, a unique solution, and nested, orthogonal components that can be quickly obtained all at once; and 3) has provable statistical and optimization guarantees to begin bridging the gap between the theory and practice of data integration. To this end, we propose Integrated Principal Components Analysis (i PCA), which extends a model-based framework of the classical Principal Components Analysis (PCA) to integrated data. As such, i PCA inherits the many advantages and nice familiar interpretations of PCA. The key idea here behind i PCA is to leverage a new but natural connection between data integration and the matrix-variate normal model. This enables us to flexibly model a rich set of dependencies among features and samples simultaneously, as is often crucial in integrated data. Building upon this model, we develop novel penalized covariance estimators for i PCA including a new geodesically convex penalty with both theoretical and practical advantages. While the main contributions of this work are methodological and applied in nature, we also begin to study the theoretical properties of our approach. In particular, we show that our non-sparse i PCA estimator converges to the global solution of a non-convex problem using geodesic convexity, and in the Appendix, we show that our sparse i PCA estimator consistently estimates the underlying joint subspace. Finally, through simulations and a careful case study application to integrative genomics for Alzheimer s disease, we demonstrate the superior empirical performance of i PCA for integrative exploratory data analysis, joint pattern recognition, and visualization. Specifically, in our case study, i PCA is able to identify joint patterns that are biologically-meaningful and can separate patients by their cognitive capabilities and Alzheimer s diagnosis. This interesting finding and application to Alzheimer s disease not only demonstrates the practical utility of i PCA but also points to promising hypotheses for follow-up experiments in Alzheimer s research. 1.1 Related Work Currently, existing data integration methods for unsupervised learning tend to fall into one of two categories the matrix factorizations or the multiblock PCA family but each have major practical limitations that we address with i PCA. First, matrix factorizations, including coupled matrix and tensor (CMTF) factorizations (Singh and Gordon, 2008; Acar et al., 2014) and Joint and Individual Variation Explained (JIVE) (Lock et al., 2013), each solve an optimization problem to factorize the integrated data sets into a low-rank joint variation matrix, encoding the shared patterns, and a lowrank individual variation matrix, encoding the patterns specific each data set. Matrix factorizations are well-liked due to their simplicity and computational feasibility, but a practical challenge with these methods is that they do not have a unique solution and heavily depend on the ranks of the factorized matrices. That is, the top factors from CMTF Integrated Principal Components Analysis and JIVE are non-nested and can change drastically depending on the user-specified rank. Since the optimal rank is almost always unknown and must be specified a priori, this poses significant interpretability challenges as the practitioner could end up with many different, but equally valid, solutions corresponding to different choices of ranks and random seeds. On a related front, the generalized SVD (GSVD) (Alter et al., 2003; Ponnapalli et al., 2011) provides an exact matrix decomposition for integrated data that does not depend on the matrix rank. Nevertheless, it is limited in scope since the GSVD assumes each matrix has full row rank, excluding problems with both highand low-dimensional data sets. On the other hand, the multiblock PCA family, which includes methods such as Multiple Factor Analysis (MFA) (Escofier and Pages, 1994; Abdi et al., 2013) and Consensus PCA (Westerhuis et al., 1998), works to generalize PCA to the integrated data regime and does not suffer from the rank-dependence or interpretability issues as in matrix factorizations. In these multiblock PCA methods, each data matrix is normalized according to a specific procedure, and then PCA is performed on the normalized concatenated data. However, these normalization schemes are often ad-hoc, and it is unclear which scheme works best and for which situation. Closely related to this is Distributed PCA (Fan et al., 2017), which integrates data that are stored across multiple servers and implicitly assumes that the data are i.i.d. across the different servers. This assumption differs from our target setting, where we allow for heterogeneity among the different sources (or servers in the distributed context). To date, an unsupervised data integration method, which both generalizes PCA and automatically determines the best way to normalize for the different scales and signal strengths between sources, does not exist. Motivated by these limitations, we develop i PCA as a proper generalization of PCA, thus inheriting its many properties and advantages (e.g., a unique solution, easily-interpretable visualizations, and nested, orthogonal principal components) unlike the matrix factorizations; and unlike the multiblock PCA methods, i PCA automatically adjusts for the different scales and signals between data sources without the need for a specified normalization scheme. The two main building blocks of i PCA are PCA and the matrix-variate normal distribution, which we review next. 1.2 Principal Components Analysis Given a column-centered data set X Rn p with n samples and p features, recall that PCA finds orthogonal directions v1, . . . , vm Rp, which maximize the covariance Sp ++, where Sp ++ is the set of p p positive definite matrices. That is, for each j = 1, . . . , m, vj = argmax v Rp v T v subject to v T v = 1, v T vi = 0 i < j. (1) It is well-known that the PC loading vj is the eigenvector of with the jth largest eigenvalue, and its corresponding PC score is uj := X vj. In practice, since the population covariance is typically unknown, the sample version of PCA plugs in an estimate ˆ := 1 n XT X for in (1). It follows that the PC loadings are the eigenvectors of ˆ , and the PC scores are the scaled eigenvectors of ˆΣ := 1 p X XT . To later establish the link between i PCA and PCA, we point out that ˆ is the maximum likelihood estimator (MLE) of under the multivariate normal model x1, . . . , xn iid N(0, ) and ˆΣ is the MLE of Σ under x 1, . . . , x p iid N(0, Σ), where xi is the ith row of X, and x j is the jth column of X. Thus, Tang and Allen there is a dual row/column interpretation of the PCA model. While this is not the only way of viewing PCA, this formulation illustrates two points which we explore further in i PCA: (1) PCA finds linear projections of the data that maximize the variance under a multivariate normal model; (2) eigenvectors correspond to the dominant (or variance-maximizing) patterns in the data. 1.3 Matrix-variate Normal Model Laid out in Gupta and Nagar (1999) and Dawid (1981), the matrix-variate normal distribution is an extension of the multivariate normal distribution such that the matrix is the unit of study. Formally, we say X Rn p follows a matrix-variate normal distribution and write X Nn,p(M, Σ ) if vec(XT ) follows a multivariate normal distribution with a Kronecker product covariance structure, vec(XT ) N(vec(MT ), Σ ). Here, vec(X) Rnp is the column vector formed by stacking the columns of X below one another. We call M Rn p the mean matrix, Σ Sn ++ the row covariance matrix, and Sp ++ the column covariance matrix. Put differently, the row covariance matrix Σ encodes the dependencies between rows of X while the column covariance matrix encodes the dependencies among columns, i.e., Xi, N(Mi, , Σii ) and X ,j N(M ,j, jj Σ). It can also be shown that if Σ = I and M = 0, we are in the familiar multivariate normal setting, x1, . . . , xn iid N(0, ), and if = I and M = 0, then x 1, . . . , x p iid N(0, Σ). The matrix-variate normal model, however, is far more general than the multivariate normal. While the multivariate normal can only model relationships between elements of a single row or a single column in X, the matrixvariate normal can model relationships between elements from different rows and columns. With this level of flexibility, the matrix-variate normal has proven to be a versatile tool in various contexts such as graphical models (Yin and Li, 2012; Tsiligkaridis et al., 2013; Zhou, 2014a), spatio-temporal models (Greenewald and Hero, 2015), and transposable models (Allen and Tibshirani, 2010). Our work on i PCA is the first to consider the matrix-variate normal model in light of data integration. 1.4 Outline Building upon the matrix-variate normal model, we introduce i PCA as a proper generalization of PCA to the integrated data regime in Section 2. In Section 3, we discuss covariance estimation methods for i PCA and begin to study some of their theoretical properties. We then demonstrate the strong empirical performance of i PCA in Section 4 through simulations and a real data application to integrative genomics for Alzheimer s disease. Finally, we conclude with a discussion of i PCA in Section 5. 2. Integrated PCA At its core, i PCA, like PCA, is an unsupervised tool for exploratory data analysis, pattern recognition, and visualization. Unlike PCA however, i PCA aims to extract dominant joint patterns which are common to multiple data sets, not necessarily the variance-maximizing patterns since they might be specific to one data set. These joint patterns are typically of considerable interest to practitioners as its common occurrence in multiple data sets Integrated Principal Components Analysis Figure 1: Coupled matrices X1, X2, X3 with n = 200, p1 = 300, p2 = 500, p3 = 400 were simulated from the i PCA model (2). Here, Σ and 1, 2, 3 were taken to be as in the base simulation described in Section 4.1. (A) plots the top two eigenvectors of Σ. In separate PCA analyses (B-D), the individual signal in each data set masks the true joint signal, but (E) i PCA (using the multiplicative Frobenius estimator) exploits the integrated data structure and recovers the true joint signal. may point to some foundational mechanism or structure. For instance, scientists may be more interested in uncovering the patterns (or clusters) of patients who have similar gene expression levels and mi RNA expression levels than those patients with similar gene expression levels alone. Figure 1 illustrates a motivating example for when i PCA is advantageous. In the example, strong dependencies among features obscure the true joint patterns among the samples so that the true joint signal is not the variance-maximizing direction. As a result, applying PCA separately to each of the data sets (panels B-D) fails to reveal the joint signal. i PCA can better recover the joint signal because it exploits the known integrated data structure and extracts the shared information among all three data sets simultaneously. Generally speaking, i PCA finds these joint patterns by modeling the dependencies between and within data sets via the matrix-variate normal model. The inherent Kronecker product covariance structure enables us to decompose the total covariance of each data matrix into two components an individual column covariance structure which is unique to each data set and a joint row covariance structure which is shared among all data sets. The joint row covariance structure is our primary interest, and maximizing this joint variation will yield the dominant patterns which are common across all data sets. In the following sections, we will introduce i PCA and provide interpretations and intuition into the model. 2.1 Population Model of i PCA Suppose we observe K coupled data matrices, X1, . . . , XK, of dimensions n p1, . . . , n p K, where n is the number of samples and pk is the number of features in Xk. Throughout this paper, we let p := PK k=1 pk and X := [X1, . . . , XK]. Since i PCA is primarily interested in the data variation, let us assume that each data matrix Xk has been mean centered so that each column of Xk has a mean of 0. Suppose also that each of the data matrices are measured on the same n samples and that all rows of Xk are perfectly aligned (see Tang and Allen Figure 2: Integrated Data Setting for i PCA: We observe K different but coupled data matrices, each with a distinct set of features that are measured on the same set of n samples. Assume that the rows align. Figure 2). Under the i PCA model, we assume that each data set Xk arises from a matrixvariate normal distribution, Xk ind. Nn,pk(0, Σ k), (k = 1, . . . , K) (2) where Σ is an n n row covariance matrix that is jointly shared by all data matrices, and k is a pk pk column covariance matrix that is specific to Xk. We next provide additional intuition into the model parameters Σ, 1, . . . , K and unpack the i PCA modeling assumptions. Feature Dependencies k: By properties of the matrix-variate normal, we can interpret k as describing the dependence structure among features in Xk, giving rise to feature patterns unique to Xk. As a consequence of the i PCA model in (2), this feature (or column) dependence is revealed as [Xk]i, N(0, Σii k), (k = 1, . . . , K, i = 1, . . . , n). (3) In other words, each sample or row in Xk has its own variance scaling factor given by Σii while all dependencies among the features are captured by k. Sample Dependencies Σ: Analogously, we can interpret Σ as describing the common row dependence structure, corresponding to patterns among the samples that are shared by all K data sets. According to the i PCA model in (2), the sample (or row) dependence manifests itself as [Xk] ,j N(0, [ k]jj Σ), (k = 1, . . . , K, j = 1, . . . , pk). (4) Here, each feature or column in Xk gets its own variance scaling factor [ k]jj while all dependencies among the samples are given by Σ. i PCA Through a Whitening Lens: In addition to these marginal interpretations of Σ and k, it is also important to gain intuition into how Σ and k interact together within the i PCA model. The simplest way to do so is through a whitening perspective, where we note that the i PCA model in (2) can be equivalently rewritten as follows: for each k = 1, . . . , K, Σ 1/2 Xk 1/2 k = Zk, where [Zk]ij iid N(0, 1). (5) Integrated Principal Components Analysis This is to say that after whitening each data matrix Xk of both its feature dependencies in k and the shared row dependencies in Σ, the remainder is distributed i.i.d. from a standard normal distribution. Thus, all dependencies in Xk must be captured by Σ and k. This assumption may seem strong at first glance, but it is a simple generalization of the usual assumptions in PCA. To see this, notice another equivalence relation to the i PCA model (2) based on whitening: for each k = 1, . . . , K, Yk := Xk 1/2 k ind. Nn,pk(0, Σ Ipk), (6) or equivalently, in vector-form, [Xk 1/2 k ] ,j iid N(0, Σ) for each j = 1, . . . , pk. (7) That is, in the idealistic scenario when population covariances are known, the i PCA model simply states that each data set Xk, after being whitened of its feature dependencies k, follows a normal distribution with the common row covariance Σ. This is not a new assumption. In fact, it is the primary modeling assumption if we were to apply classical PCA to the concatenated centered and whitened data [Y1, . . . , YK]. Further, like PCA, while i PCA assumes and works best with normally-distributed data, it is still practically effective in the non-normal regime. We provide empirical evidence of this robustness in Section 4. It is important to note, however, that when k = I such as in most cases with real data, the underlying i PCA model greatly differs from that of applying PCA to the concatenated un-whitened data (henceforth referred to as concatenated PCA). While the normal assumption in i PCA is often insignificant in practice, it is crucial to account for k in the integrated data model. Intuitively, the feature dependencies k in i PCA can be viewed as nuisance parameters that obscure the sample dependencies Σ and vice versa, so ignoring the feature dependencies completely as in concatenated PCA can be extremely problematic. The challenge here though is that k and Σ are both unknown in practice and must be estimated. In the setting where we only observe one data set X1, it is not possible to distinguish or estimate both 1 and Σ, but with multiple observed data sets X1, . . . , XK, we can and should leverage the additional data to distinguish between the feature and sample dependencies. As we will see later, this is an upshot of i PCA namely, that i PCA models and estimates both k and Σ concurrently, exploiting the integrated structure while also accounting for the opposition s nuisance effects. Now under these modeling assumptions from the i PCA model in (2), which we have seen to be generalizations of the classical PCA assumptions, i PCA extends the variancemaximizing ideas of PCA and achieves its objective of finding the dominant joint and individual patterns in the data by maximizing the joint row covariance Σ and individual column covariances 1, . . . , K simultaneously. Namely, for k = 1, . . . , K, i PCA solves ui = argmax u Rn u T Σ u subject to u T u = 1, u T ul = 0 l < i, (i = 1, . . . , n) (8) vk j = argmax v Rpk v T k v subject to v T v = 1, v T vk l = 0 l < j, (j = 1, . . . , pk) (9) for which we know the solution to be given by the eigendecompositions of Σ and k, respectively. That is, ui is the eigenvector of Σ with the ith largest eigenvalue, and vk j Tang and Allen is the eigenvector of k with the jth largest eigenvalue. Most notably, u1 maximizes the joint variation and is interpreted as the most dominant pattern among the samples, which occurs in all K data sets. We call the columns of U := [u1, . . . , un] the integrated principal component (i PC) scores and the columns of Vk := [vk 1, . . . , vk pk] the i PC loadings for the kth data set. Note that though the population covariances in (2) are not identifiable (e.g., Σ k = c Σ 1 c k for c R), the i PC scores and loadings are identifiable since eigenvectors are scale-invariant. Since we are often most interested in the joint patterns, we primarily plot the i PC scores U to visualize the joint patterns in sample space. To visualize the individual feature patterns from the kth data set, we can also plot the i PC loadings Vk. 2.2 Sample Version of i PCA To perform i PCA in practice, we must typically plug in estimators ˆΣ and ˆ k for Σ and k since the population covariances in (2) are almost always unknown. We summarize the sample version of i PCA as follows: 1. Model each (column-centered) data set via a matrix-variate normal model: Xk Nn,pk(0, Σ k), k = 1, . . . , K. 2. Estimate the covariance matrices Σ and 1, . . . , K simultaneously to obtain ˆΣ and ˆ 1, . . . , ˆ K. Methods for covariance estimation will be discussed in Section 3. 3. Compute the eigenvectors, say ˆU = eigenvectors of ˆΣ and ˆVk = eigenvectors of ˆ k. We interpret ˆU as the dominant joint patterns in sample space and ˆVk as the dominant patterns in feature space which are specific to Xk. 4. Visualize and explore the dominant joint patterns by plotting the i PC scores ˆU and the dominant individual patterns by plotting the i PC loadings ˆVk. 2.2.1 Variance Explained by i PCA After performing i PCA, we can also interpret the signal in the obtained i PCs through a notion of variance explained, analogous to that in PCA. Definition 1 Assume that Xk has been column centered. We define the cumulative proportion of variance explained in data set Xk by the top m i PCs to be PVEk,m := (U(m))T Xk V(m) k 2 F Xk 2 F , (10) where U(m) = [u1, . . . , um] are the top m i PC scores, and V(m) k = [vk 1, . . . , vk m] are the top m i PC loadings associated with Xk. Definition 2 The marginal proportion of variance explained in data set Xk by the mth i PC is defined as MPVEk,m := PVEk,m PVEk,m 1. Integrated Principal Components Analysis We verify in Appendix A that PVEk,m is a proportion and monotonically increasing as m increases. Aside from being well-defined, we also show in Appendix A that (10) generalizes the cumulative proportion of variance explained in PCA and hence, is a natural definition. Note however that unlike PCA, it may be that MPVEk,m+1 > MPVEk,m in i PCA. This is because i PCA does not maximize the total variance. For example, if MPVE1,2 > MPVE1,1, this simply means that data set X1 contributed more variation to the joint pattern in i PC2 than in i PC1. 2.3 Connections to Existing Methods Throughout our development of i PCA, we have established several connections between i PCA and PCA, which demonstrate that i PCA is indeed a natural extension of PCA. We also find it instructive to draw on connections between i PCA and other existing data integration methods to develop an even deeper understanding of i PCA. 2.3.1 Relationship to Multiblock PCA Family As discussed in Abdi et al. (2013), multiblock PCA methods reduce to performing PCA on the normalized concatenated data X = [X 1, . . . , X K], where each Xk has been normalized to X k according to some procedure. We will show later in Proposition 6 that performing PCA on the unnormalized concatenated data (i.e., concatenated PCA) is a special case of i PCA, where we assume k = I for each k. Proposition 6 can also be easily extended to show that multiblock PCA methods are a special case of i PCA for some fixed 1, . . . , K, and the exact form of k depends on the normalization procedure. For example, since MFA normalizes Xk by dividing all of its entries by its largest singular value σmax,k, MFA is a special case of i PCA, where each k = σmax,k I. Put differently, MFA assumes [Xk σ 1/2 max,k] ,j iid N(0, Σ) whereas i PCA assumes [Xk 1/2 k ] ,j iid N(0, Σ). This gives rise to another interpretation of i PCA: i PCA is a generalization and unifying framework for the entire multiblock PCA family. However, while the multiblock PCA methods assume that k takes a specific form, i PCA does not impose any restrictions on the form of k and instead freely estimates the full k matrix simultaneously with Σ. In doing so, i PCA acts as an automatic way of normalizing for the different scales and signals between data sources. Without appropriate normalization, the estimated principal components from multi-block PCA methods such as concatenated PCA will be biased as quantified by the following proposition. Proposition 3 Suppose that Xk Nn,pk(0, Σ k) for each k = 1, . . . , K, and define X = [X1, . . . , XK], = diag( 1, . . . , K), and Z = [Z1, . . . , ZK], where Zk is as defined in (5) previously. If X 1/2 = UDVT and X = U D VT are their respective compact SVDs, then U and U denote the i PCA and concatenated PCA scores, respectively, and U U = Σ1/2 Z(VD 1 1/2 V D 1). (11) Moreover, the i PCA and concatenated PCA scores are equal if and only if VD 1 1/2 V D 1 is in the nullspace of Z or 1/2 V = VD 1 D. Proposition 3 implies that if the feature dependencies, captured by k, rotate the data Xk in a way such that it does not obscure the joint row patterns, then concatenated PCA Tang and Allen will work adequately. However, for most situations and examples of k that we expect to find in real data, this is unlikely to occur, and the resulting eigenvectors of concatenated PCA will be severely biased. To avoid this bias, both the sample and feature dependencies must be accounted for in the estimation procedure as done in i PCA. 2.3.2 Relationship to Matrix Factorizations Coupled matrix factorizations (CMF) decompose each data set Xk Rn pk into the product of low-rank joint factor U Rn m and a low-rank individual factor Vk Rm pk so that Xk U Vk (Singh and Gordon, 2008; Acar et al., 2014). This approximate factorization is related to i PCA in that our matrix-variate normal model assumes a similar multiplicative structure. Specifically, from (5), the i PCA model can be viewed as Xk = Σ1/2 Zk 1/2 k , where Zk is standard normal random matrix. Moreover, an argument similar to Theorem 2 from Hastie et al. (2015) shows that one solution of the CMF optimization problem (with ℓ2 penalties) is the solution of concatenated PCA and thereby a special case of i PCA. Despite this connection however, there is a fundamental difference between CMF and i PCA. On the one hand, CMF assumes Xk can be approximated by a low-rank matrix, and the estimation of the CMF factors actively depends on the pre-specified rank m. On the other hand, the rank of Xk plays absolutely no role in the i PCA assumptions or the estimation step of i PCA. Consequently, the joint and individual CMF factors can change drastically depending on the pre-specified rank unlike in i PCA. CMF also does not have a unique solution nor enforces orthogonal components whereas i PCA gives unique, nested, orthogonal components that can be interpreted in the same way as in PCA. In contrast to the multiplicative models of CMF and i PCA, JIVE, which has been commonly used in integrative genomics, assumes an additive model and decomposes coupled data into the sum of a low-rank joint variation matrix, a low-rank individual variation matrix, and an error matrix (Lock et al., 2013). Additive and multiplicative models, being quite different models, are each advantageous in different situations, but as with CMF, the estimation of JIVE depends on the pre-specified ranks of its factors and results in non-nested, rank-dependent joint and individual components. 3. Covariance Estimators for i PCA We next return to address the covariance estimation step when fitting the i PCA model to data. In Section 3.1, we consider the traditional maximum likelihood approach but find that it suffers from substantial limitations in the integrated data setting. These limitations ultimately drive the need for new estimators, which we develop in Section 3.2. 3.1 Unpenalized Maximum Likelihood Estimators Guided by the formulation of PCA in Section 1.2, we instinctively try to estimate the i PCA population covariances Σ and 1, . . . , K via maximum likelihood estimation. Under the i PCA population model (2), the log-likelihood function reduces to ℓ(Σ 1, 1 1 , . . . , 1 K ) p log | Σ 1 | + n k=1 log | 1 k | k=1 tr Σ 1 Xk 1 k XT k , (12) Integrated Principal Components Analysis so by taking partial derivatives of (12) with respect to each parameter, we obtain Lemma 4 The unpenalized MLEs of Σ and 1, . . . , K satisfy k=1 Xk ˆ 1 k XT k (13) n XT k ˆΣ 1 Xk k = 1, . . . , K. (14) However, with only one matrix observation per matrix-variate normal model in the i PCA context, existence of the MLE is not guaranteed. In fact, the following theorem essentially implies that the MLE does not exist for all practical purposes. Theorem 5 (i) Suppose that Xk has been column-centered so that each column has a mean of 0 for each k = 1, . . . , K. Then the unpenalized MLEs for Σ and 1, . . . , K are not positive definite and hence do not exist. (ii) Suppose that Xk has not been column-centered but that rank( X) = n and rank(Xk) = pk for k = 1, . . . , K. (a) If n = pk for some k = 1, . . . , K, then the unpenalized log-likelihood function ℓ(Σ 1, 1 1 , . . . , 1 K ) is unbounded. (b) If the unpenalized log-likelihood function ℓ(Σ 1, 1 1 , . . . , 1 K ) is bounded, then the unpenalized MLE for Σ, 1, . . . , K exist. The proof of Theorem 5 also shows that if the unpenalized MLEs for Σ and 1, . . . , K exist, then pk = n p for each k = 1, . . . , K. Thus in summary, if pk = n for some k, n > p, or the data matrices have been column-centered to have mean 0, then the unpenalized MLEs do not exist. These severe restrictions motivate new covariance estimators. For example, one alternative but naive approach is to estimate Σ and k by setting their counterparts to I. Proposition 6 (i) The MLE for k, assuming that Σ = I, is n XT k Xk . (15) (ii) The MLE for Σ, assuming k = I for all k = 1, . . . , K, is k=1 Xk XT k = 1 p X X T . (16) This approach for estimating Σ is the familiar MLE for the concatenated data X, and hence, performing concatenated PCA is equivalent to a special case of i PCA, where we set k = I for each k. While this formalizes the connection between PCA and i PCA, we will see in Section 4 that concatenated PCA performs poorly when the data sets are of different scales or when the feature dependencies are stronger than the sample dependencies. In the next section, we discuss more effective methods for estimating the i PCA covariances. Tang and Allen 3.2 Penalized Maximum Likelihood Estimators Given that the unpenalized MLEs do not exist for a large class of problems, one possible solution is to develop penalized MLEs, which solve ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 K = argmax Σ 1 0 1 1 ,..., 1 K 0 n p log | Σ 1 | + n k=1 log | 1 k | k=1 tr Σ 1 Xk 1 k XT k P(Σ 1, 1 1 , . . . , 1 K ) o . (17) Similar to previous work on the penalized matrix-variate normal log-likelihood (Yin and Li, 2012; Allen and Tibshirani, 2010), we can apply an additive-type penalty and define the additive Lq i PCA penalty to be Pq(Σ 1, 1 1 , . . . , 1 K ) = λΣ Σ 1 q + k=1 λk 1 k q, where λΣ, λ1, . . . , λK > 0 are tuning parameters. Though there are many potential choices of norm-penalties here, one natural choice is the additive Frobenius penalty, q = 2 F , as it is a proper generalization of PCA. That is, performing i PCA with the additive Frobenius penalty is equivalent to PCA in the K = 1 case (see Theorem 1 in Allen and Tibshirani 2010). When K 1, the additive Frobenius penalty induces a smoothness over the eigenvalues of the covariance matrices and returns a dense full-rank covariance estimator. In the sparse covariance setting, we can instead induce sparsity through the additive L1 penalty q = 1,off, where A 1,off= P i =j Aij. Applying the additive L1 penalty to the inverse covariance matrix is common practice, but one can alternatively apply the additive L1 penalty to the inverse correlation matrix. This latter approach was adopted in Zhou (2014a) and Rothman et al. (2008). Given the plethora and well-studied nature of sparse graphical model estimation, we leverage existing ideas and tools to show that this approach, namely, the additive L1 penalty applied to the inverse correlation matrix, consistently estimates the true underlying joint subspace under certain conditions at a rate of O PK k=1 pk max{sΣ,1} log(max{n,pk}) , where sΣ is the number of non-zero off-diagonal entries in Σ 1. However, due to the superior empirical performance of the i PCA Frobenius estimators over the sparse i PCA estimators (see Section 4), we leave the precise theorem statement and discussion of the L1 subspace consistency result to Appendix F. Despite the popularity of additive-type penalties in the literature, an overarching downside with these existing penalties in the integrated data regime is that solving (17) with additive penalties is a non-convex problem, for which we can only guarantee convergence to a local solution. Nonetheless, though (17) is non-convex in Euclidean space, Wiesel (2012) showed that the matrix-variate normal log-likelihood is geodesically convex (g-convex) with respect to the manifold of positive definite matrices. G-convexity is a generalized notion of convexity on a Riemannian manifold, and like convexity, all local minima of g-convex functions are globally optimal. Exploiting this idea of g-convexity, we propose a novel type Integrated Principal Components Analysis Algorithm 1 Flip-Flop Algorithm for Multiplicative Frobenius i PCA Estimators 1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be positive definite. 2: while not converged do 3: Take eigendecomposition: PK k=1 Xk ˆ 1 k XT k = U Γ UT 4: Regularize eigenvalues: Φii = 1 Γ2 ii + 8p PK k=1 λk ˆ 1 k 2 F 5: Update ˆΣ 1 = U Φ 1 UT 6: for k = 1, . . . , K do 7: Take eigendecomposition: XT k ˆΣ 1 Xk = V Φ VT 8: Regularize eigenvalues: Γii = 1 2n Φii + q Φ2 ii + 8nλk ˆΣ 1 2 F . 9: Update ˆ 1 k = V Γ 1 VT of penalty, named the multiplicative Frobenius i PCA penalty P (Σ 1, 1) = k=1 λk Σ 1 1 k 2 F , which we will show to be g-convex in Theorem 8. Note that since A B 2 F = A 2 F B 2 F , the multiplicative penalty can be rewritten as a product Σ 1 2 F PK k=1 λk 1 k 2 F , giving rise to its name. Like the additive Frobenius i PCA estimator, the multiplicative Frobenius i PCA estimator is a shrinkage technique that returns a dense covariance estimate with smoothed eigenvalues when K 1, and when K = 1, it is equivalent to PCA (see Appendix E). Having introduced several different types of penalized i PCA covariance estimators, namely, the additive Frobenius estimator, multiplicative Frobenius estimator, additive L1 covariance estimator, and additive L1 correlation estimator, the question for practitioners becomes how to compute these estimators, how to select the penalty parameters, and which estimator to use in which situation. We discuss each in turn next. 3.2.1 Flip-flop Algorithms for i PCA Estimators For each of the aforementioned penalties, we can compute the corresponding penalized MLEs via Flip-Flop algorithms (also known as block coordinate descent algorithms), which iteratively optimize over each of the parameters, one at a time, while keeping all other parameters fixed. These algorithms are derived fully in Appendix C.2, but in general, for the Frobenius penalties, each Flip-Flop update has a closed form solution determined by a full eigendecomposition. For the L1 penalties (also known as the Kronecker Graphical Lasso, Tsiligkaridis et al., 2013), each update can be solved by the graphical lasso (Hsieh et al., 2011). We provide the multiplicative Frobenius Flip-Flop algorithm here in Algorithm 1, and as the other algorithms take similar forms, we leave them to Appendix C.2. The following theorem guarantees numerical convergence of the Flip-Flop algorithms to a local solution for the multiplicative Frobenius, additive Frobenius, and additive L1 penalties. Theorem 7 Suppose that the objective function in (17) is bounded below. Suppose also that either (i) P(Σ 1, 1 1 , . . . , 1 K ) is a differentiable convex function with respect to each coordinate or (ii) P(Σ 1, 1 1 , . . . , 1 K ) = P0(Σ 1) + PK k=1 Pk( 1 k ), where Pi is Tang and Allen a (non-differentiable) convex function for each k = 1, . . . , K. If either (i) or (ii) holds, then the Flip-Flop algorithm corresponding to (17) converges to a stationary point of the objective. However, building upon Wiesel (2012) and the notion of g-convexity, we can prove a far stronger result for the multiplicative Frobenius i PCA estimator. Theorem 8 The multiplicative Frobenius i PCA estimator is jointly geodesically convex in Σ 1 and 1 1 , . . . , 1 K . Because of this, the Flip-Flop algorithm for the multiplicative Frobenius i PCA estimator given in Algorithm 1 converges to the global solution. There are currently only a handful of non-convex problems where there exists an achievable global solution, so this guarantee that the multiplicative Frobenius i PCA estimator always reaches a global solution is both extremely rare and highly desirable. In Section 4, we will also see that the multiplicative Frobenius i PCA estimator undoubtedly gives the best empirical performance, indicating that in addition to its optimization-theoretic advantages from global convergence, there are significant practical advantages associated with the g-convex penalty. A self-contained review of g-convexity and the proof of Theorem 8 are given in Appendix D. 3.2.2 Tuning Penalty Parameters To select penalty parameters for (17) in a data-driven manner, we propose a cross-validationlike procedure. The following procedure can also be used to perform i PCA in missing data scenarios. Let Λ denote the space of penalty parameters, and let λ := (λΣ, λ1, . . . , λK) be a specific choice of penalty parameters in Λ. The idea is to first randomly leave out scattered elements from each Xk. Then, for each λ Λ, impute the missing elements via an Expectation Maximization-like algorithm, similar to Allen and Tibshirani (2010). Finally, select the λ which minimizes the error between the imputed values and the observed values. Searching over all combinations of penalty parameters in Λ however can be computationally intractable if K or the data sets themselves are large. In these cases, we can select the penalty parameters in a greedy manner: first fix λ1, . . . , λK and optimize over λΣ, then fix λΣ, λ2, . . . , λK and optimize λ1, and so forth, and we stop after optimizing λK. Note also that it can be substantially easier and faster to tune the multiplicative Frobenius penalty, which has K penalty parameters, compared to the additive i PCA penalties with K + 1 parameters. Because the choice of penalty parameter can significantly impact the empirical performance of i PCA, having one less parameter to tune is an extremely important practical advantage, and we attribute part of the strong empirical performance of the multiplicative Frobenius i PCA estimator, displayed in Section 4, to this advantage. Details, technical derivations, and numerical results regarding our imputation method and penalty parameter selection are provided in Appendix G. 3.3 Choosing the Type of Penalized i PCA Estimator As a result of its global convergence guarantee and the reduced complexity in tuning fewer penalty parameters, we strongly recommend using the multiplicative Frobenius estimator in Integrated Principal Components Analysis practice. Though we have yet to prove statistical guarantees for the multiplicative Frobenius estimator, the strong empirical performance of the multiplicative Frobenius estimator, seen next in Section 4, firmly supports this recommendation. Even in the sparse setting (see Figure 13), the multiplicative Frobenius estimator performs only slightly worse than the additive L1 i PCA estimators, for which we have proved subspace consistency guarantees (see Appendix F). This empirically demonstrates the robustness and applicability of the multiplicative Frobenius estimator to a diverse array of problems. 4. Empirical Results In the following simulations and case study, we evaluate i PCA against individual PCAs on each of the data sets Xk, concatenated PCA, distributed PCA, JIVE, and MFA. Note that many data integration methods from the multiblock PCA family are known to perform similarly to MFA (Abdi et al., 2013), so we only include MFA to minimize redundancy. We also omit CMF as it performs similarly to concatenated PCA and the GSVD since it is not applicable for integrated data with both low-dimensional and high-dimensional data sets. Our focus here will be on the non-sparse setting while we leave the sparse simulations to Appendix H. Within the class of i PCA estimators, we thus concentrate our attention on the additive and multiplicative Frobenius i PCA estimators in these dense simulations, but to also represent the sparse estimators, we include the most commonly used sparse estimator, the additive L1 penalty ( 1,off) applied to the inverse covariance matrices. The stopping rule in the Flip-Flop algorithms for the additive and multiplicative Frobenius i PCA estimators is given by λ1/2|| ˆΣ 1 t ˆΣ 1 t 1||F /|| ˆΣ 1 t 1||F < 10 6, where λ denotes the mean of the penalty parameters and ˆΣt denotes the estimate of Σ in the tth iteration. Due to computational constraints, we stop the L1 Flip-Flop algorithm after one iteration, and we select the i PCA penalty parameters in a greedy manner, as discussed in Section 3.2.2. 4.1 Simulations The base simulation is set up as follows: Three coupled data matrices, X1, X2, X3, with n = 150, p1 = 300, p2 = 500, p3 = 400, were simulated according to the i PCA Kronecker covariance model (2). Here, Σ is taken to be a full-rank spiked covariance matrix, where the top two eigenvalues are simulated to be much larger than the rest. These top two factors in Σ form the three clusters as shown in Figure 1A. 1 is an autoregressive Toeplitz matrix with entry (i, j) given by .9|i j|; 2 is the observed covariance matrix of mi RNA data from TCGA ovarian cancer (The Cancer Genome Atlas Research Network, 2011); and 3 is a block-diagonal matrix with five equally-sized blocks. We also ensured that the largest eigenvalue of each k was larger than that of Σ so that the joint patterns are intentionally obscured by individualistic patterns. From this base simulation, we systematically varied the parameters number of samples, number of features, and strength of the joint signal in Σ (i.e. Σ 2) one at a time while keeping everything else constant. We evaluate the performance of various methods using the subspace recovery error: If the true underlying subspace of Σ was simulated to be of dimension d with the orthogonal eigenbasis u1, . . . , ud and the top d eigenvectors of the estimate ˆΣ are given by ˆu1, . . . , ˆud, then the subspace recovery error is defined to be 1 d ˆU ˆU T U UT 2 F , where U = [u1, . . . , ud] Tang and Allen 200 400 600 Sample Size n Subspace Recovery Error 200 100 0 100 200 Number of Features Added to each Xk Subspace Recovery Error 0 25 50 75 100 Signal in Σ Subspace Recovery Error i PCA (x Frob) i PCA (+ Frob) Concatenated Distributed Figure 3: Subspace recovery as simulation pararmeters vary from the base simulation: (A) As the number of samples increases, it becomes more difficult to estimate the joint row subspace; (B) As the number of features increases, it becomes slightly easier to estimate the joint row subspace; (C) Performance drastically improves as the strength of the joint signal in Σ (i.e. the top singular value of Σ) increases. Moreover, in almost every scenario, the multiplicative and additive Frobenius i PCA estimators outperform their competitors. and ˆU = [ˆu1, . . . , ˆud]. This metric simply quantifies the distance between the true subspace of Σ and the estimated subspace from ˆΣ. We note that a lower subspace recovery error implies higher estimation accuracy, and in the base simulation, the true subspace of Σ is given by the number of spikes so that d = 2. Although there are other metrics like canonical angles, which also quantify the distance between subspaces, these metrics behave similarly to the subspace recovery error and are omitted for brevity. The average subspace recovery error, measured over 50 trials, from various simulations are shown in Figure 3. We clearly see that the additive and multiplicative Frobenius i PCA estimators consistently outperformed all other methods. Since Σ was not simulated to be sparse, it is no surprise that the Frobenius i PCA estimators outperformed the L1 i PCA estimator. It is also expected that distributed PCA performs poorly since the k s are not all identical, violating its basic assumption. What may be surprising is that doing PCA on X2 performed better than its competitors, excluding the Frobenius i PCA estimators. We speculate that this is because the observed covariance 2 happened to be a very low-rank matrix, and because 2 was low-rank, the signal from Σ most likely dominated much of the variation in the second PC. Looking ahead at Figure 4A, as Laplacian error was added to the simulated data, PCA on X2 failed to recover the true signal since the Laplacian error increasingly contributed to the variation in the data. We also point out that MFA always yielded a lower error than concatenated PCA in Figure 3, indicating that there is value in normalizing data sets to be on comparable scales. On the other hand, we must be weary of this normalization process. In the case of these simulations, PCA on X2 outperformed MFA, illustrating that normalization can sometimes remove important information. Integrated Principal Components Analysis 0 1 2 3 4 SD of Added Laplacian Error Subspace Recovery Error 2.5 5.0 7.5 10.0 SD of Noise in JIVE Model Subspace Recovery Error i PCA (x Frob) i PCA (+ Frob) Concatenated Distributed i PCA (x Frob) i PCA (+ Frob) JIVE (known ranks) JIVE (estimated ranks) Concatenated Distributed Figure 4: Robustness Simulations: (A) As Laplacian error is increasingly added to the simulated data sets, the Frobenius i PCA estimators appear to be robust to the departures from Gaussianity; (B) As the amount of noise in the JIVE model increases, i PCA seems to be comparable to existing methods, illustrating its relative robustness to departures from the Kronecker product model. To verify that these simulation results are not heavily dependent by the base simulation setup of Σ and k, we also ran simulations, varying the dimension d of the true joint subspace U and the number of data sets K. We provide these results in Appendix H. Beyond simulating from the i PCA model (2), we check for robustness from the two main i PCA assumptions normality and separability (i.e., the Kronecker covariance structure). To deviate from normality, we add Laplacian noise to the base simulation setup, and to depart from the Kronecker covariance structure, we simulate data from the JIVE model. The results are summarized in Figure 4, and we leave the simulation details as well as other simulations that demonstrate robustness to Appendix H. As seen in Figure 4A, the Frobenius i PCA estimators appear to be relatively robust to non-Gaussian noise and outperformed their competitors even as the standard deviation of added Laplacian errors increased. From the simulations under the JIVE model in Figure 4B, we see that as the amount of noise increases, JIVE given the known ranks yields the lowest error, as expected. But similar to how JIVE was comparable to competing methods under the i PCA model (Figure 3), the i PCA estimators are comparable to competing methods for high noise levels under the JIVE model. At low noise levels however, the Frobenius i PCA estimators are surprisingly able to recover the true joint subspace better than JIVE with the known ranks. Further investigation into this peculiar behavior reveals that the Frobenius i PCA estimators give lower subspace recovery errors but much larger approximation errors Σ ˆΣ 2 F , compared to JIVE with the known ranks. This brings up a subtle, but important distinction i PCA revolves around estimating the underlying subspace, determined by eigenvectors, while JIVE focuses on minimizing the matrix approximation error. These are inherently different objectives, and it is common for i PCA to estimate the eigenvectors well at the cost of a poor matrix approximation due to the regularized eigenvalues. Tang and Allen 4.2 Case Study: Integrative Genomics of Alzheimer s Disease A key motivating example for our research is in integrative genomics, where the goal is to combine multiple genomic sources to gain insights into the genetic basis of diseases. In particular, apart from the APOE gene, little is known about the genomic basis of Alzheimer s disease (AD) and the genes which contribute to dominant expression patterns in AD. In this case study, we delve into the integrative genomics of AD and jointly analyze mi RNA expression, gene expression via RNASeq, and DNA methylation data obtained from the Religious Orders Study Memory and Aging Project (ROSMAP) Study (Mostafavi et al., 2018). The ROSMAP study is a longitudinal clinical-pathological cohort study of aging and AD, consisting of 507 subjects, 309 mi RNAs, 900 genes, and 1250 Cp G (methylation) sites after preprocessing (which we detail in Appendix I). This data is uniquely positioned for the study of AD since its genomics data is collected from post-mortem brain tissue from the dorsolateral prefrontal cortex, an area known to play a critical role in cognitive functions. For our analysis, we consider two clinical outcomes: clinician s diagnosis and global cognition score. The clinician s diagnosis is the last clinical evaluation prior to the patient s death and is a categorical variable with three levels Alzheimer s disease (AD), mild cognitive impairment (MCI), and no cognitive impairment (NCI). Global cognition score, a continuous variable, is the average of 19 cognitive tests and is the last cognitive testing score prior to death. While the clinician s diagnosis is sometimes subjective, global cognition score is viewed as a more objective measure of cognition. Our goal is to find common patterns among patients, which occur in all three data sets, and to understand whether these joint patterns are predictive of AD, as measured by the clinician s diagnosis and global cognition score. To this end, we run i PCA and other existing methods to extract dominant patterns from the ROSMAP data. Figure 5 shows the PC plots obtained from the various methods each point represents a subject and is colored by either clinician s diagnosis or cognition score. Since visuals are a subjective measure of performance, we quantify it by taking the top PCs and using them as predictors in a random forest to predict the outcome of interest. The random forest test errors, averaged over 100 random training/test splits, are shown in Figure 6. Here, we see that the joint patterns extracted from i PCA using the multiplicative Frobenius penalty were the most predictive of the clinician s diagnosis of AD and the patient s global cognition score. Moreover, most of the predictive power can be attributed to the first three i PCs, which we visualize in Figure 7A-B. We also note that the top i PCs from i PCA with the multiplicative Frobenius penalties were more predictive than combining the PCs from the three individual PCAs performed on each data set. This showcases empirically that a joint analysis of the integrated data sets can be advantageous over three disparate analyses. Beyond the high predictive power of i PCA with the multiplicative Frobenius penalty, it is perhaps more important for scientists to be able to interpret the i PCA results. One way is through the proportion of variance explained by the joint i PCs, as defined in Section 2.2.1. Figure 7C shows the marginal proportions for the top 5 i PCs. It reveals that the RNASeq data set contributed the most variation in the joint patterns found by i PC1 and i PC2, and the mi RNA data set contributed the most variation in i PC3. More interestingly, even though i PC2 and i PC3 have relatively small variances, i PCA is able to pick out these weak Integrated Principal Components Analysis joint signals, which we found to be predictive of AD. This reiterates that the most variable patterns in the data are not necessarily the most relevant patterns for the question at hand. In this case, our goal was to find joint patterns which occur in all three data sets, and since the joint signal is not the most dominant source of variation in each data set, no individualistic PCA analysis would have identified the joint signal found by i PCA. We conclude our ROSMAP analysis by extracting the top genetic features which are associated to the joint patterns shown in Figure 7. Since i PCA provides an estimate of both Σ and k, we can select the top features by applying sparse PCA to each ˆ k obtained from i PCA. Table 1 lists the top mi RNAs, genes, and genes affiliated with the selected Cp G sites obtained from sparse PCA. Here, we used the sparse Eigen R package (Benidis et al., 2016) and chose the tuning parameter such that there were only 12 non-zero features. 0.1 0.0 0.1 PC1 i PCA (x Frobenius) J 0.1 0.0 0.1 PC1 i PCA (+ Frobenius) K 0.10 0.05 0.00 0.05 0.10 PC1 i PCA (L1) L 0.1 0.0 0.1 PC1 3 2 1 0 1 2 PC1 0.2 0.1 0.0 0.1 PC1 PCA on mi RNA O 0.10 0.05 0.00 0.05 0.10 PC1 PCA on RNASeq P 0.15 0.10 0.05 0.00 0.05 PC1 PCA on Methylation Q 0.15 0.10 0.05 0.00 0.05 PC1 Concatenated R 0.1 0.0 0.1 PC1 i PCA (x Frobenius) A 0.1 0.0 0.1 PC1 i PCA (+ Frobenius) B 0.10 0.05 0.00 0.05 0.10 PC1 i PCA (L1) C 0.1 0.0 0.1 PC1 3 2 1 0 1 2 PC1 0.2 0.1 0.0 0.1 PC1 PCA on mi RNA F 0.10 0.05 0.00 0.05 0.10 PC1 PCA on RNASeq G 0.15 0.10 0.05 0.00 0.05 PC1 PCA on Methylation H 0.15 0.10 0.05 0.00 0.05 PC1 Concatenated I Figure 5: We plot the first two (integrated) principal components from various methods applied to the ROSMAP data. Each point represents a subject, colored by the clinician s diagnosis in panels A-I and by global cognition score in panels J-R. Tang and Allen 2 4 6 8 10 Number of PCs RF Classification Error ROSMAP Diagnosis A 2 4 6 8 10 Number of PCs ROSMAP Cognition B i PCA (x Frob) i PCA (+ Frob) Concatenated Distributed PCA on mi RNA PCA on RNASeq PCA on Methylation Individual PCAs Combined Figure 6: We took the top PCs and used them as predictor variables in a random forest to predict (A) the clinician s diagnosis and (B) the global cognition score. For the random forest, we split the ROSMAP data into a training (n = 375) and test set (n = 132) and used the default random forest settings in R. The average test error from the random forests over 100 random splits are shown as the number of PCs used in the random forests increases. Figure 7: (A)-(B) We show the top 3 i PCs obtained from i PCA with the multiplicative Frobenius estimator; the points are colored by clinician s diagnosis and global cognition score in (A) and (B), respectively. (C) We plot the marginal proportion of variance explained by the top i PCs in each data set in the ROSMAP analysis (using the multiplicative Frobenius i PCA estimator). Because the RNASeq data contributed most of the variation in i PC1, we did a literature search on the top five genes extracted by sparse PCA on ˆ 2. Out of the top five genes, we found evidence in the biological literature, which links four of the five genes (the exception being SVOP) to AD (Carrette et al., 2003; Li et al., 2017; Han et al., 2014; Espuny-Camacho et al., 2017). While this is only a preliminary investigation into the importance of the genetic features obtained from i PCA, it is encouraging evidence and may potentially hint at candidate genes for future research. Integrated Principal Components Analysis mi RNA RNASeq Methylation 1 mi R 216a VGF TMCO6 2 mi R 127 3p SVOP PHF3 3 mi R 124 PCDHGC5 BRUNOL4 4 mi R 30c ADCYAP1 OSCP1 5 mi R 143 LINC01007 GRIN2B 6 mi R 27a FRMPD2L1 CASP9 7 mi R 603 SLC30A3 ZFP91; LPXN; ZFP91-CNTF 8 mi R 423 3p NCALD CNP 9 mi R 204 S100A4 YWHAE 10 mi R 128 AZGP1 C11orf73 11 mi R 193a 3p PAK1 TMED10 12 ebv mi RBART14 MAL2 RELL1 Table 1: Top genetic features obtained by applying Sparse PCA to each ˆ k in ROSMAP analysis (using the multiplicative Frobenius i PCA estimator) 5. Discussion As showcased in the simulations and the Alzheimer s disease case study, i PCA is not simply a theoretical construct that generalizes PCA to the integrated data setting. i PCA is also a useful and effective tool in practice to discover interesting joint patterns that are shared across multiple data sets. We believe that i PCA s strong empirical performance is due in part to its flexibility to handle a rich set of dependencies among samples and features concurrently, which is particularly appropriate and necessary for integrated data problems. This flexibility is inherently driven by the underlying matrix-variate normal model, and from a whitening perspective, we can view the i PCA model as a natural generalization of the classical PCA model and assumptions. More specifically, in relation to PCA, i PCA can be viewed as performing PCA on the concatenated feature-whitened data, having estimated the individual feature covariances k and the joint sample covariance Σ simultaneously. While we discuss many potential penalized i PCA estimators for Σ and k in this work, we recommend that practitioners strongly consider using our new multiplicative Frobenius i PCA estimator. The simulations show that the Frobenius penalties are relatively robust to departures from model assumptions, and furthermore, the multiplicative Frobenius penalized estimator requires one less penalty parameter to tune and always converges to the global solution. Similar in spirit to other shrinkage penalties (Ledoit and Wolf, 2004), the multiplicative Frobenius i PCA estimator is a shrinkage technique that induces a smoothness over the eigenvalues of the covariance matrices. However, its multiplicative form is especially unique and well-suited for integrated data problems as it performs automatic reweighting of the integrated data sets and accounts for the concurrent estimation of Σ and k in each penalty term. Further investigation into the multiplicative Frobenius penalty and its statistical properties is left for future research. Moreover, we believe that this work opens the door for research into the theoretical underpinnings of dimension reduction and data integration in ways that other non-modelbased methods cannot. Building upon the model-based construction of i PCA, we show in Appendix F that the additive L1 correlation estimator satisfies one of the first statistical Tang and Allen guarantees in the data integration context. However, this is only the beginning and certainly not meant to be the final investigation into theoretical properties of the i PCA estimators and data integration. Minimax results and consistency guarantees for the multiplicative Frobenius i PCA estimator in particular are challenging and will require careful follow up in future work. Still, there are many other open avenues for future exploration. Analogous to PCA, one might imagine similar fruitful extensions of i PCA to higher-order data, functional data, and other structured applications. One could continue exploring g-convex penalties in different contexts and problems. Another interesting area for future research would be to develop a general framework to prove the consistency of g-convex estimators using the intrinsic manifold space, rather than the Euclidean space. We believe this intersection of g-convexity and statistical theory is a particularly ripe area of future research, but overall, in this work, we developed a theoretically sound and practical tool for performing dimension reduction in the integrated data setting, thus facilitating holistic analyses at a large scale. Acknowledgments The authors acknowledge support from NSF DMS-1264058, NSF DMS-1554821 and NSF Neuro Nex-1707400. T.T. also acknowledges support from the NSF Graduate Research Fellowship Program DGE-1752814. The authors thank Dr. Joshua Shulman and Dr. David Bennett for help acquiring the ROS/MAP data and acknowledge support from NIH P30AG10161, RF1AG15819, R01AG17917, and R01AG36042 for this data. The authors thank Dr. Zhandong Liu and Dr. Ying-Wooi Wan for help with processing and interpreting the ROS/MAP data. The authors also thank Cole Franks for help with improving and strengthening Theorem 5. Appendix A. Variance Explained by i PCA To ensure that the cumulative proportion of variance explained from Definition 1 is a welldefined concept, we check that PVEk,m is a proportion and is an increasing function as m increases. This implies that the marginal proportion of variance explained given in Definition 2 is also a proportion. Proposition 9 The cumulative proportion of variance explained in Xk by the top m i PCs, as defined in (10), satisfies the following properties: for each k = 1, . . . , K and m = 1, . . . , min{n, pk}, (i) 0 PVEk,m 1; (ii) PVEk,m 1 PVEk,m. Proof (i) Since the Frobenius norm is always non-negative, it is clear that PVEk,m 0. So it suffices to show that PVEk,m 1, or equivalently, (U(m))T Xk V(m) k 2 F Xk 2 F . Integrated Principal Components Analysis By definition of the Frobenius norm, we have that (U(m))T Xk V(m) k 2 F = (U(m))T Xk V(m) k 2 r=1 (U(m))T iq Xk,qr V(m) k,rj r=1 UT iq Xk,qr Vk,rj = UT Xk Vk 2 F [2]= Xk 2 F . Here, [2] holds by the orthogonality of U and Vk, and [1] follows from the facts that m min{n, pk}, U(m) and V(m) k are precisely the first m columns of U and Vk respectively, and the summand is non-negative. This concludes the proof of part (i). (ii) We follow a similar argument as part (i) to see that (U(m 1))T Xk V(m 1) k 2 F = r=1 (U(m 1))T iq Xk,qr V(m 1) k,rj r=1 (U(m))T iq Xk,qr V(m) k,rj = (U(m))T Xk V(m) k 2 F . This implies that PVEk,m 1 PVEk,m. Next, we claim that Definition 1 is a generalization of the cumulative proportion of variance explained in PCA. Recall that in PCA, if X = U D VT is the SVD of X, then the cumulative proportion of variance explained by the top m PCs is (Pm i=1 d2 i )/(Pp i=1 d2 i ), where D = diag(d1, . . . , dp). We can rewrite this as Pm i=1 d2 i Pp i=1 d2 i = D(m) 2 F X 2 F = (U(m))T X V(m) 2 F X 2 F (18) using properties of the SVD. Since U are the PC scores, and V are the PC loadings from PCA, then Definition 1 is indeed a natural definition in the sense that it generalizes the PCA cumulative proportion of variance explained. Appendix B. Difference Between i PCA and Concatenated PCA To formalize the difference between i PCA and concatenated PCA, we examine the bias of the eigenvectors from concatenated PCA when sample and feature dependencies simultaneously obscure each other under the matrix-variate normal model. Tang and Allen Proposition 10 Suppose that Xk Nn,pk(0, Σ k) for each k = 1, . . . , K, and define X = [X1, . . . , XK], = diag( 1, . . . , K), and Z = [Z1, . . . , ZK], where Zk is as defined in (5) previously. If X 1/2 = UDVT and X = U D VT are their respective compact SVDs, then U and U denote the i PCA and concatenated PCA scores, respectively, and U U = Σ1/2 Z(VD 1 1/2 V D 1). (11) Moreover, the i PCA and concatenated PCA scores are equal if and only if VD 1 1/2 V D 1 is in the nullspace of Z or 1/2 V = VD 1 D. Proof By construction of their methodology, we know that the i PCA scores U and the concatenated PCA scores U satisfy the following compact SVDs: X 1/2 = U D VT , X = U D V T . We also know from (5) that X = Σ1/2 Z 1/2 . (19) Using the orthogonality of the singular vectors and plugging in (19) for X in the compact SVD formulas then gives U = Σ1/2 Z V D 1, U = Σ1/2 Z 1/2 V D 1. U U = Σ1/2 Z V D 1 1/2 V D 1 . Appendix C. Covariance Estimation for i PCA In this section, we provide the proofs and derivations related to the unpenalized and penalized maximum likelihood estimators under the i PCA model. C.1 Unpenalized Maximum Likelihood Estimators We begin this section by deriving the log-likelihood equation associated with the i PCA population model (2). Recall that the probability density function for each matrix-variate normal model (k = 1, . . . , K) is given by f (Xk | Σ, k) = (2π) npk 2tr Σ 1 Xk 1 k XT k . Integrated Principal Components Analysis Hence, the log-likelihood function is ℓ(Σ 1, 1 1 , . . . , 1 K ) = 2 log(2π) + pk 2 log | Σ 1 | + n 2 log | 1 k | 2tr Σ 1 Xk 1 k XT k i p log | Σ 1 | + n k=1 log | 1 k | k=1 tr Σ 1 Xk 1 k XT k . The unpenalized MLEs are designed to solve the optimization problem argmin Σ 1, 1 1 ,..., 1 K ℓ(Σ 1, 1 1 , . . . , 1 K ) (20) Taking partial derivatives with respect to each covariance parameter thus gives the following: Lemma 11 The unpenalized MLEs of Σ and 1, . . . , K satisfy k=1 Xk ˆ 1 k XT k (13) n XT k ˆΣ 1 Xk k = 1, . . . , K. (14) Proof Taking the partial derivatives of the log-likelihood equation with respect to Σ 1 and 1 k respectively yields the gradient equations: ℓ Σ 1 = p Σ k=1 Xk 1 k XT k ℓ 1 k = n k XT k Σ 1 Xk . Setting the gradient equations equal to 0 gives the desired result. Assuming that the unpenalized MLEs exist, we can compute the unpenalized MLEs of Σ and 1, . . . , K via Algorithm 2, which is analogous to the Flip-Flop algorithm provided in Dutilleul (1999). The next theorem provides very restrictive conditions for which the unpenalized MLEs exist, but in almost all practical cases, the unpenalized maximum likelihood problem is ill-posed for i PCA. Theorem 5 (i) Suppose that Xk has been column-centered so that each column has a mean of 0 for each k = 1, . . . , K. Then the unpenalized MLEs for Σ and 1, . . . , K are not positive definite and hence do not exist. (ii) Suppose that Xk has not been column-centered but that rank( X) = n and rank(Xk) = pk for k = 1, . . . , K. Tang and Allen Algorithm 2 Flip-Flop Algorithm for i PCA Unpenalized MLEs 1: Assume that Xk has been centered appropriately and that the unpenalized MLEs exist. 2: Initialize ˆΣ, ˆ 1, . . . , ˆ K to be symmetric positive definite. 3: while not converged do 4: Update ˆΣ = 1 p PK k=1 Xk ˆ 1 k XT k 5: for k = 1, . . . , K do 6: Update ˆ k = 1 n XT k ˆΣ 1 Xk (a) If n = pk for some k = 1, . . . , K, then the unpenalized log-likelihood function ℓ(Σ 1, 1 1 , . . . , 1 K ) is unbounded. (b) If the unpenalized log-likelihood function ℓ(Σ 1, 1 1 , . . . , 1 K ) is bounded, then the unpenalized MLE for Σ, 1, . . . , K exist. Proof (ii) Suppose that Xk has not been column-centered but that rank( X) = n and rank(Xk) = pk for k = 1, . . . , K. We will first prove part (b), so assume also that the unpenalized log-likelihood is bounded. From the unpenalized MLEs in Lemma 4, it is easy to see that ˆΣ and ˆ 1, . . . , ˆ K are symmetric positive semidefinite. We next claim that ˆΣ and ˆ 1, . . . , ˆ K are full rank and hence positive definite. To prove this claim, we proceed by induction on the unpenalized Flip-Flop iteration counter m. Let ˆΣ m and ˆ m k denote the mth Flip-Flop update of ˆΣ and ˆ k, respectively. Clearly, the base case holds since Σ0 and 0 1, . . . , 0 K are initialized to be symmetric positive definite. So suppose Σm and m 1 . . . , m K are full rank. Then the unpenalized Flip-Flop iterates at the (m + 1)th-update step are k=1 Xk( ˆ m k ) 1 XT k = 1 p X( m) 1 X T , ˆ m+1 k = 1 n XT k ( ˆΣ m+1) 1 Xk, where X = [X1, . . . , XK] and m = diag( ˆ m 1 , . . . , ˆ m K). Therefore, we have that rank ˆΣ m+1 = rank X( m) 1 X T = rank X( m) 1 2 = rank( X) = n. Here, the second equality holds because rank(AT A) = rank(A) = rank(AT ). The third equality holds because ( m) 1 2 is full rank (i.e. rank = p) by the inductive hypothesis, and the last equality holds by hypothesis. Thus, ˆΣ m+1 is positive definite. Similarly, for each k = 1, . . . , K, rank ˆ m+1 k = rank XT k ( ˆΣ m+1) 1 Xk = rank(Xk) = pk. So for each k = 1, . . . , K, ˆ m+1 k is positive definite. By induction, ˆΣ m, ˆ m 1 , . . . , ˆ m K 0 for each iterate of Algorithm 2, and by Corollary 21, Algorithm 2 converges to the global solution of (20). Thus, the unpenalized MLEs for Σ and 1 1 , . . . , 1 K exist under the assumptions given in part (ii)(b). Integrated Principal Components Analysis To prove part (a) of (ii), let X1, . . . , XK be given, and assume that n = pk for some k = 1, . . . , K. By the rank assumptions, it must be that pk n p for each k = 1, . . . , K. Define K0 = {k {1, . . . , K} | pk < n} and K1 = {k {1, . . . , K} | pk = n}. Note that by assumption, K0 is not empty, and for each k K0, XT k has a kernel. We can then choose bases for Rn and Rpk (k = 1, . . . , K) such that the first row of Xk is zero for each k K0 and the first row of Xk is the first standard basis vector, denoted fk,1, of Rpk for each k K1. Let e1 denote the first standard basis vector in Rn. We will next construct a family of diagonal matrices for Σ, 1, . . . , K that sends the unpenalized log-likelihood to infinity. Specifically, let Σ = et e1 e T 1 , k = Ipk for k K0, and k = e t f k,1 f T k,1 for k K1, where e denotes the matrix exponential. Then for k K0, tr Σ 1 Xk 1 k XT k = tr Σ 1 Xk XT k = i=2 x T k,i xk,i, and for k K1, tr Σ 1 Xk 1 k XT k = 1 + (et 1)(x T k,i fk,1)2 + x T k,i xk,i , where xk,i denotes the ith row of Xk. Therefore, for every t R, we have that ℓ(Σ 1, 1 1 , . . . , 1 K ) = p log | Σ | n X k K0 log | k | n X k K1 log | k | k K0 tr Σ 1 Xk 1 k XT k X k K1 tr Σ 1 Xk 1 k XT k = p log(et) n X k K1 log(e t) X i=2 x T k,i xk,i (et 1)(x T k,i fk,1)2 + x T k,i xk,i ! = t(n|K1| p) X i=2 x T k,i xk,i (et 1)(x T k,i fk,1)2 + x T k,i xk,i ! As t , the first term approaches since n|K1| < p while the second and third terms approach a constant. Thus, ℓ(Σ 1, 1 1 , . . . , 1 K ) is unbounded, as desired. (i) Suppose that we have centered each data matrix Xk to have column means 0. Recall that the unpenalized MLEs for Σ and 1, . . . , K are obtained by k=1 Xk ˆ 1 k XT k = 1 p X 1 X T , n XT k ˆΣ 1 Xk, Tang and Allen where := diag( ˆ 1, . . . , ˆ K). For the sake of contradiction, suppose that there exists n, p1, . . . , pk such that Σ 1 and 1 1 , . . . , 1 K are symmetric positive definite. By the same argument as in part (ii)(b), rank( ˆΣ) = rank( X), but since each Xk has been centered to have column means 0, then the n rows of X are linearly dependent. Hence, rank( ˆΣ) = rank( X) < n, which implies that ˆΣ can never be positive definite, a contradiction. Therefore, the unpenalized MLEs never exist if each Xk has be column-centered to have mean 0. Remark 12 Notice that if the unpenalized MLEs for Σ, 1, . . . , K exist, then n [1]= rank( ˆΣ) [2]= rank( X) [3] min{n, p} pk [1]= rank( ˆ k) [2]= rank(Xk) [3] min{n, pk} k = 1, . . . , K where [1] follows from positive definiteness of ˆΣ and ˆ k, [2] follows the same argument as in the proof of Theorem 5, and [3] holds by properties of rank and the dimensions of X and Xk. By combining these rank constraints with the result of Theorem 5, we see that if the unpenalized MLEs for Σ, 1, . . . , K exist, it must be that pk = n p for each k = 1, . . . , K. Proposition 6 can be proved in the same way as Lemma 4, so we omit the proof. C.2 Penalized Maximum Likelihood Estimators In this section, we develop Flip-Flop algorithms and analyze the convergence results for both Frobenius and L1 penalties. For the sake of notation, let ℓP (Σ 1, 1 1 , . . . , 1 K ) = p log | Σ 1 | n k=1 log | 1 k | + k=1 tr Σ 1 Xk 1 k XT k + P(Σ 1, 1 1 , . . . , 1 K ) We give the overarching framework of the Flip-Flop algorithms in Algorithm 3, and we show in Theorem 7 that Algorithm 3 can be used to find a local solution of (17) for a certain class of penalties, which includes the additive Frobenius, multiplicative Frobenius, and additive L1 penalties. The main idea behind the proof is to use convexity and view Algorithm 3 as a block coordinate descent algorithm so that each update of the Flip-Flop algorithm is a descent direction. Theorem 7 Suppose that the objective function in (17) is bounded below. Suppose also that either (i) P(Σ 1, 1 1 , . . . , 1 K ) is a differentiable convex function with respect to each coordinate or (ii) P(Σ 1, 1 1 , . . . , 1 K ) = P0(Σ 1) + PK k=1 Pk( 1 k ), where Pi is a (non-differentiable) convex function for each k = 1, . . . , K. If either (i) or (ii) holds, then the Flip-Flop algorithm corresponding to (17) converges to a stationary point of the objective. Integrated Principal Components Analysis Algorithm 3 Outline of Flip-Flop Algorithm for Penalized i PCA Covariance Estimators 1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be positive definite. 2: while not converged do 3: Update Σ while fixing all other variables: ˆΣ 1 = argmin Σ 1 0 ℓP (Σ 1, ˆ 1 1 , . . . , ˆ 1 K ) 4: for k = 1, . . . , K do 5: Update k while fixing all other variables: ˆ 1 k = argmin 1 k 0 ℓP ( ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 k 1, 1 k , ˆ 1 k+1, . . . , ˆ 1 K ) Proof Suppose either P(Σ 1, 1 1 , . . . , 1 K ) is a differentiable convex function with respect to each coordinate or P(Σ 1, 1 1 , . . . , 1 K ) = P0(Σ 1) + PK k=1 Pk( 1 k ) where Pi is a (non-differentiable) convex function for each i = 1, . . . , K. Let ℓ(Σ 1, 1 1 , . . . , 1 K ) = p log | Σ 1 |+n PK k=1 log | 1 k | PK k=1 tr Σ 1 Xk 1 k XT k . Since the domain of ℓis open and ℓis Gateaux-differentiable on its domain, then ℓP is regular in the domain of ℓP by Lemma 3.1 in Tseng (2001). Note also that since the log-determinant is a strictly concave function on the set of symmetric positive definite matrices, the trace function is linear, and the penalty term is convex with respect to each coordinate by hypothesis, then ℓP is strictly convex in Σ 1 with 1 1 , . . . , 1 K fixed, and for each k = 1, . . . , K, ℓP is strictly convex in in 1 k with Σ 1, 1 j , j = k fixed. Because ℓP is regular and strictly convex with respect to each coordinate, it follows that the Flip-Flop algorithm corresponding to (17) converges to a stationary point of the objective function by Theorem 4.1(c) in Tseng (2001). In the following sections, we will derive the specific form of the Flip-Flop updates for each of the penalized i PCA estimators. C.2.1 Additive Frobenius Penalized Flip-Flop Estimator To compute the additive Frobenius penalized estimator, we solve ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 K = argmax Σ 1 0 1 1 ,..., 1 K 0 n p log | Σ 1 | + n k=1 log | 1 k | k=1 tr Σ 1 Xk 1 k XT k k=1 λk 1 k 2 F o . (21) Tang and Allen The gradient equations corresponding to (21) are given by k=1 Xk ˆ 1 k XT k 2λΣ ˆΣ 1 = 0, (22) n ˆ k XT k ˆΣ 1 Xk 2λk ˆ 1 k = 0 k = 1, . . . , K. (23) Lemma 13 If gradient equations (22) and (23) are satisfied, then k=1 Xk ˆ 1 k XT k k=1 Xk ˆ 1 k XT k (24) 4n2 XT k ˆΣ 1 Xk 2 1 2n XT k ˆΣ 1 Xk k = 1, . . . , K (25) Proof Define ˆSΣ = PK k=1 Xk ˆ 1 k XT k , and right multiply (22) by ˆΣ to obtain p ˆΣ 2 ˆSΣ ˆΣ 2λΣ I = 0 (26) p ˆSΣ ˆΣ + 1 4p2 ˆS2 Σ = 2λΣ 4p2 ˆS2 Σ. (27) On the other hand, we can multiply (22) by ˆΣ on the left to obtain p ˆΣ 2 ˆΣˆSΣ 2λΣ I = 0 (28) p ˆΣˆSΣ + 1 4p2 ˆS2 Σ = 2λΣ 4p2 ˆS2 Σ. (29) Adding (27) and (29) and then dividing by 2 gives 2p ˆSΣ ˆΣ 1 2p ˆΣˆSΣ + 1 4p2 ˆS2 Σ = 2λΣ 4p2 ˆS2 Σ is positive definite, its has a unique square root. Thus, We can similarly rearrange (23) and complete the square to obtain (25). We now have the machinery to prove Proposition 14, which gives us the form of each update in the additive Frobenius Flip-Flop algorithm. Integrated Principal Components Analysis Proposition 14 ˆΣ and ˆ 1, . . . , ˆ K are solutions to the gradient equations in (22) and (23) if and only if Γ + Γ2 + 8λΣp I 1 and ˆ k = Vk Φk + Φ2 k + 8λkn I 1 2 VT k k = 1, . . . , K, (31) where U, Vk, Γ, and Φk are defined by the eigendecompositions PK k=1 Xk ˆ 1 k XT k = U Γ UT and XT k ˆΣ 1 Xk = Vk Φk VT k . Proof ( ) Suppose that ˆΣ and ˆ 1, . . . , ˆ K are solutions to the gradient equations in (22) and (23). We will first show that the eigenvectors of ˆΣ and PK k=1 Xk ˆ 1 k XT k are equivalent. Let u be an eigenvector of ˆΣ with the corresponding eigenvalue φ. Then, by (22), K X k=1 Xk ˆ 1 k XT k u = (p ˆΣ 2λΣ ˆΣ 1) u = (pφ 2λΣφ 1) u . Therefore, u is an eigenvector of PK k=1 Xk ˆ 1 k XT k with the eigenvalue pφ 2λΣφ 1. Conversely, suppose u is an eigenvector of PK k=1 Xk ˆ 1 k XT k with eigenvalue γ. Then k=1 Xk ˆ 1 k XT k p + 1 4p2 γ2 u . This implies that k=1 Xk ˆ 1 k XT k 1 2 u = 2λΣ p + 1 4p2 γ2 1 So by Lemma 13 and (32), we have that k=1 Xk ˆ 1 k XT k k=1 Xk ˆ 1 k XT k p + 1 4p2 γ2 1 γ2 + 8λΣp u, (35) so u is indeed an eigenvector of ˆΣ with the eigenvalue 1 2p γ + p γ2 + 8λΣp . Since the eigenvectors of ˆΣ and PK k=1 Xk ˆ 1 k XT k are equivalent and (35) gives us the exact connection between their eigenvalues, it follows that Γ + Γ2 + 8λΣp I 1 Tang and Allen where U, Γ are defined by the eigendecomposition PK k=1 Xk ˆ 1 k XT k = U Γ UT . This same logic can be used to show that for each k = 1, . . . , K, Φk + Φ2 k + 8λkn I 1 where Vk, Φk are defined by the eigendecomposition XT k ˆΣ 1 Xk = Vk Φk VT k . We omit the proof since it uses the same argument as above. ( ) Now suppose that ˆΣ and ˆ 1, . . . , ˆ K satisfy equations (30) and (31). Since we know h 1 2p γ + p γ2 + 8λΣp i 1 = 1 4λΣ γ2 + 8λΣp γ , it follows that h 1 2p Γ + Γ2 + 8λΣp I 1 2 i 1 = 1 4λΣ Γ2 + 8λΣp I 1 2 Γ . Therefore, k=1 Xk ˆ 1 k XT k 2λΣ ˆΣ 1 = p U 1 Γ + Γ2 + 8λΣp I 1 2 UT U Γ UT Γ + Γ2 + 8λΣp I 1 Γ + Γ2 + 8λΣp I 1 2 UT U Γ UT Γ2 + 8λΣp I 1 Similarly, one can substitute in (31) and follow the same argument to show that (23) is also satisfied. As a consequence of Proposition 14, each update step in the additive Frobenius Flip Flop algorithm can be solved by taking a full eigendecomposition and then regularizing the eigenvalues. This algorithm is given in Algorithm 4. Algorithm 4 Flip-Flop Algorithm for Additive Frobenius Penalized i PCA Estimators 1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be positive definite. 2: while not converged do 3: Take eigendecomposition: PK k=1 Xk ˆ 1 k XT k = U Γ UT 4: Regularize eigenvalues: Φii = 1 Γ2 ii + 8λΣp 5: Update ˆΣ 1 = U Φ 1 UT 6: for k = 1, . . . , K do 7: Take eigendecomposition: XT k ˆΣ 1 Xk = V Φ VT 8: Regularize eigenvalues: Γii = 1 2n Φii + q Φ2 ii + 8λkn . 9: Update ˆ 1 k = V Γ 1 VT Integrated Principal Components Analysis C.2.2 Multiplicative Frobenius Penalized Flip-Flop Estimator The derivation of the multiplicative Frobenius Flip-Flop algorithm is very similar to the previous derivation with the additive Frobenius penalty. Thus, we omit most of the details and simply provide a sketch of the derivation. Recall that the multiplicative Frobenius penalized estimator solves ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 K = argmax Σ 1 0 1 1 ,..., 1 K 0 n p log | Σ 1 | + n k=1 log | 1 k | k=1 tr Σ 1 Xk 1 k XT k k=1 λk 1 k 2 F o , (36) for which the gradient equations are k=1 Xk ˆ 1 k XT k 2 ˆΣ 1 K X k=1 λk ˆ 1 k 2 F = 0, (37) n ˆ k XT k ˆΣ 1 Xk 2λk ˆ 1 k ˆΣ 1 2 F = 0 k = 1, . . . , K. (38) Assuming these gradient equations (37) and (38) are satisfied, we can write 2 PK k=1 λk|| ˆ 1 k ||2 F p I + 1 k=1 Xk ˆ 1 k XT k k=1 Xk ˆ 1 k XT k , ˆ k = 2λk|| ˆΣ 1||2 F n I + 1 4n2 XT k ˆΣ 1 Xk 2 1 2n XT k ˆΣ 1 Xk k = 1, . . . , K. We then can follow the same argument in Proposition 14 to show that ˆΣ and ˆ 1, . . . , ˆ K are solutions to the multiplicative Frobenius gradient equations (37) and (38) if and only if k=1 λk|| ˆ 1 k ||2 F I and ˆ k = Vk Φk + Φ2 k + 8nλk|| ˆΣ 1||2 F I 1 2 VT k k = 1, . . . , K, (40) where U, Vk, Γ, and Φk are defined by the eigendecompositions PK k=1 Xk ˆ 1 k XT k = U Γ UT and XT k ˆΣ 1 Xk = Vk Φk VT k . This gives rise to the Flip Flop algorithm for solving the multiplicative Frobenius penalized estimators, as given in Algorithm 1. C.2.3 Additive L1 Penalized Flip-Flop Estimator If the inverse covariance matrices are known to have a sparse underlying structure, then we can apply an L1 penalty, rather than the Frobenius penalty, to induce this sparsity. Note Tang and Allen that while it is possible to use a multiplicative L1 penalty, the multiplicative L1 penalty is not geodesically convex and is not separable (as defined in Tseng, 2001). Thus, we primarily consider the additive L1 penalized i PCA estimator: ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 K = argmax Σ 1 0 1 1 ,..., 1 K 0 n p log | Σ 1 | + n k=1 log | 1 k | k=1 tr Σ 1 Xk 1 k XT k λΣ Σ 1 1,off k=1 λk 1 k 1,off o . (41) Note that 1,offpenalizes the off-diagonal entries (i.e. A 1,off= P i =j |aij|), but it is also possible to use the ordinary L1 norm 1. For fixed ˆ 1, . . . , ˆ K, the Flip-Flop update for ˆΣ is seen to be ˆΣ 1 = argmin Σ 1 0 p log | Σ 1 | + tr k=1 Xk ˆ 1 k XT k + λΣ Σ 1 1,off, and similarly for fixed ˆΣ and ˆ j, j = k, the update for ˆ k is ˆ 1 k = argmin 1 k 0 n log | 1 k | + tr 1 k XT k ˆΣ 1 Xk + λk 1 k 1,off. Both of which can be solved via the graphical lasso algorithm (Hsieh et al., 2011). Plugging in these updates to the framework laid out in Algorithm 3, we give the additive L1 Flip-Flop algorithm in Algorithm 5. Algorithm 5 Flip-Flop Algorithm for Additive L1 Penalized i PCA Covariance Estimators 1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be positive definite. 2: while not converged do 3: Put A = 1 p PK k=1 Xk 1 k XT k 4: Apply graphical lasso: ˆΣ 1 = argmin Σ 1 log | Σ 1 | + tr(A Σ 1) + λΣ 5: for k = 1, . . . , K do 6: Put Ak := 1 n XT k Σ 1 Xk 7: Apply graphical lasso: ˆ 1 k = argmin 1 k log | 1 k | + tr(Ak 1 k ) + λk Appendix D. Geodesic Convexity and the Multiplciative Frobenius i PCA Estimator Because of the central role that geodesic convexity plays in the multiplicative Frobenius i PCA estimator, we give a self-contained introduction to geodesic convexity in Appendix D.1. Integrated Principal Components Analysis This review provides the necessary concepts and tools to prove the theorems in Appendix D.2. For a more comprehensive review, we refer to Rapcs ak (1991) and Vishnoi (2018). D.1 Introduction to Geodesic Convexity Convex optimization problems arise frequently in a variety of machine learning applications such as regression, matrix completion, and clustering, to name a few examples. Beyond its widespread applications, convex problems are well-understood theoretically and can be reliably solved in polynomial-time. As a result, machine learning tasks are often formulated as convex problems in Euclidean space to guarantee fast convergence to a global solution. Convexity, however, is not limited to Euclidean spaces. Many tools which we know and love from convex optimization can be extended to geodesic convexity (g-convexity) on Riemannian manifolds. In this general Riemannian setting, there are several applications in which we have g-convexity but not convexity (Zhang and Sra, 2016). Before formally defining geodesic convexity, we first recall some useful concepts from metric geometry. A metric space is a pair (X, d) of a set X and a distance function d that satisfies positivity, symmetry, and the triangle inequality. A path γ is a continuous mapping from [0, 1] to X, and the length ℓof a path γ is defined as ℓ(γ) := sup{Pn i=1 d(γ(ti 1), γ(ti)) : 0 = t0 < . . . < tn = 1, n N}. A metric space is a length space if d(x, y) = inf ℓ(γ) where the infimum is taken over all paths γ : [0, 1] X joining x and y. Definition 15 Let (X, d) be a length space. A path γ : [0, 1] X is a geodesic if for every t [0, 1] there exists an interval [a, b] [0, 1] which contains a neighborhood of t and γ|[a,b] is a shortest path from γ(a) to γ(b). Put simply, a geodesic is a path which locally minimizes length. Note that geodesics minimize length locally, but not globally. A canonical example of geodesics are the great circles on a sphere. This concept of a geodesic generalizes the notion of a line in Euclidean space to general (nonlinear) length spaces. By replacing lines by geodesics in the definition of convexity, we can extend convexity to g-convexity in a straightforward manner. Definition 16 Let M be a Riemannian manifold. A function f : M R is geodesically convex if for any x, y M, geodesic γ such that γ(0) = x and γ(1) = y, and t [0, 1], it holds that f(γ(t)) (1 t)f(x) + tf(y). The only caveat is that the underlying space must be a Riemannian manifold. To prevent a long winded detour into the details of Riemannian geometry, we avoid the full technical definition and simply think of a Riemannian manifold as a real differentiable manifold equipped with the notion of an inner product. We need the structure of a Riemannian manifold in order to meaningfully perform algebraic operations in our space. In summary, by extending the notion of a line to a geodesic, we can easily translate the notion of convexity on a Euclidean space to geodesic convexity on a Riemannian manifold. We next give a series of known properties regarding geodesic convexity that will be of use in the following proofs (Wiesel, 2012; Vishnoi, 2018). Many of these properties are analogous to the familiar convex setting. Tang and Allen Theorem 17 Any local minima of a geodesically convex function is a global minima. Proposition 18 The following operations preserve geodesic convexity: (i) (Addition) If f and g are g-convex functions, then f + g is g-convex. (ii) (Kronecker Products) Suppose f is a real-valued, g-convex function on Sd ++, and Qj Sdj ++ for each j = 1, . . . , J such that QJ j=1 dj = d. Then g(Q1, . . . , QK) = f(Q1 QK) is jointly g-convex in {Qj}J j=1 Sd1 ++ Sd K ++. The proof of Proposition 18(i) is straightforward from the definition of g-convex functions, and Proposition 18(ii) is proved in Wiesel (2012). Remark 19 (Example 4.9 in Vishnoi, 2018) Consider the manifold of positive definite matrices Sn ++. For Q0, Q1 Sn ++, the geodesic joining Q0 to Q1 can be parameterized as 1 2 0 t [0, 1]. (42) D.2 Global Convergence of the Multiplicative Frobenius i PCA Estimator We now have the tools to start proving global convergence of the multiplicative i PCA estimator. The roadmap of this proofs is as follows. First, we prove that the negative loglikelihood is g-convex. Then, we prove that the multiplicative Frobenius penalty is g-convex. Since the sum of g-convex functions is g-convex, this implies that the multiplicative i PCA estimator is a g-convex optimization problem. Furthermore, since the Flip-Flop algorithm was proven to converge in Theorem 7, it follows that the multiplicative Frobenius i PCA estimator converges to the global solution as a consequence of geodesic convexity! Without loss of generality, we assume that each data set Xk has been column-centered for the remainder of this section. Lemma 20 The negative log-likelihood of our model, (12), is jointly geodesically convex in Σ 1, 1 1 , . . . 1 K with respect to Sn ++ Sp1 ++ Sp K ++. Proof Let X = [X1, . . . , XK] Rn p and define 1 0 0 0 2 0 ... ... ... ... 0 0 K Since is a block diagonal matrix, then | | = QK k=1 | k | and | 1| = QK k=1 | 1 k |. This implies that k=1 log | 1 k | = log( k=1 | 1 k |) = log | 1|. (43) Integrated Principal Components Analysis Also, using this new parameterization, k=1 tr(Xk 1 k XT k Σ 1) = tr( X 1 X T Σ 1) = vec( X)T ( 1 Σ 1) vec( X). (44) Using (43) and (44), we rewrite the negative log-likelihood as ℓ(Σ 1, 1 1 , . . . , 1 K ) p log | Σ 1 | n k=1 log | 1 k | + k=1 tr(Xk 1 k XT k Σ 1) = p log | Σ 1 | n log | 1| + vec( X)T ( 1 Σ 1) vec( X) = log(| Σ 1 |p| 1|n) + vec( X)T ( 1 Σ 1) vec( X) = log(| 1 Σ 1 | 1) + vec( X)T ( 1 Σ 1) vec( X). Next, consider the manifold Snp ++, and let the function f : Snp ++ R be given by f(Q) = log(| Q | 1) + vec( X)T Q vec( X). Since f is the sum of g-convex functions on Snp ++ (Wiesel, 2012), and ℓ(Σ 1, 1) = f( 1 Σ 1), then by Proposition 18, the negative log-likelihood is jointly geodesically convex in Σ 1, 1 1 , . . . , 1 K with respect to Sn ++ Sp1 ++ Sp K ++. Corollary 21 Suppose the unpenalized log-likelihood in (20) is bounded. If the Flip-Flop estimators for the unpenalized log-likelihood exist, then they converge to the global solution of (20). Proof Recall that the Flip-Flop algorithm yields the iterates: p PK k=1 Xk ˆ 1 k XT k = argmax Σ ℓ(Σ, ˆ 1, . . . , ˆ K) 2. For each k = 1, . . . , K, ˆ k = 1 n XT k ( ˆΣ) 1 Xk = argmax k ℓ( ˆΣ, ˆ 1, . . . , ˆ k 1, k, ˆ k+1, . . . , ˆ K) Thus, each update of the Flip-Flop algorithm monotonically increases the log-likelihood ℓ. Assuming that the MLEs exist and are bounded, this Flip-Flop algorithm converges to a local maximum. Furthermore, since ℓis jointly geodesically concave in Σ 1, 1 1 , . . . 1 K , then all local maxima are global maxima so that the Flip-Flop estimators for the unpenalized log-likelihood converge to the global solution. Lemma 22 The multiplicative Frobenius norm penalty, Σ 1 2 F PK k=1 λk 1 k 2 F , is jointly geodesically convex in Σ 1, 1 1 , . . . , 1 K with respect to Sn ++ Sp1 ++ Sp K ++. Tang and Allen Proof Since tr(A B) = tr(A)tr(B) and A 2 F = tr(AT A), then P(Σ 1, 1 1 , . . . , 1 K ) := Σ 1 2 F k=1 λk 1 K 2 F (45) k=1 λktr(Σ 2)tr( 2 k ) (46) k=1 λktr( 2 k Σ 2) (47) k=1 λk 1 k Σ 1 2 F (48) λk 1 k ) Σ 1 2 F (49) = 1 Σ 1 2 F , (50) 1 λ1 1 0 0 0 1 λ2 2 0 ... ... ... ... 0 0 1 λK K We will next show that the function f : Snp ++ R defined by f(Qα) := Qα 2 F = tr((Qα)T Qα) = tr((Qα)2), α { 1} (51) is geodesically convex in Q Snp ++. That is, let Q0, Q1 Snp ++ be given, and let Qt be the geodesic between Q0 and Q1 as given in (42). We want to show that f(Qα t ) is a convex function with respect to t. So consider the eigendecomposition Q 1 2 0 = U D UT , where U is an orthogonal matrix and D is a diagonal matrix with diagonal entries di. Then for all t [0, 1], it follows from (51) that f(Qα t ) = tr(Qα t Qα t ) (52) 2 0 )αt Qα 0 (Q 1 2 0 (U D UT )αt Qα 0 (U D UT )αt Q 2 0 U Dαt UT Qα 0 U Dαt UT Q = tr(Dαt UT Qα 0 U Dαt UT Qα 0 U) (56) = tr(Dαt A Dαt A), (57) where A := UT Qα 0 U. Integrated Principal Components Analysis Because A is symmetric and (Dαt A)ij = dαt i aij, then (Dαt A Dαt A)ij = l=1 ((Dαt A)il(Dαt A)lj) = l=1 (dαt i aildαt l alj) = l=1 (dαt i aildαt l ajl). Plugging this into (57) gives i=1 (Dαt A Dαt A)ii = l=1 (dαt i aildαt l ail) = l=1 (a2 il(didl)αt). Note that since Q0 and Q1 are positive definite, then Q 1 2 0 is positive definite. Thus, di > 0 for each i = 1, . . . , np, and because di > 0 for each i = 1, . . . , np, f(Qα t ) = Pnp i=1 Pnp l=1(a2 il(didl)αt) is a convex function in t. This proves f is g-convex in Q Snp ++. Since f is g-convex in Q Snp ++ and P(Σ 1, 1) = f( 1 Σ 1) from (50), then the multiplicative Frobenius norm penalty P is g-convex in Σ 1, 1 1 , . . . , 1 K by Proposition 18(ii). Theorem 8 The multiplicative Frobenius i PCA estimator is jointly geodesically convex in Σ 1 and 1 1 , . . . , 1 K . Because of this, the Flip-Flop algorithm for the multiplicative Frobenius i PCA estimator given in Algorithm 1 converges to the global solution. Proof From Lemma 20 and Lemma 22, we see that the objective function corresponding to the multiplicative Frobenius penalized estimator in (36) is the sum of jointly g-convex functions. Thus, the multiplicative Frobenius penalized estimator is also jointly geodesically convex in Σ 1 and 1 1 , . . . , 1 K . Recall we have already proved that Algorithm 1 converges to a stationary point in Theorem 7. Since the multiplicative Frobenius i PCA estimator is geodesically convex, then all local optima are global optima, and the Flip-Flop algorithm for the multiplicative Frobenius i PCA estimator converges to the global solution of (36). Remark 23 The multiplicative Frobenius i PCA estimator is unique in the sense that the Kronecker production solution ˆΣ ˆ k is unique. Appendix E. Equivalence between PCA and i PCA with Frobenius Penalties when K = 1 One major reason for using the Frobenius penalties is that in the case where we observe only one data set, i PCA with the Frobenius penalties and the classical PCA are equivalent in the sense that the PC scores and loadings are the same. Throughout this section, we will assume K = 1, and we let the SVD of X (which has been column-centered) be given by X = U D VT , where U Rn n, D Rn p with the diagonal elements di, V Rp p, and r = rank(X) < min{n, p}. Without loss of generality, suppose also that p n. In PCA, it Tang and Allen is well known that the PC scores are given by the columns of U and the PC loadings are given by the columns of V. We will show that i PCA with the Frobenius penalties also yield the same PC scores U and loadings V. Let us first consider i PCA with the additive Frobenius penalty λΣ Σ 1 2 F +λ 1 2 F . By Theorem 1 in Allen and Tibshirani (2010), there is a unique global solution maximizing the matrix-variate normal log-likelihood with additive Frobenius penalties (21) when K = 1. This global solution is given by ˆΣ = U β UT and ˆ = V θ VT , where β = diag(β1, . . . , βn) and θ = diag(θ1, . . . , θp) are defined as follows: d2 i θi nθ2 i 2λ , i = 1, . . . , r s p , i = r + 1, . . . , n v u u t c2,i q c2 2,i 4c1c3,i 2c1 , i = 1, . . . , r r n , i = r + 1, . . . , p, with coefficients c1 = 2λΣn2, c2,i = d4 i (p n) + 8nλΣλ , c3,i = 2λ (d4 i 4λΣλ ). Since the PC scores and loadings from i PCA are the eigenvectors of ˆΣ and ˆ , respectively, the above result shows that the PC scores and loadings from i PCA with the additive Frobenius penalty are precisely U and V, as in PCA. We next investigate the equivalence of i PCA with the multiplicative Frobenius penalty λ Σ 1 2 F 1 2 F and PCA when K = 1. While we can follow a similar argument as Allen and Tibshirani (2010), the proof is complicated by the non-separable penalty terms. That is, each term in the multiplicative penalty depends on both ˆΣ and ˆ . We will return to this complication in the proof of the following theorem. Theorem 24 In the case when K = 1, the solution to i PCA with the multiplicative Frobenius penalty (36) is given by ˆΣ = U β UT and ˆ = V θ VT , where β = diag(β1, . . . , βn) Integrated Principal Components Analysis and θ = diag(θ1, . . . , θp) satisfy the system v u u t c2,i(T) q c2 2,i(T) 4c1(T)c3,i(T) 2c1(T) , i = 1, . . . , r n , i = r + 1, . . . , p βi = d2 i θi nθ2 i 2λ, i = 1, . . . , r p , i = r + 1, . . . , n k=1 θ 2 k + (p r) n with c1(T) = 2λn2T, c2,i(T) = d4 i (p n) + 8nλ2T, and c3,i(T) = 2λ(d4 i 4λ2T). This solution exists and is unique (up to a scaling factor). Proof As seen previously, the gradient equations from the matrix-variate normal loglikelihood with the multiplicative Frobenius penalty are p ˆΣ 2λ ˆΣ 1 ˆ 1 2 F = X ˆ 1 XT n ˆ 2λ ˆ 1 ˆΣ 1 2 F = XT ˆΣ 1 X . Thus, the eigenvectors of ˆΣ and ˆ are equal to their respective quadratic forms. It follows that there is only one solution for the eigenvectors namely, the left and right singular vectors of X = U D VT . [Note: the last n r eigenvectors of U and the last p r eigenvectors of V are not unique.] Put ˆΣ = U β UT and ˆ = V θ VT , where β = diag(β1, . . . , βn) and θ = diag(θ1, . . . , θp). Note that we have the implicit constraints βi > 0 and θi > 0 for each i due to the positive definiteness of the covariances. Plugging in these decompositions and the SVD of X into the gradient equations gives p U β UT 2λ U β 1 UT V θ 1 VT 2 F = U D VT V θ 1 VT V DT UT n V θ VT 2λ V θ 1 VT U β 1 UT 2 F = V DT UT U β 1 UT U D VT , or equivalently, using the orthogonality of U and V, pβ 2λβ 1 θ 1 2 F = D θ 1 DT nθ 2λθ 1 β 1 2 F = DT β 1 D . Tang and Allen We can write this element-wise as the following system of equations: pβi 2λβ 1 i k=1 θ 2 k = d2 i θ 1 i , i = 1, . . . , r (58) pβi 2λβ 1 i k=1 θ 2 k = 0, i = r + 1, . . . , n (59) nθi 2λθ 1 i k=1 β 2 k = d2 i β 1 i , i = 1, . . . , r (60) nθi 2λθ 1 i k=1 β 2 k = 0, i = r + 1, . . . , p. (61) Now, to simplify this system of equations, we notice that if (β, θ) solve this system, then for any positive scalar c, (cβ, c 1θ) is also a solution. Without loss of generality, we may thus assume that β is normalized so that Pn k=1 β 2 k = 1. It immediately follows from (61) that θi = q n for i = r + 1, . . . , p. On the other hand, for i = 1, . . . , r, (60) gives us that nθ2 i βi 2λβi = d2 i θi = βi = d2 i θi nθ2 i 2λ. Substituting this equation for βi into (58) then yields pd2 i θi nθ2 i 2λ 2λ(nθ2 i 2λ) d2 i θi k=1 θ 2 k = d2 i θi Finding a common denominator and expanding all terms yields [ 2λn2T]θ4 i + [d4 i (p n) + 8nλ2T]θ2 i + [2λ(d4 i 4λ2T)] = 0, where T = Pp k=1 θ 2 k = Pr k=1 θ 2 k + (p r) n 2λ. For the sake of notation, let us define c1(T) = 2λn2T, c2,i(T) = d4 i (p n) + 8nλ2T, and c3,i(T) = 2λ(d4 i 4λ2T). Summarizing what we have done so far, β 0 and θ 0 must satisfy: βi = d2 i θi nθ2 i 2λ, i = 1, . . . , r (62) p , i = r + 1, . . . , n (63) 0 = c1(T)θ4 i + c2,i(T)θ2 i + c3,i(T), i = 1, . . . , r (64) n , i = r + 1, . . . , p (65) k=1 θ 2 k + (p r) n Integrated Principal Components Analysis Now, from the quartic polynomial in (64), the four possible roots are v u u t c2,i(T) q c2 2,i(T) 4c1(T)c3,i(T) In any case, c1(T) < 0, c2,i(T) > 0, and c2 2,i(T) 4c1(T)c3,i(T) = d8 i (p n)2 + 16λ2npd4 i T > 0, v u u t c2,i(T) q c2 2,i(T) 4c1(T)c3,i(T) 2c1(T) . (67) is always a real positive root. Furthermore, if θi is given by (67) for each i = 1, . . . , r, then nθ2 i 2λ = 1 4λn T d4 i (p n) 8nλ2T q d8 i (p n)2 + 16λ2npd4 i T 2λ = 2λ + 1 4λn T d4 i (p n) + q d8 i (p n)2 + 16λ2npd4 i T 2λ Thus, the corresponding βi, which we have already shown to be given by βi = d2 i θi nθ2 i 2λ, is also positive. This shows that (67) yields a feasible solution for our system of equations (62)-(66). We claim that this is the only choice of θi, which yields a feasible solution. To see this, we first immediately eliminate the two negative square root solutions due to the positivity constraint on θi. We next divide the argument into three cases. Case 1: If d4 i 4λ2T = 0, then c3,1(T) = 0 and so c2,i(T) c2,i(T) which equals 0 if we take the positive sign. Thus, in this case, there is only one feasible root by choosing the negative sign. Case 2: If d4 i 4λ2T > 0, then c3,i(T) > 0. Additionally, c1(T) < 0 and c2,i(T) > 0, so by Descartes rule of signs, there is at most one real positive solution to (64). As shown earlier, the root given in (67) is a real positive solution. This must be the unique real positive root by Descartes . Case 3: If d4 i 4λ2T < 0, then c3,i(T) < 0, so by Descartes rule of signs, there are at most two real positive solutions to (64). We have already found one real positive root, given by (67). Suppose also that the other possible root v u u t c2,i(T) + q c2 2,i(T) 4c1(T)c3,i(T) 2c1(T) . (68) Tang and Allen is a real positive root. Since the corresponding βi must also be positive (in order to be feasible) and we have already shown that βi = d2 i θi nθ2 i 2λ, it follows that the denominator nθ2 i 2λ must be positive. However, substituting (68) into this denominator gives nθ2 i 2λ = 1 4λn T d4 i (p n) 8nλ2T + q d8 i (p n)2 + 16λ2npd4 i T 2λ = 2λ + 1 4λn T d4 i (p n) q d8 i (p n)2 + 16λ2npd4 i T 2λ This contradicts the positivity of βi and hence is not a feasible solution. Thus, in any case, we have shown that there is only one feasible root of (64), which is given by (67). The last step of this proof is to show that there exists a T which solves our system of equations (62)-(66). In particular, we can substitute (67) into (66) to see that T must satisfy c2 2,i(T) 4c1(T)c3,i(T) + (p r) n d4 i (p n) + 8nλ2T + q d8 i (p n)2 + 16λ2npd4 i T + (p r) n Let f(T) denote the right hand side of the equation in (70). It can be shown that f (T) > 0 for all T 0. Also, when T = 0, we have that f(T) = (p r) n 2λ > 0 = T, and as T , we have f(T) np 2λ < . Together, these observations imply that there exists a unique solution to the equation T = f(T) for some T > 0. Thus, there exists a unique feasible T, and hence also β and θ, to our system of equations (62)-(66). As a direct consequence, since the PC scores and loadings from i PCA are the eigenvectors of ˆΣ and ˆ , respectively, Theorem 24 shows that when K = 1, the PC scores and loadings from i PCA with the multiplicative Frobenius penalty are precisely U and V, as in PCA. In other words, i PCA with the additive or multiplicative Frobenius penalty is a proper generalization of PCA to multiple data sets. Appendix F. Subspace Consistency of the Additive L1 Correlation i PCA Estimator While the multiplicative Frobenius estimator appears superior from an optimization pointof-view, we will show in this section that the additive L1 correlation estimator satisfies one of the first statistical guarantees in the data integration context. Specifically, our primary objective will be to prove that the additive L1 correlation estimator after one Flip-Flop iteration, outlined in Algorithm 6, is a consistent estimator of the true joint covariance matrix Σ and hence also consistently estimates the underlying joint subspace. Integrated Principal Components Analysis As mentioned in the main text, the additive L1 correlation estimator applies the L1 penalty to the correlation matrix, rather than the usual covariance matrix, and has been adopted previously in Zhou (2014a) and Rothman et al. (2008) for non-integrated data. In this non-integrated setting, Zhou (2014a) derived convergence rates for the one-step version of Algorithm 6, assuming only one matrix instance was observed from the matrix-variate normal distribution. Motivated by this approach, we extend the proof idea and results in Zhou (2014a), where K = 1, to i PCA, where we observe one matrix instance for each of the K 1 distinct matrix-variate normal models. Our proof closely resembles Zhou (2014a) with technical challenges due to the estimation of Σ from multiple k s with different pk s. We will see later that the consistency rate obtained in our main theorem, Theorem 30, which holds for all K 1, is equivalent to the rate given in Zhou (2014a) when K = 1. Note however that this subspace consistency property is an incredibly unique property of the additive L1 correlation estimator. When trying to prove a similar result for the other proposed estimators, we run into several difficulties. For instance, if we use the additive L1 covariance estimator, the proof of our main result in Theorem 30 no longer goes through due to an additional p term in the graphical lasso convergence rate (Rothman et al., 2008). More specifically, the graphical lasso convergence rate when applied to the inverse covariance matrix, known to be O q (p+s) log(p) , is not fast enough to guarantee statistical convergence of the Flip-Flop algorithm. However, by applying the graphical lasso to the inverse correlation matrix, we obtain a faster rate of O q , which then enables us to prove Theorem 30. Other works, which have studied convergence rates of sparse penalties in the non-integrated setting, are also not applicable for integrated data problems. In particular, Tsiligkaridis et al. (2013) proved convergence rates for the additive L1 covariance penalty but assumed multiple matrix observations per matrix-variate normal model. Since i PCA assumes only one matrix observation per model, the guarantees from Tsiligkaridis et al. (2013) do not hold. The problem of proving rates of convergence for the Frobenius estimators is even more difficult than the L1 estimators. Without imposing some additional structure on the covariance matrices, we cannot even hope to prove statistical consistency in the pk > n setting. For the L1 penalties, it is natural to impose a sparsity constraint, but with the dense Frobenius penalties, the appropriate underlying structure of the covariance matrices is unclear. One preliminary idea is to exploit the g-convexity of the log-likelihood function and impose some additional structure based upon the associated manifold. However, we leave this for future research as it requires developing a whole new set of tools, combined with Riemannian geometry, to study statistical properties in the manifold space, rather than the usual Euclidean space. We next collect notation. Let ˆΣ denote the additive L1 correlation estimator obtained after one iteration of the while loop in Algorithm 6. Assume that for each k = 1, . . . , K, the true population model is given by Xk Nn,pk(0, Σ k), where Σ = (σij) and k = (δk,ij). For the purpose of identifiability, define Σ = (σ ,ij) = n Σ /tr(Σ) and k = (δ k,ij) = tr(Σ) k n so that tr(Σ ) = n and Σ k = Σ k for each k. Let ρ(Σ) and ρ( k) denote the true correlation matrices corresponding to Σ and k, respectively. Define vec(A) to be the vectorization operator which creates a column vector from the Tang and Allen Algorithm 6 Flip-Flop Algorithm for Additive L1 Correlation-Penalized i PCA Estimators 1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be the identity matrix of the appropriate size. 2: while not converged do 3: for k = 1, . . . , K do 4: Compute sample covariance matrix: ˆSk = 1 n XT k ˆΣ 1 Xk 5: Get standard deviation estimate: ˆ Wk = diag(ˆSk)1/2 6: Convert to sample correlation matrix: ˆSρ,k = ˆ W 1 k ˆSk ˆ W 1 k 7: Apply graphical lasso to estimate correlation matrix: ˆ 1 ρ,k = argmin 1 ρ,k log | 1 ρ,k | + tr(ˆSρ,k 1 ρ,k) + λk 1 ρ,k 1,off 8: Convert back to covariance estimate: ˆ k = ˆ Wk ˆ ρ,k ˆ Wk 9: Compute sample covariance matrix: ˆSΣ = 1 p PK k=1 Xk ˆ 1 k XT k 10: Get standard deviation estimate: ˆ WΣ = diag(ˆSΣ)1/2 11: Convert to sample correlation matrix: ˆSρ,Σ = ˆ W 1 Σ ˆSΣ ˆ W 1 Σ 12: Apply graphical lasso to estimate correlation matrix: ˆΣ 1 ρ = argmin Σ 1 ρ log | Σ 1 ρ | + tr(ˆSρ,Σ Σ 1 ρ ) + λΣ Σ 1 ρ 1,off 13: Convert back to covariance estimate: ˆΣ = ˆ WΣ ˆΣρ ˆ WΣ matrix A by stacking the columns of A below one another. Then for each k = 1, . . . , K, put Sk = vec(Xk)vec(Xk)T and Sk = vec(Xk)T vec(Xk). Let Srq k denote the r, qth block of size n n of Sk, and let Srq k denote the r, qth block of size pk pk of Sk. If A is a matrix, let A 2 denote the operator norm or the maximum singular value of A, let A F denote the Frobenius norm (i.e. A 2 F = P i,j a2 ij), let A 0,offdenote the number of non-zero non-diagonal entries in A, let A 1 = P i,j |aij|, and let A 1,off= P i =j |aij|. Denote the stable rank of A by r(A) = A 2 F / A 2 2. Let us also write for a real symmetric matrix A, φmin(A) to be the minimum eigenvalue of A. Define σmin = mini σii, σmax = maxi σii, δk,min = mini δk,ii, δk,max = maxi δk,ii, and similarly for σ ,min, σ ,max, δ k,min, and δ k,max. Also write a b = max(a, b) and a b = min(a, b). If a = o(b), then |a/b| 0 as n, p1, . . . , p K . If a b, then there exists positive constants c, C such that cb a Cb as n, p1, . . . , p K . Lastly, we also adopt the notation defined in Algorithm 6 for the remainder of this section. To establish the consistency of ˆΣ, we require the following assumptions, which are generalizations of those in (Zhou, 2014a): (A1) Assume that Σ 1 and 1 k are sparse with respect to each other s dimensions: sΣ := Σ 1 0,off= o p2 pk log(n pk) and sk := 1 k 0,off= o n log(n pk) for each k = 1, . . . , K. (A2) Assume that we have uniformly bounded spectra: 0 < φmin(Σ) φmax(Σ) < and 0 < φmin( k) φmax( k) < for each k = 1, . . . , K. Integrated Principal Components Analysis (A3) Assume that the inverse correlation matrices satisfy ρ(Σ) 1 1 n and ρ( k) 1 1 pk for each k = 1, . . . , K. (A4) Assume that K is finite and the growth rate of n and p1, . . . , p K satisfy n pk = o(exp(n pk)) for each k = 1, . . . , K. Remark 25 (1) Rather than verifying the sparsity assumption of Σ 1 in (A1) and the growth rate (A4) for each k, it is sufficient to check that sΣ = o p2 maxk pk log(n maxk pk) and n maxk pk = o(exp(n mink pk)), respectively. (2) (A1) implies that q sk log(n pk) sΣ log(n pk) pk 0 as n, p1, . . . , p K . Under these assumptions, we prove our main statistical consistency result in Theorem 30. From this, we can easily establish subspace consistency for ˆΣ in Corollary 31. The overall idea of this convergence proof is to follow Algorithm 6 and sequentially bound each step in the algorithm. First, the error from line 8 of Algorithm 6 can be bounded by adapting results from Rothman et al. (2008). Then, in Lemma 28, we bound the error between Σ and the sample covariance estimate ˆSΣ defined in the step 9. Following the Flip-Flop algorithm, we next bound the error between ρ(Σ) and the sample correlation estimate ˆSρ,Σ from step 11 in Theorem 29. Finally, in Theorem 30, we prove the rate of convergence in the operator and Frobenius norms for ˆΣ 1 and ˆΣ as defined in step 13 of the algorithm. Two direct consequences of convergence in the operator norm are consistent eigenvalues and eigenvectors of ˆΣ, which in turn yield subspace consistency. F.1 Preliminaries The main driver behind Lemma 28 is a large deviation inequality, namely Theorem 13.1 from Zhou (2014b). As this is an important result used multiple times in the proof of Lemma 28, we state Theorem 13.1 from Zhou (2014b) below in a slightly different form for convenience. Theorem 26 Assume that n pk 2. Let M be an n n matrix and N be a pk pk matrix such that 1 n M 2 F < and 1 pk N 2 F < . Define τk = 2C K2 log1/2(n pk), where C := 1 c 1 c 1 and K and c are the constants from Theorem 12.1 in Zhou (2014b). (i) If the stable ranks satisfy r(Σ1/2 M Σ1/2) 4 log(n pk) and r( 1/2 k N 1/2 k ) 4 log(n pk), then with probability 1 3 (n pk)2 , we have that diag( k) 1/2 r=1 Mqr Srq k diag( k) 1/2 tr(Σ M) n ρ( k) Dkτk and diag(Σ) 1/2 r=1 Nqr Srq k diag(Σ) 1/2 tr( k N) p ρ(Σ) D kτk, where Dk = 1 n Σ1/2 M Σ1/2 F and D k = 1 p 1/2 k N 1/2 k F . Tang and Allen (ii) If n pk = o(expn pk), then with probability 1 3 (n pk)2 , the above inequalities hold with Dk = 2 n Σ 2 M 2 and D k = 2 pk We refer to Zhou (2014b) for the proof of this theorem, but if one looks at the interior of this proof, we obtain the following useful result, presented in Corollary 27. To simplify notation, let E (k, M) denote the event ( diag( k) 1/2 1 n r=1 Mqr Srq k diag( k) 1/2 tr(Σ M) n ρ( k) 2 n Σ 2 M 2τk let EΣ(k, N) denote the event ( diag(Σ) 1/2 1 p r=1 Nqr Srq k diag(Σ) 1/2 tr( k N) p ρ(Σ) 2 pk p k 2 N 2τk and define for each k = 1, . . . , K, νn,k := 2 n τk = 4C K2 r νpk := 2 pk p τk = 4C K2 p pk log(n pk) p = 4C K2 pk Corollary 27 Assume the same conditions and notation as Theorem 26. Also suppose that n pk = o(expn pk) Then under the event E (k, M), we have that r=1 Mqr Srq k νn,k Σ 2 M 2 p δk,iiδk,jj i, j, and under the event EΣ(k, N), we have that r=1 Nqr Srq k νpk k 2 N 2 σiiσjj i, j. We now have the necessary large deviation inequalities to prove Lemma 28, a generalization of Lemma 6.1 from Zhou (2014a). Though the proof of Lemma 28 closely resembles that of Lemma 6.1 from Zhou (2014a), modifications must be made as i PCA considers multiple distinct matrix-variate normal models while Zhou (2014a) considers only one matrix-variate normal model. For clarity, we give our proof in its entirety and refer to results in Zhou (2014a) when necessary. Lemma 28 Suppose that (A1)-(A4) hold. Let ˆ ρ,k and ˆ k be obtained as in steps 7 and 8 in Algorithm 6, where we choose λk = 2αk ϵ(1 αk) 3αk 1 αk for αk = A νn,k where A = n Σ F Integrated Principal Components Analysis and ϵ (0, 2/3). Then on event E , we have for ˆSΣ defined in step 9 of Algorithm 6, 4C K2 σ ,iiσ ,jj pk (1 + o(1)) k=1 |σ ,ij| µk σ ,iiσ ,jj νpk(1 + o(1)) + |σ ,ij| µk , µk := λk ˆ 1 ρ,k 1,off p + αk 1 αk ˆ 1 ρ,k 1 p µk, µk := λk ρ( k) 1 1,off p + αk 1 αk Moreover, P(E ) 1 PK k=1 8 (n pk)2 . To put simply, with high probability, 4C K2σ ,max pk (1 + o(1)) k=1 σ ,maxµk k=1 [νpk(1 + o(1)) + µk] . Proof For each k = 1, . . . , K, define RΣ,k := [δk,11vec(Σ) . . . δk,1pkvec(Σ) . . . δk,pkpkvec(Σ)] vec(Σ) vec( k)T ˆRΣ,k := h vec( S11 k ) . . . vec( S1pk k ) . . . vec( Spkpk k ) i . Then one can verify as in Zhou (2014a) that vec(Σ ) = 1 k=1 RΣ,k vec(( k) 1) and vec(ˆSΣ) = 1 k=1 ˆRΣ,kvec( ˆ 1 k ). (72) The equalities from (72) thus yield vec(ˆSΣ Σ ) = 1 k=1 ˆRΣ,kvec( ˆ 1 k ) 1 k=1 RΣ,k vec(( k) 1) (73) k=1 ˆRΣ,kvec( ˆ 1 k ) 1 k=1 RΣ,k vec(( k) 1) k=1 ˆRΣ,kvec(( k) 1) 1 k=1 ˆRΣ,kvec(( k) 1) (74) h ( ˆRΣ,k RΣ,k)vec(( k) 1) + ˆRΣ,kvec( ˆ 1 k ( k) 1) i . (75) Tang and Allen For each k = 1, . . . , K, define Θk := ˆ k k and Θk := ˆ 1 k ( k) 1, and notice that Θk = ˆ 1 k Θk( k) 1. Then ˆRΣ,kvec( Θk) = ˆRΣ,kvec( Θk) + RΣ,k vec( Θk) RΣ,k vec( Θk) (76) = RΣ,k vec( Θk) + ( ˆRΣ,k RΣ,k)vec( Θk) (77) = RΣ,k vec( ˆ 1 k Θk( k) 1) + ( ˆRΣ,k RΣ,k)vec( ˆ 1 k Θk( k) 1). (78) Putting (75) and (78) together gives vec(ˆSΣ Σ ) = p( ˆRΣ,k RΣ,k)vec(( k) 1) + 1 p RΣ,k vec( ˆ 1 k Θk( k) 1) p( ˆRΣ,k RΣ,k)vec( ˆ 1 k Θk( k) 1) i k=1 (U1,k + U2,k + U3,k), where the matrix correspondent for each of the above terms will be denoted by M1,k, M2,k, and M3,k, respectively. We will proceed to bound each of the terms separately. In order to bound U1,k, notice that by definition of RΣ,k and ˆRΣ,k, p( ˆRΣ,k RΣ,k)vec(( k) 1) (79) r=1 vec( Sqr k δk,qr Σ)[( k) 1]qr (80) r=1 vec( Sqr k )[( k) 1]qr tr( k( k) 1) p vec(Σ) (81) r=1 vec( Sqr k )[( k) 1]qr pk p vec(Σ ) (82) r=1 Sqr k [( k) 1]qr pk Define E0,k to be the event E (k, (Σ ) 1) EΣ(k, ( k) 1). Then by Corollary 27, under the event E0,k, we have |(M1,k)ij| νpk σ ,iiσ ,jj. Moreover, by Theorem 26, P(E0,k) 1 3 (n pk)2 . Next, we will bound the second term U2,k. As in Zhou (2014b), we can write p RΣ,k vec( ˆ 1 k Θk( k) 1) = 1 ptr( ˆ 1 k Θk)vec(Σ ), and M2,k = 1 ptr( ˆ 1 k Θk) Σ . Integrated Principal Components Analysis By Claim 17.3 in Zhou (2014b), it follows that that under event X0,k := EΣ(k, I) E (k, I), λk ˆ 1 ρ,k 1,off αk 1 αk ˆ 1 ρ,k 1 tr(Θk ˆ 1 k ) λk ˆ 1 ρ,k 1,off+ αk 1 αk ˆ 1 ρ,k 1 |tr( Θk ˆ 1 k )| λk ˆ 1 ρ,k 1,off+ αk 1 αk ˆ 1 ρ,k 1, which implies that on X0,k, |(M2,k)ij| |σ ,ij| λk ˆ 1 ρ,k 1,off+ αk 1 αk ˆ 1 ρ,k 1 = |σ ,ij| µk. Additionally, by Theorem 26, P(X0,k) 1 3 (n pk)2 . To bound the final term U3,k, we follow the same logic as (79)-(82) to obtain p( ˆRΣ,k RΣ,k)vec( Θk) r=1 vec( Sqr k )[ Θk]qr tr( k Θk) and M3,k = 1 r=1 Sqr k [ Θk]qr tr( k Θk) Define E1,k := EΣ(k, Θk). By the proof of Theorem 26, P(E1,k|X0,k) 1 2 (n pk)2 . On the other hand, under the event E1,k, Corollary 27 gives |(M3,k)ij| νpk k 2 Θk 2 σiiσjj. We next bound Θk 2 using Corollary 10.1 from Zhou (2014b), so assuming that X0,k holds, then sk 1 δ k,minφ2 min(ρ( k)). In summary, under the event E := K k=1(E0,k E1,k X0,k), we have that |(ˆSΣ Σ )ij| k=1 (|(M1,k)ij| + |(M2,k)ij| + |(M3,k)ij|) νpk σ ,iiσ ,jj + |σ ,ij| µk sk 1 δ k,minφ2 min(ρ( k)) νpk σ ,iiσ ,jj + |σ ,ij| µk Tang and Allen sk 1 δk,minφ2 min(ρ( k)) νpk σ ,iiσ ,jj(1 + o(1)) + |σ ,ij| µk under the assumptions. Furthermore, applying the union bound implies that P(E ) 1 PK k=1 8 (n pk)2 . Assuming that event X0,k holds, µk µk is a consequence of Corollary 17.4 from Zhou (2014b), and this concludes the proof. It is important to point out that in our choice of λk in (71), A can be considered a constant under the bounded spectrum assumption (A2). In addition, (A1) implies that q n 0 as n, pk . Therefore, since λk is on the order of A q n , λk 0 as n, pk (for k = 1, . . . , K). Next, stepping through Algorithm 6, we bound the error between the correlation estimate ˆSρ,Σ and the true correlation matrix ρ(Σ). Theorem 29 Suppose the conditions in Lemma 28 hold. Define ηk := νpk(1 + o(1)) + µk. Then under event E , we have for ˆSρ,Σ defined in step 11 in Algorithm 6 and i = j, pk (1 + o(1))(1 + |ρ(Σ)ij|) 1 PK k=1 ηk + 2|ρ(Σ)ij| PK k=1 µk 1 PK k=1 ηk (84) " νpk(1 + o(1))(1 + |ρ(Σ)ij|) 1 PK k=1 ηk + 2|ρ(Σ)ij| µk 1 PK k=1 ηk 2 PK k=1 ηk 1 PK k=1 ηk , where ηk := νpk(1 + o(1)) + µk. (86) Proof Because µk µk by Lemma 28 and |ρ(Σ)ij| 1 for all i, j, it is clear that " νpk(1 + o(1))(1 + |ρ(Σ)ij|) 1 PK k=1 ηk + 2|ρ(Σ)ij| µk 1 PK k=1 ηk 2 PK k=1 ηk 1 PK k=1 ηk 2 PK k=1 ηk 1 PK k=1 ηk . Therefore, it suffices to show (84). Assume throughout this proof that event E holds. Then by Lemma 28, k=1 [νpk(1 + o(1)) + µk] = Integrated Principal Components Analysis which implies k=1 ηk = r σ ,ii 1 PK k=1 ηk (87) for all i. On the other hand, for i = j, Lemma 28 gives ˆSΣ σ ,iiσ ,jj ρ(Σ) ˆSΣ σ ,iiσ ,jj Σ σ ,iiσ ,jj νpk(1 + o(1)) + |σ ,ij| σ ,iiσ ,jj µk k=1 (νpk(1 + o(1)) + |ρ(Σ)ij| µk) , (90) Thus, for i = j, [ˆSΣ]ij [ˆSΣ]1/2 ii [ˆSΣ]1/2 jj ρ(Σ)ij [ˆSΣ]ij σ ,iiσ ,jj [ˆSΣ]1/2 ii σ ,ii [ˆSΣ]1/2 jj σ ,jj [ˆSΣ]ij σ ,iiσ ,jj ρ(Σ)ij [ˆSΣ]1/2 ii σ ,ii [ˆSΣ]1/2 jj σ ,jj [ˆSΣ]1/2 ii σ ,ii [ˆSΣ]1/2 jj σ ,jj [ˆSΣ]1/2 ii [ˆSΣ]1/2 jj [ˆSΣ]ij σ ,iiσ ,jj ρ(Σ)ij [ˆSΣ]1/2 ii [ˆSΣ]1/2 jj 1 1 PK k=1 ηk k=1 (νpk(1 + o(1)) + |ρ(Σ)ij| µk) 1 PK k=1 ηk 1 νpk(1 + o(1))(1 + |ρ(Σ)ij|) + 2|ρ(Σ)ij| µk 1 PK k=1 ηk . (96) as desired. Note that (95) follows from (87) and (90). Tang and Allen F.2 Main Result We can now build on top of Theorem 29 and existing results to prove our main convergence result in Theorem 30, which is a generalization of Corollary 17.2 from Zhou (2014b). Indeed, when K = 1, Theorem 30 gives the same rates as Zhou (2014a). Moreover, as a direct consequence of Theorem 30, we obtain Corollary 31, which establishes bounds on the eigenvectors and eigenvalues of the additive L1 correlation i PCA estimator, thereby giving the desired subspace consistency. Theorem 30 Suppose that (A1)-(A4) hold and that n p pk for each k = 1, . . . , K. Assume also that η 1/4, and λΣ is chosen to be λΣ = 2 PK k=1 ηk ϵ1(1 PK k=1 ηk) , for some ϵ1 (0, 1). (97) Then on the event E , ˆΣ Σ 2 2 CλΣσ ,maxκ(ρ(Σ))2 sΣ 1, (98) ˆΣ Σ F 2 CλΣσ ,maxκ(ρ(Σ))2 sΣ n, (99) ˆΣ 1 (Σ ) 1 2 CλΣ sΣ 1 σ ,minφ2 min(ρ(Σ)), (100) ˆΣ 1 (Σ ) 1 2 CλΣ sΣ n σ ,minφ2 min(ρ(Σ)) (101) for some constant C. Proof Assume that E holds throughout this proof. Next, recall that from Theorem 29, ω := 2 PK k=1 ηk 1 PK k=1 ηk 2 PK k=1 ηk 1 PK k=1 ηk . By defining Cp,k := 2 ϵ ρ( k) 1 1,off p + ρ( k) 1 1 p , it follows that ηk = νpk + αk 1 αk Cp,k (1+ o(1)). Then because Cp,k 1 under (A3) and νpk 0 as n, pk under (A1) and (A2), 2 PK k=1 ηk 1 PK k=1 ηk k=1 νpk 0 as n, pk . Moreover, under (A1), we have ω sΣ 1 = o(1). Thus, we can apply Theorem 4.5 from Zhou (2014a) to obtain for some constant C, ˆΣρ ρ(Σ) 2 ˆΣρ ρ(Σ) F Cκ(ρ(Σ))2λΣ sΣ 1 and ˆΣ 1 ρ ρ(Σ) 1 2 ˆΣ 1 ρ ρ(Σ) 1 F CλΣ sΣ 1 2φ2 min(ρ(Σ)). Integrated Principal Components Analysis Now since we have a bound on the correlation estimates, the next step is to consider the covariance estimates. Let us define WΣ = diag(Σ )1/2 = q n tr(Σ)diag(Σ)1/2. By Lemma 28, | ˆ W 2 Σ,ii W2 Σ,ii | = | ˆ W 2 Σ,ii σ ,ii| σ ,ii ˆ WΣ WΣ 2 σ ,max k=1 ηk (103) and ˆ W 1 Σ W 1 Σ 2 1 σ ,max 1 + PK k=1 ηk 1 q 1 + PK k=1 ηk 1 q 1 PK k=1 ηk q 1 PK k=1 ηk PK k=1 ηk q 1 PK k=1 ηk . (105) Using Proposition 15.2 in Zhou (2014b), (103), and (105), we obtain ˆΣ Σ 2 = ˆ WΣ ˆΣρ ˆ WΣ WΣ ρ(Σ) WΣ 2 ˆ WΣ WΣ 2 + WΣ 2 2 ˆΣρ ρ(Σ) 2 + ˆ WΣ WΣ 2 ˆ WΣ WΣ 2 + 2 ρ(Σ) 2 Cκ(ρ(Σ))2λΣ sΣ 1 σ ,max And because λΣ was chosen to satisfy PK k=1 ηk < λΣ(1 PK k=1 ηk)/2 where PK k=1 ηk PK k=1 ηk 1/4, we have ˆΣ Σ 2 Cκ(ρ(Σ))2λΣ sΣ 1 σ ,max + σ ,max λΣ 2 Cκ(ρ(Σ))2σ ,maxλΣ sΣ 1. Tang and Allen We can also bound the error on the Frobenius norm similarly. Using Proposition 15.2 in Zhou (2014b), (103), (105), PK k=1 ηk < λΣ(1 PK k=1 ηk)/2, and PK k=1 ηk PK k=1 ηk 1/4 we see that ˆΣ Σ F ˆ WΣ WΣ 2 + WΣ 2 2 ˆΣρ ρ(Σ) F + ˆ WΣ WΣ 2 ˆ WΣ WΣ 2 + 2 ρ(Σ) F Cκ(ρ(Σ))2λΣ sΣ 1 σ ,max 2 Cκ(ρ(Σ))2σ ,maxλΣ sΣ n. The same logic can be used to prove (100) and (101), so we omit the details. To summarize the convergence results from Theorem 30, if we set λk = 2αk ϵ(1 αk) n k = 1, . . . , K, (106) and λΣ = 2 PK k=1 ηk ϵ1(1 PK k=1 ηk) then according to Theorem 30, with probability 1 PK k=1 8 (n pk)2 , (sΣ 1) log(n pk) ˆΣ 1 (Σ 1) 2 = O (sΣ 1) log(n pk) (sΣ n) log(n pk) ˆΣ 1 (Σ 1) F = O (sΣ n) log(n pk) Subspace consistency for the additive L1 correlation estimator, ˆΣ, follows as a direct corollary of Theorem 30. Corollary 31 Let ui denote the eigenvector of Σ corresponding to the eigenvalue φi and ˆui be the eigenvector of ˆΣ with eigenvalue ˆφi, where the eigenvalues are sorted in descending order. Under the conditions stated in Theorem 30, we have the following: Integrated Principal Components Analysis (i) For each i, |ˆφi φi| = O (sΣ 1) log(n pk) (ii) For each i such that min(φi 1 φi, φi φi+1) = 0, ˆui ui 2 = O (sΣ 1) log(n pk) Thus, the eigenvectors and eigenvalues of the one-step additive L1 correlation estimator, ˆΣ, are consistent. Proof (i) By Weyl s theorem (Horn and Johnson, 2012), we have for each i, |ˆφi φi| ˆΣ Σ 2. Thus, it follows from Theorem 30 that |ˆφi φi| = O (sΣ 1) log(n pk) (ii) By a variant of the Davis-Kahan sin θ theorem given in Yu et al. (2015), we have for each i, ˆui ui 2 23/2 ˆΣ Σ 2 min(φi 1 φi, φi φi+1), assuming that min(φi 1 φi, φi φi+1) = 0 and ˆv T v 0. Note however that if ˆu T i ui < 0, we can simply take the negative of ˆui and apply the theorem to ˆui and ui. Thus, for each i such that min(φi 1 φi, φi φi+1) = 0, it follows from Theorem 30 that ˆui ui 2 = O (sΣ 1) log(n pk) Since PK k=1 pk (sΣ 1) log(n pk) pk converges to 0 as n, p1, . . . , p K under (A1), this gives us the consistency of the eigenvalues and eigenvectors of ˆΣ. Note though that the consistency statements above assume that n p pk k = 1, . . . , K (i.e. the large n setting). If instead n < p pk k = 1, . . . , K (i.e. the large p setting), then we can modify Algorithm 6 to first initialize an estimate of Σ assuming ˆ k = I for each k. Call this initial estimate ˆΣ 1. Then estimate k given ˆΣ 1. Call this estimate ˆ k. Lastly, obtain the final estimate of Σ given ˆ 1, . . . , ˆ K. Denote this final estimate of Σ by ˆΣ. Similar convergence rates can be obtained for the large p setting Tang and Allen by using this modified algorithm. Namely, if the penalty parameters λk and λΣ are chosen on the order of (106) and (107), we can obtain the same rates as (108)-(111). The only additional assumption required here is a bound on |ρ(Σ)ij|, namely, |ρ(Σ)ij| = O( npk p ) for each k = 1, . . . , K and i = j. We omit this proof as it is a repetition of previous arguments with slight differences. A more thorough discussion of this scenario is presented in Zhou (2014a). Thus, in either the large n or large p case, we have a variant of the additive L1 correlation Flip-Flop algorithm such that under certain assumptions, the estimate of Σ converges at a (sΣ 1) log(n pk) in the operator norm. As a result of convergence in the operator norm, we obtain consistency of the eigenvalues and eigenvectors of ˆΣ (see Corollary 31). Since eigenvectors of ˆΣ define the estimated i PCA subspace, this in turn implies subspace consistency of the one-step additive L1 correlation estimator. Appendix G. Selecting Penalty Parameters Our missing data imputation framework for selecting penalty parameters is given in Algorithm 7. In the subsequent sections, we discuss algorithms to impute the missing data in step 4 of this algorithm. As one option, one could use the multi-cycle expectationconditional maximization (MCECM) (Meng and Rubin, 1993) algorithm, which iterates between taking conditional expectations in the E-step and maximizing with respect to one variable at a time in the M-step. We derive the full MCECM algorithm for i PCA in Appendix G.1. This method generalizes the MCECM imputation method proposed in Allen and Tibshirani (2010), which only considered the K = 1 case. However, as Allen and Tibshirani (2010) pointed out, the full MCECM algorithm is computationally expensive, so in practice, we also advocate using a faster one-step approximation to the MCECM algorithm, which we discuss in Appendix G.2. Algorithm 7 Selecting Penalty Parameters via Missing Imputation Framework Given: data X1, . . . , XK, space of penalty parameters Λ, type of penalized i PCA estimator 1: for k = 1, . . . , K do 2: Randomly leave out 5% of the elements in Xk; denote these scattered missing elements by Xm k 3: for λ in Λ do 4: Impute missing values (preferably by Algorithm 9); denote these imputed values by ˆXm k 5: Select λ which minimizes ˆXm k Xm k 2 F Xm k Xm k 2 F , where Xm k are the values of the column mean matrix Xk at the missing indicies. Following the notation in Allen and Tibshirani (2010), we write Xo = (Xo 1, . . . , Xo K) to denote the totality of observed entries of X = (X1, . . . , XK), and Xm = (Xm 1 , . . . , Xm K) to denote the missing entries of X. Also define Θ = (µ1, . . . , µK, Σ 1, 1 1 , . . . , 1 K ), and let Integrated Principal Components Analysis Θ be the current estimates of Θ. Note that the MCECM and its one-step approximation for i PCA (where K 1) are generalizations of the imputation algorithms from Allen and Tibshirani (2010) (where K = 1). G.1 Multi-Cycle Expectation-Conditional Maximization Algorithm For concreteness, we will work with the multiplicative Frobenius penalty. The other penalized methods are very similar. We will proceed to derive the E-steps and M-steps with respect to each variable for the MCECM algorithm. Note here that we will estimate the column means µk adaptively within the MCECM algorithm, rather than fixing them a priori. We find that this strategy works better in practice. So first, in order to compute the E-steps, note that the Q function is Q(Θ; Θ ) := EXm | Xo,Θ [ℓ(Θ| X)] = p log | Σ 1 | + n k=1 log | 1 k | k=1 tr Σ 1 Xk 1n µT k 1 k Xk 1n µT k T # k=1 ρk 1 k 2 F Thus, for the E-step with respect to Σ, we use linearity of the expectation and trace operators to obtain k=1 tr Σ 1 Xk 1n µT k 1 k Xk 1n µT k T # k=1 EXm | Xo,Θ h Xk 1n µT k 1 k Xk 1n µT k T i! Using the notation and proof of Proposition 3 in Allen and Tibshirani (2010), the conditional expectation reduces to EXm | Xo,Θ h Xk 1n µT k 1 k Xk 1n µT k T i = ˆXk 1 k ˆXT k +Fk( 1 k ) . (112) For the E-Step with respect to k, a similar argument shows that k=1 tr Σ 1 Xk 1n µT k 1 k Xk 1n µT k T # k=1 tr EXm | Xo,Θ h Xk 1n µT k T Σ 1 Xk 1n µT k i 1 k , Tang and Allen and again from Allen and Tibshirani (2010), we have that EXm | Xo,Θ h Xk 1n µT k T Σ 1 Xk 1n µT k i = ˆXT k Σ 1 ˆXk +G(Σ 1). (113) We next plug (112) and (113) back into the Q function and take partial derivatives to compute the M steps. For the M-step with respect to Σ, we have that Q Σ 1 = p Σ ˆXk 1 k ˆXT k +Fk( 1 k ) 2 Σ 1 K X k=1 λk 1 k 2 F = 0, so given 1 k from the previous iteration, we can update Σ via an eigendecomposition of PK k=1 ˆXk 1 k ˆXT k +Fk( 1 k ) . (The form of this update is analogous to the Flip-Flop updates in the multiplicative Frobenius Flip-Flop algorithm.) Similarly, for the M-step with respect to k, Q 1 k = n k ˆXT k Σ 1 ˆXk +G(Σ 1) 2λk 1 k Σ 1 2 F = 0, so given Σ 1 from the previous iteration, we can update k by an eigendecomposition of ˆXT k Σ 1 ˆXk +G(Σ 1). Putting these E-steps and M-steps together, we provide the full MCECM algorithm to impute missing values in Algorithm 8. Algorithm 8 Full MCECM Algorithm for i PCA 1: Set ˆµk to be the column means of Xo k for each k = 1, . . . , K 2: If xk ij is missing, set xk ij = ˆµj k. 3: Initialize ˆΣ 1 and ˆ 1 1 . . . ˆ 1 K to be symmetric positive definite. 4: while not converged do 5: Compute PK k=1 ˆXk ˆ 1 k ˆXT k +F( ˆ 1 k ) E-Step (Σ) 6: Update ˆµk to be the column means of ˆXk 7: Take eigendecomposition: PK k=1 ˆXk ˆ 1 k ˆXT k +F( ˆ 1 k ) = U Γ UT 8: Regularize eigenvalues: φi = 1 γ2 i + 8p PK k=1 λk ˆ 1 k 2 F 9: Update ˆΣ 1 = U Φ 1 UT 10: for k = 1, . . . , K do 11: Compute ˆXT k ˆΣ 1 ˆXk +Gk( ˆΣ 1) E-Step ( k) 12: Update ˆµk to be the column means of ˆXk 13: Take eigendecomposition: ˆXT k ˆΣ 1 ˆXk +Gk( ˆΣ 1) = V Φ VT 14: Regularize eigenvalues: γi = 1 2n φi + q φ2 i + 8nλk ˆΣ 1 2 F 15: Update ˆ 1 k = V Γ 1 VT Initialization M-Step ( k) Integrated Principal Components Analysis G.2 One-Step Approximation Algorithm 8 is a generalization of the TRCMAimpute algorithm from Allen and Tibshirani (2010), and as discussed in Allen and Tibshirani (2010), it is computationally expensive to compute F( ˆ 1 k ) and G( ˆ 1 k ). Hence, rather than using the full MCECM algorithm to impute the missing values in step 4 of Algorithm 7, we advocate, as in Allen and Tibshirani (2010), using a one-step approximation to the MCECM algorithm. We will empirically show that the one-step approximation is both a good approximation to the full MCECM algorithm and works well in practice. The idea behind the one-step approximation is that since the first step of the MCECM algorithm typically gives the steepest decrease in the objective function, we will quickly approximate the MCECM algorithm by first obtaining a decent initial imputation and then stopping the algorithm after one M-step and one E-step. We detail the initial imputation step, M-step, and E-step as follows. For the initial imputation step, we impute missing values assuming Σ = I. If we assume Σ = I, then Xk Nn,pk(1n µT k , I k) for each k = 1, . . . , K, or equivalently, xk 1, . . . , xk n iid N(µk, k), where xk i is the ith row of Xk. Since this reduces to the familiar multivariate case, we can initially impute the missing values in Xk using any (regularized) multivariate normal imputation method such as RCMimpute from Allen and Tibshirani (2010) for each k = 1, . . . , K. Given the initial imputation for X1, . . . , XK from the previous step, we next compute the M-step and update µ1, . . . , µK, Σ, 1, . . . , K using the penalized Flip-Flop algorithms derived in Appendix C. In the next and final step, we take an E-step to impute the missing values by ˆXm k = E[Xm k | Xo k, ˆµk, ˆΣ, ˆ k] for each k = 1, . . . , K. This can be done using the Alternating Conditional Expectations Algorithm from Allen and Tibshirani (2010), applied to each Xk separately. The only difference is that we specialize to the case where the mean matrix is Mk = 1n µT k . We summarize this one-step approximation method for missing data imputation in Algorithm 9. Algorithm 9 One-Step MCECM Approximation 1: for k = 1, . . . , K do Initial Imputation // E-Step 2: Impute missing values in Xk assuming Σ = I; call it ˆXk,0 Use any regularized multivariate normal imputation method 3: Estimate µ1, . . . , µK, Σ, 1, . . . , K given ˆX1,0, . . . , ˆXK,0 via penalized MLE Flip-Flop algorithm M-Step 4: for k = 1, . . . , K do E-Step 5: Set the missing values ˆXm k = E[Xm k | Xo k, ˆµk, ˆΣ, ˆ k] using the alternating conditional expectations algorithm as in Allen and Tibshirani (2010) In Figure 8, we compare the numerical convergence of the one-step approximation and the full MCECM algorithm for a small simulation. From this plot, we note two important observations. First, the initialization (assuming Σ = I) in the one-step approximation algorithm makes a significant difference, compared to initializing missing elements to their respective column means as in the full MCECM algorithm. Second, the first update of Tang and Allen Figure 8: We use the same simulation as that in Figure 9 and randomly leave out 5% of the entries in each data set. We impute the missing values using the full MCECM and one-step approximation algorithms with the multiplicative Frobenius penalty (λ = (1, 1)), and we plot the log-likelihood value over each iterate. The log-likelihood obtained by the one-step approximation is on par with the log-likelihood after 15 iterations of the MCECM algorithm. the MCECM algorithm results in the largest increase in the log-likelihood function. As discussed in Allen and Tibshirani (2010), these two observations motivate the one-step approximation algorithm, which takes advantage of a good initialization and one update step to main sufficient accuracy while reducing the computational workload. Figure 8 also shows that the likelihood function after a good initialization and one iteration is on par with the full MCECM algorithm after 15 iterations. For a more detailed discussion on computation and timing comparisons between the one-step approximation and the full MCECM algorithm, we refer to Allen and Tibshirani (2010). Note that though Allen and Tibshirani (2010) only treats the K = 1 case, the results are applicable to the K > 1 case due to the separability of the log-likelihood function. Figure 9 compares the average imputation errors from the full MCECM and the onestep approximation for a small simulation. In this case, for both the one-step approximation and the full MCECM, λ = (10 .5, 100) (0.32, 100) gave the lowest average imputation error, and hence, both imputation methods selected λ = (10 .5, 100) for the multiplicative Frobenius i PCA estimator. This further supports the use of the one-step algorithm as an approximation to the full MCECM in practice. Moreover, we verified that the minimum subspace recovery error of 2.00 was obtained at λ = (0.01, 101.5) (0.01, 31.62) and that λ = (10 .5, 100) yielded a similar subspace recovery error of 2.08. This preliminary empirical evidence leads us to believe that the one-step imputation algorithm is indeed a good approximation to the full MCECM algorithm. Appendix H. Simulations In order to check that the simulation results in Figure 3 are not heavily dependent on our choice of Σ and 1, 2, 3, we ran additional simulations, varying the dimension of the Integrated Principal Components Analysis Figure 9: We simulated two coupled data matrices X1, X2 with n = 50, p1 = 60, p2 = 70 according to the i PCA model (2). Here, we took Σ to be as in the base simulation described in Section 4.1, 1 to be an autoregressive Toeplitz matrix with the (ij)th entry given by .9|i j|, and 2 to be a block-diagonal matrix with five equal-sized blocks. Then, we randomly removed 5% of the elements in X1 and X2 and imputed these missing values using the full MCECM and the one-step approximation algorithms with the multiplicative Frobenius penalty. We plot the average imputation error PK k=1 ˆXm k Xm k 2 F / Xm k Xm k 2 F plus or minus one standard error, taken over 10 trials. In the left graph of each panel (A and B), λ1 varies while λ2 is fixed at its optimal value (i.e. λ2 = 100), and in the right graph of each panel, λ2 varies while λ1 is fixed at its optimal value (i.e. λ1 = 10 .5). The minimum average imputation error is achieved at λ = (10 .5, 100) for both imputation algorithms. Tang and Allen 2 4 6 8 10 Subspace Dimension of Σ Subspace Recovery Error 2 4 6 8 10 Number of Datasets K Subspace Recovery Error i PCA (x Frob) i PCA (+ Frob) Concatenated Distributed Figure 10: Additional Simulations: (A) The Frobenius i PCA estimators yield the lower subspace recovery error regardless of the subspace dimension of Σ; (B) As the number of integrated data sets increases, most methods tend to do better, with the multiplicative Frobenius i PCA penalty slightly outperforming the others when K = 10. Note that we did not run individual PCAs in (B) because the number of data sets is changing. true underlying subspace U and the number of data sets K. These simulation results are shown in Figure 10. For the simulations in Figure 10A, we took 1, 2, 3 to be the same as in the base simulation, and we put Σ to be of the form U D UT , where U was a random n n orthogonal matrix, and D = diag(d1, . . . , dn) was simulated by di Unif(5, 75) for i = 1, . . . , D, and di = 1 otherwise. Here, D is the dimension of the true underlying subspace of Σ (e.g. D was taken to be 2 in the base simulation). Then, like in the base simulation, we generated Xk for each k = 1, 2, 3 by Xk N(0, Σ k), or equivalently Xk = Σ1/2 Ωk 1/2 k , where Ωk is an n pk random matrix with i.i.d. N(0, 1) entries. For the simulations in Figure 10B, we took Σ to be the same as in the base simulation, and we generated k from a random choice among the covariance types: (i) Autoregressive Toeplitz matrix with the (ij)th entry given by ρ|i j|, where ρ Unif( .9, .9); (ii) Block diagonal matrix with B blocks of the entries q1, . . . , q B, where B Unif(3, 10) and qi Unif(0, .9); (iii) Spiked covariance matrix U D UT , where U is a random orthogonal matrix, di Unif(5, 75) for i = 1, . . . , D, di = 1 for i = D + 1, . . . , pk, and D Unif(5, 50); (iv) Observed covariance matrix of real mi RNA data from TCGA Ovarian Cancer. (Note that this covariance matrix can only be used for one k {1, . . . , K}) The number of features was randomly selected by pk Unif(200, 500), and we ensured that k was larger than that of Σ so that the individual signal was larger than the joint signal. Integrated Principal Components Analysis As conveyed in Figure 10A, the Frobenius i PCA estimators outperformed its competitors regardless of the subspace dimension of Σ. It is also encouraging to see that the strong performance of the Frobenius i PCA estimators is not dependent on the cluster model for Σ, which was used in the simulations in Figure 3 but not in Figure 10A. Note that since the simulated Σ was dense, the additive L1 penalized i PCA estimator should perform poorly, as it does. We also point out that JIVE tends to do worse as the subspace dimension of Σ increases because JIVE tends to underestimate the rank of the joint variation matrix when the subspace dimension of Σ is larger. This is one of the disadvantages of matrix factorizations, as they require the rank of the factorized matrices to be chosen a priori. In Figure 10B, we see that most of the methods perform better as the number of integrated data sets increases and that the multiplicative Frobenius i PCA estimator slightly outperforms all the other methods when K = 10. We speculate that the additive Frobenius i PCA estimator does worse for large values of K because the grid of possible penalty parameters was too course, and as K increase, so does the number of penalty parameters. Hence, it becomes more difficult to choose appropriate penalty parameters when K is large. For the Laplacian simulations in Figure 4A, we generated Xk for each k = 1, 2, 3 by Xk = Σ1/2 Ω 1/2 k +Ek, where Σ and k were taken as in the base simulation, and Ek was an n pk random matrix with i.i.d. Laplace(0, b) entries. For the simulations in Figure 4B, we simulated three coupled data sets from an instance of the JIVE model: for each k = 1, 2, 3, Xk = Jk + Ak +Ek, where J = [J1, J2, J3] is the joint variation matrix of rank r = 5, A1, A2, A3 are the individual variation matrices of rank r1 = 10, r2 = 15, r3 = 20, respectively, and Ek are error matrices with independent entries from N(0, σ2). Similar to the simulations in Lock et al. (2013), we set J and Ak by J = U Vk and Ak = Uk Wk, where U Rn r, Vk Rr pk, Uk Rn rk, and Wk Rrk pk. Here, U, Uk, Vk, Wk are matrices whose entries are randomly generated i.i.d. from one of the following distributions: N(0, 1), Unif(0, 1), Exp(1), and the discrete random variable { 2, 1, 0, 1, 2} with uniform probabilities. Note that under this JIVE model, the true joint covariance matrix is given by Σ = JJT . In addition to the robustness simulations in Figure 4, we also ran simulations under the CMF model and simulations with uncommon row covariance matrices. When simulating from the CMF model, we generated 3 coupled data matrices with n = 150 and p1 = 300, p2 = 500, p3 = 400 via the model Xk = U VT k +Ek. Here, U Rn 2 was taken to be a random two-dimensional subspace from a cluster model with three clusters (as shown in Figure 1), each Vk Rpk 2 was taken to be the two top eigenvectors from the base simulation s k (e.g., V1 was taken to be the top two eigenvectors of the autoregressive Toeplitz matrix with entry (i, j) given by .9|i j|), and Ek is a random matrix with i.i.d. N(0, σ2) entries. We summarize the simulation results from the CMF model with increasing levels of noise σ in Figure 11. From this figure, we see that when the additive noise is small, CMF (and concatenated PCA) yield slightly lower subspace recovery errors than the multiplicative Frobenius i PCA estimator, but as σ increases, this improvement over the multiplicative Frobenius i PCA estimator diminishes. For the simulations with uncommon row covariance matrices, we modified the base simulation so that two out of the three data sets arose from the model Xk Nn,pk(0, Σ k) while the final data set arose from the model Xk Nn,pk(0, Σ k). Here, Σ and 1, 2, 3 are as in the base simulation while Σ is a rank-2 spiked covariance with eigen- Tang and Allen 0.5 1.0 1.5 2.0 SD of Additive Normal Noise Subspace Recovery Error i PCA (x Frob) i PCA (+ Frob) JIVE MFA CMF Concatenated PCA on X1 PCA on X2 PCA on X3 Figure 11: CMF Model Simulations: As the noise level σ of the CMF model increases, the multiplicative Frobenius i PCA estimator performs on par with CMF (and concatenated PCA). vectors generated from i.i.d. random normal entries. We summarize the simulation results from this uncommon row covariance model in Figure 12. In Figure 12A, we see that regardless of which Xk is generated from the uncommon Σ, the Frobenius i PCA estimators are relatively robust to the model misspecification. Figure 12B shows the difference in subspace recovery error when applying the methods to all three data sets (i.e. mixture) versus applying them to the two data sets generated by the common Σ (i.e. oracle). Here, we see that while the Frobenius i PCA estimators perform better with oracle knowledge of the two data sets generated by the common Σ, the decline in performance is relatively small when adding the outlying data set, again demonstrating the robustness of the Frobenius i PCA estimators. In the last set of simulations, we empirically study the i PCA estimators under the sparse setting. For these sparse simulations, we generated two data sets with n = 50 and p1 = 50, p2 = 100 according to the Kronecker product model Xk Nn,pk(0, Σ k). Because we are interested in uncovering the low-rank sparse structure when performing dimension reduction, we generate Σ as follows: Let U denote the eigenvectors of a block diagonal matrix with B equally-sized blocks, and put D = diag(25, 12.5, 1, . . . , 1) Rn n. Then simulate Σ = UDUT so that Σ is a low-rank block diagonal matrix with B blocks. To generate 1, we use the huge package (Jiang et al., 2019) in R to obtain the sparse covariance matrix associated with multivariate normal data being generated from a sparse banded graph (with bandwidth = 4). Lastly, we took 2 to be the block diagonal matrix (with 5 equally-sized blocks), created by taking the observed covariance matrix of mi RNA data from TCGA ovarian cancer (The Cancer Genome Atlas Research Network, 2011) and zeroing out the entries offof the block diagonal. The results from this sparse simulation study as we increase the number of blocks in Σ (and hence increase the sparsity in Σ) are shown in Figure 13. From this study, we see that when the amount of sparsity is relatively low, the multiplicative Frobenius i PCA estimator and the additive L1 i PCA covariance estimator perform the best while the additive L1 i PCA correlation estimator performs poorly. We believe that in this Integrated Principal Components Analysis Mixture Oracle Different X1 Different X2 Different X3 i PCA (x Frob) i PCA (+ Frob) Concatenated Subspace Recovery Error Different X1 Different X2 Different X3 i PCA (x Frob) i PCA (+ Frob) MFA Concatenated Subspace Recovery Error Figure 12: Uncommon Row Covariance Simulations: (A) We plot the subspace recovery errors for various method under the uncommon row covariance model. In the top panel, X1 was generated from the uncommon Σ. In the middle panel, X2 was generated differently, and in the bottom panel, X3 was different. (B) We compare the subspace recovery errors from various methods when applied to the mixture of all three data sets (i.e. mixture) versus when applied to the two data sets which were generated by the common Σ (i.e. oracle). Tang and Allen 4 8 12 Number of Blocks Subspace Recovery Error i PCA (x Frob) i PCA (+ Frob) i PCA (L1) i PCA (Corr L1) JIVE MFA Concatenated PCA on X1 PCA on X2 Figure 13: Sparse Simulations: As we increase the number of blocks in Σ (and hence also the sparsity of Σ), the sparse i PCA estimators improve over the Frobenius i PCA estimators. However, when the sparsity level is relatively low, the multiplicative Frobenius i PCA estimator performs similarly to the additive L1 i PCA covariance estimator. low sparsity scenario, the additive L1 i PCA correlation estimator struggles with choosing the appropriate penalty parameters and thus does not perform as well. However, as the sparsity level increases, both the additive L1 i PCA covariance and correlation estimators outperform the dense Frobenius i PCA estimators. We finally note that other metrics such as canonical angles, which also quantify the distance between subspaces, these metrics behave similarly to the subspace recovery error, so we omitted the results for brevity. Other common metrics such as | ˆΣ 2 Σ 2| or ˆΣ Σ 2 F are not appropriate for our study because we are interested in the distance between subspaces of eigenvectors, and eigenvectors are scale-invariant while these metric are not. Appendix I. Case Study: Integrative Genomics of Alzheimer s Disease The ROSMAP data originally contained 309 mi RNAs, 41, 809 genes, and 420, 132 Cp G sites, so we aggressively preprocessed the number of features in the RNASeq and methylation data sets to manageable sizes. First, we transformed the methylation data to m-values and log-transformed the RNASeq counts, as is common in most analyses for these data types. Then, we removed batch (experimental) effects from both data sets via Com Bat (Johnson et al., 2007). We next filtered the features by taking those with the highest variance (top 20,000 genes for RNASeq and top 50,000 Cp G sites for methylation). Then, we performed univariate filtering and kept the features with the highest association to clinician s diagnosis. This left us with p1 = 309, p2 = 900, p3 = 1250 in the mi RNA, RNASeq, and methylation data sets, respectively. R code can be found at https://github.com/Data Slingers/i PCA. Integrated Principal Components Analysis H. Abdi, L. J. Williams, and D. Valentin. Multiple factor analysis: principal component analysis for multitable and multiblock data sets. Wiley Interdisciplinary Reviews: Computational Statistics, 5(2):149 179, 2013. E. Acar, E. E. Papalexakis, G. G urdeniz, M. A. Rasmussen, A. J. Lawaetz, M. Nilsson, and R. Bro. Structure-revealing data fusion. BMC Bioinformatics, 15(1):239, Jul 2014. ISSN 1471-2105. doi: 10.1186/1471-2105-15-239. URL https://doi.org/10.1186/ 1471-2105-15-239. G. I. Allen and R. Tibshirani. Transposable regularized covariance models with an application to missing data imputation. The Annals of Applied Statistics, 4(2):764 790, 2010. O. Alter, P. O. Brown, and D. Botstein. Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proceedings of the National Academy of Sciences, 100(6):3351 3356, 2003. K. Benidis, Y. Sun, P. Babu, and D. P. Palomar. Orthogonal sparse PCA and covariance estimation via procrustes reformulation. IEEE Transactions on Signal Processing, 64(23): 6211 6226, 2016. URL http://www.danielppalomar.com/publications.html. O. Carrette, I. Demalte, A. Scherl, O. Yalkinoglu, G. Corthals, P. Burkhard, D. F. Hochstrasser, and J.C. Sanchez. A panel of cerebrospinal fluid potential biomarkers for the diagnosis of Alzheimer s disease. Proteomics, 3(8):1486 1494, 2003. A.P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68(1):265 274, 1981. P. Dutilleul. The MLE algorithm for the matrix normal distribution. Journal of Statistical Computation and Simulation, 64(2):105 123, 1999. B. Escofier and J. Pages. Multiple factor analysis. Computational Statistics & Data Analysis, 18(1):121 140, 1994. I. Espuny-Camacho, A. M. Arranz, M. Fiers, A. Snellinx, K. Ando, S. Munck, J. Bonnefont, L. Lambot, N. Corthout, L. Omodho, et al. Hallmarks of Alzheimer s disease in stemcell-derived human neurons transplanted into mouse brain. Neuron, 93(5):1066 1081, 2017. J. Fan, D. Wang, K. Wang, and Z. Zhu. Distributed estimation of principal eigenspaces. Ar Xiv e-prints, February 2017. M. Ghil and P. Malanotte-Rizzoli. Data assimilation in meteorology and oceanography. In Advances in Geophysics, volume 33, pages 141 266. Elsevier, 1991. K. Greenewald and A. O. Hero. Robust kronecker product PCA for spatio-temporal covariance estimation. IEEE Transactions on Signal Processing, 63(23):6368 6378, 2015. Tang and Allen A. K. Gupta and D. K. Nagar. Matrix variate distributions. CRC Press, 1999. P. Han, W. Liang, L. C. Baxter, J. Yin, Z. Tang, T. G. Beach, R. J. Caselli, E. M. Reiman, and J. Shi. Pituitary adenylate cyclase activating polypeptide is reduced in Alzheimer disease. Neurology, 82(19):1724 1728, 2014. T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh. Matrix completion and low-rank SVD via fast alternating least squares. The Journal of Machine Learning Research, 16(1): 3367 3402, 2015. R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012. C. J. Hsieh, I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik. Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems, pages 2330 2338, 2011. Sijia Huang, Kumardeep Chaudhary, and Lana X Garmire. More is better: recent progress in multi-omics data integration methods. Frontiers in genetics, 8:84, 2017. Haoming Jiang, Xinyu Fei, Han Liu, Kathryn Roeder, John Lafferty, Larry Wasserman, Xingguo Li, and Tuo Zhao. huge: High-Dimensional Undirected Graph Estimation, 2019. URL https://CRAN.R-project.org/package=huge. R package version 1.3.2. W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118 127, 2007. Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis, 88(2):365 411, 2004. Y. Li, Z. Chen, Y. Gao, G. Pan, H. Zheng, Y. Zhang, H. Xu, G. Bu, and H. Zheng. Synaptic adhesion molecule Pcdh-γC5 mediates synaptic dysfunction in Alzheimer s disease. Journal of Neuroscience, pages 1051 17, 2017. E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. The Annals of Applied Statistics, 7(1):523, 2013. X. L. Meng and D. B. Rubin. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2):267 278, 1993. ISSN 00063444. URL http: //www.jstor.org/stable/2337198. S. Mostafavi, C. Gaiteri, S. E. Sullivan, C. C. White, S. Tasaki, J. Xu, M. Taga, H. U. Klein, E. Patrick, V. Komashko, et al. A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer s disease. Nature Neuroscience, 21(6):811 819, 2018. ISSN 1546-1726. URL https://doi.org/10.1038/ s41593-018-0154-9. S. P. Ponnapalli, M. A. Saunders, C. F. Van Loan, and O. Alter. A higher-order generalized singular value decomposition for comparison of global m RNA expression from multiple organisms. Plo S one, 6(12):e28072, 2011. Integrated Principal Components Analysis T. Rapcs ak. Geodesic convexity in nonlinear optimization. Journal of Optimization Theory and Applications, 69(1):169 183, Apr 1991. ISSN 1573-2878. doi: 10.1007/BF00940467. URL https://doi.org/10.1007/BF00940467. A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494 515, 2008. doi: 10.1214/08-EJS176. S. T. Shivappa, M. M. Trivedi, and B. D. Rao. Audiovisual information fusion in humancomputer interfaces and intelligent environments: A survey. Proceedings of the IEEE, 98 (10):1692 1715, 2010. A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 650 658. ACM, 2008. The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353):609, 2011. P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, 2001. T. Tsiligkaridis, A. O. Hero, and S. Zhou. On convergence of kronecker graphical lasso algorithms. IEEE Transactions on Signal Processing, 61:1743 1755, 2013. N. K. Vishnoi. Geodesic Convex Optimization: Differentiation on Manifolds, Geodesics, and Convexity. Ar Xiv e-prints, June 2018. J. A. Westerhuis, T. Kourti, and J. F. Mac Gregor. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12(5):301 321, 1998. A. Wiesel. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60(12):6182, 2012. J. Yin and H. Li. Model selection and estimation in the matrix normal graphical model. Journal of Multivariate Analysis, 107:119 140, 2012. Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the Davis-Kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015. doi: 10.1093/biomet/asv008. URL http://dx.doi.org/10.1093/biomet/asv008. H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617 1638, 2016. S. Zhou. Gemini: Graph estimation with matrix variate normal instances. The Annals of Statistics, 42(2):532 562, 2014a. S. Zhou. Supplement to gemini: Graph estimation with matrix variate normal instances . 2014b.