# integrated_principal_components_analysis__3dddba24.pdf

Journal of Machine Learning Research 22 (2021) 1-71 Submitted 1/20; Revised 4/21; Published 8/21

Integrated Principal Components Analysis

Tiﬀany M. Tang tiffany.tang@berkeley.edu Department of Statistics University of California, Berkeley Berkeley, CA 94720, USA

Genevera I. Allen gallen@rice.edu Department of Electrical and Computer Engineering Rice University Houston, TX 77005, USA

Editor: Garvesh Raskutti

Data integration, or the strategic analysis of multiple sources of data simultaneously, can often lead to discoveries that may be hidden in individualistic analyses of a single data source. We develop a new unsupervised data integration method named Integrated Principal Components Analysis (i PCA), which is a model-based generalization of PCA and serves as a practical tool to ﬁnd and visualize common patterns that occur in multiple data sets. The key idea driving i PCA is the matrix-variate normal model, whose Kronecker product covariance structure captures both individual patterns within each data set and joint patterns shared by multiple data sets. Building upon this model, we develop several penalized (sparse and non-sparse) covariance estimators for i PCA, and using geodesic convexity, we prove that our non-sparse i PCA estimator converges to the global solution of a non-convex problem. We also demonstrate the practical advantages of i PCA through extensive simulations and a case study application to integrative genomics for Alzheimer s disease. In particular, we show that the joint patterns extracted via i PCA are highly predictive of a patient s cognition and Alzheimer s diagnosis.

Keywords: data integration, multi-view data, matrix-variate normal, dimension reduction, integrative genomics

1. Introduction

The recent growth in both data volume and variety drives the need for principled data integration methods that can analyze multiple sources of data simultaneously. For instance, meteorologists must integrate data from satellites, ground-based sensors, and numerical models for forecasting (Ghil and Malanotte-Rizzoli, 1991). Audio and video are often combined for surveillance as well as speech recognition (Shivappa et al., 2010); and as new high-throughput technologies arise in biology, scientists are leveraging information from multiple genomic sources to better understand complex biological processes (Huang et al., 2017). By exploiting the commonalities and diversity of information from diﬀerent data sets, data integration methods have the potential to provide a holistic and perhaps more realistic model of the phenomena at hand.

c 2021 Tiﬀany M. Tang and Genevera I. Allen.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-084.html.

Tang and Allen

In this work, we focus on facilitating unsupervised learning tasks such as pattern recognition, dimension reduction, and visualization for integrated data in various applications. More speciﬁcally, we consider the multi-view data setting, where we observe multiple data sources (or matrices) with features of diﬀerent types that are measured on the same set of samples. Our goal in this setting is to develop a practical statistical data integration method that 1) leverages multiple data sources to discover and visualize dominant joint patterns among the samples that are common across all data sets; 2) generalizes the classical principal components analysis (PCA), thereby inheriting its nice properties including easily-interpretable visualizations, a unique solution, and nested, orthogonal components that can be quickly obtained all at once; and 3) has provable statistical and optimization guarantees to begin bridging the gap between the theory and practice of data integration.

To this end, we propose Integrated Principal Components Analysis (i PCA), which extends a model-based framework of the classical Principal Components Analysis (PCA) to integrated data. As such, i PCA inherits the many advantages and nice familiar interpretations of PCA. The key idea here behind i PCA is to leverage a new but natural connection between data integration and the matrix-variate normal model. This enables us to ﬂexibly model a rich set of dependencies among features and samples simultaneously, as is often crucial in integrated data. Building upon this model, we develop novel penalized covariance estimators for i PCA including a new geodesically convex penalty with both theoretical and practical advantages. While the main contributions of this work are methodological and applied in nature, we also begin to study the theoretical properties of our approach. In particular, we show that our non-sparse i PCA estimator converges to the global solution of a non-convex problem using geodesic convexity, and in the Appendix, we show that our sparse i PCA estimator consistently estimates the underlying joint subspace. Finally, through simulations and a careful case study application to integrative genomics for Alzheimer s disease, we demonstrate the superior empirical performance of i PCA for integrative exploratory data analysis, joint pattern recognition, and visualization. Speciﬁcally, in our case study, i PCA is able to identify joint patterns that are biologically-meaningful and can separate patients by their cognitive capabilities and Alzheimer s diagnosis. This interesting ﬁnding and application to Alzheimer s disease not only demonstrates the practical utility of i PCA but also points to promising hypotheses for follow-up experiments in Alzheimer s research.

1.1 Related Work

Currently, existing data integration methods for unsupervised learning tend to fall into one of two categories the matrix factorizations or the multiblock PCA family but each have major practical limitations that we address with i PCA.

First, matrix factorizations, including coupled matrix and tensor (CMTF) factorizations (Singh and Gordon, 2008; Acar et al., 2014) and Joint and Individual Variation Explained (JIVE) (Lock et al., 2013), each solve an optimization problem to factorize the integrated data sets into a low-rank joint variation matrix, encoding the shared patterns, and a lowrank individual variation matrix, encoding the patterns speciﬁc each data set. Matrix factorizations are well-liked due to their simplicity and computational feasibility, but a practical challenge with these methods is that they do not have a unique solution and heavily depend on the ranks of the factorized matrices. That is, the top factors from CMTF

Integrated Principal Components Analysis

and JIVE are non-nested and can change drastically depending on the user-speciﬁed rank. Since the optimal rank is almost always unknown and must be speciﬁed a priori, this poses signiﬁcant interpretability challenges as the practitioner could end up with many diﬀerent, but equally valid, solutions corresponding to diﬀerent choices of ranks and random seeds. On a related front, the generalized SVD (GSVD) (Alter et al., 2003; Ponnapalli et al., 2011) provides an exact matrix decomposition for integrated data that does not depend on the matrix rank. Nevertheless, it is limited in scope since the GSVD assumes each matrix has full row rank, excluding problems with both highand low-dimensional data sets. On the other hand, the multiblock PCA family, which includes methods such as Multiple Factor Analysis (MFA) (Escoﬁer and Pages, 1994; Abdi et al., 2013) and Consensus PCA (Westerhuis et al., 1998), works to generalize PCA to the integrated data regime and does not suﬀer from the rank-dependence or interpretability issues as in matrix factorizations. In these multiblock PCA methods, each data matrix is normalized according to a speciﬁc procedure, and then PCA is performed on the normalized concatenated data. However, these normalization schemes are often ad-hoc, and it is unclear which scheme works best and for which situation. Closely related to this is Distributed PCA (Fan et al., 2017), which integrates data that are stored across multiple servers and implicitly assumes that the data are i.i.d. across the diﬀerent servers. This assumption diﬀers from our target setting, where we allow for heterogeneity among the diﬀerent sources (or servers in the distributed context). To date, an unsupervised data integration method, which both generalizes PCA and automatically determines the best way to normalize for the diﬀerent scales and signal strengths between sources, does not exist. Motivated by these limitations, we develop i PCA as a proper generalization of PCA, thus inheriting its many properties and advantages (e.g., a unique solution, easily-interpretable visualizations, and nested, orthogonal principal components) unlike the matrix factorizations; and unlike the multiblock PCA methods, i PCA automatically adjusts for the diﬀerent scales and signals between data sources without the need for a speciﬁed normalization scheme. The two main building blocks of i PCA are PCA and the matrix-variate normal distribution, which we review next.

1.2 Principal Components Analysis

Given a column-centered data set X Rn p with n samples and p features, recall that PCA ﬁnds orthogonal directions v1, . . . , vm Rp, which maximize the covariance Sp ++, where Sp ++ is the set of p p positive deﬁnite matrices. That is, for each j = 1, . . . , m,

vj = argmax v Rp v T v subject to v T v = 1, v T vi = 0 i < j. (1)

It is well-known that the PC loading vj is the eigenvector of with the jth largest eigenvalue, and its corresponding PC score is uj := X vj. In practice, since the population covariance is typically unknown, the sample version of PCA plugs in an estimate ˆ := 1

n XT X for in (1). It follows that the PC loadings are the eigenvectors of ˆ , and the PC scores are the scaled eigenvectors of ˆΣ := 1 p X XT . To later establish the link between i PCA

and PCA, we point out that ˆ is the maximum likelihood estimator (MLE) of under the multivariate normal model x1, . . . , xn iid N(0, ) and ˆΣ is the MLE of Σ under x 1, . . . , x p iid N(0, Σ), where xi is the ith row of X, and x j is the jth column of X. Thus,

Tang and Allen

there is a dual row/column interpretation of the PCA model. While this is not the only way of viewing PCA, this formulation illustrates two points which we explore further in i PCA: (1) PCA ﬁnds linear projections of the data that maximize the variance under a multivariate normal model; (2) eigenvectors correspond to the dominant (or variance-maximizing) patterns in the data.

1.3 Matrix-variate Normal Model

Laid out in Gupta and Nagar (1999) and Dawid (1981), the matrix-variate normal distribution is an extension of the multivariate normal distribution such that the matrix is the unit of study. Formally, we say X Rn p follows a matrix-variate normal distribution and write X Nn,p(M, Σ ) if vec(XT ) follows a multivariate normal distribution with a Kronecker product covariance structure, vec(XT ) N(vec(MT ), Σ ). Here, vec(X) Rnp

is the column vector formed by stacking the columns of X below one another. We call M Rn p the mean matrix, Σ Sn ++ the row covariance matrix, and Sp ++ the column covariance matrix. Put diﬀerently, the row covariance matrix Σ encodes the dependencies between rows of X while the column covariance matrix encodes the dependencies among columns, i.e., Xi, N(Mi, , Σii ) and X ,j N(M ,j, jj Σ). It can also be shown that if Σ = I and M = 0, we are in the familiar multivariate normal setting, x1, . . . , xn iid N(0, ), and if = I and M = 0, then x 1, . . . , x p iid N(0, Σ). The matrix-variate normal model, however, is far more general than the multivariate normal. While the multivariate normal can only model relationships between elements of a single row or a single column in X, the matrixvariate normal can model relationships between elements from diﬀerent rows and columns. With this level of ﬂexibility, the matrix-variate normal has proven to be a versatile tool in various contexts such as graphical models (Yin and Li, 2012; Tsiligkaridis et al., 2013; Zhou, 2014a), spatio-temporal models (Greenewald and Hero, 2015), and transposable models (Allen and Tibshirani, 2010). Our work on i PCA is the ﬁrst to consider the matrix-variate normal model in light of data integration.

1.4 Outline

Building upon the matrix-variate normal model, we introduce i PCA as a proper generalization of PCA to the integrated data regime in Section 2. In Section 3, we discuss covariance estimation methods for i PCA and begin to study some of their theoretical properties. We then demonstrate the strong empirical performance of i PCA in Section 4 through simulations and a real data application to integrative genomics for Alzheimer s disease. Finally, we conclude with a discussion of i PCA in Section 5.

2. Integrated PCA

At its core, i PCA, like PCA, is an unsupervised tool for exploratory data analysis, pattern recognition, and visualization. Unlike PCA however, i PCA aims to extract dominant joint patterns which are common to multiple data sets, not necessarily the variance-maximizing patterns since they might be speciﬁc to one data set. These joint patterns are typically of considerable interest to practitioners as its common occurrence in multiple data sets

Integrated Principal Components Analysis

Figure 1: Coupled matrices X1, X2, X3 with n = 200, p1 = 300, p2 = 500, p3 = 400 were simulated from the i PCA model (2). Here, Σ and 1, 2, 3 were taken to be as in the base simulation described in Section 4.1. (A) plots the top two eigenvectors of Σ. In separate PCA analyses (B-D), the individual signal in each data set masks the true joint signal, but (E) i PCA (using the multiplicative Frobenius estimator) exploits the integrated data structure and recovers the true joint signal.

may point to some foundational mechanism or structure. For instance, scientists may be more interested in uncovering the patterns (or clusters) of patients who have similar gene expression levels and mi RNA expression levels than those patients with similar gene expression levels alone. Figure 1 illustrates a motivating example for when i PCA is advantageous. In the example, strong dependencies among features obscure the true joint patterns among the samples so that the true joint signal is not the variance-maximizing direction. As a result, applying PCA separately to each of the data sets (panels B-D) fails to reveal the joint signal. i PCA can better recover the joint signal because it exploits the known integrated data structure and extracts the shared information among all three data sets simultaneously. Generally speaking, i PCA ﬁnds these joint patterns by modeling the dependencies between and within data sets via the matrix-variate normal model. The inherent Kronecker product covariance structure enables us to decompose the total covariance of each data matrix into two components an individual column covariance structure which is unique to each data set and a joint row covariance structure which is shared among all data sets. The joint row covariance structure is our primary interest, and maximizing this joint variation will yield the dominant patterns which are common across all data sets. In the following sections, we will introduce i PCA and provide interpretations and intuition into the model.

2.1 Population Model of i PCA

Suppose we observe K coupled data matrices, X1, . . . , XK, of dimensions n p1, . . . , n p K, where n is the number of samples and pk is the number of features in Xk. Throughout this paper, we let p := PK k=1 pk and X := [X1, . . . , XK]. Since i PCA is primarily interested in the data variation, let us assume that each data matrix Xk has been mean centered so that each column of Xk has a mean of 0. Suppose also that each of the data matrices are measured on the same n samples and that all rows of Xk are perfectly aligned (see

Tang and Allen

Figure 2: Integrated Data Setting for i PCA: We observe K diﬀerent but coupled data matrices, each with a distinct set of features that are measured on the same set of n samples. Assume that the rows align.

Figure 2). Under the i PCA model, we assume that each data set Xk arises from a matrixvariate normal distribution,

Xk ind. Nn,pk(0, Σ k), (k = 1, . . . , K) (2)

where Σ is an n n row covariance matrix that is jointly shared by all data matrices, and k is a pk pk column covariance matrix that is speciﬁc to Xk. We next provide additional intuition into the model parameters Σ, 1, . . . , K and unpack the i PCA modeling assumptions.

Feature Dependencies k: By properties of the matrix-variate normal, we can interpret k as describing the dependence structure among features in Xk, giving rise to feature patterns unique to Xk. As a consequence of the i PCA model in (2), this feature (or column) dependence is revealed as

[Xk]i, N(0, Σii k), (k = 1, . . . , K, i = 1, . . . , n). (3)

In other words, each sample or row in Xk has its own variance scaling factor given by Σii while all dependencies among the features are captured by k.

Sample Dependencies Σ: Analogously, we can interpret Σ as describing the common row dependence structure, corresponding to patterns among the samples that are shared by all K data sets. According to the i PCA model in (2), the sample (or row) dependence manifests itself as

[Xk] ,j N(0, [ k]jj Σ), (k = 1, . . . , K, j = 1, . . . , pk). (4)

Here, each feature or column in Xk gets its own variance scaling factor [ k]jj while all dependencies among the samples are given by Σ.

i PCA Through a Whitening Lens: In addition to these marginal interpretations of Σ and k, it is also important to gain intuition into how Σ and k interact together within the i PCA model. The simplest way to do so is through a whitening perspective, where we note that the i PCA model in (2) can be equivalently rewritten as follows: for each k = 1, . . . , K,

Σ 1/2 Xk 1/2 k = Zk, where [Zk]ij iid N(0, 1). (5)

Integrated Principal Components Analysis

This is to say that after whitening each data matrix Xk of both its feature dependencies in k and the shared row dependencies in Σ, the remainder is distributed i.i.d. from a standard normal distribution. Thus, all dependencies in Xk must be captured by Σ and k. This assumption may seem strong at ﬁrst glance, but it is a simple generalization of the usual assumptions in PCA. To see this, notice another equivalence relation to the i PCA model (2) based on whitening: for each k = 1, . . . , K,

Yk := Xk 1/2 k ind. Nn,pk(0, Σ Ipk), (6)

or equivalently, in vector-form,

[Xk 1/2 k ] ,j iid N(0, Σ) for each j = 1, . . . , pk. (7)

That is, in the idealistic scenario when population covariances are known, the i PCA model simply states that each data set Xk, after being whitened of its feature dependencies k, follows a normal distribution with the common row covariance Σ. This is not a new assumption. In fact, it is the primary modeling assumption if we were to apply classical PCA to the concatenated centered and whitened data [Y1, . . . , YK]. Further, like PCA, while i PCA assumes and works best with normally-distributed data, it is still practically eﬀective in the non-normal regime. We provide empirical evidence of this robustness in Section 4. It is important to note, however, that when k = I such as in most cases with real data, the underlying i PCA model greatly diﬀers from that of applying PCA to the concatenated un-whitened data (henceforth referred to as concatenated PCA). While the normal assumption in i PCA is often insigniﬁcant in practice, it is crucial to account for k in the integrated data model. Intuitively, the feature dependencies k in i PCA can be viewed as nuisance parameters that obscure the sample dependencies Σ and vice versa, so ignoring the feature dependencies completely as in concatenated PCA can be extremely problematic. The challenge here though is that k and Σ are both unknown in practice and must be estimated. In the setting where we only observe one data set X1, it is not possible to distinguish or estimate both 1 and Σ, but with multiple observed data sets X1, . . . , XK, we can and should leverage the additional data to distinguish between the feature and sample dependencies. As we will see later, this is an upshot of i PCA namely, that i PCA models and estimates both k and Σ concurrently, exploiting the integrated structure while also accounting for the opposition s nuisance eﬀects. Now under these modeling assumptions from the i PCA model in (2), which we have seen to be generalizations of the classical PCA assumptions, i PCA extends the variancemaximizing ideas of PCA and achieves its objective of ﬁnding the dominant joint and individual patterns in the data by maximizing the joint row covariance Σ and individual column covariances 1, . . . , K simultaneously. Namely, for k = 1, . . . , K, i PCA solves

ui = argmax u Rn u T Σ u subject to u T u = 1, u T ul = 0 l < i, (i = 1, . . . , n) (8)

vk j = argmax v Rpk v T k v subject to v T v = 1, v T vk l = 0 l < j, (j = 1, . . . , pk) (9)

for which we know the solution to be given by the eigendecompositions of Σ and k, respectively. That is, ui is the eigenvector of Σ with the ith largest eigenvalue, and vk j

Tang and Allen

is the eigenvector of k with the jth largest eigenvalue. Most notably, u1 maximizes the joint variation and is interpreted as the most dominant pattern among the samples, which occurs in all K data sets. We call the columns of U := [u1, . . . , un] the integrated principal component (i PC) scores and the columns of Vk := [vk 1, . . . , vk pk] the i PC loadings for the kth data set. Note that though the population covariances in (2) are not identiﬁable (e.g., Σ k = c Σ 1

c k for c R), the i PC scores and loadings are identiﬁable since eigenvectors are scale-invariant. Since we are often most interested in the joint patterns, we primarily plot the i PC scores U to visualize the joint patterns in sample space. To visualize the individual feature patterns from the kth data set, we can also plot the i PC loadings Vk.

2.2 Sample Version of i PCA

To perform i PCA in practice, we must typically plug in estimators ˆΣ and ˆ k for Σ and k since the population covariances in (2) are almost always unknown. We summarize the sample version of i PCA as follows:

1. Model each (column-centered) data set via a matrix-variate normal model: Xk Nn,pk(0, Σ k), k = 1, . . . , K.

2. Estimate the covariance matrices Σ and 1, . . . , K simultaneously to obtain ˆΣ and ˆ 1, . . . , ˆ K. Methods for covariance estimation will be discussed in Section 3.

3. Compute the eigenvectors, say ˆU = eigenvectors of ˆΣ and ˆVk = eigenvectors of ˆ k. We interpret ˆU as the dominant joint patterns in sample space and ˆVk as the dominant patterns in feature space which are speciﬁc to Xk.

4. Visualize and explore the dominant joint patterns by plotting the i PC scores ˆU and the dominant individual patterns by plotting the i PC loadings ˆVk.

2.2.1 Variance Explained by i PCA

After performing i PCA, we can also interpret the signal in the obtained i PCs through a notion of variance explained, analogous to that in PCA.

Deﬁnition 1 Assume that Xk has been column centered. We deﬁne the cumulative proportion of variance explained in data set Xk by the top m i PCs to be

PVEk,m := (U(m))T Xk V(m) k 2 F Xk 2 F , (10)

where U(m) = [u1, . . . , um] are the top m i PC scores, and V(m) k = [vk 1, . . . , vk m] are the top m i PC loadings associated with Xk.

Deﬁnition 2 The marginal proportion of variance explained in data set Xk by the mth i PC is deﬁned as MPVEk,m := PVEk,m PVEk,m 1.

Integrated Principal Components Analysis

We verify in Appendix A that PVEk,m is a proportion and monotonically increasing as m increases. Aside from being well-deﬁned, we also show in Appendix A that (10) generalizes the cumulative proportion of variance explained in PCA and hence, is a natural deﬁnition. Note however that unlike PCA, it may be that MPVEk,m+1 > MPVEk,m in i PCA. This is because i PCA does not maximize the total variance. For example, if MPVE1,2 > MPVE1,1, this simply means that data set X1 contributed more variation to the joint pattern in i PC2 than in i PC1.

2.3 Connections to Existing Methods

Throughout our development of i PCA, we have established several connections between i PCA and PCA, which demonstrate that i PCA is indeed a natural extension of PCA. We also ﬁnd it instructive to draw on connections between i PCA and other existing data integration methods to develop an even deeper understanding of i PCA.

2.3.1 Relationship to Multiblock PCA Family

As discussed in Abdi et al. (2013), multiblock PCA methods reduce to performing PCA on the normalized concatenated data X = [X 1, . . . , X K], where each Xk has been normalized to X k according to some procedure. We will show later in Proposition 6 that performing PCA on the unnormalized concatenated data (i.e., concatenated PCA) is a special case of i PCA, where we assume k = I for each k. Proposition 6 can also be easily extended to show that multiblock PCA methods are a special case of i PCA for some ﬁxed 1, . . . , K, and the exact form of k depends on the normalization procedure. For example, since MFA normalizes Xk by dividing all of its entries by its largest singular value σmax,k, MFA is a special case of i PCA, where each k = σmax,k I. Put diﬀerently, MFA

assumes [Xk σ 1/2 max,k] ,j iid N(0, Σ) whereas i PCA assumes [Xk 1/2 k ] ,j iid N(0, Σ). This gives rise to another interpretation of i PCA: i PCA is a generalization and unifying framework for the entire multiblock PCA family. However, while the multiblock PCA methods assume that k takes a speciﬁc form, i PCA does not impose any restrictions on the form of k and instead freely estimates the full k matrix simultaneously with Σ. In doing so, i PCA acts as an automatic way of normalizing for the diﬀerent scales and signals between data sources. Without appropriate normalization, the estimated principal components from multi-block PCA methods such as concatenated PCA will be biased as quantiﬁed by the following proposition.

Proposition 3 Suppose that Xk Nn,pk(0, Σ k) for each k = 1, . . . , K, and deﬁne X = [X1, . . . , XK], = diag( 1, . . . , K), and Z = [Z1, . . . , ZK], where Zk is as deﬁned in (5) previously. If X 1/2 = UDVT and X = U D VT are their respective compact SVDs, then U and U denote the i PCA and concatenated PCA scores, respectively, and

U U = Σ1/2 Z(VD 1 1/2 V D 1). (11)

Moreover, the i PCA and concatenated PCA scores are equal if and only if VD 1 1/2 V D 1

is in the nullspace of Z or 1/2 V = VD 1 D.

Proposition 3 implies that if the feature dependencies, captured by k, rotate the data Xk in a way such that it does not obscure the joint row patterns, then concatenated PCA

Tang and Allen

will work adequately. However, for most situations and examples of k that we expect to ﬁnd in real data, this is unlikely to occur, and the resulting eigenvectors of concatenated PCA will be severely biased. To avoid this bias, both the sample and feature dependencies must be accounted for in the estimation procedure as done in i PCA.

2.3.2 Relationship to Matrix Factorizations

Coupled matrix factorizations (CMF) decompose each data set Xk Rn pk into the product of low-rank joint factor U Rn m and a low-rank individual factor Vk Rm pk so that Xk U Vk (Singh and Gordon, 2008; Acar et al., 2014). This approximate factorization is related to i PCA in that our matrix-variate normal model assumes a similar multiplicative structure. Speciﬁcally, from (5), the i PCA model can be viewed as Xk = Σ1/2 Zk 1/2 k , where Zk is standard normal random matrix. Moreover, an argument similar to Theorem 2 from Hastie et al. (2015) shows that one solution of the CMF optimization problem (with ℓ2 penalties) is the solution of concatenated PCA and thereby a special case of i PCA. Despite this connection however, there is a fundamental diﬀerence between CMF and i PCA. On the one hand, CMF assumes Xk can be approximated by a low-rank matrix, and the estimation of the CMF factors actively depends on the pre-speciﬁed rank m. On the other hand, the rank of Xk plays absolutely no role in the i PCA assumptions or the estimation step of i PCA. Consequently, the joint and individual CMF factors can change drastically depending on the pre-speciﬁed rank unlike in i PCA. CMF also does not have a unique solution nor enforces orthogonal components whereas i PCA gives unique, nested, orthogonal components that can be interpreted in the same way as in PCA. In contrast to the multiplicative models of CMF and i PCA, JIVE, which has been commonly used in integrative genomics, assumes an additive model and decomposes coupled data into the sum of a low-rank joint variation matrix, a low-rank individual variation matrix, and an error matrix (Lock et al., 2013). Additive and multiplicative models, being quite diﬀerent models, are each advantageous in diﬀerent situations, but as with CMF, the estimation of JIVE depends on the pre-speciﬁed ranks of its factors and results in non-nested, rank-dependent joint and individual components.

3. Covariance Estimators for i PCA

We next return to address the covariance estimation step when ﬁtting the i PCA model to data. In Section 3.1, we consider the traditional maximum likelihood approach but ﬁnd that it suﬀers from substantial limitations in the integrated data setting. These limitations ultimately drive the need for new estimators, which we develop in Section 3.2.

3.1 Unpenalized Maximum Likelihood Estimators

Guided by the formulation of PCA in Section 1.2, we instinctively try to estimate the i PCA population covariances Σ and 1, . . . , K via maximum likelihood estimation. Under the i PCA population model (2), the log-likelihood function reduces to

ℓ(Σ 1, 1 1 , . . . , 1 K ) p log | Σ 1 | + n

k=1 log | 1 k |

k=1 tr Σ 1 Xk 1 k XT k , (12)

Integrated Principal Components Analysis

so by taking partial derivatives of (12) with respect to each parameter, we obtain

Lemma 4 The unpenalized MLEs of Σ and 1, . . . , K satisfy

k=1 Xk ˆ 1 k XT k (13)

n XT k ˆΣ 1 Xk k = 1, . . . , K. (14)

However, with only one matrix observation per matrix-variate normal model in the i PCA context, existence of the MLE is not guaranteed. In fact, the following theorem essentially implies that the MLE does not exist for all practical purposes.

Theorem 5 (i) Suppose that Xk has been column-centered so that each column has a mean of 0 for each k = 1, . . . , K. Then the unpenalized MLEs for Σ and 1, . . . , K are not positive deﬁnite and hence do not exist.

(ii) Suppose that Xk has not been column-centered but that rank( X) = n and rank(Xk) = pk for k = 1, . . . , K.

(a) If n = pk for some k = 1, . . . , K, then the unpenalized log-likelihood function ℓ(Σ 1, 1 1 , . . . , 1 K ) is unbounded.

(b) If the unpenalized log-likelihood function ℓ(Σ 1, 1 1 , . . . , 1 K ) is bounded, then the unpenalized MLE for Σ, 1, . . . , K exist.

The proof of Theorem 5 also shows that if the unpenalized MLEs for Σ and 1, . . . , K exist, then pk = n p for each k = 1, . . . , K. Thus in summary, if pk = n for some k, n > p, or the data matrices have been column-centered to have mean 0, then the unpenalized MLEs do not exist. These severe restrictions motivate new covariance estimators. For example, one alternative but naive approach is to estimate Σ and k by setting their counterparts to I.

Proposition 6 (i) The MLE for k, assuming that Σ = I, is

n XT k Xk . (15)

(ii) The MLE for Σ, assuming k = I for all k = 1, . . . , K, is

k=1 Xk XT k = 1

p X X T . (16)

This approach for estimating Σ is the familiar MLE for the concatenated data X, and hence, performing concatenated PCA is equivalent to a special case of i PCA, where we set k = I for each k. While this formalizes the connection between PCA and i PCA, we will see in Section 4 that concatenated PCA performs poorly when the data sets are of diﬀerent scales or when the feature dependencies are stronger than the sample dependencies. In the next section, we discuss more eﬀective methods for estimating the i PCA covariances.

Tang and Allen

3.2 Penalized Maximum Likelihood Estimators

Given that the unpenalized MLEs do not exist for a large class of problems, one possible solution is to develop penalized MLEs, which solve

ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 K = argmax Σ 1 0 1 1 ,..., 1 K 0

n p log | Σ 1 | + n

k=1 log | 1 k |

k=1 tr Σ 1 Xk 1 k XT k

P(Σ 1, 1 1 , . . . , 1 K ) o . (17)

Similar to previous work on the penalized matrix-variate normal log-likelihood (Yin and Li, 2012; Allen and Tibshirani, 2010), we can apply an additive-type penalty and deﬁne the additive Lq i PCA penalty to be

Pq(Σ 1, 1 1 , . . . , 1 K ) = λΣ Σ 1 q +

k=1 λk 1 k q,

where λΣ, λ1, . . . , λK > 0 are tuning parameters. Though there are many potential choices of norm-penalties here, one natural choice is the additive Frobenius penalty, q = 2 F , as it is a proper generalization of PCA. That is, performing i PCA with the additive Frobenius penalty is equivalent to PCA in the K = 1 case (see Theorem 1 in Allen and Tibshirani 2010). When K 1, the additive Frobenius penalty induces a smoothness over the eigenvalues of the covariance matrices and returns a dense full-rank covariance estimator. In the sparse covariance setting, we can instead induce sparsity through the additive L1 penalty q = 1,oﬀ, where A 1,oﬀ= P i =j Aij. Applying the additive L1 penalty to the inverse covariance matrix is common practice, but one can alternatively apply the additive L1 penalty to the inverse correlation matrix. This latter approach was adopted in Zhou (2014a) and Rothman et al. (2008). Given the plethora and well-studied nature of sparse graphical model estimation, we leverage existing ideas and tools to show that this approach, namely, the additive L1 penalty applied to the inverse correlation matrix, consistently estimates the true underlying joint subspace under certain conditions at a rate

of O PK k=1 pk

max{sΣ,1} log(max{n,pk})

, where sΣ is the number of non-zero oﬀ-diagonal

entries in Σ 1. However, due to the superior empirical performance of the i PCA Frobenius estimators over the sparse i PCA estimators (see Section 4), we leave the precise theorem statement and discussion of the L1 subspace consistency result to Appendix F. Despite the popularity of additive-type penalties in the literature, an overarching downside with these existing penalties in the integrated data regime is that solving (17) with additive penalties is a non-convex problem, for which we can only guarantee convergence to a local solution. Nonetheless, though (17) is non-convex in Euclidean space, Wiesel (2012) showed that the matrix-variate normal log-likelihood is geodesically convex (g-convex) with respect to the manifold of positive deﬁnite matrices. G-convexity is a generalized notion of convexity on a Riemannian manifold, and like convexity, all local minima of g-convex functions are globally optimal. Exploiting this idea of g-convexity, we propose a novel type

Integrated Principal Components Analysis

Algorithm 1 Flip-Flop Algorithm for Multiplicative Frobenius i PCA Estimators

1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be positive deﬁnite.

2: while not converged do

3: Take eigendecomposition: PK k=1 Xk ˆ 1 k XT k = U Γ UT

4: Regularize eigenvalues: Φii = 1

Γ2 ii + 8p PK k=1 λk ˆ 1 k 2 F

5: Update ˆΣ 1 = U Φ 1 UT

6: for k = 1, . . . , K do

7: Take eigendecomposition: XT k ˆΣ 1 Xk = V Φ VT

8: Regularize eigenvalues: Γii = 1 2n Φii + q

Φ2 ii + 8nλk ˆΣ 1 2 F .

9: Update ˆ 1 k = V Γ 1 VT

of penalty, named the multiplicative Frobenius i PCA penalty

P (Σ 1, 1) =

k=1 λk Σ 1 1 k 2 F ,

which we will show to be g-convex in Theorem 8. Note that since A B 2 F = A 2 F B 2 F , the multiplicative penalty can be rewritten as a product Σ 1 2 F PK k=1 λk 1 k 2 F , giving rise to its name. Like the additive Frobenius i PCA estimator, the multiplicative Frobenius i PCA estimator is a shrinkage technique that returns a dense covariance estimate with smoothed eigenvalues when K 1, and when K = 1, it is equivalent to PCA (see Appendix E). Having introduced several diﬀerent types of penalized i PCA covariance estimators, namely, the additive Frobenius estimator, multiplicative Frobenius estimator, additive L1 covariance estimator, and additive L1 correlation estimator, the question for practitioners becomes how to compute these estimators, how to select the penalty parameters, and which estimator to use in which situation. We discuss each in turn next.

3.2.1 Flip-flop Algorithms for i PCA Estimators

For each of the aforementioned penalties, we can compute the corresponding penalized MLEs via Flip-Flop algorithms (also known as block coordinate descent algorithms), which iteratively optimize over each of the parameters, one at a time, while keeping all other parameters ﬁxed. These algorithms are derived fully in Appendix C.2, but in general, for the Frobenius penalties, each Flip-Flop update has a closed form solution determined by a full eigendecomposition. For the L1 penalties (also known as the Kronecker Graphical Lasso, Tsiligkaridis et al., 2013), each update can be solved by the graphical lasso (Hsieh et al., 2011). We provide the multiplicative Frobenius Flip-Flop algorithm here in Algorithm 1, and as the other algorithms take similar forms, we leave them to Appendix C.2. The following theorem guarantees numerical convergence of the Flip-Flop algorithms to a local solution for the multiplicative Frobenius, additive Frobenius, and additive L1 penalties.

Theorem 7 Suppose that the objective function in (17) is bounded below. Suppose also that either (i) P(Σ 1, 1 1 , . . . , 1 K ) is a diﬀerentiable convex function with respect to each coordinate or (ii) P(Σ 1, 1 1 , . . . , 1 K ) = P0(Σ 1) + PK k=1 Pk( 1 k ), where Pi is

Tang and Allen

a (non-diﬀerentiable) convex function for each k = 1, . . . , K. If either (i) or (ii) holds, then the Flip-Flop algorithm corresponding to (17) converges to a stationary point of the objective.

However, building upon Wiesel (2012) and the notion of g-convexity, we can prove a far stronger result for the multiplicative Frobenius i PCA estimator.

Theorem 8 The multiplicative Frobenius i PCA estimator is jointly geodesically convex in Σ 1 and 1 1 , . . . , 1 K . Because of this, the Flip-Flop algorithm for the multiplicative Frobenius i PCA estimator given in Algorithm 1 converges to the global solution.

There are currently only a handful of non-convex problems where there exists an achievable global solution, so this guarantee that the multiplicative Frobenius i PCA estimator always reaches a global solution is both extremely rare and highly desirable. In Section 4, we will also see that the multiplicative Frobenius i PCA estimator undoubtedly gives the best empirical performance, indicating that in addition to its optimization-theoretic advantages from global convergence, there are signiﬁcant practical advantages associated with the g-convex penalty. A self-contained review of g-convexity and the proof of Theorem 8 are given in Appendix D.

3.2.2 Tuning Penalty Parameters

To select penalty parameters for (17) in a data-driven manner, we propose a cross-validationlike procedure. The following procedure can also be used to perform i PCA in missing data scenarios. Let Λ denote the space of penalty parameters, and let λ := (λΣ, λ1, . . . , λK) be a speciﬁc choice of penalty parameters in Λ. The idea is to ﬁrst randomly leave out scattered elements from each Xk. Then, for each λ Λ, impute the missing elements via an Expectation Maximization-like algorithm, similar to Allen and Tibshirani (2010). Finally, select the λ which minimizes the error between the imputed values and the observed values. Searching over all combinations of penalty parameters in Λ however can be computationally intractable if K or the data sets themselves are large. In these cases, we can select the penalty parameters in a greedy manner: ﬁrst ﬁx λ1, . . . , λK and optimize over λΣ, then ﬁx λΣ, λ2, . . . , λK and optimize λ1, and so forth, and we stop after optimizing λK. Note also that it can be substantially easier and faster to tune the multiplicative Frobenius penalty, which has K penalty parameters, compared to the additive i PCA penalties with K + 1 parameters. Because the choice of penalty parameter can signiﬁcantly impact the empirical performance of i PCA, having one less parameter to tune is an extremely important practical advantage, and we attribute part of the strong empirical performance of the multiplicative Frobenius i PCA estimator, displayed in Section 4, to this advantage. Details, technical derivations, and numerical results regarding our imputation method and penalty parameter selection are provided in Appendix G.

3.3 Choosing the Type of Penalized i PCA Estimator

As a result of its global convergence guarantee and the reduced complexity in tuning fewer penalty parameters, we strongly recommend using the multiplicative Frobenius estimator in

Integrated Principal Components Analysis

practice. Though we have yet to prove statistical guarantees for the multiplicative Frobenius estimator, the strong empirical performance of the multiplicative Frobenius estimator, seen next in Section 4, ﬁrmly supports this recommendation. Even in the sparse setting (see Figure 13), the multiplicative Frobenius estimator performs only slightly worse than the additive L1 i PCA estimators, for which we have proved subspace consistency guarantees (see Appendix F). This empirically demonstrates the robustness and applicability of the multiplicative Frobenius estimator to a diverse array of problems.

4. Empirical Results

In the following simulations and case study, we evaluate i PCA against individual PCAs on each of the data sets Xk, concatenated PCA, distributed PCA, JIVE, and MFA. Note that many data integration methods from the multiblock PCA family are known to perform similarly to MFA (Abdi et al., 2013), so we only include MFA to minimize redundancy. We also omit CMF as it performs similarly to concatenated PCA and the GSVD since it is not applicable for integrated data with both low-dimensional and high-dimensional data sets. Our focus here will be on the non-sparse setting while we leave the sparse simulations to Appendix H. Within the class of i PCA estimators, we thus concentrate our attention on the additive and multiplicative Frobenius i PCA estimators in these dense simulations, but to also represent the sparse estimators, we include the most commonly used sparse estimator, the additive L1 penalty ( 1,oﬀ) applied to the inverse covariance matrices. The stopping rule in the Flip-Flop algorithms for the additive and multiplicative Frobenius i PCA estimators is given by λ1/2|| ˆΣ 1 t ˆΣ 1 t 1||F /|| ˆΣ 1 t 1||F < 10 6, where λ denotes the mean of the penalty parameters and ˆΣt denotes the estimate of Σ in the tth iteration. Due to computational constraints, we stop the L1 Flip-Flop algorithm after one iteration, and we select the i PCA penalty parameters in a greedy manner, as discussed in Section 3.2.2.

4.1 Simulations

The base simulation is set up as follows: Three coupled data matrices, X1, X2, X3, with n = 150, p1 = 300, p2 = 500, p3 = 400, were simulated according to the i PCA Kronecker covariance model (2). Here, Σ is taken to be a full-rank spiked covariance matrix, where the top two eigenvalues are simulated to be much larger than the rest. These top two factors in Σ form the three clusters as shown in Figure 1A. 1 is an autoregressive Toeplitz matrix with entry (i, j) given by .9|i j|; 2 is the observed covariance matrix of mi RNA data from TCGA ovarian cancer (The Cancer Genome Atlas Research Network, 2011); and 3 is a block-diagonal matrix with ﬁve equally-sized blocks. We also ensured that the largest eigenvalue of each k was larger than that of Σ so that the joint patterns are intentionally obscured by individualistic patterns. From this base simulation, we systematically varied the parameters number of samples, number of features, and strength of the joint signal in Σ (i.e. Σ 2) one at a time while keeping everything else constant. We evaluate the performance of various methods using the subspace recovery error: If the true underlying subspace of Σ was simulated to be of dimension d with the orthogonal eigenbasis u1, . . . , ud and the top d eigenvectors of the estimate ˆΣ are given by ˆu1, . . . , ˆud, then the subspace recovery error is deﬁned to be 1

d ˆU ˆU T U UT 2 F , where U = [u1, . . . , ud]

Tang and Allen

200 400 600 Sample Size n

Subspace Recovery Error

200 100 0 100 200 Number of Features Added to each Xk

Subspace Recovery Error

0 25 50 75 100 Signal in Σ

Subspace Recovery Error

i PCA (x Frob)

i PCA (+ Frob)

Concatenated

Distributed

Figure 3: Subspace recovery as simulation pararmeters vary from the base simulation: (A) As the number of samples increases, it becomes more diﬃcult to estimate the joint row subspace; (B) As the number of features increases, it becomes slightly easier to estimate the joint row subspace; (C) Performance drastically improves as the strength of the joint signal in Σ (i.e. the top singular value of Σ) increases. Moreover, in almost every scenario, the multiplicative and additive Frobenius i PCA estimators outperform their competitors.

and ˆU = [ˆu1, . . . , ˆud]. This metric simply quantiﬁes the distance between the true subspace of Σ and the estimated subspace from ˆΣ. We note that a lower subspace recovery error implies higher estimation accuracy, and in the base simulation, the true subspace of Σ is given by the number of spikes so that d = 2. Although there are other metrics like canonical angles, which also quantify the distance between subspaces, these metrics behave similarly to the subspace recovery error and are omitted for brevity.

The average subspace recovery error, measured over 50 trials, from various simulations are shown in Figure 3. We clearly see that the additive and multiplicative Frobenius i PCA estimators consistently outperformed all other methods. Since Σ was not simulated to be sparse, it is no surprise that the Frobenius i PCA estimators outperformed the L1 i PCA estimator. It is also expected that distributed PCA performs poorly since the k s are not all identical, violating its basic assumption. What may be surprising is that doing PCA on X2 performed better than its competitors, excluding the Frobenius i PCA estimators. We speculate that this is because the observed covariance 2 happened to be a very low-rank matrix, and because 2 was low-rank, the signal from Σ most likely dominated much of the variation in the second PC. Looking ahead at Figure 4A, as Laplacian error was added to the simulated data, PCA on X2 failed to recover the true signal since the Laplacian error increasingly contributed to the variation in the data. We also point out that MFA always yielded a lower error than concatenated PCA in Figure 3, indicating that there is value in normalizing data sets to be on comparable scales. On the other hand, we must be weary of this normalization process. In the case of these simulations, PCA on X2 outperformed MFA, illustrating that normalization can sometimes remove important information.

Integrated Principal Components Analysis

0 1 2 3 4 SD of Added Laplacian Error

Subspace Recovery Error

2.5 5.0 7.5 10.0 SD of Noise in JIVE Model

Subspace Recovery Error

i PCA (x Frob)

i PCA (+ Frob)

Concatenated

Distributed

i PCA (x Frob)

i PCA (+ Frob)

JIVE (known ranks)

JIVE (estimated ranks)

Concatenated

Distributed

Figure 4: Robustness Simulations: (A) As Laplacian error is increasingly added to the simulated data sets, the Frobenius i PCA estimators appear to be robust to the departures from Gaussianity; (B) As the amount of noise in the JIVE model increases, i PCA seems to be comparable to existing methods, illustrating its relative robustness to departures from the Kronecker product model.

To verify that these simulation results are not heavily dependent by the base simulation setup of Σ and k, we also ran simulations, varying the dimension d of the true joint subspace U and the number of data sets K. We provide these results in Appendix H.

Beyond simulating from the i PCA model (2), we check for robustness from the two main i PCA assumptions normality and separability (i.e., the Kronecker covariance structure). To deviate from normality, we add Laplacian noise to the base simulation setup, and to depart from the Kronecker covariance structure, we simulate data from the JIVE model. The results are summarized in Figure 4, and we leave the simulation details as well as other simulations that demonstrate robustness to Appendix H.

As seen in Figure 4A, the Frobenius i PCA estimators appear to be relatively robust to non-Gaussian noise and outperformed their competitors even as the standard deviation of added Laplacian errors increased. From the simulations under the JIVE model in Figure 4B, we see that as the amount of noise increases, JIVE given the known ranks yields the lowest error, as expected. But similar to how JIVE was comparable to competing methods under the i PCA model (Figure 3), the i PCA estimators are comparable to competing methods for high noise levels under the JIVE model. At low noise levels however, the Frobenius i PCA estimators are surprisingly able to recover the true joint subspace better than JIVE with the known ranks. Further investigation into this peculiar behavior reveals that the Frobenius i PCA estimators give lower subspace recovery errors but much larger approximation errors Σ ˆΣ 2 F , compared to JIVE with the known ranks. This brings up a subtle, but important distinction i PCA revolves around estimating the underlying subspace, determined by eigenvectors, while JIVE focuses on minimizing the matrix approximation error. These are inherently diﬀerent objectives, and it is common for i PCA to estimate the eigenvectors well at the cost of a poor matrix approximation due to the regularized eigenvalues.

Tang and Allen

4.2 Case Study: Integrative Genomics of Alzheimer s Disease

A key motivating example for our research is in integrative genomics, where the goal is to combine multiple genomic sources to gain insights into the genetic basis of diseases. In particular, apart from the APOE gene, little is known about the genomic basis of Alzheimer s disease (AD) and the genes which contribute to dominant expression patterns in AD. In this case study, we delve into the integrative genomics of AD and jointly analyze mi RNA expression, gene expression via RNASeq, and DNA methylation data obtained from the Religious Orders Study Memory and Aging Project (ROSMAP) Study (Mostafavi et al., 2018). The ROSMAP study is a longitudinal clinical-pathological cohort study of aging and AD, consisting of 507 subjects, 309 mi RNAs, 900 genes, and 1250 Cp G (methylation) sites after preprocessing (which we detail in Appendix I). This data is uniquely positioned for the study of AD since its genomics data is collected from post-mortem brain tissue from the dorsolateral prefrontal cortex, an area known to play a critical role in cognitive functions. For our analysis, we consider two clinical outcomes: clinician s diagnosis and global cognition score. The clinician s diagnosis is the last clinical evaluation prior to the patient s death and is a categorical variable with three levels Alzheimer s disease (AD), mild cognitive impairment (MCI), and no cognitive impairment (NCI). Global cognition score, a continuous variable, is the average of 19 cognitive tests and is the last cognitive testing score prior to death. While the clinician s diagnosis is sometimes subjective, global cognition score is viewed as a more objective measure of cognition. Our goal is to ﬁnd common patterns among patients, which occur in all three data sets, and to understand whether these joint patterns are predictive of AD, as measured by the clinician s diagnosis and global cognition score. To this end, we run i PCA and other existing methods to extract dominant patterns from the ROSMAP data. Figure 5 shows the PC plots obtained from the various methods each point represents a subject and is colored by either clinician s diagnosis or cognition score. Since visuals are a subjective measure of performance, we quantify it by taking the top PCs and using them as predictors in a random forest to predict the outcome of interest. The random forest test errors, averaged over 100 random training/test splits, are shown in Figure 6. Here, we see that the joint patterns extracted from i PCA using the multiplicative Frobenius penalty were the most predictive of the clinician s diagnosis of AD and the patient s global cognition score. Moreover, most of the predictive power can be attributed to the ﬁrst three i PCs, which we visualize in Figure 7A-B. We also note that the top i PCs from i PCA with the multiplicative Frobenius penalties were more predictive than combining the PCs from the three individual PCAs performed on each data set. This showcases empirically that a joint analysis of the integrated data sets can be advantageous over three disparate analyses. Beyond the high predictive power of i PCA with the multiplicative Frobenius penalty, it is perhaps more important for scientists to be able to interpret the i PCA results. One way is through the proportion of variance explained by the joint i PCs, as deﬁned in Section 2.2.1. Figure 7C shows the marginal proportions for the top 5 i PCs. It reveals that the RNASeq data set contributed the most variation in the joint patterns found by i PC1 and i PC2, and the mi RNA data set contributed the most variation in i PC3. More interestingly, even though i PC2 and i PC3 have relatively small variances, i PCA is able to pick out these weak

Integrated Principal Components Analysis

joint signals, which we found to be predictive of AD. This reiterates that the most variable patterns in the data are not necessarily the most relevant patterns for the question at hand. In this case, our goal was to ﬁnd joint patterns which occur in all three data sets, and since the joint signal is not the most dominant source of variation in each data set, no individualistic PCA analysis would have identiﬁed the joint signal found by i PCA.

We conclude our ROSMAP analysis by extracting the top genetic features which are associated to the joint patterns shown in Figure 7. Since i PCA provides an estimate of both Σ and k, we can select the top features by applying sparse PCA to each ˆ k obtained from i PCA. Table 1 lists the top mi RNAs, genes, and genes aﬃliated with the selected Cp G sites obtained from sparse PCA. Here, we used the sparse Eigen R package (Benidis et al., 2016) and chose the tuning parameter such that there were only 12 non-zero features.

0.1 0.0 0.1 PC1

i PCA (x Frobenius) J

0.1 0.0 0.1 PC1

i PCA (+ Frobenius) K

0.10 0.05 0.00 0.05 0.10 PC1

i PCA (L1) L

0.1 0.0 0.1 PC1

3 2 1 0 1 2 PC1

0.2 0.1 0.0 0.1 PC1

PCA on mi RNA O

0.10 0.05 0.00 0.05 0.10 PC1

PCA on RNASeq P

0.15 0.10 0.05 0.00 0.05 PC1

PCA on Methylation Q

0.15 0.10 0.05 0.00 0.05 PC1

Concatenated R

0.1 0.0 0.1 PC1

i PCA (x Frobenius) A

0.1 0.0 0.1 PC1

i PCA (+ Frobenius) B

0.10 0.05 0.00 0.05 0.10 PC1

i PCA (L1) C

0.1 0.0 0.1 PC1

3 2 1 0 1 2 PC1

0.2 0.1 0.0 0.1 PC1

PCA on mi RNA F

0.10 0.05 0.00 0.05 0.10 PC1

PCA on RNASeq G

0.15 0.10 0.05 0.00 0.05 PC1

PCA on Methylation H

0.15 0.10 0.05 0.00 0.05 PC1

Concatenated I

Figure 5: We plot the ﬁrst two (integrated) principal components from various methods applied to the ROSMAP data. Each point represents a subject, colored by the clinician s diagnosis in panels A-I and by global cognition score in panels J-R.

Tang and Allen

2 4 6 8 10 Number of PCs

RF Classification Error

ROSMAP Diagnosis A

2 4 6 8 10 Number of PCs

ROSMAP Cognition B

i PCA (x Frob)

i PCA (+ Frob)

Concatenated

Distributed

PCA on mi RNA

PCA on RNASeq

PCA on Methylation

Individual PCAs Combined

Figure 6: We took the top PCs and used them as predictor variables in a random forest to predict (A) the clinician s diagnosis and (B) the global cognition score. For the random forest, we split the ROSMAP data into a training (n = 375) and test set (n = 132) and used the default random forest settings in R. The average test error from the random forests over 100 random splits are shown as the number of PCs used in the random forests increases.

Figure 7: (A)-(B) We show the top 3 i PCs obtained from i PCA with the multiplicative Frobenius estimator; the points are colored by clinician s diagnosis and global cognition score in (A) and (B), respectively. (C) We plot the marginal proportion of variance explained by the top i PCs in each data set in the ROSMAP analysis (using the multiplicative Frobenius i PCA estimator).

Because the RNASeq data contributed most of the variation in i PC1, we did a literature search on the top ﬁve genes extracted by sparse PCA on ˆ 2. Out of the top ﬁve genes, we found evidence in the biological literature, which links four of the ﬁve genes (the exception being SVOP) to AD (Carrette et al., 2003; Li et al., 2017; Han et al., 2014; Espuny-Camacho et al., 2017). While this is only a preliminary investigation into the importance of the genetic features obtained from i PCA, it is encouraging evidence and may potentially hint at candidate genes for future research.

Integrated Principal Components Analysis

mi RNA RNASeq Methylation 1 mi R 216a VGF TMCO6 2 mi R 127 3p SVOP PHF3 3 mi R 124 PCDHGC5 BRUNOL4 4 mi R 30c ADCYAP1 OSCP1 5 mi R 143 LINC01007 GRIN2B 6 mi R 27a FRMPD2L1 CASP9 7 mi R 603 SLC30A3 ZFP91; LPXN; ZFP91-CNTF 8 mi R 423 3p NCALD CNP 9 mi R 204 S100A4 YWHAE 10 mi R 128 AZGP1 C11orf73 11 mi R 193a 3p PAK1 TMED10 12 ebv mi RBART14 MAL2 RELL1

Table 1: Top genetic features obtained by applying Sparse PCA to each ˆ k in ROSMAP analysis (using the multiplicative Frobenius i PCA estimator)

5. Discussion

As showcased in the simulations and the Alzheimer s disease case study, i PCA is not simply a theoretical construct that generalizes PCA to the integrated data setting. i PCA is also a useful and eﬀective tool in practice to discover interesting joint patterns that are shared across multiple data sets. We believe that i PCA s strong empirical performance is due in part to its ﬂexibility to handle a rich set of dependencies among samples and features concurrently, which is particularly appropriate and necessary for integrated data problems. This ﬂexibility is inherently driven by the underlying matrix-variate normal model, and from a whitening perspective, we can view the i PCA model as a natural generalization of the classical PCA model and assumptions. More speciﬁcally, in relation to PCA, i PCA can be viewed as performing PCA on the concatenated feature-whitened data, having estimated the individual feature covariances k and the joint sample covariance Σ simultaneously. While we discuss many potential penalized i PCA estimators for Σ and k in this work, we recommend that practitioners strongly consider using our new multiplicative Frobenius i PCA estimator. The simulations show that the Frobenius penalties are relatively robust to departures from model assumptions, and furthermore, the multiplicative Frobenius penalized estimator requires one less penalty parameter to tune and always converges to the global solution. Similar in spirit to other shrinkage penalties (Ledoit and Wolf, 2004), the multiplicative Frobenius i PCA estimator is a shrinkage technique that induces a smoothness over the eigenvalues of the covariance matrices. However, its multiplicative form is especially unique and well-suited for integrated data problems as it performs automatic reweighting of the integrated data sets and accounts for the concurrent estimation of Σ and k in each penalty term. Further investigation into the multiplicative Frobenius penalty and its statistical properties is left for future research. Moreover, we believe that this work opens the door for research into the theoretical underpinnings of dimension reduction and data integration in ways that other non-modelbased methods cannot. Building upon the model-based construction of i PCA, we show in Appendix F that the additive L1 correlation estimator satisﬁes one of the ﬁrst statistical

Tang and Allen

guarantees in the data integration context. However, this is only the beginning and certainly not meant to be the ﬁnal investigation into theoretical properties of the i PCA estimators and data integration. Minimax results and consistency guarantees for the multiplicative Frobenius i PCA estimator in particular are challenging and will require careful follow up in future work. Still, there are many other open avenues for future exploration. Analogous to PCA, one might imagine similar fruitful extensions of i PCA to higher-order data, functional data, and other structured applications. One could continue exploring g-convex penalties in diﬀerent contexts and problems. Another interesting area for future research would be to develop a general framework to prove the consistency of g-convex estimators using the intrinsic manifold space, rather than the Euclidean space. We believe this intersection of g-convexity and statistical theory is a particularly ripe area of future research, but overall, in this work, we developed a theoretically sound and practical tool for performing dimension reduction in the integrated data setting, thus facilitating holistic analyses at a large scale.

Acknowledgments

The authors acknowledge support from NSF DMS-1264058, NSF DMS-1554821 and NSF Neuro Nex-1707400. T.T. also acknowledges support from the NSF Graduate Research Fellowship Program DGE-1752814. The authors thank Dr. Joshua Shulman and Dr. David Bennett for help acquiring the ROS/MAP data and acknowledge support from NIH P30AG10161, RF1AG15819, R01AG17917, and R01AG36042 for this data. The authors thank Dr. Zhandong Liu and Dr. Ying-Wooi Wan for help with processing and interpreting the ROS/MAP data. The authors also thank Cole Franks for help with improving and strengthening Theorem 5.

Appendix A. Variance Explained by i PCA

To ensure that the cumulative proportion of variance explained from Deﬁnition 1 is a welldeﬁned concept, we check that PVEk,m is a proportion and is an increasing function as m increases. This implies that the marginal proportion of variance explained given in Deﬁnition 2 is also a proportion.

Proposition 9 The cumulative proportion of variance explained in Xk by the top m i PCs, as deﬁned in (10), satisﬁes the following properties: for each k = 1, . . . , K and m = 1, . . . , min{n, pk},

(i) 0 PVEk,m 1;

(ii) PVEk,m 1 PVEk,m.

Proof (i) Since the Frobenius norm is always non-negative, it is clear that PVEk,m 0. So it suﬃces to show that PVEk,m 1, or equivalently, (U(m))T Xk V(m) k 2 F Xk 2 F .

Integrated Principal Components Analysis

By deﬁnition of the Frobenius norm, we have that

(U(m))T Xk V(m) k 2 F =

(U(m))T Xk V(m) k 2

r=1 (U(m))T iq Xk,qr V(m) k,rj

r=1 UT iq Xk,qr Vk,rj

= UT Xk Vk 2 F

[2]= Xk 2 F .

Here, [2] holds by the orthogonality of U and Vk, and [1] follows from the facts that m min{n, pk}, U(m) and V(m) k are precisely the ﬁrst m columns of U and Vk respectively, and the summand is non-negative. This concludes the proof of part (i).

(ii) We follow a similar argument as part (i) to see that

(U(m 1))T Xk V(m 1) k 2 F =

r=1 (U(m 1))T iq Xk,qr V(m 1) k,rj

r=1 (U(m))T iq Xk,qr V(m) k,rj

= (U(m))T Xk V(m) k 2 F .

This implies that PVEk,m 1 PVEk,m.

Next, we claim that Deﬁnition 1 is a generalization of the cumulative proportion of variance explained in PCA. Recall that in PCA, if X = U D VT is the SVD of X, then the cumulative proportion of variance explained by the top m PCs is (Pm i=1 d2 i )/(Pp i=1 d2 i ), where D = diag(d1, . . . , dp). We can rewrite this as Pm i=1 d2 i Pp i=1 d2 i = D(m) 2 F X 2 F = (U(m))T X V(m) 2 F X 2 F (18)

using properties of the SVD. Since U are the PC scores, and V are the PC loadings from PCA, then Deﬁnition 1 is indeed a natural deﬁnition in the sense that it generalizes the PCA cumulative proportion of variance explained.

Appendix B. Diﬀerence Between i PCA and Concatenated PCA

To formalize the diﬀerence between i PCA and concatenated PCA, we examine the bias of the eigenvectors from concatenated PCA when sample and feature dependencies simultaneously obscure each other under the matrix-variate normal model.

Tang and Allen

Proposition 10 Suppose that Xk Nn,pk(0, Σ k) for each k = 1, . . . , K, and deﬁne X = [X1, . . . , XK], = diag( 1, . . . , K), and Z = [Z1, . . . , ZK], where Zk is as deﬁned in (5) previously. If X 1/2 = UDVT and X = U D VT are their respective compact SVDs, then U and U denote the i PCA and concatenated PCA scores, respectively, and

U U = Σ1/2 Z(VD 1 1/2 V D 1). (11)

Moreover, the i PCA and concatenated PCA scores are equal if and only if VD 1 1/2 V D 1

is in the nullspace of Z or 1/2 V = VD 1 D.

Proof By construction of their methodology, we know that the i PCA scores U and the concatenated PCA scores U satisfy the following compact SVDs:

X 1/2 = U D VT ,

X = U D V T .

We also know from (5) that

X = Σ1/2 Z 1/2 . (19)

Using the orthogonality of the singular vectors and plugging in (19) for X in the compact SVD formulas then gives

U = Σ1/2 Z V D 1,

U = Σ1/2 Z 1/2 V D 1.

U U = Σ1/2 Z V D 1 1/2 V D 1 .

Appendix C. Covariance Estimation for i PCA

In this section, we provide the proofs and derivations related to the unpenalized and penalized maximum likelihood estimators under the i PCA model.

C.1 Unpenalized Maximum Likelihood Estimators

We begin this section by deriving the log-likelihood equation associated with the i PCA population model (2). Recall that the probability density function for each matrix-variate normal model (k = 1, . . . , K) is given by

f (Xk | Σ, k) = (2π) npk

2tr Σ 1 Xk 1 k XT k .

Integrated Principal Components Analysis

Hence, the log-likelihood function is

ℓ(Σ 1, 1 1 , . . . , 1 K ) =

2 log(2π) + pk

2 log | Σ 1 | + n

2 log | 1 k |

2tr Σ 1 Xk 1 k XT k i

p log | Σ 1 | + n

k=1 log | 1 k |

k=1 tr Σ 1 Xk 1 k XT k .

The unpenalized MLEs are designed to solve the optimization problem

argmin Σ 1, 1 1 ,..., 1 K ℓ(Σ 1, 1 1 , . . . , 1 K ) (20)

Taking partial derivatives with respect to each covariance parameter thus gives the following:

Lemma 11 The unpenalized MLEs of Σ and 1, . . . , K satisfy

k=1 Xk ˆ 1 k XT k (13)

n XT k ˆΣ 1 Xk k = 1, . . . , K. (14)

Proof Taking the partial derivatives of the log-likelihood equation with respect to Σ 1 and 1 k respectively yields the gradient equations:

ℓ Σ 1 = p Σ

k=1 Xk 1 k XT k

ℓ 1 k = n k XT k Σ 1 Xk .

Setting the gradient equations equal to 0 gives the desired result.

Assuming that the unpenalized MLEs exist, we can compute the unpenalized MLEs of Σ and 1, . . . , K via Algorithm 2, which is analogous to the Flip-Flop algorithm provided in Dutilleul (1999). The next theorem provides very restrictive conditions for which the unpenalized MLEs exist, but in almost all practical cases, the unpenalized maximum likelihood problem is ill-posed for i PCA.

Theorem 5 (i) Suppose that Xk has been column-centered so that each column has a mean of 0 for each k = 1, . . . , K. Then the unpenalized MLEs for Σ and 1, . . . , K are not positive deﬁnite and hence do not exist.

(ii) Suppose that Xk has not been column-centered but that rank( X) = n and rank(Xk) = pk for k = 1, . . . , K.

Tang and Allen

Algorithm 2 Flip-Flop Algorithm for i PCA Unpenalized MLEs

1: Assume that Xk has been centered appropriately and that the unpenalized MLEs exist.

2: Initialize ˆΣ, ˆ 1, . . . , ˆ K to be symmetric positive deﬁnite.

3: while not converged do

4: Update ˆΣ = 1

p PK k=1 Xk ˆ 1 k XT k 5: for k = 1, . . . , K do

6: Update ˆ k = 1

n XT k ˆΣ 1 Xk

(a) If n = pk for some k = 1, . . . , K, then the unpenalized log-likelihood function ℓ(Σ 1, 1 1 , . . . , 1 K ) is unbounded.

(b) If the unpenalized log-likelihood function ℓ(Σ 1, 1 1 , . . . , 1 K ) is bounded, then the unpenalized MLE for Σ, 1, . . . , K exist.

Proof (ii) Suppose that Xk has not been column-centered but that rank( X) = n and rank(Xk) = pk for k = 1, . . . , K. We will ﬁrst prove part (b), so assume also that the unpenalized log-likelihood is bounded. From the unpenalized MLEs in Lemma 4, it is easy to see that ˆΣ and ˆ 1, . . . , ˆ K are symmetric positive semideﬁnite. We next claim that ˆΣ and ˆ 1, . . . , ˆ K are full rank and hence positive deﬁnite. To prove this claim, we proceed by induction on the unpenalized Flip-Flop iteration counter m. Let ˆΣ m and ˆ m k denote the mth Flip-Flop update of ˆΣ and ˆ k, respectively. Clearly, the base case holds since Σ0 and 0 1, . . . , 0 K are initialized to be symmetric positive deﬁnite. So suppose Σm and m 1 . . . , m K are full rank. Then the unpenalized Flip-Flop iterates at the (m + 1)th-update step are

k=1 Xk( ˆ m k ) 1 XT k = 1

p X( m) 1 X T ,

ˆ m+1 k = 1

n XT k ( ˆΣ m+1) 1 Xk,

where X = [X1, . . . , XK] and m = diag( ˆ m 1 , . . . , ˆ m K). Therefore, we have that

rank ˆΣ m+1 = rank X( m) 1 X T = rank X( m) 1

2 = rank( X) = n.

Here, the second equality holds because rank(AT A) = rank(A) = rank(AT ). The third equality holds because ( m) 1

2 is full rank (i.e. rank = p) by the inductive hypothesis, and the last equality holds by hypothesis. Thus, ˆΣ m+1 is positive deﬁnite. Similarly, for each k = 1, . . . , K,

rank ˆ m+1 k = rank XT k ( ˆΣ m+1) 1 Xk = rank(Xk) = pk.

So for each k = 1, . . . , K, ˆ m+1 k is positive deﬁnite. By induction, ˆΣ m, ˆ m 1 , . . . , ˆ m K 0 for each iterate of Algorithm 2, and by Corollary 21, Algorithm 2 converges to the global solution of (20). Thus, the unpenalized MLEs for Σ and 1 1 , . . . , 1 K exist under the assumptions given in part (ii)(b).

Integrated Principal Components Analysis

To prove part (a) of (ii), let X1, . . . , XK be given, and assume that n = pk for some k = 1, . . . , K. By the rank assumptions, it must be that pk n p for each k = 1, . . . , K. Deﬁne K0 = {k {1, . . . , K} | pk < n} and K1 = {k {1, . . . , K} | pk = n}. Note that by assumption, K0 is not empty, and for each k K0, XT k has a kernel. We can then choose bases for Rn and Rpk (k = 1, . . . , K) such that the ﬁrst row of Xk is zero for each k K0 and the ﬁrst row of Xk is the ﬁrst standard basis vector, denoted fk,1, of Rpk for each k K1. Let e1 denote the ﬁrst standard basis vector in Rn. We will next construct a family of diagonal matrices for Σ, 1, . . . , K that sends the unpenalized log-likelihood to inﬁnity. Speciﬁcally, let Σ = et e1 e T 1 , k = Ipk for k K0, and k = e t f k,1 f T k,1 for k K1, where e denotes the matrix exponential. Then for k K0,

tr Σ 1 Xk 1 k XT k = tr Σ 1 Xk XT k =

i=2 x T k,i xk,i,

and for k K1,

tr Σ 1 Xk 1 k XT k = 1 +

(et 1)(x T k,i fk,1)2 + x T k,i xk,i ,

where xk,i denotes the ith row of Xk. Therefore, for every t R, we have that

ℓ(Σ 1, 1 1 , . . . , 1 K ) = p log | Σ | n X

k K0 log | k | n X

k K1 log | k |

k K0 tr Σ 1 Xk 1 k XT k X

k K1 tr Σ 1 Xk 1 k XT k

= p log(et) n X

k K1 log(e t) X

i=2 x T k,i xk,i

(et 1)(x T k,i fk,1)2 + x T k,i xk,i !

= t(n|K1| p) X

i=2 x T k,i xk,i

(et 1)(x T k,i fk,1)2 + x T k,i xk,i !

As t , the ﬁrst term approaches since n|K1| < p while the second and third terms approach a constant. Thus, ℓ(Σ 1, 1 1 , . . . , 1 K ) is unbounded, as desired.

(i) Suppose that we have centered each data matrix Xk to have column means 0. Recall that the unpenalized MLEs for Σ and 1, . . . , K are obtained by

k=1 Xk ˆ 1 k XT k = 1

p X 1 X T ,

n XT k ˆΣ 1 Xk,

Tang and Allen

where := diag( ˆ 1, . . . , ˆ K). For the sake of contradiction, suppose that there exists n, p1, . . . , pk such that Σ 1 and 1 1 , . . . , 1 K are symmetric positive deﬁnite. By the same argument as in part (ii)(b), rank( ˆΣ) = rank( X), but since each Xk has been centered to have column means 0, then the n rows of X are linearly dependent. Hence, rank( ˆΣ) = rank( X) < n, which implies that ˆΣ can never be positive deﬁnite, a contradiction. Therefore, the unpenalized MLEs never exist if each Xk has be column-centered to have mean 0.

Remark 12 Notice that if the unpenalized MLEs for Σ, 1, . . . , K exist, then

n [1]= rank( ˆΣ) [2]= rank( X) [3] min{n, p}

pk [1]= rank( ˆ k) [2]= rank(Xk) [3] min{n, pk} k = 1, . . . , K

where [1] follows from positive deﬁniteness of ˆΣ and ˆ k, [2] follows the same argument as in the proof of Theorem 5, and [3] holds by properties of rank and the dimensions of X and Xk. By combining these rank constraints with the result of Theorem 5, we see that if the unpenalized MLEs for Σ, 1, . . . , K exist, it must be that pk = n p for each k = 1, . . . , K.

Proposition 6 can be proved in the same way as Lemma 4, so we omit the proof.

C.2 Penalized Maximum Likelihood Estimators

In this section, we develop Flip-Flop algorithms and analyze the convergence results for both Frobenius and L1 penalties. For the sake of notation, let

ℓP (Σ 1, 1 1 , . . . , 1 K ) = p log | Σ 1 | n

k=1 log | 1 k | +

k=1 tr Σ 1 Xk 1 k XT k

+ P(Σ 1, 1 1 , . . . , 1 K )

We give the overarching framework of the Flip-Flop algorithms in Algorithm 3, and we show in Theorem 7 that Algorithm 3 can be used to ﬁnd a local solution of (17) for a certain class of penalties, which includes the additive Frobenius, multiplicative Frobenius, and additive L1 penalties. The main idea behind the proof is to use convexity and view Algorithm 3 as a block coordinate descent algorithm so that each update of the Flip-Flop algorithm is a descent direction.

Theorem 7 Suppose that the objective function in (17) is bounded below. Suppose also that either (i) P(Σ 1, 1 1 , . . . , 1 K ) is a diﬀerentiable convex function with respect to each coordinate or (ii) P(Σ 1, 1 1 , . . . , 1 K ) = P0(Σ 1) + PK k=1 Pk( 1 k ), where Pi is a (non-diﬀerentiable) convex function for each k = 1, . . . , K. If either (i) or (ii) holds, then the Flip-Flop algorithm corresponding to (17) converges to a stationary point of the objective.

Integrated Principal Components Analysis

Algorithm 3 Outline of Flip-Flop Algorithm for Penalized i PCA Covariance Estimators

1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be positive deﬁnite.

2: while not converged do

3: Update Σ while ﬁxing all other variables: ˆΣ 1 = argmin Σ 1 0 ℓP (Σ 1, ˆ 1 1 , . . . , ˆ 1 K )

4: for k = 1, . . . , K do

5: Update k while ﬁxing all other variables: ˆ 1 k = argmin 1 k 0 ℓP ( ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 k 1, 1 k , ˆ 1 k+1, . . . , ˆ 1 K )

Proof Suppose either P(Σ 1, 1 1 , . . . , 1 K ) is a diﬀerentiable convex function with respect to each coordinate or P(Σ 1, 1 1 , . . . , 1 K ) = P0(Σ 1) + PK k=1 Pk( 1 k ) where Pi is a (non-diﬀerentiable) convex function for each i = 1, . . . , K. Let ℓ(Σ 1, 1 1 , . . . , 1 K ) = p log | Σ 1 |+n PK k=1 log | 1 k | PK k=1 tr Σ 1 Xk 1 k XT k . Since the domain of ℓis open and ℓis Gateaux-diﬀerentiable on its domain, then ℓP is regular in the domain of ℓP by Lemma 3.1 in Tseng (2001). Note also that since the log-determinant is a strictly concave function on the set of symmetric positive deﬁnite matrices, the trace function is linear, and the penalty term is convex with respect to each coordinate by hypothesis, then

ℓP is strictly convex in Σ 1 with 1 1 , . . . , 1 K ﬁxed, and

for each k = 1, . . . , K, ℓP is strictly convex in in 1 k with Σ 1, 1 j , j = k ﬁxed.

Because ℓP is regular and strictly convex with respect to each coordinate, it follows that the Flip-Flop algorithm corresponding to (17) converges to a stationary point of the objective function by Theorem 4.1(c) in Tseng (2001).

In the following sections, we will derive the speciﬁc form of the Flip-Flop updates for each of the penalized i PCA estimators.

C.2.1 Additive Frobenius Penalized Flip-Flop Estimator

To compute the additive Frobenius penalized estimator, we solve

ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 K = argmax Σ 1 0 1 1 ,..., 1 K 0

n p log | Σ 1 | + n

k=1 log | 1 k |

k=1 tr Σ 1 Xk 1 k XT k

k=1 λk 1 k 2 F o . (21)

Tang and Allen

The gradient equations corresponding to (21) are given by

k=1 Xk ˆ 1 k XT k 2λΣ ˆΣ 1 = 0, (22)

n ˆ k XT k ˆΣ 1 Xk 2λk ˆ 1 k = 0 k = 1, . . . , K. (23)

Lemma 13 If gradient equations (22) and (23) are satisﬁed, then

k=1 Xk ˆ 1 k XT k

k=1 Xk ˆ 1 k XT k (24)

4n2 XT k ˆΣ 1 Xk 2 1

2n XT k ˆΣ 1 Xk k = 1, . . . , K (25)

Proof Deﬁne ˆSΣ = PK k=1 Xk ˆ 1 k XT k , and right multiply (22) by ˆΣ to obtain

p ˆΣ 2 ˆSΣ ˆΣ 2λΣ I = 0 (26)

p ˆSΣ ˆΣ + 1 4p2 ˆS2 Σ = 2λΣ

4p2 ˆS2 Σ. (27)

On the other hand, we can multiply (22) by ˆΣ on the left to obtain

p ˆΣ 2 ˆΣˆSΣ 2λΣ I = 0 (28)

p ˆΣˆSΣ + 1 4p2 ˆS2 Σ = 2λΣ

4p2 ˆS2 Σ. (29)

Adding (27) and (29) and then dividing by 2 gives

2p ˆSΣ ˆΣ 1

2p ˆΣˆSΣ + 1 4p2 ˆS2 Σ = 2λΣ

4p2 ˆS2 Σ is positive deﬁnite, its has a unique square root. Thus,

We can similarly rearrange (23) and complete the square to obtain (25).

We now have the machinery to prove Proposition 14, which gives us the form of each update in the additive Frobenius Flip-Flop algorithm.

Integrated Principal Components Analysis

Proposition 14 ˆΣ and ˆ 1, . . . , ˆ K are solutions to the gradient equations in (22) and (23) if and only if

Γ + Γ2 + 8λΣp I 1

and ˆ k = Vk

Φk + Φ2 k + 8λkn I 1

2 VT k k = 1, . . . , K, (31)

where U, Vk, Γ, and Φk are deﬁned by the eigendecompositions PK k=1 Xk ˆ 1 k XT k = U Γ UT

and XT k ˆΣ 1 Xk = Vk Φk VT k .

Proof ( ) Suppose that ˆΣ and ˆ 1, . . . , ˆ K are solutions to the gradient equations in (22) and (23). We will ﬁrst show that the eigenvectors of ˆΣ and PK k=1 Xk ˆ 1 k XT k are equivalent. Let u be an eigenvector of ˆΣ with the corresponding eigenvalue φ. Then, by (22), K X

k=1 Xk ˆ 1 k XT k u = (p ˆΣ 2λΣ ˆΣ 1) u = (pφ 2λΣφ 1) u .

Therefore, u is an eigenvector of PK k=1 Xk ˆ 1 k XT k with the eigenvalue pφ 2λΣφ 1. Conversely, suppose u is an eigenvector of PK k=1 Xk ˆ 1 k XT k with eigenvalue γ. Then

k=1 Xk ˆ 1 k XT k

p + 1 4p2 γ2 u .

This implies that

k=1 Xk ˆ 1 k XT k

1 2 u = 2λΣ

p + 1 4p2 γ2 1

So by Lemma 13 and (32), we have that

k=1 Xk ˆ 1 k XT k

k=1 Xk ˆ 1 k XT k

p + 1 4p2 γ2 1

γ2 + 8λΣp u, (35)

so u is indeed an eigenvector of ˆΣ with the eigenvalue 1 2p γ + p

γ2 + 8λΣp .

Since the eigenvectors of ˆΣ and PK k=1 Xk ˆ 1 k XT k are equivalent and (35) gives us the exact connection between their eigenvalues, it follows that

Γ + Γ2 + 8λΣp I 1

Tang and Allen

where U, Γ are deﬁned by the eigendecomposition PK k=1 Xk ˆ 1 k XT k = U Γ UT . This same logic can be used to show that for each k = 1, . . . , K,

Φk + Φ2 k + 8λkn I 1

where Vk, Φk are deﬁned by the eigendecomposition XT k ˆΣ 1 Xk = Vk Φk VT k . We omit the proof since it uses the same argument as above.

( ) Now suppose that ˆΣ and ˆ 1, . . . , ˆ K satisfy equations (30) and (31).

Since we know h 1 2p γ + p

γ2 + 8λΣp i 1 = 1 4λΣ

γ2 + 8λΣp γ , it follows that h 1 2p Γ + Γ2 + 8λΣp I 1

2 i 1 = 1 4λΣ

Γ2 + 8λΣp I 1

2 Γ . Therefore,

k=1 Xk ˆ 1 k XT k 2λΣ ˆΣ 1 = p U 1

Γ + Γ2 + 8λΣp I 1

2 UT U Γ UT

Γ + Γ2 + 8λΣp I 1

Γ + Γ2 + 8λΣp I 1

2 UT U Γ UT

Γ2 + 8λΣp I 1

Similarly, one can substitute in (31) and follow the same argument to show that (23) is also satisﬁed.

As a consequence of Proposition 14, each update step in the additive Frobenius Flip Flop algorithm can be solved by taking a full eigendecomposition and then regularizing the eigenvalues. This algorithm is given in Algorithm 4.

Algorithm 4 Flip-Flop Algorithm for Additive Frobenius Penalized i PCA Estimators

1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be positive deﬁnite.

2: while not converged do

3: Take eigendecomposition: PK k=1 Xk ˆ 1 k XT k = U Γ UT

4: Regularize eigenvalues: Φii = 1

Γ2 ii + 8λΣp

5: Update ˆΣ 1 = U Φ 1 UT

6: for k = 1, . . . , K do

7: Take eigendecomposition: XT k ˆΣ 1 Xk = V Φ VT

8: Regularize eigenvalues: Γii = 1 2n Φii + q

Φ2 ii + 8λkn .

9: Update ˆ 1 k = V Γ 1 VT

Integrated Principal Components Analysis

C.2.2 Multiplicative Frobenius Penalized Flip-Flop Estimator

The derivation of the multiplicative Frobenius Flip-Flop algorithm is very similar to the previous derivation with the additive Frobenius penalty. Thus, we omit most of the details and simply provide a sketch of the derivation. Recall that the multiplicative Frobenius penalized estimator solves

ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 K = argmax Σ 1 0 1 1 ,..., 1 K 0

n p log | Σ 1 | + n

k=1 log | 1 k |

k=1 tr Σ 1 Xk 1 k XT k

k=1 λk 1 k 2 F o , (36)

for which the gradient equations are

k=1 Xk ˆ 1 k XT k 2 ˆΣ 1 K X

k=1 λk ˆ 1 k 2 F = 0, (37)

n ˆ k XT k ˆΣ 1 Xk 2λk ˆ 1 k ˆΣ 1 2 F = 0 k = 1, . . . , K. (38)

Assuming these gradient equations (37) and (38) are satisﬁed, we can write

2 PK k=1 λk|| ˆ 1 k ||2 F p I + 1

k=1 Xk ˆ 1 k XT k

k=1 Xk ˆ 1 k XT k ,

ˆ k = 2λk|| ˆΣ 1||2 F n I + 1

4n2 XT k ˆΣ 1 Xk 2 1

2n XT k ˆΣ 1 Xk k = 1, . . . , K.

We then can follow the same argument in Proposition 14 to show that ˆΣ and ˆ 1, . . . , ˆ K are solutions to the multiplicative Frobenius gradient equations (37) and (38) if and only if

k=1 λk|| ˆ 1 k ||2 F I

and ˆ k = Vk

Φk + Φ2 k + 8nλk|| ˆΣ 1||2 F I 1

2 VT k k = 1, . . . , K, (40)

where U, Vk, Γ, and Φk are deﬁned by the eigendecompositions PK k=1 Xk ˆ 1 k XT k = U Γ UT

and XT k ˆΣ 1 Xk = Vk Φk VT k . This gives rise to the Flip Flop algorithm for solving the multiplicative Frobenius penalized estimators, as given in Algorithm 1.

C.2.3 Additive L1 Penalized Flip-Flop Estimator

If the inverse covariance matrices are known to have a sparse underlying structure, then we can apply an L1 penalty, rather than the Frobenius penalty, to induce this sparsity. Note

Tang and Allen

that while it is possible to use a multiplicative L1 penalty, the multiplicative L1 penalty is not geodesically convex and is not separable (as deﬁned in Tseng, 2001). Thus, we primarily consider the additive L1 penalized i PCA estimator:

ˆΣ 1, ˆ 1 1 , . . . , ˆ 1 K = argmax Σ 1 0 1 1 ,..., 1 K 0

n p log | Σ 1 | + n

k=1 log | 1 k |

k=1 tr Σ 1 Xk 1 k XT k

λΣ Σ 1 1,oﬀ

k=1 λk 1 k 1,oﬀ o . (41)

Note that 1,oﬀpenalizes the oﬀ-diagonal entries (i.e. A 1,oﬀ= P i =j |aij|), but it is also possible to use the ordinary L1 norm 1. For ﬁxed ˆ 1, . . . , ˆ K, the Flip-Flop update for ˆΣ is seen to be

ˆΣ 1 = argmin Σ 1 0 p log | Σ 1 | + tr

k=1 Xk ˆ 1 k XT k

+ λΣ Σ 1 1,oﬀ,

and similarly for ﬁxed ˆΣ and ˆ j, j = k, the update for ˆ k is

ˆ 1 k = argmin 1 k 0 n log | 1 k | + tr 1 k XT k ˆΣ 1 Xk + λk 1 k 1,oﬀ.

Both of which can be solved via the graphical lasso algorithm (Hsieh et al., 2011). Plugging in these updates to the framework laid out in Algorithm 3, we give the additive L1 Flip-Flop algorithm in Algorithm 5.

Algorithm 5 Flip-Flop Algorithm for Additive L1 Penalized i PCA Covariance Estimators

1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be positive deﬁnite.

2: while not converged do

3: Put A = 1

p PK k=1 Xk 1 k XT k 4: Apply graphical lasso: ˆΣ 1 = argmin Σ 1 log | Σ 1 | + tr(A Σ 1) + λΣ

5: for k = 1, . . . , K do

6: Put Ak := 1

n XT k Σ 1 Xk 7: Apply graphical lasso: ˆ 1 k = argmin 1 k log | 1 k | + tr(Ak 1 k ) + λk

Appendix D. Geodesic Convexity and the Multiplciative Frobenius i PCA Estimator

Because of the central role that geodesic convexity plays in the multiplicative Frobenius i PCA estimator, we give a self-contained introduction to geodesic convexity in Appendix D.1.

Integrated Principal Components Analysis

This review provides the necessary concepts and tools to prove the theorems in Appendix D.2. For a more comprehensive review, we refer to Rapcs ak (1991) and Vishnoi (2018).

D.1 Introduction to Geodesic Convexity

Convex optimization problems arise frequently in a variety of machine learning applications such as regression, matrix completion, and clustering, to name a few examples. Beyond its widespread applications, convex problems are well-understood theoretically and can be reliably solved in polynomial-time. As a result, machine learning tasks are often formulated as convex problems in Euclidean space to guarantee fast convergence to a global solution. Convexity, however, is not limited to Euclidean spaces. Many tools which we know and love from convex optimization can be extended to geodesic convexity (g-convexity) on Riemannian manifolds. In this general Riemannian setting, there are several applications in which we have g-convexity but not convexity (Zhang and Sra, 2016). Before formally deﬁning geodesic convexity, we ﬁrst recall some useful concepts from metric geometry. A metric space is a pair (X, d) of a set X and a distance function d that satisﬁes positivity, symmetry, and the triangle inequality. A path γ is a continuous mapping from [0, 1] to X, and the length ℓof a path γ is deﬁned as ℓ(γ) := sup{Pn i=1 d(γ(ti 1), γ(ti)) : 0 = t0 < . . . < tn = 1, n N}. A metric space is a length space if d(x, y) = inf ℓ(γ) where the inﬁmum is taken over all paths γ : [0, 1] X joining x and y.

Deﬁnition 15 Let (X, d) be a length space. A path γ : [0, 1] X is a geodesic if for every t [0, 1] there exists an interval [a, b] [0, 1] which contains a neighborhood of t and γ|[a,b] is a shortest path from γ(a) to γ(b). Put simply, a geodesic is a path which locally minimizes length.

Note that geodesics minimize length locally, but not globally. A canonical example of geodesics are the great circles on a sphere. This concept of a geodesic generalizes the notion of a line in Euclidean space to general (nonlinear) length spaces. By replacing lines by geodesics in the deﬁnition of convexity, we can extend convexity to g-convexity in a straightforward manner.

Deﬁnition 16 Let M be a Riemannian manifold. A function f : M R is geodesically convex if for any x, y M, geodesic γ such that γ(0) = x and γ(1) = y, and t [0, 1], it holds that

f(γ(t)) (1 t)f(x) + tf(y).

The only caveat is that the underlying space must be a Riemannian manifold. To prevent a long winded detour into the details of Riemannian geometry, we avoid the full technical deﬁnition and simply think of a Riemannian manifold as a real diﬀerentiable manifold equipped with the notion of an inner product. We need the structure of a Riemannian manifold in order to meaningfully perform algebraic operations in our space. In summary, by extending the notion of a line to a geodesic, we can easily translate the notion of convexity on a Euclidean space to geodesic convexity on a Riemannian manifold. We next give a series of known properties regarding geodesic convexity that will be of use in the following proofs (Wiesel, 2012; Vishnoi, 2018). Many of these properties are analogous to the familiar convex setting.

Tang and Allen

Theorem 17 Any local minima of a geodesically convex function is a global minima.

Proposition 18 The following operations preserve geodesic convexity:

(i) (Addition) If f and g are g-convex functions, then f + g is g-convex.

(ii) (Kronecker Products) Suppose f is a real-valued, g-convex function on Sd ++, and Qj

Sdj ++ for each j = 1, . . . , J such that QJ j=1 dj = d. Then

g(Q1, . . . , QK) = f(Q1 QK)

is jointly g-convex in {Qj}J j=1 Sd1 ++ Sd K ++.

The proof of Proposition 18(i) is straightforward from the deﬁnition of g-convex functions, and Proposition 18(ii) is proved in Wiesel (2012).

Remark 19 (Example 4.9 in Vishnoi, 2018) Consider the manifold of positive deﬁnite matrices Sn ++. For Q0, Q1 Sn ++, the geodesic joining Q0 to Q1 can be parameterized as

1 2 0 t [0, 1]. (42)

D.2 Global Convergence of the Multiplicative Frobenius i PCA Estimator

We now have the tools to start proving global convergence of the multiplicative i PCA estimator. The roadmap of this proofs is as follows. First, we prove that the negative loglikelihood is g-convex. Then, we prove that the multiplicative Frobenius penalty is g-convex. Since the sum of g-convex functions is g-convex, this implies that the multiplicative i PCA estimator is a g-convex optimization problem. Furthermore, since the Flip-Flop algorithm was proven to converge in Theorem 7, it follows that the multiplicative Frobenius i PCA estimator converges to the global solution as a consequence of geodesic convexity! Without loss of generality, we assume that each data set Xk has been column-centered for the remainder of this section.

Lemma 20 The negative log-likelihood of our model, (12), is jointly geodesically convex in Σ 1, 1 1 , . . . 1 K with respect to Sn ++ Sp1 ++ Sp K ++.

Proof Let X = [X1, . . . , XK] Rn p and deﬁne

1 0 0 0 2 0 ... ... ... ... 0 0 K

Since is a block diagonal matrix, then | | = QK k=1 | k | and | 1| = QK k=1 | 1 k |. This implies that

k=1 log | 1 k | = log(

k=1 | 1 k |) = log | 1|. (43)

Integrated Principal Components Analysis

Also, using this new parameterization,

k=1 tr(Xk 1 k XT k Σ 1) = tr( X 1 X T Σ 1) = vec( X)T ( 1 Σ 1) vec( X). (44)

Using (43) and (44), we rewrite the negative log-likelihood as

ℓ(Σ 1, 1 1 , . . . , 1 K ) p log | Σ 1 | n

k=1 log | 1 k | +

k=1 tr(Xk 1 k XT k Σ 1)

= p log | Σ 1 | n log | 1| + vec( X)T ( 1 Σ 1) vec( X)

= log(| Σ 1 |p| 1|n) + vec( X)T ( 1 Σ 1) vec( X)

= log(| 1 Σ 1 | 1) + vec( X)T ( 1 Σ 1) vec( X).

Next, consider the manifold Snp ++, and let the function f : Snp ++ R be given by

f(Q) = log(| Q | 1) + vec( X)T Q vec( X).

Since f is the sum of g-convex functions on Snp ++ (Wiesel, 2012), and ℓ(Σ 1, 1) = f( 1 Σ 1), then by Proposition 18, the negative log-likelihood is jointly geodesically convex in Σ 1, 1 1 , . . . , 1 K with respect to Sn ++ Sp1 ++ Sp K ++.

Corollary 21 Suppose the unpenalized log-likelihood in (20) is bounded. If the Flip-Flop estimators for the unpenalized log-likelihood exist, then they converge to the global solution of (20).

Proof Recall that the Flip-Flop algorithm yields the iterates:

p PK k=1 Xk ˆ 1 k XT k = argmax Σ ℓ(Σ, ˆ 1, . . . , ˆ K)

2. For each k = 1, . . . , K, ˆ k = 1

n XT k ( ˆΣ) 1 Xk = argmax k ℓ( ˆΣ, ˆ 1, . . . , ˆ k 1, k, ˆ k+1, . . . , ˆ K)

Thus, each update of the Flip-Flop algorithm monotonically increases the log-likelihood ℓ. Assuming that the MLEs exist and are bounded, this Flip-Flop algorithm converges to a local maximum. Furthermore, since ℓis jointly geodesically concave in Σ 1, 1 1 , . . . 1 K , then all local maxima are global maxima so that the Flip-Flop estimators for the unpenalized log-likelihood converge to the global solution.

Lemma 22 The multiplicative Frobenius norm penalty, Σ 1 2 F PK k=1 λk 1 k 2 F , is jointly geodesically convex in Σ 1, 1 1 , . . . , 1 K with respect to Sn ++ Sp1 ++ Sp K ++.

Tang and Allen

Proof Since tr(A B) = tr(A)tr(B) and A 2 F = tr(AT A), then

P(Σ 1, 1 1 , . . . , 1 K ) := Σ 1 2 F

k=1 λk 1 K 2 F (45)

k=1 λktr(Σ 2)tr( 2 k ) (46)

k=1 λktr( 2 k Σ 2) (47)

k=1 λk 1 k Σ 1 2 F (48)

λk 1 k ) Σ 1 2 F (49)

= 1 Σ 1 2 F , (50)

1 λ1 1 0 0 0 1 λ2 2 0 ... ... ... ... 0 0 1 λK K

We will next show that the function f : Snp ++ R deﬁned by

f(Qα) := Qα 2 F = tr((Qα)T Qα) = tr((Qα)2), α { 1} (51)

is geodesically convex in Q Snp ++. That is, let Q0, Q1 Snp ++ be given, and let Qt be the geodesic between Q0 and Q1 as given in (42). We want to show that f(Qα t ) is a convex function with respect to t.

So consider the eigendecomposition Q 1

2 0 = U D UT , where U is an orthogonal matrix and D is a diagonal matrix with diagonal entries di. Then for all t [0, 1], it follows from (51) that

f(Qα t ) = tr(Qα t Qα t ) (52)

2 0 )αt Qα 0 (Q 1

2 0 (U D UT )αt Qα 0 (U D UT )αt Q

2 0 U Dαt UT Qα 0 U Dαt UT Q

= tr(Dαt UT Qα 0 U Dαt UT Qα 0 U) (56)

= tr(Dαt A Dαt A), (57)

where A := UT Qα 0 U.

Integrated Principal Components Analysis

Because A is symmetric and (Dαt A)ij = dαt i aij, then

(Dαt A Dαt A)ij =

l=1 ((Dαt A)il(Dαt A)lj) =

l=1 (dαt i aildαt l alj) =

l=1 (dαt i aildαt l ajl).

Plugging this into (57) gives

i=1 (Dαt A Dαt A)ii =

l=1 (dαt i aildαt l ail) =

l=1 (a2 il(didl)αt).

Note that since Q0 and Q1 are positive deﬁnite, then Q 1

2 0 is positive deﬁnite. Thus, di > 0 for each i = 1, . . . , np, and because di > 0 for each i = 1, . . . , np, f(Qα t ) = Pnp i=1 Pnp l=1(a2 il(didl)αt) is a convex function in t. This proves f is g-convex in Q Snp ++. Since f is g-convex in Q Snp ++ and P(Σ 1, 1) = f( 1 Σ 1) from (50), then the multiplicative Frobenius norm penalty P is g-convex in Σ 1, 1 1 , . . . , 1 K by Proposition 18(ii).

Theorem 8 The multiplicative Frobenius i PCA estimator is jointly geodesically convex in Σ 1 and 1 1 , . . . , 1 K . Because of this, the Flip-Flop algorithm for the multiplicative Frobenius i PCA estimator given in Algorithm 1 converges to the global solution.

Proof From Lemma 20 and Lemma 22, we see that the objective function corresponding to the multiplicative Frobenius penalized estimator in (36) is the sum of jointly g-convex functions. Thus, the multiplicative Frobenius penalized estimator is also jointly geodesically convex in Σ 1 and 1 1 , . . . , 1 K . Recall we have already proved that Algorithm 1 converges to a stationary point in Theorem 7. Since the multiplicative Frobenius i PCA estimator is geodesically convex, then all local optima are global optima, and the Flip-Flop algorithm for the multiplicative Frobenius i PCA estimator converges to the global solution of (36).

Remark 23 The multiplicative Frobenius i PCA estimator is unique in the sense that the Kronecker production solution ˆΣ ˆ k is unique.

Appendix E. Equivalence between PCA and i PCA with Frobenius Penalties when K = 1

One major reason for using the Frobenius penalties is that in the case where we observe only one data set, i PCA with the Frobenius penalties and the classical PCA are equivalent in the sense that the PC scores and loadings are the same. Throughout this section, we will assume K = 1, and we let the SVD of X (which has been column-centered) be given by X = U D VT , where U Rn n, D Rn p with the diagonal elements di, V Rp p, and r = rank(X) < min{n, p}. Without loss of generality, suppose also that p n. In PCA, it

Tang and Allen

is well known that the PC scores are given by the columns of U and the PC loadings are given by the columns of V. We will show that i PCA with the Frobenius penalties also yield the same PC scores U and loadings V.

Let us ﬁrst consider i PCA with the additive Frobenius penalty λΣ Σ 1 2 F +λ 1 2 F . By Theorem 1 in Allen and Tibshirani (2010), there is a unique global solution maximizing the matrix-variate normal log-likelihood with additive Frobenius penalties (21) when K = 1. This global solution is given by

ˆΣ = U β UT and ˆ = V θ VT ,

where β = diag(β1, . . . , βn) and θ = diag(θ1, . . . , θp) are deﬁned as follows:

d2 i θi nθ2 i 2λ , i = 1, . . . , r s

p , i = r + 1, . . . , n

v u u t c2,i q

c2 2,i 4c1c3,i

2c1 , i = 1, . . . , r r

n , i = r + 1, . . . , p,

with coeﬃcients

c1 = 2λΣn2,

c2,i = d4 i (p n) + 8nλΣλ ,

c3,i = 2λ (d4 i 4λΣλ ).

Since the PC scores and loadings from i PCA are the eigenvectors of ˆΣ and ˆ , respectively, the above result shows that the PC scores and loadings from i PCA with the additive Frobenius penalty are precisely U and V, as in PCA.

We next investigate the equivalence of i PCA with the multiplicative Frobenius penalty λ Σ 1 2 F 1 2 F and PCA when K = 1. While we can follow a similar argument as Allen and Tibshirani (2010), the proof is complicated by the non-separable penalty terms. That is, each term in the multiplicative penalty depends on both ˆΣ and ˆ . We will return to this complication in the proof of the following theorem.

Theorem 24 In the case when K = 1, the solution to i PCA with the multiplicative Frobenius penalty (36) is given by ˆΣ = U β UT and ˆ = V θ VT , where β = diag(β1, . . . , βn)

Integrated Principal Components Analysis

and θ = diag(θ1, . . . , θp) satisfy the system

v u u t c2,i(T) q

c2 2,i(T) 4c1(T)c3,i(T)

2c1(T) , i = 1, . . . , r

n , i = r + 1, . . . , p

βi = d2 i θi nθ2 i 2λ, i = 1, . . . , r

p , i = r + 1, . . . , n

k=1 θ 2 k + (p r) n

with c1(T) = 2λn2T, c2,i(T) = d4 i (p n) + 8nλ2T, and c3,i(T) = 2λ(d4 i 4λ2T). This solution exists and is unique (up to a scaling factor).

Proof As seen previously, the gradient equations from the matrix-variate normal loglikelihood with the multiplicative Frobenius penalty are

p ˆΣ 2λ ˆΣ 1 ˆ 1 2 F = X ˆ 1 XT

n ˆ 2λ ˆ 1 ˆΣ 1 2 F = XT ˆΣ 1 X .

Thus, the eigenvectors of ˆΣ and ˆ are equal to their respective quadratic forms. It follows that there is only one solution for the eigenvectors namely, the left and right singular vectors of X = U D VT . [Note: the last n r eigenvectors of U and the last p r eigenvectors of V are not unique.] Put ˆΣ = U β UT and ˆ = V θ VT , where β = diag(β1, . . . , βn) and θ = diag(θ1, . . . , θp). Note that we have the implicit constraints βi > 0 and θi > 0 for each i due to the positive deﬁniteness of the covariances. Plugging in these decompositions and the SVD of X into the gradient equations gives

p U β UT 2λ U β 1 UT V θ 1 VT 2 F = U D VT V θ 1 VT V DT UT

n V θ VT 2λ V θ 1 VT U β 1 UT 2 F = V DT UT U β 1 UT U D VT ,

or equivalently, using the orthogonality of U and V,

pβ 2λβ 1 θ 1 2 F = D θ 1 DT

nθ 2λθ 1 β 1 2 F = DT β 1 D .

Tang and Allen

We can write this element-wise as the following system of equations:

pβi 2λβ 1 i

k=1 θ 2 k = d2 i θ 1 i , i = 1, . . . , r (58)

pβi 2λβ 1 i

k=1 θ 2 k = 0, i = r + 1, . . . , n (59)

nθi 2λθ 1 i

k=1 β 2 k = d2 i β 1 i , i = 1, . . . , r (60)

nθi 2λθ 1 i

k=1 β 2 k = 0, i = r + 1, . . . , p. (61)

Now, to simplify this system of equations, we notice that if (β, θ) solve this system, then for any positive scalar c, (cβ, c 1θ) is also a solution. Without loss of generality, we may thus assume that β is normalized so that Pn k=1 β 2 k = 1.

It immediately follows from (61) that θi = q

n for i = r + 1, . . . , p. On the other hand, for i = 1, . . . , r, (60) gives us that

nθ2 i βi 2λβi = d2 i θi

= βi = d2 i θi nθ2 i 2λ.

Substituting this equation for βi into (58) then yields

pd2 i θi nθ2 i 2λ 2λ(nθ2 i 2λ) d2 i θi

k=1 θ 2 k = d2 i θi

Finding a common denominator and expanding all terms yields

[ 2λn2T]θ4 i + [d4 i (p n) + 8nλ2T]θ2 i + [2λ(d4 i 4λ2T)] = 0,

where T = Pp k=1 θ 2 k = Pr k=1 θ 2 k + (p r) n

2λ. For the sake of notation, let us deﬁne c1(T) = 2λn2T, c2,i(T) = d4 i (p n) + 8nλ2T, and c3,i(T) = 2λ(d4 i 4λ2T). Summarizing what we have done so far, β 0 and θ 0 must satisfy:

βi = d2 i θi nθ2 i 2λ, i = 1, . . . , r (62)

p , i = r + 1, . . . , n (63)

0 = c1(T)θ4 i + c2,i(T)θ2 i + c3,i(T), i = 1, . . . , r (64)

n , i = r + 1, . . . , p (65)

k=1 θ 2 k + (p r) n

Integrated Principal Components Analysis

Now, from the quartic polynomial in (64), the four possible roots are

v u u t c2,i(T) q

c2 2,i(T) 4c1(T)c3,i(T)

In any case, c1(T) < 0, c2,i(T) > 0, and

c2 2,i(T) 4c1(T)c3,i(T) = d8 i (p n)2 + 16λ2npd4 i T > 0,

v u u t c2,i(T) q

c2 2,i(T) 4c1(T)c3,i(T)

2c1(T) . (67)

is always a real positive root. Furthermore, if θi is given by (67) for each i = 1, . . . , r, then

nθ2 i 2λ = 1 4λn T

d4 i (p n) 8nλ2T q

d8 i (p n)2 + 16λ2npd4 i T 2λ

= 2λ + 1 4λn T

d4 i (p n) + q

d8 i (p n)2 + 16λ2npd4 i T 2λ

Thus, the corresponding βi, which we have already shown to be given by βi = d2 i θi nθ2 i 2λ, is also positive. This shows that (67) yields a feasible solution for our system of equations (62)-(66). We claim that this is the only choice of θi, which yields a feasible solution. To see this, we ﬁrst immediately eliminate the two negative square root solutions due to the positivity constraint on θi. We next divide the argument into three cases. Case 1: If d4 i 4λ2T = 0, then c3,1(T) = 0 and so

c2,i(T) c2,i(T)

which equals 0 if we take the positive sign. Thus, in this case, there is only one feasible root by choosing the negative sign. Case 2: If d4 i 4λ2T > 0, then c3,i(T) > 0. Additionally, c1(T) < 0 and c2,i(T) > 0, so by Descartes rule of signs, there is at most one real positive solution to (64). As shown earlier, the root given in (67) is a real positive solution. This must be the unique real positive root by Descartes . Case 3: If d4 i 4λ2T < 0, then c3,i(T) < 0, so by Descartes rule of signs, there are at most two real positive solutions to (64). We have already found one real positive root, given by (67). Suppose also that the other possible root

v u u t c2,i(T) + q

c2 2,i(T) 4c1(T)c3,i(T)

2c1(T) . (68)

Tang and Allen

is a real positive root. Since the corresponding βi must also be positive (in order to be feasible) and we have already shown that βi = d2 i θi nθ2 i 2λ, it follows that the denominator

nθ2 i 2λ must be positive. However, substituting (68) into this denominator gives

nθ2 i 2λ = 1 4λn T

d4 i (p n) 8nλ2T + q

d8 i (p n)2 + 16λ2npd4 i T 2λ

= 2λ + 1 4λn T

d4 i (p n) q

d8 i (p n)2 + 16λ2npd4 i T 2λ

This contradicts the positivity of βi and hence is not a feasible solution. Thus, in any case, we have shown that there is only one feasible root of (64), which is given by (67). The last step of this proof is to show that there exists a T which solves our system of equations (62)-(66). In particular, we can substitute (67) into (66) to see that T must satisfy

c2 2,i(T) 4c1(T)c3,i(T) + (p r) n

d4 i (p n) + 8nλ2T + q

d8 i (p n)2 + 16λ2npd4 i T + (p r) n

Let f(T) denote the right hand side of the equation in (70). It can be shown that f (T) > 0 for all T 0. Also, when T = 0, we have that f(T) = (p r) n

2λ > 0 = T, and as T , we have f(T) np

2λ < . Together, these observations imply that there exists a unique solution to the equation T = f(T) for some T > 0. Thus, there exists a unique feasible T, and hence also β and θ, to our system of equations (62)-(66).

As a direct consequence, since the PC scores and loadings from i PCA are the eigenvectors of ˆΣ and ˆ , respectively, Theorem 24 shows that when K = 1, the PC scores and loadings from i PCA with the multiplicative Frobenius penalty are precisely U and V, as in PCA. In other words, i PCA with the additive or multiplicative Frobenius penalty is a proper generalization of PCA to multiple data sets.

Appendix F. Subspace Consistency of the Additive L1 Correlation i PCA Estimator

While the multiplicative Frobenius estimator appears superior from an optimization pointof-view, we will show in this section that the additive L1 correlation estimator satisﬁes one of the ﬁrst statistical guarantees in the data integration context. Speciﬁcally, our primary objective will be to prove that the additive L1 correlation estimator after one Flip-Flop iteration, outlined in Algorithm 6, is a consistent estimator of the true joint covariance matrix Σ and hence also consistently estimates the underlying joint subspace.

Integrated Principal Components Analysis

As mentioned in the main text, the additive L1 correlation estimator applies the L1 penalty to the correlation matrix, rather than the usual covariance matrix, and has been adopted previously in Zhou (2014a) and Rothman et al. (2008) for non-integrated data. In this non-integrated setting, Zhou (2014a) derived convergence rates for the one-step version of Algorithm 6, assuming only one matrix instance was observed from the matrix-variate normal distribution. Motivated by this approach, we extend the proof idea and results in Zhou (2014a), where K = 1, to i PCA, where we observe one matrix instance for each of the K 1 distinct matrix-variate normal models. Our proof closely resembles Zhou (2014a) with technical challenges due to the estimation of Σ from multiple k s with diﬀerent pk s. We will see later that the consistency rate obtained in our main theorem, Theorem 30, which holds for all K 1, is equivalent to the rate given in Zhou (2014a) when K = 1.

Note however that this subspace consistency property is an incredibly unique property of the additive L1 correlation estimator. When trying to prove a similar result for the other proposed estimators, we run into several diﬃculties. For instance, if we use the additive L1 covariance estimator, the proof of our main result in Theorem 30 no longer goes through due to an additional p term in the graphical lasso convergence rate (Rothman et al., 2008). More speciﬁcally, the graphical lasso convergence rate when applied to the

inverse covariance matrix, known to be O q

(p+s) log(p)

, is not fast enough to guarantee

statistical convergence of the Flip-Flop algorithm. However, by applying the graphical

lasso to the inverse correlation matrix, we obtain a faster rate of O q

, which then

enables us to prove Theorem 30. Other works, which have studied convergence rates of sparse penalties in the non-integrated setting, are also not applicable for integrated data problems. In particular, Tsiligkaridis et al. (2013) proved convergence rates for the additive L1 covariance penalty but assumed multiple matrix observations per matrix-variate normal model. Since i PCA assumes only one matrix observation per model, the guarantees from Tsiligkaridis et al. (2013) do not hold.

The problem of proving rates of convergence for the Frobenius estimators is even more diﬃcult than the L1 estimators. Without imposing some additional structure on the covariance matrices, we cannot even hope to prove statistical consistency in the pk > n setting. For the L1 penalties, it is natural to impose a sparsity constraint, but with the dense Frobenius penalties, the appropriate underlying structure of the covariance matrices is unclear. One preliminary idea is to exploit the g-convexity of the log-likelihood function and impose some additional structure based upon the associated manifold. However, we leave this for future research as it requires developing a whole new set of tools, combined with Riemannian geometry, to study statistical properties in the manifold space, rather than the usual Euclidean space.

We next collect notation. Let ˆΣ denote the additive L1 correlation estimator obtained after one iteration of the while loop in Algorithm 6. Assume that for each k = 1, . . . , K, the true population model is given by Xk Nn,pk(0, Σ k), where Σ = (σij) and k = (δk,ij). For the purpose of identiﬁability, deﬁne Σ = (σ ,ij) = n Σ /tr(Σ) and k = (δ k,ij) = tr(Σ) k n so that tr(Σ ) = n and Σ k = Σ k for each k. Let ρ(Σ) and ρ( k) denote the true correlation matrices corresponding to Σ and k, respectively. Deﬁne vec(A) to be the vectorization operator which creates a column vector from the

Tang and Allen

Algorithm 6 Flip-Flop Algorithm for Additive L1 Correlation-Penalized i PCA Estimators

1: Center the columns of X1, . . . , XK, and initialize ˆΣ, ˆ 1, . . . , ˆ K to be the identity matrix of the appropriate size.

2: while not converged do

3: for k = 1, . . . , K do

4: Compute sample covariance matrix: ˆSk = 1

n XT k ˆΣ 1 Xk 5: Get standard deviation estimate: ˆ Wk = diag(ˆSk)1/2

6: Convert to sample correlation matrix: ˆSρ,k = ˆ W 1 k ˆSk ˆ W 1 k 7: Apply graphical lasso to estimate correlation matrix: ˆ 1 ρ,k = argmin 1 ρ,k log | 1 ρ,k | + tr(ˆSρ,k 1 ρ,k) + λk 1 ρ,k 1,oﬀ

8: Convert back to covariance estimate: ˆ k = ˆ Wk ˆ ρ,k ˆ Wk

9: Compute sample covariance matrix: ˆSΣ = 1

p PK k=1 Xk ˆ 1 k XT k

10: Get standard deviation estimate: ˆ WΣ = diag(ˆSΣ)1/2

11: Convert to sample correlation matrix: ˆSρ,Σ = ˆ W 1 Σ ˆSΣ ˆ W 1 Σ 12: Apply graphical lasso to estimate correlation matrix: ˆΣ 1 ρ = argmin Σ 1 ρ log | Σ 1 ρ | + tr(ˆSρ,Σ Σ 1 ρ ) + λΣ Σ 1 ρ 1,oﬀ

13: Convert back to covariance estimate: ˆΣ = ˆ WΣ ˆΣρ ˆ WΣ

matrix A by stacking the columns of A below one another. Then for each k = 1, . . . , K, put Sk = vec(Xk)vec(Xk)T and Sk = vec(Xk)T vec(Xk). Let Srq k denote the r, qth block of size n n of Sk, and let Srq k denote the r, qth block of size pk pk of Sk. If A is a matrix, let A 2 denote the operator norm or the maximum singular value of A, let A F denote the Frobenius norm (i.e. A 2 F = P i,j a2 ij), let A 0,oﬀdenote the number of non-zero non-diagonal entries in A, let A 1 = P i,j |aij|, and let A 1,oﬀ= P i =j |aij|. Denote the stable rank of A by r(A) = A 2 F / A 2 2. Let us also write for a real symmetric matrix A, φmin(A) to be the minimum eigenvalue of A. Deﬁne σmin = mini σii, σmax = maxi σii, δk,min = mini δk,ii, δk,max = maxi δk,ii, and similarly for σ ,min, σ ,max, δ k,min, and δ k,max. Also write a b = max(a, b) and a b = min(a, b). If a = o(b), then |a/b| 0 as n, p1, . . . , p K . If a b, then there exists positive constants c, C such that cb a Cb as n, p1, . . . , p K . Lastly, we also adopt the notation deﬁned in Algorithm 6 for the remainder of this section. To establish the consistency of ˆΣ, we require the following assumptions, which are generalizations of those in (Zhou, 2014a):

(A1) Assume that Σ 1 and 1 k are sparse with respect to each other s dimensions: sΣ :=

Σ 1 0,oﬀ= o p2 pk log(n pk) and sk := 1 k 0,oﬀ= o n log(n pk) for each k = 1, . . . , K.

(A2) Assume that we have uniformly bounded spectra: 0 < φmin(Σ) φmax(Σ) < and 0 < φmin( k) φmax( k) < for each k = 1, . . . , K.

Integrated Principal Components Analysis

(A3) Assume that the inverse correlation matrices satisfy ρ(Σ) 1 1 n and ρ( k) 1 1 pk for each k = 1, . . . , K.

(A4) Assume that K is ﬁnite and the growth rate of n and p1, . . . , p K satisfy n pk = o(exp(n pk)) for each k = 1, . . . , K.

Remark 25 (1) Rather than verifying the sparsity assumption of Σ 1 in (A1) and the growth rate (A4) for each k, it is suﬃcient to check that sΣ = o p2 maxk pk log(n maxk pk)

and n maxk pk = o(exp(n mink pk)), respectively.

(2) (A1) implies that q

sk log(n pk)

sΣ log(n pk)

pk 0 as n, p1, . . . , p K .

Under these assumptions, we prove our main statistical consistency result in Theorem 30. From this, we can easily establish subspace consistency for ˆΣ in Corollary 31. The overall idea of this convergence proof is to follow Algorithm 6 and sequentially bound each step in the algorithm. First, the error from line 8 of Algorithm 6 can be bounded by adapting results from Rothman et al. (2008). Then, in Lemma 28, we bound the error between Σ and the sample covariance estimate ˆSΣ deﬁned in the step 9. Following the Flip-Flop algorithm, we next bound the error between ρ(Σ) and the sample correlation estimate ˆSρ,Σ from step 11 in Theorem 29. Finally, in Theorem 30, we prove the rate of convergence in the operator and Frobenius norms for ˆΣ 1 and ˆΣ as deﬁned in step 13 of the algorithm. Two direct consequences of convergence in the operator norm are consistent eigenvalues and eigenvectors of ˆΣ, which in turn yield subspace consistency.

F.1 Preliminaries

The main driver behind Lemma 28 is a large deviation inequality, namely Theorem 13.1 from Zhou (2014b). As this is an important result used multiple times in the proof of Lemma 28, we state Theorem 13.1 from Zhou (2014b) below in a slightly diﬀerent form for convenience.

Theorem 26 Assume that n pk 2. Let M be an n n matrix and N be a pk pk matrix such that 1

n M 2 F < and 1 pk N 2 F < . Deﬁne τk = 2C K2 log1/2(n pk), where C := 1 c 1

c 1 and K and c are the constants from Theorem 12.1 in Zhou (2014b).

(i) If the stable ranks satisfy r(Σ1/2 M Σ1/2) 4 log(n pk) and r( 1/2 k N 1/2 k ) 4 log(n pk), then with probability 1 3 (n pk)2 , we have that

diag( k) 1/2

r=1 Mqr Srq k

diag( k) 1/2 tr(Σ M)

n ρ( k) Dkτk

and diag(Σ) 1/2

r=1 Nqr Srq k

diag(Σ) 1/2 tr( k N)

p ρ(Σ) D kτk,

where Dk = 1

n Σ1/2 M Σ1/2 F and D k = 1

p 1/2 k N 1/2 k F .

Tang and Allen

(ii) If n pk = o(expn pk), then with probability 1 3 (n pk)2 , the above inequalities hold

with Dk = 2 n Σ 2 M 2 and D k = 2 pk

We refer to Zhou (2014b) for the proof of this theorem, but if one looks at the interior of this proof, we obtain the following useful result, presented in Corollary 27. To simplify notation, let E (k, M) denote the event (

diag( k) 1/2 1 n

r=1 Mqr Srq k

diag( k) 1/2 tr(Σ M)

n ρ( k) 2 n Σ 2 M 2τk

let EΣ(k, N) denote the event (

diag(Σ) 1/2 1 p

r=1 Nqr Srq k

diag(Σ) 1/2 tr( k N)

p ρ(Σ) 2 pk

p k 2 N 2τk

and deﬁne for each k = 1, . . . , K,

νn,k := 2 n τk = 4C K2 r

νpk := 2 pk

p τk = 4C K2 p

pk log(n pk)

p = 4C K2 pk

Corollary 27 Assume the same conditions and notation as Theorem 26. Also suppose that n pk = o(expn pk) Then under the event E (k, M), we have that

r=1 Mqr Srq k

νn,k Σ 2 M 2 p

δk,iiδk,jj i, j,

and under the event EΣ(k, N), we have that

r=1 Nqr Srq k

νpk k 2 N 2 σiiσjj i, j.

We now have the necessary large deviation inequalities to prove Lemma 28, a generalization of Lemma 6.1 from Zhou (2014a). Though the proof of Lemma 28 closely resembles that of Lemma 6.1 from Zhou (2014a), modiﬁcations must be made as i PCA considers multiple distinct matrix-variate normal models while Zhou (2014a) considers only one matrix-variate normal model. For clarity, we give our proof in its entirety and refer to results in Zhou (2014a) when necessary.

Lemma 28 Suppose that (A1)-(A4) hold. Let ˆ ρ,k and ˆ k be obtained as in steps 7 and 8 in Algorithm 6, where we choose

λk = 2αk ϵ(1 αk) 3αk 1 αk for αk = A νn,k where A = n Σ F

Integrated Principal Components Analysis

and ϵ (0, 2/3). Then on event E , we have for ˆSΣ deﬁned in step 9 of Algorithm 6,

4C K2 σ ,iiσ ,jj

pk (1 + o(1))

k=1 |σ ,ij| µk

σ ,iiσ ,jj νpk(1 + o(1)) + |σ ,ij| µk ,

µk := λk ˆ 1 ρ,k 1,oﬀ

p + αk 1 αk

ˆ 1 ρ,k 1 p µk,

µk := λk ρ( k) 1 1,oﬀ

p + αk 1 αk

Moreover, P(E ) 1 PK k=1 8 (n pk)2 . To put simply, with high probability,

4C K2σ ,max

pk (1 + o(1))

k=1 σ ,maxµk

k=1 [νpk(1 + o(1)) + µk] .

Proof For each k = 1, . . . , K, deﬁne

RΣ,k := [δk,11vec(Σ) . . . δk,1pkvec(Σ) . . . δk,pkpkvec(Σ)] vec(Σ) vec( k)T

ˆRΣ,k := h vec( S11 k ) . . . vec( S1pk k ) . . . vec( Spkpk k ) i .

Then one can verify as in Zhou (2014a) that

vec(Σ ) = 1

k=1 RΣ,k vec(( k) 1) and vec(ˆSΣ) = 1

k=1 ˆRΣ,kvec( ˆ 1 k ). (72)

The equalities from (72) thus yield

vec(ˆSΣ Σ ) = 1

k=1 ˆRΣ,kvec( ˆ 1 k ) 1

k=1 RΣ,k vec(( k) 1) (73)

k=1 ˆRΣ,kvec( ˆ 1 k ) 1

k=1 RΣ,k vec(( k) 1)

k=1 ˆRΣ,kvec(( k) 1) 1

k=1 ˆRΣ,kvec(( k) 1) (74)

h ( ˆRΣ,k RΣ,k)vec(( k) 1) + ˆRΣ,kvec( ˆ 1 k ( k) 1) i . (75)

Tang and Allen

For each k = 1, . . . , K, deﬁne Θk := ˆ k k and Θk := ˆ 1 k ( k) 1, and notice that Θk = ˆ 1 k Θk( k) 1. Then

ˆRΣ,kvec( Θk) = ˆRΣ,kvec( Θk) + RΣ,k vec( Θk) RΣ,k vec( Θk) (76)

= RΣ,k vec( Θk) + ( ˆRΣ,k RΣ,k)vec( Θk) (77)

= RΣ,k vec( ˆ 1 k Θk( k) 1) + ( ˆRΣ,k RΣ,k)vec( ˆ 1 k Θk( k) 1). (78)

Putting (75) and (78) together gives

vec(ˆSΣ Σ ) =

p( ˆRΣ,k RΣ,k)vec(( k) 1) + 1

p RΣ,k vec( ˆ 1 k Θk( k) 1)

p( ˆRΣ,k RΣ,k)vec( ˆ 1 k Θk( k) 1) i

k=1 (U1,k + U2,k + U3,k),

where the matrix correspondent for each of the above terms will be denoted by M1,k, M2,k, and M3,k, respectively. We will proceed to bound each of the terms separately. In order to bound U1,k, notice that by deﬁnition of RΣ,k and ˆRΣ,k,

p( ˆRΣ,k RΣ,k)vec(( k) 1) (79)

r=1 vec( Sqr k δk,qr Σ)[( k) 1]qr (80)

r=1 vec( Sqr k )[( k) 1]qr tr( k( k) 1) p vec(Σ) (81)

r=1 vec( Sqr k )[( k) 1]qr pk

p vec(Σ ) (82)

r=1 Sqr k [( k) 1]qr pk

Deﬁne E0,k to be the event E (k, (Σ ) 1) EΣ(k, ( k) 1). Then by Corollary 27, under the event E0,k, we have |(M1,k)ij| νpk σ ,iiσ ,jj. Moreover, by Theorem 26, P(E0,k) 1 3 (n pk)2 . Next, we will bound the second term U2,k. As in Zhou (2014b), we can write

p RΣ,k vec( ˆ 1 k Θk( k) 1) = 1

ptr( ˆ 1 k Θk)vec(Σ ),

and M2,k = 1

ptr( ˆ 1 k Θk) Σ .

Integrated Principal Components Analysis

By Claim 17.3 in Zhou (2014b), it follows that that under event X0,k := EΣ(k, I) E (k, I),

λk ˆ 1 ρ,k 1,oﬀ αk 1 αk ˆ 1 ρ,k 1 tr(Θk ˆ 1 k ) λk ˆ 1 ρ,k 1,oﬀ+ αk 1 αk ˆ 1 ρ,k 1

|tr( Θk ˆ 1 k )| λk ˆ 1 ρ,k 1,oﬀ+ αk 1 αk ˆ 1 ρ,k 1,

which implies that on X0,k,

|(M2,k)ij| |σ ,ij|

λk ˆ 1 ρ,k 1,oﬀ+ αk 1 αk ˆ 1 ρ,k 1

= |σ ,ij| µk.

Additionally, by Theorem 26, P(X0,k) 1 3 (n pk)2 . To bound the ﬁnal term U3,k, we follow the same logic as (79)-(82) to obtain

p( ˆRΣ,k RΣ,k)vec( Θk)

r=1 vec( Sqr k )[ Θk]qr tr( k Θk)

and M3,k = 1

r=1 Sqr k [ Θk]qr tr( k Θk)

Deﬁne E1,k := EΣ(k, Θk). By the proof of Theorem 26, P(E1,k|X0,k) 1 2 (n pk)2 . On the

other hand, under the event E1,k, Corollary 27 gives |(M3,k)ij| νpk k 2 Θk 2 σiiσjj. We next bound Θk 2 using Corollary 10.1 from Zhou (2014b), so assuming that X0,k holds, then

sk 1 δ k,minφ2 min(ρ( k)).

In summary, under the event E := K k=1(E0,k E1,k X0,k), we have that

|(ˆSΣ Σ )ij|

k=1 (|(M1,k)ij| + |(M2,k)ij| + |(M3,k)ij|)

νpk σ ,iiσ ,jj + |σ ,ij| µk

sk 1 δ k,minφ2 min(ρ( k))

νpk σ ,iiσ ,jj + |σ ,ij| µk

Tang and Allen

sk 1 δk,minφ2 min(ρ( k))

νpk σ ,iiσ ,jj(1 + o(1)) + |σ ,ij| µk

under the assumptions. Furthermore, applying the union bound implies that P(E ) 1 PK k=1 8 (n pk)2 . Assuming that event X0,k holds, µk µk is a consequence of Corollary 17.4 from Zhou (2014b), and this concludes the proof.

It is important to point out that in our choice of λk in (71), A can be considered a constant under the bounded spectrum assumption (A2). In addition, (A1) implies that q

n 0 as n, pk . Therefore, since λk is on the order of A q

n , λk 0 as n, pk (for k = 1, . . . , K). Next, stepping through Algorithm 6, we bound the error between the correlation estimate ˆSρ,Σ and the true correlation matrix ρ(Σ).

Theorem 29 Suppose the conditions in Lemma 28 hold. Deﬁne ηk := νpk(1 + o(1)) + µk. Then under event E , we have for ˆSρ,Σ deﬁned in step 11 in Algorithm 6 and i = j,

pk (1 + o(1))(1 + |ρ(Σ)ij|)

1 PK k=1 ηk

+ 2|ρ(Σ)ij| PK k=1 µk 1 PK k=1 ηk (84)

" νpk(1 + o(1))(1 + |ρ(Σ)ij|)

1 PK k=1 ηk + 2|ρ(Σ)ij| µk

1 PK k=1 ηk

2 PK k=1 ηk 1 PK k=1 ηk , where ηk := νpk(1 + o(1)) + µk. (86)

Proof Because µk µk by Lemma 28 and |ρ(Σ)ij| 1 for all i, j, it is clear that

" νpk(1 + o(1))(1 + |ρ(Σ)ij|)

1 PK k=1 ηk + 2|ρ(Σ)ij| µk

1 PK k=1 ηk

2 PK k=1 ηk 1 PK k=1 ηk 2 PK k=1 ηk 1 PK k=1 ηk .

Therefore, it suﬃces to show (84). Assume throughout this proof that event E holds. Then by Lemma 28,

k=1 [νpk(1 + o(1)) + µk] =

Integrated Principal Components Analysis

which implies

k=1 ηk = r σ ,ii

1 PK k=1 ηk (87)

for all i. On the other hand, for i = j, Lemma 28 gives

ˆSΣ σ ,iiσ ,jj ρ(Σ)

ˆSΣ σ ,iiσ ,jj Σ σ ,iiσ ,jj

νpk(1 + o(1)) + |σ ,ij| σ ,iiσ ,jj µk

k=1 (νpk(1 + o(1)) + |ρ(Σ)ij| µk) , (90)

Thus, for i = j,

[ˆSΣ]ij [ˆSΣ]1/2 ii [ˆSΣ]1/2 jj ρ(Σ)ij

[ˆSΣ]ij σ ,iiσ ,jj

[ˆSΣ]1/2 ii σ ,ii

[ˆSΣ]1/2 jj σ ,jj

[ˆSΣ]ij σ ,iiσ ,jj ρ(Σ)ij

[ˆSΣ]1/2 ii σ ,ii

[ˆSΣ]1/2 jj σ ,jj

[ˆSΣ]1/2 ii σ ,ii

[ˆSΣ]1/2 jj σ ,jj

[ˆSΣ]1/2 ii

[ˆSΣ]1/2 jj

[ˆSΣ]ij σ ,iiσ ,jj ρ(Σ)ij

[ˆSΣ]1/2 ii

[ˆSΣ]1/2 jj 1

1 PK k=1 ηk

k=1 (νpk(1 + o(1)) + |ρ(Σ)ij| µk)

1 PK k=1 ηk 1

νpk(1 + o(1))(1 + |ρ(Σ)ij|) + 2|ρ(Σ)ij| µk

1 PK k=1 ηk . (96)

as desired. Note that (95) follows from (87) and (90).

Tang and Allen

F.2 Main Result

We can now build on top of Theorem 29 and existing results to prove our main convergence result in Theorem 30, which is a generalization of Corollary 17.2 from Zhou (2014b). Indeed, when K = 1, Theorem 30 gives the same rates as Zhou (2014a). Moreover, as a direct consequence of Theorem 30, we obtain Corollary 31, which establishes bounds on the eigenvectors and eigenvalues of the additive L1 correlation i PCA estimator, thereby giving the desired subspace consistency.

Theorem 30 Suppose that (A1)-(A4) hold and that n p pk for each k = 1, . . . , K. Assume also that η 1/4, and λΣ is chosen to be

λΣ = 2 PK k=1 ηk ϵ1(1 PK k=1 ηk) , for some ϵ1 (0, 1). (97)

Then on the event E ,

ˆΣ Σ 2 2 CλΣσ ,maxκ(ρ(Σ))2 sΣ 1, (98)

ˆΣ Σ F 2 CλΣσ ,maxκ(ρ(Σ))2 sΣ n, (99)

ˆΣ 1 (Σ ) 1 2 CλΣ sΣ 1 σ ,minφ2 min(ρ(Σ)), (100)

ˆΣ 1 (Σ ) 1 2 CλΣ sΣ n σ ,minφ2 min(ρ(Σ)) (101)

for some constant C.

Proof Assume that E holds throughout this proof. Next, recall that from Theorem 29,

ω := 2 PK k=1 ηk 1 PK k=1 ηk 2 PK k=1 ηk 1 PK k=1 ηk .

By deﬁning Cp,k := 2

ϵ ρ( k) 1 1,oﬀ

p + ρ( k) 1 1

p , it follows that ηk = νpk + αk 1 αk Cp,k (1+ o(1)). Then because Cp,k 1 under (A3) and νpk 0 as n, pk under (A1) and (A2),

2 PK k=1 ηk 1 PK k=1 ηk

k=1 νpk 0 as n, pk .

Moreover, under (A1), we have ω sΣ 1 = o(1). Thus, we can apply Theorem 4.5 from Zhou (2014a) to obtain for some constant C,

ˆΣρ ρ(Σ) 2 ˆΣρ ρ(Σ) F Cκ(ρ(Σ))2λΣ sΣ 1

and ˆΣ 1 ρ ρ(Σ) 1 2 ˆΣ 1 ρ ρ(Σ) 1 F CλΣ

sΣ 1 2φ2 min(ρ(Σ)).

Integrated Principal Components Analysis

Now since we have a bound on the correlation estimates, the next step is to consider the covariance estimates. Let us deﬁne WΣ = diag(Σ )1/2 = q

n tr(Σ)diag(Σ)1/2. By Lemma 28,

| ˆ W 2 Σ,ii W2 Σ,ii | = | ˆ W 2 Σ,ii σ ,ii| σ ,ii

ˆ WΣ WΣ 2 σ ,max

k=1 ηk (103)

and ˆ W 1 Σ W 1 Σ 2 1 σ ,max

1 + PK k=1 ηk 1 q

1 + PK k=1 ηk 1 q

1 PK k=1 ηk q

1 PK k=1 ηk

PK k=1 ηk q

1 PK k=1 ηk . (105)

Using Proposition 15.2 in Zhou (2014b), (103), and (105), we obtain

ˆΣ Σ 2 = ˆ WΣ ˆΣρ ˆ WΣ WΣ ρ(Σ) WΣ 2

ˆ WΣ WΣ 2 + WΣ 2 2 ˆΣρ ρ(Σ) 2

+ ˆ WΣ WΣ 2 ˆ WΣ WΣ 2 + 2 ρ(Σ) 2

Cκ(ρ(Σ))2λΣ sΣ 1 σ ,max

And because λΣ was chosen to satisfy PK k=1 ηk < λΣ(1 PK k=1 ηk)/2 where PK k=1 ηk PK k=1 ηk 1/4, we have

ˆΣ Σ 2 Cκ(ρ(Σ))2λΣ sΣ 1 σ ,max

+ σ ,max λΣ

2 Cκ(ρ(Σ))2σ ,maxλΣ sΣ 1.

Tang and Allen

We can also bound the error on the Frobenius norm similarly. Using Proposition 15.2 in Zhou (2014b), (103), (105), PK k=1 ηk < λΣ(1 PK k=1 ηk)/2, and PK k=1 ηk PK k=1 ηk 1/4 we see that

ˆΣ Σ F ˆ WΣ WΣ 2 + WΣ 2 2 ˆΣρ ρ(Σ) F

+ ˆ WΣ WΣ 2 ˆ WΣ WΣ 2 + 2 ρ(Σ) F

Cκ(ρ(Σ))2λΣ sΣ 1 σ ,max

2 Cκ(ρ(Σ))2σ ,maxλΣ sΣ n.

The same logic can be used to prove (100) and (101), so we omit the details.

To summarize the convergence results from Theorem 30, if we set

λk = 2αk ϵ(1 αk)

n k = 1, . . . , K, (106)

and λΣ = 2 PK k=1 ηk ϵ1(1 PK k=1 ηk)

then according to Theorem 30, with probability 1 PK k=1 8 (n pk)2 ,

(sΣ 1) log(n pk)

ˆΣ 1 (Σ 1) 2 = O

(sΣ 1) log(n pk)

(sΣ n) log(n pk)

ˆΣ 1 (Σ 1) F = O

(sΣ n) log(n pk)

Subspace consistency for the additive L1 correlation estimator, ˆΣ, follows as a direct corollary of Theorem 30.

Corollary 31 Let ui denote the eigenvector of Σ corresponding to the eigenvalue φi and ˆui be the eigenvector of ˆΣ with eigenvalue ˆφi, where the eigenvalues are sorted in descending order. Under the conditions stated in Theorem 30, we have the following:

Integrated Principal Components Analysis

(i) For each i,

|ˆφi φi| = O

(sΣ 1) log(n pk)

(ii) For each i such that min(φi 1 φi, φi φi+1) = 0,

ˆui ui 2 = O

(sΣ 1) log(n pk)

Thus, the eigenvectors and eigenvalues of the one-step additive L1 correlation estimator, ˆΣ, are consistent.

Proof (i) By Weyl s theorem (Horn and Johnson, 2012), we have for each i,

|ˆφi φi| ˆΣ Σ 2.

Thus, it follows from Theorem 30 that

|ˆφi φi| = O

(sΣ 1) log(n pk)

(ii) By a variant of the Davis-Kahan sin θ theorem given in Yu et al. (2015), we have for each i,

ˆui ui 2 23/2 ˆΣ Σ 2 min(φi 1 φi, φi φi+1),

assuming that min(φi 1 φi, φi φi+1) = 0 and ˆv T v 0. Note however that if ˆu T i ui < 0, we can simply take the negative of ˆui and apply the theorem to ˆui and ui. Thus, for each i such that min(φi 1 φi, φi φi+1) = 0, it follows from Theorem 30 that

ˆui ui 2 = O

(sΣ 1) log(n pk)

Since PK k=1 pk

(sΣ 1) log(n pk)

pk converges to 0 as n, p1, . . . , p K under (A1), this

gives us the consistency of the eigenvalues and eigenvectors of ˆΣ.

Note though that the consistency statements above assume that n p pk k = 1, . . . , K (i.e. the large n setting). If instead n < p pk k = 1, . . . , K (i.e. the large p setting), then we can modify Algorithm 6 to ﬁrst initialize an estimate of Σ assuming ˆ k = I for each k. Call this initial estimate ˆΣ 1. Then estimate k given ˆΣ 1. Call this estimate ˆ k. Lastly, obtain the ﬁnal estimate of Σ given ˆ 1, . . . , ˆ K. Denote this ﬁnal estimate of Σ by ˆΣ. Similar convergence rates can be obtained for the large p setting

Tang and Allen

by using this modiﬁed algorithm. Namely, if the penalty parameters λk and λΣ are chosen on the order of (106) and (107), we can obtain the same rates as (108)-(111). The only additional assumption required here is a bound on |ρ(Σ)ij|, namely, |ρ(Σ)ij| = O( npk

p ) for each k = 1, . . . , K and i = j. We omit this proof as it is a repetition of previous arguments with slight diﬀerences. A more thorough discussion of this scenario is presented in Zhou (2014a). Thus, in either the large n or large p case, we have a variant of the additive L1 correlation Flip-Flop algorithm such that under certain assumptions, the estimate of Σ converges at a

(sΣ 1) log(n pk)

in the operator norm. As a result of convergence

in the operator norm, we obtain consistency of the eigenvalues and eigenvectors of ˆΣ (see Corollary 31). Since eigenvectors of ˆΣ deﬁne the estimated i PCA subspace, this in turn implies subspace consistency of the one-step additive L1 correlation estimator.

Appendix G. Selecting Penalty Parameters

Our missing data imputation framework for selecting penalty parameters is given in Algorithm 7. In the subsequent sections, we discuss algorithms to impute the missing data in step 4 of this algorithm. As one option, one could use the multi-cycle expectationconditional maximization (MCECM) (Meng and Rubin, 1993) algorithm, which iterates between taking conditional expectations in the E-step and maximizing with respect to one variable at a time in the M-step. We derive the full MCECM algorithm for i PCA in Appendix G.1. This method generalizes the MCECM imputation method proposed in Allen and Tibshirani (2010), which only considered the K = 1 case. However, as Allen and Tibshirani (2010) pointed out, the full MCECM algorithm is computationally expensive, so in practice, we also advocate using a faster one-step approximation to the MCECM algorithm, which we discuss in Appendix G.2.

Algorithm 7 Selecting Penalty Parameters via Missing Imputation Framework Given: data X1, . . . , XK, space of penalty parameters Λ, type of penalized i PCA estimator

1: for k = 1, . . . , K do

2: Randomly leave out 5% of the elements in Xk; denote these scattered missing elements by Xm k

3: for λ in Λ do

4: Impute missing values (preferably by Algorithm 9); denote these imputed values by ˆXm k

5: Select λ which minimizes

ˆXm k Xm k 2 F Xm k Xm k 2 F , where Xm k are the values of the column

mean matrix Xk at the missing indicies.

Following the notation in Allen and Tibshirani (2010), we write Xo = (Xo 1, . . . , Xo K) to denote the totality of observed entries of X = (X1, . . . , XK), and Xm = (Xm 1 , . . . , Xm K) to denote the missing entries of X. Also deﬁne Θ = (µ1, . . . , µK, Σ 1, 1 1 , . . . , 1 K ), and let

Integrated Principal Components Analysis

Θ be the current estimates of Θ. Note that the MCECM and its one-step approximation for i PCA (where K 1) are generalizations of the imputation algorithms from Allen and Tibshirani (2010) (where K = 1).

G.1 Multi-Cycle Expectation-Conditional Maximization Algorithm

For concreteness, we will work with the multiplicative Frobenius penalty. The other penalized methods are very similar. We will proceed to derive the E-steps and M-steps with respect to each variable for the MCECM algorithm. Note here that we will estimate the column means µk adaptively within the MCECM algorithm, rather than ﬁxing them a priori. We ﬁnd that this strategy works better in practice. So ﬁrst, in order to compute the E-steps, note that the Q function is

Q(Θ; Θ ) := EXm | Xo,Θ [ℓ(Θ| X)]

= p log | Σ 1 | + n

k=1 log | 1 k |

k=1 tr Σ 1 Xk 1n µT k 1 k Xk 1n µT k T #

k=1 ρk 1 k 2 F

Thus, for the E-step with respect to Σ, we use linearity of the expectation and trace operators to obtain

k=1 tr Σ 1 Xk 1n µT k 1 k Xk 1n µT k T #

k=1 EXm | Xo,Θ h Xk 1n µT k 1 k Xk 1n µT k T i!

Using the notation and proof of Proposition 3 in Allen and Tibshirani (2010), the conditional expectation reduces to

EXm | Xo,Θ h Xk 1n µT k 1 k Xk 1n µT k T i =

ˆXk 1 k ˆXT k +Fk( 1 k ) . (112)

For the E-Step with respect to k, a similar argument shows that

k=1 tr Σ 1 Xk 1n µT k 1 k Xk 1n µT k T #

k=1 tr EXm | Xo,Θ h Xk 1n µT k T Σ 1 Xk 1n µT k i 1 k ,

Tang and Allen

and again from Allen and Tibshirani (2010), we have that

EXm | Xo,Θ h Xk 1n µT k T Σ 1 Xk 1n µT k i = ˆXT k Σ 1 ˆXk +G(Σ 1). (113)

We next plug (112) and (113) back into the Q function and take partial derivatives to compute the M steps. For the M-step with respect to Σ, we have that

Q Σ 1 = p Σ

ˆXk 1 k ˆXT k +Fk( 1 k ) 2 Σ 1 K X

k=1 λk 1 k 2 F = 0,

so given 1 k from the previous iteration, we can update Σ via an eigendecomposition of PK k=1 ˆXk 1 k ˆXT k +Fk( 1 k ) . (The form of this update is analogous to the Flip-Flop updates in the multiplicative Frobenius Flip-Flop algorithm.) Similarly, for the M-step with respect to k,

Q 1 k = n k ˆXT k Σ 1 ˆXk +G(Σ 1) 2λk 1 k Σ 1 2 F = 0,

so given Σ 1 from the previous iteration, we can update k by an eigendecomposition of ˆXT k Σ 1 ˆXk +G(Σ 1). Putting these E-steps and M-steps together, we provide the full MCECM algorithm to impute missing values in Algorithm 8.

Algorithm 8 Full MCECM Algorithm for i PCA

1: Set ˆµk to be the column means of Xo k for each k = 1, . . . , K

2: If xk ij is missing, set xk ij = ˆµj k.

3: Initialize ˆΣ 1 and ˆ 1 1 . . . ˆ 1 K to be symmetric positive deﬁnite.

4: while not converged do

5: Compute PK k=1 ˆXk ˆ 1 k ˆXT k +F( ˆ 1 k ) E-Step (Σ)

6: Update ˆµk to be the column means of ˆXk 7: Take eigendecomposition: PK k=1 ˆXk ˆ 1 k ˆXT k +F( ˆ 1 k ) = U Γ UT

8: Regularize eigenvalues: φi = 1

γ2 i + 8p PK k=1 λk ˆ 1 k 2 F

9: Update ˆΣ 1 = U Φ 1 UT

10: for k = 1, . . . , K do

11: Compute ˆXT k ˆΣ 1 ˆXk +Gk( ˆΣ 1) E-Step ( k)

12: Update ˆµk to be the column means of ˆXk 13: Take eigendecomposition: ˆXT k ˆΣ 1 ˆXk +Gk( ˆΣ 1) = V Φ VT

14: Regularize eigenvalues: γi = 1 2n φi + q

φ2 i + 8nλk ˆΣ 1 2 F

15: Update ˆ 1 k = V Γ 1 VT

Initialization

M-Step ( k)

Integrated Principal Components Analysis

G.2 One-Step Approximation

Algorithm 8 is a generalization of the TRCMAimpute algorithm from Allen and Tibshirani (2010), and as discussed in Allen and Tibshirani (2010), it is computationally expensive to compute F( ˆ 1 k ) and G( ˆ 1 k ). Hence, rather than using the full MCECM algorithm to impute the missing values in step 4 of Algorithm 7, we advocate, as in Allen and Tibshirani (2010), using a one-step approximation to the MCECM algorithm. We will empirically show that the one-step approximation is both a good approximation to the full MCECM algorithm and works well in practice. The idea behind the one-step approximation is that since the ﬁrst step of the MCECM algorithm typically gives the steepest decrease in the objective function, we will quickly approximate the MCECM algorithm by ﬁrst obtaining a decent initial imputation and then stopping the algorithm after one M-step and one E-step. We detail the initial imputation step, M-step, and E-step as follows. For the initial imputation step, we impute missing values assuming Σ = I. If we assume Σ = I, then Xk Nn,pk(1n µT k , I k) for each k = 1, . . . , K, or equivalently,

xk 1, . . . , xk n iid N(µk, k), where xk i is the ith row of Xk. Since this reduces to the familiar multivariate case, we can initially impute the missing values in Xk using any (regularized) multivariate normal imputation method such as RCMimpute from Allen and Tibshirani (2010) for each k = 1, . . . , K. Given the initial imputation for X1, . . . , XK from the previous step, we next compute the M-step and update µ1, . . . , µK, Σ, 1, . . . , K using the penalized Flip-Flop algorithms derived in Appendix C. In the next and ﬁnal step, we take an E-step to impute the missing values by ˆXm k = E[Xm k | Xo k, ˆµk, ˆΣ, ˆ k] for each k = 1, . . . , K. This can be done using the Alternating Conditional Expectations Algorithm from Allen and Tibshirani (2010), applied to each Xk separately. The only diﬀerence is that we specialize to the case where the mean matrix is Mk = 1n µT k . We summarize this one-step approximation method for missing data imputation in Algorithm 9.

Algorithm 9 One-Step MCECM Approximation

1: for k = 1, . . . , K do Initial Imputation // E-Step

2: Impute missing values in Xk assuming Σ = I; call it ˆXk,0

Use any regularized multivariate normal imputation method

3: Estimate µ1, . . . , µK, Σ, 1, . . . , K given ˆX1,0, . . . , ˆXK,0 via penalized MLE Flip-Flop algorithm M-Step 4: for k = 1, . . . , K do E-Step

5: Set the missing values ˆXm k = E[Xm k | Xo k, ˆµk, ˆΣ, ˆ k] using the alternating conditional expectations algorithm as in Allen and Tibshirani (2010)

In Figure 8, we compare the numerical convergence of the one-step approximation and the full MCECM algorithm for a small simulation. From this plot, we note two important observations. First, the initialization (assuming Σ = I) in the one-step approximation algorithm makes a signiﬁcant diﬀerence, compared to initializing missing elements to their respective column means as in the full MCECM algorithm. Second, the ﬁrst update of

Tang and Allen

Figure 8: We use the same simulation as that in Figure 9 and randomly leave out 5% of the entries in each data set. We impute the missing values using the full MCECM and one-step approximation algorithms with the multiplicative Frobenius penalty (λ = (1, 1)), and we plot the log-likelihood value over each iterate. The log-likelihood obtained by the one-step approximation is on par with the log-likelihood after 15 iterations of the MCECM algorithm.

the MCECM algorithm results in the largest increase in the log-likelihood function. As discussed in Allen and Tibshirani (2010), these two observations motivate the one-step approximation algorithm, which takes advantage of a good initialization and one update step to main suﬃcient accuracy while reducing the computational workload. Figure 8 also shows that the likelihood function after a good initialization and one iteration is on par with the full MCECM algorithm after 15 iterations. For a more detailed discussion on computation and timing comparisons between the one-step approximation and the full MCECM algorithm, we refer to Allen and Tibshirani (2010). Note that though Allen and Tibshirani (2010) only treats the K = 1 case, the results are applicable to the K > 1 case due to the separability of the log-likelihood function. Figure 9 compares the average imputation errors from the full MCECM and the onestep approximation for a small simulation. In this case, for both the one-step approximation and the full MCECM, λ = (10 .5, 100) (0.32, 100) gave the lowest average imputation error, and hence, both imputation methods selected λ = (10 .5, 100) for the multiplicative Frobenius i PCA estimator. This further supports the use of the one-step algorithm as an approximation to the full MCECM in practice. Moreover, we veriﬁed that the minimum subspace recovery error of 2.00 was obtained at λ = (0.01, 101.5) (0.01, 31.62) and that λ = (10 .5, 100) yielded a similar subspace recovery error of 2.08. This preliminary empirical evidence leads us to believe that the one-step imputation algorithm is indeed a good approximation to the full MCECM algorithm.

Appendix H. Simulations

In order to check that the simulation results in Figure 3 are not heavily dependent on our choice of Σ and 1, 2, 3, we ran additional simulations, varying the dimension of the

Integrated Principal Components Analysis

Figure 9: We simulated two coupled data matrices X1, X2 with n = 50, p1 = 60, p2 = 70 according to the i PCA model (2). Here, we took Σ to be as in the base simulation described in Section 4.1, 1 to be an autoregressive Toeplitz matrix with the (ij)th entry given by .9|i j|, and 2 to be a block-diagonal matrix with ﬁve equal-sized blocks. Then, we randomly removed 5% of the elements in X1 and X2 and imputed these missing values using the full MCECM and the one-step approximation algorithms with the multiplicative Frobenius penalty. We plot the average imputation error PK k=1 ˆXm k Xm k 2 F / Xm k Xm k 2 F plus or minus one standard error, taken over 10 trials. In the left graph of each panel (A and B), λ1 varies while λ2 is ﬁxed at its optimal value (i.e. λ2 = 100), and in the right graph of each panel, λ2 varies while λ1 is ﬁxed at its optimal value (i.e. λ1 = 10 .5). The minimum average imputation error is achieved at λ = (10 .5, 100) for both imputation algorithms.

Tang and Allen

2 4 6 8 10 Subspace Dimension of Σ

Subspace Recovery Error

2 4 6 8 10 Number of Datasets K

Subspace Recovery Error

i PCA (x Frob)

i PCA (+ Frob)

Concatenated

Distributed

Figure 10: Additional Simulations: (A) The Frobenius i PCA estimators yield the lower subspace recovery error regardless of the subspace dimension of Σ; (B) As the number of integrated data sets increases, most methods tend to do better, with the multiplicative Frobenius i PCA penalty slightly outperforming the others when K = 10. Note that we did not run individual PCAs in (B) because the number of data sets is changing.

true underlying subspace U and the number of data sets K. These simulation results are shown in Figure 10. For the simulations in Figure 10A, we took 1, 2, 3 to be the same as in the base simulation, and we put Σ to be of the form U D UT , where U was a random n n orthogonal matrix, and D = diag(d1, . . . , dn) was simulated by di Unif(5, 75) for i = 1, . . . , D, and di = 1 otherwise. Here, D is the dimension of the true underlying subspace of Σ (e.g. D was taken to be 2 in the base simulation). Then, like in the base simulation, we generated Xk for each k = 1, 2, 3 by Xk N(0, Σ k), or equivalently Xk = Σ1/2 Ωk 1/2 k , where Ωk is an n pk random matrix with i.i.d. N(0, 1) entries. For the simulations in Figure 10B, we took Σ to be the same as in the base simulation, and we generated k from a random choice among the covariance types:

(i) Autoregressive Toeplitz matrix with the (ij)th entry given by ρ|i j|, where ρ Unif( .9, .9);

(ii) Block diagonal matrix with B blocks of the entries q1, . . . , q B, where B Unif(3, 10) and qi Unif(0, .9);

(iii) Spiked covariance matrix U D UT , where U is a random orthogonal matrix, di Unif(5, 75) for i = 1, . . . , D, di = 1 for i = D + 1, . . . , pk, and D Unif(5, 50);

(iv) Observed covariance matrix of real mi RNA data from TCGA Ovarian Cancer. (Note that this covariance matrix can only be used for one k {1, . . . , K})

The number of features was randomly selected by pk Unif(200, 500), and we ensured that k was larger than that of Σ so that the individual signal was larger than the joint signal.

Integrated Principal Components Analysis

As conveyed in Figure 10A, the Frobenius i PCA estimators outperformed its competitors regardless of the subspace dimension of Σ. It is also encouraging to see that the strong performance of the Frobenius i PCA estimators is not dependent on the cluster model for Σ, which was used in the simulations in Figure 3 but not in Figure 10A. Note that since the simulated Σ was dense, the additive L1 penalized i PCA estimator should perform poorly, as it does. We also point out that JIVE tends to do worse as the subspace dimension of Σ increases because JIVE tends to underestimate the rank of the joint variation matrix when the subspace dimension of Σ is larger. This is one of the disadvantages of matrix factorizations, as they require the rank of the factorized matrices to be chosen a priori. In Figure 10B, we see that most of the methods perform better as the number of integrated data sets increases and that the multiplicative Frobenius i PCA estimator slightly outperforms all the other methods when K = 10. We speculate that the additive Frobenius i PCA estimator does worse for large values of K because the grid of possible penalty parameters was too course, and as K increase, so does the number of penalty parameters. Hence, it becomes more diﬃcult to choose appropriate penalty parameters when K is large. For the Laplacian simulations in Figure 4A, we generated Xk for each k = 1, 2, 3 by Xk = Σ1/2 Ω 1/2 k +Ek, where Σ and k were taken as in the base simulation, and Ek was an n pk random matrix with i.i.d. Laplace(0, b) entries. For the simulations in Figure 4B, we simulated three coupled data sets from an instance of the JIVE model: for each k = 1, 2, 3, Xk = Jk + Ak +Ek, where J = [J1, J2, J3] is the joint variation matrix of rank r = 5, A1, A2, A3 are the individual variation matrices of rank r1 = 10, r2 = 15, r3 = 20, respectively, and Ek are error matrices with independent entries from N(0, σ2). Similar to the simulations in Lock et al. (2013), we set J and Ak by J = U Vk and Ak = Uk Wk, where U Rn r, Vk Rr pk, Uk Rn rk, and Wk Rrk pk. Here, U, Uk, Vk, Wk are matrices whose entries are randomly generated i.i.d. from one of the following distributions: N(0, 1), Unif(0, 1), Exp(1), and the discrete random variable { 2, 1, 0, 1, 2} with uniform probabilities. Note that under this JIVE model, the true joint covariance matrix is given by Σ = JJT . In addition to the robustness simulations in Figure 4, we also ran simulations under the CMF model and simulations with uncommon row covariance matrices. When simulating from the CMF model, we generated 3 coupled data matrices with n = 150 and p1 = 300, p2 = 500, p3 = 400 via the model Xk = U VT k +Ek. Here, U Rn 2 was taken to be a random two-dimensional subspace from a cluster model with three clusters (as shown in Figure 1), each Vk Rpk 2 was taken to be the two top eigenvectors from the base simulation s k (e.g., V1 was taken to be the top two eigenvectors of the autoregressive Toeplitz matrix with entry (i, j) given by .9|i j|), and Ek is a random matrix with i.i.d. N(0, σ2) entries. We summarize the simulation results from the CMF model with increasing levels of noise σ in Figure 11. From this ﬁgure, we see that when the additive noise is small, CMF (and concatenated PCA) yield slightly lower subspace recovery errors than the multiplicative Frobenius i PCA estimator, but as σ increases, this improvement over the multiplicative Frobenius i PCA estimator diminishes. For the simulations with uncommon row covariance matrices, we modiﬁed the base simulation so that two out of the three data sets arose from the model Xk Nn,pk(0, Σ k) while the ﬁnal data set arose from the model Xk Nn,pk(0, Σ k). Here, Σ and 1, 2, 3 are as in the base simulation while Σ is a rank-2 spiked covariance with eigen-

Tang and Allen

0.5 1.0 1.5 2.0 SD of Additive Normal Noise

Subspace Recovery Error

i PCA (x Frob) i PCA (+ Frob) JIVE MFA CMF Concatenated PCA on X1 PCA on X2 PCA on X3

Figure 11: CMF Model Simulations: As the noise level σ of the CMF model increases, the multiplicative Frobenius i PCA estimator performs on par with CMF (and concatenated PCA).

vectors generated from i.i.d. random normal entries. We summarize the simulation results from this uncommon row covariance model in Figure 12. In Figure 12A, we see that regardless of which Xk is generated from the uncommon Σ, the Frobenius i PCA estimators are relatively robust to the model misspeciﬁcation. Figure 12B shows the diﬀerence in subspace recovery error when applying the methods to all three data sets (i.e. mixture) versus applying them to the two data sets generated by the common Σ (i.e. oracle). Here, we see that while the Frobenius i PCA estimators perform better with oracle knowledge of the two data sets generated by the common Σ, the decline in performance is relatively small when adding the outlying data set, again demonstrating the robustness of the Frobenius i PCA estimators. In the last set of simulations, we empirically study the i PCA estimators under the sparse setting. For these sparse simulations, we generated two data sets with n = 50 and p1 = 50, p2 = 100 according to the Kronecker product model Xk Nn,pk(0, Σ k). Because we are interested in uncovering the low-rank sparse structure when performing dimension reduction, we generate Σ as follows: Let U denote the eigenvectors of a block diagonal matrix with B equally-sized blocks, and put D = diag(25, 12.5, 1, . . . , 1) Rn n. Then simulate Σ = UDUT so that Σ is a low-rank block diagonal matrix with B blocks. To generate 1, we use the huge package (Jiang et al., 2019) in R to obtain the sparse covariance matrix associated with multivariate normal data being generated from a sparse banded graph (with bandwidth = 4). Lastly, we took 2 to be the block diagonal matrix (with 5 equally-sized blocks), created by taking the observed covariance matrix of mi RNA data from TCGA ovarian cancer (The Cancer Genome Atlas Research Network, 2011) and zeroing out the entries oﬀof the block diagonal. The results from this sparse simulation study as we increase the number of blocks in Σ (and hence increase the sparsity in Σ) are shown in Figure 13. From this study, we see that when the amount of sparsity is relatively low, the multiplicative Frobenius i PCA estimator and the additive L1 i PCA covariance estimator perform the best while the additive L1 i PCA correlation estimator performs poorly. We believe that in this

Integrated Principal Components Analysis

Mixture Oracle

Different X1 Different X2 Different X3

i PCA (x Frob)

i PCA (+ Frob)

Concatenated

Subspace Recovery Error

Different X1 Different X2 Different X3

i PCA (x Frob)

i PCA (+ Frob)

MFA Concatenated

Subspace Recovery Error

Figure 12: Uncommon Row Covariance Simulations: (A) We plot the subspace recovery errors for various method under the uncommon row covariance model. In the top panel, X1 was generated from the uncommon Σ. In the middle panel, X2 was generated diﬀerently, and in the bottom panel, X3 was diﬀerent. (B) We compare the subspace recovery errors from various methods when applied to the mixture of all three data sets (i.e. mixture) versus when applied to the two data sets which were generated by the common Σ (i.e. oracle).

Tang and Allen

4 8 12 Number of Blocks

Subspace Recovery Error

i PCA (x Frob) i PCA (+ Frob) i PCA (L1) i PCA (Corr L1) JIVE MFA Concatenated PCA on X1 PCA on X2

Figure 13: Sparse Simulations: As we increase the number of blocks in Σ (and hence also the sparsity of Σ), the sparse i PCA estimators improve over the Frobenius i PCA estimators. However, when the sparsity level is relatively low, the multiplicative Frobenius i PCA estimator performs similarly to the additive L1 i PCA covariance estimator.

low sparsity scenario, the additive L1 i PCA correlation estimator struggles with choosing the appropriate penalty parameters and thus does not perform as well. However, as the sparsity level increases, both the additive L1 i PCA covariance and correlation estimators outperform the dense Frobenius i PCA estimators.

We ﬁnally note that other metrics such as canonical angles, which also quantify the distance between subspaces, these metrics behave similarly to the subspace recovery error, so we omitted the results for brevity. Other common metrics such as | ˆΣ 2 Σ 2| or ˆΣ Σ 2 F are not appropriate for our study because we are interested in the distance between subspaces of eigenvectors, and eigenvectors are scale-invariant while these metric are not.

Appendix I. Case Study: Integrative Genomics of Alzheimer s Disease

The ROSMAP data originally contained 309 mi RNAs, 41, 809 genes, and 420, 132 Cp G sites, so we aggressively preprocessed the number of features in the RNASeq and methylation data sets to manageable sizes. First, we transformed the methylation data to m-values and log-transformed the RNASeq counts, as is common in most analyses for these data types. Then, we removed batch (experimental) eﬀects from both data sets via Com Bat (Johnson et al., 2007). We next ﬁltered the features by taking those with the highest variance (top 20,000 genes for RNASeq and top 50,000 Cp G sites for methylation). Then, we performed univariate ﬁltering and kept the features with the highest association to clinician s diagnosis. This left us with p1 = 309, p2 = 900, p3 = 1250 in the mi RNA, RNASeq, and methylation data sets, respectively.

R code can be found at https://github.com/Data Slingers/i PCA.

Integrated Principal Components Analysis

H. Abdi, L. J. Williams, and D. Valentin. Multiple factor analysis: principal component analysis for multitable and multiblock data sets. Wiley Interdisciplinary Reviews: Computational Statistics, 5(2):149 179, 2013.

E. Acar, E. E. Papalexakis, G. G urdeniz, M. A. Rasmussen, A. J. Lawaetz, M. Nilsson, and R. Bro. Structure-revealing data fusion. BMC Bioinformatics, 15(1):239, Jul 2014. ISSN 1471-2105. doi: 10.1186/1471-2105-15-239. URL https://doi.org/10.1186/ 1471-2105-15-239.

G. I. Allen and R. Tibshirani. Transposable regularized covariance models with an application to missing data imputation. The Annals of Applied Statistics, 4(2):764 790, 2010.

O. Alter, P. O. Brown, and D. Botstein. Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two diﬀerent organisms. Proceedings of the National Academy of Sciences, 100(6):3351 3356, 2003.

K. Benidis, Y. Sun, P. Babu, and D. P. Palomar. Orthogonal sparse PCA and covariance estimation via procrustes reformulation. IEEE Transactions on Signal Processing, 64(23): 6211 6226, 2016. URL http://www.danielppalomar.com/publications.html.

O. Carrette, I. Demalte, A. Scherl, O. Yalkinoglu, G. Corthals, P. Burkhard, D. F. Hochstrasser, and J.C. Sanchez. A panel of cerebrospinal ﬂuid potential biomarkers for the diagnosis of Alzheimer s disease. Proteomics, 3(8):1486 1494, 2003.

A.P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68(1):265 274, 1981.

P. Dutilleul. The MLE algorithm for the matrix normal distribution. Journal of Statistical Computation and Simulation, 64(2):105 123, 1999.

B. Escoﬁer and J. Pages. Multiple factor analysis. Computational Statistics & Data Analysis, 18(1):121 140, 1994.

I. Espuny-Camacho, A. M. Arranz, M. Fiers, A. Snellinx, K. Ando, S. Munck, J. Bonnefont, L. Lambot, N. Corthout, L. Omodho, et al. Hallmarks of Alzheimer s disease in stemcell-derived human neurons transplanted into mouse brain. Neuron, 93(5):1066 1081, 2017.

J. Fan, D. Wang, K. Wang, and Z. Zhu. Distributed estimation of principal eigenspaces. Ar Xiv e-prints, February 2017.

M. Ghil and P. Malanotte-Rizzoli. Data assimilation in meteorology and oceanography. In Advances in Geophysics, volume 33, pages 141 266. Elsevier, 1991.

K. Greenewald and A. O. Hero. Robust kronecker product PCA for spatio-temporal covariance estimation. IEEE Transactions on Signal Processing, 63(23):6368 6378, 2015.

Tang and Allen

A. K. Gupta and D. K. Nagar. Matrix variate distributions. CRC Press, 1999.

P. Han, W. Liang, L. C. Baxter, J. Yin, Z. Tang, T. G. Beach, R. J. Caselli, E. M. Reiman, and J. Shi. Pituitary adenylate cyclase activating polypeptide is reduced in Alzheimer disease. Neurology, 82(19):1724 1728, 2014.

T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh. Matrix completion and low-rank SVD via fast alternating least squares. The Journal of Machine Learning Research, 16(1): 3367 3402, 2015.

R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.

C. J. Hsieh, I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik. Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems, pages 2330 2338, 2011.

Sijia Huang, Kumardeep Chaudhary, and Lana X Garmire. More is better: recent progress in multi-omics data integration methods. Frontiers in genetics, 8:84, 2017.

Haoming Jiang, Xinyu Fei, Han Liu, Kathryn Roeder, John Laﬀerty, Larry Wasserman, Xingguo Li, and Tuo Zhao. huge: High-Dimensional Undirected Graph Estimation, 2019. URL https://CRAN.R-project.org/package=huge. R package version 1.3.2.

W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch eﬀects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118 127, 2007.

Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis, 88(2):365 411, 2004.

Y. Li, Z. Chen, Y. Gao, G. Pan, H. Zheng, Y. Zhang, H. Xu, G. Bu, and H. Zheng. Synaptic adhesion molecule Pcdh-γC5 mediates synaptic dysfunction in Alzheimer s disease. Journal of Neuroscience, pages 1051 17, 2017.

E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. The Annals of Applied Statistics, 7(1):523, 2013.

X. L. Meng and D. B. Rubin. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2):267 278, 1993. ISSN 00063444. URL http: //www.jstor.org/stable/2337198.

S. Mostafavi, C. Gaiteri, S. E. Sullivan, C. C. White, S. Tasaki, J. Xu, M. Taga, H. U. Klein, E. Patrick, V. Komashko, et al. A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer s disease. Nature Neuroscience, 21(6):811 819, 2018. ISSN 1546-1726. URL https://doi.org/10.1038/ s41593-018-0154-9.

S. P. Ponnapalli, M. A. Saunders, C. F. Van Loan, and O. Alter. A higher-order generalized singular value decomposition for comparison of global m RNA expression from multiple organisms. Plo S one, 6(12):e28072, 2011.

Integrated Principal Components Analysis

T. Rapcs ak. Geodesic convexity in nonlinear optimization. Journal of Optimization Theory and Applications, 69(1):169 183, Apr 1991. ISSN 1573-2878. doi: 10.1007/BF00940467. URL https://doi.org/10.1007/BF00940467.

A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494 515, 2008. doi: 10.1214/08-EJS176.

S. T. Shivappa, M. M. Trivedi, and B. D. Rao. Audiovisual information fusion in humancomputer interfaces and intelligent environments: A survey. Proceedings of the IEEE, 98 (10):1692 1715, 2010.

A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 650 658. ACM, 2008.

The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353):609, 2011.

P. Tseng. Convergence of a block coordinate descent method for nondiﬀerentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, 2001.

T. Tsiligkaridis, A. O. Hero, and S. Zhou. On convergence of kronecker graphical lasso algorithms. IEEE Transactions on Signal Processing, 61:1743 1755, 2013.

N. K. Vishnoi. Geodesic Convex Optimization: Diﬀerentiation on Manifolds, Geodesics, and Convexity. Ar Xiv e-prints, June 2018.

J. A. Westerhuis, T. Kourti, and J. F. Mac Gregor. Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12(5):301 321, 1998.

A. Wiesel. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60(12):6182, 2012.

J. Yin and H. Li. Model selection and estimation in the matrix normal graphical model. Journal of Multivariate Analysis, 107:119 140, 2012.

Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the Davis-Kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015. doi: 10.1093/biomet/asv008. URL http://dx.doi.org/10.1093/biomet/asv008.

H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617 1638, 2016.

S. Zhou. Gemini: Graph estimation with matrix variate normal instances. The Annals of Statistics, 42(2):532 562, 2014a.

S. Zhou. Supplement to gemini: Graph estimation with matrix variate normal instances . 2014b.