# personalized_pca_decoupling_shared_and_unique_features__1ca43e9c.pdf

Journal of Machine Learning Research 25 (2024) 1-82 Submitted 7/22; Revised 8/23; Published 2/24

Personalized PCA: Decoupling Shared and Unique Features

Naichen Shi naichens@umich.edu Raed Al Kontar alkontar@umich.edu Department of Industrial & Operations Engineering University of Michigan Ann Arbor, MI 48109-2117, USA

Editor: Martin Jaggi

In this paper, we tackle a signiﬁcant challenge in PCA: heterogeneity. When data are collected from diﬀerent sources with heterogeneous trends while still sharing some congruency, it is critical to extract shared knowledge while retaining the unique features of each source. To this end, we propose personalized PCA (Per PCA), which uses mutually orthogonal global and local principal components to encode both unique and shared features. We show that, under mild conditions, both unique and shared features can be identiﬁed and recovered by a constrained optimization problem, even if the covariance matrices are immensely diﬀerent. Also, we design a fully federated algorithm inspired by distributed Stiefel gradient descent to solve the problem. The algorithm introduces a new group of operations called generalized retractions to handle orthogonality constraints, and only requires global PCs to be shared across sources. We prove the linear convergence of the algorithm under suitable assumptions. Comprehensive numerical experiments highlight Per PCA s superior performance in feature extraction and prediction from heterogeneous datasets. As a systematic approach to decouple shared and unique features from heterogeneous datasets, Per PCA ﬁnds applications in several tasks, including video segmentation, topic extraction, and feature clustering. Keywords: Principal component analysis, personalization, heterogeneity.

1. Introduction

Principal component analysis (PCA) (F.R.S., 1901; Hotelling, 1933) unravels data features by ﬁnding a few principal components (PCs) from high dimensional data that explain the largest portion of the variance. Due to its eﬀective feature learning and dimension reduction capability, PCA has seen immense success across various domains, including image processing (Deledalle et al., 2011; J egou and Chum, 2012), time series modeling (Yang and Shahabi, 2004; Aguilera et al., 1999), bio-information (Reich et al., 2008; Novembre and Stephens, 2008), condition monitoring (Pozo et al., 2018; Li et al., 2018b), and many more. However, since all data are equally weighted in standard PCA, an underlying assumption is that these data come from homogeneous distributions. This assumption, however, is often challenged in various scenarios, including the Internet of Things (Io T), where data do not come from a single source but a large number of distinct edge devices (or clients). The edge devices, from smartphones to connected vehicles, usually operate in diﬀerent environments and conditions (Kontar et al., 2017, 2018). The data collected by edge devices are also

2024 Naichen Shi and Raed Al Kontar.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-0810.html.

Shi and Kontar

subject to changes in external conditions (Kontar et al., 2021) or user preferences (Kulkarni et al., 2020). Thus, it is common for the datasets to contain signiﬁcant heterogeneity and even conﬂicting trends while sharing some congruity. Standard PCA often does not work well when data homogeneity is not guaranteed (Oba et al., 2007; Hong et al., 2021). Few works have endeavored to extend the PCA philosophy to incorporate data heterogeneity. For example, Heterogeneous PCA (Oba et al., 2007) considers the case where data from diﬀerent sources have diﬀerent noise levels. They propose a reweighting technique to alleviate noise heteroscedasticity. Such an approach is shown to be useful in identifying PCs from heteroscedastic noises. However, simply treating the discrepancy among datasets as diﬀerent levels of noise might be inadequate to understand the intrinsic features within the data and insuﬃcient to encode both unique and shared features across devices and clients. As such, personalized solutions are needed. To transmute the heterogeneity from a bane into a blessing, in this work, we propose personalized PCA (Per PCA) that ﬁts personalized features on each client in addition to common features shared by all clients. In our model, data are driven by several mutually orthogonal global (shared) and local (personalized) PCs. The global PCs model the common patterns among diﬀerent datasets, while the local PCs model the idiosyncratic features of one speciﬁc dataset. Global and local PCs work together to ﬁt the observations. Figure 1 is an illustration of using homogeneous PCA and personalized PCA to ﬁt two heterogeneous datasets. As shown in the ﬁgure, simply pooling together all data across datasets using homogeneous PCA will fail to encode the unique features within each dataset, and the horizontal PC is a misleading one that is not representative of any source. In contrast, personalized PCA aims at decoupling unique and shared features so that heterogeneity across data sources is accounted for. There are several beneﬁts to personalization. Firstly, employing several local PCs to ﬁt individual data patterns enables us to describe immensely heterogeneous trends in datasets accurately. Also, global PCs shared by all data can be estimated more precisely without being aﬀected by disagreeing drifts from diﬀerent sources. What s more, disentangling local features from global ones provides a systematic and interpretable approach to analyzing the heterogeneity structure of datasets and leveraging this knowledge for better analytics. These include: (i) Improving classiﬁcation and clustering: instead of using raw data, operating on unique features may yield better performance as diﬀerences become more explicit when removing shared features, (ii) Transforming personalized, predictive analytics: Through selectively transferring common knowledge from one data source to another, we can reduce the negative transfer of knowledge and enhance personalized predictive and prescriptive models, (iii) Anomaly Detection: Through monitoring changes in the unique features, we envision that anomalies can be better and faster detected. To enable personalized PCA, we propose an optimization framework to provably recover both global and local PCs from noisy observations. The objective is to minimize the empirical reconstruction error under orthogonality constraints. The formulation stands on solid theoretical ground: We prove that, under an identiﬁability condition, the optimal solution can recover the true global and local PCs. Not only can the PCs be solved, but they can also be solved eﬃciently. We design an algorithm based on Stiefel manifold gradient descent that can be proved to converge linearly into the global optimum under mild conditions. The algorithm relies on a new operation

Personalized PCA

Homogeneous PCA

Personalized PCA

Figure 1: Comparison between homogeneous PCA (standard PCA) and personalized PCA (Per PCA). There are two datasets, one colored blue and the other pink. Dots represent the observations. Observations from one dataset are on a 2-dimensional plane. The black arrows represent the global PCs learned, and the colored arrows represent learned local PCs. Homogeneous PCA is a standard PCA on the pooled dataset. We will revisit the example in Section 7.

called generalized retraction to handle the orthogonality constraints. It is worth noting that our algorithm is designed in a federated manner, as the need to share raw data or place all data in a central location is circumvented, and only the updates of global PCs need to be shared across clients. Compared with centralized PCA, where all datasets are uploaded to a central server where PCA is learned on the aggregated dataset, our algorithm reaps the beneﬁts of distributed and federated analytics. Those include communication, cost, storage, and privacy beneﬁts (Kontar et al., 2021). We will show the advantages of Per PCA over existing distributed PCA methods in Section 2.1.

Furthermore, Per PCA proposes a novel provable paradigm of decoupling shared and unique features. Its applications go beyond simple data dimension reduction. We show that Per PCA has remarkable performance in video segmentation and topic extraction tasks. Hence Per PCA opens up new possibilities for broader applications.

Moving forward, we will use client, edge device, data source, and local dataset interchangeably to represent the entities of interest. Here, entities are broadly deﬁned, encompassing various levels of granularity. For instance, we can extract shared and unique features across dispersed datasets, output classes within a dataset, or even among observations (such as images) within a single dataset.

1.1 Main contributions

We summarize our contributions in the following:

Shi and Kontar

Modeling: We propose a personalized PCA model that learns both global and local features from distributed datasets. These features can be recovered from observations by solving a nonconvex optimization problem designed to minimize reconstruction error.

Consistency: We ﬁnd that there exists a simple suﬃcient condition based on the misalignment of local PCs to ensure the identiﬁability of the global and local PCs: the maximum eigenvalue of the average of projections into local subspaces should be smaller than 1. We show that, under the identiﬁability condition, both global and local PCs can be estimated from noisy observations with an error that is upper bounded by O( 1

n), where n is the number of observations on each client. As the error decreases to 0 when n approaches inﬁnity, the error bound essentially implies the consistency of Per PCA. The analysis extends conventional matrix perturbation bounds (Bhatia, 1997; Vu et al., 2013) into personalized settings where the change in one client s covariance matrix can aﬀect the PC estimates on all clients. We also use a minimax statistical lower bound to show that the statistical error upper bound is almost tight in terms of the eigengap and misalignment parameter.

Algorithm: We design an algorithm based on Stiefel manifold gradient descent (St GD hereon) to obtain global and local PC estimates. The major diﬃculty for the algorithm is handling the orthogonality constraints. To tackle it, we introduce a correction step that relies on a group of operations called generalized retractions. A generalized retraction extends retraction in literature (Edelman et al., 1998) as it is deﬁned on the entire Rd r rather than the tangent bundle of the Stiefel manifold St(d, r). In our algorithm, clients only need to share iterates of global PCs, thus preserving privacy and minimizing communication costs.

Convergence: The proposed algorithm has a local linear convergence rate. To our best knowledge, this is the ﬁrst theoretical guarantee for an algorithm that simultaneously learns global and local PCs. Interestingly, the convergence is faster when local PCs are more heterogeneous, a result that lies in sharp contrast to conventional predictive federated or transfer learning (Zhuang et al., 2020) theory as it highlights that heterogeneity can be a blessing in disguise. On the technical side, we introduce a novel Lyapunov function to study distributed St-GD with generalized retractions.

Numerical results: Empirical evidence on both synthetic and real-life datasets conﬁrms Per PCA s ability to decouple shared and unique features. Also, Per PCA has exciting applications in video segmentation and topic extraction. For instance, on video segmentation tasks, Per PCA has signiﬁcant advantages over the popular Robust PCA (Candes et al., 2011) when heterogeneity patterns are not sparse.

1.2 Organization

The paper is organized as follows: We review related work and introduce notations in Section 2. In Section 3, we propose the formulation of Per PCA and link it with constrained optimization. Section 4 includes the theoretical analysis on identiﬁability and consistency. A federated algorithm to solve Per PCA is developed in Section 5, and its convergence guarantee

Personalized PCA

is established in Section 6. Numerical experimentation results are demonstrated in Section 7. Finally, Section 8 concludes the paper with a brief discussion. Readers mainly interested in the implementation and applications of Per PCA can focus on Sections 3, 5, and 7. An implementation of the proposed method is in the linked Github repository.

2. Preliminaries

In this section, we will review related work in the literature and introduce needed notations.

2.1 Related work

Structural PCA Structural PCA attempts to build structural models for data and noise. Research on structural PCA abounds. A seminal algorithm along this line is Robust PCA (Candes et al., 2011). The authors point out that traditional PCA is sensitive to noise in the observations and tackle this issue by decomposing an observation matrix Y into a low-rank part L and a sparse noise part S: Y = L + S. The low-rank matrix L corresponds to the signal, and S represents the noise. It turns out that the two parts can be exactly identiﬁed under regularity conditions with carefully designed algorithms. Robust PCA has become a useful technique in image denoising and video processing (Bouwmans et al., 2018), collaborative ﬁltering (Xu et al., 2012), and many more. Sparse PCA (Zou et al., 2006) adds sparse constraints on the PCs, encouraging each PC to depend on a minimal number of variables. While these methods are powerful in handling large noise or high dimensional data, they mainly analyze homogeneous data. Several algorithms have also been invented to leverage variance heterogeneity in diﬀerent samples. Heterogeneous component analysis (HCA) (Oba et al., 2007) assumes data come from diﬀerent sources with diﬀerent levels of noise. To better learn the PCs with heteroscedastic variance, HCA reweights the empirical loss of each observation according to the inverse of its variance so that noisier samples contribute less to the total loss. Hong et al. (2021) calculates the optimal weights in the asymptotic case by considering the signal-to-noise ratio. Though these methods have superior performance compared to uniform weighting PCA, heterogeneity among diﬀerent sources is only modeled by the noise magnitude. A few heuristic methods also attempt to use low-rank features to characterize heterogeneity, including joint and individual variance explained (JIVE) (Lock et al., 2013), common and individual feature extraction (CIFE) (Zhou et al., 2015). However, it is diﬃcult to distribute these methods for federated learning and provide theoretical guarantees for their outputs. Distributed PCA There has been a recent push to calculate PCs on distributed devices. Oftentimes, the clients/edge devices use their local data to estimate PCs and communicate with a central server to update their estimates. One round of information exchange between clients and the central server is referred to as a communication round. Based on the number of communication rounds between edge devices and the server, research can be roughly divided into two categories (1) those that require only one round of communication and (2) those that require multiple rounds of communication. For one-round PCA algorithms, clients estimate PCs from local datasets and send summary statistics to the server. The server then analyzes the aggregated statistics to calculate the PCs of the entire dataset. There are several ways for the server to calculate PCs. Qu et al. (2002) proposes a method to reconstruct the aggregated covariance matrix

Shi and Kontar

by averaging the clients covariance matrices approximated by a few top PCs. Global PCs can be obtained by learning the top eigenvalues of the averaged covariance matrix. dist PCA (Fan et al., 2019) provides an alternative approach, where the server stacks locally calculated PCs into a large matrix and runs another PCA on the stacked matrix. Liang et al. (2014) uses a similar method, where clients calculate a singular value decomposition (SVD) of the local observation matrix, and then send the singular values and singular vectors to the server. The server stacks the scaled singular vectors and runs SVD on the stacked matrix. Federated PCA (Grammenos et al., 2020) considers streaming data applications where edge devices have limited memory budgets. In their work, locally estimated subspaces are hierarchically merged to form the global subspace. Feldman et al. (2013) also focuses on streaming data and reduces large datasets into smaller ones. In spite of the reductions in communication or memory cost, these algorithms are often not guaranteed to recover true PCs exactly. Also, they are built upon homogeneity assumptions and neglect statistical heterogeneity among the distributed datasets. To obtain more reﬁned estimates of PCs from distributed datasets, a series of works propose to use multiple rounds of communication (Chen et al., 2020; Garber et al., 2017; Huang and Pan, 2020; Alimisis et al., 2021). Among them, Chen et al. (2020) and Garber et al. (2017) design PC updates by shift-and-invert iterations. The shift-and-inverse method (Garber and Hazan, 2015) reformulates inverse power iteration as an unconstrained convex optimization problem and uses gradient-based iterative algorithms to solve it. With a similar rationale, Chen et al. (2020) applies the shift-and-invert formulation to distributed settings and applies distributed Newton methods to solve for the top eigenvector of the covariance matrix. Then, the covariance matrix is deﬂated to calculate the subsequent eigenvectors. Besides shift-and-invert iterations, manifold optimization is also employed for PCA. Huang and Pan (2020) uses distributed Riemann optimization to ﬁnd top PCs from homogeneous datasets. To further reduce communication costs, Alimisis et al. (2021) combines quantized distributed optimization and Riemannian gradient descent with an exponential map to calculate the leading eigenvectors of the covariance matrix. These methods usually treat the diﬀerence among clients covariance matrices as errors. Thus, when datasets are heterogeneous, the errors are large, and these algorithms fail to retrieve true PCs. Gradient descent on manifolds The centralized version of gradient descent on manifolds, or Riemaniann gradient descent, has been well-studied (Absil et al., 2008; Boumal, 2022). Algorithms based on exponential mappings (Edelman et al., 1998) can achieve convergence rates comparable to their Euclidean counterparts. Since exponential mappings are expensive to compute, there are algorithms that replace them with retractions. Tang (2019) presents an elegant framework for analyzing k PCA by Riemannian gradient descent with Cayley retraction. This work proves the local linear convergence of Stiefel gradient descent and also shows that the algorithm can exactly recover the top eigenspaces. Recent years have also seen advances in distributed manifold optimization. Chen et al. (2021a,b) introduces a simple distributed St-GD algorithm that minimizes a general objective on the manifold. In each round, clients use St-GD on the local objectives and send the updated variables to the server, then the server averages the received update and applies a retraction. The algorithm is guaranteed to converge into stationary points with a sublinear rate.

Personalized PCA

Per PCA also exploits St-GD to solve PCs. However, our algorithm enhances simple manifold optimization by simultaneously optimizing local and global PCs, while also incorporating orthogonality constraints between the global and local PCs. Per PCA thus introduces a special correction step to handle such constraints. This is done by deﬁning a new retraction measure we name as a generalized retraction deﬁned on the entire Rd r rather than the tangent bundle of Stiefel manifold St(d, r). We should note that among all the distributed algorithms discussed, only Per PCA models diﬀerent or distributed datasets by global and local PCs. Thus, it brings unique advantages in decoupling local and global features from highly heterogeneous datasets. Besides, there are several additional beneﬁts of Per PCA in convergence and computation compared with typical existing models. In terms of convergence, Per PCA converges into stationary points of the empirical reconstruction error and is guaranteed to recover true PCs exactly with proper initialization. The algorithm does not involve a computationally intensive exponential map and can solve k PCs at one time. More importantly, Per PCA is fully federated, and diﬀerent clients can collaborate by only sharing a few global PCs that encode shared and not unique features. The comparisons of Per PCA and several typical PCA algorithm is summarized in Table 1.

Method Source Exact convergence k PCA Federated Personalized

Robust PCA (Candes et al., 2011) JIVE (Lock et al., 2013) dist PCA (Fan et al., 2019) Distri-Eigen (Chen et al., 2020) CEDRE (Huang and Pan, 2020) PCA by St-GD (Tang, 2019) Per PCA ours

Table 1: Comparison of related work. Metrics included and their deﬁnitions are: (i) Exact convergence: the algorithm can recover top subspaces of sample covariance matrix exactly, (ii) k PCA: the algorithm can calculate the subspace spanned by top k PCs instead of one single component, (iii) Federated: the algorithm can be done in a distributed fashion where raw data remains where it is generated on the edge and only focused updates need to be shared across clients, (iv) Personalized: the algorithm encodes both shared and unique features across all datasets.

2.2 Notations

We ﬁrst introduce needed notations in this subsection. For a d-dimensional vector x, we use x to denote its 2-norm. The inner product of two vectors is deﬁned as a standard inner product in Euclidean space: x, y = x T y. We use Id to denote the identity matrix in Rd. We sometimes omit the subscript d if the dimension is clear from the context. For a real matrix A Rm n, we use A F to denote its Frobenius norm A F = q Pn i=1 Pm j=1 A2 ij

Shi and Kontar

and A op to denote its operator norm A op = maxv Rn, v =1 Av . For two matrices A, B Rm n, we deﬁne their inner product as A, B = Pn i=1 Pm j=1 Aij Bij = Tr AT B . If A Rn n is symmetric positive deﬁnite (PSD), it has an eigendecomposition A = UDU T , where D is n by n diagonal matrix whose diagonal entries are all positive, U is a n by n unitary matrix. Then for p R, the p-th power of A is deﬁned as Ap = UDp U T . For a square matrix A Rn n, we use λmin(A) and λmax(A) to denote the minimum and maximum eigenvalue of A. Similarly, we use λ1(A), λ2(A), ... λn(A) to denote the n eigenvalues of A in descending order. We use A op and λmax(A) interchangeably when A is symmetric PSD. For a matrix A Rm n, we use vec (A) Rmn to denote its vectorization, i.e., the vector formed by concatenating all the column vectors in A. col(A) is the linear subspace spanned by all column vectors of A. We use Ai1:i2,j1:j2 to denote the submatrix of A formed by picking the i1, i1 + 1...i2-th row and j1, j1 + 1...j2-th column of A. For two matrices A Rm n1 and B Rm n2, [A, B] Rm (n1+n2) is deﬁned as the concatenation of A and B by column. Finally, we use the standard O ( ), Ω( ), and o ( ) notations throughout the paper.

3. What is Per PCA?

We will establish the formulation of Per PCA in this section.

3.1 Motivation

Suppose we have N clients (i.e. data sources), each with a dataset {Y(i)}N i=1, where Y(i) is a d by ni matrix. d is the dimension of data, and ni is the the number of datapoints on client i. The datasets {Y(i)}N i=1 have commonalities but also possess client-level distinctive features. The task is to ﬁnd a few low-dimensional common and unique features that best characterize the observations from the high dimensional data {Y(i)}N i=1. Standard PCA uses a small number of principal components (PCs) to explain the variations in {Y(i)}N i=1. Such treatment ignores the client-to-client diﬀerence in the observations. The present Io T system usually consists of distributed edge devices (clients) that operate in extremely heterogeneous environments. It is thus important to consider the diﬀerent features of diﬀerent clients. As a more capacious description of the data, we consider the model where local observations are driven by r1 global PCs and r2,(i) local PCs. More speciﬁcally, from data source i, observation y(i) is generated from

q=1 φ(i),quq +

q=1 ϕ(i),qv(i),q + ϵ(i) (1)

where φ(i),q s and ϕ(i),q s are coeﬃcients, or scores in PCA terminology. uq s are global PCs, v(i),q s are local PCs, and ϵ(i) are i.i.d. noise vectors. r1 is the number of global PCs, and r2,(i) is the number of local PCs on client i. We allow v(i),q s to be client-dependent while enforcing uq s to remain the same across all clients. Naturally, uq s encode the information shared by all participants, while v(i),q s can describe distinctive patterns on each client.

Personalized PCA

Similar to standard PCA, diﬀerent principal components need to be orthonormal: ( u T q1uq2 = δq1,q2 (v(i),q1)T v(i),q2 = δq1,q2, q1, q2, i = 1, , N (2)

where δq1,q2 is the Kronecker delta. In addition to (2), we further require that the global and local features are orthogonal:

u T q1v(i),q2 = 0, i = 1, , N (3)

The orthogonality of PCs implies that the shared and unique features span diﬀerent subspaces, thus describing independent and decoupled patterns in the data sources. (1) is an interpretable linear model that naturally incorporates both common and individual features of diﬀerent clients. It is useful in applications where disentangling global and local features is important. The development of Io T and recent advancements in federated and distributed analytics present numerous such applications, including time series data, image and video data, and language data. We will show the eﬃcacy of (1) on several examples.

The task of Per PCA is to recover global and local PCs from observations {Y(i)}N i=1. We can write global and local PCs into matrix form: ( U = [u1, , ur1]

V(i) = [v(i),1, , v(i),r2,(i)] (4)

and solve for U and V(i) s by minimizing the empirical reconstruction loss:

min U,{V(i)}i=1, ,N

Y(i) ˆY(i) 2

subject to U T U = I, V T (i)V(i) = I, V T (i)U = 0, i

where ˆY(i) is the statistical ﬁt for client i s data given PCs U and V(i):

ˆY(i) = UU T Y(i) + V(i)V T (i)Y(i) (6)

Intuitively, in (5), we look for the PCs so that the predicted ˆY(i) can best ﬁt the distributed datasets. The objective (5) has another interpretation: by some algebra, we can transform the objective (5) into:

max U,{V(i)}i=1, ,N

h Tr U T S(i)U + Tr V T (i)S(i)V(i) i

subject to U T U = I, V T (i)V(i) = I, V T (i)U = 0, i

Shi and Kontar

where S(i) is deﬁned as the data covariance matrix:

ni Y(i)Y T (i)

From (7), it is clear that Per PCA attempts to ﬁnd global and local low dimensional subspaces that best align with the data covariance matrix. We will study objective (7) from here on. For simplicity, we introduce

fi(U, V(i)) = 1

2Tr U T S(i)U + 1

2Tr V T (i)S(i)V(i) (8)

f(U, {V(i)}) =

i=1 fi(U, V(i)) (9)

Then (7) transforms to maximizing f under orthonormality constraints. Notice that though f and fi s are convex, the constraint in (7) is nonconvex. Thus, the problem is nonconvex. The nonconvex formulation (7) appears diﬃcult to analyze and solve. In the following sections, we will delve into the identiﬁability and optimization of (7). Fortunately, our results show that under minimal conditions, (7) can be solved eﬃciently, and the optimal solution can recover the true PCs.

4. Are Global and Local PCs Identiﬁable?

Given the formulation (7), one may ask whether it is possible to identify the true local and global PCs by solving (7). Apparently, global and local PCs cannot be decoupled in every case. As a simple counterexample, if all local PCs are the same, then distinguishing local from global PCs is impossible, as there are inﬁnite combinations of them that all can maximize the explained variance in (7). The edifying counterexample poses the fundamental question of model identiﬁability. Therefore, we need to ﬁnd out which data instances are identiﬁable. In the following, we will introduce an identiﬁability condition, then establish the relationship between the estimated and true PCs. We restrict our analysis to recovering the subspace spanned by top PCs (Bhatia, 1997). Therefore we introduce the projection matrix notation PU: if U is a matrix with orthonormal columns, i.e. U T U = I, then PU is deﬁned as PU = UU T . We use Πg to denote the projection matrix to the true global eigenspace, i.e., Πg = PUtrue, where Utrue are the true top global PCs. Also, we use Π(i) to denote the projection matrix to the true local eigenspace, Π(i) = PV(i),true, where V(i),true are the true top local PCs on client i. Remember, we model global and local PCs as mutually vertical features; such property can be formally characterized by the following assumption.

Assumption 4.1 (Orthogonality of global and local PCs) Let Πg be the global projection matrix, Π(i) s the local projection matrices. We assume that

ΠgΠ(i) = 0 (10)

Personalized PCA

In addition, we consider the case where the subspace corresponding to the projection Πg +Π(i) is indeed an invariant subspace of the population covariance matrix on client i, Σ(i) , i.e. Πg + Π(i) Σ(i) = Σ(i) Πg + Π(i) .

In Assumption 4.1, the requirement Πg + Π(i) Σ(i) = Σ(i) Πg + Π(i) essentially assumes that Utrue and V(i),true are indeed the eigenvectors of the population covariance matrix Σ(i). As the counterexample suggests, assumption 4.1 alone is insuﬃcient to guarantee the identiﬁability of global and local PCs. To distinguish them, we need another identiﬁability condition. To rule out the counterexample, local PCs and accordingly Π(i), should diﬀer from each other. To this end, we introduce the notion of misalignment . Misalignment is quantiﬁed by the parameter θ, which represents the maximum eigenvalue of the average of the local projection matrices. Assumption 4.2 is a formal statement of the identiﬁability condition.

Assumption 4.2 (Misalignment) Let Π(i) s be the local projection matrices. We assume there exists a positive constant θ (0, 1) such that:

The constant θ characterizes the misalignment between local principal spaces. When θ is larger, the local eigenspaces are more heterogeneous. When θ is smaller, the local eigenspaces are more similar. As an extreme case, if all Π(i) s are identical, 1 N PN i=1 Π(i) is still a projection, thus its maximum eigenvalue is 1 and θ becomes zero.

4.1 Statistical error

It turns out that the identiﬁability Assumption 4.2 is suﬃcient to ensure identiﬁability. The following perturbation bound shows that when the sample covariance matrix is close to the population covariance matrix, we can obtain relatively accurate estimates of global and local eigenspaces through solving (7).

Theorem 1 Under assumption 4.1 and 4.2, and if there exists a constant δ > 0, such that λr1+r2,(i) Πg + Π(i) Σ(i) λ1 I Πg Π(i) Σ(i) δ for all i, we have:

P ˆU Πg 2 F + 1

P ˆV(i) Π(i) 2

F 8 θδ2 1 N

Σ(i) S(i) 2 F (12)

where ˆU, and ˆV(i) s are the optimal solutions to the objective in (7).

δ is usually called eigengap in literature (Vu et al., 2013; Huang and Pan, 2020). The δ 2

factor on the right-hand side of (12) is standard for matrix perturbation analysis. Theorem 1 conﬁrms the intuition on identiﬁability. Speciﬁcally, as θ increases, the righthand side of equation (12) decreases, resulting in a smaller estimation error. Consequently, ﬁnding local and global PCs becomes easier. This result critically highlights that heterogeneity can be a blessing. For the counterexample, θ 0, the right-hand side approaches inﬁnity. Hence, one cannot accurately recover the PCs.

Shi and Kontar

In addition, Theorem 1 highlights the beneﬁts of collaborative learning across multiple related clients. The right-hand side of (12) is the average diﬀerence between the sample and population covariance matrix on all clients. For clients with a larger dataset, the distance is lower, and for clients with a smaller dataset, the distance can be higher. Through jointly optimizing objective (7), clients learn from each other and obtain PC estimates with statistical error depending on the average distance.

4.2 Minimax statistical lower bound

Though the statistical error bound provided in Theorem 1 is intuitive, it is not apparent whether the bound is sharp. To fully understand the statistical diﬃculty in recovering shared and unique components from {S(i)}, we will establish a lower bound on the minimax risk of estimators under the subspace error. For simplicity, we deﬁne the subspace error between { ˆU, { ˆV(i)}} and {U, {V(i)}} as

Lsubspace { ˆU, { ˆV(i)}}, {U, {V(i)}} = P ˆU PU 2 F + 1

P ˆV(i) PV(i) 2

Additionally, we use Θ to denote the parameter space speciﬁed by Assumption 4.1,

Θ = n U, {V(i)}|U T U = I, V T (i)V(i) = I, U T V(i) = 0 o (14)

The following theorem provides a lower bound for the statistical error.

Theorem 2 If the data generation process satisﬁes Assumptions 4.1 and 4.2, the eigengap introduced in Theorem 1 is at least δ, and PN i=1 S(i) Σ(i) 2 F = o(1), then among data generated by all possible {Utrue, {V(i),true}} Θ, the supremum of the subspace error between the optimal solution to (7), { ˆU, { ˆV(i)}}, and the ground truth, {Utrue, {V(i),true}}, is at least

sup {Utrue,{V(i),true}} Θ

Lsubspace { ˆU, { ˆV(i)}}, {Utrue, {V(i),true}}

1 N PN i=1 Σ(i) S(i) 2 F = Ω 1

Theorem 2 measures the subspace error minimax lower bound in terms of misalignment parameter θ and eigengap δ. Roughly speaking, the lower bound is greater than Ω 1

N PN i=1 Σ(i) S(i) 2 F

. This almost matches the upper bound provided in

Theorem 1 as the error scales with 1

θ and 1 δ2 . Theorem 2 also demonstrates the intrinsic statistical diﬃculty of separating global and local PCs. When the local PCs are more aligned and noise components grow larger, θ and δ become smaller, and the statistical error of the subspace estimate becomes larger accordingly. The proof of Theorem 2 is based on a variant of the spiked population model (Birnbaum et al., 2013). We use perturbation analysis to calculate the leading order of the subspace error when the sample covariance is close to the population covariance. The full proof is in Appendix C. There is also a comparison between the theoretical statistical error estimate and the statistical error obtained from numerical simulations in Appendix C.

Personalized PCA

4.3 Sample complexity

In this section, we estimate the statistical error when data are generated by a sub-Gaussian distribution. A random vector y Rd admits a sub-Gaussian distribution with parameter

σ if for each ﬁxed vector v Sd 1, E eλ v,y e λ2σ2

2 for all λ R. σ is a parameter that denotes the variance level: when σ is larger the data are noisier. As a special case, if y admits a Gaussian distribution with mean zero and covariance Σy, then σ2 = Σy op (Wainwright, 2019). The following corollary gives an upper bound of the estimation error.

Corollary 3 If the dataset on each client i {Y(i)}N i=1 admits an i.i.d. sub-Gaussian distribution with parameter σ, and the assumptions in Theorem 1 are satisﬁed, then with probability at least 1 eδ (over the randomness of the data generation process), we have:

P ˆU Πg 2 F + 1

P ˆV(i) Π(i) 2

F 1 θδ2 σ4C2 d

, d + log 2N

(16) where C is a constant.

The inequality (16) essentially shows the consistency of the solutions ˆU and ˆV . When the data dimension d is ﬁxed and sample size ni is relatively large, the right hand side of (16)

decreases with O PN i=1 1 Nθδ2ni

. As ni s approach inﬁnity, the subspace error also decreases to 0, and the estimated eigenspaces approach the true values accordingly. Equation (16) also highlights the beneﬁts of knowledge sharing. If each client only uses

their own data to estimate the PCs, the estimation error would be O 1 ni

. The error can be high for clients with few observations (i.e., small ni). However, when N clients collaborate in learning global and local PCs, the estimation error becomes the average of individual statistical errors O PN i=1 1 Nθδ2ni

. Data-poor clients can thus borrow strength from other clients to improve the estimates of their PCs. The statistical consistency and knowledge-sharing eﬀect will also be examined by numerical experiments in Section 7. Here, we note that statistical consistency can not be achieved by existing estimates without personalized modeling. For example, the statistical error of dist PCA (Fan et al., 2019) depends on O 1 N PN i=1 Σ(i),l op

, which does not decrease with number of observations

ni as long as Σ(i),l op > 0. The comparison highlights the advantages of personalization through our formulation in (7). Now we present the proof of Corollary 3. Proof We will adopt the covariance concentration bound in Wainwright (2019) and Rinaldo (2019). Since data on client i admit independent sub-Gaussian distributions, theorem 13.3 in Rinaldo (2019) states that, with probability at least 1 δ1, there exists a constant C such that:

Σ(i) S(i) op σ2C max

δ1 ni , d + log 2

Shi and Kontar

We can choose δ1 = eδ N . Then by a union bound, we know that with probability at least 1 eδ:

Σ(i) S(i) op σ2C max

eδ ni , d + log 2N

holds for all i. Combining this and Theorem 1, we can prove the bound in (16).

Equation (16) also gives a simple estimate of the sample complexity.

Corollary 4 Under the assumptions of Theorem 1, and assuming that data on client i admits an i.i.d sub-Gaussian with parameter σ, if each client has at least O 1 ϵ σ4d2

θδ2 observations, then with high probability, the estimation error is smaller than ϵ.

Proof The proof is quite straightforward. Notice that when ni d, the right-hand side of (16) is dominated by d ni . Thus if we neglect the logarithm factors on the right-hand side of

(16) and set 4 θδ2 σ4C2d 1

N PN i=1 d ni ϵ, the statistical error will also be upper bounded by ϵ. It is natural to see that the inequality holds when each client has observations no less than O 1 ϵ σ4d2

5. Recovering Local and Global PCs

The statistical consistency proved in Section 4 dwells on the premise that the objective in (7) can be solved to optimality. An eﬃcient algorithm to solve the problem is not apparent as the constraints in (7) are nonconvex. In this section, we develop a class of algorithms to solve (7). The major diﬃculty in optimizing (7) lies in the nonconvex constraints: in addition to the orthonormal constraints on U and V(i) s, the constraints U T V(i) = 0 require global and local PCs to be mutually orthogonal. The later constraints introduce interaction between local and global variables, which deems simple distributed Stiefel manifold descent (Chen et al., 2021b) incompetent. To handle the orthogonality constraints, we propose a class of algorithms that we call Personalized PCA (Per PCA). Per PCA adopts Stiefel manifold gradient descent to ensure that all constraints are satisﬁed during the algorithm. It is worth noting that Per PCA is naturally federated as the computation is distributed over clients, and only updates of the global PCs need to be shared. In the following of this section, we will build the Per PCA algorithm step by step. But before delving into the technical details of parallel gradients and retractions, to illustrate the essence of Per PCA, we will ﬁrst present a simple instance of Per PCA that exploits polar projections to maintain the orthonormality of the updates.

Personalized PCA

5.1 An instance of Per PCA

The polar projection of a general full-column-rank matrix W Rn1 n2 where n1 n2 returns an orthonormal matrix deﬁned as

Polar (W ) = W W T W 1

Polar projection can be eﬃciently implemented via SVD (Breloy et al., 2021). It is shown that among all the orthonormal matrices, Polar (W ) is closest to W (Kahan, 2011). Therefore, we can combine gradient descent with polar projection to solve problem (7). The pseudocode is summarized in Algorithm 1.

Algorithm 1 An instance of Per PCA using Polar Projection

Input client covariance matrices {S(i)}N i=1, stepsize ητ Initialize U1, and V(1), 1

2 , , V(N), 1

2 . for Communication rounds τ = 1, ..., R do

for Client i = 1, , N do

V(i),τ = Polar V(i),τ 1

2 UτU T τ V(i),τ 1

[U(i),τ+1, V(i),τ+ 1

2 ] = Polar Uτ, V(i),τ + ητS(i) Uτ, V(i),τ

Uploads U(i),τ+1 to server. end for Server calculates Uτ+1 = Polar 1 N PN i=1 U(i),τ+1

Server broadcasts Uτ+1 end for

In Algorithm 1, at each iteration, client i ﬁrst deﬂates V(i),τ 1

2 to make it orthogonal to Uτ.

This ensures that the updates are feasible as U T τ V(i),τ = 0, U T τ Uτ = I, and V T (i),τV(i),τ = I. Then client i uses gradient ascent and polar projection to update U(i),τ+1 and V(i),τ+ 1

2 . This step increases the objective while respecting the orthonormal constraints on U and V(i). After the updates, client i sends the updated U(i),τ+1 to the server. The server takes the average of all received U(i),τ+1, orthonormalize it, then broadcast the updated Uτ+1. It is intuitively understandable how the iterations in Algorithm 1 maximize the objective while keeping the updates feasible. In the rest of this section, we will study a broader class of algorithms through the lens of manifold optimization and show that Algorithm 1 is actually a special case of such algorithm class. We will begin by reviewing a few concepts from manifold optimization and then provide our deﬁnition for a class of operations called generalized retraction . Then, we will use the techniques from Stiefel gradient descent to design a class of algorithms that solves (7).

5.2 Generalized retractions

We begin by introducing the Stiefel manifold commonly used in matrix analysis (Edelman et al., 1998). The Stiefel manifold St(d, r) is the set of all d by r orthonormal matrices:

St(d, r) = {U Rd r|U T U = I} (18)

Shi and Kontar

It is embedded in a d r dimensional Euclidean space. One can verify that St(d, r) is not convex in general (Edelman et al., 1998). For U St(d, r), the tangent space of St(d, r) at U is deﬁned as:

TU = {ξ Rd r|ξT U + U T ξ = 0}

It can be derived by diﬀerentiating U T U = I. The normal space NU is deﬁned as the orthogonal space of the tangent space at U. Both TU and NU are linear subspaces of Rd r. Therefore, we can deﬁne the projection onto them. PNU denotes the projection onto the normal space:

PNU (V ) = 1

2U U T V + V T U

Similarly, PTU denotes the projection onto the tangent space:

PTU (V ) = V PNU (V )

One can verify that for any matrix V Rd r, PTU (V )T U + U T PTU (V ) = 0 Next, we introduce the notion of a generalized retraction. The motivation for a generalized retraction is rather straightforward. For an orthogonal matrix U and a general update matrix ξ, the matrix U +ξ can probably violate the orthonormal constraint: (U + ξ)T (U + ξ) = I. The generalized retraction ﬁnds an approximation U +ξ that strictly satisﬁes the orthonormal constraint. Ideally, the best approximation can be found via projection. However, the projection onto a general nonlinear manifold is hard to analyze. Therefore, one can relax this projection to a generalized retraction. More formally, a generalized retraction can be deﬁned as:

Deﬁnition 5 We call a mapping

GRU ( ) : Rd r St(d, r)

a generalized retraction if

1. (Property 1): col(GRU (ξ)) = col(U + ξ), U St(d, r), ξ Rd r

2. (Property 2): There exist constants M1, M2 0 and M3 > 0 such that:

GRU (ξ) (U + PTU (ξ)) F M1 PTU (ξ) 2 F + M2 ξ PTU (ξ) F ,

U St(d, r), ξ Rd r, ξ F M3

Figure 2 is an illustration of the Stiefel manifold, tangent space, and generalized retraction. Notice that the deﬁnition of a generalized retraction extends the deﬁnition of retraction in literature (Absil et al., 2008). Retraction is usually deﬁned as a mapping from the tangent bundle TU to the Stiefel manifold St(d, r) (Edelman et al., 1998). However, a generalized retraction is a mapping from a general Rd r to the Stiefel manifold St(d, r). This extension allows us to directly apply the generalized retraction to any matrix, eliminating the need to project it to the tangent space beforehand.

Personalized PCA

Figure 2: An illustration of the Stiefel manifold, tangent space, and generalized retraction. The red surface represents the Stiefel manifold. The blue plane represents the tangent space at U St(d, r). ξ is a general d by r matrix that represents the update direction. PTU (ξ) projects ξ to the tangent space on U. Generalized retraction GRU (ξ) maps U + ξ back to the Stiefel manifold.

Property 1 requires that a generalized retraction preserves column spaces. This property is indispensable in our algorithm development as we use it to ensure the orthogonality of global and local PCs. The second property requires that GRU (ξ) be close to the projection to the tangent space U + PTU (ξ). In the special case of ξ TU, property 2 reduces to GRU (ξ) (U + ξ) F M1 ξ 2 F , which coincides with the deﬁnition of retraction in literature (Chen et al., 2021a). When the norm of ξ is small, the requirement essentially implies that the diﬀerence between a generalized retraction and the projection to a tangent space is a higher-order term. Though Deﬁnition 5 looks demanding, we can show that there are several available choices for a generalized retraction.

Proposition 6 Polar projection is deﬁned as:

GRpolar U (ξ) = (U + ξ) I + ξT U + U T ξ + ξT ξ 1

is a generalized retraction. The computation complexity is O(dr2 + r3).

(19) is consistent with the deﬁnition (17), though the notations are slightly diﬀerent. Notice that polar projection can be equivalently calculated by the SVD of U + ξ (Breloy et al., 2021). We relegate the proof and the implementation details to Appendix F.1. As discussed, an interesting property of the polar projection is that it is equivalent to the projection of U + ξ onto the Stiefel manifold:

GRpolar U (ξ) = arg min V St(d,r) U + ξ V F (20)

The proof of (20) can be found in Kahan (2011). QR decomposition is another inﬂuential algorithm in numerical linear algebra. It also satisﬁes the requirements of a generalized retraction.

Shi and Kontar

Proposition 7 For a matrix U + ξ Rd r, QR decomposition ﬁnds an orthogonal matrix Q St(d, r) and a upper triangular matrix R Rr r, such that QR = U + ξ. As such, a QR retraction is deﬁned as: GRQR U (ξ) = Q

is a generalized retraction. The computation complexity is O(dr2).

We relegate the proof to Appendix F.2. In all of our experiments, we choose the generalized retraction as a polar decomposition.

5.3 Per PCA: The algorithm

Now, we are ready to introduce the personalized PCA algorithm, Per PCA. Recall that our algorithm is designed to be federated and requires multiple communication rounds between a client and some central server/entity that orchestrates the collaborative learning process. Suppose at communication round τ, each client has feasible global components Uτ and local components V(i),τ, i.e., [Uτ, V(i),τ] St(d, r1 + r2,(i)). Then client i calculates the gradient of objective fi deﬁned in (8): ( Ufi(Uτ, V(i),τ) = S(i)Uτ V(i)fi(Uτ, V(i),τ) = S(i)V(i),τ

Since the gradient direction generally does not align with the tangent space of T[Uτ,V(i),τ], simple gradient ascent will move [Uτ, V(i),τ] out of St(d, r1 + r2,(i)). To ensure the iterates move along the manifold, Stiefel optimization ﬁrst projects the gradient to the tangent space:

G(i),τ = PT[Uτ ,V(i),τ] S(i) Uτ, V(i),τ (21)

In literature, G(i),τ is usually referred to as the parallel gradient on the manifold (Edelman et al., 1998). We shall note that G(i),τ deﬁned above is a d by r1 + r2,(i) matrix. Clients then update global and local PCs in the direction of the parallel gradient G(i),τ. As there is a small diﬀerence between the Stiefel manifold and the tangent space, the updated PCs are still not orthonormalized. Therefore, we use a generalized retraction to retract the updated local components to the Stiefel manifold. We use V(i),τ+ 1

2 to denote the retracted matrix. For the global components, clients ﬁrst send them to a server. The server then takes the average and uses a generalized retraction to map the average to St(d, r). The updated global PC matrix is denoted as Uτ+1. A major challenge then arises: after the server averages the global PCs, Uτ+1 is not orthogonal to V(i),τ+ 1

2 anymore, i.e., U T τ+1V(i),τ+ 1

2 = 0 in general. Thus Uτ+1 and V(i),τ+ 1

2 s become infeasible, and the algorithm based on St-GD cannot proceed. One can verify that U T τ V(i),τ+ 1

2 = O(ητ), which has the same order as the parallel gradient update. Thus, we cannot resolve the infeasibility issue by decreasing stepsize. This is a fundamental limitation of a simple route that uses distributed St-GD. Can we resolve the challenge by enforcing the orthogonality between global and local PC estimates? Inspired by Gram-Schmit orthonormalization, we introduce a correction step on the local PCs. We calculate the projection of V(i),τ+ 1

2 onto the column space of Uτ+1, and

Personalized PCA

subtract the projected matrix from V(i),τ+ 1

2 . The resulting (deﬂated) matrix is orthogonal to Uτ+1. Then, we use a generalized retraction to map the subtracted matrix to the Stiefel manifold. Remember that one key property of a generalized retraction is that it preserves the column space; the retracted matrix is thus still orthogonal to Uτ+1. We use V(i),τ+1 to denote the retracted matrix. Now Uτ+1 and V(i),τ+1 are feasible, and the updates can repeat over multiple communication rounds until convergence. The pseudocode is summarized in Algorithm 2.

Algorithm 2 Per PCA by St-GD

Input client covariance matrices {S(i)}N i=1, stepsize ητ Initialize U1, and V(1), 1

2 , , V(N), 1

2 . for Communication rounds τ = 1, ..., R do

for Client i = 1, , N do

V(i),τ = GRV(i),τ 1

UτU T τ V(i),τ 1

// Deflate then retract

Calculate G(i),τ = PT[Uτ ,V(i),τ] S(i) Uτ, V(i),τ

// Tangent projection

Update U(i),τ+1 = Uτ + ητ(G(i),τ)1:d,1:r1 // Gradient ascent

Update V(i),τ+ 1

2 = GRV(i),τ ητ(G(i),τ)1:d,(r1+1):(r1+r2,(i))

// Retract Choice 2:

Update [U(i),τ+1, V(i),τ+ 1

2 ] = GRpolar [Uτ,V(i),τ] ητS(i) Uτ, V(i),τ

// Retract after gradient ascent Send U(i),τ+1 to the server. // Share global PCs end for Server calculates Uτ+1 = GRUτ 1 N PN i=1 U(i),τ+1 Uτ

// Average then retract Server broadcasts Uτ+1 end for Return principal components UR and V(i),R s.

The ﬁrst line in the client loop V(i),τ = GRV(i),τ 1

UτU T τ V(i),τ 1

represents the

correction on the local PC matrix. Regardless of whether Uτ and V(i),τ 1

2 are orthogonal, Uτ and V(i),τ are always feasible: [Uτ, V(i),τ] St(d, r1 + r2,(i)). Then each client applies standard St-GD (choice 1) or a variant of St-GD (choice 2) to update U(i),τ+1 and V(i),τ+ 1

2 simultaneously. The updated global PCs are sent to the server. The server takes the simple average of all received global PCs and retracts the average to St(d, r1). The obtained Uτ+1 is then broadcasted back to the clients and becomes the starting point of the next iteration. The algorithm repeats for a certain number of communication rounds. In Algorithm 2, we introduce two algorithmic choices on the client side. For choice 1, clients perform standard St-GD: ﬁrst project the updates to the tangent space, then retract them to the Stiefel manifold. For choice 2, clients use polar projection to replace the St-GD. This update rule is inspired by the Minorization-Maximization algorithm (Breloy et al., 2021). Remember that by (20), polar projection acts as a projection into the nonlinear Stiefel manifold. Hence, it is close to the composition of the projection onto the tangent

Shi and Kontar

space and the retraction from the tangent space onto the nonlinear manifold. We propose two choices to enrich practitioners toolkits as they have similar performances in most of our case studies. We focus on choice 1 in our theoretical analysis. However, it is observed in the video segmentation task that choice 2 allows us to use larger stepsizes, thus converging faster. Hence, we leave it to practitioners discretion to make speciﬁc algorithmic choices. In general, the computation complexity per communication round at one client is O(d2). To see that, we can analyze the update of Algorithm 2. One iteration only involves matrix multiplication and generalized retractions. The computation complexity of matrix multiplication S(i)Uτ is O(d2r1). The complexity of the tangent projection step is similar. When the rank r1 and r2,(i) is far smaller than data dimension d, the computation complexity for generalized retractions is only O(d). Thus, the per-iteration computation complexity is O(d2). It is worth noting that the complexity can be further reduced to O(d) if the covariance matrix S(i) is known to be low rank. More speciﬁcally, when S(i) has a low-rank Cholesky decomposition S(i) = Y(i)Y T (i), where Y(i) Rd n(i) is a low-rank matrix n(i) d, the computation cost of matrix multiplication S(i)Uτ = Y(i)Y T (i)Uτ is reduced to O(dn(i)r1). As n(i) and r1 is far smaller than d, this becomes O(d). Hence the per-iteration computation complexity is only O(d).

6. Does Algorithm 2 Recover the Local and Global Truth?

Though the development of Algorithm 2 is intuitive, it is important to understand whether it converges and, if so, what kind of solution it can recover. In this section, we will analyze the convergence of Algorithm 2 and show that, in general, Algorithm 2 converges into stationary points of the objective. In addition, when the local and global components are initialized properly, Algorithm 2 will converge into the global optimal solutions linearly, and the result exactly recovers the true local and global PCs.

6.1 Global convergence

To analyze the convergence, we make an additional assumption that the largest eigenvalues of the sample covariance matrices S(i) s are upper bounded:

Assumption 6.1 We assume that the operator norms of S(i) s are upper bounded by constants G(i),op: S(i) op G(i),op (22)

and the Frobenius norms of S(i) s are upper bounded by constants G(i),F :

S(i) F G(i),F (23)

We use Gmax,op to denote maxi G(i),op, and Gmax,F to denote maxi G(i),F .

Assumption 6.1 is a common assumption in optimization literature, as it essentially assumes the objective is Lipschitz continuous. Also, if we assume the data are independently generated and follow a sub-Gaussian distribution, Assumption 6.1 will hold with high probability (Wainwright, 2019).

Personalized PCA

The ﬁrst order condition (KKT condition) to problem (1) is that for the parallel gradients deﬁned in (21), the local parts are zero on each client, and the average of the global parts is zero:

1:d,(r1+1):(r1+r2,(i)) = 0, i {1, 2, , , N}

1:d,1:(r1+1) = 0 (24)

The proof of KKT conditions (24) is in Appendix B. It is clear from Algorithm 2 that when (24) is satisﬁed, the global and local PC updates will be stationary. Thus, (24) essentially describes the stationary points of (7). On non-stationary points, (24) generally does not hold. The below theorem provides an upper bound on the magnitude of the violations to conditions (24). As the violations decrease to zero when the number of communication approaches inﬁnity, the theorem shows that Algorithm 2 will converge into the KKT points. We use r to denote maximum rank r = max{r1, r2,(1), , r2,(N)}.

Theorem 8 Under Assumption 6.1, if we choose a constant stepsize ητ = η1 = O( 1 Gmax,op r), then Algorithm 2 with choice 1 will converge into stationary points:

min τ {1,...R}

I PUτ PV(i),τ S(i)Uτ

I PUτ PV(i),τ N X

i=1 S(i)V(i),τ

Despite the nonconvex constraints in (7), Algorithm 2 provably converges to stationary points, regardless of initial conditions. The 1

R convergence rate is comparable to the rate in literature (Chen et al., 2021a). Our algorithm handles global and local PCs at the same time and attains stationary points of both components. In the following section, we will show the proof sketch of Theorem 8. The complete proof is relegated to Appendix D.

6.1.1 Proof sketch for Theorem 8 and key lemmas

As discussed before, one major diﬃculty in analyzing Algorithm 2 lies in the correction step. The correction step changes local PCs by O(ητ), which is comparable to that in the gradient ascent step. Therefore, a na ıve treatment to the correction step will generate a large error term that cannot be bounded. To bypass the issue, we exploit one nice structure in objective (8): fi(U, V(i)) is dependent only on the subspace spanned by the concatenated matrix [U, V(i)]. Therefore one can make adjustments on col(U) and col(V(i)) without changing the objective value, as long as col([U, V(i)]) are the same. One major technical novelty of our work is to introduce Lyapunov functions that take this key property into consideration. We deﬁne the two following Lyapunov functions:

L(i),1(U, V ) = 1

2Tr U T (I PV ) S(i) (I PV ) U (25)

Shi and Kontar

L(i),2(U, V ) = 1

2Tr V T S(i)V (26)

It s easy to see that when V T U = 0, we have:

L(i),1(U, V ) + L(i),2(U, V ) = 1

2Tr U T S(i)U 1

2Tr V T S(i)V = fn(U, V )

At each communication step τ, global and local components are indeed orthogonal U T τ V(i),τ = 0, thus L(i),1(Uτ, V(i),τ) + L(i),2(Uτ, V(i),τ) = fn(Uτ, V(i),τ). L(i),1 explicitly encodes the orthogonality constraint into the objective. Such design enables convenient handling of the correction step: we can prove that the correction step on V changes L(i),1 + L(i),2 only by O(η2 τ). Therefore only the descent step can change L(i),1 + L(i),2 by O(ητ). Thus, the change of Lyapunov functions is dominated by the update from the parallel gradient. By calculating the update of U and {V(i)} in each communication round, we can have the following informal version of the suﬃcient descent lemma:

Lemma 9 (Informal) When we choose a constant stepsize ητ = η = O 1 Gmax,op r

Uτ and V(i),τ satisfy the orthogonality condition U T τ V(i),τ = 0, we have: * N X

i=1 UL(i),1(Uτ, V(i),τ), Uτ+1 Uτ

D V(i)L(i),1(Uτ, V(i),τ) + V(i)L(i),2(Uτ, V(i),τ), V(i),τ+1 V(i),τ E

I PUτ PV(i),τ S(i)Uτ

I PUτ PV(i),τ N X

i=1 S(i)V(i),τ

+ O(η2) (27)

When η is small, the O(η) terms will dominate O(η2) terms. Thus, Lemma 9 essentially shows that in Algorithm 2, the change of Lyapunov functions is negative semideﬁnite in one communication round. With the suﬃcient decrease property, standard analysis on ﬁrst-order optimization yields a O 1

R convergence rate. Formal proofs of Theorem 8 and Lemma 9 can be found in Appendix D.

6.2 Local convergence

Theorem 8 only shows that Algorithm 2 converges into stationary points but does not provide further information about the property of the ﬁnal solution. In problems like feature extraction, we want to know whether the stationary point is a globally optimal solution or whether it corresponds to the true PCs. To this end, we analyze the convergence of global and local PCs. The convergence depends on a Polyak-Lojasiewicz style condition. Similar to Section 6.1, we will introduce another assumption about the eigenvalue distribution of the sample covariance matrix. Without loss of generality, in this section, we assume r1 = r2,(1) = = r2,(N) = r.

Personalized PCA

Assumption 6.2 (Covariance matrix eigenvalue lower bound) We further assume that the population covariance Σ(i) can be entirely explained by Πg + Π(i), i.e., Σ(i) Πg + Π(i) = Σ(i), and that the minimum nonzero eigenvalues of Σ(i) is lower bounded by a constant µ > 0: µ Πg + Π(i) Σ(i) (28)

where Πg and Π(i) are rank-r projection matrices.

Assumption 6.2 assumes that data covariance can be decomposed as noiseless global and local parts with rank r. The noiseless assumption of the population covariance matrices is the standard assumption in the local convergence analysis of many PCA algorithms (e.g., (Tang, 2019)). The following theorem shows that if Algorithm 2 is initialized within the attractive basin of the global optimum, the iterates will converge to the global optimal solution linearly.

Theorem 10 (Informal) Under assumptions 4.2, 6.1, and 6.2, if the diﬀerence between the population and sample covariance is small, when we initialize close to the global optimum, and choose a constant stepsize ητ = η = O 1 Gop,max r

, then Algorithm 2 with choice 1 will converge into the global optimum:

f( ˆU, { ˆV(i)}) f(UR, {V(i),R}) = O

where { ˆU, { ˆV(i)}} is one set of optimal solutions to problem (7). Furthermore, we can recover the exact global optimal solutions:

PV(i),R P ˆV(i)

It is worthwhile to point out that in Theorem 10, the convergence is faster for a larger misalignment parameter θ. This is intuitively understandable since when local eigenspaces are more heterogeneous, it is easier to identify diﬀerent eigenspaces. On the other hand, if all the local eigenspaces are similar, it is diﬃcult to distinguish local PCs from global PCs; thus, the convergence is slower. This result is in striking contrast to standard federated learning (e.g., Li et al. (2020, 2018a)), where data heterogeneity leads to slower convergence. We will verify this ﬁnding in Section 7. A formal version of Theorem 10 and its proof is relegated to the Appendix E. With the statistical error bound provided by Theorem 1 and the convergence guarantee from Theorem 10, we can derive the following corollary.

Corollary 11 Under the same assumptions as Theorem 1 and Theorem 10, after t = Ω r Gmax,op

µθ log 1 εstats

communication rounds, we can obtain estimates of global and local PCs that satisfy,

PUt Πg 2 F + 1

PV(i),t Π(i) 2

F = O (εstats)

where εstats is the statistical error εstats = 1 θδ2 σ4C2 d

N PN i=1 1 ni .

Shi and Kontar

Proof By the triangle inequality, we know

PUt Πg 2 F + 1

PV(i),t Π(i) 2

2 PUt P ˆU 2 F + 2

PV(i),t P ˆV(i),τ

+ 2 Πg P ˆU 2 F + 2

Π(i) P ˆV(i),τ

The ﬁrst term is bounded by Theorem 10, and the second term is bounded by Theorem 1

6.2.1 Proof sketch of Theorem 10 and key lemmas

To prove the exponential convergence in Theorem 10, we need a stronger version of the suﬃcient decrease inequality than Lemma 9. We should show that, in each communication round, the change in the Lyapunov functions is negative deﬁnite. This requires a careful analysis of the geometry of objective (7) around the global optimum P ˆU and {P ˆV(i),τ }. The key result is the Polyak-Lojasiewicz (PL) inequality.

Lemma 12 (Polyak-Lojasiewicz inequality) Under the same conditions as Theorem 10, we have

I PUτ PV(i),τ S(i)Uτ

I PUτ PV(i),τ N X

i=1 S(i)V(i),τ

f( ˆU, { ˆV(i)}) f(UR, {V(i),R})

The PL inequality shows that the norm of the parallel gradient is lower bounded, a constant fraction of the optimality gap. It certiﬁes a nice geometric property in objective (7) so that each step of gradient descent can make signiﬁcant progress. By combining the PL inequality with Lemma 16, we can easily prove Theorem 10. One of our major technical contributions is to establish the PL inequality for the nonconvex problem (7). We analyze the local geometry of the problem with the help of one special set of optimal solutions { ˆUτ, { ˆV(i),τ}}. We show that this set of optimal solutions is close to the current iterate {Uτ, {V(i),τ}}. Also, the diﬀerence {Uτ ˆUτ, {V(i),τ ˆV(i),τ}} is aligned with the parallel gradient. As a result, the parallel gradient can direct the updates to the optimal solutions. The full proof of Lemma 12 and Theorem 10 is relegated to Appendix E.

7. Numerical Experiments

This section tests our model on a set of datasets across diﬀerent applications. We start in Section 7.1 with a proof of concept study using a synthetic dataset to verify theoretical ﬁndings in Sections 4 and 6.

Personalized PCA

We also discuss the eﬀects of overparametrization and show an interesting application of Per PCA in federated client clustering using local PCs. In Section 7.2, we provide an illustrative example in comparison with Robust PCA to shed light on the end goal of our model. Next, we apply Per PCA to a real-life heterogeneous distributed dataset FEMNIST and CIFAR10 to show Per PCA s advantages in ﬁnding better features in Section 7.3. Finally, we demonstrate how Per PCA can separate shared and unique features in video and language data in Section 7.4. We note that from Theorem 10, a suitable initialization is needed for the best performance of Per PCA. We thus employ the standard one-communication round distributed PCA algorithm proposed in Qu et al. (2002) as the initialization of global PCs in Algorithm 2, unless speciﬁed otherwise. Local PCs are always randomly initialized. In this section we set r2,(1) = r2,(2) = = r2,(N) = r2.

7.1 Proof of concept on synthetic datasets

We generate data from model (1). The uq s and vq s are set to be orthogonal components. After obtaining uq s and vq s, we sample the score coeﬃcients φ(i),q s and ϕ(i),q s from i.i.d. Gaussian distributions. Noise ϵ(i) are also sampled from i.i.d. Gaussian distributions. Under this setting, multiple aspects are tested: in Section 7.1.1, we revisit the example in Figure 1 and examine the convergence behavior of Per PCA numerically. In Sections 7.1.2, 7.1.3, and 7.1.4, we demonstrate how the statistical errors change with the (i) number of observations n, (ii) data dimension d, and (iii) number of clients N, and compare the results with our theory. In Section 7.1.5, we show that in Per PCA, clients beneﬁt from knowledge sharing to improve their PC estimates. Then we investigate the numerical performance of Per PCA when r1 and r2 are overparametrized in Section 7.1.6. Finally, in Section 7.1.7, we describe a method that exploits the estimated local PCs for client clustering.

7.1.1 Convergence of Per PCA

We ﬁrst analyze the convergence of Per PCA. Theorem 10 predicts that (i) Per PCA has local linear convergence, and (ii) a larger θ can expedite convergence. To verify the two theoretical results, we run Per PCA on a group of synthetic data. We set N = 2, d = 3 and n(i) = 1000. Each client has exactly one global u1 and one local component v(i),1. After setting global PC u1 and local PCs v(1),1 and v(2),1, we generate the data according to the model (1) where coeﬃcients φ(i),q and ϕ(i),q are randomly sampled from Gaussian distributions. By changing the direction of local PCs v(1),1 and v(2),1, we can modify θ:

2 arccos(v T (1),1v(2),1)

Figure 1, shown in the introduction, is an instance of this analysis where θ = 0.127. To see the θ s eﬀect on convergence, we generate the data with θ ranging from 0 to 0.3. In this experiment, we initialize global and local PCs to be random Gaussian vectors. We run each experiment with the same stepsize η = 0.1 but from 10 diﬀerent random initializations and collect the reconstruction error in each communication round τ. The reconstruction

Shi and Kontar

error is deﬁned as the objective in (5) divided by the number of observations n(i):

Reconstruction error = 1

Y(i) PU + PV(i) Y(i) 2

Results are shown in Figure 3. From Figure 3(left), we can see that Per PCA indeed enjoys linear convergence. Furthermore, bluer curves have a larger slope, which indicates that a larger θ leads to faster convergence. Such a ﬁnding is corroborated by Figure 3(right), which plots the log error at the 100-th communication round with respect to θ. It is clear that the log error decreases linearly with the increase in θ. These results thus conﬁrm insights from Theorem 10.

(a) Log reconstruction error vs communication round

(b) Final log reconstruction error vs θ

Figure 3: Left: the learning curve of the reconstruction error. Each curve represents one set of experiments with one θ. The bluer the curve is, the larger θ is. Right: log reconstruction error after 100 communication rounds for datasets with diﬀerent misalignment parameters θ. We run each experiment 10 times, each with a diﬀerent random initialization. The red line represents the mean log error after 100 communication rounds for the ten experiments, and the blue-shaded region shows the conﬁdence interval.

7.1.2 Dependence of Statistical Error on n

Knowing that Per PCA converges rather quickly, we can use the ﬁnal iterates of Algorithm 2 as an estimate of the optimal solution to problem (7). To show that the estimate can indeed recover the true local and global PCs, we calculate the subspace error between eigenspace estimates and true values deﬁned in (13),

Subspace error = PUτ Πg 2 F + 1

PV(i),τ Π(i) 2

Personalized PCA

Remember that Theorem 1 shows that such error should decrease to 0 as the number of observations on each client approaches inﬁnity. Additionally, Corollary 3 gives a ﬁnite-sample error bound of the subspace error. Here, we benchmark with a one-shot approach dist PCA (Fan et al., 2019). However, we provide a simple variant of dist PCA to make it amenable for personalization. For standard dist PCA, each client ﬁrst calculates the top r1 + r2 principal components and sends them to the server. The server then concatenates all the received PCs into a d N(r1 + r2) matrix and calculates the top r1 principal components of the matrix. To enable personalization in dist PCA, we take the following route: we use the obtained top r1 principal components Udist PCA as estimates of the global principal components. Then, we estimate local PCs with the help of the global ones. Speciﬁcally, the global PCs Udist PCA are sent back to clients. Each client then deﬂates the sample covariance matrix S(i),deflate = (I PUdist PCA) S(i) (I PUdist PCA), and calculates the top r2 principal components of S(i),deflate as local PCs. To analyze the statistical consistency, we run Per PCA on datasets with varying numbers of observations n(i) and compare with the benchmark algorithm dist PCA. We set n(1) = n(2) = = n. We ﬁx data dimension d = 15 and generate data from 2 global PCs and 10 local PCs. On each client, the variances contributed by local PCs are set to be 100 times larger than those contributed by global PCs to simulate large heterogeneity. This is achieved by setting the standard deviations of φ(i),q to be 10 times smaller than ϕ(i),q in data-generating model (1). We use 100 clients. Among them 50 clients have n observations, and the rest 50 clients only have 1 10n observations. We run both algorithms and estimate the subspace error (30) from 5 diﬀerent random seeds.

Figure 4: Log subspace error vs local observations n. Per PCA is consistent while dist PCA is not.

Results in Figure 4 show that Per PCA achieves smaller statistical error for almost all n, and more importantly, the error decreases with n, which indicates that Per PCA gives consistent estimates of global and local PCs. When the error is small, the slope of the curve is approximately 1, which matches the theoretical error upper bound O 1

n in Corollary 3. In comparison, the statistical error of dist PCA does not decrease even when n is very large, implying that the method is not consistent for heterogeneous datasets. This result also

Shi and Kontar

sheds light on an important insight. Simply learning global components and using them for personalization in a train-then-personalize philosophy is not optimal, as global components from aggregated data may not contain useful information required for personalization.

7.1.3 Dependence of Statistical Error on d

We also examine the performance of Per PCA on data with diﬀerent dimensions d. We ﬁx n = 10000 and generate data with diﬀerent d. Other settings are the same as Section 7.1.2. We calculate the subspace error of estimates given by Per PCA and dist PCA. Results are plotted in Figure 5.

Figure 5: Log error vs data dimension d.

From Figure 5, Per PCA still achieves smaller statistical error for all d. Also, the error grows almost quadratically with d, which again matches the upper bound in Corollary 3.

7.1.4 Dependence of Statistical Error on N

Now, we explore whether the number of clients N aﬀects the statistical error. We ﬁx d = 15, n = 10000, and change N from 10 to 1000. The other settings are also the same as in Section 7.1.2. After obtaining global and local PCs, we calculate the subspace error of both global and local PCs (30) and the subspace error of only global PCs PU Πg 2 F . Results are plotted in Figure 6. Figure 6(a) shows that when N increases, the average subspace error decreases slowly. The decreasing trend is more conspicuous for the subspace error of global PCs shown in Figure 6(b). This is understandable as when more clients participate in Per PCA, more observations are available. Thus, global PCs can be better estimated.

7.1.5 Shared knowledge

When the PCs on diﬀerent clients are extremely heterogeneous, it is natural to ask whether clients are sharing knowledge and learning from each other in Per PCA. Corollary 3 indicates that clients can beneﬁt from participating in the collaborative learning process from a theoretical perspective. In this section, we show numerical results on how the learned global components improve client-level predictions.

Personalized PCA

(a) Average of subspace error of global and local PC estimates

(b) Subspace error of global PC estimates

Figure 6: Left: the average of local and global PCs subspace error. Right: Global PCs subspace error.

The dataset on client i is split into a training set Y(i),train and testing set Y(i),test. We use the training set Y(i),train to ﬁnd estimates for global and local PCs and the testing set to calculate the testing error. We focus on the reconstruction error deﬁned in (29). As in Section 7.1.2, we simulate two groups of clients with highly unbalanced dataset sizes. One group of clients has n observations. We call them data-rich clients. The other group of clients have only 1 10n observations. We call them data-sparse clients. We set N = 100 and n = 100.

In this experiment, we compare Per PCA with 3 benchmarks: indiv PCA, CPCA, and dist PCA. For indiv PCA, each client uses their own data to calculate PCs independently without any knowledge sharing. CPCA represents PCA on the pooled data from all clients, i.e., all data are uploaded to a central server, and PCA is learned on the aggregated dataset. For fair comparisons, we allow indiv PCA and CPCA to retain r1 + r2 principal components. The results of testing reconstruction error averaged over the groups are shown in Table 2. Ground Truth corresponds to the testing loss by the true PCs.

Client Group indiv PCA CPCA dis PCA Per PCA Ground Truth Data sparse 1.87 0.01 2.07 0.01 1.91 0.01 1.68 0.02 1.50 0.01 Data rich 1.80 0.01 2.10 0.01 1.88 0.01 1.52 0.01 1.50 0.01

Table 2: Testing reconstruction error averaged on each group

From Table 2, it is clear that Per PCA achieves the smallest testing error in both the data-sparse and the data-rich group, thus having the best predictive performance. As Per PCA outperforms indiv PCA, we can conclude that Per PCA learns useful shared knowledge. The results highlight Per PCA s ability to extract common features from heterogeneous datasets. Also, CPCA exhibits the worst performance. This again highlights the need for personalized learning when data comes from heterogeneous sources.

Shi and Kontar

Figure 7: Simulations where we choose ˆr2 r2.

7.1.6 Overpametrization

In practice, when the true rank r1 and r2,(i) s are unknown, practitioners may choose the rank of U and V(i) to be larger than the ground truth. This is called an overparametrized regime (Zhuo et al., 2021). Overparametrization is a common technique for PCA and matrix factorization. Here we investigate the numerical performance of Per PCA in an overparametrized regime. We use synthetic data to analyze the convergence behavior of Per PCA. We set d = 30 and r1 = 1, r2,(i) = r2 = 1. Then we randomly generate data {Y(i)} for N = 20 clients and calculate the corresponding covariance matrix {S(i)}. The data are generated without noise to better understand the convergence. We run overparametrized Per PCA on the generated data. More speciﬁcally, we choose orthonormal matrices U Rd ˆr1 and V(i) Rd ˆr2 with rank ˆr1 r1 and ˆr2 r2 in Algorithm 2. Since, in practice, people may over-parametrize the rank of both global and local PCs diﬀerently, we study both cases separately. Case 1: ˆr2 > r2. We choose the rank of local PCs ˆr2 to be higher than the ground truth r2, while keeping ˆr1 = r1. Then we run Per PCA starting from random initializations to obtain iterates Uτ and {V(i),τ} for diﬀerent τ. We analyze three metrics:

Global error: 1 N PN i=1 1 n(i) PUτ Y(i) Πg Y(i) 2 F

Local error: 1 N PN i=1 1 n(i)

PV(i),τ Y(i) Π(i)Y(i) 2

Reconstruction Error: 1 N PN i=1 1 n(i) Y(i) Π(i) + Πg Y(i) 2 F

Apparently, the reconstruction error is upper bounded by the sum of the local error and global error. We plot these metrics for diﬀerent ˆr2 in Figure 7. There are a few interesting observations in Figure 7. Firstly, for all ˆr2, the reconstruction errors decrease linearly. This is understandable as using a larger rank in local features ˆr2 can add more representation power to the model, thus helping model ﬁtting. As the covariance matrices are noiseless, the linear decrease of reconstruction error is also consistent with the standard matrix factorization results in Zhuo et al. (2021). Secondly, when the local features {V(i)} are slightly parametrized r2 ˆr2 r2 + 3, the global error and local error

Personalized PCA

Figure 8: Simulations where we choose ˆr1 r1.

also decrease linearly. Such results show that with slightly overparametrized {V(i)}, one can still recover the global features and the local ones. Thirdly, when ˆr2 is very large, ˆr2 r2 +4, the global error and local error decrease sublinearly. In this highly overparametrized regime, though the reconstruction error decreases to zero linearly, the learned global and local features do not converge to the ground truth equally fast. When the ground truth Πg and Π(i) are unknown, one cannot evaluate the local and global error. Therefore, we propose the estimated misalignment θest value as a statistic indicative of the global-local separation:

θest = 1 λmax

i=1 P ˆV(i)

where ˆV(i) is the recovered local PCs on client i. θest measures how diﬀerent the local features are. We calculate θest for diﬀerent ranks of V(i) and show the results in Table 3.

ˆr2 r2 r2 + 1 r2 + 2 r2 + 3 r2 + 4 r2 + 6 θest 0.90 0.69 0.36 0.19 1.8 10 6 2.5 10 9

Table 3: Misalignment value θ for diﬀerent ranks of matrix V(i)

From Table 3, when ˆr2 increases from r2 + 3 to r2 + 4, θest decreases rapidly from 0.19 to almost 0. Such change indicates that the local features are very aligned when ˆr2 = r2 + 4. Thus, local features are not distinguishable . The abrupt changes θest echo the results in Figure 7: when θest is small, local PCs are similar, and the separation between local and global PCs is not clear. Case 2: ˆr1 > r1. Similarly, we choose the rank of global PCs ˆr1 to be higher than the ground truth r1, while keeping ˆr2 = r2. The results are shown in Figure 8. Figure 8 demonstrates diﬀerent qualitative behaviors than Figure 7. Even when ˆr1 is slightly overparametrized, ˆr1 = r1 + 1, the global and local errors do not linearly decrease to 0. Yet the ﬁtting error for all cases decreases linearly. The comparison implies that when U is overparametrized, the combined features U and {V(i)} can still explain well the data variance, but may not exactly characterize the global and local features.

Shi and Kontar

In light of the insights gained, we recommend that practitioners carefully select small values for r1 and r2 in a way that ensures a small reconstruction error while also maintaining a large value for θest if they suspect heterogeneity among the data sources.

7.1.7 Clustering based on local principal components

Apart from capturing the variance structure in the data, the learned local and global components can reveal high-level information about the client s interrelatedness. Below we present an interesting application of Per PCA in client clustering. An important question in federated and distributed learning is how to cluster clients based on some summary statistics from their data. This is usually done by exploiting some distance metrics over the estimated parameters or gradients (Sattler et al., 2019) from each client. Per PCA can pose an alternative approach for client clustering based on local PCs. The intuition is that by focusing on local PCs, diﬀerences across clients are more explicit compared to the raw data. More speciﬁcally, when r2,(1) = = r2,(N) = r2, one can calculate the subspace distance between client i and j ρi,j deﬁned as:

P ˆV(i) P ˆV(j)

If the column space of ˆV(i) and ˆV(j) are more similar, ρi,j will be smaller. The ρi,j s measure the closeness of local subspaces, thus revealing a similarity structure among clients. They form an N N matrix ρ. As such, simple spectral clustering (Hastie et al., 2009) on ρ can be used to analyze the relations among diﬀerent clients. As an example, we generate clients from 10 diﬀerent client groups. Clients in one group have the same local PCs. Diﬀerent groups have diﬀerent local PCs. The data on clients within one group thus have a similar variance structure. We set r1 = 2, r2 = 3, and d = 15. We apply Per PCA and calculate the matrix ρ(τ) with each communication round τ. We omit the dependence ρ(τ) on τ for simplicity. Then, we use multidimensional scaling (MDS) (Hastie et al., 2009) and spectral clustering on ρ. Results are shown in Figure 9. Since local PCs are randomly initialized, it is hard to ﬁnd meaningful structures from initialization in Figure 9(a). However, after only one communication round, the true structure emerges in Figure 9(b). After 30 communication rounds, clients can be eﬀectively clustered based on their learned local PCs.

7.2 An illustrative example in comparison to Robust PCA

The philosophy of ﬁnding common and unique features can be applied to other tasks beyond explaining data variance. In this section, we use a simple example to demonstrate how Per PCA can separate shared and unique features from image data. We start by comparing Per PCA with Robust PCA. Though Robust PCA is proposed to learn low-rank and sparse parts, it is also potentially useful in ﬁnding irregular and common patterns from a dataset. When data come from diﬀerent sources {Y(i)} and have equal number of columns n(1) = = n(N) = n, one can stack them into one matrix Ystack = vec Y(1) , , vec Y(N) . Then Robust PCA can be applied on the stacked matrix Ystack Rnd N to distinguish low rank and sparse parts. The common wisdom is to

Personalized PCA

(a) Initial local subspace

(b) After 1 communication round

(c) After 30 communication rounds

Figure 9: MDS of the distance matrix ρ. Color denotes the output of the spectral clustering algorithm. Numbers denote the true cluster labels.

use a low-rank part to represent shared patterns and a sparse part to represent irregular trends (Candes et al., 2011). The underlying assumption of such an approach is that unique features are somewhat sparse among all datasets. However, there are cases where a sparse matrix cannot model unique features. An example is shown in Table 4. We create 4 images of diﬀerent icons (triangle, disk, cross, and cloud) on similar background textures using Power Point and distinguish the icons from the background. As a greyscale image can naturally be represented by an observation matrix with dimensions of its height and width, we can construct 4 datasets representing 4 images. Then we apply Per PCA and Robust PCA to identify the icons. From Table 4, it is apparent that Robust PCA does not perform well as it cannot recover the icons and always leaves shadows of icons on other images, probably because icons occupy a large space in the image and thus cannot be modeled by sparse noise. Per PCA recovers the icons by projecting the images to the subspace spanned by local PCs. The third row in Table 4 shows that Per PCA has decent performance as the icons recovered have clear edges and shapes. This highlights the need for personalized inference in many applications where PCA is utilized.

Shi and Kontar

Image 1 2 3 4

Sparse parts by RPCA

Projection to local PCs by Per PCA

Table 4: A comparison of Per PCA and Robust PCA on images of icons on background textures.

7.3 Real-life federated dataset

We also apply our algorithm on FEMNIST (Caldas et al., 2019) and CIFAR10 (Krizhevsky et al., 2009). FEMNIST is a popular dataset in federated analytics. It consists of greyscale images of handwritten digits and English letters contributed by 3550 diﬀerent writers. Each image has 28 28 = 784 pixels. Diﬀerent writers have diﬀerent writing styles. Thus, the datasets are inherently heterogeneous. Our task is to learn a few PCs that can represent the dataset. On average, each client has 89 images. We represent an image by a vector in R784. For these vectors, we randomly choose 80% of them to form the training set and take the rest as the test set. CIFAR10 is a multiclass image dataset. It consists of 60000 images from 10 classes. To simulate a heterogeneous setting, we separate the training and testing set of CIFAR10 into 20 parts such that each part contains images from only 2 classes. Then, we assign the separated parts to 20 clients. The data partition scheme is consistent with federated learning literature (Mc Mahan et al., 2017). Then we use similar data preprocessing procedures to vectorize the images on each client and construct the dataset {Y(i)}. We use Per PCA, indiv PCA, CPCA, and dist PCA to ﬁt PCs on training sets. Then, we evaluate the reconstruction error (29) on both training and test sets. The experiments are repeated 3 times to calculate the mean and standard deviations . The results are shown in Table 5. As the reconstruction error represents the diﬀerence between the original and reconstructed image, it represents how well the learned PCs can characterize the features in the

Personalized PCA

Reconstruction error indiv PCA CPCA dis PCA Per PCA FEMNIST Training 0.49 0.01 1.72 0.01 1.43 0.01 1.44 0.02 CIFAR10 Training 105.68 0.01 114.79 0.01 113.56 0.01 113.69 0.02 FEMNIST Testing 1.97 0.03 1.73 0.01 1.73 0.01 1.70 0.01 CIFAR10 Testing 120.79 0.02 115.44 0.02 115.43 0.01 115.33 0.02

Table 5: The mean and standard deviations of the training and testing reconstruction error on FEMNIST

image. In Table 5, indiv PCA achieves the lowest training error but incurs high testing error, suggesting that learned PCs overﬁt the training sets. Per PCA has the lowest testing loss both in FEMNIST and CIFAR10, highlighting Per PCA s ability to leverage common knowledge with unique trends to ﬁnd better features from data.

7.4 Other Applications

Besides the experiments in the previous sections, Per PCA can excel in various tasks that require separating shared and unique features. In this section, we will use video segmentation and topic extraction as two examples to show the applicability of Per PCA.

7.4.1 Video segmentation

The task of video segmentation is to separate moving parts (foreground) from stationary backgrounds in a video. For a video with F frames, where each frame is an image with width W and height H, we can model it as F separated datasets. Each dataset has the data of one image frame or H observations from RW . Therefore, we can naturally apply Per PCA to recover local and global PCs from the constructed datasets of all frames. Intuitively, the global PCs should capture shared features across all frames, representing the stationary background. Meanwhile, local PCs capture unique features in each frame corresponding to the moving parts. Hence, after obtaining the global and local PCs, we project the original picture onto the subspace spanned by these components to extract the background and foreground segments. We use a surveillance video example from Vacavant et al. (2012). We set r1 = 50 and r2,(1) = = r2,(N) = 50 and apply Algorithm 2 with choice 2. Some segmentation results are shown in Table 6. From Table 6, we can see that backgrounds and moving parts are well separated by global and local PCs, validating Per PCA s ability to ﬁnd common and unique features in image datasets.

7.4.2 Topic extraction

Per PCA is also useful in modeling changing topics in language datasets. As a demonstration, we analyze the presidential debate transcriptions from 1960 to 2020 (Asokan, 2022). The goal is to extract key debating topics for each speciﬁc election year. The dataset contains 9135 dialogues in 46 debates from 13 election years, where one dialogue is the speech the speaker makes in the debate before another person speaks. After

Shi and Kontar

Sample Frame 1 2 3

Projection to global PC space

Projection to local PC space

Table 6: Video segmentation. We separate moving cars from the background in a video from (Vacavant et al., 2012).

Table 7: U.S. presidential debate key topics represented by local & global PCs

Year Top local principal components words 1960 peace, Castro, Africa, Kennedy, now, world, ... 1976 billion, Carter, Governor, Africa, Ford, people, world, ... 1980 coal, oil, money, energy, Social, Security, Reagan, ... 1984 Union, tax, Soviet, arms, leadership, proposal, ... 1988 drug, young, strong, build, future, enforcement, good, ... 1992 Bill, school, children, care, health, taxes, reform, plan, control, ... 1996 Clinton, Security, Medicare, budget, tax, Dole, Bob, ... 2000 school, public, plan, children, money, Social, Security, health, tax, ... 2004 wrong, plan, cost, free, Saddam, troops, Iraq, war, health, tax, ... 2008 nuclear, oil, troops, Iraq, Afghanistan, Pakistan, health, Iran, energy, ... 2012 million, small, business, China, Medicare, Romney, jobs, tax, ... 2016 Russia, Trump, Hillary, companies, taxes, Mosul, Iran, deal, ... 2020 Harris, Pence, Trump, down, Joe, Biden, jobs, Donald, health, ... Common words Tax, country, States, make, world, money, people, cut, ...

we remove common English words such as you , I , and , at , that , from the text corpus, there are 5464 diﬀerent words used in the dataset.

Personalized PCA

We model the dataset as a collection of 13 separate datasets, each of which has all the dialogues in one election year. To construct the observation matrix Y(i), we ﬁrst use one-hot encoding to map an English word into a vector in R5464. Then, we add all vectors corresponding to words that appear in one dialogue. The added vector is one observation in R5464. Y(i) is formed by concatenating observations corresponding to dialogues in the election year. With the datasets constructed, we run Per PCA for 20 communication rounds to extract local PCs. We set r1 = 2 and r2,(1) = = r2,(N) = 2. To show the key topics represented by the two local PCs, we ﬁnd the words corresponding to the dimensions in each local and global PC that have the top 20 largest absolute values. Table 7 contains the most informative keywords from the top 20 keywords obtained. From Table 7, one can ﬁnd diﬀerent debating key topics for diﬀerent years. For some years, the key topics are about public ﬁnance and domestic economic reform. For others, the key topics are more about international relations. These topics represent the central issues at a speciﬁc time in history.

8. Conclusion

This work proposes Per PCA, a systematic approach to decouple shared and unique features from heterogeneous datasets. We show that the problem is well formulated, and consistency can be guaranteed under mild conditions. A fully federated algorithm with a convergence guarantee is designed to eﬃciently obtain global and local PCs from noisy observations. Extensive simulations highlight Per PCA s ability to separate shared and unique features in various applications. As the ﬁrst systematic approach to decouple shared and unique features quantitatively, we envision that Per PCA can ﬁnd use across various downstream analytics such as interpretability, clustering, classiﬁcation, change detection, and transfer/federated learning. Within these areas, one can leverage unique knowledge so that diﬀerences become more explicit and leverage shared knowledge to transfer useful information from one source to another. Exploration along these directions may be promising. In addition, within Per PCA, there are several avenues for expansion and exploration. On the optimization front, it is promising to design algorithms that can converge faster, or require lower computation resources, including Grassmannian gradient descent, and adaptive stepsize Stiefel gradient descent. Further, extensions of Per PCA that consider missing data, large noise, sparse factors, or malicious intruders are important directions for future work.

9. Acknowledgements and Disclosure of Funding

We thank the anonymous reviewers for their constructive feedback. This research is supported by Raed Al Kontar s National Science Foundation (NSF) CAREER Grant 2144147.

Shi and Kontar

Appendix A. Proof of Theorem 1

In this section, we will show the proof of Theorem 1. We ﬁrst use standard perturbation analysis on the eigenspaces of Σ(i) (Vu et al., 2013). By assumption, Πg and Π(i) are the projections onto top eigenspaces of Σ(i), therefore for any orthogonal projection matrix P ˆU and P ˆV(i), we have: D Σ(i), Πg + Π(i) P ˆU P ˆV(i)

= D Πg + Π(i) Σ(i), I P ˆU P ˆV(i)

E D I Πg Π(i) Σ(i), P ˆU + P ˆV(i)

λr1+r2,(i) Πg + Π(i) Σ(i) D Πg + Π(i), I P ˆU P ˆV(i)

λ1 I Πg Π(i) Σ(i) D I Πg Π(i), P ˆU + P ˆV(i)

δ r1 + r2,(i) D Πg + Π(i), P ˆU + P ˆV(i)

Summing both sides for i from 1 to N, we have:

D Σ(i), Πg + Π(i) P ˆU P ˆV(i)

r1 + r2,(i) D Πg + Π(i), P ˆU + P ˆV(i)

Since P ˆU and {P ˆV(i)} are the optimal solutions to (7), and Πg and {Π(i)} are feasible, we know that: N X

D S(i), P ˆU + P ˆV(i)

S(i), Πg + Π(i) (33)

Combining (32) and (33), we can obtain:

D S(i) Σ(i), P ˆU + P ˆV(i) Πg Π(i) E δ

r1 + r2,(i) D Πg + Π(i), P ˆU + P ˆV(i)

We can use the Cauchy-Schwartz inequality to further bound the left-hand side as: D S(i) Σ(i), P ˆU + P ˆV(i) Πg Π(i) E S(i) Σ(i) F

P ˆU + P ˆV(i) Πg Π(i) F Notice that P ˆU + P ˆV(i) Πg Π(i) F

r P ˆU + P ˆV(i)

F + Πg + Π(i) 2 F 2 D P ˆU + P ˆV(i), Πg + Π(i) E

r1 + r2,(i) D P ˆU + P ˆV(i), Πg + Π(i) E

We thus have:

D S(i) Σ(i), P ˆU + P ˆV(i) Πg Π(i) E

S(i) Σ(i) 2 F

h r1 + r2,(i) D P ˆU + P ˆV(i), Πg + Π(i) Ei

Personalized PCA

by another application of Cauchy-Schwartz inequality. Finally:

h r1 + r2,(i) D P ˆU + P ˆV(i), Πg + Π(i) Ei 2 Nδ2

S(i) Σ(i) 2 F (34)

The relation (34) slightly extends the standard result from matrix perturbation theory. However, it only shows the summation of P ˆU and P ˆV(i) is close to the summation of Πg and Π(i). One cannot infer additional information about the closeness of P ˆU to Πg, or P ˆV(i) to Π(i). In other words, (34) alone does not ensure that the recovered global and local PCs correspond to true PCs. Such a guarantee is too weak in practice when we want to know if the solved U and V(i) s are close to the ground truth. Fortunately, we can show that this is indeed the case if the problem satisﬁes the identiﬁability assumption 4.2. An important ﬁnding is the following lemma, which indicates that the closeness in direct sum space can lead to closeness in each global and local subspaces.

Lemma 13 Suppose for i = 1, , N, PU, PV(i) and P U, P V(i) are projection matrices satisfying PUPV(i) = 0 and P UP V(i) = 0 for each i. Among them, PU and P U have rank r1, PV(i) and P V(i) have rank r2,(i). If there exists a positive constant θ > 0 such that

i=1 P V(i)) 1 θ

We have the following bound:

i=1 r1 + r2,(i) D PU + PV(i), P U + P V(i)

E N (r1 P U, PU ) +

i=1 r2,(i) D P V(i), PV(i) E

i=1 r1+r2,(i) D PU + PV(i), P U + P V(i)

N (r1 P U, PU ) +

i=1 r2,(i) D P V(i), PV(i) E!

The proof of Lemma 13 is at the end of Section G. By applying inequality (36) to (34), we can prove the desired error bound in Theorem 1.

Appendix B. KKT condition

We show the KKT conditions (24). The lagrangian to the objective (7) is:

2Tr U T S(i)U + 1

2Tr V T (i)S(i)V(i) + D Λ2,(i), V T (i)V(i) I E + Λ3,(i), U T V(i)

+ Λ1, U T U

Shi and Kontar

where Λ1 Rr1 r1, Λ2,(i) Rr2,(i) r2,(i), and Λ3,(i) Rr1 r2,(i) are dual variables. The KKT conditions are:

S(i)V(i) + V(i) Λ2,(i) + ΛT 2,(i) + UΛ3,(i) = 0

h S(i)U + V(i)ΛT 3,(i) i + U Λ1 + ΛT 1 = 0

U T U = Ir1, V T (i)V(i) = Ir2,(i), U T Vi = 0

By left multiplying the ﬁrst equation in (38) with I PU PV(i), we have I PU PV(i) S(i)V(i) = 0, which is the ﬁrst equation in (24). By left multiplying

the ﬁrst equation in (38) with U T , we have Λ(3,(i)) = U T S(i)V(i). Plugging this into the

second equation in (38), we have PN i=1 h S(i)U PV(i)S(i)U i + U Λ1 + ΛT 1 = 0. We then left multiply both sides again by I PU. The second equation in (24) follows accordingly. One can also infer (38) from (24).

Appendix C. Proof of Theorem 2

Inspired by Birnbaum et al. (2013), in this section, we will use a spiked population model to demonstrate the lower bound. We will ﬁrst use matrix perturbation analysis to estimate the leading order term for the estimation error of global PCs. Then, we verify our results through numerical experiments. Proof To prove theorem 2, it suﬃces to ﬁnd one set of parameters under which the statistical error is indeed Ω 1

δ2 . For simplicity, we consider N = 2 and r1 = r2 = 1, i.e., each client is driven by one global feature and one local feature. We deﬁne a few needed signal vectors w1,1, w1,2, w2,1, w2,2 R4 and a noise vector w3 R4 as

w1,1 = (cos γ sin α, sin α sin γ, cos α, 0)T

w1,2 = (cos γ cos α, cos α sin γ, sin α, 0)T

w2,1 = (cos γ sin α, sin α sin γ, cos α, 0)T

w2,2 = (cos γ cos α, cos α sin γ, sin α, 0)T

w3 = (0, 0, 0, 1)T

Then, we deﬁne the population covariance matrix as,

Σ1 = 2w1,1w T 1,1 + w1,2w T 1,2 + ϱw3w T 3 Σ2 = 2w2,1w T 2,1 + w2,2w T 2,2 + ϱw3w T 3 (40)

where ϱ is a constant ϱ < 1. In (40), 2w1,1w T 1,1 + w1,2w T 1,2 denotes the signal part in Σ1 and ϱw3w T 3 denotes the noise part. Apparently, the model (40) satisﬁes assumption 4.1 in the main paper. The eigengap δ is δ = 1 ϱ. It is easy to check that if we run Per PCA directly on the population covariance matrices {Σ(1), Σ(2)} deﬁned in (40), the algorithm would recover the optimal global PC as

u = (0, 0, 1, 0)T

Personalized PCA

and the optimal local PCs as

v1 = (cos γ, sin γ, 0, 0)T

v2 = (cos γ, sin γ, 0, 0)T (41)

It is also easy to see that the misalignment parameter θ = sin2 γ when 0 γ π

4 . We further introduce v 1 and v 2 as,

v 1 = ( sin γ, cos γ, 0, 0)T

v 2 = (sin γ, cos γ, 0, 0)T

Now we consider the sample covariance matrices S1 and S2. For simplicity, we assume they are only slightly perturbed from the population covariance matrices; S1 = Σ1 + εδS1 and S2 = Σ2 + εδS2, where ε << 1 is a small number, and δS1 and δS2 are deﬁned as,

δS1 = v1w T 3 + w3v T 1 + v1u T + uv T 1 + v1v T 2 + v 2v T 1 + w3u T + uw T 3 δS2 = v2w T 3 + w3v T 2 + v2u T + uv T 2 + v2v T 1 + v 1v T 2 (42)

i.e., there are some small perturbations in the sample covariance matrix. δS1 and δS2 model the small diﬀerences between the sample covariance and population covariance matrices. It is easy to calculate that,

S(1) Σ(1) 2 F + S(2) Σ(2) 2 F

We can run Per PCA on the sample covariance matrices S1 and S2. The optimal global and local optimal PCs are denoted as ˆu and (ˆv1, ˆv2). Apparently, ˆu and (ˆv1, ˆv2) are a function of ε, and as ε becomes zero, the sample covariance becomes the population covariance, and (ˆu, ˆv1, ˆv2) become (u, v1, v2). To estimate (ˆu, ˆv1, ˆv2) when ε is nonzero, we can use the KKT conditions to analyze how (ˆu, ˆv1, ˆv2) change with respect to ε. Remember that the KKT conditions (38) are,

S1ˆv1 = ˆv1λ21 + ˆv1ˆv T 1 S1 ˆu

S2ˆv2 = ˆv2λ22 + ˆv2ˆv T 2 S1 ˆu

(S1 + S2) ˆu = ˆuλ1 + ˆuˆu T S1ˆv1 + ˆuˆu T S2ˆv2

Since Σ(1) and Σ(2) do not have duplicate eigenvalues, from Greenbaum et al. (2020), (ˆu, ˆv1, ˆv2) and (λ1, λ21, λ22) are analytic functions of ε when ε is small. We can thus write the Taylor series expansion of (ˆu, ˆv1, ˆv2) as,

ˆu(ε) = u + εu(1) + ε2u(2) +

ˆv1(ε) = v1 + εv(1) 1 + ε2v(2) 1 +

ˆv2(ε) = v2 + εv(1) 2 + ε2v(2) 2 +

λ1(ε) = λ(0) 1 + ελ(1) 1 + ε2λ(2) 1 +

λ21(ε) = λ(0) 21 + ελ(1) 21 + ε2λ(2) 21 +

λ22(ε) = λ(0) 22 + ελ(1) 22 + ε2λ(2) 22 +

Shi and Kontar

where u(1) is the ﬁrst-order coeﬃcient and u(2) is the second-order coeﬃcient for the expansion of ˆu(ε). Similar notations are used for other variables. Then we can take the expansion (44) into the KKT conditions (43) and match the O(ε) terms on both sides,

δS1v1 + Σ1v(1) 1 = v1λ(1) 21 + v(1) 1 λ(0) 21

+ v(1) 1 v T 1 Σ1u + v1v(1) 1 T Σ1u + v1v T 1 δS1u + v1v T 1 Σ1u(1)

δS2v2 + Σ2v(1) 2 = v2λ(1) 22 + v(1) 2 λ(0) 22

+ v(1) 2 v T 2 Σ2u + v2v(1) 2 T Σ2u + v2v T 2 δS2u + v2v T 2 Σ2u(1)

(δS1 + δS2) u + (Σ1 + Σ2) u(1) = uλ(1) 1 + u(1)λ(0) 1

+ u(1)u T Σ1v1 + uu(1)T Σ1v1 + uu T δS1v1 + uu T Σ1v(1) 1

+ u(1)u T Σ2v2 + uu(1)T Σ2v2 + uu T δS2v2 + uu T Σ2v(1) 2 (45)

To solve equation (45), we can expand u(1), v(1) 1 , and v(1) 2 over a basis,

u(1) = ϕ00u + ϕ01v1 + ϕ02v 2 + ϕ03w3

v(1) 1 = ϕ10u + ϕ11v1 + ϕ12v 2 + ϕ13w3

v(1) 2 = ϕ20u + ϕ21v 1 + ϕ22v2 + ϕ23w3

Since ˆu = ˆv1 = ˆv2 = 1, we know that ϕ00 = ϕ11 = ϕ22 = 0. Then we can take (46) into (45), and solve ϕ s as,

4 sin(2α) cot(γ)

2 sin(α) cos(α)

ϕ03 = 2 sin(2α) + cos(2α) + 2ϱ 3

4 (ϱ2 3ϱ + 2)

4 sin(2α) cot(γ)

4(cos(2α) + 3)

ϕ13 = sin(2α) 2 cos(2α) + 4ϱ 6

4 (ϱ2 3ϱ + 2)

4 sin(2α) cot(γ)

4(cos(2α) + 3)

ϕ23 = sin(2α) 2 cos(2α) + 4ϱ 6

4 (ϱ2 3ϱ + 2)

Personalized PCA

Now, we obtained the closed-form formula for the ﬁrst-order perturbation of global and local PCs. It is straightforward to calculate that

ˆuˆu T uu T 2 F u + εu(1) + ε2u(2) + u + εu(1) + ε2u(2) + T uu T

= ε22 u(1) 2 + O ε3

3 csc2(γ) +

(1 ϱ)2 (2 ϱ)2

Since we know that θ = sin2(γ) and δ = 1 ϱ when γ π

4 , we have,

ˆuˆu T uu T 2 F = ε2

δ2 (δ + 1)2

+ O(ε3) (49)

When ε is small, the higher order term O ε3 can be neglected. Thus the error in (49) can be further simpliﬁed to Ω ε2 1

δ2 when θ and δ are small. This completes our proof.

We also verify the predicted error (49) via numerical simulations. In the simulations, we run Per PCA on the sample covariance matrices S1 and S2 to obtain the global PC ˆu. Then we use ˆu to calculate the subspace error ˆuˆu T uu T . This is the actual statistical error for the estimates from Per PCA. We compare it with the predicted values in (49) under diﬀerent parameter values of θ and δ. Results are shown in Figure 10 and 11.

Figure 10: The (rescaled) global PC error ˆuˆu T uu T 2 F /ε2 under diﬀerent eigengap δ.

Figure 11: The (rescaled) global PC error ˆuˆu T uu T 2 F /ε2 under diﬀerent misalignment parameter θ.

Shi and Kontar

Figure 10 and 11 demonstrate good matches between the predicted statistical error and the actual statistical error. The two curves vividly show that when 1

θ and 1 δ2 is large, the global subspace error grows linearly with 1

θ and 1 δ2 .

Appendix D. Proof of Theorem 8

Now we analyze the global convergence of Algorithm 2. We begin by calculating the derivative of L(i),1 and L(i),2. The derivative of L(i),1 over U is:

UL(i),1(U, V ) = (I PV ) S(i) (I PV ) U (50)

When V T U = 0, this reduces to:

UL(i),1(U, V ) = (I PV ) S(i)U

And the derivative of L(i),1 over V is:

V L(i),1(U, V ) = PUS(i)V + S(i)PUV PUPV S(i)V S(i)PV PUV (51)

When V T U = 0, this reduces to:

V L(i),1(U, V ) = PUS(i)V

Similarly, the derivative of L(i),2 over V is:

V L(i),2(U, V ) = S(i)V (52)

The following lemma shows that the function we introduced is Lipschitz continuous.

Lemma 14 When U op and V op are upper bounded by 1, the functions L(i),1 + L(i),2 are Lipschitz continuous with constant L. More formally, for any U1, U2, V1, V2 Rd r, such that U1 op , U2 op , V1 op , V2 op 1, we have: h UL(i),1(U2, V2) UL(i),1(U1, V1),

V L(i),1(U2, V2) + V L(i),2(U2, V2) V L(i),1(U1, V1) V L(i),2(U1, V1) i F

U1 U2 2 F + V1 V2 2 F

where L = 9

2G(i),op (54)

Proof First, we calculate the diﬀerence in the gradient of U: UL(i),1(U2, V2) UL(i),1(U1, V1) F = I V1V T 1 S(i)U1 I V2V T 2 S(i)U2 F I V1V T 1 S(i)U1 I V2V T 2 S(i)U1 F + I V2V T 2 S(i)U1 I V2V T 2 S(i)U2 F V1V T 1 V2V T 2 F G(i),op + U1 U2 F G(i),op 2 V1 V2 F G(i),op + U1 U2 F G(i),op

Personalized PCA

where we used the triangle inequality for the Frobenius norm for the ﬁrst inequality, and Lemma 23 for the second and third inequality. Next, we calculate the diﬀerence in the gradient of V : V L(i),1(U2, V2) + V L(i),2(U2, V2) V L(i),1(U1, V1) V L(i),2(U1, V1) F (PU2 I) S(i)V2 (PU1 I) S(i)V1 F + S(i)PU2V2 S(i)PU2V2

+ PU2PV2S(i)V2 PU1PV1S(i)V1 F + S(i)PV2PU2V2 S(i)PV1PU1V1 F 7 V1 V2 F G(i),op + 6 U1 U2 F G(i),op

Summing them up, we know: h UL(i),1(U2, V2) UL(i),1(U1, V1),

V L(i),1(U2, V2) + V L(i),2(U2, V2) V L(i),1(U1, V1) V L(i),2(U1, V1) i F UL(i),1(U2, V2) UL(i),1(U1, V1) F + V L(i),1(U2, V2) + V L(i),2(U2, V2) V L(i),1(U1, V1) V L(i),2(U1, V1) F 9 V1 V2 F G(i),op + 7 U1 U2 F G(i),op

U1 U2 2 F + V1 V2 2 F

We thus complete the proof.

Now we introduce some notations:

I PUτ PV(i),τ S(i)Uτ (56)

It is easy to verify Uτ TUτ when U T V(i),τ = 0:

U T τ Uτ = 0

The Frobenius norm of Uτ is upper bounded by:

I PUτ PV(i),τ S(i)Uτ

I PUτ PV(i),τ op

S(i) op Uτ F

i=1 Gmax,op r

= Gmax,op r

where we applied Lemma 23 for the ﬁrst inequality.

Shi and Kontar

By the client update rule, we know that:

U(i),τ+1 = Uτ + ητ I PUτ PV(i),τ S(i)Uτ (58)

Therefore, after the server takes the average of U(i),τ+1 and performs a generalized retraction, the following holds:

Uτ+1 = Uτ + ητ Uτ + η2 τe1,τ (59)

where e1,τ is an error term deﬁned as:

η2τ (Uτ+1 Uτ ητ Uτ)

By deﬁnition of a generalized retraction, since Uτ is in the tangent space of Uτ, we have: e1,τ F M1 Uτ 2 F

where we applied the condition ητ M3 r Gmax,op thus ητ Uτ F M3. Remember that M3 is a numerical constant in the deﬁnition of generalized retraction (Deﬁnition 5). Similarly, we deﬁne:

V(i),τ = I PUτ PV(i),τ S(i)V(i),τ (60)

Also by Lemma 23, the Frobenius norm of V(i),τ is upper bounded by:

= I PUτ PV(i),τ S(i)V(i),τ F

I PUτ PV(i),τ op

S(i) op V(i),τ F

Now we calculate the update of V(i),τ in one communication round. We summarize the result in the following lemma.

Lemma 15 If we choose the stepsize ητ min M3

6+12M1+M2 1

1 Gmax,op r, the update

of V(i) is given by:

V(i),τ+1 = V(i),τ + ητ V(i),τ ητUτ U T τ V(i),τ + η2 τe5,(i),τ

where e5,(i),τ is an error term that satisﬁes:

e5,(i),τ C5,0 Uτ 2 + C5,1 V(i),τ 2

where C5,0 and C5,1 are two constants that only depend on M1 and M2 from the generalized retraction Deﬁnition 5.

Personalized PCA

Proof We ﬁrst calculate the projection:

Uτ+1U T τ+1

= Uτ + ητ Uτ + η2 τe1,τ Uτ + ητ Uτ + η2 τe1,τ T

= UτU T τ + ητ Uτ U T τ + UτU T τ + η2 τe2,(i),τ

where e2,(i),τ is deﬁned as:

e2,(i),τ = Uτ U T τ +Uτe T 1,(i),τ +e1,(i),τU T τ +ητ Uτe T 1,(i),τ +ητe1,(i),τ U T τ +η2 τe1,(i),τe T 1,(i),τ

Its norm is upper bounded by: e2,(i),τ F Uτ 2 F + 2 Uτ op e1,(i),τ F + 2ητ Uτ F e1,(i),τ F + η2 τ e1,(i),τ 2 F Uτ 2 F + 2M1 Uτ 2 F + 2ητM1 Uτ 3 F + η2 τM2 1 Uτ 4 F

(1 + 3M1 + 1

4M2 1 ) Uτ 2 F

where the ﬁnal inequality comes from upper bound (57) and the choice of stepsize ητ: ητ 1 Gmax,op r 1

6+12M1+M2 1 1 2Gmax,op r.

Similarly, we deﬁne e3,(i),τ as:

e3,(i),τ = 1

2 V(i),τ ητ V(i),τ

By deﬁnition of a retraction, the norm of e3,(i),τ is upper bounded by:

e3,(i),τ F M1 V(i),τ 2 F

Uτ+1U T τ+1V(i),τ+ 1

2 = UτU T τ V(i),τ+ 1

2 + ητ Uτ U T τ + UτU T τ V(i),τ+ 1

2 + η2 τe2,(i),τV(i),τ+ 1

2 = UτU T τ V(i),τ + ητ V(i),τ + η2 τe3,(i),τ

+ ητ Uτ U T τ + UτU T τ V(i),τ + ητ V(i),τ + η2 τe3,(i),τ + η2 τe2,(i),τV(i),τ+ 1

2 = ητUτ U T τ V(i),τ + η2 τUτU T τ e3,(i),τ + η2 τe2,(i),τV(i),τ+ 1

2 + η2 τUτ U T τ V(i),τ

+ η3 τ Uτ U T τ + UτU T τ e3,(i),τ = ητUτ U T τ V(i),τ + η2 τe4,(i),τ

where we use e4,(i),τ to denote:

e4,(i),τ = UτU T τ e3,(i),τ + e2,(i),τV(i),τ+ 1

2 + Uτ U T τ V(i),τ + ητ Uτ U T τ + UτU T τ e3,(i),τ

Shi and Kontar

Its norm is upper bounded as: e4,(i),τ F

e3,(i),τ F + e2,(i),τ F + Uτ F V(i),τ F + ητ Uτ F + V(i),τ F

C4,0 Uτ 2 F + C4,1 V(i),τ 2 F

where C4,0 = 3

2 + 3M1 + 1

and C4,1 = 1

Thus we know when ητ min{M3

M3 4C4,0 } 1 Gmax,op r, Uτ+1U T τ+1V(i),τ+ 1

Next we calculate the projection PNV(i),τ+ 1

Uτ+1U T τ+1V(i),τ+ 1

PNV(i),τ+ 1

Uτ+1U T τ+1V(i),τ+ 1

= V(i),τ+ 1

V T (i),τ+ 1

2 Uτ+1U T τ+1V(i),τ+ 1

= η2 τV(i),τ+ 1

2 V T (i),τ+ 1

We use e5,(i),τ to denote the diﬀerence between GRV(i),τ+ 1

Uτ+1U T τ+1V(i),τ+ 1

and V(i),τ +

ητ V(i),τ ητUτ U T τ V(i),τ, then its norm is upper bounded by:

η2 τ e5,(i),τ F

= GRV(i),τ+ 1

Uτ+1U T τ+1V(i),τ+ 1

V(i),τ ητ V(i),τ + ητUτ U T τ V(i),τ F

GRV(i),τ+ 1

Uτ+1U T τ+1V(i),τ+ 1

2 + Uτ+1U T τ+1V(i),τ+ 1

+ V(i),τ+ 1

2 Uτ+1U T τ+1V(i),τ+ 1

2 V(i),τ ητ V(i),τ + ητUτ U T τ V(i),τ F

By property 2 of the generalized retraction in Deﬁnition 5, we have: GRV(i),τ+ 1

Uτ+1U T τ+1V(i),τ+ 1

2 Uτ+1U T τ+1V(i),τ+ 1

PTV(i),τ+ 1

Uτ+1U T τ+1V(i),τ+ 1

F + (M2 + 1) PNV(i),τ+ 1

Uτ+1U T τ+1V(i),τ+ 1

M1 Uτ+1U T τ+1V(i),τ+ 1

F + (M2 + 1)η2 τ V(i),τ+ 1

2 V T (i),τ+ 1

2 e4,(i),τ F

= M1 ητUτ U T τ V(i),τ + η2 τe4,(i),τ 2 F + (M2 + 1)η2 τ V(i),τ+ 1

2 V T (i),τ+ 1

2 e4,(i),τ F

2M1η2 τ Uτ U T τ V(i),τ 2 F + η4 τ e4,(i),τ 2 F

+ (M2 + 1)η2 τ V(i),τ+ 1

2 V T (i),τ+ 1

2 e4,(i),τ F η2 τ Uτ 2 F (2M1(C4,0 + C4,1)C4,0 + 2M1 + M2C4,0)

+ η2 τ V(i),τ 2 F (2M1(C4,0 + C4,1)C4,1 + M2C4,1)

Personalized PCA

For the second part:

2 Uτ+1U T τ+1V(i),τ+ 1

2 V(i),τ ητ V(i),τ + ητUτ U T τ V(i),τ F = η2 τ e3,(n),τ e4,(n),τ F η2 τ e3,(n),τ F + η2 τ e4,(n),τ F

Therefore, the norm of e5,(i),τ is upper bounded as:

e5,(i),τ F Uτ 2 F (2M1(C4,0 + C4,1)C4,0 + 2M1 + M2C4,0 + C4,0)

+ V(i),τ 2 F (2M1(C4,0 + C4,1)C4,1 + M2C4,1 + M1 + C4,1)

This completes our proof, with

8 (12 (M2 + 1) + M1 (24M2 + M1 (M1 (M1 (M1 + 32) + 254) + 2 (M2 + 109)) + 88))

C5,1 = M4 1 + 81M3 1 4 + 13M2 1 + (2M2 + 5) M1 + 1

The following lemma shows the suﬃcient decrease property:

Lemma 16 (Formal version of Lemma 9) When we choose the stepsize ητ

1 Gmax,op r min M3

6+12M1+M2 1

, and Uτ and V(i),τ satisfy the orthogonality condition

U T τ V(i),τ = 0, we have:

i=1 UL(i),1(Uτ, V(i),τ), Uτ+1 Uτ

D V(i)L(i),1(Uτ, V(i),τ) + V(i)L(i),2(Uτ, V(i),τ), V(i),τ+1 V(i),τ E

ητN Uτ 2 F ητ

V(i),τ 2 F + η2 τ

C6,0N Uτ 2 F + C6,1

where C6,0 and C6,1 are constants dependent only on M1, M2, r, and Gmax,op:

C6,0 = Gmax,op r(M1 + C5,0)

and C6,1 = Gmax,op r C5,1

Shi and Kontar

Proof We ﬁrstly calculate the suﬃcient decrease of U:

i=1 UL(i),1(Uτ, V(i),τ), Uτ+1 Uτ

I PV(i),τ S(i)Uτ, Uτ+1 Uτ

I PV(i),τ S(i)Uτ, ητ

I PV(i),τ PUτ S(i)Uτ + η2 τe1,τ

I PV(i),τ S(i)Uτ, ητ

I PV(i),τ PUτ S(i)Uτ

I PV(i),τ S(i)Uτ, η2 τe1,τ

I PV(i),τ PUτ S(i)Uτ, ητ

I PV(i),τ PUτ S(i)Uτ

I PV(i),τ S(i)Uτ, η2 τe1,τ

ητN Uτ 2 F + η2 τ e1,τ F

I PV(i),τ S(i)Uτ F

ητN Uτ 2 F + M1η2 τ Uτ 2 F

i=1 G(i),op r

Next, we calculate:

D V(i)L(i),1(Uτ, V(i),τ) + V(i)L(i),2(Uτ, V(i),τ), V(i),τ+1 V(i),τ E

= (I PUτ ) S(i)V(i),τ, ητ V(i),τ + ητUτ U T τ V(i),τ + η2 τe5,(i),τ

= (I PUτ ) S(i)V(i),τ, ητ V(i),τ + (I PUτ ) S(i)V(i),τ, ητUτ U T τ V(i),τ

+ (I PUτ ) S(i)V(i),τ, η2 τe5,(i),τ

= D I PUτ PV(i),τ S(i)V(i),τ, ητ V(i),τ E + (I PUτ ) S(i)V(i),τ, η2 τe5,(i),τ

ητ V(i),τ 2 F + η2 τ (I PUτ ) S(i)V(i),τ F e5,(i),τ F

ητ V(i),τ 2 F + η2 τG(i),op r C5,0 Uτ 2 F + C5,1 V(i),τ 2

Personalized PCA

Adding them, we have: * N X

i=1 UL(i),1(Uτ, V(i),τ), Uτ+1 Uτ

+ D V(i)L(i),1(Uτ, V(i),τ) + V(i)L(i),2(Uτ, V(i),τ), V(i),τ+1 V(i),τ E

i=1 ητ V(i),τ 2 F + η2 τ

NC6,0 Uτ 2 F + C6,1

where the constants are: C6,0 = Gmax,op r(M1 + C5,0)

and C6,1 = Gmax,op r C5,1

Finally, we come to the proof of Theorem 8. Proof We choose constant a stepsize ητ = η1 small enough:

η1 ηc = min n 1

2C6,0 + L 1 + M2

2 2 + 2 1 + C5,0

2C6,1 + 2L 1 + C5,1

1 Gmax,op r M3

2 , 1 Gmax,op r

6 + 12M1 + M2 1

Obviously, η1 satisﬁes the requirement in Lemma 15 and 16. By the property of Lipschitz continuity, we have:

L(i) Uτ+1, V(i),τ+1 L(i) Uτ, V(i),τ

+ UL(i) Uτ, V(i),τ , Uτ+1 Uτ + UL(i) Uτ, V(i),τ , V(i),τ+1 V(i),τ

V(i),τ+1 V(i),τ 2 F + Uτ+1 Uτ 2 F

where L is deﬁned in (54). Since U T τ+1V(i),τ+1 = 0 and U T τ V(i),τ = 0, we know that

L(i) Uτ+1, V(i),τ+1 = fi Uτ+1, V(i),τ+1

and that L(i) Uτ, V(i),τ = fi Uτ, V(i),τ

Then, summing up both sides for n from 1 to N, we have:

f Uτ+1, {V(i),τ+1} f Uτ, {V(i),τ}

i=1 UL(i) Uτ, V(i),τ , Uτ+1 Uτ

UL(i) Uτ, V(i),τ , V(i),τ+1 V(i),τ

V(i),τ+1 V(i),τ 2 F + Uτ+1 Uτ 2 F

Shi and Kontar

From equation (59), we know

Uτ+1 Uτ F = ητ Uτ + η2 τe1,τ F ητ Uτ F + η2 τe1,τ F ητ Uτ F + η2 τM1 Uτ 2 F

Similarly, from Lemma 15, we have: V(i),τ+1 V(i),τ F = ητ V(i),τ + ητUτ U T τ V(i),τ + η2 τe5,(i),τ F ητ V(i),τ F + ητ Uτ U T τ V(i),τ F + η2 τ e5,(i),τ F ητ V(i),τ F + ητ Uτ F + η2 τ e5,(i),τ F

ητ Uτ F (1 + C5,0ητ Uτ F ) + ητ V(i),τ F

1 + C5,1ητ V(i),τ F

ητ V(i),τ F

Combining the two inequalities and Lemma 16, we have:

f Uτ+1, {V(i),τ+1}

f Uτ, {V(i),τ} ητ

+ η2 τNC6,0 Uτ 2 F + η2 τ

i=1 C6,1 V(i),τ 2 F

2 + 2 1 + C5,0

Uτ 2 F + 2 1 + C5,1

2 V(i),τ 2 F

f Uτ, {V(i),τ} ητ

(64) Summing up both sides for τ from 1 to R and rearranging terms, we have:

f(U1, {V(i),1}) + f(UR+1, {V(i),R+1})

As a result,

min τ {1, ,N}

2 f(UR+1, {V(i),R+1}) f(U1, {V(i),1})

Personalized PCA

This completes the proof of Theorem 8. Notice that C6,0, C6,1, and L are of the order Gmax,op r, thus the requirement on ηc in equation (63) becomes:

ηc = Cη 1 Gmax,op r

where Cη is a constant that only depends on M1, M2, and M3 from the generalized retraction Deﬁnition 5.

Appendix E. Proof for local linear convergence

In this section, we will show the full proof of Theorem 10. A formal theorem is stated below.

Theorem 17 (Formal version of theorem 10) Under assumptions 4.2, 6.1, and 6.2, if the

diﬀerence between the population and sample covariance is small q PN i=1 S(i) Σ(i) 2 F

4 µθ3/2, µ2θ2 1282 2Gmax,op } and S(i) Σ(i) Gmax,op, when we initialize close to

the global optimum φ0 φτ µ3θ3 411041792G2max,op , and choose a constant stepsize ηt = η =

O 1 Gop,max r

, then Algorithm 2 with choice 1 will converge into the global optimum:

f( ˆU, { ˆV(i)}) f(UR, {V(i),R}) = O

where { ˆU, { ˆV(i)}} is a set of optimal solutions to problem (7). Furthermore, we can recover the exact global optimal solutions:

PV(i),R P ˆV(i)

We will start by introducing needed notations, then proceed to establish some lemmas that characterize the local geometry of the optimization objective, then prove Theorem 10 at the end. At communication round τ, remember that we use Uτ and V(i),τ to denote the updated

variables. We use ˆU, { ˆV(i)} to denote one set of optimal solutions to (7). For simplicity, we

use ˆΠg to denote the projection ˆΠg = ˆU ˆU T and ˆΠ(i) to denote the projection ˆΠ(i) = ˆV(i) ˆV T (i). Since each covariance matrix S(i) is symmetric positive semideﬁnite, we can ﬁnd matrix F(i) Rd d such that F(i)F T (i) = S(i) by Cholesky factorization. Furthermore, we can deﬁne F(i),g, F(i),l, and R(i) as,

F(i),g = ˆΠg F(i) F(i),l = ˆΠ(i)F(i) ˆF(i),l = F(i),g + F(i),l R(i) = F(i) ˆΠg F(i) ˆΠ(i)F(i) (65)

Shi and Kontar

Apparently, F T (i),g R(i) = F T (i),l R(i) = 0.

Next we will introduce a set of optimal solutions ˆUτ, { ˆV(i),τ} that is close to the

current updates (Uτ, {V(i),τ}. The variables ˆUτ and ˆV(i),τ s are deﬁned as

ˆUτ = ˆΠg Uτ (Uτ)T ˆΠg Uτ 1/2 (66)

and ˆV(i),τ = ˆΠ(i)V(i),τ V T (i),τ ˆΠ(i)V(i),τ 1/2 (67)

for each n = 1, ...N. It s easy to verify that ˆUτ( ˆUτ)T = ˆΠg

and that ˆV(i),τ( ˆV(i),τ)T = ˆΠ(i)

Notice that { ˆUτ, { ˆV(i),τ}} is one set of global optimal solutions that is dependent on the iteration index τ. The ˆUτ and ˆV(i),τ s are dependent on the communication round τ. We use Uτ to denote the diﬀerence between Uτ and ˆUτ:

Uτ = Uτ ˆUτ (68)

and similarly: V(i),τ = V(i),τ ˆV(i),τ (69)

Since ˆUτ and ˆV(i),τ s are optimal, we can simplify the KKT conditions in (24) as

R(i)F T (i) ˆV(i),τ = 0, i [N]

i=1 R(i)F T (i) ˆUτ = 0 (70)

We can replace F(i) by ˆF(i) in (70) since F T (i) ˆV(i),τ = ˆF T (i) ˆV(i),τ and F T (i) ˆUτ = ˆF T (i) ˆUτ. We will ﬁrst show some properties of the introduced variables.

Lemma 18 Under the same conditions as Theorem 10, there exists constants ˆθ = θ

2, such that the following holds,

2. The smallest nonzero eigenvalue of ˆF(i) ˆF T (i) is lower bounded by ˆµ.

PN i=1 1 N ˆΠ(i) 1 ˆθ.

R(i) ˆµˆθ 64

2Gmax,op = µθ 128

Personalized PCA

We will use the ˆθ and ˆµ notations in the remaining parts of the section. Proof We will prove the claims one by one. From (34), we know,

Πg + Π(i) ˆΠg ˆΠ(i) 2

S(i) Σ(i) 2 F

where we replace δ by µ since I Πg Π(i) Σ(i) = 0. Therefore, the diﬀerence between ˆF(i) ˆF T (i) and Σ(i) is upper bounded by, ˆF(i) ˆF T (i) Σ(i)

= ˆΠg + ˆΠ(i) S(i) ˆΠg + ˆΠ(i) Σ(i)

ˆΠg + ˆΠ(i) Σ(i) ˆΠg + ˆΠ(i) Σ(i) + ˆΠg + ˆΠ(i) Σ(i) S(i) ˆΠg + ˆΠ(i)

2 ˆΠg + ˆΠ(i) Πg Π(i) F

Σ(i) + Σ(i) S(i)

Σ(j) S(j) 2 F + Σ(i) S(i)

1282 2Gmax,op

From Weyl s theorem, we know, ˆF(i) 2 = ˆF(i) ˆF T (i)

ˆF(i) ˆF T (i) Σ(i) + Σ(i)

This proves Claim 1. Also by Weyl s theorem, we know,

λ2r ˆF(i) ˆF T (i)

λ2r ˆF(i) ˆF T (i) Σ(i) ˆF(i) ˆF T (i) Σ(i)

32768Gmax,op

This proves Claim 2. Next, we consider the results from Theorem 1,

P ˆU Πg 2 F + 1

P ˆV(i) Π(i) 2

F 8 θµ2 1 N

Σ(i) S(i) 2 F

Shi and Kontar

Therefore, an upper bound for P ˆV(i) Π(i) F is

P ˆV(i) Π(i) F

Σ(i) S(i) 2 F

where we applied the condition that q PN i=1 Σ(i) S(i) 2 F µθ1.5

4 in the last inequality. As a result, we have, 1 N

1 θ + θ 1 1

This proves Claim 3. Then we analyze the norm of R(i)RT (i),

R(i) 2 = RT (i)R(i)

= I ˆΠg ˆΠ(i) S(i) I ˆΠg ˆΠ(i)

I ˆΠg ˆΠ(i) Σ(i) I ˆΠg ˆΠ(i)

+ I ˆΠg ˆΠ(i) S(i) Σ(i) I ˆΠg ˆΠ(i)

Πg + Π(i) ˆΠg ˆΠ(i) F

Σ(i) + S(i) Σ(i)

Σ(i) S(i) 2 F

1 + 4Gmax,op

where the last inequality comes from the fact that PN i=1 Σ(i) S(i) 2 F

1 1+ 8Gmax,op

µ 2 µ2θ2 32768Gmax,op

Personalized PCA

As discussed in Section 6.1.1, diﬀerent U and V(i) s may have the same objective value, as long as they span the same column space. We introduce a variable ζ to denote the subspace distance between the estimate and ground truth:

ζ(i),τ = r D PV(i),τ , Π(i) E (71)

for each i = 1, ...N, and,

ζ(0),τ = r PUτ , Πg (72)

We use ζτ to denote:

ζτ = ζ(0),τ + 1

i=1 ζ(i),τ (73)

The ζ(0),τ and ζ(i),τ s deﬁned represent how far away the iterates are from the ground truth, measured by subspace distance. We can also deﬁne,

eζ(i),τ = 2r D PUτ + PV(i),τ , ˆΠg + ˆΠ(i) E (74)

for each i = 1, ...N. We use eζτ to denote:

i=1 eζ(i),τ (75)

From Lemma 13, since ˆΠ(i) s are ˆθ-misaligned, there exists a relation between ζτ and eζτ:

ˆθ 2Nζτ eζτ Nζτ (76)

For simplicity, we also deﬁne the optimality gap φτ as,

Tr ˆΠg S(i) + Tr ˆΠ(i)S(i) Tr PUτ S(i) Tr PV(i),τ S(i) (77)

We then use the optimality gap φτ to upper bound the norm of Uτ and V(i),τ.

Lemma 19 Under the same conditions as Theorem 10, we have,

Shi and Kontar

Proof By deﬁnition of the optimality gap φτ, we have,

Tr ˆΠg S(i) + Tr ˆΠ(i)S(i) Tr PUτ S(i) Tr PV(i),τ S(i)

F(i) PUτ + PV(i),τ F(i) 2

F F(i) ˆΠg + ˆΠ(i) F(i) 2

I PUτ PV(i),τ ˆF(i) + I PUτ PV(i),τ R(i) 2

I PUτ PV(i),τ ˆF(i) 2

F + 2 D I PUτ PV(i),τ ˆF(i), I PUτ PV(i),τ R(i) E

| {z } Term I

+ I PUτ PV(i),τ R(i) 2

F R(i) 2 F | {z } Term II

The above can be further simpliﬁed. For Term I, we have,

i=1 2 D I PUτ PV(i),τ ˆF(i), I PUτ PV(i),τ R(i) E

i=1 2Tr RT (i) I PUτ PV(i),τ ˆF(i)

i=1 2Tr RT (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ ˆF(i)

i=1 2Tr RT (i) Uτ U T τ V(i),τ V T (i),τ ˆF(i)

where we have applied the KKT conditions (70) that ˆV T (i),τ ˆF(i)RT (i) = 0 and PN i=1 ˆU T τ ˆF(i)RT (i) = 0 in the third equality. For term Term II, we can also derive I PUτ PV(i),τ R(i) 2

= Tr RT (i) I PUτ PV(i),τ R(i) Tr RT (i)R(i)

= Tr RT (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ R(i)

= Tr RT (i) Uτ U T τ + V(i),τ V T (i),τ R(i)

where we used the condition ˆU T τ R(i) = ˆV T (i),τR(i) = 0 in the last equality.

Personalized PCA

Combining these, we have,

I PUτ PV(i),τ ˆF(i) 2

F Tr RT (i) Uτ U T τ + V(i),τ V T (i),τ R(i)

2Tr RT (i) Uτ U T τ + V(i),τ V T (i),τ ˆF(i) (79)

From Lemma 18, we know that ˆF(i) ˆF T (i) ˆµ ˆΠg + ˆΠ(i) , thus

i=1 Tr I PUτ PV(i),τ ˆF(i) ˆF T (i)

i=1 Tr I PUτ PV(i),τ ˆΠg + ˆΠ(i) ˆµ

r Tr ˆΠg PUτ + r Tr ˆΠ(i)PV(i),τ

Uτ 2 F + V(i),τ 2 F

By Cauchy-Schwartz inequality, we have,

Tr RT (i) Uτ U T τ + V(i),τ V T (i),τ R(i)

= Tr RT (i) Uτ U T τ R(i) + Tr RT (i) V(i),τ V T (i),τR(i)

Uτ 2 F R(i) 2 + V(i),τ 2 F R(i) 2

2Tr RT (i) Uτ U T τ + V(i),τ V T (i),τ ˆF(i)

2 Uτ 2 F R(i) ˆF(i) + 2 V(i),τ 2 F R(i) ˆF(i)

Since R(i) ˆµˆθ 64

2Gmax,op and ˆF(i) p

2Gmax,op, we have

R(i) 2 + 2 R(i) ˆF(i)

3 R(i) ˆF(i)

8 Thus we have,

Uτ 2 F + V(i),τ 2 F

This completes our proof.

Next we will provide a lemma that characterizes the landscape of the objective.

Shi and Kontar

Lemma 20 Under the same conditions as Theorem 10, we have,

I PUτ PV(i),τ S(i)Uτ, Uτ

D I PUτ PV(i),τ S(i)V(i),τ, V(i),τ E

We ﬁrst consider the inner product term,

D I PUτ PV(i),τ S(i)V(i),τ, V(i),τ E

= D I PUτ PV(i),τ F(i), V(i),τV T (i),τF(i) E

= D I PUτ PV(i),τ ˆF(i), V(i),τV T (i),τ ˆF(i) E

| {z } Term III

+ D I PUτ PV(i),τ R(i), V(i),τV T (i),τR(i) E

| {z } Term IV

+ D I PUτ PV(i),τ ˆF(i), V(i),τV T (i),τR(i) E

| {z } Term V

+ D I PUτ PV(i),τ R(i), V(i),τV T (i),τ ˆF(i) E

| {z } Term VI

We will analyze each term separately. For Term III, we know that V(i),τV T (i),τ =

V(i),τV T (i),τ ˆV(i),τ ˆV(i),τ ˆV(i),τ V T (i),τ. Therefore,

D I PUτ PV(i),τ ˆF(i), V(i),τV T (i),τ ˆF(i) E

= D I PUτ PV(i),τ ˆF(i), PV(i),τ ˆΠ(i) ˆV(i),τ V T (i),τ ˆF(i) E

= D I PUτ PV(i),τ ˆF(i), PV(i),τ ˆΠ(i) ˆF(i) E

D I PUτ PV(i),τ ˆF(i), ˆV(i),τ V T (i),τ ˆF(i) E

= D I PUτ PV(i),τ ˆF(i), PV(i),τ ˆΠ(i) ˆF(i) E + ϵ1,(i),τ

Personalized PCA

where ϵ1,(i),τ is deﬁned as ϵ1,(i),τ = D I PUτ PV(i),τ ˆF(i), ˆV(i),τ V T (i),τ ˆF(i) E . Its norm is upper bounded by ϵ1,(i),τ

= Tr ˆF T (i) I PUτ PV(i),τ ˆV(i),τ V T (i),τ ˆF(i)

= Tr ˆF T (i) I PUτ PV(i),τ V(i),τ V T (i),τ ˆF(i)

= Tr ˆF T (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ V(i),τ V T (i),τ ˆF(i)

ˆF T (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ F

V(i),τ V T (i),τ ˆF(i) F

ˆΠg + ˆΠ(i) PUτ PV(i),τ F

eζ(i),τζ(i),τGmax,op

For Term IV, we have, D I PUτ PV(i),τ R(i), V(i),τV T (i),τR(i) E

= D I PUτ PV(i),τ R(i), V(i),τ V T (i),τR(i) E

= D R(i), V(i),τ V T (i),τR(i) E + ϵ2,(i),τ

where ϵ2,(i),τ is deﬁned as

ϵ2,(i),τ = D PUτ + PV(i),τ R(i), V(i),τ V T (i),τR(i) E

and its norm is upper bounded by, ϵ2,(i),τ F

= Tr RT (i) PUτ + PV(i),τ V(i),τ V T (i),τR(i)

= Tr RT (i) PUτ + PV(i),τ ˆΠg ˆΠ(i) PUτ + PV(i),τ V(i),τ V T (i),τR(i)

RT (i) PUτ + PV(i),τ ˆΠg ˆΠ(i) F

PUτ + PV(i),τ V(i),τ V T (i),τR(i) F

PUτ + PV(i),τ ˆΠg ˆΠ(i) F

V(i),τ 2 F R(i) 2

eζ(i),τζ(i),τGmax,op

For the Term V, D I PUτ PV(i),τ ˆF(i), V(i),τV T (i),τR(i) E

= D I PUτ PV(i),τ ˆF(i), V(i),τ V T (i),τR(i) E

Shi and Kontar

where the norm of ϵ3,(i),τ is upper bounded by,

= Tr ˆF T (i) I PUτ PV(i),τ V(i),τ V T (i),τR(i)

= Tr ˆF T (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ V(i),τ V T (i),τR(i)

ˆF T (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ F

V(i),τ V T (i),τR(i) F

PUτ + PV(i),τ ˆΠg ˆΠ(i) F

V(i),τ 2 F R(i) ˆF(i)

eζ(i),τζ(i),τGmax,op

For Term VI,

D I PUτ PV(i),τ R(i), V(i),τV T (i),τ ˆF(i) E

= D I PUτ PV(i),τ R(i), V(i),τ V T (i),τ ˆF(i) E

= D R(i), V(i),τ V T (i),τ ˆF(i) E + ϵ4,(i),τ

where ϵ4,(i),τ is deﬁned as

ϵ4,(i),τ = D PUτ + PV(i),τ R(i), V(i),τ V T (i),τ ˆF(i) E

Its norm is upper bounded by,

= Tr RT (i) PUτ + PV(i),τ V(i),τ V T (i),τ ˆF(i)

= Tr RT (i) Πg Π(i) + PUτ + PV(i),τ V(i),τ V T (i),τ ˆF(i)

V(i),τ 2 F R(i) ˆF(i) Πg Π(i) + PUτ + PV(i),τ F

eζ(i),τζ(i),τGmax,op

Combining these terms, we have,

D I PUτ PV(i),τ S(i)V(i),τ, V(i),τ E

= D I PUτ PV(i),τ ˆF(i), PV(i),τ ˆΠ(i) ˆF(i) E + D R(i), V(i),τ V T (i),τR(i) E

+ D R(i), V(i),τ V T (i),τ ˆF(i) E

+ ϵ1,(i),τ + ϵ2,(i),τ + ϵ3,(i),τ + ϵ4,(i),τ (80)

Personalized PCA

Similarly, we can calculate the inner product term,

D I PUτ PV(i),τ S(i)Uτ, Uτ E

D I PUτ PV(i),τ F(i), UτU T τ F(i) E

D I PUτ PV(i),τ ˆF(i), UτU T τ ˆF(i) E

| {z } Term VII

+ D I PUτ PV(i),τ R(i), UτU T τ R(i) E

| {z } Term VIII

+ D I PUτ PV(i),τ ˆF(i), UτU T τ R(i) E

| {z } Term IX

+ D I PUτ PV(i),τ R(i), UτU T τ ˆF(i) E

| {z } Term X

For the Term VII, we can simplify it as, D I PUτ PV(i),τ ˆF(i), UτU T τ ˆF(i) E

= D I PUτ PV(i),τ ˆF(i), PUτ ˆΠg ˆF(i) E + ϵ5,(i),τ

where ϵ5,(i),τ is deﬁned as,

ϵ5,(i),τ = D I PUτ PV(i),τ ˆF(i), ˆUτ U T τ ˆF(i) E

= D I PUτ PV(i),τ ˆF(i), Uτ U T τ ˆF(i) E

Its norm is upper bounded by ϵ5,(i),τ F

= Tr F T (i) I PUτ PV(i),τ Uτ U T τ ˆF(i)

= Tr ˆF T (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ Uτ U T τ ˆF(i)

ˆF T (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ F

Uτ U T τ ˆF(i) F

ˆΠg + ˆΠ(i) PUτ PV(i),τ F Uτ 2 F ˆF(i) 2

eζ(i),τζ(0),τGmax,op

For Term VIII, also we have, D I PUτ PV(i),τ R(i), UτU T τ R(i) E

D I PUτ PV(i),τ R(i), Uτ U T τ R(i) E

= R(i), Uτ U T τ R(i) + ϵ6,(i),τ

Shi and Kontar

where ϵ6,(i),τ is deﬁned as,

ϵ6,(i),τ = D PUτ + PV(i),τ R(i), Uτ U T τ R(i) E

and its norm is upper bounded by, ϵ6,(i),τ F

= Tr RT (i) PUτ + PV(i),τ Uτ U T τ R(i)

= Tr RT (i) PUτ + PV(i),τ ˆΠg ˆΠ(i) PUτ + PV(i),τ Uτ U T τ R(i)

RT (i) PUτ + PV(i),τ ˆΠg ˆΠ(i) F

PUτ + PV(i),τ Uτ U T τ R(i) F

PUτ + PV(i),τ ˆΠg ˆΠ(i) F Uτ 2 F R(i) 2

eζ(i),τζ(0),τGmax,op

For Term IX, we have, D I PUτ PV(i),τ ˆF(i), UτU T τ R(i) E

= D I PUτ PV(i),τ ˆF(i), Uτ U T τ R(i) E

where the norm of ϵ7,(i),τ is upper bounded by, ϵ7,(i),τ

= Tr ˆF T (i) I PUτ PV(i),τ Uτ U T τ R(i)

= Tr ˆF T (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ Uτ U T τ R(i)

ˆF T (i) ˆΠg + ˆΠ(i) PUτ PV(i),τ F

Uτ U T τ R(i) F

ˆΠg + ˆΠ(i) PUτ PV(i),τ F Uτ 2 F R(i) ˆF(i)

eζ(i),τζ(0),τGmax,op

Finally, for Term X, we have,

D I PUτ PV(i),τ R(i), UτU T τ ˆF(i) E

D I PUτ PV(i),τ R(i), Uτ ˆU T τ ˆF(i) E + D I PUτ PV(i),τ R(i), Uτ U T τ ˆF(i) E

D I PUτ PV(i),τ R(i), Uτ U T τ ˆF(i) E D PV(i),τ R(i), Uτ ˆUτ ˆF(i) E

D R(i), Uτ U T τ ˆF(i) E + ϵ8,(i),τ

Personalized PCA

where ϵ8,(i),τ is deﬁned as,

ϵ8,(i),τ = D PUτ PV(i),τ R(i), Uτ U T τ ˆF(i) E D PV(i),τ R(i), Uτ ˆUτ ˆF(i) E

Its norm is upper bounded by

Tr RT (i) PUτ PV(i),τ Uτ U T τ ˆF(i) + Tr RT (i)PV(i),τ Uτ ˆUτ ˆF(i)

Tr RT (i) PUτ PV(i),τ Uτ U T τ ˆF(i)

+ Tr RT (i) PV(i),τ ˆΠ(i) PV(i),τ Uτ ˆUτ ˆF(i)

eζ(i),τζ(0),τGmax,op + PV(i),τ ˆΠ(i) F Uτ F R(i) ˆF(i)

Combining them, we have,

D I PUτ PV(i),τ S(i)Uτ, Uτ E

D I PUτ PV(i),τ ˆF(i), PUτ ˆΠg ˆF(i) E + R(i), Uτ U T τ R(i)

+ D R(i), Uτ U T τ ˆF(i) E

+ ϵ5,(i),τ + ϵ6,(i),τ + ϵ7,(i),τ + ϵ8,(i),τ (81)

Comparing (79), (80), and (81), we know that,

I PUτ PV(i),τ S(i)Uτ, Uτ

D I PUτ PV(i),τ S(i)V(i),τ, V(i),τ E

i=1 Tr RT (i) Uτ U T τ + V(i),τ V T (i),τ F(i)

α=1 ϵα,(i),τ

From the estimated upper bounds of ϵ1,(i),τ to ϵ8,(i),τ , we know that,

i=1 Tr RT (i) Uτ U T τ + V(i),τ V T (i),τ F(i) +

α=1 ϵα,(i),τ

Uτ 2 F + V(i),τ 2 F

R(i) ˆF(i) + 14

eζ(i),τ ζ(i),τ + ζ(0),τ

Shi and Kontar

From Lemma 18, we know that R(i) ˆF(i) ˆµˆθ

64, we can thus upper bound the ﬁrst term as,

Uτ 2 F + V(i),τ 2 F

Uτ 2 F + V(i),τ 2 F

where the last inequality comes from Lemma 19. Also, the second term can be bounded as,

eζ(i),τ ζ(i),τ + ζ(0),τ

ζ(i),τ + ζ(0),τ

j=1 ζ(j),τ + ζ(0),τ

i=1 eζ(i),τ

i=1 ζ(i),τ + ζ(0),τ

j=1 ζ(j),τ + ζ(0),τ

i=1 ζ(i),τ + ζ(0),τ

i=1 Uτ 2 F + V(i),τ 2 F

where the second inequality comes from Cauchy-Schwrtz inequality, the third inequality comes from Lemma 13, the fourth inequality comes from Lemma 26, the ﬁfth inequality comes from Lemma 19, and the last inequality comes from the fact that φτ ˆµ3ˆθ3 51380224G2max,op .

This completes the proof.

Combining Lemma 19 with Lemma 20, we can prove the following PL-inequality.

Lemma 21 (Lemma 12 in the main paper) Under the same conditions as Theorem 10, we have

V(i),τ 2 F ˆθˆµ 16 φτ

Personalized PCA

Proof From Cauchy-Schwartz inequality, we know that,

V(i),τ, V(i),τ

N Uτ F Uτ +

V(i),τ F V(i),τ F

v u u t N Uτ 2 F +

v u u t N Uτ 2 F +

v u u t N Uτ 2 F +

where the last inequality comes from Lemma 19. From Lemma 20, we know,

V(i),τ, V(i),τ

Combining them, we have,

V(i),τ 2 F ˆθˆµ 16 φτ

Finally, we come to the proof of Theorem 10: Proof Combining Lemma 21 with equation (64), we know:

f Uτ+1, {V(i),τ+1} f Uτ, {V(i),τ} η

We add f on both sides. Since φτ = f f Uτ, {V(i),τ} , we have:

2 ˆµˆθ 16 φτ

Thus φτ decreases linearly with τ. From Lemma 19 and Lemma 26, we can show PUτ ˆΠg F and PV(i),τ ˆΠ(i) F decrease linearly to zero as well. This completes the proof of Theorem 10.

Shi and Kontar

Appendix F. Some examples of generalized retraction

In this section, we discuss two popular normalization schemes: polar projection and QR decomposition. We prove that both ﬁt the Deﬁnition 5 of a generalized retraction. The analysis in this section is inspired by Liu et al. (2019). However, Liu et al. (2019) only considers conventional retraction operations, while we consider generalized retractions.

F.1 Polar projection

Polar projection is deﬁned as:

GRpolar U (ξ) = (U + ξ) I + U T ξ + ξT U + ξT ξ 1

Then obviously,

col(GRU (ξ)) = col (U + ξ)

To verify the second property, we can calculate the diﬀerence between GRU (ξ) and U + PTU (ξ). Notice that

I + U T ξ + ξT U + ξT ξ 1

U T ξ + ξT U + ξT ξ n (2n 1)!!( 1)n

GRU (ξ) (U + PTU (ξ))

U T ξ + ξT U + ξT ξ n (2n 1)!!( 1)n

2ξT U T ξ 1

2ξT ξT ξ + (U + ξ)

U T ξ + ξT U + ξT ξ n (2n 1)!!( 1)n

(82) By the property of Frobinius norm:

U T ξ + ξT U + ξT ξ F 2 ξ F U T op + ξ 2 F

= 2 ξ F + ξ 2 F 3 ξ F

Personalized PCA

GRU (ξ) (U + PTU (ξ)) F

ξT U T ξ F + 1

ξT ξT U F + 1

U T ξ + ξT U + ξT ξ n F (2n 1)!!

2 ξ 3 F + (1 + ξ F )

n=2 (3 ξ F )n (2n 1)!!

= ξ 2 F + 1

2 ξ 3 F + (1 + ξ F ) 3 (3 ξ F )2 + (3 ξ F )3

2 Mpolar ξ 2 F

where Mpolar = 253

8 . We applied the following summation in the derivation:

n=2 xn (2n 1)!!

= (1 x) 1/2 (1 + x

= 3x2 + x3 1 x + (1 x)(1 + x

and the fact that x 1

2 in the third inequality. Since ξ 2 F = PTU (ξ) + PNU (ξ) 2 F 2 PTU (ξ) 2 F + 2 PNU (ξ) 2 F 2 PTU (ξ) 2 F + PNU (ξ) F

We prove that polar projection is a generalized retraction with M1 = 253

4 and M2 = 253

8 . Polar projection can be implemented via singular value decomposition of U + ξ, whose computational complexity is O(dr2 + r3) (Breloy et al., 2021).

F.2 QR decomposition

QR decomposition is an extension of Gram-Schmidt orthonormalization. For a matrix U + ξ Rd r, the method ﬁnds a orthogonal matrix Q Rd r and an upper triangular matrix R Rr r, such that QR = U + ξ. Then GRQR U (ξ) = Q. In this section, we will prove that QR decomposition is a generalized retraction for ξ 1

4. Our proof in this section extends that in Liu et al. (2019). Notice that col(U + ξ) = col(Q), thus the ﬁrst property of generalized retraction in Deﬁnition 5 is satisﬁed. We will prove the second in the case M3 = 1

Shi and Kontar

Similar to Liu et al. (2019), we deﬁne U(t) = U + tξ, for t [0, 1], and use Q(t)R(t) to denote the QR decomposition of U(t). Then: GRQR U (ξ) (U + ξ) F = Q(1) Q(1)R(1) F = Q(1) (I R(1)) F R(1) R(0) F

0 R (t)dt F

Since Q(t)R(t) is the QR decomposition of U(t), we have:

RT (t)R(t) = U T (t)U(t) = U T U + tξT U + t U T ξ + t2ξT ξ (83)

Taking the derivative with respect to t on both sides, we have: R T (t)R(t) + RT (t)R (t)

= ξT U + U T ξ + 2tξT ξ

We can left multiply both sides by R 1 T (t), and right multiply both sides by R 1(t), to obtain: R 1 T (t) R T (t) + R (t)R 1(t) = R 1 T (t) ξT U + U T ξ + 2tξT ξ R 1(t)

Since on the left hand side, R (t)R 1(t) is an upper triangular matrix, its transpose R 1 T (t) R T (t) is a lower triangular matrix, we have:

R (t)R 1(t) = up h R 1 T (t) ξT U + U T ξ + 2tξT ξ R 1(t) i

where for C Rd d, up [ ] is deﬁned as:

Cij, if j > i 1 2Cii, if j = i

0, if j < i

R (t) = up h R 1 T (t) ξT U + U T ξ + 2tξT ξ R 1(t) i R(t)

and accordingly: R (t) F = up h R 1 T (t) ξT U + U T ξ + 2tξT ξ R 1(t) i R(t) F

up h R 1 T (t) ξT U + U T ξ + 2tξT ξ R 1(t) i F R(t) op

R 1 T (t) ξT U + U T ξ + 2tξT ξ R 1(t) F R(t) op

Personalized PCA

where we used Lemma 23 for the ﬁrst inequality. From (83), we know that:

R(t) 2 op = R(t)T R(t) op = I + t ξT U + U T ξ + t2ξT ξ op 1 t ξT U + U T ξ op t2 ξT ξ op 1 2 ξ F ξT ξ F

where the ﬁrst inequality comes from the triangle inequality, the second comes from the fact that F op, and the third comes from the requirement ξ F 1

4. Similarly, we can derive:

R(t) 2 op = R(t)T R(t) op 1 + 2 ξ F + ξT ξ F 25

As a result, R (t) F R 1 T (t) ξT U + U T ξ + 2tξT ξ R 1(t) F R(t) op

R 1 T (t) ξT U + U T ξ R 1(t) F + R 1 T (t) 2tξT ξ R 1(t) F

ξT U + U T ξ F

R 1 T (t)R 1(t) op + 2t ξT ξ F

R 1 T (t)R 1(t) op

7 ξT U + U T ξ F + 2t ξT ξ F

Hence, GRQR U (ξ) (U + ξ) F

7 ξT U + U T ξ F + ξT ξ F

Since U is an orthogonal matrix, PNU (ξ) F = 1 2 U ξT U + U T ξ F = 1 2 ξT U + U T ξ F . By Cauchy-Schwartz inequality, ξT ξ F = (PNU (ξ) + PTU (ξ))T (PNU (ξ) + PTU (ξ)) F 2 PNU (ξ) 2 F + 2 PTU (ξ) 2 F . Thus we have: GRQR U (ξ) (U + ξ) F

2 PNU (ξ) F + 2 PNU (ξ) 2 F + 2 PTU (ξ) 2 F

7 PNU (ξ) F + 40

7 PTU (ξ) 2 F

Hence the second property of deﬁnition holds with M1 = 40

7 and M2 = 80

7 . QR decomposition can be implemented by Gram-Schmidt or Householder algorithm with computation complexity of O(dr2)

Shi and Kontar

Appendix G. Auxiliary lemmas

In this section, we show some auxiliary lemmas needed for the proof in earlier Sections. Most lemmas are derived from basic facts in linear algebra. We begin with some general inequalities related to matrix trace norms.

Lemma 22 For two matrices A, B Rd d, if both A, B are symmetric positive deﬁnite, then: Tr (AB) 0

A simple corollary is that if A1, A2, B Rd d are symmetric and B is positive semi-deﬁnite, and A1 A2, then Tr (A1B) Tr (A2B)

Proof Since both A and B are positive symmetric, there exists X, Y Rd d, such that A = XT X and B = Y T Y , therefore:

= Tr XT XY T Y

= Tr (Y XT )T Y XT

The following lemma presents an upper bound of the Frobenius norm of the product of two matrices.

Lemma 23 For two matrices A Rm n, and B Rn k, we have:

AB F A op B F

and: AB F A F B op

The proof of the lemma can be found in Sun and Luo (2015). The following lemma introduces a simple upper bound on the Frobenius norm of Ir U T P U.

Lemma 24 For any rank-r orthonormal matrix U Rd r, and rank-r projection matrix P Rd d, we have: Ir U T P U F r Tr U T P U (84)

Proof It is easy to see that Ir U T P U is positive semideﬁnite. Also, for a positive semideﬁnite matrix, its Frobenius norm is upper bounded by its trace. Inequality (84) follows accordingly.

We can proceed to the following lemma that upper bounds the trace of the k-th power of Ir U T P U.

Personalized PCA

Lemma 25 For any rank-r orthonormal matrix U Rd r, and rank-r projection matrix P Rd r, k = 1, 2, ..., Ir U T P U k is positive semi-deﬁnite and:

0 Tr I U T P U k r Tr U T P U k (85)

Proof Since Ir U T P U is symmetric positive semideﬁnite, Ir U T P U k is also symmetric positive semideﬁnite. Assume eigenvalues of Ir U T P U are λ1, λ2, , λd, with λ1 λ2 λd, we know that Tr Ir U T P U k = Pd i=1 λk i λk 1 1 Pd i=1 λi.

By Lemma 24, we know that

λk 1 1 Ir U T P U k 1 F r Tr U T P U k 1

This completes our proof.

Based on the above results, we can discuss some properties of the projection of a matrix onto a subspace. Suppose we know the column space of U Rd r is close to that of P Rd d, can we ﬁnd a matrix U close to U with column vectors in col(P )? The following two lemmas give aﬃrmative answers.

Lemma 26 For any rank-r orthonormal matrix U Rd r, and rank-r projection matrix P Rd d, we deﬁne:

U = P U U T P U 1/2

If r Tr U T P U 1, we have:

U U 2 F r Tr U T P U (86)

U U 2 F 2 r Tr U T P U (87)

Proof To prove the lower bound (86) and upper bound (87), we can write U U 2 F as,

U U 2 F = U, U + U , U 2 U, U

= 2r 2 U, U

We ﬁrst ﬁnd an upper bound for U, U .

Shi and Kontar

Notice that:

= Tr U T P U U T P U 1/2

= Tr U T P U 1/2

= Tr Ir Ir U T P U 1/2

2 Ir U T P U

2nn! Ir U T P U n !

2Tr Ir U T P U

2nn! Tr Ir U T P U n

2Tr Ir U T P U

We used the series (1 x) 1 2 = 1 1

2x P n=2 (2n 3)!!

2nn! xn, and the result Tr Ir U T P U n 0 from Lemma 25. As a result:

U U 2 F Tr Ir U T P U = r Tr U T P U

Similarly, from Lemma 25, Tr Ir U T P U n Tr Ir U T P U n, thus:

2Tr Ir U T P U

2nn! Tr Ir U T P U n

2Tr Ir U T P U

2nn! Tr Ir U T P U n

= r + 1 Tr Ir U T P U 1

Tr Ir U T P U

where we used the relation 1 x 1 x, x [0, 1], in the last inequality. Thus U U 2 F 2Tr Ir U T P U = 2 r Tr U T P U

This completes our proof.

The following lemma shows that we can identify global PCs from local PCs.

Lemma 27 Suppose for i = 1, , N, PU, PV(i) and P U, P V(i) are projection matrices satisfying PUPV(i) = 0 and P UP V(i) = 0 for each i. Among them, PU and P U have rank r1,

Personalized PCA

PV(i) and P V(i) have rank r2,(i). If there exists a positive constant θ > 0 such that

i=1 P V(i)) 1 θ

we have the following bound:

i=1 r1 + r2,(i) Tr PU + PV(i) P U + P V(i)

N (r1 Tr(P UPU)) +

i=1 r2,(i) Tr(P V(i)PV(i))

i=1 r1 + r2,(i) Tr PU + PV(i) P U + P V(i)

N (r1 Tr(P UPU)) +

i=1 r2,(i) Tr(P V(i)PV(i))

Notice that we can replace + by on the left hand side of (88) and (89) Proof We ﬁrst calculate the upper bound. Since PUP V(i)PU is positive semideﬁnite, we know that:

Tr PUP V(i)PU 0

Tr PUP V(i)PU = Tr PUP V(i)

Similarly, we have:

Tr P UPV(i) 0

Combining them, we have:

Tr PU + PV(i) P U + P V(i)

= Tr (PUP U) + Tr PUP V(i)

+ Tr PV(i)P U + Tr PV(i)P V(i)

Tr (PUP U) + Tr PV(i)P V(i)

This proves inequality (88).

Shi and Kontar

Next, we calculate the lower bound.

i=1 Tr PU + PV(i) P U + P V(i)

i=1 Tr (PUP U) + Tr PUP V(i)) + Tr PV(i)P U + Tr PV(i)P V(i)

i=1 Tr (PUP U) + Tr PU (I P U) P V(i) (I P U) + Tr PV(i)P U + Tr PV(i)P V(i)

i=1 Tr (PUP U) + Tr (I P U) PU (I P U) P V(i)

+ Tr PV(i)P U + Tr PV(i)P V(i)

Since (I P U) PU (I P U) and P V(i) are both symmetric positive semideﬁnite, we have:

(I P U) PU (I P U) 1

Tr ((I P U) PU (I P U)) λmax

Tr (PU PUP U) (1 θ)

= (r1 Tr (PUP U)) (1 θ)

For notation simplicity, we deﬁne z0 = r1 Tr (PUP U) and zi = r2,(i) Tr PV(i)P V(i)

From the orthogonality, we have:

Tr PV(i)P U

= Tr PV(i) I P V(i)

P U I P V(i)

= Tr I P V(i)

PV(i) I P V(i)

Tr I P V(i)

PV(i) I P V(i)

Tr I P V(i)

PV(i) I P V(i)

= Tr PV(i) PV(i)P V(i)

Personalized PCA

Also, from the orthogonality, we have:

Tr PV(i)P U

= Tr (I PU) PV(i) (I PU) P U

= Tr PV(i) (I PU) P U (I PU)

Tr ((I PU) P U (I PU)) λmax PV(i)

Tr ((I PU) P U (I PU))

= Tr (P U PUP U)

Combining the two: Tr PV(i)P U min{z0, zi}

As a result:

i=1 r1 + r2,(i)

h Tr (PUP U) + Tr (I P U) PU (I P U) P V(i)

+ Tr PV(i)P U + Tr PV(i)P V(i)

i=1 z0 (1 θ) z0 + zi min{z0, zi}

Since for any number ν (0, 1), we know:

zi min{z0, zi} ν (zi z0)

We can set ν = θ

i=1 z0 (1 θ) z0 + zi min{z0, zi}

i=1 θz0 + θ

i=1 z0 + zi

This proves inequality (89).

Shi and Kontar

P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.

Ana M Aguilera, Francisco A Oca na, and Mariano J Valderrama. Forecasting time series by functional pca. discussion of several weighted approaches. Computational Statistics, 14(3): 443 467, 1999.

Foivos Alimisis, Peter Davies, Bart Vandereycken, and Dan Alistarh. Distributed principal component analysis with limited communication. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=ed CFRvl Wq V.

Rohan Asokan. Us presidential debate transcripts 1960-2020. In Kaggle dataset, 2022. doi: 10.34740/KAGGLE/DSV/3690532. URL https://www.kaggle.com/datasets/ arenagrenade/us-presidential-debate-transcripts-19602020.

Rajendra Bhatia. Matrix Analysis. Springer, New York, NY, 1997.

Aharon Birnbaum, Iain M Johnstone, Boaz Nadler, and Debashis Paul. Minimax bounds for sparse pca with noisy high-dimensional data. Annals of statistics, 41(3):1055, 2013.

Nicolas Boumal. An introduction to optimization on smooth manifolds. To appear with Cambridge University Press, Apr 2022. URL http://www.nicolasboumal.net/book.

Thierry Bouwmans, Sajid Javed, Hongyang Zhang, Zhouchen Lin, and Ricardo Otazo. On the applications of robust pca in image and video processing. Proceedings of the IEEE, 106(8):1427 1457, 2018. doi: 10.1109/JPROC.2018.2853589.

Arnaud Breloy, Sandeep Kumar, Ying Sun, and Daniel P. Palomar. Majorizationminimization on the stiefel manifold with application to robust sparse pca. IEEE Transactions on Signal Processing, 69:1507 1520, 2021. doi: 10.1109/TSP.2021.3058442.

Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konecny, H. Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. In Neur IPS, 2019.

Emmanuel J. Candes, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 2011.

Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. On the local linear rate of consensus on the stiefel manifold. In Arxiv, 2021a. URL https://arxiv.org/ pdf/2101.09346.pdf.

Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. Decentralized riemannian gradient descent on the stiefel manifold. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1594 1605. PMLR, 18 24 Jul 2021b. URL https://proceedings.mlr.press/v139/chen21g.html.

Personalized PCA

Xi Chen, Jason D. Lee, He Li, and Yun Yang. Distributed estimation for principal component analysis: a gap-free approach. Co RR, abs/2004.02336, 2020. URL https://arxiv.org/ abs/2004.02336.

Charles-Alban Deledalle, Joseph Salmon, and Arnak Dalalyan. Image denoising with patch-based pca: local versus global. In The 22nd British Machine Vision Conference, 2011.

Alan Edelman, Tom as A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2): 303 353, 1998. doi: 10.1137/S0895479895290954.

Jianqing Fan, Dong Wang, Kaizheng Wang, and Ziwei Zhu. Distributed estimation of principal eigenspaces. Annals of statistics, 47,6:3009 3031, 2019. doi: 10.1214/18-AOS1713.

Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 13, pages 1434 1453, USA, 2013. Society for Industrial and Applied Mathematics. ISBN 9781611972511.

Karl Pearson F.R.S. Liii. on lines and planes of closest ﬁt to systems of points in space. Philosophical Magazine Series 1, 2:559 572, 1901.

Dan Garber and Elad Hazan. Fast and simple pca via convex optimization. Ar Xiv, abs/1509.05647, 2015. URL https://arxiv.org/abs/1509.05647.

Dan Garber, Ohad Shamir, and Nathan Srebro. Communication-eﬃcient algorithms for distributed stochastic principal component analysis. In ICML, pages 1203 1212, 2017. URL http://proceedings.mlr.press/v70/garber17a.html.

Andreas Grammenos, Rodrigo Mendoza Smith, Jon Crowcroft, and Cecilia Mascolo. Federated principal component analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6453 6464. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper/2020/file/47a658229eb2368a99f1d032c8848542-Paper.pdf.

Anne Greenbaum, Ren-Cang Li, and Michael L. Overton. First-order perturbation theory for eigenvalues and eigenvectors. SIAM Review, 62(2):463 482, 2020. doi: 10.1137/ 19M124784X.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics, 2009.

David Hong, Fan Yang, Jeﬀrey A. Fessler, and Laura Balzano. Optimally weighted pca for high-dimensional heteroscedastic data. In Arxiv, 2021. URL https://arxiv.org/pdf/ 1810.12862.pdf.

H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417 441, 1933. doi: http://dx.doi.org/10.1037/h0071325.

Shi and Kontar

Long-Kai Huang and Sinno Pan. Communication-eﬃcient distributed PCA by Riemannian optimization. In Hal Daum e III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4465 4474. PMLR, 13 18 Jul 2020. URL https: //proceedings.mlr.press/v119/huang20e.html.

Herv e J egou and Ondˇrej Chum. Negative evidences and co-occurences in image retrieval: The beneﬁt of pca and whitening. In Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors, Computer Vision ECCV 2012, pages 774 787, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33709-3.

William Kahan. The nearest orthogonal or unitary matrix. W. Kahan s Supplementary Notes for Math. 128, 2011.

Raed Kontar, Shiyu Zhou, Chaitanya Sankavaram, Xinyu Du, and Yilu Zhang. Nonparametric-condition-based remaining useful life prediction incorporating external factors. IEEE Transactions on Reliability, 67(1):41 52, 2017.

Raed Kontar, Shiyu Zhou, Chaitanya Sankavaram, Xinyu Du, and Yilu Zhang. Nonparametric modeling and prognosis of condition monitoring signals using multivariate gaussian convolution processes. Technometrics, 60(4):484 496, 2018.

Raed Kontar, Naichen Shi, Xubo Yue, Seokhyun Chung, Eunshin Byon, Mosharaf Chowdhury, Jionghua Jin, Wissam Kontar, Neda Masoud, Maher Nouiehed, et al. The internet of federated things (ioft). IEEE Access, 9:156071 156113, 2021.

Alex Krizhevsky, Geoﬀrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

V. Kulkarni, M. Kulkarni, and A. Pant. Survey of personalization techniques for federated learning. In 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (World S4), pages 794 797, 2020. doi: 10.1109/World S450073.2020.9210355.

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, and Virginia Smith Ameet Talwalkar. Federated optimization in heterogeneous networks. Proceedings of the 3rd MLSys Conference, 2018a.

Wei Li, Minjun Peng, and Qingzhong Wang. Fault detectability analysis in pca method during condition monitoring of sensors in a nuclear power plant. Annals of Nuclear Energy, 119:342 351, 2018b.

Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJx NAn Vt DS.

Yingyu Liang, Maria-Florina F Balcan, Vandana Kanchanapally, and David Woodruﬀ. Improved distributed principal component analysis. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings. neurips.cc/paper/2014/file/52947e0ade57a09e4a1386d08f17b656-Paper.pdf.

Personalized PCA

Huikang Liu, Anthony Man-Cho So, and Weijie Wu. Quadratic optimization with orthogonality constraint: Explicit lojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods. In Proceedings of the 33rd International Conference on Machine Learning, 2019.

Eric F Lock, Katherine A Hoadley, James Stephen Marron, and Andrew B Nobel. Joint and individual variation explained (jive) for integrated analysis of multiple data types. The annals of applied statistics, 7(1):523, 2013.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-eﬃcient learning of deep networks from decentralized data. In Artiﬁcial intelligence and statistics, pages 1273 1282. PMLR, 2017.

John Novembre and Matthew Stephens. Interpreting principal component analyses of spatial population genetic variation. Nature genetics, 40(5):646 649, 2008.

Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert M uller, and Shin Ishii. Heterogeneous component analysis. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper/2007/file/ a8abb4bb284b5b27aa7cb790dc20f80b-Paper.pdf.

Francesc Pozo, Yolanda Vidal, and Oscar Salgado. Wind turbine condition monitoring strategy through multiway pca and multivariate inference. Energies, 11(4):749, 2018.

Yongming Qu, George Ostrouchov, Nagiza Samatova, and Al Geist. Principal component analysis for dimension reduction in massive distributed data sets. In Knowledge and Information Systems - KAIS, 04 2002.

David Reich, Alkes L Price, and Nick Patterson. Principal component analysis of genetic data. Nature genetics, 40(5):491 492, 2008.

Alessandro Rinaldo. Lecture notes in advanced statistical theory, Fall 2019.

Felix Sattler, Klaus-Robert M uller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multi-task optimization under privacy constraints. ar Xiv preprint ar Xiv:1910.01991, 2019.

Ruoyu Sun and Ziquan Luo. Guaranteed matrix completion via non-convex factorization. FOCS, 2015.

Cheng Tang. Exponentially convergent stochastic k-pca without variance reduction. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Antoine Vacavant, Thierry Chateau, Alexis Wilhelm, and Laurent Lequievre. A benchmark dataset for outdoor foreground/background extraction. In ACCV Workshops, 2012. URL https://api.semanticscholar.org/Corpus ID:10634625.

Shi and Kontar

Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A nearoptimal convex relaxation of sparse pca. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips. cc/paper/2013/file/81e5f81db77c596492e6f1a5a792ed53-Paper.pdf.

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. doi: 10.1017/9781108627771.

Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. IEEE Transactions on Information Theory, 58(5):3047 3064, 2012. doi: 10.1109/TIT. 2011.2173156.

Kiyoung Yang and Cyrus Shahabi. A pca-based similarity measure for multivariate time series. In Proceedings of the 2nd ACM international workshop on Multimedia databases, pages 65 74, 2004.

Guoxu Zhou, Andrzej Cichocki, Yu Zhang, and Danilo P Mandic. Group component analysis for multiblock data: Common and individual feature extraction. IEEE transactions on neural networks and learning systems, 27(11):2426 2439, 2015.

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43 76, 2020.

Jiacheng Zhuo, Jeongyeol Kwon, Nhat Ho, and Constantine Caramanis. On the computational and statistical complexity of over-parameterized matrix sensing. ar Xiv preprint ar Xiv:2102.02756, 2021.

Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265 286, 2006. doi: 10.1198/ 106186006X113430.