# covariance_neural_networks__364d19fc.pdf

co Variance Neural Networks

Saurabh Sihag University of Pennsylvania sihags@pennmedicine.upenn.edu

Gonzalo Mateos University of Rochester gmateosb@ece.rochester.edu

Corey Mc Millan University of Pennsylvania cmcmilla@pennmedicine.upenn.edu

Alejandro Ribeiro University of Pennsylvania aribeiro@seas.upenn.edu

Graph neural networks (GNN) are an effective framework that exploit interrelationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called co Variance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.

1 Introduction

Convolutional neural networks (CNN) are among the most popular deep learning (DL) architectures due to their demonstrated ability to learn latent features from Euclidean data adaptively and automatically [1]. Motivated by the fact that graph models can naturally represent data in many practical applications [2], the generalizations of CNNs to adapt to graph data using GNN architectures have rapidly emerged in recent years [3]. A practical challenge to DL architectures is scalability to high-dimensional data [4]. Specifically, DL models can typically have millions of parameters that require long training times and high storage capacity [5]. GNN architectures have two attractive features towards learning from high-dimensional data: 1. the number of parameters associated with graph convolution operations in GNNs is independent of the graph size; and 2. under certain assumptions, GNN performance can be transferred across graphs of different dimensions [6].

Principal component analysis (PCA) is a traditional, non-parametric approach for linear dimensionality reduction and increase of interpretability of high dimensional datasets [7]. Accordingly, it is common to embed the PCA transform within a DL pipeline [4, 8, 9]. Such approaches usually use PCA to optimize the DL model parameters by removing redundancy in the features in intermediate layers. The scope of our work is different. In this paper, we leverage the connections between PCA and the mathematical concepts used in GNNs to propose a learning framework that can recover PCA and also, inherit additional desirable features associated with GNNs.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

1.1 PCA and Graph Fourier Transform

PCA is an orthonormal transform which identifies the modes of maximum variance in the dataset. The transformed dimensions of the data (known as principal components) are uncorrelated. Computing the PCA transform reduces to finding the eigenbasis of the covariance matrix or equivalently, the singular value decomposition of the data matrix [10]. We note that PCA computation using eigendecomposition of a covariance matrix has similarities with the graph Fourier transform (GFT) in the graph signal processing literature [11]. This observation is discussed more formally next.

Consider an m dimensional random vector X Rm 1 with the true or ensemble covariance matrix of X defined as C = E[(X E[X])(X E[X])T]. In practice, we may not observe the data model for X and its statistics, such as the covariance matrix C directly. Alternatively, we have access to a dataset consisting of n random, independent and identically distributed (i.i.d) samples of X, given by xi Rm 1, i {1, . . . , n}, where the data matrix is ˆxn = [x1, . . . , xn]. Using the samples xi, i {1, . . . , n}, we can form an estimate of the ensemble covariance matrix, conventionally referred to as the sample covariance matrix

i=1 (xi x)(xi x)T , (1)

where x is the sample mean across n samples and T refers to the transpose operator. Given the eigendecomposition of ˆCn

ˆCn = UWUT , (2)

where U = [u1, . . . , um] is the matrix with its columns as the eigenvectors and W = diag(w1, . . . , wm) is the diagonal matrix of eigenvalues, such that, w1 w2 wm, the implementation of PCA is given by [10]

ˆyn = UTˆxn . (3)

The eigenvectors of ˆCn form the principal components and the transformation in (3) is equivalent to the projection of the sample xi in the eigenvector space of ˆCn. A detailed overview of PCA is provided in Appendix B. We note that in the context of graph signal processing, (3) is equivalent to the graph Fourier transform (GFT) if the covariance matrix ˆCn is considered the graph representation or a graph shift operator for the data ˆxn [12].

1.2 Contributions

Motivated by the observation in Section 1.1, we introduce the notion of co Variance Fourier transform and define co Variance filters similar to graph convolutional filters in GNNs. Our analysis shows that standard PCA can be recovered using co Variance filters. Furthermore, we propose a DL architecture based on co Variance filters, referred to as co Variance neural networks (VNN)1: a permutationinvariant architecture derived from the GNN framework that uses convolutional filters defined on the covariance matrix. Understanding VNNs is relevant given the ubiquity of so-termed correlation networks [13, Chapter 7.3.1] and covariance matrices as graph models for neuroimaging [14], user preference [15], financial [16], and gene expression data[13, Chapter 7.3.4], to name a few timely domains. Summary of our contributions is as follows:

Stability of VNN: For finite n, the eigenvectors of the sample covariance matrix ˆCn are perturbed from the eigenvectors of the ensemble covariance matrix C. Moreover, the principal components corresponding to eigenvalues that are close are unstable, i.e., small changes in the dataset may cause large changes in such principal components [17] and subsequently, irreproducible statistical results [18]. Statistical learning approaches such as regression, classification, clustering that use PCA as the first step for dimension reduction inherit this instability. Our main theoretical contribution is to establish the stability of VNNs to perturbations in the sample covariance matrix in terms of sample size n. For this purpose, we leverage the perturbation theory for sample covariance matrices and show that for a given ˆCn, the perturbation in the VNN output Φ( ˆCn) with respect to the VNN output

1We emphasize on V in co Variance for abbreviations to avoid confusions with any existing machine learning frameworks.

for C, Φ(C) satisfies Φ( ˆCn) Φ(C) = O(1/n 1 2 ε) for some ε < 1/2 with high probability. Therefore, our analysis ties the stability of VNNs to the quality of sample covariance matrices, which is distinct from existing studies that study stability of GNNs under abstract, data-independent discrete or geometric perturbations to graphs [19, 20].

Empirical Validations: Our experiments on real-world and synthetic datasets show that VNNs are indeed stable to perturbations in the sample covariance matrix. Moreover, on a multi-resolution dataset, we also illustrate the transferability of VNN performance to datasets of different dimensions without re-training. These observations provide convincing evidence for the advantages offered by VNNs over standard PCA-based approaches for data analysis.

1.3 Related Work

GNNs: GNNs broadly encapsulate all DL models adapted to graph data [3, 21]. Recent survey articles in [3] and [21] categorize the variants of GNNs according to diverse criteria that include mathematical formulations, algorithms, and hardware/software implementations. Convolutional GNNs [22], graph autoencoders [23], recurrent GNNs, and gated GNNs [24] are among a few prominently studied and applied categories of GNNs. The taxonomy pertinent to this paper is that of graph convolutional network (GCN) that generalizes the concept of convolution to graphs. Specifically, VNNs are based on GCNs that model the convolution operation via graph convolutional filters [25]. The graph convolutional filters are defined as polynomials over the graph representation where the order of the polynomial controls the aggregation of information over nodes in the graph [12]. We extend the definition of graph convolutional filters to filters over sample covariance matrices to define VNNs. The ubiquity of PCA and prevalence of covariance matrices across disciplines for data analysis motivate us to study VNNs independently of aforementioned GCNs. Moreover, the unique structure inherent to covariance matrices (relative to arbitrary graphs) and their sample-based perturbations facilitates sharper stability analyses, which offer novel insights.

Robust PCA: Robust PCA is a variant of PCA that aims to recover the low rank representation of a dataset arbitrarily corrupted due to outliers [26, 27, 28]. The outliers considered in robust PCA approaches could manifest in the dataset due to missing data, adversarial behavior, defects or contamination in data collection. In contrast to such settings, here we consider the inherent statistical uncertainty in the sample covariance matrix, where ill-defined eigenvalues and eigenvectors of the sample covariance matrix can be the source of instability in statistical inference. Therefore, the notion of stability of VNNs in this paper is fundamentally distinct from the notion of stability or robustness in robust PCA approaches.

2 co Variance Filters

If m dimensions of data ˆxn can be represented as the nodes of an m-node, undirected graph, the sample covariance matrix ˆCn is equivalent to its adjacency matrix. In GNNs, graph Fourier transform projects graph signals in the eigenspace of the graph and is leveraged to analyze graph convolutional filters [12]. Therefore, we start by formalizing the notion of co Variance Fourier transform (abbreviated as VFT). For this purpose, we leverage the eigendecomposition of ˆCn in (2).

Definition 1 (co Variance Fourier Transform). Consider a sample covariance matrix ˆCn as defined in (1). The co Variance Fourier transform (VFT) of a random sample x is defined as its projection on the eigenspace of ˆCn and is given by

x = UTx . (4)

The i-th entry of x, i.e., [ x]i represents the i-th Fourier coefficient and is associated with the eigenvalue wi. Note that the similarity between PCA and VFT implies that eigenvalue wi encodes the variability of dataset xn in the direction of the principal component ui. In this context, the eigenvalues of the covariance matrix are mathematical equivalent of the notion of graph frequencies in graph signal processing [11]. Next, we define the notion of co Variance filters (VF) that are polynomials in the covariance matrix.

Definition 2 (co Variance Filter). Given a set of real valued parameters {hk}m k=0, the co Variance filter for the covariance matrix ˆCn is defined as

k=0 hk ˆCk n . (5)

The output of the covariance filter H( ˆCn) for an input x is given by

k=0 hk ˆCk nx = H( ˆCn)x . (6)

The co Variance filter H( ˆCn) follows similar analytic concepts of combining information in different neighborhoods as in the well-studied graph convolutional filters [12]. Moreover, the filter is defined by the parameters {hk}m k=0 and therefore, for the ensemble covariance matrix C, the co Variance filter is given by H(C). On taking the VFT of the output in (6) and leveraging (2), we have

z = UTz = UT m X

k=0 hk[UWUT]kx =

k=0 hk Wk x , (7)

where (7) holds from the orthonormality of eigenvectors and definition of VFT in (4). Using (7), we can further define the frequency response of the co Variance filter over the covariance matrix ˆCn in the domain of its principal components as

k=0 hkwk i , (8)

such that, from (7) and (8), the i-th element of zi has the following relationship

[ z]i = h(wi)[ x]i . (9)

Equation (9) reveals that performing the co Variance filtering operation boils down to processing (e.g., amplifying or attenuating) the principal components of the data. This observation draws analogy with the linear-time invariant systems in signal processing where the different frequency modes (in this case, principal components) can be processed separately using co Variance filters, in a way determined by the frequency response values h(wi). For instance, using a narrowband co Variance filter whose frequency response is h(λ) = 1, if λ = wi and h(λ) = 0, otherwise, we recover the score corresponding to the projection of x on ui, i.e, the i-th principal component of ˆCn. Therefore, there exist filterbanks of narrowband co Variance filters that enable the recovery of the PCA transformation. This observation is formalized in Theorem 1. Theorem 1 (co Variance Filter Implementation of PCA). Given a covariance matrix ˆCn with eigendecomposition in (2), if the PCA transformation of input x is given by y = UTx, there exists a filterbank of co Variance filters {Hi( ˆCn) : i {1, . . . , m}}, such that, the score of the projection of input x on eigenvector ui can be recovered by the application of a co Variance filter Hi( ˆCn) as:

[y]i = u T i Hi( ˆCn)x , (10)

where the frequency response hi(λ) of the filter Hi( ˆCn) is given by

hi(λ) = ηi, if λ = wi , 0, otherwise . (11)

Theorem 1 establishes equivalence between processing data samples with PCA and processing data samples with a specific polynomial on the covariance matrix. As we shall see in subsequent sections, the processing on a polynomial of covariance matrix has advantages in terms of stability with respect to the perturbations in the sample covariance matrix. The design of frequency response of different co Variance filters sufficient to implement PCA transformation according to Theorem 1 is shown figuratively in Appendix D. If it is desired to have PCA-based data transformation be followed by dimensionality reduction or statistical learning tasks such as regression or classification, the co Variance filter-based PCA can be coupled with the post-hoc analysis model to have an end-toend framework that enables the optimization of the parameters ηi in the frequency response of the co Variance filters.

3 co Variance Neural Networks (VNN)

In this section, we propose co Variance Neural Network (VNN), which provides an end-to-end, non-linear, parametric mapping from the input data x to any generic objective r and is defined as

r = Φ(x; ˆCn, H) , (12)

for sample covariance matrix ˆCn where H is the set of filter coefficients that characterize the representation space defined by the mapping Φ( ). The VNN Φ(x; ˆCn, H) may be formed by multiple layers, where each layer consists of two main components: i) a filter bank made of VFs similar to that in (6); and ii) a pointwise non-linear activation function (such as Re LU, tanh). Therefore, in principle, the architecture of VNNs is similar to that of graph neural networks with the covariance matrix ˆCn as the graph shift operator [12]. We next define the co Variance perceptron, which forms the building block of a VNN and is equivalent to a 1-layer VNN.

Definition 3 (co Variance Perceptron). Consider a dataset with the sample covariance matrix ˆCn. For a given non-linear activation function σ( ), input x, a co Variance filter H( ˆCn) = Pm k=0 hk ˆCk n and its corresponding coefficient set H, the co Variance perceptron is defined as

Φ(x; ˆCn, H) = σ(H( ˆCn)x) . (13)

The VNN can be constructed by cascading multiple layers of co Variance perceptrons (shown in Appendix D in the Supplementary file). Note that the non-linear activation functions across different layers allow for non-linear transformations, thus, increasing the expressiveness of VNNs beyond linear mappings such as co Variance filters. Furthermore, similar to GNNs, we can further increase the representation power of VNNs by incorporating multiple parallel inputs and outputs per layer enabled by filter banks at every layer [12, 29]. In this context, we remark that the definition of a one layer perceptron in (13) can be expanded to the following. Remark 1 (co Variance Perceptron with Filter Banks). Consider a co Variance perceptron with Fin m-dimensional inputs and Fout m-dimensional outputs. Denote the input at the perceptron by xin = h xin[1], . . . , xin[Fin] i and the output at the perceptron by xout = h xout[1], . . . , xout[Fout] i . Each input xin[g], g {1, . . . , Fin} is processed by Fout co Variance filters in parallel. For f {1, . . . , Fout}, the f-th output in xout is given by

xout[f] = σ

g=1 Hfg( ˆCn)xin[g]

= Φ(xin; ˆCn, Hf) , (14)

where Hf is the set of all filter coefficients in co Variance filters [Hfg( ˆCn)]g in (14).

Therefore, the VNN with filter bank implementation deploys Fin Fout number of VFs in a layer defined by the covariance perceptron in Remark 1. Next, we note that the definitions of co Variance filter in Section 2 and VNN are with respect to the sample covariance matrix ˆCn. However, due to finite sample size effects, the sample covariance matrix will be a perturbed version of the ensemble covariance matrix C. PCA-based approaches that rely on eigendecomposition of the sample covariance matrix can potentially be highly sensitive to such perturbations. Specifically, if small changes are made to the dataset xn, certain ill-defined eigenvalues and eigenvectors can induce instability in the outcomes of PCA-based statistical learning models [17]. Therefore, it is desirable that the designs of co Variance filters and VNNs are not sensitive to random perturbations in the sample covariance matrix as compared to the ensemble covariance matrix C. In this context, we study the stability of co Variance filters and VNNs in the next section.

4 Stability Analysis

We start with the stability analysis of co Variance filters, which will also be instrumental in establishing the stability of VNNs. Our results will leverage the eigendecomposition of C, given by

C = VΛVT (15)

where V Rm m is the matrix of eigenvectors such that, V = [v1, . . . , vm] and Λ = diag(λ1, . . . , λm) is the diagonal matrix of eigenvalues, such that, λ1 λm.

4.1 Stability of co Variance Filters

To study the stability of co Variance filters, we analyze the effect of statistical uncertainty induced in the sample covariance matrix with respect to the ensemble covariance matrix on the co Variance filter output. To this end, we compare the outputs H(C)x and H( ˆCn)x for any random instance x of X. Without loss of generality, we present our results over instances of X where X 1. We also consider the following assumption on the frequency response.

Assumption: The frequency response of filter H(C) satisfies:

|h(λi) h(λj)| M |λi λj|

where ki = p

E[ XXTvi 2] λ2 i , for some constant M > 0 and for all non-zero eigenvalues λi = λj, i, j {1, . . . , m} of C. Here, ki is a measure of kurtosis of the distribution of X.

Next, we provide the result that establishes the stability of co Variance filters in Theorem 2. The impact of the assumption in (16) on the filter stability is discussed subsequently. To present the results in Theorem 2, we also use the following definitions:

kmin = min i {1,...,m},λi>0 ki and κ = max i,j:λi =λj k2 i |λi λj| . (17)

Theorem 2 (Stability of co Variance Filter). Consider a random vector X Rm 1 , such that, its corresponding covariance matrix is given by C = E[(X E[X])(X E[X])T]. For a sample covariance matrix ˆCn formed using n i.i.d instances of X and a random instance x of X, such that, x 1 and under assumption (16), the following holds with probability at least 1 n 2ε 2κm/n for any ε (0, 1/2]: H( ˆCn) H(C) = M

n 1 2 ε O m + C log mn

Proof. See Appendix C.

The right-hand side term in (18) and the conditions in the assumption in (16) are obtained by the analysis of the finite sample size effect-driven perturbations in ˆCn and its eigenvectors with respect to that in C. From Theorem 2, we note that H( ˆCn) H(C) decays with the number of samples n at least at the rate of 1/n 1 2 ε. Thus, we conclude that the stability of the co Variance filter improves as the number of samples n increases. This observation is along the expected lines as the estimate ˆCn becomes closer to the ensemble covariance matrix C by the virtue of the law of large numbers. Next, we briefly discuss two aspects of the assumption in (16). Note that the upper bound in (16) controls the variability of the frequency response h(λ) with respect to λ. For any pair of eigenvalues λi and λj of C, this variability is tied to the eigengap |λi λj| and the factor ki. We discuss this next.

Discriminability between close eigenvalues: Firstly, for a given ki and eigenvalue λi, the response of the filter for any eigenvalue λj, j = i becomes closer to h(λi) if |λi λj| decreases. From the perturbation theory of eigenvectors and eigenvalues of sample covariance matrices, we know that the sample-based estimates of the eigenspaces corresponding to eigenvalues λi and λj become harder to distinguish as |λi λj| decreases [30]. Therefore, the co Variance filter that satisfies (16) sacrifices discriminability between close eigenvalues to preserve its stability with respect to the statistical uncertainty inherent in the sample covariance matrix.

Stability with respect to kurtosis and estimation quality: Since we have E[ XXTvi 2] E[ X 4], the factor ki is tied to the measure of kurtosis of the underlying distribution of X in the direction of vi. Distributions with high kurtosis tend to have heavier tails and more outliers. Therefore, smaller ki indicates that the distribution of X has a fast decaying tail in the direction of vi which allows for a more accurate estimation of λi and vi. We refer the reader to [30, Section 4.1.3] for additional details. In the context of co Variance filters, we note that the upper bound in (16) is more liberal if λi is associated with a smaller ki or equivalently, the distribution of X has a smaller kurtosis in the direction of vi. This observation implies that if λi and vi are easier to estimate, the frequency response for λj in the vicinity of λi is less constrained.

4.2 Stability of co Variance Neural Networks

The stability of VNNs is analyzed by comparing Φ(x; ˆCn, H) and Φ(x; C, H). Note that stable graph convolutional filters imply the stability of GNNs for different perturbation models [19]. In VNNs, the perturbations are derived from the finite sample effect in sample covariance matrix, which is distinct from the data-independent perturbation models considered in the existing literature on GNNs, and allow us to relate sample size n with VNN stability. We formalize the stability of VNNs under the assumption of stable co Variance filters in Theorem 3. For this purpose, we consider a VNN Φ( ; , H) with number of m-dimensional inputs and outputs per layer as F and L layers, with the filter bank given by H = {Hℓ fg}, f, g {1, . . . , F}, ℓ {1, . . . , L}.

Theorem 3 (Stability of VNN). Consider a sample covariance matrix ˆCn and the ensemble covariance matrix C. Given a bank of co Variance filters {Hℓ fg}, such that |hℓ fg(λ)| 1 and a pointwise non-linearity σ( ), such that, |σ(a) σ(b)| |a b|, if the covariance filters satisfy

Hℓ fg( ˆCn) Hℓ fg(C) αn , (19)

for some αn > 0, then, we have

Φ(x; ˆCn, H) Φ(x; C, H) LF L 1αn . (20)

The proof of Theorem 3 follows from [12, Appendix E] for any generic αn. The parameter αn represents the finite sample effect on the perturbations in ˆCn with respect to C. From Theorem 2, we note that αn scales as 1/n 1 2 ε with respect to the number of samples n for the co Variance filters whose frequency response depends on the eigengap and kurtosis of the underlying distribution of the data in (16). Furthermore, the stability of a VNN decreases with increase in number of m-dimensional inputs and outputs per layer F and number of layers L, which is consistent with the stability properties of GNNs. Therefore, our results present a more holistic perspective to the stability of VNNs than that possible for generic GNNs. Remark 2 (Computational Complexity of VNN). For a co Variance perceptron defined in (14), the computational cost is given by O(m2TFin Fout), where T m is the maximum number of filter taps in its associated filter bank.

From Remark 2, we note that the computational complexity of VNNs can be prohibitive for large m. However, oftentimes sparsity is imposed as a regularization to estimate high-dimensional correlation matrices; see e.g. [31]. As a result, the computational complexity becomes O(|E|TFin Fout), where |E| is the number of non-zero correlations (edges in the covariance graph) and can be markedly smaller than m2. Since VNN architecture is analogous to that of GNN, the property of GNN transferability (see [6]) across different sized graphs can also establish scalability of VNNs to high-dimensional datasets for settings where multi-resolution datasets are available. In the next section, we empirically validate our theoretical results on the stability of VNNs. Moreover, on a set of multi-resolution datasets, we also empirically evaluate VNNs for transferability.

5 Experiments

In this section, we discuss our experiments on different datasets. Primarily, we evaluate VNNs on a regression problem on different neuroimaging datasets curated at University of Pennsylvania, where we regress human chronological age (time since birth) against cortical thickness data. Cortical thickness is a measure of the gray matter width and it is well established that cortical thickness varies with healthy aging [32]. Additional details on the neurological utility of this experiment are included in Appendix E. Brief descriptions of these datasets are provided next.

ABC Dataset: ABC dataset consists of the cortical thickness data collected from a heterogeneous population of n = 341 subjects (mean age = 68.1 years, standard deviation = 13) that consists of healthy adults, and subjects with mild cognitive impairment or Alzheimer s disease. For each individual, joint-label fusion [33] was used to quantify mean cortical thickness in m = 104 anatomical parcellations. Therefore, for every subject, we have a 104 dimensional vector whose entries correspond to the cortical thickness in distinct brain regions.

Multi-resolution FTDC Datasets: FTDC Datasets consist of the cortical thickness data from n = 170 healthy subjects (mean age = 64.26 years, standard deviation = 8.26). For each subject,

the cortical thickness data is extracted according to a multiresolution Schaefer parcellation [34], at 100 parcel, 300 parcel, and 500 parcel resolutions. Therefore, for each subject, we have the cortical thickness data consisting of m = 100 features, m = 300 features or m = 500 features, with the higher number of features providing the cortical thickness data of a brain at a finer resolution. We leverage the different resolutions of data available to form three datasets: FTDC100, FTDC300, and FTDC500, which form the cortical thickness datasets corresponding to 100, 300, and 500 features resolutions, respectively.

Our primary objective is to illustrate the stability and transferability of VNNs that also imply advantages over traditional PCA-based approaches. Hence, we use PCA-based regression as the primary baseline for comparison against VNNs. We use the ABC dataset to demonstrate the higher stability of VNNs over PCA-based regression models in Section 5.1. In Section 5.2, we use the FTDC datasets to demonstrate transferability of VNN performance across the multi-resolution datasets without re-training. The experiments in Section 5.2 clearly lay beyond the scope of PCA-based statistical learning models. Details on data and code availability for the experiments are included in Appendix A.

5.1 Stability against Perturbations in Sample Covariance Matrix

In this section, we evaluate the stability or robustness of the trained VNN and PCA-regression models against perturbations in the sample covariance matrix used in training. To this end, we first train nominal VNN and PCA-regression models. The effects of perturbations in the sample covariance matrix on the performance of nominal models are evaluated subsequently. We first describe the experiment designs for nominal models based on VNN and PCA-regression.

VNN Experiments: We randomly split ABC dataset into a 90/10 train/test split. The sample covariance matrix is formed from 307 samples in the training set, i.e., we have ˆC307 of size 104 104. The VNN consists of 2 layers with 2 filter taps each, a filter bank of 13 m-dimensional outputs per layer for m = 104 dimensions of the input data, and a readout layer that calculates the unweighted mean of the outputs at the last layer to form an estimate for age. The hyperparameters for the VNN architecture and learning rate of the optimizer in this experiments and all subsequent VNN experiments in this section are chosen by a hyperparameter optimization framework called Optuna [35]. The training set is randomly subdivided into subsets of 273 and 34 samples, such that, the VNN is trained with respect to the mean squared error loss between the predicted age and the true age in 273 samples. The loss is optimized using batch stochastic gradient descent with Adam optimizer available in Py Torch library [36] for up to 100 epochs. The learning rate for the optimizer is set to 0.0151. The VNN model with the best minimum mean squared error performance on the remaining 34 samples (which acts as a validation set) is included in the set of nominal models for this permutation of the training set.

PCA-regression: The PCA-regression pipeline consists of two steps: i) we first identify the principle components using the eigendecomposition of the sample covariance matrix ˆC307; and then, ii) to maintain consistency with VNN, transform the 273 samples from the training set used for VNN training to fit to the corresponding age data using a regression model. Regression is implemented using sklearn package in python [37] with linear and radial basis function ( rbf ) kernels. PCA-regression with rbf kernel enables us to accommodate non-linear relationships between cortical thickness and age. PCA-regression with linear kernel in the regression model is referred to as PCA-LR and that with rbf kernel in the regression model is referred to as PCA-rbf. The optimal number of principal components in the PCA-regression pipeline are selected through a 10-fold cross-validation procedure on the training set, repeated 5 times.

For 100 random permutations of the training set in VNN and PCA-regression experiments, we form a set of 100 nominal models and evaluate their stability. To evaluate the stability of VNN, we replace the sample covariance matrix ˆC307 with ˆCn for n [5, 341], n = 307. For PCA-regression models, we re-evaluate the principal components corresponding to ˆCn to transform the training data while keeping the regression model learnt for PCA transformation from ˆC307 fixed. Clearly, ˆCn will be perturbed with respect to ˆC307 due to finite sample size effect.

For each nominal model, we evaluate the model performances in terms of mean absolute error (MAE) and correlation between predicted age and true age for the training set and the test set.

Figure 1: Stability of regression performance for VNN and PCA-regression models trained on ˆC307 formed from ABC dataset. Panels (a) and (b) illustrate the performance variation on training set for the VNN and PCA-regression models when sample covariance matrix ˆC307 is perturbed by addition or removal of samples (n = 307 marked by blue arrow in (a)). Panels (c)-(f) correspond to the variation in performances in terms of MAE and correlation between predicted age and true age only on the test set. The point n = 307 is marked with a blue arrow in panel (c). Panels (c) and (d) illustrate the variation in mean MAE performance of the VNN and PCA-regression models trained using ˆC307 on the test set (n = 307 marked by blue arrow in (c)) with panel (d) zooming in on the range 290 341 with error bars included. Panel (e) illustrates the variation in correlation between true age and predicted age by the VNN and PCA-regression models with panel (f) zooming in on the range 290 341 with error bars included.

Figures 1 a)-b) illustrate the variation in MAE performance with perturbations to ˆC307 on the training set. Figures 1 c)-f) correspond to similar results on the test set for MAE and correlation. The mean performance in terms of MAE over 100 nominal models is marked by a blue arrow in Fig. 1 a) for training set and in Fig. 1 c) for the test set. For PCA-regression models, we observe that both training and test performance in terms of MAE degrades significantly when the sample covariance matrix is perturbed from ˆC307 by removing or adding even a small number of samples (also seen in Fig. 1 b) and d), which are the plots focused only on the range of 290 341 samples from Fig. 1 a) and c), respectively, and include error bars.). In contrast, the VNN performances on both training and test sets are stable with respect to perturbations to ˆC307, as suggested by our theoretical results. However, as the number of samples decrease to n < 50, we observe a significant decrease in VNN stability in Fig. 1 a) and c). We also observe that the correlation between the predicted age and true age in the test set for VNN is consistently more stable than that for PCA-regression models over the entire range of samples evaluated (Fig. 1 e)). Moreover, Fig. 1 e) also demonstrates that in the range of n < 50 samples, there is a sharper decline in the correlation for PCA-regression models as compared to VNN despite the MAE for PCA-regression models having smaller MAE than VNN in this range. Thus, our experiments demonstrate the stability of VNNs while also illustrating that PCA-regression models may be overfit on the principal components of the sample covariance matrix used in training. Additional experiments on synthetic data and FTDC datasets also illustrate the stability of VNNs (see Appendix E).

5.2 Transferability

We evaluate the transferability for VNN models trained on FTDC datasets across different resolutions. For this purpose, the VNN is trained at a specific resolution and its sample covariance matrix is replaced by the sample covariance matrix at a different resolution for testing. The training of VNNs

- 5.38 5.47

5.33 - 5.57

5.35 5.38 -

V3pyx8+K8Ox+L1j Unmzm BP3A+fw AXAJEC</latexit>FTDC100

8Ox+L1j Unmzm BP3A+fw AXAJEC</latexit>FTDC100

h Fd6sif Viv Vsfi9aclc0cwx9Ynz8a Dp E</latexit>FTDC300

ZERlhiok1OBROCs/zy Kml Vys5V+e Kx Uqre Zn Hk4QRO4Rwcu IYq PEAdmk Dg CZ7h Fd6sif Viv Vsfi9aclc0cwx9Ynz8d HJEG</latexit>FTDC500 Train

- 0.59 0.59

0.62 - 0.61

0.59 0.58 -

q OL2Vk SHRBFqb E5G4K7/PIqa ZL7l Xp8r Fcr Nxmce Tg FM7g Aly4hgo8QA0a QOEJnu EV3pyx8+K8Ox+L1j Unmzm BP3A+fw AXAJEC</latexit>FTDC100

SHRBFqb E5G4K7/PIqa ZL7l Xp8r Fcr Nxmce Tg FM7g Aly4hgo8QA0a QOEJnu EV3pyx8+K8Ox+L1j Unmzm BP3A+fw AXAJEC</latexit>FTDC100

golk5l ZERlhiok1OBROCs/zy Kml Vys5V+f Kx Uqre Zn Hk4QRO4Rwcu IYq PEAdmk Dg CZ7h Fd6sif Viv Vsfi9aclc0cwx9Ynz8a Dp E</latexit>FTDC300

golk5l ZERlhiok1OBROCs/zy Kml Vys5V+e Kx Uqre Zn Hk4QRO4Rwcu IYq PEAdmk Dg CZ7h Fd6sif Viv Vsfi9aclc0cwx9Ynz8d HJEG</latexit>FTDC500 Train

0.005 0.005

0.044 0.047 0.005 0.005

0.28 0.006 0.32 0.007

MAE Correlation

Figure 2: Transferability for VNNs across FTDC datasets.

in this context follows similar procedure as described in Section 5.1 and results in 100 random VNN models for different permutations of the training set. For FTDC500, the VNN model consists of 1-layer with a filter bank of 27 m-dimensional outputs per layer for m = 500 dimensions of the input data, and 2 filter taps in each layer. The learning rate for the Adam optimizer is set to 0.008. For FTDC300, the VNN model consists of 2-layer architecture, with 44 m-dimensional outputs per layer for m = 300 dimensions in both layers and 2 filter taps in each layer. The learning rate for the Adam optimizer is set to 0.0033. For FTDC100, the VNN model consists of 2-layer architecture, with 93 m-dimensional outputs per layer for m = 100 dimensions in both layers and 2 filter taps in each layer. The learning rate for the optimizer is set to 0.0047. The readout layer in each model evaluates the unweighted mean of the outputs of the final layer to form an estimate for age.

We tabulate the MAE in the matrix in Fig. 2 a) and correlation between true age and predicted age in the matrix in Fig. 2 b), where the row ID indicates the dataset for which VNN was trained and the column ID indicates the dataset on which the VNN was tested to evaluate transferability of VNN performance across datasets with different resolutions. For instance, the element at coordinates (1, 2) in Fig. 2a) represents the MAE evaluated on complete FTDC300 dataset (m = 300) for VNNs trained on FTDC100 dataset (m = 100). The results in Fig. 2 show that the performance of VNNs in terms of MAE and correlation between predicted age and true age can be transferred across different resolutions in the FTDC datasets. Note that this experiment is not feasible for PCA-regression models, where the principal components and the regression model would need to be re-evaluated for data from different resolution.

6 Conclusions

In this paper, we have introduced co Variance neural networks (VNN) that operate on covariance matrices and whose architecture is derived from graph neural networks. We have studied the stability properties of VNNs for sample covariance matrices and shown that the stability of VNNs improves with the number of data samples n. Our experiments on real datasets have demonstrated that VNNs are significantly more stable than PCA-based approaches with respect to perturbations in the sample covariance matrix. Also, unlike PCA-based approaches, VNNs do not require the eigendecomposition of the sample covariance matrix. Furthermore, on a set of multiresolution datasets, we have observed that VNN performance is also transferable across cortical thickness data collected at multiple resolutions without re-training.

Acknowledgements

This research was supported by National Science Foundation under the grants CCF-1750428 and CCF-1934962. The ABC dataset was provided by the Penn Alzheimer s Disease Research Center (ADRC; NIH AG072979) at University of Pennsylvania. The MRI data for FTDC datasets were provided by the Penn Frontotemporal Degeneration Center (NIH AG066597). Cortical thickness data were made available by Penn Image Computing and Science Lab at University of Pennsylvania.

[1] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436 444, 2015.

[2] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2008.

[3] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4 24, 2020.

[4] Isha Garg, Priyadarshini Panda, and Kaushik Roy. A low effort approach to structured CNN design using PCA. IEEE Access, 8:1347 1360, 2019.

[5] Mohammad-Parsa Hosseini, Senbao Lu, Kavin Kamaraj, Alexander Slowikowski, and Haygreev C Venkatesh. Deep learning architectures. In Deep learning: concepts and architectures, pages 1 24. Springer, 2020.

[6] Luana Ruiz, Luiz Chamon, and Alejandro Ribeiro. Graphon neural networks and the transferability of graph neural networks. Advances in Neural Information Processing Systems, 33: 1702 1712, 2020.

[7] Ian T Jolliffe and Jorge Cadima. Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016.

[8] Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma. Pcanet: A simple deep learning baseline for image classification? IEEE transactions on image processing, 24 (12):5017 5032, 2015.

[9] Muhammad Umer, Zainab Imtiaz, Saleem Ullah, Arif Mehmood, Gyu Sang Choi, and Byung Won On. Fake news stance detection using deep learning architecture (CNN-LSTM). IEEE Access, 8:156695 156706, 2020.

[10] Jonathon Shlens. A tutorial on principal component analysis. ar Xiv preprint ar Xiv:1404.1100, 2014.

[11] Antonio Ortega, Pascal Frossard, Jelena Kovaˇcevi c, José MF Moura, and Pierre Vandergheynst. Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE, 106 (5):808 828, 2018.

[12] Fernando Gama, Elvin Isufi, Geert Leus, and Alejandro Ribeiro. Graphs, convolutions, and neural networks: From graph filters to graph neural networks. IEEE Signal Processing Magazine, 37(6):128 138, 2020.

[13] E. D. Kolaczyk. Statistical Analysis of Network Data: Methods and Models. Springer, New York, NY, 2009.

[14] Alaa Bessadok, Mohamed Ali Mahjoub, and Islem Rekik. Graph neural networks in network neuroscience. ar Xiv preprint ar Xiv:2106.03535, 2021.

[15] Weiyu Huang, Antonio G. Marques, and Alejandro R. Ribeiro. Rating prediction via graph signal processing. IEEE Transactions on Signal Processing, 66(19):5066 5081, 2018. doi: 10.1109/TSP.2018.2864654.

[16] José Vinícius de Miranda Cardoso, Jiaxi Ying, and Daniel Perez Palomar. Algorithms for learning graphs in financial markets. ar Xiv preprint ar Xiv:2012.15410, 2020.

[17] Ian T Joliffe and BJT Morgan. Principal component analysis and exploratory factor analysis. Statistical Methods in Medical Research, 1(1):69 95, 1992.

[18] Eran Elhaik. Principal component analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated. Scientific Reports, 12(1):1 35, 2022.

[19] Fernando Gama, Joan Bruna, and Alejandro Ribeiro. Stability properties of graph neural networks. IEEE Transactions on Signal Processing, 68:5680 5695, 2020.

[20] Nicolas Keriven, Alberto Bietti, and Samuel Vaiter. Convergence and stability of graph convolutional networks on large random graphs. Advances in Neural Information Processing Systems, 33:21512 21523, 2020.

[21] Sergi Abadal, Akshay Jain, Robert Guirado, Jorge López-Alonso, and Eduard Alarcón. Computing graph neural networks: A survey from algorithms to accelerators. ACM Computing Surveys (CSUR), 54(9):1 38, 2021.

[22] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. Graph convolutional networks: a comprehensive review. Computational Social Networks, 6(1):1 23, 2019.

[23] Thomas N Kipf and Max Welling. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016.

[24] Yujia Li, Richard Zemel, Marc Brockschmidt, and Daniel Tarlow. Gated graph sequence neural networks. In Proceedings of ICLR 16, 2016.

[25] Luana Ruiz, Fernando Gama, and Alejandro Ribeiro. Graph neural networks: architectures, stability, and transferability. Proceedings of the IEEE, 109(5):660 682, 2021.

[26] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1 37, 2011.

[27] Zihan Zhou, Xiaodong Li, John Wright, Emmanuel Candès, and Yi Ma. Stable principal component pursuit. In IEEE International Symposium on Information Theory, pages 1518 1522, 2010. doi: 10.1109/ISIT.2010.5513535.

[28] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. Advances in Neural Information Processing Systems, 23, 2010.

[29] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[30] Andreas Loukas. How close are the eigenvectors of the sample and actual covariance matrices? In International Conference on Machine Learning, pages 2228 2237. PMLR, 2017.

[31] Jacob Bien and Robert J Tibshirani. Sparse estimation of a covariance matrix. Biometrika, 98 (4):807 820, 2011.

[32] Madhav Thambisetty, Jing Wan, Aaron Carass, Yang An, Jerry L Prince, and Susan M Resnick. Longitudinal changes in cortical thickness associated with normal aging. Neuroimage, 52(4): 1215 1223, 2010.

[33] Hongzhi Wang, Jung W Suh, Sandhitsu R Das, John B Pluta, Caryne Craige, and Paul A Yushkevich. Multi-atlas segmentation with joint label fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(3):611 623, 2012.

[34] Alexander Schaefer, Ru Kong, Evan M Gordon, Timothy O Laumann, Xi-Nian Zuo, Avram J Holmes, Simon B Eickhoff, and BT Thomas Yeo. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cerebral Cortex, 28(9):3095 3114, 2018.

[35] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2623 2631, 2019.

[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.

[37] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. The Journal of Machine Learning Research, 12:2825 2830, 2011.

[38] Gene H Golub and Charles F Van Loan. Matrix computations. JHU Press, 2013.

[39] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.

[40] Dane Taylor, Juan G Restrepo, and François G Meyer. Ensemble-based estimates of eigenvector error for empirical covariance matrices. Information and Inference: A Journal of the IMA, 8(2): 289 312, 2019.

[41] Xavier Mestre. On the asymptotic behavior of the sample estimates of eigenvalues and eigenvectors of covariance matrices. IEEE Transactions on Signal Processing, 56(11):5353 5368, 2008. [42] Hedieh Sajedi and Nastaran Pardakhti. Age prediction based on brain MRI image: A survey. Journal of Medical Systems, 43(8):1 30, 2019. [43] SA Valizadeh, Jürgen Hänggi, Susan Mérillat, and Lutz Jäncke. Age prediction on the basis of brain anatomical measures. Human Brain Mapping, 38(2):997 1008, 2017. [44] Habtamu M Aycheh, Joon-Kyung Seong, Jeong-Hyeon Shin, Duk L Na, Byungkon Kang, Sang W Seo, and Kyung-Ah Sohn. Biological brain age prediction using cortical thickness data: A large scale cohort study. Frontiers in Aging Neuroscience, 10:252, 2018. [45] Jeffrey S Phillips, Fulvio Da Re, David J Irwin, Corey T Mc Millan, Sanjeev N Vaishnavi, Sharon X Xie, Edward B Lee, Philip A Cook, James C Gee, Leslie M Shaw, et al. Longitudinal progression of grey matter atrophy in non-amnestic Alzheimer s disease. Brain, 142(6):1701 1722, 2019. [46] Leo Breiman. Bagging predictors. Machine learning, 24(2):123 140, 1996.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We discussed limitations of our work in Remark 2 and the discussion thereafter. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Data and code availability statement is available in Appendix A. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] The hyperparameters in the VNN architecture and learning rate of the optimizer are chosen by a hyperparameter optimization framework called Optuna [35]. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? The experiments were run on a 12GB Nvidia Ge Force RTX 3060 GPU. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] All participants in the ABC and FTDC datasets took part in an informed consent procedure approved by an Institutional Review Board convened at University of Pennsylvania. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] The data contains no personally identifiable information or offensive content. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? All participants in the ABC and FTDC datasets received a monetary honorarium (of $25 $40 per participant for participation).