# bayesian_spiked_laplacian_graphs__5b2e8aae.pdf Journal of Machine Learning Research 24 (2023) 1-35 Submitted 10/20; Revised 11/22; Published 1/23 Bayesian Spiked Laplacian Graphs Leo L Duan li.duan@ufl.edu Department of Statistics, University of Florida George Michailidis gmichail@ufl.edu Department of Statistics, University of Florida Mingzhou Ding mding@bme.ufl.edu Department of Biomedical Engineering, University of Florida Editor: Edo Airoldi In network analysis, it is common to work with a collection of graphs that exhibit heterogeneity. For example, neuroimaging data from patient cohorts are increasingly available. A critical analytical task is to identify communities, and graph Laplacian-based methods are routinely used. However, these methods are currently limited to a single network and also do not provide measures of uncertainty on the community assignment. In this work, we first propose a probabilistic network model called the Spiked Laplacian Graph that considers an observed network as a transform of the Laplacian and degree matrices of the network generating process, with the Laplacian eigenvalues modeled by a modified spiked structure. This effectively reduces the number of parameters in the eigenvectors, and their sign patterns allow efficient estimation of the underlying community structure. Further, the posterior distribution of the eigenvectors provides uncertainty quantification for the community estimates. Second, we introduce a Bayesian non-parametric approach to address the issue of heterogeneity in a collection of graphs. Theoretical results are established on the posterior consistency of the procedure and provide insights on the trade-off between model resolution and accuracy. We illustrate the performance of the methodology on synthetic data sets, as well as a neuroscience study related to brain activity in working memory. Keywords: Isoperimetric Constant, Mixed-Effect Eigendecomposition, Normalized Graph Cut, Stiefel Manifold. 1. Introduction In recent years, there has been a strong interest in modeling network data due to their increased availability in social sciences (Aggarwal, 2011), biology (Minch et al., 2015) and engineering (Zhang et al., 2008). A popular generative model, suitable for social network analysis has been the stochastic block model (Nowicki and Snijders, 2001; Karrer and Newman, 2011) and its variant, the mixed membership stochastic block model (Airoldi et al., 2008). Their popularity stems from the fact that these models tend to produce networks organized in communities; subsets of vertices connected with one another with particular edge densities. For example, edge density may be higher within communities than be- 2023 Leo L Duan, George Michailidis and Mingzhou Ding. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/20-1206.html. Duan, Michailidis and Ding tween communities. A key analytical task is that of community detection and a plethora of algorithms have been proposed in the literature [Leighton and Rao (1999); Khandekar et al. (2009); Arora et al. (2009); Mucha et al. (2010); Fortunato (2010); Papadopoulos et al. (2012); for recent reviews, see Abbe (2017); Javed et al. (2018)]. Further, consistency results when the number of vertices grows to infinity have also been provided for certain community detection algorithms, with spectral clustering being the most prominent among them (Rohe et al., 2011; Amini et al., 2013). However, when the network is of small to moderate size, such consistency results are not directly applicable, which motivated various Bayesian approaches. Many of them can be viewed as variants of the latent space model (Hoffet al., 2002), wherein the key idea is to assume a latent coordinate for each vertex, and the pairwise interaction of two coordinates (e.g., inner product, distance) determines the probability of whether an edge should form. Such an example is the Bayesian stochastic block model [see, e.g., Mc Daid et al. (2013); van der Pas and van der Vaart (2018); Geng et al. (2019)] that characterizes the randomness in the community labels. Some other approaches consider edge formation as the outcome of a stochastic mechanism that can lead to a power-law degree distribution (Cai et al., 2016), or to sparse networks (Caron and Fox, 2017) and can aid in link prediction tasks (Williamson, 2016). In many scientific areas, it is becoming common to have access to a collection of networks, that usually exhibit a certain degree of heterogeneity. For example, neuroscientists collect brain signals from EEG/MEG technologies for cohorts of patients that give rise to networks capturing brain activity between regions of interest (ROIs) (Shen et al., 2013). The networks in the collection share common features (e.g., community structure, since they are derived from subjects either responding to the same stimulus in designed experimental studies or having the same disease condition in observational studies), but also exhibit heterogeneity. Analysis of such collections could proceed by applying current approaches to each network and then devising methods for aggregating the results, which could prove challenging, since the possible significant variation from one network to another renders pooling information error-prone (e.g., by assuming a shared latent space). This issue was recognized by Durante et al. (2017) that proposed to use multiple sets of coordinates, modeled by a non-parametric mixture distribution. The latter approach fits better the underlying data, vis-a-vis a naive averaging across multiple networks. Similarly, Mukherjee et al. (2017) proposed an approach to directly cluster the networks, which reduces the heterogeneity for downstream analysis. Another important factor to consider for multiple networks is the risk of model misspecification. Unfortunately, a fully Bayesian network model is sensitive to this issue, since it needs to impose a parametric distribution on the edge generating mechanism, often using a Bernoulli distribution. In contrast, in real world applications, the available network data are in fact produced by various processing algorithms, such as thresholding the correlation matrix from the multivariate time series (Sojoudi, 2016) (hence not Bernoulli). Although one could trace back the data processing steps and develop a corresponding generative model, often this is impractical due to the complexity and use of heuristics in those steps. Spiked Laplacian Rather, it is more useful to consider a modeling approach, that is probabilistic, but based on more relaxed assumptions. The above two factors lead us to consider the graph Laplacian (Chung and Graham, 1997), which lies at the heart of spectral clustering algorithms that use a set of eigenvectors corresponding to the smallest non-trivial eigenvalues. It is a simple transformation of the adjacency matrix, and its smallest non-trivial eigenvalues provide information on the minimum edge loss when partitioning the network into multiple communities. However, spectral clustering-based approaches are primarily algorithmic in nature, involving a multi-stage procedure starting by normalizing the graph Laplacian, followed by a singular value decomposition and selection of the appropriate number of eigenvectors to use (based mostly on empirical inspection) and then finally a post-processing of the eigenvectors through an application of the K-means algorithm (Ng et al., 2002). Further, performance guarantees for such approaches are asymptotic in nature and take the form of high probability error bounds on the number of communities selected and the misclassification error rate (Hein et al., 2007; Von Luxburg et al., 2008; Rohe et al., 2011). Since there is no likelihood function involved, measures of uncertainty for community assignments are difficult to obtain, and further, it becomes challenging to accommodate heterogeneity across networks. To overcome these challenges, we consider a probabilistic model for the graph Laplacian, leveraging its spectral properties and subsequently introducing a non-parametric Bayes approach on a population of networks/graphs. The crux of the problem is how to parameterize a valid Laplacian matrix by only focusing on a small set of eigenvectors (rank) that captures the underlying community structure. We leverage ideas from the spiked covariance model (Donoho et al., 2018), and adding a new transformation that focuses on the smallest eigenvalues (as opposed to the largest ones in covariance modeling). We then show that the associated eigenvectors contain useful information for a hierarchical partitioning of the graph, which leads to an almost instantaneous estimation of the underlying communities, with no need for post-processing of the results from spectral clustering with iterative algorithms such as K-means. Due to the Bayesian nature of the model, the estimated community labels have a posterior distribution, which quantifies their uncertainty. To the best of our knowledge, existing Bayesian models involving the graph Laplacian mostly use it as a tool to construct a regularizing prior distribution for different types of problems, such as variable selection in regression (Liu et al., 2014), function estimation on a graph (Kirichenko et al., 2017) and covariance specification for Gaussian processes (Dunson et al.). In contrast, we use the Laplacian as a transformation for network data and propose a new likelihood with the goal of carrying out near-optimal community detection, hence our focus is different and novel. The remainder of this paper is organized as follows: Section 2 introduces the construction of the spiked graph Laplacian, and the non-parametric Bayesian model that accommodates heterogeneity in a collection of graphs; Section 3 introduces the estimation of the communities based on the posterior distribution; Section 4 establishes theoretical properties for the proposed model. Section 5 evaluates the model performance based on synthetic data, while Section 6 illustrates the modeling approach in a data application aiming to charac- Duan, Michailidis and Ding terize the heterogeneity in brain scans in a human working memory study. The software implementation can be found on https://github.com/leoduan/Bayes Spiked Laplacian. 2. The Spiked Graph Laplacian Model Suppose S graphs/networks are observed, each denoted by G(s) = {V (s), E(s)}, s = 1, . . . , S, with corresponding vertex set V (s) = {1, . . . , n} and edge set E(s) = {e(s) i,j }i,j. For ease of presentation, we focus on undirected, weighted graphs, whose adjacency matrix is given by A(s) = {A(s) i,j }i,j with entries satisfying A(s) i,j 0, A(s) j,i = A(s) i,j and A(s) i,i = 0. Extension to a binary A(s) i,j is discussed at the end of the paper. For notational convenience, the graph index (s) is omitted in the sequel. The observed normalized Laplacian is a transformation of the adjacency matrix given by L = D 1/2(D A)D 1/2, (1) where D = diag{di}n i=1 is the observed degree matrix, with di = Pn i=1 Ai,j. Theorem 1 For any adjacency matrix A with di > 0 for i = 1, . . . , n, the normalized Laplacian L = D 1/2(D A)D 1/2 has all eigenvalues λk [0, 2] (Chung and Graham, 1997). The smallest eigenvalue λ1 = 0 (index 1 denotes the smallest one) and L d1/2 = 0. The above theorem shows that D is dependent on L, therefore, to build a probabilistic model on the adjacency matrix, we consider the following factorized model: Π(A) = Π(L)Π(D | L). where we use Π( ) to denote a density. To fully characterize the level of dependency of D on L, we make the following observations. (i) If the multiplicity of zero eigenvalue from L is one, then the corresponding unit-norm eigenvector (subject to sign change) must be φ1 = ( di/ z1)i=1...n, with z1 = Pk i=1 di. This means that given L, all di s are almost known except for a scalar z1. (ii) If the multiplicity of zero eigenvalues of L is K > 1, then it means there are K disjoint component subgraphs (Chung and Graham, 1997), we can see that the corresponding eigenvectors are in the form of an n K matrix [ φ1 . . . φK]O, with O RK K a rotation matrix, φk = [ di1(ci = k)/ zk]i=1...n, zk = p Pn i=1 di1(ci = k) and 1(ci = k) the indicator function representing if the ith vertex is in the kth component subgraph. Since any rotation will not change the 2-norm of each row vector, it is not hard to see that the 2-norm of the each row is still di1(ci = k)/ zk; hence all di s are almost known except for ( z1, . . . , z K). Summarizing these two cases, we can see that Π(D | L) is a distribution that only describes the total scale of degrees in each component. Remark 2 Based on the above discussion, we can see that once L is known, Π(D | L) contains very little additional information about the pairwise relationship in A. Therefore, we choose to focus on Π(L) from now on to be exact, if we have parameter β = (βD, βL) that enters the likelihood as Π(L; βL)Π(D | L, βD), as our parameter of interest is βL, we only need to handle Π(L; βL) in model specification and posterior inference. Spiked Laplacian To specify Π(L; βL), we use the following signal-plus-noise matrix model: L = µL + E, µL = k=1 λk qk q k + l=T+1 θ ql q l, (2) where E is a symmetric matrix capturing random variation with E = {ei,j}i,j, ei,j N(0, σ2 e) for i < j. The matrix µL is symmetric with eigenvalues λ1, λ2 . . . , λT , θ, . . . , θ | {z } (n T) sponding eigenvectors q1, . . . , qn. We further require q1(i) > 0 for all i. We do not impose monotonicity constraint for those λk except having λ1 = 0; later, we may re-arrange them in non-descending order and will index them by subscript .(k). Collecting the eigenvectors and eigenvalues in matrices Q = ( q1, . . . , q T ) and Λ = diag(λ1, . . . , λT ), and after re-arranging terms in (2) we obtain k=1 (λk θ) qk q k + l=1 θ ql q l = Q(Λ IT θ)Q + Inθ, where q T+1, . . . , qn are canceled due to orthonormality, Pn l=1 ql q l = In. Therefore, our parameter of interest is βL = (Q, Λ, θ, σ2 e), which has O(n T) many elements. Remark 3 This model shares similarities with the spiked covariance model (Donoho et al., 2018), except that the spikes λ2, . . . , λT are associated with the smallest eigenvalues, that as shown later, drive the partitioning of the graph into communities. For this reason, we coin the term spiked graph Laplacian for µL. 2.1 A Non-parametric Bayesian model for Heterogeneous Spiked Graph Laplacians A key benefit of the probabilistic model introduced for the spiked graph Laplacian is that it enables us to naturally capture heterogeneity in a collection of graphs G(s), s = 1, , S, with associated Laplacians and their decompositions (µ(s) L , Q(s), θ(s), Λ(s)). Note that in (2), each q(s) k forms a factor matrix q(s) k q(s) k encoding the pairwise interactions of the vertices, while each λ(s) k modulates the magnitude of the interactions. Given such a heterogeneous collection of graphs/networks, in order to learn both the shared community structure across them, and also capture their heterogeneity as reflected in their edge density, we use a two-pronged approach: (i) a non-parametric Bayesian model is used for estimating a common dictionary of factors, and (ii) a random-effects model controls the number of spikes for each graph Laplacian L(s). Duan, Michailidis and Ding The matrix of eigenvectors Q(s) is modeled based on a Dirichlet process mixture, l=1 πlδU(l)(.), Π(U(l)) exp tr ΩM U(l) I[u(l) 1 (i) > 0 for i = 1, . . . , n], π1 = ν1, πl = νl Y l 1, νl Beta(1, α0), whose base measure is a constrained matrix Langevin distribution, on a Stiefel sub-manifold with the elements in the first column being all positive, VT,n = {Q Rn T : Q Q = IT , q1(i) > 0, i = 1 . . . n}; Ωa diagonal T T matrix; the concentration parameter α0 > 0; δa(.) a point mass at a; I(E) takes the value 1 if E holds, and 0 otherwise. An important property of the Dirichlet process mixture is that the posterior distribution is discrete almost surely. Therefore, using this non-parametric prior distribution allows us to obtain a discrete distribution for Q(s), where Q(1) . . . , Q(S) have only a few unique values that are significantly less than S. That is, we learn a dictionary of the eigenmatrices. Remark 4 Since L(s) = µL(s)+E(s), the term E(s) gives a perturbation to the eigenvectors of L(s), so that they become unique for each subject s = 1, . . . , S. Therefore, even though we use a Dirichlet process prior that makes the distribution of Q(s) (eigenvectors of µL(s)) discrete that is, we allow the possibility of having Q(s) = Q(s ) for two subjects s and s , the corresponding observed Laplacian matrices L(s) and L(s ) do not have the same eigenvectors (otherwise, the model would be too restrictive). The eigenvalues λ(s) k and θ(s), s = 1, . . . , S are assumed independently and identically distributed according to the following prior distribution: η(s) k Bernoulli(w), λ(s) k | η(s) k = 1 N(0,2)(0, σ2 λ,1), λ(s) k | η(s) k = 0 N(0,2)(µθ, σ2 λ,0), θ(s) N(0,2)(µθ, σ2 θ), for k = 2, . . . , T, with N(0,2) denoting a Gaussian distribution truncated to the (0, 2) interval. Further, since λ(s) 1 = 0, we assign η(s) 1 = 1. When marginalizing over η(s) k , each λ(s) k follows a two-component mixture, with the first component capturing small spikes, and the second component capturing those large ones close to θ. This enables a constant dimension T for all L(s), while retaining adaptiveness to have the effective number of small spikes: k=1 η(s) k , (6) as shown later, equivalent to κ(s) communities. Remark 5 An alternative parameterization would be using T (s) that varies directly with each graph; however, this would lead to an inefficient discrete search when estimating the Spiked Laplacian posterior distribution. In theory, one could also consider fixing T = n; nevertheless, when setting T n, we only need to estimate the first T eigenvectors [due to (3)], and the algorithm is computationally more efficient than when T = n. Our parameterization of the mixture prior in (5) is motivated by the observation that those large eigenvalues of Laplacian have a concentration at a value away from 0 (see Figure 2). Therefore, we make the second-component location µθ non-zero, along with a small scale σ2 λ,0. Another possibility is the continuous spike-and-slab prior (George and Mc Culloch, 1993) with the second location µθ = 0, a large scale σ2 λ,0, and the truncated support on (0, 2); nevertheless, under that prior, those small but not close-to-zero λ(s) k s will be more likely to be assigned to the component with η(s) k = 0 (as in the slab group), which is not ideal for our modeling purpose. 4 eigenmatrices Q(s) sampled from (4) G G G G G G G G G G G G G G G G G G G G G G G G G G G 0 10 20 30 Index G G G G G G G G G G G G G G G G G G G G G G G G G G G 0 10 20 30 Index G G G G G G G G G G G G G G G G G G G G G G G G G G G 0 10 20 30 Index G G G G G G G G G G G G G G G G G G G G G G G G G G G G 0 10 20 30 Index 4 sets of eigenvalues (Λ(s), θ(s)) sampled from (5) . 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Adjacency matrices simulated using L = µL + E, A = D1/2(I L)D1/2. Figure 1: Simulated adjacency matrices illustrating how the non-parametric spiked Laplacian model captures graph heterogeneity: Graphs (b), (c), (d) use the same eigenmatrix Q(s) drawn from the Dirichlet process, creating similar community structure; the independent eigenvalues Λ(s) lead to varying degree of sparsity ((b) vs (c)) and also dictate whether a community can be further divided into two smaller communities ((b) vs (d)). Graph a take a different value for Q(s); hence its community structure is completely different from (b), (c), (d). Next, we illustrate the high flexibility of the proposed modeling framework based on synthetic data. We draw four eigenmatrices from (4) and four sets of eigenvalues from (5), and obtain the adjacency matrix using (2) with σ2 e = 10 2. As shown in Figure 1: (i) Graphs (b), (c), (d) have the same values in the eigenmatrix Q(s), therefore they share a similar community structure and appeare quite different from graph (a); (ii) among those three, Duan, Michailidis and Ding the independent eigenvalues Λ(s) create varying edge, thus leading to different strengths in connectivity between graphs (b) and (c), and also dictate whether a community can be further divided into two smaller communities [(b) vs (d)]. 2.2 Specification of the Prior Distribution For the variance parameters σ2 θ, σ2 λ,0 and σ2 λ,1, we assign proper Inverse-Gamma(2, 0.1) distributions with a weakly informative prior mean of 0.1. To choose the mean parameter µθ of those larger eigenvalues, we consider the idealized case of having a graph G consisting of K disjoint complete subgraphs the kth subgraph has nk > 1 vertices, among which each pair of vertices are connected with an equal edge weight Ai,j = a > 0. Then for this graph G , we have that the eigenvalues of its normalized Laplacian are given by: 0, . . . , 0 | {z } K , n1 n1 1, . . . , n1 n1 1 | {z } (n1 1) , . . . , n K n K 1, . . . , n K n K 1 | {z } (n K 1) which can be derived as a direct extension of (6) from Banerjee and Jost (2008). Therefore, with most of nk/(nk 1) 1 and viewing the observed graph as some deviation from G , we set the prior mean µθ = 1 in this article. As a flexible alternative, one could further use a hierarchical prior µθ N(1, σ2 µ1) , so that µθ can be adaptive to the data. For w, we assign a non-informative prior Beta(1, 1) distribution. For the noise variance σ2 e, we set a diffuse prior Inverse-Gamma(0.01, 0.01) distribution. For the base measure of the Dirichlet process (4), we choose the non-informative Ω= diag(0, , 0), making it a uniform prior measure over VT,n and eliminating the need to estimate M or any intractable normalizing constant. We choose a small concentration α0 = 0.1 to induce sparsity in the mixture weights, which lead to fewer unique values in Q(s), thus aiding interpretation by having few communities. Note that one can always select a larger α0 value, if more communities with finer-scale differences is desired. In numerical experiments, this prior specification shows good empirical performance in recovering the ground truth and is robust to a wide range of values of n, S and noise levels without the need for tuning. 2.3 Estimation of the Posterior Distribution We use Gibbs sampling to estimate the posterior distribution. Since an infinite mixture distribution is involved, we use a latent assignment zs {1, 2, . . .} for each graph, such that Q(s) = U(l) if zs = l. Then, the likelihood given {zs} becomes s=1 Π(L(s); σ2 e, Λ(s), Q(s), θ(s), z(s)) (σ2 e) Sn(n+1) tr (Λ(s) θ(s)IT )2 + L θ(s)In 2 F 1 2σ2e tr (θ(s)In L(s))U(l)(θ(s)IT Λ(s))U(l) . Spiked Laplacian In the above, we replaced the fixed diagonal elements L(s) i,i = 1 with an augmented random variant L(s) i,i = N(µL,i,i, 2σ2 e), for easier matrix-based computation as in Hoff(2009). A Gaussian Integral Trick for the Product-Matrix-Bingham Distribution: One immediate challenge of sampling U(l) from (8) is the exponential-quadratic in the full conditional distribution: Π(U(l) | .) exp 1 s:zs=l tr(Fs U(l)Gs U(l) ) etr(ΩM U(l)), (9) where Fs = θ(s)In L(s) and Gs = θ(s)IT Λ(s). This corresponds to the product of a matrix Bingham-{Fs/(2σ2 e), Gs} distribution, for which a closed form is unavailable for sampling purposes. To address this problem, we propose a new data augmentation for the product-matrix Bingham distribution, which extends the Gaussian integral trick (Zhang et al., 2012) on the Stiefel manifold. Consider an augmented random matrix Rs RT n from the matrix Gaussian Mat-No(Gs U Fs, Gsσ2 e, Fs): Π(Rs | U(l), .) |Fs| T/2|Gs| n/2etr 1 2σ2e F 1 s (Rs Gs U(l) Fs) G 1 s (Rs Gs U(l) Fs) , (10) The joint distribution becomes: Π({Rs}s:zs=l, U(l) | .) tr(F 1 s R s G 1 s Rs) 2tr(R s U(l) ) etr ΩM U(l) , where all the quadratic terms in (9) are canceled, leading to full conditional Π(U(l) | {Rs}s:zs=l, .) etr 1 s:zs=l Rs U(l) + ΩM U(l) . (11) Therefore, we can sample (10) and (11) alternatively in closed form; note that the latter is a matrix Langevin distribution amenable to the sampling algorithm in Hoff(2009). Sampling Algorithm: To simplify computations, we approximate the Dirichlet process mixture model with a truncated version, setting the number of mixture components to g and using Dir(α0/g, . . . , α0/g) (in this paper, we use g = 30). The detailed steps of the algorithm are given in the Appendix. 3. Community Detection based on the Posterior Distribution Next, we focus on the community assignment labels c(s) i N for each vertex i in graph s, using the obtained posterior sample of Q(s) and Λ(s). Specifically, we obtain {c(s) i }n i=1 via a fast and deterministic transformation of (Q(s), Λ(s)), in which the algorithm aims to Duan, Michailidis and Ding optimize the partitioning of each graph. Note that this is a measurable transformation, we quantify the uncertainty via the induced distribution of c(s) i . Since the discussion pertains to each graph, we omit superscript (s) for ease of presentation. Optimal Graph Cut We first introduce the concept of optimal graph cut . In the simplest possible case, suppose we want to bi-partition (or, cut ) a graph G = (V, E) into two sub-graphs G1 = V1, E(V1, V1) and G2 = V2, E(V2, V2) , with V1 V2 = V and V1 V2 = , and E(Vj, Vj) the edges formed among the vertices within Vj. An intuitive cut criterion corresponds to minimizing the loss of edge weights between two sub-graphs: P i V1,j V2 Ai,j. On the other hand, we want to prevent trivial cuts, where one of the partition vertex sets Vj, j = 1, 2 comprises of few or even a single vertex. To that end, Shi and Malik (2000) introduced the minimal normalized cut loss defined as h2(G) = min (V1,V2) P i V1,j V2 Ai,j minl=1,2 P i,j Vl Ai,j , where the denominator is the sum of the vertex degrees in one of two subgraphs. Initially, h(G) was proposed for a binary adjacency matrix A, and is also known as the Cheeger or isoperimetric constant (Mohar, 1989), representing the bottleneck of the flow across the edges connecting the two partitioned vertex sets; later on, this loss was extended to weighted graphs (Friedland and Nabben, 2002). Louis et al. (2011) extends it to κ-partitioning of a weighted graph, with the corresponding loss function known as the sparsest κ-cut : hκ(G) = min (V1,...,Vκ) P m 0, dj > 0 and (θ λ(k)) > 0 for small λ(k). If qk(i) and qk(j) have the same sign, they contribute positively to A ,i,j. Therefore, to minimize the loss due to a graph cut, a locally optimal cut is simply dividing the set into two subsets the one with q(k) 0 and the one with q(k) < 0. In the simplest case with κ = 2, this is exactly the Fiedler vector partitioning (Fiedler, 1989). We do this recursively until obtaining κ subsets. Due to the orthonormality of the eigenvectors, the following holds for k 2, i=1 q(1)(i)q(k)(i) = 0, q(1)(i) > 0. To satisfy these constraints, each vector q(k) must contain both plus and minus signs; hence, we can always use the sign-based partitioning. This algorithm can run very fast, since it only takes one scan from 1 to κ. Duan, Michailidis and Ding 3.2 Obtaining Point Estimates from Posterior Samples Based on the posterior samples, it is of interests to obtain point estimates for both interpration and benchmarking purposes. On estimating the effective number of patterns, as zs = l represents assigning L(s) to the l-th group, for each posterior sample, we record the number of unique values in {zs}s=1, ,S, denoted by bz. Using the posterior samples, we obtain a discrete distribution for bz = 1, 2, , and we take the one with the largest probability as a point estimate ˆbz. To estimate the number of communities for each subject, for each posterior sample of Λ(s), we obtain a κ(s) from (6) as the number of communities. Similarly based on the posterior samples, we obtain a discrete distribution for κ(s) = 1, 2, . . . and pick the one with the largest probability as a point estimate for ˆκ(s). To obtain a point estimate of community assignments for the s-th subject, conditioned on our point estimate on the number of communities, we take those samples Q(s) : κ(s) = ˆκ(s), and find one that has the largest posterior density, and run Algorithm 1 to obtain assignment labels ˆc(s) i s to ˆκ(s) communities. 4. Theoretical Results Next, we establish several properties of the proposed methodology. We first show that adapting κ(s) for the graph involves a trade-offbetween the number of eigenvectors to be estimate and their estimation accuracy compared to the ground truth under noise perturbations; hence, this is a trade-offbetween the number of communities κ(s) one attempts to identify and the uncertainty/error on the estimated community membership c(s) i . Assume L is a noisy version of a true L0 (not necessarily having a spiked structure), with Q0 containing its eigenvectors. The spiked graph Laplacian model produces an estimated ˆL = ˆQ(λ IT θ) ˆQ + Inθ based on the posterior distribution. We can quantify the distance between the sub-matrices of ˆQ and Q0. Theorem 6 (Trade-offbetween resolution and estimating accuracy) For any given posterior sample from the spiked graph Laplacian model, let the eigen-vectors/values in the spiked graph Laplacian estimate be ordered such that λ(1) < λ(2) λ(3) . . . λ(T) < λ(T+1) = . . . = λ(n) = θ. Further, assume each element of (ˆL L0) is σe-sub-Gaussian, due to both L0 and ˆL being normalized Laplacians and thus all their elements are in the [ 1, 0] interval. Denote the sub-matrices formed by the first k columns of the ˆQ and Q0 matrices as ˆQ1:k and Q0,1:k, respectively; then, for any k [2, T 1], there exists an orthonormal matrix O, for any t > 0: Pr ˆQ1:k O Q0,1:k F kn23/2σe λ(k+1) λ(k) t 1 δt, where δt = exp[ {t2/64 log(5 Spiked Laplacian Next, note that the likelihood function can be re-written as, Π(L; σ2 e, Λ, Q, θ) (σ2 e) n(n+1)/4 exp 1 4σ2e tr(ΛΛ) 2tr(ΛQ LQ) θ2(n T) 2θtr{L(I QQ )} exp 1 4σ2e tr(LL) , where Λ and θ are conditionally independent. Integrating out Λ and θ, we obtain the marginal likelihood of Q: Π(L; σ2 e, Q) exp (Pn k=T+1 q k Lqk)2 exp PT k=1(q k Lqk)2 with Φ denoting the cumulative distribution function of the normal distribution and ζ =(σ2 e) n(n+1)/4+(T+1)/2 T Y k=2 {Φ(2 q k Lqk p 2σ2e ) Φ( q k Lqk p [Φ{ 2 (Pn k=T+1 q k Lqk)/(n T) p 2σ2e/(n T) } Φ{ (Pn k=T+1 q k Lqk)/(n T) p 2σ2e/(n T) }]. Remark 7 To see the intuition regarding the marginal likelihood, consider log Π(L; σ2 e, Q) as a loss function over Q, while ignoring the normalizing constant ζ, k=1 (q k Lqk)2 1 n T n X k=T+1 q k Lqk 2 = k=1 (q k Lqk)2 m n T where m [1, 2] due to P x2 k (P xk)2 2 P x2 k, with xk = q k Lqk 0 with L being positive semi-definite. Therefore, the first T factors (q k Lqk)2 have a substantially higher contribution compared to the remaining ones, which is consistent with our modeling focus on the first T eigenvectors. Lastly, we show that the proposed non-parametric model of the matrix containing the eigenvectors is posterior consistent. There has been theoretical work on community detection and eigenvector estimation for single graphs, assuming that the number of vertices n goes to infinity. A fundamental difference here is that we have fixed and bounded n in each graph, but the number of graphs S grows. Hence, a new theoretical approach is required. In order to avoid a potential discrepancy between the number of spikes in the true and prescribed models, we use the full eigen-decomposition for the raw observed L(s) = W (s)Ω(s)W (s) , where W (s) is an orthonormal matrix and Ω(s) diagonal. Note that W (s) belong to a a Stiefel sub-manifold V Vn,n, with the first column elements being all positive. Similarly, for the spiked graph Laplacian we have µL = Q Λ Q , where Q V and the first T columns equal to parameter Q, Λ = diag{λ1, . . . , λT , θ, . . . , θ}. Using f to denote the likelihood, each observed L(s) = W (s)Ω(s)W (s) can be generated from f(W (s), Ω(s) | Q , Λ ) etr 1 2σ2e Q Λ Q W (s)Ω(s)W (s) | {z } f(W (s)|Ω(s),Q ,Λ ) Ω(s)Ω(s) + Λ Λ | {z } f(Ω(s)|Λ ) Duan, Michailidis and Ding The former corresponds to W (s) Matrix-Bingham Ω(s), (2σ2 e) 1Q Λ Q , in which Q serves as the location parameter. Therefore, based on a non-parametric mixture priordistribution for Q , our task is equivalent to showing the consistency of estimating Q V under the Matrix-Bingham likelihood. Using the Q -marginal density f Q (W (s)) = R R f(W (s), Ω(s) | Q , Λ )P(dΛ , dΩ(s)), where P(.) denotes the appropriate measure, consider a neighborhood of the true density f Q ,0 on the manifold V as Bϵ(f Q ,0) = f Q : Z gf Q µ(d W) gf Q ,0µ(d W) ϵ, g Cb(V ) , with Cb denoting the class of continuous and bounded functions, and µ(.) the Haar measure on V . Next, we establish that the probability for the posterior density falling into Bϵ(f Q ,0) goes to 1 as S . Theorem 8 (Consistent density estimation for the eigenmatrix) Let W (1) . . . W (S) be matrices of eigenvectors, whose elements are independently and identically distributed from a distribution with density f Q ,0. Then, for all ϵ > 0, as S , Π Bϵ(f Q ,0) | W (1), . . . , W (S) = Bϵ(f Q ,0) QS s=1 f Q (W (s))Π(df) R QS s=1 f Q (W (s))Π(df) 1 a.s.Pf Q ,0, with Pf Q ,0 the true probability measure for (W (1), W (2), . . .). Remark 9 Although the space of eigenmatrix is a compact domain, the primary challenge is that there are more than one (up to infinite) different data-generating eigenmatrices, due to the presence of heterogeneity. Therefore, this theorem shows that we can obtain consistency in the sense of density estimation of the population of those eigenmatrices, as opposed to having the posterior converge to any fixed eigenmatrix. The consistency in density estimation shows that we can accurately estimate the true data generating distribution for the population of networks. On clustering the subjects, suppose we have K local maxima in the density f(W (s)) (W (s) as all the n eigenvectors of L(s)), a consistently estimated density will ensure: (i) if an observation L(s) has its W (s) sufficently close to the kth local maximum, then a classifier that maximizes the density: arg maxl=1...Kf(W (s) | Q = U(l), Λ , Ω(s)) will correctly assign subject s to the kth group (provided neither Λ and Ω(s) is a zerovalued matrix); (ii) on the other hand, if an observation L(s) has its W (s) almost equally distant away from several local maxima, then we may not perfectly cluster subject s; nevertheless, we can quantify the uncertainty via Pr(zs = k | .) = f(W (s) | Q = U(k), Λ , Ω(s))/[PK l=1 f(W (s) | Q = U(l), Λ , Ω(s))]. That is, we will have the mis-clustering error converge asymptotically to the Bayes error rate (as an irreducible error). Regarding the theoretical guarantees of finding communities for each subject, we want to clarify that the obtained results are based on a fixed (and small) n (number of vertices) Spiked Laplacian and growing S (number of subjects) setting. We purposely let the accuracy of community detection vary from one subject to another as one can imagine, due to the heterogeneity of data, it is likely that two networks may share the same community-partition pattern (that is, we may have zs = zs ), but network s may be noisier as having much more betweencommunity connections than network s . Mathematically, this is reflected in the eigenvalues with λ(s) k 0, but λ(s ) k 0. As we consider {λ(s) k }k=1...T to be random effects that vary over s, we do not aim for obtaining a convergence result for any individual subject. Rather, we aim for discovering a few shared community-partition patterns, each represented by U(l) in Theorem 8. On the other hand, if one wants to obtain some asymptotic guarantee on the community detection error rate for a specific subject, it is necessary to consider a very different scenario with n , and impose some stronger assumption on the eigenvalues. Subsequently, one can show that the first few eigenvectors will converge to the corresponding population eigenfunction (Von Luxburg et al., 2008). 5. Performance Evaluation based on Synthetic Data 5.1 Impact of Different Noise Levels on a Single Graph We first examine the effects of noise on estimating the communities in a single graph. We generate a weighted graph comprising of 60 vertices and three communities of size 10, 20 and 30 vertices, respectively. To avoid directly using the proposed model to generate data, we simulate each edge within the communities as a Bernoulli event with probability 0.5, and then add Gaussian noise No(0, ξ2) to the adjacency matrix with varying ξ2. As shown in Figure 2, the 3-community structure can be visualized by the spectral gap between the third and fourth eigenvalues. As the noise increases, the gap diminishes, making it more difficult to separate the communities. Observed Spectral Gap (ˆλ4 ˆλ3) 0.6 0.3 0.1 0.05 0.01 Spiked Laplacian (1 0) (0.95 0.05) (0.88 0.09) (0.58 0.20) (0.40 0.14) Observed Laplacian (1 0) (0.91 0.07) (0.78 0.05) (0.42 0.20) (0.36 0.25) Table 1: The spiked Laplacian model has higher accuracy in recovering community labels, comparing to direct decomposition of the observed Laplacian. The results are calculated by clustering the second and third eigenvectors into three groups (patterns), then comparing with the ground truth labels to compute normalized mutual information (NMI). Mean standard deviation is reported based on 50 times of experiments. The higher the NMI, the higher the accuracy. Duan, Michailidis and Ding Low-noise graph (observed spectral gap 0.6) Medium-noise graph (observed spectral gap 0.3) High-noise graph (observed spectral gap 0.05) Figure 2: Three simulated graphs with different degree of noise, corresponding to different spectral gaps in the eigenvalues (for clarity, we show the first 35 eigenvalues out of 60). Comparing the eigenvalues produced by the direct decomposition of the raw Laplacian (cyan), and the ones by spiked Laplacian model (red), the latter has a clearly larger spectral gap between the third and fourth, corresponding to better separation between signal and noise. The spiked Laplacian model has a lifting effect on the fourth eigenvalue (shown in red in Figure 2). This is due to the flat structure imposed, effectively replacing the fourth eigenvalue ˆλ4 by θ (P60 k=4 ˆλk)/57 with ˆλk > ˆλ4 for k > 4. Consequently, it leads to an increase in the spectral gap, compared to a direct eigendecomposition of the graph Laplacian (shown in cyan). Practically, this leads to improved accuracy in finding the community labels, as shown in Table 1. This phenomenon can be viewed as a result of rank regularization on the Laplacian matrix; Le et al. (2018) discussed similar effects under a slightly different regularization in spectral clustering. Further, we would like to discuss the effect of sparsity on the eigenvalues of L. On the one hand, since L is the normalized Laplacian, it is scale-invariant in the magnitude of A; hence the above result will remain the same, even if A is multiplied to a small positive number. On the other hand, if the noise-free (unobserved) graph A becomes sparser, but the noise magnitude does not scale down with A , then it would impact the accuracy of Spiked Laplacian signal recovery, with a similar effect as in Subfigure 3 of Figure 2 (high-noise graph) this is as expected since the signal-to-noise ratio decreases as the signal becomes sparse. To provide some numerical illustration, we simulate additional graphs in the appendix, except with the Bernoulli probability reduced from 0.5 to 0.3, 0.2 and 0.1, respectively. 5.2 Impact of Graph Size on Community Detection n 100 300 500 1000 Spiked Laplacian Model (0.84 0.15) (0.90 0.04) (0.86 0.04) (1 0) Stochastic Block Model (0.65 0.23) (0.84 0.09) (0.87 0.04) (1 0) Bayesian SBM (0.70 0.14) (0.83 0.08) (0.88 0.05) (1 0) Table 2: At small n, the spiked Laplacian model has higher accuracy in estimating the community labels. Mean standard deviation is reported based on 50 times of experiments. The higher the NMI, the higher the accuracy. Next, we evaluate the effects of varying the number of vertices n on community detection. We adopt a similar 3-community setting as in the previous subsection, retaining the community size ratio as 1:2:3, and increase the total number of vertices. We calculate the normalized mutual information that compares the estimated community labels and the ground truth (details provided in the Appendix), using estimates produced by the spiked graph Laplacian model, the stochastic block model using the spectral clustering algorithm (Ng et al., 2002) and the Bayesian stochastic block model using a Gibbs sampler based on the model in van der Pas and van der Vaart (2018). As shown in Table 2, for large n 500, there are almost no differences in terms of the point estimate accuracy. However, at smaller vertex sizes n, the spiked graph Laplacian model exhibits clearly superior performance. Figure 3 shows the posterior distribution on the effective number of communities in one experiment at n = 100. It is evident in this experiment, that the point estimate ˆκ matches the ground truth of 3 communities. We find similar results for larger n s as well. In 50 times of repeated experiments, ˆκ matches with the ground truth for 98% of time. 1 2 3 4 5 6 7 8 kappa Figure 3: Posterior distribution of κ as the effective number of communities at n = 100. Duan, Michailidis and Ding Next, we empirically show that the advantage for small n is attributable to more accurate uncertainty quantification. For a more intuitive illustration of this issue, we generate a 2-community graph using a latent position model we first sample latent yi s near two manifolds [Figure 4, Panel (b)], then compute the pairwise similarity between latent positions [Ai,j = exp( 10 yi yj 2)], and use it as the edge weight. Clearly, most of the uncertainty is located in the center of the adjacency matrix, where the manifolds get close. (a) Observed Laplacian L based on the simulated graph. (b) The latent positions yi used to generate the graph, colored by the true labels. (c) q2(i) vs vertex index. Two communities are classified by the K-means (red) or sign-based algorithm (blue). (d) Spiked Laplacian model correctly estimated the uncertainty pr(ci = 1), based on sign-partitioning of each posterior sample q2. (e) Applying Gaussian mixture model on only one sample of q2(i) (as in the stochastic block model) under-estimates the uncertainty pr(ci = 1). Figure 4: Illustration of uncertainty quantification by the spiked graph Laplacian model. Panel (c) plots one sample of q2. The sign-based partition used by the spiked graph Laplacian model has a default decision boundary at the zero line (in blue). Applying this bipartitioning on each posterior sample of q2, it leads to an accurate uncertainty quantification [Panel (d)]. Comparatively, in the estimation of stochastic block model, one applies K-means or a Gaussian mixture model on one sample of q2 (based on the direct eigendecomposition of L), which could result in a severe underestimation of the uncertainty, as shown in Panel (e). 5.3 Accommodating Heterogeneity in a Collection of Graphs In this experiment, we deal with multiple graphs comprising of 300 vertices each, whose adjacency matrices exhibit heterogeneity. We first generate a set of five possible communitypartition patterns, each represented by a binary matrix (denoted by W (l)) of size 300 6 for Spiked Laplacian l = 1, . . . , 5; each row has one 1 and five 0 s, encoding the ground truth of the community labels in 1, , 6. To generate a graph, we randomly draw one of five patterns as W (s) and a non-negative random vector Λ, producing its adjacency matrix by A(s) = W (s) Λ W (s) + E(s), with E(s) being a Gaussian noise matrix and e(s) i,j = e(s) j,i N(0, 1), for s = 1, . . . , 500. We compare the performance of the proposed model against several popular alternatives: (1) simple averaging of all graphs followed by the use of a stochastic block model, (2) coregularized stochastic block model/spectral clustering (Kumar et al., 2011), (3) clustering the graphs into five groups (patterns), and applying the stochastic block model in each group, (4) independent stochastic block model for each graph. The first two competitors produce only one partitioning, while the latter two accommodate the heterogeneity. We compute two benchmark scores: the normalized mutual information (NMI), reflecting the similarity between the estimated community labels to the ground truth in each graph; and the Root Mean Squared Error between the individual L(s) and the smoothed ˆL(s), as the goodness of fit criterion. Benchmark Scores NMI (higher is better) RMSE ( 10 3, lower is better) Spiked Laplacian Graphs 0.85 0.04 1.9 0.2 Average+SBM 0.21 0.15 9.2 2.5 Co-regularized SBM 0.25 0.11 10.2 4.5 Clustering Graphs + SBMs 0.67 0.24 5.5 1.5 Individual SBMs 0.45 0.13 1.2 0.2 Table 3: Benchmark of the fitting models to a population of heterogeneous graphs. Mean standard deviation is reported based on 50 times of experiments. When computing the RMSE, for the Spiked Laplacian Graphs, we obtain ˆL(s) from the spiked representation taking individual κ(s) as the truncated dimension, averaging over the posterior sample; for the other four, we define ˆL(s) as the truncated spectral representation ˆQˆΛ ˆQ with ( ˆQ, ˆΛ) corresponding to the top 6 dimensions (as the ground truth dimension for data generation). As shown in Table 3, our proposed model has the highest accuracy in estimating the community labels, followed by the two-stage estimator that clusters the graphs first and then partitions the vertices via the stochastic block model. The performance of individual stochastic block models is significantly inferior, likely due to the fact that they do not borrow information among graphs, and the number of vertices per graph is relatively small. For the goodness-of-fit measure, the individual stochastic block models achieve the best score due to their higher flexibility. The proposed model xhibits a slightly larger error, but is significantly lower than the remaining competitors. Figure 5 shows the posterior distributions on the effective numbers of communities (κ(s)) in one graph and the numbers of distinct patterns (bz) in one experiment. In this experiment, both the point estimates ˆκ and ˆbz match the ground truth. In 50 times of repeated experiments, ˆκ matches the ground truth for 92% of time, ˆbz for 72% of time. Duan, Michailidis and Ding Despite the very good empirical results, we want to caution that one should be careful on choosing those two dimensions, and we provide further guidance at the end of this article. 1 2 3 4 5 6 7 8 kappa (a) Posterior distribution of κ(s) as the effective number of communities for one graph. 2 3 4 5 6 7 8 bz (b) Posterior distribution of bz as the effective number of unique values in zs (distinct patterns). Figure 5: Posterior distributions on the numbers of communities and distinct patterns in one experiment. 5.4 Robustness to Over-specified T Lastly, we examine if the proposed model can handle an over-specified T, when it is larger than necessary. We focus on the following two issues: (i) whether the posterior sample of η can successfully identify redundant λk s; (ii) whether a misspecified T affects the estimation of the first few eigenvalues. (a) Eigenvalues estimated with T = 10. (b) Eigenvalues estimated with T = 30. Figure 6: Simulation showing the first few small eigenvalues are almost unaffected by an overly large T, and the variable ηk successfully identifies the redundant λk. We use the same single graph setup to generate graphs with three communities, except we now set T = 10 and T = 30. As shown in Figure 6, the posterior distribution of ηk successfully finds all unnecessary λk s, as indicated by ηk = 0. Further, there is almost Spiked Laplacian no difference in the estimates of the first few eigenvalues λ1, λ2, λ3. On the other hand, we should clarify that if the spectral gap is small, ηk will more likely be assigned to 0; this is an expected behavior indicating there is a large loss if we still want to partition the graph. 6. Data Application: Characterizing Heterogeneity in a Human Working Memory Study We employ the proposed spiked graph Laplacian model on data obtained from a neuroscience study on working memory, focusing on human brain functional connectivity (Hu et al., 2019). The study involved 1,329 brain scans, wherein each subject in the study was asked to do the Sternberg verbal working memory task, which involved memorizing a list of six numbers, followed by a memory retrieval task that requires the subject to answer if a number was among the six shown earlier. Electroencephalogram (EEG) signals were obtained from 128 electrode channels placed over each subject s head, and subsequently, a 128 128 connectivity network is estimated during the retrieval task period using absolute Pearson correlation. Each network has weighted edges taking values in the [0, 1] interval. Figure 7 depicts the adjacency matrices of three subjects for the memory retrieval task, and the presence of heterogeneity is apparent. It can be seen that memory-related connectivity can exhibit different levels of concentration in the front or back of the head [Panels (c) or (d), with spatial coordinates, plotted in Figure 9 (a)], or, they are more localized in smaller regions [Panel (e)]. We apply the spiked graph Laplacian model on this data set and the results obtained are based on an MCMC run of 30, 000 steps, with the first 10, 000 used as the burn-in period. The majority of the samples from the posterior distribution contain six distinct U(l) s in the clustered eigenmatrix values. Figure 8 depicts the three corresponding to the raw A(s) shown in the previous Figure, obtained from the fitted Laplacian matrices. The remaining three seem to correspond to smaller variations and are shown in the Appendix. The proportions for these six patterns are 25.6%, 24.1%, 16.1%, 14.7%, 15.2% and 4.3%, as estimated in the posterior mean of allocation zs. Duan, Michailidis and Ding Figure 7: Brain functional connectivity adjacency matrices of three individuals undertaking the memory retrieval task. A significant level of heterogeneity can be observed. Figure 8: Fitted Laplacian shows the structure underneath each raw connectivity matrix. Spiked Laplacian (a) Coordinates of the EEG sensors, viewed from the top of the head. (b) Histogram of the number of communities in all subjects. (c) The subject has two communities, with the larger one near the back of the head. (d) The subject has two communities, with the larger one near the front of the head. (e) The subject has four communities: outer-front, mid-front, left-back, right-back. Figure 9: Community structure for each brain scan from multiple subjects in the working memory study. We then evaluate the community structures in each network. As shown in Figure 9(a), the model discovers 1 6 communities from these graphs, as estimated by κ(s). To gain insight into the scientific implications, we plot the community labels mapped to the spatial coordinates. Panel (c) and (d) show that most of the networks contain only two distinct communities, although the division can be quite different in the dominating area either in the front or in the back of the head. Panel (e) shows a very distinct pattern with four communities, partitioned as the outer-front, mid-front, left-back, right-back regions of the head. 7. Discussion In this paper, we propose a probabilistic graph model based on the Laplacian, allowing us to exploit concepts and results rom spectral graph theory to conduct flexible community detection in a population of heterogeneous graphs. Our model can be considered as a Duan, Michailidis and Ding general method to introduce Bayesian tools into the spectral graph framework. There are several extensions worth exploring in future work. First, if the goal is to generate a new graph with binary Ai,j, such as in link prediction, then it could adopt a Bernoulli distribution associated with a canonical link. Second, if those graphs have some known covariance structure, such as is the case of repeated measurements or temporal effects, then it could take an alternative distribution on the eigenmatrix or eigenvalues to incorporate those structures. Third, for large graphs, it is of interest to consider θ not as a single constant, but as a step function. Lastly, a recent discovery is that the Dirichlet process mixture model, although it enjoys posterior consistency in density estimation (Ghosal et al., 1999), can lead to inconsistent estimates of the number of clusters (Miller and Harrison, 2013, 2014). Therefore, although we did obtain interpretable results of finding 6 patterns and small numbers of communities in our application, to be rigorous, we cautiously believe recovering a ground-truth number of clusters/patterns is still a non-trivial task. As alternatives, one may replace the Dirichlet process mixture prior with a mixture of finite mixtures prior distributions (Miller and Harrison, 2018), or an infinite mixture of quasi-Bernoulli stick-breaking prior distributions (Zeng et al., 2022), both of which have shown correct asymptotic behavior in simpler cases. However, for the task of clustering eigenvectors, significant work is needed to verify if consistency holds on the number of clusters, as it requires checking for a completely correct model specification and identifiability of the parameters. Emmanuel Abbe. Community Detection and Stochastic Block Models: Recent Developments. Journal of Machine Learning Research, 18(1):6446 6531, 2017. Charu C Aggarwal. An Introduction to Social Network Data Analytics. In Social Network Data Analytics, pages 1 15. Springer, 2011. Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research, 9:1981 2014, 2008. Arash A Amini, Aiyou Chen, Peter J Bickel, and Elizaveta Levina. Pseudo-Likelihood Methods for Community Detection in Large Sparse Networks. The Annals of Statistics, 41(4):2097 2122, 2013. Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander Flows, Geometric Embeddings and Graph Partitioning. Journal of the Association for Computing Machinery, 56(2): 1 37, 2009. Anirban Banerjee and J urgen Jost. On the Spectrum of the Normalized Graph Laplacian. Linear Algebra and Its Applications, 428(11-12):3015 3022, 2008. Abhishek Bhattacharya and David B Dunson. Nonparametric Bayesian Density Estimation on Manifolds With Applications to Planar Shapes. Biometrika, 97(4):851 865, 2010. Spiked Laplacian Diana Cai, Trevor Campbell, and Tamara Broderick. Edge-Exchangeable Graphs and Sparsity. In Advances in Neural Information Processing Systems, pages 4249 4257, 2016. Fran cois Caron and Emily B Fox. Sparse Graphs Using Exchangeable Random Measures. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1295 1366, 2017. Fan RK Chung and Fan Chung Graham. Spectral Graph Theory. Number 92. American Mathematical Soc., 1997. David L Donoho, Matan Gavish, and Iain M Johnstone. Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model. The Annals of Statistics, 46(4):1742, 2018. David B. Dunson, Hau-Tieng Wu, and Nan Wu. Graph based gaussian processes on restricted domains. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(2):414 439. doi: https://doi.org/10.1111/rssb.12486. Daniele Durante, David B Dunson, and Joshua T Vogelstein. Nonparametric Bayes Modeling of Populations of Networks. Journal of the American Statistical Association, 112 (520):1516 1530, 2017. Miroslav Fiedler. Laplacian of Graphs and Algebraic Connectivity. Banach Center Publications, 25(1):57 70, 1989. Santo Fortunato. Community Detection in Graphs. Physics Reports, 486(3-5):75 174, 2010. Shmuel Friedland and Reinhard Nabben. On Cheeger-Type Inequalities for Weighted Graphs. Journal of Graph Theory, 41(1):1 17, 2002. Junxian Geng, Anirban Bhattacharya, and Debdeep Pati. Probabilistic Community Detection With Unknown Number of Communities. Journal of the American Statistical Association, 114(526):893 905, 2019. Edward I George and Robert E Mc Culloch. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88(423):881 889, 1993. Subhashis Ghosal, Jayanta K Ghosh, and RV Ramamoorthi. Posterior Consistency of Dirichlet Mixtures in Density Estimation. The Annals of Statistics, 27(1):143 158, 1999. Matthias Hein, Jean-Yves Audibert, and Ulrike von Luxburg. Graph Laplacians and Their Convergence on Random Neighborhood Graphs. Journal of Machine Learning Research, 8:1325 1368, 2007. Peter D Hoff. Simulation of the matrix Bingham von Mises Fisher Distribution, With Applications to Multivariate and Relational Data. Journal of Computational and Graphical Statistics, 18(2):438 456, January 2009. Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent Space Approaches to Social Network Analysis. Journal of the American Statistical Association, 97(460):1090 1098, 2002. Duan, Michailidis and Ding Zhenhong Hu, Christopher M Barkley, Susan E Marino, Chao Wang, Abhijit Rajan, Ke Bo, Immanuel Babu Henry Samuel, and Mingzhou Ding. Working Memory Capacity Is Negatively Associated With Memory Load Modulation of Alpha Oscillations in Retention of Verbal Working Memory. Journal of Cognitive Neuroscience, pages 1 13, 2019. Muhammad Aqib Javed, Muhammad Shahzad Younis, Siddique Latif, Junaid Qadir, and Adeel Baig. Community Detection in Networks: A Multidisciplinary Review. Journal of Network and Computer Applications, 108:87 111, 2018. Brian Karrer and Mark EJ Newman. Stochastic Blockmodels and Community Structure in Networks. Physical Review E, 83(1):016107, 2011. Rohit Khandekar, Satish Rao, and Umesh Vazirani. Graph Partitioning Using Single Commodity Flows. Journal of the Association for Computing Machinery, 56(4):1 15, 2009. Alisa Kirichenko, Harry van Zanten, et al. Estimating a Smooth Function on a Large Graph by Bayesian Laplacian Regularisation. Electronic Journal of Statistics, 11(1): 891 915, 2017. Abhishek Kumar, Piyush Rai, and Hal Daume. Co-Regularized Multi-View Spectral Clustering. In Advances in Neural Information Processing Systems, pages 1413 1421, 2011. Can M Le, Elizaveta Levina, and Roman Vershynin. Concentration of Random Graphs and Application to Community Detection. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pages 2925 2943. World Scientific, 2018. Tom Leighton and Satish Rao. Multicommodity Max-Flow Min-Cut Theorems and Their Use in Designing Approximation Algorithms. Journal of the Association for Computing Machinery, 46(6):787 832, 1999. Lizhen Lin, Vinayak Rao, and David Dunson. Bayesian Nonparametric Inference on the Stiefel Manifold. Statistica Sinica, pages 535 553, 2017. Fei Liu, Sounak Chakraborty, Fan Li, Yan Liu, Aurelie C Lozano, et al. Bayesian Regularization via Graph Laplacian. Bayesian Analysis, 9(2):449 474, 2014. Anand Louis, Prasad Raghavendra, Prasad Tetali, and Santosh Vempala. Algorithmic Extensions of Cheeger s Inequality to Higher Eigenvalues and Partitions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 315 326. Springer, 2011. Aaron F Mc Daid, Thomas Brendan Murphy, Nial Friel, and Neil J Hurley. Improved Bayesian Inference for the Stochastic Block Model With Application to Large Networks. Computational Statistics & Data Analysis, 60:12 31, 2013. Jeffrey W Miller and Matthew T Harrison. A Simple Example of Dirichlet Process Mixture Inconsistency for the Number of Components. Advances in Neural Information Processing Systems, 26, 2013. Spiked Laplacian Jeffrey W Miller and Matthew T Harrison. Inconsistency of Pitman-Yor Process Mixtures for the Number of Components. Journal of Machine Learning Research, 15(1):3333 3370, 2014. Jeffrey W Miller and Matthew T Harrison. Mixture Models With a Prior on the Number of Components. Journal of the American Statistical Association, 113(521):340 356, 2018. Kyle J Minch, Tige R Rustad, Eliza JR Peterson, Jessica Winkler, David J Reiss, Shuyi Ma, Mark Hickey, William Brabant, Bob Morrison, and Serdar Turkarslan. The DNA-binding Network of Mycobacterium tuberculosis. Nature Communications, 6:5829, 2015. Bojan Mohar. Isoperimetric Numbers of Graphs. Journal of Combinatorial Theory, Series B, 47(3):274 291, 1989. Peter J Mucha, Thomas Richardson, Kevin Macon, Mason A Porter, and Jukka-Pekka Onnela. Community Structure in Time-Dependent, Multiscale, and Multiplex Networks. Science, 328(5980):876 878, 2010. Soumendu Sundar Mukherjee, Purnamrita Sarkar, and Lizhen Lin. On Clustering Network Valued Data. In Advances in Neural Information Processing Systems, pages 7071 7081, 2017. Andrew Y Ng, Michael I Jordan, and Yair Weiss. On Spectral Clustering: Analysis and an Algorithm. In Advances in Neural Information Processing Systems, pages 849 856, 2002. Krzysztof Nowicki and Tom A B Snijders. Estimation and Prediction for Stochastic Blockstructures. Journal of the American Statistical Association, 96(455):1077 1087, 2001. Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali, and Ploutarchos Spyridonos. Community Detection in Social Media. Data Mining and Knowledge Discovery, 24(3): 515 554, 2012. Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral Clustering and the High-Dimensional Stochastic Blockmodel. The Annals of Statistics, 39(4):1878 1915, 2011. Xilin Shen, Fuyuze Tokoglu, Xenios Papademetris, and R Todd Constable. Groupwise Whole-Brain Parcellation From Resting-State f MRI Data for Network Node Identification. Neuroimage, 82:403 415, 2013. Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000. Somayeh Sojoudi. Equivalence of Graphical Lasso and Thresholding for Sparse Graphs. Journal of Machine Learning Research, 17(1):3943 3963, 2016. Terence Tao. Topics in Random Matrix Theory, volume 132. American Mathematical Soc., 2012. SL van der Pas and AW van der Vaart. Bayesian Community Detection. Bayesian Analysis, 13(3):767 796, 2018. Duan, Michailidis and Ding Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of Spectral Clustering. The Annals of Statistics, pages 555 586, 2008. Martin J Wainwright. High-Dimensional Statistics: a Non-asymptotic Viewpoint, volume 48. Cambridge University Press, 2019. Sinead A Williamson. Nonparametric Network Models for Link Prediction. Journal of Machine Learning Research, 17(1):7102 7121, 2016. Cheng Zeng, Jeffrey W Miller, and Leo L Duan. Consistent Model-based Clustering: using the Quasi-Bernoulli Stick-Breaking Process. ar Xiv preprint ar Xiv:2008.09938, 2022. Li Zhang, Yuan Li, and Ramakant Nevatia. Global Data Association for Multi-Object Tracking Using Network Flows. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1 8. IEEE, 2008. Yichuan Zhang, Zoubin Ghahramani, Amos J Storkey, and Charles A Sutton. Continuous Relaxations for Discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, pages 3194 3202, 2012. Proof of Theorem 1 Proof The bounds on the eigenvalues of the Laplacian are given and discussed in Chung and Graham (1997). For the first eigenvector we have µL d1/2 = D 1/2 (D A )D 1/2 d1/2 = D 1/2 (D A ) 1 = 0. Proof of Theorem 6 Proof For simplicity, we omit .(s) in the proof and use σe for σe0. The proof consists of the following four parts: 1. An application of the Davis-Kahan Theorem Let E = L L, using Theorem 2 in citepyu2014useful with r = 1 and s = k, we obtain Q0 QO F 23/2 min(k1/2 E op, E F ) 23/2(k1/2 E op) where E op denotes the operator norm ( E op = sup x =1 Ex ). 2. Discretizing Sn 1 = {x : x = 1} using a maximal ϵ-net: Following Tao (2012), let Nϵ Sn 1 be an ϵ-net with ϵ (0, 1), such that for any two x Nϵ, x Nϵ, x x ϵ. Maximizing over the number of included points in Sn 1, we obtain a maximal ϵ-net N 0 ϵ . Clearly, the balls with centers x N 0 ϵ and radius ϵ/2 are disjoint, and all covered by a large ball centered at the origin with radius 1 + ϵ/2, hence |N 0 ϵ | (ϵ/2 + 1 ϵ/2 )n = (ϵ + 2 Spiked Laplacian On the other hand, for any y Sn, there is at least one x N 0 ϵ : x y ϵ, otherwise y can be added to the net, contradicting the maximal condition. Choosing y Sn that attains Ey = E op, and its associated x N 0 ϵ : x y ϵ E op Ex = Ey Ex E(y x) E opϵ, by an application of the triangle inequality and f(x) = Ex is E op-Lipschitz. Therefore, E op t implies at least one x N 0 ϵ : Ex (1 ϵ)t. pr( E op t) pr [ x N 0 ϵ Ex (1 ϵ)t |N 0 ϵ |pr Ex (1 ϵ)t, where x Sn where the last inequality follows from the union bound. 3. Concentration inequality for Ex Since E is symmetric, let E = EU + EL, with EU being the upper triangular portion including the diagonal and EL the lower triangular portion. We first use B to represent either EU or EL. Let B be an n n matrix comprising of bi,j independent and σ2 e-sub-Gaussian elements. Then, for each element Bx E exp{t B jx} = E exp{t k=1 xkbj,k} k=1 E exp{txkbj,k} k=1 exp{t2σ2 ex2 k/2} = exp{t2σ2 e/2} where the inequality is due to the sub-Gaussian assumption, and the last equality due to x = 1. Therefore, each Zj = Bjx is sub-Gaussian as well. By a result in Wainwright (2019), this is equivalent to E exp(κZ2 j 2σ2e ) (1 κ) 1/2 (14) for all κ (0, 1). We have Ex 2 = EUx + ELx 2 ( EUx + ELx )2 2( EUx 2 + ELx 2) By the Cauchy Schwarz inequality, we obtain E exp(κ Ex 2 2σ2e ) E exp(2κ[ EUx 2 + ELx 2] E exp(4κ EUx 2 2σ2e )E exp(4κ ELx 2 Since EU and EL comprise of sub-Gaussian elements and zeros, they are also sub-Gaussian with σ2 e; then, multiplying (14) over j = 1, . . . , n for each matrix, we get E exp(κ Ex 2 (1 4κ) n/2(1 4κ) n/2 = (1 4κ) n/2 where κ (0, 1/4). Using Markov s inequality pr( Ex t) = pr exp(κ Ex 2 2σ2e ) exp( κt2 2σ2e ) (1 4κ) n/2 exp( κt2 Duan, Michailidis and Ding 4. Combining results to obtain a concentration inequality pr( E op t) (ϵ + 2 ϵ )n(1 4κ) n/2 exp( κ(1 ϵ)2t2 Letting t = c1 nσe, κ = 1/8 and ϵ = 1/2, we have pr( E op c1 nσe) exp[ {c2 1/64 log(5 Q ˆQ ˆO F 23/2k1/2c1 nσe λk+1 λk with probability greater than 1 δ. Proof of Theorem 8 For simplicity, we omit .(s) for now and let D = Λ and B = Ω. Without loss of generality, we assume the diagonal of B are ordered 0 = b1 < b2 . . . bn; and we have fixed d1 = 0 and d2, . . . , dn > 0. The parameter Q follows a matrix Bingham distribution truncated to V g(Q ; W, D, B, σ2 e)Π(d Q ) = Z 1(σ2 e, D, B)etr 1 2σ2e DQ WBW Q Π(d Q ) where Z is a normalizing constant. We utilize the result of Bhattacharya and Dunson (2010) to establish weak consistency of the posterior density estimation. There are three sufficient conditions to check: (1) The kernel g(.) is continuous in all of its arguments. (2) The set {F0} D0 ϵ intersects the parameter support of Q and σ2 e, where D0 ϵ is the interior of Dϵ, a compact neighborhood for σ2 e. (3) For any continuous f, there is a Dϵ for σ2 e, such that = sup W V ,σ2e Dϵ f(W) Z g(Q ; W, D, B, σ2 e)f(Q )Π(d Q ) ϵ. The first two conditions are straightforward to check (see Lin et al. (2017) for similar derivation). We will focus on verifying (3). Note the Frobenius distance between two orthonormal matrices dist(W, Q )2 = 2n 2tr(W Q ) = 2 j=1 (1 gj,j), where gi,j is the element of G = W Q , where |gj,j| 1 due to orthonormality of G. Let (1 gj,j) = sj,jσe, with sj,j [0, 2/σe], then Pn j=1(1 gj,j) = Pn j=1 sj,jσe. As σe 0, dist(W, Q ) 0 for any fixed (s1,1, . . . , sn,n). By the continuity of f and compactness of Stiefel manifold, as σe 0 f(W) f(Q ) 0. (15) Spiked Laplacian Z 1(σ2 e, D, B) Z sup W V f(W) f(Q ) etr 1 2σ2e DQ [WBW ]Q Π(d Q ) = Z 1(σ2 e, D, B) Z sup W V f(W) f(WG) etr 1 2σ2e DG BG Π(d G) where the second line is due to the invariant volume of rotation via W. It can be verified that tr(DG BG) = j=1 bidjg2 i,j j=1 bjdj(1 g2 j,j) + i =j bidjg2 i,j j=1 bjdj(1 g2 j,j) + i =j g2 i,j j=1 bjdj(1 g2 j,j) + j=1 djbn(1 g2 j,j) j=1 dj(bn bj)(1 g2 j,j) j=1 dj(bn bj)g2 j,j, where the first inequality is due to dj 0 and bn bi for all i; the fourth line is due to the 1 unit norm for each column of G. Applying one-to-one transformation T : V S, T(G) = {si,j = gi,j for i = j, sj,j = (1 gj,j)/σe}i,j, denote the transformed G matrix by GS. We have Π(d G) = φ (G)dg1,1 dg1,2 . . . dgn,n φ (GS) σn e φ (GS)ds1,1 ds1,2 . . . dsn,n φ (GS) σn e Π(d GS), where φ and φ are some functions of G and Gs, respectively. Since sj,j 2/σe, we have (1 sj,jσe)2 = 1+2sj,jσe s2 j,jσ2 e 3 s2 j,jσ2 e. Continuing from above, j=1 dj(bn bj)(1 sj,jσe)2 j=1 dj(bn bj)(3 s2 j,jσ2 e) j=1 dj(bn bj)s2 j,jσ2 e j=1 dj(bn bj)s2 j,jσ2 e Duan, Michailidis and Ding Combining the above, Z 1(σ2 e, D, B) exp 1 j=1 3djbj) σn e f(W) f(WGS) j=1 dj(bn bj)s2 j,j φ (GS) Π(d GS). Note that sup GS V sup W V f(W) f(WGS) M due to the compactness of V and continuity of f. And clearly, j=1 dj(bn bj)s2 j,j φ (GS) Π(d GS) < . Using dominated convergence theorem, when σe 0, the integral in (16) goes to zero. Our remaining task is to verify the constant before the integral is finite as σe 0. Note the inverse of the constant in (16) σ n e Z(σ2 e, D, B) exp 1 2σ2e = σ n e exp 1 2σ2e 2σ2e DU BU Π(d U) j=1 bidju2 i,j i=1 u2 i,j) Π(d U) j=1 (bi 4bn)dju2 i,j j=2 (bi 4bn)dju2 i,j j=2 (bi 4bn)dnu2 i,j i=1 (bi 4bn)dn(1 u2 i,1) Π(d U), where we use d1 = 0, (bi 4bn) 0 and dj dn in the inequality. Since the last line does not depend on U2:n, we denote the null space of u1 by K(u1) = {U2:n Vn 1,n : u ku1 = 0, k > 1}. It is not hard to see that the volume K(u1) is a constant invariant to u1, we denote it by vol(K). The above is then, σ n e vol(K) Z S+ exp 1 2σ2e i=1 (4bn bi)dn(1 u2 i,1) Π(d U1) σ n e vol(K) Z S+ exp 1 2σ2e i=1 (4bn bi)dn(1 + ui,1)2 Π(d U1), where S+ is the unit-norm space constrained to all elements positive; and the inequality due to (1 u2) = (1 u)(1 + u) (1 + u)2 for u 0. Spiked Laplacian Let ti = (1 + ui,1)/σe. We have Π(d U1) = ψ(U1)du1,1 du2,1 . . . dun,1 ψ(T) σn e ψ(T)dt1 dt2 . . . dtn ψ(T) σn e Π(d T), where ψ and ψ are some functions of U1 and T, respectively. The above is then i=1 (4bn bi)dnt2 i ψ(T) Π(d T), which is bounded away from 0. Therefore, the constant in (16) is finite as σ2 e 0. The limit result means that for any ϵ > 0, we have a neighborhood Dϵ = {σ2 e : 1/σ2 e > Nϵ}, so that < ϵ. Details of the Gibbs Sampling Algorithm The posterior sampling proceeds according to the following steps: 1. Sample Rs from (11) in the main article. 2. Sample U (l) from (12) in the main article. 3. Sample from the categorical distribution zs Π(zs | .) πl1(zs = l) exp 1 2σ2e tr( L(s)(In U (l)U (l)T ) ) + µθ 2( 1 σ2 λ,η(s) k + 1 2σ2e ) 1 T X u(l) k L(s)u(l) k 2σ2e + (1 η(s) k )µθ σ2 λ,η(s) k with 1(.) the indicator function, update Q(s) = U (zs). 4. Sample (π1, π2, . . . , πg) Dir(α0/g + P 1(zs = 1), α0/g + P 1(zs = 2), . . . , α0/g + P 1(zs = 1)). 5. Sample for k = 2, . . . , T λ(s) k N(0,2) ( 1 σ2 λ,η(s) k + 1 2σ2e ) 1 q(s) k L(s)q(s) k 2σ2e + (1 η(s) k )µθ σ2 λ,η(s) k , ( 1 σ2 λ,η(s) k + 1 2σ2e ) 1 . 6. Sample from the Bernoulli for k = 2, . . . , T, η(s) k 1(η(s) k = 1)w N(0,2)(λ(s) k ; 0, σ2 λ,1) + 1(η(s) k = 0)(1 w)N(0,2)(λ(s) k ; µθ, σ2 λ,0), where N(0,2)(x; a, b) denotes the density of the truncated normal. θ(s) N(0,2) i L(s)(i, i) X k q(s)T k L(s)q(s) k ) + µθ 8. Sample for i = 1, . . . , n L(s) i,i N Q(s)(Λ(s) θ(s)IT )Q(s) (i,i) + θ(s), 2σ2 e Duan, Michailidis and Ding σ2 e Inv-Gamma n2S s=1 L(s) θIn Q(l) (Λ(s) θ(s)IT )Q(l) 2 F The Laplacian Eigenvalues of Sparse Graphs under High Noise Level To provide some numerical illustration, we simulated additional graphs as in Section 5.1, except with the Bernoulli probability reduced to 0.3, 0.2 and 0.1, respectively. Within-community Bernoulli probability 0.3 Within-community Bernoulli probability 0.2 Within-community Bernoulli probability 0.1 Figure 10: When the graph sparsity increases, but the noise level remains relatively high, it becomes more difficult to distinguish the first few eigenvalues of the Laplacian from the remaining larger ones. Additional Components in Working Memory Data Analysis Figure 11: Fitted Laplacian shows the structure underneath the raw connectivity matrix. Spiked Laplacian Additional Details on Normalized Mutual Information On quantifying the accuracy of ˆc(s) i , we use the normalized mutual information, as a measure of similarity that is invariant to label switching. To provide some more details, consider discrete x and y as in two vectors of equal lengths, using P( ) to denote a proportion, we have I(x, y) = X i,j P(x = i, y = j) log P(x = i, y = j) P(x = i)P(y = j) i P(x = i) log P(x = i), NMI(x, y) = 2I(x, y) H(x) + H(y).