# sparse_popularity_adjusted_stochastic_block_model__273e82f0.pdf

Journal of Machine Learning Research 22 (2021) 1-36 Submitted 10/19; Revised 4/21; Published 10/21

Sparse Popularity Adjusted Stochastic Block Model

Majid Noroozi mnoroozi@memphis.edu Department of Mathematical Sciences University of Memphis Memphis, TN 38152, USA

Marianna Pensky Marianna.Pensky@ucf.edu Department of Mathematics University of Central Florida Orlando, FL 32816, USA

Ramchandra Rimal ramchandra.rimal@mtsu.edu Department of Mathematical Sciences Middle Tennessee State University Murfreesboro, TN 37132, USA

Editor: Edo Airoldi

In the present paper we study a sparse stochastic network enabled with a block structure. The popular Stochastic Block Model (SBM) and the Degree Corrected Block Model (DCBM) address sparsity by placing an upper bound on the maximum probability of connections between any pair of nodes. As a result, sparsity describes only the behavior of network as a whole, without distinguishing between the block-dependent sparsity patterns. To the best of our knowledge, the recently introduced Popularity Adjusted Block Model (PABM) is the only block model that allows to introduce a structural sparsity where some probabilities of connections are identically equal to zero while the rest of them remain above a certain threshold. The latter presents a more nuanced view of the network.

Keywords: Stochastic Block Model, Popularity Adjusted Block Model, Sparsity, Sparse Subspace Clustering

1. Introduction

1.1 Stochastic Block Models

The last few years have seen a surge of interest in stochastic network models. Indeed, such models appear in a variety of applications ranging from social to biological sciences. Stochastic networks can be described in a variety of ways, however, in the last decade stochastic block models attracted more and more attention due to their ability to summarize data in a compact and intuitive way and to uncover low-dimensional structures that fully describe a given network.

In this paper, we consider an undirected network with n nodes and no self-loops and multiple edges. Let A {0, 1}n n be the symmetric adjacency matrix of the network with Ai,j = 1 if there is a connection between nodes i and j, and Ai,j = 0 otherwise. We assume

c 2021 Majid Noroozi, Marianna Pensky and Ramchandra Rimal.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-835.html.

Noroozi, Pensky and Rimal

that Ai,j Bernoulli(Pi,j), 1 i j n, (1)

where Ai,j are conditionally independent given Pi,j and Ai,j = Aj,i, Pi,j = Pj,i for i > j. The block models assume that each node in the network belongs to one of K distinct blocks or communities Nk, k = 1, , K. The communities are described by the vector c of community assignment, with ci = k if the node i belongs to the community k. One can also consider a corresponding membership (or clustering) matrix Z {0, 1}n K such that Zi,k = 1 iﬀi Nk, i = 1, . . . , n. The degree of a node i and its expected degree are deﬁned, respectively, as the number of edges and the sum of probabilities of connections between the node i and the rest of the nodes. One of the features of the block models is that they assume that the probability of connection between node i Nk and node j Nl depends on the pair of blocks (k, l) to which nodes (i, j) belong. In particular, the Stochastic Block Model (SBM) assumes that the probability of connection between nodes is completely deﬁned by the communities to which they belong, so that, for any pair of nodes (i, j), one has Pi,j = Bci,cj where Bk,l is the probability of connection between communities k and l. In particular, under the SBM, all nodes from the same community have the same expected degree. Since the real life networks usually contain a very small number of high-degree nodes while the rest of the nodes have very few connections (low degree), the SBM model fails to explain the structure of many networks that occur in practice. The Degree Corrected Block Model (DCBM) addresses this deﬁciency by allowing these probabilities to be multiplied by the node-dependent weights (see, e.g., Chen et al. (2018), Karrer and Newman (2011), Zhao et al. (2012) among others). Under the DCBM, the elements of matrix P are modeled as Pi,j = θi Bci,cjθj, where θi, i = 1, . . . , n, are the degree parameters of the nodes, and B is the (K K) matrix of baseline interaction between communities. Identiﬁability of the parameters is usually ensured by a constraint of the form P i Nk θi = 1 for all k = 1, . . . , K (see, e.g., Karrer and Newman (2011)). The Popularity Adjusted Block Model (PABM), introduced by Sengupta and Chen (2018) and subsequently studied in Noroozi et al. (2021), provides a generalization of both the SBM and the DCBM. The DCBM enables a more ﬂexible spectral structure of matrix P which is especially useful in the cases when the mixed membership models cannot be employed. We are particularly interested in the PABM since, to the best of our knowledge, it is the only block model that allows to model structural sparsity in the connections between the nodes in the network. In order to understand the PABM, consider a rearranged version P(Z, K) of matrix P where its ﬁrst n1 rows correspond to nodes from class 1, the next n2 rows correspond to nodes from class 2 and the last n K rows correspond to nodes from class K. Denote the (k, l)- th block of matrix P(Z, K) by P (k,l)(Z, K). Then, sub-matrix P (k,l)(Z, K) [0, 1]nk nl corresponds to pairs of nodes in communities (k, l) respectively. It is easy to see that in the SBM, P (k,l)(Z, K) has all elements equal to Bk,l, while in the DCBM, P (k,l) = Bk,lθ(k)(θ(l))T

where θ(k) is the sub-vector of vector θ that contains weights for the nodes in community k. Under the PABM, each pair of blocks P (k,l)(Z, K) and P (l,k)(Z, K) is deﬁned using a unique combination of vectors Λ(l,k) as follows:

P (k,l)(Z, K) = [P (l,k)(Z, K)]T = Λ(k,l) [Λ(l,k)]T [0, 1]nk nl, k, l = 1, . . . , K. (2)

Sparse Popularity Adjusted Stochastic Block Model

Here, vectors Λ(k,l) [0, 1]nk, k = 1, . . . , K, form column l of matrix Λ [0, 1]n K given by

Λ(1,1) Λ(1,2) Λ(1,K)

Λ(2,1) Λ(2,2) Λ(2,K) ... ... ... Λ(K,1) Λ(K,2) Λ(K,K)

Vector Λ(k,l) represents the popularity (or, the level of interaction) of nodes in class k with respect to class l. The PABM allows higher degree of ﬂexibility in modeling the probability matrix and, in addition, does not require any identiﬁability conditions for its ﬁtting, thus, providing an attractive alternative to SBM and DCBM.

1.2 Sparsity in Block Models

The real life networks are usually sparse in a sense that a large number of nodes have small degrees. One of the shortcomings of both the SBM and the DCBM is that they do not allow to eﬃciently model sparsity. Speciﬁcally, in majority of high-dimensional setting, sparsity means structural sparsity and establishes that some parameters of the model are equal to zero and have no eﬀect on the variables of interest. Finding the set of nonzero parameters in such models is one of the goals of the inference. This is true in, for example, high-dimensional regression model where identiﬁcation of the set of nonzero coeﬃcients is crucial for understanding which independent variables aﬀect the variable of interest. However, the traditional stochastic block models do not allow to model sparsity in a structural way. The latter is due to simplistic modeling of connection probabilities. Indeed, for the SBM, it is not realistic to assume that all nodes in a pair of communities have no connections, hence, in the SBM setting, one does not assume that the block probabilities Bk,l = 0 for some k and l. The DCBM is not very diﬀerent in this respect, since setting any node-speciﬁc weight to zero will force the respective node to be totally disconnected from the network. For this reason, unlike in other numerous statistical settings, sparsity in block models is deﬁned as a low maximum probability of connections between the nodes: max i,j Pi,j τ(n) where τ(n) 0 as n (see, e.g., Klopp et al. (2017),

Lei and Rinaldo (2015)). As a result, sparsity describes only the behavior of network as a whole, without distinguishing between the block-dependent sparsity patterns. In addition, the above deﬁnition of sparsity has other drawbacks. In particular, one has to estimate every probability of connections Bk,l, no matter how small it is, and, in many settings (see, e.g., Klopp et al. (2017)), in order to take advantage of the fact that Pi,j are bounded above by τ(n), one needs to incorporate this unknown value into the estimation process. To the best of our knowledge, the PABM is the only existing block model that allows to model sparsity as structural sparsity where some connection probabilities are equal to zero, while the average connection probabilities between classes are above certain level, and the network is connected. In the context of PABM, setting Λ(k,l) i = 0 simply means that node i in class k is not active ( popular ) in class l. This, nevertheless, does not prevent this node from having high probability of connection with nodes in another class. Setting some elements of vectors Λ(k,l) to zero will merely lead to some of the rows (columns) of

Noroozi, Pensky and Rimal

sub-matrices P (k,l)(Z, K) being zero. Moreover, since Ai,j are Bernoulli variables with the means Pi,j, those zeros are fairly easy to identify, as Pi,j = 0 implies Ai,j = 0. Identiﬁcation of the set of zeros in the sub-columns Λ(k,l) of matrix Λ gives the nuanced picture of the behavioral patterns of the nodes in the network and leads to a better understanding of network topology. Moreover, it allows to improve the precision of estimation of the matrix of connection probabilities, since it is well known that, when many of the elements of a vector or a matrix are identical zeros, identifying those zeros and estimating the rest of the elements leads to a smaller error than when this information is ignored. In summary, to the best of our knowledge, our paper is the ﬁrst paper that studies structural sparsity in stochastic block models and the PABM is the only block model that allows the treatment. The rest of the paper is organized as follows. Section 2 is the key part of the paper. After introducing notations in Section 2.1, we review the PABM and convey the structure of the probability matrix in Section 2.2. Section 2.3 formulates an optimization procedure for estimation and clustering. Furthermore, Section 2.4 suggests two possible expressions for the penalties and examines the support sets of the true and estimated probability matrices. Section 3 produces upper bounds on the estimation and clustering errors. Since the optimization procedure in Section 2.3 is NP-hard, Section 4 discusses implementation of the community detection via sparse subspace clustering. Sections 5.1 and 5.2 complement the theory with simulations on synthetic networks and real data examples. Finally, Appendix A presents simulation results for the precision of estimation of the number of communities, and also contains the proofs of the statements in the paper.

2. Estimation and Clustering in Sparse PABM

2.1 Notation

For any two positive sequences {an} and {bn}, an bn means that there exists a constant C > 0 independent of n such that C 1an bn Can for any n. For any set Ω, denote cardinality of Ωby |Ω|. For any numbers a and b, a b = min(a, b). For any vector t Rp, denote its ℓ2, ℓ1, ℓ0 and ℓ norms by, respectively, t , t 1, t 0 and t . Denote by 1m the m-dimensional column vector with all components equal to one. For any matrix A, denote its spectral and Frobenius norms by, respectively, A op and A F . Let vec(A) be the vector obtained from matrix A by sequentially stacking its columns. Denote column i of matrix A by A:,i. Denote by ΠJ(X), the projection of a matrix X : n m onto the set of matrices with nonzero elements in the set J = J1 J2 = {(i, j) : i J1, j J2}. Denote by Π(1)(X) the best rank one approximation of matrix X and by Πu,v(X) the rank one projection of X onto pair of unit vectors u, v given by

Πu,v(X) = (uu T )X(vv T ). (4)

Then, Π(1)(X) = Πu,v(X) provided (u, v) is a pair of singular vectors of X corresponding to the largest singular value. Denote by Mn,K a collection of clustering matrices Z {0, 1}n K such that Zi,k = 1 iﬀ i Nk, i = 1, . . . , n, and ZT Z = diag(n1, . . . , n K) where nk = |Nk| is the size of community

Sparse Popularity Adjusted Stochastic Block Model

k, where k = 1, . . . , K. Denote by PZ,K {0, 1}n n the permutation matrix corresponding to Z Mn,K that rearranges any matrix B Rn n, so that its ﬁrst n1 rows correspond to nodes from class 1, the next n2 rows correspond to nodes from class 2 and the last n K rows correspond to nodes from class K. Recall that PZ,K is an orthogonal matrix with P 1 Z,K = PT Z,K. For any PZ,K and any matrix B Rn n denote the permuted matrix and its blocks by, respectively, B(Z, K) and B(k,l)(Z, K), where B(k,l)(Z, K) Rnk nl, k, l = 1, . . . , K, and

B(Z, K) = PT Z,KBPZ,K, B = PZ,KB(Z, K)PT Z,K. (5)

Also, throughout the paper, we use the star symbol to identify the true quantities. In particular, we denote the true matrix of connection probabilities by P and the true clustering matrix that partitions n nodes into K communities by Z .

2.2 The Structural Sparsity of the Probability Matrix

Consider the problem of estimation and clustering of the true matrix P of the probabilities of the connection between the nodes. Consider a block P (k,l) (Z , K ) of the rearranged version P (Z , K ) of P . Let Λ Λ(Z , K ) [0, 1]n K be a block matrix with each column l partitioned into K blocks Λ(k,l) Λ(k,l) (Z , K ). Here, Λ(k,l) [0, 1]nk and Λ(l,k) [0, 1]nl

are the column vectors and P (k,l) (Z , K ) follows (2), i.e., P (k,l) (Z , K ) = Λ(k,l) [Λ(l,k) ]T . Hence, P (k,l) (Z , K ) are rank-one matrices such that P (k,l) (Z , K ) = [P (l,k) (Z , K )]T and that each pair of blocks P (k,l) and P (l,k) , involves a unique combination of vectors Λ(k,l) and Λ(l,k) , k, l = 1, . . . , K . Vectors Λ(k,l) and Λ(l,k) describe the heterogeneity of the connections of nodes in the pair of communities (k, l). While, on average, those communities can be connected, some nodes in community k may have no interaction with nodes in community l or vice versa, so that some of the elements of vectors Λ(k,l) and Λ(l,k) can be identical zeros. Denote the set of indices of all nonzero elements of matrix Λ by

J J (Z , K ) =

k,l=1 (J )k,l.

Let (J )k,l (J )k,l(Z , K ) = {i : (Λ )(k,l) i = 0}, J(k,l) = (J )k,l (J )l,k, (6)

be, respectively, the true support of vector Λ(k,l) and the set of all ordered pairs of indices (positions) of non-zero elements of sub-matrix P (k,l) (Z , K ). Here, the elements of (J )k,l are enumerated by their corresponding rows in matrix Λ . Then,

(P )(k,l) i,j (Z , K ) > 0 iﬀ (i, j) J(k,l)

and row i and column j of P (k,l) (Z , K ) are equal to zero if i / (J )k,l or j / (J )l,k. Note that the set J J (Z , K ) relies upon the true clustering deﬁned by K and Z . One can also consider sparsity sets ( J )k,l ( J )k,l(Z, K) and Jk,l Jk,l(Z, K) for an

Noroozi, Pensky and Rimal

arbitrary K and matrix Z Mn,K

( J )k,l = {i : (P )(k,l) i,j (Z, K) = 0, for some j = 1, . . . , nl},

(7) Jk,l = {i : A(k,l) i,j (Z, K) = 0, for some j = 1, . . . , nl},

where the elements of ( J )k,l and Jk,l are enumerated by their corresponding rows in matrices P and A, respectively. Examples of the sets (J )k,l, (J )(k,l), ( J )k,l and ( J )(k,l) are considered in Section 2.4. For any sparsity sets Jk,l Jk,l(Z, K), deﬁne, similarly to (6),

k,l=1 Jk,l with J(k,l) = Jk,l Jl,k (8)

It follows from the deﬁnitions (7) and (8) that, for any K, Z Mn,K and k, l = 1, . . . , K

Jk,l(Z, K) ( J )k,l(Z, K) and J(Z, K) J (Z, K). (9)

2.3 Optimization Procedure for Estimation and Clustering

Observe that although matrices P (k,l) (Z , K ) and the sets J(k,l) are well deﬁned, vectors Λ(k,l) and Λ(l,k) can be determined only up to a multiplicative constant. In order to avoid this ambiguity, we denote Θ(k,l) = Λ(k,l) [Λ(l,k) ]T and recover matrix Θ with the uniquely deﬁned rank one blocks Θ(k,l) and their supports J(k,l) , k, l = 1, . . . , K . For this purpose, we need to solve the following optimization problem

(ˆΘ, ˆZ, ˆJ, ˆK) argmin Θ,Z,J,K

A(k,l)(Z, K) Θ(k,l)(Z, J, K) 2

F + Pen(n, J, K)

s.t. A(Z, K) = PT Z,KAPZ,K, Z Mn,K,

supp(Θ(k,l)) = J(k,l) = Jk,l Jl,k, rank(Θ(k,l)) = 1, k, l = 1, . . . , K.

Here, ˆΘ is the block matrix with blocks ˆΘ(k,l), k, l = 1, . . . , K. Observe that, if ˆZ, ˆJ and ˆK were known, the best solution of problem (10) would be given by the best rank one approximations ˆΘ(k,l) of matrices A(k,l)( ˆZ, ˆK), restricted to the sets ˆJ(k,l) of indices of nonzero elements:

ˆΘ(k,l)( ˆZ, ˆJ, ˆK) = Π(1) Π ˆJ(k,l) A(k,l)( ˆZ, ˆK) , (11)

where ΠJ(k,l) A(k,l) is the projection of matrix A(k,l) onto the set of matrices with the support J(k,l), and Π(1) is the best rank one approximation of a matrix. Plugging (11) into (10), we rewrite optimization problem (10) as

( ˆZ, ˆJ, ˆK) argmin Z,J,K

k,l=1 A(k,l)(Z, K) Π(1)[ΠJ(k,l)(A(k,l)(Z, K))] 2 F + Pen(n, J, K)

s.t. A(Z, K) = PT Z,KAPZ,K, Z Mn,K,

J(k,l) J(k,l)(Z, K) = Jk,l(Z, K) Jl,k(Z, K), k, l = 1, . . . , K.

Sparse Popularity Adjusted Stochastic Block Model

In practice, in order to obtain ( ˆZ, ˆJ, ˆK), one needs to solve optimization problem (12) for every K, obtaining

( ˆZK, ˆJK) argmin Z,J

A(k,l)(Z, K) Π(1) ΠJ(k,l)(A(k,l)(Z, K)) 2

F + Pen(n, J, K)

s.t. A(Z, K) = PT Z,KAPZ,K, ZK Mn,K,

J(k,l) J(k,l)(Z, K) = Jk,l(Z, K) Jl,k(Z, K), k, l = 1, . . . , K.

and then ﬁnd ˆK as

ˆK argmin K

A(k,l)( ˆZK, K) Π(1) Π ˆ J(k,l) K

A(k,l)( ˆZK, K) 2

F + Pen(n, ˆJK, K)

2.4 The Support of the Probability Matrix and the Penalty

Consider solution of optimization problem (13) for a ﬁxed value of K. If ˆZK Mn,K is a solution of (12), then

ˆJK argmin J

A(k,l)( ˆZK, K) Π(1) ΠJ(k,l) A(k,l)( ˆZK, K) 2

F + Pen(n, J, K)

s.t. A( ˆZK, K) = PT ˆ ZK,KAP ˆ ZK,K, J(k,l) = Jk,l Jl,k, Jk,l Jk,l( ˆZK, K).

Observe that if the penalty term Pen(n, J, K) were not present in (15) or did not depend on a set J, then one would have ˆJK = JK and ˆJ(k,l) K = J(k,l) K , where, by (7), J(k,l) K is the set of indices of nonzero rows and columns in A(k,l)( ˆZK, K). It is easy to see that

Π J(k,l) A(k,l)( ˆZK, K) = A(k,l)( ˆZK, K),

Π(1) Π J(k,l) A(k,l)( ˆZK, K) = Π(1) A(k,l)( ˆZK, K) .

Hence, even if sparsity is not speciﬁcally enforced (as it happens in Noroozi et al. (2021) where the penalty depends on n and K only), one still obtains a sparse estimator ˆP with the support ˆJK = JK. If the true number of clusters K and the true clustering matrix Z Mn,K were available, then the statement below shows that, with high probability, sets J J (Z , K ) and J(Z , K ) would coincide, provided nonzero elements of matrix P are above CK p

ln n/n where C is an absolute constant. Therefore, some zeros of the adjacency matrix correspond to the true zero probabilities of connections.

Lemma 1. Let K2 n and the true matrix P be such that (P )i,j = 0 or (P )i,j > ϖ(n, K ). If the community sizes are balanced, i.e., the sizes of the true communities are bounded below by C0n/K for some C0 (0, 1], and

then, with probability at least 1 e t, one has J (Z , K ) = J(Z , K ).

Unfortunately, K and Z are unknown and, hence, ˆJK(Z, K) = JK(Z, K) may not always be the best estimator. In order to understand this, consider, for example, the situation displayed in Figure 1

Noroozi, Pensky and Rimal

Figure 1: Zeros of the probability matrix with n = 5 and K = 2. Star symbols correspond to nonzero elements, the x symbols stand for the diagonal elements that are unavailable, the thick lines correspond to clustering assignments. Left panel: matrix Λ with (J )1,1 = {1, 2}, (J )2,1 = {3, 5}, (J )1,2 = {1, 2} and (J )2,2 = {3, 4, 5}. Middle panel: matrix P (Z , K ) with true clustering, ( J )c 2,1(Z ) = {4}, ˆPi,j(Z , K ) = 0 for (i, j) {(1, 4), (2, 4), (4, 1), (4, 2)}, so that, zero entries of the probability matrix are estimated by zeros. Right panel: matrix P ( ˆZ, K ) with node 3 erroneously placed into community 1. The values of (P )4,3 and (P )3,4 are nonzero. If A3,4 = A4,3 = 0, then {4} Jc 2,1( ˆZ) and ˆPi,j( ˆZ, K ) = 0 for (i, j) {(1, 4), (2, 4), (3, 4), (4, 1), (4, 2), (4, 3)}, hence, zero entries of P are still estimated by the identical zeros. However, if A4,3 = A3,4 = 1, then zero elements (P )4,1, (P )4,2, (P )1,4 and (P )2,4 are estimated by positive values.

where n = 5, K = 2 and, under the true clustering, one has n1 = 2 and n2 = 3. Vectors Λ2,1 has one zero element, so that (J )1,1 = {1, 2}, (J )2,1 = {3, 5}, (J )1,2 = {1, 2} and (J )2,2 = {3, 4, 5} (left panel) leading to (J )(1,1) = {(1, 1), (1, 2), (2, 1), (2, 2), }, (J )(2,1) = {(3, 1), (3, 2), (5, 1), (5, 2)}, (J )(1,2) = {(1, 3), (2, 3), (1, 5), (2, 5)} and (J )(2,2) = {(3, 3), (3, 4), (3, 5), (4, 3), (4, 4), (4, 5), (5, 3), (5, 4), (5.5)} (middle panel). With the true clustering (middle panel), ( J )c 2,1(Z ) = {4}, so that ˆPi,j(Z , K ) = 0 for (i, j) {(1, 4), (2, 4), (4, 1), (4, 2)}. Hence, zero entries of the probability matrix are estimated by zeros.

Consider now the situation where the third node has been erroneously placed into community 1 by clustering matrix ˆZ (right panel). Then, we have (J )c 2,1 = {4} but ( J )c 2,1( ˆZ) is an empty set. If A3,4 = A4,3 = 0, then {4} Jc 2,1( ˆZ) and ˆPi,j( ˆZ, K ) = 0 for (i, j) {(1, 4), (2, 4), (4, 1), (4, 2)}, hence, zero entries of P are still estimated by the identical zeros. However, if A4,3 = A3,4 = 1, then it is possible that zero elements (P )4,1, (P )4,2, (P )1,4 and (P )2,4 are estimated by positive values. For example, if A5,1 = 1, A5,2 = 1 and A5,3 = 1, then ˆP4,1 = 0.3536 and ˆP4,2 = 0.3536 which leads to higher estimation errors than setting ˆP4,1 = ˆP4,2 = 0. Therefore, it is reasonable to introduce a penalty that will lead to trimming the support of ˆP(Z, K).

One can consider two kinds of penalties here: separable and non-separable. We say that a penalty Pen(n, J, K) is separable if for any K and any clustering matrix Z that partitions n nodes

Sparse Popularity Adjusted Stochastic Block Model

into K communities of sizes nk, k = 1, . . . , K, one can write

Pen(n, J, K) = Pen(0)(n, J, K) + Pen(1)(n, K) with Pen(0)(n, J, K) =

k=1 F(|Jk,l|, nk), (16)

where Jk,l Jk,l(Z, K). Otherwise, the penalty is non-separable.

Lemma 2. Let ( ˆZK, ˆJK) be the solution of the optimization problem (13). If Pen(n, J, K) is an increasing function of |J| (for a non-separable penalty) or of |Jk,l|, k, l = 1, . . . , K (for a separable penalty), then

ˆJk,l( ˆZK, K) Jk,l( ˆZK, K) ( J )k,l( ˆZK, K), ˆJ( ˆZK, K) J( ˆZK, K) J ( ˆZK, K). (17)

3. The Errors of Estimation and Clustering

3.1 The penalty

In what follows, we consider the separable and the non-separable penalties of the form (16) with the common Pen(1)(n, K) term, i.e.

Pen(a)(n, J, K) = Pen(0,a)(n, J, K) + Pen(1)(n, K), (18)

where a =s for the separable penalty and a = ns for the non-separable one, and

Pen(0,s)(n, J, K) = β1

k,l=1 |Jk,l| ln(nke/|Jk,l|) + β2K

k=1 ln nk (19)

Pen(0,ns)(n, J, K) = β1|J| ln(n Ke/|J|) + 2β2 ln n (20)

Pen(1)(n, K) = β2[n ln K + ln n]. (21)

Here, the separable penalty corresponds to F(|Jk,l|, nk) = β1|Jk,l| ln(nke/|Jk,l|) + β2 ln nk and the exact expressions for β1 and β2 are given in the proof of Theorem 1. In the next two sections, we shall provide upper bounds for the errors of the solution of optimization problem (10) with the separable or the non-separable penalty (18), as well as upper bounds for the clustering error in the case of the separable penalty. While the separable penalty has some valuable properties (see Lemma 2), the non-separable penalty is much easier to interpret. Fortunately, as the statement below shows, under very nonrestrictive conditions, the penalties are within a constant factor of each other.

Lemma 3. If n 8 and K p

n/ ln n, then

Pen(ns)(n, J, K) < (2 + β1/β2) Pen(s)(n, J, K) < 2 (2 + β1/β2) Pen(ns)(n, J, K). (22)

3.2 The Estimation Errors

Theorem 1. Let (ˆΘ, ˆZ, ˆJ, ˆK) be a solution of optimization problem (10) with the penalty deﬁned in (18). Construct the estimator ˆP of P of the form

ˆP = P ˆ Z, ˆ K ˆΘ( ˆZ, ˆJ, ˆK) PT ˆ Z, ˆ K (23)

where P ˆ Z, ˆ K is the permutation matrix corresponding to ( ˆZ, ˆK). Then, for any t > 0 and some absolute positive constants γ and C, one has

P n n 2 ˆP P 2 F n 2 H0 Pen(n, J , K ) + n 2 Ct o 1 3e t, (24)

Noroozi, Pensky and Rimal

n 2 E ˆP P 2 F n 2 H0 Pen(n, J , K ) + 3n 2 C. (25)

The exact expressions for H0 and C are given in the proof of Theorem 1.

Observe that, due to Lemma 3, the separable and non-separable penalties are within a constant factor of each other, so that Theorem 1 implies that the estimation error is proportional to Pen(n, J , K ) where Pen(n, J, K) Pen(ns)(n, J, K) n ln K + |J| ln(n Ke/|J|) + ln n. (26)

The ﬁrst term in (26) is due to the clustering errors, the second term quantiﬁes the diﬃculty of ﬁnding |J| nonzero elements among n K elements of matrix Λ [0, 1]n K and estimating them, while the term ln n ln(n K) stands for the diﬃculty of ﬁnding the cardinality of the set |J|, and it is always dominated by the ﬁrst two terms in (26). Since each node is connected to at least one community with a nonzero probability, one has n |J| n K. In the (non-sparse) PABM, |J| = n K and the second term in (26) is always asymptotically larger than the other two terms, as n . In SPABM, the second term in (26) dominates the ﬁrst term only if K = 1 or |J|/n as n . However, if K > 1 and |J| n, then both terms are of the equal asymptotic order. If K and |J| n as n , then SPABM has the error O(n ln K) which is asymptotically smaller than O(n K) error of PABM.

3.3 Detectability of clusters

In order one can detect clusters, the vectors Λ(k,l), l = 1, . . . , K, should be suﬃciently diﬀerent for every k = 1, . . . , K. Assume that K = K is known and that the following condition holds.

Assumption A1. For any k = 1, . . . , K, vectors Λ(k,1), . . . , Λ(k,K) are linearly independent.

Under Assumption A1, the true clusters are detectable.

Lemma 4. Let Z Mn,K be the true clustering matrix, and Z Mn,K be an arbitrary clustering matrix. Let J = J (Z ) be the true set of indices of nonzero elements, and J = J (Z) be the set of indices of nonzero elements, deﬁned in (7), which is associated with a clustering matrix Z Mn,K. If Assumption A1 holds and the network is connected, then

P (k,l) (Z ) Π(1) ΠJ(k,l) (P (k,l) (Z ) 2

P (k,l) (Z) Π(1) Π J(k,l) (P (k,l) (Z) 2

(27) where, for any matrix B, Π(1)(B) is its rank one approximation and ΠJ is its projection on the set of indices deﬁned by J. Moreover, equality in (27) occurs if and only if matrices Z and Z coincide up to a permutation of columns.

3.4 The Clustering Errors

In order to evaluate the clustering error when clustering is applied to the adjacency matrix, we assume that the true number of classes K = K is known. Then ˆZ ˆZK is a solution of the optimization problem (13). Let Z Mn,K be the true clustering matrix and Z Mn,K be any other clustering matrix. Then the proportion of misclustered nodes can be evaluated as

Err(Z, Z ) = (2n) 1 min PK PK ZPK Z 1 = (2n) 1 min PK PK ZPK Z 2 F (28)

Sparse Popularity Adjusted Stochastic Block Model

where PK is the set of permutation matrices PK : {1, 2, , K} {1, 2, , K}. Let

Υ(Z , δn) = Z Mn,K : (2n) 1 min PK PK ZPK Z 1 δn

be the set of clustering matrices with the proportion of misclassiﬁed nodes being at least δn (0, 1). The success of clustering in (13) relies upon the fact that matrix P is a collection of K2 rank one blocks, so that the operator and the Frobenius norms of each block are the same. On the other hand, if clustering were incorrect, the ranks of the blocks would increase which would lead to the discrepancy between their operator and Frobenius norms. In particular, the following statement is true.

Theorem 2. Let K = K 2 be the true number of clusters, Z Mn,K be the true clustering matrix and Assumption A1 hold. Let J = J (Z ) be the true set of indices of nonzero elements, and J = J (Z) be the set of indices of nonzero elements, deﬁned in (7), which is associated with a clustering matrix Z Mn,K. Let ˆZ ˆZK be a solution of the optimization problem (13) and δn 0 as n . If there exists αn (0, 1/2) and absolute positive constants H1 and H2, independent of K, n, J , J (Z), δn and αn, such that

P 2 F max Z Υ(Z ,δn)

k,l=1 P (k,l) (Z) 2 op + H1

αn | J (Z)| ln

αn (|J | + n ln K) + H2|J | ln n Ke

then, with probability at least 1 2K n, the proportion of the nodes, misclassiﬁed by ˆZ, is at most δn.

Example 1. In order to see what condition (30) means, we consider a simple example. We study the sparse PABM with K = 2, and Z Mn,2 with equal size communities N = n/2. Assume that Λ(k,k) = a 1N, k = 1, 2, while elements Λ(k,l) i of vectors Λ(k,l), k = l, are equal to

b if i Jk, k = 1, 2, and equal to zero otherwise. Examine the case of an assortative network, where a an, b bn and b/a = ρ ρn 1. Denote J = J1 J2 and note that the cardinality of the set of nonzero elements of matrix Λ is equal to 2N + |J| with |J| = |J1| + |J2|. Denote the overall proportion of nonzero entries in vectors Λ(1,2) and Λ(2,1) by γ, and the proportion of zero entries in vectors Λ(1,2)

and Λ(2,1) by s: γ = |J|/n = (|J1| + |J2|)/(2N), s = 1 γ.

Below we examine what condition (30) of Theorem 2 means for diﬀerent values of s and ρ. Assume that the connection probabilities are not too small, speciﬁcally, that

lim n na2 n = . (31)

Let δn δ = δ1 + δ2. Let Z Υ(Z , δn) Mn,2 be an arbitrary incorrect clustering matrix and, according to Z, ˇNk = Nδk nodes are moved erroneously from class k to class l, l = k, and Nk = N(1 δk) nodes remain correctly in class k. Then, according to Z, community k has Nk + ˇNl nodes, k = 1, 2, k = l, and the proportion of misclassiﬁed nodes is equal to (δ1N + δ2N)/n = δ/2. Denote the subsets of nodes corresponding to nonzero elements of vector Λ(k,l), that correctly stay in class k and those that are misclassiﬁed into community l, l = k, by Jk and ˇJk, respectively. Then Jk = Jk ˇJk, k = 1, 2. Denote

βk = | Jk|/| Nk|, ˇβk = | ˇJk|/| ˇNk|, k = 1, 2, (32)

Noroozi, Pensky and Rimal

and note that βk, ˇβk [0, 1]. Then, for any Z Mn,2 with equal class sizes and the proportion of misclassiﬁed nodes being δ/2, one has

P 2 F (1 + αn)

k,l=1 P (k,l) (Z) 2 op ˇC a2n2( n(Z) 16 αn), (33)

where ˇC is an absolute constant, α αn and

n(Z) = δ2 1 (1 ρ2 n ˇβ2 1 β2 2)2 + δ2 2 (1 ρ2 n β2 1 ˇβ2 2)2. (34)

The proof of the inequality (33) is given in the Appendix. Note that, in this example, the right hand side of (30) reduces to H1 n α 1 n + H2n, so we need to show that a2n2 n(Z) 16 a2n2 αn + H1 n α 1 n + H2n, (35)

for some αn (0, 1/2) and absolute positive constants H1 and H2. It is easy to see that the right hand side of (35) is minimized by αn = Cα /(an n) (0, 1/2), and (35) appears as

an n n(Z) H3 (36)

for some absolute positive constant H3. Below, we examine when this condition can be satisﬁed for δn 0 as n . First, we consider the case when s = 0, so that γ = 1 and there is no structural sparsity. In this case, ˇβk = βk = 1, k = 1, 2, and, due to δ2 1 + δ2 2 (δ1 + δ2)2/2 = δ2/8, one obtains from (34) that n(Z) δ2 n(1 ρ2 n)2. Hence, (36) becomes an n δ2 n(1 ρ2 n)2 H3, so that

δ2 n na2 n(1 ρ2 n)2 1 0 if na2 n(1 ρ2 n)2 (n ). (37)

The latter implies that either an should be asymptotically larger than n 1/2 or the ratio ρn = bn/an should be separated from one. Now, consider s > 0, so that γ < 1. In this case we need the minimal possible value of n(Z) over Z Υ(Z , δn) to satisfy condition (36). To formalize this notion, we introduce

b F(γ, δ, ρ, a, n) = min n δ2 1 (1 ρ2 n ˇβ2 1 β2 2)2 + δ2 2 (1 ρ2 n β2 1 ˇβ2 2)2o

s.t. 0 βk 1, 0 ˇβk 1, βk, ˇβk given by (32), k = 1, 2,

δk 0, k = 1, 2, δ1 + δ2 = δ 1/2 β1(1 δ1) + β2(1 δ2) + ˇβ1δ1 + ˇβ2δ2 = 2γ (38)

In order the proportion of clustering errors is bounded above by δn 0, one needs

b F(γ, δn, ρn, an, n) a2 nn 1/2 , (n ). (39)

Consider the case when s < 1/2, so that 1/2 < γ < 1. If δn 0, then, for n large enough, one has δn 2(γ 1/2) and, hence, 2γ 1+δn. Set δ1 = δ, δ2 = 0, ˇβ1 = β2 = 1, β1 = [2γ (1+δ)]/(1 δ), ˇβ2 = 0. It is easy to verify that conditions in (38) hold, so that b F(γ, δ, ρ, a, n) δ2(1 ρ2 n)2 and condition (39) is equivalent to (37), that occurs when there is no structural sparsity (s = 0). Now, let s > 1/2, so that 0 < γ < 1/2. Let d = (1 2γ)/2 and, if δn 0, then, for n large enough, one has δn d. Let γ0 = (1 2d)/(1 d) < 1. Then, 2γ/(1 δn) γ0. By (38), obtain βk(1 δk) 2γ and, hence, βk 2γ/(1 δk) γ0, k = 1, 2. Consequently, due to βk, ˇβk 1, obtain

b F(γ, δn, ρn, an, n) (δ2 1 + δ2 2)(1 ρ2 nγ2 0)2 δ2 n(1 γ2 0)2/2.

Sparse Popularity Adjusted Stochastic Block Model

Since γ0 is a non-asymptotic quantity, condition (39) holds for some δn 0 as n , whenever assumption (31) is satisﬁed. Therefore, if s > 1/2, one has δn 0 even if ρn 1 as n . The sparsity proportion of s = 1/2 constitutes the so called elbow value, so the diﬃculty of clustering varies signiﬁcantly for s < 1/2 and s > 1/2. Analysis of the conditions that ensure δn 0 when s = 1/2 requires more sophisticated tools, so we do not study s = 1/2 in this paper.

Remark 1. Non-constant connection probabilities. We remark that consideration of constant values for elements of vectors Λ(k,k) and Λ(k,l), k, l = 1, 2, k = l, is motivated by showing a clear pattern of the impact of sparsity on the clustering precision. Assumption that in-cluster and outof-cluster connection probabilities take constant values are quite common in stochastic networks literature (see, e.g., Abbe (2018), Abbe et al. (2020a), Abbe et al. (2020b) and Ndaoud et al. (2020) among others). Indeed, if Λ(k,k) i an, and Λ(k,l) i bn if i Jk, k = 1, 2, k = l, and equal to zero otherwise, where bn/an = ρn 1, then conclusion of Example 1 that δn 0 as n is still true, provided condition (31) holds and s > 1/2. However, in the case of s < 1/2, δn may tend to zero even if s < 1/2, depending on the exact values of components of vectors Λ(k,k) and Λ(k,l). Studying the case of constant probabilities allowed us to show the beneﬁts of structural sparsity more clearly.

4. Implementation of Clustering

In Section 2, we obtained an estimator ˆZ of the true clustering matrix Z as a solution of optimization problem (12). Minimization in (12) is somewhat similar to modularity maximization in Bickel and Chen (2009) or Zhao et al. (2012) in the sense that modularity maximization as well as minimization in (12) are NP-hard, and, hence, require some relaxation in order to obtain an implementable clustering solution. In the case of the SBM and the DCBM, possible relaxations include semideﬁnite programming (see, e.g., Amini and Levina (2018) and references therein), variational methods (Celisse et al. (2012)) and spectral clustering and its versions (see, e.g., Joseph and Yu (2016), Lei and Rinaldo (2015) and Rohe et al. (2011) among others). Since in the case of SPABM, columns of matrix P that correspond to nodes in the same class are neither identical, nor proportional, direct application of spectral clustering to matrix P does not deliver the partition of the nodes. However, it is easy to see that the columns of matrix P that correspond to nodes in the same community, form a matrix with K rank-one blocks, hence, those columns lie in the subspace of the dimension at most K. Therefore, matrix P consists of K clusters of columns (rows) that lie in the union of K distinct subspaces, each of the dimension K. For this reason, the subspace clustering presents a technique for obtaining a fast and reliable solution of optimization problem (12) (or (13)). Subspace clustering has been widely used in computer vision and, for this reason, it is a very well studied and developed technique. Subspace clustering is designed for separation of points that lie in the union of subspaces. Let {Xj RD}n j=1 be a given set of points drawn from an unknown union of K 1 linear or aﬃne subspaces {Si}K i=1 of unknown dimensions di = dim(Si), 0 < di < D, i = 1, ..., K. In the case of linear subspaces, the subspaces can be described as Si = {x RD : x = U iy}, i = 1, ..., K, where U i RD di is a basis for subspace Si and y Rdi is a low-dimensional representation for point x. The goal of subspace clustering is to ﬁnd the number of subspaces K, their dimensions {di}K i=1, the subspace bases {U i}K i=1, and the segmentation of the points according to the subspaces. Several methods have been developed to implement subspace clustering such as algebraic methods (Boult and Brown (1991), Ma et al. (2008), Vidal et al. (2005)), iterative methods (Agarwal and Mustafa (2004), Bradley and Mangasarian (2000), Tseng (2000)), and spectral clustering based methods (Elhamifar and Vidal (2009), Elhamifar and Vidal (2013), Favaro et al. (2011), Liu et al. (2013), Liu et al. (2010), Soltanolkotabi et al. (2014), Vidal (2011)). In this paper, we use the latter group of techniques.

Noroozi, Pensky and Rimal

Spectral clustering algorithms rely on construction of an aﬃnity matrix whose entries are based on some distance measures between the points. In particular, in the case of the SBM, adjacency matrix itself serves as the aﬃnity matrix, while for the DCBM, the aﬃnity matrix is obtained by normalizing rows/columns of A. In the case of the subspace clustering problem, one cannot use the typical distance-based aﬃnity because two points could be very close to each other, but lie in diﬀerent subspaces, while they could be far from each other, but lie in the same subspace. One of the solutions is to construct the aﬃnity matrix using self-representation of the points with the expectation that a point is more likely to be presented as a linear combination of points in its own subspace rather than from a diﬀerent one. A number of approaches such as Low Rank Representation (see, e.g., Liu et al. (2013), Liu et al. (2010)) and Sparse Subspace Clustering (see, e.g., Elhamifar and Vidal (2013), Elhamifar and Vidal (2009)) have been proposed in the past decade for the solution of this problem. In this paper, we use Sparse Subspace Clustering (SSC) since it allows one to take advantage of the knowledge that, for a given K, columns of matrix P lie in the union of K distinct subspaces, each of the dimension at most K. If matrix P were known, the weight matrix W would be based on writing every data point as a sparse linear combination of all other points by solving the following optimization problem min Wj Wj 1 s.t. (P )j = X

k =j Wkj(P )k (40)

In the case of data contaminated by noise, the SSC algorithm does not attempt to write data as an exact linear combination of other points. Instead, SSC can be built upon the solution of the elastic net problem

c Wj argmin Wj

2 Aj AWj 2 2 + γ1 Wj 1 + γ2 Wj 2 2

s.t. Wjj = 0 , j = 1, ..., n, (41)

where γ1, γ2 > 0 are tuning parameters. The quadratic term stabilizes the LASSO problem by making the problem strongly convex, and therefore it has a unique minimum. We solve (41) using the LARS algorithm Efron et al. (2004) implemented in SPAMS Matlab toolbox (see Mairal et al. (2014)). Given c W, the aﬃnity matrix is deﬁned as |c W| + |c W T | where, for any matrix B, matrix |B| has absolute values of elements of B as its entries. The class assignment (clustering matrix) Z is then obtained by applying spectral clustering to |c W| + |c W T |. We elaborate on the implementation of the SSC in Section 5.1.

5. Simulations and Real Data Examples

5.1 Simulations on Synthetic Networks

In this section we evaluate the performance of our method using synthetic networks. We assume that the number of communities (clusters) K is known and for simplicity consider a perfectly balanced model with n/K nodes in each cluster. We generate each network from a random graph model with a symmetric probability matrix P given by the SPABM model with a clustering matrix Z and a block matrix Λ. To generate synthetic networks, we start by producing a block matrix Λ in (3) with random entries between 0 and 1. We use a parameter σ as the proportion of nonzero entries in matrix Λ to control the sparsity of networks. To do that, we set n Kσ smallest non-diagonal entries of Λ to zero. Then we multiply the non-diagonal blocks of Λ by ω, 0 < ω < 1, to ensure that most nodes in the same community have larger probability of interactions. As a result, matrix P(Z, K) with blocks P (k,l)(Z, K) = Λ(k,l)(Λ(l,k))T , k, l = 1, . . . , K, has larger entries mostly in the diagonal blocks than in the non-diagonal blocks and some zero rows (columns) in the non-diagonal blocks. The parameter ω is the heterogeneity parameter. Indeed, if ω = 0, the matrix P is strictly

Sparse Popularity Adjusted Stochastic Block Model

300 350 400 450 500 n

Clustering Error

300 350 400 450 500 n

Estimation Error

=0.5, =0.3 =0.5, =0.5 =0.5, =0.7 =0.8, =0.3 =0.8, =0.5 =0.8, =0.7

300 350 400 450 500 n

Clustering Error

300 350 400 450 500 n

Estimation Error

=0.5, =0.3 =0.5, =0.5 =0.5, =0.7 =0.8, =0.3 =0.8, =0.5 =0.8, =0.7

300 350 400 450 500 n

Clustering Error

300 350 400 450 500 n

Estimation Error

=0.5, =0.3 =0.5, =0.5 =0.5, =0.7 =0.8, =0.3 =0.8, =0.5 =0.8, =0.7

Figure 2: The clustering errors Err( ˆZ, Z) deﬁned in (28) (left panels) and the estimation errors n 2 ˆP P 2 F (right panels) for K = 4 (top), K = 5 (middle) and K = 6 (bottom) clusters. The errors are evaluated over 100 simulation runs. The number of nodes ranges from n = 300 to n = 540 with the increments of 60. Dashed lines represent the results for ω = 0.5 and solid lines represent the results for ω = 0.8; σ = 0.3 (red), σ = 0.5 (blue) and σ = 0.7 (black).

Noroozi, Pensky and Rimal

block-diagonal, while in the case of ω = 1, there is no diﬀerence between entries in diagonal and nonzero entries in non-diagonal blocks. Next, we generate a random clustering matrix Z Mn,K corresponding to the case of equal community sizes and the permutation matrix PZ,K corresponding to the clustering matrix Z. Subsequently, we scramble rows and columns of P(Z, K) to create the probability matrix P = PZ,KP(Z, K)PT Z,K. Finally we generate the lower half of the adjacency matrix A as independent Bernoulli variables Aij Ber(Pij), i = 1, . . . , n, j = 1, . . . , i 1, and set Aij = Aji when j > i. In practice, the diagonal elements of matrix A are unavailable, so we estimate diag(P) without their knowledge. Now we use SSC to ﬁnd the clustering matrix ˆZ. Since the diagonal elements of matrix A are unavailable, we initially set Aii = 0, i = 1, ..., n, and solve optimization problem (41) with γ1 = 30ρ(A) and γ2 = 125(1 ρ(A)), where ρ(A) is the density of matrix A, the proportion of nonzero entries of A. The values of γ1 and γ2 have been obtained empirically by testing on synthetic networks. After matrix c W of weights is evaluated, we obtain the clustering matrix ˆZ by applying spectral clustering to |c W| + |c W T |, as it was described in Section 4. In this paper, we use the normalized cut algorithm Shi and Malik (2000) to perform spectral clustering. Given ˆZ, we generate matrix A( ˆZ, K) = PT ˆ Z,KAP ˆ Z,K with blocks A(k,l)( ˆZ, K), k, l = 1, . . . , K, and obtain ˆΘ(k,l)( ˆZ, K) by using the rank one approximation for each of the blocks. Finally, we estimate matrix P by ˆP = ˆP( ˆZ, ˆK) using formula (23) with ˆK = K. Figure 2 represents the accuracy of SSC in terms of the average estimation errors n 2 ˆP P 2 F and the average clustering errors Err( ˆZ, Z) deﬁned in (28) for K = 4, 5 and 6, respectively, and the number of nodes ranging from n = 300 to n = 540 with the increments of 60. The left panels display the clustering errors Err( ˆZ, Z) while the right ones exhibit the estimation errors n 2 ˆP P 2 F , as functions of the number of nodes, for two diﬀerent values of the parameter ω: ω = 0.5 (dashed lines) and 0.8 (solid lines) and three diﬀerent values of the parameter σ: σ = 0.3 (red lines), 0.5 (blue lines), and 0.7 (black lines). Figure 2 shows that sparsity has a diﬀerent eﬀect on estimation and clustering errors. It is easy to see that as sparsity increases (σ decreases), the estimation errors decrease. On the other hand, the diﬃculty of clustering depends on combination of the sparsity parameter σ and the heterogeneity parameter ω. Speciﬁcally, a denser network is easier to cluster when the network is more diverse (the heterogeneity parameter ω is larger), while for a very sparse network, heterogeneity of the network does not play much of a role. Indeed, in all three graphs in the left half of Figure 2, the red curves, corresponding to the most sparse case (σ = 0.3), are close together while the black curves, corresponding to the least sparse case (σ = 0.7), are further apart. The graphs also show the eﬀect of the number of clusters K on the clustering errors. Indeed, for large K (K = 6), when n is small (n < 420), a sparser network is not harder to cluster than a denser one, perhaps because the diverse sparsity patterns make the network less uniform. In summary, the diﬃculty of clustering depends on the interplay between sparsity and heterogeneity of the network. Our procedure does not estimate the set J explicitly. Instead, we set ˆJ = J = SK k,l=1 Jk,l where Jk,l is deﬁned in (7). Our next objective is to evaluate how accurate J is, as an estimator of J . While there are several ways for doing this, below we use two measures, the false positive rate ρF P , deﬁned as the proportion of zero entries in P that are estimated by non-zeros in ˆP, and F N = P 1 F X F , where X F is the Frobenius norm of nonzero entries in P that are estimated by zeros in ˆP. The reports on the accuracies of estimating J are presented in Figure 3. The left panels display ρF P while the right ones exhibit F N, as functions of the number of nodes for the same settings as in Figure 2. The left panels of Figure 3 demonstrate that the proportion of false positives ρF P decreases as the network becomes more and more sparse and more heterogeneous (the proportions of false positives are smaller for smaller values of σ and larger values of ω). Again, the same as for Figure 2, the pattern emerges only when the number of nodes per community reaches some critical threshold. Indeed, as the bottom left panel of Figure 3 shows, the false positive rate is high, when the number of

Sparse Popularity Adjusted Stochastic Block Model

300 350 400 450 500 n

300 350 400 450 500 n

=0.5, =0.3 =0.5, =0.5 =0.5, =0.7 =0.8, =0.3 =0.8, =0.5 =0.8, =0.7

300 350 400 450 500 n

300 350 400 450 500 n

=0.5, =0.3 =0.5, =0.5 =0.5, =0.7 =0.8, =0.3 =0.8, =0.5 =0.8, =0.7

300 350 400 450 500 n

300 350 400 450 500 n

=0.5, =0.3 =0.5, =0.5 =0.5, =0.7 =0.8, =0.3 =0.8, =0.5 =0.8, =0.7

Figure 3: The false positive rates ρFP (left panels) and the rates FN (right panels) for K = 4 (top), K = 5 (middle) and K = 6 (bottom) clusters. The rates are evaluated over 100 simulation runs. The number of nodes ranges from n = 300 to n = 540 with the increments of 60. Dashed lines represent the results for ω = 0.5 and solid lines represent the results for ω = 0.8; σ = 0.3 (red), σ = 0.5 (blue) and σ = 0.7 (black).

Noroozi, Pensky and Rimal

nodes is small. The right hand side panels of Figure 3 show that F N, the relative norm of nonzero entries of P estimated by zeros, is minimal for the moderately sparse network σ = 0.5 and becomes smaller when the network is more heterogeneous. One can also notice that the values of F N are almost independent of σ when the network is relatively homogeneous (ω = 0.5) but become more diverse when the network becomes more diverse (ω = 0.8).

Remark 2. Unknown number of clusters. In our previous simulations we treated the true number of clusters as a known quantity. However, we can actually use ˆP to obtain an estimator ˆK of K by solving, for every suitable K, the optimization problem (14), which can be equivalently rewritten as ˆK = argmin K { ˆP A 2 F + Pen(n, J, K)}. (42)

The penalties Pen(n, J, K) deﬁned in (18) are, however, motivated by the objective of setting it above the noise level with a very high probability. In our simulations, we also study the selection of an unknown K using an empirical version of this penalty

Pen(n, J, K) = ρ(A)n K p

ln n (ln K)3. (43)

In order to assess the accuracy of ˆK as an estimator of K, we evaluated ˆK as a solution of optimization problem (42) with the penalty (43) in each of the previous simulations settings over 100 simulation runs. Table 1 in Section A.1 of the Appendix presents the relative frequencies of the estimators ˆK of K for K ranging from 3 to 5, n = 360 and 480 and ω = 0.5 and 0.8 and σ = 0.4, 0.6 and 0.8. Table 1 conﬁrms that for majority of settings, ˆK = K , i.e., the estimated and the true number of clusters coincide with high probability. We would like to point out that the problem (41) of ﬁnding weights is indeed strongly convex and it leads to a unique set of weights for every column of the adjacency matrix. However, the subsequent spectral clustering is not convex since it requires application of the K-means clustering to the main K eigenvectors of the weight matrix. The subspace clustering is carried out with a ﬁxed number of clusters. The number of clusters is then found as a solution of the discrete optimization problem (14). Therefore, even with the same adjacency matrix, due to random initialization of the K-means algorithm, the values of ˆK may vary.

5.2 Real Data Examples

In this section, we report the performance of SSC and our estimation procedure when they are applied to two real life networks, an ego-network and a human brain network. To study the ego-network, we use the dataset described comprehensively in Leskovec and Mcauley (2012). An ego-network is a social network of a single person, with the exclusion of the person generating this network. Users of social networking sites are usually provided with a tool that allows them to organize their networks into categories, referred to, in Leskovec and Mcauley (2012), as social circles. Practically all major social networking cites provide such functionality, for example, circles on Google+, and lists on Facebook and Twitter. Examples of such circles include university classmates, sports team members, relatives, etc. Once circles are created by a user, they can be utilized, for example, for content ﬁltering (e.g. to ﬁlter status updates posted by distant acquaintances) or for privacy (e.g., to hide personal information from coworkers). In this paper, we attempt to recover social circles of an ego-network when only binary connection data is available. In particular, we formulate the problem of circle detection as a clustering problem on an individual ego-network. In principle, circles can overlap or a circle can be a subset of another circle, hence, as an example in this paper, we study an ego-network with only few nodes overlap between the circles which does not aﬀect the performance of the clustering method. Speciﬁcally, we study an ego-network from Facebook where user proﬁles are treated as nodes and a friendship between two user proﬁles is considered as an edge between them. Since a friendship is a mutual tie,

Sparse Popularity Adjusted Stochastic Block Model

0 100 200 300 400 500 600 nz = 25114

0 50 100 150 200 250 300 350 400 nz = 30894

Figure 4: The adjacency matrices of the ego-network with 25114 nonzero entries and 5 clusters (left) and the brain network with 30894 nonzero entries and 6 clusters (right) after clustering

the ego-network is undirected. The ego-network studied in this paper, has 777 nodes with 17 circles, each circle containing between 2 to 225 nodes. For our study, we extract the ﬁve largest circles of the this network, obtaining a network with 629 nodes and 12557 edges. We carried out clustering of the nodes using the SSC and compared the clustering assignments of SSC with the true class assignments. The SSC provides 85% accuracy. In addition, we applied formula (42) with K ranging from 2 to 6 to the adjacency matrix with the randomly permuted rows (columns), obtaining the true number of clusters with 100% accuracy over 100 runs. Figure 4 shows the adjacency matrix of the graph after clustering (left), which conﬁrms that the network indeed follows the SPABM. Indeed, the SPABM is a very appropriate model for this example since users display diﬀerent degrees of connections to users in other circles, and, furthermore, the network is sparse, which justiﬁes the application of the SPABM.

Our second example involves analyzing a human brain functional network, constructed on the basis of the resting-state functional MRI (rsf MRI). We use the the brain connectivity dataset presented as a Group Average rsf MRI matrix described in Crossley et al. (2013). In this dataset, the brain is partitioned into 638 distinct regions and a weighted graph is used to characterize the network topology. Nicolini et al. (2017) developed a new Asymptotical Surprise method, which is applied for clustering of the weighted graph. Asymptotical Surprise detects 47 communities ranging from 1 to 133. Since the true clustering as well as the true number of clusters are unknown for this dataset, we treat the results of the Asymptotical Surprise as the ground truth.

In order to generate a binary network, we set all nonzero weights to one in the Group Average rsf MRI matrix, obtaining a network with 18625 undirected edges. For evaluating the performance of SSC on this network, we extract 6 largest communities derived by the Asymptotical Surprise, obtaining a network with 422 nodes and 15447 edges. Applying (42), with K ranging from 2 to 10, to the adjacency matrix with the randomly permuted rows (columns), we recovered the true number of clusters with 64% accuracy over 100 simulation runs. For this true number of communities, our version of the SSC detects the true communities with 94% accuracy. Figure 4 (right) displays the

Noroozi, Pensky and Rimal

adjacency matrix of the network after clustering, showing that the network is very sparse, thus, justifying application of the SPABM to the data.

Acknowledgments

The authors of the paper were partially supported by National Science Foundation (NSF) grants DMS-1712977 and DMS-2014928. We would also like to thank Drs. Nicolini, Bordier and Bifone for providing the brain dataset together with the results of their clustering algorithm.

Appendix A.

A.1 Accuracy of Estimating the Number of Communities

Table 1 below presents the relative frequencies of the estimators ˆK of K for K ranging from 3 to 5, n = 360 and 480, ω = 0.5 and 0.8, and σ = 0.4, 0.6 and 0.8. Table 1 conﬁrms that for majority of settings, the estimated and the true number of clusters are equal, ˆK = K , with high probability.

A.2 Proof of Theorem 1

Overview: The proof follows the standard oracle inequality strategy. We bound the error ˆP P 2

F by the random error term plus the diﬀerence between the values of the penalty function at K , J and ˆK, ˆJ: 2Tr h (A P )T ( ˆP P ) i + Pen(n, J , K ) Pen(n, ˆJ, ˆK).

Subsequently, we show that the random error term is bounded above by the sum of the Pen(n, ˆJ, ˆK)

and a small multiple of ˆP P 2

F with high probability. The latter leads to the conclusion that ˆP P 2

F is smaller than a multiple of Pen(n, J , K ) with high probability. The details of the proof are given below.

Proof. Denote Ξ = A P and recall that, given matrix P , entries Ξi,j = Ai,j (P )ij of Ξ are the independent Bernoulli errors for 1 i j n and Ξi,j = Ξj,i. Then following notations (5), for any Z and K

Ξ(Z, K) = PT Z,KΞPZ,K and P (Z, K) = PT Z,KP PZ,K.

Let (ˆΘ, ˆZ, ˆJ, ˆK) be a solution of optimization problem (10), and the estimator ˆP ˆP( ˆZ, ˆJ, ˆK) of P be of the form (23). Since A(Z, K) = PT Z,KAPZ,K, one has A = PZ,KA(Z, K)PT Z,K and it follows from (10) that PT ˆ Z, ˆ KAP ˆ Z, ˆ K ˆΘ( ˆZ, ˆJ, ˆK) 2

F + Pen(n, ˆJ, ˆK) PT Z ,K APZ ,K PT Z ,K P PZ ,K 2

F + Pen(n, J , K )

Using orthogonality of permutation matrices, we can rewrite the previous inequality as A P ˆ Z, ˆ K ˆΘ( ˆZ, ˆJ, ˆK)PT ˆ Z, ˆ K

F A P 2 F + Pen(n, J , K ) Pen(n, ˆJ, ˆK) (A.1)

Hence (A.1) and (23) yield A ˆP 2

F A P 2 F + Pen(n, J , K ) Pen(n, ˆJ, ˆK) (A.2)

Sparse Popularity Adjusted Stochastic Block Model

n = 360 ω = 0.5 ω = 0.8 K ˆK σ = 0.4 σ = 0.6 σ = 0.8 σ = 0.4 σ = 0.6 σ = 0.8 2 0.01 0 0.01 0 0 0 3 0.49 0.62 0.62 0.54 0.79 0.76 3 4 0.31 0.27 0.30 0.39 0.17 0.18 5 0.15 0.09 0.06 0.06 0.04 0.06 6 0.04 0.02 0.01 0.01 0 0 2 0 0 0 0 0 0 3 0.01 0.01 0.05 0 0 0.01 4 4 0.66 0.74 0.66 0.72 0.85 0.81 5 0.22 0.22 0.25 0.23 0.15 0.16 6 0.11 0.03 0.04 0.05 0 0.02 2 0 0 0.02 0 0 0 3 0 0 0.03 0 0 0 5 4 0.05 0.07 0.23 0 0 0.08 5 0.70 0.69 0.54 0.74 0.84 0.84 6 0.25 0.24 0.18 0.26 0.16 0.08

n = 480 ω = 0.5 ω = 0.8 K ˆK σ = 0.4 σ = 0.6 σ = 0.8 σ = 0.4 σ = 0.6 σ = 0.8 2 0 0 0 0 0 0 3 0.64 0.62 0.60 0.54 0.73 0.76 3 4 0.26 0.17 0.31 0.32 0.24 0.19 5 0.08 0.19 0.07 0.10 0.01 0.05 6 0.02 0.02 0.02 0.04 0.02 0 2 0 0 0 0 0 0 3 0.01 0 0 0 0 0 4 4 0.64 0.68 0.76 0.69 0.74 0.83 5 0.21 0.30 0.21 0.23 0.24 0.17 6 0.14 0.02 0.03 0.08 0.02 0 2 0 0 0 0 0 0 3 0 0 0.02 0 0 0 5 4 0.04 0.01 0.21 0 0 0.05 5 0.65 0.78 0.65 0.77 0.89 0.86 6 0.31 0.21 0.12 0.23 0.11 0.09

Table 1: The relative frequencies of the estimators ˆK of K for K ranging from 3 to 5, n = 360 and 480 and ω = 0.5 and 0.8 and σ = 0.4, 0.6 and 0.8.

Noroozi, Pensky and Rimal

Now adding and subtracting P in the norm on the left side of (A.2), we rewrite (A.2) as ˆP P 2

F ( ˆZ, ˆJ, ˆK) + Pen(n, J , K ) Pen(n, ˆJ, ˆK) (A.3)

where ( ˆZ, ˆJ, ˆK) = 2Tr h ΞT ( ˆP P ) i .

Again, using orthogonality of permutation matrices, we obtain

( ˆZ, ˆJ, ˆK) = 2 D Ξ( ˆZ, ˆK), (ˆΘ( ˆZ, ˆJ, ˆK) P ( ˆZ, ˆK)) E

where A, B = Tr(AT B). Then, in the block form, ( ˆZ, ˆJ, ˆK) appears as

( ˆZ, ˆJ, ˆK) =

k,l=1 (k,l)( ˆZ, ˆJ, ˆK) (A.4)

with (k,l)( ˆZ, ˆJ, ˆK) = 2 D Ξ(k,l)( ˆZ, ˆK), Πˆu,ˆv Π ˆ J(k,l) A(k,l)( ˆZ, ˆK) P (k,l) ( ˆZ, ˆK) E .

Here Πˆu,ˆv is deﬁned in (4), and ˆu ˆu(k,l)( ˆZ, ˆJ, ˆK) and ˆv ˆv(k,l)( ˆZ, ˆJ, ˆK) are the singular vectors of

Π ˆ J(k,l) A(k,l)( ˆZ, ˆK) corresponding to the largest singular values of Π ˆ J(k,l) A(k,l)( ˆZ, ˆK) . Let u =

u(k,l)( ˆZ, ˆJ, ˆK) and v = v(k,l)( ˆZ, ˆJ, ˆK) be the singular vectors of Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK)) corresponding to the largest singular values of Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK)), and Π u, v(Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK))) be the rank one projection of P (k,l) ( ˆZ, ˆK) deﬁned in (4). We point out here that although all singular vectors depend on the block (k, l), as well as on Z, J and K, we omit these dependences from the notations since, otherwise, the paper will become unreadable. In addition, vectors ˆu and ˆv have supports ˆJk,l and ˆJl,k, respectively. Recall that

Πˆu,ˆv(Π ˆ J(k,l)(A(k,l)( ˆZ, ˆK))) = Πˆu,ˆv(Π ˆ J(k,l)(P (k,l) ( ˆZ, ˆK)) + Π ˆ J(k,l)(Ξ(k,l)( ˆZ, ˆK)))

Then, (k,l)( ˆZ, ˆJ, ˆK) can be partitioned into the sums of three components

(k,l)( ˆZ, ˆJ, ˆK) = (k,l) 1 ( ˆZ, ˆJ, ˆK) + (k,l) 2 ( ˆZ, ˆJ, ˆK) + (k,l) 3 ( ˆZ, ˆJ, ˆK), k, l = 1, 2, , K, (A.5)

(k,l) 1 ( ˆZ, ˆJ, ˆK) = 2 D Ξ(k,l)( ˆZ, ˆK), Πˆu,ˆv(Π ˆ J(k,l)(Ξ(k,l)( ˆZ, ˆK))) E (A.6)

(k,l) 2 ( ˆZ, ˆJ, ˆK) = 2 D Ξ(k,l)( ˆZ, ˆK), Π u, v(Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK))) P (k,l) ( ˆZ, ˆK) E (A.7)

(k,l) 3 ( ˆZ, ˆJ, ˆK) = 2 D Ξ(k,l)( ˆZ, ˆK), Πˆu,ˆv(Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK))) Π u, v(Π ˆ J(k,l)(P (k,l) ( ˆZ, ˆK))) E (A.8)

With some abuse of notations, for any matrix B and any vectors u, v, let Πu,v Π ˆ J(B( ˆZ, ˆK)) be

the matrix with blocks Πu,v Π ˆ J(k,l)(B(k,l)( ˆZ, ˆK)) , k, l = 1, 2, , ˆK. Then, it follows from (A.5) that ( ˆZ, ˆJ, ˆK) = 1( ˆZ, ˆJ, ˆK) + 2( ˆZ, ˆJ, ˆK) + 3( ˆZ, ˆJ, ˆK) (A.9)

1( ˆZ, ˆJ, ˆK) = 2 D Ξ( ˆZ, ˆK), Πˆu,ˆv Π ˆ J(Ξ( ˆZ, ˆK)) E

2( ˆZ, ˆJ, ˆK) = 2 D Ξ( ˆZ, ˆK), Π u, v Π ˆ J(P ( ˆZ, ˆK)) P ( ˆZ, ˆK) E

3( ˆZ, ˆJ, ˆK) = 2 D Ξ( ˆZ, ˆK), Πˆu,ˆv Π ˆ J(P ( ˆZ, ˆK)) Π u, v Π ˆ J(P ( ˆZ, ˆK)) E . (A.12)

Sparse Popularity Adjusted Stochastic Block Model

Now, we need to derive an upper bound for each component in (A.5) and (A.9).

An upper bound for 1( ˆZ, ˆJ, ˆK). Observe that

| (k,l) 1 ( ˆZ, ˆJ, ˆK)| = 2 Πˆu,ˆv(Π ˆ J(k,l)(Ξ(k,l)( ˆZ, ˆK))) 2

op 2 Π ˆ J(k,l)(Ξ(k,l)( ˆZ, ˆK)) 2

Fix t > 0 and let Ω1 be the set such that Π ˆ J Ξ( ˆZ, ˆK) 2 op F1(n, ˆJ, ˆK) + C2t. According to Lemma 8, P(Ω1) 1 exp( t), (A.13)

and, for ω Ω1, one has

| 1( ˆZ, ˆJ, ˆK)| 2

k,l=1 Π ˆ J(k,l)(Ξ(k,l)( ˆZ, ˆK)) 2 op 2 F1(n, ˆJ, ˆK) + 2 C2t (A.14)

where F1(n, J, K) is deﬁned by either (A.49) or (A.50) and C2 is given in Lemma 7.

An upper bound for 2( ˆZ, ˆJ, ˆK). Now, consider 2( ˆZ, ˆJ, ˆK) given by (A.11). Note that

| 2( ˆZ, ˆJ, ˆK)| = 2 Π u, v Π ˆ J(P ( ˆZ, ˆK)) P ( ˆZ, ˆK) F | Ξ( ˆZ, ˆK), H u, v( ˆZ, ˆJ, ˆK) |,

H u, v( ˆZ, ˆJ, ˆK) = Π u, v Π ˆ J(P ( ˆZ, ˆK)) P ( ˆZ, ˆK)

Π u, v Π ˆ J(P ( ˆZ, ˆK)) P ( ˆZ, ˆK) F

Since for any a, b and α1 > 0, one has 2ab α1a2 + b2/α1, obtain

| 2( ˆZ, ˆJ, ˆK)| α1 Π u, v Π ˆ J(P ( ˆZ, ˆK)) P ( ˆZ, ˆK) 2 F + α 1 1 | Ξ( ˆZ, ˆK), H u, v( ˆZ, ˆJ, ˆK) |2 (A.15)

Observe that if K, J and Z Mn,K are ﬁxed, then H u, v(Z, J, K) is ﬁxed and, for any K, J and Z, one has H u, v(Z, J, K) F = 1. Note also that, for ﬁxed K, J and Z, matrix Ξ(Z, K) [0, 1]n n

contains independent Bernoulli errors. It is well known that if ξ is a vector of independent Bernoulli

errors and h is any ﬁxed vector with

i=1 h2 i = 1, then, for any x > 0, the Hoeﬀding s inequality

yields P(|ξT h|2 > x) 2 exp( x/2).

Since Ξ(Z, K), H u, v(Z, J, K) = [vec(Ξ(Z, K))]T vec(H u, v(Z, J, K)), applying the Hoeﬀding s inequality and accounting for the symmetry, we derive for any ﬁxed K, J and Z:

P | Ξ(Z, K), H u, v(Z, J, K) |2 x > 0 2 exp( x/2).

Hence, application of the union bound over K, Z and J leads to

P | Ξ( ˆZ, ˆK), H u, v( ˆZ, ˆJ, ˆK) |2 F2(n, ˆJ, ˆK) > 2t (A.16)

P max 1 K nmax J max Z Mn,k[| Ξ(Z, K), H u, v(Z, J, K) |2 F2(n, J, K)] > 2 t 2 exp( t),

Noroozi, Pensky and Rimal

where F2(n, ˆJ, ˆK) stands for F (s) 2 (n, J, K) or F (ns) 2 (n, J, K) and

F (ns) 2 (n, J, K) = 2 ln n + 2(n + 2) ln K + 2|J| ln(n Ke/|J|) (A.17)

F (s) 2 (n, J, K) = 2

k,l=1 |Jk,l| ln(nke/|Jk,l|) + 2

ln n + n ln K + K

Using Lemma 6, obtain that

Π u, v Π ˆ J(P ( ˆZ, ˆK)) P ( ˆZ, ˆK) 2 F Πˆu,ˆv Π ˆ J(P ( ˆZ, ˆK)) P ( ˆZ, ˆK) 2 F ˆP P 2 F .

Denote the set on which (A.16) holds by Ωc 2, so that

P(Ω2) 1 2 exp( t). (A.19)

Then inequalities (A.15) and (A.16) imply that, for any α1 > 0 and any ω Ω2, one has

| 2( ˆZ, ˆJ, ˆK)| α1 ˆP P 2 F + α 1 1 F2(n, ˆJ, ˆK) + 2 α 1 1 t. (A.20)

An upper bound for 3( ˆZ, ˆJ, ˆK). Now consider 3( ˆZ, ˆJ, ˆK) deﬁned in (A.12) with components

(A.8). Note that matrices Xk,l = Πˆu,ˆv(Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK))) Π u, v(Π ˆ J(k,l)(P (k,l) ( ˆZ, ˆK))) have ranks at most two. Use the fact that (see, e.g., Giraud (2014), page 123)

A, B A (2,r) B (2,r) r A op B F , r = min{rank(A), rank(B)}, (A.21)

where, for any matrix X, X (2,q) is the Ky-Fan (2, q) norm such that X 2 (2,q) rank(X) X 2 op. Applying inequality (A.21) with r = 2 to (A.8), derive that

| (k,l) 3 ( ˆZ, ˆJ, ˆK)| 4 Π ˆ J(k,l)(Ξ(k,l)( ˆZ, ˆK)) op Πˆu,ˆv(Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK))) Π u, v(Π ˆ J(k,l)(P (k,l) ( ˆZ, ˆK))) F

Then, for any α2 > 0, obtain

| 3( ˆZ, ˆJ, ˆK)| =

k,l=1 | (k,l) 3 ( ˆZ, ˆJ, ˆK)| 2

k,l=1 Ξ(k,l)( ˆZ, ˆK) 2 op (A.22)

k,l=1 Πˆu,ˆv(Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK))) Π u, v(Π ˆ J(k,l)(P (k,l) ( ˆZ, ˆK))) 2 F

Note that, by Lemma 6,

Πˆu,ˆv(Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK))) Π u, v(Π ˆ J(k,l)(P (k,l) ( ˆZ, ˆK))) 2 F

2 Πˆu,ˆv(Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK))) P (k,l) ( ˆZ, ˆK) 2 F + 2 Π u, v(Π ˆ J(k,l)(P (k,l) ( ˆZ, ˆK))) P (k,l) ( ˆZ, ˆK) 2 F

4 Πˆu,ˆv Π ˆ J(k,l) A(k,l)( ˆZ, ˆK) P (k,l) ( ˆZ, ˆK) 2 F = 4 ˆΘ(k,l)( ˆZ, ˆJ, ˆK) P (k,l) ( ˆZ, ˆK) 2 F

k,l=1 Πˆu,ˆv Π ˆ J(k,l)(P (k,l)( ˆZ, ˆK)) Π u, v Π ˆ J(k,l) P (k,l) ( ˆZ, ˆK) 2 F

4 ˆΘ( ˆZ, ˆJ, ˆK) P ( ˆZ, ˆK) 2

F = 4 ˆP P 2 F

Sparse Popularity Adjusted Stochastic Block Model

Combining the last inequality with (A.14) and (A.22), obtain that for any α2 > 0, t > 0 and ω Ω1, one has | 3( ˆZ, ˆJ, ˆK)| 8α2 ˆP P 2 F + 2 α 1 2 F1(n, ˆJ, ˆK) + 2 α 1 2 C2 t. (A.23)

An upper bound in probability. Let Ω= Ω1 Ω2. Then, (A.13) and (A.19) imply that P(Ω) 1 3 exp( t) and that, for ω Ω, inequalities (A.14), (A.20) and (A.23) simultaneously hold. Hence, (A.9) implies that, for any ω Ω,

| ( ˆZ, ˆJ, ˆK)| (2+2 α 1 2 )F1(n, ˆJ, ˆK)+α 1 1 F2(n, ˆJ, ˆK)+(α1+8α2) ˆP P 2 F +2 (C2+α 1 1 +C2 α 1 2 ) t.

Combination of the last inequality and (A.3) yields that, for α1 + 8α2 < 1 and any ω Ω,

(1 α1 8α2) ˆP P 2

F (2 + 2 α 1 2 ) F1(n, ˆJ, ˆK) + α 1 1 F2(n, ˆJ, ˆK) (A.24)

+ Pen(n, J , K ) Pen(n, ˆJ, ˆK) + 2 (C2 + α 1 1 + C2 α 1 2 ) t.

Setting Pen(n, ˆJ, ˆK) = (2 + 2/α2)F1(n, ˆJ, ˆK) + 1/α1F2(n, ˆJ, ˆK), obtain the penalty as deﬁned in (18) (21), with

β1 = 2(C1 + C2)(1 + α2)

α1 , β2 = 2C2(1 + α2)

α1 . (A.25)

Dividing both sides of (A.24) by (1 α1 8α2), obtain that

P n ˆP P 2 F (1 α1 8α2) 1 Pen(n, J , K ) + C t o 1 3e t (A.26)

where C = 2(1 α1 8α2) 1(C2 + 1/α1 + C2/α2) t. To obtain (24) set H0 = (1 α1 8α2) 1.

An upper bound in expectation. In order to obtain the upper bound (25) note that for ξ = ˆP P 2 F H0 Pen(n, K ), one has E ˆP P 2 F = H0 Pen(n, K ) + Eξ, where

0 P(ξ > z)dz = C Z

0 P(ξ > Ct)dt C Z

0 3 e t dt = 3 C,

which yields (25).

A.3 Proof of Theorem 2.

Let K be ﬁxed, and known, so that K = K and, hence, A( ˆZ, K) A( ˆZ) and so on. Let Z be the true clustering matrix and J be the set of indices such that Pi,j(Z , K ) = 0 if (i, j) / J . It follows from (13) that

k,l=1 A(k,l)( ˆZ) Π(1)(Π ˆ J(k,l)(A(k,l)( ˆZ))) 2 F + Pen(n, ˆJ, K)

k,l=1 A(k,l)(Z ) Π(1)(ΠJ(k,l) (A(k,l)(Z ))) 2 F + Pen(n, J , K)

where Π(1)(B) is the best rank one approximation of matrix B. Since for any Z Mn,K and any J, one has

A(k,l)(Z) 2

F = A 2 F , D A(k,l)(Z), Π(1) ΠJ(k,l)(A(k,l)(Z)) E = Π(1)(ΠJ(k,l)(A(k,l)(Z))) 2

Noroozi, Pensky and Rimal

and Pen(1)(n, K) does not depend on the sparsity set J, obtain from (13), with non-separable penalty (20):

k,l=1 Π(1) Π ˆ J(k,l) A(k,l)( ˆZ) 2 F

k,l=1 Π(1) ΠJ(k,l)

A(k,l)(Z ) 2 F (A.27)

+ β1| ˆJ| ln(n Ke/| ˆJ|) β1|J | ln(n Ke/|J |).

Denote, as before, Ξ(k,l)(Z) = A(k,l)(Z) P (k,l) (Z). Note that, for any J(k,l), matrices P (k,l) (Z ) and Π(1)(ΠJ(k,l)(A(k,l)(Z))) have rank one, while for Z = Z , some P (k,l) (Z) may have ranks higher than one. Note that for any Z Mn,K and any J(k,l)

Π(1)(ΠJ(k,l)(A(k,l)(Z))) F = Π(1)(ΠJ(k,l)(A(k,l)(Z))) op (A.28)

P (k,l) (Z) op Π(1)(ΠJ(k,l)(Ξ(k,l)(Z)) op = P (k,l) (Z) F Π(1)(ΠJ(k,l)(Ξ(k,l)(Z)) op

Note that, for (i, j) / J(k,l) , one has Ξ(k,l) i,j (Z ) = 0, since a Bernoulli random variable with zero mean is identically equal to zero. Therefore, for any set J(k,l), the matrix ΠJ(k,l)(Ξ(k,l)(Z )) has (J )k,l Jk,l nonzero rows and (J )l,k Jl,k nonzero columns. Thus, for any t > 0, by Lemma 7

ΠJ(k,l)(Ξ(k,l)(Z )) 2

op C1|J J| + C2 t

1 exp( t). (A.29)

Observe that, for any a, b, c > 0, a b c implies b2 (1 + τ)a2 + (1 + 1/τ)c2 for any τ > 0, so that a2 b2/(1 + τ) c2/τ. Therefore, by (A.28), for any τ (0, 1), one has

Π(1)(ΠJ(k,l) (A(k,l)(Z ))) 2 F (1 + τ) 1 P (k,l) (Z) 2 F τ 1 Π(1)(ΠJ(k,l)(Ξ(k,l)(Z)) 2 op. (A.30)

Hence, it follows from (A.29) and (A.30), that, for any τ (0, 1), any t > 0

Π(1)[ΠJ (k,l)(A(k,l)(Z ))] 2

F 1 1 + τ P 2 F C1|J |

1 e t. (A.31)

On the other hand, for any τ0 (0, 1), derive

Π(1)[Π ˆ J(k,l)(A(k,l)( ˆZ))] 2

F (1 + τ0) Π ˆ J(k,l) P (k,l) ( ˆZ) 2

op + (1 + 1/τ0) Π ˆ J(k,l)Ξ(k,l)( ˆZ) 2

Taking a union bound similarly to Lemma 8 and recalling that K is ﬁxed, obtain for any t > 0

Π ˆ J(k,l)(Ξ(k,l)( ˆZ)) 2

op (C1 + C2)| ˆJ| ln(n Ke/| ˆJ|) + C2(2 ln n + n ln K + t)

Therefore, for any τ0 (0, 1) and any t > 0, derive

Π(1)[Π ˆ J(k,l)(A(k,l)( ˆZ))] 2

Π ˆ J(k,l)P (k,l) ( ˆZ) 2

+ (1 + 1/τ0) h (C1 + C2)| ˆJ| ln(n Ke/| ˆJ|) + C2(2 ln n + n ln K + t) io 1 e t,

Sparse Popularity Adjusted Stochastic Block Model

Combining (A.27), (A.31) and (A.32), derive that, for any τ, τ0 (0, 1) and any t > 0, one has with probability at least 1 2e t

Π ˆ J(k,l)P (k,l) ( ˆZ) 2

op + (1 + 1/τ0) h (C1 + C2)| ˆJ| ln(n Ke/| ˆJ|) + C2(2 ln n + n ln K + t) i

(1 + τ) 1 P 2 F τ 1 C1|J | τ 1 C2t + β1| ˆJ| ln(n Ke/| ˆJ|) β1|J | ln(n Ke/|J |).

Recall that, by Lemma 2, ˆJ( ˆZK, K) J( ˆZK, K) J ( ˆZK, K) and ˆJk,l( ˆZK, K) ( J )k,l( ˆZK, K) for any (k, l), so that

Π ˆ J(k,l)P (k,l) ( ˆZ) 2 op Π( J )(k,l)P (k,l) ( ˆZ) 2 op = P (k,l) ( ˆZ) 2 op.

Then, combining similar terms and multiplying both sides by (1 + τ), obtain for any τ, τ0 (0, 1) and any t > 0, with probability at least 1 2e t

P 2 F (1 + τ0)(1 + τ)

P (k,l) ( ˆZ) 2

op (1 + τ)[τ 1 0 (1 + τ0) β1]| ˆJ| ln(n Ke/| ˆJ|)+

β1|J | ln(n Ke/|J |) + (1 + τ)|J |[C1 τ 1 + β1 ln(n Ke/|J |)+

C2(1 + τ)(1 + τ0)τ 1 0 (2 ln n + n ln K) + C2(1 + τ0)(1 + τ 1 + τ 1 0 ) t.

Set t = n ln K. Let τ = τ0 and (1 + τ0)(1 + τ) = 1 + αn. Then, τ 1 = α 1(1 + 1 + αn), and hence τ 1(1 + τ)l α 1 for l = 0, 1, 2. Taking into account that, by Lemma 2, | ˆJ( ˆZ)| | J ( ˆZ)| and that function f(x) = x ln(n Ke/x) is an increasing function of x, derive that for any αn > 0 and t > 0 and some absolute positive constants H1 and H2, one has with probability at least 1 2e t

P 2 F (1 + αn)

k,l=1 P (k,l) ( ˆZ) 2 op H2|J | ln(n Ke/|J |)+

H1 α 1 n h | J ( ˆZ)| ln(n Ke/| J ( ˆZ)|) + |J | + n ln K (A.33)

The proof is completed by comparison between (A.33) and (30), and by the contradiction argument.

A.4 Proofs of Lemmas 1, 2, 3 and 4

Proof of Lemma 1. Note that index j is incorrectly identiﬁed if j J l,k ( Jl,k)c or j Jl,k (J l,k)c. Since Bernoulli variable with zero mean is always equal to zero, the second case is impossible. Observe that for any (k, l), one has P (k,l) P (k,l) (Z , K ) and

i=1 (P )(k,l) ij nkϖ(n, K) C0n K 1 ϖ(n, K) if j J l,k,

i=1 (P )(k,l) ij = 0 if j (J l,k)c

Therefore, for any (k, l) and j J l,k, by Hoeﬀding inequality,

P(j ( Jl,k)c) = P

i=1 A(k,l) ij (Z , K ) = 0

h A(k,l) ij (Z , K ) (P )(k,l) ij i =

i=1 (P )(k,l) ij

h A(k,l) ij (Z , K ) (P )(k,l) ij i C0n K 1 ϖ(n, K )

exp n 2 C0 2n K 2 ϖ2(n, K ) o .

Noroozi, Pensky and Rimal

Hence, applying the lower bound for ϖ2(n, K ) and the union bound, obtain

P(J (Z , K ) = J(Z , K ))

k,l=1 P(j J l,k ( Jl,k)c)

K2 exp n 2 C0 2n K 2 ϖ2(n, K ) o K2 n 1e t e t

which completes the proof.

Proof of Lemma 2. Since (P )i,j = 0 implies Ai,j = 0, one has Jk,l( ˆZK, K) ( J )k,l( ˆZK, K) and J( ˆZK, K) J ( ˆZK, K). In order to prove the ﬁrst inclusions in (17), consider the following two optimization problems

J( ˆZK, K) argmin J

A(k,l)( ˆZK, K) Π(1) ΠJ(k,l)(A(k,l)( ˆZK, K)) 2

F + Pen(n, J, K)

J( ˆZK, K) argmin J

A(k,l)( ˆZK, K) Π(1) ΠJ(k,l)(A(k,l)( ˆZK, K)) 2

Since Pen(n, J, K) is an increasing function of |J| (for a non-separable penalty) or of |Jk,l| (for a separable penalty), one has

( J)k,l( ˆZK, K) ( J)k,l( ˆZK, K), J( ˆZK, K) J( ˆZK, K) (A.36)

On the other hand, one has J( ˆZK, K) = ˆJ( ˆZK, K) since the right hand side of (A.34) is minimized at ˆJ( ˆZK, K). In addition, it is easy to see that the right hand side of (A.35) takes the smallest value at J( ˆZK, K) = J( ˆZK, K). Therefore,

( ˆJ)k,l( ˆZK, K) ( J)k,l( ˆZK, K), ˆJ( ˆZK, K) J( JK, K),

which completes the proof.

Proof of Lemma 3. Note that the diﬀerence between separable and non-separable penalty is given by n/s = Pen(ns)(n, J, K) Pen(s)(n, J, K) = β1 n/s 1 + β2 n/s 2 (A.37)

n/s 1 = |J| ln n Ke

k,l=1 |Jk,l| ln nke

, n/s 2 = 2 ln n K

Note that, due to the log-sum inequality (Theorem 17.1.2 of Cover and Thomas (2006)), n/s 1 0 with n/s 2 = 0 if and only if nk/|Jk,l| = n K/|J| for every k, l = 1, . . . , K. In the extreme case where the nodes have nonzero connection probabilities only to the nodes in the same class, one has |Jk,l| = nk for k = l and 0 otherwise, so that |J| = n. Then, n/s 1 = n ln K, so that

0 n/s 1 n ln K. (A.38)

Sparse Popularity Adjusted Stochastic Block Model

Now, consider n/s 2 . Note that application of the log-sum inequality (Theorem 17.1.2 of Cover and Thomas (2006)) yields

2 ln n K2 ln(n/K) n/s 2 2 ln n K ln(n + 1 K).

It is easy to see that 0 < K2 ln n n ln K if n 8 and K p

n/ ln n, therefore,

2 ln n n ln K n/s 2 2 ln n. (A.39)

Combining (A.37) (A.39), obtain that

β2(2 ln n n ln K) n/s β1n ln K + 2 β2 ln n.

Pen(ns)(n, J, K) Pen(s)(n, J, K) + β1n ln K + 2 β2 ln n < (2 + β1/β2)Pen(s)(n, J, K)

Pen(s)(n, J, K) Pen(ns)(n, J, K) + β2(2 ln n n ln K) < 2Pen(ns)(n, J, K),

which leads to (22).

Proof of Lemma 4. Note that Π(1) ΠJ(k,l) (P (k,l) (Z )) = Π(1)(P (k,l) (Z )) = P (k,l) (Z ), so that

the left hand side of inequality (27) is equal to identical zero. Also, Π J(k,l) (P (k,l) (Z)) = P (k,l) (Z),

hence we need to prove that P (k,l) (Z) Π(1)(P (k,l) (Z)) F > 0 for at least one pair (k, l), k, l = 1, . . . , K. Consider matrix Z Mn,K such that Z cannot be obtained from Z by a permutation of columns. Let i be a misclassiﬁed node, so that it belongs to communities l and l according to Z and Z, respectively. Then, the i-th column in the cluster l of matrix P is vertical concatenation of vectors Λ(1,l ) Λ(l ,1) i , Λ(2,l ) Λ(l ,2) i , . . . , Λ(K,l ) Λ(l ,K) i . Since the node i is connected to the network, there exists t such that Λ(l ,t) i > 0. When node i is moved to cluster l, according to Z, the column Λ(t,l ) Λ(l ,t) i is moved to the sub-matrix P (t,l) which contains multiples of vectors Λ(t,l ). Under Assumption A1, vectors Λ(t,l) and Λ(t,l ) are linearly independent, so that the rank of sub-matrix P (t,l) (Z) is at least two. Therefore, P (t,l) (Z) Π(1)(P (t,l) (Z)) F > 0, which completes the proof.

A.5 Supplementary Lemmas

Lemma 5. Let A and B be arbitrary matrices in Rm n and u Rn and v Rm be any unit vectors. Let u, v be the singular vectors of matrix A corresponding to its largest singular value. Then,

Πu,v(B), A Πu,v(A) = 0 and A Π u, v(A) A Πu,v(A) , (A.40)

so that, the best rank one approximation of A is given by Π(1)(A) = Π u, v(A). Here, Πu,v(A) is deﬁned in (4).

Lemma 6. Let A = P + Ξ. Denote by (ˆu, ˆv) and (u, v) the pairs of singular vectors of matrices ΠJ(A) and ΠJ(P), respectively, corresponding to their largest singular values. Then,

Πu,v(ΠJ(P)) P F Πˆu,ˆv(ΠJ(P)) P F Πˆu,ˆv(ΠJ(A)) P F (A.41)

where, for any matrix X, Πu,v(X) is the projection of X onto the pair of unit vectors (u, v), given in (4), and ΠJ(X) is the projection of the matrix X onto the set of all matrices with the rectangular support J.

Noroozi, Pensky and Rimal

Proof. Note that

Πˆu,ˆv(ΠJ(A)) P 2 F = Πˆu,ˆv(ΠJ(P + Ξ)) P 2 F =

Πˆu,ˆv(ΠJ(P)) + Πˆu,ˆv(ΠJ(Ξ)) P 2 F =

Πˆu,ˆv(ΠJ(Ξ)) + [Πˆu,ˆv(ΠJ(P)) ΠJ(P)] + [ΠJ(P) P] 2 F

Since matrices Πˆu,ˆv(ΠJ(Ξ)) and [Πˆu,ˆv(ΠJ(P)) ΠJ(P)] are supported on the set of indices J and ΠJ(P) P is supported on Jc, the latter matrix is orthogonal to the ﬁrst two. On the other hand, Πˆu,ˆv(ΠJ(Ξ)) and [Πˆu,ˆv(ΠJ(P)) ΠJ(P)] = Π ˆu,ˆv(ΠJ(P)) are also orthogonal. Therefore,

Πˆu,ˆv(ΠJ(A)) P 2 F = Πˆu,ˆv(ΠJ(Ξ)) 2 F + Πˆu,ˆv(ΠJ(P)) ΠJ(P) 2 F + ΠJ(P) P 2 F =

Πˆu,ˆv(ΠJ(Ξ)) 2 F + Πˆu,ˆv(ΠJ(P)) P 2 F Πˆu,ˆv(ΠJ(P)) P 2 F Πu,v(ΠJ(P)) P 2 F

where the last inequality follows from Lemma 5.

Lemma 7. Let elements of matrix Ξ ( 1, 1)n n be independent Bernoulli errors. Let matrix Ξ be partitioned into K2 sub-matrices Ξ(k,l) with supports J(k,l) = Jk,l Jl,k, k, l = 1, , K, such that Ξ(k,l) = (Ξ(l,k))T . Then, for any x > 0

ΠJ(k,l) Ξ(k,l) 2

op C1|J| + C2x

1 exp( x), (A.42)

where C1 and C2 are absolute constants independent of n, K and sets Jk,l, k, l = 1, , K.

Proof. Denote |Jk,l| = nk,l, k, l = 1, , K, and observe that matrices Ξ(k,l) are eﬀectively of the size nk,l nl,k. Consider K(K + 1)/2-dimensional vectors ξ and µ with elements ξk,l = ΠJ(k,l) Ξ(k,l) op and µk,l = E ΠJ(k,l) Ξ(k,l) op, 1 k l K, and let η = ξ µ. Then,

ΠJ(k,l) Ξ(k,l) 2

op ξ 2 2 η 2 + 2 µ 2 (A.43)

Hence, we need to construct the upper bounds for η 2 and µ 2. We start with constructing upper bounds for µ 2. Let Ξ(k,l) i,j be elements of the (nk,l nl,k)-

dimensional matrix ΠJ(k,l) Ξ(k,l) . Then, E(Ξ(k,l) i,j ) = 0 and, by Hoeﬀding s inequality, E n exp(λΞ(k,l) i,j ) o

exp λ2/8 . Taking into account that Bernoulli errors are bounded by one in absolute value and applying Corollary 3.3 of Bandeira and van Handel (2016) with m = nk,l, n = nl,k, σ = 1, σ1 = nl,k and σ2 = nk,l, obtain

nk,l + nl,k + q

ln(nk,l nl,k)

where C0 is an absolute constant independent of nk,l and nl,k. Therefore,

k,l=1 (nk,l + nl,k + ln(nk,l nl,k)) 6C2 0|J| + 3C2 0

k,l=1 ln(nk,l). (A.44)

Next, we show that, for any ﬁxed partition, ηk,l = ξk,l µk,l are independent sub-gaussian random variables when 1 k l K. Independence follows from the conditions of Lemma 7.

Sparse Popularity Adjusted Stochastic Block Model

To prove the sub-gaussian property, use Talagrand s concentration inequality (Theorem 6.10 of Boucheron et al. (2013)): if Ξ1, Ξ2, Ξ3, , Ξn are independent random variables taking values in the interval [0, 1] and f : [0, 1]n R is a separately convex function such that |f(x) f(y)| x y for all x, y [0, 1]n, then, for Z = f(Ξ1, Ξ2, Ξ3, , Ξn) and any t > 0, one has P(Z > EZ + t) exp( t2/2). Apply this theorem to vectors ζk,l = vec(ΠJ(k,l) Ξ(k,l) ) [0, 1]nk,l nl,k

and f(ΠJ(k,l) Ξ(k,l) ) = f(ζk,l) = ΠJ(k,l) Ξ(k,l) op. Note that, for any two matrices Ξ and Ξ of

the same size, one has Ξ Ξ 2 op Ξ Ξ 2 F = vec(Ξ) vec( Ξ) 2. Then, applying Talagrand s inequality with Z = ΠJ(k,l) Ξ(k,l) op and Z = ΠJ(k,l) Ξ(k,l) op, obtain

P ΠJ(k,l) Ξ(k,l) op E ΠJ(k,l) Ξ(k,l) op > t 2 exp( t2/2).

Now, use the Lemma 5.5 of Vershynin (2012) which states that the latter implies that, for any t > 0 and some absolute constant C4 > 0,

E [exp(tηk,l)] = E [exp(t(ξk,l µk,l))] exp(C4t2/2). (A.45)

Hence, ηk,l are independent sub-gaussian random variables when 1 k l K. In order to obtain an upper bound for η 2, use Theorem 2.1 of Hsu et al. (2012). Applying this theorem with A = IK(K+1)/2, µ = 0 and σ2 = C4 to a sub-vector η of η which contains components ηk,l with 1 k l K, obtain

P n η 2 C4 K(K + 1)/2 + p

2 K(K + 1) x + 2x o exp( x).

Since η 2 2 η 2, derive

P n η 2 2C4K(K + 1) + 6C4x o exp ( x) (A.46)

Combination of formulas (A.43) and (A.46) yield

P n ξ 2 2 µ 2 + 4C4K(K + 1) + 12C4x o 1 exp ( x)

Plugging in µ 2 from (A.44) into the last inequality, derive for any x > 0 that

ξ 2 12C2 0|J| + 6C2 0

k,l=1 ln(nk,l) + 4C4K(K + 1) + 12C4x

1 exp ( x) . (A.47)

Since K(K + 1) 2K2 and

k,l=1 ln(nk,l) + 8C4K2 max(6C2 0, 8C4)

k,l=1 ln(nk,le) max(6C2 0, 8C4)|J|,

inequality (A.42) holds with C1 = 12C2 0 + max(6C2 0, 8C4) and C2 = 12C4.

Lemma 8. For any t > 0,

Π ˆ J(k,l) Ξ(k,l)( ˆZ, ˆK) 2

op F1(n, ˆJ, ˆK) C2t

1 exp ( t), (A.48)

Noroozi, Pensky and Rimal

with F1(n, J, K) = F (ns) 1 (n, J, K) or F1(n, J, K) = F (s) 1 (n, J, K), where

F (ns) 1 (n, J, K) = (C1 + C2)|J| ln(n Ke/|J|) + C2(3 ln n + n ln K) (A.49)

F (s) 1 (n, J, K) = (C1 + C2)

k,l=1 |Jk,l| ln(nke/|Jk,l|) + C2

ln n + n ln K + K

and C1 and C2 are the absolute constants from Lemma 7.

Proof. Note that |Jk,l| |Jk,l| ln(n Ke/|Jk,l|), |J| |J| ln(n Ke/|J|), and also that |J| =

k,l=1 |Jk,l|.

First, let us prove the statement for F1(n, J, K) = F (ns) 1 (n, J, K). For this purpose, set x = t + 3 ln n + n ln K + |J| ln(n Ke/|J|) in Lemma 7 and apply the union bound over K [1, n], Z Mn,K and J {1, . . . , n K}. Obtain

Π ˆ J(k,l) Ξ(k,l)( ˆZ, ˆK) 2

op F (ns) 1 (n, ˆJ, ˆK) C2t 0

k,l=1 ΠJ(k,l) Ξ(k,l)(Z, K) 2 op F (ns) 1 (n, J, K) C2t

|J|=j exp( t 3 ln n n ln K j ln(n Ke/j))

j=1 Kn n K j

exp( t 3 ln n n ln K j ln(n Ke/j)) exp( t).

In order to prove the statement for F1(n, J, K) = F (s) 1 (n, J, K), choose

x = t + ln n + n ln K +

k,l=1 [ln(nk) + |Jk,l| ln(nk e/|Jk,l|)]

in Lemma 7 and again apply the union bound over Z Mn,K, K [1, n] and |Jkl| {1, . . . , nk}, k, l = 1, . . . , K. Obtain

Π ˆ J(k,l) Ξ(k,l)( ˆZ, ˆK) 2

op F (s) 1 (n, ˆJ, ˆK) C2t 0

|Jk,l|=jk,l P

k,l=1 ΠJ(k,l) Ξ(k,l)(Z, K) 2 op F (s) 1 (n, J, K) C2t

t ln n n ln K

k,l=1 [ln(nk) + jk,l ln(nk e/jk,l)]

which completes the proof.

Sparse Popularity Adjusted Stochastic Block Model

Proof of the inequality (33). For any m, denote em = 1m/ m, so that em = 1. Denote by Λ(k,l) and ˇΛ(k,l) the portions of vectors Λ(k,l), k, l = 1, 2, that, respectively, stayed in the correct class and were moved to the wrong one by the erroneous clustering matrix Z. It is easy to check that, for k = 1, 2, matrices P (k,k) P (k,k) (Z) are 2 2 block matrices with blocks Λ(k,k)( Λ(k,k))T

and ˇΛ(l,l)(ˇΛ(l,l))T on the main diagonal and ˇΛ(l,k)( Λ(k,l))T and its transpose oﬀthe main diagonal. Here, for k, l = 1, 2, k = l, one has

a Nk e Nk, Λ(k,l) = q

βk e Nk + q

ˇΛ(k,k) = q

a ˇNk e ˇ Nk, ˇΛ(k,l) = q

ˇβk e ˇ Nk + q

1 ˇβk e ˇ Nk

where e m is a unit vector orthogonal to em. Consider matrices Uk : ( Nk + ˇNl) 4, k = 1, 2, with the columns

(Uk):,1 = [e Nk; 0 ˇ Nl], (Uk):,2 = [0 Nk; e ˇ Nl], (Uk):,3 = [e Nk; 0 ˇ Nl], (Uk):,4 = [0 Nk; e ˇ Nl],

where 0m is the m-dimensional zero column vector, and [a; b] denotes the vector, obtained by stacking column vectors a and b together vertically. Then, it is easy to verify that U T k Uk = I4, and that P (k,k) = Uk Hk U T k , where Hk is the 4 4 symmetric matrix

Hk = [ Bk, Rk, 0, Fk; Rk, ˇBl, Gk, 0; 0, Gk, 0, Qk; Fk, 0, Qk, 0]

(with elements listed row by row). Therefore,

P (k,k) 2 F = Hk 2 F , P (k,k) 2 op = Hk 2 op (A.51)

Consider the top left sub-matrix Hk = [ Bk, Rk; Rk, Bl] of matrix Hk. Let λ1,k λ2,k λ3,k λ4,k and λ1,k λ2,k be the eigenvalues of matrices Hk and Hk, respectively. Then, by Interlace Theorem (see Rao and Rao (1998), P 10.2.1) with m = 4 and n = 2, obtain

λ1,k λ1,k λ3,k, λ2,k λ2,k λ4,k (A.52)

Observe that for any α > 0, one has

Hk 2 F (1 + α) Hk 2 op λ2 2,k αλ2 1,k λ2 2,k α Hk 2 F λ2 2,k (A.53)

= (1 + α)λ2 2,k α Hk 2 F (1 + α) λ2 2,k α Hk 2 F

Hence, by (A.51) and (A.52), for diagonal blocks, derive

D = P (1,1) 2 F + P (2,2) 2 F (1 + αn) P (1,1) 2 op + P (2,2) 2 op (A.54)

λ2 2,1 + λ2 2,2 αn P (1,1) 2 F + P (2,2) 2 F

Also, for non-diagonal blocks, one has

ND = 2 P (1,2) 2 F 2(1 + αn) P (1,2) 2 op 2αn P (1,2) 2 op 2αn P (1,2) 2 F (A.55)

Combining (A.54) and (A.55), obtain

P 2 F (1 + αn)

k,l=1 P (k,l) (Z) 2 op = D + ND λ2 2,1 + λ2 2,2 αn P 2 F (A.56)

Noroozi, Pensky and Rimal

It is easy to check that

λ2 2,k = 1/4 Bk + ˇBl q

( Bk + ˇBl)2 4R2 k

2 ( Bk + ˇBl) 2 ( Bk ˇBl R2 k)2

Note that Bk ˇBl R2 k = Bk ˇBl(1 ρ2 n β2 k ˇβ2 l ). Also, due to max(δk, δl) δ 1/2 1 min(δk, δl), obtain Bk ˇBl Bk + ˇBl = a N(1 δk)δl

1 δk + δl n an δl

Plugging the last two expressions into (A.56) and taking into account that P 2 F 4a2N 2 = a2n2, arrive at (33) with n given by (34).

Emmanuel Abbe. Community detection and stochastic block models: Recent developments. J. Mach. Learn. Res., 18(177):1 86, 2018.

Emmanuel Abbe, Enric Boix-Adsera, Peter Ralli, and Colin Sandon. Graph powering and spectral robustness. SIAM Journal on Mathematics of Data Science, 2(1):132 157, 2020a. doi: 10.1137/ 19M1257135. URL https://doi.org/10.1137/19M1257135.

Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvector analysis of random matrices with low expected rank. The Annals of Statistics, 48(3):1452 1474, 2020b. doi: 10.1214/19-AOS1854. URL https://doi.org/10.1214/19-AOS1854.

Pankaj K Agarwal and Nabil H Mustafa. K-means projective clustering. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 155 165. ACM, 2004.

Arash A. Amini and Elizaveta Levina. On semideﬁnite relaxations for the block model. Ann. Statist., 46(1):149 179, 02 2018. doi: 10.1214/17-AOS1545.

Afonso S. Bandeira and Ramon van Handel. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. Ann. Probab., 44(4):2479 2506, 07 2016.

Peter J. Bickel and Aiyou Chen. A nonparametric view of network models and newman girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068 21073, 2009. ISSN 0027-8424. doi: 10.1073/pnas.0907096106.

St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

T. Boult and L. Brown. Factorization-based segmentation of motions. Journal of Global Optimization, pages 179 186, 11 1991. doi: 10.1109/WVM.1991.212809.

P. S. Bradley and O. L. Mangasarian. k-plane clustering. J. of Global Optimization, 16(1):23 32, January 2000. ISSN 0925-5001. doi: 10.1023/A:1008324625522.

Alain Celisse, Jean-Jacques Daudin, and Laurent Pierre. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Statist., 6:1847 1899, 2012. doi: 10.1214/12-EJS729.

Sparse Popularity Adjusted Stochastic Block Model

Yudong Chen, Xiaodong Li, and Jiaming Xu. Convexiﬁed modularity maximization for degreecorrected stochastic block models. Ann. Statist., 46(4):1573 1602, 08 2018. doi: 10.1214/ 17-AOS1595.

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA, 2006. ISBN 0471241954.

Nicolas A Crossley, Andrea Mechelli, Petra E V ertes, Toby T Winton-Brown, Ameera X Patel, Cedric E Ginestet, Philip Mc Guire, and Edward T Bullmore. Cognitive relevance of the community structure of the human brain functional coactivation network. Proceedings of the National Academy of Sciences, 110(28):11583 11588, 2013.

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 32:407 499, 2004.

E. Elhamifar and R. Vidal. Sparse subspace clustering. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2790 2797, June 2009. doi: 10.1109/CVPR.2009.5206547.

Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2765 2781, November 2013. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.57.

P. Favaro, R. Vidal, and A. Ravichandran. A closed form solution to robust subspace estimation and clustering. CVPR 11, pages 1801 1807, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-1-4577-0394-2. doi: 10.1109/CVPR.2011.5995365.

Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab., 17:6 pp., 2012. doi: 10.1214/ECP.v17-2079.

Antony Joseph and Bin Yu. Impact of regularization on spectral clustering. Ann. Statist., 44(4): 1765 1791, 08 2016. doi: 10.1214/16-AOS1447.

Brian Karrer and Mark E. J. Newman. Stochastic blockmodels and community structure in networks. Physical review. E, Statistical, nonlinear, and soft matter physics, 83 1 Pt 2:016107, 2011.

Olga Klopp, Alexandre B. Tsybakov, and Nicolas Verzelen. Oracle inequalities for network models and sparse graphon estimation. Ann. Statist., 45(1):316 354, 02 2017. doi: 10.1214/16-AOS1454. URL https://doi.org/10.1214/16-AOS1454.

Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. Ann. Statist., 43(1):215 237, 02 2015. doi: 10.1214/14-AOS1274.

Jure Leskovec and Julian J Mcauley. Learning to discover social circles in ego networks. In Advances in neural information processing systems, pages 539 547, 2012.

Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank representation. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, pages 663 670, USA, 2010. Omnipress. ISBN 978-1-60558-907-7.

Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell., 35(1): 171 184, January 2013. ISSN 0162-8828. doi: 10.1109/TPAMI.2012.88.

Noroozi, Pensky and Rimal

Yi Ma, Allen Y. Yang, Harm Derksen, and Robert Fossum. Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM Rev., 50(3):413 458, August 2008. ISSN 0036-1445. doi: 10.1137/060655523.

Julien Mairal, F Bach, J Ponce, G Sapiro, R Jenatton, and G Obozinski. Spams: A sparse modeling software, v2.3. URL http://spams-devel. gforge. inria. fr/downloads. html, 2014.

Mohamed Ndaoud, Suzanne Sigalla, and Alexandre B. Tsybakov. Improved clustering algorithms for the bipartite stochastic block model. ar Xiv e-prints, art. ar Xiv:1911.07987, 2020.

Carlo Nicolini, C ecile Bordier, and Angelo Bifone. Community detection in weighted brain connectivity networks beyond the resolution limit. Neuroimage, 146:28 39, 2017.

Majid Noroozi, Ramchandra Rimal, and Marianna Pensky. Estimation and clustering in popularity adjusted block model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):293 317, 2021. doi: https://doi.org/10.1111/rssb.12410. URL https: //rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12410.

C.R. Rao and M A. Rao. Matrix Algebra and Its Applications to Statistics and Econometrics. World Scientiﬁc, Singapore, 1998. ISBN 98102322683.

Karl Rohe, Sourav Chatterjee, Bin Yu, et al. Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist., 39(4):1878 1915, 2011.

Srijan Sengupta and Yuguo Chen. A block model for node popularity in networks with community structure. Journal of the Royal Statistical Society Series B, 80(2):365 386, 2018.

Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000.

Mahdi Soltanolkotabi, Ehsan Elhamifar, and Emmanuel J. Candes. Robust subspace clustering. Ann. Statist., 42(2):669 699, 04 2014. doi: 10.1214/13-AOS1199.

Paul Tseng. Nearest q-ﬂat to m points. Journal of Optimization Theory and Applications, 105(1): 249 252, 2000.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, pages 210 268. Cambridge University Press, 2012. doi: 10.1017/CBO9780511794308.006.

Ren e Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52 68, 2011.

Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca). IEEE Trans. Pattern Anal. Mach. Intell., 27(12):1945 1959, 2005.

Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist., 40(4):2266 2292, 2012.