# distributed_community_detection_in_large_networks__3c78e638.pdf

Journal of Machine Learning Research 24 (2023) 1-28 Submitted 7/19; Revised 5/23; Published 5/23

Distributed Community Detection in Large Networks

Sheng Zhang szhang37@ncsu.edu Rui Song rsong@ncsu.edu Wenbin Lu wlu4@ncsu.edu Department of Statistics North Carolina State University Raleigh, NC 27607, USA

Ji Zhu jizhu@umich.edu Department of Statistics University of Michigan Ann Arbor, MI 48109, USA

Editor: Edo Airoldi

Community detection for large networks poses challenges due to the high computational cost as well as heterogeneous community structures. In this paper, we consider widely existing real-world networks with grouped communities (or the group structure ), where nodes within grouped communities are densely connected and nodes across grouped communities are relatively loosely connected. We propose a two-step community detection approach for such networks. Firstly, we leverage modularity optimization methods to partition the network into groups, where between-group connectivity is low. Secondly, we employ the stochastic block model (SBM) or degree-corrected SBM (DCSBM) to further partition the groups into communities, allowing for varying levels of between-community connectivity. By incorporating this two-step structure, we introduce a novel divide-and-conquer algorithm that asymptotically recovers both the group structure and the community structure. Numerical studies conﬁrm that our approach signiﬁcantly reduces computational costs while achieving competitive performance. This framework provides a comprehensive solution for detecting community structures in networks with grouped communities, oﬀering a valuable tool for various applications. Keywords: community detection, divide and conquer, large networks, modularity, spectral clustering, stochastic block model

1. Introduction

For many real-world networks, such as social networks, airline route networks, and citation networks (Egghe and Rousseau, 1990), community structures are commonly observed as functional modules, i.e., nodes belonging to the same community share similar connectivity patterns. Finding such community structures can be useful for exploring, modeling, and understanding networks (Rohe et al., 2011). The stochastic block model (SBM) (Holland et al., 1983; Snijders and Nowicki, 1997) has been widely used for modeling the community structure in networks and has shown appealing empirical and theoretical properties. Community detection based on the SBM is a nontrivial task, as optimizing the likelihood function over all possible community labels is an NP-hard problem (Bickel and Chen, 2009).

c 2023 Zhang, Song, Lu and Zhu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/19-590.html.

Various algorithms have been developed for community detection based on the SBM in the literature. For example, Daudin et al. (2008) considered the variational EM algorithm by optimizing the community assignment and probability matrix iteratively. Rohe et al. (2011) proposed the spectral clustering algorithm by analyzing the Laplacian of the adjacency matrix for high-dimensional SBM. Amini et al. (2013) proposed pseudo-likelihood based algorithms for network community detection tasks. However, when dealing with very large networks, the computational cost is often too high and causes challenges. In the variational EM method, each EM update requires O(n2) computations (Snijders and Nowicki, 1997; Handcock et al., 2007), where n is the number of nodes in the network; for the spectral clustering method, the complexity for each iteration is O(m K + n K2), where m is the number of edges and K is the number of communities (White and Smyth, 2005); and the pseudo-likelihood algorithm lacks the guarantee of convergence (Wang et al., 2020). Besides, if the size of communities is unknown, selecting the number of communities also requires high computational cost. For example, Wang and Bickel (2017) showed that directly using likelihood-based model selection methods requires an exponential number of summands and would become intractable as n grows. Even with the variational likelihood EM update algorithm, each step will require O(n2) computations. Note that these methods all work and only work with the entire network. To overcome the computational burden for community detection in large networks, a natural idea is to ﬁrst divide the large network into several sub-networks, and then communities are detected within each sub-network in a parallel way. The random division would break the community structure and lead to erroneous community detection results. Therefore, a key question is how to divide the large network into sub-networks in a sensible way. In many real-world applications, it has often been observed that communities in large networks can be divided into groups such that nodes (or communities) within the same group have relatively high link probabilities, while nodes in diﬀerent groups have much lower link probabilities. One such example is the airline route network1 which contains routes among airports spanning the globe. It can be seen that the group structure in the airline route network corresponds to geographical regions, where connections within the same region are much denser than connections across regions. Within each region (group), the airports are further divided into communities depending on connections due to diﬀerent levels of economies, areas or politics. See Section 6.1 for more details. Another illustrative example is the Facebook ego network2, which is created based on the friend lists of 10 ego people (see Figure 8 in Appendix). Naturally, people connected to the same ego person have higher link probabilities among themselves than to people connected to diﬀerent egos. Therefore, it is reasonable to assume that there are 10 groups corresponding to the 10 egos in the Facebook ego network. Within each group, people could belong to diﬀerent friend circles (communities), such as the colleague circle, the family circle, the classmate circle, etc, of the ego person. See Section 6.2 for more details. In this paper, we consider large networks with a grouped community structure . An illustration of the grouped community structure is shown in Figure 1(a) and (b), where the network consists of four communities, with the green and the yellow communities belonging to one group and the red and blue communities belonging to the other group. The groups

1. https://openﬂights.org/data.html 2. https://snap.stanford.edu/data

Distributed Community Detection in Large Networks

are such that nodes (or communities) within the same group have much denser links than those between diﬀerent groups. In Figure 1(a), communities within the same group are non-assortative, while Figure 1(b) shows an example that communities within the same group are assortative. Our goal has two folds: divide the large network into sub-networks according to the group structure so that it does not break the community structure within each group during the division; consistently conduct community detection within each subnetwork.

Probability Matrix: (Non-assortative within group)

Probability Matrix: (Assortative within group)

Figure 1: An illustration for the grouped community structure: the yellow and the green communities belong to the same group; the red and blue communities are in another group. In panel (a), communities within the same group are non-assortative (links are dense across communities); and in panel (b), communities within the same group are assortative (links are dense within communities).

Towards this goal, we adopt a division method by optimizing the modularity (Newman and Girvan, 2004), which is a widely used measure for the strength of division of a network into groups. Speciﬁcally, modularity is deﬁned as the fraction of edges that fall within the divided groups minus the expected fraction if edges were distributed at random (Newman, 2012). The larger the modularity, the divided groups will have denser connections within the groups but sparser connections between the groups. Many modularity optimization algorithms, such as the edge-betweenness method (Newman and Girvan, 2004), the fast-greedy method (Blondel et al., 2008) and the exhaustive modularity optimization via simulated annealing method (Guimera et al., 2004), have been developed to divide a network into groups. These methods are usually model-free and can be implemented eﬃciently for large networks. For example, the computational cost for the fast greedy algorithm is O(nlog2n), which is nearly linear in the size n (Fortunato and Hric, 2016). Although the modularity optimization methods are computationally eﬃcient, they cannot be used to recover the reﬁned community structure within groups since communities within the same group may have relatively more non-assortative links. This motivates us to adopt a novel divide-and-conquer scheme for distributed community detection for large networks. Speciﬁcally, we propose to ﬁrst divide a large network into groups using the modularity-based method and then conduct community detection using the model-based method within each group. We show that our proposed divide-and-conquer method can

estimate the community labels consistently for networks with the group structure under certain conditions. The remainder of the paper is organized as follows. In Section 2, we describe in detail the SBM with the grouped community structure. In Section 3, we present the proposed distributed community detection algorithm. Theoretical properties of the proposed distributed community detection algorithm and extension to the Degree-Corrected SBM (DCSBM) are given in Section 4. In Section 5, we conduct simulation studies to examine the performance of the proposed method and compare it with other community detection methods. In Section 6, we apply the proposed method to an airline route network and a Facebook ego network. We conclude the paper with discussions in Section 7. All technical proofs are provided in the Appendix.

2. SBM with Grouped Community Structures

Consider an undirected and unweighted network N = (V, A), where V = {v1, v2, ..., vn} is the set of nodes and A = (Aij) is the adjacency matrix with Aij = 1 representing a link between nodes vi and vj, and 0 otherwise. First, we assume that the node set {v1, v2, ..., vn} can be partitioned into G disjoint groups G1, G2, ..., GG. In addition, each group Gt, is further partitioned into Kt disjoint communities Ct,1, Ct,2, ..., Ct,Kt. Second, the link probabilities are denoted by the matrix Bn RK K, where K = PG t=1 Kt is the total number of communities in the network. It is assumed that for any two nodes vi Ct,a and vj Cs,b, they are connected by an edge with the probability Bn,Pt 1 i=0 Ki+a,Ps 1 j=0 Kj+b, where Bn,h,l is

the (h, l)th element in Bn and K0 is set as 0. Hence, the connection probability between nodes only depends on their community labels. For group structure g = (g1, g2, ..., gn)T

where gi {1, 2, .., G} denotes the group label for node vi, i = 1, . . . , n, we assume that the group labels gi s follow a multinomial distribution with parameters Π = (Π1, Π2, ..., ΠG)T . For community structure, we deﬁne c = (c1, c2, ..., cn), where ci {1, 2, .., K} denotes the community label for node vi, i = 1, . . . , n, and the community labels ci s follow a multinomial distribution with parameters π = (π1, π2, ..., πK)T . With these notation, for a node vi Ct,a (i.e. the a-th community in the t-th group), we have ci = Pt 1 i=0 Ki + a. We also have Πt = PKt j=1 πPt 1 i=0 Ki+j, t = 1, ..., G, which indicates that the frequency of a speciﬁc group is the summation of the frequencies of the communities in that group. Note that for the grouped community structure to be meaningful, one needs additional constraints on the group assignment. Speciﬁcally, we introduce Condition 1 on the generative probability matrix Bn such that links within groups are denser than links across groups.

Condition 1 For graph G with community label c and probability matrix Bn. Given the group assignment g, the probability matrix Bn satisﬁes that Bn,ab > Bn,0 for communities a and b that belong to the same group g., and Bn,ab < Bn,0 for communities a and b that belong to diﬀerent groups, where Bn,0 = P ab πaπb Bn,ab measures the averaged connection probability over all communities.

Lemma 1 shows that, under Condition 1, the group assignment is well-deﬁned and unique. Therefore, there is no identiﬁability issue regarding to group structure.

Distributed Community Detection in Large Networks

Lemma 1 For generative SBM graph with probability matrix Bn, if both group assignments g1 and g2 satisfy Condition 1, then g1 = g2.

Proof: Suppose g1 = g2, then there exist nodes vi, vj such that g1 i = g1 j and g2 i = g2 j . Under the Condition 1, we have Bn,cicj > Bn,0 and Bn,cicj < Bn,0, which causes contradiction. Therefore, Lemma 1 holds.

To illustrate the SBM with the grouped community structure, we consider the toy examples in Figures 1(a) and (b). Both networks are generated under the SBM with n = 20, K = 4, and π = (1/4, 1/4, 1/4, 1/4)T . The link probability matrices are

0.3 0.8 0.01 0.01 0.8 0.3 0.02 0.02 0.01 0.02 0.3 0.9 0.01 0.02 0.9 0.3

0.8 0.3 0.02 0.02 0.3 0.7 0.01 0.01 0.02 0.01 0.9 0.3 0.02 0.01 0.3 0.9

respectively. Note that both probability matrices satisfy Condition 1 with group size of two. Therefore, both networks are SBM with the grouped community structure, i.e., in both networks, nodes between groups are loosely connected in comparison to nodes within the same group. In Figure 1(a), within community nodes are relatively loosely connected while nodes across communities but within the group are relatively densely connected. While in Figure 1(b), the within community nodes have dense links and nodes across communities have loose links.

3. A Distributed Community Detection Algorithm

In this section, we present a divide-and-conquer algorithm that ﬁrst divides the entire network into disjoint groups and then identiﬁes communities within each group in the SBM. The summarized algorithms are provided in Algorithm 1.

3.1 Group Division Based on Modularity Optimization

Modularity, as deﬁned by Newman (2006), is the fraction of the edges that fall within the given groups minus the expected fraction if edges were distributed at random. Hence the group structure can be recovered by maximizing the modularity. To evaluate the optimality of a group assignment e, we use the Erdos-Renyi modularity (Guimera et al., 2004) deﬁned as

i,j (Aij L/n2)I(ei = ej), (1)

where L is sum of degrees as L = P ij Aij. Note that modularity QER(e) ranges from -1 to 1, and a large value of modularity indicates good performance for the group assignment e, i.e. the group division based on e produces dense links within groups and sparse links between groups. Exact modularity maximization is an NP-hard problem, which may be intractable for large networks. Henceforth, several approximate methods have been developed to ﬁnd decent solutions that approximately maximize the modularity. In our implementation, we

adopt the fast greedy algorithm (Clauset et al., 2004) to divide a large network into groups by optimizing the modularity, which is a widely used hierarchical clustering algorithm for group detection. The basic idea of the fast greedy algorithm is to start with each node being the single member of a group. Then we repeatedly concatenate the two groups whose amalgamation produces the largest increase in modularity. For a network of n nodes, after n 1 such joins, a single group remains and the algorithm stops. The entire process can be represented as a tree whose leaves are the nodes of the original network and whose internal joints correspond to the joins. This dendrogram represents a hierarchical decomposition of the network into groups at all levels. Therefore, a cutoﬀpoint of the dendrogram can be selected based on the modularity or a given number of groups. Since the fast greedy algorithm generates a dendrogram, the number of groups G can be determined by choosing the partition with the greatest modularity during the training process without extra computation (Newman and Girvan, 2004). One potential risk when using the maximal modularity to determine the groups is the resolution limit issue (Fortunato and Barthelemy, 2007), which means maximizing modularity would miss substructures of a network, i.e. wrongly treat several small groups as one large group. However, in our method, the resolution limit issue can be corrected in the following community detection step.

3.2 Distributed Community Detection Algorithm

After dividing the network into disjoint groups, we conduct community detection within each group. Various methods can be adapted to estimate the community labels within each group, such as the variational EM algorithm based on SBM (VSBM) (Daudin et al., 2008) and the regularized spherical spectral clustering (SSP) algorithm (Qin and Rohe, 2013). The number of communities Kt in group t can be determined by the asymptotic likelihood ratio Bayesian information criterion (LRBIC) (Wang and Bickel, 2017) which is a likelihood-based model selection method. We use csub t to denote the community labels in group t, where csub t = (csub t,1 , csub t,2 , . . . , csub t,|Gt|) and | | denotes the cardinality of the set.

Let ˆcsub t denote the estimated community labels in group t, for t = 1, . . . , G. Then, the estimated community labels ˆc of the entire network can be constructed by concatenating detected community labels from all groups as follows: for node vi in group t, if ˆcsub t assigns vi to the k-th community in group t, then we denote the i-th element of ˆc as Pt 1 j=0 ˆKj + k, where ˆKj is the estimated number of communities in group j obtained from the previous step. Given the estimated community labels ˆc, we can further estimate the probability matrix ˆBn by the maximum likelihood estimation (Karrer and Newman, 2011).

3.3 Computational Cost for Proposed Method

In terms of the computational cost, the division step takes O(nlog2n) computations in the fast greedy modularity method. Suppose group Gt (for t = 1, 2, .., G) contains nt nodes and mt edges. For the spectral clustering method, the computational complexity in Gt is O(mt Kt + nt K2 t ). Deﬁne Kmax = max(K1, K2, .., KG) and nmax, mmax in the same way. The computational cost for all groups is O(G (mmax Kmax + nmax K2 max)), which is much smaller than the original cost O(m K +n K2) as n grows. For VSBM in the community-level detection step, the computational cost in each EM update is O(G n2 max) which is much

Distributed Community Detection in Large Networks

Algorithm 1: Distributed Community Detection

Input: the network N = (V, A) Output: the estimated community labels ˆc and estimated probability matrix ˆBn 1 PART I: Group Division

2 Apply the fast greedy algorithm to the network N and obtain the dendrogram H

3 If G is not given:

4 Choose an increment threshold δ; initialize the group size G = 1; initialize the current modularity value Q = 0

5 for j = 2; j < n; j = j + 1 do

6 Calculate the modularity value Qj with group size j in dendrogram H

7 if Qj Q > δ: update the current modularity value Q Qj and the group size G j

8 else: break the loop, output G

9 Divide the network N into G groups {Gt}{t=1,...,G} based on the dendrogram H

10 PART II: Community Detection

11 Initialization: t = 1

13 Determine the number of communities ˆKt in each group Gt using LRBIC; estimate the community labels ˆcsub t in Gt using a SBM community detection method (such as VSBM or SSP)

15 15 t = t + 1;

16 until t = G;

17 PART III: Combination

18 Obtain the community labels ˆc by concatenating ˆcsub t , for t = 1, . . . , G, and the estimated probability matrix ˆBn given ˆc.

smaller than applying VSBM to the whole graph. Furthermore, the computational time can be signiﬁcantly decreased if distributed computing is used in the community detection step. It should also be noted that applying LRBIC to each group for selecting the number of communities has lower computational cost than to the whole network.

4. Theoretical Properties

In this section, we show that the proposed distributed community detection algorithm can consistently recover the community labels for networks with the grouped community structure. We state the main results in this section and provide all the proofs in the Appendix.

4.1 Group Detection Consistency for the SBM

We ﬁrst formally formulate the group detection problem for networks with the grouped community structure. Given the total number of groups G, we use e to denote a group assignment e = {e1, e2, . . . , en}, where ei is the group label for node vi and it takes value in {1, 2, . . . , G}. Deﬁne ˆg = argmaxe QER(e) as the estimated group labels. Our ﬁrst goal is to show that ˆg can consistently estimate the true group labels g. To establish group consistency for the proposed distributed community detection algorithm, we make the following assumptions for the probability matrix Bn and the true group assignment g. First, we note the probability matrix Bn is not ﬁxed, as otherwise, the expected degree for each node will grow proportionally to n as n , which is not realistic in practical applications. Speciﬁcally, we assume the following condition for the probability matrix Bn.

Condition 2 The probability matrix Bn can be written as Bn = ρn B, where ρn = πT Bnπ = P ab πaπb Bn,ab is the averaged probability of two nodes being linked. We assume ρn 0 as n .

Condition 2 has been widely used in the literature for studying the consistency of community detection in networks (e.g. Bickel and Chen, 2009; Lei et al., 2015; Jin et al., 2015; Zhao et al., 2012; Abbe, 2017). Under Condition 2, the expected degree of a node is given by λn = nρn and the expected total degree in the networks can be written as un = nλn = n2ρn. If nρn = O(1), then the expected degree of a node is bounded as the size of the network grows to inﬁnity. The strong and weak consistency of estimated group labels ˆg can be deﬁned by

P(ˆg = g) 1 as n ,

and for any ε > 0,

i 1(ˆgi = gi)

respectively. In the next theorem, we establish the consistency for group detection. Note that all the results are up to permutation of the group labels.

Distributed Community Detection in Large Networks

Theorem 2 Under the SBM, suppose Bn satisﬁes Conditions 1 and 2, then the estimated group labels ˆg obtained by maximizing QER(e) are strongly consistent when λn/ log(n) and weakly consistent when λn .

Theorem 2 shows that by maximizing the modularity and under certain conditions, the estimated group labels ˆg can consistently recover the group structure in the SBM, without breaking the community structures within the groups. In comparison to previous works, the key diﬀerence lies in the requirement on the link probability matrix in Condition 1. For example, Theorem 2 in Zhao et al. (2012) requires the link probabilities within communities to be higher than a threshold while link probabilities across communities are lower than it, i.e. to achieve consistency, the community structure is required to be assortative. In our theory, the community-level assortative condition is relaxed to the group-level assortative condition, which leads to the divide-and-conquer algorithm (i.e. Algorithm 1) that works not only for assortative networks but also for non-assortative networks. There are two key and unique challenges in the proof of Theorem 2, which are not shared by previous works. The ﬁrst one is how to make sure the maximization of modularity maintains the community structure. Motivated by the proof in Zhao et al. (2012), we show that if the maximizer of the modularity s expectation can be denoted as a generic G K matrix with at most one nonzero element in each column, then the corresponding group estimation is consistent. The second challenge is to show that Conditions 1 and 2 are suﬃcient to lead to the generic matrix with at most one nonzero element in each column. The detailed proof is provided in the Appendix.

4.2 Extension to the Degree-corrected SBM

To allow for more ﬂexible degree heterogeneity, we consider the degree-corrected stochastic block model (DCSBM) (Karrer and Newman, 2011), an extension of the SBM. Out of simplicity, we use the same notations as in 2 in DCSBM. In addition to the parameters of the SBM, each node is also associated with a degree parameter θi. For identiﬁability, we assume the expectation of the degree variable is 1, i.e. E(θi) = 1. The edge Aij is a Poisson distributed random variable with P(Aij = 1|c, θ) = θiθj Bn,cicj, where θ = (θ1, θ2, .., θn). The group structure in DCSBM is deﬁned under the same idea as in grouped SBM: links within groups are denser than links between groups. Since the DCSBM is a generative model, the group structure can be deﬁned by adding constraints on the probability matrix Bn and the degree variable θ as in Condition 3. Further, Lemma 3 shows that the group structure in the DCSBM is well-deﬁned.

Condition 3 Suppose θi is a discrete random variable and takes value in the set {x1, x2, .., x M}. Let τau denote the frequency of community a and the degree value xu (i.e. τau = P(ci = a, θi = xu)). Further, let πa = P u τau and Bn,0 = P ab πa πb Bn,ab. For the normalized matrix Γ = W ( W1)( W1)T where Wab = πa πb Bn,ab/ Bn,0, it satisﬁes that Γn,ab > 0 for communities a and b that belong to the same group, and Γn,ab < 0 for a and b that belong to diﬀerent groups.

Lemma 3 For generative DCSBM graph with probability Bn, if both g1 and g2 satisfy Condition 3, then we have g1 = g2.

Following the same idea as in Section 3 for the SBM, we can build the distributed community detection algorithm for the DCSBM. In the group division part, the modularity under the DCSBM is deﬁned in Eq. (2), which is a modiﬁcation of the Erdos-Renyi modularity in Eq. (1) by taking into account the degree heterogeneity,

i,j (Aij didj/L)I(ei = ej), (2)

where di = P j Aij is the degree of node vi, and e, O(e), L are all deﬁned the same as before. As discussed in Section 3.1, exact modularity maximization is infeasible. Therefore, we again apply a fast greedy method to the degree-corrected modularity. Just like in the SBM, the number of groups can be obtained during the training process of the fast greedy method. In the community detection part, we ﬁrst use LRBIC to determine the number of communities Kt, then we apply community detection methods such as the variational DCSBM method, spectral clustering or SCORE (Jin et al., 2015), to identify the community label csub t within each group t, t = 1, 2, .., G. The combination of the community detection part and the group detection part is the same as in Algorithm 1. Lastly, we show the group consistency result under the DCSBM in Theorem 4. Based on the Theorem, the group detection part of the distributed community detection algorithm can consistently recover the group labels for the DCSBM with group structure.

Theorem 4 Under the DCSBM, if the model satisﬁes Conditions 2 and 3, then the estimated group labels ˆg obtained by maximizing QDC(e) are strongly consistent when λn/ log(n) and weakly consistent when λn .

4.3 Community Detection Consistency

In this subsection, we establish the strong and weak consistency for the estimated community labels ˆc obtained using the proposed distributed community detection algorithm. The deﬁnitions of the strong and weak consistency for the estimated community labels ˆc are similar to those for ˆg: The strong and weak consistency of estimated community labels ˆc can be deﬁned by P(ˆc = c) 1 as n ,

and for any ε > 0,

i 1(ˆci = ci)

respectively.

Theorem 5 (Consistency of the Distributed Community Detection Algorithm) Provided that the estimated group labels ˆg are strongly consistent and the adopted community detection algorithm is strongly (weakly) consistent, then the estimated community labels ˆc are strongly (weakly) consistent.

Distributed Community Detection in Large Networks

Remark: Note that Theorem 5 covers both the SBM and the DCSBM. If the number of groups G is 1, Theorem 5 reduces to the community detection results established in Bickel and Chen (2009) for the SBM and that in Zhao et al. (2011) for the DCSBM. Because the community level detection methods, including the SSP (Von Luxburg et al., 2008) and VSBM (Daudin et al., 2008) for the SBM and SCORE (Jin et al., 2015) for the DCSBM, are consistent, our proposed distributed community detection algorithm is consistent provided that the estimated group labels are strongly consistent as stated in Theorem 5.

5. Simulation Studies

In this section, we conduct simulation studies to evaluate the performance of the proposed distributed community algorithm.

5.1 Comparative Methods and Implementation Detail

5In SBM, we consider three baselines. First, FG denotes the ER modularity optimization (Eq (1)) based on fast-greedy algorithm.3 The estimated community size and community label can be obtained simultaneously from the dendrogram generated by fast-greedy algorithm. Second, VSBM denotes the variational EM algorithm for maximizing SBM likelihood function.4 The size of community is selected by LRBIC method where Kmax is set as the true community size.5 Third, SSP denotes the spherical spectral clustering algorithm,6 the size of community is also selected by LRBIC method. As comparison, we study two distributed community detection methods summarized in Algorithm 1. D-VSBM and D-SSP represent group division via modularity optimization and community detection within each sub-network is conducted using VSBM and SSP methods, respectively. The group size is obtained from the group division step and community sizes within sub-networks are estimated from LRBIC method as well. In DCSBM, three baselines are compared. Firstly, FG-DC denotes the DC modularity optimization (Eq (2)) based on fast-greedy algorithm. Secondly, VDCSBM denotes the variational EM algorithm for maximizing DCSBM likelihood function. Lastly, SCORE denotes the spectral clustering on ratios-of-eigenvectors community detection method.7 In both VDCSBM and SCORE method, the size of community is selected through LRBIC method. We compare the baselines with distributed method, D-DCSBM and D-SCORE, to evaluate the performance of proposed methods.

5.2 Numerical Results for SBM and DCSBM

We start with grouped community structure SBM network with generation based on the description in Section 2. Speciﬁcally, we generate networks using the SBM with G = 4 groups, the four groups contain 2, 3, 3 and 4 communities, respectively. The community frequen-

3. https://igraph.org/r/doc/cluster fast greedy.html 4. https://rdrr.io/cran/mixer/ 5. LRBIC function in randnet R package: https://cran.r-project.org/web/packages/randnet/randnet.pdf 6. reg.SSP function in randnet R package: https://cran.r-project.org/web/packages/randnet/randnet.pdf 7. https://rdrr.io/cran/Score Plus/man/SCORE.html

Figure 2: Relatively small networks. The panels on the left show the NMI for all methods with growing number of nodes n under the SBM; the panels on the right show the corresponding computing times; the panel on the bottom is the group NMI.

Figure 3: The panels on the left show the NMI for all methods with growing number of nodes n under the DCSBM. The panels on the right show the corresponding computing times; the panel on the bottom is the group NMI.

cies are set as π = (1/12, 1/12, .., 1/12)T . The probability matrix Bn R12 12 is generated as follows: for nodes vi and vj belonging to the same group, Bcicj Unif(1/100, 1); for nodes vi and vj from diﬀerent groups, Bcicj Unif(0, 1/100). Hence, the generated Bn satisﬁes Condition 1. Therefore, the networks which are generated from Bn have structure that community structure within a group is non-assortative while the groups remain as assortative. Under the setting of DCSBM, we follow the network generation strategy as in SBM, except that now we have the extra degree parameter θ = (θ1, . . . , θn), and they are generated independently as follows,

P(θi = mx) = P(θi = x) = 1

where we set m = 3/2 and x = 2/(m + 1) so that E(θi) = 1 is satisﬁed as discussed in Section 4.2. For each setting, we repeat the simulation 100 times with various network sizes n. To evaluate the performance of diﬀerent methods of community detection, we use the normalized mutual information (NMI) index (Mori and Malik, 2003) shown in Equation (3) where I( , ) and H( ) are mutual information and entropy respectively. NMI ranges

Distributed Community Detection in Large Networks

between 0 and 1, with larger values indicating better performance. NMI equals 1 when the method recovers the true community structure perfectly.

NMI(c,ˆc) = 2 I(c,ˆc)

H(c), H(ˆc)

I(c,ˆc) = X

e2 ˆc p(c,ˆc)(e1, e2)log(p(c,ˆc)(e1, e2)

pc(e1)pˆc(e2)) (3)

We also compare the computational time for diﬀerent methods to evaluate the scalability. The results for both SBM and DCSBM are summarized in Figures 2 and 3, respectively. We have three ﬁndings from the result. First, as the size of the network n increases, except for the modularity-based method (FG and FG-DC), the NMI increases for all other methods in both SBM and DCSBM models. Also, we observe that likelihood-based method VSBM and VDCSBM are better than SSP and SCORE in NMI, respectively, while the computational time for VSBM and VDCSBM grows exponentially which means they are not scalable. Third, we ﬁnd that the distributed method and non-distributed method are close in community detection performance, while distributed methods are scalable in computation time. To study the performance in relatively large size of networks, we consider relatively large networks where the size of networks varies from 5e3 to 5e5 and the the size group G also grows as the size of nodes increases. Within each group, the community size is 5 with equal community frequency. The probability matrix is designed as the small scale experiment above. we summarize the result in Table 1. Computational time including community size selection using LRBIC and community label detection step, hence SSP and SCORE cannot be calculated when size of n is too large. We use G-NMI to denote the group level NMI which is deﬁned as G-NMI= NMI(g, ˆg). From the result, we show the proposed distributed based methods works well even in large size of networks setting. Also, the G-NMIs are large which indicate the proposed distributed method can maintain the group level structure well. The distributed based methods are also scalable compared to baseline methods.

6. Data Examples

In this section, we apply the proposed method to real-world networks including an airline route network and Facebook ego networks. Since the true community labels of real application are unknown, the evaluation step is based on two aspects: ﬁrst, in airline route network, we analyze the community detection results from regions geometric and economic factors; second, in both applications, we analyze the AUC results by considering the recovering power of edges based on the estimated SBM model.

6.1 Airline Route Network

The airline route network8 contains 67,663 routes between 3,321 airports on 548 airlines spanning the globe. The adjacency matrix A is then constructed as follows: each airport is considered as a node; if there exists a route between two nodes vi and vj, we set Aij = 1,

8. https://openﬂights.org/data.html

(n, G, K) Method NMI t (s) G-NMI Method NMI t (s) G-NMI

(5e3, 1e1, 5e1)

VSBM 0.97 (0.005) 6952 (20) - VDCSBM 0.95 (0.006) 7024 (20) -

D-VSBM 0.97 (0.008) 42 (6) 0.95 D-VDCSBM 0.95 (0.010) 48 (5) 0.96

SSP 0.88 (0.004) 411 (8) - SCORE 0.90 (0.007) 502 (9) -

D-SSP 0.86 (0.006) 25 (4) 0.95 D-SCORE 0.89 (0.012) 35 (5) 0.96

(5e4, 1e2, 5e2)

VSBM - - - VDCSBM - - -

D-VSBM 0.96 (0.010) 512 (10) 0.96 D-VDCSBM 0.95 (0.012) 553 (12) 0.97

SSP - - - SCORE - - -

D-SSP 0.90 (0.008) 273 (11) 0.96 D-SCORE 0.88 (0.013) 394 (12) 0.97

(5e5, 1e3, 5e3)

VSBM - - - VDCSBM - - -

D-VSBM 0.96 (0.010) 5590 (35) 0.94 D-VDCSBM 0.982 (0.010) 6024 (32) 0.97

SSP - - - SCORE - - -

D-SSP 0.90 (0.009) 3025 (20) 0.94 D-SCORE 0.975 (0.012) 4520 (25) 0.97

Table 1: Result for relatively large networks. Computational time includes two parts: 1) community size selection 2) community label detection. Using LRBIC to determine the community size grows exponentially with the size of networks. Our distributed community detection method can calculate the community size as well as get a high NMI score in community detection.

otherwise, Aij = 0. We apply the proposed method and report the results based on D-SSP. The results from D-VSBM are similar and omitted here. We ﬁrst choose the number of groups G. Figure 6(a) shows the modularity value under diﬀerent values of G. As one can see, the modularity value reaches its peak when G = 13 and then remains almost the same as G further increases. Therefore, we divide the network into G = 13 groups. Figure 4 shows the seven largest identiﬁed groups, where diﬀerent colors correspond to diﬀerent estimated groups. Roughly speaking, these seven groups represent seven diﬀerent regions in the world. For example, the black points are mainly in South America, the pink points are mainly in Canada, while the yellow points are mainly in the United States, Mexico, and the Caribbean region. Community detection is then applied to each identiﬁed group. For example, the right panel of Figure 5(a) shows the community detection results for the group in South America. As one can see, the two detected communities correspond to the two centers in the left

Distributed Community Detection in Large Networks

Figure 4: Group detection in the airline route network; diﬀerent colors indicate diﬀerent estimated group labels.

(a) South America (black points in Figure 4 )

(b) Canada (pink points in Figure 4)

Figure 5: In each sub-ﬁgure, one panel shows the airline routes (with red points corresponding to airports and green lines corresponding to airline routes), and the other panel shows the estimated community labels - diﬀerent colors indicate diﬀerent estimated community labels. In Canada, there are three airports (in black) that are obviously mislabeled.

panel of Figure 5(a): one in Uruguay and Argentina, and the other in Brazil, which are understandable. Another example is shown in Figure 5(b). Two detected communities (lower panel) match the airline routes in the upper panel and correspond to eastern and western Canada, respectively. Further, we evaluate the eﬀectiveness of community detection in a predictive manner. Speciﬁcally, we ﬁrst randomly sample a proportion of the node pairs as the testing set and

mask the corresponding edge Aij values as zero; then we apply the proposed community detection method to obtain ˆBn which can be estimated according to the detected community labels ˆc by considering the connection probability of nodes both within and between communities. Lastly, we evaluate the performance of ˆBn on the testing set using the area under the receiver operating characteristic curve (AUC). The left panel in Figure 7 shows how the AUC value changes when we vary the proportion of masked node pairs. As we can see, overall, the AUC values are pretty high and stable. As the proportion of masked node pairs increases, the AUC value decreases, but not by a lot. We can also see that the AUC of D-SSP is only slightly lower than that of SSP, indicating again that with data splitting, the proposed distributed community detection method does not degrade the community detection performance.

(a) Airline route network

(b) Facebook ego network

Figure 6: Modularity under diﬀerent values of G

6.2 Ego Networks

Ego networks (Leskovec and Mcauley, 2012; Leskovec and Krevl, 2014) are often created from user surveys in social media, such as Facebook, Twitter, or Google+. Each ego person is considered as a primary node, and the ego s friends are all considered as nodes in the ego network. If two nodes in the ego network are friends with each other, a link is added between these two nodes. For example, the Facebook ego network we consider here contains 4,039 nodes and 88,234 edges, corresponding to 10 ego people (see Figure 8). Since there are 10 ego people in this Facebook ego network, it is reasonable to set the number of groups G as 10. The plot of modularity for the ego network (Figure 6(b)), which shows the modularity value under diﬀerent values of G, conﬁrms the conjecture. As one can see, the modularity value reaches its peak at G = 10 and stays at about the same value for large G values. After setting the number of groups as 10, we apply the D-SSP method to the Facebook ego network to obtain the estimated community labels and link probability matrix. Similar to the previous subsection, the results are evaluated using the AUC on the testing set (see Figure 7). Further, we have also considered a Twitter ego network and a Google+ ego network. Since these two networks are relatively large (with 142 and 132 ego people respectively), the SSP method can not be applied due to the computational cost. We thus apply the SSP

Distributed Community Detection in Large Networks

Figure 7: AUC under diﬀerent proportions of masked node pairs. The x-axis for masked proportion is set as 0.1, 0.3, 0.5, 0.7, 0.99 respectively.

Network Egos Nodes Edges AUC (mask 10%) (D-SSP)

Time (D-SSP)

AUC (mask 10%) (SSP)

Facebook 10 4,039 88,234 0.985 44(s) 0.986 214(s) Twitter 42 5,816 83,694 0.991 34(s) 0.989 420(s) Google+ 8 6,382 325,819 0.941 101(s) 0.954 761(s) Twitter 142 81,306 1,342,296 0.989 1091(s) Google+ 132 107,614 13,673,453 0.974 5030(s)

Table 2: Numerical results for ego networks; all experiments are conducted via a Dell R7425 machine with dual processor AMD Epyc 32 core 2.2 GHZ with 8GB of RAM

method only on a subset of the ego network corresponding to a randomly sampled subset of the ego people. The results are shown in Table 2. It can be seen that the proposed distributed community detection method D-SSP has comparable performance to SSP (when SSP can be applied) in terms of the AUC, but signiﬁcantly reducing the computational cost.

7. Conclusion and Future Work

In this paper, we have proposed a novel distributed community detection method for large networks with a grouped community structure. The method is easy to implement using the current technology in modularity optimization and community detection. We have also established both weak and strong consistency of the proposed method for both group and community detection. Numerical results show that the proposed distributed commu-

nity detection method achieves similar detection performance in comparison to community detection methods without data splitting, but largely reduces the computational cost. There are several directions we plan to extend our work. The number of groups G is currently assumed ﬁxed as n grows in our theoretical analysis. Using techniques developed in Choi et al. (2012) and Rohe et al. (2011), we may relax this assumption. Another interesting direction is to generalize the current work to incorporate the dynamic representation of timevarying networks with evolving community structures (Sarkar and Moore, 2006; Berger-Wolf and Saia, 2006). This is challenging and requires further investigation.

Emmanuel Abbe. Community detection and stochastic block models: recent developments.

J. Mach. Learn. Res., 18:177:1 177:86, 2017.

Arash A Amini, Aiyou Chen, Peter J Bickel, and Elizaveta Levina. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41 (4):2097 2122, 2013.

Tanya Y Berger-Wolf and Jared Saia. A framework for analysis of dynamic social networks. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 523 528. ACM, 2006.

Peter J Bickel and Aiyou Chen. A nonparametric view of network models and newman girvan and other modularities. Proceedings of the National Academy of Sciences, pages pnas 0907096106, 2009.

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.

David S Choi, Patrick J Wolfe, and Edoardo M Airoldi. Stochastic blockmodels with a growing number of classes. Biometrika, 99(2):273 284, 2012.

Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding community structure in very large networks. Physical review E, 70(6):066111, 2004.

J-J Daudin, Franck Picard, and St ephane Robin. A mixture model for random graphs.

Statistics and computing, 18(2):173 183, 2008.

Leo Egghe and Ronald Rousseau. Introduction to informetrics: Quantitative methods in

library, documentation and information science. Elsevier Science Publishers, 1990.

Santo Fortunato and Marc Barthelemy. Resolution limit in community detection. Proceedings of the National Academy of Sciences, 104(1):36 41, 2007.

Santo Fortunato and Darko Hric. Community detection in networks: A user guide. Physics

Reports, 659:1 44, 2016.

Roger Guimera, Marta Sales-Pardo, and Lu ıs A Nunes Amaral. Modularity from ﬂuctuations in random graphs and complex networks. Physical Review E, 70(2):025101, 2004.

Distributed Community Detection in Large Networks

Mark S Handcock, Adrian E Raftery, and Jeremy M Tantrum. Model-based clustering for social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(2):301 354, 2007.

Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983.

Jiashun Jin et al. Fast community detection by score. The Annals of Statistics, 43(1):57 89, 2015.

Brian Karrer and Mark EJ Newman. Stochastic blockmodels and community structure in networks. Physical review E, 83(1):016107, 2011.

Jing Lei, Alessandro Rinaldo, et al. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215 237, 2015.

Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.

Jure Leskovec and Julian J Mcauley. Learning to discover social circles in ego networks. In

Advances in neural information processing systems, pages 539 547, 2012.

Greg Mori and Jitendra Malik. Recognizing objects in adversarial clutter: Breaking a visual captcha. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I I. IEEE, 2003.

Mark EJ Newman. Modularity and community structure in networks. Proceedings of the

national academy of sciences, 103(23):8577 8582, 2006.

Mark EJ Newman. Communities, modules and large-scale structure in networks. Nature

physics, 8(1):25, 2012.

Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.

Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In Advances in Neural Information Processing Systems, pages 3120 3128, 2013.

Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878 1915, 2011.

Purnamrita Sarkar and Andrew W Moore. Dynamic social network analysis using latent space models. In Advances in Neural Information Processing Systems, pages 1145 1152, 2006.

Tom AB Snijders and Krzysztof Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of classiﬁcation, 14(1):75 100, 1997.

Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The Annals of Statistics, pages 555 586, 2008.

Jiangzhou Wang, Jingfei Zhang, Binghui Liu, Ji Zhu, and Jianhua Guo. Fast network community detection with proﬁle-pseudo likelihood methods. ar Xiv preprint ar Xiv:2011.00647, 2020.

YX Rachel Wang and Peter J Bickel. Likelihood-based model selection for stochastic block models. The Annals of Statistics, 45(2):500 528, 2017.

Scott White and Padhraic Smyth. A spectral clustering approach to ﬁnding communities in graphs. In Proceedings of the 2005 SIAM international conference on data mining, pages 274 285. SIAM, 2005.

Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Community extraction for social networks.

Proceedings of the National Academy of Sciences, 2011.

Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics, 40 (4):2266 2292, 2012.

Distributed Community Detection in Large Networks

We ﬁrst formally formulate the group detection problem for networks with the grouped community structure. Given the total number of groups G, we use e to denote a group assignment e = {e1, e2, . . . , en}, where ei is the group label for node vi and it takes value in {1, 2, . . . , G}. Deﬁne ˆg = argmaxe QER(e) as the estimated group labels. Our ﬁrst goal is to show that ˆg can consistently estimate the true group labels g. Following Bickel and Chen (2009) and Zhao et al. (2012), for any given group assignment e, we construct matrix O(e) RG G as Ots(e) = X

ij Aij I{ei = t, ej = s},

where I is the indicator function. The total degree in group t under the group assignment e can be deﬁned as Ot(e) = X

s Ots(e), t = 1, . . . , G.

Let f(e) = (n1(e)

n , ..., n G(e)

n )T denote the frequencies in each group under the group assignment e, where nt(e) = P i I{ei = t} is the number of nodes in group t. Also, denote the sum of degrees as L = P ij Aij. We start with the general format of modularity-based criteria, which can be written as

Q(e) = F(O(e)

µn , f(e)).

We aim to ﬁnd an e that maximizes the modularity. The key in group detection problem is to construct a population version of Q(e) such that the group labels g can maximize the population version of Q(e). We ﬁrst need to construct the population version of f(e) and O(e). We deﬁne a matrix R(e) RG K M as Rtau(e) = 1

n P i I(ei = t, ci = a, θi = xu), which is a normalized one-hot representation of the relationship between nodes and structure including group label, community labels and degree level. For any generic array S = [Stau] RG K M, we can deﬁne a G-dimensional vector h(S) = [ht(S)] by

au Stau, (4)

and G G matrix H(S) = [Hts(S)] by

abuv xuxv Bab Sta Ssb. (5)

Then we have Lemma 6 as follows:

E(ft(e)|c, θ) = ht(R(e)),

1 un E(Ots(e)|c, θ) = Hts(R(e)).

Proof of Lemma 6: For the ﬁrst equation, left part can be written as E(ft(e)|c, θ) = ft(e) = 1 n P i I(ei = t) due to the deﬁnition of ft(e). From the construction of h and R(e) above, we have ht(R(e)) = P au Rtau(e) = P au 1 n P i I(ei = t, ci = a, θi = xu). After changing the order of two summations, we have X

i I(ei = t, ci = a, θi = xu) = X

au I(ei = t, ci = a, θi = xu) = 1

i I(ei = t).

The last equality holds because community labels c and degrees θ take exact one value for each i. Therefore, E(ft(e)|c, θ) = ht(R(e)). For the second equation, we have by deﬁnition of Ots:

1 un E(Ots(e)|c, θ) = 1

ij Aij I{ei = t, ej = s}|c, θ)

abuv Aij I{ei = t, gi = a, θi = xu}I{ej = s, gj = b, θj = xv}|c, θ).

Given community labels and degrees, the adjacency matrix A is constructed from the probability matrix Bn = ρn B. Therefore, we have:

abuv Aij I{ei = t, gi = a, θi = xu}I{ej = s, gj = b, θj = xv}|c, θ)

ij θiθj Bci,cj X

abuv I{ei = t, gi = a, θi = xu}I{ej = s, gj = b, θj = xv}

abuv xuxv Bab X

ij I{ei = t, gi = a, θi = xu}I{ej = s, gj = b, θj = xv}.

From the construction of H and R(e), the following equations holds:

Hts(R(e)) = 1

abuv xuxv Bab Rta(e)Rsb(e)

ab xuxv Bab X

i I(ei = t, ci = a) X

j I(ej = s, cj = b).

Therefore, 1 un E(Ots(e)|c) = Hts(R(e)). From the Lemma 6, the population version of Q(e) can be deﬁned as the expectation conditional on c but replacing R(e) by T(e) RG K M, where

Ttau(e) = P i I(ei = t, ci = a, θi = xu) P i I(ci = a, θi = xu) τau,

it can be shown that Ttau(e) = Rtau(e)τau/ˆτau, where τau = P(ci = a, θi = xu) and ˆτau = 1

n P i I(ci = a, θi = xu). Therefore, the population version of Q(e) can be further denoted as F(H(T(e)), h(T(e))), where we omit the constant multiplier L. Theorem 2 would indicate that the group labels g are the maximizer of F(H(T(e)), h(T(e))). Lemma 7 is based on Bernstein s inequality which is utilized in Lemma 8. Lemma 8 provides the conditions for the consistency in maximizing modularity Q(e).

Distributed Community Detection in Large Networks

Lemma 7 Deﬁne X(e) RG G by Xts(e) = O(e)

µn H(R(e)). Let ||X|| = maxs,t|Xst| and |e g| = P i I(ei = gi). Then

P(maxe||X(e)|| ε) 2Gn+2exp( 1

8C ε2µn) (6)

for ε < 3C, where C = maxab Bab.

P(max|e g| m||X(e) X(g)|| ε) 2 n m

for ε 6Cm/n.

P(max|e g| m||X(e) X(g)|| ε) 2 n m

16C ε2µn) (8)

for ε 6Cm/n.

Proof: See the proof of Lemma A.1. in the supplementary material of Zhao et al. (2012).

Lemma 8 For any Q(e) of the form

Q(e) = F(O(e)

if π, B, F satisfy conditions (*), (a),(b),(c), then Q is strongly consistent under stochastic block models if λn/(log(n)) and weakly consistent when λn . The conditions are listed as follows: (*) F(H(S), h(S)) is uniquely maximized over P = {S : S 0, PG t=1 Stau = τau} by S = S , where S tau = τau for one t, otherwise, S t au = 0 for t = t. S u has only one nonzero element in each column, and communities within group stay in the same row. (a) F is Lipschitz in its arguments; (b) Let W = H(D). The directional derivatives 2F

ε2 (M0 + ε(M1 M0), t0 + ε(t1 t0))|ε=0+ are continuous in (M1, t1) for all (M0, t0) in a neighborhood of (W, π); (c) Let G(S) = F(H(S), h(S)). Then on P, G((1 ε)D+εS)

ε |ε=0+ < C < 0 for all π, B.

Proof: The proof has three steps. Step 1: show that the modularity Q(e) = F(O(e)

µn , f(e)) is uniformly close to the population version F(H(T(e)), h(T(e))). In another word, we need to show that for λn , there exist εn 0 such that

P(maxe|F(O(e)

µn , f(e)) F(H(T(e)), h(T(e)))| < εn) 1. (9)

µn , f(e)) F(H(T(e)), h(T(e)))|

µn , f(e)) F(H(R(e)), h(R(e)))| + |F(H(R(e)), h(R(e))) F(H(T(e)), h(T(e)))|,

For the ﬁrst item on the left hand above, since h(R(e)) = f(e) and by Lipschitz continuity,

µn , f(e)) F(H(R(e)), f(e))| M1||X(e)|| . (10)

From Equation 6 in Lemma 7, Equation 10 tends to 0 uniformly for λn . For the second item,

|F(H(R(e)), f(e)) F(H(T(e)), h(T(e)))|

M1||H(R(e)) h(T(e))|| + M2||H(R(e)) h(T(e))||2, (11)

where ||.||2 is Euclidean norm for vectors. Since for λn , Rta(e) Tta(e) a.s., Equation 11 converges to 0 uniformly too. Thus, Equation 9 holds. Step 2: show that the weak consistency of group assignments holds. Since T(g) P and has only one nonzero element in each column. By continuity and condition ( ), there exists δn 0, if |e g|/n = P i I(ei = gi)/n δn, then

F(H(T(g)), h(T(g))) F(H(T(e)), h(T(e))) > 2εn.

From Equation 9 in step 1, it is can be shown that

P(max{e:|e g|/n δn}F(O(e)

µn , f(e)) < F(O(g)

µn , f(g)))

P(|max{e:|e g|/n δn}F(O(e)

µn , f(e)) max{e:|e g|/n δn}F(H(T(e)), h(T(e)))| < εn,

µn , f(g)) F(H(T(e)), h(T(e)))| < εn) 1

The Equation 12 implies that P(|e g|/n < δn) 1, and weak consistency of group assignment holds. Step 3: show that the strong consistency of group assignments holds. See the step 3 of the proof of of Theorem 4.1 in the Zhao et al. (2012).

Now we provide the proof of Theorem 2. Proof of Theorem 2: Under the block assumption θi 1, the population version of QER(e) is F(H(S), h(S)) = P t(Htt(S) B0h2 t (S)) up to a constant multiplier, where S RG K. Therefore, it satisﬁes conditions (a),(b),(c) and we only need to show condition (*) is satisﬁed as well. For S P, we have the following equation:

t (Htt(S) B0h2 t (S)) +

t =s (Hts(S) B0ht(S)hs(S))

t,s Hts(S) B0(

= 1 12 = 0.

Distributed Community Detection in Large Networks

The second last equality holds due to Lemma 6 and the fact that P t ft = 1 and P ts Ots

L = 1. Also, we construct RG G by:

ts = (I(t = s) 1/2) 2. (14)

From Equation 13 and 14, the population version F(H(S), h(S)) can be rewritten as:

F(H(S), h(S)) =

t (Htt(S) B0h2 t (S))

t,s ts(Hts(S) B0ht(S)hs(S)).

From the Deﬁnition of h and H in equation 4 and 5, we get:

t,s ts(Hts(S) B0ht(S)hs(S))

a,b Bab Sta Ssb B0

a,b Sta Ssb ts(Bab B0).

Construct c RK K by:

( 1, community a, b belong to same group 1, O.W.

From the conditions in Condition 1, the following inequality always holds for any t, s, a, b:

ts(Bab B0) c ab(Bab B0).

Therefore, population version of modularity satisﬁes:

F(H(S), h(S)) = 1

a,b Sta Ssb ts(Bab B0)

a,b Sta Ssb c ab(Bab B0).

We can change the order of summations in last item above:

a,b Sta Ssb c ab(Bab B0) = 1

a,b c ab(Bab B0)

a,b πaπb c ab(Bab B0) = F(H(S ), h(S )).

The second last equality holds because S P. Therefore, we have F(H(S), h(S)) F(H(S ), h(S )) (up to row permutation of S ). Therefore, condition ( ) is also satisﬁed. From the result in Lemma 8, Theorem 2 holds. Proof of Theorem 4: For the degree-corrected model, we follow a similar strategy in Proof of Theorem 2. The population version of QNGM can be denoted as

F(H(S)) = X

we need to show condition (*) is satisﬁed. From equation,

we can show

F(H(S)) = 1

B0 )2) (15)

From the deﬁnition of H in Equation 5, it can be shown that

ab Sta Ssb Bab,

ac Sta πc Bac

where Sta = P u xu Stau. Hence, we obtain

F(H(S)) = 1

ts ts( P ab Sta Ssb Bab

B0 P ac Sta πc Bac P bd Ssb πd Bbd B0 2 )

ab Sta Ssb ts(Bab

B0 (P c πc Bac)(P d πd Bbd) B0 2 ). (16)

From the condition 3, it satisﬁes that

ab Sta Ssb ts(Bab

B0 (P c πc Bac)(P d πd Bbd) B0 2 )

ab Sta Ssb c ab(Bab

B0 (P c πc Bac)(P d πd Bbd) B0 2 )

Therefore, condition ( ) is satisﬁed (up to permutation). From the result in Lemma 8, Theorem 4 holds. Proof of Theorem 5:

Distributed Community Detection in Large Networks

For weak consistency, given any ε > 0:

i=1 1(ˆci = ci) ε) = P({ 1

i=1 1(ˆci = ci) ε}, ˆg = g)+P({ 1

i=1 1(ˆci = ci) ε}, ˆg = g)

The second item P({ 1

n Pn i=1 1(ˆci = ci) ε}, ˆg = g) tends to zero because of the strong consistency of group labels ˆg. The ﬁrst item can be bounded by considering the event under each group. I.e, for nodes in group t, { 1

n Pnt k=1 1(ˆcsub t,k = csub t,k ) ε, ˆg = g} can lead to { 1

n Pn i=1 1(ˆci = ci) ε, ˆg = g}. Therefore, we have:

i=1 1(ˆci = ci) ε}, ˆg = g) P( t=1,2,..,G{ 1

k=1 1(ˆcsub t,k = csub t,k ) ε, ˆg = g})

= P( t=1,2,..,G{ 1

k=1 1(ˆcsub t,k = csub t,k ) ε

πt , ˆg = g})

k=1 1(ˆcsub t,k = csub t,k ) ε

πt , ˆg = g}).

The last part tends to zero for each t due to the weak consistency of the community labels ˆcsub t , t = 1, 2, ..., G. For strong consistency:

P(ˆc = c) = P(ˆg = g, ˆc = c) + P(ˆg = g, ˆc = c)

The ﬁrst item goes to zero from the strong consistency of the estimated community labels in each correct group; the second item goes to zero from the strong consistency of the estimated group label.

Figure 8: The adjacency matrix for the Facebook ego network; in the network, each node denotes a user and each link denotes two users are friends in Facebook. From the adjacency matrix, we can easily observe several groups of friend list.