# network_global_testing_by_counting_graphlets__b2214d48.pdf

Network Global Testing by Counting Graphlets

Jiashun Jin 1 Zheng Tracy Ke 2 Shengming Luo 1

Consider a large social network with possibly severe degree heterogeneity and mixedmemberships. We are interested in testing whether the network has only one community or there are more than one communities. The problem is known to be non-trivial, partially due to the presence of severe degree heterogeneity. We construct a class of test statistics using the numbers of short paths and short cycles, and the key to our approach is a general framework for canceling the effects of degree heterogeneity. The tests compare favorably with existing methods. We support our methods with careful analysis and numerical study with simulated data and a real data example.

1. Introduction

Given a large symmetrical network, we are interested in the global testing problem where we use the adjacency matrix of the network to test whether the network consists of only one community or that it consists of multiple communities, where some nodes may have mixed memberships.

Real networks frequently have severe degree heterogeneity. The Stochastic Block Model (SBM) is well-known, but does not accommodate severe degree heterogeneity. To tackle the problem, Karrer and Newman (2011) proposed the Degree Corrected Block Model (DCBM). DCBM strikes a better balance between theory and practice than SBM, and has become increasingly more popular recently.

We adopt a Degree-Corrected Mixed-Membership (DCMM) model (Jin et al., 2017). DCMM can be viewed as an extension of DCBM, but allows for mixed memberships. Suppose the network has n nodes and K perceivable communities

C1, C2, . . . , CK.

*Equal contribution 1Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, USA 2Department of Statistics, University of Chicago, Chicago, USA. Correspondence to: Jiashun Jin <jiashun@stat.cmu.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

For each node, we assign a Probability Mass Function (PMF) πi = (πi(1), πi(2), , πi(K)) , where πi(k) is the weight that node i puts on community Ck, 1 k K. We call node i pure if πi is degenerate and mixed otherwise. Let A Rn,n be the adjacency matrix, where Aij = 1 if nodes i and j have an edge, and Aij = 0 otherwise (all diagonal entries of A are 0 as we don t treat a node as connecting to itself). In DCMM, we think the upper triangle of A contains independent Bernoulli random variables. Moreover, for n degree heterogeneity parameters θ1, θ2, . . . , θn and a non-singular, irreducible matrix P RK,K,

P(Aij = 1) = θiθj

ℓ=1 πi(k)πj(ℓ)Pkℓ. (1)

Also, we assume (a) all diagonals of P are 1, and (b) each of the K communities has at least one pure node. With such constraints, DCMM is identiﬁable (Jin et al., 2017).

Let Θ be the n n diagonal matrix Θ = diag(θ1, . . . , θn) and let Π be the n K matrix Π = [π1, π2, . . . , πn] . Let Ω= ΘΠPΠ Θ (recall that P RK,K):

π 1 ... π n

P [π1, . . . , πn]

Let diag(Ω) be the diagonal matrix where the i-th diagonal entry is Ωii, and let W = A [Ω diag(Ω)]. We have

A = [Ω diag(Ω)] + W = signal + noise .

Remark 1. Many recent works use the Random Degree Parameter (RDP) model (which is narrower than ours): ﬁxing a scaling parameter αn > 0 and a density function f over (0, ) where the ﬁrst a few moments of f are ﬁnite, and es-

pecially the second moment is 1, we assume (θi/αn) iid f, i = 1, 2, . . . , n. We call the resultant DCMM model the DCMM-RDP model. In applications where we don t know how θi s are correlated, it is preferable to treat θ as nonrandom as before, and it is safer to use the original DCMM than DCMM-RDP.

The testing problem can be cast as testing a null hypothesis H(n) 0 against a speciﬁc hypothesis H(n) 1 in its complement:

H(n) 0 : K = 1 vs. H(n) 1 : K > 1, (2)

Network Global Testing by Counting Graphlets

where under H(n) 1 , DCMM holds for some eligible (P, θ1, ..., θn, π1, ..., πn). Note that under H(n) 0 , P = 1, π1, ..., πn are all degenerate, and P(A(i, j) = 1) = θiθj.

Most existing literatures for the testing problem (2) have been focused on the special case where

No degree heterogeneity. θ1 = θ2 = . . . = θn.

No mixed-membership. All πi are degenerate PMFs.

Many tests were proposed for this special case, including but are not limited to the likelihood ratio test approach (Wang & Bickel, 2017) and the spectral approach (Bickel & Sarkar, 2016; Lei, 2016; Banerjee & Ma, 2017).

However, in our setting, θ1, θ2, . . . , θn vary signiﬁcantly from one to another, and it is unclear how to extend the methods above to the current setting. The likelihood ratio test is not applicable, for there are many unknown parameters (θ1, π1), (θ2, π2), ..., (θn, πn). The spectral approach also faces challenges, because the distributions of such test statistics depend on unknown parameters in a complicated way, and it is nontrivial to ﬁgure out the rejection region.

Also, it may be tempting to adapt the recent approaches on estimating K to our testing problem (Saldana et al., 2017; Chen & Lei, 2017; Le & Levina, 2015), but similarly, due to severe degree heterogeneity, the null distributions of such statistics are not tractable, so they cannot be used directly.

Recently, Gao and Lafferty (2017) (see also Bubeck et al. (2016) which is related but on different settings) proposed a new test called the Erd os-Zuckerberg (EZ) test for DCMM, with the following constraints.

(GL1) No mixed-membership: all πi are degenerate.

(GL2) The community labels are uniformly drawn so the K communities have roughly equal sizes.

(GL3) The DCMM-RDP holds (i.e., θi are iid samples), and the K K matrix P has the special form of

a b b ... ... ... b b a

Note that the last bullet point is particularly restrictive.

Gao and Lafferty (2017) made an interesting observation that the effect of the degree heterogeneity parameters θ1, θ2, . . . , θn is largely canceled out in the EZ test, and the test statistic approximately equals to 0 under the null, regardless of what θ1, θ2, . . . , θn are. This allows us to ﬁnd a convenient way to map out the rejection region.

Unfortunately, the authors did not make it clear whether the cancellation is coincidental and is due to the symmetry they imposed on the model (see GL1-GL3), or is inherent and holds for much broader settings.

In this paper, we introduce a class of test statistics by counting the number of graphlets in the network. Fixing a small m 1, we count two kinds of graphlets: length-m paths and m-cycles. Our main contributions are as follows:

Ideation. For a K-vector η and a K K matrix G1/2PG1/2 to be introduced, we derive succinct proxies for the number of length-m paths and mcycles, using η and the eigenvalues and eigenvectors of G1/2PG1/2. The proxies motivate a systematic way of constructing tests where the degree heterogeneity is largely removed so the distributions are more tractable.

To the best of our knowledge, such proxies have not been discovered in the literature. Methods and theory. We propose a class of graphlet count (GC) test statistics, and derive their asymptotic distributions, under the null and alternative hypotheses.

We try to be as general as possible, and our methods and theory are for DCMM with minimal constraints.

The way we construct our statistics is to use the proxies aforementioned, and thus is different from that in Gao and Lafferty (2017), which uses calculations that heavily depend on the imposed constraints GL1-GL3. See Section 3.2 for more comparison.

Our ﬁndings support the philosophy of Jin (2015) which introduced the community detection algorithm SCORE. Jin (2015) pointed out that θ1, θ2, . . . , θn are required to model severe degree heterogeneity, but they turn out to be nuisance parameters, the effects of which can be largely removed with a proper construction of statistics.

While our tests are designed for global testing, the idea is also useful for tackling other problems. For example, we can combine our idea with those in community detection (e.g. Jin (2015), Chen et al. (2018), Qin and Rohe (2013)) to estimate the number of communities K.

2. A Class of Graphlet Count (GC) Statistics

The testing problem (2) is hard for there are so many unknown parameters: P, θ1, . . . , θn, π1, . . . , πn. The parameters θ1, θ2, . . . , θn, which are required to model the severe degree heterogeneity of real networks, are especially hard to deal with for they vary signiﬁcantly from one to another. What we need is therefore a smart test statistic that

does not vary signiﬁcantly as θ = (θ1, θ2, . . . , θn)

varies, and has a tractable limiting distribution (so it is easy to map out the rejection region), is powerful in differentiating the null and alternative.

Our idea is to use the graphlet-count statistics. In a network, we say a path is a self-avoiding path if it doesn t intersect with itself, and a cycle if it is a closed path that does not intersect with itself.

Network Global Testing by Counting Graphlets

Deﬁnition 2.1 For 0 m n, let Bn,m = Qm 1 s=0 (n s). Deﬁnition 2.2 For m 1, we deﬁne the density of lengthm self-avoiding paths by

b Lm = 1 Bn,m+1

1 i1,...,im+1 n i1, ,im+1 are distinct

Ai1i2Ai2i3 Aimim+1.

and for m 3, we deﬁne the density of m-cycles by b Cm = 1 Bn,m

1 i1,...,im n i1, ,im are distinct

Ai1i2Ai2i3 Aimi1.

We propose the family of test statistics, called the Graphlet Count (GC) test statistics:

bχ(m) gc = b Cm (b Lm 1/b Lm 2)m, m = 3, 4, . . . . (3)

Remark 2. Using the adjacency matrix A, b Cm and b Lm can be conveniently computed (e.g., b C4 = 1 24( n 4) tr(A4)

2(1 n A21n) + 1 n A1n ). See supplemental material.

2.1. The Key Idea: Why the GC Test Statistics Work

Recall that Ω= ΘΠPΠ Θ. Let 1n be the n-dimensional vector of 1 s and let θ = (θ1, θ2, . . . , θn) . We use ( , ) to denote the inner product of two vectors. The K K matrix G Π Θ2Π and the vector η RK play a key role.

Deﬁnition 2.3 Denote the vector G 1/2Π Θ1n by η.

Deﬁnition 2.4 For 1 k K, let λk be the k-th largest (in absolute value) eigenvalue of G1/2PG1/2, and let ξk be the corresponding eigenvector.

It turns out that λ1, . . . , λK (eigenvalues of G1/2PG1/2) are the nonzero eigenvalues of Ω. The following results, which will be made precise in Theorems 3.1-3.2, play the key role:

nm b Cm tr(Ωm) =

k=1 λm k , (4)

nm+1 b Lm 1 nΩm1n =

k=1 (η, ξk)2λm k . (5)

We explain why these equations motivate the test statistic bχ(m) gc . Recall that we hope to have a statistic that does not vary too much as θ varies, so ﬁrst, it is desirable to remove the terms (η, ξk)2, which not only depend on θ but are also not very tractable. Under the alternative hypothesis, it is unclear how to cancel these terms, but under the null hypothesis, K = 1, and the right hand side of (5) reduces to (η, ξ1)2λm 1 , and there are many ways to do the cancellation. One such way is to use the following ratio:

nmb Lm 1 nm 1b Lm 2 PK k=1(η, ξk)2λm 1 k PK k=1(η, ξk)2λm 2 k

( = λ1, under H(n) 0 , λ1, under H(n) 1 . (6)

Therefore, at least under H(n) 0 , we have managed to cancel the terms (η, ξk)2.

Next, it is also desirable to cancel the term λ1, at least under the null hypothesis. Comparing (4) and (6), there are many ways to do this, and one such way is to use the bχ(m) gc statistic aforementioned:

nmbχ(m) gc = nm b Cm [ nmb Lm 1)/(nm 1b Lm 2) m.

In fact, by (4) and (6), we have that up to a negligible term (i.e., of a smaller order than that of the standard deviation of the statistic under H(n) 0 ),

( = 0, under H(n) 0 , 1 nm PK k=2 λm k , under H(n) 1 .

On the one hand, the effects of degree heterogeneity are largely canceled in the statistic, so it does not vary signiﬁcantly as the vector θ varies (this is particularly important for we wish to have a rejection region that is relatively insensitive to θ). On the other hand, the statistic bχ(m) gc is able to differentiate the null hypothesis and the alternative hypothesis, through the term PK k=2 λm k . This suggests that bχ(m) gc is a reasonable test statistic.

Note that bχ(m) gc is only one of many test statistics with the desired properties above, but seemingly one of the simplest.

Remark 3. Recall that λ1, . . . , λK are the eigenvalues of G1/2PG1/2, and they are also the eigenvalues of PG (nonnegative, irreducible). By Perron s theorem (Horn & Johnson, 1985), λ1 is positive and λ1 > |λk| for all 2 k K.

Remark 4. Our tests include the EZ test as a special case (i.e., bχ(m) gc with m = 3), but our idea is by no means a straightforward extension of that in Gao and Lafferty (2017). The EZ test was derived by calculations that depend heavily on the constraints GL1-GL3 imposed on P, θ, etc., and it was unclear whether the core idea of the EZ test is only valid when these constraints hold, or in more general settings. The GC tests are derived by (4)-(6), where the relationship between the test statistics and θ, η, and the eigenvalues and eigenvectors of G1/2PG1/2 has not been discovered before, even in cases where the constraints GL1-GL3 hold.

Remark 5. Can we simply use PK k=2 bλm k as the test statistic, where bλk is the k-th eigenvalue of A? We can not, as K is unknown. Can we simply use bλ2 as the test statistic? We can, but the asymptotic distribution of bλ2 is much harder to derive than that of bχ(m) gc (which is Gaussian; see below), so it is challenging to determine the rejection region.

3. Main Results

In theory, we use n as the driving asymptotic parameter, let the matrices (Θ, Π, P) change with n, and consider a

Network Global Testing by Counting Graphlets

sequence of problems where we test H(n) 0 : K = 1 vs. H(n) 1 : K > 1 (K is unknown but does not change with n). Recall node i is pure if πi is degenerate. A pure node can be in any of the K communities. For 1 k K, we let

Nk = {1 i n : πi is degenerate and πi(k) = 1}

be the set of all pure nodes in the community k. Assume

θ , θ 3 0. (7)

Since θmax θ 3, this implies θmax 0. Suppose there is a constant c1 > 0 so that for any 1 k K, P

i Nk θ2 i Pn i=1 θ2 i c1. (8)

(this roughly says each community has sufﬁciently many pure nodes). Also, assume for some constant c2 (0, 1),

all singular values of P fall between c2 and c 1 2 . (9)

Denote Cm = E[ b Cm] and Lm = E[b Lm] and introduce a non-stochastic counterpart of bχ(m) gc by

χ(m) gc = Cm (Lm 1/Lm 2)m. (10)

Theorem 3.1 Consider the DCMM model (1) where (7)-(9) hold. As n ,

χ(m) gc = 1 nm

"PK k=1(η, ξk)2λm 1 k PK k=1(η, ξk)2λm 2 k

+ O(n m θ 4 4 θ 2m 4).

The last term is of a smaller order of the standard deviation of bχ(m) gc and is thus negligible. Theorem 3.1 solidiﬁes what we mentioned in Section 2.1 and is proved in Section 6.

Theorem 3.2 Consider the DCMM model (1) where (7)-(9) hold. As n , for m = 3, 4, under either H(n) 0 or H(n) 1 , r

2m b C 1/2 m

bχ(m) gc χ(m) gc

The proof for other ﬁxed m is similar, but signiﬁcantly more tedious so we leave it as future work. Theorem 3.2 is proved in the supplemental material. Compared to existing literature, our theorems are for a much broader setting where existing works have very little theory and understanding.

Remark 6. Conditions (8)-(9) are only for H(n) 1 since they naturally hold under H(n) 0 ; the conditions are only mild. Condition (7) is also only mild. Take the DCMM-RDP model for example (see Remark 1): θ nαn and θ 3 n1/3αn, so Condition (7) requires n 1/2 αn n 1/3. The case αn n 1/3 corresponds to the strong signal case, the analysis of which is different and we leave it to the forthcoming manuscript.

3.1. Testing Power

By Theorems 3.1-3.2, we expect to see that in distribution, r

2m b C 1/2 m bχ(m) gc N r

2m C 1/2 m χ(m) gc , 1 .

Motivated by Theorem 3.1 and equation (4), we introduce a

2m C 1/2 m χ(m) gc , deﬁned as

δ(m) gc = (2m) 1/2

q PK k=1 λm k

k=1 λm k PK k=1(η, ξk)2λm 1 k PK k=1(η, ξk)2λm 2 k

(11) and we expect to see that in distribution, r

2m b C 1/2 m bχ(m) gc N(δ(m) gc , 1). (12)

It is noteworthy that under H(n) 0 , δ(m) gc = 0.

Fixing 0 < α < 1, let zα be the (1 α)-quantile of N(0, 1). Consider the Graphlet Count (GC) test where we

reject H(n) 0

2m b C 1/2 m bχ(m) gc > zα. (13)

The theorem below is proved in the supplemental material.

Theorem 3.3 Consider the DCMM model (1) where (7)-(9) hold. As n , for m = 3, 4, the level and the power of the Graphlet Count test are respectively α + o(1) and Φ δ(m) gc zα + o(1). Moreover, if δ(m) gc as n , then the power 1.

3.2. Comparison of bχ(3) gc and bχ(4) gc

One of the key messages is that, bχ(3) gc may be powerless for some non-null cases, even when the signals are strong and the test is relatively easy. In comparison, bχ(4) gc has successfully avoided such a pitfall. Note that translated to our terms, the EZ test by Gao and Lafferty (2017) is bχ(3) gc .

In detail, by Remark 3, λ1 is positive and has the largest absolute value among all λk s, so

k=1 (η, ξk)2λm 1 k ]/[

k=1 (η, ξk)2λm 2 k ] λ1. (14)

It follows that under H(n) 1

k=2 λm k )/[2m

k=1 λm k ]1/2, (15)

where = is achievable; see Section 3.3 for examples.

Network Global Testing by Counting Graphlets

It can be shown that PK k=1 λm k > 0, no matter m is odd or even. The numerator PK k=2 λm k , however, is more tricky.

When m is even, the numerator of (15) is positive.

When m is odd, the numerator of (15) can be negative or 0. In fact, δ(m) gc may be 0 under some H(n) 1 , even when signals are strong; see Section 3.3 for an example. If additionally P (and so G1/2PG1/2) is positive deﬁnite, then δ(m) gc is positive, either m is odd or even.

Corollary 3.1 Consider the DCMM model (1) where (7)- (9) hold and the Graphlet Count test (13). As n .

When m = 3, for some conﬁgurations of the non-null case, δ(3) gc = 0 and the power of the test is α + o(1). If additionally the K K matrix P is positive deﬁnite, then there is a constant c3 > 0 such that δ(3) gc c3 θ 3

and so the power of the test Φ c3 θ 3 zα , which tends to 1 since θ in our setting.

If m = 4, then there is a constant c4 > 0 such that δ(4) gc c4 θ 4 and the power of the test Φ c4 θ 4 zα , which tends to 1 as θ in our setting.

See the supplement for the proof. In comparison, bχ(3) gc has a two-fold disadvantage: it may lose power in some non-null conﬁgurations even when θ is large, and as θ , its power grows to 1 in a speed slower than that of bχ(4) gc .

Remark 7. In the most subtle case where θ 1, for any m 3, bχ(m) gc has a non-trivial power since the signal to noise ratio δ(m) gc θ m. The form is reminiscent of the results on likelihood ratio (Mossel et al., 2015) which is unfortunately only for SBM (much narrower than DCMM).

Remark 8. The computational complexity of the GC test is O(nd2) for m = 3 and O(nd3) for m = 4 (Schank & Wagner, 2005), where d is the maximum degree. Many large networks are reasonably sparse where d n, and the complexity is reasonably modest in such cases.

Remark 9. Our work is connected to Maugis et al. (2017), which characterizes the expected number of closed walks. However, their work does not study the expected number non-closed walks, and the standard deviations of close and non-closed walks, so how to apply their results to our setting is unclear. Our work is also closed to the notion of clustering coefﬁcient (Holland & Leinhardt, 1971; Watts & Strogatz, 1998), which in our notation equals to 3 b C3/b L2. To use this as a test, the challenge is how to normalize the statistic properly so the limiting distribution is more tractable. Also, the test may lose power in some settings. In Section 3.3, we provide an example where C3/L2 = λ1 + (PK k=2 λ3 k)/λ2 1, so when PK k=2 λ3 k = 0, asymptotically, the test is powerless while the GC test may still have good power.

0.5 0.55 0.6 0.65 0.7 0.75 b

m=3, ||θ||=6 m=4, ||θ||=6 m=3, ||θ||=8 m=4, ||θ||=8

Figure 1. Plot of δ(m) gc in the setting of Section 3.3 ((K, a) = (10, .25)). When b = .7, δ(3) gc = 0 even when θ is very large, and the EZ test is powerless. In comparison, bχ(4) gc has overcome such a pitfall. Also, bχ(4) gc is more powerful when θ is large.

3.3. An Example

It is instructive to consider an example where δ(m) gc can be further simpliﬁed. Consider a setting where

the DCMM-RDP model holds (see Remark 1), i.e., (θi/αn) i.i.d. f, where the 2nd moment of f is 1, all πi s are degenerate, dividing to K equal-size communities, the rows of P have an equal sum.

It follows that approximately: (a). θ = nαn, (b). G the K K identity matrix, so ξ1 (the ﬁrst eigenvector of G1/2PG1/2) is approximately proportional to the vector of ones 1K. (c). η = G 1/2Π Θ1n 1K, due to the random model for θi. Therefore, (η, ξk) = (ξ1, ξk), which equals to 1 if k = 1 and 0 otherwise, and so by basic linear algebra,

δ(m) gc = (2m) 1/2 q PK k=1 λm k

k=1 λm k λm 1

= PK k=2 λm k q

2m PK k=1 λm k

We now consider a setting where we can spell out λk more explicitly. Suppose K is even and the K K matrix P calibrating the community structure is a 2 2 block-wise

matrix having the form of P = D C C D

, where C, D

RK/2,K/2 and

1 a a ... ... ... a a 1

b b b ... ... ... b b b

In this case, λ1 = nα2 n K (1 a) + K

2 (a + b) , and the other (K 1) eigenvalues are (which one is λ2 depends on (a, b))

nα2 n K [(1 a) + K

2 (a b)], nα2 n K (1 a), . . . , nα2 n K (1 a).

Network Global Testing by Counting Graphlets

If we let AK(a, b) = (1 a) + K

2 (a b) and BK(a, b) = (1 a) + K

2 (a + b), recalling θ = nαn, then

δ(3) gc = (K 3

2 θ 3) [A3 K(a, b) + (K 2)(1 a)3] p

[A3 K(a, b) + B3 K(a, b) + (K 2)(1 a)3 ,

δ(4) gc = (K 2 θ 4) [A4 K(a, b) + (K 2)(1 a)4] p

A4 K(a, b) + B4 K(a, b) + (K 2)(1 a)4 .

Note that δ(4) gc is always positive, but δ(3) gc can be either positive or negative, and in particular,

δ(3) gc = 0 when b = w + (1 w)a, where w = 1+(K 2)1/3

Figure 1 compares δ(3) gc and δ(4) gc for (K, a) = (10, .25) and various (b, θ ). Regardless of θ , δ(3) gc gets small when b is close to 0.7. However, δ(4) gc has no such issue.

3.4. The Lower Bound

The lower bound is not discussed here but is studied in the extended version (Jin et al., 2018), where we show that under DCMM, if θ c for a sufﬁciently small constant, then the risk (sum of type-I and type-II errors) of any test converges to 1 as n . See also Massouli e (2014), Mossel (2015), Abbe & Sandon (2016), Gao & Lafferty (2017). Note by Corollary 3.1, if the level α 0, then the risk of the GC test 0 as θ . A closely related work is Jin and Ke (2017). Under the DCMM and the assumption of θmax Cθmin, they showed that when θ it is impossible to successfully estimate the mixed memberships.

4. Simulations

We investigate bχ(m) gc for m = 3, 4. Recall that when m = 3, it coincides with the EZ test (Gao & Lafferty, 2017). The methods (Bickel & Sarkar, 2016; Lei, 2016; Banerjee & Ma, 2017; Wang & Bickel, 2017) are for SBM, which do not apply to our settings and so we skip them for study.

Experiment 1 (checking for asymptotic normality). Fixing n = 200, we consider a null setting where θi s are from

10θi iid Pareto(4, 0.375);1 (note: severe degree heterogeneity!) We also consider an alternative case where θi s are from

2θi iid Pareto(4, 0.375), K = 3, and P is the matrix with unit diagonals and all off-diagonals are 1/3. Among the 200 nodes, 180 are pure with 60 in each community, and the remaining 20 nodes have a mixed membership (1/3, 1/3, 1/3). The results of 500 repetitions are in Figure 2, which suggest that the claimed asymptotic normality is valid even for relatively small n.

Experiment 2 (power comparison). Fix (n, K) = (300, 10). All nodes are pure with 30 in each commu-

1In Pareto(α, xm), α is for shape, xm is for scale.

Figure 2. Histograms of p

(Bn,4/8) b C 1/2 4 bχ(4) gc for the null hypothesis (light blue) and the alternative hypothesis (yellow). The blue and red curves are densities of N(0, 1) and N(δ(4) gc , 1), respectively, which support our results on asymptotical normality.

nity. For (a, b) and h > 0, let the matrix P be the same as in Section 3.3. Set θi = (h/ θ ) θi, where θi iid Pareto(4, 0.375); we note that θ = h.

For θ ranging in {5, 6, . . . , 10}, we consider three different settings for (a, b), (i)-(iii). See Table 1 for the results. In (i), (a, b) = (.15, .52) and the second eigenvalue λ2 of G1/2PG1/2 is moderately large, and bχ(4) gc uniformly outperforms bχ(3) gc (the EZ test). In (ii), (a, b) = (.30, .54), λ2 is relatively small and the testing problem is more challenging than (i). In this case, bχ(4) gc performs better again. In (iii), we investigate a case where δ(3) gc = 0 and see how bχ(3) gc deals with this most challenging case (the case poses no challenge to bχ(4) gc as δ(4) gc is always > 0; see Section 3.3). In this case, bχ(3) gc loses power, and signiﬁcantly underperforms bχ(4) gc .

These results are consistent with our theoretical results, especially those of Section 3.3.

Table 1. Powers of bχ(4) gc (GC) and bχ(3) gc (EZ).

(a, b) θ 5 6 7 8 9 10

(.15, .52) GC .13 .46 .77 .99 1.0 1.0 EZ .07 .17 .28 .64 .93 .99

(.30, .54) GC .11 .12 .27 .55 .85 .99 EZ .07 .10 .22 .25 .51 .78

(.15, .66) GC .09 .27 .72 .94 1.0 1.0 EZ .08 .05 .07 .06 .16 .43

5. Application to a Football Network

In the college football network (Girvan & Newman, 2002), each node is a Division I-A college team and two nodes have an edge if and only if they played 1 games during the Fall 2000 season. There are a total of 115 teams. Except for 5 independent teams, all teams are manually divided into 11 conferences for administration purposes; we treat these manually labeled communities as the ground truth.

First, we consider a relatively easy setting where we test whether the whole network has only 1 community or has multiple communities. As expected, for both m = 3 and 4, our test bχ(m) gc rejects the null with extremely small p-values.

Network Global Testing by Counting Graphlets

Next, we consider a more subtle problem, where for each of the 11 manually labeled communities aforementioned, we test whether it can t be further divided into multiple communities (null) or it can (alternative). The results of all 11 testing settings are in Table 2.

For the ﬁrst 10 test settings, despite some differences in the p-values, two tests, bχ(3) gc and bχ(4) gc , agree with each other and both accept the null. For the last setting (corresponding to the Western Athletic Conference (WAC)), however, bχ(4) gc votes for rejection and bχ(3) gc votes for acceptance. It turns out that one team ( Biose State ) in the WAC is an outlier, which did not play any game in the data range. After removing the outlier, both tests vote for acceptance. These results are consistent with the ground truth, suggesting (a) both tests yield reasonable testing results even for small-size networks, and (b) bχ(4) gc is more effective in detecting outliers.

Table 2. The 11 testing results (each corresponds to a conference). Column 3-4: test scores p

Bn,m/(2m) b C 1/2 m bχ(m) gc and corresponding p-values with m = 3. Column 5-6: same but for m = 4.

Conference size score p-value score p-value Atlantic Coast 9 0.00 1.00 0.00 0.50 Big East 8 0.00 1.00 0.00 0.50 Big Ten 11 -0.07 0.94 -0.31 0.62 Big Twelve 12 -0.02 0.98 -0.48 0.68 Conference USA 10 0.26 0.80 1.23 0.11 Mid-American 13 0.65 0.51 0.24 0.41 Mountain West 8 0.00 1.00 0.00 0.50 Paciﬁc Ten 10 -0.04 0.97 -0.19 0.58 Southeastern 12 -0.06 0.95 -0.40 0.65 Sun Belt 7 1.48 0.14 1.06 0.15 Western Athletic 10 0.51 0.61 2.48 0.01

6. Proof of Theorem 3.1

First, we show that for any m 3,

k=1 λm k + O θ 4 4 θ 2m 4 , (16)

and for any m 1, Bn,m+1Lm equals to

k=1 λm k (η ξk)2 + O θ 2 1 θ 4 4 θ 2m 6 . (17)

Consider (16). By deﬁnition, Bn,m Cm = Bn,m E[ b Cm] = P

i1,...,im E[Ai1i2 Aimi1] = P

i1,...,im Ωi1i2 Ωimi1, where the sum is over distinct indices i1, ..., im. As a result,

Bn,m Cm = tr(Ωm) X

non-distinct i1,...,im

Ωi1i2Ωi2i3 Ωimi1.

We calculate the term tr(Ωm). From the DCMM model, Ω= ΘΠPΠ Θ. It follows that

Ωm = ΘΠP(Π Θ2Π)P(Π Θ2Π) P(Π Θ2Π)PΠ Θ

= ΘΠ(PGPG PGP)Π Θ

= (ΘΠG 1/2)(G1/2PG1/2)m(G 1/2Π Θ).

For any matrices A and B, tr(AB) = tr(BA). As a result,

tr(Ωm) = tr (P 1/2GP 1/2)m(G 1/2Π Θ)(ΘΠG 1/2)

= tr (P 1/2GP 1/2)m = X

k λm k . (18)

We then bound the remainder term. Note that Ωij = θiθj(π i Pπj) Cθiθj, where the last inequality is from Condition (9). Hence, X

non-distinct i1,...,im

Ωi1i2Ωi2i3 Ωimi1 X

non-distinct i1,...,im

Cθ2 i1θ2 i2 θ2 im

i1,...,im 1 Cθ4 i1θ2 i2 θ2 im 1 C θ 4 4 θ 2(m 2).

Combining the above gives (16).

Consider (17). Similarly, we have

Bn,m+1Lm = 1 nΩm1n X

non-distinct i1,...,im+1

Ωi1i2Ωi2i3 Ωimim+1.

Since Ω= ΘΠPΠ Θ, it follows that

1 nΩm1n = 1 nΘΠP(Π Θ2Π)P(Π Θ2Π) PΠ Θ1n = 1 nΘΠ(PGPG P)Π Θ1n

= 1 nΘΠG 1/2(G1/2PG1/2)m G 1/2Π Θ1n

= η (G1/2PG1/2)mη.

The eigen-decomposition G1/2PG1/2 = PK k=1 λkξkξ k implies that (G1/2PG1/2)m = P k=1 λm k ξkξ k. As a result,

1 nΩm1n = η K X

k=1 λm k ξkξ k η =

k=1 λm k (η ξk)2. (19)

We then bound the remainder term. Since Ωij Cθiθj, Ωi1i2 Ωimim+1 Cθi1θim+1θ2 i2 θ2 im. It follows that

non-distinct i1,...,im+1

Ωi1i2Ωi2i3 Ωimim+1

i1=im+1 + X

i2=im+1 + X

non-distinct i2,...,im

Cθi1θim+1θ2 i2 θ2 im

i1,...,im Cθ2 i1 θ2 im + X

i1,...,im Cθi1θ3 i2θ2 i3 θ2 im

i1,...,im 1,im+1 θi1θim+1θ4 i2θ2 i3 θ2 im 1

C h θ 2m + θ 1 θ 3 3 θ 2(m 2) + θ 2 1 θ 4 4 θ 2(m 3)i

Network Global Testing by Counting Graphlets

C θ 2(m 3) θ 6 + θ 1 θ 3 3 θ 2 + θ 2 1 θ 4 4 .

We need to compare the three terms in the brackets. First, applying Holder s inequality with p = 3 and q = 3/2,

we have P i θ2 i = P i θ

2 3 i (P i θ

3 i ) 1 p (P i θ

3 i ) 1 q . It implies θ 2 θ 4/3 4 θ 2/3 1 . As a result,

θ 6 θ 4 4 θ 2 1.

This means the ﬁrst term above is dominated by the last term. Second, by Cauchy-Schwartz inequality, P

i θ2 i )1/2(P

i θ4 i )1/2, which means θ 3 3 θ θ 2 4. As a result, θ 1 θ 3 3 θ 2 θ 1 θ 2 4 θ 3. Furthermore, we have proved θ 3 θ 2 4 θ 1. Combining the above gives θ 1 θ 3 3 θ 2 θ 2 1 θ 4 4. Hence, the second term is dominated by the last term. In summary, the remainder term is O( θ 2 1 θ 4 4 θ 2(m 3)). This proves (17).

Next, we use (16)-(17) to show the claim. Write for short

χ(m) gc,0 = 1 Bn,m

k(η, ξk)2λm 1 k P

k(η, ξk)2λm 2 k

Write e Cm = Bn,m Cm, e C0 m = tr(Ωm), e Lm = Bn,m+1Lm, and e L0 m = 1 nΩm1n, for all m. By deﬁnition and (18)-(19),

Bn,mχ(m) gc,0 = e C0 m (e L0 m 1/e L0 m 2)m,

Bn,mχ(m) gc = Bn,m[Cm (Lm 1/Lm 2)m]

= e Cm (Bn,m 1)m

(Bn,m)m 1 (e Lm 1/e Lm 2)m.

As a result,

Bn,m|χ(m) gc χ(m) gc,0|

| e Cm e C0 m| + (e Lm 1)m

(e Lm 2)m (e L0 m 1)m

(e L0 m 2)m

+ (e Lm 1)m

(e Lm 2)m (Bn,m 1)m

(Bn,m)m 1 1

I1 + I2 + I3. (20)

We now bound these three terms. By (16)-(17),

| e Cm e C0 m| = O( θ 4 4 θ 2m 4),

|e Lm e L0 m| = O( θ 2 1 θ 4 4 θ 2m 6). (21)

Hence, I1 = O( θ 4 4 θ 2m 4).

To bound I2, we need the following lemma, which is proved in the supplemental material.

Lemma 6.1 Under conditions of Theorem 3.3, |λk| θ 2

for 1 k K, and max1 k K |η ξk| θ 1 θ 1.

By (19), e L0 m = P

k(η ξk)2λm k . For m even, it then follows from Lemma 6.1 that

e L0 m c θ 2m 2 θ 2 1 (22)

for a constant c > 0. For m odd, this is still true; see the proof of Lemma B.1 in the supplemental material. Additionally, by Cauchy-Schwarz inequality, θ 4 n θ 4 4; hence, for m 3, |e Lm e L0 m| is negligible compared to the order of e L0 m, so we also have e Lm c θ 2m 2 θ 2 1. Now, e L0 m 1 e L0 m 2 e Lm 1 e Lm 2

|e Lm 1 e L0 m 1| e L0 m 2 + e Lm 1 e Lm 2

|e Lm 2 e L0 m 2| e L0 m 2

= O θ 2 1 θ 4 4 θ 2m 8

θ 2m 6 θ 2 1

+ O θ 2 θ 2 1 θ 4 4 θ 2m 10

θ 2m 6 θ 2 1

= O( θ 4 4 θ 2).

Since |xm ym| C|x y|(|x| + |y|)m 1,

I2 C e L0 m 1 e L0 m 2 e Lm 1 e Lm 2

e L0 m 1 e L0 m 2

= O( θ 4 4 θ 2 θ 2m 2) = O( θ 4 4 θ 2m 4).

Consider I3. Note that (Bn,m 1)m

(Bn,m)m 1 = n(n 1) (n m+2)

(n m+1)m 1 = Qm 1 j=1 (1 + j n m+1) = 1 + O( 1

I3 = O( θ 2m n 1) = O( θ 4 4 θ 2m 4),

where we have used the universal inequality θ 4 n θ 4 4.

We plug the above results into (20) and note that Bn,m nm. The claim follows immediately.

7. Conclusion

We consider a hard testing problem in the rather general DCMM model where the challenge is severe degree heterogeneity. We discover a systematic way to cancel the effects of degree heterogeneity, and propose a family of tests, with careful analysis and numerical support. Compared to literature, our tests have competitive powers and are applicable in much broader settings. Our theory is also for very broad settings where existing works have very limited understanding. We point out an unappealing feature of the EZ test (Gao & Lafferty, 2017), and shows a new test in our family has successfully overcome the problem that the EZ test faces.

In our theorems, we require K to be ﬁxed, but the results continue to hold if K reasonably slowly. We also require the singular values of P are in the same order, but this is mostly for simplicity in presentation and can be replaced by weaker conditions. We also assume θ 3 0. The case θ 3 is related to the dense network case, the analysis of which can be done but is different and we leave it as future work.

Abbe, E. and Sandon, C. Achieving the KS threshold in the general stochastic block model with linearized acyclic

Network Global Testing by Counting Graphlets

belief propagation. In Advances in Neural Information Processing Systems, pp. 1334 1342, 2016.

Airoldi, E., Blei, D., Fienberg, S., and Xing, E. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981 2014, 2008.

Banerjee, D. and Ma, Z. Optimal hypothesis testing for stochastic block models with growing degrees. ar Xiv:1705.05305, 2017.

Bickel, P. J. and Sarkar, P. Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78 (1):253 273, 2016.

Bubeck, S., Ding, J., Eldan, R., and R acz, M. Z. Testing for high-dimensional geometry in random graphs. Random Structures & Algorithms, 49(3):503 532, 2016.

Chen, K. and Lei, J. Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association, pp. 1 11, 2017.

Chen, Y., Li, X., and Xu, J. Convexiﬁed modularity maximization for degree-corrected stochastic block models. The Annals of Statistics, to appear, 2018.

Gao, C. and Lafferty, J. Testing for global network structure using small subgraph statistics. ar Xiv:1710.00862, 2017.

Girvan, M. and Newman, M. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821 7826, 2002.

Holland, P. W. and Leinhardt, S. Transitivity in structural models of small groups. Comparative Group Studies, 2 (2):107 124, 1971.

Horn, R. and Johnson, C. Matrix Analysis. Cambridge University Press, 1985.

Jin, J. Fast community detection by SCORE. The Annals of Statistics, 43(1):57 89, 2015.

Jin, J. and Ke, Z. T. A sharp lower bound for mixedmembership estimation. ar Xiv:1709.05603, 2017.

Jin, J., Ke, Z. T., and Luo, S. Estimating network memberships by simplex vertices hunting. ar Xiv:1708.07852, 2017.

Jin, J., Ke, Z. T., and Luo, S. Network global testing by counting graphlets (extended version). Manuscript, 2018.

Karrer, B. and Newman, M. Stochastic blockmodels and community structure in networks. Physical Review E, 83 (1):016107, 2011.

Le, C. M. and Levina, E. Estimating the number of communities in networks by spectral methods. ar Xiv:1507.00827, 2015.

Lei, J. A goodness-of-ﬁt test for stochastic block models. The Annals of Statistics, 44(1):401 424, 2016.

Massouli e, L. Community detection thresholds and the weak ramanujan property. In Proceedings of the fortysixth annual ACM symposium on Theory of computing, pp. 694 703. ACM, 2014.

Maugis, P.-A. G., Olhede, S. C., and Wolfe, P. J. Topology reveals universal features for network comparison. 1705.05677, 2017.

Mossel, E., Neeman, J., and Sly, A. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 162(3-4):431 461, 2015.

Qin, T. and Rohe, K. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In Advances in Neural Information Processing Systems, pp. 3120 3128, 2013.

Saldana, D., Yu, Y., and Feng, Y. How many communities are there? Journal of Computational and Graphical Statistics, 26(1):171 181, 2017.

Schank, T. and Wagner, D. Approximating clustering coefﬁcient and transitivity. Journal of Graph Algorithms and Applications, 9(2):265 275, 2005.

Wang, Y. R. and Bickel, P. J. Likelihood-based model selection for stochastic block models. The Annals of Statistics, 45(2):500 528, 2017.

Watts, D. J. and Strogatz, S. H. Collective dynamics of small-world networks. Nature, 393, 1998.