# graph_clustering_blockmodels_and_model_free_results__bfeca017.pdf Graph Clustering: Block-models and model free results Yali Wan Department of Statistics University of Washington Seattle, WA 98195-4322, USA yaliwan@washington.edu Marina Meil a Department of Statistics University of Washington Seattle, WA 98195-4322, USA mmp@stat.washington.edu Clustering graphs under the Stochastic Block Model (SBM) and extensions are well studied. Guarantees of correctness exist under the assumption that the data is sampled from a model. In this paper, we propose a framework, in which we obtain correctness guarantees without assuming the data comes from a model. The guarantees we obtain depend instead on the statistics of the data that can be checked. We also show that this framework ties in with the existing model-based framework, and that we can exploit results in model-based recovery, as well as strengthen the results existing in that area of research. 1 Introduction: a framework for clustering with guarantees without model assumptions In the last few years, model-based clustering in networks has witnessed spectacular progress. At the central of intact are the so-called block-models, the Stochastic Block Model (SBM), Degree Corrected SBM (DC-SBM) and Preference Frame Model (PFM). The understanding of these models has been advanced, especially in understanding the conditions when recovery of the true clustering is possible with small or no error. The algorithms for recovery with guarantees have also been improved. However, the impact of the above results is limited by the assumption that the observed data comes from the model. This paper proposes a framework to provide theoretical guarantees for the results of model based clustering algorithms, without making any assumption about the data generating process. To describe the idea, we need some notation. Assume that a graph G on n nodes is observed. A modelbased algorithm clusters G, and outputs clustering C and parameters M(G, C). The framework is as follows: if M(G, C) fits the data G well, then we shall prove that any other clustering C of G that also fits G well will be a small perturbation of C. If this holds, then C with model parameters M(G, C) can be said to capture the data structure in a meaningful way. We exemplify our approach by obtaining model-free guarantees for the SBM and PFM models. Moreover, we show that model-free and model-based results are intimately connected. 2 Background: graphs, clusterings and block models Graphs, degrees, Laplacian, and clustering Let G be a graph on n nodes, described by its adjacency matrix ˆA. Define ˆdi = n j=1 ˆAij the degree of node i, and ˆD = diag{ ˆdi} the diagonal matrix of the node degrees. The (normalized) Laplacian of G is defined as1 ˆL = ˆD 1/2 ˆA ˆD 1/2. In 1Rigorously speaking, the normalized graph Laplacian is I ˆL [10]. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. extension, we define the degree matrix D and the Laplacian L associated to any matrix A Rn n, with Aij = Aji 0, in a similar way. Let C be a partitioning (clustering) of the nodes of G into K clusters. We use the shorthand notation i k for node i belongs to cluster k . We will represent C by its n K indicator matrix Z, defined by Zik = 1 if i k, 0 otherwise, for i = 1, . . . n, k = 1, . . . K. (1) Note that ZT Z = diag{nk} with nk counting the number of nodes in cluster k, and ZT ˆAZ = [nkl]K k,l=1 with nkl counting the edges in G between clusters k and l. Moreover, for two indicator matrices Z, Z for clusterings C, C , (ZT Z )kk counts the number of points in the intersection of cluster k of C with cluster k of C , and (ZT ˆDZ )kk computes i k k ˆdi the volume of the same intersection. Block models for random graphs (SBM, DC-SBM, PFM) This family of models contains Stochastic Block Models (SBM) [1, 18], Degree-Corrected SBM (DC-SBM) [17] and Preference Frame Models (PFM) [20]. Under each of these model families, a graph G with adjacency matrix ˆA over n nodes is generated by sampling its edges independently following the law ˆAij Bernoulli(Aij), for all i > j. The symmetric matrix A = [Aij] describing the graph is the edge probability matrix. The three model families differ in the constraints they put on an acceptable A. Let C be a clustering. The entries of A are defined w.r.t C as follows (and we say that A is compatible with C ). SBM : Aij = Bkl whenever i k, j l, with B = [Bkl] RK K symmetric and nonnegative. DC-SBM : Aij = wiwj Bkl whenever i k, j l, with B as above and w1, . . . wn non-negative weights associated with the graph nodes. PFM : A satisfies D = diag(A1), D 1AZ = ZR where 1 denotes the vector of all ones, Z is the indicator matrix of C , and R is a stochastic matrix (R1 = 1, Rkl 0), the details are in [20] While perhaps not immediately obvious, the SBM is a subclass of the DC-SBM, and the latter a subclass of the PFM. Another common feature of block-models, that will be significant throughout this work is that for all three, Spectral Clustering algorithms [15] have been proved to work well estimating C . 3 Main theorem: blueprint and results for PFM, SBM Let M be a model class, such as SBM, DC-SBM, PFM, and denote M(G, C) M to be a model that is compatible with C and is fitted in some way to graph G (we do not assume in general that this fit is optimal). Theorem 1 (Generic Theorem) We say that clustering C fits G well w.r.t M iff M(G, C) is close to G. If C fits G well w.r.t M, then (subject to other technical conditions) any other clustering C which also fits G well is close to C, i.e. dist(C, C ) is small. In what follows, we will instantiate this Generic Theorem, and the concepts therein; in particular the following will be formally defined. (1) Model construction, i.e an algorithm to fit a model in M to (G, C). This is necessary since we want our results to be computable in practice. (2) A goodness of fit measure between M(C, G) and the data G. (3) A distance between clusterings. We adopt the widely used Misclassification Error (or Hamming) distance defined below. The Misclassification Error (ME) distance between two clusterings C, C over the same set of n points is dist(C, C ) = 1 1 i k π(k) 1, (2) where π ranges over all permutations of K elements SK, and π(k) indexes a cluster in C . If the points are weighted by their degrees, a natural measure on the node set, the Weighted ME (w ME) distance is dist ˆd(C, C ) = 1 1 n i=1 ˆdi max π SK In the above, i k k ˆdi represents the total weight of the set of points assigned to cluster k by C and to cluster k ( or π(k)) by C . Note that in the indicator matrix representation of clusterings, this is the (k, k ) element of the matrix ZT ˆDZ RK K. While dist is more popular, we believe dist ˆd is more natural, especially when node degrees are dissimilar, as ˆd can be seen as a natural measure on the set of nodes, and dist ˆd is equivalent to the earth-mover s distance. 3.1 Main result for PFM Constructing a model Given a graph G and a clustering C of its nodes, we wish to construct a PFM compatible with C, so that its Laplacian L satisfies that ||ˆL L|| is small. Let the spectral decomposition of ˆL be ˆL = [ ˆY ˆYlow] ˆΛ 0 0 ˆΛlow ˆY T ˆY T low = ˆY ˆΛ ˆY T + ˆYlow ˆΛlow ˆY T low (4) where ˆY Rn K, ˆYlow Rn (n K), ˆΛ = diag(ˆλ1, , ˆλK), ˆΛlow = diag(ˆλK+1, , ˆλn). To ensure that the matrices ˆY , ˆYlow are uniquely defined we assume throughout the paper that ˆL s K-th eigengap, i.e, |λK| |λK+1|, is non-zero. Assumption 1 The eigenvalues of ˆL satisfy ˆλ1 = 1 |ˆλ2| . . . |ˆλK| > |ˆλK+1| . . . |ˆλn|. Denote the subspace spanned by the columns of M, for any M matrix, by R(M), and || || the Euclidean or spectral norm. PFM Estimation Algorithm Input Graph G with ˆA, ˆD, ˆL, ˆY , ˆΛ, clustering C with indicator matrix Z. Output (A, L) = PFM(G, C) 1. Construct an orthogonal matrix derived from Z. YZ = ˆD1/2ZC 1/2, with C = ZT ˆDZ the column normalization of Z. (5) i k ˆdi is the volume of cluster k. 2. Project YZ on ˆY and perform Singular Value Decomposition. F = Y T Z ˆY = UΣV T (6) 3. Change basis in R(YZ) to align with ˆY . Y = YZUV T . Complete Y to an orthonormal basis [Y B] of Rn. (7) 4. Construct Laplacian L and edge probability matrix A. L = Y ˆΛY T + (BBT )ˆL(BBT ), A = ˆD1/2L ˆD1/2. (8) Proposition 2 Let G, ˆA, ˆD, ˆL, ˆY , ˆΛ and Z be defined as above, and (A, L) = PFM(G, C). Then, 1. ˆD and L, or A define a PFM with degrees ˆd1:n. 2. The columns of Y are eigenvectors of L with eigenvalues ˆλ1:K. 3. ˆD1/21 is an eigenvector of both L and ˆL with eigenvalue ˆλ1 = 1. The proof is relegated to the Supplement, as are all the omitted proofs. PFM(G, C) is an estimator for the PFM parameters given the clustering. It is evidently not the Maximum Likelihood estimator, but we can show that it is consistent in the following sense. Proposition 3 (Informal) Assume that G is sampled from a PFM with parameters D , L and compatible with C , and let L = PFM(G, C ). Then, under standard recovery conditions for PFM (e.g [20]) ||L L|| = o(1) w.r.t. n. Assumption 2 (Goodness of fit for PFM) ||ˆL L|| ε. PFM(G, C) instantiates M(G, C), and Assumption 2 instantiates the goodness of fit measure. It remains to prove an instance of Generic Theorem 1 for these choices. Theorem 4 (Main Result (PFM)) Let G be a graph with ˆd1:n, ˆD, ˆL, ˆλ1:n as defined, and ˆL satisfy Assumption 1. Let C, C be two clusterings with K clusters, and L, L be their corresponding Laplacians, defined as in (8), and satisfy Assumption 2 respectively. Set δ = (K 1)ε2 (|ˆλK| |ˆλK+1|)2 and δ0 = mink Ckk/ maxk Ckk with C defined as in (5), where k indexes the clusters of C. Then, whenever δ δ0, dist ˆd(C, C ) maxk Ckk k Ckk δ, (9) with dist ˆd being the weighted ME distance (3). In the remainder of this section we outline the proof steps, while the partial results of Proposition 5, 6, 7 are proved in the Supplement. First, we apply the perturbation bound called the Sinus Theorem of Davis and Kahan, in the form presented in Chapter V of [19]. Proposition 5 Let ˆY , ˆλ1:n, Y be defined as usual. If Assumptions 1 and 2 hold, then || diag(sin θ1:K( ˆY , Y ))|| ε |ˆλK| |ˆλK+1| = ε (10) where θ1:K are the canonical (or principal) angles between R( ˆY ) and R(Y ) (see e.g [8]). The next step concerns the closeness of Y, ˆY in Frobenius norm. Since Proposition 5 bounds the sinuses of the canonical angles, we exploit the fact that the cosines of the same angles are the singular values of F = Y T ˆY of (6). Proposition 6 Let M = Y Y T , ˆ M = ˆY ˆY T and F, ε as above. Assumptions 1 and 2 imply that 1. ||F||2 F = trace M ˆ M T K (K 1)ε 2. 2. ||M ˆ M||2 F 2(K 1)ε 2. Now we show that all clusterings which satisfy Proposition 6 must be close to each other in the weighted ME distance. For this, we first need an intermediate result. Assume we have two clusterings C, C , with K clusters, for which we construct YZ, Y, L, M, respectively Y Z, Y , L , M as above. Then, the subspaces spanned by Y and Y will be close. Proposition 7 Let ˆL satisfy Assumption 1 and let C, C represent two clusterings for which L, L satisfy Assumption 2. Then, ||Y T Z Y Z||2 F K 4(K 1)ε 2 = K δ The main result now follows from Proposition 7 and Theorem 9 of [13], as shown in the Supplement. This proof approach is different from the existing perturbation bounds for clustering, which all use counting arguments. The result of [13] is a local equivalence, which bounds the error we need in terms of δ defined above ( local meaning the result only holds for small δ). 3.2 Main Theorem for SBM In this section, we offer an instantiation of Generic Theorem 1 for the case of the SBM. As before, we start with a model estimator, which in this case is the Maximum Likelihood estimator. SBM Estimation Algorithm Input Graph with ˆA, clustering C with indicator matrix Z. Output A = SBM(G, C) 1. Construct an orthogonal matrix derived from Z: YZ = ZC 1/2 with C = ZT Z. 2. Estimate the edge probabilities: B = C 1ZT ˆAZC 1. 3. Construct A from B by A = ZBZT . Proposition 8 Let B = C1/2BC1/2 and denote the eigenvalues of B, ordered by decreasing magnitude, by λ1:K. Let the spectral decomposition of B be B = UΛU T , with U an orthogonal matrix and Λ = diag(λ1:K). Then 1. A is a SBM. 2. λ1:K are the K principal eigenvalues of A. The remaining eigenvalues of A are zero. 3. A = Y ΛY T where Y = YZU. Assumption 3 (Eigengap) B is non-singular (or, equivalently, |λK| > 0. Assumption 4 (Goodness of fit for SBM) || ˆA A|| ε. With the model (SBM), estimator, and goodness of fit defined, we are ready for the main result. Theorem 9 (Main Result (SBM)) Let G be a graph with incidence matrix ˆA, and ˆλA K the K-th singular value of ˆA. Let C, C be two clusterings with K clusters, satisfying Assumptions 3 and 4. Set δ = 4Kε2 |ˆλA K|2 and δ0 = mink nk/ maxk nk, where k indexes the clusters of C. Then, whenever δ δ0, dist(C, C ) δ maxk nk/n, where dist represents the ME distance (2). Note that the eigengap of ˆA, ˆΛA K is not bounded above, and neither is ε. Since the SBM is less flexible than the PFM, we expect that for the same data G, Theorem 9 will be more restrictive than Theorem 4. 4 The results in perspective 4.1 Cluster validation Theorems like 4, 9 can provide model free guarantees for clustering. We exemplify this procedure in the experimental Section 6, using standard spectral clustering as described in e.g [18, 17, 15]. What is essential is that all the quantities such as ε and δ are computable from the data. Moreover, if Y is available, then the bound in Theorem 4 can be improved. Proposition 10 Theorem 4 holds when δ is replaced by δY = K < ˆ M, M >F +(K 1)(ε )2 + 2 2(K 1)ε || ˆ M M||F , with ε = ε/(|ˆλK| |ˆλK+1|) and M, ˆ M defined in Proposition 6. 4.2 Using existing model-based recovery theorems to prove model-free guarantees We exemplify this by using (the proof of) Theorem 3 of [20] to prove the following. Theorem 11 (Alternative result based on [20] for PFM) Under the same conditions as in Theorem 4, dist ˆd(C, C ) δWM, with δWM = 128 Kε2 (|ˆλK| |ˆλK+1|)2 . It follows, too, that with the techniques in this paper, the error bound in [20] can be improved by a factor of 128. Similarly, if we use the results of [18] we obtain alternative model-free guarantee for the SBM. Assumption 5 (Alternative goodness of fit for SBM) ||ˆL2 L2||F ε, where ˆL, L are the Laplacians of ˆA and A = SBM(G, C) respectively. Theorem 12 (Alternative result based on [18] for SBM) Under the same conditions as in Theorem 9, except for replacing Assumption 4 with 5, dist(C, C ) δRCY with δRCY = ε2 |ˆλK|4 16 maxk nk A problem with this result is that Assumption 5 is much stronger than 4 (being in Frobenius norm). The more recent results of [17] (with unspecified constants) in conjunction with our original Assumptions 3, 4, and the assumption that all clusters have equal sizes, give a bound of O(Kε2/ˆλ2 K) for the SBM; hence our model-free Theorem 9 matches this more restrictive model-based theorem. 4.3 Sanity checks and Extensions It can be easily verified that if indeed G is sampled from a SBM, or PFM, then for large enough n, and large enough model eigengap, Assumptions 1 and 2 (or 3 and 4) will hold. Some immediate extensions and variations of Theorems 4, 9 are possible. For example, one could replace the spectral norm by the Frobenius norm in Assumptions 2 and 4, which would simplify some of the proofs. However, using the Frobenius norm would be a much stronger assumption [18] Theorem 4 holds not just for simple graphs, but in the more general case when ˆA is a weighted graph (i.e. a similarity matrix). The theorems can be extended to cover the case when C is a clustering that is α-worse than C, i.e when ||L ˆL|| ||L ˆL||(1 α). 4.4 Clusterability and resilience Our Theorems also imply the stability of a clustering to perturbations of the graph G. Indeed, let ˆL be the Laplacian of G , a perturbation of G. If ||ˆL ˆL|| ε, then ||ˆL L|| 2ε, and (1) G is well fitted by a PFM whenever G is, and (2) C is δ stable w.r.t G , hence C is what some authors [9] call resilient. A graph G is clusterable when G can be fitted well by some clustering C . Much work [4, 7] has been devoted to showing that clusterability implies that finding a C close to C is computationally efficient. Such results can be obtained in our framework, by exploiting existing recovery theorems such as [18, 17, 20], which give recovery guarantees for Spectral Clustering, under the assumption of sampling from the model. For this, we can simply replace the model assumption with the assumption that there is a C for which L (or A) satisfies Assumptions 1 and 2 (or 3 and 4). 5 Related work To our knowledge, there is no work of the type of Theorem 1 in the literature on SBM, DC-SBM, PFM. The closest work is by [6] which guarantees approximate recovery assuming G is close to a DC-SBM. Spectral clustering is also used for loss-based clustering in (weighted) graphs and some stability results exist in this context. Even though they measure clustering quality by different criteria, so that the ε values are not comparable, we review them here. The recent paper of [16], Theorem 1.2 states that if the K-way Cheeger constant of G is ρ(k) (1 ˆλK+1)/(c K3) then the clustering error2 dist ˆd(C, Copt) C/c = δP SZ. In the current proof, the constant C = 2 105; moreover, ρ(K) cannot be computed tractably. In [14], the bound δMSX depends on εMSX, the Normalized Cut scaled by the eigengap. Since both bounds refer to the result of spectral clustering, we can compare the relationship between δMSX and εMSX; for [14], this is δMSX = 2εMSX[1 εMSX/(K 1)], 2The results is stronger, bounding the perturbation of each cluster individually by δP SZ, but it also includes a factor larger than 1, bounding the error of K-means algorithm. which is about K 1 times larger than δ when = MSX. In [5], dist(C, C ) is defined in terms of ||Y T Z Y Z||2 F , and the loss is (closely related) to || ˆA SBM(G, C)||2 F . The bound does not take into account the eigengap, that is, the stability of the subspace ˆY itself. Bootstrap for validating a clustering C was studied in [11] (see also references therein for earlier work). In [3] the idea is to introduce a statistics, and large deviation bounds for it, conditioned on sampling from a SBM (with covariates) and on a given C. 6 Experimental evaluation Experiment Setup Given G, we obtain a clustering C0 by spectral clustering [15]. Then we calculate clustering C by perturbing C0 with gradually increasing noise. For each C, we construct PFM (C, G)and SBM(C, G) model, and further compute , δ and δ0. If δ δ0, C is guaranteed to be stable by the theorems. In the remainder of this section, we describe the data generating process for the simulated datasets and the results we obtained. PFM Datasets We generate from PFM model with K = 5, n = 10000, λ1:K = (1, 0.875, 0.75, 0.625, 0.5). eigengap = 0.48, n1:K = (2000, 2000, 2000, 2000, 2000). The stochastic matrix R and its stationary distribution ρ are shown below. We sample an adjacency matrix ˆA from A (shown below). ρ = 25 .12 .17 .18 .28 .79 .02 .06 .03 .10 .03 .71 .23 .00 .02 .09 .16 .69 .00 .06 .04 .00 .00 .80 .16 .10 .01 .03 .11 .76 Perturbed PFM Datasets A is obtained from the previous model by perturbing its principal subspace (details in Supplement). Then we sample ˆA from A. Lancichinetti-Fortunato-Radicchi (LFR) simulated matrix [12] The LFR benchmark graphs are widely used for community detection algorithms, due to heterogeneity in the distribution of node degree and community size. A LFR matrix is simulated with n = 10000, K = 4, nk = (2467, 2416, 2427, 2690) and µ = 0.2, where µ is the mixing parameter indicating the fraction of edges shared between a node and the other nodes from outside its community. Political Blogs Dataset A directed network A of hyperlinks between weblogs on US politics, compiled from online directories by Adamic and Glance [2], where each blog is assigned a political leaning, liberal or conservative, based on its blog content. The network A contains 1490 blogs. After erasing the disconnected nodes, n = 983. We study ˆA = ( AT A)3, which is a smoothed undirected graph. For AT A we find no guarantees. The first two data sets are expected to fit the PFM well, but not the SBM, while the LFR data is expected to be a good fit for a SBM. Since all bounds can be computed on weighted graphs as well, we have run the experiments also on the edge probability matrices A used to generate the PFM and perturbed PFM graphs. The results of these experiments are summarized in Figure 1. For all of the experiments, the clustering C is ensured to be stable by Theorem 4 as the unweighted error grows to a breaking point, then the assumptions of the theorem fail. In particular, the C0 is always stable in the PFM framework. Comparing δ from Theorem 9 to that from Theorem 4, we find that Theorem 9 (guarantees for SBM) is much harder to satisfy. All δ values from Theorem 9 are above 1, and not shown.3 In particular, for the SBM model class, the C cannot be proved stable even for the LFR data. Note that part of the reason why with the PFM model very little difference from the clustering C0 can be tolerated for a clustering to be stable is that the large eigengap makes PFM(G, C) differ from PFM(G, C0) even for very small perturbations. By comparing the bounds for ˆA with the bounds for the weighted graphs A, we can evaluate that the sampling noise on δ is approximately equal to that of the clustering perturbation. Of course, the sampling noise varies with n, decreasing for larger graphs. Moreover, from Political Blogs data, we see that smoothing a graph, by e.g. taking powers of its adjacency matrix, has a stability inducing effect. Figure 1: Quantities , δ, δ0 from Theorem 4 plotted vs dist(C, C0) for various datasets: ˆ A denotes a simple graph, while A denotes a weighted graph (i.e. a non-negative matrix). For the Political Blogs: Truth means C0 is true clustering of [2], spectral means C0 is obtained from spectral clustering. For SBM, δ is always greater than δ0. 7 Discussion This paper makes several contributions. At a high level, it poses the problem of model free validation in the area of community detection in networks. The stability paradigm is not entirely new, but using it explicitly with model-based clustering (instead of cost-based) is. So is turning around the model-based recovery theorems to be used in a model-free framework. All quantities in our theorems are computable from the data and the clustering C, i.e do not contain undetermined constants, and do not depend on parameters that are not available. As with distribution-free results in general, making fewer assumptions allows for less confidence in the conclusions, and the results are not always informative. Sometimes this should be so, e.g when the data does not fit the model well. But it is also possible that the fit is good, but not good enough to satisfy the conditions of the theorems as they are currently formulated. This happens with the SBM bounds, and we believe tighter bounds are possible for this model. It would be particularly interesting to study the non-spectral, sharp thresholds of [1] from the point of view of model-free recovery. A complementary problem is to obtain negative guarantees (i.e that C is not unique up to perturbations). At the technical level, we obtain several different and model-specific stability results, that bound the perturbation of a clustering by the perturbation of a model. They can be used both in model-free and in existing or future model-based recovery guarantees, as we have shown in Section 3 and in the experiments. The proof techniques that lead to these results are actually simpler, more direct, and more elementary than the ones found in previous papers. 3We also computed δRCY but the bounds were not informative. [1] Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms. ar Xiv preprint ar Xiv:1503.00609, 2015. [2] Lada A Adamic and Natalie Glance. The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pages 36 43. ACM, 2005. [3] Edoardo M. Airoldi, David S. Choi, and Patrick J. Wolfe. Confidence sets for network structure. Technical Report ar Xiv:1105.6245, 2011. [4] Pranjal Awasthi. Clustering under stability assumptions. In Encyclopedia of Algorithms, pages 331 335. 2016. [5] Francis Bach and Michael I. Jordan. Learning spectral clustering with applications to speech separation. Journal of Machine Learning Research, 7:1963 2001, 2006. [6] Maria-Florina Balcan, Christian Borgs, Mark Braverman, Jennifer Chayes, and Shang-Hua Teng. Finding endogenously formed communities. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 767 783. SIAM, 2013. [7] Shai Ben-David. Computational feasibility of clustering under clusterability assumptions. Co RR, abs/1501.00437, 2015. [8] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013. [9] Yonatan Bilu and Nathan Linial. Are stable instances easy? Co RR, abs/0906.3162, 2009. [10] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997. [11] Brian Karrer, Elizaveta Levina, and M. E. J. Newman. Robustness of community structure in networks. Phys. Rev. E, 77:046119, Apr 2008. [12] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for testing community detection algorithms. Physical review E, 78(4):046110, 2008. [13] Marina Meil a. Local equivalence of distances between clusterings a geometric perspective. Machine Learning, 86(3):369 389, 2012. [14] Marina Meil a, Susan Shortreed, and Liang Xu. Regularized spectral learning. In Robert Cowell and Zoubin Ghahramani, editors, Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05), 2005. [15] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press. [16] Richard Peng, He Sun, and Luca Zanetti. Partitioning well-clustered graphs with k-means and heat kernel. In Proceedings of the Annual Conference on Learning Theory (COLT), pages 1423 1455, 2015. [17] Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In Advances in Neural Information Processing Systems, pages 3120 3128, 2013. [18] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, pages 1878 1915, 2011. [19] Gilbert W Stewart and Ji-guang Sun. Matrix perturbation theory, volume 175. Academic press New York, 1990. [20] Yali Wan and Marina Meila. A class of network models recoverable by spectral clustering. In Daniel Lee and Masashi Sugiyama, editors, Advances in Neural Information Processing Systems (NIPS), page (to appear), 2015.