# scalable_and_effective_conductancebased_graph_clustering__372c8214.pdf

Scalable and Effective Conductance-Based Graph Clustering

Longlong Lin1, Rong-Hua Li 2, Tao Jia1

1College of Computer and Information Science, Southwest University, Chongqing 400715, China 2Beijing Institute of Technology, China longlonglin@swu.edu.cn; lironghuabit@126.com; tjia@swu.edu.cn

Conductance-based graph clustering has been recognized as a fundamental operator in numerous graph analysis applications. Despite the signiﬁcant success of conductance-based graph clustering, existing algorithms are either hard to obtain satisfactory clustering qualities, or have high time and space complexity to achieve provable clustering qualities. To overcome these limitations, we devise a powerful peeling-based graph clustering framework PCon. We show that many existing solutions can be reduced to our framework. Namely, they ﬁrst deﬁne a score function for each vertex, then iteratively remove the vertex with the smallest score. Finally, they output the result with the smallest conductance during the peeling process. Based on our framework, we propose two novel algorithms PCon core and PCon de with linear time and space complexity, which can efﬁciently and effectively identify clusters from massive graphs with more than a few billion edges. Surprisingly, we prove that PCon de can identify clusters with near-constant approximation ratio, resulting in an important theoretical improvement over the well-known quadratic Cheeger bound. Empirical results on real-life and synthetic datasets show that our algorithms can achieve 5 42 times speedup with a high clustering accuracy, while using 1.4 7.8 times less memory than the baseline algorithms.

Introduction Graph clustering is an important algorithmic primitive with applications in numerous tasks, including image segmentation (Shi and Malik 1997; Tolliver and Miller 2006), community detection (Fortunato 2009; Leskovec, Lang, and Mahoney 2010), and machine learning (Belkin and Niyogi 2001; Bianchi, Grattarola, and Alippi 2020). Generally, graph clustering aims to partition the entire graph into several non-overlapping vertex sets, called clusters, such that the vertices within the same cluster have more connections than the vertices in different clusters. Many graph clustering methods have been proposed, such as modularity based graph clustering (Newman 2004), structural graph clustering (Xu et al. 2007), and cohesive subgraph based clustering (Chang and Qin 2019). Perhaps, the most representative graph clustering method is the conductance-based clustering (Andersen, Chung, and Lang 2006; Yang et al. 2019) due to its nice structure properties and solid theoretical foundation.

Copyright c 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Given an undirected graph G(V, E), the conductance of the cluster C is deﬁned as φ(C) = |E(C, C)| min{vol(C),2m vol(C)}, in which |E(C, C)| is the number of edges with one endpoint in C and another not in C, vol(C) is the sum of the degree of all vertices in C, and m is the number of the edges in G. Thus, a smaller φ(C) implies that the number of edges going out of C is relatively small compared with the number of edges within C. As a consequence, the smaller the φ(C), the better the clustering quality of the cluster C (Leskovec, Lang, and Mahoney 2010; Yang and Leskovec 2015). Therefore, conductance-based graph clustering aims at identifying a vertex set C with the smallest φ(C). However, identifying the smallest conductance φ (0, 1] raises signiﬁcant challenges due to its NP-hardness (Chawla et al. 2006). On the positive side, many approximate or heuristic algorithms have been proposed to either reduce the conductance of the returned cluster or improve the efﬁciency. For example, classic Fiedler vector-based spectral clustering algorithm outputs a cluster with conductance O( φ ) (Alon and Milman 1985). Unfortunately, such an algorithm has to compute the eigenvector corresponding to the second smallest eigenvalue of normalized Laplacian matrix of G, resulting in prohibitively high time and space complexity. To boost the efﬁciency, numerous diffusion-based local clustering algorithms have been proposed, whose running time depends only on the size of the resulting cluster and is independent of or depends at most polylogarithmically on the size of the entire graph (Spielman and Teng 2004; Andersen, Chung, and Lang 2006; Kloster and Gleich 2014). However all these diffusion-based local clustering algorithms are heuristic and the clustering quality of their output is heavily dependent on many hard-to-tune parameters, resulting in that their clustering performance are unstable and in most cases very poor (Zhu, Lattanzi, and Mirrokni 2013; Fountoulakis, Wang, and Yang 2020). Thus, it is very challenging to improve both the computational efﬁciency and the clustering quality.

To this end, we propose a powerful peeling-based computing framework PCon, which can efﬁciently and effectively identify conductance-based clusters. In particular, we observe that Fiedler vector-based spectral clustering algorithms and diffusion-based local clustering algorithms are essentially a peeling-based computing paradigm. Namely, they ﬁrst deﬁne a score function for each vertex, then iter-

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

atively remove the vertex with the smallest score. Finally, they output the results with the smallest conductance during the peeling process. Thus, the primary challenge is how to design a proper score function that is easier to compute and can obtain better clustering quality. With the help of the concept of degeneracy ordering (Batagelj and Zaversnik 2003), we propose a heuristic algorithm PCon core with linear time and space complexity. Speciﬁcally, PCon core assigns the score of each vertex as the vertex s degeneracy ordering. To further improve the clustering quality, we devise another effective algorithm PCon de, which assigns the score of each vertex based on the degree ratio (Deﬁnition 3). Degree ratio is a novel concept, which is the ratio of the degree of the vertex in the current subgraph to the degree of the vertex in the original graph. Surprisingly, we prove that PCon de has a near-constant approximation ratio, which achieves an important theoretical improvement over the well-known quadratic chegger bound of Fiedler vector-based spectral clustering. In a nutshell, our main contributions are listed as follows.

Novel Computing Framework: We introduce PCon, a novel peeling-based graph clustering framework, which can embrace most state-of-the-art algorithms. Novel algorithms: Based on PCon, we develop two novel and efﬁcient algorithms with linear time and space complexity. One is a heuristic algorithm which aims to optimize efﬁciency, while the other is an approximation algorithm which can obtain provable clustering qualities. Extensive Experiments: We conduct extensive experiments on eleven datasets with six competitors to test the scalability and effectiveness of the proposed algorithms. The experimental results show that our proposal outperforms diffusion-based local clustering by a large margin in terms of clustering accuracy, and achieves 5 42 times speedup with a high clustering accuracy while using 1.4 7.8 times less memory than Fiedler vector-based spectral clustering.

Problem Formulation We use G(V, E) to denote an undirected graph, in which V (resp. E) indicates the vertex set (resp. the edge set) of G. Let |V | = n and |E| = m be the number of vertices and the number of edges, respectively. For simplicity, in this paper, we focus on un-weighted graphs1. Let GS = (S, ES) be the subgraph induced by S if S V and ES = {(u, v) E|u, v S}. We use NS(v) = {u S|(u, v) E} to denote the neighbors of v in S. Let d S(v) = |NS(v)| be the degree of vertex v in S. When the context is clear, we use N(v) and d(v) to represent NV (v) and d V (v), respectively. For a vertex subset S V , we deﬁne S = V \ S as the complement of S. We use E(S, S) = {(u, v) E|u S, v S} to represent the edges with one endpoint in S and another not in S. Let vol(S) = P u S d(u) be the sum of the degree of all vertices in S. A cluster is a vertex set S V . According to (Leskovec, Lang, and Mahoney 2010; Yang and Leskovec 2015), we

1Because our results can be easily extended to the weighted case in a straightforward manner.

Clusters Conductance {v1} {v1,v2} {v1,v2,v3} {v1,v2,v3,v4}

2/min{2,20-2}=1 3/min{5,20-5}=3/5 3/min{9,20-9}=3/11 1/min{11,20-11}=1/9

Figure 1: Illustration of conductance for a synthetic graph with 9 vertices and 10 edges. The conductance of several clusters is shown in the box. The smallest conductance φ = 1/9 and the corresponding cluster is {v1, v2, v3, v4} included in the red circle.

know that a cluster S is good if the cluster is densely connected internally and well separated from the remainder of G. Therefore, we use a representative metric conductance (Fiedler 1973; Spielman and Teng 2004; Andersen, Chung, and Lang 2006; Kloster and Gleich 2014) to measure the quality of a cluster S.

Deﬁnition 1 Given an undirected graph G(V, E) and a vertex subset S V , the conductance of S is deﬁned as φ(S) = |E(S, S)| min{vol(S),2m vol(S)}. φ(V ) = 1 for convenience.

By Deﬁnition 1, we have φ(S) = φ( S). Figure 1 illustrates the concept of conductance on a synthetic graph.

Problem Statement. Given an undirected graph G(V, E), the goal of the conductance-base graph clustering is to identify a vertex subset S V , satisfying vol(S ) m and φ(S ) φ(S) for any S V . For simplicity, we use φ to represent φ(S ).

Existing Solutions Here, we review some state-of-the-art algorithms. For convenience, we classify these algorithms into three categories: Fiedler vector-based spectral clustering, diffusion-based local clustering, and other methods.

Fiedler Vector-Based Spectral Clustering Fiedler vector-based spectral clustering obtains the clustering result by calculating the spectrum of the normalized Laplacian matrix. Speciﬁcally, let A be the adjacency matrix of G, where Auv = 1 if (u, v) E, and Auv = 0 otherwise. We use D to represent the diagonal degree matrix of G, in which Duu = d(u). The Laplacian matrix of G is deﬁned as L = D A. Furthermore, the normalized Laplacian matrix is deﬁned as L = D 1/2LD 1/2. The following theorem is an important theoretical basis of Fiedler vector-based spectral clustering.

Methods Accuracy Guarantee Time Complexity Remark

SC (Fiedler 1973) O( φ ) O(n3) Eigenvector-based ASC (Trevisan 2017) O( 4φ + 2ϵ) O((m + n) 1

ϵ ) Eigenvector-based

NIBBLE (Spielman and Teng 2004) O( log 1

ϵ ϵ ) Diffusion-based NIBBLE PPR (Andersen, Chung, and Lang 2006) O( log 1 αϵ αϵ ) Diffusion-based

HK Relax (Kloster and Gleich 2014) O( tet log( 1

ϵ ) ϵ ) Diffusion-based

PCon core (this paper) O(m + n) Degeneracy-based PCon de (this paper) O(1/2 + 1/2φ ) O(m + n) Degree Ratio-based

Table 1: A comparison of state-of-the-art algorithms. φ is the smallest conductance value and φ (0, 1]. ϵ is the error tolerance. α and t are model parameters of NIBBLE PPR and HK Relax, respectively. The space complexity of SC is O(n2) and the rest is O(m + n). represents the corresponding method has no accuracy guarantee.

Theorem 1 (Cheeger inequality (Alon and Milman 1985)) Given an undirected graph G(V, E) and its normalized Laplacian matrix L, we assume that λ2 is the second smallest eigenvalue of L. Then, we have λ2/2 φ 2λ2.

At a high level, Fiedler vector-based spectral clustering consists of three steps: (1) Compute the eigenvector x of λ2. (2) Sort all entries in x such that x1 x2 ... xn. (3) Output S = arg min φ(Si), in which Si = {x1, x2, ..., xi}.

Theorem 2 ((Fiedler 1973; Alon and Milman 1985)) Let S be the vertex subset returned by Fiedler vector-based spectral clustering, we have φ(S) 2λ2. Furthermore, we have φ φ(S) 2 φ .

Note that since Fiedler vector-based spectral clustering (SC for short) needs to calculate the eigenvector of L, its time complexity is O(n3) and space complexity is O(n2) (Fiedler 1973), resulting in poor scalability. Thus, some authors devised efﬁcient approximate algorithms. For example, Trevisan proposed ASC to approximate the eigenvector of second smallest eigenvalue of L (Trevisan 2017). As a result, ASC identiﬁes a cluster S with φ(S) 4φ + 2ϵ using O((m + n) 1

ϵ ) time cost, in which ϵ is the error tolerance. Although the time complexity of ASC is much less than of SC, the quality of ASC decreases rapidly as shown in our experiments.

Diffusion-based Local Clustering Diffusion-based local clustering is another technique that propagates information from the given query seed vertex q to identify clusters. Here, we state three state-of-the-art graph diffusions: Truncated Random Walk (Spielman and Teng 2004), Personalized Page Rank (Andersen, Chung, and Lang 2006), and Heat Kernel (Kloster and Gleich 2014). For simplicity, we use πT , πP , and πH to represent the probability distribution at the end of Truncated Random Walk, Personalized Page Rank, and Heat Kernel, respectively. Before proceeding further, we give some important notations. Let P = D 1A be the probability transition matrix of G, in which Puv = 1/d(u) for any v N(u). Moreover, we use P k to represent the k-hop probability transition matrix of G. Namely, P k uv is the probability that a k-hop (k 1) random walk from vertex u would end at vertex v.

1) Truncated Random Walk (Spielman and Teng 2004) is a graph diffusion algorithm where propagation and truncation are alternately performed. Speciﬁcally, for any vector s and error tolerance ϵ, we deﬁne a truncation operator Tr(s) on u as follows:

Tr(s)[u] = s[u], if s[u] d(u)ϵ 0, Otherwise (1)

Furthermore, we use Z0 = χq to represent the one-hot vector with only a value-1 entry corresponding to the given query seed q. The propagation and truncation are denoted as Zi = Tr(Zi 1P), in which Zi 1P is the propagation process and Tr(Zi 1P) is the truncation process which can obtain the next probability distribution. Thus, we let πT = ZN be the probability distribution after N iterations, in which N is an input parameter. 2) Personalized Page Rank (Andersen, Chung, and Lang 2006) models a special random walk process. Speciﬁcally, given a stop probability parameter α (a.k.a. teleportation probability), we denote a α-discount random walk as follows: (1) It starts from the given query seed q. (2) At each step it stops in the current vertex with probability α, or it continues to walk according to the probability transition matrix P with 1-α. Thus, πP (u) is the probability that the αdiscount random walk stops in u. 3) Heat Kernel (Kloster and Gleich 2014) also models a special random walk process. Speciﬁcally, given a heat constant t, πH(u) is the probability that a random walk of length k starting from the given query seed q would end at the vertex u, in which k is sampled from the Poisson distribution

η(k) = e ttk

k! . Thus πH(u) = P

k=0 η(k)P k qu.

Similarly, these diffusion-based local clustering algorithms also consist of three steps: (1) Compute the probability distribution π at the end of the corresponding graph diffusion, and y = πD 1. (2) Sort all non-zero entries in y such that y1 y2 ... ysup(y), in which sup(y) is the number of the non-zero entries in y. (3) Output S = arg min φ(Si), in which Si = {y1, y2, ...yi}. Note that since diffusion-based local clustering algorithms aim to recover the cluster to which the given query

seed q belongs, they only have locally-biased Cheeger-like quality (Spielman and Teng 2004; Andersen, Chung, and Lang 2006; Kloster and Gleich 2014). Namely, diffusionbased local clustering algorithms do not give the theoretical gap to φ . Besides, the clustering quality of their output is heavily dependent on the given query seed and hard-to-tune parameters, resulting in that they are unstable and prone to ﬁnding degenerate solutions in most cases (Zhu, Lattanzi, and Mirrokni 2013; Fountoulakis, Wang, and Yang 2020).

Other Methods

Optimization-based (Leighton and Rao 1999; Arora, Rao, and Vazirani 2004) and ﬂow-based (Orecchia and Zhu 2014; Veldt, Gleich, and Mahoney 2016; Wang et al. 2017) are two important techniques for improving Fiedler vectorbased spectral clustering and diffusion-based local clustering. For example, Leighton et al. (Leighton and Rao 1999) used linear programming to solve the conductance-based graph clustering with O(log n)-approximation. Furthermore, Arora et al. (Arora, Rao, and Vazirani 2004) achieved the best O( log n)-approximation using a non-trivially semi-deﬁnite programming algorithm with O( (n+m)2

ϵO(1) ) time cost. However, all these methods are mostly of theoretical interests only, as they are difﬁcult to implement and offer rather poor practical efﬁciency (Yang et al. 2019).

The Proposed Algorithms

Here, we ﬁrst present a three-stage computing framework, which embraces the computing paradigm of most existing approaches. Then, we develop two scalable and effective algorithms to solve the conductance-based graph clustering.

Computing Framework

Inspired by most existing approaches (Fiedler 1973; Spielman and Teng 2004; Andersen, Chung, and Lang 2006; Kloster and Gleich 2014), we propose a three-stage computing framework PCon. In Stage 1, we give a pre-deﬁned score function for every vertex according to the corresponding applications. For simplicity, we use s(u) to represent the score of vertex u. In Stage 2, we iteratively remove the vertex with the smallest score. Such an iterative deletion process is referred to as a peeling process. In Stage 3, we output the result with smallest conductance during the peeling process. Obviously, Stage 1 is key to our proposed computing framework. Different algorithms have different score functions. For example, s(u) = x[u] for Fiedler vector-based spectral clustering, in which x is the eigenvector of second smallest eigenvalue of L. In diffusion-based local clustering, we can know that s(u) = y[u], in which y = πD 1 and π is the probability distribution of the diffusion process. Thus, most state-of-the-art algorithms can be reduced to the proposed three-stage computing framework. However, the primary problem with these existing algorithms to partition the graph is that it is difﬁcult to efﬁciently and effectively obtain the score function. To this end, in the next sections, we devise two new score functions, which are easier to compute.

Algorithm 1: PCon core Input: An undirected graph G(V, E) Output: A cluster S 1: {u1, u2, ..., un} the degeneracy ordering 2: i n, Si V , S V 3: while i = 0 do 4: if vol(Si) m and φ(Si) < φ(S) then 5: S Si 6: end if 7: Si 1 Si \ {ui}, i i 1 8: end while 9: return S

v4 v3 v1 v5 v2 v7 v9 v6 v8

Degeneracy ordering

Figure 2: A degeneracy ordering of vertices in Figure 1.

The PCon core Algorithm Recall that in Stage 2, we need to iteratively remove the vertex with the smallest score. Namely, we have to create a linear ordering on vertices. Many ordering strategies have been proposed for numerous graph analysis tasks, such as graph coloring (Hasenplaugh et al. 2014) and k-clique listing (Li et al. 2020). However, ordering techniques for conductancebased graph clustering are less explored. To ﬁll this gap, we propose a simple algorithm with the help of the well-known degeneracy ordering (Batagelj and Zaversnik 2003).

Deﬁnition 2 (Degeneracy ordering) Given an undirected graph G, a permutation (u1, u2, ..., un) on all vertices of G is a degeneracy ordering iff every vertex ui has the minimum degree in the subgraph of G induced by {ui, ui+1, ..., un}, that is, ui = arg min{d Vi(u)|u Vi} where Vi = {ui, ui+1, ..., un}.

We use degeneracy ordering to assign the score s(u) of vertex u. Speciﬁcally, u is the s(u)-th element in the degeneracy ordering from left to right.

Example 1 Reconsider the graph in Figure 1. According to Deﬁnition 2, we can derive the degeneracy ordering as {v6, v8, v9, v7, v5, v1, v2, v3, v4} which is illustrated in Figure 2. Consider a vertex v1, we can know that s(v1) = 6. This is because that v1 has the minimum degree in the subgraph of G induced by {v1, v2, v3, v4}. Similarity, we have s(v6) = 1 and s(v7) = 4.

The degeneracy ordering can be efﬁciently computed within linear time by the classic core-decomposition (Batagelj and Zaversnik 2003). Speciﬁcally, it iteratively removes the vertex with the smallest degree in the current subgraph until all vertices are removed. When a vertex is removed, the degree of other vertices is updated accordingly. As a consequence, the sequence of the removed vertices forms the degeneracy ordering. Using a proper data structure (e.g., bin-sort), we can implement the above vertex-removal

process in linear time (Batagelj and Zaversnik 2003). Based on the degeneracy ordering, we propose a simple but practically effective algorithm PCon core, which is outlined in Algorithm 1. Speciﬁcally, we ﬁrst set the score s(u) for each vertex u V according to the degeneracy ordering (Line 1). Then, we execute Stage 2 (Lines 3, 7, and 8) and Stage 3 (Lines 4-6 and 9).

The PCon de Algorithm As stated in the Introduction, Fiedler vector-based spectral clustering can obtain a cluster with conductance of O( φ ). However, Fiedler vector-based spectral clustering has prohibitively high time and space complexity. On the other hand, although diffusion-based local clustering has a very low time complexity, it is heuristic without a global conductance guarantee. Thus, an interesting problem is to devise an algorithm that has better time complexity than Fiedler vector-based spectral clustering and better conductance quality than diffusion-based local clustering. To this end, we propose a novel algorithm PCon de with linear time and space complexity, which has a near-constant approximation ratio. Lemma 1 Given an undirected graph G(V, E) and a vertex

subset S, we have φ(S) = 1

u S d S(u) P

u S d V (u) if vol(S) m.

Proof 1 According to Deﬁnition 1, if vol(S) m, we have φ(S) = |E(S, S)| min{vol(S),2m vol(S)} = |E(S, S)|

vol(S) . Furthermore,

u S d S(u) P

u S d V (u) =

u S d S(u)+ P

u S d V (u) P

u S d V (u) P

u S d V (u) =

u S d V (u) P

u S d S(u) P

u S d V (u) . Thus, φ(S) = 1

u S d S(u) P

u S d V (u).

For simplicity, we denote a function g(S) =

u S d V (u)

and assume that the larger the value of g(S), the better the quality of S. Generally, we let e S be the optimal vertex set for g(.). That is, g(e S) g(S) for any vertex subset S V . Lemma 2 Given an undirected graph G(V, E), we have d e S(u) d V (u) g(e S) for any u e S holds.

Proof 2 This lemma can be proved by contradiction. Assume that there is a vertex u e S such that d e S(u) d V (u) <

g(e S), we have de S(u) < g(e S)d V (u). Thus, g(e S \

v e S d e S(v) d e S(u)

v e S d V (v) d V (u) >

v e S d e S(v) g(e S)d V (u)

v e S d V (v) d V (u) =

v e S d V (v) g(e S)d V (u)

v e S d V (v) d V (u) = g(e S), which contradicts that e S

is the optimal vertex set for g(.). As a result, d e S(u) d V (u) g(e S)

for any u e S holds. Deﬁnition 3 (Degree ratio) Given an undirected graph G(V, E) and a subgraph GS, the degree ratio of u S w.r.t. G and GS is deﬁned as Dr S(u) = d S(u)

Algorithm 2: PCon de Input: An undirected graph G(V, E) Output: A cluster ˆS

1: S V ; ˆS V 2: Dr S(u) d S(u)

d V (u) for each vertex u S

3: while S = do 4: u arg min{Dr S(u)|u S} 5: S S \ {u} 6: if g(S) > g( ˆS) and vol(S) m then 7: ˆS S 8: end if 9: for v NS(u) do 10: Dr S(v) Dr S(v) 1 d V (v) 11: end for 12: end while 13: return ˆS

Based on the above lemmas and deﬁnitions, we devise a linear time greedy removing algorithm called PCon de, which is shown in Algorithm 2. In particular, Algorithm 2 ﬁrst initializes the current search space S as V , candidate result ˆS as V , and the degree ratio Dr S(u) for every vertex u S according to Deﬁnition 3 (Lines 1-2). Subsequently, it executes the greedy removing process in each round to improve the quality of the target cluster (Lines 312). Speciﬁcally, in each round, it obtains one vertex u with the smallest degree ratio (Line 4). Lines 5-11 update the current search space S, the candidate result ˆS if any, and Dr S(v) for v NS(u). The iteration terminates once the current search space is empty (Line 3). Finally, it returns ˆS as the approximate solution (Line 13).

Theorem 3 Algorithm 2 can identify a cluster with conductance 1/2 + 1/2φ .

Proof 3 Let e S is the optimal vertex set for g(.). In Lines 312, Algorithm 2 executes the peeling process. That is, in each round, it greedily deletes the vertex with the smallest degree ratio. Consider the round t when the ﬁrst vertex u of e S is deleted. Let Vt be the vertex set from the beginning of round t. Clearly, e S is the subset of Vt because u is the ﬁrst deleted vertex of e S. This implies that min v Vt Dr Vt(v) = Dr Vt(u) =

d V (u) d e S(u) d V (u). Furthermore, g(Vt) =

v Vt d Vt(v)

v Vt d V (v) =

v Vt d V (v) d Vt (v)

v Vt d V (v)

v Vt d V (v) d Vt (u)

v Vt d V (v) = d Vt(u) 2d V (u) d e S(u) 2d V (u).

By Lemma 2, we have d e S(u) d V (u) g(e S). Thus, g(Vt) g(e S)

2 . Since Algorithm 2 maintains the optimal solution during the peeling process in Lines 6-8, ˆS will be returned in Line 13 and g( ˆS) g(Vt) g(e S)

2 . On the other hand, by Lemma 1,

we have g( ˆS) = 1 φ( ˆS)

2 and g(S ) = 1 φ(S )

2 . According to the deﬁnition of e S, we know that g(e S) g(S ). Thus,

Dataset |V | |E| d

DBLP 317,080 1,049,866 6.62 Youtube 1,134,890 2,987,624 5.27 Pokec 1,632,803 22,301,964 2.73 LJ 4,843,953 42,845,684 17.69 Orkut 3,072,441 117,185,083 76.28 Twitter 41,652,231 1,202,513,046 57.74

Table 2: Dataset statistics. d is the average degree.

2 = g( ˆS) g(e S)

2 = 1 φ(S )

4 . Namely, φ( ˆS) 1 1 φ(S )

2 = 1/2+1/2φ . As a result, Algorithm 2 can identify a cluster with conductance 1/2 + 1/2φ .

Empirical Results Experimental Setup We evaluate our proposed solutions on six real-life publiclyavailable datasets2 (Table 2), which are widely used benchmarks for conductance-based graph clustering (Shun et al. 2016; Yang et al. 2019). The maximum connected components of these datasets are used in the experiments. We also use ﬁve synthetic graphs LFR (Lancichinetti, Fortunato, and Kert esz 2009), WS (Watts and Strogatz 1998), PLC (Holme and Kim 2002), ER (Erdos, R enyi et al. 1960), and BA (Barab asi and Albert 1999). The following six competitors are implemented for comparison.

Eigenvector-based methods: SC (Fiedler 1973) and ASC (Trevisan 2017). SC (resp. ASC) used the exact eigenvector (resp. approximate eigenvector) of the second smallest eigenvalue of L to execute the peeling process. We use sparse matrices to highly optimize these algorithms. Diffusion-based methods: NIBBLE PPR (Andersen, Chung, and Lang 2006) and HK Relax (Kloster and Gleich 2014). Since NIBBLE PPR and HK Relax took a seed vertex as input, to be more reliable, we randomly select 50 vertices as seed vertices and report the average runtime and quality. Unless speciﬁed otherwise, following previous work (Shun et al. 2016), we set α = 0.01 and ϵ = 1 m for NIBBLE PPR; t = 10 and ϵ = 1 m for HK Relax. Note that we do not include NIBBLE of Table 1 in the experiments because NIBBLE is outperformed by NIBBLE PPR and HK Relax (Shun et al. 2016). Flow-based methods: Simple Local (Veldt, Gleich, and Mahoney 2016) and CRD (Wang et al. 2017). These methods devise specialized max-ﬂow algorithms with early termination. We take their default parameters in our experiments.

Results on Real-World Graphs Table 3 reports runtime and conductance for each method on real-world graphs. We have the following observations: (1) The proposed algorithms PCon core and PCon de are

2All datasets can be downloaded from http://snap.stanford.edu/

DBLP Youtube Orkut 0.00

SC ASC NIBBLE_PPR

HK_Relax Simple Local CRD

PCon_core PCon_de

Figure 3: NMI scores on real-world graphs with groundtruth clusters.

DBLP Youtube Pokec LJ Orkut Twitter 0 100

Memory overhead (MB)

SC PCon_core PCon_de

Figure 4: Memory overhead on real-world graphs (excluding the size of the graph itself).

consistently faster than other methods on most graphs, especially for larger graphs. In particular, they achieve the speedups of 5, 14, 42, and 17 times over SC on DBLP, Youtube, LJ, and Twitter, respectively. For example, on Twitter with more than a few billion edges, PCon core and PCon de take 5,486 seconds and 7,655 seconds to obtain the result, respectively, while SC takes 130,570 seconds. (2) PCon core is slightly fast than PCon de but PCon core has poor conductance value. (3) The clustering quality returned by PCon de outperforms other methods on ﬁve of the six datasets. Besides, PCon de always outperforms NIBBLE PPR and HK Relax. This is because PCon de can ﬁnd clusters with near-liner approximation ratio, while Fiedler vector-based spectral clustering algorithms (i.e., SC and ASC) have quadratic chegger bound and diffusion-based local clustering algorithms (i.e., NIBBLE PPR and HK Relax) have no guarantee of clustering quality (Table 1). Therefore, these results give clear evidences that the proposed algorithms can achieve signiﬁcant speedup with high clustering quality compared with the baselines. We use the normalized mutual information (NMI for short) (Cilibrasi and Vit anyi 2005) to measure how close each detected cluster is to the ground-truth one. Note that for a cluster C, the larger the NMI, the better the quality of the cluster C. Figure 3 shows NMI scores of different methods on real-world graphs with ground-truth clusters. As can be seen, PCon de consistently outperforms other methods. The NMI scores of all methods other than PCon de vary signif-

Runtime/Conductance DBLP Youtube Pokec LJ Orkut Twitter

SC 56/0.009 552/0.006 165/0.002 14492/0.001 837/0.007 130570/ 0.002 ASC 21/0.482 73/0.554 730/0.454 1185/0.423 3089/0.415 67409/0.602 NIBBLE PPR 7/0.130 18/0.110 156/0.184 784/0.021 3768/0.009 5822/0.177 HK Relax 31/0.127 138/0.113 1271/0.011 2144/0.036 6016/0.008 99613/0.026 Simple Local 108/0.009 596/0.006 213/0.002 14807/0.001 16267/0.013 283745/0.004 CRD 47/0.184 235/0.183 1231/0.218 1279/0.194 7261/0.141 38060/0.336

PCon core 9/0.106 27/0.404 116/0.327 266/0.088 500/0.223 5486/0.419 PCon de 11/0.027 39/0.004 141/0.001 345/0.000 577/0.006 7655/0.000

Table 3: Runtime (in seconds) and conductance on real-world graphs. The best result in each metric is highlighted in bold.

104 105 106 107 n

101 102 103 104 105

Running time (s)

SC PCon_core PCon_de

(a) Scalability testing on ER synthetic graph

0.1 0.3 0.5 0.7 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

SC->NMI PCon_de->NMI SC->

(b) LFR synthetic graphs varying µ

104 105 106 107 n

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

SC->NMI PCon_de->NMI SC->

(c) LFR synthetic graphs varying n

Figure 5: Results on synthetic graphs.

icantly depending on the dataset. For example, SC, Simple Local, and CRD are almost zero on DBLP and Youtube, but they have relatively good NMI scores on Orkut. Therefore, these results imply that PCon de approximates the groundtruth clusters more effectively than all other methods. From Figure 4, we can see that memory overhead of PCon core and PCon de are consistently less than SC on all datasets. In particular, they save 1.4 7.8 times memory overhead compared to SC. For example, on Orkut graph, PCon core and PCon de consume about 535.88MB and 746.26MB to obtain the result, respectively, while SC consumes 4209.25MB. This is because PCon core and PCon de only need linear space complexity to calculate the score of each vertex. However, SC adopts the eigenvector of the matrix to compute the score of each vertex, thus it requires squared space complexity in the worst case. These results demonstrate that PCon core and PCon de can identify clusters with low memory.

Results on Synthetic Graphs

We further use synthetic graphs to evaluate the scalability and effectiveness of our proposed solutions. In particular, we only present the scalability for ER (Erdos, R enyi et al. 1960) with varying the number of vertices in Figure 5(a), but all other synthetic graphs have similar trends. As can be seen, PCon core and PCon de scale near-linear with respect to the size of the graphs. However, SC has poor scalability because its time cost ﬂuctuates greatly as the graph size increases. These results indicate that our proposed algorithms can handle massive graphs while SC cannot.

The LFR model (Lancichinetti, Fortunato, and Kert esz 2009) is a widely used benchmark with ground-truth clusters. In Figure 5(b), we generate ﬁve synthetic graphs composed of 1,000,000 vertices by varying the mixing ratio µ from 0.1 to 0.9. A larger µ implies that the number of edges crossing clusters increases, resulting in that being more difﬁcult to detect intrinsic clusters. As can be seen, PCon de consistently outperforms SC in terms of NMI and φ. Meanwhile, the quality of PCon de decreases as µ increases, but it is always better than SC. Furthermore, we also generate four graphs with varying the number of vertices when ﬁxing µ = 0.4. Figure 5(c) shows a similar trend as Figure 5(b). As a consequence, these results imply that PCon de approximates the ground-truth clusters more effectively than SC.

In this paper, we devise a peeling-based computing framework PCon for conductance-based graph clustering. We observe that most state-of-the-art algorithms can be reduced to PCon. Inspired by our framework, we ﬁrst propose an efﬁcient heuristic algorithm PCon core, which adopts the degeneracy ordering to detect clusters. To improve the accuracy, we further propose a powerful PCon de with nearconstant approximation ratio, which achieve an important theoretical improvement over existing approaches such as Fiedler vector-based spectral clustering. Finally, extensive experiments on eleven datasets with six competitors demonstrate that the proposed algorithms can achieve 5 42 times speedup with a high accuracy and 1.4 7.8 times less memory than the state-of-the-art solutions.

Acknowledgments This work is supported by (i) National Key R&D Program of China 2021YFB3301300, (ii) NSFC Grants U2241211, 62072034, U1809206, (iii) Fundamental Research Funds for the Central Universities under Grant SWU-KQ22028, (iv) University Innovation Research Group of Chongqing CXQT21005, (v) Industry-University-Research Innovation Fund for Chinese Universities 2021ALA03016, and (vi) CCF-Huawei Populus Grove Fund. Rong-Hua Li is the corresponding author of this paper.

References Alon, N.; and Milman, V. D. 1985. λ1, isoperimetric inequalities for graphs, and superconcentrators. Journal of Combinatorial Theory, Series B, 38(1): 73 88. Andersen, R.; Chung, F. R. K.; and Lang, K. J. 2006. Local Graph Partitioning using Page Rank Vectors. In FOCS, 475 486. Arora, S.; Rao, S.; and Vazirani, U. V. 2004. Expander ﬂows, geometric embeddings and graph partitioning. In Babai, L., ed., STOC. ACM. Barab asi, A.-L.; and Albert, R. 1999. Emergence of scaling in random networks. science, 286(5439): 509 512. Batagelj, V.; and Zaversnik, M. 2003. An O(m) Algorithm for Cores Decomposition of Networks. Co RR, cs.DS/0310049. Belkin, M.; and Niyogi, P. 2001. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In NIPS, 585 591. Bianchi, F. M.; Grattarola, D.; and Alippi, C. 2020. Spectral Clustering with Graph Neural Networks for Graph Pooling. In ICML, 874 883. Chang, L.; and Qin, L. 2019. Cohesive Subgraph Computation Over Large Sparse Graphs. In ICDE, 2068 2071. Chawla, S.; Krauthgamer, R.; Kumar, R.; Rabani, Y.; and Sivakumar, D. 2006. On the Hardness of Approximating Multicut and Sparsest-Cut. Comput. Complex., 15(2): 94 114. Cilibrasi, R.; and Vit anyi, P. M. B. 2005. Clustering by compression. IEEE Trans. Inf. Theory, 51(4): 1523 1545. Erdos, P.; R enyi, A.; et al. 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1): 17 60. Fiedler, M. 1973. Algebraic connectivity of graphs. Czechoslovak mathematical journal, 23(2): 298 305. Fortunato, S. 2009. Community detection in graphs. Physics Reports, 486(3): 75 174. Fountoulakis, K.; Wang, D.; and Yang, S. 2020. p-Norm Flow Diffusion for Local Graph Clustering. In ICML, volume 119, 3222 3232. Hasenplaugh, W.; Kaler, T.; Schardl, T. B.; and Leiserson, C. E. 2014. Ordering heuristics for parallel graph coloring. In Blelloch, G. E.; and Sanders, P., eds., SPAA. Holme, P.; and Kim, B. J. 2002. Growing scale-free networks with tunable clustering. Physical review E, 65(2): 026107.

Kloster, K.; and Gleich, D. F. 2014. Heat kernel based community detection. In KDD, 1386 1395. Lancichinetti, A.; Fortunato, S.; and Kert esz, J. 2009. Detecting the overlapping and hierarchical community structure in complex networks. New journal of physics, 11(3): 033015. Leighton, T.; and Rao, S. 1999. Multicommodity max-ﬂow min-cut theorems and their use in designing approximation algorithms. Journal of the ACM (JACM), 46(6): 787 832. Leskovec, J.; Lang, K. J.; and Mahoney, M. W. 2010. Empirical comparison of algorithms for network community detection. In WWW, 631 640. Li, R.; Gao, S.; Qin, L.; Wang, G.; Yang, W.; and Yu, J. X. 2020. Ordering Heuristics for k-clique Listing. Proc. VLDB Endow., 13(11): 2536 2548. Newman, M. E. 2004. Fast algorithm for detecting community structure in networks. Physical review E, 69(6): 066133. Orecchia, L.; and Zhu, Z. A. 2014. Flow-Based Algorithms for Local Graph Clustering. In Chekuri, C., ed., SODA, 1267 1286. SIAM. Shi, J.; and Malik, J. 1997. Normalized Cuts and Image Segmentation. In CVPR, 731 737. Shun, J.; Roosta-Khorasani, F.; Fountoulakis, K.; and Mahoney, M. W. 2016. Parallel Local Graph Clustering. Proc. VLDB Endow., 9(12): 1041 1052. Spielman, D. A.; and Teng, S. 2004. Nearly-linear time algorithms for graph partitioning, graph sparsiﬁcation, and solving linear systems. In STOC, 81 90. Tolliver, D.; and Miller, G. L. 2006. Graph Partitioning by Spectral Rounding: Applications in Image Segmentation and Clustering. In CVPR, 1053 1060. Trevisan, L. 2017. Lecture notes on graph partitioning, expanders and spectral methods. University of California, Berkeley, https://people. eecs. berkeley. edu/luca/books/expanders-2016. pdf. Veldt, N.; Gleich, D.; and Mahoney, M. 2016. A simple and strongly-local ﬂow-based method for cut improvement. In ICML, 1938 1947. Wang, D.; Fountoulakis, K.; Henzinger, M.; Mahoney, M. W.; and Rao, S. 2017. Capacity Releasing Diffusion for Speed and Locality. In ICML, volume 70, 3598 3607. Watts, D. J.; and Strogatz, S. H. 1998. Collective dynamics of small-world networks. nature, 393(6684): 440 442. Xu, X.; Yuruk, N.; Feng, Z.; and Schweiger, T. A. J. 2007. SCAN: a structural clustering algorithm for networks. In KDD, 824 833. Yang, J.; and Leskovec, J. 2015. Deﬁning and evaluating network communities based on ground-truth. Knowl. Inf. Syst., 42(1): 181 213. Yang, R.; Xiao, X.; Wei, Z.; Bhowmick, S. S.; Zhao, J.; and Li, R. 2019. Efﬁcient Estimation of Heat Kernel Page Rank for Local Clustering. In SIGMOD, 1339 1356. Zhu, Z. A.; Lattanzi, S.; and Mirrokni, V. S. 2013. A Local Algorithm for Finding Well-Connected Clusters. In ICML, 396 404.