# online_sparsification_of_bipartitelike_clusters_in_graphs__9a0001ab.pdf Online Sparsification of Bipartite-Like Clusters in Graphs Joyentanuj Das 1 Suranjan De 1 He Sun 1 Graph clustering is an important algorithmic technique for analysing massive graphs, and has been widely applied in many research fields of data science. While the objective of most graph clustering algorithms is to find a vertex set of low conductance, a sequence of recent studies highlights the importance of the inter-connection between vertex sets when analysing real-world datasets. Following this line of research, in this work we study bipartite-like clusters and present efficient and online sparsification algorithms that find such clusters in both undirected graphs and directed ones. We conduct experimental studies on both synthetic and real-world datasets, and show that our algorithms significantly speedup the running time of existing clustering algorithms while preserving their effectiveness. 1. Introduction Graph clustering is a fundamental technique in data analysis with wide-ranging applications in machine learning and data science. A classical graph clustering problem involves partitioning the vertices of a graph into sets of highly connected vertices to minimize the normalised cut value. However, many real-world clustering tasks are defined by alternative objective functions, tailored to the specific needs and constraints of the problem at hand. One such example involves uncovering the vertex sets that are densely connected to each other, and these two vertex sets form a bipartite-like graph. For example, when representing the migration or trade datasets with a graph, a bipartitelike cluster captures a typical pattern of regional migration or trade (Cucuringu et al., 2020; Laenen & Sun, 2020; He et al., 2022), and the significance of bipartite-like clusters is further highlighted when studying many other real-world datasets (Bennett et al., 2022; Concas et al., 2022). 1School of Informatics, University of Edinburgh, Edinburgh, United Kingdom. Correspondence to: He Sun . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Our Results. We first study bipartite-like clusters in undirected graphs, and present an algorithm that sparsifies an undirected graph while preserving its structure of bipartite-like clusters. Our algorithm can be implemented online, and directly applied to speed up the running time of existing algorithms that find bipartite-like clusters. Formally speaking, for an undirected G = (V, E) and a pair of disjoint and non-empty subsets A, B V , we define ϕG(A, B) 2w G(A, B) vol G(A B), where w G(A, B) X {u,v} E u A,v B is the cut value between A and B and vol G(A B) is the volume of A B defined by vol G(A B) = X {u,v} E u A B Notice that a high value of ϕG(A, B) implies that most edges adjacent to the vertices in A B are between A and B, and A and B form a bipartite-like cluster. Generalising this to multiple clusters, for every k N we define the k-way dual Cheeger constant by ρG(k) max (A1,B1),...,(Ak,Bk) min 1 i k ϕG(Ai, Bi), (1.1) where the maximum is taken over all the possible k pairs of subsets (A1, B1), . . . , (Ak, Bk) satisfying Ai Aj = , Bi Bj = , Ai Bj = for different i, j [k], and Ai Bi = for different i, j [k]. Notice that a high value of ρG(k) implies that G contains k bipartitelike clusters, in each of which the vertex sets Ai and Bi are densely connected to each other. We prove that, when G presents a clear structure of k bipartite-like clusters, this structure can be represented by a sparse subgraph G of G with e O(n) edges, and G can be constructed online in nearly-linear time1. Our result is as follows: 1We say that a graph algorithm runs in nearly-linear time if the algorithm s running time is O(m poly log n), where m and n are the number of edges and vertices of the input graph. For simplicity, we use e O( ) to hide a poly-logarithmic factor of n. Online Sparsification of Bipartite-Like Clusters in Graphs Theorem 1 (Result for undirected graphs). Let G = (VG, EG, w G) be an undirected and weighted graph of m edges, and assume that G contains k bipartite-like clusters (A1, B1), . . . , (Ak, Bk) corresponding to ρG(k). Then, there is an algorithm that runs in e O(m) time and computes a sparsifier G = (VG, F EG, ew), such that these k bipartite-like clusters are preserved in G with high probability. That is, it holds with high probability that ρG (k) = Ω( ρG(k)), and G contains only k bipartitelike clusters. Secondly, we study the bipartite-like clusters in directed graphs. Let G = (V G, E G, w G) be a digraph with weight function w G : E G R 0. For any vertex u V G, we use degout(u) P (u,v) E w G(u, v) and degin(u) P (v,u) E w G(v, u) to express the sum of weights of directed edges with u as the tail or the head, respectively. For any S V G, we define volout(S) P u S degout(u) and volin(S) P u S degin(u). For any two disjoint subsets A, B V G, we define ϕ G(A, B) by ϕ G(A, B) 2w G(A, B) volout(A) + volin(B), (1.2) where w G(A, B) X (u,v) E u A,v B is the sum of the weights of the edges from A to B. For every k N, the k-way directed dual Cheeger constant is defined by ρ G(k) max (A1,B1),...,(Ak,Bk) min 1 i k ϕ G(Ai, Bi), (1.3) where the maximum is taken over all the possible k pairs of subsets (A1, B1), . . . , (Ak, Bk) satisfying Ai Aj = , Bi Bj = , Ai Bj = for different i, j [k], Ai Bi = for any i [k]. By definition, a high value of ρ G(k) implies that G contains k bipartite-like clusters (A1, B1), . . . , (Ak, Bk) such that most edges with their tails in Ai have their head in Bi and conversely most edges with their head in Bi have their tail in Ai. We prove that, when G presents a structure of k bipartite-like clusters with respect to ρ G(k), this structure can be represented by a sparse graph G with e O(n) edges, and G can be constructed online in nearly-linear time: Theorem 2 (Result for directed graphs). Let G = (V G, E G, w G) be a directed and weighted graph of m edges, and assume that G contains k directly bipartite-like clusters (A1, B1), . . . , (Ak, Bk) with respect to ρ G(k). Then, there is an algorithm that runs in e O(m) time and computes a sparsifier G = (V G, F E G, ew), such that these k directed bipartite-like clusters of G are preserved in G with high probability. That is, it holds with high probability that ρ G (k) = Ω ρ G(k) , and G only contains k directed bipartite-like clusters. Now we examine the significance of Theorems 1 and 2. We first highlight that our algorithms preserve the cut values w(Ai, Bi) between the pairs of vertex sets Ai and Bi for 1 i k; this objective is different from the one for most graph sparsification problems, which only preserve the cut values between vertex set S and V \S. Secondly, our algorithms preserve k bipartite-like clusters, and the value of k in the output graph is the same as the input graph. Thirdly, our second result works for directed graphs; this result is very interesting on its own since most sparsification algorithms are only applicable for undirected graphs. Finally, while the design of most graph sparsification algorithms are based on Laplacian solvers making it unpractical, our designed algorithms only use random sampling. The design of our algorithms is based on several new reductions and sampling routines, and our algorithms can be implemented online with the degree oracles. As such one can run our algorithms online while exploring the underlying graph with existing local algorithms (e.g., (Andersen, 2010; Li & Peng, 2013)), resulting in direct improvement on the running time of the existing algorithms. To demonstrate this, we conduct experimental studies and show that our algorithms can be directly applied to significantly speed up the running time of the the ones presented in (Macgregor & Sun, 2021a), while preserving similar output results on both the synthetic and real-world datasets. Related Work. Bipartite-like clusters are widely studied in both theoretical computer science and machine learning communities. In theoretical computer science, Trevisan (2009) developed a spectral algorithm that finds a bipartitelike cluster in an undirected graph, and used this to design an approximation algorithm for the max-cut problem. This result is improved by Soto (2015). Liu (2015) studied the relationship between the k-way dual Cheeger constant and the eigenvalues of the normalised graph Laplacians, and developed a Cheer-type inequality. In the machine learning community, bipartite-like clusters are employed to model highly-correlated data items of different types, and algorithms finding these clusters are studied in different settings. Andersen (2010), Li & Peng (2013) and Macgregor & Sun (2021a) presented local algorithms that find bipartite-like clusters, and Macgregor & Sun (2021b) presented an algorithm that finds bipartite components in hypergraphs. Cucuringu et al. (2020) proved that densely connected clusters in a directed graph can be uncovered through spectral clustering on a complexvalued Hermitian matrix representation of directed graphs. Online Sparsification of Bipartite-Like Clusters in Graphs Neumann & Peng (2022) designed a sublinear-time oracle which, under a certain condition, correctly classified the membership of most vertices in a set of hidden planted ground-truth clusters in signed graphs. Our work relates to finding clusters in disassortative networks (Moore et al., 2011; Pei et al., 2019; Zhu et al., 2020), although most existing techniques are based on semi-supervised and global methods. Our work is also related to designing graph sparsification algorithms, e.g., (Spielman & Teng, 2011; Batson et al., 2012; Cohen et al., 2017; Lee & Sun, 2017; 2018). We highlight that, while a spectral sparsifier preserves the cut value w(S, V \ S) between any vertex set S and its complement V \S, our algorithms output preserves the cut value w(Ai, Bi) for pairs of vertex sets Ai and Bi. Moreover, our algorithms are much easier to implement, and work for directed graphs. 2. Preliminaries In this section we list the notation and preliminary results used in the analysis. Matrix Representation of Graphs. We always use G = (V, E, w) to represent an undirected and weighted graph with n vertices and weight function w : E R 0. The degree of any vertex u is defined as d G(u) = P u v w(u, v), where the notation u v represents that u and v are adjacent, i.e., {u, v} E(G). The normalised indicator vector of any S V is defined by d G(v) vol G(S) if v S, and χS(v) = 0 otherwise. Let AG be the adjacency matrix of G defined by (AG)u,v = w(u, v) if {u, v} E(G), and (AG)u,v = 0 otherwise. The degree matrix DG of G is a diagonal matrix defined by (DG)u,u = d G(u), and the normalised Laplacian of G is defined by LG = I D 1/2 G AGD 1/2 G . We can also write the normalised Laplacian matrix with respect to the indicator vectors of the vertices: for each vertex v, we define an indicator vector χv Rn by χv(u) = 1 dv if u = v, and χv(u) = 0 otherwise. We further define be = χu χv for each edge e = {u, v}, where the orientation of e is chosen arbitrarily. Then, we have e={u,v} E w(u, v) beb e. We also define JG I + D 1/2 G AGD 1/2 G . For any symmetric matrix A Rn n, let λ1(A) λ2(A) λn(A) be the eigenvalues of A. For ease of presentation, we always use 0 = λ1 λ2 λn 2 to express the eigenvalues of LG, with the corresponding orthonormal eigenvectors f1, f2, , fn. With slight abuse of notation, we use L 1 G for the pseudo-inverse of LG, i.e., 1 λi fif i . Note that when G is connected, it holds that λ2 > 0 and the matrix L 1 G is well defined. We sometimes drop the subscript G when it is clear from the context. For any x Rn we define x p Pn i=1 x2 i , and for any M Rn n we define M = max x Rn\{0} Mx Graph expansion and Cheeger inequality. For any undirected graph G, the expansion (or conductance) of any non-empty subset S V in G is defined as ϕG(S) w G(S, S) where S is the complement of S. We call subsets of vertices S1, S2, , Sk a k-way partition of G if Si = for all 1 i k, Si Sj = for i = j and Sk i=1 Si = V . For any k N, the k-way expansion constant is defined as ρG(k) = min S1,S2, ,Sk max 1 i k ϕG(Si), where the minimum is taken over all possible k-way partitions of G. Lee et al. (2014) proves the following higherorder Cheeger inequality: Lemma 3 (Higher-order Cheeger Inequality, (Lee et al., 2014)). It holds for any undirected graph G of n vertices and integer 1 k n that λk/2 ρG(k) Ck2p where C is a universal constant. Generalising this, Liu (2015) proves the following higherorder dual-Cheeger inequality: Lemma 4 (Higher-order dual-Cheeger Inequality, (Liu, 2015)). It holds for any undirected graph G of n vertices and integer 1 k n that (2 λn k+1)/2 1 ρG(k) Ck3p where C is a universal constant. The higher-order dual Cheeger inequality can be viewed as a quantitative version of the fact that λn k+1 = 2 if and only if G has at least k bipartite connected components. Online Sparsification of Bipartite-Like Clusters in Graphs 3. Proof of Theorem 1 In this section we present a nearly-linear time sparsification algorithm such that every bipartite-like cluster in an undirected graph G is approximately preserved in the sparsifed graph G , and sketch the proof. Our result is as follows: Theorem 5 (Formal Statement of Theorem 1). There exists a nearly-linear time algorithm that, given an input graph G = (V, E, w) with ρG(k) 1 log n for constant some k, with high probability computes a sparsifier G = (V, F E, ew) with |F| = O n log3 n 2 λn k edges such that the fol- lowing hold: (1) ρG (k) = Ω( ρG(k)); (2) λk+1(JG ) = Θ(λk+1(JG)). The first statement of Theorem 5 shows that the k bipartitelike clusters of G is approximately preserved in G , and together with Lemma 4 the second statement shows that the number of bipartite-like clusters in G and G is the same. Algorithm. Our algorithm is similar with (Sun & Zanetti, 2019) at a high level, and is based on sampling edges in G with carefully defined probabilities. Formally, for an input undirected graph G = (V, E, w G), the algorithm starts with G = (V, , ew) and samples every edge u v in G with probability pe pu(v)+pv(u) pu(v) pv(u), where min w G(u, v) C log3 n d G(u) (2 λn k), 1 , (3.1) for some constant C. For every sampled edge e = {u, v}, the algorithm adds e to graph G , and sets w G (e) = w G(e)/pe. Notice that, the choice of C only changes the sampling probability by a constant factor, and doesn t influence the asymptotic order of the sampled edges. Moreover, in practice we usually treat C log3 n 2 λn k as O(logc n) for a constant c, and this only influences the total number of sampled edges and the algorithm s running time by a polylogarithmic factor. Proof Sketch of Theorem 5. We first prove that the cut values between Ai and Bi in G is preserved in H for any 1 i k. For any edge e = {u, v}, we define the random variable Ye by Ye = w G(u, v)/pe with probability pe, and Ye = 0 otherwise. By defining X = w H(Ai, Bi), we prove that E[X] = w G(Ai, Bi) and e={u,v} u Ai,v Bi w(u, v) d G(u) + d G(v) Let {(Ai, Bi)}k i=1 be the optimal clusters corresponding to ρ(k). Then, we have for every 1 i k that ρG(k) ϕG(Ai, Bi) = 2w G(Ai, Bi) vol G(Ai Bi), which implies 2 vol G(Ai Bi) X e={u,v} u Ai,v Bi Applying the Chebyshev s inequality, we have for any constant c R+ that P [|X E[X]| c E[X]] E[X2] c2 E[X]2 2 (2 λn k) c2 C log3 n ρG(k)2 max e={u,v} u Ai,v Bi {d G(u) + d G(v)} vol G(Ai Bi)2 X e={u,v} u Ai,v Bi Since vol G(Ai Bi) = P u Ai d G(u)+P v Bi d G(v) and d G(u) = P u v w G(u, v), we have max e={u,v} u Ai,v Bi (d G(u) + d G(v)) u Ai d G(u) + X v Bi d G(v) = vol G(Ai Bi) and P e={u,v} u Ai,v Bi w G(u, v) vol G(Ai Bi). Applying these gives us that P [|X E[X]| c E[X]] 2(2 λn k) c2 C log3 n ρ(k)2 = O 1 log n Hence, by the union bound, we have that w H(Ai, Bi) = Ω(w G(Ai, Bi)) for all 1 i k. The proof of the second statement of Theorem 5 can be found in the appendix. Finally, the total number of edges in H follows by the definition of sampling probability and the Markov inequality. This completes the proof of Theorem 5. 4. Proof of Theorem 2 In this section we present a nearly-linear time sparsification algorithm such that every directed bipartite-like cluster in a directed graph is approximately preserved in the output sparsifier, and prove Theorem 2. Specifically, for a digraph G that contains exactly k pairs of (A1, B1), . . . , (Ak, Bk) with high values of ϕ G(Ai, Bi) for every 1 i k, our Online Sparsification of Bipartite-Like Clusters in Graphs objective is to construct a sparse digraph G , such that (i) the values of ϕ G (Ai, Bi) are high for every 1 i k and (ii) the number of such pairs in G is the same as G. Our result is as follows: Theorem 6 (Formal Statement of Theorem 2). There is a nearly-linear time algorithm that, given a directed and weighted graph G = (V G, E G, w G) with n vertices and k directed bipartite-like clusters satisfying ρ G(k) = 1 o(1/k) as input, with high probability computes a sparsifier G = (V G, F E G, ew) such that ρ G (k) = Ω ρ G(k) . Moreover, the total number of edges in the output graph is nearly-linear in n. Before sketching our technique, recall that, for undirected graphs, the value of k is proven to be identical for G and G by analysing the eigenvalues of JG and JG and applying the higher-order dual-Cheeger inequality (Lemma 4). However, a natural matrix representation for directed graphs could result in complex-valued eigenvalues, and there is no analogue of Lemma 4 for directed graphs. To overcome this, our developed algorithm is based on a novel reduction from a directed graph to an undirected one, and its reverse operation. Specifically, our designed algorithm consists of the following three steps: 1. for any input digraph G, the algorithm constructs an undirected graph H such that every directed bipartitelike cluster defined by (Ai, Bi) in G corresponds to a low-conductance set in H; 2. the algorithm constructs a sparsifier H of H, such that H and H have the same structure of clusters; 3. the algorithm applies the sparsified undirected graph H to construct a directed graph G of G that satisfies ρ G (k) = Ω ρ G(k) . See Figure 1 for illustration. G , ρ G (k) (H , ρH (k)) reverse semi-double cover semi-double graph sparsification graph sparsification Figure 1: A commutative diagram of our construction. To construct G from G, we construct graphs H and H and prove the close relationships between G, H, H , and G . Constructing H from G. Notice that, to preserve ϕ G (Ai, Bi), the cut values w(Ai, Bi) between Ai and Bi need to be approximately preserved in a sparsified directed graph; this objective is different from the most graph sparsification one, which only preserves the cut value between any set S and its complement. To overcome this, we construct an undirected graph H such that every bipartite-like cluster defined by (Ai, Bi) in G corresponds to a lowconductance set in H. Specifically, for a weighted digraph G = (V G, E G, w G), we construct its semi-double cover H = (VH, EH, w H) as follows: (1) every vertex v V G has two corresponding vertices v1, v2 VH; (2) for every edge (u, v) E G, we add the edge {u1, v2} in EH. See Figure 2 for illustration. a2 b2 c2 d2 a1 b1 c1 d1 Figure 2: Illustration of the semi-double cover construction. A directed graph of n vertices (left) corresponds to an undirected and bipartite graph of 2n vertices (right). Next we analyse the properties of the reduced graph. Let G be a directed graph with semi-double cover H. For any S V G, we define S1 VH and S2 VH by S1 {v1|v S} and S2 {v2|v S}. A subset S of VH is called simple if |{v1, v2} S| 1 holds for all v V G. The following lemma develops a relationship between the flow ratio from A to B defined by f G(A, B) 1 ϕ G(A, B) (4.1) and ΦH(A1 B2), for any A, B. Lemma 7. Let G be a directed graph with semi-double cover H. Then, it holds for any A, B V G that f G(A, B) = ϕH(A1 B2). Similarly, for any simple set S VH, let A = {u : u1 S} and B = {u : u2 S}. Then, it holds that f G(A, B) = ϕH(S). Lemma 7 proves a one-to-one correspondence between any bipartite-like cluster in G and a vertex set in H. Building on this, we prove that this one-to-one correspondence can be generalised between any k bipartite-like clusters in G and k disjoint vertex sets in H. Moreover, the structure of k bipartite-like clusters in G is preserved by a collection of k disjoint vertex sets of low conductance in H. Lemma 8. For any directed and weighted graph G = (V G, E G, w G) and k N, it holds that ρ G(k) = 1 min C1,...,Ck max 1 i k ϕH(Ci), (4.2) Online Sparsification of Bipartite-Like Clusters in Graphs where the minimum is taken over k disjoint simple subsets of VH defined by Ci = Ai1 Bi2 for 1 i k. Sparsification of H. Next we construct a sparse representation of H, denoted by H , such that the k vertex sets of low conductance is preserved in H . To achieve this, we apply the following result to construct a cluster-preserving sparsifier. Lemma 9 ((Sun & Zanetti, 2019)). There exists a nearlylinear time algorithm that, given a graph G = (V, E, w) with k clusters as input, with probability at least 9/10, computes a sparsifier H = (V, F E, ew) with |F| = O((1/λk+1) n log n) edges such that the following holds: (1) it holds for any 1 i k that ϕH(Si) = O(k ϕG(Si)), where S1, , Sk are the optimal clusters in G that achieves ρ(k); (2) λk+1(LH) = Ω(λk+1(LG)). Constructing G from H . Finally, we construct a directed graph G from H such that the original k directed bipartite-like clusters in G is preserved in G . To achieve this, we introduce the following reverse semi-double cover: Definition 10 (reverse semi-double cover). Given any double cover graph H = (VH , EH , w H ) as input, the reverse semi-double cover of H is a directed graph G = (V G , E G , w G ) constructed as follows: every pair of vertices u1 and u2 in VH corresponds to a vertex v V G ; we add an edge (u, v) to E G if there is edge {u1, v2} EH , and set w G (u, v) = w H (u1, v2). One might think that the reverse double cover plays an exact opposite role of the double cover, however it is not the case. In particular, while our constructed subsets C1, . . . , Ck in the first step are always simple in H (cf. Lemma 8), the k subsets corresponding to ρH(k) are not necessarily simple. As a result, min C1,...,Ck max 1 i k ϕH(Ci) = ρH(k) doesn t hold in general, and there is no direct correspondence between C1, . . . , Ck in H and the k directed bipartite-like clusters in G that correspond to ρ G (k). To analyse ρ G (k), for any set S VH we partition the set into two subsets S1 and S2 defined by S1 = S (Ai1 Bi2) and S2 = S (Ai2 Bi1). For example, following Figure 2, if Ai = {a, c} and Bi = {b, d} and the set S VH is S = {a1, b1, b2, c1, c2}, then we have S1 = {a1, b2, c1} and S2 = {b1, c2}. As Ai and Bi are densely connected in H, there are few edges within Ai and Bi for 1 i k. Hence, there are very few edges between S1 and S2 for any S VH. Without loss of generality, we assume that 2w H(S1, S2) w H(S1, S1) + w H(S2, S2) c for some constant c < 1. Simplifying the inequality above we get w H(S1, S1) + w H(S2, S2) 2w H(S1, S2) (1 c) w H(S1, S1) + w H(S2, S2) . Thus, for any not necessarily simple vertex set S VH we have ϕH(S) = w H(S, S) = w H(S1, S1) + w H(S2, S) 2w H(S1, S2) vol(S1) + vol(S2) (1 c) min w H(S1, S1) vol(S1) , w H(S2, S2) = (1 c) min {ϕH(S1), ϕH(S2)} , where the last inequality follows by the median inequality. Thus, for every S VH, there is a simple set T VH such that ϕH(S) (1 c) ϕH(T). Moreover, for any collection of k-disjoint sets S1, S2, , Sk, where Si VH we have a collection of k-disjoint simple sets T1, T2, , Tk, where Ti VH, such that max 1 i k ϕH(Si) (1 c) max 1 i k ϕH(Ti). Taking minimum over all such collection of k-disjoint subsets of VH gives us that min S1,S2, ,Sk max 1 i k ϕH(Si) = ρH(k) (1 c) min T1,T2, ,Tk max 1 i k ϕH(Ti), where in the second half of the inequality the minimum is taken over collection of k-disjoint simple subsets of VH. On one hand, rearranging the above inequality we have 1 1 c ρH(k) min T1,T2, ,Tk max 1 i k ϕH(Ti), (4.3) and on the other hand, since the collection of k-disjoint simple subsets of VH is a sub-collection of the collection of k-disjoint subsets of VH, we have min T1,T2, ,Tk max 1 i k ϕH(Ti) ρH(k). (4.4) Thus, combining (4.3) and (4.4), we have 1 1 c ρH(k) min T1,T2, ,Tk max 1 i k ϕH(Ti) ρH(k). (4.5) Further, combining (4.2) and (4.5) we have 1 1 1 c ρH(k) ρ G(k) 1 ρH(k). (4.6) Online Sparsification of Bipartite-Like Clusters in Graphs (a) Runtime comparison (b) Bipartiteness Ratio comparison Figure 3: Runtime and bipartiteness comparison between MS and our algorithm by fixing p = 0.3, q = 0.1p and varying the number of vertices between 500 and 2, 500 in each partition. Proof of Theorem 2. Now we are ready to prove Theorem 2. Since G is a directed graph with k bipartitelike clusters, the value of ρ G(k) is high; together with (4.6), this implies that ρH(k) = o(1). By Lemma 9, we know that there exists a sparsifier H of H, such that ρH (k) = O(k ρH(k)). Thus, we can conclude that ρH (k) = o(1). Hence, applying (4.6) for G and H 1 1 1 c ρH (k) ρ G (k) 1 ρH (k). (4.7) Finally, using the fact that ρH (k) = o(1), we conclude that ρ G (k) is close to 1 and hence the structure of G will be preserved in G . Moreover, by the construction of H, and H , and G , the value of k is preserved. For the running time, notice that all the intermediate graphs H and H can be constructed locally, and it s sufficient to examine every edge of the input graph G once throughout the execution of the algorithm. This implies the nearlylinear running time of our overall algorithm. Combining everything above above proves Theorem 2. 5. Experiments We evaluate the performance of our proposed algorithms on synthetic and real-world datasets. We employ the algorithms presented in (Macgregor & Sun, 2021a) as the baseline algorithms, and examine the speedup of their algorithms when applying our sparsification algorithms as subroutines. Notice that, as all the involved operations of our algorithms can be performed locally, one can run our graph sparsification algorithms online while exploring the underlying graph with a local algorithm. For ease of presentation, in this section we call the local algorithm in (Macgregor & Sun, 2021a) with our sparsification framework our algorithm. All experiments were performed on a HP ZBook Studio with 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz processor and 32 GB of RAM. Our code can be downloaded from https://github.com/suranjande4/Online Sparsification-of-Bipartite-Like-Clusters-in-Graphs. 5.1. Results for Undirected Graphs Synthetic Dataset. We compare the performance of our algorithm with the LOCBIPARTDC algorithm presented in (Macgregor & Sun, 2021a), which we refer to as MS, on synthetic graphs generated from the stochastic block model (SBM). Specifically, we assume that the graph has k = 2 clusters, say C1, C2, and the number of vertices in each cluster, denoted by n1 and n2 respectively, satisfies n1 = n2. Moreover, any pair of vertices u Ci and v Cj is connected with probability pij. We assume that p12 = p21 = p and p11 = p22 = q, where q = 0.1p. Throughout the experiments, we leave the parameters n and p free but maintain the above relations. Our algorithm sparsifies the underlying graph and simultaneously applies the MS algorithm. We evaluate the quality of the output (L, R) returned by each algorithm with respect to its bipartiteness ratio defined by β(L, R) = 1 ϕ(L, R). All our reported results are the average performance of each algorithm over 10 runs, in which a random vertex from C1 C2 is chosen as the starting vertex of the algorithm. We generate graphs from the SBM such that q = 0.1p and vary the size of the target set by varying n1 between 500 and 2, 500. In Figure 3, we fix the probability p = 0.3 and vary the number of vertices n1 = n2 and compare both runtime and the bipartiteness ratio between the MS algorithm and our algorithm. One can observe that for a fixed probability p as we increase the number of vertices, our algorithm takes much less time than the MS algorithm and maintains a similar bipartiteness ratio with the MS algorithm. Online Sparsification of Bipartite-Like Clusters in Graphs Table 1: Comparison of MS with our algorithm on the Militarised Interstate Disputes Dataset. We use the vertices corresponding to the listed countries in the first column as the seed vertex of the local algorithm. COUNTRY NAME MS MS OUR ALGO. OUR ALGO. RUNTIME BIPARTITENESS RUNTIME BIPARTITENESS USA 0.034 0.292 0.0044 0.285 NETHERLANDS 0.0351 0.307 0.0042 0.281 LITHUANIA 0.0336 0.303 0.0043 0.165 (a) Runtime comparison (b) Flow-ratio comparison Figure 4: Runtime and flow-ratio comparison between ECD and our algorithm by fixing η = 0.7 and varying the number of vertices in each partition from 500 to 2, 500. Real-world Dataset. We evaluate the performance of our algorithm on the Dyadic Militarised Interstate Disputes Dataset (v3.1) (Maoz et al., 2019), which records every interstate dispute during 1900 1950, including the level of hostility resulting from the dispute and the number of casualties. We construct a graph from the dataset as follows: every country is represented by a vertex; two vertices are connected by an edge with weight 30 if there is a war between the corresponding countries, and the two vertices are connected by an edge with weight 1 if the corresponding countries have other dispute which is not part of an interstate war. We set γ = 0.02 for the MS algorithm, and Table 1 compares the performance of the MS algorithm with ours. This shows that our algorithm takes much less time than the MS algorithm and maintains a similar bipartiteness ratio. 5.2. Results for Directed Graphs Synthetic Dataset. We compare the performance of our algorithm with the EVOCUTDIRECTED algorithm presented in (Macgregor & Sun, 2021a), which we refer to as ECD, and use the graphs generated from the SBM as the algorithms input. In our algorithm, given a digraph G as input, we sparsify the graph along with generating the volumebiased ESP on G s semi-double cover H. Since the ECD is a local algorithm, we also test our algorithm locally. In this model, we look into a cluster which is almost bipartite with the bipartition being L and R. We set the number of vertices in L and R to be n1 and n2 such that n1 = n2 and the probability of assigning an edge is defined by L R L 9/n1 η R 1 η 9/n2 , i.e., the probability that there is an edge within the partition is 9/n1 = 9/n2 and so on. As most of our directed edges are from L to R, the value of η is high. For our experiments we generate two sets of plots: We fix the value of η = 0.7 and increase the number of vertices in each partition from 500 to 2, 500, and compare the runtime of ECD and our algorithm. As reported in Figure 4, our algorithm takes much less time than the ECD algorithm and gives a similar flowratio at the same time as we increase the number of vertices. We increase the number of vertices in each partition from 500 to 2, 500, increase the value of η from 0.7 to 0.9, and compare the runtime of the ECD algorithm and our algorithm. As reported in Figure 5, our algorithm runs faster than the ECD algorithm as η increases. Real-world Dataset. Now we evaluate the algorithms performance on the US Migration Dataset (U.S. Census Bureau, 2000). We construct the digraph from the dataset as follows: every county in the mainland USA is represented Online Sparsification of Bipartite-Like Clusters in Graphs (a) η = 0.7 (b) for η = 0.8 (c) η = 0.9 Figure 5: Runtime comparison between ECD and our algorithm for η = 0.7, 0.8, and 0.9 respectively. Table 2: Comparison of ECD with our algorithm on real-world datasets. We use the vertices corresponding to the listed counties in the first column as the seed vertex of the local algorithm. COUNTY NAME TARGET ϕ ECD ECD OUR ALGO. OUR ALGO. RUNTIME FLOW-RATIO RUNTIME FLOW-RATIO Maricopa County 0.2 20.661 0.414 13.434 0.417 Virginia Beach City 0.2 15.31 0.546 12.29 0.621 Kanawha county 0.2 9.318 0.33 8.483 0.33 by a vertex; for any vertices i, j, the edge weight of is given by |(Mi,j Mj,i)/(Mi,j + Mj,i)|, where Mi,j is the number of people who migrated from county to county between 1995 and 2000; in addition, the direction of (i, j) is set to be from i to j if Mi,j > Mj,i, otherwise the direction is set to be the opposite. We compare the output of ECD and the output of ECD when applying our sparsification algorithm as subroutine. Furthermore, we use the vertices corresponding to different counties as the input of the local algorithm ECD. As shown in Table 2, with our developed algorithm the local ECD algorithm achieves roughly the same flow ratio, and our sparsification procedure significantly speeds up the total running time of the algorithm. Moreover, the runtime speedup is more significant when the local algorithm explores more parts of the input graph. In conclusion, these experimental studies demonstrate that our developed algorithms can be directly employed to speed up the running time of existing algorithms that find bipartite-like clusters, and can be widely applied when analysing datasets of various domains. We believe that our developed techniques and algorithms will motivate future and fruitful studies for analysing complex cluster structures of graphs. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Acknowledgements The first and third authors of the paper are supported by EPSRC Early Career Fellowship (EP/T00729X/1). Andersen, R. A local algorithm for finding dense subgraphs. ACM Transactions on Algorithms (TALG), 6(4): 1 12, 2010. Batson, J. D., Spielman, D. A., and Srivastava, N. Twiceramanujan sparsifiers. SIAM J. Comput., 41(6):1704 1721, 2012. Bennett, S., Mihai, C., and Gesine, R. Lead lag detection and network clustering for multivariate time series with an application to the US equity market. Machine Learning, 111:4497 4538, 2022. Cohen, M. B., Kelner, J. A., Peebles, J., Peng, R., Rao, A. B., Sidford, A., and Vladu, A. Almost-linear-time algorithms for markov chains and new spectral primitives for directed graphs. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp. 410 419, 2017. Concas, A., Fenu, C., Reichel, L., Rodriguez, G., and Yunzi, Z. Chained structure of directed graphs with applications to social and transportation networks. Applied Network Science, 7(64), 2022. Cucuringu, M., Li, H., Sun, H., and Zanetti, L. Hermitian matrices for clustering directed graphs: insights and applications. In The 23rd International Conference on Artificial Intelligence and Statistics, 2020. Online Sparsification of Bipartite-Like Clusters in Graphs He, Y., Reinert, G., and Cucuringu, M. DIGRAC: digraph clustering based on flow imbalance. In Learning on Graphs Conference, Proceedings of Machine Learning Research, pp. 21, 2022. Laenen, S. and Sun, H. Higher-order spectral clustering of directed graphs. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020. Lee, J. R., Gharan, S. O., and Trevisan, L. Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM, 61(6):1 30, 2014. Lee, Y. T. and Sun, H. An sdp-based algorithm for linearsized spectral sparsification. In 49th Annual ACM Symposium on Theory of Computing, pp. 678 687, 2017. Lee, Y. T. and Sun, H. Constructing linear-sized spectral sparsification in almost-linear time. SIAM J. Comput., 47(6):2315 2336, 2018. Li, A. and Peng, P. Detecting and characterizing small dense bipartite-like subgraphs by the bipartiteness ratio measure. In 24th International Symposium on Algorithms and Computation (ISAAC 13), pp. 655 665, 2013. Liu, S. Multi-way dual cheeger constants and spectral bounds of graphs. Advances in Mathematics, 268:306 338, 2015. Macgregor, P. and Sun, H. Local algorithms for finding densely connected clusters. In 38th International Conference on Machine Learning, pp. 7268 7278, 2021a. Macgregor, P. and Sun, H. Finding bipartite components in hypergraphs. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pp. 7912 7923, 2021b. Maoz, Z., Johnson, P. L., Kaplan, J., Ogunkoya, F., and Shreve, A. P. The dyadic militarized interstate disputes (MIDs) dataset version 3.0: Logic, characteristics, and comparisons to alternative datasets. Journal of Conflict Resolution, 63(3):811 835, 2019. URL https: //correlatesofwar.org/. Moore, C., Yan, X., Zhu, Y., Rouquier, J.-B., and Lane, T. Active learning for node classification in assortative and disassortative networks. In 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 11), pp. 841 849, 2011. Neumann, S. and Peng, P. Sublinear-time clustering oracle for signed graphs. In International Conference on Machine Learning, pp. 16496 16528, 2022. Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B. Geom-GCN: Geometric Graph Convolutional Networks. In 7th International Conference on Learning Representations (ICLR 19), 2019. Soto, J. A. Improved analysis of a max-cut algorithm based on spectral partitioning. SIAM Journal on Discrete Mathematics, 29(1):259 268, 2015. Spielman, D. A. and Teng, S. Spectral sparsification of graphs. SIAM J. Comput., 40(4):981 1025, 2011. Sun, H. and Zanetti, L. Distributed graph clustering and sparsification. ACM Trans. Parallel Comput., 6(3):17:1 17:23, 2019. Trevisan, L. Max cut and the smallest eigenvalue. In 41st Annual ACM Symposium on Theory of Computing, pp. 263 272, 2009. U.S. Census Bureau. United states 2000 census. https://web.archive.org/web/ 20150905081016/https://www.census. gov/population/www/cen2000/commuting/, 2000. Accessed: January 2021. Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., and Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. In 34th Advances in Neural Information Processing Systems (Neur IPS 20), 2020. Online Sparsification of Bipartite-Like Clusters in Graphs A. Useful Inequalities The following inequalities will be used in our analysis. Theorem 11 (Courant-Fischer Theorem). Let A be a n n symmetric matrix with eigenvalues λ1 λ2 λn. Then, it holds for any 1 k n that λk = min S dim(S)=k max y S\{0} y A y = max S dim(S)=n k+1 min y S\{0} y A y where the maximisation and minimisation are over the subspaces of Rn. Lemma 12 (Bernstein s Inequality). Let X1, X2, , Xn be independent random variables such that |Xi| M for any 1 i n. Let X = Pn i=1 Xi, and R = Pn i=1 E X2 i . Then, it holds that P [|X E[X]| t] 2 exp Lemma 13 (Matrix Chernoff Bound). Consider a finite sequence {Xi} of independent, random, PSD matrices of dimension d that satisfy Xi R. Let µmin = λmin (E[P i Xi]) and µmax = λmax (E[P i Xi]). Then, it holds that for δ [0, 1], and (1 + δ)µmax B. Omitted Detail from Section 3 This section presents all the omitted detail from Section 3, and gives a complete proof of Theorem 5. We first recall that, for every vertex u and its adjacent vertex v, the algorithm assigns the edge e = {u, v} the probability pu(v) min w G(u, v) C log3 n d G(u) (2 λn k), 1 , (B.1) for a large enough constant C R 0. The algorithm checks every edge and samples an edge e = {u, v} with probability pe, where pe pu(v) + pv(u) pu(v) pv(u). Note that, it is easy to check that pe satisfies the inequality 1 2(pu(v) + pv(u)) pe pu(v) + pv(u). We start with an empty set F and gradually store all the sampled edges in F, which is sampled by the algorithm. Finally, the algorithm returns a weighted graph H = (V, F, w H), where the weight w H(u, v) of every sampled edge e = {u, v} F is defined by w H(u, v) = w G(u, v) Next, we analyze the size of F. Since e={u,v} w G(u, v) C log3 n d G(u) (2 λn k) = O n log3 n Online Sparsification of Bipartite-Like Clusters in Graphs it holds by Markov inequality that with constant probability the number of edges e = {u, v} with pu(v) 1 is O n log3 n 2 λn k . Without loss of generality, we assume that these edges are in F, and in the remaining part of the proof we assume it holds for any edge u v that w G(u, v) C log3 n d G(u) (2 λn k) < 1. Then, the expected number of edges in H equals X e={u,v} pe X e={u,v} pu(v) + pv(u) e={u,v} w(u, v) 1 d G(u) + 1 d G(v) = O n log3 n and by Markov inequality it holds with constant probability that |F| = O n log3 n Now we show that the cut value between Ai and Bi is preserved in H for all 1 i k. For any edge e = {u, v}, we define the random variable Ye by pe with probability pe, 0 otherwise. (B.2) Also, we define X = w H(Ai, Bi), and have that e={u,v} u Ai,v Bi e={u,v} u Ai,v Bi pe w G(u, v) e={u,v} u Ai,v Bi w G(u, v) = w G(Ai, Bi). (B.3) Next, we analyse the second moment of the random variable X and have that e={u,v} u Ai,v Bi pe w G(u, v) e={u,v} u Ai,v Bi e={u,v} u Ai,v Bi 2w G(u, v)2 pu(v) + pv(u) e={u,v} u Ai,v Bi 2w G(u, v)2 w G(u,v) C log3 n (2 λn k) 1 d G(u) + 1 d G(v) e={u,v} u Ai,v Bi w(u, v) d G(u) + d G(v) Online Sparsification of Bipartite-Like Clusters in Graphs where the last step follows by the means inequality. Let {(Ai, Bi)}k i=1 be the optimal cluster where ρ(k) is attained for graph G. Recall that for every k N, the k-way dual Cheeger constant is defined by ρG(k) = max (A1,B1), ,(Ak,Bk) min 1 i k ϕG(Ai, Bi). Then, we have for every 1 i k that ρG(k) ϕG(Ai, Bi) = 2w G(Ai, Bi) vol G(Ai Bi), which implies ρG(k) 2 vol G(Ai Bi) X e={u,v} u Ai,v Bi w G(u, v). (B.5) Next, by the Chebyshev s inequality we have for any constant c R+ that P [|X E[X]| c E[X]] E[X2] c2 E[X]2 2 λn k C log3 n P e={u,v} u Ai,v Bi w G(u, v) d G(u)+d G(v) c2 P e={u,v} u Ai,v Bi w G(u, v) 2 2 λn k C log3 n P e={u,v} u Ai,v Bi w G(u, v) d G(u)+d G(v) 2 vol G(Ai Bi) 2 = 2 (2 λn k) c2 C log3 n ρG(k)2 P e={u,v} u Ai,v Bi w G(u, v) (d G(u) + d G(v)) vol G(Ai Bi)2 2 (2 λn k) c2 C log3 n ρG(k)2 max e={u,v} u Ai,v Bi {d G(u) + d G(v)} P e={u,v} u Ai,v Bi w G(u, v) vol G(Ai Bi)2 . Since vol G(Ai Bi) = P u Ai d G(u) + P v Bi d G(v) and d G(u) = P u v w G(u, v), we have max e={u,v} u Ai,v Bi {d G(u) + d G(v)} X u Ai d G(u) + X v Bi d G(v) = vol G(Ai Bi) e={u,v} u Ai,v Bi w G(u, v) vol G(Ai Bi). Thus, we have by (B.6) and the assumption of ρ(k) 1 log(n) that P [|X E[X]| c E[X]] 2(2 λn k) c2 C log3 n ρ(k)2 = O 1 log n Online Sparsification of Bipartite-Like Clusters in Graphs Hence, by choosing a sufficient large constant c and the union bound, we have that w H(Ai, Bi) = Ω(w G(Ai, Bi)) for all 1 i k. (B.7) Next, we show that the degree of every vertex in H is approximately preserved with high probability. Based on the random variable Ye defined in (B.2), we define the random variable Zu by Then, the expected value of Zu is given by e:v u E[Ye] = X e:v u pe w G(u, v) v:v u w G(u, v) = d G(u), and the second moment can be upper bounded by e:v u E Y 2 e = X e:v u pe w G(u, v) since pe pu(v). Now using the value of pu(v) from (3.1), we have e:v u E Y 2 e X v:v u w(u, v)2 d G(u) (2 λn k) w(u, v) C log3 n = d G(u) (2 λn k) v:v u w G(u, v) = d2 G(u) (2 λn k) and for any edge e = {u, v} we have that pu(v) d G(u) (2 λn k) Now, applying Bernstein s inequality (Lemma 12), we have P |d H(u) d G(u)| du = P |Zu E[Zu]| E[Zu] d2 G(u) (2 λn k) C log3 n + 1 6 d2 G(u) (2 λn k) 1 8 C log3 n 7 6 (2 λn k) Online Sparsification of Bipartite-Like Clusters in Graphs Hence, it holds by the union bound that, with high probability, the degree of all the vertices in H are approximately preserved up to a constant factor. This implies that for any subset S V , we have vol H(S) = Θ (vol G(S)) , more specifically, vol H(Ai Bi) = Θ (vol G(Ai Bi)) , (B.8) for all 1 i k. Thus, combining (B.7) and (B.8) gives us that ϕH(Ai, Bi) = Ω ϕG(Ai, Bi) (B.9) for all 1 i k, which implies that ρH(k) min 1 i k ϕH(Ai, Bi) = min 1 i k Ω = Ω( ρG(k)) , where the last equality follows from the fact that {(Ai, Bi)}k i=1 is the optimal cluster where ρ(k) is attained for graph G. Next, we show that the top (n k)-eigenspaces of JG are preserved in H. Without loss of generality we assume the graph is connected. Since JG = 2I LG by definition, it holds that λi(JG) = 2 λn+1 i(LG). (B.10) i=1 (2 λi(LG))fif i , and with slight abuse of notation we call P 1/2 as the square root of the pseudo-inverse of P, i.e., i=1 (2 λi(LG)) 1/2fif i . Let P be the projection on the span of {f1, f2, , fn k}, then i=1 fif i . Recall that, for each vertex v, the indicator vector χv Rn is defined by χv(u) = 1 d G(v) if u = v and χv(u) = 0 otherwise. For each edge e = {u, v} of G we define a vector ge = χu + χv Rn and a random matrix Xe Rn n by w H(u, v) P 1/2geg e P 1/2 if e = {u, v} is sampled by the algorithm, 0 otherwise. (B.11) Then, it holds that X e={u,v} F w H(u, v) P 1/2geg e P 1/2 e={u,v} F w H(u, v) geg e = P 1/2J HP 1/2, Online Sparsification of Bipartite-Like Clusters in Graphs where J H X e={u,v} F w H(u, v) geg e is the signless Laplacian matrix of H normalised with respect to the degree of the vertices in the original graph G. We will now prove that, with high probability the top n k eigenspaces of J H and JG are approximately the same. We first analyse the expectation of P e E Xe, and have that e={u,v} E pe w H(u, v) P 1/2geg e P 1/2 e={u,v} E pe w G(u, v) pe P 1/2geg e P 1/2 e={u,v} F w G(u, v) geg e = P 1/2JGP 1/2 = i=1 fif i = P. Moreover, for any edge e = {u, v} E sampled by the algorithm, we have Xe w H(u, v) g e P 1/2P 1/2ge = w G(u, v) pe g e P 1ge pe 1 2 λn k ge 2 2w G(u, v) pu(v) + pv(u) 1 2 λn k 1 d G(u) + 1 d G(v) 2 C log3 n, where the second inequality follows by the min-max theorem of eigenvalues. Now we apply the matrix Chernoff bound (Lemma 13) to analyze the eigenvalues of P e E Xe. Following Lemma 13 we set the parameters as follows: µmax = λmax R = 2 C log3 n, and Then using the Matrix Chernoff bound (Lemma 13), we have for some constant C; this implies that On the other hand, since E P e E Xe = P, we have µmin = 1 and hence keeping R and δ the same as above, using the Matrix Chernoff bound (Lemma 13), we get Online Sparsification of Bipartite-Like Clusters in Graphs this implies that Combining (B.13), (B.14) and the fact that P e E Xe = P 1/2J HP 1/2, with probability 1 O 1 n3 it holds for any non-zero x Rn in span{f1, f2, , fn k} that x P 1/2J HP 1/2x x x 1 Let y = P 1/2x, and we rewrite (B.15) as y J Hy y Py = y J Hy y y y y Since dim(span{f1, f2, , fn k}) = n k, there exist n k orthogonal vectors whose Rayleigh quotient with respect to J H is Θ(λn k(2I LG)). Hence, by the Courant-Fischer Theorem (Theorem 11) we have 1 2 λn k(2I LG) λk+1(J H) 3 2 λn k(2I LG) (B.16) By the definition of J H = D 1/2 G (DH + AH) D 1/2 G , we have JH = D 1/2 H (DH + AH) D 1/2 H = D 1/2 H D1/2 G J H D1/2 G D 1/2 H . Hence, we set y = D1/2 G D 1/2 H x for any x Rn and have that x x = x D 1/2 H D1/2 G J H D1/2 G D 1/2 H x = y J H y x x 1 2 y J H y y y , where we use the fact that the degree of a vertex differs by a constant factor between H and G. Similarly, we also have 2 y J H y y y , (B.18) Let T Rn be a (k + 1)-dimensional subspace of Rn satisfying λk+1(JH) = max x =0,x T x JH x and e T = n D1/2 G D 1/2 H x : x T o . Since D1/2 G D 1/2 H has full rank, e T is also a (k + 1)-dimensional subspace of Rn. Hence, by the Courant-Fischer Theorem (Theorem 11) and (B.17), we have that λk+1(J H) = min S dim(S)=k+1 max y S\{0} y J H y y y max y e T \{0} y J H y y y 2 max x T \{0} x JH x x x = 2 λk+1(JH). Online Sparsification of Bipartite-Like Clusters in Graphs Next, using (B.16) and (B.19), we have 1 2 λk+1(JG) λk+1(J H) 2 λk+1(JH), which implies that 1 4 λk+1(JG) λk+1(JH). (B.20) Similarly, let U Rn be an (n k)-dimensional subspace of Rn satisfying λk+1(JH) = min x =0,x U x JH x and e U = n D1/2 G D 1/2 H x : x U o . Since D1/2 G D 1/2 H has full rank, e U is also an (n k)-dimensional subspace of Rn. Thus, using the Courant-Fischer Theorem (Theorem 11) and (B.18), we have λk+1(J H) = max S dim(S)=n k min y S\{0} y J H y y y min y e U\{0} y J H y y y 3 min x U\{0} x (2I LH) x 3 λk+1 (JH) . Next, by (B.16) and (B.21) we have 2 3 λk+1(JH) γk+1(L H) 3 2 λk+1(JG), which implies that 4 λk+1(JG). (B.22) Thus, combining (B.20) and (B.22) we have 1 4 λk+1(JG) λk+1(JH) 9 4 λk+1(JG), Hence, the the top n k eigenspaces of JG are preserved in JH. This proves the second statement of the theorem. C. Omitted Detail from Section 4 In this section we list all the proofs omitted from Section 4. Proof of Lemma 7. The proof follows from (Macgregor & Sun, 2021a), which proves the result for undirected graphs. We include the proof here for completeness. Let S = A1 B2 in H, then ϕH(A1 B2) = ϕH(S) = w H(S, V \ S) = vol H(S) 2w H(S, S) = 1 2w H(S, S) = 1 2w G(A, B) volout(A) + volin(B) = f G(A, B). This proves the first statement of the lemma. The second statement of the lemma follows by the similar argument. Online Sparsification of Bipartite-Like Clusters in Graphs Proof of Lemma 8. By definition, we have that f G(A, B) = 1 ϕ G(A, B), (C.2) and this implies that ρ G(k) = max (A1,B1),...,(Ak,Bk) min 1 i k ϕ G(Ai, Bi) = max (A1,B1),...,(Ak,Bk) min 1 i k 1 f G(Ai, Bi) = 1 min (A1,B1),...,(Ak,Bk) max 1 i k f G(Ai, Bi) = 1 min C1,...,Ck max 1 i k ϕH(Ci), where the second line follow by (C.2), and the last one follows by Lemma 7 and Ci = Ai1 Bi2.