# higherorder_spectral_clustering_of_directed_graphs__abada8cf.pdf

Higher-Order Spectral Clustering of Directed Graphs

Steinar Laenen Oxford Research Group Five AI steinar.laenen@five.ai

He Sun School of Informatics University of Edinburgh h.sun@ed.ac.uk

Clustering is an important topic in algorithms, and has a number of applications in machine learning, computer vision, statistics, and several other research disciplines. Traditional objectives of graph clustering are to ﬁnd clusters with low conductance. Not only are these objectives just applicable for undirected graphs, they are also incapable to take the relationships between clusters into account, which could be crucial for many applications. To overcome these downsides, we study directed graphs (digraphs) whose clusters exhibit further structural information amongst each other. Based on the Hermitian matrix representation of digraphs, we present a nearly-linear time algorithm for digraph clustering, and further show that our proposed algorithm can be implemented in sublinear time under reasonable assumptions. The signiﬁcance of our theoretical work is demonstrated by extensive experimental results on the UN Comtrade Dataset: the output clustering of our algorithm exhibits not only how the clusters (sets of countries) relate to each other with respect to their import and export records, but also how these clusters evolve over time, in accordance with known facts in international trade.

1 Introduction

Clustering is one of the most fundamental problems in algorithms and has applications in many research ﬁelds including machine learning, network analysis, and statistics. Data can often be represented by a graph (e.g., users in a social network, servers in a communication network), and this makes graph clustering a natural choice to analyse these datasets. Over the past three decades, most studies on undirected graph clustering have focused on the task of partitioning with respect to the edge densities, i.e., vertices form a cluster if they are better connected to each other than to the rest of the graph. The well-known normalised cut value [25] and graph conductance [20] capture these classical deﬁnitions of clusters, and have become the objective functions of most undirected graph clustering algorithms.

While the design of these algorithms has received a lot of research attention from both theoretical and applied research areas, these algorithms are usually unable to uncover higher-order structural information among clusters in directed graphs (digraphs). For example, let us look at the international oil trade network [29], which employs digraphs to represent how mineral fuels and oils are imported and exported between countries. Although this highly connected digraph presents little cluster structure with respect to a typical objective function of undirected graph clustering, from an economic point of view this digraph clearly exhibits a structure of clusters: there is a cluster of countries mainly exporting oil, a cluster mainly importing oil, and several clusters in the middle of this trade chain. All these clusters are characterised by the imbalance of the edge directions between clusters, and further present a clear ordering reﬂecting the overall trade pattern. This type of structure is not only found in trade data, but also in many other types of data such as migration data and infectious disease

steinar9@gmail.com

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

spreading data. We view these types of patterns as a higher-order structure among the clusters and, in our point of view, this structural information could be as important as the individual clusters themselves.

Our contribution. In this work we study clustering algorithms for digraphs whose cluster structure is deﬁned with respect to the imbalance of edge densities as well as the edge directions between the clusters. Formally, for any set of vertices S0, . . . , Sk 1 that forms a partition of the vertex set V (G) of a digraph G, we deﬁne the ﬂow ratio of {Sj}k 1 j=0 by

w(Sj, Sj 1) vol(Sj) + vol(Sj 1),

where w(S, T) P (u,v) E u S,v T w(u, v) is the cut value from S V to T V and vol(S) is the sum

of degrees of the vertices in S. We say that {Sj}k 1 j=0 forms an optimal partition if this {Sj}k 1 j=0 maximises the ﬂow ratio over all possible partitions. By introducing a complex-valued representation of the graph Laplacian matrix LG, we show that this optimal partition {Sj}k 1 j=0 is well embedded into the bottom eigenspace of LG. To further exploit this novel and intriguing connection, we show that an approximate partition with bounded approximation guarantee can be computed in time nearly-linear in the number of edges of the input graph. In the settings for which the degrees of the vertices are known in advance, we also present a sub-linear time implementation of the algorithm. The signiﬁcance of our work is further demonstrated by experimental results on several synthetic and real-world datasets. In particular, on the UN Comtrade dataset our clustering results are well supported by the literature from other research ﬁelds. At the technical level, our analysis could be viewed as a hybrid between the proof of the Cheeger inequality [6] and the analysis of spectral clustering for undirected graphs [23], as well as a sequence of recent work on fast constructions of graph sparsiﬁcation (e.g., [26]). We believe our analysis for the new Hermitian Laplacian LG could inspire future research on studying the clusters higher-order structure using spectral methods.

Related work. There is a rich literature on spectral algorithms for graph clustering. For undirected graph clustering, the works most related to ours are [23, 25, 30]. For digraph clustering, [24] proposes to perform spectral clustering on the symmetrised matrix A = M M + MM of the input graph s adjacency matrix M; [9] initiates the studies of spectral clustering on complex-valued Hermitian matrix representations of digraphs, however their theoretical analysis only holds for digraphs generated from the stochastic block model. Our work is also linked to analysing higher-order structures of clusters in undirected graphs [4, 5, 31], and community detection in digraphs [7, 21]. The main takeaway is that there is no previous work which analyses digraph spectral clustering algorithms to uncover the higher-order structure of clusters in a general digraph.

2 Preliminaries

Throughout the paper, we always assume that G = (V, E, w) is a digraph with n vertices, m edges, and weight function w : V V R 0. We write u v if there is an edge from u to v in the graph. For any vertex u, the in-degree and out-degree of u are deﬁned as din u P

v:v u w(v, u) and dout u P

v:u v w(u, v), respectively. We further deﬁne the total degree of u by du din u + dout u , and deﬁne vol(S) P u S du for any S V . For any set of vertices S and T, the symmetric difference between S and T is deﬁned by S T (S \ T) (T \ S).

Given any digraph G as input, we use M Rn n to denote the adjacency matrix of G, where Mu,v = w(u, v) if there is an edge u v, and Mu,v = 0 otherwise. We use A Cn n to represent the Hermitian adjacency matrix of G, where Au,v = Av,u = w(u, v) ω 2πk if u v, and Au,v = 0 otherwise. Here, ω 2πk is the 2πk -th root of unity, and x is the conjugate of x. The normalised Laplacian matrix of G is deﬁned by LG I D 1/2AD 1/2, where the degree matrix D Rn n is deﬁned by Du,u = du, and Du,v = 0 for any u = v. We sometimes drop the subscript G if the underlying graph is clear from the context.

For any Hermitian matrix A Cn n and non-zero vector x Cn, the Rayleigh quotient R(A, x) is deﬁned as R(A, x) x Ax/x x, where x is the complex conjugate transpose of x Cn. For any

Hermitian matrix B Cn n, let λ1(B) . . . λn(B) be the eigenvalues of B with corresponding eigenvectors f1, . . . , fn, where fj Cn for any 1 j n.

3 Encoding the ﬂow-structure into LG s bottom eigenspace

Now we study the structure of clusters with respect to their ﬂow imbalance, and their relation to the bottom eigenspace of the normalised Hermitian Laplacian matrix. For any set of vertices S0, . . . , Sk 1, we say that S0, . . . , Sk 1 form a k-way partition of V (G), if it holds that S

0 j k 1 Sj = V (G) and Sj Sℓ= for any j = ℓ. As discussed in Section 1, the primary focus of the paper is to study digraphs in which there are signiﬁcant connections from Sj to Sj 1 for any 1 j k 1. To formalise this, we introduce the notion of ﬂow ratio of {Sj}k 1 j=0, which is deﬁned by

ΦG (S0, . . . , Sk 1)

w(Sj, Sj 1) vol(Sj) + vol(Sj 1). (1)

We call this k-way partition {Sj} an optimal clustering if the ﬂow ratio given by {Sj} achieves the maximum deﬁned by θk(G) max S0,...,Sk 1 Si=V,Sj Sℓ= ΦG (S0, . . . , Sk 1) . (2)

Notice that, for any two consecutive clusters Sj and Sj 1, the value w(Sj, Sj 1) (vol(Sj) + vol(Sj 1)) 1 evaluates the ratio of the total edge weight in the cut (Sj, Sj 1) to the total weight of the edges with endpoints in Sj or Sj 1; moreover, only k 1 out of 2 k 2 different cuts among S0, . . . , Sk 1 contribute to ΦG(S0, . . . , Sk 1) according to (1). We remark that, although the deﬁnition of ΦG(S0, . . . , Sk 1) shares some similarity with the normalised cut value for undirected graph clustering [25], in our setting an optimal clustering is the one that maximises the ﬂow ratio. This is in a sharp contrast to most objective functions for undirected graph clustering, whose aim is to ﬁnd clusters of low conductance2. In addition, it is not difﬁcult to show that this problem is NP-hard since, when k = 2, our problem is exactly the MAX DICUT problem studied in [15].

To study the relationship between the ﬂow structure among S0, . . . , Sk 1 and the eigen-structure of the normalised Laplacian matrix of the graph, we deﬁne for every optimal cluster Sj (0 j k 1) an indicator vector χj Cn by χj(u) w 2π k j if u Sj and χj(u) = 0 otherwise. We further deﬁne the normalised indicator vector of χj by

c χj D1/2χj D1/2χj ,

j=0 c χj. (3)

We highlight that, due to the use of complex numbers, a single vector y is sufﬁcient to encode the structure of k clusters: this is quite different from the case of undirected graphs, where k mutually perpendicular vectors are needed in order to study the eigen-structure of graph Laplacian and the cluster structure [20, 23, 30]. In addition, by the use of roots of unity in (3), different clusters are separated from each other by angles, indicating that the use of a single eigenvector could be sufﬁcient to approximately recover k clusters. Our result on the relationship between λ1(LG) and θk(G) is summarised as follows: Lemma 3.1. Let G = (V, E, w) be a weighted digraph with normalised Hermitian Laplacian LG Cn n. Then, it holds that λ1(LG) 1 4

k θk(G). Moreover, θk(G) = k/4 if G is a bipartite digraph with all the edges having the same direction, and θk(G) < k/4 otherwise.

2It is important to notice that, among 2 k 2 cuts formed by pairwise different clusters, only (k 1) cut values contribute to our objective function. If one takes all of the 2 k 2 cut values into account, the objective function would involve 2 k 2 terms. However, even if most of the 2 k 2 terms are much smaller than the ones along the ﬂow, their sum could still be dominant, leaving little information on the structure of clusters. Therefore, we should only take (k 1) cut values into account when the clusters present a ﬂow structure.

Notice that the bipartite graph G with θk(G) = k/4 is a trivial case for our problem; hence, without lose of generality we assume θk(G) < k/4 in the following analysis. To study how the distribution of eigenvalues inﬂuences the cluster structure, similar to the case of undirected graphs we introduce the parameter γ deﬁned by

γk(G) λ2 1 (4/k) θk(G).

Our next theorem shows that the structure of clusters in G and the eigenvector corresponding to λ1(LG) can be approximated by each other with approximation ratio inversely proportional to γk(G).

Theorem 3.2. The following statements hold: (1) there is some α C such that the vector ef1 = αf1 satisﬁes y ef1 2 1/γk(G); (2) there is some β C such that the vector ey = βy satisﬁes f1 ey 2 1/ (γk(G) 1).

4 Algorithm

In this section we discuss the algorithmic contribution of the paper. In Section 4.1 we will describe the main algorithm, and its efﬁcient implementation based on nearly-linear time Laplacian solvers; we will further present a sub-linear time implementation of our algorithm, assuming the degrees of the vertices are known in advance. The main technical ideas used in analysing the algorithms will be discussed in Section 4.2.

4.1 Algorithm Description

Main algorithm. We have seen from Section 3 that the structure of clusters is approximately encoded in the bottom eigenvector of LG. To exploit this fact, we propose to embed the vertices of G into R2 based on the bottom eigenvector of LG, and apply k-means on the embedded points. Our algorithm, which we call Simple Herm, only consists of a few lines of code and is described as follows: (1) compute the bottom eigenvector f1 Cn of the normalised Hermitian Laplacian matrix LG of G; (2) compute the embedding {F(v)}v V [G], where F(v) 1 dv f1(v) for any vertex v; (3) apply k-means on the embedded points {F(v)}v V [G].

We remark that, although the entries of LG are complex-valued, some variant of the graph Laplacian solvers could still be applied for our setting. For most practical instances, we have k = O(logc n) for some constant c, in which regime our proposed algorithm runs in nearly-linear time3. We refer a reader to [22] on technical discussion on the algorithm of approximating f1 in nearly-linear time.

Speeding-up the runtime of the algorithm. Since Ω(m) time is needed for any algorithm to read an entire graph, the runtime of our proposed algorithm is optimal up to a poly-logarithmic factor. However we will show that, when the vertices degrees are available in advance, the following sub-linear time algorithm could be applied before the execution of the main algorithm, and this will result in the algorithm s total runtime to be sub-linear in m.

More formally, our proposed sub-linear time implementation is to construct a sparse subgraph H of the original input graph G, and run the main algorithm on H instead. The algorithm for obtaining graph H works as follows: every vertex u in the graph G checks each of its outgoing edges e = (u, v), and samples each outgoing edge with probability

pu(u, v) min w(u, v) α log n

λ2 dout u , 1 ;

in the same time, every vertex v checks each of its incoming edges e = (u, v) with probability

pv(u, v) min w(u, v) α log n

λ2 dinv , 1 ,

where α R>0 is some constant which can be determined experimentally. As the algorithm goes through each vertex, it maintains all the sampled edges in a set F. Once all the edges have

3Given any graph G with n vertices and m edges as input, we say an algorithm runs in nearly-linear time if the algorithm s runtime is O(m logc n) for some constant c.

been checked, the algorithm returns a weighted graph H = (V, F, w H), where each sampled edge e = (u, v) has a new weight deﬁned by w H(u, v) = w(u, v)/pe. Here, pe is the probability that e is sampled by one of its endpoints and, for any e = (u, v), we can write pe as pe = pu(u, v) + pv(u, v) pu(u, v)pv(u, v).

4.2 Analysis

Analysis of the main algorithm. Now we analyse the proposed algorithm, and prove that running k-means on {F(v)}v V [G] is sufﬁcient to obtain a meaningful clustering with bounded approximation guarantee. We assume that the output of a k-means algorithm is A0, . . . , Ak 1. We deﬁne the cost function of the output clustering A0, . . . , Ak 1 by

COST(A0, . . . , Ak 1) min c0,...,ck 1 C

u Aj du F(u) cj 2,

and deﬁne the optimal clustering by

2 k min partition A0,...Ak 1 COST(A0, . . . , Ak 1).

Although computing the optimal clustering for k-means is NP-hard, we will show that the cost value for the optimal clustering can be upper bounded with respect to γk(G). To achieve this, we deﬁne k points p(0), . . . , p(k 1) in C, where p(j) is deﬁned by

k (ω 2π k )j

vol(Sj) , 0 j k 1. (4)

We could view these p(0), . . . , p(k 1) as approximate centers of the k clusters, which are separated from each other through different powers of ω 2π k .

Our ﬁrst lemma shows that the total distance between the embedded points from every Sj and their respective centers p(j) can be upper bounded, which is summarised as follows:

Lemma 4.1. It holds that Pk 1 j=0 P

u Sj du F(u) p(j) 2 (γk(G) 1) 1.

Since the cost value of the optimal clustering is the minimum over all possible partitions of the embedded points, by Lemma 4.1 we have that 2 k (γk(G) 1) 1. We assume that the k-means algorithm used here achieves an approximation ratio of APT. Therefore, the output A0, . . . , Ak 1 of this k-means algorithm satisﬁes COST(A0, . . . , Ak 1) APT (γk(G) 1).

Secondly, we show that the norm of the approximate centre of each cluster is inversely proportional to the volume of each cluster. This implies that larger clusters are closer to the origin, while smaller clusters are further away from the origin.

Lemma 4.2. It holds for any 0 j k 1 that p(j) 2 = β 2 (k vol(Sj)) 1.

Thirdly, we prove that the distance between different approximate centres p(j) and p(ℓ) is inversely proportional to the volume of the smaller cluster, which implies that the embedded points of the vertices from a smaller cluster are far from the embedded points from other clusters. This key fact explains why our algorithm is able to approximately recover the structure of all the clusters.

Lemma 4.3. It holds for any 0 j = ℓ k 1 that p(j) p(ℓ) 2 β 2

3k3 min{vol(Sj),vol(Sℓ)}.

Combining these three lemmas with some combinatorial analysis, we prove that the symmetric difference between every returned cluster by the algorithm and its corresponding cluster in the optimal partition can be upper bounded, since otherwise the cost value of the returned clusters would contradict Lemma 4.1. Theorem 4.4. Let G = (V, E) be a digraph, and S0, . . . , Sk 1 be a k-way partition of V [G] that maximises the ﬂow ratio ΦG(S0, . . . , Sk 1). Then, there is an algorithm that returns a kway partition A0, . . . , Ak 1 of V [G]. Moreover, by assuming Aj corresponds to Sj in the optimal partition, it holds that vol(Aj Sj) εvol(Sj) for some ε = 48k3 (1+APT) (γk(G) 1) 1/2.

We remark that the analysis of our algorithm is similar with the work of [23]. However, the analysis in [23] relies on k indicator vectors of k clusters, each of which is in a different dimension of Rn; this implies that k eigenvectors are needed in order to ﬁnd a good k-way partition. In our case, all the embedded points are in R2, and the embedded points from different clusters are mainly separated by angles; this makes our analysis slightly more involved than [23].

Analysis for the speeding-up subroutine. We further analyse the speeding-up subroutine described in Section 4.1. Our analysis is very similar with [27], and the approximation guarantee of our speeding-up subroutine is as follows:

Theorem 4.5. Given a digraph G = (V, E) as input, the speeding-up subroutine computes a subgraph H = (V, F) of G with O((1/λ2) n log n) edges. Moreover, with high probability, the computed sparse graph H satisﬁes that θk(H) = Ω(θk(G)), and λ2(LH) = Ω(λ2(LG)).

5 Experiments

In this section we present the experimental results of our proposed algorithm Simple Herm on both synthetic and real-world datasets, and compare its performance against the previous state-of-theart. All our experiments are conducted with an ASUS Zen Book Pro UX501VW with an Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz with 12GB of RAM.

We will compare Simple Herm against the DD-SYM algorithm [24] and the Herm-RW algorithm [9]. Given the adjacency matrix M Rn n as input, the DD-SYM algorithm computes the matrix A = M M + MM , and uses the top k eigenvectors of a random walk matrix D 1A to construct an embedding for k-means clustering. The Herm-RW algorithm uses the imaginary unit i to represent directed edges and applies the top k/2 eigenvectors of a random walk matrix to construct an embedding for k-means. Notice that both of the DD-SYM and Herm-RW algorithms involve the use of multiple eigenvectors, and DD-SYM requires computing matrix multiplications, which makes it computationally more expensive than ours.

5.1 Results on Synthetic Datasets

We ﬁrst perform experiments on graphs generated from the Directed Stochastic Block Model (DSBM) which is introduced in [9]. We introduce a path structure into the DSBM, and compare the performance of our algorithm against the others. Speciﬁcally, for given parameters k, n, p, q, η, a graph randomly chosen from the DSBM is constructed as follows: the overall graph consists of k clusters S0, . . . , Sk 1 of the same size, each of which can be initially viewed as a G(n, p) random graph. We connect edges with endpoints in different clusters with probability q, and connect edges with endpoints within the same cluster with probability p. In addition, for any edge (u, v) where u Sj and v Sj+1, we set the edge direction as u v with probability η, and set the edge direction as v u with probability 1 η. For all other pairs of clusters which do not lie along the path, we set their edge directions randomly. The directions of edges inside a cluster are assigned randomly.

As graphs generated from the DSBM have a well-deﬁned ground truth clustering, we apply the Adjusted Rand Index (ARI) [14] to measure the performance of different algorithms. We further set p = q, since this is one of the hardest regimes for studying the DSBM. In particular, when p = q, the edge density plays no role in characterising the structure of clusters, and the edges are entirely deﬁned with respect to the edge directions.

We set n = 1000, and k = 4. We set the value of p to be between 0.5 and 0.8, and the value of η to be between 0.5 and 0.7. As shown in Figure 1, our proposed Simple Herm clearly outperforms the Herm-RW and the DD-SYM algorithms.

Next, we study the case of n = 2000 and k = 8, but the structure of clusters presents a more signiﬁcant path topology. Speciﬁcally, we assume that any pair of vertices within each cluster are connected with probability p (0.05, 0.1); moreover, all the edges crossing different clusters are along the cuts (Sj, Sj+1) for some 0 j k 2. By setting η (0.65, 1), our results are reported in Figure 2. From these results, it is easy to see that, when the underlying graph presents a clear ﬂow structure, our algorithm performs signiﬁcantly better than both the Herm-RW and DD-SYM algorithms, for which multiple eigenvectors are needed.

0.50 0.55 0.60 0.65 0.70 η

Simple Herm Herm RW DD-SYM

0.50 0.55 0.60 0.65 0.70 η

0.50 0.55 0.60 0.65 0.70 η

0.50 0.55 0.60 0.65 0.70 η

Figure 1: n = 1000 and k = 4. Average ARIs over 5 runs of different algorithms, with respect to different values of p and η.

0.650.700.750.800.850.900.951.00

Simple Herm Herm RW DD-SYM

0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Figure 2: n = 2000 and k = 8. Average ARIs over 5 runs of different algorithms, with respect to different values of p and η.

5.2 Results on the UN Comtrade Dataset

We compare our proposed algorithm against the previous state-of-the-art on the UN Comtrade Dataset [29]. This dataset consists of the import-export tradeﬂow data of 97 speciﬁc commodities across N = 246 countries and territories over the period 1964 2018. The total size of the data in zipped ﬁles is 99.8GB, where every csv ﬁle for a single year contained around 20,000,000 lines.

Pre-processing. As the pre-processing step, for any ﬁxed commodity c and any ﬁxed year, we construct a directed graph as follows: the constructed graph has N = 246 vertices, which correspond to 246 countries and territories listed in the dataset. For any two vertices j and ℓ, there is a directed edge from j to ℓif the export of commodity c from country j to ℓis greater than the export from ℓ to j, and the weight of that edge is set to be the absolute value of the difference in trade, i.e., the net trade value between ℓand j. Notice that our construction ensures that all the edge weights are non-negative, and there is at most one directed edge between any pair of vertices.

Result on the International Oil Trade Industry. We ﬁrst study the international trade for mineral fuels, oils, and oil distillation product in the dataset. The primary reason for us to study the international oil trade is due to the fact that crude oil is one of the highest traded commodities worldwide [2], and plays a signiﬁcant role in geopolitics (e.g., 2003 Iraq War). Many references in international trade and policy making (e.g., [3, 10, 11]) allow us to interpret the results of our proposed algorithm.

Following previous studies on the same dataset from complex networks perspectives [13, 32], we set k = 4. Our algorithm s output around the period of 2006 2009 is visualised in Figure 3. We choose to highlight the results between 2006 and 2009, since 2008 sees the largest post World Ward II oil shock after the economic crisis [18]. As discussed earlier, our algorithm s output is naturally associated with an ordering of the clusters that optimises the value of Φ, and this ordering is reﬂected in our visualisation as well. Notice that such ordering corresponds to the chain of oil trade, and indicates the clusters of main export countries and import countries for oil trade.

From Figure 3, we see that the output of our algorithm from 2006 to 2008 is pretty stable, and this is in sharp contrast to the drastic change between 2008 and 2009, caused by the economic crisis. Moreover, many European countries move across different clusters from 2008 to 2009. The visualisation results of the other algorithms are less signiﬁcant than ours.

We further show that this dynamic change of clusters provides a reasonable reﬂection of international economics. Speciﬁcally, we compute the clustering results of our Simple Herm algorithm on the same dataset from 2002 to 2017, and compare it with the output of the DD-SYM algorithm. For every two

cluster 1 cluster 2 cluster 3 cluster 4

Figure 3: The clustering result for international trade from 2006 to 2009, where k = 4. Red countries form start of the trade chain, and yellow countries the end of the trade chain. Countries coloured white have no data.

'02/'03'03/'04'04/'05'05/'06'06/'07'07/'08'08/'09'09/'10'10/'11'11/'12'12/'13'13/'14'14/'15'15/'16

year-to-year

Symmetric Difference

Simple Herm DD-SYM

Figure 4: Comparison of the symmetric difference of the returned clusters between two consecutive years.

consecutive years, we map every cluster to its optimal correspondence (i.e., the one that minimises the symmetric difference between the two). We further compute the total symmetric difference between the clustering results for every two consecutive years, and our results are visualised in Figure 4. As shown in the ﬁgure, our algorithm has notable changes in clustering during 2004/2005 and 2008/2009 respectively. The peak around 2004/2005 might be a delayed change as a consequence of the Venezuelan oil strike and the Iraq war of 2003. Both the events led to the decrease in oil barrel production by 5.4 million barrels per day [16]. The peak around 2008/2009 is of course due to the economic crisis. These peaks correspond to the same periods of cluster instability found in the complex network analysis literature [1, 33], further signifying our result4. Compared to our algorithm, the clustering result of the DD-SYM algorithm is less stable over time.

Result on the International Wood Trade. We also study the international wood trade network (IWTN). This network looks at the trade of wood and articles of wood. Although the IWTN is less studied than the International Oil Trade Industry in the literature, it is nonetheless the reﬂection of an important and traditional industry and deserves detailed analysis. Wood trade is dependent on a number of factors, such as the amount of forest a country has left, whether countries are trying to reintroduce forests, and whether countries are deforesting a lot for agriculture (e.g., Amazon rainforest in Brazil) [17].

4We didn t plot the result between 2016 and 2017, since the symmetric difference for the DD-SYM algorithm is 107 and the symmetric difference for the Simple Herm algorithm is 17. We believe this is an anomaly for DD-SYM, and plotting this result in the same ﬁgure would make it difﬁcult to compare other years results.

cluster 1 cluster 2 cluster 3 cluster 4

Figure 5: Change in clustering of Simple Herm of the IWTN from 2006 to 2009 with k = 4. Clusters are labelled according to their position in the ordering that maximises the ﬂow ratio between the 4 clusters. Red countries form start of the trade chain, and yellow countries the end of the trade chain. Countries coloured in white have no data.

Figure 5 visualises the clusters from 2006 to 2009. As we can see, the structure of clusters are stable in early years, and the ﬁrst cluster contains countries with large forests such as Canada, Brazil, Russia, Germany, and China. However, there is a signiﬁcant change of the cluster structure from 2008 to 2009, and countries in Eastern Europe, the Middle East and Central Asia move across different clusters.

5.3 Result on the Data Science for COVID-19 Dataset

The Data Science for COVID-19 Dataset (DS4C) [19] contains information about 3519 South Korean citizens infected with COVID-19. Here, digraphs are essential to represent how the virus is transmitted among the individuals, and the clusters with high ratio of out-going edges represent the communities worst hit by the virus. We ﬁrst identify the largest connected component of the infection graph, which consists of 67 vertices and 66 edges, and run our algorithm on the largest connected component. By setting k = 4, our algorithm manages to identify a super-spreader as a single cluster, and the path of infection between groups of people along which most infections lie.

6 Broader Impact

The primary focus of our work is efﬁcient clustering algorithms for digraphs, whose clusters are deﬁned with respect to the edge directions between different clusters. We believe that our work could have long-term social impact. For instance, when modelling the transmission of COVID-19 among individuals through a digraph, the cluster (group of people) with the highest ratio of out-going edges represents the most infectious community. This type of information could aid local containment policy. With the development of many tracing Apps for COVID-19 and a signiﬁcant amount of infection data available in the near future, our studied algorithm could potentially be applied in this context. In addition, as shown by our experimental results on the UN Comtrade Dataset, our work could be employed to analyse many practical data for which most traditional clustering algorithms do not sufﬁce.

Acknowledgments and Disclosure of Funding

Part of this work was done when Steinar Laenen studied at the University of Edinburgh as a Master student. He Sun is supported by an EPSRC Early Career Fellowship (EP/T00729X/1).

[1] H. An, W. Zhong, Y. Chen, H. Li, and X. Gao. Features and evolution of international crude oil trade relationships: A trading-based network analysis. Energy, 74:254 259, 2014.

[2] F. I. Association. Total 2017 volume 25.2 billion contracts, down 0.1% from 2016. https://www.fia. org/resources/total-2017-volume-252-billion-contracts-down-01-2016, Jan 2018. Accessed: 2020-06-05.

[3] N. B. Behmiri and J. R. P. Manso. Crude oil conservation policy hypothesis in OECD (organisation for economic cooperation and development) countries: A multivariate panel Granger causality test. Energy, 43(1):253 260, 2012.

[4] A. R. Benson, D. F. Gleich, and J. Leskovec. Tensor spectral clustering for partitioning higher-order network structures. In International Conference on Data Mining, pages 118 126, 2015.

[5] A. R. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization of complex networks. Science, 353(6295):163 166, 2016.

[6] F. Chung. Spectral graph theory. In CBMS: Conference Board of the Mathematical Sciences, Regional Conference Series, 1997.

[7] F. Chung. Laplacians and the Cheeger inequality for directed graphs. Annals of Combinatorics, 9(1):1 19, 2005.

[8] F. Chung and L. Lu. Concentration inequalities and martingale inequalities: a survey. Internet Mathematics, 3(1):79 127, 2006.

[9] M. Cucuringu, H. Li, H. Sun, and L. Zanetti. Hermitian matrices for clustering directed graphs: insights and applications. In International Conference on Artiﬁcial Intelligence and Statistics, 2020.

[10] L.-B. Cui, P. Peng, and L. Zhu. Embodied energy, export policy adjustment and China s sustainable development: a multi-regional input-output analysis. Energy, 82:457 467, 2015.

[11] N. Cui, Y. Lei, and W. Fang. Design and impact estimation of a reform program of China s tax and fee policies for low-grade oil and gas resources. Petroleum Science, 8(4):515 526, 2011.

[12] M. Dittrich and S. Bringezu. The physical dimension of international trade: Part 1: Direct global ﬂows between 1962 and 2005. Ecological Economics, 69(9):1838 1847, 2010.

[13] R. Du, G. Dong, L. Tian, M. Wang, G. Fang, and S. Shao. Spatiotemporal dynamics and ﬁtness analysis of global oil market: Based on complex network. Public Library of Science one, 11(10), 2016.

[14] A. J. Gates and Y.-Y. Ahn. The impact of random models on clustering similarity. The Journal of Machine Learning Research, 18(1):3049 3076, 2017.

[15] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of the ACM, 42(6):1115 1145, 1995.

[16] J. D. Hamilton. Historical oil shocks. Technical report, National Bureau of Economic Research, 2011.

[17] T. Kastner, K.-H. Erb, and S. Nonhebel. International wood trade and forest change: A global analysis. Global Environmental Change, 21(3):947 956, 2011.

[18] L. Kilian. Exogenous oil supply shocks: how big are they and how much do they matter for the US economy? The Review of Economics and Statistics, 90(2):216 240, 2008.

[19] Korea Centers for Disease Control & Prevention. Data science for COVID-19. https://www.kaggle. com/kimjihoo/coronavirusdataset, 2020. Accessed: 2020-06-03.

[20] J. R. Lee, S. O. Gharan, and L. Trevisan. Multiway spectral partitioning and higher-order Cheeger inequalities. Journal of the ACM, 61(6):37:1 37:30, 2014.

[21] E. A. Leicht and M. E. J. Newman. Community structure in directed networks. Physical Review Letters, 100:118703, 2008.

[22] H. Li, H. Sun, and L. Zanetti. Hermitian Laplacians and a Cheeger inequality for the Max-2-Lin problem. In 27th Annual European Symposium on Algorithms (ESA), pages 1 14, 2019.

[23] R. Peng, H. Sun, and L. Zanetti. Partitioning well-clustered graphs: Spectral clustering works! SIAM J. Comput., 46(2):710 743, 2017.

[24] V. Satuluri and S. Parthasarathy. Symmetrizations for clustering directed graphs. In Proceedings of the 14th International Conference on Extending Database Technology, pages 343 354, 2011.

[25] J. Shi and J. Malik. Normalized cuts and image segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 731 737, 1997.

[26] D. A. Spielman and N. Srivastava. Graph sparsiﬁcation by effective resistances. SIAM Journal on Computing, 40(6):1913 1926, 2011.

[27] H. Sun and L. Zanetti. Distributed graph clustering and sparsiﬁcation. ACM Transactions on Parallel Computing, 6(3):17:1 17:23, 2019.

[28] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389 434, 2012.

[29] United Nations. UN comtrade free API. https://comtrade.un.org/data/. Accessed: 2020-06-03.

[30] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007.

[31] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich. Local higher-order graph clustering. In 23rd International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 555 564, 2017.

[32] Z. Zhang, H. Lan, and W. Xing. Global trade pattern of crude oil and petroleum products: Analysis based on complex network. In IOP Conference Series: Earth and Environmental Science, volume 153, pages 22 33. IOP Publishing, 2018.

[33] W. Zhong, H. An, X. Gao, and X. Sun. The evolution of communities in the international oil trade network. Physica A: Statistical Mechanics and its Applications, 413:42 52, 2014.