# codinmf_coclustering_of_directed_graphs_via_nmf__fdcb0d13.pdf Co Di NMF: Co-Clustering of Directed Graphs via NMF Woosang Lim School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA 30332, USA woosang.lim@cc.gatech.edu Rundong Du School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA 30332, USA rdu@gatech.edu Haesun Park School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA 30332, USA hpark@cc.gatech.edu Co-clustering computes clusters of data items and the related features concurrently, and it has been used in many applications such as community detection, product recommendation, computer vision, and pricing optimization. In this paper, we propose a new co-clustering method, called Co Di NMF, which improves the clustering quality and finds directional patterns among co-clusters by using multiple directed and undirected graphs. We design the objective function of coclustering by using min-cut criterion combined with an additional term which controls the sum of net directional flow between different co-clusters. In addition, we show that a variant of Nonnegative Matrix Factorization (NMF) can solve the proposed objective function effectively. We run experiments on the US patents and Blog Catalog data sets whose ground truth have been known, and show that Co Di NMF improves clustering results compared to other co-clustering methods in terms of average F1 score, Rand index, and adjusted Rand index (ARI). Finally, we compare Co Di NMF and other coclustering methods on the Wikipedia data set of philosophers, and we can find meaningful directional flow of influence among co-clusters of philosophers. Introduction Clustering is an essential task in unsupervised learning which finds the inherent relationships and group structures in the data sets. In particular, co-clustering simultaneously computes co-clusters which consist of both features and data items or higher order entities at the same time by exploiting the dualities. Since co-clustering is effective in analyzing the dyadic or higher-order relationships compared to the traditional one-way clustering, it has many applications such as text mining (Dhillon 2001; Dhillon, Mallela, and Modha 2003), bioinformatics (Cho et al. 2004), product recommendation (Vlachos et al. 2014), pricing optimization (Zhu, Yang, and He 2015), sense discovery (Chen et al. 2015), and community detection (Rohe, Qin, and Yu 2016). Dhillon (Dhillon 2001) suggested Spectral Co-clustering (SCo) to compute co-clusters by using the min-cut problem of a bipartite graph with documents and words as parts. Since then, co-clustering has been studied with different ways over the last decades; information theoretic Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. co-clustering (Dhillon, Mallela, and Modha 2003), multiview co-clustering (Sun et al. 2015), co-clustering based on spectral approaches (Wu, Benson, and Gleich 2016; Rohe, Qin, and Yu 2016), co-clustering based on NMF (Long, Zhang, and Yu 2005; Wang et al. 2011). However, most of co-clustering methods assume that the connections between entities are symmetric or undirected, but many interactions in real networks are asymmetric or directed. For example, the data set of patents and words contains a citation network among patents, and it can be represented as a directed and asymmetric graph. For development of philosophy, there are influence flow from one philosopher or philosophy school to other philosophers or schools respectively, and these relationships can be directed and asymmetric. Rohe et al. recently proposed a spectral co-clustering method for directed networks, called DI-SIM, which uses the asymmetric regularized graph Laplacian for directed graph (Rohe, Qin, and Yu 2016). DI-SIM computes the left and right singular vectors of regularized graph Laplacian to generate two lower-dimensional representations for sending and receiving nodes. Then, by using an asymmetry score, it discovers the asymmetries in the relationships and describe the directional communities. However, DI-SIM can only be applied to the data sets which consist of one kind of entity distinguished by sending and receiving roles. In this paper, we propose a new NMF-based co-clustering method, called Co Di NMF, which computes co-clusters by using both directed and undirected relationships to improve co-clustering quality. Especially, Co Di NMF is able to find directional relationships among co-clusters by controlling the net directional flow between different co-clusters. In the later sections, we will derive Co Di NMF, and compare it with other NMF-based co-clustering in terms of accuracy on the US patents and Blog Catalog data sets. We will also demonstrate its ability for finding directional relationships on the Wikipedia data set of philosophers. Related Work In this section, we briefly discuss some of the existing co-clustering methods in literature including spectral coclustering, and NMF based co-clustering. The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Spectral Co-clustering Let X be the m n data matrix for the set of features A = {a1, ..., am} and the set of data items B = {b1, ..., bn}. For a document data set, A is the set of words, B is the set of documents, and X is the weight matrix between A and B. Dhillon suggested Spectral Co-clustering (SCo) which constructs graph Laplacian by using a bipartite graph rep- resented by the edge weight matrix M = 0 X X 0 computes low dimensional embedding, and it runs k-means clustering on the computed embedding to find k co-clusters (Dhillon 2001). More specifically, the objective function of co-clustering for bipartition is designed to solve min-cut problem defined with a partition vector q and the weight matrix M, where qi = η2 η1 if the i-th row/column of M belongs to the first cluster and qi = η1 η2 if the i-th row/column of M belongs to the second cluster, ηi is the sum of edge weights of the nodes in the i-th cluster, and the plus or minus sign of qi plays an important role in assigning nodes to clusters. Since the problem of solving the proposed objective function is NP-complete, SCo was suggested. It computes the second left and right singular vectors of normalized weight matrix Xn = D 1/2 1 XD 1/2 2 , and construct vector y Rm+n as an approximation of q, where D1 and D2 are diagonal matrices such that D1(i, i) = j Xi,j, and D2(i, i) = j Xj,i. Finally, it runs k-means algorithm on the 1-dimensional y to compute co-clusters. For multipartitioning, it uses ℓ= log2 k singular vectors of Xn to construct a matrix Y R(m+n) ℓand runs k-means on Y . Wu et al. extended the spectral co-clustering method for the tensor data sets and proposed a method called General Tensor Spectral Co-clustering (GTSC) (Wu, Benson, and Gleich 2016). First, GTSC extends the original rectangle data tensor to a square and symmetric extended tensor. Next, it constructs a Markov chain by using the transition matrix and the super-spacey stationary vector, and computes the second leading real-valued eigenvector of Markov chain. At the final step, it recursively applies sweep cut for the recursive bisection procedures for clustering. Although GTSC extended spectral co-clustering, it still uses undirected and symmetric connections between entities. Rohe et al. recently proposed a new spectral co-clustering method called DI-SIM, which applies the spectral coclustering approach to directed graph (Rohe, Qin, and Yu 2016). First, it assumes that we have an n n weight matrix S which represents directed connections among n nodes in the data set, and it constructs a regularized graph Laplacian by normalizing S as L = O 1/2SP 1/2, where O, P are n n diagonal matrices with Oi,i = j Si,j + τ and Pi,i = j Sj,i + τ, and the regularization parameter τ is the average out-degree. Next, it computes the first k left and right singular vectors of L, and constructs a k-dimensional embedding by normalizing rows of singular vectors. At the final step, it runs k-means on the k-dimensional embedding and finds the asymmetry in the relationships by using an asymmetry score. Although DI-SIM can analyze the clus- tering asymmetries on some data sets, it can be applied only to the special case of data set which represents sending and receiving relationships among the nodes of the same kind of entities. NMF-based Co-clustering Non-negative Matrix Factorization (NMF) and its variants have been used for clustering (Xu, Liu, and Gong 2003; Kim and Park 2011; Kuang, Yun, and Park 2015; Kuang and Park 2013; Du et al. 2017a), and NMF has the advantage of providing intuitive interpretation for the result. Non-negative Block Value Decomposition (NBVD) was proposed to analyze the latent block structure of the non-negative data matrix based on Nonnegative Matrix Tri-Factorization (NMTF) (Long, Zhang, and Yu 2005), and its objective function is minimize W,T,H X WTH 2 F subject to W, T, H 0, (1) where X is the data matrix defined in the previous section, W is m c, T is c l, and H is l n. T is called as block value matrix which can be considered as a compact representation of X. W and H are coefficient matrices for row and column clusters, respectively. To solve Eqn (1), they derived an algorithm that iteratively updates the decomposition by using a set of multiplicative updating rules. Wang et al., proposed Locality Preserved Fast Nonnegative Matrix Tri-Factorization (LP-FNMTF) for co-clustering which uses two undirected graphs GA of features and GB of data points in addition to the bipartite graph of X (Wang et al. 2011), where the undirected graphs GA and GB can be either generated from X or obtained from prior knowledge (Wang et al. 2011). Thus, the objective function of LPFNMTF in Eqn (2) has two terms in addition to Eqn (1) to consider the manifold structures in both data and feature space, and the constraints in Eqn (1) and Eqn (2) are different. minimize W,S,H X WSH 2 F (2) +α tr(W T (I LA)W) + β tr(H(I LB)HT ) subject to W Ψm d, H Ψc n, where LA and LB are normalized graph Laplacians s.t. LA = D 1/2 A GAD 1/2 A and LB = D 1/2 B GBD 1/2 B , and Ψ is a cluster indicator matrix. Basically, LP-FNMTF computes d c co-clusters, but we can set d = c = k for finding k clusters among data points. So far, we have discussed spectral co-clustering and NMF-based co-clustering including NBVD and LPFNMTF. All of them, except DI-SIM, use only undirected relationships for constructing the objective function to compute k co-clusters, and DI-SIM only considers the data set consisting of sending and receiving relationships among the nodes of the same kind of entities. Co-clustering of Directed Graphs Many data sets in the real world contain both undirected and directed relationships. For example, research papers and patents can be encoded with words as their features, but the Data Feature Data Feature Figure 1: Previous co-clustering model uses undirected bipartite graph (left figure), and our co-clustering model uses bipartite graph combined with directed subgraphs (right figure) two data sets also have directed relationships as the citations among the papers and the citations among the patents, respectively. Fig 1 describes an example of dyadic data which can be represented as combination of both an undirected subgraph and directed subgraphs. Thus, unlike the clustering methods on the multiple undirected graphs (Tang, Lu, and Dhillon 2009; Wang et al. 2011; Sun et al. 2015) and multi-view graphs on the same entity (Nie, Li, and Li 2017), we consider an extended (m+n) (m+n) weight matrix M by combining the weight matrices of undirected graph and directed graphs to encode additional directed relationships for the data set of two entities. M = Z X X S where X Rm n is a weight matrix of undirected graph, Z Rm m and S Rn n can be weight matrices of directed graphs shown on the right side in Fig 1. For a document data set, Z can be a weight matrix of word-word graph, X a weight matrix of word-document graph, S a weight matrix of document-document graph. More in general, we can assume that relationships between features and items are directed, then X1 and X2 in Eqn (4) can be different, i.e., X1 = X2. M = Z X1 X2 S Since such data inherently have directional relationships among the entities, we need to exploit them to enhance the clustering quality and also to find directional relationships among co-clusters. Our Co Di NMF was derived under the condition of Eqn (4), and we discuss it in detail in the following section. Co Di NMF: Co-clustering of Directed Graphs via NMF We first define a directional cut in directed graphs to derive the objective function of Co Di NMF. Let M in Eqn (4) be the extended weight matrix of directed graph, and Vi be a set of i-th co-cluster which consists of both entities of set Min-cut Problem on Directed Graphs Maximizing Differences of Direc onal Flows Figure 2: Illustrations of the objective function. Min-cut problem of directed graphs can be derived by considering the sum of all directional cuts between different co-clusters (left figure). An additional term helps to control a net directional flow which is the sum of differences of directional flows (right figure). A and B. Then, the directional cut in directed graph for cluster Vi and Vj is Cut(Vi, Vj) = s Vi,t Vj Ms,t, where Cut(Vi, Vj) = Cut(Vj, Vi) since the cuts between Vi and Vj can be different depending on directions. Based on the notion of directional cut, we introduce two different aspects which are cut minimization in directed graphs and control of net directional flow to derive an objective function. Edge cut minimization in directed graphs: Our first criteria is minimizing cut which is a sum of all directional cuts among k co-clusters. We assume that each co-cluster indicator vector ui consists of 0 and 1 whereas usual spectral clustering based methods consider -1 and 1 for the values of cluster indicator vector. Specifically, the elements of ui are 1 for the members in the i-th cluster and 0 for the members in the other clusters, and ui is non-negative and u i uj = 0 when i = j. By using the defined ui, the directional cut from Vi to Vj can be expressed as Cut(Vi, Vj) = u i Muj. Then, the cut becomes i =j Cut(Vi, Vj) = i