# hybridorder_stochastic_block_model__2e1112c9.pdf Hybrid-order Stochastic Block Model Xunxun Wu,1 Chang-Dong Wang, 1,2,3* Pengfei Jiao 4 1 School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2 Guangdong Province Key Laboratory of Computational Science, Guangzhou, China 3 Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China 4 Center of Biosafety Research and Strategy, Law school of Tianjin University, Tianjin, China xunxunwu99@outlook.com, changdongwang@hotmail.com, pjiao@tju.edu.cn Community detection is a research hotspot in machine learning and data mining. However, most of the existing community detection methods only rely on the lower-order connectivity patterns, while ignoring the higher-order connectivity patterns, and unable to capture the building blocks of the complex network. In recent years, some community detection methods based on higher-order structures have been developed, but they mainly focus on the motif network composed of higher-order structures, which violate the original lowerorder topological structure and are affected by the fragmentation issue, resulting in the deviation of community detection results. Therefore, there is still a lack of community detection methods that can effectively utilize higher-order connectivity patterns and lower-order connectivity patterns. To overcome the above limitations, this paper proposes the Hybridorder Stochastic Block Model (HSBM) from the perspective of the generative model. Based on the classical stochastic block model, the generation of lower-order structure and higher-order structure of the network is modeled uniformly, and the original topological properties of the network are maintained while using higher-order connectivity patterns. At the same time, a heuristic algorithm for community detection is proposed to optimize the objective function. Extensive experiments on six real-world datasets show that the proposed method outperforms the existing approaches. Introduction Community detection is a research hotspot in network analysis that aims to divide the network into substructures with tight internal connections and sparse external connections, and plays an important role in various fields such as human social interaction, economy and trade, biological information, transportation and electricity. Many community detection methods have been proposed (He et al. 2016; Blondel et al. 2008; He et al. 2017; Jin et al. 2018; Ganji, Bailey, and Stuckey 2018; Jin et al. 2019), but most of them only rely on the lower-order structure at the level of individual nodes and edges, and ignore the higher-order structure in the network (Shi and Malik 2000; Newman 2006; Frey and Dueck 2007; Schaeffer 2007; *Corresponding author Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Chakraborty et al. 2014). The common higher-order structures are small subgraphs in networks, also known as network motifs, which are crucial for understanding the fundamental structures and regulating the behavior of complex networks (Benson, Gleich, and Leskovec 2016). In order to capture the building blocks of the network, some community detection methods based on higher-order structure have been proposed in recent years (Arenas et al. 2008; Benson, Gleich, and Leskovec 2016; Tsourakakis, Pachocki, and Mitzenmacher 2017; Yin et al. 2017; Huang, Wang, and Chao 2018, 2019a,b). The basic idea of these higher-order methods is as follows: The first step is to construct a motif network by using motifs. If two nodes have involved in at least one common motif, there is a higher-order connection between them; otherwise, there is no higher-order connection. Then, lower-order methods are used on the motif adjacency matrix for community detection. However, this kind of approaches focuses on the motif network and ignores the the original lower-order topological structure of complex network. In the process of constructing motif network, some connected components and isolated nodes may be generated, which leads to the fragmentation issue of the motif network. At the same time, the construction principle of the motif network may cause two nodes connected in the original network not to be connected in the motif network, and making the nodes that might originally be in the same community belong to different communities, which violates the lowerorder topological structure of the network, and causes the deviation of community detection results. Although a community detection method based on edge enhancement has proposed (Li et al. 2019a), the addition of edges may also break the original topological structure. To make effective use of higher-order structure and lowerorder structure for community detection, a Hybrid-order Stochastic Block Model (HSBM) is proposed. This method solves the problem of hybrid-order community detection from the perspective of the generative model. Based on the classical stochastic block model, the generation of lowerorder structure and higher-order structure of the network is modeled uniformly. It is able to reveal the generation mechanism of the network from the perspective of community structure, and maintain the topological structure, higherorder structure, and statistical characteristics of the network. At the same time, a heuristic algorithm is proposed to op- The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) timize the objective function. Extensive experiments on six real-world datasets show that the proposed method is effective and advanced for community detection. The main contributions of this paper are as follows: We propose a Hybrid-order Stochastic Block Model (HSBM) for the community detection, which leverages both higher-order and lower-order connectivity patterns. We for the first time propose to use the generative model to uniformly model the original network and motif network, avoiding the disseverance of higher-order and lower-order connectivity patterns and the fragmentation issues in the current community detection methods. Extensive experiments are conducted on six real-world networks to prove the effectiveness of the proposed method. Preliminaries and Problem Statement Before formally introducing the problem statement and the proposed approach, we briefly introduce some of the necessary background and notations. The input is a network G = (V, E), where V = {v1, v2, ..., vn} means the node set of the graph, and E = {e1, e2, ..., em} denotes the edge set consisting of m edges. In this paper, we focus on undirected networks, which have been most extensively studied, and we allow the networks to include both multi-edges and self-edges without generality (Karrer and Newman 2010). A Nn n represents the adjacency matrix of the network, i.e. the lower-order connection of the original network. Specifically, Aij represents the number of connections between node i and node j if i is not equal to j, and Aii is equal to twice the number of self-edges of node i. The lower-order structure above mainly focuses on the connectivity patterns at the level of individual nodes and edges, and ignores the higher-order connectivity patterns. Motif is a sub-network that occurs frequently in complex networks, and its number is significantly higher than that in randomized networks of the same degree distribution. As the building block of complex networks, it is crucial to understand the basic structure and to regulate the behavior of the networks (Benson, Gleich, and Leskovec 2016). Different motifs exist in different complex networks. For instance, the triangle motif has been widely found in social networks, and two-hop path motifs are common structures of air traffic networks. Formally, network motif can be expressed as: Mq p = {VM, EM} (1) where VM V represents the node set consisting of p nodes, and EM E represents the edge set consisting of q edges in the motif M. Given the original adjacency matrix A, we can define the motif adjacency matrix M Nn n based on different motif types, which is defined as: Mij =number of motif instances containing nodes i and j (2) In this way, M is used to represent the motif-based higherorder connections of the complex network, where the edges (a) Original network (b) Motif network Figure 1: Illustration of the differences between the original network and the motif network on the DBLP dataset. represent the number of co-occurrence of nodes i and j under a specific motif type. Note that Mij can be 0 even if Aij = 0. This paper focuses on the triangular motif M3 3, but the proposed method can be well extended to other motifs. From the above formula, we can see that in the process of constructing the motif network, the original lower-order topological structure is not well maintained. For illustration purpose, we illustrate the original network of the DBLP dataset and its corresponding motif network, as shown in Figure 1. From the figure, the structure of the original network and the motif network is significantly different. On the one hand, for the node-level, the lack of higher-order connectivity patterns between some nodes led to the following situation: two nodes are connected in the original network, while there are no edges between them in the motif network. If only the higher-order connectivity structure obtained by the above definition is used for the subsequent analysis, the original lower-order information will obviously be ignored, which will result in the deviation of community detection results. On the other hand, for the network-level, when we construct a motif network, the original connected network is divided into several connected components of different sizes and some isolated nodes, resulting in the fragmentation issue (Li et al. 2019a). In the process of partitioning nodes, these isolated nodes will not be supported by the original network, which makes the community labels of isolated nodes present randomness. Although an edge enhancement approach has been proposed to solve the above issue (Li et al. 2019a), the added edges of the method may still destroy the original lowerorder connectivity patterns. To our best knowledge, there is still a lack of methods to effectively utilize higher-order structure and lower-order structure for community detection. Since we are using both higher-order connectivity patterns and lower-order connectivity patterns, the proposed method should be named Hybrid-order Stochastic Block Model (HSBM) . On the one hand, we maintain the topological structure of the original network so that the community assignments of nodes does not violate the original lowerorder connectivity patterns. On the other hand, we utilize higher-order connectivity patterns, and construct the motif network to correct the deviation of community detection results caused by using only lower-order structure. The Proposed Method Motif Network Construction We use the topological structure of the original network to construct a motif network, which is represented as: GM = {VM, EM} (3) where GM represents a motif network based on the triangle motif, and VM represents the node set of the motif network containing n nodes, which is the same as the node set V in the original network, and EM represents the edge set containing m edges. In particular, we have EM = {EM 1 , EM 2 , ..., EM m } with EM l = (i, j, Mij), l = 1, 2, ..., m . (4) where i and j represent the nodes at both ends of edges, and Mij represents the number of edges between nodes i and j on the motif network as Eq. (2). Hybrid-order Stochastic Block Model In this section, we introduce the proposed Hybrid-order Stochastic Block Model (HSBM) in detail. We first introduce the generation mechanism of the motif network. In the motif network, we assume that the number of edges generated between each pair of nodes follows an independent Poisson distribution. Let gi {1, 2, ..., K}, i = 1, ..., n represent the community label of node i, where K is the number of communities. X RK K represents the edge expectation matrix of nodes between communities in the motif network, where Xrs denotes the expected value of the motif adjacency matrix element Mij for node i and j belonging to communities r and s respectively. To reflect the difference of nodes within the community in the motif network, the parameter ξ [0, 1]n is introduced to control the expected degrees of nodes. Therefore, in the motif network, the expected number of edges between nodes i and j belonging to communities r and s respectively is ξiξj Xrs. In this way, under the conditions of given parameters X, ξ and community labels g, the probability distribution of generating the motif network GM can be obtained as follows: P(M|ξ, X, g) = Y (ξiξj Xgigj)Mij Mij! exp( ξiξj Xgigj) 2ξ2 i Xgigi)Mii/2 (Mii/2)! exp( 1 2ξ2 i Xgigi) where the expected number of self-edges at node i in community r is 1/2ξ2 i Xrr. For each community r, there is a constraint about ξ as follows: X i ξiδgi,r = 1 (6) where δ is the Kronecker delta, which is equal to 1 when two subscripts are equal, and 0 otherwise. Then, the sum of ξi of all nodes in the community is 1, i.e., the value of ξi is equal to the probability that the connected node is i itself when an 𝑋𝑋𝑟𝑟𝑟𝑟 𝑔𝑔𝑗𝑗 𝑣𝑣𝑖𝑖, 𝑣𝑣𝑗𝑗 𝓥𝓥 𝜃𝜃𝑖𝑖 𝜃𝜃𝑗𝑗 𝜉𝜉𝑖𝑖 𝜉𝜉𝑗𝑗 Figure 2: A graphical representation of HSBM edge is connected to the community which node i belongs to in the motif network. For the original adjacency matrix Aij, the above process is also applicable. The generation probability of A can be obtained as follows: P(A|θ, ω, g) = Y (θiθjωgigj)Aij Aij! exp( θiθjωgigj) 2θ2 i ωgigi)Aii/2 (Aii/2)! exp( 1 2θ2 i ωgigi) where ω RK K is the expected value of the original adjacency matrix elements. θ [0, 1]n is the parameter of node differences in original networks, corresponding to ξ in the motif network, and its constraint is: X i θiδgi,r = 1 (8) Figure 2 shows the generation process of the original adjacency matrix A and the motif adjacency matrix M, and reflects the conditional independent relationship between the observation variables A and M, and parameters g, θ, ξ, ω and X. It can be seen that the generation of the original adjacency matrix A is independent of other variables except for the community labels g and the expected value of edges ω, and the node difference parameter θ. The generation of motif adjacency matrix M only depends on the node labels g, the expected value of edges X, and the node difference parameter ξ in the motif network. The proposed model generates both the original adjacency matrix A and the motif adjacency matrix M. In classical SBM, the generation of edges in the network is conditional independent. The motif network characterizes local motif structures, and the generation of M can alleviate the constraint of conditional independence in SBM. Although the model does not directly model the relationship between A and M, M is calculated by deterministic rules through A, and the community labels g are used to simultaneously model the generation of A and M, so that the two have mutually reinforcing effects. The model not only describes the original lower-order topological structure, but also realizes the influence of the higher-order connectivity pattern on the community structure. Therefore, we can obtain the joint probability distribution of the observation variables A and M given the model parameters and community labels as follows: P(A, M|θ,ω, g, ξ, X) =P(A|θ, ω, g)P(M|ξ, X, g) = 1 Q i