# higen_hierarchical_graph_generative_networks__9b536dbd.pdf Published as a conference paper at ICLR 2024 HIGEN: HIERARCHICAL GRAPH GENERATIVE NETWORKS Mahdi Karami mahdi.karami@ualberta.ca Most real-world graphs exhibit a hierarchical structure, which is often overlooked by existing graph generation methods. To address this limitation, we propose a novel graph generative network that captures the hierarchical nature of graphs and successively generates the graph sub-structures in a coarse-to-fine fashion. At each level of hierarchy, this model generates communities in parallel, followed by the prediction of cross-edges between communities using separate neural networks. This modular approach enables scalable graph generation for large and complex graphs. Moreover, we model the output distribution of edges in the hierarchical graph with a multinomial distribution and derive a recursive factorization for this distribution. This enables us to generate community graphs with integer-valued edge weights in an autoregressive manner. Empirical studies demonstrate the effectiveness and scalability of our proposed generative model, achieving state-ofthe-art performance in terms of graph quality across various benchmark datasets. Code available at https://github.com/Karami-m/Hi Gen_main. 1 INTRODUCTION Graphs play a fundamental role in representing relationships and are widely applicable in various domains. The task of generating graphs from data holds immense value for diverse applications but also poses significant challenges (Dai et al., 2020). Some of the applications include: the exploration of novel molecular and chemical structures (Jin et al., 2020), document generation (Blei et al., 2003), circuit design (Mirhoseini et al., 2021), the analysis and synthesis of realistic data networks, as well as the synthesis of scene graphs in computer (Manolis Savva et al., 2019; Ramakrishnan et al., 2021). In all the aforementioned domains, a common observation is the presence of locally heterogeneous edge distributions in the graph representing the system, leading to the formation of clusters or communities and hierarchical structures. These clusters represent groups of nodes characterized by a high density of edges within the group and a comparatively lower density of edges connecting the group with the rest of the graph. In a hierarchical structure that arise from graph clustering, the communities in the lower levels capture the local structures and relationships within the graph. These communities provide insights into the fine-grained interactions among nodes. On the other hand, the higher levels of the hierarchy reflect the broader interactions between communities and characterize global properties of the graph. Therefore, in order to generate realistic graphs, it is essential for graph generation models to learn this multi-scale structure, and be able to capture the cross-level relations. While hierarchical multi-resolution generative models were developed for specific data types such as voice (Oord et al., 2016), image (Reed et al., 2017; Karami et al., 2019) and molecular motifs (Jin et al., 2020), these methods rely on domain-specific priors that are not applicable to general graphs with unordered nature. To the best of our knowledge, there exists no data-driven generative models specifically designed for generic graphs that can effectively incorporate hierarchical structure. Graph generative models have been extensively studied in the literature. Classical methods based on random graph theory, such as those proposed in Erdos & Rényi (1960) and Barabási & Albert (1999), can only capture a limited set of hand-engineered graph statistics. Leskovec et al. (2010) leveraged the Kronecker product of matrices but the resulting generative model is very limited in modeling the underlying graph distributions. With recent advances in graph neural networks, a variety of deep neural network models have been introduced that are based on variational autoencoders (VAE) (Kingma & Welling, 2013) or generative adversarial networks (GAN) (Goodfellow et al., 2020). Published as a conference paper at ICLR 2024 Some examples of such models include (De Cao & Kipf, 2018; Simonovsky & Komodakis, 2018; Kipf & Welling, 2016; Ma et al., 2018; Liu et al., 2019; Bojchevski et al., 2018; Yang et al., 2019) The major challenge in VAE based models is that they rely on heuristics to solve a graph matching problem for aligning the VAE s input and sampled output, limiting them to small graphs. On the other hand, GAN-based methods circumvent the need for graph matching by using a permutation invariant discriminator. However, they can still suffer from convergence issues and have difficulty capturing complex dependencies in graph structures for moderate to large graphs (Li et al., 2018; Martinkus et al., 2022). To address these limitations, (Martinkus et al., 2022) recently proposed using spectral conditioning to enhance the expressivity of GAN models in capturing global graph properties. On the other hand, autoregressive models approach graph generation as a sequential decision-making process. Following this paradigm, Li et al. (2018) proposed a generative model based on GNN but it has high complexity of O(mn2). In a distinct approach, Graph RNN (You et al., 2018) modeled graph generation with a two-stage RNN architecture for generating new nodes and their links, respectively. However, traversing all elements of the adjacency matrix in a predefined order results in O(n2) time complexity making it non-scalable to large graphs. In contrast, GRAN (Liao et al., 2019) employs a graph attention network and generates the adjacency matrix row by row, resulting in a O(n) complexity sequential generation process. To improve the scalability of generative models, Dai et al. (2020) proposed an algorithm for sparse graphs that decreases the training complexity to O(log n), but at the expense of increasing the generation time complexity to O((n + m) log n). Despite their improvement in capturing complex statistics of the graphs, autoregressive models highly rely on an appropriate node ordering and do not take into account the community structures of the graphs. Additionally, due to their recursive nature, they are not fully parallelizable. A new family of diffusion model for graphs has emerged recently. Continuous denoising diffusion was developed by Jo et al. (2022), which adds Gaussian noise to the graph adjacency matrix and node features during the diffusion process. However, since continuous noise destroys the sparsity and structural properties of the graph, discrete denoising diffusion models have been developed as a solution in (Haefeli et al., 2022; Vignac et al., 2022; Kong et al., 2023). These models progressively edit graphs by adding or removing edges in the diffusion process, and then denoising graph neural networks are trained to reverse the diffusion process. While the denoising diffusion models can offer promising results, their main drawback is the requirement of a long chain of reverse diffusion, which can result in relatively slow sampling. To address this limitation, Chen et al. (2023) introduced a diffusion-based graph generative model. In this model, a discrete diffusion process randomly removes edges while a denoising model is trained to inverse this process, therefore it only focuses on a portion of nodes in the graph at each denoising step. In this work, we introduce Hi Gen, a Hierarchical Graph Generative Network to address the limitations of existing generative models by incorporating community structures and cross-level interactions. This approach involves generating graphs in a coarse-to-fine manner, where graph generation at each level is conditioned on a higher level (lower resolution) graph. The generation of communities at lower levels is performed in parallel, followed by the prediction of cross-edges between communities using a separate graph neural network. This parallelized approach enables high scalability. To capture hierarchical relations, our model allows each node at a given level to depend not only on its neighbouring nodes but also on its corresponding super-node at the higher level. We address the generation of integer-valued edge weights of the hierarchical structure by modeling the output distribution of edges using a multinomial distribution. We show that multinomial distribution can be factorized successively, enabling the autoregressive generation of each community. This property makes the proposed architecture well-suited for generating graphs with integer-valued edge weights. Furthermore, by breaking down the graph generation process into the generation of multiple small partitions that are conditionally independent of each other, Hi Gen reduces its sensitivity to a predefined initial ordering of nodes. 2 BACKGROUND A graph G = (V, E) is a collection of nodes (vertices) V and edges E with corresponding sizes n = |V| and m = |E| and an adjacency matrix Aπ for the node ordering π. The node set can be partitioned into c communities (a.k.a. cluster or modules) using a graph partitioning function F : V {1, ..., c}, where each cluster of nodes forms a sub-graph denoted by Ci = (V(Ci), E(Ci)) Published as a conference paper at ICLR 2024 Figure 1: (a) A sample hierarchical graph, HG with 2 levels is shown. Communities are shown in different colors and the weight of a node and the weight of an edge in a higher level, represent the sum of the edges in the corresponding community and bipartite, respectively. Node size and edge width indicate their weights. (b) The matrix shows corresponding adjacency matrix of the graph at the leaf level, G2, where each of its sub-graphs corresponds to a block in the adjacency matrix, communities correspond to diagonal blocks and are shown in different colors while bipartites are colored in gray. (c) Decomposition of multinomial distribution as a recursive stick-breaking process where at each iteration, first a fraction of the remaining weights rt is allocated to the t-th row (corresponding to the t-th node in the sub-graph) and then this fraction, vt, is distributed among that row of lower triangular adjacency matrix, ˆA. (d), (e) Parallel generation of communities and bipartites, respectively. Shadowed lines are the augmented edges representing candidate edges at each step. with adjacency matrix Ai. The cross-links between neighboring communities form a bipartite graph, denoted by Bij = (V(Ci), V(Cj), E(Bij)) with adjacency matrix Aij. Each community is aggregated to a super-node and each bipartite corresponds to a super-edge linking neighboring communities, which induces a coarser graph at the higher (a.k.a. parent) level. Herein, the levels are indexed by superscripts. Formally, each community at level l, Cl i, is mapped to a node at the higher level graph, also called its parent node, vl 1 i := Pa(Cl i) and each bipartite at level l is represented by an edge in the higher level, also called its parent edge, el 1 i = Pa(Bl ij) = (vl 1 i , vl 1 j ). The weights of the self edges and the weights of the cross-edges in the parent level are determined by the sum of the weights of the edges within their corresponding community and bipartite, respectively. Therefore, the edges in the induced graphs at the higher levels have integer-valued weights: wl 1 ii = P e E(Cl i) we and wl 1 ij = P e E(Bl ij) we, moreover sum of all edge weights remains constant in all levels so w0 := P e E(Gl) we = |E|, l [0, ..., L]. This clustering process continues recursively in a bottom-up approach until a single node graph G0 is obtained, producing a hierarchical graph, defined by the set of graphs in all levels of abstractions, HG := {G0, ...., GL 1, GL}. This forms a dendrogram tree with G0 being the root and GL being the final graph that is generated at the leaf level. An HG is visualized in Figure 1a. The hierarchical tree structure enables modeling of both local and long-range interactions among nodes, as well as control over the flow of information between them, across multiple levels of abstraction. This is a key aspect of our proposed generative model. Please refer to appendix A for a list of notation definitions. Published as a conference paper at ICLR 2024 3 HIERARCHICAL GRAPH GENERATION In graph generative networks, the objective is to learn a generative model, p(G) given a set of training graphs. This work aims to establish a hierarchical multi-resolution framework for generating graphs in a coarse-to-fine fashion. In this framework, we assume that the graphs do not have node attributes, so the generative model only needs to characterize the graph topology. Given a particular node ordering π, and a hierarchical graph HG := {G0, ...., GL 1, GL}, produced by recursively applying a graph partitioning function, F, we can factorize the generative model using the chain rule of probability as: p(G = GL, π) = p({GL, GL 1, ..., G0}, π) = p(GL, π | {GL 1, ..., G0}) ... p(G1, π | G0) p(G0) l=0 p(Gl, π | Gl 1) p(G0) (1) In other words, the generative process involves specifying the probability of the graph at each level conditioned on its parent level graph in the hierarchy. This process is iterated recursively until the leaf level is reached. Here, the distribution of the root p(G0) = p(w0 = w0) can be simply estimated using the empirical distribution of the number of edges, w0 = |E|, of graphs in the training set. Based on the partitioned structure within each level of HG, the conditional generative probability p(Gl | Gl 1) can be decomposed into the probability of its communities and bipartite graphs as: p(Gl | Gl 1) = p({Cl i i V(Gl 1)} {Bl ij (i, j) E(Gl 1)} | Gl 1) i V(Gl 1) p(Cl i | Gl 1) Y (i,j) E(Gl 1) p(Bl ij | Gl 1, {Cl k}Cl k Gl) (2) This decomposition is based on the assumption that, given the parent graph Gl 1, generative probabilities of communities, Cl i, are mutually independent. Subsequent to the generation of community graphs, it further assumes that the generation probability of each bipartite can be modeled independent of the other bipartites. 1 The following theorem leverages the properties of multinomial distribution to prove the conditional independence of the components for the integer-value weighted hierarchical graph (r.t. appendix B.1 for the proof). Theorem 3.1. Let the random vector w := [we]e E(Gl) denote the set of weights of all edges of Gl such that their sum is w0 = 1T w. The joint probability of w can be described by a multinomial distribution: w Mu(w | w0, θl). By observing that the sum of edge weights within each community Cl i and bipartite graph Bl ij are determined by the weights of their parent edges in the higher level, wl 1 ii and wl 1 ij respectively, we can establish that these components are conditionally independent and each of them follow a multinomial distribution: p(Gl | Gl 1) Y i V(Gl 1) Mu([we]e Cl i | wl 1 ii , θl ii) Y (i,j) E(Gl 1) Mu([we]e Bl ij | wl 1 ij , θl ij) (3) where {θl ij[e] [0, 1], s.t. 1T θl ij = 1 | (i, j) E(Gl 1)} are the multinomial model s parameters. Therefore, given the parent graph at a higher level, the generation of graph at its subsequent level can be reduced to generation of its partition and bipartite sub-graphs. As illustrated in figure 1, this decomposition enables parallel generation of the communities in each level which is followed by predicting bipartite sub-graphs in that level. Each of these sub-graphs corresponds to a block in the adjacency matrix, as visualized in figure 2a, so the proposed hierarchical model generates adjacency matrix in a blocks-wise fashion and constructs the final graph topology. 3.1 COMMUNITY GENERATION Based on the equation (3), the edge weights within each community can be jointly modeled using a multinomial distribution. In this work, our objective is to specify the generative probability of 1Indeed, this assumption implies that the cross dependency between communities are primarily encoded by their parent abstract graph which is reasonable where the nodes dependencies are mostly local and are within community rather than being global. Published as a conference paper at ICLR 2024 communities in level l, p(Cl i | Gl 1), as an autoregressive process, hence, we need to factorize the joint multinomial distribution of edges as a sequence of conditional distributions. Toward this goal, we show in appendix: B.2 that this multinomial distribution can be decomposed into a sequence of binomial distribution of each edge using a stick-breaking process. This result enables us to model the community generation as an edge-by-edge autoregressive process with O(|VC|2) generation steps, similar to existing algorithms such as Graph RNN (You et al., 2018) or Deep GMG (Li et al., 2018). However, inspired by GRAN (Liao et al., 2019), a community can be generated more efficiently by generating one node at a time, reducing the sampling process to O(|VC|) steps. This requires decomposing the generative probability of edges in a group-wise form, where the candidate edges between the t-th node, the new node at the t-th step, and the already generated graph are grouped together. In other words, this model completes the lower triangle adjacency matrix one row at a time conditioned on the already generated sub-graph and the parent-level graph. The following theorem formally derives this decomposition for multinomial distributions. Theorem 3.2. For a random counting vector w ZE + with a multinomial distribution Mu(w | w, θ), let s split it into T disjoint groups w = [u1, ..., u T ] where ut ZEt + , PT t=1 Et = E, and also split the probability vector accordingly as θ = [θ1, ..., θT ]. Additionally, let s define sum of all variables in the t-th group (t-th step) by a random count variable vt := PEt e=1 ut,e. Then, the multinomial distribution can be factorized as a chain of binomial and multinomial distributions: Mu(w = [u1, ..., u T ]| w, θ = [θ1, ..., θT ]) = t=1 Bi(vt | rt, ηvt) Mu(ut | vt, λt), (4) where: rt = w X i