# advancing_graph_generation_through_beta_diffusion__28cc949a.pdf Published as a conference paper at ICLR 2025 ADVANCING GRAPH GENERATION THROUGH BETA DIFFUSION Xinyang Liu1, , Yilin He1, , Bo Chen2, Mingyuan Zhou1 1The University of Texas at Austin, 2Xidian University xinyangatk@gmail.com, yilin.he@mccombs.utexas.edu bchen@mail.xidian.edu.cn, mingyuan.zhou@mccombs.utexas.edu Diffusion models have excelled in generating natural images and are now being adapted to a variety of data types, including graphs. However, conventional models often rely on Gaussian or categorical diffusion processes, which can struggle to accommodate the mixed discrete and continuous components characteristic of graph data. Graphs typically feature discrete structures and continuous node attributes that often exhibit rich statistical patterns, including sparsity, bounded ranges, skewed distributions, and long-tailed behavior. To address these challenges, we introduce Graph Beta Diffusion (GBD), a generative model specifically designed to handle the diverse nature of graph data. GBD leverages a beta diffusion process, effectively modeling both continuous and discrete elements. Additionally, we propose a modulation technique that enhances the realism of generated graphs by stabilizing critical graph topology while maintaining flexibility for other components. GBD competes strongly with existing models across multiple general and biochemical graph benchmarks, showcasing its ability to capture the intricate balance between discrete and continuous features inherent in real-world graph data. Our Py Torch code is available at https://github.com/xinyang ATK/Graph Beta Diffusion. 𝑮! = 𝑮!"# 𝑸! 𝑮!"# = 𝑮! + 𝑷! (𝟏 𝑮!) 𝑮# minimize 𝓛𝒔𝒂𝒎𝒑𝒍𝒊𝒏𝒈and 𝓛𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒊𝒐𝒏 reverse process forward process 𝑸#, 𝑷# Beta variables Figure 1: Overview of the forward and reverse diffusion processes of GBD. The multiplicative factors Qt and Pt are sampled from beta distributions parameterized by the initial graphs G0 and clean graphs predicted by ˆGθ. The neural network ˆGθ is learned through minimizing Equation 8 constituted by Lsampling and Lcorrection. 0.05T 0.1T 0.15T 0.25T 0.5T T Figure 2: Edge generation process of the GBD model in graph topology (top) and adjacency matrix (bottom) views. Nodes are sorted by descending degree centrality and color-coded by degree, from yellow (high) to purple (low). The modulation technique in Section 2.3 enables early emergence of key positions such as community hubs, enhancing reverse chain stability. *Equal contribution. Published as a conference paper at ICLR 2025 1 INTRODUCTION In recent years, there has been a significant surge in interest and activity in the field of graph generation, particularly with the development of advanced generative models tailored for graphs. This growing attention is driven by the recognition of graph data s pervasive presence and utility across diverse real-world applications, ranging from social network study (Newman et al., 2002; Conti et al., 2011; Abbe, 2018; Shirzadian et al., 2023) to biochemical molecular research (Jin et al., 2018; Liu et al., 2018; Bongini et al., 2021; Guo et al., 2024; Li & Yamanishi, 2024). Additionally, the rapid evolution of machine learning tools has introduced powerful techniques for data generation, among which diffusion models (Ho et al., 2020; Song et al., 2021; Austin et al., 2021; Avdeyev et al., 2023; Chen & Zhou, 2023; Zhou et al., 2023) stand out as a notable example. As these advanced tools intersect with the task of graph generation, we witness the emergence of numerous diffusion-based graph generative models (Niu et al., 2020a; Jo et al., 2022; Haefeli et al., 2022; Huang et al., 2022; Vignac et al., 2023; Jo et al., 2023; Cho et al., 2023; Chen et al., 2023; Kong et al., 2023). While diffusion-based graph generative models often demonstrate superior performance compared to their predecessors (You et al., 2018; De Cao & Kipf, 2018; Li et al., 2018; Simonovsky & Komodakis, 2018; Liu et al., 2019; Shi et al., 2020; Luo et al., 2021; Martinkus et al., 2022), there is still potential for further enhancement in the quality of generated graphs. Among the latest advancements in these methods, it is widely recognized that incorporating inductive bias from the graph data is generally beneficial for model design (Jo et al., 2023). One promising direction of incorporating this bias involves considering the statistical characteristics of the distribution of graph data. For instance, both Graph D3PM (Haefeli et al., 2022) and Di Gress (Vignac et al., 2023) have demonstrated that when considering the binary or categorical nature of the graph adjacency matrix and modeling it in the discrete space, it provides benefits for generating more realistic graphs. Accounting for the discreteness of the graph adjacency matrix has shown enhancement to performance. However, it is crucial to recognize that the complexity and flexibility of the distribution characteristics of graph data extend beyond mere discreteness. Real-world graphs usually display sparse edge distributions and exhibit diverse statistical patterns in node attributes, which may include skewed, multi-modal, or long-tailed distributions (Barabási et al., 2000; Ciotti et al., 2015; Liang et al., 2023; Wang et al., 2023). While the values within node feature matrices may not inherently bounded by range, they are often be empirically represented or processed into quantities that are bound by specific limits. Considering the unique characteristics inherent to graph data, it is clear that Gaussian and categorical distributions, often default choices for constructing diffusion processes, may not adequately align with these graph traits. This misalignment could introduce noticeable limitations in accurately modeling the distribution of graphs. Given the unique statistical characteristics of graph data, the beta distribution emerges as a particularly suitable modeling choice. With great flexibility to model continuous data with various statistical characteristics and approximate discrete distributions at all sparsity levels, the beta distribution aligns well with the inherent traits of graphs, hence making itself a promising candidate to surpass the potential limitations imposed by utilizing Gaussian or categorical distributions. In this paper, we introduce Graph Beta Diffusion (GBD) as a novel addition to diffusion-based graph generative models. GBD models the joint distribution of node attributes and edge connections within a graph through beta diffusion (Zhou et al., 2023), a generative diffusion process that is built upon the thinning of beta random variables in its multiplicative forward diffusion process and the thickening in its multiplicative reverse process. We underscore two major contributions arising from the development of GBD. First, our experiments generating data on various synthetic and real-world graphs confirm the effectiveness of beta diffusion as a strategic choice within the design framework of the backbone diffusion model, especially for graph generation tasks. Second, our exploration of the model s design space has yielded a set of recommended practices, notably a novel modulation technique that bolsters the stability of generating essential graph structures. We demonstrate that these practices, when implemented together, lead to consistent enhancements in model performance. 2 THE METHODOLOGY In this study, our primary focus lies in generating two types of graphs: generic graphs and molecular graphs. A graph with N nodes is represented by the tuple G = (A, X), where X RN D Published as a conference paper at ICLR 2025 denotes the node features with feature dimension D, and A RN N is the symmetric binary adjacency matrix that defines the connections between nodes. The selection of node features offers high flexibility, ranging from raw-data-provided node categories to hand-crafted features such as node-level statistics (Jo et al., 2022) or spectral graph signals (Jo et al., 2023). The features within X exhibit great diversity in their nature, including numerical, categorical, and ordinal types. Through preprocessing methods including dummy-encoding, empirical CDF transformation or normalization, we standardize them as continuous variables bounded by [0, 1]. For molecular graphs, we use A(1:K) to represent the structure of a graph with K types of edges and G is defined as (A(1:K), X). In the sequel, we by default employ the generic graph scenario to illustrate the methodology. 2.1 FORWARD AND REVERSE BETA DIFFUSION PROCESSES Forward multiplicative beta diffusion process. Such a process can be characterized by the transition probability q(Gt | Gt 1, G0), with G0 denoting the combination of the original adjacency matrix and node feature matrix. Following recent diffusion-based graph generative models (Jo et al., 2022; Vignac et al., 2023; Jo et al., 2023; Cho et al., 2023), we assume q(Gt | Gt 1, G0) to be factorizable such that q(Gt | Gt 1, G0) = q(At | At 1, A0) q(Xt | Xt 1, X0). Constructing the forward multiplicative beta diffusion process (Zhou et al., 2023) for graph modeling, we have: At = At 1 QA,t, QA,t Beta (ηAαt A0, ηA(αt 1 αt)A0) , (1) Xt = Xt 1 QX,t, QX,t Beta (ηXαt X0, ηX(αt 1 αt)X0) , t [1, T]. (2) Here ηA, ηX are positive scalars adjusting the concentration of beta distributions, with higher values leading to enhanced concentration and reduced variability. The diffusion noise schedule is defined with {αt | t [1, T]}, which represent a sequence of values descending from 1 towards 0 as t increases. Elements in the fractional multiplier QA,t or QA,t are independently sampled from their respective beta distributions. With the forward diffusion process defined in Equations 1 and 2, we characterize the stochastic transitions of an element g within G as: q(gt | gt 1, g0) = 1 gt 1 Beta gt gt 1 | ηαtg0, η(αt 1 αt)g0 where depending on whether g is an element in A or X, we have either η = ηA or η = ηX. Derived from Equation 3, the joint distribution q(G1:T | G0) has analytical format in the marginal distribution at each time stamp t, specifically, q(Gt | G0) = Beta(ηαt G0, η(1 αt G0)). (4) Reverse multiplicative beta diffusion process. It is important to note that the joint distribution q(G1:T | G0) can be equivalently constructed in reverse order, which directs samples from the terminus states GT towards the initial states G0 by incrementally applying the changes δGt at each reversed time stamp. With the changes at a given time t parameterized as δGt := Pt (1 Gt), where Pt are beta-distributed fractional multipliers, the time-reversal multiplicative sampling process can be mathematically defined as: for t = T, T 1, , 1, Gt 1 = Gt + Pt (1 Gt), Pt Beta (η(αt 1 αt)G0, η(1 αt 1G0)) . (5) Similar to the forward sampling process, we can derive the transition distribution corresponding to the reverse sampling process described in Equation 5 as following: q(Gt 1 | Gt, G0) = 1 1 Gt Beta Gt 1 Gt 1 Gt | η(αt 1 αt)G0, η(1 αt 1G0) . (6) Following previous work (Austin et al., 2021; Haefeli et al., 2022; Vignac et al., 2023; Zhou et al., 2023), we construct the reverse diffusion process through the definition of ancestral sampling distribution as following: pθ(Gt 1 | Gt) := q(Gt 1 | Gt, ˆGθ(Gt, t)), (7) where ˆGθ(Gt, t) is a neural network that predicts the conditional expectation of G0 given Gt. Following Vignac et al. (2023), we instantiate ˆGθ(Gt, t) as a graph transformer network (Dwivedi & Bresson, 2020). We present the complete sampling process in Appendix D.1. Published as a conference paper at ICLR 2025 2.2 TRAINING GBD The overall training procedure of GBD is described in Section D.1 of the Appendix. We employ the objective function proposed by beta diffusion (Zhou et al., 2023), specifically, t=2 (1 ω)Lsampling(t, G0) + ω Lcorrection(t, Gt), ω [0, 1]. (8) In Equation 8, the loss terms associated with sampling and correction are defined as Lsampling(t, G0) = Eq(Gt,G0) KL (pθ(Gt 1 | Gt) q(Gt 1 | Gt, G0)) , (9) Lcorrection(t, G0) = Eq(Gt,G0) KL q(Gτ | ˆGθ(Gt, t)) q(Gτ | G0) . (10) In Equation 10, the KL divergence is evaluated between the following distributions: q(Gτ | ˆGθ(Gt, t)) is Beta(ηαt ˆGθ(Gt, t), η(1 αt ˆGθ(Gt, t))), and q(Gτ | G0) is the same as q(Gt | G0) in distribution. The subscript τ is introduced to represent a generic graph sample other than Gt that is also obtained at time t from the forward diffusion process. The core principle behind the loss function terms can be described as follows: Lsampling drives the empirical ancestral sampling distribution towards the destination-conditional posterior distribution, while Lcorrection corrects the bias on marginal distribution at each time stamp accumulated through the ancestral sampling. These two types of loss terms collectively reduce the divergence between the empirical joint distribution on two graphs sampled from adjacent time stamps in the reverse process, and their joint distribution derived from the forward diffusion process. A positive weight ω is introduced to balance the effects of these two types of loss terms. We set it to 0.01, following Zhou et al. (2023), and found that this configuration is sufficient to produce graphs that closely resemble the reference graphs without further tuning. To better elucidate the optimization objective, we list out the analytical expressions of the KL divergence term in Appendix B. It is demonstrated in Zhou et al. (2023) that the KL divergence between two beta distributions can be expressed in the format of a Bregman divergence. Namely, considering a convex function ϕ(α, β) = ln Beta(α, β), where Beta(α, β) = Γ(α)Γ(β) Γ(α+β) is the beta function, the loss term Lsampling can be expressed as Lsampling(t, G0) = Eq(Gt)Eq(G0|Gt)dϕ [asampling, bsampling], [a sampling, b sampling] , asampling = η(αt 1 αt)G0, bsampling = η(1 αt 1G0), a sampling = η(αt 1 αt) ˆG0, b sampling = η(1 αt 1 ˆG0). Likewise, we can express the correction loss term Lcorrection as Lcorrection(t, G0) = Eq(Gt)Eq(G0|Gt)dϕ ([acorrection, bcorrection], [a correction, b correction]) , acorrection = ηαt G0, bcorrection = η(1 αt G0), a correction = ηαt ˆG0, b correction = η(1 αt ˆG0). Here we reference the dϕ notation of Banerjee et al. (2005) to represent the Bregman divergence. As established in Lemmas 3 5 of Zhou et al. (2023), since both Lsampling and Lcorrection can be formulated as Bregman divergences, Proposition 1 of Banerjee et al. (2005) applies, demonstrating that they yield the same unbiased optimal solution. This justifies the use of ˆG0 in the reverse diffusion process, as detailed below. Property 1 Both Lsampling and Lcorrection are uniquely minimized at ˆG0 = ˆGθ(Gt, t) = Eq(G0 | Gt)[G0]. 2.3 EXPLORING THE DESIGN SPACE OF GBD Many diffusion-based graph generative models offer great flexibility with technical adjustment to enhance their practical performances. Here we list four impactful dimensions among the design space of GBD. Namely, data transformation, concentration modulation, logit-domain computation, and neural-network precondition. We elaborate each design dimension below and discuss our choices in these aspects in the Appendix. Published as a conference paper at ICLR 2025 Data transformation. We convert the raw data (A, X) to G0 through linear transformations, i.e., G0 = (A0, X0), where A0 = w A A + b A, X0 = w X X + b X, (13) with the constraints that min(w A, b A, w X, b X) > 0 and max(w A + b A, w X + b X) 1. This operation not only ensure that all data values fall within the positive support of beta distributions, avoiding gradient explosion when optimizing the loss function, but also provide an effective means to adjust the rate at which diffusion trajectories mix. A forward diffusion trajectory reaches a state of mix when it becomes indistinguishable to discern the initial value from its counterfactual given the current value. A suitable mixing rate ensures that the signal-to-noise ratio (SNR) of the final state in the forward diffusion process approaches zero, meeting the prerequisite for learning reverse diffusion while preserving the learnability of graph structural patterns. The scaling parameter provides a macro control for the mixing rate, with a smaller value contracting the data range and promoting the arrival of the mixing state. Concentration modulation. Another hyperparameter that offers a more refined adjustment to the mixing rate is the concentration parameter η. Higher values of η reduce the variance of the fractional multipliers Pt sampled from their corresponding beta distributions, thus delaying the arrival of the mixing state. Leveraging this property, we have devised a simple yet effective modulation strategy to differentiate the mixing times across various graph substructures. Specifically, we assign higher η values to important positions within a graph, such as edges connecting high-centrality nodes or edges deemed significant based on domain knowledge, such as the carbon-carbon bond in chemical molecules. For instance, when modulating η from degree centrality, the exact operation executed upon the η values for edge (u, v) and for the features of node u can be mathematically expressed as ηu,v = g A(max(deg(u), deg(v))), ηu = g X(deg(u)). (14) Here we first prepare several levels of η values, then utilize two assignment functions, namely g A( ) and g X( ), to map the node degrees (or their percentile in the degree population within one graph) to one of the choices of the η values. We have observed that this operation indeed prolongs the presence of these substructures during the forward diffusion process, which in turn leads to their earlier emergence compared to the rest of the graph during the reverse process. Additionally, we provide an alternate definition of importance positions using betweenness centrality (Freeman, 1977), detailed in Appendix D.2, and also ablate its effects in Section 4.3. We visualize the reverse process from two perspectives in Figure 2. We first obtain the ηu,v by degrees retrieved from the training set before sampling and then generate graph through reverse beta diffusion. From the top row, we observe that edges linked to nodes with higher degrees (indicated by brighter colors) appear first, followed by other edges. From the bottom row, it is evident that edges connected to the first five nodes, which have higher degrees, are identified first and then progressively in descending order of degree. Notably, the nodes of the adjacency matrices in the bottom row are reordered by decreasing node degree of the final graph. Additionally, we can also find the predicted graph of GBD converges in an early stage to the correct topology. We attribute the enhanced quality of generated graphs to the early emergence of these important substructures, which likely improves the reliability of generating realistic graph structures. Furthermore, this approach is particularly appealing as it allows for the flexible integration of graph inductive biases within the diffusion model framework. Logit domain computation. Another noteworthy designing direction lies in the computation domain. Although the reverse sampling process directly implemented from Equation 5 is already effective to generate realistic graph data, we observe that migrating the computation to the logit space further enhances model performance and accelerates training convergence. One potential explanation is that the logit transformation amplifies the structural patterns of the graph when all edge weights are very close to zero at the beginning of the ancestral sampling process. Equivalent to Equation 5, the logit-domain computation can be expressed as logit(Gt 1) = ln elogit(Gt) + elogit(Pt) + elogit(Gt)+logit(Pt) . (15) Neural-network precondition. Finally, we employ the neural-network precondition technique (Karras et al., 2022) and customize it for training GBD, which involves standardizing Gt before Published as a conference paper at ICLR 2025 passing them to the prediction network ˆGθ( ). In other words, we modify Equation 7 as pθ(Gt 1 | Gt) := q(Gt 1 | Gt, ˆGθ( Gt, t)), Gt = Gt E[gt] p Var[gt] or logit(Gt) E[logit(gt)] p Var[logit(gt)] . (16) To illustrate the application of neural-network preconditioning, we present an example of training GBD to generate graphs G = (A, X). For simplicity, we assume the predictor operates in the original domain, with corresponding results for cases involving logit domain computations provided in Appendix C. We make following statistical assumptions regarding the marginal distribution of a0 and x0, the elements within the graph adjacency matrices and node feature matrices after the data transformation step: a0 follows a Bernoulli distribution with two possible outcomes amin, amax, where the probability of a0 = amax is p, and x0 follows a uniform distribution over the support [xmin, xmax]. Given that gt | g0 Beta(ηαtg0, η(1 αtg0)), and by applying the law of total expectation and the law of total variance, one can derive that E[at] = αt (p amax + (1 p) amin) , Var[at] = 1 ηA + 1 E[at] E[at]2 + ηA ηA + 1 α2 t(p(1 p))(amax amin)2 , (17) 2αt(xmin + xmax), Var[xt] = 1 ηX + 1 E[xt] E[xt]2 + ηX 12(ηX + 1) α2 t(xmax xmin)2 . (18) Using these computed quantities, we complete the neural-network preconditioning by standardizing Gt into Gt = ( At, Xt), where the adjacency matrix is transformed as At = (At E[at])/ Var[at] and the node feature matrix as Xt = (Xt E[xt])/ Var[xt]. The standardized graph Gt is then used as input to the predictor network. The derivation of Equations 17 and 18 is in Appendix C. While neural-network preconditioning is generally effective for stabilizing model training, we empirically find it to be particularly beneficial when training the GBD with the predictor defined in the logit space, where increased variability is introduced. 3 RELATED WORK Graph generative models. Early attempts at modeling graph distributions trace back to the Erd os Rényi graph model (ERDd S & R&wi, 1959; Erd os et al., 1960), from which a plethora of graph generative models have emerged. These models employ diverse approaches to establish the data generative process and devise optimization objectives, which in turn have significantly expanded the flexibility in modeling the distribution of graph data. Stochastic blockmodels (Holland et al., 1983; Lee & Wilkinson, 2019), latent variable models (Airoldi et al., 2009; Zhou, 2015; Caron & Fox, 2017), and their variational-autoencoder-based successors (Kipf & Welling, 2016; Hasanzadeh et al., 2019; Mehta et al., 2019; He et al., 2022) assume that edges are formed through independent pairwise node interactions, and thus factorize the probability of the graph adjacency matrix into the dot product of factor representations of nodes. Sequential models (You et al., 2018; Wang et al., 2018; Jin et al., 2018; Han et al., 2023) adopt a similar concept of node interactions but correlate these interactions by organizing them into a series of connection events. Additionally, some models treat the graph adjacency matrix as a parameterized random matrix and generate it by mapping a random vector through a feed-forward neural network (Simonovsky & Komodakis, 2018; De Cao & Kipf, 2018). In terms of optimization targets, many utilize log-likelihood-based objectives such as negative log-likelihood (Liu et al., 2019) or evidence lower bound objectives (Kipf & Welling, 2016), while others employ generative adversarial losses (Wang et al., 2018) or reinforcement learning losses (De Cao & Kipf, 2018). Diffusion-based graph generative models (Jo et al., 2022; Haefeli et al., 2022; Vignac et al., 2023; Niu et al., 2020b;a; Chen et al., 2023; Cho et al., 2023; Kong et al., 2023), including this work, feature a unique data generation process compared to previous models. They map the observed graph structures and node features to a latent space through a stochastic diffusion process, whose reverse process can be learned by optimizing a variational lower bound (Vignac et al., 2023) or numerically solving a reverse stochastic differential equation (Jo et al., 2022). Diffusion models. The stochastic diffusion process is introduced by Sohl-Dickstein et al. (2015) for deep unsupervised learning, and its foundational connection with deep generative models is laid Published as a conference paper at ICLR 2025 QM9 ZINC250k Figure 3: Examples of graphs generated by the GBD model on Planar, SBM, QM9, and ZINC250k datasets. down by Song & Ermon (2019) and denoising diffusion probabilistic model (DDPM) (Ho et al., 2020). DDPM maps a data sample to the latent space via a Markov process that gradually applies noise to the original sample, and learns a reverse process to reproduce the sample in finite steps. The optimization and sampling processes in DDPM can be interpreted through the lens of variational inference (Sohl-Dickstein et al., 2015; Ho et al., 2020) or can be formulated as score matching with Langevin dynamics (Song et al., 2021; Song & Ermon, 2019). Both approaches are focused on diffusion processes that define the transition between normally distributed variables, which are proven effective for generating natural images. As the scope of generative tasks expands to discrete domains like text, diffusion models transitioning between discrete states emerge (Austin et al., 2021), which demonstrates that the choice of probabilistic distribution for modeling each noise state could significantly impact the learning task. This conclusion is also validated in the application of graph generation (Haefeli et al., 2022; Vignac et al., 2023). Further studies (Chen & Zhou, 2023; Zhou et al., 2023) are conducted to improve diffusion models by introducing novel diffusion processes based on probabilistic distributions that better capture the intrinsic characteristics of the generation target. Among these, the beta diffusion of Zhou et al. (2023) is chosen as the foundation of our method, due to the beta distribution s proficiency in capturing sparsity and modeling range-bounded data across mixed types. These traits are commonly observed in real-world graphs (Barabási et al., 2000; Ciotti et al., 2015; Liang et al., 2023; Wang et al., 2023). 4 EXPERIMENTS 4.1 GENERIC GRAPH GENERATION Datasets and metrics. We use five graph datasets with varying sizes, connectivity, and topology, commonly employed as benchmarks. Ego-small includes 200 sub-graphs from the Citeseer network with 4 N 18 nodes. Community-small has 100 synthetic graphs with 12 N 20. Grid contains 100 2D grid graphs with 100 N 400. Planar includes 200 synthetic planar graphs with N = 64. SBM comprises 200 stochastic block model graphs with 2 5 communities, where 44 N 187 and each community has 20 40 nodes. For a fair comparison, we follow the experimental setup of Jo et al. (2022; 2023), using the same train/test split. We evaluate using maximum mean discrepancy (MMD) (Gretton et al., 2012) to compare the distributions of key graph properties between test graphs and generated graphs: degree (Deg.), clustering coefficient (Clus.), and 4-node orbit counts (Orbit). For more complex structures like Planar and SBM, we also report the eigenvalues of the graph Laplacian (Spec.) and the percentage of valid, unique, and novel graphs (V.U.N.) to assess how well the model captures both intrinsic features and global graph properties. A lower score indicates better performance for all metrics except V.U.N. More details on these metrics can be found in Appendix E.1. Baselines. We compare GBD against various graph generation methods, including autoregressive models: Deep GMG (Li et al., 2018), Graph RNN (You et al., 2018), and Graph AF (Shi et al., 2020); one-shot model: Graph VAE (Simonovsky & Komodakis, 2018); and flow-based models: GNF (Liu et al., 2019) and Graph DF (Luo et al., 2021). Additionally, we compare against stateof-the-art (SOTA) diffusion-based graph generative models, including score-based or continuous: EDP-GNN (Niu et al., 2020a), GDSS, GDSS + Transofrmer (TF) (Jo et al., 2022), Gru M (Jo et al., 2023), and Con Gress (Vignac et al., 2023); discrete: Di Gress (Vignac et al., 2023); autoregressive: Graph ARM (Kong et al., 2023); and a model with wavelet features: Wave-GD (Cho et al., 2023). Published as a conference paper at ICLR 2025 Table 1: MMD results for simple generic graph generation. Ego-small Community-small Grid Method Deg. Clus. Orbit. Deg. Clus. Orbit. Deg. Clus. Orbit. Deep GMG 0.040 0.100 0.020 0.220 0.950 0.400 - - - Graph RNN 0.090 0.220 0.003 0.080 0.120 0.040 0.064 0.043 0.021 Graph AF 0.030 0.110 0.001 0.180 0.200 0.020 - - - Graph DF 0.040 0.130 0.010 0.060 0.120 0.030 - - - Graph VAE 0.130 0.170 0.050 0.350 0.980 0.540 1.619 0.0 0.919 GNF 0.030 0.100 0.001 0.200 0.200 0.110 - - - EDP-GNN 0.052 0.093 0.007 0.053 0.144 0.026 0.455 0.238 0.328 GDSS 0.021 0.024 0.007 0.045 0.086 0.007 0.111 0.005 0.070 Con Gress 0.037 0.064 0.017 0.020 0.076 0.006 - - - Di Gress 0.017 0.021 0.010 0.028 0.115 0.009 - - - Graph ARM 0.019 0.017 0.010 0.034 0.082 0.004 - - - Wave-GD 0.012 0.010 0.005 0.007 0.058 0.002 0.144 0.004 0.021 GBD 0.011 0.014 0.002 0.002 0.060 0.002 0.045 0.011 0.040 ( 0.008) ( 0.013) ( 0.004) ( 0.003) ( 0.004) ( 0.003) ( 0.004) ( 0.002) ( 0.014) Table 2: MMD results for complex generic graph generation. Method Deg. Clus. Orbit. Spec. V.U.N. Deg. Clus. Orbit. Spec. V.U.N. Training set 0.0002 0.0310 0.0005 0.0052 100.0 0.0008 0.0332 0.0255 0.0063 100.0 Graph RNN 0.0049 0.2779 1.2543 0.0459 0.0 0.0055 0.0584 0.0785 0.0065 5.0 GRAN 0.0007 0.0426 0.0009 0.0075 0.0 0.0113 0.0553 0.0540 0.0054 25.0 SPECTRE 0.0005 0.0785 0.0012 0.0112 25.0 0.0015 0.0521 0.0412 0.0056 52.5 EDP-GNN 0.0044 0.3187 1.4986 0.0813 0.0 0.0011 0.0552 0.0520 0.0070 35.0 GDSS 0.0041 0.2676 0.1720 0.0370 0.0 0.0212 0.0646 0.0894 0.0128 5.0 GDSS+TF 0.0036 0.1206 0.0525 0.0137 5.0 0.0411 0.0565 0.0706 0.0074 27.5 Con Gress 0.0048 0.2728 1.2950 0.0418 0.0 0.0273 0.1029 0.1148 - 0.0 Di Gress 0.0003 0.0372 0.0009 0.0106 75. 0.0013 0.0498 0.0434 0.0400 74.0 Gru M 0.0005 0.0353 0.0009 0.0062 90.0 0.0007 0.0492 0.0448 0.0050 85.0 GBD 0.0003 0.0353 0.0135 0.0059 92.5 0.0013 0.0493 0.0446 0.0047 75.0 Generating graphs with simple topology Using the GBD theoretical model and the implementation detailed in Appendix E.1, we show that GBD excels at generating graphs with moderate size and simple topological patterns, as evidenced by the MMD metrics in Table 1. GBD outperforms the baselines in five out of nine experiments and remains statistically on par with the SOTA model in the other two. Even with larger graphs (e.g., the Grid benchmark), GBD achieved a degree MMD of 0.045, significantly surpassing all baselines, as well as maintaining competitive performance on other metrics. Moreover, previous works (Cho et al., 2023; Jo et al., 2022; Liu et al., 2019) suggested that the MMDs as a metric suffers from large standard deviations due to the insufficient size of the sampled graphs. Thus, to gain a comprehensive study, we repeated the evaluation using an enlarged generated sample of 1024 graphs. As shown in Appendix F.2, the results demonstrate that GBD consistently generates graphs that resembles true data, with improved statistical significance in experimental conclusions. Generating graphs with complex topology GBD also maintains a leading position in generating large graphs with complex topologies, as shown by the results in Table 2. In the experiments on the Planar and SBM datasets, GBD achieved superior or comparable MMD scores on most graph statistics, along with high V.U.N. scores, consistently ranking first or second among all baselines. These results provide stronger evidence of GBD s capability and suitability as a candidate model for generating graph data, particularly for applications where real-world instances are typically more complex in nature. Rapid convergence in the reverse process. We conducted a convergence experiment measuring the V.U.N. metric for graphs Gt generated at every 100 steps on the reverse chain, with the results shown in Figure 4. The figure highlights that GBD achieves high V.U.N. score at early timestamps on the reverse diffusion process, reflecting rapid convergence. Notably, both GBD and the second-place model, Gru M (Jo et al., 2023), share a key property of predicting E[G0 | Gt]. While Gru M achieves Published as a conference paper at ICLR 2025 Figure 4: V.U.N. results for intermediate graph samples in the reverse chain on Planar (left) and SBM (right) datasets, with GBD demonstrating clear advantage in convergence rate. Table 3: 2D molecule generation results QM9 (|N| 9) ZINC250K (|N| 38) Method Valid (%) FCD NSPDK Scaf. Valid(%) FCD NSPDK Scaf. Mo Flow 91.36 4.467 0.0169 0.1447 63.11 20.931 0.0455 0.0133 Graph AF 74.43 5.625 0.0207 0.3046 68.47 16.023 0.0442 0.0672 Graph DF 93.88 10.928 0.0636 0.0978 90.61 33.546 0.1770 0.0000 EDP-GNN 47.52 2.680 0.0046 0.3270 82.97 16.737 0.0485 0.0000 GDSS 95.72 2.900 0.0033 0.6983 97.01 14.656 0.0195 0.0467 GDSS+TF 99.68 0.737 0.0024 0.9129 96.04 5.556 0.0326 0.3205 Di Gress 98.19 0.095 0.0003 0.9353 94.99 3.482 0.0021 0.4163 Swin GNN 99.71 0.125 0.0003 - 81.72 5.920 0.006 - Graph ARM 90.25 1.22 0.0020 - 88.23 16.26 0.055 - EDGE 99.10 0.458 - 0.763 - - - - Gru M 99.69 0.108 0.0002 0.9449 98.65 2.257 0.0015 0.5299 GBD 99.88 0.093 0.0002 0.9510 97.87 2.248 0.0018 0.5042 it through designing the model upon an OU bridge mixture (Jo et al., 2023), our model attain the same property due to the inherent nature of beta diffusion. 4.2 MOLECULE GENERATION Datasets and Metrics We consider two widely-used molecule datasets as benchmarks in Jo et al. (2022): QM9 (Ramakrishnan et al., 2014), which consists of 133,885 molecules with N 9 nodes from 4 different node types, and ZINC250k (Irwin et al., 2012), which consists of 249,455 molecules with N 38 nodes from 9 different node types. Molecules in both datasets have 3 edge types, namely single bond, double bond, and triple bond. Following the evaluation setting of Jo et al. (2022), we generated 10,000 molecules for each dataset and evaluated them with four metrics: the ratio of valid molecules without correction (Val.), Fréchet Chem Net Distance (FCD), Neighborhood subgraph pairwise distance kernel NSPDK, and Scaffold similarity (Scaf.). We provide the results of uniqueness and novelty in Appendix E.2. Baselines We compare GBD against the following autoregressive and one-shot graph generation methods: Mo Flow (Zang & Wang, 2020), Graph AF (Shi et al., 2020), Graph DF (Luo et al., 2021), and several state-of-the-art diffusion-based graph generative models discussed previously: EDPGNN (Niu et al., 2020a), GDSS (Jo et al., 2022), Con Gress (Vignac et al., 2023), Di Gress (Vignac et al., 2023), Swin GNN (Niu et al., 2020b), Graph ARM (Kong et al., 2023), EDGE (Chen et al., 2023), and Gru M (Jo et al., 2023). We describe the details of the implementation in Appendix E.2. Results As shown in Table 3, we observe that our GBD outperforms most previous diffusion-based models and is competitive with the current state-of-the-art Gaussian-based diffusion model, Gru M. In particular, compared to the basic continuous diffusion model, GBD significantly outperforms it (GDSS+TF) under the same Graph Transformer architecture. Additionally, we observe that our proposed beta-based graph diffusion model is superior to the discrete diffusion model on both 2D molecule datasets, demonstrating that our method is also capable of modeling complex structures of attributed graphs. We attribute this to the excellent modeling ability of the beta-based graph diffusion model for sparse data distributions. Published as a conference paper at ICLR 2025 4.3 ADDITIONAL EXPERIMENTAL RESULTS Ablation study on precondition and computation domain. We vary the options regarding the computation domain and the application of preconditioning, and summarize the results in Table 4. The combination of adopting logit domain computation without using preconditioning can sometimes increase the challenge in model convergence, and therefore, it is not recommended. The listed results on Ego-small and Community-small demonstrate that both techniques are in general beneficial for achieving better model performance, and the effect of preconditioning is more evident when the computation is perfomed in the logit domain. Table 4: The effect of logit domain computation and preconditioning Ego-small Community-small Logit Domain Preconditioning Deg. Clus. Orbit. Deg. Clus. Orbit. - - 0.015 0.018 0.004 0.010 0.076 0.004 - 0.013 0.017 0.002 0.004 0.044 0.007 0.011 0.014 0.002 0.002 0.060 0.002 Ablation study on concentration modulation. To verify that effect of concentration modulation, we compared the results of GBD under different concentration modulation strategies on both the general graph and the molecule graph. As shown in Table 5, assigning the appropriate η through optional concentration modulation strategies helps GBD model various types of graphs, and this technique has the potential for further modification according to varying scenarios. Table 5: The effect of concentration modulation on generic graph and molecule graph. Modulation Strategy Deg. Clus. Orbit. Spec. V.U.N. Deg. Clus. Orbit. Spec. V.U.N. w/o modulation 0.0005 0.0357 0.0294 0.0069 87.5 0.0015 0.0493 0.0452 0.0051 72.5 Betweenness Centrality 0.0003 0.0354 0.0154 0.0057 90.0 0.0015 0.0492 0.0450 0.0053 70.0 Degree Centrality 0.0003 0.0353 0.0135 0.0059 92.5 0.0013 0.0493 0.0446 0.0047 75.0 QM9 ZINC250K Modulation Strategy Valid(%) FCD NSPDK Spec. Valid(%) FCD NSPDK Spec. w/o modulation 99.73 0.126 0.0003 0.9475 97.60 2.412 0.0020 0.5033 Carbon Bonds 99.88 0.093 0.0002 0.9510 97.87 2.248 0.0018 0.5042 Comparison on various node feature initialization. To demonstrate that GBD has the potential to model graphs with various types of node features, we explored the impact of different initialization of node features on modeling the joint distribution G = (A, X). Specifically, the node representation can be featured by Degree, Centrailities, and Eigenvectors and we vary the node feature initialization and summarize the results of model performance, as detailed in Appendix F.2. 4.4 VISUALIZATION We present samples from GBD trained on planar, SBM, QM9 and ZINC250k in Figure 3. Additionally, we provide visualization of the generative process and more generated graphs of GBD, along with a comprehensive description in Appendix G. 5 CONCLUSION We introduce graph beta diffusion (GBD), a novel graph generation framework developed upon beta diffusion. We demonstrate that the utilization of beta distribution to define the diffusion process is beneficial for modeling the distribution of graph data, and outline four crucial designing elements data transformation, concentration modulation, logit-domain computation, and neuralnetwork precondition that consistently enhance model performance. Through extensive experiments, GBD demonstrated superior performance across various graph benchmarks, showcasing its ability to model diverse patterns within graph data. Moreover, our proposed model shows promise in modeling various types of data with discrete structures and offers valuable insights into further exploring the properties of beta diffusion. Published as a conference paper at ICLR 2025 REPRODUCIBILITY STATEMENT Our Py Torch code is available at https://github.com/xinyang ATK/Graph Beta Diffusion. The implementation follows the training and sampling algorithms described in Appendix D.1 and other technical details outlined in Appendix D. ACKNOWLEDGMENT M. Zhou acknowledges the support of a gift fund from Apple. Emmanuel Abbe. Community detection and stochastic block models: recent developments. Journal of Machine Learning Research, 18(177):1 86, 2018. Edo M Airoldi, David Blei, Stephen Fienberg, and Eric Xing. Mixed membership stochastic blockmodels. In Advances in Neural Information Processing Systems (Neur IPS), volume 21, 2009. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, and Jian Zhou. Dirichlet diffusion score model for biological sequence generation. In Proceedings of the Fortieth International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2023. Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John Lafferty. Clustering with Bregman divergences. Journal of machine learning research, 6(10), 2005. Albert-László Barabási, Réka Albert, and Hawoong Jeong. Scale-free characteristics of random networks: the topology of the world-wide web. Physica A: statistical mechanics and its applications, 281(1-4):69 77, 2000. Guy W Bemis and Mark A Murcko. The properties of known drugs. 1. molecular frameworks. Journal of medicinal chemistry, 39(15):2887 2893, 1996. Pietro Bongini, Monica Bianchini, and Franco Scarselli. Molecular generative graph neural networks for drug discovery. Neurocomputing, 450:242 252, 2021. François Caron and Emily B Fox. Sparse graphs using exchangeable random measures. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(5):1295 1366, 2017. Tianqi Chen and Mingyuan Zhou. Learning to jump: Thinning and thickening latent counts for generative modeling. In International Conference on Machine Learning, pp. 5367 5382. PMLR, 2023. Xiaohui Chen, Jiaxing He, Xu Han, and Li-Ping Liu. Efficient and degree-guided graph generation via discrete diffusion modeling. In International Conference on Machine Learning. PMLR, 2023. Hyuna Cho, Minjae Jeong, Sooyeon Jeon, Sungsoo Ahn, and Won Hwa Kim. Multi-resolution spectral coherence for graph generation with score-based diffusion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. Valerio Ciotti, Ginestra Bianconi, Andrea Capocci, Francesca Colaiori, and Pietro Panzarasa. Degree correlations in signed social networks. Physica A: Statistical Mechanics and its Applications, 422: 25 39, 2015. Marco Conti, Andrea Passarella, and Fabio Pezzoni. A model for the generation of social network graphs. In 2011 IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks, pp. 1 6. IEEE, 2011. Published as a conference paper at ICLR 2025 Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. ar Xiv preprint ar Xiv:1805.11973, 2018. Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. ar Xiv preprint ar Xiv:2012.09699, 2020. P ERDd S and A R&wi. On random graphs i. Publ. math. debrecen, 6(290-297):18, 1959. Paul Erd os, Alfréd Rényi, et al. On the evolution of random graphs. Publ. math. inst. hung. acad. sci, 5(1):17 60, 1960. LC Freeman. A set of measures of centrality based on betweenness. Sociometry, 1977. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Zhiye Guo, Jian Liu, Yanli Wang, Mengrui Chen, Duolin Wang, Dong Xu, and Jianlin Cheng. Diffusion models in bioinformatics and computational biology. Nature reviews bioengineering, 2 (2):136 154, 2024. Kilian Konstantin Haefeli, Karolis Martinkus, Nathanaël Perraudin, and Roger Wattenhofer. Diffusion models for graphs benefit from discrete state spaces. In The First Learning on Graphs Conference, 2022. Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008. Xu Han, Xiaohui Chen, Francisco JR Ruiz, and Li-Ping Liu. Fitting autoregressive graph generative models through maximum likelihood estimation. Journal of Machine Learning Research, 24(97): 1 30, 2023. Arman Hasanzadeh, Ehsan Hajiramezanali, Krishna Narayanan, Nick Duffield, Mingyuan Zhou, and Xiaoning Qian. Semi-implicit graph variational auto-encoders. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, pp. 10711 10722, 2019. Yilin He, Chaojie Wang, Hao Zhang, Bo Chen, and Mingyuan Zhou. A variational edge partition model for supervised graph representation learning. Advances in Neural Information Processing Systems, 35:12339 12351, 2022. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Conditional diffusion based on discrete graph structures for molecular graph generation. In Neur IPS 2022 Workshop on Score-Based Methods, 2022. John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7): 1757 1768, 2012. Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, pp. 2323 2332. PMLR, 2018. Jaehyeong Jo, Seul Lee, and Sung Ju Hwang. Score-based generative modeling of graphs via the system of stochastic differential equations. In International Conference on Machine Learning, pp. 10362 10383. PMLR, 2022. Jaehyeong Jo, Dongki Kim, and Sung Ju Hwang. Graph generation with diffusion mixture. ar Xiv preprint ar Xiv:2302.03596, 2023. Published as a conference paper at ICLR 2025 Weonyoung Joo, Wonsung Lee, Sungrae Park, and Il-Chul Moon. Dirichlet variational autoencoder. Pattern Recognition, 107:107514, 2020. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. Advances in Neural Information Processing Systems, 35:26565 26577, 2022. Thomas N Kipf and Max Welling. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016. Lingkai Kong, Jiaming Cui, Haotian Sun, Yuchen Zhuang, B Aditya Prakash, and Chao Zhang. Autoregressive diffusion model for graph generation. In International Conference on Machine Learning, pp. 17391 17408. PMLR, 2023. Greg Landrum et al. Rdkit: Open-source cheminformatics, 2006. Clement Lee and Darren J Wilkinson. A review of stochastic block models and extensions for graph clustering. Applied Network Science, 4(1):1 50, 2019. Chen Li and Yoshihiro Yamanishi. Tengan: Pure transformer encoders make an efficient discrete gan for de novo molecular generation. In International Conference on Artificial Intelligence and Statistics, pp. 361 369. PMLR, 2024. Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. In International conference on machine learning. PMLR, 2018. Langzhang Liang, Zenglin Xu, Zixing Song, Irwin King, Yuan Qi, and Jieping Ye. Tackling long-tailed distribution issue in graph neural networks via normalization. IEEE Transactions on Knowledge and Data Engineering, 2023. Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, and Kevin Swersky. Graph normalizing flows. Advances in Neural Information Processing Systems, 32, 2019. Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Constrained graph variational autoencoders for molecule design. Advances in neural information processing systems, 31, 2018. Youzhi Luo, Keqiang Yan, and Shuiwang Ji. Graphdf: A discrete flow model for molecular graph generation. In International conference on machine learning, pp. 7192 7203. PMLR, 2021. Karolis Martinkus, Andreas Loukas, Nathanaël Perraudin, and Roger Wattenhofer. Spectre: Spectral conditioning helps to overcome the expressivity limits of one-shot graph generators. In International Conference on Machine Learning, pp. 15159 15179. PMLR, 2022. Nikhil Mehta, Lawrence Carin, and Piyush Rai. Stochastic blockmodels meet graph neural networks. In International Conference on Machine Learning (ICML), pp. 4466 4474, 2019. Mark EJ Newman, Duncan J Watts, and Steven H Strogatz. Random graph models of social networks. Proceedings of the national academy of sciences, 99(suppl_1):2566 2572, 2002. Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, and Stefano Ermon. Permutation invariant graph generation via score-based generative modeling. In Silvia Chiappa and Roberto Calandra (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 4474 4484. PMLR, 26 28 Aug 2020a. Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, and Stefano Ermon. Permutation invariant graph generation via score-based generative modeling. In International Conference on Artificial Intelligence and Statistics, pp. 4474 4484. PMLR, 2020b. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. Published as a conference paper at ICLR 2025 Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1 7, 2014. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. Graphaf: a flow-based autoregressive model for molecular graph generation. In International Conference on Learning Representations, 2020. Pouyan Shirzadian, Blessy Antony, Akshaykumar G Gattani, Nure Tasnina, and Lenwood S Heath. A time evolving online social network generation algorithm. Scientific Reports, 13(1):2395, 2023. Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. In Artificial Neural Networks and Machine Learning ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27, pp. 412 422. Springer, 2018. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations, 2023. Haohui Wang, Baoyu Jing, Kaize Ding, Yada Zhu, Liqing Zhang, and Dawei Zhou. Characterizing long-tail categories on graphs. ar Xiv preprint ar Xiv:2305.09938, 2023. Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. Graphgan: Graph representation learning with generative adversarial nets. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, pp. 5708 5717. PMLR, 2018. Chengxi Zang and Fei Wang. Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 617 626, 2020. Mingyuan Zhou. Infinite edge partition models for overlapping community detection and link prediction. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 38 of Proceedings of Machine Learning Research, pp. 1135 1143, 2015. Mingyuan Zhou, Tianqi Chen, Zhendong Wang, and Huangjie Zheng. Beta diffusion. In Neur IPS 2023: Neural Information Processing Systems, 2023. Published as a conference paper at ICLR 2025 A BACKGROUND OF BETA DIFFUSION MODEL In this section, we employ x to denote the original data that is already normalized to the range [0, 1]. The variables z1 through z T denote the latent states sampled via a forward beta diffusion process, with larger subscripts indicating states closer to complete noise. The beta diffusion model (Zhou et al., 2023) is designed so that each latent state follows a beta distribution as its conditional marginal distribution: q(zt | z0) = Beta(zt | ηαtz0, η(1 αt)z0). (19) In this setup, η serves as the concentration parameter, where larger values of η result in a stronger concentration effect, shaping the beta distribution more sharply around its mean. The parameter αt is a scheduling factor that gradually decreases from 1 to 0 as the forward diffusion process advances, controlling the degree of noise introduced at each step. To achieve this, the transition between two consecutive states in the forward diffusion process is formulated as a multiplication by an attenuating factor, i.e., zt = zt 1 qt, qt Beta(ηαtz0, η(αt 1 αt)), t [1, T]. (20) To demonstrate that the forward process defined via Equation 20 results in the conditional marginal distribution given in Equation 19, one can utilize the property that (zt, zt 1 zt, 1 zt 1) Dir(ηαtz0, η(αt 1 αt)z0, η(1 αt 1z0)), t [1, T] (21) and then apply mathematical induction to establish the desired conclusion. It is worth noting that a generic triplet (π1, π2, π3) from the Dirichlet distribution in Equation 21 can be generated using the stick-breaking construction. Specifically, π1 is first sampled from Beta(ηαtz0, η(1 αtz0)). Next, π2 is created by multiplying (1 π1) by a factor sampled from Beta(η(αt 1 αt)z0, η(1 αt 1z0)). Finally, π3 can be obtained by subtracting (π1 + π2) from one. This aligns with the reverse transition process: once zt is obtained, zt 1 can be reconstructed by adding an increment, generated in the same manner as π2, to zt, i.e., zt 1 = zt + (1 zt) pt, pt Beta(η(αt 1 αt)z0, η(1 αt 1z0)), t [1, T]. (22) With the reverse transition process formalized in Equation 22, one can derive the reverse transition distribution as q(zt 1 | zt, z0) = 1 1 zt Beta zt 1 zt 1 zt 1 | η(αt 1 αt)z0, η(1 αt 1z0) , t [1, T]. (23) In practice, to implement the reverse diffusion process, Zhou et al. (2023) defines the ancestral sampling distribution as following: pθ(zt 1 | zt) := q(zt 1 | zt, fθ(zt, t)), (24) with fθ denoting a neural network with parameters θ, that predicts z0 with latent state zt and time stamp t. B ANALYTICAL EXPRESSIONS OF OPTIMIZATION OBJECTIVE Recall that the optimization function is expressed as t=2 (1 ω)Lsampling(t, G0) + ω Lcorrection(t, Gt), ω [0, 1], (25) with the components defined as follows: Lsampling(t, G0) = Eq(Gt,G0) KL (pθ(Gt 1 | Gt) q(Gt 1 | Gt, G0)) , (26) Lcorrection(t, G0) = Eq(Gt,G0) KL q(Gτ | ˆGθ(Gt, t)) q(Gτ | G0) . (27) To derive the analytical form of Lsampling and Lcorrection, we employ the following property of beta distribution (Joo et al., 2020; Zhou et al., 2023): Published as a conference paper at ICLR 2025 Property 2 The KL divergence between two Beta distributions Beta(α1, β1) and Beta(α2, β2) is given by: KL (Beta(α1, β1) Beta(α2, β2)) = ln B(α2, β2) B(α1, β1) + (α1 α2) [ψ(α1) ψ(α1 + β1)] + (β1 β2) [ψ(β1) ψ(α1 + β1)] , (28) where B(α, β) = Γ(α)Γ(β) Γ(α+β) is the Beta function, ψ( ) denotes the digamma function. With probability distributions q(Gt 1 | Gt, G0) and pθ(Gt 1 | Gt) defined in Equations 6 and 7, the parameters (αp, βp, αq, βq) are instantiated as follows: α1 = η(αt 1 αt) ˆG0, β1 = η(1 αt 1 ˆG0), α2 = η(αt 1 αt)G0, β2 = η(1 αt 1G0). Substituting these into Equation 28, we can express the KL term within Lsampling as KL (pθ(Gt 1 | Gt) q(Gt 1 | Gt, G0)) = ln Γ(η(αt 1 αt)G0) + ln Γ(η ηαt 1G0) ln Γ(η ηαt G0) ln Γ(η(αt 1 αt) ˆG0) ln Γ(η ηαt 1 ˆG0) + ln Γ(η ηαt ˆG0) + η(αt 1 αt)( ˆG0 G0) ψ(η(αt 1 αt) ˆG0) (29) + ηαt 1(G0 ˆG0) ψ(η ηαt 1 ˆG0) + ηαt( ˆG0 G0) ψ(η ηαt ˆG0), where ˆG0 := ˆGθ(Gt, t). Similarly, the KL term within Lcorrection can be derived as KL q(Gτ | ˆGθ(Gt, t)) q(Gτ | G0) = ln Γ(ηαt G0) + ln Γ(η ηαt G0) ln Γ(ηαt ˆG0) ln Γ(η ηαt ˆG0) (30) + ηαt( ˆG0 G0) ψ(ηαt ˆG0) ψ(η ηαt ˆG0) . C MATHEMATICAL EXPRESSIONS FOR ELEMENTS IN NEURAL-NETWORK PRECONDITIONING In this section, we present the full derivation of the expressions for the mean and variance of at and xt as shown in Equations 17 and 18, and extend these conclusions to logit(at) and logit(xt). To establish these results, we begin by introducing the following property of the beta distribution: Property 3 Given that gt | g0 Beta(ηαtg0, η(1 αtg0)), one can derive that E[gt | g0] = αtg0 and Var[gt | g0] = αtg0(1 αtg0) η+1 . Let µ := E[g0] and σ2 := Var[g0]. By applying the law of total expectation and the law of total variance, we obtain the following results: E[gt] = αtµ, Var[gt] = αtµ α2 t(µ2 + σ2) η + 1 + α2 tσ2. (31) Their counterparts in the logit domain are expressed as E[logit(gt)] = E[ψ(ηαtg0)] E[ψ(η ηαtg0)], (32) Var[logit(gt)] = E[ψ(1)(ηαtg0)] + E[ψ(1)(η(1 αtg0))] + Var[ψ(ηαtg0)] + Var[ψ(η(1 αtg0))], (33) with ψ( ) and ψ(1)( ) denoting digamma and trigamma functions. In the example presented in Section 2.3, we assume that a0 follows a categorical distribution with outcomes amin and amax, where the probability of P(a0 = amax) = p. This leads to the expected Published as a conference paper at ICLR 2025 value µ = p amax + (1 p) amin and variance σ2 = p(1 p)(amax amin)2. Taking these quantities into Equation 31, we obtain E[at] = αt (p amax + (1 p) amin) , Var[at] = 1 ηA + 1 E[at] E[at]2 + ηA ηA + 1 α2 t(p(1 p))(amax amin)2 . For node features, the assumption states that x0 is uniformly distributed over the interval [xmin, xmax]. This gives a mean of µ = (xmin+xmax)/2 and a variance of σ2 = (xmax xmin)2/12, and the corresponding mean and variance of xt can then be expressed as 2αt(xmin + xmax), Var[xt] = 1 ηX + 1 E[xt] E[xt]2 + ηX 12(ηX + 1) α2 t(xmax xmin)2 . Under the same probabilistic assumptions for a0 and x0, we can derive the expressions for the terms in Equations 32 and 33, presented in the following remarks: Remark 1 Given that a0 has two potential outcomes {amin, amax} with P(a0 = amax) = p, the computation of E[logit(gt)] and Var[logit(gt)] involves several key components, including: E[ψ(ηαta0)] = p ψ(ηαtamax) + (1 p) ψ(ηαtamin), E[ψ(η ηαta0)] = p ψ(η ηαtamax) + (1 p) ψ(η ηαtamin), Var[ψ(ηαta0)] = p(1 p) (ψ(ηαtamax) ψ(ηαtamin))2 , Var[ψ(η ηαta0)] = p(1 p) (ψ(η ηαtamax) ψ(η ηαtamin))2 , E[ψ(1)(ηαta0)] = p ψ(1)(ηαtamax) + (1 p) ψ(1)(ηαtamin), E[ψ(1)(η ηαta0)] = p ψ(1)(η ηαtamax) + (1 p) ψ(1)(η ηαtamin). (34) Remark 2 Let x0 be uniformly distributed as Unif[xmin, xmax]. We denote K as the number of sub-intervals used for numerical integration via the Trapezoidal rule. Similar to Remark 1, we present the expressions for the components in the logit domain as follows: E[ψ(ηαtx0)] = 1 ηαt(xmax xmin) (ln Γ(ηαtxmax) ln Γ(ηαtxmin)) , E[ψ(η ηαtx0)] = 1 ηαt(xmax xmin) (ln Γ(η ηαtxmin) ln Γ(η ηαtxmax)) , Var[ψ(ηαtx0)] max ψ2 ηαt xmin + i K (xmax xmin) 2δ(i=0)+δ(i=K) E[ψ(ηαtx0)]2, 0 Var[ψ(η ηαtx0)] max ψ2 η ηαt xmin + i K (xmax xmin) 2δ(i=0)+δ(i=K) E[ψ(η ηαtx0)]2, 0 E[ψ(1)(ηαtx0)] = 1 ηαt(xmax xmin) (ψ(ηαtxmax) ψ(ηαtxmin)) , E[ψ(1)(η ηαtx0)] = 1 ηαt(xmax xmin) (ψ(η ηαtxmin) ψ(η ηαtxmax)) . (35) D DETAILS OF GBD D.1 TRAINING AND SAMPLING We provide the pseudo-code of the training and sampling of our generative framework within original domain and logit domain, respectively. Specifically, Algorithm 3 and Algorithm 1 show the procedure of training and sampling in original domain, respectively. In practice, we migrate our proposed GBD to logit domain shown in Algorithm 4 and Algorithm 2 in most cases. Published as a conference paper at ICLR 2025 Algorithm 1 Sampling, in original domain. Require: Number of time steps T = 1000, default concentration parameter η = 30, predictor ˆGθ. 1: (Optional) Assign value to η via Equation 38 2: Sample GT = (AT , XT ) p(AT , XT ) 3: for t = T to 1 do 4: Gin = Gt E[Gt] Var[Gt] 5: ( ˆA 0, ˆX 0) ˆGθ(Gin, t) 6: ˆA0 w A ˆA 0 + b A 7: ˆX0 w X ˆX 0 + b X 8: ˆG0 ( ˆA0, ˆX0) 9: Pt Beta(η(αt 1 αt) ˆG0, η(1 αt 1 ˆG0)) 10: Gt 1 Gt + Pt (1 Gt) 11: end for 12: return (A0 b A)/w A and (X0 b X)/w X Algorithm 2 Sampling, in logit domain 1: Sample logit(GT ) p(logit(AT ), logit(XT )) 2: for t = 1 to 1 do 3: Gin = logit(Gt) E[logit(Gt)] Var[logit(Gt)] 4: ( ˆA 0, ˆX 0) ˆGθ(Gin, t) 5: ˆA0 w A ˆA 0 + b A 6: ˆX0 w X ˆX 0 + b X 7: ˆG0 ( ˆA0, ˆX0) 8: Ut Gamma(η(αt 1 αt) ˆG0, 1) 9: Vt Gamma(η(1 αt 1 ˆG0), 1) 10: logit(Pt) ln Ut ln Vt 11: Obtain logit(Gt 1) from Equation 15 12: end for 13: G0 sigmoid (logit(G0)) 14: return (A0 b A)/w A and (X0 b X)/w X Algorithm 3 Training, in original domain. Require: Number of timesteps T = 1000, default concentration parameter η = 30, predictor ˆGθ, default node influence factor γ = 0.5, input graph batch B = {G(k) = (A(k), X(k))}[K], default learning rate λ = 0.002, optimization steps M. 1: (Optional) Assign value to η via Equation 38 2: for step = 1 to M do 3: Initialize LX and LA with 0 4: for k = 1 to K do 5: t Unif(1, ..., T) 6: αt, αt 1 schedule(t), schedule(t 1) 7: A0 w A A(k) + b A 8: X0 w X X(k) + b X 9: G0 (A0, X0) 10: Gt Beta(ηαt G0, η(1 αt G0)) 11: Gin Gt E[Gt] Var[Gt] 12: ( ˆA 0, ˆX 0) ˆGθ(Gin, t) 13: ˆA0 = w A ˆA 0 + b A 14: ˆX0 = w X ˆX 0 + b X 15: LA LA + L(A0, ˆA0, η, αt, αt 1) 16: LX LX +L(X0, ˆX0, η, αt, αt 1) 17: end for 18: θ θ λ K θ(LA + γLX) 19: end for Algorithm 4 Training, in logit domain. Require: Number of timesteps T, concentration parameter η, predictor ˆGθ, node influence factor γ, input graph batch B, learning rate λ, optimization steps M. Same default values with Algorithm 3. 1: for step = 1 to M do 2: Initialize LX and LA with 0 3: for k = 1 to K do 4: t Unif(1, ..., T) 5: αt, αt 1 schedule(t), schedule(t 1) 6: A0 w A A(k) + b A 7: X0 w X X(k) + b X 8: G0 (A0, X0) 9: Ut Gamma(ηαt G0, 1) 10: Vt Gamma(η(1 αt G0), 1) 11: logit(Gt) ln Ut ln Vt 12: Gin logit(Gt) E[logit(Gt)] Var[logit(Gt)] 13: ( ˆA 0, ˆX 0) ˆGθ(Gin, t) 14: ˆA0 = w A ˆA 0 + b A 15: ˆX0 = w X ˆX 0 + b X 16: LA LA + L(A0, ˆA0, η, αt, αt 1) 17: LX LX + L(X0, ˆX0, η, αt, αt 1) 18: end for 19: θ θ λ K θ(LA + γLX) 20: end for D.2 DETAILS OF CONCENTRATION MODULATION Here we elaborate the concentration modulation strategies mentioned in Section 2.3. Concentration Modulation for general graph generation For general graph generation, we provide two strategies depending on node-level centralities, which are degree centrality and betweenness centrality, respectively. For the degree centrality of the node u in an undirected graph, it can be formulated as Cd(u) = Cin(u) = Cout(u) = Deg(u) (36) Published as a conference paper at ICLR 2025 where Cin(u), Cout(u) and Deg(u) are denoted as the in-degree, out-degree, and total degree of the node u. For the betweenness centrality of the node u in an undirected graph, it can be formulated as s =t =u gst(u) where gst is the total number of shortest paths from node s to node t, gst(u) is the number of those paths that pass through v. For large graphs, exact calculation of betweenness centrality can be time-consuming, thus approximation algorithms using random sampling are often employed. In practice, we utilize the library of Network X Hagberg et al. (2008) to implement this. With these two metrics to measure the centrality of the nodes, which are denoted as C(u) in general, the modulated η can be mathematically expressed as ηu,v = g A(max(C(u), C(v))), ηu = g X(C(u)). (38) where g A( ) and g X( ) are two assignment functions that map the node centrality to one of the predefined η values. Concentration Modulation for molecule graph generation For molecule graph generation, we provide a straightforward strategy that regards the carbon atom as the most important node, as well as the bonds connected to the carbon atoms as the important edge in a molecule graph. Specifically, for various types of carbon-atom bond, we first rank the importance of bonds in the following order: carbon-carbon bonds, carbon-nitrogen bonds, carbon-oxygen bonds, carbon-sulfur bonds, and so on. Then we can assign predefined η on different nodes and edges depending on their "importance" in a molecule graph. D.3 MODEL ARCHITECHTURE We leverage the graph transformer network introduced in Dwivedi & Bresson (2020) and Vignac et al. (2023) cross all graph generation tasks. Each graph transformer layer consists of a graph attention module, as well as fully-connected layers and layer normalization. It employs self-attention module to update node features, then uses Fi LM layers (Perez et al., 2018) to incorporate edge features and global features. Since the data we transformed falls within the range of [0, 1], we apply the sigmoid function to the output of node features and adjacency matrices to model the one-hot encoded representation of node and edge. D.4 SCHEDULE OF DIFFUSION PROCESS IN GBD Following Zhou et al. (2023), we employ a sigmoid diffusion schedule defined as αt = 1/(1 + e c0 (c1 c0)t) throughout all experiments, where c0 = 10 and c1 = 13. E EXPERIMENTAL DETAILS E.1 GENERAL GRAPH GENERATION Datasets We evaluated our model using three synthetic and real datasets of varying size and connectivity, previously used as benchmarks in the literature (Cho et al., 2023; Jo et al., 2022): Ego-small (Sen et al., 2008) consists of 200 small real sub-graphs from the Citeseer network dataset with 4 N 18. Community-small consists of 100 randomly generated synthetic graphs with 12 N 20, where the graphs are constructed by two equal-sized communities, each of which is generated by the Erdös Rényi model (Erd os et al., 1960), with p = 0.7 and 0.05N intercommunity edges are added with uniform probability as in previous works (Jo et al., 2022; Niu et al., 2020b). Grid consists of randomly generated 100 standard 2D grid graphs with 100 N 400 and the maximum number of edges per node is 4 since all nodes are arranged in a regular lattice. Planar comprises 200 synthetic planar graphs, each containing N = 64. nodes. SBM includes 200 synthetic stochastic block model graphs, where the number of communities is randomly sampled from 2 to 5, and the number of nodes in each community is randomly sampled from 20 to 40. The probability of edges between communities is 0.3 and that of edges within communities is 0.05. Published as a conference paper at ICLR 2025 Evaluation metrics For a fair comparison, we follow the experimental and evaluation settings of Jo et al. (2022; 2023), using the same train/test split, where 80% of the data is used as the training set and the remaining 20% as the test set. We adopt maximum mean discrepancy (MMD) as our evaluation metric to compare three graph property distributions between test graphs and the same number of generated graphs: degree (Deg.), clustering coefficient (Clus.) and count of orbits with 4 nodes (Orbit). Note that we use the Gaussian Earth Mover s Distance (EMD) kernel to compute the MMDs following the method used in previous work (Jo et al., 2022; Cho et al., 2023). Additionally, for graphs with more complex structures like Planar and SBM, we report the eigenvalues of the graph Laplacian (Spec.) and the percentage of valid, unique, and novel graphs (V.U.N.) to measure whether the model has learned the intrinsic feature distribution and global properties of the graph. A lower value is better for all of these metrics except V.U.N. Specifically, a graph is defined as a valid planar graph if it is connected and planar. Following the statistical test introduced in Martinkus et al. (2022), we determine that a graph is a valid SBM graph if and only if it has a community count between 2 and 5, and a node count inside each community between 20 and 40. Implementation details We follow the evaluation setting of Jo et al. (2022); Cho et al. (2023) to generate graphs of the same size as the test data in each run and we report the mean and standard deviation obtained from 3 independent runs for each dataset. We report the baseline results taken from Cho et al. (2023), except for the results of Con Gress in Tables 1 and 8, which we obtained by running its corresponding open-source code. For a fair comparison, we adopt the Graph Transformer (Dwivedi & Bresson, 2020; Vignac et al., 2023) as the neural network used in GDSS+Transformer (Jo et al., 2022), Di Gress (Vignac et al., 2023), and Dru M (Jo et al., 2023). We set the diffusion steps to 1000 for all the diffusion models. For important hyperparameters mentioned in Sec 2.3, we usually set Scale = 0.9, Shift = 0.09. and η = [10000, 100, 30, 10] for the normalized degrees corresponding to the intervals falling in the interval split by [1.0, 0.8, 0.4, 0.1], respectively. In practice, we set threshold as 0.5 to quantize generated continue adjacency matrix. E.2 2D MOLECULE GENERATION Datasets We utilize two widely-used molecular datasets as benchmarks, as described in Jo et al. (2023): QM9 (Ramakrishnan et al., 2014), consisting of 133,885 molecules with N 9 nodes from 4 different node types and ZINC250k (Irwin et al., 2012), consisting of 249,455 molecules with N 38 nodes from 9 node types. Molecules in both datasets have 3 edge types, namely single bond, double bond, and triple bond. Following the standard procedure in the literature (Shi et al., 2020; Luo et al., 2021; Jo et al., 2022; 2023), we kekulize the molecules using the RDKit library (Landrum et al., 2006) and remove the hydrogen atoms from the molecules in the QM9 and ZINC250k datasets. Evaluation metrics Following the evaluation setting of Jo et al. (2022), we generate 10,000 molecules for each dataset and evaluate them with four metrics: the ratio of valid molecules without correction (Val.). Frechet Chem Net Distance (FCD) evaluates the chemical properties of the molecules by measuring the distance between the feature vectors of generated molecules and those in the test set using Chem Net. Neighborhood Subgraph Pairwise Distance Kernel (NSPDK) assesses the quality of the graph structure by measuring the MMD between the generated molecular graphs and the molecular graphs from the test set. Scaffold Similarity (Scaf.) evaluates the ability to generate similar substructures by measuring the cosine similarity of the frequencies of Bemis-Murcko scaffolds (Bemis & Murcko, 1996). Implementation details We follow the evaluation setting of Jo et al. (2022; 2023) to generate 10,000 molecules and evaluate graphs with test data for each dataset. We quote the baselines results from Jo et al. (2023). For a fair comparison, we adopt the Graph Transformer (Dwivedi & Bresson, 2020; Vignac et al., 2023) as the neural network used in GDSS+Transformer (TF) (Jo et al., 2022), Di Gress (Vignac et al., 2023), and Dru M (Jo et al., 2023). We apply the exponential moving average (EMA) to the parameters while sampling and set the diffusion steps to 1000 for all the diffusion models. For both QM9 and ZINC250k, we encode nodes and edges to one-hot and set Scale = 0.9, Shift = 0.09. For η modulated in molecule generation, with the help of chemical knowledge, we apply η = [10000, 100, 100, 100, 30] to carbon-carbon bonds, carbonnitrogen, carbon-oxygen, carbon-fluorine, and other possible bonds, respectively. For η of nodes, we apply η = [10000, 100, 100, 30] on carbon atom, nitrogen atom, oxygen atom and other possible Published as a conference paper at ICLR 2025 Table 6: Additional 2D molecule generation results on QM9 dataset. QM9 (|N| 9) Method Valid (%) FCD NSPDK Scaf. Uniq (%) Novelty (%) Mo Flow 91.36 4.467 0.0169 0.1447 98.65 94.72 Graph AF 74.43 5.625 0.0207 0.3046 88.64 86.59 Graph DF 93.88 10.928 0.0636 0.0978 98.58 98.54 EDP-GNN 47.52 2.680 0.0046 0.3270 99.25 86.58 GDSS 95.72 2.900 0.0033 0.6983 98.46 86.27 GDSS+TF 99.68 0.737 0.0024 0.9129 - - Di Gress 98.19 0.095 0.0003 0.9353 96.67 25.58 Swin GNN 99.71 0.125 0.0003 - 96.25 17.34 Graph ARM 90.25 1.22 0.0020 - 95.62 70.39 EDGE 99.10 0.458 - 0.763 100.0 - Gru M 99.69 0.108 0.0002 0.9449 96.90 24.15 GBD 99.88 0.093 0.0002 0.9510 97.12 26.32 Table 7: Additional 2D molecule generation results on ZINC250k dataset. ZINC250k (|N| 38) Method Valid (%) FCD NSPDK Scaf. Uniq (%) Novelty (%) Mo Flow 63.11 20.931 0.0455 0.0133 99.99 100.0 Graph AF 68.47 16.023 0.0442 0.0672 98.64 99.99 Graph DF 90.61 33.546 0.1770 0.0000 99.63 100.0 EDP-GNN 82.97 16.737 0.0485 0.0000 99.79 100.0 GDSS 97.01 14.656 0.0195 0.0467 99.64 100.0 GDSS+TF 96.04 5.556 0.0326 0.3205 - - Di Gress 94.99 3.482 0.0021 0.4163 99.97 99.99 Swin GNN 81.72 5.920 0.006 - 99.98 99.91 Graph ARM 88.23 16.26 0.055 - 99.46 100.0 Gru M 98.65 2.257 0.0015 0.5299 99.97 99.98 GBD 97.87 2.248 0.0018 0.5042 99.97 99.99 atoms, respectively. As described in Section 2.3, applying the appropriate η for different node types and edge types can prolong the presence of related substructures during the diffusion process. In practice, we set threshold as 0.5 to quantize generated continue adjacency matrix, and the value in discrete adjacency matrix is 0 after quantizing if and only if all values in each dimension are all 0. Complete results on 2D molecule generation We provided additional results including Unique and Novelty on 2D molecule generation in Table 6 and Table 7. E.3 COMPUTING RESOURCES For all experiments, we utilized the Py Torch (Paszke et al., 2019) framework to implement GBD and trained the model with NVIDIA Ge Force RTX 4090 and RTX A5000 GPUs. F ADDITIONAL EXPERIMENTAL RESULTS F.1 EVALUATION WITH LARGER SAMPLE SIZE As described in Section 4.1, small number of nodes and the insufficient size of the sampled graphs can lead to large standard deviations when evaluating with the reference graph on smaller dataset. Therefore, we attempted to evaluate the large number of generated graphs and report the results in Table 8. We observe that our proposed GBD outperforms previous continuous and discrete diffusion models on both smaller datasets. Furthermore, GBD significantly surpasses the wavelet-based diffusion model (Wave-GD) by a wide margin on the Community-small dataset, as evidenced by both means and standard deviations, Specifically, GBD achieves 85.0%, 90.5%, and 40.0% improvements over Wave-GD in the MMDs means of Degree, Cluster, and Orbit, respectively, indicating that our Published as a conference paper at ICLR 2025 proposed model is capable of generating smaller graphs that are closer to the data distribution with better stability. Table 8: Generic graph generation results with enlarged sample (1024 graphs). Ego-small Community-small Method Deg. Clus. Orbit. Avg. Deg. Clus. Orbit. Avg. Graph RNN 0.040 0.050 0.060 0.050 0.030 0.010 0.010 0.017 GNF 0.010 0.030 0.001 0.014 0.120 0.150 0.020 0.097 EDP-GNN 0.010 0.025 0.003 0.013 0.006 0.127 0.018 0.050 GDSS 0.023 0.020 0.005 0.016 0.029 0.068 0.004 0.034 Con Gress 0.030 0.050 0.008 0.030 0.004 0.047 0.001 0.017 ( 0.001) ( 0.003) ( 0.001) - ( 0.000) ( 0.000) ( 0.000) - Di Gress 0.009 0.031 0.003 0.014 0.003 0.009 0.001 0.004 ( 0.000) ( 0.002) ( 0.000) - ( 0.000) ( 0.001) ( 0.000) - Wave-GD 0.010 0.018 0.005 0.011 0.016 0.077 0.001 0.031 ( 0.001) ( 0.003) ( 0.002) - ( 0.000) ( 0.006) ( 0.002) - GBD 0.007 0.011 0.003 0.010 0.002 0.007 0.001 0.003 ( 0.000) ( 0.001) ( 0.000) - ( 0.000) ( 0.001) ( 0.000) - F.2 COMPARISON ON VARIOUS NODE FEATURE INITIALIZATION. Node feature initialization We initialize node representations using the following node-level features, respectively: Degree (one-hot): Degree with one-hot format is a categorical representation of a node s degree. It encodes the degree information as a binary vector where each position corresponds to a possible degree value. The position corresponding to the node s actual degree is set to 1, while all other positions are 0. Degree (normalized): The normalized degree is a continuous representation of a node s degree, scaled to a value between 0 and 1. Centrality: Here we adopt the normalized betweenness centrality as initial node features and it is a measure of a node s importance based on its role in connecting other nodes. It quantifies the fraction of shortest paths between all pairs of nodes that pass through the given node. The normalized version scales this value to be between 0 and 1. Eigenvectors: Following Jo et al. (2022), we adopt the two first eigenvectors associated to non zero eigenvalues as initial node features. Results As shown in Table 9, GBD outperforms GDSS + TF and Con Gress by a large margin in all MMDs when the node representation exhibits sparsity and long-tailedness. Additionally, GBD achieves competitive performance compared with other Gaussian-based diffusion models while the node feature is initializing with Eigenvectors. This demonstrates that our proposed GBD has the ability to model graphs with flexible node features, indicating its potential for modeling graphs with more informative features. G VISUALIZATION We follow the implementation described in Section 2.3 and the nodes in all adjacency matrices are reordered by decreasing the degree of nodes. Apparently, we can find that edges associated with nodes with large degree will be the first to be identified and then spread in decreasing order of degree on both datasets in Appendix G.1. It is worth noting that the reverse beta diffusion can converge rapidly, leading to generated graphs with correct topology at an early stage. This shows that our proposed GBD can further explore the potential benefits of beta diffusion, resulting in valid graphs with stability and high quality. For generated molecule graphs shown in Appendix G.1, we can observe that GBD can successfully generate valid and high-quality 2D molecules, verifying its ability to model attributed graphs. More generated graphs are presented following. Published as a conference paper at ICLR 2025 Table 9: The effect of Feature Initialization on Community-small and Ego-small. Community-samll Node Feature Degree (one-hot) Degree (normalized) Centralities Eigenvectors Method Deg. Clus. Orbit. Deg. Clus. Orbit. Deg. Clus. Orbit. Deg. Clus. Orbit. GDSS+TF 0.008 0.080 0.005 0.009 0.077 0.005 0.010 0.075 0.004 0.005 0.061 0.003 Con Gress 0.024 0.072 0.006 0.020 0.076 0.006 0.013 0.079 0.005 0.004 0.067 0.003 GBD 0.002 0.060 0.002 0.004 0.059 0.003 0.003 0.059 0.003 0.004 0.064 0.002 Node Feature Degree (one-hot) Degree (normlized) Centralities Eigenvectors Method Deg. Clus. Orbit. Deg. Clus. Orbit. Deg. Clus. Orbit. Deg. Clus. Orbit. GDSS+TF 0.016 0.020 0.004 0.018 0.023 0.005 0.017 0.026 0.006 0.013 0.015 0.002 Con Gress 0.045 0.059 0.015 0.037 0.064 0.017 0.032 0.057 0.014 0.020 0.029 0.011 GBD 0.011 0.014 0.002 0.013 0.017 0.002 0.015 0.012 0.003 0.017 0.016 0.002 G.1 GENERATIVE PROCESS OF GBD ON GENERAL DATASETS We visualize the generative process of GBD on the Community-small and the Ego-small dataset in Figures 5 and 6, respectively. Figure 5: Visualization of the generative process of GBD on the Community-small dataset. G.2 GENERATIVE PROCESS OF GBD ON COMPLEX GRAPH We provide the visualization of the complex graph generated by GBD on the Planar and the SBM datasets in Figure 7 and in Figure 8, respectively. G.3 GENERATED GRAPHS OF GBD ON 2D MOLECULE DATASETS We provide the visualization of the 2D molecules generated by GBD on the QM9 and the ZINC250k datasets in Figure 9 and in Figure 10, respectively. Published as a conference paper at ICLR 2025 Figure 6: Visualization of the generative process of GBD on the Ego-small dataset. Figure 7: Visualization of the generated graphs of GBD on the Planar dataset. G.4 GENERATED GRAPHS OF GBD ON BA-NETWORKS We provide the visualization of the BA-networks (n = 20, m = 2) generated by GBD in Figure 11. Published as a conference paper at ICLR 2025 Figure 8: Visualization of the generated graphs of GBD on the SBM dataset. Figure 9: Visualization of the generated graphs of GBD on the QM9 dataset. Published as a conference paper at ICLR 2025 Figure 10: Visualization of the generative graphs of GBD on the ZINC250k dataset. Figure 11: Visualization of the generative graphs of GBD on the BA-network (n = 20, m = 2).