# graph_generation_with_diffusion_mixture__caa506fe.pdf Graph Generation with Diffusion Mixture Jaehyeong Jo 1 * Dongki Kim 1 * Sung Ju Hwang 1 2 Generation of graphs is a major challenge for real-world tasks that require understanding the complex nature of their non-Euclidean structures. Although diffusion models have achieved notable success in graph generation recently, they are illsuited for modeling the topological properties of graphs since learning to denoise the noisy samples does not explicitly learn the graph structures to be generated. To tackle this limitation, we propose a generative framework that models the topology of graphs by explicitly learning the final graph structures of the diffusion process. Specifically, we design the generative process as a mixture of endpoint-conditioned diffusion processes which is driven toward the predicted graph that results in rapid convergence. We further introduce a simple parameterization of the mixture process and develop an objective for learning the final graph structure, which enables maximum likelihood training. Through extensive experimental validation on general graph and 2D/3D molecule generation tasks, we show that our method outperforms previous generative models, generating graphs with correct topology with both continuous (e.g. 3D coordinates) and discrete (e.g. atom types) features. Our code is available at https://github.com/harryjo97/Gru M. 1. Introduction Generation of graph-structured data has emerged as a crucial task for real-world problems such as drug discovery (Simonovsky & Komodakis, 2018), protein design (Ingraham et al., 2019), and program synthesis (Brockschmidt et al., 2019). To tackle the challenge of learning the underlying distribution of graphs, deep generative models have been *Equal contribution 1Korea Advanced Institute of Science and Technology (KAIST) 2Deep Auto.ai. Correspondence to: Jaehyeong Jo , Dongki Kim , Sung Ju Hwang . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). proposed, including models based on generative adversarial networks (GANs) (De Cao & Kipf, 2018; Martinkus et al., 2022), recurrent neural networks (RNNs) (You et al., 2018), and variational autoencoders (VAEs) (Jin et al., 2018). Recently, diffusion models have achieved state-of-the-art performance on the generation of graph-structured data (Niu et al., 2020; Jo et al., 2022; Hoogeboom et al., 2022). These models learn the generation process as the time reversal of the forward process, which corrupts the graphs by gradually adding noise that destroys its topological properties. Since the generative process is derived from the unknown score function (Song et al., 2021) or noise (Ho et al., 2020), existing graph diffusion models aim to estimate them in order to denoise the data from noise, which are commonly referred to as the denoising diffusion models (Figure 1 (a)). Despite their success, learning the score or noise is fundamentally ill-suited for the generation of graphs. In contrast to other types of data such as images, the key to generating valid graphs is accurately modeling the discrete structures that determine the topological properties such as connectivity or clusteredness. However, learning the score or noise does not explicitly model these features, as it aims to gradually denoise the corrupted structures. Thereby it is challenging for the diffusion models to recover the topological properties, which leads to failure cases even for small graphs. A way to more accurately generate graphs with correct topology would be directly learning the final graph and its structure, instead of learning how to denoise a noisy version of the original graphs. However, predicting the final graph structure of the diffusion process is difficult since the prediction would be highly inaccurate in the early steps of the diffusion process, and such an inaccurate prediction may lead the process in the wrong direction resulting in invalid results. Few existing works (Hoogeboom et al., 2021; Austin et al., 2021; Vignac et al., 2023) based on denoising diffusion models aim to predict the probability of the final states by parameterizing the denoising process, but it is only applicable to categorical data with a finite number of states and thus cannot generate graphs with continuous features, which is not suitable for tasks such as 3D molecule generation. To address these limitations of existing graph diffusion models, we propose a novel framework that explicitly models the Graph Generation with Diffusion Mixture graph topology, by learning the prediction of the resulting graph of the generative process which is represented as a weighted mean of graph data (Figure 1 (b)). Specifically, we construct a diffusion process via the mixture of endpointconditioned Ornstein-Uhlenbeck processes for which the drift drives the process in the direction of the predicted graph, differing from the denoising diffusion process used in previous works (Section 3.1). In order to model the mixture of the diffusion process, we develop a simple parameterization of the graph generative model with respect to the prediction of the final graph. We further derive an efficient training objective for learning the graph prediction, which guarantees to maximize the likelihood of our generative model (Section 3.2). Thanks to its ability to capture accurate graph structures, our framework achieves fast convergence to the correct graph topology in an early sampling step (Figure 1 (c) and Figure 3 (Right)). We experimentally validate our method on diverse realworld graph generation tasks. We first validate it on general graph generation benchmarks with synthetic and real-world graphs, on which it outperforms previous deep graph generative models including graph diffusion models, by being able to generate valid graphs with correct topologies. We further validate our method on 2D and 3D molecule generation tasks to demonstrate its ability to generate graphs with both the continuous and discrete features, on which ours generates a significantly larger number of valid and stable molecules compared to the state-of-the-art baselines. Our main contributions can be summarized as follows: We observe that previous diffusion models cannot accurately model the graph structures as they learn to denoise at each step without considering the topology of the graphs to be generated. To fix such a myopic behavior of previous diffusion models, we propose a new graph generation framework that captures the graph structures by directly predicting the final graph of the diffusion process modeled by a mixture of endpoint-conditioned diffusion processes. We develop a simple parameterization of the graph generative model for modeling the mixture process and present a simulation-free training objective for graph prediction. Our method significantly outperforms previous graph diffusion models on the generation of diverse real and synthetic graphs, as well as on 2D/3D molecule generation tasks, by being able to generate graphs with accurate topologies, and both the discrete and continuous features. 2. Related Work Diffusion Models Diffusion models have been shown to successfully generate high-quality samples from diverse data domains such as images (Dhariwal & Nichol, 2021; Sa- haria et al., 2022) and videos (Ho et al., 2022). Despite their success, existing diffusion models for graphs (Niu et al., 2020; Jo et al., 2022) often fail to generate graphs with correct structures since they learn to estimate the score or noise for the denoising process which does not explicitly capture the final graph and its structure. To address these limitations, we propose a graph diffusion framework that models the generative process as a mixture of diffusion processes, which learns to predict the final graph with valid topology instead of predicting the denoising function at each step. This promotes our generative process to be driven toward the prediction of the final graph, resulting in generation of valid graphs with correct topology. Diffusion Bridge Process A line of recent works has improved the generative framework of diffusion models by leveraging the diffusion bridge processes, i.e., processes conditioned to the endpoints. Schrödinger Bridge (Bortoli et al., 2021b; Chen et al., 2022) aims to find both the forward and the backward process that transforms two distributions back and forth using iterative proportional fittings that require heavy computations. More recent works (Peluchetti, 2021; Wu et al., 2022; Liu et al., 2023; Jo & Hwang, 2023) consider learning the generation process as a mixture of diffusion processes instead of reversing the noising process as in denoising diffusion models, which we describe in detail in Appendix A.11. However, previous works aim to approximate the drifts of the diffusion processes which cannot accurately capture the discrete structure of graphs as it does not explicitly learn the graph structures to be generated. Moreover, learning the drift could be problematic as the drift of the diffusion process diverges near the terminal time. Instead, we present a new approach to parameterizing the mixture process with the prediction of the final graph which allows it to model valid graph topology. Graph Generative Models Deep generative models for graphs either generate nodes and edges in an autoregressive manner or all the nodes and edges at once using VAE (Jin et al., 2018), RNN (You et al., 2018), normalizing flow (Zang & Wang, 2020; Shi et al., 2020; Luo et al., 2021), and GAN (De Cao & Kipf, 2018; Martinkus et al., 2022). However, these models show poor performance due to restrictive model architectures for modeling the likelihood or their inability to model the permutation equivariant nature of graphs. Recently, diffusion models for graphs (Jo et al., 2022; Hoogeboom et al., 2022; Vignac et al., 2023) have made large progress, but either fail to capture the graph topology or are not applicable to general tasks due to the architectural restriction of the framework. In our work, we introduce a diffusion framework that predicts the final graph structure instead of denoising noisy graphs. Our method largely outperforms existing models (Jo et al., 2022; Vignac et al., 2023; Hoogeboom et al., 2022) on generation tasks including general graphs as well as 2D and 3D molecules. Graph Generation with Diffusion Mixture Final Result Final Result Reverse-time Predicted Graph Incorrect Topology Correct Topology Noise level Predicted Noise Mixture Drift: Eq.(7) Data Distribution Graph Mixture T/10 T/5 T/4 T/3 T/2 T Mismatch Match Figure 1: Illustration of the graph generative process. (a: Denoising diffusion model, b: Gru M (ours), c: Graph mixture) For Gru M, we design the generative process as a mixture of endpoint-conditioned diffusion processes (Eq. (3)), namely the OU bridge mixture (Eq. (6)), which is driven toward the graph mixture (green) by its drift (Eq. (8)). Our Gru M in (b) successfully generates graphs with valid topology by predicting the final result via learning the graph mixture as a weighted mean of data (Eq. (1)). The predicted graph of Gru M converges in an early stage to the correct topology as visualized in (c). In contrast, previous denoising diffusion models in (a) often fail to capture the correct topology as they learn the score or noise for denoising (red), without explicit knowledge of final graph structure. 3. Graph Diffusion Mixture In this section, we present our graph generation framework Graph Diffusion Mixture (Gru M), for modeling valid topology of graphs using a mixture of diffusion processes. 3.1. Designing Graph Generative Process The key to generating graph-structured data is understanding the underlying topology of graphs which is crucial to determining its validity, since a slight modification in the edges may significantly change its structure and the attributes, for example, planarity or molecular properties. However, previous diffusion models fail to do so as their objective is to denoise the noisy graphs, in which the topology is only implicitly captured (Figure 1 (a)) from the noisy structure. To overcome the limitation, we propose a graph diffusion framework that can directly learn the accurate graph structures and capture valid topology. Throughout the paper, we represent a graph with N nodes as a pair G=(X,A) where X RN F is the node features of feature dimension F and A RN N is the weighted adjacency matrix that defines the connection between nodes. Graph Mixture Our goal is to directly predict the final graph of the diffusion process that transports a prior distribution to the data distribution Π . To be specific, for a graph diffusion process represented as a trajectory of random variables {Gτ =(Xτ,Aτ)}τ [0,T ], we aim to predict the terminus of the process GT in Π given the current state Gt. However, identifying the exact GT at the early stage of the process is problematic since the prediction based on Gt of almost no information would be highly inaccurate, and could lead the process in the wrong direction. To address this problem, we present a different approach to predicting the probable graph, which we define as a weighted mean of all the possible final results (Figure 1 (b)). Since the probability of a graph g being the final result is equal to the transition probability of the process denoted as p T |t(g| ), we define the probable graph given the current state Gt via the expectation of the graphs as follows: D(Gt, t) = Z g p T |t(g|Gt) dg, (1) which we refer to as the graph mixture of the process, visualized in Figure 1. In order to explicitly model this, we construct a generative process driven toward the graph mixture using a mixture of diffusion processes, which we describe in the following paragraphs. Ornstein-Uhlenbeck Bridge Process As a building block of our generative framework, we leverage diffusion processes with fixed endpoints, namely the diffusion bridge processes. We propose to use a family of bridge processes, namely the Ornstein-Uhlenbeck (OU) bridge process that enables simulation-free training for our generative model while providing flexibility for modeling the complex generative process for graphs. Given an OU process Q modeled by the following SDE: Q: d Gt = ασ2 t Gtdt + σtd Wt, (2) where α is a constant, σt is a noise schedule, and Wt is the standard Wiener process, the OU bridge process Qg is the process Q pinned at a fixed terminal point g. Using the Doob s h-transform (Doob & Doob, 1984), we can derive the OU bridge process Qg as follows (we provide detailed Graph Generation with Diffusion Mixture derivation of the bridge process in Appendix A.1): d Gt = ασ2 t Gt + σ2 t vt | {z } ηg(Gt,t) dt + σtd Wt, (3) where the scalar functions ut and vt are defined as follows: ut = exp α Z T t σ2 τdτ , vt = 1 2α 1 u 2 t . (4) The endpoint of Qg is fixed to GT = g, since the drift ηg( , t) of the process forces the trajectory Gt in the direction of g. Although there exists a more general class of bridge processes with non-linear drift (see Appendix A.1), they have intractable transition probability and require expensive SDE simulation to obtain trajectories. In contrast, the OU bridge processes yield tractable transition probabilities due to their affine nature and allow the training of our generative model to be simulation-free, which we further discuss in Section 3.2. Note that the Brownian bridge process used in previous works (Wu et al., 2022; Liu et al., 2022) is a special case of the OU bridge process when α 0 (see Appendix A.1). Especially, we can write the OU bridge process of Eq. (3) for graphs g=(x,a) as a system of SDEs: d Xt = h α1σ2 1,t Xt+ σ2 1,t v1,t x u1,t Xt i dt+σ1,td W1,t d At = h α2σ2 2,t At+ σ2 2,t v2,t a u2,t At i dt+σ2,td W2,t (5) With the OU bridge processes in hand, we develop a framework for predicting the final graph via the graph mixture. Diffusion Mixture for Graph Generation As the graph mixture in Eq. (1) is a weighted mean of the final graphs, conceptually, this can be modeled by aggregating the endpoint-conditioned processes with respect to the weights from the graph mixture. Inspired by the diffusion mixture framework (Peluchetti, 2021; Wu et al., 2022; Liu et al., 2022), we design the generation process by mixing the OU bridge processes with the endpoints from the data distribution, where we leverage the diffusion mixture representation (Brigo, 2008; Peluchetti, 2021). This yields the SDE representation of a mixture process as a weighted mean of the SDEs of the diffusion processes (we provide a formal definition of the mixture representation in Appendix A.2). Specifically, we mix a collection of OU bridge processes {Qg : g = (x,a) Π } to construct a generation process, for which the mixture process is modeled by the SDE: ( d Xt = η1(Xt, At, t)dt + σ1,td W1,t d At = η2(Xt, At, t)dt + σ2,td W2,t (6) with G0 =(X0,A0) following an arbitrary prior distribution Γ and the drifts η1 and η2 defined as follows: η1(Xt,At, t) η2(Xt,At, t) = Z ηx 1 (Xt, t) ηa 2 (At, t) pg t (Gt) pt(Gt) Π (dg) (7) for Gt = (Xt,At), where pg t is the marginal density of the bridge process Qg, and pt( ) := R pg t ( )Π (dg) is the marginal density of the mixture process. Notably, the terminal distribution of the mixture process QΠ is equal to the data distribution Π by construction. We refer to this mixture process as the OU bridge mixture. Remarkably, the mixture process QΠ can be explicitly represented in terms of the graph mixture. We derive a parameterization of QΠ from the SDE representation of the OU bridge process in Eq. (3) as follows (see Appendix A.3 for the derivation): η1(Xt,At,t) = α1σ2 1,t Xt+ σ2 1,t v1,t DX(Xt,At,t) η2(Xt,At,t) = α2σ2 2,t At+ σ2 2,t v2,t DA(Xt,At,t) where DX and DA are defined as a weighted mean of the node features and the adjacency matrices respectively: DΠ (Gt, t):= DX(Gt, t) DA(Gt, t) pg t (Gt) pt(Gt) Π (dg) (9) Notice that from the definition of the transition distribution, we can derive the following identity: DΠ (Gt, t) = Z g pg t (Gt) pt(Gt) Π (dg) = Z g p T |t(g|Gt)dg, which shows that DΠ ( , t) coincides with the graph mixture of QΠ as in Eq. (1). As a result, DΠ ( , t) acts as the prediction of the final graph at time t, where DX and DA are the predicted node features and adjacency matrices, respectively, in the form of a weighted mean of data. In the view of the graph mixture as the weighted mean from DΠ , it converges to the final graph of the mixture process since the marginal density pg t of the bridge process converges to one if g corresponds to the final graph while the probability becomes zero otherwise. This convergence is achieved at an early stage as visualized in Figure 1 (c) and Appendix E.2, where we further analyze the convergence behavior with respect to the coefficient α and the noise schedule σt in Appendix D.2. A key observation is that the drift of the OU bridge mixture in Eq. (8) highly resembles the drift of the OU bridge process in Eq. (3), except that the final graph g is replaced by the graph mixture. From this observation, we can see that the trajectory of the mixture process is guided by the drift in the direction of DΠ ( , t), driven toward the graph Graph Generation with Diffusion Mixture mixture that converges to a graph in the data distribution Π . Therefore, if we could estimate the graph mixture of this process, we can build a generative model upon the mixture process without relying on score function or noise, where the graph structures and the topological attributes can be explicitly modeled by the graph mixture. Before introducing the training objective for learning the graph mixture, we discuss the difference between our framework and the denoising diffusion models. Our generative process is modeled by the mixture of bridge processes that describes the exact transport from the prior distribution to the data distribution by construction, whereas the time reversal of denoising diffusion models is not an exact transport to the data distribution for finite time (Franzese et al., 2023). We provide further discussion on the difference in the characteristics of our mixture process and the denoising diffusion processes in Appendix A.10. 3.2. Generation Framework Using Graph Mixture Training Objectives Our goal is to design a generative model that explicitly learns the graph topology. To this end, we leverage the OU bridge mixture parameterized by the graph mixture, where we estimate the graph mixture using a neural network sθ( , t) that corresponds to directly learning the graph structures. In particular, we show that estimating the graph mixture guarantees to maximize the likelihood of our generative model. For the rest of the section, we represent the system of SDEs of Eq. (6) as a SDE with respect to Gt for notational simplicity. We propose to define the generative model Pθ to approximate the mixture process QΠ as follows: Pθ : d Gt = ηθ(Gt, t)dt + σtd Wt, ηθ(Gt, t) = ασ2 t Gt + σ2 t vt ut sθ(Gt, t) Gt where sθ is desired to estimate the graph mixture DΠ . In order to model the drift ηθ, we provide a tractable objective for estimating the graph mixture, which guarantees to maximize the likelihood of our generative model Pθ. Leveraging the Girsanov theorem (Øksendal, 2003), we upperbound the KL divergence between Π and the terminal distribution of Pθ denoted as pθ T as follows (see Appendix A.5 for a detailed derivation of the objective): DKL(Π pθ T ) DKL(QΠ Pθ) 0 γ2 t sθ(Gt, t) DΠ (Gt, t) 2 dt + C1 0 γ2 t sθ(Gt, t) GT 2dt + C2, (11) where γt := σt/(utvt), C1 and C2 are constants independent of θ, and the expectation is computed over the samples G from the OU bridge mixture QΠ . During training, Gt from the mixture process QΠ can be easily obtained without simulating the SDE. Notice that Gt follows the distribution pt|0,T (Gt|G0, GT ), which is the distribution of the mixture process QΠ at time t given the endpoints G0 and GT . By construction, the OU bridge mixture with fixed endpoints G0 and GT coincides with the OU process (Eq. (2)) with these fixed endpoints, and pt|0,T (Gt|G0, GT ) corresponds to the marginal probability of the OU process with the fixed endpoints G0 and GT . Using the Bayes theorem, we derive that the distribution p(Gt|G0, GT ) is also Gaussian that results from the product of Gaussian distributions, where the mean µ t and the covariance Σ t have analytical forms as follows (see Appendix A.6 for the derivation): µ t = sinh (φT φt) sinh (φT ) G0 + sinh (φt) sinh (φT )GT , α sinh (φT φt) sinh (φt) sinh (φT ) I, (12) where φt := α R t 0 σ2 τdτ. Thereby the training of Gru M is simulation-free, and our approach is 17.5 times faster compared to the training that relies on expensive SDE simulation (Wu et al., 2022). In particular, Eq. (12) shed light on the connection of the OU bridge mixture and the stochastic interpolant (Albergo et al., 2023) between the distributions Γ and Π as follows: Gt =sinh (φT φt) sinh (φT ) G0 + sinh (φt) sinh (φT )GT α sinh (φT φt) sinh (φt) 1/2 Z. (13) where G0 Γ, GT Π , and Z N(0, I), respectively. Note that the goal of Eq. (11) is to model the drift of the OU bridge mixture parametrized by the graph mixture, where sθ is trained to estimate the graph mixture instead of the exact graph GT , and we refer to this objective as the graph mixture matching. Learning the graph mixture not only allows us to directly model the structures of the final graph and their topological properties, but further guarantees our generative model to closely approximate the data distribution. Additionally, we discuss the difference between learning the graph mixture and the training objectives of denoising diffusion models in Appendix A.9 and A.10. We summarize the training process in Algorithm 1 and provide the details in Appendix B.2. Sampling from Gru M Using the trained model sθ to compute the drift ηθ of the parameterized mixture process in Eq. (10), we generate samples by simulating Pθ from time Graph Generation with Diffusion Mixture Algorithm 1 Training Input: Model sθ, constant ϵ For each epoch: 1: Sample graph G from the training set 2: N number of nodes of G 3: Sample t [0, T ϵ] and G0 N(0, IN) 4: Sample Gt pt|0,T (Gt|G0, G) Eq. (13) 5: γt σt/utvt 6: Lθ γ2 t sθ(Gt, t) G 2 7: Update θ using Lθ Algorithm 2 Sampling Input: Trained model sθ, number of sampling steps K, diffusion step size dt 1: Sample number of nodes N from the training set. 2: G0 N(0, IN) Start from noise 3: t 0 4: for k = 1 to K do 5: η ασ2 t Gt + σ2 t vt 1 ut sθ(Gt, t) Gt 6: w N(0, IN) 7: Gt+dt ηdt + σt dtw Euler-Maruyama Step 8: t t + dt 9: end for 10: G quantize(Gt) Quantize if necessary 11: Return: Graph G t = 0 to t = T with initial samples drawn from the prior distribution. Note that we generate the node features and the adjacency matrices simultaneously using the system of SDEs in the form of Eq. (6), and solving the SDEs is similar to that of denoising diffusion models which does not require additional time. We summarize the sampling process in Algorithm 2 and describe the details in Appendix B.4. Advantages of Our Framework We conclude this section by explaining the advantages of our framework. First, Gru M can directly model the graph topology by predicting the graph structures via the graph mixture, instead of implicitly capturing via noise or score. Furthermore, our framework is not restricted to the type of data to be generated, allowing it to be applicable to both continuous and discrete data, for example, 3D molecules with both discrete atom types and continuous coordinates. From the perspective of the model hypothesis space, learning the graph mixture is considerably easier compared to previous objectives such as learning the score function or the drift of the diffusion process. While the graph mixture is supported inside the bounded data space, the score function or the drift tends to diverge near the terminal time which could be problematic for the model to learn. Furthermore, we can exploit the inductive bias of the graph data for learn- ing the graph mixture, which is critical as it dramatically reduces the hypothesis space. To be specific, we can leverage the prior knowledge of the graph representation such as one-hot encoding or the categorical type by adding an additional function at the last layer of the model sθ, for instance, softmax function for the one-hot encoded node features and the sigmoid function for the 0-1 adjacency matrices (we provide more details in Appendix B.3). We experimentally verify these advantages in Section 4.4. 4. Experiments 4.1. General Graph Generation We validate Gru M on general graph generation tasks to show that it can generate valid graph topology. Datasets and Metrics We evaluate the quality of generated graphs on three synthetic and real datasets used as benchmarks in previous works (Martinkus et al., 2022; Vignac et al., 2023): Planar, Stochastic Block Model (SBM), and Proteins (Dobson & Doig, 2003). We follow the evaluation setting of Martinkus et al. (2022) using the same data split. We measure the maximum mean discrepancy (MMD) of four graph statistics between the set of generated graphs and the test set: degree (Deg.), clustering coefficient (Clus.), count of orbits with 4 nodes (Orbit), and the eigenvalues of the graph Laplacian (Spec.). To verify that the model truly learns the distribution, we report the percentage of valid, unique, and novel (V.U.N.) graphs for which the validness is defined as satisfying the specific property of each dataset. We provide further details in Appendix C.1. Baselines We compare our method against the following graph generative models: Graph RNN (You et al., 2018) an autoregressive model based on RNN, GRAN (Liao et al., 2019) an autoregressive model with attention, SPECTRE (Martinkus et al., 2022) a one-shot model based on GAN, EDP-GNN (Niu et al., 2020) a score-based model for adjacency matrix, GDSS (Jo et al., 2022) and Con Gress (Vignac et al., 2023) a continuous diffusion model, and Di Gress (Vignac et al., 2023), a discrete diffusion model. We provide the details of training and sampling of our Gru M in Appendix B and describe further implementation details including the hyperparameters in Appendix C.1. Results Table 1 shows that our method outperforms all the baselines on all datasets. Especially, ours achieves the highest validity (V.U.N.) metric, as it accurately learns the underlying topology of the graphs. Notably, our method outperforms Di Gress by a large margin in V.U.N., even though we do not use specific prior distributions or structural feature augmentation that are utilized in Di Gress. We provide an ablation study on the model architecture in Appendix D.2 to validate that the superior performance of Gru M comes from its ability to accurately model the graph topology by Graph Generation with Diffusion Mixture Table 1: Generation results on the general graph datasets. Best results are highlighted in bold, where smaller MMD and larger V.U.N. indicate better results. Hyphen(-) denotes out-of-resources that take more than 2 weeks. Planar SBM Proteins Synthetic, |V | = 64 Synthetic, 44 |V | 187 Real, 100 |V | 500 Deg. Clus. Orbit Spec. V.U.N. Deg. Clus. Orbit Spec. V.U.N. Deg. Clus. Orbit Spec. Training set 0.0002 0.0310 0.0005 0.0052 100.0 0.0008 0.0332 0.0255 0.0063 100.0 0.0003 0.0068 0.0032 0.0009 Graph RNN 0.0049 0.2779 1.2543 0.0459 0.0 0.0055 0.0584 0.0785 0.0065 5.0 0.0040 0.1475 0.5851 0.0152 GRAN 0.0007 0.0426 0.0009 0.0075 0.0 0.0113 0.0553 0.0540 0.0054 25.0 0.0479 0.1234 0.3458 0.0125 SPECTRE 0.0005 0.0785 0.0012 0.0112 25.0 0.0015 0.0521 0.0412 0.0056 52.5 0.0056 0.0843 0.0267 0.0052 EDP-GNN 0.0044 0.3187 1.4986 0.0813 0.0 0.0011 0.0552 0.0520 0.0070 35.0 - - - - GDSS 0.0041 0.2676 0.1720 0.0370 0.0 0.0212 0.0646 0.0894 0.0128 5.0 0.0861 0.5111 0.732 0.0748 Con Gress 0.0048 0.2728 1.2950 0.0418 0.0 0.0273 0.1029 0.1148 - 0.0 - - - - Di Gress 0.0003 0.0372 0.0009 0.0106 75 0.0013 0.0498 0.0434 0.0400 74 - - - - Gru M (Ours) 0.0005 0.0353 0.0009 0.0062 90.0 0.0007 0.0492 0.0448 0.0050 85.0 0.0019 0.0660 0.0345 0.0030 Figure 2: (Left) Topology analysis. We compare Spec. MMD and V.U.N of the graph mixture from Gru M against the implicit prediction computed from GDSS, Con Gress, and Di Gress which we provide details in Appendix C.1. (Middle) MMD between the test set and the graph mixture of Gru M through the generative process. (Right) The complexity of Gru M with and without using the inductive bias, measured by the Frobenius norm of the Jacobian of the models. predicting the graph mixture. We provide the visualization of the generated graphs and the generative process of Gru M in Appendix E, showing that it can accurately capture the attributes of each dataset. Topology Analysis To show how learning the graph mixture results in graphs with correct topology, we conduct an analysis of the graph mixture. Figure 2 (Left) demonstrates that Gru M can achieve the spectral property of the final graph at an early stage by explicitly modeling the topology via learning the graph mixture. In contrast, GDSS and Con Gress fail to recover the spectral properties as they implicitly model the topology via predicting the noise or score functions. Further, ours recovers the spectral property faster than Di Gress, resulting in graphs with higher validity. In particular, we observe that the V.U.N. of the estimated graph mixture increases after achieving the desired spectral property, resulting in 90% V.U.N. This shows that predicting the final graph allows us to better capture the global topologies. Moreover, we plot the MMD results of Gru M through the generative process in Figure 2 (Middle), which demonstrates that the local characteristics of the predicted graph rapidly converge to that of the graphs from the training set. 4.2. 2D Molecule Generation We further validate Gru M on 2D molecule generation tasks to show that it can accurately generate graphs with both the node features and the topologies of the target graphs. Datasets and Metrics We evaluate the quality of generated 2D molecules on two molecule datasets used as benchmarks in Jo et al. (2022): QM9 (Ramakrishnan et al., 2014) and ZINC250k (Irwin et al., 2012). Following the evaluation setting of Jo et al. (2022), we evaluate the models with four metrics: Validity is the percentage of the valid molecules among the generated without any posthoc correction. FCD (Preuer et al., 2018) measures the distance between the sets of molecules in the chemical space. NSPDK MMD (Costa & De Grave, 2010) evaluates the quality of the graph structure compared to the test set. Scaffold similarity (Scaf.) evaluates the ability to generate similar substructures. We provide more details in Appendix C.2. Baselines We compare to the following molecular graph generative models: Mo Flow (Zang & Wang, 2020) is a oneshot flow-based model. Graph AF (Shi et al., 2020) and Graph DF (Luo et al., 2021) are autoregressive flow-based model. EDP-GNN, GDSS, Con Gress, and Di Gress are Graph Generation with Diffusion Mixture Table 2: Generation results on the 2D molecule datasets. We report the mean of 3 different runs. Best results are highlighted in bold. We provide the results of uniqueness, novelty and variance in Appendix D.1. QM9 (|V | 9) ZINC250k (|V | 38) Method Valid (%) FCD NSPDK Scaf. Valid (%) FCD NSPDK Scaf. Training set 100.0 0.0398 0.0001 0.9719 100.0 0.0615 0.0001 0.8395 Mo Flow (Zang & Wang, 2020) 91.36 4.467 0.0169 0.1447 63.11 20.931 0.0455 0.0133 Graph AF (Shi et al., 2020) 74.43 5.625 0.0207 0.3046 68.47 16.023 0.0442 0.0672 Graph DF (Luo et al., 2021) 93.88 10.928 0.0636 0.0978 90.61 33.546 0.1770 0.0000 EDP-GNN (Niu et al., 2020) 47.52 2.680 0.0046 0.3270 82.97 16.737 0.0485 0.0000 GDSS (Jo et al., 2022) 95.72 2.900 0.0033 0.6983 97.01 14.656 0.0195 0.0467 Di Gress (Vignac et al., 2023) 98.19 0.095 0.0003 0.9353 94.99 3.482 0.0021 0.4163 Gru M (Ours) 99.69 0.108 0.0002 0.9449 98.65 2.257 0.0015 0.5299 QM9 (|V | 29) GEOM-DRUGS (|V | 181) Method Atom Stab.(%) Mol. Stab.(%) Atom Stab.(%) Mol. Stab.(%) G-Schnet (Gebauer et al., 2019) 95.7 68.1 - - EN-Flow (Satorras et al., 2021) 85.0 4.9 75.0 0.0 GDM (Hoogeboom et al., 2022) 97.0 63.2 75.0 0.0 EDM (Hoogeboom et al., 2022) 98.7 0.1 82.0 0.4 81.3 0.0 Bridge (Wu et al., 2022) 98.7 0.1 81.8 0.2 81.0 0.7 0.0 Bridge+Force (Wu et al., 2022) 98.8 0.1 84.6 0.2 82.4 0.7 0.0 Gru M (Ours) 98.81 0.03 87.34 0.19 82.96 0.12 0.51 0.03 0 500 1000 Timesteps L2-distance Mol. Stability (%) Gru M: L2 EDM: L2 Gru M: Stab. EDM: Stab. Figure 3: (Left) Generation results on the 3D molecule datasets. Best results are highlighted in bold which is the average of 3 different runs. The baseline results are taken from Hoogeboom et al. (2022) and Wu et al. (2022). (Right) Convergence of the generative process. We compare the convergence of the graph mixture from Gru M and the implicit prediction computed from EDM. We measure the convergence (L2 distance) and report the molecule stability of the predictions. diffusion models previously explained. We describe further implementation details in Appendix C.2. Results Table 2 shows that our method achieves the highest validity on all datasets verifying that Gru M can generate valid molecules without correction. Further, ours outperforms the baselines in FCD and NSPDK metrics demonstrating that the molecules synthesized by Gru M are similar to the molecule from the training set in both chemical and graph-structure aspects. Especially, ours achieves the highest scaffold similarity indicating that it is able to generate similar substructures from that of the training set. We visualize the generated molecules in Appendix E.1. 4.3. 3D Molecule Generation To show that Gru M is able to generate graphs with both continuous and discrete features, we validate it on 3D molecule generation tasks, which come with discrete atom types and continuous coordinates. Datasets and Metrics We evaluate the generated 3D molecules on two standard molecule datasets used as benchmarks in Hoogeboom et al. (2022): QM9 (Ramakrishnan et al., 2014) (up to 29 atoms) and GEOM-DRUGS (Axelrod & Gomez-Bombarelli, 2022) (up to 181 atoms). Following Hoogeboom et al. (2022), both datasets include hydrogen atoms. For GEOM-DRUGS, we select 30 conformations for each molecule with the lowest energy. We evaluate the quality of the generated molecules with two stability metrics: Atom stability is the percentage of the atoms with valid valency. Molecule stability is the percentage of the generated molecules that consist of stable atoms. We provide more details in Appendix C.3. Baselines We compare Gru M against 3D molecule generative models: G-Schnet (Gebauer et al., 2019) is an autoregressive model based on the 3d point sets. ENFlow (Satorras et al., 2021) is a flow-based model. GDM and EDM (Hoogeboom et al., 2022) are denoising diffusion models. Bridge (Wu et al., 2022) is a diffusion model based on the diffusion mixture that learns to approximate the drift and Bridge+Force (Wu et al., 2022) adds physical force to the drift. For ours, we follow the training setting of Hoogeboom et al. (2022) using the same architecture of EGNN (Satorras et al., 2021). We describe further implementation details in Appendix C.3. Results As shown in the table of Figure 3, our method yields the highest atom stability compared to all the baselines on both datasets. Furthermore, ours achieves higher molecule stability since we directly model the topology by learning the graph mixture. Moreover, Gru M outper- Graph Generation with Diffusion Mixture forms Bridge+Force (Wu et al., 2022) even though Gru M does not require task-dependent prior force while trained in a simulation-free manner. Notably, our method achieves non-zero molecule stability in the GEOM-DRUGS dataset consisting of large molecules with up to 181 atoms. We visualize the generated molecules and the generative process of Gru M in Appendix E, demonstrating that we can predict the final molecule at an early stage of the process leading to stable molecules. We further observe that Gru M generates 1.5 more number of connected molecules compared to EDM as shown in Table 8 of the Appendix. Stability Analysis To further investigate the superior performance of our framework in generating more stable molecules, we conduct an analysis of the convergence and stability. Figure 3 (Right) shows the convergence of the predicted graph from Gru M and the implicit prediction from EDM computed from the estimated noise. We observe that for Gru M, the predicted graphs converge rapidly to the final result. After the convergence, the stability of Gru M increases as it has sufficient steps to calibrate the details to produce valid molecules, which is visualized in the generative process of Figure 19 of the Appendix. As for EDM, the implicit predictions converge slowly since EDM does not explicitly learn the information of the final result, which leads to lower stability. This analysis shows that learning the final graph is significantly superior in capturing the correct topology compared to previous diffusion models. 4.4. Further Analysis We conduct an analysis to investigate the advantages of our framework explained in Section 3.2. Exploiting Inductive Bias To validate that exploiting the inductive bias of the graph data is critical, we compare Gru M against a variant of it without an additional function at the last layer in the model. Figure 2 (Right) shows the complexity of the models sθ trained on the Planar dataset, where the transformation at the last layer significantly reduces the model complexity for predicting the final graph. Especially, the larger complexity gap at the late stage of the diffusion process suggests that exploiting the inductive bias is crucial for learning valid structures and their topology. Comparison with Learning Drift To verify that learning the graph mixture as in our framework is superior to learning the drift, we compare with Bridge (Wu et al., 2022) which models the drift of the mixture process. Table 3 shows that ours outperforms Bridge, especially for the molecule stability, since learning the drift is challenging due to its diverging nature and unable to model the topology directly. We further validate that learning the drift performs poorly on general graph generation tasks and fails to generate the correct topology in Appendix D.2. Early stopping for the generative process In Figure 2 (Left) and (Middle), the V.U.N. and the MMD results of Dru M in the Planar dataset demonstrate that the estimated destination mixture converges to the exact destination at early sampling steps, accurately capturing both the global topology and local graph characteristics. This allows us to early-stop the diffusion process, which reduces the generation time by up to 20% on this task. The generation results on SBM and Proteins datasets in Section D.2 of the Appendix show a similar tendency. 5. Conclusion In this work, we proposed a new diffusion-based graph generation framework, Gru M, that explicitly models the topology of the graphs. Unlike existing graph diffusion models that learn to denoise, our framework learns to predict the final graph of the generative process through the graph mixture, thereby accurately capturing the valid graph structure and its topological features. Specifically, we construct the generation process as a mixture of diffusion bridges, which differs from the denoising diffusion process, where the drift drives the generation process toward the predicted graph that converges in an early stage. We extensively validated our framework on diverse graph generation tasks, including 2D/3D molecular generation, on which ours significantly outperforms previous graph generation methods. A promising future direction would be the generalization to domains other than graphs where the topology of the data is important, such as proteins and manifolds. Impact Statement This paper presents work whose goal is to advance the field of deep generative models, specifically for graph-structured data. We believe that our work can accelerate the discovery of feasible drugs and improve the quality of human lives by recommending drug candidates in silico, which reduces time-consuming wet lab experiments performed by experts. However, one might maliciously use our framework to generate toxic substances or narcotics harmful to humans or the environment. Acknowledgement This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No.2019-0-00075 Artificial Intelligence Graduate School Program(KAIST)), Google Research Grant and Google Cloud Research Credits program with the award (XKCVN0JU-8K3R-65LK, 40KR-GJX5-XH4X-PU3L), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-0-00713), and the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. RS-2023-00256259). Graph Generation with Diffusion Mixture Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv:2303.08797, 2023. Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing System, 2021. Axelrod, S. and Gomez-Bombarelli, R. Geom, energyannotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1):1 14, 2022. Bemis, G. W. and Murcko, M. A. The properties of known drugs. 1. molecular frameworks. Journal of medicinal chemistry, 39(15):2887 2893, 1996. Bortoli, V. D., Doucet, A., Heng, J., and Thornton, J. Simulating diffusion bridges with score matching. ar Xiv:2111.07243, 2021a. Bortoli, V. D., Thornton, J., Heng, J., and Doucet, A. Diffusion schrödinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, 2021b. Brigo, D. The general mixture-diffusion sde and its relationship with an uncertain-volatility option model with volatility-asset decorrelation. ar Xiv:0812.4052, 2008. Brockschmidt, M., Allamanis, M., Gaunt, A. L., and Polozov, O. Generative code modeling with graphs. In International Conference on Learning Representations, 2019. Campbell, A., Benton, J., De Bortoli, V., Rainforth, T., Deligiannidis, G., and Doucet, A. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 2022. Chen, T., Liu, G., and Theodorou, E. A. Likelihood training of schrödinger bridge using forward-backward sdes theory. In International Conference on Learning Representations, 2022. Corlay, S. Properties of the ornstein-uhlenbeck bridge. ar Xiv preprint ar Xiv:1310.5617, 2013. Costa, F. and De Grave, K. Fast neighborhood subgraph pairwise distance kernel. In International Conference on Machine Learning, 2010. De Cao, N. and Kipf, T. Molgan: An implicit generative model for small molecular graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. ar Xiv:2105.05233, 2021. Dobson, P. D. and Doig, A. J. Distinguishing enzyme structures from non-enzymes without alignments. Journal of Molecular Biology, 330(4):771 783, 2003. Doob, J. L. and Doob, J. Classical potential theory and its probabilistic counterpart, volume 549. Springer, 1984. Dwivedi, V. P. and Bresson, X. A generalization of transformer networks to graphs. ar Xiv:2012.09699, 2020. Franzese, G., Rossi, S., Yang, L., Finamore, A., Rossi, D., Filippone, M., and Michiardi, P. How much is enough? a study on diffusion times in score-based generative models. Entropy, 25(4):633, 2023. Gebauer, N. W. A., Gastegger, M., and Schütt, K. Symmetryadapted generation of 3d point sets for the targeted discovery of molecules. In Advances in Neural Information Processing Systems, 2019. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. Ho, J., Salimans, T., Gritsenko, A. A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. ar Xiv:2204.03458, 2022. Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. In Advances in Neural Information Processing Systems, 2021. Hoogeboom, E., Satorras, V. G., Vignac, C., and Welling, M. Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning, 2022. Ingraham, J., Garg, V. K., Barzilay, R., and Jaakkola, T. S. Generative models for graph-based protein design. In Advances in Neural Information Processing Systems, 2019. Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757 1768, 2012. Jin, W., Barzilay, R., and Jaakkola, T. S. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, 2018. Graph Generation with Diffusion Mixture Jo, J. and Hwang, S. J. Generative modeling on manifolds through mixture of riemannian diffusion processes. arxiv:2310.07216, 2023. Jo, J., Lee, S., and Hwang, S. J. Score-based generative modeling of graphs via the system of stochastic differential equations. In International Conference on Machine Learning, 2022. Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Advances in Neural Information Processing Systems, 2021. Landrum, G. et al. Rdkit: Open-source cheminformatics software, 2016. URL http://www. rdkit. org/, https://github. com/rdkit/rdkit, 2016. Liao, R., Li, Y., Song, Y., Wang, S., Hamilton, W., Duvenaud, D. K., Urtasun, R., and Zemel, R. Efficient graph generation with graph recurrent attention networks. Advances in Neural Information Processing Systems, 2019. Liu, X., Wu, L., Ye, M., and Liu, Q. Let us build bridges: Understanding and extending diffusion generative models. ar Xiv:2208.14699, 2022. Liu, X., Wu, L., Ye, M., and Liu, Q. Learning diffusion bridges on constrained domains. In International Conference on Learning Representations, 2023. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv:1711.05101, 2017. Luo, Y., Yan, K., and Ji, S. Graphdf: A discrete flow model for molecular graph generation. International Conference on Machine Learning, 2021. Martinkus, K., Loukas, A., Perraudin, N., and Wattenhofer, R. SPECTRE: spectral conditioning helps to overcome the expressivity limits of one-shot graph generators. In International Conference on Machine Learning, 2022. Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon, S. Permutation invariant graph generation via score-based generative modeling. In AISTATS, 2020. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 2019. Peluchetti, S. Non-denoising forward-time diffusions. Openreview, 2021. Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A. C. Film: Visual reasoning with a general conditioning layer. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence,, pp. 3942 3951. AAAI Press, 2018. Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., and Klambauer, G. Fréchet chemnet distance: a metric for generative models for molecules in drug discovery. Journal of chemical information and modeling, 58(9):1736 1741, 2018. Ramakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1 7, 2014. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, S. K. S., Lopes, R. G., Ayan, B. K., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022. Satorras, V. G., Hoogeboom, E., Fuchs, F., Posner, I., and Welling, M. E(n) equivariant normalizing flows. In Advances in Neural Information Processing Systems, 2021. Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. Graphaf: a flow-based autoregressive model for molecular graph generation. In International Conference on Learning Representations, 2020. Simonovsky, M. and Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders. In ICANN, 2018. Song, Y. and Ermon, S. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, 2020. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Särkkä, S. and Solin, A. Applied Stochastic Differential Equations. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2019. Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher, V., and Frossard, P. Digress: Discrete denoising diffusion for graph generation. In International Conference on Learning Representations, 2023. Wu, L., Gong, C., Liu, X., Ye, M., and Liu, Q. Diffusionbased molecule generation with informative prior bridges. In Advances in Neural Information Processing Systems, 2022. Graph Generation with Diffusion Mixture Ye, M., Wu, L., and Liu, Q. First hitting diffusion models. In Advances in Neural Information Processing Systems, 2022. You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec, J. Graphrnn: Generating realistic graphs with deep autoregressive models. In International Conference on Machine Learning, 2018. Zang, C. and Wang, F. Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 617 626, 2020. Øksendal, B. Stochastic Differential Equations. Universitext. Springer Berlin Heidelberg, 2003. Graph Generation with Diffusion Mixture Organization The Appendix is organized as follows: In Section A, we provide the derivations of the results from the main paper. In Section B, we explain the details of our generative framework including the training objectives, the sampling method, and the model architectures. In Section C, we provide experimental details for the generation tasks and further present additional experimental results in Section D. In Section E, we visualize the generated graphs and molecules, with visualized generative processes. Finally, in Section F, we discuss the limitations of our work. A. Derivations A.1. Diffusion bridge processes Here we derive the Ornstein-Uhlenbeck (OU) bridge process using Doob s h-transform (Doob & Doob, 1984) and show that the Brownian bridge process is a special case of the OU bridge process. We further discuss a general class of bridge processes and explain the advantage of the OU bridge process. Ornstein-Uhlenbeck bridge process First, we consider the simple case when the reference process is given as a standard OU process without a time-dependent diffusion coefficient: ˆQ : d Gt = αGtdt + d Wt, (14) where α is a constant. Then the Doob s h-transform on ˆQ yields the representation of an endpoint-conditioned process ˆQg := ˆQ( |GT = g) defined by the following SDE: ˆQg : d Gt = h αGt + Gtlog ˆp T |t(g|Gt) i dt + d Wt, (15) where ˆp T |t(g|Gt) is the transition probability from time t to T of the standard OU process in Eq. (14). Since the standard OU process has a linear drift, the transition probability is Gaussian, i.e. ˆp T |t(g|Gt) = N(g; µt, Σt), where the mean µt and the covariance Σt satisfies the following ODEs (derived from the results of Eq.(5.50) and Eq.(5.51) of Särkkä & Solin (2019)): dt = αµt , dΣt dt = I + 2αΣt. (16) The ODE with respect to Σt can be modified as: d dte 2αtΣt = e 2αt I, (17) which give the following closed-form solutions: µt = ˆut Gt , Σt = 1 2α ˆu2 t 1 I for ˆut = eα(T t). (18) Therefore, the SDE representation of the standard OU bridge process with fixed endpoint g is given as follows: ˆQg : d Gt = αGt + 2α 1 ˆu 2 t ˆut Gt dt + d Wt. (19) Now we derive the bridge process for the general OU process with a time-dependent diffusion coefficient defined by the following SDE: Q: d Gt = ασ2 t Gtdt + σtd Wt, (20) where σt is a scalar function. Since the time change (Section 8.5. of Øksendal (2003)) with βt = R t 0 σ2 τdτ of ˆQ in Eq. (14) is equivalent to Q of Eq. (20), the transition probability p T |t(g|Gt) of the general OU process satisfies the following: p T |t(g|Gt) = ˆpβT |βt(g|Gt) (21) Graph Generation with Diffusion Mixture Thereby, the OU bridge process conditioned on the endpoint g is defined by the following SDE: Qg : d Gt = ασ2 t Gt + σ2 t vt ut Gt dt + σtd Wt, (22) where the scalar function ut and vt are given as: ut = eα(βT βt) = exp α Z T t σ2 τdτ , vt = 1 2α(1 u 2 t ). (23) Note that the OU bridge process, also known as the constrained OU process, was studied theoretically in previous works (Corlay, 2013; Peluchetti, 2021; Bortoli et al., 2021a). However, we are the first to validate the effectiveness of the OU bridge processes for modeling the generative process through extensive experiments, especially for the generation of graphs in diverse tasks including the generation of general graphs as well as 2D and 3D molecular graphs. Brownian bridge process We show that the Brownian bridge process is a special case of the OU bridge process. When the constant α of the OU bridge process approaches 0, the scalar function ut converges to 1 that leads to the convergence of vt as follows: 2α(1 u 2 t ) = 1 1 e 2α(βT βt) βT βt, which is due to the Taylor expansion of the exponential function. Therefore, the OU bridge process for α 0 is modeled by the following SDE: Qg bb : d Gt = σ2 t βT βt (g Gt) dt + σtd Wt, (24) which is equivalent to the SDE representation of the Brownian bridge process. Compared to the OU bridge process in Eq. (22), the Brownian bridge process has a simpler SDE representation with less flexibility for designing the generative process as the process is solely determined by the noise schedule σt. Note that the Brownian bridge is an endpoint-conditioned process with respect to a reference Brownian Motion defined by the following SDE: d Gt = σtd Wt, (25) which is a diffusion process without drift, and also a special case of the OU process that is used for the reference process of the OU bridge process. More bridge processes Wu et al. (2022) proposes an approach for designing a more general class of diffusion bridges using the Lyapunov function method. Starting from a simple Brownian bridge Qg bb, we can create a new bridge process by adding an extra drift term as follows: Qg bb,f : d Gt = σtft(Gt) | {z } extra drift + σ2 t βT βt (g Gt) dt + σtd Wt, (26) for ft satisfying EG Qg bb,f [ ft(Gt) 2] < . (27) Qg bb,f of Eq. (26) is still a bridge process with endpoint g since the drift of the Brownian bridge (i.e. Eq. (24)) dominates the extra drift term due to the condition of Eq. (27). Moreover, Wu et al. (2022) introduces problem-dependent prior f inspired by physical energy functions. These general bridge processes could be used for our framework to construct a mixture process for modeling the generative process, as described in Section 3.1. If the explicit SDE representation for the general bridges is accessible, the mixture process can be represented by leveraging the diffusion mixture representation, and further the Brownian bridge could be replaced with the OU bridge process. However, in contrast to constructing the generative process as a mixture of the OU bridge processes, using the mixture of the general bridge processes results in difficulty during training; Training a generative model that approximates the mixture of the general bridge processes requires expensive SDE simulation due to the intractable transition probability. We show through extensive experiments that for our approach, the family of OU bridge processes is sufficient to model the complex generation process while the generative model can be trained in a simulation-free manner. Graph Generation with Diffusion Mixture A.2. Diffusion mixture representation In this section, we provide the formal definition of the diffusion mixture representation (Brigo, 2008; Peluchetti, 2021). Consider a collection of diffusion processes {Qλ : λ Λ} defined by the SDEs: Qλ : d Zλ t = ηλ(Zt, t)dt + σλ t d Wλ t , Zλ 0 pλ 0 (28) where Wλ t are independent standard Wiener processes and pλ 0 are the initial distributions. Denote pλ t as the marginal density of the process Qλ. Further, define the mixture of marginal densities and the mixture of initial distributions with respect to a mixing distribution L on the collection Λ as follows: Λ pλ t (z)L(dλ) , p0(z) = Z Λ pλ 0(z)L(dλ), (29) Then there exists a diffusion process that induces a marginal density pt, and the diffusion process is modeled by the following SDE: QL : d Zt = η(Zt, t)dt + σtd Wt , Z0 p0, (30) where the drift and diffusion coefficients are given as the weighted mean of the corresponding coefficients of Qλ as follows: η(z, t) = Z Λ ηλ(z, t)pλ t (z) pt(z) L(dλ) , σ2 t = Z Λ (σλ t )2 pλ t (z) pt(z) L(dλ). (31) Below, we provide a proof of this statement. proof. It is enough to show that pt defined in Eq. (29) is the solution to the corresponding Fokker-Planck equation of Eq. (30), which is given as follows: t = z qt(z)η(z, t) 1 2σ2 t zqt(z) , (32) where qt denotes the marginal density of Eq. (30). Using the definition of Eq. (29) and the corresponding Fokker-Planck equations with respect to Qλ for λ Λ, we derive the following result: Λ pλ t (z)L(dλ) = Z tpλ t (z)L(dλ) z ηλ(z, t)pλ t (z) 1 2(σλ t )2 zpλ t (z) L(dλ) ηλ(z, t)pλ t (z) 1 2(σλ t )2 zpλ t (z) L(dλ) = z pt(z) Z Λ ηλ(z, t)pλ t (z) pt(z) L(dλ) 1 Λ (σλ t )2 pλ t (z) pt(z) L(dλ) = z pt(z)η(z, t) 1 2σ2 t zpt(z) , (33) which proves that pt is the solution to the Fokker-Planck equation of Eq. (32). A.3. OU bridge mixture Now we use the diffusion mixture representation described in Appendix A.2 to derive the OU bridge mixture. Consider a mixture of the collection of OU bridge processes with endpoints in the data distribution, i.e. {Qg : g Π }. We mix this Graph Generation with Diffusion Mixture collection of processes with the data distribution Π as the mixing distribution, which is represented by the following SDE: QΠ : d Gt = Z ασ2 t Gt + σ2 t vt ut Gt pg t (Gt) pt(Gt) Π (dg) dt + σtd Wt = ασ2 t Gt + σ2 t vt Z g pg t (Gt) pt(Gt) Π (dg) Gt dt + σtd Wt = ασ2 t Gt + σ2 t vt ut DΠ (Gt, t) Gt dt + σtd Wt (34) where pt(z) = R pg t (z)Π (dg) is used for the second equality and the definition of the graph mixture (Eq. (9)) is used for the last equality. A.4. Reverse-time diffusion process of the OU bridge mixture Here we derive the reverse-time diffusion process of Gru M, i.e. the time reversal of the OU bridge mixture. Since the generative process of Gru M transports the prior distribution Γ to the data distribution Π , the time reversal of Gru M transports Π to Γ. We show that it has a similar SDE representation as Eq. (34). We derive the reverse process of the OU bridge mixture by constructing a mixture of the reverse processes of each OU bridge process. To be precise, for the mixture process Q := R QgdΠ , the reverse process of Q denoted as Q is equal to the mixture process R QxdΓ where Qx is the reverse process of the bridge process Qg with starting point x. For the simplicity of the representation, we first derive the time-reversal of general bridge processes, where the reference process is given as d Gref t = µ(Gref t , t) + σt Wt, (35) with the marginal density denoted as pt. In order to obtain the reverse-time diffusion process, we leverage the reverse-time SDE representation (Anderson, 1982; Song et al., 2021) as follows: d Gref t = h µ(Gref t , T t) + σ2 T t Gref t log qt(Gref) i dt + σT td Wt, (36) where qt = p T t is the marginal density of the process {Gref t }t [0,T ]. Then the bridge process of Eq. (36) with fixed end point x Γ can be derived by using the Doob s h-transform (Doob & Doob, 1984) as follows: Qx : d Gt = h µ(Gt, T t) +σ2 T t Gtlog qt(Gt) +σ2 T t Gtlog q T |t(x|Gt) i dt +σT td Wt, (37) which is a reverse process for the conditioned process of Gref with starting point x and endpoint g Π fixed. Here using the fact that qt = p T t, we can see that qt(y) q T |t(x|y) = q(GT =x, Gt =y) = q(GT =x, Gt =y) q T (x) q T (x) = p T t|0(y|x) q T (x), (38) and since Gt log q T (x) = 0 for fixed x, Eq. (37) can be simplified as follows: Qx : d Gt = h µ(Gt, T t) + σ2 T t Gt log p T t|0(Gt|x) i dt + σT td Wt. (39) Finally, the mixture of the bridge processes {Qx : x Γ} can be derived using the diffusion mixture representation as follows: Q : d Gt = µ(Gt, t) + σ2 T t Z Gt log p T t|0(Gt|x)qx t (Gt) qt(Gt) Γ(dx) dt + σT td Wt, (40) where qx t is the marginal density of Qx and qt is the marginal density of the mixture process Q defined as qt( ) := R qx t ( )Γ(dx). Graph Generation with Diffusion Mixture Using the result of Eq. (40), now we can derive the time reversal of the OU bridge mixture by setting µ(z, t) = ασ2 t z. Since the transition distributions of the OU process satisfy the following (we provide closed-form mean and covariance of the transition distribution in Eq. (56)): p T t|0(z|x) = N z; utx, u2 tvt I for ut = exp 2α 1 u 2 t , (41) the log gradient of the transition distribution can be computed as follows: z log p T t|0(z|x) = 1 u2 tvt (z utx) . (42) Thereby, the reverse-time diffusion process of the OU bridge mixture is given by: Q : d Gt = ασ2 T t Gt + σ2 T t u2 tvt ut DΓ(Gt, t) Gt dt + σT td Wt, G0 Π , (43) where DΓ( , t) is the graph mixture of Q defined as follows: DΓ(Gt, t) = Z xqx t (Gt) qt(Gt) Γ(dx). (44) Since Q describes the diffusion process from the data distribution to the prior distribution, it can be considered a perturbation process. Further, we can observe that the time reversal of the OU bridge mixture is non-linear with respect to Gt in general, and completely different from the forward process (i.e. perturbation process) of denoising diffusion models, i.e. the VESDE or VPSDE (Song et al., 2021). Note that the reverse process of the OU bridge mixture perfectly transports the data distribution Π to the arbitrary prior distribution Γ in the sense that the terminal distribution exactly matches Γ for finite terminal time T. On the other hand, the forward process of denoising diffusion models, for example, VPSDE (Song et al., 2021), does not perfectly transport the data distribution to the prior distribution. The terminal distribution of the forward process is approximately Gaussian but not exactly a Gaussian distribution for finite T, although the mismatch is small for sufficiently large T. This is because the forward process requires infinite T in order to decouple the prior distribution Γ from the data distribution Π . In conclusion, the generative process of Gru M is different from denoising diffusion models which naturally follows from the fact that the time reversal of the OU bridge mixture is different from the forward processes of denoising diffusion models. A.5. Derivation of the graph mixture matching objective We provide the derivation of our graph mixture matching objective, corresponding to Eq. (11). First, we leverage the Girsanov theorem (Øksendal, 2003) to upper bound the KL divergence between the data distribution Π and the terminal distribution of Pθ denoted as pθ T : DKL(Π pθ T ) DKL(QΠ Pθ) (45) = DKL(QΠ 0 Pθ 0) + EQΠ log d QΠ = EG QΠ log pθ 0(G0) + 1 σ 1 t (ηθ(Gt, t) η(Gt, t)) 2 dt + C (47) = EG QΠ log pθ 0(G0) + 1 0 γ2 t sθ(Gt, t) DΠ (Gt, t) 2 dt + C, (48) where pθ 0 is a predetermined prior distribution that is easy to sample from, for instance, Gaussian distribution, and C is a constant independent of θ. Note that the first inequality is known as the data processing inequality. The expectation in Eq. (48) corresponds to Eq. (11). Graph Generation with Diffusion Mixture Furthermore, the expectation of Eq. (48) can be written as follows: 0 γ2 t sθ(Gt, t) DΠ (Gt, t) 2 dt 0 γ2 t sθ(Gt, t) GT + GT DΠ (Gt, t) 2 dt 0 γ2 t sθ(Gt, t) GT 2dt + E + ET + C1, (49) where E and C1 are defined as: E = EG QΠ 1 0 γ2 t sθ(Gt, t) GT T GT DΠ (Gt, t) dt , C1 = EG QΠ 1 0 γ2 t GT DΠ (Gt, t) 2 dt . From the definition of the graph mixture (Eq. (9)), the following identity holds for all t [0, T]: EG QΠ DΠ (Gt, t) = EG QΠ GT , (51) which gives the following result: E = EG QΠ 1 0 γ2 t sθ(Gt, t) GT T GT DΠ (Gt, t) dt (52) 0 γ2 t sθ(Gt, t) GT T GT GT dt = 0 (53) Therefore, we can conclude that minimizing Eq. (48) is equivalent to minimizing the following loss: 0 γ2 t sθ(Gt, t) GT 2dt (54) which corresponds to the graph mixture matching presented in Eq. (11). A.6. Analytical computation of graph mixture matching In order to practically use the graph mixture matching (Eq. (11)), we provide the analytical form of the distribution pt|0,T (Gt|G0, GT ). Notice that by construction, the OU bridge mixture with a fixed starting point G0 and an endpoint GT coincides with the reference OU process in Eq. (20) with a fixed starting point G0 and an endpoint GT . Thereby, pt|0,T (Gt|G0, GT ) is equal to pt|0,T (Gt|G0, GT ) where p denotes the marginal probability of the reference OU process of Eq. (20). Using the Bayes theorem, we can derive the following: p(Gt|G0, GT ) = p(Gt, GT |G0) p(GT |G0) = p(GT |Gt, G0) p(Gt|G0) p(GT |G0) = p(GT |Gt) p(Gt|G0) p(GT |G0) , (55) where the last equality is due to the Markov property of the OU process. Note that the transition distributions of the reference OU process are Gaussian with the mean and the covariance as follows: pb|a(Gb|Ga) = N(Gb; ub|a Ga, u2 b|avb|a I) for 0 a < b T, where ub|a := exp α Z b a σ2 τdτ , vb|a := 1 1 u 2 b|a . (56) Therefore, the distribution p(Gt|G0, GT ) is also Gaussian resulting from the product of Gaussian distributions, where the mean µ t and the covariance Σ t have analytical forms as follows: µ t = v T |t ut|0v T |0 G0 + vt|0 u T |tv T |0 GT , Σ t = v T |tvt|0 v T |0 I. (57) Graph Generation with Diffusion Mixture The mean and the covariance can be simplified by using the hyperbolic sine function as follows: µ t = sinh (φT φt) sinh (φT ) G0 + sinh (φt) sinh (φT )GT , Σ t = 1 α sinh (φT φt) sinh (φt) sinh (φT ) I, (58) where φt := αβt = α R t 0 σ2 τdτ. A.7. Gru M as a stochastic interpolant Recently, Albergo et al. (2023) introduced the concept of stochastic interpolant which unifies the framework for diffusion models from the perspective of continuous-time stochastic processes. From the results of Eq. (58), we can represent the OU bridge mixture as a stochastic interpolant between the distributions Γ and Π as follows: Gt = sinh (φT φt) sinh (φT ) G0 + sinh (φt) sinh (φT )GT + 1 α sinh (φT φt) sinh (φt) 1/2 Z. (59) where G0, GT , and Z are random variables sampled independently from the distributions Γ, Π , and N(0, I), respectively. Eq. (59) shows that Gt is linear in both the starting point G0 Γ and the endpoint GT Π . Note that our proposed graph mixture matching is different from the loss introduced in Albergo et al. (2023), as graph mixture matching does not require estimation of the score function. Additionally, we further derive the score function of our Gru M in Section A.9. A.8. Understanding the informative prior as regularizing the graph mixture Wu et al. (2022) introduces incorporating prior information into the generative process, for example injecting physical and statistical information. To be specific, given a generative process: d Gt = η(Gt, t)dt + σt Wt, Wu et al. (2022) modifies the drift by adding a prior function f( , t) as follows: d Gt = σtf(Gt, t) + η(Gt, t) | {z } ηR(Gt,t) dt + σt Wt, (60) where f( , t) is designed to be a force defined as f( , t) = E( ) where E( ) is a problem-dependent energy function. Although Wu et al. (2022) shows that incorporating prior information is beneficial for the generation of stable molecules or realistic 3D point clouds, how this modification leads to better performance was not fully explained. Notably, from the perspective of our framework, we can interpret the incorporation of the prior information as modifying the generative path toward an energy-regularized result. To be precise, given a generative process modeled by the OU bridge mixture as in Eq. (34), adding the prior function f( , t) to the drift can be written as follows: ηR(Gt, t) = ασ2 t Gt + σ2 t vt DΠ (Gt, t) + utvt σt f(Gt, t) Gt which is equivalent to regularizing the graph mixture with the weighted prior function as follows: DΠ R (Gt, t) := DΠ (Gt, t) + utvt σt f(Gt, t). (62) Since the weight of the prior function converges to 0 through the generative process: σt = exp α R T t σ2 τdτ exp α R T t σ2 τdτ 2ασt 0 as t T, we can see that DΠ R converges to the original graph mixture DΠ where the convergence is determined by the prior function. By defining f( , t) = E( ) where E is an energy function, for example, potential energy for the 3D molecules or Riesz energy for the 3D point cloud, the regularized graph mixture has the following representation: DΠ R (Gt, t) = DΠ (Gt, t) utvt σt E(Gt). (63) Graph Generation with Diffusion Mixture Thereby, DΠ R follows a path that minimizes the energy function E through the generative process. Therefore, the generative process is guided toward the regularized graph mixture which results in samples that achieve desired physical properties, for instance, stable 3D-structured molecules or point clouds. A.9. Associated probability flow ODE of Gru M Since we have derived the reverse-time diffusion process of the OU bridge mixture in Section A.4, we can further derive its associated probability flow ODE (Song et al., 2021), i.e. a deterministic process that shares the same marginal density with the OU bridge mixture. First, the OU bridge mixture is modeled by the following SDE: d Gt = ασ2 t Gt + σ2 t vt ut DΠ (Gt, t) Gt dt + σtd Wt, where the scalar functions ut and vt, and the graph mixture DΠ are defined as: ut = exp α Z T t σ2 τdτ , vt = 1 2α(1 u 2 t ), DΠ (Gt, t) = Z g pg t (Gt) pt(Gt) Π (dg). Then using the results of Section A.4, the reverse-time diffusion process of the OU bridge mixture is modeled by the following SDE: d Gt = ασ2 T t Gt + σ2 T t u2 tvt ut DΓ(Gt, t) Gt dt + σT td Wt, where the scalar functions ut and vt, and the reversed graph mixture DΓ are defined as: 2α 1 u 2 t , DΓ(Gt, t) = Z xqx t (Gt) qt(Gt) Γ(dx). From the relation between the diffusion process and its reverse-time diffusion process (for instance, Eq. (35) and Eq. (36)), the score function of the OU bridge mixture can be computed as follows: Gtlog pt(Gt) = 1 ut DΠ (Gt, t) Gt + 1 u2 T tv T t u T t DΓ(Gt, T t) Gt Therefore, the associated probability flow ODE can be derived as follows: dt = ασ2 t Gt + σ2 t vt ut DΠ (Gt, t) Gt 2σ2 t Gtlog pt(Gt) (65) = ασ2 t Gt + σ2 t 2vt ut DΠ (Gt, t) Gt σ2 t 2u2 T tv T t u T t DΓ(Gt, T t) Gt To practically use the probability flow ODE as a generative model, the graph mixtures DΠ ( , t) and DΓ( , t) should be approximated by the neural networks sθ( , t) and sϕ( , t), respectively. sθ can be trained using the graph mixture matching (Eq. (54)). sϕ also can be trained in a similar way where the trajectories are sampled from the reverse-time process of the OU bridge mixture. In particular, from the result of Eq. (64), we can see that learning the score function of the mixture process is not interchangeable with learning the graph mixture since the score function additionally requires the knowledge of the reversed graph mixture DΓ. Our mixture process differs from the denoising diffusion processes for which learning the score function is equivalent to recovering clean data from its corrupted version (Kingma et al., 2021). The difference originates from the difference in the construction of the generative process, where denoising diffusion processes are derived by reversing the forward noising processes while our mixture process is built as a mixture of bridge processes without relying on the time-reversal approach. We further discuss the difference between our framework and the denoising diffusion models in Section A.10. Graph Generation with Diffusion Mixture A.10. Comparison with Denoising Diffusion Models Here we explain in detail the difference between our framework and previous denoising diffusion models. Comparison of the generative processes The main difference with the denoising diffusion models (Ho et al., 2020; Song et al., 2021) is in the different generative processes. While denoising diffusion models derive the generative process by reversing the forward noising process, our method constructs the generative process using the mixture of OU bridge processes described in Eq. (3) which does not rely on the time-reversal approach. Due to the difference in the generative process, our method demonstrates two distinct properties: First, the mixture process defines an exact transport from an arbitrary prior distribution to the data distribution by construction. In contrast, denoising diffusion processes are not an exact transport to the data distribution since the forward noising processes require infinitely long diffusion time in order to guarantee convergence to the prior distribution (Franzese et al., 2023). Furthermore, our framework does not suffer from the restrictions of denoising diffusion models. Denoising diffusion models require pprior to be approximately independent of the data distribution Π , e.g. Gaussian, as the perturbation process decouples pprior from Π , and further this decoupling requires infinitely long diffusion time T. On the contrary, our framework does not have any constraints on the prior distribution pprior and does not require large T, since the OU bridge mixture can be defined between two arbitrary distributions for any T > 0, where its drift forces the process to the terminal distribution regardless of the initial distribution. Therefore, the OU bridge mixture provides flexibility for our generative framework in choosing the prior distribution and the finite terminal time while maintaining the generative process to be an exact transport from the prior to the data distribution. Comparison of the training objectives We further compare our training objective in Eq. (54) with the training objectives of denoising diffusion models. First, we clarify that learning the graph mixture is not equivalent to learning the score function for the mixture process of Gru M. As derived in Eq. (64) of Section A.9, the score function of the OU bridge mixture additionally requires the knowledge of the reversed graph mixture DΓ, thus learning the score function needs to predict not only the graph mixture but also the reversed graph mixture. In contrast, the training objectives of denoising diffusion models are interchangeable (Kingma et al., 2021), i.e., learning the score function of the denoising diffusion process is equivalent to recovering clean data from its corrupted version. This difference in the training objective originates from the difference in the generative process, which we have discussed in detail in the previous paragraphs. Furthermore, our training objective differs from the objectives of previous works (Saharia et al., 2022) that aim to recover clean data from its corrupted version. While our method learns the graph mixture, i.e. the probable graph represented as the weighted mean of data, Saharia et al. (2022) aims to predict the exact final result which could be problematic as the prediction would be highly inaccurate in early steps which may lead the process in the wrong direction. It should be noted that the goal of Eq. (54) is to estimate the graph mixture, i.e. the weighted mean of data, not to predict the exact graph as in Saharia et al. (2022). This is because Eq. (54) is derived from Eq. (11) which minimizes the difference between our model prediction and the ground truth graph mixture. We emphasize that learning graph mixture (Eq. (11)) is only feasible for the OU bridge mixture and cannot be used for denoising diffusion models due to the difference in the generative processes. In the perspective of the mathematical formulation of the training objective, Eq. (54) differs from the objective of Saharia et al. (2022) in two parts: (1) The computation of the expectation for the squared error loss term is different. The expectation is computed by sampling from the trajectory of the diffusion process, where our Gru M uses the OU bridge mixture while previous works use the denoising diffusion process. These two processes are not the same and therefore result in different objectives. (2) The weight function in the loss is different. The weight function γt of Eq. (11) is different from the weight function used in denoising diffusion models, and γt is derived to guarantee that minimizing Eq. (54) is equivalent to minimizing the KL divergence between the data distribution and the terminal distribution of our approximated process. Another line of works on discrete diffusion models (Austin et al., 2021; Hoogeboom et al., 2021; Vignac et al., 2023) aims to predict the probabilities of each state of the final data to be generated. Since these works predict the probabilities, they are limited to data with a finite number of states and cannot be applied to data with continuous features. In contrast, our method directly predicts the weighted mean of the data (i.e., graph mixture) instead of the probabilities, which can be applied to data with continuous features, for example, 3D molecules as well as the discrete data, which we experimentally validate to be effective. It is worth noting that our Gru M is a continuous diffusion model, and thereby our framework can leverage the advanced sampling strategies that reduce the sampling time or improve sample quality (Campbell et al., 2022), whereas the discrete diffusion models are forced to use a simple ancestral sampling strategy. Graph Generation with Diffusion Mixture Table 3: Comparison of graph diffusion models. Explicitly model Simulation-free Arbitrary prior Does not require large Learning object Model graph topology training distribution diffusion time T is bounded prediction EDM (Hoogeboom et al., 2022) Noise GDSS (Jo et al., 2022) Score Bridge (Wu et al., 2022) Drift Gru M (Ours) Data A.11. Additional explanation on relevant works Related works on diffusion bridges Recent works (Peluchetti, 2021; Wu et al., 2022; Ye et al., 2022; Liu et al., 2023) introduce learning the generative process using a mixture of diffusion bridge processes, instead of learning to reverse the noising process as in denoising diffusion models. Peluchetti (2021) introduces a diffusion mixture representation that constructs a generation process as a mixture of the bridge processes. Wu et al. (2022) injects physical information into the process by adding informative prior to the drift, while Ye et al. (2022) and Liu et al. (2023) extend the bridge process to constrained domains. Related works on graph generative models Recently, graph diffusion models (Niu et al., 2020; Jo et al., 2022; Hoogeboom et al., 2022; Vignac et al., 2023) have made large progress on generating general graphs as well as molecular graphs. EDP-GNN (Niu et al., 2020) aims to generate the adjacency matrix by learning the score function of the denoising diffusion process, while GDSS (Jo et al., 2022) proposes a framework for generating the nodes and edges simultaneously by learning the joint score function that captures the node-edge dependency. However, learning the score function is ill-suited for modeling the graph topology as it does not explicitly consider the graph structures, and further could be problematic due to the diverging score function. On the other hand, discrete diffusion model (Vignac et al., 2023) proposes to model the noising process as successive graph edits, preserving the discrete structure during the diffusion process. However, this is not a desirable solution for real-world graph generation tasks since it applies to graphs with categorical node and edge attributes, and cannot be alone applied to graphs with continuous features, such as the 3D coordinates of atoms. We summarize the comparison between closely related graph diffusion models in Table 3. B. Details of Gru M In this section, we provide the details of the training and sampling methods of Gru M, describe the models used in our experiments, and further discuss the hyperparameters of Gru M. B.1. Overview We provide the pseudo-code of the training and sampling of our generative framework in Algorithm 3 and 4, respectively. We further provide the implementation details of the training and sampling for each generation task in Section C. B.2. Training objectives Random permutation The general graph datasets, namely Planar and SBM, contain only 200 graphs. Thus to ensure the permutation invariant nature of the graph dataset, we apply random permutation to the graphs of the training set during training. To be specific, for a graph data G = (X, A) in the training set and random permutation matrix P , we use the permuted data (P T X, P T AP ) for training, where P T denotes the transposed matrix. We empirically found that this leads to more stable training. Simplified loss We provide the explicit form of simplified loss explained in Section 3.2, which uses constant loss coefficient c instead of the time-dependent γt as follows: L(θ) = EG QΠ 1 0 c2 sθ(Gt, t) GT 2dt . (67) We empirically found that using this loss is beneficial for the generation of continuous features such as eigenvectors of the graph Laplacian or 3D coordinates. Graph Generation with Diffusion Mixture Algorithm 3 Training of Gru M Input: Model sθ, constant ϵ For each epoch: 1: Sample graph G from the training set 2: N number of nodes of G 3: Sample t [0, T ϵ] and G0 N(0, IN) 4: Sample Gt pt|0,T (Gt|G0, G) Section A.6 5: γt σt/utvt 6: Lθ γ2 t sθ(Gt, t) G 2 Eq. (54) 7: Update θ using Lθ Algorithm 4 Sampling of Gru M Input: Trained model sθ, number of sampling steps K, diffusion step size dt 1: Sample number of nodes N from the training set. 2: G0 N(0, IN) Start from noise 3: t 0 4: for k = 1 to K do 5: η ασ2 t Gt + σ2 t vt 1 ut sθ(Gt, t) Gt 6: w N(0, IN) 7: Gt+dt ηdt + σt dtw Euler-Maruyama 8: t t + dt 9: end for 10: G quantize(Gt) Quantize if necessary 11: Return: Graph G Algorithm 5 PC Sampler for Gru M Input: Trained models sθ and sϕ (described in Section A.9), number of sampling steps K, number of correction steps per prediction M, diffusion step size dt, score-to-noise ratio r Output: Sampled graph G 1: Sample number of nodes N from the training set. 2: G0 N(0, IN) Start from noise 3: t 0 4: for k = 1 to K do 5: η ασ2 t Gt + σ2 t vt 1 ut sθ(Gt, t) Gt 6: w N(0, IN) 7: Gt ηdt + σt dtw Predictor 8: for m = 1 to M do Corrector loop 9: D, D sθ( Gt, t), sϕ( Gt, T t) 10: s Compute_Score(D, D, Gt) Eq.(64) 11: w N(0, IN) 12: ϵ 2 (r w 2/ s 2)2 13: Gt Corrector( Gt, s, ϵ) 14: end for 15: Gt+dt Gt 16: t t + dt 17: end for 18: G quantize(Gt) Quantize if necessary 19: Return: Graph G Attributed graphs Especially for the generation of attributed graphs G = (X, A), the graph mixture matching for X and A can be derived from Eq. (11). Specifically, for the model sθ( , t) = (s X θ ( , t), s A θ ( , t)), we use the following objective: L(θ) = EG QΠ 1 0 γ2 1,t s X θ (Gt, t) XT 2dt + λ 0 γ2 2,t s A θ (Gt, t) AT 2dt (68) where λ is the hyperparameter. We use λ = 5 for all our experiments and empirically observed that changing λ did not make much difference for sufficient training epochs. B.3. Model architecture For the general graph and 2D molecule generation tasks, we leverage the graph transformer network introduced in Dwivedi & Bresson (2020) and Vignac et al. (2023). The node features and the adjacency matrices are updated with multiple attention layers with global features obtained by the self-attention-based Fi LM layers (Perez et al., 2018). We additionally use the higher-order adjacency matrices following Jo et al. (2022). For general graph generation tasks, we add the sigmoid function to the output of the model since the entries of the weighted mean of the data are supported in the interval [0, 1]. For 2D molecule generation tasks, we apply the softmax function to the output of the node features to model the one-hot encoded atom types, and further apply the sigmoid function to the output of the adjacency matrices. Note that we do not use the structural augmentation as in Vignac et al. (2023). For the 3D molecule generation task, we use EGNN (Satorras et al., 2021) to model the E(3) equivariance of the molecule data. We additionally add a softmax function at the last layer to model the one-hot encoded atom types. B.4. Sampling from Gru M Sampling from the generative model requires solving the SDE of Eq. (10). If our model sθ can closely approximate the graph mixture, a simple Euler-Maruyama method would be enough to simulate the generative model, which is true for most of the experiments. Since sθ is trained on the marginal density pt, Gt outside of pt could cause incorrect predictions Graph Generation with Diffusion Mixture that lead to an undesired result. To address the limitation, we may leverage the predictor-corrector (PC) sampling method introduced in Song et al. (2021). Using the corrector method such as Langevin dynamics (Song et al., 2021), we force Gt to be drawn from pt which ensures a more accurate estimation of the graph mixture. The score function to be used for the corrector can be computed as in Eq. (64) of Section A.9. We provide the pseudo-code of the predictor-only sampler and the PC sampler for our Gru M in Algorithm 4 and 5. Note that our Gru M does not require additional time for sampling compared to the denoising diffusion models, since the generation is equivalent to solving the SDE where the drift is computed each step from the forward pass of the model. B.5. Hyperparameters of Gru M The generative process of Gru M modeled as the OU bridge mixture is uniquely determined with the noise schedule σt and constant α. Through our experiments, we use α = 1/2 and a decreasing linear noise schedule, starting from σ2 0 and ends in σ2 0 defined as follows: σ2 t = (1 t)σ2 0 + tσ2 1 for 0 < σ1 < σ0 < 1 (69) We perform a grid search for the hyperparameters σ0 and σ1 in {0.4, 0.6, 0.8, 1.0} and {0.1, 0.2, 0.3}, respectively, where the search space slightly differs for each generation task. C. Experimental Details C.1. General graph generation Datasets and evaluation metrics We evaluate the quality of generated graphs on three graph datasets used as benchmarks in Martinkus et al. (2022): Planar graph dataset consists of 200 synthetic planar graphs where each graph has 64 nodes. We determine that a graph is a valid Planar graph if it is connected and planar. Stochastic Block Model (SBM) graph dataset consists of 200 synthetic stochastic block model graphs with the number of communities uniformly sampled between 2 and 5, where the number of nodes in each community is uniformly sampled between 20 and 40. The probability of the inter-community edges and the intra-community edges are 0.3 and 0.05, respectively. We determine that a graph is a valid SBM graph if it has the number of communities between 2 and 5, the number of nodes in each community between 20 and 40, and further using the statistical test introduced in Martinkus et al. (2022). Proteins graph dataset (Dobson & Doig, 2003) consists of 918 real protein graphs with up to 500 nodes in each graph. The protein graph is constructed by considering each amino acid as a node and connecting two nodes if the corresponding amino acids are less than 6 Angstrom. For our experiments, we use the datasets provided by Martinkus et al. (2022). We follow the evaluation setting of Liao et al. (2019) using total variation (TV) distance for measuring MMD which is considerably fast compared to using the earth mover s distance (EMD) kernel, especially for large graphs. Moreover, we use the V.U.N. metric following Martinkus et al. (2022) that measures the proportion of valid, unique, and novel graphs among the generated graphs, where the validness is defined as satisfying the specific property of the dataset explained above. The baseline results are taken from Vignac et al. (2023) or obtained by running the open-source codes. Implementation details We follow the standard setting of Martinkus et al. (2022) using the same data split and evaluation procedures. We report the baseline results taken from Martinkus et al. (2022) and Vignac et al. (2023), or the results obtained from running the open-source codes using the hyperparameters given by the original work. We could not report the results of EDP-GNN (Niu et al., 2020) and Di Gress (Vignac et al., 2023) on the Proteins dataset as they took more than 2 weeks. For the diffusion models including our proposed method, we set the diffusion steps to 1000 for a fair comparison. For our proposed Gru M, we train our model for 30,000 epochs for all datasets using a constant learning rate with Adam W optimizer (Loshchilov & Hutter, 2017) and weight decay 10 12, applying the exponential moving average (EMA) to the parameters (Song & Ermon, 2020). We perform the hyperparameter search explained in Section B.5 for the lowest MMD and highest V.U.N. results. We initialize the node features with the eigenvectors of the graph Laplacian of the adjacency matrices, which we further scale with constant. During training (Algorithm 3), we sample the noise for the adjacency matrices to be symmetric with zero diagonals. During generation (Algorithm 4), we generate both the node features and adjacency matrices starting from noise, and we quantize the entries of the resulting adjacency matrices. Empirically, we found that the entries of the resulting sample lie very close to the desired values 0 and 1, for which the L1 distance between the resulting sample and the quantized sample is smaller than 10 2. In Figure 2 (Left), we measure the Spec. MMD and V.U.N. of our method and the baselines as follows: For Gru M we evaluate the predicted graph mixture. For Di Gress, we evaluate the prediction obtained from the predicted probability Graph Generation with Diffusion Mixture of each state. For GDSS and Con Gress, we evaluate the implicit prediction computed from the estimated score or noise following Eq. (16) of Hoogeboom et al. (2022). The Spec. MMD and the V.U.N. are measured after quantizing the predicted adjacency matrix with thresholding at 0.5. C.2. 2D molecule generation Datasets and evaluation metrics We evaluate the quality of generated 2D molecules on two molecule datasets used as benchmarks in Jo et al. (2022). QM9 (Ramakrishnan et al., 2014) dataset consists of 133,885 molecules with up to 9 heavy atoms of four types. ZINC250k (Irwin et al., 2012) dataset consists of 249,455 molecules with up to 38 heavy atoms of 9 types. Molecules in both datasets have 3 edge types, namely single bond, double bond, and triple bond. For our experiments, we follow the standard procedure (Shi et al., 2020; Luo et al., 2021; Jo et al., 2022) of kekulizing the molecules using the RDKit library (Landrum et al., 2016) and removing the hydrogen atoms from the molecules in the QM9 and ZINC250k datasets. We evaluate the models with four metrics: Validity is the percentage of the valid molecules among the generated without any post-hoc correction such as valency correction or edge resampling. Fréchet Chem Net Distance (FCD) (Preuer et al., 2018) measures the distance between the feature vectors of generated molecules and the test set using Chem Net, evaluating the chemical properties of the molecules. Neighborhood subgraph pairwise distance kernel (NSPDK) MMD (Costa & De Grave, 2010) measures the MMD between the generated molecular graphs and the molecular graphs from the test set, assessing the quality of the graph structure. Scaffold similarity (Scaf.) measures the cosine similarity of the frequencies of Bemis-Murcko scaffolds (Bemis & Murcko, 1996), evaluating the ability to generate similar substructures. See Section C.2 for more details. Among these, FCD and NSPDK MMD metrics measure the distribution similarity between the test dataset and generated samples while scaffold similarity evaluates the similarity of the generated scaffolds. The baseline results are taken from Jo et al. (2022) or obtained by running open-source codes. Implementation details We report the results of the baselines taken from Jo et al. (2022), except for the results of the Scaffold similarity (Scaf.) and the results of Di Gress, which we obtained by running the open-source codes. Especially, the 2D molecule generation results of Di Gress on the QM9 dataset are different compared to the results reported in its original paper, since we have used the preprocessed dataset following the setting of Jo et al. (2022) for a fair comparison with other baselines. We set the diffusion steps to 1000 for all the diffusion models. For our Gru M, we train our model sθ with a constant learning rate with Adam W optimizer and weight decay 10 12, applying the exponential moving average (EMA) to the parameters. We perform the hyperparameter search similar to that of the general graph generation tasks. Especially for Gru M, we follow Jo et al. (2022) by using the adjacency matrices in the form of A {0, 1, 2, 3}N N where N is the maximum number of atoms in a molecule and each entries indicating the bond types: 0 for no bond, 1 for the single bond, 2 for the double bond and 3 for the triple bond. Further, we scale the entries with a constant scale of 3 in order to bound the input of the model in the interval [0, 1], and rescale the final sample of the generation process to recover the bond types. We also sample the noise for the adjacency matrices to be symmetric with zero diagonals during training. We quantize the entries of the resulting adjacency matrices to obtain the discrete bond types {0, 1, 2, 3}. Empirically, we found that the entries of the resulting sample lie very close to the desired bond types {0, 1, 2, 3}, for which the L1 distance between the resulting sample and the quantized sample is approximately 0. C.3. 3D molecule generation Datasets and evaluation metrics We evaluate the quality of generated 3D molecules on two molecule datasets used as benchmarks in Hoogeboom et al. (2022). QM9 (Ramakrishnan et al., 2014) dataset consists of 133,885 molecules with up to 29 atoms of five types including hydrogen atoms. The node features of the QM9 dataset include the one-hot representation of atom type and atom number. GEOM-DRUGS (Axelrod & Gomez-Bombarelli, 2022) dataset consists of 430,000 molecules with up to 181 atoms of fifteen types including hydrogen atoms. GEOM-DRUGS dataset contains different conformations for each molecule. Among many conformations, the 30 lowest energy conformations for each molecule are retained. For the GEOM-DRUGS dataset, we utilize the one-hot representation of atom type without the atom number. To evaluate the generated molecules, we measure the atom and molecule stability by predicting the bond type between atoms with the standard bond lengths and then checking the valency. Implementation details We follow the standard setting of Hoogeboom et al. (2022) for a fair comparison: for the QM9 experiment, we use EGNN with 256 hidden features and 9 layers and train the model, and for the GEOM-DRUGS experiment, we use EGNN with 256 hidden features and 4 layers and train the model. We report the results of the baselines taken from Hoogeboom et al. (2022) and Wu et al. (2022). In Figure 3 (Right), we compute the implicit prediction using the Graph Generation with Diffusion Mixture Table 4: 2D molecule generation results on the QM9 dataset. The baseline results are taken from Jo et al. (2022) or obtained by running the open-source codes. Best results are highlighted in bold. Method Valid (%) FCD NSPDK Scaf. Uniq (%) Novelty (%) Mo Flow (Zang & Wang, 2020) 91.36 1.23 4.467 0.595 0.017 0.003 0.1447 0.0521 98.65 0.57 94.72 0.77 Graph AF (Shi et al., 2020) 74.43 2.55 5.625 0.259 0.021 0.003 0.3046 0.0556 88.64 2.37 86.59 1.95 Graph DF (Luo et al., 2021) 93.88 4.76 10.928 0.038 0.064 0.000 0.0978 0.1058 98.58 0.25 98.54 0.48 EDP-GNN (Niu et al., 2020) 47.52 3.60 2.680 0.221 0.005 0.001 0.3270 0.1151 99.25 0.05 86.58 1.85 GDSS (Jo et al., 2022) 95.72 1.94 2.900 0.282 0.003 0.000 0.6983 0.0197 98.46 0.61 86.27 2.29 Di Gress (Vignac et al., 2023) 98.19 0.23 0.095 0.008 0.0003 0.000 0.9353 0.0025 96.67 0.24 25.58 2.36 Gru M (ours) 99.69 0.19 0.108 0.006 0.0002 0.000 0.9449 0.0054 96.90 0.15 24.15 0.80 Table 5: 2D molecule generation results on the ZINC250k dataset. The baseline results are taken from Jo et al. (2022) or obtained by running the open-source codes. Best results are highlighted in bold. Method Valid (%) FCD NSPDK Scaf. Uniq (%) Novelty (%) Mo Flow (Zang & Wang, 2020) 63.11 5.17 20.931 0.184 0.046 0.002 0.0133 0.0052 99.99 0.01 100.00 0.00 Graph AF (Shi et al., 2020) 68.47 0.99 16.023 0.451 0.044 0.005 0.0672 0.0156 98.64 0.69 99.99 0.01 Graph DF (Luo et al., 2021) 90.61 4.30 33.546 0.150 0.177 0.001 0.0000 0.0000 99.63 0.01 100.00 0.00 EDP-GNN (Niu et al., 2020) 82.97 2.73 16.737 1.300 0.049 0.006 0.0000 0.0000 99.79 0.08 100.00 0.00 GDSS (Jo et al., 2022) 97.01 0.77 14.656 0.680 0.019 0.001 0.0467 0.0054 99.64 0.13 100.00 0.00 Di Gress (Vignac et al., 2023) 94.99 2.55 3.482 0.147 0.0021 0.0004 0.4163 0.0533 99.97 0.01 99.99 0.01 Gru M (ours) 98.65 0.25 2.257 0.084 0.0015 0.0003 0.5299 0.0441 99.97 0.03 99.98 0.02 estimated noise following Eq. (16) of Hoogeboom et al. (2022). For our Gru M, we train our model sθ for 1,300 epochs with batch size 256 for the QM9 experiment, and for 13 epochs with batch size 64 for the GEOM-DRUGS experiment. We apply EMA to the parameters of the model with a coefficient of 0.999 and use Adam W optimizer with learning rate 10 4 and weight decay 10 12. The 3D coordinates and charge values are scaled as 4 and 0.1, respectively, and we use the simplified loss with a constant c = 100. We perform the hyperparameter search with smaller values for coordinates in {0.1, 0.2, 0.3} and higher values for node features in {0.6, 0.7, 0.8, 0.9, 1.0}. For the generation, we use the Euler-Maruyama predictor. C.4. Computing resources For all experiments, we use NVIDIA Ge Force RTX 3090 and 2080 Ti and implement the source code with Py Torch (Paszke et al., 2019). D. Additional Experimental Results D.1. 2D molecule generaation We provide the standard deviation results in Table 4 and Table 5. We additionally report the following two metrics: Novelty is the proportion of the molecules generated that are valid and not in the training set, and Uniqueness is the proportion of the molecules generated that are valid and unique, where valid molecules are the ones that do not violate the chemical valency rule. D.2. Further analysis 0.00 0.25 0.50 0.75 1.00 Diffusion time Gru M (Ours) w/o Inductive Bias Drift Figure 8: Model complexity comparison of Gru M and Drift. Comparison with learning the drift To verify that learning the graph mixture as in our Gru M is superior compared to learning the drift, we additionally report the generation result of the variant of Gru M which learns the drift, similar to Wu et al. (2022), on the Planar dataset. Table in Figure 4 shows that learning the drift, denoted as Drift in the table, performs poorly generating only 15% valid, novel, and unique graphs. The generated Planar graphs in Figure 4 demonstrate that learning the drift fails to capture the correct topology. Further, to verify why learning the drift fails to capture the correct topology, we compare the complexity of the models for learning different objectives. As shown in Figure 8, the complexity of learning the drift (Drift) is significantly higher than learning the graph Graph Generation with Diffusion Mixture Figure 4: (Left) Generation results on the Planar dataset. Best results are highlighted in bold, where smaller MMD and larger V.U.N. indicate better results. (Right) Generated graphs by learning the drift. Visualized graphs are randomly sampled without curation. Deg. Clus. Orbit Spec. V.U.N. Training set 0.0002 0.0310 0.0005 0.0052 100.0 Graph RNN (You et al., 2018) 0.0049 0.2779 1.2543 0.0459 0.0 SPECTRE (Martinkus et al., 2022) 0.0005 0.0785 0.0012 0.0112 25.0 EDP-GNN (Niu et al., 2020) 0.0044 0.3187 1.4986 0.0813 0.0 GDSS (Jo et al., 2022) 0.0041 0.2676 0.1720 0.0370 0.0 Di Gress (Vignac et al., 2023) 0.0003 0.0372 0.0009 0.0106 75 Drift 0.0008 0.0845 0.0075 0.0126 15 Gru M (Ours) 0.0005 0.0353 0.0009 0.0062 90.0 Figure 5: MMD results of graph mixture of Gru M through the generative process. Dotted lines indicate MMDs of training set. (a) Planar (b) SBM (c) Proteins Sampled Graph Graph Mixture Figure 6: Convergence of sampled graphs and graph mixtures with varying σ0 and σ1 values. mixture (Gru M) for all time steps. Moreover, learning the drift is much harder compared to learning the graph mixture without exploiting the graph structure (w/o Inductive Bias). In particular, the complexity gap dramatically increases at the late stage of the diffusion process, because the drift diverges approaching the terminal time while the graph mixture is supported inside the data space, as discussed in Section 3.2. Early Stopping for Generative Process In Figure 2 (Left) and (Middle), the V.U.N. and the MMD results of Gru M in the Planar dataset demonstrate that the estimated graph mixture converges to the final result at early sampling steps, accurately capturing both the global topology and local graph characteristics. This allows us to early-stop the diffusion process, which reduces the generation time by up to 20% on this task. We provide additional MMD results of the generative processes in Figure 5, which show that the estimated graph mixture Graph Generation with Diffusion Mixture Figure 7: The experimental results for the variant of EDM where it aims to predict the final result (EDM-Var.). (Left) Generation results on the 3D molecule QM9 datasets. Best results are highlighted in bold where the higher stability indicates better results. (Right) Convergence of the generative process. We compare the convergence of the graph mixture from Gru M, the implicit prediction computed from the estimated noise of EDM, and the predicted result of EDM-Var. We measure the convergence with L2 distance and further visualize the molecule stability of the predictions through the generative process. QM9 (|V | 29) Method Atom Stab.(%) Mol. Stab.(%) G-Schnet (Gebauer et al., 2019) 95.7 68.1 GDM (Hoogeboom et al., 2022) 97.0 63.2 EDM (Hoogeboom et al., 2022) 98.7 82.0 Bridge (Wu et al., 2022) 98.7 81.8 Bridge+Force (Wu et al., 2022) 98.8 84.6 EDM-Var. 94.02 35.95 Gru M (Ours) 98.81 87.34 0 500 1000 Timesteps L2-distance Mol. Stability (%) Gru M: L2 EDM: L2 EDM-Var: L2 Gru M: Stab. EDM: Stab. EDM-Var: Stab. converges to the final result around 800 diffusion steps for all datasets. Role of the diffusion coefficient We can observe that the generative process of Gru M is uniquely determined by the constant α and the diffusion coefficient σt. These two coefficients control the convergence behavior of the diffusion process: large α and small σt lead to a drift with a large norm that forces the trajectory to converge quickly. Here, we demonstrate the effect of the diffusion coefficient σt on the convergence of the generative process. Figure 6 (Sampled Graph) shows that the smaller values of σt (i.e. 0.2 0.1) lead to faster convergence of the trajectory to the final result, compared to the larger σt. This is due to the fast convergence of each bridge process with small σt. Especially, as shown in Figure 6 (Graph Mixture), large σt for the continuous feature (i.e., 3D coordinates) leads to slower convergence of the graph mixture since it destroys the topology of graphs and makes it hard to predict the final result. Graph prediction through EDM Additionally, we compare our Gru M with the variant of EDM (Hoogeboom et al., 2022) which learns to predict the final result of the denoising process instead of learning the noise. Table of Figure 7 shows the generation result of this variant, denoted as EDM-Var., on the 3D molecule QM9 dataset. EDM-Var. exhibits the lowest atom stability and extremely low molecule stability of less than 40%, which performs significantly worse than Gru M as well as the original EDM. This is because EDM-Var. depends on a single deterministic prediction during the generative process, and the inaccurate prediction of the final result at the early step of the generative process leads the process in the wrong direction resulting in invalid molecules, as discussed in the Introduction and Section 3.1. On the other hand, our Gru M predicts the final grpah of the generative process using the graph mixture which represents the probable graph as a weighted mean of the data, thereby guiding the process in the right direction resulting in valid molecules with correct topology. We further provide the convergence results of EDM-Var. in Figure 7, which demonstrates that the prediction of Gru M converges significantly faster than that of EDM and EDM-Var. The inaccurate prediction of EDM-Var. results in slower convergence and low molecule stability. Analysis on the model architecture As shown in Table 6 and 7, GDSS using graph transformer architecture shows improved performance over original GDSS but is still outperformed by our Gru M with a large margin in V.U.N, FCD, and NSPDK. These results verify that the superior performance of Gru M comes from its ability to accurately model the topology of the final graph to be generated. Graph Generation with Diffusion Mixture Table 6: General graph generation results with GDSS using graph transformer. Synthetic, |V | = 64 Synthetic, 44 |V | 187 Deg. Clus. Orbit Spec. V.U.N. Deg. Clus. Orbit Spec. V.U.N. Training set 0.0002 0.0310 0.0005 0.0052 100.0 0.0008 0.0332 0.0255 0.0063 100.0 GDSS 0.0041 0.2676 0.1720 0.0370 0.0 0.0212 0.0646 0.0894 0.0128 5.0 GDSS+Transformer 0.0036 0.1206 0.0525 0.0137 5.0 0.0411 0.0565 0.0706 0.0074 27.5 Con Gress 0.0048 0.2728 1.2950 0.0418 0.0 0.0273 0.1029 0.1148 - 0.0 Di Gress 0.0003 0.0372 0.0009 0.0106 75 0.0013 0.0498 0.0434 0.0400 74 Gru M (Ours) 0.0005 0.0353 0.0009 0.0062 90.0 0.0007 0.0492 0.0448 0.0050 85.0 Table 7: 2D molecule generation results with GDSS using graph transformer. QM9 (|V | 9) ZINC250k (|V | 38) Method Valid (%) FCD NSPDK Scaf. Valid (%) FCD NSPDK Scaf. Training set 100.0 0.0398 0.0001 0.9719 100.0 0.0615 0.0001 0.8395 GDSS 95.72 2.900 0.0033 0.6983 97.01 14.656 0.0195 0.0467 GDSS+Transformer 99.68 0.737 0.0024 0.9129 96.04 5.556 0.0326 0.3205 Di Gress 98.19 0.095 0.0003 0.9353 94.99 3.482 0.0021 0.4163 Gru M (Ours) 99.69 0.108 0.0002 0.9449 98.65 2.257 0.0015 0.5299 E. Visualization In this section, we visualize the generated graphs and molecules of Gru M, and further provide visualization of the diffusion processes for diverse generation tasks. E.1. Generated samples of Gru M General graphs Graphs from the training set and the generated graphs of Gru M are visualized in Figure 9, 10, and 11. The visualized graphs are randomly selected from the training set and the generated graph set. Note that we visualize the entire graph for the Proteins dataset, unlike Martinkus et al. (2022) which visualizes the largest connected component since it fails to consistently generate connected graphs. For Gru M, we found that 92% of the generated Proteins graphs are connected. (a) Training set (b) Gru M (Ours) Figure 9: Visualization of graphs from the Planar dataset and the generated graphs of Gru M. Graph Generation with Diffusion Mixture (a) Training set (b) Gru M (Ours) Figure 10: Visualization of graphs from the SBM dataset and the generated graphs of Gru M. (a) Training set (b) Gru M (Ours) Figure 11: Visualization of graphs from the Proteins dataset and the generated graphs of Gru M. Graph Generation with Diffusion Mixture 2D molecules We provide the visualization of the molecules from the training set and the generated 2D molecules in Figure 12 and 13. These molecules are randomly selected from the training set and the generated molecule set. (a) Training set (b) Gru M (Ours) Figure 12: Visualization of molecules from the QM9 dataset and the generated molecules of Gru M for the 2D molecule generation experiment. (a) Training set (b) Gru M (Ours) Figure 13: Visualization of the molecules from the ZINC250k dataset and the generated molecules of Gru M for the 2D molecule generation experiment. Graph Generation with Diffusion Mixture Table 8: Fraction of connected graphs on GEOM-DRUGS experiment. Methods Connected (%) EDM 37.70 0.79 Gru M (Ours) 56.57 0.31 3D molecules We visualize the generated molecules for the 3D molecule generation experiment in Figure 14 and 15. Note that the visualized molecules are all stable. For the GEOM-DRUGS experiment, we observe that a few of the generated molecules are not connected as pointed out in Hoogeboom et al. (2022). To measure how many graphs are connected, we report the fraction of the connected graphs, taking the average of 3 different runs. Table 8 shows that Gru M can generate a significantly larger number of connected molecules compared to EDM (Hoogeboom et al., 2022). (a) Training Set (b) Gru M (Ours) Figure 14: Visualization of the molecules from the QM9 dataset and the generated molecules of Gru M for the 3D molecule generation experiment. (a) Training Set (b) Gru M (Ours) Figure 15: Visualization of the molecules from the GEOM-DRUGS dataset and the generated molecules of Gru M for the 3D molecule generation experiment. Graph Generation with Diffusion Mixture E.2. Generative process of Gru M Here we visualize the generative process of Gru M. We visualize the generative process of general graphs in Figure 16, 17, and 18. We also visualize the generative process of the 3D molecules in Figure 19. We further provide the animation of the generative process in https://github.com/harryjo97/Gru M. t=0 t=0.25 t=0.5 t=0.75 t=1 t=0 t=0.25 t=0.5 t=0.75 t=1 Figure 16: Visualization of the generative process of Gru M. We visualize the graph mixture from Gru M on the Planar dataset. t=0 t=0.25 t=0.5 t=0.75 t=1 t=0 t=0.25 t=0.5 t=0.75 t=1 Figure 17: Visualization of the generative process of Gru M. We visualize the graph mixture from Gru M on the SBM dataset. t=0 t=0.3 t=0.7 t=0.8 t=1 t=0 t=0.3 t=0.7 t=0.8 t=1 Figure 18: Visualization of the generative process of Gru M. We visualize the graph mixture from Gru M on the Proteins dataset. Graph Generation with Diffusion Mixture Figure 19: Visualizations of the 3D molecule generative process of Gru M on QM9 dataset (Top) and GEOM-DRUGS dataset (Bottom). For each dataset, we visualize the trajectory Gt in the first row, and we visualize the estimated graph mixtures from Gru M in the second row. Note that the visualized molecules are stable. The atom types and the 3D coordinates of the atoms inside the green circles are calibrated after the convergence of the graph mixtures, where the convergence is achieved at an early stage. 34 Graph Generation with Diffusion Mixture F. Limitation Limitation We proposed a novel diffusion-based graph generation framework that directly predicts the final graph of the generative process as a weighted mean of data, thereby accurately capturing the valid structures and the topological properties. We have shown that our framework is able to generate graphs with correct topology for diverse graph generation tasks, including 2D/3D molecular generation, on which ours significantly outperforms previous graph generation methods. While Gru M shows superior performance, future work would benefit from improving our framework. First, the likelihood of the generative process of Gru M cannot be directly computed from the training objective. In order to compute the likelihood, one could derive an associated probability flow ODE of Gru M as described in Section A.9, but this requires training an additional model for estimating the reverse graph mixture. Furthermore, the proposed framework is focused on unconditional graph generation tasks. We could design a conditional framework of Gru M by training a model sθ(Gt, t, c) for a given condition (i.e., class label) c for estimating the c-conditional graph mixture defined as follows: DΠ c (Gt, t) := Z g pg t (Gt) pt(Gt) Π c(dg), Π c := {g : g Π with label c}. (70) Intuitively, the generative process of the modified OU bridge mixture, for which the graph mixture is replaced by DΠ c (Gt, t) is guided by the conditional graph mixture that terminates in the conditioned distribution Π c. We leave this conditional framework as future work.