# cometh_a_continuoustime_discretestate_graph_diffusion_model__208d3c9d.pdf Published in Transactions on Machine Learning Research (04/2025) Cometh: A continuous-time discrete-state graph diffusion model Antoine Siraudin antoine.siraudin@log.rwth-aachen.de RWTH Aachen Fragkiskos Malliaros Centrale Supélec, Inria Université Paris-Saclay Christopher Morris RWTH Aachen Reviewed on Open Review: https: // openreview. net/ forum? id= nu N1m Rrrj X Discrete-state denoising diffusion models led to state-of-the-art performance in graph generation, especially in the molecular domain. Recently, they have been transposed to continuous time, allowing more flexibility in the reverse process and a better trade-off between sampling efficiency and quality. However, they have not yet been applied to graphs. Here, to leverage the benefits of both approaches for graphs, we propose Cometh, a continuous-time discretestate graph diffusion model, tailored to the specificities of graph data. In addition, we also successfully replaced the set of structural features previously used in discrete graph diffusion models with a single random-walk-based encoding, providing a simple and principled way to boost the model s expressive power. Empirically, we show that integrating continuous time leads to significant improvements across various metrics over state-of-the-art discrete-state diffusion models on a large set of molecular and non-molecular benchmark datasets. In terms of valid, unique, and novel (VUN) samples, Cometh obtains a near-perfect performance of 99.5% on the planar graph dataset and outperforms Di Gress by 12.6% on the large Guaca Mol dataset. 1 Introduction Denoising diffusion models (Ho et al., 2020; Song et al., 2020) are among the most prominent and successful classes of generative models. Intuitively, these models aim to denoise diffusion trajectories and produce new samples by sampling noise and recursively denoising it, often outperforming competing architectures in tasks such as image and video generation (Sohl-Dickstein et al., 2015; Yang et al., 2023). Recently, a large set of works, e.g., Chen et al. (2023); Jo et al. (2022; 2024); Vignac et al. (2022), aimed to leverage diffusion models for graph generation, e.g., the generation of molecular structures. One class of such models embeds the graphs into a continuous space and adds Gaussian noise to the node features and graph adjacency matrix (Jo et al., 2022). However, such noise destroys the graphs sparsity, resulting in complete, noisy graphs without meaningful structural information, making it difficult for the denoising network to capture the structural properties of the data. Therefore, discrete-state graph diffusion models such as Di Gress (Vignac et al., 2022) have been proposed, exhibiting competitive performance against their continuous-state counterparts. These models utilize a categorical corruption process (Austin et al., 2021), making them more suited to the discrete structure of graph data. In parallel, the above Gaussian-noise-based diffusion models have been extended to continuous time (Song et al., 2020), i.e., relying on a continuous-time stochastic process (Capasso & Bakstein, 2021), by formulating Published in Transactions on Machine Learning Research (04/2025) the forward process as a stochastic differential equation. In addition, discrete-state diffusion models have recently been transposed to continuous time (Campbell et al., 2022; Sun et al., 2022), relying on continuoustime Markov chains (CTMC). Unlike their discrete-time counterparts, which define a fixed time scale during training, they allow training using a continuous-time scale and leave the choice of the time discretization strategy for the sampling stage. Hence, incorporating continuous time enables a more optimal balance between sampling efficiency and quality while providing greater flexibility in designing the reverse process, as various CTMC simulation tools can be utilized (Campbell et al., 2022; Sun et al., 2022). However, extending continuous-time discrete-state diffusion models to graphs is not straightforward. Unlike other discrete data, such as text, where all tokens share the same support, nodes and edges in graphs have distinct attributes and must be handled separately. Furthermore, the noise models used for other data particularly the uniform noise model, which yields very dense graphs may be suboptimal for sparse, real-world graphs (Vignac et al., 2022). Present work Hence, to leverage the benefits of both approaches for graph generation, i.e., discrete-state and continuous-time, we propose Cometh, a continuous-time discrete-state graph diffusion model, integrating graph data into a continuous diffusion model framework; see Figure 1 for an overview of Cometh. Concretely, we 1. propose a new noise model adapted to graph specificities, featuring distinct noising processes for nodes and edges, and we extend the marginal transitions previously proposed for graph data to the continuous-time setting. 2. In addition, we successfully replace the structural features used in most previous discrete graph diffusion models with a random-walk-based encoding. We prove that it generalizes most of the features used in Di Gress, representing a simple and elegant way to boost the model s expressivity and reach state-of-the-art performance. 3. Empirically, we show that integrating continuous-time into a discrete-state graph diffusion model leads to state-of-the-art results on synthetic and established molecular benchmark datasets across various metrics. For example, in terms of VUN (valid, unique, and novel) samples, Cometh obtains a near-perfect performance of 99.5% on the planar graph dataset and outperforms Di Gress by 12.6% on the large Guaca Mol dataset. Overall, Cometh is a state-of-the-art graph diffusion model that combines the benefits of a discrete-state space with the flexibility of a continuous-time scale in its sampling algorithm design. 1.1 Related work Diffusion models are a prominent class of generative models successfully applied to many data modalities, such as images, videos, or point clouds (Yang et al., 2023). Graph generation is a well-studied task applied to various application domains, such as molecule generation, floorplan generation, or abstract syntax tree generation (Shabani et al., 2023; Shi et al., 2019). We can roughly categorize graph generation approaches into auto-regressive models such as Kong et al. (2023); You et al. (2018); Zhao et al. (2024); Jang et al. (2024b) and one-shot models such as diffusion models. The main advantage of one-shot models over auto-regressive ones is that they generate the whole graph in a single step and do not require any complex procedure to select a node ordering. On the other hand, auto-regressive models are more flexible regarding the size of the generated graph, which can remain unknown beforehand, and they usually enjoy more favorable complexity since the don t require as many sampling steps as diffusion models. While the first diffusion models for graph generation leveraged continuous-state spaces (Niu et al., 2020), they are now largely replaced by discrete-state diffusion models (Haefeli et al., 2022; Vignac et al., 2022), using a discrete-time scale. However, using discrete time constrains the sampling scheme to a particular form called ancestral sampling, which prevents the exploration of alternative sampling strategies that could optimize sampling time or enhance sampling quality. Recently, there has been a surge of interest in continuous-time Published in Transactions on Machine Learning Research (04/2025) CE loss Predict ˆ R t,θ(G, G) Figure 1: Overview of Cometh. During training, the graph transformer learns to predict the clean graph G(0) from a noisy graph G(t). Unlike previous discrete-time diffusion models, Cometh performs transitions at any time t [0, 1]. During sampling, the clean graph G(0) is first predicted and used to compute the reverse rate ˆRt,θ(G, G) as defined in Equation (2). Next, a τ-leaping step is performed to sample G(t τ), with the step length fixed to τ. Optionally, multiple corrector steps can be applied at t τ, which experimentally improves sample quality. discrete-state diffusion models (Lou et al., 2023; Gat et al., 2024; Shi et al., 2024), which have demonstrated promising performance on text generation tasks, rivaling GPT-2 models of comparable scale. Another line of research considers lifting the graphs into a continuous-state space and applying Gaussian noise to the node and edge features matrices (Niu et al., 2020; Jo et al., 2022; 2024). Such continuous noise allows the generation of continuous features to be handled smoothly, such as the generation of atomic coordinates in molecular graphs (Jo et al., 2024). The above Gaussian-noise-based diffusion models have many successful applications in computational biology (Corso et al. (2023); Yim et al. (2023)). However, they almost exclusively consider point-cloud generation, focusing on modeling the geometry of the molecules and ignoring structural information. In addition, some hybrid approaches also exist that consider jointly modeling the 2D molecular graphs and their 3D geometry (Hua et al., 2024; Le et al., 2023; Vignac et al., 2023). These models usually rely on continuous noise for the atomic coordinates and discrete noise for the atom and edge types. Recent works have tried to scale graph generative models in size (Bergmeister et al., 2023; Luo et al., 2023; Qin et al., 2023). Such frameworks are often built on top of previously proposed approaches, e.g., Sparse Diff (Qin et al., 2023) is based on Di Gress. Therefore, these scaling methods are likely to apply to our approach. Concurrently with our work, Xu et al. (2024) proposed a continuous-time discrete-state graph diffusion model. While they experiment using an MPNN as the backbone with limited success, we effectively utilize RRWP (Relative Random Walk Probabilities) as a positional encoding. Moreover, our experimental evaluation of Cometh is more comprehensive, as we successfully implement the predictor-corrector mechanism (Diethelm et al., 2004) and include a conditional generation experiment. 2 Background In the following, we overview the continuous-time discrete-state diffusion framework on which Cometh builds. We provide a complete description of this framework in Appendix B.1, providing intuitions and technical details, and refer to Yang et al. (2023) for a general introduction to diffusion models. Published in Transactions on Machine Learning Research (04/2025) 2.1 Continuous-time discrete diffusion In the discrete diffusion setting, we aim at modeling a discrete data distribution pdata(z(0)), where z(0) Z and Z is a finite set with cardinality S := |Z|. A continuous-time diffusion model (Campbell et al., 2022; Sun et al., 2022; Lou et al., 2023) is a stochastic process, running from time t = 0 to t = 1. In the following, we denote the marginal distributions of the state z(t) Z at time t by qt(z(t)), and qt|s(z(t) | z(s)) denotes the conditional distribution of the state z(t) given the state z(s) Z at some time s [0, 1]. We also denote δ z,z the Kronecker delta, which equals 1 if z = z and 0 otherwise. We define a forward process which gradually transforms the marginal distribution q0(z(0)) = pdata(z(0)) into q1(z(1)), that is close to an easy-to-sample prior distribution pref(z(1)), e.g., a uniform categorical distribution. We define the forward process as a continuous-time Markov chain (CTMC). The current state z(t) alternates between resting in the current state and transitioning to another state, where a transition rate matrix R(t) RS S controls the dynamics of the CTMC. Formally, the forward process is defined through the infinitesimal transition probability from z(t) to another state z Z, for a infinitesimal time step dt between time t and t + dt, qt+dt|t z z(t) := δ z,z(t) + R(t) z(t), z dt, where R(t)(z(t), z) denotes the entry of R(t) that gives the rate from z(t) to z. Intuitively, the process is more likely to transition to a state where R(t)(z(t), z) is high, and R(t) is designed so that q1(z(1)) closely approximates the prior distribution pref. The generative process is formulated as the time reversal of the forward process, therefore interpolating from q1(z(1)) to pdata(z(0)). The rate of this reverse CTMC, ˆR(t), is intractable (Campbell et al., 2022) and has to be modeled by a parametrized approximation, i.e., ˆRt,θ(z, z) = R(t)( z, z) X qt|0 z | z(0) qt|0 z | z(0) pθ 0|t z(0) | z , for z = z, where pθ 0|t z(0) | z is the denoising neural network with parameters θ. For efficient training, the conditional distribution qt|0(z(t) | z(0)) needs to be analytically obtained; see Appendix B.1 for more details. As demonstrated in Campbell et al. (2022), this property is achieved by choosing R(t) = β(t)Rb with β(t) R being the noise schedule and Rb RS S is a constant base rate matrix. We can now generate samples by simulating the reverse process from t = 1 to t = 0. Different algorithms can be employed for this purpose, such as Euler s method (Sun et al., 2022), τ-leaping and the predictor-corrector scheme (Campbell et al., 2022). In particular, the predictor-corrector method consists in, after each τ-leaping step using ˆR(t),θ(z(t) d , zd), in applying several corrector steps using the expression defined in Campbell et al. (2022), i.e., ˆR(t),c = ˆR(t),θ + R(t). This rate admits qt(z(t)) as its stationary distribution, which means that applying the corrector steps brings the distribution of noisy graphs at time t closer to the marginal distribution of the forward process. 3 A Continuous-Time Discrete-State Graph Diffusion Model Here, we present our Cometh framework, a continuous-time discrete-state diffusion model for graph generation. Let Jm, n K := {m, . . . , n} N. We denote n-order attributed graph as a pair G := (G, X, E), where G := (V (G), E(G)) is a graph, X {0, 1}n a, for a > 0, is a node feature matrix, and E {0, 1}n n b, for b > 0, is an edge feature tensor. Note that node and edge features are considered to be discrete and consequently encoded in a one-hot encoding. For notational convenience, in the following, we denote the graph at time t [0, 1] by G(t), the node feature of node i V (G) at time t by x(t) i J1, a K, and similarly the edge feature of edge (i, j) E(G) at time t by e(t) ij J1, b K. In addition, we treat the absence of an edge as a special edge with a unique edge feature. By 1, we denote an all-one vector of suitable size, by I, the identity matrix of suitable size, while 0 denotes the all-zero matrix of suitable size. Moreover, by a , we denote the transpose of the vector a. Published in Transactions on Machine Learning Research (04/2025) 3.1 Forward process factorization Considering the graph state-space would result in a prohibitively large state, making it impossible to construct a transition matrix. Therefore, we consider that the forward process factorizes and that the noise propagates independently on each node and edge, enabling us to consider node and edge state spaces separately. Formally, let G = (G, X, E) be n-order attributed graph, then we have qt+dt|t G(t+dt) G(t) := i=1 qt+dt|t x(t+dt) i x(t) i n Y i 1 so that the model prioritizes predicting a correct graph structure over predicting correct node labels. 3.4 Simple and powerful positional encodings with RRWP In all our experiments, we use the graph transformer proposed by Vignac et al. (2022); see Figure 4 in the appendix. Relying on the fact that discrete diffusion models preserve the sparsity of noisy graphs, they propose a large set of features to compute at each denoising step to boost the expressivity of the model. This set includes Laplacian features, connected components features, and nodeand graph-level cycle counts. Even though this set of features has been successfully used in follow-up works, e.g., Vignac et al. (2023), Qin et al. (2023), Igashov et al. (2023), no theoretical nor experimental study exists to investigate the relevance of those particular features. In addition, a rich literature on encodings in graph learning has been developed, e.g., Lap PE (Kreuzer et al., 2021), Sign Net (Lim et al., 2023), RWSE (Dwivedi et al., 2021), SPE (Huang et al., 2023), which led us to believe that powerful encodings developed for discriminative models should be transferred to the generative setting. Specifically, in our experiments, we leverage the relative random-walk probabilites (RRWP) encoding, introduced in Ma et al. (2023). Denoting A the adjacency matrix of a graph G, D the diagonal degree matrix, and M = D 1A the degree-normalized adjacency matrix, for each pair of nodes (i, j) V (G)2, the RRWP encoding computes P K ij := Iij, Mij, M 2 ij, . . . , M K 1 ij , (4) where K refers to the maximum length of the random walks. The entry P K ii corresponds to the RRWP encoding of node i; therefore, we leverage them as node encodings. This encoding provides an efficient and elegant solution for boosting model expressivity and performance through a unified encoding. In the following, we show that RRWP encoding can (approximately) determine if two nodes lie in the same connected components, approximate the size of the largest connected component, and count small cycles. Theorem 2 (Informal). For n N, let Gn denote the set of n-order graphs and for a graph G Gn let V (G) := J1, n K. Then RRWP composed with a universally approximating feed-forward neural network can 1. determine if two vertices are in the same connected component, 2. approximate the size of the largest connected component in G, 3. approximately count the number p-cycles, for p < 5, in which a node is contained. See Appendix B.3 for the detailed formal statements. However, we can also show that RRWP encodings cannot detect if a node is contained in a large cycle of a given graph. We say the RRWP encoding counts the number of p-cycles for p 2 if there do not exist two graphs, one containing at least one p-cycle while the other does not, while the RRWP encodings of the two graphs are equivalent. Proposition 3. For p 8, the RRWP encoding does not count the number of p-cycles. Published in Transactions on Machine Learning Research (04/2025) Hence, for p 8 and K 0, there exists a graph and two vertex pairs (r, s), (v, w) V (G)2 such that (r, s) is contained in C while (v, w) is not and P K vw = P K rs. 3.5 Equivariance properties Since graphs are invariant to node reordering, it is essential to design methods that capture this fundamental property of the data. Relying on the similarities between Cometh and Di Gress, we establish that Cometh is permutation-equivariant and that our loss is permutation-invariant. We also establish that the τ-leaping sampling scheme and the predictor-corrector yield exchangeable distributions, i.e., the model assigns each graph permutation the same probability. Since those results mainly stem from proofs outlined in Vignac et al. (2022), we moved them to Appendix B.4. 3.6 Conditional generation If unconditional generation is essential to designing an efficient diffusion model, conditioning the generation on some high-level property is critical in numerous real-world applications (Corso et al., 2023; Lee & Min, 2022). In addition, Vignac et al. [2022] used classifier guidance, which relies on a trained unconditional model guided by a regressor on the target property. Instead, we rely on the classifier-free guidance procedure (Ho & Salimans, 2021), which consists in jointly optimizing a conditional model pθ 0|t G(0) | G(t), y and an unconditional model pθ 0|t G(0) | G(t) by randomly dropping the conditioner y during training. We provide further details concerning conditional generation in appendix B.5. 4 Experiments In the following, we empirically evaluate Cometh on synthetic and real-world graph generation benchmarks. For all datasets, the results obtained with the raw model are denoted as Cometh, while the results using the predictor-corrector are referred to as Cometh-PC. We sample from Cometh using the tau-leaping algorithm as described in Campbell et al. (2022), a procedure that we detail in Appendix C.1. We also conduct several ablation studies in Appendix E. Our code is available at github.com/asiraudin/Cometh. 4.1 Synthetic graph generation: Planar and SBM Here, we outline experiments regarding synthetic graph generation. Datasets and metrics We first validate Cometh on two benchmarks proposed by Martinkus et al. (2022), Planar and SBM. We measure four standard metrics to assess the ability of our model to capture structural properties of the graph distributions, i.e., degree (Degree), clustering coefficient (Cluster), orbit count(Orbit), and the eigenvalues of the graph Laplacian (Spectrum). The reported values are the maximum mean discrepancies (MMD) between those metrics evaluated on the generated graphs and the test set. We also report the percentage of valid, unique, and novel (VUN) samples among the generated graphs to further assess the ability of our model to capture the properties of the targeted distributions correctly. We provide a detailed description of the metrics in Appendix D.1. Baselines We evaluate our model against several graph diffusion models, namely Di Gress (Vignac et al., 2022), Gru M (Jo et al., 2024), two autoregressive models, Gran (Liao et al., 2019), and Graph Rnn (You et al., 2018), and a GAN, Spectre (Martinkus et al., 2022). We re-evaluated previous state-of-the-art models over five runs, namely Di Gress and Gru M, to provide error bars for the results. Results See Table 1. On Planar, our Cometh yields very good results over all metrics, being outperformed by a couple of baselines on Degree and Orbit, but with a much lower VUN. The sampling quality benefits from the predictor-corrector scheme, with a near-perfect VUN score. On SBM, we also obtain state-of-the-art results on all metrics. However, we found that, overall, the predictor-corrector did not improve performance on this dataset. Published in Transactions on Machine Learning Research (04/2025) Table 1: Synthetic graph generation results. We report the mean of five runs, as well as 95% confidence intervals. We reproduced the baselines results for Di Gress and Gru M; the rest is taken from Jo et al. (2024). The best results are highlighted in bold. Model Degree Cluster Orbit Spectrum VUN [%] Planar graphs Graph Rnn 24.5 9.0 2508 8.8 0 Gran 3.5 1.4 1.8 1.4 0 Spectre 2.5 2.5 2.4 2.1 25 Di Gress 0.8 0.0 4.1 0.3 0.5 0.0 76.0 0.1 Gru M 2.6 1.7 1.3 0.3 10.0 7.7 1.4 0.2 91.0 5.7 Disco 1.2 0.5 1.3 0.5 1.7 0.7 - 83.6 2.1 Cometh 2.1 1.3 1.5 0.4 3.1 3.0 1.3 0.3 92.5 3.7 Cometh PC 2.0 0.9 1.1 0.1 7.7 3.8 1.3 0.2 99.5 0.9 Stochastic block model Graph Rnn 6.9 1.7 3.1 1.0 5 Gran 14.1 1.7 2.1 0.9 25 Spectre 1.9 1.6 1.6 0.9 53 Di Gress 1.7 0.1 5.0 0.1 3.6 0.4 74.0 4.0 Gru M 2.6 1.0 1.5 0.0 1.8 0.4 0.9 0.2 69.0 8.5 Disco 0.8 0.2 0.8 0.4 2.0 0.5 - 66.2 1.4 Cometh 2.4 1.1 1.5 0.0 1.7 0.2 0.8 0.1 77.0 5.3 Cometh PC 1.4 0.4 1.5 0.0 2.0 0.1 0.8 0.1 75.5 4.0 Ablation Study We also conducted an ablation study on the positional encoding to compare the benefits of the RRWP encoding against the feature set used in Digress (see table 2). While both approaches achieve comparable results across most distribution metrics, RRWP significantly outperforms Di Gress set of features in terms of VUN. Table 2: Ablation study on the positional encoding for synthetic graphs. We report the mean of 5 runs, as well as 95% confidence intervals. Noise Model Degree Cluster Orbit Spectrum VUN [%] Planar graphs RRWP 2.1 1.3 1.5 0.3 3.1 3.0 1.2 0.2 92.5 3.67 Di Gress features 2.2 1.1 2.2 0.3 18.0 7.4 1.3 0.2 67.5 3.7 Stochastic block model Marginal 2.4 1.1 1.5 0.0 1.7 0.2 0.8 0.1 77.0 5.26 Di Gress features 2.3 1.2 1.5 0.0 1.3 0.2 0.9 0.2 64.5 6.41 4.2 Small-molecule generation: QM9 Here, we outline experiments regarding small-molecule generation. Datasets and metrics To assess the ability of our method to model attributed graph distributions, we evaluate its performance on the standard dataset QM9 (Wu et al. (2018)). Molecules are kekulized using the RDKit library, removing hydrogen atoms. We use the same split as Vignac et al. (2022), with 100k molecules for training, 10k for testing, and the remaining data for the validation set. We want to stress that this split differs from Jo et al. (2022), which uses 120k molecules for training and the rest as a test set. We choose to use the former version of this dataset because it allows for selecting the best checkpoints based on the evaluation of the ELBO on the validation set. We report the Validity over 10k molecules, as evaluated by RDKit sanitization, as well as the Uniqueness, FCD, using the MOSES benchmark, and NSPDK. Appendix D.2 provides a complete description of the metrics. Published in Transactions on Machine Learning Research (04/2025) Table 3: Molecule generation results on QM9. We report the mean of five runs, as well as 95% confidence intervals. The best results are highlighted in bold. Baseline results are taken from Jang et al. (2024a). Model Validity Uniqueness Valid & Unique FCD NSPDK GDSS 95.72 98.46 94.25 2.9 0.003 Di Gress 99.01 96.34 95.39 0.25 0.001 Graph ARM 90.25 95.26 85.97 1.22 0.002 Hggt 99.22 95.65 94.90 0.40 0.000 Cometh 99.57 0.07 96.76 0.17 96.34 0.2 0.25 0.01 0.000 0.00 Cometh - PC 99.69 0.03 96.66 0.08 96.36 0.07 0.25 0.01 0.000 0.00 Table 4: Molecule generation on MOSES. We report the mean of five runs, as well as 95% confidence intervals. The best results are highlighted in bold, and the second-best results are underlined. Model Class Val. Val. & Uni. VUN Filters FCD SNN Scaf VAE Smiles 97.7 97.5 67.8 99.7 0.57 0.58 5.9 JT-VAE Fragment 100 100 99.9 97.8 1.00 0.53 10 Graph Invent Autoreg. 96.4 96.2 95.0 1.22 0.54 12.7 Di Gress One-shot 85.7 85.7 81.4 97.1 1.19 0.52 14.8 Disco One-shot 88.3 88.3 86.3 95.6 1.44 0.50 15.1 Cometh One-shot 87.0 0.2 86.9 0.2 83.8 0.2 97.2 0.1 1.44 0.02 0.51 0.0 15.9 0.8 Cometh PC One-shot 90.5 0.3 90.4 0.3 83.7 0.2 99.1 0.1 1.27 0.02 0.54 0.0 16.0 0.7 Baselines We evaluate our model against several recent graph generation models, including two diffusion models, Di Gress Vignac et al. (2022) and Gdss (Jo et al., 2022)), and two autoregressive models, Hggt Jang et al. (2024a) and Graph ARM Kong et al. (2023)). Results We report results using 500 denoising steps for a fair comparison, as in Vignac et al. (2022). Cometh performs very well on this simple molecular dataset, notably outperforming its discrete-time counterpart Di Gress in terms of valid and unique samples with similar FCD and NSPDK. We experimentally find that the predictor-corrector offers only a slight improvement in the number of valid and unique samples on this dataset, with performance comparable for FCD and NSPDK. 4.3 Molecule generation on large datasets: Moses and Guaca Mol Here, we outline experiments regarding large-scale molecule generation. Datasets and benchmarks We also evaluate our model on two much larger molecular datasets, MOSES (Polykovskiy et al., 2020)) and Guaca Mol (Brown et al., 2019). The former is a refinement of the ZINC database and includes 1.9M molecules, with 1.6M allocated to training. The latter is derived from the Ch EMBL database and comprises 1.4M molecules, from which 1.1M are used for training. We use a preprocessing step similar to Vignac et al. (2022) for the Guaca Mol dataset, which filters molecules that cannot be mapped from SMILES to graph and back to SMILES. Both datasets come with their own metrics and baselines, which we briefly describe here. As for QM9, we report Validity, as well as the percentage of Valid & Unique (Val. & Uni.) samples, and Valid, Unique and Novel (VUN) samples for both datasets. We also report Filters, FCD, SNN, and Scaf for MOSES, as well as KL div and FCD for Guaca Mol. These metrics are designed to evaluate the model s capability to capture the chemical properties of the learned distributions. We provide a detailed description of those metrics in Appendix D.3. Results Similarly to previous graph generation models, Cometh does not match the performance of molecule generation methods that incorporate domain-specific knowledge, especially SMILES-based models (see Table 4). However, Cometh further bridges the gap between graph diffusion models and those methods, outperforming Di Gress in terms of validity by a large margin. Published in Transactions on Machine Learning Research (04/2025) Table 5: Molecule generation on Guaca Mol. We report the mean of five runs, as well as 95% confidence intervals. Conversely to MOSES, the Guaca Mol benchmark reports scores, so higher is better. The best results are highlighted in bold, and the second-best results are underlined. Model Class Val. Val. & Uni. VUN KL div FCD LSTM Smiles 95.9 95.9 87.4 99.1 91.3 NAGVAE One-shot 92.9 95.5 88.7 38.4 0.9 MCTS One-shot 100.0 100.0 95.4 82.2 1.5 Di Gress One-shot 85.2 85.2 85.1 92.9 68.0 Disco One-shot 86.6 86.6 86.5 92.6 59.7 Cometh One-shot 94.4 0.2 94.4 0.2 93.5 0.3 94.1 0.4 67.4 0.3 Cometh PC One-shot 98.9 0.1 98.9 0.1 97.6 0.2 96.7 0.2 72.7 0.2 Table 6: Conditional molecule generation results on QM9. We report the mean of five runs, as well as 95% confidence intervals. Model µ HOMO µ & HOMO MAE Val MAE Val MAE Val Di Gress 0.81 0.56 0.87 Cometh 0.67 0.02 88.8 0.5 0.32 0.01 94.1 0.8 0.58 0.01 92.5 0.7 On Guaca Mol (see Table 5), Cometh obtains excellent performance in terms of VUN samples, with an impressive 12.6% improvement over Di Gress. The LSTM model still surpasses our graph diffusion model on the FCD metric. This may be because we train on a subset of the original dataset, whereas the LSTM is trained directly on SMILES. 4.4 Conditional generation We perform conditional generation on QM9 following the setting of Vignac et al. (2022). We target two molecular properties, the dipole moment µ and the highest occupied molecular orbital energy (HOMO). We sample 100 properties from the test set for each experiment and use them as conditioners to generate ten molecules. We estimate the properties of the sampled molecules using the Psi4 library (Smith et al. (2020)) and report the mean absolute error (MAE) between the estimated properties from the generated set and the targeted properties. We report our results against Di Gress in Table 6. Overall, Cometh outperforms Di Gress by large margin, with 18%, 43%, and 33% improvements on µ, HOMO and both targets respectively. Those performance improvements indicate the superiority of classifier-free guidance over classifier guidance for conditional graph generation. 5 Conclusion Here, to leverage the benefits of continuous-time and discrete-state diffusion model, we proposed Cometh, a continuous-time discrete-state graph diffusion model, integrating graph data into a continuous-time diffusion model framework. We introduced a new noise model adapted to graph specificities using different node and edge rates and a tailored marginal distribution and noise schedule. In addition, we successfully replaced the structural features of Di Gress with a single encoding with provable expressivity guarantees, removing unnecessary features. Empirically, we showed that integrating continuous time leads to significant improvements across various metrics over state-of-the-art discrete-state diffusion models on a large set of molecular and non-molecular benchmark datasets. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, Published in Transactions on Machine Learning Research (04/2025) Andreas Bergmeister, Karolis Martinkus, Nathanaël Perraudin, and Roger Wattenhofer. Efficient and scalable graph generation through iterative local expansion. ar Xiv preprint, 2023. Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher. Guacamol: benchmarking models for de novo molecular design. Journal of Chemical Information and Modeling, 59(3):1096 1108, 2019. Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 2022. Vincenzo Capasso and David Bakstein. Introduction to Continuous-Time Stochastic Processes. Springer, 2021. Xiaohui Chen, Jiaxing He, Xu Han, and Li-Ping Liu. Efficient and degree-guided graph generation via discrete diffusion modeling. In International Conference on Machine Learning, 2023. Gabriele Corso, Bowen Jing, Regina Barzilay, Tommi Jaakkola, et al. Diffdock: Diffusion steps, twists, and turns for molecular docking. In International Conference on Learning Representations, 2023. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 5(4):455, 1992. Kai Diethelm, Neville J. Ford, and Alan D. Freed. Detailed error analysis for a fractional adams method. Numerical Algorithms, 2004. Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Graph neural networks with learnable structural and positional representations. In International Conference on Learning Representations, 2021. Martin Fürer. On the combinatorial power of the Weisfeiler-Lehman algorithm. In International Conference on Algorithms and Complexity, 2017. Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. Advances in Neural Information Processing Systems, 37:133345 133385, 2024. Kilian Konstantin Haefeli, Karolis Martinkus, Nathanaël Perraudin, and Roger Wattenhofer. Diffusion models for graphs benefit from discrete state spaces. In The First Learning on Graphs Conference, 2022. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In Neur IPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. Chenqing Hua, Sitao Luan, Minkai Xu, Zhitao Ying, Jie Fu, Stefano Ermon, and Doina Precup. Mudiff: Unified diffusion for complete molecule generation. In Learning on Graphs Conference, 2024. Yinan Huang, William Lu, Joshua Robinson, Yu Yang, Muhan Zhang, Stefanie Jegelka, and Pan Li. On the stability of expressive positional encodings for graph neural networks. In International Conference on Learning Representations, 2023. Ilia Igashov, Arne Schneuing, Marwin Segler, Michael M Bronstein, and Bruno Correia. Retrobridge: Modeling retrosynthesis with markov bridges. In International Conference on Learning Representations, 2023. Yunhui Jang, Dongwoo Kim, and Sungsoo Ahn. Graph generation with K2-trees. In The Twelfth International Conference on Learning Representations, 2024a. Published in Transactions on Machine Learning Research (04/2025) Yunhui Jang, Seul Lee, and Sungsoo Ahn. A simple and scalable representation for graph generation. In The Twelfth International Conference on Learning Representations, 2024b. Jaehyeong Jo, Seul Lee, and Sung Ju Hwang. Score-based generative modeling of graphs via the system of stochastic differential equations. In International Conference on Machine Learning, pp. 10362 10383, 2022. Jaehyeong Jo, Dongki Kim, and Sung Ju Hwang. Graph generation with diffusion mixture. ar Xiv preprint ar Xiv:2302.03596, 2024. Lingkai Kong, Jiaming Cui, Haotian Sun, Yuchen Zhuang, B Aditya Prakash, and Chao Zhang. Autoregressive diffusion model for graph generation. In International Conference on Machine Learning, 2023. Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 2021. Tuan Le, Julian Cremer, Frank Noe, Djork-Arné Clevert, and Kristof T Schütt. Navigating the design space of equivariant diffusion-based generative models for de novo 3D molecule generation. In International Conference on Learning Representations, 2023. Myeonghun Lee and Kyoungmin Min. MGCVAE: multi-objective inverse design via molecular graph conditional variational autoencoder. Journal of Chemical Information and Modeling, 62(12):2943 2950, 2022. Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861 867, 1993. Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, Will Hamilton, David K Duvenaud, Raquel Urtasun, and Richard Zemel. Efficient graph generation with graph recurrent attention networks. Advances in Neural Information Processing Systems, 2019. Derek Lim, Joshua David Robinson, Lingxiao Zhao, Tess E. Smidt, Suvrit Sra, Haggai Maron, and Stefanie Jegelka. Sign and basis invariant networks for spectral graph representation learning. International Conference on Learning Representations, 2023. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In International Conference on Machine Learning, 2023. Tianze Luo, Zhanfeng Mo, and Sinno Jialin Pan. Fast graph generation via spectral diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. Liheng Ma, Chen Lin, Derek Lim, Adriana Romero-Soriano, Puneet K Dokania, Mark Coates, Philip Torr, and Ser-Nam Lim. Graph inductive biases in transformers without message passing. ar Xiv preprint, 2023. Karolis Martinkus, Andreas Loukas, Nathanaël Perraudin, and Roger Wattenhofer. Spectre: Spectral conditioning helps to overcome the expressivity limits of one-shot graph generators. In International Conference on Machine Learning, 2022. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021. Matteo Ninniri, Marco Podda, and Davide Bacciu. Classifier-free graph diffusion for molecular property targeting. ar Xiv preprint, 2023. Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, and Stefano Ermon. Permutation invariant graph generation via score-based generative modeling. In International Conference on Artificial Intelligence and Statistics, 2020. Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. Published in Transactions on Machine Learning Research (04/2025) Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, et al. Molecular sets (moses): a benchmarking platform for molecular generation models. Frontiers in pharmacology, 11: 565644, 2020. Kristina Preuer, Philipp Renz, Thomas Unterthiner, Sepp Hochreiter, and Gunter Klambauer. Fréchet Chem Net distance: a metric for generative models for molecules in drug discovery. Journal of Chemical Information and Modeling, 58(9):1736 1741, 2018. Yiming Qin, Clement Vignac, and Pascal Frossard. Sparse training of discrete diffusion models for graph generation. ar Xiv preprint, 2023. Gaurav Rattan and Tim Seppelt. Weisfeiler leman and graph spectra. In Symposium on Discrete Algorithms, pp. 2268 2285, 2023. Mohammad Amin Shabani, Sepidehsadat Hosseini, and Yasutaka Furukawa. Housediffusion: Vector floorplan generation via a diffusion model with discrete and continuous denoising. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5466 5475, 2023. Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. Graph AF: a flowbased autoregressive model for molecular graph generation. In International Conference on Learning Representations, 2019. Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37:103131 103167, 2024. Daniel GA Smith, Lori A Burns, Andrew C Simmonett, Robert M Parrish, Matthew C Schieber, Raimondas Galvelis, Peter Kraus, Holger Kruse, Roberto Di Remigio, Asem Alenaizan, et al. Psi4 1.4: Open-source software for high-throughput quantum chemistry. The Journal of Chemical Physics, 152(18), 2020. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020. Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. ar Xiv preprint, 2022. Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Improved vector quantized diffusion models. ar Xiv preprint, 2022. Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Di Gress: Discrete denoising diffusion for graph generation. In International Conference on Learning Representations, 2022. Clement Vignac, Nagham Osman, Laura Toni, and Pascal Frossard. Midi: Mixed graph and 3d denoising diffusion for molecule generation. In ICLR 2023-Machine Learning for Drug Discovery workshop, 2023. Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Molecule Net: a benchmark for molecular machine learning. Chemical Science, 9(2):513 530, 2018. Zhe Xu, Ruizhong Qiu, Yuzhong Chen, Huiyuan Chen, Xiran Fan, Menghai Pan, Zhichen Zeng, Mahashweta Das, and Hanghang Tong. Discrete-state continuous-time diffusion for graph generation. The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Published in Transactions on Machine Learning Research (04/2025) Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1 39, 2023. Jason Yim, Brian L Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola. arxiv preprint. In International Conference on Machine Learning, 2023. Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graph RNN: Generating realistic graphs with deep auto-regressive models. In International Conference on Machine Learning, 2018. Lingxiao Zhao, Xueying Ding, and Leman Akoglu. Pard: Permutation-invariant autoregressive diffusion for graph generation. In Advances in Neural Information Processing Systems, 2024. We provide proofs and additional theoretical background in Appendix B. We give details on our implementation in Appendix C, experimental details in Appendix D, and additional experimental results in Appendix E. Finally, we provide visualization for generated samples in Appendix G. B Theoretical details Here, we outline the theoretical details of our Cometh architecture. B.1 Continuous-time discrete diffusion Here, we provide a more detailed description of the continuous-time, discrete-state diffusion model introduced in Campbell et al. (2022). We begin by recalling our notations. We aim to model a discrete data distribution pdata(z(0)), where z(0) Z and Z is a finite set with cardinality S := |Z|. In the following, the state is denoted by z(t) Z, where time is denoted by t [0, 1], and z(t) {0, 1}S is its one-hot encoding. The marginal distributions at time t are denoted by qt(z(t)) and the conditional distribution of the state z(t) given the state z(s) at some time s [0, 1] by qt|s(z(t) | z(s)). We also denote δ z,z the Kronecker delta, which equals 1 if z = z and 0 otherwise. A continuous-time, discrete-state diffusion model builds upon continuous-time Markov chains (CTMCs). CTMCs are continuous-time processes in which the state z(t) alternates between remaining in the current state and transitioning to another state. The dynamics of the CTMC are governed by a rate matrix R(t) RS S, where S represents the number of possible states. We denote R(t)(z(t), z) the transition rate from the state z(t) to another state z. Precisely, a CTMC satisfies three differential equations: (forward) tqt|s( z | z) = X y qt|s(y | z)R(t)(y, z), (backward) sqt|s(z | z) = X y R(s)( z, y)qt|s(z | y), (forward, marginals) tqt(z) = X y qt(y)R(t)(y, z), where z, z, y Z and t, s [0, 1]. The infinitesimal probability of transition from z(t) to another state z Z, for a infinitesimal time step dt between time t and t + dt is given by qt+dt|t( z | z(t)) := R(t)(z(t), z)dt if z = z(t), 1 + R(t)(z(t), z)dt if z = z(t) = δ z,z(t) + R(t)(z(t), z)dt. Published in Transactions on Machine Learning Research (04/2025) The rate matrix must satisfy the following conditions, R(t)(z(t), z) 0, and R(t)(z(t), z(t)) = X z R(t)(z(t), z(t)) < 0. The second condition ensures that qt+dt|t( | z(t)) sums to 1. Intuitively, the rate matrix contains instantaneous transition rates, i.e., the number of transitions per unit of time. Therefore, the higher R(t)(z(t), z), the more likely the transition from z(t) to z. Since the diagonal coefficients of the rate matrix can be derived from the off-diagonal ones, we will define only the latter in the following. The CTMC is initialized with q(z(0)) = pdata(z(0)). The key challenge in designing the rate matrix is ensuring that the forward process converges to a well-known categorical distribution we can use later as a prior distribution during the generative process. In the discrete case, such a distribution can be, e.g., a uniform distribution, a discretized-Gaussian distribution, or an absorbing distribution (Campbell et al., 2022). The reverse process can also be formulated as a CTMC. Using similar notations as for the forward process, the reverse process is defined through qt|t+dt(z(t) | z) = δz(t), z + ˆR(t)( z, z(t))dt, where ˆR(t) is the rate matrix for the reverse process. Similar to discrete-time diffusion models, this reverse rate can be expressed as (Campbell et al., 2022, Proposition 1), ˆR(t)( z, z(t)) = R(t)(z(t), z) X qt|0( z | z(0)) qt|0(z(t) | z(0))q0|t(z(0) | z(t)), for z = z. Since the true reverse process q0|t(z(0) | z(t)) is intractable, we approximate it using a neural network pθ 0|t(z(0) | z) parameterized by θ, yielding the approximate reverse rate, ˆRt,θ(z, z) = R(t)( z, z) X qt|0( z | z(0)) qt|0(z | z(0))pθ 0|t(z(0) | z), for z = z. Diffusion models are typically optimized by minimizing the negative ELBO on the negative log-likelihood, i.e., log pθ 0(z(0)). Campbell et al. (2022, Proposition 2) provides an expression for the ELBO. Although it is not used in this work, we include it for completeness, LCT (θ) = T Et U([0,T ]),qt(z),r(z| z) z = z ˆRt,θ(z, z ) Z(t) log ˆRt,θ( z, z) where C is a constant independent of θ, Z(t) = P z =z R(t)(z, z ), and r(z | z) = 1 δ z,z R(t)(z, z)/Z(t). For efficient optimization, it is essential to express qt(z) = qt|0(z | z(0))q0(z(0)) in closed form. In this context, the transition matrix R(t) must be designed so that qt|0(z | z(0)) has a closed-form expression. Campbell et al. (2022) established that when R(t) and R(t ) commute for any t and t , the transition probability matrix can be written as: Q(t) := qt|0(z(t) | z(0)) = exp Z t where ( Q(t))ij = q(z(t) = i | z0 = j). This condition is met when the rate is written as R(t) = β(t)Rb, where β is a time-dependent scalar and Rb is a constant base rate matrix. In that case, the forward process can be further refined as ( Q(t))kl = qt|0(z(t) = k | z(0) = l) = P exp Λ Z t 0 β(s)ds P 1 Published in Transactions on Machine Learning Research (04/2025) where Rb = P ΛP 1 and exp refer to the element-wise exponential. Given a one-hot encoded data sample z(0), we can sample a noisy state z(t) by sampling a categorical distribution with probability vector z(0) Q(t). Since most real-world data is multi-dimensional, the above framework needs to be extended to D dimensions. This is done by assuming that each dimension is noised independently so that the forward process factorizes as qt+dt|t( z | z) = d=1 qt+dt|t( zd | zd), where qt+ t|t( zd | zd) is the uni-dimensional forward process on the dth dimension. Campbell et al. (2022, Proposition 3) establishes how to write the forward and reverse rates in the Ddimensional case, R(t)( z, z) = d=1 δ z\d,z\d R(t) d ( zd, zd), ˆRt,θ(z, z) = d=1 δ z\d,z\d R(t) d ( zd, zd) X qt|0( zd | z0 d) qt|0(zd | z0 d)pθ 0|t(z0 d | z), for zd = zd. In brief, assuming that a transition cannot occur in two different dimensions simultaneously, the multi-dimensional rates are equal to the uni-dimensional rates in the dimension of transition. Importantly, if the dimensions are independent in the forward process, they are not in the reverse process since the whole state is given as input in pθ 0|t(z0 d | z). Finally, we need a practical way to simulate the reverse process over finite time intervals for D-dimensional data. To that extent, we follow Campbell et al. (2022) and use the τ-leaping algorithm. The first step is to sample z(1) from the prior pref(z(1)). The sampling procedure is as follows. At each iteration, we keep z(t) and ˆRt,θ(z, z) constant and simulate the reverse process for a time interval of length τ. It means we count all the transitions between t and t τ and apply them simultaneously. The number of transitions in each dimension z(t) d of the current state z(t), between z(t) d and zd is Poisson distributed with mean τ ˆRt,θ(z(t) d , zd). In a state space with no ordinal structure, multiple transitions in one dimension are meaningless, and we reject them. In addition, we experiment using the predictor-corrector scheme. After each predictor step using ˆR(t),θ(z(t) d , zd), we can also apply several corrector steps using the expression defined in Campbell et al. (2022), i.e., ˆR(t),c = ˆR(t),θ + R(t). The transitions using the corrector rate are counted the same way as for the predictor. This rate admits qt(z(t)) as its stationary distribution, which means that applying the corrector steps brings the distribution of noisy graphs at time t closer to the marginal distribution of the forward process. B.2 Noise schedule Here, we provide a proof for Proposition 1, as well as some intuition on the prior distribution. Proposition 4. (Proposition 1 in the main body) For a CTMC (z(t))t [0,1] with rate matrix R(t) = β(t)Rb and Rb = 1m I, the forward process can be written as Q(t) = e βt I + (1 e βt)1m , where ( Q(t))ij = q(z(t) = i | z(0) = j) and βt = R t 0 β(s)ds. Proof. Since 1m is a rank-one matrix with trace 1, it is diagonalizable and has only one non-zero eigenvalue, equal to tr(1m ) = 1. Therefore, Rb = 1m I = P DP 1 I = P (D I)) P 1, Published in Transactions on Machine Learning Research (04/2025) with D = diag(1, 0, . . . , 0). Denoting βt = R t 0 β(s)ds, Q(t) = P exp βt(D I) P 1 = P D e βt I e βt D P 1 = e βt I + (1 e βt)1m . We now wish to elaborate on the link between Proposition 4 and the choice of the prior distribution. Recall our noise schedule, 2 t and Z t 0 β(s)ds = α 1 cos π When t = 1, it holds that e βt = e α, and therefore Q(1) = e αI+(1 e α)1m . When α is large enough, then e α 0 and Q(1) 1m . Denoting ( Q(1))j the j-th row of Q(1), it holds that ( Q(1))j = q(z(1) | z(0) = j) m. In other terms, whatever the value z(0), z(1) is sampled from the same categorical distribution with probability vector m. Therefore, q1(z(1)) = X j Z q1|0(z(1) | z(0) = j)q0(z(0) = j) j Z mq0(z(0) = j) m = pref(z(1)) The first line is obtained by marginalizing z(0), the second line by using the factg that q1|0(z(1) | z(0) = j) m for all possible z(0), and the last one using that P j Z q0(z(0) = j) = 1 and m is independent of j. In the D-dimensional case, since the forward process factorizes, we get q1|0(z(1) | z(0)) = d=0 m = pref(z(1)), where d J0, DK denotes d-th dimension of z ZD. B.3 RRWP properties In the following, we formally study the encodings of the RRWP encoding. Notations A graph G is a pair (V (G), E(G)) with finite sets of vertices or nodes V (G) and edges E(G) {{u, v} V (G) | u = v}. If not otherwise stated, we set n := |V (G)|, and the graph is of order n. For ease of notation, we denote the edge {u, v} E(G) by (u, v) or (v, u). An n-order attributed graph is a pair G = (G, X, E), where G = (V (G), E(G)) is a graph and X {0, 1}n a, for a > 0, is a node feature matrix and E {0, 1}n n b, for b > 0, is an edge feature tensor. Here, we set V (G) := J1, n K. The neighborhood of v V (G) is denoted by N(v) := {u V (G) | (v, u) E(G)}. In our experiments, we leverage the relative random-walk probabilites (RRWP) encoding, introduced in Ma et al. (2023). Denoting A the adjacency matrix of a graph G, and D the diagonal degree matrix, and M = D 1A the degree-normalized adjacency matrix, for each pair of nodes (i, j), the RRWP encoding computes P K ij := Iij, Mij, M 2 ij, . . . , M K 1 ij , (5) where K refers to the maximum length of the random walks. The entry P K ii corresponds to the RRWP encoding of node i; therefore, we leverage them as node encodings. This encoding alone can train our graph diffusion model and attain state-of-the-art results. Published in Transactions on Machine Learning Research (04/2025) In the following, we show that RWP encoding can (approximately) determine if two nodes lie in the same connected components and approximate the size of the largest connected component. Proposition 5. For n N, let Gn denote the set of n-order graphs and for a graph G Gn let V (G) := J1, n K. Assume that the graph G has c connected components and let C {0, 1}n c be a matrix such the ith row Ci is a one-hot encoding indicating which of the c connected components the vertex i belongs to. Then, for any ε > 0, there exists a feed-forward neural network FNN: Rn n [0, 1]n c such that FNN(M n 1) C F ε. Proof. Let R := M n 1. First, since the graphs have n vertices, the longest path in the graphs has length n 1. Hence, two vertices v, w V (G), with v = w are in the same connected component if, and only, if Rvw = 0. Hence, applying a sign activation function to R pointwisely, we get a matrix over {0, 1} with the same property. Further, by adding Dn {0, 1}n n, an n n diagonal matrix with ones on the main diagonal, to this matrix, this property also holds for the case of v = w. In addition, there exists a permutation matrix Pn such that applying it to the above matrix results in a block-diagonal matrix B {0, 1}n n such that Bv = Bw , for v, w V (G), if, and only, if the vertices v, w are in the same connected component. Since n is finite, the number of such B matrices is finite and hence compact. Hence, we can find a continuous function mapping each possible row of Bv , for v V (G), to the corresponding one-hot encoding of the connected component. Since all functions after applying the sign function are continuous, we can approximate the above composition of functions via a two-layer feed-forward neural network leveraging the universal approximation theorem (Cybenko, 1992; Leshno et al., 1993). Similarly, we can also approximate the size of the largest component in a given graph. Proposition 6. For n N, let Gn denote the set of n-order graphs and for G Gn let V (G) := J1, n K. Assume that S is the number of vertices in the largest connected component of the graph G. Then, for any ε > 0, there exists a feed-forward neural network FNN: Rn n [1, n], |FNN(M n 1) S| ε. Proof. By the proof of Proposition 5, we a get a block-diagonal matrix B {0, 1}n n, such that Buv = 1 if, and only, if u, v are in the same connected components. Hence, by column-wise summation, we get the number of vertices in each connected component. Hence, there is an n 1 matrix over {0, 1}, extracting the largest entry. Since all of the above functions are continuous, we can approximate the above composition of functions via a two-layer feed-forward neural network leveraging the universal approximation theorem (Cybenko, 1992; Leshno et al., 1993). Moreover, we can show RRWP encodings can (approximately) count the number p-cycles, for p < 5, in which a node is contained. A p-cycle is a cycle on p vertices. Proposition 7. For n N, let Gn denote the set of n-order graphs and for G Gn let V (G) := J1, n K. Assume that c Nn contains the number of p-cycles a node is contained in for all vertices in G, for p {3, 4}. Then, for any ε > 0, there exists a feed-forward neural network FNN: Rn n Rn, FNN(P n 1) c 2 ε. Proof. For p {3, 4}, Vignac et al. (2022, Appendix B.2) provide simple linear-algebraic equations for the number of p-cycles each vertex of a given graph is contained based on powers of the adjacency matrix, which can be expressed as compositions of linear mappings, i.e., continuous functions. We can extract these matrices from P n 1. Further, note that the domain of these mappings is compact. Hence, we can approximate this composition of functions via a two-layer feed-forward neural network leveraging the universal approximation theorem (Cybenko, 1992; Leshno et al., 1993). However, we can also show that RRWP encodings cannot detect if a node is contained in a large cycle of a given graph. We say that an encoding, e.g., RRWP, counts the number of p-cycles for p 2 if there do not Published in Transactions on Machine Learning Research (04/2025) exist two graphs, one containing at least one p-cycle while the other does not, while the RRWP encodings of the two graphs are equivalent. Proposition 8. For p 8, the RRWP encoding does not count the number of p-cycles. Proof. First, by Rattan & Seppelt (2023), the RRWP encoding does not distinguish more pairs of nonisomorphic graphs than the so-called (1, 1)-dimensional Weisfeiler Leman algorithm. Secondly, the latter algorithm is strictly weaker than the 3-dimensional Weisfeiler Leman algorithm in distinguishing nonisomorphic graphs (Rattan & Seppelt, 2023, Theorem 1.4). However, by Fürer (2017, Theorem 4), the 3-dimensional Weisfeiler Leman algorithm cannot count 8-cycles. Hence, the results follows. Hence, the above proposition implies the following results. Corollary 9. For p 8 and K 0, there exists a graph G containing a p-cycle C, and two vertex pairs (r, s), (v, w) V (G)2 such that (r, s) is contained in C while (v, w) is not and P K vw = P K rs. B.4 Equivariance properties In this section, we prove that our model is equivariant (Proposition 10) and that our loss is permutationinvariant (Proposition 11), relying on Vignac et al. (2022, Lemma 3.1 and 3.2). We also prove exchangeability with Proposition 12. Let us start by defining the notation for a graph permutation. Denote π a permutation, π acts on the attributed graph G = (G, X, E) as, πG = (V (πG), E(πG)) where V (πG) = {π(1), . . . , π(n)} and E(πG) = {(π(i), π(j)) | (vi, vj) E(G)}, πX the matrix obtained by permutating the rows of X according to π, i.e. (πX)i = xπ 1(i), Similarly, πE is the tensor obtained by the permutation of the components eij of E according to π, i.e. (πE)ij = eπ 1(i)π 1(j). Proposition 10 (Equivariance). Di Gress graph transformer using RWSE as node encodings and RRWP as edge encodings is permutation equivariant. Proof. We recall the sufficient three conditions stated in Vignac et al. (2022) for ensuring permutationequivariance of the Di Gress architecture, namely, their set of structural and spectral features is equivariant. All the blocks of their graph transformer architecture are permutation equivariant. The layer normalization is equivariant. Replacing the first condition with the permutation-equivariant nature of the RRWP-based node and edge encodings completes the proof. We now derive a more thorough proof of the permutation invariance of the loss compared to Vignac et al. (2022, Lemma 3.2), relying on the permutation-equivariant nature of both the forward process and the denoising neural network. Proposition 11 (Permutation invariance of the loss). The cross-entropy loss defined in Equation 3 is invariant to the permutation of the input graph G(0). Proof. Given a graph G = (G, X, E), we denote by ˆG = ( ˆG, ˆ X, ˆE) the predicted clean graph by the neural network and πG = (πG, πX, πE) a permutation of this graph, for arbitrary permutation π. Let us now Published in Transactions on Machine Learning Research (04/2025) establish that the loss function is permutation-invariant. We recall the loss function for a permutation π of the clean data sample G(0) is LCE := Et [0,1],pdata(πG(0)),q(πG(t)|πG(0)) i log pθ 0|t(x(0) π(i) | πG(t)) λ i 0.01 do for i = 1 to n do for x in X do ˆRt,θ X (x(t) i , x) = Rt X( xi, x(t) i ) P x0 qt|0( xi|x(0) i ) qt|0(x(t) i |x(0) i )pθ 0|t(x(0) i | G(t)), for x(t) i = xi Sample jx(t) i , x P(τ ˆRt,θ X (x(t) i , x)) Count transitions on node i end end for i, j = 1 to n, i < j do for e in E do ˆRt,θ E (e(t) ij , eij) = Rt E( eij, e(t) ij ) P qt|0( eij|e(0) ij ) qt|0(e(t) ij |e(0) ij )pθ 0|t(e(0) ij | G(t)), for e(t) ij = eij Sample je(t) ij , eij P(τ ˆRt,θ E (e(t) ij , eij)) Count transitions on edge ij end end for i = 1 to n do for x in X do if jx(t) i , x = 1 and P x jx(t) i , x = 1 then x(t τ) i = x Apply unique transition or discard end end end for i, j = 1 to n, i < j do for e in E do if je(t) ij , e = 1 and P e je(t) ij , e = 1 then e(t τ) ij = e Apply unique transition or discard end end end t t τ end i argmax pθ 0|t(x(0) i | G(t)) Q ij argmax pθ 0|t(e(0) ij | G(t)) Last pass return G0 Figure 3: Training and Sampling algorithms of Cometh D.1 Synthetic graph generation We evaluate our method on two datasets from the SPECTRE benchmark (Martinkus et al., 2022), with 200 graphs each. Planar contains planar graphs of 64 nodes, and SBM contains graphs drawn from a stochastic block model with up to 187 nodes. We use the same split as the original paper, which uses 128 graphs for training, 40 for training, and the rest as a validation set. Similar to Jo et al. (2024), we apply random permutations to the graphs at each training epoch. We report five metrics from the SPECTRE benchmark, which include four MMD metrics between the test set and the generated set and the VUN metric on the generated graphs. The MMD metrics measure the Maximum Mean Discrepancy between statistics from the test and the generated set, namely the Degree (Degree) distribution, the Clustering coefficient (Cluster) distribution, the count of orbit information regarding subgraphs of size four Orbit (Orb.) and the eigenvalues (Spectrum) of the graph Published in Transactions on Machine Learning Research (04/2025) Outer product + scaling Node/ Edge - wise MLP Node/ Edge - wise MLP Encoding/ Features Figure 4: Overview of Di Gress graph transformer. Published in Transactions on Machine Learning Research (04/2025) Laplacian. The Valid, Unique, and Novel (VUN) metric measures the percentage of valid, unique, and non-isomorphic graphs to any graph in the training set. For the MMD metrics, we report ratios between the MDD computed between the generated and the test set, over the MDD between the train and the test set. In fact, due to the small size of the datasets, the statistics that we compute vary slightly between the train and the test set. Formally, r = MMD(generated, test)2 MMD(train, test)2 (7) Note that we actually compute the MMD squared, following a common practice (Martinkus et al., 2022; Vignac et al., 2022). On Planar, we report results using τ = 0.002, i.e. using 500 τ-leaping. We also evaluate our model using 10 corrector steps after each predictor step when t < 0.1T, with τ = 0.002, for 1000 τ-leaping steps. We found our best results using τc = 0.7. On SBM, we report results using τ = 0.001, i.e., using 1 000 τ-leaping steps. D.2 Small molecule generation : QM9 We evaluate our model on QM9 (Wu et al. (2018)) to assess the ability of our model to model attributed graph distributions. The molecules are kekulized using the RDKit library, and hydrogen atoms are removed, following the standard preprocessing pipeline for this dataset. Edges can have three types, namely simple bonds, double bonds, and triple bonds, as well as one additional type for the absence of edges. The atom types are listed in Table 7. We use the same split as Vignac et al. (2022), i.e., 100k molecules for training, 13k for testing, and the rest (20 885 molecules) as a validation set. We choose this split over the one proposed in Jo et al. (2022) because it leaves a validation set to evaluate the ELBO and select the best checkpoints to minimize this quantity. In consequence, our training dataset contains roughly 20k molecules, which is less than what most graph generation works use. At the sampling stage, we generate 10k molecules. We evaluate four metrics. The Validity is evaluated by sanitizing the molecules and converting them to SMILES string using the RDKit library. The largest molecular fragment is selected as a sample if it is disconnected. We then evaluate the Uniqueness among valid molecules. As stated in Vignac et al. (2022), evaluating novelty on QM9 bears little sense since this dataset consists of an enumeration of all stable molecules containing the atom above types with size nine or smaller. We also evaluate the Fréchet Chem Net Distance (FCD), which embeds the generated set and the test set using the Chem Net neural network and compares the resulting distributions using the Wasserstein-2 distance (Preuer et al. (2018)). Finally, we evaluate the Neighborhood Subgraph Pairwise Distance Kernel (NSPDK) between the test set and the generated, which measures the structural similarities between those two distributions. D.3 Molecule generation on large datasets We further evaluate Cometh on two large molecule generation benchmarks, MOSES (Polykovskiy et al., 2020) and Guaca Mol (Brown et al., 2019). The molecules are processed the same way as for QM9 and present the same edge types. The atom types for both datasets are listed in Table 7. The filtration procedure for the Guaca Mol consists of converting the SMILES into graphs and retrieving the original SMILES. The molecules for which this conversion is not possible are discarded. We use the code of Vignac et al. (2022) to perform this procedure. We use the standard split provided for each dataset. Both datasets are accompanied by their own benchmarking libraries. For Guaca Mol, we use the distribution learning benchmark. During the sampling stage, we generate 25k molecules for MOSES and 18k for Guaca Mol, which is sufficient for both datasets, as they evaluate metrics based on 10k molecules sampled from the generated SMILES provided. Published in Transactions on Machine Learning Research (04/2025) Table 7: Details of the molecular datasets. The number of molecules for Guaca Mol is computed after filtration. Dataset Number of molecules Size Atom types QM9 133 885 1 to 9 C, N, O, F MOSES 1 936 962 8 to 27 C, N, S, O, F, Cl, Br Guaca Mol 1 398 223 2 to 88 C, N, O, F, B, Br, Cl, I, P, S, Se, Si We then elaborate on the metrics for each dataset. For both datasets, we report Validity, defined in the same manner as for QM9, the percentage of Valid and Unique (Val. & Uni.) samples, and the percentage of Valid, Unique, and Novel (VUN) samples. We prefer the latter two metrics over Uniqueness and Novelty alone, as they better assess a model s performance compared to separately reporting all three metrics (Validity, Uniqueness, and Novelty). The MOSES benchmark also computes metrics by comparing the generated set to a scaffold test set, from which we report the Fréchet Chem Net Distance (FCD), the Similarity to the nearest neighbor (SNN), which computes the average Tanimoto similarity between the molecular fingerprints of a generated set and the fingerprints of the molecules of a reference set, and Scaffold similarity (Scaf), which compares the frequencies of the Bemis-Murcko scaffolds in the generated set and a reference set. Finally, we report the metrics of the Filters indicating the percentage of generated molecules successfully passing the filters applied when constructing the dataset. The Guaca Mol benchmark computes two scores, the Fréchet Chem Net Distance (FCD) score and the KL divergences (KL) between the distributions of a set of physicochemical descriptors in the training set and the generated set. We report results using τ = 0.002, i.e., 500 denoising steps on both datasets. The predictor-corrector experiments used τ = 0.004 and 10 corrector steps for 500 denoising steps. For both datasets, we used τc = 1.5. D.4 Conditional generation We perform conditional generation experiments on QM9, targeting two properties, the dipole moment µ and the highest occupied molecular orbital energy (HOMO). They are well suited for conditional generation evaluation because they can be estimated using the Psi4 library (Smith et al., 2020). We trained models sweeping over puncond {0.1, 0.2}, and explore different values for s in J1, 6K during sampling. We obtained our best results using puncond = 0.1 and s = 1. We evaluated our method in the same setting as Vignac et al. (2022) during inference. We sampled 100 molecules from the test set, extracted their dipole moment and HOMO values, and generated 10 molecules targeting those properties. We estimated the HOMO energy and the dipole moment of the generated molecules, and we report the Mean Absolute Error (MAE) between the estimated properties and the corresponding targets. To efficiently incorporate the conditioner y, we implemented a couple of ideas proposed in Ninniri et al. (2023). Instead of using y solely as a global feature, we incorporated it as an additional feature for each node and edge. Additionally, we trained a two-layer neural network to predict the size of the molecule given the target properties rather than sampling it from the empirical dataset distribution. Our empirical observations indicate that this approach enhances performance. As of the time of writing, no official implementation has been released for Ninniri et al. (2023), rendering it impossible to reproduce their results. Additionally, since they do not report validity in their experiments on QM9, we choose not to include their results as a baseline to avoid unfair comparisons. D.5 Compute Ressources Experiments on QM9, Planar, and SBM were carried out using a single V100 or A10 GPU at the training and sampling stage. The training time on QM9 is 6 hours, while the training time on SBM and Planar is approximately 2 days and a half. Published in Transactions on Machine Learning Research (04/2025) We trained models on MOSES or Guacamol using two A100 GPUs. To sample from these models, we used a single A100 GPU. The training time on MOSES is approximately two days while training on Guaca Mol requires 4 days. E Additional Experiments E.1 Ablation on the number of steps To demonstrate why the continuous-time approach allows trade sampling quality and efficiency, we perform an ablation study on τ. Results are presented in Tables 8 to 10. We report the number of model evaluations equal to 1/τ instead of τ for readability. Overall, the model achieves decent performance across all datasets with just 50 steps. Increasing the number of model evaluations to 500-700 enhances performance to a state-of-the-art level. Beyond this point, performance saturates and models using 1000 steps do not necessarily outperform those with fewer evaluations, as seen with SBM and Planar. Table 8: Ablation study on the number of steps for synthetic graphs. We report the mean of 5 runs, as well as 95% confidence intervals. The best results are highlighted in bold. Number of steps Degree Cluster Orbit Spectrum VUN [%] Planar graphs 10 103.0 6.8 11.0 0.1 305.4 10.6 5.3 0.2 0.0 0.0 50 3.6 1.3 3.3 0.2 12.4 4.5 1.3 0.2 41.5 4.51 100 3.1 1.2 2.3 0.3 5.2 2.5 1.4 0.2 76.0 2.23 300 3.4 1.1 1.7 0.5 6.2 3.9 1.3 0.1 86.5 3.82 500 2.1 1.3 1.5 0.3 3.1 3.0 1.2 0.2 92.5 3.67 700 2.4 1.2 1.5 0.2 2.2 1.2 1.1 0.2 94.0 2.23 1000 1.9 1.0 1.9 0.2 2.7 1.7 1.5 0.2 89.5 3.51 Stochastic block model 10 166.6 16.1 1.8 0.1 3.3 0.2 2.2 0.2 14.0 4.72 50 2.6 0.6 1.5 0.0 1.9 0.2 0.9 0.1 62.0 5.44 100 1.4 0.7 1.5 0.0 1.7 0.2 0.9 0.1 70.5 5.26 300 2.4 0.6 1.5 0.0 1.8 0.3 0.8 0.0 65.5 5.94 500 2.4 1.1 1.5 0.0 1.7 0.2 0.8 0.1 77.0 5.26 700 2.6 0.9 1.5 0.0 1.6 0.1 0.9 0.1 69.0 4.06 1000 1.8 0.7 1.5 0.0 1.7 0.4 0.8 0.1 67.5 3.10 Table 9: Ablation study on the number of steps for QM9. We report the mean of 5 runs, as well as 95% confidence intervals. Number of steps Validity Uniqueness Valid & Unique FCD NSPDK 10 88.69 0.36 98.57 0.13 87.42 0.46 0.84 0.02 0.001 0.0 50 99.07 0.05 96.78 0.16 95.88 0.21 0.25 0.01 0.000 0.0 100 99.42 0.06 96.81 0.06 96.24 0.05 0.26 0.01 0.000 0.0 300 99.53 0.02 96.57 0.12 96.12 0.11 0.25 0.01 0.000 0.0 500 99.57 0.07 96.76 0.17 96.34 0.2 0.25 0.01 0.000 0.0 700 99.53 0.05 96.65 0.15 96.2 0.15 0.25 0.01 0.000 0.0 1000 99.57 0.07 96.79 0.08 96.37 0.14 0.25 0.01 0.000 0.0 E.2 Ablation on the noise model To emphasize the impact of using marginal transitions instead of uniform transitions, we trained models on Planar and SBM with the uniform noise model (see table 11). While the uniform model performs competitively with marginal transitions on SBM, the VUN score remains significantly higher with marginal transitions. On Planar, marginal transitions demonstrate substantially superior performance compared to uniform transitions. Published in Transactions on Machine Learning Research (04/2025) Table 10: Ablation study on the number of steps for MOSES We report the mean of five runs, as well as 95% confidence intervals. The best results are highlighted in bold. Number of steps Val. Val. & Uni. VUN Filters FCD SNN Scaf 10 26.1 0.2 26.1 0.2 26.0 0.2 59.9 0.6 7.88 0.13 0.36 0.0 8.9 1.1 50 82.9 0.3 82.9 0.3 80.5 0.3 94.6 0.1 1.54 0.01 0.49 0.0 18.4 1.0 100 85.8 0.2 85.7 0.1 82.9 0.2 96.5 0.1 1.43 0.01 0.5 0.0 17.2 0.6 300 86.9 0.2 86.9 0.2 83.8 0.2 97.1 0.1 1.44 0.02 0.51 .0 17.8 1.0 500 87.0 0.2 86.9 0.2 83.8 0.2 97.2 0.1 1.44 0.02 0.51 0.0 15.9 0.8 700 87.2 0.2 87.1 0.2 83.9 0.2 97.2 0.1 1.43 0.02 0.51 0.0 15.9 0.4 1000 87.2 0.2 87.2 0.2 84.0 0.2 97.2 0.1 1.44 0.01 0.51 0.0 17.3 0.9 Table 11: Ablation study on the noise model for synthetic graphs. We report the mean of 5 runs, as well as 95% confidence intervals. Noise Model Degree Cluster Orbit Spectrum VUN [%] Planar graphs Marginal 2.1 1.3 1.5 0.3 3.1 3.0 1.2 0.2 92.5 3.67 Uniform 14.3 4.2 3.8 0.6 14.2 6.1 1.7 0.2 32.5 4.4 Stochastic block model Marginal 2.4 1.1 1.5 0.0 1.7 0.2 0.8 0.1 77.0 5.26 Uniform 1.6 0.3 1.5 0.0 1.8 0.3 0.9 0.1 63.5 6.6 E.3 Ablation on the loss function To justify our choice to use cross-entropy as our loss function instead of ELBO, we provide the results of our experiments on QM9 using ELBO as our loss function. Figure 5 presents the validation performance of both approaches regarding validity over 512 samples. While the cross-entropy loss allows us to reach a near-perfect validity quickly, the model trained using the ELBO saturates below 80%. The significant performance gap on a simple dataset like QM9 underscores the inefficiency of using the ELBO as the loss function for Cometh. Figure 5: Validation results on QM9 using the cross-entropy and ELBO losses. E.4 Ablation on the noise schedule We also performed a simple ablation study on QM9 to compare the performance of our cosine noise schedule against the exponential noise schedule β(t) = αγt log γ, using α = 0.8 and γ = 2. Overall, our cosine schedule performs better on every metric except the Uniqueness. Published in Transactions on Machine Learning Research (04/2025) Table 12: Ablation on the noise schedule on QM9. We report the mean of five runs, as well as 95% confidence intervals. The best results are highlighted in bold. Model Validity Uniqueness Valid & Unique FCD NSPDK Cosine 99.57 0.07 96.76 0.17 96.34 0.2 0.25 0.01 0.000 0.00 Exp 98.28 0.15 97.0 0.13 95.34 0.15 0.31 0.01 0.001 0.00 F Limitations Although our model advances the state-of-the-art across all considered benchmarks, it still faces quadratic complexity, a common issue in graph diffusion models. This problem could be alleviated by adapting methods like EDGE (Chen et al., 2023) used to scale Di Gress for large graph generation. Additionally, our approach does not support the generation of continuous features and is restricted to categorical attributes. It should be combined with a continuous-state diffusion model to generate continuous features, resulting in an approach similar to Vignac et al. (2023). Figure 6: Samples from Cometh on Planar (top) and SBM (bottom) Figure 7: Samples from Cometh on QM9. Published in Transactions on Machine Learning Research (04/2025) Figure 8: Samples from Cometh on MOSES. Figure 9: Samples from Cometh on Guaca Mol. The samples on this dataset exhibit some failure cases, such as disconnected molecules or 3-cycles of carbon atoms.