# directed_graph_generation_with_heat_kernels__3dd06c48.pdf Published in Transactions on Machine Learning Research (01/2025) Directed Graph Generation with Heat Kernels Marc T. Law marcl@nvidia.com NVIDIA Karsten Kreis kkreis@nvidia.com NVIDIA Haggai Maron hmaron@nvidia.com NVIDIA Technion Reviewed on Open Review: https: // openreview. net/ forum? id= 60Gi1w6hte Existing work on graph generation has, so far, mainly focused on undirected graphs. In this paper we propose a denoising autoencoder-based generative model that exploits the global structure of directed graphs (also called digraphs) via their Laplacian dynamics and enables one-shot generation. Our noising encoder uses closed-form expressions based on the heat equation to corrupt its digraph input with uniform noise. Our decoder reconstructs the corrupted representation by exploiting the global topological information of the graph included in its random walk Laplacian matrix. Our approach generalizes a special class of exponential kernels over discrete structures, called diffusion kernels or heat kernels, to the non-symmetric case via Reproducing Kernel Banach Spaces (RKBS). This connection with heat kernels provides us with a geometrically motivated algorithm related to Gaussian processes and dimensionality reduction techniques such as Laplacian eigenmaps. It also allows us to interpret and exploit the eigenproperties of the Laplacian matrix. We provide an experimental analysis of our approach on different types of synthetic datasets and show that our model is able to generate directed graphs that follow the distribution of the training dataset even if it is multimodal. 1 Introduction The representation of directed graphs (or digraphs) has recently attracted interest from the machine learning community (Clough & Evans, 2017; Sim et al., 2021) as they can naturally describe causal relations (Bombelli et al., 1987), spatiotemporal events using chronological order (Law & Lucas, 2023) or some stochastic processes such as Markov chains (Norris, 1998). Existing work in digraph representation has focused on discriminative tasks such as classification or link prediction. In this work, we consider the task of digraph generation. One possible application is the modeling of new causal systems that follow the same distribution as some given training set. Most of the machine learning literature on graph generation focuses on undirected graphs (Liao et al., 2019; Niu et al., 2020; You et al., 2018). Their goal is to generate plausible graphs of the same nature as those from some given training dataset (e.g., molecules (Vignac et al., 2023)). Existing approaches can be divided mostly into two categories: auto-regressive and one-shot. Auto-regressive approaches (Liao et al., 2019; You et al., 2018) start by generating small graphs to which sets of nodes and their corresponding edges are iteratively added until the final graph reaches a certain criterion (e.g., size). On the other hand, one-shot approaches generate all the nodes and edges of the generated graphs in a single step. One-shot approaches were shown to be more efficient than auto-regressive ones due to the lack of intermediate steps that can also lead to worse generative performance because of error accumulation at each step and the fact that one-shot Published in Transactions on Machine Learning Research (01/2025) Adjacency matrix A Data augmentation Edge perturbation A Heat diffusion encoder (Closed-form solution) Noisy representation X(T ) Learned decoder (Neural network) Reconstruction Figure 1: Our framework can be viewed as a denoising autoencoder. Our heat diffusion encoder maps a perturbed adjacency matrix A {0, 1}n n to a noisy node representation matrix X(T) [0, 1]n d that is given as input of a decoder that reconstructs the edges. (n = 5, d = 7 in the figure) approaches can directly work with the global structure of the graph via its Laplacian matrix instead of arbitrary subgraphs (Martinkus et al., 2022). Among the one-shot approaches, Spectre (Martinkus et al., 2022) learns the distribution of the most informative eigenvectors of the Lapacian matrices. Nonetheless, Spectre does not generalize to digraphs as it relies on properties of Laplacian matrices that are satisfied only for undirected graphs (see explanation in Section 5). Di Gress (Vignac et al., 2023) considers separate representations for nodes and edges to which discrete noise is added. Di Gress is formulated as a classification problem such that a denoising decoder classifies the category or existence of edges and nodes. However, Di Gress also requires spectral features from Beaini et al. (2021) that are valid only for undirected graphs as they rely on symmetric scalar products. In conclusion, none of the existing one-shot approaches can be easily adapted to digraphs. Contributions. We propose a one-shot approach that generates digraphs in a single step. The training of our framework is illustrated in Fig. 1. Unlike previous approaches, we exploit the eigenproperties of the Laplacian matrix that are valid even when the graph is directed, so we can effectively use its global structure during the generation process. To this end, we propose a denoising autoencoder (Vincent et al., 2008) whose noising encoder is not learned by a neural network but exploits closed-form expressions based on the heat equation for digraphs (Veerman & Lyons, 2020), which effectively encodes the global topological information of the graph into node features. We propose to introduce noise via a nonhomogeneous term that makes our node representations tend to a stochastic matrix with all its elements equal. Our denoising decoder is a neural network trained to reconstruct the original node representations and adjacency matrix of the graph. We explain below the main setup and building blocks of our approach. 1.1 Problem definition and overview of our approach During training, we are given a set of digraphs {G1, . . . , Gm} represented by their adjacency matrices A1, . . . , Am. Our goal at inference time is to generate new digraphs that follow the same distribution. Training. At training time, a training graph Gi is represented by its adjacency matrix Ai. First, this matrix is perturbed using data augmentation techniques such as edge perturbation (Ding et al., 2022). The resulting perturbed adjacency matrix Ai is used as input of our first main component, called Heat Diffusion Encoder, which generates some noisy representation, called X(T). Our encoder produces outputs that are close to a uniform distribution in order to approximate maximum entropy (or lack of information). A detailed description of the heat diffusion encoder can be found in Section 3. Next, our second component, called Denoising Decoder, takes X(T) as input and is trained to predict an adjacency matrix that is as similar to the original (unperturbed) matrix Ai as possible. The whole training process is illustrated in Fig. 1. Sampling new graphs after training. At inference time, an adjacency matrix is randomly generated and given as input of our autoencoder to generate a novel digraph. We discuss other continuous sampling strategies in Appendix G.2. Published in Transactions on Machine Learning Research (01/2025) 2 Preliminaries on the Heat Equation This section introduces our notation and a self-contained introduction to heat diffusion on digraphs. These concepts will be used in Section 3 to define a closed-form expression for our heat diffusion encoder. We refer the reader to Veerman & Lyons (2020) for more details on Laplacian matrices of digraphs, to Chung & Yau (1999); Kondor & Lafferty (2002); Belkin & Niyogi (2003) for heat diffusion kernel-based methods on undirected graphs, and to Appendix C for the details of the equations of this section and the next section. One main difference with other types of data such as images is that graphs lie in a discrete space whose topological information depends on the adjacency of its nodes. Moreover, graphs are often sparse in terms of adjacency. Due to the discrete nature of graphs, calculating similarities between nodes has been a challenge. To tackle this, heat kernel approaches (Chung & Yau, 1999; Kondor & Lafferty, 2002) express the notion of discrete local neighborhood of nodes in terms of a global similarity over the nodes of an undirected graph. More exactly, heat kernels are generalizations of Gaussian kernels where the Euclidean distance that is used in Gaussian kernels is replaced by a more general distance that takes into account the neighborhood information of the Laplacian matrix (Kondor & Lafferty, 2002). The information in node features is propagated along their neighbors via the heat equation. We propose to add a nonhomogeneous term that introduces noise along the heat diffusion process. We explain in Appendix E how our approach generalizes heat kernels to digraphs. Notation. We denote the identity matrix by I, and the all-ones vector by 1. We consider a graph G = (V, E) defined by its set of n nodes V = {vi}n i=1 and its set of edges E V V . Its adjacency matrix A {0, 1}n n satisfies Aij = 1 iff (vi, vj) E and Aij = 0 otherwise. If G is undirected, then we can simply consider that A is symmetric (i.e., i, j, Aij = Aji). However, we consider the more general case where A is not constrained to be symmetric. Although optional, we add self-loops by constraining A to satisfy i, Aii = 1. The in-degree diagonal matrix D Rn n + is defined so that i, Dii = P j Aji. In other words, we have D := diag(1 A). Let us define the matrix S := AD 1. S is column stochastic (i.e., i, P j Sji = 1 and i, j, Sij 0). We consider in this section that we are given some matrix N := X(0) Rn d where the i-th row of N is the initial d-dimensional feature representation of vi (i.e., X(t) with t = 0). The matrix N could be arbitrarily defined or given. In practice, we train N jointly with our denoising decoder, and we explain its training process in Section 4 and Appendix B. In this section, we consider that N is fixed. Laplacian dynamics. We define the negative of the random walk Laplacian matrix as L := S I = AD 1 I. The matrix L can be viewed as a matrix form of the discrete Laplace operator which approximates the continuous Laplace Beltrami operator in differential geometry (Belkin & Niyogi, 2003). Given a twicedifferentiable real-valued function f defined on some manifold, the Laplace-Beltrami operator is defined as the divergence of the gradient gradf and provides us with an estimate of how far apart f maps nearby points. Since L is not symmetric in general, we can use both L or L as they have different left and right eigenvectors. We then denote {L, L }. In the main paper, we consider only the case = L, which is called diffusion model in the graph community (Veerman & Lyons, 2020). To avoid confusion with diffusion models in the machine learning literature, we call it heat diffusion. We give in Appendix D all the formulae to solve our problem when = L (called consensus model (De Groot, 1974)). Heat equation. The global information of the graph G is diffused via the following heat equation: dt X(t) = X(t) + Q(t) where X(t) is the representation of nodes at time t 0, (1) N = X(0), and Q(t) Rn d is a heat source term that introduces noise in the node representations. If t 0, Q(t) = 0, then Q is called homogeneous. It is called nonhomogeneous otherwise. The solution of equation 1 can be found in standard textbooks such as Edwards et al. (2020) and is written in equation 2. For any formulation of Q, equation 1 is solved by the following equation when is constant over time: t 0, X(t) = et X(0) + Z t 0 e(t s) Q(s)ds = Z(t) + F(t). (2) where et denotes the matrix exponential of the matrix t . To simplify notation, we define Z(t) := et X(0), and F(t) := R t 0 e(t s) Q(s)ds. Published in Transactions on Machine Learning Research (01/2025) Interpretation of the heat equation. The above heat equation can be interpreted as a continuous way of performing message passing in the graph G over t continuous steps by exploiting the neighborhood information included in . Nonetheless, our continuous message passing approach introduces noise due to the presence of the nonhomogeneous term. Each column of N = X(0) contains some initial signal of the nodes, and the information of the Laplacian matrix and noise are jointly diffused in those signals as t increases by following the heat equation in equation 2. One difference with standard diffusion processes is the use of the global topological information of the graph via et over time t. The noise is introduced via the term F(t). If Q is homogeneous, then t 0, F(t) = 0 (i.e., there is no noise) and equation 2 reduces to t 0, X(t) = Z(t). 3 Heat Diffusion Encoder As in standard denoising autoencoders (Vincent et al., 2008), we define a noising process that maps X(0) to an informative representation close to some analytically tractable distribution (from which we could sample at inference time). To this end, we consider that equation 2 is the output of our heat diffusion encoder that is given X(0) and as input, and it constructs some noisy representation X(T) where T > 0 is an arbitrary time constant defined in Section 4. In Proposition 1, we propose to define the heat source term Q so that X(T) is similar to some constant matrix that we call M and that has all its elements equal to the same value in order to approximate maximum entropy (or lack of information) in the node representations when t = T. In Section 4, we train a decoder that reconstructs the nodes and edges when given some noisy X(T). Formulation of the heat source term. Our goal is to formulate the heat source term Q so that X(T) tends to some non-informative matrix M as T tends to + . To this end, we first notice that the matrix et is column stochastic for all t 0 (see explanation in Appendix F). In order to have Z(t) = et N also column stochastic for all t 0, we add the constraint that N = X(0) is column stochastic (see proof in Appendix C.3). Since each column of the resulting node representations is a probability distribution vector, we define each column of M as the uniform probability distribution vector, which corresponds to the maximum entropy probability distribution vector. In other words, we define the column stochastic uniform noise matrix as M := 1 n}n d. By definition, the matrix M is constant. Each column of M also corresponds to the expected value of a random variable following a flat Dirichlet distribution. Proposition 1. To satisfy lim T + X(T) = M where M = 1 n11 , we formulate Q and F as follows: Q(s) := αe αses R eβ X(0) = F(t) = (1 e αt)et R eβ X(0) (3) where α > 0 is a noise diffusivity rate hyperparameter, β 0 is a hyperparameter that can be tuned to control the Laplacian dynamics further, and R := e T M Rn d is a constant matrix for some arbitrary time constant T > 0 defined in Section 4. See Appendix C for details and proofs. With the above proposition, X(t) can be written only as a function of X(0) and in equation 2: t 0, X(t) = et X(0) + (e αt 1)eβ X(0) + (1 e αt)R (4) If we set β = 0, then eβ = I by definition of the matrix exponential, and we get a simpler formulation of equation 4: β = 0 = t, X(t) = e αt Z(t) + (1 e αt)et R. In this case, we have the following simple formulation when t = T: β = 0 = X(T) = e αT Z(T) + (1 e αT )M (5) which is column stochastic, and we call 1 e αT the noise ratio at time T. Since we have lim T + e αT = 0, one can verify that Proposition 1 satisfies our initial goal because equation 5 implies lim T + X(T) = M. Backward formulation. In the following, we consider that β = 0. However, our approach can be generalized to any β 0. If is given, we can write X(t) as a function of X(t + τ) for all time step τ 0: β = 0 = t 0, X(t) = eατe τ X(t + τ) + (1 eατ) et R (6) Published in Transactions on Machine Learning Research (01/2025) However, we assume that the denoising decoder that we train in Section 4 does not have access to at inference time. Therefore, the decoder has to reconstruct the set of nodes and/or edges when given only X(T). Efficient forward process. As explained in Vignac et al. (2023), we can build an efficient generative model for the following reasons: The noisy representation X(t) has a closed-form expression that depends only on X(0) and (see equation 4). We can then directly calculate X(t) for all t 0 without adding noise iteratively. The ground truth node representation Z(t) can be written in closed-form when given only and either X(0) or X(t). In practice, we generate a perturbed adjacency matrix A, from which we generate its Laplacian and noisy representation X(T) = e αT et X(0) + (1 e αT )M that is given as input of a node decoder that has to reconstruct the ground truth node representation Z(t) = et X(0) for some t [0, T]. This data augmentation technique acts as a regularizer that reconstructs node representations similar to the training distribution when given a noisy input. The limit distribution lim T + X(T) = M does not depend on X(0). It is also worth noting that all the elements of X(T) are in the interval [ 1 e αT /n, 1 + (n 1)e αT /n]. In practice, we choose appropriate values of T > 0 and α > 0 so that sufficient information of the graph is preserved in the noisy node representation, and so that denoised edges can be recovered from it. Specifically, we diffuse toward a distribution that can be well approximated by an analytic distribution (e.g., we can sample from a (flat) symmetric Dirichlet distribution (Kotz et al., 2004)) while preserving sufficient information about X(0) to perform denoising. Moreover, X(t) is column stochastic when t = 0 and t T, but X(t) might contain some negative elements when t (0, T) due to the formulation of R. This is not a problem in practice since our goal is to reconstruct Z(t) which is column stochastic for all t 0. 4 Denoising Decoder and Sampling The previous section defines a noising heat diffusion encoder that does not require learning a neural network and is given an initial node representation matrix N = X(0) and a Laplacian matrix to generate in closed form some noisy representation at some arbitrary time T > 0. We now propose a multi-task learning formulation to train decoders that reconstruct the original input. We assume that the decoders are not directly given the Laplacian matrix at inference time. Our first task learns a neural network (called node decoder) that predicts the denoised node representation Z(t) where t [0, T] (we recall that Z(0) = X(0)). Our second and last task jointly learns another neural network (called edge decoder) to predict edges. This is similar to the way link prediction tasks are solved, and we observe in practice that the learned representations hold information from the graph in the form of the Laplacian singular vectors. In Section 4.3, we propose a sampling strategy to generate digraphs at inference time. 4.1 Training of the denoising decoders and data augmentation Setup. We consider the task where, during training, we are given m digraphs {Gi = (Vi, Ei)}m i=1 drawn from some distribution G and of different sizes. Our goal is to generate graphs that follow the same distribution. Each graph Gi is represented by its adjacency matrix Ai {0, 1}ni ni where ni := |Vi| is the number of nodes. The Laplacian matrix of Gi is i = I Ai diag(1 Ai) 1. Node representation. We explain here how we define the node representations of the training graphs. Let us note nmax := maxi ni the number of nodes of the largest graph in the training set. We define some matrix O Rnmax d where d > 0 is an arbitrary hyperparameter. For each graph Gi, we define its column stochastic initial node representation matrix Ni = Xi(0) [0, 1]ni d as the uppersubmatrix of O whose columns are ℓ1-normalized with the softmax operator, which corresponds to applying a mask and renormalizing. To simplify the notation, we write N instead of Ni since we consider that all the graphs of same size share the same initial node representation. N is trained by training O (see details about their training in Appendix B). Published in Transactions on Machine Learning Research (01/2025) Algorithm 1 Generation of digraphs at inference time Input: Node representations O Rnmax d, hyperparameters T, α > 0, µ, ρ [0, 1] 1: Sample n nmax. Define N Rn d as upper submatrix of O followed by softmax operation for ℓ1-normalized columns 2: Generate discrete adjacency matrix A {0, 1}n n such that i = j, Aij Bernoulli(µ) and i, Aii = 1 3: Apply data augmentation to obtain perturbed matrix A (e.g., A = A C s.t. i = j, Cij Bernoulli(ρ)) 4: Calculate the diagonal matrix D Rn n + such that Dii = P j Aji. Define := AD 1 I. 5: Define B as e T e or optionally as the rank-s approximation of e T via truncated SVD. 6: Give the matrix e αT BN + (1 e αT )M as input of the edge decoder that returns an adjacency matrix We arbitrarily define the values of T > 0 and α > 0 so that the noise ratio defined as (1 e αT ) is close to 1 (see Fig. 2). Following equation 5, we define Xi(T) := e αT e T i N + (1 e αT )M. In practice, we apply data augmentation on Xi(T) during training as explained below. Edge perturbation (Ding et al., 2022). One can apply data augmentation via edge perturbation which can be interpreted as injecting noise by considering the perturbed adjacency matrix Ai = Ai C instead of Ai, where is the logical XOR operator, and the zero-diagonal corruption matrix C {0, 1}ni ni has its non-diagonal elements equal to 1 with probability ρ [0, 1], and 0 with probability (1 ρ). Following Veličković et al. (2019), we set ρ 1/ni and we sample a new matrix C each time C is called. We call i := I Ai(diag(1 Ai)) 1 the Laplacian from Ai. We obtain the formulation Zi(T) := e T i N, and Xi(T) := e αT Zi(T) + (1 e αT )M. If ρ = 0, we have Xi(T) = Xi(T). Permutation of adjacency matrices. N is the same for all graphs of same size. To promote permutation invariance of our model, we can replace Ai and i by P Ai P and P i P where P {0, 1}ni ni is a (randomly sampled) permutation matrix. This is equivalent to replacing e T i N by P e T i PN. This can be seen as augmenting the training set with adjacency matrices of isomorphic digraphs. In Appendix G.3, we experimentally show that using this kind of data augmentation technique does not have a negative impact on the optimization of our loss function. Xi(T) is given as input of a node decoder φ and edge decoder ψ during training as described below. Node decoding task. Our node decoder φ takes the noisy node representation Xi(T) as input, and its goal is to reconstruct some target node representation matrix Ti that does not contain noise. In practice, we formulate the training loss of our node decoder as Lnode(i) := φ( Xi(T)) Ti 2 F where F is the Frobenius norm, and we arbitrarily define Ti := Zi(1) = e i N where i is the ground truth Laplacian matrix. Since each row of Xi(T) represents a node of the graph, we ideally want our model to be equivariant to the order of the rows of Xi(T). For this reason, we formulate φ as an attention-based permutation-invariant neural network called Set Transformer (Lee et al., 2019). φ considers each row of Xi(T) as the element of a set of node representations, and it is robust to the order of the rows. Implementation details can be found in Appendix B. Edge decoding task. We call our edge decoder ψ and we denote the p-th row of ψ( Xi(T)) by ψ( Xi(T))p. Our edge decoder predicts whether or not there exists a directed edge between pairs of nodes. We formulate the term: Ledge(i) := P p =q H ω ψ( Xi(T))p, ψ( Xi(T))q , Ai pq where [ , ] denotes concatenation, ω is a learned multilayer perceptron (MLP), and H is the cross-entropy loss. Our edge decoder and node decoder share a common backbone (see architecture details in Appendix B). It is worth noting that if the goal is to generate undirected graphs, then the concatenation operation can be replaced by a symmetric operation such as the addition. The training loss that we minimize is: Pm i=1 Ledge(i) + γ Lnode(i) (7) where γ 0 is a regularization parameter. Since both Ti and Xi(T) depend on N, we optimize equation 7 by training jointly φ, ψ and N via gradient descent in order to reconstruct the training graphs. See Appendix B for implementation details. Published in Transactions on Machine Learning Research (01/2025) 4.2 Other considerations Class-conditional generation. To add class label information, we give as input of both decoders the concatenation of a matrix Yi {0, 1}ni |C| to Xi(T) where |C| is the number of categories, and each row of Yi is a one-hot vector whose nonzero index corresponds to the category of the graph Gi. This sampling strategy is known as conditional sampling (Zhu et al., 2022). The rest of the method is similar. 0 1 2 3 4 5 6 t f(t) = 1-exp(- t) = 5 = 2 = 1 = 0.5 = 0.25 = 0.1 Figure 2: The noise ratio 1 e αT in X(T) as a function of T for different values of α. Choice of the final step T. The matrix Xi(T) is given as input of decoders to reconstruct Gi. We ideally want Xi(T) = e αT Zi(T) + (1 e αT )M to be similar to the matrix M. This similarity depends on both T and α, and Xi(T) tends to M as T or α tend to + . We provide a detailed discussion about the impact of T and α in Appendix F. We found that setting T = 1 and choosing α large enough works well in practice (e.g., α = 2.3 implies 1 e αT 0.9, which means that about 90% of the values of Xi(T) are noise). However, the optimal value of both T and α can be determined via cross-validation depending on the task. Fig. 2 illustrates the ratio of noise for different values of α as a function of T. It is worth noting that we want Xi(T) to be similar to M so that sampling a similar matrix at inference time is easy. On the other hand, we also want Xi(T) to preserve enough information so that our neural networks can reconstruct Ti and Ai (i.e., the node and edge information of the graph) from it. Learning N. It is known in the heat kernel literature (Belkin & Niyogi, 2003; Chung & Yau, 1999) that the coarse structure of an undirected graph is included in the subset of eigenvectors of its Laplacian matrix that correspond to its smallest eigenvalues and can be used for dimensionality reduction. Page Rank (Page, 1998) uses the same motivation and exploits the leading eigenvector of a nonsymmetric stochastic matrix. Spectre (Martinkus et al., 2022) exploits this observation by generating a symmetric Laplacian matrix spanned by a set of leading eigenvectors with bounded real eigenvalues. Since our eigenvalues and eigenvectors are usually complex and not unitary (when the adjacency matrix is not symmetric), we consider related linear algebra properties such as column spaces and singular vectors, which allow us to work with real values. We experimentally observe in Section 6.1 that the leading singular vectors of the learned matrix Zi(t) = et i N and of et i tend to be strongly correlated, which suggests that our model learns N so that it maps to the leading left singular vectors of et i. This observation is mainly qualitative. We also observed that using a large number of columns d to represent N helps in practice to recover edges. Approximations. We show in Appendix E that our approach corresponds to a non-symmetric heat kernel method induced in a Reproducing Kernel Banach Space (RKBS) with a column stochastic and non-symmetric kernel matrix K = e αT e T i + 1 e αT ni 11 . K has the same eigenvectors as the Laplacian i when α = 0 (see Appendix F), and it then contains the structure of the graph. To scale our method to large graphs, we also propose to replace et i by its rank-s approximation obtained with truncated Singular Value Decomposition (SVD) where s ni. We replace et i by the product of two rectangular rank-s matrices, which greatly reduces memory if s ni. Although the obtained matrix is not stochastic when s < ni, we observe in Section 6.2 that the training graphs can be fully reconstructed with our model while saving memory usage. The SVD can be preprocessed offline before training. Gaussian kernels are specific heat kernels that use a squared Euclidean distance, and the relationship between heat kernels and Gaussian Processes (GPs) is known in the literature (Kondor & Lafferty, 2002). Using a truncated SVD corresponds to a low-rank approximation of the kernel matrix w.r.t. the Frobenius norm, which is similar to one of the approximation methods for GPs mentioned in Williams & Rasmussen (2006, Chapter 8.1). Published in Transactions on Machine Learning Research (01/2025) 4.3 Sampling strategy for generation of novel digraphs Our sampling algorithm to generate novel digraphs is given in Algorithm 1 while our training algorithm is detailed in Algorithm 2. During sampling at inference time, we do not have access to input graphs from the dataset. We then construct graphs that, when given to our heat diffusion encoder, diffuse toward noisy graphs similar to those encountered after diffusing graphs during training. In this way, our denoising decoders can successfully produce denoised graphs similar to those of the dataset. This is conceptually similar to variational autoencoders (Kingma & Welling, 2014), where during sampling the encoding distribution is approximated by a simple prior distribution. How can we then analytically construct suitable input graphs during inference time? One solution is to generate a matrix with each column sampled from a flat Dirichlet distribution (Kotz et al., 2004) and give it as input of the decoders to generate a digraph. This works well when the training graphs (of a given category) are all similar to each other. However, it was observed in Vignac et al. (2023) that this kind of continuous sampling tends to destroy the graph s sparsity and creates very noisy graphs in practice. When the distribution of the graphs is multimodal, we found that sampling discrete adjacency matrices and applying standard data augmentation techniques for graphs both during training and sampling allows our model to sample graphs from the different modes. We study continuous sampling strategies in Appendix G.2. We propose one discrete sampling algorithm in Algorithm 1. Let us note µ (0, 1] the ratio of pairs of distinct nodes that are adjacent in the training set. We first generate an adjacency matrix A {0, 1}n n such that each of its non-diagonal elements is assigned the value 1 with probability µ, and 0 otherwise. Following the motivation of denoising autoencoders, our decoders are trained to construct an (unperturbed) sample similar to training samples when given some noisy input. In Appendix G.2, we observe that our discrete sampling strategy is competitive with other continuous sampling strategies in terms of performance. 5 Related Work Graph generative approaches can be divided into two categories which are auto-regressive models and oneshot models. Auto-regressive models (Liao et al., 2019; You et al., 2018) generate a succession of graphs G1, G2, . . . , GT such that i, Gi Gi+1 and return the last generated graph GT . At each iteration, the graph Gi is given as input of a neural network that generates Gi+1 by adding new nodes and their edges. Most of these models are typically slower than one-shot approaches that generate all the nodes and edges of a graph in a single step. Three main one-shot approaches in the machine learning literature are Top-n (Vignac & Frossard, 2022), Spectre (Martinkus et al., 2022) and Di Gress (Vignac et al., 2023). Other one-shot methods such as (Kwon et al., 2020; Mercado et al., 2021) are dedicated to molecular graphs and do not generalize to other tasks. Although Top-n is one-shot, it assumes symmetric similarity functions between nodes inappropriate for digraphs. Spectre (Martinkus et al., 2022) considers the generation of undirected graphs via their normalized graph Laplacian matrix Ln := I D 1/2AD 1/2, which is symmetric positive semi-definite and admits an eigendecomposition of the form U 1ΛU where U 1 = U and both the diagonal matrix Λ and U are real-valued. They exploit the intuition that coarse structure of the graph lies on a Stiefel manifold that contains the eigenvectors of the k smallest eigenvalues of the Laplacian. Spectre Martinkus et al. (2022) then trains a neural network that generates an adjacency matrix by sampling the k smallest eigenvalues and their corresponding eigenvectors by exploiting their Stiefel manifold structure. The authors mention that their work can be extended to the random-walk Laplacian matrix Lr := I AD 1 since its right eigenvectors are the same as D 1/2U (up to column-wise ℓ2 normalization), and its left eigenvectors can be formulated in a similar way. However, when the graph is directed and A is not symmetric, U is complex and not unitary. The information of the Laplacian matrix then does not lie on a complex Stiefel manifold, and Spectre can then not easily be extended to digraphs. Di Gress (Vignac et al., 2023) is a denoising diffusion model for graphs. Instead of using the discrete Laplacian operator as we propose, they represent their nodes as a function of time t N as follows: X(t) = (αt I + (1 αt) d 11 )X(t 1) where X(t) is row-stochastic and α (0, 1). Di Gress relies on spectral Published in Transactions on Machine Learning Research (01/2025) (left) Erdős-Rényi graphs (right) stochastic block model graphs Figure 3: Correlations between the left singular vectors of et i and et i N (more results in Fig. 4). features from Beaini et al. (2021) designed for undirected graphs by using a symmetric similarity function between nodes. It is then not appropriate for digraphs. Diffusion processes have been used in score-based generative models (Ho et al., 2020; Sohl-Dickstein et al., 2015; Song et al., 2021) to generate undirected graphs (Jo et al., 2022; Niu et al., 2020) or adjacency matrices in general (Yan et al., 2024). Nonetheless, both Jo et al. (2022) and Niu et al. (2020) exploit Graph Neural Network architectures that require a symmetric adjacency matrix so these methods are not easily adaptable to digraphs. Moreover, both approaches perturb each data dimension independently (i.e., without accounting for the global structure of the graph via its Laplacian). On the other hand, Swin GNN (Yan et al., 2024) considers diffusion models to reconstruct arbitrary adjacency matrices that do not have to be symmetric. Generally, in such score-based generative diffusion models, the input converges towards an entirely uninformative pure noise distribution, and generation requires a slow and iterative denoising process. In contrast, our approach encodes the global structure of the graph into the node representations via the Laplacian dynamics and effectively encodes the graph topology in a noisy, yet informative distribution when t = T, from which the denoised nodes and edges can be efficiently predicted in one shot. In other words, in contrast to score-based diffusion models, where the role of the diffusion process is purely to gradually perturb the data and destroy information, in our model the primary role of the (Laplacian) diffusion process is to additionally encode the graph structure into the node representations. Heat diffusion for the purpose of information propagation in that fashion has, for instance, been used for learning on curved 3D surfaces (Sharp et al., 2022). Our method is a different approach compared to the widely used score-based diffusion models and specifically designed to avoid their slow, iterative synthesis process. Instead, our model can be seen as a denoising autoencoder (Vincent et al., 2008), since it corrupts the input data with optional edge perturbation and with a nonhomogoneous heat diffusion process. During sampling, we also approximate the encoding distribution at t = T by using an analytically tractable distribution, which is similar to variational autoencoders (Kingma & Welling, 2014), where a simple prior distribution in latent space models the encoding distribution. Hence, our approach can be seen as related to Graph VAE (Simonovsky & Komodakis, 2018), which, however, uses learned encoder neural networks instead of a heat diffusion process to encode small graphs in a latent space. Moreover, Graph VAE only tackles undirected graph synthesis. We emphasize that connection by proposing a sampling method that directly samples from a flat Dirichlet distribution in Appendix G.2. It is worth noting that it was shown in Vincent (2011) that score matching techniques (Hyvärinen & Dayan, 2005) can be seen as training a specific type of denoising autoencoder. Our approach could be adapted to a score-based generative model with iterative synthesis. This could be done by replacing our deterministic nonhomogeneous heat source term by a Wiener process using a Dirichlet diffusion score as done for example in Avdeyev et al. (2023). However, the method would be computationally expensive for a large number of nodes since (1) each node would be associated with a different beta distribution, (2) there exists no known closed-form solution for this kind of linear stochastic differential equation, (3) and each intermediate time step would require calculating some matrix exponential involving in our case. Instead, we propose a framework similar to Martinkus et al. (2022) to efficiently work with singular vectors of the Laplacian matrix and generate (directed) graphs in a one-shot manner. Published in Transactions on Machine Learning Research (01/2025) 6 Experiments We evaluate the generative power of our method that we call Digraph Generation with Diffusion Kernels (DGDK), focusing on digraphs only, for which our method was designed. We use Adam (Kingma & Ba, 2014) as optimizer. DGDK works better as the number of columns d of N is larger. Experimental details can be found in Appendix B. Due to lack of space, we add other experiments in Appendix G. 6.1 Column space of the learned representations We adapt the intuition of Spectre (Martinkus et al., 2022) to digraphs. To this end, we experimentally observe in this subsection that our learned representations are strongly correlated with the leading singular vectors of et i. In our first qualitative experiment, we apply conditional sampling (Zhu et al., 2022) and consider two categories of digraphs: Erdős-Rényi (Erdős et al., 1960) with p = 0.6 , and stochastic block model (Holland et al., 1983) with 3 main blocks that are connected to each other by following the transition probability matrix: 0.9 0.35 0.2 0.2 0.75 0.2 0.2 0.25 0.8 The above transition matrix Π between blocks cannot correspond to undirected graphs since undirected graphs would require that Π is symmetric. Each category contains m = 100 training graphs with n = 15 nodes each, and we set the number of columns of N to d = 150. The size of the first two blocks is m = n 4 each and the size of the last block is n 2m. In this setup, our training graphs are directed and their adjacency matrices are not symmetric. As explained in Section 5, standard baselines such as Spectre cannot be exploited. If we define Ti = e i N [0, 1]n d (i.e., we consider t = 1), the goal of the node decoder in Section 4.1 is to reconstruct Ti whose column space is by definition included in the column space of e i (i.e., spanned by its columns). Ti is in general not a square matrix (i.e., n = d) and thus does not possess eigenvectors. We then consider its SVD et i N = U1Λ1V 1 , and we write the SVD of et i = U2Λ2V 2 . The singular values of both Λ1 and Λ2 are ordered in descending order. Figure 3 illustrates the absolute values of the following matrix product U 2 U1. Since both U1 and U2 have their columns ℓ2-normalized, each element of U 2 U1 is the cosine between two singular vectors. A cosine of 0 indicates orthogonality, hence independence, whereas higher absolute values indicate (cosine) correlations. The top left part of each plot corresponds to the leading singular vectors whereas the bottom right corresponds to the singular vectors with lower singular values. As one can see, there is a strong cosine correlation between the leading singular vectors of et i and of et i N. This suggests that N is learned to preserve the most informative singular vectors of et i. We exploit this observation in Section 6.2 by working with a low-rank approximation of et i. 6.2 Using approximations of the target matrix We evaluate the generative power of DGDK in the class-conditional generation task. We use a rank-s approximation of e T i via a truncated SVD, and we formulate our target node matrix Ti = Zi(T) where T = 1 and α = 2.3. We consider the case where the number of nodes is i, ni = 21, and s = 15. The two categories follow the same properties as in Section 6.1, and contain 3,000 non-isomorphic training graphs per category. During training, each mini-batch contains 10 graphs per category. Only the number of nodes per graph is different. DGDK manages to reconstruct all the edges of the training graphs when given noisy representations Xi(T). This shows the effectiveness of our low-rank approximation approach. We sample 10,000 test graphs per category by using Algorithm 1 with class-conditional generation (i.e., we provide the desired category as input). None of the generated test graphs are isomorphic to one another nor to the training graphs. This means that our model obtains a uniqueness and novelty scores of 100%. We also report in Table 1 standard evaluation metrics based on the squared Maximum Mean Discrepancy (MMD) (O Bray et al., 2022) between the training set and the test set. These evaluation metrics measure the distance between the training and test distributions w.r.t. some graph properties. We adapt the descriptor functions in O Bray et al. (2022) to directed graphs. Published in Transactions on Machine Learning Research (01/2025) Table 1: Squared MMD distances. Dataset Erdős-Rényi (p = 0.6) Stochastic block model (3 blocks) MMD metric Degree Clustering Spectrum Degree Clustering Spectrum DGDK (ours) 1.1 10 4 1.0 10 3 1.3 10 5 1.2 10 4 2.0 10 4 6.2 10 6 Swin GNN (Yan et al., 2024) 1.3 10 4 2.4 10 2 4.2 10 3 1.2 10 4 3.7 10 2 5.1 10 3 GRAN (Liao et al., 2019) 1.5 10 4 5.6 10 2 1.9 10 4 1.3 10 4 8.3 10 2 3.6 10 4 Degree distribution histogram. Given a graph G = (V, E), we create a n-dimensional histogram by evaluating the in-degree deg(v) for v V . The i-th position of the resulting histogram is the number of nodes with in-degree i. We ℓ1-normalize the histogram so that it sums to 1. Clustering coefficient. The local clustering coefficient for a directed graph of a node vi is formulated: Ci := |{(vj,vk) E:vj Ni,vk Ni}| |Ni|(|Ni| 1) [0, 1] where Ni = {vj : (vi, vj) E or (vj, vi) E}. It measures to what extent vi forms a clique. The different values C1, . . . , Cn are binned into a b-dimensional histogram. We set b = 100. Laplacian spectrum. In the directed case, the eigenvalues of S = L + I are complex but their absolute value is upper bounded by 1 since S is column stochastic. We bin their absolute values into a 100-dimensional histogram. We report in Table 1 the scores of DGDK, GRAN (Liao et al., 2019) (the state-of-the-art auto-regressive baseline that can be extended to digraphs), and Swin GNN (Yan et al., 2024) (a diffusion approach that can be applied to arbitrary adjacency matrices). It is worth noting that we train a different GRAN and Swin GNN model for each category instead of using a class-conditional generation approach. Our MMD scores are close to 0, which means that the training and test graphs follow similar distributions. Moreover, DGDK outperforms both baselines in the clustering and spectrum evaluation metrics. This suggests that learning the global structure of the graph via its Laplacian in a single shot is beneficial for generation (for example compared to GRAN that sequentially considers multiple local subproblems). 6.3 Additional experiments We present additional experiments in Appendix G. In Appendix G.2, we consider an experiment similar to the one in Section 6.2 with digraphs containing different numbers of nodes (from 180 to 200). We compare different sampling strategies and show that strategies that exploit the learned matrix N tend to perform better. In Appendix G.4, we consider a case where the training distribution contains multiple modes, and DGDK generates samples similar to the different modes. 7 Discussion We discuss here different aspects of our approach that explained in detail in the appendix. 7.1 Scalability. One main advantage of our approach is that we use efficient closed-form solutions for the encoder that can be preprocessed offline to obtain efficient training times. As explained in Appendix B, in practice on an NVIDIA Ge Force RTX 3090 with 24 GB or VRAM, we manage to train 10,000 iterations of the dataset introduced in Section 6.2 in about 1 hour when the batch size is 20 (10 training graphs per category). We can also train in 1 hour 10,000 iterations of the dataset introduced in Section G.2, with larger graphs containing up to 200 nodes and a batch size of 2 (1 training graph per category). The main limitation to scale to larger graphs is memory. Indeed, our low-rank approximation technique presented in Section 4.2 allows us to save memory that grows linearly in the size of the training set. Nonetheless, larger graphs require larger VRAM. Published in Transactions on Machine Learning Research (01/2025) 7.2 Connection with Heat Kernels and Reproducing Kernel Banach Spaces (RKBS) The understanding of heat kernels and RKBS is not essential to run our method. Nonetheless, it provides a connection with existing work that considers graphs and kernel methods (Kondor & Lafferty, 2002; Belkin & Niyogi, 2003). In particular, this connection is used in Belkin & Niyogi (2003) to create low-dimensional representations of the nodes, and in Kondor & Lafferty (2002) to perform classification by using the kernel matrix as a similarity measure between nodes. In both cases, their graphs are undirected and their kernel matrix is symmetric, which allows them to exploit Reproducing Kernel Hilbert Spaces (RKHS). In our case, we propose to add a nonhomogeneous term to introduce noise, and we show that the resulting non-symmetric matrix K still corresponds to a non-symmetric kernel matrix induced by a RKBS. This motivates our low-rank approximation, inspired by kernel methods, to reduce used memory that grows linearly in the number of nodes instead of quadratically. It is worth noting that since the kernel matrix K is non-symmetric, the standard framework of RKHS that relies on (symmetric) inner products cannot be used. Our RKBS interpretation introduces a elegant alternative that is useful both theoretically and in practice. 8 Conclusion and Limitations We have proposed a one-shot generative model that samples digraphs and is similar in essence to denoising autoencoders. Our encoder exploits closed-form expressions to add noise to a digraph, and our decoder is trained to recover the global structure of the graph via its Laplacian dynamics. We show how our framework generalizes heat kernels and is able to simulate the training distribution. We also propose a low-rank approximation of the heat kernel matrix to possibly scale to large graphs. Limitations Although our approach is scalable and produces digraphs of different sizes, one limitation is that our decoder is deterministic and its output is determined by its input. Nonetheless, different sampling strategies that we study thoroughly in Appendix G.2 can be used to generate a large set of different types of input. This allows our model to generate a diverse set of outputs. Another limitation of our work is the lack of experiments on real-world datasets. This limitation is due to the fact that we did not find an appropriate real-world dataset to try our approach. For instance, even the genome dataset introduced in Marbach et al. (2012) contains three different graphs, each for a different type of gene. Our goal is to work with a large number of graphs that follow the same type of distribution. On the other hand, in order to work with the dataset in Marbach et al. (2012), we would need to generate multiple subgraphs from the three large graphs, which is different from the task we are interested in. Broader Impact Statement Our work is not focused on applications. There are many potential societal consequences of our work, none which we feel must be specifically highlighted in the paper. Acknowledgments We thank Nicholas Sharp for helpful discussions about this project and the anonymous reviewers for their feedback. Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, and Jian Zhou. Dirichlet diffusion score model for biological sequence generation. International Conference on Machine Learning, 2023. Dominique Beaini, Saro Passaro, Vincent Létourneau, Will Hamilton, Gabriele Corso, and Pietro Liò. Directional graph networks. In International Conference on Machine Learning, pp. 748 758. PMLR, 2021. Published in Transactions on Machine Learning Research (01/2025) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003. Luca Bombelli, Joohan Lee, David Meyer, and Rafael D Sorkin. Space-time as a causal set. Physical review letters, 59(5):521, 1987. Fan Chung and S-T Yau. Coverings, heat kernels and spanning trees. the electronic journal of combinatorics, pp. R12 R12, 1999. James R Clough and Tim S Evans. Embedding graphs in lorentzian spacetime. Plo S one, 12(11):e0187301, 2017. Morris H De Groot. Reaching a consensus. Journal of the American Statistical association, 69(345):118 121, 1974. Kaize Ding, Zhe Xu, Hanghang Tong, and Huan Liu. Data augmentation for deep graph learning: A survey. SIGKDD Explorations Newsletter, 24(2):61 77, dec 2022. ISSN 1931-0145. doi: 10.1145/3575637.3575646. URL https://doi.org/10.1145/3575637.3575646. C Henry Edwards, David E Penney, and David T Calvis. Differential Equations and Linear Algebra, e Book. Pearson Higher Ed, 2020. Paul Erdős, Alfréd Rényi, et al. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1): 17 60, 1960. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. Jaehyeong Jo, Seul Lee, and Sung Ju Hwang. Score-based generative modeling of graphs via the system of stochastic differential equations. In International Conference on Machine Learning, 2022. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on Learning Representations, 2014. Risi Imre Kondor and John D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML 02, pp. 315 322, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. ISBN 1558608737. Samuel Kotz, Narayanaswamy Balakrishnan, and Norman L Johnson. Continuous multivariate distributions, Volume 1: Models and applications, volume 1. John Wiley & Sons, 2004. Youngchun Kwon, Dongseon Lee, Youn-Suk Choi, Kyoham Shin, and Seokho Kang. Compressed graph representation for scalable molecular graph generation. Journal of Cheminformatics, 12(1):1 8, 2020. Marc T. Law and James Lucas. Spacetime representation learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=q V_M_rh Yajc. Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pp. 3744 3753. PMLR, 2019. Published in Transactions on Machine Learning Research (01/2025) Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, Charlie Nash, William L. Hamilton, David Duvenaud, Raquel Urtasun, and Richard Zemel. Efficient graph generation with graph recurrent attention networks. In Neur IPS, 2019. Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Manolis Kellis, James J Collins, and Gustavo Stolovitzky. Wisdom of crowds for robust gene network inference. Nature methods, 9(8):796 804, 2012. Karolis Martinkus, Andreas Loukas, Nathanaël Perraudin, and Roger Wattenhofer. Spectre: Spectral conditioning helps to overcome the expressivity limits of one-shot graph generators. In International Conference on Machine Learning, pp. 15159 15179. PMLR, 2022. Rocío Mercado, Tobias Rastemo, Edvard Lindelöf, Günter Klambauer, Ola Engkvist, Hongming Chen, and Esben Jannik Bjerrum. Graph networks for molecular design. Machine Learning: Science and Technology, 2(2):025023, 2021. James Mercer. Xvi. functions of positive and negative type, and their connection the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character, 209(441-458):415 446, 1909. Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, and Stefano Ermon. Permutation invariant graph generation via score-based generative modeling. In International Conference on Artificial Intelligence and Statistics, pp. 4474 4484. PMLR, 2020. James R Norris. Markov chains. Number 2. Cambridge university press, 1998. Leslie O Bray, Max Horn, Bastian Rieck, and Karsten Borgwardt. Evaluation metrics for graph generative models: Problems, pitfalls, and practical solutions. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=t Bto ZYKd9n. Larry Page. The pagerank citation ranking: Bringing order to the web. Technical Report, Stanford University, 1998. Caroline E Seely. Non-symmetric kernels of positive type. Annals of Mathematics, pp. 172 176, 1919. Nicholas Sharp, Souhaib Attaiki, Keenan Crane, and Maks Ovsjanikov. Diffusionnet: Discretization agnostic learning on surfaces. ACM Trans. Graph., 41(3), mar 2022. Aaron Sim, Maciej Wiatrak, Angus Brayne, Páidí Creed, and Saee Paliwal. Directed graph embeddings in pseudo-riemannian manifolds. International Conference on Machine Learning (ICML), 2021. Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. In Věra Kůrková, Yannis Manolopoulos, Barbara Hammer, Lazaros Iliadis, and Ilias Maglogiannis (eds.), Artificial Neural Networks and Machine Learning ICANN 2018, pp. 412 422, Cham, 2018. Springer International Publishing. Alex J Smola and Bernhard Schölkopf. Learning with kernels, volume 4. Citeseer, 1998. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015. Guohui Song, Haizhang Zhang, and Fred J Hickernell. Reproducing kernel banach spaces with the l1 norm. ar Xiv preprint ar Xiv:1101.4388, 2011. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. JJP Veerman and Robert Lyons. A primer on laplacian dynamics in directed graphs. ar Xiv preprint ar Xiv:2002.02605, 2020. Published in Transactions on Machine Learning Research (01/2025) Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=rklz9i Ac KQ. Clement Vignac and Pascal Frossard. Top-n: Equivariant set and graph generation without exchangeability. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=-Gk_IPJWvk. Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Ua AD-Nu86WX. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23 (7):1661 1674, 2011. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096 1103, 2008. Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006. Qi Yan, Zhengyang Liang, Yang Song, Renjie Liao, and Lele Wang. Swin GNN: Rethinking permutation invariance in diffusion models for graph generation. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=abfi5plv Q4. Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, pp. 5708 5717. PMLR, 2018. Haizhang Zhang, Yuesheng Xu, and Jun Zhang. Reproducing kernel banach spaces for machine learning. Journal of Machine Learning Research, 10(12), 2009. Yanqiao Zhu, Yuanqi Du, Yinkai Wang, Yichen Xu, Jieyu Zhang, Qiang Liu, and Shu Wu. A survey on deep graph generation: Methods and applications. ar Xiv preprint ar Xiv:2203.06714, 2022. Published in Transactions on Machine Learning Research (01/2025) Algorithm 2 Training algorithm (for each mini-batch) Input: Node representations O Rnmax d, hyperparameters T > 0, α > 0, t 0, Bernoulli factor ρ [0, 1] Initialize mini-batch loss as Lmini-batch = 0 for Graph Gi with Adjacency matrix Ai and Laplacian matrix i in the mini-batch do if Promote permutation invariance is true then Generate random permutation matrix P {0, 1}ni ni. Ai P Ai P. i P i P end if if Data augmentation matrix is true then Generate perturbation matrix C {0, 1}ni ni s.t. i, Cii = 0 and i = j, Cij Bernoulli(ρ) Ai Ai C. i Ai diag(1 Ai) 1 I else i i. end if Define N Rn d as upper submatrix of O followed by softmax operation for ℓ1-normalized columns Define B as e T i or optionally as the rank-s approximation of e T i via truncated SVD. Xi(T) e αT BN + (1 e αT )M Define E as et i or optionally as the rank-s approximation of et i via truncated SVD. Ti e αt EN Lmini-batch Lmini-batch + Ledge(i) + γ Lnode(i) where γ 0 is a regularization parameter. end for Optimize Lmini-batch with Adam Kingma & Ba (2014) Our appendix is structured as follows: Section B provides experimental details including the architecture of the neural networks. Section C provides the details of our equations for our heat diffusion encoder. Section D provides the necessary details to solve the consensus problem. Section E explains the connection of our method with heat kernels. Section F studies the impact of T and α on the eigenvalues of the different matrices that are involved in our model. Section G presents additional experimental results. B Experimental details Setup. We ran all our experiments on a single desktop with a NVIDIA Ge Force RTX 3090 GPU (with 24 GB of VRAM) and 64 GB of RAM. We coded our project in Pytorch. We use double precision format to define our tensors. This is important in the current version of Pytorch to obtain an accurate version of the matrix exponential. Our training algorithm is illustrated in Algorithm 2 and follows the different data augmentation techniques mentioned in Section 4. Node representation matrix N. We define nmax N as the maximum number of nodes in a graph of the training set, and d as the number of columns of the initial node representation O Rnmax d. In practice, we set d = 150 and we initialize each element of O by sampling from the normal distribution parameterized by a mean of 1 and standard deviation of 1. Other hyperparameter values could be used. For a given graph Gi of ni nodes, we define N Rni d as the upper submatrix of O, and it is ℓ1-normalized by using a column-wise softmax operator. Equation 7 is minimized via standard gradient descent by training jointly ψ, φ and N Published in Transactions on Machine Learning Research (01/2025) (and hence O). In practice, we find that the larger d, the better the performance. However, our approach is limited by the amount of available memory. In Section 6.2, we train the model for 60,000 iterations. We recall that each mini-batch contains 10 training graphs per category, hence we have 20 graphs per mini-batch. We make this choice because this is the maximum amount we manage to load in our GPU VRAM. The training algorithm takes about one hour for 10,000 iterations, so about 6 hours in total. We use a regularization parameter of γ = 100, and a step size/learning rate of 0.0001. Following O Bray et al. (2022), we calculate our MMD evaluation metric by using a radial basis function (RBF) kernel with σ = 10. We report performance for other values of σ in Appendix G.2. Backbone. The common backbone of the node and edge decoder is: a linear layer Rd Rd where d = 1000. a set transformer (Lee et al., 2019) with four attention blocks, each of dimensionality 1000, with one head, 20 inducing points and 20 seed vectors. Each row-wise feedforward layer of an attention block contains 3 linear layers d d with Re LU activation function. We choose one head so that the global structure of the Laplacian matrix is not separately processed by the different heads. a multilayer perceptron (MLP) with one linear layer Rd Rd (with d = 600), followed by four linear layers Rd Rd , and one linear layer Rd Rd where d = 200. The head of the node decoder is an MLP with an initial layer Rd Rd followed by one linear layer Rd Rd , and one linear layer Rd Rd. If the input is of size Rni d, this returns a matrix of same size. We also use column-wise normalization with softmax on the output. Edge decoder. The output of the backbone described above are node representations of size d = 200 for each node. We concatenate them (as described in Section 4.1) to obtain representations of pairs of nodes of size 2d = 400. They are given as input of an MLP with one linear layer R2d Rd (with d = 600), followed by 5 linear layers Rd Rd , followed by a linear layer Rd R2 that is used for cross-entropy loss (one element is used for the absence of edge, and the other element is used for the existence of edge). Alternatively, one could output a real value for each pair of nodes and use a binary cross entropy loss. We use Re LU as an activation function between all the linear layers. In all our experiments, we give the same weighting for the positive and negative edges (i.e., presence or absence of edge). C Details of Equations for the Heat Diffusion Encoder C.1 Nonhomogeneous heat source term We now give the details of the equations in Section 3. We recall the formulation of Q in Proposition 1: Q(s) = αe αses (R eβ X(0)) (9) which implies the following formulation of F: F(t) := Z t 0 e(t s) Q(s)ds = et Z t 0 e s Q(s)ds (10) 0 αe(t s) es e αs(R eβ X(0))ds = et Z t 0 αe αs(R eβ X(0))ds (11) = et e αs(R eβ X(0)) s=0 = (1 e αt)et (R eβ X(0)) (12) Published in Transactions on Machine Learning Research (01/2025) C.2 Node representation over time We assume in this subsection that t 0. equation 4 is written: X(t) = et X(0) + (e αt 1)eβ X(0) + (1 e αt)R (13) For any nonnegative time step τ 0, we can write X(t + τ) as a function of X(t) and vice versa. X(t + τ) = e(t+τ) X(0) + (e α(t+τ) 1)eβ X(0) + (1 e α(t+τ))R = eτ et X(0) + (e αt 1)eβ X(0) + (1 e αt)R + (e α(t+τ) e αt)(eβ X(0) R) = eτ X(t) + e αt e α(t+τ) et R eβ X(0) From the equation above, we find: X(t) = e τ X(t + τ) e αt e α(t+τ) et R eβ X(0) (14) When β = 0, we have: X(T) = e αT e T X(0) + (1 e αT )e T R = e αT Z(T) + (1 e αT )M (15) Let us assume that β = 0, equation 13 can be written as: X(t) = et e αt X(0) + (1 e αt)R (16) which implies X(0) = eαte t X(t) + (1 eαt)R (17) Equation 17 implies the following formulation of X(0) as a function of X(t + τ): X(0) = eα(t+τ)e (t+τ) X(t + τ) + (1 eα(t+τ))R (18) From equation 14, and by setting β = 0, we obtain: X(t) = e τ X(t + τ) e αt e α(t+τ) et (R X(0)) (19) = e τ X(t + τ) + e α(t+τ) e αt et R + e αt e α(t+τ) et X(0) (20) By using equation 18, the last term of equation 20 can be rewritten: e αt e α(t+τ) et X(0) = e αt e α(t+τ) et eα(t+τ)e (t+τ) X(t + τ) + (1 eα(t+τ))R (21) = (eατ 1) e τ X(t + τ) + e αt e α(t+τ) 1 eα(t+τ) et R (22) Equation 20 is then rewritten: X(t) = eατe τ X(t + τ) + e α(t+τ) e αt + e αt e α(t+τ) 1 eα(t+τ) et R (23) = eατe τ X(t + τ) + e α(t+τ) e αt + e αt eατ e α(t+τ) + 1 et R (24) X(t) = eατe τ X(t + τ) + (1 eατ) et R (25) C.3 Stochasticity of the node representation matrix In Section 3, we mention that if the matrices et and N are both column stochastic for all t 0 (Veerman & Lyons, 2020), then Z(t) = et N is also column stochastic for all t 0. This is easily verified. Let us assume that two matrices B and C are column stochastic. They then have nonnegative elements, and they satisfy 1 B = 1 and 1 C = 1 . The matrix (BC) is column stochastic because it has nonnegative elements and satisfies 1 (BC) = 1 BC = 1 C = 1 . Published in Transactions on Machine Learning Research (01/2025) D Consensus We now give the formulae and constraints for the consensus model (i.e., when = L ) (De Groot, 1974). In this model, the matrix et is row stochastic for all t 0. We then constrain both N = X(0) and the matrix M := 1 d}n d to be row stochastic, this implies Z(t) row stochastic for all t 0. In this case, we have t 0, et M = M. We then also define R := M. In the consensus model, X(t) = e αt Z(t) + (1 e αt)M is row stochastic for all t 0. A detailed comparison between the diffusion and consensus models is discussed in Veerman & Lyons (2020, Section 6). The consensus model would be appropriate in contexts where each row of X(0) is a one-hot vector corresponding to the category of the node. However, it might suffer from degenerate cases where some rows of X(t) do not depend on the same rows at their initial time t = 0 due to the properties of the left eigenvectors of L. E Non-symmetric Heat Kernels We explain how the heat kernel framework in Kondor & Lafferty (2002) is a special case of our encoder (see Section 3) when the kernel function is symmetric and the source term Q is homogeneous. This connection motivates the study of the column space of the Laplacian matrix that is the foundation of many heat kernel methods (Belkin & Niyogi, 2003; Kondor & Lafferty, 2002). Heat kernels (Kondor & Lafferty, 2002). We first explain how our approach can be seen as a nonsymmetric heat kernel when the term Q is homogeneous. We recall that a function K : X X R defined on some nonempty set X is called a kernel function if it satisfies Mercer s theorem (Mercer, 1909). In other words, it satisfies R X K(p, q)f(p)f(q)dpdq 0 for every function f(p) of integrable square, or P q X K(p, q)fpfq 0 for all sets of real coefficients {fp} in the discrete case, which is the case we are interested in. Kernel functions are usually defined to be symmetric (i.e., p, q, K(p, q) = K(q, p)) to define a Reproducing Kernel Hilbert Space (Smola & Schölkopf, 1998), and the symmetry of the kernel function between pairs of nodes is reasonable when the graph is undirected. However, kernels are not necessarily symmetric (Seely, 1919) and one can define a Reproducing Kernel Banach Space (RKBS) on X V equipped with the ℓ1 norm (Song et al., 2011; Zhang et al., 2009) by considering the kernel matrix K [0, 1]n n defined such that K = e T with Kpq = K(p, q). Let us define [0, 1]X := {f : X [0, 1]} the Banach space of functions mapping X into [0, 1]. We also define the linear operators f( ) := Pn i=1 αi K(xi, ), g( ) := Pm j=1 βj K( , x j) and the bilinear form f, g := Pn i=1 Pm j=1 αiβj K(xi, x j) where n N, m N, αi R, βj R, x1, . . . , xn, x 1, . . . , x m X are arbitrary. The resulting RKBS generalizes the heat kernels in Kondor & Lafferty (2002) to directed graphs although it restricts the functions to map into [0, 1] instead of R. When Q is nonhomogeneous, we have X(T) = e αT Z(T)+(1 e αT )M = e αT e T + (1 e αT )/n11 N since 1 n11 N = M when N is column stochastic. This leads to the kernel matrix: K = e αT e T + 1 e αT As in Laplacian eigenmap approaches for dimensionality reduction (Chung & Yau, 1999), K can be thought of as an operator on functions defined on nodes of the graph and we obtain the node representation matrix KN [0, 1]n d. The fact that KN is column stochastic allows us to upper bound the ℓ1 norm of its columns by 1 (it is equal to 1), and then satisfy the properties of RKBS on X with the ℓ1 norm (Song et al., 2011). We now explain how our framework falls into the framework of Song et al. (2011); Zhang et al. (2009). Let us consider the case V = X, which implies |X| = n. We consider the non-symmetric function K as Kpq = K(p, q) where the kernel matrix K = e αT e T + (1 e αT )/n11 [0, 1]n n is column stochastic. Since K(p, q) is nonnegative for all p X and q X, a sufficient condition to satisfy P p X P q X K(p, q)fpfq 0 is to constrain {fp}p X to be a set of nonnegative coefficients. In our experiments, we set {fp} to be a set of nonnegative coefficients that sum to 1 (i.e., column stochastic). Published in Transactions on Machine Learning Research (01/2025) The above explanation did not require the notion of RKBS. We can nonetheless use Proposition 5 of Zhang et al. (2009) which is stated as follows: If the input space X is a finite set, then any nontrivial function K on X X is the reproducing kernel of some RKBS on X. In our case, K is nontrivial because K is full rank. Our kernel matrix (that is full rank, and column stochastic hence with nonnegative elements) naturally satisfies the first three requirements of Song et al. (2011). It also satisfies the relaxation of their fourth requirement in their Section 6. We recall their requirements: (A1) for all sequences {xp : p {1, . . . , n}} X of pairwise distinct sampling points, the matrix K := [K(p, q)] Rn n is non singular. - (A1) is satisfied because K is full rank in our case. (A2) K is bounded, namely, |K(s, t)| M for some positive constant M and all s, t {1, . . . , n}. - (A2) is satisfied because K is column stochastic, so s, t, |K(s, t)| 1. This is satisfied if 1 M. (A3) for all pairwise distinct p X, j N and c having its ℓ1-norm finite, P j=1 cj K(xj, x) = 0 for all x X implies c = 0. - (A3) is satisfied because j, K(xj, xj) > 0 and j = i, K(xj, xi) 0 in our case. Therefore, (A3) can be satisfied only if c = 0. the relaxation of (A4) in their Section 6 can be formulated as follows: let us write K as follows: K = K1:(n 1),1:(n 1) K1:(n 1),n Kn,1:(n 1) Kn,n where K1:(n 1),1:(n 1) [0, 1](n 1) (n 1) and K1:(n 1),n [0, 1]n 1 are submatrices of K. The relaxation of (A4) is satisfied if there exists some βn such that: (K1:(n 1),1:(n 1)) 1K1:(n 1),n ℓ1 βn is satisfied. Since K1:(n 1),n has its values in [0, 1], we have to be able to bound the values of (K1:(n 1),1:(n 1)) 1. We can bound them by exploiting the (inverse of the) eigenvalues of K that depend on T and α (see equation 29) since we know that r, |λr + 1| 1. It is worth noting that this section has proven that it is possible to formulate a RKBS to represent our node similarities. However, the kernel matrix K is given as input of our algorithm via the adjacency matrix A. One could define some node representation space that would induce the matrix K by using the theory of RKBS instead of considering that K is given as input of the algorithm. F Difference between T and α To understand the difference of impact between T and α, we need to study the eigenvalues of the kernel matrix described in Appendix E: K = e αT e T + (1 e αT )/n11 [0, 1]n n. It is worth noting that S, = L and e T all have the same set of right and left eigenvectors. The only difference is their set of eigenvalues. Since S is column stochastic, it is diagonalizable, its spectral radius is 1 and it has at least one eigenvalue equal to 1 with 1 as left eigenvector. The number of eigenvalues of S that are equal to 1 is the number of reaches of the graph (Veerman & Lyons, 2020). A reach of a directed graph is a maximal unilaterally connected set (see Veerman & Lyons (2020) for details). By definition of = S I, and S have the same eigenvectors and those that correspond to the eigenvalue 1 of S, correspond to the eigenvalue 0 of , and to the eigenvalue e0 = 1 of et for all t R. The matrix et is column stochastic for all t 0, because its spectral radius is then 1, it has at least one eigenvalue equal to 1 with 1 as left eigenvector, and it has the same eigenvectors as the column stochastic matrix S. Let us note = UΛU 1 the eigendecomposition of . The eigendecomposition of S is S = U(Λ + I)U 1, and the eigendecomposition of et is et = UetΛU 1. Let us consider that the first row of U 1 is γ1 Published in Transactions on Machine Learning Research (01/2025) where γ = 0 is an appropriate factor (i.e., the first row of U 1 is collinear to 1 ), and let us note: λ1 0 . . . 0 . . . 0 0 λ2 0 . . . . . . 0 ... ... ... ... ... ... 0 . . . . . . 0 λn 1 0 0 . . . 0 . . . 0 λn We know that λ1 = 0 since S is column stochastic. Moreover, both 1 n11 and e T have 1 as left eigenvector with corresponding eigenvalue equal to 1, so K = e αT e T + (1 e αT ) 1 n11 also has 1 as left eigenvector with corresponding eigenvalue equal to 1. The eigendecomposition of K is then K = VΦV 1 where V = U in general, but the first row of V 1 is collinear to 1 . The diagonal matrix Φ is written: 1 0 0 0 . . . 0 0 e T (λ2 α) 0 . . . . . . 0 0 0 ... ... ... ... 0 0 0 . . . . . . 0 e T (λn 1 α) 0 0 . . . 0 0 0 e T (λn α) It is worth noting that all the nonzero eigenvalues λr of have negative real part by definition of S (i.e., since the spectral radius of S is 1). If α = 0, then r, λr = 0 = e T (λr α) = 1. If α > 0 and T > 0, then the real part of λr α is negative for all r, which implies |e T (λr α)| < |e0| = 1. If α > 0 and T > 0, we also have for all r, lim T + |e T (λr α)| = 0 and limα + |e T (λr α)| = 0. From equation 29, the main difference between T > 0 and α > 0 is that T acts as a multiplicative factor on the eigenvalues inside the exponential, whereas α > 0 only has an impact on the real part of the eigenvalue inside the exponential. G Additional Experimental Results G.1 Study of the column space of the learned representations Fig 4 illustrates additional qualitative results from the experiments in Section 6.1 showing cosine values between the singular vectors (ordered by magnitude of their singular values) of et i and et i N. To improve visualization, we set all the absolute values lower than 0.3 to 0. One can see that most cosine correlations appear along the diagonal. This shows that the singular vectors of the different matrices are correlated. Although some absolute values are high in the bottom right corner, their corresponding singular values are much smaller compared to those in the top left corner, so the overall importance of their correlation is weaker. It is worth noting that we use conditional sampling in this experiment so our model has to jointly learn representations that are relevant for both categories. G.2 Approximation techniques for larger graphs In this subsection, we consider a task similar to the one described in Section 6.2. The goal in Section 6.2 was to show that digraphs could be entirely reconstructed with our approach by using low-rank approximations of the Laplacian for small graphs. In other words, if we give some training adjacency matrix A {0, 1}n n to our noising encoder, then our decoder is able to reconstruct all the elements of A. For simplicity, we did not consider data augmentation techniques in Section 6.2. We now describe our experimental setup to generate larger graphs of different sizes. As in Section 6.2, we consider the class-conditional digraph generation with two categories. We use the following categories Published in Transactions on Machine Learning Research (01/2025) (left) Erdős-Rényi graphs (right) stochastic block model graphs Figure 4: Correlations between the left singular vectors of et i and et i N. Erdős-Rényi with p = 0.4, and a stochastic block model with 5 blocks and the following transition matrix between the blocks: 0.9 0.2 0.4 0.2 0.4 0.3 0.9 0.15 0.5 0.45 0.4 0 0.95 0.05 0.4 0 0.3 0.4 0.75 0.45 0.1 0.4 0.4 0.15 0.7 All the graphs contain ni nodes where ni {180, 181, . . . , 200}. The first block contains 40 nodes, the second block contains 20 nodes, the third and fourth blocks contain 35 nodes each, and the last block contains from 50 to 70 nodes. We set µ = 0.4, ρ = 1/ni and s = 50 for the rank-s approximation of e T i [0, 1]ni ni in this experiment. At inference time, we sample digraphs that contain ni nodes where ni {180, 181, . . . , 200}. Quantitative results are reported in Table 2 and Table 3. Once again, DGDK outperforms GRAN and Swin GNN in evaluation metrics that take into account global properties of the graph. Data augmentation slightly improves performance. In both Table 2 and Table 3., we report the MMD scores for different values of the variance parameter in the RBF kernel: σ2 = 100, 10 and 1. We recall that we use σ2 = 100 in Table 1. We call Discrete DGDK the sampling strategy described in Algorithm 1 where we sample discrete adjacency matrices A {0, 1}n n as described in line 2 of Algorithm 1. Continuous DGDK corresponds to the same sampling strategy except Published in Transactions on Machine Learning Research (01/2025) Table 2: Squared MMD distances over 5 random initializations (average standard deviation) for the Erdős-Rényi distribution (p = 0.4). MMD metric Degree Clustering Spectrum Sampling from flat Dirichlet distribution 0.00091 0.00043 0.0079 0.0026 0.00193 0.0004 Continuous DGDK (with data augmentation) 0.00074 0.00004 0.0068 0.0005 0.00085 0.0001 Discrete DGDK (with data augmentation) 0.00073 0.00002 0.0067 0.0006 0.00085 0.0001 Discrete DGDK (without data augmentation) 0.00082 0.00003 0.0069 0.0004 0.00092 0.0003 Swin GNN 0.00091 0.00084 0.0091 0.0012 0.00110 0.0008 GRAN 0.00118 0.00124 0.0138 0.0035 0.00143 0.0013 Sampling from flat Dirichlet distribution 0.0093 0.0065 0.077 0.0298 0.0202 0.003 Continuous DGDK (with data augmentation) 0.0074 0.0005 0.067 0.0049 0.0085 0.001 Discrete DGDK (with data augmentation) 0.0072 0.0002 0.065 0.0009 0.0085 0.001 Discrete DGDK (without data augmentation) 0.0083 0.0004 0.071 0.0014 0.0123 0.002 Swin GNN 0.0090 0.0035 0.079 0.0021 0.0113 0.006 GRAN 0.0121 0.0103 0.143 0.0241 0.0149 0.012 Sampling from flat Dirichlet distribution 0.103 0.098 0.57 0.14 0.184 0.0340 Continuous DGDK (with data augmentation) 0.076 0.009 0.57 0.08 0.084 0.0122 Discrete DGDK (with data augmentation) 0.071 0.002 0.54 0.02 0.083 0.0092 Discrete DGDK (without data augmentation) 0.084 0.004 0.60 0.02 0.103 0.0104 Swin GNN 0.102 0.064 0.64 0.10 0.101 0.0126 GRAN 0.142 0.105 1.39 0.22 0.145 0.1156 Table 3: Squared MMD distances over 5 random initializations (average standard deviation) for the stochastic block model (5 blocks). MMD metric Degree Clustering Spectrum Sampling from flat Dirichlet distribution 0.00031 0.0002 0.0069 0.0026 0.00232 0.0004 Continuous DGDK (with data augmentation) 0.00017 0.0001 0.0057 0.0002 0.00041 0.0003 Discrete DGDK (with data augmentation) 0.00015 0.0001 0.0039 0.0023 0.00038 0.0003 Discrete DGDK (without data augmentation) 0.00031 0.0001 0.0046 0.0025 0.00041 0.0004 Swin GNN 0.00046 0.0009 0.0245 0.0094 0.00831 0.0103 GRAN 0.00053 0.0008 0.0654 0.0057 0.02472 0.0144 Sampling from flat Dirichlet distribution 0.0021 0.001 0.067 0.020 0.0187 0.002 Continuous DGDK (with data augmentation) 0.0018 0.001 0.059 0.006 0.0040 0.003 Discrete DGDK (with data augmentation) 0.0015 0.001 0.056 0.001 0.0039 0.003 Discrete DGDK (without data augmentation) 0.0027 0.002 0.059 0.004 0.0043 0.004 Swin GNN 0.0041 0.003 0.264 0.062 0.0793 0.089 GRAN 0.0056 0.004 0.664 0.091 0.2629 0.124 Sampling from flat Dirichlet distribution 0.0255 0.0133 0.59112 0.342 0.172 0.031 Continuous DGDK (with data augmentation) 0.0186 0.0132 0.58531 0.162 0.041 0.036 Discrete DGDK (with data augmentation) 0.0154 0.0093 0.48165 0.012 0.038 0.032 Discrete DGDK (without data augmentation) 0.0331 0.0103 0.56420 0.025 0.048 0.042 Swin GNN 0.0424 0.0251 0.68073 0.351 0.388 0.724 GRAN 0.0553 0.0421 1.56743 0.931 1.253 1.123 that we sample non-diagonal elements of A uniformly in the continuous interval [0, 1] instead of {0, 1}, while keeping the constraint i, Aii = 1. The two methods are competitive with baselines. On the other hand, we also report scores when we sample each column of the input directly from a flat Dirichlet distribution. This strategy does not exploit the learned matrix N and is outperformed by our other sampling strategies although it still outperforms the GRAN deadline. G.3 Data augmentation by using permutations of Laplacian matrices In this subsection, we study the impact of the data augmentation technique adding adjacency matrices of isomorphic digraphs as explained in Section 4.2. Figure 5 illustrates the loss value of equation 7 as a function of the number of iterations with and without this data augmentation technique. In this setup, we use γ = 100. Published in Transactions on Machine Learning Research (01/2025) Figure 5: Loss values obtained when optimizing equation 7 with or without data augmentation by adding permutations of training adjacency matrices. We use the experimental setup described in Section 6.2 with 3,000 training graphs per category, 10 graphs per category per batch. Each epoch corresponds to 300 iterations. Both loss curves follow similar patterns, although the one corresponding to data augmentation has a slightly higher loss value. One reason of the low impact of this data augmentation technique is that the matrix N is jointly learned with nonlinear decoders φ and ψ that are robust to this kind of transformation. G.4 Additional experiment: multimodal categories In Section 6.2, the categories are unimodal and it is assumed that they can be conditioned over at test time. We now show that if the distribution of the training set is multimodal and the modes are not given to the model, our model is able to sample graphs from the different modes. We consider the following training set containing four modes of same size: (1) a single block stochastic model with probability p = 0.28, (2) two (disjoint and connected) components of same size with p = 0.48 each, (3) three components of same size with p = 0.78 each, (4) and four components of same size with p = 0.97 each. This corresponds to an average edge ratio of µ 0.24. Fig. 7 illustrates some graphs generated with Algorithm 1 when γ = 1 and α = 1, they follow the training distribution. We emphasize that blocks are disjoints only in this subsection for visualization purpose, not in the previous subsections where some edges connect different blocks. Impact of γ. The regularization parameter γ in equation 7 acts as a tradeoff between learning the edge decoder and the node decoder. If γ = 0, then the node decoder is not trained. Our goal is to predict the adjacency matrix from a random matrix in [0, 1]n d given as input of the decoders. Only the edge decoder is useful for this task. We found that the loss function does not converge during training when γ = 0. This suggests that the data augmentation technique that adds edge perturbation to the adjacency matrix is important, at least for this experiment. A positive value of γ is necessary for the loss to decrease during training and learn meaningful decoders. This implies that the node decoder learns the global structure of the graph by reconstructing node representations that depend on the ground truth Laplacian. This is beneficial to the edge decoder as both decoders share a common backbone. Published in Transactions on Machine Learning Research (01/2025) Figure 6: Digraphs generated when α = 0. Impact of α. Using α = 0 results in generated samples with only one block (see Fig. 6). As α increases, the number of components per sample increases (see Fig. 7). This is because the sampled adjacency matrices in Algorithm 1 generated with a Bernoulli distribution correspond to one component. As α increases, the information of the sampled adjacency matrix gets partially lost and the decoder is able to reproduce graphs similar to the training distribution, and it samples graphs uniformly from all the modes. We provide some ablation study on the impact of the noise diffusivity rate hyperparameter α 0. When α = 0, more than 97% of the generated graphs contain only one block. Some generated graphs with α = 0 are illustrated in Figure 6. When α = 1 (i.e., 1 e αT 0.63), 19% of the generated graphs contain a single block, 26% contain 2 blocks, 30% contain 3 blocks, 20% contain 4 blocks, 2% contain 5 blocks, 2% contain 6 blocks, 1% contain 7 blocks. The distribution is similar to the training set that is uniformly distributed (i.e., 25% for each mode). Some generated graphs when α = 1 are illustrated in Figure 7. When α = 2.3 (i.e., 1 e αT 0.9), 17% of the generated graphs contain a single block, 14% contain 2 blocks, 20% contain 3 blocks, 20% contain 4 blocks, 8% contain 5 blocks, 5% contain 6 blocks, 8% contain 7 blocks, and the remaining 8% contains up to 15 blocks. Some graphs generated when α = 2.3 are illustrated in Figure 8. Many generated graphs contain nodes that are isolated. We report in Table 4 the MMD scores that compare the training set with the generated graphs for different values of α. The performance shows that α has to be chosen carefully so that noise is introduced, but not in excess. It is worth noting that GRAN is not appropriate in this experiment since it has to be given the maximum number of nodes in the graph. Our model is given a number of nodes and has to generate one or multiple (disconnected) graphs that follow the training distribution. Published in Transactions on Machine Learning Research (01/2025) Table 4: Squared MMD distances for the experiments on the multimodal dataset in Section G.4 for different values of α. Dataset Multimodal dataset MMD metric Degree Clustering Spectrum α = 0 2.6 10 3 1.7 10 3 3.3 10 4 α = 1 0.9 10 3 6.3 10 4 2.3 10 4 α = 2.3 1.6 10 3 1.0 10 3 5.9 10 4 Figure 7: Digraphs generated when α = 1.0. Published in Transactions on Machine Learning Research (01/2025) Figure 8: Digraphs generated when α = 2.3.