# discrete_graph_autoencoder__4feecda3.pdf Published in Transactions on Machine Learning Research (03/2024) Discrete Graph Auto-Encoder Yoann Boget yoann.boget@hesge.ch Geneva School for Business administration HES-SO University of Geneva Magda Gregorova magda.gregorova@thws.de Center for Arti cial Intelligence and Robotics (CAIRO) Technische Hochschule Würzburg-Schweinfurt (THWS) Alexandros Kalousis alexandros.kalousis@hesge.ch Geneva School for Business administration HES-SO Reviewed on Open Review: https: // openreview. net/ forum? id= b Z80b0wb9 Despite advances in generative methods, accurately modeling the distribution of graphs remains challenging primarily because of the absence of a prede ned or inherent unique graph representation. Two main strategies have emerged to tackle this issue: 1) restricting the number of possible representations by sorting the nodes, or 2) using permutationinvariant/equivariant functions, speci cally Graph Neural Networks (GNNs). In this paper, we introduce a new framework named Discrete Graph Auto-Encoder (DGAE), which leverages both strategies strengths and mitigates their respective limitations. In essence, we propose a strategy in two steps. We rst use a permutation-equivariant autoencoder to convert graphs into sets of discrete latent node representations, each node being represented by a sequence of quantized vectors. In the second step, we sort the sets of discrete latent representations and learn their distribution with a speci cally designed autoregressive model based on the Transformer architecture. Through multiple experimental evaluations, we demonstrate the competitive performances of our model in comparison to the existing state-of-the-art across various datasets. Various ablation studies support the interest of our method. 1 Introduction Graph generative models are a signi cant research area with broad potential applications. While most existing models focus on generating molecules for de novo drug design, there has been interest in using these models for tasks such as material design (Lu et al., 2020), protein design (Ingraham et al., 2019), code programming modeling (Brockschmidt et al., 2019), semantic graph modeling in natural language processing (Chen et al., 2018; Klawonn & Heim, 2018), or scene graph modeling in robotics (Li et al., 2017). Despite advances in generative methods, accurately modeling the distribution of graphs remains a complex task. The main challenge lies in the absence of prede ned or inherent unique graph representations. Given a graph of n nodes, there are n! permutations, each corresponding to a distinct representation of the same graph. Two main strategies have emerged to tackle this issue: 1. The rst accepts multiple representations of the same graph but restricts their quantity by sorting the nodes thanks to a heuristic such as Breadth-First Search (BFS), as in (You et al., 2018; Shi et al., 2020; Luo et al., 2021). Published in Transactions on Machine Learning Research (03/2024) 2. The alternative approach employs models that are inherently invariant to node ordering permutations. This method achieves permutation invariance through the use of Graph Neural Networks, as in (Vignac et al., 2023; Jo et al., 2022). Both methods have their own challenges, which we elaborate on in Section 2. Our main contribution is introducing a new generative model designed to address this issue of multiple graph representations. We use a permutation-equivariant auto-encoder to convert graphs into sets of node embeddings and reconstruct graphs from sets of node embeddings. Contrary to graphs, any sorting algorithm yields a unique representation for sets. So, we transform the sets into sequences and leverage an autoregressive model to learn the implicit distribution over sets. In implementing this framework, we introduce new methods and techniques or adapt existing ones. We enhance the node embeddings by introducing an innovative feature augmentation strategy. To ease the modeling of the latent distribution, we remove unnecessary information and reduce the size of the latent space by quantizing the node representations. However, for pragmatic considerations, we initially partition the node embeddings and subsequently quantize these partitions. As a result of this sequentialization, the latent representations form sequences spanning two dimensions: the partitions and the nodes. To model this distribution, we introduce a novel Transformer architecture designed to take advantage of its 2-dimensional structure. The development and subsequent evaluation of our model yield several signi cant contributions that we summarize as follows: We introduce a novel generative model for graphs, leveraging a graph-to-set discrete auto-encoder invariant to permutations. We develop a new Transformer architecture, the 2D-Transformer, speci cally designed to model sequences across two dimensions. We design a new technique for graph feature augmentation, which we call p path features. For the rst time, we empirically evaluate various feature augmentation strategies via an ablation study. Through empirical analysis, we demonstrate the competitive performances of our model in comparison to the existing state-of-the-art across various datasets. In the following, Section 2 presents the foundational concepts and issues of graph modeling. Section 3 discusses the literature on graph generation. In Section 4, we introduce our Discrete Graph Auto-Encoder. Lastly, Section 5 presents the experimental evaluation of our model, demonstrating its competitive performance against baseline models for various datasets, including both simple and annotated graphs. 2 Background In this section, we provide the necessary background about graph modeling. We de ne some notations and then review the challenges of learning graph representations, which mainly stems from the graph isomorphism problem. Finally, we introduce Graph Neural Networks (GNN) and discuss some of their limitations. 2.1 Notation We de ne a simple (unannotated and undirected) graph as a tuple G = (V, E), where V is the set of nodes, whose cardinality is n, and E is the set of edges between these nodes. Given two nodes νi and νj in V, we represent the edge connecting them as ϵi,j = (νi, νj) E. For annotated graphs, we introduce two additional functions V and E that map the nodes and the edges to their respective attributes: V (νi) = xi RR and E(ϵi,j) = ei,j RS. In the case of univariate discrete attributes, the categorical attributes are one-hot encoded, and R and S are the number of node and edge categories, respectively. We denote an annotated graph as G = (V, E, V, E). Published in Transactions on Machine Learning Research (03/2024) 2.2 Graph isomorphism A fundamental property of graphs is their invariance with respect to the representation ordering of the node. To put it di erently, any permutation applied to the representation ordering of the node will yield an identical graph. Therefore, there are n! possible isomorphic representations of the same graph. In the following, we employ the notation π( ) to indicate that we represent the graph under a speci c node ordering. The graph isomorphism problem, which consists of determining whether two graphs are identical or distinct, follows from these possible multiple representations (Köbler et al., 1993). Algorithms that o er a unique representation for each isomorphism are at least as expensive computationally as solving the graph isomorphism problem1. This problem has not been proven to be solvable in polynomial time (Helfgott et al., 2017). As a result, algorithms that uniquely sort nodes in any graph are, at least, as expensive as algorithms solving the graph isomorphism problem. Graph isomorphism is a signi cant issue for graph generative models. Without a speci c solution, generative models must learn a new set of parameters for each permutation. Given the vast number of possible permutations for each graph, this brute-force approach is unreasonable. In fact, to the best of our knowledge, no generative model for graphs follows such a procedure. 2.3 Generative models and graph isomorphism Previous generative models, detailed in Section 3, have predominantly pursued two strategies to deal with the issue of ordering: sequentialization and invariance to permutation. 2.3.1 Sequentialization The rst method aims to establish an ordering of nodes that yields a sequentialization of the graph. Most models adopt algorithms sorting nodes in a non-unique manner. Breadthrst search (BFS), used for instance in (You et al., 2018; Shi et al., 2020; Luo et al., 2021), is the most common heuristic to order nodes in graph generative models. We presume that BFS is also a convenient way of traversing the graph and works as an inductive bias for auto-regressive models. The main drawback of BFS is that the number of potential representations can still be considerable. In the worst case (star graph), BFS does not reduce the number of possible representations at all. 2.3.2 Invariance to permutation The alternative approach circumvents the issue by developing models invariant to permutation by construction. These models rely on permutation-invariant and permutation-equivariant functions. Permutationinvariance is de ned as: f(G) = x f(π(G)) = x (1) A function f is said permutation-equivariant if: f(π(G)) = π(f(G)) (2) Notably, equivariance to permutation implies that input and output have the same cardinality. In practice, Graph Neural Networks (GNN) entails permutation-invariance and permutation-equivariance. In the next Section (2.4), we introduce the fundamental principles of GNNs 2.4 Graph neural networks GNN layers are permutation-equivariant functions. Since the cardinality of the input and the output of such function must match, GNNs are said to be at 2. 1The graph isomorphism problem involves determining whether two graphs are isomorphic. Establishing a unique representation for each isomorphism can be considered a way to solve the problem. If the representations match, then the graphs are isomorphic. 2It does not mean that the dimension of each node representation, usually a vector, does not change. Published in Transactions on Machine Learning Research (03/2024) An aggregating function providing graph-level information often follows the GNNs layers. Such an aggregation with a commutative function, as summation or averaging, results in a permutation-invariant function. 2.4.1 Message Passing Neural Networks Message Passing Neural Networks (MPNNs) (Scarselli et al., 2009) represent a prevalent class of Graph Neural Networks (GNNs), even though other classes of GNN have gained in popularity, in particular, GNN, where each node shares information with all the other nodes as, for instance, in (Yun et al., 2019). The core concept underlying MPNNs involves updating the representation of each node by aggregating information from its neighboring nodes, those with which it shares an edge. Consequently, after l layers, each node aggregates information from its l-hop neighborhood. We refer to this l-hop neighborhood as the receptive eld of node features. So, each layer increases the size of the receptive eld. Message Passing Layer The update of the lth layer of an MPNN involves two steps. The rst step computes the message: el+1 i,j = f l edge(xl i, xl j, el ij) (3) The second step aggregates information to compute a new node representation: xl+1 i = f l node j N (i) el+1 i,j In these equations, the functions f are small neural networks, the denotes any commutative function, and N (i) is the set of nodes connected to node i. Note that the message el+1 i,j can also be interpreted as an edge representation and passed to the next layer. In the following, we will favor this interpretation. Limitations MPNNs present the advantage of addressing the graph isomorphism problem but exhibit some shortcomings. Speci cally, two challenges have been widely acknowledged: oversmoothing and limitation of their expressive power. Firstly, all node and edge features tend to the same value as the number of layers increases. We call this phenomenon oversmoothing (Li et al., 2018-02; Oono & Suzuki, 2020). Because of it, one should limit the depth of MPNNs to a few layers. Secondly, MPNNs exhibit limited expressive capacities in the sense that they cannot discern speci c pairs of non-isomorphic graphs (Morris et al., 2019). Standard MPNNs are, at best, as expressive as the Weisfeiler Lehman (1-WL) test (Xu et al., 2019; Loukas, 2019). Consequently, they fail to detect some basic substructures (Arvind et al., 2020; Chen et al., 2020). We illustrate such failure in Appendix D. Two strategies to enhance the expressive capability of standard MPNNs have been proposed. Developing GNNs more powerful than MPNNs is the rst strategy (Maron et al., 2019; Vignac et al., 2020). So far, such GNNs are computationally much more expensive and, therefore, ine ective. The second approach involves the augmentation of the node features with synthetic attributes, including spectral embeddings, cycle counts, and the integration of random noise. The feature augmentation enhances the ability of GNNs to capture the graph structure. Using spectral embeddings (Laplacian Positional Encoding), cycle count, or random noise is now usual as additional synthetic features for graph generation (Krawczuk et al., 2021; Vignac et al., 2023). We present the details about these features in appendix D. In Section 4.1.2, we propose a new method for feature augmentation called p-path features, and in Section 5.4, we assess experimentally the e ect of the various methods on our model. Published in Transactions on Machine Learning Research (03/2024) 3 Related work In this Section, we brie y review existing work on graph generative models. As exposed in Section 2, generative models for graphs have followed two main strategies to address the graph representation issue: sequentialization and invariance to permutation. 3.1 Sequential generation This sequentialization serves dual purposes: it restricts the number of possible graph representations and facilitates auto-regressive graph generation. Notably, all models following this approach are auto-regressive. Graph canonization is computationally expensive, as discussed in Section 2. To the best of our knowledge, Graph Gen (Goyal et al., 2020) stands as the only model employing this particular strategy for generic graphs. Most works only seek to reduce the number of graph representations by adopting a Breadth-First Search (BFS) based method. We can further categorize auto-regressive models into two classes. The rst class, which includes the seminal Graph RNN (You et al., 2018), employs a recurrent framework that necessitates maintaining and updating a global graph-level hidden state. This approach inherently introduces long-range dependencies, as proximate nodes in graph topology can be distant in the sequential representation. The second class of auto-regressive models sidesteps long-range dependencies by recomputing the state for the partially generated graph at each step (Shi et al., 2020; Luo et al., 2021; Liao et al., 2019; Kong et al., 2023). Although this alleviates the long-range dependency issues, it drastically increases the computational cost. Furthermore, auto-regressive models, generating nodes and edges one by one, are inherently slow during the generation phase. Generation slowness is especially true for the second class of models, as highlighted by experimental results in Section 5.3.2 and in (Jo et al., 2022). GRANs (Liao et al., 2019) attempt to address this speed issue by generating multiple nodes and edges simultaneously, albeit with a trade-o in generation performance. Models previously discussed are generic, meaning they operate independently of any speci c domain knowledge. However, many models are tailored to particular domains, notably in the domain of molecule generation, which represents a predominant application of graph generative modeling. Such domain-speci c models often employ canonical representations. The Canonical SMILES notation, for instance, is commonly used to represent molecules, as in (Gómez-Bombarelli et al., 2018; Kusner et al., 2017). Recently, powerful models representing molecules as 3D objects have emerged (Hoogeboom et al., 2022; Xu et al., 2023). While plenty of domain-speci c sequential models exist (Liu et al., 2018; Samanta et al., 2020; Jin et al., 2018; 2020; Kuznetsov & Polykovskiy, 2021; Li et al., 2018), in this work, we focus on developing a generic graph model. 3.2 Models invariant to permutations The second main strategy to address the multiple potential representations of a graph consists of using models invariant to permutations. By de nition, these models, also referred to as one-shot , cannot be sequential. These models implement invariance or equivariance through GNNs. They are thus insensitive to the ordering of the graphs during training. Various methods, such as Generative Adversarial Networks (GANs), Normalizing Flows (NF), and di usion/score-base models, have been proposed following this strategy, each with its merits and issues. In GANs, a discriminator (or critic) classi es (or scores) graphs between true (from the dataset) and generated, while a generator aims to maximize the error of the discriminator. Mol GAN (De Cao & Kipf, 2018) proposes to use permutation-invariant discriminator. GG-GAN (Krawczuk et al., 2021) add a permutationequivariant generator, and Grann Gan (Boget et al., 2023) proposes to generate only the nodes and edges attribute from sampled skeletons. However, the adversarial training is known to be unstable. Mol GAN exhibits mode collapse and low diversity. GG-GAN also su ers from a lack of diversity, while Grann GAN does not generate graphs, but only their attributes. Published in Transactions on Machine Learning Research (03/2024) Normalizing Flows are designed to learn an invertible mapping between the data distribution and a known continuous distribution, typically the Standard Normal distribution. Both Graph NVP (Madhawa et al., 2019) and Mo ow (Zang & Wang, 2020) use a two-step process, rst generating an adjacency tensor, followed by an annotation matrix conditioned on this tensor. GNF (Liu et al., 2019) needs only one step but focuses on generating simple graphs only. However, we hypothesize that, even with dequantization, e ciently spanning the continuous latent space using discrete objects remains a signi cant challenge. Indeed, based on their empirical evaluations, while these models surpass the performance of GAN-based approaches, they do not model graph distributions as e ectively as auto-regressive models and those relying on di usion/score-based methods. Di usion and score-based models gradually introduce Gaussian noise to the data and learn to denoise it for each noise level. In graphs, some models have suggested adding Gaussian noise to the adjacency matrix (Yang et al., 2019) and to the annotation matrix (Jo et al., 2022). By doing so, these models generate fully connected graphs, facilitating direct information sharing among all nodes. However, as stated in (Vignac et al., 2023), the continuous noise compromises the graph structures, leading them to collapse into fully connected graphs. To address this concern, Di Gress introduced a discrete di usion model for graphs (Vignac et al., 2023), which integrates discrete noise to maintain a coherent graph structure at every noise level. In practice, however, the discretization does not appear to bring signi cant improvement over GDSS. Whether continuous or discrete, the di usion/score-based approach e ectively captures the global structure of the graph and, as of now, has produced the best results with Digress and GDSS (Jo et al., 2022) considered as state-of-the-art. However, the extensive number of denoising steps required by these models make the generation process comparatively slow, as shown experimentally and reported in Section 5. 3.3 Graph Auto-Encoders Closer to our work, Kipf and Welling (Kipf & Welling, 2016) proposed a permutation-equivariant graph auto-encoder, namely the Graph Auto-Encoder (GAE) and the Variational Graph Auto-Encoder (VGAE) relatively long ago. However, they use it as a feature extractor for downstream tasks rather than for generation. In fact, neither of these two models captures the interaction and dependencies between node embeddings. One of our main contributions consists of modeling these interactions and dependencies necessary for generation. On a side note, we also remark that they use a simple activated scalar product as a decoder, which can neither capture interactions between multiple node embeddings nor decode node and edge attributes. Like most auto-encoders in generative modeling, we use a decoder mirroring the encoder. Lastly, Graph VAE (Simonovsky & Komodakis, 2018) adopts a distinct approach to handle the multiple possible graph representations. A permutation-invariant encoder captures a graph-level latent representation, losing the original node ordering in their aggregation. Then, a decoder, built using a standard feed-forward neural network, aims to reconstruct the original graph to yield a graph in any permutation. Nevertheless, during the training phase, the model requires aligning the original representation with its reconstructed counterpart. This comparison necessitates an expensive graph-matching procedure with a computational complexity of order O(n4). Unlike Graph VAE, our auto-encoder is permutation-equivariant end-to-end. It does not to capture the graph representation in a single vector but a in a set of node embeddings, each capturing its local neighborhood so that we can reconstruct the graph from its set of node embeddings. Since the sequentialization of sets does not su er the same problem as graphs, we can model the set of node embeddings as a sequence leveraging an auto-regressive model. To sum up, we follow the permutation invariance strategy to train a graph-to-set auto-encoder, and then we follow the sequentialization strategy to model the distribution over sets. 4 Discrete Graph To Set Auto-Encoder In a nutshell, our approach involves a two-step strategy. The rst step is a graph auto-encoder. We encode a graph into a set of n node embeddings. For practical reasons detailed hereunder, we partition each node Published in Transactions on Machine Learning Research (03/2024) embedding into C vectors and quantize each vector independently. Then, we decode the quantized latent representation, aiming to recover the original graph. This rst step generates a discrete latent distribution over sets of node embeddings, yet this latent distribution remains unknown. In the second step, our objective is to learn this latent distribution. Rather than directly modeling a distribution over sets, we transform the sets into sequences. We then leverage a model based on the Transformer architecture to learn these sequences auto-regressively. The rationale behind this two-step strategy stems from using permutation-equivariant functions, particularly Message Passing Neural Networks (MPNNs), to project the graph into a latent space and decode it. This approach prevents the node ordering issue, making the computation of the reconstruction loss straightforward without needing a matching procedure. However, as discussed in Section 2.4.1, MPNNs present inherent limitations, such as being constrained in depth and only capturing information from their L-hop neighborhood, L being the number of layers in the MPNN. Consequently, MPNNs are unable to model interactions between distant nodes. Therefore, we introduce the second step to model the latent distribution to address these long-range dependencies between node embeddings. In this section, we present and motivate these two steps in detail (Sections 4.1 and 4.2), and we explicitly explain how we perform generation (Section 4.3. The source code of our model is publicly available at https://github.com/yoboget/dgae. 4.1 Discrete Graph Auto-Encoder In this initial step, our primary objectives when encoding graphs are twofold. Firstly, we aim to learn node embeddings that facilitate the accurate reconstruction of the original graph. Secondly, we strive to create a latent distribution that is handy to learn in the second step. To achieve this, we enhance the input graph representation with feature augmentation (Section 4.1.1). We use a Message Passing Neural Network (MPNN) to encode graphs G into sets of node embeddings {zh i , , zh n} = Zh, mapping a space over graphs to a space over sets fencoder : G Zh (Section 4.1.2). Partitioning and quantizing the node embeddings results in a handy discrete latent space with known support, facilitating the modeling of the latent distribution in the second step. Speci cally, we partition the node embeddings in a sequence of C vectors zh i = (zh 1 , , zh C). We then quantize each vector independently (Section 4.1.3), replacing them with their closest neighbor in a codebook. Finally, another MPNN decodes the sets of quantized vectors Zq, also called codewords, to graphs ˆG, formally fdecoder : Zq G. Figure 1 visually summarizes the entire procedure. Figure 1: Diagram of our auto-encoder. 1. The encoder is an MPNN transforming the graph into a set of node embeddings Zh. 2. The elements of the set Zh are partitioned and quantized, producing a set of codeword sequences Zq. 3. The decoder, an other MPNN, takes the set Zq and reconstruct the original graph. 4.1.1 Input Graph and Feature Augmentation As explained in Section 2.4, conventional MPNNs cannot identify certain fundamental graph substructures. Such a shortcoming adversely impacts the information captured by the encoder. Inevitably, the node em- Published in Transactions on Machine Learning Research (03/2024) beddings miss the information the encoder fails to capture, preventing accurate reconstruction. A common mitigation strategy involves enhancing MPNN capacities by augmenting the input features with synthetic attributes. We employ this approach to generate more informative node embeddings, allowing for better reconstruction performance. Here, we propose a new augmentation scheme that aggregates information from the p-path neighborhood. Speci cally, we rst create virtual edges (with a vector of zeros as attributes) between nodes that are connected by paths of length 2 to p (the edges between adjacent nodes, i.e. connected by a path of length 1, are already represented by a vector of edge attributes). We then concatenate a synthetic p dimensional vector to each vector of edge attribute, whose ith entry corresponds to the number of paths of length i connecting the two endpoints. Similarly, we augment the node features with a p dimensional vector, whose ith entry indicates the number of paths of length i emanating from the node. In Appendix D, we provide more details on how we compute the path lengths as well as information about the other feature augmentation schemes used for our experiments. Our method presents two bene ts compared to other methods. First, unlike other methods, our p path augmentation strategy produces synthetic attributes for nodes and edges, providing richer information about the substructures around each node. Second, the virtual edges aggregate information from a larger neighborhood. The computation of the synthetic features is a unique prepossessing step. In our experiments, we use the p path method setting p = 3. Our solution is related to the recent k hop message passing presented in (Feng et al., 2022), both using the path length information. Our proposition di ers on two main points. First, we use the path information to create synthetic features, while k hop message passing uses this information to compute di erent message passing functions for each path length. Second, we take into account the number of paths of length p between 2 nodes, while they only consider if there is at least one such path. In Section 5.4, we empirically show that our method increases the reconstruction performance of our autoencoder the most compared to other methods. 4.1.2 Encoder We aim to encode graphs as sets of node embeddings, where each embedding captures its local graph structure so that we can e ectively recover the original graph from these embeddings. We use an MPNN as encoder since they are speci cally designed and e cient for modeling the local neighborhood of each node. We formally de ne our encoder as an L-layer MPNN whose layers follow these two equations: el+1 i,j = bn(fedge([xl i, xl j, el i,j]) (5) xl+1 i = bn j N (i) fnode([xl i, xl j, el i,j]) where bn stands for batch normalization (Io e & Szegedy, 2015) and [ , ] is the concatenation operator. After the last layer, we discard the edge representation e L i,j and keep only the node representation so that zh i = x L i Rdlatent. The complete graph latent representation is the set of all the node embedding Zh = {zh 1 , , zh n} where the superscript h indicates the set before quantization. The superscript q will refer to the latent representations after quantization. 4.1.3 Discrete Latent Distributions In the context of our auto-encoder, quantization consists of replacing a d dimensional vector by its nearest neighbor in a learned codebook H Rm d, m being the codebook size. By partitioning, we refer to a simple reshaping operation, where a vector of dimension d is reshaped in C vectors of dimension d C. Both Published in Transactions on Machine Learning Research (03/2024) operations follow the same objectives: produce a latent distribution that is handy to model (in the second step) and exible enough to yield good reconstruction. Speci cally, the encoder outputs n node embeddings zh i Rdlatent. So, we partition these node embeddings in C vectors (zh i,1, , zh i,C), where each zh i,c is in Rdlatent C. Then, we quantize each vector independently. We refer to the partitioned and quantized representations as the discrete latent representation Zq = {(zq 1,1, , zq 1,C), , (zq n,1, , zq n,C)). Figure 2 depicts the quantization procedure. Partitioning and quantization result in discrete latent distributions. Figure 2: Diagram of the quantization. We represent each node embedding by C partition vectors zh i,c. Then, we quantize each of these vectors by replacing them with their closest neighbor from the corresponding codebook Hc. The vectors in the codebooks are parameters learned during training. Discrete Latent Representations The node embeddings zh i , as they are output by the encoder, are real-valued vectors in Rdlatent. We make two observations regarding these vectors. First, if the original graphs have only discrete node and edge attributes, the support of this distribution is discrete. It corresponds to all the possible subgraphs within the L-hop neighborhood, where L is the number of encoder layers. However, enumerating the support of this discrete distribution is intractable in almost all practical scenarios, which makes the modeling of this discrete distribution unpractical. In the continuous case, the distribution of the node embeddings remains also unknown and challenging to model. Second, the encoder cannot model interactions between distant nodes. By consequence, even if we could model the distribution of each node embedding individually, the distribution over the sets of node embeddings would remain unknown. Both observations, i.e., the potential discrete support of the node embedding distributions and the impossible modeling of distant nodes, make standard variational auto-encoder unappropriated for the task. More generally, the distribution of node embeddings zh i , as they are output by the encoder, is challenging. On the other hand, discrete latent models have shown impressive results with both discrete and continuous data (van den Oord et al., 2017; Razavi et al., 2019). Therefore, we leverage such a model using quantization to produce discrete latent distributions with known support. Quantization The quantization of the node embeddings into discrete latent representations presents many advantages. First, discrete latent representation is a natural t for modeling discrete structures like graphs. In particular, discrete representation prevents approximating step functions or Dirac delta functions with neural networks, which require smooth di erentiable functions. Quantization also enforces the codewords to follow a m dimensional categorical distribution, where m, the codebook size, is a hyper-parameter. Knowing that the latent vectors follow an m dimensional categorical distribution, we can learn their parameters in the second stage of our model. Published in Transactions on Machine Learning Research (03/2024) Moreover, quantization acts as an information bottleneck. Assuming no prior information, the information contained in each selected codeword is I = log2(m) bits and depends only on the codebook size m. By training the auto-encoder with a reconstruction objective, we encourage the quantization process to retain only the most valuable information for reconstruction. Also, by reducing the codebook size, we shrink the number of categories and, therefore, the number of parameters of the categorical distribution, making the latent distribution easier to learn. Limiting the codebook size also reduces the risk of codebook collapse (Kaiser et al., 2018) as observed in our experiments (See Section 5.4), favoring a better coverage of the space and so preventing sampling from regions out of the distribution. However, restraining the information passed to the decoder makes the reconstruction harder. As a result, there is a trade-o between the reconstruction quality and the dimensionality of the categorical distribution stemming from the codebook size. We empirically show the e ect of this trade-o in our experiments (See Section 5.4). We indeed observe that the reconstruction loss decreases when the codebook size increases. However, larger codebook sizes do not yield better generation performances despite better reconstruction performance. We explain this observation because larger codebooks result in larger categorical distributions that are harder to learn. Partitioning In practice, we often need a large discrete latent space to model each node representation with enough exibility. We can obtain this exibility either with a single codeword with a large codebook or with multiple codewords, each coming from a (much) smaller codebook. Large codebooks imply high dimensional categorical distributions, which are inconvenient in practice. Specifically, modeling a high-dimensional categorical distribution in the second stage of our model would require an output layer whose size corresponds to the dimension of this distribution. Large output layers, particularly those larger than their hidden layers, are ine ective in practice. Our experiments show that our model s generation performance drops for the larger categorical distribution. So, instead of representing each node embedding with a single codeword, we use multiple codewords for each node representation. To that end, we partition the node embeddings zh i Rdlatent in C vectors zh i , c Rdlatent C by reshaping them. In consequence, each node embedding is represented by C partition vectors. We quantize each of these partition vectors independently, each partition with its own codebook. In fact, in graphs, as in images, codebooks shared across features re ect an invariance assumption: the permutation invariance for graphs and the translation invariance for images. We use di erent codebooks for the di erent partitions as we do not assume any invariance across partitions. In other words, there is no reason to share codebooks across partitions because the various partitions do not belong to the same feature map. After partitioning and quantization, the latent representation of each node is a sequence of codewords. We call dictionary the set all possible codeword arrangements. Therefore, for a xed codebook size m, the dictionary size is M = m C. We empirically assess the e ect of partitioning and present the results in Section 5.4. Formalisation Speci cally, the partition function fp de nes a simple reshaping operation transforming a vector of dimension dlatent in C vectors of dimensions dlatent C, that is fp : Rdlatent RC (dlatent C). We de ne the quantization function qc : Rdlatent C {h1,c, hm,c} as the mapping of the c-th partition of node i, to its closest neighbor in the corresponding codebooks Hc Rm dlatent C, such that: qc(zh i,c) = zq i,c = hk,c with k = arg min g (||zh i,c hg,c||2) (7) Notably, the quantization procedure acts independently on every element in the set. It is, therefore, permutation-equivariant. We underline that a bijection exists between the quantized vectors and their corresponding indices in the codebooks. Therefore, the quantized vectors and their indices hold the same information. In the following, we refer to the set of indices corresponding to the set of quantized vectors as Zq = {(zq 1,1, , zq 1,C), , (zq n,1, , zq n,C)) as K = {(k1,1, , k1,C), , (kn,1, , kn,C)), where ki,c {1, , m}. Published in Transactions on Machine Learning Research (03/2024) 4.1.4 Decoder After quantization, the decoder has to recover the graph structure and the node and edge attributes for annotated graphs. Formally, it is a set-to-graph function fdecoder : Zq G. Since we have dropped any explicit representation of the graph (see Section 4.1.2), when we decode it, we assume a fully connected graph among the elements of Zq. We subsequently feed the fully connected graph into an MPNN similar to the one used as encoder but without feature augmentation. The decoder output serves as logit to model the discrete distributions over the edges and the nodes when the graph has node attributes. We present the implementation details in Appendix A.1. 4.1.5 Training Our training strategy follows principles developed to train models performing quantization of the latent space (van den Oord et al., 2017). The training objectives are simultaneously reconstruction and quantization. We de ne a reconstruction loss for the rst objective. We update the codebook vectors center using Exponential Moving Average (EMA) as the second objective. Lastly, we add a regularization term to prevent the latent space from growing inde nitely. In the following, we present these elements and provide the implementation details in Appendix A.2. Reconstruction loss We train the parameters of MPNNs (encoder and decoder) to minimize the reconstruction loss. We de ne it as the expected negative log-likelihood of the graph given the set of codewords Zq: Lrecon = EZqln (pθ(G|Zq)) (8) Thanks to the auto-encoder s permutation-equivariance, the loss computation is straightforward since the representation order of the output corresponds to the input one. The main di culty comes from the quantization operation, which is non-di erentiable. Therefore, no gradient can ow back to the encoder. We circumvent this di culty by following (van den Oord et al., 2017) and passing the gradient with respect to the codeword directly to the encoder s output, bypassing the quantization operation. Vector Quantization Objective The Vector Quantization (VQ) objective updates the vector in the codebooks. The objective function, acting as an online k-means algorithm, updates the vectors in the codebook using EMA as proposed in (van den Oord et al., 2017). We adopt the k-mean++ initialization for the codebooks as proposed in (Łańcucki et al., 2020). Commitment loss Finally, nothing prevents the embeddings from taking arbitrarily large values. To circumvent this issue, we follow again (van den Oord et al., 2017). We incorporate a commitment loss function, which pushes the encoder outputs close to their corresponding codewords. 4.2 Modeling The Latent Distribution The auto-encoder implicitly de nes a distribution over the latent space Zq, which translates over the space of the corresponding sets of indices K de ned in Section 4.1.3. For generation, we need to sample from one of these distributions. So, we decide to model the distribution of indices. After sampling, we recover the corresponding codewords from their indices. Instead of learning a distribution over the sets of index partitions {(k1,1, , k1,C), , (kn,1, , kn,C)), we turn the sets into sequences by sorting them. After sorting, the sequences of indices K Rn C have two dimensions: the partition dimension and the node dimension. Instead of attening the sequence, we propose a new model based on the Transformer architecture (Vaswani et al., 2017) that leverages the 2dimensional structure of the sequences to implement parameter sharing for computational e ciency and better generalization. We call it 2D-Transformer. Published in Transactions on Machine Learning Research (03/2024) Our main contribution concerns the attention function. Before presenting it, we rst detail the sequentialization, the auto-regressive process, the input and output of our model, and recall some basics of the Transformer architecture. 4.2.1 Sequentialization The existence of multiple permutations carries over from the graphs to the sets in the latent space. The multiple possible representations as well as the limitations of permutation-equivariant and permutationinvariant functions are still an issue. However, unlike graphs, sorting sets in a unique representation is easy. Any sorting algorithm can sort sets, yielding a unique representation. We can then represent the latent distributions as sequences and learn them auto-regressively. Additionally, we remark that sorting is a speci c kind of permutation. Since the decoder is permutationequivariant, permuting the node embeddings in the latent space only changes the representation of the output graph but not the graph itself. So, we sequentialize the sets and learn the sequential distribution auto-regressively. In practice, we sort the set of index sequences K = {(k1,1, , k1,C), , (kn,1, , kn,C)) (the indices corresponding to the codewords Zq = {(zq 1,1, , zq 1,C), , (zq n,1, , zq n,C)), see Section 4.1.3) by increasing order using the rst partition as the primary criterion. In case of a tie, we use the index of the second partition as a secondary criterion, etc. We permute the set Z accordingly. We obtain sequences of indices K Rn C and, the corresponding sequence of codewords Z Rn C dlatent C. Since we perform the sorting operation as a preprocessing step for the second stage of our model, there is no backpropagation through the sorting operation. The ordering obtained is arbitrary, but it has the advantage of being completely generic. Domain-speci c information can certainly lead to orderings that improve learning performance. We let this question open for further domain-speci c research. Regarding notation, we remove the superscript q from the latent vector z since we only work with the selected codewords in this second stage. Instead, we use the superscript to indicate the number of Transformer blocks traversed so that zl i,c correspond to the representation zi,c after l blocks. 4.2.2 Auto-regressive setup on two dimensions Our model uses the codewords vectors Z as input to model the distribution of indices K. Formally, we have: P ((k1,1 , k1,C), , (kn,1, , kn,C)) = c=1 P(ki,c|z= ki 1,1) = 0. Even though the auto-regressive model should learn to generate the sequence following the order in the training set, we ease the task of our model by enforcing this constraint with masking. 4.2.3 Transformer The Transformer is based on attention. The attention function takes queries qi, keys ki, and values vi, all are transformations from the corresponding representation zi. The attention output is a weighted sum of the values, where the weights correspond to a compatibility function, such as the scalar product, between queries and keys. The queries correspond to the aggregating position, while keys and values correspond to aggregated positions. The original attention from (Vaswani et al., 2017) includes a scaling factor so that the attention output is computed as: j i softmax(q T i kj dk )vj (13) In practice, all attention functions are computed simultaneously using matrix multiplication. Masking M ensures that the weighted sum includes allowed positions only. We have: A = softmax( QKT dk M)V . Following the original paper (Vaswani et al., 2017), we use multi-head attention, residual connections, position-wise fully connected feed-forward networks, and positional encoding. 2D Attention The main idea behind 2D-Transformer lies in the computation of the attention function. Instead of computing the attention function for all the partitions, we compute it over the node embeddings as a whole. In addition to the bene ts o ered by transformers over other auto-regressive models, such as the computational complexity per layer, the parallelizable computation, and the short path lengths between long-range dependencies (Vaswani et al., 2017), our 2D-Transformer implements parameter sharing for computational e ciency and better generalization. So, the keys and values are computed once for the whole node, so that we have: kl i = f l k(zl i+1,0), and vl i = f l v(zl i+1,0). One can interpret it as a form of parameter sharing since we use the same keys and values and, therefore, the same set of parameters for all the partitions. Nevertheless, the attention weights depend on the partition through the queries, which are di erent for each partition, as ql i,c = f l q,c(zl i,c). In practice, the functions fk, fv, and fq are small neural networks. Thanks to the queries, the attention score, q T i,ckj, also depends on both dimensions. So, the main change of our model compared to the standard Transformer appears in the attention function, which we compute as: j