# modular_flows_differential_molecular_generation__379e1390.pdf Modular Flows: Differential Molecular Generation Yogesh Verma, Samuel Kaski, Markus Heinonen Aalto University {yogesh.verma, samuel.kaski, markus.heinonen}@aalto.fi Vikas Garg Yai Yai Ltd and Aalto University vgarg@csail.mit.edu; vikas@yaiyai.fi Generating new molecules is fundamental to advancing critical applications such as drug discovery and material synthesis. Flows can generate molecules effectively by inverting the encoding process, however, existing flow models either require artifactual dequantization or specific node/edge orderings, lack desiderata such as permutation invariance, or induce discrepancy between the encoding and the decoding steps that necessitates post hoc validity correction. We circumvent these issues with novel continuous normalizing E(3)-equivariant flows, based on a system of node ODEs coupled as a graph PDE, that repeatedly reconcile locally toward globally aligned densities. Our models can be cast as message passing temporal networks, and result in superlative performance on the tasks of density estimation and molecular generation. In particular, our generated samples achieve state of the art on both the standard QM9 and ZINC250K benchmarks. 1 Introduction Figure 1: A toy illustration of Mod Flow in action with a two-node graph. The two local flows - z1 and z2 - co-evolve toward a more complex joint density, both driven by the same differential f. Generative models have rapidly become ubiquitous in machine learning with advances from image synthesis (Ramesh et al., 2022) to protein design (Ingraham et al., 2019). Molecular generation (Stokes et al., 2020) has also received significant attention owing to its promise for discovering new drugs and materials. Searching for valid molecules in prohibitively large discrete spaces is, however, challenging: estimates for drug-like structures range between 1023 and 1060 but only a tiny fraction - on the order of 108 - has been synthesized (Polishchuk et al., 2013; Merz et al., 2020). Thus, learning representations that exploit appropriate molecular inductive biases (e.g., spatial correlations) becomes crucial. Earlier models focused on generating sequences based on the SMILES notation (Weininger, 1988) used in Chemistry to describe the molecular structures as strings. However, they were supplanted by generative models that capture valuable spatial information such as bond strengths and dihedral angles, e.g., by embedding molecular graphs via some graph neural network (GNNs) (Scarselli et al., 2009; Garg et al., 2020). Such models primarily include variants of Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Normalizing 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Flows (Dinh et al., 2014, 2016). Besides known issues with their training, GANs (Goodfellow et al., 2014; Maziarka et al., 2020) suffer from the well-documented problem of mode collapse, thereby generating molecules that lack diversity. VAEs (Kingma and Welling, 2013; Lim et al., 2018; Jin et al., 2018), on the other hand, are susceptible to a distributional shift between the training data and the generated samples. Moreover, optimizing for likelihood via a surrogate lower bound is likely insufficient to capture the complex dependencies inherent in the molecules. Flows are especially appealing since, in principle, they enable estimating (and sampling from) complex data distributions using a sequence of invertible transformations on samples from a more tractable continuous distribution. Molecules are discrete, so many flow models (Madhawa et al., 2019; Honda et al., 2019; Shi et al., 2020) add noise during encoding and later apply a dequantization procedure. However, dequantization begets distortion and issues related to convergence (Luo et al., 2021). Moroever, many methods segregate the generation of atoms from bonds, so the decoded structure is often not a valid molecule and requires post hoc correction to ensure validity (Zang and Wang, 2020), effecting a discrepancy between the encoding and the decoded distributions. Permutation dependence is another undesirable artifact of these methods. Some alternatives have been explored to avoid dequantization, e.g., (Lippe and Gavves, 2021) encodes molecules in a continuous latent space via variational inference and jointly optimizes a flow model for generation. Discrete graph flows (Luo et al., 2021) also circumvent the many pitfalls of dequantization by resorting to discrete latent variables, and performing validity checks during the generative process. However, discrete flows follow an autoregressive procedure that requires a specific ordering of nodes and edges during training. In general, one shot methods can generate much faster than discrete flows. We offer a different flow-based perspective tailored to molecules. Specifically, we suggest coupled continuous normalizing E(3)-equivariant flows that bestow generative capabilities from neural partial differential equation (PDE) models on graphs. Graph PDEs have been known to enable designing new embedding methods such as variants of GNNs (Chamberlain et al., 2021), extending GNNs to continuous layers as Neural ODEs (Poli et al., 2019), and accommodating spatial information (Iakovlev et al., 2020). We instead seek to bring to the fore their efficacy and elegance as tools to help generate complex objects, such as molecules, viewed as outcomes resulting from an interplay of co-adapting latent trajectories (i.e., underlying dynamics). Concretely, a flow is associated with each node of the graph, and these flows are conjoined as a joint ODE system conditioned on neighboring nodes. While these flows originate independently as samples from simple distributions, they adjust progressively toward more complex joint distributions as they repeatedly interact with the neighboring flows. We view molecules as samples generated from the globally aligned distributions obtained after many such local feedback iterations. We call the proposed method Modular Flows (Mod Flows) to underscore that each node can be regarded as a module that coordinates with other modules. Table 1 summarizes the capabilities of Mod Flow compared to some previous generative works. Contributions. We propose to learn continuous-time, flow based generative models, grounded on graph PDEs, for generating molecules without resorting to any validity correction. In particular, we propose Mod Flow, a novel generative model based on coupled continuous normalizing E(3)-equivariant flows. Mod Flow encapsulates essential inductive bias using PDEs, and defines multiple flows that interact locally toward a globally consistent joint density; Table 1: A comparison of generative modeling approaches for molecules. Method One-shot Modular Invertible Continuous-time JT-VAE Jin et al. (2018) MRNN Popova et al. (2019) Graph AF Shi et al. (2020) Graph DF Luo et al. (2021) Mo Flow Zang and Wang (2020) Graph NVP Madhawa et al. (2019) Mod Flow this work Figure 2: A demonstration of the modular flow generation. The initial Gaussian distributions N(0, I) evolve into complex densities z(T) under f and are subsequently translated into probabilities and labels. we encode permutation, translation, rotation, and reflection equivariance with E(3) equivariant GNNs adapted to molecular generation, and can leverage 3D geometric information; Mod Flow is end-to-end trainable, non-autoregressive, and obviates the need for any external validity checks or correction; empirically, Mod Flow achieves state-of-the-art performance on both the standard QM9 (Ramakrishnan et al., 2014) and ZINC250K (Irwin et al., 2012) benchmarks. 2 Related works Generative models. Earlier attempts for molecule generation (Kusner et al., 2017; Dai et al., 2018) aimed at representing molecules as SMILES strings (Weininger, 1988) and developed sequence generation models. A challenge for these approaches is to learn complicated grammar rules that can generate syntactically valid sequences of molecules. Recently, representing molecules as graphs has inspired new deep generative models for molecular generation (Segler et al., 2018; Samanta et al., 2018; Neil et al., 2018), ranging from VAEs (Jin et al., 2018; Kajino, 2019) to flows (Madhawa et al., 2019; Luo et al., 2021; Shi et al., 2020). The core idea is to learn to encode molecular graphs into a latent space, and subsequently decode samples from this latent space to generate new molecules (Atwood and Towsley, 2016; Xhonneux et al., 2020; You et al., 2018). Graph partial differential equations. Graph PDEs is an emerging area that studies PDEs on structured data encoded as graphs. For instance, one can define a PDE on graphs to track the evolution of signals defined over the graph nodes under some dynamics. Graph PDEs have enabled, among others, design of new graph neural networks; see, e.g., works such as GNODE (Poli et al., 2019), Neural PDE (Iakovlev et al., 2020), Neural operator (Li et al., 2020), GRAND (Chamberlain et al., 2021), and PDE-GCN (Eliasof et al., 2021). Different from all these works, we focus on using PDEs for generative modeling of molecules (as graph-structured objects). Interestingly, Mod Flow proposed in this work may be viewed as a new equivariant temporal graph network (Rossi et al., 2020; Souza et al., 2022). Validity oracles. A key challenge of molecular generative models is to be able to generate valid molecules, according to various criteria for molecular validity or feasibility. It is a common practice to call on external chemical software as rejection oracles to reduce or exclude invalid molecules, or do validity checks as part of autoregressive generation (Luo et al., 2021; Shi et al., 2020; Popova et al., 2019). An important open question has been whether generative models can learn to achieve high generative validity intrinsically, i.e., without being aided by oracles or resorting to additional checks. Mod Flow takes a major step forward toward that goal. 3 Modular Flows We focus on unsupervised learning of an underlying graph density p(G) using a dataset of observed molecular graphs D = {Gn}N n=1. We learn a generative flow model pθ(G) specified by flow parameters θ, and use it to sample novel high-probability molecules. 3.1 Molecular Representation Graph representation. We represent each molecular graph G = (V, E) of size M as a tuple of vertices V = (v1, . . . , v M) and edges E V V . Each vertex takes a value from an alphabet on atoms: v A = {C, H, N, O, P, S, . . .}; while each edge e B = {1, 2, 3} abstracts some bond type (i.e., single, double, or triple). We assume that, conditioned on the edges, the graph likelihood factorizes as a product of categorical distributions over vertices given their latent representations: p(G) := p(V |E, {z}) = i=1 Cat(vi|σ(zi)) , (1) where zi = (zi C, zi H, . . .) R|A| is a set of atom scores for node i such that zik R pertains to type k A, and σ is the softmax function σ(zi)k = exp(zik) P k exp(zik ) , (2) which turns the real-valued scores zi into normalized probabilities. Mod Flow also supports 3D molecular graphs that contain atomic coordinates and angles as additional information. Tree representations. We can obtain an alternative representation for molecules: we can decompose each molecule into a tree-like structure, by contracting certain vertices into a single node (denoted as a cluster) such that the molecular graph becomes acyclic. Following Jin et al. (2018), we restrict these clusters to ring substructures present in the molecular data, in addition to the atom alphabet. Thus, we obtain an extended alphabet Atree = A {C1, C2, . . .}, where each cluster label Cr corresponds to some ring substructure in the label vocabulary χ. We then reduce the vocabulary to the 30 most commonly occurring substructures of Atree. For further details, see Appendix A.2. 3.2 Differential modular flows Normalizing flows (Kobyzev et al., 2021) provide a general recipe for constructing flexible probability distributions, used in density estimation (Cramer et al., 2021; Huang et al., 2018) and generative modeling (Zhen et al., 2020; Zang and Wang, 2020). We propose to model the atom scores zi(t) as a Continuous-time Normalizing Flow (CNF) (Grathwohl et al., 2018) over time t R+. We assume the initial scores at time t = 0 follow an uninformative Gaussian base distribution zi(0) N(0, I) for each node i. Node scores evolve in parallel over time according to the differential equation zi(t) := zi(t) t = fθ t, zi(t), z Ni(t), xi, x Ni , i {1, . . . , M} , (3) where Ni = {j : (i, j) E} is the set of neighbors of node i and z Ni(t) = {zj(t) : j Ni} the scores of the neighbors at time t; xi and x Ni denote, respectively, the positional (2D/3D) information of i and its neighbours; and θ denotes the parameters of the flow function f to be learned. Stacking together all node differentials, we obtain a modular system of coupled ODEs: z1(t) ... z M(t) fθ t, z1(t), z N1(t), x1, x N1 ... fθ t, z M(t), z NM (t), x M, x NM z(T) = z(0) + Z T 0 z(t)dt . (5) This coupled system of ODEs may be viewed as a graph PDE (Iakovlev et al., 2020; Chamberlain et al., 2021), where the evolution of each node depends only on its neighbors. The joint flow induces a corresponding change in the individual densities in terms of divergence of f (Chen et al., 2018), d log pt(zi(t)) fθ t, zi(t), z Ni(t), xi, x Ni starting from the base distribution p0(zi(0)) = N(zi(0)|0, I). The trace picks only the diagonal elements of the Jacobian f z , which interprets the input from neighbors, z Ni, as a control for each node zi at each instant t. An ODE solver is used for such systems, and the gradients are computed via the adjoint sensitivity method (Kolmogorov et al., 1962). This approach incurs a low memory cost, and explicitly controls the numerical error. Notably, moving towards modular flows translates sparsity also to the adjoints. Proposition 1: Modular adjoints are sparser than regular adjoints. They can be computed as dt = λ(t) f t, zi(t), z Ni(t), xi, x Ni j Ni {i} λj(t) f t, zi(t), z Ni(t), xi, x Ni where the partial derivatives f zj ]ij are sparse (see Appendix A.1 for the derivation). 3.3 Equivariant local differential Our goal is to have a differential function f that is a PDE operator used in Equation 4, and that satisfies the natural equivariances and invariances of the molecules. Specifically, this function must be (i) translation equivariant: translating the input results in an equivalent translation of the output; (ii) rotational (and reflection) equivariant: rotating the input results in an equivalent rotation of the output; and (iii) permutation equivariant: permuting the input results in the same permutation of the output. Therefore, we chose to use E(3)-Equivariant GNN (EGNN) (Satorras et al., 2021), which is translation, rotation and reflection equivariant (E(n)), and permutation equivariant with respect to an input set of points (see Appendix A.3 for details). EGNN takes as input the node embeddings as well as the geometric information (polar coordinates (2D) and spherical polar coordinates (3D)). Interestingly, Mod Flow can be viewed as a message passing temporal graph network (Rossi et al., 2020; Souza et al., 2022) as shown next. Proposition 2: Modular Flows can be cast as message passing Temporal Graph Networks (TGNs). The operations are listed in Table 2, where Mod Flow is subjected to a single layer of EGNN. (See Appendix A.4 for more details). Table 2: Mod Flow as a temporal graph network (TGN). Adopting notation for TGNs from Rossi et al. (2020) vi is a node-wise event on i; eij denotes an (asymmetric) interaction between i and j; si is the memory of node i; and t and t denote time with t being the time of last interaction before t, e.g., si(t ) is the memory of i just before time t; and msg and agg are learnable functions (e.g., MLP) to compute, respectively, the individual and the aggregate messages. For Mod Flow, we use rij to denote the spatial distance xi xj, and aij to denote the attributes of the edge between i and j. The functions ϕe, ϕx, and ϕh are as defined in (Satorras et al., 2021). Method TGN layer Mod Flow Edge m ij(t) = msg (si (t ) , sj (t ) , t, eij(t)) mij(t) = ϕe zi(t), zj(t), rij(t) 2 , aij m i(t) = agg {m ij (t) |j Ni} mi(t) = P j N(i) mij ˆmij(t) = rij(t) ϕx (mij(t)) ˆmi(t) = C P j N(i) ˆmij(t) Memory state si(t) = mem (m i(t), si (t )) xi(t + 1) = xi(t) + ˆmi(t) Node z i(t) = P j Ni h (si(t), sj(t), eij(t), vi(t), vj(t)) zi(t + 1) = ϕh (zi(t), mi(t)) 3.4 Training objective Normalizing flows are predominantly trained to minimize KL[pdata||pθ], i.e., the KL-divergence between the unknown data distribution pdata and the flow-generated distribution pθ. This objective is equivalent to maximizing Epdata[log pθ] (Papamakarios et al., 2021). However, note that the discrete graphs G and the continuous atom scores z(t) reside in different spaces. Thus, in order to apply flows, a mapping between the observation space and the flow space is needed. Earlier approaches use dequantisation to turn a graph G into a distribution of latent states, and argmax to deterministically map latent states to graphs (Zang and Wang, 2020). z(0) z(T) σ(z(T)) G f 1 softmax argmax Figure 3: Plate diagram showing both the inference and generative components of Mod Flow. We instead reduce the learning problem to maximizing Eˆpdata(z(T ))[log pθ(z(T))], where we turn the observed set of graphs {Gn} into a set of scores {zn} using zn(Gn; ϵ) = (1 ϵ) onehot(Gn) + ϵ |A |1M(n)1 |A | , where onehot(Gn) is a matrix of size M(n) |A | (i.e., rows equal to the number of nodes in Gn and columns equal to the number of possible node labels) such that Gn(i, k) = 1 if vi = ak A , that is if the vertex i is labeled with atom k, and 0 otherwise; 1q is a vector with q entries each set to 1; A {A, Atree}; and ϵ [0, 1] is added to model the noise in estimating the posterior p(z(T)|G) due to short-circuiting the inference process from G to z(T) skipping the intermediate dependencies, thereby inducing an unconditional distribution ˆpdata that is slightly different from the true data distribution pdata. The plate diagram in Figure 3 summarizes the overall procedure. Effectively, we exploit the (non-reversible) composition of the argmax and softmax operations to transition from the continuous flow space to the discrete graph space, but skip this composition altogether in the reverse direction. Importantly, this short-circuiting allows Mod Flow to keep the forward and backward flows between z(0) and z(T) completely aligned (i.e., reversible) unlike previous approaches. We maximize the following objective over N training graphs: argmax θ L = Eˆpdata(z) log pθ(z) (8) n=1 log p T z(T) = zn (9) i=1 log p0(zi(0)) 0 tr fθ(t, zi(t), z Ni(t), xi, x Ni) which factorizes over the size M(n) of the n th training molecule. The encoding probability follows from Equation 6, where z(0) can be traced by traversing the flow f backward in time starting from zn at time t = T until t = 0. In practice we solve ODE integrals using a numerical solver such as Runge-Kutta. We thus delegate this task to a general solver of the form ODESolve(z, fθ, T), where map fθ is applied for T steps starting with z. An optimizer optim is also required for updating θ. 3.5 Molecular generation Given a molecular structure, we can generate novel molecules by sampling an initial state z(0) N(0, I), and running the modular flow forward in time for T steps and obtain z(T). This procedure maps a tractable base distribution p0 to a more complex distribution p T . We follow argmax to pick the most probable label assignment for each node (Zang and Wang, 2020). We outline the procedures for training and generation in Algorithm 1 and Algorithm 2 respectively. Algorithm 1 Training Mod Flow Require: Dataset D, iterations niter, batch size B, number of batches n B 1: Initialise parameters θ of Mod Flow (EGNN) 2: for k = 1, . . . , niter do 3: for b = 1, . . . , n B do 4: Sample Db = {G1, . . . , GB} from D 5: Define zb(T) := {zr(T) : Gr Db} 6: Set zb(T) to zb(Gb; ϵ) 7: Lb = 1 B Gr Db log pθ(zr(T)), using zb(0) = ODESolve(zb(T), f 1 θ , T) 8: end for 9: θ optim( 1 n B Pn B b=1 Lb; θ) 10: end for Algorithm 2 Generating with Mod Flow 1: Sample z(0) N(0, I) 2: z(T) = ODESolve(z(0), fθ, T) 3: Assign labels by argmax(σ(z(T))) 4 Experiments We first demonstrate the ability of Modular Flows (Mod Flow) to learn highly discontinuous synthetic patterns on 2D grids. We also evaluated Mod Flow models trained, variously, on (i) 2D coordinates, (ii) 3D coordinates, (iii) 2D coordinates + tree representation, and (iv) 3D coordinates + tree representation on the tasks of molecular generation and optimization. Our results show that Mod Flow compares favorably to other prominent flow and non-flow based molecular generative models, including Graph DF (Luo et al., 2021), Graph NVP (Madhawa et al., 2019), MRNN(Popova et al., 2019), and Graph AF (Shi et al., 2020). Notably, Mod Flow achieves state-of-the-art results without validity checks or post hoc correction. We also provide results of our ablation studies to underscore the relevance of geometric features and equivariance toward this superlative empirical performance. 4.1 Density Estimation Figure 4: Mod Flow can accurately learn to reproduce complex, discontinuous graph patterns. We generated our synthetic data in the following way. We considered two variants of a chessboard pattern, namely, (i) 4 4 grid where every node takes a binary value, 0 or 1, and neighboring nodes have different values; and (ii) 16 16 grid where nodes in each block of 4 4 all take the same value (0 or 1), different from the adjacent blocks. We also experimented with a 20 20 grid describing alternating stripes of 0s and 1s. Figure 4 shows that Mod Flow can learn neural differential functions fθ that reproduce the patterns almost perfectly, indicating sufficient capacity to model complex patterns. That is, Mod Flow is able to transform the initial Gaussian distribution into different multi-modal and discontinuous distributions. 4.2 Molecule Generation Data. We trained and evaluated all the models on ZINC250k (Irwin et al., 2012) and QM9 (Ramakrishnan et al., 2014) datasets. The ZINC250k set contains 250,000 drug-like molecules, each consisting of up to 38 atoms. The QM9 set contains 134,000 stable small organic molecules with atoms from the set {C, H, O, N, F}. The molecules are processed to be in the kekulized form with hydrogens removed by the RDkit software (Landrum et al., 2013). Table 3: Random generation on QM9 (top) and ZINC250K (bottom) without post hoc validity corrections. Results with are taken from Luo et al. (2021). Higher values are better for all columns. Method Validity % Uniqueness % Novelty % Reconstruction % GVAE 60.2 9.3 80.9 96.0 Graph NVP 83.1 99.2 58.2 100 GRF 84.5 66 58.6 100 Graph AF 67 94.2 88.8 100 Graph DF 82.7 97.6 98.1 100 Mo Flow 89.0 98.5 96.4 100 Mod Flow (2D-EGNN) 96.2 1.7 99.5 100 100 Mod Flow (3D-EGNN) 98.3 0.7 99.1 100 100 Mod Flow (JT-2D-EGNN) 97.9 1.2 99.2 100 100 Mod Flow (JT-3D-EGNN) 99.1 0.8 99.3 100 100 Method Validity % Uniqueness % Novelty % Reconstruction % MRNN 65 99.89 100 n/a GVAE 7.2 9 100 53.7 GCPN 20 99.97 100 n/a Graph NVP 42.6 94.8 100 100 GRF 73.4 53.7 100 100 Graph AF 68 99.1 100 100 Graph DF 89 99.2 100 100 Mo Flow 50.3 99.9 100 100 Mod Flow (2D-EGNN) 94.8 1.0 99.4 100 100 Mod Flow (3D-EGNN) 95.4 1.2 99.7 100 100 Mod Flow (JT-2D-EGNN) 97.4 1.4 99.1 100 100 Mod Flow (JT-3D-EGNN) 98.1 0.9 99.3 100 100 Setup. We adopt common quality metrics to evaluate molecular generation. Validity is the fraction of molecules that satisfy the respective chemical valency of each atom. Uniqueness refers to the fraction of generated molecules that is unique (i.e, not a duplicate of some other generated molecule). Novelty is the fraction of generated molecules that is not present in the training data. Reconstruction is the fraction of molecules that can be reconstructed from their encoding. Here, we strictly limit ourselves to comparing all methods on their validity scores without resorting to external correction. We trained each model with 5 random weight initializations, and generated 50,000 molecular graphs for evaluation. We report the mean and the standard deviation scores across these multiple runs. Implementation. The models were implemented in Py Torch (Paszke et al., 2019). The EGNN method used only a single layer with embedding dimension of 32. We trained with the Adam optimizer (Kingma and Ba, 2014) for 50-100 epochs (until the training loss became stable), with batch size 1000 and learning rate 0.001. Mod Flow is significantly faster compared to autoregressive models such as Graph AF and Graph DF. For more details, see Appendix A.5. Results. Table 3 reports the performance on QM9 (top) and ZINC250K (bottom) respectively. Mod Flow achieves state-of-the-art results across all metrics. Notably, its reconstruction rate is 100% (similar to other flow models); in addition, however, the novelty (100%) and uniqueness scores ( 99%) are also very high. Moreover, Mod Flow surpassed the other methods on validity (95%-99%). In Appendix A.6, we document additional evaluations with respect to the MOSES metrics that access the overall quality of generated molecules, as well as the distributions of chemical properties. All these results substantiate the promise of Mod Flow as an effective tool for molecular generation. 4.3 Property-targeted Molecular Optimization The task of molecular optimization is to search for molecules that have better chemical properties. We choose the standard quantitative estimate of drug-likeness (QED) as our target chemical property. QED measures the potential of a molecule to be characterized as a drug. We used a pre-trained (a) QM9 Dataset (b) ZINC250K Dataset Figure 5: Samples of molecules generated by Mod Flow. More examples are shown in Appendix A.7. Mod Flow model f to encode molecules M into their embeddings Z = f(M), and applied linear regression to obtain QED scores Y from these embeddings. We then interpolate in the latent space of each molecule along the direction of increasing QED via several gradient ascent steps, i.e., updates of the form Z = Z + λ d Y d Z , where λ denotes the length of the search step. The final embedding thus obtained is decoded as a new molecule via the reverse mapping f 1. Figure 6: Example of chemical property optimization on the ZINC250K dataset. Given the left-most molecule, we interpolate in latent space along the direction which maximizes its QED property. Figure 7: Example of chemical property optimization on the QM9 dataset. Given the left-most molecule, we interpolate in latent space along the direction which maximizes its QED property. Table 4: Performance in terms of the best QED scores (baselines are taken from Luo et al. (2021)). Method 1st 2nd 3rd ZINC (dataset) 0.948 0.948 0.948 JTVAE 0.925 0.911 0.910 GCPN 0.948 0.947 0.945 MRNN 0.844 0.799 0.736 Graph AF 0.948 0.948 0.947 Graph DF 0.948 0.948 0.948 Mo Flow 0.948 0.948 0.948 Mod Flow (2D-EGNN) 0.948 0.941 0.937 Mod Flow (3D-EGNN) 0.948 0.937 0.931 Mod Flow (JT-2D-EGNN) 0.947 0.941 0.939 Mod Flow (JT-3D-EGNN) 0.948 0.948 0.945 Figure 6 and Figure 7 show examples of the molecules decoded from the learned latent space using this procedure, starting with molecules having a low QED score. Note that the number of valid molecules decoded back varies on the query molecule. We report the discovered novel molecules sorted by their QED scores in Table 4. Clearly, Mod Flow is able to find novel molecules with high QED scores. 4.4 Ablation Studies We also performed ablation experiments to gain further insights about Mod Flow, as we describe next. E(3)-equivariant versus not equivariant. Molecules exhibit translational and rotational symmetries, so we conducted an ablation study to quantify the effect of incorporating these symmetries in our model. We compare the results obtained using an EGNN with a non-equivariant graph convolutional network (GCN). For our purpose, we used a 3-layer GCN with layer sizes 64-32-32. The validity scores in Table 5 provide strong evidence in favor of modeling the symmetries explicitly in the proposed Modular Flows. Table 5: Random generation performance on ZINC250K and QM9 dataset with E(3)-EGNN vs GCN. Dataset Method Validity % Uniqueness % Novelty % ZINC250K Mod Flow (3D-EGNN) 95.4 1.2 99.7 100 Mod Flow (GCN) 90.3 1.9 99.7 100 QM9 Mod Flow (3D-EGNN) 98.3 0.7 99.1 100 Mod Flow (GCN) 93.3 0.5 98.8 100 2D versus 3D. Finally, we study whether including information about the 3D coordinates improves the model. Note that the EGNN-coupled differential function obtains either the 2D or 3D positions as polar coordinates, where the 3D positions have an extra degree of freedom. Table 6 shows that transitioning from 2D to 3D improves the mean validity score. Table 6: Random generation on ZINC250K and QM9 dataset with 2D versus 3D features. Dataset Method Validity % Uniqueness % Novelty % ZINC250K Mod Flow (3D-EGNN) 95.4 1.2 99.7 100 Mod Flow (2D-EGNN) 94.8 1.0 99.4 100 QM9 Mod Flow (3D-EGNN) 98.3 0.7 99.1 100 Mod Flow (2D-EGNN) 96.2 1.7 99.5 100 5 Conclusion We proposed Mod Flow, a new generative flow model where multiple flows interact locally according to a coupled ODE, resulting in accurate modeling of graph densities and high quality molecular generation without any validity checks or correction. Interesting avenues open up, including the design of (a) more nuanced mappings between discrete and continuous spaces, and (b) extensions of modular flows to (semi-)supervised settings. 6 Acknowledgments The calculations were performed using resources within the Aalto University Science-IT project. This work has been supported by the Academy of Finland under the HEALED project (grant 13342077). James Atwood and Don Towsley. Diffusion-convolutional neural networks. Advances in neural information processing systems, 29, 2016. G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90 98, 2012. Andrew M. Bradley. PDE-constrained optimization and the adjoint method. Technical report, 2019. Ben Chamberlain, James Rowbottom, Maria I Gorinova, Michael Bronstein, Stefan Webb, and Emanuele Rossi. Grand: Graph neural diffusion. In International Conference on Machine Learning, pages 1407 1418. PMLR, 2021. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. Eike Cramer, Alexander Mitsos, Raul Tempone, and Manuel Dahmen. Principal component density estimation for scenario generation using normalizing flows, 2021. URL https://arxiv.org/abs/2104.10410. Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed variational autoencoder for structured data, 2018. URL https://arxiv.org/abs/1802.08786. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. Moshe Eliasof, Eldad Haber, and Eran Treister. Pde-gcn: Novel architectures for graph neural networks motivated by partial differential equations. Advances in Neural Information Processing Systems, 34, 2021. Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of cheminformatics, 1(1):1 11, 2009. Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3419 3430. PMLR, 2020. URL http://proceedings.mlr.press/v119/garg20c.html. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014. URL https://arxiv.org/abs/ 1406.2661. Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. Shion Honda, Hirotaka Akita, Katsuhiko Ishiguro, Toshiki Nakanishi, and Kenta Oono. Graph residual flow for molecular graph generation. ar Xiv preprint ar Xiv:1909.13521, 2019. Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows, 2018. URL https://arxiv.org/abs/1804.00779. Valerii Iakovlev, Markus Heinonen, and Harri Lähdesmäki. Learning continuous-time pdes from sparse data with graph neural networks. ar Xiv preprint ar Xiv:2006.08956, 2020. John Ingraham, Vikas Garg, Regina Barzilay, and Tommi Jaakkola. Generative models for graph-based protein design. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https:// proceedings.neurips.cc/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf. John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757 1768, 2012. Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. ar Xiv preprint ar Xiv:1802.04364, 2018. Hiroshi Kajino. Molecular hypergraph grammar with its application to molecular optimization. In International Conference on Machine Learning, pages 3183 3191. PMLR, 2019. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013. URL https://arxiv.org/ abs/1312.6114. Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3964 3979, nov 2021. doi: 10.1109/tpami.2020.2992934. URL https://doi.org/10.1109%2Ftpami.2020.2992934. Andrei Nikolaevich Kolmogorov, Ye F Mishchenko, and Lev Semenovich Pontryagin. A probability problem of optimal control. Technical report, JOINT PUBLICATIONS RESEARCH SERVICE ARLINGTON VA, 1962. Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder, 2017. URL https://arxiv.org/abs/1703.01925. Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling, 2013. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. ar Xiv preprint ar Xiv:2003.03485, 2020. Jaechang Lim, Seongok Ryu, Jin Woo Kim, and Woo Youn Kim. Molecular generative model based on conditional variational autoencoder for de novo molecular design. Journal of cheminformatics, 10(1):1 9, 2018. Phillip Lippe and Efstratios Gavves. Categorical normalizing flows via continuous transformations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= -GLNZe VDuik. Youzhi Luo, Keqiang Yan, and Shuiwang Ji. Graphdf: A discrete flow model for molecular graph generation, 2021. URL https://arxiv.org/abs/2102.01189. Kaushalya Madhawa, Katushiko Ishiguro, Kosuke Nakago, and Motoki Abe. Graphnvp: An invertible flow model for generating molecular graphs. ar Xiv preprint ar Xiv:1905.11600, 2019. Lukasz Maziarka, Agnieszka Pocha, Jan Kaczmarczyk, Krzysztof Rataj, Tomasz Danel, and Michał Warchoł. Mol-cyclegan: a generative model for molecular optimization. Journal of Cheminformatics, 12(1):1 18, 2020. Kenneth M. Merz, Gianni De Fabritiis, and Guo-Wei Wei. Generative models for molecular design. Journal of Chemical Information and Modeling, 60(12):5635 5636, 2020. doi: 10.1021/acs.jcim.0c01388. URL https://doi.org/10.1021/acs.jcim.0c01388. PMID: 33378853. Daniel Neil, Marwin Segler, Laura Guasch, Mohamed Ahmed, Dean Plumbley, Matthew Sellwood, and Nathan Brown. Exploring deep recurrent models with reinforcement learning for molecule design. 2018. George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57): 1 64, 2021. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. Michael Poli, Stefano Massaroli, Junyoung Park, Atsushi Yamashita, Hajime Asama, and Jinkyoo Park. Graph neural ordinary differential equations. ar Xiv preprint ar Xiv:1911.07532, 2019. Pavel G Polishchuk, Timur I Madzhidov, and Alexandre Varnek. Estimation of the size of drug-like chemical space based on gdb-17 data. Journal of computer-aided molecular design, 27(8):675 679, 2013. Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, Artur Kadurin, Simon Johansson, Hongming Chen, Sergey Nikolenko, Alan Aspuru-Guzik, and Alex Zhavoronkov. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Frontiers in Pharmacology, 2020. Mariya Popova, Mykhailo Shvets, Junier Oliva, and Olexandr Isayev. Molecularrnn: Generating realistic molecular graphs with optimized properties, 2019. URL https://arxiv.org/abs/1905.13372. Kristina Preuer, Philipp Renz, Thomas Unterthiner, Sepp Hochreiter, and Gunter Klambauer. Fréchet chemnet distance: a metric for generative models for molecules in drug discovery. Journal of chemical information and modeling, 58(9):1736 1741, 2018. Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. URL https://arxiv.org/abs/2204.06125. Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs, 2020. URL https://arxiv.org/abs/ 2006.10637. Bidisha Samanta, Abir De, Niloy Ganguly, and Manuel Gomez-Rodriguez. Designing random graph models using variational autoencoders with applications to chemical design. ar Xiv preprint ar Xiv:1802.05283, 2018. Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks, 2021. URL https://arxiv.org/abs/2102.09844. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2009. doi: 10.1109/TNN.2008. 2005605. Marwin HS Segler, Thierry Kogej, Christian Tyrchan, and Mark P Waller. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4(1):120 131, 2018. Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. Graphaf: a flow-based autoregressive model for molecular graph generation. ar Xiv preprint ar Xiv:2001.09382, 2020. Amauri Souza, Diego Mesquita, Samuel Kaski, and Vikas Garg. Provably expressive temporal graph networks. In Advances in Neural Information Processing Systems (Neur IPS), 2022. Jonathan M. Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M. Donghia, Craig R. Mac Nair, Shawn French, Lindsey A. Carfrae, Zohar Bloom-Ackermann, Victoria M. Tran, Anush Chiappino-Pepe, Ahmed H. Badran, Ian W. Andrews, Emma J. Chory, George M. Church, Eric D. Brown, Tommi S. Jaakkola, Regina Barzilay, and James J. Collins. A deep learning approach to antibiotic discovery. Cell, 180(4):688 702.e13, 2020. ISSN 0092-8674. doi: https://doi.org/10.1016/j.cell.2020.01.021. URL https://www.sciencedirect.com/science/article/pii/S0092867420301021. David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31 36, 1988. Scott A Wildman and Gordon M Crippen. Prediction of physicochemical parameters by atomic contributions. Journal of chemical information and computer sciences, 39(5):868 873, 1999. Louis-Pascal Xhonneux, Meng Qu, and Jian Tang. Continuous graph neural networks. In International Conference on Machine Learning, pages 10432 10441. PMLR, 2020. Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, pages 5708 5717. PMLR, 2018. Chengxi Zang and Fei Wang. Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 617 626, 2020. Xingjian Zhen, Rudrasis Chakraborty, Liu Yang, and Vikas Singh. Flow-based generative models for learning manifold to manifold mappings, 2020. URL https://arxiv.org/abs/2012.10013.