# heterophilyinformed_message_passing__92972c3a.pdf Published in Transactions on Machine Learning Research (04/2025) Heterophily-informed Message Passing Haishan Wang haishan.wang@aalto.fi Aalto University Arno Solin arno.solin@aalto.fi Aalto University Vikas Garg vgarg@csail.mit.edu Yai Yai Ltd and Aalto University Reviewed on Open Review: https: // openreview. net/ forum? id= 9f Pinz1i H2 Graph neural networks (GNNs) are known to be vulnerable to oversmoothing due to their implicit homophily assumption. We mitigate this problem with a novel scheme that regulates the aggregation of messages, modulating the type and extent of message passing locally thereby preserving both the low and high-frequency components of information. Our approach relies solely on learnt embeddings, obviating the need for auxiliary labels, thus extending the benefits of heterophily-aware embeddings to broader applications, e.g. generative modelling. Our experiments, conducted across various data sets and GNN architectures, demonstrate performance enhancements and reveal heterophily patterns across standard classification benchmarks. Furthermore, application to molecular generation showcases notable performance improvements on chemoinformatics benchmarks. 1 Introduction Methods for ubiquitous graph-structured data with abundant topological information have advanced the field of graph representation learning in recent years. Graph neural networks (GNNs) have emerged as prominent deep learning models in this domain (Hamilton, 2020). A key feature of GNNs is the message-passing (MP) scheme inspired by belief propagation which facilitates the processing of local topology while maintaining computational efficiency (Dai et al., 2016). The MP scheme enables local information exchange between nodes and their neighbours; implicitly assuming strong homophily, i.e., the tendency of nodes to connect with others that have similar labels or features. This assumption turns out to be reasonable in settings such as social data (Mc Pherson et al., 2001), regional planning (Gerber et al., 2013), and citation networks (Ciotti et al., 2016). However, heterophilous graphs exist in many scenarios, e.g., in fraud transaction networks (Pandit et al., 2007) and actor co-occurrence networks (Tang et al., 2009). They violate the homophily assumption, leading to sub-optimal performance (Zhu et al., 2020; 2021; Chien et al., 2021; Lim et al., 2021; Wang et al., 2023), owing to oversmoothing (Li et al., 2018) resulting from flattening of high-frequency information (Wu et al., 2023) by MP schemes. A conceptual way to characterize homophily is by examining the neighbours of each node in a graph. For example, in a graph representing a molecule a fully homophilous molecule only has links between atoms of the same type, while a heterophilous molecule has links between different types. However, in practice, node labels necessary for homophily calculation are often missing due to lack of information or unavailable due to privacy concerns. Instead, heterophily typically stems from more intricate properties of the graph which need to be learned from data. For this problem, we introduce a general heterophily-informed MP scheme to carefully address and utilize the heterophily properties in graph data. Many previous works analyze the impact of heterophily on GNN performance and design innovative structures to mitigate it (Zhu et al., 2020; Liu et al., 2021a; Yan et al., 2022; Ma et al., 2021). Some studies offer deeper insights into how heterophily affects model expressiveness (Ma et al., 2021; Luan et al., 2022; Mao et al., Published in Transactions on Machine Learning Research (04/2025) 2023; Luan et al., 2023). However, most of these works provide a specialized model and only focus on simple node classification tasks. In contrast, we aim to design a simple yet flexible structure that can be applied to any GNN. We verify its effectiveness not only on classification tasks but also on more challenging molecular generation tasks, which require models to learn the data distribution on graph embeddings. Our contributions We introduce a novel heterophily-informed MP scheme, serving as an approach for general GNNs to utilize data heterophily, designed to learn graph structures and node features across varying degrees of homophily and heterophily. The scheme improves various GNN architectures for node classification in different domains of data. Furthermore, we apply our approach to a flow-based graph generation model (Het Flows). Our key contributions are summarized below: Conceptual and technical: We propose an architecture-independent approach that encodes homophily/heterophily patterns for general GNNs with flexible applications and highlights the necessity of data heterophily as the model prior. Methodological: We design a heterophily-informed message-passing scheme focusing on specific node heterophily patterns and apply it to (i) various classic GNNs for node classification and (ii) an invertible flow-based model (Het Flows) for molecule generation. Empirical: We demonstrate the benefits of our idea by benchmarking node classification accuracy and in molecule generation by evaluating the generated molecules with an extensive set of chemoinformatics metrics. Notable advantages of our model include enhancing performance on different graph learning tasks without adding extra parameters, offering flexible applications across various GNN architectures, and revealing the homophily pattern match between message passing and data sets. 2 Related Work Graph heterophily Graph heterophily refers to the connectivity tendency among nodes with different labellings, as opposed to graph homophily. This property could be measured by node homophily (Pei et al., 2019), edge homophily (Zhu et al., 2020), or more dedicated designed metrics (Lim et al., 2021; Platonov et al., 2023). Recent research shows the importance and benefits of considering heterophily in graph learning. For example, Empirical experiments (Zhu et al., 2022) show heterophilic nature in structure attacks, and validate that heterophily incorporation enhances model robustness to adversarial attacks. Similarly to oversmoothing, the heterophily challenges graph learning by less discriminative node representation, Bodnar et al. (2022) links the GNN performances in heterophilic graphs with the oversmoothing problem by cellular sheaf theory. Heterophily-awared GNNs Numerous techniques have been developed to address degradation in the performance of GNNs in heterophilic settings. Some approaches expand the concept of neighbour sets, such as aggregating messages from farther hops of neighbours (Abu-El-Haija et al., 2019; Wang & Derr, 2021) or searching for potential new neighbours (Pei et al., 2019; Jin et al., 2021; Li et al., 2022). Other focus on refining the message during the aggregation process, such as differentiating neighbours through specific filters (Luan et al., 2022; Yan et al., 2022; Wang et al., 2023) or collecting embeddings from previous layers, like jumping knowledge (Xu et al., 2018) and generalized Page Rank techniques (Chien et al., 2021). These methods typically require specialized structures or ignore the local homophily during message passing to achieve their effects. However, we hope to capture the local homophily difference with minimal structure modifications, making the solution flexible and widely applicable. Heterophily-informed MP can be viewed as performing adaptive message modulation before aggregation. Molecule representation and generation Early works in molecule generation (Kusner et al., 2017; Guimaraes et al., 2017; Gómez-Bombarelli et al., 2018; Dai et al., 2018) primarily used sequence models to encode the SMILES (short for Simplified Molecular-Input Line-Entry System ) strings (Weininger et al., 1989), posing generation as an autoregressive problem. However, the mapping from molecules to SMILES is Published in Transactions on Machine Learning Research (04/2025) not continuous, so similar molecules can be assigned vastly different string representations. Graphs provide an elegant abstraction to encode the interactions between the atoms in a molecule. Thus the field has gravitated towards representing molecules as (geometric) graphs and using powerful graph encoders; e.g., based on graph neural networks (GNNs), for example, adversarial models (De Cao & Kipf, 2018; You et al., 2018), energy-based models (Liu et al., 2021b), diffusion models (Hoogeboom et al., 2022; Xu et al., 2023), Neural ODEs (Verma et al., 2022) and flow-based models (Shi et al., 2019; Luo et al., 2021; Zang & Wang, 2020). 3 Heterophily-informed Message Passing We propose an architecture-independent approach that encodes homophily/heterophily patterns for general GNNs and later apply it to GNNs in node classification and graph generation setups. 3.1 Prerequisites: Message Passing Graph Neural Networks (GNNs) have emerged as a potent paradigm for learning from graph-structured data, where the challenges include diverse graph sizes and varying structures (Kipf & Welling, 2017; Veličković et al., 2018; Xu et al., 2019; Garg et al., 2020). Consider a graph G p V, Eq with nodes V and edges E. For these nodes and edges, we denote the corresponding node features as X txv P Rna | v P Vu and edge features as E teuv P Rnb | u, v P Eu. Here na, nb are the feature dimensions of nodes and edges. For each node v P V, its embedding at the kth layer is represented as hpkq v , and hpkq thpkq v | v P Vu. These embeddings evolve through a sequence of transformations across GNNs of depth K, by the message passing scheme (Hamilton, 2020): mpkq uv MESSAGEpkq hpkq u , euv , u P Npvq, (1) hpk 1q v UPDATEpkq hpkq v , mpkq Npvq , (2) for k P t0, 1, . . . , Ku. Here Npvq denotes the neighbour set of node v. Both UPDATEpkq and MESSAGEpkq are arbitrary differentiable functions. Messages from all neighbours of v are aggregated in the multiset mpkq Npvq ttmpkq uv | u P Npvquu. Importantly, the function UPDATEpkq needs to be permutation invariant on this message set mpkq Npvq (e.g., by resorting to operations like summation or taking the maximum). However, in practice, a naïve aggregation strategy typically mixes messages, especially in heterophilic locality, leading to the oversmoothing problem (Zhu et al., 2020; 2021; Chien et al., 2021; Lim et al., 2021; Wang et al., 2023). 3.2 Heterophily-informed Message Passing Our method encodes the heterophily assumption into the MP scheme of the GNN, denoted as GNNγp X | Eq hp Kq. The indicator γ P Γ torig., hom., het.u specifies the heterophily preference of the GNNs as a hyperparameter: whether they lean towards homophily (hom.), heterophily (het.), and original structure (orig.). The GNNorig. is exactly the original GNN described at Sec. 3.1. And we name GNNhet. and GNNhom. after the Het MP and Hom MP modes of GNNorig.. Referring to Eq. (1) and Eq. (2), the messages undergo different scaling preprocessing steps before being sent forward to the subsequent layer: mpkq uv MESSAGEpkq hpkq u , euv , u P Npvq, (3) mpkq Npvq ttαpkq uv,γmpkq uv | u P Npvquu, (4) hpk 1q v UPDATEpkq hpkq v , mpkq Npvq , (5) where the scaling factors Hpu, vq, if γ hom. 1, if γ orig. 1 Hpu, vq, if γ het. (6) Published in Transactions on Machine Learning Research (04/2025) Model input 3-Aminophenol Heterophilious channel Original channel Homophilious channel ... ... ... Figure 1: Comparison of heterophiliy-informed MP with original GNN on a graph describing the 3Aminophenol molecule. The three channels show how γ controls the scaling factor in Eq. (6) and leads to different message passing behaviour, given the same input. and H denotes the node homophily and the embeddings are initilized by hp0q X. Aiming to learn embeddings as node labels, therefore in practice, instead of traditional label-style definition in many contexts, we define the homophily or attraction to similarity between embeddings as the cosine similarity Hpu, vq Scosphpkq u , hpkq v q at the relevant layer in Eq. (6). For more representative embeddings, three channels (Hom MP, Het MP, and original model) could be combined with an additional linear layer to be a mixed model (mix.) named Mix MP. App. A.5 provides an example of GCN-based Mix MP. The message passing process of an example GNN and its Het MP, Hom MP versions are visualized in Fig. 1 as three channels. The input is an example molecule 3-Aminophenol, and node colouring corresponds to the embeddings while similar colour means closer embeddings. The thickness of the bond corresponds to the scaling factor αγ (which shows the similarity between node pairs) of the last layer. The homophilous channels (hom.) decrease information exchange between dissimilar neighbours, introducing frictions during message flows. The heterophilous channel (het.) encourages faster message spreading on higher frequency locality, while slowing it down when neighbors become similar. In this simple example, the original channel (orig.) shows the typical oversmoothing issue: all the node embeddings become similar after layers, while the other two channels (homophilous/heterophilous channels) mitigate it with different convergence resistances, thus improving model effectiveness. 3.3 Application to Molecular Generation For generative modelling on molecule graphs, we propose a graph-generation model based on heterophilyinformed MP: Het Flows. The model is built on a normalizing flow-based model Mo Flow (Zang & Wang, 2020), which generates molecules with estimated probability and ensures uncertainty estimation and applications of Bayesian tools. Our model is split into two main components: a bond flow and an atom flow. The bond flow learns the molecular topology, while the atom flow assigns certain atomic details to this structure. Prerequisites In general, normalizing flows offers a methodological approach to model distributions based on the change-of-variable law of probabilities. This is achieved by applying a chain of reversible and bijective transformations between trivial variables (like Gaussians) with target data variables and updating the transformation to minimize the negative log-likelihood (Dinh et al., 2014). Affine coupling layers (ACLs) introduce reversible transformations to normalizing flows, ensuring efficient computation of the log-determinant of the Jacobian (Kingma & Dhariwal, 2018). ALCs keep reversibility Published in Transactions on Machine Learning Research (04/2025) by updating partial information via the other part of the information. Series of ACLs tf1, . . . , f T u build a reversible flow between the Gaussian with the target distribution f f T f1. Training and loss Given molecule graph G p X, Eq, the atom flow fa and bond flow fb map the graph into embeddings which follow Gaussian distributions Na, Nb: fap X | Eq ha Na, fbp Eq hb Nb. (7) The model is trained to minimize the negative log-likelihoods (NLLs) of the data by gradient descent. The target loss function of the model L could be decomposed into two parts since Het Flows contains two flows L La Lb. The atom loss La is defined as following La log p Xp Xq log p phaq log det Bfa BX The bond loss Lb is defined similarly as above. Generation process Given a trained model fa , fb with established parameters, sampled embeddings randomly from Gaussian ha, hb Na, Nb. Then the embeddings can generate features of bonds E f 1 b phbq and atoms X f 1 a pha | Eq in sequence, which requires the reversibility of flow model. Finally, the molecules are reconstructed G p X, Eq. Additionally, the bond features generation can be achieved by sampling the adjacency matrix (denoted by +true adj. later) from the real data distribution which reflects the model separability. The Het Flows shares the same structure of Mo Flow (Zang & Wang, 2020): ACL-based normalizing flows. The coupling function stores most of the parameters of an ACL, which serves as the most essential part. The main difference between these two methods is the message passing scheme of GNN utilized as coupling functions in all ACLs of flow: Mo Flow contains the classical graph convolutional networks (Kipf & Welling, 2017), and the Het Flows substitute it with a Mix MP version of it. Further details of Het Flows are discussed in App. A, including introductions of normalizing flows and ACLs, mathematical model description, the training and generation process of Het Flows, the loss function, and proof of model reversibility. 4 Experiments We demonstrate the effects of heterophily-informed MP both in discriminative node classification benchmarks and molecule generation settings. Sec. 4.1 compares the node classification performances of various types of GNNs with their variants (Het MP, Hom MP, and Mix MP versions) across 15 data sets. In Sec. 4.2, we demonstrate how our Mix MP GCN blocks improve flow-based molecule generation, directly showing the benefits of our MP scheme. 4.1 Node Classification Benchmarks Data sets The node classification experiments belong to the class of semi-supervised learning tasks. We evaluated on 5 homophilic data sets in citation networks (Yang et al., 2016) (Cora, Pub Med, Cite Seer) and co-purchase graphs (Shchur et al., 2018) (Computers, Photo). Furthermore, the 10 heterophilic data sets including hyperlink networks (Pei et al., 2019) (Cornell, Wisconsin, Texas), Wikipedia networks (Rozemberczki et al., 2021) (Chameleon, Squirrel), and heterophilous graph dataset (Platonov et al., 2023) (Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions). Here the graph homophily Hei is measured by class insensitive edge homophily ratio (Lim et al., 2021). Setups The models were implemented in Py Torch (Paszke et al., 2019) and Py Torch Geometric (Py G) (Fey & Lenssen, 2019). The base model GNNs in Sec. 4.1 are the Py G-implemented versions. Each one and its variants (Het MP, Hom MP) contain 2 layers and 128 dimensions for all hidden layers. All the models are trained with the Adam W optimizer (Loshchilov & Hutter, 2019), learning rate 0.001 and drop-out ratio 0.2. Published in Transactions on Machine Learning Research (04/2025) Table 1: We shed light on how our homo-/heterophilous message passing can boost standard node classification benchmark accuracy (mean std). Data sets are sorted according to their homophily level Hei. For each data (column), four classic GNN architectures (GCN, GAT, GIN, Graph SAGE) are tested, grouped with their corresponding Het MP/Hom MP/Mix MP modes (with hom., hom. or mix.). Numbers are bolded with a paired t-test (5% significance level) for each 4-mode group. Our Mix MP models perform either equally well or significantly better in all cases, with an average of 3.84 (%-points) accuracy improvement than the original model. Note that the accuracy of binary class datasets is reported as their ROC-AUC scores, which follows the convention from Platonov et al. (2023). Texas Minesweeper Roman-empire Squirrel Cornell Chameleon Questions Wisconsin Amazon-ratings Tolokers Cite Seer Pub Med Computers Cora Photo Homophily Hei 0.00 0.01 0.02 0.03 0.06 0.06 0.08 0.10 0.13 0.18 0.63 0.66 0.70 0.77 0.77 GCN 58.4 4.6 52.3 0.6 47.8 0.8 26.9 1.4 42.7 4.9 37.1 3.0 53.2 0.4 52.4 6.0 49.1 0.6 58.0 2.9 76.7 1.3 88.5 0.5 91.6 0.5 88.0 0.9 94.4 0.5 GCN+hom. 58.6 5.1 57.3 1.0 56.6 0.6 29.6 1.7 41.4 5.4 40.4 3.0 52.8 0.2 54.5 3.3 47.1 0.5 57.7 3.2 76.8 1.4 88.7 0.5 91.5 0.5 86.9 0.9 94.4 0.4 GCN+het. 60.0 7.7 71.5 1.8 60.2 0.5 28.8 1.3 50.0 8.2 28.6 1.5 52.1 0.9 53.9 3.8 46.4 0.6 55.1 1.1 69.6 1.2 83.9 0.5 78.3 1.4 82.6 1.1 85.4 1.6 GCN+mix. 56.8 5.3 74.5 2.3 71.5 0.5 32.0 2.7 48.9 6.2 38.9 3.9 53.1 0.6 52.0 6.1 49.7 0.8 58.7 1.6 76.9 1.1 88.6 0.5 91.2 0.6 87.7 0.7 94.6 0.6 GAT 57.8 4.6 54.4 1.6 47.1 0.6 29.4 1.5 45.7 6.3 43.3 2.9 52.4 0.4 50.8 8.7 48.8 0.5 57.0 2.7 76.2 1.1 87.1 0.5 91.2 0.3 86.6 0.9 94.1 0.5 GAT+hom. 59.5 4.9 55.5 3.2 57.6 1.0 32.6 1.5 47.0 7.5 43.6 2.4 53.0 0.7 54.9 5.8 47.2 0.7 55.7 3.2 75.9 1.4 88.2 0.5 90.9 0.3 85.9 1.1 93.6 0.6 GAT+het. 59.7 5.5 52.7 3.8 48.0 1.3 29.4 1.1 58.9 9.4 27.9 1.6 51.3 0.8 53.3 6.8 44.1 0.6 50.5 0.5 67.5 1.7 79.6 0.9 84.0 2.0 75.1 1.4 87.3 1.3 GAT+mix. 59.2 4.7 75.1 2.4 68.9 0.9 34.4 1.4 50.5 7.5 46.4 2.6 54.8 0.9 55.3 6.3 50.2 0.5 62.4 2.7 75.7 1.3 87.8 0.7 91.1 0.3 85.7 1.4 94.1 0.3 Graph SAGE 73.2 5.3 73.2 1.7 78.9 0.4 36.5 1.9 70.5 3.7 50.2 1.9 56.0 0.8 77.1 6.5 52.7 0.6 60.5 1.6 76.7 1.0 88.1 0.5 90.9 0.3 87.7 1.5 95.5 0.4 Graph SAGE+hom. 70.5 5.5 68.4 1.4 78.6 0.6 36.2 1.6 67.3 4.1 50.8 2.6 55.7 0.7 80.2 5.2 51.6 0.5 59.7 1.5 77.4 0.7 88.9 0.4 90.6 0.4 87.5 1.1 95.5 0.4 Graph SAGE+het. 75.9 6.8 71.0 1.9 78.1 0.7 35.5 1.4 72.2 7.6 50.7 2.4 56.1 0.7 80.0 4.3 52.2 0.6 57.7 1.4 76.0 0.8 87.7 0.5 89.4 0.6 86.6 1.3 95.0 0.6 Graph SAGE+mix. 73.0 6.5 73.1 1.9 78.7 0.7 36.4 2.3 68.1 6.5 50.6 2.5 57.2 1.2 80.2 5.0 53.5 0.7 59.2 2.0 76.5 0.9 88.4 0.5 90.3 0.5 87.6 0.8 95.2 0.5 GIN 57.0 6.9 55.1 3.2 48.2 1.2 26.5 3.1 46.8 7.1 36.8 5.6 51.8 3.0 44.9 6.8 50.2 0.7 53.3 6.9 71.0 1.5 86.1 0.8 41.2 3.1 84.0 1.1 35.3 7.1 GIN+hom. 58.1 6.8 58.2 4.6 65.9 0.8 24.6 1.0 46.8 6.6 36.0 3.8 55.7 4.6 49.0 6.3 49.2 0.7 53.0 6.4 71.2 1.9 87.4 0.8 42.8 3.8 84.3 1.3 37.7 8.0 GIN+het. 59.5 8.0 65.7 9.8 58.6 1.1 26.5 2.0 50.3 10.7 35.9 5.6 50.8 2.6 52.0 3.8 48.2 1.2 53.7 5.8 69.3 1.7 81.1 14.9 43.1 9.3 81.9 1.1 38.4 15.7 GIN+mix. 58.4 6.8 60.5 7.0 64.0 1.0 28.9 1.8 49.5 6.7 36.5 4.1 59.3 3.9 50.6 7.8 48.2 1.5 56.3 7.0 72.6 2.2 86.6 0.9 57.3 14.7 84.3 1.5 78.2 13.6 0 0.2 0.4 0.6 0.8 1 Data set homophily Hei Ñ GCN GAT Graph SAGE GIN (a) Accuracy improvement of Mix MP 0 0.2 0.4 0.6 0.8 1 Data set homophily Hei Ñ ahet aorig & ahom aorig het orig hom orig (b) Accuracy improvement of Het MP and Hom MP Figure 2: Comparison over 15 data sets and 4 GNN architectures in terms of improvement in accuracy (%-points). aorig, ahom, ahet and amix denote the accuracy of original model and corresponding Hom MP, Het MP, and Mix MP versions. In Fig. 2a, each node is the average performance over 10 random seeds for a specific dataset and GNN structure, it illustrates the accuracy advantage of Mix MP above the original model. In Fig. 2b, the y-axis denotes the accuracy improvement of Het MP and Hom MP over the original structure. The data split settings (training/validation/test 60%{20%{20%). Each configuration (data set and model) is tested for 10 random model initializations and data splits. Baselines Four different GNN architectures are chosen as baseline models: Graph Convolutional Network (GCN) (Kipf & Welling, 2017), Graph Attention Network (GAT) (Veličković et al., 2018), Graph Isomorphism Network (GIN) (Xu et al., 2019), and Graph SAGE (Hamilton et al., 2017). For more details about the data split rules and base GNNs, please refer to App. B. Results The comprehensive benchmark results in Table 1 include the node classification accuracy (or ROCAUC score for binary labels), each value representing the average of 10 random runs of the four base models and their Het MP, Hom MP, and Mix MP versions across 15 different graph domains, with data homophily displayed. For each GNN and data set, the best result among the four modes is highlighted in bold. For more intuitive visualizations, Fig. 2 illustrates the benefits of classification accuracy on heterophily-informed MP and special patterns through the accuracy discrepancies. Further comparison with other heterophily-aware baseline models is in App. B.4. Published in Transactions on Machine Learning Research (04/2025) Analysis Table 1 demonstrates that, across 10 heterophilic data sets, our proposed Mix MP performs comparably or better than the baseline in 38 out of 40 cases, evaluated over four types of GNN architectures. This strong performance is further supported by Fig. 2a, where the majority of data points lie on or above the y-axis, indicating that the Mix MP consistently enhances node classification performance. Notably, the improvement is stronger on heterophilous datasets compared to homophilous ones (with minor outliers from GIN), suggesting that our model modification is able to incorporate data heterophily as structural prior without benefits from homophilic data. The consistent performance gains across diverse setting combinations underscore the potential generalization capability of our approach, which could also be applied to a wider range of GNN structures and data domains. In Fig. 2b, the accuracy differences ahet aori and ahom aori are mostly concentrated in the top-left and bottom-right regions. For homophilic data, the Hom MP exhibits a minor impact on accuracy, whereas the Het MP even adversely affects performance. In contrast, heterophilic data benefit from both the Hom MP and Het MP on the node classification task. These observations provide empirical evidence for the implicit homophily assumption underlying the traditional MP scheme. Furthermore, both the Hom MP and Het MP mitigate the issue of oversmoothing in heterophilic data, as visualized in Fig. 1. Minor performance improvements are observed on the Cornell, Wisconsin, and Texas datasets, likely attributed to their small graph size (183 251 nodes) with limited statistical significance. In contrast, our Mix MP achieves substantial significant improvements on the Minesweeper and Roman-empir, which are specifically designed to evaluate GNNs under heterophily (Platonov et al., 2023). These results further validate the advantages of our approach in scenarios where heterophily is a critical factor. In conclusion, the node classification experiments highlight the capability of Mix MP to incorporate data heterophily as prior and enhance node-level graph representation learning, especially on heterophilic data. 4.2 Molecule Generation Molecules express varying levels of homo-/heterophily. Thus molecular modelling provides an interesting benchmark for our proposed MP scheme. We demonstrate the impact of accounting for heterophilic message passing in a variety of common benchmark tasks for molecule generation and modelling. We provide results for molecule generation with benchmarks on a wide range of chemoinformatics metrics. Setups The Het Flows in Sec. 4.2 is built on GNNs with 4 layers and flows that were ka 27, kb 10 (for qm9) deep and ka 38, kb 10 (for zinc-250k). For the generation task, we select the best-performing model using the FCD score as suggested in Polykovskiy et al. (2020). We report numbers without post hoc validity corrections. Data sets We consider two common molecule data sets: qm9 and zinc-250k. The qm9 data set (Ramakrishnan et al., 2014) comprises 134k stable small organic molecules composed of atoms from the set {C, H, O, N, F}. These molecules have been processed into their kekulized forms with hydrogens removed using the RDkit software (Landrum et al., 2013). The zinc-250k (Irwin et al., 2012) data contains 250k drug-like molecules, each with up to 38 atoms of 9 different types. Chemoinformatics metrics We compare methods through an extensive set of chemoinformatics metrics that perform both sanity checks (validity, uniqueness, and novelty) on the generated molecule corpus and quantify properties of the molecules: neighbour (SNN), fragment (Frag), and scaffold (Scaf) similarity, internal diversity (Int Div1 and Int Div2), and Fréchet Chem Net distance (FCD). We also show score histograms and distribution distances for solubility (log P), synthetic accessibility (SA), drug-likeness (QED), and molecular weight. For computing the metrics, we use the MOSES benchmarking platform (Polykovskiy et al., 2020) and the RDKit open-source cheminformatics software (Landrum et al., 2013). The data row in metrics is based on averages over 10 randomly sampled sets (1000 mols per set) from the data. For the metrics, we simulate 10 batches of 1000 mols and compare them to a hold-out reference set (20% of data, other 80% used for training). Full details on the 14 metrics we use are included in App. D. Published in Transactions on Machine Learning Research (04/2025) 0.0 0.2 0.4 0.6 0.8 50 75 100 125 150 175 200 qm9 Het Flow+true adj. Het Flow Graph DF Mo Flow Mo Flow+true adj. Molecular weight Figure 3: Chemoinformatics statistics for data (qm9) and generated molecules from Het Flows (ours), Mo Flow, and Graph DF. We report histograms for the Octanol-water partition coefficient (log P), synthetic accessibility score (SA), quantitative estimation of drug-likeness (QED), and molecular weight. Table 2: Chemoinformatics summary statistics for random generation on the qm9 molecule data. Full listing of all 14 metrics in Table A6. The results show that our message passing modification of Mo Flow (resulting in Het Flows) achieves better results on FCD, SNN, Frag, and Scaf, and retains competitive performance on other metrics. FCD Ó Validity Ò Novelty Ò SNN Ò Frag Ò Scaf Ò Int Div1 Ò Data (qm9) 0.40 1.00 0.00 0.62 0.02 0.54 0.00 0.94 0.01 0.76 0.03 0.92 0.00 Graph DF 10.76 - 0.98 0.00 0.35 0.00 0.61 0.01 0.09 0.07 0.87 0.00 Mo Flow 7.55 0.95 0.01 0.96 0.01 0.32 0.00 0.60 0.03 0.04 0.03 0.92 0.00 Het Flow (Ours) 4.04 0.92 0.01 0.92 0.01 0.34 0.00 0.80 0.02 0.04 0.03 0.91 0.00 Mo Flow+true adj. 4.45 1.00 0.00 0.85 0.01 0.38 0.00 0.70 0.03 0.31 0.08 0.92 0.00 Het Flow+true adj. 1.46 1.00 0.00 0.74 0.01 0.43 0.00 0.85 0.02 0.52 0.05 0.92 0.00 Table 3: Chemoinformatics summary statistics for random generation on the zinc-250k molecule data set. Full listing of all 14 metrics in Table A7. FCD Ó Validity Ò Novelty Ò SNN Ò Frag Ò Scaf Ò Int Div1 Ò Data (zinc-250k) 1.44 1.00 0.00 0.02 0.00 0.51 0.00 1.00 0.00 0.28 0.02 0.87 0.00 Graph DF 34.30 - 1.00 0.00 0.23 0.00 0.35 0.01 0.00 0.00 0.88 0.00 Mo Flow 23.33 0.89 0.01 1.00 0.00 0.27 0.00 0.79 0.01 0.01 0.00 0.88 0.00 Het Flow (Ours) 23.72 0.87 0.01 1.00 0.00 0.26 0.00 0.77 0.01 0.01 0.00 0.88 0.00 Mo Flow+true adj. 8.21 0.94 0.01 1.00 0.00 0.33 0.00 0.89 0.01 0.07 0.02 0.87 0.00 Het Flow+true adj. 8.24 0.93 0.01 1.00 0.00 0.34 0.00 0.91 0.01 0.10 0.03 0.87 0.00 Baselines For random generation, we include baseline results for methods that have pre-trained models publicly available. Trained models are required for generating chemoinformatics metrics beyond trivial sanity checks (validity, uniqueness, and novelty). We compare Graph DF (Luo et al., 2021) and Mo Flow (Zang & Wang, 2020) which are current state-of-the-art flow models for molecular generation. Results on qm9 For the qm9 data set, the main chemoinformatic summary statistics are given in Table 2 and the descriptive distributions in Fig. 3. Het Flows +true adj. achieves best validity, SNN Frag and Scaf, especially lowest FCD over flow-based models. Full benchmark results are available in the extended listings in Table A6 in the Appendix. Results on zinc-250k For the zinc-250k data set, the main chemoinformatic summary statistics are given in Table 3 and the descriptive distributions in Fig. A7. While the zinc-250k data set is more complicated, Het Flows +true adj. achieves the best Novelty SNN, Frag and Scaf, and has competitive performance on other metrics. Full benchmark results are available in the extended listings in Table A7. Analysis Het Flows emerges as a robust and versatile molecular generation model, adept at balancing fidelity, diversity, and molecular properties. Notably, the validity on both qm9 and zinc-250k are higher than 85%, ensuring the model s reliability. As discussed in Sec. 3.3, Het Flows is built based on the Mo Flow (Zang & Published in Transactions on Machine Learning Research (04/2025) 0.67 0.56 0.52 0.46 0.39 0.32 0.30 0.30 0.28 0.64 0.59 0.44 0.24 0.24 0.21 0.18 0.18 0.17 0.66 0.55 0.55 0.45 0.41 0.33 0.30 0.24 0.23 0.66 0.66 0.50 0.45 0.37 0.33 0.30 0.29 0.26 Figure 4: Structured latent-space exploration (qm9). Our approach yields a sensibly structured latent space as qualitatively demonstrated by this nearest neighbour search in the latent space with the seed molecule on the left and neighbours with the Tanimoto similarity (1 0) given for each molecule. For results on zinc-250k, see Fig. A6 in the appendix. Wang, 2020). It replaces the classic graph convolution of the GCNs module of Mo Flow with Mix MP version. In other words, the difference between Mo Flow and Het Flows is the GNNs, more exactly, the MP scheme. As Table 2 and Table 3 show, Het Flows is superior over Mo Flow with both adjacency matrix generation styles on qm9 and zinc-250k. Fig. 3 shows the chemoinformatics statistics of generated molecules from Het Flows fit the original distribution better than Mo Flow in both generation styles. It provides evidence of the advantage of utilizing the heterophily inside the message passing, as intuited by Fig. 1. Visualizing the continuous latent space Inspired by Zang & Wang (2020), we examine the learned latent space of our method on both qm9 and zinc-250k, presented in Fig. 4 and Fig. A6, respectively. Qualitatively, we note that the latent space appears smooth and the molecules near the seed molecule resemble the input and have high Tanimoto similarity (Rogers & Hahn, 2010). 4.3 Ablations Studies To further verify that the effects we see are due to our proposed heterophily-informed MP scheme, we include ablation studies on different adjacency matrix generation strategies and parameter sharing of the GNNs. Ablation study: adjacency matrix generation As mentioned in Sec. 3.3, the adjacency matrix can be generated by the bond model or sampled directly from the real distribution. Both approaches are present in literature as a basis for modelling (see, e.g., discussion in Verma et al., 2022). With sampled adjacency matrices, the main task of the model is to put correct labels on a given graph topology. As the comparison results in Table 2 and Table 3 show, the adjacency matrix generated by the bond model limits the model generation performance compared with the sampling alternative. However, this approach can be considered a method of its own. Ablation study: parameter sharing of Mix MP GNN There are three channels of message passing scheme as shown in Fig. 1, the three channels could share parameters or not in the Mix MP, which corresponds to whether MESSAGEpkq γ in Eq. (3) is the same function for all γ P Γ or not. To investigate the improved expressiveness of Mix MP GNNs from extra two channels, models with two settings are compared in Table A8 and Table A9. Based on the chemoinformatics metrics, the random generation outputs of the two settings are similar. It means the model strength stems from the more expressive structure as intuited by Fig. 1, not from the larger parameter size. Published in Transactions on Machine Learning Research (04/2025) 5 Conclusions and Discussion We have presented a heterophily-informed message-passing scheme, a flexible plug-in module of GNNs to account for a heterophily prior, countering the traditional oversmoothing vulnerability prevalent in existing GNN-based methodologies. By adjusting message passing to discern (dis-)similarities between nodes, our method offers a more nuanced representation of the intricate balance between affinities and repulsions. In the experiments, we demonstrated our approach both in standard discriminative node classification benchmarks and by applying the approach inside a generative flow model (which we call Het Flows). Experiment results show the versatility and ability of the proposed scheme to enhance embedding expressiveness across multiple graph domains. The analysis underscores the necessity of aligning data homophily with corresponding model assumptions. We consider this approach a promising tool in heterophilic graph learning. One limitation of heterophily-informed MP is the variable effectiveness across datasets of different homophily levels. As discussed in Sec. 4.1, the scheme brings limited advantages to all homophily data and heterophily data of small sizes. It also suggests clear potential for improving various MP mechanisms. Such improvements could be achieved through better estimation of graph homophily and more intelligent integration of different channels. In molecular generation, the dependency on the current adjacency matrix generation (bond model) method perhaps restricts Het Flows s effects, as the key coupling functions, based on CNNs, capture limited structural information. With a sampled adjacency matrix, it successfully generates molecules that are valid, novel, diverse, and chemoinformatically similar to those in the existing distribution. A reference implementation of the methods is available at https://github.com/Aalto ML/heterophily-imp. Acknowledgments This work was supported by the Research Council of Finland (grants 342077, 362408, 339730), Saab-WASP (grant 411025), and the Jane and Aatos Erkko Foundation (grant 7001703). We acknowledge the computational resources provided by the Aalto Science-IT project and CSC IT Center for Science, Finland. Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. Mix Hop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pp. 21 29. PMLR, 2019. Cristian Bodnar, Francesco Di Giovanni, Benjamin Chamberlain, Pietro Lio, and Michael Bronstein. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in GNNs. Advances in Neural Information Processing Systems (Neur IPS), 35:18527 18541, 2022. Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations (ICLR), 2021. Valerio Ciotti, Moreno Bonaventura, Vincenzo Nicosia, Pietro Panzarasa, and Vito Latora. Homophily and missing links in citation networks. EPJ Data Science, 5:1 14, 2016. Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In Proceedings of The 33rd International Conference on Machine Learning (ICML), volume 48 of Proceedings of Machine Learning Research, pp. 2702 2711. PMLR, 2016. Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed variational autoencoder for structured data. ar Xiv preprint ar Xiv:1802.08786, 2018. Nicola De Cao and Thomas Kipf. Mol GAN: An implicit generative model for small molecular graphs. ICML 2018 Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018. Published in Transactions on Machine Learning Research (04/2025) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning (ICML), pp. 3419 3430. PMLR, 2020. Elisabeth R Gerber, Adam Douglas Henry, and Mark Lubell. Political homophily and collaboration in regional planning networks. American Journal of Political Science, 57(3):598 610, 2013. Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 4(2):268 276, 2018. Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis Cunha Farias, and Alán Aspuru-Guzik. Objective-reinforced generative adversarial networks (organ) for sequence generation models. ar Xiv preprint ar Xiv:1705.10843, 2017. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30 (Neur IPS), pp. 1025 1035. Curran Associates, Inc., 2017. William L Hamilton. Graph Representation Learning. Morgan & Claypool Publishers, 2020. Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. In International Conference on Machine Learning (ICML), pp. 8867 8887. PMLR, 2022. John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. ZINC: A free tool to discover chemistry for biology. Journal of Chemical Information and Modeling, 52(7):1757 1768, 2012. Di Jin, Zhizhi Yu, Cuiying Huo, Rui Wang, Xiao Wang, Dongxiao He, and Jiawei Han. Universal graph convolutional networks. In Advances in Neural Information Processing Systems 34 (Neur IPS), pp. 10654 10664. Curran Associates, Inc., 2021. Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning (ICML), pp. 2323 2332. PMLR, 2018. Durk P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems (Neur IPS), volume 31, pp. 10236 10245. Curran Associates, Inc., 2018. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In International Conference on Machine Learning (ICML), pp. 1945 1954. PMLR, 2017. Greg Landrum et al. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8:31, 2013. AA Leman and Boris Weisfeiler. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsiya, 2(9):12 16, 1968. Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semisupervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Published in Transactions on Machine Learning Research (04/2025) Xiang Li, Renyu Zhu, Yao Cheng, Caihua Shan, Siqiang Luo, Dongsheng Li, and Weining Qian. Finding global homophily in graph neural networks when meeting heterophily. In Proceedings of the 39th International Conference on Machine Learning (ICML), volume 162 of Proceedings of Machine Learning Research, pp. 13242 13256. PMLR, 2022. Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. Advances in Neural Information Processing Systems (Neur IPS), 34:20887 20902, 2021. Meng Liu, Zhengyang Wang, and Shuiwang Ji. Non-local graph neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):10270 10276, 2021a. Meng Liu, Keqiang Yan, Bora Oztekin, and Shuiwang Ji. Graph EBM: Molecular graph generation with energy-based models. ar Xiv preprint ar Xiv:2102.00546, 2021b. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. Revisiting heterophily for graph neural networks. Advances in Neural Information Processing Systems (Neur IPS), 35:1362 1375, 2022. Sitao Luan, Chenqing Hua, Minkai Xu, Qincheng Lu, Jiaqi Zhu, Xiao-Wen Chang, Jie Fu, Jure Leskovec, and Doina Precup. When do graph neural networks help with node classification: Investigating the homophily principle on node distinguishability. ar Xiv preprint ar Xiv:2304.14274, 2023. Youzhi Luo, Keqiang Yan, and Shuiwang Ji. Graph DF: A discrete flow model for molecular graph generation. In International Conference on Machine Learning (ICML), pp. 7192 7203. PMLR, 2021. Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. Is homophily a necessity for graph neural networks? ar Xiv preprint ar Xiv:2106.06134, 2021. Haitao Mao, Zhikai Chen, Wei Jin, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, and Jiliang Tang. Demystifying structural disparity in graph neural networks: Can one size fit all? ar Xiv preprint ar Xiv:2306.01323, 2023. Miller Mc Pherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415 444, 2001. Shashank Pandit, Duen Horng Chau, Samuel Wang, and Christos Faloutsos. Netprobe: A fast and scalable system for fraud detection in online auction networks. In Proceedings of the 16th International Conference on World Wide Web, WWW 07, pp. 201 210. Association for Computing Machinery, 2007. ISBN 9781595936547. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Py Torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (Neur IPS), pp. 8026 8037. Curran Associates, Inc., 2019. Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-GCN: Geometric graph convolutional networks. In International Conference on Learning Representations (ICLR), 2019. Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, and Liudmila Prokhorenkova. A critical look at the evaluation of gnns under heterophily: Are we really making progress? In International Conference on Learning Representations (ICLR), 2023. Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, et al. Molecular sets (MOSES): A benchmarking platform for molecular generation models. Frontiers in Pharmacology, 11: 565644, 2020. Published in Transactions on Machine Learning Research (04/2025) Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1(1):1 7, 2014. David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling, 50(5):742 754, 2010. Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2):cnab014, 2021. Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. ar Xiv preprint ar Xiv:1811.05868, 2018. Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. Graphaf: a flowbased autoregressive model for molecular graph generation. In International Conference on Learning Representations (ICLR), 2019. Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807 816, 2009. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. International Conference on Learning Representations (ICLR), 2018. Yogesh Verma, Samuel Kaski, Markus Heinonen, and Vikas Garg. Modular flows: Differential molecular generation. In Advances in Neural Information Processing Systems 35 (Neur IPS), pp. 12409 12421. Curran Associates, Inc., 2022. Yu Wang and Tyler Derr. Tree decomposed graph neural network. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM 21, pp. 2040 2049. Association for Computing Machinery, 2021. Yuelin Wang, Kai Yi, Xinliang Liu, Yu Guang Wang, and Shi Jin. ACMP: Allen-Cahn message passing with attractive and repulsive forces for graph neural networks. In International Conference on Learning Representations (ICLR), 2023. David Weininger, Arthur Weininger, and Joseph L Weininger. SMILES. 2. Algorithm for generation of unique SMILES notation. Journal of Chemical Information and Computer Sciences, 29(2):97 101, 1989. Lirong Wu, Haitao Lin, Bozhen Hu, Cheng Tan, Zhangyang Gao, Zicheng Liu, and Stan Z Li. Beyond homophily and homogeneity assumption: Relation-based frequency adaptive graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2023. Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp. 5453 5462. PMLR, 2018. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019. Minkai Xu, Alexander S Powers, Ron O. Dror, Stefano Ermon, and Jure Leskovec. Geometric latent diffusion models for 3D molecule generation. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 of Proceedings of Machine Learning Research, pp. 38592 38610. PMLR, 2023. Yujun Yan, Milad Hashemi, Kevin Swersky, Yaoqing Yang, and Danai Koutra. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. In IEEE International Conference on Data Mining (ICDM), pp. 1287 1292. IEEE, 2022. Published in Transactions on Machine Learning Research (04/2025) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In Proceedings of The 33rd International Conference on Machine Learning (ICML), volume 48 of Proceedings of Machine Learning Research, pp. 40 48. PMLR, 2016. Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation. In Advances in Neural Information Processing Systems 31 (Neur IPS), pp. 6410 6421. Curran Associates, Inc., 2018. Chengxi Zang and Fei Wang. Moflow: An invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 617 626, 2020. Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems 33 (Neur IPS), pp. 7793 7804. Curran Associates, Inc., 2020. Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai Koutra. Graph neural networks with heterophily. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11168 11176, 2021. Jiong Zhu, Junchen Jin, Donald Loveland, Michael T Schaub, and Danai Koutra. How does heterophily impact the robustness of graph neural networks? theoretical connections and practical implications. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2637 2647, 2022. Published in Transactions on Machine Learning Research (04/2025) This appendix is organized as follows. App. A presents more about Het Flows: Prerequisite knowledge of normalizing flows and affine coupling layers, training process, generation process, loss function, details of Mix MP GNNs in Het Flows, computational issues and reversibility proof of the model. App. B illustrates related details of node classification experiment, including the experiment environment, data sets information and message passing details of GNNs used as base models. App. C provides the experiment environment, data sets, additional experiment results for molecule generation, property optimization algorithm with corresponding results together and model selection details. App. D summarizes and describes the metrics used in the molecular generation experiments. A Het Flows Details This section presents all the technical details of Het Flows. A.1 Prerequisites: Normalizing Flows with Affine Coupling Layers Affine coupling layers (ACLs) introduce reversible transformations to normalizing flows, ensuring efficient computation of the log-determinant of the Jacobian (Kingma & Dhariwal, 2018). Typically, the affine coupling layer, denoted by ACLpf,Mq, contains a binary masking matrix M P t0, 1um n and a coupling function f which determines the affine transformation parameters. The input X P Rm n is split into X1 M d X and X2 p1 Mq d X by masking, where d denotes the Hadamard (element-wise) product. Here, X1 is the masked input that will undergo the transformation, and X2 is the part that provides parameters for this transformation via the coupling function and stays invariant inside the ACLs. The output is the concatenation of the transformed part and the fixed part, as visualized in Fig. A5 given as: ACLpf,Mqp Xq M d p S d X1 T q p1 Mq d X2, (9) such that log S, T fp X2q. The binary masking ensures that only part of the input is transformed, allowing the model to update certain features while fixing others, enabling the model s reversibility. The coupling functions of flow capture intricate data characteristics, into which we incorporate heterophily priors. Normalizing flows offers a methodological approach to model distribution based on the change-of-variable law of probabilities. This is achieved by applying a chain of reversible and bijective transformations between trivial variables (like Gaussian) with target data variables and updating the transformation to minimize the negative log-likelihood (Dinh et al., 2014). Given a target distribution z0 pz, we initialize flows f f T f1. The flow reach trivial varibales z T Npµ, σ2q through a series of invertible functions: zi fipzi 1q, i 1, 2, . . . , T. The goal of normalizing flows is to minimize the negative log-likelihoods (NLLs) of the data: L log pzpz0q log Npz T | µ, σ2q log det Bf Bz0 X1 M d X X2 p1 Mq d X Split masking matrix M coupling function f S d X1 T X2 log S, T CONCAT[ ] Figure A5: The affine coupling layer. The coupling is defined through a coupling function f and binary masking matrix M (seen in Eq. (9)). Published in Transactions on Machine Learning Research (04/2025) The power of normalizing flows lies in their bijectiveness. This ensures that no information from the data is lost during these transformations. Thus, the transformed distribution can be pulled back to the original space using the inverse of the transformation functions, providing a bridge between the Gaussian and the intricate target distribution. A.2 Training Process Molecule graph G contains atom (node) features X P Rn na and bond (edge) features E P Rn n nb. The terms na and nb denote the types amount of atoms and bonds respectively. And p Xqi denotes the one-hot encoded type of the ith atom present in molecule G. Similarly, p Eqij denotes the one-hot encoding of the specific chemical bond between the ith and jth atoms. Our model Het Flows maps the molecule G to embeddings ha and hb that follow the Gaussian distributions: ha pa Npµa, σ2 aq, hb pb Npµb, σ2 bq. (11) Bond flow The bond flow represented by fb ACLb kb ACLb 1 consists of a series of ACLs with convolutional neural networks (CNNs) as coupling function: ACLb i ACLp CNNi,M b i q, M b i P Rn n nb i 1, 2, . . . , kb, where kb denotes the number of layers. Then bond embeddings hb hpkbq b fbphp0q b q are updated by layers: hpiq b ACLb i hpi 1q b , i 1, 2, . . . , kb. (12) initialized from the bond tensor hp0q b E Heterophilous atom flow The atom flow fa contains ka affine coupling layers. Each layer consists of a masking matrix M and a GNN coupling function GNNΓ. ACLa i ACLp GNNΓ i ,Miq, i 1, 2, . . . , ka. (13) where p Miqj,k 1j ipnq, and Γ torig., hom., het.u indicate MP scheme as mentioned in Sec. 3.1. GNNΓ is a Mix MP GNN combining these three schemes as described at App. A.5. All GNNs in this context derive their graph topology p E, Eq from the bond tensor E. The embeddings are initialized by the atom features: hpaq 0 X, and undergo an update through each affine coupling layer as follows: hpiq a ACLa i hpi 1q a | E , i 1, 2, . . . , ka. (14) The final node embedding is ha hpkaq a faphp0q a q. Training target The training goal of Het Flows is to decrese the distance between the transformed variables tha, hbu with target distribution pa, pb, as assumed in Eq. (11). The loss function combines the NLLs from both the atom and bond flow: L La Lb (details at App. A.4). Each NLL is given as shown in Eq. (10). The target of training is to search for model parameters to minimize the loss: fa , fb arg min fa,fb L (15) A.3 Generation Process Given a trained Het Flows model, with established atom flow fa and bond flow fb , the procedure for generating molecules is described as follows ( denotes parameters fixed). 1. Sampling Embeddings: Start by randomly sampling embeddings ha pa and hb pb from a Gaussian distribution as expressed in Eq. (11). Published in Transactions on Machine Learning Research (04/2025) 2. Obtaining the Bond Tensor: The bond tensor E can be derived by applying the inverse of the bond flow f 1 b to the sampled embedding hb. This is given as E f 1 b phbq. 3. Recovering Graph Topology: From the bond tensor E, the graph topology p E, Eq can be deduced. This topology is essential for the node feature generation at the next step. 4. Generating Node Features: With the bond tensor in place, the generated node features can be produced by applying the inverse of the atom flow f 1 a to the sampled atom embedding ha given by X f 1 a pha | Eq. 5. Molecule Recovery: Finally, a molecule, represented as G p X, Eq, can be reconstructed from random embeddings rha, hbs. The topology generation process (Steps 2 3) can be achieved by sampling the adjacency matrix from the real (data) distribution, which is denoted by +true adj. Reversibility of Het Flows To ensure that the molecular embeddings produced by Het Flows can be inverted back, it is crucial to understand the reversibility of the processes. A formal proof of reversibility of ACL blocks and Het Flows is provided in App. A.7. A.4 Loss function The loss function of the Het Flows is the NLLs of flows L La Lb consists of atom flows and bond flows. The details are listed as follows: The bond model loss La comes from the NLLs as the Eq. (10) Lb log p phbq kb i 1 log det BACLb i Bhpi 1q b Similarly, the loss La for the atom flow can be constructed as: La log p phaq ka i 1 log det BACLa i Bhpi 1q a A.5 The Mix MP Graph Neural Network In the code implementation, the Mix MP GCN, GNNΓ pΓ torig., hom., het.uq, consists of convolutional layers, batch normalization layers and a series of linear layers. The convolutional layer is the improved version of the graph convolutional layer of GCNs (Kipf & Welling, 2017), with three channels of MP. Each channel transfers messages between neighbours by certain preferences based on the homophily/heterophily. Given input h P Rn na and bond tensor E P Rn n nb, assume the adjacency matrix Ai Er:, :, is P Rn n, i 1, . . . , nb. Here na, nb are the feature dimensions of the node and bond. Then the messages are scaling with the scaling matrix Hγ P Rn n, and p Hγqij αij,γ. The homophily factors αuv,γ differ from channels indicated by γ P Γ Hpu, vq, if γ hom. 1, if γ orig. 1 Hpu, vq, if γ het., (18) and Hpu, vq Scosphpkq u , hpkq v q is the cosine similarity. The node embeddings are updated by the transferred messages in these triple channels, for each edge channel i, ˆhi cat p Horig. d Aih, Hhom. d Aih, Hhet. d Aiq ˆ Wi, (19) for i 1, . . . , nb, where the t Wi P Rna noutunb i 1 are the model parameters, and cat denotes the concatenation operator, nout is the expected output dimension. Then the output of this convolutional layer is the sum of all edge types i 1 ˆhi P Rn nout. (20) Published in Transactions on Machine Learning Research (04/2025) In conclusion, given the input h P Rn na and bond tensor E P Rn n nb, the convolutional layer generates ouput GNNΓph | Eq ˆh P Rn nout. (21) A.6 Computational Considerations For convenience on the calculation of the log-likelihood, every transformation of variables needs the calculation of a Jacobian matrix (i.e., BZpl 1q{BZplq). So all the complicated modules (e.g., GNNs, MLPs) are all built inside the coupling structure (part of the input is updated by the scaling matrix S, and transformation matrix T depends on the other part of the input). A.7 Proof of Reversibility of Het Flows Both the atom and bond models of Het Flows rely on ACL blocks. As introduced in App. A.1, these blocks are inherently reversible. This means they can forward process the input to produce an output and can also take that output to revert it to the original input without loss of information. Besides the use of ACL blocks, the operations used within the model primarily leverage simple concatenation or permutation. These operations are straightforward and do not affect the overall reversibility of the processes. Given that the individual components (both atom and bond flows) are reversible and the operations performed on the data are straightforward, thus that Het Flows as a whole is reversible. Fromal proof of the model s reversibility is provided below. A.7.1 Reversibility of the ACL Claim The affine coupling layer (ACL) defined at App. A.1, the atom model fa and bond model fb defined at App. A.2 are reversible. Set up Assume an ACL (Kingma & Dhariwal, 2018) defined in App. A.1 contains coupling function f and masking matrix M P t0, 1um n. Given input X P Rm n, the output Y is calculated as Y ACLpf,Mqp Xq M d p S d X1 T q p1 Mq X2 (22) where log S, T fp X2q, and X1, X2 are the split from input by masking: X1 M d X, X2 p1 Mq d X. (23) We seek to recover X from the f, M, and Y . Reversibility from output to input Since M is binary, we can get the following results. M d M M, (24) p1 Mq d p1 Mq p1 Mq, (25) M d p1 Mq p1 Mq d M 0 (26) X p M p1 Mqq d X (27) M d X p1 Mq d X (28) X1 X2. (29) By splitting the output Y to Y1, Y2 by masking matrix: Y1 M d Y , Y2 p1 Mq d Y . (30) Published in Transactions on Machine Learning Research (04/2025) Combining with Eq. (22), we know Y1 M d Y (31) M d p M d p S d X1 T q p1 Mq d X2q (32) M d p S d X1 T q, (33) Y2 p1 Mq d Y (34) p1 Mq d p M d p S d X1 T q p1 Mq d X2q (35) p1 Mq d p M d p S d X1 T q p1 Mq d p1 Mq d Xq (36) p1 Mq d X X2. (37) Now the log S, T fp X2q Y2 are recovered by Y . Notice that M d p Y1 T q c S (38) M d p M d p S d X1 T q T q c S (39) p M d S d X1 M d T M d T q c S (40) p M d S d X1q c S (41) M d X1 (42) M d M d X (43) X1 if p Sqi,j 0, @i, j, (45) where c denotes element-wise division. Since we define S as the exponential part of the output from the coupling function, the elements of S are all strictly positive. Then X ACLpf,Mq 1 p Y q X1 X2 M d p Y1 T q c S Y2 M d p M d Y T q c S p1 Mq d Y . where log S, T fp X2q fpp1 Mq d Y qq. Eq. (46) shows how the input is recovered from the output, thus the ACL block is reversible. A.7.2 Reversibility of the Bond Model For the bond model fb ACLb kb ACLb 1, and since each ACLb i, i 1, . . . , kb is reversible, we can write f 1 b ACLb 1 1 ACLb kb 1 , which the reverse function of fb. A.7.3 Reversibility of the Atom Model For the atom model fa ACLa ka ACLa 1, and since each ACLa i , i 1, . . . , ka is reversible, we can write f 1 a p ACLa 1q 1 ACLa ka 1, which the reverse function of fa. B Experiment Details: Node Classification B.1 Hardware All models in this experiment are trained on a Linux cluster equipped with NVIDIA V100 GPUs. The training time and memory requirement for single were (for all modes orig., hom., het., mix. and for all architectures): Published in Transactions on Machine Learning Research (04/2025) Texas: 10 mins, 2 GB Cornell: 10 mins, 2 GB Wisconsin: 10 mins, 2 GB Squirrel: 15 mins, 16 GB Chameleon 10 mins, 4 GB Cite Seer: 10 mins, 2 GB Computers: 20 mins, 16 GB Pub Med: 10 mins, 4 GB Cora: 10 mins, 2 GB Photo: 15 mins, 8 GB Roman-empire: 10 mins, 2GB Amazon-ratings: 10 mins, 2GB Minesweeper: 10 mins, 2GB Tolokers: 20 mins, 16GB Questions: 15 mins, 8GB B.2 Data Sets There are 15 data sets selected to conduct comprehensive and discriminative node classification tasks. To examine patterns associated with varying levels of data homophily, half of the data sets are chosen as having higher homophily, while the other half are more heterophilic. The data domains include citation networks (Yang et al., 2016) (Cora, Pub Med, Cite Seer), co-purchase graphs (Shchur et al., 2018) (Computers, Photo), hyperlink networks (Pei et al., 2019) (Cornell, Wisconsin, Texas), Wikipedia networks (Rozemberczki et al., 2021) (Chameleon, Squirrel), and heterophilous graph dataset (Platonov et al., 2023) (Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions). Statistical information The descriptive statistics on all the data sets for node classification tasks in Sec. 4.1 are displayed in Table A4. Here, Ngraphs, Nnodes and Nedges denote the total amounts of graphs, nodes and edges in the data set respectively. Dfeat, Nclass denote the dimensions of the node features and labels. Hn, He and Hei denote the average node homophily (Pei et al., 2019), average edge homophily (Zhu et al., 2020) and average class insensitive edge homophily ratio (Lim et al., 2021) of the data set. Data split rule For three heterophilic data sets (Cornell, Wisconsin and Texas), the data is split to train/validate/test with fixed 10 seeds from GEOM-GCN (Pei et al., 2019). For other datasets, the splitting is randomly controlled by the python torch package. Each result contains 10 random runs for data split and model initializations. B.3 GNN Algorithm Details There are four classic GNNs utilized as base models in the node classification task of Sec. 4.1. Here we provide these algorithms by showing the message-passing details of a single convolutional layer. Graph Convolutional Networks (GCN) The most classic graph convolutional operator is proposed by Kipf & Welling (2017). Given the embeddings hpkq v of node v P V at kth layer, the message for node u from node v is defined as mpkq uv eu,v p ˆdu ˆdvq1{2 hpkq v , (47) Published in Transactions on Machine Learning Research (04/2025) where eu,v denotes the weight of edge pu, vq, ˆdv 1 u PNpvq eu,v denotes the weighted degree of node v with self-loop. Then the node embeddings are updated by the message set ttmuv|v P Npuq Y tuuuu collected from all its neighbours hpk 1q u ΘJ u PNpuq Ytuu mpkq uv , (48) where Θ denotes the model parameters. Graph Attention Networks (GAT) Veličković et al. (2018) proposed the graph attention operator inspired by the transformer. Given the embeddings hpkq v of node v P V at kth layer, the message for node v from node u is defined as # αu,vΘthpkq u if u v αu,vΘshpkq v if u v , (49) where Θs, Θt denotes the model parameters, and αu,v denotes the attention value from node v to node u αu,v exp LRpa J s Θshu a J t Θthvq w PNpuq Ytuu exp LRpa J s Θshu a J t Θthwq . (50) where LR denotes the Leaky Rectified Linear Unit. Then the node embeddings are updated by the neighbour message set hpk 1q u u PNpuq Ytuu mpkq uv . (51) Graph Isomorphism Network (GIN) The graph isomorphism operator is created by Xu et al. (2019) based on Weisfeiler Lehman (WL) graph isomorphism test (Leman & Weisfeiler, 1968). Given the embeddings hpkq v of node v P V at kth layer, the message for node v from node u is the node embedding itself # hpkq v if u v p1 εqhpkq u if u v , (52) where the ε controls the self-weight. Then the node embeddings are updated by the neighbour message set through another neural network hθ hpk 1q u hθ u PNpuq Ytuu mpkq uv Table A4: The statistic information on the data sets for node classification experiment in Sec. 4.1 Ngraphs Nnodes Nedges Dfeat Nclass Hn He Hei Texas 1 183 325 1703 5 0.104 0.108 0.001 Minesweeper 1 10000 78804 7 2 0.683 0.683 0.009 Roman-empire 1 22662 65854 300 18 0.046 0.047 0.021 Squirrel 1 5201 217073 2089 5 0.219 0.224 0.026 Cornell 1 183 298 1703 5 0.106 0.131 0.059 Chameleon 1 2277 36101 2325 5 0.249 0.235 0.063 Questions 1 48921 307080 301 2 0.898 0.840 0.079 Wisconsin 1 251 515 1703 5 0.134 0.196 0.097 Amazon-ratings 1 24492 186100 300 5 0.376 0.380 0.127 Tolokers 1 11758 1038000 10 2 0.634 0.595 0.180 Cite Seer 1 3327 9104 3703 6 0.706 0.736 0.627 Pub Med 1 19717 88648 500 3 0.792 0.802 0.664 Computers 1 13752 491722 767 10 0.785 0.777 0.700 Cora 1 2708 10556 1433 7 0.825 0.810 0.766 Photo 1 7650 238162 745 8 0.836 0.827 0.772 Published in Transactions on Machine Learning Research (04/2025) Table A5: Performance comparison between Graph SAGE+mix. with other heterophily-aware GNNs, the best result among the baseline modes is highlighted in bold. Texas Cornell Wisconsin Cite Seer Pub Med Cora Homophily Hei 0.00 0.06 0.10 0.63 0.66 0.77 Graph SAGE+mix. 79.7 5.9 75.1 4.0 84.5 1.1 77.0 1.0 89.3 0.4 87.8 1.2 Graph SAGE 79.0 1.2 71.4 1.2 64.8 5.1 78.2 0.3 86.8 0.1 86.6 0.3 ACMP 86.2 3.0 85.4 7.0 86.1 4.0 75.0 1.0 78.9 1.0 84.9 0.6 H2GCN 85.9 3.5 86.2 4.7 87.5 1.8 80.0 0.7 87.8 0.3 87.5 0.6 GPRGNN 92.9 0.6 91.4 0.7 93.8 2.4 67.6 0.4 85.1 0.1 79.5 0.4 Geom-GCN 67.6 N{A 60.8 N{A 64.1 N{A 78.0 N{A 90.0 N{A 85.3 N{A ACM-GCN 94.9 2.9 94.8 3.8 95.8 2.0 81.7 1.0 90.7 0.5 88.6 1.2 The hθ is often defined as a simple MLP. Graph SAGE Hamilton et al. (2017) highlights the sample and aggregation approach in their Graph SAGE operator. Given the embeddings hpkq v of node v P V at kth layer, the message for node v from node u is the node embedding itself mpkq uv vhpkq v u, (54) Then the node embeddings are updated as hpk 1q u Θ1muu Θ2 1 |Npuq| v PNpuq muv, (55) where Θ1, Θ2 are the model parameters. B.4 Comparison with other heterophily-aware GNN baselines We select Graph SAGE+mix. to compare with baselines on node classification tasks over 6 data sets in Table A5. The baseline models are selected as the base model and other heterophily-aware GNN: ACMP (Wang et al., 2023), H2GCN (Zhu et al., 2021), GPRGNN (Chien et al., 2021), Geom-GCN (Pei et al., 2019) and ACM-GCN (Luan et al., 2022). The Graph SAGE+mix. are generally better than its original model: Graph SAGE. It shows the representation enhancement through our heterophilous message-passing scheme. Here ACM-GCN is the best algorithm over all algorithms in such comparison. The N/A means the reported results are not available from the baseline papers. C Experiment Details: Molecule Generation C.1 Hardware All models in this experiment are trained on a cluster equipped with NVIDIA A100 GPUs. The training time for a single model was 10 h (qm9) and 90 h (zinc-250k). The training time and memory requirement for single models were (for all modes orig., hom.or het.and for all base models) qm9: 10 h, 8 GB zinc-250k: 90 h, 16 GB C.2 Data sets Two classic molecule data sets (qm9 and zinc-250k), which are widely used in academia, are chosen for the molecular generation task. The qm9 data set (Ramakrishnan et al., 2014) comprises 134k stable small organic molecules composed of atoms from the set {C, H, O, N, F}. The zinc-250k (Irwin et al., 2012) data contains 250k drug-like molecules, each with up to 38 atoms of 9 different types. Data split In this experiment, all data sets are split with ratio train/test 80{20%. Published in Transactions on Machine Learning Research (04/2025) 0.43 0.38 0.24 0.20 0.18 0.17 0.16 0.16 0.14 Figure A6: Structured latent-space exploration (zinc-250k). Example of nearest neighbour search in the latent space with the seed molecules on the left and neighbours with the Tanimoto similarity (1 0) given for each molecule. For results on qm9, see Fig. 4 in the main paper. 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0.0 0.2 0.4 0.6 0.8 1.0 100 200 300 400 500 600 700 zinc250k Het Flow+true adj. Het Flow Graph DF Mo Flow Mo Flow+true adj. Molecular weight Figure A7: Chemoinformatics statistics for data (zinc-250k) and generated molecules from Het Flows (ours), Mo Flow, and Graph DF. Histograms for the Octanol-water partition coefficient (log P), synthetic accessibility score (SA), quantitative estimation of drug-likeness (QED), and molecular weight. Table A6: Full benchmark metrics for random generation using qm9 (reporting mean std) Validity Ò Uniqueness Ò Novelty Ò SNN Ò Frag Ò Scaf Ò Int Div1 Ò Data (qm9) 1.00 0.00 1.00 0.00 0.62 0.02 0.54 0.00 0.94 0.01 0.76 0.03 0.92 0.00 Graph DF - 1.00 0.00 0.98 0.00 0.35 0.00 0.61 0.01 0.09 0.07 0.87 0.00 Mo Flow 0.95 0.01 1.00 0.00 0.96 0.01 0.32 0.00 0.60 0.03 0.04 0.03 0.92 0.00 Het Flow (Ours) 0.92 0.01 0.99 0.00 0.92 0.01 0.34 0.00 0.80 0.02 0.04 0.03 0.91 0.00 Mo Flow+true adj. 1.00 0.00 1.00 0.00 0.85 0.01 0.38 0.00 0.70 0.03 0.31 0.08 0.92 0.00 Het Flow+true adj. 1.00 0.00 1.00 0.00 0.74 0.01 0.43 0.00 0.85 0.02 0.52 0.05 0.92 0.00 EDM 0.92 0.01 1.00 0.00 0.56 0.02 0.48 0.01 0.92 0.01 0.65 0.03 0.92 0.00 Int Div2 Ò Filters Ò FCD Ó log P Ó SA Ó QED Ó Weight Ó Data (qm9) 0.90 0.00 0.64 0.02 0.40 0.02 0.04 0.01 0.03 0.01 0.00 0.00 0.32 0.08 Graph DF 0.86 0.00 0.69 0.02 10.76 0.21 0.16 0.03 0.27 0.02 0.05 0.00 19.72 0.54 Mo Flow 0.90 0.00 0.55 0.02 7.55 0.23 0.40 0.02 0.41 0.02 0.04 0.00 3.75 0.08 Het Flow (Ours) 0.90 0.00 0.62 0.02 4.04 0.24 0.11 0.02 0.34 0.01 0.03 0.00 4.40 0.11 Mo Flow+true adj. 0.90 0.00 0.60 0.01 4.45 0.11 0.65 0.02 0.56 0.02 0.04 0.00 0.74 0.14 Het Flow+true adj. 0.90 0.00 0.62 0.01 1.46 0.09 0.26 0.02 0.36 0.02 0.02 0.00 1.12 0.18 EDM 0.90 0.00 0.61 0.01 0.96 0.08 0.16 0.04 0.15 0.04 0.01 0.00 1.87 0.16 C.3 Further Results Latent-space exploration We provide further results for structured latent-space exploration. Example explorations for qm9 and zinc-250k are shown in Fig. 4 and Fig. A6. Overall 14 metrics tables We include full listings of all 14 metrics (description of metrics in App. D) considered in the random generation tasks for qm9 and zinc-250k. The values are listed in Tables A6 and A7, respectively. The reconstruction example For better understanding, we provide reconstruction examples on qm9 from intermediate layers in Fig. A8. Published in Transactions on Machine Learning Research (04/2025) Table A7: Full benchmark metrics for random generation using zinc-250k (reporting mean std) Validity Ò Uniqueness Ò Novelty Ò SNN Ò Frag Ò Scaf Ò Int Div1 Ò Data (zinc-250k) 1.00 0.00 1.00 0.00 0.02 0.00 0.51 0.00 1.00 0.00 0.28 0.02 0.87 0.00 Graph DF - 1.00 0.00 1.00 0.00 0.23 0.00 0.35 0.01 0.00 0.00 0.88 0.00 Mo Flow 0.89 0.01 1.00 0.00 1.00 0.00 0.27 0.00 0.79 0.01 0.01 0.00 0.88 0.00 Het Flow (Ours) 0.87 0.01 1.00 0.00 1.00 0.00 0.26 0.00 0.77 0.01 0.01 0.00 0.88 0.00 Mo Flow+true adj. 0.94 0.01 1.00 0.00 1.00 0.00 0.33 0.00 0.89 0.01 0.07 0.02 0.87 0.00 Het Flow+true adj. 0.93 0.01 1.00 0.00 1.00 0.00 0.34 0.00 0.91 0.01 0.10 0.03 0.87 0.00 Int Div2 Ò Filters Ò FCD Ó log P Ó SA Ó QED Ó Weight Ó Data (zinc-250k) 0.86 0.00 0.59 0.01 1.44 0.01 0.05 0.01 0.03 0.01 0.01 0.00 2.18 0.39 Graph DF 0.87 0.00 0.54 0.01 34.30 0.30 1.28 0.03 1.70 0.03 0.30 0.00 149.27 1.55 Mo Flow 0.86 0.00 0.51 0.02 23.33 0.35 0.15 0.02 0.82 0.03 0.25 0.01 56.71 2.64 Het Flow (Ours) 0.87 0.00 0.51 0.01 23.72 0.19 0.50 0.04 0.99 0.03 0.25 0.01 51.90 1.65 Mo Flow+true adj. 0.86 0.00 0.67 0.02 8.21 0.22 0.64 0.04 0.54 0.03 0.04 0.00 4.23 0.51 Het Flow+true adj. 0.86 0.00 0.75 0.01 8.24 0.17 0.82 0.04 0.41 0.03 0.04 0.00 5.05 0.77 Layer i 0 Layer i 4 Layer i 8 Layer i 10 Layer i 16 Layer i 20 Layer i 26 Figure A8: Step-by-step generation (qm9). Snapshots of reconstructed molecules when fixing the bond model and collecting node embeddings of the intermediate layers i. Homophily distribution of generated molecules Additionally, we also visualize the node homophily for both qm9 and zinc-250k together with the estimated node homophily histograms (see Fig. A9) from the generation outputs from the different models. The adjacency matrix generation strategies have a minor influence on the homophily distribution of generated molecules, it shows these statistics mostly rely on the atom model but not the bond model. Ablation study The random generation of Het Flows with/without parameters sharing in the Mix MP GNNs are shown in Table A8 and Table A9. Table A8: Ablation study of Mix MP GNN parameter sharing on qm9 FCD Ó Validity Ò Novelty Ò SNN Ò Frag Ò Scaf Ò Int Div1 Ò Data (qm9) 0.40 0.02 1.00 0.00 0.62 0.02 0.54 0.00 0.94 0.01 0.76 0.03 0.92 0.00 Het Flow+true adj. 1.46 0.09 1.00 0.00 0.74 0.01 0.43 0.00 0.85 0.02 0.52 0.05 0.92 0.00 Het Flow+true adj.+share para. 1.66 0.07 1.00 0.00 0.77 0.01 0.42 0.00 0.86 0.02 0.48 0.05 0.92 0.00 Published in Transactions on Machine Learning Research (04/2025) 0.0 0.2 0.4 0.6 0.8 1.0 qm9 Het Flow+true adj. Het Flow Graph DF Mo Flow Mo Flow+true adj. EDM 0.0 0.2 0.4 0.6 0.8 1.0 zinc250k Het Flow+true adj. Het Flow Graph DF Mo Flow Mo Flow+true adj. Figure A9: Node homophily distribution of generated molecules. Table A9: Ablation study of Mix MP GNN parameter sharing on zinc-250k FCD Ó Validity Ò Novelty Ò SNN Ò Frag Ò Scaf Ò Int Div1 Ò Data (zinc-250k) 1.44 0.01 1.00 0.00 0.02 0.00 0.51 0.00 1.00 0.00 0.28 0.02 0.87 0.00 Het Flow+true adj. 8.24 0.17 0.93 0.01 1.00 0.00 0.34 0.00 0.91 0.01 0.10 0.03 0.87 0.00 Het Flow+true adj.+share para. 8.11 0.21 0.93 0.01 1.00 0.00 0.32 0.00 0.93 0.01 0.05 0.02 0.87 0.00 Table A10: Performance on molecule property optimization in terms of the best QED scores, scores taken from the corresponding papers (JTVAE score from Luo et al., 2021; Verma et al., 2022). Method 1st 2nd 3rd Data (zinc-250k) 0.948 0.948 0.948 JTVAE 0.925 0.911 0.910 GCPN 0.948 0.947 0.946 Graph AF 0.948 0.948 0.947 Graph DF 0.948 0.948 0.948 Mo Flow 0.948 0.948 0.948 Mod Flow 0.948 0.948 0.945 Het Flows 0.948 0.948 0.947 C.4 Property Optimization In the property optimization task, models show their capability to find novel molecules that optimize specific chemical properties not present in the training data set: a critical component for drug discovery. For our study, we focused on maximizing the QED property. We trained Het Flows on zinc-250k and evaluated its performance against other state-of-the-art models (Verma et al., 2022; Luo et al., 2021; Zang & Wang, 2020; Shi et al., 2019; Jin et al., 2018; You et al., 2018). The results, given in Table A10, show that the top three novel molecule candidates identified by Het Flows (not part of the zinc-250k data set), exhibit QED values on par with those from zinc-250k or other state-of-the-art methods. Algorithm Given a pre-trained Het Flows f, and training set D contains molecule and property label pairs t G, yu. Now we introduce an extra simple MLP gθ, which maps the molecular embeddings fp Gq into the predicted property yp gθpfp Gqq, (56) it is trained on the dataset to be gθ by optimizing the parameters: θ arg min θ MSEloss p G,yq PD pgθpfp Gqq, yq (57) Published in Transactions on Machine Learning Research (04/2025) Then we find molecule candidates t Giuk i 1 with top-k properties in the data set D are chosen. New embeddings are explored by optimizing the predict label by gθ starting from these candidates: Bh phi,j 1q hi,j 1, j 1, . . . , N, hi,0 fp Giq, i 1, . . . , k, (58) where δ denotes the search step length, and N is the number of iterations. These embeddings could be recovered to be molecule set: D1 tf 1phijqui 1,...,k, j 1,...,N. (59) Finally, D1z D gives the novel molecule sets with related high target properties. Generation results In our experiments, the gθ is a simple 3-layer MLP with 16 hidden nodes, the dataset D is zinc-250k, and target property y is QED. And D1z D provides 17 molecules with QED score 0.948. The Top-3 QED score and molecular SMILES are listed below: 1. Cc1cc(C(=O)NC2CC2)c(C)n1-c1ccc2c(c1)OCCO2.N, QED 0.947936, 2. Cc1cccnc1NC(=O)C1CC(=O)N(C)C1c1ccccc1, QED 0.947505, 3. Cc1nn(C)c(C)c1C(=O)NCC1CC12CCc1ccccc12, QED 0.947317. Baselines The baselines scores of GCPN (You et al., 2018), Graph AF (Shi et al., 2019), Graph DF (Luo et al., 2021), Mo Flow (Zang & Wang, 2020) and Mod Flow (Verma et al., 2022) are acquired from the corresponding papers. The score of JTVAE (Jin et al., 2018) is acquired from Zang & Wang (2020); Verma et al. (2022). C.5 Model selection The ranges of Het Flows hyperparameters are listed as follows: The residual connection of ACL coupling function: [True, False] The parameter sharing mode of ln µa, ln µb: r0, 1, 2s. Here µa, µb are the variance of the embeddings distribution Na, Nb in Eq. (7). mode 0 means the distribution variance of both node and edge embeddings are fixed to be 1: ln µa ln µb 0. mode 1 means the distribution variance of both node and edge embeddings share one parameter: ln µa ln µb, it could be optimized during the training process. mode 2 means the distribution variance of both node and edge embeddings are separate parameters: ln µa, ln µb, both of them could be optimized the during training process. The best-performing model is selected using the FCD score as suggested in Polykovskiy et al. (2020). The Het Flows reported in the paper is selected with hyperparameters: The residual connection of ACL coupling function: False The parameter sharing mode of ln µa, ln µb: 1. Published in Transactions on Machine Learning Research (04/2025) D Description of Chemoinformatics Metrics For benchmarking, model selection, comparison, and explorative analysis, we use the following 14 metrics. The metrics are presented in detail in the work by Polykovskiy et al. (2020) that introduced the MOSES benchmarking platform. The metrics calculation makes heavy use of the RDKit open-source cheminformatics software (https://www.rdkit.org/). We briefly summarize the metrics below. Sanity check metrics 1. Validity Fraction (in r0, 1s) of the molecules that produce valid SMILES representations. This is a sanity check for how well the model captures explicit chemical constraints such as proper valence. Higher values are better as a low value can indicate that the model does not properly capture chemical structure. We report numbers without post hoc validity corrections. 2. Uniqueness Fraction (in r0, 1s) of the molecules that are unique. This is a sanity check based on the SMILES string representation of the generated molecules. Higher values are better as a low value can indicate the model has collapsed and produces only a few typical molecules. 3. Novelty Fraction (in r0, 1s) of the generated molecules that are not present in the training set. Higher values are better as a low value can indicate overfitting to the training data set. Summary statistics 4. Similarity to a nearest neighbour (SNN) The average Tanimoto similarity (Jaccard coefficient) in r0, 1s between the generated molecules and their nearest neighbour in the reference data set. Higher is better: If the generated molecules are far from the reference set, similarity to the nearest neighbour will be low. 5. Fragment similarity (Frag) Measures similarity (in r0, 1s) of distributions of BRICS fragments (substructures) in the generated set vs. the original data set. If molecules in the two sets share many of the same fragments in similar proportions, the Frag metric will be close to 1 (higher better). 6. Scaffold similarity (Scaf) Measures similarity (in r0, 1s) of distributions of Bemis Murcko scaffolds (molecule ring structures, linker fragments, and carbonyl groups) in the generated set vs. the original data set. This metric is calculated similarly to the Fragment similarity metric by counting substructure presence in the data, and they can be high even if the data sets do not contain the same molecules. 7. Internal diversity (Int Div1) Measure (in r0, 1s) of the chemical diversity within the generated set of molecules. Higher values are better and signal higher diversity in the generated set of molecules. Low values can signal mode collapse. 8. Internal diversity (Int Div2) Measure (in r0, 1s) of the chemical diversity within the generated set of molecules. The interpretation is similar to Int Div1 but with stronger penalization of the Tanimoto similarity in calculating the diversity. 9. Filters This metric is specific to the MOSES benchmarking platrofm (see Polykovskiy et al., 2020). It gives the fraction (in r0, 1s) of generated molecules that pass filters applied during data set construction. In practice, these filters may filter out chemically valid molecules with fragments that are not of interest in the MOSES data set (filtered with medicinal chemistry filters). Thus, this metric is not of primary interest to us but gives a view on the match with the MOSES data set. 10. Fréchet Chem Net distance (FCD) Analogous to the Frechét Inception Distance (FID) used in image generation, FCD compares feature distributions of real and generated molecules using a pre-trained model (Chem Net). Lower values are better. Published in Transactions on Machine Learning Research (04/2025) Descriptive distributions 11. Octanol-water partition coefficient (log P) A logarithmic measure of the relationship between lipophilicity (fat solubility) and hydrophilicity (water solubility) of a set of molecules. For large values a substance is more soluble in fat-like solvents such as n-octanol, and for small values more soluble in water. We report both histograms of log P and a summary statistic in terms of the Wasserstein distance between the generated and reference distributions (smaller better). 12. Synthetic accessibility score (SA) A metric that estimates how easily a chemical molecule can be synthesized. It provides a quantitative value indicating the relative difficulty or ease of synthesizing a molecule, with a lower SA score suggesting that a molecule is more easily synthesized, and a higher score suggesting greater complexity or difficulty. We report both histograms of SA and a summary statistic in terms of the Wasserstein distance between the generated and reference distributions (smaller better). 13. Quantitative estimation of drug-likeness (QED) A metric designed to provide a quantitative measure of how drug-like a molecule is. It essentially refers to the likelihood that a molecule possesses properties consistent with most known drugs, estimated based on a variety of molecular descriptors. We report both histograms of QED and a summary statistic in terms of the Wasserstein distance between the generated and reference distributions (smaller better). 14. Molecular weight (Weight) The sum of atomic weights in a molecule. We report both histograms of molecular weights and a summary statistic in terms of the Wasserstein distance between the generated and reference distributions (smaller better).