# supercharging_graph_transformers_with_advective_diffusion__6936b044.pdf Supercharging Graph Transformers with Advective Diffusion Qitian Wu 1 Chenxiao Yang 2 Kaipeng Zeng 3 Michael Bronstein 4 5 The capability of generalization is a cornerstone for the success of modern learning systems. For non-Euclidean data, e.g., graphs, that particularly involves topological structures, one important aspect neglected by prior studies is how machine learning models generalize under topological shifts. This paper proposes ADVDIFFORMER, a physics-inspired graph Transformer model designed to address this challenge. The model is derived from advective diffusion equations which describe a class of continuous message passing process with observed and latent topological structures. We show that ADVDIFFORMER has provable capability for controlling generalization error with topological shifts, which in contrast cannot be guaranteed by graph diffusion models, i.e., the generalization of common graph neural networks in continuous space. Empirically, the model demonstrates superiority in various predictive tasks across information networks, molecular screening and protein interactions1. 1. Introduction Learning representations for non-Euclidean data is essential for geometric deep learning. Graph-structured data in particular has attracted increasing attention, as graphs are a very popular mathematical abstraction for systems of relations and interactions that can be applied from microscopic scales (e.g. molecules) to macroscopic ones (social networks). Common frameworks for learning on graphs include graph neural networks (GNNs) (Scarselli et al., 2008; Gilmer et al., 2017; Kipf & Welling, 2017), which operate by propagating information between adjacent nodes of the graph networks, 1Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard 2Toyota Technological Institute at Chicago 3Shanghai Jiao Tong University 4University of Oxford 5Aithyra. Correspondence to: Qitian Wu . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1Codes are available at https://github.com/ qitianwu/Adv DIFFormer and graph Transformers (Ying et al., 2021; Wu et al., 2021; Rampásek et al., 2022; Wu et al., 2022b), which propagate information among arbitrary node pairs through global attention. GNNs can be seen as discretized versions of local diffusion equations on graphs (Atwood & Towsley, 2016; Klicpera et al., 2019; Chamberlain et al., 2021a), while graph Transformers can be considered as the counterparts of non-local diffusion (Wu et al., 2023; 2024c). Linking graph learning models with diffusion equations offers powerful tools from the domain of partial differential equations (PDEs), allowing us to study the expressive power (Bodnar et al., 2022), behaviors such as over-smoothing (Rusch et al., 2023) and over-squashing (Topping et al., 2022), the settings of missing features (Rossi et al., 2022), and guide architectural choices (Di Giovanni et al., 2022). While significant efforts have been devoted to understanding the expressive power of graph learning models, the generalization capabilities of such methods are largely an open question. Recent works attempt to study generalization of graph learning models (particularly GNNs) from various perspectives such as extrapolation in feature space (Xu et al., 2021), subgroup fairness (Ma et al., 2021), invariance principle (Wu et al., 2022a), feature propagation (Yang et al., 2023), causality (Wu et al., 2024a), and training dynamics (Yang et al., 2024). However, most of these works focus on the distribution shifts of features and labels. In many critical real-world settings, the training and testing graph topologies can be generated from different distributions (e.g., molecular structures with diverse drug likeness) (Koh et al., 2021; Hu et al., 2021; Bazhenov et al., 2023; Zhang et al., 2023), a phenomenon we refer to as topological distribution shift . This can be a predominant nature of non Euclidean data in contrast with commonly studied feature and label shifts in Euclidean space. Despite its practical significance, how to enable graph learning models to generalize under topological shifts still remains unclear. In this paper, we aim to address generalization under topological shifts through the lens of a physics-inspired graph learning model dubbed as Advective Diffusion Transformer (ADVDIFFORMER). The model is derived from advective diffusion equations which describe a class of continuous message passing process with observed and latent topological structures. On top of this, we connect advective diffusion with a graph Transformer architecture for gener- Supercharging Graph Transformers with Advective Diffusion advection: local message passing diffusion: global attentive propagation latent space Advective Diffusion Transformer Figure 1: Illustration of Advective Diffusion Transformers. alization against topological shifts (Fig. 1): the non-local diffusion term (instantiated as global attention) aims to capture latent interactions learned from data; the advection term (instantiated as local message passing) accommodates the topological features pertaining to observed graphs. To justify the model designs, we show that the closed-form solution of this advective diffusion system possesses the capability to control the generalization error caused by topological shifts, which further guarantees the desired level of generalization. In contrast, commonly used local diffusion models that can be considered as a simplified variant of our model leads to the generalization error whose upper bound exponentially grows with topological shifts. For implementation, we resort to numerical scheme based on the Padé-Chebyshev theory (Golub & Van Loan, 1989) for efficiently computing the diffusion equation s closed-form solution. Experiments show that our model offers superior generalization performance across various downstream predictive tasks in diverse domains, including information networks, molecular screening, and protein interactions. 2. Background and Preliminaries We recapitulate diffusion equations on manifolds (Freidlin & Wentzell, 1993; Medvedev, 2014) and their connection with graph learning. Diffusion on Riemannian manifolds. Let Ωdenote an abstract domain, which we assume here to be a Riemannian manifold (Eells & Sampson, 1964). A key feature distinguishing an n-dimensional Riemannian manifold from a Euclidean space is the fact that it is only locally Euclidean, in the sense that at every point u Ωone can construct n-dimensional Euclidean tangent space TuΩ = Rn that locally models the structure of Ω. The collection of such spaces (referred to as the tangent bundle and denoted by TΩ) is further equipped with a smoothly-varying inner product (Riemannian metric). Now consider some quantity (e.g., temperature) as a function of the form q : Ω R, which we refer to as a scalar field. Similarly, we can define a (tangent) vector field Q : Ω TΩ, associating to every point u on a manifold a tangent vector Q(u) TuΩ, which can be thought of as a local infinitesimal displacement. We use Q(Ω) and Q(TΩ) to denote the functional spaces of scalar and vector fields, respectively. The gradient operator : Q(Ω) Q(TΩ) takes scalar fields into vector fields representing the local direction of the steepest change of the field. The divergence operator is the adjoint of the gradient and maps in the opposite direction, : Q(TΩ) Q(Ω). A manifold diffusion process models the evolution of a quantity (e.g., chemical concentration) due to its difference across spatial locations on Ω. Denoting by q(u, t) : Ω [0, ) R the quantity over time t, the process is described by a PDE (diffusion equation) (Romeny, 2013): t = (S(u, t) q(u, t)) , t 0, u Ω, with initial conditions q(u, 0) = q0(u) and possibly additional boundary conditions if Ωhas a boundary. S denotes the diffusivity of the domain. It is typical to distinguish between an isotropic (location-independent diffusivity), non-homogeneous (location-dependent diffusivity S = s(u) R), and anisotropic (locationand directiondependent S(u) Rn n) settings. In the cases studied below, we assume the dependence of diffusivity on locations is via a function of the quantity itself, i.e., S = S(q(u, t)). Diffusion on Graphs. Recent works adopt diffusion equations as a foundation principle for graph representation learning (Chamberlain et al., 2021a;b; Thorpe et al., 2022; Bodnar et al., 2022; Choi et al., 2023; Rusch et al., 2023; Wu et al., 2024b), employing analogies between calculus on manifolds and graphs. Let G = (V, E) be a graph with nodes V and edges E, represented by the |V| |V| adjacency matrix A. Let X = [xu]u V denote a |V| D matrix of node features, analogous to scalar fields on manifolds. The graph gradient ( X)uv = xv xu defines edge features for (u, v) E, analogous to vector fields on manifolds. Similarly, the graph divergence of edge features E = [euv](u,v) E, defined as the adjoint ( E)u = P v:(u,v) E euv, produces node features. Diffusion models replace discrete GNN layers with continuous time-evolving node embeddings Z(t) = [zu(t)], where zu(t) : [0, ) Rd driven by the diffusion equation: t = (S(Z(t); A) Z(t)) , t 0, (1) with initial conditions Z(0) = ϕenc(X) where ϕenc is a node-wise MLP encoder and w.l.o.g., the diffusivity Supercharging Graph Transformers with Advective Diffusion S(Z(t); A) over the graph can be defined as a |V| |V| matrix-valued function dependent on A, which measures the rate of information flows between node pairs. With the graph gradient and divergence, Eqn. 1 becomes t = (C(Z(t); A) I)Z(t), 0 t T, (2) with initial conditions Z(0) = ϕenc(X) where C(Z(t); A) is a |V| |V| coupling matrix associated with the diffusivity. Eqn. 2 yields a dynamics from t = 0 to an arbitrary given stopping time T, where the latter yields node representations for prediction, e.g., ˆY = ϕdec(Z(T)). The coupling matrix determines the interactions between different nodes in the graph, and its common instantiations include normalized graph adjacency (non-parametric) and learnable attention matrix (parametric), in which cases the finite-difference numerical iterations for solving Eqn. 2 correspond to the discrete propagation layers of common GNNs (Chamberlain et al., 2021a) and graph Transformers (Wu et al., 2023; 2024c) (see Appendix A for details). It is typical to tacitly make a closed-world assumption, i.e., the graph topologies of training and testing data are generated from the same distribution. However, the challenge of generalization arises when the testing topology is different from the training one. In such an open-world regime, it still remains unclear how graph diffusion equations and, more broadly, learning-based models on graphs (e.g., GNNs) extrapolate and generalize to new unseen structures. 3. Generalization by Advective Diffusion As a prerequisite for analyzing the generalization behaviors of learning-based models on graphs, we need to characterize how topological shifts occur in nature. In general sense, extrapolation is impossible without any exposure to the new data or prior knowledge about the data-generating mechanism. In our work, we assume testing data is strictly unknown during training, in which case structural assumptions become necessary for authorizing generalization. 3.1. Problem Formulation: Data Generation Hypothesis We present the causal mechanism of graph data generation in Fig. 2 as a hypothesis, inspired by the graph limits (Lovász & Szegedy, 2006; Medvedev, 2014) and random graph models (Snijders & Nowicki, 1997). In graph theory, the topology of a graph G = (V, E) can be assumed to be generated by a graphon (or continuous graph limit), a random symmetric measurable function W : [0, 1]2 [0, 1], which is an unobserved latent variable. In our work, we generalize this data-generating mechanism to include alongside graph adjacency also node features and labels: i) Each node u V has a latent i.i.d. variable Uu U[0, 1]. The node features are a random variable X = [Xu] gen- Figure 2: The data-generating mechanism with topological shifts caused by environment E. The solid (resp. dashed) nodes represents observed (resp. latent) random variables. erated from each Uu through a certain node-wise function Xu = g(Uu; W). We denote by matrix X a particular realization of the random variable X. ii) Similarly, the graph adjacency A = [Auv] is a random variable generated through a pairwise function Auv = h(Uu, Uv; W, E) additionally dependent on the environment E. The change of E happens when it transfers from training to testing, resulting in a different distribution of A. We denote by A a particular realization of the adjacency matrix. iii) The label Y can be specified in certain forms. As we assume in below, Y is generated by a function over sets, Y = r({Uv V}, A; W). Denote by Y a realization of Y . The above process formalizes the data-generating mechanism behind various data of inter-dependent nature, where the graph data (X, A, Y) is generated from the joint distribution p(X, A, Y |E) with a specific environment. The learning problem boils down to finding parameters θ of a parametric function Γθ(A, X) that establishes the predictive mapping from observed node features X and graph adjacency A to the label Y. Γθ is typically implemented as a GNN, which is expected to possess sufficient expressive power (in the sense that θ such that Γθ(A, X) Y) as well as generalization capability under topological shifts (i.e., when the observed graph topology varies from training to testing, which in our model amounts to the change in E). While significant attention in the literature has been devoted to the former property (Morris et al., 2019; Xu et al., 2019; Bouritsas et al., 2023; Papp et al., 2021; Balcilar et al., 2021; Bodnar et al., 2022); the latter is largely an open question. 3.2. Proposed Model: Advective Diffusion Transformers To deal with the problem formulated in the previous subsection, we next present a new diffusion equation model that offers a provable level of generalization in the data-generating mechanism as stipulated in Sec. 3.1. The model is inspired by a particular class of diffusion equations, called advective diffusion, that describe common physical phenomenons driven by both diffusion and advection effects. Supercharging Graph Transformers with Advective Diffusion Advective Diffusion Equations. The classic advective diffusion is commonly used for characterizing physical systems with convoluted quantity transfers, where the term advection refers to the evolution caused by the movement of the diffused quantity (Chandrasekhar, 1943). Consider the abstract domain Ωof our interest defined in Sec. 2, and assume V (u, t) TuΩ(a vector field in Ω) to denote the velocity of the particle at location u and time t. The advective diffusion of the physical quantity q on Ωis governed by the PDE as (Leveque, 1992): q(u,t) (S(u, t) q(u, t)) + β (V (u, t) q(u, t)) , (3) where t 0, u Ωand β 0 is a weight for the advection term. For example, if we consider q(u, t) as the water salinity in a river, then Eqn. 3 describes the temporal evolution of salinity at each location that equals to the spatial transfers of both diffusion process (caused by the concentration difference of salt and S reflects the molecular diffusivity in the water) and advection process (caused by the movement of the water and V characterizes the flowing directions). Advective Diffusion on Graphs. Similarly, on a graph G = (V, E), we can define the velocity for each node u as a |V|-dimensional vector-valued function V(t) = [vu(t)]. We thus have ( (V(t) Z(t)))u = P v V vuv(t)zv(t) and the advective diffusion equation on graphs: t = [C(Z(t)) + βV(t) I] Z(t), 0 t T. (4) We next instantiate the coupling matrix and the velocity to endow the model with generalizability under topological shifts, by drawing inspirations from physical phenomenons. Non-local diffusion as global attention. The diffusion process led by the concentration gradient acts as an internal driving force, where the diffusivity keeps invariant across environments (e.g., the molecular diffusivity stays constant in different rivers). With this intuition in mind, we consider the non-local diffusion operator allowing instantaneous information flows among arbitrary locations (Chasseigne et al., 2006). In the context of learning on graphs, the non-local diffusion can be seen as generalizing the feature propagation to a complete or fully-connected (latent) graph (Wu et al., 2023; 2024c), in contrast with common GNNs that allow message passing only between neighboring nodes. Formally speaking, we can define the gradient and divergence operators on a complete graph: ( X)uv = xv xu (u, v V) and ( E)u = P v V euv (u V). This resonates with the latent interactions among nodes, determined by the underlying data manifold, that induce all-pair information flows over a complete graph and stay invariant w.r.t. the change of E. We thus instantiate C as a global attention matrix that computes the similarities between arbitrary node pairs: C = [cuv]u,v V, cuv = η(zu(0), zv(0)) P w V η(zu(0), zw(0)), (5) where η is a learnable pairwise similarity function. Advection as local message passing. The advection process driven by the directional movement belongs to an external force, with the velocity depending on contexts (e.g., different rivers). This is analogous to the environment-sensitive graph topology that is informative for prediction in specific environments. We instantiate the velocity as the normalized graph adjacency, i.e., V = D 1/2AD 1/2, reflecting observed structural information, where D is the diagonal degree matrix of A. Then our advective diffusion model can be formulated as: t = [C + βV I] Z(t), 0 t T, (6) with initial conditions Z(0) = ϕenc(X) where β [0, 1] is a hyper-parameter. The integration of non-local diffusion (implemented through attention) and advection (implemented as MPNNs) give rise to a new architecture, which we call Advective Diffusion Transformer (ADVDIFFORMER). Remark. Eqn. 6 has a closed-form solution Z(t) = e (I C βV)t Z(0). A special case of β = 0 (no advection) can be used in situations where the graph structure is not useful. Moreover, one can extend Eqn. 6 to a non-linear equation with time-dependent C(Z(t)), in which case the equation has no closed-form solution and needs numerical schemes for solving. Similarly to (Di Giovanni et al., 2022), we found in our experiments a simple linear diffusion to be sufficient to yield promising performance. We therefore leave the study of the non-linear variant for the future. 3.3. Theoretical Justification We proceed to analyze the generalization capability of our proposed model w.r.t. topological shifts as defined in Sec 3.1. We are interested in the generalization error of Γθ instantiated as the continuous diffusion model in Eqn. 6, when transferring from training data generated with the environment Etr to testing data generated with Ete. The latter causes varied graph topologies as stipulated in Sec. 3.1. We denote by {(X(i), A(i), Y(i))}Ntr i the training data set sized Ntr generated from p(X, A, Y |E = Etr), and l( , ) any bounded loss function. The training error (i.e., empirical risk) can be defined as Remp(Γθ; Etr) i=1 l(Γθ(X(i), A(i)), Y(i)). (7) Our target is to reduce the generalization error on testing data generated from p(X, A, Y |E = Ete): R(Γθ; Ete) E(X ,A ,Y ) p(X,A,Y |E=Ete)[l(Γθ(X , A ), Y )]. (8) Particularly, if Ete = Etr, the learning setting degrades to the standard one commonly studied in the closed-world Supercharging Graph Transformers with Advective Diffusion assumption, wherein the in-distribution generalization error has an upper bound (Shalev-Shwartz & Ben-David, 2014): R(Γθ; Etr) Remp(Γθ; Etr) Din(Γθ, Etr, Ntr) = 2H(Γθ) + O p log(1/δ)/Ntr , (9) where H(Γθ) denotes the Rademacher complexity of the function class induced by Γθ, and Din(Γθ, Etr, Ntr) is determined by training data size and model complexity. When Ete = Etr that occurs in the open-world regime, i.e., our focused learning setting, the analysis becomes more difficult due to the topological shifts. In the diffusion equation (either Eqn. 6 or 2), the change of graph topologies leads to the change of node representations (solution of the diffusion equation Z(T)). Thereby, the output of the diffusion process can be expressed as Z(T; A) = f(Z(0), A). Our first result below decouples the out-of-distribution generalization gap R(Γθ; Ete) Remp(Γθ; Etr) into three error terms. Theorem 3.1. Assume l and ϕdec are Lipschitz continuous. For any graph data generated with the mechanism of Sec. 3.1, it holds with the probability 1 δ that the generalization gap of Γθ satisfies |R(Γθ; Ete) Remp(Γθ; Etr)| Din(Γθ, Etr, Ntr) + O(EA p(A|Etr),A p(A|Ete)[ Z(T; A ) Z(T; A) 2]) | {z } Dood model(Γθ,Etr,Ete) + O(E(A,Y) p(A,Y |Etr),(A ,Y ) p(A,Y |Ete) [ Y Y 2]) | {z } Dood label(Etr,Ete) Remark. Since Din is independent of the testing data generated with Ete = Etr, the impact of topological shifts on the out-of-distribution generalization error is largely dependent on Dood model and Dood label: the former reflects the variation magnitude of Z(T; A) yielded by Γθ w.r.t. varying topologies; the latter measures the difference of labels generated with different environments. Notice that Dood label is fully determined by the data-generating mechanism, while Dood model is mainly dependent on the model Γθ, particularly the sensitivity of node representations w.r.t. topological shifts. We thus next zoom in on the specific design of Γθ as Eqn. 6, and the next result shows the upper bound of the change rate of Z(T; A) w.r.t. variation of graph topologies. Theorem 3.2. For any graph data generated with the mechanism of Sec. 3.1, if g is injective, then the model Eqn. 6 can reduce the variation magnitude of the node representation Z(T; A ) Z(T; A) 2 to any order O(ψ( A 2)) where ψ denotes an arbitrary polynomial function, A = A A and A = D 1/2AD 1/2. This suggests that the advective diffusion model is capable of controlling the change rate of node representations to arbitrary rates w.r.t. A 2. The injectivity of g is a mild condition since g establishes a mapping from a smooth and compact latent space to a high-dimensional space. Applying Theorem 3.1 we have the generalization error of the advective diffusion model. Corollary 3.3. On the same condition of Theorem 3.1 and 3.2, the model-dependent generalization error bound of Eqn 6 can be reduced to arbitrary polynomial orders w.r.t. topological shifts, i.e., Dood model(Γθ, Etr, Etr) = O(EA p(A|Etr),A p(A|Ete)[ψ( A 2)]). This implies that the generalization error of the model in Eqn. 6 can be controlled within an arbitrary rate w.r.t. A 2. The model has provable capacity for achieving a desired level of generalization with topological shifts. To verify the efficacy of the model, we consider synthetic data that simulates the topological shifts as defined in Sec. 3.1 and investigate into three types of topological shifts as shown in Fig. 3 (see experimental details in Sec. 5.1). We found that our model (ADVDIFFORMERI and ADVDIFFORMER-S whose implementation is presented in Sec. 4) keeps the testing error nearly constant as A 2 increases. 3.4. Comparisons with Other Models To further illuminate the effectiveness of the proposed model, we next compare with two related models that are commonly adopted and can be considered as the simplified variants of our model. Local Diffusion Model. We first consider a typical model instantiation, i.e., local diffusion equation on graphs, wherein the model discards the advection term in Eqn. 6 and degrades to Eqn. 2 with the coupling matrix dependent on A. In such a situation, the propagation of node signals is constrained within connected neighbored nodes. The common choice for the coupling matrix is the symmetric normalized graph adjacency matrix A = D 1/2AD 1/2. In this case, the finite-difference iteration for solving Eqn. 2 would induce the discrete propagation layers akin to the message passing rule of SGC (Wu et al., 2019) and GCN (Kipf & Welling, 2017) if the feature transformation and nonlinearity are neglected (see more illustration in Appendix A). Given the constant coupling matrix C, Eqn. 2 has a closedform solution Z(t) = e (I C)t Z(0). However, unlike the advective diffusion model, the change rate of Z(T; A) w.r.t. A = A A produced by the local diffusion model has an exponential upper bound. Proposition 3.4. For local diffusion model (defined by Eqn. 2) with the coupling matrix C = D 1/2AD 1/2 or C = D 1A, the yielded node representation satisfies Z(T; A ) Z(T; A) 2 = O( A 2 exp ( A 2T)). The label prediction ˆY = ϕdec(Z(T; A)) could be highly sensitive to the change of the graph topology. Pushing fur- Supercharging Graph Transformers with Advective Diffusion 0 10 20 30 40 50 60 ||ΔA||2 Mean Square Error (a) Homophily Shift Diff Linear Diff Multi Layer Diff Time Diff Non Local Adv DIFFormer l Adv DIFFormer s 60 70 80 90 100 110 120 ||ΔA||2 Mean Square Error (b) Density Shift 0 20 40 60 80 100 ||ΔA||2 Mean Square Error (c) Block Shift Figure 3: Testing errors (y-axix) w.r.t. differences in graph topologies (x-axis) on synthetic datasets that simulate the topological distribution shifts according to the data generation hypothesis of Fig. 2. ther, we have the following corollary on the generalization capability of local diffusion models under topological shifts. Corollary 3.5. Under the same condition as in Theorem 3.1, for diffusion models Eqn. 2 with the normalized graph adjacency as the coupling matrix, the model-dependent generalization error on testing data generated with Ete = Etr has an upper bound: Dood model(Γθ, Etr, Etr) = O(EA p(A|Etr),A p(A|Ete)[ A 2 exp ( A 2T)]). By definition in Sec. 3.1, the graph adjacency is a realization of a random variable A = h(Uu, Uv; W, E) dependent on a varying environment E. Topological shifts caused by different distributions of A s between training and testing environments may result in large Dood model. 2 This result together with Theorem 3.1 suggests that local diffusion models may lead to undesirably poor generalization in cases where models are expected to be insensitive to the change of topologies. For example, for situations where the groundtruth labels do not dramatically change with topological shifts (i.e., Dood label is small), local diffusion models may induce large Dood model that prejudices generalization. The above conclusion can be extended to models with layer-wise feature transformations and non-linearity (see Appendix B.6 for illustration). While the upper bound result does not mean the exponentially growing error would definitely happen in practice, our empirical comparison in Fig. 3 shows that the generalization error of local diffusion models (Diff-Linear, Diff-Multi Layer and Diff-Time) on test data grows superlinearly w.r.t. A 2 across three types of topological shifts (see experimental details in Sec. 5.1). Non-Local Diffusion Model. We next discuss the nonlocal diffusion model without the advection term, which as previously mentioned can be essentially treated as a generalization of local diffusion to a complete or fully-connected 2The influence of topology variation is inherently associated with h. For example, if one considers h as the stochastic block model (Snijders & Nowicki, 1997), then the change of E may lead to generated graph data with different edge probabilities. In the case of real-world data with intricate topological patterns, the functional forms of h can be more complex, consequently inducing different types of topological shifts. graph. The corresponding diffusion equation still exhibits the form of Eqn. 2 but allows non-zero entries for arbitrary (u, v) s in the coupling matrix to accommodate the all-pair information flows. Then we can easily derive a result that if Y is conditionally independent from A with given {Uu}u V in the data generation hypothesis of Sec. 3.1, the non-local diffusion model (i.e., Eqn. 2 with the attention-based coupling matrix) leads to the generalization gap R(Γθ; Ete) Remp(Γθ; Etr) Din(Γθ, Etr, Ntr), (10) which holds with the probability 1 δ. The assumption of conditional independence between Y and A, however, can be violated in many situations where labels strongly correlate with observed graph structures. Furthermore, the performance on testing data (i.e., what we care about) depends on both the model s expressiveness and generalization. The non-local diffusion alone, discarding any observed topology, has insufficient expressiveness for capturing the structural information. By contrast, the advective diffusion model proposed in Sec. 3.2 that accommodates the observed structures can provably generalize under topological shifts without the conditional independence assumption between Y and A. This makes the advective diffusion model more powerful in real cases. Furthermore, as empirically validated in Fig. 3, the non-local diffusion model (Diff-Non Local) indeed yields comparably stable yet obviously inferior performance to our models (ADVDIFFORMER-I and ADVDIFFORMER-S). 4. Model Implementation The remaining question concerning model implementation boils down to how to solve the advective diffusion equation Eqn. 6. One straightforward solution is to harness the scheme adopted by (Chen et al., 2018) for back-propagation through PDE dynamics. However, since it is known that the equation has a closed-form solution e (I C βV)t, we resort to a implementation-wise simpler method by computing the solution instead of numerically solving the equation. Nevertheless, direct computation of the matrix exponential through eigendecomposition is computationally intractable for large matrices. As an alternative, we leverage numer- Supercharging Graph Transformers with Advective Diffusion ical techniques based on series expansion that produces two model versions. Due to space limit, we present the main ideas in this subsection and defer details on model implementation to Appendix D.1. ADVDIFFORMER-I uses a numerical method based on the extension of Padé-Chebyshev theory to rational fractions (Golub & Van Loan, 1989; Gallopoulos & Saad, 1992), which has shown empirical success in 3D shape analysis (Patané, 2014). The matrix exponential is approximated by solving multiple linear systems (see more details and derivations in Appendix C) and we generalize it as a multihead network where each head propagates in parallel: Lh = (1 + θ)I Ch βV, h = 1, , H, h=1 ϕ(h) F C(linsolver(Lh, Z(0))). (11) The linsolver computes the matrix inverse Zh = (Lh) 1Z(0) and can be efficiently implemented via torch.linalg.solve() that enables automated differentiation. Each head contains propagation with the precomputed attention Ch and node-wise transformation ϕ(h) F C. ADVDIFFORMER-S resorts to approximation by finite geometric series (see Appendix C for derivations): Ph = Ch + β A, h = 1, , H, h=1 ϕ(h) F C([Z(0), Ph Z(0), , (Ph)KZ(0)]). This model aggregates K-order propagated results with the propagation matrix Ph in each head. One advantage of this model version lies in its good scalability with linear complexity w.r.t. the number of nodes in the feed-forward computation (see detailed illustration in Appendix D.1.2). 5. Experiments We apply our model to a wide variety of downstream tasks of disparate scales and granularities that involve topological shifts led by distinct factors. Due to the diversity of datasets and tasks, the competing models that are applicable to specific cases can vary case by case, so the goal of our experiments is to showcase the wide applicability and superiority of ADVDIFFORMER against commonly used GNNs and graph Transformers as well as bespoke methods tailored for specific tasks. In the following, we delve into each case separately with the overview of experimental setup and discussions. More detailed dataset information is provided in Appendix E.1. Details on baselines and hyper-parameters are deferred to Appendix E.2 and E.3, respectively. 5.1. Synthetic Datasets To validate our model and theoretical analysis, we create synthetic datasets simulating the data generation hypothesis Table 1: Results on Arxiv and Twitch, where we use time and spatial contexts for data splits, respectively. We report the Accuracy ( ) for three testing sets of Arxiv and average ROC-AUC ( ) for all testing graphs of Twitch (results for each case are reported in Appendix F.1). Top performing methods are marked as first/second/third. OOM indicates out-of-memory error. Arxiv (2018) Arxiv (2019) Arxiv (2020) Twitch (avg) MLP 49.91 0.59 47.30 0.63 46.78 0.98 61.12 0.16 GCN 50.14 0.46 48.06 1.13 46.46 0.85 59.76 0.34 GAT 51.60 0.43 48.60 0.28 46.50 0.21 59.14 0.72 SGC 51.40 0.10 49.15 0.16 46.94 0.29 60.86 0.13 GDC 51.53 0.42 49.02 0.51 47.33 0.60 61.36 0.10 GRAND 52.45 0.27 50.18 0.18 48.01 0.24 61.65 0.23 A-DGNs 50.91 0.41 47.54 0.61 45.79 0.39 60.11 0.09 CDE 50.54 0.21 47.31 0.52 45.32 0.26 60.69 0.10 Graph Trans OOM OOM OOM 61.65 0.23 Graph GPS 51.11 0.19 48.91 0.34 46.46 0.95 62.13 0.34 DIFFormer 50.45 0.94 47.37 1.58 44.30 2.02 62.11 0.11 ADVDIFFORMER-S 53.41 0.48 51.53 0.60 49.64 0.54 62.51 0.07 in Sec. 3.1. We instantiate h as a stochastic block model which generates edges Auv according to block numbers (b), intra-block edge probability (p1) and inter-block edge probability (p2). Then we study three types of topological distribution shifts: homophily shift (changing p2 with fixed p1); density shift (changing p1 and p2); and block shift (varying b). The predictive task is node regression. More details on data generation are presented in Appendix E.1.1. Fig. 3 plots the testing error (i.e., Mean Square Error) w.r.t. differences in graph topologies A 2 (i.e., the gap between training and testing graphs) in three cases. We compare our model (ADVDIFFORMER-I and ADVDIFFORMER-S) with other diffusion-based models as competitors. The latter includes Diff-Linear (graph diffusion with constant C), Diff-Multi Layer (the extension of Diff-Linear with intermediate feature transformations), Diff Time (graph diffusion with time-dependent C(Z(t))) and Diff-Non Local (non-local diffusion with the global attentionbased C(Z(t))). The results show that three local graph diffusion models exhibit clear performance degradation, i.e., the regression error grows super-linearly w.r.t. topological shifts, while our two models yield consistently low error across environments. In contrast, the non-local diffusion model produces comparably stable performance yet inferior to our models due to its ignorance of the useful information in input graphs. These empirical observations are consistent with our theoretical results presented in Sec. 3.3 and 3.4. 5.2. Real-World Datasets We next evaluate ADVDIFFORMER on real-world datasets with more complex topological shifts concerning non Euclidean data in a diverse set of applications. Information Networks. We first consider citation networks Arxiv (Hu et al., 2020) and social networks Supercharging Graph Transformers with Advective Diffusion Table 2: Results of RMSE ( ) for node regression and edge regression on dynamic networks of protein-protein interactions. Node Regression Edge Regression Valid Test Average Test Worst Valid Test Average Test Worst MLP 0.768 0.011 0.672 0.014 0.768 0.014 0.150 0.004 0.192 0.003 0.204 0.003 GCN 1.791 0.023 1.308 0.013 1.797 0.007 0.185 0.003 0.196 0.001 0.213 0.001 GAT 1.255 0.022 1.057 0.030 1.708 0.067 0.210 0.010 0.204 0.006 0.216 0.010 SGC 1.622 0.004 1.154 0.006 1.616 0.002 0.193 0.000 0.191 0.001 0.209 0.001 Graph Trans 3.798 1.146 3.203 0.889 3.795 1.123 0.189 0.005 0.189 0.008 0.202 0.003 Graph GPS 0.713 0.050 0.671 0.040 0.803 0.081 0.168 0.004 0.182 0.007 0.216 0.019 DIFFormer 0.672 0.046 0.637 0.034 0.710 0.028 0.171 0.007 0.183 0.005 0.197 0.003 ADVDIFFORMER-I 0.681 0.010 0.643 0.019 0.679 0.021 0.159 0.002 0.166 0.006 0.184 0.011 ADVDIFFORMER-S 0.547 0.040 0.574 0.028 0.644 0.040 0.156 0.006 0.167 0.004 0.188 0.010 Adv DIFFormer (0.697) Ground Truth GCN (0.685) GAT (0.664) Graph GPS (0.694) DIFFormer (0.674) Figure 4: Testing cases for molecular mapping operators generated by different models with averaged testing Accuracy ( ) reported. The task is to generate subgraph-level partitions (marked by different colors) resembling the ground-truth. Twitch (Rozemberczki et al., 2021) with graph sizes ranging from 2K to 0.2M, where we use the scalable version ADVDIFFORMER-S. To introduce topological shifts, we partition the data according to publication years and geographic information for Arxiv and Twitch, respectively. The predictive task is node classification, and we follow the common practice comparing Accuracy (resp. ROCAUC) for Arxiv (resp. Twitch). We compare with three types of state-of-the-art baselines: (i) classical GNNs (GCN (Kipf & Welling, 2017), GAT (Velickovic et al., 2018) and SGC (Wu et al., 2019)); (ii) diffusion-based GNNs (GDC (Klicpera et al., 2019), GRAND (Chamberlain et al., 2021a), A-DGNs (Gravina et al., 2023) and CDE (Zhao et al., 2023)), and (iii) graph Transformers (Graph Trans (Wu et al., 2021), Graph GPS (Rampásek et al., 2022), and the diffusion-based DIFFormer (Wu et al., 2023)). Appendix E.2 presents detailed descriptions for these models. Table 1 reports the results, showing that our model offers significantly superior generalization for node classification. Protein Interactions. We then test on protein-protein interactions (Fu & He, 2022). Each node denotes a protein with a time-aware gene expression value and the edges indicate co-expressed protein pairs at each time. The dataset consists of 12 dynamic networks each of which is obtained by one protein identification method and records the metabolic cycles of yeast cells. The networks have distinct topological features (e.g., distribution of cliques) (Fu & He, 2022), and we use 6/1/5 networks for train/valid/test. To test the gen- eralization of the model in different scenarios, we consider two predictive tasks: i) node regreesion for gene expression values (measured by RMSE), and 2) edge regression for predicting the co-expression correlation coefficients (measured by RMSE). Table 2 reports the averaged scores and worst-case scores across all testing graphs. The results show that ADVDIFFORMER-S and ADVDIFFORMER-I are ranked in the top three (resp. two) models in terms of the averaged (resp. worst-case) testing performance. Moreover, ADVDIFFORMER-S performs better in node regression tasks, while ADVDIFFORMER-I exhibits (slightly) better competitiveness for edge regression. The possible reason might be that ADVDIFFORMER-I can better exploit high-order structural information as the matrix inverse can be treated as ADVDIFFORMER-S with K . Molecular Mapping Operator Generation. We next consider the generation of molecular coarse-grained mapping operators, an important step for molecular dynamics simulation, aiming to find a representation of how atoms are grouped in a molecule (Li et al., 2020). The task is a graph segmentation problem which can be modeled as predicting edges that indicate where to partition the graph. We use the relative molecular mass to split the data and test how the model extrapolates to larger molecules. Fig. 4 compares the testing cases (with more cases shown in Appendix F.1) generated by different models, which shows the more accurate estimation of our model (we use ADVDIFFORMER-S for experiments) that demonstrates desired generalization. Supercharging Graph Transformers with Advective Diffusion 0 0.2 0.5 0.8 1.0 1.5 2.0 5.0 10.0 β Test 1 (2018) Test 2 (2019) Test 3 (2020) 0 0.3 0.5 0.8 1.0 β 1.44 1.46 1.48 1.50 1.52 1.54 1.56 1.58 0 0.3 0.5 0.8 1.0 β Figure 5: Analysis of β on Arxiv and node regression (nr) and edge regression (er) tasks on DPPIN. Hyper-Parameter Analysis. The hyper-parameter β controls the importance weight for the advection term. Fig. 5 shows the model performance of ADVDIFFORMER-S on Arxiv and DPPIN with different β s. We found that the optimal settings for β can be different across datasets and tasks. For node classification on Arxiv, the model gives the best performance with β [0.7, 1.0]. The performance degrades when β is too small (<0.5) or too large (>2.0). The reason could be that the graph structural information is useful for the predictive task on Arxiv yet too much emphasis on the graph structure can lead to undesired generalization. Differently, for DPPIN, we found that using smaller β can bring up more satisfactory performance across node regression and edge regression tasks. In particular, setting β = 0, in which case the advection term is completely dropped, can yield optimal performance for the node regression task. This is possibly because the graph structure is uninformative and pure global attention can learn generalizable topological patterns from latent interactions. To sum up, in practice, the model enables much flexibility for adjusting the weight on the advection effect (the importance of observed structural information) to accommodate the diversity of graph-structured data. More hyper-parameter analysis (w.r.t. θ and K) is deferred to Appendix F.2. Ablation Studies. We defer ablation studies on the diffusion and advection terms of our model to Appendix F.2. 6. Conclusions This paper studies generalization with topological shifts, a largely open question in machine learning, and the insights in this work open new possibilities of leveraging diffusion PDEs as principled guidance for navigating generalizable neural network models. As exemplified in this work, our proposed Advective Diffusion Transformer, inspired by advective diffusion equations, has provable potentials for generalization and shows superior performance in various downstream tasks across different scenarios. Impact Statement This paper presents work whose goal is to advance the current understandings for the generalization of neural network models. In general sense, improving the generalizability of the model is important for many aspects associated with AI s societal responsibilities, such as addressing the observational bias in training data and promoting the fairness of the outcome on the test set. Specifically speaking, our paper explores the generalization capability of neural network models operated on varying graph topologies, which has implications on several significant applications such as social network analysis, drug discovery and healthcare. The results and methodology presented in this work can shed lights on enhancing the trustworthiness and reliability of machine learning models in the open world regime. Acknowledgement QW is supported in part by funding from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. MB is partially supported by the EPSRC Turing AI World Leading Research Fellowship No. EP/X040062/1 and EPSRC AI Hub No. EP/Y028872/1. Atwood, J. and Towsley, D. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993 2001, 2016. Balcilar, M., Renton, G., Héroux, P., Gaüzère, B., Adam, S., and Honeine, P. Analyzing the expressive power of graph neural networks in a spectral perspective. In International Conference on Learning Representations, 2021. Bazhenov, G., Kuznedelev, D., Malinin, A., Babenko, A., and Prokhorenkova, L. Evaluating robustness and uncertainty of graph models under structural distributional shifts. ar Xiv preprint ar Xiv:2302.13875, 2023. Bodnar, C., Di Giovanni, F., Chamberlain, B., Liò, P., and Bronstein, M. Neural sheaf diffusion: A topological perspective on heterophily and oversmoothing in gnns. Advances in Neural Information Processing Systems, 35: 18527 18541, 2022. Bouritsas, G., Frasca, F., Zafeiriou, S., and Bronstein, M. M. Improving graph neural network expressivity via sub- Supercharging Graph Transformers with Advective Diffusion graph isomorphism counting. IEEE Trans. Pattern Anal. Mach. Intell., 45(1):657 668, 2023. Chamberlain, B., Rowbottom, J., Gorinova, M. I., Bronstein, M. M., Webb, S., and Rossi, E. GRAND: graph neural diffusion. In International Conference on Machine Learning (ICML), pp. 1407 1418, 2021a. Chamberlain, B. P., Rowbottom, J., Eynard, D., Giovanni, F. D., Dong, X., and Bronstein, M. M. Beltrami flow and neural diffusion on graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2021b. Chandrasekhar, S. Stochastic problems in physics and astronomy. Reviews of modern physics, 15(1):1, 1943. Chasseigne, E., Chaves, M., and Rossi, J. D. Asymptotic behavior for nonlocal diffusion equations. Journal de mathématiques pures et appliquées, 86(3):271 291, 2006. Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in neural information processing systems, 2018. Choi, J., Hong, S., Park, N., and Cho, S.-B. Gread: Graph neural reaction-diffusion equations. In International Conference on Machine Learning, 2023. Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlós, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021. Di Giovanni, F., Rowbottom, J., Chamberlain, B. P., Markovich, T., and Bronstein, M. M. Understanding convolution on graphs via energies. Transactions on Machine Learning Research, 2022. Dwivedi, V. P. and Bresson, X. A generalization of transformer networks to graphs. Co RR, abs/2012.09699, 2020. Eells, J. and Sampson, J. H. Harmonic mappings of riemannian manifolds. American journal of mathematics, 86(1): 109 160, 1964. Freidlin, M. I. and Wentzell, A. D. Diffusion processes on graphs and the averaging principle. The Annals of probability, pp. 2215 2245, 1993. Fu, D. and He, J. Dppin: A biological repository of dynamic protein-protein interaction network data. In 2022 IEEE International Conference on Big Data (Big Data), pp. 5269 5277. IEEE, 2022. Gallopoulos, E. and Saad, Y. Efficient solution of parabolic equations by krylov approximation methods. SIAM journal on scientific and statistical computing, 13(5):1236 1264, 1992. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263 1272, 2017. Golub, G. H. and Van Loan, C. F. Matrix computations. John Hopkins University Press, 1989. Gravina, A., Bacciu, D., and Gallicchio, C. Anti-symmetric DGN: a stable architecture for deep graph networks. In International Conference on Learning Representations, 2023. Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems, 2020. Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y., and Leskovec, J. Ogb-lsc: A large-scale challenge for machine learning on graphs. ar Xiv preprint ar Xiv:2103.09430, 2021. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. Klicpera, J., Weißenberger, S., and Günnemann, S. Diffusion improves graph learning. In Advances in neural information processing systems, 2019. Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B., Haque, I., Beery, S. M., Leskovec, J., Kundaje, A., Pierson, E., Levine, S., Finn, C., and Liang, P. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), pp. 5637 5664, 2021. Leveque, R. J. Numerical methods for conservation laws, volume 214. Springer, 1992. Li, Z., Wellawatte, G. P., Chakraborty, M., Gandhi, H. A., Xu, C., and White, A. D. Graph neural network based coarse-grained mapping prediction. Chemical science, 11 (35):9524 9531, 2020. Lovász, L. and Szegedy, B. Limits of dense graph sequences. Journal of Combinatorial Theory, Series B, 96(6):933 957, 2006. Ma, J., Deng, J., and Mei, Q. Subgroup generalization and fairness of graph neural networks. In Advances in Neural Information Processing Systems, 2021. Supercharging Graph Transformers with Advective Diffusion Medvedev, G. S. The nonlinear heat equation on dense graphs and graph limits. SIAM Journal on Mathematical Analysis, 46(4):2743 2766, 2014. Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence, pp. 4602 4609, 2019. Papp, P. A., Martinkus, K., Faber, L., and Wattenhofer, R. Dropgnn: Random dropouts increase the expressiveness of graph neural networks. In Advances in Neural Information Processing Systems, pp. 21997 22009, 2021. Patané, G. Laplacian spectral distances and kernels on 3d shapes. Pattern Recognition Letters, 47:102 110, 2014. Rampásek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., and Beaini, D. Recipe for a general, powerful, scalable graph transformer. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2022. Romeny, B. M. H. Geometry-driven diffusion in computer vision, volume 1. Springer Science & Business Media, 2013. Rossi, E., Kenlay, H., Gorinova, M. I., Chamberlain, B. P., Dong, X., and Bronstein, M. M. On the unreasonable effectiveness of feature propagation in learning on graphs with missing node features. In Learning on Graphs Conference, pp. 11 1. PMLR, 2022. Rozemberczki, B., Allen, C., and Sarkar, R. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2), 2021. Rusch, T. K., Chamberlain, B. P., Mahoney, M. W., Bronstein, M. M., and Mishra, S. Gradient gating for deep multi-rate learning on graphs. In International Conference on Learning Representations, 2023. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008. Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Snijders, T. A. and Nowicki, K. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of classification, 14(1):75 100, 1997. Thorpe, M., Xia, H., Nguyen, T., Strohmer, T., Bertozzi, A. L., Osher, S. J., and Wang, B. GRAND++: graph neural diffusion with a source term. In International Conference on Learning Representations (ICLR), 2022. Topping, J., Giovanni, F. D., Chamberlain, B. P., Dong, X., and Bronstein, M. M. Understanding over-squashing and bottlenecks on graphs via curvature. In International Conference on Learning Representations, 2022. Van Loan, C. The sensitivity of the matrix exponential. SIAM Journal on Numerical Analysis, 14(6):971 981, 1977. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018. Wu, F., Jr., A. H. S., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. Q. Simplifying graph convolutional networks. In International Conference on Machine Learning, pp. 6861 6871, 2019. Wu, Q., Zhang, H., Yan, J., and Wipf, D. Handling distribution shifts on graphs: An invariance perspective. In International Conference on Learning Representations, 2022a. Wu, Q., Zhao, W., Li, Z., Wipf, D., and Yan, J. Nodeformer: A scalable graph structure learning transformer for node classification. In Advances in Neural Information Processing Systems, 2022b. Wu, Q., Yang, C., Zhao, W., He, Y., Wipf, D., and Yan, J. Difformer: Scalable (graph) transformers induced by energy constrained diffusion. In International Conference on Learning Representations, 2023. Wu, Q., Nie, F., Yang, C., Bao, T., and Yan, J. Graph outof-distribution generalization via causal intervention. In The Web Conference, pp. 850 860, 2024a. Wu, Q., Nie, F., Yang, C., and Yan, J. Learning divergence fields for shift-robust graph representations. In International Conference on Machine Learning, 2024b. Wu, Q., Wipf, D., and Yan, J. Neural message passing induced by energy-constrained diffusion. ar Xiv preprint ar Xiv:2409.09111, 2024c. Wu, Z., Jain, P., Wright, M. A., Mirhoseini, A., Gonzalez, J. E., and Stoica, I. Representing long-range context for graph neural networks with global attention. In Advances in Neural Information Processing Systems, 2021. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. Supercharging Graph Transformers with Advective Diffusion Xu, K., Li, J., Zhang, M., Du, S. S., Kawarabayashi, K., and Jegelka, S. How neural networks extrapolate: From feedforward to graph neural networks. In International Conference on Learning Representations (ICLR), 2021. Yang, C., Wu, Q., Wang, J., and Yan, J. Graph neural networks are inherently good generalizers: Insights by bridging gnns and mlps. In International Conference on Learning Representations, 2023. Yang, C., Wu, Q., Wipf, D., Sun, R., and Yan, J. How graph neural networks learn: Lessons from training dynamics. In International Conference on Machine Learning, 2024. Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T. Do transformers really perform bad for graph representation? In Advances in Neural Information Processing Systems, 2021. Zhang, X., Wang, L., Helwig, J., Luo, Y., Fu, C., Xie, Y., Liu, M., Lin, Y., Xu, Z., Yan, K., et al. Artificial intelligence for science in quantum, atomistic, and continuum systems. ar Xiv preprint ar Xiv:2307.08423, 2023. Zhao, K., Kang, Q., Song, Y., She, R., Wang, S., and Tay, W. P. Graph neural convection-diffusion with heterophily. In International Joint Conference on Artificial Intelligence, 2023. Supercharging Graph Transformers with Advective Diffusion A. Connection between Diffusion Equations and Message Passing In this section, we provide a systematically introduction on the fundamental connections between graph diffusion equations and neural message passing, as supplementary technical background for our analysis and methodology presented in the main text. Consider graph diffusion equations of the generic form t = (C(Z(t); A) I)Z(t), 0 t T, with initial conditions Z(0) = ϕenc(X). (13) As demonstrated by existing works, e.g., (Chamberlain et al., 2021a), using finite-difference numerical schemes for solving Eqn. 13 would induce the message passing neural networks of various forms. The latter is recognized as the common paradigm in modern graph neural networks and Transformers whose layer-wise updating aggregates the embeddings of other nodes to compute the embeddings for the next layer. A.1. Graph Neural Networks as Local Diffusion Consider the explicit Euler s scheme as the commonly used finite-difference method for approximately solving the differential equations, and Eqn. 13 will induce the discrete iterations with step size τ: Z(k+1) Z(k) τ (C(Z(k); A) I)Z(k). (14) With some re-arranging we have Z(k+1) = (1 τ)Z(k) + τC(Z(k); A)Z(k), (15) with the initial states Z(0) = ϕenc(X). The above updating equation gives one-layer update through residual connection and propagation with C(Z(k); A). There are some well-known graph neural network architectures that can be derived with different instantiations of the coupling matrix. Simplifying Graph Convolution (SGC). If one considers C(Z(k); A) = A = D 1/2AD 1/2, then we will get the one-layer updating rule: Z(k+1) = (1 τ)Z(k) + τD 1/2AD 1/2Z(k). (16) This can be seen as one-layer propagation of SGC (Wu et al., 2019) with residual connection, and when τ = 1 it becomes exactly the SGC layer. Since SGC model does not involve feature transformation layers and non-linearity throughout the message passing, one often uses a pre-computed propagation matrix for one-step convolution that is much faster than the multi-layer convolution: Z(K) = PKZ(0), P = (1 τ)I + τD 1/2AD 1/2. (17) Graph Convolution Networks (GCN). The GCN network inserts feature transformation layers in-between the propagation layers. This can be achieved by considering K stacked piece-wise diffusion equations, where the k-th dynamics is given by the differential equation with time boundaries: t = (C I)Z(t; k), t [tk 1, tk], with initial conditions Z(tk 1; k) = ϕ(k) int(Z(tk 1; k 1)), (18) where ϕ(k) int denotes the node-wise feature transformation of the k-th layer. Assume C = D 1/2AD 1/2. Then consider one-step feed-forward of the explicit Euler scheme for Eqn. 18, and one can obtain the updating rule at the k-th layer: Z(k+1) = ϕ(k+1) int (1 τ)Z(k) + τD 1/2AD 1/2Z(k) . (19) This corresponds to one GCN layer (Kipf & Welling, 2017) if one considers ϕ(k+1) int as a fully-connected neural layer with Re LU activation and simply sets τ = 1. High-Order Propagation. Besides the explicit numerical scheme, one can also utilize the implicit scheme and multi-step schemes (e.g., Runge-Kutta) for solving the diffusion equation, and the induced updating form will involve high-order information (Chamberlain et al., 2021a). Supercharging Graph Transformers with Advective Diffusion A.2. Graph Transformers as Non-Local Diffusion The original architectures of Transformers (Vaswani et al., 2017) involve self-attention layers as the key module, where the attention measures the pairwise influence between arbitrary token pairs in the input. There are recent works, e.g., (Dwivedi & Bresson, 2020; Ying et al., 2021; Wu et al., 2021; Rampásek et al., 2022; Wu et al., 2022b) transferring the Transformer architectures originally designed for sequence inputs into graph-structured data, and the attention is computed for arbitrary node pairs in the graph, which can be seen as a counterpart of non-local diffusion (Wu et al., 2023; 2024c). In specific, the coupling matrix allows non-zero entries for arbitrary location pairs and can be instantiated as a global attention network. Then using the explicit Euler s scheme as Eqn. 15 we can obtain the self-attention propagation layer of common Transformers: Z(k+1) = (1 τ)Z(k) + τC(k)Z(k), c(k) uv = η(z(k) u , z(k) v ) P w V η(z(k) u , z(k) w ) . (20) For obtaining the fully-connected layers and non-linear activations adopted in Transformers, one can inherit the spirit of GCN and extend the diffusion model to K piece-wise equations as Eqn. 18. B. Proofs for Technical Results B.1. Proof for Theorem 3.1 According to the data generation hypothesis in Fig. 2, for given node latents Uu s, we can decompose the joint distribution into (we omit all conditions on U for brevity) p(X, A, Y |E) = p(X|E)p(A|E)p(Y |A, E). (21) Also, by definition in Sec. 3.1 we have p(X|E = Etr) = p(X|E = Ete), (22) p(Y |A, E = Etr) = p(Y |A, E = Ete). (23) Therefore we have p(X, A, Y |E) = p(X)p(A|E)p(Y |A). We next consider the gap between R(Γθ; Etr) and R(Γθ; Ete): |R(Γθ; Ete) R(Γθ; Etr)| = E(X ,A ,Y ) p(X,A,Y |E=Ete)[l(Γθ(X , A ), Y )] E(X,A,Y) p(X,A,Y |E=Etr)[l(Γθ(X, A), Y)] = EX p(X),A p(A|E=Ete),Y p(Y |A=A ))[l(Γθ(X , A ), Y )] EX p(X),A p(A|E=Etr),Y p(Y |A=A))[l(Γθ(X, A), Y)] EX p(X),A p(A|E=Ete),Y p(Y |A=A ))[l(Γθ(X , A ), Y )] EX p(X),A p(A|E=Etr),A p(A|E=Ete),Y p(Y |A=A ))[l(Γθ(X, A), Y )] + EX p(X),A p(A|E=Etr),A p(A|E=Ete),Y p(Y |A=A ))[l(Γθ(X, A), Y )] EX p(X),A p(A|E=Etr),Y p(Y |A=A))[l(Γθ(X, A), Y)] = EX p(X),A p(A|E=Etr),A p(A|E=Ete),Y p(Y |A=A ))[l(Γθ(X, A ), Y ) l(Γθ(X, A), Y )] + EX p(X),A p(A|E=Etr),Y p(Y |A=A),A p(A|E=Ete),Y p(Y |A=A ))[l(Γθ(X, A), Y ) l(Γθ(X, A), Y)] EX p(X),A p(A|E=Etr),A p(A|E=Ete),Y p(Y |A=A )) |l(Γθ(X, A ), Y ) l(Γθ(X, A), Y )| + EX p(X),A p(A|E=Etr),Y p(Y |A=A),A p(A|E=Ete),Y p(Y |A=A )) |l(Γθ(X, A), Y ) l(Γθ(X, A), Y)| . Moreover, due to the Lipschitz continuity of l and ϕdec, we have |l(Γθ(X, A ), Y ) l(Γθ(X, A), Y )| L1 Z(T; A ) Z(T; A) 2 , (25) |l(Γθ(X, A), Y ) l(Γθ(X, A), Y)| L2 Y Y 2 , (26) where L1 and L2 denote the Lipschitz constants. Combing Eqn. 25 and Eqn. 26 with Eqn. 24, we have |R(Γθ; Ete) R(Γθ; Etr)| L1 EA p(A|Etr),A p(A|Ete) [ Z(T; A ) Z(T; A) 2] + L2 E(A,Y) p(A,Y |Etr),(A ,Y ) p(A,Y |Ete) [ Y Y 2] . (27) The conclusion for the main theorem can be obtained via combining Eqn. 27 and Eqn. 9 using the triangle inequality. Supercharging Graph Transformers with Advective Diffusion B.2. Proof for Theorem 3.2 For the advective diffusion equation with the coupling matrix C pre-computed by attention network η(zu(0), zv(0)) and fixed velocity V = D 1/2AD 1/2, we have its closed-form solution Z(t) = e (I C βV)t Z(0), t 0. (28) To prove the perturbation bound w.r.t. the change of graph structures, we introduce the following lemma. Lemma B.1. Let X, E Rn n, and let || || be a submultiplicative matrix norm. Suppose there exist constants M 1, ω 0 such that for all Y Rn n, ||e Y|| Meω Y . Then the following perturbation bound holds: e X+E e X E M 2 eω( X + E ). (29) Proof. Define path X(s) := X + s E for s [0, 1]. Then: e X+E e X = Z 1 d dse X(s) ds. (30) Using the integral form of the derivative of the matrix exponential (Van Loan, 1977), we have: d dse X(s) = Z 1 0 e(1 θ)X(s)EeθX(s) dθ. (31) e X+E e X = Z 1 0 e(1 θ)X(s)EeθX(s) dθ ds. (32) Taking norms and applying submultiplicativity: e X+E e X Z 1 0 e(1 θ)X(s) E eθX(s) dθds. (33) Using the growth bound assumption: e(1 θ)X(s) Meω(1 θ) X(s) , eθX(s) Meωθ X(s) . (34) Multiplying these we have: e(1 θ)X(s) eθX(s) M 2eω X(s) . (35) Note that X(s) = X + s E X + E , so: e X+E e X E M 2 Z 1 0 eω X+s E dθds E M 2 eω( X + E ). (36) The existence of M and ω is guaranteed for every consistent matrix norm (Van Loan, 1977) such as spectral norm 2 considered in our analysis. Let L = I C βV and L = I C βV . We then apply the above lemma that holds even if L and L are not commutable: e L T e LT 2 M 2T L L 2 eωT L 2 eωT L L 2. (37) For spectral norm || ||2 in our case, the above result holds for M = 1 and ω = 1. We next prove the bound for the last term of Eqn. 37 by construction. Notice that the initial states are given by the encoder MLP: Z(0) = ϕenc(X). According to our data generation hypothesis in Fig. 2, we know that node embeddings are generated from the latents of each node (we use uu to denote the realization of Uu), i.e., xu = g(uu; W) and the graph adjacency is generated through a pair-wise function auv = h(uu, uv; W, E). Since g is injective, we assume g 1 as its inverse mapping. Supercharging Graph Transformers with Advective Diffusion We define by η ϕenc the function composition of η and ϕenc that establishes a mapping from input node features X to the attention-based coupling C. According to the universal approximation results that hold for MLPs on the compact set (Hornik et al., 1989), we can construct a mapping induced by η ϕenc to obtain a propagation matrix in the form of C = C (β +ϵ)V, where C is independent from A and ϵ > 0 is an arbitrary small number. To be specific, the construction of the mapping can be achieved by η ϕenc = m h g 1: g 1 maps the input feature xu to uu; h maps (uu, uv) to auv; m maps auv to cuv, where cuv denotes the (u, v)-th entry of C. Then consider the difference of node representations under topological shifts and we have (C +βV ) (C+βV) 2 = ϵ O( A 2). Since A 2 is bounded, for any positive integer m, there exists ϵ > 0 such that exp (ϵ A 2) A m 2 . Therefore, we have the conclusion: e||L L||2 = e (C +βV ) (C+βV) 2 O( A m 2 ). (38) The theorem can be concluded by combining the result of Eqn. 37. B.3. Proof for Corollary 3.3 The conclusion follows by combing the results of Theorem 3.1 and Theorem 3.2. B.4. Proof for Proposition 3.4 The diffusion equation with the constant coupling matrix C has a closed-form solution Z(t) = e (I C)t Z(0), t 0. To prove the proposition, we need to derive the bound of e (I C )T e (I C)T 2 for any C = C. According to the result (3.5) of (Van Loan, 1977) we have e (I C )T e (I C)T 2 T C C 2 e (I C)T 2e (C C)T 2. (39) Given the fact C C = A A = A, we have e (I C )T e (I C)T 2 = O( A 2 exp( A 2T)). (40) This gives rise to the conclusion that Z(T; A ) Z(T; A) 2 = O( A 2 exp( A 2T)), (41) and we conclude the proof for the proposition. B.5. Proof for Corollary 3.5 By combing the results of Theorem 3.1 and Proposition 3.4, we have Dood model(Γθ, Etr, Ete) = O(EA p(A|Etr),A p(A|Ete)[ Z(T; A ) Z(T; A) 2]) O(EA p(A|Etr),A p(A|Ete)[ A 2 exp( A 2T)]). (42) B.6. Extension with Feature Transformations The conclusion of Proposition 3.4 and Corollary 3.5 can be extended to the cases incorporating feature transformations and non-linear activation in-between propagation layers used in common GNNs, like GCN (Kipf & Welling, 2017). In particular, the diffusion model becomes the piece-wise diffusion equations with K dynamics components as defined by Eqn. 18: t = (C I)Z(t; k), t [tk 1, tk], with initial conditions Z(tk 1; k) = ϕ(k) int(Z(tk 1; k 1)), (43) Supercharging Graph Transformers with Advective Diffusion where ϕ(k) int denotes the node-wise feature transformation of the k-th layer. Based on this, can re-use the reasoning line of proofs for Proposition 3.4 to each component, and arrive at the exponential bound of node representation within the k-th dynamics: Z(tk; A , k) Z(tk; A, k) 2 = O( A 2 exp ( A 2(tk tk 1))). (44) By stacking the results for each component, one can obtain the variation magnitude of the node representation yielded by the whole trajectory Z(T; A ) Z(T; A) 2 = O( A 2 exp ( A 2T)). (45) C. Approximation Strategies for Diffusion PDE Solutions The closed-form solutions of linear diffusion equations often involve the form of matrix exponential e Lt, which is intractable for computing its exact value. There are many established techniques based on numerical approximations, e.g., series expansion, in this fundamental challenge. In our presented model in Sec. 4, we propose two implementation versions based on two approximation ways for handling the closed-form solution of the advective diffusion equations on graphs. Approximation with Linear Systems. One scalable scheme proposed by (Gallopoulos & Saad, 1992) is via the extension of the minimax Padé-Chebyshev theory to rational fractions (Golub & Van Loan, 1989). This approximation technique has been utilized by (Patané, 2014) as an effective and efficient method for spectrum-free computation of the diffusion distances in 3D shape analysis. In specific, the matrix exponential of the form e Lt is approximated by the combination of multiple matrix inverses: i=1 αi(L + θi I) 1, (46) where αi and θi can be pre-defined parameters (Gallopoulos & Saad, 1992). To unleash the capacity of neural networks, in Sec. 4, our model implementation (ADVDIFFORMER-I) extends this scheme to a multi-head network where each head contributes to propagation with independently parameterized attention networks. The matrix inverse is computed with the linear system solver that is available in common deep learning tools (e.g., Py Torch) and supports automatic differentiation. Approximation with Geometric Series. When the graph sizes become large, the matrix inverse can be computationally expensive. For better scalability, we can use the geometric series for approximation: (L + θi I) 1 = k=0 ( 1)kθ (k+1) i Lk k=0 ( 1)kθ (k+1) i Lk. (47) In this way, the matrix exponential can be approximately computed via a combination of finite series: k=0 ( 1)kθ (k+1) i Lk. (48) In our model, the closed-form solution for the PDE induces L = (I C βV), and the summation in Eqn. 48 can be expressed as a weighted sum of Pk = (C + βV)k for k = 0, , K. Our model implementation (ADVDIFFORMER-S) proposed in Sec. 4 generalizes the weighted sum to a one-layer neural network. D. Model Implementations and Algorithms In this section, we provide detailed and self-contained descriptions about our model architectures in Appendix D.1. Then in Appendix D.2, we discuss how to apply our model to various graph-structured data with additional input information. To make the presentation clear and focused on the model implementation side, we will re-define some notations that are originally defined in Sec. 4, where we formulate the model with the terminology of the PDE domain. D.1. Model Architectures The model takes a graph G = (V, E, X, A) as input, and output prediction in the downstream tasks. We assume the number of nodes in the graph |V| = N, node feature matrix X RN D and graph adjacency matrix A {0, 1}N N. We use D to denote the diagonal degree matrix of A. The normalized adjacency is denoted by A = D 1/2AD 1/2, and 1 is an all-one N-dimensional column vector. In this subsection, we assume G has no edge weight or edge feature for presentation, and with loss of generality, we will discuss how to incorporate these additional attributes in Appendix D.2. Supercharging Graph Transformers with Advective Diffusion D.1.1. INSTANTIATIONS AND PARAMETERIZATIONS Our model is comprised of three modules: the encoder ϕenc, the decoder ϕdec, and the propagation network in-between the first two. Encoder: The node features X = [xu]u V RN D are first mapped to embeddings in the latent space Z(0) = [z(0) u ]u V RN d via the encoder: Z(0) = ϕenc(X). The encoder ϕenc( ) is instantiated as a shallow MLP with non-linear activation (e.g., Re LU). Propagation: The propagation network converts the initial node embeddings Z(0) to the node representations Z = [zu]u V RN d (where Z(0) and Z are the re-defined counterparts of Z(0) and Z(T), respectively, presented in Sec. 4). The propagation network is implemented via a multi-head network with H heads involving the attention network η(h)( , ) and feature transformation network ϕ(h) F C( ). The latter is instantiated as a fully-connected layer WO,h, and the attention network is instantiated as a normalized dot-product positive similarity function: η(h)(z(0) u , z(0) v ) = 1 + WQ,hz(0) u WQ,hz(0) u 2 ! WK,hz(0) v WK,hz(0) v 2 Ch = {c(h) uv }, c(h) uv = η(h)(z(0) u , z(0) v ) P w V η(h)(z(0) u , z(0) w ) , where WQ,h Rd d and WK,h Rd d are trainable weights for query and key, respectively, of the h-th head. Then the node representations will be computed in different ways by two models. For ADVDIFFORMER-I, the node representations are calculated via Lh = (1 + θ)I Ch β A, h = 1, , H Zh = linsolver(Lh, Z(0)), h = 1, , H h=1 Zh WO,h, where WO,h Rd d is a trainable weight matrix. Alg. 1 summarizes the feed-forward computation of ADVDIFFORMER-I. For ADVDIFFORMER-S, the node representations are computed by Ph = Ch + β A, h = 1, , H Z(k) h = Ph Z(k 1) h , k = 1, K, h = 1, , H h=1 [Z(0) h , Z(1) h , , Z(K) h ]WO,h, where WO,h R(K+1)d d is a trainable weight matrix. To accelerate the computation of Eqn. 51, we can inherit the strategy used in (Wu et al., 2023) and alter the order of matrix products, which reduces the time and space complexity to O(N) (see Appendix D.1.2 for detailed illustration). Alg. 2 presents the feed-forward computation of ADVDIFFORMER-S that only requires O(N) algorithmic complexity. Decoder: The decoder ϕdec( ) transforms the node representations into prediction. Depending on the specific downstream tasks, the decoder can be implemented in different ways: (node-level prediction): ˆyu = MLP(zu) (graph-level prediction): ˆy = MLP(Sum Pooling({zu}u V)) (edge-level prediction): ˆyuv = MLP([zu, zv]). (52) In particular, the softmax activation is used for output in classification tasks. For training, we adopt standard loss functions, i.e., cross-entropy for classification and mean square loss for regression. Supercharging Graph Transformers with Advective Diffusion Algorithm 1 Feed-Forward of the Model ADVDIFFORMER-I. INPUT: Node feature matrix X and normalized adjacency matrix A. Z(0) = ϕenc(X) for h = 1, , H do ZQ,h = WQ,hz(0) u WQ,hz(0) u 2 u V , ZK,h = WK,hz(0) u WK,hz(0) u 2 u V Uh = 11 + ZQ,h(ZK,h) Ch = diag 1 (Uh1) Uh Lh = (1 + θ)I Sh β A Zh = linsolver(Lh, Z) Z = PH h=1 Zh WO,h OUTPUT: Node representations Z and predicted labels with ϕdec(Z). Algorithm 2 Feed-Forward of the Model ADVDIFFORMER-S (with O(N) complexity). INPUT: Node feature matrix X and normalized adjacency matrix A. Z(0) = ϕenc(X) for h = 1, , H do ZQ,h = WQ,hz(0) u WQ,hz(0) u 2 u V , ZK,h = WK,hz(0) u WK,hz(0) u 2 u V Nh = diag 1 N + ZQ,h (ZK,h) 1 Z(0) h = Z(0) for k = 1, , K do Z(k) h = Nh h 1 1 Z(k 1) h + ZQ,h (ZK,h) Z(k 1) h i + β AZ(k 1) h Zh = [Z0) h , Z(1) h , , Z(K) h ] Z = PH h=1 Zh WO,h OUTPUT: Node representations Z and predicted labels with ϕdec(Z). D.1.2. ACCELERATION OF ADVDIFFORMER-S WITH LINEAR COMPLEXITY We illustrate how to achieve the propagation of ADVDIFFORMER-S in Eqn. 51 with O(N) complexity. With the query and key matrices defined by ZQ,h = WQ,hz(0) u WQ,hz(0) u 2 u V and ZK,h = WK,hz(0) u WK,hz(0) u 2 u V , the attention matrix Ch in Eqn. 49 is computed by (in the matrix form used for implementation) Ch = diag 1 N + ZQ,h (ZK,h) 1 11 + ZQ,h (ZK,h) . (53) Computing the above result requires O(N 2) time and space complexity. Still, if we consider the feature propagation with Ch, we have Ch Z(k) h = diag 1 N + ZQ,h (ZK,h) 1 11 + ZQ,h (ZK,h) Z(k) h = diag 1 N + ZQ,h (ZK,h) 1 h 1 1 Z(k) h + ZQ,h (ZK,h) Z(k) h i , (54) where the equality is achieved by altering the order of matrix products. The above computation only requires O(N) time and space complexity. The feed-forward computation of ADVDIFFORMER-S with O(N) acceleration is summarized in Alg. 2. D.2. Applicability of Our Model In the main paper, we assume unweighted graphs without edge attribute features for model formulation. Without loss of generality, we next discuss how to extend our model to handle the edge weights and edge features. Edge Weights. For weighted graphs, the adjacency matrix A would become a real matrix where the entry auv denotes Supercharging Graph Transformers with Advective Diffusion the weight on the edge (u, v) E. In this situation, we still have the corresponding normalized adjacency A = D 1A or A = D 1/2AD 1/2, where D = diag([du]u V) and du = P v,(u,v) E auv. Our model implementations can be trivially generalized to this case by using A as the propagation matrix for local message passing. Edge Features. If the graph contains edge features, denoted by E = [euv](u,v) E R|E| D , we introduce an encoding layer WE RD d for mapping the edge features into embeddings in the latent space and then incorporate them with node embeddings. In specific, we first compute the edge-to-node signals: M = [mu]u V, mu = X v,(u,v) E Au,v WEeuv. (55) For ADVDIFFORMER-I, we can modify Eqn. 50 as Lh = (1 + θ)I Ch β A, Zh = linsolver Lh, (Z(0) + M) , h=1 Zh WO,h. For ADVDIFFORMER-S, we can modify Eqn. 51 to be Ph = Ch + β A, Z(k) = Ph(Z(k 1) + M), k = 1, K, h=1 [Z(0), Z(1), , Z(K)]WO,h. E. Experiment Details We supplement details for our experiments, regarding datasets, competitors, and implementations, for facilitating the reproducibility. E.1. Datasets The datasets we use for the experiments in Sec. 5 span diverse domains and learning tasks. We summarize the statistics and brief descriptions for each dataset in Table 3, with the detailed information presented in the following subsections. Table 3: Statistics and descriptions for experimental datasets. Dataset #Nodes #Edges #Graphs Train/Val/Test Split Task Metric Synthetic-h 1,000 14,064 - 32,066 12 SBM (Homophily) Node Regression RMSE Synthetic-d 1,000 7,785 - 13,912 12 SBM (Density) Node Regression RMSE Synthetic-b 1,000 14,073 - 59,936 12 SBM (Block Number) Node Regression RMSE Twitch 1,912 - 9,498 31,299 - 153,138 7 Geographic Domain Node Classification ROC-AUC Arxiv 169,343 1,166,243 1 Publication Time Node Classification Accuracy OGB-BACE 10 - 97 10 - 101 1,513 Molecular Scaffold Graph Classification ROC-AUC OGB-SIDER 1 - 492 0 - 505 1,427 Molecular Scaffold Graph Classification ROC-AUC DPPIN-nr 143 - 5,003 22 - 25,924 12 Protein Identification Method Node Regression RMSE DPPIN-er 143 - 5,003 22 - 25,924 12 Protein Identification Method Edge Regression RMSE DPPIN-lp 143 - 5,003 22 - 25,924 12 Protein Identification Method Link Prediction ROC-AUC HAM 8 - 25 7 - 29 1,987 Relative Molecular Mass Edge Classification Accuracy E.1.1. SYNTHETIC DATASETS The synthetic datasets used in Sec. 5.1 simulate the graph data generation in Sec. 3.1, where the topological distribution shifts are caused by the difference of environments across training and testing data. In specific, we generate graphs of Supercharging Graph Transformers with Advective Diffusion |V| = 1000 nodes, with the node features X, graph adjacency matrix A and labels Y generated by the following process. Each node u V is assigned with a scalar uu randomly sampled from the uniform distribution U[0, 1]. For the generation of node features X = [xu]u V, we instantiate the node-wise function g as a 2-layer MLP with Re LU activation and 4-dimensional output. Then the node feature xu is generated through xu = MLP(uu). For the generation of graph adjacency A = [auv]u,v V, we instantiate the pairwise function h as the stochastic block model (Snijders & Nowicki, 1997) which generates edges according to the intra-block edge probability (p1) and the inter-block edge probability (p2). We map the nodes into b blocks by the following rule: for node u V, we assign it to the k-th block if vu [ k 1 b ) (where 1 k b). Then the edge auv is randomly generated from a bernoulli distribution with p1 if u and v are in the same block, and p2 otherwise. For the generation of labels Y, we consider the regression tasks and each node has a label yu generated through an ensemble model of a 2-layer GCN and a 1-layer DIFFormer (without using the graph-based propagation) with random initializations: Y = gcn(U, A) + difformer(U, A), where U = [uu]u V. Using the above data generation, we create 12 graphs with the indices #1 #12, and use the graph #1 for training, the graph #2 for validation, and the graphs #3 #12 for testing. The topological distribution shifts are introduced in three different ways as described in Sec. 5.1, where in each case, the detailed configurations for p1, p2 and b are illustrated below. Homophily Shift: p1 = 0.1, b = 5 and p2 = 0.01 + 0.05 1 12 (i 1) for the graph #i. Density Shift: b = 5, p1 = 0.1 + 0.1 1 12 (i 1) and p2 = 0.01 + 0.1 1 12 (i 1) for the graph #i. Block Shift: p1 = 0.1, p2 = 0.01 and b = 5 + (i 1) for the graph #i. E.1.2. INFORMATION NETWORKS The citation network Arxiv provided by (Hu et al., 2020) consists of a single graph with 0.16M nodes, where each node represents a paper with the publication year (ranging from 1960 to 2020) and a subarea id (from 40 different subareas in total). The node attribute features are 128-dimensional obtained by averaging the word embeddings of the paper s title and abstract. The edges are given by the citation relationship between papers. The predictive task is to estimate the paper s subarea. We use the publication years to split the data: papers published before 2014 for training, within the range from 2014 to 2017 for validation, and on 2018/2019/2020 for testing. Since there is a single graph, to increase the difficulty of generalization, we consider the inductive setting: the testing nodes are not contained in the training graph. Table 5 demonstrates the dissimilar statistics for training/validation/testing graphs, manifesting the existence of topological shifts. Following the common practice, we use Accuracy as the evaluation metric. Table 4: Statistics for training/validation/testing graphs on Arxiv. There is a single citation network that augments with time evolving, and with the data splits in the inductive setting, the previous graph is contained by the subsequent one. Train (1960-2014) Valid (2015-2017) Test 1 (2018) Test 2 (2019) Test 3 (2020) # Target Nodes 41,125 49,816 29,799 39,711 8,892 # All Nodes 41,125 90,941 120,740 160,451 169,343 # All Edges 102,316 374,839 622,466 1,061,197 1,166,243 Max Degrees 275 3,036 6,251 12,006 13,161 Avg Degrees 4.98 8.24 10.31 13.23 13.77 Twitch (Rozemberczki et al., 2021) is comprised of seven dis-connected graphs, where each node represents a Twitch user and edges indicate the friendship. Each graph is collected from the social newtork in a particular region, including DE, ENGB, ES, FR, PTBR, RU and TW. The node features are multi-hot with 2,545 dimensions indicating the user s profile. The predictive task is to classify the gender of the user. The seven networks with sizes ranging from 2K to 9K have distinct structural characteristics (such as densities and maximum degrees) as observed by (Wu et al., 2022a). We therefore split the data according to the geographic information: use the network DE for training, ENGB for validation, and the remaining networks for testing. The evaluation metric is ROC-AUC for binary classification. Supercharging Graph Transformers with Advective Diffusion E.1.3. BIOLOGICAL PROTEIN INTERACTIONS DPPIN (Fu & He, 2022) contains 12 individual dynamic network datasets at different scales, and each dataset is a dynamic protein-protein interaction network that describes the protein-level interactions of yeast cells. Each graph dataset is obtained by one protein identification method and consists of 36 graph snapshots, wherein each node denotes a protein that has a sequence of 1-dimensional continuous features with 36 time stamps. This records the evolution of gene expression values within metabolic cycles of yeast cells. The edges in the graph are determined by co-expressed protein pairs at one time, and each edge is associated with a co-expression correlation coefficient. We consider the predictive tasks within each graph snapshot and ignore the temporal evolution between different snapshots. In specific, we use the graph topology of each snapshot as the observed graph adjacency A and use the gene expression values at the previous 10 time steps as node features X. On top of this, we consider three different predictive tasks: 1) node regression for gene expression value at the current time (measured by RMSE); 2) edge regression for predicting the co-expression correlation coefficient (measured by RMSE); 3) link prediction for identifying co-expressed protein pairs (measured by ROC-AUC). Given the fact that each graph dataset has distinct sizes (ranging from 143 to 5,003 nodes) and distributions of 3-cliques and 4-cliques (ranging from 0 to hundreds) (Fu & He, 2022), we consider the dataset-level data splitting and use 6/1/5 graph datasets for training/validation/testing, which introduces topological distribution shifts. E.1.4. MOLECULAR MAPPING OPERATOR GENERATION The Human Annotated Mappings (HAM) dataset (Li et al., 2020) consists of 1,206 molecules with expert annotated mapping operators, i.e., a representation of how atoms are grouped in a molecule. The latter segments the atoms of a molecule into groups of varying sizes. As an important step in molecular dynamics simulation, generating coarse-grained mapping operators aims to reproduce the mapping operators produced by experts. This task can be modeled as a graph segmentation problem (Li et al., 2020) which takes a molecule graph as input and outputs the labels for each edge that indicates if there is cut needed to partition the source and end atoms into different groups. For data splits, we calculate the relative molecular mass of each molecule using the RDKit package3, and rank the molecules with increasing mass. Then we use the first 70% molecules for training, the following 15% for validation, and the remaining for testing. This splitting protocol partitions molecules with different weights, and requires generalization from small molecules in the training set to larger molecules in the testing set. Table 5: The range of relative molecular mass for training/validation/testing molecules in HAM. Train Valid Test Relative Molecular Mass 108.18 273.34 273.34 311.14 311.14 762.94 E.2. Competitors In our experiments, we compare with peer encoder backbones for graph learning tasks. The competitors span three aspects: 1) classical GNNs, 2) diffusion-based GNNs, and 3) graph Transformers. We briefly introduce the competitors and illuminate their connections with our model. GCN (Kipf & Welling, 2017) is a popular model that propagates node embeddings over observed graphs for computing node representations, which can be seen as the discretized version of graph diffusion equations with feature transformations. GAT (Velickovic et al., 2018) introduces attention networks for computing pairwise weights for neighboring nodes in the graph and propagates node signals with adaptive strengths given by the attention weights. GAT can be seen as the discretized version of the graph diffusion equation with time-dependent coupling matrices. SGC (Wu et al., 2019) proposes to simplify the GCN architecture by removing the feature transformations in-between propagation layers, reducing multi-layer propagation to one-layer. This brings up significant acceleration for training and inference. SGC can be seen as the discretization of the linear diffusion equation on graphs. 3https://github.com/rdkit/rdkit Supercharging Graph Transformers with Advective Diffusion GDC (Klicpera et al., 2019) extends the graph convolution operator to graph diffusion convolution derived from the linear diffusion equation on graphs. We use its implementation version based on the heat kernel for diffusion coefficients. GRAND (Chamberlain et al., 2021a) proposes graph neural diffusion, a continuous PDE model, that generalizes manifold diffusion to graphs and then uses numerical schemes to solve the PDE. We compare with its linear version that implements the linear graph diffusion equation. A-DGNs (Gravina et al., 2023) is a stable graph neural architecture inspired by ODE on graphs that has provable capability to preserve long-range information between nodes and avoid gradient vanishing or explosion in training. CDE (Zhao et al., 2023) is a recently proposed continuous model derived from convection diffusion equations that is designed for addressing heterophilic graphs. Graph Trans (Wu et al., 2021) is a recently proposed Transformer for graph-structured data that satisfies the permutationinvariant property. The model architecture sequentially combines GNNs and Transformers in order, where the GNN can learn local, short-range structures and the Transformer can capture global, long-range relationships. Graph GPS (Rampásek et al., 2022) introduces a scalable and powerful Transformer model class for graph data and achieves state-of-the-art results on molecular property prediction benchmarks. We use its scalable implementation version with the Performer attentions (Choromanski et al., 2021). DIFFormer (Wu et al., 2023) is a scalable Transformer inspired by diffusion on graphs. The model is comprised of principled attention layers, which implements the diffusion iterations minimizing a global energy. The architecture integrates graph-based feature propagation and global attention in each layer. We use its version with simple diffusivity that only requires linear complexity and yields state-of-the-art results on some large-graph benchmarks. E.3. Implementation Details Computation Systems. All the experiments are run on NVIDIA 3090 with 24GB memory. The environment is based on Ubuntu 18.04.6, Cuda 11.6, Pytorch 1.13.0 and Pytorch Geometric 2.1.0. Evaluation Protocol. For all the experiments, we run the training and evaluation of each model with five independent trials, and report the mean and standard deviation results in our tables and figures. In each run, we train the model with a fixed budget of epochs and record the testing performance produced by the epoch where the model yields the best performance on validation data. Hyper-Parameters. We use the grid search for hyper-parameter tuning on the validation dataset with the searching space described below. For information networks, hidden size d {32, 64, 128}, learning rate {0.0001, 0.001}, head number H {1, 2, 4}, the weight for local message passing β {0.2, 0.5, 0.8, 1.0}, and the order of propagation (only used for ADVDIFFORMER-S) K {1, 2, 4}. For molecular datasets, hidden size d = 256, learning rate {0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005}, dropout {0.0, 0.1, 0.3, 0.5}, head number H {1, 2, 4}, the weight for local message passing β {0.5, 0.75, 1.0}, the coefficient for identity matrix (only used for ADVDIFFORMER-I) θ {0.5, 1.0}, and the order of propagation (only used for ADVDIFFORMER-S) K {1, 2, 3, 4}. For protein interaction networks, hidden size d {32, 64}, learning rate {0.01, 0.001, 0.0001}, head number H {1, 2, 4}, the weight for local message passing β {0.3, 0.5, 0.8, 1.0}, the coefficient for identity matrix (only used for ADVDIFFORMER-I) θ {0.5, 1.0}, and the order of propagation (only used for ADVDIFFORMER-S) K {1, 2, 3, 4}. F. Additional Experimental Results In this section, we supplement more experimental results including additional results for main experiments, ablation studies and hyper-parameter analysis. Supercharging Graph Transformers with Advective Diffusion 1 2 3 4 5 6 K Test 1 (2018) Test 2 (2019) Test 3 (2020) 1 2 3 4 5 6 K 1 2 3 4 5 6 K Figure 6: Model performance on Arxiv and DPPIN with different settings of K. The latter involves node regression (nr) and edge regression (er) tasks. F.1. Supplementary Results for Main Experiments In Table 6, we present the ROC-AUC for each graph of Twitch. In Fig. 7 and 8, we show the generated results for more testing cases of molecular mapping operators in HAM. Table 6: Result of ROC-AUC for each graph on Twitch where we use nodes in different networks to split the training, validation and testing data. Train (DE) Valid (ENGB) Test 1 (ES) Test 2 (FR) Test 3 (PTBR) Test 4 (RU) Test 5 (TW) MLP 75.26 1.49 63.48 0.15 65.19 0.37 62.25 0.28 65.01 0.19 54.92 0.33 58.23 0.13 GCN 69.55 0.34 60.76 0.21 63.75 0.44 61.56 0.56 63.26 0.42 54.51 0.21 55.72 0.28 GAT 69.28 1.14 59.80 0.42 62.81 1.16 60.65 0.92 63.13 1.25 53.80 0.27 55.31 0.94 SGC 71.68 0.33 61.98 0.07 65.12 0.15 63.06 0.12 64.14 0.19 55.17 0.06 56.83 0.20 GDC 80.73 1.69 62.14 0.30 66.33 0.25 60.70 0.51 64.21 0.23 56.60 0.24 58.97 0.37 GRAND 79.17 0.74 62.48 0.39 66.52 0.23 61.62 0.62 64.44 0.73 56.42 0.38 59.27 0.57 A-DGNs 78.91 0.52 61.52 0.34 65.82 0.21 60.59 0.56 63.49 0.63 55.74 0.32 58.31 0.53 CDE 80.21 0.35 62.51 0.21 65.62 0.17 60.93 0.55 63.92 0.57 55.79 0.31 58.42 0.42 Graph Trans 79.17 0.74 62.48 0.39 66.52 0.23 61.62 0.62 64.44 0.73 56.42 0.38 59.27 0.57 Graph GPS 74.49 1.35 63.40 0.31 66.85 0.32 63.74 0.37 65.03 0.58 56.39 0.39 58.63 0.83 DIFFormer 73.12 0.52 63.06 0.09 66.68 0.15 64.44 0.13 65.23 0.20 55.75 0.12 58.91 0.37 ADVDIFFORMER-S 75.46 0.28 63.53 0.14 66.78 0.14 63.35 0.10 65.68 0.06 56.27 0.06 60.48 0.21 F.2. Ablation Studies and Hyper-Parameter Analaysis We next conduct more analysis on our proposed model by ablation studies on some key components and investigating the impact of hyper-parameters. Diffusion and Advection. We conduct ablation studies on the advection term (i.e., the local message passing) and the diffusion term (i.e., the global attention). In Table 7 we report the results for ADVDIFFORMER-S on Arxiv, which shows that the two modules are indeed effective for producing superior generalization on node classification tasks. Table 7: Ablation studies for ADVDIFFORMER-S on Arxiv. Train (1960-2014) Valid (2015-2017) Test 1 (2018) Test 2 (2019) Test 3 (2020) ADVDIFFORMER 63.79 0.66 55.25 0.14 53.41 0.48 51.53 0.60 49.64 0.54 ADVDIFFORMER w/o diffusion 64.65 1.10 55.00 0.12 52.45 0.27 50.18 0.18 48.01 0.24 ADVDIFFORMER w/o advection 61.84 0.79 54.31 0.24 51.64 0.59 49.65 0.53 47.06 0.69 Impact of K. The hyper-parameter K (used for ADVDIFFORMER-S) controls the number of propagation orders in the model. In fact, the value of K would impact how the structural information is utilized by the model. If K is small, the model only utilizes the low-order structure, and large K enables the usage of high-order structural information. Fig. 6 presents the model performance on Arxiv and DPPIN with K ranging from 1 to 6. We observe that the optimal settings for K are different across cases, and using larger K can not necessarily yield better performance. This is because in these cases, the low-order structural information is informative enough for desired generalization. Supercharging Graph Transformers with Advective Diffusion Impact of θ. Finally, we study the impact of θ used for computing Lh in ADVDIFFORMER-I. Table 8 shows the performance of ADVDIFFORMER-I on DPPIN with different θ s. We found that using θ close to 1 can bring up stably good performance, which is consistently manifested by experiments on other cases. Still, if θ is too small, e.g., close to 0, it would sometimes lead to numerical instability. This is due to that, in such a case, the matrix Lh could become a singular matrix. Table 8: Testing accuracy of ADVDIFFORMER-I with different θ s in the edge regression task on DPPIN. θ 0 0.5 1.0 2.0 Accuracy 0.241 0.154 0.147 0.149 G. Current Limitations and Future Works The generalization analysis in the current work focuses on the data-generating mechanism as described in Fig. 2 which is inspired and generalized by the random graph model. While this mechanism can in principle reflect real-world data generation process in various graph-structured data, in the open-world regime, there could exist situations involving topological distribution shifts by diverse factors or their combination. Future works can extend our framework for such cases where inter-dependent data is generated with different causal mechanisms. Another future research direction lies in the instantiation of the diffusion and advection operators in our model. Besides our choice of MPNN architecture to implement the advection process, other possibilities include structural and positional embeddings. We leave this line of exploration for the future, along with the analysis for the generalization capabilities of more general (e.g., non-linear) versions of the advective diffusion equation and other architectural choices. Supercharging Graph Transformers with Advective Diffusion Ground Truth GCN (1.00) GAT (0.69) Graph GPS (0.92) Difformer (0.92)Ad Adv DIFFormer (1.00) Ground Truth GCN (0.22) GAT (0.22) Graph GPS (0.22) Difformer (0.22) Adv DIFFormer (1.00) Ground Truth GCN (1.00) GAT (0.89) Graph GPS (0.86) Difformer (0.86) Adv DIFFormer (1.00) Ground Truth GCN (0.54) GAT (0.87) Graph GPS (0.74) Difformer (0.74) Adv DIFFormer (0.90) Ground Truth GCN (0.70) GAT (0.70) Graph GPS (0.63) Difformer (0.60) Adv DIFFormer (0.83) Figure 7: Additional testing cases for molecular mapping operators generated by different models and the expert annotations (ground-truth). For each case, we report the score (the higher is better) that measures the closeness between the generated results and the expert annotations. Supercharging Graph Transformers with Advective Diffusion Ground Truth GCN (0.87) GAT (0.77) Graph GPS (0.85) Difformer (0.87) Adv DIFFormer (1.00) Ground Truth GCN (0.84) GAT (0.76) Graph GPS (0.70) Difformer (0.64) Adv DIFFormer (1.00) Ground Truth GCN (0.72) GAT (0.74) Graph GPS (0.56) Difformer (0.57) Adv DIFFormer (1.00) Ground Truth GCN (0.77) GAT (0.77) Graph GPS (0.77) Difformer (0.89) Adv DIFFormer (1.00) Ground Truth GCN (0.64) GAT (0.51) Graph GPS (0.51) Difformer (0.51) Adv DIFFormer (1.00) Figure 8: Additional testing cases for molecular mapping operators generated by different models and the expert annotations (ground-truth). For each case, we report the score (the higher is better) that measures the closeness between the generated results and the expert annotations.