# beltrami_flow_and_neural_diffusion_on_graphs__6a88244e.pdf

Beltrami Flow and Neural Diffusion on Graphs

Benjamin P. Chamberlain Twitter Inc. bchamberlain@twitter.com

James Rowbottom Twitter Inc. Davide Eynard Twitter Inc.

Francesco Di Giovanni Twitter Inc. Xiaowen Dong University of Oxford Michael M. Bronstein Twitter Inc. and Imperial College London

We propose a novel class of graph neural networks based on the discretised Beltrami ﬂow, a non-Euclidean diffusion PDE. In our model, node features are supplemented with positional encodings derived from the graph topology and jointly evolved by the Beltrami ﬂow, producing simultaneously continuous feature learning and topology evolution. The resulting model generalises many popular graph neural networks and achieves state-of-the-art results on several benchmarks.

1 Introduction

The majority of graph neural networks (GNNs) are based on the message passing paradigm [30], wherein node features are learned by means of a non-linear propagation on the graph. Multiple recent works have pointed to the limitations of the message passing approach. These include; limited expressive power [80, 95, 9, 7], the related oversmoothing problem [60, 62] and the bottleneck phenomena [1, 93], which render such approaches inefﬁcient, especially in deep GNNs. Multiple alternatives have been proposed, among which are higher-order methods [54, 7] and decoupling the propagation and input graphs by modifying the topology, often referred to as graph rewiring. Topological modiﬁcations can take different forms such as graph sampling [32], k NN [43], using the complete graph [86, 1], latent graph learning [89, 36], or multi-hop ﬁlters [92, 73]. However, there is no agreement in the literature on when and how to modify the graph, and a single principled framework for doing so.

A somewhat underappreciated fact is that GNNs are intimately related to diffusion equations [16], a connection that was exploited in the early work of Scarselli et al. [76]. Diffusion PDEs have been historically important in computer graphics [83, 11, 51, 64], computer vision [13, 18, 6], and image processing [65, 82, 90, 85, 26, 12], where they created an entire trend of variational and PDE-based approaches. In machine learning and data science, diffusion equations underpin such popular manifold learning methods as eigenmaps [5] and diffusion maps [20], as well as the family of Page Rank algorithms [63, 14]. In deep learning, differential equations are used as models of neural networks [16, 19, 25, 94, 71, 98] and for physics-informed learning [72, 22, 75, 21, 81, 47].

Main contributions In this paper, we propose a novel class of GNNs based on the discretised non-Euclidean diffusion PDE in joint positional and feature space, inspired by the Beltrami ﬂow [82] used two decades ago in the image processing literature for edge-preserving image denoising. We show that the discretisation of the spatial component of the Beltrami ﬂow offers a principled view on positional encoding and graph rewiring, whereas the discretisation of the temporal component can replace GNN layers with more ﬂexible adaptive numerical schemes. Based on this model, we introduce Beltrami Neural Diffusion (BLEND) that generalises a broad range of GNN architectures and shows state-of-the-art performance on many popular benchmarks. In a broader perspective, our

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

approach explores new tools from PDEs and differential geometry that are less well known in the graph ML community.

2 Background

Beltrami ﬂow Kimmel et al. [40, 82, 39] considered images as 2-manifolds (parametric surfaces) (Σ, g) embedded in some larger ambient space as z(u) = (u, αx(u)) Rd+2 where α 0 is a scaling factor, u = (u1, u2) are the 2D positional coordinates of the pixels, and x are the ddimensional colour or feature coordinates (with d = 1 or 3 for grayscale or RGB images, or d = k2 when using k k patches as features [12]). In these works, the image is evolved along the gradient ﬂow of a functional S[z, g] called the Polyakov action [68], which roughly measures the smoothness of the embedding1. For images embedded in Euclidean space with the functional S minimised with respect to both the embedding z and the metric g, one obtains the following PDE:

t = Gz(u, t), z(u, 0) = z(u), t 0, (1)

and boundary conditions as appropriate. Here G is the Laplace-Beltrami operator, the Laplacian operator induced on Σ by the Euclidean space we embed the image into. Namely, the embedding of the manifold allows us to pull-back the Euclidean distance structure on the image: the distance between two nearby points u and u + du is given by

dℓ2 = du G(u)du = du2 1 + du2 2 + α2 d X

i=1 dx2 i , (2)

where G = I + α2( ux(u)) ux(u) is a 2 2 matrix called the Riemannian metric. The fact that the distance is a combination of the positional component (distance between pixels in the plane, u u ) and the colour component (distance between the colours of the pixels, x(u) x(u ) ) is crucial as it allows edge-preserving image diffusion.

When dealing with images, the evolution of the ﬁrst two components of (z1, z2) = u is a nuisance amounting to the reparametrisation of the manifold and can be ignored. For grayscale images (the case when d = 1 and z = (u1, u2, x)), this is done by projection along the dimension z3, in which case the Beltrami ﬂow takes the form of an inhomogeneous diffusion equation of x,

det G(u, t) div

det G(u, t)

Figure 1: Two interpretations of the Beltrami ﬂow: position-dependent bilateral kernel (top) and a Gaussian passed on the manifold (bottom).

The diffusivity

det G = 1 p

1 + α2 x 2 (4)

determining the speed of diffusion at each point, can be interpreted as an edge indicator: diffusion is weak across edges where x 1. The result is an adaptive diffusion [65] popular in image processing due to its ability to denoise images while preserving their edges. For cases with d > 1 (multiple colour channels), equation (3) is applied to each channel separately; however, the metric G couples the channels, which results in their gradients becoming aligned [38].

Special cases In the limit case α = 0, equation (3) becomes the simple homogeneous isotropic diffusion tx = div( x) = x, where = 2

u2 2 is the standard Euclidean Laplacian operator. The solution is

1Explicitly, by minimising the functional with respect to the embedding one ﬁnds Euler-Lagrange (EL) equations that can be used to dictate the evolution process of the embedding itself.

given in closed form as the convolution of the initial image and a Gaussian kernel with time-dependent variance,

x(u, t) = x(u, 0) 1 (4πt)d/2 e u 2/4t (5)

and can be considered a simple linear low-pass ﬁltering. In the limit t , the image becomes constant and equal to the average colour.2

Another interpretation of the Beltrami ﬂow is passing a Gaussian on the manifold (see Figure 1, bottom), which can locally be expressed as non-linear ﬁltering with the bilateral kernel [85] dependent on the joint positional and colour distance (Figure 1, top),

x(u, t) = 1 (4πt)d/2

R2 x(v, 0)e u v 2/4te α2 x(u,0) x(v,0) 2/4tdv. (6)

For α = 0, the bilateral ﬁlter (6) reduces to a simple convolution with a time-dependent Gaussian.

3 Discrete Beltrami ﬂow on graphs

We now develop the analogy of Beltrami ﬂow for graphs. We consider a graph to be a discretisation of a continuous structure (manifold), and show that the evolution of the feature coordinates in time amounts to message passing layers in GNNs, whereas the evolution of the positional coordinates amounts to graph rewiring, which is used in some GNN architectures.

3.1 Graph Beltrami ﬂow

Let G = (V = {1, . . . , n}, E) be an undirected graph, where V and E denote node and edge sets, respectively. We further assume node-wise d-dimensional features xi Rd for i = 1, . . . , n. Denote by zi = (ui, αxi) the embedding of the graph in a joint space C Rd, where C is a d -dimensional space with a metric d C representing the node coordinates (for simplicity, we will assume C = Rd

unless otherwise stated). We refer to ui xi and zi as the positional, feature and joint coordinates of node i, respectively, and arrange them into the matrices U, X, and Z, of sizes n d , n d, and n (d + d).

For images, Beltrami ﬂow amounts to evolving the embedding z along div(a(z) z), with a a diffusivity map.3 Thus, we consider the graph Beltrami ﬂow to be the discrete diffusion equation

j:(i,j) E a(zi(t), zj(t))(zj(t) zi(t)) zi(0) = zi; i = 1, . . . , n; t 0. (7)

We motivate our deﬁnition as follows: gij = zj zi and di = P

j:(i,j) E gij are the discrete analogies of the gradient z and divergence div(g), both with respect to a graph (V, E ) that can be interpreted as the numerical stencil for the discretisation of the continuous Laplace-Beltrami operator in (3). Note that E can be different from the input E (referred to as rewiring ). As discussed in Section 3.3, most GNNs use E = E (input graph is used for diffusion, no rewiring). Alternatively, the positional coordinates of the nodes can be used to deﬁne a new graph topology either with E(U) = {(i, j) : d C(ui, uj) < r}, for some radius r > 0, or using k nearest neighbours. This new rewiring is precomputed using the input positional coordinates (i.e., E = E(U(0))) or updated throughout the diffusion (i.e., E (t) = E(U(t))). Therefore, (7) can be compactly rewritten as

t = div (a(z(t)) zi(t)) .

The function a is the diffusivity controlling the diffusion strength between nodes i and j and is assumed to be normalised: P

j:(i,j) E a(zi, zj) = 1. The dependence of the diffusivity on the

2Assuming appropriate boundary conditions. 3Note that in equation (3) the diffusivity function a coincides with (det(G(u, t))) 1

2 . Similarly to [82, Section 4.2] we have neglected the extra term 1/a appearing in (3).

embedding z matches the smooth PDE analysed in e.g. [82, Section 4.2] and is consistent with the form of attention mechanism used in e.g. [88, 86]. In matrix-form, we can also rewrite (7) as

t X(t) = (A(U(t), X(t)) I) (U(t), X(t)) (8)

U(0) = U; X(0) = αX; t 0, where we emphasise the evolution of both the positional and feature components, coupled through the matrix-valued function A

aij(t) = a((ui(t), xi(t)), (uj(t), xj(t))) (i, j) E(U(t)) 0 else.

representing the diffusivity. The graph Beltrami ﬂow produces an evolution of the joint positional and feature coordinates, Z(t) = (U(t), X(t)). In Section 3.3 we will show how the evolution of the feature coordinates X(t) results in feature diffusion or message passing on the graph, the core of GNNs. As noted in Section 2, in the smooth case the Beltrami ﬂow is obtained as gradient ﬂow of an energy functional when minimised with respect to both the embedding and the metric on the surface (an image). When the embedding takes values in the Euclidean space, this leads to equations of the form (3) with no channel-mixing and an exact form of the diffusivity determined by the pull-back G of the Euclidean metric. To further motivate our approach, it is tempting to investigate whether a similar conclusion can be attained here. Although in the discrete case the operation of pull-back is not well-deﬁned, we are able to derive that the gradient ﬂow of a modiﬁed graph Dirichlet energy gives rise to an equation of the form (7). We note though that the gradient ﬂow does not recover the exact form of the diffusivity implemented in this paper. This is not a limitation of the theory and should be expected: by requiring the gradient ﬂow to avoid channel-mixing and imitate the image analogy in [82] and by inducing a discrete pull-back condition, we are imposing constraints on the problem. We leave the theoretical implications for future work and refer to the Supplementary Materials for a more thorough discussion, including deﬁnitions and proofs. Theorem 1. Under structural assumptions on the diffusivity, graph Beltrami ﬂow (7) is the gradient ﬂow of the discrete Polyakov functional.

3.2 Numerical solvers

Explicit vs implicit schemes Equation (7) is solved numerically, which in the simplest case is done by replacing the continuous time derivative

t with forward time difference:

z(k+1) i z(k) i τ = X

j:(i,j) E(U(k)) a z(k) i , z(k) j (z(k) j z(k) i ). (9)

Here k denotes the discrete time index (iteration) and τ is the time step (discretisation parameter). Rewriting (9) compactly in matrix-vector form with τ = 1 leads to the explicit Euler scheme:

Z(k+1) = (A(k) I)Z(k) = Q(k)Z(k), (10)

where a(k) ij = a(z(k) i , z(k) j ) and the matrix Q(k) (diffusion operator) is given by

l:(i,l) E a(k) il i = j

τa(k) ij (i, j) E(U(k)) 0 otherwise The solution to the diffusion equation is computed by applying scheme (10) multiple times in sequence, starting from some initial Z(0). It is explicit because the update Z(k+1) is done directly by the application of the diffusion operator Q(k) on Z(k) (as opposed to implicit schemes of the form Z(k) = Q(k)Z(k+1) arising from backward time differences that require inversion of the diffusion operator [91]).

Multi-step and adaptive schemes Higher-order approximation of temporal derivatives amount to using intermediate fractional steps, which are then linearly combined. Runge-Kutta (RK) [74, 44], ubiquitously used in numerical analysis, is a classical family of explicit numerical schemes, including Euler as a particular case. The Dormand-Prince (DOPRI) [24] is an RK method based on ﬁfth and fourth-order approximations, the difference between which is used as an error estimate guiding the time step size [78].

Adaptive spatial discretisation and rewiring Many numerical PDE solvers also employ adaptive spatial discretisation. The choice of the stencil (mesh) for spatial derivatives is done based on the character of the solution at these points; in the simulation of phenomena such as shock waves it is often desired to use denser sampling in the respective regions of the domain, which can change in time. A class of techniques for adaptive rewiring of the spatial derivatives are known as Moving Mesh (MM) methods [33]. Interpreting the graph E in (7) as the numerical stencil for the discretisation of the continuous Laplace-Beltrami operator in (3), we can regard rewiring as a form of MM.

3.3 Relation to graph neural networks

Equation (9) has the structure of many GNN architectures of the attentional type [10], where the discrete time index k corresponds to a (convolutional or attentional) layer of the GNN and multiple diffusion iterations amount to a deep GNN. In the diffusion formalism, the time parameter t acts as a continuous analogy of the layers, in the spirit of neural differential equations [19]. Typical GNNs amount to explicit single-step (Euler) discretisation schemes, whereas our continuous interpretation can exploit more efﬁcient numerical schemes.

GNNs as instances of graph Beltrami ﬂow The graph Beltrami framework leads to a family of graph neural networks that generalise many popular architectures (see Table 1). For example, GAT [88] can be obtained as a particular setting of our framework where the input graph is ﬁxed (E = E) and only the feature coordinates X are evolved. Equation (10) in this case becomes

x(k+1) i = x(k) i + τ X

j:(i,j) E a x(k) i , x(k) j (x(k) j x(k) i ) (11)

and corresponds to the update formula of GAT with a residual connection and the assumption of no non-linearity between the layers. The role of the diffusivity is played by a learnable parametric attention function, which is generally time-dependent: a(z(k) i , z(k) j , k). This results in separate attention parameters per layer k, which can be learned independently. Our intentionally simplistic choice of a time-independent attention function amounts to parameter sharing across layers. We will show in Section 5.1 that this leads to a smaller model that is less likely to overﬁt.

Another popular architecture Mo Net [57] uses linear diffusion of the features with weights dependent on the structure of the graph expressed as pseudo-coordinates , which can be cast as attention of the form a(ui, uj). Transformers [87] can be interpreted as feature diffusion on a ﬁxed complete graph with E = V V [10]. Positional encoding (used in Transformers as well as in several recent GNN architectures [9, 27]) amounts to attention dependent on both X and U, which allows the diffusion to adapt to the local structure of the graph; importantly, the positional coordinates U are precomputed. Similarly, Deep Sets [97] and Point Net [70] architectures can be interpreted as GNNs applied on a graph with no edges (E = ), where each node is treated independently of the others [10]. DIGL [43] performs graph rewiring as a pre-processing step using personalised page rank as positional coordinates, which are then ﬁxed and not evolved. In the point cloud methods, DGCNN [89] and DGM [36], the graph is constructed based on the feature coordinates X and rewired adaptively (in our notation, E = E(X(t))).

Method Evolution Diffusivity Graph (V, E ) Discretisation Cheb Net Features X Fixed aij Fixed E Explicit ﬁxed step GAT Features X a(xi, xj) Fixed E Explicit ﬁxed step Mo Net Features X a(ui, uj) Fixed E Explicit ﬁxed step Transformer Features X a((ui, xi), (uj, xj)) Fixed E = V V Explicit ﬁxed step Deep Set/Point Net Features X a(xi) Fixed E = Explicit ﬁxed step DIGL Features X a(xi, xj) Fixed E(U) Explicit ﬁxed step DGCNN/DGM Features X a(xi, xj) Adaptive E(X) Explicit ﬁxed step Beltrami Positions U a((ui, xi), (uj, xj)) Adaptive E(U) Explicit adaptive step / +Features X Implicit Table 1: GNN architectures interpreted as particular instances of our framework. Attentional variant.

Graph rewiring Multiple authors have recently argued in favor of decoupling the input graph from the graph used for diffusion. Such rewiring can take the form of graph sampling [32] to address scalability issues, data denoising [43], removal of information bottlenecks [1], or larger multi-hop ﬁlters [73]. The graph construction can also be made differentiable and a task-speciﬁc rewiring can

be learned [89, 36]. The statement of Klicpera et al. [43] that diffusion improves graph learning , leading to the eponymous paradigm (DIGL), can be understood as a form of diffusion on the graph connectivity independent of the features. Speciﬁcally, the authors used as node positional encoding the Personalised Page Rank (PPR), which can be interpreted as the steady-state of a diffusion process

k 0 (1 β)βk k RW = (1 β)(I β RW) 1, 0 < β < 1, (12)

where RW is the random walk graph Laplacian and β (0, 1) is a parameter such that 1 β represents the restart probability. The resulting positional encoding of dimension d = n can be used to rewire the graph by k NN sampling, which corresponds to using E = E(UPPR) in our framework.

Numerical schemes All the aforementioned GNN architectures can be seen as an explicit discretisation of equation (7) with ﬁxed step size. On the other hand, our continuous diffusion framework offers an additional advantage of employing more efﬁcient numerical schemes with adaptive step size. Graph rewiring of the form E = E(U(t)) can be interpreted as adaptive spatial discretisation (moving mesh method).

3.4 Extensions

feature space

Figure 2: Graph Beltrami ﬂow with hyperbolic positional coordinates.

Non-Euclidean geometry There are multiple theoretical and empirical arguments [52, 17] in favor of using hyperbolic spaces to represent real-life small-world graphs (in particular, scale-free networks can be obtained as k NN graphs in such spaces [8]). Our framework allows using a non-Euclidean metric d C for the positional coordinates U (Figure 2). In Section 5 we show that hyperbolic positional encodings allow for a signiﬁcant reduction in model size with only a marginal degradation of performance.

Time-dependent diffusivity The diffusivity function a which we assumed time-independent and which lead to parameter sharing across layers (i.e., updates of the form Z(k+1) = Q(Z(k), θ)Z(k)) can be made time-dependent of the form Z(k+1) = Q(Z(k), θ(k))Z(k), where θ and θ(k) denote shared and layer-dependent parameters, respectively.

Onsager diffusion As we noted, the Beltrami ﬂow diffuses each channel separately. A more general variant of diffusion allowing for feature mixing is the Onsager diffusion [61] of the form t Z(t) = Q(Z(t))Z(t)W(t), where the matrix-valued function W acts across the channels. GCN [42] can be regarded a particular setting thereof, with update of the form X(k+1) = AX(k)W.

MPNNs Finally, we note that the Beltrami ﬂow amounts to linear aggregation with non-linear coefﬁcients, or the attentional ﬂavor of GNNs [10]. The more general message passing ﬂavor [30] is possible using a generic non-linear equation of the form

t Z(t) = Ψ(Z(t)).

4 BLEND: Beltrami Neural Diffusion

Beltrami Neural Diffusion (BLEND) is a novel class of graph neural network architectures derived from the graph Beltrami framework. We assume an input graph G = (V, E) with n nodes and d-dimensional node-wise features represented as a matrix Xin. We further assume a d -dimensional positional encoding Uin of the graph nodes. BLEND architectures implement a learnable joint diffusion process of U and X and runs it for time T, to produce an output node embeddings Y,

Z(0) = (φ(Uin), ψ(Xin)) Z(T) = Z(0) + Z T

t dt Y = ξ(Z(T)),

where φ, ψ are learnable positional and feature encoders and ξ is a learnable decoder (possibly changing the output dimensions). Here the α in Equations (2) and (8) is absorbed by ψ and made

learnable. Z(t)

t is given by the graph Beltrami ﬂow equation (8), where the diffusivity function (attention) a is also learnable. The choice of attention function depends on the geometry of the positional encoding and for Euclidean encodings we ﬁnd the scaled dot product attention [86] performs well, in which case

a(zi, zj) = softmax (WKzi) WQzj

where WK and WQ are learned matrices, and dk is a hyperparameter.

5 Experimental results

In this section, we compare the proposed Beltrami framework to popular GNN architectures on standard node classiﬁcation benchmarks and provide a detailed study of the choice of the positional encoding space. Additional experiments and implementation details, including runtimes and hyperparameter tuning are given in the Supplementary Materials. The code is available at https://github.com/twitter-research/graph-neural-pde.

Datasets In our experiments, we use the following datasets: Cora [56], Citeseer [77], Pubmed [58], Coauthor CS [79], Amazon, Computer, and Photo [55], and OGB-arxiv [35]. Since many works using the ﬁrst three datasets rely on the Planetoid splits [96], we included them Table 2, together with a more robust evaluation on 100 random splits with 20 random initialisations [79].

Baselines We compare to the following GNN architectures: GCN [42], GAT [88], Mo Net [57] and Graph SAGE [32], and recent ODE-based GNN models: CGNN [94], GDE [67], GODE [98], and two versions of Lanczos Net [49]. We use two variants of our method: using ﬁxed input graph (BLEND) and using k NN graph in the positional coordinates (BLEND-knn).

5.1 Node Classiﬁcation Method CORA Cite Seer Pub Med GCN 81.9 0.8 69.5 0.9 79.0 0.5 GAT 82.8 0.5 71.0 0.6 77.0 1.3 Mo Net 82.2 0.7 70.0 0.6 77.7 0.6 GS-maxpool 77.4 1.0 67.0 1.0 76.6 0.8 Lanczos 79.5 1.8 66.2 1.9 78.3 0.3 Ada Lanczos 80.4 1.1 68.7 1.0 78.1 0.4 CGNN 81.7 0.7 68.1 1.2 80.2 0.3 GDE 83.8 0.5 72.5 0.5 79.9 0.3 GODE 83.3 0.3 72.4 0.6 80.1 0.3 BLEND 84.2 0.6 74.4 0.7 80.7 0.7 BLEND-k NN 83.1 0.8 73.3 0.9 81.5 0.5

Table 2: Performance (test accuracy std) of different GNN models using Planetoid splits.

In these experiments, we followed the methodology of [79] using 20 random weight initialisations for datasets with ﬁxed Planetoid splits and 100 random splits otherwise. Where available, results from [79] are reported. Hyperparameters with the highest validation accuracy were chosen and results are reported on a test set that is used only once. Hyperparameter search used Ray Tune [50] with a thousand random trials using an asynchronous hyperband scheduler with a grace period and half life of ten epochs. The code to reproduce our results is included with the submission and will be released publicly following the review process. Experiments ran on AWS p2.8xlarge machines, each with 8 Tesla V100-SXM2 GPUs.

Implementation details For all datasets excepting ogb-arxiv, adaptive explicit Dormand-Prince scheme was used as the numerical solver; for ogb-arxiv, we used the Runge-Kutta method. For the two smallest datasets (Cora and Citeseer) we performed direct backpropagation through each step of the numerical integrator. For the larger datasets, to reduce memory complexity, we use Pontryagin s maximum principle to propagate gradients backwards in time [69]. For the larger datasets, kinetic energy and Jacobian regularisation [28, 37] was employed. The regularisation ensures the learned dynamics is well-conditioned and easily solvable by a numeric solver, which reduced training time. We use constant initialisation for the attention weights, WK, WQ, so training starts from a well-conditioned system that induces small regularisation penalty terms [28].

The space complexity of BLEND is dominated by evaluating attention (13) over edges and is O(|E |(d + d )) where E is the edge set following rewiring and d is dimension of features and d is the dimension of positional encoding. The runtime complexity is O(|E |(d + d ))(Eb + Ef), split

Method CORA Cite Seer Pub Med Coauthor CS Computer Photo ogb-arxiv GCN 81.5 1.3 71.9 1.9 77.8 2.9 91.1 0.5 82.6 2.4 91.2 1.2 71.74 0.29 GAT 81.8 1.3 71.4 1.9 78.7 2.3 90.5 0.6 78.0 19 85.7 20 73.01 0.19 GAT-ppr 81.6 0.3 68.5 0.2 76.7 0.3 91.3 0.1 85.4 0.3 90.9 0.3 Mo Net 81.3 1.3 71.2 2.0 78.6 2.3 90.8 0.6 83.5 2.2 91.2 2.3 GS-mean 79.2 7.7 71.6 1.9 77.4 2.2 91.3 2.8 82.4 1.8 91.4 1.3 71.49 0.27 GS-maxpool 76.6 1.9 67.5 2.3 76.1 2.3 85.0 1.1 90.4 1.3 CGNN 81.4 1.6 66.9 1.8 66.6 4.4 92.3 0.2 80.3 2.0 91.4 1.5 58.70 2.50 GDE 78.7 2.2 71.8 1.1 73.9 3.7 91.6 0.1 82.9 0.6 92.4 2.0 56.66 10.9 BLEND 84.8 0.9 75.9 1.3 79.5 1.4 92.9 0.2 86.9 0.6 92.9 0.6 72.56 0.1 BLEND-k NN 82.5 0.9 73.4 0.5 80.9 0.7 92.3 0.1 86.7 0.6 93.5 0.3

Table 3: Performance (test accuracy std) of different GNN models using random splits. OGB GAT reference has 1.5M parameters vs ours 70K. BLEND-k NN pre-processes the graph using the DIGL methodology [43] (Section 3.3), which constructs an n-dimensional representation of each node (an n-by-n matrix), then sparsiﬁes into a k NN graph. The ogb-arxiv dataset has >150K nodes and goes OOM. This is not a limitation of BLEND, but that of DIGL. Other forms of initial rewiring are possible, but we chose to compare with DIGL (arguably the most popular graph rewiring) and so this result is missing

between the forward and backward pass and can be dominated by either depending on the number of function evaluations (Eb, Ef).

Tables 2 3 summarise the results of our experiments. BLEND outperforms other GNNs in most of the experiments, showing state-of-the-art results on some datasets. Another important point to note is that compared GNNs use different sets of parameters per layer, whereas in BLEND, due to our choice of a time-independent attention, parameters are shared. This results in signiﬁcantly fewer parameters: for comparison, the OGB versions of GCN, SAGE and GAT used in the ogb-arxiv experiment require 143K, 219K and 1.63M parameters respectively, compared to only 70K in BLEND.

5.2 Positional encoding

Cora Cite Seer Pub Med Coauthor CS Computer Photo Datasets

74 BLEND BLEND w/o positional

Figure 3: An ablation study showing BLEND with and without positional encodings as well as GAT, the most similar conventional GNN with positional encodings added

In the second experiment, we investigated the impact of the positional encodings and vary the dimensionality and underlying geometry, in order to showcase the ﬂexibility of our framework.

Figure 3 shows that for all datasets BLEND is superior to a Euclidean model where positional encodings are not used, which corresponds to Z = X (BLEND w/o positional in Figure 3) and a version of GAT where attention is over a concatenation of the same positional encodings used in BLEND and the features. The only exception is Coathor CS, where the performance without positional encodings is unchanged.

We experimented with three forms of positional encoding: DIGL PPR embeddings of dimension n [43], Deep Walk embeddings [66] of dimensions 16 256, and hyperbolic embeddings in the Poincare ball [15, 59] of dimension 2 16. Positional encodings are calculated as a preprocessing step and input as the U coordinates to BLEND. We calculated Deep Walk node embeddings using Py Torch Geometric s Node2Vec implementation with parameters p = 1, q = 1, context size 20, and 16 walks per node. We trained with walk length ranging between 40 and 120 and took the embeddings with the highest accuracy for the link prediction task. Shallow hyperbolic embeddings were generated using the HGCN [17] implementation with the default parameters provided by the authors.

Figure 4 (left) compares the performance of the best model using DIGL n-dimensional positional encodings against the best performing d-dimensional hyperbolic positional encodings with d tuned over the range 2-16. The average performance with DIGL positional encodings is 85.48, compared to 85.28 for hyperbolic encodings. Only in one case the DIGL encodings outperform the best hyperbolic

Cora Cite Seer Pub Med Coauthor Computer Photo

HYP2 HYP4 HYP8 HYP16 DW16 DW64 DW128 DW256 GDC Positional embedding type and size

Citeseer Photo

Figure 4: Left: performance comparison between Euclidean and hyperbolic positional embeddings. Right: results of positional embeddings ablation. Hyperbolic or Euclidean embeddings with d = 16 allow to obtain performances comparable to euclidean embeddings with d = n >> 16.

encodings. Figure 4 (right) show the change in performance with the dimension d of the positional encoding using a ﬁxed hyperparameter conﬁguration. As expected, we observe monotonic increase in performance as function of d . Importantly most of the performance is captured by either 16 dimensional hyperbolic or positional encodings, with a small additional margin gained for going up to d = n, which is impractical for larger graphs.

5.3 Additional ablations

Dopri5 0.25 0.5 1 2 4 8 Stepsize

Cora Citeseer Pubmed

Coauthor CS Computers Photo

Figure 5: step size against accurcy for explicit Euler compared to the adaptive dopri5.

In addition to studying the affect of positional encodings , we performed ablation studies on the step size used in the integrator as well as different forms of attention.

In Figure 5 we studied the affect of changing the step size of the integrator using the explicit Euler method with a ﬁxed terminal time set to be the optimal terminal time. The left hand side of the ﬁgure shows the performance using the adaptive stepsize Dopri5 for comparison. Dopri5 gives the most consistent performance and is best if three of the six datasets. Details of additional attention functions and their relative performance can be found in the Supplementary Material.

6 Related work

Image processing, computer vision, and graphics. Following the Perona-Malik scheme [65], PDE-based approaches ﬂourished in the 1990s with multiple versions of anisotropic [90] and non Euclidean [82] diffusion used primarily for image denoising. The realisation of these ideas in the form of nonlinear ﬁlters [85, 12] was adopted in the industry. PDE-based methods were also used for low-level vision tasks including inpainting [6] and image segmentation [13, 18]. In computer graphics, fundamental solutions ( heat kernels ) of diffusion equations were used as shape descriptors [83]. The closed-form expression of such solutions using the Laplace-Beltrami operator served as inspiration for some of the early approaches for GNNs [34, 23, 42, 45].

Neural differential equations The interpretation of neural networks as discretised differential equations ( neural ODEs ) [19] was an inﬂuential recent result with multiple follow-up works [25, 28, 53, 46]. In deep learning on graphs, this mindset has been applied to GNN architectures [2, 67] and continuous message passing [94]. Continuous GNNs were also explored in [31] who, similarly to [76], addressed the solutions of ﬁxed point equations. Ordinary Differential Equations

on Graph Networks (GODE)[98] approach the problem using the technique of invertible Res Nets. Finally, [75] used graph-based ODEs to generate physics simulations.

Physics-inspired learning Solving PDEs with deep learning has been explored by [72]. Neural networks appeared in [47] to accelerate PDE solvers with applications in the physical sciences. These have been applied to problems where the PDE can be described on a graph [48]. [4] consider the problem of predicting ﬂuid ﬂow and use a PDE inside a GNN. These approaches differ from ours in that they solve a given PDE, whereas we use the notion of discretising PDEs as a principle to understand and design GNNs.

Neural ODEs on graphs The most similar work to this is GRAND [16] of which BLEND can be considered a non-Euclidean extension. In addition there are several other works that apply the neural ODE framework to graphs. In GDE [67], GODE [98] and CGNN [94], the goal is to adapt neural ordinary differential equations to graphs. In contrast, we consider non-Euclidean partial differential equations and regard GNNs as particular instances of their discretisation (both spatial and temporal). We can naturally use spaces with any metric, in particular, extending recent works on hyperbolic graph embeddings. None of the previous techniques explore the link to differential geometry. More speciﬁcally, GDE, GODE, and CGNN consider Neural ODEs of the canonical form x

t = f(x, t, θ) where f is a graph neural network (GODE), the message passing component of a GNN (CGNN), or restricting f to be layers of bijective functions on graphs (GDE). Furthermore, in CGNN only ODEs with closed form solutions are considered. [75] on the other hand is quite distinct as they are not concerned with GNN design and instead use graph-based ODEs to generate physics simulations.

7 Conclusion

We developed a new class of graph neural networks based on the discretisation of a non-Euclidean diffusion PDE called Beltrami ﬂow. We represent the graph structure and node features as a manifold in a joint continuous space, whose evolution by a parametric diffusion PDE (driven by the downstream learning task) gives rise to feature learning, positional encoding, and possibly also graph rewiring. Our experimental results show very good performance on popular benchmarks with a small fraction of parameters used by other GNN models. Perhaps most importantly, our framework establishes important links between GNNs, differential geometry, and PDEs ﬁelds with a rich history and profound results that in our opinion are still insufﬁciently explored in the graph ML community.

Future directions While we show that our framework generalises many popular GNN architectures we seek to use the graph as a numerical representation of an underlying continuous space. This view of the graph as an approximation of a continuous latent structure is a common paradigm of manifold learning [84, 5, 20] and network geometry [8]. If adopted in graph ML, this mindset offers a rigorous mathematical framework for formalising and generalising some of the recent trends in the ﬁeld, including the departure from the input graph as the basis for message passing [1], latent graph inference [41, 89, 29, 36] higher-order [7, 54] and directional [3] message passing (which can be expressed as anisotropic diffusion arising from additional structure of the underlying continuous space and different discretisation of the PDEs e.g. based on ﬁnite elements), and exploiting efﬁcient numerical PDE solvers [19]. We leave these exciting directions for future research.

Societal impact GNNs have recently become increasingly utilized in industrial applications e.g. in recommender systems and social networks, and hence could potentially lead to a negative societal impact if used improperly. We would like to emphasize that our paper does not study such potential negative applications and the mathematical framework we develop could help to interpret and understand existing GNN models. We believe that better understanding of ML models is key to managing their potential societal implications and preventing their negative impact.

Limitations The assumption that the graph can be modelled as a discretisation of some continuous space makes our framework applicable only to cases where the edge and node features are continuous in nature. Applications e.g. to knowledge graphs with categorical attributes could only be addressed by ﬁrst embedding such attributes in a continuous space. Finally, the structural result presented in Theorem 1 that link our model to the Polyakov action have not been implemented. This will be addressed in future works.

8 Acknowledgements

We thank Nils Hammerla and Gabriele Corso for feedback on early version of this manuscript. MB and JR are supported in part by ERC Consolidator grant no 724228 (LEMAN).

[1] Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. In ICLR, 2021.

[2] Pedro H. C. Avelar, Anderson R. Tavares, Marco Gori, and Luis C. Lamb. Discrete and continuous deep residual learning over graphs. ar Xiv:1911.09554, 2019.

[3] Dominique Beaini, Saro Passaro, Vincent Létourneau, William L Hamilton, Gabriele Corso, and Pietro Liò. Directional graph networks. ar Xiv:2010.02863, 2020.

[4] Filipe de Avila Belbute-Peres, Thomas Economon, and Zico Kolter. Combining differentiable pde solvers and graph neural networks for ﬂuid ﬂow prediction. In ICML, 2020.

[5] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373 1396, 2003.

[6] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In PACMCGIT, 2000.

[7] Cristian Bodnar, Fabrizio Frasca, Yu Guang Wang, Nina Otter, Guido Montúfar, Pietro Liò, and Michael Bronstein. Weisfeiler and lehman go topological: Message passing simplicial networks. ar Xiv:2103.03212, 2021.

[8] Marian Boguna, Ivan Bonamassa, Manlio De Domenico, Shlomo Havlin, Dmitri Krioukov, and M Ángeles Serrano. Network geometry. Nature Reviews Physics, pages 1 22, 2021.

[9] Giorgos Bouritsas, Fabrizio Frasca, Stefanos Zafeiriou, and Michael M Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. ar Xiv:2006.09252, 2020.

[10] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv:2104.13478, 2021.

[11] Michael M Bronstein and Iasonas Kokkinos. Scale-invariant heat kernel signatures for non-rigid shape recognition. In CVPR, 2010.

[12] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In ICCV, 2005.

[13] Vicent Caselles, Ron Kimmel, and Guillermo Sapiro. Geodesic active contours. IJCV, 22(1):61 79, 1997.

[14] Soumen Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In WWW, 2007.

[15] Benjamin P. Chamberlain, James Clough, and Marc Peter Deisenroth. Neural embeddings of graphs in hyperbolic space. ar Xiv:1705.10359, 2017.

[16] Benjamin P. Chamberlain, James Rowbottom, Maria Gorinova, Michael M. Bronstein, Stefan Webb, and Emanuele Rossi. GRAND: graph neural diffusion. In ICML 2021, pages 1407 1418, 2021.

[17] Ines Chami, Rex Ying, Christopher Ré, and Jure Leskovec. Hyperbolic graph convolutional neural networks. In Neur IPS, 2019.

[18] Tony F Chan and Luminita A Vese. Active contours without edges. IEEE Trans. Image Processing, 10(2):266 277, 2001.

[19] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. In Neur IPS, pages 6571 6583, 2018.

[20] Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5 30, 2006.

[21] Miles Cranmer, Sam Greydanus, Stephan Hoyer, Peter Battaglia, David Spergel, and Shirley Ho. Lagrangian neural networks. 2020.

[22] Miles Cranmer, Rui Xu, Peter Battaglia, and Shirley Ho. Learning symbolic physics with graph networks. ar Xiv:1909.05862, 2019.

[23] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In Neur IPS, 2016.

[24] John R Dormand and Peter J Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6(1):19 26, 1980.

[25] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural ODEs. In Neur IPS, pages 3134 3144, 2019.

[26] Frédo Durand and Julie Dorsey. Fast bilateral ﬁltering for the display of high-dynamic-range images. In Proc Computer Graphics and Interactive Techniques, 2002.

[27] Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. ar Xiv:2012.09699, 2020.

[28] Chris Finlay, Jörn-Henrik Jacobsen, Levon Nurbekyan, and Adam M Oberman. How to train your neural ode: The world of Jacobian and kinetic regularization. In ICML, 2020.

[29] Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. In ICML, 2019.

[30] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017.

[31] Fangda Gu, Heng Chang, Wenwu Zhu, Somayeh Sojoudi, and Laurent El Ghaoui. Implicit graph neural networks. In Neur IPS, 2020.

[32] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Neur IPS, 2017.

[33] DF Hawken, JJ Gottlieb, and JS Hansen. Review of some adaptive node-movement techniques in ﬁnite-element and ﬁnite-difference solutions of partial differential equations. Journal of Computational Physics, 95(2):254 302, 1991.

[34] Mikael Henaff, Joan Bruna, and Yann Le Cun. Deep convolutional networks on graph-structured data. ar Xiv:1506.05163, 2015.

[35] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv:2005.00687, 2020.

[36] Anees Kazi, Luca Cosmo, Nassir Navab, and Michael Bronstein. Differentiable graph module (DGN) graph convolutional networks. ar Xiv:2002.04999, 2020.

[37] Jacob Kelly, Jesse Bettencourt, Matthew James Johnson, and David Duvenaud. Learning differential equations that are easy to solve. In Neur IPS, 2020.

[38] Ron Kimmel. Numerical geometry of images: Theory, algorithms, and applications. Springer, 2012.

[39] Ron Kimmel, Ravi Malladi, and Nir Sochen. Images as embedded maps and minimal surfaces: movies, color, texture, and volumetric medical images. IJCV, 39(2):111 129, 2000.

[40] Ron Kimmel, Nir Sochen, and Ravi Malladi. From high energy physics to low level vision. pages 236 247. Springer, 1997.

[41] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. In ICML, 2018.

[42] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. In ICLR, 2017.

[43] Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. In Neur IPS, volume 32, 2019.

[44] Wilhelm Kutta. Beitrag zur naherungsweisen integration totaler differentialgleichungen. Z. Math. Phys., 46:435 453, 1901.

[45] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral ﬁlters. IEEE Trans. Signal Processing, 2017.

[46] Xuechen Li, Ricky Tian Qi Chen, Ting-Kam Leonard Wong, and David Duvenaud. Scalable gradients for stochastic differential equations. In Artiﬁcial Intelligence and Statistics, 2020.

[47] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. 2020.

[48] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Multipole graph neural operator for parametric partial differential equations. In Neur IPS, 2020.

[49] Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard S Zemel. Lanczosnet: Multi-scale deep graph convo-lutional networks. In 7th International Conference on Learning Representations, ICLR 2019, 2019.

[50] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training. In ar Xiv: 1807.05118., 2018.

[51] Roee Litman and Alexander M Bronstein. Learning spectral descriptors for deformable shape correspondence. PAMI, 36(1):171 180, 2013.

[52] Qi Liu, Maximilian Nickel, and Douwe Kiela. Hyperbolic graph neural networks. ar Xiv preprint ar Xiv:1910.12892, 2019.

[53] Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural SDE: Stabilizing neural ODE networks with stochastic noise. (2), 2019.

[54] Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks. In Neur IPS, pages 2153 2164, 2019.

[55] Julian Mc Auley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based recommendations on styles and substitutes. In Proceedings Information Retrieval, 2015.

[56] Andrew Kachites Mc Callum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127 163, 2000.

[57] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Jan Svoboda, and Michael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model CNNs. In CVPR, 2017.

[58] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and UMD EDU. Query-driven active surveying for collective classiﬁcation. In Proceedings Mining and Learning with Graphs, 2012.

[59] Maximillian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, pages 3779 3788. PMLR, 2018.

[60] Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass ﬁlters. 2019.

[61] Lars Onsager. Reciprocal relations in irreversible processes. i. Physical Review, 37(4):405, 1931.

[62] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classiﬁcation. In ICLR, 2020.

[63] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, 1999.

[64] Giuseppe Patané. Laplacian spectral kernels and distances for geometry processing and shape analysis. Computer Graphics Forum, 35(2):599 624, 2016.

[65] Pietro Perona and Jitendra Malik. Scale-space and edge detection using anisotropic diffusion. PAMI, 12(7):629 639, 1990.

[66] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701 710, 2014.

[67] Michael Poli, Stefano Massaroli, Junyoung Park, Atsushi Yamashita, Hajime Asama, and Jinkyoo Park. Graph neural ordinary differential equations. pages 6571 6583, 2019.

[68] Alexander M Polyakov. Quantum geometry of bosonic strings. Physics Letters B, 103(3):207 210, 1981.

[69] Lev Semenovich Pontryagin. Mathematical theory of optimal processes. Routledge, 2018.

[70] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. In CVPR, 2017.

[71] Alejandro F. Queiruga, N. Benjamin Erichson, Dane Taylor, and Michael W. Mahoney. Continuous-in-depth neural networks. Co RR, 2020.

[72] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics informed deep learning (Part II): Data-driven discovery of nonlinear partial differential equations. Co RR, (Part II), 2017.

[73] Emanuele Rossi, Fabrizio Frasca, Ben Chamberlain, Davide Eynard, Michael Bronstein, and Federico Monti. SIGN: Scalable inception graph neural networks. ar Xiv:2004.11198, 2020.

[74] Carl Runge. Über die numerische auﬂösung von differentialgleichungen. Mathematische Annalen, 46(2):167 178, 1895.

[75] Alvaro Sanchez-Gonzalez, Victor Bapst, Kyle Cranmer, and Peter Battaglia. Hamiltonian graph networks with ODE integrators. 2019.

[76] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Trans. Neural Networks, 27(8):61 80, 2009.

[77] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi Rad. Collective classiﬁcation in network data. AI Magazine, 29(3):93 93, 2008.

[78] Lawrence F Shampine. Some practical Runge-Kutta formulas. Mathematics of Computation, 46(173):135 150, 1986.

[79] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. ar Xiv:1811.05868, 2018.

[80] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. JMLR, 12:2539 2561, 2011.

[81] Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vlimant. Graph neural networks in particle physics. Machine Learning: Science and Technology, 2(2):021001, 2020.

[82] Nir Sochen, Ron Kimmel, and Ravi Malladi. A general framework for low level vision. IEEE Trans. Image Processing, 7(3):310 318, 1998.

[83] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multiscale signature based on heat diffusion. Computer Graphics Forum, 28(5):1383 1392, 2009.

[84] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Dcience, 290(5500):2319 2323, 2000.

[85] Carlo Tomasi and Roberto Manduchi. Bilateral ﬁltering for gray and color images. In ICCV, 1998.

[86] Ashish Vaswani, Noam Shazeer, Niki Parmar, Akob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, pages 5998 6008, 2017.

[87] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.

[88] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.

[89] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph CNN for learning on point clouds. ACM Trans. Graphics, 38(5):1 12, 2019.

[90] Joachim Weickert. Anisotropic diffusion in image processing. Teubner Stuttgart, 1998.

[91] Joachim Weickert, Bart M. Ter Haar Romeny, and Max A. Viergever. Efﬁcient and reliable schemes for nonlinear diffusion ﬁltering. IEEE Trans. Image Processing, 7(3):398 410, 1998.

[92] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In ICML. PMLR, 2019.

[93] Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. Graph information bottleneck. Neur IPS, 2020.

[94] Louis-pascal A C Xhonneux, Meng Qu, and Jian Tang. Continuous graph neural networks. In ICML, 2020.

[95] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.

[96] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pages 40 48. PMLR, 2016.

[97] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Neur IPS, 2017.

[98] Juntang Zhuang, Nicha Dvornek, Xiaoxiao Li, and James S. Duncan. Ordinary differential equations on graph networks. Technical Report 1, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] In limitations (S6) and an extensions (S3.4) (c) Did you discuss any potential negative societal impacts of your work? [Yes] Section 6 (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? Theorem 1 is dealt with in the supplementary material [Yes] Supplementary Material (b) Did you include complete proofs of all theoretical results? [Yes] Supplementary Material 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] In supplementary material (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] hyperparameters are included in Supplementary Material and splits are addressed in section 5 (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Results in Section 5 include error bars (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] AWS, details at the end of 5.1 and supplementary material 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] papers have been cited for the most relevant OS libraries (b) Did you mention the license of the assets? [No]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]