# capturing_graphs_with_hypoelliptic_diffusions__4d1aa49b.pdf Capturing Graphs with Hypo-Elliptic Diffusions Csaba Toth Mathematical Institute University of Oxford toth@maths.ox.ac.uk Darrick Lee Mathematical Institute University of Oxford leed@maths.ox.ac.uk Celia Hacker Department of Mathematics EPFL celia.hacker@epfl.ch Harald Oberhauser Mathematical Institute University of Oxford oberhauser@maths.ox.ac.uk Convolutional layers within graph neural networks operate by aggregating information about local neighbourhood structures; one common way to encode such substructures is through random walks. The distribution of these random walks evolves according to a diffusion equation defined using the graph Laplacian. We extend this approach by leveraging classic mathematical results about hypo-elliptic diffusions. This results in a novel tensor-valued graph operator, which we call the hypo-elliptic graph Laplacian. We provide theoretical guarantees and efficient low-rank approximation algorithms. In particular, this gives a structured approach to capture long-range dependencies on graphs that is robust to pooling. Besides the attractive theoretical properties, our experiments show that this method competes with graph transformers on datasets requiring long-range reasoning but scales only linearly in the number of edges as opposed to quadratically in nodes. 1 Introduction Obtaining a latent description of the non-Euclidean structure of a graph is central to many applications. One common approach is to construct a set of features for each node that represents the local neighborhood of this node; pooling these node features then provides a latent description of the whole graph. A classic way to arrive at such node features is by random walks: at the given node one starts a random walk, and extracts a summary of the local neighbourhood from its sample trajectories. We revisit this random walk construction and are inspired by two classical mathematical results: Hypo-elliptic Laplacian. In the Euclidean case of Brownian motion B = (Bt)t 0 evolving in Rn, the quantity u(t, x) = E[f(Bt)|B0 = x] solves the heat equation @tu = u on [0, 1) Rn, u(0, x) = f(x). Seminal work of Gaveau [23] in the 1970s shows that if one replaces f(Bt) in the expectation by a functional of the whole trajectory, F(Bs : s 2 [0, t]), then a pathdependent heat equation can be derived where the classical Laplacian is replaced by the hypo-elliptic Laplacian. Free Algebras. A simple way to capture a sequence for us, a sequence of nodes visited by a random walk on a graph is to associate with each sequence element an element in an algebra1 and multiply these algebra elements together. If the algebra multiplication is commutative, *Equal contribution; order determined by random coin flip. 1An algebra is a vector space where one can multiply elements; e.g. the set of n n matrices with matrix multiplication. This multiplication can be non-commutative; e.g. A B 6= B A for general matrices A, B. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). the sequential structure is lost but if it is non-commutative, this captures the order in the sequence. In fact, by using the free associative algebra, this can be done faithfully and linear functionals of this algebra correspond to functionals on the space of sequences. We leverage these ideas from the Euclidean case of Rd to the non-Euclidean case of graphs. In particular, we construct features for a given node by sampling from a random walk started at this node, but instead of averaging over the end points, we average over path-dependent functions. Put informally, instead of asking a random walker that started at a node, "What do you see now?" after k steps, we ask "What have you seen along your way?". The above notions from mathematics about the hypo-elliptic Laplacian and the free algebra allow us to formalize this in the form of a generalized graph diffusion equation and we develop algorithms that make this a scalable method. Related Work. From the ML literature, [49, 27] popularized the combination of deep learning architectures to capture random walk histories. Such ideas have been incorporated, sometimes implicitly, into graph neural networks (GNN) [53, 9, 54, 18, 29, 5, 33] that in turn build on convolutional approaches [37, 38, 27], as well as their combination with attention or message passing [45, 65, 24], and more recent improvements [72, 46, 43, 15] that provide and improve on theoretical guarantees. Another classic approach are graph kernels, see [7] for a recent survey; in particular, the seminal paper [36] explored the connection between diffusion equations and random walkers in a kernel learning context. More recently, [14] proposed sequence kernels to capture the random walk history. Furthermore, [17] uses the signature kernel maximum mean discrepancy (MMD) [16] as a metric for trees which implicitly relies on the algebra of tensors that we use, and [48] aggregates random walk histories to derive a kernel for graphs. Moreover, the concept of network motifs [44, 55] relates to similar ideas that describe a graph by node sequences. Further, the Bethe Hessian [52] has been successfully used in spectral clustering and shares the same goal of capturing pathdependence via a "deformed Laplacians", although the mathematical approach is very different to ours. Directly related to our approach is the topic of learning diffusion models [35, 13, 60, 20, 11] on graphs. While similar ideas on random walks and diffusion for graph learning have been developed by different communities, our proposed method leverages these perspectives by capturing random walk histories through a novel diffusion operation. Our main mathematical influence is the seminal work of Gaveau [23] from the 1970s that shows how Brownian motion can be lifted into a Markov process evolving in a free algebra to capture path-dependence. This leads to a heat equation governed by the hypo-elliptic Laplacian. These insights had a large influence in PDE theory, see [51, 31], but it seems that their discrete counterpart on graphs has received no attention despite the well-developed literature on random walks on graphs and general non-Euclidean objects, [68, 19, 26, 63]. A key challenge to go from theory to application is handling the computational complexity. To do so, we build on ideas from [62] to design effective algorithms for the hypo-elliptic graph diffusion. Contribution and Outline. We introduce the hypo-elliptic graph Laplacian which allows to effectively capture random walk histories through a generalized diffusion model. In Section 3, we introduce the hypo-elliptic variants of standard graph matrices such as the adjacency matrix and (normalized) graph Laplacians. These hypo-elliptic variants are formulated in terms of tensor-valued matrices rather than scalar-valued matrices, and can be manipulated using linear algebra in the same manner as the classical setting. The hypo-elliptic Laplacian leads to a corresponding diffusion model, and in Theorem 1, we show that the solution to this generalized diffusion equation summarizes the microscopic picture of the entire history of random walks and not just their location after k steps. This solution provides a rich description of the local neighbourhood about a node, which can either be used directly as node features or be pooled over the graph to obtain a latent description of the graph. Theorem 2 shows that these node features characterize random walks on the graph, and we provide an analogous statement for graph features in Appendix E. One can solve the hypo-elliptic graph diffusion equation directly with linear algebra, but this is computationally prohibitive and Theorem 3 provides an efficient low-rank approximation. Finally, Section 5 provides experiments and benchmarks. A particular focus is to test the ability of our model to capture long-range interactions between nodes and the robustness of pooling operations which makes it less susceptible to the "over-squashing" phenomenon [2]. 2 Sequence Features by Non-Commutative Multiplication. We define the space of sequences in Rd by where elements are sequences denoted by x = (x0, x1, . . . , xk) 2 (Rd)k+1. Assume we are given an injective map, which we call the algebra lifting, ' : Rd ! H. from Rd into an algebra H. We can use this to define a sequence feature map2 e' : Seq(Rd) ! H, e'(x) = '(δ0x) '(δkx), (1) where δ0x = x0 and δix := xi xi 1 for i 1 are used to denote the increments of a sequence x = (x0, . . . , xk). This map associates to any sequence x 2 Seq(Rd) an element of the algebra H. If the multiplication in H is commutative, then the map e' would have no information about the order of increments, i.e. '(δ0x) '(δkx) = '(δ (0)x) '(δ (k)x) for any permutation of {0, . . . , k}. However, if the multiplication in H is "non-commutative enough" we expect e' to be injective. A Free Construction. Many choices for H are possible, but intuitively it makes sense to use the "most general object" for H. The mathematically rigorous approach is to use the free algebra over Rd and we give a summary in Appendix B. Despite this abstract motivation, the algebra H has a concrete form: it is realized as a sequence of tensors in Rd of increasing degree, and is defined by H := {v = (v0, v1, v2, . . .) : vm 2 (Rd) m, m 2 N, kvk < 1}, where by convention (Rd) 0 = R, and we describe the norm kvk in the paragraph below. For example, if v = (vm)m 0 2 H, then v0 is a scalar, v1 is a vector, v2 2 (Rd) 2 is a d d matrix, and so on. The vector space structure of H is given by addition and scalar multiplication according to v + w := (vm + wm)m 0 2 H and λv := (λvm)m 0 2 H for λ 2 R, and the algebra structure is given by An Inner Product. If e1, . . . , ed is a basis of Rd, then every tensor vm 2 (Rd) m can be written as 1 i1,...,im d ci1,...,imei1 eim. This allows us to define an inner product h , im on (Rd) m by extending hei1 eim, ej1 ejmim = 1 : i1 = j1, . . . , im = jm, 0 : otherwise. (3) to (Rd) m by linearity. This gives us an inner product on H, such that H is a Hilbert space; in particular we get a norm kvk := hv, vi. To sum up, the space H has a rich structure: it has a vector space structure, it has an algebra structure (a noncommutative product), and it is a Hilbert space (an inner product between elements of H gives a scalar). 2There are variants of this sequence feature map, which are discussed in Appendix G. Characterizing Random Walks. From Equation (1), we have constructed a map e' that maps a sequence x 2 Seq(Rd) of arbitrary length into the space H (see Appendix C for further details). Our aim is to apply this to the sequence of node attributes corresponding to random walks on a graph. Therefore, the expectation of e' should be able to characterize the distribution of the random walk. Formally the map e' is characteristic if the map µ 7! Ex µ[e'(x)] from the space of probability measures on Seq(Rd) into H is injective. Indeed, if the chosen lifting ' satisfies some mild conditions this holds for e'; see Appendix C and [16, 62]. Linear Functionals. The quantity Ex µ[e'(x)] characterizes the probability measure µ but is valued in the infinite-dimensional Hilbert space H. Using the inner product, we can instead consider h , Ex µ[e'(x)]i for = ( 0, 1, 2, . . . , M, 0, . . .) 2 H and M 1 (4) which is equivalent to knowing Ex µ[e'(x)]; i.e. the set (4) characterizes µ. This is analogous to how one can use either the moment generating function of a real-valued random variable or its sequence of moments to characterize its distribution; the former is one infinite-dimensional object (a function), the latter is a infinite sequence of scalars. We extend a key insight from [62] in Section 4: a linear functional h , Ex µ[e'(x)]i can be efficiently approximated without directly computing Ex µ[e'(x)] or storing large tensors. The Tensor Exponential. While we will continue to keep ' arbitrary for our main results (see [62] and Appendix G for other choices), we will use the tensor exponential exp : Rd ! H, defined by as the primary example throughout this paper and in the experiments in Section 5. With this choice, the induced sequence feature map is the discretized version of a classical object in analysis, called the path signature, see Appendix C. 3 Hypo-Elliptic Diffusions Throughout this section, we fix a labelled graph G = (V, E, f), that is V is a set of n nodes V = {1, . . . , n}, E denotes edges and f : V ! Rd is the set of continuous node attributes3 which map each node to an element in the vector space Rd. Two nodes i, j 2 V are adjacent if (i, j) 2 E is an edge, and we denote this by i j. The adjacency matrix A of a graph is defined by Ai,j = 1, whenever i j, and 0 otherwise. We denote by deg(i) the number nodes that are adjacent to node i. Random Walks on Graphs. Let (Bk)k 0 be the simple random walk on the nodes V of G, where the initial node is chosen uniformly at random. The transition matrix of this time-homogeneous Markov chain is Pi,j := P(Bk = j|Bk 1 = i) = 1 deg(i) : i j 0 : otherwise. Denote by (Lk)k 0 the random walk lifted to the node attributes in Rd, that is Lk := f(Bk). (6) Recall that the normalized graph Laplacian for random walks is defined as L = I D 1A, where D is diagonal degree matrix; in particular, the entry-wise definition is 1 deg(i) : i j 1 : i = j 0 : otherwise. The discrete graph diffusion equation for Uk 2 Rn d is given by Uk Uk 1 = LUk 1, U (i) 0 = f(i) (7) 3The labels given by the labelled graph are called attributes, while the computed updates are called features. where the initial condition U0 2 Rn d is specified by the node attributes.4 The probabilistic interpretation of the solution to this diffusion equation is classical and given as Uk = (E[Lk | B0 = i])n i=1 = P k U0. (8) This allows us to compute the solution uk using the transition matrix P = I L. Random Walks on Algebras. We now incorporate the history of a random walker by considering the quantity E[e'(Lk) | B0 = i] = E['(δ0L) '(δk L) | B0 = i] (9) where Lk = (L0, . . . , Lk). Since e' captures the whole history of the random walk Lk over node attributes, we expect this expectation to provide a much richer summary of the neighborhood of node i than E[Lk|B0 = i]. The price is however, the computational complexity, since (9) is H-valued. We first show, that analogous to (7), the quantity (9) satisfies a diffusion equation that can be computed with linear algebra. To do so, we develop a graph analogue of the hypo-elliptic Laplacian and replace the scalar entries of the matrices with entries from the algebra H. Matrix Rings over Algebras. We first revisit the adjacency matrix A 2 Rn n and replace it by the tensor adjacency matrix e A = ( e A)i,j 2 Hn n, that is e A is a matrix but instead of scalar entries its entries are elements in the algebra H. The matrix A has an entry at i, j if nodes i and j are connected; e A replaces the i, j entry with an element of H that tells us how the node attributes of i and j differ, '(f(j) f(i)) : i j 0 : otherwise. (10) Matrix multiplication works for elements of Hn n by replacing scalar multiplication by multiplication in H, that is ( e B e C)i,j = Pn k=1 e Bi,k e Ck,j for e B, e C 2 Hn n and e Bi,k e Ck,j denotes multiplication in H as in Equation (2). For the classical adjacency matrix A, the k-th power counts the number of length k walks in the graph, so that (Ak)i,j is the number of walks of length k from node i to node j. We can take powers of e A in the same way as in the classical case, where ( e Ak)i,j = '(δ1x) '(δkx) where the sum is taken over all length k walks x = (f(i), . . . f(j)) from node i to node j (full details are provided in Appendix D). Since e'(x) characterizes each walk x, the entry e Ak i,j can be interpreted as a summary of all walks which connect nodes i and j. Hypo-elliptic Graph Diffusion. Similar to the tensor adjacency matrix, we define the hypo-elliptic graph Laplacian as the n n matrix e L = I D 1 e A 2 Hn n, where D is the degree matrix embedded into Hn n at tensor degree 0. The entry-wise definition is '(f(j) f(i)) deg(i) : i j 1 : i = j 0 : otherwise. We can now formulate the hypo-elliptic graph diffusion equation for vk 2 Hn as vk vk 1 = e Lvk 1, v(i) 0 = '(f(i)). (12) Analogous to the classic graph diffusion (8), the hypo-elliptic graph diffusion (12) has a probabilistic interpretation in terms of L as shown in Theorem 1 (the proof is given in Appendix D). Theorem 1. Let k 2 N, Lk = (L0, . . . , Lk) be the lifted random walk from (6), and e P = I e L be the tensor adjacency matrix. The solution to the hypo-elliptic graph diffusion equation (12) is vk = (E['(δ1Lk) '(δk Lk)|B0 = i])n i=1 = e P k1H. Furthermore, if F 2 Hn n is the diagonal matrix with Fi,i = '(f(i)), then Fvk = (E[e'(Lk)|B0 = i])n 4The attributes over all nodes are given by an n d matrix; in particular U (i) k is the ith row of the matrix. In the classical diffusion equation, Uk captures the concentration of the random walkers after k time steps over the nodes. In the hypo-elliptic diffusion equation, vk captures summaries of random walk histories after k time steps over the nodes since e'(Lk) summarizes the whole trajectory Lk = (L0, . . . , Lk) and not only the endpoint Lk. Node Features and Graph Features. Theorem 1 can then be used to compute features Φ(i) 2 H for individual nodes as well as a feature (G) for the entire graph. The former is given by i-th component v(i) k of the solution vk = (v(i) k )i=1,...,n 2 Hn of Equation (12), Φ(i) := v(i) k = E[e'(Lk) | B0 = i] = (F e P kv0)(i) 2 H, since the random walk B chooses the starting node B0 = i uniformly at random. The latter can be computed by mean pooling the node features, which also has a probabilistic interpretation as k = E[e'(Lk)] = n 1(1T HF e P kv0) 2 H, (13) H := (1H, . . . , 1H) 2 Hn is the all-ones vector in H and 1H denotes the unit in H. Characterizing Graphs with Random Walks. The graph and node features obtained through the hypo-elliptic diffusion equation are highly descriptive: they characterize the entire history of the random walk process if one also includes the time parametrization, as described in Appendix C. Theorem 2. Suppose is the graph feature map from Equation (13) induced by the tensor exponential algebra lifting including time parametrization. Let G and G0 be two labelled graphs, and Lk = (L0, . . . , Lk) and L0 0, . . . , L0 k) be the k-step lifted random walk as defined in Equation (6). Then, (G) = (G0) if and only if the distributions of Lk and L0 k are equal. It is instructive to contrast this result with the classical diffusion case; the latter only uses the marginal distribution of Lk to capture the graph structure, which at least intuitively has much less expressive power. Indeed, in Appendix E, we show that for elementary graphs, this already leads to big differences in expressive power. Further, an analogous result holds for the node features, and we prove both results in Appendix E. While we use the tensor exponential in this article, many other choices of e' are possible and result in graph and node features with such properties: under mild conditions, if the algebra lifting ' : Rd ! H characterizes measures on Rd, the resulting node feature map Φ characterizes the random walk, see [62], which in turn implies the above results. Possible variations are discussed in Appendix G. General (Hypo-elliptic) Diffusions and Attention. One can consider more general diffusion operators, such as the normalized Laplacian K of a weighted graph. We define its lifted operator e K 2 Hn n analogous to Equation (11), resulting in a generalization of Theorem 1 with e K replacing e L. In the flavour of convolutional GNNs [8], we consider a weighted adjacency matrix A 2 Rn n ci,j : i j 0 : otherwise, for ci,j > 0. The corresponding normalized Laplacian K is given by K = I D 1A, where D is a diagonal matrix with Di,i = P j2N (i) ci,j. A common way to learn the coefficients is by introducing parameter sharing across graphs by modelling them as ci,j = exp(a(f(i), f(j))) using a local attention mechanism, a : Rd Rd ! R [65]. In our implementation, we use additive attention [4] given by a(f(i), f(j)) = Leaky Relu0.2(Wsf(i) + Wtf(j)), where Ws, Wt 2 R1 d are linear transformations for the source and target nodes, but different attention mechanisms can also be used; e.g. scaled dot-product attention [64]. Then, the corresponding transition matrix P = D 1A is defined as Pij = softmaxk2N (i)(a(f(i), f(k)))j. The lifted transition matrix is defined as Pi,j'(f(j) f(i)) : i j 0 : otherwise. The statements of Theorem 1 immediately generalize to this variation by replacing the expectation with respect to a non-uniform random walk. Hence, in this case the use of attention can be interpreted as learning the transition probabilities of a random walk on the graph. 4 Efficient Algorithms for Deep Learning The previous sections show that the node feature Φ(i) provides a structured description of the neighborhood of node i and it is instructive to think of a linear functional h , Φ(i)i as answering a specific question about the node neighbourhood, see Appendix E for examples. The naive computation of h , Φ(i)i by first computing Φ(i) and taking the inner product is too expensive, especially when = ( 0, . . . , M, 0, . . .) 2 H for large M. To address this we revisit two observations from [62]: first, for a rank-1 functional 2 H, the computation of h , Φ(i)i is computationally cheap. Second, restriction to small M limits the expressive power but can be counteracted by composition: any choice of d different functionals 1, . . . , d 2 H gives a label update f(i) 7! (h j, Φ(i)i)j=1,...,d 2 Rd for the graph. Repeating such a label update a few times with low-degree M and rank-1 functionals turns out to be as powerful as computing one update for general functionals with arbitrary high M. The first observation should not be too surprising given the popularity of low rank approximations; the second observation is reminiscent to constructing a high-degree polynomial by composing low-degree polynomials5 or the width-vs-depth phenomenon in neural nets and we give more details below. Computing Rank-1 Functionals. First, we focus on a rank-1 linear functional 2 H given as = ( m)m 0 with m = u M m+1 u M and m = 0 for m > M, (14) where um 2 Rd for m = 1, . . . , M for a fixed M 1. Theorem 3 shows that for such , the computation of h , ˆΦ(i)i, where ˆΦ(i) is the node feature without the basepoint, can be done (a) efficiently by factoring this low-rank structure into the recursive computation, and (b) simultaneously for all nodes i 2 V in parallel. This can then be used to compute rank-R functionals for R > 1, and for h , Φ(i)i; see Appendix F, where we also provide a pseudocode implementation. Theorem 3. Let be as in (14) and define fk,m 2 Rn for m = 1, . . . , M as m! (P Cu M m+1 Cu M ) 1, where 1T := (1, . . . , 1) 2 Rn is the all-ones vector; and for 2 k and 1 m M recursively as fk,m := P fk 1,m + 1 r! (P Cu M m+1 Cu M m+r) fk 1,m r, (15) where the matrix Cu = (Cu i,j) 2 Rn n is defined as hu, f(j) f(i)i : i j, 0 : otherwise. Here denotes element-wise6 multiplication, while denotes matrix multiplication. Then, it holds for i 2 V, random walk length k 2 Z+, and tensor degree m = 1, . . . , M, that fk,m(i) = h m, ˆΦk(i)i, where ˆΦk(i) = E['(δ1Lk) '(δk Lk) | B0 = i]. Overall, Eq. (15) computes fk,m(i) for all i 2 V, k = 1, . . . , K, m = 1, . . . , M in O(K M 2 NE + M NE d) operations, where NE 2 N denotes the number of edges; see App. F. In particular, one does not need to compute Φ(i) 2 H directly or store large tensors. Graph Labelling Layers. Fixing d rank 1-functionals 1, . . . , d 2 H induces a label update f(i) 7! (h i, Φ(i)i)i=1,...,d 2 Rd. Theorem 3 allows us to compute this update in parallel for all nodes in V. Such a label update is similar to hidden layer in a NN and we can stack such updates, see Figure 3 in App. H.1. As in NN, the d functionals in each "graph labelling layer" are optimized by gradient descent. Finally, note that a rank R functional is the sum of R rank-1 functionals so we can immediately carry out the same construction with rank-R functionals by adding a mixing layer. 5For example, 1+x+x2 composed with 1+2x2 yields the degree 4 polynomial 1+(1+2x2)+(1+2x2)2. 6For example To sum up, a graph labelling layer is determined by the random walk length k, the maximal tensor degree M, maximal tensor rank R and the functionals are then found by optimization. Using a single layer of low-rank functionals limits the expressiveness but stacking layers allows in practice to approximate general, high-degree M functionals. Some theoretical results can be found in [62]; however, here we simply appeal to the analogy with NN where stacking simple transformations provides a flexible functional class with good inductive bias. 5 Experiments We implemented the above approach and call the resulting model Graph2Tens Networks since it represents the neighbourhood of a node as a sequence of tensors, which is further pushed through a low-tensor-rank constrained linear mapping, similarly to how neural networks linearly transform their inputs pre-activation. A conceptual difference is that in our case the non-linearity is applied first and the projection secondly, albeit the computation is coupled between these steps. We provide further experiments and ablation studies of our models in Appendix H. Experimental Setup. The aim of our main experiment is to test the following key properties of our model: (1) ability to capture long-range interactions between nodes in a graph, (2) robustness to pooling operations, hence making it less susceptible to the over-squashing phenomenon [2]. We do this by following the experiments in [70]. In particular, we show that our model is competitive with previous approaches for retaining long-range context in graph-level learning tasks but without computing all pairwise interactions between nodes, thus keeping the influence distribution localized [73]. We further give a detailed ablation study to show the robustness of our model to various architectural choices in Appendix H.2. As a second experiment, we follow the previous applications of diffusion approaches to graphs that have mostly considered inductive learning tasks, e.g. on the citation datasets [13, 60, 11]. Our experimentation on these datasets are available in Appendix H.3, where the model performs on par with short-range GNN models, but does not seem to benefit from added long-range information a-priori. However, when labels are dropped in a k-hop sanitized way as in [50], the performance decrease is less pronounced. Datasets. We use two biological graph classification datasets (NCI1 and NCI109), that contain around 4000 biochemical compounds represented as graphs with 30 nodes on average [67, 1]. The task is to predict whether a compound contains anti-lung-cancer activity. The dataset is split in a ratio of 80% 10% 10% for training, validation and testing. Previous work [2] has found that GNNs that only summarize local structural information can be greatly outperformed by models that are able to account for global contextual relationships through the use of fully-adjacent layers. This was further improved on by [70], where a local neighbourhood encoder consisting of a GNN stack was upgraded with a Transformer submodule [64] for learning global interactions. Model Details. We build a GNN architecture primarily motivated by the Graph Trans (small) model from [70], and only fine-tune the preand postprocessing layers(s), random walk length, functional degree and optimization settings. In detail, a preprocessing MLP layer with 128 hidden units is followed by a stack of 4 G2TN layers each with RW length-5, max rank-128, max tensor degree-2, all equipped with JK-connections [73] and a max aggregator. Afterwards, the node features are combined into a graph-level representation using gated attention pooling [41]. The pooled features are transformed using a final MLP layer with 256 hidden units, and then fed into a softmax classification layer. The preand postprocessing MLP layers employ skip-connections [30]. Both MLP and G2TN layers are followed by layer normalization [3], where GTN layers normalize their rank-1 functionals independently across different tensor degrees, which corresponds to a particular realization of group normalization [69]. We randomly drop 10% of the features for all hidden layers during training [58]. The attentional variant, G2T(A)N also randomly drops 10% of its edges and uses 8 attention heads [65]. Training is performed by minimizing the categorical cross-entropy loss with an 2 regularization penalty of 10 4. For optimization, Adam [32] is used with a batch size of 128 and an inital learning rate of 10 3 that is decayed via a cosine annealing schedule [42] over 200 epochs. Further intuition about the model and architectural choices are available in Appendix H.1. Baselines. We compare against (1) the baseline models reported in [70], (2) variations of Graph Trans, (3) other recently proposed hierarchical approaches for long-range graph tasks [50]. Groups Table 1: Comparison of classification accuracies on NCI biological datasets, where we report mean and standard deviation over 10 random seeds for our models. Model GNN Type GNN Count NCI1 (%) NCI109 (%) Set2Set [40, 66] GCN 3 68.6 1.9 69.8 1.2 Sort Pool [40, 75] GCN 3 73.8 1.0 74.0 1.2 SAGPoolh [40] GCN 3 67.5 1.1 67.9 1.4 SAGPoolg [40] GCN 3 74.2 1.2 74.1 0.8 GIN [21, 72] GIN 8 80.0 1.4 - GCN + VN [74, 24] GCN 2 71.5 - HGNet-Edge Pool [74, 54] GCN+RGCN 3 + 2 77.1 - HGNet-Louvain [74, 54] GCN+RGCN 3 + 2 75.1 - GIN + FA [2, 72] GIN 8 81.5 1.2 - Graph Trans (small) [70, 64] GCN 3 81.3 1.9 79.2 2.2 Graph Trans (large) [70, 64] GCN 4 82.6 1.2 82.3 2.6 G2T(A)N (ours) G2T(A)N 4 81.9 1.2 78.0 2.3 G2TN (ours) G2TN 4 80.7 2.5 78.9 2.5 of models in Table 1 are separated by dashed lines if they were reported in separate papers, and the first citation after the name is where the result first appeared. The number of GNN layers in HGNet are not discussed by [50], and we report it as implied by their code. We organize the models into three groups divided by solid lines: (a) baselines that only apply neighbourhood aggregations, and hierarchical or global pooling schemes, (b) baselines that first employ a local neighbourhood encoder, and afterwards fully densify the graph in one way or another so that all nodes interact with each other directly, (c) our models that we emphasize thematically belong to (a). Results. In Table 1, we report the mean and standard deviation of classification accuracy computed over 10 different seeds. Overall, both our models improve over all baselines in group (a) on both datasets, maximally by 1.9% on NCI1 and by 4.8% on NCI109. In group (b), G2T(A)N is solely outperformed by Graph Trans (large) on NCI1 by only 0.7%. Interestingly, the attention-free variation, G2TN , performs better on NCI109, where it performs very slightly worse than Graph Trans (small). Discussion. The previous experiments demonstrate that our approach performs very favourably on long-range reasoning tasks compared to GNN-based alternatives without global pairwise node interactions. Several of the works we compare against have focused on extending GNNs to larger neighbourhoods by specifically designed graph coarsening and pooling operations, and we emphasize two important points: (1) our approach can efficiently capture large neighbourhoods without any need for coarsening, (2) it already performs well with simple mean-pooling as justified by Theorem 2 and experimentally supported by the ablation studies in Appendix H.2. Although the Transformer-based Graph Trans slightly outperforms our model potentially due to its ability to learn global interactions, it is not entirely clear how much of the global graph structure it is able to infer from interactions of short-range neighbourhood summaries. Finally, Transformer models can be bottlenecked by their quadratic complexity in nodes, while our approach only scales with edges, and hence, it can be more favourable for large sparse graphs in terms of computations. 6 Conclusion Inspired by classical results from analysis [23], we introduce the hypo-elliptic graph Laplacian. This yields a diffusion equation and also generalizes its classical probabilistic interpretation via random walks but now taking history into account. In addition to several attractive theoretical guarantees, we provide scalable algorithms. Our experiments show that this can lead to largely improved baselines for long-range reasoning tasks. A promising future research theme is to develop improvements for the classical Laplacian in this hypo-elliptic context; including lazy random walks [71]; nonlinear diffusions [12]; and source/sink terms [60]. Another theme could be to extend the geometric study [61] of over-squashing to this hypo-elliptic point of view which is naturally tied to sub Riemannian geometry [59]. A limitation in our theoretical results is that for the iterations of low-rank approximations only partial results exist and expanding this is an interesting (algebra-heavy) topic. Acknowledgments and Disclosure of Funding Csaba Toth was supported by a Mathematical Institute Award from the University of Oxford. Celia Hacker and Darrick Lee were supported by NCCR-Synapsy Phase-3 SNSF grant number 51NF40185897. Darrick Lee and Harald Oberhauser were supported by the Hong Kong Innovation and Technology Commission (Inno HK Project CIMDA). Harald Oberhauser was also supported by the Data Sig Program [EP/S026347/1] and the Oxford-Man Institute. [1] Pubchem. https://pubchem.ncbi.nlm.nih.gov/. [2] Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2021. [3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. [4] Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, 2015. [5] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. Advances in neural information processing systems, 29, 2016. [6] Patric Bonnier, Chong Liu, and Harald Oberhauser. Adapted topologies and higher rank signatures. ar Xiv preprint ar Xiv:2005.08897, 2020. [7] Karsten Borgwardt, Elisabetta Ghisu, Felipe Llinares-López, Leslie O Bray, and Bastian Rieck. Graph kernels: State-of-the-art and future challenges. ar Xiv preprint ar Xiv:2011.03854, 2020. [8] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. ar Xiv:2104.13478 [cs, stat], May 2021. [9] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs. ar Xiv preprint ar Xiv:1312.6203, 2013. [10] J. Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of Eckart-Young decomposition. Psychometrika, 35(3):283 319, September 1970. [11] Benjamin Paul Chamberlain, James Rowbottom, Davide Eynard, Francesco Di Giovanni, Xiaowen Dong, and Michael M. Bronstein. Beltrami flow and neural diffusion on graphs. Co RR, abs/2110.09443, 2021. [12] Benjamin Paul Chamberlain, James Rowbottom, Maria I. Gorinova, Stefan D. Webb, Emanuele Rossi, and Michael M. Bronstein. GRAND: Graph Neural Diffusion. In The Symbiosis of Deep Learning and Differential Equations, September 2021. [13] Benjamin Paul Chamberlain, James R. Rowbottom, Maria I. Gorinova, Stefan Webb, Emanuele Rossi, and Michael M. Bronstein. Grand: Graph neural diffusion. In ICML, 2021. [14] Dexiong Chen, Laurent Jacob, and Julien Mairal. Convolutional kernel networks for graph- structured data. In International Conference on Machine Learning, pages 1576 1586. PMLR, 2020. [15] Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. Advances in neural information processing systems, 32, 2019. [16] Ilya Chevyrev and Harald Oberhauser. Signature moments to characterize laws of stochastic processes. ar Xiv:1810.10971 [math, stat], 2018. [17] Thomas Cochrane, Peter Foster, Varun Chhabra, Maud Lemercier, Terry Lyons, and Cristopher Salvi. Sk-tree: a systematic malware detection algorithm on streaming trees via the signature kernel. In 2021 IEEE International Conference on Cyber Security and Resilience (CSR), pages 35 40. IEEE, 2021. [18] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016. [19] Persi Diaconis. Group representations in probability and statistics. Lecture notes-monograph series, 11:i 192, 1988. [20] Ahmed A. A. Elhag, Gabriele Corso, Hannes Stärk, and Michael M. Bronstein. Graph anisotropic diffusion, 2022. [21] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. In International Conference on Learning Representations, 2019. [22] Michel Fliess. Fonctionnelles causales non linéaires et indéterminées non commutatives. Bulletin de la société mathématique de France, 109:3 40, 1981. [23] Bernard Gaveau. Principe de moindre action, propagation de la chaleur et estimees sous elliptiques sur certains groupes nilpotents. Acta Mathematica, 139(none):95 153, January 1977. [24] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263 1272. PMLR, 2017. [25] Daniele Grattarola and Cesare Alippi. Graph neural networks in tensorflow and keras with spektral [application notes]. IEEE Computational Intelligence Magazine, 16(1):99 106, 2021. [26] Alexander Grigoryan. Heat kernel and analysis on manifolds, volume 47. American Mathemat- ical Soc., 2009. [27] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855 864, 2016. [28] Ben Hambly and Terry Lyons. Uniqueness for the signature of a path of bounded variation and the reduced path group. Ann. of Math., 171(1):109 167, 2010. [29] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. [30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [31] Lars Hörmander. Hypoelliptic second order differential equations. Acta Mathematica, 119:147 171, 1967. [32] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, [33] Thomas Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations, 2017. [34] Franz J. Kiraly and Harald Oberhauser. Kernels for sequentially ordered data. J. Mach. Learn. Res., 20(31):1 45, 2019. [35] Johannes Klicpera, Stefan Weissenberger, and Stephan Günnemann. Diffusion improves graph learning. Ar Xiv, abs/1911.05485, 2019. [36] Risi Kondor. Diffusion kernels on graphs and other discrete structures. In ICML 2002, 2002. [37] Yann Le Cun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. [38] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [39] Darrick Lee and Robert Ghrist. Path Signatures on Lie Groups. ar Xiv:2007.06633 [cs, math, stat], 2020. [40] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling. In International conference on machine learning, pages 3734 3743. PMLR, 2019. [41] Yujia Li, Richard Zemel, Marc Brockschmidt, and Daniel Tarlow. Gated graph sequence neural networks. In Proceedings of ICLR 16, 2016. [42] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017. [43] Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph networks. Advances in neural information processing systems, 32, 2019. [44] Ron Milo, Shai S. Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri B. Chklovskii, and Uri Alon. Network motifs: simple building blocks of complex networks. Science, 298 5594:824 7, 2002. [45] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5115 5124, 2017. [46] Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gau- rav Rattan, and Martin Grohe. Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4602 4609, July 2019. [47] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and U Edu. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8, page 1, 2012. [48] Giannis Nikolentzos and Michalis Vazirgiannis. Random walk graph neural networks. Advances in Neural Information Processing Systems, 33:16211 16222, 2020. [49] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre- sentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701 710, 2014. [50] Ladislav Rampášek and Guy Wolf. Hierarchical graph neural nets can capture long-range interactions. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1 6. IEEE, 2021. [51] Linda Preiss Rothschild and Elias M. Stein. Hypoelliptic differential operators and nilpotent groups. Acta Mathematica, 137:247 320, 1976. [52] Alaa Saade, Florent Krzakala, and Lenka Zdeborová. Spectral clustering of graphs with the bethe hessian. In NIPS, 2014. [53] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008. [54] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593 607. Springer, 2018. [55] Alice C Schwarze and Mason A Porter. Motifs for processes on networks. SIAM Journal on Applied Dynamical Systems, 20(4):2516 2557, 2021. [56] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi- Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. [57] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. ar Xiv preprint ar Xiv:1811.05868, 2018. [58] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929 1958, 2014. [59] Robert S Strichartz. Sub-riemannian geometry. Journal of Differential Geometry, 24(2):221 263, [60] Matthew Thorpe, Tan Minh Nguyen, Hedi Xia, Thomas Strohmer, Andrea Bertozzi, Stanley Osher, and Bao Wang. GRAND++: Graph neural diffusion with a source term. In International Conference on Learning Representations, 2022. [61] Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature, 2021. [62] Csaba Toth, Patric Bonnier, and Harald Oberhauser. Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections. In International Conference on Learning Representations, 2021. [63] N Th Varopoulos. Analysis and geometry on groups. Cambridge Tracts in Math., 100, 1992. [64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [65] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. [66] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. In ICLR, 2016. [67] Nikil Wale, Ian A. Watson, and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347 375, March 2008. [68] Wolfgang Woess. Random Walks on Infinite Graphs and Groups. Cambridge Tracts in Mathe- matics. Cambridge University Press, Cambridge, 2000. [69] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3 19, 2018. [70] Zhanghao Wu, Paras Jain, Matthew Wright, Azalia Mirhoseini, Joseph E Gonzalez, and Ion Stoica. Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems, 34:13266 13279, 2021. [71] Louis-Pascal A. C. Xhonneux, Meng Qu, and Jian Tang. Continuous graph neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 20, pages 10432 10441. JMLR.org, July 2020. [72] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. [73] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pages 5453 5462. PMLR, 2018. [74] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pages 4805 4815, Red Hook, NY, USA, December 2018. Curran Associates Inc. [75] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In Thirty-second AAAI conference on artificial intelligence, 2018. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] While our approach is less susceptible to the over-squashing phenomenon and empirically performs well on learning long-range interactions, we find that our methods are less beneficial for inductive learning tasks. This is stated explicitly in Section 5, with experiments shown in Appendix H.3. (c) Did you discuss any potential negative societal impacts of your work? [No] This article introduce a method which can improve the performance on graph neural networks. We don t think there are any foreseeable negative societal impacts beyond generic points that apply to any graph learning method. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] All theorem statements contain the full set of assumptions. (b) Did you include complete proofs of all theoretical results? [Yes] Proofs to all of our theoretical results are contained in the appendix, with specific references to sections within the main text. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] , see Appendix H (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] This is detailed in Section 5, with further discussion in Appendix H.2. (c) Did you report error bars (e.g., with respect to the random seed after running exper- iments multiple times)? [Yes] All experiments are run using multiple seeds, and the standard deviation over these seeds is provided. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix H. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] We include it in our code. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We include code as part of supplementary material. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]