# directional_graph_networks__ff639873.pdf

Directional Graph Networks

Dominique Beaini * 1 Saro Passaro * 2 Vincent L etourneau 1 3 William L. Hamilton 3 4 Gabriele Corso 2

Pietro Li o 2

The lack of anisotropic kernels in graph neural networks (GNNs) strongly limits their expressiveness, contributing to well-known issues such as over-smoothing. To overcome this limitation, we propose the ﬁrst globally consistent anisotropic kernels for GNNs, allowing for graph convolutions that are deﬁned according to topologicalyderived directional ﬂows. First, by deﬁning a vector ﬁeld in the graph, we develop a method of applying directional derivatives and smoothing by projecting node-speciﬁc messages into the ﬁeld. Then, we propose the use of the Laplacian eigenvectors as such vector ﬁeld. We show that the method generalizes CNNs on an n-dimensional grid and is provably more discriminative than standard GNNs regarding the Weisfeiler-Lehman 1WL test. We evaluate our method on different standard benchmarks and see a relative error reduction of 8% on the CIFAR10 graph dataset and 11% to 32% on the molecular ZINC dataset, and a relative increase in precision of 1.6% on the Mol PCBA dataset. An important outcome of this work is that it enables graph networks to embed directions in an unsupervised way, thus allowing a better representation of the anisotropic features in different physical or biological problems.

1. Introduction

One of the most important distinctions between convolutional neural networks (CNNs) and graph neural networks (GNNs) is that CNNs allow for any convolutional kernel, while most GNN methods are limited to symmetric ker-

*Equal contribution 1In Vivo AI, Montreal, Canada 2department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom 3MILA, Montreal, Canada 4Mc Gill University, Montreal, Canada. Correspondence to: Dominique Beaini <dominique@invivoai.com>, Saro Passaro <sp976@cam.ac.uk>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

nels (also called isotropic kernels) (Kipf & Welling, 2016; Gilmer et al., 2017). There are some implementations of asymmetric kernels using gated mechanisms (Bresson & Laurent, 2017; Veliˇckovi c et al., 2017), motif attention (Peng et al., 2019), edge features (Gilmer et al., 2017), port numbering (Sato et al., 2019) or the 3D structure of molecules (Klicpera et al., 2019).

However, to the best of our knowledge, there are currently no methods that allow asymmetric graph kernels that are dependent on the full graph structure or directional ﬂows. They either depend on local structures or local features. This is in opposition to images, which exhibit canonical directions: the horizontal and vertical axes. The absence of an analogous concept in graphs makes it difﬁcult to deﬁne directional message passing and to produce an analogue of the directional frequency ﬁlters (or Gabor ﬁlters) widely present in image processing (Olah et al., 2020). In fact, there is numerous evidence that directional ﬁltering is fundamental image processing (Kang et al., 2017; Antoine & Murenzi, 1996; Yue Lu & Minh N. Do, 2005).

We propose a novel idea for GNNs: use vector ﬁelds in the graph to deﬁne directions for the propagation of information. An overview of this framework is presented in ﬁgure 1. Using this approach, the usual message-passing structure of a GNN is projected onto globally-deﬁned directions so that the contribution of each neighbouring node nv is weighted by its alignment with the vector ﬁelds at the receiving node nu. This enables our method to propagate information via directional derivatives or smoothing of the features.

In order to deﬁne globally consistent directional ﬁelds over general graphs, we propose to use the gradients of the lowfrequency eigenvectors φk of the graph Laplacian, since they are known to capture key information about the global structure of graphs (Chavel, 1984; Chung et al., 1997; Grebenkov & Nguyen, 2013). In particular, these eigenvectors can be used to deﬁne optimal partitions of the nodes in a graph, to give a natural ordering (Levy, 2006), and to ﬁnd the dominant directions of the graph diffusion process (Chung & Yau, 2000; Saerens et al., 2004). Further, we show that they generalize the horizontal and vertical directional ﬂows in a grid (see ﬁgure 2), allowing them to guide the aggregation and mimic the asymmetric and directional kernels

Directional Graph Networks

The a-directional adjacency matrix is given as an input. We then compute the Laplacian matrix .

: number of nodes

: number of edges

The eigenvectors of are computed and sorted such that has the lowest nonzero eigenvalue and has the -th lowest.

We compute the -first eigenvectors with a complexity of .

(e) (b) (c) (d) (f) (g) (a)

Pre-computed steps

A graph with the node features is given.

is the feature matrix of the graph at the 0th GNN layer, of size # .

The aggregation matrices ,

, , are taken from the precomputed steps.

Graph neural network steps +

The gradient of is a function of the edges (a matrix) such that '( = ' ( if the nodes *, + are connected, or '( = 0 otherwise.

If the graph has a known direction, it can be encoded as field .

Each row *, : of the field is normalized by it s - norm.

.',: = ',: '.: 01 + 2

is the directional smoothing matrix.

is the directional derivative matrix.

',: = .',: diag 6 .:,(

The aggregation matrices ,

, , are used to aggregate the features via the matrix prodict . For we take the absolute value due to the sign ambiguity of .

is the columnconcatenation of all directional and a-directional aggregations.

The complexity is , or if the aggregations are parallelized.

This is the only step with learned parameters.

Based on the GCN method, each aggregation is followed by a multi layer perceptron (MLP) on all the features.

The MLP is applied on the columns of , thus we have a complexity of .

has # columns

has 2 + 1 #

has # columns

A A + 1 C CD

Next GNN layer

max 0 -max Node colormap

max 0 -max Field matrix colormap

Input graph Compute the first

eigenvectors

Compute the

Create the aggregation

matrices Feature aggregation MLP Input graph

Figure 1. Overview of the steps required to aggregate messages in the direction of the eigenvectors.

present in computer vision. In fact, we demonstrate mathematically that our work generalizes CNNs, by reproducing all convolutional kernels of radius R in an n-dimensional grid, while also bringing the powerful data augmentation capabilities of reﬂection, rotation or distortion of the directions. Additionally, we also prove that our directional graph networks (DGNs) are more discriminative than standard GNNs in regards to the Weisfeiler-Lehman 1-WL test, conﬁrming an increase of expressiveness.

We further show that our DGN model theoretically and empirically allows for efﬁcient message passing across distant communities, which counteracts the well-known problem of over-smoothing in GNNs. Alternative methods reduce the impact of over-smoothing by using skip connections (Luan et al., 2019), global pooling (Alon & Yahav, 2020), or randomly dropping edges during training time (Rong et al., 2020), but without solving the underlying problem.

Our method distinguishes itself from other spectral GNNs since the literature usually uses the low frequencies to estimate local Fourier transforms in the graph (Levie et al., 2018; Xu et al., 2019). Instead, we do not try to approximate the Fourier transform, but only to deﬁne a directional ﬂow at each node and guide the aggregation.

We tested our method on 5 standard datasets from (Dwivedi et al., 2020) and (Hu et al., 2020), using two types of architectures, and either using or ignoring edge features. In all cases, we observed state-of-the-art results from the proposed DGN, with relative improvements of 8% on CIFAR10, 1132% on ZINC, 0.8% on Mol HIV and 1.6% on Mol PCBA.

Most of the improvement is attributed to the directional derivative aggregator, highlighting our method s ability of capturing directional high-frequency signals in graphs.

2. Theoretical development

2.1. Intuitive overview

One of the biggest limitations of current GNN methods compared to CNNs is the inability to do message passing in a speciﬁc direction such as the horizontal one in a grid graph. In fact, it is difﬁcult to deﬁne directions or coordinates based solely on the shape of the graph.

The lack of directions strongly limits the discriminative abilities of GNNs to understand local structures and simple feature transformations. Most GNNs are invariant to the permutation of the neighbours features, so the nodes received signal is not inﬂuenced by swapping the features of two neighbours. Therefore, several layers in a deep network will be employed to understand these simple changes instead of being used for higher level features, leading to problematic phenomena such as a over-squashing (Alon & Yahav, 2020).

In the ﬁrst part of the theoretical development, we develop the mathematical theory for general vector ﬁelds F . Intuitively, deﬁning a vector ﬁeld over a graph corresponds to assigning a scalar weight to edges corresponding to the magnitude of the ﬂow in that direction. Note that F has the same shape as the adjacency matrix and the same zero entries. As an example a left-to-right ﬂow in a grid corresponds to a matrix with positive values over all left-to-right

Directional Graph Networks

Figure 2. Possible directional ﬂows in different types of graphs. The node coloring is a potential map and the edges represent the gradient of the potential with the arrows in the direction of the ﬂow. The ﬁrst 3 columns present the arcosine of the normalized eigenvectors (acos ˆφ) as node coloring, and their gradients represented as edge intensity. The last column presents examples of inductive bias introduced in the choice of direction. (a) The eigenvectors 1 and 2 are the horizontal and vertical ﬂows of the grid. (b) The eigenvectors 1 and 2 are the ﬂow in the longest and second-longest directions. (c) The eigenvectors 1, 2 and 3 ﬂow respectively in the South-North, suburbs to the city center and West-East directions. We ignore φ0 since it is constant and has no direction.

edges, negative over the right-to-left edges and 0 on the vertical edges.

In the second part, we set F to be the gradient of the lowfrequency eigenvectors of the Laplacian. Using this directional ﬁeld, we show that the expressiveness of GNNs can be improved, while providing an intuitive directional ﬂows over a variety of graphs (see ﬁgure 2). For example, we prove that in grid-shaped graphs some of these eigenvectors correspond to the horizontal and vertical ﬂows. Again, we observe in the Minnesota map that the ﬁrst 3 nonconstant eigenvectors produce logical directions, namely South/North, suburb/city, and West/East.

Another important contribution also noted in ﬁgure 2 is the ability to deﬁne any kind of directional ﬂow based on prior knowledge of the problem. Hence, instead of relying on eigenvectors to ﬁnd directions in a map, we can simply use the cardinal directions or the rush-hour trafﬁc ﬂow.

2.2. Overview of the theoretical contributions

Vector ﬁelds in a graph. Using directions in a graph is novel and not intuitive, so our ﬁrst step is to deﬁne a simple nomenclature where we use a vector ﬁeld to deﬁne a directional ﬂow at each node.

Directional smoothing and derivatives. To make use of vector ﬁelds over graphs, we deﬁne aggregation matrices that can either smooth the signal (low pass ﬁlter) or compute its derivative (high pass ﬁlter) according to the directions speciﬁed by the vector ﬁeld.

Gradient of the Laplacian eigenvectors. We show that using the gradient of the low-frequency eigenvectors of the

graph Laplacian generates interpretable vector ﬁelds that counteract the over-smoothing problem.

Generalization of CNNs. We demonstrate that, when applied to a grid graph, the eigenvector-based directional aggregation generalizes convolutional neural networks.

Comparison to the Weisfeiler-Lehman (WL) test. We prove that the proposed DGN is more expressive than the 1-WL test, and thus more expressive than ordinary GNNs.

2.3. Vector ﬁelds in a graph

This section presents the ideas of differential geometry applied to graphs, with the goal of ﬁnding proper deﬁnitions of scalar products, gradients and directional derivatives. For reference see for example (Bronstein et al., 2017; Grebenkov & Nguyen, 2013; Grady & Polimeni, 2010).

Let G = (V, E) be a graph with V the set of vertices and E V V the set of edges. The graph is undirected meaning that (i, j) E iff (j, i) E. Deﬁne the vector spaces L2(V ) and L2(E) as the set of maps V R and E R with x, y L2(V ) and F , H L2(E) and scalar products

x, y L2(V ) : = X

F , H L2(E) : = X

(i,j) E F(i,j)H(i,j)

Think of E as the tangent space to V and of L2(E) as the set of vector ﬁelds on the space V with each row Fi,: representing a vector at the i-th node, and the element Fi,j

Directional Graph Networks

being the component of the vector going from node i to j through edge eij. Note that with n the number of nodes in G, any x L2(V ) can be represented as an n coordinates vector and F L2(E) can be represented as an n n matrix.

Deﬁne the pointwise scalar product as the map L2(E) L2(E) L2(V ) taking 2 vector ﬁelds and returning their inner product at each point of V , at the node i is deﬁned by equation 2.

F , H i := X

j:(i,j) E Fi,j Hi,j (2)

In equation 3, we deﬁne the gradient as a mapping L2(V ) L2(E) and the divergence div as a mapping L2(E) L2(V ), thus leading to an analogue of the directional derivative in equation 4.

( x)(i,j) := x(j) x(i)

(div F )i := X

j:(i,j) E F(i,j) (3)

Deﬁnition 1. The directional derivative of the function x on the graph G in the direction of the vector ﬁeld ˆF where each vector is of unit-norm is

D ˆ F x(i) := x, ˆF i = X

j:(i,j) E (x(j) x(i)) ˆFi,j (4)

|F | will denote the absolute value of F and ||Fi,:||Lp the Lp-norm of the i-th row of F . We also deﬁne the forward/backward directions as the positive/negative parts of the ﬁeld F .

2.4. Directional smoothing and derivatives

Next, we show how the vector ﬁeld F is used to guide the graph aggregation by projecting the incoming messages. Speciﬁcally, we deﬁne the weighted aggregation matrices Bav and Bdx that allow to compute the directional smoothing and directional derivative of the node features, as presented visually in ﬁgure 1-d.

The directional average matrix Bav is the weighted aggregation matrix such that all weights are positives and all rows have an L1-norm equal to 1, as shown in equation 5 and theorem 2.1, with a proof in the appendix D.1.

Bav(F )i,: = |Fi,:| ||Fi,:||L1 + ϵ (5)

The variable ϵ is an arbitrarily small positive number used to avoid ﬂoating-point errors. The L1-norm denominator is a local row-wise normalization. The aggregator works

by assigning a large weight to the elements in the forward or backward direction of the ﬁeld, while assigning a small weight to the other elements, with a total weight of 1.

Theorem 2.1 (Directional smoothing). The operation y = Bavx is the directional average of x, in the sense that yu is the mean of xv, weighted by the direction and amplitude of F .

With xv the features at the nodes v neighbouring u, and yu the directional smoothing at node u.

The directional derivative matrix Bdx is deﬁned in (6) and theorem 2.2, with the proof in appendix D.2. Again, the denominator is a local row-wise normalization but can be replaced by a global normalization. diag(a) is a square, diagonal matrix with diagonal entries given by a. The aggregator works by subtracting the projected forward message by the backward message (similar to a center derivative), with an additional diagonal term to balance both directions.

Bdx(F )i,: = ˆFi,: diag X

ˆFi,: = Fi,: ||Fi,:||L1 + ϵ

Theorem 2.2 (Directional derivative). Suppose ˆF have rows of unit L1 norm. The operation y = Bdx( ˆF )x is the centered directional derivative of x in the direction of F , in the sense of equation 4, i.e.

y = D ˆ F x = ˆF diag X

These aggregators are directional, interpretable and complementary, making them ideal choices for GNNs. We discuss the choice of aggregators in more details in appendix A, while also providing alternative aggregation matrices such as the center-balanced smoothing, the forward-copy, the phantom zero-padding, and the hardening of the aggregators using softmax/argmax on the ﬁeld. We further provide a visual interpretation of the Bav and Bdx aggregators in ﬁgure 3. Interestingly, we also note in appendix A.1 that Bav and Bdx yield respectively the mean and Laplacian aggregations when F is a vector ﬁeld such that all entries are constant Fij = C.

2.5. Gradient of the Laplacian eigenvectors as interpretable vector ﬁelds

In this section we give theoretical support for the choice of gradients of the eigenfunctions of the Laplacian as sensible vectors along which to do directional message passing since they are interpretable and allow to reduce the oversmoothing. This section gives a theoretical ground to the

Directional Graph Networks

Directional smoothing aggregation

Directional derivative aggregation

Graph features focused on the neighbourhood of

, : Node receiving the message , , : Neighbouring node : Feature at node , : Directional vector field

between the node and

, + , + , Absolute weighted sum Sum of the absolute weights

, + , + , Weighted forward derivative with

Weighted backward

derivative with

Weighted backward

derivative with + + Sum of the absolute weights

Figure 3. Illustration of how the directional aggregation works at a node nv, with the arrows representing the direction and intensity of the ﬁeld F .

intuitive directions presented in ﬁgure 2, and is the motivation behind steps (b-c) in ﬁgure 1.

As usual the combinatorial, degree-normalized and symmetric normalized Laplacian are deﬁned as

L = D A, Lnorm = D 1L, Lsym = D 1

2 (7) The eigenvectors of these matrices are known to capture many essential properties of graphs, making them a natural foundation for directional message passing. For example, the Laplacian eigenvectors corresponding to the smallest eigenvalues (i.e., the low frequency eigenvectors) effectively capture the community structure of a graph, and these eigenvectors also play the role of Fourier modes in graph signal processing (Hamilton, 2020). Indeed, the Laplacian eigenvectors hold such rich information about graph structure that their study is the focus of the mathematical subﬁeld of spectral graph theory (Chung et al., 1997).

In order to illustrate the utility of these eigenvectors in the context of GNNs, we show that the low-frequency eigenvectors provide a natural direction that allows us to pass messages between distant nodes in a graph. In particular, we show in theorem 2.3 (proved in appendix D.3) that by passing information in the direction of φ1, the eigenvector associated to the lowest non-trivial frequency of Lnorm, DGNs can efﬁciently share information between distant nodes of the graph by reducing the diffusion distance between them. This idea is reﬂected in ﬁgure 2, where we see that the eigenvectors of the Laplacian give directions

that correspond to a natural notion of distance on real-world graphs.

In the next paragraphs, we will prove that following the gradient of the eigenvectors allows to effectively reduce the heat-kernel distance between pairs of nodes.

Consider the transition matrix W = D 1A. Its entries can be used to deﬁne a random walk with probability to move from node x to node y equal to p1(x, y) = 1 dx if x and y are neighbors and 0 if not. Notice that the probability to transition from x to y in k steps is given by the x, y entry of the matrix W k. This matrix is also called the discrete heat kernel pk(x, y) = (W k)x,y. Given a Markov process Xk deﬁned by the transition matrices W k, j = 1, ..., k, we can deﬁne a continuous time random walk on the same graph in the following way. Let Nt be a mean 1 Poisson random variable, the continuous time random variable is deﬁned by Xt := XNt with transition probability qt(x, y) = P(Xt = y|x0 = x).

In (Barlow, 2017), the following identity is shown

k! pk(x, y)

Or in matrix form qt = et(W I) = e t Lnorm. This transition probability is also called the continuous time heat kernel because it satisﬁes the continuous time heat equation on graphs d

dtqt = Lnormqt. In (Coifman & Lafon, 2006) the following distance is deﬁned Deﬁnition 2 (Diffusion distance). The diffusion distance at time t between the nodes x, y is

dt(x, y) :=

qt(x, z) qt(y, z) 2 ! 1

The diffusion distance is small when there is high probability that two random walks starting at x and y meet at time t. The diffusion distance is used as a model of how the data at a node x inﬂuences a node y in a GNN. The symmetrisation of the heat kernel in the diffusion distance and the use of continuous time are slight departure from the actual process of information diffusion in a GNN but allow us to describe the important phenomenons with much simpler statements. Deﬁnition 3 (Gradient step). Suppose the two neighboring nodes x and z are such that φ(z) φ(x) is maximal among the neighbors of x, then we will say z is obtained from x by taking a step in the direction of the gradient φ. Theorem 2.3 (Gradient steps reduce diffusion distance). Let x, y be nodes such that φ1(x) < φ1(y). Let x be the node obtained from x by taking one step in the direction of φ1, then there is a constant C such that for C t we have

dt(x , y) < dt(x, y).

Directional Graph Networks

With the reduction in distance being proportional to e λ1.

From this theorem, we see that moving from node x to node x by following the gradient of the eigenvector φ1 is guaranteed to reduce the heat kernel distance with a destination node y. While the theorem always holds for φ1, it should be true for higher frequency eigenvectors if the graph has added structure for example if it is an approximation of a surface or a higher dimensional manifold.

In the context of GNNs, Theorem 2.3 also has implications for the well-known problems of over-smoothing and oversquashing (Alon & Yahav, 2020; Hamilton, 2020). In most GNN models, node representations become over-smoothed after several rounds of message passing, as the representations tend to reach a mean-ﬁeld equilibrium equivalent to the stationary distribution of a random walk (Hamilton, 2020). Researchers have also highlighted the related issue of over-squashing, which reﬂects the inability for GNNs to propagate informative signals between distant nodes in a graph (Alon & Yahav, 2020).

Both these problems are related to the fact that the inﬂuence of one node s input on the ﬁnal representation of another node in a GNN is correlated with the diffusion distance between the nodes (Xu et al., 2018b). Theorem 2.3 highlights how the DGN approach can alleviate these issues. In particular, the Laplacian eigenfunctions reveal directions that can counteract over-smoothing and over-squashing by allowing efﬁcient propagation of information between distant nodes instead of following a diffusion process.

Finally it is interesting to note that by selecting different eigenvectors as basis of directions, our method further aligns with a theorem that multiple independent aggregators are needed to distinguish neighbourhoods of nodes with continuous features (Corso et al., 2020).

2.6. Choosing a basis of the Laplacian eigenspace

When using eigenvectors of the Laplacian φi to deﬁne directions in a graph, we need to keep in mind that there is never a single eigenvector associated to an eigenvalue, but a whole eigenspace. If an eigenvalue has multiplicity of k, the associated eigenspace has dimension k and any collection of k orthogonal vectors could be chosen as basis of that space and as vectors for the deﬁnitions of the aggregation matrices B deﬁned in the previous sections.

Disconnected graphs. When a graph is disconnected, then the eigenfunctions will simply be the combination of the eigenfunctions of each connected components. Hence, one must consider φi as the i-th eigenvector of each component when taken separately.

Normalizing the eigenvectors. For an eigenvalue of multiplicity 1, there are always two unit norm eigenvectors of

opposite sign, which poses a problem during the directional aggregation. We can make a choice of sign and later take the absolute value (i.e. Bav in equation 5). An alternative that applies to multiplicities higher than 1 is to take samples of orthonormal bases of the eigenspace and use each choice to augment the training (see section 2.10).

Multiplicities greater than 1. Although multiplicities higher than one do happen for low-frequencies (square grids have a multiplicity 2 for λ1) this is not common in realworld graphs since it suggests symmetries in the graph which are uncommon. Furthermore, we found no λ1 multiplicity greater than 1 in the ZINC and PATTERN datasets. We further discuss these rare cases and how to deal with them in appendix C.4.

Orthogonal directions. Although all φ are orthogonal, their gradients, used to deﬁne directions, are not always locally orthogonal (e.g. there are many horizontal ﬂows in the grid). This concern is left to be addressed in future work.

2.7. Generalization of the convolution on a grid

In this section we show that our method generalizes CNNs by allowing to deﬁne any radius-R convolutional kernels in grid-shaped graphs. The radius-R kernel at node u is a convolutional kernel that takes the weighted sum of all nodes v at a distance d(u, v) R.

Consider the lattice graph Γ of size N1 N2 ... Nn where each vertices are connected to their direct non-diagonal neighbour. We know from Lemma D.1 that, for each dimension, there is an eigenvector that is only a function of this speciﬁc dimension. For example, the lowest frequency eigenvector φ1 always ﬂows in the direction of the longest length. Hence, the Laplacian eigenvectors of the grid can play a role analogous to the axes in Euclidean space, as shown in ﬁgure 2.

With this knowledge, we show in theorem 2.4 (proven in D.6), that we can generalize all convolutional kernels in an n-dimensional grid. This is a strong result since it demonstrates that our DGN framework generalizes CNNs when applied on a grid, thus closing the gap between GNNs and the highly successful CNNs on image tasks.

Theorem 2.4 (Generalization radius-R convolutional kernel in a lattice). For an n-dimensional lattice, any convolutional kernel of radius R can be realized by a linear combination of directional aggregation matrices and their compositions.

As an example, ﬁgure 4 shows how a linear combination of the ﬁrst and m-th aggregators B( φ1,m) realize a kernel on an N M grid, where m = N/M and N > M.

Note that when the size of a given dimension is an integer multiple of another direction, e.g. N = M or N = 3M, then you will ﬁnd a multiplicity of 2 for the m th eigen-

Directional Graph Networks

vector. Hence, the eigenvector used to deﬁne the direction is not unique. This does not void theorem 2.4 since the eigenvectors ﬂowing in the horizontal/vertical directions are still valid choices.

CNN equivalent on

image , with > ; % 0 Graph aggregation

Figure 4. Realization of a radius-1 convolution using the proposed aggregators. Ix is the input feature map, the convolutional operator, Iy the convolution result, and Bi = B( φi).

2.8. Extending the radius of the aggregation kernel

Having aggregation kernels for neighbours of distance 2 or 3 is important to improve the expressiveness of GNNs, their ability to understand patterns, and to reduce the number of layers required. However, the lack of directions in GNNs strongly limits the radius of the kernels since, given a graph of regular degree d, a mean/sum aggregation at a radius-R will result in a heavy over-squashing of O(d R) messages. Using the directional ﬁelds, we can enumerate different paths, thus assigning a different weight for different Rdistant neighbours. This method, proposed in appendix A.7, avoids the over-squashing. (Empirical results on this extension are left for future work.)

2.9. Comparison with Weisfeiler-Lehman (WL) test

We also compare the expressiveness of the Directional Graph Networks with the classical WL graph isomorphism test which is often used to classify the expressivity of graph neural networks (Xu et al., 2018a). In theorem 2.5 (proven in appendix D.7) we show that DGNs are capable of distinguishing pairs of graphs that the 1-WL test (and so ordinary GNNs) cannot differentiate.

Theorem 2.5 (Comparison with 1-WL test). DGNs using the mean aggregator, any directional aggregator of the ﬁrst Laplacian eigenvector and injective degree-scalers are strictly more powerful than the 1-WL test.

2.10. Data augmentation

Another theoretical result is that the directions in the graph allow to replicate some of the most common data augmentation techniques used in computer vision, namely reﬂection, rotation and distortion. The main difference is that, instead of modifying the image (such as a 5 rotation), the proposed transformation is applied on the vector ﬁeld deﬁning the aggregation kernel (thus rotating the kernel by 5 without changing the image). This offers the advantage of avoiding to pre-process the data since the augmentation is done directly on the kernel at each iteration of the training.

In appendix B.1, we explain how we can generalize some image augmentation techniques by deﬁning the reﬂection, rotation and distortion augmentations in our DGN framework, and provide early results on the CIFAR10 dataset in appendix B.2.

3. Implementation

We implemented the models using the DGL and Py Torch libraries and we provide the code at the address https://github.com/Saro00/DGN. We test our method on standard benchmarks from (Dwivedi et al., 2020) and (Hu et al., 2020), namely ZINC, CIFAR10, PATTERN, Mol HIV and Mol PCBA with more details on the datasets and how we enforce a fair comparison in appendix C.1.

For the empirical experiments we inserted our proposed aggregation method in two different type of message passing architectures used in the literature: a simple convolutional architecture similar to the one present in GCN (equation 9a) (Kipf & Welling, 2016) and a more complex and general one typical of MPNNs (9b) (Gilmer et al., 2017) with or without edge features eji. The time complexity of our approach is O(Em), which is identical to PNA (Corso et al., 2020), where E is the number of edges and m the number of aggregators, with an additional O(Ek) to pre-compute the k-ﬁrst eigenvectors, as explained in the appendix C.2.

X(t+1) i = U

(j,i) E X(t) j

X(t+1) i = U

(j,i) E M X(t) i , X(t) j , eji |{z} optional

Here, L is an operator which concatenates the results of multiple aggregators, X is the node features, M is a linear transformation and U a multiple layer perceptron (MLP). This simple architecture of equation 9a is observed visually in steps (f-g) of ﬁgure 1.

We further use degree scalers S(d, α) deﬁned below to scale the aggregation results according to each node s degree, as

Directional Graph Networks

proposed by the PNA model (Corso et al., 2020). Here, d is the degree of a given node, δ is the average node degree in the training set, and α is a parameter set to 1 for degreeattenuation and 1 for degree ampliﬁcation. Note that each degree scaler is applied to the result of each aggregator, and the results are concatenated.

S(d, α) = log(d + 1)

α , δ = 1 |train|

i train log(di +1)

We tested the directional aggregators across the datasets using the gradient of the ﬁrst k eigenvectors φ1,...,k as the underlying vector ﬁelds. Here, k is a hyperparameter, usually 1 or 2, but could be bigger for high-dimensional graphs. To deal with the arbitrary sign of the eigenvectors, we take the absolute value of the result of equation 6, making it invariant to a reﬂection of the ﬁeld. In case of a disconnected graph, φi is the i-th eigenvector of each connected component. Despite the numerous aggregators proposed in appendix A, only Bdx and Bav are tested empirically.

The metrics used to measure the performance of a model depend are enforced for each dataset and provided by (Dwivedi et al., 2020) and (Hu et al., 2020). In particular, we use the mean absolute error (MAE), the accuracy (acc), the area under the receiver operating curve (ROC-AUC), and the average precision (AP).

4. Results and discussion

Directional aggregation Using the benchmarks introduced in section 3, we present in ﬁgure 5 a fair comparison of various aggregation strategies using the same parameter budget and hyperparameters. We see a consistent boost in the performance for simple, complex and complex with edges models using directional aggregators compared to the mean-aggregator baseline.

With our theoretical analysis in mind, we expected to perform well on PATTERN since the ﬂow of the ﬁrst eigenvectors are meaningful directions in a stochastic block model (i.e., these eigenvectors tend to correlate with community membership). The results match our expectations, outperforming all the previous models.

In particular, we see a signiﬁcant improvement in the molecular datasets (ZINC, Mol HIV and Mol PCBA) when using the directional aggregators, especially for the derivative aggregation B1 dx (noted dx1 in ﬁgure 5). We believe this is due to the capacity to efﬁciently move messages across opposite parts of the molecule and to better understand the role of atom pairs. We further believe that the derivative aggregator is better able to capture high-frequency directional signals, similarly to the Gabor ﬁlters in computer vision.

Further, the thesis that DGNs can bridge the gap between

CNNs and GNNs is supported by the clear improvements on CIFAR10 over the baselines.

Improvements over positional embeddings. In the work by (Dwivedi et al., 2020), they proposed the use of positional encoding of the eigenvectors. However, our experiments with the positional encoding of the ﬁrst 2 non-trivial eigenvectors, noted pos1, pos2 in ﬁgure 5, showed no clear improvement on most datasets. In fact, Dwivedi et al. noted that many eigenvectors and high network depths are required for improvements, yet we outperform their results with fewer parameters, less depth, and only 1-2 eigenvectors, further motivating their use as directional ﬂows instead of positional encoding.

Comparison to the literature. In order to compare our model with the literature, we ﬁne-tuned it on the various datasets and we report its performance in ﬁgure 6. We observe that DGN provides signiﬁcant improvement across all benchmarks, highlighting the importance of anisotropic kernels that are dependant on the graph topology.

Note that the results in Figure 6 are better those in Figure 5 since the latter uses a more exhaustive parameter search, and uses the min/max aggregators proposed in PNA (Corso et al., 2020) alongside the directional aggregators.

5. Conclusion

The proposed DGN method allows to address many problems of GNNs, including the lack of anisotropy, the low expressiveness, the over-smoothing and over-squashing. For the ﬁrst time in graph networks, we generalize the directional properties of CNNs and their data augmentation capabilities. Based on the intuitive idea that the low-frequency eigenvectors of the graph Laplacian gives an interpretable directional ﬂow, we backed our work by a set of strong theoretical results showing that these eigenvectors are important in connecting nodes that are far away and improving the expressiveness in regards to the WL-test.

The work being also supported by strong empirical results, we believe it will give rise to a new family of directional GNNs. In fact, we introduce in the appendix different avenues for future work, including the hardening of the aggregators A.4, the introduction of a zero-padding at the boundaries A.6, the implementation of radius-R kernels A.7, and the full study of directional data augmentation B.1. Future methods could also improve the choice of multiple directions beyond the selection of the k-lowest frequencies.

Broader Impact. This work will extend the usability of graph networks to all problems with engineering and physically deﬁned directions, thus making GNN a new laboratory for signal processing, physics, material science and molecular and cell biology. In fact, the anisotropy present

Directional Graph Networks

ZINC PATTERN CIFAR10 Mol HIV Mol PCBA

Aggregators Simple Complex Complex-E Simple Complex Simple Complex Simple Complex Complex-E MAE MAE MAE % acc % acc % acc % acc % ROC-AUC % AP % AP mean 0.316 0.353 0.262 80.77 83.34 55.9 62.8 75.1 26.04 26.38 mean pos1 0.349 0.332 0.297 80.76 83.74 75.8 26.97 27.50 mean pos1 pos2 0.344 0.330 0.284 84.51 81.25 76.1 26.03 25.65 mean dx1 0.296 0.233 0.191 84.22 83.44 78.0 26.79 27.91 mean dx1 dx2 0.337 0.271 0.205 81.61 86.62 52.9 69.8 76.5 27.16 26.55 mean av1 0.317 0.332 0.276 84.54 83.21 78.4 25.97 26.66 mean av1 av2 0.367 0.332 0.260 85.12 85.38 60.6 65.1 77.1 25.61 26.67 mean dx1av1 0.290 0.245 0.192 85.17 86.68 79.0 26.40 27.47

Figure 5. Test set results using a parameter budget of 100k with the same hyperparameters as (Corso et al., 2020), except Mol PCBA with a budget of 7M. The low-frequency Laplacian eigenvectors are used to deﬁne the directions, except for CIFAR10 that uses the coordinates of the image. For brevity, we denote dxi and avi as the directional derivative Bi dx and smoothing Bi av aggregators of the i-th direction. We also denote posi as the i-th eigenvector used as positional encoding for the mean aggregator.

ZINC PATTERN CIFAR10 Mol HIV Mol PCBA

No edge features Edge features No edge features No edge features Edge features No edge features All models MAE MAE % acc % acc % acc % ROC-AUC % AP GCN 0.469 0.002 65.880 0.074 54.46 0.10 76.06 0.97 * 20.20 0.24 * GIN 0.408 0.008 85.590 0.011 53.28 3.70 75.58 1.40 * 22.66 0.28 * Graph Sage 0.410 0.005 50.516 0.001 66.08 0.24 GAT 0.463 0.002 75.824 1.823 65.48 0.33 Mo Net 0.407 0.007 85.482 0.037 53.42 0.43

Gated GCN 0.422 0.006 0.363 0.009 84.480 0.122 69.19 0.28 69.37 0.48

PNA 0.320 0.032 0.188 0.004 86.567 0.075 70.46 0.44 70.47 0.72 79.05 1.32 * 28.38 0.35 *

DGN 0.219 0.010 0.168 0.003 86.680 0.034 72.70 0.54 72.84 0.42 79.70 0.97 28.85 0.30 *

Figure 6. Fine-tuned results of the DGN model against models from (Dwivedi et al., 2020) and (Hu et al., 2020): GCN (Kipf & Welling,

2016), Graph Sage (Hamilton et al., 2017), GIN (Xu et al., 2018a), GAT (Veliˇckovi c et al., 2017), Mo Net (Monti et al., 2017), Gated GCN (Bresson & Laurent, 2017) and PNA (Corso et al., 2020). All the models use 100k parameters, except those with * who use 300k to 6.5M. In ZINC the DGN aggregators are {mean, dx1, max, min}, in PATTERN {mean, dx1, av1}, in CIFAR10 {mean, dx1, dx2, max}, in Mol HIV {mean, dx1, av1, max, min}, in Mol PCBA {mean, sum, max, dx1}. Mean and uncertainty are taken over 4 runs for ZINC, PATTERN and CIFAR10 and 10 runs for Mol HIV and Mol PCBA.

in a wide variety of systems could be expressed as vector ﬁelds (spinor, tensor) compatible with the DGN framework, without the need of eigenvectors. One example is magnetic anisotropicity in metals, alloys and organic molecules that is dependant on the relative orientation to the magnetic ﬁeld. Other examples are the response of materials to high electromagnetic ﬁelds; all kind of ﬁeld propagation in crystals lattices (vibrations, heat, shear and frictional force, young modulus, light refraction, birefringence); multi-body or liquid motion; magnons and solitons in different media, fracture propagation, trafﬁc modelling; developmental biology and embryology, and design of novel materials and constrained structures. Finally applications based on neural operators for ODE/PDE may beneﬁt as well.

Alon, U. and Yahav, E. On the bottleneck of graph neural networks and its practical implications. ar Xiv:2006.05205 [cs, stat], 2020. URL http:// arxiv.org/abs/2006.05205.

Antoine, J. P. and Murenzi, R. Two-dimensional di-

rectional wavelets and the scale-angle representation. Signal Processing, 52(3):259 281, 1996. ISSN 0165-1684. doi: 10.1016/0165-1684(96)00065-5. URL https://www.sciencedirect.com/ science/article/pii/0165168496000655.

Barlow, M. T. Random Walks and Heat Kernels on Graphs. London Mathematical Society Lecture Note Series. Cambridge University Press, 2017.

Bresson, X. and Laurent, T. Residual gated graph convnets. ar Xiv preprint ar Xiv:1711.07553, 2017.

Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, 2017. ISSN 1053-5888, 1558-0792. doi: 10.1109/MSP.2017.2693418. URL http://arxiv. org/abs/1611.08097.

Chavel, I. Eigenvalues in Riemannian geometry. Academic press, 1984.

Chung, F. and Yau, S. T. Discrete green s functions. Journal of Combinatorial Theory, Series A, 91(1):191

Directional Graph Networks

214, 2000. ISSN 0097-3165. doi: 10.1006/jcta.2000. 3094. URL http://www.sciencedirect.com/ science/article/pii/S0097316500930942.

Chung, F., Graham, F., on Recent Advances in Spectral Graph Theory, C. C., (U.S.), N. S. F., Society, A. M., and of the Mathematical Sciences, C. B. Spectral Graph Theory. CBMS Regional Conference Series. Conference Board of the mathematical sciences, 1997. ISBN 9780821803158. URL https://books.google. ca/books?id=4IK8Dg AAQBAJ.

Coifman, R. R. and Lafon, S. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5 30, 2006. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2006.04.006. URL https://www.sciencedirect.com/ science/article/pii/S1063520306000546. Special Issue: Diffusion Maps and Wavelets.

Corso, G., Cavalleri, L., Beaini, D., Li o, P., and Veliˇckovi c, P. Principal neighbourhood aggregation for graph nets. ar Xiv preprint ar Xiv:2004.05718, 2020.

Doshi, V. and Eun, D. Y. Fiedler vector approximation via interacting random walks. ar Xiv:2002.00283 [math], 2000. doi: 10.1145/3379487. URL http://arxiv. org/abs/2002.00283.

Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. ar Xiv preprint ar Xiv:2003.00982, 2020.

Fiedler, M. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23:298 305, 01 1973. doi: 10. 21136/CMJ.1973.101168.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263 1272. JMLR. org, 2017.

Grady, L. J. and Polimeni, J. Discrete calculus : applied analysis on graphs for computational science. Springer, 2010.

Grebenkov, D. S. and Nguyen, B.-T. Geometrical structure of laplacian eigenfunctions. SIAM Review, 55(4): 601 667, Jan 2013. ISSN 1095-7200. doi: 10.1137/ 120880173. URL http://dx.doi.org/10.1137/ 120880173.

Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024 1034, 2017.

Hamilton, W. L. Graph Representation Learning. Morgan and Claypool, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687, 2020.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015.

Islam, M. A., Jia, S., and Bruce, N. D. B. How much position information do convolutional neural networks encode? ar Xiv:2001.08248 [cs], 2020. URL http: //arxiv.org/abs/2001.08248.

Jin, W., Barzilay, R., and Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. ar Xiv:1802.04364 [cs, stat], 2018. URL http:// arxiv.org/abs/1802.04364.

Kang, E., Min, J., and Ye, J. C. A deep convolutional neural network using directional wavelets for low-dose x-ray CT reconstruction. Medical Physics, 44(10):e360 e375, 2017. ISSN 2473-4209. doi: 10.1002/mp.12344.

Kipf, T. N. and Welling, M. Semi-supervised classiﬁcation with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

Klicpera, J., Groß, J., and G unnemann, S. Directional message passing for molecular graphs. ICLR2020, 2019. URL https://openreview.net/forum? id=B1e Wbx St PH.

Knyazev, B., Taylor, G. W., and Amer, M. Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems, pp. 4204 4214, 2019.

Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. Covariant compositional networks for learning graphs. ar Xiv preprint ar Xiv:1801.02144, 2018.

Krizhevsky, A., 2009.

Lanczos, C. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Ofﬁce Los Angeles, CA, 1950.

Directional Graph Networks

Levie, R., Monti, F., Bresson, X., and Bronstein, M. M. Cayley Nets: Graph convolutional neural networks with complex rational spectral ﬁlters. ar Xiv:1705.07664 [cs], 2018. URL http://arxiv.org/abs/1705.07664.

Levy, B. Laplace-beltrami eigenfunctions towards an algorithm that understands geometry. In IEEE International Conference on Shape Modeling and Applications 2006 (SMI 06), pp. 13 13, 2006. doi: 10.1109/SMI.2006.21.

Luan, S., Zhao, M., Chang, X.-W., and Precup, D. Break the ceiling: Stronger multi-scale deep graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 10943 10953, 2019.

Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. ar Xiv preprint ar Xiv:1812.09902, 2018.

Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., and Bronstein, M. M. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115 5124, 2017.

O Gara, S. and Mc Guinness, K. Comparing data augmentation strategies for deep image classiﬁcation. Session 2: Deep Learning for Computer Vision, 2019. doi: http://doi.org10.21427/148b-ar75. URL https: //arrow.tudublin.ie/impstwo/7.

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. An overview of early vision in Inception V1. Distill, 5(4):e00024.002, 2020. ISSN 2476-0757. doi: 10.23915/distill.00024.002. URL https://distill. pub/2020/circuits/early-vision.

Peng, H., Li, J., Gong, Q., Wang, S., Ning, Y., and Yu, P. S. Graph convolutional neural networks via motifbased attention. ar Xiv:1811.08270 [cs], 2019. URL http://arxiv.org/abs/1811.08270.

Rong, Y., Huang, W., Xu, T., and Huang, J. Drop Edge: Towards deep graph convolutional networks on node classiﬁcation. ICLR2020, pp. 17, 2020.

Saerens, M., Fouss, F., Yen, L., and Dupont, P. The principal components analysis of a graph, and its relationships to spectral clustering. In European conference on machine learning, pp. 371 383. Springer, 2004.

Sato, R., Yamada, M., and Kashima, H. Approximation ratios of graph neural networks for combinatorial problems. ar Xiv preprint ar Xiv:1905.10261, 2019.

Shorten, C. and Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019. ISSN 2196-1115. doi: 10.1186/

s40537-019-0197-0. URL https://doi.org/10. 1186/s40537-019-0197-0.

Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017.

Xu, B., Shen, H., Cao, Q., Qiu, Y., and Cheng, X. Graph wavelet neural network. ar Xiv:1904.07785 [cs, stat], 2019. URL http://arxiv.org/abs/1904. 07785.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018a.

Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5453 5462, 2018b.

Yue Lu and Minh N. Do. The ﬁner directional wavelet transform. In Proceedings. (ICASSP 05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., volume 4, pp. 573 576. IEEE, 2005. ISBN 978-0-7803-8874-1. doi: 10.1109/ICASSP.2005. 1416073. URL http://ieeexplore.ieee.org/ document/1416073/.