# comparing_graph_transformers_via_positional_encodings__347c5e95.pdf

Comparing Graph Transformers via Positional Encodings

Mitchell Black * 1 Zhengchao Wan * 2 Gal Mishne 2 Amir Nayyeri 1 Yusu Wang 2

The distinguishing power of graph transformers is tied to the choice of positional encoding: features used to augment the base transformer with information about the graph. There are two primary types of positional encoding: absolute positional encodings (APEs) and relative positional encodings (RPEs). APEs assign features to each node and are given as input to the transformer. RPEs instead assign a feature to each pair of nodes, e.g., shortest-path distance, and are used to augment the attention block. A priori, it is unclear which method is better for maximizing the power of the resulting graph transformer. In this paper, we aim to understand the relationship between these different types of positional encodings. Interestingly, we show that graph transformers using APEs and RPEs are equivalent in their ability to distinguish non-isomorphic graphs. In particular, we demonstrate how to interchange APEs and RPEs while maintaining their distinguishing power in terms of graph transformers. However, in the case of graphs with node features, we show that RPEs may have an advantage over APEs. Based on our theoretical results, we provide a study of different APEs and RPEs including the shortest-path and resistance distance and the recently introduced stable and expressive positional encoding (SPE) and compare their distinguishing power in terms of transformers. We believe our work will help navigate the vast number of positional encoding choices and provide guidance on the future design of positional encodings for graph transformers.

*Equal contribution 1School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA 2Halıcıoˇglu Data Science Institute, University of California San Diego, San Diego, California, USA. Correspondence to: Mitchell Black <blackmit@oregostate.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

1. Introduction

Graph transformers (GTs) (Dwivedi & Bresson, 2021; Ying et al., 2021) have recently emerged as a competitor to message passing neural networks (MPNNs) (Gilmer et al., 2017) for graph representation learning. While MPNNs only consider the immediate neighbors of a node when updating that node s feature, GTs consider all nodes in the graph. Accordingly, graph transformers can capture global information of graphs that can potentially be used to address the issue of oversquashing faced by MPNNs. Likewise, the attention mechanism of transformers can address the issue of oversmoothing faced by MPNNs. See (M uller et al., 2024).

However, one caveat in applying transformers to graph data is that transformers are equivariant, meaning they treat nodes symmetrically and cannot distinguish two nodes at different positions in the graph. Accordingly, much of the research on graph transformers has been focused on the design of positional encodings that incorporate information about the graph structure into the input or architecture of transformers. There are currently two major types of positional encodings:

Absolute positional encodings (APEs) encode the graph structure as an embedding of the vertices ϕ : V Rd. APEs are either added to or concatenated with the initial vertex features that are the input to the transformer. Typical examples are the Laplacian eigenvectors (Dwivedi & Bresson, 2021) or information about random walks (Ramp aˇsek et al., 2022). Recent works have also proposed learnable APEs (Lim et al., 2022; Huang et al., 2024).

Relative positional encodings (RPEs) encode the graph structure as an embedding of pairs of vertices ψ : V V Rd. RPEs are incorporated into a GT via a modified attention mechanism. Examples of RPEs include the shortestpath distance (Ying et al., 2021), resistance distance (Zhang et al., 2023), and heat kernels (Choromanski et al., 2022). The adjacency matrix can also serve as an RPE, which will result in an RPE GT with the same distinguishing power as MPNNs (Corollary 4.17).

Although different kinds of APEs and RPEs have been proposed to encode graph structure and enhance the performance of GTs, there is a lack of understanding of how different positional encodings compare e.g., shortest-path

Comparing Graph Transformers via Positional Encodings

vs. resistance distance for distinguishing non-isomorphic graphs. Only a few works (Zhang et al., 2023) have attempted to compare different RPEs, let alone conduct a systematic study comparing APEs and RPEs. This results in a lack of guidance for the design of positional encodings for GTs: which type of positional encoding absolute or relative should be used in practice? Furthermore, this could lead to inefficiency in constructing positional encodings. In some recent works on designing APEs like Basis Net (Lim et al., 2022) and the stable and expressive positional encoding (SPE) (Huang et al., 2024), the authors first construct well-defined RPEs and then use some extra machinery to artificially transform RPEs into APEs. This raises the natural question of whether the transition is worthwhile given the expensive extra computation.

Our contributions. We establish a theoretical framework for comparing positional encodings, not only for those in the same category, i.e., absolute or relative, but also across APEs and RPEs. Our main contributions are as follows:

(1) In Section 3, we establish that specific types of APE and RPE GTs (cf. Definition 2.2) are equivalent in their ability to distinguish non-isomorphic graphs. In particular, we show how to map APEs to RPEs and vice versa while maintaining their distinguishing power. We empirically validate our theoretical results on several graph classification datasets.

(2) We show that RPEs may have an advantage over APEs for distinguishing graphs with node features. In particular, we show that transforming an RPE to an APE results in a GT that is able to distinguish strictly fewer graphs with node features than a GT that uses the RPE directly.

(3) The techniques for proving the above results enable us to compare the distinguishing power of different positional encodings. In Section 4, we provide a case study on several popular APEs and RPEs by comparing their distinguishing power. For example, we prove that GTs using SPE (Huang et al., 2024) as an APE are stronger than the GTs using resistance distance as an RPE (Zhang et al., 2023).

(4) In the process of establishing our main results, we identify a new variant of the Weisfeiler-Lehman (WL) test (Weisfeiler & Leman, 1968) that we call the RPE-aug WL test that allows us to connect the distinguishing power of RPEs and corresponding GTs. This RPE-aug WL test is interesting in itself as a way to further the theoretical formulation of the representation power of graph learning frameworks. We generalize the equivalence between the WL and 2-WL tests to our RPE-aug WL test and establish the equivalence between the resistance distance and the pseudoinverse of the Laplacian in terms of distinguishing power via the RPE-aug WL test. Furthermore, we identify a class of RPEs resulting in a family of RPE GTs at least as strong as MPNNs.

2. Positional Encodings and Graph Transformers

In this section, we introduce some preliminaries on positional encodings and graph transformers, as well as some auxiliary results that will be used to prove our main results.

Let G = (V, E) denote an unweighted graph, which can be either directed or undirected. A featured graph (G, X) includes node features X R|V | d, indicating a d-dimensional feature X(v) for each node in v V . Let A denote the adjacency matrix, D the diagonal matrix of the degrees, and L = D A the graph Laplacian matrix. To prevent ambiguity when working with multiple graphs, we use G as a subscript in our notations, such as VG for the vertex set or XG for the node features of graph G.

For two graphs G and H, a graph isomorphism is a bijection σ : VG VH such that {u, v} EG if and only if {σ(u), σ(v)} EH. Two graphs G and H are isomorphic if there is a graph isomorphism σ : VG VH.

2.1. Positional Encodings

A positional encoding (PE) of a graph is a way of summarizing structural or positional information of the graph. In their most general form, there are two types of positional encodings we consider in this paper:

Definition 2.1. (Positional Encodings) An absolute positional encoding (APE) ϕ assigns each graph G a map ϕG : VG Rl such that for any two isomorphic graphs G and H and graph isomorphism σ : VG VH, one has that ϕG = ϕH σ. A relative positional encoding (RPE) ψ assigns each graph G a map ψG : VG VG Rk such that for any two isomorphic graphs G and H and graph isomorphism σ : VG VH, one has that ψG = ψH (σ σ).

An APE is also naturally expressed as a matrix in R|VG| l

and an RPE as a tensor R|VG| |VG| k.

Note: While node features and APEs both assign a vector to each node, we emphasize that the difference between node features and APEs is that APEs are dependent on the topology of a graph, while two isomorphic graphs can have different node features.

One of the simplest APEs is the degree map deg: for any graph G, deg G(u) is the degree of the vertex u. Any graph distance is an RPE, e.g., shortest-path distance (SPD) or resistance distance (RD).

Laplacian eigenvectors are commonly used as APEs (Dwivedi & Bresson, 2021; Kreuzer et al., 2021). However, there is no unique choice of eigenvector for a given eigenspace. Therefore, Laplacian eigenvectors are not a well-defined APE by our definition, as isomorphic graphs

Comparing Graph Transformers via Positional Encodings

may have different eigenvector encodings due to the choice of basis. However, the projection matrices onto the eigenspaces are a well-defined RPE (Lim et al., 2022).

2.2. Graph Transformers

A transformer is an architecture T composed of multiheaded attention layers T = MHA(L) MHA(1), where a multiheaded attention layer with heads 1 h H is a map MHA(l) that sends X(l) Rn d to MHA(l)(X(l)) = X(l+1) Rn d as defined below:

A(l,h)(X(l)) = softmax

X(l)W (l,h) Q (X(l)W (l,h) K )T dh

Y (l) = X(l) +

h=1 A(l,h)(X(l))X(l)W (l,h) V

X(l+1) = Y (l) + σ Y (l)W (l) 1 W (l) 2

where W (l,h) Q , W (l,h) K , W (l,h) V , W (l) O Rd dh, W1

Rdh dr, W (l) 2 Rdr d and σ is an activation function. Here Q, K, V refer to query , key and value , respectively, in the terminology of the attention literature (Vaswani et al., 2017), and dh and dr are the dimension of the hidden layer and the dimension of the residual layer, respectively. The activation function σ is usually the Gaussian error linear units (GELU) (Hendrycks & Gimpel, 2016). Fact 1 (Transformers are Permutation Equivariant). Let T be a transformer. Let X Rn d. Let P Rn n be a permutation matrix. Then PT(X) = T(PX).

When applying a transformer to a graph G, the input is a set of node features XG R|VG| d; however, initial node features often do not contain any information about the topology of a graph, so two graphs with the same multiset of features but different topologies will have the same output from a transformer. Graph transformers augment the transformer by incorporating information about the input graph in the form of absolute or relative positional encodings. Definition 2.2 (Graph Transformers). An APE graph transformer with APE ϕ (ϕ-APE-GT) is a transformer that either: concatenates the node features XG R|VG| d with the APE ϕG R|VG| l, i.e. d XG = [XG|ϕG] R|VG| (d+l), or adds the node features and APE, i.e. d XG = XG + ϕG, before passing them through a transformer.

An RPE graph transformer with RPE ψ (ψ-RPE-GT) is a transformer whose attention layer is modified as follows for a graph G with node features X R|VG| d:

A(X) = f1(ψG) softmax XWQ(XWK)T

dh + f2(ψG)

where f1, f2 : Rk R are functions applied entrywise to the tensor ψG R|VG| |VG| k.

The original transformer (Vaswani et al., 2017) used sines and cosines as APEs for sequences. Transformers for sequences with RPEs were subsequently proposed by Shaw et al. (2018). The first instance of an APE-GT for graphs was proposed by Dwivedi & Bresson (2021) using the Laplacian eigenvectors. The first instance of an RPE-GT was proposed by Ying et al. (2021) using the shortest-path distance, while the form of RPE-GT we consider was proposed by Zhang et al. (2023). Although other ways of incorporating an RPE into a transformer have been proposed (Mialon et al., 2021; Ma et al., 2023), in this paper, APE-GT and RPE-GT only refer to transformers of the forms in Definition 2.2.

2.3. Properties of RPEs

In this subsection, we describe two special types of RPEs that will be useful later.

Diagonally-Aware RPEs. One benefit of using graph distances as RPEs compared to e.g., the adjacency matrix is their distinct diagonal entries are all zeros, in contrast to the positive off-diagonal terms. We characterize this property as diagonal-awareness . As we will see throughout the paper, this property is important as it allows various algorithms to distinguish the feature at a node from all other features.

Definition 2.3. An RPE ψ is diagonally-aware if for any two graphs G and H and vertices v VG and x, y VH, then ψG(v, v) = ψH(x, y) only if x = y.

Diagonal-awareness is a very mild condition as one can always augment an RPE so that it becomes diagonally-aware:

Definition 2.4. The diagonal augmentation Dψ of an RPE ψ is defined, for a graph G, as the stacking Dψ G := (IG, ψG) where IG is the identity matrix.

For any graph G, the diagonal augmentation Dψ G(u, v) has a 1 in the first coordinate iff u = v and 0 otherwise. Therefore, Dψ is diagonally-aware:

Proposition 2.5. Let ψ be an RPE. The diagonal augmentation Dψ is diagonally-aware.

Asymmetric RPEs. Most popular RPEs for undirected graphs are symmetric. However, for directed graphs, it is natural to consider asymmetric RPEs such as the adjacency matrix, the directed distance matrix, or the Laplacian matrix.

We identify the following special type of (possibly asymmetric) RPEs, which will be useful later:

Definition 2.6 (Pseudo-symmetric RPEs). Let ψ be an RPE valued in Rk for either directed or undirected graphs. Let f : Rk Rk be any injective function such that f f = Id. Then, ψ is (f-)pseudo-symmetric if ψG(u, v) = f(ψG(v, u)) for any graph G and any vertices u, v VG.

Comparing Graph Transformers via Positional Encodings

Examples of f satisfying f f = Id include the identify map, the reflection map, and any coordinate switching map. Obviously, any symmetric RPE ψ is Idpseudo-symmetric. Any skew-symmetric matrix is pseudosymmetric for f(x) = x. If we consider complex-valued RPEs, then any Hermitian matrix is pseudo-symmetric for f(x) = x, the complex conjugate map.

Although the adjacency or the Laplacian matrix of a directed graph can also serve as RPEs, they are in general not pseudo-symmetric. In order to guarantee pseudo-symmetry, given any RPE ψ, we can define an augmented RPE Sψ := (ψ, ψT ), i.e., for any graph G, Sψ G := (ψG, ψT G). Lemma 2.7 (Pseudo-symmetric augmentation). Let ψ be any RPE. Define f : R2k R2k by letting (x, y) 7 (y, x) for any x, y Rk. Then, Sψ is f-pseudo-symmetric.

3. Comparison of APE-GT and RPE-GT

In this section, we establish the equivalence of the distinguishing power of APE and RPE GTs for distinguishing graphs without node features. This is done by showing that any RPE can be mapped to an APE without changing its distinguishing power and vice versa. Our strategy is to relate the distinguishing power of APE and RPEs, as well as the maps between the two, to variants of the Weisfeiler-Lehman tests. This approach generalizes the method of Zhang et al. (2023) for studying the expressive power of the Graphormer GD. Our main results are summarized in Figure 1.

However, for the case of graphs with node features, we prove that our map from RPEs to APEs can result in a strict decrease in the distinguishing power of the corresponding GTs. In this way, our results suggest that (1) the two types of positional encodings are theoretically equivalent for solving graph isomorphism problems but (2) RPE GTs may be better suited for tasks where graph node features are involved.

Our theoretical results are tested empirically in Appendix C.

3.1. Distinguishing Power: PEs vs. Graph Transformers

Positional encodings provide a way of distinguishing different, non-isomorphic graphs. In their most basic form, two graphs can be distinguished by a positional encoding if it takes different values on the two graphs. For graphs with node features, a positional encoding can distinguish two graphs if it takes different values on nodes with different features. Specifically, two featured graphs (G, XG) and (H, XH) are indistinguishable by an APE ϕ if the concatenation of their features with their images by the APE ϕ are the same multiset, i.e., {{(XG(v), ϕG(v)) : v VG}} = {{(XH(v), ϕH(x)) : x VH}}, and G and H are indistinguishable by an RPE ψ if {{(XG(u), XG(v), ψG(u, v)) : (u, v) VG VG}} = {{(XH(x), XH(y), ψH(x, y)) : (x, y) VH VH}}. Likewise, two graphs are indistin-

guishable by a graph transformer if their outputs have the same multiset of node features.

Next, we compare the distinguishing power of positional encodings and GTs using these positional encodings.

APE vs. APE-GT. Interestingly, APE-GTs are only as strong as their initial APEs in distinguishing power. This was previously noted by M uller et al. (2024, Section 2.1), although we include a short proof (which follows from the permutation-equivariance of transformers) in Appendix A.4 for completeness.

Lemma 3.1 (Equivalence of APEs and APE-GT). Any two graphs (G, XG) and (H, XH) are indistinguishable by an APE ϕ iff (G, XG) and (H, XH) are indistinguishable by all ϕ-APE-GTs.

RPE vs. RPE-GT. An analogous result does not hold for RPEs and RPE-GTs, as using an RPE in an RPE-GT can increase its distinguishing power; see Appendix A.1 for an example. In fact, one can obtain a stronger notion of distinguishing power of RPEs via a variation of the Weisfeiler Lehman (WL) graph isomorphism test (Weisfeiler & Leman, 1968). This new notion of indistinguishability for RPEs will be equivalent to the indistinguishability of RPE-GTs.

We first recall the standard WL algorithm, which iteratively assigns colors to the vertices of a graph and utilizes the multisets of node colors to distinguish non-isomorphic graphs. The color updating process is defined explicitly as follows:

Definition 3.2. Let G be a graph. The Weisfeiler-Lehman (WL) algorithm assigns a coloring to the vertices χ(t) G : VG Ct for each t 0 defined by

χ(0) G (v) = 1, (1 denotes an arbitrary constant)

χ(t+1) G (v) = (χ(t) G (v), {{χ(t) G (u) : {v, u} EG}}).

However, to understand the distinguishing power of RPE transformers, we need to consider the following augmented WL test for graphs with node features and RPEs.

Definition 3.3. Let ψ be an RPE. Let (G, XG) be a featured graph. The Weisfeiler-Lehman algorithm augmented with RPE ψ (RPE-aug WL or ψ-WL) assigns a coloring to the vertices χ(t) ψG : VG Ct for each t 0 defined by

χ(0) ψG(v) = XG(v),

χ(t+1) ψG (v) = (χ(t) ψG(v), {{(χ(t) ψG(u), ψG(v, u)) : u VG}}).

Note: For unfeatured graphs, ψ-WL is defined by setting χ(0) ψG(v) = 1 (where 1 is an arbitrary constant.) We adopt this convention for all later constructions.

Note that the RPE-aug WL algorithm has a different color updating mechanism from the WL algorithm: RPE-aug WL

Comparing Graph Transformers via Positional Encodings

APE-GT APEs

RPE-GT RPE-aug WL RPE-2-WL

Deep Sets Lemma 3.7

Theorem 3.5 Lemma 3.6

EGNs Lemma 3.9 Theorem 3.8 Theorem 3.10

Figure 1. Illustration of main results. Arrows denote non-decreasing in distinguishing power. Our main results are the two red arrows on the left. The proofs of the two theorems are illustrated by other parts of the diagram. Our contributions are in bold. The two-head arrow from RPE-2-WL to APEs indicates that the non-decreasing property only holds for unfeatured graphs.

considers all nodes in the update stage, whereas WL considers only neighbors. However, when the RPE is the adjacency matrix A, A-WL is equivalent to WL (cf. Proposition B.16). Hence, we consider RPE-aug WL to be a generalization of the WL algorithm.

The definition of RPE-aug WL is inspired by the GD-WL test of Zhang et al. (2023) where the new feature is updated as bχ(t+1) ψG (v) = {{(bχ(t) ψG(u), ψG(v, u)) : u V }}, where ψG is a graph distance. We note that GD-WL is equivalent to our ψ-WL; in GD-WL, the previous color of the node can be deduced from the fact that ψG(v, v) = 0 since ψG is a graph distance.

The WL algorithm has a family of higher-order generalizations called the k-WL algorithms that assign colors to k-tuples of nodes. While Definition 3.3 augments the WL test by using RPEs within the feature updating step, we can also directly use RPEs as the initial colors for 2-WL.

Definition 3.4. Let ψ be an RPE. Let (G, X) be a featured graph. The 2-Weisfeiler-Lehman algorithm with RPE ψ (RPE-2-WL or ψ-2-WL) assigns a color to each pair of vertices χ(t) 2 : V V Ct for each t 0 defined by1

χ(0) 2,ψG(u, v) = (XG(u), XG(v), ψG(u, v)),

χ(t+1) 2,ψG (u, v) = χ(t) 2,ψG(u, v), {{χ(t) 2,ψG(u, w) : w V }},

{{χ(t) 2,ψG(w, v) : w V }} .

Because RPEs are dependent on the graph structure and will be the same (up to permutation) for two isomorphic graphs, we can use the RPE-aug WL or RPE-2-WL as heuristics to distinguish non-isomorphic graphs. We say two featured graphs (G, XG) and (H, XH) are ψ-WL indistinguishable if {{χ(t) ψG(v) : v VG}} = {{χ(t) ψH(v) : v VH}} for all t 0. We denote this as χψ(G) = χψ(H). Likewise, they

1In the literature (Chen et al., 2020), a pair (u, v) is colored differently from ours at the 0-th step by considering both its isomorphism type and its RPE: χ(0) 2,ψ(u, v) = (1u=v, Auv, ψ(u, v)). We ignore the first two terms as one can easily augment ψ with these terms to obtain a new RPE that reflects the isomorphism type if needed; see Section 4.5 for a further discussion.

are ψ-2-WL indistinguishable if {{χ(t) 2,ψG(u, v) : u, v

VG}} = {{χ(t) 2,ψH(u, v) : u, v VH}} for all t 0. We denote this as χ2,ψ(G) = χ2,ψ(H).

ψ-WL and ψ-2-WL indistinguishability can be used as a heuristic test for a special type of graph isomorphism. Two featured graphs (G, XG) and (H, XH) are feature isomorphic if there is a graph isomorphism σ : VG VH such that XG(v) = XH(σ(v)) for all v VG. Two feature isomorphic graphs will be ψ-WL and ψ-2-WL indistinguishable. Fact 2. If (G, XG) and (H, XH) are feature isomorphic, then (G, XG) and (H, XH) are ψ-WL indistinguishable and ψ-2-WL indistinguishable for any RPE ψ.

Interestingly, these two WL tests RPE-aug WL and RPE2-WL are equivalent for pseudo-symmetric RPEs. See Appendix A.2 for a proof. Theorem 3.5 (Equivalence of RPE-aug WL and RPE-2-WL.). Let ψ be a pseudo-symmetric RPE. Then two featured graphs (G, XG) and (H, XH) are ψ-WL indistinguishable iff they are ψ-2-WL indistinguishable.

This result generalizes the well-known fact that the WL and 2-WL tests are equivalent; see (Huang & Villar, 2021).

Finally, in contrast to Lemma 3.1, we have the following slight generalization of (Zhang et al., 2023, Theorem 4) who only consider the case where the RPE is a distance function. See Appendix A.6 for a sketch of the proof. Lemma 3.6 (Equivalence of RPE-aug WL and RPE-GT). Let ψ be a diagonally-aware RPE. Let (G, XG) and (H, XH) be featured graphs. Then (G, XG) and (H, XH) are indistinguisable by the ψ-WL test iff (G, XG) and (H, XH) are indistinguisable by all ψ-RPE-GTs.

3.2. Main Results: APE vs RPE Transformers

In this section, we show that any APE can be turned into an RPE without changing its distinguishing power in terms of transformers and vice versa.

Mapping APEs to RPEs. Let S be any set, and let Mul2(S) denote the collection of all 2-element multi-

Comparing Graph Transformers via Positional Encodings

subsets of S. For a map f : S Rd, define the map hf : Mul2(S) Rd as hf({{x, y}}) = f(x) + f(y). The map hf is a special case of the Deep Sets network (Zaheer et al., 2017) as it operates on multisets. Accordingly, we refer to hf as a Deep Set network in this paper.

We propose the following way of mapping APEs to RPEs: given a function f and an APE ϕ, we define the RPE ψf for a graph G as ψf G(u, v) := hf({{ϕG(u), ϕG(v)}}}). Our key observation is that given an APE ϕ, there exists a universal function f whose induced RPE ψf distinguishes the same pairs of graphs as ϕ (Lemma 3.7). The main Theorem 3.8 will then follow. Proofs of these results can be found in Appendix A.7.

Lemma 3.7. Let ϕ be an APE. Then there is a function f such that any two featured graphs (G, XG) and (H, XH) are indistinguishable by ϕ iff (G, XG) and (H, XH) cannot be distinguished by the ψf-2-WL test.

Theorem 3.8. For any APE ϕ, there exists a function f such that any two graphs (G, XG) and (H, XH) are indistinguishable by all ϕ-APE-GTs iff they are indistinguishable by all ψf-RPE-GTs.

Mapping RPEs to APEs. In the other direction, we can transform an RPE ψ to an APE ϕg by passing ψ through a 2-equivariant graph network (2-EGN) g. (See (Maron et al., 2019b) or Appendix A.3 for a definition of 2-EGN.) As 2-EGNs are permutation equivariant, the output ϕg will be a well-defined APE in the sense of Definition 2.1. This is the approach taken by Lim et al. (2022) to compute an APE from the eigenspaces of the graph Laplacian. It turns out that mapping RPEs to APEs using 2-EGNs can preserve their ability to distinguish non-featured graphs; however, mapping an RPE to an APE may decrease its ability to distinguish featured graphs. Proofs of the following theorems and example can be found in Appendix A.8.

Lemma 3.9. Let ψ be a diagonally-aware RPE.

For any 2-EGN g, if (G, XG) and (H, XH) are indistinguishable by the ψ-2-WL test then (G, XG) and (H, XH) are indistinguishable by ϕg.

Moreover, for any finite set of unfeatured graphs G, there is a 2-EGN g such that if G, H G are indistinguishable by ϕg, then G and H are indistinguishable by the ψ-2-WL test.

Theorem 3.10. Let ψ be a diagonally-aware RPE.

For any 2-EGN g, if (G, XG) and (H, XH) are indistinguishable by all ψ-RPE-GTs, then (G, XG) and (H, XH) are indistinguishable by ϕg.

Moreover, for any finite set of unfeatured graphs G, there is a 2-EGN g such that if G, H G are indistinguishable by all ϕg-APE-GTs, then G and H are indistinguishable by all ψ-RPE-GTs.

Example 1. There exists an RPE ψ and featured graphs (G, XG) and (H, XH) that are distinguishable by ψ-2-WL but are indistinguishable by ϕg for any 2-EGN g. Remark 3.11. Even in the case of unfeatured graphs, there is still a slight asymmetry in the statements of the results for APEs (Theorem 3.8) and RPEs (Theorem 3.10), as the result for APEs does not have the restriction on a finite set of graphs. This restriction arises because there is no known universal 2-EGN equal in power to the RPE-2-WL test. This is in part due to the fact that known 2-EGNs would need an unbounded hidden dimension and number of layers to distinguish all pairs of graphs (Maron et al., 2019a).

More expressive RPE to APE maps. Theorem 3.10 shows that RPEs can be transformed into equally-powerful (for unfeatured graphs) APEs using 2-EGNs. However, it is worth noting that RPEs can be turned into APEs that are even more powerful than the original RPE using k EGNs for k > 2 (Maron et al., 2019a). In the extreme case of k Ω(n), k-EGNs can distinguish all pairs of nonisomorphic graphs with n vertices. Such techniques for constructing APEs from RPEs have been previously explored in the literature (Huang et al., 2024); for example, Expressive Basis Net (Lim et al., 2022). However, the downside to these techniques is that their computational cost increases with their expressive power; k-EGNs take Θ(nk) time as they operate on tensors of size Θ(nk). In practice, techniques that map RPEs to APEs typically only use 2-EGNs because of the computational cost of higher k-EGNs.

Restrictions and Implications. Our main findings (Theorem 3.8 and Theorem 3.10) show that APE-GTs and RPEGTs have comparable distinguishing capabilities for unfeatured graphs. However, in practice, learning the Deep Set (to serve as f) in Theorem 3.8 or the 2-EGN (to serve as g) in Theorem 3.10 may not always yield the desired maps. Hence, it remains an interesting open question to figure out whether it is theoretically easier to learn the Deep Set or the 2-EGN. For example, one would benefit in designing architectures from understanding which method requires a lower dimension for hidden layers.

Finally, we would like to point out that converting RPEs to APEs to fit a specific graph transformer architecture is not recommended, as shown by Example 1. See also discussion in Remark 4.5 and Appendix C.3 for some empirical validation of this point.

4. Comparing Graph Transformers with Different Positional Encodings

In this section, we compare the distinguishing power of graph transformers with different (absolute or relative) positional encodings. The main results are summarized in

Comparing Graph Transformers via Positional Encodings

Adjacency & Laplacian WL & MPNN Combinatorially-Aware RPEs SPD

SPE Spectral Kernels Powers of L L RD

Proposition 4.10 Theorem 4.16 SPD is combinatorially-aware

Corollary 4.3 Lemma B.13

Theorem 4.6

L is a spectral kernel

Theorem 4.4

Figure 2. Hierarchy of PEs. The arrows indicate that the corresponding positional encoding is less strong than the one it points to in terms of distinguishing power. The two-head arrow indicates that the non-decreasing property only holds for unfeatured graphs. The dotted arrow between SPD and RD refers to some partial evidence (cf. Theorem B.18) that RD is stronger than SPD in some respects; however, it is an open question how the two compare as RPEs.

Figure 2. Intuitively, by Theorem 3.8, APE-GTs can be turned into RPE-GTs preserving the distinguishing power. This enables us to compare APE-GTs with RPE-GTs. When comparing two RPE-GTs, by Lemma 3.6 we simply need to compare the corresponding RPEs via the RPE-aug WL test. For two RPEs ψ1 and ψ2, we say that ψ1-WL is at least as strong as ψ2-WL if for any two featured graphs (G, XG) and (H, XH) such that χψ1(G) = χψ1(H), then we have χψ2(G) = χψ2(H). Moreover, we say that ψ1-WL and ψ2-WL are equally strong if for any two featured graphs (G, XG) and (H, XH), we have that χψ1(G) = χψ1(H) iff χψ2(G) = χψ2(H).

Some highlights of this section include a comparison of SPE-APE-GTs with RD-RPE-GTs (Theorem 4.4), identification of a class of RPEs whose corresponding RPE-GTs are at least as strong as MPNNs (Proposition 4.10 and Corollary 4.17), and suggesting new RPEs for directed graphs (Proposition 4.12).

4.1. Resistance Distance, Spectral Kernels, and SPE

Many popular APEs (such as Basis Net (Lim et al., 2022) and SPE (Huang et al., 2024)) and RPEs (such as RD (Zhang et al., 2023)) are defined using the spectrum of the Laplacian L. While Basis Net and SPE have been shown to be equivalent (Huang et al., 2024, Proposition 3.2), it is unclear how either of these compares to RD in terms of transformers. More generally, it is unknown how much information RD-WL carries about the spectrum of L.

In this section, we establish that SPE-APE-GTs are stronger than resistance distance RPE-GTs (Theorem 4.4). As one step to prove this, we show that two graphs are indistinguishable by RD-WL iff they are indistinguishable by L -WL where L is the pseudoinverse of L (Corollary 4.3). This result gives a partial answer to the question of how much information RD-WL carries about the Laplacian spectrum. This is weaker than saying that RD-WL indistinguishable

graphs have the same L up to permutation, but it does show that L of these graphs are indistinguishable when combined with RPE-aug WL (and thus RPE-GTs).

In fact, our result about RD-WL is a corollary to a stronger result about the equivalence of RPE-aug WL tests with spectral distance and spectral kernel as RPEs.

Definition 4.1. Let L = Pn i=2 λiziz T i be the spectral decomposition, where {λ2, . . . , λn} are the non-zero eigenvalues and {z2, . . . , zn} is an orthonormal basis of eigenvectors. Let f : R+ R+ be a function. The spectral kernel (Hammond et al., 2011) corresponding to f is the matrix Kf G = Pn i=2 f(λi)ziz T i . The spectral distance corresponding to f is defined

df G(u, v) := q

Kf G(u, u) + Kf G(v, v) 2Kf G(u, v).

Example 2. The diffusion distance (at time t) (Coifman & Lafon, 2006) is the spectral distance Df for the function f(x) = e tx. The heat kernel is the corresponding spectral kernel denoted by H(t) = Pn i=2 e λitziz T i , which has been proposed as an RPE by Choromanski et al. (2022).

Note: Although eigenvectors are not unique due to the choice of bases for eigenspaces, the spectral kernel and spectral distance are unique up to graph isomorphism and so are well-defined RPEs.

Let Df denote the RPE assigning the matrix of spectral distance df G to a graph G and let Kf denote the RPE assigning the matrix of spectral kernel Kf G.

Theorem 4.2. Let f : R+ R+. Df-WL is at least as strong as Kf-WL. Kf-WL with diagonal augmentation is at least as strong as Df-WL.

Theorem 4.2 is the result of the fact that the spectral distance and kernel are the distance and Gram matrix, respectively, of a point cloud in Rd. A variant of the WL-algorithm for general distance matrices of point clouds was studied

Comparing Graph Transformers via Positional Encodings

by delle Rose et al. (2023) as a heuristic for determining if two point clouds are isometric, so our proof may have implications for these algorithms as well. A proof of this theorem can be found in Appendix B.2.

The resistance distance (RD) is the squared spectral distance of the function f(x) = x 1, and the pseudoinverse of the Laplacian L is the corresponding spectral kernel, so Theorem 4.2 applies to RD and L . Interestingly, RD has additional properties that allow us to drop the assumption of diagonal augmentation. A proof of the following theorem can be found in Appendix B.2. Corollary 4.3. RD-WL and L -WL are equally strong.

Finally, as an immediate consequence of Corollary 4.3 (combined with Theorem 3.10), we compare RD-RPE-GT, with existing APE-GPTs using the so-called Stable and Expressive Positional Encodings (SPE) (Huang et al., 2024). A proof of this theorem can be found in Appendix B.3. Theorem 4.4. For unfeatured graphs, diagonally augmented SPE-APE-GT is at least as strong as RD-RPE-GT. Remark 4.5. In Appendices C.1 and C.2, we empirically validate this theorem by showing that SPE-APE-GTs perform similarly to RD-RPE-GTs on some unfeatured graph classification/isomorphism tasks. However, when node features are involved, the theorem may not hold since the 2-EGN involved in the construction of SPE-APE-GT may have weaker distinguishing power than the RD-RPE-GT (cf. Example 1). In Appendix C.3, we empirically observe that for graph regression tasks when node features are involved, using RD-RPE-GT can indeed give us an edge in performance over SPE-APE-GT.

4.2. Resistance Distance and Cut Edges.

The first proposed RPE for graph transformers was the shortest-path distance (SPD) (Ying et al., 2021). Later, Zhang et al. (2023) proposed to use the stack of RD and SPD as an RPE. It is natural to wonder how RD-WL compares with SPD-WL. While this remains an open question, we prove that RD-WL is at least as strong as SPD-WL in at least one regard.

Zhang et al. (2023) showed that SPD-WL could distinguish cut edges but not cut vertices in unfeatured graphs. Furthermore, the authors showed that RD-WL could distinguish cut vertices. However, it was not known whether or not RD-WL could distinguish cut edges in a graph. We answer this open question in the affirmative: RD-WL can distinguish cut edges in a graph. See Theorem B.18 for a precise theorem statement and proof.

However, our experiments on the BREC dataset (Wang & Zhang, 2024) suggest (but do not prove) that RD-WL may not be strictly stronger than SPD-WL, as there are pairs of graphs in this dataset that an SPD-RPE-GT was able to learn

to distinguish, while we were unsuccessful in training a RDRPE-GT to distinguish these graphs. See Appendix C.2 for details.

4.3. Powers of Matrices and Spectral Kernels

Another common type of RPE is the stack of powers of some matrix associated with the graph; for example, GRIT (Ma et al., 2023) uses a stack of powers of the random-walk adjacency matrix of a graph called the relative randomwalk positional encoding (RRWP). In this section, we show that using stacks of various matrices as an RPE is at least as strong as any spectral kernel RPEs for any f : R+ R+. Proofs for this section can be found in Appendix B.4.

We first prove that using sufficiently many powers of the Laplacian is at least as strong as any spectral kernel. Theorem 4.6. (I, L, . . . , L2n 1)-WL is at least as strong as Kf-WL on graphs with at most n nodes. Remark 4.7. While it may seem excessive to consider 2n powers of L to match the power of a single spectral kernel, this is not computationally more expensive than computing the spectral kernel in the first place, which requires adding n matrices f(λi)ziz T i of size n2, which takes O(n3) time.

A variant of RRWP that uses stacks powers of the symmetrically-normalized adjacency matrix b A = D 1/2AD 1/2 is at least as strong as spectral kernels using the symmetrically-normalized Laplacian b Kf (see Appendix B.4 for definition). Theorem 4.8. (I, b A, . . . , b A2n 1)-WL is at least as strong as b Kf-WL on graphs with at most n nodes.

Finally, we show that stacked heat kernels H(t) (cf. Example 2) are also at least as strong as any spectral kernel. Theorem 4.9. (I, H(1), . . . , H(2n 1))-WL is at least as strong as Kf-WL on graphs with at most n nodes.

4.4. RPEs: Common Matrices for Graphs

There are multiple common matrices used to characterize graphs, such as the adjacency matrix, Laplacian matrix and their normalized versions. We will examine their corresponding RPEs and compare their distinguishing power.

Undirected graphs. For undirected graphs, we show that various common matrices lead to equivalent RPEs.

Let e A = D 1A. Let b L = I b A and e L = I e A denote the symmetrically-normalized and the random-walk graph Laplacians, respectively. These, along with A, L and b A, give rise to six RPEs and hence five corresponding RPEaug WL tests. The following result suggests that in practice, perhaps there is no need to use more than one such type of information with RPE-GTs and one can interchangeably use any of the matrix as RPE.

Comparing Graph Transformers via Positional Encodings

Proposition 4.10. A, b A, e A, L, b L and e L induce equivalent RPE-aug WL tests. In particular, all of these RPE-aug WL tests are equally strong as the WL test.

Directed graphs. Directed graphs are sometimes turned into a magnetic graph to obtain a Hermitian matrix called the magnetic Laplacian (Fanuel et al., 2018). Let G be a directed graph with adjacency matrix A. Let A := (A + AT )/2 and let D denote the corresponding degree matrix. Then, given a parameter α, the magnetic Laplacian is defined as Lα := D T α A ,

where T α ij = exp(ι 2πα sgn(Aji Aij)) and ι is the imaginary unit. As mentioned in Section 2, Lα can serve as a pseudo-symmetric RPE. In fact, Lα can be obtained via the pseudo-symmetric augmentation (A, AT ) of A (cf. Lemma 2.7) as well as an extra augmentation via D : Lemma 4.11. Lα = f α((D , A, AT )) where f α : R3 R is applied to each channel n n such that

f α(x, y, z) := x exp(ι2πα sgn(z y)) y + z

In this way, we obtain an RPE (D , A, AT ) more general than the magnetic Laplacian matrix. Hence, instead of using magnetic Laplacian as positional encoding, one should consider using (D , A, AT )-RPE-GT. Proposition 4.12. The (D , A, AT )-WL test is at least as strong as the Lα-WL test for any α.

4.5. RPEs: Combinatorial-Awareness and WL

It is an interesting question what information about a graph a transformer can extract from a positional encoding. In Definition 2.2, the learnable functions in an RPE-GT fi are applied elementwise to the RPEs. Moreover, the functions are also applied elementwise to the attention scores. In other words, given an RPE ψ, even though certain graph structure can be obtained indirectly through more complicated operations on ψ, such hidden information may not be accessible to the transformer because of the limited set of operations it can perform on the RPE. For example, while a graph can be reconstructed from its resistance distance matrix, an RPEGT may not have access to the edges of a graph as it cannot perform the necessary operations to reconstruct the graph.

One basic piece of information a GT may want to access through an RPE is of course the adjacency matrix. We identify a desirable type of RPE where a GT will have this information. Definition 4.13. An RPE ψ is combinatorially-aware if for any two graphs G and H and vertices u, v VG and x, y VH, then ψG(u, v) = ψH(x, y) iff one of the following conditions hold: (i) {u, v} EG and {x, y} EH (ii) {u, v} / EG and {x, y} / EH.

As it turns out, elementwise operations are insufficient for reconstructing the graph structure from the resistance distance, as evidenced by the following example.

Example 3. The shortest-path distance (SPD) is combinatorially-aware as SPD(u, v) = 1 iff {u, v} E. The resistance distance is, however, not combinatoriallyaware. See Appendix B.7 for a counterexample.

As in the case of diagonal-awareness, given any RPE, one can always augment it to be combinatorially-aware.

Definition 4.14 (Combinatorial Augmentation). Let ψ be any RPE. We define its combinatorial augmentation Cψ as follows: for any graph G, Cψ G(u, v) := (AG(u, v), ψG(u, v)) for any u, v V , where AG denotes the adjacency matrix of G.

This in fact provides an equivalent characterization of combinatorially-aware RPEs.

Proposition 4.15. An RPE ψ is combinatorially-aware if for any G, the following condition holds for all u, v, x, y VG: ψG(u, v) = ψG(x, y) iff Cψ G(u, v) = Cψ G(x, y).

The reason combinatorially-aware RPEs are important is because RPE-aug WL with combinatorially-aware RPEs are at least as strong as the WL test (Weisfeiler & Leman, 1968). A proof of this theorem can be found in Appendix B.6.

Theorem 4.16. Let ψ be a combinatorially-aware RPE. Then, ψ-WL is at least as strong as WL.

Theorem 4.16 is of interest because the WL test is equally strong at distinguishing graphs as message passing graph neural networks (MPNNs) (Xu et al., 2018).

Corollary 4.17. Let ψ be a combinatorially-aware RPE. Then ψ-RPE-GTs are at least as strong as MPNNs.

5. Conclusion and Open Questions

We have established a framework for comparing different types of positional encodings. Future work can continue this line of work of comparing different specific positional encodings, e.g. shortest-path vs. resistance distance or resistance distance vs. SPE. Furthermore, our theoretical results provide some suggestions for designing new positional encodings, e.g., one may not want to convert RPEs into APEs as done in the literature. Finally, it would be interesting to shift the focus of research on positional encodings for graph transformers from coarse-grained distinguishing-power results (what pairs of graphs can GTs distinguish) to more fine-grained results on expressive power, for example, approximation results (Azizian & Lelarge, 2021) or results using RPE-aug WL-inspired distances to study graph transformers (Chen et al., 2022; 2023; B oker et al., 2024).

Comparing Graph Transformers via Positional Encodings

Acknowledgements

This work was supported by the NSF (Grants CCF-2217058 for GM, ZW, and YW; CCF-2112665 and CCF-2310411 for YW; and CCF-2311180 and CCF-1941086 for MB and AN). MB would like to thank YW for supporting a visit to UCSD where this research was initiated.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of potentially more effective graph learning models due to our work, none of which we feel must be specifically highlighted here.

Azizian, W. and Lelarge, M. Expressive power of invariant and equivariant graph neural networks. In ICLR 2021International Conference on Learning Representations, 2021.

B oker, J., Levie, R., Huang, N., Villar, S., and Morris, C. Fine-grained expressivity of graph neural networks. Advances in Neural Information Processing Systems, 36, 2024.

Chen, S., Lim, S., M emoli, F., Wan, Z., and Wang, Y. Weisfeiler-Lehman meets Gromov-Wasserstein. In International Conference on Machine Learning, pp. 3371 3416. PMLR, 2022.

Chen, S., Lim, S., M emoli, F., Wan, Z., and Wang, Y. The Weisfeiler-Lehman distance: Reinterpretation and connection with GNNs. In Topological, Algebraic and Geometric Learning Workshops 2023, pp. 404 425. PMLR, 2023.

Chen, Z., Chen, L., Villar, S., and Bruna, J. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383 10395, 2020.

Choromanski, K., Lin, H., Chen, H., Zhang, T., Sehanobish, A., Likhosherstov, V., Parker-Holder, J., Sarlos, T., Weller, A., and Weingarten, T. From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked transformers. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 3962 3983. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr. press/v162/choromanski22a.html.

Coifman, R. R. and Lafon, S. Diffusion maps. Applied and computational harmonic analysis, 21(1):5 30, 2006.

delle Rose, V., Kozachinskiy, A., Rojas, C., Petrache, M., and Barcelo, P. Three iterations of (d 1)-WL test distinguish non isometric clouds of d-dimensional points. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=9yh Ycjsdab.

Dwivedi, V. P. and Bresson, X. A generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021.

Dwivedi, V. P., Joshi, C. K., Luu, A. T., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. Journal of Machine Learning Research, 24 (43):1 48, 2023.

Fanuel, M., Ala ız, C. M., Fern andez, A., and Suykens, J. A. Magnetic eigenmaps for the visualization of directed networks. Applied and Computational Harmonic Analysis, 44(1):189 199, 2018.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263 1272. PMLR, 2017.

Hammond, D. K., Vandergheynst, P., and Gribonval, R. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129 150, 2011.

Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016.

Huang, N. T. and Villar, S. A short tutorial on the Weisfeiler Lehman test and its variants. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8533 8537. IEEE, 2021.

Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M., Jegelka, S., and Li, P. On the stability of expressive positional encodings for graph neural networks. In The Twelfth International Conference on Learning Representations, 2024.

Jin, B., Zhang, Y., Meng, Y., and Han, J. Edgeformers: Graph-empowered transformers for representation learning on textual-edge networks. ar Xiv preprint ar Xiv:2302.11050, 2023.

Kreuzer, D., Beaini, D., Hamilton, W., L etourneau, V., and Tossou, P. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618 21629, 2021.

Lim, D., Robinson, J. D., Zhao, L., Smidt, T., Sra, S., Maron, H., and Jegelka, S. Sign and basis invariant networks for spectral graph representation learning. In The Eleventh

Comparing Graph Transformers via Positional Encodings

International Conference on Learning Representations, 2022.

Ma, L., Lin, C., Lim, D., Romero-Soriano, A., Dokania, P. K., Coates, M., Torr, P., and Lim, S.-N. Graph inductive biases in transformers without message passing. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 23321 23337. PMLR, 23 29 Jul 2023. URL https:// proceedings.mlr.press/v202/ma23c.html.

Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y. Provably powerful graph networks. Advances in neural information processing systems, 32, 2019a.

Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. In International Conference on Learning Representations, 2019b.

Mialon, G., Chen, D., Selosse, M., and Mairal, J. Graphit: Encoding graph structure in transformers, 2021.

M uller, L., Galkin, M., Morris, C., and Ramp aˇsek, L. Attending to graph transformers. Transactions on Machine Learning Research, 2024. ISSN 28358856. URL https://openreview.net/forum? id=Hhbq HBBrf Z.

Ramp aˇsek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., and Beaini, D. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501 14515, 2022.

Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464 468, 2018.

Spielman, D. Spectral and algebraic graph theory. Available at http://cswww.cs.yale.edu/homes/spielman/sagt/sagt.pdf (2021/12/01), 2019.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Wang, Y. and Zhang, M. An empirical study of realized GNN expressiveness. In International Conference on Machine Learning. PMLR, 2024.

Weisfeiler, B. and Leman, A. The reduction of a graph to canonical form and the algebra which appears therein. nti, Series, 2(9):12 16, 1968.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2018.

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877 28888, 2021.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. Advances in neural information processing systems, 30, 2017.

Zhang, B., Luo, S., Wang, L., and He, D. Rethinking the expressive power of GNNs via graph biconnectivity. In The Eleventh International Conference on Learning Representations, 2023.

Zhu, W., Wen, T., Song, G., Wang, L., and Zheng, B. On structural expressive power of graph transformers. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 23, pp. 3628 3637, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701030. doi: 10. 1145/3580305.3599451. URL https://doi.org/ 10.1145/3580305.3599451.

Comparing Graph Transformers via Positional Encodings

A. Details from Section 3

A.1. Example of RPE vs RPE-aug WL

Figure 3. Left: G. Right: H

In this section, we show that RPE-aug WL can distinguish graphs that are indistinguishable by just their RPE. By Lemma 3.6, this also means the graphs are distinguishable by some RPE transformer. Consider the graphs G and H in Figure 3 and the RPE A that assigns each graph its adjacency matrix. Both graphs have 4 vertices and 4 edges, so their adjacency matrices AG and AH are indistinguishable. However, after one iteration of A-WL, all vertices of G will have the same color, while the degree 1 vertex in H will have a different color from all vertices in G. Therefore, G and H are distinguishable by A-WL.

A.2. Equivalence of RPE-aug WL and RPE-2-WL: Proof of Theorem 3.5

Our proof of Theorem 3.5 is based on the following lemma.

Lemma A.1. Let ψ be a pseudo-symmetric RPE. Let (G, XG) and (H, XH) be featured graphs. Let u, v VG and x, y VH. Then χ(t) 2,ψG(u, v) = χ(t) 2,ψH(x, y) iff ψG(u, v) = ψH(x, y), χ(t) ψG(u) = χ(t) ψH(x), and χ(t) ψG(v) = χ(t) ψH(y).

Proof of Lemma A.1. We prove this by induction on t. For t = 0, this is follows directly from definition as χ(t) ψG(v) = XG(v)

and χ(0) 2,ψG(u, v) = (XG(u), XG(v), ψG(u, v)).

Now inductively suppose this is true for some value t; we will show this is true for t + 1.

First assume that χ(t+1) 2,ψG (u, v) = χ(t+1) 2,ψH (x, y). Then, it follows from the definition that

1. χ(t) 2,ψG(u, v) = χ(t) 2,ψH(x, y).

2. {{χ(t) 2,ψG(u, w) : w VG}} = {{χ(t) 2,ψH(x, z) : z VH}};

3. {{χ(t) 2,ψG(w, v) : w VG}} = {{χ(t) 2,ψH(z, y) : z VH}}.

By induction, Item 1 implies that

4. ψG(u, v) = ψH(x, y), and by pseudo-symmetry, ψG(v, u) = ψH(y, x);

5. χ(t) ψG(u) = χ(t) ψH(x) and χ(t) ψG(v) = χ(t) ψH(y);

By item 2, there is a bijection σ : VG VH such that χ(t) 2,ψG(u, w) = χ(t) 2,ψH(x, σ(w)). The inductive hypothesis implies

that ψG(u, w) = ψH(x, σ(w)) and χ(t) ψG(w) = χ(t) ψH(σ(w)). Therefore, the ψ-WL colors of u and x at t + 1 are the same as

χ(t+1) ψG (u) = (χ(t) ψG(u), {{ (χ(t) ψG(w), ψG(u, w)) : w VG }})

= (χ(t) ψH(x), {{ (χ(t) ψH(z), ψH(x, z)) : z VH }}) = χ(t+1) ψH (x).

By item 3, there is a bijection σ : VG VH such that χ(t) 2,ψG(w, v) = χ(t) 2,ψH(σ (w), y). The inductive hypothesis

implies that ψG(w, v) = ψH(σ (w), y) and χ(t) ψG(w) = χ(t) ψH(σ (w)). By pseudo-symmetry, one also has that ψG(v, w) =

Comparing Graph Transformers via Positional Encodings

ψH(y, σ (w)). Therefore, the ψ-WL colors of v and y at t + 1 are the same as

χ(t+1) ψG (v) = (χ(t) ψG(v), {{ (χ(t) ψG(w), ψG(v, w)) : w VG }})

= (χ(t) ψH(y), {{ (χ(t) ψH(y), ψH(y, z)) : z VH }}) = χ(t+1) ψH (y).

Conversely, assume that ψG(u, v) = ψH(x, y), χ(t+1) ψG (u) = χ(t+1) ψH (x), and χ(t+1) ψG (v) = χ(t+1) ψH (y). By definition of the ψ-

WL, we have that χ(t) ψG(u) = χ(t) ψH(x) and χ(t) ψG(v) = χ(t) ψH(y). Hence, by induction, we have that χ(t) 2,ψG(u, v) = χ(t) 2,ψH(x, y). Moreover, there exist bijections σux : VG VH and σvy : VG VH such that

1. for any w VG, χ(t) ψG(w) = χ(t) ψH(σux(w)) and ψG(u, w) = ψH(x, σux(w));

2. for any w VG, χ(t) ψG(w) = χ(t) ψH(σvy(w)) and ψG(v, w) = ψH(y, σvy(w)), and by pseudo-symmetry, we also have that ψG(w, v) = ψH(σvy(w), y).

For any given w VG, by induction, one has that χ(t) 2,ψG(u, w) = χ(t) 2,ψH(x, σux(w)) and χ(t) 2,ψG(w, v) = χ(t) 2,ψH(σvy(x), y).

Therefore, {{χ(t) 2,ψG(u, w) : w VG}} = {{χ(t) 2,ψH(x, z) : z VH}} and {{χ(t) 2,ψG(w, v) : w VG}} = {{χ(t) 2,ψH(z, y) : z

VH}}. This implies that χ(t+1) 2,ψG (u, v) = χ(t+1) 2,ψH (x, y) which concludes the proof.

Now we are ready to prove our main theorem in this section. Theorem 3.5 (Equivalence of RPE-aug WL and RPE-2-WL.). Let ψ be a pseudo-symmetric RPE. Then two featured graphs (G, XG) and (H, XH) are ψ-WL indistinguishable iff they are ψ-2-WL indistinguishable.

Proof of Theorem 3.5. For the forward direction, we will show that for all t 0, if {{χ(t+1) ψG (u) : u VG}} = {{χ(t+1) ψH (v) :

v VH}}, then {{χ(t) 2,ψG(u, v) : u, v VG}} = {{χ(t) 2,ψH(x, y) : x, y VH}}. This implies that if G and H are indistinguishable by the ψ-WL, they are indistinguishable by the ψ-2-WL.

We first prove the base case of t = 0. As {{χ(1) ψG(u) : u VG}} = {{χ(1) ψH(v) : v VH}}, then there is a bijection σ : VG

VH such that χ(1) ψG(u) = χ(1) ψG(σ(u)). Moreover, by the definition of χ(1) ψ , for each u V , there is a bijection σu : VG VH such that ψG(u, v) = ψH(σ(u), σu(v)). Therefore, we can define a bijection Φ : VG VG VH VH defined Φ(u, v) = (σ(u), σu(v)) such that ψG(u, v) = ψH(Φ(u, v)). Therefore, {{χ(0) 2,ψG(u, v) : u, v VG}} = {{χ(0) 2,ψH(x, y) : x, y VH}}.

We now prove that for each t 1, if {{χ(t+1) ψG (u) : u VG}} = {{χ(t+1) ψH (v) : v VH}}, then {{χ(t) 2,ψG(u, v) : u, v VG}} =

{{χ(t) 2,ψH(x, y) : x, y VH}}. Assume that both graphs are assigned k(t+1) colors by χ(t+1). More specifically, assume there

are partitions VG = V (t+1) G,1 V (t+1) G,k(t+1) and VH = V (t+1) H,1 V (t+1) H,k(t+1) such that

1. |V (t+1) G,i | = |V (t+1) H,i |;

2. for any u V (t+1) G,i and v V (t+1) G,j , χ(t+1) ψG (u) = χ(t+1) ψG (v) iff i = j;

3. for any u V (t+1) G,i and x V (t+1) H,i , χ(t+1) ψG (u) = χ(t+1) ψH (x).

We can define analogous partitions for χ(t) ψ .

These partitions induce a partition on the set of pairs of vertices VG VG = 1 i k(t+1) 1 j k(t) V (t+1) G,i V (t) G,j and

VH VH = 1 i k(t+1) 1 j k(t) V (t+1) H,i V (t) H,j; see Figure 4 for an illustration of the decomposition.

We show that χ(t) 2,ψ has the same multiset of colors on V (t+1) G,i V (t) G,j and V (t+1) H,i V (t) H,j. By the definition of the partitions,

there exists a bijection σ(t+1) i : V (t+1) G,i V (t+1) H,i such that χ(t+1) ψG (u) = χ(t+1) ψH (σ(t+1) i (u)) for any u V (t+1) G,i . By

definition of ψ-WL, we have that χ(t) ψG(u) = χ(t) ψH(σ(t+1) i (u)) for any u V (t+1) G,i . Furthermore, for any u V (t+1) G,i ,

{{ (χ(t) ψG(w), ψG(u, w)) : w VG }} = {{ (χ(t) ψH(z), ψH(σ(t+1) i (u), z)) : z VH }}.

Comparing Graph Transformers via Positional Encodings

V (t) G,1 V (t) G,2

V (t+1) G,1

V (t+1) G,2

V (t+1) G,3

V (t+1) G,4

Figure 4. Decomposition of VG VG.

Now for each 1 j k(t), consider the sub-multiset {{ (χ(t) ψG(w), ψG(u, w)) : w V (t) G,j }} of {{ (χ(t) ψG(w), ψG(u, w)) : w VG }}, then we have that

{{ (χ(t) ψG(w), ψG(u, w)) : w V (t) G,j }} = {{ (χ(t) ψH(z), ψH(σ(t+1) i (u), z)) : z V (t) H,j }}.

This is because when j = b, one has that χ(t) ψG(w) = χ(t) ψH(z) for w V (t) G,j and z V (t) H,b.

{{(ψG(u, w), χ(t) ψG(u), χ(t) ψG(w)) : w V (t) G,j}}

={{(ψH(σ(t+1) i (u), z), χ(t) ψH(σ(t+1) i (u)), χ(t) ψH(z)) : z V (t) H,j}}.

Therefore, by Lemma A.1, one has that

{{χ(t) 2,ψG(u, w) : w V (t) G,j}} = {{χ(t) 2,ψH(σ(t+1) i (u), z) : z V (t) H,j}}.

Since σ(t+1) i is bijective, we conclude that

{{χ(t) 2,ψG(u, v) : u V (t+1) G,i , v V (t) G,j}} = {{χ(t) 2,ψH(x, y) : x V (t+1) H,i , y V (t) H,j}}.

Taking the multiset union over 1 i k(t+1) and 1 j k(t), this implies that {{χ(t) 2,ψG(u, v) : u, v VG}} =

{{χ(t) 2,ψH(x, y) : x, y VH}}.

Conversely, we will show that for t 0, if {{χ(t) 2,ψG(u, v) : u, v VG}} = {{χ(t) 2,ψH(x, y) : x, y VH}}, then {{χ(t) ψG(u) : u

VG}} = {{χ(t) ψH(x) : x VH}}.

For the base case of t = 0, we have that {{(XG(u), XG(v), ψG(u, v)) : u, v VG}} = {{(XH(x), XH(y), ψH(x, y)) : x, y VH}}. Therefore, {{(XG(u), XG(v)) : u, v VG}} = {{(XH(x), XH(y)) : x, y VH}} and hence {{XG(u) =: u VG}}2 = {{XH(x) : x VH}}2. Now we apply the following fact:

Fact 3. Let A and B be two finite multisets. If A2 = B2, then A = B.

Then, we have that {{χ(0) ψG(u) = XG(u) : u VG}} = {{χ(0) ψH(x) = XH(x) : x VH}}.

Now assume that for k 1, we have that

{{χ(t) 2,ψG(u, v) : u, v VG}} = {{χ(t) 2,ψH(x, y) : x, y VH}}.

Lemma A.1 implies that

{{(ψG(u, v), χ(t) ψG(u), χ(t) ψG(v)) : u, v VG}} = {{(ψH(x, y), χ(t) ψH(x), χ(t) ψH(y)) : x, y VH}}.

Dropping the ϕ term, one has that

{{(χ(t) ψG(u), χ(t) ψG(v)) : u, v VG}} = {{(χ(t) ψH(x), χ(t) ψH(y)) : x, y VH}}.

Comparing Graph Transformers via Positional Encodings

This implies that the following Cartesian products are equal:

{{χ(t) ψG(u) : u VG}}2 = {{χ(t) ψH(x) : x VH}}2.

Then, by Fact 3 again, we have that

{{χ(t) ψG(u) : u VG}} = {{χ(t) ψH(x) : x VH}}.

This means that G and H are indistinguishable by ψ-WL.

We now end this proof by proving Fact 3.

Proof of Fact 3. We let C := A B denote the set of all distinct elements in A and B. Let C = {c1, . . . , cn} and assume that A contains r A i many ci and B contains r B i many ci for each i = 1, . . . , n. Then, A2 contains r A i r A j many (ci, cj) and B2 contains r B i r B j many (ci, cj). Therefore, A2 = B2 implies that r A i r A j = r B i r B j for all i, j = 1, . . . , n. For each i = 1, . . . , n, we let j = i and hence (r A i )2 = (r B i )2. Since we are dealing with nonnegative integers, we must have that r A i = r B i . Therefore, A = B.

A.3. RPE-2-WL vs 2-EGN

Equivariant or invariant graph networks (EGNs and IGNs) (Maron et al., 2019b) are another important type of graph neural networks. These models will be important when trying to understand the distinguishing power of positional encodings later.

Definition A.2 (EGNs). A kth order Equivariant Graph Network (k-EGN) is a function F : Rnk d0 Rns do for s k of the form F = h m L(T ) σ σ L(1)

where h and L(t) : Rnkt 1 dt 1 Rnkt dt are equivariant linear layers for each t = 1, . . . , T such that kt k, k0 = k and k L = s, σ is an activation function, and m is a pointwise MLP. When s = 0, the resulting k-EGN is also called a kth

order Invariant Graph Network (k-IGN).

Note: Some implementations of 2-IGNs and 2-EGNs (e.g. (Maron et al., 2019b)) stack the adjacency matrix to the input W. In this paper, unless otherwise stated, the adjacency matrix is not stacked to the input.

EGNs are permutation equivariant in the sense that permutation of the order of input tensors will result in a corresponding permutation of the order of output tensors. Likewise, IGNs are permutation invariant as any permutation of the input tensors results in the same output.

We are interested in 2-EGNs and 2-IGNs since they are closely related to the 2-WL graph isomorphism test. Let ψ be an RPE valued in Rk and let g : Rn2 k Rn l be any 2-EGN. Recall that for any graph G with n vertices, the RPE ψG can be represented as a tensor in Rn2 k. If XG Rn l are node features for G, let row(XG), col(XG) Rn2 l be the tensors where each row or column respectively are the node feature XG, e.g. row(XG)[i, :, :] = XG and col(XG)[:, j, :] = XG for all 1 i, j n.

We now show that 2-EGNs with the input [ψG, row(XG), col(XG)] Rn2 (k+2l) have the same distinguishing powers as the ψ-2-WL test.

Proposition A.3 (Equivalence of RPE-2-WL and 2-EGN). Let ψ be a diagonally-aware RPE valued in Rk. Let G be any finite set of featured graphs. Then there exists a 2-EGN g : Rn2 k Rn l such that any (G, XG), (H, XH) G can be distinguished by g with the input [ψ, row(X), col(X)] iff G and H can be distinguished by the ψ-2-WL test.

This result is a generalization of existing results showing that 2-IGNs of the form g : Rn2 k Rm are equivalent to the 2-WL test (Maron et al., 2019a; Chen et al., 2020).

Proof of Proposition A.3. We will prove the forward and backward direction of this statement separately in the following lemmas. However, we emphasize that only Lemma A.5 needs the assumption of diagonal-awareness. Moreover, we

Comparing Graph Transformers via Positional Encodings

emphasize that the proofs of both of these lemmas are just slightly adapting existing results connecting 2-IGNs to the 2-WL test.

Lemma A.4 (2-EGNs are stronger than RPE-2-WL). Let ψ be an RPE valued in Rk. Let G be a finite set of graphs. Then, there exists a 2-EGN g : Rn2 (k+2l) Rn m such that if any (G, XG), (H, XH) G can be distinguished by ψ-2-WL test then {{g([ψG, row(XG), col(XG)])(u) : u VG}} = {{g([ψH, row(XH), col(XH)])(v) : v VH}}.

Proof. This follows from the proof by Maron et al. (2019a, Theorem 1) showing that there exists a 2-EGN that can distinguish any two graphs from a finite set of graphs well as the classical 2-WL algorithm. (While their statement of Theorem 1 is only for two graphs, their proof explains why this can be generalized to a finite set of graphs.) This lemma shows how this result can be generalized to the RPE-2-WL test.

Their proof is broken into two steps. First, the authors show that given a tensor W Rn2 (e+1) of a diagonal feature matrix stacked with the adjacency matrix, there is a 2-equivariant linear layer f : Rn2 (e+1) Rn2 4 (e+2) such that, for each pair (u, v), the vector f(W)[u, v, :] corresponds to the isomorphism type of (u, v), which is the initial color of (u, v) in the 2WL test. Second, they show that there is a d layers 2-EGN f = L(d) σ σ L(1) : Rn2 k Rn2 l that can implement d iterations of the 2-WL algorithm; that is to say, f(AG)(u, v) = f(AH)(u, v) iff χ(d)(u, v) = χ(d)(x, y). Finally, they show that there exists an MLP m, together with a final layer h of summing over all entries, such that g = h m f implements the 2-WL test, i.e. g(AG) = g(AH) iff χ(G) = χ(H).

The difference between the 2-WL test and the RPE-2-WL test lies in the initialization step. In the classical 2-WL test, a pair (u, v) is colored with its isomorphism type, while in the RPE-2-WL test, the initial color is (XG(u), XG(v), ψG(u, v)). If the input to the 2-EGN is the RPE matrix concatenated with the node features (i.e. the initial coloring), we can skip the initialization step and just perform the update steps of the RPE-2-WL through a 2-EGN f = L(d) σ σ L(1) : Rn2 k Rn2 l. Finally, instead of taking the sum of all entries, we can apply the operation of row sum (which is equivariant) to the output of f to obtain our target 2-EGN g : Rn2 k Rn l.

Lemma A.5 (RPE-2-WL is stronger than 2-EGNs). Let ψ be a diagonally-aware RPE. Let (G, XG) and (H, XH) be graphs. If G and H cannot be distinguished by ψ-2-WL, then (G, XG) and (H, XH) cannot be distinguished by any 2-EGN with input [ψ, row(X), col(X)].

Proof. This follows directly from the proof showing that if two graphs cannot be distinguished by the 2-WL, they cannot be distinguished by 2-EGNs (Chen et al., 2020, Lemma 4.2). The only reason we cannot apply this proof directly is that it relies on the fact that the initial colors to the 2-WL and therefore all subsequent colors are diagonally-aware (Chen et al., 2020, Lemma D.1), which is not necessarily the case for the initial colors of RPE-2-WL for general RPEs. However, if the RPE is diagonally-aware, then our lemma can be proven following the proof of Chen et al. (2020, Lemma 4.2).

Importance of diagonal-awareness. In the proof of Proposition A.3 (specifically in Lemma A.5), we show that RPE2-WL is at least as strong as 2-EGNs if the RPE is diagonally-aware. We emphasize that if we drop the assumption of diagonal-awareness, the lemma does not hold. Therefore, 2-EGN can be strictly stronger than th RPE-2-WL test for non-diagonally-aware RPEs, as shown by the following simple example.

Consider the two graphs on two vertices G and H where G has an edge and H does not. (In this example, the graphs have no node features.) Consider the (artificial) RPE ψ that assigns

ψG = 1 0 0 1

, ψH = 0 1 1 0

Note that ψ is not diagonally-aware, as the values 0 and 1 appear on both the diagonal and off-diagonal elements.

The ψ-2-WL colorings for t = 0 are

χ(0) 2,ψ(G) = 1 0 0 1

, χ(0) 2,ψ(H) = 0 1 1 0

Comparing Graph Transformers via Positional Encodings

These are, in fact, already stable ψ-2-WL colorings, as another iteration of ψ-2-WL gives the same partition. This means further iterations of ψ-2-WL won t be able to distinguish these two graphs. For example,

χ(1) 2,ψ(G) = (1, ({{0, 1}}, {{0, 1}})) (0, ({{0, 1}}, {{0, 1}})) (0, ({{0, 1}}, {{0, 1}})) (1, ({{0, 1}}, {{0, 1}}))

χ(1) 2,ψ(H) = (0, ({{0, 1}}, {{0, 1}})) (1, ({{0, 1}}, {{0, 1}})) (1, ({{0, 1}}, {{0, 1}})) (0, ({{0, 1}}, {{0, 1}}))

As χ(0) 2,ψ(G) and χ(0) 2,ψ(H) have the same multiset of colors {{0, 0, 1, 1}}, they are indistinguishable by ψ-2-WL. However, there is a 2-IGN that distinguishes these graphs. Specifically, one invariant linear matrix is B : Rn n R that maps B(W) = P

v V Wvv. (In the language of the original IGN paper (Maron et al., 2019b), this corresponds to the partition γ = {{1, 2}}.) Then B(ψG) = 2 = B(ψH) = 0, so this 2-IGN distinguishes ψG and ψH.

A.4. Equivalence of APEs and APE-GTs: Proof of Lemma 3.1

Lemma 3.1 (Equivalence of APEs and APE-GT). Any two graphs (G, XG) and (H, XH) are indistinguishable by an APE ϕ iff (G, XG) and (H, XH) are indistinguishable by all ϕ-APE-GTs.

Proof. Let T be any transformer, and let (G, XG) and (H, XH) indistinguishable by ϕ. The initial input to T are ˆ XG = [XG|ϕG] and ˆXH = [XH|ϕH]. As G and H are indistinguishable by ϕ, then there is a permutation P such that P ˆXG = ˆXH; however, as transformers are permutation-equivariant (Fact 1), then T( ˆXH) = T(P ˆXG) = PT( ˆXG), so the theorem follows.

Now we prove the inverse. Consider the transformer T with all weight matrices set to the zero matrix. The transformer T is exactly the identity map. Thus, if G and H are distinguishable by ϕ, then they are distinguishable by T.

A.5. Isomorphic implies WL Indistinguishable: Proof of Fact 2

Fact 2. If (G, XG) and (H, XH) are feature isomorphic, then (G, XG) and (H, XH) are ψ-WL indistinguishable and ψ-2-WL indistinguishable for any RPE ψ.

Proof. Let (G, XG) and (H, XH) be feature isomorphic graphs connected by isomorphism σ : VG VH. Let ψ be an RPE. We will prove the stronger statement that χ(t) ψG(u) = χ(t) ψH(σ(u)) for all v VG and t 0. For the base case of t = 0,

this is trivial as χ(0) ψG(u) = XG(u) = XH(σ(u)) = χ(0) ψH(σ(u)) for all v VG by the definition of feature isomorphism.

Now suppose the statement is true for t, and we will show it is true for t + 1. Let u VG. Consider the colors χ(t+1) ψG (u) =

(χ(t) ψG(u), {{(χ(t) ψG(v), ψG(u, v)) : v VG}}) and χ(t+1) ψH (σ(u)) = (χ(t) ψH(σ(u)), {{(χ(t) ψH(v), ψH(σ(u), v)) : v VH}}). By

induction, we know that the first coordinates χ(t) ψG(u) = χ(t) ψH(σ(u)). Moreover, by the inductive hypothesis and the definition

of RPE, we know that (χ(t) ψG(v), ψG(u, v)) = (χ(t) ψH(σ(v)), ψH(σ(u), σ(v))), which implies the second coordinates are equals. Therefore, the theorem holds.

A similar argument shows that χ(t) 2,ψG(u, v) = χ(t) 2,ψH(σ(u), σ(v)) for all u, v VG and t 0.

A.6. Equivalence of RPE-aug WL and RPE-GTs: Proof of Lemma 3.6

Lemma 3.6 (Equivalence of RPE-aug WL and RPE-GT). Let ψ be a diagonally-aware RPE. Let (G, XG) and (H, XH) be featured graphs. Then (G, XG) and (H, XH) are indistinguisable by the ψ-WL test iff (G, XG) and (H, XH) are indistinguisable by all ψ-RPE-GTs.

Proof. We will denote the input at the l-th layer of the RPE-GT for graph G as X(l) G R|VG| ℓ, and for each node u VG, its feature is denoted by X(l) G (u). Note X(0) G (u) = XG(u). We adopt similar notations for H.

For the forward direction, it suffices to show that if G and H are indistinguishable by the ψ-WL test, they are indistinguishable by any ψ-RPE-GT. We will prove the stronger statement that if the multiset of colors {{χ(t) ψG(u) : v VG}} = {{χ(t) ψH(x) :

Comparing Graph Transformers via Positional Encodings

x VH}}, then any bijection σ(t) : VG VH such that χ(t) ψG(u) = χ(t) ψH(σ(t)(u)) will satisfy the following property:

X(t) G (u) = X(t) H (σ(t)(u)) for all u VG, and hence {{X(t) G (u) : u VG}} = {{X(t) H (x) : x VH}}. We will prove this by induction on t.

The key observation is that the only part of the transformer architecture that is not applied separately to the node features X(t)(u) is the multiplication by the attention matrix A(l,h)(X(l))X(l), meaning for all other parts of the transformer, if the inputs X(t) G (u) = X(t) H (x) are equal, then the outputs at u and x are equal.

For the base case of t = 0, the statement holds as the initial node features equals the ψ-WL colors for t = 0, i.e. {{χ(0) ψG(v) :

v VG}} = {{X(0) G (v) : v VG}}.

Now suppose the theorem inductively holds for some t. Assume that {{χ(t+1) ψG (u) : v VG}} = {{χ(t+1) ψH (x) : x VH}} and

let σ(t+1) : VG VH denote any bijection such that χ(t+1) ψG (u) = χ(t+1) ψH (σ(t+1)(u)) for any u VG. By the definition of

χψ and the inductive hypothesis, we have that {{χ(t) ψG(u) : v VG}} = {{χ(t) ψH(x) : x VH}}, χ(t) ψG(u) = χ(t+1) ψH (σ(t)(u))

and X(t) G (u) = X(t) H (σ(t+1)(u)) for all u VG.

Now for any v VG and y VH such that X(t) G (v) = X(t) H (y), we have that

X(t) G (u)WQ(X(t) G (v)WK)T = X(t) H (σ(t+1)(u))WQ(X(t) H (y)WK)T (1)

for any matrices WQ and WK.

Now, for any u VG, χ(t+1) ψG (u) = χ(t+1) ψH (σ(t+1)(u)) implies that

{{(χ(t) ψG(v), ψG(u, v)) : v VG}} = {{(χ(t) ψH(y), ψH(σ(t+1)(u), y)) : y VH}}.

Hence, {{(X(t) G (v), ψG(u, v)) : v VG}} = {{(X(t) H (y), ψH(σ(t+1)(u), y)) : y VH}}.

Note that for any given value a,

{{ψG(u, v) : v VG, X(t) G (v) = a}} = {{ψH(σ(t+1)(u), y) : y VH, X(t) H (y) = a}}. (2)

Then, we have that

[A(t,h)(X(t) G )X(t) G ](u) = X

v A(t,h)(X(t) G )(u, v)X(t) G (v)

v f1(ψG(u, v))softmax(X(t) G (u)WQ(X(t) G ( )WK)T / p

dh + f2(ψG(u, )))(v)X(t) G (v)

y f1(ψH(σ(t+1)(u), y))softmax(X(t) H (σ(t+1)(u))WQ(X(t) H ( )WK)T / p

dh + f2(ψH(σ(t+1)(u), )))(y)X(t) H (y)

= [A(t,h)(X(t) H )X(t) H ](σ(t+1)(u)).

Here the third equation follows from Equation (1) and Equation (2).

It is then easy to check that X(t+1) G (u) = X(t+1) H (σ(t+1)(u)) for any u VG and this concludes the proof.

For the other direction, the proof follows from the one of Zhang et al. (2023, Theorem 4); More precisely, Zhang et al. (2023) are dealing with the case when the RPE ψ is taking value from a finite set D for graphs with bounded sizes. Furthermore, their RPE ψ is a distance (or a stack of distance functions) so that ψG(u, v) = 0 iff u = v VG for any graph G. In this way, elements in D can be written as 0 = r0 < r1 < r H. Note that, for one step of ψ-WL, the second argument can be decomposed by

{{(χ(t) ψG(u), ψG(v, u)) : u VG}} = [

ri {{(χ(t) ψG(u), ri) : ψG(v, u) = ri, u VG}}.

Comparing Graph Transformers via Positional Encodings

Then, Zhang et al. (2023) explicitly constructed H + 1 different RPE attention heads Ah RPE which can simulate {{(χ(t) ψG(u), ri) : ψG(v, u) = rh, u VG}} in an injective manner for each h = 0, . . . , H. In this way, they argue

that by concatenating these heads, one can simulate their GD-WL test where each step is updated by bχ(t+1) ψG (v) =

{{(bχ(t) ψG(u), ψG(v, u)) : u VG}}. But as mentioned already in Section 3.1, this GD-WL test is equivalent to the ψ-WL test when ψ is a graph distance (or a stack of graph distances), so their one layer RPE transformer can simulate one step of ψ-WL in an injective manner.

Recall that GD-WL is equivalent to ψ-WL due to the fact that ψ takes value 0 on diagonal entries and ψ takes non-zero values on off-diagonal entries. This is the reason why one can drop the first argument in (χ(t) ψG(v), {{(χ(t) ψG(u), ψG(v, u)) : u V }}) without losing distinguishing power. Hence, the same argument in the proof of (Zhang et al., 2023, Theorem 4) can be applied to the setting of diagonally-aware RPEs due to the following simple result.

Lemma A.6. When ψ is a diagonally-aware RPE, then ψ-WL is equivalent to the following variant of ψ-WL:

bχ(0) ψG(v) = XG(v),

bχ(t+1) ψG (v) = {{(bχ(t) ψG(u), ψG(v, u)) : u V }}.

Now, to prove our theorem, we simply let D := {ψG(u, v) : u, v VG} {ψH(x, y) : x, y VH} and apply the same argument as in the proof of (Zhang et al., 2023, Theorem 4) to conclude the proof.

A.7. Mapping APEs to RPEs: Proof of Lemma 3.7 and Theorem 3.8

We first recall a key lemma from Xu et al. (2018). For any set S, we let Mul(S) denote the collection of all finite multi-subsets of S and recall that Mul2(S) denotes the collection of all multi-subsets of size 2 in S.

Lemma A.7 ((Xu et al., 2018, Lemma 5)). If S is countable, then there exists a function f : S R such that the map hf : Mul(S) R defined by sending hf(X) = P

x X f(x) is injective.

As a direct consequence, we immediately have the following corollary.

Corollary A.8. If S is countable, then there exists a function f : S R such that the map hf : Mul2(S) R defined by sending hf({{x, y}}) = f(x) + f(y) is injective.

Lemma 3.7. Let ϕ be an APE. Then there is a function f such that any two featured graphs (G, XG) and (H, XH) are indistinguishable by ϕ iff (G, XG) and (H, XH) cannot be distinguished by the ψf-2-WL test.

Proof. Let S = G G im ϕG where G is the set of all finite graphs. We first note that S is countable as the set of all finite graphs is countable. Let f be any function such that hf that is injective on S; such a function must exist by Corollary A.8 as S is countable. Assume that G and H are indistinguishable by the ψf-2-WL test. This, in particular, implies that the multisets of initial colors are the same:

{{(XG(u), XG(v), ψf G(v, w)) : u, v VG}} = {{(XH(x), XH(y), ψf H(x, y)) : x, y VH}}.

By the definition of ψf and the fact that hf is injective, one has that

{{(XG(u), XG(v), ϕG(u), ϕG(v)) : u, v VG}} = {{(XH(x), XH(y), ϕH(x), ϕH(y)) : x, y VH}}.

Hence, the following two Cartesian products of multisets are equal:

{{(XG(u), ϕG(u)) : u VG}}2 = {{(XH(x), ϕH(x)) : x VH}}2.

By Fact 3, we have that {{(XG(u), ϕG(u)) : u VG}} = {{(XH(x), ϕH(x)) : x VH}}.

Conversely, assume that (G, XG) and (H, XH) are indistinguishable by ϕ. Then {{(XG(u), ϕG(u) : u VG}} = {{(XH(x), ϕH(x)) : x VH}}. Let σ : VG VH be a bijection such that ϕG(v) = ϕH(σ(v)) and XG(v) = XH(σ(v)) for all v VG. We will prove that (G, XG) and (H, XH) are indistinguishable by ψf-WL for any function f. In fact, we will prove the stronger statement that χ(t) ψf G(v) = χ(t) ψf H(σ(v)) for all t 0. This is stronger than proving they are ψf-WL

indistinguishable as we use the same bijection for all iterations t.

Comparing Graph Transformers via Positional Encodings

We prove this by induction on the iteration t. For t = 0, this is true as the initial ψf-WL colors χ(t) ψf H(v) are just the node

features XG(v), which are implied by the definition of indistinguishability by an APE. Now suppose this is true for some iteration t; we will show it is true for t + 1. Let v VG. The inductive hypothesis implies that the first element of the colors are equal χ(t) ψf G(v) = χ(t) ψf H(σ(v)). Moreover, if we consider the second element of the colors, we find that

{{(χ(t) ψf G(w), ψf G(v, w)) : w VG}} ={{(χ(t) ψf G(w), f(ϕG(v)) + f(ϕG(w))) : w VG}} (defintion of ψf G)

={{(χ(t) ψf G(w), f(ϕH(σ(v))) + f(ϕH(σ(w)))) : w VG}} (defintion of σ)

={{(χ(t) ψf H(σ(w)), f(ϕH(σ(v)) + f(ϕH(σ(w))))) : w VG}} (inductive hypothesis)

={{(χ(t) ψf H(σ(w)), ψf H(σ(v), σ(w))) : w VG}} (defintion of ψf H)

={{(χ(t) ψf H(w), ψf H(σ(v), w)) : w VH}}. (σ is a bijection)

Therefore, χ(t+1) ψf G (v) = χ(t+1) ψf H (σ(v)) as claimed.

Theorem 3.8. For any APE ϕ, there exists a function f such that any two graphs (G, XG) and (H, XH) are indistinguishable by all ϕ-APE-GTs iff they are indistinguishable by all ψf-RPE-GTs.

Proof. By Lemma 3.1, two graphs (G, XG) and (H, XH) are indistinguishable by all APE transformers with APE ϕ iff they are indistinguishable by ϕ. By Lemma 3.7, this is equivalent to being indistinguishable by ψf-2-WL. Hence, by Lemma 3.6, this is equivalent to being indistinguishable by any RPE transformer with RPE ψf.

A.8. Mapping RPE to APE: Proof of Theorem 3.10

Lemma 3.9. Let ψ be a diagonally-aware RPE.

For any 2-EGN g, if (G, XG) and (H, XH) are indistinguishable by the ψ-2-WL test then (G, XG) and (H, XH) are indistinguishable by ϕg.

Moreover, for any finite set of unfeatured graphs G, there is a 2-EGN g such that if G, H G are indistinguishable by ϕg, then G and H are indistinguishable by the ψ-2-WL test.

Proof. Let (G, XG) and (H, XH) be featured graphs that are indistinguishable by ψ-2-WL. By Lemma A.5, this means that for any 2-EGN g of the appropriate dimension, g([ψG, row(XG), col(XG)]) = g([ψG, row(XG), col(XG)]). In particular, for any 2-EGN h of appropriate dimension, there is a 2-EGN gh such that gh([ψG, XGXT G]) = [h(ψG), X2 G]; the 2-EGN gh runs h on the first columns of [ψG, row(XG), col(XG)] and then copies the mean of the rows of row(XG) on the remaining columns. The output of gh is exactly the APE ψh G concatenated to the node features XG.

For the backward direction of the case of graphs without node features, this follows directly from Lemma A.4.

Theorem 3.10. Let ψ be a diagonally-aware RPE.

For any 2-EGN g, if (G, XG) and (H, XH) are indistinguishable by all ψ-RPE-GTs, then (G, XG) and (H, XH) are indistinguishable by ϕg.

Moreover, for any finite set of unfeatured graphs G, there is a 2-EGN g such that if G, H G are indistinguishable by all ϕg-APE-GTs, then G and H are indistinguishable by all ψ-RPE-GTs.

Proof. By Lemma 3.6 and Theorem 3.5, G and H are indistinguisable by an RPE transformer with RPE ψ iff G and H are indistinguisable by the ψ-2-WL test. By Lemma 3.9, this is equivalent to G and H being indistinguishable by ϕg. By Lemma 3.1, this is equivalent to G and H being indistinguishable by any APE transformer with APE ϕg.

We now show that RPEs are strictly stronger than APEs derived from RPEs for graphs with node features. Example 1. There exists an RPE ψ and featured graphs (G, XG) and (H, XH) that are distinguishable by ψ-2-WL but are indistinguishable by ϕg for any 2-EGN g.

Comparing Graph Transformers via Positional Encodings

Figure 5. Left: (G, XG). Right: (H, XH).

Proof. The graph G and H are both be the cycle graph on 4-vertices C4. The RPE ψ is the adjacency matrix. The node features are XG = [1, 2, 3, 4] and XH = [1, 3, 2, 4]. See Figure 5.

We first prove these graphs are distinguishable by ψ-WL. If we consider the first ψ-WL colors χ(1) ψ , we see that G has a node of color (1, {{(2, 1), (3, 0), (4, 1)}}); however, H has no node of that color, as the only node with feature 1 has color (1, {{(3, 1), (2, 0), (4, 1)}}).

We next prove that are indistinguishable by the APE ϕg for any 2-EGN g. The first thing to observe is that the APE ϕg

will assign the same value to each node in G and H; this follows from the definition of APE and the fact that for any two nodes u, v VG, there is a graph isomorphism σ : VG VG such that σ(u) = v; likewise for any two nodes x, y VH or nodes u VG and x VH. Therefore, as (G, XG) and (H, XH) have the same multiset of node features and ϕg has uniform entries, they cannot be distinguished by ϕg.

B. Details from Section 4

B.1. A Lemma for RPE-aug WL Composition.

In this section, we will repeatedly use the following useful property of composing the RPE-aug WL test. Let ψ be an RPE valued in Rd, and let f : Rd R be a function. Consider the composition f ψ defined by (f ψ)G(u, v) := f(ψG(u, v)) for any graph G. One naturally wonders about the distinguishing power of ψ-WL versus f ψ-WL. In fact, ψ-WL will always be at least as strong as f ψ-WL, and under mild conditions on the input, there will exist a function f such that ψ-WL and f ψ-WL are equally strong.

Proposition B.1. Let ψ be an RPE. Let G and H be graphs. Let u VG and v VH.

1. If χ(t) ψG(u) = χ(t) ψH(v), then for any function f : Rd R, χ(t) f ψG(u) = χ(t) f ψH(v). In particular, ψ-WL is as strong as f ψ-WL.

2. For any set Ω Rd and any function f : Rd R injective on Ω, if im ψG im ψH Ωand χ(t) ψG(u) = χ(t) ψH(v), then

χ(t) f ψG(u) = χ(t) f ψH(v). In particular, f ψ-WL is as strong as ψ-WL.

Proof. For item 1, it follows from (Zhu et al., 2023, Theorem 3).

For item 2, we choose the function f to be any function that is injective on Ω; such a function exists as Ωare finite. The lemma follows by induction of the iteration t.

Remark B.2. It is quite common that RPEs will be drawn from a finite set Ω Rd. In particular, if we consider any RPE defined on unweighted graphs with n vertices that is purely a function of its topology, e.g., the shortest-path distance or Laplacian eigenspace projection, then because there are only finitely many unweighted graphs on n vertices, then the image of the PE in Rd will be finite.

Comparing Graph Transformers via Positional Encodings

B.2. Proof of Theorem 4.2 and Corollary 4.3

Euclidean Distance and Gram Matrices. Let P = {p1, . . . , pn} Rd be a point cloud. The Gram matrix of P is denoted GP Rn n where (GP )ij = p T i pj. The distance matrix of P is denoted DP Rn n where (DP )ij = pi pj . A point cloud P is centered if Pn i=1 pi = 0.

Theorem B.3. Assume that for any graph G, there is an embedding of the nodes PG : V Rd such that im PG is centered. Let ψd and ψg be RPEs that are the distance and Gram matrix of the point set im PG, respectively. Then, ψd-WL is at least as strong as are as ψg-WL. Conversely, ψg-WL with diagonal augmentation is at least as strong as ψd.

The proof is based on the following key lemma. We define the squared distance matrix of P to be the matrix D2 P Rn n

where (D2 P )ij = pi pj 2.

Lemma B.4. Let ψd and ψg be RPEs such that, for any graph G, ψd,G and ψg,G are the squared distance and Gram matrix of a centered point set PG Rd. Then ψd-WL is at least as strong as are as ψg-WL. Conversely, ψg-WL with diagonal augmentation is at least as strong as ψd.

Proof of Theorem B.3 assuming Lemma B.4. As the square root function is injective, then by Proposition B.1, G and H are indistinguishable by ψd-WL if and only if D2 P and D2 Q are indistinguishable by ψg-WL.

Now we proceed to prove Lemma B.4.

Lemma B.5. Let P = {p1, . . . , pn} Rd be a centered point cloud. For any 1 k n, (GP )ii = 1

n Pn i=1(D2 P )ik 1 2n2 Pn i=1 Pn j=1(D2 P )ij.

Proof. Observe that (D2 p)ij = p T i pi + p T j pj 2p T i pj. Therefore, the first term simplifies to

i=1 (D2 P )ik =p T k pk + 1

i=1 p T i pi 2

i=1 p T i pk

=p T k pk + 1

i=1 p T i pi. (as Pn i=1 pi = 0)

The second term simplifies to:

j=1 (D2 P )ij = 1

p T i pi + p T j pj 2p T i pj

np T i pi +

j=1 p T j pj

j=1 2p T i pj

np T i pi +

j=1 p T j pj

(as Pn i=1 pi = 0)

i=1 2np T i pi

i=1 p T i pi.

i=1 (D2 P )ik 1 2n2

j=1 (D2 P )ij =p T k pk + 1

i=1 p T i pi 1

i=1 p T i pi

=p T k pk =(GP )kk.

Comparing Graph Transformers via Positional Encodings

Lemma B.6. Let ψd and ψg be RPEs such that, for any graph G, ψd,G and ψg,G are the squared distance and Gram matrix of a centered point set PG Rd. Let (G, XG) and (H, XH) be featured graphs. If χ(2) ψd,G(u) = χ(2) ψd,H(x), then (GPG)uu = (GPH)xx.

Proof. We will prove this using Lemma B.5. First, the fact that χ(1) ψd,G(u) = χ(1) ψd,H(x) implies {{(D2 PG)uv : v VG}} =

{{(D2 PH)xy : y VH}}, which in particular implies that 1

v VG(D2 PG)uv = 1

y VH(D2 PH)xy. Second, as χ(2) ψd,G(u) =

χ(2) ψd,H(x), then as the second coordinate of these colors are equal, there is a bijection σ : VG VH such that χ(1) ψd,G(v) =

χ(1) ψd,H(σ(v)) for all v VG. In particular, this implies that {{(D2 PG)vw : w VG}} = {{(D2 PH)σ(v)z : x VH}} for all v VG. Therefore, PG and PH have the same multiset of distances, and 1 2n2 P

w VG(D2 PG)vw = 1 2n2 P

z VH(D2 PH)yz. These two observations and Lemma B.5 imply the lemma.

Proposition B.7 (Forward Direction of Lemma B.4). Let ψd and ψg be RPEs such that, for any graph G, ψd,G and ψg,G are the squared distance and Gram matrix of a centered point set PG Rd. Let (G, XG) and (H, XH) be featured graphs. Let u VG and x VH. If χ(t+2) ψd,G (u) = χ(t+2) ψd,H (x), then χ(t) ψg,G(u) = χ(t) ψg,H(x). Moreover, if (G, XG) and (H, XH) are indistinguishable by ψd-WL, then (G, XG) and (H, XH) are indistinguishable by ψg-WL.

Proof. We prove this by induction on t. For t = 0, this is obvious as χ(2) ψd,G(u) = χ(2) ψd,H(x) implies that χ(0) ψg,G(u) =

XG(u) = XH(v) = χ(0) ψg,H(x)

Now suppose the proposition is true for a natural number t, and let u VG and x VH such that χ(t+3) ψd,G (u) = χ(t+3) ψd,H (x).

By induction, we know that χ(t+2) ψd,G (u) = χ(t+2) ψd,H (x) implies χ(t) ψg,G(u) = χ(t) ψg,H(x), so the first coordinates of χ(t+1) ψg,G (u) and

χ(t+1) ψg,H (x) are equal.

We now show the second coordinates are equal. The fact that χ(t+3) ψd,G (u) = χ(t+3) ψd,H (x) implies there is a bijection σ : VG

VH such that (χ(t+2) ψd,G (v), (D2 PG)uv) = (χ(t+2) ψd,H (σ(v)), (D2 PH)xσ(v)) for all v VG. As χ(t+2) ψd,G (v) = χ(t+2) ψd,H (σ(v)) implies (GPG)vv = (GPH)σ(v)σ(v) by Lemma B.6, then we know that

(GPG)uv = 1

2 (GPG)vv + (GPG)uu (D2 PG)uv = 1

2 (GPH)σ(v)σ(v) + (GPH)xx (D2 PH)xσ(v) = (GPQ)xσ(v).

Moreover, χ(t+2) ψd,G (v) = χ(t+2) ψd,H (σ(v)) implies χ(t) ψg,G(v) = χ(t) ψg,H(σ(v)) by the inductive hypothesis. Therefore, as σ is a

bijection, the second coordinates of χ(t+1) ψg,G (u) and χ(t+1) ψg,H (x) are equal:

{{(χ(t) ψg,G(v), (GPG)uv) : v VG}} = {{(χ(t) ψg,H(y), (GPH)xy) : y VH}}.

Proposition B.8 (Backward Direction of Lemma B.4). Let ψd and ψg be RPEs such that, for any graph G, ψd,G and ψg,G are the squared distance and Gram matrix of a centered point set PG Rd. Let ψ g = (ψg, I) be the diagonal

augmentation of ψg. Let (G, XG) and (H, XH) with featured graphs. Let u VG and x VH. If χ(t+1) ψ g,G (u) = χ(t+1) ψ g,H (x),

then χ(t) ψd,G(u) = χ(t) ψd,H(x). In particular, ψ g-WL with diagonal augmentation is at least as strong as ψd-WL.

Proof. We prove this by induction on t. For t = 0, this is obvious as χ(1) ψ g,G(u) = χ(1) ψ g,H(x) implies χ(0) ψ g,G(u) = XG(u) =

XH(v) = χ(0) ψ g,H(x) and equivalently χ(0) ψd,G(u) = XG(u) = XH(v) = χ(0) ψd,H(x).

Now suppose the proposition is true for a natural number t, and suppose that χ(t+2) ψ g,G (u) = χ(t+2) ψ g,H (x). By induction, we

know that χ(t+1) ψ g,G (u) = χ(t+1) ψ g,H (x) implies χ(t) ψd,G(u) = χ(t) ψd,H(x), so the first coordinates of χ(t+1) ψd,G (u) and χ(t+1) ψd,H (x) are equal.

We now show the second coordinates are equal. As χ(t+2) ψ g,G (u) = χ(t+2) ψ g,H (x), then there is a bijection σ : VG VH such

that (χ(t+1) ψ g,G (v), (GPG, I)uv) = (χ(t+1) ψ g,H (σ(v)), (GPG, I)xσ(v)) for all v VG. The diagonal augmentation implies that

Comparing Graph Transformers via Positional Encodings

(GPG, I)uu = (GPH, I)xx as a diagonal element of (GPG, I) can only equal a diagonal element of (GPH, I). By a similar argument, the fact that χ(t+1) ψ g,G (v)) = χ(t+1) ψ g,H (σ(v)) implies that (GPG, I)vv = (GPH, I)σ(v)σ(v). Therefore,

(D2 PG)uv =(GPG)uu + (GPG)vv 2(GPG)uv =(GPH)xx + (GPH)σ(v)σ(v) 2(GPH)xσ(v) = (D2 PH)xσ(v).

Therefore, the second coordinates of χ(t+1) ψd,G (u) and χ(t+1) ψd,H (x) are equal, i.e.,

{{(χ(t) ψd,G(v), (D2 PG)uv) : v VG}} = {{(χ(t) ψd,H(v), (D2 PH)uv) : v VH}}.

Now we can prove Theorem 4.2 below:

Theorem 4.2. Let f : R+ R+. Df-WL is at least as strong as Kf-WL. Kf-WL with diagonal augmentation is at least as strong as Df-WL.

Proof of Theorem 4.2. Note that for given graph G with n points, any spectral decomposition of the Laplacian L = Pn i=2 λiziz T i , and any function f : R+ R+, we can define the point cloud PG = { p

f(L)1v : v V }, where p

f(L) = Pn i=2 p

f(λi)ziz T i The spectral kernel Kf G is the Gram matrix of this point cloud and the spectral distance Df G is the distance matrix of this point cloud. (While different choice of bases for the eigenspace result in different point clouds PG, these point clouds will always have the same Gram and distance matrices regardless of choice of basis.) This point cloud is centered as P

v V Pn i=2 p

f(L)λiziz T i 1v = 0 by the fact that each eigenvector zi is orthogonal to the all-1s vector, which is the eigenvector of L corresponding to λ1 = 0. The result for RPE-aug WL with RPEs Kf G and Df G holds by directly applying Theorem B.3.

RD-WL and L -WL In this section, we prove Corollary 4.3 by showing that it is unnecessary to perform diagonal augmentation on L in order for L -WL to be as strong as RD-WL. In particular, we show that vertices u and v with the same L -WL color after one iteration have the same diagonal entry L (u, u) = L (v, v). This is the point in the proof of Proposition B.8 where diagonal augmentation is used, so because this holds for L -WL, we can drop the diagonal augmentation.

For background, the pseudoinverse of the Laplacian is L = Pn i=2 1 λi ziz T i , and the resistance distance between nodes u and v in a graph is RD(u, v) = (1u 1v)T L (1u 1v), where 1u is the vector whose u-th entry is 1 and other entries are 0. The resistance distance is a squared spectral distance, but remarkably, it is also a metric (see for example (Spielman, 2019, Section 12.8)), which is not true in general for the square of distances.

In particular, we prove the following lemma:

Lemma B.9. Let G and H be graphs. Let u VG and v VH such that χ(1) L G(u) = χ(1) L H(v). Then L G(u, u) = L H(v, v).

The proof is based on several auxiliary results.

Lemma B.10. Let G be a connected graph with more than 1 vertex and let L denote its graph Laplacian. Then, for any u = v V , we have that L (v, v) > L (u, v).

Proof. By the fact that RD(u, v) = L (u, u) + L (v, v) 2L (u, v), we have that

L (v, v) L (u, v) = 1

2(RD(u, v) + L (v, v) L (u, u))

Therefore, to prove that L (v, v) > L+ uv, it suffices to prove that

RD(u, v) > L (u, u) L (v, v).

Comparing Graph Transformers via Positional Encodings

By Lemma B.5, we have that

L (u, u) L (v, v) = 1

w V RD(u, w) X

w V RD(v, w)

w V RD(u, w) RD(v, w)

w V \{u,v} RD(u, w) RD(v, w) (as RD(v, u) = 0)

w V \{u,v} RD(u, v) (Triangle inequality.)

n RD(u, v) < RD(u, v).

Lemma B.11. Let G be a graph and let L denote its graph Laplacian. Let u V . Then either

1. L (u, u) > L (u, v) for all v V , or

2. L (u, v) = 0 for all v V .

Proof. Let G1, , Gk denote the connected components of G. Then, we have that L = L1 Lk and thus L = L 1 L k. For each connected component Gl, if it contains more than 1 vertex, then by Lemma B.10, L l satisfies the condition that the diagonal entries are strictly larger than all other entries in the corresponding rows. If Gl contains only 1 vertex, then L l = 0 and hence the corresponding row and column in L is always 0. This concludes the proof.

Now, we finish the proof of Lemma B.9.

Proof of Lemma B.9. By the definition of the L -WL test, χ(1) L G(u) = χ(1) L H(v) implies that {{L G(u, w) : w VG}} =

{{L H(v, x) : x VH}}. By Lemma B.11, we know that either the multisets {{L G(u, w) : w VG}} = {{L H(v, x) : x VH}} are all 0s, in which case L G(u, u) = 0 = L H(v, v), or they have the same maximal element, which is L G(u, u) = L H(v, v).

B.3. SPE APE-GT is Stronger than RD-RPE-GT

Background. First, we provide the definition of SPE (Huang et al., 2024). Denote the eigenvalues and eigenvectors of the Laplacian as λ1 . . . λ|V | and z1, . . . , z|V |. Let Λ := [λ1, . . . , λ|V |] R|V | and V := [z1| |z|V |] R|V | |V |, so L = V diag(Λ)V T .

Stable and expressive positional encoding (SPE) is the APE

SPE(V, Λ) = g(V diag(f(Λ))V T ),

where f : Rn Rn is a permutation-equivariant function like an elementwise MLP or Deep Sets (Zaheer et al., 2017) and g : Rn n Rn is any graph neural network. In our context, we assume that f is a Deep Sets network and g is a 2-EGN.

By removing the final layer of 2-EGN from SPE, one obtains a natural RPE that we call the relative-SPE (RSPE): RSPE(V, Λ) = V diag(f(Λ))V T . Lemma B.12. If a chosen RSPE is diagonally-aware (or diagonally augmented), then there exists a 2-EGN g such that RSPE-RPE-GT is equivalent to SPE-APE-GT in terms of distinguishing power.

Proof. This follows from Theorem 3.10.

Notice that when f is an elementwise function, RSPE reduces to a spectral kernel and we hence have the following result; however, RSPE is more general than spectral kernels as f(Λ)i may be a function of all the eigenvalues, not just the ith eigenvalue. Accordingly, we have the following lemma.

Comparing Graph Transformers via Positional Encodings

Lemma B.13. RSPE can compute any spectral kernel.

Based on the lemma, we are able to prove our main result in this section.

Theorem 4.4. For unfeatured graphs, diagonally augmented SPE-APE-GT is at least as strong as RD-RPE-GT.

Proof. The precise statement of the result is that there exists a choice of RSPE with the diagonal augmentation such that its corresponding SPE-APE-GT is at least as strong as RD-RPE-GT.

By Lemma B.12, we know that there exists a 2-EGN g such that RSPE-RPE-GT is equivalent to SPE-APE-GT in terms of distinguishing power. Therefore, it suffices to show that RSPE-RPE-GT is at least as strong as RD-RPE-GT. Since the pseudoinverse of the Laplacian L is a spectral kernel, by Lemma B.13 RSPE can be chosen so that its corresponding RSPE-WL is at least as strong as L -WL. Since the extra diagonal augmentation of the two RPEs will not change the relative distinguishing power of their RPE-aug WL test, by Lemma 3.6 and the fact that RD is diagonally-aware, we conclude the proof.

B.4. Powers of Matrices and Spectral Kernels: Proofs from Section 4.3

We want to show that the multidimensional RPE (I, L, . . . , Lk) is equivalent to spectral kernels in distinguishing power. To prove this, we will use a basic result from the interpolation literature showing that any function on k points can be fit exactly by a (k 1)-degree polynomial p.

Lemma B.14. Let f : R R be any function and let {x1, . . . , xn} R. Then there exists an (n 1)-degree polynomial p : R R such that p(xi) = f(xi) for all 1 i n.

Theorem 4.6. (I, L, . . . , L2n 1)-WL is at least as strong as Kf-WL on graphs with at most n nodes.

Proof. Let G and H be graphs such that Kf-WL distinguishes G and H. Let {λG,1, . . . , λG,n} and {λH,1, . . . , λH,n} be the spectra of the Laplacians LG and LH respectively, and denote their union {λG 1 , . . . , λG n } {λG 1 , . . . , λG n } = {λ0, . . . , λ2n 1} (without loss of generality, we assume these elements are distinct). Let p(x) = P2n 1 i=0 cixi be the polynomial guaranteed by Lemma B.14 for the point cloud {λ0, . . . , λ2n 1} and for the function f in defining Kf. Let pl : R2n R be the linear function with the same coefficients as p, i.e. pl(x) = P2n 1 i=0 cixi. By linearity, if we apply pl elementwise on the tensor (I, LG, . . . , L2n 1 G ), we find that it equals Kf for G:

pl(I, LG, . . . , L2n 1 G ) =

i=0 ci Li G =

j=2 ciλi G,jzjz T j =

j=2 p(λG,j)zjz T j =

j=2 f(λG,j)zjz T j = Kf G

The same calculation holds for H. Therefore, as Kf-WL distinguishes G and H, then pl (I, L, . . . , L2n 1)-WL distinguishes G and H.

Moreover, Proposition B.1 implies (I, L, . . . , L2n 1)-WL distinguishes G and H. As the above argument can be repeated (using different polynomials p) to show that (I, L, . . . , L2n 1) can distinguish any pair of graphs that Kf can, then (I, L, . . . , L2n 1)-WL is as strong as Kf-WL on graphs with n nodes.

The following theorem uses normalized versions of the adjacency matrix and Laplacian. Recall that D denotes the degree matrix, A denotes the adjacency matrix, and L denotes the graph Laplacian. The symmetrically-normalized adjacency matrix is b A = D 1/2AD 1/2, and its symmetrically-normalized Laplacian is b L = D 1/2AD 1/2 = I b A. We let µ1 µn denote eigenvalues of b A and let ν1 νn denote eigenvalues of b L

Note that νi = 1 µi. Moreover, b A and b L they have the same set of eigenvectors {x1, . . . , xn} (up to choice of orthonormal basis). For a function f : R+ R+, we define the normalized spectral kernel as b Kf = Pn i=2 f(νi)xix T i .

Theorem 4.8. (I, b A, . . . , b A2n 1)-WL is at least as strong as b Kf-WL on graphs with at most n nodes.

Proof. This theorem follows as the eigenvalues µi of b A and νi of b L can be matched as µi = 1 νi. The rest of the proof is the same as Theorem 4.6 by using polynomials to fit the function bf(x) = f(1 x).

Theorem 4.9. (I, H(1), . . . , H(2n 1))-WL is at least as strong as Kf-WL on graphs with at most n nodes.

Comparing Graph Transformers via Positional Encodings

Proof. This theorem follows as the eigenvalues of H(t) are e tλi = (e λi)t, so the proof is the same as the proof of Theorem 4.6 by using polynomials to fit the function bf(x) = f( log(x)).

B.5. Common Matrices: Proof of Proposition 4.10

Proposition 4.10. A, b A, e A, L, b L and e L induce equivalent RPE-aug WL tests. In particular, all of these RPE-aug WL tests are equally strong as the WL test.

Proof. By Proposition B.16 below, we know that A-WL is equivalent to the WL test. Hence, we only need to show that all other RPE-aug WL tests are equivalent to A-WL.

Choose any featured graphs (G, XG) and (H, XH). Pick u VG and v VH. First of all, we observe the following fact.

Fact 4. χ(1) AG(u) = χ(1) AH(v) simplies that deg G(u) = deg H(v).

e A: We inductively prove that χ(t) AG(u) = χ(t) AH(v) iff χ(t) e AG(u) = χ(t) e AH(v). The case when t = 0 trivially holds. Assume that

the statement holds for some t 0. Then, we first assume that χ(t+1) AG (u) = χ(t+1) AH (v). This implies that χ(1) AG(u) = χ(1) AH(v).

By Fact 4, we have that deg G(u) = deg H(v). By induction, we have that χ(t) e AG(u) = χ(t) e AH(v) and

{{(χ(t) e AG(x), e AG(u, x)) : x VG}} = {{(χ(t) e AH(y), e AH(v, y)) : y VH}}.

Hence, we have that

{{(χ(t) e AG(x), AG(u, x)/ deg G(u)) : x VG}} = {{(χ(t) e AH(y), AH(v, y)/ deg H(v)) : y VH}},

and thus χ(t+1) e AG (u) = χ(t+1) e AH (v).

Conversely, assume that χ(t+1) e AG (u) = χ(t+1) e AH (v). This implies that χ(1) e AG(u) = χ(1) e AH(v). Similarly to Fact 4, we have that

deg G(u) = deg H(v). Then, the above argument can be repeated to show that χ(t+1) AG (u) = χ(t+1) AH (v).

Therefore, e A-WL is equivalent to A-WL.

b A: We prove that b A-WL is equivalent to e A-WL and hence A-WL by showing that for any u VG and v VH, one has for all t 0 that

χ(t) b AG(u) = χ(t) b AH(v) implies χ(t) e AG(u) = χ(t) e AH(v) and χ(t+1) e AG (u) = χ(t+1) e AH (v) implies χ(t) b AG(u) = χ(t) b AH(v)

The case when t = 0 holds trivially.

Now assume that the statement holds for any t 1. χ(t+1) b AG (u) = χ(t+1) b AH (v) implies that

(χ(t) b AG(u), {{(χ(t) b AG(x), AG(u, x)/ p

deg G(u) deg G(x)) : x VG}})

= (χ(t) b AH(v), {{(χ(t) b AH(y), AH(v, y)/ p

deg H(v) deg H(y)) : y VH}}).

By counting the number of non-zero elements on both sides, we have that deg G(u) = deg H(v). Hence,

(χ(t) b AG(u), {{(χ(t) b AG(x), deg G(u), AG(u, x)/ p

deg G(x)) : x VG}})

= (χ(t) b AH(v), {{(χ(t) b AH(y), deg H(v), AH(v, y)/ p

deg H(y)) : y VH}})

(χ(t) b AG(u), {{(χ(t) b AG(x), deg G(u), AG(u, x)) : x VG}})

= (χ(t) b AH(v), {{(χ(t) b AH(y), deg H(v), AH(v, y)) : y VH}}).

Comparing Graph Transformers via Positional Encodings

Finally, by the inductive hypothesis we have that

χ(t+1) e AG (u) = (χ(t) e AG(u), {{(χ(t) e AG(x), e AG(u, x)) : x VG}})

= (χ(t) e AH(v), {{(χ(t) e AH(y), e AH(v, y)) : y VH}}) = χ(t+1) e AH (v).

For the other direction, assume that χ(t+2) e AG (u) = χ(t+2) e AH (v). This implies that

(χ(t+1) e AG (u), {{(χ(t+1) e AG (x), e AG(u, x)) : x VG}})

= (χ(t+1) e AH (v), {{(χ(t+1) e AH (y), e AH(v, y)) : y VH}}).

It is obvious that deg G(u) = deg H(v). Let σ : VG VH denote a bijection such that (χ(t+1) e AG (x), e AG(u, x)) =

(χ(t+1) e AH (σ(x)), e AH(v, σ(x))) for all x VG. Then, we have that deg G(x) = deg H(σ(x)). Hence,

(χ(t+1) e AG (x), b AG(u, x)) = (χ(t+1) e AH (σ(x)), b AH(v, σ(x))). By the inductive hypothesis, we have that (χ(t) b AG(x), b AG(u, x)) =

(χ(t) b AH(σ(x)), b AH(v, σ(x))). Therefore,

χ(t+1) b AG (u) = (χ(t) b AG(u), {{(χ(t) b AG(x), b AG(u, x)) : x VG}})

= (χ(t) b AH(v), {{(χ(t) b AH(y), b AH(v, y)) : y VH}}) = χ(t+1) b AH (v).

L, b L, e L: Note that the graph Laplacian L is combinatorially-aware. Hence by Theorem 4.16, L-WL is at least as strong as the WL test and hence A-WL test. For the other direction, we prove the stronger result that χ(t) AG(u) = χ(t) AH(v) implies that

χ(t) LG(u) = χ(t) LH(v) for any u VG and v VH.

The statement trivially holds when t = 0. Assume that the statement holds for some t 0. Then, we assume that χ(t+1) AG (u) = χ(t+1) AH (v). This implies that χ(1) AG(u) = χ(1) AH(v). By Fact 4, we have that deg G(u) = deg H(v). By definition,

χ(t) AG(u) = χ(t) AH(v) and hence by induction hypothesis, χ(t) LG(u) = χ(t) LH(v). Now, from the fact that {{(χ(t) AG(x), AG(u, x)) :

x VG}} = {{(χ(t) AH(y), AH(v, y)) : y VH}}, we obtain a bijection σ : VG VH such that (χ(t) AG(x), AG(u, x)) =

(χ(t) AH(σ(x)), AH(v, σ(x))) for all x VG. We assume that σ(u) = v (otherwise we can modify σ so that the assumption

holds). Then, by inductive hypothesis we have that χ(t) LG(x) = χ(t) LH(σ(x)) for all x VG. Furthermore, from the fact that AG(u, x) = AH(v, σ(x)), σ(u) = v and deg G(u) = deg H(v), we have that LG(u, x) = LH(v, σ(x)) for all x VG. Hence, {{(χ(t) LG(x), LG(u, x)) : x VG}} = {{(χ(t) LH(y), LH(v, y)) : y VH}}. Therefore, χ(t+1) LG (u) = χ(t+1) LH (v). This concludes the proof.

Therefore, L-WL is equivalent to A-WL. Similarly, b L-WL is equivalent to b A-WL and e L-WL is equivalent to e A-WL. Hence, all of these RPE-aug WL tests are equivalent to the WL test.

B.6. Combinatorial-Awareness: Proof of Theorem 4.16

Theorem 4.16. Let ψ be a combinatorially-aware RPE. Then, ψ-WL is at least as strong as WL.

For any two featured graphs (G, XG) and (H, XH), we prove the following stronger lemma that immediately implies Theorem 4.16.

Lemma B.15. Let ψ be a combinatorially-aware RPE. Let u G and v H. If χ(t) ψG(u) = χ(t) ψH(v), then χ(t) G (u) = χ(t) H (v).

Proof. We prove this by induction on the iteration. The case t = 0 holds trivially by definition of the two WL tests.

Now inductively suppose the theorem is true for iteration t. Let u G and v H such that χ(t+1) ψG (u) = χ(t+1) ψH (v).

First, by definition, χ(t) ψG(u) = χ(t) ψH(v). This and the inductive hypothesis imply χ(t) G (u) = χ(t) H (v).

Comparing Graph Transformers via Positional Encodings

Next, by definition, there is a bijection σ : VG VH such that (χ(t) ψG(w), ψG(u, w)) = (χ(t) ψH(σ(w)), ψH(v, σ(w))) for all w VG. This implies ψG(u, w) = ψH(v, σ(w)) if and only if both {u, w} and {v, σ(w)} are edges or nonedges; therefore, {{χ(t) ψG(w) : {u, w} EG}} = {{χ(t) ψH(x) : {v, x} EH}}. This and the inductive hypothesis imply

{{χ(t) G (w) : {u, w} EG}} = {{χ(t) H (x) : {v, x} EH}}. This and the observation from the previous paragraph imply χ(t+1) G (u) = χ(t+1) H (v).

Note that the adjacency matrix A is trivially combinatorially-aware and hence A-WL is at least as strong as the WL. In fact, the other direction also holds and hence we have that the A-WL and WL are equally strong.

Proposition B.16 (RPE-aug WL generalizes WL). Consider the RPE given by the adjacency matrix A. Given two graphs G and H, then χ(G) = χ(H) iff χA(G) = χA(H).

Proof. We only need to prove the direction that the WL is at least as strong as the A-WL. Let (G, XG) and (H, XH) be two featured graphs. We prove the following stronger statement:

Lemma B.17. Assume that {{χ(t) G (u) : u VG}} = {{χ(t) H (v) : v VH}}. Let u G and v H. If χ(t) G (u) = χ(t) H (v), then χ(t) AG(u) = χ(t) AH(v) and hence {{χ(t) AG(u) : u VG}} = {{χ(t) AH(v) : v VH}}.

Proof of Lemma B.17. We prove this by induction on the iteration. The case t = 0 holds trivially by definition of the two WL tests.

Now inductively suppose the theorem is true for iteration t. Let u G and v H such that χ(t+1) G (u) = χ(t+1) H (v). By definition, χ(t) G (u) = χ(t) H (v) and {{χ(t) G (w) : {u, w} EG}} = {{χ(t) H (x) : {v, x} EH}}. By induction assumption, we have that χ(t) G (u) = χ(t) H (v). Furthermore, there is a bijection σ : NG(u) NH(v) such that χ(t) G (w) = χ(t) H (σ(w)) for all w NG(u). Now, since {{χ(t) G (x) : x VG}} = {{χ(t) H (y) : y VH}}, we have that {{χ(t) G (x) : x VG\NG(u)}} = {{χ(t) H (y) : y VH\NH(v)}}. Hence there exists a bijection τ : VG\NG(u) VH\NH(v) such that χ(t) G (x) = χ(t) H (τ(x)) for all x VG\NG(u). Now, we define a bijection Φ : VG VH as the union of σ and τ. In this way, we have that for any x VG, AG(u, x) = AH(v, Φ(x)) and χ(t) A (x) = χ(t) A (Φ(x)). Hence, we have that

(χ(t) AG(u), {{(χ(t) AG(x), AG(u, x)) : x VG}}) = (χ(t) AH(v), {{(χ(t) AH(y), AH(v, y)) : y VH}}).

This implies that χ(t+1) AG (u) = χ(t+1) AH (v) and we hence concludes the proof.

This concludes the proof.

B.7. Resistance Distance is not Combinatorially-Aware

In Figure 6, we show a graph G so that RD(1, 3) = RD(4, 5) = 1 but {4, 5} EG and {1, 3} / EG. This proves that the resistance distance is not combinatorially-aware.

Figure 6. RD(1, 3) = RD(4, 5) = 1.

Comparing Graph Transformers via Positional Encodings

B.8. Resistance Distance and Block Cut-Edge Component Trees

For a connected graph G, a cut edge is an edge e EG such that G\{e} is disconnected. An (edge-)biconnected component of G is a maximal subset U VG such that the subgraph of G induced by U is connected and has no cut edges. The block cut-edge tree of G is the tree whose vertices are the edge-biconnected components BCC(G) of G and its edges are {{C1, C2} : C1 = C2 BCC(G), {u, v} EG such that u C1 & v C2}, i.e. the edges of the block cut-edge tree correspond to cut edges in G that connect two edge-biconnected components.

Similarly, a cut vertex is a vertex v VG such that G \ {v} is disconnected.

In this section, we prove the following theorem.

Theorem B.18. Let G and H be unfeatured graphs.

1. Let {u, v} EG and {x, y} EH. If {{χ(t) RDG(u), χ(t) RDG(v)}} = {{χ(t) RDH(x), χ(t) RDH(y)}} for all t 0, then {u, v} is a cut edge if and only if {x, y} is a cut edge.

2. If χRD(G) = χRD(H), then the block cut-edge trees of G and H are isomorphic.

Background The following are well-known facts about effective resistance that will be useful for proving this theorem.

Lemma B.19. Let G be a graph with {u, v} EG. Then RDG(u, v) 1. Moreover, RDG(u, v) = 1 if and only if {u, v} is a cut edge.

Lemma B.20. Let G be a graph with u, v, w V . Then RDG(u, v) + RDG(v, w) = RDG(u, w) if and only if v is a cut vertex such that u and w are disconnected in G \ {v}.

Notation For any pair of vertices u and v of a graph G, their resistance distance is denoted by RDG(u, v); however, we may drop the subscript G when it is clear from the context.

Recall that the notation χRD(G) = χRD(H) means that two graphs G and H have the same multisets of the RD-WL colorings of their vertices for all t 0, i.e. {{χ(t) RDG(v) : v VG}} = {{χ(t) RDH(v) : v VH}}, where χRDG and χRDH are

RD-WL colorings of G and H, respectively. Similarly, we define χRDG(u) as the ordered tuple (χ(0) RDG(u), χ(1) RDG(u), . . .).

In other words, χRDG(u) = χRDH(v) if and only if χ(t) RDG(u) = χ(t) RDH(v) for all t 0. We call these the colors of the graphs and nodes, respectively. We prove the following lemma about this new notation.

Lemma B.21. Let u VG and u VH be such that χRDG(u) = χRDH(u ). Then {{(χRDG(v), RDG(u, v)) : v VG}} = {{(χRDH(v ), RDH(u , v )) : v VH}}.

Proof. If not, then there is some iteration t such that {{(χ(t) RDG(v), RDG(u, v)) : v VG}} = {{(χ(t) RDH(v ), RDH(u , v )) :

v VH}}. In this case, clearly χ(t+1) RDG (u) = χ(t+1) RDH (u ).

Lemma B.22. Let C be any set of colors. Let u VG and u VH be such that χRDG(u) = χRDH(u ). Let D(u) (respectively, D(u )) be the multisets of the resistance distances of all vertices of G (respectively, H) with color C from u (respectively, u ), i.e. D(u) = {{RDG(u, v) : v VG, χRDG(v) C}} and D(u ) = {{RDH(u , v ) : v VH, χRDH(v) C}}. Then D(u) = D(u ). In particular, the mean, minimum, and maximum of the multisets D(u) and D(u ) are equal.

Proof. If these sets are not the same, then by Lemma B.21, it cannot be the case that χRDG(u) = χRDH(u ).

Lemma B.23. Let {u, v} be a cut edge in G, and let Su and Sv be the vertex sets of the connected components of G \ {u, v} such that u Su and v Sv. Let Cuv V be the subset of all vertices that have RD-WL-color χRDG(u) or χRDH(v). Then Su Cuv = {u} or Sv Cuv = {v}.

Proof. Let x be the farthest vertex of Cuv (with respect to the resistance distance) from the set {u, v} in G. Assume x = u and x = v, otherwise the statement of the lemma trivially holds. Without loss of generality assume x Su. We show that Cuv Sv = {v}.

Comparing Graph Transformers via Positional Encodings

Let y Sv, and y = v. Let my, mu an mv be the maximum (resistance) distance of y, u and v from any vertex in Cuv. Note mu = RD(u, x), and mv = RD(v, x) = 1 + RD(u, x) because of the choice of x. Further, we have my RD(y, v) + RD(v, x) = RD(y, v) + mv > mv, and that my RD(y, u) + RD(u, x) = RD(y, u) + mu = RD(y, v) + 1 + mu > mu as {u, v} is a cut edge. Thus, my = mu and my = mv. In particular, the list of distances of both v and u to vertices in Cu,v are different from the list of distances of y to these vertices. Hence, by Lemma B.22, the color χRDG(y) is different from both of the colors χRDG(u) and χRDG(v).

Theorem B.24. Let G and H be two graphs with the same RD-WL colorings. Let {u, v} EG be a cut edge, and let {u , v } EH be such that χRDG(u) = χRDH(u ) and χRDG(v) = χRDH(v ). Then, {u , v } is a cut edge in H.

Proof. Let Su and Sv be the vertex sets of the connected components of G\{u, v} such that u Su and v Sv. Also, let CG uv (respectively, CH u v ) be the set of all vertices that are colored χRDG(u) or χRDG(v) in G (respectively, in H). By Lemma B.23, and without loss of generality, we assume CG uv Sv = {v}. Therefore, the closest vertex of CG uv to v is u, which is at distance 1 to v since {u, v} is a cut edge; the vertex u is the closest to v as there are no other vertices besides v in CG uv Sv, and any vertex in CG uv Su is at distance at least 1 from v via the series formula of the resistance distance.

Since χRDG(v) = χRDH(v ), the closest vertex in CH u v to v has distance 1 from v otherwise χRDH(v ) would be different from χRDG(v) by Lemma B.22. Moreover, RDH(u , v ) 1 as {u , v } EH. Thus, RDH(u , v ) = 1 and, by Lemma B.19, {u , v } is a cut edge.

Lemma B.25. Let G and H be two graphs with the same RD-WL colorings. Let {u, v} be a cut edge in G and Su, Sv be the partitions of G\{u, v}. Let {u , v } be a cut edge in H such that χRDG(u) = χRDH(u ) and χRDG(v) = χRDH(v ), and let S u , S v be partitions of H\{u , v }. Let C be any subset of colors used in these equivalent colorings. Let Dv = {{d1, . . . , dk}} be the list of distances of vertices with color C in Sv from v and let Lu = {{ℓ1, . . . , ℓt}} be the list of distances of vertices with color C in Su from u. Similarly, let D v = {{d 1, . . . , d k }} be the list of distances of vertices with color C in S v from v and let L u = {{ℓ 1, . . . , ℓ t }} be the list of distances of vertices with color C in S u from u . Then, we have Dv = D v and Lu = L u . In particular, |Dv| = |D v | and |Lu| = |L u |.

Proof. Without loss of generality, by permuting indices, let DI = Dv D v = {{d1, . . . , dp}} = {{d 1, . . . , d p}}, and LI = Lu L u = {{ℓ1, . . . , ℓq}} = {{ℓ 1, . . . , ℓ q}}. Hence,

D v = {{d1, . . . , dp, d p+1, . . . , d k }},

and L u = {{ℓ1, . . . , ℓq, ℓ q+1, . . . , ℓ t }}.

By the definition of DI, {{d p+1, . . . , d k }} {{dp+1, . . . , dk}} = , (3)

and {{ℓ q+1, . . . , ℓ t }} {{ℓq+1, . . . , ℓt}} = . (4)

Comparing Graph Transformers via Positional Encodings

Since χRDG(v) = χRDH(v ), by Lemma B.22, the list of distances of vertices with color C from v in G is the same as the list of vertices with color C from v in H. As {u, v} and {u , v } are cut edges, then Dv (Lu + 1) = D v (L u + 1), where +1 is applied to all elements of a multiset. Specifically,

{{d1, . . . , dk, ℓ1 + 1, . . . , ℓt + 1}} = {{d 1, . . . , d k , ℓ 1 + 1, . . . , ℓ t + 1}}

= {{d1, . . . , dp, d p+1, . . . d k , ℓ1 + 1, . . . , ℓq + 1, ℓ q+1 + 1 . . . ℓ t + 1}}

Therefore, removing DI and LI from these multisets, we conclude

{{dp+1, . . . , dk, ℓq+1 + 1, . . . , ℓt + 1}} = {{d p+1, . . . d k , ℓ q+1 + 1 . . . ℓ t + 1}}

Equations (3) and (4) imply

{{dp+1, . . . , dk}} = {{ℓ q+1 + 1 . . . ℓ t + 1}},

{{ℓq+1 + 1, . . . , ℓt + 1}} = {{d p+1, . . . d k }}. (5)

On the other hand, since the list of distances of vertices with color C from u in G is the same as the list of vertices with color C from u in H, i.e. (Dv + 1) Lu = (D v + 1) L u . Specifically,

{{d1 + 1, . . . , dk + 1, ℓ1, . . . , ℓt}} = {{d 1 + 1, . . . , d k + 1, ℓ 1, . . . , ℓ t }}

= {{d1 + 1, . . . , dp + 1, d p+1 + 1, . . . d k + 1, ℓ1, . . . , ℓq, ℓ q+1 . . . ℓ t }}.

{{dp+1 + 1, . . . , dk + 1, ℓq+1, . . . , ℓt}} = {{d p+1 + 1, . . . d k + 1, ℓ q+1 . . . ℓ t }}.

Again, Equations (3) and (4), imply

{{dp+1 + 1, . . . , dk + 1}} = {{ℓ q+1 . . . ℓ t }},

{{ℓq+1, . . . , ℓt}} = {{d p+1 + 1, . . . d k + 1}}. (6)

Combining Equations (5) and (6), we conclude

{{ℓq+1, . . . , ℓt}} = {{d p+1 + 1, . . . d k + 1}} = {{ℓq+1 + 2, . . . , ℓt + 2}}.

Thus, {{ℓq+1, . . . , ℓt}} is empty, implying Lu = L u , and Dv = D v , as desired.

Lemma B.26. Let G and H be two graphs with the same RD-WL colorings. Let (i) {u, v} be a cut edge in G, (ii) u VH such that χRDG(u) = χRDH(u ), (iii) v VH such that χRDH(v ) = χRDG(v), and RDH(u , v ) = 1. Then, {u , v } must be a cut edge in H.

Note that the lemma does not trivially follow from Lemma B.19 since {u , v } is not assumed to be an edge a priori.

Proof. Let Cuv be the set of all vertices that are colored χRDG(u) or χRDG(v) in G. By Lemma B.23, Su Cuv = {u} or Sv Cuv = {v}. Assume that Sv Cuv = {v}; the other case is similar.

Suppose, to derive a contradiction, that {u , v } is not a cut edge. Let w be a neighbor of v that resides in the same connected component as u in H\{v }. (In particular, if v is not a cut vertex w can be any neighbor of v ). Such a w exists and is distinct from u , otherwise {u , v } would be a cut edge. We know that RDH(w , u ) < RDH(w , v ) + RDH(v , u ) = RDH(w , v ) + 1; the triangle inequality implies the inequality, and it is a strict inequality because, in the case of equality, v would a cut vertex separating w and u , contradicting the choice of w . Let ℓ= RDH(w , v ), and note that ℓ< 1, as otherwise {w , v } is a cut edge between v and u implying that RDH(u , v ) > 1.

Comparing Graph Transformers via Positional Encodings

Since χRDG(v) = χRDH(v ), there must exist a vertex w VG such that χRDG(w) = χRDH(w ) and RDG(v, w) = ℓ< 1. Since all vertices in Su are at a distance at least one from v, we have w Sv. Let Lw be the list of distances of vertices in Cuv from w. Since Sv Cuv = {v}, the smallest distance in Lw is ℓand the second smallest distance in ℓ+ 1, to v and u, respectively.

Let Cu v be the set of all vertices in H that are colored χRDH(u ) = χRDG(u) or χRDH(v ) = χRDG(v). Then, let Lw be the list of all distances of w to vertices in Cu v . Since χRDH(w ) = χRDG(w), then Lw = Lw by Lemma B.22. In particular, the smallest two numbers in Lw are ℓand ℓ+ 1. We know RDH(w , v ) = ℓ, and by the triangle inequality RDH(w , u ) RDH(w , v ) + 1 = ℓ+ 1. Since, the second smallest number in the list is ℓ+ 1, and not smaller, the inequality must be tight, i.e. RDH(w , u ) = RDH(w , v ) + 1 = RDH(w , v ) + RDH(v , u ). Therefore, v is a cut vertex separating u and w , contradicting the choice of w .

Lemma B.27. Let G and H be two graphs with the same RD-WL colorings. Let {u, v} be a cut edge in G and {u , v } a cut edge in H such that χRDG(u) = χRDH(u ) and χRDG(v) = χRDH(v ). Furthermore, let Bv (resp. B v ) be the biconnected component of v (resp. v ) in G (resp. H). Finally, let K (resp. K ) be the set of all cut edges except {u, v} (resp. {u , v }) with one endpoint in Bv (resp. {u , v }). Then there exists a bijection between f : K K , such that for any {x, y} K and {w, z} = f({x, y}), if x in on v s side in G\{x, y} and w is on v s side in H\{w, z}, then χRDG(x) = χRDH(w) and χRDG(y) = χRDH(z).

Proof. Let Sv (resp. S v ) be the vertex set of the connected component of G\{u, v} (resp. H\{u , v }) that contains v (resp. v ).

Let X = {x1, . . . , xk} be the set of endpoints of K that are in Bv ordered such that RD(v, x1) . . . RD(v, xk). (Note that each xi may be connected to multiple cut edges of K.)

By Lemma B.25, there exists X = {x 1, . . . , x k} S v such that for every i [k]

(1) χRDG(xi) = χRDH(x i) and

(2) RDG(v, xi) = RDH(v , x i).

(Let CX be the set of all vertices that are colored by the colors {χRDG(x1), . . . , χRDG(xk)} in Sv. Note, CX is a (possibly proper) superset of {x1, . . . , xk}. Therefore, by Lemma B.25, there exists a set C X Sv , such that there is a one-to-one correspondence between CX and C X with corresponding pairs have the same color and distance to v and v . The existence of X with the aforementioned properties follows as X CX.)

For each i [k], let Yi be the set of all vertices that are adjacent to xi with a cut edge in K. Since χRDG(xi) = χRDH(x i), by Lemma B.26, for each i [k], there exists Y i such that (1) for all y Y i , {x i, y } is a cut edge, and (2) the multisets {{χRDG(y)}}y Yi = {{χRDH(y )}}y Y i .

Next, we show for every i [k] and every y Y i , RDH(v , x i) < RDH(v , y i), or equivalently, v is in the connected component of x i in H\{x i, y }. We prove this by induction on i.

For each i [k], and each y Yi, let Sy be the set of all vertices on the y side in G\{xi, y}, and let Si = S y Yi Sy. For all h < i, and each y Y h, let S y be the set of all vertices on the y side in H\{x h, y }, and let S h = S

y Y h S y . Note Sv\ S h<i Sh contains v and, by the induction hypothesis, S v \ S h<i S h contains v .

By Lemma B.25 and that RDG(v, xh) = RDH(v , x h), Sh and S h induce the same set of color/distance pairs from v and v , respectively; i.e. there is a one-to-one map f from Sh to S h such that for each pair (q, f(q)), (i) q and f(q) have the same color, and (ii) RDG(v, q) = RDH(v , f(q)). Thus, Bv,i = Sv\ S

h<i Sh and B v,i = S v \ S

h<i S h induce the

Comparing Graph Transformers via Positional Encodings

same set of color/distance pairs from v and v , respectively. Let C be the set of colors of vertices that are incident to cut edges. The ith smallest distance with color from C in Bv,i corresponds to xi. We know x i has the same color as xi, and RDG(v, xi) = RDH(v , x i). If there exists y Y i such that the RDH(v , y ) < RDH(v , xi) then the ith smallest number in B v,i would be smaller than RDG(v, xi) which is a contradiction. Hence, RDH(v , x i) < RDH(v , y ) for all y Y i , as desired.

Therefore, B v = S v \ S

i [k] S i is connected, and it has the same set of colors as Bv = Sv\ S

i [k] Si. So, it must be the biconnected component that contains v .

Theorem B.28. If two graphs G and H have the same RD-WL colorings, then their block cut-edge trees are isomorphic.

Proof. Let {u, v} be a cut edge in G such that Su is a biconnected component. Let {u , v } be an edge in H with χRDG(u) = χRDH(u ) and χRDG(v) = χRDH(v ). By Lemma B.27, S u is a biconnected component of H. Root the block cut-edge tree of G at Su and the block cut-edge tree of H at S u so all cut edges are directed away from Su and S u .

For a directed cut edge u v, define TG[v] the be the block cut-edge subtree rooted at the biconnected component of v. We use induction on the size of TG[v] to show that for two same colored edges with the same direction u v and u v , TG[v] and TH[v ] are isomorphic.

Let Bv and B v be the biconnected components of v in G and v in H respectively. For the base case that Bv = Sv, then by Lemma B.27, B v = S v and the statement holds. Alternatively, also by Lemma B.27, the outgoing directed edges of Bv and B v are in one-to-one correspondence with matching head and tail colors. Let these sets of edges be K and K , respectively, and let f be the bijection between them that preserves colors. For any x y K, and its image f(x) f(y) = w z K , by the induction hypothesis, the block cut-edge subtrees TG[y] and TH[z] are isomorphic. Therefore, TG[v] and TH[v ] are isomorphic.

C. Experiments

In this section, we carry out experiments to validate our two main results Theorem 3.8 and Theorem 3.10. Our code is adapted from the Graph GPS module (Ramp aˇsek et al., 2022) and subsequent fork from M uller et al. (2024).

C.1. Graph Isomorphism: CSL

Table 1. Test performance on the CSL dataset of different APEs. Results are shown in the form of mean standard deviation. Experiments are averaged over 5 runs.

CSL Dataset: Classification Accuracy ( )

HKdiag SE RWSE APE-GT 100 0.00 100 0.00 Deep Set RPE-GT 100 0.00 100 0.00

Table 2. Test performance on the CSL dataset of different RPEs. Results are shown in the form of mean standard deviation. Experiments are averaged over 5 runs. The result for GIN was taken from M uller et al. (2024).

CSL Dataset: Classification Accuracy ( ) RSPE Powers of Adjacency RD SPD Adjacency RPE-GT 100 0.00 100 0.00 100 0.00 90 0.00 10 0.00 EGN APE-GT 100 0.00 100 0.00 100 0.00 90 0.00 10 0.00 GIN 10 0.00

We consider the graph classification task on the Circular Skip Links (CSL) dataset (Dwivedi et al., 2023). The goal of the CSL dataset is to classify graphs according to their isomorphism type. This is classified as a Hard task by M uller et al. (2024).

To validate Theorem 3.8, we consider two different APEs: Random Walk Structural Encoding (RWSE) and Heat Kernel Diagonal Structural Encoding (HKdiag SE) from the Graph GPS Library (Ramp aˇsek et al., 2022). For both APEs, we used

Comparing Graph Transformers via Positional Encodings

times {1, ..., 20}. We test these APEs with APE-GTs and their corresponding RPEs by applying a Deep Set and the result GTs are denoted Deep Set RPE-GTs.

To validate Theorem 3.10, we consider four different RPEs: RSPE (Huang et al., 2024), a stack of 20 powers of the adjacency matrix (inspired by RRWP (Ma et al., 2023)), the resistance distance (RD), the shortest-path distance (SPD), and the adjacency matrix. We then consider their corresponding APEs by applying a 2-EGN and the result GTs are denoted EGN APE-GTs. We ran the RPE-GT and APE-GT for 1000 epochs each. Our results are averaged over 5 random seeds. See Table 1 and Table 2 for the results.

As our theoretical results (Theorems 3.8 and 3.10) predict, RPE-GTs and EGN APE-GTs have the same classification accuracy when using the same RPE. Likewise, APE-GT and Deep Sets RPE-GT have the same classification accuracy when using the same APE. This validates our main theorems.

Moreover, note that the accuracy of Adjacency-RPE-GT and GIN are equal, which agrees with Corollary 4.17.

C.2. Graph Isomorphism: BREC

Table 3. Performance of different RPEs on the BREC dataset. Basic, Regular, Extension, and CFI are subsets of the BREC dataset. Results are shown in the form of mean standard deviation. Experiments are averaged over 5 runs.

BREC Dataset: Pair-Distinguishing Accuracy ( ) Basic Regular Extension CFI Total RPE: Resistance Distance (RD) RPE-GT 100.00 0.00 35.71 0.00 100.00 0.00 7.60 1.14 54.40 0.29 EGN APE-GT 100.00 0.00 35.71 0.00 100.00 0.00 7.00 1.23 54.25 0.31 RPE: Shortest-Path Distance (SPD) RPE-GT 26.67 0.00 9.29 0.00 40.8 0.45 6.8 0.11 19.15 0.22 EGN APE-GT 26.67 0.00 9.42 0.28 41.00 0.00 11.80 0.45 20.50 0.18 RPE: Stable Positional Encoding (SPE) RPE-GT 97.00 2.17 34.71 0.57 96.00 1.87 3.2 4.47 51.5 0.73 EGN APE-GT 0.97 2.17 33.14 0.97 98.80 0.84 8.80 1.30 53.05 0.54 RPE: Stack of Powers of Adjacency Matrix RPE-GT 100.00 0.00 35.00 00 94.6 0.00 9.80 0.84 53.55 0.28 EGN APE-GT 83.00 1.39 32.86 0.00 88.20 1.79 3.00 00 46.75 0.47 RPE: Adjacency Matrix RPE-GT 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 EGN APE-GT 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Table 4. Performance of different APEs on the BREC dataset. Basic, Regular, Extension, and CFI are subsets of the BREC dataset. Results are shown in the form of mean standard deviation. Experiments are averaged over 5 runs.

BREC Dataset: Pair-Distinguishing Accuracy ( )

Basic Regular Extension CFI Total APE: Heat Kernel Diagonal Structural Encoding (HKdiag SE) APE-GT 56.00 4.50 26.00 0.57 38.00 3.16 3.00 00 27.75 1.43 Deep Sets RPE-GT 59.00 4.01 18.86 1.84 59.60 6.11 1.2 1.30 30.65 1.57 APE: Random Walk Structural Encoding (RWSE) APE-GT 84.33 3.65 31.29 0.83 62.40 4.39 1.00 0.00 39.45 1.10 Deep Sets RPE-GT 69.67 5.32 21.29 2.73 63.80 2.39 0.4 0.55 33.95 1.81

We consider the graph isomorphism benchmark dataset BREC (Wang & Zhang, 2024). The BREC dataset tests a graph neural network architecture s ability to distinguish pairs of non-isomorphic graphs. Unlike the CSL dataset which poses the graph isomorphism problem as a graph classification problem, the BREC dataset considers pairs of non-isomorphic graphs and tests whether a graph neural networks can learn different representations for the two graphs. The graph neural network is trained via contrastive learning on sets of graphs pairs with randomly permuted node indices. The graph neural network

Comparing Graph Transformers via Positional Encodings

is then evaluated using the T 2 test on the learned features for each of the two graphs. The exact training and evaluation procedures can be found in the original BREC dataset paper (Wang & Zhang, 2024).

We compared several RPEs on the BREC dataset. We found that RD, SPE, and a stack of 20 powers of the adjacency matrix all perform similarly well and perform better than SPD in terms of total accuracy. Moreover, the adjacency matrix achieves 0% accuracy, which matches the theoretical results of Proposition 4.10 as all pairs of graphs in the BREC dataset are WL indistinguishable.

We also compared two APEs on the BREC dataset: RWSE and HKdiag SE from the Graph GPS library (Ramp aˇsek et al., 2022). For both APEs, we used times {1, ..., 20}. We found that RWSE slightly outperformed HKdiag SE.

Finally, for the most part, both RPEs and APEs achieved similar performance for their different architectures, i.e. RPE-GT and EGN APE-GT or APE-GTs and Deep Sets RPE-GT. This matches our theoretical results (Theorems 3.8 and 3.10).

However, one interesting thing we found is that there are several graph pairs in the CFI subset that SPD-GTs were able to learn to distinguish while RD-GTs did not. This suggests that SPD-WL and RD-WL may be incomparable. Zhang et al. (2023) show there are graphs that RD-WL can distinguish that SPD-WL cannot; this suggests the converse is true too, although it is not a definitive proof of this fact and should not be interpreted as such. A question for future research would be to prove that these pairs of graphs are indistinguishable by RD-WL or to find a RD-GT that could distinguish these pairs of graphs.

Hyperparameter search was performed for each pair of PE and architecture.

C.3. Graph Regression: ZINC

Table 5. Test performance on the Small ZINC dataset. Results are shown in the form of mean standard deviation. Experiments are averaged over 3 runs.

ZINC Dataset: MAE ( ) # Parameters RD SPD SPE RPE-GT 573922 0.096 0.002 0.130 0.010 0.132 0.007 EGN APE-GT 520514 0.196 0.004 0.217 0.011 0.200 0.007 # Parameters RD+EF SPD+EF SPE+EF RPE-GT 696226 0.069 0.004 0.092 0.010 0.092 0.001 EGN APE-GT 664818 0.110 0.003 0.132 0.004 0.126 0.003

Although our theoretical results are for the distinguishing power of graph transformers and apply most directly to graph classification/isomorphism tasks, we would also like to compare RPEs and APEs on real-world tasks beyond those. In this experiment, we compare RPE-GTs and EGN APE-GTS for graph regression on the small ZINC dataset containing 12k graphs (Dwivedi et al., 2023). We consider the RPEs resistance distance (RD), shortest-path distance (SPD), and SPE (Huang et al., 2024). See Table 5 for the results.

We test the models both with and without the use of edge features of graphs in the ZINC dataset. Edge features are distinct from RPEs as they are not dependent on the graph structure and are associated to edges and not a pairs of vertices. For the ZINC dataset, the features are one-hot encodings of the bond type (e.g. single, double, triple). When we include edge features, we concatenate them to the RPE and pass the concatenated tensor to the model in place on the RPE. This use of edge features is not captured by our theoretical results, and it is not clear if this is the best or most natural way of including edge features in a graph transformer. Other papers have proposed different ways of incorporating edge features into a graph transformer (Ma et al., 2023; Jin et al., 2023).

From Table 5, we see that the performance of EGN APE-GT is worse than the corresponding RPE-GT. This could be partially explained by Theorem 3.10, which states that RPE-GTs have stronger distinguishing power than EGN APE-GTs for graphs with node features (as the graphs in ZINC do). Additionally, we suspect that the training of the extra EGN layer introduces additional difficulty in the training process, leading to worse performance of APE-GT compared to RPE-GT. This result aligns with our expectations, suggesting that instead of converting an RPE into an APE, it may be better to use the RPE directly in a RPE-GT.

Comparing Graph Transformers via Positional Encodings

Hyperparameters & experimental setup. Our results are averaged over 3 random seeds. We ran the RPE-GT and APE-GT for 1000 epochs each.

Following Zhang et al. (2023), the functions f1 and f2 in the RPE-GT head are embeddings into Gaussian kernels followed by an MLP.

Hyperparameter search was performed for RPE-GTs and EGN APE-GTs. We chose the final hyperparameters for each architecture to have similar numbers of parameters. Hyperparameters involved in our models are listed in Table 6.

Table 6. Hyperparameters for ZINC Experiment

RPE GTs # Transformer Layers: 14 # Transformer Heads: 8 # Gaussian Kernels 16 # MLP Layers 2 MLP Hidden Dimension (No Edge Features) 16 MLP Hidden Dimension (Edge Features) 16 EGN APE GTs # Transformer Layers: 8 # Transformer Heads: 8 # EGN Layers 6 EGN Hidden Dim (No Edge Features) 48 EGN Hidden Dim (Edge Features) 64 APE Type Add SPE # Deep Sets Layers 3 Deep Sets Hidden Dimension 64 # Parameters 17217

Code for all experiments can be found at https://github.com/blackmit/comparing_graph_transformers_via_ positional_encodings