# aligning_transformers_with_weisfeilerleman__f10d7f88.pdf

Aligning Transformers with Weisfeiler Leman

Luis Müller 1 Christopher Morris 1

Graph neural network architectures aligned with the k-dimensional Weisfeiler Leman (k-WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the k-WL hierarchy have shown promising empirical results, employing transformers for higher orders of k remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the k-WL hierarchy, showing stronger expressivity results for each k, making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the stateof-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets.

1. Introduction

Message-passage graph neural networks (GNNs) are the defacto standard in graph learning (Gilmer et al., 2017; Kipf & Welling, 2017; Scarselli et al., 2009; Xu et al., 2019a). However, due to their purely local mode of aggregating information, they suffer from limited expressivity in distinguishing non-isomorphic graphs in terms of the 1-dimensional Weisfeiler Leman algorithm (1-WL) (Morris et al., 2019; Xu et al., 2019a). Hence, recent works (Azizian & Lelarge, 2021;

1Department of Computer Science, RWTH Aachen University, Germany. Correspondence to: Luis Müller <luis.mueller@cs.rwthaachen.de>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Maron et al., 2019a; Morris et al., 2020; 2023; 2022) introduced higher-order GNNs, aligned with the k-dimensional Weisfeiler Leman (k-WL) hierarchy for graph isomorphism testing (Cai et al., 1992), resulting in more expressivity with an increase in the order k > 1. The k-WL hierarchy draws from a rich history in graph theory (Babai, 1979; Cai et al., 1992; Grohe, 2017; Weisfeiler & Leman, 1968), offering a deep theoretical understanding of k-WL-aligned GNNs. In contrast, graph transformers (GTs) (Glickman & Yahav, 2023; He et al., 2023; Ma et al., 2023; Müller et al., 2024; Rampášek et al., 2022; Ying et al., 2021) recently demonstrated state-of-the-art empirical performance. However, they draw their expressive power mostly from positional or structural encodings (PEs in the following), making it challenging to understand these models in terms of an expressivity hierarchy such as the k-WL. In addition, their empirical success relies on modifying the attention mechanism (Glickman & Yahav, 2023; Ma et al., 2023; Ying et al., 2021) or using additional message-passing GNNs (He et al., 2023; Rampášek et al., 2022).

To still benefit from the theoretical power offered by the k-WL, previous works (Kim et al., 2021; 2022) aligned transformers with k-IGNs, showing that transformer layers can approximate invariant linear layers (Maron et al., 2019b). Crucially, Kim et al. (2022) designed pure transformers, requiring no architectural modifications of the standard transformer and instead draw their expressive power from an appropriate tokenization, i.e., the encoding of a graph as a set of input tokens. Pure transformers provide several benefits over graph transformers that use message-passing or modified attention, such as being directly applicable to innovations for transformer architectures most notably Performers (Choromanski et al., 2021) and Flash Attention (Dao et al., 2022) to reduce the runtime or memory demands of transformers, as well as the Perceiver (Jaegle et al., 2021) enabling multi-modal learning. Unfortunately, the framework of Kim et al. (2022) does not allow for a feasible transformer with provable expressivity strictly greater than 1-WL due to an O(n6) runtime complexity and the requirement of 203 attention heads resulting from their alignment with IGNs. This poses the question of whether a more feasible hierarchy of pure transformers exists.

Aligning Transformers with Weisfeiler Leman

1-WL (2, 1)-WL

(2, 1)-GT 1-GT

Approximate more functions

Figure 1: Overview of our theoretical results, aligning transformers with the established k-WL hierarchy. Forward arrows point to more powerful algorithms or neural architectures. A B (A B, A B) algorithm A is strictly more powerful than (as least as powerful as, equally powerful as) B. The relations between the boxes in the lower row stem from Cai et al. (1992) and Morris et al. (2022).

Present work Here, we offer theoretical improvements for pure transformers. Our guiding question is

Can we design a hierarchy of increasingly expressive pure transformers that are also feasible in practice?

In this work, we will make significant progress to answering this question in the affirmative. First, in Section 3, we improve the expressivity of pure transformers for every order k > 0, while keeping the same runtime complexity as Kim et al. (2022) by aligning transformers closely with the Weisfeiler Leman hierarchy. Secondly, we demonstrate the benefits of this close alignment by showing that our results directly yield transformers aligned with the (k, s)-WL hierarchy, recently proposed to increase the scalability of higherorder variants of the Weisfeiler Leman algorithm (Morris et al., 2022). These transformers then form a hierarchy of increasingly expressive pure transformers. We show in Section 4, that our transformers can be naturally implemented with established node-level PEs such as the Laplacian PEs in Kreuzer et al. (2021). Lastly, we show in Section 5 that our transformers are feasible in practice and have more expressivity than the 1-WL. In particular, we obtain very close to state-of-the-art results on the large-scale PCQM4MV2 dataset (Hu et al., 2021) and further show that our transformers are highly competitive with GNNs when fine-tuned on small-scale molecular benchmarks (Hu et al., 2020a). See Figure 1 for an overview of our theoretical results and Table 1 for a comparison of our pure transformers with the transformers in Kim et al. (2021) and Kim et al. (2022).

Related work Many graph learning architectures with higher-order Weisfeiler Leman expressive power exist, most notably δ-k-GNNs (Morris et al., 2020), Speq Nets (Morris et al., 2022), k-IGNs (Maron et al., 2019b;c), PPGN (Azizian & Lelarge, 2021; Maron et al., 2019a), and the more recent PPGN++ (Puny et al., 2023). Moreover, Lipman et al. (2020) devise a low-rank attention module possessing the same power as the folklore 2-WL. Bodnar et al. (2021) propose CIN with an expressive power of at least 3-WL. However,

while theoretically intriguing, higher-order GNNs often fail to deliver state-of-the-art performance on real-world problems (Azizian & Lelarge, 2021; Morris et al., 2020; 2022), making theoretically grounded architectures less relevant in practice.

Graph transformers with higher-order expressive power are Graphormer-GD (Zhang et al., 2023) as well as the higherorder graph transformers in Kim et al. (2021) and Kim et al. (2022). However, Graphormer-GD is less expressive than the 3-WL (Zhang et al., 2023). Further, Kim et al. (2021) and Kim et al. (2022) align transformers with k-IGNs, and, thus, obtain the theoretical expressive power of the corresponding k-WL. However, they do not empirically evaluate their transformers for k > 2. For k = 2, Kim et al. (2022) propose Token GT, a pure transformer with a n + m tokens per graph, where n is the number of nodes and m is the number of edges. This transformer has 2-WL expressivity, which, however, is the same as 1-WL expressivity (Morris et al., 2023). From a theoretical perspective, the transformers in Kim et al. (2022) still require impractical assumptions such as bell(2k) attention heads, where bell(2k) is the 2k-th bell number (Maron et al., 2019a),1 resulting in 203 attention heads for a transformer with a provable expressive power strictly stronger than the 1-WL. In addition, Kim et al. (2022) introduces special encodings called nodeand type identifiers that are theoretically necessary but, as argued in Appendix A.5 in (Kim et al., 2022), not ideal in practice. For an overview of the Weisfeiler Leman hierarchy in graph learning; see Morris et al. (2023).

2. Background

We consider node-labeled graphs G := (V (G), E(G), ℓ) with n nodes and without self-loops and without isolated nodes, where V (G) is the set of nodes, E(G) is the set of edges, and ℓ: V (G) N assigns an initial color or label

1For example, bell(2 3) = 203, bell(2 4) = 4 140, bell(2 5) = 115 975. In comparison, GPT-3, a large transformer with 175B parameters, has only 96 attention heads (Brown et al., 2020).

Aligning Transformers with Weisfeiler Leman

to each node. For convenience of notation, we always assume an arbitrary but fixed ordering over the nodes such that each node corresponds to a number in [n]. Further A(G) {0, 1}n n denotes the adjacency matrix where A(G)ij = 1 if, and only, if nodes i and j share an edge. We also construct a node feature matrix F Rn d that is consistent with ℓ, i.e., for nodes i and j in V (G), Fi = Fj if, and only, if ℓ(i) = ℓ(j). Note that, for a finite subset of N, we can always construct F, e.g., with a one-hot encoding of the initial colors. We call pairs of nodes (i, j) V (G)2

node pairs or 2-tuples. For a matrix X Rn d, whose i-th row represents the embedding of node v V (G) in a graph G, we also write Xv Rd or X(v) Rd to denote the row corresponding to node v. Further, we define a learnable embedding as a function, mapping a subset S N to a d-dimensional Euclidean space, parameterized by a neural network, e.g., a neural network applied to a one-hot encoding of the elements in S. Moreover, || ||F denotes the Frobenius norm. Finally, we refer to some transformer architecture s concrete set of parameters, including its tokenization, as a parameterization. See Appendix C for a complete description of our notation. Further, see Appendix E.1 for a formal description of the k-WL and its variants.

3. Expressive power of transformers on graphs

Here, we consider the (standard) transformer (Vaswani et al., 2017), a stack of alternating blocks of multi-head attention and fully-connected feed-forward networks; see Appendix D for a definition.

To align the above transformer with the k-WL, we will provide appropriate input tokens X(0,k) to the standard transformer for each k 1 and subsequently show that then the t-th layer of the transformer can simulate the t-th iteration of some k-order Weisfeiler Leman algorithm. To gradually introduce our theoretical framework, we begin with tokenization for k = 1, obtaining a transformer with the expressive power of at least the 1-WL and afterward generalize the tokenization to higher orders k.

3.1. Transformers with 1-WL expressive power

A commonly used baseline in prior work (Müller et al., 2024; Rampášek et al., 2022) is to employ a standard transformer with Laplacian PEs (Kreuzer et al., 2021). Here, we start by characterizing such a transformer in terms of 1-WL expressivity and introduce our theoretical framework. Concretely, we propose a tokenization for a standard transformer to be at least as expressive as the 1-WL. We first formalize our tokenization and then derive how 1-WL expressivity follows. Let G = (V (G), E(G), ℓ) be a graph with n nodes and feature matrix F Rn d, consistent with ℓ. Then, we initialize

n token embeddings X(0,1) Rn d as

X(0,1) := F + P, (1)

where we call P Rn d structural embeddings, encoding structural information for each token. For node v, we define

P(v) := FFN(deg(v) + PE(v)), (2)

where deg: V (G) Rd is a learnable embedding of the node degree, PE: V (G) Rd is a node-level PE such as the Laplacian PE (Kreuzer et al., 2021) or SPE (Huang et al., 2024), and FFN: Rd Rd is a multi-layer perceptron. For the PE, we require that it enables us to distinguish whether two nodes share an edge in G, which we formally define as follows.

Definition 1 (Adjacency-identifying). Let G = (V (G), E(G), ℓ) be a graph with n nodes. Let X(0,1) Rn d denote the initial token embeddings according to Equation (1) with structural embeddings P Rn d. Let further

P := 1 dk PWQ PWK T ,

where WQ, WK Rd d. Then P is adjacency-identifying if there exists WQ, WK such that for any row i and column j, Pij = max k Pik,

if, and only, if A(G)ij = 1, where each row can have between one and (n 1) maxima (for connected graphs). Further, an approximation Q Rn d to P is sufficiently adjacency-identifying if P Q F < ϵ,

for any ϵ > 0.

As the name suggests, the property of (sufficiently) adjacency-identifying allows us to identify the underlying adjacency matrix during the attention computation. In Section 4, we introduce a slight generalization of the commonly used Laplacian PEs (Kreuzer et al., 2021; Rampášek et al., 2022), which we refer to as LPE, and show that structural embeddings are sufficiently adjacency-identifying when using LPE or SPE (Huang et al., 2024) as node-level PEs. We call a standard transformer using the above tokenization with sufficiently adjacency-identifying node-level PEs the 1-GT.

Let us now understand why the above token embeddings with degree encodings and adjacency-identifying PEs are sufficient to obtain 1-WL expressivity. To this end, we revisit a 1-WL expressive GNN from Grohe (2021), which updates node representations F(t) at layer t as

F(t) := FFN F(t 1) + 2A(G)F(t 1) , (3)

Aligning Transformers with Weisfeiler Leman

where FFN is again a feed-forward neural network and F(t) i contains the representation of node i V (G) at layer t. Contrast the above to the update of a 1-GT layer with a single head, given as

X(t,1) := FFN X(t 1,1) + softmax X(t 1,1) X(t 1,1)WV ,

X(t 1,1) := 1 dk X(t 1,1)WQ X(t 1,1)WK T

denotes the unnormalized attention matrix at layer t. If we now set WV = 2I, where I is the identity matrix, we obtain

X(t,1) := FFN X(t 1,1)+2 softmax X(t 1,1) X(t 1,1) .

At this point, if we could reconstruct the adjacency matrix with softmax( X(t 1,1)), we could simulate the 1-WL expressive GNN of Equation (3) with a single attention head. However, the attention matrix is right-stochastic, meaning its rows sum to 1. Unfortunately, this is not expressive enough to reconstruct the adjacency matrix, which generally is not right-stochastic. Thus, we aim to reconstruct the rownormalized adjacency matrix A(G) := D 1A(G), where D Rn n is the diagonal degree matrix, such that Dii is the degree of node i. Indeed, A(G) is right-stochastic. Further, element-wise multiplication of row i of A(G) with the degree of i recovers A(G). As we show, adjacencyidentifying PEs are, in fact, sufficient for softmax( X(t 1,1)) to approximate A(G) arbitrarily close. We show that then, we can use the degree embeddings deg(i) to de-normalize the i-th row of A(G) and obtain

X(t,1) = FFN X(t 1,1) + 2A(G)X(t 1,1) .

Showing that a transformer can simulate the GNN in Grohe (2021) implies the connection to the 1-WL. We formally prove the above in the following theorem, showing that the 1GT can simulate the 1-WL; see Appendix H for proof details.

Theorem 2. Let G = (V (G), E(G), ℓ) be a labeled graph with n nodes and F Rn d be a node feature matrix consistent with ℓ. Further, let C1 t : V (G) N denote the coloring function of the 1-WL at iteration t. Then, for all iterations t 0, there exists a parametrization of the 1-GT such that

C1 t (v) = C1 t (w) X(t,1)(v) = X(t,1)(w),

for all nodes v, w V (G).

Having set the stage for theoretically aligned transformers, we now generalize the above to higher orders k > 1.

3.2. Transformers with k-WL expressive power

Here, we propose tokenization for a standard transformer, operating on nk tokens, to be strictly more expressive as the

k-WL, for an arbitrary but fixed k > 1. Subsequently, to make the architecture more practical, we reduce the number of tokens while remaining strictly more expressive than the 1-WL.

Again, we consider a labeled graph G = (V (G), E(G), ℓ) with n nodes and feature matrix F Rn d, consistent with ℓ. Intuitively, to surpass the limitations of the 1-WL, the k-WL colors ordered subgraphs instead of a single node. More precisely, the k-WL colors the tuples from V (G)k for k 2 instead of the nodes. Of central importance to our tokenization is consistency with the initial coloring of the k-WL and recovering the adjacency information between ktuples imposed by the k-WL, both of which we will describe hereafter; see Appendix E for a formal definition of the k WL.

The initial color of a k-tuple v := (v1, . . . , vk) V (G)k

under the k-WL depends on its atomic type and the labels ℓ(v1), . . . , ℓ(vk). Intuitively, the atomic type describes the structural dependencies within elements in a tuple. We can represent the atomic type of v by a k k matrix K over {1, 2, 3}. That is, the entry Kij is 1 if (vi, vj) E(G), 2 if vi = vj, and 3 otherwise; see Appendix C for a formal definition of the atomic type. Hence, to encode the atomic type as a real vector, we can learn an embedding from the set of k k matrices to Re, where e > 0 is the embedding dimension of the atomic type. For a tuple v, we denote this embedding with atp(v). Apart from the initial colors for tuples, the k-WL also imposes a notion of adjacency between tuples. Concretely, we define

ϕj(v, w) := (v1, . . . , vj 1, w, vj+1, . . . , vk),

i.e., ϕj(v, w) replaces the j-th component of the tuple v with the vertex w. We say that two tuples are adjacent or j-neighbors if they are different in the j-th component (or equal, in the case of self-loops).

Now, to construct token embeddings consistent with the initial colors under the k-WL, we initialize nk token embeddings X(0,k) Rnk d, one for each k-tuple. For order k and embedding dimension d of the tuple-level tokens, we first compute node-level tokens X(0,1) in Equation (1) and, in particular, the structural embeddings P(v) for each node v, with an embedding dimension of d and then concatenate node-level embeddings along the embedding dimension to construct tuple-level embeddings which are then projected down to fit the embedding dimension d. Specifically, we define the token embedding of a k-tuple v = (v1, . . . , vk) as X(0,k)(v) := X(0,1)(vi) k i=1W + atp(v), (4)

where W Rd k d is a projection matrix. Intuitively, the above construction ensures that the token embeddings respect the initial node colors and the atomic type. Analogously to

Aligning Transformers with Weisfeiler Leman

our notion of adjacency-identifying, we use the structural embeddings P(vi) in each node embedding X(0,1)(vi) to identify the j-neighborhood adjacency between tuples v and w in the j-th attention head. To reconstruct the j-neighborhood adjacency, we show that it is sufficient to identify the nodes in each tuple-level token. Hence, we define the following requirements for the structural embeddings P. Definition 3 (Node-identifying). Let G = (V (G), E(G), ℓ) be a graph with n nodes. Let X(0,k) Rnk d denote the initial token embeddings according to Equation (4) with structural embeddings P Rn d. Let further

P := 1 dk PWQ PWK T ,

where WQ, WK Rd d. Then P is node-identifying if there exists WQ, WK such that for any row i and column j, Pij = max k Pik,

if, and only, if i = j. Further, an approximation Q Rn d

of P is sufficiently node-identifying if P Q F < ϵ,

for any ϵ > 0.

As we will show, the above requirement allows the attention to distinguish whether two tuples share the same nodes, intuitively, by counting the number of node-to-node matches between two tuples, which is sufficient to determine whether the tuples are j-neighbors. For structural embeddings P to be node-identifying, it suffices, for example, that P has an orthogonal sub-matrix.

In Section 4, we show that structural embeddings are also sufficiently node-identifying when using LPE or SPE (Huang et al., 2024) as node-level PEs. It should also be mentioned that our definition of node-identifying structural embeddings generalizes the node identifiers in Kim et al. (2022), i.e., their node identifiers are node-identifying. Still, structural embeddings exist that are (sufficiently) node-identifying and do not qualify as node identifiers in Kim et al. (2022).

We call a standard transformer using the above tokenization with sufficiently nodeand adjacency-identifying structural embeddings the k-GT. We then show the connection of the k-GT to the k-WL, resulting in the following theorem; see Appendix I for proof details. Theorem 4. Let G = (V (G), E(G), ℓ) be a labeled graph with n nodes and k 2 and F Rn d be a node feature matrix consistent with ℓ. Let Ck t denote the coloring function of the k-WL at iteration t. Then for all iterations t 0, there exists a parametrization of the k-GT such that

Ck t (v) = Ck t (w) X(t,k)(v) = X(t,k)(w)

for all k-tuples v and w V (G)k.

Note that similar to Theorem 2, the above theorem gives a lower bound on the expressivity of the k-GT. We now show that the k-GT is strictly more powerful than the k-WL by showing that the k-GT can also simulate the δ-k-WL, a k-WL variant that, for each j-neighbor ϕj(v, w), additionally considers whether (vj, w) E(G) (Morris et al., 2020). That is, the δ-k-WL distinguishes for each j-neighbor whether the replaced node and the replacing node share an edge in G. Morris et al. (2020) showed that the δ-k-WL is strictly more expressive than the k-WL. We use this to show that the k-GT is strictly more expressive than the k-WL, implying the following result; see again Appendix I for proof details.

Theorem 5. For k > 1, the k-GT is strictly more expressive than the k-WL.

Note that the k-GT uses nk tokens. However, for larger k and large graphs, the number of tokens might present an obstacle in practice. Luckily, our close alignment of the k-GT with the k-WL hierarchy allows us to directly benefit from a recent result in Morris et al. (2022), reducing the number of tokens while maintaining an expressivity guarantee. Specifically, Morris et al. (2022) define the set of (k, s)-tuples V (G)k s as

V (G)k s := {v V (G)k | #comp(G[v]) s},

where #comp(H) denotes the number of connected components in subgraph H, G[v] denotes the ordered subgraph of G, induced by the nodes in v and s k is a hyper-parameter. Morris et al. (2022) then define the (k, s)-LWL as a k-WL variant that only considers the tuples in V (G)k s and additional only use subset of adjacent tuples; see (Morris et al., 2022) and Appendix E for a detailed description of the (k, s)- WL. Fortunately, the runtime of the (k, s)-LWL depends only on k, s, and the sparsity of the graph, resulting in a more efficient and scalable k-WL variant. We can now directly translate this modification to our token embeddings. Specifically, we call the k-GT using only the token embeddings X(0,k)(v) where v V (G)k s, the (k, s)-GT.

Now, Morris et al. (2022, Theorem 1) shows that, for k > 1, the (k + 1, 1)-LWL is strictly stronger than the (k, 1)-LWL. Further, note that the (1, 1)-LWL is equal to the 1-WL. Using this result, we can prove the following; see Appendix I for proof details.

Theorem 6. For all k > 1, the (k, 1)-GT is strictly more expressive than the (k 1, 1)-GT. Further, the (2, 1)-GT is strictly more expressive than the 1-WL.

Note that the (2, 1)-GT only requires O(n + m) tokens, where m is the number of edges, and consequently has a runtime complexity of O((n + m)2), the same as Token GT. Hence, the (2, 1)-GT improves the expressivity result of Token GT, which is shown to have 1-WL expressive power (Kim et al., 2022) while having the same runtime complexity.

In summary, our alignment of standard transformers with

Aligning Transformers with Weisfeiler Leman

Table 1: Comparison of our theoretical results with Kim et al. (2021) and Kim et al. (2022). Highlighted are aspects in which our results improve over Kim et al. (2022). Here, n denotes the number of nodes, and m denotes the number of edges. We denote with A that the transformer is strictly more expressive than algorithm A. Note that for pure transformers, the squared complexity can be relaxed to linear complexity when applying linear attention approximation such as the Performer (Choromanski et al., 2021).

Transformer Provable expressivity Runtime complexity # Heads Pure transformer

Sparse Transformer (Kim et al., 2021) Gilmer et al. (2017) O(m) 1 Token GT (Kim et al., 2022) 1-WL O((n + m)2) 15 1-GT (ours) 1-WL O(n2) 1 (2, 1)-GT (ours) 1-WL O((n + m)2) 2

Higher-Order Transformer (Kim et al., 2021) k-WL O(n2k) 1 k-order Transformer (Kim et al., 2022) k-WL O(n2k) bell(2k) k-GT (ours) k-WL O(n2k) 2k

the k-WL hierarchy has brought forth a variety of improved theoretical results for pure transformers, providing a strict increase in provable expressive power for every order k > 0 compared to previous works. See Table 1 for a direct comparison of our results with Kim et al. (2021) and Kim et al. (2022). Most importantly, Theorem 6 leads to a hierarchy of increasingly expressive pure transformers, taking one step closer to answering our guiding question in Section 1. In what follows, we make our transformers feasible in practice.

4. Implementation details

Here, we discuss the implementation details of pure transformers. In particular, we demonstrate that Laplacian PEs (Kreuzer et al., 2021) are sufficient to be used as node-level PEs for our transformers. Afterward, we introduce order transfer, a way to transfer model weights between different orders of the WL hierarchy; see Appendix A for additional implementation details.

4.1. Node and adjacency-identifying PEs

Here, we develop PEs based on Laplacian eigenvectors and -values that are both sufficiently nodeand adjacencyidentifying, avoiding computing separate PEs for nodeand adjacency identification; see Appendix A.1 for details about the graph Laplacian and some background on PEs based on the spectrum of the graph Laplacian.

Let λ := (λ1, . . . , λl)T denote the vector of the l smallest, possibly repeated, eigenvalues of the graph Laplacian for some graph with n nodes and let V Rn l be a matrix such that the i-th column of V is the eigenvector corresponding to eigenvalue λi. We will now briefly introduce LPE, based on Kreuzer et al. (2021), as well as SPE from Huang et al. (2024) and show that these node-level PEs are nodeand adjacency-identifying.

LPE Here, we define a slight generalization of the Laplacian PEs in Kreuzer et al. (2021) that enables sufficiently node and adjacency-identifying PEs. Concretely, we define LPE as

LPE(V, λ) = ρ ϕ(VT 1 , λ + ϵ) . . . ϕ(VT n , λ + ϵ) , (5)

where ϵ Rl is a learnable, zero-initialized vector, we introduce to show our result. Here, ϕ: R2 Rd is an FFN that is applied row-wise and ρ: Rl d Rd is a permutationequivariant neural network, applied row-wise. One can recover the Laplacian PEs in Kreuzer et al. (2021) by setting ϵ = 0 and implementing ρ as first performing a sum over the first dimension of its input, resulting in an d-dimensional vector and then subsequently applying an FFN to obtain a d-dimensional output vector. Note that then, Equation (5) forms a Deep Set (Zaheer et al., 2017) over the set of eigenvector components, paired with their (ϵ-perturbed) eigenvalues. While slightly deviating from the Laplacian PEs defined in Kreuzer et al. (2021), the ϵ vector is necessary to ensure that no spectral information is lost when applying ρ.

We will now show that LPE is sufficiently nodeand adjacency-identifying. Further, this result holds irrespective of whether the Laplacian is normalized. As a result, LPE can be used with 1-GT, k-GT, and (k, s)-GT; see Appendix F.3 for proof details.

Theorem 7. Structural embeddings with LPE as node-level PE are sufficiently nodeand adjacency-identifying.

SPE We also show that SPE (Huang et al., 2024) are sufficiently nodeand adjacency-identifying. Huang et al. (2024) define the SPE encodings as

SPE(V, λ) = ρ Vdiag(ϕ1(λ))VT . . . Vdiag(ϕn(λ))VT ,

where ϕ1, . . . , ϕn : Rl Rl are equivariant FFNs and ρ: Rn l Rd is a permutation equivariant neural network,

Aligning Transformers with Weisfeiler Leman

applied row-wise. For example, ρ can be implemented by first performing a sum over the first dimension of its input, resulting in an l-dimensional vector, and then subsequently applying an FFN to obtain a d-dimensional output vector.

SPE have the benefit of being stable, meaning a small perturbation in the graph structure causes only a small change in the resulting SPE encoding. As a result, stability can be related to OOD generalization (Huang et al., 2024). We can show that SPE are sufficiently nodeand adjacency-identifying. As a result, SPE encodings can be used with 1-GT, k-GT and (k, s)-GT; see Appendix F.4 for proof details.

Theorem 8. Structural embeddings with SPE as node-level PE are sufficiently nodeand adjacency-identifying.

4.2. Order transfer

Here, we describe order transfer, a strategy for effectively using the k-GT for k > 2 in practice. The idea of order transfer consists of pre-training a transformer on a lowerorder tokenization, such as on the (2, 1)-GT tokens, and subsequently fine-tuning the pre-trained weights on a higherorder transformer, such as the 2-GT, 3-GT or the (3, 1)-GT. We evaluate order transfer empirically in Section 5.2.

Order transfer could be useful when pre-training a higherorder model from scratch is infeasible. However, using a higher-order model on a smaller downstream task is feasible and beneficial to performance. Examples of such scenarios are (a) a large difference in the number of graphs in the pretraining dataset compared to the fine-tuning task, (b) a large difference in the size of the graphs in the pre-training dataset compared to the fine-tuning task, or (c) a specific fine-tuning task benefiting strongly from higher-order expressivity. As a result, order transfer is a promising application of expensive yet expressive higher-order models.

5. Experimental evaluation

Here, we conduct an experimental evaluation of our theoretical results. In particular, we consider two settings. To motivate the use of large-scale pure transformers for graph learning, we evaluate the empirical performance of our transformers under a pre-training/fine-tuning paradigm typically employed with large-scale language, vision, and graph models (Devlin et al., 2019; Dosovitskiy et al., 2021; Ying et al., 2021). This approach is especially promising in molecular learning, where downstream datasets often only contain a few thousand examples (Hu et al., 2020b; Méndez-Lucio et al., 2022); see Section 5.1 for pre-training and Section 5.2 for fine-tuning. Moreover, we benchmark the (2, 1)-GT and (3, 1)-GT on how well these models can leverage their expressivity in practice; see Section 5.3. The source code for all experiments is available at https://github.com/ luis-mueller/wl-transformers.

5.1. Pre-training

For pre-training, we train on PCQM4MV2, one of the largest molecular regression datasets available (Hu et al., 2021). The dataset consists of roughly 3.8M molecules, and the task is to predict the HOMO-LUMO energy gap. Here, we pre-train the (2, 1)-GT for 2M steps with a cosine learning rate schedule with 60K warm-up steps on two A100 NVIDIA GPUs; see Table 6 in Appendix B for details on the hyper-parameters. For model evaluation, we use the code provided by Hu et al. (2021), available at https: //github.com/snap-stanford/ogb. Further, to demonstrate the effectiveness of our transformers on large node-level tasks, we additionally benchmark the (2, 1)-GT on CS and PHOTO, two transductive node classification datasets (Shchur et al., 2018). On all three tasks, we compare to Graphormer (Ying et al., 2021), a strong graph transformer with modified attention, and Token GT (Kim et al., 2022) as a pure transformer baseline. We use 12 transformer layers, a hidden dimension of 768, 16 attention heads, and a GELU non-linearity (Hendrycks & Gimpel, 2016). As node-level PEs, we compare both LPE and SPE, as defined in Section 4.

We present our results in Table 2 and observe that the (2, 1)-GT further closes the gap between pure transformers and graph transformers with graph inductive bias such as Graphormer (Ying et al., 2021). Considering that the (2, 1)- GT has still relatively low graph inductive bias and in light of recent advances in terms of quality and size of pre-training dataset (Beaini et al., 2024), we expect this gap to further shrink with increasing data scale. Further, we find that on all three datasets, LPE are favorable or comparable to SPE.

5.2. Fine-tuning

Here, we describe our fine-tuning experiments. When finetuning, we re-use the pre-trained weights for the transformer layers and randomly initialize new weights for task-specific feature encodings and the task head. To demonstrate that finetuning large pre-trained transformers for small downstream tasks is feasible even with a smaller compute budget, we ensure that all experiments can be run on a single A10 Nvidia GPU with 24GB RAM.

Molecular regression To determine whether pre-training can improve the fine-tuning performance of our transformers, we choose ALCHEMY (12K), a small-scale molecular dataset with 12K molecules (Morris et al., 2022). Specifically, we evaluate three settings: (a) training the (2, 1)-GT from scratch, (b) fine-tuning the pre-trained (2, 1)-GT and (c) order transfer from (2, 1)-GT to (3, 1)-GT, where we reuse the pre-trained weights from the (2, 1)-GT pre-training but learn a new tokenizer with (3, 1)-GT tokens. We evaluate these settings both for LPE and SPE as node-level PEs. We compare the results to three GNNs with node-level PEs

Aligning Transformers with Weisfeiler Leman

Table 2: Comparison of (2, 1)-GT to Graphormer and Token GT. Results on PCQM4MV2 over a single random seed, as well as CS and PHOTO over 7 random seeds where we also report standard deviation. Results for baselines are taken from Kim et al. (2022). We highlight best and second best model on each dataset.

Model PCQM4MV2 CS PHOTO Validation MAE Accuracy Accuracy

Graphormer 0.0864 0.791 0.015 0.894 0.004 Token GT 0.0910 0.903 0.004 0.949 0.007 (2, 1)-GT + LPE 0.0870 0.924 0.008 0.933 0.013 (2, 1)-GT + SPE 0.0888 0.920 0.002 0.933 0.011

based on the eigenvectors and -values of the graph Laplacian: Sign Net, Basis Net, and SPE (Huang et al., 2024); see Table 3 for results. Here, we find that even without pretraining, the (2, 1)-GT already performs well on ALCHEMY. Most notably, the (2, 1)-GT with SPE without fine-tuning already performs on par with Sign Net. Nonetheless, we observe significant improvements through pre-training (2, 1)- GT. Most notably, fine-tuning the pre-trained (2, 1)-GT with LPE or SPE beats all GNN baselines. Interestingly, training the (2, 1)-GT with SPE from scratch results in much better performance than training the (2, 1)-GT with LPE from scratch. However, once pre-trained, the (2, 1)-GT performs better with LPE than with SPE. Moreover, we observe no improvements when performing order transfer to the (3, 1)-GT. However, the (3, 1)-GT with LPE and pre-trained weights from the (2, 1)-GT beats Sign Net (Lim et al., 2023), a strong GNN baseline. Further, (3, 1)-GT with SPE performs on par with SPE in Huang et al. (2024), the best of our GNN baselines. Interestingly, order transfer improves over the (2, 1)-GT trained from scratch for both LPE and SPE. As a result, we hypothesize that LPE and SPE provide sufficient expressivity for this task but that pre-training is required to fully leverage their potential. Further, we hypothesize that the added (3, 1)-GT tokens lead to overfitting.

We conclude that pre-training can help our transformers downstream performance. Further, order transfer is a promising technique for fine-tuning downstream tasks, particularly those that benefit from higher-order representations. However, its benefits might be nullified in the presence of sufficiently expressive node-level PEs in combination with largescale pre-training.

Molecular classification Here, we evaluate whether finetuning a large pre-trained transformer can be competitive with a strong GNN baseline on five small-scale molecular classification tasks, namely BBBP, BACE, CLINTOX, TOX21, and TOXCAST (Hu et al., 2020a). The number of molecules in any of these datasets is lower than 10K, making them an ideal choice to benchmark whether large-

Table 3: Effect of pre-training for fine-tuning performance on ALCHEMY (12K). Pre-trained weights for both (2, 1)-GT and (3, 1)-GT are taken from the (2, 1)-GT model in Table 2. We report mean and standard deviation over 3 random seeds. Baseline results for Sign Net, Basis Net and SPE are taken from Huang et al. (2024). We highlight the best, second best and third best model.

Model Pre-trained ALCHEMY (12K)

Sign Net 0.113 0.002 Basis Net 0.110 0.001 SPE 0.108 0.001

(2, 1)-GT + LPE 0.124 0.001 0.101 0.001 (3, 1)-GT + LPE 0.114 0.001

(2, 1)-GT + SPE 0.112 0.000 0.103 0.002 (3, 1)-GT + SPE 0.108 0.001

scale pre-trained models such as the (2, 1)-GT can compete with a task-specific GNN. Specifically, following Tönshoff et al. (2023), we aim for a fair comparison by carefully designing and hyper-parameter tuning a GINE model (Xu et al., 2019b) with residual connections, batch normalization, dropout, GELU non-linearities and, most crucially, the same node-level PEs as the (2, 1)-GT. We followed Hu et al. (2020b) in choosing the GINE layer over other GNN layers due to its guaranteed 1-WL expressivity. It is worth noting that our GINE with LPE encodings significantly outperforms the GIN in Hu et al. (2020b) without pre-training on all five datasets, demonstrating the quality of this baseline. Interestingly, GINE with SPE as node-level PEs underperforms against the GIN in Hu et al. (2020b) on all but one datasets. In addition, we report the best pre-trained model for each task in Hu et al. (2020b). We fine-tune the (2, 1)-GT for 5 epochs and train the GINE model for 45 epochs. In addition, we also train the (2, 1)-GT for 45 epochs from scratch to study the impact of pre-training on the molecular classification tasks. Note that we do not pre-train our GINE model to evaluate whether a large pre-trained transformer can beat a small GNN model trained from scratch; see Table 4 for results.

First, we find that the pre-trained (2, 1)-GT with SPE is better than the corresponding GINE with SPE on all five datasets. The pre-trained (2, 1)-GT with LPE is better than GINE with LPE on all but one dataset. Moreover, the (2, 1)-GT with LPE as well as the (2, 1)-GT with SPE are each better or on par with the best pre-trained GIN in Hu et al. (2020b) in two out of five datasets. When studying the impact of pre-training, we observe that pre-training leads mostly to performance improvement. The notable exception is on the

Aligning Transformers with Weisfeiler Leman

Table 4: Fine-tuning results on small OGB molecular datasets compared to a fully-equipped and -tuned GIN. Results over 6 random seeds. We highlight best, second best and third best model for each dataset. Ties are broken by smaller standard deviation. For comparison, we also report pre-training results from Hu et al. (2020b).

Model Pre-trained BBBP BACE CLINTOX TOX21 TOXCAST

AUROC AUROC AUROC AUROC AUROC

GIN (Hu et al., 2020b) 0.658 0.045 0.701 0.054 0.580 0.044 0.740 0.008 0.634 0.006 0.688 0.008 0.845 0.007 0.737 0.028 0.783 0.003 0.665 0.003

GINE+LPE 0.670 0.016 0.763 0.007 0.824 0.016 0.763 0.011 0.659 0.005 GINE+SPE 0.613 0.036 0.677 0.047 0.641 0.066 0.709 0.023 0.593 0.018

(2, 1)-GT +LPE 0.703 0.025 0.786 0.019 0.821 0.045 0.763 0.003 0.667 0.007 (2, 1)-GT +LPE 0.679 0.012 0.815 0.017 0.867 0.020 0.780 0.005 0.642 0.007

(2, 1)-GT +SPE 0.694 0.020 0.776 0.030 0.845 0.017 0.758 0.006 0.657 0.007 (2, 1)-GT +SPE 0.703 0.013 0.793 0.025 0.859 0.021 0.762 0.003 0.637 0.003

TOXCAST dataset, where the pre-trained (2, 1)-GT underperforms even the GIN baselines without pre-training. Hence, we conclude that the pre-training/fine-tuning paradigm is viable for applying pure transformers such as the (2, 1)-GT to small-scale problems.

5.3. Expressivity tests

Since we only provide expressivity lower bounds, we empirically investigate the expressive power of the k-GT. To this end, we evaluate the BREC benchmark offering fine-grained and scalable expressivity tests (Wang & Zhang, 2023). The benchmark is comprised of 400 graph pairs that range from being 1-WL to 4-WL indistinguishable and pose a challenge even to the most expressive models; see Table 5 for the expressivity results of (2, 1)-GT and (3, 1)-GT on BREC. We mainly compare to our GINE baseline, Graphormer (Ying et al., 2021) as a graph transformer baseline, and the 3-WL as a potential expressivity upper-bound. In addition, we compare our GINE model, (2, 1)-GT and (3, 1)-GT for both LPE and SPE. We find that SPE encodings consistently lead to better performance. Further, we observe that increased expressivity also leads to better performance, as for both LPE and SPE, the (2, 1)-GT beats our GINE baseline, and the (3, 1)-GT beats the (2, 1)-GT across all tasks. Finally, we find that both the (2, 1)-GT and the (3, 1)-GT improve over Graphormer while still being outperformed by the 3-WL, irrespective of the choice of PE.

6. Conclusion

In this work, we propose a hierarchy of expressive pure transformers that are also feasible in practice. We improve existing pure transformers such as Token GT (Kim et al., 2022) in several aspects, both theoretically and empirically. Theoretically, our hierarchy has stronger provable expressivity for each k and is more scalable. For example, our (2, 1)-GT and

Table 5: Results on the BREC benchmark over a single seed. Baseline results are taken from Wang & Zhang (2023). We highlight the best model for each PE and category (excluding the 3-WL, which is not a model and merely serves as a reference point). We additionally report the results of Graphormer from Wang & Zhang (2023).

Model PE Basic Regular Extension CFI All

GINE LPE 19 20 37 3 79 (2, 1)-GT 36 28 51 3 118 (3, 1)-GT 55 46 55 3 159

GINE SPE 56 48 93 20 217 (2, 1)-GT 60 50 98 18 226 (3, 1)-GT 60 50 97 21 228

Graphormer 16 12 41 10 79 3-WL 60 50 100 60 270

(3, 1)-GT have a provable expressivity strictly above 1-WL but are also feasible in practice. Empirically, we verify our claims about practical feasibility and show that our transformers improve over existing pure transformers, closing the gap between pure transformers and graph transformers with strong graph inductive bias on the large-scale PCQM4MV2 dataset. Further, we show that fine-tuning our pre-trained transformers is a feasible and resource-efficient approach for applying pure transformers to small-scale datasets. We discover that higher-order transformers can efficiently re-use pre-trained weights from lower-order transformers during fine-tuning, indicating a promising direction for utilizing higher-order expressivity to improve results on downstream tasks. Future work could explore aligning transformers to other variants of Weisfeiler Leman. Finally, pre-training pure transformers on truly large-scale datasets such as those recently proposed by Beaini et al. (2024) could be a promising direction for graph transformers.

Aligning Transformers with Weisfeiler Leman

Impact statement

This paper presents work whose goal is to advance the field of machine learning. Our work has many potential societal consequences, none of which must be specifically highlighted here.

Acknowledgements

CM and LM are partially funded by a DFG Emmy Noether grant (468502433) and RWTH Junior Principal Investigator Fellowship under Germany s Excellence Strategy.

Anderson, W. N. and Morley, T. D. Eigenvalues of the Laplacian of a graph. Linear and Multilinear Algebra, 18 (2):141 145, 1985.

Azizian, W. and Lelarge, M. Characterizing the expressive power of invariant and equivariant graph neural networks. In International Conference on Learning Representations, 2021.

Babai, L. Lectures on graph isomorphism. University of Toronto, Department of Computer Science. Mimeographed lecture notes, 1979.

Beaini, D., Huang, S., Cunha, J. A., Li, Z., Moisescu Pareja, G., Dymov, O., Maddrell-Mander, S., Mc Lean, C., Wenkel, F., Müller, L., Mohamud, J. H., Parviz, A., Craig, M., Koziarski, M., Lu, J., Zhu, Z., Gabellini, C., Klaser, K., Dean, J., Wognum, C., Sypetkowski, M., Rabusseau, G., Rabbany, R., Tang, J., Morris, C., Koutis, I., Ravanelli, M., Wolf, G., Tossou, P., Mary, H., Bois, T., Fitzgibbon, A. W., Banaszewski, B., Martin, C., and Masters, D. Towards foundational models for molecular learning on large-scale multi-task datasets. In International Conference on Learning Representations, 2024.

Bodnar, C., Frasca, F., Otter, N., Wang, Y. G., Liò, P., Montúfar, G., and Bronstein, M. M. Weisfeiler and Lehman go cellular: CW networks. In Advances in Neural Information Processing Systems, 2021.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.

Cai, J., Fürer, M., and Immerman, N. An optimal lower

bound on the number of variables for graph identifications. Combinatorica, 12(4):389 410, 1992.

Chen, Z., Villar, S., Chen, L., and Bruna, J. On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, 2019.

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlós, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, 2022.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pretraining of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics, 2019.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

Geerts, F. and Reutter, J. L. Expressiveness and approximation properties of graph neural networks. In International Conference on Learning Representations, 2022.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning, 2017.

Glickman, D. and Yahav, E. Diffusing graph attention. Ar Xiv preprint, 2023.

Grohe, M. Descriptive Complexity, Canonisation, and Definable Graph Structure Theory. Cambridge University Press, 2017.

Grohe, M. The logic of graph neural networks. In Symposium on Logic in Computer Science, 2021.

He, X., Hooi, B., Laurent, T., Perold, A., Le Cun, Y., and Bresson, X. A generalization of vit/mlp-mixer to graphs. In International Conference on Machine Learning, 2023.

Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. Ar Xiv preprint, 2016.

Aligning Transformers with Weisfeiler Leman

Horn, R. A. and Johnson, C. R. Matrix Analysis, 2nd Edition. Cambridge University Press, 2012.

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems, 2020a.

Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V. S., and Leskovec, J. Strategies for pre-training graph neural networks. In International Conference on Learning Representations, 2020b.

Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y., and Leskovec, J. OGB-LSC: A large-scale challenge for machine learning on graphs. In Neur IPS: Datasets and Benchmarks Track, 2021.

Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M., Jegelka, S., and Li, P. On the stability of expressive positional encodings for graphs. In International Conference on Learning Representations, 2024.

Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception with iterative attention. In International Conference on Machine Learning, 2021.

Kim, J., Oh, S., and Hong, S. Transformers generalize deepsets and can be extended to graphs & hypergraphs. In Advances in Neural Information Processing Systems, 2021.

Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., and Hong, S. Pure transformers are powerful graph learners. In Advances in Neural Information Processing Systems, 2022.

Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.

Kreuzer, D., Beaini, D., Hamilton, W. L., Létourneau, V., and Tossou, P. Rethinking graph transformers with spectral attention. In Advances in Neural Information Processing Systems, 2021.

Lim, D., Robinson, J. D., Zhao, L., Smidt, T. E., Sra, S., Maron, H., and Jegelka, S. Sign and basis invariant networks for spectral graph representation learning. In International Conference on Learning Representations, 2023.

Lipman, Y., Puny, O., and Ben-Hamu, H. Global attention improves graph networks generalization. Ar Xiv preprint, 2020.

Ma, L., Lin, C., Lim, D., Romero-Soriano, A., Dokania, K., Coates, M., H.S. Torr, P., and Lim, S.-N. Graph Inductive Biases in Transformers without Message Passing. In International Conference on Machine Learning, 2023.

Malkin, P. N. Sherali Adams relaxations of graph isomorphism polytopes. Discrete Optimization, 2014.

Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y. Provably powerful graph networks. In Advances in Neural Information Processing Systems, 2019a.

Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. In International Conference on Learning Representations, 2019b.

Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On the universality of invariant networks. In International Conference on Machine Learning, 2019c.

Méndez-Lucio, O., Nicolaou, C. A., and Earnshaw, B. Mole: a molecular foundation model for drug discovery. Ar Xiv preprint, 2022.

Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence, 2019.

Morris, C., Rattan, G., and Mutzel, P. Weisfeiler and Leman go sparse: Towards higher-order graph embeddings. In Advances in Neural Information Processing Systems, 2020.

Morris, C., Rattan, G., Kiefer, S., and Ravanbakhsh, S. Speq Nets: Sparsity-aware permutation-equivariant graph networks. In International Conference on Machine Learning, 2022.

Morris, C., L., Y., Maron, H., Rieck, B., Kriege, N. M., Grohe, M., Fey, M., and Borgwardt, K. Weisfeiler and Leman go machine learning: The story so far. Journal of Machine Learning Research, 2023.

Müller, L., Galkin, M., Morris, C., and Rampásek, L. Attending to graph transformers. Transactions on Machine Learning Research, 2024.

Puny, O., Lim, D., Kiani, B. T., Maron, H., and Lipman, Y. Equivariant polynomials for graph neural networks. In International Conference on Machine Learning, 2023.

Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., and Beaini, D. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 2022.

Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2009.

Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. Pitfalls of graph neural network evaluation. Ar Xiv preprint, 2018.

Aligning Transformers with Weisfeiler Leman

Tönshoff, J., Ritzert, M., Rosenbluth, E., and Grohe, M. Where did the gap go? reassessing the long-range graph benchmark. Ar Xiv preprint, 2023.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.

Wang, Y. and Zhang, M. Towards better evaluation of GNN expressiveness with BREC dataset. Ar Xiv preprint, 2023.

Weisfeiler, B. and Leman, A. The reduction of a graph to canonical form and the algebra which appears therein. Nauchno-Technicheskaya Informatsia, 2(9):12 16, 1968.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019a.

Xu, K., Wang, L., Yu, M., Feng, Y., Song, Y., Wang, Z., and Yu, D. Cross-lingual knowledge graph alignment via graph matching neural network. In Annual Meeting of the Association for Computational Linguistics, 2019b.

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing System, 2021.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, 2017.

Zhang, B., Luo, S., Wang, L., and He, D. Rethinking the expressive power of gnns via graph biconnectivity. In International Conference on Learning Representations, 2023.

Aligning Transformers with Weisfeiler Leman

A. Additional implementation details

Here, we outline additional implementation details that make our pure transformers more feasible in practice.

A.1. Positional encodings based on Laplacian eigenmaps

Encoding Laplacian eigenmaps are a popular choice for PEs in GNNs and graph transformers (Kreuzer et al., 2021; Lim et al., 2023; Müller et al., 2024; Rampášek et al., 2022). In particular, Lim et al. (2023) show that their PEs based on Laplacian eigenmaps generalize other PEs, such as those based on random walks or heat kernels. In what follows, we will briefly review some background on Laplacian eigenvectors and -values as input to machine learning models and then show how such PEs can be constructed to be nodeand adjacency-identifying.

For an n-order graph G, the graph Laplacian is defined as L := D A(G), where D denotes the degree matrix and A(G) denotes the adjacency matrix. We then consider the eigendecomposition of L = VΣVT , the column vectors of V correspond to eigenvectors and the diagonal matrix Σ contains the corresponding eigenvalues of the graph Laplacian. Since L is real and symmetric, such decomposition always exists; see, e.g., Corollary 2.5.11 in Horn & Johnson (2012). In addition, the eigenvalues of L are real and non-negative. Some works also consider the normalized graph Laplacian L := D 1

2 (Kreuzer et al., 2021; Lim et al., 2023). The statements we made for the graph Laplacian also hold for the normalized graph Laplacian, i.e., L is real and symmetric, and its eigenvalues are real and non-negative. Further, the eigenvectors of real and symmetric matrices are orthonormal (Lim et al., 2023).

We need to address the following to use Laplacian eigenvectors for machine learning. Let v be an eigenvector of a matrix M with eigenvalue λ. Then, v is also an eigenvector of M with eigenvalue λ. As a result, we want a machine learning model to be invariant to the signs of the eigenvectors. Kreuzer et al. (2021) address this issue by randomly flipping the sign of each eigenvector during training for the model to learn this invariance from data. Concretely, given k eigenvectors v1, . . . , vk, we uniformly sample k independent sign values s1, . . . , sk { 1, 1} and then plug eigenvectors s1 v1, . . . , sk vk into the model during training.

A.2. Implementation of LPE and SPE

Here, we describe our concrete implementations of LPE and SPE based on our definition in Section 4.

For LPE, we implement ρ by performing a sum over its input s first dimension and applying an FFN, as described in Section 4. We choose this implementation to stay as close to implementing the Laplacian PEs in Kreuzer et al. (2021).

For SPE, we follow the implementation in Huang et al. (2024). Concretely, let

Q := Vdiag(ϕ1(λ))VT . . . Vdiag(ϕn(λ))VT Rn n l

be the input tensor to ρ. Huang et al. (2024) propose to partition Q along the second axis into n matrices of size n l and then to apply a GIN to each of those n matrices in parallel where we use the adjacency matrix of the original graph. Concretely, the GIN maps each n l matrix to a matrix of shape n d, where d is the output dimension of ρ. Finally, the output of ρ is the sum of all n d matrices. We adopt this implementation for our experiments. For prohibitively large graphs, we propose the following modification to the above implementation of ρ. Specifically, we let V:m denote the matrix containing as columns the eigenvectors corresponding to the m smallest eigenvalues. Then, we define

Q:m := Vdiag(ϕ1(λ))VT :m . . . Vdiag(ϕn(λ))VT :m Rn m l (6)

and use the same ρ as for the case m = n.

A.3. Atomic types from edge embeddings

Here, we discuss a practical implementation of the atomic type embeddings atp in Section 3.2. In particular, we use the fact that most graph learning datasets also include edge types, for example, the bond type in molecular datasets. Now, for an undirected graph G and a tuple v = (v1, . . . , vk) V (G)k s, let E(vi, vj) Rd denote a learnable embedding of the edge type between nodes vi and vj, where we additionally set E(vi, vj) = 0 if, and only, if vi and vj do not share an edge in G. Further, since we consider graphs without self-loops, we can assign a special learnable vector s to E(vi, vj) for i = j. Then, we can define atp(v) := E(vi, vj) k i j W,

Aligning Transformers with Weisfeiler Leman

Table 6: Hyper-parameters for (2, 1)-GT pre-training on PCQM4MV2.

Parameter Value

Learning rate 2e 4 Weight decay 0.1 Attention dropout 0.1 Post-attention dropout 0.1 Batch size 256 # gradient steps 2M # warmup steps 60K precision bfloat16

where W Rd(k2 k)/2 d is a learnable weight matrix projecting the concatenated edge embeddings to the target dimension d. To see that the above faithfully encodes the atomic type, recall the matrix K over {1, 2, 3}, determining the atomic type that we introduced in Section 3.2. For undirected graphs, for all i j, Kij = 1 if E(vi, vj) = 0 and E(vi, vj) = s, Kij = 2 if E(vi, vj) = s and Kij = 3 if E(vi, vj) = 0. As a result, by including additional edge information into the k-GT, we simultaneously obtain atomic type embeddings atp.

B. Additional experimental details

Here, we present the hyper-parameters used during pre-training in Table 6. Further, we describe the hyper-parameter selection strategies employed for the different experiments. Across all experiments, we always select the hyper-parameters based on the best validation score and then evaluate on the test set.

For CS and PHOTO, we set the learning rate to 0.0001 and tune dropout over {0.5, 0.1}, the hidden dimension over {512, 1024}, the number of attention heads over {1, 2, 4}, the number of eigenvalues used over {64, 96}. For the node classification datasets, we disable the learning rate scheduler. We train for 100 epochs. Due to the large number of nodes and edges on these datasets, we use the SPE node-level PEs according to Equation (6) with m = 384.

For ALCHEMY, we only tune the learning rate. Specifically, for the (2, 1)-GT we sweep over {0.001, 0.0005, 0.0003} and for the (3, 1)-GT, due to the additional computational demand over {0.0005, 0.0003}. For the (2, 1)-GT we train for 2000 epochs and for the (3, 1)-GT we train for 500 epochs, again due to the additional computational demand.

For the molecular classification datasets, we tune the transformers on learning rate and dropout. Specifically, we sweep the learning rate over {0.0005, 0.0001, 0.00005, 0.00001} and dropout over {0.5, 0.1}. To ensure that the GINE baseline is representative, we invest more time for hyper-parameter tuning. Specifically, we sweep over the number of layers over {3, 6, 12}, the hidden dimension over {384, 768}. Further, we sweep the learning rate over {0.0005, 0.0001, 0.00005, 0.00001} and dropout over {0.5, 0.1}. Finally, we sweep the choice of pooling function over {mean, sum}, used for pooling the node embeddings of the GINE into a single graph-level representation. This results in 12 more hyper-parameter configurations for the GINE baseline than the transformers.

For BREC, we use a batch size of 16, 6 layers and a hidden dimension of 768 for both GNN and (2, 1)-GT. For the (3, 1)-GT, we use a batch size of 2, 3 layers and a hidden dimension of 192.

C. Extended notation

The neighborhood of a vertex v in V (G) is denoted by N(v) := {u V (G) | (v, u) E(G)} and the degree of a vertex v is |N(v)|. Two graphs G and H are isomorphic, and we write G H if there exists a bijection φ: V (G) V (H) preserving the adjacency relation, i.e., (u, v) is in E(G) if and only if (φ(u), φ(v)) is in E(H). Then φ is an isomorphism between G and H. In the case of labeled graphs, we additionally require that l(v) = l(φ(v)) for v in V (G), and similarly for attributed graphs. We further define the atomic type atp: V (G)k N, for k > 0, such that atp(v) = atp(w) for v and w in V (G)k if and only if the mapping φ: V (G)k V (G)k where vi 7 wi induces a partial isomorphism, i.e., we have vi = vj wi = wj and (vi, vj) E(G) (φ(vi), φ(vj)) E(G). Let M Rn p and N Rn q be two matrices then M N Rn p+q

Aligning Transformers with Weisfeiler Leman

denotes column-wise matrix concatenation. Further, let M Rp n and N Rq n be two matrices then M N

denotes row-wise matrix concatenation. For a matrix X Rn d, we denote with Xi the i-th row vector. In the case where the rows of X correspond to nodes in a graph G, we use Xv to denote the row vector corresponding to the node v V (G).

D. Transformers

Here, we define the (standard) transformer (Vaswani et al., 2017), a stack of alternating blocks of multi-head attention and fully-connected feed-forward networks. In each layer, t > 0, given token embeddings X(t 1) RL d for L tokens, we compute X(t) := FFN X(t 1) + h1(X(t 1)) . . . h M(X(t 1)) WO , (7)

where [ ] denotes column-wise concatenation of matrices, M is the number of heads, hi denotes the i-th transformer head, WO RMdv d denotes a final projection matrix applied to the concatenated heads, and FFN denotes a feed-forward neural network applied row-wise. We define the i-th head as

hi(X) := softmax 1 dk XWQ,i XWK,i T XWV,i, (8)

where the softmax is applied row-wise and WQ,i, WK,i Rd dk, WV,i Rd dv, and dv and d are the head dimension and embedding dimension, respectively. We omit layer indices t and optional bias terms for clarity.

E. Weisfeiler Leman

Here, we discuss additional background for the Weisfeiler Leman hierarchy. We begin by describing the Weisfeiler Leman algorithm, starting with the 1-WL. The 1-WL or color refinement is a well-studied heuristic for the graph isomorphism problem, originally proposed by Weisfeiler & Leman (1968).2 Intuitively, the algorithm determines if two graphs are non-isomorphic by iteratively coloring or labeling vertices. Formally, let G = (V, E, ℓ) be a labeled graph, in each iteration, t > 0, the 1-WL computes a vertex coloring C1 t : V (G) N, depending on the coloring of the neighbors. That is, in iteration t > 0, we set

C1 t (v) := RELABEL C1 t 1(v), {{C1 t 1(u) | u N(v)}} ,

for all vertices v in V (G), where RELABEL injectively maps the above pair to a unique natural number, which has not been used in previous iterations. In iteration 0, the coloring C1 0 := ℓ. To test if two graphs G and H are non-isomorphic, we run the above algorithm in parallel on both graphs. If the two graphs have a different number of vertices colored c in N at some iteration, the 1-WL distinguishes the graphs as non-isomorphic. It is easy to see that the algorithm cannot distinguish all non-isomorphic graphs (Cai et al., 1992). Several researchers, e.g., Babai (1979); Cai et al. (1992), devised a more powerful generalization of the former, today known as the k-dimensional Weisfeiler Leman algorithm (k-WL), operating on k-tuples of vertices rather than single vertices.

E.1. The k-dimensional Weisfeiler Leman algorithm

Due to the shortcomings of the 1-WL or color refinement in distinguishing non-isomorphic graphs, several researchers, e.g., Babai (1979); Cai et al. (1992), devised a more powerful generalization of the former, today known as the k-dimensional Weisfeiler-Leman algorithm, operating on k-tuples of vertices rather than single vertices.

Intuitively, to surpass the limitations of the 1-WL, the k-WL colors vertex-ordered k-tuples instead of a single vertex. More precisely, given a graph G, the k-WL colors the tuples from V (G)k for k 2 instead of the vertices. By defining a neighborhood between these tuples, we can define a coloring similar to the 1-WL. Formally, let G be a graph, and let k 2. In each iteration, t 0, the algorithm, similarly to the 1-WL, computes a coloring Ck t : V (G)k N. In the first iteration,

2Strictly speaking, the 1-WL and color refinement are two different algorithms. The 1-WL considers neighbors and non-neighbors to update the coloring, resulting in a slightly higher expressive power when distinguishing vertices in a given graph; see (Grohe, 2021) for details. For brevity, we consider both algorithms to be equivalent.

Aligning Transformers with Weisfeiler Leman

t = 0, the tuples v and w in V (G)k get the same color if they have the same atomic type, i.e., Ck 0 (v) := atp(v). Then, for each iteration, t > 0, Ck t is defined by

Ck t (v) := RELABEL Ck t 1(v), Mt(v) , (9)

with Mt(v) the multiset

Mt(v) := {{Ck t 1(ϕ1(v, w)) | w V (G)}}, . . . , {{Ck t 1(ϕk(v, w)) | w V (G)}} , (10)

and where ϕj(v, w) := (v1, . . . , vj 1, w, vj+1, . . . , vk).

That is, ϕj(v, w) replaces the j-th component of the tuple v with the vertex w. Hence, two tuples are adjacent or j-neighbors if they are different in the j-th component (or equal, in the case of self-loops). Hence, two tuples v and w with the same color in iteration (t 1) get different colors in iteration t if there exists a j in [k] such that the number of j-neighbors of v and w, respectively, colored with a certain color is different.

We run the k-WL algorithm until convergence, i.e., until for t in N

Ck t (v) = Ck t (w) Ck t+1(v) = Ck t+1(w),

for all v and w in V (G)k, holds.

Similarly to the 1-WL, to test whether two graphs G and H are non-isomorphic, we run the k-WL in parallel on both graphs. Then, if the two graphs have a different number of vertices colored c, for c in N, the k-WL distinguishes the graphs as non-isomorphic. By increasing k, the algorithm gets more powerful in distinguishing non-isomorphic graphs, i.e., for each k 2, there are non-isomorphic graphs distinguished by (k + 1)-WL but not by k-WL (Cai et al., 1992). In the following, we define some variants of the k-WL.

E.2. The δ-k-dimensional Weisfeiler Leman algorithm

Malkin (2014) introduced the following variant of the k-WL, the δ-k-WL, which updates k-tuples according to

M adj t (v) := {{(Ck,adj t 1 (ϕ1(v, w)), adj(v1, w)) | w V (G)}}, . . . ,

{{(Ck,adj t 1 (ϕk(v, w)), adj(vk, w)) | w V (G)}} ,

adj(v, w) :=

( 1, if (v, w) E(G) 0, otherwise,

resulting in the coloring function Ck,M t : V (G)k N. Morris et al. (2020) showed that this variant is slightly more powerful in distinguishing non-isomorphic graphs compared to the k-WL.

E.3. The local δ-k-dimensional Weisfeiler Leman algorithm

Morris et al. (2020) introduced a more efficient variant of the k-WL, the local δ-k-dimensional Weisfeiler Leman algorithm (δ-k-LWL), which updates k-tuples according to

M δ t (v) = {{Ck,δ t 1(ϕ1(v, w)) | w N(v1)}}, . . . , {{Ck,δ t 1(ϕk(v, w)) | w N(vk)}} ,

resulting in the coloring function Ck,δ t : V (G)k N.

E.4. The local (k, s)-dimensional Weisfeiler Leman algorithm

Let G be a graph. Then #comp(G) denotes the number of (connected) components of G. Further, let k 1 and 1 s k, then V (G)k s := {v V (G)k | #comp(G[v]) s}

is the set of (k, s)-tuples of nodes, i.e, k-tuples which induce (sub-)graphs with at most s (connected) components. In contrast to the algorithms of Equation (9), the (k, s)-LWL colors tuples from V (G)k s instead of the entire V (G)k, resulting in the coloring Ck,s t : V (G)k s N. For more details; see Morris et al. (2022).

Aligning Transformers with Weisfeiler Leman

E.5. Comparing k-WL variants

Let A1 and A2 denote k-WL-like algorithms, we write A1 A2 if A1 distinguishes between all non-isomorphic pairs A2 does, and A1 A2 if both directions hold. The corresponding strict relation is denoted by . We can extend these relations to neural architectures as follows. Given two non-isomorphic graphs G and H, a neural network architecture distinguishes them if a parameter assignment exists such that h G = h H.

E.6. The Weisfeiler Leman hierarchy and permutation-invariant function approximation

The Weisfeiler Leman hierarchy is a purely combinatorial algorithm for testing graph isomorphism. However, the graph isomorphism function, mapping non-isomorphic graphs to different values, is the hardest to approximate permutation-invariant function. Hence, the Weisfeiler Leman hierarchy has strong ties to GNNs capabilities to approximate permutation-invariant or equivariant functions over graphs. For example, Morris et al. (2019); Xu et al. (2019a) showed that the expressive power of any possible GNN architecture is limited by the 1-WL in terms of distinguishing non-isomorphic graphs. Azizian & Lelarge (2021) refined these results by showing that if an architecture is capable of simulating the k-WL (on the set of n-order graphs) and allows the application of universal neural networks on vertex features, it will be able to approximate any permutation-equivariant function below the expressive power of the k-WL; see also (Chen et al., 2019). Hence, if one shows that one architecture distinguishes more graphs than another, it follows that the corresponding GNN can approximate more functions. These results were refined in (Geerts & Reutter, 2022) for color refinement and taking into account the number of iterations of the k-WL.

E.7. A generalized adjacency matrix

Finally, we define a generalization of the adjacency matrix from nodes to k-tuples. Specifically, let G be a graph with n nodes and let γ { 1, 1}. Then, for all j [k], the generalized adjacency matrix A(k,j,γ) {0, 1}nk nk is defined as

A(k,j,γ) il :=

( 1 w V (G): ul = ϕj(ui, w) adj(uij, w) 0 else γ = 1

( 1 w V (G): ul = ϕj(ui, w) adj(uij, w) 0 else, γ = 1

where ui denotes the i-th k-tuple in a fixed but arbitrary ordering over V (G)k, uij denotes the j-th node of ui and where γ controls whether the generalized adjacency matrix considers (γ = 0) all j-neighbors, (γ > 1) j-neighbors where the swapped nodes are adjacent in G or (γ < 0) j-neighbors where the swapped nodes are not adjacent in G. With the generalized adjacency matrix as defined above we can represent the local and global j-neighborhood adjacency defined for the δ-k-WL with A(k,j,1)

and A(k,j, 1), respectively. Further, the j-neighborhood adjacency of k-WL can be described via A(k,j,1) + A(k,j, 1) and the local j-neighborhood adjacency of δ-k-LWL can be described with A(k,j,1).

F. Structural embeddings

Before we give the missing proofs from Section 4, we develop a theoretical framework to analyze structural embeddings.

F.1. Sufficiently node and adjacency-identifying

We begin by proving the following lemma which shows an important bound for sufficiently node and adjacency-identifying matrices.

Lemma 9. Let P, Q Rn d be matrices such that

||P Q||F < ϵ,

Aligning Transformers with Weisfeiler Leman

for an arbitrary but fixed ϵ > 0. Let P be node or adjacency-identifying with projection matrices WQ, WK Rd d and let

P = 1 dk (PWQ)(PWK)T , and

Q = 1 dk (QWQ)(QWK)T .

Then, there exists a monotonic strictly increasing function f such that

|| P Q||F < f(ϵ).

Proof. Our goal is to show that if the error between P and Q is bounded, so is the error between P and Q. We now show that this error can be described by a monotonic strictly increasing function f, i.e., if ϵ1 < ϵ2, then f(ϵ1) < f(ϵ2). We will first prove the existence of f.

First note that we can write,

P Q = 1 dk PWQ(PWK QWK)T + (PWQ QWQ)(QWK)T .

Note further that we are guaranteed that

||P||F > 0 (12)

||WQ||F > 0 (13)

||WK||F > 0 (14)

since otherwise at least one of the above matrices is zero, in which case P = 0, which is not node or adjacency-identifying in general, a contradiction. Now we can write,

|| P Q||F = 1 dk PWQ(PWK QWK)T + (PWQ QWQ)(QWK)T F

PWQ(PWK QWK)T F + (PWQ QWQ)(QWK)T F

PWQ F (PWK QWK)T F + (QWK)T F PWQ QWQ F

PWQ F PWK QWK F + QWK F PWQ QWQ F

P Q F + Q F

= 1 dk WQ F

WK F P Q F P F + Q F

(e)< ϵ dk WQ F

WK F P F + Q F

Here, we used (a) the triangle inequality, (b) the Cauchy-Schwarz inequality (c) the fact that for any matrix X, ||X||F = ||XT ||F, (d) again Cauchy-Schwarz and finally (e) the Lemma statement, namely that ||P Q||F < ϵ combined with the facts in Equation (12), (13) and (14), guaranteeing that || P Q||F > 0. We set

f(ϵ) := ϵ dk WQ F

WK F P F + Q F

which is clearly monotonic strictly increasing, since the norms as well as 1 dk are non-negative and Equation (12), (13) and (14) ensure that f(ϵ) > 0. As a result, we can now write

|| P Q||F < f(ϵ).

This concludes the proof.

Aligning Transformers with Weisfeiler Leman

We use sufficiently node or adjacency-identifying matrices for approximately recovering weighted indicator matrices, which we define next.

Definition 10 (Weighted indicators). Let x = (x1, . . . , xn) {0, 1}n be an n-dimensional binary vector. We call

x := x Pn i=1 xi ,

the weighted indicator vector of x. Further, let X {0, 1}n n be a binary matrix. Let now X Rn n be a matrix such that the i-th row of X is the weighted indicator vector of the i-th row of X. We call X the weighted indicator matrix of X.

The following lemma ties (sufficiently) node or adjacency-identifying matrices and weighted indicators together.

Lemma 11. Let P Rn n be a matrix and let X be a binary matrix with weighted indicator matrix X, such that for every i, j [n], Pij = max k Pik

if, and only, if Xij = 1. Then, for a matrix Q Rn n and all ϵ > 0, there exist an δ > 0 and a b > 0 such that if P Q F < δ,

then, softmax b Q X F < ϵ,

where softmax is applied row-wise.

Proof. We begin by reviewing how the softmax acts on a vector z = (z1, . . . , zn) Rn. Let zmax := maxi zi be the maximum value in z. Further, let x = (x1, . . . , xn) {0, 1}n be a binary vector such that

( 1 zi = zmax 0 else .

Now, for a scalar b > 0, lim b inf softmax b z = x Pn i=1 xi ,

i.e., as b goes to infinity, the softmax converges b z to the weighted indicator vector

x = x Pn i=1 xi .

Let us now generalize this to matrices. Specifically, let P Rn n be a matrix, let X be a binary matrix with weighted indicator matrix X such that for every i, j [n],

Pij = max k Pik

if and only if Xij = 1. Then, for a scalar b > 0,

lim b inf softmax b P = X,

which follows from the fact that the softmax is applied independently to each row and each row of X is a weighted indicator vector. We now show the proof statement. First, note that for any b > 0 we can choose a δ < f(b), where f : R R is some strictly monotonically decreasing function of b that shrinks faster than linear, e.g., f(b) = 1 b2 . This function is well-defined since by assumption b > 0. As a result an increase in b implies a non-linearly growing decrease of softmax b P softmax b Q F

and thus, lim b inf softmax b Q = X,

Aligning Transformers with Weisfeiler Leman

x1 x2 x3 (a)

0.0 0.25 0.5 0.75 1

Figure 2: Visual explanation of the "opposing forces" in Lemma 11. In (a) before softmax: We consider three numbers x1, x2 and x3, where x2 and x3 are less than δ apart. In (b) after softmax: An increase in b (blue) pushes the maximum value x3 away from x1 and x2. However, the approximation with δ acts stronger (red). As a result, x1 gets pushed closer to 0, but x2 and x3 get pushed closer. In (c) after softmax: Further increasing b makes x1 converge to 0, but the approximation with δ pushes x2 and x3 closer together, and the softmax maps both values approximately to 1

2 (here depicted with the same dot). Hence, with a sufficiently close approximation, we can approximate the weighted indicator matrix X.

and we have that for all ϵ > 0 there exists a b > 0 and a δ < f(b) such that

softmax b Q X F < ϵ.

This concludes the proof.

In the above proof, the approximation with δ and the scaling with b act as two "opposing forces". The proof then chooses b such that ϵ acts stronger than b and hence with b inf, the approximation converges to the weighted indicator matrix; see Figure 2 for a visual explanation of this concept.

From the definitions of sufficiently node and adjacency-identifying matrices, we can prove the following statement.

Lemma 12. Let G be a graph with adjacency matrix A(G) and degree matrix D. Let Q be a sufficiently adjacency-identifying matrix. Then, there exists a b > 0 and projection matrices WQ, WK Rd d with

Q := 1 dk (QWQ)(QWK)T

such that softmax b Q D 1A(G) F < ϵ,

for all ϵ > 0.

Proof. Note that D 1A(G) is a weighted indicator matrix, since A(G) has binary entries and left-multiplication with D 1

results in dividing every element in row i of A(G) by the number of 1 s in row i of A(G), or formally,

i = h A(G)i1 Pn j=1 A(G)ij . . . A(G)in Pn j=1 A(G)ij i ,

where we note that

j=1 A(G)ij.

Further, since Q is sufficiently adjacency-identifying, there exists an adjacency-identifying matrix P and projection matrices WQ and WK, such that for P = 1 dk (PWQ)(PWK)T ,

Aligning Transformers with Weisfeiler Leman

it holds that Pij = max k Pik,

if and only if A(G) = 1. Recall the definition of sufficiently adjacency-identifying, namely that we can choose projection matrices such that ||P Q||F < ϵ0,

for all ϵ0. Then, according to Lemma 9, for matrix

Q = 1 dk (QWQ)(QWK)T ,

it holds that P Q F < f(ϵ0),

where f is a strictly monotonically increasing function of ϵ0. We can then apply Lemma 11 to P, Q, A(G) as the binary matrix and D 1A(G) as its weighted indicator matrix and, for every ϵ, choose ϵ0 small enough such that there exists a b > 0 with softmax b Q D 1A(G) F < ϵ.

This concludes the proof.

Finally, we generalize the above result to arbitrary k via the generalized adjacency matrix.

Lemma 13 (Approximating the generalized adjacency matrix). Let G be graph with n nodes and let k > 1. Let further, Q Rn d/k be sufficiently node and adjacency-identifying structural embeddings. Let v := (v1, . . . , vk) V (G)k be the i-th and let u := (u1, . . . , uk) V (G)k be the l-th k-tuple in a fixed but arbitrary ordering over V (G)k. Let A(k,j,γ) denote the weighted indicator matrix of the generalized adjacency matrix in Equation (11). Let Z Rnk nk be the unnormalized attention matrix such that Zil := 1 dk

Q(vj) k j=1WQ Q(uj) k j=1WK T ,

with projection matrices WQ, WK Rd d. Then, for every j [k] and every γ { 1, 1}, there exist projection matrices WQ, WK such that for all i [nk] softmax Zi1 . . . Zink A(k,j,γ) i F < ε,

for any ε > 0.

Proof. Our proof strategy is to first construct the unnormalized attention matrix from a nodeand adjacency-identifying matrix and then to invoke Lemma 11 to relax the approximation to the sufficiently nodeand adjacency-identifying Q.

First, by definition of sufficiently node and adjacency-identifying matrices, the existence of Q implies the existence of a node and adjacency-identifying matrix P. We will now give projection matrices WQ and WK such that softmax Z i1 . . . Z ink A(k,j,γ) i F < ε. (15)

for any ε > 0, with

Z il := 1 dk

P(vj) k j=1WQ P(uj) k j=1WK T ,

where we recall that v = (v1, . . . , vk) V (G)k is the i-th and u = (u1, . . . , uk) V (G)k is the l-th k-tuple in a fixed but arbitrary ordering over V (G)k. To show Equation (15), we expand sub-matrices in WQ and WK for each j, i.e., we write

WQ,1 ... WQ,k

Aligning Transformers with Weisfeiler Leman

WK,1 ... WK,k

where for all j [k], Q(vj) is projected by WQ,j and Q(uj) is projected by WK,j. We further expand the sub-matrices of each WQ,j and WK,j, writing

WQ,j = WQ,j N WQ,j A

WK,j = WK,j N WK,j A

Since P is node-identifying, there exists projection matrices WQ, N and WK, N such that

pilj := 1 dk P WQ, N (P WK, N )T

is maximal if and only if ui,j = ul,j is the same node. Further, if P is adjacency-identifying, there exists projection matrices WQ, A and WK, A such that

qilj := 1 dk P WQ, A (P WK, A )T ,

is maximal if and only if nodes ui,j and ul,j share an edge in G. Consequently, we also know that then qilj is maximal if and only if nodes ui,j and ul,j do not share an edge in G. Note that because we assume that G has no self-loops, qiij = 0 for all i [n] and all j [k]. Now, let us write out the unnormalized attention score Z il as

Z il = 1 dk

P(vj) k j=1WQ P(uj) k j=1WK T

j=1 P(vj)WQ,j P(uj)WK,j T

j=1 P(vj)WQ,j N P(uj)WK,j N T + 1 dk

j=1 P(vj)WQ,j A P(uj)WK,j A T ,

where in the second line we simply write out the dot-product and in the last line we split the sum to distinguish nodeand adjacency-identifying terms.

Now for all γ and a fixed j, we set WQ,j N = 0 and WK,j N = 0 and WQ,o N = b WQ, N and WK,o N = WK, N for o = j, for some b > 0 that we need to be arbitary to use it later in Lemma 11.

We now distinguish between the different choices of γ. In the first case, let γ > 0. Then, we set WQ,j A = WQ, A and WK,j A = WK, A . In the third case, γ < 0. Then, we set WQ,j A = WQ, A and WK,j A = WK, A .

Then, in all cases, for tuples ui and ul, Z il = γ qilj + X

o =j b pilo. (16)

Note that b is the same for all pairs ui, ul and the dot-product is positive. We again distinguish between two cases. In the first case, let γ > 0. Then, the above sum attains its maximum value if and only if

o = j : ui,o = ul,o Ailj = 1.

Now, since ui,o and ul,o denote nodes at the o-th component of ui and ul, respectively, the above statement is equivalent to saying that ul is a j-neighbor of ui and that ul is adjacent to ui. In the second case, let γ < 0. Then, the above sum attains its maximum value if and only if o = j : ui,o = ul,o Ailj = 0.

Aligning Transformers with Weisfeiler Leman

Again, since ui,o and ul,o denote nodes at the o-th component of ui and ul, respectively, the above statement is equivalent to saying that ul is a j-neighbor of ui and that ul is not adjacent to ui.

Now, recall the generalized adjacency matrix in Equation (11) as

A(k,j,γ) il :=

( 1 w V (G): ul = ϕj(ui, w) adj(uij, w) 0 else γ = 1

( 1 w V (G): ul = ϕj(ui, w) adj(uij, w) 0 else. γ = 1

Then, we can say that for our construction of Z , for all i, l [nk]

Z il = max k Z ik

if and only if Ak,j,γ il = 1. Consequently, we can apply Lemma 11 to Z , Z, Ak,j,γ as the binary matrix and Ak,j,γ as its weighted indicator matrix and obtain that for each ϵ > 0 there exists a b > 0 such that softmax Zi1 . . . Zink Ak,j,γ i F < ε.

This completes the proof.

F.2. adjacency-identifying via graph Laplacian

Here, we show a few results for how factorizations of the (normalized) graph Laplacian can be adjacency-identifying.

Lemma 14. Let G be a graph with graph Laplacian L. Then, a matrix P is nodeand adjacency-identifying if there exists matrices WQ and WK such that 1 dk (PWQ)(PWK)T = L.

Proof. Note that the graph Laplacian is node-identifying, since all off-diagonal elements are 0 and the diagonal is always > 0 since we consider graphs G without self-loops and without isolated nodes. Note further, that if there exists matrices WQ

and WK such that 1 dk (PWQ)(PWK)T = L,

then there also exist matrix WQ = WQ such that

1 dk (PWQ )(PWK)T = L.

Now, note that the negative graph Laplacian is L = A(G) D. Because we subtract the degree matrix from the adjacency matrix, the maximum element of each row of the negative graph Laplacian is 1. Since D is diagonal, for each row i and each column j, Lij = 1 A(G)ij = 1,

for i = j. Further, in the case where i = j, Lij 0,

since we consider graphs G without self-loops. Hence, we obtain that

Lij = max k ( Lik)

if and only if A(G)ij = 1. This shows the statement.

Lemma 15. Let G be a graph with graph Laplacian L. Then, a matrix P is node and adjacency-identifying if

Aligning Transformers with Weisfeiler Leman

Proof. Our goal is to show that there exist matrices WQ and WK, such that

1 dk (PWQ)(PWK)T = L,

and then to invoke Lemma 14. We set

where I is the identity matrix. Then, 1 dk (PWQ)(PWK)T = PPT = L.

This shows the statement.

Lemma 16. Let G be a graph with graph Laplacian L. Then, a matrix P Q is nodeand adjacency-identifying if

Proof. Our goal is to show that there exist matrices WQ and WK, such that

1 dk (PWQ)(PWK)T = L,

and then to invoke Lemma 14. We set

WQ = dk I 0

WK = 0 dk I ,

where I is the identity matrix and 0 is the all-zero matrix. Then,

P Q WQ P Q WK T = PQT = L.

This shows the statement.

For the next two lemmas, we first briefly define permutation matrices as binary matrices M that are right-stochastic, i.e., its rows sum to 1, and such that each column of M is 1 at exactly one position and 0 elsewhere. It is well know that for permutation matrices M it holds that MT M = MMT = I, where I is the identity matrix. We now state the following lemmas.

Lemma 17. Let G be a graph with graph Laplacian L. Let L = UΣUT be the eigendecomposition of L. Then, for any permutation matrix M, the matrix UΣ 1 2 M is adjacency-identifying.

Proof. We have that

UΣ 1 2 M(UΣ 1 2 M)T = UΣ 1 2 MMT Σ 1 2 UT

Hence, by Lemma 15, UΣ 1 2 M is adjacency-identifying.

Lemma 18. Let G be a graph with graph Laplacian L and normalized graph Laplacian L. Let L = UΣUT be the eigendecomposition of L and let D denote the degree matrix. Then, for any permutation matrix M the matrix D 1 2 UΣ 1 2 M is adjacency-identifying.

Aligning Transformers with Weisfeiler Leman

Proof. We have that

D 1 2 UΣ 1 2 M(D 1 2 UΣ 1 2 M)T = D 1 2 UΣ 1 2 MMT Σ 1 2 UT D 1 2

= D 1 2 UΣUT D 1 2

= D 1 2 LD 1 2

= D 1 2 (I D 1

Hence, by Lemma 15, D 1 2 UΣ 1 2 is node or adjacency-identifying.

Here, we show that the node-level PEs LPE as defined in Equation (5) are sufficiently nodeand adjacency-identifying. We begin with the following useful lemma.

Lemma 19. Let P Rn d be structural embeddings with sub-matrices Q1 Rn d and Q2 Rn d , i.e.,

P = Q1 Q2 ,

where d = d +d . If one of Q1, Q2 is (sufficiently) node-identifying, then P is (sufficiently) node-identifying. If one of Q1, Q2 is (sufficiently) adjacency-identifying, then P is (sufficiently) adjacency-identifying. If Q1 is (sufficiently) node-identifying and Q2 is (sufficiently) adjacency-identifying or vice versa, then P is (sufficiently) node and adjacency-identifying.

Proof. Let a sub-matrix Q be (sufficiently) nodeor adjacency-identifying. Let WQ,Q and WK,Q be the corresponding projection matrices with which Q is (sufficiently) nodeor adjacency-identifying. Then, if Q = Q1, we define

WQ = WQ,Q 0 0 0

WK = WQ,K 0 0 0

and if Q = Q2, we define

WQ = 0 0 WQ,Q 0

WK = 0 0 WQ,K 0

In both cases, we have that

PWQ = QWQ,Q 0

PWK = QWQ,K 0

and consequently, PWQ(PWK)T = QWQ,Q(QWQ,K)T .

Hence if Q1 is (sufficiently) node-identifying, then P is (sufficiently) node-identifying. If Q2 is (sufficiently) nodeidentifying, then P is (sufficiently) node-identifying. If Q1 is (sufficiently) adjacency-identifying, then P is (sufficiently) adjacency-identifying. If Q2 is (sufficiently) adjacency-identifying, then P is (sufficiently) adjacency-identifying. If Q1 is (sufficiently) node-identifying and Q2 is (sufficiently) adjacency-identifying or vice versa, then P is (sufficiently) node and adjacency-identifying. This concludes the proof.

We will now prove a slightly more general statement than Theorem 7.

Aligning Transformers with Weisfeiler Leman

Theorem 20 (Slightly more general than Theorem 7 in main text). Structural embeddings with LPE according to Equation (5) as node-level PEs with embedding dimension d are sufficiently nodeand adjacency-identifying, irrespective of whether the underlying Laplacian is normalized or not. Further, for graphs with n nodes, the statement holds for d (2n + 1).

Proof. We begin by stating that the domain of LPE is compact since eigenvectors are unit-norm and eigenvalues are bounded by twice the maximum node degree for graphs without self-loops (Anderson & Morley, 1985). Further, the domain of the structural embeddings is compact, since LPE is compact and the node degrees are finite. Hence, the domain of deg is compact.

Let G be a graph with graph Laplacian L. Let L = UΣUT be the eigendecomposition of L, where the i-th column of U contains the i-th eigenvector, denoted vi, and the i-th diagonal entry of Σ contains the i-th eigenvalue, denoted λi.

Recall that structural embeddings with LPE as node-level PEs are defined as

P(v) = FFN(deg(v) + LPE(v)),

for all v V (G). Since the domains of both deg and LPE are compact, there exist parameters for both of these embeddings such that without loss of generality we may assume that for v V (G),

deg(v) + LPE(v) = deg(v) LPE(v) Rd

and that deg and LPE are instead embedded into some smaller p-dimensional and s-dimensional sub-spaces, respectively, where it holds that d = p + s.

To show node and adjacency identifiability, we divide up the s-dimensional embedding space of LPE into two s/2-dimensional sub-spaces node and adj, i.e., we write

LPE(v) = node(v) adj(v) Rn s,

node(v) = ρnode k X

j=1 ϕnode(vji, λj + ϵj)

adj(v) = ρadj k X

j=1 ϕadj(vji, λj + ϵj) ,

and where we have chosen ρ to be a sum over the first dimension of its input, followed by FFN ρnode and ρadj, for node and adjacency, respectively. Note that we have written out Equation (5) for a single node v. For convenience, we also fix an arbitrary ordering over the nodes in V (G) and define

deg(v1) ... deg(vn)

node(v1) ... node(vn)

adj(v1) ... adj(vn)

where vi is the i-th node in our ordering. Further, note that node and adj are Deep Sets (Zaheer et al., 2017) over the set

Mi := {(vji, λj + ϵj)}k j=1,

where vj is again the j-th eigenvector with corresponding eigenvalue λj and ϵj is a learnable scalar. With Deep Sets we can universally approximate permutation invariant functions. We will use this fact in the following, where we show that there exists

Aligning Transformers with Weisfeiler Leman

a parameterization of QN and QA such that QN is sufficiently node-identifying and QA is sufficiently adjacency-identifying. Then, it follows from Lemma 19 that P is sufficiently node and adjacency-identifying. We will further use QD later in the proof.

We begin by showing that QN is sufficiently node-identifying. Observe that the eigenvector matrix U is already nodeidentifying since U forms an orthonormal basis and hence UUT is the n-dimensional identity matrix I. Clearly for I it holds that Iij = max k Iik,

if and only if i = j. Moreover, let M be any permutation matrix, i.e., each column of M is 1 at exactly one position and 0 else and the rows of M sum to 1. Then, UMMT UT = UUT = I

is also node-identifying. We will now approximate UM for some M with node arbitrarily close. Specifically, for the i-th node vi in our node ordering, we choose ρnode and ϕnode such that node(vi) Ui M F < ϵ,

for all ϵ > 0 and all i. Since Deep Sets can universally approximate permutation invariant functions, it remains to show that there exists a permutation invariant function f such that f(Mi) = Ui M for all i and for some M. To this end, note that for a graph G, there are only at most n unique eigenvalues of the corresponding (normalized) graph Laplacian. Hence, we can choose ϵj such that λj + ϵj is unique for each unique j. In particular, let

where we choose δ < minl,o |λl λo| > 0, i.e., the smallest non-zero difference between two eigenvalues. We now define f as f({(vji, λj + ϵj)}k j=1) = v1i . . . vki = Ui M,

where the order of the components is according to the sorted λj + ϵj in ascending order. This order is reflected in some permutation matrix M. Hence, f is permutation invariant and can be approximated by a Deep Set arbitrarily close. As a result, we have node(vi) Ui M F < ϵ,

for all i and an arbitrarily small ϵ > 0. In matrix form, we have that UM QN F < ϵ

and since UM is node-identifying, we can invoke Lemma 9, to conclude that QN is sufficiently node-identifying. As a result, P has a sufficiently node-identifying sub-space and is thus, also sufficiently node-identifying according to Lemma 19.

We continue by showing that QA is sufficiently adjacency-identifying. Note that according to Lemma 17, UΣ 1 2 is adjacencyidentifying. We will now approximate UΣ 1 2 with adj arbitrarily close. Specifically, for the i-th node vi in our node ordering, we choose ρadj and ϕadj such that adj(vi) UΣ

1 2 i F < ϵ,

for all ϵ > 0 and all i. To this end, we first note that right-multiplication of U by Σ 1 2 is equal to multiplying the i-th column of U, i.e., the eigenvector vi, with the j-th diagonal element of Σ 1 2 , i.e., p

λj. Hence, for the j-node vj it holds

1 2 i = v1j λ1 . . . vnj λn Rn,

where vij denotes the j-th component of vi. Since Deep Sets can universally approximate permutation invariant functions, it

remains to show that there exists a permutation invariant function f such that f(Mi) = UΣ

1 2 i M for all i and for some M. To this end, note that for a graph G, there are only at most n unique eigenvalues of the corresponding (normalized) graph Laplacian. Hence, we can choose ϵj such that λj + ϵj is unique for each unique j. In particular, let

Aligning Transformers with Weisfeiler Leman

where we choose δ < minl,o |λl λo| > 0, i.e., the smallest non-zero difference between two eigenvalues. We now define f as f({(vji, λj + j δ)}k j=1) = v1i λ1 + ϵ1 + δ . . . vki λk + k δ ,

where the order of the components is according to the sorted λj + ϵj in ascending order. This order is reflected in some permutation matrix M, which we will use next. Now, since we can choose δ arbitrarily low, we can choose it such that

f({(vji, λj + j δ)}k j=1) UΣ

1 2 i M||F < ϵ,

for any ϵ > 0. Further, f is permutation invariant and can be approximated by a Deep Set arbitrarily close. As a result, we have adj(vi) UΣ

1 2 i M F < ϵ,

for all i and an arbitrarily small ϵ > 0. In matrix form, we have that

UΣ 1 2 M QA F < ϵ.

Now, we need to distinguish between the non-normalized and normalized Laplacian. In the first case, we consider the graph Laplacian underlying the eigendecomposition to be non-normalized, i.e., L = D A(G). First, we know from Lemma 17 that UΣ 1 2 M is adjacency-identifying. Further, we know from Lemma 9 that then QA is sufficiently adjacency-identifying, since QA can approximate UΣ 1 2 M arbitrarily close. As a result, P has a sufficiently adjacency-identifying sub-space and is, thus, also sufficiently adjacency-identifying.

In the second case, we consider the graph Laplacian underlying the eigendecomposition to be normalized, i.e., L = I D 1

2 . Here, recall our construction for P, namely

P(v) = FFN deg(v) node(v) adj(v)

or in matrix form P = FFN QD QN QA .

We will now use FFN to approximate the following function f, defined as

f QD QN QA = QD QN D 1 2 QA .

Note that a FFN can approximate such a function arbitrarily close since our domain is compact and left-multiplication by D 1 2 is equivalent to multiplying the i-th row of QA with di, where di is the degree of node i. Further, the i-th row of QD is an embedding deg(vi) of di. We can choose this embedding to be

deg(vi) = di 0 . . . 0 Rp,

where we write di into the first component and pad the remaining vector with zeros to fit the target dimension p of deg. Hence, with the FFN we can approximate a sub-space containing D 1 2 QA arbitrarily close. Further, since we already showed that QA can approximate UΣ 1 2 M arbitrarily close, we can thus approximate D 1 2 UΣ 1 2 M arbitrarily close. First, we know from Lemma 18 that D 1 2 UΣ 1 2 M is adjacency-identifying. Further, we know from Lemma 9 that then D 1 2 QA

is sufficiently adjacency-identifying, since D 1 2 QA can approximate D 1 2 UΣ 1 2 QA arbitrarily close. As a result, P has a sufficiently adjacency-identifying sub-space and is, thus, also sufficiently adjacency-identifying according to Lemma 19.

Finally, for QD we need p 1, for QN and QA we need s

2 n. As a result, the above statements hold for d (2n + 1). This concludes the proof.

Here, we show that SPE are sufficiently nodeand adjacency-identifying. We begin with useful lemma.

Aligning Transformers with Weisfeiler Leman

Lemma 21. Let G be a graph with graph Laplacian L and normalized graph Laplacian L. Then, if for a node-level PE Q with compact domain there exists matrices WQ and WK such that

1 dk (QWQ)(QWK)T = L,

there exists a parameterization of the structural embedding P with Q as node-level PE such that P is sufficiently nodeand adjacency-identifying.

Proof. Recall that structural embeddings with P as node-level PEs are defined as

P(v) = FFN(deg(v) + Q(v)),

for all v V (G). Since the domains of both deg and Q are compact, there exists parameters for both of these embeddings such that without loss of generality we may assume that for v V (G),

deg(v) + Q(v) = deg(v) Q(v) Rd

and that deg and Q are instead embedded into some smaller p-dimensional and s-dimensional sub-spaces, respectively, where it holds that d = p + s.

Next, we choose the embedding deg(v) to be

deg(v) = dv 0 . . . 0 Rp,

where dv is the degree of node v and we write dv into the first component and pad the remaining vector with zeros to fit the target dimension p of deg. We will now use FFN to approximate the following function f, defined as

f deg(v) Q(v) = deg(v) dv Q(v) .

Note that a FFN can approximate such a function arbitrarily close since our domain is compact. Hence, we have that P(v) deg(v) dv Q(v) F < ϵ1,

for all ϵ1 > 0. Further, in matrix form this becomes

deg(v1) ... deg(vn)

for all ϵ2 > 0, since left multiplication of a matrix by D 1 2 corresponds to an element-wise multiplication of the i-th row of Q with p

dvi, the square-root of the degree of the i-th node vi in an arbitrary but fixed node ordering. Now, since we have that there exists matrices WQ and WK such that

1 dk (QWQ)(QWK)T = L,

then, for the same matrices WQ and WK we know that

1 dk (D 1 2 QWQ)(D 1 2 QWK)T = D 1 2 LD 1 2 = L

Finally, since P has a sub-matrix that can approximate D 1 2 Q arbitrarily close and since, by Lemma 15, D 1 2 Q is nodeand adjacency-identifying, P is sufficiently nodeand adjacency-identifying. This shows the statement.

We now show Theorem 8.

Theorem 22 (Theorem 8 in the main paper). Structural embeddings with SPE as node-level PE are sufficiently nodeand adjacency-identifying.

Aligning Transformers with Weisfeiler Leman

Proof. We begin by stating that the domain of SPE is compact, since eigenvectors are unit-norm and eigenvalues are bounded by twice the maximum node degree for graphs without self-loops (Anderson & Morley, 1985). Further, the domain of the structural embeddings is compact, since SPE is compact and the node degrees are finite and hence the domain of deg is compact.

Let G be a graph with (normalized or non-normalized) graph Laplacian L. Let L = VΣVT be the eigendecomposition of L, where the i-th column of V contains the i-th eigenvector, denoted vi, and the i-th diagonal entry of Σ contains the i-th eigenvalue, denoted λi.

Recall that structural embeddings with SPE as node-level PEs are defined as

P(v) = FFN(deg(v) + SPE(v)),

for all v V (G), where

SPE(v) = ρ Vdiag(ϕ1(λ))VT . . . Vdiag(ϕn(λ))VT (v).

To show node and adjacency identifiability, we first replace the neural networks ρ, ϕ1, . . . , ϕn in SPE with functions permutation equivariant functions g, h1, . . . , hn over the same domain and co-domain, respectively, and choose g and h1, . . . , hn such that the resulting encoding, which we call f SPE, is sufficiently nodeand adjacency-identifying. Then, we show that ρ and ϕ1, . . . , ϕn can approximate g and h1, . . . , hn arbitrarily close, which gives the proof statement.

We first define, for all ℓ [n],

such that λℓis the ℓ-th sorted eigenvalue where ties are broken arbitrarily3 and λℓis the ℓ-th entry of hℓ(λ). Note that due to the sorting, hℓis equivariant to permutations of the nodes.

We then define the matrix Mi as i-th input to g in f SPE, for all i [n]. We have

Mi := Vdiag(h1(λ))VT i . . . Vdiag(hn(λ))VT i

j v1j h1(λ)j vij . . . P

j v1j hn(λ)j vij ... P

j vnj h1(λ)j vij . . . P

j vnj hn(λ)j vij

1 2 1 vi1 . . . v1n λ

1 2n vin ...

1 2 1 vi1 . . . vnn λ

where the last equality follows from the fact that hℓ(λ)j = 0 if and only if ℓ= j. Now, we choose a linear set function f, satisfying f({a x | x X}) = a f(X),

for all a R and all X R, e.g., sum or mean. Since f is a function over sets, f is equivariant to the ordering of X. We now define the function g step-by-step. In g, we first apply f to the columns of Mi for all i [n]. To this end, let Mij denote the set of entries in the j-column of matrix Mi. Then, we define

fi = f(Mi1) . . . f(Min)

1 2 1 vi1 f({vo1 | o [n]}) . . . λ

1 2n vin f({von | o [n]}) i ,

3Note that despite eigenvalues with higher multiplicity, the output of hℓis the same, no matter how ties are broken.

Aligning Transformers with Weisfeiler Leman

We then multiply fi element-wise to each row of Mi and obtain a matrix

1 2 1 )2 (vi1)2 . . . v1n (λ

1 2 1 )2 (vin)2

1 2 1 )2 (vi1)2 . . . vnn (λ

1 2n)2 (vin)2

v11 λ1 (vi1)2 . . . v1n λn (vin)2

... vn1 λ1 (vi1)2 . . . vnn λn (vin)2

Further, we divide each row of Mi element-wise by fi and obtain a matrix

v11 f({vo1 | o [n]}) . . . v1n f({von | o [n]}) ... vn1 f({vo1 | o [n]}) . . . vnn f({von | o [n]})

Hence, we also need to ensure that f({voj | o [n]}) = 0,

for all j [n]. Now, we define P = X

and have that

(PQT )k,l = X

vkℓ λℓ f({voℓ| o [n]}) X

i v2 iℓ vlℓ f({voℓ| o [n]})

ℓ vkℓ vlℓ λℓ,

where we recall that the eigenvectors are normalized and hence, P

i v2 iℓ= 1.

In the last step of g, we return the matrix P Q Rn 2n.

We will now show that this matrix is nodeand adjacency-identifying. To this end, note that

(PQT )k,l = (VΣVT )k,l

and hence, PQT = VΣVT = L.

We distinguish between two cases. If L is the non-normalized graph Laplacian, then according to Lemma 16, the matrix P Q

is nodeand adjacency-identifying and hence, according to Lemma 19, f SPE is nodeand adjacency-identifying. Further, if L is the normalized graph Laplacian, according to Lemma 21, f SPE is sufficiently nodeand adjacency-identifying.

To summarize, we have shown that there exist functions g, h1, . . . , hn such that if, in SPE, we replace ρ with g and ϕℓwith hℓ, then the resulting encoding f SPE is sufficiently nodeand adjacency-identifying. To complete the proof, we will now show

Aligning Transformers with Weisfeiler Leman

that we can approximate g, h1, . . . , hn with ρ, ϕ1, . . . , ϕn arbitrarily close. Since our domain is compact, we can use ϕℓto approximate hℓarbitrarily close, i.e., we have that ϕℓ hℓ F < ϵ,

for all ϵ > 0. Further, g consists of a sequence of permutation equivariant steps and is thus, permutation equivariant. Since the domain of g is compact and, by construction, g and ρ have the same domain and co-domain, and since ρ can approximate any permutation equivariant function on its domain and co-domain arbitrarily close, we have that ρ can also approximate g arbitrarily close. As a result, SPE can approximate a sufficiently nodeand adjacency-identifying encoding arbitrarily close, and hence, by definition, we have that SPE is also sufficiently nodeand adjacency-identifying. This completes the proof.

G. Learnable embeddings

Here, we pay more attention to learnable embeddings, which play an important role throughout our work. Although our definition of learnable embeddings is fairly abstract, learnable embeddings are very commonly used in practice, e.g., in tokenizers of language models (Vaswani et al., 2017). Since many components of graphs, such as the node labels, edge types, or node degrees are discrete, learnable embeddings are useful to embed these discrete features into an embedding space. In our work, different embeddings are typically joined via sum, e.g., in Equation (1) or Equation (4). Summing embeddings is much more convenient in practice since every embedding has the same dimension d. In contrast, joining embeddings via concatenation can lead to very large d, unless the underlying embeddings are very low-dimensional. However, in our theorems joining embeddings via concatenation is much more convenient. To bridge theory and practice in this regard, in what follows we establish under what conditions summed embeddings can express concatenated embeddings and vice versa. Specifically, we show that summed embeddings can still act as if concatenating lower-dimensional embeddings but with more flexibility in terms of joining different embeddings.

The core idea is that we show under which operations one may replace a learnable embedding e to Rd with a lower-dimensional learnable embedding e to Rd while preserving the injectivity of e. This property is useful since it allows us to, for example, add two learnable embeddings without losing expressivity. As a result, we can design a tokenizer that is practically useful without sacrificing theoretical guarantees.

We begin with a projection of a concatenation of learnable embeddings.

Lemma 23 (Projection of learnable embeddings). Let e 1, . . . , e k be learnable embeddings from N to Rp for some p 1. If d k then, there exists learnable embeddings e1, . . . , ek as well as a projection matrix W Rd k k p such that for v1, . . . , vk N ei(vi) k i=1W = e i(vi) k i=1 Rk p,

where it holds for all v, w N and all i {1, . . . , k} that

ei(v) = ei(w) e i(v) = e i(w).

Proof. For a learnable embedding e and some v N, we denote with e(v)j the j-th element of the vector e(v). Since learnable embeddings are from N to Rd, for every i we define ei such that for every v N, ei(v)1 = e i(v) and ei(v)j = 0 for j > 0.

Now, we define W as follows. For row l and column j, we set Wlj = 1 if (l 1) is a multiple of d, i.e., l = 1, (d + 1), 2(d + 1), . . . . Otherwise we set Wlj = 0. Intuitively, applying W to ei(vi) k i=1 will select the first, d-th, 2d-th, . . . , kd-th,

element of ei(vi) k i=1, corresponding to v1, . . . , vk, respectively, according to our construction of ei. Now, we have that

ei(vi) k i=1W = e 1(v1) . . . e k(vk) = e i(vi) k i=1.

Further we have by construction that for all v, w N and all i {1, . . . , k}

ei(v) = ei(w) e i(v) = e i(w).

This shows the statement.

Aligning Transformers with Weisfeiler Leman

Next, we turn to a composition of learnable embeddings, namely the structural embeddings P, as defined in Section 3.1. In particular, we show next that a projection of multiple structural embeddings preserves nodeand adjacency identifiability of the individual structural embeddings in the new embedding space.

Lemma 24 (Projection of structural embeddings). Let G be a graph with n nodes and let P Rn d be structural embeddings for G. Let k > 1. If d k (2n + 1) then, there exists a parameterization of P as well as a projection matrix W Rk d k (2n+1) such that for any v1, . . . , vk V (G), P(vi) k i=1W = P (vi) k i=1 Rn d,

where P Rn d is also a structural embedding for G and it holds that if P is sufficiently nodeand adjacency-identifying, then so is P .

Proof. We know from Theorem 20 that there exists sufficiently nodeand adjacency-identifying structural embeddings P in Rn (2d+1). Since d k (2n + 1), we define P = P 0 ,

where 0 Rn (k 1) (2n+1) is an all-zero matrix. Since P has a sufficiently nodeand adjacency-identifying subspace P , by Lemma 19, P is sufficiently nodeand adjacency-identifying. Further, by setting

W = In k (2n+1),

where In k (2n+1) are the first n rows of the k (2n + 1)-dimensional identity matrix and 0 Rn (k 1) (2n+1) is again an all-zero matrix.

Then, we have that for nodes v1, . . . , vk V (G), P(vi) k i=1W = P (vi) k i=1 Rn d,

and P is by definition sufficiently nodeand adjacency-identifying, from which the statement follows.

We continue by showing that our construction of token embeddings in Equation (1) can be equally well represented by another form that will be easier to use in proving Theorem 2.

Lemma 25. Let G be a graph with node features F Rn d. For every tokenization X(0,1) Rd according to Equation (1) with (sufficiently) adjacency reconstructing structural embeddings P(v), there exists a parameterization of X(0,1) such that for node v V (G) as X(0,1)(v) = F (v) 0 deg (v) P (v) ,

with P (v) = FFN (deg (v) + PE(v)),

such that F (v) Rp, 0 Rp, deg : N Rr and FFN : Rd Rs for d = 2p + r + s and where it holds for every v, w V (G) that F(v) = F(w) F (v) = F (w)

and deg(v) = deg(w) deg (v) = deg (w)

and P(v) = P(w) P (v) = P (w)

and P (v) is (sufficiently) adjacency-identifying.

Proof. We set p = (d (2n + 1))/2 1, r = 2 and s = (2n + 1). Indeed, we have that

2p + r + s = 2((d (2n + 1))/2 1) + 2 + (2n + 1) = d (2n + 1) + (2n + 1) = d.

We continue by noting that the node features F are mapped from N to Rd and can be obtained from a learnable embedding. More importantly, we can define F (v) = ℓ(v) 0 . . . 0 ,

Aligning Transformers with Weisfeiler Leman

where F (v) Rd is another learnable embedding for which it holds that

F(v) = F(w) F (v) = F (w),

for all v, w V (G).

Further, we can write deg(v) + PE(v) = deg (v) PE(v) ,

where deg is another learnable embedding for the node degrees mapping to Rr. Moreover, we know from Theorem 20 that there exist (sufficiently) adjacency-identifying structural embeddings of dimension (2n + 1). We choose the FFN in Equation (2) such that P(v) = 0 deg (v) P ,

where 0 R2p is an all-zero vector and P = FFN (deg (v) + PE(v)),

with FFN : Rd Rs, another FFN inside of FFN. Note that such a FFN exists without approximation. Further according to Lemma 24, P is still (sufficiently) adjacency-identifying. Hence, we can write

X(0,1)(v) = F (v) + P (v) = F (v) 0 deg (v) FFN (deg (v) + PE(v)) ,

where through the concatenation it is easy to verify that indeed X(0,1)(v) has the desired properties, namely that

F(v) = F(w) F (v) = F (w)

and deg(v) = deg(w) deg (v) = deg (w)

and P(v) = P(w) P (v) = P (w)

and P (v) is (sufficiently) adjacency-identifying. This shows the statement.

Lastly, we show a similar result for Equation (4).

Lemma 26. Let G be a graph with node features F Rn d. For every tokenization X(0,k) Rd according to Equation (4) with nodeand adjacency reconstructing structural embeddings P(v), there exists a parameterization of X(0,k) such that for k-tuple v = (v1, . . . , vk) V (G)k,

X(0,k)(v) = F (v1) . . . F (vk) 0 deg (v1) . . . deg (vk) P (v1) . . . P (vk) atp (v) ,

with P (v) = FFN (deg (v) + PE(v)),

such that F (v) Rp, 0 Rk2 p, deg : N Rr, atp : N Ro and P Rs for d = k2 p + k p + k r + k s + o and where it holds for every v, w V (G),

F (v) = F(w) F (v) = F (w)

and deg(v) = deg(w) deg (v) = deg (w)

and P(v) = P(w) P (v) = P (w)

and P (v) is node and adjacency-identifying and for every v, w V (G)k

atp(v) = atp(w) atp (v) = atp (w).

Aligning Transformers with Weisfeiler Leman

Proof. We set p = 1, r = 2, o = 1 and s = d k2 3k 1

k . Indeed, we have that

k2 p + k p + k r + k s + o = k2 + 3k + d k2 3k 1 + 1 = d.

Recall Equation (4) as X(0,k)(v) = F(vi) k i=1WF + P(vi) k i=1WP + atp(v),

where WF Rd k d and WP Rd k d are projection matrices, P are structural embeddings and v = (v1, . . . , vk) is a k-tuple.

We continue by noting that the node features F are mapped from N to Rd and can be obtained from a learnable embedding. More importantly, we can define F (v) = ℓ(v) for which it clearly holds that

F(v) = F(w) F (v) = F (w),

for all v, w V (G). Invoking Lemma 23, there exists learnable embeddings F and a projection matrix WF, Rk d k p

such that F(vi) k i=1WF = F (v1) . . . F (vk) Rk p.

Note that since we do not have an assumption on d, we can make d arbitrarily large to ensure that d k. Hence, by defining

WF = WF, 0 ,

where 0 Rk d (d k p) is an all-zero matrix, we can write

F(vi) k i=1WF = F (v1) . . . F (vk) 0 Rd,

where 0 Rd k p is an all-zero vector.

Moreover, we know from Theorem 7 that there exists adjacency-identifying structural embeddings of dimension (2n + 1). We choose the FFN in Equation (2) such that

P(v) = 0 deg (v) P (v) ,

where 0 Rk2 p+k p is an all-zero vector and

P (v) = FFN (deg (v) + PE(v)),

with FFN : Rd Rs, another FFN inside of FFN. Note that such a FFN exists without approximation. Further, according to Lemma 24, P is still sufficiently adjacency-identifying, and there exists a projection matrix WP, Rk d d o such that

P(vi) k i=1WP, = 0 deg (v1) P (v1) . . . 0 deg (vk) P (vk) Rd o.

But then, there also exists a permutation matrix Q {0, 1}d o d o such that

P(vi) k i=1WP, Q = 0 deg (v1) . . . deg (vk) P (v1) . . . P (vk) Rd o,

where we simply re-order the entries of P(vi) k i=1WP, . Note that now, 0 Rk2 p+k p, as we grouped the zero-vectors of P(vi) k i=1WP, into a single zero-vector of size k2 p + k p.

Note that since we do not have an assumption on d, we can make d arbitrarily large to ensure that d k (2n + 1). Hence, by defining WP = WP, Q 0 ,

where 0 Ro is an all-zero vector, we can write

P(vi) k i=1WP = 0 deg (v1) . . . deg (vk) P (v1) . . . P (vk) 0 Rd,

Aligning Transformers with Weisfeiler Leman

where P (vi) Rs. Further, since P is by assumption sufficiently nodeand adjacency-identifying, Lemma 24 guarantees that then so is P .

Finally, for tuple v V (G)k, we set atp(v) = 0 atp (v) ,

where atp (v) = atp(v) and we recall that atp maps to N and 0 Rd 1 is an all-zero vector.

Then, we can write

X(0,k)(v) = F(vi) k i=1WF + P(vi) k i=1WP + atp(v)

= F (v1) . . . F (vk) 0 deg (v1) . . . deg (vk) P (v1) . . . P (vk) atp (v) ,

where it holds for every v, w V (G),

F (v) = F(w) F (v) = F (w)

and deg(v) = deg(w) deg (v) = deg (w)

and P(v) = P(w) P (v) = P (w)

where P (v) is node and adjacency-identifying and for every v, w V (G)k

atp(v) = atp(w) atp (v) = atp (w).

This shows the statement.

H. Expressivity of 1-GT

Here, we prove Theorem 2 in the main paper. We first restate Theorem VIII.4 of (Grohe, 2021), showing that a simple GNN can simulate the 1-WL.

Lemma 27 (Grohe (2021), Theorem VIII.4). Let G = (V (G), E(G), ℓ) be a graph with n nodes with adjacency matrix A(G) and node feature matrix X(0) := F Rn d consistent with ℓ. Further, assume a GNN that for each layer, t > 0, updates the vertex feature matrix

X(t) := FFN X(t 1) + 2A(G)X(t 1) .

Then, for all t 0, there exists a parameterization of FFN such that

C1 t (v) = C1 t (w) X(t)(v) = X(t)(w),

for all vertices v, w V (G).

We now give the proof for a slight generalization of Theorem 2. Specifically, we relax the adjacency-identifying condition to sufficiently adjacency-identifying.

Theorem 28 (Generalization of Theorem 2 in main paper). Let G = (V (G), E(G), ℓ) be a labeled graph with n nodes and F Rn d be a node feature matrix consistent with ℓ. Let C1 t denote the coloring function of the 1-WL at iteration t. Then, for all iterations t 0, there exists a parametrization of the 1-GT with sufficiently adjacency-identifying structural embeddings, such that C1 t (v) = C1 t (w) X(t,1)(v) = X(t,1)(w),

for all nodes v, w V (G).

Proof. According to Lemma 25, there exists a parameterization of X(0,1) such that for each v V (G),

X(0,1)(v) = F (v) + P (v) = F (v) 0 deg (v) P (v) ,

Aligning Transformers with Weisfeiler Leman

where it holds that F(v) = F(w) F (v) = F (w) (17)

and deg(v) = deg(w) deg (v) = deg (w)

and P(v) = P(w) P (v) = P (w)

and P (v) is adjacency-identifying. Further, F (v), 0 Rp, deg (v) Rr and P (v) Rs, for some choice of p, r, s where d = 2p + r + s, specified in Lemma 25. We will use this parameterization for X(0,1) throughout the rest of the proof.

We will now prove the statement by induction over t. Since by definition

C1 0(v) = C1 0(w) F(v) = F(w),

the base case follows from Equation (17). We let F(t)(v) denote the representation of the color of node v at iteration t. Initially, we set F(t)(v) = F (v). Further, we define the matrix Demb Rn r such that for the i-th row of Demb,i = deg (vi), where vi is the i-th node in a fixed but arbitrary node ordering. Hence, we can write

X(0,1)(v) = F(0)(v) 0 deg (v) P (v) ,

for all v V (G) and in matrix form X(0,1) = F(0) 0 Demb P .

Now, assume that the statement holds up to iteration t 1. For the induction, we show

C1 t (v) = C1 t (w) F(t)(v) = F(t)(w). (18)

That is, we compute the 1-WL-equivalent aggregation using the node color representations F(t) and use the remaining columns in X(0,1) to keep track of degree and structural embeddings. Clearly, if Equation (18) holds for all t, then the statement follows. Thereto, we aim to update the vertex representations such that

X(t,1) = F(t) 0 Demb P , (19)

where, following Lemma 27,

F(t) := FFN F(t 1) + 2A(G)F(t 1) . (20)

Hence, it remains to show that our graph transformer layer can update vertex representations according to Equation (19). To this end, we will require only a single transformer head in each layer. Specifically, we want to compute

h1(X(t 1,1)) = 0 D 1A(G))F(t 1) 0 0 , (21)

where D 1 denotes the inverse of the degree matrix and I denotes the identity matrix with n rows. Note that since we only have one head, the head dimension dv = d. We begin by re-stating Equation (8) with expanded sub-matrices and then derive the instances necessary to obtain the head Equation (21). We re-state projection weights WQ and WK with expanded sub-matrices as

WQ 1 WQ 2 WQ 3 WQ 4

WK 1 WK 2 WK 3 WK 4

Aligning Transformers with Weisfeiler Leman

where WQ 1 , WK 1 Rp d, WQ 2 , WK 2 Rp d, WQ 3 , WK 3 Rr d, WQ 4 , WK 4 Rs d. We then define

Z(t 1) := 1 dk (X(t 1,1)WQ)(X(t 1,1)WK)T = 1 dk (X(t 1,1)

WQ 1 WQ 2 WQ 3 WQ 4

WK 1 WK 2 WK 3 WK 4

where X(t 1,1) = F(t 1) 0 Demb P ,

by the induction hypothesis. By setting WQ 1 , WQ 2 , WQ 3 , WK 1 , WK 2 , WK 3 to zero, we have

Z(t 1) = 1 dk (P WQ 4 )(P WK 4 )T .

Now, let P := 1 dk (P WQ P )(P WK P )T ,

for some weight matrices WQ P and WK P . We know from Lemma 12 that, since P is sufficiently adjacency reconstructing, there exists WQ P and WK P such that by setting WQ 4 = b WQ P and WK 4 = WK P , we have

Z(t 1) = b P,

where for all ε > 0, there exists a b > 0, such that softmax Z(t 1) D 1A(G) F < ε.

Hence, by choosing a large enough b, we can approximate the matrix D 1A(G) arbitrarily close. In the following, for clarity of presentation, we assume softmax Z(t) = D 1A(G) although we only approximate it arbitrarily close. However, by

choosing ε small enough, we can still approximate the matrix X(t,1), see below, arbitrarily close.

We now again expand sub-matrices of WV and express Equation (8) as

hi(X(t 1,1)) = softmax Z(t 1) X(t 1,1)

WV 1 WV 2 WV 3 WV 4

where WV 1 Rp d, WV 2 Rp d, WV 3 Rr d, WV 4 Rs d. By setting

WV 1 = 0 I 0 0

WV 2 = WV 3 = WV 4 = 0,

where I Rp p, we can approximate

h1(X(t 1,1)) = 0 D 1A(G)F(t 1) 0 0

arbitrarily close. We now conclude our proof as follows. Recall that the transformer computes the final representation X(t,1)

X(t,1) = FFNfinal

X(t 1,1) + h1(X(t 1,1))WO !

F(t 1) 0 Demb U + 0 D 1A(G)F(t 1) 0 0 WO !

= WO:=I FFNfinal

F(t 1) D 1A(G)F(t 1) Demb U !

Aligning Transformers with Weisfeiler Leman

for some FFNfinal. We now show that there exists an FFNfinal such that

X(t,1) = FFNfinal

F(t 1) D 1A(G)F(t 1) Demb U !

= F(t) 0 Demb U ,

which then implies the proof statement. To this end, we show that there exists a composition of functions f FFN fadd fdeg f 2 such that

f FFN fadd fdeg f 2

F(t 1) D 1A(G)F(t 1) Demb U !

= F(t) 0 Demb U ,

where the functions are applied independently to each row. We denote

X(t 1) upd := F(t 1) D 1A(G)F(t 1) Demb U Rn d

and obtain for node v V (G),

X(t 1) upd (v) = h F(t 1)(v) P

w N(v) 1 |N(v)| F(t 1)(w) deg (v) P (v) i Rd.

We can now define deg (v) = |N(v)| 0 R2.

Since our domain is compact, there exist choices of f FFN, fadd, fdeg, f 2 such that f FFN fadd fdeg f 2 : Rd Rd is continuous. As a result, FFNfinal can approximate f FFN fadd fdeg f 2 arbitrarily close.

Concretely, we define

h F(t 1)(v) P

w N(v) 1 |N(v)| F(t 1)(w) |N(v)| 0 P (v) i!

= h F(t 1)(v) 2 P

w N(v) 1 |N(v)| F(t 1)(w) |N(v)| 0 P (v) i ,

where f 2 multiplies the second column with 2.

Next, we define

h F(t 1)(v) 2 P

w N(v) 1 |N(v)| F(t 1)(w) |N(v)| 0 P (v) i!

= h F(t 1)(v) 2 P

w N(v) 1 |N(v)| F(t 1) j |N(v)| |N(v)| P (v) i

= h F(t 1)(v) 2 P

w N(v) F(t 1)(w) |N(v)| 0 P (v) i

where fdeg multiplies the second column with the degree of node v in the third column.

Next, we define

h F(t 1)(v) 2 P

w N(v) F(t 1)(w) |N(v)| 0 P (v) i!

= h F(t 1)(v) + 2 P

w N(v) F(t 1)(w) 0 |N(v)| 0 P (v) i ,

where we add the second column to the first column and set the elements of the second column to zero.

Finally, we define

h F(t 1)(v) + 2 P

w N(v) F(t 1)(w) 0 |N(v)| 0 P (v) i!

= h FFN F(t 1)(v) + 2 P w N(v) F(t 1)(w) 0 |N(v)| 0 P (v) i

= F(t)(v) 0 deg (v) P (v) ,

Aligning Transformers with Weisfeiler Leman

where FFN denotes the FFN in Equation (20), from which the last equality follows. Clearly, applying f FFN fadd fdeg f 2 to each row i of X(t 1) upd results in

f FFN fadd fdeg f 2 X(t 1) upd = F(t) 0 Demb P (v) ,

from which our proof follows.

I. Expressivity of higher-order transformers

Here, we prove Theorem 4, Theorem 5, and Theorem 6 in Section 3.2 in the main paper. To this end, we first construct new higher-order GNNs aligned with the k-WL, δ-k-WL, δ-k-LWL, and (k, s)-WL, respectively. We will use these higher-order GNNs as an intermediate step to show the expressivity results for the k-GT and (k, s)-GT.

I.1. Higher-order GNNs

Here, we generalize the GNN from Grohe (2021) to higher orders k. Higher-order GNNs with the same expressivity have been proposed in prior works by Azizian & Lelarge (2021) and Morris et al. (2019). However, our GNNs have a special form that can be computed by a transformer. To this end, we extend Theorem VIII.4 in Grohe (2021) to the k-WL. Specifically, for each k > 1 we devise neural architectures with k-WL expressivity. Afterwards, we show that transformers can simulate these architectures.

Formally, let S N be a finite subset. First, we show that multisets over S can be injectively mapped to a value in the closed interval (0, 1), which is a variant of Lemma VIII.5 in Grohe (2021). Here, we outline a streamlined version of its proof, highlighting the key intuition behind representing multisets as m-ary numbers. Let M S be a multiset with multiplicities a1, . . . , ak and distinct k values. We define the order of the multiset as Pk i=1 ai. We can write such a multiset as a sequence x(1), . . . , x(l) where l is the order of the multiset. Note that the order of the sequence is arbitrary and that for i = j it is possible to have x(i) = x(j). We call such a sequence an M-sequence of length l. We now prove the following, a slight variation of Grohe (2021).

Lemma 29. For a finite m N, let M S be a multiset of order m 1 and let xi S denote the i-th number in a fixed but arbitrary ordering of S. Given a mapping g: S (0, 1) where

g(xi) := m i,

and an M-sequence of length l given by x(1), . . . , x(l) with positions i(1), . . . , i(l) in S, the sum

j [l] g(x(j)) = X

j [l] m i(j)

is unique for every unique M.

Proof. By assumption, let M S denote a multiset of order m 1. Further, let x(1), . . . , x(l) M be an M-sequence with i(1), . . . , i(l) in S. Given our fixed ordering of the numbers in S we can equivalently write M = ((a1, x1), . . . , (an, xn)), where ai denotes the multiplicity of i-th number in M with position i from our ordering over S. Note that for a number m i

there exists a corresponding m-ary number written as

0.0 . . . 1 |{z} i . . .

Then the sum, X

j [l] g(x(j)) = X

j [l] m i(j)

i S aim i (0, 1)

Aligning Transformers with Weisfeiler Leman

and in m-ary representation

0.a1 . . . an.

Note that ai = 0 if and only if there exists no j such that i(j) = i. Since the order of M is m 1, it holds that ai < m. Hence, it follows that the above sum is unique for each unique multiset M, implying the result.

Recall that S N and that we fixed an arbitrary ordering over S. Intuitively, we use the finiteness of S to map each number therein to a fixed digit of the numbers in (0, 1). The finite m ensures that at each digit, we have sufficient bandwidth to encode each ai. Now that we have seen how to encode multisets over S as numbers in (0, 1), we review some fundamental operations about the m-ary numbers defined above. We will refer to decimal numbers m i as corresponding to an m-ary number 0.0 . . . 1 |{z} i . . . ,

where the i-th digit after the decimal point is 1 and all other digits are 0, and vice versa.

To begin with, addition between decimal numbers implements counting in m-ary notation, i.e.,

m i + m j corresponds to 0.0 . . . 1 |{z} i . . . 1 |{z} j . . . ,

for digit positions i = j and m i + m j corresponds to 0.0 . . . 2 |{z} i=j . . . ,

otherwise. We used counting in the proof of the previous result to represent the multiplicities of a multiset. Next, multiplication between decimal numbers implements shifting in m-ary notation, i.e.,

m i m j corresponds to 0.0 . . . 1 |{z} i+j . . . .

Shifting further applies to general decimal numbers in (0, 1). Let x (0, 1) correspond to an m-ary number with l digits,

0.a1 . . . al.

Then, m i x corresponds to 0.0 . . . 0 a1 . . . al | {z } i+1,...,i+l .

We continue by deriving a neural architecture simulating the k-WL.

Proposition 30. Let G = (V (G), E(G), ℓ) be a n-order (node-)labeled graph. Then for all t 1, there exists a function g(t) and scalars β1, . . . , βk with

g(t)(u) := FFN g(t 1)(u) + X

w V (G) g(t 1)(ϕj(u, w)) ,

where g(0)(u) is initialized consistent ℓconsistent with the atomic type, such that for all u, v V (G)k

Ck t (u) = Ck t (v) g(t)(u) = g(t)(v). (22)

Proof. Before we start, let us recall the relabeling computed by the k-WL for a k-tuple u as

RELABEL Ck t (u), M (t) 1 (u), . . . , M (t) k (u) ,

with M (t) j (u) := {{Ck t 1(ϕj(u, w)) | w V (G)}}.

Aligning Transformers with Weisfeiler Leman

To show our result, we show that there exist scalars β1, . . . , βk such that the m-ary representations computed for Ck t (u) and M (t) 1 , . . . , M (t) k are pairwise unique. To this end, we show that a weighted sum can represent multiset counting in different exponent ranges of m-ary numbers in (0, 1). We then simply invoke Lemma 29 to show that we can map each unique tuple (Ck t (u), M (t) 1 (u), . . . , M (t) k (u)) to a unique number in (0, 1). Finally, the FFN will be responsible for the relabeling.

Each possible color in Ck t is a unique number in [nk] as the maximum possible number of unique colors in Ck t is nk. We then fix an arbitrary ordering over the [nk].

We show the statement via induction over t. Let m > 0 such that m 1 is the order of the multiset M (0) j (u). Note that this is the same m for each u. For a k-tuple u with initial color Ck 0 (u) at position i, we choose g(0) such as to approximate m i

arbitarily close, i.e., g(0)(u) m i F < ϵ,

for an arbitrarily small ϵ > 0. By choosing ϵ small enough, we have that g(0)(u) is unique for every unique position i and hence Equation (22) holds for all pairs of k-tuples for t = 0. Note that by construction, i nk and hence, for tuple u at position i, the m-ary number corresponding to m i is non-zero in at most the first nk digits.

For the inductive case, we assume that

Ck t 1(u) = Ck t 1(v) g(t 1)(u) = g(t 1)(v).

Further, we assume that for tuple u at position i, g(t 1)(u) m i F < ϵ,

for an arbitrarily small ϵ > 0. Let now u V (G)k. We say that the j-neighbors of u w.r.t. w have indices l(1)(w), . . . , l(k)(w) in our ordering. Then, P

w V (G) m l(j)(w) is unique for each unique

M (t) j (u) = {{Ck t (ϕj(u, w)) | w V (G)}}, (23)

according to Lemma 29.

We now obtain a unique value for the tuple M (t) 1 (u), . . . , M (t) k (u) ,

by setting βj := m (nk) j,

for j [k]. Specifically, let a1, . . . , ank denote the multiplicities of multiset M (t) j (u) and let

0.a1 . . . ank,

denote the m-ary number that corresponds to the sum X

w V (G) m l(j)(w).

Note that since each m l(j)(w) corresponds to a color under k-WL at iteration t 1, all digits after the nk-th digit are zero. Then, multiplying the above sum with βj results in a shift in m-ary notation and, hence, for the m-ary number that corresponds to the term βj X

w V (G) m l(j)(w), (24)

we can write 0.0 . . . 0 |{z} nk j a1 . . . ank | {z } nk j+1,...,nk j+nk 0 |{z} nk (j+1)+1 . . . 0.

Aligning Transformers with Weisfeiler Leman

As a result, for all l = j, the non-zero digits of the m-ary number that corresponds to

w V (G) m l(j)(w)

do not collide with the non-zero digits of the output of Equation (24) and hence, the sum

w V (G) m l(j)(w), (25)

is unique for each unique tuple M (t) 1 (u), . . . , M (t) k (u) .

Let ui be the i-th tuple in our fixed but arbitrary ordering. Then, g(t 1)(u) approximates the number m i arbitrarily close. Note that by the induction hypothesis, the m-ary number that corresponds to m i is non-zero in at most the first nk digits, as nk is the maximum possible number of colors attainable under k-WL. Since the smallest shift in Equation (25) is by nk (for j = 1) and since the m-ary number that corresponds to m i is non-zero in at most the first nk digits, the sum of m i and Equation (25) have no intersecting non-zero digits. As a result,

w V (G) m l(j)(w) (26)

is unique for each tuple Ck t (u), M (t) 1 (u), . . . , M (t) k (u) .

Now, by the induction hypothesis, we can approximate each m i with g(t 1)(ui) arbitrarily close. Further, we can approximate each m l(j)(w) with g(t 1)(ϕj(u, w)) arbitrarily close. As a result, we can approximate Equation (26) with

g(t 1)(u) + X

w V (G) g(t 1)(ϕj(u, w))

arbitrarily close, i.e.,

g(t 1)(u) + X

w V (G) g(t 1)(ϕj(u, w)) m i + X

w V (G) m l(j)(w) F < ϵ,

for an arbitrarily small ϵ. Hence, we by choosing ϵ small enough,

g(t 1)(u) + X

w V (G) g(t 1)(ϕj(u, w))

is also unique for each unique tuple Ck t (u), M (t) 1 (u), . . . , M (t) k (u) .

Finally, since for finite graphs of order n there exists only a finite number of such tuples, there exists a continuous function mapping the output of Equation (26) to m i where i is the position of the color Ck t (u) in [nk], for all u V (G)k. We approximate this function with FFN arbitrarily close and obtain

||g(t)(u) m i|| < ϵ,

for an arbitrarily small ϵ > 0. This concludes the proof.

We finish with a few Corollaries regarding variants of the k-WL.

Aligning Transformers with Weisfeiler Leman

Corollary 31. Let G = (V (G), E(G), ℓ) be a n-order (node-)labeled graph. Then for each k-tuple u := (u1, . . . uk) and each t > 0, we can equivalently express the coloring of the δ-k-WL as Cδ,k t (u) :=

RELABEL Cδ,k t 1(u),{{Cδ,k t 1(ϕ1(u, w)) | w N(u1)}}, {{Cδ,k t 1(ϕ1(u, w)) | w V (G) \ N(u1)}}, . . . ,

{{Cδ,k t 1(ϕk(u, w)) | w N(uk)}}, {{Cδ,k t 1(ϕk(u, w)) | w V (G) \ N(uk)}} .

With the above statement, we can directly derive a δ-variant of Proposition 30. To this end, let j(u) denote the set of vertices adjacent to the j-th node in u.

Corollary 32. Let G = (V (G), E(G), ℓ) be a n-order (vertex-)labeled graph and assume a vertex feature matrix F Rn d

that is consistent with ℓ. Then for all t 1, there exists a function g(t) and scalars β1, . . . , β2k with

g(t)(u) := FFN g(t 1)(u) + X

w j(u) g(t 1)(ϕj(u, w)) + βj X

w V (G)\ j(u) g(t 1)(ϕj(u, w) ,

such that for all u, v V (G)k

Ck,M t (u) = Ck,M t (v) g(t)(u) = g(t)(v). (27)

The proof is the same as for Proposition 30. Further, we can also recover the δ-k-LWL variant of Proposition 30 by setting βj = 0. Lastly, we again recover Proposition 30 by setting αj = βj.

I.2. Higher-order pure transformers

Here, we use the results from the previous section to show the expressivity results for Theorem 4, Theorem 5 and Theorem 6 in the main paper. In fact, we prove a slightly stronger result than Theorem 4 from which then also Theorem 5 and Theorem 6 will follow directly.

Theorem 33 (Generalization of Theorem 4 in main paper). Let G = (V (G), E(G), ℓ) be a labeled graph with n nodes and k 2 and F Rn d be a node feature matrix consistent with ℓ. Let Ck t denote the coloring function of the k-WL, δ-k-WL, or δ-k-LWL at iteration t. Then for all iterations t 0, there exists a parametrization of the k-GT such that

Ck t (v) = Ck t (w) X(t,k)(v) = X(t,k)(w)

for all k-tuples v and w V (G)k. If Ck t is the coloring function of the k-WL, it suffices that the structural embeddings of k-GT are sufficiently node-identifying. Otherwise, we require the structural embeddings of k-GT to be both sufficiently node and adjacency-identifying.

Proof. We begin by recalling Equation (4) as

X(0,k)(v) := F(vi) k i=1WF + P(vi) k i=1WP + atp(v),

where P are structural embeddings and atp is a learnable embedding of the atomic type. We further know from Lemma 26 that we can write

X(0,k)(v) = F (v1) . . . F (vk) 0 deg (v1) . . . deg (vk) P (v1) . . . P (vk) atp (v) , (28)

where it holds for every v, w V (G), F(v) = F(w) F (v) = F (w)

and deg(v) = deg(w) deg (v) = deg (w)

and P(v) = P(w) P (v) = P (w)

where P (v) is sufficiently node and adjacency-identifying and for every v, w V (G)k

atp(v) = atp(w) atp (v) = atp (w).

Aligning Transformers with Weisfeiler Leman

Further, F (v) Rp, 0 Rk2 p, deg : N Rr, atp : N Ro and P Rs, for some choice of p, r, s, o where d = k2 p + k p + k r + k s + o, as specified in Lemma 26. We will use this parameterization for X(0,k) throughout the rest of the proof.

We prove our statement by induction over iteration t. For the base case, notice that the initial color of a tuple v depends on the atomic type and the node labeling. In Equation (28), we encode the atomic type with atp (v) and the node labels by concatenating the features F (v1), . . . , F (vk) of the k nodes v1, . . . , vk in v. The concatenation of both node labels and atomic type is clearly injective, and so

Ck 0 (v) = Ck 0 (w) X(0,k)(v) = X(0,k)(w)

for any two v, w V (G)k.

Before continuing with the induction, we define some additional notation. Throughout the induction, we will denote the color representation of a tuple v at iteration t as F(t)(v). Further, we initially define F(0)(v) = F (vi) k i=1, deg (v) = deg (vi) k i=1 and P (v) = P (vi) k i=1. Hence, we can rewrite

X(0,k)(v) = F(0)(v) 0 deg (v) P (v) atp (v) .

For the induction step, we show Ck t (v) = Ck t (w) X(t,k)(v) = X(t,k)(w), (29)

that is we compute the k-WL-, δ-k-LWL, or δ-k-WL-equivalent aggregation. Clearly, if Equation (29) holds for all t, then the proof statement follows. Thereto, we show that the standard transformer updates the tuple representation of tuple ui = (u1, . . . , uk), following Corollary 32, as

X(t,k)(ui) = h FFN F(t 1)(ui) + H(X(t 1,k)(ui)) 0 deg (ui) P (ui) atp (ui) i , (30)

H(X(t 1,k)(ui)) := X

w j(ui) F(t 1)(ϕj(ui, w)) + βj X

w V (G)\ j(ui) F(t 1)(ϕj(ui, w)) ,

for α1, . . . , αk, β1, . . . , βk R, for j [k] that we pick such that H(X(t 1,k)(ui)) is equivalent to the neural architecture in Corollary 32. Note that we obtain the k-WL for αj = βj for all j [k]. Then,

H(X(t 1,k)(ui)) = X

w V (G) F(t 1)(ϕj(ui, w)) .

Further, we obtain δ-k-LWL for βj = 0 for all j [k]. Then,

H(X(t 1,k)(ui)) = X

w j(ui) F(t 1)(ϕj(ui, w)) .

Hence, it remains to show that the standard transformer layer can update tuple representations according to Equation (30). To this end, we will require 2k transformer heads h1 1, . . . h1 k, h2 1, . . . h2 k in each layer. Specifically, in the first k heads, we want to compute h1 j(X(t 1,k)(ui)) = αj X

w j(ui) F(t 1)(ϕj(ui, w)). (31)

In the remaining k heads, we want to compute

h2 j(X(t 1,k)(ui)) = βj X

w V (G)\ j(ui) F(t 1)(ϕj(ui, w)). (32)

In both of the above cases, for a head hγ j , γ { 1, 1} denotes the type of head, and j [k] denotes the j-neighbors the head aggregates over. For the head dimension, we set dv = d.

Aligning Transformers with Weisfeiler Leman

For each j, recall the definition of the standard transformer head at tuple ui at position i as

hγ j (X(t 1,k)(ui)) = softmax h Z(t 1,γ) i1 . . . Z(t 1,γ) ink i Xt 1(u1) . . . Xt 1(unk) T WV ,

Z(t 1,γ) il := 1 dk (X(t 1,k)(ui)WQ,γ)(X(t 1,k)(ul)WK,γ)T R (33)

is the unnormalized attention score between tuples ui and ul, with

X(t 1,k)(ui) = F(t 1)(ui) 0 deg (ui) P (ui) atp (ui)

for all i [nk] and

WQ F WQ Z WQ D WQ,γ P WQ A

WK F WK Z WK D WK,γ P WK A

WV F WV Z WV D WV P WV A

F(t 1)(ui) is projected by WQ F , WK F , WV F ,

0 is projected by WQ Z , WK Z , WV Z ,

deg (ui) is projected by WQ D, WK D, WV D,

P (ui) is projected by WQ,γ P , WK,γ P , WV P ,

atp (ui) is projected by WQ A, WK A , WV A.

Note that only sub-matrices WQ,γ P and WK,γ P are different for different γ. We now specify projection matrices WQ,γ, WK,γ and WV in a way that allows the attention head hγ j to dynamically recover the j-neighborhood adjacency as well as the adjacency between j-neighbors in the attention matrix. To this end, in heads h1 j and h2 j we set

WQ F = WK F = 0

WQ Z = WK Z = WV Z = 0

WQ D = WK D = WV D = 0

WQ A = WK A = WV A = 0.

The remaining non-zero sub-matrices are WV F , WQ,γ P , WK,γ P , which we will define next.

Aligning Transformers with Weisfeiler Leman

We begin by defining WQ,γ P and WK,γ P . Specifically, we want to choose WQ,γ P and WK,γ P such that the attention matrix in head hγ j can approximate the generalized adjacency matrix A(k,j,γ) in Equation (11). To this end, we simply invoke Lemma 13, guaranteeing that there exists WQ,γ P and WK,γ P such that that for each ϵ > 0 and each γ { 1, 1},

softmax h Z(t 1,γ) i1 . . . Z(t 1,γ) ink i A(k,j,γ) i F < ε,

i.e., we can approximate the generalized adjacency matrix arbitrarily close for each k > 1, j [k] and γ { 1, 1}. In the following, for clarity of presentation, we assume

softmax h Z(t 1,γ) i1 . . . Z(t 1,γ) ink i = A(k,j,γ) i

although we only approximate it arbitrarily close. However, by choosing ε small enough, we can still approximate the matrix X(t,k), see below, arbitrarily close. We now set

WV = In kp 0 ,

where In kp denotes the first n rows of the kp-by-kp identity matrix if kp > n or else the first kp columns of the n-by-n identity matrix and 0 Rn d kp is an all-zero matrix. The above yields

hγ j (X(t 1,k)(ui)) = 1

h A(k,j,γ) i1 . . . A(k,j,γ) ink i

F(t 1)(u1) ... F(t 1)(unk)

w j(ui) F(t 1)(ϕj(ui, w)) γ = 1

w V (G)\ j(ui) F(t 1)(ϕj(ui, w)) γ = 1 ,

( dij γ = 1 n dij γ = 1

with dij the degree of node uj in k-tuple ui = (u1, . . . , uk).

We now conclude our proof as follows. Recall that the standard transformer layer computes the final representation X(t,k)(ui) as

X(t,k)(ui) = FFN

X(t 1,k)(ui) +

h1 1(X(t 1,k))(ui) ... h1 k(X(t 1,k))(ui) h2 1(X(t 1,k))(ui) ... h2 k(X(t 1,k))(ui)

We can now satisfy the requirements in Equation (31) and Equation (32) as follows.

Recall that for the first k heads we set γ = 1, in the remaining k heads we set γ = 1. Further, since we have 2k heads, WO R2k dv d. We set

αj kp < i = j k + kp βj k + kp < i = j 2k + kp 0 else,

i.e., we leave the first kp diagonal elements zero and then fill the next 2k diagonal elements of WO with the αj and βj,

Aligning Transformers with Weisfeiler Leman

respectively. All other elements are 0. We now obtain

h1 1(X(t 1,k))(ui) ... h1 k(X(t 1,k))(ui) h2 1(X(t 1,k))(ui) ... h2 k(X(t 1,k))(ui)

w 1(ui) F(t 1)(ϕ1(ui, w)) ... αk dik P

w k(ui) F(t 1)(ϕk(ui, w)) β1 n di1 P

w V (G)\ 1(ui) F(t 1)(ϕ1(ui, w)) ... βk n dik P

w V (G)\ k(ui) F(t 1)(ϕk(ui, w))

w 1(ui) F(t 1)(ϕ1(ui, w)) ... αk dik P w k(ui) F(t 1)(ϕk(ui, w)) β1 n di1 P

w V (G)\ 1(ui) F(t 1)(ϕ1(ui, w)) ... βk n dik P

w V (G)\ k(ui) F(t 1)(ϕk(ui, w)) 0

where the first zero vector is in Rkp and the second zero vector is in Rd k(r+s)+o. We now define the vector

F(t 1)(ui) 0 deg (ui) P (ui) atp (ui)

w 1(ui) F(t 1)(ϕ1(ui, w)) ... αk dik P

w k(ui) F(t 1)(ϕk(ui, w)) β1 n di1 P

w V (G)\ 1(ui) F(t 1)(ϕ1(ui, w)) ... βk n dik P w V (G)\ k(ui) F(t 1)(ϕk(ui, w)) 0

F(t 1)(ui) α1 di1 P

w 1(ui) F(t 1)(ϕ1(ui, w)) ... αk dik P

w k(ui) F(t 1)(ϕk(ui, w)) β1 n di1 P

w V (G)\ 1(ui) F(t 1)(ϕ1(ui, w)) ... βk n dik P

w V (G)\ k(ui) F(t 1)(ϕk(ui, w)) deg (ui) P (ui) atp (ui)

To simplify terms, we additionally define

w 1(ui) F(t 1)(ϕ1(ui, w)) ... αk dik P

w k(ui) F(t 1)(ϕk(ui, w))

Aligning Transformers with Weisfeiler Leman

w V (G)\ 1(ui) F(t 1)(ϕ1(ui, w)) ... βk n dik P

w V (G)\ k(ui) F(t 1)(ϕk(ui, w))

and consequently, we can write

F(t 1)(ui) Fα(ui) Fβ(ui) deg (ui) P (ui) atp (ui)

We additionally define

deg (uj) = dij n dij R2,

where dij is again the degree of node uj in k-tuple ui = (u1, . . . , uk). Note that X(ui) represents all the information we feed into the final FFN. Specifically, we obtain the updated representation of tuple ui as

X(t,k)(ui) = FFN X(ui) .

We now show that there exists an FFN such that

X(t,k) = FFN X(ui)

= F(t)(ui) 0 deg (ui) P (ui) atp (ui) ,

which then implies the proof statement. It is worth to pause at this point and remind ourselves what each element in X(ui) represents. To this end, we use Py Torch-like array slicing, i.e., for a vector x, x[a : b] corresponds to the sub-vector of length b a for b > a containing the (a + 1)-st to b-th element of x. E.g., for a vector x = (x1, x2, x3, x4, x5)T , x[1 : 4] = (x2, x3, x4)T . Now, we interpret X(ui) by its sub-vectors. Concretely,

1. X(ui)[0 : kp] Rkp corresponds to the previous color representation F(t 1)(ui) of tuple ui.

2. Fα(ui)[(j 1)kp : jkp] Rkp corresponds to the degree-normalized sum over adjacent j-neighbors αj

dj P w j(ui) F(t 1)(ϕj(ui, w)).

3. Fβ(ui)[(j 1)kp : jkp] Rkp corresponds to the degree-normalized sum over non-adjacent j-neighbors βj n dj P

w j(ui) F(t 1)(ϕj(ui, w)).

4. deg (ui)[2(j 1)] = dij, where dij is the degree of node uj in the k-tuple ui = (u1, . . . , uk).

5. deg (ui)[2(j 1) + 1] = n dij, where dij is the degree of node uj in the k-tuple ui = (u1, . . . , uk).

To this end, we show that there exists a sequence of functions f FFN fadd fdeg such that

X(t,k) = f FFN fadd fdeg X(ui)

= F(t)(ui) 0 deg (ui) P (ui) atp (ui) ,

where the functions are applied independently to each row.

Since our domain is compact, there exist choices of f FFN, fadd, fdeg such that f FFN fadd fdeg : Rd Rd, in which case f FFN fadd fdeg is continuous. As a result, FFN can approximate f FFN fadd fdeg arbitrarily close.

Aligning Transformers with Weisfeiler Leman

Concretely, we define

fdeg X(ui) =

F(t 1)(ui) di1 α1

w 1(ui) F(t 1)(ϕ1(ui, w)) ... dik αk

w k(ui) F(t 1)(ϕk(ui, w)) (n di1) β1 n di1 P w V (G)\ 1(ui) F(t 1)(ϕ1(ui, w)) ... (n dik) βk n dik P

w V (G)\ k(ui) F(t 1)(ϕk(ui, w)) deg (ui) P (ui) atp (ui)

where for each j [k], fdeg multiplies

1. Fα(ui)[(j 1)kp : jkp] with deg (ui)[2(j 1)].

2. Fβ(ui)[(j 1)kp : jkp] with deg (ui)[2(j 1) + 1].

We then define

w 1(ui) F(t 1)(ϕ1(ui, w)) ... αk P

w k(ui) F(t 1)(ϕk(ui, w))

w V (G)\ 1(ui) F(t 1)(ϕ1(ui, w)) ... βk P

w V (G)\ k(ui) F(t 1)(ϕk(ui, w))

and consequently, we can write

fdeg X(ui) =

F(t 1)(ui) Fα(ui) Fβ(ui) deg (ui) P (ui) atp (ui)

Next, we define

fadd fdeg X(ui) =

F(t 1)(ui) + P

j [k](Fα(ui)[(j 1)kp : jkp] + Fβ(ui)[(j 1)kp : jkp]) 0 deg (ui) P (ui) atp (ui)

where fadd sums up F(t 1)(ui) with Fα(ui)[(j 1)kp : jkp] and Fβ(ui)[(j 1)kp : jkp] for each j. Afterwards, fadd sets

fdeg X(ui) [kp : 2k2p + kp] = 0 R2k2p.

Now, finally, note from the definition of Fα(ui) and Fβ(ui) that X

j [k] (Fα(ui)[(j 1)kp : jkp] + Fβ(ui)[(j 1)kp : jkp]) = H(X(t 1,k)(ui)),

Aligning Transformers with Weisfeiler Leman

where we recall that we defined

H(X(t 1,k)(ui)) = 1

w j(ui) F(t 1)(ϕj(ui, w)) + βj X

w V (G)\ j(ui) F(t 1)(ϕj(ui, w)) .

Hence, we have that

fadd fdeg X(ui) =

F(t 1)(ui) + H(X(t 1,k)(ui)) 0 deg (ui) P (ui) atp (ui)

Finally, we define

f FFN fadd fdeg X(ui) =

FFN F(t 1)(ui) + H(X(t 1,k)(ui))

0 deg (ui) P (ui) atp (ui)

= F(t)(v) 0 deg (v) P (v) ,

where FFN denotes the FFN in Equation (30), from which the last equality follows. Thank you for sticking around until the end. This completes the proof.

The above proof shows that the k-GT can simulate the δ-k-WL. Since the δ-k-WL is strictly more expressive than the k-WL (Morris et al., 2020), Theorem 5 follows directly from Theorem 4.

Further, the above proof shows that the k-GT can simulate the δ-k-LWL. If we simulate the δ-k-LWL with the k-GT while only using token embeddings for tuples in V (G)k s, Theorem 6 directly follows.