# expressivitypreserving_gnn_simulation__7bd31b64.pdf

Expressivity-Preserving GNN Simulation

Fabian Jogl Machine Learning Research Unit Center for Artificial Intelligence and Machine Learning TU Wien, Vienna, Austria fabian.jogl@tuwien.ac.at

Maximilian Thiessen Machine Learning Research Unit TU Wien, Vienna, Austria maximilian.thiessen@tuwien.ac.at

Thomas G artner Machine Learning Research Unit TU Wien, Vienna, Austria thomas.gaertner@tuwien.ac.at

We systematically investigate graph transformations that enable standard message passing to simulate state-of-the-art graph neural networks (GNNs) without loss of expressivity. Using these, many state-of-the-art GNNs can be implemented with message passing operations from standard libraries, eliminating many sources of implementation issues and allowing for better code optimization. We distinguish between weak and strong simulation: weak simulation achieves the same expressivity only after several message passing steps while strong simulation achieves this after every message passing step. Our contribution leads to a direct way to translate common operations of non-standard GNNs to graph transformations that allow for strong or weak simulation. Our empirical evaluation shows competitive predictive performance of message passing on transformed graphs for various molecular benchmark datasets, in several cases surpassing the original GNNs.

1 Introduction

We systematically investigate which variants of non-standard message passing graph neural networks can be implemented with standard message passing graph neural networks (MPNNs) on a transformed graph without losing expressivity. While MPNNs are a very popular type of graph neural networks (GNNs), they are not able to represent every function on the space of graphs [Xu et al., 2019, Morris et al., 2019] due to their limited expressivity: for all MPNNs, there are pairs of non-isomorphic graphs which always have the same embedding. A common approach to create more expressive GNNs is to change the message passing function of MPNNs. If a GNN is more expressive than MPNNs by adapting the message passing function, we call this non-standard message passing. Examples of this are message passing variants that operate on subgraphs [Frasca et al., 2022, Bevilacqua et al., 2021] or tuples of nodes [Morris et al., 2019, 2020, 2022]. It is known that some variants of non-standard message passing can be implemented with MPNNs on a transformed graph without losing expressivity as demonstrated by Morris et al. [2019, 2020, 2022] and Qian et al. [2022]. We formalize this as an MPNN simulating a GNN: a GNN can be simulated if there exists a graph transformation such that combining an MPNN with that graph transformation is at least as expressive as the GNN. As simulation only requires graph transformations as pre-processing, it allows the use of off-the-shelf MPNN implementations which simplifies and speeds up the implementation of GNNs. Another advantage is that simulation is programming framework agnostic: only the graphs needs to be transformed independently of how the MPNN is implemented. Finally, simulation allows to easily exchange GNNs in existing workflows without requiring changes to the model. Despite

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

the benefits of simulation, this approach has not been thoroughly investigated: there is currently no formal definition of simulation and it is not known which GNNs can be simulated. In this paper, we define simulation and provide sufficient criteria for non-standard message passing to be simulated.

Related Work. Xu et al. [2019] and Morris et al. [2019] proved that MPNNs have limited expressivity. This lead to the development of new GNNs that have higher expressivity than MPNNs. Such GNNs often operate on structures that differ from graphs, for example (1) Morris et al. [2019, 2020, 2022] proposed GNNs operating on k-tuples of nodes, (2) Bodnar et al. [2021a,b] proposed GNNs on topological structures, and (3) Bevilacqua et al. [2021] proposed GNNs operating on subgraphs. The idea of implementing non-standard message passing by standard message passing dates back to at least Otto [1997] who showed that instead of performing the higher-order message passing of k-WL it is possible to use 1-WL (classical message passing) on a transformed structure [Grohe et al., 2021]. To the best of our knowledge, the first GNN that has been implemented through a graph transformation is k-GNN [Morris et al., 2019], which is a generalization from k-WL to GNNs. Already Morris et al. [2019] refer to this concept as simulation. In follow up work, Morris et al. [2020] and Morris et al. [2022] implemented similar GNNs with non-standard message passing as MPNNs together with graph transformations. Similarly, k-OSWL [Qian et al., 2022] was designed to utilize WL. However, none of these papers have formalized the idea of simulation for GNNs (or WL) explicitly. To this end, we [Jogl et al., 2022a,b] showed that CW Networks [Bodnar et al., 2021a], DS [Bevilacqua et al., 2021], DSS [Bevilacqua et al., 2021], and δ-k-GNNs/WL [Morris et al., 2020] can be implemented as an MPNN on a transformed graph. A similar high-level idea was proposed in a positional paper by Veliˇckovi c [2022]. In parallel, Hajij et al. [2023] introduced combinatorial complexes that generalize structures such as graphs or cell complexes and showed that computations on these complexes can be realized as message passing on graphs. While simulation has been used and advocated in the past there is no unified solution to obtaining graph transformations, theoretical justification, or thorough empirical evaluation. We fill this gap and find many GNNs can be simulated by MPNNs on transformed graphs. More details can be found in Appendix A and Appendix B.

Our Approach. We introduce simulation of non-standard message passing, which formalizes the idea of implementing a GNN, which uses non-standard message passing, through an MPNN on a transformed graph. A message passing algorithm can be strongly simulated if an MPNN together with a graph transformation can achieve the same expressivity in every iteration of message passing. To prove that many GNNs can be strongly simulated, we define the class of augmented message passing (AMP) algorithms which contains many common GNNs. We prove that all AMP algorithms can be strongly simulated and present a meta algorithm to generate the necessary graph transformations. This allows us to show that eleven recently proposed GNNs can be strongly simulated. Specifically, the GNNs by Morris et al. [2019, 2020, 2022] and Qian et al. [2022] perform AMP. This gives an additional mathematical justification to their approach by proving that their GNNs can be implemented with MPNNs without losing expressivity over an implementation with non-standard message passing.

Furthermore, we investigate three constructions that demonstrate the limits of strong simulation: time dependent neighborhoods, nested aggregations, and non-pairwise message passing. We prove that these constructions either cannot be strongly simulated efficiently or cannot be strongly simulated at all. However, if we are only interested in the MPNN achieving the same expressivity as the GNN with non-standard message passing after a sufficient number of layers, it is possible to implement all three constructions with an MPNN. We call this weak simulation: a message passing algorithm can be weakly simulated if there exists a non-negative integer ζ such that an MPNN together with a graph transformation can achieve the same expressivity as the message passing algorithm in one iteration after every ζ iterations of message passing. We show that Message Passing Simplicial Networks [Bodnar et al., 2021b], CW Networks [Bodnar et al., 2021a], DSS [Bevilacqua et al., 2021], and K-hop message passing [Feng et al., 2022] can be weakly simulated. Finally, we evaluate a representative set of graph transformation empirically. Our graph transformations lead to competitive performance compared to the simulated algorithms and often lead to more accurate predictions.

Main Contributions. We introduce the concept of strong and weak simulation (Section 3) of message passing. This generalizes existing ideas behind the implementation of several GNNs [Morris et al., 2019, 2020, 2022, Qian et al., 2022]. We provide an automated way of proving that a GNN can be simulated and deriving the necessary graph transformations. We prove that there exist architectures that cannot be strongly simulated (Section 4) but only weakly simulated (Section 5). Our empirical evaluation (Section 6) demonstrates that simulation achieves competitive performance.

2 Background

A graph G is a triple G = (V, E, F) consisting of a set of vertices V , a set of directed edges E {(x, y) | x, y V, x = y}, and a function F that assigns a feature to every vertex and edge. For a set of objects U and integer l > 0, we denote by 2U = {X | X U} the powerset of U, i.e., the set of all subsets of U, and by U l = {(u1, . . . , ul) | u1, . . . , ul U} the set of all tuples of length l built from U. In this paper we work on a generalization of graphs we call relational structures that use a generalized notion of neighborhood. For a set of objects U, we call any function N : U 2(U ℓ) a neighborhood function. A neighborhood function assigns to every object a set of tuples of length ℓwhich we call its neighbors. We use ℓ(N) to denote the length of these tuples. We call any tuple (w, u) a directed edge where u U and w N(u) (the in-neighbours) with ℓ(N) = 1. For an integer k > 0, we use [[k]] to denote the set {1, . . . , k} and {{ }} to denote a multiset. Definition 2.1. (Relational Structure) Let k 0 be an integer, U be a set of objects, and let N1, . . . , Nk be neighborhood functions over U. Furthermore, let F be a function such that F(u) assigns a feature to every u U and F i((x, u)) assigns a feature to every directed edge with i [[k]] and x Ni(u) with ℓ(Ni) = 1. Then, the tuple X = (U, N1, . . . Nk, F) is a relational structure.

Figure 1: Left: graph. Center: regular cell complex built from the graph through a graph-to-structure encoding [Bodnar et al., 2021a]. Vertices correspond to 0-dimensional cells, edges to 1-dimensional cells ( yellow ) and induced cycles to 2-dimensional cells

( blue ). Right: a graph created by structure-to-graph encoding the regular cell complex to a graph. Vertices corresponds to cells as indicated by color.

If it is clear from context from which neighborhood Ni an edge (x, u) is, we will simply write its features as F((x, u)) instead of F i((x, u)). If the number of neighborhoods is important we call such a structure a k-relational structure. Note, that a graph G = (V, E, F) is a special case of a 1relational structure with only one neighborhood NG(v) = {w | (w, v) EG}, i.e., G = (V, NG, F). For graphs G, H we refer to their vertices, edges, and features by adding the graph as a subscript. We say that G, H are isomorphic if there exists an edge and feature preserving bijective function α : VG VH between their vertices: for every vertex v VG it holds that FG(v) = FH(α(v)) and for every x, y VG, it holds that (x, y) EG if and only if (α(x), α(y)) EH with FG((x, y)) = FH((α(x), α(y))). No polynomial time algorithm is known which decides whether two graphs are isomorphic [Babai, 2016]. However, heuristics such as those based on colorings and color update functions allow us to distinguish many pairs of non-isomorphic graphs. Furthermore, they allow us to model the ability of GNNs to distinguish graphs [Xu et al., 2019, Morris et al., 2019]. For a relational structure over a set of objects U, a coloring is a function c : U χ where χ is a known set of colors. Suppose we have two disjoint sets U and U with two colorings c over U and c over U . Then, we define the joint coloring c c for every x (U U ) as c(x) if x U and c (x) otherwise. For two colorings c and d defined over the same set of objects U, we say that c refines d if for every pair of objects u, r U it holds that cu = cr du = dr. Let U, U be sets of objects with U U. Let c be a coloring of U and d a coloring of U . Then, c refines d if for every pair of objects u, r U it holds that cu = cr du = dr. Definition 2.2. (Color Update Function) Let k 1 be an integer and X a set of k-relational structures. A color update function takes any X = (U, N1, . . . , Nk, F) X, a coloring c of U, and a u U and outputs a new color ϕ(u, X, c) of u. We denote the new coloring of the whole set U as ϕ(X, c).

We assume color update functions ϕ to be computable and denote the time to compute ϕ with respect to an input U as τϕ(|U|). Color update functions are used to iteratively update an initial coloring c0

such that ct+1 = ϕ(X, ct) for every t 0. We denote t applications of ϕ to compute ct as ϕt(X, c0). We also say that ϕ0(X, c0) = c0. The initial coloring is often given by the features c0 = F|U, meaning the color c0 is given by the domain restriction of F to the set of objects U. Then, the color of each object u U is c0 u = Fu. If we are not given features, then we use a constant coloring, i.e., a coloring that assigns the same (arbitrary) color to each object. Color update functions can sometimes

be used to determine whether two graphs are isomorphic. For this, a coloring is computed for both graphs. If at some iteration the colorings of the two graphs are different this implies that the graphs are not isomorphic. If the colorings are never different, then the result of this method is inconclusive. A common color update function is the Weisfeiler-Leman algorithm (WL). In this work we define WL to use edge features.

Definition 2.3. (WL) The Weisfeiler-Leman algorithm (WL) is a color update function operating on a graph G = (V, E, F) defined as WL(v, G, c) = HASH (cv, {{(cu, F((u, v)) | u NG(v)}}) for v V where HASH is an injective mapping to colors.

Message passing graph neural networks (MPNNs) can be seen as a generalization of WL where the colors correspond to learned node embeddings and each layer of the MPNN corresponds to a color update function. The color update function in layer t 1 of an MPNN is defined as ϕ(v, G, c) = COMBt cv, AGGt ({{(cx, F((x, v)) | x NG(v)}}) where COMBt and AGGt are learnable functions, e.g., multi-layer perceptrons (MLPs).1 We refer to WL and MPNNs as standard message passing. It has been shown that MPNNs are at most as expressive as WL, i.e., it can only distinguish non-isomorphic graphs that WL can distinguish. As there exist pairs of graphs that WL cannot distinguish this means that there exists graphs that MPNNs cannot distinguish either [Xu et al., 2019, Morris et al., 2019]. A common approach to improve the expressivity of standard message passing is to apply a function T that transforms graphs to other relational structures. We call this mapping T from a graph to a relational structure a graph-to-structure encoding. An example of an graph-to-structure encoding can be seen in Figure 1. This mapping T is combined with a color update function ϕ tailored to this new structure. We use non-standard message passing to refer to color update functions that operate on relational structures that are not graphs. Some examples of non-standard message passing are k-WL [Immerman and Lander, 1990] which operates on k-tuples of nodes and CW Networks [Bodnar et al., 2021a] which operate on regular cell complexes.

3 Strong Simulation

We show that standard message passing together with graph transformations can achieve the same expressivity as many algorithms with non-standard message passing. As standard message passing operates on graphs we need to map relational structures to graphs. Note that merely guaranteeing at least the same expressivity as a color update function ϕ in each iteration can easily be achieved by, e.g., using the final coloring of ϕ as node features. To avoid such a trivial solution, we enforce straightforward restrictions on the generated graphs.

Definition 3.1. (Structure-to-graph encoding) Let R be a mapping R : X 7 G that maps relational structures X = (U, N1, . . . , Nk, F) to graphs G = (V, E, F ). We call R a structure-to-graph encoding if it can be written as R(X) = Rfeat(Rgraph(U, N1, . . . , Nk), F) such that

1. Rgraph maps a relational structure without features to a graph G = (V, E, Fgraph).

2. Rfeat((V, E, Fgraph), F) = (V, E, F ) creates the features F by concatenating each (node or edge) feature from Fgraph with the corresponding feature from F if it exists.

As a graph-to-structure encoding T maps graphs to structures and a structure-to-graph encoding R maps structures to graphs, this means that R T maps graphs to graphs. We call such functions R T graph transformations, an example can be seen in Figure 1. As we define relational structures over sets of objects, this implies that R is permutation equivariant with respect to a permutation of the objects. Furthermore, since R is a function it is also deterministic. Next, we define strong simulation of color update functions.

Definition 3.2. (Strong Simulation) Let ϕ be a color update function. Let R be a structure-to-graph encoding that runs in O(τϕ(|U|)) for every relational structure with object set U and creates a graph with vertex set V U. We consider two arbitrary relational structures from the domain of ϕ, say X = (U, N1, . . . , Nk, F) and X = (U , N 1, . . . , N k, F ). Let (V1, E1, F1) = R(X) and (V2, E2, F2) = R(X ). We say ϕ can be strongly simulated under R if for every t 0 it holds that WLt(R(X), F1|V1) WLt(R(X ), F2|V2) refines ϕt(X, F|U) ϕt(X , F |U ).

1Here, we define MPNNs to use edge features which practically is almost always the case.

Intuitively, a color update function ϕ can be strongly simulated if instead of running t iterations of ϕ on relational structure X we can instead run WL on the graph R(X) for t iterations and achieve at least the same expressivity in every iteration. Strong simulation is a stronger property than expressivity: strong simulation implies being at least as expressive in every iteration whereas expressivity usually refers only to the final coloring. This guarantees that WL on R(X) is always at least as expressive as ϕ and not just in some iteration. There exist MPNNs that are as expressive as WL in every iteration [Xu et al., 2019]. Thus, every layer of an MPNN can be as expressive as the corresponding iteration of WL. Note that as long as we restrict the size of the graphs (the number of vertices), strong simulation allows to learn a color update function ϕ, e.g., a GNN, exactly. Indeed, if we combine each layer of an MPNN with a large enough MLP (with appropriate activation), we can learn to map the MPNN s embeddings to the output of ϕ. This problem of fitting a finite number of points exactly is well studied and known as memorization, see e.g. Yun et al. [2019] for an overview.

To strongly simulate a large number of GNNs we introduce a class of color update functions we call augmented message passing (AMP). AMP is designed to capture many GNNs and we prove that AMP can be strongly simulated. AMP makes use of a building block we call atoms. Atoms model operations of color update functions such as colors, constants, or features that are focused on single vertices or pairs of vertices. AMP extends atoms with more general operations that are not necessarily focused on a single vertex such as function applications or aggregations over neighborhoods.

Definition 3.3. (Atom). Let X = (U, N1, . . . , Nk, F) be a relational structure. Let v, w U and cv, cw be the colors of v and w. Then, an atom ρ(v, cv, w, cw) is a function that has exactly one of the following four forms:

(A1) ρ(v, cv, w, cw) = k(v, w) for a constant k(v, w) that only depends on v and w but not on the colors,

(A2) ρ(v, cv, w, cw) {F(v), F(w), F((w, v))} where F((w, v)) is only part of this set if the edge (w, v) is part of at least one neighborhood of X,

(A3) ρ(v, cv, w, cw) {cv, cw}, or

(A4) ρ(v, cv, w, cw) = (ρ1(v, cv, w, cw), . . . , ρm(v, cv, w, cw)) with atom ρi for all i [[m]].

Definition 3.4. (Augmented Message Passing). Let X = (U, N1, . . . , Nk, F) be a relational structure with ℓ(Ni) = 1 for all i [[k]]. We call a color update function ϕ augmented message passing (AMP) if for all u U and coloring c of X, it can recursively be defined as exactly one of the following four forms:

(S1) ϕ(u, X, c) = ρ(u, cu, u, cu) where ρ is an atom,

(S2) ϕ(u, X, c) = {{ρ(u, cu, w, cw) | w Ni(u)}} where ρ is any atom and any i [[k]],

(S3) ϕ(u, X, c) = (ϕ1(u, X, c), . . . , ϕm(u, X, c)) where all ϕ1, . . . , ϕm are AMP, or

(S4) ϕ(u, X, c) = f (ϕ (u, X, c)) where f maps colors to colors and ϕ is AMP.

Having defined AMP, we next prove that AMP can be strongly simulated by providing a structure-tograph encoding for every color update function from AMP. Full proofs are in Appendix C.

Theorem 3.5. Augmented message passing can be strongly simulated.

Proof sketch. We define augmented message encoding (AME) which returns a structure-to-graph encoding for every given AMP algorithm. Let ϕ be AMP that operates on a relational structure X = (U, N1, . . . , Nk, F). AMEϕ(X) returns a graph G with vertex set VG = U. Any constant value atom (A1) in ϕ and any feature in F is stored in the vertex or edge features of G. If messages are passed between two objects then the graph will have an edge between the corresponding vertices with an edge feature encoding the neighborhood. We prove that every AMP ϕ can be strongly simulated under the structure-to-graph encoding AMEϕ.

AMP does not only contain variants of WL, but also GNN layers such as the layer of GSN [Bouritsas et al., 2022]. We say a GNN can be strongly simulated, if all of its layers correspond to color update functions that can be strongly simulated by the same structure-to-graph encoding R. This means

Figure 2: Left: nested aggregations. Right: nonpairwise message passing of 2-tuples. In both examples the state of two vertices gets combined and sent to the blue vertex. Note that for non-pairwise aggregations we have an order on the nodes in a tuple whereas for nested aggregations the aggregated vertices are unordered.

Figure 3: WL on the graph on the left (right) side weakly simulates the nested aggregations (non-pairwise message passing) from Figure 2 by adding additional vertices ( yellow ). This shows how weak simulation of both constructions is done in a similar way with the difference being that for non-pairwise aggregation we encode the order in the tuples on the edges.

that for every relational structure X and every t 0, WL achieves at least the same expressivity in iteration t on R(X) as layer t of the GNN on X. We prove that this is always the case for color update functions that only differ in their use of function applications (S4) (see Appendix C.3). Different layers of a GNN commonly only differ by function applications (S4) as these correspond to learned functions. As many GNN layers use AMP, we can state one of the main results of this paper. Corollary 3.6. The following GNNs and variants of WL can be strongly simulated:

1. V VC-GNN [Sato et al., 2019] 7. k-OSWL / OSAN [Qian et al., 2022] 2. k-WL / GNN [Morris et al., 2019] 8. Mk GNN [Papp and Wattenhofer, 2022] 3. δ-k-(L)WL / (L)GNN [Morris et al., 2020] 9. GMP [Wijesinghe and Wang, 2022] 5. GSN-e and GSN-v [Bouritsas et al., 2022] 10. Shortest Path Networks [Abboud et al., 2022] 4. (k, s)-LWL / Speq Net [Morris et al., 2022] 11. Generalized Distance WL [Zhang et al., 2023] 6. DS-WL / GNN [Bevilacqua et al., 2021]

Proof sketch. The proof consists of showing that the color update function underlying the nonstandard message passing is AMP. By Theorem 3.5 this implies that the color update function can be simulated.

Runtime Complexity. AME creates a vertex for every object and edges for every pair of objects that exchange messages. This implies that, creating the encoding requires the same amount of time as one iteration of AMP. Furthermore, classical message passing on the transformed graphs passes the same number of messages as AMP in every iteration. Thus, the overall runtime complexity of strongly simulating AMP is equivalent to that of running AMP itself.

Example. We illustrate AMP on the example of WL with triangle counts and without edge features. For every v VG, we denote by v the number of triangles in the graph containing v. We define the color update function of WL with triangle counts: ϕ(v, G, c) = (cv, {{(cu, u) | u NG(v)}}) . We show that WL with triangle counts is AMP. Consider the tuple (cu, u), this consists of a color cu atom (A3) with a constant u atom (A1) combined into a tuple atom (A4). Thus, (cu, u) is an atom which means that {{(cu, u) | u NG(v)}} is AMP that aggregates this atom over the neighborhood NG (S2). Finally, this AMP is combined with the color atom cv (A1) and combined into a tuple AMP (S3). Thus, the color update function of WL with triangle counts performs AMP which implies that it can be strongly simulated. For this it suffices to attach v to the features of every vertex v V .

4 Limits of Strong Simulation

To show the limits of strong simulation we identify three constructions that either cannot be strongly simulated or cannot be strongly simulated efficiently: time dependent neighborhoods, nested ag-

gregations, and non-pairwise message passing. These constructions have previously been used by Deac et al. [2022], Bevilacqua et al. [2021], Feng et al. [2022], Bodnar et al. [2021b], and Bodnar et al. [2021a]. We begin with time dependent neighborhoods which are used in the alternating message passing of Deac et al. [2022]. We illustrate them via the following example. Consider a color update function that updates the color of a vertex a based on the color of b in iteration t but not in iteration t . As it depends on the iteration whether a and b are neighbors, we call this time dependent neighborhoods. To model this in a graph, we require an edge between a and b in iteration t or we would lose expressivity whenever the color of b is important for the coloring of a. However, this means that the color of a will also be updated with the color of b in iteration t . Thus, we can strongly simulate time dependent neighborhoods but we cannot do so efficiently (i.e., with the same number of sent messages), as we might need to send many more messages.

Nested aggregations are a form of color update functions used by Bevilacqua et al. [2021] and Feng et al. [2022]. Consider the following color update function from DSS-WL [Bevilacqua et al., 2021] ϕ(v, X, c) = (cv, {{{{cy | y N(x)}} | x N (v)}}) (see Figure 2 left).2 We call this nesting of two aggregations ({{c | N( )}}) a nested aggregation. To strongly simulate this we need an edge from every vertex of S

x N (v) N(x) to v. Furthermore, we need a way to group vertices N(x) together for every x N (v). We prove that this is impossible with a structure-to-graph encoding and thus that nested-aggregations cannot be strongly simulated.

The third construction that we cannot strongly simulate is non-pairwise message passing which has been used by Bodnar et al. [2021b,a]. Non-pairwise message passing refers to aggregating colors as ordered tuples over a neighborhood N with ℓ(N) > 1. As an example, suppose we want to update the color of vertex v based on the ordered tuple of colors (ca, cb) of vertices a and b (see Figure 2 right). This is different from the aggregation by WL or AMP as those aggregations are unordered, i.e., {{ca, cb}}. An issue arises when multiple such ordered tuples have to be used to update a color, as there is no way this can be done within the concept of strong simulation. Similar to nested aggregations, non-pairwise message passing cannot be strongly simulated. Note that non-pairwise message passing is a special case of nested-aggregations where an additional order on the aggregated objects is given as is demonstrated in Figure 2.

Theorem 4.1. Nested aggregations and non-pairwise message passing cannot be strongly simulated.

Proof sketch. To prove this (in Appendix D), we construct two relational structures that can be distinguished with nested aggregations in one iteration. These two structures have the same adjacencies but different features. We prove that no structure-to-graph encoding can create two graphs that one iteration of WL can distinguish. By the definition of graph-to-structure encoding, it is not possible for the edges of the resulting graph to depend on the original features. Thus, the two relational structure will lead to two graphs that have the same edges. We prove that WL cannot distinguish these two graphs in one iteration. The proof for non-pairwise message passing works similarly.

5 Weak Simulation

We have shown in the previous section that not all color update functions can be strongly simulated. In this section, we introduce the more general concept of weak simulation and show that color update functions can be weakly simulated that are impossible to strongly simulate. Weak simulation differs from strong simulation by requiring only the same expressivity after certain number of iterations instead of every iteration.

Definition 5.1. (Weak Simulation) Let ϕ be a color update function. Let R be a structure-to-graph encoding that runs in O(τϕ(|U|)) for every relational structure with object set U and creates a graph with vertex set V U. We consider two arbitrary relational structures from the domain of ϕ, say X = (U, N1, . . . , Nk, F) and X = (U , N 1, . . . , N k, F ). Let (V1, E1, F1) = R(X) and (V2, E2, F2) = R(X ). We say ϕ can be weakly simulated under R with simulation factor ζ 1 if for every t 0 it holds that WLζ t(R(X), F1|V1) WLζ t(R(X ), F2|V2) refines ϕt(X, F|U) ϕt(X , F |U ).

2We have altered the notation of DSS-WL from the original paper for the sake of simplicity.

The simulation factor ζ measures the relative slowdown of weak simulation with respect to the original GNN. It follows that strong simulation is a special case of weak simulation as every strongly simulatable algorithm is weakly simulatable with a simulation factor of 1. Next, we prove that it is possible to weakly simulate non-pairwise aggregations and nested aggregations. This shows that weak simulation contains a larger set of color update functions than strong simulation. The full proof can be found in Appendix E.

Theorem 5.2. Nested aggregations and non-pairwise message passing can be weakly simulated with simulation factor 2.

Proof sketch. For weak simulation of nested aggregations consider the following construction. Suppose a nested aggregation ct y | y N(x) | x N(v) . For every x N(v) we create incoming edges from N(x) and an outgoing edge (x, v) (see Figure 3 left). This allows for weak simulation with a simulation factor of 2. Non-pairwise message passing can be weakly simulated in a similar way. Suppose we intend to update the color of vertex v based on the ordered tuple of colors (ca, cb) of vertices a and b. Then, we create a vertex x and an edge (x, v) for this message. Furthermore, we create an edge (a, x) labeled with 1 to indicate that it is the first element in the tuple and an edge (b, x) labeled with 2 (see Figure 3 right). This construction weakly simulates the non-pairwise message passing with simulation factor 2.

As a consequence, we can weakly simulate the following algorithms.

Corollary 5.3. The following GNNs and variants of WL can be weakly simulated:

1. Message Passing Simplicial Networks [Bodnar et al., 2021b] 2. CW Networks [Bodnar et al., 2021a] 3. DSS [Bevilacqua et al., 2021] 4. K-hop message passing and KP-GNNs [Feng et al., 2022]

Discussion. It is possible to design graph transformations that result in smaller graphs, enabling more efficient weak simulation than the transformations automatically following from Theorem 5.2. We demonstrate that this is possible for all algorithms from Corollary 5.3. For Message Passing Simplicial Networks and CW Networks it is possible to construct graphs that require no additional vertices for weak simulation (see Appendix F.1). Similarly, it is possible to weakly simulate DSS without creating a large number of additional edges by adding a second copy of the original graph if we accept increasing the simulation factor to 3 as a trade-off (see Appendix F.2). Interestingly, K-hop message passing and KP-GNNs induce an additional order on the message passing which even allows for efficient strong simulation (see Appendix F.3). We did not find any architectures where weak simulation led to large increases in time and space complexity. However, in general, we cannot rule out the possibility of architectures that exhibit such an increase. This indicates the possibility for more powerful non-standard message passing, opening a promising direction for future work.

6 Experiments

We have shown many cases in which graph transformation based methods achieve the same expressivity as non-standard message passing. In this section, we empirically investigate whether MPNNs together with graph transformations can also achieve similar predictive performance as the simulated GNNs with non-standard message passing. We strongly simulate DS [Bevilacqua et al., 2021], weakly simulate DSS [Bevilacqua et al., 2021] and weakly simulate CW Networks (CWN) [Bodnar et al., 2021a], as representative approaches. For a non-standard message passing algorithm, we denote the corresponding graph transformation with a bar, e.g., DSS. The code for our experiments can be found at https://github.com/ocatias/GNN-Simulation.

Models. With GIN [Xu et al., 2019] and GCN [Kipf and Welling, 2017] we use two of the most common MPNNs as baselines. We combine these MPNNs with the graph transformations DS, DSS and CWN. DS, DS, DSS, and DSS require a policy that maps a graph to subgraphs, for this we chose the common 3-egonets policy that extracts the induced 3-hop neighborhood for each node [Bevilacqua et al., 2021]. We chose this policy as it creates only small subgraphs and Bevilacqua

Table 1: Predictive Performance of different GNNs on 10 different datasets. BEATS MPNN is the percentage of datasets on which the model beats the corresponding MPNN (either GIN or GCN, as noted in the model name). BEATS NSMP is the percentage of datasets where the graph transformation based model beats the corresponding non-standard message passing model. BEST MODEL is the percentage of datasets where this model achieves the best results among all models.

BEATS MPNN BEATS NSMP BEST MODEL GIN - - 0% GCN - - 0% DS (GIN) 60% - 10% DS + GIN 60% 60% 0% DS + GCN 70% 30% 0% DSS (GIN) 70% - 20% DSS + GIN 70% 60% 30% DSS + GCN 70% 30% 10% CWN 60% - 20% CWN + GIN 60% 50% 0% CWN + GCN 50% 30% 10%

et al. [2021] reported good results for it. More details on DSS can be found in Appendix F.2. For CWN and CWN we construct cells as described by Bodnar et al. [2021a]. More details on CWN can be found in Appendix G and F.1. We apply the same training and evaluation procedure to all GNNs.

Datasets. Due to the large number of combination of models and graph transformations, we focus on medium size datasets. To experimentally investigate how the graph transformations increase expressivity, we perform an initial investigation with GIN on the synthetic CSL dataset [Murphy et al., 2019, Dwivedi et al., 2023]. For real world prediction tasks, we use all real world datasets with less than 105 graphs that provide a train, validation, and test split used in Bodnar et al. [2021a], Bevilacqua et al. [2021]: ZINC [G omez-Bombarelli et al., 2018, Sterling and Irwin, 2015], ogbg-molhiv and ogbg-moltox21 [Hu et al., 2020]. Additionally, we add seven small molecule datasests from OGB [Hu et al., 2020]. In total, we evaluate on 10 real-life datasets.3

Table 2: Accuracy on CSL. Bold results outperform GIN.

MODEL CSL ACCURACY

GIN 0.1 0.0 CWN 1.0 0.0 CWN + GIN 1.0 0.0 DSS 1.0 0.0 DSS + GIN 1.0 0.0 DS 1.0 0.0 DS + GIN 1.0 0.0

Setup. For real-life datasets we combine all baseline models (GCN, GIN) with all graph transformations (DS, DSS, CWN) and tune hyperparameters individually (including CWN, DS and DSS; see Appendix H). We also measure the preprocessing and training speeds for different models (details and speed results are in Appendix H.3).

Results. Table 2 shows the results on CSL. GIN achieves 10% accuracy whereas all other methods achieve a perfect accuracy. This confirms that just like non-standard message passing, MPNNs with graph transformations have superior expressivity compared to GIN. Table 1 summarizes the results on all real life datasets. Indeed, MPNNs with graph transformations achieve the best result on half of them. Just like non-standard message passing, MPNNs with graph transformations outperform MPNNs in the majority of experiments. Notably, graph transformations combined with GIN outperform nonstandard message passing in more than half of the experiments. This shows that simulated networks not only have provably at least the same expressivity but also perform surprisingly well in practice.

3We use the following datasets: ZINC with 12k nodes, ogbg-molhiv, ogbg-moltox21, ogbg-molesol, ogbg-molbace, ogbg-molclintox, ogbg-molbbbp, ogbg-molsider, ogbg-moltoxcast, and ogbg-mollipo.

7 Discussion and Conclusion

Discussion. The purpose of simulating a GNN is to allow an MPNN to learn similar embeddings of graphs as the GNN. We expect such embeddings to model the similarity of graphs in a more fine-grained way than just distinguishing graphs. Thus, we have to avoid trivial solutions: any MPNN can be made as expressive as any other GNN when the output of the GNN is attached to the input of the MPNN. In fact, it is even possible to make an MPNN maximally expressive by precomputing the isomorphism class of the graph, e.g., a number uniquely encoding whether two graphs are isomorphic, and inputting it into the MPNN. However, such techniques defeat the purpose of learning a graph representation in the first place and potentially require a super-polynomial runtime [Babai, 2016]. To avoid such scenarios, we introduce two requirements for weak and strong simulation: (1) the features of the resulting graph depend on the original features solely by concatenation and (2) the transformation must operate in linear time relative to the simulated color update function. These two constraints render it impossible to precompute the colors generated by the simulated method for a non-constant number of iterations and prohibit the computation of graph isomorphism classes.

Implications. Our work has implications on (1) the theory of GNNs, (2) the design of new GNNs, and (3) the implementation GNNs with graph transformations. For (1): simulation allows to investigate the expressivity of different GNNs through a unified lens by analyzing the corresponding graph transformation. It should be possible to obtain VC bounds for any weakly simulatable GNN by using the results from Morris et al. [2023]. Similarly, we believe that it is possible to use Geerts and Reutter [2022] to get expressivity upper-bounds in term of the k-WL test for any weakly simulatable GNN. For (2): our theorems indicate that nested aggregations and non-pairwise message passing cannot be strongly simulated and are thus fundamentally different from the message passing paradigm. Thus, to build GNNs that go beyond MPNNs in expressivity it seems promising to investigate such constructions. For (3): instead of implementing a GNN with non-standard message passing, our theorems imply it can be implemented as a graph transformation together with an MPNN. This makes it easier to implement GNNs as of-the-shelf MPNNs can be used and makes the resulting method agnostic from the deep learning framework as the graph transformation does most of the heavy lifting.

Conclusion. We introduced weak and strong simulation of non-standard message passing. Simulation allows an MPNN together with a graph transformation to obtain the same expressivity every ζ 1 layers (weak simulation) or in every layer (strong simulation) as GNNs with non-standard message passing. We have proposed a straightforward way of proving that an algorithm can be simulated and to generate the corresponding graph transformations. This generalizes previous ideas [Otto, 1997, Morris et al., 2019, 2020, 2022, Qian et al., 2022] and makes simulating GNNs accessible to anyone who intends to build novel GNNs. In total, we have shown that 15 GNNs can be simulated. We chose these 15 models to showcase the variety of GNNs that fall under augmented message passing and expect many more GNNs to be simulatable as well. In future research, we will investigate the use of simulation to analyze aspects of GNNs such as generalization, oversmoothing, and oversquashing.

Acknowledgements. Part of this work has been supported by the Vienna Science and Technology Fund (WWTF) project ICT22-059. The authors would like to thank the reviewers for the feedback. We are especially thankful to our colleagues Pascal Welke and Tamara Drucks for giving extensive feedback on a draft of this paper.

R. Abboud, I. I. Ceylan, M. Grohe, and T. Lukasiewicz. The surprising power of graph neural networks with random node initialization. In IJCAI, 2021.

R. Abboud, R. Dimitrov, and I. I. Ceylan. Shortest path networks for graph property prediction. In Learning on Graphs Conference, 2022.

L. Babai. Graph isomorphism in quasipolynomial time. In STOC, 2016.

P. Barcel o, F. Geerts, J. L. Reutter, and M. Ryschkov. Graph neural networks with local graph parameters. In Neur IPS, 2021.

B. Bevilacqua, F. Frasca, D. Lim, B. Srinivasan, C. Cai, G. Balamurugan, M. Bronstein, and H. Maron. Equivariant subgraph aggregation networks. In ICLR, 2021.

L. Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.

C. Bodnar, F. Frasca, N. Otter, Y. G. Wang, P. Li o, G. Mont ufar, and M. Bronstein. Weisfeiler and Lehman go cellular: CW networks. In Neur IPS, 2021a.

C. Bodnar, F. Frasca, Y. G. Wang, N. Otter, G. Montufar, P. Lio, and M. Bronstein. Weisfeiler and Lehman go topological: Message passing simplicial networks. In ICML, 2021b.

G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. Transactions on Pattern Analysis and Machine Intelligence, 45(1):657 668, 2022.

G. Dasoulas, L. Dos Santos, K. Scaman, and A. Virmaux. Coloring graph neural networks for node disambiguation. In IJCAI, 2020.

A. Deac, M. Lackenby, and P. Veliˇckovi c. Expander graph propagation. In Learning on Graphs Conference, 2022.

V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1 48, 2023.

S. Ebli, M. Defferrard, and G. Spreemann. Simplicial neural networks. In Neur IPS, 2020.

J. Feng, Y. Chen, F. Li, A. Sarkar, and M. Zhang. How powerful are K-hop message passing graph neural networks. In Neur IPS, 2022.

M. Fey and J. E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

F. Frasca, B. Bevilacqua, M. M. Bronstein, and H. Maron. Understanding and extending subgraph GNNs by rethinking their symmetries. In Neur IPS, 2022.

F. Geerts and J. L. Reutter. Expressiveness and approximation properties of graph neural networks. In ICLR, 2022.

M. Grohe, P. Schweitzer, and D. Wiebking. Deep Weisfeiler Leman. In SODA, 2021.

R. G omez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hern andez-Lobato, B. S anchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 2018.

M. Hajij, K. Istvan, and G. Zamzmi. Cell complex neural networks. In Topological Data Analysis and Beyond Workshop at Neur IPS, 2020.

M. Hajij, G. Zamzmi, T. Papamarkou, N. Miolane, A. Guzm an-S aenz, K. N. Ramamurthy, T. Birdal, T. K. Dey, S. Mukherjee, S. N. Samaga, N. Livesay, R. Walters, P. Rosen, and M. T. Schaub. Topological deep learning: Going beyond graph data. ar Xiv:2206.00606v3, 2023.

W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open Graph Benchmark: Datasets for machine learning on graphs. In Neur IPS, 2020.

N. Immerman and E. Lander. Describing graphs: A first-order approach to graph canonization. In Complexity Theory Retrospective, pages 59 81. 1990.

F. Jogl, M. Thiessen, and T. G artner. Reducing learning on cell complexes to graphs. In Workshop on Geometrical and Topological Representation Learning at ICLR, 2022a.

F. Jogl, M. Thiessen, and T. G artner. Weisfeiler and Leman return with graph transformations. In International Workshop on Mining and Learning with Graphs at ECMLPKDD, 2022b.

A. D. Keros, V. Nanda, and K. Subr. Dist2Cycle: A simplicial neural network for homology localization. In AAAI, 2022.

T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.

C. Morris, M. Ritzert, M. Fey, W. Hamilton, J. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI, 2019.

C. Morris, G. Rattan, and P. Mutzel. Weisfeiler and Leman go sparse: Towards scalable higher-order graph embeddings. In Neur IPS, 2020.

C. Morris, G. Rattan, S. Kiefer, and S. Ravanbakhsh. Speq Nets: Sparsity-aware permutationequivariant graph networks. In ICML, 2022.

C. Morris, F. Geerts, J. T onshoff, and M. Grohe. WL meet VC. In ICML, 2023.

R. L. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro. Relational Pooling for Graph Representations. In ICML, 2019.

M. Otto. Bounded variable logics and counting a study in finite models. In Volume 9 of Lecture Notes in Logic, 1997.

P. A. Papp and R. Wattenhofer. A theoretical comparison of graph neural network extensions. In ICML, 2022.

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. De Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Py Torch: An imperative style, high-performance deep learning library. In Neur IPS, 2019.

C. Qian, G. Rattan, F. Geerts, C. Morris, and M. Niepert. Ordered subgraph aggregation networks. In Neur IPS, 2022.

R. Sato, M. Yamada, and H. Kashima. Approximation ratios of graph neural networks for combinatorial problems. In Neur IPS, 2019.

R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks. In SDM, 2021.

T. Sterling and J. J. Irwin. ZINC 15 ligand discovery for everyone. Journal of Chemical Information and Modeling, 2015.

E. Thiede, W. Zhou, and R. Kondor. Autobahn: Automorphism-based graph neural nets. In Neur IPS, 2021.

P. Veliˇckovi c. Message passing all the way up. In ICLR Workshop on Geometrical and Topological Representation Learning, 2022.

A. Wijesinghe and Q. Wang. A new perspective on how graph neural networks go beyond Weisfeiler Lehman? . In ICLR, 2022.

K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In ICLR, 2019.

C. Yun, S. Sra, and A. Jadbabaie. Small Re LU networks are powerful memorizers: a tight analysis of memorization capacity. Neur IPS, 2019.

B. Zhang, S. Luo, L. Wang, and D. He. Rethinking the expressive power of GNNs via graph biconnectivity. In ICLR, 2023.

A Related Work: Expressivity of GNNs

In this section, we talk about research on the expressivity of GNNs and how it relates to our paper. Xu et al. [2019] and Morris et al. [2019] proved that MPNNs have limited expressivity. This lead to the development of new GNNs that have higher expressivity than MPNNs. We distinguish between GNNs that perform non-standard message passing and GNNs that use an MPNN on a transformed graph. We begin with approaches that correspond to MPNNs on a transformed graph. A common approach for this adds additional features to the graph. It has been proven that adding random features to vertices improves the expressivity of GNNs [Abboud et al., 2021, Dasoulas et al., 2020, Sato et al., 2021] which allows them to learn every function defined on a graph and thus gives them maximum expressivity. However, to do this they sacrifice determinism, and permutation invariance / equivariance. Some permutation equivariant approaches extend the graph features by counting patterns such as Barcel o et al. [2021] which extend the MPNN with rooted homomorphism counts of a set of patterns. All approaches that consist of deterministically adding additional features to vertices and edges in a permutation equivariant way can be trivially strongly simulated and belong to augmented message passing. Graph Structural Networks (GSN-e and GSN-v; Bouritsas et al. [2022]) introduce a new graph convolution layer which extends messages with subgraph isomorphism counts. This cannot be directly seen as transforming graphs. However, it is very similar to adding these subgraph isomorphism counts to the vertex features and thus to an MPNN operating on a transformed graph.

Next, we consider approaches that use non-standard message passing. Such approaches often operate on different structures such as (1) k-tuples of nodes, (2) topological objects such as simplical complexes or regular cell complexes, and (3) subgraphs. Type (1) passes messages between ktuples of nodes. Higher dimensional WL (k-WL) is a generalization of WL and forms a sequence of algorithms that are stronger than WL [Immerman and Lander, 1990]. Increasing k increases expressivity at the cost of exponentially increasing the runtime. Morris et al. [2019] introduced k-dimensional GNNs which extend the concept of k-WL to GNNs. For every k there exist k-GNNs that have equal expressivity as k-WL. However, as k increases, the runtime of k-GNNs grows exponentially. To combat this, Morris et al. [2020] introduced (local) δ-k dimensional WL and GNNs; Morris et al. [2022] introduce (k, s)-LWL and (k, s)-Speq Nets. These methods make use of the sparsity of the graph to reduce computation time. Type (2) operates on topological structures such as simplical complexes or regular cell complexes. Three prominent examples of this idea are Simplical Networks [Bodnar et al., 2021b], CW Networks [Bodnar et al., 2021a] and messagepassing combinatorial complex neural networks [Hajij et al., 2023]. Other algorithms that work on topological structures are Simplical Neural Networks [Ebli et al., 2020], Dist2Cycle [Keros et al., 2022], and Cell Complex Neural Networks [Hajij et al., 2020]. Type (3) decomposes the graph into subgraphs and then performs message passing on these subgraphs. Examples of these algorithms are Automorphism-based Neural Networks [Thiede et al., 2021], Subgraph GNNs [Frasca et al., 2022], and Equivariant Subgraph Aggregation Networks (ESAN; Bevilacqua et al. [2021]). ESAN uses a policy to transform a graph into a set of subgraphs and then processes them either separately (DS-WL / GNN) or in a way that allows information transfer between the subgraphs (DSS-WL / GNN). The GNN Mk [Papp and Wattenhofer, 2022] separately processes subgraphs consisting of the entire original graph with k nodes that are marked. Ordered subgraph aggregation networks (k-OSAN) [Qian et al., 2022] operate on k-ordered vertex subgraphs, k-OSAN is upper bounded in expressivity by (k +1)-WL and incomparable to k-WL. Frasca et al. [2022] proves that GNNs based on subgraphs are upper bounded by 3-WL. Finally, we list other GNNs with non-standard message passing that we can simulate: V Vc-GNNs [Sato et al., 2019] have been shown to be stronger than standard message passing GNNs in approximating algorithmic problems such as minimum dominating set. Wijesinghe and Wang [2022] introduce a generalized message passing framework (GMP) which allows them to inject local structure into the aggregation scheme. K-hop message passing [Feng et al., 2022] extend MPNNs from 1-hop neighborhoods to K-hop neighborhoods. KP-GNNs [Feng et al., 2022] extends K-hop message passing by including additional information about the subgraphs induced by the K-hop neighborhood. In summary, there exist many GNNs that are more expressive than MPNNs. Despite the conceptual differences between these GNNs we show that many of them are some form of augmented message passing and thus can be simulated.

B Related Work: Simulating Message Passing

We investigate previous approaches to simulating GNNs with MPNNs on transformed graphs by analyzing papers and source code of different architectures. We begin with algorithms from Corollary 3.6. GSN-e / GSN-v [Bouritsas et al., 2022] and DS-WL / DS-GN [Bevilacqua et al., 2021] were not implemented via graph transformations. In practice however, there is little difference between their implementation of DS-WL / DS-GN and our implementation with graph transformations: Bevilacqua et al. [2021] implement subgraphs as separate graphs while we implement them as separate subgraphs of the same graphs. For VVC-GNNs [Sato et al., 2019], we could find no indication that they were implemented with graph transformations. Mk-GNNs [Papp and Wattenhofer, 2022] is a purely theoretic work and has not been implemented by the corresponding authors. For GMP [Wijesinghe and Wang, 2022], we are not sure if their code can count as being implemented as a graph transformation as they adapt message passing layers to use their structural information. However, for this they change the definition of the layers whereas we only change the graphs. Morris et al. [2019] introduced k-GNNs and implemented them with MPNNs on a transformed graph. Morris et al. [2019] even call this simulation and give a short proof that WL can simulate k-WL on transformed graphs. This corresponds to our concept of strong simulation, and this property of k-WL was already noticed by Otto [1997] [Grohe et al., 2021].

For δ-k-(L)GNN, Morris et al. [2020] write that they implemented δ-k-(L)GNNs

[ ...] using PYTORCH GEOMETRIC, using a Python-wrapped C++11 preprocessing routine to compute the computational graphs for the higher-order GNN. We used the GIN-ϵ layer to express f W1 mrg and f W2 aggr of [ ...].

An inspection of their code reveals that they create a similar graph structure as we propose in the proof of Corollary 3.6. For δ-k-GNN they create a graph where the vertices represent the k-tuples. They add an edge between two nodes if the tuples are j-neighbors, with j [[k]]. Just like in our graph transformation the edge type encodes whether the tuples are local or global j-neighbors and the neighborhood j. For (k, s)-Speq Net (or (k, s)-LWL), Morris et al. [2022] write that (k, s)-Speq Nets were implemented

[ ...] using Py Torch Geometric, using a Python-wrapped C++11 preprocessing routine to compute the computational graphs for the higher-order GNNs. We used the GIN-ϵ layer to express f W1 mrg and f W2 aggr of [ ...].

Similar to our graph transformation from Corollary 3.6, they create a new graph where each node corresponds to an ordered subgraph v V k G of size k with at most s connected components in this subgraph. Just like in our graph transformation, edges are added between nodes if their tuples only differ at position j [[k]] and the nodes at position j are neighbors. Additionally, the edge features in the transformed graph encode the edge features from the original graph in a way that allows them to encode j. Thus, their graph transformation follows ours. Both Morris et al. [2020] and Morris et al. [2022] reference Morris et al. [2019] to argue that there exist instantiations powerful enough for f W1 mrg and f W2 aggr. For k-OSWL / OSAN, Qian et al. [2022] state in their paper

[ ...] the computation of k-OSWL s coloring relies on the simple and easy-toimplement 1-WL.

It follows that the authors of these four papers were aware of simulation. However, to the best of our knowledge they have never formalized or fully explained their idea. In contrast, we fully formalize the idea of simulation and provide proofs that show that these graph transformations provably simulate the corresponding GNNs. Additionally, we provide recipes for obtaining graph transformations for augmented message passing algorithms.

To the best of our knowledge, so far Simplical Message Passing Networks [Bodnar et al., 2021b], CWN [Bodnar et al., 2021a], DSS-GNN / WL [Bevilacqua et al., 2021], K-hop message passing and KP-GNNs [Feng et al., 2022] have not been implemented using weak simulation. However, another topological deep learning network the combinatorial complex neural network [Hajij et al., 2023] has been implemented based on GNNs. For this, the authors use a transformation to an augmented Hasse graph by transforming cells to vertices and adding edges between vertices corresponding to adjacent cells. This is similar to CWN, our proposed transformation from regular cell complexes to graph. Specifically, Hajij et al. [2023] state the following about their implementation Topo Embed X:

Topo Embed X [...] supports higher-order representation learning on cell complexes, simplicial complexes, and CCs. [...] Topo Embed X converts a given higher-order domain into a subgraph of the corresponding augmented Hasse graph, and then utilizes existing graph representation learning algorithms to compute the embedding of elements of this subgraph.

Finally, Veliˇckovi c [2022] proposes that all functions over graphs that are of interest can be computed with message passing over a transformed graph. They call this augmented message passing which inspired our use of this term. They discuss how to perform some particular variants of non-standard message passing through standard message passing. This is helpful and demonstrates the wide applicability of this approach. However, their work is intended as a positional paper and thus mostly stays high level and does not provide sufficient formal proofs to back up their claims. Providing the necessary proofs is non-trivial: once one understands simulation it is straightforward to apply it to a large number of architectures and it is simple to prove simulatability for any one given architecture. However, difficulty arises when one tries to prove this for a large number of architectures. Thus, the difficulty is not proving simulatability but defining a framework that allows us to prove this for many different architectures. We provide such a framework based on color update functions. Furthermore, Veliˇckovi c [2022] does not provide a formal definition of simulation (or in their case augmented message passing). Indeed, it is trivially possible to build maximally expressive MPNNs by precomputing the isomorphism class of the input graph. Thus, without a restriction on the transformation, the general statement that a GNN can be implemented as an MPNN plus a transformation is only meaningful if the transformation is in some sense close to what the GNN does. We solve this issue through our careful definition of structure-to-graph encoding which restrict how the transformation can access features of the structure and through the definition of weak / strong simulation that pose a runtime restriction on the transformation.

C Proofs for Strong Simulation (Section 3)

C.1 Rewriting AMPs (Lemma C.1)

We use the following Lemma that allows us to rewrite AMPs without losing the strong simulation property. For a function ϕ we use IM[ϕ] to denote the image of ϕ.

Lemma C.1. Let ϕ be a color update function. We are given a set of objects with neighborhood functions from the domain of ϕ and colors. Let Γ(ϕ) be a color update functions such that there exists an injective (bijective) mapping A between IM[ϕ] and IM[Γ[ϕ]]. Then, a coloring obtained from Γ(ϕ) refines a coloring obtained from ϕ. Furthermore, if the mapping A is bijective the coloring from ϕ refines the coloring from Γ(ϕ).

This lemma allows us to simplify AMPs by replacing functions (S4) with the identity function i.e. f(ϕ) is replaced by ϕ. This transformation is bijective if f is injective. Furthermore, we can use this to flatten tuples (A4) (S3) i.e. (. . . , (ϕ1, . . . , ϕn), . . .) is replaced by (. . . , ϕ1, . . . , ϕn, . . .). Flattening tuples with the same structure is bijective. Thus, we can apply these transformations without losing the simulatable property. We prove the correctness of the above lemma.

Proof. Let A be the injective mapping from the color assigned to objects from ϕ to Γ(ϕ). For an arbitrary iteration, let πw, πp be the colors assigned to w W, p P by Γ(ϕ) and cw, cp the colors assigned to them by ϕ. We start by showing that if there exists the required injective mapping, then the coloring π from Γ(ϕ) refines the coloring c from ϕ. For this, we assume that ϕ can distinguish w, p and show that this implies that Γ(ϕ) can also distinguish them. Formally, we show cw = cp πw = πp, this is the contraposition of the definition of refinement. Suppose cw = cp, then from πw = A(cw) and πp = A(cp) and the fact that A is injective follows that πw = πp.

Next, we show that if A is bijective then the colorings from ϕ and Γ(ϕ) refine each other. We assume that A is bijective. From the above argument, it then follows that the coloring from Γ(ϕ) refines the coloring from ϕ. It remains to show that the coloring from ϕ refines that from Γ(ϕ). Similar to above, we show that if Γ(ϕ) can distinguish w and p, then so can ϕ. Thus, we assume πw = πp and show cw = cp. Using the bijectiveness of A it follows from our assumption that A 1(πw) = A 1(πp) and thus that cw = cp. This concludes the proof.

Algorithm 1 Helper algorithm for atoms

Input: atom ρ of type (A1), (A2),(A3), graph (V, E, Fgraph), object w, object q, indices i, j, l. Output: updated graph G = (V, E, Fgraph) if ρ is of type (A2) or (A3) then

return (V, E, Fgraph) else ρ(w, cw, q, cq) is a constant k(w, q)

if w is q then

Fgraph(w) := ENC(Fgraph(w), i, j, l, k(w, w)) else

Fgraph((q, w)) := ENC(Fgraph((q, w)), i, j, l, k(w, q)) end if end if return (V, E, Fgraph)

Algorithm 2 Helper algorithm for (S2)

Input: AMP (S2) of shape ρ(w, ct w, q, ct q) | q Ns(w) where ρ is an atom and ℓ(Ns) = 1; graph (V, E, Fgraph), relational structure (U, Ns) without features, indices i, j. Output: updated graph G = (V, E, Fgraph) for u U do

for q Ns(u) do

e := (q, u) if e / E then

E := E {e} Fgraph(e) = empty vector end if Fgraph(e) := ENC(Fgraph(e), i, j, 0, s) end for end for / ρ has the shape (ρ1(w, cw, Ω1, cq), . . . , ρm(w, cw, Ωm, cq)) / for l [[m]] do

/ ρl is either an (A1), (A2) or an (A3) / for u U do

for q Ns(u) do

(V, E, Fgraph) := Algorithm 1(ρl, (V, E, Fgraph), u, q, i, j, l) end for end for end for return (V, E, Fgraph)

C.2 Augmented Message Passing can be Simulated (Theorem 3.5)

We are interested in showing the following theorem.

Theorem 3.5. Augmented message passing can be strongly simulated.

To prove the above theorem we define augmented message encoding (AME) in Algorithm 4. For this we assume a function ENC: (x, i, j, l, c) 7 x , that updates a feature x to x by attaching the color c at the indices i, j, l. We assume that applying ENC multiple times never overwrites information if different indices are used. For example, we can think of the feature x as a 3d-tensor such that ENC(x, i, j, l, c) updates this vector by writing the entry c at position i, j, l. Observe, that in Algorithm 4 it never happens that the same indices are used, thus whenever AME encodes information in the graph, this information does not get overwritten by latter operations. We prove the theorem by showing that AME+WL can simulate any given AMP.

Proof. (Theorem 3.5) Let ϕ be an arbitrary augmented message passing algorithm. We consider two arbitrary relational structures from the domain of ϕ, say X = (U, N1, . . . , Nk, F) and X = (U , N 1, . . . , N k, F ). Let (U, E1, F1) = R(X) and (U , E2, F2) = R(X ). We prove (1) that

Algorithm 3 Helper algorithm for AMP

Input: AMP color update function ϕ = (ϕ1, . . . , ϕn) where each individual AMP is (A1), (A2), (A3) or (S2), graph (V, E, Fgraph), relational structure (U, N1, . . . , Nk) without features where ℓ(Ns) = 1 for all s [[k]], index i. Output: updated graph G = (V, E, Fgraph) for j [[n]] do

if ϕj is (A1), (A2) or (A3) then

(V, E, Fgraph) := Algorithm 1(ϕj, (V, E, Fgraph), u, u, i, j, 0) end for else

/ ϕi has the shape {{ρ(w, cw, q, cq) | q Ns(w)}} where ρ is an atom and s [[k]] / (V, E, Fgraph) := Algorithm 2(ϕi, (V, E, Fgraph), (U, Ns), i, j) end if end for return (V, E, Fgraph)

Algorithm 4 Augmented message encoding (AME)

Input: relational structure X = (U, N1, . . . , Nk, F) where ℓ(Ni) = 1 for every i [[k]], AMP color update function ϕ. Output: graph G = (U, E, F )

/ Rgraph / E := Fgraph := empty features Replace function applications (S4) in ϕ by the identity function, i.e., f(ϕ) becomes ϕ. Flatten all (A4), (S3) constructions in ϕ except the outermost, i.e., (. . . , (ϕ1, . . . , ϕn), . . .) becomes (. . . , ϕ1, . . . , ϕn, . . .).

/ ϕ has the shape (ϕ1, . . . , ϕn) where each ϕi with i [[n]] is one of (A1), (A2), (A3) or (S2) / (U, E, Fgraph) := Algorithm 3(ϕ, (U, E, Fgraph), (U, N1, . . . , Nk), 0)

/ Rfeat / F := empty features for u U do

F (u) := concat (F(u), Fgraph(u)) end for for e E do

F (e) := concat (F(e), Fgraph(e)) end for return (U, E, F )

AMEϕ is a structure-to-graph encoding such that R(X) runs in O(τϕ(|U|)); (2) for every t 0 the coloring WLt(R(X), F1|U) WLt(R(X ), F2|U ) on AMEϕ(X) refines the coloring ϕt(X, F|U) ϕt(X , F |U ).

We begin by proving (1). Observe that Algorithm 4 can be split into Rgraph and Rfeat as required by the definition of structure-to-graph encoding. Furthermore, Algorithm 4 first iterates through all AMPs in ϕ and all possible objects that are relevant to these AMPs. Hence, Algorithm 4 runs in the same runtime as one iteration of AMP which is in O(τϕ(|U|)).

Next, we prove (2). We use πt u, πt r and ct u, ct r to denote the colors assigned to vertex / object u, r (U U ) in iteration t by WL on R(X), R(X ) or ϕ on X, X , respectively. We show by induction that if WL cannot distinguish u and r in a given iteration t then neither can ϕ. Formally, we show πt u = πt r ct u = ct r which implies that WLt(R(X), F1|U) WLt(R(X ), F2|U ) refines ϕt(X, F|U) ϕt(X , F |U ) as the domain of the coloring computed by ϕ is the same as the domain of the coloring computed by WL (U U ).

Algorithm 4 starts by replacing all function applications (S4) by the identity function and flattens all tuples (A4) (S3). Recall that this transformation is either injective or bijective. By applying Lemma C.1 they do not influence the strong simulation property. From now on we assume that ϕ = (ϕ1, . . . , ϕn) such that all ϕj with j [[n]] are either (A1), (A2) (A3), or (S2) and if they are an atom, then the first and third function argument are identical. We use w to denote the object the AMP is defined to update. Algorithm 3 iterates through all ϕ1, . . . , ϕn and encodes them in the graph.

Base case: We show that π0 u = π0 r c0 u = c0 r. For this, we assume that π0 u = π0 r. Since the initial colors of WL correspond to the vertex features which encode the object features of X and X that are used as the initial coloring of AMP, this implies that c0 u = c0 r.

Induction hypothesis: We assume that πt u = πt r ct u = ct r holds for all t T.

Induction step: We show that πT +1 u = πT +1 r c T +1 u = c T +1 r . We assume πT +1 u = πT +1 r . We iterate through all AMPs from ϕ1, . . . , ϕn and show that they must all return the same result for u and r which implies c T +1 u = c T +1 r . We do a case distinction on the type of this AMP. Note that in AMP the object / edge features are referred to us F( ) whereas in the relational structures X and X they are denoted as F( ) and F ( ). Below, when we argue that two AMP constructions are equal and use the F( ) notation and neglect F ( ) for the sake of simplicity.

k(u, u) != k(r, r) . This atom was encoded into the graph by Algorithm 1. These constants are part of the vertex features of the graphs R(X), R(X ) and thus are part of the initial coloring π0 u, π0 r. Since WL refines colorings, the assumption that πT +1 u = πT +1 r implies π0 u = π0 r. Thus, k(u, u) = k(r, r).

F(u) != F(r) . As the object features are encoded into the vertex features of R(X), R(X )

(see Rfeat in Algorithm 4), they are part of the initial coloring π0 u, π0 r. Since WL refines colorings, the assumption that πT +1 u = πT +1 r implies π0 u = π0 r. Thus, F(u) = F(r).

c T u != c T r . This atom was not encoded into the graph. From our assumption we know that

πT +1 u = πT +1 r . Since WL refines colorings this implies that πT u = πT r . By applying the induction hypothesis we obtain that c T u = c T r .

ρ(u, c T u , x, c T x ) | x Ni(u) != ρ(r, c T r , y, c T y ) | y Ni(r) . This AMP was encoded into the graph by Algorithm 2. Note that the neighborhood relation Ni was encoded by edges with edge features labeled i, this means that if two nodes have the same color πT +1 u = πT +1 r then for every neighborhood relation Ni there exists a bijective α : Ni(u) Ni(r) such that for all x Ni(u) it holds that πT x = πT α(x) and that the edges (x, u), (α(x), r) have the same edge features. From the induction hypothesis it then follows that c T x = c T α(x). Let ρ( , c T , Ω, c T Ω) = ρ1( , c T , Ω1, c T Ω), . . . , ρm( , c T , Ωm, c T Ωm)

for some m 1. We show for an arbitrary j [[m]] that ρj u, c T u , Ωj, c T Ωj =

ρj r, c T r , Ω j, c T Ω j

where either Ωj = u, Ω j = r, or Ωj = x, Ω j = α(x). We do a case distinction on the type of ρj.

(A1) ρj = k(a, b). We do a case distinction whether a = b:

* Case a = b. By assumption we need to show k(u, u) != k(r, r). Recall that the vertex features Fµ(u), Fν(r) encode k(u, u) and k(r, r) where µ, ν {1, 2} (see Algorithm 1), respectively. Hence, the initial colors π0 u and π0 r also encode k(u, u) and k(r, r). As WL refines colorings and πT u = πT r it holds that π0 u = π0 r which implies that k(u, u) = k(r, r).

* Case a = b. Then (b, a) forms an edge meaning we need to show that k(u, x) != k(r, α(x)). Recall that the feature Fµ((x, u)) encodes k(u, x) and that feature Fν((α(x), r) encodes k(r, α(x)) where, µ, ν {1, 2} (Algorithm 1). From the definition of α it follows that the edges (x, u) and (α(x), r) have the same features and thus that k(u, x) = k(r, α(x)). (A2) We do a case distinction:

* Case F(u) != F(r). See proof of (A2) above.

* Case F(x) != F(α(x)). We know that πT x = πT α(x). Recall that the object features are encoded in the vertex features and thus in the initial coloring π0 . Since WL refines colorings, it follows from πT x = πT α(x) that π0 x = π0 α(x) which implies F(x) = F(α(x)).

* Case F((x, u))) != F((α(x), r)). As argued above we know that the edges (x, u) and (α(x), r) have the same features which implies F((x, u))) = F((α(x), r)).

(A3) ρj != c T σ . We do a case distinction on σ:

* Case σ is the vertex to update i.e., u and r. Then, we need to show that c T u != c T r . We know that πT u = πT r and by the induction hypothesis it follows that c T u = c T r . * Case σ is the neighboring vertex i.e., x and α(x). Then, we need to show that

c T x != c T α(x). We combine πT x = πT α(x) with the induction hypothesis to obtain c T x = c T α(x).

Thus we conclude that ρj(u, c T u , Ωj, c T x ) = ρj r, c T r , Ω j, c T α(x) which means that

ρ(u, c T u , x, c T x ) = ρ r, c T r , α(x), c T α(x) . This concludes the proof of the induction step.

C.3 Strongly Simulating GNNs

We prove a theorem that allows us to simulate a combination of multiple AMPs that differ only by function applications (S4). This allows us to simulate GNNs whose layers correspond to AMPs. Theorem C.2. Let ϕ1, . . . , ϕl be l 1 AMPs that only differ in function application (S4). Let R = AMEψ be a structure-to-graph encoding where ψ is the AMP obtained by removing function applications from ϕ1. Then, it holds for every pair of relational structures X, X from the domain of ϕ1, . . . , ϕl and every l t 1 that t iterations of WL on R(X) and R(X ) refines the coloring produced by (ϕ1 . . . ϕt) on X and X .

Note that it does not matter whether we construct ψ by removing function applications from ϕ1 or any other AMP from ϕ1, . . . , ϕl as these AMP only differ by function application.

Proof. Let X = (U, N1, . . . , Nk, F) and X = (U , N 1, . . . , N k, F ) be two arbitrary relational structure from the domain of ϕ1 . . . ϕl and let l t 1. By Lemma C.1 any coloring produced by ψ refines the corresponding coloring from every AMP from ϕ1, . . . , ϕl. Thus, ψl(X, F|U) refines (ϕ1 . . . ϕl)(X, F|U) and ψl(X , F |U ) refines (ϕ1 . . . ϕl)(X , F |U ). By Theorem 3.5 we can strongly simulate ψ under R. Thus, the theorem follows.

C.4 List of Strongly Simulatable Algorithms (Corollary 3.6)

We prove Corollary 3.6. Corollary C.3. The following GNNs and variants of WL can be strongly simulated:

1. V VC-GNN [Sato et al., 2019] 7. k-OSWL / OSAN [Qian et al., 2022] 2. k-WL / GNN [Morris et al., 2019] 8. Mk GNN [Papp and Wattenhofer, 2022] 3. δ-k-(L)WL / (L)GNN [Morris et al., 2020] 9. GMP [Wijesinghe and Wang, 2022] 5. GSN-e and GSN-v [Bouritsas et al., 2022] 10. Shortest Path Networks [Abboud et al., 2022] 4. (k, s)-LWL / Speq Net [Morris et al., 2022] 11. Generalized Distance WL [Zhang et al., 2023] 6. DS-WL / GNN [Bevilacqua et al., 2021]

Proof. The proofs consist of showing that the update rule underlying the corresponding algorithms are AMP. By invoking Theorem 3.5 this proves that the algorithm is strongly simulatable. For

algorithms that have both a GNN and a WL variant we only prove this for the WL variant as for GNN the proof can be done analogously. Note that most variants of WL and GNNs contain the color of the vertex they want to update in the updated color. An example of this the color ct 1 s in the definition of k-WL: ct s = HASH ct 1 s , (Ct 1(s), . . . , Ct k(s)) where Ct j(s) = HASH ct 1 s | s Nj(s) . This color can be modeled by a color atom (A3). We will argue this again for the first two algorithms and will assume that this is obvious for the latter algorithms.

1. V VC-GNNs [Sato et al., 2019]: Let be the maximum degree of the vertices. Suppose we are given a port numbering with functions ptail(v, i) : V [[ ]] V { } and pn : V [[ ]] [[ ]] { }.4

The color update function for V VC-GNNs is

zl+1 v = f l θ

zl v, zl ptail(v,1), pn(v, 1), zl ptail(v,2), pn(v, 2), . . . zl ptail(v, ), pn(v, ), .

Let i [[ ]]. The color update function applies a function f l θ (S4) to a tuple. We need to show that this tuple corresponds to tuple AMP construction (S3) by showing that each element of the tuple is AMP. The first element of the tuple is zl v which corresponds to an AMP (S1) atom. Each pn(v, i) can be seen as a constant atom (A1). Each zl ptail(v,i) can be seen as zl w | w {ptail(v, i)} when ptail(v, i) = and the empty set otherwise which defines a neighborhood function. Thus, zl v, zl ptail(v,1), pn(v, 1), zl ptail(v,2), pn(v, 2), . . . zl ptail(v, ), pn(v, ), is an AMP tuple (S3)

of a color atom (A3) zl v, constant atoms (A1) and aggregation AMPs (S2). Hence, the color update function is strongly simulatable.

2. k-WL [Morris et al., 2019]: Let W = V (G)k be the set of k-tuples of vertices for a given graph G. Then, k-WL can be written as ct s = HASH ct 1 s , (Ct 1(s), . . . , Ct k(s)) where Ct j(s) = HASH ct 1 s | s Nj(s) . Both hash functions are function application AMPs (S4) and and ct 1 s is a color atom AMP (S1). Here, Nj(s) encodes the j-neighborhood of the k-tuple s. This means that each Ct 1(s) corresponds to an aggregation AMP (S2). Then, the combination into (Ct 1(s), . . . , Ct k(s)) is a tuple of AMPs (S3). Hence, the color update functions is a AMP implying that k-WL is strongly simulatable.

3. δ-k-(L)WL [Morris et al., 2020]: We only show this for δ-k-WL as δ-k-LWL is a special case obtained by removing elements from the color update function of δ-k-WL. Let W = V (G)k be the set of k-tuples of vertices for a given graph G. We use bold v to denote a k-tuple and v to denote a vertex. For j [[k]], the function ϕj(v, w) returns the k-tuple obtained by replacing the j-th vertex of v by w. The function adj(v, w) returns 1 if the two vertices that distinguish the j-neighbors v and w are neighbors in G and 0 otherwise. Then, δ-k-WL can be written as ct+1 v = ci v, M δ, δ i (v) with

M δ, δ i (v) = nn ci ϕ1(v,w), adj(v, ϕ1(v, w)) | w V (G) oo , . . . ,

nn ci ϕk(v,w), adj(v, ϕk(v, w)) | w V (G) oo .

Let j [[k]] be arbitrary. Note, that adj(v, ϕj(v, w)) is an edge constant atom k(v, ϕj(v, w)) (A1) and that ci ϕj(v,w) is a color atom (A3) for object ϕj(v, w). Thus, ci ϕj(v,w), adj(v, ϕj(v, w)) is a tuple atom (A4). Furthermore, we can rewrite the aggre-

gation by replacing ϕj(v, w) with w as ci w, k(v, w) | w Nj(v where Nj(v) =

{ϕj(v, w) | w V (G)}. Thus, each nn ci ϕj(v,w), adj(v, ϕj(v, w)) | w V (G) oo is an

aggregation AMP (S2) and M δ, δ i (v) is a tuple AMP (S3). Thus, the update rule of δ-k-WL is strongly simulatable.

4We think that there is a small typographical error in these functions signatures as defined by Sato et al. [2019]. There, they define them as ptail(v, i) : V V { } and pn : V { }.

4. (k, s)-LWL and (k, s)-Speq Nets [Morris et al., 2022]: This algorithm is very similar to δ-k-LWL and we use the same definition of ϕ as above. We prove this only for (k, s)-LWL as (k, s)-Speq Nets can be argued analogously. In what follows we use #com(G[v]) to denote the number of connected components in the graph induced by the k-tuple of nodes v of G i.e., the subgraph of G that contains only vertices from v and all edges from E(G) incident to these vertices. For k 1 and 1 s k, let

V (G)k s = v V k | #com(G[v]) s .

Similarly to δ-k-LWL, (k, s)-LWL assigns a color to k-tuples. However, (k, s)-LWL works on elements from V (G)k s. Hence, W = V (G)k s. The update function can the be defined as:

ci+1 v = ci v, M δ,k,s i (v)

M δ,k,s i (v) = nn ci ϕ1(v,w) | w N(v1) and ϕ1(v, w) V (G)k s oo , . . . ,

nn ci ϕk(v,w) | w N(vk) and ϕk(v, w) V (G)k s oo

where vi is the i-th element of k-tuple v. Similar to above, we can rewrite the aggregation in all multisets. Let i [[k]], then we define Ni(v) = ϕi(v, w)) | w N(vi) and ϕi(v, w)) V (G)k s , which allows us to rewrite the multisets as ci x | x Ni(v . This implies that each of the multisets is an aggregation AMP (S2) and that M δ,k,s i (v) is a tuple AMP (S3). With this it follows that the color update function is simulatable.

5. GSN-e and GSN-v [Bouritsas et al., 2022]: We only prove this for GSN-v as GSN-e can be argued similarly. Let W = V (G). The update function behind GNS-v is defined as

ht+1 v = UPt+1 ht v, mt+1 v

with mt+1 v = M t+1 ht v, ht u, x V v , x V u , eu,v | u N(v) .

In this definition UPt+1 is an arbitrary function approximator such as an MLP and M t+1 is a neighborhood aggregation function meaning meaning an arbitrary function that operates on multisets. We can see M t+1, UPt+1 as function applications (S4). Additionally, x V v , x V u and eu,v are vectors encoding subgraph isomorphism counts with respect to some set of patterns. These isomorphism counts correspond to atom constants (A1): k(v, v), k(u, u) and k(u, v). As ht v and ht u are color atoms (A3) and eu,v is a constant atom (A1) it follows that ht v, ht u, x V v , x V u , eu,v is a tuple atom (A4) and ht v, ht u, x V v , x V u , eu,v | u N(v) is an aggregation AMP (S2). Applying the function M t+1 (S4) to this AMP yields mt+1 v . Thus, the update rule is strongly simulatable.

6. DS-WL [Bevilacqua et al., 2021]: For a given graph G and a policy π that compute subgraphs we define the set of objects W as the disjoint union of the vertices in each subgraph W = S

S π(G)V (S). We identify each object from W by its original vertex v V and subgraph S π(G) that created it. This means that cv,S is the color of vertex v from subgraph S. Then, the update rule of DS-WL can be written as

ct+1 v,S = HASH ct v,S, ct x,S | (x, S) N(v, S)

where N(v, S) = {(x, S) | x NS(v)} is the set of all neighbors of vertex v in subgraph S. Hence, ct x,S | (x, S) N(v, S) is an aggregation AMP (S2) which implies that the update rule is strongly simulatable.

7. k-OSWL and k-OSANs [Qian et al., 2022]: We only argue the case of k-OSWL as the proof works similarly for k-OSANs. In what follows we use g Gk to denote a k tuple of vertices from V (G) corresponding to k-vertex induced subgraph of G. We use Gk to denote the set of all k-vertex induced subgraphs of G. The algorithm computes colors for each

combination of a vertex with k-vertex induced subgraph combination i.e. W = V (G) Gk. It uses the color update function:

Ci+1 v,g = RELABEL Ci v,g, Ci u,g | u .

Here, is either NG(v) the neighborhood relation in the graph G, or V (G) the set of all vertices. Since both neighborhood allows us to define corresponding neighborhood functions it follows that Ci u,g | u is an aggregation AMP (S2) and thus that the color update function is strongly simulatable.

8. Mk-GNNs [Papp and Wattenhofer, 2022]: As Papp and Wattenhofer [2022] note, their approach is related to ESAN [Bevilacqua et al., 2021]. For a graph G with some set of marked nodes the following update rule that computes the colorings:

ht+1 u = UPDATE ht u, at+1 u

at+1 u =AGGRmarked ht v | v NM(u) + AGGRmarked ht v | v NU(u) .

Where NM and NU are the neighborhood relations referring to marked (unmarked) neighbors in the subgraphs, respectively. Thus, {{ht v | v NM(u)}} and {{ht v | v NU(u)}} are aggregation AMPs (S2). We can model at u as a function ϕ((x, y)) = AGGRmarked(x) + AGGRmarked(y) with x = {{ht v | v NM(u)}} and y = {{ht v | v NU(u)}}. Hence, at u is a function AMP (S4) applied to a tuple AMP (S3). Thus, the color update function is strongly simulatable.

9. GMP [Wijesinghe and Wang, 2022]: The color update function underlying GMP has the shape

mt a = AGGREGATEN Avu, ht u | u N(v) ,

mt v = AGGREGATEN Avu | u N(v) ht v ct+1 v = COMBINE mt v, mt a .

Here, Avu R are structural coefficients encoding local structures in the graph around v, u. Note that we can model AGGREGATEN Avu | u N(v) as a constant atom (A1) k(v, v). Thus, we can then write mt v = k(v, v) ht v = fmul(k(v, v), ht v) where fmul(x, y) = x y. Hence, mt v is a function application (S4) to a tuple (S3) of a constant node specific value (A1) together and a color (A3). Furthermore, mt a is an aggregation (S2) of tuples (A4) of a constant value (A1) combined with a color (A3). Thus, the color update function is strongly simulatable.

10. Shortest Path Networks [Abboud et al., 2022]: Let G = (V, E, F) be a graph, v V a vertex and i 1 an integer. We denote by Ni(v) the i-hop shortest path neighborhood i.e., the set of all vertices that can be reached from v in a shortest path of length i. Then, the color update function of shortest path message passing graph neural networks (SP-MPNNs) are defined as

ct+1 v = COM ct v, ct v, AGG1 ct u | u N1(v) , . . . , AGGk ct u | u Nk(v)

where k 1 is an integer, COM is a combination function and AGG... are functions that map a multisets of colors to a single color. Let W = V . Obviously, COM and all AGG functions are function applications (S4). It is clear that ct v corresponds to a color atom (A3). Next we investigate the term (ct v, AGGj ({{ct u | u Nj(v)}})) where j [[k]]. In this term, {{ct u | u Nj(v)}} is the aggregation AMP (S2) of color atoms (A3). Thus, (ct v, AGGj ({{ct u | u Nj(v)}})) is a tuple AMP (S3). It follows that the color update functions of SP-MPNNs corresponds to AMP which implies that SP-MPNN can be strongly simulated.

11. Generalized Distance WL [Zhang et al., 2023]: Note that this generalizes shortest path networks. Let G = (V, E, F) be a graph. Let d G( , ) be an arbitrary distance metric i.e., a function d G : V V R. The color update function of generalized distance Weisfeiler Leman (GD-WL) is defined as ct+1 v = HASH ({{(d G(u, v), ct u) | u V }}). Let W = V . It

follows that {{(d G(u, v), ct u) | u V }} is an aggregation AMP (S2) that aggregates a tuple atom (A4) (d G(u, v), ct u) built from a constant atom (A1) d G(u, v) and a color atom (A3) ct u. As HASH corresponds to a function application (S4) we conclude that GD-WL is AMP and can be strongly simulated.

D Proofs for Limits of Strong Simulation (Section 4)

D.1 Cannot Simulate Nested Aggregation (Theorem 4.1 Part 1)

We prove the correctness of Theorem 4.1 for the nested aggregations.

Theorem 4.1. Nested aggregations and non-pairwise message passing cannot be strongly simulated.

Proof. We define two relational structures X1, X2 that can be distinguished in the first round of message passing with nested aggregations, but cannot be distinguished in the first round of WL when representing the relational structures as graphs. Consider the set of objects U = {x, a, b, c, d, h, h } with neighborhood functions:

N1(a) = N1(b) = N1(c) = N1(d) = {x},

N2(x) = {h, h },

N3(h) = {a, b},

N3(h ) = {c, d}.

In the above definition, the neighborhood function returns the empty set for all other objects. Both relational structures will have the same objects and neighborhood functions but different initial features:

F1(a) = F1(b) = , F1(c) = F1(d) = , F1(x) = F1(h) = F1(h ) = ,

F2(a) = F2(d) = , F2(b) = F2(c) = , and F2(x) = F2(h) = F2(h ) = .

We define the two relational structures as X1 = (U, N1, N2, N3, F1) and X2 = (U, N1, N2, N3, F2). We define a color update function with nested-aggregations over these relations structures:

ct+1 v = HASH ct v, ct w | w N1(v) , ct y | y N3(x) | x N2(v) ,

where ct v is the color of object v in iteration t. We use π1, π2 to denote the color c1 x for X1 and X2, respectively. By definition we know that

π1 = HASH ( , , {{{{ , }} , {{ , }}}}) ,

π2 = HASH ( , , {{{{ , }} , {{ , }}}}) . Obviously π1 = π2. Suppose there exists a structure-to-graph encoding R such that WL can distinguish the color of the vertex x corresponding to object x for the two graphs R(X1), R(X2). We use τ1, τ2 to denote the color of vertex x after one iteration of WL on R(X1), R(X2), respectively. Since the color of x depends on a, b, c, d, we can assume that there are edges from these vertices to x in both graphs. Since R is a structure-to-graph encoding it makes use of the mappings Rgraph and Rfeat. By definition of Rgraph it has no access to the features thus the graph created for X1, X2 by Rgraph are identical. Hence, the graphs R(X1) and R(X2) only differ by the features assigned to a, b, c, d. By definition of WL, the colors τ1, τ2 only depend on the features of the neighbors of x and are independent from any other vertices (or adjacencies between other vertices). Let Y = y1, . . . , ym be all neighbors of x that are not a, b, c, d. Note that the features of the vertices of Y is the same for

both graphs as all vertices from Y are either not from U or they are assigned the same color by F1 and F2. We add the relational structure Xi (for i {1, 2}) as a subscript to the color notation to indicate where the color is originally coming from, e.g., c0 x,X1 is the initial color of vertex x in graph R(X1). By definition of WL it holds that:

τ1 = HASH c0 x,X1, c0 a,X1, c0 b,X1, c0 c,X1, c0 d,X1 c0 y,X1 | y Y ,

τ2 = HASH c0 x,X2, c0 a,X2, c0 b,X2, c0 c,X2, c0 d,X2 c0 y,X2 | y Y .

As Rgraph assigns the same features to x and all vertices of Y for the two graphs and Rfeat concatenates the vertex features with the features from the structures, it follows that c0 x,X1 = c0 x,X2 and c0 y,X1 | y Y = c0 y,X2 | y Y . As we intend to argue that there exists no R such that τ1 = τ2 we intend to argue that

c0 a,X1, c0 b,X1, c0 c,X1, c0 d,X1 != c0 a,X2, c0 b,X2, c0 c,X2, c0 d,X2 .

By definition the color of a vertex v {a, b, c, d} in graph R(Xi) (with i {1, 2}) is the color assigned to v by Rgraph(Xi):

c0 v,Xi = concat(Fgraph,i(v), Fi(v)).

We argue that all v {a, b, c, d} have the same feature Fgraph,i(v) for all i {1, 2}. First, any two v, w {a, b, c, d} that are part of the same set in N3 ({a, b}, {c, d}) are necessarily assigned the same feature as they are indistinguishable for Rgraph based on the input (U, N1, N2, N3). Second, for two v, w {a, b, c, d} that are not part of the same set in N3 (e.g. v = a, w = c), Rgraph can detect that they are part of different sets but cannot encode this into the vertex features. Any such encoding would treat the sets {a, b} and {c, d} differently which is not possible since the function operates on (multi)sets and is thus permutation equivariant. Together with the fact that {{F1(a), F1(b), F1(c), F1(d)}} = {{F2(a), F2(b), F2(c), F2(d)}} it follows that τ1 = τ2. This contradicts the initial assumption and proves that no structure-to-graph encoding R exists that allows for strong simulation on X1 and X2. This proves the theorem.

D.2 Non-pairwise message passing cannot be simulated (Theorem 4.1 Part 2)

We prove the correctness of Theorem 4.1 for non-pairwise message passing.

Theorem 4.1. Nested aggregations and non-pairwise message passing cannot be strongly simulated.

Proof. The proof is similar to the case of nested aggregation (Appendix D.1). We define two relational structures X1, X2 that can be distinguished in the first round of message passing with non-pairwise aggregations, but cannot be distinguished in the first round of WL when representing the relational structures as graphs. Consider the set of objects U = {x, a, b, c, d} with neighborhood functions:

N1(a) = N1(b) = N1(c) = N1(d) = {x},

N2(x) = {(a, b), (c, d)}.

In the above definition, the neighborhood function returns the empty set for all other objects. Both relational structures will have the same objects and neighborhood functions but different initial features:

F1(a) = F1(b) = , F1(c) = F1(d) = , F1(x) = ,

F2(a) = F2(d) = , F2(b) = F2(c) = , and F2(x) = .

We define the two relational structures as X1 = (U, N1, N2, F1) and X2 = (U, N1, N2, F2). We define a color update function with non-pairwise aggregations over these relational structures:

ct+1 v = HASH ct v, ct w | w N1(v) , (ct x, ct y) | (x, y) N2(v) ,

where ct v is the color of object v in iteration t. We use π1, π2 to denote the color c1 x for X1 and X2, respectively. By definition we know that

π1 = HASH ( , , {{( , ) ( , )}}) ,

π2 = HASH ( , , {{( , ) , ( , )}}) .

Obviously π1 = π2. Suppose there exists a structure-to-graph encoding R such that WL can distinguish the color of the vertex x corresponding to object x for the two graphs R(X1), R(X2). We use τ1, τ2 to denote the color of vertex x after one iteration of WL on R(X1), R(X2), respectively. Since the color of x depends on a, b, c, d, we can assume that there are edges from these vertices to x in both graphs. Since R is a structure-to-graph encoding it makes use of the mappings Rgraph and Rfeat. By definition of Rgraph it has no access to the features thus the graph created for X1, X2 by Rgraph are identical. Hence, the graphs R(X1) and R(X2) only differ by the features assigned to a, b, c, d. By definition of WL, the colors τ1, τ2 only depend on the features of the neighbors of x and are independent from any other vertices (or adjacencies between other vertices). Let Y = y1, . . . , ym be all neighbors of x that are not a, b, c, d. Note that the features of the vertices of Y is the same for both graphs as all vertices from Y are either not from U or they are assigned the same color by F1 and F2. We add the relational structure Xi (for i {1, 2}) as a subscript to the color notation to indicate where the color is originally coming from, e.g., c0 x,X1 is the initial color of vertex x in graph R(X1). By definition of WL it holds that:

τ1 = HASH c0 x,X1, c0 a,X1, c0 b,X1, c0 c,X1, c0 d,X1 c0 y,X1 | y Y ,

τ2 = HASH c0 x,X2, c0 a,X2, c0 b,X2, c0 c,X2, c0 d,X2 c0 y,X2 | y Y .

As Rgraph assigns the same features to x and all vertices of Y for the two graphs and Rfeat concatenate the vertex features with the features from the structures, it follows that c0 x,X1 = c0 x,X2 and c0 y,X1 | y Y = c0 y,X2 | y Y . As we intend to argue that there exists no R such that τ1 = τ2 we intend to argue that c0 a,X1, c0 b,X1, c0 c,X1, c0 d,X1 != c0 a,X2, c0 b,X2, c0 c,X2, c0 d,X2 .

By definition the color of a vertex v from a, b, c, d in graph Xi from X1, X2 where Fgraph,i(v) is the color assigned to v by Rgraph(Xi):

c0 v,Xi = concat(Fgraph,i(v), Fi(v)).

We prove the the two multisets of color are equal by showing c0 a,X1 = c0 a,X2, c0 b,X1 = c0 d,X2, c0 c,X1 = c0 c,X2 and c0 d,X1 = c0 b,X2. This follows from the observation that two vertices x VR(X1) and y RR(X2) are indistinguishable by Rgraph if they are in the same position of a tuple in N2. For example, both b and d are in position 2 of their respective tuple. This means that Rgraph has to assign the same features to both b and d implying Fgraph,1(b) = Fgraph,2(d). As F1(b) = F2(d) it follows that c0 b,X1 = c0 d,X2. An analogous argument allows us to show the rest of the equalities which implies τ1 = τ2. This contradicts the initial assumption and shows that no structure-to-graph encoding R exists that allows for strong simulation of non-pairwise message passing on X1 and X2. This proves the theorem.

E Proofs for Weak Simulation (Section 5)

E.1 Weak simulation of nested aggregations and non-pairwise message passing (Theorem 5.2)

We prove the correctness of Theorem 5.2.

Theorem 5.2. Nested aggregations and non-pairwise message passing can be weakly simulated with simulation factor 2.

For this we define the class of generalized augmented message passing (g AMP) color update functions. This class contains augmented message passing (AMP) but also non-pairwise message passing and nested aggregations. Definition E.1. (Generalized Augmented Message Passing. Let X = (U, N1, . . . , Nk, F) be a relational structure. We call a color update function ϕ generalized augmented message passing (g AMP) if for all u U and coloring c of X, it can recursively be defined as exactly one of the following five forms:

(W1) ϕ(u, X, c) = ψ(u, (U, M1, . . . , Mr, F), c) where ψ is AMP and {M1, . . . , Mr} is a subset of {N1, . . . , Nk} only containing neighborhood functions N where ℓ(N) = 1.

(W2) ϕ(u, X, c) = {{(cu1, . . . , cum) | (u1, . . . , um) Ni(u)}} where i [[k]]. This is called non-pairwise message passing.

(W3) ϕ(u, X, c) = {{ψ(x, (U, M1, . . . Mr, F), c) | x Ni(u)}} where ψ is AMP that contains a neighborhood aggregation (S2), i [[k]] and {M1, . . . , Mr} is a subset of {N1, . . . , Nk} only containing neighborhood functions N where ℓ(N) = 1. This construction is called a nested aggregation.

(W4) ϕ(u, X, c) = (ϕ1(u, X, c), . . . , ϕn(u, X, c) where ϕ1, . . . , ϕn are g AMP.

(W5) ϕ(u, X, c) = f(ψ(u, X, c)) where f maps colors to colors and ψ is g AMP.

We want to prove that we can weakly simulate all coloring algorithms in this class. For this we define a structure-to-graph encoding in Algorithm 6 for every color update function in this class. We use the same encoding function ENC as defined in Appendix C. We assume that if a feature function is called with an vertex, object or edge that is assigned no features, it returns a fixed default color. We prove that combining Algorithm 6 with WL is at least as expressive as the corresponding non-standard message passing algorithm.

Proof. (Theorem 5.2) This proof works similarly to the proof of Theorem 3.5. We prove weak simulatability for ζ = 2. Let ϕ be an arbitrary generalized augmented message passing algorithm. We consider two arbitrary relational structures from the domain of ϕ, say X = (U, N1, . . . , Nk, F) and X = (U , N 1, . . . , N k, F ). Let (V1, E1, F1) = R(X) and (V2, E2, F2) = R(X ). We prove (1) that g AMEϕ is a structure-to-graph encoding that runs in O(τϕ(|U|)); (2) for every t 0 the coloring WL2t(R(X), F1|V1) WL2t(R(X ), F2|V2) on g AMEϕ(X) refines the coloring ϕt(X, F|U) ϕt(X , F |U ).

We begin by proving (1). Observe that Algorithm 6 can be split into Rgraph and Rfeat as required by the definition of structure-to-graph encoding. Furthermore, Algorithm 6 first iterates through all g AMPs in ϕ and all possible objects that are relevant to these g AMPS. Hence, Algorithm 6 runs in the same runtime as one iteration of g AMP which is in O(τϕ(|U|)).

Next, we prove (2). We use πt u, πt r and ct u, ct r to denote the colors assigned to vertex / object u, r (U U ) in iteration t by WL on R(X), R(X ) or ϕ on X, X , respectively. Let u, r (U U ) be arbitrary objects / vertices. We show by induction that if WL cannot distinguish u and r in iteration 2t then ϕ cannot distinguish them in iteration t. Formally, we show that π2t u = π2t r ct u = ct r which implies that WL2t(R(X), F1|V1) WL2t(R(X ), F2|V2) refines ϕt(X, F|U) ϕt(X , F |U ) as (U U ) (V1 V2) (by Algorithm 5).

Algorithm 6 starts by replacing all function applications (S4), (W5) by the identity function and flattens all tuples (A4), (S3), (W4). Recall that this transformation is either injective or bijective. By applying Lemma C.1 they do not influence the strong simulation property and thus also do not influence the weak simulation property. From now on we assume that ϕ = (ϕ1, . . . , ϕn) such that all ϕj with j [[n]] are either (W1) (which means they are one of (A1), (A2) (A3), (S2)), (W2), or (W3) and if they are an atom, then the first and third function argument are identical. We use w to denote the object the g AMP is defined to update. Algorithm 3 iterates through all ϕ1, . . . , ϕn and encodes them in the graph.

Base case: We show that π0 u = π0 r c0 u = c0 r. For this, we assume that π0 u = π0 r. As the initial colors correspond to vertex or object features and Algorithm 6 encodes the object features in the vertex features it follows that c0 u = c0 r.

Algorithm 5 Rgraph for generalized augmented message encoding

Input: relational structure X = (U, N1, . . . , Nk) without features, g AMP color update function ϕ that has the shape (ϕ1, . . . , ϕn) where each ϕi with i [[n]] is one of (A1), (A2), (A3), (S2), (W2) or (W3). Output: graph G = (V, E, Fgraph)

V := W E := Fgraph := empty features for i [[n]] do

if ϕi is (A1), (A2) or (A3) then

(V, E, Fgraph) := Algorithm 1(ϕi, (V, E, Fgraph), u, u, i, 0, 0) end for else if ϕi is (S2) which aggregates on neighborhood N then

(V, E, Fgraph) := Algorithm 2(ϕi, (V, E, Fgraph, (U, N), i, 0) else if ϕi is (W2) then

/ ϕi has the shape (ct u1, . . . , ct um) | (u1, . . . , um) Nj(u) where j [[k]] / for u U do

for (u1, . . . , um) Nj(u) do

V := V u(u1,...,um)

E := E {(uu1,...,um, u)} F(u(u1,...,um)) := ENC(empty features, i, 0, 0, 1) F(uu1,...,um, u) := ENC(empty features, i, 0, 0, 1) for r [[m]] do

E := E {(ur, uu1,...,um)} F(ur, uu1,...,um) := ENC(empty features, i, r, 0, 1) end for end for end for else if ϕi is (W3) then

/ ϕi has shape {{ψ(x, (U, M1, . . . Mr, F), c) | x Nj(u)}} where ψ is AMP / Replace function applications (S4) in ψ by the identity function, i.e., f(φ) becomes φ. Flatten tuples (A4), (S3) in ψ except outermost, i.e., (. . . , (φ1, . . . , φr), . . .) becomes (. . . , φ1, . . . , φr, . . .).

/ ψ has shape (ψ1, . . . , ψn) where each ψi with i [[n]] is one of (A1), (A2), (A3) or (S2) / (V, E, Fgraph) := Algorithm 3(ψ, (V, E, Fgraph), (U, M1, . . . Mr), i) for u U do

for x Nj(u) do

E := E {(x, u)} F(x, u) := ENC(empty features, i, 0, 0, 1) end for end for end if end for return (V, E, Fgraph)

Induction hypothesis: We assume that π2t u = π2t r ct u = ct r holds for all t T.

Induction step: We show that π2(T +1) u = π2(T +1) r c T +1 u = c T +1 r . We assume π2(T +1) u = π2(T +1) r which is equivalent to π2T +2 u = π2T +2 r . We prove for an arbitrary g AMP from ϕ1, . . . , ϕn that it returns the same result for u and r which implies c T +1 u = c T +1 r . We do a case distinction on the type of this g AMP. Note that in g AMP and AMP the object / edge features are referred to us F( ) whereas in the relational structures X and X they are denoted as F( ) and F ( ). Below, when we argue that two AMP constructions are equal and use the F( ) notation and neglect F ( ) for the sake of simplicity.

Algorithm 6 Generalized augmented message encoding (g AME)

Input: relational structure X = (U, N1, . . . , Nk, F), g AMP color update function ϕ. Output: graph G = (V, E, F )

Replace function applications (W5), (S4) in ϕ by the identity function, i.e., f(ϕ) becomes ϕ. Flatten all (W4), (A4), (S3) constructions in ϕ except the outermost, i.e., (. . . , (ϕ1, . . . , ϕn), . . .) becomes (. . . , ϕ1, . . . , ϕn, . . .). / ϕ has the shape (ϕ1, . . . , ϕn) where each ϕi with i [[n]] is one of (A1), (A2), (A3), (S2), (W2) or (W3) /

/ Rgraph / (V, E, Fgraph) := Algorithm 5((U, N1, . . . , Nk), ϕ)

/ Rfeat / F := empty features for v V do

F (v) := concat (F(v), Fgraph(v)) end for for e E do

F (e) := concat (F(e), Fgraph(e)) end for return (V, E, F )

ϕs(u, X, c) != ϕs(r, X, c) . Then, ϕs is one of (A1), (A2), (A3), (S2). Observe that, then Algorithm 6 performs the same operations as Algorithm 4: for atoms (A1), (A2), (A3) it instantiates Algorithm 1 and for (S2) it instantiates Algorithm 2. We apply the induction step of the proof of Theorem 3.5 in a slightly altered form: instead of proving πT +1 u = πT +1 r c T +1 u = c T +1 r we instead prove π2T +1 u = π2T +1 r c T u = c T r . Note, that by the induction hypothesis we already know the stronger fact π2T +2 u = π2T +2 r . However, we only use the fact π2T +1 u = π2T +1 r so that we can later apply this argument in a context where only the weaker π2T +1 u = π2T +1 r holds. We obtain a proof for π2T +1 u = π2T +1 r c T u = c T r from the induction step in the proof of Theorem 3.5 by replacing πT and πT +1 by π2T and π2T +1 , respectively. Thus, we know that for AMP ϕs it holds that ϕs(u, X, c) = ϕs(r, X, c).

(W2) ( (c T u1, . . . , c T um) | (u1, . . . , um) Nj(u) != (c T r1, . . . , c T rm) | (r1, . . . , rm) Nj(r) ). From π2T +2 u = π2T +2 r it follows that there exists a bijective function α : NG(u) NG(r) that maps vertices assigned the same color by π2T +1 to another. Note, that NG(u) is the set of all neighboring vertices of vertex u in the graph generated from X or X . By the construction of the edge features, it follows that α maps vertices from {uu1,...,um | (u1, . . . , um) Nj(u)} to {rr1,...,rm | (r1, . . . , rm) Nj(r)}. Let β be the restriction of the function α to these two sets as its domain and image, respectively. This implies that nn π2T +1 uu1,...,um | (u1, . . . , um) Nj(u) oo = nn π2T +1 rr1,...,rm | (r1, . . . , rm) Nj(r) oo .

Let x {uu1,...,um | (u1, . . . , um) Nj(u)} be arbitrary with x = uu1,...,um. Then, we call β(x) = rr1,...,rm. Note that all neighbors that send messages to x are u1, . . . , um. Similarly, all neighbors of β(x) that send messages to β(x) are r1, . . . , rm. Since π2T +1 x = π2T +1 β(x) it holds that there exists a bijective function γ(x) : {u1, . . . , um} {r1, . . . , rm} mapping vertices assigned the same color by π2T to each other. Note for every i [[m]] the edge connecting ui to x is labeled with i and the same holds for ri and β(x). Thus, γ(ui) = ri which implies u2T 1 , . . . , u2T m = r2T 1 , . . . , r2T m . From the induction hypothesis it holds that c T u1, . . . c T um = c T r1, . . . c T rm . From the fact that β is a bijection and by iterating over all possible vertices for x be we obtain (c T u1, . . . , c T um) | (u1, . . . , um) Nj(u) = (c T r1, . . . , c T rm) | (r1, . . . , rm) Nj(r) .

(W3) ( θ(x, X , c T ) | x Nj(u) != θ(x, X c T ) | x Nj(r) ) where θ is AMP, X = (U, M1, . . . , Mr, F) and all neighborhoods of X have ℓ( ) = 1. We use the function α defined above. By definition all vertices from Nj(u) and Nj(r) are connected to u and r by an edge with a special edge label encoding j. Thus, there exists a restriction from a bijective α to a bijective β : Nj(u) Nj(r) such that for every x Nj(u) it holds that π2T +1 x = π2T +1 β(x) . Next, we can apply the above induction step proof for a g AMP that correspond to AMP (W1). From this proof it follows that for every x Nj(u) it holds that θ(x, X , c T ) = θ(β(x), X , c T ). Together with the fact that β is bijective we obtain that

( θ(x, X , c T ) | x Nj(u) = θ(x, X , c T ) | x Nj(r) )

which contradicts the initial assumption.

This concludes the proof.

E.2 List of Weakly Simulatable Algorithms (Corollary 5.3)

Next, we leverage Theorem 5.2 to prove Corollary 5.3.

Corollary E.2. The following GNNs and variants of WL can be weakly simulated:

1. Message Passing Simplicial Networks [Bodnar et al., 2021b] 2. CW Networks [Bodnar et al., 2021a] 3. DSS [Bevilacqua et al., 2021] 4. K-hop message passing and KP-GNNs [Feng et al., 2022]

Proof. Note that this has already been proven for Message Passing Simplicial Networks and CW Networks in Theorem F.1. We start by proving the corollary for DSS [Bevilacqua et al., 2021] by showing that its color update rule performs g AMP. We reuse the notation from the proof of Corollary 3.6. The update rule for DSS-WL can be written as

ct+1 v,S = HASH ct v,S, ct x,S | x NS(v) , Ct v, Ct x | x NG(v)

where NS(v) is the set of all neighbors of vertex v in subgraph S, NG(v) is the set of all neighbors of v in the original graph, and Ct v contains the colors of v across different subgraphs

Ct v = ct v,S | S π(G) and v V (S ) .

We define the set of objects W as the disjoint union of all vertices in subgraphs W = S S π(G)V (S) and identify each object from W by its original vertex and the subgraph it created. The updaterule applies a HASH function (W5) to a tuple (W4) of other g AMPs. The first element of the tuple is ct v,S which is a color atom (A3). The second element is an aggregation over neighbors (S2).5 The third element can also be thought as an aggregation over neighbors (S2) Ct v = ct v,S | S π(G) and v V (S ) = ct v,S | S N S(v, S) where N S(v, S) = {S | S π(G) and v V (S )}. Finally, the fourth element of the tuple is a nested aggregation (W3) that aggregates over the AMP defined in the previous sentence (with x instead of v) for all x NG(v). Thus, the color update function performs g AMP and DSS can be weakly simulated.

5There is some more subtlety to this. The notation NS implies a separate neighborhood function NS for each subgraph S which would give an unbounded number of different neighborhood functions which does not directly work with our definitions. Instead, we need to create a new neighborhood function N1((v, s)) = {x | x NS(v)}. The same holds for the following definition of N S.

Next, we prove the corollary for KP-GNNs [Feng et al., 2022], K-hop message passing [Feng et al., 2022] will follow as a special case of this. The color update function of KP-GNNs is defined as

ml+1,k v = MESl,normal k

nn (hl u, euv) | u Qk,t v,G oo + f(Gk,t v,G),

f(Gk,t v,G) = EMB((E(Qk,t v,G), Ck k )),

hl+1,k v = UPDl k

ml+1,k v , hl v

hl+1 v = COMBINEl hl+1,k v | k = 1, 2, . . . , K .

Here, t {gd, spd} with gd = graph diffusion and spd = shortest path distance. The set Qk,gd v,G contains all nodes from V (G) that can be reached within k-steps of random walk diffusion steps from v. The set Qk,spd v,G contains all nodes from V (G) whose shortest-path to node v is exactly k.

Let W = V (G) {vk | v V (G), k [[K]]} be a set of objects with associated features F. We define three neighborhood functions N1, N2, N3:

1. N1(v) = {vk | k [[K]]},

2. N2(vk) = {v},

3. N3(vk) = Qk,t v,G.

For all other vertices the neighborhood functions return the empty set. We prove that hl+1 v performs g AMP it inside out, starting with ml+1,k v . We represent ml+1,k v as AMP ϕ(vk, (W, N1, N2, N3, F), hl).6 Let ρ(vk, hl vk, u, hl u) = (hl u, euvk) where euvk is an edge constant depending on u, vk. By definition ρ is an atom that is constructed by combining a color atom

(A3) with a constant atom (A1) into a tuple atom (A4). Note that MESl,normal k

{{(hl u, euv) |

u Qk,t v,G}} is AMP that applies the function MESl,normal k (S4) to an AMP that aggregates

(S2) the atom ρ(v, hl v, u, hl u) over all u N3(vk). Next, f(Gk,t v,G) can be seen as a constant atom (A1) k(vk, vk) = f(Gk,t v,G). Let φ, ϕ be two AMPs, then we define the addition AMP as f+((φ, ϕ)) = φ + ϕ. Hence, we can represent ml+1,k v as an AMP

ϕ(vk, (W, N1, N2, N3, F), hl) = f+

MESl,normal k

nn (hl u, euv) | u Qk,t v,G oo , f(Gk,t v,G) .

Next, we argue that hl+1,k v is AMP represented as φ(vk, (W, N1, N2, N3, F), hl). Observe, that we can write φ as a function application of UPDl k to a tuple ϕ(vk, (W, N1, N2, N3, F), hl x | x N2(vk)

where hl x | x N2(vk) is AMP (S2) that aggregates over the only N2 neighbor. As N2(vk) = {v} it holds that φ(vk, (W, N1, N2, N3, F) corresponds to hl+1,k v . Finally, we argue that hl+1 v is g AMP ψ(v, (W, N1, N2, N3, F). We write ψ as a function application (W5) of COMBINEl to the nested aggregation (W3) φ(vk, hl, W, N1, N2, N3) | vk N1(v) .

Thus, ψ(v, (W, N1, N2, N3, F) corresponds to hl+1 v . As hl+1 v is a g AMP it follows by Theorem 5.2 that KP-GNN can be weakly simulated.

We obtain a proof that K-hop message passing [Feng et al., 2022] can be weakly simulated as special case of the above proof by removing the term f(Gk,t v,G) from the definition of ml+1,k v .

6Note that atoms, AMP and g AMP are generally defined over all objects. However, by construction this AMP will only be used to aggregate colors of objects vk with v V (G), k [[K]]. Thus, we denote it only for these objects despite it being defined over all objects. We use the similar simplifications throughout this proof.

F Handcrafted Efficient Graph Transformations

F.1 CWN: Efficient Weak Simulation of CW Networks

We define a transformation CWN that allows for more efficient weak simulation of CW Networks. We prove the following theorem: Theorem F.1. Message Passing Simplicial Networks [Bodnar et al., 2021b] and CW Networks [Bodnar et al., 2021a] can be weakly simulated by CWN with a simulation factor of 2 such that the number of vertices is equal to the number of cells.

Proof. We prove this theorem for CW Networks as they a generalization of Message Passing Simplicial Networks. We begin with definitions required for the proof. A regular cell complexes consists of different cells each with a fixed dimension. We have a boundary relation given on the cells. For a cell σ we call B(σ) = {τ | τ σ} its boundary adjacent cells and C(σ) = {τ | σ τ} its co-boundary adjacent cells. We define C(σ, µ) = Cσ Cµ. For a cell σ we call N (σ) = {τ | δ such that σ δ and τ δ} its upper adjacent cells. For a color ct at iteration t we use the notation ct F(σ) to denote the color of all objects in F(σ) at iteration t meaning ct F(σ) = {{ct x | x F(σ}}. For a given cell σ in a regular cell complex X a CW Network computes its color in iteration t + 1 with the color update function

ct+1 σ = UPD1 σ, X, ct = HASH ct σ, ct B(σ), ct (σ) .

where ct (σ) performs the non pairwise message passing

ct (σ) = ct µ, ct δ | µ N (σ) and δ C(σ, µ) .

It is straightforward to show that the color update function UPD1 performs generalized augmented message passing and that CWN can thus be weakly simulated with a simulation factor of 2. However, for this Algorithm 6 would generate more vertices than cells. Instead, we present a hand-crafted graph transformation that requires no additional vertices. We define a color update function that replaces the non pairwise message passing into separate pairwise message aggregations:

ct+1 σ = UPD2 σ, X, ct = HASH πt σ, πt B(σ), πt N (σ), πt N (σ) ,

where N (σ) = {δ | µ N (σ) and δ C(σ, µ)}. We require the dimension of cells to be encoded in the initial coloring. Note that UPD2 performs augmented message passing: πt σ is a color atom (A3) and πt B(σ), πt N (σ), πt N (σ) correspond to a AMPs (S2) that aggregates a color over a neighborhood from B(σ), N (σ), N (σ). Hence, we can strongly simulate UPD2 by Theorem 3.5. Next, we show that for every t 0 the coloring produced in iteration 2t of UPD2 refines the coloring produced by t iterations of UPD1 (CWN). Combining this with the fact that we can strongly simulated UPD2 means that the coloring produced by 2t iterations of strongly simulating UPD2 refines the coloring from t iterations of CWN. This proves that we can weakly simulated CWN with a simulation factor of 2. We use CWN to denote the transformation AMEUPD2.

Let t 0 be an integer. Let X, X be two arbitrary regular cell complexes. Let σ, ν be two arbitrary cells from these cell complexes. We show by induction on t that if UPD2 is unable to distinguish σ from ν in iteration 2t, then UPD1 is unable to distinguish cell σ from cell ν in iteration t. We use ct σ to denote the color assigned to cell σ by the coloring from iteration t of UPD1. We use πt σ to denote the color assigned to vertex σ in from iteration t of UPD2. To prove the theorem, we show that π2t σ = π2t ν ct σ = ct ν. The base case (t = 0) holds trivially, as the features of the cells are encoded in the vertex features. Next, we assume the statement holds for t = T and prove that it holds for t = T + 1. This means we assume π2(T +1) σ = π2(T +1) ν and have to show:

c T +1 σ = c T σ , c T B(σ), c T (σ) != c T ν , c T B(ν), c T (ν) = c T +1 ν .

From π2(T +1) σ = π2(T +1) ν it follows that π2T +2 σ = π2T +2 ν and π2T σ = π2T ν . By the induction hypothesis it then follows that c T σ = c T ν . From π2T +2 σ = π2T +2 ν it follows that πB(σ)2T +1 =

Figure 4: A graph (left) transformed into bags of subgraphs (center) by a policy and the result of applying DSS to the graph and bags of subgraphs (right). Yellow vertices represent the original graph and red vertices the subgraphs.

πB(ν)2T +1. Hence, there exists a bijective function α : B(σ) B(ν) that maps vertices with the same color assigned by π2T +1 to another, meaning for all x B(σ) it holds that π2T +1 x = π2T +1 α(x) and π2T x = π2T α(x). The existence of α and the induction hypothesis imply that c T B(σ) = c T B(ν). Similarly, there exist a function β : N (σ) N (ν) that bijectively map vertices assigned the same color by π2T +1 to another. Thus, for every x N (σ) it holds that π2T +1 x = π2T +1 β(x) and π2T x = π2T β(x). Furthermore, it holds that for every x N (σ) there exists a bijective function γx : B(x) B(α(x)) that maps cells to another that get assigned the same color by π2T . Combining this yields π2T x , π2T y | y B(x), x N (σ) = π2T x , π2T y | y B(x), x N (ν) .

By construction each y has a higher dimension than x. This is encoded in the features, thus it follows that π2T x , π2T y | y B(x), x N (σ) = π2T x , π2T y | y B(x), x N (ν) .

From the induction hypothesis it follows that c T x , c T y | y B(x), x N (σ) = c T x , c T y | y B(x), x N (ν) . (1)

We prove that this is equivalent to showing that c T (σ) = c T (ν). For this, we show that for every cell ρ it holds that

{(µ, δ) | µ B(δ), δ N (ρ)} != {(µ, δ) | µ N (ρ) and δ C(ρ, µ)}.

First, we show that the set on the left is a subset of the set on the right. Suppose an arbitrary element (µ, δ) with µ B(δ) and δ N (ρ) from the left set. Note that µ B(δ) implies µ δ. Furthermore, δ N (ρ) means ρ δ. Thus, δ C(ρ, µ). Finally, this also implies that µ N (ρ). Hence, (µ, δ) is an element of the right set. Next, we show that the set on the right is a subset of the left set. Let (µ, δ) with µ N (ρ) and δ C(ρ, µ) be an arbitrary element from the set on the right. From δ C(ρ, µ) it follows that ρ δ and µ δ. By the definition of B, it follows that ρ B(δ) and µ B(δ). It remains to show that δ N (ρ) meaning we need to show that δ {y | x N (δ) and y C(δ, x)}. By definition we know that µ N (δ) and that δ C(δ, µ). Hence, δ N (ρ). This implies that (µ, δ) is an element of the set on the left, meaning that the set on the right is a subset of the set on the left. Thus, the two sets are equal. We use this to rewrite Equation 1 into c T x , c T y | x N (σ) and y C(σ, x) = c T x , c T y | x N (ν) and y C(ν, x) . (2)

Hence, c T (σ) = c T (ν). This concludes the proof.

F.2 DSS: Weakly Simulating DSS Efficiently

Recall that the color update function of DSS WL is defined as

ct+1 v,S = HASH ct v,S, ct x,S | x NS(v) , Ct v, Ct x | x NG(v)

where NS(v) is the set of all neighbors of vertex v in subgraph S, and Ct v contains the colors of v across different subgraphs

Ct v = ct v,S | S π(G) and v V (S ) .

Let v V (G). According to the graph transformation from Algorithm 6 with simulation factor 2 we would create a separate helper vertex for each vertex from X = {x | x NG(v)} where each x X is connected to all instances of x across the different subgraphs. A straightforward way to do this, is to add an additional copy of the original graph. A drawback of the graph transformation from Algorithm 6 is that because of Ct v = ct v,S | S π(G) and v V (S ) we need to create edges between every copy of v among all subgraphs, that means we create up to |V | |π(G)|2 extra edges. We can remedy this by adding edges between v in the original graph and all instances of v in the other subgraphs. Thus, it suffices to create at most |E(G)| + |V (G)| |π(G)| edges. Similarly, for every v V (G) and all subgraphs S π(G) containing v we need to add edges to the corresponding helper node for all neighbors NG(v). Interestingly, it is not necessary to add these edges and instead it is possible to rely on the edges generated by adding an additional copy of the original graph. However, this means that for a vertex v it takes three steps to obtain the information {{Ct x | x NG(v)}}. Thus, this comes at the cost of increasing the simulation factor to 3. We call the resulting graph transformation DSS. Figure 4 shows an example of DSS.

Definition F.2 (DSS). Given a graph G and policy π we define Gπ = (Vπ, Eπ, Xπ, Yπ) to be the graph obtained by DSS. Where

S π(G) G V (S),

S π(G) G E(S) {{v G, v S} | S π(G) and v V (S)}.

For unlabeled graphs we introduce a vertex feature that encodes whether it was created from a subgraph S π(G) or was created from G. For labeled graphs we extend the features correspondingly. We add edge features that allow us to distinguish whether an edge was created due from E(S) or from {{v G, v S} | S π(G) and v V (S)}.

We show that for arbitrary policies and graphs WL on the transformed graphs is at least as expressive as DSS-WL on the original graphs.

Theorem F.3. DSS weakly simulates DSS-WL with a simulation factor of 3.

Proof. Let G and H be two graphs and π a policy. We use τ t v,S to denote the coloring of vertex v S obtained by iteration t of WL on Gπ or Hπ, and τ t v,G to denote the color in iteration t of WL of vertex v that was created from the original graph G (respectively τ t v,H). We use ct v,S to denote the color of vertex v in subgraph S obtained in the t-th iteration of DSS-WL. We prove that for every t 0 it holds that

v, w V (G), S, T π(G) : τ 3t v,S = τ 3t w,T ct v,S = ct w,T

by induction on the iteration t of DSS-WL.

Base case: All vertices v S V (Gπ), w T V (Hπ) are assigned the same initial color (except the indicator that they were created from a subgraph) as their counterpart from DSS-WL. Thus from τv,S = τw,T it follows that c0 v,S = c0 w,T .

Induction hypothesis: We assume the statement holds for all t < n where n is an arbitrary nonnegative integer.

Induction step: Let v V (G), w V (H), S π(G), T π(H). We assume that τ 3(n+1) v,S =

τ 3(n+1) w,T holds and want to show cn+1 v,S = cn+1 w,T . It is equivalent to the assumption that τ 3n+3 v,S = τ 3n+3 w,T which also implies τ 3n v,S = τ 3n w,T . We need to show that

1. cn v,S = cn w,T ,

cn x,S | x NS(v) = cn y,T | y NT (w) , and

3. Cn v = Cn w, and {{Cn x | x NG(v)}} = Cn y | y NH(w) .

From the induction hypothesis and τ 3n v,S = τ 3n w,T it immediately follows that cn v,S = cn w,T .

We want to show that cn x,S | x NS(v) = cn y,T | y NT (w) . From τ 3n+3 v,S = τ 3n+3 w,T it follows that there exists a bijection β : NGπ(v S) NHπ(w T ) where for every x NGπ(v S) it holds that τ 3n+2 x = τ 3n+2 β(x) which implies τ 3n x = τ 3n β(x). Additionally, all neighbors of v S are from the subgraph S or the original graph G and the vertex features allow us to distinguish between those two types. Thus, β maps vertices from V (S) to vertices from V (T). Note that these are exactly the vertices whose colors are in cn x,S | x NS(v) , cn y,T | y NT (w) . Thus, by combining this with the induction hypothesis we obtain that cn x,S | x NS(v) = cn y,T | y NT (w) .

We want to show that Cn v = Cn w. We use the function β defined in the previous paragraph. Due to the vertex features we know that β must map v G to w H as there exists only one neighbor with the created from the original graph feature among v S and w T s neighbors. Hence, τ 3n+2 v,G = τ 3n+2 w,H which implies τ 3n+1 v,G = τ 3n+1 w,H . This implies that there exists a bijective function γ mapping neighbors of v G to neighbors of w G such that they are assigned the same color by τ 3n. The vertex v G and w G in the copy of H have two types of neighbors: neighbors NG(v), NH(w) from the original graph and copies of v, w over different subgraphs. By the vertex features we can distinguish these two types of neighbors. Thus, γ maps copies of v from the different subgraphs to copies of w over different subgraphs. As these vertices correspond to v, S and w, T for all S π(G), T π(G), we can combine this with the induction hypothesis to obtain that Cn v = cn v,S | S π(G) and v V (S ) = cn w,T | T π(H) and w V (T ) = Cn w.

Finally, we want to show {{Cn x | NG(v)}} = Cn y | y NH(w) . Since the copy of v and w created from the original graph can be uniquely identified by its vertex feature from NGπ(v), NHπ(w) we know that β maps these two vertices together and thus that τ 3n+2 v,G = τ 3n+2 w,H . Hence, there exists a bijective function σ : NG(v) NH(w) such that for every neighbor x NG(v) it holds that τ 3n+1 x,G = τ 3n+1 σ(x),H. We want to show that for every such neighbor x NG(v) it holds that Cn x = Cn σ(x). Let x be one such vertex. As τ 3n+1 x,G = τ 3n+1 σ(x),H we can use the same argument as in the previous paragraph to obtain that Cn x = Cn σ(x). This concludes the proof of the induction hypothesis and shows that the theorem holds.

F.3 Strong Simulation of K-hop GNNs and KP-GNNs

We prove the following Theorem: Theorem F.4. K-hop message passing and KP-GNNs [Feng et al., 2022] can be strongly simulated.

We prove the theorem only for KP-GNNs as K-hop message passing is a special case of KP-GNNs. In the proof we define a new message passing algorithm we call simple KP-GNN (s KP-GNN). We prove that (1) s KP-GNNs are as expressives as KP-GNNs in every iteration; and (2) that s KP-GNN is strongly simulatable. The theorem then follows from the combination of (1) and (2). We begin with the definition of s KP-GNN. Definition F.5. (s KP-GNN) The color update function of s KP-GNNs is defined as

ml+1,k v = MESl,normal k

nn (hl u, euv) | u Qk,t v,G oo + f(Gk,t v,G),

f(Gk,t v,G) = EMB((E(Qk,t v,G), Ck k )),

hl+1,k v = UPDl k

ml+1,k v , hl v

hl+1 v = COMBINEl hl+1,1 v , . . . , hl+1,K v .

Note that s KP-GNN only differs from KP-GNN by the definition of hl+1 v :

s KP-GNN: hl+1 v = COMBINEl hl+1,1 v , . . . , hl+1,K v and

KP-GNN: hl+1 v = COMBINEl hl+1,k v | k = 1, 2, . . . , K .

It is important to note that this transformation from a nested-aggregation to a tuple is only possible because there is a natural order on the multisets aggregated (the hop size k). We prove that s KP-GNN is as expressive as KP-GNN in every iteration.

Lemma F.6. Suppose both s KP-GNN and KP-GNN use an injective COMBINE function. Then, for every graph G = (V, E, F) and integer l 0 it holds that the coloring s KP-GNNl(G, F|V ) refines the coloring KP-GNNl(G, F|V ).

Proof. Let G = (V, E, F) and G = (V , E , F ) be two graphs, v, w (V V ), and l 0, K 1 be integers. We furthermore assume that the COMBINE function is injective. We prove that if

COMBINEl hl+1,1 v , . . . , hl+1,K v = COMBINEl hl+1,1 w , . . . , hl+1,K w (3)

COMBINEl hl+1,k v | k = 1, 2, . . . , K = COMBINEl hl+1,k w | k = 1, 2, . . . , K . (4)

From Equation 3 and the assumption that the COMBINE function is injective it follows that hl+1,1 v , . . . , hl+1,K v = hl+1,1 w , . . . , hl+1,K w .

Thus, for every j [[K]] it holds that hl+1,j v = hl+1,j w . This implies Equation 4.

Lemma F.7. s KP-GNN can be strongly simulated.

Proof. The proof reuses large parts the proof that KP-GNNs can be weakly simulated (Theorem 5.3). As s KP-GNN only differs from KP-GNN in the definition of ht+1 v it follows from the proof that hl+1,k v is AMP for every k [[K]]. Then, ht+1 v is AMP that applies the function COMBINEt (S4) to a tuple (S3) built from the AMPs hl+1,1 v , . . . , hl+1,K v . From Theorem 3.5 it follows that s KP-GNN can be strongly simulated.

Proof. (Theorem F.4) The theorem follows from Lemma F.6 and Lemma F.7.

G Intuition for CWN and CWN

Figure 5: Left: graph. Center: regular cell complex built from the graph through a graph-to-structure encoding [Bodnar et al., 2021a]. Vertices correspond to 0-dimensional cells, edges to 1-dimensional cells ( yellow ) and induced cycles to 2-dimensional cells ( blue ). Right: a graph created by structureto-graph encoding the regular cell complex to a graph. Vertices corresponds to cells as indicated by color.

CW Networks [Bodnar et al., 2021a, CWN] perform message passing on topological structures called regular cell complexes. A regular cell complex is built out of individual cells of some dimension.

Similar to message passing on graphs, CWN computes a representation (or color) for each cell by aggregating from the neighboring cells. To apply CWNs to graphs, a regular cell complex has to be constructed from a graph (see Figure 5), we call this process graph-to-structure encoding. For this vertices are transformed into 0-dimensional cells and edges into 1-dimensional cells. In this process additional structures such as (induced) cycles or cliques can also be lifted to cells to improve the expressiveness and connectivity in the structure. In the experiments we use a lifting map that lifts induced cycles of length at most 6 to cells. Figure 5 shows how a regular cell complex is transformed to a graph and shows how our structure-to-graph encoding CWN transforms the complex back to a graph. Note that in our implementation of CWN we use undirected instead of directed for the sake of simplicity. When transforming a cell to a vertex, we set the vertex features to the most common feature of the vertices making up the cell for each feature. For example, a cell that corresponds to a cycle v, w, p is transformed to a vertex whose features are the most common of the features of v, w, p. Similarly, the features of edges connecting cells are the most common of the edge features between the vertices making up the cells.

H More Details about the Experiments

We provide further details about our experiments. In Section H.1 we elaborate on our setup and in Section H.2 we present more details on the results from the main paper. In Section H.3 we present an additional evaluation of transformation and training speed. The code for our experiments can be found at https://github.com/ocatias/GNN-Simulation.

All models are implemented in Pytorch Geometric [Fey and Lenssen, 2019]. We use Wand B [Biewald, 2020] to track our experiments. Unfortunately, CWN requires an old version of Py Torch [Paszke et al., 2019] meaning that we have to train it on older GPUs. CWN training and all speed evaluations are performed on an an NVIDIA Ge Force GTX 1080 GPU. All other experiments are performed on NVIDIA Ge Force RTX 3080 GPUs.

Type Pooling. In our transformations DSS and CWN vertices get assigned additional features that encode the type of a vertex. In CWN, these features encode whether the cell was built from a vertex, edge or cycle. In DSS, these features encode whether a vertex is part of the original graph or in a subgraph. Inspired by a pooling operation by Bodnar et al. [2021a] we propose type pooling. The idea is that vertices with different types contain different information and have varying importance. Thus, instead of pooling all vertices in a graph, we instead pool all vertices of a type. To get a single output for the entire graph, we then apply a multilayer perceptron with a Re LU activation function to each pooled vector and then sum the results. Let POOL be a pooling operation such as P. Let T be the set of types and let Ht with t T be the multiset of all representations of vertices of type t, and MLPt be a multilayer perceptron with activation function. Then, type pooling computes

Type-Pool = X

t T MLPt (POOL(Ht)) .

We use type pooling for all combinations of an MPNN plus DS, DSS and CWN. Note that for DS all nodes have just one type.

Hyperparameter Tuning. Table 3 contains all hyperparameter grids for the real life tasks. During preliminary experiments we found that it is crucial to tune the pooling operation and number of layers to achieve strong results. For DS(S) we do not tune the pooling function as the code provided by Bevilacqua et al. [2021] does not directly support this. Since CWN is very slow, we use a smaller hyperparameter grid, containing a superset of the hyperparameters used by Bodnar et al. [2021a].

Setup Real-Life Datasets. For all real-life datasets we tune the hyperparameters on the validation set and evaluate the best hyperparameters on the test set 10 times. We report the average and standard deviation of the metric that is typically used on the given dataset. For all OGB datasets, we train with a batch size of 32 for 100 epochs with a fixed learning rate to allow for fast training of many models. For ZINC we train as described by Bevilacqua et al. [2021], Bodnar et al. [2021a] and Dwivedi et al.

Table 3: Hyperparameter grid used for the real life datasets. The learning rate is only tuned on OGB tasks. On ZINC all models used the same initial learning rate that decays over time.

MODEL PARAMETERS GIN, GCN DSS, DS CIN ON OGB CIN ON ZINC

EMBEDDING DIMENSION 64, 256 64, 256 64, 256 64, 256 NUMBER OF LAYERS 1, . . . , 5 1, . . . , 5 1,2,3 3,4,5 LEARNING RATE 10 3, 10 4 10 3, 10 4 10 3, 10 4 10 3

POOLING OPERATION SUM, MEAN MEAN SUM, MEAN SUM, MEAN DROPOUT RATE 0.5 0.5 0.5 0.5

[2023] i.e., for up to 500 epochs with a batch size of 128 with an initial learning rate of 10 3 that reduces by a factor of 0.5 every time the validation result has not improved for 20 epochs. If the learning rate dips below 10 5 the training stops.

Setup CSL. For the synthetic CSL dataset, we only evaluate GIN plus the graph transformations and compare to the results reported by Bevilacqua et al. [2021] and Bodnar et al. [2021a]. We perform 10-fold cross validation on this dataset. For all models with graph transformations we replicate the hyperparameters used by the corresponding non-standard message passing algorithms described in the original papers. For GIN, we tune hyperparameters as described for real-life datasets.

H.2 Results on Real-Life Datasets

Tables 4 and 5 show the prediction quality of our models on all real-life datasets.

Table 4: Bold results are better than the corresponding message passing baseline, red results are graph transformation based methods that are better than the corresponding non-standard message passing variation. means lower is better. means higher is better.

DATA SET MODEL MOLHIV ROC-AUC MOLTOX21 ROC-AUC MOLESOL RSMSE ZINC MAE

GIN 77.7 1.0 76.1 0.5 0.946 0.086 0.306 0.035 GCN 76.6 1.0 74.9 0.6 0.928 0.049 0.456 0.021 MLP 72.2 0.9 71.2 0.7 1.128 0.021 1.44 0.001 DS 75.4 1.4 75.4 0.6 0.986 0.051 0.161 0.008 DS + GIN 77.2 1.2 77.0 0.6 0.924 0.03 0.143 0.008 DS + GCN 75.4 0.8 76.6 0.6 0.872 0.037 0.285 0.044 DS + MLP 69.8 6.1 70.8 0.8 1.292 0.115 1.281 0.002 DSS 76.8 1.3 77.0 1.0 0.864 0.024 0.105 0.005 DSS + GIN 74.4 1.3 78.1 0.7 0.853 0.015 0.149 0.008 DSS + GCN 74.8 2.1 76.3 0.7 0.868 0.022 0.221 0.028 DSS + MLP 71.9 0.8 69.7 0.5 1.272 0.025 1.283 0.006 CWN 78.6 1.5 75.2 0.9 1.117 0.051 0.13 0.003 CWN + GIN 78.4 1.5 74.4 0.7 0.958 0.021 0.137 0.007 CWN + GCN 78.1 1.0 74.2 0.9 0.933 0.036 0.155 0.013 CWN + MLP 71.8 0.8 71.0 0.5 1.217 0.032 1.045 0.006

H.3 Speed Evaluation

Setup. We evaluate the preprocessing and training speed on ZINC for GIN and our graph transformations against DS(S) and CWN. We select the hyperparameters for all GNNs such that they all have roughly 8 105 trainable parameters. To measure the transformation speed, we begin the time measurement after the graphs have been loaded into memory and stop before we store the final objects. For our graph transformations, this measures how long it takes to run the graph transformations. For CWN, this measures how long it takes to create the cell complexes from the graphs. Additionally, we measure how long it takes to train the model for 100 epochs on ZINC evaluating it after every epoch. We average all time measurements over 10 trials.

Table 5: Bold results are better than the corresponding message passing baseline, red results are graph transformation based methods that are better than the corresponding non-standard message passing variation. means lower is better. means higher is better.

DATA SET MODEL

MOLBACE ROC-AUC

MOLCLINTOX ROC-AUC

MOLBBBP ROC-AUC

MOLSIDER ROC-AUC

MOLTOXCAST ROC-AUC

MOLLIPO RMSE

GIN 75.3 4.9 87.1 1.5 66.8 2.3 56.8 1.0 64.2 0.7 0.787 0.026 GCN 76.9 3.8 88.2 1.2 66.6 2.1 57.5 0.9 63.2 0.5 0.793 0.028 MLP 41.6 11.2 57.1 3.9 63.7 1.8 58.9 1.4 62.8 0.3 1.048 0.009 DS 79.2 1.9 85.7 2.9 67.4 1.3 60.6 1.1 66.3 0.5 0.736 0.022 DS + GIN 75.3 3.5 81.4 4.2 66.2 1.8 59.3 0.9 66.7 0.5 0.716 0.014 DS + GCN 77.7 2.8 83.8 2.4 66.0 1.1 61.0 0.9 66.1 0.4 0.754 0.018 DS + MLP 71.8 5.6 55.4 5.0 63.6 2.1 52.9 1.6 58.5 5.5 1.048 0.011 DSS 74.5 7.4 84.6 2.6 67.5 1.5 57.5 1.5 66.4 0.8 0.652 0.009 DSS + GIN 77.2 3.0 86.4 1.2 64.3 2.5 59.0 1.1 66.8 0.4 0.76 0.018 DSS + GCN 74.8 3.9 87.3 1.6 67.3 2.0 62.0 1.0 66.1 0.6 0.767 0.027 DSS + MLP 72.8 4.3 51.9 7.7 63.5 2.0 51.6 2.0 62.0 0.8 1.054 0.01 CWN 75.0 2.7 84.7 2.4 69.5 1.6 57.6 1.6 64.7 0.9 0.755 0.018 CWN + GIN 75.2 3.3 85.4 2.4 68.1 2.4 59.1 1.0 64.5 1.1 0.749 0.013 CWN + GCN 72.2 3.8 88.0 1.6 61.5 3.0 62.4 1.1 64.4 0.6 0.787 0.016 CWN + MLP 38.1 0.6 55.7 4.8 62.7 1.6 59.9 1.1 63.1 0.5 1.055 0.01

Table 6: Speed comparison of different GNNs. TRANSFORM is the time to transform the graphs to the required structure to perform message passing over. TRAIN is the time to train a model to train for 100 epochs. COMBINED is the combined time of training and transformation. Bold indicates the best result for a given non-standard message passing algorithm and corresponding graph transformation.

MODEL TRANSFORM SEC. TRAIN. SEC. COMBINED SEC.

GIN NA 170 1 170 1 CWN 37 1 3259 16 3296 17 CWN + GIN 67 1 286 6 353 7 DSS 95 4 955 5 1050 9 DSS + GIN 183 8 1466 7 1649 15 DS 95 4 1235 5 1330 9 DS + GIN 171 8 1060 5 1231 13

Results. Table 6 shows the results of the speed evaluation. We can see that the graph transformation are roughly twice as slow as the alternative transformation for the non-standard message passing. However, this runtime is still significantly smaller than the time it takes to train the model a single time. Thus, the runtime of graph transformations is insignificant as the transformation only needs to be performed a single time whereas the model might need to be trained dozens of times to find good hyperparameters. With respect to the training time, DSS+GIN is roughly 54% slower than DSS and DS+GIN is roughly 17% faster than DS. Interestingly, CWN+GIN is 11 times faster than CWN. While it might seem surprising that our graph transformation based variant of CWN is 11x faster than the original implementation, the reasons for that might be: (1) a bug, for example in how the original code of CWN interacts with our environment; (2) CWN requires a large amount of code built on top of Py Torch Geometric, whereas CWN+GIN just performs standard message passing; (3) the implementation of CWN by Bodnar et al. [2021a] is not compatible with newer variants of Py Torch (Geometric) meaning that CWN+GIN can make use of speed increases due to improvements in Py Toch (Geometric). Reasons (2) and (3) highlight the strength of graph transformations in that they are mostly platform and implementation independent as the transformed graphs can be stored in file formats that are independent from the used deep learning framework.