# finegrained_expressivity_of_graph_neural_networks__126fa6cd.pdf Fine-grained Expressivity of Graph Neural Networks Jan Böker RWTH Aachen University Ron Levie Technion - Israel Institute of Technology Ningyuan Huang Johns Hopkins University Soledad Villar Johns Hopkins University Christopher Morris RWTH Aachen University Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the 1-dimensional Weisfeiler Leman test (1-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resolves this issue by considering continuous extensions of both 1-WL and MPNNs to graphons. Concretely, we show that the continuous variant of 1-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the level of difficulty in separating them. We identify the finest topology where MPNNs separate points and prove a universal approximation theorem. Consequently, we provide a theoretical framework for graph and graphon similarity combining various topological variants of classical characterizations of the 1-WL. In particular, we characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the 1-WL and MPNNs on graphons. Empirically, we validate our theoretical findings by showing that randomly initialized MPNNs, without training, exhibit competitive performance compared to their trained counterparts. Moreover, we evaluate different MPNN architectures based on their ability to preserve graph distances, highlighting the significance of our continuous 1-WL test in understanding MPNNs expressivity. 1 Introduction Graph-structured data is widespread across several application domains, including chemoand bioinformatics [11, 101], image analysis [107], and social-network analysis [38], explaining the recent growth in developing and analyzing machine learning methods tailored to graphs. In recent years, message-passing graph neural networks (MPNNs) [25, 50, 92] emerged as the dominant paradigm, and alongside the growing prominence of MPNNs, numerous works [89, 115] analyzed MPNNs expressivity. The analysis, typically based on combinatorial techniques such as the 1-dimensional Weisfeiler Leman test (1-WL) for the graph isomorphism problem [113, 114], provides explanations of MPNNs limitations (see [92] for a thorough survey). However, since the graph isomorphism problem only concerns whether the graphs are exactly the same, it only gives insights into MPNNs ability to distinguish graphs. Hence, such approaches cannot quantify the graphs degree of similarity. Nonetheless, understanding the similarities induced by MPNNs is crucial for precisely quantifying their generalization abilities [82, 94], stability [44], or robustness properties [60]. Present work. To address these shortcomings, we show how to integrate MPNNs into the theory of iterated degree measures first developed by Grebík and Rocha [52], which generalizes the 1-WL and its characterizations to graphons [78]. Integrating MPNNs into this theory allows us to identify the finest 37th Conference on Neural Information Processing Systems (Neur IPS 2023). topology in which MPNNs separate points, allowing us to prove a universal approximation theorem for graphons. Inspired by the Weisfeiler Leman distance [26], we show that metrics on measures also integrate beautifully into the theory of iterated degree measures. Concretely, we define metrics δP via the Prokhorov metric [99] and δW via an unbalanced Wasserstein metric [97] that metrize the compact topology of iterated degree measures. By leveraging this theory, we show that two graphons are close in these metrics if, and only if, the output of all possible MPNNs, up to a specific Lipschitz constant and number of layers, is close as well. This refines the result of Chen et al. [26], which shows only one direction of this equivalence, i.e., graphs similar in their metric produce similar MPNN outputs. We focus on graphons without node feature information to focus purely on MPNNs ability to distinguish their structure. Our main result offers a topological generalization of classical characterizations of the 1-WL, showing that the above metrics represent the optimal approach for defining a metric variant of the 1-WL. Informally, the main result states the equivalence of our metrics δP and δW, the tree distance δT of Böker [18], the Euclidean distance of MPNNs output, and tree homomorphism densities. These metrics arise from the topological variants of the 1-WL test, fractional isomorphisms, MPNNs, and tree homomorphism counts. Theorem 1 (informal). The following are equivalent for all graphons U and W: 1. U and W are close in δP (or alternatively δW). 2. U and W are close in δT . 3. MPNN outputs on U and W are close for all MPNNs with Lipschitz constant C and L layers. 4. Homomorphism densities in U and W are close for all trees up to order k. Up to now, except for the connection between the tree distance and tree homomorphism densities by Böker [18], these equivalences were only known to hold on a discrete level where graphs are either exactly isomorphic or not. The closeness statements in the above theorem are epsilon-delta statements, i.e., for every ε > 0, there is a δ > 0 such that, if graphons are δ-close in one distance measure, they are ε-close in the other distance measures, where the constants are independent of the actual graphons. In particular, for graphs, these constants are independent of their number of vertices. Theorem 1 is formally stated and proved in Appendix C.4. A key point in the proof is to consider compact operators (graphons) as limits of graphs. Empirically, we verify our findings by demonstrating that untrained MPNNs yield competitive predictive performance on established graph-level prediction tasks. Further, we evaluate the usefulness of our derived metrics for studying different MPNN architectures. Our theoretical and empirical results also provide an efficient lower bound of the graph distances in Böker [18], Chen et al. [26] by using the Euclidean distance of MPNN outputs. In summary, we quantify which distance MPNNs induce, leading to a more fine-grained understanding of their expressivity and separation capabilities. Our results provide a deeper understanding of MPNNs capacity to capture graph structure, precisely determining when they can and when they cannot assign similar and dissimilar vectorial representations to graphs. Our work establishes the first rigorous connection between the similarity of graphs and their learned vectorial presentations, paving the way for a more detailed understanding of MPNNs expressivity and their connection to graph structure. 1.1 Related work and motivation In the following, we discuss relevant related work and provide additional background and motivation. MPNNs. Following Gilmer et al. [50], Scarselli et al. [105], MPNNs learn a vectorial representation, i.e., a d-dimensional real-valued vector, representing each vertex in a graph by iteratively aggregating information from neighboring vertices. Subsequently, MPNNs compute a single vectorial representation of a given graph by aggregating these vectorial vertex representations. Notable instances of this architecture include, e.g., Duvenaud et al. [36], Hamilton et al. [61], and Velickovic et al. [111], which can be subsumed under the message-passing framework introduced in Gilmer et al. [50]. In parallel, approaches based on spectral information were introduced in, e.g., Bruna et al. [23], Defferrard et al. [33], Gama et al. [43], Kipf and Welling [73], Levie et al. [75], and Monti et al. [87] all of which descend from early work in Baskin et al. [14], Goller and Küchler [51], Kireev [74], Merkwirth and Lengauer [84], Micheli [85], Micheli and Sestito [86], Scarselli et al. [105], and Sperduti and Starita [108]. Expressivity and limitations of MPNNs. The expressivity of an MPNN is the architecture s ability to express or approximate different functions over a domain, e.g., graphs. High expressivity means the neural network can represent many functions over this domain. In the literature, the expressivity of MPNNs is modeled mathematically based on two main approaches, algorithmic alignment with graph 1/4 G i G,1 G i G,2 a/b Graph-level read-out 1/4 3/4 1/4 G νG 1/5 2/5 1/5 H νH D1( , ) = d P,1( , ) 1/4 2/5 D2( , ) = d P,2( , ) δP,2 = ( , ) (A) (B) (C) (D) (E) Figure 1: Illustration of the procedure to compute the distance δP between graphs G and H. Columns A and C show the colors obtained by 1-WL iterations on graphs G and H, respectively. Columns B and D show the iterated degree measures (IDMs) i G,h and i H,h for iterations h = 1, 2 (see Eq. (1)), and the output distributions of iterated degree measures (DIDMs) νG and νH (see Eq. (2)). Column E depicts the recursive construction to compute the distance δP between the IDMs from columns B and D (outlined in Section 3, detailed in Appendix C.2.2). isomorphism test [88] and universal approximation theorems [7, 48]. Works following the first approach study if an MPNN, by choosing appropriate weights, can distinguish the same pairs of non-isomorphic graphs as the 1-WL or its more powerful generalization the k-WL. Here, an MPNN distinguishes two non-isomorphic graphs if it can compute different vectorial representations for the two graphs. Specifically, Morris et al. [89] and Xu et al. [115] showed that the 1-WL limits the expressive power of any possible MPNN architecture in distinguishing non-isomorphic graphs. In turn, these results have been generalized to the k-WL, see, e.g., Azizian and Lelarge [7], Geerts [47], Maron et al. [81], Morris et al. [89, 91, 93]. Works following the second approach study, which functions over the domain of graphs, can be approximated arbitrarily close by an MPNN [7, 28, 48, 79]; see also next paragraph. Further, see Appendix B for an extended discussion of related works about MPNNs expressivity. Limitations of current universal approximation theorems for MPNNs. Universal approximation theorems assume that the domain of the network is a compact metric space and show that a neural network can approximate any continuous function over this space. Current approaches studying MPNNs universal approximation capabilities employ the (graph) edit distance to define the metric on the space of graphs, e.g., see [7]. However, the edit distance is not a natural notion of similarity for practical machine learning on graphs. That is, any two graphs on a different number of vertices are far apart, not fully reflecting the similarity of real-world graphs. More generally, the same holds for any pair of non-isomorphic graphs. Hence, any rewiring of a graph leads to a far-away graph. However, we would like to interpret the rewiring of a small number of edges in a large graph as a small perturbation close to the original graph. Additionally, since the edit metric is not compact, the Stone Weierstrass theorem, the primary tool in universal approximation analysis, cannot be applied directly to the whole space of graphs. For example, [26] uses other non-compact metrics to circumvent this, artificially choosing a compact subset of graphs from the whole space by uniformly limiting the size of the graphs and the edge weights. Alternatively, Chen et al. [28] resorted to the graph signal viewpoint, allowing real-valued edge features, and showed that the algorithmic alignment of GNNs with graph isomorphism algorithms can be utilized to prove universal approximation theorems for MPNNs. In contrast, here we suggest using graph similarity measures from graphon analysis, for which simple graphs of arbitrarily different sizes can be close to each other and by which the space of all graphs is dense in the compact metric space of graphons, allowing us to use the Stone Weierstrass theorem directly; see also Appendix B.2 for an extended discussion on graph metrics beyond the edit distance. Graphon theory. The book of Lovász [78] provides a thorough treatment of graphons, which emerged as limit objects for sequences of graphs in the theory of dense graph limits developed by Borgs et al. [20, 21], Lovász and Szegedy [77]. These limit objects allow the completion of the space of all graphs to a compact metric space. When endowed with the cut distance, first introduced by Frieze and Kannan [41], the space of graphons is compact [78, Theorem 9.23]). We can interpret graphons in two ways. First, as a weighted graph on the continuous vertex set [0, 1]. Secondly, we can think of every point from [0, 1] as an infinite set of vertices and two points in [0, 1] as being connected by a random bipartite graph with edge density given by the graphon. This second point of view naturally leads to using graphons as generative models of graphs and the theory of graph limits. Our work follows the first point of view, and we emphasize that we do not use graphons as a generative model. This also means that we do not need large graphs for the asymptotics to work since every graph and, in particular, every small graph is also a graphon. Grebík and Rocha [52] generalized the 1-WL test and various of its characterizations to graphons, while Böker [19] did this for the k-WL test. Graphon theory in graph machine learning. Keriven et al. [64], Maskey et al. [83], Ruiz et al. [103] use graphons to analyze graph signal processing and MPNNs. These papers assume a single fixed graphon generating the data, i.e., any graph from the dataset is randomly sampled from this graphon, and showed that spectral MPNNs on graphs converge to spectral MPNNs on graphons as the size of the sampled graphs increases. Maskey et al. [82] developed a generalization analysis of MPNNs, assuming a pre-defined finite set of graphons. Further, Keriven et al. [65] compared the expressivity of two types of spectral MPNNs on spaces of graphons, assuming graphons are Lipschitz continuous kernels. To that, the metric on the graphon space is taken as the L distance between graphons as functions. However, the paper does not directly characterize the separation power of the studied classes of MPNNs, and it requires the choice of an arbitrary compact subset to perform the analysis. In contrast, in the current paper, we use graphon analysis to endow the domain of definition of MPNNs, the set of all graphs, with a well-behaved structure describing a notion of natural graph similarity and allowing us to analyze properties of MPNNs regardless of any model of the data distribution. 2 Background Here, we provide the necessary background and define notation. Analysis. We denote the Lebesgue measure on [0, 1] by λ and consider measurability w.r.t. the Borel σ-algebra on [0, 1]. Let (X, B) be a standard Borel space, where we sometimes just write X when B is understood and then use B(X) to explicitly denote B. For a measure µ on X, we let µ := µ(X) denote its total mass, and for a standard Borel space (Y, C) and a measurable map f : X Y , let the push-forward f µ of µ via f be defined by f µ(A) := µ(f 1(A)) for every A C. Let P(X) and M 1(X) denote the spaces of all probability measures on X and all measures of total mass at most one on X, respectively. Let Cb(X) denote the set of all bounded continuous real-valued functions on X. We endow P(X) and M 1(X) with the topology generated by the maps µ 7 R X fdµ for f Cb(X), the weak topology (weak* topology in functional analysis); see [63, Section 17.E] or [17, Chapter 8]. Then, P(X) and M 1(X) are again standard Borel spaces, and if K is a compact metric space, then P(K) and M 1(K) are compact metrizable; see [63, Theorem 17.22]. For a sequence (µi)i of measures and a measure µ, we have µi µ if and only if R X fdµ for every f Cb(X), and for measures µ, ν, we have µ = ν if and only if R X fdν for every f Cb(X). In both statements, we may replace Cb(X) by a dense (w.r.t. the sup norm) subset. See Appendix A.3 for basics on topology. We denote a function f : A B also by f( ) or f in case the evaluation of f on a point a A is denoted by f(a) or fa, respectively. Graphs and graphons. A graph G is a pair (V (G), E(G)) with finite sets of vertices or nodes V (G) and edges E(G) {{u, v} V (G) | u = v}. If not otherwise stated, we set n := |V (G)|, and the graph is of order n. We also call the graph G an n-order graph. For ease of notation, we denote the edge {u, v} in E(G) by uv or vu. The neighborhood of v in V (G) is denoted by N(v) := {u V (G) | vu E(G)} and the degree of a vertex v is |N(v)|. Two graphs G and H are isomorphic and we write G H if there exists a bijection φ: V (G) V (H) preserving the adjacency relation, i.e., uv is in E(G) if and only if φ(u)φ(v) is in E(H). Then φ is an isomorphism between G and H. A kernel is a measurable function U : [0, 1]2 R, and a symmetric measurable function W : [0, 1]2 [0, 1] is called a graphon. The set of all graphons is denoted by W. Graphons generalize graphs in the following way. Every graph G can be viewed as a graphon WG by partitioning [0, 1] into n intervals (Iv)v V (G), each of mass 1/n, and letting WG(x, y) for x Iu, y Iv be one or zero depending on whether uv is an edge in G or not. The homomorphism density t(F, W) of a graph F in a graphon W is t(F, W) := R [0,1]V (F ) Q ij E(F ) W(xi, xj) d( x), where x is the tuple of variables xv for v V (G). The tree distance. A graphon W defines an operator TW : L2([0, 1]) L2([0, 1]) on the space L2([0, 1]) of square-integrable functions modulo equality almost everywhere by setting (TW f)(x) := R [0,1] W(x, y)f(y) dλ(y) for every x X and every f L2([0, 1]). Böker [18] defined the tree distance of two graphons U and W by δT (U, W) := inf S supf,g| f, (TU S S TW )g |, where the supremum is taken over all measurable functions f, g: [0, 1] [0, 1] and the infimum is taken over all Markov operators S, i.e., operators S : L2([0, 1]) L2([0, 1]) such that S(f) 0 for every f 0, S(1[0,1]) = 1[0,1], and S (1[0,1]) = 1[0,1] also for the Hilbert adjoint S . Markov operators are the infinite-dimensional analog to doubly stochastic matrices, see [39] for a thorough treatment of Markov operators. One can verify that the tree distance is a lower bound for the cut distance [18]. Iterated measures. Here, we define iterated degree measures (IDMs), which is basically a sequence of measures, by adapting the definition of Grebík and Rocha [52]. Let M0 := {1} and inductively define Mh+1 := M 1(Mh) for every h 0. Then, the spaces M0, M1, . . . are all compact metrizable. For 0 h < , inductively define the projection ph+1,h : Mh+1 Mh by letting p1,0 be the trivial map and, for h > 0, letting ph+1,h(α) := (ph,h 1) α for every α Mh+1 = M 1(Mh). This extends to ph,ℓ: Mh Mℓfor 0 ℓ h < by composition in the obvious way. Let M := M := (αh)h Y h N Mh | ph+1,h(αh+1) = αh for every h N be the inverse limit of M0, M1, . . . ; see the Kolmogorov Consistency Theorem [63, Theorem 17.20]). Then, M is compact metrizable [52, Claim 6.2]. For 0 h < , let p ,h : M Mh denote the projection to the hth component. We remark that we slightly simplified the definition of Grebík and Rocha [52] by not including previous IDMs from Mh for h h in Mh+1 and directly defining M as the inverse limit; corresponding to the definition of the space P in [52]. These changes yield equivalent definitions that simplify the exposition. The 1-WL for graphons. See Appendix A.1 for the standard definition of 1-WL on graphs. Grebík and Rocha [52] generalized 1-WL to graphons by defining a map i W : [0, 1] M, mapping every point of the graphon W W to an iterated degree measure as follows. First, inductively define the map i W,h : [0, 1] Mh by setting i W,0(x) := 1 for every x [0, 1] and i W,h+1(x) := A 7 Z i 1 W,h(A) W(x, y)dλ(y), (1) for all x [0, 1], A B(Mh), and h 0. Intuitively, i W,h(x) is the color assigned to point x after h iterations of 1-WL. Observe that i W,1(x) encodes the degree of point x, and i W,h(x) for h > 1 represents the iterated degree sequence information. Then, we define i W := i W, : [0, 1] M by i W (x) := i W, (x) := Q h N i W,h(x) and let νW := νW, := (i W ) λ P(M) (2) be the distribution of iterated degree measures (DIDM) of W. In other words, νW (A) is the volume that the colors in A occupy in the graphon domain [0, 1]. Then, the 1-WL test on graphons is the mapping that takes a graphon W W and returns νW . In addition to νW , we also define νW,h := (i W,h) λ P(Mh) for 0 h < , corresponding to running 1-WL for h rounds. While every DIDM of a graphon is a measure from the compact space P(M), not every measure in P(M) is the DIDM of a graphon. Grebík and Rocha [52] address this by giving a definition of a DIDM that is independent of a specific graphon. For us, it suffices to remark that the set Dh := {νW,h | W graphon} P(Mh) is compact as it is the image of the compact space of graphons [78, Theorem 9.23] under a continuous function [52]. For us, this means that Dh and P(Mh) can be used interchangeably in our arguments, and we do not have to be overly careful with distinguishing them. For simplicity, we simply stick to P(Mh) and refer to all elements of P(Mh) as DIDMs. Message-passing graph neural networks. MPNNs learn a d-dimensional real-valued vector for each vertex in a graph by aggregating information from neighboring vertices; see Appendix A.2 for more details. Here, we consider MPNNs where both the update functions and the readout functions are Lipschitz continuous and use sum aggregation normalized by the order of the graph. Formally, we first let φ = (φi)L i=0 denote a tuple of continuous functions φ0 : R0 Rd0 and φt : Rdt 1 Rdt for t [L], where we simply view φ0 as an element of Rd0. Furthermore, let ψ denote a continuous function ψ: Rd L Rd. For a graph G, an MPNN initializes a feature h(0) v := φ0 Rd0. Then, for t [L], we compute h(t) : V (G) Rdt and the single graph-level feature h G Rd after L layers by h(t) v := φt u N(v) h(t 1) u and h G := ψ 1 |V (G)| v V (G) h(L) v For a graphon W W, an MPNN initializes a feature h(0) x := φ0 Rd0 for x [0, 1]. Then, for t [L], we compute h(t) : [0, 1] Rdt and the single graphon-level feature h W Rd after L layers by h(t) x := φt [0,1] W(x, y)h(t 1) y dλ(y) and h W := ψ Z [0,1] h(L) x dλ(x) . This generalizes the previous definition, i.e., for a graph G and its (induced) graphon WG, we have h G = h WG and h(t) v = h(t) x for all t [L], v V (G), and x Iv; see Appendix C.1. We now extend the definition of MPNNs to IDMs. While the above definition of h(t) depends on a specific graphon W, an IDM already carries the aggregated information of its neighborhood. Hence, the initial feature h(0) α := φ0 Rd0 for α M0 and h(t) : Mt Rdt are defined for all IDMs at once. The intuition is that MPNNs cannot assign at layer t different feature values to nodes that have the same color at step t of the graphon 1-WL algorithm. Hence, it is enough to only consider an assignment between colors and feature values, and not consider the graph/graphon structure directly. Then, for a DIDM ν P(ML), we define the single DIDM-level feature hν Rd. Formally, we let h(t) α := φt Mt 1 h(t 1) dα and hν := ψ Z ML h(L) dν . That is, messages are aggregated via the IDM itself. In addition to h(t) : Mt Rdt, and h : P(ML) Rd, we define h(t) : M Rdt and h : P(M) Rd by setting h(t) α := h(t) p ,t(α) for every α M and hν := h(p ,L) ν for every ν P(M); it will always be clear from the context which of these functions we mean. These definitions of MPNNs on IDMs extend the previous definitions of MPNNs on graphons via the following identities. For a graphon W W, we have h W = hνW,L = hνW and h(t) x = h(t) i W,t(x) = h(t) i W (x) for almost every x [0, 1]; see Appendix C.1. That is, the feature values of an MPNN on a graphon W are equal to the feature values of that MPNN on the (D)IDM computed by the 1-WL on W. We call a tuple φ as defined above an (L-layer) MPNN model if φt is Lipschitz continuous on { R Mt 1 h(t 1) dα | α Mt} for every t [L], and we call ψ as defined above Lipschitz if it is Lipschitz continuous on { R ML h(L) dν | ν P(ML)}. We use L to denote the Lipschitz constants on these sets. In this paper, φ and ψ always denote an MPNN model and a Lipschitz function, respectively. We use the term -layer MPNN model to refer to an L-layer MPNN model for an arbitrary L. 3 Metrics on iterated degree measures Chen et al. [26] recently introduced the Weisfeiler Leman distance, a polynomial-time computable pseudometric on graphs combining the 1-WL test with the well-known Wasserstein metric from optimal transport [112], where their approach resembles that of iterated degree measures as introduced by Grebík and Rocha [52]. To use metrics from optimal transport, Chen et al. [26] resorted to mean aggregation instead of sum aggregation to obtain probability measures instead of finite measures with total mass at most one. Using mean aggregation, however, is different from the 1-WL test, which relies on sum aggregation. That is, sum aggregation allowed the algorithm to start with a constant coloring, something impossible with mean aggregation, potentially leading to a constant coloring. Chen et al. [26] circumvented this problem by encoding vertex degrees and the total number of vertices in the initial coloring. Here, we show that the Prokhorov metric [99] and an unbalanced variant of the Wasserstein metric can be beautifully integrated into the theory of iterated degree measures, eliminating the need to work around the limits of mean aggregation. Both metrics metrize the weak topology, which is precisely the topology of the space Mh of IDMs is endowed with; see Section 2. In modern-day literature, the Prokhorov metric is usually only defined for probability measures [35, Section 11.3], yet the original definition by Prokhorov [99] already was for finite measures. That is, let (S, d) be a complete separable metric space with Borel σ-algebra B. For a subset A S and ε 0, let Aε := {y S | d(x, y) < ε for some x A}, and define the Prokhorov metric P on M 1(S) by P(µ, ν) := inf{ε > 0 | µ(A) ν(Aε) + ε and ν(A) µ(Aε) + ε for every A B}. As the name suggests, P is a metric on M 1(S) [99, Section 1.4], and moreover, convergence in P is equivalent to convergence in the weak topology [99, Theorem 1.11]. For the unbalanced Wasserstein metric let µ, ν M 1(S), where we assume µ ν without loss of generality, and define W(µ, ν) := µ ν + inf γ M(µ,ν) S S d(x, y) dγ(x, y), where M(µ, ν) is the set of all measures γ M 1(S S) such that (p1) γ µ and (p2) γ = ν. Here, p1 and p2 are the projections from S S upon S to the first and the second component, respectively, i.e., (p1) γ(A) = γ(A S) and (p2) γ(A) = γ(S A). We prove that W is a well-defined metric on M 1(S, B) that coincides with the Wasserstein distance [35, Section 11.8] on probability measures. Furthermore, it satisfies W(µ, ν) 2P(µ, ν) 4 p W(µ, ν) for all µ, ν M 1(S), which implies that it metrizes the weak topology; see Appendix C.2. The metric P is used to define the metric d P,h on Mh for 0 h as follows. Let d P,0 be the trivial metric on the one-point space M0 and, for h 0, inductively let d P,h+1 be the Prokhorov metric on (Mh+1, d P,h). Then, define d P, on M by setting d P(α, β) := d P, (α, β) := suph N 1 h d P,h(αh, βh) for α, β M . The factor of 1/h is included in this definition on purpose to ensure that d P, metrizes the product topology and not the uniform topology. The metric d W,h on Mh is defined completely analogously via W instead of P. The metrics d P,h and d W,h on Mh allow us, for example, to compare the IDM of a point in a graphon to the IDM of a point in another graphon. To compare two graphons distributions on iterated degree measures, we let δP,h be the Prokhorov metric on (P(Mh), d P,h) for 0 h and again define δW,h analogously via the distance W. We note that these metrics directly apply to graphons U, W W by simply comparing their DIDMs νU,h and νW,h. Theorem 2. Let 0 h . The metrics d P,h and d W,h are well-defined and metrize the topology of Mh. The metrics δP,h and δW,h are well-defined and metrize the topology of P(Mh). Moreover, these metrics are computable on graphs in time polynomial in the size of the input graphs and h, up to an additive error of ε in the case of d W, and δW, . While these metrics are polynomial-time computable, in Appendix C.2, we derive the same impractical upper bound of O(h n5 log n) for δW,h as Chen et al. [26] get for their Weisfeiler Leman distance and the even worse bound of O(h n7) for δP,h. This means that these metrics are not suitable as a computational tool in practice. Theorem 1 hints that we can instead use easy-to-compute MPNNs to lower-bound these metrics, which leads to our experiments in Section 6. MPNNs are Lipschitz in the metrics we defined, where the Lipschitz constant only depends on basic properties of the MPNN model. That is, if two graphons are close in our metrics, then MPNNs outputs for all MPNN models up to a specific Lipschitz constant are close. Formally, let φ = (φi)L i=0 be an MPNN model with L layers, and for t {0, . . . , L}, let φt := (φi)t i=0. Then, we inductively define the Lipschitz constant Cφ 0 of φ by Cφ0 := 0 for t = 0 and Cφt := φt L ( h(t 1) + Cφt 1) for t > 0. This essentially depends on the product of the Lipschitz constants of the functions in φ, and the bounds for the MPNN output values, which are finite since a continuous function on a compact set attains its maximum. Including these bounds in the constant is necessary since we consider sum aggregation. That is, a constant function mapping all inputs to some c R has Lipschitz constant zero, but when integrated with measures of total mass zero and one, for example, the difference of the outputs is c. We define C(φ,ψ) for Lipschitz ψ analogously by essentially viewing (φ, ψ) as an MPNN model. Lemma 3. Let φ be an L-layer MPNN model for L N and ψ be Lipschitz. Then, h(L) α h(L) β 2 Cφ d W,L(α, β) and hµ hν 2 C(φ,ψ) δW,L(µ, ν) for all α, β ML and all µ, ν P(ML), respectively. These inequalities also hold for d W, and δW, with an additional factor of L in the Lipschitz constant. 4 Universality of message-passing graph neural networks In this section, we prove a universal approximation theorem for MPNNs on IDMs and DIDMs, deriving our main result from it. For 0 L , let N n L C(ML, Rn) denote the set of all functions h(L) : ML Rn for an L-layer MPNN model φ with d L = n. Similarly, let NN n L := {h | φ L-layer MPNN model, ψ: Rd L Rn Lipschitz} C(P(ML), Rn) be the set of all functions computed by an MPNN after a global readout. Our universal approximation theorem, Theorem 4, shows that all continuous functions on IDMs and DIDMs, i.e., functions on graphons that are invariant w.r.t. the (colors of) the 1-WL test, can be approximated by MPNNs. Hence, our result extends the universal approximation result of Chen et al. [26] for measure Markov chains in two ways. First, measure Markov chains are restricted to finite spaces by definition, which is not the case for graphons and our universal approximation theorem. Secondly, the spaces ML and P(ML) are compact, which means we obtain a universal approximation theorem for the whole space of graphons, including all graphs, not restricted to an artificially chosen compact subset. Theorem 4. Let 0 L . Then, N 1 L is dense in C(ML, R) and NN 1 L is dense in C(P(ML), R). The proof of Theorem 4 is elegant and does not rely on encoding the 1-WL test as an MPNN. That is, it follows by inductive applications of the Stone Weierstrass theorem [35, Theorem 2.4.11] combined with the definition of IDMs. It is strikingly similar to the proof of Grebík and Rocha [52] for a similar result concerning tree homomorphism densities; see Appendix D. While the second statement of Theorem 4, i.e., the graphon-level approximation, is interesting in its own right, the crux of Theorem 4 lies in its first statement, namely, that N 1 L is dense in C(ML, R), immediately implying that the topology induced by MPNNs on P(ML) is the weak topology, i.e., the topology we endowed this space within Section 2. Corollary 5. Let 0 L and n > 0. Let ν P(ML) and (νi)i be a sequence with νi P(ML). Then, νi ν if and only if hνi hν for all L-layer MPNN models φ and Lipschitz ψ: Rd L Rn. By combining standard compactness arguments with Theorem 2 and Corollary 5, we can now prove that two graphons are close in our metrics if and only if the output of all possible MPNNs, up to a specific constant and number of layers, is close. Formally, the forward direction of this equivalence is just Lemma 3, while the backward direction reads as follows. Theorem 6. Let n > 0 be fixed. For every ε > 0, there are L N, C > 0, and δ > 0 such that, for all graphons U and W, if h U h W 2 δ for every L -layer MPNN model φ and Lipschitz ψ: Rd L Rn with L L and C(φ,ψ) C, then δP(U, W) ε. We stress that the constants L, C, and δ in the theorem statement are independent of the graphons U and W. The proof of Theorem 6 is simple: one assumes that the statement does not hold to obtain two sequences of counterexamples, which have to have convergent subsequences by compactness, and the limit graphons allow us to derive a contradiction. This establishes the equivalence between our metrics and MPNNs stated in Theorem 1. Analogous reasoning together with the universality of tree homomorphism densities [52], cf. Appendix D, yields the equivalence between our metrics and tree homomorphism densities. Then, the missing equivalence to the tree distance follows from the result of Böker [18], which connects the tree distance to tree homomorphism densities. See Appendix C.4 for the formal statements of all these equivalences as epsilon-delta-statements and their proofs. 5 Extension to graphons with signals Our focus in this work lies on MPNNs ability to distinguish the structure of graphons. However, the definitions of IDMs, MPNNs, and our metrics can be adapted to graphons with signals, i.e., graphons W equipped with a measurable signal function ℓ: [0, 1] K, where (K, d) is some fixed compact metric space. In the following, we briefly sketch how to do this. First, replace the one-point space M0 by (K, d) and modify the 1-WL for graphons by setting i(W,ℓ),0 := ℓ, i.e., use the signal function as the initial coloring. For h > 0, adapt the definition of the IDM spaces Mh and of the refinement rounds i(W,ℓ),hof the 1-WL for graphons to include the previous IDM of a point in its new IDM, like in the original definition of Grebík and Rocha [52]. Omitting the previous IDM, as we have done before in our definition, is only reasonable if the initial coloring is constant. Then, since M0 = (K, d) is a compact metric space, the spaces Mh are compact metrizable as before. Modify the definition of an MPNN model φ such that φ0 is a Lipschitz function K Rd0. Finally, adapt the definition of the metrics d P,h and d W,h by letting d P,0 := d W,0 := d be the metric of the compact metric space (K, d) and then adapting d P,h and d W,h for h > 0 to the modified definition of Mh by using the product metric. Then, the proofs of our results can be adapted to this more general setting. In particular, one can prove a universal approximation theorem since Lipschitz functions on K are dense in C(K). 6 Experimental evaluation In the following, we investigate the applicability of our theory on real-world prediction tasks. Specifically, we answer the following questions. Q1 To what extent do our graph metrics δP and δW act as a proxy for distances between MPNNs vectorial representations? Q2 Our theoretical results imply that untrained MPNNs can be as effective as their trained counterparts when using enough of them. Can untrained MPNNs remain competitive when only using a finite number of them (measured by the hidden dimensionality)? The source code of all methods and evaluation protocols are available at https://github.com/ nhuang37/finegrain_expressivity_GNN. We conducted all experiments on a server with 256 GB RAM and four NVIDIA RTX A5000 GPU cards. Fine-grained expressivity comparisons of MPNNs. To answer Q1, we construct a graph sequence that converges in our graph metrics. Given an MPNN, we compute the sequence of its embeddings on such graphs and the corresponding embedding distance using the ℓ2-norm. Hence, comparing different MPNNs amounts to comparing the convergence rate of their Euclidean embedding distances concerning the graph distances. Concretely, we simulate a sequence of 50 random graphs {Gi} for i [50] with 30 vertices using the stochastic block model, where Gi SBM(p, qi), with p = 0.5 and qi [0.1, 0.5] increases equidistantly. Let G denote the last graph in the sequence, and observe that G is sampled from an Erd os Rényi model, i.e., G ER(p). For i [50], we compute the Wasserstein distance δW,h(Gi, G) and the Euclidean distance h Gi h G 2.1 For demonstration purposes, we compare two common MPNN layers, GIN [115] and Graph Conv [89], using sum aggregation normalized by the graph s order, varying the hidden dimensions and the number of layers. Figure 2 visualizes their normalized embedding distance and normalized graph distance, with an increasing number of hidden dimensions, from left to right. GIN, top-row, and Graph Conv, bottom-row, produce more discriminative embeddings as the number of hidden dimensions increases, supporting Theorem 4. Each point corresponds to a different (untrained) MPNN. Note that we interpret the hidden dimension w (i.e., width) as concatenating w random MPNNs. We observe similar behavior when increasing the number of layers; see Figure 3 in the appendix. Untrained Graph Conv embeddings are more robust than untrained GIN embeddings regarding the choice of hidden dimensions and number of layers. Figure 4 in the appendix shows the same experiments on the real-world dataset MUTAG, part of the TUDataset [90]. We observe that increasing the number of hidden dimensions improves performance. Nonetheless, increasing the number of layers seems to first improve and then degrade performance. This observation coincides with the downstream graph classification performance, as discussed in the next section. The surprising effectiveness of untrained MPNNs. To answer Q2, we compare popular MPNN architectures, i.e., GIN and Graph Conv, with their untrained counterparts. For untrained MPNNs, we freeze their input and hidden layer weights that are randomly initialized and only optimize for the output layer(s) used for the final prediction. We benchmark on a subset of the established TUDataset [90]. For each dataset, we run paired experiments of trained and untrained MPNNs on the same ten random splits (train/test) and 10-fold cross-validation splits, using the evaluation protocol outlined in Morris et al. [90]. We report the test accuracy with 10-run standard deviation in Table 1 and the mean training time per epoch with standard deviation in Table 2 in the appendix. Table 1 and Table 2 show that untrained MPNNs with sufficient hidden dimensionality perform competitively as trained MPNNs while being significantly faster, with 20%-46% time savings. As shown in Figure 6 in the appendix, increasing the hidden dimension (i.e., the number of MPNN models) improves the performance of untrained MPNNs. Our theory states that it is 1We compute δW,h via a min-cost-flow algorithm and terminate at most 3 iterations after the Weisfeiler Leman colors stabilize. Table 1: Untrained MPNNs show competitive performance as trained MPNNs given sufficiently large hidden dimensionality (3-layer, 512-hidden-dimension). To be consistent with our theory, we use standard architectures with sum aggregation, layer-wise 1/V (G) normalization, and mean pooling, denoted by MPNN-m. We report the mean accuracy std over ten data splits. Accuracy MUTAG IMDB-BINARY IMDB-MULTI NCI1 PROTEINS REDDIT-BINARY GIN-m (trained) 79.01 2.24 69.96 1.43 46.29 0.76 78.61 0.34 73.51 0.47 89.73 0.37 GIN-m (untrained) 82.56 3.12 70.70 0.60 47.59 0.95 77.82 0.55 73.45 0.30 82.32 0.45 Graph Conv-m (trained) 81.62 2.08 59.14 1.93 38.75 1.62 63.28 0.6 71.49 0.67 82.4 0.19 Graph Conv-m (untrained) 78.03 1.57 65.77 1.32 43.29 0.96 62.36 0.45 71.83 0.42 77.15 0.29 Figure 2: MPNNs preserve graph distance better when increasing the number of hidden dimensions. Comparatively, untrained GIN embeddings are more sensitive than untrained Graph Conv to changes in the number of hidden dimensions. necessary to use all MPNNs to preserve graph distance, which is impossible to compute in practice. Nonetheless, our experiment shows that using enough of them suffices for graph classification tasks. We also investigate the effect of the number of layers, i.e., the 1-WL iteration in IDM. As shown in Figure 7 in the appendix, increasing the number of layers first improves and then degrades untrained MPNNs performance, which is likely due to the changes in MPNNs ability to preserve graph distance observed as in Figure 5 in the appendix. 7 Conclusion This work devised a deeper understanding of MPNNs capacity to capture graph structure, precisely determining when they learn similar vectorial representations. To that, we developed a comprehensive theory of graph metrics on graphons, demonstrating that two graphons are close in our metrics if, and only if, the outputs of all possible MPNNs are close, offering a more nuanced understanding of their ability to capture graph structure similarity. In addition, we established a connection between the continuous extensions of 1-WL and MPNNs to graphons, tree distance, and tree homomorphism densities. Our experimental study confirmed the validity of our theory in real-world prediction tasks. In summary, our work establishes the first rigorous connection between the similarity of graphs and their learned vectorial presentations, paving the way for a more nuanced understanding of MPNNs expressivity and robustness abilities and their connection to graph structure. Looking forward, future research could focus on extending all characterizations of Theorem 1 to graphons with signals. While we briefly sketched in Section 5 how to do this for MPNNs and the metrics we defined, this still presents a challenge as tree distance and tree homomorphism densities do not readily generalize to graphons with signals. A different direction could be an extension of our theory to different aggregation functions like maxor mean-aggregation. Since the 1-WL paradigm of summing over neighbors is crucial in our proofs, it is not clear how one would approach this. Additionally, further quantitative versions of equivalences in Theorem 1, not resorting to epsilon-variants statements, and generalizing our results to the k-WL are interesting avenues for future exploration. Acknowledgments and disclosure of funding This research project was started at the BIRS 2022 Workshop Deep Exploration of non-Euclidean Data with Geometric and Topological Representation Learning held at the UBC Okanagan campus in Kelowna, B.C. Jan Böker is funded by the European Union (ERC, Sym Sim, 101054974). Views and opinions expressed are, however those of the author only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. Ron Levie is partially funded by ISF (Israel Science Foundation) grant # 1937/23. Ningyuan Huang is partially supported by the MINDS Data Science Fellowship from Johns Hopkins University. Soledad Villar is partially funded by the NSF Simons Research Collaboration on the Mathematical and Scientific Foundations of Deep Learning (Mo DL) (NSF DMS 2031985), NSF CISE 2212457, ONR N00014-22-1-2126 and an Amazon AI2AI Faculty Research Award. Christopher Morris is partially funded by a DFG Emmy Noether grant (468502433) and RWTH Junior Principal Investigator Fellowship under Germany s Excellence Strategy. [1] A. Aamand, J. Y. Chen, P. Indyk, S. Narayanan, R. Rubinfeld, N. Schiefer, S. Silwal, and T. Wagner. Exponentially improving the complexity of simulating the Weisfeiler-Lehman test with graph neural networks. Ar Xiv preprint, 2022. 20 [2] R. Abboud, I. I. Ceylan, M. Grohe, and T. Lukasiewicz. The surprising power of graph neural networks with random node initialization. In Joint Conference on Artificial Intelligence, pages 2112 2118, 2021. 20 [3] V. Arvind, J. Köbler, G. Rattan, and O. Verbitsky. On the power of color refinement. In International Symposium on Fundamentals of Computation Theory, pages 339 350, 2015. 17 [4] V. Arvind, F. Fuhlbrück, J. Köbler, and O. Verbitsky. On Weisfeiler-Leman invariance: Subgraph counts and related graph properties. In International Symposium on Fundamentals of Computation Theory, pages 111 125, 2019. 17 [5] A. Atserias and E. N. Maneva. Sherali-adams relaxations and indistinguishability in counting logics. SIAM Journal on Computing, 42(1):112 137, 2013. 17 [6] A. Atserias, L. Mancinska, D. E. Roberson, R. Sámal, S. Severini, and A. Varvitsiotis. Quantum and non-signalling graph isomorphisms. Journal of Combinatorial Theory, Series B, pages 289 328, 2019. 17 [7] W. Azizian and M. Lelarge. Characterizing the expressive power of invariant and equivariant graph neural networks. In International Conference on Learning Representations, 2021. 3, 20 [8] L. Babai. Lectures on graph isomorphism. University of Toronto, Department of Computer Science. Mimeographed lecture notes, October 1979, 1979. 17 [9] L. Babai. Graph isomorphism in quasipolynomial time. In Symposium on Theory of Computing, pages 684 697, 2016. 17 [10] L. Babai and L. Kucera. Canonical labelling of graphs in linear average time. In Symposium on Foundations of Computer Science, pages 39 46, 1979. 17 [11] A.-L. Barabasi and Z. N. Oltvai. Network biology: Understanding the cell s functional organization. Nature Reviews Genetics, 5(2):101 113, 2004. 1 [12] P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. L. Reutter, and J. P. Silva. The logical expressiveness of graph neural networks. In International Conference on Learning Representations, 2020. 20 [13] P. Barceló, F. Geerts, J. L. Reutter, and M. Ryschkov. Graph neural networks with local graph parameters. In Advances in Neural Information Processing Systems, pages 25280 25293, 2021. 20 [14] I. I. Baskin, V. A. Palyulin, and N. S. Zefirov. A neural device for searching direct correlations between structures and properties of chemical compounds. Journal of Chemical Information and Computer Sciences, 37(4):715 721, 1997. 2 [15] C. Berkholz, P. S. Bonsma, and M. Grohe. Tight lower and upper bounds for the complexity of canonical colour refinement. Theory of Computing Systems, 60(4):581 614, 2017. 17 [16] P. Billingsley. Probability and Measure. Wiley, May 1995. 21 [17] V. I. Bogachev. Measure Theory. Springer Science & Business Media, Jan. 2007. 4, 28 [18] J. Böker. Graph Similarity and Homomorphism Densities. In 48th International Colloquium on Automata, Languages, and Programming, volume 198, pages 32:1 32:17, 2021. 2, 5, 8, 20, 34, 35 [19] J. Böker. Weisfeiler-Leman Indistinguishability of Graphons. Ar Xiv preprint, 2021. 4 [20] C. Borgs, J. T. Chayes, L. Lovász, V. T. Sós, and K. Vesztergombi. Convergent sequences of dense graphs I: Subgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6): 1801 1851, Dec. 2008. 4, 20 [21] C. Borgs, J. T. Chayes, L. Lovász, V. T. Sós, and K. Vesztergombi. Convergent sequences of dense graphs II. Multiway cuts and statistical physics. Annals of Mathematics, 176(1):151 219, 2012. 4 [22] G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. Ar Xiv preprint, 2020. 20 [23] J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun. Spectral networks and deep locally connected networks on graphs. In International Conference on Learning Representation, 2014. 2 [24] J. Cai, M. Fürer, and N. Immerman. An optimal lower bound on the number of variables for graph identifications. Combinatorica, 12(4):389 410, 1992. 17 [25] I. Chami, S. Abu-El-Haija, B. Perozzi, C. Ré, and K. Murphy. Machine learning on graphs: A model and comprehensive taxonomy. Ar Xiv preprint, 2020. 1 [26] S. Chen, S. Lim, F. Mémoli, Z. Wan, and Y. Wang. Weisfeiler-Lehman meets Gromov-Wasserstein. In International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 3371 3416, 2022. 2, 3, 6, 7, 8, 20, 25, 30, 43 [27] S. Chen, S. Lim, F. Mémoli, Z. Wan, and Y. Wang. The Weisfeiler-Lehman distance: Reinterpretation and connection with GNNs. Ar Xiv preprint, 2023. 20 [28] Z. Chen, S. Villar, L. Chen, and J. Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, pages 15868 15876, 2019. 3, 20 [29] Z. Chen, L. Chen, S. Villar, and J. Bruna. Can graph neural networks count substructures? In Advances in Neural Information Processing Systems, 2020. 17 [30] Z. Chen, L. Chen, S. Villar, and J. Bruna. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383 10395, 2020. 20 [31] C.-Y. Chuang and S. Jegelka. Tree mover s distance: Bridging graph metrics and stability of graph neural networks. Ar Xiv preprint, 2022. 20 [32] G. Dasoulas, L. D. Santos, K. Scaman, and A. Virmaux. Coloring graph neural networks for node disambiguation. In International Joint Conference on Artificial Intelligence, pages 2126 2132, 2020. 20 [33] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3837 3845, 2016. 2 [34] H. Dell, M. Grohe, and G. Rattan. Lovász meets Weisfeiler and Leman. In International Colloquium on Automata, Languages, and Programming, pages 40:1 40:14, 2018. 17, 18 [35] R. M. Dudley. Real Analysis and Probability. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2 edition, 2002. 7, 8, 18, 19, 26, 27, 28, 30 [36] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pages 2224 2232, 2015. 2 [37] Z. Dvoˇrák. On recognizing graphs by numbers of homomorphisms. Journal of Graph Theory, 64 (4):330 342, 2010. 18 [38] D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, 2010. 1 [39] T. Eisner, B. Farkas, M. Haase, and R. Nagel. Operator Theoretic Aspects of Ergodic Theory. Springer, Nov. 2015. 5 [40] J. Elstrodt. Maßund Integrationstheorie. Springer, Feb. 2011. 23 [41] A. Frieze and R. Kannan. Quick Approximation to Matrices and Applications. Combinatorica, 19 (2):175 220, Feb. 1999. 4, 20 [42] M. Fürer. On the combinatorial power of the Weisfeiler-Lehman algorithm. In International Conference on Algorithms and Complexity, pages 260 271, 2017. 17 [43] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro. Convolutional neural network architectures for signals supported on graphs. IEEE Transactions on Signal Processing, 67(4):1034 1049, 2019. 2 [44] F. Gama, J. Bruna, and A. Ribeiro. Stability properties of graph neural networks. IEEE Transactions on Signal Processiong, 68:5680 5695, 2020. 1 [45] U. García-Palomares and M. Evarist Giné. On the linear programming approach to the optimality property of Prokhorov s distance. Journal of Mathematical Analysis and Applications, 60(3): 596 600, Oct. 1977. 24, 28, 29 [46] B. Garel and J.-C. Massé. Calculation of the Prokhorov distance by optimal quantization and maximum flow. ASt A Advances in Statistical Analysis, 93(1):73 88, Mar. 2009. 24 [47] F. Geerts. The expressive power of kth-order invariant graph networks. Ar Xiv preprint, 2020. 3, 20 [48] F. Geerts and J. L. Reutter. Expressiveness and approximation properties of graph neural networks. In International Conference on Learning Representations, 2022. 3, 20 [49] F. Geerts, F. Mazowiecki, and G. A. Pérez. Let s agree to degree: Comparing graph convolutional networks in the message-passing framework. In International Conference on Machine Learning, pages 3640 3649, 2021. 20 [50] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263 1272, 2017. 1, 2, 18 [51] C. Goller and A. Küchler. Learning task-dependent distributed representations by backpropagation through structure. In International Conference on Neural Networks, pages 347 352, 1996. 2 [52] J. Grebík and I. Rocha. Fractional Isomorphism of Graphons. Ar Xiv preprint, Feb. 2021. 1, 4, 5, 6, 8, 19, 34, 36 [53] M. Grohe. Descriptive Complexity, Canonisation, and Definable Graph Structure Theory. Cambridge University Press, 2017. 17 [54] M. Grohe. Word2vec, Node2vec, Graph2vec, X2vec: Towards a theory of vector embeddings of structured data. In Symposium on Principles of Database Systems, pages 1 16, 2020. 18 [55] M. Grohe. The logic of graph neural networks. In Symposium on Logic in Computer Science, pages 1 17, 2021. 17 [56] M. Grohe. The descriptive complexity of graph neural networks. Ar Xiv preprint, abs/2303.04613, 2023. 20 [57] M. Grohe and M. Otto. Pebble games and linear equations. Journal of Symbolic Logic, 80(3): 797 844, 2015. 17 [58] M. Grohe, P. Schweitzer, and D. Wiebking. Deep Weisfeiler Leman. Ar Xiv preprint, 2020. 17 [59] M. Grohe, M. Lichter, and D. Neuen. The iteration number of the Weisfeiler-Leman algorithm. Co RR, abs/2301.13317, 2023. 17 [60] S. Günnemann. Graph neural networks: Adversarial robustness. In L. Wu, P. Cui, J. Pei, and L. Zhao, editors, Graph Neural Networks: Foundations, Frontiers, and Applications, pages 149 176. Springer, 2022. 1 [61] W. L. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024 1034, 2017. 2 [62] N. Immerman and E. Lander. Describing graphs: A first-order approach to graph canonization. In Complexity Theory Retrospective: In Honor of Juris Hartmanis on the Occasion of His Sixtieth Birthday, July 5, 1988, pages 59 81, 1990. 17 [63] A. S. Kechris. Classical Descriptive Set Theory. Springer, 1995. 4, 5 [64] N. Keriven, A. Bietti, and S. Vaiter. Convergence and stability of graph convolutional networks on large random graphs. In Advances in Neural Information Processing Systems, pages 21512 21523, 2020. 4 [65] N. Keriven, A. Bietti, and S. Vaiter. On the universality of graph neural networks on large random graphs. In Advances in Neural Information Processing Systems, 2021. 4 [66] S. Kiefer. Power and Limits of the Weisfeiler-Leman Algorithm. Ph D thesis, Department of Computer Science, RWTH Aachen University, 2020. 17 [67] S. Kiefer and B. D. Mc Kay. The iteration number of Colour Refinement. In International Colloquium on Automata, Languages, and Programming, pages 73:1 73:19, 2020. 17 [68] S. Kiefer and D. Neuen. The power of the Weisfeiler-Leman algorithm to decompose graphs. SIAM Journal on Discrete Mathematics, 36(1):252 298, 2022. 17 [69] S. Kiefer and P. Schweitzer. Upper bounds on the quantifier depth for graph differentiation in first order logic. In Symposium on Logic in Computer Science, pages 287 296, 2016. 17 [70] S. Kiefer, P. Schweitzer, and E. Selman. Graphs identified by logics with counting. In International Symposium on Mathematical Foundations of Computer Science, pages 319 330, 2015. 17 [71] S. Kiefer, I. Ponomarenko, and P. Schweitzer. The Weisfeiler-Leman dimension of planar graphs is at most 3. Journal of the ACM, 66(6):44:1 44:31, 2019. 17 [72] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. 18 [73] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. 2 [74] D. B. Kireev. Chemnet: A novel neural network based method for graph/property mapping. Journal of Chemical Information and Computer Sciences, 35(2):175 180, 1995. 2 [75] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1): 97 109, 2019. 2 [76] M. Lichter, I. Ponomarenko, and P. Schweitzer. Walk refinement, walk logic, and the iteration number of the Weisfeiler-Leman algorithm. In Symposium on Logic in Computer Science, pages 1 13, 2019. 17 [77] L. Lovász and B. Szegedy. Limits of dense graph sequences. Journal of Combinatorial Theory, Series B, 96(6):933 957, Nov. 2006. 4, 20 [78] L. Lovász. Large Networks and Graph Limits., volume 60 of Colloquium Publications. American Mathematical Society, 2012. 1, 4, 5, 19 [79] T. Maehara and H. NT. A simple proof of the universality of invariant/equivariant graph neural networks. Ar Xiv preprint, 2019. 3, 20 [80] P. N. Malkin. Sherali adams relaxations of graph isomorphism polytopes. Discrete Optimization, pages 73 97, 2014. 17 [81] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. In Advances in Neural Information Processing Systems, pages 2153 2164, 2019. 3, 20 [82] S. Maskey, R. Levie, Y. Lee, and G. Kutyniok. Generalization analysis of message passing neural networks on large random graphs. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. 1, 4 [83] S. Maskey, R. Levie, and G. Kutyniok. Transferability of graph neural networks: An extended graphon approach. Applied and Computational Harmonic Analysis, 63:48 83, 2023. 4 [84] C. Merkwirth and T. Lengauer. Automatic generation of complementary descriptors with molecular graph networks. Journal of Chemical Information and Modeling, 45(5):1159 1168, 2005. 2 [85] A. Micheli. Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks, 20(3):498 511, 2009. 2 [86] A. Micheli and A. S. Sestito. A new neural network model for contextual processing of graphs. In Italian Workshop on Neural Nets Neural Nets and International Workshop on Natural and Artificial Immune Systems, pages 10 17, 2005. 2 [87] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5425 5434, 2017. 2 [88] C. Morris. The power of the Weisfeiler-Leman algorithm for machine learning with graphs. In International Joint Conference on Artificial Intelligence, pages 4543 4550, 2021. 3 [89] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence, pages 4602 4609, 2019. 1, 3, 9, 20 [90] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann. TUDataset: A collection of benchmark datasets for learning with graphs. Co RR, 2020. 9, 43 [91] C. Morris, G. Rattan, and P. Mutzel. Weisfeiler and Leman go sparse: Towards higher-order graph embeddings. In Advances in Neural Information Processing Systems, 2020. 3, 20 [92] C. Morris, Y. L., H. Maron, B. Rieck, N. M. Kriege, M. Grohe, M. Fey, and K. Borgwardt. Weisfeiler and Leman go machine learning: The story so far. Ar Xiv preprint, 2021. 1, 18, 20 [93] C. Morris, G. Rattan, S. Kiefer, and S. Ravanbakhsh. Speq Nets: Sparsity-aware permutationequivariant graph networks. In International Conference on Machine Learning, pages 16017 16042, 2022. 3, 20 [94] C. Morris, F. Geerts, J. Tönshoff, and M. Grohe. WL meet VC. Ar Xiv preprint, abs/2301.11039, 2023. 1 [95] H. Nguyen and T. Maehara. Graph homomorphism convolution. In International Conference on Machine Learning, pages 7306 7316, 2020. 20 [96] J. B. Orlin. Max flows in O(nm) time, or better. In Symposium on Theory of Computing, pages 765 774, 2013. doi: 10.1145/2488608.2488705. 26 [97] O. Pele and M. Werman. A Linear Time Histogram Metric for Improved SIFT Matching. In Computer Vision ECCV 2008, pages 495 508, 2008. ISBN 978-3-540-88690-7. 2, 26 [98] O. Pele and M. Werman. Fast and robust Earth Mover s Distances. In IEEE International Conference on Computer Vision, pages 460 467, Sept. 2009. 26, 30 [99] Y. V. Prokhorov. Convergence of Random Processes and Limit Theorems in Probability Theory. Theory of Probability & Its Applications, 1956. 2, 6, 7, 23 [100] O. Puny, D. Lim, B. T. Kiani, H. Maron, and Y. Lipman. Equivariant polynomials for graph neural networks. Ar Xiv preprint, abs/2302.11556, 2023. 20 [101] P. Reiser, M. Neubert, A. Eberhard, L. Torresi, C. Zhou, C. Shao, H. Metni, C. van Hoesel, H. Schopmans, T. Sommer, et al. Graph neural networks for materials science and chemistry. Communications Materials, 3(1):93, 2022. 1 [102] E. Rosenbluth, J. Tönshoff, and M. Grohe. Some might say all you need is sum. Ar Xiv preprint, abs/2302.11603, 2023. 20 [103] L. Ruiz, L. F. O. Chamon, and A. Ribeiro. Graphon signal processing. IEEE Transactions on Signal Processing, 69:4961 4976, 2021. 4 [104] R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks. In SIAM International Conference on Data Mining, pages 333 341, 2021. 20 [105] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2009. 2, 18 [106] G. Schay. Nearest Random Variables with Given Distributions. The Annals of Probability, 2(1): 163 166, 1974. 24, 28, 29 [107] M. Simonovsky and N. Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In IEEE Conference on Computer Vision and Pattern Recognition, pages 29 38, 2017. 1 [108] A. Sperduti and A. Starita. Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks, 8(3):714 35, 1997. 2 [109] G. Tinhofer. Graph isomorphism and theorems of Birkhoff type. Computing, 36(4):285 300, Dec. 1986. 18 [110] G. Tinhofer. A note on compact graphs. Discrete Applied Mathematics, 30(2):253 264, Feb. 1991. 18 [111] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. 2 [112] C. Villani. Optimal Transport: Old and New. Springer Science & Business Media, Oct. 2008. 6 [113] B. Weisfeiler. On Construction and Identification of Graphs. Springer, 1976. 1, 17 [114] B. Weisfeiler and A. Leman. The reduction of a graph to canonical form and the algebra which appears therein. Nauchno-Technicheskaya Informatsia, 2(9):12 16, 1968. English translation by G. Ryabov is available at https://www.iti.zcu.cz/wl2018/pdf/wl_paper_translation. pdf. 1, 17 [115] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. 1, 3, 9, 20 A Additional background Here, we provide additional background. A.1 The Weisfeiler Leman algorithm and related topics The 1-WL or color refinement is a well-studied heuristic for the graph isomorphism problem, originally proposed by Weisfeiler and Leman [114].2 Intuitively, the algorithm determines if two graphs are non-isomorphic by iteratively coloring or labeling vertices. Given an initial coloring or labeling of the vertices of both graphs, e.g., their degree or application-specific information, in each iteration, two vertices with the same label get different labels if the number of identically labeled neighbors is unequal. These labels induce a vertex partition, and the algorithm terminates when at some iteration, the algorithm does not refine the current partition, i.e., when a stable coloring or stable partition is obtained. Then, if the number of vertices annotated with a specific label is different in both graphs, we can conclude that the two graphs are not isomorphic. It is easy to see that the algorithm cannot distinguish all non-isomorphic graphs [24]. Nonetheless, it is a powerful heuristic that can successfully test isomorphism for a broad class of graphs [10]. Formally, let G = (V (G), E(G), ℓ) be a labeled graph, i.e., a graph equipped with a (vertex-)label function ℓ: V (G) N. In each iteration, t > 0, the 1-WL computes a vertex coloring C1 t : V (G) N, depending on the coloring of the neighbors. That is, in iteration t > 0, we set C1 t (v) := RELABEL C1 t 1(v), {{C1 t 1(u) | u N(v)}} , for all vertices v in V (G), where RELABEL injectively maps the above pair to a unique natural number, which has not been used in previous iterations. In iteration 0, the coloring C1 0 := ℓ. To test if two graphs G and H are non-isomorphic, we run the above algorithm in parallel on both graphs. If the two graphs have a different number of vertices colored c in N at some iteration, the 1-WL distinguishes the graphs as non-isomorphic. Moreover, if the number of colors between two iterations, t and (t + 1), does not change, i.e., the cardinalities of the images of C1 t and C1 i+t are equal, or, equivalently, C1 t (v) = C1 t (w) C1 t+1(v) = C1 t+1(w), for all vertices v and w in V (G), the algorithm terminates. For such t, we define the stable coloring C1 (v) = C1 t (v), for v in V (G). The stable coloring is reached after at most max{|V (G)|, |V (H)|} iterations [53]. The algorithm can be generalized to the k-WL, leading to a strict boost in expressive power in distinguishing non-isomorphic graphs. Properties of the Weisfeiler Leman algorithm The Weisfeiler Leman algorithm constitutes one of the earliest and most natural approaches to isomorphism testing [113, 114], and the theory community has heavily investigated it over the last few decades [53]. Moreover, the fundamental nature of the k-WL is evident from various connections to other fields such as logic, optimization, counting complexity, and quantum computing. The power and limitations of the k-WL can be neatly characterized in terms of logic and descriptive complexity [8, 62], Sherali-Adams relaxations of the natural integer linear optimization problem for the graph isomorphism problem [5, 57, 80], homomorphism counts [34], quantum isomorphism games [6], graph decompositions [68]. In their seminal paper, Cai et al. [24] showed that, for each k, a pair of non-isomorphic graphs of size O(k) exists not distinguished by the k-WL. Kiefer [66] thoroughly surveys more background and related results concerning the expressive power of the k-WL. For k = 1, the power of the algorithm has been completely characterized [3, 70]. Moreover, upper bounds on the running time [15] and the number of iterations for k = 1 [67], for the folklore k = 2 [69, 76], and general k [59] have been shown. For k in {1, 2}, Arvind et al. [4] studied the abilities of the folklore k-WL to detect and count fixed subgraphs, extending the work of Fürer [42]. The former was refined in [29]. Kiefer et al. [71] showed that the folklore 3-WL completely captures the structure of planar graphs. The algorithm for logarithmic k plays a prominent role in the recent result of [9] improving the best-known running time for the graph isomorphism problem. Recently, Grohe et al. [58] introduced the framework of Deep Weisfeiler Leman algorithms, which allow the design of a more powerful graph isomorphism test than Weisfeiler Leman type algorithms. Finally, the emerging 2Strictly speaking, the 1-WL and color refinement are two different algorithms. That is, the 1-WL considers neighbors and non-neighbors to update the coloring, resulting in a slightly higher expressive power when distinguishing vertices in a given graph; see [55] for details. For brevity, we consider both algorithms to be equivalent. connections between the Weisfeiler Leman paradigm and graph learning are described in two recent surveys [54, 92]. Fractional isomorphisms A matrix X RI J is called doubly stochastic if all entries are nonnegative and we have P i I Xij = 1 for every j J and P j J Xij = 1 for every i I, i.e., all columns and rows sum to 1. Note that such a matrix is necessarily square. A fractional isomorphism between graphs G and H is a doubly stochastic matrix X RV (G) V (H) such that AGX = XAH, where AG and AH are the adjacency matrices of G and H, respectively. G and H are called fractionally isomorphic if there is a fractional isomorphism between G and H. A result due to Tinhofer [109, 110] states that the 1-WL test does not distinguish G and H if and only if they are fractionally isomorphic. Graph homomorphisms Given two graphs F and G, a homomorphism from F to G is a mapping h: V (F) V (G) such that h(u)h(v) E(G) for every uv E(F). Let hom(F, G) denote the number of homomorphisms from F to G. Then, t(F, G) := hom(F, G)/|V (G)||V (F )| is called the homomorphism density of F in G. We have t(F, G) = t(F, WG) for the graphon WG induced by G. A result due to Dvoˇrák [37] and Dell et al. [34] states that the 1-WL test does not distinguish G and H if and only if we have t(T, G) = t(T, H) for every tree T. A.2 Message-passing graph neural networks Intuitively, MPNNs learn a vectorial representation, i.e., a d-dimensional real-valued vector, representing each vertex in a graph by aggregating information from neighboring vertices. Formally, let G be graph with initial vertex features h(0) v in Rd for v V (G). An MPNN architecture consists of a stack of neural network layers, i.e., a composition of permutation-invariant parameterized functions. Similarly to the 1-WL, each layer aggregates local neighborhood information, i.e., the neighbors features, around each vertex and then passes this aggregated information on to the next layer. Following Gilmer et al. [50] and Scarselli et al. [105], in each layer, t > 0, we compute vertex features h(t) v := UPD(t) h(t 1) v , AGG(t) {{h(t 1) u | u N(v)}} Rd, where UPD(t) and AGG(t) may be differentiable parameterized functions, e.g., neural networks.3 In the case of graph-level tasks, e.g., graph classification, one uses h G := READOUT {{h(L) v | v V (G)}} Rd, to compute a single vectorial representation based on learned vertex features after iteration L. Again, READOUT may be a differentiable parameterized function. To adapt the parameters of the above three functions, they are optimized end-to-end, usually through a variant of stochastic gradient descent, e.g., [72], together with the parameters of a neural network used for classification or regression. A.3 Topology We recall some basics from Dudley [35]. Given a set X, a topology on X is a collection T 2X of subsets of X with , X T such that T is closed under finite intersections and arbitrary unions. Then, the elements of T are called open sets and (X, T ) is called a topological space. Given two topological spaces, (X, T ) and (Y, U), a function f : X Y is called continuous if f 1(U) T for every U U. If S and I are any sets and fi : S Xi a function, where (Xi, Ti) is a topological space, for every i I, then there is a smallest topology T on S for which every fi is continuous, called the topology generated by the fi. Given a topological space (X, T ), one can define nets and the notion of convergence of a net; nets do in general topological spaces what sequences do in metric spaces. Given a pseudometric space (X, d), the collection T of all unions of open balls B(x, r) := {y X | d(x, y) < r}, where x X and r > 0, is a topology, and T is called metrizable and metrized by d. If T is metrizable, then a set U S is open if and only if for every x U and sequence xi x, there is some j with xi U for every i j. Hence, two metrizable topologies are equal if they have the same convergent sequences. A topological space (X, T ) is called Hausdorff if, for all x = y in X, there are 3Strictly speaking, Gilmer et al. [50] consider a slightly more general setting in which vertex features are computed by h(t) v := UPD(t) h(t 1) v , AGG(t) {{(h(t 1) v , h(t 1) u , l(v, u)) | u NG(v)}} . open sets U and V with x U, y V , and U V = . A pseudometric space (X, d) is Hausdorff if and only if d is a metric. A topological space (X, T ) is called separable if X has a countable dense subset, where a subset of X is called dense if its closure is X. A topological space (X, T ) is called compact if whenever U T and X = S U, there is a finite V U such that X = S V. If (X, T ) is a compact topological space and f : X Y a surjective continuous function from X to another topological space (Y, U), then (Y, U) is compact. By Tychonoff s theorem, the product of any collection of compact topological spaces is compact w.r.t. the product topology. A metric space (X, d) is compact if and only if (X, d) is complete and totally bounded. Every compact metric space is separable. Given a set X, a σ-algebra is a collection A 2X of subsets of X with , X A such that A is closed under complements and countable unions. Then, (X, A) is called a measurable space. For every collection C 2X, there is a smallest σ-algebra including C, called the σ-algebra generated by C. A σ-algebra that is generated by a topology is called a Borel σ-algebra; its elements are called Borel sets. A measurable space (X, B) is called a standard Borel space if there is a separable completely metrizable topology T on X such that B is the σ-algebra generated by T . Here, a topology is called completely metrizable if there is a metric d such that (X, d) is a complete metric space and d metrizes T . Hence, every compact metric space with its Borel σ-algebra is a standard Borel space. Let (X, A) be a measurable space. A function µ: A [0, ] is called a measure if µ( ) = 0 and it is countably additive, i.e., if whenever An A for n = 1, 2, . . . are pairwise disjoint, then µ(S n 1 An) = P n 1 µ(An). The measure µ is called finite if µ(X) < . A.4 The Stone Weierstrass theorem The Stone Weierstrass theorem is the primary tool we use for proving our universal approximation theorem. Here, we state the formulation from [35, Theorem 2.4.11]. An algebra is a vector space that is additionally closed under multiplication. Recall that, for a compact Hausdorff space K, we denote by C(K, R) the set of all real-valued continuous functions on K and dsup(f, g) := supx K|f(x) g(x)| for f, g C(K, R). Theorem 7 (Stone Weierstrass theorem). Let K be a compact Hausdorff space and let F be an algebra included in C(K, R) such that F separates points and contains the constants. Then, F is dense in C(K, R) for dsup. A.5 The cut distance The cut norm of a kernel U is U := sup S,T | R S T W(x, y) dxdy|, where the supremum is taken over all measurable sets S, T [0, 1]. The cut distance of two graphons U and W is defined by δ (U, W) := infφ U φ W , where the infimum is taken over all invertible measure-preserving maps φ: [0, 1] [0, 1] and U φ(x, y) := U(φ(x), φ(y)) for all x, y X. See the book of Lovász [78] for a thorough exposition. A.6 Distributions on iterated degree measures Grebík and Rocha [52] give the following definition of a DIDM, which we slightly adapt to our definitions. The Kolmogorov Consistency Theorem yields that, for every α M, there is a unique µα M 1(M) such that (p ,h) µα = αh+1 for every h N. Then, a probability measure ν P(M) is called a distribution on iterated degree measures (DIDM) if µα is absolutely continuous w.r.t. ν with the Radon-Nikodym derivative satisfying 0 dµα dν 1 for ν-almost every α; intuitively, this ensures that, if we have some mass m of points of color c, then every point in the graph may have at most m neighbors of color c. Then, νW for a graphon W satisfies this definition, and conversely, that every DIDM defines a kernel on M M that is bounded by one almost everywhere. B Extended related work Here, we provided an extended discussion on related work. B.1 Limitations of MPNNs and the 1-WL Recently, connections between GNNs and Weisfeiler Leman type algorithms have been shown [12, 49, 89, 115]. Specifically, Morris et al. [89] and Xu et al. [115] showed that the 1-WL limits the expressive power of any possible GNN architecture in distinguishing non-isomorphic graphs. In turn, these results have been generalized to the k-WL, see, e.g., Azizian and Lelarge [7], Geerts [47], Maron et al. [81], Morris et al. [89, 91, 93], and connected to permutation-equivariant functions approximation over graphs, see, e.g., Azizian and Lelarge [7], Chen et al. [28], Geerts and Reutter [48], Maehara and NT [79]. Further, Aamand et al. [1] devised an improved analysis using randomization. Recent works have extended the expressive power of GNNs, e.g, by using random features [2, 32, 104], equivariant graph polynomials [100], or homomorphism and subgraph counts [13, 22, 30, 95], see Morris et al. [92] for a thorough survey. Recently, Grohe [56] showed tight connections between GNNs expressivity and circuit complexity. Moreover, Rosenbluth et al. [102] investigated the expressive power of different aggregation functions beyond sum aggregation. B.2 Graph metrics beyond edit distance As discussed before, the edit distance does not capture the structural similarity of graphs well. A metric that arguably captures the similarity of graphs well is the cut distance, which was introduced by Frieze and Kannan [41] and provides the central notion of convergence in the theory of dense graph limits. Graph similarity can also be defined via graph homomorphism densities, and as part of the aforementioned theory, Borgs et al. [20], Lovász and Szegedy [77] proved that this is equivalent to the cut distance. Böker [18] introduced a variant of the cut distance based on the well-known system of linear equations defining fractional isomorphism, the tree distance, which is equivalent to similarity defined via tree homomorphism densities. To study MPNNs stability properties, Chuang and Jegelka [31] introduced Tree Mover s Distance to compare two graphs via hierarchical optimal transport between their computation trees. Similarly, Chen et al. [26] introduced the Weisfeiler Leman distance, a polynomial-time computable pseudometric on graphs by combining techniques from optimal transport with the 1-WL, and gave an interpretation in terms of stochastic processes [27]. Here, we collect proofs that we omitted from the main body of the paper. C.1 Equivalence of message-passing graph neural networks Here, we show that, first, for a graph G, the output of an MPNN on G equals the output of the MPNN on the corresponding graphon WG and, second, the output of an MPNN on a graphon W equals the output of the corresponding MPNN on the corresponding DIDM νW of W. Hence, it suffices to consider MPNNs on DIDMs. We note that, while φ is an MPNN model and ψ is Lipschitz in the following theorems, we actually do not need Lipschitz continuity and continuous functions would suffice. Theorem 8. Let G be a graph and φ be an L-layer MPNN model. Let (Iv)v V (G) be the partition of [0, 1] used in the construction of WG from G. Then, for all t [L], v V (G), and x Iv, h(t) v = h(t) x . Proof. We prove the claim by induction on t. Base of the induction. The first space of colors is M0 = {1}. Then, for all v V (G) and x Iv, h(0) v = φ0 = h(0) x . Induction step. The induction assumption is that h(t 1) v = h(t 1) x for all v V (G) and x Iv. Then, for all v V (G) and x Iv, we have h(t) x = φt [0,1] WG(x, y)h(t 1) y dλ(y) = φt Iu WG(x, y)h(t 1) y dλ(y) Iu WG(x, y)h(t 1) u dλ(y) u N(v) h(t 1) u Theorem 9. Let G be a graph, let φ be an L-layer MPNN model, and let ψ be Lipschitz. Let (Iv)v V (G) be the partition of [0, 1] used in the construction WG from G. Then, h G = h WG. Proof. With Theorem 8, we get [0,1] h(L) x dλ(x) = ψ X Iv h(L) x dλ(x) Iv h(L) v dλ(x) = ψ 1 |V (G)| v V (G) h(L) v To prove the equivalence of an MPNN on a graphon W and the corresponding DIDM νW , we need the following definition. Let δ: [0, 1] R be a nonnegative measurable function. Then, define the measure νδ by for measurable A [0, 1]. The following lemma, which can be interpreted as the well-definedness of the change of variable z = δ(y) , is taken from [16, Theorem 16.11]. Lemma 10. Let δ : [0, 1] R be a nonnegative measurable function. Then, a measurable function f : [0, 1] R is integrable with respect to νδ if and only if fδ is integrable with respect to λ, in which case Z A f dνδ = Z holds for every measurable A. Theorem 11. Let W W and φ be an L-layer MPNN model. Then, for every t [L] and almost every x [0, 1], h(t) x = h(t) i W,t(x) = h(t) i W (x). Proof. We only have to show the first equality. Then, the second follows directly from the definitions of h(t) : M Rdt and i W . Base of the induction. The first space of colors is M0 = {1}. Then, h(0) x = φ0 = h(0) i W,t(x). Induction step. The induction assumption is that h(t 1) x = h(t 1) i W,t 1(x) for almost every x [0, 1]. Let us prove the induction step for t. We have h(t) x = φt [0,1] W(x, y)h(t 1) y dλ(y) . Recall that for any A B(Mh 1), i W,t(x) (A) = Z i 1 W,t 1(A) W(x, y) dλ(y), so, by the notation νW (x, ) introduced before Lemma 10, i W,t(x) = (i W,t 1) νW (x, ). Hence, by Lemma 10 with δ = W(x, ) and f = h(t 1) , we have h(t) i W,t(x) = φt Mt 1 h(t 1) di W,t(x) = φt Mt 1 h(t 1) d(i W,t 1) νW (x, ) [0,1] h(t 1) i W,t 1 dνW (x, ) [0,1] W(x, y)h(t 1) i W,t 1(y) dλ(y) . Hence, by the induction assumption, h(t) x = h(t) i W,t(x). Theorem 12. Let W W, let φ be an L-layer MPNN model, and let ψ be Lipschitz. Then, h W = hνW,L = hνW . Proof. We only have to prove the first inequality; the second follows from the first and the definition of νW . The first inequality follows from Theorem 11: hνW,L = ψ Z ML h(L) dνW,L = ψ Z ML h(L) d(i W,L) λ [0,1] h(L) i W,L(x) dλ(x) [0,1] h(L) x dλ(x) C.2 Metrics on iterated degree measures We first prove Theorem 2 for the metrics based on the Prokhorov metric. Then, we prove the inequalities between the Prokhorov metric P and the unbalanced Wasserstein metric W from Section 3 and deduce Theorem 2 for the unbalanced Wasserstein metric from it. After that, we prove Lemma 3. C.2.1 The Prokhorov metric To begin, let us restate Theorem 2 for better readability. Theorem 2. Let 0 h . The metrics d P,h and d W,h are well-defined and metrize the topology of Mh. The metrics δP,h and δW,h are well-defined and metrize the topology of P(Mh). Moreover, these metrics are computable on graphs in time polynomial in the size of the input graphs and h, up to an additive error of ε in the case of d W, and δW, . Let (S, d) be a complete separable metric space with Borel σ-algebra B. Recall that, for a subset A S and ε 0, one defines Aε := {y S | d(x, y) < ε for some x A}. Then, the Prokhorov metric P on M 1(S, B) is given by P(µ, ν) := inf{ε > 0 | µ(A) ν(Aε) + ε and ν(A) µ(Aε) + ε for every A B}. It is not hard to see that we can replace Aε by Aε] := {y S | d(x, y) ε for some x A} in the above definition without changing the value P(µ, ν). Moreover, if µ ν , then the definition simplifies to P(µ, ν) = inf{ε > 0 | µ(A) ν(Aε) + ε for every A B} since, if µ(A) ν(Aε) + ε for every A B, then ν(B) µ(Bε) + ε + ν µ µ(Bε) + ε for every B B [40, Lemma 4.30]. As the name suggests, P is a metric on M (S) [99, Section 1.4], and Prokhorov also proved that convergence in P is equivalent to convergence in the weak topology. Theorem 13 ([99, Theorem 1.11]). Let (S, d) be a complete separable metric space. Then, (M (S), P) is a complete separable metric space, and convergence in P is equivalent to weak convergence of measures. The well-definedness of d P,h and δP,h follows from an inductive application of this theorem. Lemma 14. Let 0 h . The metric d P,h is well-defined and metrizes the topology of Mh. The metric δP,h is well-defined and metrizes the topology of P(Mh). Proof. First, we consider d P,h. Let h N. To see why we even have to prove that d P,h is well-defined, note that the measures in Mh+1 = M 1(Mh) = M 1(Mh, B(Mh)) by definition are functions µ: B(Mh) R, where B(Mh) depends on the topology of Mh, which is the weak topology in our case. Hence, it is crucial that d P,h metrizes the weak topology on Mh; otherwise the sets M 1(Mh, B(Mh)) and M 1(Mh, d P,h) might not even be the same. For the induction basis h = 0, this claim trivially holds. Then, the induction hypothesis yields that d P,h is well-defined and metrizes the weak topology on Mh. Then, d P,h+1 is well-defined as the Borel σ-algebra of (Mh, d P,h) equals B(Mh) and, by Theorem 13, convergence in d P,h+1 is equivalent to weak convergence on M 1(Mh, d P,h), which is just weak convergence on M 1(Mh, B(Mh)) = Mh+1. Hence, the topology induced by d P,h+1 is equal to the weak topology on Mh+1 as both spaces are metrizable. For d P, , the claim directly follows since d P, is just a variant of the product metric, which metrizes the product topology. The claim for δP,h then follows from the just proven claim for d P,h by the same reasoning as in the inductive step. The following lemma states that the distances d P,h behave as expected, i.e., the distance of IDMs can only increase when h increases. Lemma 15. Let 0 h h < . Then, d P,h(ph ,h(α), ph ,h(β)) d P,h (α, β) for all α, β Mh. Proof. It suffices to show that d P,h(ph+1,h(α), ph+1,h(β)) d P,h+1(α, β), for all α, β Mh+1, where h 0. We prove this by induction on h, where for h = 0, the statement trivially holds. For the inductive step, we assume without loss of generality that α β , which directly implies ph+1,h(α) ph+1,h(β) . We show that if ε > 0 such that α(A) β(Aε) + ε for every A B(Mh+1), then also ph+1,h(α)(A) ph+1,h(β)(Aε) + ε. For such an ε, we have ph+1,h(α)(A) = (ph,h 1) α(A) = α(p 1 h,h 1(A)) β((p 1 h,h 1(A))ε) + ε β(p 1 h,h 1(Aε)) + ε = ph+1,h(β)(Aε) + ε, where β((p 1 h,h 1(A))ε) β(p 1 h,h 1(Aε)) holds by the induction hypothesis: We show that (p 1 h,h 1(A))ε p 1 h,h 1(Aε). Let x (p 1 h,h 1(A))ε. Then, there is a y (p 1 h,h 1(A)) such that d P,h(x, y) < ε. By the inductive hypothesis, we have d P,h 1(ph,h 1(x), ph,h 1(y)) d P,h(x, y) < ε. Since ph,h 1(y) A, we have ph,h 1(x) Aε, and thus, x p 1 h,h 1(A). C.2.2 Computability of the Prokhorov metric It remains to prove that d P,h and δP,h are polynomial-time computable. A paper of Garel and Massé [46] is concerned with computing the Prokhorov metric of (possibly non-discrete) probability distributions. To this end, they first quantize the distributions and then show that the Prokhorov metric of finitely-supported probability distributions can be computed exactly by solving a series of maximum-flow problems; this latter part is based on results by Schay [106] and García-Palomares and Evarist Giné [45], which essentially shows how the Prokhorov metric can be formulated as a linear-optimization problem. By generalizing this linear program to finite measures, we can adapt the algorithm of Garel and Massé [46] to obtain an algorithm for computing the Prokhorov distance of finite measures. Theorem 16. Let µ, ν M (S), where (S, d) is a finite metric space with S = {x1, . . . , xn}. Then, the Prokhorov metric P(µ, ν) can be computed in time polynomial in n and the number of bits needed to encode d, µ, and ν. Proof. Without loss of generality, we assume that µ ν . Then, P(µ, ν) = inf{ε > 0 | µ(A) ν(Aε) + ε for every A S} = inf{ε > 0 | µ(A) ν(Aε]) + ε for every A S} = inf{ε 0 | ε ρ(ε)}, where ρ(ε) := inf{η > 0 | µ(A) ν(Aε]) + η for every A S} for ε > 0. The following claim generalizes an observation of Schay [106, Theorem 1] and García Palomares and Evarist Giné [45, Lemma] to finite measures and shows that the value of ρ(ε) can be expressed as a linear program. Claim 17. Let ε > 0. Then, ρ(ε) = µ max i,j=1 1[d(xi, xj) ε] xij, where the maximum is taken over variables xij such that Pn i=1 xij = ν({xj}) for every j [n], Pn j=1 xij µ({xi}) for every i [n], and xij 0 for all i, j [n]. Proof. For the sake of brevity, define pi := µ({xi}) and qi := ν({xi}) for i [n] and dε ij := 1[d(xi, xj) ε] for all i, j [n]. We have ρ(ε) = max A S µ(A) ν(Aε]) = max vj {0,1}, wi {0,1}, vj dε ij 1+wi i [n] (piwi qivi) = max 0 vj 1, 0 wi 1, vj dε ij 1+wi i [n] (piwi qivi) where the first equality holds by definition of ρ(ε), the second since one can view the sets A and Aε as vectors v and w, respectively, and the third, since the optimum value of a linear program is attained on an extreme point of the convex polytope specifying the feasible solutions, which are 0-1-vectors for this specific linear program. Moreover, since values vj > 1 and wi < 0 can always be set to 1 and 0, respectively, without decreasing the value of the target function, this further equals to max vj 0, wi 1, vj dε ij 1+wi i [n] (piwi qivi) = µ min ui 0, vj 0, ui+vj dε ij i [n] (piui + qivi), where the second equality results from the substitution ui := 1 wi. Then, considering the dual to the linear minimization problem yields that this further equals to ρ(ε) = µ max i,j=1 dε ij xij, where the maximum is taken over variables xij such that Pn i=1 xij qj for every j [n], Pn j=1 xij pi for every i [n], and xij 0 for all i, j [n]. Since we can always increase the variables xij such that Pn i=1 xij = qj for every j [n] holds without decreasing the target function, we obtain the claim. With this claim, we can use a polynomial-time method for solving linear programs to compute ρ(ε) in time polynomial in n and the number of bits needed to encode d, µ, and ν. Alternatively, it is straightforward to express this linear program as a maximum-flow problem, which can also be solved in polynomial time. The mapping ε 7 ρ(ε) is a non-increasing step function whose jumps can only occur at points in D := {d(xi, xj) | i, j [n]}. Let D := D {0}. Then, we have P(µ, ν) = inf{ε 0 | ρ(ε) ε} = min {ε D | ρ(ε) ε} {ρ(ε) | ε D with ρ(ε) > ε} , which can be computed in time polynomial in n and the number of bits needed to encode d, µ, and ν since D contains at most n2 values. Theorem 16 can then be used to compute the distance δP,h of two given graphs. Intuitively, we run the 1-WL on these graphs in parallel and additionally, after the ith refinement round, compute a matrix Di containing the distances d P,i between any pair of colors. This matrix can be computed from the new colors and Di 1. In the end, we use Dh to compute δP,h. This is essentially the same approach that Chen et al. [26] use for showing that their Weisfeiler Leman distance is polynomial-time computable. Theorem 18. Let 0 h < , and let G and H be graphs. Then, δP,h(G, H) can be computed in time polynomial in h and the size of G and H. Proof. We inductively compute Dh RV (G) V (H) with Dh,uv = d P,h(i G,h(u), i H,h(v)). In a practical setting, it makes sense to index Dh by the colors/IDMs of the vertices instead in a practical setting to reduce the size of the matrix and avoid unnecessary computations of the same value. However, for the sake of simplicity, we stick to this definition of Dh here. For h = 0, this is just the all-zero matrix. After having computed Dn, for all pairs of vertices u V (G), v V (H), we use the matrix Dh to compute the measures i G,h+1(u) and i H,h+1(v) and then use Theorem 16 to compute their distance w.r.t. d P,h+1. We note that we do not need the distances between the IDMs of two vertices u, v V (G) or u, v V (H) for this. Hence, the matrix Dh RV (G) V (H) suffices to compute Dh+1 RV (G) V (H). Furthermore, we note that we actually do not need to know the precise element of Mh assigned to a vertex to do so: we basically run color refinement in parallel and use its renamed colors after h rounds as our space S. From the coloring after h refinement rounds and the matrix Dh, another application of Theorem 16 yields the distance δP,h(G, H) = δP,h(νG,h, νH,h). Let us give a rough analysis of the running time of the algorithm from Theorem 18. Let n be the maximum number of vertices in the two graphs G and H. Then, we have to compute the Prokhorov metric of two IDMs h n2 times and once for the final distance δP,h(G, H). To compute a single such Prokhorov metric, we have to solve n2 maximum-flow problems on a graph with n vertices and n2 edges, where a single maximum-flow problem takes time O(n3) [96]. Hence, the overall running time is O(h n2 n2 n3) = O(h n7); We will see that the distance δW,h can be computed in time O(h n5 log n), which is much better, but nevertheless not ideal for practical purposes. This shows why we need a different way of approximating these distances for practical purposes and leads to our experiments in Section 6. For two graphs G and H on n and m vertices, respectively, the 1-WL test converges after at most max{n, m} refinement rounds, i.e., the color partitions are stable and cannot further be refined in subsequent rounds. For d P,h, however, it is conceivable that d P,h still changes even after having obtained stable color partitions. More precisely, by Lemma 15, d P,h may still increase, i.e., we have a second convergence step where the distances d P,h converge after the color partitions have already stabilized. However, for the Prokhorov distance, we can argue that this takes, at most, polynomially many steps. Theorem 19. Let G and H be graphs. Then, δP, (G, H) can be computed in polynomial time. Proof. Let n and m be the number of vertices of G and H, respectively. Then, the masses that an IDM assigns to a color class are all of the form i/(n m) for a natural number i, which implies that the resulting Prokhorov distance between such IDMs also is of this form. This inductively yields that all entries of the distance matrices in the computation of δP,h(G, H) for h N are of this form and, with the monotonicity of Lemma 15, we get that all n m entries can increase at most n m times, i.e., the distance matrix stays constant after at most (n m)2 steps. Thus, the claim follows. When combined with the previous analysis, this yields that δP, (G, H) can be computed in time O(n4 n7) = O(n11), where n is the maximum number of vertices in the two input graphs. C.2.3 The unbalanced Wasserstein metric Let us now turn our attention to our unbalanced Wasserstein metric. To this end, let (S, d) be a complete separable metric space with Borel σ-algebra B. We first note that the Wasserstein distance of two probability measures µ, ν P(S) is W(µ, ν) := inf γ C(µ,ν) S S d(x, y) dγ(x, y), where C(µ, ν) is the set of all couplings of µ and ν, i.e., probability measures γ P(S S) with marginals µ and ν or, formally, (p1) γ = µ and (p2) γ = ν. In Section 3, for finite measures µ, ν M 1(S), where where we assume µ ν without loss of generality, we defined the unbalanced Wasserstein distance W(µ, ν) := µ ν + inf γ M(µ,ν) S S d(x, y) dγ(x, y), where M(µ, ν) is the set of all measures γ M 1(S S) such that (p1) γ µ and (p2) γ = ν. We first note that W is well-defined as M(µ, ν) is non-empty since 1 µ (µ ν) M(µ, ν). On probability measures, this definition coincides with the Wasserstein distance [35, Section 11.8], and on finite histograms, it coincides with the Earth Mover s Distance \ EMD1 introduced by Pele and Werman [97, 98]. To prove that W indeed is a metric on M 1(S), we follow the proof of Pele and Werman [97] and relate our definition of W to the classical definition on probability measures as follows: For a complete separable metric space (S, d) where any two elements have a distance at most one and a measure µ M 1(S, d) we define the space (S , d ) and the probability measure µ P(S , d ) by adding a new element of mass 1 µ that has distance one to all other elements. Formally, we let S := S { }, d (x, y) := d(x, y) if x, y S, 1 otherwise for all x, y S, and µ (A) := µ(A \ { }) + (1 µ ) if A, µ(A) otherwise for every A B(S , d ). For measures obtained this way, we have the following. Lemma 20. Let (S, d) be a complete separable metric space where any two elements have a distance at most one and µ, ν M 1(S, d). Then, W(µ , ν ) = W(µ, ν). Proof. Without loss of generality, we assume that µ ν . To show that W(µ , ν ) W(µ, ν), we show that a γ M(µ, ν) yields a γ C(µ , ν ) such that Z S S d (x, y) dγ (x, y) = µ ν + Z S S d(x, y) dγ(x, y). (3) To this end, we define γ C(µ , ν ) by γ (C) := γ(C (S S)) + (µ (p1) γ)({x S | (x, ) C}) + (1 µ )1[( , ) C] for every C B(S S). It is easy to verify that γ actually is a probability measure with marginals µ and ν . Then, Z S S d (x, y) dγ (x, y) = Z S S d (x, y) dγ (x, y) + Z { } S d (x, y) dγ (x, y) S { } d (x, y) dγ (x, y) { } { } d (x, y) dγ (x, y) S S d(x, y) dγ(x, y) + 0 + µ (p1) γ + 0 S S d(x, y) dγ(x, y). For the other direction W(µ , ν ) W(µ, ν), we show that a γ C(µ , ν ) restricts to a γ M(µ, ν) such that (3) holds. Let γ denote the restriction of γ to B(S S). We prove that γ M(µ, ν). First, we have (p1) γ(A) = γ(A S) γ (A S ) = (p1) γ (A) = µ (A) = µ(A), and, analogously, (p2) γ(A) ν(A) for every A B(S). Now, assume that (p2) γ(A) < ν(A) for some A B(S). Then, ν = γ(S A) + γ(S A) < ν(A) + ν( A) = ν , which is a contradiction. Next, observe that γ ({ } S) + γ ({ } { }) = 1 γ (S S ) = 1 µ (S) = 1 µ , and also γ (S { })+γ ({ } { }) = 1 ν . By the previous paragraph, we have γ (S S) = ν , which yields that γ ({ } S) + γ (S { }) + γ ({ } { }) = 1 ν . We get γ ({ } S) = 0 and, hence, γ ({ } { }) = 1 µ and, thus, γ (S { }) = µ ν . From this we obtain Z S S d (x, y) dγ (x, y) = Z S S d(x, y) dγ(x, y) + γ ({ } S) + γ (S { }) S S d(x, y) dγ(x, y). Corollary 21. Let (S, d) be a separable metric space. Then, W is a metric on M 1(S, d). Proof. Follows immediately from Lemma 20 and the fact that W is a metric on P(S , d ), cf. [35, Lemma 11.8.3]. To show that W also metrizes the weak topology, we show that it is upper-bounded by the Prokhorov metric and lower-bounded by another metric that we define in the following. For two µ, ν M 1(S), let K(µ, ν) := sup f L 1, f 1 Z fdµ Z fdν and BL(µ, ν) := sup f BL 1 Z fdµ Z fdν be the Kantorovich-Rubinshtein distance and the Bounded-Lipschitz distance of µ and ν, respectively, where f BL := f L + f for f : S R. We clearly have BL(µ, ν) K(µ, ν) 2BL(µ, ν). If (S, d) is separable, then K and β metrize the weak topology [17, Theorem 8.3.2]. We lower-bound the unbalanced Wasserstein distance W by the Kantorovich-Rubinshtein distance K and upper-bound it by the Prokhorov distance P (with a factor of two). This directly implies that W metrizes the weak topology. These inequalities for probability measures are easily found in the literature, see Dudley [35], and we generalize them to finite measures of total mass at most one. There are two ways to go on about this. First, one could use Lemma 20 and show that the analogous statement holds for the Prokhorov distance P; then, these inequalities follow from the ones for probability measures. Secondly, one can prove these inequalities directly, which is what we do here as the proofs provide a better understanding of these metrics, although they are not as straightforward as one might expect. In particular, the proof of upper-bounding the Wasserstein distance by the Prokhorov distance in [35] is quite complicated, and we follow the simpler proofs outlined in Schay [106] and García-Palomares and Evarist Giné [45], which use the duality of linear programming. We remark that the inequalities between the Prokhorov distance P and the unbalanced Wasserstein distance W presented in Section 3 are a special case of the following lemma. Lemma 22. Let (S, d) be a complete separable metric space. Then, for all µ, ν M 1(S), BL(µ, ν) K(µ, ν) W(µ, ν) 2P(µ, ν) 4 p Proof. In the following, we assume µ ν without loss of generality. The first inequality follows immediately from the definition. For the second inequality, we first prove the following claim. Claim 23. Let f : S Rn be Lipschitz. Then, S fdν 2 f ( µ ν ) + f L Z S S d(x, y) dγ(x, y) for every γ M(µ, ν). Proof. Let γ M(µ, ν). Then, S f d(p1) γ + Z S f d(p1) γ Z S fd(µ (p1) γ) 2 + S f d(p1) γ Z S fd(p2) γ 2 S fd(µ (p1) γ) 2 + S S (f(x) f(y))dγ(x, y) 2 S f 2d(µ (p1) γ) + Z S S f(x) f(y) 2dγ(x, y) S f d(µ (p1) γ) + Z S S f L d(x, y) dγ(x, y) = f µ (p1) γ + f L Z S S d(x, y) dγ(x, y) = f ( µ ν ) + f L Z S S d(x, y) dγ(x, y). The claim implies that, for an f : S R with f L 1 and f 1, we have R S fdν W(µ, ν), and since this holds for every such f, we get K(µ, ν) W(µ, ν). For the third inequality, we follow Schay [106] and García-Palomares and Evarist Giné [45], who proved this inequality for probability measures by using the duality of linear programming in the finite case and then lifting this result to the infinite case. Note that we already used this connection to linear programming in as Claim 17 to show that the Prokhorov metric is efficiently computable in Theorem 16. Assume that µ ν without loss of generality. If S is finite (or µ and ν are finitely supported), we use that P(µ, ν) = inf{ε > 0 | ρ(ε) ε} for the function ρ defined in the proof of Theorem 16. Claim 17 expresses the value ρ(ε) for an ε > 0 as a linear program and we can interpret the solution to this linear program as a measure γε M(µ, ν) that satisfies ρ(ε) = µ γε( ε), where ε := {(x, y) S S | d(x, y) ε}. Choose a sequence εn with εn P(µ, ν) and ρ(εn) εn. Then, εn ρ(εn) = µ γεn( εn) = µ γεn + γεn εn = µ ν + γεn and, hence, W(µ, ν) µ ν + Z S S d(x, y) dγεn(x, y) εn d(x, y) dγεn(x, y) + Z εn d(x, y) dγεn(x, y) µ ν + εn γεn + d γεn which yields that W(µ, ν) 2P(µ, ν). We can also consider the limit of the γεn. More precisely, since P(S S) is compact, the sequence γεn, by possibly considering a subsequence, converges to a measure γ of the same total mass, which is ν . For this, note that results for probability measures apply since we can just rescale a sequence since all measures have the same total mass. We then have γ M(µ, ν), cf. the case for S infinite, and for ε > P(µ, ν), ε µ ν + lim inf n γεn µ ν + lim inf n γεn εn lim inf n εn = P(µ, ν), where the first inequality holds by the Portmanteau theorem, which again implies W(µ, ν) 2P(µ, ν) as before. The infinite case works by approximating µ and ν by finite measures and is mostly analogous to [45] as we can rescale a sequence of measures of the same total mass to probability measures. More precisely, one weakly approximates µ and ν by sequences of measures µn and νn of total mass µ and ν , respectively, supported by the finite set Sn for n 1. Then, limn P(µn, νn) = P(µ, ν) by the triangle inequality. For n 0, take γn M(µn, νn) of total mass ν for µn and νn as defined in the finite case, i.e., we have µ ν + γ ε P(µn, νn) for every ε > P(µn, νn). Since µn and νn are uniformly tight as weakly convergent sequences, the sequence γn is also uniformly tight and, by possibly considering a subsequence, converges to a measure γ of total mass ν . By continuity of p2, we then have that (p2) γn converges to (p2) γ. Since (p2) γn = νn converges to ν, we get that (p2) γ = ν. Moreover, one can show that (p2) γn µn implies that (p2) γ µ: For a closed set C B(S) and an η > 0, approximate C by the Lipschitz function f(x) := max{0, 1 d(x,C) η }. Then, 1C f 1Cη and, hence, (p2) γ(C) = Z S 1Cd(p2) γ Z S fd(p2) γ = lim n S fd(p2) γn S 1Cηdµ = µ(Cη). Since this holds for every η > 0 and C is closed, we get (p2) γ(C) µ(C). Then, since (p2) γ is regular by Ulam s theorem, we get (p2) γ(A) = sup{(p2) γ(K) | K compact, K A, K B(S)} µ(A) for every A B(S). All in all, we have γ M(µ, ν). For every ε > P(µ, ν), we have ε > P(µn, νn) if n is large enough since P(µn, νn) converges to P(µ, ν). Then, ε µ ν + lim inf n γn ε lim inf n P(µn, νn) = P(µ, ν), where the first inequality follows from the Portmanteau theorem. Then, we get W(µ, ν) µ ν + ε γ + d γ ε ε + P(µ, ν), similarly to the finite case. Since this holds for every ε > P(µ, ν), the claim follows. For the last inequality, we show that P(µ, ν) 2 p BL(µ, ν) by following the proof of [35, Theorem 11.3.3] for probability measures. We have P(µ, ν) = inf{ε > 0 | µ(A) ν(Aε) + ε for every A B(S)} and fix an A B(S) and an ε > 0. Let f(x) := max{0, 1 d(x,A) ε }. Then, f BL 1 + 1 ε and, hence, S fdν + f BL BL(µ, ν) S fdν + 1 + 1 BL(µ, ν) ν(Aε) + 1 + 1 which implies that P(µ, ν) max ε, 1 + 1 ε BL(µ, ν) . For ε := p BL(µ, ν), this yields that P(µ, ν) BL(µ, ν) + p BL(µ, ν). Hence, if BL(µ, ν) 1, then P(µ, ν) 2 p BL(µ, ν). If BL(µ, ν) > 1, then P(µ, ν) 2 p BL(µ, ν) trivially holds since P(µ, ν) 1 as µ and ν have total mass at most one. Lemma 24. Let 0 h . The metric d W,h is well-defined and metrizes the topology of Mh. The metric δW,h is well-defined and metrizes the topology of P(Mh). Proof. By Theorem 13 and Lemma 22, for a complete separable metric space (S, d), we have that (M (S), W) is a complete separable metric space and convergence in W is equivalent to weak convergence of measures. Then, the proof is analogous to Lemma 14. C.2.4 Computability of the unbalanced Wasserstein metric It only remains to argue that d W,h and δW,h are polynomial-time computable. This can be done in the same way as in Theorem 16, i.e., one again constructs a matrix of distances that one updates while running the 1-WL in parallel on the two input graphs. Here, the metric W can be computed in polynomial-time by using the same flow network [98] as Chen et al. [26] did in their proof for the polynomial-time computability of their Weisfeiler Leman distance. Following their arguments, one then obtains a running time of O(h n5 log n) for δW,h, where h N and n is the maximum number of vertices in the two input graphs. For d W, and δW, , in contrast to Theorem 16, we cannot argue that all distances are of the form i/(n m), where n and m are the numbers of vertices of the input graphs: with every iteration, we possibly get an additional factor of 1/(n m). However, we can start rounding the distances after log(1/ε) iterations to integer multiples of 1/(n m)log2(1/ε), which results in an additive error of ε. C.2.5 Continuity of message-passing graph neural networks It remains to prove Lemma 3. Recall that, for an L-layer MPNN model φ = (φi)L i=0, we inductively defined the Lipschitz constant Cφ 0 of φ by Cφ0 := 0 for t = 0 and Cφt := φt L ( h(t 1) + Cφt 1) for t > 0. If additionally ψ is Lipschitz, then C(φ,ψ) := ψ L ( h(L) + Cφ). Lemma 3. Let φ be an L-layer MPNN model for L N and ψ be Lipschitz. Then, h(L) α h(L) β 2 Cφ d W,L(α, β) and hµ hν 2 C(φ,ψ) δW,L(µ, ν) for all α, β ML and all µ, ν P(ML), respectively. These inequalities also hold for d W, and δW, with an additional factor of L in the Lipschitz constant. Proof. We first note that, by Claim 23, for a complete separable metric space (S, d) and a Lipschitz function f : S Rn, we have S fdν 2 f BL W(µ, ν). for all µ, ν M 1(S). Let us now prove the first inequality by induction on L. For L = 0, the statement trivially holds since M0 is the one-point space. For the inductive step, we have h(L) α h(L) β 2 = ML 1 h(L 1) dα φL ML 1 h(L 1) dβ 2 ML 1 h(L 1) dα Z ML 1 h(L 1) dβ 2 φL L h(L 1) BL W(α, β) = φL L ( h(L 1) + h(L 1) L) d W,L(α, β) φL L ( h(L 1) + CφL 1) d W,L(α, β) = Cφ d W,L(α, β) for all α, β Mh by the induction hypothesis. Hence, we get the first inequality. The second equality then follows from the first by the same reasoning as in the inductive step above. It remains to prove the inequalities for d W, and δW, . These follow from the two just-proven inequalities. First, for all α, β M 1(M), we have h(L) α h(L) β 2 = h(L) p ,L(α) h(L) p ,L(β) 2 Cφ d W,L(p ,L(α), p ,L(β)) Cφ L d W, (α, β). Second, for all µ, ν P(M), we have hµ hν 2 = h(p ,L) µ h(p ,L) ν 2 C(φ,ψ) δW,L((p ,L) µ, (p ,L) ν) C(φ,ψ) L δW, (µ, ν). C.3 Universality of message-passing graph neural networks Here, we prove our universal approximation theorem for MPNNs on iterated degree measures, which states that the sets N 1 t and NN 1 t are dense in C(Mt, R) and C(P(Mt), R), respectively. It follows, by simple inductive applications of the Stone Weierstrass theorem, cf. Appendix A.4, to the set N 1 t . Essentially, we show that N 1 t satisfies all requirements of the Stone Weierstrass theorem. Then, an application of the Stone Weierstrass theorem combined with the definition of iterated degree measures yields that N 1 t+1 separates points, which allows us to show that N 1 t+1 again satisfies all requirements of the Stone Weierstrass theorem. In the following, to stress the dependence of h(t) and h on the MPNN model φ and on the MPNN model φ and the Lipschitz function ψ, we sometimes write φ(t) and (φ, ψ) instead, respectively. Lemma 25. Let 0 t . The set N 1 t is closed under multiplication and linear combinations, contains 1Mt, and separates points of Mt. Proof. Base of the induction. For t = 0, the claim trivially holds as N 1 0 contains precisely the constant functions and M0 = {1} is the one-point space. Induction step. Let t > 0. Clearly, N 1 t contains the all-one function 1Mt since we can always choose φt in an MPNN model to be the all-one function. Next, consider two functions φ(t) and φ (t) from N 1 t for t-layer MPNN models φ and φ . Define ψmul((x, y)T ) := x y and ψadd((x, y)T ) := x + c y for all x, y R, where c R is fixed. Then, define the MPNN model φmul := (φ0 φ 0, . . . , φt 1 φ t 1, ψmul (φt φ t)) and define φadd analogously via ψadd. Note that φmul is in fact an MPNN model since multiplication on a compact, and hence, a bounded subset of R2 is Lipschitz continuous. Then, φ(t) φ (t) = φ(t) mul and φ(t) + c φ (t) = φ(t) add , which implies that N 1 t is closed under multiplication and linear combinations. Finally, let α, β Mt = M 1(Mt 1) with α = β. By the induction hypothesis, the set N 1 t 1 is closed under multiplication and linear combinations, contains 1Mt 1, and separates points of Mt 1. Hence, it is a subalgebra of C(Mt 1) that separates points and contains the constants. By the Stone Weierstrass theorem, N 1 t 1 is dense in C(Mt 1), which means that there is a (t 1)-layer MPNN model φ with output dimension one such that Z Mt 1 φ(t 1) dα = Z Mt 1 φ(t 1) dβ. Define the t-layer MPNN model φ := (φ0, . . . , φt 1, ψid), where ψid(x) := x for every x R. Then, φ (t) N 1 t , separates α and β since φ α (t) = Z Mt 1 φ(t 1) dα = Z Mt 1 φ(t 1) dβ = φ β (t), cf. Section 2. The claim for N 1 now easily follows: N 1 contains the all-one function 1M = 1M0 p ,0 N 1 and separates points of M since, if we have α, β M with α = β, there is some t N such that αt = βt. By the already proven claim for N 1 t , there is a t-layer MPNN model φ with output dimension one such that φ(t) αt = φ(t) βt . Hence, φ(t) p ,t(α) = φ(t) αt = φ(t) βt = φ(t) p ,t(β) for φ(t) p ,t N 1 . To see that N 1 is closed under multiplication and linear combinations, let φ(t) p ,t and φ (t ) p ,t be two functions from N 1 , where φ and φ are t-layer and t -layer MPNN models with output dimension one, respectively. We assume that t t ; the other case is analogous. Then, φ (t ) p ,t = φ (t ) pt,t p ,t by the definition of M . For now, assume that φ (t ) pt,t N 1 t . Then, by the already proven claim for N 1 t , we have φ(t) (φ (t ) pt,t ) N 1 t , which means that (φ(t) p ,t) (φ (t ) p ,t ) = (φ(t) (φ (t ) pt,t )) p ,t N 1 . This implies that N 1 is closed under multiplication, and by analogous reasoning, we get that it is closed under linear combinations. To finish the proof, it remains to show that φ (t ) pt,t N 1 t , which follows from the following claim. Claim 26. Let n, t 0. Let φ be a t-layer MPNN-model with dt = n. Then, h(t) pt+1,t N n t+1. Proof. We prove the claim by induction on t. Base of the induction. For t = 0, we have h(0) α = φ0 for every α M0, i.e., the only element of M0 is mapped to φ0 Rn. Let φ1 : Rn Rn be the constant function that maps all inputs to φ0 and define the 1-layer MPNN model φ := (φ0, φ1). Then, φ(0) p1,0(α) = φ0 = φ1 M0 φ0dα = φα (1) for every α M1, i.e., φ(0) p1,0 = φ (1) N n 1 . Induction step. Let t > 0 and φ be a t-layer MPNN model. Then, for every α Mt+1, we have (φ(t) pt+1,t)(α) = φt Mt 1 φ(t 1) dpt+1,t(α) = φt Mt 1 φ(t 1) d(pt,t 1) (α) Mt φ(t 1) pt,t 1 dα . By the induction hypothesis, we have φ(t 1) pt,t 1 N dt t , i.e., φ(t 1) pt,t 1 = φ (t) for a t-layer MPNN model φ . Hence, we have (φ(t) pt+1,t)(α) = φt Mt φ (t) dα = φ α (t) for every α Mt+1, where φ := (φ 0, . . . , φ t, φt) is an MPNN model with t + 1 layers. Hence, φ(t) pt+1,t N n t+1. With Lemma 25, we immediately obtain Theorem 4, which we restate here for better readability. Theorem 4. Let 0 L . Then, N 1 L is dense in C(ML, R) and NN 1 L is dense in C(P(ML), R). Proof. By Lemma 25, the Stone Weierstrass theorem is applicable to N 1 L, and hence, N 1 L is dense in C(ML, R). We can then use this to show that NN 1 L is dense in C(P(ML), R). By the same arguments as in the inductive step in the proof of Lemma 25, NN 1 L is closed under multiplication and linear combinations, contains the all-one function, and separates points of P(ML). Hence, an application of the Stone Weierstrass theorem yields that NN 1 L is dense in C(P(ML), R). Theorem 4 then yields Corollary 5. Corollary 5. Let 0 L and n > 0. Let ν P(ML) and (νi)i be a sequence with νi P(ML). Then, νi ν if and only if hνi hν for all L-layer MPNN models φ and Lipschitz ψ: Rd L Rn. Proof. First, let n = 1. When restricted to functions h NN n L of the form hν = R ML h(L) dν, i.e., the function ψ is the identity, the claim follows since N 1 L is dense in C(ML, R) by Theorem 4 and the definition of the weak topology on P(ML), cf. Section 2. Since the function ψ in the definition of h NN n L, which, in general, is of the form hν = ψ R ML h(L) dν , is continuous, the equivalence also holds when considering all functions in the set NN n L. Finally, since one can always consider the projection to a single component and conversely map a single real number to a vector of these numbers, the equivalence also holds in the case n > 1. C.4 Equivalence of distances Here, we formally state and prove Theorem 1. Let us first recall the informal statement from the main body of the paper. Theorem 1 (informal). The following are equivalent for all graphons U and W: 1. U and W are close in δP (or alternatively δW). 2. U and W are close in δT . 3. MPNN outputs on U and W are close for all MPNNs with Lipschitz constant C and L layers. 4. Homomorphism densities in U and W are close for all trees up to order k. Let us give an overview of the proof. The implication (1) (3) is just Lemma 3, and its converse is Theorem 6. Convergence of measures in P(M) in the weak topology is equivalent to convergence in δP (or δW) by Theorem 2, convergence of all MPNN outputs by Corollary 5, convergence of tree homomorphism densities by Grebík and Rocha [52], cf. Appendix D, and thus, convergence in the tree distance by Böker [18]. Therefore, the remaining equivalences all follow with a compactness argument similar to the one in Theorem 6; this was also used by Böker [18] to prove the equivalence of (2) and (4). Let us formally state the equivalence of all mentioned notions of convergence in the following theorem. Theorem 27. Let (Wi)i be a sequence of graphons Wi : [0, 1]2 [0, 1], and let W : [0, 1]2 [0, 1] be a graphon. Then, the following are equivalent: 1. δP(Wi, W) 0 (or equivalently δW(Wi, W) 0). 2. δT (Wi, W) 0. 3. h Wi h W for every MPNN model φ and Lipschitz ψ: Rd L Rn, where n > 0. 4. t(T, Wi) t(T, W) for every tree T. 5. νWi νW . Proof. (1) and (3) are equivalent to (5) by Theorem 2 and Corollary 5. (2) is equivalent to (4) by [18], which in turn is equivalent to (5) by the result of Grebík and Rocha [52], cf. Appendix D. We note that we only state Theorem 27 for graphons and not elements of P(M) since the tree distance of [18] is only defined for graphons and not for probability measures. We further note that the following non-approximative variant of Theorem 27 also holds. Theorem 28. Let U and W be graphons. Then, the following are equivalent: 1. δP(U, W) = 0 (or equivalently δW(U, W) = 0). 2. δT (U, W) = 0. 3. h U = h W for every MPNN model φ and Lipschitz ψ: Rd L Rn, where n > 0. 4. t(T, U) = t(T, W) for every tree T. 5. νU = νW . Proof. (2) is equivalent to (4) by [18]. The other equivalences follow as in Theorem 27 since P(M) is Hausdorff. Let us now demonstrate how compactness can be used to deduce Theorem 1 from Theorem 27. For the direction (1) (3), we actually do not need compactness since Lemma 3 already is a slightly stronger statement. For the sake of completeness, we nevertheless state it as an epsilon-delta statement. Theorem 29. Let n > 0 be fixed. For every L N, C > 0, and ε > 0, there is a δ > 0 such that, for all graphons U and W, if δP(U, W) δ, then h U h W 2 ϵ for every L -layer MPNN model φ and Lipschitz ψ: Rd L Rn with L L and C(φ,ψ) C. Proof. Follows immediately from Lemma 3. Let us formally prove the converse direction (3) (1). Theorem 6. Let n > 0 be fixed. For every ε > 0, there are L N, C > 0, and δ > 0 such that, for all graphons U and W, if h U h W 2 δ for every L -layer MPNN model φ and Lipschitz ψ: Rd L Rn with L L and C(φ,ψ) C, then δP(U, W) ε. Proof. Assume that there is an ε > 0 such that such L N, C > 0, and δ > 0 do not exist. Then, for every k 0, there are graphons Uk and Wk with h Uk h Wk 2 1/k for every MPNN model φ and Lipschitz ψ with output dimension n, at most k layers and C(φ,ψ) k but also δP(Uk, Wk) > ε. By the compactness of the graphon space, there are subsequences (Uik)k and (Wik)k of (Uk)k and (Wk)k converging to graphons e U and f W, respectively, in the cut distance and, hence, also in the tree distance. Let φ be an L-layer MPNN model and ψ: Rd L Rn be Lipschitz. Then, by Theorem 27, also (h Uik )k and (h Wik )k converge to he U and hf W , respectively. Hence, he U hf W 2 he U h Uik 2 + h Uik h Wik 2 + h Wik hf W 2 k 0 by the assumption, i.e., he U = hf W . Since this holds for every MPNN model and Lipschitz ψ, we have δP(e U, f W) = 0 by Theorem 28. Then, however δP(Uik, Wik) δP(Uik, e U) + δP(e U, f W) + δP(f W, Wik) k 0 since (Uik)k and (Wik)k converge to e U and f W, respectively, also in δP by Theorem 27. This contradicts the assumption that δP(Uk, Wk) > ε for every k 0. In the above theorem, we could simply replace δP by δW or δT ; the proof is analogous. We can also state the equivalence of these metrics explicitly, e.g., graphons that are close in δP are close in δT and vice versa. Theorem 30. For every ε > 0, there is a δ > 0 such that, for all graphons U and W, if δP(U, W) δ, then δT (U, W) ε. Proof. Assume that there is a ε > 0 such that such a δ > 0 does not exist. Then, for every k 0, there are graphons Uk and Wk with δP(Uk, Wk) 1/k but δT (Uk, Wk) > ε. By the compactness of the graphon space, there are subsequences (Uik)k and (Wik)k of (Uk)k and (Wk)k weakly converging to graphons e U and f W, respectively, in the cut distance and, hence, also in the tree distance, which in turn by Theorem 27 means that they also converge in δP. Combining this with the assumption yields that δP(e U, f W) = 0, which implies δT (e U, f W) = 0 by Theorem 28. This, however, is a contradiction to δT (Uk, Wk) > ε since (Uik)k and (Wik)k converge to e U and f W, respectively, in δT . Theorem 31. For every ε > 0, there is an δ > 0 such that, for all graphons U and W, if δT (U, W) δ, then δP(U, W) ε. Proof. Analogous to Theorem 30. The equivalence between the tree distance and tree homomorphism densities was proven in [18]. Let us state them here for the sake of completeness. Note that the statements are essentially the same as Theorem 29 and Theorem 6, where the constant C and number of layers L is replaced by the order k of the trees. Theorem 32 ([18, Corollary 20]). For every tree T and every ε > 0, there is a δ > 0 such that, for all graphons U and W, if δT (U, W) δ, then |t(T, U) t(T, W)| ε. Theorem 33 ([18, Corollary 21]). For every ε > 0, there are k > 0 and δ > 0 such that, for all graphons U and W, if |t(T, U) t(T, W)| for every tree T on k vertices, then δT (U, W) ε. D Tree homomorphism densities We briefly repeat the proof of Grebík and Rocha [52] for a universal approximation theorem for (linear combinations) of tree homomorphism density functions. The purpose of this is twofold. First, it serves as a reference since we use this result for a part of the proof of Theorem 1. Secondly, however, it illustrates the striking similarities between tree homomorphism densities and MPNNs. Not only do both tree homomorphism densities and MPNNs satisfy a universal approximation theorem, but the theorem statements and proofs in this section for tree homomorphism densities and in Appendix C.3 for MPNNs are nearly identical. Hence, tree homomorphism densities and MPNNs are two different ways of arriving at the same end goal, a set of functions that has certain properties and, in particular, describes the similarity of graphons modulo the 1-WL test. The following definition is implicit in Grebík and Rocha [52]. For a rooted tree T of height at most h N, we inductively define the homomorphism density function th(T, ): Mh [0, 1] by letting th(T, ) := 1Mh if the height of T is zero and th(T, α) := Z Mh 1 th 1(T1, ) dα . . . Z Mh 1 th 1(Tk, ) dα for every α Mh if h > 0, where T1, . . . , Tk are the trees rooted in the children of the root of T. Intuitively, th(T, α) is the homomorphism density of T when the root of T is mapped to a point of color α. Additionally, we define t (T, ): M [0, 1] by t (T, ) := th(T, ) p ,h. One can show that this is well defined, i.e., that the definition is independent of h, by a simple induction. Lemma 34. Let T be a rooted tree of height at most h. Then, th+1(T, ) = th(T, ) ph+1,h. Let 0 h . For a rooted tree T of height at most h and ν P(Mh), define t(T, ν) := R Mh th(T, ) dν. For the DIDM νW,h associated to a graphon W, this then simply equals the homomorphism density of T in W when viewing T as unrooted. Lemma 35 ([52, Proposition 7.3]). Let 0 h , let T be a rooted tree of height at most h, and let W W be a graphon. Then, t(T, W) = t(T, νW,h) = t(T, νW ), where T in the expression t(T, W) is treated as unrooted. For 0 h , let Th := {th(T, ) | T rooted tree of height at most h} C(Mh, R), and let T h be the set of linear combinations of functions in Th. The set Th is not closed under linear combinations, which is why we have to consider the set T h to obtain a sub-algebra of C(Mh, R) to which the Stone Weierstrass theorem is then applicable. In fact, Th is clearly not dense in C(Mh, R) since the codomain of every function in Th is [0, 1]. Other than that, the proof of the following lemma is identical to the proof of Lemma 25. Lemma 36 ([52, Proposition 7.5]). Let 0 h . The set Th is closed under multiplication, contains 1Mh, and separates points of Mh. Proof. Essentially, the proof is analogous to Lemma 25. When showing that Th separates points of Mh, one cannot apply the Stone Weierstrass theorem directly since the set of Th is not closed under linear combinations. However, one can apply it to T h instead and obtain that there always is a separating linear combination. Then, if a linear combination separates two points, a function in the linear combination has to separate these points, too. For 0 h , let T Th := {ν 7 t(T, ν) | T rooted tree of height at most h} C(P(Mh), R), and let T T h be the set of linear combinations of functions in T Th. We then obtain the following universal approximation theorem for linear combinations of tree homomorphism density functions. Theorem 37. Let 0 h . Then, T h is dense in C(Mh, R) and T T h is dense in C(P(Mh), R). Proof. T h has all the properties as listed in Lemma 36 and is additionally closed under linear combinations, which means that the Stone Weierstrass Theorem is applicable, which immediately yields the first claim. The first claim together with another application of the Stone Weierstrass Theorem then yields the second claim, where we as in the proof of Lemma 36 use that if a function in T T h separates two probability measures µ, ν P(Mh), then so does a function in T Th. Lemma 38. Let 0 h . Let (νi)i be a sequence of measures from P(Mh) and ν P(Mh). Then, νi ν if and only if t(T, νi) t(T, ν) for every rooted tree T of height at most h. Proof. Follows from Theorem 37 and the definition of the weak topology. Here, we use that the integral and the limit are linear, which means that we can replace the set T h by the set Th in order to obtain the exact statement of this lemma. Let us conclude this section with what is known as the counting lemma in the world of graph homomorphisms, or in other words, a tree analogue to Lemma 3. Formally, we prove that tree homomorphism density functions are Lipschitz in the metrics we defined in Section 3, where the Lipschitz constant only depends on the order of the tree. This implies that if two graphons are close in one of our metrics, then all tree homomorphism densities for trees up to a certain order are close. Hence, in comparison to Lemma 3, the order of the tree takes the place of the constant and number of layers of an MPNN model. Lemma 39. Let h N and T be a rooted tree of height at most h. Then, |th(T, α) th(T, β)| |E(T)| d W,h(α, β) and |t(T, µ) t(T, ν)| |V (T)| δW,h(µ, ν) for all α, β Mh and all µ, ν P(Mh), respectively. These inequalities also hold for d W, and δW, with an additional factor of h in the Lipschitz constant. Proof. We prove the first inequality by induction on the height of T. If it is zero, the statement trivially holds. For the inductive step, let T1, . . . , Tk denote the trees rooted in the children of T. Then, |th(T, α) th(T, β)| = Mh 1 th 1(Ti, ) dα Y Mh 1 th 1(Ti, ) dβ Mh 1 th 1(Ti, ) dα Z Mh 1 th 1(Ti, ) dβ i [k] th 1(Ti, ) BL BL(α, β) i [k] (1 + th 1(Ti, ) L) W(α, β) i [k] (1 + |E(Ti)|) W(α, β) = |E(T)| W(α, β) = |E(T)| d W,h(α, β), which proves the claim. The second inequality then follows from the first. We have |t(T, µ) t(T, ν)| = Z Mh th(T, ) dµ Z Mh th(T, ) dν th(T, ) BL BL(µ, ν) (1 + |E(T)|) W(µ, ν) = |V (T)| δW,h(µ, ν). Let h(T) denote the height of T. Then, |t(T, α) t(T, β)| = |t(T, p ,h(T )(α)) t(T, p ,h(T )(β))| |E(T)| d W,h(T )(p ,h(T )(α), p ,h(T )(β)) |E(T)| h(T) d W, (α, β). for all α, β M and |t(T, µ) t(T, ν)| = |t(T, (p ,h(T )) µ) t(T, (p ,h(T )) ν)| |V (T)| δW,h(T )((p ,h(T )) µ, (p ,h(T )) ν) |V (T)| h(T) δW, (µ, ν), for all µ, ν P(M ), where the last inequality follows from the definition of d W, . E Additional experimental results Here, we provide additional experimental results. E.1 Graph distance preservation Figure 3: Increasing the number of layers first improves and then degrades the ability of MPNNs to preserve graph distance. Comparatively, untrained GIN embeddings are more sensitive than untrained Graph Conv to changes in the number of layers. Figure 4: For MUTAG, MPNNs preserve graph distance better when increasing the number of hidden dimensions. Figure 5: For MUTAG, MPNNs preserve graph distance worse when depth increases beyond a certain threshold. E.2 Untrained MPNNs Table 2: Untrained MPNNs are significantly faster than trained MPNNs. We report the mean training time (in milliseconds) per epoch std over 200 epochs. Time MUTAG IMDB-BINARY IMDB-MULTI NCI1 PROTEINS REDDIT-BINARY GIN (trained) 17.09 0.04 101.23 0.07 148.09 0.03 321.52 0.07 90.31 0.06 291.41 0.18 GIN (untrained) 9.17 0.00 65.09 0.01 98.49 0.17 180.54 0.02 48.52 0.00 210.29 0.03 Graph Conv (trained) 12.02 0.01 80.59 0.07 118.28 0.03 226.31 0.07 65.01 0.06 228.08 0.18 Graph Conv (untrained) 8.08 0.00 64.44 0.01 95.41 0.17 163.08 0.02 45.95 0.00 185.76 0.03 Table 3: When the hidden dimensionality is smaller (3-layer, 128-hidden-dimension), untrained GIN-m is still competitive as trained GIN-m, whereas untrained Graph Conv-m is less competitive as trained Graph Conv-m. We report the mean accuracy std over 10 data splits. Accuracy MUTAG IMDB-BINARY IMDB-MULTI NCI1 PROTEINS REDDIT-BINARY GIN-m (trained) 78.8 2.05 69.76 1.44 45.63 1.07 78.8 0.58 73.11 0.7 84.15 0.71 GIN-m (untrained) 78.34 2.0 70.31 0.74 46.75 1.17 71.86 0.29 73.17 0.63 78.8 0.5 Graph Conv-m (trained) 81.96 1.78 67.69 1.56 44.71 0.98 64.06 0.45 71.37 0.61 81.9 0.35 Graph Conv-m (untrained) 66.75 0.42 62.33 0.87 42.54 1.47 62.31 0.35 71.51 0.58 74.09 0.34 Table 4: For standard MPNNs using sum aggregation, sum pooling, and without 1/V (G) normalization (denoted as MPNN-s ), the untrained ones also show competitive performance as trained ones (3-layer, 128-hidden-dimension). We report the mean accuracy std over 10 data splits. Accuracy MUTAG IMDB-BINARY IMDB-MULTI NCI1 PROTEINS REDDIT-BINARY GIN-s (trained) 80.99 3.57 69.06 1.31 44.59 0.91 76.91 0.44 72.72 0.93 80.33 1.62 GIN-s (untrained) 81.78 3.29 69.44 1.37 47.53 0.86 77.59 0.45 70.23 0.80 83.22 0.75 Graph Conv-s (trained) 70.74 2.47 57.39 2.20 35.89 2.00 56.89 1.14 63.6 0.97 54.82 2.35 Graph Conv-s (untrained) 71.35 2.28 52.06 1.62 33.31 1.09 57.12 1.88 68.19 1.26 52.45 1.63 Table 5: Trained and Untrained MPNNs (3-layer, 512-hidden-dimension) with mean aggregation and mean (graph-level) pooling, denoted by MPNN-mean . We report the mean accuracy std over ten data splits. Although our theoretical results do not apply to mean aggregation, we still see that untrained MPNNs are competitive compared to their trained counterparts. Accuracy MUTAG IMDB-BINARY IMDB-MULTI NCI1 PROTEINS REDDIT-BINARY GIN-mean (trained) 74.63 2.93 49.48 1.56 33.70 1.35 73.74 0.45 71.53 0.93 50.04 0.70 GIN-mean (untrained) 72.46 2.56 49.18 1.83 33.03 1.12 77.16 0.39 70.33 0.95 49.90 0.83 Graph Conv-mean (trained) 65.87 3.24 49.32 1.35 33.15 1.19 54.39 1.25 66.76 0.96 49.68 0.82 Graph Conv-mean (untrained) 63.30 3.55 48.80 1.91 32.51 0.90 55.84 0.53 70.73 0.69 49.39 0.48 Figure 6: Increasing hidden dimension (number of functions) improves untrained MPNN performance, supporting Theorem 6. Figure 7: Increasing the number of message-passing layers (number of 1-WL iterations) first improves and then degrades the performance of untrained MPNNs. Table 6: Classification accuracy of 1-NN on our graph distances. Results of d WL are taken from [26] using node degrees as initial labels. Accuracy MUTAG IMDB-BINARY d(3) WL [26] 91.1 4.3 69.4 3.9 d(3) WLLB [26] 85.2 3.5 69.8 3.3 δW,3 87.89 4.11 66.50 4.15 δW, 3 1 86.32 4.21 66.60 3.93 E.3 Evaluating graph distances in graph classification task Our main theoretical results and empirical validation focus on the relationship between MPNNs and their induced graph distances, suggesting that the Euclidean distance between MPNN embeddings is an efficient lower bound of these graph distances. More precisely, the time complexity of computing δP,h, δW,h is O(h n5 log n) (the same order as the WL distance in [26]). In contrast, the time complexity of computing h-layer, d-dimensional MPNN embedding distance (with ψt, ψ chosen as the composition of linear maps and pointwise nonlinearities) is O(h n |E(G)| + n d), which is massively cheaper, especially for large, sparse graphs that are ubiquitous in real-world applications. For ablation purposes, we evaluate our graph distances δP,h, δW,h in graph classification tasks to examine how well they can separate graphs. For comparison, we follow the same set-up from Chen et al. [26]. Specifically, for each selected graph dataset in the TUDataset [90], we compute the pairwise distances for all graphs in the dataset. We then perform graph classification via 1-nearest-neighbor classifier (1-NN), which classifies the test graph based on the label of the closest training graph (measured by our graph distance). We use 90/10 training/test random split and repeat ten times, following the same random data split in [26]. Table 6 shows the mean classification accuracy of 1-NN using our graph distances and demonstrates that our graph distances achieve competitive classification performance as the WL distance in [26]. 1δW computed up to at most three iterations after color stabilizes.