# weisfeilerlehman_meets_gromovwasserstein__66d76ea9.pdf

Weisfeiler-Lehman Meets Gromov-Wasserstein

Samantha Chen * 1 Sunhyuk Lim * 2 Facundo M emoli * 3 Zhengchao Wan * 4 Yusu Wang * 1 4

The Weisfeiler-Lehman (WL) test is a classical procedure for graph isomorphism testing. The WL test has also been widely used both for designing graph kernels and for analyzing graph neural networks. In this paper, we propose the Weisfeiler Lehman (WL) distance, a notion of distance between labeled measure Markov chains (LMMCs), of which labeled graphs are special cases. The WL distance is polynomial time computable and is also compatible with the WL test in the sense that the former is positive if and only if the WL test can distinguish the two involved graphs. The WL distance captures and compares subtle structures of the underlying LMMCs and, as a consequence of this, it is more discriminating than the distance between graphs used for deﬁning the state-of-the-art Wasserstein Weisfeiler-Lehman graph kernel. Inspired by the structure of the WL distance we identify a neural network architecture on LMMCs which turns out to be universal w.r.t. continuous functions deﬁned on the space of all LMMCs (which includes all graphs) endowed with the WL distance. Finally, the WL distance turns out to be stable w.r.t. a natural variant of the Gromov-Wasserstein (GW) distance for comparing metric Markov chains that we identify. Hence, the WL distance can also be construed as a polynomial time lower bound for the GW distance which is in general NP-hard to compute.

*Equal contribution 1Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA 2Max Planck Institute for Mathematics in the Sciences, Leipzig, Saxony, Germany 3Department of Mathematics and Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA 4Halıcıo glu Data Science Institute, University of California San Diego, La Jolla, California, USA. Correspondence to: Zhengchao Wan <zcwan@ucsd.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1. Introduction

The Weisfeiler-Lehman (WL) test (Lehman & Weisfeiler, 1968) is a classical procedure which provides a polynomial time proxy for testing graph isomorphism. It is efﬁcient and can distinguish most pairs of graphs in linear time (Babai & Kucera, 1979; Babai & Luks, 1983). The WL test has a close relationship with graph neural networks (GNNs), both in the design of GNN architectures and in terms of characterizing their expressive power. For example, Xu et al. (2018) showed that graph isomorphism networks (GINs) have the same discriminative power as the WL test in distinguishing whether two graphs are isomorphic or not. Recently, Azizian et al. (2020) showed that message passing graph neural networks (MPNNs) are universal with respect to the continuous functions deﬁned on the set of graphs (with the topology induced by a speciﬁc variant of the graph edit distance) that have equivalent to or less discriminative power than the WL test.

However, the WL test is only suitable for testing graph isomorphism and cannot directly quantitatively compare graphs. This state of affairs naturally suggests identifying a distance function between graphs so that two graphs have positive distance iff they can be distinguished by the WL test. We note that there have been WL-inspired graph kernels which can quantitatively compare graphs (Shervashidze et al., 2011; Togninalli et al., 2019). However, these either cannot handle continuous node features naturally or they do not have the same discriminative power as the WL test.

New work and connections to related work. Our work provides novel connections between the WL test, GNNs and the Gromov-Wasserstein distance. The central object we deﬁne in this paper is a distance between graphs which has the same discriminative power as the WL test. We do this by combining ideas inherent to the WL test with optimal transport (OT) (Villani, 2009). We call this distance the Weisfeiler-Lehman (WL) distance. We show that two graphs are at zero WL distance if and only if they cannot be distinguished by the WL test. Moreover, the WL distance can be computed in polynomial time. Furthermore, our WL distance is actually deﬁned on a more general and ﬂexible type of objects called the labeled measure Markov chains (LMMCs), of which labeled graphs (i.e, graph with node features) are special cases. LMMCs can naturally model

Weisfeiler-Lehman meets Gromov-Wasserstein

the interaction between graphs and their node labels (node features), and thus provides a new OT based perspective for comparing node labeled graphs. Besides graphs, the LMMC framework also encompasses continuous objects such as Riemannian manifolds and graphons. It is worth noting that the idea of combining OT and Markov chains/heat kernels has long been used to study notions of curvature of geometric objects, e.g., graphs or Riemannian manifolds (von Renesse & Sturm, 2005; Ollivier, 2009).

Our deﬁnition of the WL distance is able to capture and compare subtle geometric and combinatorial structures from the underlying LMMCs. This allows us to establish various lower bounds for the WL distance which are not just useful in practical computations but also clarify its discriminating power relative to existing approaches. In particular, we carry out experiments which demonstrate the effectiveness of our WL distance (and its lower bounds) in graph comparison tasks.1 Furthermore, based on the hierarchy inherent to the WL distance, we are able to identify a neural network architecture on the collection of all LMMCs, which we call MCNNs (for Markov chain NNs). We show that MCNNs have the same discriminative power as the WL test when applied to graphs; while at the same time, they have the desired universal approximation property w.r.t. continuous functions deﬁned on the space of all LMMCs (including the space of graphs) equipped with the WL distance. It turns out that from MCNNs, one can recover Weisfeiler-Lehman graph kernels (Togninalli et al., 2019) and in particular, we show that a slight variant of a key pseudo-distance between graphs deﬁned in (Togninalli et al., 2019) serves as a lower bound for our WL distance. This indicates that the WL distance has a stronger discriminating ability than WWL graph kernels (see Appendix A.3 for an inﬁnite family of examples).

Finally, we observe that our formulation of the WL distance resembles the Gromov-Wasserstein (GW) distance (M emoli, 2007; 2011; Peyr e et al., 2016; Sturm, 2012; Vayer et al., 2020; Chowdhury & M emoli, 2019) which is a OT-based distance between metric measure spaces and has been recently widely used in shape matching and machine learning. We hence identify a special variant of the GW distance between Markov chain metric spaces (MCMSs) (including all graphs). Our version of the GW distance implements a certain multiscale comparison of MCMSs, it vanishes only when the two MCMSs are isomorphic, but leads to NP-hard problems. Interestingly, it turns out that the poly-time computable WL distance is not only stable w.r.t. (i.e., upper bounded by) this variant of the GW distance, but can also be construed as a variant of the third lower bound (TLB) of this GW distance, as in (M emoli, 2011).

1Our code is available at https://github.com/chens5/WLdistance

Proofs of results and details can be found in the Appendix.

2. Preliminaries

2.1. The Weisfeiler-Lehman Test

A labeled graph is a graph G = (VG, EG) endowed with a label function ℓG : VG Z, where the labels (i.e., node features) are taken from some set Z. Common label functions include the degree label (i.e., ℓG : VG N sends each v VG to its degree, denoted by deg G(v)) and the constant label (assigning a constant to all vertices). For a node v VG, let NG(v) denote the set of neighbors of v in G. Below, we describe the Weisfeiler-Lehman hierarchy for a given labeled graph (G, ℓG).

Deﬁnition 1 (Weisfeiler-Lehman hierarchy). Given any labeled graph (G, ℓG), we consider the following hierarchy of multisets, which we call the Weisfeiler-Lehman hierarchy:

Step 1 For each v VG we compute the pair

ℓ (1) (G,ℓG)(v) := (ℓG(v), {{ℓG(v ) : v NG(v)}}) .

Step k For each v V we compute the pair

ℓ (k) (G,ℓG)(v) := ℓ (k 1) (G,ℓG)(v), nn ℓ (k 1) (G,ℓG)(v ) : v NG(v) oo .

Here, {{ }} denotes multisets. In the literature, ℓ (k) (G,ℓG)(v) is usually often mapped to a common space of labels such as N through a hash function, a step which we do not require in this paper. We induce, at each step k, a multiset

Lk((G, ℓG)) := nn ℓ (k) (G,ℓG)(v) : v VG oo .

Deﬁnition 2 (Weisfeiler-Lehman test). For each integer k 0, we compare Lk((G1, ℓG1)) with Lk((G2, ℓG2)). If k 0 so that Lk((G1, ℓG1)) = Lk((G2, ℓG2)) then we conclude that the two label graphs are non-isomorphic; otherwise we say that the two labeled graphs pass the WL test and that the two graphs are possibly isomorphic .

2.2. Probability Measures and Optimal Transport

For any measurable space Z, we will denote by P(Z) the collection of all probability measures on Z. When Z is a metric space (Z, d Z), we further require that every α P(Z) has ﬁnite 1-moment, i.e., R

Z d Z(z, z0)α(dz) < for any α P(Z) and any ﬁxed z0 Z.

Pushforward Maps. Given two measurable spaces X and Y and a measurable map ψ : X Y , the pushforward map induced by ψ is the map ψ# : P(X) P(Y ) sending α to ψ#α where for any measurable B Y , ψ#α(B) :=

Weisfeiler-Lehman meets Gromov-Wasserstein

α ψ 1(B) . In the case when X is ﬁnite and Y is a metric space, ψ#α obviously has ﬁnite 1-moment and is thus an element of P(Y ).

Couplings and the Wasserstein Distance. For measurable spaces X and Y , given α P(X) and β P(Y ), γ P(X Y ) is called a coupling between α and β if (p X)#γ = α and (p Y )#γ = β, where p X : X Y X and p Y : X Y Y are the canonical projections, e.g., the product measure α β is one such coupling. Let C(α, β) denote the set of all couplings between α and β.

Given a metric space (Z, d Z), for α, β P(Z), we deﬁne the (ℓ1-)Wasserstein distance between them as follows:

d W(α, β) := inf γ C(α,β)

Z Z d Z(z, z )γ(dz dz ).

By (Villani, 2009, Proposition 2.1), the inﬁmum above is always achieved by some γ C(α, β) which we call an optimal coupling between α and β.

Hierarchy of Probability Measures. An important ingredient in this paper is the following construction: Given a ﬁnite set X and a metric space Z, a map of the form ψ : X P(Z) induces ψ# : P(X) P(P(Z)) which involves the space of probability measures over probability measures, i.e., P(P(Z)). Inductively, we deﬁne the family of spaces P k(Z), called the hierarchy of probability measures:

1. P 1(Z) := P(Z);

2. P (k+1)(Z) := P P k(Z) for k 1.

If Z is complete and separable then, when endowed with d W, P(Z) is also complete and separable (Villani (2009, Theorem 6.18)). By induction, for each k N, P k(Z) is also a complete and separable metric space. This hierarchy will be critical in our development of the WL distance.

The Gromov-Wasserstein Distance. We call a triple X = (X, d X, µX) a metric measure space (MMS) if (X, d X) is a metric space and µX is a (Borel) probability measure on X with full support. Given any X = (X, d X, µX) and Y = (Y, d Y , µY ), for any coupling γ C(µX, µY ), we deﬁne its distortion by

|d X(x, x ) d Y (y, y )|γ(dx dy)γ(dx dy ).

Then, the (ℓ1-)Gromov-Wasserstein (GW) distance between X = (X, d X, µX) and Y = (Y, d Y , µY ) is deﬁned as follows (M emoli, 2011)

d GW(X, Y) := inf γ C(µX,µY ) dis(γ), (1)

where we omit the usual 1

2 factor for simplicity.

2.3. Markov Chains

Given a ﬁnite set X, we call any map m X : X P(X) a Markov kernel on X. Of course Markov kernels can be represented as transition matrices but we adopt this more ﬂexible language. A probability measure µX P(X) is called a stationary distribution w.r.t. m X if for every measurable subset A X we have:

X m X x (A)µ(dx).

The existence of stationary distributions is guaranteed by the Perron-Frobenius Theorem (Saloff-Coste, 1997). A measure Markov chain (MMC) is any tuple X = (X, m X , µX) where X is a ﬁnite set, m X is a Markov kernel on X and µX is a fully supported stationary distribution w.r.t. m X . Deﬁnition 3 (Labeled measure Markov chain). Given any metric space Z, which we refer to as the metric space of labels, a Z-labeled measure Markov chain ((Z-)LMMC for short) is a tuple (X, ℓX) where X is a MMC and ℓX : X Z is a continuous map. For technical reasons, throughout this paper, we assume that the metric space of labels Z is complete and separable2. We let ML(Z) denote the collection of all Z-LMMCs.

The following deﬁnition of isomorphism between LMMCs is similar to that of labeled graph isomorphism. Deﬁnition 4. Two Z-LMMCs (X, ℓX) and (Y, ℓY ) are said to be isomorphic if there exists a bijective map ψ : X Y such that ℓX(x) = ℓY (ψ(x)) and ψ#m X x = m Y ψ(x) for all x X and ψ#µX = µY .

Labeled Graphs as LMMCs. Any labeled graph induces a family of LMMCs which we explain as follows. Deﬁnition 5 (q-Markov chains on graphs). For any graph G and parameter q [0, 1), we deﬁne the q-Markov chain m G,q associated to G as follows: for any v VG,

( q δv + 1 q deg G(v) P v NG(v) δv , NG(v) =

δv, NG(v) = .

We further let deg G(v) := deg G(v) if NG(v) = and deg G(v) := 1 otherwise. Then, it is easy to see that

v VG deg G(v )δv

is a stationary distribution for m G,q for all q [0, 1].

For any q [0, 1), we let Xq(G) := VG, m G,q , µG and

call (Xq(G), ℓG) a graph induced LMMC. When q = 0, we

2These assumptions are mild: they encompass ﬁnite metric spaces, compact metric spaces, closed subsets of Euclidean spaces and also the set of all integers with the usual metric.

Weisfeiler-Lehman meets Gromov-Wasserstein

also let m G := m G,q and let X(G) := X0(G). One has the following desirable property for graph induced LMMCs.

Proposition 2.1. For any q [0, 1),(G1, ℓG1) is isomorphic to (G2, ℓG1) as labeled graphs iff (Xq(G1), ℓG1) is isomorphic to (Xq(G2), ℓG2) as LMMCs.

3. The WL Distance

A non-empty ﬁnite multiset M of elements from a given set S encodes information about the multiplicity of each s S in M. This suggests that one might consider the probability measure µM on S induced by M:

µM(s) := m(s) P

t S m(t), s S

where m(s) denotes the multiplicity of s in S. This point of view permits reinterpreting the multisets appearing in the WL hierarchy (see Deﬁnition 1) through the language of probability measures, which will eventually lead us to a distance between graphs.

Deﬁnition 6 (Weisfeiler-Lehman measure hierarchy). Given any Z-LMMC (X, ℓX), we let l (0) (X,ℓX ) := ℓX and produce the following label functions whose codomains span a certain hierarchy of probability measures:

Step 1 For each x X, we have (l (0) (X,ℓX ))#m X x P(Z). Hence we in fact have the function

l (1) (X,ℓX ) := l (0) (X,ℓX )

# m X : X P(Z).

Step k For each integer k 2, we inductively deﬁne

l (k) (X,ℓX ) := l (k 1) (X,ℓX )

# m X : X P k(Z).

We then induce at each step k a probability measure

Lk((X, ℓX)) := l (k) (X,ℓX )

# µX P (k+1)(Z).

l (k) (X,ℓX ) should be compared to ℓ (k) (G,ℓG) and Lk((X, ℓX)) should be compared to Lk((G, ℓG)) from the WL hierarchy (cf. Deﬁnition 1). See Figure 1 for an illustration of the WL measure hierarchy of a graph induced LMMC and its comparison with the corresponding WL hierarchy. We will show later that, up to certain change of labels, the WL measure hierarchy for a graph induced LMMC captures all the information contained in the WL hierarchy of the original graph (see Proposition 3.3).

We now deﬁne the Weisfeiler-Lehman distance based on the WL measure hierarchy.

Deﬁnition 7 (Weisfeiler-Lehman distance). For each integer k 0 and any metric space of labels Z we deﬁne the Weisfeiler-Lehman (WL) distance of depth k between the Z-LMMCs (X, ℓX) and (Y, ℓY ) as

d (k) WL((X, ℓX), (Y, ℓY )):=d W(Lk((X, ℓX)) , Lk((Y, ℓY ))) (2) where d W above takes place in P k(Z). We also deﬁne the (absolute) Weisfeiler-Lehman distance by

d WL((X, ℓX), (Y, ℓY )) := sup k 0 d (k) WL((X, ℓX), (Y, ℓY )) .

Example 1. We write down explicit formulas for d (k) WL when k = 0 and 1. When k = 0, it is easy to see that

d (0) WL((X, ℓX), (Y, ℓY ))=d W((ℓX)#µX, (ℓY )#µY ) ,

which agrees with the Wasserstein distance between the global label distributions (ℓX)#µX and (ℓY )#µY (cf. a similar concept for MMSs (M emoli, 2011)).

When k = 1, we have that

d (1) WL((X, ℓX), (Y, ℓY )) =

inf γ C(µX,µY )

d W (ℓX)#m X x , (ℓY )#m Y y γ(dx dy),

implementing the comparison of local label distributions.

The following proposition states that the WL distance becomes more discriminating as the depth increases.

Proposition 3.1. Let k 0 be any integer. Given any two Z-LMMCs (X, ℓX) and (Y, ℓY ), we have that d (k) WL((X, ℓX), (Y, ℓY )) d (k+1) WL ((X, ℓX), (Y, ℓY )).

As a ﬁrst step towards understanding the WL distance, we show that d WL is a pseudo-distance. We discuss its relationship with the WL test in Proposition 3.3 below.

Proposition 3.2. d WL (resp. d (k) WL for k 0) deﬁnes a pseudo-distance3 on the collection ML(Z).

3.1. Comparison with the WL Test

Given the apparent similarity between the WL hierarchy and the WL measure hierarchy, it should not be surprising that (as we will show later in Proposition 3.3), on graphs, d WL essentially has the same discriminative power as the WL test, i.e., those pairs of graphs which can be distinguished by the WL test are the same as those with d WL > 0. The WL distance therefore should be interpreted as a quantiﬁcation of the degree to which the graphs fail to pass the WL test.

3By pseudo-distance, we mean that d WL (resp. d (k) WL) is symmetric and satisﬁes the triangle inequality, but non-isomorphic LMMCs can have zero d WL (resp. d (k) WL) distance.

Weisfeiler-Lehman meets Gromov-Wasserstein

Figure 1. Illustration of the WL (measure) hierarchy. The graph G shown in the middle of the ﬁgure is assigned the degree label ℓG. We explicitly present two steps of the WL hierarchy of (G, ℓG) (on the left) and of the WL measure hierarchy of (X(G), ℓG) (on the right). Every probability measure is represented as a histogram. For i = 1, 2, ℓi and li are abbreviations for ℓ (i) (G,ℓG) and l (i) (X(G),ℓG), respectively. Notice how the WL measure hierarchy interprets the multisets from the WL hierarchy as probability measures.

However, there is an apparent loss of information in the WL measure hierarchy due to the normalization inherent in probability measures. This could result in certain cases when d WL((Xq(G1), ℓG1), (Xq(G2), ℓG2)) = 0 but (G1, ℓG1) and (G2, ℓG2) are distinguished by the WL test. See the examples in Appendix A.5. However, as we explain next, with appropriate label transformations, the discriminative power of d WL is the same as that of the WL test.

For any metric space of labels Z, consider any injective map g : Z N N Z1 where Z1 is another metric space of labels. A trivial example of such injective g is given by letting Z1 := Z N N and letting g be the identity map.

Now, given any labeled graph (G, ℓG : VG Z), we generate a new label function ℓg G := g(ℓG, deg G( ), |VG|) : VG Z1. Intuitively, this is understood as relabeling G via the map g. Modulo this change of label function, we establish that d WL has the same discriminative power as the WL test. Proposition 3.3. For any q ( 1

2, 1), the WL test distinguishes two labeled graphs (G1, ℓG1) and (G2, ℓG2) iff d WL Xq(G1), ℓg G1 , Xq(G2), ℓg G2 > 0.

Although it may seem that we are injecting more information into labels, this extra relabeling, in fact, does not affect the outcome of the WL test: Lemma 3.4. The WL test distinguishes (G1, ℓG1) and (G2, ℓG2) iff it distinguishes (G1, ℓg G1) and (G2, ℓg G2).

By Proposition 3.3 and a convergence result pertaining to the WL test (Krebs & Verbitsky, 2015), one has the following convergence result for d (k) WL (which implies that to determine whether two graph induced LMMCs satisfy d WL = 0 one only needs to inspect d (k) WL for ﬁnitely many k.) Corollary 3.5. For any q ( 1

2, 1) and any two labeled graphs (G1, ℓG1) and (G2, ℓG2), if

d (k) WL Xq(G1), ℓg G1 , Xq(G2), ℓg G2 = 0 holds for each k = 0, . . . , |VG1| + |VG2|, we then have that d WL Xq(G1), ℓg G1 , Xq(G2), ℓg G2 = 0.

Results from (Babai & Kucera, 1979) imply that the WL test can certify isomorphism of random graphs (with degree labels) with high probability. Then, immediately by Proposition 3.3, we have that with high probability d WL generates positive distance for non-isomorphic random-graph-induced LMMCs.

3.2. A Lower Bound for d (k) WL

The WL measure hierarchy, deﬁned through consecutive steps of pushforward maps, can be related to a certain sequence of Markov kernels which we explain next. Given a MMC X = (X, m X , µX) and any k N, the k-step Markov kernel, denoted by m X, k , is deﬁned inductively as follows: for any x X, when k = 1, m X 1 x := m X x and when k 2, for A X,

m X, k x (A) := Z

X m X, (k 1) x (A) m X x (dx ).

If we represent m X by a transition matrix MX , then the matrix corresponding to m X, k is the k-power of MX .

Recall the formula for d (1) WL from Example 1. We deﬁne a certain quantity (for each k N) which arises by replacing the Markov kernels in that formula with k-step Markov kernels:

d (k) WLLB((X, ℓX), (Y, ℓY )) :=

inf γ C(µX,µY )

d W (ℓX)#m X, k x ,(ℓY )#m Y, k y γ(dx dy).

Notice that d W above takes place in P(Z) whereas d W in Equation (2) for deﬁning d (k) WL takes place in P k(Z). Of

Weisfeiler-Lehman meets Gromov-Wasserstein

course, we have that d (1) WLLB = d (1) WL. It turns out that for each k 1, d (k) WLLB is a lower bound for d (k) WL. Proposition 3.6. For any (X, ℓX), (Y, ℓY ) ML(Z) and any integer k 1 we have that d (k) WLLB((X, ℓX), (Y, ℓY )) d (k) WL((X, ℓX), (Y, ℓY )) .

For any ﬁxed k N and any two ﬁnite R-LMMCs, (X, ℓX) and (Y, ℓY ), if we let n := max(|X|, |Y |), then computing d (k) WL((X, ℓX), (Y, ℓY )) can be done in O(n5 log(n) k) time whereas the total time complexity for computing d (k) WLLB((X, ℓX), (Y, ℓY )) is O(n3 log(nk)). Hence it is far more efﬁcient to compute d (k) WLLB than d (k) WL. Details can be found in Appendix A.7.

4. WL Distance Inspired Neural Networks

We now focus on the case when the metric space of labels Z is Euclidean, i.e., Z = Rd and deﬁne a family of real functions on ML(Rd) called Markov chain neural networks (MCNNs). We both study the discriminative power and establish a universality result for this family of functions.

For any function ϕ : Ri Rj, we deﬁne the map qϕ : P(Ri) Rj sending α P(Ri) to the average R

Ri ϕ(x)α(dx). Based on qϕ, we deﬁne two types of maps:

(1) Fϕ : ML(Ri) ML(Rj) sending (X, ℓX) to (X, ℓϕ X), where ℓϕ X : X Rj is deﬁned by x 7 qϕ((ℓX)#m X x ).

(2) Sϕ : ML(Ri) Rj sending (X, ℓX) to qϕ((ℓX)#µX).

Below, we only focus on maps qϕ when ϕ = ϕσ is a multilayer perceptron (MLP) with a single hidden layer, i.e., any map of the form ϕσ(x) := Cσ (Wx + b), where C, W are matrices, b is any vector, and σ represents elementwise application of a ﬁxed activation function σ. For technical reasons, we also assume that σ is Lipschitz4 and non-polynomial. For example, one can choose σ to be Re LU or sigmoid.

Then, for any sequence of MLPs ϕi : Rdi 1 Rdi for i = 1, . . . , k + 1, and any MLP ψ : Rdk+1 R, we deﬁne a map of the following form, which we call a k-layer Markov chain neural network (MCNNk):

ψ Sϕk+1 Fϕk Fϕ1 : ML(Rd) R. (3)

Note the resemblance between our MCNNs and message passing neural networks (MPNNs) for graphs (Gilmer et al., 2017): Speciﬁcally, (ℓX)#m X x is analogous to the AGGRE-

GATION operation, qφ is analogous to the UPDATE operation, and ψ Sφ corresponds to the readout function that appears in the context of MPNN. Example 2 (Relation with WWL graph kernels). MCNNs recover the framework of Wasserstein Weisfeiler-Lehman

4Note that any MLP with a Lipschitz activation function is Lipschitz. This fact will be used in later proofs.

(WWL) graph kernels w.r.t. continuous attributes (Togninalli et al., 2019): Consider any labeled graph (G, ℓG : VG Rd) and any q [0, 1). Let ϕ : Rd Rd be any continuous map. Applying qϕ to (Xq(G), ℓG) (see Deﬁnition 5), then for any v VG such that NG(v) = , we have

ℓϕ G(v) = Z

Rd ϕ(t) (ℓG)#m G,q v (dt)

= q ϕ(ℓG(v)) + 1 q deg G(v)

v NG(v) ϕ(ℓG(v )).

Notice that if we further let q = 1

2 and ϕ : Rd Rd be the identity map id (by relaxing the assumption that ψ is a MLP), then we have

ℓid G(v) = 1

ℓG(v) + 1 deg G(v)

v NG(v) ℓG(v )

This is exactly how labels are updated in the WWL graph kernel framework. A slight modiﬁcation of the ground distance computation in the WWL framework generates a lower bound for d (k) WL, which implies that the WL distance is more capable at discriminating labeled graphs than WWL graph kernels. This is conﬁrmed by the examples in Appendix A.3.

We let NN k(Rd) denote the collection of all MCNNk. Below, we show that NN k(Rd) has the same discriminative power as the WL distance.

Proposition 4.1. Given any (X, ℓX), (Y, ℓY ) ML(Rd),

1. if d (k) WL((X, ℓX), (Y, ℓY )) = 0, then for every h NN k(Rd) one has that h((X, ℓX)) = h((Y, ℓY ));

2. if d (k) WL((X, ℓX), (Y, ℓY )) > 0, then there exists h NN k(Rd) such that h((X, ℓX)) = h((Y, ℓY )).

Recall from Corollary 3.5 that given any q ( 1

2, 1), any injective map g and any labeled graphs (G1, ℓG1) and (G2, ℓG2), we need at most 2n steps to determine whether d WL((Xq(G1), ℓg G) , Xq(G2), ℓg G2 ) = 0, where n = max (|VG1| , |VG2|). Consequently, MCNNs have the same discriminative power as the WL test:

Corollary 4.2. For any 1

2 < q < 1, the WL test distinguishes the labeled graphs (G1, ℓG1) and (G2, ℓG2) iff there exists h NN 2n(Rd) for which h Xq(G1), ℓg G1 = h Xq(G2), ℓg G2 .

Since MPNNs also have the same discriminative power as the WL test (Xu et al., 2018), we know that our MCNNs can separate all pairs of graphs that MPNNs can separate.

Next, we establish a universal approximation theorem for MCNNs. We ﬁrst introduce some notation. In general, a

Weisfeiler-Lehman meets Gromov-Wasserstein

pseudometric space canonically induces a metric space by identifying points at 0 distance (see Burago et al. (2001, Proposition 1.1.5)). We let ML k (Rd) denote the metric space induced by the pseudometric space (ML(Rd), d (k) WL). As a direct consequence of Proposition 4.1, every h NN k(Rd) induces a real function (which we still denote by h) in ML k (Rd). Then, our MCNNs are actually universal w.r.t. continuous functions deﬁned on ML k (Rd) (see the proof in Appendix B.3.2).

Theorem 4.3. For any k N, let K ML k (Rd) be any compact subspace. Then5, NN k(Rd) = C(K, R).

Our universality result resembles the one established in (Azizian et al., 2020) for message passing neural networks (MPNNs) which proves that MPNNs can universally approximate continuous functions on graphs with bounded size which are less or equally as discriminative as the WL test. Compared with their result, we remark that our universality result applies to the collection of all LMMCs (and hence all graphs) with no restriction on their size. Moreover, although for simplicity LMMCs are restricted to ﬁnite spaces throughout the paper, our MCNNs and universality result can potentially be extended to more general LMMCs including continuous objects such as manifolds and graphons.

5. Relationship with the GW Distance

Given a LMMC (X, ℓX), the label ℓX induces the following pseudo-distance on X: d X(x, x ) := d Z(ℓX(x), ℓX(x )) for x, x X. This suggests structures which are closely related to LMMCs: Markov chain metric spaces (MCMSs for short). A MCMS is any tuple (X, d X) where X = (X, m X , µX) is a ﬁnite MMC and d X is a proper distance on X. Obviously, endowing a MCMS (X, d X) with a label function ℓX and forgetting d X produces a LMMC (X, ℓX). We let MMS denote the collection of all MCMSs. We now construct a Gromov-Wasserstein type distance between MCMSs and study its relationship with the WL distance.

5.1. The GW Distance between MCMSs

Recall from Equation (1) the deﬁnition of the standard GW distance between MMSs. Intuitively, in order to identify a suitable GW-like distance between MCMSs, we would like to incorporate a comparison between Markov kernels into Equation (1). Towards this goal, for each k N, we consider a special type of maps ν (k) , : X Y P(X Y ) which are deﬁned similarly to how we deﬁne k-step Markov kernels and satisfy that for any x X and y Y , ν(k) x,y is a coupling between the k-step Markov kernels m X, k x and m Y, k y ; see Appendix A.2 for the precise deﬁnition. We

5For simplicity of notation, we still use NN k(Rd) to denote induced functions on ML k (Rd) with domain restricted to K.

refer to ν (k) , as a k-step coupling between k-step Markov kernels. We let C(k) m X , m Y denote the collection of all such k-step couplings ν (k) , .

Deﬁnition 8. For any k 1 and any MCMSs (X, d X) and (Y, d Y ), we deﬁne the k-distortion of any pair (γ, ν (k) , ) where γ C(µX, µY ) and ν (k) , C(k) m X , m Y as:

dis (k) γ, ν (k) , := Z

|d X(x, x ) d Y (y, y )|

ν (k) x ,y (dx dy ) γ(dx dy ) γ(dx dy).

This notion of distortion implements a multiscale reweighting of the coupling γ through the k-step coupling ν (k) , . Then, the k-Gromov-Wasserstein distance between the MCMSs (X, d X) and (Y, d Y ) is deﬁned by

d (k) GW((X, d X), (Y, d Y )) := inf γ C(µX,µY ) ν(k) , C(k)(m X ,m Y )

dis (k) γ, ν (k) , .

We then deﬁne the (absolute) Gromov-Wasserstein distance between MCMSs by

d MCMS GW ((X, d X), (Y, d Y )):= sup k d (k) GW((X, d X), (Y, d Y )).

Proposition 5.1. d MCMS GW deﬁnes a proper6 distance on the collection MMS modulo isomorphism of MCMSs.

Example 3 (MMS induced MCMS). Given a metric measure space X = (X, d X, µX), we produce a MCMS M(X) := (X, d X) where X := (X, m X , µX) by letting m X := µX be the constant Markov kernel. It is easy to check that µX is a stationary distribution w.r.t. m X . Then, for any two metric measure spaces X = (X, d X, µX), Y = (Y, d Y , µY ), and k 1, we have that

d (k) GW(M(X), M(Y)) = dbi GW(X, Y)

where dbi GW denotes a decoupled version of the Gromov Wasserstein distance between metric measure spaces (see Appendix A.4) which is of course independent of k.

dbi GW is in general NP-hard to compute (Scetbon et al., 2021), which (via Example 3) implies that d MCMS GW is also NP-hard to compute. See Appendix A.6 for a basic computable lower bound estimate of d MCMS GW . In the next section, we establish more sophisticated lower bounds for d MCMS GW involving the WL distance.

6Unlike the case for d WL, two MCMSs have zero d MCMS GW distance iff they are isomorphic. The precise deﬁnition of isomorphism between MCMSs is postponed to Deﬁnition 15 in Appendix B.4.

Weisfeiler-Lehman meets Gromov-Wasserstein

Table 1. 1-Nearest Neighbor classiﬁcation accuracy.

METHOD MUTAG PROTEINS PTC-FM PTC-MR IMDB-B IMDB-M COX2

d (k) WL 92.1 6.3 63.0 3.5 62.2 8.5 56.2 6.3 70.0 4.3 41.3 4.8 76.1 5.5

d (k) WLLB 87.3 1.9 66.2 2.2 62.5 8.5 57.8 6.8 69.9 2.5 40.6 3.8 81.2 5.3 WWL 85.1 6.5 64.7 2.8 58.2 8.5 54.3 7.9 65.0 3.3 40.0 3.3 76.1 5.6

Table 2. SVM classiﬁcation accuracy.

METHOD MUTAG PROTEINS PTC-FM PTC-MR IMDB-B IMDB-M COX2

d (k) WL 89.9 6.4 72.6 3.1 63.6 7.0 57.9 7.9 76.2 4.1 51.6 4.0 78.1 0.8

d (k) WLLB 90.0 5.6 70.5 1.0 63.3 5.6 59.0 8.3 75.1 2.2 52.0 1.8 78.1 0.8 WWL 85.3 7.3 72.9 3.6 62.2 6.1 63.0 7.4 70.8 5.4 50.0 5.3 78.2 0.8 WL 85.5 1.6 71.6 0.6 56.6 2.1 56.2 2.0 72.4 0.7 50.9 0.4 78.4 1.1 WL-OA 86.3 2.1 72.6 0.7 58.4 2.0 54.2 1.6 73.0 1.1 50.2 1.1 78.8 1.3

5.2. The WL Distance vs. the GW Distance

Given a MCMS (X, d X), ﬁx any x X. Then, one can endow (X, d X) with the label function d X(x, ) : X R. This gives rise to the LMMC (X, d X(x, )). Then, we have the following lower bound of d (k) GW in terms of d (k) WL:

Proposition 5.2. For each k 1 and for any MCMSs (X, d X) and (Y, d Y ), one has that

d (k) GW((X, d X), (Y, d Y ))

inf γ C(µX,µY )

d (k) WL (X, d X(x, )), (Y, d Y (y, )) γ(dx dy).

Remark 5.3. When (X, d X) and (Y, d Y ) are induced from MMSs X and Y, as shown in Example 3, the left-hand side of the above inequality coincides with the decoupled GW distance. We point out that the right-hand side is also independent of k and actually coincides with the third lower bound (TLB) for the GW distance as deﬁned in (M emoli, 2011). See Appendix B.4.4 for more details. Hence, the proposition above can be viewed as a generalization of the TLB to the setting of MCMS.

In general, a MCMS (X, d X) endowed with any label function ℓX : X Z induces a LMMC (X, m X , µX, ℓX) which we denote by (X, ℓX). If we assign label functions to all MCMSs in a suitable coherent way, we obtain that the WL distance between the induced LMMCs is stable w.r.t. the GW distance between corresponding MCMSs.

Deﬁnition 9 (Label invariant of MCMCs). Given any metric space of labels Z, a Z-valued label invariant of MCMSs is a map ℓ : MMS Z which by deﬁnition sends each MCMS (X, d X) into a label function ℓX : X Z. One such label invariant ℓ will be said to be stable if

for all (X, d X), (Y, d Y ) MMS, k N, and for any γ C(µX, µY ), ν (k) , C(k) m X , m Y we have

d Z(ℓX(x ), ℓY (y )) ν (k) x,y(dx dy )γ(dx dy)

dis (k) γ, ν (k) , .

By L(Z) we will denote the collection of all stable Z-valued label invariants.

Example 4. One immediate example of the stable label invariant is the eccentricity function ecc (see Appendix B.4.5 for a proof): for any MCMS (X, d X), ecc X(x) := R

X d X(x, x ) µX(dx ) for x X.

Then, if one assigns stable labels to MCMSs, one has that the WL distance between the induced LMMCs is stable w.r.t. the GW distance.

Proposition 5.4. For every stable label invariant ℓ L(Z), k N and (X, d X), (Y, d Y ) MMS we have that d (k) WL((X, ℓX), (Y, ℓY )) d (k) GW((X, d X), (Y, d Y )).

6. Experimental Results

We provide some results showing the effectiveness of our WL distance in terms of comparing graphs. We conduct both 1-NN and SVM graph classiﬁcation experiments and evaluate the performance of both our lower bound, d (k) WLLB, and our WL distance, d (k) WL, against the WWL kernel/distance (Togninalli et al., 2019), the WL kernel and the WL optimal assignment (WL-OA) (Kriege et al., 2016) kernel. We note that the WWL kernel of (Togninalli et al., 2019) is a stateof-the-art graph kernel. We use the degree label for both d (k) WL and d (k) WLLB. More details on the experimental setup and extra experiments can be found in Appendix A.8.

Weisfeiler-Lehman meets Gromov-Wasserstein

1-NN Classiﬁcation. In this case, both d (k) WLLB and d (k) WL slightly outperform the WWL distance on all datasets we tested; see Table 1. Overall, the classiﬁcation accuracies of d (k) WLLB and d (k) WL were close to those of the WWL distance. These results illustrate the close relationship between d (k) WL and the WWL distance that was outlined in Section 4.

SVM Classiﬁcation. See Table 2. First, we observe that our lower bound kernel slightly outperforms the WWL kernel for MUTAG, IMDB-B, and IMDB-M. For the other datasets, d (k) WLLB had comparable classiﬁcation accuracy with the other methods, coming within one to two percent of WWL and WWL-OA. The d (k) WL kernel had similar classiﬁcation accuracy to d (k) WLLB but only outperformed the d (k) WLLB kernel on PTC-FM, PROTEINS, and IMDB-B.

Note that our lower bound distance d WLLB performs similarly to our WL distance d WL, but is more efﬁcient to compute. See Appendix A.8.3 for the runtime comparison.

7. Conclusion and Future Directions

In this paper, we proposed the WL distance a quantitative extension of the WL test for measuring the dissimilarity between objects in a fairly general family called LMMCs. The WL distance possesses interesting connections with graph kernels, GNNs and the GW distance. In order to more directly compare the WL test with the WL distance without resorting to relabeling, one future direction is to redeﬁne the WL distance via positive measures and unbalanced (Gromov-)Wasserstein distances (Liero et al., 2018; S ejourn e et al., 2021; De Ponti & Mondino, 2020). Whereas our paper focuses on the application of our WL distance to the graph setting, LMMCs can be used to model not just graphs but also far more general objects such as Riemannian manifolds (equipped with heat kernels) or graphons. Then, our neural network architecture (MCNN) has the potential to be applied to point sets sampled from manifolds too, as well as serving as the limiting object when studying the convergence of GNNs. It is also interesting to extend our WL distance to a higher order version that is analogous to the high order k-WL test and k-GNNs. We conjecture that a suitable notion of k-WL distance will converge to the Gromov-Wasserstein distance for MCMSs as k tends to inﬁnity. Finally, we point out that our WL distance can actually be regarded as a certain instance of optimal transport with special couplings (see Appendix A.2). It is of interest to explore the relation between our WL distance and some other distances between stochastic processes deﬁned via optimal transport with special couplings (Backhoff et al., 2017; Moulos, 2021; O Connor et al., 2022).

Acknowledgements

This work is partially supported by NSF-DMS-1723003, NSF-CCF-1740761, NSF-RI-1901360, NSF-CCF-1839356, NSF-IIS-2050360, NSF-CCF-2112665 and BSF-2020124.

Azizian, W. et al. Expressive power of invariant and equivariant graph neural networks. In International Conference on Learning Representations, 2020.

Babai, L. and Kucera, L. Canonical labelling of graphs in linear average time. In 20th Annual Symposium on Foundations of Computer Science (sfcs 1979), pp. 39 46. IEEE, 1979.

Babai, L. and Luks, E. M. Canonical labeling of graphs. In Proceedings of the ﬁfteenth annual ACM symposium on Theory of computing, pp. 171 183, 1983.

Backhoff, J., Beiglbock, M., Lin, Y., and Zalashko, A. Causal transport in discrete time and applications. SIAM Journal on Optimization, 27(4):2528 2562, 2017.

Burago, D., Burago, Y., and Ivanov, S. A course in metric geometry, volume 33. American Mathematical Soc., 2001.

Chowdhury, S. and M emoli, F. The Gromov-Wasserstein distance between networks and stable network invariants. Information and Inference: A Journal of the IMA, 8(4): 757 787, 2019.

De Ponti, N. and Mondino, A. Entropy-transport distances between unbalanced metric measure spaces. ar Xiv preprint ar Xiv:2009.10636, 2020.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263 1272. PMLR, 2017.

Krebs, A. and Verbitsky, O. Universal covers, color reﬁnement, and two-variable counting logic: Lower bounds for the depth. In 2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science, pp. 689 700. IEEE, 2015.

Kriege, N. M., Giscard, P.-L., and Wilson, R. On valid optimal assignment kernels and applications to graph classiﬁcation. Advances in Neural Information Processing Systems, 29:1623 1631, 2016.

Lehman, A. and Weisfeiler, B. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsiya, 2(9): 12 16, 1968.

Weisfeiler-Lehman meets Gromov-Wasserstein

Liero, M., Mielke, A., and Savar e, G. Optimal entropytransport problems and a new Hellinger-Kantorovich distance between positive measures. Inventiones mathematicae, 211(3):969 1117, 2018.

Loosli, G., Canu, S., and Ong, C. S. Learning svm in kre ın spaces. IEEE transactions on pattern analysis and machine intelligence, 38(6):1204 1216, 2015.

Luss, R. and d Aspremont, A. Support vector machine classiﬁcation with indeﬁnite kernels. Mathematical Programming Computation, 1(2):97 118, 2009.

M emoli, F. On the use of Gromov-Hausdorff distances for shape comparison. In Botsch, M., Pajarola, R., Chen, B., and Zwicker, M. (eds.), Eurographics Symposium on Point-Based Graphics. The Eurographics Association, 2007. ISBN 978-3-905673-51-7. doi: 10.2312/SPBG/ SPBG07/081-090.

M emoli, F. Gromov-Wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417 487, 2011.

Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel, P., and Neumann, M. Tudataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020. URL www.graphlearning. io.

Moulos, V. Bicausal optimal transport for Markov chains via dynamic programming. In 2021 IEEE International Symposium on Information Theory (ISIT), pp. 1688 1693. IEEE, 2021.

Ollivier, Y. Ricci curvature of Markov chains on metric spaces. Journal of Functional Analysis, 256(3):810 864, 2009.

O Connor, K., Mc Goff, K., and Nobel, A. B. Optimal transport for stationary Markov chains via policy iteration. Journal of Machine Learning Research, 23(45): 1 52, 2022.

Pele, O. and Werman, M. Fast and robust earth mover s distances. In 2009 IEEE 12th international conference on computer vision, pp. 460 467. IEEE, 2009.

Peyr e, G., Cuturi, M., and Solomon, J. Gromov-Wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, pp. 2664 2672. PMLR, 2016.

Pinkus, A. Approximation theory of the MLP model in neural networks. Acta numerica, 8:143 195, 1999.

Saloff-Coste, L. Lectures on ﬁnite Markov chains. In Lectures on probability theory and statistics, pp. 301 413. Springer, 1997.

Scetbon, M., Peyr e, G., and Cuturi, M. Linear-time Gromov Wasserstein distances using low rank couplings and costs. ar Xiv preprint ar Xiv:2106.01128, 2021.

Schmitzer, B. and Schn orr, C. Modelling convex shape priors and matching based on the Gromov-Wasserstein distance. Journal of mathematical imaging and vision, 46(1):143 159, 2013.

S ejourn e, T., Vialard, F.-X., and Peyr e, G. The unbalanced Gromov-Wasserstein distance: Conic formulation and relaxation. In 35th Conference on Neural Information Processing Systems, 2021.

Shervashidze, N., Schweitzer, P., Van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12(9), 2011.

Sturm, K.-T. The space of spaces: curvature bounds and gradient ﬂows on the space of metric measure spaces. ar Xiv preprint ar Xiv:1208.0434, 2012.

Titouan, V., Redko, I., Flamary, R., and Courty, N. Cooptimal transport. Advances in Neural Information Processing Systems, 33, 2020.

Togninalli, M., Ghisu, E., Llinares-L opez, F., Rieck, B., and Borgwardt, K. Wasserstein Weisfeiler-Lehman graph kernels. Advances in Neural Information Processing Systems, 32:6439 6449, 2019.

Vallender, S. Calculation of the Wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 18(4):784 786, 1974.

Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty, N. Fused Gromov-Wasserstein distance for structured objects. Algorithms, 13(9):212, 2020.

Villani, C. Topics in optimal transportation, volume 58. American Mathematical Soc., 2003.

Villani, C. Optimal transport: old and new, volume 338. Springer, 2009.

von Renesse, M.-K. and Sturm, K.-T. Transport inequalities, gradient estimates, entropy and Ricci curvature. Communications on pure and applied mathematics, 58(7): 923 940, 2005.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2018.

Weisfeiler-Lehman meets Gromov-Wasserstein

A. Extra Details

A.1. Useful Facts about Couplings

Here we collect some useful facts about couplings which will be used in subsequent proofs.

Lemma A.1. Let X, Y be ﬁnite metric spaces and let Z be a complete and separable metric space. Let ϕX : X Z and ϕY : Y Z be measurable maps. Consider any µX P(X) and µY P(Y ). Then, we have that

d W((ϕX)#µX, (ϕY )#µY ) = inf γ C(µX,µY )

d Z(ϕX(x), ϕY (y))γ(dx dy).

Proof. The proof is based on the following result.

Lemma A.2. Let X, Y be ﬁnite metric spaces and let Z be a metric space of labels. Let ϕX : X Z and ϕY : Y Z be measurable maps. Consider any µX P(X) and µY P(Y ). If we let ϕ := ϕX ϕY , then we have that

ϕ#C(µX, µY ) = C ((ϕX)#µX, (ϕY )#µY )

Proof of Lemma A.2. Since X and Y are ﬁnite, ϕX(X) and ϕY (Y ) are discrete sets. Then, the lemma follows directly from Proposition 4.5 in (Schmitzer & Schn orr, 2013).

d W((ϕX)#µX, (ϕY )#µY ) = inf γ C((ϕX)#µX,(ϕY )#µY )

d Z(z1, z2)γ(dz1 dz2)

= inf γ C(µX,µY )

d Z(z1, z2) ϕ#γ(dz1 dz2)

= inf γ C(µX,µY )

d Z(ϕX(x), ϕY (y))γ(dx dy).

The following lemma is a direct consequence of (Villani, 2009, Corollary 5.22)

Lemma A.3. For any complete and separable metric space Z, there exists a measurable map ϕ : P(Z) P(Z) P(Z Z) so that for every α, β P(Z), ϕ(α, β) is an optimal coupling between α and β.

A.2. A Characterization of the WL Distance

Recall from Deﬁnition 7 that the WL distance of depth k is the Wasserstein distance between the local distributions of labels generated at the kth step of the WL hierarchy. In this section, we prove that in fact d (k) WL can be characterized through a novel variant of the Wasserstein distance between the distributions of initial labels (i.e., (ℓX)#µX and (ℓY )#µY ).

A.2.1. k-STEP COUPLINGS

We introduce a convenient notation which will be used in the sequel.

Deﬁnition 10. Suppose a measurable space Z, a probability measure γ P(Z), and a measurable map ν : Z P(Z) are given. Then, we deﬁne the average of ν under γ, denoted by ν γ, which is still a probability measure on Z:

For any measurable A Z, ν γ(A) := Z

Z νz(A) γ(dz).

The operation will be useful for constructing a special type of couplings between Markov kernels.

Given two MMCs X and Y, we introduce the following notion of k-step coupling between Markov kernels m X and m Y :

Weisfeiler-Lehman meets Gromov-Wasserstein

k = 1: A 1-step coupling between m X and m Y is any measurable map

ν (1) , : X Y P(X Y )

such that ν(1) x,y C(m X x , m Y y ) for any x X and y Y .

k 2: We say a map ν (k) , : X Y P(X Y )

is a k-step coupling between the Markov kernels m X and m Y if there exist a (k 1)-step coupling ν (k 1) , and a 1-step coupling ν (1) , such that ν (k) x,y = ν (k 1) , ν (1) x,y, x X, y Y.

Lemma A.4. For any k-step coupling ν (k) , , one has that ν(k) x,y C m X, k x , m Y, k y for every x X and y Y .

Proof. We prove the statement by induction on k. When k = 1, the statement is trivially true. Assume that the statement is true for some k 1. Then, for any measurable A X, we have that

ν (k+1) x,y (A Y ) = Z

ν (k) x ,y (A Y ) ν (1) x,y(dx dy )

m X, k x (A) ν (1) x,y(dx dy )

X m X, k x (A) m X x (dx )

= m X, (k+1) x (A).

Similarly, for any measurable B Y , we have that

ν (k+1) x,y (X B) = m Y, (k+1) y (B).

Hence ν(k+1) x,y C m X, (k+1) x , m Y, (k+1) y and thus we conclude the proof.

Henceforth, we denote by C(k) m X , m Y the collection of all k-step couplings between Markov kernels m X and m Y .

Deﬁnition 11. Given any k N, any coupling γ C(µX, µY ) and any k-step coupling ν (k) , C(k) m X , m Y , we deﬁne a probability measure on X Y as follows: µ (k) := ν (k) , γ. (5)

We call µ(k) deﬁned as above a k-step coupling between µX and µY . We let C(k)(µX, µY ) denote the collection of all k-step couplings between µX and µY and we also let C(0)(µX, µY ) := C(µX, µY ).

As deﬁned above, k-step couplings are indeed couplings. Lemma A.5. Any µ(k) C(k)(µX, µY ) is a coupling between µX and µY .

Proof. For any measurable A X, we have that

µ (k)(A Y ) = Z

ν (k) x,y(A Y ) γ(dx dy)

m X, k x (A) γ(dx dy)

X m X, k x (A) µX(dx)

Similarly, for any measurable B Y , we have that µ(k)(X B) = µY (B). Therefore, µ(k) C(µX, µY ).

Weisfeiler-Lehman meets Gromov-Wasserstein

Lemma A.6. We have the following hierarchy of k-step couplings:

C (0)(µX, µY ) C (1)(µX, µY ) C (2)(µX, µY )

Proof. We prove the following inclusion by induction on k = 0, 1, . . . :

C (k)(µX, µY ) C (k+1)(µX, µY ) (6)

When k = 0, we only need to check that any γ(1) C(1)(µX, µY ) is a coupling between µX and µY . Assume that

γ (1) = ν (1) , γ

for some ν (1) , C(1) m X , m Y and γ C(µX, µY ). For any measurable set A X, we have that

γ (1)(A Y ) = Z

ν (1) x,y(A Y ) γ(dx dy)

m X x (A) γ(dx dy)

X m X x (A) µX(dx)

Similarly, for any measurable B Y we have that

γ (1)(X B) = µY (B).

Hence, γ(1) C(µX, µY ).

Now, assume that Equation (6) holds for some k 0. For any γ(k+2) C(k+2)(µX, µY ), we assume that

γ (k+2) = ν (k+2) , γ,

where ν (k+2) , C(k+2) m X , m Y and γ C(µX, µY ). Then, there exist ν (k+1) , C(k+1) m X , m Y and ν (1) , C(1) m X , m Y such that ν (k+2) x,y = ν (k+1) , ν (1) x,y, x X, y Y.

γ (k+2) = Z

ν (k+1) x ,y ν (1) x,y(dx dy ) γ(dx dy)

ν (k+1) x ,y Z

ν (1) x,y(dx dy ) γ(dx dy)

ν (k+1) x ,y γ (1)(dx dy ).

Here that γ(1) := ν (1) , γ belongs to C(µX, µY ) follows from the case k = 0. Hence, by the induction assumption, γ(k+2) C(k+1)(µX, µY ) which concludes the proof.

A.2.2. A CHARACTERIZATION OF d (k) WL VIA k-STEP COUPLINGS

Now, we characterize d (k) WL via k-step couplings deﬁned in the previous section.

Weisfeiler-Lehman meets Gromov-Wasserstein

Theorem A.7. Given any integer k 0 and any two Z-LMMC (X, ℓX) and (Y, ℓY ), we have that

d (k) WL((X, ℓX), (Y, ℓY )) = inf γ(k) C(k)(µX,µY )

d Z(ℓX(x), ℓY (y))γ (k)(dx dy).

Proof of Theorem A.7. The case k = 0 holds trivially. Now, for any k 1 and for any x X and y Y , by Lemma A.1 we have that

d W l (k) (X,ℓX )(x), l (k) (Y,ℓY )(y) = inf νx,y C(m X x ,m Y y )

d W l (k 1) (X,ℓX )(x ), l (k 1) (Y,ℓY )(y ) νx,y(dx dy ).

Since (x, y) 7 (m X x , m Y y ) is measurable by deﬁnition of Markov kernels, by Lemma A.3 we have that there exists a measurable map ν , : X Y P(X Y ) such that for every x X and y Y , νx,y is optimal, i.e.,

d W l (k) (X,ℓX )(x), l (k) (Y,ℓY )(y) = Z

d W l (k 1) (X,ℓX )(x ), l (k 1) (Y,ℓY )(y ) νx,y(dx dy ). (7)

Hence, we have the following formulas.

d (k) WL((X, ℓX), (Y, ℓY ))

l (k) (X,ℓX )

# µX, l (k) (Y,ℓY )

d W l (k) (X,ℓX )(x), l (k) (Y,ℓY )(y) γ(dx dy), here γ C(µX, µY ) is chosen to be optimal

d W l (k 1) (X,ℓX )(x1), l (k 1) (Y,ℓY )(y1) (ν1)x,y(dx1 dy1)γ(dx dy),

d W l (0) (X,ℓX )(xk), l (0) (Y,ℓY )(yk) (νk)xk 1,yk 1(dxk dyk) (ν1)x,y(dx1 dy1)γ(dx dy).

Here, each (νi) , C(1)(m X , m Y ) for i = 1, . . . , k is optimal in the sense of Equation (7).

From the above equations, we identify a probability measure on X Y for every x X and y Y as follows

ν (k) x,y := Z

(νk)xk 1,yk 1 (νk 1)xk 2,yk 2(dxk 1 dyk 1) (ν1)x,y(dx1 dy1).

It is obvious that ν (k) , is a k-step coupling and thus

d (k) WL((X, ℓX), (Y, ℓY )) =d W

l (k) (X,ℓX )

# µX, l (k) (Y,ℓY )

d W l (0) (X,ℓX )(x ), l (0) (Y,ℓY )(y ) ν (k) x,y(dx dy )γ(dx dy)

inf γ(k) C(k)(µX,µY )

d W(ℓX(x), ℓY (y)) γ (k)(dx dy),

Conversely, we have that given any γ(k) := ν (k) , γ C(k)(µX, µY ), where γ C(µX, µY ) and ν (k) , C(k) m X , m Y

can be written for every x X and y Y as follows

ν (k) x,y := Z

(νk)xk 1,yk 1 (νk 1)xk 2,yk 2(dxk 1 dyk 1) (ν1)x,y(dx1 dy1),

Weisfeiler-Lehman meets Gromov-Wasserstein

the following inequalities hold: Z

d W(ℓX(x), ℓY (y)) γ (k)(dx dy)

d W l (0) (X,ℓX )(xk), l (0) (Y,ℓY )(yk) ν (k) x,y(dxk dyk)γ(dx dy)

d W l (0) (X,ℓX )(xk), l (0) (Y,ℓY )(yk) (νk)xk 1,yk 1(dxk dyk) (ν1)x,y(dx1 dy1)γ(dx dy)

d W l (1) (X,ℓX )(xk 1), l (1) (Y,ℓY )(yk 1) (νk 1)xk 2,yk 2(dxk 1 dyk 1) (ν1)x,y(dx1 dy1)γ(dx dy)

d W l (k 1) (X,ℓX )(x1), l (k 1) (Y,ℓY )(y1) (ν1)x,y(dx1 dy1)γ(dx dy)

d W l (k) (X,ℓX )(x), l (k) (Y,ℓY )(y) γ(dx dy)

l (k) (X,ℓX )

# µX, l (k) (Y,ℓY )

Inﬁmizing over all γ and ν (k) , , one concludes the proof.

A.3. The Wasserstein Weisfeiler-Lehman Graph Kernel and its Relationship with d WL

The Wasserstein Weisfeiler-Lehman graph kernel deals with graphs with either categorical or continuous (i.e., Euclidean) labels (Togninalli et al., 2019). We only consider WWL graph kernel w.r.t. continuous labels since categorical labels usually don t come equipped with a distance. Even if we artiﬁcially deﬁned a distance on a set of categorical labels (e.g. using Euclidean distance via one-hot encoding), we point out that there wouldn t be a clear relation between WWL graph kernel and the WL distance since at each step of WWL graph kernel a hash function is invoked to map into a single label a pair consisting of a label and a multiset of labels. Now, we describe the WWL framework w.r.t. continuous labels as follows. For technical reasons, we assume that all graphs involved in this section are such that all their connected components have cardinality at least 2 (i.e. no graph contains an isolated vertex).

Given a labeled graph (G, ℓG : VG Rd), the label function is updated for a ﬁxed number k of iterations according to the equation below for i = 0, . . . , k 1, where ℓ0 G := ℓG.

v VG, ℓi+1 G (v) := 1

ℓi G(v) + 1 deg G(v)

v NG(v) ℓi G(v )

Then, for each i = 0, . . . , k there is a label function ℓ (i) (G,ℓG) : VG Rd. Deﬁne the stacked label function Lk G as follows:

Lk G := ℓ0 G, . . . , ℓk G : VG Rd (k+1).

Now, given any two labeled graphs (G1, ℓG1) and (G2, ℓG2), Togninalli et al. (2019) ﬁrst computed Lk G1 and Lk G2, then computed the Wasserstein distance between their induced distributions (which we call the WWL distance) and ﬁnally, built a kernel upon this Wasserstein distance. If we let λGi denote the uniform measure on VGi, then we express their WWL distance via pushforward of uniform measures as follows.

D (k)((G1, ℓG1), (G2, ℓG2)) := d W Lk G1

# λG1, Lk G2

# λG2 . (8)

Now, if we instead of uniform measures consider the stationary distributions µG1 and µG2 w.r.t. m G1, 1

2 and m G2, 1

Weisfeiler-Lehman meets Gromov-Wasserstein

Figure 2. In this ﬁgure we show two labeled graphs and each of them has 3n + 2 vertices. Each of the graphs has n vertices with label 1, n vertices with label 1, and n + 2 vertices with label 0.

respectively, we deﬁne the following variant of D(k):

ˆD (k)((G1, ℓG1), (G2, ℓG2)) := d W Lk G1

# µG1, Lk G2

# µG2 , (9)

which is the distance which we will relate to our WL distance next. In fact, we then prove that ˆD(k)((G1, ℓG1), (G2, ℓG2)) actually provides a lower bound for d (k) WL((Xq(G1), ℓG1), (Xq(G2), ℓG2)) when q = 1

Proposition A.8. For any two labeled graphs (G1, ℓG1 : VG1 Rd) and (G2, ℓG2 : VG2 Rd), one has that for q = 1

2 and any k N, ˆD (k)((G1, ℓG1), (G2, ℓG2)) k d (k) WL((Xq(G1), ℓG1), (Xq(G2), ℓG2)).

The proposition will be proved after we provide some examples and remarks.

Example 5 (d (k) WL is more discriminating than ˆD(k)). In this example, we construct a family of pairs of graphs so that ˆD(k)

between the pairs is always zero but d WL between the pairs is positive. For any n 2, consider the two (3n + 2)-point labeled graphs shown in Figure 2. It is easy to see that for each i = 1, 2, and any v VGi,

1. if ℓGi(v) = 0, then ℓk Gi(v) = 0 for all k = 0, 1, . . .;

2. if ℓGi(v) = 1, then ℓk Gi(v) = 1

2k for all k = 0, 1, . . ..

Hence, for any k = 0, . . ., we have that

# µG1 = Lk G2

# µG2 = 4n + 2

6n + 2δ(0, ,0) + n 6n + 2δ(1, ,2 k) + n 6n + 2δ( 1, , 2 k).

Therefore, ˆD(k)((G1, ℓG1), (G2, ℓG2)) = 0 for all k = 0, . . ..

For proving that d WL > 0, we ﬁrst analyze l (1) (X(G1),ℓG1 )(v1) for any v1 VG1.

1. If ℓG1(v1) = 1, then

l (1) (X(G1),ℓG1 )(v1) = 1

2. If ℓG1(v1) = 0 and v1 is neither the leftmost nor the rightmost vertex, then

l (1) (X(G1),ℓG1 )(v1) = 6

Weisfeiler-Lehman meets Gromov-Wasserstein

3. If v1 is either the leftmost or the rightmost vertex, then

l (1) (X(G1),ℓG1 )(v1) = δ0.

We then analyze l (1) (X(G2),ℓG2 )(v2) for any v2 VG2.

1. If ℓG2(v2) = 1, then

l (1) (X(G2),ℓG2 )(v2) = 1

2. If v2 is the center vertex, then

l (1) (X(G2),ℓG2 )(v2) = 2n + 1

3n + 1δ0 + n 2(3n + 1)δ1 + n 2(3n + 1)δ 1.

3. If ℓG2(v2) = 0 and v2 is not the center vertex, then

l (1) (X(G2),ℓG2 )(v2) = δ0.

Hence, it is clear that when n > 1

d (1) WL((Xq(G1), ℓG1), (Xq(G2), ℓG2)) = d W

l (1) (X(G1),ℓG1 )

# µG1, l (1) (X(G2),ℓG2 )

Remark A.9. Our WL distance formulation is ﬂexible and we can of course relax it by allowing the comparison of measure Markov chains with general reference probability measures which are not necessarily stationary. In that case, we can directly compare D(k) deﬁned in Equation (8) with d (k) WL. More precisely, for any graph G, we can replace the stationary distribution inherent to Xq(G) with the uniform measure and hence obtain a new measure Markov chain X u q (G). Then, the same proof technique used for proving Proposition A.8 can be used for proving that

D (k)((G1, ℓG1), (G2, ℓG2)) k d (k) WL((X u q (G1), ℓG1), (X u q (G2), ℓG2)).

Moreover, we can show that d (k) WL is strictly more discriminating than D(k) via the same pairs of graphs as in Example 5.

The proof of Proposition A.8 is based on the following basic fact about the Wasserstein distance: Lemma A.10. Let Z be a complete and separable metric space. Endow Z Z with any product metric d Z Z such that

d Z Z((z1, z2), (z3, z4)) d Z(z1, z3) + d Z(z2, z4) z1, z2, z3, z4 Z.

For example, one can let

d Z Z((z1, z2), (z3, z4)) := q

(d Z(z1, z3))2 + (d Z(z2, z4))2. Given any complete and separable metric space X, for any i = 1, 2 and any measurable maps fi, gi : X Z, if we let hi := (fi, gi) : X Z Z, then for any µ1, µ2 P(X)

d W((h1)#µ1, (h2)#µ2) d W((f1)#µ1, (f2)#µ2) + d W((g1)#µ1, (g2)#µ2) .

Proof of Lemma A.10. For any γf, γg C(µ1, µ2), we deﬁne a probability measure ν P(Z Z Z Z) as follows: for any measurable A, A , B, B Z

ν(A A B B ) := (f1 f2)#γf(A B) (g1 g2)#γg(A B ).

It is easy to show that ν C ((h1)#µ1, (h2)#µ2). Then,

d W((h1)#µ1, (h2)#µ2) Z

d Z Z((z1, z2), (z3, z4)) ν(dz1 dz2 dz3 dz4)

(d Z(z1, z3) + d Z(z2, z4)) ν(dz1 dz2 dz3 dz4)

d Z(z1, z3) (f1 f2)#γf(dz1 dz3) + Z

d Z(z2, z4) (g1 g2)#γg(dz2 dz4)

By Lemma A.1, inﬁmizing over all γf, γg C(µ1, µ2), we obtain the conclusion.

Weisfeiler-Lehman meets Gromov-Wasserstein

Proof of Proposition A.8. Since Lk G := (ℓ0 G, . . . , ℓk G), by inductively applying Lemma A.10, we have that

ˆD (k)(G1, G2) = d W Lk G1

# µG1, Lk G2

i=1 d W ℓi G1

# µG1, ℓi G2

Choose ϕj := id : Rd Rd to be the identity map for each j = 1, . . . , k, then using notation from Appendix B.3.1, we have that ℓj Gi = ℓ (ϕ,j) Gi , i = 1, 2 and j = 1, . . . , k.

Then, by Equation (19), we conclude that

ˆD (k)(G1, G2)

l (i) (Xq(G1),ℓG1 )

# µG1, l (i) (Xq(G2),ℓG2 )

i=1 d (i) WL((Xq(G1), ℓG1), (Xq(G2), ℓG2)) .

By Proposition 3.1, we have that

ˆD (k)(G1, G2) k d (k) WL((Xq(G1), ℓG1), (Xq(G2), ℓG2)) .

A.4. A Decoupled Version of the Gromov-Wasserstein Distance

For simplicity, in this section we will assume that the cardinality of all underlying spaces to always be ﬁnite.

The Gromov-Wasserstein distance d GW was proposed as a measure of dissimilarity between two metric measure spaces; see Section 2.2 for its deﬁnition and also (M emoli, 2011) for its more general version involving a parameter p [1, ]. Note that one can deﬁne a variant of the standard GW distance by considering two coupling measures γ, γ independently, and use γ γ instead of γ γ in Equation (1). This version of the GW distance was implicit in the optimization procedure followed in (M emoli, 2011) and has been explicitly considered in (S ejourn e et al., 2021; Titouan et al., 2020), and this is closely connected to our GW distance between MCMSs (see Deﬁnition 8) as shown in Example 3.

Here we give the deﬁnition of this decoupled variant of the GW distance. Deﬁnition 12 (Decoupled Gromov-Wasserstein distance). Suppose two metric measure spaces X = (X, d X, µX) , Y = (Y, d Y , µY ) are given. We deﬁne the decoupled Gromov-Wasserstein distance dbi GW(X, Y) in the following way:

dbi GW(X, Y) := inf γ,γ C(µX,µY )

|d X(x, x ) d Y (y, y )|γ (dx dy )γ(dx dy).

Obviously, dbi GW(X, Y) d GW(X, Y) in general. Furthermore, this inequality is actually tight as one can see in the following remark. Remark A.11. We let ΓX,Y (x, y, x , y ) := |d X(x, x ) d Y (y, y )| for any x, x X and y, y Y . If the kernel ΓX,Y : X Y X Y R is negative semi-deﬁnite, then one can show that dbi GW(X, Y) = d GW(X, Y) by invoking S ejourn e et al. (2021, Theorem 4). More precisely, if γ, γ are the optimal coupling measures achieving the inﬁmum in the deﬁnition of dbi GW(X, Y), then both γ and γ are optimal for d GW, i.e.,

dbi GW(X, Y) = ΓX,Y L1(γ γ ) = ΓX,Y L1(γ γ) = ΓX,Y L1(γ γ ) = d GW(X, Y).

Just like the original version, dbi GW also becomes a legitimate metric on the collection of metric measure spaces. This is another contribution of our work. Proposition A.12. The decoupled Gromov Wasserstein distance dbi GW is a legitimate metric on MMS.

Proof. Symmetry is obvious. We need to prove the triangle inequality plus the fact that dbi GW(X, Y) = 0 happens if and only if X and Y are isomorphic. The if part is trivial. For the other direction we proceed as follows. Suppose that dbi GW(X, Y) = 0. By Lemma B.11 and the compactness of C(µX, µY ) for the weak topology (see Villani (2003, p.49)), there must be optimal couplings γ, γ C(µX, µY ) such that X

(x ,y ) X Y |d X(x, x ) d Y (y, y )| γ (x , y ) γ(x, y) = 0. (10)

Weisfeiler-Lehman meets Gromov-Wasserstein

Claim 1. There exists an isometry φ : X Y such that

{(x, φ(x)) : x X} = supp(γ) = supp(γ ).

Proof of Claim 1. By Equation (10), we have that

d X(x, x ) = d Y (y, y ) (11)

for any (x, y) supp(γ) and (x , y ) supp(γ ).

Fix an arbitrary x X. Then, since both µX and µY are fully supported and X, Y are ﬁnite, there must exist y, y Y such that (x, y) supp(γ) and (x, y ) supp(γ ). Then, y = y by Equation (11). Now, if there exists y Y such that (x, y ) supp(γ), then similarly, we have that y = y and thus y = y. In other words, for each x X, there exists a unique y Y such that (x, y) supp(γ). Similarly, this same y Y is unique such that (x, y) supp(γ ). Hence, we deﬁne φ : X Y by letting φ(x) be the unique y Y such that (x, φ(x)) supp(γ). It is obvious that φ is bijective and satisﬁes that {(x, φ(x)) : x X} = supp(γ) = supp(γ ). By Equation (11), we conclude that φ is an isometry.

Based on the claim above, consider an arbitrary Borel subset A X. Then,

µX(A) = γ(A Y ) = γ(A Y A φ(A)) = γ(A φ(A)) = µY (φ(A)).

Hence, φ is a isomorphism between X and Y.

Finally, for the triangle inequality, ﬁx ﬁnite metric measure space X,Y, and Z. Notice ﬁrst that for all x, x X, y, y Y , and z, z Z,

ΓX,Y (x, y, x , y ) ΓX,Z(z, x, z , x ) + ΓZ,Y (y, z, y , z ).

Next, ﬁx arbitrary coupling measures γ1, γ 1 C(µX, µZ) and γ2, γ 2 C(µZ, µY ). By the Gluing Lemma (see Villani (2003, Lemma 7.6)), there exist probability measures π, π P(X Y Z) with marginals γ1, γ 1 on X Z and γ2, γ 2 on Z Y . Let γ3, γ 3 be the marginal of π, π on X Y . Then, by the triangle inequality of L1 norm,

dbi GW(X, Y) ΓX,Y L1(γ3 γ 3) = ΓX,Y L1(π π ) ΓX,Z L1(π π ) + ΓZ,Y L1(π π ) = ΓX,Z L1(γ1 γ 1) + ΓZ,Y L1(γ2 γ 2).

Since the choice of γ1, γ 1, γ2, γ 2 are arbitrary, by taking the inﬁmum one can conclude

dbi GW(X, Y) dbi GW(X, Z) + dbi GW(Z, Y).

A.5. Examples When d WL Fails to Separate Graphs

Example 6 (Constant labels). Let G1 be a claw and G2 be a path with four nodes; see Figure 3. Let the label functions ℓGi for i = 1, 2 be constant and equal to 1 for both graphs.

In the ﬁrst step of the WL test, we ﬁnd

L1((G1, ℓG1)) = {{(1, {{1, 1, 1}}), (1, {{1}}), (1, {{1}}), (1, {{1}})}}

and L1((G2, ℓG2)) = {{(1, {{1}}), (1, {{1}}), (1, {{1, 1}}), (1, {{1, 1}})}} .

Since L1((G1, ℓG1)) = L1((G2, ℓG2)), (G1, ℓG1) and (G2, ℓG2) are recognized as non-isomorphic by the WL test. Notice that within the ﬁrst step, the WL test collects degree information and comparing L1((G1, ℓG1)) and L1((G2, ℓG2)) is equivalent to comparing the multisets of degrees w.r.t. G1 and G2. However, for d WL((X(G1), ℓG1), (X(G2), ℓG2)) (abbreviated to d WL(X(G1), X(G2)) in Figure 3), because of the normalization inherent to the Markov chains m G1 and m G2 , each step inside the hierarchy pertaining to the WL distance cannot collect degree information when the labels are constant.

Weisfeiler-Lehman meets Gromov-Wasserstein

Figure 3. Illustration of Example 6. Notice that if we start from constant labels, one step of the WL hierarchy will collect degree information for the vertices. In contrast, because of the normalization of the Markov kernel, a single step of the WL measure hierarchy with constant labels will not be able to accumulate the same information.

Example 7 (Degree label). Let G1 be a two-point graph consisting of a single edge with the vertex set {v1, v2}. Let G2 be a four-point graph consisting of two disjoint edges denoted by {u1, u2} and {u3, u4}; see Figure 4. For each i = 1, 2, let ℓGi be the degree label function for both graphs.

In the ﬁrst step of the WL test, L1((G1, ℓG1)) = {{(1, {{1}}), (1, {{1}})}}

and L1((G2, ℓG2)) = {{(1, {{1}}), (1, {{1}}), (1, {{1}}), (1, {{1}})}} .

Then in the ﬁrst step of the WL test, the two labeled graphs are already distinguished as non-isomorphic.

In the case of d WL((X(G1), ℓG1), (X(G2), ℓG2)), notice that (ℓG1)#m G1 x (z) = 1 if z = 1 and 0 otherwise for both x = v1 and x = v2. Hence, l (1) (X(G1),ℓG1 )(v1) = l (1) (X(G1),ℓG1 )(v2). Similarly, for G2,

l (1) (X(G2),ℓG2 )(u1) = = l (1) (X(G2),ℓG2 )(u4) = l (1) (X(G1),ℓG1 )(v1).

It is not hard to show inductively that for each k N,

l (k) (X(G1),ℓG1 )(v1) = l (k) (X(G1),ℓG1 )(v2) = l (k) (X(G2),ℓG2 )(u1) = = l (k) (X(G2),ℓG2 )(u4).

Then, for each k N,

L1((X(G1), ℓG1)) = l (k) (X(G1),ℓG1 )

# µG1 = l (k) (X(G2),ℓG2 )

# µG2 = L1((X(G2), ℓG2))

and thus d (k) WL((X(G1), ℓG1), (X(G2), ℓG2)) = 0 which implies that d WL((X(G1), ℓG1), (X(G2), ℓG2)) = 0.

Notice that the standard WL test with degree labels is able to capture (and therefore compare) information about the number of nodes in the graph. On the other hand, (X(G1), ℓG1) and (X(G2), ℓG2) cannot be distinguished by the WL distance because of the normalization of the reference measures, µG1 and µG2.

Weisfeiler-Lehman meets Gromov-Wasserstein

Figure 4. Illustration of Example 7. One step of the WL hierarchy with degree labels can distinguish graphs of different sizes whereas the normalization of µG1 and µG2 does not allow graph size to be distinguished when the label function is degree.

A.6. A Basic Lower Bound for d MCMS GW

One can produce some basic lower bounds for d MCMS GW by invoking the notion of diameter for MCMSs which we deﬁne below. We ﬁrst introduce the one point MCMS. Example 8. The one point MCMS is the tuple := ({ }, (0), δ , δ ). Deﬁnition 13 (MCMS diameter). For each k 1 and a MCMS (X, d X), we deﬁne

diam (k) MCMS((X, d X)) := d (k) GW((X, ℓX), ),

diam MCMS((X, d X)) := d MCMS GW ((X, d X), ).

Notice that C(µX, µ ) = {µX δ } and C(k) m X , δ = n m X, k δ o . Then, it turns out that the diameter of X is independent of k:

diam (k) MCMS((X, d X)) = d (k) GW((X, d X), ) = Z

X d X(x, x ) m X, k x (dx )µX(dx )µX(dx)

X d X(x, x )µX(dx )µX(dx)

By the triangle inequality, one can prove the following result. Proposition A.13. For all two MCMSs (X, d X),(Y, d Y ) and k 1, we have

X d X(x, x )µX(dx )µX(dx) Z

Y d Y (y, y )µY (dy )µY (dy) d (k) GW((X, d X), (Y, d Y )) ,

X d X(x, x )µX(dx )µX(dx) Z

Y d Y (y, y )µY (dy )µY (dy) d MCMS GW ((X, d X), (Y, d Y )) .

A.7. More Details on the Complexity of Computing the WL Distance

In the following subsections, we provide an algorithm for computing the WL distance and complexity analysis for both computing the WL distance and its lower bound deﬁned in Section 3.2.

Weisfeiler-Lehman meets Gromov-Wasserstein

A.7.1. COMPUTATION OF THE WL DISTANCE

In this section, we devise an algorithm (with pseudocode in Algorithm 1) for computing d (k) WL and establish the following complexity analysis.

Proposition A.14. For any ﬁxed k N, computing d (k) WL between any LMMCs (X, ℓX) and (Y, ℓY ) can be achieved in time at most O(k n5 log(n)) where n = max(|X|, |Y |).

Recall from Equation (2) that the WL distance of depth k is deﬁned as

d (k) WL((X, ℓX), (Y, ℓY )) := d W

l (k) (X,ℓX )

# µX, l (k) (Y,ℓY )

= inf γ C(µX,µY )

d W l (k) (X,ℓX )(x), l (k) (Y,ℓY )(y) γ(dx dy).

In order to compute d (k) WL((X, ℓX), (Y, ℓY )), we must ﬁrst compute d W(l (k) (X,ℓX )(x), l (k) (Y,ℓY )(y)) for each x X and y Y . To do this, we introduce some notation. For each i = 1, . . . , k, we let Ci denote the |X| |Y | matrix such that for each x X and y Y , Ci(x, y) := d W l (i) (X,ℓX )(x), l (i) (Y,ℓY )(y) .

We also let C0 denote the matrix such that C0(x, y) := ℓX(x) ℓY (y) for each x X and y Y . Then, our task is to compute the matrix Ck. For this purpose, we consecutively compute the matrix Ci for i = 1, . . . , k: Given matrix Ci 1, since l (i) (X,ℓX )(x) = l (i 1) (X,ℓX )

# m X x and l (i) (Y,ℓY )(y) = l (i 1) (Y,ℓY )

# m Y y , computing

d W l (i) (X,ℓX )(x), l (i) (Y,ℓY )(y) = inf γ C(m X x ,m Y y )

d W l (i 1) (X,ℓX )(x), l (i 1) (Y,ℓY )(y) γ(dx dy).

is reduced to solving the optimal transport problem with Ci 1 as the cost matrix and m X x and m Y y as the source and target distributions, which can be done in O(n3 log(n)) time (Pele & Werman, 2009). Thus, for each i, computing Ci given that we know Ci 1, requires O(n2 n3 log(n)). Finally, we need O(n3 log(n)) time to compute d (k) WL((X, ℓX), (Y, ℓY )) based on solving an optimal transport problem with cost matrix Ck and with µX and µY being the source and target distributions, respectively.

Therefore, the total time needed to compute d (k) WL((X, ℓX), (Y, ℓY )) is

k O(n5 log(n)) + O(n3 log(n)) = O(k n5 log(n)).

For any n N, d (2n) WL generates a distance between graph induced LMMCs with size bounded by n. By Corollary 3.5, d (2n) WL has the same discriminating power as the WL test in separating graphs with size bounded by n. Now, given labeled graphs (G1, ℓG1) and (G2, ℓG2) so that max (|VG1|, |VG2|) n, computing d (2n) WL((Xq(G1), ℓG1), (Xq(G2), ℓG2)) takes time at most O(n6 log(n)).

A.7.2. COMPUTATION OF THE LOWER BOUND DISTANCE

Recall from Section 3.2 that the WL lower bound distance was deﬁned as

d (k) WLLB((X, ℓX), (Y, ℓY )) := inf γ C(µX,µY )

d W (ℓX)#m X, k x , (ℓY )#m Y, k y γ(dx dy).

Given two ﬁnite LMMCs,

(X, ℓX : X R) where X = X = {x1, x2, ..., xn} , m X , µX

and (Y, ℓY : Y R) where Y = Y = {y1, y2, ..., ym} , m Y , µY

we represent their Markov kernels as two transition matrices, MX and MY, respectively. Then, k-Markov kernels m X, k and m Y, k are expressed as matrices M k X and M k Y, respectively. Assume that n m. Then computing the k-Markov

Weisfeiler-Lehman meets Gromov-Wasserstein

Algorithm 1 d (k) WL computation

1: Input: The depth k N, and two ﬁnite LMMCs X = {x1, x2, ..., xn} , m X , µX, ℓX : X R and Y = {y1, ..., ym} , m Y , µY , ℓY : Y R

2: Initialization: P = C = zeros(n, m) 3: for i [n], j [m] do 4: P(i, j) = |ℓX(xi) ℓY (yj)| 5: end for 6: for l [k] do 7: for i [n], j [m] do 8: C(i, j) = infγ C m X xi,m Y yj

a [n],b [m] P(a, b)γ(a, b)

9: end for 10: P = C 11: end for D = infγ C(µX,µY ) P

i [n],j [m] C(i, j)γ(i, j) 12: Output: D

kernels of X and Y will require O(n3 log(k)) time where O(n3) is time needed for matrix multiplication. Then since (ℓX)#m X, k x and (ℓY )#m Y, k y are both distributions in R, by (Vallender, 1974), d W (ℓX)#m X, k x , (ℓY )#m Y, k y can be computed in O(n) time for each x X and y Y . Finally, computing d (k) WLLB can be formulated as ﬁnding the optimal transport cost where each entry of the cost matrix is deﬁned as d W (ℓX)#m X, k x , (ℓY )#m Y, k y and the source and target distributions are µX, µY respectively. Recall from the previous section that µX and µY are normalized degree distributions for X and Y. Therefore, the overall time complexity is

O(n3 log(k)) + O(n3 log(n)) = O(n3 log(kn)).

A.8. Experiments

A.8.1. EXPERIMENTAL SETUP

We use several publicly available graph benchmark datasets from TUDatasets (Morris et al., 2020) and evaluate the performance of our WL distance d (k) WL distance as well as d (k) WLLB (lower bound of our d WL which is more efﬁcient to compute) through two types of graph classiﬁcation experiments compared with several representative methods. Note that for all of our experiments, we use q = 0.6 to transform every graph G into the Markov chain Xq(G).

We use 1-Nearest Neighbors classiﬁer in the ﬁrst graph classiﬁcation experiment and for the second experiment, we use support vector machines (SVM). For the ﬁrst experiment, we compare classiﬁcation accuracies with the WWL distance (Togninalli et al., 2019) (see Equation (8)). For the second graph classiﬁcation task, we run an SVM using the indeﬁnite kernel matrices exp γd (k) WLLB and exp γd (k) WL , which are seen as noisy observations of the true positive semi-deﬁnite kernels (Luss & d Aspremont, 2009). We also evaluate the classiﬁcation accuracies of d WLLB and d (k) WL using KSVM (Loosli et al., 2015) for their indeﬁnite kernel matrices. Additionally, for the SVM method, we cross validate the parameter C {10 3, . . . , 103} and the parameter γ {10 3, . . . , 103}. We compare classiﬁcation accuracies with the WWL kernel (Togninalli et al., 2019), the WL kernel (Shervashidze et al., 2011), and the Weisfeiler-Lehman optimal assignment kernel (WL-OA) (Kriege et al., 2016). Note that we only use WWL distance in the 1-NN graph classiﬁcation experiment since the WL-OA and WL kernels are not deﬁned in terms of a distance unlike the WWL kernel.

In addition to the full accuracies for k {1, 2, 3, 4} for d (k) WL and d (k) WLLB with the degree label, call this f1, we also evaluate d (k) WL and d (k) WLLB with the label function f2(G, v) = 1 |VG| + deg G(v) for any graph G and vertex v VG. Note that f2 is a relabeling of any constant label function, which assigns a constant c to each vertex, via the injective map g : {c} N N R sending (c, n1, n2) to n1 + 1 n2 as described in Section 3.1. So under f2, d (k) WL is as discriminative as the k-step WL test. Thus, we also evaluate the performance of the WWL distance/kernel, WL, and WL-OA kernels using only degree label. Additionally, we report the best accuracies for WWL, WL, and WL-OA for iterations 1, . . . , 4.

Weisfeiler-Lehman meets Gromov-Wasserstein

Table 3. 1-Nearest Neighbor classiﬁcation accuracy. Let f1(G, v) = deg G(v), f2(G, v) = 1 |VG| + deg G(v)

METHOD MUTAG PROTEINS PTC-FM PTC-MR IMDB-B IMDB-M COX2

d (1) WL, f1 90.5 6.5 61.8 4.3 60.0 8.5 53.9 7.1 70.1 4.7 41.1 3.9 73.8 3.6 d (2) WL, f1 92.1 6.3 60.8 4.4 62.2 7.4 56.2 6.3 69.9 4.2 41.1 4.7 74.2 4.5 d (3) WL, f1 91.1 4.3 60.8 3.5 59.4 8.2 54.0 7.7 69.4 3.9 41.0 4.8 74.2 3.9 d (4) WL, f1 90.1 4.8 63.0 3.8 59.1 8.3 54.2 6.8 70.2 4.3 41.3 4.8 76.1 5.5

d (1) WL, f2 91.6 7.1 63.3 4.4 57.5 6.0 51.9 9.3 71.4 4.5 40.6 5.3 72.5 4.5 d (2) WL, f2 91.1 5.8 62.4 3.4 58.2 8.2 56.2 7.6 70.4 4.5 41.6 4.3 74.0 4.7 d (3) WL, f2 91.5 5.8 63.4 3.9 58.5 7.9 53.4 8.4 71.4 5.9 40.6 4.3 74.6 4.4 d (4) WL, f2 92.6 4.8 63.3 4.9 58.5 8.0 54.8 7.9 71.2 5.1 40.7 4.8 75.9 4.9

d (1) WLLB, f1 87.3 1.9 64.0 2.3 62.5 8.5 57.4 6.8 69.0 3.9 40.6 3.8 75.1 3.8 d (2) WLLB, f1 86.8 3.7 66.2 2.2 60.0 8.1 53.4 6.4 69.4 3.2 40.1 3.6 75.1 3.8 d (3) WLLB, f1 85.2 3.5 64.6 2.2 58.0 1.1 54.5 9.1 69.8 3.3 40.1 3.9 81.2 5.3 d (4) WLLB, f1 84.7 3.1 65.4 2.3 58.0 1.1 52.0 9.1 69.9 2.5 40.1 3.6 80.4 2.3

d (1) WLLB, f2 87.3 2.5 64.7 1.4 62.5 7.4 57.8 6.8 69.0 3.9 40.4 3.6 75.5 3.7 d (2) WLLB, f2 86.3 3.6 65.6 2.2 60.0 8.1 53.4 6.4 69.2 3.2 40.2 3.6 77.0 4.9 d (3) WLLB, f2 85.3 3.6 64.3 1.1 58.0 10.8 54.7 9.1 69.7 3.1 40.1 3.9 80.4 4.4 d (4) WLLB, f2 84.7 3.0 64.8 1.8 58.0 10.9 52.0 9.2 69.4 2.5 40.2 3.8 80.4 4.4 WWL 85.1 6.5 64.7 2.8 58.2 8.5 54.3 7.9 65.0 3.3 40.0 3.3 76.1 5.6

A.8.2. EXTRA EXPERIMENTAL RESULTS

In Table 3 and Table 4, we have included the 1-NN and SVM classiﬁcation accuracies for k {1, 2, 3, 4}, respectively. In general, using a KSVM with kernels exp γd (k) WLLB and exp γd (k) WL yields classiﬁcation accuracies which are similar to the classiﬁcation accuracy of a standard SVM with the indeﬁnite kernels. However, note that for the kernel exp γd (k) WL , KSVM has a slightly higher classiﬁcation accuracy on both PTC-FM and IMDB-B than standard SVM. For the kernel exp γd (k) WLLB with the labels generated by f2, KSVM has a much lower classiﬁcation accuracy on both PROTEINS and IMDB-B than standard SVM.

A.8.3. TIME COMPARISON

We compare the runtimes of d (k) WL and d (k) WLLB for k = 1, 2. For our runtime comparisons, we use LMMCs induced by Erd os-Renyi graphs of sizes varying from 5 nodes to 100 nodes (with the degree label function and q = 0.6). Note that while the runtime for d (k) WLLB does not change much between k = 1 and k = 2, the d (k) WL distance shows a signiﬁcant increase in the time needed to compute distance between two graphs from k = 1 to k = 2.

B.1. Proofs from Section 2

B.1.1. PROOF OF PROPOSITION 2.1

The only if part is obvious. To prove the if part, we assume (Xq(G1), ℓG1) is isomorphic to (Xq(G2), ℓG2). Then, there exists a bijective map ψ : VG1 VG2 such that ψ#m G1,q v = m G2,q ψ(v), ψ#µG1 = µG2 and ℓG1(v) = ℓG2(ψ(v)) for all

v VG1. Now, by the deﬁnition of m G1,q and m G2,q (see Deﬁnition 5), one can easily check that

deg G1(v) = 0 m G1,q v (v) = m G2,q ψ(v)(ψ(v)) = 1 deg G2(ψ(v)) = 0.

So, consider the case when m G1,q v (v) < 1. This implies deg G1(v) > 0 and deg G2(ψ(v)) > 0. In this case, again by the deﬁnition of m G1,q and m G2,q , one can show that

v, v VG1 are adjacent m G1,q v (v ) = m G2,q ψ(v)(ψ(v )) > 0 ψ(v), ψ(v ) VG2 are adjacent.

Weisfeiler-Lehman meets Gromov-Wasserstein

Figure 5. Comparison of runtime of WWL distance against d (k) WL and its lower bound d (k) WLLB.

Hence, G1 and G2 are isomorphic as we required.

B.2. Proofs from Section 3

B.2.1. PROOF OF THE CLAIM IN EXAMPLE 1

By Lemma A.1 we have that

d (1) WL((X, ℓX), (Y, ℓY )) = d W(L1((X, ℓX)), L1((Y, ℓY )))

l (1) (X,ℓX )

# µX, l (1) (Y,ℓY )

= inf γ C(µX,µY )

d W l (1) (X,ℓX )(x), l (1) (Y,ℓY )(y) γ(dx dy)

= inf γ C(µX,µY )

d W (ℓX)#m X x , (ℓY )#m Y y γ(dx dy).

B.2.2. PROOF OF PROPOSITION 3.1

This proposition follows directly from Lemma A.6 and Theorem A.7.

B.2.3. PROOF OF PROPOSITION 3.2

It is obvious that when (X, ℓX) is isomorphic to (Y, ℓY ), d (k) WL((X, ℓX), (Y, ℓY )) = 0 for all k N and thus d WL((X, ℓX), (Y, ℓY )) = 0. It follows directly from Equation (2) that d (k) WL satisﬁes the triangle inequality. Hence, d WL := supk 1 d (k) WL also satisﬁes the triangle inequality.

B.2.4. PROOF OF LEMMA 3.4

We ﬁrst assume that the WL test cannot distinguish (G1, ℓG1) and (G2, ℓG2), i.e., Lk((G1, ℓG1)) = Lk((G2, ℓG2)) for all k = 0, 1, . . .. We then prove that Lk((G1, ℓg G1)) = Lk((G2, ℓg G2)) for all k = 0, 1, . . .. The assumption Lk((G1, ℓG1)) = Lk((G2, ℓG2)) for all k = 0, 1, . . . immediately implies that |VG1| = |VG2|. Then, it sufﬁces to show that for any v1 VG1 and v2 VG2 ℓ (k+1) (G1,ℓG1 )(v1) = ℓ (k+1) (G2,ℓG2 )(v2) = ℓ (k)

(G1,ℓg G1 )(v1) = ℓ (k)

(G2,ℓg G2 )(v2), k = 0, 1, . . . . (12)

We prove Equation (12) by induction on k. When k = 0, for any v1 VG1 and v2 VG2, if ℓ1 (G1,ℓG1)(v1) = ℓ1 (G2,ℓG2)(v2),

Weisfeiler-Lehman meets Gromov-Wasserstein

then (ℓG1(v1), {{ℓG1(v), v NG1(v1)}}) = (ℓG2(v2), {{ℓG2(v), v NG2(v2)}}).

It follows that ℓG1(v1) = ℓG2(v2) and deg G1(v1) = deg G2(v2). Then, by injectivity of g, one has that ℓg G1(v1) = ℓg G2(v2).

Now, we assume that Equation (12) holds for some k 0. For the case of k + 1, note that ℓ (k+2) (G1,ℓG1 )(v1) = ℓ (k+2) (G2,ℓG2 )(v2) implies that ℓ (k+1) (G1,ℓG1 )(v1), nn ℓ (k+1) (G1,ℓG1 )(v), v NG1(v1) oo = ℓ (k+1) (G2,ℓG2 )(v2), nn ℓ (k+1) (G2,ℓG2 )(v), v NG2(v2) oo .

Hence, ℓ (k+1) (G1,ℓG1 )(v1) = ℓ (k+1) (G2,ℓG2 )(v2) and there exists a bijection ψ : NG1(v1) NG2(v2) such that ℓ (k+1) (G1,ℓG1 )(v) = ℓ (k+1) (G2,ℓG2 )(ψ(v)) for any v NG1(v1). By the induction assumption, we then have that ℓ (k)

(G1,ℓg G1 )(v1) = ℓ (k)

(G2,ℓg G2 )(v2) and

(G1,ℓg G1 )(v) = ℓ (k)

(G2,ℓg G2 )(ψ(v)) for any v NG1(v1). This implies that

(G1,ℓg G1 )(v1), nn ℓ (k)

(G1,ℓg G1 )(v), v NG1(v1) oo = ℓ (k)

(G2,ℓg G2 )(v2), nn ℓ (k)

(G2,ℓg G2 )(v), v NG2(v2) oo

and thus ℓ (k+1)

(G1,ℓg G1 )(v1) = ℓ (k+1)

(G2,ℓg G2 )(v2). Therefore, Lk (G1, ℓg G1) = Lk (G2, ℓg G2) for all k = 0, 1, . . . and thus the WL

test cannot distinguish (G1, ℓg G1) and (G2, ℓg G2).

Conversely, we assume that the WL test cannot distinguish (G1, ℓg G1) and (G2, ℓg G2), i.e., Lk (G1, ℓg G1) = Lk (G2, ℓg G2)

for all k = 0, 1, . . .. We then prove that Lk((G1, ℓG1)) = Lk((G2, ℓG2)) for all k = 0, 1, . . .. The proof is similar to the one for the other direction. First, the assumption Lk (G1, ℓg G1) = Lk (G2, ℓg G2) for all k = 0, 1, . . . implies that |VG1| = |VG2|. Then, it sufﬁces to show that for any v1 VG1 and v2 VG2

(G1,ℓg G1 )(v1) = ℓ (k)

(G2,ℓg G2 )(v2) = ℓ (k) (G1,ℓG1 )(v1) = ℓ (k) (G2,ℓG2 )(v2), k = 0, 1, . . . . (13)

We prove Equation (13) by induction on k. When k = 0, for any v1 VG1 and v2 VG2, if ℓg G1(v1) = ℓg G2(v2), then by injectivity of g, we have that ℓG1(v1) = ℓG2(v2).

Now, we assume that Equation (13) holds for some k 0. For the case of k + 1, note that ℓ (k+1)

(G1,ℓg G1 )(v1) = ℓ (k+1)

(G2,ℓg G2 )(v2) implies that ℓ (k)

(G1,ℓg G1 )(v1), nn ℓ (k)

(G1,ℓg G1 )(v), v NG1(v1) oo = ℓ (k)

(G2,ℓg G2 )(v2), nn ℓ (k)

(G2,ℓg G2 )(v), v NG2(v2) oo .

By the induction assumption, it is easy to see that ℓ (k) (G1,ℓG1 )(v1), nn ℓ (k) (G1,ℓG1 )(v), v NG1(v1) oo = ℓ (k) (G2,ℓG2 )(v2), nn ℓ (k) (G2,ℓG2 )(v), v NG2(v2) oo

and thus ℓ (k+1) (G1,ℓG1 )(v1) = ℓ (k+1) (G2,ℓG2 )(v2). Therefore, Lk((G1, ℓG1)) = Lk((G2, ℓG2)) for all k = 0, 1, . . . and thus the WL test cannot distinguish (G1, ℓG1) and (G2, ℓG2).

B.2.5. PROOF OF PROPOSITION 3.3

By Lemma 3.4, we only need to prove that the WL test cannot distinguish (G1, ℓg G1) and (G2, ℓg G2) iff d WL (Xq(G1), ℓg G) , Xq(G2), ℓg G2 = 0. For this purpose, we need to introduce some new notions.

For any metric space Z, we let Mult Pow(Z) denote the collection of all ﬁnite multisets of Z (including the empty set). We inductively deﬁne a family of sets Zk as follows:

1. Z1 := Z Mult Pow(Z);

2. for k 1, Zk+1 := Zk Mult Pow(Zk).

Then, we inductively deﬁne a family of maps ϕk q : Zk P k(Z) as follows:

Weisfeiler-Lehman meets Gromov-Wasserstein

1. deﬁne ϕ1 q : Z1 P(Z) by

(z, A) Z Mult Pow(Z) 7

( qδz + 1 q

z A δz , A =

2. for k 1, deﬁne ϕk+1 q : Zk+1 P (k+1)(Z) by

(z, A) Zk Mult Pow(Zk) 7

( qδϕkq (z) + 1 q

z A δϕkq (z ), A =

δϕkq (z), A = .

Lemma B.1. For any labeled graph (G, ℓG : VG Z) and any q [0, 1], one has that for any k N

l (k) (Xq(G),ℓG) = ϕk q ℓ (k) (G,ℓG) : VG P k(Z). (14)

Proof of Lemma B.1. We prove by induction on k.

When k = 1, for any v V , if NG(v) = we have that

l (1) (Xq(G),ℓG)(v) = (ℓG)#m G,q v = q δℓG(v) + 1 q

v NG(v) δℓG(v )

= ϕ1 q ((ℓG(v), {{ℓG(v ) : v NG(v)}}))

= ϕ1 q(ℓ (1) (G,ℓG)(v)).

If NG(v) = then we have that

l (1) (Xq(G),ℓG)(v) = (ℓG)#m G,q v = δℓG(v) = ϕ1 q ((ℓG(v), )) = ϕ1 q(ℓ (1) (G,ℓG)(v)).

Now, we assume that Equation (14) holds for some k 1. Then, for k + 1 and for any v V , if NG(v) = , we have that

l (k+1) (Xq(G),ℓG)(v) = l (k) (Xq(G),ℓG)

# m G,q v = q δl(k) (Xq(G),ℓG)(v) + 1 q

v NG(v) δl(k) (Xq(G),ℓG)(v )

= q δϕk q ℓ(k) (G,ℓG)(v) + 1 q

v NG(v) δϕkq ℓ(k) (G,ℓG)(v )

= ϕk+1 q ℓ (k) (G,ℓG)(v), nn ℓ (k) (G,ℓG)(v ) : v NG(v) oo

= ϕk+1 q ℓ (k+1) (G,ℓG)(v) .

If NG(v) = then we have that

l (k+1) (Xq(G),ℓG)(v) = l (k) (Xq(G),ℓG)

# m G,q v = δl(k) (Xq(G),ℓG)(v) = δϕk q ℓ(k) (G,ℓG)(v) = ϕk+1 q ℓ (k) (G,ℓG)(v), = ϕk+1 q ℓ (k+1) (G,ℓG)(v) .

This concludes the proof.

Lemma B.2. Fix any 1

2 < q < 1 and any labeled graphs (G1, ℓG1) and (G2, ℓG2). Assume that the labels satisfy that for any v1 VG1 and v2 VG2, we have that ℓG1(v1) = ℓG2(v2) implies deg(v1) = deg(v2). Then, one has that for any v1 VG1, v2 VG2, ℓ (k) (G1,ℓG1 )(v1) = ℓ (k) (G2,ℓG2 )(v2) iff l (k) (Xq(G1),ℓG1 )(v1) = l (k) (Xq(G2),ℓG2 )(v2). (15)

Proof of Lemma B.2. By Lemma B.1, we have that ℓ (k) (G1,ℓG1 )(v1) = ℓ (k) (G2,ℓG2 )(v2) implies that l (k) (Xq(G1),ℓG1 )(v1) = l (k) (Xq(G2),ℓG2 )(v2). For the other direction, we prove by induction on k.

Weisfeiler-Lehman meets Gromov-Wasserstein

When k = 1, we ﬁrst note that l (1) (Xq(G1),ℓG1 )(v1) = q δℓG1(v1) + 1 q deg(v1) P

v NG1(v1) δℓG1(v) if NG1(v1) = and l (1) (Xq(G1),ℓG1 )(v1) = δℓG1(v1) otherwise. Since 1

2 < q < 1, l (1) (Xq(G1),ℓG1 )(v1) = l (1) (Xq(G2),ℓG2 )(v2) implies that δℓG1(v1) = δℓG2(v2). Hence ℓG1(v1) = ℓG2(v2) and thus deg G1(v1) = deg G2(v2). This implies that NG1(v1) = iff NG2(v2) = . If NG1(v1) = , then obviously, we have that

ℓ (1) (G1,ℓG1 )(v1) = (ℓG1(v1), ) = (ℓG2(v2), ) = ℓ (1) (G2,ℓG2 )(v2).

If otherwise NG1(v1) = , then l (1) (Xq(G1),ℓG1 )(v1) = l (1) (Xq(G2),ℓG2 )(v2) again implies that 1 q deg(v1) P

v NG1(v1) δℓG1(v) =

1 q deg(v2) P

v NG2(v2) δℓG2(v). Hence, P

v NG1(v1) δℓG1(v) = P

v NG2(v2) δℓG2(v) and thus {{ℓG1(v) : v NG1(v1)}} = {{ℓG2(v) : v NG2(v2)}}. Therefore, ℓ (1) (G1,ℓG1 )(v1) = ℓ (1) (G2,ℓG2 )(v2).

Now, we assume that Equation (15) holds for some k 1. Note that

l (k+1) (Xq(G1),ℓG1 )(v1) =

q δl(k) (Xq(G1),ℓG1 )(v1) + 1 q deg(v1) P

v NG1(v1) δl(k) (Xq(G1),ℓG1 )(v), NG1(v1) =

δl(k) (Xq(G1),ℓG1 )(v1), NG1(v1) =

Then, for k + 1, the assumptions 1

2 < q < 1 and l (k+1) (Xq(G1),ℓG1 )(v1) = l (k+1) (Xq(G2),ℓG2 )(v2) imply that δl(k) (Xq(G1),ℓG1 )(v1) =

δl(k) (Xq(G2),ℓG2 )(v2). Hence l (k) (Xq(G1),ℓG1 )(v1) = l (k) (Xq(G2),ℓG2 )(v2). By the induction assumption we have that ℓ (k) (G1,ℓG1 )(v1) =

ℓ (k) (G2,ℓG2 )(v2). It is not hard to see that then ℓ (1) (G1,ℓG1 )(v1) = ℓ (1) (G2,ℓG2 )(v2) and thus deg G1(v1) = deg G2(v2). Then, similarly as in the case k = 1, we have two situations. If NG1(v1) = , then we have that

ℓ (k+1) (G1,ℓG1 )(v1) = ℓ (k) (G1,ℓG1 )(v1), = ℓ (k) (G2,ℓG2 )(v2), = ℓ (k+1) (G2,ℓG2 )(v2).

If otherwise NG1(v1) = , then l (k+1) (Xq(G1),ℓG1 )(v1) = l (k+1) (Xq(G2),ℓG2 )(v2) again implies that

1 q deg(v1) P

v NG1(v1) δl(k) (Xq(G1),ℓG1 )(v) = 1 q deg(v2) P

v NG2(v2) δl(k) (Xq(G2),ℓG2 )(v). Hence, P

v NG1(v1) δl(k) (Xq(G1),ℓG1 )(v) = P

v NG2(v2) δl(k) (Xq(G2),ℓG2 )(v) and thus

nn l (k) (Xq(G1),ℓG1 )(v) : v NG1(v1) oo = nn l (k) (Xq(G2),ℓG2 )(v) : v NG2(v2) oo .

Then, by the induction assumption again, we have that nn ℓ (k) (G1,ℓG1 )(v) : v NG1(v1) oo = nn ℓ (k) (G2,ℓG2 )(v) : v NG2(v2) oo .

Therefore, ℓ (k+1) (G1,ℓG1 )(v1) = ℓ (k+1) (G2,ℓG2 )(v2). This concludes the proof.

Now, we are ready to prove that

the WL test cannot distinguish (G1, ℓg G1) and (G2, ℓg G2) iff d WL (Xq(G1), ℓg G) , Xq(G2), ℓg G2 = 0.

It sufﬁces to show that for any k = 0, 1, . . .,

Lk((G1, ℓg 1)) = Lk((G2, ℓg 2)) iff l (k)

(Xq(G1),ℓg G1 )

# µG1 = l (k)

(Xq(G2),ℓg G2 )

Fix any k = 0, . . .. We ﬁrst assume that Lk((G1, ℓg 1)) = Lk((G2, ℓg 2)). Then, it is obvious that |VG1| = |VG2| and moreover, there exists a bijection ψ : VG1 VG2 such that ℓ (k)

(G1,ℓg G1 )(v) = ℓ (k)

(G2,ℓg G2 )(ψ(v)) for any v VG1. This implies the following facts:

1. By injectivity of g, deg G1(v) = deg G2(ψ(v)) for any v VG1. Hence, deg G1(v) = deg G2(ψ(v)) for any v VG1.

Weisfeiler-Lehman meets Gromov-Wasserstein

2. By Lemma B.1, for any v VG1 we have that

(Xq(G1),ℓg G1 )(v) = ϕk q ℓ (k)

(G1,ℓg G1 )(v) = ϕk q ℓ (k)

(G2,ℓg G2 )(ψ(v)) = l (k)

(Xq(G2),ℓg G2 )(ψ(v)).

(Xq(G1),ℓg G1 )

deg G1(v) P

v VG1 deg G1(v )δl(k) (Xq(G1),ℓg G1 )(v)

deg G2(ψ(v)) P

v VG1 deg G2(ψ(v ))δl(k) (Xq(G2),ℓg G2 )(ψ(v))

deg G2(v) P

v VG2 deg G2(v )δl(k) (Xq(G2),ℓg G2 )(v)

(Xq(G2),ℓg G2 )

Conversely, we assume that l (k)

(Xq(G1),ℓg G1 )

# µG1 = l (k)

(Xq(G2),ℓg G2 )

# µG2. Then,

deg G1(v) P

v VG1 deg G1(v )δl(k) (Xq(G1),ℓg G1 )(v) = X

deg G2(v) P

v VG2 deg G2(v )δl(k) (Xq(G2),ℓg G2 )(v). (17)

Then, for any v1 VG1, there exists v2 VG2 such that l (k)

(Xq(G1),ℓg G1 )(v1) = l (k)

(Xq(G2),ℓg G2 )(v2). If k = 0, then

ℓg G1(v1) = l (0)

(Xq(G1),ℓg G1 )(v1) = l (0)

(Xq(G2),ℓg G2 )(v2) = ℓg G2(v2).

Otherwise, we assume that k > 0. Since 1

2 < q < 1, we have that l (k 1)

(Xq(G1),ℓg G1 )(v1) = l (k 1)

(Xq(G2),ℓg G2 )(v2). Inductively, we still obtain that ℓg G1(v1) = l (0)

(Xq(G1),ℓg G1 )(v1) = l (0)

(Xq(G2),ℓg G2 )(v2) = ℓg G2(v2).

Hence, by injectivity of g, we have that |VG1| = |VG2|, deg G1(v1) = deg G2(v2) and deg G1(v1) = deg G2(v2). Then, it is easy to see from Equation (17) that n v VG1 : l (k)

(Xq(G1),ℓg G1 )(v) = l (k)

(Xq(G1),ℓg G1 )(v1) o = n v VG2 : l (k)

(Xq(G2),ℓg G2 )(v) = l (k)

(Xq(G2),ℓg G2 )(v2) o .

It is obvious that ℓg Gi for i = 1, 2 satisfy the condition in Lemma B.2. Then, by Lemma B.2, we have that ℓ (k) (G1,ℓG1 )(v1) = ℓ (k) (G2,ℓG2 )(v2) and that

n v VG1 : ℓ (k) (G1,ℓG1 )(v) = ℓ (k) (G1,ℓG1 )(v1) o = n v VG2 : ℓ (k) (G2,ℓG2 )(v) = ℓ (k) (G2,ℓG2 )(v2) o .

Therefore, Lk (G1, ℓg G1) = Lk (G2, ℓg G2) .

B.2.6. PROOF OF COROLLARY 3.5

It turns out that one only needs ﬁnite steps to determine whether the WL test can distinguish two labeled graphs (Krebs & Verbitsky, 2015). More precisely:

Proposition B.3. For any labeled graphs (G1, ℓG1) and (G2, ℓG2), Lk((G1, ℓG1)) = Lk((G2, ℓG2)) holds for all k = 0, . . . , |VG1| + |VG2| if and only if Lk((G1, ℓG1)) = Lk((G2, ℓG2)) holds for all k 0.

Hence, this corollary is a direct consequence of Proposition 3.3 and Proposition B.3.

Weisfeiler-Lehman meets Gromov-Wasserstein

B.3. Proofs from Section 4

B.3.1. PROOF OF PROPOSITION 4.1

We need the following lemma:

Lemma B.4. For any C-Lipschitz function ϕ : Ri Rj, we have that the map qϕ : P(Ri) Rj is C-Lipschitz.

Proof of Lemma B.4. For any α, β P(Ri), pick any γ C(α, β). Then, we have that

|qϕ(α) qϕ(β)| =

Ri ϕ(x)α(dx) Z

Ri ϕ(x)β(dx)

Ri Ri(ϕ(x) ϕ(y)) γ(dx dy)

Ri Ri |ϕ(x) ϕ(y)| γ(dx dy)

Ri Ri |x y| γ(dx dy).

Since γ C(α, β) is arbitrary, we have that

|qϕ(α) qϕ(β)| C d W(α, β).

Hence qϕ is C-Lipschitz.

Now, we start to prove item 1. We introduce some notation. Given a MCNNk h := ψ Sϕk+1 Fϕk Fϕ1 and any (X, ℓX) ML(Z), we let X, ℓ (ϕ,i) X := Fϕi Fϕ1((X, ℓX)) (18)

Now, we assume that for i = 1, . . . , k, the MLP ϕi is Ci-Lipschitz for some Ci > 0 (note that MLPs with the Lipschitz activation function σ speciﬁed in Section 4 are Lipschitz). Then, by Lemma B.4, we have that qϕi is a Ci-Lipschitz map for i = 1, . . . , k.

Then, we prove that

d W ℓ (ϕ,k) X

# µX, ℓ (ϕ,k) Y

# µY Πk i=1Ci d (k) WL((X, ℓX), (Y, ℓY )). (19)

Given Equation (19), if d (k) WL((X, ℓX), (Y, ℓY )) = 0, then d W ℓ (ϕ,k) X

# µX, ℓ (ϕ,k) Y

# µY = 0 and thus ℓ (ϕ,k) X

# µX = ℓ (ϕ,k) Y

# µY . Hence, the MCNN h = ψ Sϕk+1 Fϕk Fϕ1 satisﬁes that

h((X, ℓX)) = ψ qϕk+1 ℓ (ϕ,k) X

# µX = ψ qϕk+1 ℓ (ϕ,k) Y

# µY = h((Y, ℓY )).

To prove Equation (19), it sufﬁces to prove that for any x X and y Y (see Lemma A.1),

ℓ (ϕ,k) X (x) ℓ (ϕ,k) Y (y) Πk i=1Ci d W l (k) (X,ℓX )(x), l (k) (Y,ℓY )(y) .

We prove the above inequality by proving the following inequality inductively on j = 1, . . . , k:

ℓ (ϕ,j) X (x) ℓ (ϕ,j) Y (y) Πj i=1Ci d W l (j) (X,ℓX )(x), l (j) (Y,ℓY )(y) . (20)

Weisfeiler-Lehman meets Gromov-Wasserstein

When j = 0, we have that ℓ (ϕ,0) X = ℓX = l (0) (X,ℓX ) and ℓ (ϕ,0) Y = ℓY = l (0) (Y,ℓY ). Therefore, Equation (20) obviously holds (we let Π0 i=1Ci := 1). We now assume that Equation (20) holds for some j 0. For j + 1, we have that

Πj+1 i=1Ci d W l (j+1) (X,ℓX )(x), l (j+1) (Y,ℓY )(y)

= Πj+1 i=1Ci d W

l (j) (X,ℓX )

# m X x , l (j) (Y,ℓY )

= Cj+1 inf γ C(m X x ,m Y y )

Πj i=1Ci d W l (j) (X,ℓX )(x ), l (j) (Y,ℓY )(y ) γ(dx dy )

Cj+1 inf γ C(m X x ,m Y y )

ℓ (ϕ,j) X (x ) ℓ (ϕ,j) Y (y ) γ(dx dy )

= Cj+1 d W ℓ (ϕ,j) X

# m X x , ℓ (ϕ,j) Y

qϕj+1 ℓ (ϕ,j) X

# m X x qϕj+1 ℓ (ϕ,j) Y

= ℓ (ϕ,j+1) X (x) ℓ (ϕ,j+1) Y (y) .

This concludes the induction step and thus the proof of item 1.

Next, we prove item 2. The proof is based on the following two basic results:

Lemma B.5. For any d N and any α, β P(Rd), if α = β, then there exists a Lipschitz function ϕ : Rd R such that Z

Rd ϕ(x)α(dx) = Z

Rd ϕ(x)β(dx).

Proof of Lemma B.5. By Kantorovich duality (see for example Remark 6.5 in (Villani, 2009)),

d W(α, β) = sup

Rd ϕ(x)α(dx) Z

Rd ϕ(x)β(dx) : ϕ : Rd R is 1-Lipschitz.

Since α = β, we have that d W(α, β) > 0, and thus there exists a 1-Lipschitz ϕ : Rd R such that R

Rd ϕ(x)α(dx) = R

Rd ϕ(x)β(dx).

Lemma B.6. For any d N and any α, β P(Rd), if α = β and they are both compactly supported, then there exists a single-hidden-layer MLP ϕ : Rd R (with the activation function σ speciﬁed in Section 4) such that Z

Rd ϕ(x)α(dx) = Z

Rd ϕ(x)β(dx).

Proof of Lemma B.6. By Lemma B.5, there exists a Lipschitz ψ : Rd R such that R

Rd ψ(x)α(dx) = R

Rd ψ(x)β(dx). Let ε := R

Rd ψ(x)α(dx) R

Rd ψ(x)β(dx) > 0. Let K := supp(α) supp(β). Since both probability measures are compactly supported, we have that K is compact. Then, by the classic universal approximation theorem (Pinkus, 1999, Theorem 3.1) and by the assumption that σ is Lipschitz and non-polynomial, there exists a single-hidden-layer MLP ϕ : Rd R such that supx K |ϕ(x) ψ(x)| < ε

4. Therefore,

Rd ϕ(x)α(dx) Z

Rd ψ(x)α(dx) Z

Rd |ϕ(x) ψ(x)| α(dx) = Z

K |ϕ(x) ψ(x)| α(dx) ε

and similarly,

Rd ϕ(x)β(dx) Z

Rd ψ(x)β(dx) ε

Rd ϕ(x)α(dx) = R

Rd ϕ(x)β(dx).

Weisfeiler-Lehman meets Gromov-Wasserstein

Now, given any (X, ℓX) and (Y, ℓY ) such that d (k) WL((X, ℓX), (Y, ℓY )) > 0, we have that

l (k) (X,ℓX )

# µX, l (k) (Y,ℓY )

= d (k) WL((X, ℓX), (Y, ℓY )) > 0.

Then, we prove that for each i = 1, . . . , k, there exists a MLP ϕi : Rdi 1 Rdi for suitable dimensions di 1 and di such that x X, y Y, ℓ (ϕ,i) X (x) = ℓ (ϕ,i) Y (y) iff l (i) (X,ℓX )(x) = l (i) (Y,ℓY )(y). (21)

Given Equation (21), it is obvious that l (k) (X,ℓX )

# µX = l (k) (Y,ℓY )

# µY implies that ℓ (ϕ,k) X

# µX = ℓ (ϕ,k) Y

# µY . Then, by Lemma B.6 and by the fact that X and Y are ﬁnite, there exists a MLP ϕk+1 : Rdk R such that Z

Rd ϕk+1(t) ℓ (ϕ,k) X

# µX(dt) = Z

Rd ϕk+1(t) ℓ (ϕ,k) Y

Then, by the classic universal approximation theorem (Pinkus, 1999, Theorem 3.1) again, there exists a MLP ψ : R R which distinguishes the two numbers above. Hence, we have that

ψ Sϕk+1 Fϕk Fϕ1(X) = ψ Z

Rd ϕk+1(t) ℓ (ϕ,k) X

Rd ϕk+1(t) ℓ (ϕ,k) Y

= ψ Sϕk+1 Fϕk Fϕ1(Y).

To conclude the proof, we prove Equation (21) by induction on i = 1, . . . , k. When i = 1, let

A1 := (x, y) X Y : (ℓX)#m X x = (ℓY )#m Y y .

Since X and Y are ﬁnite, A1 is a ﬁnite set. We enumerate elements in A1 and write A1 = {(x1, y1), . . . , (xd1, yd1)}. By Lemma B.6, for each j = 1, . . . , d1, there exists a MLP ϕj 1 : Rd R such that Z

Rd ϕj 1(t)(ℓX)#m X xj(dt) = Z

Rd ϕj 1(t)(ℓY )#m Y yj(dt).

We then let ϕ1 := ϕ1 1, ϕ2 1, . . . , ϕd1 1 : Rd Rd1. It is not hard to see that ϕ1 is still a MLP with a single hidden layer and it satisﬁes that

x X, y Y, (ℓX)#m X x = (ℓY )#m Y y iff Z

Rd ϕ1(t)(ℓX)#m X x (dt) = Z

Rd ϕ1(t)(ℓY )#m Y y (dt).

Equivalent speaking, x X, y Y, l (1) (X,ℓX )(x) = l (1) (Y,ℓY )(y) iff ℓ (ϕ,1) X (x) = ℓ (ϕ,1) Y (y).

Now, we assume that Equation (21) holds for some i 1. For i + 1, we let

Ai+1 := n (x, y) X Y : ℓ (ϕ,i) X

# m X x = ℓ (ϕ,i) Y

# m Y y o .

Since X and Y are ﬁnite, Ai+1 is a ﬁnite set. We enumerate elements in Ai+1 and write Ai+1 = {(x1, y1), . . . , (xdi+1, ydi+1)}. By Lemma B.6, for each j = 1, . . . , di+1, there exists a MLP ϕj i+1 : Rdi R such that Z

Rdi ϕj i+1(t) ℓ (ϕ,i) X

# m X xj(dt) = Z

Rdi ϕj i+1(t) ℓ (ϕ,i) Y

# m Y yj(dt).

We then let ϕi+1 := ϕ1 i+1, ϕ2 i+1, . . . , ϕdi+1 i+1 : Rdi Rdi+1. Again ϕi+1 is a MLP with a single hidden layer and it satisﬁes that x X and y Y,

# m X x = ℓ (ϕ,i) Y

# m Y y iff Z

Rdi ϕi+1(t) ℓ (ϕ,i) X

# m X x (dt) = Z

Rdi ϕi+1(t) ℓ (ϕ,i) Y

# m Y y (dt).

Weisfeiler-Lehman meets Gromov-Wasserstein

Equivalent speaking,

x X, y Y, ℓ (ϕ,i) X

# m X x = ℓ (ϕ,i) Y

# m Y y iff ℓ (ϕ,i+1) X (x) = ℓ (ϕ,i+1) Y (y).

By the induction assumption, x X, y Y we have that

ℓ (ϕ,i) X (x) = ℓ (ϕ,i) Y (y) iff l (i) (X,ℓX )(x) = l (i) (Y,ℓY )(y).

This implies that

# m X x = ℓ (ϕ,i) Y

# m Y y iff l (i) (X,ℓX )

# m X x = l (i) (Y,ℓY )

# m Y y iff l (i+1) (X,ℓX )(x) = l (i+1) (Y,ℓY )(y).

Therefore, ℓ (ϕ,i+1) X (x) = ℓ (ϕ,i+1) Y (y) iff l (i+1) (X,ℓX )(x) = l (i+1) (Y,ℓY )(y)

and we thus conclude the proof.

B.3.2. PROOF OF THEOREM 4.3

We ﬁrst prove a relaxation of Theorem 4.3. Let

NN c k(Rd) := {ψ Sϕk+1 Fϕk Fϕ1 : MLPs ϕi, i = 1, . . . , k + 1 and continuous ψ.}.

Note that NN k(Rd) NN c k(Rd) since the latter relaxes the assumption in NN k(Rd) that ψ is a MLP. Then, we prove the following lemma:

Lemma B.7. For any k N, let K ML k (Rd) be any compact subspace. Then, NN c k(Rd) = C(K, R).

The proof of the lemma is based on the following Stone-Weierstrass theorem.

Lemma B.8 (Stone-Weierstrass). Let X be a compact space. Let F C(X, R) be a subalgebra containing the constant function 1. If moreover F separates points, then F is dense in C(X, R).

NN c k(Rd) Contains 1. Given any choice of ϕis, we let ψ : Rdk+1 R be the constant map 1. Then, the corresponding function h := ψ Sϕk+1 Fϕk Fϕ1 1 NN k(Rd).

NN c k(Rd) Separates Points. This follows from item 2 in Proposition 4.1.

NN c k(Rd) Is a Subalgebra. By Equation (19), Lemma B.4 and the continuity of ψ, we have that NN c k(Rd) C(K, R). Next, we show that NN c k(Rd) is, in fact, a subalgebra of C(K, R). Given any constant C and function h = ψ Sϕk+1 Fϕk Fϕ1 NN c k(Rd), we have that

C h = C ψ Sϕk+1 Fϕk Fϕ1 = (C ψ) Sϕk+1 Fϕk Fϕ1 NN c k(Rd).

Then, we show that the sum and the product of any h1 = ψ Sϕk+1 Fϕk Fϕ1 and h2 = ψ S ϕk+1 F ϕk F ϕ1 belong to NN c k(Rd). We deﬁne Φ1 := (ϕ1, ϕ1) : Rd Rd1 R d1,

and for each 2 i k + 1, we deﬁne

Φi = ϕi ϕi : Rdi 1 R di 1 Rdi R di.

It is not hard to see that for each i = 1, . . . , k + 1, Φi is also a single-hidden-layer MLP with the activation function σ. Indeed, if we write ϕi(x) = Ciσ (Wix + bi) for any x Rdi, and ϕi( x) = Ciσ Wi x + bi for any x R di, then for

Rdi R di, we have that

Φ1(x) = C1 C1

Weisfeiler-Lehman meets Gromov-Wasserstein

= Ci 0 0 Ci

σ Wi 0 0 Wi

We let P : Rdk+1 R dk+1 Rdk+1 and P : Rdk+1 R dk+1 R dk+1 denote projection maps. Then, we can rewrite h1 and h2 as follows. Claim 2. h1 = ψ Sϕk+1 Fϕk Fϕ1 = ψ P SΦk+1 FΦk FΦ1 and h2 = ψ S ϕk+1 F ϕk F ϕ1 = ψ P SΦk+1 FΦk FΦ1.

Proof of Claim 2. Recall notation from Equation (18). Then, we ﬁrst prove inductively on i = 1, . . . , k that for any X K

ℓ (Φ,i) X (x) = ℓ (ϕ,i) X (x), ℓ ( ϕ,i) X (x) , x X. (22)

When i = 1,

ℓ (Φ,1) X (x) = qΦ1((ℓX)#m X x )) = (qϕ1((ℓX)#m X x ), q ϕ1((ℓX)#m X x )) = ℓ (ϕ,1) X (x), ℓ ( ϕ,1) X (x) .

Now, we assume that Equation (22) holds for some i 1. Then, for i + 1, we have that

ℓ (Φ,i+1) X (x) = qΦi+1 ℓ (Φ,i) X

# m X x = qΦi+1 ℓ (ϕ,i) X , ℓ ( ϕ,i) X

= qΦi+1 ℓ (ϕ,i) X

# m X x ℓ ( ϕ,i) X

= qϕi+1 ℓ (ϕ,i) X

# m X x , q ϕi+1 ℓ ( ϕ,i) X

= ℓ (ϕ,i+1) X (x), ℓ ( ϕ,i+1) X (x) ,

which concludes the proof of Equation (22).

SΦk+1 FΦk FΦ1((X, ℓX)) = qΦk+1 ℓ (Φ,k) X

# µX = qΦk+1 ℓ (ϕ,k) X , ℓ ( ϕ,k) X

= qΦk+1 ℓ (ϕ,k) X

# µX ℓ ( ϕ,k) X

= qϕk+1 ℓ (ϕ,k) X

# µX , q ϕk+1 ℓ ( ϕ,k) X

= Sϕk+1 Fϕk Fϕ1((X, ℓX)), S ϕk+1 F ϕk F ϕ1((X, ℓX)) .

Therefore, ψ Sϕk+1 Fϕk Fϕ1 = ψ P SΦk+1 FΦk FΦ1 and similarly, ψ S ϕk+1 F ϕk F ϕ1 = ψ P SΦk+1 FΦk FΦ1.

Given these claims, we then have that

ψ Sϕk+1 Fϕk Fϕ1 + ψ S ϕk+1 F ϕk F ϕ1 = (ψ P + ψ P) SΦk+1 FΦk FΦ1 NN c k(Rd)

ψ Sϕk+1 Fϕk Fϕ1 ψ S ϕk+1 F ϕk F ϕ1 = (ψ P ψ P) SΦk+1 FΦk FΦ1 NN c k(Rd).

This concludes the proof of Lemma B.7. Now we ﬁnish proving Theorem 4.3 by showing that NN k(Rd) is dense in NN c k(Rd) when restricted to the compact set K. Choose any h = ψ Sϕk+1 Fϕk Fϕ1 NN c k(Rd). By Equation (19) and Lemma B.4 again, we have that the map Sϕk+1 Fϕk Fϕ1 : ML k (Rd) Rdk+1 is continuous. Hence, K := Sϕk+1 Fϕk Fϕ1(K) is a compact subspace of Rdk+1. Now, for any ε > 0, by the universal approximation theorem for MLP (Pinkus, 1999, Theorem 3.1), there exists a single-hidden-layer MLP ψ : Rdk+1 R such that supx K ψ(x) ψ(x) < ε. If we let h := ψ Sϕk+1 Fϕk Fϕ1, then h NN k(Rd) and moreover,

sup (X,ℓX) K

h((X, ℓX)) h((X, ℓX)) < ε.

This implies that NN k(Rd) is dense in NN c k(Rd) when restricted to the compact set K. This concludes the proof.

Weisfeiler-Lehman meets Gromov-Wasserstein

B.4. Proofs from Section 5

B.4.1. PROOF OF PROPOSITION 5.1

The proof is rather lengthy, and we start with some preliminary deﬁnitions and lemmas.

Deﬁnition 14. Suppose two ﬁnite metric spaces X and Y are given. We say a sequence of measurable maps {(νn) , : X Y P(X Y )}n N weakly converges to ν , : X Y P(X Y ) if {(νn)x,y}n N P(X Y ) weakly converges to νx,y P(X Y ) for all (x, y) X Y .

Lemma B.9. Suppose two ﬁnite metric spaces X and Y are given. If a sequence of probability measures {γn}n N P(X Y ) weakly converges to γ P(X Y ) and a sequence of measurable maps {(νn) , : X Y P(X Y )}n N weakly converges to ν , : X Y P(X Y ), then the sequence {(νn) , γn}n N also weakly converges to ν , γ.

Proof. Fix an arbitrary continuous bounded map φ : X Y R. Then,

X Y φ(x, y) (νn) , γn(dx dy) Z

X Y φ(x, y) ν , γ(dx dy)

X Y φ(x, y) (νn) , γn(dx dy) Z

X Y φ(x, y) ν , γn(dx dy)

X Y φ(x, y) ν , γn(dx dy) Z

X Y φ(x, y) ν , γ(dx dy) .

By the weak convergence of {(νn) , }n N and by applying the bounded convergence theorem, we have that Z

X Y φ(x, y) (νn) , γn(dx dy) = Z

X Y φ(x, y) (νn)x ,y (dx dy)γn(dx dy )

converges to Z

X Y φ(x, y) ν , γn(dx dy) = Z

X Y φ(x, y) νx ,y (dx dy)γn(dx dy ).

Also, since {γn}n N weakly converges to γ, by ﬁniteness of X and Y we have that Z

X Y φ(x, y) ν , γn(dx dy) = Z

X Y φ(x, y) νx ,y (dx dy)γn(dx dy )

converges to Z

X Y φ(x, y) ν , γ(dx dy) = Z

X Y φ(x, y) νx ,y (dx dy)γ(dx dy ).

X Y φ(x, y) (νn) , γn(dx dy) R

X Y φ(x, y) ν , γ(dx dy) converges to zero as we required. This completes the proof.

Lemma B.10. Suppose two MCMSs (X, d X), (Y, d Y ), k 1, and a sequence of k-step couplings {(ν(k) n ) , }n N C(k) m X , m Y are given. Then, there is a k-step coupling ν (k) , C(k) m X , m Y to which the sequence {(ν(k) n ) , }n N converges.

Proof. The proof is by induction. k = 1 case is obvious since C(m X x , m Y y ) is compact w.r.t. the weak topology (see Villani (2003, p.49)) for all (x, y) X Y .

Now, suppose the claim holds up to some k 1. Consider k + 1 case. By the deﬁnition, each (k + 1)-step coupling (ν(k+1) n ) , C(k+1) m X , m Y can be expressed in the following way:

Weisfeiler-Lehman meets Gromov-Wasserstein

(ν (k+1) n )x,y = Z

X Y (ν (k) n )x ,y (µ (1) n )x,y(dx dy )

for all (x, y) X Y for some (ν(k) n ) , C(k) m X , m Y and (µ(1) n ) , C(1) m X , m Y . Then, by the inductive assumption, there are ν (k) , C(k) m X , m Y and µ (1) , C(1) m X , m Y such that the sequence {(ν(k) n ) , }n N weakly converges to ν (k) , , and the sequence {(µ(1) n ) , }n N weakly converges to µ (1) , . Then, for each (x, y) X Y ,

(ν (k+1) n )x,y = Z

X Y (ν (k) n )x ,y (µ (1) n )x,y(dx dy ) = (ν (k) n ) , (µ (1) n )x,y

weakly converges to

ν (k+1) x,y = Z

X Y ν (k) x ,y µ (1) x,y(dx dy ) = ν (k) , µ (1) x,y

by Lemma B.9. This completes the proof.

Lemma B.11 (M emoli (2011, Lemma 10.3)). Let (Z, d Z) be a compact metric space and φ : Z Z R be a Lipschitz map w.r.t. the L1 metric on Z Z:

ˆd Z Z((z1, z2), (z 1, z 2)) := d Z(z1, z 1) + d Z(z2, z 2) for all (z1, z2), (z 1, z 2) Z Z.

Also, for each γ P(Z), we deﬁne a map pφ,γ in the following way:

Z φ(z, z ) γ(dz ).

If a sequence {µn}n N P(Z) weakly converges to µ, then pφ,µn uniformly converges to pφ,µ.

Corollary B.12. For any two MCMSs (X, d X), (Y, d Y ), and k 1, there exist a coupling measure γ C(µX, µY ) and a k-step coupling ν (k) , C(k) m X , m Y such that

d (k) GW((X, d X), (Y, d Y )) = dis (k) γ, ν (k) , .

Proof. First of all, we deﬁne φ : (X Y ) (X Y ) R by sending any ((x, y), (x , y )) (X Y ) (X Y ) to |d X(x, x ) d Y (y, y )|.

By Deﬁnition 8, there are a sequence of coupling measures {γn}n N C(µX, µY ) and a sequence of k-step couplings {(ν(k) n ) , }n N C(k) m X , m Y such that

dis (k)(γ, (ν (k) n ) , ) = Z

|d X(x, x ) d Y (y, y )| (ν (k) n )x ,y (dx dy )γn(dx dy )γn(dx dy)

pφ,(ν(k) n ) , γn(x, y) γn(dx dy)

d (k) GW((X, d X), (Y, d Y )) + 1

for each n 1.

Now, since C(µX, µY ) is compact w.r.t. the weak topology (see p.49 of (Villani, 2003)), there is a coupling measure γ C(µX, µY ) such that γn γ weakly. Also, by Lemma B.10, there is a k-step coupling ν (k) , C(k) m X , m Y such that (ν(k) n ) , ν (k) , weakly.

Weisfeiler-Lehman meets Gromov-Wasserstein

pφ,ν(k) , γ(x, y) γn(dx dy) Z

pφ,(ν(k) n ) , γn(x, y) γn(dx dy),

pφ,ν(k) , γ(x, y) γ(dx dy) Z

pφ,ν(k) , γ(x, y) γn(dx dy),

pφ,ν(k) , γ(x, y) γ(dx dy) Z

pφ,(ν(k) n ) , γn(x, y) γn(dx dy).

It is easy to see that Cn = An + Bn and thus |Cn| |An| + |Bn|. By Lemma B.9 and Lemma B.11, An converges to zero. Also, Bn converges to zero by the assumption that X, Y are ﬁnite and that γn weakly converges to γ. Hence, Cn converges to zero. Therefore,

dis (k) γ, ν (k) , = Z

pφ,ν(k) , γγ(dx dy)

pφ,(ν(k) n ) , γnγn(dx dy)

d (k) GW((X, d X), (Y, d Y ))

Since we always have that dis (k) γ, ν (k) , d (k) GW((X, d X), (Y, d Y )) , we conclude that

dis (k) γ, ν (k) , = d (k) GW((X, d X), (Y, d Y )) .

Lemma B.13 (Gluing of k-step couplings). Suppose three MCMSs (X, d X), (Y, d Y ), (Z, d Z), k 1, and k-step couplings µ (k) , C(k)(m X , m Z ), η (k) , C(k)(m Z , m Y ) are given. Then, there are probability measures π (k) , , : X Y Z P(X Y Z) and k-step coupling ν (k) , C(k) m X , m Y such that ν(k) x,y, µ(k) x,z, and η(k) z,y are the marginals of π(k) x,y,z for any (x, y, z) X Y Z.

Proof. The proof is by induction on k. First, consider k = 1 case. Fix an arbitrary (x, y, z) X Y Z. Let

π (1) x,y,z(x , y , z ) :=

( µ(1) x,z(x ,z )µ(1) z,y(z ,y ) m Z z (z ) , m Z z (z ) > 0

0, m Z z (z ) = 0

for each (x , y , z ) X Y Z. Observe that

(x ,y ,z ) X Y Z π (1) x,y,z(x , y , z ) = X

z Z,m Z z (z )>0

µ(1) z,y(z , y )

x X µ (1) x,z(x , z )

z Z,m Z z (z )>0

µ(1) z,y(z , y )

m Zz (z ) m Z z (z ) = X

z Z,m Z z (z )>0

y Y µ (1) z,y(z , y ) = 1.

Hence, π(1) x,y,z P(X Y Z). Now, let ν(1) x,y(x , y ) := P z Z π(1) x,y,z(x , y , z ) for each (x , y ) X Y . Then, for ﬁxed x X,

Weisfeiler-Lehman meets Gromov-Wasserstein

y Y ν (1) x,y(x , y ) = X

z Z,m Z z (z )>0

µ(1) x,z(x , z )µ(1) z,y(z , y ) m Zz (z ) = X

z Z,m Z z (z )>0

µ(1) x,z(x , z )

y Y µ (1) z,y(z , y )

z Z,m Z z (z )>0

µ(1) x,z(x , z )

m Zz (z ) m Z z (z ) = X

z Z,m Z z (z )>0 µ (1) x,z(x , z ) = m X x (x ).

Similarly, for each ﬁxed y Y , one can prove P x X ν(1) x,y(x , y ) = m Y y (y ). Hence, indeed ν (1) , C(1) m X , m Y .

Now, suppose the claim holds up to some k 1. We consider k + 1 case. For a (k + 1)-step coupling µ (k+1) , C(k+1)(m X , m Z ), there are k-step coupling µ (k) , C(k)(m X , m Z ) and 1-step coupling µ (1) , C(1)(m X , m Z ) such that

µ (k+1) x,z (x , z ) = X

(x ,z ) X Y µ (k) x ,z (x , z ) µ (1) x,z(x , z )

for any (x, z), (x , z ) X Z. Similarly, for a (k + 1)-step coupling η (k+1) , C(k+1)(m Z , m Y ), there are k-step coupling η (k) , C(k)(m Z , m Y ) and 1-step coupling η (1) , C(1)(m Z , m Y ) such that

η (k+1) z,y (z , y ) = X

(z ,y ) Z Y η (k) z ,y (z , y ) η (1) z,y(z , y )

for any (z, y), (z , y ) Z Y .

Because of the inductive assumption, we have π (k) , , : X Y Z P(X Y Z) and π (1) , , : X Y Z P(X Y Z) satisfying the claim. Then, let

π (k+1) x,y,z(x , y , z ) := X

(x ,y ,z ) X Y Z π (k) x ,y ,z (x , y , z ) π (1) x,y,z(x , y , z ),

and let ν(k+1) x,y (x , y ) := P z Z π(k+1) x,y,z(x , y , z ). Then, we have that

ν (k+1) x,y (x , y ) = X

(x ,y ,z ) X Y Z π (k) x ,y ,z (x , y , z ) π (1) x,y,z(x , y , z )

(x ,y ,z ) X Y Z π (1) x,y,z(x , y , z ) X

z Z π (k) x ,y ,z (x , y , z )

(x ,y ,z ) X Y Z π (1) x,y,z(x , y , z ) ν (k) x ,y (x , y )

(x ,y ) X Y ν (k) x ,y (x , y ) X

z Z π (1) x,y,z(x , y , z )

(x ,y ) X Y ν (k) x ,y (x , y ) ν (k) x,y(x , y ).

Since the choice of (x, y, z) x y z is arbitrary, now we have ν (k+1) , C(k+1) m X , m Y as we required. Hence, this concludes the proof.

Now we start to prove Proposition 5.1.

First of all, d MCMS GW is obviously symmetric.

Next, we prove that d MCMS GW ((X, d X), (Y, d Y )) = 0 happens if and only if (X, d X) and (Y, d Y ) are isomorphic. To do this, we ﬁrst provide a precise deﬁnition of MCMS isomorphism.

Deﬁnition 15. Two MCMSs (X, d X) and (Y, d Y ) are said to be isomorphic if there exists an isometry ψ : X Y such that ψ#µX = µY and ψ#m X x = m Y ψ(x) for all x X.

Weisfeiler-Lehman meets Gromov-Wasserstein

When are (X, d X) and (Y, d Y ) are isomorphic, without loss of generality, we simply assume that (Y, d Y ) = (X, d X).

Claim 3. Let µX denote the diagonal coupling between µX and itself, i.e.,

x X µX(x)δ(x,x).

Then, for each k N, there exists ν (k) , C(k)(m X , m X ) such that µX = ν (k) , µX.

Assume the claim for now. Then, we have that for each k N

d (k) GW((X, d X), (Y, d Y )) dis µX, ν (k) ,

|d X(x, x ) d X(x1, x 1)|ν (k) , µX(dx dx 1) µX(dx dx1)

|d X(x, x ) d X(x, x )|µX(dx)µX(dx ) = 0.

Hence, d MCMS GW ((X, d X), (Y, d Y )) = 0,

Proof of Claim 3. We prove inductively on k N that there exists ν (k) , C(k)(X, X) so that ν(k) x,x = m X, k x is the diagonal coupling between m X, k x and itself for each x X.

For k = 1, we deﬁne ν (1) , as follows:

ν (1) x,x :=

( m X x m X x x = x

m X x x = x .

Since X is ﬁnite, obviously we have that ν (1) , C(1)(X, X).

Assume that the statement holds for some k 0. Now, for k+1, by the induction assumption, there exists ν (k) , C(k)(X, X) so that ν(k) x,x = m X, k x . We deﬁne ν (k+1) , C(k+1)(X, X) as follows

ν (k+1) x,x := Z

ν (k) x1,x 1ν (1) x,x (dx1 dx 1), x, x X.

Now, for any x X, we have that

ν (k+1) x,x = Z

ν (k) x1,x 1ν (1) x,x(dx1 dx 1)

x X m X x (x ) X

x X m X, k x (x )δ(x ,x )

x X m X, k x (x )m X x (x )

x X m X, (k+1) x (x )δ(x ,x )

= m X, (k+1) x .

Now, we turn to prove the claim. For each k N, let ν (k) , C(k)(X, X) be such that ν(k) x,x = m X, k x is the diagonal

Weisfeiler-Lehman meets Gromov-Wasserstein

coupling between m X, k x and itself for each x X. Then, Z

ν (k) x,x µX(dx dx ) = X

x X ν (k) x,x µX(x)

x,x X m X, k x (x )δ(x ,x ) µX(x)

x X m X, k x (x )µX(x)

x X µX(x )δ(x ,x )

Now, we assume that d MCMS GW ((X, d X), (Y, d Y )) = 0 for some MCMSs (X, d X) and (Y, d Y ). Then, d (1) GW((X, d X), (Y, d Y )) = 0. By Corollary B.12, there exist optimal ν , C(1) m X , m Y and γ C(µX, µY ) such that

|d X(x, x ) d Y (y, y )|νx ,y (dx dy )γ(dx dy )γ(dx dy) = 0.

We let γ := ν , γ. Notice that γ C(µX, µY ). Since X and Y are ﬁnite, we rewrite the integral above as ﬁnite sums:

(x ,y ) X Y |d X(x, x ) d Y (y, y )| γ (x , y ) γ(x, y) = 0.

By Claim 1, there exists an isometry φ : X Y such that

{(x, φ(x)) : x X} = supp(γ) = supp(γ ).

Since γ, γ C(µX, µY ), this immediately implies that for any Borel subset A X, one has

µX(A) = γ(A Y ) = γ(A φ(A)) = γ(X φ(A)) = µY (φ(A)).

Hence, φ#µX = µY . Moreover, we have that

x X µX(x)δ(x,φ(x)).

Then, by the deﬁnition of γ , we have that X

x X µX(x)δ(x,φ(x)) = γ = X

x1 X,y1 Y γ(x1, y1)νx1,y1

x1 X,y1 Y γ(x1, y1)νx1,y1(x, y)

x1 X γ(x1, φ(x1))νx1,φ(x1)(x, y)

x1 X µX(x1)νx1,φ(x1)(x, y)

Weisfeiler-Lehman meets Gromov-Wasserstein

By comparing coefﬁcients for the Dirac delta measures above, one has that

νx1,φ(x1)(x, y) =

( 0, if y = φ(x) νx1,φ(x1)(x, y), if y = φ(x)

for any x1 X. This means that {(x, φ(x)) : x X} supp[νx1,φ(x1)]. Since νx1,φ(x1) C m X x1, m Y φ(x1) , for any Borel subset A X, one has that

m X x1(A) = νx1,φ(x1)(A Y ) = νx1,φ(x1)(A φ(A)) = νx1,φ(x1)(X φ(A)) = m Y φ(x1)(φ(A)).

By an argument similar to the one for proving φ#µX = µY , we have that

φ#m X x1 = m Y φ(x1), x1 X.

Therefore, (X, d X) is isomorphic to (Y, d Y ).

Finally, we prove that d MCMS GW satisﬁes the triangle inequality. It sufﬁces to prove that for each k N, d (k) GW satisﬁes the triangle inequality. Fix arbitrary three MCMSs (X, d X), (Y, d Y ), and (Z, d Z). Recall the notation ΓX,Y which deﬁnes a function sending (x, y, x , y ) X Y X Y to |d X(x, x ) d Y (y, y )|. Then, for any x, x X, y, y Y , and z, z Z, we obviously have that

ΓX,Y (x, y, x , y ) ΓX,Z(x, z, x , z ) + ΓZ,Y (z, y, z , y ).

Now, ﬁx arbitrary µ (k) , C(k)(m X , m Z ), γX,Z C(µX, µZ), η (k) , C(k)(m Z , m Y ), and γZ,Y C(µZ, µY ). Then, by the Gluing Lemma (see Villani (2003, Lemma 7.6)), there exists a probability measure α P(X Y Z) with marginals γX,Z and γZ,Y on X Z and Z Y , respectively. Let γX,Y be the marginal of π on X Y which belongs to C(µX, µY ). By Lemma B.13, there exists ν (k) , C(k) m X , m Y such that ν(k) x,y, µ(k) x,z, η(k) z,y are the marginals of some probability measure π(k) x,y,z P(X Y Z) for any x, y, z X Y Z. Then, because of the triangle inequality for L1-norm,

d (k) GW((X, d X), (Y, d Y ))

ΓX,Y (x, y, x , y )ν (k) x ,y (dx dy )γX,Y (dx dy )γX,Y (dx dy)

ΓX,Y (x, y, x , y )π (k) x ,y ,z (dx dy dz )α(dx dy dz )α(dx dy dz)

ΓX,Z(x, z, x , z )π (k) x ,y ,z (dx dy dz )α(dx dy dz )α(dx dy dz)

ΓZ,Y (z, y, z , y )π (k) x ,y ,z (dx dy dz )α(dx dy dz )α(dx dy dz)

ΓX,Z(x, z, x , z )µ (k) x ,z (dx dz )γX,Z(dx dz )γX,Z(dx dz)

ΓZ,Y (z, y, z , y )η (k) z ,y (dx dy dz )γZ,Y (dz dy )γZ,Y (dz dy).

Since the choice of µ (k) , , γX,Z, η (k) , , γZ,Y are arbitrary, by taking the inﬁmum one concludes that

d (k) GW((X, d X), (Y, d Y )) d (k) GW((X, d X), (Z, d Z)) + d (k) GW((Z, d Z), (Y, d Y ))

as we required. Then, we have that d MCMS GW := supk 0 d (k) GW satisﬁes the triangle inequality.

Weisfeiler-Lehman meets Gromov-Wasserstein

B.4.2. PROOF OF THE CLAIM IN EXAMPLE 3

The proof is based on the following lemma.

Lemma B.14. Given two MMSs X and Y, for their corresponding MCMSs M(X) and M(Y) we have that

1. C(k+1) m X , m Y C(k) m X , m Y for all k 1.

2. For any coupling measure γ C(µX, µY ), the constant map ν , γ belongs to C(k) m X , m Y for all k 1.

Proof. We ﬁrst prove item 1. Since m X x = µX and m Y y = µY for all x X and y Y , observe that m X, k x = µX and m Y, k y = µY for all x X, y Y , and k 1. Hence, C(k) m X , m Y C(1) m X , m Y for all k 1 by Lemma A.4. Now, for any k 2, ﬁx an arbitrary ν (k+1) , C(k+1) m X , m Y . Then, by the deﬁnition, there are ν (k) , C(k) m X , m Y

and ν (1) , C(1) m X , m Y = C(µX, µY ) such that

ν (k+1) x,y = Z

X Y ν (k) x ,y ν (1) x,y(dx dy )

for all (x, y) X Y . Again, by the deﬁnition, there are ν (k 1) , C(k) m X , m Y and µ (1) , C(1) m X , m Y = C(µX, µY ) such that

ν (k) x,y = Z

X Y ν (k 1) x ,y µ (1) x,y(dx dy )

for all (x, y) X Y . Therefore,

ν (k+1) x,y = Z

X Y ν (k 1) x ,y µ (1) x ,y (dx dy ) ν (1) x,y(dx dy )

X Y ν (k 1) x ,y π (1) x,y(dx dy )

for all (x, y) X Y where π (1) , := R

X Y µ (1) x ,y ν (1) , (dx dy ) C(2) m X , m Y C(1) m X , m Y . Hence, ν (k+1) , C(k) m X , m Y by the deﬁnition. The ﬁrst item is proved.

Next, we prove the second item. The proof is by induction on k. Fix a coupling measure γ C(µX, µY ) and the constant map ν , γ. Obviously, ν , C(1) m X , m Y . Then, we also have that the constant map µX µY C(1) m X , m Y . Now, suppose the claim holds up to some k 1. Consider k + 1 case. Observe that

X Y γ µX µY (dx dy )

where γ C(k) m X , m Y by the inductive assumption and the constant map µX µY C(1) m X , m Y . Hence, γ C(k+1) m X , m Y by the deﬁnition. This completes the proof.

Now, ﬁx arbitrary couplings γ, γ C(µX, µY ) and consider the constant map ν , γ . Then, by the second item of Lemma B.14, ν , C(k) m X , m Y . Hence,

Weisfeiler-Lehman meets Gromov-Wasserstein

d (k) GW(M(X), M(Y)) Z

|d X(x, x ) d Y (y, y )| νx ,y (dx dy )γ(dx dy )γ(dx dy)

|d X(x, x ) d Y (y, y )| γ (dx dy )γ(dx dy )γ(dx dy)

|d X(x, x ) d Y (y, y )| γ (dx dy )γ(dx dy).

Since the choice of γ, γ are arbitrary, one concludes that d (k) GW((M(X), M(Y)) dbi GW(X, Y).

For the reverse direction, choose arbitrary γ C(µX, µY ) and a k-step coupling ν (k) , C(k) m X , m Y . Let

X Y ν (k) x ,y γ(dx dy ).

|d X(x, x ) d Y (y, y )| ν (k) x ,y (dx dy )γ(dx dy )γ(dx dy)

|d X(x, x ) d Y (y, y )| γ (dx dy )γ(dx dy)

dbi GW(X, Y).

Since the choice of γ and ν (k) , are arbitrary, one concludes that d (k) GW(M(X), M(Y)) dbi GW(X, Y).

Hence, d (k) GW(M(X), M(Y)) = dbi GW(X, Y) as we required.

B.4.3. PROOF OF PROPOSITION 5.2

For any x X, we let ℓx X := d X(x, ). For i = 1, . . . , k, let l(i) x : X P i(R) be the shorthand for the ith WL measure hierarchy l (k) (X,ℓx X ) generated from the label d X(x, ). We similarly deﬁne ℓy Y and l(i) y : Y P i(R) for any y Y and each i = 1, . . . , k.

For any ν (k) , C(k) m X , m Y , there exist (νi) , C(1) m X , m Y for i = 1, . . . , k such that

ν (k) x,y = Z

(νk)xk 1,yk 1 (νk 1)xk 2,yk 2(dxk 1 dyk 1) (ν1)x,y(dx1 dy1)

Weisfeiler-Lehman meets Gromov-Wasserstein

for any x X and y Y . Hence, for any γ C(µX, µY ), we have that Z

|d X(x, x ) d Y (y, y )|ν (k) x ,y (dx dy )γ(dx dy )γ(dx dy)

|d X(x, x ) d Y (y, y )|(νk)xk 1,yk 1(dx dy ) (ν1)x ,y (dx1 dy1)γ(dx dy )γ(dx dy)

d W (ℓx X)#m X xk 1, (ℓy Y )#m Y yk 1

(νk 1)xk 2,yk 2(dxk 1 dyk 1) (ν1)x ,y (dx1 dy1)γ(dx dy )γ(dx dy)

d W l (1) x (xk 1), l (1) y (yk 1)

(νk 1)xk 2,yk 2(dxk 1 dyk 1) (ν1)x ,y (dx1 dy1)γ(dx dy )γ(dx dy)

d W l (k) x (x ), l (k) y (y ) γ(dx dy )γ(dx dy)

d (k) WL((X, ℓx X), (Y, ℓy Y )) γ(dx dy)

d (k) GW((X, d X), (Y, d Y ))

= inf γ C(µX,µY ),ν(k) , C(k)(m X ,m Y )

|d X(x, x ) d Y (y, y )|ν (k) x ,y (dx dy )γ(dx dy )γ(dx dy)

inf γ C(µX,µY )

d (k) WL((X, ℓx X), (Y, ℓy Y )) γ(dx dy).

B.4.4. PROOF OF THE STATEMENT IN REMARK 5.3

We ﬁrst recall the third lower bound (TLB) from (M emoli, 2011):

TLB(X, Y) := inf γ C(µX,µY )

inf γ C(µX,µY )

|d X(x, x ) d Y (y, y )|γ (dx dy )

where we omit the 1

2 factor from (M emoli, 2011) for simplicity of presentation.

We adopt notation from the previous section. Notice that

inf γ C(µX,µY )

|d X(x, x ) d Y (y, y )|γ (dx dy ) = d (1) WL((X, ℓx X), (Y, ℓy Y )).

We hence have that TLB(X, Y) = inf γ C(µX,µY )

d (1) WL((X, ℓx X), (Y, ℓy Y ))γ(dx dy).

For any k N, we show that

TLB(X, Y) = inf γ C(µX,µY )

d (k) WL((X, ℓx X), (Y, ℓy Y ))γ(dx dy)

Weisfeiler-Lehman meets Gromov-Wasserstein

by the lemma below.

Lemma B.15. For any k N we have that d (k) WL((X, ℓx X), (Y, ℓy Y )) = d (1) WL((X, ℓx X), (Y, ℓy Y )).

Proof. By item 2 in Lemma B.14 and Theorem A.7, we have that

d (k) WL((X, ℓx X), (Y, ℓy Y )) = inf γ C(k)(µX,µY )

|d X(x, x ) d Y (y, y )|γ (dx dy )

= inf ν , C(k)(m X ,m Y ),µ C(µX,µY )

|d X(x, x ) d Y (y, y )|νx1,y1(dx dy )µ(dx1 dy1)

inf µ C(µX,µY )

|d X(x, x ) d Y (y, y )|µ(dx dy )µ(dx1 dy1)

= inf µ C(µX,µY )

|d X(x, x ) d Y (y, y )|µ(dx dy )

= d (1) WL((X, ℓx X), (Y, ℓy Y )).

By Proposition 3.1, we conclude that d (k) WL((X, ℓx X), (Y, ℓy Y )) = d (1) WL((X, ℓx X), (Y, ℓy Y )).

B.4.5. PROOF OF THE STATEMENT IN EXAMPLE 4

Given X and Y, we have for any γ C(µX, µY ) and ν (k) , that Z

|ecc X(x ) ecc Y (y ))| ν (k) x,y(dx dy )γ(dx dy)

X d X(x , x )µX(dx ) Z

Y d Y (y , y )µY (y ) ν (k) x,y(dx dy )γ(dx dy)

d X(x , x )γ(dx dy ) Z

d Y (y , y )γ(dx dy )

ν (k) x,y(dx dy )γ(dx dy)

|d X(x , x ) d Y (y , y )| ν (k) x,y(dx dy )γ(dx dy)γ(dx dy ).

Hence, we conclude that ecc is stable.

B.4.6. PROOF OF PROPOSITION 5.4

By Theorem A.7 we have that

d (k) WL((X, ℓX), (Y, ℓY )) = inf γ(k) C(k)(µX,µY )

d Z(ℓX(x), ℓY (y))γ (k)(dx dy)

= inf γ C(µX,µY ),ν(k) , C(k)(m X ,m Y )

d Z(ℓX(x ), ℓY (y )) ν (k) x,y(dx dy )γ(dx dy)

inf γ C(µX,µY ),ν(k) , C(k)(m X ,m Y ) dis (k) γ, ν (k) ,

= d (k) GW((X, d X), (Y, d Y )) .

The inequality follows from the fact that ℓ is stable. Hence we conclude the proof.

Weisfeiler-Lehman meets Gromov-Wasserstein

Table 4. SVM classiﬁcation accuracy. Let f1(G, v) = deg G(v), f2(G, v) = 1 |VG| + deg G(v). For d (k) WL and d WLLB, we show SVM classiﬁcation accuracies using both the indeﬁnite kernel as well as with a KSVM (Loosli et al., 2015).

METHOD MUTAG PROTEINS PTC-FM PTC-MR IMDB-B IMDB-M COX2

d (1) WL, f1 87.7 6.2 71.7 3.3 57.6 6.4 55.5 4.5 74.5 4.1 51.3 3.0 78.1 0.8 d (2) WL, f1 89.9 6.4 71.3 3.3 59.3 3.7 54.9 6.3 75.0 3.0 51.4 3.4 78.1 0.8 d (3) WL, f1 87.6 8.8 72.6 3.1 59.0 3.7 57.8 7.9 74.9 5.1 51.6 4.0 78.1 0.8 d (4) WL, f1 87.7 4.1 72.4 4.1 62.1 3.9 56.7 3.7 75.9 2.7 51.4 3.2 78.1 0.8

d (1) WL, f1, KSVM 88.3 4.6 72.1 2.6 61.0 3.9 52.9 7.6 75.6 5.1 51.4 4.8 77.5 2.8 d (2) WL, f1, KSVM 87.8 5.7 72.8 2.4 58.1 5.0 56.9 5.8 74.5 3.4 51.4 4.1 77.9 1.0 d (3) WL, f1, KSVM 88.8 6.8 72.6 3.7 61.3 4.6 55.2 6.1 75.0 3.8 51.8 3.8 78.7 1.7 d (4) WL, f1, KSVM 87.2 5.5 73.5 2.3 58.1 6.3 53.4 6.3 73.3 3.7 51.2 3.2 77.7 1.3

d (1) WL, f2 87.3 8.2 71.1 3.2 58.7 7.6 55.2 5.5 74.1 4.1 50.5 4.5 78.1 0.8 d (2) WL, f2 86.2 7.4 73.5 2.8 60.2 5.3 54.0 6.4 75.0 4.5 51.4 3.9 78.1 0.8 d (3) WL, f2 88.8 5.4 74.5 2.9 61.6 5.3 59.2 6.4 75.4 5.4 50.8 4.0 77.0 1.5 d (4) WL, f2 87.2 5.8 73.9 3.5 60.4 5.1 54.3 7.7 75.7 3.7 50.8 3.2 78.1 0.8

d (1) WL, f2, KSVM 88.8 7.9 74.3 3.3 63.3 2.3 60.1 6.3 76.2 4.1 51.5 3.3 78.1 0.8 d (2) WL, f2, KSVM 87.7 9.4 74.5 3.8 63.2 5.9 54.0 6.6 75.8 4.5 51.0 3.6 78.1 1.7 d (3) WL, f2, KSVM 86.6 5.9 74.2 2.3 62.7 7.6 58.1 6.1 73.9 3.9 51.7 3.1 79.2 1.8 d (4) WL, f2, KSVM 88.2 5.3 74.0 2.8 63.6 7.0 56.4 6.9 76.1 3.8 51.9 3.5 78.3 2.5

d (1) WLLB, f1 87.9 5.9 68.0 1.3 59.6 6.4 57.4 8.1 74.7 2.5 52.0 1.8 78.1 0.8 d (2) WLLB, f1 89.4 5.2 68.9 1.9 58.6 5.7 59.0 8.3 75.1 2.2 50.8 1.6 78.1 0.8 d (3) WLLB, f1 90.0 5.6 68.6 1.6 57.3 6.2 58.7 8.1 75.2 2.1 51.0 1.6 78.1 0.8 d (4) WLLB, f1 89.4 5.2 66.7 2.0 58.2 6.1 56.6 7.2 74.5 2.0 50.3 1.4 77.5 2.1

d (1) WLLB, f1, KSVM 88.2 5.7 72.0 1.9 60.1 4.8 54.3 4.8 75.7 3.2 50.2 3.3 78.1 1.9 d (2) WLLB, f1, KSVM 88.3 10.7 70.6 4.3 59.3 4.5 55.8 8.5 75.2 3.6 50.6 3.6 77.1 2.2 d (3) WLLB, f1, KSVM 88.8 5.4 71.7 2.7 58.1 6.6 54.6 4.5 75.4 3.8 50.8 4.1 78.1 0.8 d (4) WLLB, f1, KSVM 87.2 5.8 71.6 4.3 60.5 5.8 52.6 7.6 76.5 2.9 51.5 3.1 78.3 1.1

d (1) WLLB, f2 88.9 4.5 70.5 1.0 60.5 5.4 56.6 8.0 75.0 2.5 52.0 1.8 78.1 0.8 d (2) WLLB, f2 90.0 4.2 70.0 1.4 58.0 5.6 58.5 8.9 75.1 2.2 50.8 1.6 78.1 0.8 d (3) WLLB, f2 90.0 4.2 70.3 2.4 56.4 6.2 58.8 7.8 75.2 2.1 51.0 1.6 77.9 1.3 d (4) WLLB, f2 90.0 4.2 70.2 1.8 58.3 5.6 58.2 6.9 74.5 2.0 50.3 1.4 78.2 0.8

d (1) WLLB, f2, KSVM 86.7 7.8 32.0 4.2 61.8 2.7 52.0 5.4 27.1 4.5 20.0 2.8 78.1 0.8 d (2) WLLB, f2, KSVM 82.9 8.7 30.9 1.1 63.3 5.6 54.3 7.2 20.1 4.4 19.8 2.7 78.1 0.8 d (3) WLLB, f2, KSVM 86.1 4.9 30.6 2.1 61.6 6.5 55.5 6.3 25.2 1.1 18.0 1.8 78.1 0.8 d (4) WLLB, f2, KSVM 86.1 7.2 30.7 2.0 58.7 6.0 52.1 7.1 33.7 8.7 17.0 3.2 78.1 0.8 WWL 85.3 7.3 72.9 3.6 62.2 6.1 63.0 7.4 72.5 3.7 50.0 5.3 78.2 0.8 WL 85.5 1.6 71.6 0.6 56.6 2.1 56.2 2.0 72.4 0.7 50.9 0.4 78.4 1.1 WL-OA 86.3 2.1 72.6 0.7 58.4 2.0 54.2 1.6 73.0 1.1 50.2 1.1 78.8 1.3