# a_practical_progressivelyexpressive_gnn__ee9f97aa.pdf

A Practical, Progressively-Expressive GNN

Lingxiao Zhao Carnegie Mellon University

lingxiao@cmu.edu

Louis Härtel RWTH Aachen University haertel@informatik.rwth-aachen.de

Snap Inc. nshah@snap.com

Leman Akoglu Carnegie Mellon University lakoglu@andrew.cmu.edu

Message passing neural networks (MPNNs) have become a dominant ﬂavor of graph neural networks (GNNs) in recent years. Yet, MPNNs come with notable limitations; namely, they are at most as powerful as the 1-dimensional Weisfeiler-Leman (1-WL) test in distinguishing graphs in a graph isomorphism testing framework. To this end, researchers have drawn inspiration from the k-WL hierarchy to develop more expressive GNNs. However, current k-WLequivalent GNNs are not practical for even small values of k, as k-WL becomes combinatorially more complex as k grows. At the same time, several works have found great empirical success in graph learning tasks without highly expressive models, implying that chasing expressiveness with a coarse-grained ruler of expressivity like k-WL is often unneeded in practical tasks. To truly understand the expressiveness-complexity tradeoff, one desires a more ﬁne-grained ruler, which can more gradually increase expressiveness. Our work puts forth such a proposal: Namely, we ﬁrst propose the (k, c)( )-SETWL hierarchy with greatly reduced complexity from k-WL, achieved by moving from k-tuples of nodes to sets with k nodes deﬁned over c connected components in the induced original graph. We show favorable theoretical results for this model in relation to k-WL, and concretize it via (k, c)( )-SETGNN, which is as expressive as (k, c)( )-SETWL. Our model is practical and progressively-expressive, increasing in power with k and c. We demonstrate effectiveness on several benchmark datasets, achieving several state-of-the-art results with runtime and memory usage applicable to practical graphs. We open source our implementation at https://github.com/Lingxiao Shawn/KCSet GNN.

1 Introduction

In recent years, graph neural networks (GNNs) have gained considerable attention [58, 60] for their ability to tackle various node-level [35, 56], link-level [51, 66] and graph-level [9, 40] learning tasks, given their ability to learn rich representations for complex graph-structured data. The common template for designing GNNs follows the message passing paradigm; these so-called message-passing neural networks (MPNNs) are built by stacking layers which encompass feature transformation and aggregation operations over the input graph [25, 39]. Despite their advantages, MPNNs have several limitations including oversmoothing [12, 47, 67], oversquashing [2, 55], inability to distinguish node identities [63] and positions [62], and expressive power [61].

Since Xu et al. s [61] seminal work showed that MPNNs are at most as powerful as the ﬁrst-order Weisfeiler-Leman (1-WL) test in the graph isomorphism (GI) testing framework, there have been several follow-up works on improving the understanding of GNN expressiveness [3, 15]. In re-

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Input Graph 1-sets

2-sets 3-sets

super-edges in (3, >=1) super-graph

super-edges in (3, >=2) super-graph

super-edges only in (3, 3) super-graph

3-Bipartite Super-graph

Extract 3 Sets Subgraphs

... ... ... ...

... ... ... ...

Init Set "Color"

Build Super-graph

Bidirectional Message Passing

Bidirectional Message Passing

All 3( )-Sets

Colors (Emb.)

Backward Forward

k-Bipartite GNN on

Super-graph

Figure 1: Main steps of (k, c)( )-SETGNN. Given a graph and (k, c), we build the (k, c)-bipartite super-graph (in middle) containing sets with up to k nodes and c connected components in the induced subgraph, on which a base GNN assigns initial colors . Bidirectional bipartite GNN layers with frw.-bckw. message passing learn set embeddings, pooled into graph embedding. The size of super-graph, and accordingly its expressiveness, grows progressively with increasing k and c. The 2bipartite message passing generalizes normal GNN, edge GNN, and line graph (see Appendix.A.12).

sponse, the community proposed many GNN models to overcome such limitations [4, 52, 68]. Several of these aim to design powerful higher-order GNNs which are increasingly expressive [9, 33, 41, 43] by inspiration from the k-WL hierarchy [54].

A key limitation of such higher-order models that reach beyond 1-WL is their poor scalability; in fact, these models can only be applied to small graphs with small k in practice [40, 41, 43] due to combinatorial growth in complexity. On the other hand lies an open question on whether practical graph learning tasks indeed need such complex, and extremely expressive GNN models. Historically, Babai et al. [5] showed that almost all graphs on n nodes can be distinguished by 1WL. In other contexts like node classiﬁcation, researchers have encountered superior generalization performance with graph-augmented multi-layer perceptron (GA-MLP) models [37, 50] compared to MPNNs, despite the former being strictly less expressive than the latter [13]. Considering that increased model complexity has negative implications in terms of overﬁtting and generalization [29], it is worth re-evaluating continued efforts to pursue maximally expressive GNNs in practice. Ideally, one could study these models generalization performance on various real-world tasks by increasing k (expressiveness) in the k-WL hierarchy. Yet, given the impracticality of this too-coarse ruler of expressiveness, which becomes infeasible beyond k>3, one desires a more ﬁne-grained ruler , that is, a new hierarchy whose expressiveness grows more gradually. Such a hierarchy could enable us to gradually build more expressive models which admit improved scaling, and avoid unnecessary leaps in model complexity for tasks which do not require them, guarding against overﬁtting.

Present Work. In this paper, we propose such a hierarchy, and an associated practical progressivelyexpressive GNN model, called (k, c)( )-SETGNN, whose expressiveness can be modulated through k and c. In a nutshell, we take inspiration from k-WL, yet achieve practicality through three design ideas which simplify key bottlenecks of scaling k-WL: First, we move away from k-tuples of the original k-WL to k-multisets (unordered), and then to k( )-sets. We demonstrate that these steps drastically reduce the number nodes in the k-WL graph while retaining signiﬁcant expressiveness. (Sec.s 3.2 3.3) Second, by considering the underlying sparsity of an input graph, we reduce scope to k, c( )-sets that consist of k nodes whose induced subgraph is comprised of c k connected components. This also yields massive reduction in the number of nodes while improving practicality; i.e. small values of c on sparse graphs can allow one to increase k well beyond 3 in practice. (Sec. 3.4) Third, we design a super-graph architecture that consists of a sequence of k 1 bipartite graphs over which we learn embeddings for our k, c( )-sets, using bidirectional message passing. These embeddings can be pooled to yield a ﬁnal graph embedding.. (Sec.s 4.1 4.2) We also speed up initializing colors for the k, c( )-sets, for c > 1, substantially reducing computation and memory. (Sec. 4.3) Fig. 1 overviews of our proposed framework. Experimentally, our (k, c)( )-SETGNN outperforms existing state-of-the-art GNNs on simulated expressiveness as well as real-world graph-level tasks, achieving new bests on substructure counting and ZINC-12K, respectively. We show that generalization performance reﬂects increasing expressiveness by k and c. Our proposed scalable designs allow us to train models with e.g. k=10 or c=3 with practical running time and memory requirements.

2 Related Work

MPNN limitations. Given the understanding that MPNNs have expressiveness upper-bounded by 1-WL [61], several researchers investigated what else MPNNs cannot learn. To this end, Chen et al. [15] showed that MPNNs cannot count induced connected substructures of 3 or more nodes, while along with [15], Arvind et al. [3] showed that MPNNs can only count star-shaped patterns. Loukas [38] further proved several results regarding decision problems on graphs (e.g. subgraph veriﬁcation, cycle detection), ﬁnding that MPNNs cannot solve these problems unless strict conditions of depth and width are met. Moreover, Garg et al. [22] showed that many standard MPNNs cannot compute properties such as longest or shortest cycles, diameters or existence of cliques.

Improving expressivity. Several works aim to improve expressiveness limitations of MPNNs. One approach is to inject features into the MPNN aggregation, motivated by Loukas [38] who showed that MPNNs can be universal approximators when nodes are sufﬁciently distinguishable. Sato et al. [53] show that injecting random features can better enable MPNNs to solve problems like minimum dominating set and maximum matching. You et al. [63] inject cycle counts as node features, motivated by the limitation of MPNNs not being able to count cycles. Others proposed utilizing subgraph isomorphism counting to empower MPNNs with substructure information they cannot learn [9, 10]. Earlier work by You et al. [62] adds positional features to distinguish node which naïve MPNN embeddings would not. Recently, several works also propose utilizing subgraph aggregation; Zhang et al. [65], Zhao et al. [68] and Bevilacqua et al. [8] propose subgraph variants of the WL test which are no less powerful than 3-WL. Recently a following up work [21] shows that rooted subgraph based extension with 1-WL as kernel is bounded by 3-WL. The community is yet exploring several distinct avenues in overcoming limitations: Murphy et al. [46] propose a relational pooling mechanism which sums over all permutations of a permutation-sensitive function to achieve above-1-WL expressiveness. Balcilar et al. [6] generalizes spatial and spectral MPNNs and shows that instances of spectral MPNNs are more powerful than 1-WL. Azizian et al. [4] and Geerts et al. [24] unify expressivity and approximation ability results for existing GNNs.

k-WL-inspired GNNs. The k-WL test captures higher-order interactions in graph data by considering all k-tuples, i.e., size k ordered multisets deﬁned over the set of nodes. While highly expressive, it does not scale to practical graphs beyond a very small k (k = 3 pushes the practical limit even for small graphs). However, several works propose designs which can achieve k-WL in theory and strive to make them practical. Maron et al. [41] proposes a general class of invariant graph networks k-IGNs having exact k-WL expressivity [23], while being not scalable. Morris et al. have several work on k-WL and its variants like k-LWL [42], k-GNN [43], and δ-k-WL-GNN [44]. Both k-LWL and k-GNN use a variant of k-WL that only considers k-sets and are strictly weaker than k-WL but much more practical. In our paper we claim that k( )-sets should be used with additional designs to keep the best expressivity while remaining practical. The δ-k-WL-GNN works with k-tuples but sparsiﬁes the connections among tuples by considering locality. Qian et al. [48] extends the k-tuple to subgraph network but has same scalability bottleneck as k-WL. A recent concurrent work Speq Net [45] proposes to reduce number of tuples by restricting the number of connected components of tuples induced subgraphs. The idea is independently explored in our paper in Sec.3.4. All four variants proposed by Morris et al. do not have realizations beyond k > 3 in their experiments while we manage to reach k = 10. Interestingly, a recent work [34] links graph transformer with k-IGN that having better (or equal) expressivity, however is still not scalable.

3 A practical progressively-expressive isomorphism test: (k, c)( )-SETWL

We ﬁrst motivate our method (k, c)( )-SETGNN from the GI testing perspective by introducing (k, c)( )-SETWL. We ﬁrst introduce notation and background of k-WL (Sec. 3.1). Next, we show how to increase the practicality of k-WL without reducing much of its expressivity, by removing node ordering (Sec. 3.2), node repetitions (Sec. 3.3), and leveraging graph sparsity (Sec. 3.4). We then give the complexity analysis (Sec. 3.5). We close the section with extending the idea to k-FWL which is as expressive as (k+1)-WL (Sec. 3.6). All proofs are included in Appendix A.4.

Notation: Let G = (V (G), E(G), l G) be an undirected, colored graph with nodes V (G), edges E(G), and a color-labeling function l G : V (G) ! C where C = {c1, ..., cd} denotes a set of d distinct colors. Let [n] = {1, 2, 3, ..., n}. Let ( ) denote a tuple (ordered multiset), {{ }} denote a multiset (set which allows repetition), and { } denote a set. We deﬁne ! v = (v1, ..., vk) as a k-tuple, v = {{v1, ..., vk}} as a k-multiset, and ˆv = {v1, ..., vk} as a k-set. Let ! v [x/i] =

(v1, ..., vi 1, x, vi+1, ..., vk) denote replacing the i-th element in ! v with x, and analogously for v[x/i] and ˆv[x/i] (assume mutlisets and sets are represented with the canonical ordering). When vi 2 V (G), let G[ ! v ], G[ v], G[ˆv] denote the induced subgraph on G with nodes inside ! v , v, ˆv respectively (keeping repetitions). An isomorphism between two graphs G and G0 is a bijective mapping p : V (G) ! V (G0) which satisﬁes 8u, v 2 V (G), (u, v) 2 E(G) () (p(u), p(v)) 2 E(G0) and 8u 2 V (G), l G(u) = l G0(p(u)). Two graphs G, G0 are isomorphic if there exists an isomorphism between them, which we denote as G = G0.

3.1 Preliminaries: the k-Weisfeiler-Leman (k-WL) Graph Isomorphism Test

The 1-dimensional Weisfeiler-Leman (1-WL) test, also known as color reﬁnement [49] algorithm, is a widely celebrated approach to (approximately) test GI. Although extremely fast and effective for most graphs (1-WL can provide canonical forms for all but n 1/7 fraction of n-vertex graphs [5]), it fails to distinguish members of many important graph families, such as regular graphs. The more powerful k-WL algorithm ﬁrst appeared in [57], and extends coloring of vertices (or 1-tuples) to that of k-tuples. k-WL is progressively expressive with increasing k, and can distinguish any ﬁnite set of graphs given a sufﬁciently large k. k-WL has many interesting connections to logic, games, and linear equations [11, 28]. Another variant of k-WL is called k-FWL (Folklore WL), such that (k + 1)-WL is equally expressive as k-FWL [11, 26]. Our work focuses on k-WL.

k-WL iteratively recolors all nk k-tuples deﬁned on a graph G with n nodes. At iteration 0, each k-tuple ! v = (v1, ..., vk) 2 V (G)k is initialized with a color as its atomic type atk(G, ! v ). Assume G has d colors, then atk(G, ! v ) 2 {0, 1}2(

k 2)+kd is an ordered vector encoding. The ﬁrst

entries indicate whether vi = vj, 8i, j, 1 i<j k, which is the node repetition information. The second !k

entries indicate whether (vi, vj) 2 E(G). The last kd entries one-hot encode the initial color of

vi, 8i, 1 i k. Importantly, atk(G, ! v ) = atk(G0, ! v0) if and only if vi 7! v0

i is an isomorphism from G[ ! v ] to G0[ ! v0]. Let wl(t)

k (G, ! v ) denote the color of k-tuple ! v on graph G at t-th iteration of k-WL, where colors are initialized with wl(0)

k (G, ! v ) = atk(G, ! v ).

At the t-th iteration, k-WL updates the color of each ! v 2 V (G)k according to

k (G, ! v )=HASH

k (G, ! v ),

k (G, ! v [x/1])

k (G, ! v [x/k])

k (G) denote the encoding of G at t-th iteration of k-WL. Then,

k (G) = HASH

k (G, ! v )

### ! v 2 V (G)k

Two k-tuples ! v and ! u are connected by an edge if |{i 2 [k]|vi = ui}| = k 1, or informally if they share (k 1) entries. Then, k-WL deﬁnes a super-graph Sk-wl(G) with its nodes being all k-tuples in G, and edges deﬁned as above. Eq. (1) deﬁnes the rule of color reﬁnement on Sk-wl(G). Intuitively, k-WL is akin to (but more powerful as it orders subgroups of neighbors) running 1-WL algorithm on the supergraph Sk-wl(G). As t ! 1, the color wl(t+1)

k (G, ! v ) converges to a stable value, denoted as wl(1)

k (G, ! v ) and the corresponding stable graph color denoted as gwl(1)

k (G). For two non-isomorphic graphs G, G0, k-WL can successfully distinguish them if gwl(1)

k (G) 6= gwl(1)

k (G0). The expressivity of wl(t)

k can be exactly characterized by ﬁrst-order logic with counting quantiﬁers. Let Ct

k denote all ﬁrst-order formulas with at most k variables and t-depth counting

quantiﬁers, then wl(t)

k (G, ! v ) = wl(t)

k (G0, ! v0) () 8φ 2 Ct

k, φ(G, ! v ) = φ(G0, ! v0). Additionally, there is a t-step bijective pebble game that are equivalent to t-iteration k-WL in expressivity. See Appendix.A.2 for the pebble game characterization of k-WL.

Despite its power, k-WL uses all nk tuples and has O(knk) complexity at each iteration.

3.2 From k-WL to k-MULTISETWL: Removing Ordering

Our ﬁrst proposed adaptation to k-WL is to remove ordering in each k-tuple, i.e. changing k-tuples to k-multisets. This greatly reduces the number of supernodes to consider by O(k!) times.

Let v = {{v1, ..., vk}} be the corresponding multiset of tuple ! v = (v1, ..., vk). We introduce a canonical order function on G, o G : V (G) ! [n], and a corresponding indexing function o 1

which returns the i-th element of v1, ..., vk sorted according to o G. Let mwl(t)

k (G, v) denote the color of the k-multiset v at t-th iteration of k-MULTISETWL, formally deﬁned next.

At t = 0, we initialize the color of v as mwl(0)

k (G, v) = HASH

{{atk(G, p( v))|p 2 perm[k]}}

where perm[k]1 denotes the set of all permutation mappings of k elements. It can be shown that mwl(0)

k (G, v) = mwl(0)

k (G0, v0) if and only if G[ v] and G0[ v0] are isomorphic.

At t-th iteration, k-MULTISETWL updates the color of every k-multiset by

k (G, v) = HASH

k (G, v[x/o 1

G ( v, 1)])

##x 2 V (G)}}, (3)

...,{{mwl(t)

k (G, v[x/o 1

G ( v, k)])

##x 2 V (G)}}

Where v[x/o 1

G ( v, i)]) denotes replacing the i-th (ordered by o G) element of the multiset with x 2 V (G). Let Sk-mwl(G) denote the super-graph deﬁned by k-MULTISETWL. Similar to Eq. (2), the graph level encoding is gmwl(t)

k (G) = HASH({{mwl(t)

"%%8 v 2 V (Sk-mwl(G))}}

Interestingly, although k-MULTISETWL has signiﬁcantly fewer number of node groups than k-WL, we show it is no less powerful than k-1-WL in terms of distinguishing graphs, while being upper bounded by k-WL in distingushing both node groups and graphs.

Theorem 1. Let k 1 and wl(t)

k (G, v) := {{wl(t)

k (G, p( v))|p 2 perm[k]}}. For all t 2 N and all graphs G, G0: k-MULTISETWL is upper bounded by k-WL in distinguishing multisets G, v and G0, v0 at t-th iteration, i.e. wl(t)

k (G, v) = wl(t)

k (G0, v0) =) mwl(t)

k (G, v) = mwl(t)

k (G0, v0).

Theorem 2. k-MULTISETWL is no less powerful than (k-1)-WL in distinguishing graphs: for any k 3 there exists graphs that can be distinguished by k-MULTISETWL but not by (k-1)-WL.

Theorem 2 is proved by using a variant of a series of CFI [11] graphs which cannot be distinguished by k-WL. This theorem shows that k-MULTISETWL is indeed very powerful and ﬁnding counter examples of k-WL distinguishable graphs that cannot be distinguished by k-MULTISETWL is very hard. Hence we conjecture that k-MULTISETWL may have the same expressivity as k-WL in distinguishing undirected graphs, with an attempt of proving it in Appendix.A.4.10. Additionally, the theorem also implies that k-MULTISETWL is strictly more powerful than (k 1)-MULTISETWL.

We next give a pebble game characterization of k-MULTISETWL, which is named doubly bijective k-pebble game presented in Appendix.A.3. The game is used in the proof of Theorem 2.

Theorem 3. k-MULTISETWL has the same expressivity as the doubly bijective k-pebble game.

3.3 From k-MULTISETWL to k( )-SETWL: Removing Repetition

Next, we propose further removing repetition inside any k-multiset, i.e. transforming k-multisets to k( )-sets. We assume elements of k-multiset v and k( )-set ˆv in G are sorted based on the ordering function o G, and omit o G for clarity. Let s( ) transform a multiset to set by removing repeats, and let r( ) return a tuple with the number of repeats for each distinct element in a multiset. Speciﬁcally, let ˆv = s( v) and ˆn = r( v), then m := |ˆv| = |ˆn| 2 [k] denotes the number of distinct elements in k-multiset v, and 8i 2 [m], ˆvi is the i-th distinct element with ˆni repetitions. Clearly there is an injective mapping between v and (ˆv, ˆn); let f be the inverse mapping such that

v = f(s( v), r( v)). Equivalently, each m-set ˆv can be mapped with a multiset of k-multisets: ˆv $ {{ v = f(ˆv, ˆn) | Pm

i=1 ˆni = k, 8i ˆni 1}}. Based on this relationship, we extend the connection among k-multisets to k( )-sets: given m1, m2 2 [k], a m1-set ˆv is connected to a m2-set ˆu if and only if 9ˆnv, ˆnu, f(ˆv, ˆnv) is connected with f(ˆu, ˆnu) in k-MULTISETWL. Let Sk-swl(G) denote the deﬁned super-graph on G by k( )-SETWL. It can be shown that this is equivalent to either (1) (|m1 m2| = 1) (|ˆv \ ˆu| = min(m1, m2)) or (2) (m1 = m2) (|ˆv \ ˆu| = m1 1) is true. Notice that Sk-swl(G) contains a sequence of k 1 bipartite graphs with each reﬂecting the connections among the (m 1)-sets and the m-sets. It also contains k 1 subgraphs, i.e. the connections among the m-sets for m = 2, ..., k. Later on we will show that these k 1 subgraphs can be ignored without affecting k( )-SETWL.

1This function should also consider repeated elements; we omit this for clarity of presentation.

k (G, ˆv) denote the color of m-set ˆv at t-th iteration of k( )-SETWL. Now we formally deﬁne k( )-SETWL. At t = 0, we initialize the color of a m-set ˆv (m 2 [k]) as:

k (G, ˆv) = HASH

k (G, f(ˆv, ˆn))

## ˆn1 + ... + ˆnm = k, 8i ˆni 1}}

Clearly swl(0)

k (G, ˆv) = swl(0)

k (G0, ˆv0) if and only if G[ˆv] and G0[ˆv0] are isomorphic. At t-th iteration, k( )-SETWL updates the color of every m-set ˆv by

k (G, ˆv) = HASH

k (G, ˆv), {{swl(t)

k (G, ˆv [ {x}) | x 2 V (G) \ ˆv}}, {{swl(t)

k (G, ˆv \ x) | x 2 ˆv}},

k (G, ˆv[x/o 1

G (ˆv, 1)]) | x 2 V (G) \ ˆv}}, ..., {{swl(t)

k (G, ˆv[x/o 1

G (ˆv, m)]) | x 2 V (G) \ ˆv}}

Notice that when m = 1 and m = k, the third and second part of the hashing input is an empty multiset, respectively. Similar to Eq. (2), we formulate the graph level encoding as gswl(t)

k (G) = HASH({{swl(t)

" %% 8ˆv 2 V (Sk-swl(G))}}

To characterize the relationship of their expressivity, we ﬁrst extend k-MULTISETWL on sets by deﬁning the color of a m-set ˆv on k-MULTISETWL as mwl(t)

k (G, ˆv) := {{mwl(t)

k (G, f(ˆv, ˆn))

i=1 ˆvi = k, 8i ˆni 1}}. We prove that k-MULTISETWL is at least as expressive as k( )-SETWL in terms of separating node sets and graphs.

Theorem 4. Let k 1, then 8t 2 N and all graphs G, G0: mwl(t)

k (G, ˆv) = mwl(t)

k (G0, ˆv0) =) swl(t)

k (G, ˆv) = swl(t)

k (G0, ˆv0). We also conjecture that k-MULTISETWL and k( )-SETWL could be equally expressive, and leave it to future work. As a single m-set corresponds to

k-multisets, moving from k MULTISETWL to k( )-SETWL further reduces the computational cost greatly.

3.4 From k( )-SETWL to (k, c)( )-SETWL: Accounting for Sparsity

Notice that for two arbitrary graphs G and G0 with equal number of nodes, the number of k( )- sets and the connections among all k( )-sets in k( )-SETWL are exactly the same, regardless of whether they are dense or sparse. We next propose to account for the sparsity of a graph G to further reduce the complexity of k( )-SETWL. As the graph structure is encoded inside every mset, when the graph becomes sparser, there would be more sparse m-sets with a potentially large number of disconnected components. Based on the hypothesis that the induced subgraph over a set (of nodes) with fewer disconnected components naturally contains more structural information, we propose to restrict the k( )-sets to be (k, c)( )-sets: all sets with at most k nodes and at most c connected components in its induced subgraph. Let Sk,c-swl(G) denote the super-graph deﬁned by (k, c)( )-SETWL, then Sk,c-swl(G) = Sk-swl(G)[{ˆv|#components(G[ˆv]) c}], which is the induced subgraph on the super-graph deﬁned by k( )-SETWL. Fortunately, Sk,c-swl(G) can be efﬁciently and recursively constructed based on S(k 1,c)-swl(G), and we include the construction algorithm in the Appendix A.5. (k, c)( )-SETWL can be deﬁned similarly to k( )-SETWL (Eq. (4) and Eq. (5)), however while removing all colors of sets that do not exist on Sk,c-swl(G).

(k, c)( )-SETWL is progressively expressive with increasing k and c, and when c = k, (k, c)( )- SETWL becomes the same as k( )-SETWL, as all k( )-sets are then considered. Let swl(t)

denote the color of a (k, c)( )-set ˆv on t-th iteration of (k, c)( )-SETWL, then gswl(t)

HASH({{swl(t)

"%%8ˆv 2 Sk,c-swl(G)

Theorem 5. Let k 1, then 8t 2 N and all graphs G, G0:

(1) when 1 c1 < c2 k, if G, G0 cannot be distinguished by (k, c2)( )-SETWL, they

cannot be distinguished by (k, c1)( )-SETWL (2) when k1 < k2, 8c k1, if G, G0 cannot be distinguished by (k2, c)( )-SETWL, they

cannot be distinguished by (k1, c)( )-SETWL 3.5 Complexity Analysis

All color reﬁnement algorithms described above run on a super-graph; thus, their complexity at each iteration is linear to the number of supernodes and number of edges of the super-graph. Instead

of using big O notation that ignores constant factors, we compare the exact number of supernodes and edges. Let G be the input graph with n nodes and average degree d. For k-WL, there are nk supernodes and each has n k number of neighbors, hence Sk-wl(G) has nk supernodes and knk+1/2 edges. For m 2 [k], there are

m-sets and each connects to m number of (m 1)-sets. So Sk-swl has Pk

n 2k+1 supernodes and Pk

n 2k+2 edges (derivation in Appendix). Here we ignore edges within m-sets for any m 2 [k] as they can be reconstructed from the bipartite connections among (m 1)-sets and m-sets, described in detail in Sec. 4.1. Consider e.g., n = 30, k = 5; we get |V (Sk-wl(G))|

|V (Sk-swl(G))| = 139, |E(Sk-wl(G))|

|E(Sk-swl(G))| = 2182. Directly analyzing the savings by restricting number of components is not possible without assuming a graph family. In Sec. 5.3 we measured the scalability for (k, c)( )-SETGNN with different number of components directly on sparse graphs, where (k, c)( )-SETWL shares similar scalability.

3.6 Set version of k-FWL

k-FWL is a stronger GI algorithm and it has the same expressivity as k+1-WL [26], in this section we also demonstrate how to extend the set to k-FWL to get k( )-SETFWL. Let fwl(t)

k (G, ! v ) denote the color of k-tuple ! v at t-th iteration of k-FWL. Then k-FWL is initialized the same as the k-WL, i.e. fwl(0)

k (G, ! v ) = wl(0)

k (G, ! v ). At t-th iteration, k-FWL updates colors with

k (G, ! v )=HASH

k (G, ! v ),

k (G, ! v [x/1]),...,fwl(t)

k (G, ! v [x/k])

"%%%x2V (G)

Let sfwl(t)

k (G, ˆv) denote the color of m-set ˆv at t-th iteration of k( )-SETFWL. Then at t-th iteration it updates with

k (G,ˆv)=HASH

k (G,ˆv[x/1]),...,sfwl(t)

k (G,ˆv[x/m])

The k( )-SETFWL should have better expressivity than k( )-SETWL. We show in the next section that k( )-SETWL can be further improved with less computation through an intermediate step while this is nontrivial for k( )-SETFWL. We leave it to future work of studying k( )-SETFWL.

4 A practical progressively-expressive GNN: (k, c)( )-SETGNN

In this section we transform (k, c)( )-SETWL to a GNN model by replacing the HASH function in Eq. (5) with a combination of MLP and Deep Set [64], given they are universal function approximators for vectors and sets, respectively [31, 64]. After the transformation, we propose two additional improvements (Sec. 4.2 and Sec. 4.3) to further improve scalability. We work on vector-attributed graphs. Let G = (V (G), E(G), X) be an undirected graph with node features xi 2 Rd, 8i 2 V (G).

4.1 From (k, c)( )-Set WL to (k, c)( )-Set GNN

(k, c)( )-SETWL deﬁnes a super-graph Sk,c-swl, which aligns with Eq. (5). We ﬁrst rewrite Eq. (5) to reﬂect its connection to Sk,c-swl. For a supernode ˆv in Sk,c-swl, let N G

left(ˆv) = {ˆu | ˆu 2 Sk,c-swl, ˆu $ ˆv and |ˆu| = |ˆv| 1}, and N G

right(ˆv) = {ˆu | ˆu 2 Sk,c-swl, ˆu $ ˆv and |ˆu| = |ˆv| + 1}. Then we can rewrite Eq. (5) for (k, c)( )-SETWL as

k,c (G, ˆv) =

k,c(G, ˆv), swl

2 ) k,c (G, ˆv), {{swl(t)

k,c(G, ˆu) | ˆu 2 N G

left(ˆv)}}, {{swl

2 ) k,c (G, ˆu) | ˆu 2 N G

2 ) k,c (G, ˆv) := {{swl(t)

k,c(G, ˆu) | ˆu 2 N G

right(ˆv)}}. Notice that we omit HASH and apply it implicitly. Eq. (8) essentially splits the computation of Eq. (5) into two steps and avoids repeated computation via caching the explicit t+ 1

2 step. It also implies that the connection among m-sets for any m 2 [k] can be reconstructed from the bipartite graph among m-sets and (m 1)-sets.

Next we formulate (k, c)( )-SETGNN formally. Let h(t)(ˆv) 2 Rdt denote the vector representation of supernode ˆv on the t-th iteration of (k, c)( )-SETGNN. For any input graph G, it initializes representations of supernodes by

h(0)(ˆv) = Base GNN(G[ˆv]) (9)

where the Base GNN can be any GNN model. Theoretically the Base GNN should be chosen to encode non-isomorphic induced subgraphs distinctly, mimicking HASH. Based on empirical tests

of several GNNs on all (11,117) possible 8-node non-isomorphic graphs [6], and given GIN [61] is simple and fast with nearly 100% separation rate, we use GIN as our Base GNN in experiments.

(k, c)( )-SETGNN iteratively updates representations of all supernodes by

2 )(h(t)(ˆu)) (10)

h(t+1)(ˆv) = MLP(t)

h(t)(ˆv), h(t+ 1

A (h(t)(ˆu)),

Then after T iterations, we compute the graph level encoding as

h(T )(G) = POOL({{h(T )(ˆv) | ˆv 2 V (Sk,c-swl)}}) (12)

where POOL can be chosen as summation. We visualize the steps in Figure 1. Under mild conditions, (k, c)( )-SETGNN has the same expressivity as (k, c)( )-SETWL.

Theorem 6. When (i) Base GNN can distinguish any non-isomorhpic graphs with at most k nodes, (ii) all MLPs have sufﬁcient depth and width, and (iii) POOL is an injective function, then for any t 2 N, t-layer (k, c)( )-SETGNN is as expressive as (k, c)( )-SETWL at the t-th iteration.

The following facts can be derived easily from Theorem 5 and Theorem 6. Corollary 6.1. (k, c)( )-SETGNN is progressively-expressive with increasing k and c, that is,

(1) when c1 > c2, (k, c1)( )-SETGNN is more expressive than (k, c2)( )-SETGNN, and (2) when k1 > k2, (k1, c)( )-SETGNN is more expressive than (k2, c)( )-SETGNN.

4.2 Bidirectional Sequential Message Passing

The t-th layer of (k, c)( )-SETGNN (Eq. (10) and Eq. (11)) are essentially propagating information back and forth on the super-graph Sk,c-swl(G), which is a sequence of k 1 bipartite graphs (see the middle of Figure 1), in parallel for all supernodes. We propose to change it to bidirectional blockwise sequential message passing, which we call (k, c)( )-SETGNN , deﬁned as follows.

m = k 1 to 1, 8m-set ˆv, h(t+ 1

2 )(ˆv) = MLP(t)

m = 2 to k, 8m-set ˆv, h(t+1)(ˆv) = MLP

2 ) m,2 (h(t+1)(ˆu))

Notice that (k, c)( )-SETGNN has lower memory usage, as (k, c)( )-SETGNN load the complete supergraph (k 1 bipartites) while (k, c)( )-SETGNN loads 1 out of k 1 bipartites at a time, which is beneﬁcial for limited-size GPUs. What is more, for a small, ﬁnite t it is even more expressive than (k, c)( )-SETGNN. We provide both implementation of parallel and sequential message passing in the ofﬁcial github repository, while only report the performance of (k, c)( )-SETGNN given its efﬁciency and better expressivity.

Theorem 7. For any t 2 N, the t-layer (k, c)( )-SETGNN is more expressive than the t-layer (k, c)( )-SETGNN. As limt!1, (k, c)( )-SETGNN is as expressive as (k, c)( )-SETGNN .

4.3 Improving Supernode Initialization

Next we describe how to improve the supernode initialization (Eq. (9)) and extensively reduce computational and memory overhead for c > 1 without losing any expressivity. We achieve this by the fact that a graph with c components can be viewed as a set of c connected components. Formally,

Theorem 8. Let G be a graph with c connected components C1, ..., Cc, and G0 be a graph also with c connected components C0

c, then G and G0 are isomorphic if and only if 9p : [c] ! [c], s.t. 8i 2 [c], Ci and C0

p(i) are isomorphic.

The theorem implies that we only need to apply Eq. (9) to all supernodes with a single component, and for any c-components ˆv with components ˆu1, ..., ˆuc, we can get the encoding by h(0)(ˆv) = Deep Set({h(0)(ˆu1), ..., h(0)(ˆuc)}) without passing its induced subgraph to Base GNN. This eliminates the heavy computation of passing a large number of induced subgraphs. We give the algorithm of building connection between ˆv to its single components in Appendix A.6.

Table 1: Simulation data performances. For (k, c)( )-SETGNN , (k, c) values that achieve reported performance in parenthesis. (ACC: accuracy, MA[S]E: mean abs.[sq.] error)

Method EXP (ACC)

Counting Substructures (MAE) Graph Properties (log10(MSE))

Triangle Tailed Tri. Star 4-Cycle Is Connected Diameter Radius

GCN 50% 6.67% 0.4186 0.3248 0.1798 0.2822 -1.7057 -2.4705 -3.9316 GIN 50% 6.67% 0.3569 0.2373 0.0224 0.2185 -1.9239 -3.3079 -4.7584 PNA 50% 6.67% 0.3532 0.2648 0.1278 0.2430 -1.9395 -3.4382 -4.9470 PPGN 100% 6.67% 0.0089 0.0096 0.0148 0.0090 -1.9804 -3.6147 -5.0878 GIN-AK+ 100% 6.67% 0.0123 0.0112 0.0150 0.0126 -2.7513 -3.9687 -5.1846 PNA -AK+ 100% 6.67% 0.0118 0.0138 0.0166 0.0132 -2.6189 -3.9011 -5.2026

(k, c)( ) 100% 100% 0.0073 0.0075 0.0134 0.0075 -5.4667 -4.0800 -5.1603 ( 3, 2) ( 4, 1) (3, 2) (4, 1) (3, 2) (4, 1) (4, 2) (4, 1) (2, 2) Table 2: Train and Test performances on substructure counting tasks by varying k and c. Notice the orders of magnitude drop in Test MAE between bolded entries per task.

Counting Substructures (MAE)

Triangle Tailed Tri. Star 4-Cycle

k c Train Test Train Test Train Test Train Test

2 1 0.9941 0.2623 1.1409 0.1224 1.1506 0.2542 0.8695 0.0781 1.5348 2.0697 2.3454 0.8198 1.2159 0.0292 0.8361 0.1171 3 1 0.0311 0.0025 0.0088 0.0001 0.0303 0.0108 0.0085 0.0018 0.0559 0.0019 0.0151 0.0006 0.1351 0.0058 0.1893 0.0030 4 1 0.0321 0.0008 0.0151 0.0074 0.0307 0.0085 0.0075 0.0012 0.0687 0.0104 0.0339 0.0009 0.0349 0.0007 0.0075 0.0002 5 1 0.0302 0.0070 0.0208 0.0042 0.0553 0.0009 0.0189 0.0024 0.0565 0.0078 0.0263 0.0023 0.0377 0.0057 0.0175 0.0036 6 1 0.0344 0.0024 0.0247 0.0085 0.0357 0.0017 0.0171 0.0000 0.0560 0.0000 0.0168 0.0022 0.0356 0.0014 0.0163 0.0064 2 2 0.3452 0.0329 0.4029 0.0053 0.2723 0.0157 0.2898 0.0055 0.0466 0.0025 0.0242 0.0006 0.2369 0.0123 0.2512 0.0029 3 2 0.0234 0.0030 0.0073 0.0009 0.0296 0.0074 0.0100 0.0009 0.0640 0.0003 0.0134 0.0006 0.0484 0.0135 0.0194 0.0065 4 2 0.0587 0.0356 0.0131 0.0010 0.0438 0.0140 0.0094 0.0002 0.0488 0.0008 0.0209 0.0063 0.0464 0.0037 0.0110 0.0020

5 Experiments

We design experiments to answer the following questions. Q1. Performance: How does (k, c)( )- SETGNN compare to SOTA expressive GNNs? Q2. Varying k and c: How does the progressively increasing expressiveness reﬂect on generalization performance? Q3. Computational requirements: Is (k, c)( )-SETGNN feasible on practical graphs w.r.t. running time and memory usage?

Datasets. To inspect the expressive power, we use four different types of simulation datasets: 1) EXP [1] contains 600 pairs of 1&2-WL failed graphs which we split into two where graphs in each pair is assigned to two different classes; 2) SR25 [6] has 15 strongly regular graphs (3WL failed) with 25 nodes each, which we transform to a 15-way classiﬁcation task; 3) Substructure counting (i.e. triangle, tailed triangle, star and 4-cycle) tasks on random graph dataset [15]; 4) Graph property regression (i.e. connectedness, diameter, radius) tasks on random graph dataset [17]. We also evaluate performance on two real world graph learning tasks: 5) ZINC-12K [19], and 6) QM9 [59] for molecular property regression. See Table 6 in Appendix A.7 for detailed dataset statistics.

Baselines. We use GCN [36], GIN [61], PNA [17], PPGN [40], PF-GNN [18], and GNN-AK [68] as baselines on the simulation datasets. On ZINC-12K we also reference CIN [9] directly from literature. Most baselines results are taken from [68]. Finally, we compare to GINE [32] on QM9. Hyperparameter and model conﬁgurations are described in Appendix A.8.

5.2 Results

Theorem 2 shows that k-MULTISETWL is able to distinguish CFI(k) graphs. It also holds for k( )- SETWL as the proof doesn t use repetitions. We implemented the construction of CFI(k) graphs for any k. Empirically we found that (k, c)( )-SETGNN is able to distinguish CFI(k) for k = 3, 4, 5 (k>6 out of memory). This empirically veriﬁes its theoretical expressivity.

Table 1 shows the performance results on all the simulation datasets. The low expressive models such as GCN, GIN and PNA underperform on EXP and SR25, while 3-WL equivalent PPGN excels only on EXP. Notably, (k, c)( )-SETGNN achieves 100% discriminating power with any k larger than 2 and 3, and any c larger than 1 and 0, resp. for EXP and SR25. (k, c)( )-SETGNN also outperforms the baselines on all substructure counting tasks, as well as on two out of three graph property prediction tasks (except Radius) with signiﬁcant gap, for relatively small values of k and c.

In Table 2, we show the train and test MAEs on substructure counting tasks for individual values of k and c. As expected, performances improve for increasing k when c is ﬁxed, and vice versa. It is notable that orders of magnitude improvements on test error occur moving from k=2 to 3 for the

triangle tasks as well as the star task, while a similarly large magnitude drop is obtained at k=4 for the 4-cycle task, which is expected as triangle and 4-cycle have 3 and 4 nodes respectively. Similar observations hold for graph property tasks as well. (See Appendix A.9)

Table 3: Set GNN achieves SOTA on ZINC-12K.

Gated GCN 0.363 0.009 GCN 0.321 0.009 PNA 0.188 0.004 DGN 0.168 0.003 GIN 0.163 0.004 GINE 0.157 0.004 HIMP 0.151 0.006 PNA 0.140 0.006 GSN 0.115 0.012 PF-GNN 0.122 0.010 GIN-AK+ 0.080 0.001 CIN 0.079 0.006

(k, c)( ) 0.0750 0.0027

In addition to simulation datasets, we evaluate our (k, c)( )- SETGNN on real-world data; Table 3 shows our performance on ZINC-12K. Our method achieves a new state-of-the-art performance, with a mean absolute error (MAE) of 0.0750, using k=5 and c=2.

In Table 4 we show the test and validation MAE along with training loss for varying k and c for ZINC-12K. For a ﬁxed c, validation MAE and training loss both follow a ﬁrst decaying and later increasing trend with increasing k, potentially owing to the difﬁculty in ﬁtting with too many sets (i.e. supernodes) and edges in the super-graph.

Similar results are shown for QM9 in Table 5. For comparison we also show GINE performances using both 4 and 6 layers, both of which are signiﬁcantly lower than (k, c)( )-SETGNN . Table 4: (k, c)( )-SETGNN performances on ZINC-12K by varying (k,c). Test MAE at lowest Val. MAE, and lowest Test MAE.

k c Train loss Val. MAE Test MAE

2 1 0.1381 0.0240 0.2429 0.0071 0.2345 0.0131 3 1 0.1172 0.0063 0.2298 0.0060 0.2252 0.0030 4 1 0.0693 0.0111 0.1645 0.0052 0.1636 0.0052 5 1 0.0643 0.0019 0.1593 0.0051 0.1447 0.0013 6 1 0.0519 0.0064 0.0994 0.0093 0.0843 0.0048 7 1 0.0543 0.0048 0.0965 0.0061 0.0747 0.0022 8 1 0.0564 0.0152 0.0961 0.0043 0.0732 0.0037 9 1 0.0817 0.0274 0.0909 0.0094 0.0824 0.0056 10 1 0.0894 0.0266 0.1060 0.0157 0.0950 0.0102

2 2 0.1783 0.0602 0.2913 0.0102 0.2948 0.0210 3 2 0.0640 0.0072 0.1668 0.0078 0.1391 0.0102 4 2 0.0499 0.0043 0.1029 0.0033 0.0836 0.0010 5 2 0.0483 0.0017 0.0899 0.0056 0.0750 0.0027 6 2 0.0530 0.0064 0.0927 0.0050 0.0737 0.0006 7 2 0.0547 0.0036 0.0984 0.0047 0.0784 0.0043

3 3 0.0798 0.0062 0.1881 0.0076 0.1722 0.0086 4 3 0.0565 0.0059 0.1121 0.0066 0.0869 0.0026 5 3 0.0671 0.0156 0.1091 0.0097 0.0920 0.0054

Table 5: (k, c)( )-SETGNN performances on QM9 by varying (k,c). Test MAE at lowest Val. MAE, and lowest Test MAE. All variances are 0.002 and thus omitted.

k c Train loss Val. MAE Test MAE

2 1 0.0376 0.0005 0.0387 0.0007 0.0389 0.0008 3 1 0.0308 0.0010 0.0386 0.0017 0.0379 0.0010 4 1 0.0338 0.0003 0.0371 0.0005 0.0370 0.0006 5 1 0.0299 0.0017 0.0343 0.0008 0.0341 0.0009 6 1 0.0226 0.0004 0.0296 0.0007 0.0293 0.0007 7 1 0.0208 0.0005 0.0289 0.0007 0.0269 0.0003

2 2 0.0367 0.0007 0.0398 0.0004 0.0398 0.0004 3 2 0.0282 0.0013 0.0358 0.0009 0.0356 0.0007 4 2 0.0219 0.0004 0.0280 0.0008 0.0278 0.0008 5 2 0.0175 0.0003 0.0267 0.0005 0.0251 0.0006

3 3 0.0391 0.0107 0.0428 0.0057 0.0425 0.0052 4 3 0.0219 0.0011 0.0301 0.0010 0.0286 0.0004 GINE (L=4) 0.0507 0.0014 0.0478 0.0003 0.0479 0.0004 GINE (L=6) 0.0440 0.0009 0.0440 0.0009 0.0451 0.0009

5.3 Computational requirements

(a) Memory usage

(b) Training time

Figure 2: (k, c)( )-SETGNN s footprint scales practically with both k and c in memory (a) and running time (b) results on ZINC-12K.

We next investigate how increasing k and c change the computational footprint of (k, c)( )-SETGNN in practice. Fig. 2 shows that increasing these parameters expectedly increases both memory consumption (in MB in (a)) as well as runtime (in seconds per epoch in (b)). Notably, since larger k and c increases our model s expressivity, we observe that suitable choices for k and c allow us to practically realize these increasingly expressive models on commodity hardware. With conservative values of c (e.g. c=1), we are able to consider passing messages between sets of k (e.g. 10) nodes far larger than k-WL-style higher order models can achieve ( 3).

6 Conclusion

Our work is motivated by the impracticality of higher-order GNN models based on the k-WL hierarchy, which make it challenging to study how much expressiveness real-world tasks truly necessitate. To this end, we proposed (k, c)( )-Set WL, a more practical and progressively-expressive hierarchy with theoretical connections to k-WL and drastically lowered complexity. We also designed and implemented a practical model (k, c)( )-Set GNN( ), expressiveness of which is gradually increased by larger k and c. Our model achieves strong performance, including several new best results on graph-level tasks like ZINC-12K and expressiveness-relevant tasks like substructure counting, while being practically trainable on commodity hardware.

[1] R. Abboud, I. I. Ceylan, M. Grohe, and T. Lukasiewicz. The surprising power of graph neural

networks with random node initialization. In Proceedings of the Thirtieth International Joint Conference on Artiﬁcal Intelligence (IJCAI), 2021.

[2] U. Alon and E. Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=i80OPh OCVH2.

[3] V. Arvind, F. Fuhlbrück, J. Köbler, and O. Verbitsky. On weisfeiler-leman invariance: Sub-

graph counts and related graph properties. Journal of Computer and System Sciences, 113: 42 59, 2020.

[4] W. Azizian and M. Lelarge. Expressive power of invariant and equivariant graph neural net-

works. ar Xiv preprint ar Xiv:2006.15646, 2020.

[5] L. Babai, P. Erdos, and S. M. Selkow. Random graph isomorphism. SIAM Journal on Com-

puting, 9(3):628 635, 1980.

[6] M. Balcilar, P. Héroux, B. Gauzere, P. Vasseur, S. Adam, and P. Honeine. Breaking the limits

of message passing graph neural networks. In International Conference on Machine Learning, pages 599 608. PMLR, 2021.

[7] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski,

A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261, 2018.

[8] B. Bevilacqua, F. Frasca, D. Lim, B. Srinivasan, C. Cai, G. Balamurugan, M. M. Bronstein,

and H. Maron. Equivariant subgraph aggregation networks. ar Xiv preprint ar Xiv:2110.02910, 2021.

[9] C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein. Weisfeiler

and lehman go cellular: Cw networks. Advances in Neural Information Processing Systems, 34:2625 2640, 2021.

[10] G. Bouritsas, F. Frasca, S. P. Zafeiriou, and M. Bronstein. Improving graph neural network

expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[11] J.-Y. Cai, M. Fürer, and N. Immerman. An optimal lower bound on the number of variables

for graph identiﬁcation. Combinatorica, 12(4):389 410, 1992.

[12] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun. Measuring and relieving the over-smoothing

problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 3438 3445, 2020.

[13] L. Chen, Z. Chen, and J. Bruna. On graph neural networks versus graph-augmented mlps.

ar Xiv preprint ar Xiv:2010.15116, 2020.

[14] Z. Chen, L. Li, and J. Bruna. Supervised community detection with line graph neural networks.

In International Conference on Learning Representations, 2019. URL https://openreview.net/ forum?id=H1g0Z3A9Fm.

[15] Z. Chen, L. Chen, S. Villar, and J. Bruna. Can graph neural networks count substructures?

Advances in neural information processing systems, 33:10383 10395, 2020.

[16] K. Choudhary and B. De Cost. Atomistic line graph neural network for improved materials

property predictions. npj Computational Materials, 7(1):1 8, 2021.

[17] G. Corso, L. Cavalleri, D. Beaini, P. Liò, and P. Veliˇckovi c. Principal neighbourhood aggrega-

tion for graph nets. In Advances in Neural Information Processing Systems, volume 33, pages 13260 13271, 2020.

[18] M. H. Dupty, Y. Dong, and W. S. Lee. PF-GNN: Differentiable particle ﬁltering based approx-

imation of universal graph representations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oh4Tirnf Sem.

[19] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking graph neural

networks. ar Xiv preprint ar Xiv:2003.00982, 2020.

[20] M. Fey and J. E. Lenssen. Fast graph representation learning with Py Torch Geometric. In

ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

[21] F. Frasca, B. Bevilacqua, M. M. Bronstein, and H. Maron. Understanding and extending subgraph gnns by rethinking their symmetries. ar Xiv preprint ar Xiv:2206.11140, 2022.

[22] V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural

networks. In International Conference on Machine Learning, pages 3419 3430. PMLR, 2020.

[23] F. Geerts. The expressive power of kth-order invariant graph networks. ar Xiv preprint ar Xiv:2007.12035, 2020.

[24] F. Geerts and J. L. Reutter. Expressiveness and approximation properties of graph neural networks. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=w Iz Ue M3TAU.

[25] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing

for quantum chemistry. In International conference on machine learning, pages 1263 1272. PMLR, 2017.

[26] M. Grohe. The logic of graph neural networks. In 2021 36th Annual ACM/IEEE Symposium

on Logic in Computer Science (LICS), pages 1 17. IEEE, 2021.

[27] M. Grohe. Multiset wl. Personal communication, 2022.

[28] M. Grohe and M. Otto. Pebble games and linear equations. The Journal of Symbolic Logic, 80

(3):797 844, 2015.

[29] D. M. Hawkins. The problem of overﬁtting. Journal of chemical information and computer

sciences, 44(1):1 12, 2004.

[30] L. Hella. Logical hierarchies in ptime. Information and Computation, 129(1):1 19, 1996.

[31] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal

approximators. Neural networks, 2(5):359 366, 1989.

[32] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec. Strategies for pre-

training graph neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJl WWJSFDH.

[33] N. Keriven and G. Peyré. Universal invariant and equivariant graph neural networks. Advances

in Neural Information Processing Systems, 32, 2019.

[34] J. Kim, T. D. Nguyen, S. Min, S. Cho, M. Lee, H. Lee, and S. Hong. Pure transformers are

powerful graph learners. ar Xiv, abs/2207.02505, 2022. URL https://arxiv.org/abs/2207.02505.

[35] T. N. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks.

ar Xiv preprint ar Xiv:1609.02907, 2016.

[36] T. N. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks.

In International Conference on Learning Representations (ICLR), 2017.

[37] J. Klicpera, A. Bojchevski, and S. Günnemann. Predict then propagate: Graph neural networks

meet personalized pagerank. ar Xiv preprint ar Xiv:1810.05997, 2018.

[38] A. Loukas. What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=B1l2bp4Yw S.

[39] Y. Ma, X. Liu, T. Zhao, Y. Liu, J. Tang, and N. Shah. A uniﬁed view on graph neural net-

works as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202 1211, 2021.

[40] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks.

Advances in neural information processing systems, 32, 2019.

[41] H. Maron, E. Fetaya, N. Segol, and Y. Lipman. On the universality of invariant networks. In

International conference on machine learning, pages 4363 4371. PMLR, 2019.

[42] C. Morris, K. Kersting, and P. Mutzel. Glocalized weisfeiler-lehman graph kernels: Global-

local feature maps of graphs. In 2017 IEEE International Conference on Data Mining (ICDM), pages 327 336. IEEE, 2017.

[43] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe.

Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 33, pages 4602 4609, 2019.

[44] C. Morris, G. Rattan, and P. Mutzel. Weisfeiler and leman go sparse: Towards scalable higher-

order graph embeddings. Advances in Neural Information Processing Systems, 33:21824 21840, 2020.

[45] C. Morris, G. Rattan, S. Kiefer, and S. Ravanbakhsh. Speqnets: Sparsity-aware permutation-

equivariant graph networks. ar Xiv preprint ar Xiv:2203.13913, 2022.

[46] R. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro. Relational pooling for graph representations.

In International Conference on Machine Learning, pages 4663 4673. PMLR, 2019.

[47] K. Oono and T. Suzuki. Graph neural networks exponentially lose expressive power for node

classiﬁcation. ar Xiv preprint ar Xiv:1905.10947, 2019.

[48] C. Qian, G. Rattan, F. Geerts, C. Morris, and M. Niepert. Ordered subgraph aggregation networks. ar Xiv preprint ar Xiv:2206.11168, 2022.

[49] R. C. Read and D. G. Corneil. The graph isomorphism disease. Journal of graph theory, 1(4):

339 363, 1977.

[50] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, and F. Monti. Sign: Scalable

inception graph neural networks. ar Xiv preprint ar Xiv:2004.11198, 2020.

[51] A. Sankar, Y. Liu, J. Yu, and N. Shah. Graph neural networks for friend ranking in large-scale

social platforms. In Proceedings of the Web Conference 2021, pages 2535 2546, 2021.

[52] R. Sato. A survey on the expressive power of graph neural networks. Co RR, abs/2003.04078,

2020. URL http://dblp.uni-trier.de/db/journals/corr/corr2003.html#abs-2003-04078.

[53] R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks.

In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 333 341. SIAM, 2021.

[54] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt.

Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9), 2011.

[55] J. Topping, F. Di Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding

over-squashing and bottlenecks on graphs via curvature. ar Xiv preprint ar Xiv:2111.14522, 2021.

[56] P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention

networks. ar Xiv preprint ar Xiv:1710.10903, 2017.

[57] B. Weisfeiler. On construction and identiﬁcation of graphs. In LECTURE NOTES IN MATH-

EMATICS. Citeseer, 1976.

[58] S. Wu, F. Sun, W. Zhang, X. Xie, and B. Cui. Graph neural networks in recommender systems:

a survey. ACM Computing Surveys (CSUR), 2020.

[59] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and

V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2): 513 530, 2018.

[60] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip. A comprehensive survey on

graph neural networks. IEEE transactions on neural networks and learning systems, 32(1): 4 24, 2020.

[61] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How Powerful are Graph Neural Networks? In Proceedings of the 7th International Conference on Learning Representations, ICLR 19, pages 1 17, 2019. URL https://openreview.net/forum?id=ry Gs6i A5Km.

[62] J. You, R. Ying, and J. Leskovec. Position-aware graph neural networks. In International

Conference on Machine Learning, pages 7134 7143. PMLR, 2019.

[63] J. You, J. Gomes-Selman, R. Ying, and J. Leskovec. Identity-aware graph neural networks.

ar Xiv preprint ar Xiv:2101.10320, 2021.

[64] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep

sets. Advances in neural information processing systems, 30, 2017.

[65] M. Zhang and P. Li. Nested graph neural networks. Advances in Neural Information Processing

Systems, 34:15734 15747, 2021.

[66] M. Zhang, P. Li, Y. Xia, K. Wang, and L. Jin. Revisiting graph neural networks for link prediction. 2020.

[67] L. Zhao and L. Akoglu. Pairnorm: Tackling oversmoothing in {gnn}s. In International Confer-

ence on Learning Representations, 2020. URL https://openreview.net/forum?id=rkecl1rtw B.

[68] L. Zhao, W. Jin, L. Akoglu, and N. Shah. From stars to subgraphs: Uplifting any gnn with

local structure awareness. ICLR, 2022.

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justiﬁcation to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [Yes] Did you include the license to the code and datasets? [No] The code and the data are

proprietary. Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [No]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] In Appendix 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] Will add later into the URL (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] In Appendix (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] In Appendix 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [No]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

(d) Did you discuss whether and how consent was obtained from people whose data

you re using/curating? [Yes] (e) Did you discuss whether the data you are using/curating contains personally identiﬁ-

able information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]