# a_practical_progressivelyexpressive_gnn__ee9f97aa.pdf A Practical, Progressively-Expressive GNN Lingxiao Zhao Carnegie Mellon University lingxiao@cmu.edu Louis Härtel RWTH Aachen University haertel@informatik.rwth-aachen.de Snap Inc. nshah@snap.com Leman Akoglu Carnegie Mellon University lakoglu@andrew.cmu.edu Message passing neural networks (MPNNs) have become a dominant flavor of graph neural networks (GNNs) in recent years. Yet, MPNNs come with notable limitations; namely, they are at most as powerful as the 1-dimensional Weisfeiler-Leman (1-WL) test in distinguishing graphs in a graph isomorphism testing framework. To this end, researchers have drawn inspiration from the k-WL hierarchy to develop more expressive GNNs. However, current k-WLequivalent GNNs are not practical for even small values of k, as k-WL becomes combinatorially more complex as k grows. At the same time, several works have found great empirical success in graph learning tasks without highly expressive models, implying that chasing expressiveness with a coarse-grained ruler of expressivity like k-WL is often unneeded in practical tasks. To truly understand the expressiveness-complexity tradeoff, one desires a more fine-grained ruler, which can more gradually increase expressiveness. Our work puts forth such a proposal: Namely, we first propose the (k, c)( )-SETWL hierarchy with greatly reduced complexity from k-WL, achieved by moving from k-tuples of nodes to sets with k nodes defined over c connected components in the induced original graph. We show favorable theoretical results for this model in relation to k-WL, and concretize it via (k, c)( )-SETGNN, which is as expressive as (k, c)( )-SETWL. Our model is practical and progressively-expressive, increasing in power with k and c. We demonstrate effectiveness on several benchmark datasets, achieving several state-of-the-art results with runtime and memory usage applicable to practical graphs. We open source our implementation at https://github.com/Lingxiao Shawn/KCSet GNN. 1 Introduction In recent years, graph neural networks (GNNs) have gained considerable attention [58, 60] for their ability to tackle various node-level [35, 56], link-level [51, 66] and graph-level [9, 40] learning tasks, given their ability to learn rich representations for complex graph-structured data. The common template for designing GNNs follows the message passing paradigm; these so-called message-passing neural networks (MPNNs) are built by stacking layers which encompass feature transformation and aggregation operations over the input graph [25, 39]. Despite their advantages, MPNNs have several limitations including oversmoothing [12, 47, 67], oversquashing [2, 55], inability to distinguish node identities [63] and positions [62], and expressive power [61]. Since Xu et al. s [61] seminal work showed that MPNNs are at most as powerful as the first-order Weisfeiler-Leman (1-WL) test in the graph isomorphism (GI) testing framework, there have been several follow-up works on improving the understanding of GNN expressiveness [3, 15]. In re- 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Input Graph 1-sets 2-sets 3-sets super-edges in (3, >=1) super-graph super-edges in (3, >=2) super-graph super-edges only in (3, 3) super-graph 3-Bipartite Super-graph Extract 3 Sets Subgraphs ... ... ... ... ... ... ... ... Init Set "Color" Build Super-graph Bidirectional Message Passing Bidirectional Message Passing All 3( )-Sets Colors (Emb.) Backward Forward k-Bipartite GNN on Super-graph Figure 1: Main steps of (k, c)( )-SETGNN. Given a graph and (k, c), we build the (k, c)-bipartite super-graph (in middle) containing sets with up to k nodes and c connected components in the induced subgraph, on which a base GNN assigns initial colors . Bidirectional bipartite GNN layers with frw.-bckw. message passing learn set embeddings, pooled into graph embedding. The size of super-graph, and accordingly its expressiveness, grows progressively with increasing k and c. The 2bipartite message passing generalizes normal GNN, edge GNN, and line graph (see Appendix.A.12). sponse, the community proposed many GNN models to overcome such limitations [4, 52, 68]. Several of these aim to design powerful higher-order GNNs which are increasingly expressive [9, 33, 41, 43] by inspiration from the k-WL hierarchy [54]. A key limitation of such higher-order models that reach beyond 1-WL is their poor scalability; in fact, these models can only be applied to small graphs with small k in practice [40, 41, 43] due to combinatorial growth in complexity. On the other hand lies an open question on whether practical graph learning tasks indeed need such complex, and extremely expressive GNN models. Historically, Babai et al. [5] showed that almost all graphs on n nodes can be distinguished by 1WL. In other contexts like node classification, researchers have encountered superior generalization performance with graph-augmented multi-layer perceptron (GA-MLP) models [37, 50] compared to MPNNs, despite the former being strictly less expressive than the latter [13]. Considering that increased model complexity has negative implications in terms of overfitting and generalization [29], it is worth re-evaluating continued efforts to pursue maximally expressive GNNs in practice. Ideally, one could study these models generalization performance on various real-world tasks by increasing k (expressiveness) in the k-WL hierarchy. Yet, given the impracticality of this too-coarse ruler of expressiveness, which becomes infeasible beyond k>3, one desires a more fine-grained ruler , that is, a new hierarchy whose expressiveness grows more gradually. Such a hierarchy could enable us to gradually build more expressive models which admit improved scaling, and avoid unnecessary leaps in model complexity for tasks which do not require them, guarding against overfitting. Present Work. In this paper, we propose such a hierarchy, and an associated practical progressivelyexpressive GNN model, called (k, c)( )-SETGNN, whose expressiveness can be modulated through k and c. In a nutshell, we take inspiration from k-WL, yet achieve practicality through three design ideas which simplify key bottlenecks of scaling k-WL: First, we move away from k-tuples of the original k-WL to k-multisets (unordered), and then to k( )-sets. We demonstrate that these steps drastically reduce the number nodes in the k-WL graph while retaining significant expressiveness. (Sec.s 3.2 3.3) Second, by considering the underlying sparsity of an input graph, we reduce scope to k, c( )-sets that consist of k nodes whose induced subgraph is comprised of c k connected components. This also yields massive reduction in the number of nodes while improving practicality; i.e. small values of c on sparse graphs can allow one to increase k well beyond 3 in practice. (Sec. 3.4) Third, we design a super-graph architecture that consists of a sequence of k 1 bipartite graphs over which we learn embeddings for our k, c( )-sets, using bidirectional message passing. These embeddings can be pooled to yield a final graph embedding.. (Sec.s 4.1 4.2) We also speed up initializing colors for the k, c( )-sets, for c > 1, substantially reducing computation and memory. (Sec. 4.3) Fig. 1 overviews of our proposed framework. Experimentally, our (k, c)( )-SETGNN outperforms existing state-of-the-art GNNs on simulated expressiveness as well as real-world graph-level tasks, achieving new bests on substructure counting and ZINC-12K, respectively. We show that generalization performance reflects increasing expressiveness by k and c. Our proposed scalable designs allow us to train models with e.g. k=10 or c=3 with practical running time and memory requirements. 2 Related Work MPNN limitations. Given the understanding that MPNNs have expressiveness upper-bounded by 1-WL [61], several researchers investigated what else MPNNs cannot learn. To this end, Chen et al. [15] showed that MPNNs cannot count induced connected substructures of 3 or more nodes, while along with [15], Arvind et al. [3] showed that MPNNs can only count star-shaped patterns. Loukas [38] further proved several results regarding decision problems on graphs (e.g. subgraph verification, cycle detection), finding that MPNNs cannot solve these problems unless strict conditions of depth and width are met. Moreover, Garg et al. [22] showed that many standard MPNNs cannot compute properties such as longest or shortest cycles, diameters or existence of cliques. Improving expressivity. Several works aim to improve expressiveness limitations of MPNNs. One approach is to inject features into the MPNN aggregation, motivated by Loukas [38] who showed that MPNNs can be universal approximators when nodes are sufficiently distinguishable. Sato et al. [53] show that injecting random features can better enable MPNNs to solve problems like minimum dominating set and maximum matching. You et al. [63] inject cycle counts as node features, motivated by the limitation of MPNNs not being able to count cycles. Others proposed utilizing subgraph isomorphism counting to empower MPNNs with substructure information they cannot learn [9, 10]. Earlier work by You et al. [62] adds positional features to distinguish node which naïve MPNN embeddings would not. Recently, several works also propose utilizing subgraph aggregation; Zhang et al. [65], Zhao et al. [68] and Bevilacqua et al. [8] propose subgraph variants of the WL test which are no less powerful than 3-WL. Recently a following up work [21] shows that rooted subgraph based extension with 1-WL as kernel is bounded by 3-WL. The community is yet exploring several distinct avenues in overcoming limitations: Murphy et al. [46] propose a relational pooling mechanism which sums over all permutations of a permutation-sensitive function to achieve above-1-WL expressiveness. Balcilar et al. [6] generalizes spatial and spectral MPNNs and shows that instances of spectral MPNNs are more powerful than 1-WL. Azizian et al. [4] and Geerts et al. [24] unify expressivity and approximation ability results for existing GNNs. k-WL-inspired GNNs. The k-WL test captures higher-order interactions in graph data by considering all k-tuples, i.e., size k ordered multisets defined over the set of nodes. While highly expressive, it does not scale to practical graphs beyond a very small k (k = 3 pushes the practical limit even for small graphs). However, several works propose designs which can achieve k-WL in theory and strive to make them practical. Maron et al. [41] proposes a general class of invariant graph networks k-IGNs having exact k-WL expressivity [23], while being not scalable. Morris et al. have several work on k-WL and its variants like k-LWL [42], k-GNN [43], and δ-k-WL-GNN [44]. Both k-LWL and k-GNN use a variant of k-WL that only considers k-sets and are strictly weaker than k-WL but much more practical. In our paper we claim that k( )-sets should be used with additional designs to keep the best expressivity while remaining practical. The δ-k-WL-GNN works with k-tuples but sparsifies the connections among tuples by considering locality. Qian et al. [48] extends the k-tuple to subgraph network but has same scalability bottleneck as k-WL. A recent concurrent work Speq Net [45] proposes to reduce number of tuples by restricting the number of connected components of tuples induced subgraphs. The idea is independently explored in our paper in Sec.3.4. All four variants proposed by Morris et al. do not have realizations beyond k > 3 in their experiments while we manage to reach k = 10. Interestingly, a recent work [34] links graph transformer with k-IGN that having better (or equal) expressivity, however is still not scalable. 3 A practical progressively-expressive isomorphism test: (k, c)( )-SETWL We first motivate our method (k, c)( )-SETGNN from the GI testing perspective by introducing (k, c)( )-SETWL. We first introduce notation and background of k-WL (Sec. 3.1). Next, we show how to increase the practicality of k-WL without reducing much of its expressivity, by removing node ordering (Sec. 3.2), node repetitions (Sec. 3.3), and leveraging graph sparsity (Sec. 3.4). We then give the complexity analysis (Sec. 3.5). We close the section with extending the idea to k-FWL which is as expressive as (k+1)-WL (Sec. 3.6). All proofs are included in Appendix A.4. Notation: Let G = (V (G), E(G), l G) be an undirected, colored graph with nodes V (G), edges E(G), and a color-labeling function l G : V (G) ! C where C = {c1, ..., cd} denotes a set of d distinct colors. Let [n] = {1, 2, 3, ..., n}. Let ( ) denote a tuple (ordered multiset), {{ }} denote a multiset (set which allows repetition), and { } denote a set. We define ! v = (v1, ..., vk) as a k-tuple, v = {{v1, ..., vk}} as a k-multiset, and ˆv = {v1, ..., vk} as a k-set. Let ! v [x/i] = (v1, ..., vi 1, x, vi+1, ..., vk) denote replacing the i-th element in ! v with x, and analogously for v[x/i] and ˆv[x/i] (assume mutlisets and sets are represented with the canonical ordering). When vi 2 V (G), let G[ ! v ], G[ v], G[ˆv] denote the induced subgraph on G with nodes inside ! v , v, ˆv respectively (keeping repetitions). An isomorphism between two graphs G and G0 is a bijective mapping p : V (G) ! V (G0) which satisfies 8u, v 2 V (G), (u, v) 2 E(G) () (p(u), p(v)) 2 E(G0) and 8u 2 V (G), l G(u) = l G0(p(u)). Two graphs G, G0 are isomorphic if there exists an isomorphism between them, which we denote as G = G0. 3.1 Preliminaries: the k-Weisfeiler-Leman (k-WL) Graph Isomorphism Test The 1-dimensional Weisfeiler-Leman (1-WL) test, also known as color refinement [49] algorithm, is a widely celebrated approach to (approximately) test GI. Although extremely fast and effective for most graphs (1-WL can provide canonical forms for all but n 1/7 fraction of n-vertex graphs [5]), it fails to distinguish members of many important graph families, such as regular graphs. The more powerful k-WL algorithm first appeared in [57], and extends coloring of vertices (or 1-tuples) to that of k-tuples. k-WL is progressively expressive with increasing k, and can distinguish any finite set of graphs given a sufficiently large k. k-WL has many interesting connections to logic, games, and linear equations [11, 28]. Another variant of k-WL is called k-FWL (Folklore WL), such that (k + 1)-WL is equally expressive as k-FWL [11, 26]. Our work focuses on k-WL. k-WL iteratively recolors all nk k-tuples defined on a graph G with n nodes. At iteration 0, each k-tuple ! v = (v1, ..., vk) 2 V (G)k is initialized with a color as its atomic type atk(G, ! v ). Assume G has d colors, then atk(G, ! v ) 2 {0, 1}2( k 2)+kd is an ordered vector encoding. The first entries indicate whether vi = vj, 8i, j, 1 i c2, (k, c1)( )-SETGNN is more expressive than (k, c2)( )-SETGNN, and (2) when k1 > k2, (k1, c)( )-SETGNN is more expressive than (k2, c)( )-SETGNN. 4.2 Bidirectional Sequential Message Passing The t-th layer of (k, c)( )-SETGNN (Eq. (10) and Eq. (11)) are essentially propagating information back and forth on the super-graph Sk,c-swl(G), which is a sequence of k 1 bipartite graphs (see the middle of Figure 1), in parallel for all supernodes. We propose to change it to bidirectional blockwise sequential message passing, which we call (k, c)( )-SETGNN , defined as follows. m = k 1 to 1, 8m-set ˆv, h(t+ 1 2 )(ˆv) = MLP(t) m = 2 to k, 8m-set ˆv, h(t+1)(ˆv) = MLP 2 ) m,2 (h(t+1)(ˆu)) Notice that (k, c)( )-SETGNN has lower memory usage, as (k, c)( )-SETGNN load the complete supergraph (k 1 bipartites) while (k, c)( )-SETGNN loads 1 out of k 1 bipartites at a time, which is beneficial for limited-size GPUs. What is more, for a small, finite t it is even more expressive than (k, c)( )-SETGNN. We provide both implementation of parallel and sequential message passing in the official github repository, while only report the performance of (k, c)( )-SETGNN given its efficiency and better expressivity. Theorem 7. For any t 2 N, the t-layer (k, c)( )-SETGNN is more expressive than the t-layer (k, c)( )-SETGNN. As limt!1, (k, c)( )-SETGNN is as expressive as (k, c)( )-SETGNN . 4.3 Improving Supernode Initialization Next we describe how to improve the supernode initialization (Eq. (9)) and extensively reduce computational and memory overhead for c > 1 without losing any expressivity. We achieve this by the fact that a graph with c components can be viewed as a set of c connected components. Formally, Theorem 8. Let G be a graph with c connected components C1, ..., Cc, and G0 be a graph also with c connected components C0 c, then G and G0 are isomorphic if and only if 9p : [c] ! [c], s.t. 8i 2 [c], Ci and C0 p(i) are isomorphic. The theorem implies that we only need to apply Eq. (9) to all supernodes with a single component, and for any c-components ˆv with components ˆu1, ..., ˆuc, we can get the encoding by h(0)(ˆv) = Deep Set({h(0)(ˆu1), ..., h(0)(ˆuc)}) without passing its induced subgraph to Base GNN. This eliminates the heavy computation of passing a large number of induced subgraphs. We give the algorithm of building connection between ˆv to its single components in Appendix A.6. Table 1: Simulation data performances. For (k, c)( )-SETGNN , (k, c) values that achieve reported performance in parenthesis. (ACC: accuracy, MA[S]E: mean abs.[sq.] error) Method EXP (ACC) Counting Substructures (MAE) Graph Properties (log10(MSE)) Triangle Tailed Tri. Star 4-Cycle Is Connected Diameter Radius GCN 50% 6.67% 0.4186 0.3248 0.1798 0.2822 -1.7057 -2.4705 -3.9316 GIN 50% 6.67% 0.3569 0.2373 0.0224 0.2185 -1.9239 -3.3079 -4.7584 PNA 50% 6.67% 0.3532 0.2648 0.1278 0.2430 -1.9395 -3.4382 -4.9470 PPGN 100% 6.67% 0.0089 0.0096 0.0148 0.0090 -1.9804 -3.6147 -5.0878 GIN-AK+ 100% 6.67% 0.0123 0.0112 0.0150 0.0126 -2.7513 -3.9687 -5.1846 PNA -AK+ 100% 6.67% 0.0118 0.0138 0.0166 0.0132 -2.6189 -3.9011 -5.2026 (k, c)( ) 100% 100% 0.0073 0.0075 0.0134 0.0075 -5.4667 -4.0800 -5.1603 ( 3, 2) ( 4, 1) (3, 2) (4, 1) (3, 2) (4, 1) (4, 2) (4, 1) (2, 2) Table 2: Train and Test performances on substructure counting tasks by varying k and c. Notice the orders of magnitude drop in Test MAE between bolded entries per task. Counting Substructures (MAE) Triangle Tailed Tri. Star 4-Cycle k c Train Test Train Test Train Test Train Test 2 1 0.9941 0.2623 1.1409 0.1224 1.1506 0.2542 0.8695 0.0781 1.5348 2.0697 2.3454 0.8198 1.2159 0.0292 0.8361 0.1171 3 1 0.0311 0.0025 0.0088 0.0001 0.0303 0.0108 0.0085 0.0018 0.0559 0.0019 0.0151 0.0006 0.1351 0.0058 0.1893 0.0030 4 1 0.0321 0.0008 0.0151 0.0074 0.0307 0.0085 0.0075 0.0012 0.0687 0.0104 0.0339 0.0009 0.0349 0.0007 0.0075 0.0002 5 1 0.0302 0.0070 0.0208 0.0042 0.0553 0.0009 0.0189 0.0024 0.0565 0.0078 0.0263 0.0023 0.0377 0.0057 0.0175 0.0036 6 1 0.0344 0.0024 0.0247 0.0085 0.0357 0.0017 0.0171 0.0000 0.0560 0.0000 0.0168 0.0022 0.0356 0.0014 0.0163 0.0064 2 2 0.3452 0.0329 0.4029 0.0053 0.2723 0.0157 0.2898 0.0055 0.0466 0.0025 0.0242 0.0006 0.2369 0.0123 0.2512 0.0029 3 2 0.0234 0.0030 0.0073 0.0009 0.0296 0.0074 0.0100 0.0009 0.0640 0.0003 0.0134 0.0006 0.0484 0.0135 0.0194 0.0065 4 2 0.0587 0.0356 0.0131 0.0010 0.0438 0.0140 0.0094 0.0002 0.0488 0.0008 0.0209 0.0063 0.0464 0.0037 0.0110 0.0020 5 Experiments We design experiments to answer the following questions. Q1. Performance: How does (k, c)( )- SETGNN compare to SOTA expressive GNNs? Q2. Varying k and c: How does the progressively increasing expressiveness reflect on generalization performance? Q3. Computational requirements: Is (k, c)( )-SETGNN feasible on practical graphs w.r.t. running time and memory usage? Datasets. To inspect the expressive power, we use four different types of simulation datasets: 1) EXP [1] contains 600 pairs of 1&2-WL failed graphs which we split into two where graphs in each pair is assigned to two different classes; 2) SR25 [6] has 15 strongly regular graphs (3WL failed) with 25 nodes each, which we transform to a 15-way classification task; 3) Substructure counting (i.e. triangle, tailed triangle, star and 4-cycle) tasks on random graph dataset [15]; 4) Graph property regression (i.e. connectedness, diameter, radius) tasks on random graph dataset [17]. We also evaluate performance on two real world graph learning tasks: 5) ZINC-12K [19], and 6) QM9 [59] for molecular property regression. See Table 6 in Appendix A.7 for detailed dataset statistics. Baselines. We use GCN [36], GIN [61], PNA [17], PPGN [40], PF-GNN [18], and GNN-AK [68] as baselines on the simulation datasets. On ZINC-12K we also reference CIN [9] directly from literature. Most baselines results are taken from [68]. Finally, we compare to GINE [32] on QM9. Hyperparameter and model configurations are described in Appendix A.8. 5.2 Results Theorem 2 shows that k-MULTISETWL is able to distinguish CFI(k) graphs. It also holds for k( )- SETWL as the proof doesn t use repetitions. We implemented the construction of CFI(k) graphs for any k. Empirically we found that (k, c)( )-SETGNN is able to distinguish CFI(k) for k = 3, 4, 5 (k>6 out of memory). This empirically verifies its theoretical expressivity. Table 1 shows the performance results on all the simulation datasets. The low expressive models such as GCN, GIN and PNA underperform on EXP and SR25, while 3-WL equivalent PPGN excels only on EXP. Notably, (k, c)( )-SETGNN achieves 100% discriminating power with any k larger than 2 and 3, and any c larger than 1 and 0, resp. for EXP and SR25. (k, c)( )-SETGNN also outperforms the baselines on all substructure counting tasks, as well as on two out of three graph property prediction tasks (except Radius) with significant gap, for relatively small values of k and c. In Table 2, we show the train and test MAEs on substructure counting tasks for individual values of k and c. As expected, performances improve for increasing k when c is fixed, and vice versa. It is notable that orders of magnitude improvements on test error occur moving from k=2 to 3 for the triangle tasks as well as the star task, while a similarly large magnitude drop is obtained at k=4 for the 4-cycle task, which is expected as triangle and 4-cycle have 3 and 4 nodes respectively. Similar observations hold for graph property tasks as well. (See Appendix A.9) Table 3: Set GNN achieves SOTA on ZINC-12K. Gated GCN 0.363 0.009 GCN 0.321 0.009 PNA 0.188 0.004 DGN 0.168 0.003 GIN 0.163 0.004 GINE 0.157 0.004 HIMP 0.151 0.006 PNA 0.140 0.006 GSN 0.115 0.012 PF-GNN 0.122 0.010 GIN-AK+ 0.080 0.001 CIN 0.079 0.006 (k, c)( ) 0.0750 0.0027 In addition to simulation datasets, we evaluate our (k, c)( )- SETGNN on real-world data; Table 3 shows our performance on ZINC-12K. Our method achieves a new state-of-the-art performance, with a mean absolute error (MAE) of 0.0750, using k=5 and c=2. In Table 4 we show the test and validation MAE along with training loss for varying k and c for ZINC-12K. For a fixed c, validation MAE and training loss both follow a first decaying and later increasing trend with increasing k, potentially owing to the difficulty in fitting with too many sets (i.e. supernodes) and edges in the super-graph. Similar results are shown for QM9 in Table 5. For comparison we also show GINE performances using both 4 and 6 layers, both of which are significantly lower than (k, c)( )-SETGNN . Table 4: (k, c)( )-SETGNN performances on ZINC-12K by varying (k,c). Test MAE at lowest Val. MAE, and lowest Test MAE. k c Train loss Val. MAE Test MAE 2 1 0.1381 0.0240 0.2429 0.0071 0.2345 0.0131 3 1 0.1172 0.0063 0.2298 0.0060 0.2252 0.0030 4 1 0.0693 0.0111 0.1645 0.0052 0.1636 0.0052 5 1 0.0643 0.0019 0.1593 0.0051 0.1447 0.0013 6 1 0.0519 0.0064 0.0994 0.0093 0.0843 0.0048 7 1 0.0543 0.0048 0.0965 0.0061 0.0747 0.0022 8 1 0.0564 0.0152 0.0961 0.0043 0.0732 0.0037 9 1 0.0817 0.0274 0.0909 0.0094 0.0824 0.0056 10 1 0.0894 0.0266 0.1060 0.0157 0.0950 0.0102 2 2 0.1783 0.0602 0.2913 0.0102 0.2948 0.0210 3 2 0.0640 0.0072 0.1668 0.0078 0.1391 0.0102 4 2 0.0499 0.0043 0.1029 0.0033 0.0836 0.0010 5 2 0.0483 0.0017 0.0899 0.0056 0.0750 0.0027 6 2 0.0530 0.0064 0.0927 0.0050 0.0737 0.0006 7 2 0.0547 0.0036 0.0984 0.0047 0.0784 0.0043 3 3 0.0798 0.0062 0.1881 0.0076 0.1722 0.0086 4 3 0.0565 0.0059 0.1121 0.0066 0.0869 0.0026 5 3 0.0671 0.0156 0.1091 0.0097 0.0920 0.0054 Table 5: (k, c)( )-SETGNN performances on QM9 by varying (k,c). Test MAE at lowest Val. MAE, and lowest Test MAE. All variances are 0.002 and thus omitted. k c Train loss Val. MAE Test MAE 2 1 0.0376 0.0005 0.0387 0.0007 0.0389 0.0008 3 1 0.0308 0.0010 0.0386 0.0017 0.0379 0.0010 4 1 0.0338 0.0003 0.0371 0.0005 0.0370 0.0006 5 1 0.0299 0.0017 0.0343 0.0008 0.0341 0.0009 6 1 0.0226 0.0004 0.0296 0.0007 0.0293 0.0007 7 1 0.0208 0.0005 0.0289 0.0007 0.0269 0.0003 2 2 0.0367 0.0007 0.0398 0.0004 0.0398 0.0004 3 2 0.0282 0.0013 0.0358 0.0009 0.0356 0.0007 4 2 0.0219 0.0004 0.0280 0.0008 0.0278 0.0008 5 2 0.0175 0.0003 0.0267 0.0005 0.0251 0.0006 3 3 0.0391 0.0107 0.0428 0.0057 0.0425 0.0052 4 3 0.0219 0.0011 0.0301 0.0010 0.0286 0.0004 GINE (L=4) 0.0507 0.0014 0.0478 0.0003 0.0479 0.0004 GINE (L=6) 0.0440 0.0009 0.0440 0.0009 0.0451 0.0009 5.3 Computational requirements (a) Memory usage (b) Training time Figure 2: (k, c)( )-SETGNN s footprint scales practically with both k and c in memory (a) and running time (b) results on ZINC-12K. We next investigate how increasing k and c change the computational footprint of (k, c)( )-SETGNN in practice. Fig. 2 shows that increasing these parameters expectedly increases both memory consumption (in MB in (a)) as well as runtime (in seconds per epoch in (b)). Notably, since larger k and c increases our model s expressivity, we observe that suitable choices for k and c allow us to practically realize these increasingly expressive models on commodity hardware. With conservative values of c (e.g. c=1), we are able to consider passing messages between sets of k (e.g. 10) nodes far larger than k-WL-style higher order models can achieve ( 3). 6 Conclusion Our work is motivated by the impracticality of higher-order GNN models based on the k-WL hierarchy, which make it challenging to study how much expressiveness real-world tasks truly necessitate. To this end, we proposed (k, c)( )-Set WL, a more practical and progressively-expressive hierarchy with theoretical connections to k-WL and drastically lowered complexity. We also designed and implemented a practical model (k, c)( )-Set GNN( ), expressiveness of which is gradually increased by larger k and c. Our model achieves strong performance, including several new best results on graph-level tasks like ZINC-12K and expressiveness-relevant tasks like substructure counting, while being practically trainable on commodity hardware. [1] R. Abboud, I. I. Ceylan, M. Grohe, and T. Lukasiewicz. The surprising power of graph neural networks with random node initialization. In Proceedings of the Thirtieth International Joint Conference on Artifical Intelligence (IJCAI), 2021. [2] U. Alon and E. Yahav. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=i80OPh OCVH2. [3] V. Arvind, F. Fuhlbrück, J. Köbler, and O. Verbitsky. On weisfeiler-leman invariance: Sub- graph counts and related graph properties. Journal of Computer and System Sciences, 113: 42 59, 2020. [4] W. Azizian and M. Lelarge. Expressive power of invariant and equivariant graph neural net- works. ar Xiv preprint ar Xiv:2006.15646, 2020. [5] L. Babai, P. Erdos, and S. M. Selkow. Random graph isomorphism. SIAM Journal on Com- puting, 9(3):628 635, 1980. [6] M. Balcilar, P. Héroux, B. Gauzere, P. Vasseur, S. Adam, and P. Honeine. Breaking the limits of message passing graph neural networks. In International Conference on Machine Learning, pages 599 608. PMLR, 2021. [7] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261, 2018. [8] B. Bevilacqua, F. Frasca, D. Lim, B. Srinivasan, C. Cai, G. Balamurugan, M. M. Bronstein, and H. Maron. Equivariant subgraph aggregation networks. ar Xiv preprint ar Xiv:2110.02910, 2021. [9] C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein. Weisfeiler and lehman go cellular: Cw networks. Advances in Neural Information Processing Systems, 34:2625 2640, 2021. [10] G. Bouritsas, F. Frasca, S. P. Zafeiriou, and M. Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [11] J.-Y. Cai, M. Fürer, and N. Immerman. An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389 410, 1992. [12] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3438 3445, 2020. [13] L. Chen, Z. Chen, and J. Bruna. On graph neural networks versus graph-augmented mlps. ar Xiv preprint ar Xiv:2010.15116, 2020. [14] Z. Chen, L. Li, and J. Bruna. Supervised community detection with line graph neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/ forum?id=H1g0Z3A9Fm. [15] Z. Chen, L. Chen, S. Villar, and J. Bruna. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383 10395, 2020. [16] K. Choudhary and B. De Cost. Atomistic line graph neural network for improved materials property predictions. npj Computational Materials, 7(1):1 8, 2021. [17] G. Corso, L. Cavalleri, D. Beaini, P. Liò, and P. Veliˇckovi c. Principal neighbourhood aggrega- tion for graph nets. In Advances in Neural Information Processing Systems, volume 33, pages 13260 13271, 2020. [18] M. H. Dupty, Y. Dong, and W. S. Lee. PF-GNN: Differentiable particle filtering based approx- imation of universal graph representations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oh4Tirnf Sem. [19] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking graph neural networks. ar Xiv preprint ar Xiv:2003.00982, 2020. [20] M. Fey and J. E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. [21] F. Frasca, B. Bevilacqua, M. M. Bronstein, and H. Maron. Understanding and extending subgraph gnns by rethinking their symmetries. ar Xiv preprint ar Xiv:2206.11140, 2022. [22] V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning, pages 3419 3430. PMLR, 2020. [23] F. Geerts. The expressive power of kth-order invariant graph networks. ar Xiv preprint ar Xiv:2007.12035, 2020. [24] F. Geerts and J. L. Reutter. Expressiveness and approximation properties of graph neural networks. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=w Iz Ue M3TAU. [25] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263 1272. PMLR, 2017. [26] M. Grohe. The logic of graph neural networks. In 2021 36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pages 1 17. IEEE, 2021. [27] M. Grohe. Multiset wl. Personal communication, 2022. [28] M. Grohe and M. Otto. Pebble games and linear equations. The Journal of Symbolic Logic, 80 (3):797 844, 2015. [29] D. M. Hawkins. The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1 12, 2004. [30] L. Hella. Logical hierarchies in ptime. Information and Computation, 129(1):1 19, 1996. [31] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. [32] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec. Strategies for pre- training graph neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJl WWJSFDH. [33] N. Keriven and G. Peyré. Universal invariant and equivariant graph neural networks. Advances in Neural Information Processing Systems, 32, 2019. [34] J. Kim, T. D. Nguyen, S. Min, S. Cho, M. Lee, H. Lee, and S. Hong. Pure transformers are powerful graph learners. ar Xiv, abs/2207.02505, 2022. URL https://arxiv.org/abs/2207.02505. [35] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. [36] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. [37] J. Klicpera, A. Bojchevski, and S. Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. ar Xiv preprint ar Xiv:1810.05997, 2018. [38] A. Loukas. What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=B1l2bp4Yw S. [39] Y. Ma, X. Liu, T. Zhao, Y. Liu, J. Tang, and N. Shah. A unified view on graph neural net- works as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202 1211, 2021. [40] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. Advances in neural information processing systems, 32, 2019. [41] H. Maron, E. Fetaya, N. Segol, and Y. Lipman. On the universality of invariant networks. In International conference on machine learning, pages 4363 4371. PMLR, 2019. [42] C. Morris, K. Kersting, and P. Mutzel. Glocalized weisfeiler-lehman graph kernels: Global- local feature maps of graphs. In 2017 IEEE International Conference on Data Mining (ICDM), pages 327 336. IEEE, 2017. [43] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 4602 4609, 2019. [44] C. Morris, G. Rattan, and P. Mutzel. Weisfeiler and leman go sparse: Towards scalable higher- order graph embeddings. Advances in Neural Information Processing Systems, 33:21824 21840, 2020. [45] C. Morris, G. Rattan, S. Kiefer, and S. Ravanbakhsh. Speqnets: Sparsity-aware permutation- equivariant graph networks. ar Xiv preprint ar Xiv:2203.13913, 2022. [46] R. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro. Relational pooling for graph representations. In International Conference on Machine Learning, pages 4663 4673. PMLR, 2019. [47] K. Oono and T. Suzuki. Graph neural networks exponentially lose expressive power for node classification. ar Xiv preprint ar Xiv:1905.10947, 2019. [48] C. Qian, G. Rattan, F. Geerts, C. Morris, and M. Niepert. Ordered subgraph aggregation networks. ar Xiv preprint ar Xiv:2206.11168, 2022. [49] R. C. Read and D. G. Corneil. The graph isomorphism disease. Journal of graph theory, 1(4): 339 363, 1977. [50] E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, and F. Monti. Sign: Scalable inception graph neural networks. ar Xiv preprint ar Xiv:2004.11198, 2020. [51] A. Sankar, Y. Liu, J. Yu, and N. Shah. Graph neural networks for friend ranking in large-scale social platforms. In Proceedings of the Web Conference 2021, pages 2535 2546, 2021. [52] R. Sato. A survey on the expressive power of graph neural networks. Co RR, abs/2003.04078, 2020. URL http://dblp.uni-trier.de/db/journals/corr/corr2003.html#abs-2003-04078. [53] R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 333 341. SIAM, 2021. [54] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9), 2011. [55] J. Topping, F. Di Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. ar Xiv preprint ar Xiv:2111.14522, 2021. [56] P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017. [57] B. Weisfeiler. On construction and identification of graphs. In LECTURE NOTES IN MATH- EMATICS. Citeseer, 1976. [58] S. Wu, F. Sun, W. Zhang, X. Xie, and B. Cui. Graph neural networks in recommender systems: a survey. ACM Computing Surveys (CSUR), 2020. [59] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2): 513 530, 2018. [60] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1): 4 24, 2020. [61] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How Powerful are Graph Neural Networks? In Proceedings of the 7th International Conference on Learning Representations, ICLR 19, pages 1 17, 2019. URL https://openreview.net/forum?id=ry Gs6i A5Km. [62] J. You, R. Ying, and J. Leskovec. Position-aware graph neural networks. In International Conference on Machine Learning, pages 7134 7143. PMLR, 2019. [63] J. You, J. Gomes-Selman, R. Ying, and J. Leskovec. Identity-aware graph neural networks. ar Xiv preprint ar Xiv:2101.10320, 2021. [64] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. Advances in neural information processing systems, 30, 2017. [65] M. Zhang and P. Li. Nested graph neural networks. Advances in Neural Information Processing Systems, 34:15734 15747, 2021. [66] M. Zhang, P. Li, Y. Xia, K. Wang, and L. Jin. Revisiting graph neural networks for link prediction. 2020. [67] L. Zhao and L. Akoglu. Pairnorm: Tackling oversmoothing in {gnn}s. In International Confer- ence on Learning Representations, 2020. URL https://openreview.net/forum?id=rkecl1rtw B. [68] L. Zhao, W. Jin, L. Akoglu, and N. Shah. From stars to subgraphs: Uplifting any gnn with local structure awareness. ICLR, 2022. The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example: Did you include the license to the code and datasets? [Yes] Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A] Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [No] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] In Appendix 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] Will add later into the URL (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In Appendix (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] In Appendix 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [No] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] (e) Did you discuss whether the data you are using/curating contains personally identifi- able information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]