# wl_meet_vc__94684b8c.pdf Christopher Morris * 1 Floris Geerts * 2 Jan T onshoff 1 Martin Grohe 1 Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the 1-dimensional Weisfeiler Leman algorithm (1-WL). Here, the 1-WL is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph s vertex set. While this connection has led to significant advances in understanding and enhancing GNNs expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs generalization ability through the lens of Vapnik Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs order is known, we show that the bitlength of GNNs weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs VC dimension using the number of colors produced by the 1-WL. Secondly, when an upper bound on the graphs order is known, we show a tight connection between the number of graphs distinguishable by the 1-WL and GNNs VC dimension. Our empirical study confirms the validity of our theoretical findings. 1. Introduction Graph-structured data are prevalent across application domains ranging from chemoand bioinformatics (Barabasi & Oltvai, 2004; Jumper et al., 2021; Stokes et al., 2020) to image (Simonovsky & Komodakis, 2017) and socialnetwork analysis (Easley & Kleinberg, 2010), indicating the importance of machine learning methods for such data. Nowadays, there are numerous approaches for machine *Equal contribution 1Department of Computer Science, RWTH Aachen University, Aachen, Germany 2Department of Computer Science, University of Antwerp, Antwerp, Belgium. Correspondence to: Christopher Morris . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). learning for graph-structured, most notably those based on graph kernels (Borgwardt et al., 2020; Kriege et al., 2020) or graph neural networks (GNNs) (Chami et al., 2020; Gilmer et al., 2017; Morris et al., 2021). Here, graph kernels (Shervashidze et al., 2011) based on the 1-dimensional Weisfeiler Leman algorithm (1-WL) (Weisfeiler & Leman, 1968), a well-studied heuristic for the graph isomorphism problem, and corresponding GNNs (Morris et al., 2019; Xu et al., 2019), have recently advanced the state-of-the-art in supervised vertexand graph-level learning (Morris et al., 2021). Further, based on the k-dimensional Weisfeiler Leman algorithm (k-WL), 1-WL s more powerful generalization, several works generalized GNNs to higher-order GNNs (k GNNs), resulting in provably more expressive architectures, e.g., Azizian & Lelarge (2021); Geerts & Reutter (2022); Maron et al. (2019); Morris et al. (2019; 2020b; 2021; 2022). While devising provably expressive GNN-like architectures is a meaningful endeavor, it only partially addresses the challenges of machine learning with graphs. That is, expressiveness results reveal little about an architecture s ability to generalize to graphs outside the training set. Surprisingly, only a few notable contributions study GNNs generalization behaviors, e.g., Garg et al. (2020); Kriege et al. (2018); Liao et al. (2021); Maskey et al. (2022); Scarselli et al. (2018). However, these approaches express GNN s generalization ability using only classical graph parameters, e.g., maximum degree, number of vertices, or edges, which cannot fully capture the complex structure of real-world graphs. Further, most approaches study generalization in the nonuniform regime, i.e., assuming that the GNNs operate on graphs of a pre-specified order. Further, they only investigate the case k = 1, i.e., standard GNNs, ignoring more expressive generalizations; see the previous paragraph. 1.1. Present work This paper investigates the influence of graph structure and the parameters encoding lengths on GNNs generalization by tightly connecting 1-WL s expressivity and GNNs Vapnik Chervonenkis (VC) dimension. Specifically, our contributions are: 1. In the non-uniform regime, we prove tight bounds on GNNs VC dimension. We show that GNNs VC dimension depends tightly on the number of equivalence Bitlength b? = b [Prop. 3.5] 1-WL colors u? poly(d, L) log(u) [Thm. 3.6] [Thm. 3.4] = mn,d,L [Prop. 3.1,3.2] Figure 1: Overview of our results for bounded-width GNNs. Green and red boxes denote VC dimension bounds. Here, mn,d,L denotes the number of graphs of order at most n with boolean d-dimensional features distinguishable by 1-WL after L iterations. classes computed by the 1-WL over a set of graphs; see Propositions 3.1 and 3.2. Moreover, our results easily extend to the k-WL and many recent expressive GNN extensions. 2. In the uniform regime, i.e., when graphs can have arbitrary order, we show that GNNs VC dimension is lower and upper bounded by the largest bitlength of its weights; see Proposition 3.5. 3. In both the uniform and non-uniform regimes, GNNs VC dimension depends logarithmically on the number of colors computed by the 1-WL and polynomially on the number of parameters; see Theorem 3.6. 4. Empirically, we show that our theoretical findings hold in practice. Overall, our results provide new insights into GNNs generalization behavior and how graph structure and parameters influence it. Specifically, our results imply that a complex graph structure, captured by 1-WL, results in worse generalization performance. The same holds for increasing the encoding length of the GNN s parameters. Importantly, our theory provides the first link between expressivity results and generalization ability. Moreover, our results establish the first lower bounds for GNNs VC dimension. See Figure 1 for a high-level overview of our results. 1.2. Related work In the following, we discuss relevant related work. GNNs Recently, GNNs (Gilmer et al., 2017; Scarselli et al., 2009) emerged as the most prominent graph representation learning architecture. Notable instances of this architecture include, e.g., Duvenaud et al. (2015); Hamilton et al. (2017), and Veliˇckovi c et al. (2018), which can be subsumed under the message-passing framework introduced in Gilmer et al. (2017). In parallel, approaches based on spectral information were introduced in, e.g., Bruna et al. (2014); Defferrard et al. (2016); Gama et al. (2019); Kipf & Welling (2017); Levie et al. (2019), and Monti et al. (2017) all of which descend from early work in Baskin et al. (1997); Goller & K uchler (1996); Kireev (1995); Merkwirth & Lengauer (2005); Micheli & Sestito (2005); Micheli (2009); Scarselli et al. (2009), and Sperduti & Starita (1997). Limits of GNNs and more expressive architectures Recently, connections between GNNs and Weisfeiler Leman type algorithms have been shown (Barcel o et al., 2020; Geerts et al., 2021; Morris et al., 2019; Xu et al., 2019). Specifically, Morris et al. (2019) and Xu et al. (2019) showed that the 1-WL limits the expressive power of any possible GNN architecture in terms of distinguishing nonisomorphic graphs. In turn, these results have been generalized to the k-WL, see, e.g., Azizian & Lelarge (2021); Geerts (2020); Maron et al. (2019); Morris et al. (2019; 2020b; 2022), and connected to permutation-equivariant functions approximation over graphs, see, e.g., Chen et al. (2019); Maehara & NT (2019); Azizian & Lelarge (2021); Geerts & Reutter (2022). Further, Aamand et al. (2022) devised an improved analysis using randomization. Recent works have extended the expressive power of GNNs, e.g., by encoding vertex identifiers (Murphy et al., 2019; Vignac et al., 2020), using random features (Abboud et al., 2021; Dasoulas et al., 2020; Sato et al., 2021), equivariant graph polynomials (Puny et al., 2023), homomorphism and subgraph counts (Barcel o et al., 2021; Bouritsas et al., 2020; Nguyen & Maehara, 2020), spectral information (Balcilar et al., 2021), simplicial (Bodnar et al., 2021b) and cellular complexes (Bodnar et al., 2021a), persistent homology (Horn et al., 2022), random walks (T onshoff et al., 2021; Martinkus et al., 2022), graph decompositions (Talak et al., 2021), relational (Barcel o et al., 2022), distance (Li et al., 2020) and directional information (Beaini et al., 2021), subgraph information (Bevilacqua et al., 2022; Cotta et al., 2021; Feng et al., 2022; Frasca et al., 2022; Huang et al., 2022; Morris et al., 2021; Papp et al., 2021; Papp & Wattenhofer, 2022; Qian et al., 2022; Thiede et al., 2021; Wijesinghe & Wang, 2022; You et al., 2021; Zhang & Li, 2021; Zhao et al., 2022; Zhang et al., 2023a), and biconnectivity (Zhang et al., 2023b). See Morris et al. (2021) for an in-depth survey on this topic. Geerts & Reutter (2022) devised a general approach for bounding the expressive power of a large variety of GNNs utilizing the 1-WL or k-WL. Recently, Kim et al. (2022) showed that transformer architectures (M uller et al., 2023) can simulate the 2-WL. Grohe (2023) showed tight connections between GNNs expressivity and circuit complexity. Moreover, Rosenbluth et al. (2023) investigated the expressive power of different aggregation functions beyond sum aggregation. GNN s generalization capabilities Scarselli et al. (2018) used classical techniques from learning theory (Karpinski & Macintyre, 1997) to show that GNNs VC dimension (Vapnik, 1995) with piece-wise polynomial activation functions on a fixed graph, under various assumptions, is in O(P 2n log n), where P is the number of parameters and n is the order of the input graph. We note here that Scarselli et al. (2018) analyzed a different type of GNN not aligned with modern GNN architectures (Gilmer et al., 2017). Garg et al. (2020) showed that the empirical Rademacher complexity, e.g., (Mohri et al., 2018), of a specific, simple GNN architecture, using sum aggregation, is bounded in the maximum degree, the number of layers, Lipschitz constants of activation functions, and parameter matrices norms. We note here that their analysis assumes weight sharing across layers. Liao et al. (2021) refined these results via a PAC-Bayesian approach, further refined in Ju et al. (2023). Maskey et al. (2022) used random graphs models to show that GNNs generalization ability depends on the (average) number of vertices in the resulting graphs. Verma & Zhang (2019) studied the generalization abilities of 1-layer GNNs in a transductive setting based on algorithmic stability. Similarly, Esser et al. (2021) used stochastic block models to study the transductive Rademacher complexity (El-Yaniv & Pechyony, 2007; Tolstikhin & Lopez-Paz, 2016) of standard GNNs. Moreover, (Kriege et al., 2018) leveraged results from graph property testing (Goldreich, 2010) to study the sample complexity of learning to distinguish various graph properties, e.g., planarity or triangle freeness, using graph kernels (Borgwardt et al., 2020; Kriege et al., 2020). We stress that all of the above approaches only consider classical graph parameters to bound the generalization abilities of GNNs. Finally, (Yehudai et al., 2021) showed negative results for GNNs ability to generalize to larger graphs. However, the generalization properties of GNNs and their connection to expressivity is understood to a lesser extent. See Appendix A for an overview of the Weisfeiler Leman algorithm s theoretical properties. 2. Preliminaries Let N := {1, 2, 3, . . . }. For n 1, let [n] := {1, . . . , n} N. We use {{. . . }} to denote multisets, i.e., the generalization of sets allowing for multiple instances for each of its elements. Graphs A graph G is a pair (V (G), E(G)) with finite sets of vertices or nodes V (G) and edges E(G) {{u, v} V (G) | u = v}. If not otherwise stated, we set n := |V (G)|, and the graph is of order n. We also call the graph G an n-order graph. For ease of notation, we denote the edge {u, v} in E(G) by (u, v) or (v, u). In the case of directed graphs, the set E(G) {(u, v) V (G) V (G) | u = v} and a directed acyclic graph (DAG) is a directed graph with no directed cycles. A (vertex-)labeled graph G is a triple (V (G), E(G), ℓ) with a (vertex-)label function ℓ: V (G) N. Then ℓ(v) is a label of v, for v in V (G). An attributed graph G is a triple (V (G), E(G), a) with a graph (V (G), E(G)) and (vertex-)attribute function a: V (G) R1 d, for some d > 0. That is, contrary to labeled graphs, we allow for vertex annotations from an uncountable set. Then a(v) is an attribute or feature of v, for v in V (G). Equivalently, we define an n-order attributed graph G := (V (G), E(G), a) as a pair G = (G, L), where G = (V (G), E(G)) and L in Rn d is a vertex feature matrix. Here, we identify V (G) with [n]. For a matrix L in Rn d and v in [n], we denote by Lv in R1 d the vth row of L such that Lv := a(v). We also write Rd for R1 d. The neighborhood of v in V (G) is denoted by N(v) := {u V (G) | (v, u) E(G)} and the degree of a vertex v is |N(v)|. In case of directed graphs, N +(u) := {v V (G) | (v, u) E(G)} and N (u) := {v V (G) | (u, v) E(G)}. The in-degree and out-degree of a vertex v are |N +(v)| and |N (v)|, respectively. Two graphs G and H are isomorphic and we write G H if there exists a bijection φ: V (G) V (H) preserving the adjacency relation, i.e., (u, v) is in E(G) if and only if (φ(u), φ(v)) is in E(H). Then φ is an isomorphism between G and H. In the case of labeled graphs, we additionally require that l(v) = l(φ(v)) for v in V (G), and similarly for attributed graphs. 2.1. The Weisfeiler Leman algorithm We here describe the 1-WL and refer to Appendix B for the k-WL. The 1-WL or color refinement is a well-studied heuristic for the graph isomorphism problem, originally proposed by Weisfeiler & Leman (1968).1 Intuitively, the algorithm determines if two graphs are non-isomorphic by iteratively coloring or labeling vertices. Given an initial coloring or labeling of the vertices of both graphs, e.g., their degree or application-specific information, in each iteration, two vertices with the same label get different labels if the number of identically labeled neighbors is unequal. These labels induce a vertex partition, and the algorithm 1Strictly speaking, the 1-WL and color refinement are two different algorithms. That is, the 1-WL considers neighbors and non-neighbors to update the coloring, resulting in a slightly higher expressive power when distinguishing vertices in a given graph; see (Grohe, 2021) for details. For brevity, we consider both algorithms to be equivalent. terminates when after some iteration, the algorithm does not refine the current partition, i.e., when a stable coloring or stable partition is obtained. Then, if the number of vertices annotated with a specific label is different in both graphs, we can conclude that the two graphs are not isomorphic. It is easy to see that the algorithm cannot distinguish all non-isomorphic graphs (Cai et al., 1992). Nonetheless, it is a powerful heuristic that can successfully test isomorphism for a broad class of graphs (Babai & Kucera, 1979). Formally, let G = (V (G), E(G), ℓ) be a labeled graph. In each iteration, t > 0, the 1-WL computes a vertex coloring C1 t : V (G) N, depending on the coloring of the neighbors. That is, in iteration t > 0, we set C1 t (v) := RELABEL C1 t 1(v), {{C1 t 1(u) | u N(v)}} , for all vertices v in V (G), where RELABEL injectively maps the above pair to a unique natural number, which has not been used in previous iterations. In iteration 0, the coloring C1 0 := ℓ. To test if two graphs G and H are nonisomorphic, we run the above algorithm in parallel on both graphs. If the two graphs have a different number of vertices colored c in N at some iteration, the 1-WL distinguishes the graphs as non-isomorphic. Moreover, if the number of colors between two iterations, t and (t+1), does not change, i.e., the cardinalities of the images of C1 t and C1 i+t are equal, or, equivalently, C1 t (v) = C1 t (w) C1 t+1(v) = C1 t+1(w), for all vertices v and w in V (G), the algorithm terminates. For such t, we define the stable coloring C1 (v) = C1 t (v), for v in V (G). The stable coloring is reached after at most max{|V (G)|, |V (H)|} iterations (Grohe, 2017). We define the color complexity of a graph G as the number of colors computed by the 1-WL after |V (G)| iterations on G. Due to the shortcomings of the 1-WL or color refinement in distinguishing non-isomorphic graphs, several researchers, e.g., Babai (1979); Cai et al. (1992), devised a more powerful generalization of the former, today known as the kdimensional Weisfeiler Leman algorithm, operating on ktuples of vertices rather than single vertices; see Appendix B for details. 2.2. Graph Neural Networks Intuitively, GNNs learn a vectorial representation, i.e., a ddimensional real-valued vector, representing each vertex in a graph by aggregating information from neighboring vertices. Formally, let G = (V (G), E(G), ℓ) be a labeled graph with initial vertex features h(0) v in Rd that are consistent with ℓ. That is, each vertex v is annotated with a feature h(0) v in Rd such that h(0) v = h(0) u if and only ℓ(v) = ℓ(u), e.g., a onehot encoding of the labels ℓ(u) and ℓ(v). Alternatively, h(0) v can be an attribute or a feature of the vertex v, e.g., physical measurements in the case of chemical molecules. A GNN architecture consists of a stack of neural network layers, i.e., a composition of permutation-equivariant parameterized functions. Similarly to the 1-WL, each layer aggregates local neighborhood information, i.e., the neighbors features around each vertex, and then passes this aggregated information on to the next layer. Following, Gilmer et al. (2017) and Scarselli et al. (2009), in each layer, t > 0, we compute vertex features h(t) v := UPD(t) h(t 1) v , AGG(t) {{h(t 1) u | u N(v)}} (1) in Rd, where UPD(t) and AGG(t) may be differentiable parameterized functions, e.g., neural networks.2 In the case of graph-level tasks, e.g., graph classification, one uses h G := READOUT {{h(L) v | v V (G)}} R, (2) to compute a single vectorial representation based on learned vertex features after iteration L.3 Again, READOUT may be a differentiable parameterized function. To adapt the parameters of the above three functions, they are optimized end-to-end, usually through a variant of stochastic gradient descent, e.g., Kingma & Ba (2015), together with the parameters of a neural network used for classification or regression. See Appendix C for a definition of (higher-order) k-GNNs. Notation In the subsequent sections, we use the following notation. We denote the class of all (labeled) graphs by G, the class of all graphs with d-dimensional, real-valued vertex features by Gd, the class of all graphs with d-dimensional boolean vertex features by GB d , the class of all graphs with an order of at most n and d-dimensional vertex features by Gd,n, and the class of all graphs with d-dimensional vertex features and of color complexity at most u by Gd, u. Further, we consider the following classes of GNNs. We denote the class of all GNNs consisting of L layers with (L + 1)th layer readout layer by GNN(L), the subset of GNN(L) but whose aggregate, update and readout functions have a width at most d by GNN(d, L), and the subset of GNN(L) but whose aggregation function is a summation and update and readout functions are single layer perceptrons of width at most d by GNNslp(d, L). More generally, 2Strictly speaking, Gilmer et al. (2017) consider a slightly more general setting in which vertex features are computed by h(t) v := UPD(t) h(t 1) v , AGG(t) {{(h(t 1) v , h(t 1) u , ℓ(v, u)) | u N(v)}} , where ℓ(v, u) denotes the edge label of the edge (v, u). 3For simplicity, we assume GNNs to return scalars on graphs. This makes the definition of VC dimension more concise. we consider the class GNNmlp(d, L) of GNNs using summation for aggregation and such that update and readout functions are multilayer perceptrons (MLPs), all of width of at most d. We refer to elements in GNNmlp(d, L) as simple GNNs. See Appendix E for details. We stress that simple GNNs are already expressive enough to be equivalent to the 1-WL in distinguishing non-isomorphic graphs. VC dimension of GNNs For a class C of GNNs and X of graphs, VC-dim X (C) is the maximal number m of graphs G1, . . . , Gm in X that can be shattered by C. Here, G1, . . . , Gm are shattered if for any τ in {0, 1}m there exists a GNN gnn in C such that for all i in [m]: ( 2/3 if τi = 1, and 1/3 if τi = 0. (3) The above definition can straightforwardly be generalized to k-GNNs. Bounding the VC dimension directly implies an upper bound on the generalization error; see Appendix D and (Vapnik, 1995; Mohri et al., 2018) for details. Bitlength of GNNs Below we study the dependence of GNNs VC dimension on the bitlength of its weights. Assume an L-layered GNN with a set of parameters Θ, then the GNN s bitlength is the maximum number of bits needed to encode each weight in Θ and the parameters specifying the activation functions. We define the bitlength of a class of GNNs as the maximum bitlength across all GNNs in the class. 3. WL meet VC We first consider the non-uniform regime, i.e., we assume an upper bound on the graphs order. Given the connection between GNNs and the 1-WL (Morris et al., 2019; Xu et al., 2018), GNNs ability to shatter a set of graphs can easily be related to distinguishability by the 1-WL. Indeed, let S be a collection of graphs that GNNs can shatter. Hence, for each pair of graphs in S, we have a GNN that distinguishes them. By the results of (Morris et al., 2019; Xu et al., 2019), this implies that the graphs in S are pairwise 1-WL distinguishable. In other words, when considering the VC dimension of GNNs on a class S of graphs with a bounded number m of 1-WL distinguishable graphs, then m is also an upper bound on the VC dimension of GNNs on graphs in S. For example, let us first consider the VC dimension of GNNs on the class GB d,n consisting of graphs of an order of at most n with d-dimensional boolean features. Let mn,d,L be the maximal number of graphs in GB d,n distinguishable by 1-WL after L iterations. Then, the same argument as above implies that mn,d,L is also the maximal number of graphs in GB d,n that can be shattered by L-layer GNNs, as is stated next. Proposition 3.1. For all n, d and L, it holds that VC-dim GB d,n GNN(L) mn,d,L. This upper bound holds regardless of the choice of aggregation, update, and readout functions used in the GNNs. We next show a matching lower bound for the VC dimension of GNNs on graphs in GB d,n. In fact, the lower bound already holds for simple GNNs of width O(nmn,d,L). Proposition 3.2. For all n, d, and L, all mn,d,L 1WL-distinguishable graphs of order at most n with ddimensional boolean features can be shattered by sufficiently wide L-layer GNNs. Hence, VC-dim GB d,n GNN(L) = mn,d,L. The lower bound follows from the fact that GNNs are as powerful as the 1-WL (Morris et al., 2019; Xu et al., 2018). Indeed, Morris et al. (2019) showed that L iterations of the 1-WL on graphs in GB d,n can be simulated by a simple L-layered GNN of width O(n). To shatter all mn,d,L graphs, we first simulate 1-WL on all mn,d,L graphs combined using a simple L-layer GNN of width in O(nmn,d,L). We then define a readout layer whose weights can be used to shatter the input graphs based on the computed 1-WL vertex colors in the graphs. We note that the above two results can be straightforwardly generalized to k-GNNs, i.e., their VC dimension is tightly connected to the expressive power of the k-WL, and other recent extensions of GNNs; see Appendix C.1. We now consider the uniform regime, i.e., we assume no upper bound on the graphs order. Since the number mn,d,L of 1-WL distinguishable graphs increases for growing n, Proposition 3.2 implies that the VC dimension of L-layered GNNs on the class GB d of all graphs with d-dimensional boolean features but of arbitrary order is unbounded. Corollary 3.3. For all d and L 1, it holds that VC-dim GB d(GNN(L)) = . The proof of this result requires update and readout functions in GNNs of unbounded width. Using a different bit extraction proof technique, we can strengthen the previous result such that fixed-width GNNs can be considered. Theorem 3.4. For all d, L at least two, it holds that VC-dim GB d(GNN(d, L)) = . Again, this result holds even for the class of simple GNNs. The theorem relies on the following result, which is of independent interest. Proposition 3.5. There exists a family Fb of simple 2-layer GNNs of width two and bitlength O(b) using piece-wise linear activation functions such that its VC dimension is exactly b. Theorem 3.4 directly follows from Proposition 3.5. That is, to show that the VC dimension is infinite for GNNs in GNN(d, L) with d and L 2, we leverage Proposition 3.5, implying that we can shatter an arbitrary number of graphs by such GNNs, provided that they have bit precision O(b). The GNNs in Theorem 3.4 have arbitrary precision reals, so they can also shatter these graphs. Since this works for any b, Theorem 3.4 follows. Proposition 3.5 in turn is proved as follows. For the upper bound, we observe that there are only exponentially many (in b) GNNs in Fb, from which the upper bound immediately follows. Indeed, classical VC theory implies a bound on the VC dimension for finite classes of GNNs, logarithmic in the number of GNNs in the class. The lower bound proof is more challenging and requires constructing the collection Fb of simple GNNs that can shatter b graphs belonging to GB d . We remark that the b graphs used are of order O(2b) and have O(2b) 1-WL vertex colors. Finally, we show bounded VC dimension when both the width and the number of layers of GNNs, and input graphs are restricted. We obtain the bound by leveraging stateof-the-art VC dimension bounds for feedforward neural networks (FNNs) by Bartlett et al. (2019a). To control the number of parameters, we consider the class GNNslp(d, L) of simple GNNs in which update and readout functions are single-layer perceptrons of bounded width d. We specify this class of GNNs using P = d(2d L + L + 1) + 1 parameters and the choice of activation functions in the perceptrons. Regarding the class of input graphs, we consider the class Gd, u consisting of graphs having d-dimensional features and color complexity at most u. Note that we only bound the number of colors appearing in a single graph and not the number of colors appearing in all graphs of the class. Intuitively, the parameter u comes into play because we can reduce input graphs by combining 1-WL equivalent vertices. The reduced graph has u vertices, and edges have weights. One can run extended GNNs, taking edge weights into account on the reduced graph without loss of information. Moreover, one can tie extended GNNs to FNNs. Hence, the parameter n can be replaced by u when analyzing the VC dimension of the GNN-related FNNs. We now relate Gd,n and Gd, u. Since any graph of order at most n has at most n 1-WL colors, Gd,n Gd, n. The bound below thus also complements the upper bound on GB d,n given earlier, but now for fixed-width GNNs. However, Gd, u may contain graphs of arbitrary order. For example, all regular graphs (of the same degree) belong to Gd, 1. We also remark that there is no upper bound on the number of 1-WL distinguishable graphs for Gd, u, because we only bound the number of colors appearing in a single graph and not the number of colors appearing in all graphs of the class. For example, all regular graphs of arbitrary degrees are in G0, 1, but regular graphs of different degrees can be distinguished. As such, the bounds obtained earlier do not apply. Finally, in the bound below, input graphs can have real features. Theorem 3.6. Assume d and L in N, and GNNs in GNNslp(d, L) using piece-wise polynomial activation functions with p > 0 pieces and degree δ 0. Let P = d(2d L + L + 1) + 1 be the number of parameters in the GNNs. For all u in N, VC-dim Gd, u(GNNslp(d, L)) O(P log(pu P)) if δ = 0, O(LP log(pu P)) if δ = 1, O(LP log(pu P) + L2P log(δ)) if δ > 1. We note that the above result can be straightforwardly generalized to k-GNNs. These upper bounds, concerning the dependency on u, cannot be improved by more than a constant factor. Indeed, for the b graphs used in Proposition 3.5, u = O(2b). Moreover, the simple GNNs in Fb belong to GNNslp(2, 2) and use piecewise-linear activation functions with p = 4 and δ = 1. Since they shatter b = O(log(u)) graphs, this matches with the upper bound up to constant factors. 3.1. Implications of your theoretical results Our results include the first lower bounds of GNNs VC dimension and the first inherent connection between GNNs VC dimension and the expressive power of the 1-WL. In particular, we find that a larger bit precision leads to a higher VC dimension and sample complexity. Therefore, our findings provide an additional incentive for reducing the precision of GNNs besides the already-known benefits of reduced training and inference time. Furthermore, we show that the VC dimension bounds for GNNs can be tightened when considering graphs of low color complexity. This connection to the number of colors indicates that graphs with complex structures, captured by the 1-WL, may have a higher VC dimension. Moreover, if the 1-WL can distinguish a large set of graphs of a given dataset, our results imply that a sufficiently expressive GNN will require a large set of training samples to generalize well. Therefore, we can use the 1-WL to assess GNNs generalization ability on a given dataset quickly. The same relation holds for k-GNNs and the k WL. Moreover, our results extend easily to recent, more expressive GNN enhancements, e.g., subgraph-based (Bouritsas et al., 2020) or subgraph-enhanced GNNs (Bevilacqua et al., 2022; Qian et al., 2022). Hence, our results lead to a better understanding of how expressivity influences the learning performance of modern GNN architectures; see Appendix C.1. Table 1: Train and test classification accuracies with different numbers of parameters, using five layers, studying how the number of parameters influences generalization. Dimension Split Dataset ENZYMES MCF-7 MCF-7H MUTAGENICITY NCI1 NCI109 4 Train 31.0 3.1 92.1 0.4 92.1 0.4 79.7 0.9 76.7 7.3 66.4 9.5 Test 25.3 5.2 92.4 0.2 92.3 0.3 75.8 0.9 72.4 7.1 63.6 9.1 16 Train 76.8 6.4 96.0 0.1 96.5 0.3 92.7 2.7 88.4 6.2 86.4 0.9 Test 41.7 9.4 93.2 0.5 93.1 0.4 79.8 2.2 76.1 2.1 78.6 1.9 256 Train 98.2 3.6 99.7 < 0.1 99.9 < 0.1 100.0 0.0 99.8 < 0.1 97.5 2.1 Test 54.7 2.4 94.0 0.2 93.6 0.2 80.7 1.0 81.8 1.5 82.1 1.0 1 024 Train 99.8 0.2 99.8 < 0.1 99.8 0.1 99.9 0.2 99.8 0.1 98.6 1.0 Test 54.3 2.3 93.8 0.3 93.6 0.2 81.7 0.8 80.5 1.0 82.9 0.9 0 100 200 300 400 500 Epoch Train - test accuracy [%] 4 16 256 1024 (a) ENZYMES 0 100 200 300 400 500 Epoch Train - test accuracy [%] 4 16 256 1024 0 100 200 300 400 500 Epoch Train - test accuracy [%] 4 16 256 1024 Figure 2: Difference between train and test accuracy for different feature dimensions in {4, 16, 256, 1 024}. See Figure 5 in the appendix for additional results. 4. Limitations, possible road maps, and future work Although the results are the first ones explicitly drawing a tight connection between expressivity and generalization, there are still many open questions. First, excluding Proposition 3.1, the results investigate specific GNN classes using sum aggregation. Hence, the results should be extended to specific GNN layers commonly used in practice, such as that in Xu et al. (2019), and the effect of different aggregation functions, such as max or mean, should be studied in detail. Moreover, the results only give meaningful results for discretely labeled graphs. Hence, the results should be extended to attributed graphs. Secondly, although the experimental results in Section 5 suggest that our VC dimension bounds hold in practice to some extent, it is well known that they do not explain the generalization behavior of deep neural networks in the over-parameterized regime Bartlett et al. (2019b), trained with variants of stochastic gradient descent. Therefore, it is a future challenge to understand how graph structure influences the generalization properties of over-parameterized GNNs, trained with variants of stochastic gradient descent, and what role the Weisfeiler Leman algorithm plays in this context. 5. Experimental evaluation In the following, we investigate how well the VC dimension bounds from the previous section hold in practice. Specifically, we answer the following questions. Q1 How does the number of parameters influence GNNs generalization performance? Q2 How does the number of 1-WL-distinguishable graphs influence GNNs generalization performance? Q3 How does the bitlength influence a GNN s ability to fit random data? The source code of all methods and evaluation procedures is available at https://www.github.com/ chrsmrrs/wl_vs_vc. Datasets To investigate questions Q1 and Q2, we used the datasets ENZYMES (Borgwardt et al., 2005; Schomburg et al., 2004), MCF-7 (Yan et al., 2008), MCF-7H (Yan et al., 2008), MUTAGENICITY (Kazius et al., 2005; Riesen & Bunke, 2008), and NCI1 and NCI109 (Wale et al., 2008; Shervashidze et al., 2011) provided by Morris et al. (2020a). See Table 3 for dataset statistics and properties. For ques- Table 2: Train and test classification accuracies using different numbers of layers and a feature dimension of 64, studying how the number of different color histograms influences generalization. Layers Split Dataset ENZYMES MCF-7 MCF-7H MUTAGENICITY NCI1 NCI109 Train 40.7 0.5 91.7 0.1 91.8 0.1 77.2 0.3 74.5 0.3 73.1 0.5 Test 33.7 1.6 91.9 < 0.1 91.2 0.1 75.7 1.2 67.9 1.3 71.5 0.7 Difference 7.0 1.9 -0.2 0.1 1.0 0.1 1.5 1.3 6.5 1.2 1.6 0.6 # Histograms 385 11 533 19 625 2 819 2 889 2 929 Train 66.7 3.6 91.8 0.1 92.1 < 0.1 90.9 0.1 92.0 1.5 83.4 1.5 Test 52.3 5.0 91.9 < 0.1 91.4 0.1 82.0 1.0 78.6 1.3 76.1 1.0 Difference 14.4 5.2 <0.1 0.1 0.1 0.1 8.9 1.3 13.4 0.9 7.3 0.7 # Histograms 595 25 417 26 037 3 624 3 906 3 950 Train 93.5 2.1 92.0 0.2 91.9 0.3 96.9 1.9 98.3 0.5 91.1 0.5 Test 62.7 7.2 92.1 0.1 91.0 0.6 82.5 1.0 80.5 1.3 78.1 1.5 Difference 39.9 5.5 -0.1 0.2 1.0 0.3 14.4 1.0 17.8 1.0 13.0 1.5 # Histograms 595 26 872 27 353 4 239 4 027 4 055 Train 98.0 2.5 92.1 0.3 92.1 0.2 99.4 0.9 99.8 0.1 93.6 1.2 Test 58.7 5.3 92.1 0.2 91.5 0.2 82.8 1.0 83.5 0.7 77.8 1.8 Difference 39.4 2.8 0.1 0.2 1.0 0.1 16.6 1.0 16.3 0.7 15.8 1.4 # Histograms 595 27 048 27 524 4 317 4 039 4 067 Train 99.8 0.3 92.0 0.1 92.2 0.2 99.1 0.2 99.8 < 0.1 96.9 1.0 Test 62.7 2.5 92.1 0.1 91.5 0.2 82.7 0.8 83.2 0.4 79.8 1.2 Difference 37.1 2.5 -0.1 0.1 1.0 0.1 16.4 0.7 16.6 0.4 17.2 0.8 # Histograms 595 27 059 OOM 4 317 4 039 4 067 Train 98.9 1.9 92.1 0.2 92.4 0.2 99.9 0.2 99.8 0.0 97.7 0.9 Test 57.0 3.9 92.3 0.2 91.6 0.2 83.0 0.8 84.1 1.1 79.6 0.5 Difference 41.9 2.9 -0.2 0.2 1.0 0.2 16.9 0.7 15.7 1.1 18.1 0.5 # Histograms 595 OOM OOM 4 317 4 039 4 067 Train 99.4 0.8 92.0 0.2 92.2 0.2 99.1 1.9 99.6 0.6 95.2 1.9 Test 54.0 2.3 92.2 0.2 91.4 0.4 83.5 1.0 83.4 1.3 79.2 1.3 Difference 44.4 1.9 -0.2 0.1 1.0 0.2 15.6 1.2 16.2 0.9 16.0 2.1 # Histograms 595 OOM OOM 4 317 4 039 4 067 tion Q3, to investigate the influence of bitlength on GNN s VC dimension, we probed how well GNNs can fit random data. Hence, the experiments on these datasets aim at empirically verifying the VC dimension bounds concerning bitlength. To that, we created a synthetic dataset; see Appendix G.3. Since it is challenging to simulate different bitlengths without specialized hardware, we resorted to simulating an increased bitlength via an increased feature dimension; see Appendix G.4. All experiments are therefore conducted with standard 32-bit precision. We also experimented with 64-bit precision but observed no clear difference. Furthermore, 16-bit precision proved numerically unstable in this setting. Neural architectures For the experiments regarding Q1 and Q2, we used the simple GNN layer described in Appendix G.1 using a re LU activation function, ignoring possible edge labels. To answer question Q1, we fixed the number of layers to five and chose the feature dimension d in {4, 16, 256, 1 024}. To answer Q2, we set the feature dimension d to 64 and choose the number of layers from {0, . . . , 6}. We used sum pooling and a two-layer MLP for all experiments for the final classification. To investigate Q3, we used the architecture described in Appendix G.2. In essence, we used a 2-layer MLP for the message generation function in each GNN layer and added batch normalization (Ioffe & Szegedy, 2015) before each non-linearity and fixed the number of layers to 3, and varied the feature dimension d in {4, 16, 64, 256}. Experimental protocol and model configuration For the experiments regarding Q1 and Q2, we uniformly and at random choose 90% of a dataset for training and the remaining 10% for testing. We repeated each experiment five times and report mean test accuracies and standard deviations. We optimized the standard cross entropy loss for 500 epochs using the ADAM optimizer (Kingma & Ba, 2015). Moreover, we used a learning rate of 0.001 across all experiments and no learning rate decay or dropout. For Q3, we set the learning rate to 10 4 and the number of epochs to 100 000, and repeated each experiment 50 times. All architectures were implemented using PYTORCH GEOMETRIC (Fey & Lenssen, 2019) and executed on a workstation with 128GB RAM and an NVIDIA Tesla V100 with 32GB memory. 5.1. Results and discussion In the following, we answer questions Q1 to Q3. Q1 See Table 1 and Figure 2 (and Figure 5 in the appendix). Increasing the feature dimension d increases the 10 20 30 40 50 60 70 80 90 |V| Accuracy [%] Dim=4 Dim=16 Dim=64 Dim=256 Figure 3: GNN s ability to fit the synthetic dataset for different feature dimensions in {4, 16, 64, 256}. average difference between train and test accuracies across all datasets. For example, on the ENZYMES dataset, the difference increases from around 5% for d = 4 to more than 45% for d = 1024. However, we also observe that the difference does not increase when reaching near-perfect training accuracies, i.e., going from d = 256 to d = 1 024 does not increase the difference. Hence, the results show that the number of parameters plays a crucial role in GNNs generalization ability, in accordance with Theorem 3.6. Q2 See Table 2. The results indicate that the number of 1-WL-distinguishable graphs (mn,d,L) influence GNNs generalization properties. For example, on the MUTAGENICITY dataset, after two iterations, the number of unique histograms computed by 1-WL stabilizes, and similarly, the generalization error stabilizes as well. Similar effects can be observed for the ENZYMES, NCI1, and NCI109 datasets. Hence, our results largely confirm Propositions 3.1 and 3.2. Q3 See Figure 3. Increasing the feature dimension boosts the model s capacity to fit random class labels, indicating that increased bitlength implies an increased VC dimension. For example, for an order of 70, a GNN using a feature dimension of 4 cannot reach an accuracy of over 75%. In contrast, feature dimensions 64 and 256 can almost fit such data. Moreover, for larger graphs, up to order 90, a GNN with a feature dimension of 256 can almost perfectly fit random class labels, with a feature dimension of 64 only slightly worse, confirming Proposition 3.5. 6. Conclusion We investigated GNNs generalization capabilities through the lens of VC dimension theory in different settings. Specifically, when not assuming a bound on the graphs order, we showed that the VC dimension tightly depends on the bitlength of the GNNs weights. We further showed that the number of colors computed by the 1-WL, besides the number of parameters and layers, influences the VC dimension. When a bound on the graphs order is known, we upper and lower bounded GNNs VC dimension via the maximal number of graphs distinguishable by the 1-WL. Thus, our theory provides the first link between expressivity results and generalization. Further, our theory also applies to a large set of recently proposed GNN enhancements. Acknowledgements Christopher Morris is partially funded by a DFG Emmy Noether grant (468502433) and RWTH Junior Principal Investigator Fellowship under Germany s Excellence Strategy. Martin Grohe is partially funded by the European Union (ERC, Sym Sim, 101054974). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. Aamand, A., Chen, J. Y., Indyk, P., Narayanan, S., Rubinfeld, R., Schiefer, N., Silwal, S., and Wagner, T. Exponentially improving the complexity of simulating the Weisfeiler-Lehman test with graph neural networks. Ar Xiv preprint, 2022. Abboud, R., Ceylan, I. I., Grohe, M., and Lukasiewicz, T. The surprising power of graph neural networks with random node initialization. In Joint Conference on Artificial Intelligence, pp. 2112 2118, 2021. Arvind, V., K obler, J., Rattan, G., and Verbitsky, O. On the power of color refinement. In International Symposium on Fundamentals of Computation Theory, pp. 339 350, 2015. Arvind, V., Fuhlbr uck, F., K obler, J., and Verbitsky, O. On Weisfeiler-Leman invariance: Subgraph counts and related graph properties. In International Symposium on Fundamentals of Computation Theory, pp. 111 125, 2019. Atserias, A. and Maneva, E. N. Sherali-adams relaxations and indistinguishability in counting logics. SIAM Journal on Computing, 42(1):112 137, 2013. Atserias, A., Mancinska, L., Roberson, D. E., S amal, R., Severini, S., and Varvitsiotis, A. Quantum and nonsignalling graph isomorphisms. Journal of Combinatorial Theory, Series B, pp. 289 328, 2019. Azizian, W. and Lelarge, M. Characterizing the expressive power of invariant and equivariant graph neural networks. In International Conference on Learning Representations, 2021. Babai, L. Lectures on graph isomorphism. University of Toronto, Department of Computer Science. Mimeographed lecture notes, October 1979, 1979. Babai, L. Graph isomorphism in quasipolynomial time. In Symposium on Theory of Computing, pp. 684 697, 2016. Babai, L. and Kucera, L. Canonical labelling of graphs in linear average time. In Symposium on Foundations of Computer Science, pp. 39 46, 1979. Balcilar, M., H eroux, P., Ga uz ere, B., Vasseur, P., Adam, S., and Honeine, P. Breaking the limits of message passing graph neural networks. In International Conference on Machine Learning, pp. 599 608, 2021. Barabasi, A.-L. and Oltvai, Z. N. Network biology: Understanding the cell s functional organization. Nature Reviews Genetics, 5(2):101 113, 2004. Barcel o, P., Kostylev, E. V., Monet, M., P erez, J., Reutter, J. L., and Silva, J. P. The logical expressiveness of graph neural networks. In International Conference on Learning Representations, 2020. Barcel o, P., Geerts, F., Reutter, J. L., and Ryschkov, M. Graph neural networks with local graph parameters. In Advances in Neural Information Processing Systems, pp. 25280 25293, 2021. Barcel o, P., Galkin, M., Morris, C., and Orth, M. A. R. Weisfeiler and leman go relational. Ar Xiv preprint, 2022. Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20:63:1 63:17, 2019a. Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Ar Xiv preprint, 2019b. Baskin, I. I., Palyulin, V. A., and Zefirov, N. S. A neural device for searching direct correlations between structures and properties of chemical compounds. Journal of Chemical Information and Computer Sciences, 37(4): 715 721, 1997. Beaini, D., Passaro, S., L etourneau, V., Hamilton, W. L., Corso, G., and Li o, P. Directional graph networks. In International Conference on Machine Learning, pp. 748 758, 2021. Berkholz, C., Bonsma, P. S., and Grohe, M. Tight lower and upper bounds for the complexity of canonical colour refinement. Theory of Computing Systems, 60(4):581 614, 2017. Bevilacqua, B., Frasca, F., Lim, D., Srinivasan, B., Cai, C., Balamurugan, G., Bronstein, M. M., and Maron, H. Equivariant subgraph aggregation networks. In International Conference on Learning Representations, 2022. Bodnar, C., Frasca, F., Otter, N., Wang, Y. G., Li o, P., Mont ufar, G., and Bronstein, M. M. Weisfeiler and Lehman go cellular: CW networks. In Advances in Neural Information Processing Systems, pp. 2625 2640, 2021a. Bodnar, C., Frasca, F., Wang, Y., Otter, N., Mont ufar, G. F., Li o, P., and Bronstein, M. M. Weisfeiler and Lehman go topological: Message passing simplicial networks. In International Conference on Machine Learning, pp. 1026 1037, 2021b. Borgwardt, K. M., Ong, C. S., Sch onauer, S., Vishwanathan, S. V. N., Smola, A. J., and Kriegel, H.-P. Protein function prediction via graph kernels. Bioinformatics, 21 (Supplement 1):i47 i56, 2005. Borgwardt, K. M., Ghisu, M. E., Llinares-L opez, F., O Bray, L., and Rieck, B. Graph kernels: State-of-the-art and future challenges. Foundations and Trends in Machine Learning, 13(5 6), 2020. Bouritsas, G., Frasca, F., Zafeiriou, S., and Bronstein, M. M. Improving graph neural network expressivity via subgraph isomorphism counting. Ar Xiv preprint, 2020. Bruna, J., Zaremba, W., Szlam, A., and Le Cun, Y. Spectral networks and deep locally connected networks on graphs. In International Conference on Learning Representation, 2014. Cai, J., F urer, M., and Immerman, N. An optimal lower bound on the number of variables for graph identifications. Combinatorica, 12(4):389 410, 1992. Chami, I., Abu-El-Haija, S., Perozzi, B., R e, C., and Murphy, K. Machine learning on graphs: A model and comprehensive taxonomy. Ar Xiv preprint, 2020. Chen, Z., Villar, S., Chen, L., and Bruna, J. On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, pp. 15868 15876, 2019. Chen, Z., Chen, L., Villar, S., and Bruna, J. Can graph neural networks count substructures? In Advances in Neural Information Processing Systems, 2020. Cotta, L., Morris, C., and Ribeiro, B. Reconstruction for powerful graph representations. In Advances in Neural Information Processing Systems, pp. 1713 1726, 2021. Dasoulas, G., Santos, L. D., Scaman, K., and Virmaux, A. Coloring graph neural networks for node disambiguation. In International Joint Conference on Artificial Intelligence, pp. 2126 2132, 2020. Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3837 3845, 2016. Dell, H., Grohe, M., and Rattan, G. Lov asz meets Weisfeiler and Leman. In International Colloquium on Automata, Languages, and Programming, pp. 40:1 40:14, 2018. Duvenaud, D., Maclaurin, D., Aguilera-Iparraguirre, J., G omez-Bombarelli, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pp. 2224 2232, 2015. Easley, D. and Kleinberg, J. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, 2010. El-Yaniv, R. and Pechyony, D. Transductive rademacher complexity and its applications. In Annual Conference on Learning Theory, pp. 157 171, 2007. Esser, P. M., Vankadara, L. C., and Ghoshdastidar, D. Learning theory can (sometimes) explain generalisation in graph neural networks. In Advances in Neural Information Processing Systems, pp. 27043 27056, 2021. Feng, J., Chen, Y., Li, F., Sarkar, A., and Zhang, M. How powerful are k-hop message passing graph neural networks. In Advances in Neural Information Processing Systems, 2022. Fey, M. and Lenssen, J. E. Fast graph representation learning with Py Torch Geometric. In International Conference on Learning Representations, Workshop on Representation Learning on Graphs and Manifolds, 2019. Frasca, F., Bevilacqua, B., Bronstein, M. M., and Maron, H. Understanding and extending subgraph GNNs by rethinking their symmetries. Co RR, 2022. F urer, M. On the combinatorial power of the Weisfeiler Lehman algorithm. In International Conference on Algorithms and Complexity, pp. 260 271, 2017. Gama, F., Marques, A. G., Leus, G., and Ribeiro, A. Convolutional neural network architectures for signals supported on graphs. IEEE Transactions on Signal Processing, 67(4):1034 1049, 2019. Garg, V. K., Jegelka, S., and Jaakkola, T. S. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning, pp. 3419 3430, 2020. Geerts, F. The expressive power of kth-order invariant graph networks. Ar Xiv preprint, 2020. Geerts, F. and Reutter, J. L. Expressiveness and approximation properties of graph neural networks. In International Conference on Learning Representations, 2022. Geerts, F., Mazowiecki, F., and P erez, G. A. Let s agree to degree: Comparing graph convolutional networks in the message-passing framework. In International Conference on Machine Learning, pp. 3640 3649, 2021. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263 1272, 2017. Goldreich, O. Introduction to testing graph properties. In Property Testing. Springer, 2010. Goller, C. and K uchler, A. Learning task-dependent distributed representations by backpropagation through structure. In International Conference on Neural Networks, pp. 347 352, 1996. Grohe, M. Descriptive Complexity, Canonisation, and Definable Graph Structure Theory. Cambridge University Press, 2017. Grohe, M. Word2vec, Node2vec, Graph2vec, X2vec: Towards a theory of vector embeddings of structured data. In Symposium on Principles of Database Systems, pp. 1 16, 2020. Grohe, M. The logic of graph neural networks. In Symposium on Logic in Computer Science, pp. 1 17, 2021. Grohe, M. The descriptive complexity of graph neural networks. Ar Xiv preprint, 2023. Grohe, M. and Otto, M. Pebble games and linear equations. Journal of Symbolic Logic, 80(3):797 844, 2015. Grohe, M., Schweitzer, P., and Wiebking, D. Deep Weisfeiler Leman. Ar Xiv preprint, 2020. Hamilton, W. L., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024 1034, 2017. Horn, M., Brouwer, E. D., Moor, M., Moreau, Y., Rieck, B., and Borgwardt, K. M. Topological graph neural networks. In International Conference on Learning Representations, 2022. Huang, Y., Peng, X., Ma, J., and Zhang, M. Boosting the cycle counting power of graph neural networks with I2GNNs. Ar Xiv preprint, 2022. Immerman, N. and Lander, E. Describing graphs: A firstorder approach to graph canonization. In Complexity Theory Retrospective: In Honor of Juris Hartmanis on the Occasion of His Sixtieth Birthday, July 5, 1988, pp. 59 81, 1990. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448 456, 2015. Ju, H., Li, D., Sharma, A., and Zhang, H. R. Generalization in graph neural networks: Improved pac-bayesian bounds on graph diffusion. Ar Xiv preprint, 2023. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., ˇZ ıdek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., and Hassabis, D. Highly accurate protein structure prediction with Alpha Fold. Nature, 2021. Karpinski, M. and Macintyre, A. Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks. Journal of Computer and System Sciences, 54 (1):169 176, 1997. Kazius, J., Mc Guire, R., and Bursi, R. Derivation and validation of toxicophores for mutagenicity prediction. Journal Medicinal Chemistry, 48(13):312 320, 2005. Kiefer, S. Power and Limits of the Weisfeiler-Leman Algorithm. Ph D thesis, Department of Computer Science, RWTH Aachen University, 2020. Kiefer, S. and Mc Kay, B. D. The iteration number of Colour Refinement. In International Colloquium on Automata, Languages, and Programming, pp. 73:1 73:19, 2020. Kiefer, S. and Schweitzer, P. Upper bounds on the quantifier depth for graph differentiation in first-order logic. In Symposium on Logic in Computer Science, pp. 287 296, 2016. Kiefer, S., Schweitzer, P., and Selman, E. Graphs identified by logics with counting. In International Symposium on Mathematical Foundations of Computer Science, pp. 319 330, 2015. Kiefer, S., Ponomarenko, I., and Schweitzer, P. The Weisfeiler-Leman dimension of planar graphs is at most 3. Journal of the ACM, 66(6):44:1 44:31, 2019. Kim, J., Nguyen, T. D., Min, S., Cho, S., Lee, M., Lee, H., and Hong, S. Pure transformers are powerful graph learners. Ar Xiv preprint, 2022. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. Kireev, D. B. Chemnet: A novel neural network based method for graph/property mapping. Journal of Chemical Information and Computer Sciences, 35(2):175 180, 1995. Kriege, N. M., Morris, C., Rey, A., and Sohler, C. A property testing framework for the theoretical expressivity of graph kernels. In International Joint Conference on Artificial Intelligence, pp. 2348 2354, 2018. Kriege, N. M., Johansson, F. D., and Morris, C. A survey on graph kernels. Applied Network Science, 5(1):6, 2020. Levie, R., Monti, F., Bresson, X., and Bronstein, M. M. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1):97 109, 2019. Li, P., Wang, Y., Wang, H., and Leskovec, J. Distance encoding: Design provably more powerful neural networks for graph representation learning. In Advances in Neural Information Processing Systems, 2020. Liao, R., Urtasun, R., and Zemel, R. S. A PAC-Bayesian approach to generalization bounds for graph neural networks. In International Conference on Learning Representations, 2021. Lichter, M., Ponomarenko, I., and Schweitzer, P. Walk refinement, walk logic, and the iteration number of the Weisfeiler-Leman algorithm. In Symposium on Logic in Computer Science, pp. 1 13, 2019. Maehara, T. and NT, H. A simple proof of the universality of invariant/equivariant graph neural networks. Ar Xiv preprint, 2019. Malkin, P. N. Sherali Adams relaxations of graph isomorphism polytopes. Discrete Optimization, pp. 73 97, 2014. Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y. Provably powerful graph networks. In Advances in Neural Information Processing Systems, pp. 2153 2164, 2019. Martinkus, K., Papp, P. A., Schesch, B., and Wattenhofer, R. Agent-based graph neural networks. Ar Xiv preprint, 2022. Maskey, S., Lee, Y., Levie, R., and Kutyniok, G. Generalization analysis of message passing neural networks on large random graphs. In Advances in Neural Information Processing Systems, 2022. Merkwirth, C. and Lengauer, T. Automatic generation of complementary descriptors with molecular graph networks. Journal of Chemical Information and Modeling, 45(5):1159 1168, 2005. Micheli, A. Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks, 20(3):498 511, 2009. Micheli, A. and Sestito, A. S. A new neural network model for contextual processing of graphs. In Italian Workshop on Neural Nets Neural Nets and International Workshop on Natural and Artificial Immune Systems, pp. 10 17, 2005. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2018. Monti, F., Boscaini, D., Masci, J., Rodol a, E., Svoboda, J., and Bronstein, M. M. Geometric deep learning on graphs and manifolds using mixture model cnns. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5425 5434, 2017. Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence, pp. 4602 4609, 2019. Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel, P., and Neumann, M. TUDataset: A collection of benchmark datasets for learning with graphs. Ar Xiv preprint, 2020a. Morris, C., Rattan, G., and Mutzel, P. Weisfeiler and Leman go sparse: Towards higher-order graph embeddings. In Advances in Neural Information Processing Systems, 2020b. Morris, C., L., Y., Maron, H., Rieck, B., Kriege, N. M., Grohe, M., Fey, M., and Borgwardt, K. Weisfeiler and Leman go machine learning: The story so far. Ar Xiv preprint, 2021. Morris, C., Rattan, G., Kiefer, S., and Ravanbakhsh, S. Speq Nets: Sparsity-aware permutation-equivariant graph networks. In International Conference on Machine Learning, pp. 16017 16042, 2022. M uller, L., Galkin, M., Morris, C., and Ramp asek, L. Attending to graph transformers. Co RR, abs/2302.04181, 2023. Murphy, R. L., Srinivasan, B., Rao, V. A., and Ribeiro, B. Relational pooling for graph representations. In International Conference on Machine Learning, pp. 4663 4673, 2019. Nguyen, H. and Maehara, T. Graph homomorphism convolution. In International Conference on Machine Learning, pp. 7306 7316, 2020. Papp, P. A. and Wattenhofer, R. A theoretical comparison of graph neural network extensions. In International Conference on Machine Learning, pp. 17323 17345, 2022. Papp, P. A., K. Martinkus, L. F., and Wattenhofer, R. Drop GNN: Random dropouts increase the expressiveness of graph neural networks. In Advances in Neural Information Processing Systems, 2021. Puny, O., Lim, D., Kiani, B. T., Maron, H., and Lipman, Y. Equivariant polynomials for graph neural networks. Ar Xiv preprint, 2023. Qian, C., Rattan, G., Geerts, F., Morris, C., and Niepert, M. Ordered subgraph aggregation networks. In Advances in Neural Information Processing Systems, 2022. Riesen, K. and Bunke, H. IAM graph database repository for graph based pattern recognition and machine learning. In Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshop, pp. 287 297, 2008. Rosenbluth, E., T onshoff, J., and Grohe, M. Some might say all you need is sum. Ar Xiv preprint, 2023. Sato, R., Yamada, M., and Kashima, H. Random features strengthen graph neural networks. In SIAM International Conference on Data Mining, pp. 333 341, 2021. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2009. Scarselli, F., Tsoi, A. C., and Hagenbuchner, M. The Vapnik Chervonenkis dimension of graph and recursive neural networks. Neural Networks, pp. 248 259, 2018. Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., and Schomburg, D. Brenda, the enzyme database: updates and major new developments. Nucleic acids research, pp. D431 3, 2004. Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, pp. 2539 2561, 2011. Simonovsky, M. and Komodakis, N. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 29 38, 2017. Sperduti, A. and Starita, A. Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks, 8(3):714 35, 1997. Stokes, J., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N., Mac Nair, C., French, S., Carfrae, L., Bloom-Ackerman, Z., Tran, V., Chiappino-Pepe, A., Badran, A., Andrews, I., Chory, E., Church, G., Brown, E., Jaakkola, T., Barzilay, R., and Collins, J. A deep learning approach to antibiotic discovery. Cell, pp. 688 702.e13, 2020. Talak, R., Hu, S., Peng, L., and Carlone, L. Neural trees for learning on graphs. Ar Xiv preprint, 2021. Thiede, E. H., Zhou, W., and Kondor, R. Autobahn: Automorphism-based graph neural nets. In Advances in Neural Information Processing Systems, pp. 29922 29934, 2021. Tolstikhin, I. O. and Lopez-Paz, D. Minimax lower bounds for realizable transductive classification. Ar Xiv preprint, 2016. T onshoff, J., Ritzert, M., Wolf, H., and Grohe, M. Graph learning with 1D convolutions on random walks. Ar Xiv preprint, 2021. Vapnik, V. N. The Nature of Statistical Learning Theory. Springer, 1995. Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, 2018. Verma, S. and Zhang, Z. Stability and generalization of graph convolutional neural networks. In International Conference on Knowledge Discovery & Data Mining, pp. 1539 1548, 2019. Vignac, C., Loukas, A., and Frossard, P. Building powerful and equivariant graph neural networks with structural message-passing. In Advances in Neural Information Processing Systems, 2020. Wale, N., Watson, I. A., and Karypis, G. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14 (3):347 375, 2008. Weisfeiler, B. On Construction and Identification of Graphs. Springer, 1976. Weisfeiler, B. and Leman, A. The reduction of a graph to canonical form and the algebra which appears therein. Nauchno-Technicheskaya Informatsia, 2 (9):12 16, 1968. English translation by G. Ryabov is available at https://www.iti.zcu.cz/wl2018/ pdf/wl_paper_translation.pdf. Wijesinghe, A. and Wang, Q. A new perspective on how graph neural networks go beyond weisfeiler-lehman? . In International Conference on Learning Representations, 2022. Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5449 5458, 2018. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. Yan, X., Cheng, H., Han, J., and Yu, P. S. Mining significant graph patterns by leap search. In International Conference on Management of Data, pp. 433 444, 2008. Yehudai, G., Fetaya, E., Meirom, E. A., Chechik, G., and Maron, H. From local structures to size generalization in graph neural networks. In International Conference on Machine Learning, pp. 11975 11986, 2021. You, J., Gomes-Selman, J., Ying, R., and Leskovec, J. Identity-aware graph neural networks. In AAAI Conference on Artificial Intelligence, pp. 10737 10745, 2021. Zhang, B., Feng, G., Du, Y., He, D., and Wang, L. A complete expressiveness hierarchy for subgraph gnns via subgraph weisfeiler-lehman tests. Co RR, abs/2302.07090, 2023a. Zhang, B., Luo, S., Wang, L., and He, D. Rethinking the expressive power of gnns via graph biconnectivity. Ar Xiv preprint, 2023b. Zhang, M. and Li, P. Nested graph neural networks. In Advances in Neural Information Processing Systems, pp. 15734 15747, 2021. Zhao, L., Jin, W., Akoglu, L., and Shah, N. From stars to subgraphs: Uplifting any GNN with local structure awareness. In International Conference on Learning Representations, 2022. A. Extended related work In the following, we discuss more related work. Expressive power of k-WL The Weisfeiler Leman algorithm constitutes one of the earliest and most natural approaches to isomorphism testing (Weisfeiler, 1976; Weisfeiler & Leman, 1968), and the theory community has heavily investigated it over the last few decades (Grohe, 2017). Moreover, the fundamental nature of the k-WL is evident from various connections to other fields such as logic, optimization, counting complexity, and quantum computing. The power and limitations of the k-WL can be neatly characterized in terms of logic and descriptive complexity (Babai, 1979; Immerman & Lander, 1990), Sherali-Adams relaxations of the natural integer linear optimization problem for the graph isomorphism problem (Atserias & Maneva, 2013; Grohe & Otto, 2015; Malkin, 2014), homomorphism counts (Dell et al., 2018), and quantum isomorphism games (Atserias et al., 2019). In their seminal paper, Cai et al. (1992) showed that, for each k, a pair of non-isomorphic graphs of size O(k) exists not distinguished by the k-WL. Kiefer (2020) gives a thorough survey of more background and related results concerning the expressive power of the k-WL. For k = 1, the power of the algorithm has been completely characterized (Arvind et al., 2015; Kiefer et al., 2015). Moreover, upper bounds on the running time (Berkholz et al., 2017) and the number of iterations for k = 1 (Kiefer & Mc Kay, 2020) and the non-oblivious k = 2 (Kiefer & Schweitzer, 2016; Lichter et al., 2019) have been shown. For k in {1, 2}, Arvind et al. (2019) studied the abilities of the (non-oblivious) k-WL to detect and count fixed subgraphs, extending the work of F urer (2017). The former was refined in (Chen et al., 2020). Kiefer et al. (2019) showed that the non-oblivious 3-WL completely captures the structure of planar graphs. The algorithm for logarithmic k plays a prominent role in the recent result of (Babai, 2016) improving the best-known running time for the graph isomorphism problem. Recently, Grohe et al. (2020) introduced the framework of Deep Weisfeiler Leman algorithms, which allow the design of a more powerful graph isomorphism test than Weisfeiler Leman type algorithms. Finally, the emerging connections between the Weisfeiler Leman paradigm and graph learning are described in two recent surveys (Grohe, 2020; Morris et al., 2021). B. Oblivious k-WL Intuitively, to surpass the limitations of the 1-WL, the k-WL colors ordered subgraphs instead of a single vertex.4 More precisely, given a graph G, the k-WL colors the tuples from V (G)k for k 2 instead of the vertices. By defining a neighborhood between these tuples, we can define a coloring similar to the 1-WL. Formally, let G be a graph, and let k 2. In each iteration, t 0, the algorithm, similarly to the 1-WL, computes a coloring Ck t : V (G)k N. In the first iteration, t = 0, the tuples v and w in V (G)k get the same color if they have the same atomic type, i.e., Ck 0 (v) := atp(v). Here, we define the atomic type atp: V (G)k N, for k > 0, such that atp(v) = atp(w) for v and w in V (G)k if and only if the mapping φ: V (G)k V (G)k where vi 7 wi induces a partial isomorphism, i.e., we have vi = vj wi = wj and (vi, vj) E(G) (φ(vi), φ(vj)) E(G). Then, for each layer, t > 0, Ck t is defined by Ck t (v) := RELABEL Ck t 1(v), Mt(v) , with Mt(v) the multiset Mt(v) := {{Ck t 1(ϕ1(v, w)) | w V (G)}}, . . . , {{Ck t 1(ϕk(v, w)) | w V (G)}} , and where ϕj(v, w) := (v1, . . . , vj 1, w, vj+1, . . . , vk). That is, ϕj(v, w) replaces the j-th component of the tuple v with the vertex w. Hence, two tuples are adjacent or j-neighbors if they are different in the jth component (or equal, in the case of self-loops). Hence, two tuples v and w with the same color in iteration (t 1) get different colors in iteration t if there exists a j in [k] such that the number of j-neighbors of v and w, respectively, colored with a certain color is different. We run the k-WL algorithm until convergence, i.e., until for t in N Ck t (v) = Ck t (w) Ck t+1(v) = Ck t+1(w), 4There exists two definitions of the k-WL, the so-called oblivious k-WL and the folklore or non-oblivious k-WL; see Grohe (2021). There is a subtle difference in how they aggregate neighborhood information. Within the graph learning community, it is customary to abbreviate the oblivious k-WL as k-WL, a convention we follow in this paper. for all v and w in V (G)k, holds. For such t, we define Ck (v) = Ck t (v) for v in V (G)k. At convergence, we call the partition of V (G)k induced by Ck t the stable partition. We set Ck (v) := Ck (v, . . . , v) and refer to this as the color of the vertex v. Similarly to the 1-WL, to test whether two graphs G and H are non-isomorphic, we run the k-WL in parallel on both graphs. Then, if the two graphs have a different number of vertices colored c, for c in N, the k-WL distinguishes the graphs as non-isomorphic. By increasing k, the algorithm gets more powerful in distinguishing non-isomorphic graphs, i.e., for each k 2, there are non-isomorphic graphs distinguished by (k + 1)-WL but not by k-WL (Cai et al., 1992). For a finite set of graphs S G, we run the algorithm in parallel over all graphs in the set S. C. k-order GNNs By generalizing Equation (1) in Section 2.2, following (Morris et al., 2019; 2020b; 2021), we can derive k-GNNs computing features for all k-tuples V (G)k, for k > 0, defined over the set of vertices of an attributed graph G = (V (G), E(G), a) with features from Rd. Concretely, in each layer, t > 0, for each k-tuple v = (v1, . . . , vk) in V (G)k, we compute a feature h(t) v := UPD(t) h(t 1) v , AGG(t) {{h(t 1) v (ϕ1(v, w)) | w V (G)}}, . . . , {{h(t 1) v (ϕk(v, w)) | w V (G)}} . Initially, for t = 0, we set h(0) v := UPD([atp(v), a(v1), . . . , a(vk)]) Rd, i.e., the atomic type and the attributes of a given k-tuple determine the initial feature of a k-tuple s vertices. In the above, UPD, UPD(t), and AGG(t) may be differentiable parameterized functions, e.g., neural networks. In the case of graph-level tasks, e.g., graph classification, one additionally uses h G := READOUT {{h(L) v | v = (v, . . . , v), v V (G)}} Rd, to compute a single vectorial graph representation based on the learned k-tuple features after iteration L. C.1. Transfering VC bounds from GNNs to k-order GNNs and other more expressive architectures In the following, we briefly sketch how Propositions 3.1 and 3.2, Corollary 3.3, and Theorem 3.6 can be lifted to k-GNNs. First, observe that we can simulate the computation of a k-GNN via a GNN on a sufficiently defined auxiliary graph. That is, the auxiliary graph contains a vertex for each k-tuple, and an edge connects two k-tuples j if they are j-neighbors for j in [k]; see Morris et al. (2021) for details. Using a 1-WL equivalent GNN taking edge labels into account, we can extend Propositions 3.1 and 3.2 and Corollary 3.3 to k-GNNs. Similar reasoning applies to Theorem 3.6, i.e., we can apply the proof technique from Appendix F.3 to this auxiliary graph. C.1.1. ARCHITECTURES BASED ON SUBGRAPH INFORMATION Further, we note that Propositions 3.1 and 3.2 also easily extend to recent GNN enhancements, e.g., subgraph-based (Bouritsas et al., 2020) or subgraph-enhanced GNNs (Bevilacqua et al., 2022; Qian et al., 2022). Since suitably defined variations of the 1-WL, incorporating subgraph information at initialization, upper bound the architectures expressive power, we can easily apply the reasoning behind the proofs of Propositions 3.1 and 3.2 to these cases. Hence, the architectures VC dimensions are also tightly related to the number of graphs distinguishable by respective 1-WL variants. D. Relationship between VC dimension and generalization error If we can bound the VC dimension of a hypothesis class C of GNNs, we directly get insights into its generalization ability, i.e., the difference of the empirical error RS(h) and the true error RD(h) for h C and a data generating distribution D. Theorem D.1. Let C be a class of GNNs, with finite VC dimension VC-dim(C) = d. Then for C, for all ε > 0 and δ (0, 1), using samples, for all data generating distributions D, we have Pr S Dm( h C : |RS(h) RD(h)| ε) 1 δ. This result was first proven by Vladimir Vapnik and Alexey Chervonenkis in 1960 s; see, e.g., Mohri et al. (2018) for a proof. E. Simple GNNs We here provide more detail on the simple GNNs mentioned in Section 2.2. That is, for given d and L in N, we define the class GNNmlp(d, L) of simple GNNs as L-layer GNNs for which, according to Equation (1), for each t in [L], the aggregation function AGG(t) is simply summation and the update function UPD(t) is a multilayer perceptron mlp(t) : R2d Rd of width at most d. Similarly, the readout function in Equation (2) consists of a multilayer perceptron mlp : Rd R applied on the sum of all vertex features computed in layer L.5 More specifically, GNNs in GNNmlp(d, L) compute on a graph (G, L) in Gd, for each v V (G), h(t) v := mlp(t) h(t 1) v , X u N(v) h(t 1) u Rd, (4) for t in [L] and h(0) v := Lv., and h G := mlp X v V (G) h(L) v R. (5) We also consider an even simpler class GNNslp(d, L) of GNNmlp(d, L) in which the multilayer perceptrons are in fact single layer perceptrons. That is, Equation (4) is replaced by h(t) v := σt h(t 1) v W(t) 1 + X u N(v) h(t 1) u W(t) 2 + b(t) Rd, (6) where W(t) 1 in Rd d and W(t) 2 in Rd d are weight matrices, and b(t) in R1 d is a bias vector, and σt : R R is an activation function, for t in [L]. Similarly, Equation (5) is replaced by h G := σL+1 X v V (G) h(L) v w + b R. (7) with w in Rd 1 a weight vector and b in R a bias value of the final readout layer. Also, σL+1 : R R is an activation function. We can thus represent elements in GNNslp(d, L) more succinctly by the following tuple of parameters, Θ = W(1) 1 , W(1) 2 , b(1), . . . , W(L) 1 , W(L) 2 , b(L), w, b , together with the tuple of activation functions σ = (σ1, . . . , σL, σL+1). We can equivalently view Θ as an element in Rd(2d L+L+1)+1. Each Θ in Rd(2d L+L+1)+1 and σ = (σ1, . . . , σL+1) induces a permutation-invariant graph function gnnΘ,σ : Gd R: (G, L) 7 gnnΘ,σ(G, L) := h G, with h G as defined in Equation (7). F. Missing proofs In the following, we outline missing proofs from the main paper. 5For simplicity we assume that all feature dimensions of the layers are fixed to d in N and also assume that the readout layer returns a scalar. F.1. Proofs of Proposition 3.1 and Proposition 3.2 We start with the general upper bound on the VC dimension in terms of the number of 1-WL-indistinguishable graphs. Proposition F.1 (Proposition 3.1 in the main text). For all n, d, and L, the maximal number of graphs of order at most n with d-dimensional boolean features that can be shattered by L-layer GNNs is bounded by the maximal number (mn,d,L) of 1-WL-distinguishable graphs. That is, VC-dim GB d,n GNN(L) mn,d,L. Proof. Clearly, every set S of mn,d,L + 1 graphs from GB d,n contains at least two graphs G and G not distinguishable by the 1-WL. Since GNNs cannot distinguish 1-WL-indistinguishable graphs (Morris et al., 2019; Xu et al., 2019), they cannot tell G and G apart and hence cannot not shatter S. Hence, the VC dimension can be at most mn,d,L. We next show a corresponding lower bound. In fact, the lower bound already holds for the class of simple GNNs of arbitrary width, that is for GNNs in GNNmlp(L) := S d N GNNmlp(d, L). Proposition F.2 (Proposition 3.2 in the main paper). For all n, d, and L, all mn,d,L 1-WL-distinguishable graphs of order at most n with d-dimensional boolean features can be shattered by sufficiently wide L-layer GNNs. Hence, VC-dim GB d,n GNN(L) = mn,d,L. Proof. For all i in [mn,d,L], choose Gi in GB d,n such that S = {G1, . . . , Gmn,d,L} consists of the maximum number of graphs in GB d,n pairwise distinguishable by the 1-WL after L iterations. We next show that the class of simple GNNs which are wide enough, that is, GNNmlp(d , L) for large enough d , is sufficiently rich to shatter S. That is, we show that for each T S there is a gnn T in GNNmlp(d , L) such that for all i in [mn,d,L]: gnn T (Gi) = ( 1 if Gi T , and 0 otherwise. This shows that S is shattered by GNNmlp(d , L) and hence its VC dimension is at least |S| = mn,d,L, as desired. Overview of the construction Intuitively, we will show that GNNmlp(d , L), with d large enough, is powerful enough to return a one-hot encoding of the color histograms of graphs in S. That is, there is a simple GNN gnn in GNNmlp(d , L) which in the MLP in its readout layer embeds a graph Gi in S as a vector h G in {0, 1}mn,d,L satisfying (h G)i = 1 if and only if Gi in T and i [mn,d,L]. Then, we extend the readout multilayer perceptron of gnn by one more layer such that on input G the revised GNN evaluates to the scalar g G := sign(h G w T 1) {0, 1}, with w in Rd 1. We observe that given T S it suffices to let the parameter vector w be the indicator vector for T . Indeed, this ensures that g G = 1 if and only if G is in a color class included in T . We can explore all such subsets T of S by varying w; hence, this GNN will shatter S. Encoding 1-WL colors via GNNs We proceed with the construction of the required GNN. For simplicity of exposition, in the description below we will construct GNN layers of non-uniform width. One can easily obtain uniform width by padding each layer. First, by Morris et al. (2019, Theorem 2), there exists a GNN architecture with feature dimension (at most) n and consisting of L layers such that for each Gi in S it computes 1-WL-equivalent vertex features fv in R1 d for v V (Gi). That is, for vertices v and w in V (Gi) it holds that fv = fw C1 L(v) = C1 L(w). We note here that we can construct a single GNN architecture for all graphs by applying (Morris et al., 2019, Theorem 2) over the disjoint union the graphs in S. This increases the width from n to nmn,d,L. Encoding 1-WL histograms via GNNs Moreover, again by (Morris et al., 2020b, Theorem 2) there exists W in Rnmn,d,L nmn,d,L and b in Rnmn,d,L such that v V (G) fv W + b = σ X v V (H) fv W + b h G = h H, for graphs G and H in S. We use Re LU as activation function σ here, just as in (Morris et al., 2019). Other activation functions could be used as well (Grohe, 2021). Hence, for each graph in S, we have a vector in R1 nmn,d,L uniquely encoding it. Since the number of vertices n is fixed, there exists a number M in N such that Mσ(P v V (G) Wfv) is in N1 nmn,d,L for all G in S. Moreover, observe that there exists a matrix W in Nnmn,d,L 2mn,d,L such that v V (G) fv W + b W = Mσ X v V (H) fv W + b W h G = h H, for graphs G and H in S. For example, we can set Knmn,L 1 Knmn,L 1 ... ... K0 K0 Nnmn,d,L 2mn,d,L for sufficiently large K > 1. Hence, the above GNN architecture computes a vector k G in N2mn,d,L containing 2mn,d,L occurrences of a natural number uniquely encoding each color histogram for each graph G in S. We next turn k G into our desired h G as follows. We first define an intermediate vector h G whose entries will be used to check which color histogram is returned. More specifically, we define h G = lsig(k G (w )T + b), with w = (1, 1, 1, 1, . . . , 1, 1) R2mn,L and b = ( c1 1, c1+1, c2 1, c2+1, . . . , cmn,d,L 1, cmn,d,L +1) R2mn,d,L with ci the number encoding the ith color histogram. We note that for odd i, (h G)i := lsig(col(G) ci 1) = ( 1 col(G) ci 0 otherwise. and for even i, (h G)i := lsig( col(G) + ci + 1) = ( 1 col(G) ci 0 otherwise. In other words, ((h G)i, (h G)i+1) are both 1 if and only if col(G) = ci. We thus obtain h G by combining ((h G)i, (h G)i+1) using an AND encoding (e.g., lsig(x + y 1)) applied to pairs of consecutive entries in h G. That is, h G := lsig 1 0 0 1 0 0 0 1 0 0 1 0 ... ... ... ... 0 0 1 0 0 1 (1, 1, . . . , 1) We thus see that a 3-layer MLP suffices for the readout layer of the simple GNN, finishing the proof. We remark that the maximal width is 2nmn,d,L, so we can take d = 2nmn,d,L. F.2. Proof of Proposition 3.5 We now prove Proposition 3.5. Proposition F.3 (Proposition 3.5 in the main text). There exists a family Fb of simple 2-layer GNNs of width two and of bitlength O(b) using a piece-wise linear activation such that its VC dimension is exactly b. Proof. We first show the lower bound. We fix some n 1. We shall construct a family of GNNs whose weights have bitlength O(n) and a family of n graphs shattered by these GNNs. Thereto, for all x = (x1, . . . , xn) in {0, 1}n, we let i=1 (2 2i+1 + xi2 2i). Written in binary, we have ρ(x) = 0.1x11x21x3 . . . 1xn. Observe that 1 2 ρ(x) 1. (8) For 1 k n, we let ρk(x) := ρ (xk+1, . . . , xk+n) = i=1 (2 2i+1 + xk+i2 2i), where xk+i := 0 for k + i > n. Then it follows from (8) that 1 2 ρk(x) 1. (9) We claim that ρk(x) = 22kρ(x) i=1 22(k i)+1 i=1 22(k n i)+1 ! | {z } =:ak i=1 22(k i)xi | {z } :=bk(x) Indeed, we have 22(k i)+1 + xi22k 2i 22(k i)+1 + xi22(k i) i=n+1 22(k i)+1 i=1 22(k i)+1 i=n+1 22(k i)+1 + i=1 xi22(k i) + 22(k i)+1 + xi22(k i) i=1 22(k i)+1 i=1 22(k n i)+1 + i=1 xi22(k i) + 2 2i+1 + xk+i2 2i i=1 22(k i)+1 i=1 22(k n i)+1 + i=1 xi22(k i) + xk + ρk(x) = ak + bk(x) + xk + ρk(x), which proves (10). Now let ck(x) := bk(x) + ak + 1. Then by (9) and (10), we have 2 4kρ(x) ck(x) xk. (11) For x = (x1, . . . , xn) and y = (y1, . . . , yn) in {0, 1}n, we write x =k y if xi = yi for some i < k. Observe that x =k y implies bk(x) bk(y) 4 and thus ck(x) ck(y) 4. (12) Let A : R R be the continuous piecewise-linear function defined by 0 if x < 0, 2x if 0 x < 1 2 x < 1, 3 2x if 1 x < 3 Since xk {0, 1}, by (11) we have xk = A 4kρ(x) ck(x) . (13) If follows from (12) that for y with y =k x we have A 4kρ(x) ck(y) = 0. (14) Let Ck := n ck(y) y {0, 1}no . Then xk = X c Ck A 4kρ(x) c . (15) Note that the only dependence on x of the right-hand side of (15) is in ρ(x), because Ck does not depend on x. Observe that |Ck| = 2k 1, because ck(y) only depends on y1, . . . , yk 1 {0, 1} and is distinct for distinct values of the yi. We have i=1 22(k i)+1 i=1 22(k n i)+1 | {z } =:s 1 i=0 4i s = 2 Thus 2 3 4k 1 1 ak 2 Furthermore, i=1 22(k i) = i=1 4k i = 4 Thus 2 3 4k 1 c 2 3 4k 1 1 + 1 = 4k 1. (16) Now for each ρ in R we construct a 2-layer GNN Gρ as follows: Initially, all nodes v carry the 1-dimensional feature h(0) v := 1. The first layer computes the 2-dimensional feature h(1) v,1 h(1) v,2 h(1) v,1 := X w N(v) ρ h(0) w ρ, h(1) v,2 := X w N(v) h(0) w 1. The second layer computes the 1-dimensional feature h(2) v defined by w N(v) h(1) w,2 The readout functions just takes the sum of all the h(2) v . We define a graph Fk as follows. The graph Fk is a forest of height 2. Fk has a root node rc for every c Ck. Each rc has a child sc and 4k additional children tc,1, . . . , tc,4k. The tc,i are leaves. Each sc has children uc,1, . . . , uc,c. The uc,i are leaves. Now we run the GNN Gρ on Fk with ρ = ρ(x) for some x = (x1, . . . , xn) in {0, 1}n. We have h(0) v = 1 for all v in V (Fk). h(1) tc,i,1 = h(1) uc,i,1, h(1) sc,1 = cρ, h(1) rc,1 = 4kρ, h(1) tc,i,2 = h(1) uc,i,1 = 0, h(1) sc,2 = c, h(1) rc,2 = 4k. h(2) tc,i = A 4k = 0 h(2) uc,i = A c = 0 h(2) sc = A cρ 4k = 0 h(2) rc = A 4kρ c = ( xk if c = ck(x) 0 otherwise by (13) and (14). To see that the first three equalities hold, recall that A(x) = 0 only if 0 < x < 3 2. Thus A( 4k) = 0. Moreover, by (16) we have 2 c and thus A(c) = 0. Finally, A cρ 4k = 0 because ρ < 1 and c 4k 1 by (16) and therefore cρ 4k < 0. As there is exactly one node rc with c = ck(x), the readout is P v V (Fk) h(2) v = xk. Hence Gρ(x)(Fk) = xi Thus the GNNs Gρ(x) for x {0, 1}n shatter the set {F1, . . . , Fn}. Since the bitlength is upper bounded by O(b) and the number of parameters in the above construction is constant, the hypothesis set is finite, and the upper bound follows from standard learning-theoretic results; see, e.g., (Mohri et al., 2018). F.3. Proof of Theorem 3.6 In the following, we outline the proof of Theorem 3.6. First, we define feedforward neural networks and show how simple GNNs can be interpreted as such. Feedforward neural networks A feedforward neural network (FNN) is specified by a tuple N = (N, β, γ) where N describes the underlying architecture and where β and γ define the parameters or weights. More specifically, N = V N , EN , i1, . . . , ip, o1, . . . , oq, αN where (V N , EN ) is a finite DAG with p input nodes i1, . . . , ip of in-degree 0, and q output nodes o1, . . . , oq of out-degree 0. No other nodes have inor out-degree zero. Moreover, αN is a function assigning to each node v V N \{i1, . . . , ip} an activation function α(v) : R R. Furthermore, the function β : V N \{i1, . . . , ip} R is a function assigning biases to nodes, and finally, the function γ : EN R assigns weights to edges. For an FNN N, we define its size s as the number of biases and weights, that is s = |V N | p + |EN |. Given an FNN N = (N, β, γ), we get a function fnn N : Rp Rq defined as follows. For all v in V N , we define a function h N v : Rp R such for a = (a1, . . . , ap) in Rp, h N v (a) := ( aj if v = ij for j [p], u N+(v) γ(u, v)h N u (a) + β(v) otherwise. Finally, fnn N : Rp Rq is defined as a 7 fnn N(a) := h N o1(a), . . . , h N oq(a) . Simple GNNs as FNNs We next connect simple GNNs in GNNslp(d, L) to FNNs. As described in Section E such GNNs are specified by L+1 activation functions σ := (σ1, . . . , σL+1) and a weight vector Θ in Rd(2d L+L+1)+1 describing weight matrices and bias vectors in all the layers. We show that for any attributed graph of order at most n G = (G, L) in Gn,d with G = (V (G), E(G)) and L in Rn d there exists an architecture NG(σ) such that for any weight assignment Θ in Rd(2d L+L+1)+1 of the GNN, there exists βΘ : V NG R and γΘ : ENG R, satisfying gnnΘ,σ(G, L) = fnn N=(NG(σ),βΘ,γΘ)(L ), (17) where L in Rnd is the (column-wise) concatenation of the rows of the matrix L. Moreover, NG(σ) is of polynomial size in the number of vertices and edges in G, the feature dimension d, and the number of layers L. Furthermore, NG(σ) has only a single output node o. The idea behind the construction of NG(σ) is to consider the tree unraveling or unrolling, see, e.g., (Morris et al., 2020b), of the computation of gnnΘ,σ(G, L) but instead of a tree we represent the computation more concisely as a DAG. The DAG NG(σ) is defined as follows. The node set V NG consists of the following nodes. We have input nodes iv,j for v in V (G) and j in [d] which will take the vertex labels Lvj in R as value. For each t in [L], we include the following nodes: n(t) v,j for v in V (G), j in [d]. Finally, we have a single output node o. We thus have d(L + 1)|V (G)| + 1 nodes in total. The edge set ENG consists of the following edges. We have edges encoding the adjacency structure of the graph G in every layer. More specifically, we have an edge eu,j,v,k,t := (n(t 1) u,j , n(t) v,k) whenever u in N(v) {v} and where u and v are in V (G), j and k in [d], and t in {2, . . . , L}. We also have edges from the input nodes iu,j to n(1) v,k for all u in NG(v) {v} and where u and v are in V (G) and j and k in [d]. Finally, we have edges connecting the last layer nodes to the output, i.e., edges ev,j := (n(L) v,j , o) for all v in V (G) and j in [d]. We thus have d|V (G)| + d2((L 1)(E(G) + V (G)) + (E(G) + V (G))) edges in total. Finally, we define the activation functions. αN (n(t) v,j) := σt for all v in V (G), j in [d] and t in [L], and αN (o) := σL+1. This fixes the architecture NG(σ). We next verify Equation (17). Let Θ in Rd(2d L+L+1)+1 and G = (G, L) in Gn,d. Let NG(σ) be the architecture defined above for the graph G. We define βΘ and γΘ, as follows. βΘ := V NG R is such that βΘ(n(t) v,j) = b(t) j for all v in V (G), j in [d] and t in [L]. We also set βΘ(o) = b. γΘ : ENG R is such that γΘ(eu,j,v,k,t) := W (2,t) jk if u = v and γΘ(eu,j,v,k,t) := W (1,t) jk otherwise, and γΘ(ev,j) = wj, for u and v in V (G), j and k in [d], and t in [L]. Note that we share weights across edges that correspond to the same edge in the underlying graph. Now, if we denote by f (t) v the feature vector in Rd computed in the tth layer by the GNN gnnΘ,σ(G, L), then it is readily verified, by induction on the layers, that for N = (NG, ασ, βΘ, γΘ): h N n(t) v,j = f (t) v,j and thus h N o := σL+1 j [d] wjf (L) vj + b from which Equation (17) follows. We next expand the construction by obtaining an FNN that simulates GNNs on multiple input graphs. More specifically, consider a set G consisting of m graphs G1 = (G1, L1), . . . , Gm = (Gm, Lm) in Gn,d and a GNN in GNNslp(d, L) using activation functions σ = (σ1, . . . , σL+1) in its layers. We first construct an FNN architecture NGi(σ) for each graph separately, as explained above, such that for every Θ in RP , there exists βΘ and γΘ such that gnnΘ,σ(Gi, Li) = fnn NGi:=(NGi(σ),βΘ,γΘ)(L i), with L i is the concatenation of rows in Li, as before. Then, let NG(σ) be the FNN architecture obtained as the disjoint union of NG1(σ), . . . , NGm(σ). If we denote by oi the output node of NGi(σ) in NG(σ), then we have again that for every Θ in RP , there exists βΘ and γΘ such that gnnΘ,σ(Gi) = h NG:=(NG(σ),βΘ,γΘ) oi (L ) for all i in [m], where L := (L 1, . . . , L m). We recall that, for t in [L], the nodes in NG(σ) are of the form ν(t),g v,j for v in VGg, j in [d] and g in [m]. In layer L + 1, we have the output nodes o1, . . . , om. If the order of the graphs in G is at most n, then every layer, except the last one, has ndm nodes. The last layer only has m nodes. Piece-wise polynomial activation functions A piece-wise polynomial activation function σp,δ : R R is specified by a partition of R into p intervals Ij and corresponding polynomials pj(x) of degree at most δ, for j in [p]. That is, σp,δ(x) = pj(x) if x in Ij. Examples of σp,δ(x) are: sign(x): R R: x 7 1x 0 for which p = 2 and δ = 0, relu(x): R R: x 7 max(0, x) for which p = 2 and δ = 1, and lsig(x): R R: x 7 max(0, min(1, x)) for which p = 3 and δ = 1. Piece-wise linear activation functions are of the form σp,1, i.e., they are defined in terms of linear polynomials. The parameters of an activation function σp,δ consist of the coefficients of the polynomials involved and the boundary points (numbers) of the intervals in the partition of R. Proof of Theorem 3.6 We next derive upper bounds on the VC dimension of GNNs by the approach used in Bartlett et al. (2019a), where they used it for bounding the VC dimension of FNNs using piecewise polynomial activation functions. Their approach allows for recovering known bounds on the VC dimension of FNNs in a unified manner. As we will see, the bounds by Bartlett et al. (2019a) for FNNs naturally translate to bounds for GNNs. Assume d and L in N. In this section, we will consider the subclass of GNNs in GNNslp(d, L) that use piece-wise polynomial activation functions with p > 0 pieces and degree δ 0. As explained in Section E, d(2d L + L + 1) + 1 is the total number of (learnable) parameters for our GNNs in GNNslp(d, L). As shorthand notation, we define P := d(2d L + L + 1) + 1. We first bound VC-dim Gd,n GNNslp(d, L) and then use this bound to obtain a bound for VC-dim Gd, u GNNslp(d, L) . We take G consisting of m graphs G1 = (G1, L1), . . . , Gm = (Gm, Lm) in Gn,d and consider the FNN architecture NG(σ) define above with output nodes o1, . . . , om. Let tresh : R R such that tresh(x) = 1 if x 2/3 and tresh(x) = 0 if x 1/3. We will bound K := tresh(h NG o1 (L )), . . . , tresh(h NG om (L )) : NG := (NG(σ), βΘ, γΘ), Θ RP , as this number describes how many 0/1 patterns can occur when Θ ranges over RP . These 0/1 patterns correspond, by the construction of NG and the semantics of its output nodes, to how many of the input graphs in G can be shattered. To bound K using the approach in Bartlett et al. (2019a) we need to slightly change the activation function σL+1 used in the FNN architecture. The reason is that Bartlett et al. use the sign function to turn a real-valued function into a 0/1-valued function. In contrast, we use the tresh function described above. Let σ := (σ1, . . . , σL, σL+1 1/3). We will bound K by bounding K := sign(h NG o1 (L )), . . . , sign(h NG om (L )) : NG := (NG(σ ), βΘ, γΘ), Θ RP . Note that K K because if tresh(σL+1)(x) = 1 then σL+1(x) 2/3 and hence σL+1(x) 1/3 > 0 and hence sign(σL+1(x) 1/3) = 1. Similarly, tresh(σL+1)(x) = 0 then σL+1(x) 1/3 and hence σL+1(x) 1/3 0 and hence sign(σL+1(x) 1/3) = 0. Then, if VC-dim Gd,n GNNslp(d, L) = m then K 2m. We thus bound K in terms of a function κ in m and then use 2m κ(m) to find an upper bound for m, i.e., an upper bound for VC-dim Gd,n GNNslp(d, L) . To bound K we can now use the approach of Bartlett et al. (2019a). In a nutshell, the entire parameter space RP is partitioned into pieces S1, . . . , Sℓ such that whenever Θ and Θ belong to the same piece (i) they incur the same sign pattern in {0, 1}m; and (ii) each h NG o1 (L ) is a polynomial of degree at most 1 + LδL. For δ = 0, these are polynomials in d + 1 variables, for δ > 0, the number of variables is P. Crucial in Bartlett s approach is the following lemma. Lemma F.4 (Lemma 17 in Bartlett et al. (2019a)). Let p1(x), . . . , pr(x) be polynomials of degree at most δ and in variables x satisfying |x| r, where | | denotes the number of components of a vector. Then sign(p1(Θ)), . . . , sign(pr(Θ)) | Θ R|x| 2 2erδ Given property (ii) of the pieces S1, . . . , Sℓ, we can apply the above lemma to the polynomials h NG o1 (L ), . . . , h NG om (L ) and, provided that the number of variables is at most m, obtain a bound for K by ℓ2 2em d+1 d+1 , when δ = 0, and K ℓ2 2em(1+LδL) P P for δ > 0. It then remains to bound the number of parts ℓ. Bartlett et al. show how to do this inductively (on the number of layers), again using Lemma F.4. More precisely, every node in the FNN architecture is associated with a number of polynomials. In layer t we have nmd nodes (number of computation nodes), and we associate with each node p polynomials (number of breakpoints of activation function) of degree at most 1 + (t 1)δt 1 and have (2d + 1)d variables for δ = 0 and (2d + 1)dt variables for δ > 0. We then get, for δ = 0, K 2L 2edmnp and for δ > 0, t=1 2 2edmnp(1 + (t 1)δt 1) (2d+1)dt 2 2em(1 + LδL) These are precisely the bounds given in Bartlett et al. (2019a) applied to our FNN. It is important, however, to note that this upper bound is only valid when Lemma F.4 can be applied, and hence, the number of variables must be smaller than the number of polynomials. For t in [L], we must have that the number of variables is less than the number of polynomials. We have nmdp polynomials, and up to layer t we have (2d + 1)dt parameters (variables). Hence, we must have (2d + 1)dt nmdp or (2d + 1)t nmp, and also D m or P nmdp for δ > 0. For δ = 0, we need (2d + 1)d nmdp (or (2d + 1) nmp) and d + 1 m. The following conditions are sufficient P m for δ > 0, and 2d + 1 m for δ = 0. ( ) FNN size reduction based on 1-WL We next bring in 1-WL into consideration by collapsing computation nodes in NG in each layer based on their equivalence with regards to 1-WL. In other words, if we assume that the graphs to be shattered have at most u vertex colors, then we have at most u rather than n computation nodes per graph. This implies that the parameter n in the above expression can be replaced by u. As a consequence, following Bartlett et al. (2019a), using the weighted AM GM inequality on the right-hand side of the inequalities (18) and (19), we obtain a bound for the VC dimension by finding maximal m satisfying, for δ = 0, 2m 2L+1 2ep(ud L + 1)m and for δ > 0, 2m 2L+1 2ep ud PL t=1(1 + (t 1)δt 1) + (1 + LδL) m 2 (2d + 1)d + P 2 (2d+1)d+P Such m are found in (Bartlett et al., 2019a), resulting in the following bounds. Proposition F.5 ((Bartlett et al., 2019a) modified for our FNN NG). VC-dim Gd, u GNNslp(d, L) O(Ld2 log(p(ud L + 1))) if δ = 0 O(L2d2 log(p(ud L + 1))) if δ = 1 O(L2d2 log(p(ud L + 1)) + L3d2 log(δ)) if δ > 1. We can simplify this to O(P log(pu P)), O(LP log(pu P)), and O(LP log(pu P) + L2P log(δ)), respectively. We note that since these are larger than P, the condition ( ) is satisfied. G. Additional experimental data and results Here, we report on additional experimental details, results, and dataset generation. G.1. Simple GNN layer used for Q1 and Q2 The simple GNN layer used in Q1 and Q2 updates the feature of vertex v at layer t via f (t) v := σ f (t 1) v W(t) 1 + X u N(v) f (t 1) u W(t) 2 Rd, (20) where W(t) 1 and W(t) 2 Rd d are parameter matrices. In the experiments, we used re LU activation functions. G.2. GNN architecture used for Q3 For the experiments on Q3 we extend the simple GNN layer from Equation (20) used in Q1 and Q2. We update the feature of vertex v at layer t via f (t) v := σ BN f (t 1) v W(t) + X u N(v) mlp(t)(f (t 1) u ) Rd, (21) where BN is a batch normalization module (Ioffe & Szegedy, 2015) and mlp(t) is a two-layer perceptron with architecture, Linear BN re LU Linear. We, therefore, use a normalized 2-layer MLP to generate messages in each layer. We found this change necessary to ensure smooth convergence on the challenging synthetic task posed by Q3, where the GNN has to memorize an arbitrary binary graph labeling. Moreover, in the experiments, we used re LU activation functions. G.3. Synthetic dataset generation To address Q3, we aim to empirically estimate how well GNNs of different sizes can fit arbitrary binary labelings of graphs. We construct a synthetic dataset that focuses on a simple class of trees. Formally, for two natural numbers m and n in N 0, we define the graph Tm,n = (Vm,n, Em,n) as a directed tree with vertex set V = {v0, . . . , vm+n+3}. The root v0 has two children v1 and v2. The remaining m + n vertices form the leaves such that vertex v1 has m children and v2 has n children. Figure 4 provides a visual example. . . . | {z } m . . . | {z } n Figure 4: A visualization of a Tm,n. Table 3: Dataset statistics and properties. Dataset Properties Number of graphs Number of classes/targets Number of nodes Number of edges Node labels Edge labels ENZYMES 600 6 32.6 62.1 MCF-7 27 770 2 26.4 28.5 MCF-7H 27 770 2 47.3 49.4 MUTAGENICITY 4 337 2 30.3 30.8 NCI1 4 110 2 29.9 32.3 NCI109 4 127 2 29.7 32.1 For a chosen k in N, k 4, we define: Tk = n Tm,n | 0 m j(k 3) k , n = k 3 m o . Therefore, Tk contains all distinct graphs Tm,n with |Vm,n(Tm,n)| = k. In particular, we observe |Tk| = (k 3) For k {10, 20, . . . , 90}, we aim to test how well a GNN with a given feature dimension d in {4, 16, 64, 256} can learn binary labelings y: Tk {0, 1} of Tk. The labeling y is obtained by sampling the label y(T) uniformly at random for all T in Tk. We then train a GNN model with stochastic gradient descent to minimize the binary cross entropy on the dataset (Tk, y). For each combination of k and d, we repeat the experiment 50 times. We resample a new labeling y and a new random initialization of the GNN model in each repetition. G.4. Simulating bitlength via higher feature dimension Here, we outline how to simulate a higher bitlength via a higher feature dimension. Assume the simple GNN layer of Equation (20). Clearly, we can express the matrices W(t) 1 and W(t) 2 as the sum of k matrix with smaller bitlength, e.g., W(t) 2 = W(1,t) 2 + + W(k,t) 2 . Hence, we can re-write the aggregation in Equation (20) as X u N(v) f (t 1) u W(1,t) 2 , , W(k,t) 2 M Rd, where [ ] denotes column-wise matrix concatenation and M in {0, 1}kd d is a matrix such that σ f (t 1) v W(t) 1 + X u N(v) f (t 1) u W(1,t) 2 , , W(k,t) 2 M = σ f (t 1) v W(t) 1 + X u N(v) f (t 1) u W(t) 2 , i.e., the matrix M sums together columns of the aggregated features such that they have feature dimension d. 0 100 200 300 400 500 Epoch Train - test accuracy [%] 4 16 256 1024 (a) ENZYMES 0 100 200 300 400 500 Epoch Train - test accuracy [%] 4 16 256 1024 0 100 200 300 400 500 Epoch Train - test accuracy [%] 4 16 256 1024 0 100 200 300 400 500 Epoch Train - test accuracy [%] Mutagenicity 4 16 256 1024 (d) MUTAGENICITY 0 100 200 300 400 500 Epoch Train - test accuracy [%] 4 16 256 1024 0 100 200 300 400 500 Epoch Train - test accuracy [%] 4 16 256 1024 Figure 5: Difference between train and test accuracy for different feature dimensions in {4, 16, 256, 1 024}.