# g2n2__weisfeiler_and_lehman_go_grammatical__e260c8e4.pdf G2N2 : Weisfeiler and Lehman go grammatical Jason Piquenot Aldo Moscatelli Maxime B erar Pierre H eroux Jean-Yves Ramel Romain Raveaux S ebastien Adam This paper introduces a framework for formally establishing a connection between a portion of an algebraic language and a Graph Neural Network (GNN). The framework leverages Context-Free Grammars (CFG) to organize algebraic operations into generative rules that can be translated into a GNN layer model. As CFGs derived directly from a language tend to contain redundancies in their rules and variables, we present a grammar reduction scheme. By applying this strategy, we define a CFG that conforms to the third-order Weisfeiler-Lehman (3-WL) test using MATLANG. From this 3-WL CFG, we derive a GNN model, named G2N2, which is provably 3-WL compliant. Through various experiments, we demonstrate the superior efficiency of G2N2 compared to other 3-WL GNNs across numerous downstream tasks. Specifically, one experiment highlights the benefits of grammar reduction within our framework. 1 Introduction In the last few years, the Weisfeiler-Lehman (WL) hierarchy, based on the eponymous polynomial-time isomorphism test (Lehman and Weisfeiler (1968)), has been the most common way to characterise the expressive power of Graph Neural Networks (GNNs) (Morris et al. (2019); Bodnar et al. (2021b;a); Zhang et al. (2023b)). A founding result was the proof that Message Passing Neural Networks (MPNNs) (Gilmer et al. (2017); Wu et al. (2020)) are at most as powerful as the first-order WL test (1-WL) (Morris et al. (2019); Xu et al. (2019)). As a consequence of this result, many subsequent contributions have focused on going beyond this 1-WL limit, to reach more expressive GNNs. For instance, subgraph-based GNNs (Chen et al. (2020); Zhang and Li (2021); Zhao et al. (2022)) succeed to surpass 1-WL expressive power but are still bounded by 3-WL (Frasca et al. (2022)). One way to ensure k-WL expressive power is to mimic one iteration of the k-WL test (Maron et al. (2019a)) for each GNN layer. Taking as root the colouring and hashing steps of the k-WL algorithm, Maron et al. (2019a) shows that k-IGN, based on the basis of equivariant operators defined for IGN (Maron et al. (2019b)), is as powerful as the k-WL test. Since k-IGN works on k-th order tensors and since the cardinal of the basis is equal to the 2k-th Bell number, it is limited in practice by both the layer input memory consumption and the cardinal of IGN operator basis, even for k = 3 (Li et al. (2020)). Concurrently, Provably Powerful Graph Network (PPGN) was also proposed in Maron et al. (2019a). It is able to mimic the second-order Folklore WL test (2-FWL1) colouring and hashing steps with MLPs that are coupled together with matrix multiplication. Since PPGN only relies on matrices, it is a more tractable 3-WL architecture than 3-IGN (Zhang et al. (2023a)). Taking an algebraic point of view, the groundbreaking paper Geerts (2020) reformulates the 1-WL and 3-WL tests as languages based on specific subsets of algebraic operations applied on the adjacency matrix. These fragments of the matrix language MATLANG (Brijder et al. (2019)) called ML (L1) and ML (L3) are shown to be as expressive as 1-WL and 3-WL (Geerts (2020)). Derived from this result, a model called GNNML1 was proposed in Balcilar LITIS Lab, University of Rouen Normandy, France LIFAT Lab, University of Tours, France 1known to be equivalent to 3-WL test (Huang and Villar (2021)) et al. (2020). GNNML1 is proven to be 1-WL equivalent since it is able to generate any sentence of ML (L1). A more expressive model called GNNML3 was proposed in the same paper. It is only shown to be more expressive than 1-WL. This is due to the lack of a systematic procedure of deriving a GNN model from a given language fragment. In this paper, we leverage this bottleneck by proposing a generic methodology to produce a GNN from any fragment of an algebraic language, opening a new way to ensure expressiveness. The rationale behind our framework is to instantiate a language fragment by a reduced set of generative rules, translated into layer components of a GNN. Starting from the operations set L3, we build an exhaustive Context-Free Grammar (CFG) able to generate ML (L3). This CFG is reduced to remove unnecessary operations among the rules while keeping the equivalence with 3-WL. From the variables of this reduced CFG, GNN inputs are easily deduced. Then, the rules of the CFG determine the GNN layers update functions. As a result of this methodology, we propose a new model called Grammatical Graph Neural Network (G2N2) that is provably 3-WL. The contributions of this work are the following : (i) A generic framework to design a GNN from any fragment of an algebraic language; (ii) The instantiation of the framework on ML (L3) resulting in G2N2, a provably 3-WL GNN; (iii); An experimental validation of the set of rules; (iv) Numerous experiments demonstrating that G2N2 outperforms existing 3-WL GNNs on various downstream tasks. The paper is structured as follows. Section 2 introduces the necessary background, by defining MATLANG, its link with WL and CFGs. Section 3 describes our framework and presents the resulting G2N2 architecture, which is experimentally evaluated in section 4. 2 From MATLANG and Weisfeiler-Lehman to Context-Free Grammars and Languages Let G = (V, E) be an undirected graph where V = [[1 , n]] is the set of n nodes and E V V is the set of edges. The adjacency matrix A {0, 1}n n represents the connectivity of G. Definition 2.1 (MATLANG (Brijder et al. (2019))) MATLANG is a matrix language with an allowed operation set {+, , , T, Tr, diag, 1, , f} denoting respectively matrix addition, matrix and element-wise multiplications, transpose and trace computations, diagonal matrix creation from a vector, column vector of 1 generation, scalar multiplication, and element-wise function applied on a scalar, a vector or a matrix. Restricting the set of operations to a subset L defines a fragment of MATLANG denoted ML (L). s(X) R is a sentence in ML (L) if it consists of consistent consecutive operations in L, operating on a given matrix X, resulting in a scalar value. As an example, s(X) = 1T X2 diag (1) 1 is a sentence of ML { , T, 1, diag, } computing the trace of X2. Equivalences between ML (L1) and ML (L3) with L1 = { , T, 1, diag}, L3 = { , T, 1, diag, } and respectively the 1-WL and 3-WL tests are shown in Geerts (2020): two graphs are indistinguishable by the 1-WL (resp. 3-WL) test if and only if applying any sentence of ML (L1) (resp. ML (L3)) to their adjacency matrices gives the same scalar. Adding {+, , f} does not improve the expressive power of the fragment (Geerts (2020)). Transposed in a Machine Learning context, a MATLANG-based GNN will inherit the 3-WL expressive power of ML (L3) if it is able to generate any sentence of the fragment while learning the downstream task. To reach this objective, we will instantiate the fragment as a Context Free Language, entirely described by a set of production rules2. Definition 2.2 (Context-Free Grammar and Language) A Context-Free Grammar (CFG) G is a 4-tuple (V, Σ, R, S) with V a finite set of variables, Σ a finite set of terminal symbols, R a finite set of rules V (V Σ) , S a start variable. R completely describes a CFG with the convention that S is placed on the top left. 2Figure 7 in appendix A illustrates the process of sentence generation from a grammar. B is a Context-Free Language (CFL) if there exists a CFG G such that B = L(G) := {w, w Σ and S = w} where S = w denotes that S can be transformed into w by applying an arbitrary number of rules in G. 3 From ML (L3) to the 3-WL G2N2 In this section, the proposed generic framework is described and instantiated on the ML (L3) fragment to generate our G2N2 model. As shown by Figure 1, 3 steps are involved: (1) defining the exhaustive CFG that generates the language, (2) reducing the exhaustive CFG, (3) translating the variables and the rules of the reduced CFG into GNN input and model layer. To keep the expressive power of the language at each step, the equivalence between the successive representations must be ensured. Figure 1: Overview of the proposed framework instantiated on ML (L3). 3.1 From ML (L3) to the exhaustive CFG GL3 The first step of the framework translates the language fragment into an exhaustive CFG (variables, terminal symbols and rules). For ML (L3), the variables of the exhaustive CFG denoted GL3 are defined using the following proposition proved in appendix A.2. Proposition 3.1 For any square matrix of size n2, operations in L3 can only produce square matrices of the same size, row, or column vectors of size n or scalars. In the context of our study, as in Geerts (2020), ML (L3) is applied on the adjacency matrix. Thus, proposition 3.1 ensures that GL3 variables are restricted to square matrix (M), column vector (Vc), row vector (Vr) and scalar (S). Once the variables defined, the production rules of GL3 are obtained by enumerating all possible operations in ML (L3) that produce such variables. The rule M A is added in order to be compliant with Geerts (2020). All the rules composing GL3 are synthesised in equation 1 where | denotes the classical OR operator since a variable can be produced by different rules. They fully characterise the CFG3. The following theorem ensures that the language generated by GL3 is ML (L3). Thus GL3 is as expressive as ML (L3). Theorem 3.1 For GL3 defined by S (Vr)(Vc) | diag (S) | SS | (S S) (1) Vc (Vc Vc) | MVc | (Vr)T | Vc S | 1 Vr (Vr Vr) | Vr M | (Vc)T | SVr M (M M) | MM | (M)T | diag (Vc) | (Vc)(Vr) | A we have L(GL3) = ML (L3) . 3Elements that are not variables in the rule set are said to be terminal symbols. The full proof is provided in appendix (A.2). Its idea is the following. As any operation in the rules of GL3 belongs to L3, it is clear that L(GL3) ML (L3). The reciprocal inclusion is proven by induction over the number of ML (L3) operations. Given the results of theorem 3.1, the next step reduces the CFG by exploiting the redundancies in the exhaustive set of rules and variables. 3.2 From GL3 to r-GL3 An example of redundancy can be observed in the following proposition proved in the appendix (see A.2). Proposition 3.2 For any square matrix M, column vector Vc and row vector Vr, we have M (Vc Vr) = diag (Vc) Mdiag (Vr) The following theorem guarantees that the following reduced grammar preserves expressiveness. Theorem 3.2 (ML (L3) reduced CFG ) Let r-GL3 be defined by Vc MVc | 1 (2) M (M M) | MM | diag (Vc) | A r-GL3 is as expressive as GL3. Proof. For any scalar S, S , since diag (S), S S and S S produce a scalar, the only way to produce a scalar from other variables is to pass through a vector dot product. Hence the scalar variable S and its rules can be removed from GL3 without loss of expressive power. Since diag (v) w = v w for any vector v, w, the vector Hadamard product can be removed from the vector rules. Proposition 3.2 allows to remove Vc Vr from the rules of M since the results of subsequent mandatory operations MM or MVc can be obtained with other combinations. At this stage, the following intermediate CFG i-GL3 is as expressive as GL3 since it can compute any vector of GL3. Vc MVc | (Vr)T | 1 Vr Vr M | (Vc)T M (M M) | MM | (M)T | diag (Vc) | A Since the remaining M rules preserve symmetry, (M)T, the variable Vr and its rules can be removed. It conducts to r-GL3 defined in equation 2. From these two steps, the resulting CFG r-GL3 possesses the expressive power of the fragment ML (L3). The next step is a translation of r-GL3 into a GNN layer. 3.3 From r-GL3 to a G2N2 layer model In r-G(L3), any vector Vc or matrix M is produced by applying a sequence of rules on A and 1. As a consequence, every matrix or vector can be attained through an iterative rule selection procedure using matrix and vector memories that store intermediate variables. Figure 2 describes this procedure: each iteration starts by choosing a rule in r-G(L3) before selecting corresponding inputs in the memories. Applying the selected rule produces a new matrix or a new vector, which is added to the appropriate memory. Translating this iterative procedure into a GNN based on a sequence of layers requires a memory management strategy and a selection mechanism for both rules and inputs, while taking into account learning issues related to downstream tasks. The matrix memory aims at storing the variables M produced by successive applications of r-G(L3) rules. This memory is represented by a three order tensor C(l) where produced matrices (i.e. edges embeddings in a GNN context) are stacked across layers on the third dimension. In the same way, the vector memory is dedicated to the variables Vc that correspond to nodes embeddings. It is as a matrix H(l) where produced vectors are stacked on the second dimension. C(l) and H(l) are the input of the l-th GNN layer which produces C(l+1) and H(l+1) as output, as depicted in Figure 3 describing a G2N2 layer. While the memory of the iterative procedure grows with each iteration, a tractable GNN architecture constrains the stacking dimension to be set to a given value at each layer. In order to mimic the rule selection procedure of Figure 2, a G2N2 layer applies a selection among the outputs produced by all the rules. Such a strategy enables to compute in parallel several occurrences of any rule with multiple inputs. Hence, parameterised quantities b ,b ,bdiag,b MVc of the rules (M M), (MM), diag (Vc), MVc are computed in parallel taking as input linear combination Li of slices of C(l) and slices of H(l). These linear combinations are able to select among inputs C(l) and H(l) through a learning paradigm. Both the matrix rules outputs and the tensor C(l) (obtained through a skip connection which guarantees the memory persistence) are fed to MLPM that produces the output tensor C(l+1) with a selected third dimension size S(l+1). This MLP allows in the same time to simulate the rule selection, to compress the matrix output of the layer to a fixed size and to learn a point wise function for solving specific downstream tasks. It relates to the set of operations {+, , f} of MATLANG and does not modify the expressive power (Geerts (2020); Maron et al. (2019a)). The output H(l+1) is provided similarly through MLPVc. Figure 3 describes the whole model of a G2N2 layer. Formally, the update equations are : C(l+1) = MLPM(C(l)||L1(C(l)) L2(C(l))||L3(C(l)) L4(C(l))||diag(L6(H(l)))), (3) H(l+1) = MLPVc(H(l)||L5(C(l)) L7(H(l))), (4) where || is the concatenation. MLPM and MLPVc are learnable MLPs, and Li are learnable linear blocks acting on the third dimension of C(l) or the second dimension of H(l): L1,2 : RS(l) Rb(l) , L3,4 : RS(l) Rb(l) , L5 : RS(l) Rb(l) MVc, L6 : Rf (l) Rb(l) diag, L7 : Rf (l) (l), MLPM : RS(l)+b (l) RS(l+1), and MLPVc : Rf (l)+b MVc (l) Rf (l+1). 3.4 G2N2 architecture and its expressive power Figure 4 depicts the global G2N2 architecture. The inputs are H(0) and C(0). H(0) of size n fn + 1 is the feature nodes matrix concatenated with 1. C(0) Rn n (fe+1) is a stacking on the third dimension of the adjacency matrix A and the extended adjacency tensor E of size n n fe, where fe is the number of edge features. After the last layer, permutation equivariant readout functions are applied on both H(lend) and the diagonal and off-diagonal components of C(lend). Readout outputs are then fed to a dedicated decision layer. Figure 2: 4-step iterative procedure (1) Rule selection (2) Inputs selection: inputs relative to the chosen rule are selected from matrix and/or vector memories (opaque matrices) (3) Rule computation (4) Output storage: the produced output is stored into its relative memory. Figure 3: L1-L5 combine the S(l) slices of C(l) into 2b , 2b and b MVc matrices. L6-L7 combine the f (l) n columns of H(l) into bdiag and b MVc vectors. From the outputs of L1 -L7, multiple occurrences of r-G(L3) rules (M M), (MM), (diag(Vc) and (MVc) are computed. The obtained outputs and the layer inputs are fed to MLPM and MLPVc providing the layer outputs C(l+1) and H(l+1). Figure 4: Model of G2N2 architecture from the graph to the output. Each layer updates node and edge embeddings and readout functions are applied independently on H(k) and the diagonal and the non-diagonal elements of C(k). Theorem 3.3 (Expressive power of G2N2) G2N2 is able to produce any matrix and vector of L(r-GL3). It is as expressive as 3-WL. Proof. We show that G2N2 at layer l can produce all matrices and vectors r-GL3 can produce, after l iterations. It is true for l = 1. Indeed, at r-GL3 first iteration, we obtain the matrices I, A, A2 and the vectors 1 and A1. Since any of Li(C(0)) for i [[1 , 5]] is a linear combination of A and I, G2N2 can produce those vectors and matrices in one layer. Suppose that there exists l > 0 such that G2N2 can produce any of the matrices and vectors r GL3 can after l iterations. We denote by Al the set of those matrices and by Vl the set of those vectors. At the l+1-th iteration, we have Al+1 = {M N, MN, diag (Vc) |M, N Al Vc Vl} and Vl+1 = {MVc|M Ak, Vc Vl}. Let M, N Al and Vc Vl then by hypothesis G2N2 can produce M, N at layer l. Since L produces at least two different linear combinations of matrices or vectors in respectively Al and Vl, MN, M N, MVc and diag (Vc) are reachable at layer l + 1. Thus Al+1 is included in the set of matrices G2N2 can produce at layer l + 1 and Vl+1 is included in the set of vectors G2N2 can produce at layer l + 1. 3.5 Discussion : G2N2 in the 3-WL GNN literature Positioning w.r.t Maron et al. (2019a) From PPGN layer description (see Figure 2 of Maron et al. (2019a)), one can build the following CFG: M MM | diag (1) | A (5) where M diag (1) and M A represent inputs of the architecture as for G2N2. Compared to r-GL3, Vc variable and M M M, diag (Vc) and MVc rules are missing. As a consequence, PPGN 3-WL expressive power is not formally inherited from ML (L3). As stated in the introduction, it relies on PPGN ability to mimic 2-FWL colouring and hashing steps. Its capacity to implement the colouring step relies on MLP universality. It explains that PPGN can approximate the missing rules of r-GL3. To guarantee such an approximation, a certain width and depth for MLP are needed. G2N2 does not suffer from these computational constraints since it only needs to provide linear combinations as arguments of the operations. 3-IGN processes on sets of third order tensors. As a consequence, it cannot be described by a CFG derived from ML (L3). However, we can connect our approach with k-IGN. For k-IGN, the expressive power is related to MLPs and to the basis of linear equivariant operators defined in Maron et al. (2019b). In some ways, these operators can be linked to the algebraic operations of our framework. An example of such a link is given in appendix A.3 for 2-IGN . Positioning w.r.t Balcilar et al. (2021) In appendix A.1, we show that GNNML1 (Balcilar et al. (2021)) can be seen as the resulting GNN of our framework applied on ML (L1). Concerning GNNML3, a CFG can also be deduced from its layer Vc C1Vc | | Ck Vk | Vc Vc | 1 where the matrices C1, , Ck are defined using the adjacency matrix, exponential pointwise function, and matrix Hadamard product. As some rules and variables are missing compared to r-GL3, it cannot formally inherit the expressive power of ML (L3). 4 Experiments This section is dedicated to the experimental validation of both the framework and G2N2. It answers 4 questions Q1-Q4. Q1 concerns the validation of the reduced grammars. Q2 and Q3 relate to performance of G2N2 on downstream regression/classification tasks. Q4 concerns the model spectral ability. Experimental settings are detailed in appendix C. Q1: Is the reduction of grammar relevant and optimal? This experiment aims at investigating the impact of the CFG reduction scheme through the comparison of different models built using GL3 (see Figure 9 in appendix A), i-GL3 (see Figure 10 in appendix A) and r-GL3 (see Figure 3). The comparison is completed by an ablation study the aim of which is to investigate the importance of each rule of r-GL3. We use a graph regression benchmark called QM9 which is composed of 130K small molecules (Ramakrishnan et al. (2014); Wu et al. (2018)). For this study, we focus on the regression target R2, which is known to be the most difficult to predict. As in Maron et al. (2019a), the dataset is randomly split into training, validation, and test sets with a respective ratio of 0.8, 0.1 and 0.1. The edge and vector embeddings are always of size 32. The results are presented in Figure 5 where each model is represented in a 2-D space using the Mean Absolute Error (MAE) of the best validation model on the test set and the number of parameters of the model. These results corroborate the theoretical results of section 3, discussed in greater detail in appendix A.5: the MAE scores are comparable for G(L3), i-G(L3) and r-G(L3) while the number of parameters is divided by 2 when reducing from G(L3) to r-G(L3). As expected, removing rules in r-G(L3) leads to a drop of MAE performance. It also offers insights into the weights of each operation in the model and enables informed pruning of the GNN if the expressiveness is not required. Q2: Does G2N2 perform better than other 3-WL GNNs for regression? For this second question, we also use the dataset QM9, but we consider the 12 regression targets. The dataset is randomly split into training, validation, and test sets with the same ratio as in Q1. The experimental settings are detailed in appendix C. G2N2 results are compared to those in Huang et al. (2023); Maron et al. (2019a) including 1-GNN and Figure 5: Comparison of model size and MAE performance on the QM9 R2 target for GNNs derived from G(L3). Each GNN model is denoted by its set of rules. Over-reduced grammar from r-G(L3) are denoted with a \, whereas denotes the addition of a rule to the set. 1-2-3-GNN (Morris et al. (2019)), DTNN (Wu et al. (2018)), Deep LRP (Chen et al. (2020)), NGNN (Zhang and Li (2021)), I2-GNN (Huang et al. (2023)) and PPGN Maron et al. (2019a). The metric is the MAE of the best validation model on the test set. The mean epoch duration is measured on the same device for comparison between G2N2 and PPGN. As in Maron et al. (2019a), we made two experiments. The first one consists in learning one target at a time while the second learns every target at once. In the first experiment, we have S(l) = f (l) n = 64 and in the second S(l) = f (l) n = 32. Partial results focusing on the two best models are given in Table 1. Complete results and experiment settings are given in appendix C. In both cases, G2N2 obtains the best results while learning faster. Table 1: Results on QM9 dataset focusing on the best methods. On the left part, each target is learned separately while on the right side all targets are learned at the same time. The metric is MAE, the lower, the better. Complete results can be found in Table 4. Target PPGN G2N2 µ 0.0934 0.0703 α 0.318 0.127 ϵhomo 0.00174 0.00172 ϵlumo 0.0021 0.00153 ϵ 0.0029 0.00253 R2 3.78 0.342 ZPVE 0.000399 0.0000951 U0 0.022 0.0169 U 0.0504 0.0162 H 0.0294 0.0176 G 0.024 0.0214 Cv 0.144 0.0429 T / ep 129 s 98 s 0.231 0.102 0.382 0.196 0.00276 0.0021 0.00287 0.00211 0.0029 0.00287 16.07 1.19 0.00064 0.0000151 0.234 0.0502 0.234 0.0503 0.229 0.0503 0.238 0.0504 0.184 0.0707 131 s 57 s Q3: Does G2N2 perform better than other 3-WL GNNs for classification? For graph classification, we evaluate G2N2 on the classical TUD benchmark (Morris et al. (2020)), using the evaluation protocol of Xu et al. (2019). Results of GNNs and Graph Kernel are taken from Bouritsas et al. (2022). Since the number of node and edge features is very different from one dataset to another, the parameter settings for each of the 6 experiments related to these datasets can be found in Table 5 of appendix C. Partial results focusing on G2N2 performance are given in Table 2. Complete results can be seen in Table 6 of appendix C. G2N2 achieves better than rank 2 for five of the six datasets. Table 2: Results of G2N2 on TUD dataset compared to the best GNN competitor. The rank of G2N2 within GNNs is in parentheses. The metric is accuracy, the higher, the better. Complete results can be seen in Table 6. Dataset G2N2 rank Best GNN competitor MUTAG 92.5 5.5 1(1) 92.2 7.5 PTC 72.3 6.3 1(1) 68.2 7.2 Proteins 80.1 3.7 1(1) 77.4 4.9 NCI1 82.8 0.9 5(3) 83.5 2.0 IMDB-B 76.8 2.8 3(3) 77.8 3.3 IMDB-M 54.0 2.9 2(2) 54.3 3.3 Q4: Can G2N2 learn band-pass filters in the spectral domain? As shown in Balcilar et al. (2020), the spectral ability of a GNN and particularly its ability to operate as band-pass filter is an important property of a model for certain downstream tasks. In order to assess the spectral ability of G2N2 and answer Q4, we use the protocol and node regression dataset of Balcilar et al. (2021). R2 score is used to compare performance. Table 3 reports the comparison of G2N2 to CHEBNET (Hammond et al. (2011)), PPGN and GNNML3, citing the results from Balcilar et al. (2021). CHEBNET and GNNML3 are spectrally designed and manage to learn low-pass, high-pass, and band-pass filters. For the three filter types, G2N2 reaches comparable performance. In appendix B, a theoretical analysis shows that a 3-WL GNN is able to approximate any type of filter. As shown in the table, PPGN fails to learn band-pass filters. This result which contradicts the previous theoretical result is related to memory and complexity issues. Hence, as explained before, PPGN needs a deeper and wider architecture for this task that can not be reached for 900 node graphs (Balcilar et al. (2021)). Table 3: R2 score on spectral filtering node regression. Results are a median of 10 runs. Method Low-pass High-pass Band-pass CHEBNET 0.9995 0.9901 0.8217 GNNML3 0.9995 0.9909 0.8189 PPGN 0.9991 0.9925 0.1041 G2N2 0.9996 0.9994 0.8206 5 Conclusion Designing provably expressive GNNs has been the target of many recent works. In this paper, we have proposed a new theoretical framework for designing such models. Taking as input a language fragment, i.e. a set of algebraic operations, the framework uses reduced Context Free Grammars to drive the generation of graph neural architectures with provable expressive power. The framework provides insights about the importance of algebraic operations in the resulting model, as shown by the experimental grammar ablative study. Such results can be useful for improving the performance vs. computational cost trade-off for a given task. Through the application of the framework to ML (L3) fragment, the paper also proposed the provably 3-WL G2N2 model. In addition to these theoretical guarantees, G2N2 is also shown to be efficient for solving graph learning downstream tasks through several experiments on regression, classification and spectral filtering benchmarks. In all cases, G2N2 outperforms 3-WL GNNs, while being more tractable. Beyond these results, we are convinced that our contributions open the door to the design of models surpassing 3-WL, taking as root a language manipulating tensors of greater order (Geerts and Reutter (2022)). Moreover, the framework is not limited to GNN models since many other learning paradigm can be modeled with algebraic languages. Acknoledgements The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR) under grant ANR-21-CE23-0025 (Co De GNN project). The authors acknowledge the support of the ANR and the R egion Normandie under grant ANR-20-THIA-0021 (HAISCo De project). M. Balcilar, G. Renton, P. H eroux, B. Ga uz ere, S. Adam, and P. Honeine. Analyzing the expressive power of graph neural networks in a spectral perspective. In International Conference on Learning Representations, 2020. M. Balcilar, P. H eroux, B. Gauzere, P. Vasseur, S. Adam, and P. Honeine. Breaking the limits of message passing graph neural networks. In International Conference on Machine Learning, pages 599 608. PMLR, 2021. C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein. Weisfeiler and lehman go cellular: Cw networks. Advances in Neural Information Processing Systems, 34:2625 2640, 2021a. C. Bodnar, F. Frasca, Y. Wang, N. Otter, G. F. Montufar, P. Lio, and M. Bronstein. Weisfeiler and lehman go topological: Message passing simplicial networks. In International Conference on Machine Learning, pages 1026 1037. PMLR, 2021b. G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):657 668, 2022. R. Brijder, F. Geerts, J. V. den Bussche, and T. Weerwag. On the expressive power of query languages for matrices. ACM Trans. Database Syst., 44(4):15:1 15:31, 2019. T. Cai, S. Luo, K. Xu, D. He, T.-y. Liu, and L. Wang. Graphnorm: A principled approach to accelerating graph neural network training. In International Conference on Machine Learning, pages 1204 1215. PMLR, 2021. Z. Chen, L. Chen, S. Villar, and J. Bruna. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383 10395, 2020. P. de Haan, T. S. Cohen, and M. Welling. Natural graph networks. Advances in Neural Information Processing Systems, 33:3636 3646, 2020. S. S. Du, K. Hou, R. R. Salakhutdinov, B. Poczos, R. Wang, and K. Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems, 32, 2019. F. Frasca, B. Bevilacqua, M. M. Bronstein, and H. Maron. Understanding and extending subgraph gnns by rethinking their symmetries. In Advances in Neural Information Processing Systems, 2022. F. Geerts. On the expressive power of linear algebra on graphs. Theory of Computing Systems, Oct 2020. F. Geerts and J. L. Reutter. Expressiveness and approximation properties of graph neural networks. In International Conference on Learning Representations, 2022. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263 1272. PMLR, 2017. D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129 150, 2011. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2010.04.005. N. T. Huang and S. Villar. A short tutorial on the weisfeiler-lehman test and its variants. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8533 8537. IEEE, 2021. Y. Huang, X. Peng, J. Ma, and M. Zhang. Boosting the cycle counting power of graph neural networks with I2-GNNs. In The Eleventh International Conference on Learning Representations, 2023. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, 2017. S. Kolouri, N. Naderializadeh, G. K. Rohde, and H. Hoffmann. Wasserstein embedding for graph learning. ar Xiv preprint ar Xiv:2006.09430, 2020. A. Lehman and B. Weisfeiler. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsiya, 2(9):12 16, 1968. P. Li, Y. Wang, H. Wang, and J. Leskovec. Distance encoding: Design provably more powerful neural networks for graph representation learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4465 4478. Curran Associates, Inc., 2020. H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. Advances in neural information processing systems, 32, 2019a. H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman. Invariant and equivariant graph networks. In International Conference on Learning Representations, 2019b. C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 4602 4609, 2019. C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020. R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1 7, 2014. N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9), 2011. Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9 (2):513 530, 2018. Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32 (1):4 24, 2020. K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. B. Zhang, C. Fan, S. Liu, K. Huang, X. Zhao, J. Huang, and Z. Liu. The expressive power of graph neural networks: A survey. ar Xiv preprint ar Xiv:2308.08235, 2023a. B. Zhang, G. Feng, Y. Du, D. He, and L. Wang. A complete expressiveness hierarchy for subgraph gnns via subgraph weisfeiler-lehman tests. In International Conference on Machine Learning, 2023b. M. Zhang and P. Li. Nested graph neural networks. Advances in Neural Information Processing Systems, 34:15734 15747, 2021. M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for graph classification. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. L. Zhao, W. Jin, L. Akoglu, and N. Shah. From stars to subgraphs: Uplifting any GNN with local structure awareness. In International Conference on Learning Representations, 2022. This appendices provide additional content to the main paper G2N2: Weisfeiler and Lehman go grammatical. A CFG and GNN A.1 From ML (L1) to 1-WL GNN In this subsection, the reduction framework described in section 3 is applied to the fragment ML (L1) as shown by Figure 6. Figure 6: Overview of the proposed framework instantiated on ML (L1). To determine the variables of the CFG, the following proposition is necessary. Proposition A.1 For any square matrix of size n2, operations in L1 can only produce square matrices of size n2, row or column vectors of size n or scalars. Proof. Let M be a square matrix of size n2, we first need to prove that L1 can produce square matrices of size n2, row and column vectors of size n and scalars. Yet 1 := 1(M) is a column vector of size n, 1T is a row vector of size n, 1T 1 is a scalar and M is a square matrix of size n2. Then let N Rn n, v Rn, w (Rn) , and s R be words of ML (L1), we have M N Rn n M v Rn w M (Rn) w v R v w Rn n 1(v) Rn v T (Rn) 1(w) R M T Rn n w T Rn s w (Rn) diag (s) R diag (v) Rn n 1(M) Rn s s R v s Rn 1(s) R Since this is an exhaustive list of all operations ML (L1) can produce with these words, we can conclude. Theorem A.1 (ML (L1) reduced CFG ) The following CFG denoted r-GL1 is as expressive as 1-WL. Vc diag (Vc) Vc | AVc | 1 (6) Proof. Proposition A.1 leads to only four variables. M for the square matrices, Vc for the column vectors, Vr for the row vectors and S for the scalars. We define a CFG GL1 where the rules of a given variable is every possible operation in ML (L1) that produce this variable: S (Vr)(Vc) | diag (S) | SS (7) Vc MVc | (Vr)T | Vc S | 1 Vr Vr M | (Vc)T | SVr M MM | (M)T | diag (Vc) | (Vc)(Vr) | A As any operation in the rules of GL1 belongs to L1, it is clear that L(GL1) ML (L1). For the other inclusion, a simple inductive proof on the maximal number of rules shows that any sentence produced by ML (L1) can be derived from GL1. We have then ML (L1) = L(GL1). For any scalar s, s , since diag (s) and s s produce a scalar, the only way to produce a scalar from another variable is to pass through a vector dot product. It implies that to generate scalars, we only need to be able to generate vectors. We can then reduce GL1 by removing the scalar variable S and setting Vc as starting variable. Vc MVc | (Vr)T | 1 Vr Vr M | (Vc)T M MM | (M)T | diag (Vc) | (Vc)(Vr) | A To ensure that the start variable is Vc, a mandatory subsequent operation will be MVc for any matrix variable M. As a consequence, by associativity of the matrix multiplication, MM and (Vc)(Vr) can be removed from the rule of M. Vc MVc | (Vr)T | 1 Vr Vr M | (Vc)T M (M)T | diag (Vc) | A Since diag produces symmetric matrices and A is symmetric, (M)T does not play any role here. As a consequence, we can then focus on the column vector and we obtain r-GL1. Figure 7 shows how the CFG GL1 produces the sentence 1TA1. Figure 7: GL1 generating the sentence 1TA1. From the starting variable, Variables are replaced by applying rules until only terminal symbols remain. A.2 Proofs of section 3 This subsection is dedicated to proof of propositions and theorem of section 3. Firstly, we prove proposition 3.1. Proof. Since L1 L3, we only need to check the rule associated with the matrix Hadamard product can produce. Let M Rn n and N Rn n be words ML (L3) can produce, we have M N Rn n. We can conclude. Secondly, we prove proposition 3.2. Proof. Let M be a square matrix, Vc, Vr be respectively column and row vectors, we have for any i, j, (M (Vc Vr))i,j = Mi,j(Vc Vr)i,j = (Vc)i Mi,j(Vr)j l diag (Vc)i,l Ml,j(Vr)j = (diag (Vc) M)i,j(Vr)j l (diag (Vc) M)i,ldiag (Vr)l,j = (diag (Vc) Mdiag (Vr))i,j We only use the scalar product commutativity here. Eventually, we prove theorem 3.1. Proof. As any operation in the rules of GL3 belongs to L3, it is clear that L(GL3) ML (L3). Let k be a positive integer, we denote by M k L3, V ck L3, V rk L3 and Sk L3 the set of matrices, column vectors, row vectors and scalars that can be produce with at most k operation in L3 from A. We also denote by M k G, V ck G, V rk G and Sk G the set of matrices, column vectors, row vectors and scalars that can be produce with at most k rules applied in GL3. For k = 0, we have V c0 L3 = V r0 L3 = S0 L3 = , and thus V c0 L3 V c0 G, V r0 L3 V r0 G and S0 L3 S0 G. Moreover M 0 L3 = {A} and M 0 G = {A}. Let suppose that there exists k 0 such that M k L3 M k G, V ck L3 V ck G, V rk L3 V rk G and Sk L3 Sk G. Then since GL3 rules is composed of the exhaustive operations in L3, we have that M k+1 L3 M k+1 G , V ck+1 L3 V ck+1 G , V rk+1 L3 V rk+1 G and Sk+1 L3 Sk+1 G By induction, we have that L(GL3) ML (L3) and we can conclude that L(GL3) = ML (L3). A.3 CFG to describe existent architectures The following examples show how CFG can be used to characterise GNNs. Proposition A.2 (GCN CFG) The following CFG, strictly less expressive than ML (L1), represents GCN (Kipf and Welling (2017)) Vc CVc | 1 (8) where C = diag ((A + I)1) 1 2 (A + I)diag ((A + I)1) 1 In GCN, the only grammatical operation is the message passing given by CVc where C is the convolution support. The other parts of the model are linear combinations of vectors and MLP, that correspond to +, , and f in the language. Since +, , and f do not affect the expressive power of the language (Geerts (2020)), they do not appear in the grammar. Actually, any MPNN based on k convolution support Ci included in ML (L1) can be described by the following CFG which is strictly less expressive than ML (L1): Vc C1Vc | | Ck Vc | 1 (9) GNNML1 is an architecture provably 1-WL equivalent (Balcilar et al. (2021)) with the following node update. H(l+1) = H(l)W (l,1) + AH(l)W (l,2) (10) + H(l)W (l,3) H(l)W (l,1). Where H(l) is the matrix of node embedding at layer l and W (l,i) are learnable weight matrices. For any vector v, w, since diag (v) w = v w, the following CFG that describes GNNML1 is equivalent to r-GL1. Vc Vc Vc | AVc | 1 (11) From r-GL3 to MPNNs and PPGN We have already shown that most MPNNs can be written with operations in r-GL1, since L1 L3 it stands for r-GL3. PPGN can also be written with r-GL3. Indeed, at each layer, PPGN applies the matrix multiplication on matched matrices on the third dimension, an operation included in r-GL3. The node features are stacked on the third dimension as diagonal matrices, the diag operation is also included in r-GL3. As all operations in PPGN are included, r-GL3 generalises PPGN layer. Actually, the following CFG describes the PPGN layer : M MM | diag (1) | A (12) Figure 8: Partition of four indices tuples. 2-IGN CFG For p [[1 , 15]], we define Bp (Rn)4 as follow, Bp = 1 if (i, j, k, l) αp, 0 if not. . Where (αp) corresponds to the 15 manners to partition four elements that can be seen in Figure 8. As shown in Maron et al. (2019b), (Bp) is a basis of the set of equivariant linear operators from (Rn)2 to (Rn)2. For the proof in the paper, two isomorphisms vec : (Rn)2 Rn2 and mat : (Rn)4 (Rn2)2 was defined for any tensor T (Rn)4, matrices M (Rn2)2, N (Rn)2 and vector v Rn2. mat(T)i,j = Ti//n,i%n,j//n,j%n mat 1(M)i,j,k,l = Min+j,kn+l vec(N)i = Ni//n,i%n vec 1(v)i,j = vin+j We can then define the binary operation as follow T N = vec 1(mat(T)vec(N)) Actually, we obtain the following operation (T N)i,j = X k,l Ti,j,k,l Nk,l We have all we need to proceed on writing 2-IGN as a grammar. The idea is to compute the basis operator to any matrices with a set of rules. (B1 N)i,j = X k,l (B1)i,j,k,l Nk,l = Ni,i if i = j, 0 if not. It is pretty easy to see that (B2 N)i,j = X k,l (B2)i,j,k,l Nk,l l =i Ni,l if i = j, 0 if not. Here, it is a sum over the matrix line avoiding the diagonal. B2 N = diag ((N J)1) (B3 N)i,j = X k,l (B3)i,j,k,l Nk,l l =i Nl,i if i = j, 0 if not. Here, it is a sum over the matrix column avoiding the diagonal. B3 N = diag (N J)T1 (B4 N)i,j = X k,l (B4)i,j,k,l Nk,l = Nj,j if i = j, 0 if not. It is the projection of the corresponding column diagonal element. B4 N = (11T(N I)) J (B5 N)i,j = X k,l (B5)i,j,k,l Nk,l = Ni,i if i = j, 0 if not. It is the projection of the corresponding line diagonal element. B5 N = ((N I)11T) J (B6 N)i,j = X k,l (B6)i,j,k,l Nk,l l =k Nk,l P l Nl,i if i = j, 0 if not. One can recognise B2 and B3. B6 N = (1(N J)1T)I B2 N B3 N (B7 N)i,j = X k,l (B7)i,j,k,l Nk,l l =i Ni,l Ni,j if i = j, 0 if not. It is just a sum over the line avoiding the element. B7 N = (11T(N J) N) J (B8 N)i,j = X k,l (B8)i,j,k,l Nk,l l Nl,i Nj,i if i = j, 0 if not. It is just a sum over the column corresponding to the line avoiding the transpose element. B8 N = ((N J)11T N T) J (B9 N)i,j = X k,l (B9)i,j,k,l Nk,l l =i Nj,l Nj,i if i = j, 0 if not. It is just a sum over the line corresponding to the column avoiding the transpose element. B9 N = (11T(N J) N T) J (B10 N)i,j = X k,l (B10)i,j,k,l Nk,l = P l Nl,j Ni,j if i = j, 0 if not. It is just a sum over the column avoiding the element. B10 N = ((N J)11T N) J (B11 N)i,j = X k,l (B11)i,j,k,l Nk,l l Nl,l Ni,i Nj,j if i = j, 0 if not. It is just a sum over the diagonal avoiding the two corresponding diagonal elements. B11 N = (1T(N I)1)J B3 N B4 N (B12 N)i,j = X k,l (B12)i,j,k,l Nk,l l Nl,l Ni,i if i = j, 0 if not. It is just a sum over the diagonal avoiding the corresponding diagonal element. B12 N = (1T(N I)1)J (11T(N I)) I (B13 N)i,j = X k,l (B13)i,j,k,l Nk,l = Ni,j if i = j, 0 if not. It selects the non-diagonal. B13 N = N J (B14 N)i,j = X k,l (B14)i,j,k,l Nk,l = Nj,i if i = j, 0 if not. It selects the transpose non-diagonal. B14 N = N T J (B15 N)i,j = X k,l (B15)i,j,k,l Nk,l k =l Nk,l P i =l Ni,l P i =l Nl,i P j =l Nj,l P j =l Nl,j Ni,j Nj,i if i = j, 0 if not. It is in fact a composition of other elements of the basis. B15 N =(1T(N J)1)J B7 N B8 N B9 N B10 N + B13 N + B14 N From all this, we can deduce the following grammar that generates 2-IGN: M Vc1T | M J | M I | A Vc MVc | 1 As one can see, there is less operation in the CFG than operators in the basis. A.4 GNNs derived from different Grammars This subsection is dedicated to a description of GNNs derived from different grammars of Q1 experiment (section 4). Figure 9 depicts a layer of a GNN derived from the exhaustive CFG GL3. The resulting architecture inherits 3-WL expressive power from theorem 3.1. In Figure 10, a GNN derived from i-GL3, the CFG obtain during the reduction process of the framework of section 3, is described. Since T is missing in r-G(L3, Figure 11 describes a GNN derived from a grammar containing r-G(L3) and M T. A.5 G2N2 expressiveness at fixed depth The following proposition ensures that for architectures with at most three layers, the GNN derived from the exhaustive CFG G(L3), called E-G2N2, and G2N2 have the same expressive power. Proposition A.3 (Graph isomorphism (WL) expressiveness at fixed depth) Let k {1, 2, 3}, G2N2 and E-G2N2 have the same separative power after k layers. Proof. Remark that, for scalars s and s , if ss separates two graphs then s or s already separate those graphs. Figure 9: Layer of a GNN derived from GL3. Figure 10: Layer of a GNN derived from i-GL3. Figure 11: Layer of a GNN derived r-GL3 {M T}. We denote by M (k) G , V (k) c G and S(k) G , respectively the set of matrices, vectors and scalars that G2N2 can compute at layer k. We also denotes by M (k) E , V (k) c E , V (k) r E and S(k) E , respectively the set of matrices, column and row vectors and scalars that E-G2N2 can compute at layer k. Remark that thanks to the skip connection in both layer, these sets are increasing for inclusion with respect to k. For example S(k) G S(k+1) G . We will show that S(k) G = S(k) E for k {1, 2, 3}. First of all we have that M (0) E = M (0) G = {A} and V (0) c E = V (0) r E = V (0) c G = {1} by construction of the architectures. Let compare two architectures with only one layer, we have that V (1) c E = V (1) r E = V (1) c G since 1 1 = 1, but M (1) E = M (1) G {vw T , v, w V (0) c G }. Since {vw T , v, w V (0) c G } = 11T and this matrix can be approximated by the bias of any of the matrix linear of G2N2, we have M (1) E = M (1) G . Thus S(1) E = S(1) G . Let compare two architectures with two layers, we have that V (2) c E = V (2) r E = V (2) c G since like in Maron et al. (2019a) the vector Hadamard product can be approximated with a multi layer perceptron, but M (2) E = M (2) G {vw T , v, w V (1) c G }. Let N = vw T , v, w V (1) c G }, then applying a readout function on N would result in 1T N1 = 1T v |{z} From that we have that S(2) E = S(2) G since the final decision multilayer perceptron can approximate ss for any scalar s, s . So our hypothesise is true for k = 2. Assume that we compare architectures with k = 3 layers. Since M (2) E = M (2) G {vw T , v, w V (1) c G }, we have for the matrix case, M (3) E = M (3) G {vw T , v, w V (2) c G } {M(vw T ) or (vw T )M, M M (2) E , v, w V (1) c G } {M (vw T ), M M (2) E , v, w V (1) c G }. We have three cases here. For N {vw T , v, w V (2) c G }, it is the same situation than for k = 2. For N = M(vw T ), M M (2) E , v, w V (1) c G , either M M (2) G or M {vw T , v, w V (1) c G }. In the first case Mv V (3) c G and since V (1) c G V (3) c G , we are in the same case than for k = 2. In the second case, there exists v and w V (1) c G V (2) c G such that M = v (w )T , which means that after the readout function, we have 1T N1 = 1T v |{z} (w )T v | {z } For N = M (vw T ), M M (2) E , v, w V (1) c G , either M M (2) G or M {vw T , v, w V (1) c G }. In the first case, we have that 1T (M (vw T ))1 S(3) G . Indeed, 1T (M (vw T ))1 = v T Mw = 1T (v Mw) | {z } We have for the vector case, V (3) c E = V (3) r E = V (3) c G {vw T v , v, w V (1) c G , v V (2) c G }. Since w T v S(3) G for all w, v V (2) c G , we have that S(3) E = S(3) G . B Spectral response of ML (L3) The graph Laplacian is the matrix L = D A (or L = I D 1 2 for the normalised Laplacian) where D is the diagonal degree matrix. Since L is positive semidefinite, its eigendecomposition is L = Udiag (λ) U T with U Rn n orthogonal and λ Rn +. By analogy with the convolution theorem, one can define graph filtering in the frequency domain by x = Udiag (Ω(λ)) U Tx where Ωis the filter applied in the spectral domain. Lemme B.1 Given A the adjacency matrix of a graph, ML (L3) can compute the graph Laplacian L and the normalised Laplacian Ln of this graph. Proof. ML (L3) can produce A2 I which is equal to D. Thus it can compute L = D A. For the normalised Laplacian, since the point-wise application of a function does not improve the expressive power of ML (L3) (Geerts (2020)), D 1 2 is reachable by ML (L3). Thus, the normalised Laplacian D 1 2 can be computed. As in Balcilar et al. (2020), we define the spectral response ϕ Rn of C Rn n as ϕ(λ) = diagonal(U TCU) where diagonal extracts the diagonal of a given square matrix. Using spectral response, Balcilar et al. (2020) shows that most existing MPNNs act as lowpass filters while high-pass and band-pass filters are experimentally proved to be necessary to increase model expressive power. Theorem B.2 For any continuous filter Ωin the spectral domain of the normalised Laplacian, there exists a matrix in ML (L3) such that its spectral response approximate Ω. Proof. The spectrum of the normalised Laplacian is included in [0 , 2], which is compact. Thanks to Stone-Weierstraß theorem, any continuous function can be approximated by a polynomial function. We just have to ensure the existence of a matrix in ML (L3) such that its spectral response is a polynomial function. For k N, the spectral response of Lk is λk since we have U TLk U = U T(Udiag (λ) U T)k U = U TUdiag (λ)k U TU = diag (λ)k From Lemma B.1, ML (L3) can compute L, and thus it can compute Lk for any k N. Since ML (L3) can produce all the matrices with a monome spectral response and since the function that gives the spectral response to a given matrix is linear, ML (L3) can produce any matrices with a polynomial spectral response. This section shows that a 3-WL GNN should be able to approximate any type of filter. C Experiments C.1 Experimental setting In the experiments, all the linear blocks of a layer are set at the same width S(l) = b(l) = b(l) = b(l) diag. This means that MLP(l) M takes as input a third order tensor of dimensions n n 4S(l) and MLP(l) Vc takes as input a matrix of dimensions n 2S(l). At each layer, the MLP depth is always 2 and the intermediate layer doubled the input dimension. For this experiment, there are 4 edge attributes and 11 node features. We use 3 layers with S(l) = f (l) n = 64 when learning one target at a time and S(l) = f (l) n = 32 in the other experiment for l {1, 2, 3}. The vector readout function is a sum over the components of H(3) and the matrix readout function is a sum over the components of the diagonal and the off-diagonal parts of C(3). Finally, 3 fully connected layers, with respective dimension(512/256/(1 or 12)) are applied before using an absolute error loss. Complete results on this dataset can be found in Table 4. Table 4: Results on QM9 dataset predicting each target at a time. The metric is MAE, the lower, the better. Target 1-GNN 1-2-3-GNN DTNN Deep LRP NGNN I2-GNN PPGN G2N2 µ 0.493 0.476 0.244 0.364 0.428 0.428 0.0934 0.0703 α 0.78 0.27 0.95 0.298 0.29 0.230 0.318 0.127 ϵhomo 0.00321 0.00337 0.00388 0.00254 0.00265 0.00261 0.00174 0.00172 ϵlumo 0.00355 0.00351 0.00512 0.00277 0.00297 0.00267 0.0021 0.00153 ϵ 0.0049 0.0048 0.0112 0.00353 0.0038 0.0038 0.0029 0.00253 R2 34.1 22.9 17.0 19.3 20.5 18.64 3.78 0.342 ZPVE 0.00124 0.00019 0.00172 0.00055 0.0002 0.00014 0.000399 0.0000951 U0 2.32 0.0427 2.43 0.413 0.295 0.211 0.022 0.0169 U 2.08 0.111 2.43 0.413 0.361 0.206 0.0504 0.0162 H 2.23 0.0419 2.43 0.413 0.305 0.269 0.0294 0.0176 G 1.94 0.0469 2.43 0.413 0.489 0.261 0.024 0.0214 Cv 0.27 0.0944 2.43 0.129 0.174 0.0730 0.144 0.0429 The parameter setting for each of the 6 experiments related to this dataset can be found in Table 5. Complete results on this dataset are given in Table 6. Table 5: G2N2 parameters detail for each dataset in our experiments on TU parameters MUTAG PTC Proteins NCI1 IMDB-B IMDB-M node features 7 22 3 37 1 1 edge features 6 0 0 0 0 0 # of G2N2 layer = lm 3 3 3 3 3 3 f (l) n l [[1 , lm]] 16 32 32 64 32 32 S(l) l [[1 , lm]] 16 32 32 64 32 32 readout dimension 256/128/1 512/256/1 512/256/1 128/64/1 512/256/1 512/256/3 loss BCEloss BCEloss BCEloss BCEloss BCEloss CEloss Table 6: Results on TUD dataset. The metric is accuracy, the higher, the better. Dataset MUTAG PTC Proteins NCI1 IMDB-B IMDB-M WL kernel (Shervashidze et al. (2011)) 90.4 5.7 59.9 4.3 75.0 3.1 86.0 1.8 73.8 3.9 50.9 3.8 GNTK (Du et al. (2019)) 90.0 8.5 67.9 6.9 75.6 4.2 84.2 1.5 76.9 3.6 52.8 4.6 DGCNN (Zhang et al. (2018)) 85.8 1.8 58.6 2.5 75.5 0.9 74.4 0.5 70.0 0.9 47.8 0.9 IGN (Maron et al. (2019b)) 83.9 13.0 58.5 6.9 76.6 5.5 74.3 2.7 72.0 5.5 48.7 3.4 GIN (Xu et al. (2019)) 89.4 5.6 64.6 7.0 76.2 2.8 82.7 1.7 75.1 5.1 52.3 2.8 PPGNs (Maron et al. (2019a)) 90.6 8.7 66.2 6.6 77.2 4.7 83.2 1.1 73.0 5.8 50.5 3.6 Natural GN (de Haan et al. (2020)) 89.4 1.60 66.8 1.79 71.7 1.04 82.7 1.35 74.8 2.01 51.3 1.50 WEGL (Kolouri et al. (2020)) N/A 67.5 7.7 76.5 4.2 N/A 75.4 5.0 52.3 2.9 GIN+Graph Norm (Cai et al. (2021)) 91.6 6.5 64.9 7.5 77.4 4.9 82.7 1.7 76.0 3.7 N/A GSNs (Bouritsas et al. (2022)) 92.2 7.5 68.2 7.2 76.6 5.0 83.5 2.0 77.8 3.3 54.3 3.3 G2N2 92.5 4.3 72.3 6.3 80.1 3.7 82.8 0.9 76.8 2.8 54.0 2.9 C.4 Spectral dataset This dataset is composed of three 2D grids of size 30x30, for respectively training, validation, and testing. We use 3 layers of G2N2 with S(l) = 32 and f (l) n = 32 for l {1, 2, 3}. Our readout function is the identity over the last node embedding and a sum over the line of the last edge embedding. We finally apply two fully connected layers on the output of the readout function and then use Mean Square Error (MSE) loss to compare the output to the ground truth.