# rethinking_pooling_in_graph_neural_networks__d147b3e1.pdf Rethinking pooling in graph neural networks Diego Mesquita1 , Amauri H. Souza2 , Samuel Kaski1,3 1Aalto University 2Federal Institute of Ceará 3University of Manchester {diego.mesquita, samuel.kaski}@aalto.fi, amauriholanda@ifce.edu.br Graph pooling is a central component of a myriad of graph neural network (GNN) architectures. As an inheritance from traditional CNNs, most approaches formulate graph pooling as a cluster assignment problem, extending the idea of local patches in regular grids to graphs. Despite the wide adherence to this design choice, no work has rigorously evaluated its influence on the success of GNNs. In this paper, we build upon representative GNNs and introduce variants that challenge the need for locality-preserving representations, either using randomization or clustering on the complement graph. Strikingly, our experiments demonstrate that using these variants does not result in any decrease in performance. To understand this phenomenon, we study the interplay between convolutional layers and the subsequent pooling ones. We show that the convolutions play a leading role in the learned representations. In contrast to the common belief, local pooling is not responsible for the success of GNNs on relevant and widely-used benchmarks. 1 Introduction The success of graph neural networks (GNNs) [3, 36] in many domains [9, 15, 25, 41, 46, 50] is due to their ability to extract meaningful representations from graph-structured data. Similarly to convolutional neural networks (CNNs), a typical GNN sequentially combines local filtering, nonlinearity, and (possibly) pooling operations to obtain refined graph representations at each layer. Whereas the convolutional filters capture local regularities in the input graph, the interleaved pooling operations reduce the graph representation while ideally preserving important structural information. Although strategies for graph pooling come in many flavors [26, 30, 47, 49], most GNNs follow a hierarchical scheme in which the pooling regions correspond to graph clusters that, in turn, are combined to produce a coarser graph [4, 7, 13, 21, 47, 48]. Intuitively, these clusters generalize the notion of local neighborhood exploited in traditional CNNs and allow for pooling graphs of varying sizes. The cluster assignments can be obtained via deterministic clustering algorithms [4, 7] or be learned in an end-to-end fashion [21, 47]. Also, one can leverage node embeddings [21], graph topology [8], or both [47, 48], to pool graphs. We refer to these approaches as local pooling. Together with attention-based mechanisms [24, 26], the notion that clustering is a must-have property of graph pooling has been tremendously influential, resulting in an ever-increasing number of pooling schemes [14, 18, 21, 27, 48]. Implicit in any pooling approach is the belief that the quality of the cluster assignments is crucial for GNNs performance. Nonetheless, to the best of our knowledge, this belief has not been rigorously evaluated. Misconceptions not only hinder new advances but also may lead to unnecessary complexity and obfuscate interpretability. This is particularly critical in graph representation learning, as we have seen a clear trend towards simplified GNNs [5, 6, 11, 31, 43]. Equal contribution. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. In this paper, we study the extent to which local pooling plays a role in GNNs. In particular, we choose representative models that are popular or claim to achieve state-of-the-art performances and simplify their pooling operators by eliminating any clustering-enforcing component. We either apply randomized cluster assignments or operate on complementary graphs. Surprisingly, the empirical results show that the non-local GNN variants exhibit comparable, if not superior, performance to the original methods in all experiments. To understand our findings, we design new experiments to evaluate the interplay between convolutional layers and pooling; and analyze the learned embeddings. We show that graph coarsening in both the original methods and our simplifications lead to homogeneous embeddings. This is because successful GNNs usually learn low-pass filters at early convolutional stages. Consequently, the specific way in which we combine nodes for pooling becomes less relevant. In a nutshell, the contributions of this paper are: i) we show that popular and modern representative GNNs do not perform better than simple baselines built upon randomization and non-local pooling; ii) we explain why the simplified GNNs work and analyze the conditions for this to happen; and iii) we discuss the overall impact of pooling in the design of efficient GNNs. Aware of common misleading evaluation protocols [10, 11], we use benchmarks on which GNNs have proven to beat structureagnostic baselines. We believe this work presents a sanity-check for local pooling, suggesting that novel pooling schemes should count on more ablation studies to validate their effectiveness. Notation. We represent a graph G, with n > 0 nodes, as an ordered pair (A, X) comprising a symmetric adjacency matrix A {0, 1}n n and a matrix of node features X Rn d. The matrix A defines the graph structure: two nodes i, j are connected if and only if Aij = 1. We denote by D the diagonal degree matrix of G, i.e., Dii = P j Aij. We denote the complement of G by G = ( A, X), where A has zeros in its diagonal and Aij = 1 Aij for i = j. 2 Exposing local pooling 2.1 Experimental setup Models. To investigate the relevance of local pooling, we study three representative models. We first consider GRACLUS [8], an efficient graph clustering algorithm that has been adopted as a pooling layer in modern GNNs [7, 34]. We combine GRACLUS with a sum-based convolutional operator [28]. Our second choice is the popular differential pooling model (DIFFPOOL) [47]. DIFFPOOL is the pioneering approach to learned pooling schemes and has served as inspiration to many methods [14]. Last, we look into the graph memory network (GMN) [21], a recently proposed model that reports state-of-the-art results on graph-level prediction tasks. Here, we focus on local pooling mechanisms and expect the results to be relevant for a large class of models whose principle is rooted in CNNs. Tasks and datasets. We use four graph-level prediction tasks as running examples: predicting the constrained solubility of molecules (ZINC, [20]), classifying chemical compounds regarding their activity against lung cancer (NCI1, [40]); categorizing ego-networks of actors w.r.t. the genre of the movies in which they collaborated (IMDB-B, [45]); and classifying handwritten digits (Superpixels MNIST, [1, 10]). The datasets cover graphs with none, discrete, and continuous node features. For completeness, we also report results on five other broadly used datasets in Section 3.1. Statistics of the datasets are available in the supplementary material (Section A.1). Evaluation. We split each dataset into train (80%), validation (10%) and test (10%) data. For the regression task, we use the mean absolute error (MAE) as performance metric. We report statistics of the performance metrics over 20 runs with different seeds. Similarly to the evaluation protocol in [10], we train all models with Adam [22] and apply learning rate decay, ranging from initial 10 3 down to 10 5, with decay ratio of 0.5 and patience of 10 epochs. Also, we use early stopping based on the validation accuracy. For further details, we refer to Appendix A in the supplementary material. Notably, we do not aim to benchmark the performance of the GNN models. Rather, we want to isolate local pooling effects. Therefore, for model selection, we follow guidelines provided by the original authors or in benchmarking papers and simply modify the pooling mechanism, keeping the remaining model structure untouched. All methods were implemented in Py Torch [12, 33] and our code is available at https://github.com/Aalto PML/Rethinking-pooling-in-GNNs. 75 80 85 90 Accuracy (%) GRACLUS mean: 80.10 COMPLEMENT mean: 80.19 60 70 80 Accuracy (%) IMDB-BINARY mean: 69.90 mean: 70.90 93 94 95 96 Accuracy (%) mean: 94.32 mean: 94.52 0.50 0.45 0.40 Negative MAE mean: -0.42 mean: -0.43 Figure 1: Comparison between GRACLUS and COMPLEMENT. The plots show the distribution of the performances over 20 runs. In all tasks, GRACLUS and COMPLEMENT achieve similar performance. The results indicate that pooling nearby nodes is not relevant to obtain a successful pooling scheme. Case 1: Pooling with off-the-shelf graph clustering We first consider a network design that resembles standard CNNs. Following architectures used in [7, 12, 13], we alternate graph convolutions [28] and pooling layers based on graph clustering [8]. At each layer, a neighborhood aggregation step combines each node feature vector with the features of its neighbors in the graph. The features are linearly transformed before running through a component-wise non-linear function (e.g., Re LU). In matrix form, the convolution is Z(l) = Re LU X(l 1)W (l) 1 + A(l 1)X(l 1)W (l) 2 with (A(0), X(0)) = (A, X), (1) where W (l) 2 , W (l) 1 Rdl 1 dl are model parameters, and dl is the embedding dimension at layer l. The next step consists of applying the GRACLUS algorithm [8] to obtain a cluster assignment matrix S(l) {0, 1}nl 1 nl mapping each node to its cluster index in {1, . . . , nl}, with nl < nl 1 clusters. We then coarsen the features by max-pooling the nodes in the same cluster: X(l) kj = max i:S(l) ik =1 Z(l) ij k = 1, . . . , nl (2) and coarsen the adjacency matrix such that A(l) ij = 1 iff clusters i and j have neighbors in G(l 1): A(l) = S(l) A(l 1)S(l) with A(l) kk = 0 k = 1, . . . , nl. (3) Clustering the complement graph. The clustering step holds the idea that good pooling regions, equivalently to their CNN counterparts, should group nearby nodes. To challenge this intuition, we follow an opposite idea and set pooling regions by grouping nodes that are not connected in the graph. In particular, we compute the assignments S(l) by applying GRACLUS to the complement graph G(l 1) of G(l 1). Note that we only employ the complement graph to compute cluster assignments S(l). With the assignments in hand, we apply the pooling operation (Equations 2 and 3) using the original graph structure. Henceforth, we refer to this approach as COMPLEMENT. Results. Figure 1 shows the distribution of the performance of the standard approach (GRACLUS) and its variant that operates on the complement graph (COMPLEMENT). In all tasks, both models perform almost on par and the distributions have a similar shape. Table 1: GRACLUS and COMPLEMENT performances are similar to those from the best performing isotropic GNNs (GIN and Graph SAGE) in [11] or [10]. For ZINC, lower is better. Models NCI1 IMDB-B SMNIST ZINC Graph SAGE 76.0 1.8 69.9 4.6 97.1 0.02 0.41 0.01 GIN 80.0 1.4 66.8 3.9 94.0 1.30 0.41 0.01 GRACLUS 80.1 1.6 69.9 3.5 94.3 0.34 0.42 0.02 COMPLEMENT 80.2 1.4 70.9 3.9 94.5 0.33 0.43 0.02 Despite their simplicity, GRACLUS and COMPLEMENT are strong baselines (see Table 1). For instance, Errica et al. [11] report GIN [44] as the best performing model for the NCI1 dataset, achieving 80.0 1.4 accuracy. This is indistinguishable from COMPLEMENT s performance (80.1 1.6). We observe the same trend on IMDB-B, Graph SAGE [16] obtains 69.9 4.6 and COMPLEMENT scores 70.6 5.1 a less than 0.6% difference. Using the same data splits as [10], COMPLEMENT and GIN perform within a margin of 1% in accuracy (SMNIST) and 0.01 in absolute error (ZINC). Accuracy (%) best mean: 77.202 IMDB-BINARY best mean: 68.750 best mean: 89.870 Negative MAE best mean: 0.457 Figure 2: Boxplots for DIFFPOOL (DP) and its random variants: Uniform U(0, 1), Normal N(0, 1), and Bernoulli B(0.3). For all datasets, at least one of the random models achieves higher average accuracy than DIFFPOOL. Also, random pooling does not consistently lead to higher variance. The results show that the learned pooling assignments are not relevant for the performance of DIFFPOOL. Case 2: Differential pooling DIFFPOOL [47] uses a GNN to learn cluster assignments for graph pooling. At each layer l, the soft cluster assignment matrix S(l) Rnl 1 nl is S(l) = softmax GNN(l) 1 (A(l 1), X(l 1)) with (A(0), X(0)) = (A, X). (4) The next step applies S(l) and a second GNN to compute the graph representation at layer l: X(l) = S(l) GNN(l) 2 (A(l 1), X(l 1)) and A(l) = S(l) A(l 1)S(l). (5) During training, DIFFPOOL employs a sum of three loss functions: i) a supervised loss; ii) the Frobenius norm between A(l) and the Gramian of the cluster assignments, at each layer, i.e., P l A(l) S(l)S(l) ; iii) the entropy of the cluster assignments at each layer. The second loss is referred to as the link prediction loss and enforces nearby nodes to be pooled together. The third loss penalizes the entropy, encouraging sharp cluster assignments. Random assignments. To confront the influence of the learned cluster assignments, we replace S(l) in Equation 4 with a normalized random matrix softmax( S(l)). We consider three distributions: (Uniform) S(l) ij U(a, b) (Normal) S(l) ij N(µ, σ2) (Bernoulli) S(l) ij B(α) (6) We sample the assignment matrix before training starts and do not propagate gradients through it. 0 100 Epochs Validation MAE Random Invariant 0 100 Epochs Random Invariant 0 100 Epochs Random Invariant Figure 3: MAE obtained over random permutations of each graph in the validation set. Both Invariant and fully Random produce similar predictions throughout the training. Results. Figure 2 compares DIFFPOOL against the randomized variants. In all tasks, the highest average accuracy is due to a randomized approach. Nonetheless, there is no clear winner among all methods. Notably, the variances obtained with the random pooling schemes are not significantly higher than those from DIFFPOOL. Remark 1. Permutation invariance is an important property of most GNNs that assures consistency across different graph representations. However, the randomized variants break the invariance of DIFFPOOL. A simple fix consists of taking S(l) = X(l 1) S , where S Rdl 1 nl is a random matrix. Figure 3 compares the randomized variants with and without this fix w.r.t. the validation error on artificially permuted graphs during training on the ZINC dataset. Results suggest that the variants are approximately invariant. Case 3: Graph memory networks Graph memory networks (GMNs) [21] consist of a sequence of memory layers stacked on top of a GNN, also known as the initial query network. We denote the output of the initial query network by Q(0). The first step in a memory layer computes kernel matrices between input queries Q(l 1) = [q(l 1) 1 , . . . , q(l 1) nl 1 ] and multi-head keys K(l) h = [k(l) 1h, . . . , k(l) nlh] : S(l) h : S(l) ijh 1 + q(l 1) i k(l) jh 2/τ τ+1 2 h = 1 . . . H, (7) where H is the number of heads and τ is the degrees of freedom of the Student s t-kernel. We then aggregate the multi-head assignments S(l) h into a single matrix S(l) using a 1 1 convolution followed by row-wise softmax normalization. Finally, we pool the node embeddings Q(l 1) according to their soft assignments S(l) and apply a single-layer neural net: Q(l) = Re LU S(l) Q(l 1)W (l) . (8) In this notation, queries Q(l) correspond to node embeddings X(l) for l > 0. Also, note that the memory layer does not leverage graph structure information as it is fully condensed into Q(0). Following [21], we use a GNN as query network. In particular, we employ a two layer network with the same convolutional operator as in Equation 1. The loss function employed to learn GMNs consists of a convex combination of: i) a supervised loss; and ii) the Kullback-Leibler divergence between the learned assignments and their self-normalized squares. The latter aim to enforce sharp soft-assignments, similarly to the entropy loss in DIFFPOOL. Remark 2. Curiously, the intuition behind the loss that allegedly improves cluster purity might be misleading. For instance, uniform cluster assignments, the ones the loss was engineered to avoid, are a perfect minimizer for it. We provide more details in Appendix D. Simplifying GMN. The principle behind GMNs consists of grouping nodes based on their similarities to learned keys (centroids). To scrutinize this principle, we propose two variants. In the first, we replace the kernel in Equation 7 by the euclidean distance taken from fixed keys drawn from a uniform distribution. Opposite to vanilla GMNs, the resulting assignments group nodes that are farthest from a key. The second variant substitutes multi-head assignment matrices for a fixed matrix whose entries are independently sampled from a uniform distribution. We refer to these approaches, respectively, as DISTANCE and RANDOM. NCI1 IMDB-B SMNIST 20 Accuracy (%) Higher is better GMN DISTANCE RANDOM Lower is better Figure 4: GMN versus its randomized variants DISTANCE and RANDOM. The plots show that pooling based on dissimilarity measure (distance) instead of similarity (kernel) does not have a negative impact in performance. The same holds true when we employ random assignments. Both variants achieve lower MAE for the regression task (ZINC). Results. Figure 4 compares GMN with its simplified variants. For all datasets, DISTANCE and RANDOM perform on par with GMN, with slightly better MAE for the ZINC dataset. Also, the variants present no noticeable increase in variance. It is worth mentioning that the simplified GMNs are naturally faster to train as they have significantly less learned parameters. In the case of DISTANCE, keys are taken as constants once sampled. Additionally, RANDOM bypasses the computation of the pairwise distances in Equation 7, which dominates the time of the forward pass in GMNs. On the largest dataset (SMNIST), DISTANCE takes up to half the time of GMN (per epoch), whereas RANDOM is up to ten times faster than GMN. Complement Before Pooling 1 [35, 19, 23] Complement After Pooling 1 [18, 10, 12] Unif. (Diff Pool) Before Pooling 1 [35, 19, 23] Unif. (Diff Pool) After Pooling 1 [10, 10, 10] Rand. (GMN) Before Pooling 1 [35, 19, 23] Rand. (GMN) After Pooling 1 [10, 10, 10] Figure 5: Non-local pooling: activations for three arbitrary graphs, before and after the first pooling layer. The convolutional layers learn to output similar embeddings within a same graph. As a result, methods become approximately invariant to how embeddings are pooled. Darker tones denote lower activation values. To better assess homogeneity, we normalize the color range of each embedding matrix individually. The number of nodes for each embedding matrix is given inside brackets. The results in the previous section are counter-intuitive. We now analyze the factors that led to these results. We show that the convolutions preceding the pooling layers learn representations which are approximately invariant to how nodes are grouped. In particular, GNNs learn smooth node representations at the early layers. To experimentally show this, we remove the initial convolutions that perform this early smoothing. As a result, all networks experience a significant drop in performance. Finally, we show that local pooling offers no benefit in terms of accuracy to the evaluated models. Pooling learns approximately homogeneous node representations. The results with the random pooling methods suggest that any convex combination of the node features enables us to extract good graph representation. Intuitively, this is possible if the nodes display similar activation patterns before pooling. If we interpret convolutions as filters defined over node features/signals, this phenomenon can be explained if the initial convolutional layers extract low-frequency information across the graph input channels/embedding components. To evaluate this intuition, we compute the activations before the first pooling layer and after each of the subsequent pooling layers. Figure 5 shows activations for the random pooling variants of DIFFPOOL and GMN, and the COMPLEMENT approach on ZINC. The plots in Figure 5 validate that the first convolutional layers produce node features which are relatively homogeneous within the same graph, specially for the randomized variants. The networks learn features that resemble vertical bars. As expected, the pooling layers accentuate this phenomenon, extracting even more similar representations. We report embeddings for other datasets in Appendix E. Even methods based on local pooling tend to learn homogeneous representations. As one can notice from Figure 6, DIFFPOOL and GMN show smoothed patterns in the outputs of their initial pooling layer. This phenomenon explains why the performance of the randomized approaches matches those of their original counterparts. The results suggest that the loss terms designed to enforce local clustering are either not beneficial to the learning task or are obfuscated by the supervised loss. This observation does not apply to GRACLUS, as it employs a deterministic clustering algorithm, separating the possibly competing goals of learning hierarchical structures and minimizing a supervised loss. Graclus Before Pooling 1 n:[35, 19, 23] Graclus After Pooling 1 n:[21, 11, 13] Diff Pool Before Pooling 1 n:[35, 19, 23] Diff Pool After Pooling 1 n:[10, 10, 10] GMN Before Pooling 1 n:[35, 19, 23] GMN After Pooling 1 n:[10, 10, 10] Figure 6: Local pooling: activations for the same graphs from Figure 5, before and after the first pooling layer. Similar to the simplified variants, their standard counterparts learn homogenous embeddings, specially for DIFFPOOL and GMN. The low variance across node embedding columns illustrates this homogeneity. Darker tones denote lower values and colors are not comparable across embedding matrices. Brackets (top) show the number of nodes after each layer for each graph. 0 50 100 150 Epochs 0 100 200 Epochs Cross Entropy Figure 7: Supervised loss on validation set for ZINC and SMNIST datasets. Scaling up the unsupervised loss has no positive impact on the performance of DIFFPOOL. To further gauge the impact of the unsupervised loss on the performance of these GNNs, we compare two DIFFPOOL models trained with the link prediction loss multiplied by the weighting factors λ = 100 and λ = 103. Figure 7 shows the validation curves of the supervised loss for ZINC and SMNIST. We observe that the supervised losses for models with both of the λ values converge to a similar point, at a similar rate. This validates that the unsupervised loss (link prediction) has little to no effect on the predictive performance of DIFFPOOL. We have observed a similar behavior for GMNs, which we report in the supplementary material. Discouraging smoothness. In Figure 6, we observe homogeneous node representations even before the first pooling layer. This naturally poses a challenge for the upcoming pooling layers to learn meaningful local structures. These homogeneous embeddings correspond to low frequency signals defined on the graph nodes. In the general case, achieving such patterns is only possible with more than a single convolution. This can be explained from a spectral perspective. Since each convolutional layer corresponds to filters that act linearly in the spectral domain, a single convolution cannot filter out specific frequency bands. These ideas have already been exploited to develop simplified GNNs [31, 43] that compute fixed polynomial filters in a normalized spectrum. Table 2: All networks experience a decrease in performance when implemented with a single convolution before pooling (numbers in red). For ZINC, lower is better. NCI1 IMDB-B SMNIST ZINC # Convolutions = 1 > 1 = 1 > 1 = 1 > 1 = 1 > 1 GRACLUS 76.6 80.1 69.7 69.9 89.1 94.3 0.458 0.429 COMPLEMENT 74.2 80.2 69.8 70.9 81.0 94.5 0.489 0.431 DIFFPOOL 71.3 76.9 68.5 68.5 66.6 89.6 0.560 0.459 U DIFFPOOL 72.8 76.8 69.0 68.7 66.4 89.4 0.544 0.457 GMN 77.1 80.2 68.9 69.9 87.4 92.4 0.469 0.409 Rand. GMN 76.4 80.2 66.6 66.8 90.0 94.0 0.465 0.400 Remarkably, using multiple convolutions to obtain initial embeddings is common practice [18, 26]. To evaluate its impact, we apply a single convolution before the first pooling layer in all networks. We then compare these networks against the original implementations. Table 2 displays the results. All models report a significant drop in performance with a single convolutional layer. On NCI1, the methods obtain accuracies about 4% lower, on average. Likewise, GMN and GRACLUS report a 4% performance drop on SMNIST. With a single initial Graph SAGE convolution, the performances of Diffpool and its uniform variant become as lower as 66.4% on SMNIST. This dramatic drop is not observed only for IMDB-B, which counts on constant node features and therefore may not benefit from the additional convolutions. Note that under reduced expressiveness, pooling far away nodes seems to impose a negative inductive bias as COMPLEMENT consistently fails to rival the performance of GRACLUS. Intuitively, the number of convolutions needed to achieve smooth representations depend on the richness of the initial node features. Many popular datasets rely on one-hot features and might require a small number of initial convolutions. Extricating these effects is outside the scope of this work. Is pooling overrated? Our experiments suggest that local pooling does not play an important role on the performance of GNNs. A natural next question is whether local pooling presents any advantage over global pooling strategies. We run a GNN with global mean pooling on top of three convolutional layers, the results are: 79.65 2.07 (NCI), 71.05 4.58 (IMDB-B), 95.20 0.18 (SMNIST), and 0.443 0.03 (ZINC). These performances are on par with our previous results obtained using GRACLUS, DIFFPOOL and GMN. Regardless of how surprising this may sound, our findings are consistent with results reported in the literature. For instance, Graph SAGE networks outperform DIFFPOOL in a rigorous evaluation protocol [10, 11], although Graph SAGE is the convolutional component of DIFFPOOL. Likewise, but in the context of attention, Knyazev et al. [24] argue that, except under certain circumstances, general attention mechanisms are negligible or even harmful. Similar findings also hold for CNNs [35, 39]. 2 3 4 5 Number of layers 16 32 64 128 Embedding dim. 0.72 0.74 0.74 0.76 0.74 0.76 0.75 0.8 0.75 0.78 0.79 0.8 0.76 0.78 0.81 0.82 2 3 4 5 Number of layers 0.71 0.7 0.73 0.74 0.69 0.7 0.74 0.73 0.7 0.73 0.73 0.77 0.71 0.75 0.8 0.77 2 3 4 5 Number of layers -0.56 -0.55 -0.51 -0.51 -0.51 -0.46 -0.46 -0.48 -0.5 -0.45 -0.42 -0.38 -0.45 -0.44 -0.37 -0.4 2 3 4 5 Number of layers 0.72 0.73 0.74 0.75 0.74 0.72 0.77 0.75 0.76 0.72 0.77 0.77 0.78 0.74 0.76 0.76 Figure 8: Performance gap between GRACLUS and COMPLEMENT for different hyperparameter settings. Numbers denote the performance of GRACLUS negative MAE for ZINC, AUROC for MOLHIV, and accuracy for the rest. Red tones denote cases on which COMPLEMENT obtains better performance. For all datasets, the performance gap is not larger than 0.05 (or 5% in accuracy). The predominant light tones demonstrate that both models obtain overall similar results, with no clear trend toward specific hyperparameter choices. Permutation invariance is a relevant property for GNNs as it guarantees that the network s predictions do not vary with the graph representation. For completeness, we state in Appendix C the permutation invariance of the simplified GNNs discussed in Section 2. 3.1 Additional results More datasets. We consider four additional datasets commonly used to assess GNNs, three of which are part of the TU datasets [29]: PROTEINS, NCI109, and DD; and one from the recently proposed Open Graph Benchamark (OGB) framework [17]: MOLHIV. Since Ivanov et al. [19] showed that the IMDB-B dataset is affected by a serious isomorphism bias, we also report results on the cleaned version of the IMDB-B dataset, hereafter refereed to as IMDB-C. Table 3: Results for the additional datasets. Again, we observe that non-local pooling yields results as good as local pooling. Models IMDB-C PROTEINS NCI109 DD MOLHIV GRACLUS 70.5 8.3 72.6 4.1 77.7 2.6 72.1 5.2 74.6 1.8 COMPLEMENT 70.1 7.0 75.3 3.1 77.0 2.4 72.1 3.6 76.4 1.9 DIFFPOOL 70.9 6.7 75.2 4.0 72.2 1.5 76.9 4.4 75.1 1.3 N-DIFFPOOL 70.3 7.5 74.6 3.8 71.8 2.1 78.3 4.1 74.8 1.3 GMN 68.4 7.1 74.8 3.3 77.0 1.7 73.9 4.6 73.7 1.8 Random-GMN 70.2 8.0 74.8 3.9 76.9 2.0 74.3 3.6 73.5 2.6 Table 3 shows the results for the additional datasets. Similarly to what we have observed in the previous experiments, non-local pooling performs on par with local pooling. The largest performance gap occurs for the PROTEINS and MOHIV datasets, on which COMPLEMENT achieves accuracy around 3% higher than GRACLUS, on average. The performance of all methods on IMDB-C does not differ significantly from their performance on IMDB-B. However, removing the isomorphisms in IMDB-B reduces the dataset size, which clearly increases variance. Another pooling method. We also provide results for MINCUTPOOL [2], a recently proposed pooling scheme based on spectral clustering. This scheme integrates an unsupervised loss which stems from a relaxation of a MINCUT objective and learns to assign clusters in a spectrum-free way. We compare MINCUTPOOL with a random variant for all datasets employed in this work. Again, we find that a random pooling mechanism achieves comparable results to its local counterpart. Details are given in Appendix B. Sensitivity to hyperparameters. As mentioned in Section 2.1, this paper does not intend to benchmark the predictive performance of models equipped with different pooling strategies. Consequently, we did not exhaustively optimize model hyperparameters. One may wonder whether our results hold for a broader set of hyperparameter choices. Figure 8 depicts the performance gap between GRACLUS and COMPLEMENT for a varying number of pooling layers and embedding dimensionality over a single run. We observe the greatest performance gap in favor of COMPLEMENT (e.g., 5 layers and 32 dimensions on ZINC), which does not amount to an improvement greater than 0.05 in MAE. Overall, we find that the choice of hyperparameters does not significantly increase the performance gap between GRACLUS and COMPLEMENT. 4 Related works Graph pooling usually falls into global and hierarchical approaches. Global methods aggregate the node representations either via simple flattening procedures such as summing (or averaging) the node embeddings [44] or more sophisticated set aggregation schemes [32, 49]. On the other hand, hierarchical methods [7, 14, 18, 21, 26, 27, 47] sequentially coarsen graph representations over the network s layers. Notably, Knyazev et al. [24] provides a unified view of local pooling and node attention mechanisms, and study the ability of pooling methods to generalize to larger and noisy graphs. Also, they show that the effect of attention is not crucial and sometimes can be harmful, and propose a weakly-supervised approach. Simple GNNs. Over the last years, we have seen a surge of simplified GNNs. Wu et al. [43] show that removing the nonlinearity in GCNs [23] does not negatively impact on their performances on node classification tasks. The resulting feature extractor consists of a low-pass-type filter. In [31], the authors take this idea further and evaluate the resilience to feature noise and provide insights on GCN-based designs. For graph classification, Chen et al. [6] report that linear convolutional filters followed by nonlinear set functions achieve competitive performances against modern GNNs. Cai and Wang [5] propose a strong baseline based on local node statistics for non-attributed graph classification tasks. Benchmarking GNNs. Errica et al. [11] demonstrate how the lack of rigorous evaluation protocols affects reproducibility and hinders new advances in the field. They found that structure-agnostic baselines outperform popular GNNs on at least three commonly used chemical datasets. At the rescue of GNNs, Dwivedi et al. [10] argue that this lack of proper evaluation comes mainly from using small datasets. To tackle this, they introduce a new benchmarking framework with datasets that are large enough for researchers to identify key design principles and assess statistically relevant differences between GNN architectures. Similar issues related to the use of small datasets are reported in [37]. Understanding pooling and attention in regular domains. In the context of CNNs for object recognition tasks, Ruderman et al. [35] evaluate the role of pooling in CNNs w.r.t. the ability to handle deformation stability. The results show that pooling is not necessary for appropriate deformation stability. The explanation lies in the network s ability to learn smooth filters across the layers. Sheng et al. [38] and Zhao et al. [51] propose random pooling as a fast alternative to conventional CNNs. Regarding attention-based models, Wiegreffe and Pinter [42] show that attention weights usually do not provide meaningful explanations for predictions. These works demonstrate the importance of proper assessment of core assumptions in deep learning. 5 Conclusion In contrast to the ever-increasing influx of GNN architectures, very few works rigorously assess which design choices are crucial for efficient learning. Consequently, misconceptions can be widely spread, influencing the development of models drinking from flawed intuitions. In this paper, we study the role of local pooling and its impact on the performance of GNNs. We show that most GNN architectures employ convolutions that can quickly lead to smooth node representations. As a result, the pooling layers become approximately invariant to specific cluster assignments. We also found that clustering-enforcing regularization is usually innocuous. In a series of experiments on accredited benchmarks, we show that extracting local information is not a necessary principle for efficient pooling. By shedding new light onto the role of pooling, we hope to contribute to the community in at least two ways: i) providing a simple sanity-check for novel pooling strategies; ii) deconstruct misconceptions and wrong intuitions related to the benefits of graph pooling. Acknowledgments and Disclosure of Funding This work was funded by the Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI, and grants 294238, 292334 and 319264). We acknowledge the computational resources provided by the Aalto Science-IT Project. Broader impact Graph neural networks (GNNs) have become the de facto learning tools in many valuable domains such as social network analysis, drug discovery, recommender systems, and natural language processing. Nonetheless, the fundamental design principles behind the success of GNNs are only partially understood. This work takes a step further in understanding local pooling, one of the core design choices in many GNN architectures. We believe this work will help researchers and practitioners better choose in which directions to employ their time and resources to build more accurate GNNs. [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34 (11):2274 2282, 2012. [2] F. M. Bianchi, D. Grattarola, and C. Alippi. Spectral clustering with graph neural networks for graph pooling. In International Conference on Machine Learning (ICML), 2020. [3] M. M. Bronstein, J. Bruna, Y. Le Cun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, 2017. [4] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014. [5] C. Cai and Y. Wang. A simple yet effective baseline for non-attribute graph classification. Ar Xiv:1811.03508, 2018. [6] T. Chen, S. Bian, and Y. Sun. Are powerful graph neural nets necessary? A dissection on graph classification. Ar Xiv:1905.04579, 2019. [7] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (Neur IPS), 2018. [8] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11):1944 1957, 2007. [9] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems (Neur IPS), 2015. [10] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking graph neural networks. Ar Xiv:2003.00982, 2020. [11] F. Errica, M. Podda, D. Bacciu, and A. Micheli. A fair comparison of graph neural networks for graph classification. In International Conference on Learning Representations (ICLR), 2020. [12] M. Fey and J. E. Lenssen. Fast graph representation learning with Py Torch Geometric. In Workshop track of the International Conference on Representation Learning (ICLR), 2019. [13] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller. Spline CNN: Fast geometric deep learning with continuous B-spline kernels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [14] H. Gao and S. Ji. Graph U-nets. In International Conference on Machine Learning (ICML), 2019. [15] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning (ICML), 2017. [16] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2017. [17] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2020. [18] J. Huang, Z. Li, N. Li, S. Liu, and G. Li. Attpool: Towards hierarchical feature representation in graph convolutional networks via attention mechanism. In International Conference on Computer Vision (ICCV), 2019. [19] S. Ivanov, S. Sviridov, and E. Burnaev. Understanding isomorphism bias in graph data sets. Ar Xiv:1910.12091, 2019. [20] W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning (ICML), 2018. [21] A. H. Khasahmadi, K. Hassani, P. Moradi, L. Lee, and Q. Morris. Memory-based graph networks. In International Conference on Learning Representations (ICLR), 2020. [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. [23] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. [24] B. Knyazev, G. W. Taylor, and M. Amer. Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2019. [25] L. Landrieu and M. Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Computer Vision and Pattern Recognition (CVPR), 2018. [26] J. Lee, I. Lee, and J. Kang. Self-attention graph pooling. In International Conference on Machine Learning (ICML), 2019. [27] Y. Ma, S. Wang, C. C. Aggarwal, and J. Tang. Graph convolutional networks with eigenpooling. In International Conference on Knowledge Discovery & Data Mining (KDD), 2019. [28] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence (AAAI), 2019. [29] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann. TU dataset: a collection of benchmark datasets for learning with graphs. In Workshop track of the International Conference on Machine Learning (ICML), 2020. [30] R. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro. Relational pooling for graph representations. In International Conference on Machine Learning (ICML), 2019. [31] H. NT and T. Maehara. Revisiting graph neural networks: All we have is low-pass filters. Ar Xiv:1905.09550, 2019. [32] M. Kudlur O. Vinyals, S. Bengio. Order matters: Sequence to sequence for sets. In International Conference on Learning Representations (ICLR), 2015. [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems (Neur IPS - Workshop), 2017. [34] S. Rhee, S. Seo, and S. Kim. Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. In International Joint Conference on Artificial Intelligence, (IJCAI), 2018. [35] A. Ruderman, N. C. Rabinowitz, A. S. Morcos, and D. Zoran. Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs. Ar Xiv:1804.04438, 2018. [36] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2009. [37] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann. Pitfalls of graph neural network evaluation. Ar Xiv:1811.05868, 2018. [38] J. Sheng, C. Chen, C. Fu, and C. J. Xue. Easy Conv Pooling: Random pooling with easy convolution for accelerating training and testing. Ar Xiv:1806.01729, 2018. [39] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for Simplicity: The All Convolutional Net. In Workshop track of the International Conference on Representation Learning (ICLR), 2015. [40] N. Wale, I. A. Watson, and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347 375, 2008. [41] C. Y. Wee, C. Liu, A. Lee, J. S. Poh, H. Ji, and A. Qiu. Cortical graph neural network for AD and MCI diagnosis and transfer learning across populations. Neuro Image: Clinical, 23, 2019. [42] S. Wiegreffe and Y. Pinter. Attention is not not explanation. In Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. [43] F. Wu, A. H. Souza Jr, T. Zhang, C. Fifty, T. Yu, and K. Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning (ICML), 2019. [44] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR), 2019. [45] P. Yanardag and S. V. N. Vishwanathan. Deep graph kernels. In International Conference on Knowledge Discovery and Data Mining (KDD), 2015. [46] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for web-scale recommender systems. In International Conference on Knowledge Discovery & Data Mining (KDD), 2018. [47] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems (Neur IPS), 2018. [48] H. Yuan and S. Ji. Struct Pool: Structured graph pooling via random fields. In International Conference on Learning Representations (ICLR), 2020. [49] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for graph classification. In AAAI Conference on Artificial Intelligence (AAAI), 2018. [50] Y. Zhang, P. Qi, and C. D. Manning. Graph convolution over pruned dependency trees improves relation extraction. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. [51] G. Zhao, J. Wang, and Z. Zhang. Random shifting for CNN: a solution to reduce information loss in down-sampling layers. In International Joint Conference on Artificial Intelligence (IJCAI), 2017.