# simple_spectral_graph_convolution__9065f36f.pdf Published as a conference paper at ICLR 2021 SIMPLE SPECTRAL GRAPH CONVOLUTION Hao Zhu, Piotr Koniusz Australian National University Canberra, Australia {hao.zhu,piotr.koniusz}@anu.edu.au Data61/CSIRO Canberra, Australia Graph Convolutional Networks (GCNs) are leading methods for learning graph representations. However, without specially designed architectures, the performance of GCNs degrades quickly with increased depth. As the aggregated neighborhood size and neural network depth are two completely orthogonal aspects of graph representation, several methods focus on summarizing the neighborhood by aggregating K-hop neighborhoods of nodes while using shallow neural networks. However, these methods still encounter oversmoothing, and suffer from high computation and storage costs. In this paper, we use a modified Markov Diffusion Kernel to derive a variant of GCN called Simple Spectral Graph Convolution (S2GC). Our spectral analysis shows that our simple spectral graph convolution used in S2GC is a trade-off of lowand high-pass filter bands which capture the global and local contexts of each node. We provide two theoretical claims which demonstrate that we can aggregate over a sequence of increasingly larger neighborhoods compared to competitors while limiting severe oversmoothing. Our experimental evaluations show that S2GC with a linear learner is competitive in text and node classification tasks. Moreover, S2GC is comparable to other state-of-the-art methods for node clustering and community prediction tasks. 1 INTRODUCTION In the past decade, deep learning has become mainstream in computer vision and machine learning. Although deep learning has been applied for extraction of features on the Euclidean lattice (Euclidean grid-structured data) with great success, the data in many practical scenarios lies on non-Euclidean structures, whose processing poses a challenge for deep learning. By defining a convolution operator between the graph and signal, Graph Convolutional Networks (GCNs) generalize Convolutional Neural Networks (CNNs) to graph-structured inputs which contain attributes. Message Passing Neural Networks (MPNNs) (Gilmer et al., 2017) unify the graph convolution as two functions: the transformation function and the aggregation function. MPNN iteratively propagates node features based on the adjacency of the graph in a number of rounds. Despite their enormous success in many applications like social media, traffic analysis, biology, recommendation systems and even computer vision, many of the current GCN models use fairly shallow setting as many of the recent models such as GCN (Kipf & Welling, 2016) achieve their best performance given 2 layers. In other words, 2-layer GCN models aggregate nodes in two-hops neighborhood and thus have no ability to extract information in K-hops neighborhoods for K > 2. Moreover, stacking more layers and adding a non-linearity tend to degrade the performance of these models. Such a phenomenon is called oversmoothing (Li et al., 2018a), characterized by the effect that as the number of layers increases, the representations of the nodes in GCNs tend to converge to a similar, non-distinctive from one another value. Even adding residual connections, an effective trick for training very deep CNNs, merely slows down the oversmoothing issue (Kipf & Welling, 2016) in GCNs. It appears that deep GCN models gain nothing but the performance degradation from the deep architecture. One solution for that is to widen the receptive field of aggregation function while limiting the depth of network because the required neighborhood size and neural network depth can be regarded as The corresponding author. The code is available at https://github.com/allenhaozhu/SSGC. Published as a conference paper at ICLR 2021 two separate aspects of design. To this end, SGC (Wu et al., 2019) captures the context from Khops neighbours in the graph by applying the K-th power of the normalized adjacency matrix in a single layer of neural network. This scheme is also used for attributed graph clustering (Zhang et al., 2019). However, SGC also suffers from oversmoothing as K , as shown in Theorem 1. PPNP and APPNP (Klicpera et al., 2019a) replace the power of the normalized adjacency matrix with the Personalized Page Rank matrix to solve the oversmoothing problem. Although APPNP relieves the oversmoothing problem, it employs a non-linear operation which requires costly computation of the derivative of the filter due to the non-linearity over the multiplication of feature matrix with learnable weights. In contrast, we show that our approach enjoys a free derivative computed in the feed-forward step due to the use of a linear model. Furthermore, APPNP aggregates over multiple k-hop neighborhoods (k=0, , K) but the weighting scheme favors either global or local context making it difficult if not impossible to find a good value of balancing parameter. In contrast, our approach aggregates over k-hop neighborhoods in a well-balanced manner. GDC (Klicpera et al., 2019b) further extends APPNP by generalizing Personalized Page Rank (Page et al., 1999) to an arbitrary graph diffusion process. GDC has more expressive power than SGC (Wu et al., 2019), PPNP and APPNP (Klicpera et al., 2019a) but it leads to a dense transition matrix which makes the computation and space storage intractable for large graphs, although authors suggest that the shrinkage method can be used to sparsify the generated transition matrix. Noteworthy are also orthogonal research directions of Sun et al. (2019); Koniusz & Zhang (2020); Elinas et al. (2020) which improve the performance of GCNs by the perturbation of graph, high-order aggregation of features, and the variational inference, respectively. To tackle the above issues, we propose a Simple Spectral Graph Convolution (S2GC) network for node clustering and node classification in semi-supervised and unsupervised settings. By analyzing the Markov Diffusion Kernel (Fouss et al., 2012), we obtain a very simple and effective spectral filter: we aggregate k-step diffusion matrices over k = 0, , K steps, which is equivalent to aggregating over neighborhoods of gradually increasing sizes. Moreover, we show that our design incorporates larger neighborhoods compared to SGC and copes better with oversmoothing. We explain that limiting overdominance of the largest neighborhoods in the aggregation step limits oversmoothing while preserving the large context of each node. We also show via the spectral analysis that S2GC is a trade-off between the lowand high-pass filter bands which leads to capturing the global and local contexts of each node. Moreover, we show how S2GC and APPNP (Klicpera et al., 2019a) are related and explain why S2GC captures a range of neighborhoods better than APPNP. Our experimental results include node clustering, unsupervised and semi-supervised node classification, node property prediction and supervised text classification. We show that S2GC is highly competitive, often significantly outperforming state-of-the-art methods. 2 PRELIMINARIES Notations. Let G = (V, E) be a simple and connected undirected graph with n nodes and m edges. We use {1, , n} to denote the node index of G, whereas dj denotes the degree of node j in G. Let A be the adjacency matrix and D be the diagonal degree matrix. Let e A = A + In denote the adjacency matrix with added self-loops and the corresponding diagonal degree matrix e D, where In Sn ++ is an identity matrix. Finally, let X Rn d denote the node feature matrix, where each node v is associated with a d-dimensional feature vector Xv. The normalized graph Laplacian matrix is defined as L = In D 1/2AD 1/2 Sn +, that is, a symmetric positive semidefinite matrix with eigendecomposition UΛU , where Λ is a diagonal matrix with eigenvalues of L, and U Rn n is a unitary matrix that consists of the eigenvectors of L. Spectral Graph Convolution (Defferrard et al., 2016). We consider spectral convolutions on graphs defined as the multiplication of signal x Rn with a filter gθ parameterized by θ Rn in the Fourier domain: gθ(L) x = Ug θ(Λ)U x, (1) where the parameter θ Rn is a vector of spectral filter coefficients. One can understand gθ as a function operating on eigenvalues of L, that is, g θ(Λ). To avoid eigendecomposition, gθ(Λ) can be approximated by a truncated expansion in terms of Chebyshev polynomials Tk(Λ) up to the K-th Published as a conference paper at ICLR 2021 order (Defferrard et al., 2016): k=0 θk Tk( Λ), (2) with a rescaled Λ = 1 2λmax Λ In, where λmax denotes the largest eigenvalue of L and θ RK is now a vector of Chebyshev coefficients. Vanila Graph Convolutional Network (GCN) (Kipf & Welling, 2016). The vanilla GCN is a first-order approximation of spectral graph convolutions. If one sets K = 1, θ0 = 2, and θ1 = 1 for Eq. 2, they obtain the convolution operation gθ(L) x = (I + D 1/2AD 1/2)x. Finally, by the renormalization trick, replacing matrix I + D 1/2AD 1/2 by a normalized version e T = e D 1/2 e A e D 1/2 = (D + In) 1/2(A + In)(D + In) 1/2 leads to the GCN layer with a non-linear function σ: H(l+1) = σ(e TH(l)W(l)). (3) Graph Diffusion Convolution (GDC) (Klicpera et al., 2019b). A generalized graph diffusion is given by the diffusion matrix: k=0 θk Tk, (4) with the weighting coefficients θk and the generalized transition matrix T. Eq. 4 can be regarded as related to the Taylor expansion of matrix-valued functions. Thus, the choice of θk and Tk must at least ensure that Eq. 4 converges. Klicpera et al. (2019b) provide two special cases as low-pass filters ie., the heat kernel and the kernel based on random walk with restarts. If S denotes the adjacency matrix and D is the diagonal degree matrix of S, the corresponding graph diffusion convolution is then defined as D 1/2SD 1/2x. Note that θk can be a learnable parameter, or it can be chosen in some other way. Many works use the expansion in Eq. 4 but different choices of θk realise very different filters, making each method unique. Simple Graph Convolution (SGC) (Wu et al., 2019). A classical MPNN (Gilmer et al., 2017) averages (in each layer) the hidden representations among 1-hop neighbors. This implies that each node in the K-th layer obtains feature information from all nodes that are K-hops away in the graph. By hypothesizing that the non-linearity between GCN layers is not critical, SGC captures information from K-hops neighborhood in the graph by applying the K-th power of the transition matrix in a single neural network layer. The SGC can be regarded as a special case of GDC without non-linearity and without the normalization by D 1/2 if we set θK = 1 and θi