# plaplacian_based_graph_neural_networks__89201469.pdf

p-Laplacian Based Graph Neural Networks

Guoji Fu 1 Peilin Zhao 1 Yatao Bian 1

Abstract Graph neural networks (GNNs) have demonstrated superior performance for semi-supervised node classification on graphs, as a result of their ability to exploit node features and topological information. However, most GNNs implicitly assume that the labels of nodes and their neighbors in a graph are the same or consistent, which does not hold in heterophilic graphs, where the labels of linked nodes are likely to differ. Moreover, when the topology is noninformative for label prediction, ordinary GNNs may work significantly worse than simply applying multi-layer perceptrons (MLPs) on each node. To tackle the above problem, we propose a new p-Laplacian based GNN model, termed as p GNN, whose message passing mechanism is derived from a discrete regularization framework and can be theoretically explained as an approximation of a polynomial graph filter defined on the spectral domain of p-Laplacians. The spectral analysis shows that the new message passing mechanism works as low-high-pass filters, thus rendering p GNNs effective on both homophilic and heterophilic graphs. Empirical studies on real-world and synthetic datasets validate our findings and demonstrate that p GNNs significantly outperform several state-of-the-art GNN architectures on heterophilic benchmarks while achieving competitive performance on homophilic benchmarks. Moreover, p GNNs can adaptively learn aggregation weights and are robust to noisy edges.

1. Introduction

In this paper, we explore the usage of graph neural networks (GNNs) for semi-supervised node classification on

1Tencent AI Lab, Shenzhen, China. Correspondence to: Guoji Fu <guoji.leo.fu@gmail.com>, Yatao Bian <yatao.bian@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

graphs, especially when the graphs admit strong heterophily or noisy edges. Semi-supervised learning on graphs is ubiquitous in a lot of real-world scenarios, such as user classification in social media (Kipf & Welling, 2017), protein classification in biology (Velickovic et al., 2018), molecular property prediction in chemistry (Duvenaud et al., 2015), and many others (Marcheggiani & Titov, 2017; Satorras & Estrach, 2018). Recently, GNNs have become the de facto choice for processing graph structured data. They can exploit the node features and the graph topology by propagating and transforming the features over the topology in each layer and thereby learn refined node representations. A series of GNN architectures have been proposed, including graph convolutional networks (Bruna et al., 2014; Henaff et al., 2015; Niepert et al., 2016; Kipf & Welling, 2017; Wu et al., 2019), graph attention networks (Velickovic et al., 2018; Thekumparampil et al., 2018), and other representatives (Hamilton et al., 2017; Xu et al., 2018; Pei et al., 2020).

Most of the existing GNN architectures work under the homophily assumption, i.e. the labels of nodes and their neighbors in a graph are the same or consistent, which is also commonly used in graph clustering (Bach & Jordan, 2004; von Luxburg, 2007; Liu & Han, 2013) and semisupervised learning on graphs (Belkin et al., 2004; Hein, 2006; Nadler et al., 2009). However, recent studies (Zhu et al., 2020; 2021; Chien et al., 2021) show that in contrast to their success on homophilic graphs, most GNNs fail to work well on heterophilic graphs, in which linked nodes are more likely to have distinct labels. Moreover, GNNs could even fail on graphs where their topology is not helpful for label prediction. In these cases, propagating and transforming node features over the graph topology could lead to worse performance than simply applying multi-layer perceptrons (MLPs) on each of the nodes independently. Several recent works were proposed to deal with the heterophily issues of GNNs. Zhu et al. (2020) finds that heuristically combining ego-, neighbor, and higher-order embeddings improves GNN performance on heterophilic graphs. Zhu et al. (2021) uses a compatibility matrix to model the graph homophily or heterphily level. Chien et al. (2021) incorporates the generalized Page Rank algorithm with graph convolutions so as to jointly optimize node feature and topological information extraction for both ho-

p-Laplacian Based Graph Neural Networks

mophilic and heterophilic graphs. However, the problem of GNNs on graphs with non-informative topologies (or noisy edges) remains open.

Unlike previous works, we tackle the above issues of GNNs by proposing the discrete p-Laplacian based message passing scheme, termed as p-Laplacian message passing. It is derived from a discrete regularization framework and is theoretically verified as an approximation of a polynomial graph filter defined on the spectral domain of the p Laplacian. Spectral analysis of p-Laplacian message passing shows that it works as low-high-pass filters1 and thus is applicable to both homophilic and heterophilic graphs. Moreover, when p = 2, our theoretical results indicate that it can adaptively learn aggregation weights in terms of the variation of node embeddings on edges (measured by the graph gradient (Amghibech, 2003; Zhou & Sch olkopf, 2005; Luo et al., 2010)), and work as low-pass or low-highpass filters on a node according to the local variation of node embeddings around the node (measured by the norm of graph gradients).

Based on p-Laplacian message passing, we propose a new GNN architecture, called p GNN, to enable GNNs to work with heterophilic graphs and graphs with non-informative topologies. Several existing GNN architectures, including SGC (Wu et al., 2019), APPNP (Klicpera et al., 2019) and GPRGNN (Chien et al., 2021), can be shown to be analogical to p GNN with p = 2. Our empirical studies on real-world benchmark datasets (homophilic and heterophilic datasets) and synthetic datasets (c SBM (Deshpande et al., 2018)) demonstrate that p GNNs obtain the best performance on heterophilic graphs and competitive performance on homophilic graphs against state-of-the-art GNNs. Moreover, experimental results on graphs with different levels of noisy edges show that p GNNs work much more robustly than GNN baselines and even as well as MLPs on graphs with completely random edges. Additional experiments (reported in Appendix F.5) illustrate that intergrating p GNNs with existing GNN architectures (i.e. GCN (Kipf & Welling, 2017), JKNet (Xu et al., 2018)) can significantly improve their performance on heterophilic graphs. In conclusion, our contributions can be summarized as below:

(1) New methodologies. We propose p-Laplacian message passing and p GNN to adapt GNNs to heterophilic graphs and graphs where the topology is non-informative for label prediction. (2) Superior performance. We empirically demonstrate that p GNNs is superior on heterophilic graphs and competitive on homophilic graphs against state-of-theart GNNs. Moreover, p GNNs work robustly on graphs

1Note that if the low frequencies and high frequencies dominate the middle frequencies (the frequencies around the cutoff frequency), we say that the filter works as a low-high-pass filter.

with noisy edges or non-informative topologies. (3) Theoretical justification. We theoretically demonstrate that p Laplacian message passing works as low-high-pass filters and the message passing iteration is guaranteed to converge with proper settings. (4) New paradigm of designing GNN architectures. We bridge the gap between discrete regularization framework and GNNs, which could further inspire researchers to develop new graph convolutions or message passing schemes using other regularization techniques with explicit assumptions on graphs. Due to space limit, we defer the discussions on related work and future work and all proofs to the Appendix. Code available at https://github.com/guoji-fu/p GNNs.

2. Preliminaries and Background

Notation. Let G = (V, E, W) be an undirected graph, where V = {1, 2, . . . , N} is the set of nodes, E V V is the set of edges, W RN N is the adjacency matrix and Wi,j = Wj,i, Wi,j > 0 for [i, j] E, Wi,j = 0, otherwise. Ni = {j}[i,j] E denotes the set of neighbors of node i, D RN N = diag(D1,1, . . . DN,N) denotes the diagonal degree matrix with Di,i = PN j=1 Wi,j, for i = 1, . . . , N. f : V R and g : E R are functions defined on the vertices and edges of G, respectively. FV denotes the Hilbert space of functions endowed with the inner product f, f FV := P

i V f(i) f(i). Similarly define FE. We also denote by [K] = {1, 2, . . . , K}, K N and we use x = x 2 = (Pd i=1 x2 i )1/2, x Rd to denote the Frobenius norm of a vector.

Problem Formulation. Given a graph G = (V, E, W), each node i V has a feature vector Xi,: which is the ith row of X and a subset of nodes in G have labels from a label set L = {1, . . . , L}. The goal of GNNs for semisupervised node classification on G is to train a GNN M : V L and predict the labels of unlabeled nodes.

Homophily and Heterophily. The homophily or heterophily of a graph is used to describe the relation of labels between linked nodes in the graphs. The level of homophily of a graph can be measured by H(G) = Ei V {j}j Ni,yi=yj /|Ni| (Pei et al., 2020; Chien et al., 2021), where {j}j Ni,yi=yj denotes the number of neighbors of i V that share the same label as i (i.e. yi = yj) and H(G) 1 corresponds to strong homophily while H(G) 0 indicates strong heterophily. We say that a graph is a homophilic (heterophilic) graph if it has strong homophily (heterophily).

Graph Gradient. The graph gradient of an edge [i, j], i, j V is defined to be a measurement of the variation of a function f 2 : V R on the edge [i, j].

2f can be a vector function: f : V Rc for some c N and here we use f : V R for better illustration.

p-Laplacian Based Graph Neural Networks

Definition 1 (Graph Gradient). Given a graph G = (V, E) and a function f : V R, the graph gradient is an operator : FV FE defined as for all [i, j] E,

( f)([i, j]) :=

Wi,j Dj,j f(j)

Di,i f(i). (1)

For [i, j] / E, ( f)([i, j]) := 0. The graph gradient of a function f at a vertex i, i [N] is defined to be f(i) := (( f)([i, 1]), . . . , ( f)([i, N])) and its Frobenius norm is given by f(i) 2 := (PN j=1( f)2([i, j]))1/2, which measures the variation of f around node i. We measure the variation of f over the whole graph G by Sp(f) where it is defined as for p 1,

j=1 ( f)([i, j]) p

Wi,j Dj,j f(j)

Note that the definition of Sp here is different with the p Dirichlet form in Zhou & Sch olkopf (2005).

Graph Divergence. The graph divergence is defined to be the adjoint of the graph gradient:

Definition 2 (Graph Divergence). Given a graph G = (V, E), and functions f : V R, g : E R, the graph divergence is an operator div : FE FV which satisfies

f, g = f, divg . (3)

The graph divergence can be computed by

(divg)(i) =

Di,i (g([i, j]) g([j, i])) . (4)

Fig. 4 in Appendix E.1 gives a tiny example of illustration of graph gradient and graph divergence.

Graph p-Laplacian Operator. By the definitions of graph gradient and graph divergence, we reach the definition of graph p-Laplacian operator as below.

Definition 3 (Graph p-Laplacian3). Given a graph G = (V, E) and a function f : V R, the graph p-Laplacian is an operator p : FV FV defined by

2div( f p 2 f), for p 1. (5)

3Note that the definition adopted is slightly different with the one used in Zhou & Sch olkopf (2005) where p 2 is not element-wise and the one used in some literature such as Amghibech (2003); B uhler & Hein (2009), where ( pf)(i) = PN j=1 Wi,j

Di,i |f(i) f(j)|p 2 (f(i) f(j)) for p > 1 and p = 1 is not allowed.

where p 2 is element-wise, i.e. f(i) p 2 = ( ( f)([i, 1]) p 2, . . . , ( f)([i, N]) p 2).

Substituting Eq. (1) and Eq. (4) into Eq. (5), we obtain

Di,i ( f)([j, i]) p 2 s

Wi,j Dj,j f(j)

The graph p-Laplacian is semi-definite: f, pf = Sp(f) 0 and we have

i = p( pf)(i). (7)

When p = 2, 2 is refered as the ordinary Laplacian operator and 2 = I D 1/2WD 1/2 and when p = 1, 1 is refered as the Curvature operator and 1f := 1

2div( f 1 f). Note that Laplacian 2 is a linear operator, while in general for p = 2, p-Laplacian is nonlinear since p(af) = a p(f) for a R.

3. p-Laplacian based Graph Neural Networks In this section, we derive the p-Laplacian message passing scheme from a p-Laplacian regularization framework and present p GNN, a new GNN architecture developed upon the new message passing scheme. We demonstrate that p Laplacian message passing scheme is guaranteed to converge with proper settings and provide an upper-bounding risk of p GNNs in Appendix C.1.

3.1. p-Laplacian Regularization Framework Given an undirected graph G = (V, E) and a signal function with c (c N) channels f : V Rc, let X = (X 1,:, . . . , X N,:) RN c with Xi,: R1 c, i [N] denoting the node features of G and F = (F 1,:, . . . , F N,:) RN c be a matrix whose ith row vector Fi,: R1 c, i [N] represents the function value of f at the i-th vertex in G. We present a p-Laplacian regularization problem whose cost function is defined to be

Lp(F) := min F Sp(F) + µ

i=1 Fi,: Xi,: 2, (8)

where µ (0, ). The first term of the right-hand side in Eq. (8) is a measurement of variation of the signal over the graph based on p-Laplacian. As we will discuss later, different choices of p result in different smoothness constraint on the signals. The second term is the constraint that the optimal signals F = arg min Lp(F) should not change too much from the input signal X, and µ provides a trade-off between these two constraints.

p-Laplacian Based Graph Neural Networks

Regularization with p = 2. If p = 2, the solution of Eq. (8) satisfies 2F +µ(F X) = 0 and we can obtain the closed form (Zhou & Sch olkopf, 2005)

F = µ( 2 + µIN) 1X. (9)

Then, we could use the following iteration algorithm to get an approximation of Eq. (9):

F(k+1) = αD 1/2WD 1/2F(k) + βX, (10)

where k represents the iteration index, α = 1 1+µ and β = µ 1+µ = 1 α. The iteration converges to a closedform solution as k goes to infinity (Zhou et al., 2003; Zhou & Sch olkopf, 2005). We could relate the the result here with the personalized Page Rank (PPR) (Page et al., 1999; Klicpera et al., 2019) algorithm:

Theorem 1 (Relation to personalized Page Rank (Klicpera et al., 2019)). µ( 2 + µIN) 1 in the closed-form solution of Eq. (9) is equivalent to the personalized Page Rank matrix.

Regularization with p > 1. For p > 1, the solution of Eq. (8) satisfies p p F + 2µ(F X) = 0. By Eq. (6) we have that, for all i [N],

Di,i ( f )([j, i]) p 2 1 p

Di,i F i,: 1 p

p F i,: Xi,: = 0.

Let M(k) RN N, α(k) = diag(α(k) 1,1, . . . , α(k) N,N),

β(k) = diag(β(k) 1,1, . . . , β(k) N,N). Based on which we can construct a similar iterative algorithm to obtain a solution (Zhou & Sch olkopf, 2005): for all i [N],

F(k+1) i,: = α(k) i,i

M (k) i,j p

Di,i Dj,j F(k) j,: + β(k) i,i Xi,:, (11)

for all i, j [N],

M (k) i,j = Wi,j

Di,i F(k) i,:

Wi,j Dj,j F(k) j,:

α(k) i,i = 1

M (k) i,j Di,i + 2µ

, β(k) i,i = 2µ

p α(k) i,i . (13)

Note that when q

Di,i F(k) i,: q

Wi,j Dj,j F(k) j,: = 0, we set

M (k) i,j = 0. It is easy to see that Eq. (10) is the special cases of Eq. (14) with p = 2.

Remark 1 (Discussion on p = 1). For p = 1, when f is a real-valued function (c = 1), 1f is a step function, which could make the stationary condition of the objective function Eq. (8) become problematic. Additionally, 1f is not continuous at ( f)([i, j]) = 0. Therefore, p = 1 is not allowed when f is a real value function. On the other hand, note that there is a Frobenius norm in pf. When f is a vector-valued function (c > 1), the step function in 1f only exists on the axes. The stationary condition will be fine if the node embeddings F are not a matrix of vectors that has only one non-zero element, which is true for many graphs. p = 1 may work for these graphs. Overall, we suggest to use p > 1 in practice but p = 1 may work for graphs with multiple channel signals as well. We conduct experiments for p > 1 (p = 1.5, 2, 2.5) and p = 1 in Sec. 5.

3.2. p-Laplacian Message Passing and p GNN Architecture

p-Laplacian Message Passing. Rewrite Eq. (11) in a matrix form we obtain

F(k+1) = α(k)D 1/2M(k)D 1/2F(k) + β(k)X. (14)

Eq. (14) provides a new message passing mechanism, named p-Laplacian message passing. Remark 2. αD 1/2MD 1/2 in Eq. (14) can be regarded as the learned aggregation weights at each step for message passing. It suggests that p-Laplacian message passing could adaptively tune the aggregation weights during the course of learning, which will be demonstrated theoretically and empirically in the later. βX in Eq. (14) can be regarded as a residual unit, which helps the model escape from the oversmoothing issue (Chien et al., 2021).

We present the following theorem to show the shrinking property of p-Laplacian message passing.

Theorem 2 (Shrinking Property of p-Laplacian Message Passing). Given a graph G = (V, E, W) with node features X, β(k), F(k), M(k), α(k) are updated accordingly to Equations (11) to (13) for k = 0, 1, . . . , K and F(0) = X. Then there exist some positive real value µ > 0 which depends on X, G, p and p > 1 such that

Lp(F(k+1)) Lp(F(k)).

Proof see Appendix D.2. Theorem 2 shows that with some proper positive real value µ and p > 1, the loss of the objective function Eq. (8) is guaranteed to decline after taking one step p-Laplacian message passing. Theorem 2 also demonstrates that the iteration Equations (11) to (13) is guaranteed to converge for p > 1 with some proper µ which is chosen depends on the input graph and p.

p-Laplacian Based Graph Neural Networks

p GNN Architecture. We design the architecture of p GNNs using p-Laplacian message passing. Given node features X RN c, the number of node labels L, the number of hidden units h, the maximum number of iterations K, and M, α, and β updated by Equations (12) and (13) respectively, we give the p GNN architecture as following:

F(0) = Re LU(XΘ(1)), (15)

F(k+1) = α(k)D 1/2M(k)D 1/2F(k) + β(k)F(0), (16)

Z = softmax(F(K)Θ(2)), (17)

where k = 0, 1, . . . , K 1, Z RN L is the output propbability matrix with Zi,j is the estimated probability that the label at node i [N] is j [L] given features X and graph G, Θ(1) Rc h and Θ(2) Rh L are the firstand the second-layer parameters of the neural network.

Remark 3 (Connection to existing GNN variants). The message passing scheme of p GNNs is different from that of several GNN variants (say, GCN, GAT, and Graph Sage), which repeatedly stack message passing layers. In contrast, it is similar with SGC (Wu et al., 2019), APPNP (Klicpera et al., 2019), and GPRGNN (Chien et al., 2021). SGC is an approximation to the closed-form in Eq. (9) (Fu et al., 2020). By Theorem 1, it is easy to see that APPNP, which uses PPR to propagate the node embeddings, is analogical to p GNN with p = 2, termed as 2.0GNN. APPNP and 2.0GNN work analogically and effectively on homophilic graphs. 2.0GNN can also work effectively on heterophilic graphs by letting Θ(2) be negative. However, APPNP fails on heterophilic graphs as its PPR weights are fixed (Chien et al., 2021). Unlike APPNP, GPRGNN, which adaptively learn the generalized Page Rank (GPR) weights, works similarly to 2.0GNN on both homophilic and heterophilic graphs. However, GPRGNN needs more supervised information in order to learn optimal GPR weights. On the contrary, p GNNs need less supervised information to obtain similar results because Θ(2) acts like a hyperplane for classification. p GNNs could work better under weak supervised information. Our analysis is also verified by the experimental results in Sec. 5.

We also provide an upper-bounding risk of p GNNs by Theorem 4 in Appendix C.1 to study the effect of the hyperparameter µ on the performance of p GNNs. Theorem 4 shows that the risk of p GNNs is upper-bounded by the sum of three terms: the risk of label prediction using only the original node features X, the norm of p-Laplacian diffusion on X, and the magnitude of the noise in X. µ controls the trade-off between these three terms. The smaller µ, the more weights on the p-Laplacian diffusion term and the noise term and the less weights on the the other term and vice versa.

4. Spectral Views of p-Laplacian Message Passing

In this section, we theoretically demonstrate that p Laplacian message passing is an approximation of a polynomial graph filter defined on the spectral domain of p Laplacian. We show by spectral analysis that p-Laplacian message passing works as low-high-pass filters.

p-Eigenvalues and p-Eigenvectors of the Graph p Laplacian. We first introduce the definitions of peigenvalues and p-eigenvectors of p-Laplacian. Let ϕp : R R defined as ϕp(u) = u p 2u, for u R, u = 0. Note that ϕ2(u) = u. For notational simplicity, we denote by ϕp(u) = (ϕp(u1), . . . , ϕp(u N)) for u RN and Φp(U) = (ϕp(U:,1), . . . , ϕp(U:,N)) for U RN N and U:,i RN is the ith column vector of U.

Definition 4 (p-Eigenvector and p-Eigenvalue). A vector u RN is a p-eigenvector of p if it satisfies the equation

( pu)i = λϕp(ui), for all i [N],

where λ R is a real value referred as a p-eigenvalue of p associated with the p-eigenvector u.

Definition 5 (p-Orthogonal (Luo, Huang, Ding, and Nie, 2010)). Given two vectors u, v RN with u, v = 0, we call that u and v is p-orthogonal if

ϕp(u) ϕp(v) =

i=1 ϕp(ui)ϕp(vi) = 0.

Luo et al. (2010) demonstrated that the p-eigenvectors of p are p-orthogonal to each other (see Theorem 5 in Appendix C.2 for details). Therefore, the space spanned by the multiple p-eigenvectors of p is p-orthogonal. Additionally, we demonstrate that the p-eigen-decomposition of p is given by: p = Φp(U)ΛΦp(U) (see Theorem 6 in Appendix C.3 for details), where U is a matrix of peigenvectors of p and Λ is a diagonal matrix in which the diagonal is the p-eigenvalues of p.

Graph Convolutions based on p-Laplacian. Based on Theorem 5, the graph Fourier Transform ˆf of any function f on the vertices of G can be defined as the expansion of f in terms of Φ(U) where U is the matrix of p-eigenvectors of p: ˆf = Φp(U) f. Similarly, the inverse graph Fourier transform is then given by: f = Φp(U) ˆf. Therefore, a signal X RN c being filtered by a spectral filter gθ can be expressed formally as: gθ X = Φp(U)ˆgθ(Λ)Φp(U) X, where Λ denotes a diagonal matrix in which the diagonal corresponds to the p-eigenvalues {λl}l=0,...,N 1 of p and ˆgθ(Λ) denotes a diagonal matrix in which the diagonal corresponds to spectral filter coefficients. Let ˆgθ be a polynomial filter defined as ˆgθ = PK 1 k=0 θkλk l , where the parameter θ = [θ0, . . . , θK 1] RK is a vector of polynomial

p-Laplacian Based Graph Neural Networks

coefficients. By the p-eigen-decomposition of p-Laplacian, we have

k=0 θkΦp(U)ΛkΦp(U) X =

k=0 θk k p X (18)

Theorem 3. The K-step p-Laplacian message passing is a K-order polynomial approximation to the graph filter given by Eq. (18).

Proof see Appendix D.3. Theorem 3 indicates that p Laplacian message passing is implicitly a polynomial spectral filter defined on the spectral domain of p-Laplacian.

Spectral Analysis of p-Laplacian Message Passing. Here, we analyze the spectral propecties of p-Laplacian message passing. We can approximately view p Laplacian message pasing as a filter of a linear combination of K spectral filters g(Λ)(0), g(Λ)(1), . . . , g(Λ)(K 1)

with each spectral filter defined to be g(Λ)(k) :=

(αD 1/2MD 1/2)k where Mi,j = Wi,j q

Di,i Fi,: q

Wi,j Dj,j Fj,: p 2 for i, j [N] and F is the matrix of node embeddings. We can study the properties of p-Laplacian message passing by studying the spectral properties of αD 1/2MD 1/2 as given below.

Proposition 1. Given a connected graph G = (V, E, W) with node embeddings F and the p-Laplacian p with its p-eigenvectors {u(l)}l=0,1,...,N 1 and the p-eigenvalues {λl}l=0,1,...,N 1. Let gp(λi 1) := αi,i P

j D 1/2 i,i Mi,j D 1/2 j,j for i [N] be the filters defined on the spectral domain of p, where Mi,j = Wi,j f([i, j]) p 2, ( f)([i, j]) is the graph gradient of the edge between node i and j and f(i) is the norm of graph gradient at i. Ni denotes the number of edges connected to i, Nmin = min{Nj}j [N], and

k = arg maxj({|u(l) j |/ p

Dl,l}j [N];l=0,...,N 1), then

1. When p = 2, gp(λi 1) works as low-high-pass filters.

2. When p > 2, if f(i) 2(p 1)/(p 2), gp(λi 1) works as low-high-pass filters on node i and gp(λi 1) works as low-pass filters on i when f(i) 2(p 1)/(p 2).

3. When 1 p < 2, if 0 f(i) 2(2 Nk)1/(p 2), gp(λi 1) works as low-pass filters on node i and gp(λi 1) works as low-high-pass filters

on i when f(i) 2 2 Nk 1/(p 2). Specifically, when p = 1, Nk can be replaced by Nmin.

Proof see Appendix D.7. Proposition 1 shows that when p = 2, p-Laplacian message passing adaptively works as

low-pass or low-high-pass filters on node i in terms of the degree of local node embedding variation around i, i.e. the norm of the graph gradient f(i) at node i. When p = 2, p-Laplacian message passing works as low-highpass filters on node i regardless of the value of f(i) . When p > 2, p-Laplacian message passing works as lowpass filters on node i for large f(i) and works as lowhigh-pass filters for small f(i) . Therefore, p GNNs with p > 2 can work very effectively on graphs with strong homophily. When 1 p < 2, p-Laplacian message passing works as low-pass filters for small f(i) and works as low-high-pass filters for large f(i) . Thus, p GNNs with 1 p < 2 can work effectively on graphs with low homophily, i.e. heterophilic graphs.

Remark 4. We say that a filter works as a low-pass filter if the low frequencies dominate the other frequencies and it works as a low-high-pass filter if the low frequencies and high frequencies dominate the middle frequencies, i.e., the frequencies around the cutoff frequency.

Remark 5. Previous works (Wu et al., 2019; Klicpera et al., 2019) have shown that GCN, SGC, APPNP work as lowpass filters. The reason is that they have added the selfloop to the adjacency matrix, which will shrink the spectral range of the Laplacian from [0, 2] to approximately [0, 1.5] and causes them work as low-pass filters (Wu et al., 2019). On the contrary, we did not add self-loop to the adjacancy matrix and therefore the spectral filter work as low-highpass filters in our case for p = 2.

5. Empirical Studies

Here, we empirically study the effectiveness of p GNNs for semi-supervised node classification using and real-world benchmark and synthetic datasets with heterophily and strong homophily. The experimental results are also used to validate our theoretical findings presented previously.

Datasets and Experimental Setup. We use seven homophilic benchmark datasets: citation graphs Cora, Cite Seer, Pub Med (Sen et al., 2008), Amazon copurchase graphs Computers, Photo, coauthor graphs CS, Physics (Shchur et al., 2018), and six heterophilic benchmark datasets: Wikipedia graphs Chameleon, Squirrel (Rozemberczki et al., 2021), the Actor cooccurrence graph, webpage graphs Wisconsin, Texas, Cornell (Pei et al., 2020). The node classification tasks are conducted in the transductive setting. Following Chien et al. (2021), we use the sparse splitting (2.5%/2.5%/95%) and the dense splitting (60%/20%/20%) to randomly split the homophilic and heterophilic graphs into training/validation/testing sets, respectively. Note that we directly used the data from Pytorch Geometric library (Fey & Lenssen, 2019) where they did not transform Chameleon and Squirrel to undirected

p-Laplacian Based Graph Neural Networks

Table 1: Heterophilious results. Averaged accuracy (%) for 100 runs. Best results outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold.

Method Chameleon Squirrel Actor Wisconsin Texas Cornell

MLP 48.02 1.72 33.80 1.05 39.68 1.43 93.56 3.14 79.50 10.62 80.30 11.38 GCN 34.54 2.78 25.28 1.55 31.28 2.04 61.93 3.00 56.54 17.02 51.36 4.59 SGC 34.76 4.55 25.49 1.63 30.98 3.80 66.94 2.58 59.99 9.95 44.39 5.88 GAT 45.16 2.10 31.41 0.98 34.11 1.28 65.64 6.29 56.41 13.01 43.94 7.33 JKNet 33.28 3.59 25.82 1.58 29.77 2.61 61.08 3.71 59.65 12.62 55.34 4.43 APPNP 36.18 2.81 26.85 1.48 31.26 2.52 64.59 3.49 82.90 5.08 66.47 9.34 GPRGNN 43.67 2.27 31.27 1.76 36.63 1.22 88.54 4.94 80.74 6.76 78.95 8.52

1.0GNN 48.86 1.95 33.75 1.50 40.62 1.25 95.37 2.06 84.06 7.41 82.16 8.62 1.5GNN 48.74 1.62 33.33 1.45 40.35 1.35 95.24 2.01 84.46 7.79 78.47 6.87 2.0GNN 48.77 1.87 33.60 1.47 40.07 1.17 91.15 2.76 87.96 6.27 72.04 8.22 2.5GNN 48.80 1.77 33.79 1.45 39.80 1.31 87.08 2.69 83.01 6.80 70.31 8.84

graphs, which is different from Chien et al. (2021) where they did so. Dataset statistics and their levels of homophily are presented in Appendix E.

Baselines. We compare p GNN with seven models, including MLP, GCN (Kipf & Welling, 2017), SGC (Wu et al., 2019), GAT (Velickovic et al., 2018), JKNet (Xu et al., 2018), APPNP (Klicpera et al., 2019), GPRGNN (Chien et al., 2021). We use the Pytorch Geometric library to implement all baselines except GPRGNN. For GPRGNN, we use the code released by the authors4. The details of hyperparameter settings are deferred to Appendix E.3.

Superior Performance on Real-World Heterophilic Datasets. The results on homophilic benchmark datasets are deferred to Appendix F.1, which show that p GNNs obtains competitive performance against state-of-the-art GNNs on homophilic datasets. Table 1 summarizes the results on heterophilic benchmark datasets. Table 1 shows that p GNNs significantly dominate the baselines and 1.0GNN obtains the best performance on all heterophilic graphs except the Texas dataset. For Texas, 2.0GNN is the best. We also observe that MLP works very well and significantly outperforms most GNN baselines, which indicates that the graph topology is not informative for label prediction on these heterophilic graphs. Therefore, propagating and transforming node features over the graph topology could lead to worse performance than MLP. Unlike ordinary GNNs, p GNNs can adaptively learn aggregation weights and ignore edges that are not informative for label prediction and thus could work better. It confirms our theoretical findings presented in previous sections. Note

4https://github.com/jianhao2016/GPRGNN

that GAT can also learn aggregation weights, i.e. the attention weights. However, the aggregation weights learned by GAT are significantly distinct from that of p GNNs, as we will show following.

Interpretability of the Learned Aggregation Weights of p GNNs. We showcase the interpretability of the learned aggregation weights αi,i D 1/2 i,i Mi,j D 1/2 j,j of p GNNs by studying its entropy distribution, along with the attention weights of GAT on real-world datasets. Denote {Ai,j}j Ni as the aggregation weights of node i and its neighbors. For GAT, {Ai,j}j Ni are referred as the attention weights (in the first layer) and for p GNNs are αi,i D 1/2 i,i Mi,j D 1/2 j,j . For any node i, {Ai,j}j Ni forms a discrete probability distribution over all its neighbors with the entropy given by H({Ai,j}j Ni) = P

j Ni Ai,j log(Ai,j). Low entropy means high degree of concentration and vice versa. An entropy of zero means all aggregation weights or attentions are on one source node. The uniform distribution has the highest entropy of log(Di,i). Fig. 1 reports the results on Computers, Wisconsin and we defer more results on other datasets to Appendix F.2 due to space limit. Fig. 1 shows that the aggregation weight entropy distributions of GAT and p GNNs on Computers (homophily) are both similar to the uniform case. It indicates the original graph topology of Computers is very helpful for label prediction and therefore GNNs could work very well on Computers. However, for Wisconsin (heterophily), the entropy distribution of p GNNs is significantly different from that of GAT and the uniform case. Most entropy of p GNNs is around zero, which means that most aggregation weights are on one source node. It states that the original graph topology of Wisconsin is not helpful for label prediction,

p-Laplacian Based Graph Neural Networks

0 2 4 6 8 10 12 14 entropy

0 2 4 6 8 10 12 14 entropy

1.0GNN (Acc: 85.24%)

0 2 4 6 8 10 12 14 entropy

1.5GNN (Acc: 85.3%)

0 2 4 6 8 10 12 14 entropy

2.5GNN (Acc: 83.68%)

0 2 4 6 8 10 12 14 entropy

GAT (Acc: 82.73%)

1 0 1 2 3 4 5 entropy

1 0 1 2 3 4 5 entropy

1.0GNN (Acc: 98.15%)

1 0 1 2 3 4 5 entropy

1.5GNN (Acc: 96.3%)

1 0 1 2 3 4 5 entropy

2.5GNN (Acc: 84.26%)

1 0 1 2 3 4 5 entropy

GAT (Acc: 62.96%)

Figure 1: Aggregation weight entropy distribution of graphs. Low entropy means high degree of concentration, vice versa. An entropy of zero means all aggregation weights are on one source node.

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

Accuracy (%)

c SBM sparse split

Figure 2: Averaged accuracy (%) on c SBM (sparse split) for 20 runs. Best view in colors.

which explains why MLP works well on Wisconsin. On the contrary, the entropy distribution of GAT is similar to the uniform case and therefore GAT works similarly to GCN and is significantly worse than p GNNs on Wisconsin. Similar results can be observed on the experiments on more datasets in Appendix F.2.

Results on c SBM Datasets. We examine the performance of p GNNs on heterophilic graphs whose topology is informative for label prediction using synthetic graphs generated by c SBM (Deshpande et al., 2018) with ϕ { 1, 0.75, . . . , 1}. We use the same settings of c SBM used in Chien et al. (2021). Due to the space limit, we refer the readers to Chien et al. (2021) for more details of c SBM dataset. Fig. 2 reports the results on c SBM using sparse splitting (for results on c SBM with dense splitting see Ap-

pendix F.3). Fig. 2 shows that when ϕ 0.5 (heterophilic graphs), 2.0GNN obtains the best performance and p GNNs and GPRGNN significantly dominate the others. It validates the effectiveness of p GNNs on heterophilic graphs. Moreover, 2.0GNN works better than GPRGNN and it again confirms that 2.0GNN is more superior under weak supervision (2.5% training rate), as stated in Remark 3. Note that 1.0GNN and 1.5GNN are not better than 2.0GNN, the reason could be the iteration algorithms Eq. (11) with p = 1, 1.5 are not as stable as the one with p = 2. When the graph topology is almost non-informative for label prediction (ϕ = 0.25, 0), The performance of p GNNs is close to MLP and they outperform the other baselines. Again, it validates that p GNNs can erase non-informative edges and work as well as MLP and confirms the statements in Theorem 4. When the graph is homophilic (ϕ 0.25), 1.5GNN is the best on weak homophilic graphs (ϕ = 0.25, 0.5) and p GNNs work competitively with all GNN baselines on strong homophilic graphs (ϕ 0.75).

Results on Datasets with Noisy Edges. We conduct experiments to evaluate the performance of p GNNs on graphs with noisy edges by randomly adding edges to the graphs and randomly remove the same number of original edges. We define the random edge rate as r := #random edges

#all edges . The experiments are conducted on 4 homophilic datasets (Computers, Photo, CS, Physics) and 2 heterophilic datasets (Wisconsin, Texas) with r = 0.25, 0.5, 1. Fig. 3 reports the results on Computers, Wisconsin and we defer more results to Appendix F.4. Fig. 3 shows that p GNNs significantly outperform all baselines. Specifically, 1.5GNN obtains the best performance on Computers, and 1.5GNN and 2.0GNN even work as well as MLP on Computers with completely random edges (r = 1). For Wisconsin, 1.0GNN is the best,

p-Laplacian Based Graph Neural Networks

Accuracy (%)

Accuracy (%)

Figure 3: Averaged accuracy (%) on graphs with noisy edges for 20 runs. Best view in colors.

and 1.0GNN and 1.5GNN significantly dominate the others. We also observed that APPNP and GPRGNN, whose architectures are analogical to 2.0GNN, also work better than other GNNs. Nevertheless, they are significantly outperformed by p GNNs overall. Similar results can be observed in the experiments conducted on more datasets as presented in Appendix F.4.

6. Conclusion

We have addressed the problem of generalizing GNNs to heterophilic graphs and graphs with noisy edges. To this end, we derived a novel p-Laplacian message passing scheme from a discrete regularization framework and proposed a new p GNN architecture. We theoretically demonstrate our method works as low-high-pass filters and thereby applicable to both homophilic and heterophilic graphs. We empirically validate our theoretical results and show the advantages of our methods on heterophilic graphs and graphs with non-informative topologies.

Like most existing spectral based GNN models, e.g., GCN (Kipf & Welling, 2017), SGC (Wu et al., 2019), the main restriction of p GNNs is the relatively high space cost compared to Graph Sage (Hamilton et al., 2017), especially for extremely large graphs. Integrating p GNNs and

p-Laplacian message passing with some popular subgraph sampling techniques so that p GNNs (or its variants) could scale to large graphs would be an interesting future work. We refer the reader to Appendix B for further discussions on the potential extensions of p GNNs.

Acknowledgements

Guoji would like to thank Prof. Xin Yao and Dr. Yunwen Lei for their sincere and selfless helps and supports.

Reproducibility Statement

In order to ensure reproducibility, we have made the efforts in the following respects: (1) Provide our code as the supplementary material; (2) Provide self-contained proofs of the main claims in Appendices C and D; (3) Provide more details on experimental configurations in Appendix E and experimental results in Appendix F. All the datasets are publicly available as described in the main text. The code of p GNNs is available at https://github.com/ guoji-fu/p GNNs.

Abu-El-Haija, S., Perozzi, B., Al-Rfou, R., and Alemi, A. A. Watch your step: Learning node embeddings via graph attention. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 9198 9208, 2018.

Abu-El-Haija, S., Kapoor, A., Perozzi, B., and Lee, J. NGCN: multi-scale graph convolution for semi-supervised node classification. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, volume 115 of Proceedings of Machine Learning Research, pp. 841 851. AUAI Press, 2019.

Amghibech, S. Eigenvalues of the discrete p-laplacian for graphs. Ars Combinatoria, 67:283 302, 2003.

Atwood, J. and Towsley, D. Diffusion-convolutional neural networks. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, Neur IPS 2016, December 5-10, 2016, Barcelona, Spain, pp. 1993 2001, 2016.

Bach, F. and Jordan, M. Learning spectral clustering. Advances in Neural Information Processing Systems, NIPS 2004, 16(2):305 312, 2004.

p-Laplacian Based Graph Neural Networks

Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez Gonzalez, A., Zambaldi, V. F., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., G ulc ehre, C ., Song, H. F., Ballard, A. J., Gilmer, J., Dahl, G. E., Vaswani, A., Allen, K. R., Nash, C., Langston, V., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y., and Pascanu, R. Relational inductive biases, deep learning, and graph networks. Co RR, abs/1806.01261, 2018.

Belkin, M. and Niyogi, P. Towards a theoretical foundation for laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289 1308, 2008.

Belkin, M., Matveeva, I., and Niyogi, P. Regularization and semi-supervised learning on large graphs. In The 17th Annual Conference on Learning Theory, COLT 2004, Banff, Canada, July 1-4, 2004, volume 3120, pp. 624 638, 2004.

Belkin, M., Niyogi, P., and Sindhwani, V. On manifold regularization. In Cowell, R. G. and Ghahramani, Z. (eds.), Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, January 6-8, 2005. Society for Artificial Intelligence and Statistics, 2005.

Bruna, J., Zaremba, W., Szlam, A., and Le Cun, Y. Spectral networks and locally connected networks on graphs. In The 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, 2014.

B uhler, T. and Hein, M. Spectral clustering based on the graph p-laplacian. In Danyluk, A. P., Bottou, L., and Littman, M. L. (eds.), Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pp. 81 88. ACM, 2009.

Chen, J., Ma, T., and Xiao, C. Fastgcn: Fast learning with graph convolutional networks via importance sampling. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

Chien, E., Peng, J., Li, P., and Milenkovic, O. Adaptive universal generalized pagerank graph neural network. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021.

Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29, Neur IPS 2016, December 5-10, 2016, Barcelona, Spain, pp. 3837 3845, 2016.

Deshpande, Y., Sen, S., Montanari, A., and Mossel, E. Contextual stochastic block models. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 8590 8602, 2018.

Duvenaud, D., Maclaurin, D., Aguilera-Iparraguirre, J., G omez-Bombarelli, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, Neur IPS 2015 December 7-12, 2015, Montreal, Quebec, Canada, pp. 2224 2232, 2015.

Fey, M. and Lenssen, J. E. Fast graph representation learning with pytorch geometric. Co RR, abs/1903.02428, 2019.

Fu, G., Hou, Y., Zhang, J., Ma, K., Kamhoua, B. F., and Cheng, J. Understanding graph neural networks from graph signal denoising perspectives. Co RR, abs/2006.04386, 2020.

Gama, F., Bruna, J., and Ribeiro, A. Stability properties of graph neural networks. IEEE Transactions on Signal Processing, 68:5680 5695, 2020.

Garg, V. K., Jegelka, S., and Jaakkola, T. S. Generalization and representational limits of graph neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 3419 3430. PMLR, 2020.

Grandvalet, Y. and Bengio, Y. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems 17, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], pp. 529 536, 2004.

Hamilton, W. L., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30, Neur IPS 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 1024 1034, 2017.

Hein, M. Uniform convergence of adaptive graph-based regularization. In The 19th Annual Conference on Learning Theory, COLT 2006, Pittsburgh, PA, USA, June 2225, 2006, volume 4005, pp. 50 64, 2006.

Henaff, M., Bruna, J., and Le Cun, Y. Deep convolutional networks on graph-structured data. Co RR, abs/1506.05163, 2015.

p-Laplacian Based Graph Neural Networks

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Huang, Q., He, H., Singh, A., Lim, S., and Benson, A. R. Combining label propagation and simple models outperforms graph neural networks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021.

Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.

Klicpera, J., Bojchevski, A., and G unnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

Li, G., M uller, M., Thabet, A. K., and Ghanem, B. Deepgcns: Can gcns go as deep as cnns? In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 9266 9275. IEEE, 2019.

Li, Q., Han, Z., and Wu, X. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018, pp. 3538 3545, 2018.

Liao, R., Zhao, Z., Urtasun, R., and Zemel, R. S. Lanczosnet: Multi-scale deep graph convolutional networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

Liu, F., Chakraborty, S., Li, F., Liu, Y., and Lozano, A. C. Bayesian regularization via graph laplacian. Bayesian Analysis, 9(2):449 474, 2014.

Liu, J. and Han, J. Spectral clustering. In Aggarwal, C. C. and Reddy, C. K. (eds.), Data Clustering: Algorithms and Applications, pp. 177 200. CRC Press, 2013.

Liu, X., Jin, W., Ma, Y., Li, Y., Liu, H., Wang, Y., Yan, M., and Tang, J. Elastic graph neural networks. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML

2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 6837 6849. PMLR, 2021.

Loukas, A. What graph neural networks cannot learn: depth vs width. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Luo, D., Huang, H., Ding, C. H. Q., and Nie, F. On the eigenvectors of p-laplacian. Machine Learning, 81(1): 37 51, 2010.

Marcheggiani, D. and Titov, I. Encoding sentences with graph convolutional networks for semantic role labeling. In Palmer, M., Hwa, R., and Riedel, S. (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 1506 1515. Association for Computational Linguistics, 2017.

Nadler, B., Srebro, N., and Zhou, X. Semi-supervised learning with the graph laplacian: The limit of infinite unlabelled data. Advances in Neural Information Processing Systems, NIPS 2009, 22:1330 1338, 2009.

Niepert, M., Ahmed, M., and Kutzkov, K. Learning convolutional neural networks for graphs. In Balcan, M. and Weinberger, K. Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2014 2023. JMLR.org, 2016.

Niyogi, P. Manifold regularization and semi-supervised learning: some theoretical analyses. Journal of Machine Learning Research, 14(1):1229 1250, 2013.

NT, H. and Maehara, T. Revisiting graph neural networks: All we have is low-pass filters. Co RR, abs/1905.09550, 2019.

Oono, K. and Suzuki, T. Graph neural networks exponentially lose expressive power for node classification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Info Lab, 1999.

Pei, H., Wei, B., Chang, K. C., Lei, Y., and Yang, B. Geomgcn: Geometric graph convolutional networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

p-Laplacian Based Graph Neural Networks

Rozemberczki, B., Allen, C., and Sarkar, R. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2), 2021.

Satorras, V. G. and Estrach, J. B. Few-shot learning with graph neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018.

Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. Collective classification in network data. AI Magazine, 29(3):93 106, 2008.

Shchur, O., Mumme, M., Bojchevski, A., and G unnemann, S. Pitfalls of graph neural network evaluation. Co RR, abs/1811.05868, 2018.

Sindhwani, V., Niyogi, P., Belkin, M., and Keerthi, S. Linear manifold regularization for large scale semisupervised learning. In Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, volume 28, 2005.

Slepcev, D. and Thorpe, M. Analysis of p-laplacian regularization in semi-supervised learning. Co RR, abs/1707.06213, 2017.

Smola, A. J. and Kondor, R. Kernels and regularization on graphs. In Sch olkopf, B. and Warmuth, M. K. (eds.), Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003, Proceedings, volume 2777 of Lecture Notes in Computer Science, pp. 144 158. Springer, 2003.

Thekumparampil, K. K., Wang, C., Oh, S., and Li, L. Attention-based graph neural network for semisupervised learning. Co RR, abs/1803.03735, 2018.

van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86): 2579 2605, 2008.

van Engelen, J. E. and Hoos, H. H. A survey on semisupervised learning. Machine Learning, 109(2):373 440, 2020.

Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, 2018.

Velickovic, P., Fedus, W., Hamilton, W. L., Li o, P., Bengio, Y., and Hjelm, R. D. Deep graph infomax. In 7th International Conference on Learning Representations, ICLR

2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

Verma, S. and Zhang, Z. Stability and generalization of graph convolutional neural networks. In The 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pp. 1539 1548, 2019.

von Luxburg, U. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, 2007.

Wang, H. and Leskovec, J. Unifying graph convolutional neural networks and label propagation. Co RR, abs/2002.06755, 2020.

Wu, F., Jr., A. H. S., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. Q. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 6861 6871, 2019.

Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning System, 32(1):4 24, 2021.

Xinyi, Z. and Chen, L. Capsule graph neural network. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 5449 5458. PMLR, 2018.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2017 New Orleans, LA, USA, May 6-9, 2019, Conference Track Proceedings, 2019.

Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W. L., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 4805 4815, 2018.

Ying, Z., Bourgeois, D., You, J., Zitnik, M., and Leskovec, J. Gnnexplainer: Generating explanations for graph neural networks. In Advances in Neural Information Processing Systems 32, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 9240 9251, 2019.

p-Laplacian Based Graph Neural Networks

Zeng, H., Zhou, H., Srivastava, A., Kannan, R., and Prasanna, V. K. Graphsaint: Graph sampling based inductive learning method. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Zhou, D. and Sch olkopf, B. Regularization on discrete spaces. In The 27th DAGM Symposium, Vienna, Austria, August 31 - September 2, 2005, volume 3663, pp. 361 368, 2005.

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Sch olkopf, B. Learning with local and global consistency. In Thrun, S., Saul, L. K., and Sch olkopf, B. (eds.), Advances in Neural Information Processing Systems 16, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], pp. 321 328. MIT Press, 2003.

Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., and Sun, M. Graph neural networks: A review of methods and applications. AI Open, 1:57 81, 2020.

Zhou, X. and Belkin, M. Semi-supervised learning by higher order regularization. In Gordon, G. J., Dunson, D. B., and Dud ık, M. (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, volume 15 of JMLR Proceedings, pp. 892 900. JMLR.org, 2011.

Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., and Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems 33, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Zhu, J., Rossi, R. A., Rao, A., Mai, T., Lipka, N., Ahmed, N. K., and Koutra, D. Graph neural networks with heterophily. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, pp. 11168 11176. AAAI Press, 2021.

Zhu, X., Ghahramani, Z., and Lafferty, J. D. Semisupervised learning using gaussian fields and harmonic functions. In Fawcett, T. and Mishra, N. (eds.), Machine Learning, Proceedings of the Twentieth International Conference, ICML 2003, August 21-24, 2003, Washington, DC, USA, pp. 912 919. AAAI Press, 2003.

Zitnik, M. and Leskovec, J. Predicting multicellular function through multi-layer tissue networks. Bioinform., 33 (14):i190 i198, 2017.

p-Laplacian Based Graph Neural Networks

A Related Work 15

B Discussions and Future Work 16

C Additional Theorems 17

C.1 Theorem 4 (Upper-Bounding Risk of p GNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C.2 Theorem 5 (p-Orthogonal Theorem (Luo et al., 2010)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C.3 Theorem 6 (p-Eigen-Decomposition of p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C.4 Theorem 7 (Bounds of p-Eigenvalues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

D Proof of Theorems 18

D.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

D.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

D.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

D.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

D.5 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

D.6 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

D.7 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

E Dataset Statistics and Hyperparameters 28

E.1 Illustration of Graph Gradient and Graph Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

E.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

E.3 Hyperparameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

F Additional Experiments 30

F.1 Experimental Results on Homophilic Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

F.2 Experimental Results of Aggregation Weight Entropy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

F.3 Experimental Results on c SBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

F.4 Experimental Results on Graphs with Noisy Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

F.5 Experimental Results of Intergrating p GNNs with GCN and JKNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

F.6 Experimental Results of p GNNs on PPI Dataset for Inductive Learning . . . . . . . . . . . . . . . . . . . . . . . . . 36

F.7 Experimental Results of p GNNs on OGBN ar Xiv Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

F.8 Running Time of p GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

F.9 Experimental Results on Benchmark Datasets for 64 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

F.10 Training Curves for p = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

F.11 Visualization Results of Node Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

p-Laplacian Based Graph Neural Networks

A. Related Work

Graph Neural Networks. Graph neural networks (GNNs) are a variant of neural networks for graph-structured data, which can propagate and transform the node features over the graph topology and exploit the information in the graphs. Graph convolutional networks (GCNs) are one type of GNNs whose graph convolution mechanisms or the message passing schemes were mainly inspired by the field of graph signal processing. Bruna et al. (2014) defined a nonparametric graph filter using the Fourier coefficients. Defferrard et al. (2016) introduced Chebyshev polynomial to avoid computational expensive eigen-decomposition of Laplacian and obtain localized spectral filters. GCN (Kipf & Welling, 2017) used the first-order approximation and reparameterized trick to simplify the spectral filters and obtain the layer-wise graph convolution. SGC (Wu et al., 2019) further simplify GCN by removing non-linear transition functions between each layer. Chen et al. (2018) propose importance sampling to design an efficient variant of GCN. Xu et al. (2018) explored a jumping knowledge architecture that flexibly leverages different neighborhood ranges for each node to enable better structure-aware representation. Atwood & Towsley (2016); Liao et al. (2019); Abu-El-Haija et al. (2019) exploited multi-scale information by diffusing multi-hop neighbor information over the graph topology. Wang & Leskovec (2020) used label propagation to improve GCNs. Klicpera et al. (2019) incorporated personalized Page Rank with GCNs. Liu et al. (2021) introduced a l1 norm-based graph smoothing term to enhance the local smoothnesss adaptivity of GNNs. Hamilton et al. (2017); Zeng et al. (2020) proposed sampling and aggregation frameworks to extent GCNs to inductive learning settings. Another variant of GNNs is graph attention networks (Velickovic et al., 2018; Thekumparampil et al., 2018; Abu-El-Haija et al., 2018), which use attention mechanisms to adaptively learn aggregation weights based on the nodes features. There are many other works on GNNs (Pei et al., 2020) (Ying et al., 2018; Xinyi & Chen, 2019; Velickovic et al., 2019; Zeng et al., 2020), we refer to Zhou et al. (2020); Battaglia et al. (2018); Wu et al. (2021) for a comprehensive review. Most GNN models implicitly assume that the labels of nodes and their neighbors should be the same or consistent, while it does not hold for heterophilic graphs. Zhu et al. (2020) investigated the issues of GNNs on heterophilic graphs and proposed to separately learn the embeddings of ego-node and its neighborhood. Zhu et al. (2021) proposed a framework to model the heterophily or homophily levels of graphs. Chien et al. (2021) incorporated generalized Page Rank with graph convolution to adapt GNNs to heterophilic graphs.

There are also some works on the interpretability of GNNs proposed recently. Li et al. (2018); Ying et al. (2019); Fu et al. (2020) showed that spectral graph convolutions work as conducting Laplacian smoothing on the graph signals and Wu et al. (2019); NT & Maehara (2019) demonstrated that GCN, SGC work as low-pass filters. Gama et al. (2020) studied the stability properties of GNNs. Xu et al. (2019); Oono & Suzuki (2020); Loukas (2020) studied the expressiveness of GNNs. Verma & Zhang (2019); Garg et al. (2020) work on the generalization and representation power of GNNs.

Graph based Semi-supervised Learning. Graph-based semi-supervised learning works under the assumption that the labels of a node and its neighbors shall be the same or consistent. Many methods have been proposed in the last decade, such as Smola & Kondor (2003); Zhou et al. (2003); Belkin et al. (2004) use Laplacian regularization techniques to force the labels of linked nodes to be the same or consistent. Zhou & Sch olkopf (2005) introduce discrete regularization techniques to impose different regularizations on the node features based on p-Laplacian. Lable propagation (Zhu et al., 2003) recursively propagates the labels of labeled nodes over the graph topology and use the convergence results to make predictions. To mention but a few, we refer to Zhou & Sch olkopf (2005); van Engelen & Hoos (2020) for a more comprehensive review.

p-Laplacian Based Graph Neural Networks

B. Discussions and Future Work

In this section, we discuss the future work of p GNNs. Our theoretical results and experimental results could lead to several potential extensions of p GNNs.

New Paradigm of Designing GNN Architectures. We bridge the gap between discrete regularization framework, graphbased semi-supervised learning, and GNNs, which provides a new paradigm of designing new GNN architectures. Following the new paradigm, researchers could introduce more regularization techniques, e.g., Laplacian regularization (Smola & Kondor, 2003; Belkin et al., 2004), manifold regularization (Sindhwani et al., 2005; Belkin et al., 2005; Niyogi, 2013), high-order regularization (Zhou & Belkin, 2011), Bayesian regularization (Liu et al., 2014), entropy regularization (Grandvalet & Bengio, 2004), and consider more explicit assumptions on graphs, e.g. the homophily assumption, the low-density region assumption (i.e. the decision boundary is likely to lie in a low data density region), manifold assumption (i.e. the high dimensional data lies on a low-dimensional manifold), to develop new graph convolutions or message passing schemes for graphs with specific properties and generalize GNNs to a much broader range of graphs. Moreover, the paradigm also enables us to explicitly study the behaviors of the designed graph convolutions or message passing schemes from the theory of regularization (Belkin & Niyogi, 2008; Niyogi, 2013; Slepcev & Thorpe, 2017).

Applications of p GNNs to learn on graphs with noisy topologies. The empirical results (as shown in Fig. 3 and Tables 6 and 7) on graphs with noisy edges show that p GNNs are very robust to noisy edges, which suggests the applications of p-Laplacian message passing and p GNNs on the graph learning scenarios where the graph topology could potentially be seriously intervened.

Integrating with existing GNN architectures. As shown in Table 9, the experimental results on heterophilic benchmark datasets illustrate that integrating GCN, JKNet with p GNNs can significantly improve their performance on heterophilic graphs. It shows that p GNN could be used as a plug-and-play component to be integrated into existing GNN architectures and improve their performance on real-world applications.

Inductive learning for p GNNs. p GNNs are shown to be very effective for inductive learning on PPI datasets as reported in Table 10. p GNNs even outperforms GAT on PPI, while using much fewer parameters than GAT. It suggests the promising extensions of p GNNs to inductive learning on graphs.

p-Laplacian Based Graph Neural Networks

C. Additional Theorems

C.1. Theorem 4 (Upper-Bounding Risk of p GNN)

Theorem 4 (Upper-Bounding Risks of p GNNs). Given a graph G = (V, E, W) with N nodes, let X RN c be the node features and y RN be the node labels and M(k), α(k), βk, Fk are updated accordingly by Equations (12) to (14) for k = 0, 1, . . . , K 1 and F(0) = X, K N. Assume that G is d-regular and the ground-truth node features X = X + ϵ, where ϵ RN c represents the noise in the node features and there exists a L-Lipschitz function σ : RN c RN such that σ(X ) = y. let y(k+1) = α(k)D 1/2M(k)D 1/2σ(F(k)) + β(k)σ F(0) , we have

i=1 |yi yi| 1

i=1 β(K 1) i,i |yi σ(Xi,:)|

i=1 α(K 1) i,i

(K 1) p F(K 1) i,: +

M (l) i,j d

i=1 (1 β(K 1) i,i ) ϵi,: .

Proof see Appendix D.4. Theorem 4 shows that the risk of p GNNs is upper-bounded by the sum of three terms: The first term of the r.h.s in the above inequation represents the risk of label prediction using only the original node features X, the second term is the norm of p-Laplacian diffusion on the node features X, and the third term is the magnitude of the noise in the node features. αi,i and βi,i control the trade-off between these three terms and they are related to the hyperparameter µ in Eq. (10). The smaller µ, the smaller βi,i and larger αi,i, thus the more important of the p-Laplacian diffusion term but also the more effect from the noise. Therefore, for graphs whose topological information is not helpful for label prediction, we could impose more weights on the first term by using a large µ so that p GNNs work more like MLPs which simply learn on node features. While for graphs whose topological information is helpful for label prediction, we could impose more weights on the second term by using a small µ so that p GNNs can benefit from p-Laplacian smoothing on node features.

In practice, to choose a proper value of µ one may first simply apply MLPs on the node features to have a glance at the helpfulness of the node features. If MLPs work very well, there is not much space for the graph s topological information to further improve the prediction performance and we may choose a large µ. Otherwise, there could be a large chance for the graph s topological information to further improve the performance and we should choose a small µ.

C.2. Theorem 5 (p-Orthogonal Theorem (Luo et al., 2010))

Theorem 5 (p-Orthogonal Theorem (Luo et al., 2010)). If u(l) and u(r) are two eigenvectors of p-Laplacian p associated with two different non-zero eigenvalues λl and λr, W is symmetric and p 1, then u(l) and u(r) are p-orthogonal up to the second order Taylor expansion.

Theorem 5 implies that ϕp(u)(l) ϕp(u(r)) 0, for all l, r = 0, . . . , N 1 and λl = λr. Therefore, the space spanned by the multiple eigenvectors of the graph p-Laplacian is p-orthogonal.

C.3. Theorem 6 (p-Eigen-Decomposition of p)

Theorem 6 (p-Eigen-Decomposition of p). Given the p-eigenvalues {λl R}l=0,1,...,N 1, and the p-eigenvectors {u(l) RN}l=0,1,...,N 1 of p-Laplacian p and u(l) p = (PN i=1(u(l) i )p)1/p = 1, let U be a matrix of p-eigenvectors with U = (u(0), u(1), . . . , u(N 1)) and Λ be a diagonal matrix with Λ = diag(λ0, λ1, . . . , λN 1), then the p-eigendecomposition of p-Laplacian p is given by

p = Φp(U)ΛΦp(U) .

When p = 2, it reduces to the standard eigen-decomposition of the Laplacian matrix.

Proof see Appendix D.5.

p-Laplacian Based Graph Neural Networks

C.4. Theorem 7 (Bounds of p-Eigenvalues)

Theorem 7 (Bounds of p-Eigenvalues). Given a graph G = (V, E, W), if G is connected and λ is a p-eigenvalue associated with the p-eigenvector u of p, let Ni denotes the number of edges connected to node i, Nmin = min{Ni}i=1,2,...,N, and k = arg max({|ui|/ p

Di,i}i=1,2,...,N), then

1. for p 2, 0 λ 2p 1;

2. for 1 < p < 2, 0 λ 2p 1 Nk;

3. for p = 1, 0 λ Nmin.

Proof see Appendix D.6.

D. Proof of Theorems

D.1. Proof of Theorem 1

Proof. Let i be the one-hot indicator vector whose i-th element is one and the other elements are zero. Then, we can obtain the personalized Page Rank on node i, denoted as πPPR(i), by using the recurrent equation (Klicpera et al., 2019):

π(k+1) PPR (i) = αD 1/2WD 1/2π(k) PPR(i) + βi,

where k is the iteration step, 0 < α < 1 and β = (1 α) represents the restart probability. Without loss of generality, suppose π(0) PPR(i) = i. Then we have,

π(k) PPR(i) = αD 1/2WD 1/2π(k 1) PPR (i) + βi

= αD 1/2WD 1/2 αD 1/2W D 1/2π(k 2) PPR (i) + βi + βi

= αD 1/2WD 1/2 2 π(t 2) PPR (i) + βαD 1/2WD 1/2i + βi

= αD 1/2WD 1/2 k π(0) PPR(i) + β

αD 1/2WD 1/2 t i

= αD 1/2WD 1/2 k i + β

αD 1/2WD 1/2 t i

Since 0 < α < 1 and the eigenvalues of D 1/2WD 1/2 in [ 1, 1], we have

αD 1/2WD 1/2 k = 0,

and we also have

αD 1/2WD 1/2 t = IN αD 1/2WD 1/2 1 .

πPPR(i) = lim k π(k) PPR(i) = β IN αD 1/2WD 1/2 1 i

= β (α 2 + (1 α)IN) 1 i

= µ( 2 + µIN) 1i,

where we let α = 1 1+µ and β = µ 1+µ, µ > 0. Then the fully personalized Page Rank matrix can be obtained by substituting i with IN: ΠPPR = µ( 2 + µIN) 1.

p-Laplacian Based Graph Neural Networks

D.2. Proof of Theorem 2

Proof. By the definition of Lp(f) in Eq. (8), we have for some positive real value µ, µ > 0

Wi,j Dj,j Fj,:

i=1 Fi,: Xi,: 2.

and by Eq. (12),

M (k) i,j := Wi,j

Di,i F(k) i,:

Wi,j Dj,j F(k) j,:

Then, we have

F(k) i,: = p

Di,i F(k) i,:

Wi,j Dj,j F(k) j,:

Di,i F(k) i,:

Wi,j Dj,j F(k) j,:

+ 2µ(F(k) i,: Xi,:)

M (k) i,j Di,i F(k) i,:

M (k) i,j p

Di,i Dj,j F(k) j,:

+ 2µ(F(k) i,: Xi,:)

M (k) i,j Di,i + 2µ

M (k) i,j p

Di,i Dj,j F(k) j,: + 2µ

M (k) i,j p

Di,i Dj,j F(k) j,: + β(k) i,i Xi,:

F(k) i,: F(k+1) i,: ,

which indicates that

F(k) i,: F(k+1) i,: = α(k) i,i p Lp(F(k))

For all i, j [N], v R1 c, denote by

Lp(F(k) i,: ) := Lp(F(k))

M (k) i,j := Wi,j

Di,i (F(k) i,: + v)

Wi,j Dj,j F(k) j,:

α (k) i,i := 1

M (k) i,j Di,i + 2µ

β (k) i,i := 2µ

p α (k) i,i

F (k+1) i,: := α (k) i,i

M (k) i,j p

Di,i Dj,j F(k) j,: + β i,i Xi,:.

p-Laplacian Based Graph Neural Networks

Lp(F(k) i,: + v) Lp(F(k) i,: )

F(k) i,: + v F (k+1) i,: p

F(k) i,: F(k+1) i,:

α (k) i,i v +

F(k) i,: F (k) i,: p

F(k) i,: F(k+1) i,:

α (k) i,i v +

α (k) i,i p

α (k) i,i F (k+1) i,: + p

α(k) i,i F(k+1) i,:

α (k) i,i v + p

M (k) i,j Di,i

M (k) i,j Di,i

M (k) i,j p

Di,i Dj,j F(k) j,: 2µ

M (k) i,j p

Di,i Dj,j F(k) j,: + 2µ

α (k) i,i v + p

M (k) i,j Di,i

M (k) i,j Di,i

M (k) i,j p

Di,i Dj,j F(k) j,: +

M (k) i,j p

Di,i Dj,j F(k) j,:

M (k) i,j Di,i + 2µ

M (k) i,j M (k) i,j Di,i v + p

M (k) i,j M (k) i,j Di,i F(k) i,:

M (k) i,j M (k) i,j p

Di,i Dj,j F(k) j,:

M (k) i,j Di,i + 2µ

p + o (p, v, X, G)

Therefore, there exists some real positive value µ o (p, v, X, G) > 0 such that

Lp(F(k) i,: + v) Lp(F(k) i,: ) p N X

M (k) i,j Di,i + 2µ

α(k) i,i v . (19)

Let γ = (γ1, . . . , γN) RN and η RN c. By Taylor s theorem, we have:

Lp(F(k) i,: + γiηi,:)

= Lp(F(k) i,: ) + γi

0 Lp(F(k) i,: + ϵγiηi,:), ηi,: dϵ

= Lp(F(k) i,: ) + γi ηi,:, Lp(F(k) i,: ) + γi

0 Lp(F(k) i,: + ϵγiηi,:) Lp(F(k) i,: ), ηi,: dϵ

Lp(F(k) i,: ) + γi ηi,:, Lp(F(k) i,: ) + γi

0 Lp(F(k) i,: + ϵγiηi,:) Lp(F(k) i,: ) ηi,: dϵ

Lp(F(k) i,: ) + γi ηi,:, Lp(F(k) i,: ) + p

2α(k) i,i γ2 i ηi,: 2

Let η = Lp(F(k)) and choose some positive real value µ which depends on X, G, p and p > 1, i.e. µ o (p, X, G).

p-Laplacian Based Graph Neural Networks

By Eq. (19), we have for all i [N],

Lp(F(k) i,: γ Lp(F(k) i,: )) Lp(F(k) i,: ) γi Lp(F(k) i,: ), Lp(F(k) i,: ) + p

2α(k) i,i γ2 i Lp(F(k) i,: ) 2

= Lp(F(k) i,: ) p

2α(k) i,i γi

= Lp(F(k) i,: ) p

γi α(k) i,i p

Lp(F(k) i,: ) 2.

Then for all i [N], when 0 γi 2α(k) i,i p , we have Sp(F(k) i,: γi Sp(F(k) i,: )) Sp(F(k) i,: ) and γi = α(k) i,i p minimizes

Sp(F(k) i,: γi Sp(F(k) i,: )). Therefore,

Lp(F(k+1)) = Lp(F(k) 1

p α(k) Lp(F(k))) Lp(F(k)).

D.3. Proof of Theorem 3

Proof. Without loss of generality, suppose F(0) = X. Denote M(k) = D 1/2M(k)D 1/2, by Eq. (14), we have for K 2,

F(K) = α(K 1)D 1/2M(K 1)D 1/2F(K 1) + β(K 1)X

= α(K 1) MK 1F(K 1) + β(K 1)X

= α(K 1) MK 1 α(K 2) MK 2F(K 2) + β(K 2)X + β(K 1)X

= α(K 1)α(K 2) MK 1 MK 2F(K 2) + α(K 1) MK 1β(K 2)X + β(K 1)X

k=0 α(k) ! K 1 Y

l=K k α(l) M(l) !

β(K 1 k)X + β(K 1)X

k=0 α(k) ! K 1 Y

l=K k α(l) M(l) !

β(K 1 k)X + β(K 1)X. (20)

Recall Equations (12) and (13), we have

M (k) i,j = Wi,j p

Di,i F(k) i,:

Wi,j Dj,j F(k) j,:

, for all i, j = 1, 2, . . . , N,

and α(k) i,i = 1 PN j=1 M (k) i,j Di,i + 2µ

, for all i = 1, 2, . . . , N.

Note that the eigenvalues of M are not infinity and 0 < αi,i < 1 for all i = 1, . . . , N. Then we have

k=0 α(k) = 0,

k=0 α(k) ! K 1 Y

p-Laplacian Based Graph Neural Networks

lim K F(K) = lim K

l=K k α(l) M(l) !

β(K 1 k)X + β(K 1)X

By Equations (6) and (12), we have

Di,i ( f)([j, i]) p 2f(i)

Di,i Dj,j ( f)([j, i]) p 2f(j)

Di,i Dj,j f(j). (22)

By Eq. (13), we have N X

M (k) i,j Di,i = 1

α(k) i,i 2µ

Equations (22) and (23) show that

(k) p = α(k) 1 2µ

which indicates α(k) M(k) = IN 2µ

p α(k) α(k) (k) p . (25)

Eq. (25) shows that α(k) M(k) is linear w.r.t p and therefore can be expressed by a linear combination in terms of p:

α(k) M(k) = θ (k) p, (26)

where θ = diag(θ 0, θ 1, . . . , θ N 1) are the parameters. Therefore, we have

lim K F(K) = lim K

l=K k α(l) M(l) !

β(K 1 k)X + β(K 1)X

l=K k θ (l) p

β(K 1 k)X + β(K 1)X

k=1 β(K 1 k) K 1 Y

l=K k θ (l) !

k p X + β(K 1)X

k=0 θ (k) k p X,

where θ (k) = diag(θ (k) 1 , θ (k) 2 , . . . , θ (k) N ) defined as θ (0) = β(K 1) and

θ (k) i = β(K 1 k) i,i

l=K k θ (l) i , for k = 1, 2, . . . , K 1.

Let θ = (θ0, θ1, . . . , θK 1) defined as θk = PN i=1 θ (k) i for all k = 0, 1, . . . , K 1, then

lim K F(K) = lim K

k=1 θk k p X + θ0X

k=0 θk k p X.

Therefore complete the proof.

p-Laplacian Based Graph Neural Networks

D.4. Proof of Theorem 4

Proof. The first-order Taylor expansion with Peano s form of remainder for σ at X i,: is given by:

σ(F(K 1) j,: ) = σ(X i,:) + σ(X i,:) X

F(K 1) j,: X i,: + o( F(K 1) j,: X i,: ).

Note that in general the output non-linear layer σ( ) is simple. Here we assume that it can be well approximated by the first-order Taylor expansion and we can ignore the Peano s form of remainder. For all i = 1, . . . , N, Di,i = Dj,j = d, we have αi,i PN j=1 D 1/2 i,i M (K 1) i,j D 1/2 j,j + β(K 1) i,i = 1. Then

yi α(K 1) i,i

M (K 1) i,j p

Di,i Dj,j σ FK 1 j,: β(K 1) i,i σ (Xi,:)

yi β(K 1) i,i σ (Xi,:) α(K 1) i,i

M (K 1) i,j p

σ X i,: + σ X i,:

F(K 1) j,: X i,: !

yi α(K 1) i,i

M (K 1) i,j p

Di,i Dj,j yi β(K 1) i,i σ(Xi,:) α(K 1) i,i

M (K 1) i,j p

F(K 1) j,: X i,:

β(K 1) i,i (yi σ(Xi,:)) α(K 1) i,i

M (K 1) i,j p

F(K 1) j,: Xi,: ϵi,:

β(K 1) i,i |yi σ(Xi,:)| + α(K 1) i,i

M (K 1) i,j p

F(K 1) j,: Xi,: + (1 β(K 1) i,i ) σ(X i,:) X ϵ i,:

β(K 1) i,i |yi σ(Xi,:)| + α(K 1) i,i

M (K 1) i,j p

F(K 1) j,: Xi,: + (1 β(K 1) i,i ) σ(X i,:) X

β(K 1) i,i |yi σ(Xi,:)| + α(K 1) i,i L

M (K 1) i,j

F(K 1) j,: Xi,: + (1 β(K 1) i,i )L ϵi,:

= β(K 1) i,i |yi σ(Xi,:)| + α(K 1) i,i L

M (K 1) i,j

F(K 1) j,: F(K 1) i,: + F(K 1) i,: Xi,:

+ (1 β(K 1) i,i )L ϵi,:

= β(K 1) i,i |yi σ(Xi,:)| + α(K 1) i,i L

p F(K 1) i,: +

M (K 1) i,j

M (K 2) i,j

F(K 2) j,: Xi,:

+ (1 β(K 1) i,i )L ϵi,:

= β(K 1) i,i |yi σ(Xi,:)| + α(K 1) i,i L

(K 1) p F(K 1) i,: +

M (l) i,j d

+ (1 β(K 1) i,i )L ϵi,:

= β(K 1) i,i |yi σ(Xi,:)| + α(K 1) i,i L

(K 1) p F(K 1) i,: +

M (l) i,j d

+ (1 β(K 1) i,i )L ϵi,:

p-Laplacian Based Graph Neural Networks

i=1 |yi yi| 1

i=1 β(K 1) i,i |yi σ(Xi,:)|

i=1 α(K 1) i,i

(K 1) p F(K 1) i,: +

M (l) i,j d

i=1 (1 β(K 1) i,i ) ϵi,:

D.5. Proof of Theorem 6

Proof. Note that

i=1 ϕp(ui)ui =

i=1 ui p 2u2 i =

i=1 |ui|p = u p p = 1,

then we have p U = Φp(U)Λ = Φp(U)ΛΦ(U) U.

Therefore, p = Φp(U)ΛΦp(U) .

When p = 2, by Φ2(U) = U, we get 2 = Φ2(U)ΛΦ2(U) = UΛU .

D.6. Proof of Theorem 7

Proof. By the definition of graph p-Laplacian, we have for all i = 1, 2, . . . , N,

Wi,j Dj,j uj

Wi,j Dj,j uj

Then, for all i = 1, 2, . . . , N,

λ = 1 ϕp(ui)

Wi,j Dj,j uj

Wi,j Dj,j uj

= 1 ui p 2ui

Wi,j Dj,j uj

Wi,j Dj,j uj

Wi,j Dj,j uj p 2

Di,i Wi,j p

Wi,j Dj,j uj

Di,i Wi,j p

Wi,j Dj,j uj

Di,i Wi,j p

p-Laplacian Based Graph Neural Networks

Let l = arg max{ ui }i=1,2,...,N, the above equation holds for all i = 1, 2, . . . , N, then

Dl,l Wl,j p

Dl,l Wl,j p

Dl,l Wl,j p

When p = 1,

Di,i Wi,j p

where the last inequality holds by using the Cauchy-Schwarz inequality. The above inequality holds for all i = 1, 2, . . . , N, therefore,

When p > 1, we have for i = 1, 2, . . . , N,

Di,i Wi,j p

Without loss of generality, let k = arg max({|ui|/ p

Di,i}i=1,2,...,N). Because the above inequality holds for all i = 1, 2, . . . , N, then we have

p-Laplacian Based Graph Neural Networks

For 1 < p < 2,

Wk,j Dk,k 2p 1

Wk,j Dk,k = 2p 1p

D.7. Proof of Proposition 1

Proof. We proof Proposition 1 based on the bounds of p-eigenvalues as demonstrated in Theorem 7.

By Eq. (6) and Eq. (12), we have

Di,i ( f)([j, i]) p 2f(i)

Di,i Dj,j ( f)([j, i]) p 2f(j)

Di,i Dj,j f(j). (27)

By Eq. (13), we have

M (k) i,j Di,i = 1

α(k) i,i 2µ

Equations (27) and (28) show that

(k) p = α(k) 1 2µ

D 1/2M(k)D 1/2, (29)

which indicates

α(k)D 1/2M(k)D 1/2 = IN 2µ

p α(k) α(k) (k) p . (30)

For i = 1, 2, . . . , N, let α := ( α1, . . . , αN), αi := 1/ PN j=1 Mi,j

Di,i , then

Di,i Dj,j = (1 2µ

p αi,i) αi,iλi

PN j=1 Mi,j

Di,i PN j=1 Mi,j

1 1 PN j=1 Mi,j

= 1 1 + 2µ αi

p (1 αiλi) , (31)

Recall the Eq. (12) that

Mi,j = Wi,j

Wi,j Dj,j Fj,:

= Wi,j ( f)([i, j]) p 2,

1. When p = 2, for all i = 1, . . . , N, αi = 1 and 0 λi 1 2, g2(λi 1) works as low-high-pass filters.

2. When p > 2, by Theorem 7 we have for all i = 1, . . . , N, 0 λi 1 2p 1. If 0 αi 21 p, then 0 1 αiλi 1, which indicates that gp(λi 1) works as a low-pass filter; If αi > 21 p, then gp(λi 1) works as low-high-pass

p-Laplacian Based Graph Neural Networks

filters. Since

Wi,j ( f)([i, j]) p 2

j=1 ( f)([i, j]) 2(p 2)

j=1 ( f)([i, j]) 2(p 2)

which indicates that αi f(i) 2 p. 0 αi 21 p directly implies that 0 f(i) 2 p 21 p, i.e. f(i) 2(p 1)/(p 2) and when f(i) 2 p 21 p, i.e. f(i) 2(p 1)/(p 2), αi 21 p always holds. Therefore, if f(i) 2(p 1)/(p 2), gp(λi 1) works as low-high-pass filters on node i; If gp(λi 1) works as a low-pass filter, f(i) 2(p 1)/(p 2).

3. When 1 p < 2, by Theorem 7 we have for all i = 1, . . . , N, 0 λi 1 2p 1 Nk. If 0 αi 21 p/ Nk, 0 1 αiλi 1, which indicates that gp(λi 1) work as low-pass filters; If αi 21 p/ Nk, gp(λi 1) work as low-high-pass filters. By

( f)([i, j]) p 2

, 1 PN j=1 Wi,j

Di,i ( f)([i, j]) p 2

( f)([i, j]) p 2

Di,i ( f)([i, j]) p 2

( f)([i, j]) p 2 ( f)([i, j]) p 2 !

αi = 1 PN i=1 Mi,j

Di,i = 1 PN j=1 Wi,j

Di,i ( f)([i, j]) p 2

( f)([i, j]) p 2

Di,i ( f)([i, j]) 2 p

αi 21 p/ Nk directly implies that f(i) 2 p 21 p/ Nk, i.e. f(i) 2(2 Nk)1/(p 2) and when 0 f(i) 2 p 21 p/ Nk, i.e. 0 f(i) 2(2 Nk)1/(p 2), 0 αi 21 p/ Nk always holds. Therefore, if 0 f(i) 2(2 Nk)1/(p 2), gp(λi 1) work as low-pass filters; If gp(λi 1) work as low-high-pass

filters, f(i) 2 2 Nk 1/(p 2).

Specifically, when p = 1, by Theorem 7 we have for all i = 1, . . . , N, 0 λi 1 2p 1 Nmin. Following the same derivation above we attain if 0 f(i) 2(2 Nmin)1/(p 2), gp(λi 1) work as low-pass filters; If gp(λi 1) work

as both low-high-pass filters, f(i) 2 2 Nmin 1/(p 2).

p-Laplacian Based Graph Neural Networks

E. Dataset Statistics and Hyperparameters

E.1. Illustration of Graph Gradient and Graph Divergence

𝑓 [1,2] = 𝑊1,2 𝐷2,2

𝑓 [1,3] = 𝑊1,3 𝐷3,3

𝑓 [2,1] = 𝑊1,2

𝑓1 𝑊1,2 𝐷2,2

𝑓 [3,1] = 𝑊1,3

𝑓1 𝑊1,3 𝐷3,3

(a) Graph gradient.

div𝑔 1 = 𝑊1,2

𝑔1, 2 𝑔2, 1 + 𝑊1,3

𝑔1, 3 𝑔3, 1

(b) Graph divergence.

Figure 4: A tiny example of illustration of graph gradient and graph divergence. Best view in colors.

E.2. Dataset Statistics

Table 2 summarizes the dataset statistics and the levels of homophily H(G) of all benchmark datasets. Note that the homophily scores here is different with the scores reported by Chien et al. (2021). There is a bug in their code when computing the homophily scores (doing division with torch integers) which caused their homophily scores to be smaller. Additionally, We directly used the data from Pytorch Geometric library (Fey & Lenssen, 2019) where they did not transform Chameleon and Squirrel to undirected graphs, which is different from Chien et al. (2021) where they did so.

Table 2: Statistics of datasets.

Dataset #Class #Feature #Node #Edge Training Validation Testing H(G)

Cora 7 1433 2708 5278 2.5% 2.5% 95% 0.825 Cite Seer 6 3703 3327 4552 2.5% 2.5% 95% 0.717 Pub Med 3 500 19717 44324 2.5% 2.5% 95% 0.792 Computers 10 767 13381 245778 2.5% 2.5% 95% 0.802 Photo 8 745 7487 119043 2.5% 2.5% 95% 0.849 CS 15 6805 18333 81894 2.5% 2.5% 95% 0.832 Physics 5 8415 34493 247962 2.5% 2.5% 95% 0.915

Chameleon 5 2325 2277 31371 60% 20% 20% 0.247 Squirrel 5 2089 5201 198353 60% 20% 20% 0.216 Actor 5 932 7600 26659 60% 20% 20% 0.221 Wisconsin 5 251 499 1703 60% 20% 20% 0.150 Texas 5 1703 183 279 60% 20% 20% 0.097 Cornell 5 1703 183 277 60% 20% 20% 0.386

E.3. Hyperparameter Settings

We set the number of layers as 2, the maximum number of epochs as 1000, the number for early stopping as 200, the weight decay as 0 or 0.0005 for all models. The other hyperparameters for each model are listed as below:

1.0GNN, 1.5GNN, 2.0GNN, 2.5GNN:

Number of hidden units: 16 Learning rate: {0.001, 0.01, 0.05} Dropout rate: {0, 0.5}

p-Laplacian Based Graph Neural Networks

µ: {0.01, 0.1, 0.2, 1, 10} K: 4, 6, 8

Number of hidden units: 16 Learning rate: {0.001, 0.01} Dropout rate: {0, 0.5}

Number of hidden units: 16 Learning rate: {0.001, 0.01} Dropout rate: {0, 0.5}

Number of hidden units: 16 Learning rate: {0.2, 0.01} Dropout rate: {0, 0.5} K: 2

Number of hidden units: 8 Number of attention heads: 8 Learning rate: {0.001, 0.005} Dropout rate: {0, 0.6}

Number of hidden units: 16 Learning rate: {0.001, 0.01} Dropout rate: {0, 0.5} K: 10 α: {0.1, 0.5, 0.7, 1} The number of GCN based layers: 2 The layer aggregation: LSTM with 16 channels and 4 layers

Number of hidden units: 16 Learning rate: {0.001, 0.01} Dropout rate: {0, 0.5} K: 10 α: {0.1, 0.5, 0.7, 1}

Number of hidden units: 16 Learning rate: {0.001, 0.01, 0.05} Dropout rate: {0, 0.5} K: 10 α: {0, 0.1, 0.2, 0.5, 0.7, 0.9, 1} dprate: {0, 0.5, 0.7}

p-Laplacian Based Graph Neural Networks

F. Additional Experiments

F.1. Experimental Results on Homophilic Benchmark Datasets

Competitive Performance on Real-World Homophilic Datasets. Table 3 summarizes the averaged accuracy (the micro F1 score) and standard deviation of semi-supervised node classification on homophilic benchmark datasets. Table 3 shows that the performance of p GNN is very close to APPNP, JKNet, GCN on Cora, Cite Seer, Pub Med datasets and slightly outperforms all baselines on Computers, Photo, CS, Physics datasets. Moreover, we observe that p GNNs outperform GPRGNN on all homophilic datasets, which confirms that p GNNs work better under weak supervised information (2.5% training rate) as discussed in Remark 3. We also see that all GNN models work significantly better than MLP on all homophilic datasets. It illustrates that the graph topological information is helpful for the label prediction tasks. Notably, 1.0GNN is slightly worse than the other p GNNs with larger p, which suggests to use p 2 for homophilic graphs. Overall, the results of Table 3 indicates that p GNNs obtain competitive performance against all baselines on homophilic datasets.

Table 3: Results on homophilic benchmark datasets. Averaged accuracy (%) for 100 runs. Best results are outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold. OOM denotes out of memory.

Method Cora Cite Seer Pub Med Computers Photo CS Physics

MLP 43.47 3.82 46.95 2.15 78.95 0.49 66.11 2.70 76.44 2.83 86.24 1.43 92.58 0.83 GCN 76.23 0.79 62.43 0.81 83.72 0.27 84.17 0.59 90.46 0.48 90.33 0.36 94.46 0.08 SGC 77.19 1.47 64.10 1.36 79.26 0.69 84.32 0.59 89.81 0.57 91.06 0.05 OOM GAT 75.62 1.01 61.28 1.09 83.60 0.22 82.72 1.29 90.48 0.57 89.96 0.27 93.96 0.21 JKNet 77.19 0.98 63.32 0.95 82.54 0.43 79.94 2.47 88.29 1.64 89.69 0.66 93.92 0.32 APPNP 79.58 0.59 63.02 1.10 84.80 0.22 83.32 1.11 90.42 0.53 91.54 0.24 94.93 0.06 GPRGNN 76.10 1.30 61.60 1.69 83.16 0.84 82.78 1.87 89.81 0.66 90.59 0.38 94.72 0.16

1.0GNN 77.59 0.69 63.19 0.98 83.21 0.30 84.46 0.89 90.69 0.66 91.46 0.50 94.72 0.37 1.5GNN 78.86 0.75 63.80 0.79 83.65 0.17 85.03 0.90 90.91 0.50 92.12 0.40 94.90 0.16 2.0GNN 78.93 0.60 63.65 1.08 84.19 0.22 84.39 0.85 90.40 0.63 92.28 0.47 94.93 0.14 2.5GNN 78.87 0.57 63.28 0.97 84.45 0.18 83.85 0.87 89.82 0.64 91.94 0.40 94.87 0.11

F.2. Experimental Results of Aggregation Weight Entropy Distribution

Here we present the visualization results of the learned aggregation weight entropy distribution of p GNNs and GAT on all benchmark datasets. Fig. 5 and Fig. 6 show the results obtained on homophilic and heterophilic benchmark datasets, respectively.

We observe from Fig. 5 that the aggregation weight entropy distributions learned by p GNNs and GAT on homophilic benchmark datasets are similar to the uniform cases, which indicates that aggregating and transforming node features over the original graph topology is very helpful for label prediction. It explains why p GNNs and GNN baselines obtained similar performance on homophilic benchmark datasets and all GNN models significantly outperform MLP.

Contradict to the results on homophilic graphs shown in Fig. 5, Fig. 6 shows that the aggregation weight entropy distributions of p GNNs on heterophilic benchmark datasets are very different from that of GAT and the uniform cases. We observe from Fig. 6 that the entropy of most of the aggregation weights learned by p GNNs are around zero, which means that most aggregation weights are on one source node. It indicates that the graph topological information in these heterophilic benchmark graphs is not helpful for label prediction. Therefore, propagating and transforming node features over the graph topology could lead to worse performance than MLPs, which validates the results in Table 3 that the performance of MLP is significantly better most GNN baselines on all heterophilic graphs and closed to p GNNs.

F.3. Experimental Results on c SBM

In this section we present the experimental results on c SBM using sparse splitting and dense splitting, respectively. We used the same settings in Chien et al. (2021) in which the number of nodes n = 5000, the number of features f = 2000, ϵ = 3.25 for all experiments. Table 4 reports the results on c SBM with sparse splitting setting, which also are presented in Fig. 2 and discussed in Sec. 5. Table 5 reports the results on c SBM with dense splitting settings.

Table 5 shows that p GNNs obtain the best performance on weak homophilic graphs (ϕ = 0, 0.25) while competitive

p-Laplacian Based Graph Neural Networks

0 2 4 6 8 entropy

0 2 4 6 8 entropy

1.0GNN (Acc: 78.43%)

0 2 4 6 8 entropy

1.5GNN (Acc: 78.58%)

0 2 4 6 8 entropy

2.5GNN (Acc: 78.39%)

0 2 4 6 8 entropy

GAT (Acc: 73.86%)

1 0 1 2 3 4 5 entropy

1 0 1 2 3 4 5 entropy

1.0GNN (Acc: 62.95%)

1 0 1 2 3 4 5 entropy

1.5GNN (Acc: 64.87%)

1 0 1 2 3 4 5 entropy

2.5GNN (Acc: 63.32%)

1 0 1 2 3 4 5 entropy

GAT (Acc: 61.59%)

0 2 4 6 8 entropy

0 2 4 6 8 entropy

1.0GNN (Acc: 83.2%)

0 2 4 6 8 entropy

1.5GNN (Acc: 83.73%)

0 2 4 6 8 entropy

2.5GNN (Acc: 84.35%)

0 2 4 6 8 entropy

GAT (Acc: 83.85%)

0 2 4 6 8 10 12 14 entropy

0 2 4 6 8 10 12 14 entropy

1.0GNN (Acc: 85.24%)

0 2 4 6 8 10 12 14 entropy

1.5GNN (Acc: 85.3%)

0 2 4 6 8 10 12 14 entropy

2.5GNN (Acc: 83.68%)

0 2 4 6 8 10 12 14 entropy

GAT (Acc: 82.73%)

0 2 4 6 8 10 12 14 entropy

0 2 4 6 8 10 12 14 entropy

1.0GNN (Acc: 91.33%)

0 2 4 6 8 10 12 14 entropy

1.5GNN (Acc: 91.28%)

0 2 4 6 8 10 12 14 entropy

2.5GNN (Acc: 90.25%)

0 2 4 6 8 10 12 14 entropy

GAT (Acc: 90.29%)

1 0 1 2 3 4 5 6 entropy

1 0 1 2 3 4 5 6 entropy

1.0GNN (Acc: 91.17%)

1 0 1 2 3 4 5 6 entropy

1.5GNN (Acc: 92.14%)

1 0 1 2 3 4 5 6 entropy

2.5GNN (Acc: 92.11%)

1 0 1 2 3 4 5 6 entropy

GAT (Acc: 89.8%)

0 2 4 6 8 10 entropy

0 2 4 6 8 10 entropy

1.0GNN (Acc: 94.73%)

0 2 4 6 8 10 entropy

1.5GNN (Acc: 94.94%)

0 2 4 6 8 10 entropy

2.5GNN (Acc: 94.77%)

0 2 4 6 8 10 entropy

GAT (Acc: 94.46%)

Figure 5: Aggregation weight entropy distribution of homophilic benchmark graphs. Low entropy means high degree of concentration, vice versa. An entropy of zero means all aggregation weights are on one source node.

p-Laplacian Based Graph Neural Networks

1 0 1 2 3 4 5 entropy

1 0 1 2 3 4 5 entropy

1.0GNN (Acc: 48.8%)

1 0 1 2 3 4 5 entropy

1.5GNN (Acc: 48.8%)

1 0 1 2 3 4 5 entropy

2.5GNN (Acc: 49.23%)

1 0 1 2 3 4 5 entropy

GAT (Acc: 44.64%)

1 0 1 2 3 4 5 6 entropy

1 0 1 2 3 4 5 6 entropy

1.0GNN (Acc: 32.08%)

1 0 1 2 3 4 5 6 entropy

1.5GNN (Acc: 28.15%)

1 0 1 2 3 4 5 6 entropy

2.5GNN (Acc: 33.33%)

1 0 1 2 3 4 5 6 entropy

GAT (Acc: 28.72%)

1 0 1 2 3 4 5 entropy

1 0 1 2 3 4 5 entropy

1.0GNN (Acc: 41.21%)

1 0 1 2 3 4 5 entropy

1.5GNN (Acc: 40.63%)

1 0 1 2 3 4 5 entropy

2.5GNN (Acc: 39.46%)

1 0 1 2 3 4 5 entropy

GAT (Acc: 31.12%)

1 0 1 2 3 4 5 entropy

1 0 1 2 3 4 5 entropy

1.0GNN (Acc: 98.15%)

1 0 1 2 3 4 5 entropy

1.5GNN (Acc: 96.3%)

1 0 1 2 3 4 5 entropy

2.5GNN (Acc: 84.26%)

1 0 1 2 3 4 5 entropy

GAT (Acc: 62.96%)

1 0 1 2 3 4 5 entropy

1 0 1 2 3 4 5 entropy

1.0GNN (Acc: 86.25%)

1 0 1 2 3 4 5 entropy

1.5GNN (Acc: 83.75%)

1 0 1 2 3 4 5 entropy

2.5GNN (Acc: 86.25%)

1 0 1 2 3 4 5 entropy

GAT (Acc: 48.75%)

2 1 0 1 2 3 4 5 entropy

2 1 0 1 2 3 4 5 entropy

1.0GNN (Acc: 86.25%)

2 1 0 1 2 3 4 5 entropy

1.5GNN (Acc: 81.25%)

2 1 0 1 2 3 4 5 entropy

2.5GNN (Acc: 75.0%)

2 1 0 1 2 3 4 5 entropy

GAT (Acc: 42.5%)

Figure 6: Aggregation weight entropy distribution of heterophilic benchmark graphs. Low entropy means high degree of concentration and vice versa. An entropy of zero means all aggregation weights are on one source node.

Table 4: Results on c SBM with sparse splitting setting. Average accuracy (%) for 20 runs. Best results are outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold.

Method ϕ = 1 ϕ = 0.75 ϕ = 0.5 ϕ = 0.25 ϕ = 0 ϕ = 0.25 ϕ = 0.5 ϕ = 0.75 ϕ = 1

MLP 49.72 0.36 51.42 1.83 59.21 1.01 61.57 0.38 61.70 0.30 59.92 1.88 57.20 0.62 54.48 0.48 50.09 0.51 GCN 57.24 1.15 58.19 1.46 57.30 1.30 51.97 0.44 54.45 1.38 64.70 2.38 82.45 1.35 91.31 0.54 76.07 3.30 SGC 55.98 1.48 58.56 1.40 56.97 0.54 51.54 0.22 52.69 2.36 64.14 1.05 79.88 1.57 90.37 0.09 75.94 0.92 GAT 59.72 2.23 60.20 2.14 55.38 1.96 50.15 0.55 53.05 1.40 64.00 2.03 81.04 1.71 90.37 1.33 78.24 1.95 JKNet 49.70 0.39 49.75 0.79 49.65 0.52 48.93 0.48 52.36 2.09 62.76 2.54 87.10 1.52 97.43 0.36 97.69 0.52 APPNP 48.45 0.98 49.65 0.46 53.31 0.89 56.58 0.58 60.10 0.65 65.02 2.23 82.95 1.38 95.49 0.43 89.85 0.60 GPRGNN 97.26 0.66 94.81 0.91 82.14 0.47 61.15 2.55 60.20 0.76 62.90 2.22 83.61 1.28 96.96 0.41 98.01 0.71

1.0GNN 95.75 1.21 93.06 1.13 77.39 4.21 61.38 0.39 61.80 0.29 65.73 2.11 85.85 3.24 96.80 0.87 97.40 1.10 1.5GNN 95.90 3.01 94.10 4.57 73.08 2.59 61.44 0.30 61.77 0.35 66.01 1.88 90.57 0.71 97.38 0.43 97.76 0.86 2.0GNN 98.37 0.78 96.32 1.50 84.93 0.39 61.13 0.51 61.79 0.34 63.55 1.73 88.55 1.05 97.56 0.16 97.94 0.39 2.5GNN 97.74 0.99 96.78 0.44 83.21 2.12 61.30 0.41 61.74 0.34 62.88 2.31 79.64 2.15 95.71 0.34 97.25 0.58

p-Laplacian Based Graph Neural Networks

Table 5: Results on c SBM with dense splitting setting. Average accuracy (%) for 20 runs. Best results are outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold.

Method ϕ = 1 ϕ = 0.75 ϕ = 0.5 ϕ = 0.25 ϕ = 0 ϕ = 0.25 ϕ = 0.5 ϕ = 0.75 ϕ = 1

MLP 50.37 0.60 65.22 0.92 75.82 0.65 81.18 0.55 79.86 0.69 79.97 0.57 75.03 0.89 67.53 0.68 51.96 0.69 GCN 83.14 0.49 82.59 0.48 77.17 0.59 58.58 0.41 61.18 1.06 82.59 0.50 92.20 0.27 97.21 0.27 97.10 0.12 SGC 78.35 0.36 82.13 0.09 77.76 0.12 59.14 0.57 60.31 0.63 81.96 0.34 91.68 0.13 96.56 0.09 96.87 0.05 GAT 92.99 0.86 90.89 0.60 87.02 0.80 68.40 1.60 61.98 1.16 82.92 0.51 92.05 0.73 97.28 0.25 98.04 0.46 JKNet 68.95 9.05 79.21 7.67 67.97 5.22 56.12 4.10 58.33 1.70 80.15 0.80 91.21 0.50 97.62 0.25 98.32 0.21 APPNP 49.86 0.39 50.47 0.89 65.28 0.68 73.98 0.64 79.37 0.66 86.60 0.73 92.45 0.39 97.67 0.14 97.65 0.49 GPRGNN 99.06 0.25 97.14 0.31 94.59 0.32 83.84 0.69 78.81 1.30 85.85 1.01 92.08 0.81 97.49 0.22 98.46 0.15

1.0GNN 98.19 0.28 94.38 0.44 86.40 1.00 80.57 0.43 80.21 0.42 87.32 0.47 92.42 0.62 97.52 0.33 98.37 0.26 1.5GNN 98.88 0.16 95.62 0.21 86.87 1.22 80.70 0.71 80.28 0.31 86.29 0.43 92.40 0.24 97.56 0.25 98.24 0.32 2.0GNN 99.21 0.09 96.91 0.16 92.96 0.31 80.83 0.61 80.04 0.49 84.96 0.60 91.18 0.27 97.41 0.14 98.45 0.14 2.5GNN 99.21 0.14 96.94 0.16 93.28 0.37 80.93 0.44 80.28 0.38 83.83 0.70 86.10 0.39 96.28 0.43 97.76 0.18

performance against GPRGNN on strong heterophilic graphs (ϕ = 0.75, 1) and competitive performance with state-ofthe-art GNNs on strong homophilic graphs (ϕ = 0.75, 1). We also observe that GPRGNN is slightly better than p GNNs on weak heterophilic graphs (ϕ = 0.25, 0.5), which suggests that GPRGNN could work very well using strong supervised information (60% training rate and 20% validation rate). However, as shown in Table 4, p GNNs work better than GPRGNN under weak supervised information (2.5% training rate and 2.5%) on all heterophilic graphs. The result is reasonable, as discussed in Remark 3 in Sec. 3.2, GPRGNN can adaptively learn the generalized Page Rank (GPR) weights and it works similarly to 2.0GNN on both homophilic and heterophilic graphs. However, it needs more supervised information in order to learn optimal GPR weights. On the contrary, p GNNs need less supervised information to obtain similar results because Θ(2) acts like a hyperplane for classification. Therefore, p GNNs can work better under weak supervised information.

F.4. Experimental Results on Graphs with Noisy Edges

Here we present more experimental results on graph with noisy edges. Table 6 reports the results on homophilic graphs (Computers, Photo, CS, Physics) and Table 7 reports the results on heterophilic graphs (Wisconsin Texas). We observe from Tables 6 and 7 that p GNNs dominate all baselines. Moreover, p GNNs even slightly better than MLP when the graph topologies are completely random, i.e. the noisy edge rate r = 1. We also observe that the performance of GCN, SGC, JKNet on homophilic graphs dramatically degrades as the noisy edge rate r increases while they do not change a lot for the cases on heterophilic graphs. It is reasonable since the original graph topological information is very helpful for label prediction on these homophilic graphs. Adding noisy edges and remove the same number of original edges could significantly degrade the performance of ordinary GNNs. On the other hand, since we find that the original graph topological information in Wisconsin and Texas is not helpful for label prediction. Therefore, adding noisy edges and removing original edges on these heterophilic graphs would not affect too much their performance.

F.5. Experimental Results of Intergrating p GNNs with GCN and JKNet

Here we further conduct experiments to study whether p GNNs can be intergrated into existing GNN architectures and improve their performance on heterophilic graphs. We use two popular GNN architectures: GCN (Kipf & Welling, 2017) and JKNet (Xu et al., 2018).

To incorporate p GNNs with GCN, we use the p GNN layers as the first layer of the combined models, termed as p GNN + GCN, and GCN layer as the second layer. Specifically, we use the aggregation weights αD 1/2MD 1/2 learned by the p GNN in the first layer as the input edge weights of GCN layer in the second layer. To combine p GNN with JKNet, we use the p GNN layer as the GNN layers in the JKNet framework, termed as p GNN + JKNet. Tables 8 and 9 report the experimental results on homophilic and heterophilic benchmark datasets, respectively.

We observe from Table 8 that intergrating p GNNs with GCN and JKNet does not improve their performance on homophilic graphs. The performance of GCN slightly degrade after incorporating p GNNs. The performance of JKNet also slightly degrade on Cora, Cite Seer, and Pub Med but is improved on Computers, Photo, CS, Physics. It is reasonable since GCN and JKNet can predict well on these homophilic benchmark datasets based on their original graph topology.

However, for heterophilic benchmark datasets, Table 9 shows that there are significant improvements over GCN, and JKNet after intergrating with p GNNs. Moreover, p GNNs + JKNet obtain advanced performance on all heterophilic benchmark datasets and even better than p GNNs on Squirrel. The results of Table 9 demonstrate that intergrating p GNNs with GCN and JKNet can sigificantly improve their performance on heterophilic graphs.

p-Laplacian Based Graph Neural Networks

Table 6: Results on homophilic graphs with random edges. Average accuracy (%) for 20 runs. Best results are outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold. OOM denotes out of memory.

Method Computers Photo

r = 0.25 r = 0.5 r = 1 r = 0.25 r = 0.5 r = 1

MLP 66.11 2.70 66.11 2.70 66.11 2.70 76.44 2.83 76.44 2.83 76.44 2.83 GCN 74.70 1.72 62.16 2.76 8.95 6.90 81.43 0.76 75.52 3.59 12.78 5.20 SGC 75.15 1.08 66.96 1.05 15.79 7.47 82.22 0.36 77.80 0.49 13.57 3.63 GAT 76.44 1.81 68.34 2.61 11.58 7.70 82.70 1.31 77.20 2.10 13.74 5.14 JKNet 56.74 6.48 46.11 8.43 12.50 6.56 73.46 6.74 64.18 4.06 15.66 6.10 APPNP 78.23 1.84 74.57 2.25 66.67 2.68 87.63 1.05 86.22 1.73 75.55 1.72 GPRGNN 77.30 2.24 77.11 1.80 66.85 1.65 85.95 1.05 85.64 1.22 77.46 1.44

1.0GNN 75.14 14.95 63.26 20.67 41.60 16.17 87.97 0.70 84.47 3.05 41.17 18.15 1.5GNN 81.79 1.33 78.12 2.08 66.04 2.73 88.09 1.18 86.20 1.61 68.78 8.97 2.0GNN 80.34 1.07 76.90 1.93 67.17 1.63 87.65 0.94 87.06 1.50 77.07 1.83 2.5GNN 79.14 1.51 75.49 1.25 64.95 2.27 87.38 0.85 86.11 1.10 76.65 1.46

Method CS Physics

r = 0.25 r = 0.5 r = 1 r = 0.25 r = 0.5 r = 1

MLP 86.24 1.43 86.24 1.43 86.24 1.43 92.58 0.83 92.58 0.83 92.58 0.83 GCN 81.05 0.59 68.37 0.85 7.72 2.39 89.02 0.16 80.45 0.34 19.78 3.94 SGC 83.41 0.01 71.98 0.12 8.00 1.43 OOM OOM OOM GAT 80.11 0.67 68.66 1.42 8.49 2.39 88.72 0.61 82.05 1.86 22.39 5.04 JKNet 81.35 0.74 71.30 2.14 11.43 1.18 87.98 0.97 81.90 2.27 26.38 5.80 APPNP 88.63 0.68 87.56 0.51 76.90 0.96 93.46 0.12 92.81 0.24 90.49 0.33 GPRGNN 85.77 0.81 83.89 1.54 72.79 2.24 92.18 0.29 90.96 0.48 91.77 0.41

1.0GNN 90.27 0.86 89.56 0.81 86.60 1.22 94.35 0.39 94.23 0.27 92.97 0.36 1.5GNN 91.27 0.40 90.50 0.71 84.40 1.84 94.34 0.21 93.77 0.29 92.51 0.35 2.0GNN 90.97 0.49 89.98 0.50 80.84 1.48 94.14 0.18 93.30 0.31 91.72 0.44 2.5GNN 89.90 0.45 89.00 0.59 76.82 2.11 93.61 0.30 92.77 0.26 91.16 0.47

Table 7: Results on heterophilic graphs with random edges. Average accuracy (%) for 20 runs. Best results are outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold.

Method Wisconsin Texas

r = 0.25 r = 0.5 r = 1 r = 0.25 r = 0.5 r = 1

MLP 93.56 3.14 93.56 3.14 93.56 3.14 79.50 10.62 79.50 10.62 79.50 10.62 GCN 62.31 8.12 59.44 5.76 64.21 4.49 41.56 8.89 44.69 23.05 40.31 18.26 SGC 64.68 7.34 62.36 2.64 51.81 2.63 42.50 5.49 40.94 18.34 23.81 14.54 GAT 65.37 9.04 60.05 9.12 60.05 7.46 39.50 9.29 34.88 21.59 29.38 11.53 JKNet 64.91 13.07 51.39 10.36 57.41 2.57 47.75 7.30 46.62 23.23 40.69 13.57 APPNP 70.19 9.04 60.32 4.70 72.64 4.73 66.69 13.46 63.25 9.87 69.81 7.76 GPRGNN 90.97 3.83 87.50 3.86 87.55 2.97 74.25 7.25 76.75 14.05 80.69 5.87

1.0GNN 94.91 2.73 95.97 2.00 95.97 2.27 81.50 9.24 82.12 11.09 81.81 5.67 1.5GNN 94.58 1.25 95.19 2.18 94.95 2.79 82.50 6.39 78.12 5.30 78.50 7.98 2.0GNN 90.46 2.79 90.97 4.22 91.44 2.27 86.06 5.17 69.38 11.47 63.50 8.90 2.5GNN 82.45 3.93 88.24 2.79 84.40 1.98 80.00 10.83 56.62 10.01 52.31 10.58

p-Laplacian Based Graph Neural Networks

Table 8: The results of p GNNs + GCN and p GNNs + JKNet on homophilic benchmark dataset. Averaged accuracy (%) for 20 runs. Best results are outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold.

Method Cora Cite Seer Pub Med Computers Photo CS Physics

GCN 76.23 0.79 62.43 0.81 83.72 0.27 84.17 0.59 90.46 0.48 90.33 0.36 94.46 0.08 1.0GNN + GCN 72.37 1.35 60.56 1.59 82.14 0.31 83.75 1.05 90.24 1.12 89.60 0.46 94.59 0.33 1.5GNN + GCN 72.72 1.39 60.23 1.80 82.21 0.22 83.89 0.74 90.00 0.68 89.48 0.45 94.70 0.18 2.0GNN + GCN 72.39 1.55 60.19 1.60 82.24 0.23 83.92 1.09 90.17 0.83 89.60 0.71 94.51 0.39 2.5GNN + GCN 72.85 1.19 59.68 1.85 82.23 0.34 83.69 0.92 90.02 1.09 89.53 0.68 94.58 0.31

JKNet 77.19 0.98 63.32 0.95 82.54 0.43 79.94 2.47 88.29 1.64 89.69 0.66 93.92 0.32 1.0GNN+JKNet 75.67 1.54 60.38 1.65 81.68 0.44 83.19 1.36 89.71 1.05 90.26 0.72 94.27 0.69 1.5GNN+JKNet 76.40 1.59 60.67 1.93 82.42 0.35 82.78 2.09 90.25 1.03 90.76 0.75 94.82 0.34 2.0GNN+JKNet 76.75 1.26 61.05 1.48 82.50 0.53 82.36 2.39 89.31 1.39 90.33 0.63 94.70 0.33 2.5GNN+JKNet 76.48 1.28 60.97 0.97 82.56 1.04 81.45 1.55 89.21 1.10 89.66 0.68 94.29 0.59

Table 9: The results of p GNNs + GCN and p GNNs + JKNet on heterophilic benchmark dataset. Averaged accuracy (%) for 20 runs. Best results are outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold.

Method Chameleon Squirrel Actor Wisconsin Texas Cornell

GCN 34.54 2.78 25.28 1.55 31.28 2.04 61.93 3.00 56.54 17.02 51.36 4.59 1.0GNN + GCN 48.52 1.89 34.78 1.11 32.37 3.12 68.52 3.75 67.94 12.60 67.81 7.61 1.5GNN + GCN 48.85 2.13 34.61 1.11 32.37 2.48 66.25 3.95 65.62 11.99 64.88 9.19 2.0GNN + GCN 48.71 2.24 35.06 1.18 32.72 2.02 66.34 4.51 65.94 7.63 68.62 6.55 2.5GNN + GCN 49.53 2.19 34.40 1.60 32.40 3.23 67.18 3.50 68.31 9.18 66.06 9.56

JKNet 33.28 3.59 25.82 1.58 29.77 2.61 61.08 3.71 59.65 12.62 55.34 4.43 1.0GNN + JKNet 49.00 2.09 35.56 1.34 40.74 0.98 95.23 2.43 80.25 6.87 78.38 8.14 1.5GNN + JKNet 48.77 2.22 35.98 0.93 40.22 1.27 94.86 2.00 80.38 9.79 72.25 9.83 2.0GNN + JKNet 48.88 1.63 35.77 1.73 40.16 1.31 88.84 2.78 86.12 5.59 74.75 7.81 2.5GNN + JKNet 49.04 1.95 35.78 1.87 40.00 1.12 85.42 3.86 79.06 7.60 76.81 7.66

p-Laplacian Based Graph Neural Networks

F.6. Experimental Results of p GNNs on PPI Dataset for Inductive Learning

Additionally, we conduct comparison experiments of p GNNs against GAT on PPI dataset (Zitnik & Leskovec, 2017) using the inductive learning settings as in Velickovic et al. (2018) (20 graphs for training, 2 graphs for validation, 2 graphs for testing). We use three layers of GAT architecture with 256 hidden units, use 1 attention head for GAT (1 head) and 4 attention heads for GAT (4 heads). We use three p GNN layers and a MLP layer as the first layer for p GNNs, set µ = 0.01, K = 1, and use 256 hidden units for p GNN-256 and 512 hidden units for p GNN-512. The experimental results are reported in Table 10.

Table 10: Results on PPI datasets. Averaged micro-F1 scores for 10 runs. Best results are outlined in bold.

GAT (1 head) 0.917 0.041 GAT (4 heads) 0.972 0.002

1.0GNN-256 0.961 0.003 1.5GNN-256 0.967 0.008 2.0GNN-256 0.968 0.006 2.5GNN-256 0.973 0.002 1.0GNN-512 0.978 0.005 1.5GNN-512 0.977 0.008 2.0GNN-512 0.981 0.006 2.5GNN-512 0.978 0.005

From Appendix F.6 we observe that the results of 2.5GNN on PPI slightly better than GAT with 4 attention heads and other p GNNs are very close to it. Moreover, all results of p GNNs significantly outperform GAT with one attention head. The results of p GNNs on PPI is impressive. p GNNs have much less parameters than GAT with 4 attention heads while obtain very completitive performance on PPI. When we use more hidden units, 512 hidden units, p GNNs-512 significantly outperform GAT, while p GNNs-512 still have less parameters. It illustrates the superior potential of applying p GNNs to inducting learning on graphs.

F.7. Experimental Results of p GNNs on OGBN ar Xiv Dataset

Table 11: Results on OGBN ar Xiv dataset. Average accuracy (%) for 10 runs. Best results are outlined in bold.

Method OGBN ar Xiv

MLP 55.50 0.23 GCN 71.74 0.29 JKNet (GCN-based) 72.19 0.21 Deep GCN 71.92 0.16 GCN + residual (6 layers) 72.86 0.16 GCN + residual (8 layers) + C&S 72.97 0.22 GCN + residual (8 layers) + C&S v2 73.13 0.17

1GNN 72.40 0.19 2GNN 72.45 0.20 3GNN 72.58 0.23 1GNN + residual (6 layers) + C&S 72.96 0.22 2GNN + residual (6 layers) + C&S 73.13 0.20 3GNN + residual (6 layers) + C&S 73.23 0.16

Here we present the experimental of p GNNs on OGBN ar Xiv dataset (Hu et al., 2020). We use the official data split setting of OGBN ar Xiv. We use three layers p GNN architecture and 256 hidden units with µ = 0.5, K = 2. We also combine p GNNs with correct and smooth model (C&S) (Huang et al., 2021) and introduce residual units. The results of MLP, GCN, JKNet, Deep GCN (Li et al., 2019), GCN with residual units, C&S model are extracted from the leaderboard for OGBN ar Xiv. dataset5. Table 11 summaries the results of p GNNs against the baselines.

5https://ogb.stanford.edu/docs/leader nodeprop/#ogbn-arxiv

p-Laplacian Based Graph Neural Networks

We observe from Table 11 that p GNNs outperform MLP, GCN, JKNet, and Deep GCN. The performance of p GNNs can be further improved by combining it with C&S model and residual units and 3GNN + residual (6 layers) + C&S obtains the best performance against the baselines.

F.8. Running Time of p GNNs

Tables 12 and 13 report the averaged running time of p GNNs and baselines on homophilic and heterophilic benchmark datasets, respectively.

Table 12: Efficiency on homophilic benchmark datasests. Averaged running time per epoch (ms) / averaged total running time (s). OOM denotes out of memory.

Method Cora Cite Seer Pub Med Computers Photo CS Physics

MLP 7.7 ms / 5.27s 8.1 ms / 5.37s 7.8 ms / 5.52s 8.8 ms / 5.45s 8.4 ms / 5.34s 10.5 ms / 8.18s 14.6 ms / 12.78s GCN 82.2 ms / 6.1s 84.2 ms / 6.1s 85 ms / 6.13s 85.2 ms / 7.07s 83.6 ms / 6.08s 85 ms / 9.68s 90 ms / 13.8s SGC 89.5 ms / 4.96s 74.7 ms / 4.86s 80.6 ms / 5.28s 109 ms / 5.21s 85.9 ms / 4.96s 213.6 ms / 8.01s OOM GAT 534.8 ms / 13.06s 313.6 ms / 13.36s 314.6 ms / 13.97s 441.3 ms / 24.62s 309.8 ms / 15.96s 454 ms / 21.87s 436.9 ms / 40.9s JKNet 95.4 ms / 20.07s 101.1 ms / 19.58s 105.4 ms / 20.8s 106.1 ms / 29.72s 97.9 ms / 21.18s 102.7 ms / 24.94s 119.2 ms / 40.83s APPNP 86.7 ms / 11.6s 86.3 ms / 11.98s 85.5 ms / 11.97s 92.1 ms / 15.75s 86 ms / 12.19s 90.5 ms / 17.36s 99.6 ms / 25.89s GPRGNN 86.5 ms / 12.42s 195.8 ms / 12.6s 88.6 ms / 12.59s 93.3 ms / 15.98s 86.7 ms / 12.65s 92 ms / 17.8s 217.1 ms / 26.33s 1.0GNN 96 ms / 20.12s 98.1 ms / 19.81s 100.2 ms / 21.74s 151.4 ms / 64.08s 121.3 ms / 34.07s 109.7 ms / 25.03s 122.9 ms / 49.59s 1.5GNN 98.2 ms / 20.19s 97 ms / 20.26s 100.2 ms / 22.6s 140.3 ms / 64.08s 120 ms / 34.22s 112.3 ms / 25.11s 127.9 ms / 49.54s 2.0GNN 98.1 ms / 20.11s 96.3 ms / 19.97s 99.3 ms / 22.17s 141 ms / 64.04s 129.3 ms / 34.14s 104.7 ms / 24.93s 124.6 ms / 49.35s 2.5GNN 96.6 ms / 20.12s 92.9 ms / 20.16s 103 ms / 22.17s 141.6 ms / 64.01s 128.1 ms / 34.22s 110.8 ms / 25.07s 124 ms / 49.39s

Table 13: Efficiency on heterophilic benchmark datasests. Averaged running time per epoch (ms) / averaged total running time (s).

Method Chameleon Squirrel Actor Wisconsin Texas Cornell

MLP 7.7 ms / 5.29s 8 ms / 5.44s 8.6 ms / 5.4s 7.7 ms / 5.16s 7.9 ms / 5.22s 7.6 ms / 5.19s GCN 83.4 ms / 6.1s 83.2 ms / 6.2s 90.7 ms / 6.07s 83.5 ms / 5.94s 80.7 ms / 5.96s 87.1 ms / 5.92s SGC 78.1 ms / 4.93s 110.9 ms / 5.21s 77.1 ms / 4.71s 73.2 ms / 4.52s 74.2 ms / 4.79s 71.3 ms / 4.8s GAT 374.9 ms / 13.49s 324.2 ms / 17.15s 420 ms / 13.82s 317.5 ms / 12.68s 357.9 ms / 12.38s 383.3 ms / 12.45s JKNet 102.4 ms / 21.15s 101 ms / 22.84s 97.2 ms / 21.24s 98.5 ms / 21.07s 103.6 ms / 20.92s 102.2 ms / 20.79s APPNP 87.1 ms / 12.12s 98.8 ms / 12.41s 87.2 ms / 11.81s 84.2 ms / 11.83s 86 ms / 11.9s 83.1 ms / 11.94s GPRGNN 93 ms / 12.98s 86.1 ms / 13.01s 94.2 ms / 13.01s 84.3 ms / 12.66s 92 ms / 12.64s 89.1 ms / 12.6s 1.0GNN 107.3 ms / 22.43s 116.3 ms / 30.92s 117.8 ms / 23.6s 94.5 ms / 18.47s 92 ms / 18.83s 92.7 ms / 18.97s 1.5GNN 97.2 ms / 22.54s 115 ms / 31.04s 119.2 ms / 23.47s 93.3 ms / 18.64s 90.8 ms / 19.09s 94.9 ms / 18.88s 2.0GNN 98.7 ms / 22.37s 114.8 ms / 31.14s 100.8 ms / 23.73s 92.2 ms / 19.09s 92.5 ms / 18.72s 98 ms / 18.64s 2.5GNN 97.9 ms / 22.38s 115.9 ms / 31.09s 97.3 ms / 23.77s 92.8 ms / 19.03s 91 ms / 18.84s 90.7 ms / 18.83s

F.9. Experimental Results on Benchmark Datasets for 64 Hidden Units

Table 14: Results on heterophilic benchmark datasets for 64 hidden units. Averaged accuracy (%) for 20 runs. Best results outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold.

Method Chameleon Squirrel Actor Wisconsin Texas Cornell

MLP 46.55 0.90 33.83 0.59 38.40 0.76 93.91 2.47 87.51 8.53 86.75 8.22 GCN 34.74 2.62 25.68 1.17 30.86 1.51 65.93 5.47 58.56 13.28 46.81 4.28 SGC 34.57 4.71 24.39 1.54 35.50 2.09 62.87 8.92 50.62 5.60 29.44 14.83 GAT 43.33 1.53 30.07 0.99 33.44 2.45 66.57 4.69 50.69 12.89 42.62 13.37 JKNet 32.69 4.47 27.18 0.76 25.72 2.75 66.57 10.53 43.88 17.10 47.69 3.25 APPNP 35.09 3.18 28.15 0.93 32.28 1.75 66.30 1.60 69.00 4.53 54.88 3.85 GPRGNN 34.65 2.86 28.56 1.35 34.58 1.45 93.70 3.12 86.50 6.04 84.75 8.38

1.0GNN 49.51 1.32 32.67 1.00 40.70 0.88 95.23 1.60 84.12 7.39 82.56 6.97 1.5GNN 49.52 1.15 33.14 1.10 39.82 1.54 94.03 2.26 86.94 6.99 86.89 6.63 2.0GNN 49.19 0.81 33.78 0.87 39.75 1.26 94.49 1.81 87.62 6.64 85.56 7.25 2.5GNN 48.93 0.74 33.31 1.27 39.47 1.20 92.13 2.16 87.25 5.57 80.56 5.28

p-Laplacian Based Graph Neural Networks

Table 15: Results on homophilic benchmark datasets for 64 hidden units. Averaged accuracy (%) for 20 runs. Best results are outlined in bold and the results within 95% confidence interval of the best results are outlined in underlined bold. OOM denotes out of memory.

Method Cora Cite Seer Pub Med Computers Photo CS Physics

MLP 49.05 0.82 50.67 1.25 80.32 0.40 70.58 0.82 79.44 0.79 89.48 0.50 92.84 0.62 GCN 77.65 0.42 64.72 0.52 84.13 0.12 84.56 0.79 90.16 0.88 91.14 0.10 94.75 0.04 SGC 70.32 1.87 65.77 0.99 76.27 0.94 83.24 0.81 89.43 1.03 91.11 0.10 OOM GAT 76.97 1.18 61.28 1.62 83.57 0.23 83.84 1.93 90.54 0.56 89.68 0.42 93.91 0.20 JKNet 78.77 0.79 64.62 0.80 82.82 0.16 82.22 1.32 88.43 0.53 90.48 0.13 93.75 0.32 APPNP 79.95 0.72 65.56 0.64 84.00 0.22 83.83 0.78 90.50 0.59 91.90 0.12 94.84 0.08 GPRGNN 78.17 1.31 61.26 2.14 84.54 0.24 83.77 1.06 89.86 0.63 91.34 0.25 94.63 0.26

1.0GNN 77.11 0.39 63.17 0.89 83.14 0.46 82.64 0.98 89.60 0.69 92.53 0.22 94.86 0.24 1.5GNN 78.69 0.43 63.14 0.93 83.97 0.04 84.64 1.42 90.67 0.67 92.93 0.14 94.93 0.12 2.0GNN 79.06 0.41 63.92 1.14 84.24 0.27 84.57 0.96 90.17 0.88 92.74 0.26 95.05 0.09 2.5GNN 79.15 0.39 63.16 1.25 84.88 0.09 83.84 0.71 89.05 0.85 92.31 0.19 94.92 0.10

F.10. Training Curves for p = 1

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

0 200 400 600 800 1000 Epoch

Training Loss

Testing Accuracy

Figure 7: The curves of training loss and testing accuracy for p = 1.

p-Laplacian Based Graph Neural Networks

F.11. Visualization Results of Node Embeddings

60 40 20 0 20 40 60

(a) 1.0GNN.

60 40 20 0 20 40

(b) 1.5GNN.

60 40 20 0 20 40 60

(c) 2.0GNN.

60 40 20 0 20 40 60

(d) 2.5GNN.

60 40 20 0 20 40 60

Figure 8: Visualization of node embeddings for Cora dataset using t-SNE (van der Maaten & Hinton, 2008)

75 50 25 0 25 50 75

(a) 1.0GNN.

75 50 25 0 25 50 75 100

(b) 1.5GNN.

75 50 25 0 25 50 75

(c) 2.0GNN.

75 50 25 0 25 50 75

(d) 2.5GNN.

100 75 50 25 0 25 50 75 100

Figure 9: Visualization of node embeddings for Computers dataset using t-SNE.

p-Laplacian Based Graph Neural Networks

60 40 20 0 20 40 60

(a) 1.0GNN.

60 40 20 0 20 40 60

(b) 1.5GNN.

40 20 0 20 40 60

(c) 2.0GNN.

60 40 20 0 20 40 60

(d) 2.5GNN.

20 0 20 40 60

Figure 10: Visualization of node embeddings for Chameleon dataset using t-SNE.

15 10 5 0 5 10 15

(a) 1.0GNN.

10 5 0 5 10 15

(b) 1.5GNN.

10 5 0 5 10 15

20 wisconsin

(c) 2.0GNN.

10 5 0 5 10 15

(d) 2.5GNN.

15 10 5 0 5 10 15 20 25

Figure 11: Visualization of node embeddings for Wisconsin Dataset using t-SNE.