# graph_coarsening_with_messagepassing_guarantees__02dfebca.pdf

Graph Coarsening with Message-Passing Guarantees

Antonin Joly IRISA, Rennes, France antonin.joly@inria.fr

Nicolas Keriven CNRS, IRISA, Rennes, France nicolas.keriven@cnrs.fr

Graph coarsening aims to reduce the size of a large graph while preserving some of its key properties, which has been used in many applications to reduce computational load and memory footprint. For instance, in graph machine learning, training Graph Neural Networks (GNNs) on coarsened graphs leads to drastic savings in time and memory. However, GNNs rely on the Message-Passing (MP) paradigm, and classical spectral preservation guarantees for graph coarsening do not directly lead to theoretical guarantees when performing naive message-passing on the coarsened graph. In this work, we propose a new message-passing operation specific to coarsened graphs, which exhibit theoretical guarantees on the preservation of the propagated signal. Interestingly, and in a sharp departure from previous proposals, this operation on coarsened graphs is often oriented, even when the original graph is undirected. We conduct node classification tasks on synthetic and real data and observe improved results compared to performing naive message-passing on the coarsened graph.

1 Introduction

In recent years, several applications in data science and machine learning have produced large-scale graph data [20, 5]. For instance, online social networks [13] or recommender systems [40] routinely produce graphs with millions of nodes or more. To handle such massive graphs, researchers have developed general-purpose graph reduction methods [4], such as graph coarsening [32, 7]. It consists in producing a small graph from a large graph while retaining some of its key properties, and starts to play an increasingly prominent role in machine learning applications [7].

Graph Neural Networks. Machine Learning on graphs is now largely done by Graph Neural Networks (GNNs) [37, 27, 5]. GNNs are deep architectures on graph that rely on the Message Passing paradigm [16]: at each layer, the representation Hl i Rdl of each node 1 i N, is updated by aggregating and transforming the representations of its neighbours at the previous layer {Hl 1 j }j N(i), where N(i) is the neighborhood of i. In most examples, this aggregation can be represented as a multiplication of the node representation matrix Hl 1 RN dl 1 by a propagation matrix S RN N related to the graph structure, followed by a fully connected transformation. That is, starting with initial node features H0, the GNN Φθ outputs after k layers:

Hl = σ SHl 1θl , Φθ(H0, S) = Hk , (1)

where σ is an activation function applied element-wise (often Re LU), θl Rdl 1 dl are learned parameters and θ = {θ1, . . . , θk}. We emphasize here the dependency of the GNN on the propagation matrix S. Classical choices include mean aggregation S = D 1A or the normalized adjacency S = D 1

2 , with A the adjacency matrix of the graph and D the diagonal matrix of degrees. When adding self-loops to A, the latter corresponds for instance to the classical GCNconv layer [27].

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

An interesting example is the Simplified Graph Convolution (SGC) model [42], which consists in removing all the non-linearity (σ = id the identity function). Surprisingly, the authors of [42] have shown that SGC reaches quite good performances when compared to non-linear architectures and due to its simplicity, SGC has been extensively employed in theoretical analyses of GNNs [46, 26].

Graph coarsening and GNNs. In this paper, we consider graph coarsening as a preprocessing step to downstream tasks [11, 21]: indeed, applying GNNs on coarsened graphs leads to drastic savings in time and memory, both during training and inference. Additionally, large graphs may be too big to fit on GPUs, and mini-batching graph nodes is known to be a difficult graph sampling problem [14], which may no longer be required on a coarsened graph. A primary question is then the following: is training a GNN on a coarsened graph provably close to training it on the original graph? To examine this, one must study the interaction between graph coarsening and message-passing.

There are many ways of measuring the quality of graph coarsening algorithms, following different criteria [10, 32, 7]. A classical objective is the preservation of spectral properties of the graph Laplacian, which gave birth to different algorithms [32, 8, 4, 24, 33]. Loukas [32] materializes this by the so-called Restricted Spectral Approximation (RSA, see Sec. 2) property: it roughly states that the frequency content of a certain subspace of graph signals is approximately preserved by the coarsening, or intuitively, that the coarsening is well-aligned with the low-frequencies of the Laplacian. Surprisingly, the RSA does not generally lead to guarantees on the message-passing process at the core of GNNs, even for very simple signals. That is, simply performing message-passing on the coarsened graph using Sc, the naive propagation matrix corresponding to S on the coarsened graph (e.g. normalized adjacency of the coarsened graph when S is the normalized adjacency of the original one) does not guarantee that the outputs of the GNN on the coarsened graph and the original graph will be close, even with high-quality RSA.

Contribution. In this paper, we address this problem by defining a new propagation matrix SMP c specific to coarsened graphs, which translate the RSA bound to message-passing guarantees: we show in Sec. 3.3 that training a GNN on the coarsened graph using SMP c is provably close to training it on the original graph. The proposed matrix SMP c can be computed for any given coarsening and is not specific to the coarsening algorithm used to produce it1, as long as it produces coarsenings with RSA guarantees. Interestingly, our proposed matrix SMP c is not symmetric in general even when S is, meaning that our guarantees are obtained by performing oriented message-passing on the coarsened graph, even when the original graph is undirected. To our knowledge, the only previous work to propose a new propagation matrix for coarsened graphs is [21], where the authors obtain guarantees for a specific GNN model (APPNP [28]), which is quite different from generic message-passing.

Related Work. Graph Coarsening originates from the multigrid-literature [36], and is part of a family of methods commonly referred to as graph reduction, which includes graph sampling [19], which consists in sampling nodes to extract a subgraph; graph sparsification [38, 1, 31], that focuses on eliminating edges; or more recently graph distillation [22, 45, 23], which extends some of these principles by authorizing additional informations inspired by dataset distillation [41].

Some of the first coarsening algorithms were linked to the graph clustering community, e.g. [9] which used recursively the Graclus algorithm [10] algorithm itself built on Metis [25]. Linear algebra technics such as the Kron reduction were also employed [32] [12]. In [32], the author presents a greedy algorithm that recursively merge nodes by optimizing some cost, with the purpose of preserving spectral properties of the coarsened Laplacian. This is the approach we use in our experiments (Sec. 4). It was followed by several similar methods with the same spectral criterion [8, 4, 24, 33]. Since modern graph often includes node features, other approaches proposed to take them into account in the coarsening process, often by learning the coarsening with specific regularized loss [29, 34]. Closer to this work, [11] proposes an optimization process to explicitely preserve the propagated features, however with no theoretical guarantees and only one step of message-passing. While these works often seek to preserve a fixed number of node features as in e.g. [29]), the RSA guarantees [32] leveraged in this paper are uniform over a whole subspace: this stronger property is required to provide guarantees for GNNs with several layers.

1Note however that SMP c must be computed during the coarsening process and included as an output of the coarsening algorithm, before eventually discarding the original graph.

Graph coarsening has been intertwined with GNNs in different ways. It can serve as graph pooling [44] within the GNN itself, with the aim of mimicking the pooling process in deep convolutional models on images. In the recent literature, the terms coarsening and pooling tend to be a bit exchangeable. For instance, some procedures that were initially introduced as pooling could also be used as pre-processing step, such as Graclus [10], introduced by [9] as a pooling scheme, see also [17]. Graph pooling is often data-driven and fully differentiable, such as Diffpool [44], SAGPool [30], and DMo N [39]. Theoretical work on their ability to distinguish non homomorphic graphs after pooling have been conducted [3]. In return, GNNs can also be trained to produce data-driven coarsenings, e.g. GOREN [6] which proposes to learn new edge weights with a GNN. As mentioned before, in the framework we consider here, graph coarsening is a preprocessing step with the aim of saving time and memory during training and inference [21]. Here few works derive theoretical guarantees for GNNs and message-passing, beyond the APPNP architecture examined in [21]. To our knowledge, the proposed SMP c is the first to yield such guarantees.

Outline. We start with some preliminary material on graph coarsening and spectral preservation in Sec. 2. We then present our main contribution in Sec. 3: a new propagation matrix on coarsened graphs that leads to guarantees for message-passing. As is often the case in GNN theoretical analysis, our results mostly hold for the linear SGC model, however we still outline sufficient assumptions that would be required to apply our results to general GNNs, which represent a major path for future work. In Sec. 4 we test the proposed propagation matrix on real and synthetic data, and show how it leads to improved results compared to previous works. The code is available at https://gitlab.inria.fr/anjoly/mp-guarantees-graph-coarsening, and proofs are deferred to App. A.

Notations. For a matrix Q Rn N, its pseudo-inverse Q+ RN n is obtained by replacing its nonzero singular values by their inverse and transposing. For a symmetric positive semi-definite (p.s.d.) matrix L RN N, we define L 1 2 by replacing its eigenvalues by their square roots, and L 1

2 = (L+) 1 2 . For x RN we denote by x L =

x Lx the Mahalanobis semi-norm associated to L. For a matrix P RN N, we denote by P = max x =1 Px the operator norm of P, and P L = L 1 2 PL 1

2 . For a subspace R, we say that a matrix P is R-preserving if x R implies Px R. Finally, for a matrix X RN d, we denote its columns by X:,i, and define X :,L = P

2 Background on Graph Coarsening

We mostly adopt the framework of Loukas [32], with some generalizations. A graph G with N nodes is described by its weighted adjacency matrix A RN N. We denote by L RN N a notion of symmetric p.s.d. Laplacian of the graph: classical choices include the combinatorial Laplacian L = D A with D = D(A) := diag(A1n) the diagonal matrix of the degrees, or the symmetric normalized Laplacian L = IN D 1

2 . We denote by λmax, λmin respectively the largest and smallest non-zero eigenvalue of L.

Coarsening matrix. A coarsening algorithm takes a graph G with N nodes, and produces a coarsened graph Gc with n < N nodes. Intuitively, nodes in G are grouped in super-nodes in Gc (Fig. 1), with some weights to outline their relative importance. This mapping can be represented via a coarsening matrix Q Rn N:

Q = Qki > 0 if the i-th node of G is mapped to the k-th super-node of Gc Qki = 0 otherwise

The lifting matrix is the pseudo-inverse of the coarsening matrix Q+, and plays the role of inverse mapping from the coarsened graph to the original one. The coarsening ratio is defined as r : r = 1 n

N . That is, the higher r is, the more coarsened the graph is.

A coarsening is said to be well-mapped if nodes in G are mapped to a unique node in Gc, that is, if Q has exactly one non-zero value per column. Moreover, it is surjective if at least one node is mapped to each super node: P i Qki > 0 for all k. In this case, QQ is invertible diagonal and Q+ = Q (QQ ) 1. Moreover, such a coarsening is said to be uniform when mapping weights

are constant for each super-nodes and sum to one: Qki = 1/nk for all Qki > 0, where nk is the number of nodes mapped to the super-node k. In this case the lifting matrix is particularly simple: Q+ {0, 1}N n (Fig. 1d). For simplicity, following the majority of the literature [32], in this paper we consider only well-mapped surjective coarsenings (but not necessarily uniform). Remark 1. Non-well-mapped coarsenings may appear in the literature, e.g. when learning the matrix Q via a gradient-based optimization algorithm such as Diffpool [44]. However, these methods often include regularization penalties to favor well-mapped coarsenings.

Restricted Spectral Approximation. A large part of the graph coarsening literature measures the quality of a coarsening by quantifying the modification of the spectral properties of the graph, often represented by the graph Laplacian L. In [32], this is done by establishing a near-isometry property for graph signals with respect to the norm L, which can be interpreted as a measure of the smoothness of a signal across the graph edges. Given a signal x RN over the nodes of G, we define the coarsened signal xc Rn and the re-lifted signal x RN by

xc = Qx, x = Q+xc = Πx (2)

where Π = Q+Q. Loukas [32] then introduces the notion of Restricted Spectral Approximation (RSA) of a coarsening algorithm, which measures how much the projection Π is close to the identity for a class of signals. Since Π is at most of rank n < N, this cannot be true for all signals, but only for a restricted subspace R RN. With this in mind, the RSA constant is defined as follows. Definition 1 (Restricted Spectral Approximation constant). Consider a subspace R RN, a Laplacian L, a coarsening matrix Q and its corresponding projection operator Π = Q+Q. The RSA constant ϵL,Q,R is defined as

ϵL,Q,R = sup x R, x L=1 x Πx L (3)

In other words, the RSA constant measures how much signals in R are preserved by the coarseninglifting operation, with respect to the norm L. Given some R and Laplacian L, the goal of a coarsening algorithm is then to produce a coarsening Q with the smallest RSA constant possible. While the best coarsening arg min Q ϵL,Q,R is generally computationally unreachable, there are many possible heuristic algorithms, often based on greedy merging of nodes. In App. B.1, we give an example of such an algorithm, adapted from [32]. In practice, R is often chosen as the subspace spanned by the first eigenvectors of L ordered by increasing eigenvalue: intuitively, coarsening the graph and merging nodes is more likely to preserve the low-frequencies of the Laplacian.

While ϵL,Q,R 1 is required to obtain meaningful guarantees, we remark that ϵL,Q,R is not necessarily finite. Indeed, as L may only be a semi-norm, its unit ball is not necessarily compact. It is nevertheless finite in the following case.

Proposition 1. When Π is ker(L)-preserving, it holds that ϵL,Q,R p

Hence, some examples where ϵL,Q,R is finite include: Example 1. For uniform coarsenings with L = D A and connected graph G, ker(L) is the constant vector2, and Π is ker(L)-preserving. This is the case examined by [32]. Example 2. For positive definite Laplacians , ker(L) = {0}. This is a deceptively simple solution for which L is a true norm. This can be obtained e.g. with L = δIN + ˆL for any p.s.d. Laplacian ˆL and small constant δ > 0. This leaves its eigenvectors unchanged and add δ to its eigenvalues, and therefore does not alter the fundamental structure of the coarsening problem.

Given the adjacency matrix A RN N of G, there are several possibilities to define an adjacency matrix Ac for the graph Gc [21, 29]. A natural choice that we make in this paper is

Ac = (Q+) AQ+ . (4)

In the case of a uniform coarsening, (Ac)kℓequals the sum of edge weights for all edges in the original graph between all nodes mapped to the super-node k and all nodes mapped to ℓ. Moreover, we have the following property, derived from [32].

2Note that this would also work with several connected components, if no nodes from different components are mapped to the same super-node.

(a) (b) (c)

1 2 1 2 0 0 0 0 0 0 1 3 1 3 1 3 0 0 0 0 0 0 1

1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1

, Π = Q+Q =

1 2 1 2 0 0 0 0 1 2 1 2 0 0 0 0 0 0 1 3 1 3 1 3 0 0 0 1 3 1 3 1 3 0 0 0 1 3 1 3 1 3 0 0 0 0 0 0 1

Figure 1: Example of uniform coarsening. (a): original graph G; (b): coarsened adjacency matrix Ac; (c) representation of the proposed SMP c when S = A; (d): corresponding matrices Q, Q+, Π.

Proposition 2. Assume that the coarsening Q is uniform and consider combinatorial Laplacians L = D(A) A and Lc = D(Ac) Ac. Then Lc = (Q+) LQ+, and x R, (1 ϵL,Q,R) x L xc Lc (1 + ϵL,Q,R) x L (5)

This draws a link between the RSA and a notion of near-isometric embedding for vectors in R. Note that the proposition above is not necessarily true when considering normalized Laplacians, or non uniform coarsenings. In the next section, we propose a new propagation matrix on coarsened graphs and draw a link between the RSA constant ϵL,Q,R and message-passing guarantees.

3 Message-Passing on coarsened graphs

In the previous section, we have seen that coarsenings algorithms generally aim at preserving the spectral properties of the graph Laplacian, leading to small RSA constants ϵL,Q,R. However, this generally does not directly translate to guarantees on the Message-Passing process that is at the core of GNNs, which as mentioned in the introduction is materialized by the matrix S. In this section, we propose a new propagation matrix such that small RSA constants leads to preserved message-passing, which then leads to guarantees for training GNNs on coarsened graphs.

3.1 A new propagation matrix on coarsened graph

Given a graph G with a propagation matrix S and a coarsened graph Gc with a coarsening matrix Q, our goal is to define a propagation matrix SMP c Rn n such that one round of message-passing on the coarsened signal xc followed by lifting is close to performing message-passing in the original graph: Q+SMP c xc Sx. Assuming that the propagation matrix S = f S(A) is the output of a function f S of the graph s adjacency matrix, the most natural choice, often adopted in the literature [11], is therefore to simply take Sc = f S(Ac), where Ac is the adjacency matrix of the coarsened graph defined in (4). However, this does not generally leads to the desired guarantees: indeed, considering for instance S = A, we have in this case Q+Scxc = Q+(Q+) AΠx, which involves the quite unnatural term Q+(Q+) . For other choices of normalized S, the situation is even less clear. Some authors propose different variant of Sc adapted to specific cases [21, 44] (see Sec. 4), but none offers generic message-passing guarantees. To address this, we propose a new propagation matrix: SMP c = QSQ+ Rn n . (6)

This expression is conceptually simple: it often amounts to some reweighting. For instance, when S = A and in the case of uniform coarsening, we have (SMP c )kℓ= (Ac)kℓ/nk (Fig. 1c). Despite

this simplicity, we will see that under some mild hypotheses this choice indeed leads to preservation guarantees of message-passing for coarsenings with small RSA constants.

Orientation. An important remark is that, unlike all the examples in the literature, and unlike the adjacency matrix Ac defined in (4), the proposed matrix SMP c is generally asymmetric, even when S is symmetric. This means that our guarantees are obtained by performing directed messagepassing on the coarsened graph, even when the original message-passing procedure was undirected. Conceptually, this is an important departure from previous works. However SMP c becomes more symmetric when Q+ and QT becomes more similar. This is for instance the case when Q induces a balanced partition, where each supernodes has the same number of ancestors (which can be targeted by some pooling algorithms). On the contrary, the difference between Q and Q+ is more pronounced when supernodes are of very different sizes,which may happen for highly irregular graphs.

3.2 Message-Passing guarantees

In this section, we show how the proposed propagation matrix (6) allows to transfer the spectral approximation guarantees to message-passing guarantees, under some hypotheses. First, we must make some technical assumptions relating to the kernel of the Laplacian. Assumption 1. Assume that Π and S are both ker(L)-preserving.

Moreover, since spectral approximation pertains to a subspace R, we must assume that this subspace is left unchanged by the application of S. Assumption 2. Assume that S is R-preserving.

As mentioned before, for Examples 1 and 2, the projection Π is ker(L)-preserving. Moreover, R is often chosen to be the subspace spanned by the low-frequency eigenvectors of L and in this case, all matrices of the form S = αIN + βL for some constant α, β are both ker(L)-preserving and R-preserving. Hence, for instance, a primary example in practice is to choose GCNconv [27] with S = D( ˆA) 1

2 ˆAD( ˆA) 1

2 with ˆA = A + IN, and to compute a coarsening with a good RSA constant for the Laplacian L = (1+δ)IN S with small δ > 0 and R spanned by eigenvectors of L. In this case, Assumptions 1 and 2 are satisfied. This is the implementation we choose in our experiments.

We now state the main result of this section. Theorem 1. Define SMP c as (6). Under Assumption 1 and 2, for all x R,

Sx Q+SMP c xc L ϵL,Q,R x L (CS + CΠ) (7)

where CS := S L and CΠ := ΠS L.

Sketch of proof. The Theorem is proved in App. A. The proof is quite direct, and relies on the fact that, for this well-designed choice (6) of SMP c , the lifted signal is precisely Q+SMP c xc = ΠSΠx. Then, bounding the error incurred by Π using the RSA, we show that this is indeed close to performing message-passing by S in the original graph.

This theorem shows that the RSA error ϵL,Q,R directly translates to an error bound between Sx and Q+SMP c xc. As we will see in the next section, this leads to guarantees when training a GNN on the original graph and the coarsened graph. First, we discuss the two main multiplicative constant involved in Thm. 1.

Multiplicative constants. In full generality, for any matrix M we have M L M p

λmax/λmin. Moreover, when M and L commute, we have M L M . As mentioned before, choosing S = αIN + βL for some constants α, β is a primary example to satisfy our assumptions. In this case CS = S L S . Then, if S is properly normalized, e.g. for the GCNconv [27] example outlined above, we have S 1. For combinatorial Laplacian L = D A however, we obtain S |α| + |β|N. We observed in our experiments that the combinatorial Laplacian generally yields poor results for GNNs.

For CΠ, in full generality we only have CΠ CS Π L CS q

λmin , since Π is an orthogonal projector. However, in practice we generally observe that the exact value CΠ = ΠS L is far better

than this ratio of eigenvalues (e.g. we observe a ratio of roughly CΠ (1/10) p

λmax/λmin in our experiments). Future work may examine more precise bounds in different contexts.

3.3 GNN training on coarsened graph

In this section, we instantiate our message-passing guarantees to GNN training on coarsened graph, with SGC as a primary example. To fix ideas, we consider a single large graph G, and a node-level task such as node classification or regression. Given some node features X RN d, the goal is to minimize a loss function J : RN R+ on the output of a GNN Φθ(X, S) RN (assumed unidimensional for simplicity) with respect to the parameter θ: min θ Θ R(θ) with R(θ) := J(Φθ(X, S)) (8)

where Θ is a set of parameters that we assume bounded. For instance, J can be the cross-entropy between the output of the GNN and some labels on training nodes for classification, or the Mean Square Error for regression. The loss is generally minimized by first-order optimization methods on θ, which requires multiple calls to the GNN on the graph G. Roughly, the computational complexity of this approach is O(T(N + E)D), where T is the number of iterations of the optimization algorithm, D is the number of parameters in the GNN, and E is the number of nonzero elements in S. Instead, one may want to train on the coarsened graph Gc, which can be done by minimizing instead3: Rc(θ) := J(Q+Φθ(Xc, SMP c )) (9) where Xc = QX. That is, the GNN is applied on the coarsened graph, and the output is then lifted to compute the loss, which is then back-propagated to compute the gradient of θ. The computational complexity then becomes O(T(n + e)D + TN), where e E is the number of nonzeros in SMP c , and the term TN is due to the lifting. As this decorrelates N and D, it is in general much less costly.

We make the following two assumptions to state our result. Since our bounds are expressed in terms of L, we must handle it with the following assumption. Assumption 3. We assume that there is a constant CJ such that |J(x) J(x )| CJ x x L (10)

For most loss functions, it is easy to show that |J(x) J(x )| x x , and when L is positive definite (Example 2) then 1 λmin L. Otherwise, one must handle the kernel of L, which may be done on a case-by-case basis of for an appropriate choice of J.

The second assumption relates to the activation function. It is here mostly for technical completeness, as we do not have examples where it is satisfied beyond the identity σ = id, which corresponds to the SGC architecture [42] often used in theoretical analyses [46, 26]. Assumption 4. We assume that:

i) σ is R-preserving, that is, for all x R, we have σ(x) R. We discuss this constraint below.

ii) σ(x) σ(x ) L Cσ x x L. Note that most activations are 1-Lipschitz w.r.t. the Euclidean norm, so this is satisfied when L is positive-definite with Cσ = p

iii) σ and Q+ commute: σ(Q+y) = Q+σ(y). This is satisfied for all uniform coarsenings, or when σ is 1-positively homogeneous: σ(λx) = λσ(x) for nonnegative λ (e.g. Re LU).

Item i) above means that, when R is spanned by low-frequency eigenvectors of the Laplacian, σ does not induce high frequencies. In other words, we want σ to preserve smooth signal. For now, the only example for which we can guarantee that Assumption 4 is satisfied is when σ = id and the GNN is linear, which corresponds to the SGC architecture [42]. As is the case with many such analyses of GNNs, non-linear activations are a major path for future work. A possible study would be to consider random geometric graphs for which the eigenvectors of the Laplacian are close to explicit functions, e.g. spherical harmonics for dot-product graphs [2]. In this case, it may be possible to explicitely prove that Assumption 4 holds, but this is out-of-scope of this paper.

Our result on GNNs is the following.

3Note that we apply the GNN on the coarsened graph, but still lift its output to compute the loss on the training nodes of the original graph. Another possibility would be to also coarsen the labels to directly compute the loss on the coarsened graph [21], but this is not considered here. See App. B.2 for more discussion.

Theorem 2. Under Assumptions 1-4: for all node features X RN d such that X:,i R, denoting by θ = arg minθ Θ R(θ) and θc = arg minθ Θ Rc(θ), we have R(θc) R(θ ) CϵL,Q,R X :,L (11)

with C = 2CJCk σCΘ(CS + CΠ) Pk l=1 Ck l Π Cl 1 S where CΠ := ΠSΠ L and CΘ is a constant that depends on the parameter set Θ.

The proof of Thm. 2 is given in App. A.3. In this proof, to apply the Theorem 1, we apply the RSA to each nodes features column. It relies on the assumption that each column of the nodes features X:,i belongs to the preserved space R. This assumption seems reasonable for homophilic datasets (Cora, Citeseer) and large preserved space . The Theorem states that training a GNN that uses the proposed SMP c on the coarsened graph by minimizing (9) yields a parameter θc whose excess loss compared to the optimal θ is bounded by the RSA constant. Hence, spectral approximation properties of the coarsening directly translates to guarantees on GNN training. The multiplicative constants CS, CΠ have been discussed in the previous section, and the same remarks apply to CΠ.

4 Experiments

Setup. We choose the propagation matrix from GCNconv [27], that is, S = f S(A) = D( ˆA) 1

2 ˆAD( ˆA) 1

2 with ˆA = A+IN. As detailed in the previous section, we take L = (1+δ)IN S with δ = 0.001 and R as the K first eigenvectors of L (K = N/10 in our experiments), ensuring that Assumptions 1 and 2 are satisfied. In our experiments, we observed that the combinatorial Laplacian L = D A gives quite poor results, as it corresponds to unusual propagation matrices S = αIN + βL, and the constant CS = S L is very large. Hence our focus on the normalized case.

On coarsened graphs, we compare five propagation matrices:

SMP c = QSQ+, our proposed matrix Sc = f S(Ac), the naive choice

Sdiag c = ˆ D 1/2(Ac + C) ˆ D 1/2, proposed in [21], where C is the diagonal matrix of the nk and ˆ D the corresponding degrees. This yields theoretical guarantees for APPNP when S is GCNconv; Sdiff c = QSQ , which is roughly inspired by Diffpool [44];

Ssym c = (Q+) SQ+, which is the lifting employed to compute Ac (4).

10 2 10 1 r

104 SMP c Sdiff c Ssym c Sc Sdiag c CϵL,Q,R

Figure 2: Message-Passing error for different propagation matrices.

Coarsening algorithm. Recall that the proposed SMP c can be computed for any coarsening, and that the corresponding theoretical guarantees depend on the RSA constant ϵL,Q,R. In our experiments, we adapt the algorithm from [32] to coarsen the graphs. It takes as input the graph G and the coarsening ratio desired r and output the propagation matrix SMP c and the coarsening matrix Q used for lifting. It is a greedy algorithm which successively merges edges by minimizing a certain cost. While originally designed for the combinatorial Laplacian, we simply adapt the cost to any Laplacian L, see App. B.1. Note however that some mathematical justifications for this approach in [32] are no longer valid for normalized Laplacian, but we find in practice that it produces good RSA constants.

A major limit of this algorithm is its computational cost, which is quite high since it involves large matrix inversion and SVD computation. Hence we limit ourselves to middle-scale graphs like Cora [35] and Citeseer [15] and one larger graph with Reddit [18] in the following experiments. The design of more scalable coarsening algorithms with RSA guarantees is an important path for future work, but out-of-scope of this paper.

Message passing preservation guarantees To evaluate the effectiveness of the proposed propagation matrix, we first illustrate the theoretical message passing preservation guarantees (Thm. 1 and 2) on synthetic graphs, taken as random geometric graph, built by sampling 1000 nodes with coordinates in [0, 1]2 and connecting them if their distance is under a given threshold (details in App. B.3). For each choice of propagation matrix and different coarsening ratio, we compute numerically Skx Q+(SMP c )kxc L for various signals x R. We perform Np = 6 message-passing steps to enhance the difference between propagation matrices. We evaluate and plot the upper bound defined by ϵL,Q,R(CS +CΠ) Pk l=1 Ck l Π Cl 1 S (seen in the proof of Theorem 2 in App. A.3) in Fig. 2. We observe that our propagation matrix incurs a significantly lower error compared to other choices, and that as expected, this error is correlated to ϵL,Q,R, which is not the case for other choices. More experiments can be found in App. B.4.

Node classification on real graphs. We then perform node classification experiments on real-world graphs, namely Cora [35] and Citeseer [15], using the public split from [43]. For simplicity, we restrict them to their largest connected component4, since using a connected graph is far more convenient for coarsening algorithms (details in App. B.3). The training procedure follows that of Sec. 3.3: the network is applied to the coarsened graph and coarsened node features, its output is lifted to the original graph with Q+, and the label of the original training graph nodes are used to compute the cross-entropy loss, which is then back-propagated to optimize the parameters θ (pseudocode in App. B.2). Despite the lifting procedure, this results in faster training than using the entire graph (e.g., by approximately 30% for a coarsening ratio of r = 0.5 when parallelized on GPU). For downstream tasks we introduce a novel metric to analyze a specific coarsening : "Max acc possible". It corresponds to the optimal prediction over the super-nodes of the coarsened graph (all the nodes coarsened in a super nodes has the same prediction, optimally the majority label of this cluster). It might be hard to achieve as the optimal assignment for the validation nodes or training nodes can be different. It allows comparing different coarsenings for classification task without training models on it. We test SGC [42] with Np = 6 and GCNconv [27] with Np = 2 on four different coarsening ratio: r {0.3, 0.5, 0.7} where Np is the number of propagation. Each classification results is averaged on 10 random training.

Results are reported in Table 1 and Table 2. We observe that the proposed propagation matrix SMP c yields better results and is more stable, especially for high coarsening ratio. The benefits are more evident when applied to the SGC architecture [42], for which Assumption 4 of Thm. 2 is actually satisfied, than for GCN, for which Re LU is unlikely to satisfy Assumption 4. It is also interesting to notice that training on coarsened graphs sometimes achieve better results than on the original graph. This may be explained by the fact that, for homophilic graphs (connected nodes are more likely to have the same label), nodes with similar labels are more likely to be merged together during the coarsening, and thus become easier to predict for the model. The detailed hyper-parameters for each model and each dataset can be found in appendix B.5.

Table 1: Accuracy in % for node classification with SGC and different coarsening ratio

SGC Cora Citeseer

r 0.3 0.5 0.7 0.3 0.5 0.7

Ssym c 16.8 3.8 16.1 3.8 16.4 4.7 17.5 3.8 18.6 4.6 19.8 5.0 Sdiff c 50.7 1.4 21.8 2.2 13.6 2.8 50.5 0.2 30.5 0.2 23.1 0.0 Sc 79.3 0.1 78.7 0.0 74.6 0.1 74.1 0.1 72.8 0.1 72.5 0.1 Sdiag c 79.9 0.1 78.7 0.1 77.3 0.0 73.6 0.1 73.4 0.1 73.1 0.4 SMP c (ours) 81.8 0.1 80.3 0.1 78.5 0.0 73.9 0.1 74.6 0.1 74.2 0.1 Max acc possible 96.5 92.5 88.9 93.5 90.5 84.5 Full Graph 81.6 0.1 73.6 0.0

Scaling to larger Datasets We performed experiments on the Reddit Dataset [18], which is approximately 100 times bigger than Cora or Citeseer. The Message-Passing error for different coarsening propagation matrices is reported in Table 3 with the node prediction results on two coarsening ratio r = 90% and r = 99% (their number of nodes,and edges can be found in App B.3), the details of the hyperparameters and coarsening procedure are in B.6. Our propagation matrix for

4hence the slight difference with other reported results on these datasets

Table 2: Accuracy in % for node classification with GCNconv and different coarsening ratio

GCNconv Cora Citeseer

r 0.3 0.5 0.7 0.3 0.5 0.7

Ssym c 80.1 1.3 78.1 1.3 30.8 2.5 71.0 1.4 62.5 11 52.7 3.6 Sdiff c 81.9 1.0 74.5 0.9 62.6 7.1 72.7 0.4 71.2 1.7 37.6 0.9 Sc 81.2 0.8 79.9 0.9 78.1 1.0 71.7 0.6 70.7 1.0 67.1 3.1 Sdiag c 81.4 0.8 80.4 0.8 78.6 1.3 72.1 0.6 70.2 0.8 69.3 1.9 SMP c (ours) 82.1 0.5 79.8 1.5 78.2 0.9 72.8 0.8 72.0 0.8 70.0 1.0 Max acc possible 96.5 92.5 88.9 93.5 90.5 84.5 Full Graph 81.6 0.6 73.1 1.5

coarsened graphs achieved a better Message-Passing error, close to the RSA-constant computed in the coarsened graph. It is consistent with the fact the Message-Passing error is bounded by Theorem 1 with our propagation matrix. Similarly, for the node prediction results, our propagation matrix SMP c achieves good results with the SGC model, close to the maximum accuracy possible on the given coarsening. Our propagation matrix is still competitive with the GCNconv model and achieved better results on the biggest coarsening ratio. These experiments show the effectiveness of our method on large graphs for which coarsening as a preprocessing step is crucial: indeed, on most small-scale machines with single GPU, the Reddit dataset is too large to fit in memory and requires adapted strategies.

Table 3: Accuracy in % for node classification on Reddit Dataset and Message passing errors

Reddit Dataset SGC GCNconv MP error

r 0.90 0.99 0.90 0.99 0.90 0.99

Ssym c 37.1 6.6 3.7 5.5 48.1 8.9 34.8 4.0 4.73e16 2.07e27 Sdiff c 18.3 0.0 14.9 0.0 71.3 1.0 18.7 1.7 0.92 1.00 Sc 87.5 0.1 37.3 0.0 88.0 0.1 54.2 2.4 2.46 1.75 Sdiag c 87.6 0.1 37.3 0.0 88.1 0.2 55.5 1.8 2.45 1.74 SMP c (ours) 90.2 0.0 64.1 0.0 84.4 0.3 60.3 0.9 0.22 0.88 Max Acc Possible 93.4 64.7 93.4 64.7 Not applicable Full Graph 94.9 Non computable (OOM) Not applicable

5 Conclusion

In this paper, we investigated the interactions between graph coarsening and Message-Passing for GNNs. Surprisingly, we found out that even for high-quality coarsenings with strong spectral preservation guarantees, naive (but natural) choices for the propagation matrix on coarsened graphs does not lead to guarantees with respect to message-passing on the original graph. We thus proposed a new message-passing matrix specific to coarsened graphs, which naturally translates spectral preservation to message-passing guarantees, for any coarsening, under some hypotheses relating to the structure of the Laplacian and the original propagation matrix. We then showed that such guarantees extend to GNN, and in particular to the SGC model, such that training on the coarsened graph is provably close to training on the original one.

There are many outlooks to this work. Concerning the coarsening procedure itself, which was not the focus of this paper, new coarsening algorithms could emerge from our theory, e.g. by instantiating an optimization problem with diverse regularization terms stemming from our theoretical bounds. The scalability of such coarsening algorithms is also an important topic for future work. From a theoretical point of view, a crucial point is the interaction between non-linear activation functions and the low-frequency vectors in a graph (Assumption 4). We focused on the SGC model here, but a more in-depth study of particular graph models (e.g. random geometric graphs) could shed light on this complex phenomenon, which we believe to be a major path for future work.

Acknowledgments and Disclosure of Funding

The authors acknowledge the fundings of France 2030, PEPR IA, ANR-23-PEIA-0008 and ANR Grand Ma ANR-21-CE23-0006.

[1] Z. Allen-Zhu, Z. Liao, and L. Orecchia. Spectral sparsification and regret minimization beyond matrix multiplicative updates. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 237 245, 2015.

[2] E. Araya and Y. de Castro. Latent distance estimation for random geometric graphs. Advances in Neural Information Processing Systems (Neur IPS), 32, 2019. ISSN 10495258. URL http://arxiv.org/abs/1909.06841.

[3] F. M. Bianchi and V. Lachi. The expressive power of pooling in graph neural networks. Advances in neural information processing systems, 36, 2024.

[4] G. Bravo Hermsdorff and L. Gunderson. A unifying framework for spectrum-preserving graph sparsification and coarsening. Advances in Neural Information Processing Systems, 32, 2019.

[5] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veliˇckovi c. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. ar Xiv:2104.13478, 2021. URL http://arxiv.org/ abs/2104.13478.

[6] C. Cai, D. Wang, and Y. Wang. Graph coarsening with neural networks. In 9th International conference on Learning Representations, 2021.

[7] J. Chen, Y. Saad, and Z. Zhang. Graph coarsening: from scientific computing to machine learning, volume 79. Springer International Publishing, 2022. ISBN 4032402100282. doi: 10.1007/s40324-021-00282-x. URL https://doi.org/10.1007/s40324-021-00282-x.

[8] Y. Chen, R. Yao, Y. Yang, and J. Chen. A gromov-wasserstein geometric view of spectrumpreserving graph coarsening. In International Conference on Machine Learning, pages 5257 5281. PMLR, 2023.

[9] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016.

[10] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence, 29(11):1944 1957, 2007.

[11] C. Dickens, E. Huang, A. Reganti, J. Zhu, K. Subbian, and D. Koutra. Graph coarsening via convolution matching for scalable graph neural network training. In Companion Proceedings of the ACM on Web Conference 2024, pages 1502 1510, 2024.

[12] F. Dorfler and F. Bullo. Kron reduction of graphs with applications to electrical networks. IEEE Transactions on Circuits and Systems I: Regular Papers, 60(1):150 163, 2012.

[13] D. Ediger, K. Jiang, J. Riedy, D. A. Bader, C. Corley, R. Farber, and W. N. Reynolds. Massive social network analysis: Mining twitter for social good. Proceedings of the International Conference on Parallel Processing, pages 583 593, 2010. ISSN 01903918. doi: 10.1109/ICPP. 2010.66.

[14] J. Gasteiger, C. Qian, and S. Günnemann. Influence-based mini-batching for graph neural networks. 12 2022. URL http://arxiv.org/abs/2212.09083.

[15] C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system, 1998. URL www.neci.nj.nec.com.

[16] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural Message Passing for Quantum Chemistry. In International Conference on Machine Learning (ICML), pages 1 14, 2017. ISBN 978-1-4577-0079-8. doi: 10.1002/nme.2457. URL http://arxiv.org/abs/ 1704.01212.

[17] D. Grattarola, D. Zambon, F. M. Bianchi, and C. Alippi. Understanding pooling in graph neural networks. IEEE transactions on neural networks and learning systems, 35(2):2708 2718, 2022.

[18] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.

[19] P. Hu and W. C. Lau. A Survey and Taxonomy of Graph Sampling. pages 1 34, 2013. URL http://arxiv.org/abs/1308.5865.

[20] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs. Neural Information Processing Systems (Neur IPS), (Neur IPS):1 34, 2020. URL http://arxiv.org/abs/2005.00687.

[21] Z. Huang, S. Zhang, C. Xi, T. Liu, and M. Zhou. Scaling up Graph Neural Networks Via Graph Coarsening, volume 1. Association for Computing Machinery, 2021. ISBN 9781450383325. doi: 10.1145/3447548.3467256.

[22] W. Jin, L. Zhao, S. Zhang, Y. Liu, J. Tang, and N. Shah. Graph condensation for graph neural networks. In International Conference on Learning Representations, 2021.

[23] W. Jin, X. Tang, H. Jiang, Z. Li, D. Zhang, J. Tang, and B. Yin. Condensing graphs via one-step gradient matching. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 720 730, 2022.

[24] Y. Jin, A. Loukas, and J. Ja Ja. Graph coarsening with preserved spectral properties. In International Conference on Artificial Intelligence and Statistics, pages 4452 4462. PMLR, 2020.

[25] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359 392, 1998.

[26] N. Keriven. Not too little, not too much: a theoretical analysis of graph (over)smoothing. Advances in Neural Information Processing Systems (Neur IPS), 2022. URL http://arxiv. org/abs/2205.12156.

[27] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2016.

[28] J. Klicpera, A. Bojchevski, and S. Günnemann. Predict then propagate: Graph neural networks meet personalized Page Rank. 7th International Conference on Learning Representations, ICLR 2019, pages 1 15, 2019.

[29] M. Kumar, A. Sharma, S. Saxena, and S. Kumar. Featured graph coarsening with similarity guarantees. In International Conference on Machine Learning, pages 17953 17975. PMLR, 2023.

[30] J. Lee, I. Lee, and J. Kang. Self-attention graph pooling. In International conference on machine learning, pages 3734 3743. PMLR, 2019.

[31] Y. T. Lee and H. Sun. Constructing linear-sized spectral sparsification in almost-linear time. SIAM Journal on Computing, 47(6):2315 2336, 2018.

[32] A. Loukas. Graph reduction with spectral and cut guarantees. Journal of Machine Learning Research, 20(116):1 42, 2019.

[33] A. Loukas and P. Vandergheynst. Spectrally approximating large graphs with smaller graphs. In International conference on machine learning, pages 3237 3246. PMLR, 2018.

[34] T. Ma and J. Chen. Unsupervised learning of graph hierarchical abstractions with differentiable coarsening and optimal transport. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 8856 8864, 2021.

[35] A. K. Mccallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning, 2000. URL www.campsearch.com.

[36] J. W. Ruge and K. Stüben. Algebraic multigrid. In Multigrid methods, pages 73 130. SIAM, 1987.

[37] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008.

[38] D. A. Spielman and S.-H. Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4):981 1025, 2011.

[39] A. Tsitsulin, J. Palowitch, B. Perozzi, and E. Müller. Graph clustering with graph neural networks. Journal of Machine Learning Research, 24(127):1 21, 2023.

[40] J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao, and D. L. Lee. Billion-scale commodity embedding for E-commerce recommendation in alibaba. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 839 848, 2018. doi: 10.1145/3219819.3219869.

[41] T. Wang, J.-Y. Zhu, A. Torralba, and A. A. Efros. Dataset distillation. ar Xiv preprint ar Xiv:1811.10959, 2018.

[42] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger. Simplifying graph convolutional networks. In International conference on machine learning, pages 6861 6871. PMLR, 2019.

[43] Z. Yang, W. Cohen, and R. Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pages 40 48. PMLR, 2016.

[44] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems, 31, 2018.

[45] X. Zheng, M. Zhang, C. Chen, Q. V. H. Nguyen, X. Zhu, and S. Pan. Structure-free graph condensation: From large-scale graphs to condensed graph-free data. Advances in Neural Information Processing Systems, 36, 2024.

[46] J. Zhu, R. A. Rossi, A. Rao, T. Mai, N. Lipka, N. K. Ahmed, and D. Koutra. Graph Neural Networks with Heterophily. 35th AAAI Conference on Artificial Intelligence, AAAI 2021, 12B: 11168 11176, 2021.

We start by a small useful lemma. Lemma 1. For all ker(L)-preserving matrices M, we have Mx L M L x L.

Proof. Take the orthogonal decomposition x = u + v where u ker(L) and v ker(L) . Then, since x L = v L, L 1

2 L 1 2 v = v and Mu L = 0,

Mx L Mv L = L 1 2 ML 1

2 L 1 2 v M L v L

A.1 Proof of Proposition 2

Proof. The fact that Lc = (Q+)LQ+ in this particular case is done by direct computation. It results that xc Lc = L 1 2 Πx and

| x L xc Lc | L 1 2 x L 1 2 Πx = x Πx L ϵL,Q,R x L by definition of ϵL,Q,R.

A.2 Proof of Theorem 1

Proof. Since x R and S is R-preserving, we have

Π x L ϵL,Q,R x L

where Π = IN Π, and similarly for Sx. Moreover, under Assumption 1, both Π and S are ker(L)-preserving, such that ΠSx L ΠS L x L for all x. Then

Sx Q+SMP c xc L = Sx ΠSΠx L = Sx ΠSx + ΠSx ΠSΠx L = Π Sx + ΠSΠ x L Π Sx L + ΠSΠ x L ϵL,Q,R Sx L + ΠS L Π x L ϵL,Q,R Sx L + ϵL,Q,R ΠS L x L = ϵL,Q,R x L (CS + CΠ)

A.3 Proof of Theorem 2

Recall that the GNN is such that H0 = X, and

Hl = σ(SHl 1θl) RN dℓ, Φθ(X, S) = Hk RN

Similarly, for the GNN on coarsened graph we denote by H0 c = Xc and its layers

Hl c = σ(SMP c Hl 1 c θl) Rn dℓ, Φθ(Xc, SMP c ) = Hk c RN

For some set of parameters θ of a GNN, we define

Cθ,l = sup i

j |θl ij|, Cθ,l =

We start with a small lemma. Lemma 2. Define Bl = Bl(X) := X

i Hl :,i L (12)

Then we have Bl Cθ,l Cl SCl σ X :,L (13)

Proof. From assumption 4 we have σ(x) L Cσ x L. Then, since S is ker(L)-preserving from Assumption 1, by Lemma 1 X

i Hl :,i L = X

i σ(SHl 1θl :,i) L Cσ X

i SHl 1θl :,i L

i |θl ji|) X

j SHl 1 :,j L CσCθ,l CSBl 1

Since B0 = X :,L, we obtain the result

Proof. We start with classical risk bounding in machine learning

J(Φθc(X, S)) J(Φθ (X, S)) = J(Φθc(X, S)) J(Q+Φθc(Xc, SMP c ))

+ J(Q+Φθc(Xc, SMP c )) J(Q+Φθ (Xc, SMP c ))

+ J(Q+Φθ (Xc, SMP c )) J(Φθ (X, S))

2 sup θ Θ |J(Φθ(X, S)) J(Q+Φθ(Xc, SMP c ))|

since θc minimizes J(Q+Φθ(Xc, SMP c )). For all θ, by Assumption 3, we have

|J(Φθ(X, S)) J(Q+Φθ(Xc, SMP c ))| CJ Φθ(X, S) Q+Φθ(Xc, SMP c ) L (14)

We will prove a recurrence bound on

i Hl :,i Q+(Hl c):,i L

From Assumption 4,

i σ(SHl 1(θl):,i) Q+σ(SMP c Hl 1 c (θl):,i) L

i σ(SHl 1(θl):,i) σ(Q+SMP c Hl 1 c (θl):,i) L

i SHl 1(θl):,i Q+SMP c Hl 1 c (θl):,i L

SHl 1 :,j Q+SMP c (Hl 1 c ):,j L

We then write

SHl 1 :,j Q+SMP c (Hl 1 c ):,j L SHl 1 :,j Q+SMP c QHl 1 :,j L + Q+SMP c QHl 1 :,j Q+SMP c (Hl 1 c ):,j L

Then note that, since both S and σ are R-preserving, for all l, i we have that (Hl):,i R. We can thus apply Theorem 1 to the first term:

SHl 1 :,j Q+SMP c QHl 1 :,j L ϵL,Q,R(CS + CΠ) Hl 1 :,j L

The second term is 0 when l = 1 since H0 c = QH0. Otherwise, using QQ+ = In, and since under Assumption 1 both S and Π are ker(L)-preserving, applying Lemma 1:

Q+SMP c QHl 1 :,j Q+SMP c (Hl 1 c ):,j L = Q+SMP c Q(Hl 1 :,j Q+(Hl 1 c ):,j) L

= ΠSΠ(Hl 1 :,j Q+(Hl 1 c ):,j) L

ΠSΠ L Hl 1 :,j Q+(Hl 1 c ):,j L

At the end of the day, we obtain

El CσCθ,l(ϵL,Q,R(CS + CΠ)Bl 1 + CΠEl 1)

Cl σ Cθ,l Cl 1 S (CS + CΠ)ϵL,Q,R X :,L + CσCθ,l CΠEl 1

using Lemma 2, and E1 ϵL,Q,RCσCθ,1(CS + CΠ) X :,L. We recognize a recursion of the form un anc + bnun 1, which leads to un Pn p=2 ap Qn i=p+1 bi + u1 Qn i=2 bi, which results in:

Ek ϵL,Q,R X :,LCk σ Cθ,k, (CS + CΠ)

l=1 Ck l Π Cl 1 S (15)

This concludes the proof with CΘ = maxθ Θ Cθ,k.

B Coarsening algorithm and experimental details

B.1 Adaptation of Loukas Algorithm

You can find below the pseudo-code of Loukas algorithm. This algorithm works by greedy selection of contraction sets (see below) according to some cost, merging the corresponding nodes, and iterate. The main modification is to replace the combinatorial Laplacian in the Loukas code by any Laplacian L = f L(A), and to update the adjacency matrix according to (4) at each iteration and recompute L, instead of directly updating L as in the combinatorial Laplacian case. Note that we also remove the diagonal of Ac at each iteration, as we find that it produces better results. The output of the algorithm is the resulting coarsening Q, as well as SMP c = QSQ+ for our application.

Algorithm 1 Loukas Algorithm

Require: Adjacency matrix A, Laplacian L = f L(A), propagation matrix S, a coarsening ratio r , preserved space R, maximum number of nodes merged at one coarsening step : ne 1: nobj int(N N r) the number of nodes wanted at the end of the algorithm. 2: compute cost matrix B0 V V T L 1/2 with V an orthonormal basis of R 3: Q IN 4: while n nobj do 5: Make one coarsening STEP l 6: Create candidate contraction sets.

7: For each contraction C, compute cost(C, Bl 1, Ll 1) = ΠCBl 1(BT l 1Ll 1Bl 1) 1/2 LC

|C| 1 8: Sort the list of contraction set by the lowest score 9: Select the lowest scores non overlapping contraction set while the number of nodes merged is inferior to min(n nobj, ne) 10: Compute Ql, Q+ l , uniform intermediary coarsening with contraction sets selected 11: Bl Ql Bl 1 12: Q Ql Q 13: Al (Q+ l ) Al 1Q+ l diag((Q+ l ) Al 1Q+ l )1n) 14: Ll 1 = f L(Al 1) 15: n min(n nobj, ne) 16: end while 17: IF uniform coarsening THEN Q row-normalize(Ql Q) 18: Compute SMP c = QSQ+

19: return Q, SMP c

The terms ΠC and LC are some specific projection of the contraction set, their explicit definition can be find in Loukas work [32]. We did not modify them here and leave their eventual adaptation for future work.

Enforcing the iterative/greedy aspect In our adaptation we also add a parameter ne to limit the number of nodes contracted at each coarsening step. In one coarsening step, when a contraction set C is selected, we merge |C| nodes. In practice Loukas proposed in its implementation to force ne = and coarsen the graph in one single iteration. We observed empirically better results by diminishing ne and combining it with enforcing the uniform coarsening (Appendix B.4).

Candidate contraction Set. Candidate contractions sets come in two main flavors: they can be each two nodes linked by edges, or the neighborhood of each nodes (so-called "variation edges" and "variation neighborhood" versions). In practice, as the neighborhood are quite big in our graphs, it is not very convenient for small coarsening ratio and give generally poor results. We will use mainly the edges set as candidate contraction sets and adjust the parameter ne to control the greedy aspect of this algorithm.

Uniform Coarsening At each coarsening step, in Loukas algorithm Ql is uniform by construction. Nonetheless the product of uniform coarsening is not necessarily an uniform coarsening. Then, we propose an option to force the uniform distribution in the super-nodes in the Loukas algorithm by normalize the non zero values of each line of the final coarsening matrix Q. We observe that uniform coarsening gives better results for ϵL,Q,R, and works better for our message passing guarantees. See Appendix B.4.

B.2 Discussion on Training procedure

The pseudocode of our training procedure is detailed in Algo. 2.

Algorithm 2 Training Procedure

Require: Adjacency A, node features X, desired propagation matrix S, preserved space R, Laplacian L, a coarsening ratio r 1: Q, SMP c Coarsening-algorithm(A, L, S, r, R) 2: Xc QX 3: Initialize model (SGC or GCNconv) 4: for Nepochs iterations do 5: compute coarsened prediction Φθ(SMP c , Xc) 6: uplift the predictions : Q+Φθ(SMP c , Xc) 7: compute the cross entropy loss J(Q+Φθ(SMP c , Xc)) 8: Backpropagate the gradient 9: Update θ 10: end for

Note that it is different from the procedure of [21] which computes labels for the super-nodes (using the majority label in the coarsening cluster) and do not use the uplifting matrix Q+. We find this procedure to be less amenable to semi-supervised learning, as super-nodes may merge training and testing nodes, and prefer to uplift the output of the GNN in the original graph instead. Additionally, this preserves the theoretical guarantees of Sec. 3. Our procedure might be slightly lower but we find the uplifting operation to be of negligible cost compared to actual backpropagation.

B.3 Presentation of dataset

Synthetic Dataset Random geometric graph is built by sampling nodes with coordinates in [0, 1]2 and connecting them if their distance is under a given threshold. For the experiment on illustrating the message passing preservation guarantees, we sample 1000 nodes with a threshold of 0.05 (fig 3 ).

Table 4: Characteristics of Cora and Cite Seer Datasets

Dataset # Nodes # Edges # Train Nodes # Val Nodes # Test Nodes

Cora 2,708 10,556 140 500 1,000 Cora PCC 2,485 10,138 122 459 915

Citeseer 3,327 9,104 120 500 1,000 Citeseer PCC 2,120 7,358 80 328 663

Real World datasets We restrict the well known Cora and Citeseer to their principal connected component(PCC) as it more compatible with coarsening as preprocessing. Indeed, the loukas

Figure 3: Example of a random Geometric graph

algorithm tend to coarsen first the smallest connected components before going to the biggest which leads to poor results for small coarsening ratio. However working with this reduced graph make the comparison with other works more difficult as it is not the same training and evaluating dataset.

For the Reddit dataset ( 1 PCC) its characteristics and of its coarsened version as well of the Reddit and Cora coarsened dataset can be find in the table 5

Table 5: Characteristics of Reddit, Cora, Citeseer and its coarsen version

Dataset # Nodes # Edges # Features #classes

Reddit 232,965 114,615,892 602 41 Reddit90 23,298 8,642,864 602 41 Reddit99 2,331 10,838 602 41

Cora PCC 2,485 10,138 1,433 7 Cora70 746 3,716 1,433 7

Citeseer PCC 2,120 7,358 3,703 6 Citeseer70 636 2,122 3,703 6

B.4 Discussion of hyperparameters and additional experiments

In the following section, we will use two different view of the same plot, to focus on different parts. We use the log-log scale (fig 4a) to put the focus on low coarsening ratio and on the upper bound. We use the linear scale (fig 4b) to compare more precisely our propagation matrix with Sdiag c and Sc for higher coarsening ratio.

10 2 10 1 r

104 SMP c Sdiff c Ssym c Sc Sdiag c CϵL,Q,R

(a) Log Log scale

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r

1.0 SMP c Sc Sdiag c Sdiff c

(b) Linear Linear Scale

Figure 4: Uniform coarsening with ne = 5N/100 and Normalized Laplacian

Uniform coarsening We observe that forcing uniform coarsening gives better ϵL,Q,R and thus better message passing guarantees . It is shown in the figure 5 for ne = 5N/100 with N the number of Nodes of the graph (1000 here).

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r

1.0 SMP c Sc Sdiag c Sdiff c

(a) uniform coarsening

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r

1.2 SMP c Sc Sdiag c Sdiff c

(b) Non uniform coarsening

Figure 5: Coarsening with ne = 5N/100 and Normalized Laplacian

Bounding ne. For high coarsening ratio, we observe limits of the variation edges defined as Loukas with ne as it gives bigger ϵL,Q,R and thus worse curve for our propagation matrix in the coarsened graph (fig 6).

B.5 Hyper-parameters for Table 1 and Table 2

For all experiments, we preserve K eigenvectors of the normalized Laplacian defined as L = IN(1 + δ) S with δ = 0.001 and K = 10%N where N is the number of nodes in the original graph. We apply our adapted version of Loukas coarsening algorithm with ne = 5%N for SGC Cora, SGC Citeseer and GCN citeseer and ne for GCN Cora (variation edges as defined by Loukas). For SGC cora and SGC Citeseer we make 6 propagations as preprocessing for the features. For GCN Cora and Citeseer we use 2 convolationnal layer with a hidden dimension of 16. For all experiments we use an Adam Optimizer wit a learning rate of 0.05 and a weight decay of 0.01.

B.6 Hyper-parameters for Table 3

For the experiment on Reddit Dataset, we preserve K eigenvectors of the normalized Laplacian defined as L = IN(1 + δ) S with δ = 0.001 and K = 400 eigenvectors to be computationally efficient (10%N being too big). We apply our adapted version of Loukas coarsening algorithm with

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r

1.0 SMP c Sc Sdiag c Sdiff c

(a) ne = 5N/100

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r

1.0 SMP c Sc Sdiag c Sdiff c

Figure 6: Uniform coarsening for Normalized Laplacian

ne = 10%N for SGC Reddit and GCN Reddit. We computed 6 propagations for Reddit SGC and 2 for Reddit GCN. We keep the same hidden dimension as for Cora and Citeseer. For the reddit experiments, we use an Adam Optimizer wit a learning rate of 0.1 and a weight decay of 0.0.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: This is a mostly theoretical paper. Theorems and their implications are described in abstract and introduction.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

ii) Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: This is a mostly theoretical paper. Hypotheses are illustrated by examples and limitations are discussed. For experiments, scalability is discussed.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

iii) Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Proofs are provided in Appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. iv) Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Code is included as supplementary material, and can be run on any computer. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

i) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. ii) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. iii) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). iv) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. v) Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code is included as supplementary material, and use only open-source Python libraries.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. vi) Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: All details are described in Appendix, and the code in supplementary material. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. vii) Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Standard deviations are reported in Table results. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

viii) Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: This is small-scale code: it can be run on any computer in reasonable time.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

ix) Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: This is a mostly theoretical paper. The code use only open-source Python libraries.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

x) Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This is a mostly theoretical paper. We do not anticipate significant societal impact as a direct result of our work. Future algorithmic work on scalability of graph coarsening could include such discussion, but this is relatively out-of-topic for this paper.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

xi) Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper does not include such model.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

xii) Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: The original paper for Cora and Citeseer is cited. Details are given in Appendix.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

xiii) New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: This paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets.

Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. xiv) Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. xv) Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.