# graph_sparsification_via_mixture_of_graphs__f401a599.pdf

Published as a conference paper at ICLR 2025

GRAPH SPARSIFICATION VIA MIXTURE OF GRAPHS

Guibin Zhang1 , Xiangguo Sun2 , Yanwei Yue1 , Chonghe Jiang2, Kun Wang3 , Tianlong Chen4 , Shirui Pan5

1Tongji University 2CUHK 3NTU 4UNC-Chapel Hill 5Griffith University

Graph Neural Networks (GNNs) have demonstrated superior performance across various graph learning tasks but face significant computational challenges when applied to large-scale graphs. One effective approach to mitigate these challenges is graph sparsification, which involves removing non-essential edges to reduce computational overhead. However, previous graph sparsification methods often rely on a single global sparsity setting and uniform pruning criteria, failing to provide customized sparsification schemes for each node s complex local context. In this paper, we introduce Mixture-of-Graphs (Mo G), leveraging the concept of Mixtureof-Experts (Mo E), to dynamically select tailored pruning solutions for each node. Specifically, Mo G incorporates multiple sparsifier experts, each characterized by unique sparsity levels and pruning criteria, and selects the appropriate experts for each node. Subsequently, Mo G performs a mixture of the sparse graphs produced by different experts on the Grassmann manifold to derive an optimal sparse graph. One notable property of Mo G is its entirely local nature, as it depends on the specific circumstances of each individual node. Extensive experiments on four large-scale OGB datasets and two superpixel datasets, equipped with five GNN backbones, demonstrate that Mo G (I) identifies subgraphs at higher sparsity levels (8.67% 50.85%), with performance equal to or better than the dense graph, (II) achieves 1.47 2.62 speedup in GNN inference with negligible performance drop, and (III) boosts top-student GNN performance (1.02% on Rev GNN+OGBNPROTEINS and 1.74% on Deeper GCN+OGBG-PPA). The source code is available at https://github.com/yanweiyue/Mo G.

1 INTRODUCTION

Graph Neural Networks (GNNs) (Sun et al., 2023a; Zhou et al., 2020) have become prominent for confronting graph-related learning tasks, including social recommendation (Wu et al., 2021; Yu et al., 2022), fraud detection (Sun et al., 2022; Wang et al., 2019a; Cheng et al., 2020), drug design (Zhang & Liu, 2023), and many others (Wu et al., 2023; Sun et al., 2023b). The superiority of GNNs stems from iterative aggregation and update processes. The former accumulates embeddings from neighboring nodes via sparse matrix-based operations (e.g., sparse-dense matrix multiplication (Sp MM) and sampled dense-dense matrix multiplication (SDDMM) (Fey & Lenssen, 2019; Wang et al., 2019b)), and the latter updates the central nodes embeddings using dense matrix-based operations (e.g., Mat Mul) (Fey & Lenssen, 2019; Wang et al., 2019b). Sp MM typically contributes the most substantial part ( 70%) to the computational demands (Liu et al., 2023b; Zhang et al., 2024b), influenced largely by the graph s scale. Nevertheless, large-scale graphs are widespread in real-world scenarios (Wang et al., 2022a; Jin et al., 2021; Zhang et al., 2024a), leading to substantial computational burdens, which hinder the efficient processing of features during the training and inference, posing headache barriers to deploying GNNs in the limited resources environments.

To conquer the above challenge, graph sparsification (Chen et al., 2023; Hashemi et al., 2024) has recently seen a revival as it directly reduces the aggregation process associated with Sp MM (Liu et al., 2023b; Zhang et al., 2024b) in GNNs. Specifically, graph sparsification is a technique that approximates a given graph by creating a sparse subgraph with a subset of vertices and/or edges. Since the execution time of Sp MM is directly related to the number of edges in the graph, this method can significantly accelerate GNN training or inference. Existing efforts such as UGS (Chen et al.,

Equal Contribution, Corresponding Authors

Published as a conference paper at ICLR 2025

1-hop 2-hop 3-hop

Pruning attribution: Non-Bridge Edge

Pruning attribution: Non-homophilic Edge

Pruning attribution: Low Effective Resistance

Expert 1 Expert 2 Expert 3

Aggregated Sparse Graph

Extract 1-hop

Expert consensus

Center Node: 14 Center Node: 6

Figure 1: (Left) We illustrated the k-hop neighborhood expansion rates for nodes 6 and 14, which is proportional to the amount of message they receive as the GNN layers deepen; (Middle) The local patterns of different nodes vary, hence the attributions of edge pruning may also differ. For instance, pruning (v1, v2) might be due to its non-bridge identity, while pruning (v5, v6) could be attributed to its non-homophilic nature; (Right) The overview of our proposed Mo G.

2021), DSpar (Liu et al., 2023b), and Ada GLT (Zhang et al., 2023) have achieved notable successes, with some maintaining GNN performance even with up to 40% edge sparsity.

Beyond serving as a computational accelerator, the purpose of graph sparsification extends further. Another research line leverages graph sparsification as a performance booster to remove taskirrelevant edges and pursue highly performant and robust GNNs (Zheng et al., 2020). Specifically, it is argued that due to uncertainty and complexity in data collection, graph structures are inevitably redundant, biased, and noisy (Li et al., 2024). Therefore, employing graph sparsification can effectively facilitate the evolution of graph structures towards cleaner conditions (Zheng et al., 2020; Luo et al., 2021), and finally boost GNN performance.

However, existing sparsification methods, namely sparsifiers, whether aimed at achieving higher sparsity or seeking enhanced performance, often adopt a rigid, global approach to conduct graph sparsification, thus suffering from the inflexibility in two aspects:

❶Inflexibility of sparsity level. Previous sparsifiers globally score all edges uniformly and prune them based on a preset sparsity level (Chen et al., 2023). However, as shown in Figure 1 (Left), the degrees of different nodes vary, which leads to varying rates of k-hop neighborhood expansion. This phenomenon, along with prior work on node-wise aggregation (Lai et al., 2020; Wang et al., 2023a), suggests that different nodes require customized sparsity levels tailored to their specific connectivity and local patterns.

❷Inflexibility of sparsity criteria. Previous sparsifiers often operate under a unified guiding principle, such as pruning non-bridge edges (Wang et al., 2022b), non-homophilic edges (Gong et al., 2023), or edges with low effective resistance (Spielman & Srivastava, 2008; Liu et al., 2023b), among others. However, as illustrated in Figure 1 (Middle), the context of different nodes varies significantly, leading to varied rationales for edge pruning. Therefore, it is essential to select appropriate pruning criteria tailored to the specific circumstances of each node to customize the pruning process effectively.

Based on these observations and reflections, we propose the following challenge: Can we customize the sparsity level and pruning criteria for each node, in the meanwhile ensuring the efficiency of graph sparsification? Towards this end, we propose a novel graph sparsification method dubbed Mixture of Graphs (Mo G). It comprises multiple sparsifier experts, each equipped with distinct pruning criteria and sparsity settings, as in Figure 1 (Right). Throughout the training process, Mo G dynamically selects the most suitable sparsifier expert for each node based on its neighborhood properties. This fosters specialization within each Mo G expert, focusing on specific subsets of nodes with similar neighborhood contexts. After each selected expert prunes the 1-hop subgraph of the central nodes and outputs its sparse version, Mo G seamlessly integrates these sparse subgraphs on the Grassmann manifold in an expert-weighted manner, thereby forming an optimized sparse graph.

We validate the effectiveness of Mo G through a comprehensive series of large-scale tasks. Experiments conducted across six datasets and three GNN backbones showcase that Mo G can ❶effectively locate well-performing sparse graphs, maintaining GNN performance losslessly at satisfactory graph

Published as a conference paper at ICLR 2025

sparsity levels (8.67% 50.85%), and even only experiencing a 1.65% accuracy drop at 69.13% sparsity on OGBN-PROTEINS; ❷achieve a tangible 1.47 2.62 inference speedup with negligible performance drop; and ❸boost ROC-AUC by 1.81% on OGBG-MOLHIV, 1.02% on ogbn-proteins and enhances accuracy by 0.95% on OGBN-ARXIV compared to the vanilla backbones.

The key contributions of Mo G can be found in appendix H.2.

2 TECHNICAL BACKGROUND

Notations & Problem Formulation We consider an undirected graph G = {V, E}, with V as the node set and E the edge set. The node features of G is represented as X RN F , where N = |V| signifies the total number of nodes in the graph. The feature vector for each node vi V, with F dimensions, is denoted by xi = X[i, ]. An adjacency matrix A {0, 1}N N is utilized to depict the inter-node connectivity, where A[i, j] = 1 indicates an edge eij E, and 0 otherwise. For our task of graph sparsification, the core objective is to identify a subgraph Gsub given a sparsity ratio s%:

Gsub = {V, E \ E }, s% = |E |

where Gsub only modifies the edge set E without altering the node set V, and E denotes the removed edges, and s% represents the ratio of removed edges.

Graph Neural Networks Graph neural networks (GNNs) (Wu et al., 2020) have become pivotal for learning graph representations, achieving benchmark performances in various graph tasks at node-level (Xiao et al., 2022), edge-level (Sun et al., 2021), and graph-level (Liu et al., 2022a). At the node-level, two of the most famous frameworks are GCN (Kipf & Welling, 2017) and Graph SAGE (Hamilton et al., 2017), which leverages the message-passing neural network (MPNN) framework (Gilmer et al., 2017) to aggregate and update node information iteratively. For edge-level and graph-level tasks, GCN and Graph SAGE can be adapted by simply incorporating a predictor head or pooling layers. Nevertheless, there are still specialized frameworks like SEAL (Zhang & Chen, 2018) and Neo-GNN (Yun et al., 2021) for link prediction, and Diff Pool (Ying et al., 2018) and PNA (Corso et al., 2020) for graph classification. Regardless of the task, MPNN-style GNNs generally adhere to the following paradigm:

h(l) i = COMB h(l 1) i , AGGR{h(k 1) j : vj N(vi)} , 0 l L (2)

where L is the number of GNN layers, h(0) i = xi, and h(l) i (1 l L) denotes vi s node embedding at the l-th layer. AGGR( ) and COMB( ) represent functions used for aggregating neighborhood information and combining egoand neighbor-representations, respectively.

Graph Sparsification Graph sparsification methods can be categorized by their utility into two main types: computational accelerators and performance boosters. Regarding computational accelerators, early works aimed at speeding up traditional tasks like graph partitioning/clustering often provide theoretical assurances for specific graph properties, such as pairwise distances (Althöfer et al., 1990), cuts (Abboud et al., 2022), eigenvalue distribution (Batson et al., 2013), and effective resistance (Spielman & Srivastava, 2008). More contemporary efforts focus on the GNN training and/or inference acceleration, including methods like SGCN (Li et al., 2020b), GEBT (You et al., 2022), UGS (Chen et al., 2021), DSpar (Liu et al., 2023b), and Ada GLT (Zhang et al., 2024a). Regarding performance boosters, methods like Neural Sparse (Zheng et al., 2020) and PTDNet (Luo et al., 2021) utilize parameterized denoising networks to eliminate task-irrelevant edges. SUBLIME (Liu et al., 2022b) and Nodeformer (Wu et al., 2022) also involve refining or inferring a cleaner graph structure followed by k-nearest neighbors (k NN) sparsification.

Mixture of Experts The Mixture of Experts (Mo E) concept (Jacobs et al., 1991) traces its origins to several seminal works (Chen et al., 1999; Jordan & Jacobs, 1994). Recently, the sparse Mo E architecture (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2021; Clark et al., 2022) has regained attention due to its capacity to support the creation of vast (language) models with trillions of parameters (Clark et al., 2022; Hoffmann et al., 2022). Given its stability and generalizability, sparse Mo E is now broadly implemented in modern frameworks across various domains, including vision (Riquelme et al., 2021), multi-modal (Mustafa et al., 2022), and multi-task learning (Ma et al.,

Published as a conference paper at ICLR 2025

Ego-graph Decomposition

Input Graph 𝒢 (𝒱, ℰ)

Ego subgraphs Decompose 𝒢 into

node-specific ego graphs 𝒢𝑖

Routing Network 𝝍𝒉𝒊= 𝒉𝒊𝑾𝒈+ ϵ Softplus 𝒉𝒊𝑾𝒏 Candidate Experts total K sparisifers; select k out of K Sparsifier Selection Softmax(Top K(𝜙𝒉𝒊, k))

Routing Network 𝜙ℎ𝑖

Expert Scores 𝐸(𝟏)

Expert Scores 𝐸(𝟐)

Expert Scores 𝐸(|𝒱|)

Customized Sparsifier

Sparsity Levels

90% 85% 60%

Pruning Criterion

Jaccard Similarity: Nodes shared neighbors

Gradient Manitude: |

Mixture-of-Graph

Effective Resisatnce

Sparsifier Expert Top K(𝐶𝒩𝑣𝑖 , 𝒩𝑣𝑖 𝑠)

Criteria: C( ) Sparsity: s

Criteria Sparsity

Expert No.1

Criteria Sparsity

Expert No.2

Criteria Sparsity

Expert No.K

Expert Scores for 𝒢𝑖 {𝐸1

Sparsity Ensembling

෪ 𝒢𝑖 (No.1)

෪ 𝒢𝑖: (dense) ego-graph for 𝑣𝑖

𝑒1 𝑒2 𝑒3 𝑒4 𝑒5

𝑒1 𝑒2 𝑒3 𝑒4 𝑒5

Mixture-of-Graph

ሚ𝒢 { ෪ 𝒢1 , ෪ 𝒢1 , , ෫ 𝒢|𝒱| }

Weighted Expert Consensus

෪ 𝒢𝑖 (No.k)

𝒢(1) 𝒢(2) 𝒢(3)

𝒢(4) 𝒢(5) 𝒢(6)

Post sparsification

Figure 2: The overview of our proposed method. Mo G primarily comprises ego-graph decomposition, expert routing, expert customization, and the final graph mixture. For simplicity, we only showcase three pruning criteria including Jaccard similarity, gradient magnitude, and effective resistance.

2018; Zhu et al., 2022). As for graph learning, Mo E has been explored for applications in graph classification (Hu et al., 2022), scene graph generation (Zhou et al., 2022), molecular representation (Kim et al., 2023), graph fairness (Liu et al., 2023a), and graph diversity modeling (Wang et al., 2024).

3 METHODOLOGY

3.1 OVERVIEW

Figure 2 illustrates the workflow of our proposed Mo G. Specifically, for an input graph, Mo G first decomposes it into 1-hop ego graphs for each node. For each node and its corresponding ego graph, a routing network calculates the expert scores. Based on the router s decisions, sparsifier experts with different sparsity levels and pruning criteria are allocated to different nodes. Ultimately, a mixture of graphs is obtained based on the weighted consensus of the sparsifier experts. In the following sections, we will first detail how to route different experts in Section 3.2, then describe how to explicitly model various sparsifier experts in Section 3.3 and how to ensemble the sparse graphs output by experts on the Grassmann manifold in Section 3.4. Finally, the overall optimization process and complexity analysis of Mo G is placed in Section 3.5.

3.2 ROUTING TO DIVERSE EXPERTS

Following the classic concept of a (sparsely-gated) mixture-of-experts (Zhao et al., 2024), which assigns the most suitable expert(s) to each input sample, Mo G aims to allocate the most appropriate sparsity level and pruning criteria to each input node. To achieve this, we first decompose the input graph G = {V, E} into 1-hop ego graphs centered on different nodes, denoted as {G(1), G(2), , G(N)}, where G(i) = {V(i), E(i)}, V(i) = {vj|vj N(vi)}, E(i) = {eij|(vi, vj) E}. Assuming we have K sparsifier experts, for each node vi and its corresponding ego graph G(i), we aim to select k most suitable experts. We employ the noisy top-k gating mechanism following Shazeer et al. (2017):

Ψ(G(i)) = Softmax(Top K(ψ(xi), k)), (3) ψ(xi) = xi Wg + ϵ Softplus(xi Wn), (4)

where ψ(xi) RK is the calculated scores of vi for total K experts, Top K( ) is a selection function that outputs the largest k values, and Ψ(G(i)) Rk = [E(i) 1 , E(i) 2 , , E(i) k ] represents those for

Published as a conference paper at ICLR 2025

selected k experts. In Ψ(G(i)), ϵ N(0, 1) denotes the standard Gaussian noise, Wg RK F and Wn RK F are trainable parameters that learn clean and noisy scores, respectively.

After determining the appropriate experts, we proceed to generate different sparse graphs with diverse sparsifiers. We denote each sparsifier by κ( ), which takes in a dense graph G and outputs a sparse one G = κ(G). Based on this, for each node vi and its ego graph G(i), the routing network selects k experts that produce k sparse ego graphs. Notably, sparsifiers differ in their pruning rates (i.e. the proportion of the edges to be removed) and the pruning criteria, which will be detailed in Section 3.3. Mo G s dynamic selection of different sparsifiers for each node aids in identifying pruning strategies truly adapted to the node s local context. Formally, the mixture of k sparse graphs can be written as: d G(i) = ESMB({ e G(i) m }k m=1), e G(i) m = κm(G(i)), (5) where ESM( ) is a combination function that receives k sparse graphs and ideally outputs an emsemble version d G(i) = {d V(i), d E(i)} that preserves their desirable properties. It is noteworthy that, Mo G can seamlessly integrate with any GNN backbone after obtaining each node s sparse ego graph. Specifically, we modify the aggregation method in Equation (2) as follows:

h(l) i = COMB h(l 1) i , AGGR{h(k 1) j : vj d V(i)} . (6)

Mo G acts as a plug-and-play module that can be pre-attached to any GNN architecture, leveraging multi-expert sparsification to enhance GNNs with (1) performance improvements from removing task-irrelevant edges (validated in Section 4.3); (2) resistance to high graph sparsity through precise and customized sparsification (validated in Section 4.2). The remaining questions now are: how can we design explicitly different sparsifiers? and further, how can we develop an effective combination function that integrates the sparse graphs from different experts?

3.3 CUSTOMIZED SPARSIFIER MODELING

With the workflow of Mo G in mind, in this section, we will delve into how to design sparsifiers driven by various pruning criteria and different levels of sparsity. Revisiting graph-related learning tasks, their objective can generally be considered as learning P(Y|G), which means learning the distribution of the target Y given an input graph. Based on this, a sparsifier κ( ) can be formally expressed as follows:

g SG P(Y | G)P( G | G) X

g SG QΘ(Y | G)Qκ( G | G) (7)

where SG is a class of sparsified subgraphs of G. The second term in Equation (7) aims to approximate the distribution of Y using the sparsified graph G as a bottleneck, while the third term uses two approximation functions QΘ and Qκ for P(Y | G) and P( G | G) parameterized by Θ and κ respectively. The parameter Θ typically refers to the parameters of the GNN, while the sparsifier κ( ), on the other hand, is crafted to take an ego graph G(i) and output its sparsified version G(i), guided by a specific pruning paradigm C and sparsity sm%:

κm(G(i)) = {V(i), E(i) \ E(i) p }, E(i) p = Top K Cm(E), |E(i)| sm% , (8)

where Cm( ) acts as the m-th expert s scoring function that evaluates edge importance. We leverage long-tail gradient estimation (Liu et al., 2020) to ensure the Top K( ) operator is differentiable. Furthermore, to ensure different sparsity criteria drive the sparsifier, we implement Cm( ) as follows:

Cm(eij) = FFN (xi, xj, c(eij)) , cm(eij)

Degree: (|N(vi) + N(vj)|) /2 Jaccard Similarity: |N (vi) N(vj)|

|N (vi) N(vj)| ER: (ei ej)T L 1(ei ej) Gradient Magnitude: | L/ eij|

where FFN( ) is a feed-forward network, cm(eij) represents the prior guidance on edge significance. By equipping different sparsifiers with various priors and sparsity levels, we can customize the most appropriate pruning strategy for each node s local scenario. In practice, we select four widely-used pruning criteria including edge degree (Seo et al., 2024), Jaccard similarity (Murphy, 1996a; Satuluri et al., 2011b), effective resistance (Spielman & Srivastava, 2008; Liu et al., 2023b) and gradient magnitude (Wan & Schweitzer, 2021; Zhang et al., 2024a). Details regarding these criteria and their implementations are in Appendix B.

Published as a conference paper at ICLR 2025

3.4 GRAPH MIXTURE ON GRASSMANN MANIFOLD

After employing k sparsifiers driven by different criteria and sparsity levels, we are in need of an effective mechanism to ensemble these k sparse subgraphs and maximize the aggregation of their advantages. A straightforward approach is voting or averaging (Sagi & Rokach, 2018); however, such simple merging may fail to capture the intricate relationships among multi-view graphs (Kang et al., 2020), potentially resulting in the loss of advantageous properties from all experts. Inspired by recent advances in manifold representations (Dong et al., 2013; Bendokat et al., 2024), we develop a subspace-based sparse graph ensembling mechanism. We first provide the definition of the Grassmann manifold (Bendokat et al., 2024) as follows: Definition 1 (Grassmann manifold). Grassmann manifold Gr(n, p) is the space of n-by-p matrices (e.g., M) with orthonormal columns, where 0 p n, i.e.,

Gr(n, p) = M|M Rn p, M M = I . (10)

According to Grassmann manifold theory, each orthonormal matrix represents a unique subspace and thus corresponds to a distinct point on the Grassmann manifold (Lin et al., 2020). This applies to the eigenvector matrix of the normalized Laplacian matrix (U = L[:, : p] Rn p), which comprises the first p eigenvectors and is orthonormal (Merris, 1995), and thereby can be mapped onto the Grassmann manifold.

Consider the k sparse subgraphs { e G(i) m }k m=1, their subspace representations are {U(i) m R|N (vi)| p}k m=1. We aim to identify an oracle subspace U(i) on the Grassmann manifold, which essentially represents a graph, that serves as an informative combination of k base graphs. Formally, we present the following objective function:

min U(i) R|N (vi)| p

tr(U(i) Lm U(i)) | {z } (1) node connectivity

expert score z}|{ E(i) m d2(U(i), U(i) m ) | {z } (2) subspace distance

, s. t. U(i) U(i) = I (11)

where tr( ) calculates the trace of matrices, Lm is the graph Laplacian of G(i) m , d2(U1, U2) denotes the project distance between two subspaces (Dong et al., 2013), and E(i) m is the expert score for the m-th expert, calculated by the routing network Ψ, which determines which expert s subspace the combined subspace should more closely align with. In Equation (11), the first term is designed to preserve the original node connectivity based on spectral embedding, and the second term controls that individual subspaces are close to the final representative subspace U(i). Using the Rayleigh-Ritz Theorem (Jia & Stewart, 2001), we provide a closed-form solution for Equation (11) and obtain the graph Laplacian of the ensemble sparse graph d G(i) as follows:

Lm E(i) m U(i) U(i) . (12)

We provide detailed derivations and explanations for Equations (11) and (12) in Appendix C. Consequently, we can reformulate the function ESMB( ) in Equation (5) as follows:

ESMB({ e G(i) m }k m=1) = {D(i) d L(i), X(i)} =

Lm E(i) m U(i) U(i) , X(i) )

where D(i) is the degree matrix of vi s ego-graph. On the Grassmann manifold, the subspace ensemble effectively captures the beneficial properties of each expert s sparse graph. After obtaining the final version of each node s ego-graph, d G(i) = { d A(i), X(i)}, we conduct a post-sparsification step as the graph ensembled on the Grassmann manifold can become dense again. Specifically, we obtain the final sparsity s(i)% for vi by weighting the sparsity of each expert and sparsifying d G(i).

d G(i) {Top K( d A(i), |E(i)| s(i)%), X(i)}, s(i)% = 1

m=1 sm%. (14)

Post-sparsified d G(i) are then reassembled together into b G {d G(1), d G(2), , \ G(|V|)}, with detailed explanations in appendix I. Ultimately, the sparsified graph b G produced by Mo G can be input into any MPNN (Gilmer et al., 2017) or graph transformer (Min et al., 2022) for end-to-end training.

Published as a conference paper at ICLR 2025

3.5 TRAINING AND OPTIMIZATION

Additional Loss Functions Following classic Mo E works (Shazeer et al., 2017; Wang et al., 2024), we introduce an expert importance loss to prevent Mo G from converging to a trivial solution where only a single group of experts is consistently selected:

Importance(V) =

j=1 E(i) m , Limportance(V) = CV(Importance(V))2, (15)

where Importance(V) represents the sum of each node s expert scores across the node-set, CV( ) calculates the coefficient of variation, and Limportance ensures the variation of experts. Therefore, the final loss function combines both task-specific and Mo G-related losses, formulated as follows:

L = Ltask + λ Limportance, (16)

where λ is a hand-tuned scaling factor, with its sensitivity analysis placed in Section 4.4.

Complexity Analysis To better illustrate the effectiveness and clarity of Mo G, we provide a comprehensive algorithmic table in Appendix D and detailed complexity analysis in Appendix E. To address concerns regarding the runtime efficiency of Mo G, we have included an empirical analysis of efficiency in Section 4.5.

4 EXPERIMENTS In this section, we conduct extensive experiments to answer the following research questions: (RQ1) Can Mo G effectively help GNNs combat graph sparsity? (RQ2) Does Mo G genuinely accelerate the GNN inference? (RQ3) Can Mo G help boost GNN performance? (RQ4) How sensitive is Mo G to its key components and parameters?

4.1 EXPERIMENT SETUP Datasets and Backbones We opt for four large-scale OGB benchmarks (Hu et al., 2020), including OGBN-ARXIV, OGBN-PROTEINS and OGBN-PRODUCTS for node classification, and OGBG-PPA for graph classification. The dataset splits are given by (Hu et al., 2020). Additionally, we choose two superpixel datasets, MNIST and CIFAR-10 (Knyazev et al., 2019). We select Graph SAGE (Hamilton et al., 2017), Deeper GCN (Li et al., 2020a), and PNA (Corso et al., 2020) as the GNN backbones. More details are provided in Appendix F.

Parameter Configurations For Mo G, we adopt the m = 4 sparsity criteria outlined in Section 3.3, assigning n = 3 different sparsity levels {s1, s2, s3} to each criterion, resulting in a total of K = m n = 12 experts. We select k = 2 sparsifier experts for each node, and set the loss scaling factor λ = 1e 2 across all datasets and backbones. By adjusting the sparsity combination, we can control the global sparsity of the entire graph. We present more details on parameter settings in Appendix F.4, and a recipe for adjusting the graph sparsity in Appendix F.6.

4.2 MOG AS GRAPH SPARSIFIER (RQ1 & RQ2)

To answer RQ1 and RQ2, we comprehensively compare Mo G with eleven widely-used topologyguided sparsifiers and five semantic-guided sparsifiers, as outlined in Table 1, with more detailed explanations in Appendix F.5. The quantitative results on five datasets are shown in Tables 1 and 9 to 12 and the efficiency comparison is in Figure 3. We give the following observations (Obs.):

Obs. ❶Mo G demonstrates superior performance in both transductive and inductive settings. As shown in Tables 1, 2 and 9 to 11, Mo G outperforms other sparsifiers in both transductive and inductive settings. Specifically, for node classification tasks, Mo G achieves a 0.09% performance improvement while sparsifying 30% of the edges on OGBN-PROTEINS+Graph SAGE. Even when sparsifying 50% of the edges on OGBN-PROTEINS+Deeper GCN, the ROC-AUC only drops by 0.81%. For graph classification tasks, Mo G can remove up to 50% of the edges on MNIST with a 0.14% performance improvement, surpassing other sparsifiers by 0.99% 12.97% in accuracy.

Obs. ❷Different datasets and backbones exhibit varying sensitivities to sparsification. As shown in Tables 1 and 10, despite OGBN-PROTEINS being relatively insensitive to sparsification, sparsification at extremely high levels (e.g., 70%) causes more performance loss for Graph SAGE

Published as a conference paper at ICLR 2025

Table 1: Node classification performance comparison to state-of-the-art sparsification methods. All methods are trained using Graph SAGE, and the reported metrics represent the average of five runs. We denote methods with that do not have precise control over sparsity; their performance is reported around the target sparsity 2%. Sparsity % refers to the ratio of removed edges as defined in Section 2. OOM and OOT denotes out-of-memory and out-of-time, respectively.

Dataset OGBN-ARXIV (Accuracy ) OGBN-PROTEINS (ROC-AUC )

Sparsity % 10 30 50 70 10 30 50 70

Topology-guided

Random 70.03 1.46 68.40 3.09 64.32 7.17 61.18 10.3 76.72 0.68 75.03 2.37 73.58 3.82 72.30 5.10 Rank Degree (Voudigari et al., 2016) 68.13 3.36 67.01 4.48 65.58 5.91 62.17 9.32 77.47 0.07 76.15 1.25 75.59 1.81 74.23 3.17

Local Degree (Hamann et al., 2016) 68.94 2.55 67.20 4.29 65.45 6.04 65.59 5.90 76.20 1.20 76.05 1.35 76.09 1.31 72.88 4.52

Forest Fire (Leskovec et al., 2006) 68.39 3.10 68.10 3.39 67.36 4.13 65.22 6.27 76.50 0.90 75.37 2.03 74.29 3.11 72.11 5.29 G-Spar (Murphy, 1996b) 71.30 0.19 69.29 2.20 65.56 5.93 65.49 6.00 77.38 0.02 77.36 0.04 76.02 1.38 75.89 1.51 LSim (Satuluri et al., 2011a) 69.22 2.27 66.15 5.34 61.07 10.4 60.32 11.2 76.83 0.57 76.01 1.39 74.83 2.57 73.65 3.75 SCAN (Xu et al., 2007) 71.55 0.06 69.27 2.22 65.14 6.35 64.72 6.77 77.60 0.20 76.88 0.52 76.19 1.21 74.32 3.08 ER (Spielman & Srivastava, 2008) 71.63 0.14 69.48 2.01 69.00 2.49 67.15 4.34 OOT DSpar (Liu et al., 2023b) 71.23 0.26 68.50 2.99 64.79 6.70 63.11 8.38 77.34 0.06 77.06 0.34 76.38 1.02 75.49 1.91

Semantic-guided

UGS (Chen et al., 2021) 68.77 2.72 66.30 5.19 65.72 5.77 63.10 8.39 76.80 0.60 75.46 1.94 73.28 4.12 73.31 4.09 GEBT (You et al., 2022) 69.04 2.45 65.29 6.20 65.88 5.61 65.62 5.87 76.30 1.10 76.17 1.23 74.43 2.97 74.12 3.28 MGSpar (Wan & Schweitzer, 2021) 70.22 1.27 69.13 2.36 68.27 3.22 66.55 4.94 OOM ACE-GLT (Wang et al., 2023b) 71.88 0.39 70.14 1.35 68.08 3.41 67.04 4.45 77.59 0.19 76.14 1.26 75.43 1.97 73.28 4.12 WD-GLT (Hui et al., 2023) 71.92 0.43 70.21 1.28 68.30 3.19 66.57 4.92 OOM Ada GLT (Zhang et al., 2024a) 71.22 0.27 70.18 1.31 69.13 2.36 67.02 4.47 77.49 0.09 76.76 1.64 76.00 2.40 75.44 2.96 Mo G (Ours) 71.93 0.44 70.53 0.96 69.06 2.43 67.31 4.18 77.78 0.38 77.49 0.09 76.46 0.94 76.12 1.28

Whole Dataset 71.49 0.01 77.40 0.1

Table 2: Graph classification performance comparison to state-of-the-art sparsification methods. The reported metrics represent the average of five runs.

Dataset MNIST + PNA (Accuracy ) OGBN-PPA + Deeper GCN (Accuracy )

Sparsity % 10 30 50 70 10 30 50 70

Topology-guided

Random 94.61 2.74 87.23 10.1 84.82 12.5 80.07 17.3 75.44 1.65 73.81 4.09 71.97 5.12 69.62 7.47 Rank Degree (Voudigari et al., 2016) 96.42 0.93 94.23 3.12 92.36 4.99 89.20 8.15 75.81 1.28 74.99 2.10 74.12 2.97 70.68 6.41

Local Degree (Hamann et al., 2016) 95.95 1.40 93.37 3.98 90.11 7.24 86.24 11.1 76.43 0.66 75.87 1.22 72.11 4.98 69.93 7.16

Forest Fire (Leskovec et al., 2006) 96.75 0.60 95.42 1.93 95.03 2.32 93.10 4.25 76.38 0.71 75.33 1.76 73.18 3.91 71.49 5.60 G-Spar (Murphy, 1996b) 97.10 0.25 96.59 0.76 94.36 2.99 92.48 4.87 77.68 0.59 73.90 3.19 69.52 7.57 68.10 8.99 LSim (Satuluri et al., 2011a) 95.79 1.56 92.14 5.21 92.29 5.06 91.95 5.40 76.04 1.05 74.40 2.69 72.78 4.31 68.21 8.88 SCAN (Xu et al., 2007) 95.81 1.54 93.48 3.87 90.18 7.17 86.48 10.9 75.23 1.86 75.18 1.91 72.48 4.61 71.11 5.98 ER (Spielman & Srivastava, 2008) 94.77 2.58 93.91 3.44 93.45 3.90 91.07 6.28 77.94 0.85 75.15 1.94 73.23 3.86 72.74 4.35 DSpar (Liu et al., 2023b) 94.97 2.38 93.80 3.55 92.23 5.12 90.48 6.87 76.33 0.76 73.37 3.72 72.98 4.11 70.77 6.32

ICPG (Sui et al., 2023) 97.69 0.34 97.39 0.04 96.80 0.55 93.77 3.58 77.36 0.27 75.24 1.85 73.18 3.91 71.09 6.00 Ada GLT (Zhang et al., 2024a) 97.31 0.04 96.58 0.77 94.14 3.21 92.08 5.27 76.22 0.87 73.54 3.55 70.10 6.99 69.28 7.81 Mo G (Ours) 97.80 0.45 97.74 0.39 97.79 0.44 95.30 2.05 78.43 1.34 77.90 0.81 75.23 1.86 73.09 4.00

Whole Dataset 97.35 0.07 77.09 0.04

compared to Deeper GCN, with the former experiencing a 2.28% drop and the latter only 1.07%, which demonstrates the varying sensitivity of different GNN backbones to sparsification. Similarly, we observe in Table 2 that the MNIST dataset shows a slight accuracy increase even with 50% sparsification, whereas the OGBG-PPA dataset suffers a 1.86% performance decline, illustrating the different sensitivities to sparsification across graph datasets.

Obs. ❸Mo G can effectively accelerate GNN inference with negligible performance loss. Figure 3 illustrates the actual acceleration effects of Mo G compared to other baseline sparsifiers. It is evident that Mo G achieves 1.6 lossless acceleration on OGBN-PROTEINS+Deeper GCN and OGBN-PRODUCTS+Graph SAGE, meaning the performance is equal to or better than the vanilla backbone. Notably, on OGBN-PRODUCTS+Deeper GCN, Mo G achieves 3.3 acceleration with less than a 1.0% performance drop. Overall, Mo G provides significantly superior inference acceleration compared to its competitors.

4.3 MOG AS PERFORMANCE BOOSTER (RQ3) In the context of RQ3, Mo G is developed to augment GNN performance by selectively removing a limited amount of noisy and detrimental edges, while simultaneously preventing excessive sparsification that could degrade GNN performance. Consequently, we uniformly set the sparsity combination to {90%, 85%, 80%}. We combine Mo G with state-of-the-art GNNs on both node-level

Published as a conference paper at ICLR 2025

Figure 3: The trade-off between inference speedup and model performance for Mo G and other sparsifiers. The first and second rows represent results on Graph SAGE and Deeper GCN, respectively. The gray pentagon represents the performance of the original GNN without sparsification.

Table 3: Node classification results on OGBN-PROTEINS with Rev GNN and GAT+Bo T and graph classification results on OGBG-PPA with PAS and Deeper GCN. Mean and standard deviation values from five random runs are presented.

OGBN-PROTEINS (ROC-AUC ) OGBG-PPA (Accuracy )

Model Rev GNN GAT+Bo T PAS Deeper GCN

w/o Mo G 88.14 0.24 88.09 0.16 78.28 0.24 77.09 0.04

w/ Mo G 89.04 0.72 88.72 0.50 78.66 0.47 78.43 0.19 (Sparsity: 9.2%) (Sparsity: 12.7%) (Sparsity: 6.6%) (Sparsity: 10.8%)

and graph-level tasks. The former include Rev GNN (Li et al., 2021) and GAT+Bo T (Wang et al., 2021), which rank fourth and seventh, respectively, on the OGBN-PROTEINS benchmark, and the latter include PAS (Wei et al., 2021) and Deeper GCN (Li et al., 2020a), ranking fourth and sixth on the OGBN-PPA benchmark. We observe from Table 3:

Obs. ❹Mo G can assist the top-student backbones to learn better. Despite Rev GNN and PAS being high-ranking backbones for OGBN-PROTEINS and OGBG-PPA, Mo G still achieves non-marginal performance improvements through moderate graph sparsification: 1.02% on Rev GNN+OGBNPROTEINS and 1.74% on Deeper GCN+OGBG-PPA. This demonstrates that Mo G can effectively serve as a plugin to boost GNN performance by setting a relatively low sparsification rate.

4.4 SENSITIVITY ANALYSIS (RQ4)

To answer RQ4, we perform a sensitivity analysis on the two most important parameters in Mo G: the number of selected experts k and the expert importance loss coefficient λ. We compared the performance of Mo G when choosing different numbers of experts per node, as outlined in Figure 4. The effect of different scaling factors λ on OGBN-PROTEINS+Deeper GCN is shown in Table 4. Based on the results of the above sensitivity analysis, we observe that:

Obs. ❺Sparse expert selection helps customized sparsification. It can be observed FROM Figure 4 that the optimal k varies with the level of graph sparsity. At lower sparsity (10%), k = 1 yields relatively good performance. However, as sparsity increases to 50%, model performance peaks at k = 4, suggesting that in high sparsity environments, more expert opinions contribute to better sparsification. Notably, when k increases to 6, Mo G s performance declines, indicating that a more selective approach in sparse expert selection aids in better model generalization. For a balanced consideration of performance and computational efficiency, we set k = 2 in all experiments. We further provide sensitivity analysis results of parameter k on more datasets, as shown in Appendix G.2.

Obs. ❻Sparsifier load balancing is essential. We conduct a sensitivity analysis of the expert importance loss coefficient λ. A larger λ indicates greater variation in the selected experts. As shown

Published as a conference paper at ICLR 2025

in Table 4, λ = 0 consistently resulted in the lowest performance, as failing to explicitly enforce variation among experts leads to the model converging to a trivial solution with the same set of experts (Wang et al., 2024; Shazeer et al., 2017). Conversely, λ = 1e 1 performed slightly better than λ = 1e 2 at higher sparsity levels, supporting the findings in Obs. 5 that higher sparsity requires more diverse sparsifier experts.

Sparsity=10% Sparsity=30% Sparsity=50%

Figure 4: Sensitivity study on parameter k, i.e., how many experts are chosen per node. The results are reported based on OGBN-ARXIV+Graph SAGE.

λ 0 1e-2 1e-1

10% 81.19 0.08 83.32 0.19 83.04 0.23 30% 79.77 0.08 82.14 0.23 82.08 0.25 50% 79.40 0.06 81.79 0.21 82.04 0.20 70% 78.22 0.13 80.90 0.24 80.97 0.28

Table 4: Sensitivity study on scaling factor λ. The results are reported on OGBNPROTEINS+Deeper GCN.

Table 5: Running time efficiency comparison on OGBN-PRODUCTS+Graph SAGE. We consistently set n = 4, k = 2, corresponding to utilizing 4 pruning criteria and selecting 2 experts for each node, and vary m {1, 2, 3} to check how the training cost grows with m increasing.

Sparsity 30% 50%

Metric Per-epoch Time (s) Accuracy (%) Per-epoch Time (s) Accuracy (%)

Random 18.71 0.14 74.21 0.28 15.42 0.24 71.08 0.34 Ada GLT 23.55 0.20 77.30 0.54 21.68 0.26 74.38 0.79

Mo G(m = 1, K = 3) 20.18 0.14 77.75 0.22 18.19 0.30 76.10 0.49 Mo G(m = 2, K = 6) 21.25 0.22 78.23 0.29 19.70 0.30 76.43 0.49 Mo G(m = 3, K = 12) 23.19 0.18 78.15 0.32 20.83 0.29 76.98 0.49

4.5 EFFICIENCY ANALYSIS AND ABLATION STUDY (RQ4)

Efficiency Analysis To verify that Mo G can achieve better results with less additional training cost than the previous SOTA methods, we compare the accuracy and the time efficiency of Mo G with Ada GLT on OGBN-PRODUCT+Graph SAGE, as outlined in Table 5. We have:

Obs. ❼Mo G can achieve better accuracy with less additional training cost. It is evident in Table 5 that Mo G incurs less additional training cost compared to Ada GLT while achieving significant improvements in sparsification performance. More importantly, we demonstrate that with k = 2, Mo G does not incur significantly heavier training burdens as the number of sparsifiers increases. Specifically, at s% = 50%, the difference in per epoch time between Mo G (K = 3) and Mo G (K = 12) is only 2.63 seconds, consistent with the findings of mainstream sparse Mo E approaches (Wang et al., 2024).

Ablation Study We test three different settings of ϵ (in Equation (3)) on OGBNARXIV+Graph SAGE: (1) ϵ N(0, I), (2) ϵ = 0, and (3) ϵ = 0.2, presented in Table 15. Our key finding is that randomness in gating networks consistently benefits our model. More results and detailed analysis can be found in Appendix G.3.

5 CONCLUSION & LIMITATION

In this paper, we introduce a new graph sparsification paradigm termed Mo G, which leverages multiple graph sparsifiers, each equipped with distinct sparsity levels and pruning criteria. Mo G selects the most suitable sparsifier expert based on each node s local context, providing a customized graph sparsification solution, followed by an effective mixture mechanism on the Grassmann manifold to ensemble the sparse graphs produced by various experts. Extensive experiments on four large-scale OGB datasets and two superpixel datasets have rigorously demonstrated the effectiveness of Mo G. A potential limitation of Mo G is its current reliance on 1-hop decomposition to represent each node s local context. The performance of extending this approach to k-hop contexts remains unexplored, suggesting a possible direction for future research.

Published as a conference paper at ICLR 2025

Amir Abboud, Robert Krauthgamer, and Ohad Trabelsi. Friendly cut sparsifiers and faster gomory-hu trees. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 3630 3649. SIAM, 2022.

Ingo Althöfer, Gautam Das, David Dobkin, and Deborah Joseph. Generating sparse spanners for weighted graphs. In SWAT 90: 2nd Scandinavian Workshop on Algorithm Theory Bergen, Sweden, July 11 14, 1990 Proceedings 2, pp. 26 37. Springer, 1990.

Joshua Batson, Daniel A Spielman, Nikhil Srivastava, and Shang-Hua Teng. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 56(8):87 94, 2013.

Thomas Bendokat, Ralf Zimmermann, and P-A Absil. A grassmann manifold handbook: Basic geometry and computational aspects. Advances in Computational Mathematics, 50(1):1 51, 2024.

Ke Chen, Lei Xu, and Huisheng Chi. Improved learning algorithms for mixture of experts in multiclass classification. Neural networks, 12(9):1229 1252, 1999.

Tianlong Chen, Yongduo Sui, Xuxi Chen, Aston Zhang, and Zhangyang Wang. A unified lottery ticket hypothesis for graph neural networks. In International Conference on Machine Learning, pp. 1695 1706. PMLR, 2021.

Yuhan Chen, Haojie Ye, Sanketh Vedula, Alex Bronstein, Ronald Dreslinski, Trevor Mudge, and Nishil Talati. Demystifying graph sparsification algorithms in graph properties preservation. ar Xiv preprint ar Xiv:2311.12314, 2023.

Dawei Cheng, Xiaoyang Wang, Ying Zhang, and Liqing Zhang. Graph neural network for fraud detection via spatial-temporal attention. IEEE Transactions on Knowledge and Data Engineering, 34(8):3800 3813, 2020.

Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In ICML, pp. 4057 4086, 2022.

Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veliˇckovi c. Principal neighbourhood aggregation for graph nets. Advances in Neural Information Processing Systems, 33:13260 13271, 2020.

Tim Dettmers, Luke Zettlemoyer, and Zhang. Sparse networks from scratch: Faster training without losing performance. ar Xiv preprint ar Xiv:1907.04840, 2019.

Xiaowen Dong, Pascal Frossard, Pierre Vandergheynst, and Nikolai Nefedov. Clustering on multilayer graphs via subspace analysis on grassmann manifolds. IEEE Transactions on signal processing, 62(4):905 918, 2013.

Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. ar Xiv preprint ar Xiv:2003.00982, 2020.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021.

Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. ar Xiv preprint ar Xiv:1903.02428, 2019.

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263 1272. PMLR, 2017.

Zheng Gong, Guifeng Wang, Ying Sun, Qi Liu, Yuting Ning, Hui Xiong, and Jingyu Peng. Beyond homophily: Robust graph anomaly detection via neural sparsification. In Proceedings of International Joint Conference on Artificial Intelligence, pp. 2104 2113, 2023.

Published as a conference paper at ICLR 2025

Michael Hamann, Gerd Lindner, Henning Meyerhenke, Christian L Staudt, and Dorothea Wagner. Structure-preserving sparsification methods for social networks. Social Network Analysis and Mining, 6:1 22, 2016.

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of NIPS, 2017.

Mohammad Hashemi, Shengbo Gong, Juntong Ni, Wenqi Fan, B Aditya Prakash, and Wei Jin. A comprehensive survey on graph reduction: Sparsification, coarsening, and condensation. ar Xiv preprint ar Xiv:2402.03358, 2024.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022.

Fenyu Hu, W Liping, L Qiang, Shu Wu, Liang Wang, and Tieniu Tan. Graphdive: graph classifcation by mixture of diverse experts. In IJCAI, 2022.

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687, 2020.

Bo Hui, Da Yan, Xiaolong Ma, and Wei-Shinn Ku. Rethinking graph lottery tickets: Graph sparsity matters. In The Eleventh International Conference on Learning Representations, 2023.

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991.

Zhongxiao Jia and GW Stewart. An analysis of the rayleigh ritz method for approximating eigenspaces. Mathematics of computation, 70(234):637 647, 2001.

Wei Jin, Lingxiao Zhao, Shichang Zhang, Yozen Liu, Jiliang Tang, and Neil Shah. Graph condensation for graph neural networks. ar Xiv preprint ar Xiv:2110.07580, 2021.

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181 214, 1994.

Zhao Kang, Guoxin Shi, Shudong Huang, Wenyu Chen, Xiaorong Pu, Joey Tianyi Zhou, and Zenglin Xu. Multi-graph fusion for multi-view spectral clustering. Knowledge-Based Systems, 189:105102, 2020.

Suyeon Kim, Dongha Lee, Seong Ku Kang, Seonghyeon Lee, and Hwanjo Yu. Learning topologyspecific experts for molecular property prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 8291 8299, 2023.

Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.

Boris Knyazev, Graham W Taylor, and Mohamed Amer. Understanding attention and generalization in graph neural networks. Advances in neural information processing systems, 32, 2019.

Kwei-Herng Lai, Daochen Zha, Kaixiong Zhou, and Xia Hu. Policy-gnn: Aggregation optimization for graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 461 471, 2020.

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. ar Xiv preprint ar Xiv:1810.02340, 2018.

Dmitry Lepikhin, Hyouk Joong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ar Xiv preprint ar Xiv:2006.16668, 2020.

Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1:2, 2006.

Published as a conference paper at ICLR 2025

Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns. ar Xiv preprint ar Xiv:2006.07739, 2020a.

Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. Training graph neural networks with 1000 layers. In International conference on machine learning, pp. 6437 6449. PMLR, 2021.

Jiayu Li, Tianyun Zhang, Hao Tian, Shengmin Jin, Makan Fardad, and Reza Zafarani. Sgcn: A graph sparsifier based on graph convolutional networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 275 287. Springer, 2020b.

Zhixun Li, Xin Sun, Yifan Luo, Yanqiao Zhu, Dingshuo Chen, Yingtao Luo, Xiangxin Zhou, Qiang Liu, Shu Wu, Liang Wang, et al. Gslb: The graph structure learning benchmark. Advances in Neural Information Processing Systems, 36, 2024.

Guangfeng Lin, Jing Wang, Kaiyang Liao, Fan Zhao, and Wanjun Chen. Structure fusion based on graph convolutional networks for node classification in citation networks. Electronics, 9(3):432, 2020.

Chuang Liu, Yibing Zhan, Jia Wu, Chang Li, Bo Du, Wenbin Hu, Tongliang Liu, and Dacheng Tao. Graph pooling for graph neural networks: Progress, challenges, and opportunities. ar Xiv preprint ar Xiv:2204.07321, 2022a.

Junjie Liu, Zhe Xu, Runbin Shi, Ray CC Cheung, and Hayden KH So. Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers. ar Xiv preprint ar Xiv:2005.06870, 2020.

Yixin Liu, Yu Zheng, Daokun Zhang, Hongxu Chen, Hao Peng, and Shirui Pan. Towards unsupervised deep graph structure learning. In Proceedings of the ACM Web Conference 2022, pp. 1392 1403, 2022b.

Zheyuan Liu, Chunhui Zhang, Yijun Tian, Erchi Zhang, Chao Huang, Yanfang Ye, and Chuxu Zhang. Fair graph representation learning via diverse mixture of experts. In The Web Conference, 2023a.

Zirui Liu, Kaixiong Zhou, Zhimeng Jiang, Li Li, Rui Chen, Soo-Hyun Choi, and Xia Hu. Dspar: An embarrassingly simple strategy for efficient gnn training and inference via degree-based sparsification. Transactions on Machine Learning Research, 2023b.

Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen, and Xiang Zhang. Learning to drop: Robust graph neural network via topological denoising. In Proceedings of the 14th ACM international conference on web search and data mining, pp. 779 787, 2021.

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1930 1939, 2018.

Russell Merris. A survey of graph laplacians. Linear and Multilinear Algebra, 39(1-2):19 31, 1995.

Erxue Min, Runfa Chen, Yatao Bian, Tingyang Xu, Kangfei Zhao, Wenbing Huang, Peilin Zhao, Junzhou Huang, Sophia Ananiadou, and Yu Rong. Transformer for graphs: An overview from architecture perspective. ar Xiv preprint ar Xiv:2202.08455, 2022.

Allan H. Murphy. The finley affair: A signal event in the history of forecast verification. Weather and Forecasting, 11(1):3 20, 1996a. doi: https://doi.org/10.1175/1520-0434(1996)011<0003: TFAASE>2.0.CO;2. URL https://journals.ametsoc.org/view/journals/ wefo/11/1/1520-0434_1996_011_0003_tfaase_2_0_co_2.xml.

Allan H Murphy. The finley affair: A signal event in the history of forecast verification. Weather and forecasting, 11(1):3 20, 1996b.

Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal contrastive learning with limoe: the language-image mixture of experts. ar Xiv preprint ar Xiv:2206.02770, 2022.

Published as a conference paper at ICLR 2025

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems, volume 34, pp. 8583 8595, 2021.

Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley interdisciplinary reviews: data mining and knowledge discovery, 8(4):e1249, 2018.

Venu Satuluri, Srinivasan Parthasarathy, and Yiye Ruan. Local graph sparsification for scalable clustering. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 721 732, 2011a.

Venu Satuluri, Srinivasan Parthasarathy, and Yiye Ruan. Local graph sparsification for scalable clustering. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 11, pp. 721 732, New York, NY, USA, 2011b. Association for Computing Machinery. ISBN 9781450306614. doi: 10.1145/1989323.1989399. URL https://doi.org/ 10.1145/1989323.1989399.

Hyunjin Seo, Jihun Yun, and Eunho Yang. TEDDY: Trimming edges with degree-based discrimination strategy. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5RUf9n Edy C.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538, 2017.

Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 563 568, 2008.

Christian L Staudt, Aleksejs Sazonovs, and Henning Meyerhenke. Networkit: A tool suite for large-scale complex network analysis. Network Science, 4(4):508 530, 2016.

Yongduo Sui, Xiang Wang, Tianlong Chen, Meng Wang, Xiangnan He, and Tat-Seng Chua. Inductive lottery ticket learning for graph neural networks. Journal of Computer Science and Technology, 2023. ISSN 1000-9000(Print) /1860-4749(Online). doi: 10.1007/s11390-023-2583-5. URL https://jcst.ict.ac.cn/en/article/doi/10.1007/s11390-023-2583-5.

Xiangguo Sun, Hongzhi Yin, Bo Liu, Hongxu Chen, Qing Meng, Wang Han, and Jiuxin Cao. Multi-level hyperedge distillation for social linking prediction on sparsely observed networks. In Proceedings of the Web Conference 2021, pp. 2934 2945, 2021.

Xiangguo Sun, Hongzhi Yin, Bo Liu, Qing Meng, Jiuxin Cao, Alexander Zhou, and Hongxu Chen. Structure learning via meta-hyperedge for dynamic rumor detection. IEEE Transactions on Knowledge and Data Engineering, 2022.

Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan. All in one: Multi-task prompting for graph neural networks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2120 2131, 2023a.

Xiangguo Sun, Hong Cheng, Bo Liu, Jia Li, Hongyang Chen, Guandong Xu, and Hongzhi Yin. Self-supervised hypergraph representation learning for sociological analysis. IEEE Transactions on Knowledge and Data Engineering, 2023b.

N. Talati, H. Ye, S. Vedula, K. Chen, Y. Chen, D. Liu, Y. Yuan, D. Blaauw, A. Bronstein, T. Mudge, and R. Dreslinski. Mint: An accelerator for mining temporal motifs. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1270 1287, Los Alamitos, CA, USA, oct 2022. IEEE Computer Society. doi: 10.1109/MICRO56248.2022.00089. URL https: //doi.ieeecomputersociety.org/10.1109/MICRO56248.2022.00089.

Kale-ab Tessera, Sara Hooker, and Benjamin Rosman. Keep the gradients flowing: Using gradient flow to study sparse network optimization. ar Xiv preprint ar Xiv:2102.01670, 2021.

Elli Voudigari, Nikos Salamanos, Theodore Papageorgiou, and Emmanuel J Yannakoudakis. Rank degree: An efficient algorithm for graph sampling. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 120 129. IEEE, 2016.

Published as a conference paper at ICLR 2025

Guihong Wan and Haim Schweitzer. Edge sparsification for graphs via meta-learning. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 2733 2738. IEEE, 2021.

Daixin Wang, Jianbin Lin, Peng Cui, Quanhui Jia, Zhen Wang, Yanming Fang, Quan Yu, Jun Zhou, Shuang Yang, and Yuan Qi. A semi-supervised graph attentive network for financial fraud detection. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 598 607. IEEE, 2019a.

Haotao Wang, Ziyu Jiang, Yuning You, Yan Han, Gaowen Liu, Jayanth Srinivasa, Ramana Kompella, Zhangyang Wang, et al. Graph mixture of experts: Learning on large-scale graphs with explicit diversity modeling. Advances in Neural Information Processing Systems, 36, 2024.

Kun Wang, Yuxuan Liang, Pengkun Wang, Xu Wang, Pengfei Gu, Junfeng Fang, and Yang Wang. Searching lottery tickets in graph neural networks: A dual perspective. In The Eleventh International Conference on Learning Representations, 2022a.

Kun Wang, Guohao Li, Shilong Wang, Guibin Zhang, Kai Wang, Yang You, Xiaojiang Peng, Yuxuan Liang, and Yang Wang. The snowflake hypothesis: Training deep gnn with one node one receptive field, 2023a.

Li Wang, Wei Huang, Miao Zhang, Shirui Pan, Xiaojun Chang, and Steven Weidong Su. Pruning graph neural networks by evaluating edge properties. Knowledge-Based Systems, 256:109847, 2022b.

Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al. Deep graph library: A graph-centric, highly-performant package for graph neural networks. ar Xiv preprint ar Xiv:1909.01315, 2019b.

Yangkun Wang, Jiarui Jin, Weinan Zhang, Yong Yu, Zheng Zhang, and David Wipf. Bag of tricks for node classification with graph neural networks. ar Xiv preprint ar Xiv:2103.13355, 2021.

Yuwen Wang, Shunyu Liu, Kaixuan Chen, Tongtian Zhu, Ji Qiao, Mengjie Shi, Yuanyu Wan, and Mingli Song. Adversarial erasing with pruned elements: Towards better graph lottery ticket. ar Xiv preprint ar Xiv:2308.02916, 2023b.

Lanning Wei, Huan Zhao, Quanming Yao, and Zhiqiang He. Pooling architecture search for graph classification. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2091 2100, 2021.

Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. Selfsupervised graph learning for recommendation. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pp. 726 735, 2021.

Lingfei Wu, Yu Chen, Kai Shen, Xiaojie Guo, Hanning Gao, Shucheng Li, Jian Pei, Bo Long, et al. Graph neural networks for natural language processing: A survey. Foundations and Trends in Machine Learning, 16(2):119 328, 2023.

Qitian Wu, Wentao Zhao, Zenan Li, David P Wipf, and Junchi Yan. Nodeformer: A scalable graph structure learning transformer for node classification. Advances in Neural Information Processing Systems, 35:27387 27401, 2022.

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4 24, 2020.

Shunxin Xiao, Shiping Wang, Yuanfei Dai, and Wenzhong Guo. Graph neural networks in node classification: survey and evaluation. Machine Vision and Applications, 33(1):4, 2022.

Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas AJ Schweiger. Scan: a structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 824 833, 2007.

Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems, 31, 2018.

Published as a conference paper at ICLR 2025

Haoran You, Zhihan Lu, Zijian Zhou, Yonggan Fu, and Yingyan Lin. Early-bird gcns: Graphnetwork co-optimization towards more efficient gcn training and inference via drawing early-bird lottery tickets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8910 8918, 2022.

Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Lizhen Cui, and Quoc Viet Hung Nguyen. Are graph augmentations necessary? simple graph contrastive learning for recommendation. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp. 1294 1303, 2022.

Seongjun Yun, Seoyoon Kim, Junhyun Lee, Jaewoo Kang, and Hyunwoo J Kim. Neo-gnns: Neighborhood overlap-aware graph neural networks for link prediction. Advances in Neural Information Processing Systems, 34:13683 13694, 2021.

Guibin Zhang, Kun Wang, Wei Huang, Yanwei Yue, Yang Wang, Roger Zimmermann, Aojun Zhou, Dawei Cheng, Jin Zeng, and Yuxuan Liang. Graph lottery ticket automated. In The Twelfth International Conference on Learning Representations, 2023.

Guibin Zhang, Kun Wang, Wei Huang, Yanwei Yue, Yang Wang, Roger Zimmermann, Aojun Zhou, Dawei Cheng, Jin Zeng*, and Yuxuan Liang*. Graph lottery ticket automated. In The International Conference on Learning Representations, 2024a.

Guibin Zhang, Yanwei Yue, Kun Wang, Junfeng Fang, Yongduo Sui, Kai Wang, Yuxuan Liang, Dawei Cheng, Shirui Pan, and Tianlong Chen. Two heads are better than one: Boosting graph sparse training via semantic and topological awareness. ar Xiv preprint ar Xiv:2402.01242, 2024b.

Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Proceedings of NIPS, 2018.

Zaixi Zhang and Qi Liu. Learning subpocket prototypes for generalizable structure-based drug design. In International Conference on Machine Learning, pp. 41382 41398. PMLR, 2023.

Xinyu Zhao, Xuxi Chen, Yu Cheng, and Tianlong Chen. Sparse moe with language guided routing for multilingual machine translation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=y SS7h H1sm L.

Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. Robust graph representation learning via neural sparsification. In International Conference on Machine Learning, pp. 11458 11468. PMLR, 2020.

Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57 81, 2020.

Liguang Zhou, Yuhongze Zhou, Tin Lun Lam, and Yangsheng Xu. CAME: Context-aware mixtureof-experts for unbiased scene graph generation. ar Xiv preprint ar Xiv:2208.07109, 2022.

Jinguo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, and Jifeng Dai. Uni-Perceiver-Mo E: Learning sparse generalist models with conditional moes. ar Xiv preprint ar Xiv:2206.04674, 2022.

A NOTATIONS

We conclude the commonly used notations throughout the manuscript in Table 6.

B DEATILS ON PRUNING CRITERIA

In this section, we will thoroughly explain the four pruning criteria we selected and the rationale behind these choices.

Published as a conference paper at ICLR 2025

Table 6: The notations that are commonly used in the manuscript.

Notation Definition

G = {V, E} = {A, X} Input graph A Input adjacency matrix X Node feature matrix L Graph Laplacian matrix COMB( ) GNN ego-node transformation function AGGR( ) GNN message aggregation function ESM( ) Sparse graph combination on function s% Sparsity ratio (the ratio of removed edges) vi The i-th node in G xi Node feature vector for vi G(i) The 1-hop ego-graph for vi ϕ(G(i)) Routing network K The number of total sparsifier experts k The number of selected sparsifier experts per node Wg, Wn Trainable parameters in the routing network κ(G) A graph sparsifier G(i) m The sparse ego-graph of vi produced by the m-th graph sparsifier d G(i) = {d V(i), d E(i)} The ensemble sparse graph produced by Mo G for vi E(i) p Edges removed surrounding vi cm(eij) Prior guidance on edge importance eij

Edge degree of eij is defined as follows:

Degree (eij) = 1

2 (|N(vi) + N(vj)|) . (17)

Previous methods (Wang et al., 2022b; Seo et al., 2024) have explicitly or implicitly used edge degree for graph sparsification. Intuitively, edges with higher degrees are more replaceable. (Wang et al., 2022b) further formalizes this intuition from the perspective of bridge edges. Jaccard Similarity (Murphy, 1996b) measures the similarity between two sets by computing the portion of shared neighbors between two nodes (viand vj), as defined below:

Jaccrad Similarity vi, vj) = |N(vi) N(vj)|

|N(vi) N(vj)|. (18)

Jaccard similarity is widely used for its capacity for detecting clusters, hubs, and outliers on social networks (Murphy, 1996b; Xu et al., 2007; Satuluri et al., 2011a). Effective Resistance, derived from the analogy to electrical circuits, is applied to graphs where edges represent resistors. The effective resistance of an edge is defined as the potential difference generated when a unit current is introduced at one vertex and withdrawn from the other. Once the effective resistance is calculated, a sparsified subgraph can be constructed by selecting edges with probabilities proportional to their effective resistances. Notably, (Spielman & Srivastava, 2008) proved that the quadratic form of the Laplacian for such sparsified graphs closely approximates that of the original graph. Consequently, the following inequality holds for the sparsified subgraph with high probability:

x R|V| (1 ϵ)x T Lx x T Lx (1 + ϵ)x T Lx, (19)

where L is the Laplacian of the sparsified graph, and ϵ > 0 is a small number. The insight is that effective resistance reflects the significance of an edge. Effective resistance aims to preserve the quadratic form of the graph Laplacian. This property makes it suitable for applications relying on

Published as a conference paper at ICLR 2025

the quadratic form of the graph Laplacian, such as min-cut/max-flow problems. For computation simplicity, we do not directly utilize the definition of effective resistance, and use its approximation version (Liu et al., 2023b).

Gradient Magnitude, a widely used pruning criterion, is prevalent not only in the field of graph sparsification but also in classical neural network pruning. Numerous studies (Lee et al., 2018; Tessera et al., 2021; Dettmers et al., 2019) leverage gradient magnitude to estimate parameter importance. Specifically for graph sparsification, MGSpar (Wan & Schweitzer, 2021) was the first to propose using meta-gradient to estimate edge importance. We consider gradient magnitude a crucial indicator of the graph s topological structure during training. Therefore, we explicitly design some sparsifier experts to focus on this information.

C GRAPH MIXTURE ON GRASSMANN MANIFOLD

In this section, we detail how we leverage the concept of Grassmann Manifold to effectively combine different sparse (ego-)graphs output by various sparsifiers.

According to Equation (10), each orthonormal matrix represents a unique subspace and thus corresponds to a distinct point on the Grassmann manifold (Lin et al., 2020). This applies to the eigenvector matrix of the normalized Laplacian matrix (U = L[:, : p] Rn p), which comprises the first p eigenvectors and is orthonormal (Merris, 1995), and thereby can be mapped onto the Grassmann manifold. Additionally, each row of the eigenvector matrix encapsulates the spectral embedding of each node in a p-dimensional space, where adjacent nodes have similar embedding vectors. This subspace representation, summarizing graph information, is applicable to various tasks such as clustering, classification, and graph merging (Dong et al., 2013).

In the context of Mo G, we aim to efficiently find the final version that aggregates all the excellent properties of each point s k versions of sparse ego-graph { e G(i) m }k m=1 on the Grassmann Manifold. Moreover, this should guided by the expert scores computed by the routing network in Section 3.2. Let Dm and Am denote the degree matrix and the adjacency matrix for e G(i) m (we omit the superscript ( )(i) denoting vi for simplicity in the subsequent expressions), then the normalized graph Laplacian is defined as: Lm = D 1

2 m (Dm Am)D

Given the graph Laplacian Lm for each sparse graph, we calculate the spectral embedding matrix Um through trace minimization:

min Um R|N (vm)| p tr (U m Lm Um), s. t. U m Um = I, (21)

which can be solved by the Rayleigh-Ritz theorem. As mentioned above, each point on the Grassmann manifold can be represented by an orthonormal matrix Y R|N (vi)| p whose columns span the corresponding p-dimensional subspace in R|N (vi)| p. The distance between such subspaces can be computed as a set of principal angles {θi}k i=1 between these subspaces. (Dong et al., 2013) showed that the projection distance between two subspaces Y1 and Y2 can be represented as a separate trace minimization problem:

d2 proj(Y1, Y2) =

i=1 sin2 θi = k tr (Y1Y 1 Y2Y 2 ). (22)

Based on this, we further define the projection of the final representative subspace U and the k sparse candidate subspace {Um}k m=1:

d2 proj(U, {Um}k m=1) =

m=1 d2 proj(U, Um) = p k

m=1 tr (UU Um U m), (23)

which ensures that individual subspaces are close to the final representative subspace U.

Finally, to maintain the original vertex connectivity from all k sparse ego-graphs and emphasize the connectivity relationship from more reliable sparsifiers (with higher expert scores), we propose the

Published as a conference paper at ICLR 2025

following objective function:

min Um R|N (vm)| p

m=1 tr (UU Um U m)

where E(i) m represents the expert score of the node vi s m-th sparsifier expert. Based on Equations (21) and (24), we present the overall objective:

min U(i) R|N (vi)| p

tr(U(i) Lm U(i)) | {z } (1) node connectivity

+ E(i) m d2(U(i), U(i) m ) | {z } (2) subspace distance

, s. t. U(i) U(i) = I. (25)

For simplicity, we omit the superscript (i) in the following content. Substituting Equation (23) into Equation (25), we obtain:

m=1 tr(UT Lm U) + Em

m=1 tr(UUT Um UT m)

, s.t.UT U = I. (26)

Further simplification by neglecting constant terms like Em p k yields:

m=1 tr(UT Lm U) Em

m=1 tr(UUT Um UT m), s.t.UT U = I. (27)

Reorganizing the trace form of the second term, we obtain:

k=1 Um UT m)U

, s.t.UT U = I. (28)

At this point, the optimization problem essentially becomes a trace minimization problem, and thus the solution to this minimization problem is essentially the term between UT and U, which is:

k=1 Um UT m) =

k=1 (Lm Em Um UT m). (29)

Since computations involving the Grassmann manifold unavoidably entail eigenvalue decomposition, concerns about computational complexity may arise. However, given that Mo G only operates mixtures on the ego-graph of each node, such computational burden is entirely acceptable. Specific complexity analyses are presented in Appendix E.

C.1 FURTHER DISCUSSION ON THE USE OF GRASSMANN MANIFOLD

In this section, we visualize the impact of different ego-graph ensemble methods on the eigenvalue distribution. Specifically, we select node [2458] from the OGBN-ARXIV dataset and compare its original ego-graph, the sparse ego-graphs produced by three different sparsifiers, and the ensembled ego-graphs obtained through simple averaging and Grassmann optimization. The eigenvalue distributions are shown in Figure 5. As observed, when the three candidate sparse graphs are combined through simple averaging followed by sparsification, the resulting graph s eigenvalue distribution can deviate significantly from the original distribution. In contrast, the Grassmann ensembling method effectively preserves the spectral properties of each graph view, resulting in a sparse ego-graph whose eigenvalue distribution closely aligns with that of the original graph.

D ALGORITHM WORKFLOW

The algorithm framework is presented in Algo. 1.

Published as a conference paper at ICLR 2025

0 2 4 6 8 10 Eigenvalue Index

Original Graph Sparse Graph 1 Sparse Graph 2 Sparse Graph 3 Average Graph Grassmann Sparse Graph

Figure 5: Case study on the eigenvalue distribution of ego-graph.

E COMPLEXITY ANALYSIS

In this section, we delve into a comprehensive analysis of the time and space complexity of Mo G. Without loss of generality, we consider the scenario where Mo G is applied to vanilla GCN. It is worth recalling that the forward time complexity of vanilla GCN is given by: O(L |E| D + L |V| D2), (30)

where L is the number of GNN layers, |E| and |V| denotes the number of edges and nodes, respectively, and D is the hidden dimension. Similarly, the forward space complexity of GCN is: O(L |E| + L D2 + L |V| D) (31)

When Mo G is applied to GCN, each sparsifier expert κ( ) essentially introduces additional complexity equivalent to that of an FFN( ), as depicted in Equation (9). Incorporating the Sparse Mo E-style structure, the forward time complexity of GCN+Mo G becomes: O(L |E| D + L |V| D2 + k |E| D Ds), (32)

where Ds represents the hidden dimension of the feed-forward network in Equation (9) and k denotes the number of selected experts. Similarly, the forward space complexity is increased to: O(L |E| + L D2 + L |V| D + k |E| D Ds). (33) It is noteworthy that we omit the analysis for the routing network, as its computational cost is meanwhile negligible compared to the cost of selected experts, since both Wg RK F and Wn RK F is in a much smaller dimension that the weight matrix W RF F in GCN.

Furthermore, we present the additional complexity introduced by the step of graph mixture on the Grassmann manifold. For each center node s k sparse ego-graphs, we need to compute the graph Laplacian and the eigenvector matrix, which incurs an extra time complexity of O(k ( |E|

to compute the Laplacian d L(i) of the final ensemble sparse graph, an additional complexity of O(k ( |E|

|V|)2 p) is required. In the end, the complete time complexity of Mo G is expressed as:

L |E| D + L |V| D2 | {z } vanilla GCN

+ k |E| D Ds | {z } sparsifier experts

2 p | {z } graph mixture

To empirically verify that Mo G does not impose excessive computational burdens on GNN backbones, we conduct experiments in Section 4.5 to compare the per-epoch time efficiency metric of Mo G with other sparsifiers.

Published as a conference paper at ICLR 2025

Algorithm 1: Algorithm workflow of Mo G Input :G = (A, X), GNN model f(G, Θ), , epoch number Q. Output :Sparse graph Gsub = {V, E } for iteration q 1 to Q do

/* Ego-graph decomposition */ Decompose G into ego-graph representations {G(1), G(2), , G(N)}. /* Sparsifier expert allocation */ for node i 1 to |V| do

Calculate the total K expert score of vi by routing network ψ(xi); Eq. 3 Select k sparsifier expert for node vi by Softmax(Top K(ψ(xi), k)); Eq. 3 end /* Produce sparse graph condidates */ for iteration i 1 to |V| do

for sparsifier index m 1 to m do

Sparisifier κm determines which edges to remove by E(i) p = Top K Cm(E), |E(i)| s% ; Eq. 8 Produce sparse graph candidate e G(i) = κm(G(i)) = {V(i), E(i) \ E(i) p }. end /* Ensenmble sparse graphs on Grassmann manifold */ Calculate the ensemble graph s graph Laplacian by d L(i) = Pk m=1 Lm E(i) m U(i) U(i) ; Eq. 12

Obtain vi s final sparse graph by ESM({d G(i) = {D d L(i), X(i)}); Eq. 13 Compute vi s weighted sparsity by s(i)% = 1

k Pk m=1 sm%; Eq. 14

Post-sparsify d G(i): d G(i) {Top K( d A(i), |E(i)| s(i)%), X(i)}; Eq. 14 end /* Combine ego-graphs */ b G {d G(1), d G(2), , \ G(|V|)} /* Standard GNN training */ Feed the sparse graph b G into GNN model for any kinds of downstream training ; Eq. 6 Compute loss Ltask + λ Limportance; Eq. 16 Backpropagate to update GNN f(G, Θ), routing network ψ and sparsifiers {κm}K m=1. end

F EXPERIMENTAL DETAILS

F.1 DATASET STATISTICS

We conclude the dataset statistics in Tab. 7

Table 7: Graph datasets statistics.

Dataset #Graph #Node #Edge #Classes Metric

OGBN-ARXIV 1 169,343 1,166,243 40 Accuracy OGBN-PROTEINS 1 132,534 39,561,252 2 ROC-AUC OGBN-PRODUCTS 1 2,449,029 61,859,140 47 Accuracy

OGBG-PPA 158,100 243.4 2,266.1 47 Accuracy

MNIST 70,100 50.5 564.5 10 Accuracy CIFAR-10 60,000 117.6 914.0 10 Accuracy

Published as a conference paper at ICLR 2025

F.2 EVALUATION METRICS

Accuracy represents the ratio of correctly predicted outcomes to the total predictions made. The ROC-AUC (Receiver Operating Characteristic-Area Under the Curve) value quantifies the probability that a randomly selected positive example will have a higher rank than a randomly selected negative example.

F.3 DATASET SPLITS

For node-level tasks, the data splits for OGBN-ARXIV, OGBN-PROTEINS, and OGBN-PRODUCTS were provided by the benchmark (Hu et al., 2020). Specifically, for OGBN-ARXIV, we train on papers published until 2017, validate on papers from 2018 and test on those published since 2019. For OGBN-PROTEINS, protein nodes were segregated into training, validation, and test sets based on their species of origin. For OGBN-PRODUCTS, we sort the products according to their sales ranking and use the top 8% for training, next top 2% for validation, and the rest for testing.

For graph-level tasks, we follow (Hu et al., 2020) for OGBG-PPA. Concretely, we adopt the species split, where the neighborhood graphs in the validation and test sets are extracted from protein association networks of species not encountered during training but belonging to one of the 37 taxonomic groups. This split stress-tests the model s capacity to extract graph features crucial for predicting taxonomic groups, enhancing biological understanding of protein associations. For MNIST and CIFAR-10, consistent with (Dwivedi et al., 2020), we split them to 55000 train/5000 validation/10000 test for MNIST, and 45000 train/5000 validation/10000 test for CIFAR10, respectively. We report the test accuracy at the epoch with the best validation accuracy.

F.4 PARAMETER SETTING

Backbone Parameters For node classification backbones, we utilize a 3-layer Graph SAGE with hidden_dim {128, 256}. As for Deeper GCN, we set layer_num = 28, block = res+, hidden_dim = 64. The other configurations are the same as in https://github.com/lightaime/deep_gcns_torch/tree/master/ examples/ogb/ogbn_proteins. For graph classification backbones, we leverage a 4-layer PNA with hidden_dim = 300. Rest configurations are the same as in https://github.com/ lukecavabarrett/pna.

Mo G parameters We adopt the m = 4 sparsity criteria outlined in Section 3.3, assigning n = 3 different sparsity levels {s1, s2, s3} to each criterion, resulting in a total of K = m n = 12 experts. We select k = 2 sparsifier experts for each node, and set the loss scaling factor λ = 1e 2 across all datasets and backbones.

All the experiments are conducted on NVIDIA Tesla V100 (32GB GPU), using Py Torch and Py Torch Geometric framework.

F.5 SPARSIFIER BASELINE CONFIGURATIONS

Topology-based sparsification

Rank Degree (Talati et al., 2022): The Rank Degree sparsifier initiates by selecting a random set of "seed" vertices. Then, the vertices with connections to these seed vertices are ranked based on their degree in descending order. Subsequently, the edges linking each seed vertex to its top-ranked neighbors are chosen and integrated into the sparsified graph. The newly added nodes in the graph act as new seeds for identifying additional edges. This iterative process continues until the target sparsification limit is attained. We utilize the implementation in (Chen et al., 2023). Local Degree (Hamann et al., 2016): Local Degree sparsifier, similar to Rank Degree, incorporates edges to the top deg(v)α neighbors ranked by their degree in descending order, where α [0, 1] represents the degree of sparsification. Forest Fire (Leskovec et al., 2006): Forest fire assembles burning through edges probabilistically, and we use the implementation in (Staudt et al., 2016).

Published as a conference paper at ICLR 2025

G-Spar (Murphy, 1996b): G-Spar sorts the Jaccard scores globally and then selects the edges with the highest similarity score. We opt for the code from (Staudt et al., 2016). Local Similarity (Satuluri et al., 2011a): Local Similarity ranks edges using the Jaccard score and computes log(rank(eij))/ log(deg(eij)) as the similarity score, and selects edges with the highest similarity scores. We utilize the implementation in (Chen et al., 2023). SCAN (Spielman & Srivastava, 2008): SCAN uses structural similarity (called SCAN similarity) measures to detect clusters, hubs, and outliers. We utilize the implementation in (Chen et al., 2023) DSpar (Liu et al., 2023b): DSpar is an extension of effective resistance sparsifier, which aims to reduce the high computational budget of calculating effective resistance through an unbiased approximation. We adopt their official implementation (Liu et al., 2023b).

Semantic-based sparsification

UGS (Chen et al., 2021): We utilize the official implementation from the authors. Notably, UGS was originally designed for joint pruning of model parameters and edges. Specifically, it sets separate pruning parameters for parameters and edges, namely the weight pruning ratio pθ and the graph pruning ratio pg. In each iteration, a corresponding proportion of parameters/edges is pruned. For a fairer comparison, we set pθ = 0%, while maintaining pg {5%, 10} (consistent with the original paper). GEBT (You et al., 2022): GEBT, for the first time, discovered the existence of graph earlybird (GEB) tickets that emerge at the very early stage when sparsifying GCN graphs. (You et al., 2022) has proposed two variants of graph early bird tickets, and we opt for the graphsparsification-only version, dubbed GEB Ticket Identification. Meta-gradient sparsifier (Wan & Schweitzer, 2021): The Meta-gradient sparsifier prunes edges based on their meta-gradient importance scores, assessed over multiple training epochs. Since no official implementation is provided, we carefully replicated the results following the guidelines in the original paper. ACE-GLT (Wang et al., 2023b): ACE-GLT inherits the iterative magnitude pruning (IMP) paradigm from UGS. Going beyond UGS, it suggested mining valuable information from pruned edges/weights after each round of IMP, which in the meanwhile doubled the computational cost of IMP. We utilize the official implementation provided by (Wang et al., 2023b), and set pθ = 0%, pg {5%, 10}. WD-GLT (Hui et al., 2023): WD-GLT also inherits the iterative magnitude pruning paradigm from UGS, so we also set pθ = 0%, pg {5%, 10%} across all datasets and backbones. The perturbation ratio α is tuned among {0, 1}. Since no official implementation is provided, we carefully reproduced the results according to the original paper. Ada GLT (Zhang et al., 2024a): Ada GLT revolutionizes the original IMP-based graph lottery ticket methodology into an adaptive, dynamic, and automated approach, proficient in identifying sparse graphs with layer-adaptive structures. We fix ηθ = 0%, ηg {1e 6, 1e 5, 1e 4, 1e 3, 1e 2}, ω = 2 across all datasets and backbones.

F.6 ADJUSTING GRAPH SPARSITY

In Table 8, we provide detailed guidelines on how to achieve the desired global sparsity by adjusting the three sparsity levels {s1, s2, s3} in Mo G across six datasets.

G ADDTIONAL EXPERIMENT RESULTS

G.1 RESULTS FOR RQ1

We report the performances of Mo G and other sparsifiers on OGBN-PRODUCTS in Table 9.

G.2 SENSITIVITY ANALYSIS OF PARAMETER K

Based on the experiments in Section 4.4, we further provide sensitivity analysis results on OGBNPROTEINS+Rev GNN, as shown in Table 13. It can be observed that Mo G achieves peak performance at k {2, 3} and begins to decline after k 4, which is consistent with our finding in Observation 6.

Published as a conference paper at ICLR 2025

Table 8: The recipe for adjusting graph sparsity via different sparsifier combinations.

Datasets 1 s1 1 s2 1 s3 k 1 s%

1 0.9 0.8 2 [88.0%,90.9%] 0.8 0.7 0.5 2 [69.0%,73.2%] 0.6 0.5 0.3 2 [49.5%,52.7%] 0.5 0.3 0.15 2 [27.1%, 31.6%]

OGBN-PROTEINS

1 0.9 0.8 2 [86.1%,89.3%] 0.8 0.7 0.6 2 [65.1%,69.2%] 0.6 0.5 0.4 2 [45.2%,49.3%] 0.4 0.3 0.2 2 [29.2%,31.1%]

OGBN-PRODUCTS

1 0.9 0.8 2 [90.1%,93.2%] 0.8 0.7 0.6 2 [69.3%,72.0%] 0.6 0.5 0.4 2 [51.5%,54.9%] 0.4 0.3 0.2 2 [28.7%,36.0%]

1 0.85 0.8 2 [90.4%,92.7%] 0.8 0.5 0.4 2 [67.1%,68.3%] 0.6 0.3 0.2 2 [46.2%,49.3%] 0.35 0.1 0.1 2 [29.8%,31.3%]

1 0.85 0.8 2 [90.6%,93.7%] 0.8 0.5 0.4 2 [67.5%,69.9%] 0.6 0.3 0.2 2 [47.7%,49.3%] 0.35 0.1 0.1 2 [30.1%,31.3%]

0.95 0.9 0.8 2 [86.5%,88.9%] 0.8 0.65 0.6 2 [68.0%,70.1%] 0.6 0.5 0.3 2 [47.8%,48.9%] 0.4 0.3 0.15 2 [30.1%,33.6%]

Table 9: Node classification performance comparison to state-of-the-art sparsification methods. All methods are trained using Graph SAGE, and the reported metrics represent the average of five runs. We denote methods with that do not have precise control over sparsity; their performance is reported around the target sparsity 2%. We do not report results for sparsifiers like ER for OOT issues and those like UGS for their infeasibility in inductive settings (mini-batch training).

Dataset OGBN-PRODUCTS (Accuracy )

Sparsity % 10 30 50 70

Random 76.99 1.05 74.21 3.83 71.08 6.96 67.24 10.80 Rank Degree (Voudigari et al., 2016) 76.08 1.96 74.26 3.89 71.85 6.19 70.66 7.38 Local Degree (Hamann et al., 2016) 77.19 1.58 76.40 1.64 72.77 5.27 72.48 5.56 G-Spar (Murphy, 1996b) 76.15 1.89 74.20 3.84 71.55 6.49 69.42 8.62 LSim (Satuluri et al., 2011a) 77.96 0.08 74.98 2.06 72.67 5.37 70.43 7.61 SCAN (Xu et al., 2007) 76.30 1.74 74.33 3.71 71.25 6.79 71.12 6.92 DSpar (Liu et al., 2023b) 78.25 0.21 75.11 2.93 74.57 3.47 73.16 4.88

Ada GLT (Zhang et al., 2024a) 78.19 0.15 77.30 0.74 74.38 3.66 73.04 5.00 Mo G (Ours) 78.77 0.73 78.15 0.11 76.98 1.06 74.91 3.17

Whole Dataset 78.04 0.31

Published as a conference paper at ICLR 2025

Table 10: Node classification performance comparison to state-of-the-art sparsification methods. All methods are trained using Deeper GCN, and the reported metrics represent the average of five runs. We denote methods with that do not have precise control over sparsity; their performance is reported around the target sparsity 2%. OOM and OOT denotes out-of-memory and out-of-time, respectively.

Dataset OGBN-ARXIV (Accuracy )

Sparsity % 10 30 50 70

Topology-guided

Random 70.66 1.28 68.74 3.20 65.38 6.56 63.55 8.39 Rank Degree (Voudigari et al., 2016) 69.44 2.50 67.82 4.12 65.08 6.86 63.19 8.75 Local Degree (Hamann et al., 2016) 68.77 3.17 67.92 4.02 66.10 5.84 65.97 5.97 Forest Fire (Leskovec et al., 2006) 68.70 3.24 68.95 3.99 67.23 4.71 67.29 4.65 G-Spar (Murphy, 1996b) 70.57 1.37 70.15 1.79 68.77 3.17 65.26 6.68 LSim (Satuluri et al., 2011a) 69.33 2.61 67.19 4.75 63.55 8.39 62.20 9.74 SCAN (Xu et al., 2007) 71.33 0.61 69.22 2.72 67.88 4.06 64.32 7.62 ER (Spielman & Srivastava, 2008) 71.33 0.61 69.65 2.29 69.08 2.86 67.10 4.84 DSpar (Liu et al., 2023b) 71.65 0.29 70.66 1.28 68.03 3.91 67.25 4.69

Semantic-guided

UGS (Chen et al., 2021) 72.01 0.93 70.29 1.65 68.43 3.51 67.85 4.09 GEBT (You et al., 2022) 70.22 1.72 69.40 2.54 67.84 4.10 67.49 4.45 MGSpar (Wan & Schweitzer, 2021) 70.02 1.92 69.34 2.60 68.02 3.92 65.78 6.16 ACE-GLT (Wang et al., 2023b) 72.13 0.19 71.96 0.02 69.13 2.81 67.93 4.01 WD-GLT (Hui et al., 2023) 71.92 0.02 70.21 1.73 68.30 3.64 66.57 5.37 Ada GLT (Zhang et al., 2024a) 71.98 0.04 70.44 1.50 69.15 2.79 68.05 3.89 Mo G (Ours) 72.08 0.14 71.98 0.05 69.86 2.08 68.20 -3.74

Whole Dataset 71.93 0.04

Table 11: Node classification performance comparison to state-of-the-art sparsification methods. All methods are trained using Deeper GCN, and the reported metrics represent the average of five runs. We denote methods with that do not have precise control over sparsity; their performance is reported around the target sparsity 2%. OOM and OOT denotes out-of-memory and out-of-time, respectively.

Dataset OGBN-PROTEINS (ROC-AUC )

Sparsity % 10 30 50 70

Topology-guided

Random 80.18 2.55 78.92 3.83 76.57 6.16 72.69 10.04 Rank Degree (Voudigari et al., 2016) 80.14 2.59 79.05 3.73 78.59 4.13 76.22 6.51

Local Degree (Hamann et al., 2016) 79.40 3.33 79.83 3.90 78.50 4.23 78.25 4.48

Forest Fire (Leskovec et al., 2006) 81.49 1.24 78.47 4.26 76.14 6.59 73.89 9.84 G-Spar (Murphy, 1996b) 81.56 1.17 81.12 1.61 79.13 3.60 77.45 5.28 LSim (Satuluri et al., 2011a) 80.30 2.43 79.19 3.54 77.13 5.60 77.85 4.88 SCAN (Xu et al., 2007) 81.60 1.13 80.19 2.54 81.53 1.20 78.58 4.15 ER (Spielman & Srivastava, 2008) OOT DSpar (Liu et al., 2023b) 81.46 1.27 80.57 2.16 77.41 5.32 75.35 7.39

Semantic-guided

UGS (Chen et al., 2021) 82.33 0.40 81.54 1.19 78.75 4.98 76.40 6.33 GEBT (You et al., 2022) 80.74 2.99 80.22 2.51 79.81 3.92 76.05 6.68 MGSpar (Wan & Schweitzer, 2021) OOM ACE-GLT (Wang et al., 2023b) 82.93 0.80 82.01 0.72 81.05 1.68 75.92 6.81 WD-GLT (Hui et al., 2023) OOM Ada GLT (Zhang et al., 2024a) 82.60 0.13 82.76 0.97 80.55 2.18 78.42 4.31 Mo G (Ours) 83.32 0.41 82.14 0.59 81.92 0.81 80.90 1.83

Whole Dataset 82.73 0.02

Published as a conference paper at ICLR 2025

Table 12: Graph classification performance comparison to state-of-the-art sparsification methods. All methods are trained using PNA, and the reported metrics represent the average of five runs. We denote methods with that do not have precise control over sparsity; their performance is reported around the target sparsity 2%.

Dataset CIFAR-10 (Accuracy )

Sparsity % 10 30 50 70

Random 68.04 1.70 66.81 2.93 65.35 4.39 62.14 7.60 Rank Degree (Voudigari et al., 2016) 68.27 1.77 67.14 2.60 64.05 5.69 60.22 9.52 Local Degree (Hamann et al., 2016) 68.10 1.64 67.29 2.45 64.96 4.78 61.77 8.97 G-Spar (Murphy, 1996b) 67.13 2.61 65.06 4.68 64.86 4.88 62.92 6.82 LSim (Satuluri et al., 2011a) 69.75 0.01 67.33 2.41 66.58 3.16 64.86 4.88 SCAN (Xu et al., 2007) 68.25 1.49 66.11 3.63 64.59 5.15 63.20 6.54 DSpar (Liu et al., 2023b) 68.94 0.53 66.80 2.94 64.87 4.87 64.10 5.64

Ada GLT (Zhang et al., 2024a) 69.77 0.02 67.97 1.78 65.06 4.68 64.22 5.52 Mo G (Ours) 70.04 0.30 69.80 -0.94 68.28 1.46 66.55 3.19

Whole Dataset 69.74 0.17

Table 13: Sensitivity analysis of parameter k when applying Mo G to OGBN-PROTEINS+Rev GNN.

k 1 2 3 4 5

Mo G 88.37 89.04 89.09 88.55 88.20

We also reported the impact of selecting different values of k on the per-epoch training time and inference time, when applying Mo G to OGBN-ARXIV+Graph SAGE in Table 14. It can be observed that although the training and inference cost of Mo G increases as the number of selected experts increases, this additional cost is not significant: when k doubles from 2 to 4, the inference time only increases by 23%. More importantly, we can already achieve optimal performance with k {2, 3}, so there is no need to select too many experts, therefore avoiding significant inference delay.

Table 14: Sensitivity analysis of parameter k when applying Mo G to OGBN-ARXIV+Graph SAGE.

Sparsity Selected Expert k Per-Epoch Time Inference Time Acc.

30% 2 0.213 0.140 70.53 30% 3 0.241 0.161 70.48 30% 4 0.266 0.173 70.13 30% 5 0.279 0.190 69.57

G.3 ABLATION STUDY ON THE NOISE CONTROL OF THE ROUTER NETWORK

We test three different settings of epsilon on Graph SAGE+Ogbn-Arxiv: (1) ϵ N(0, I), (2) ϵ = 0, and (3) ϵ = 0.2, and report their performance under different sparsity levels in Table 15. We can see that trainable noisy parameters always bring the greatest performance gain to the model, which is consistent with previous practices in Mo E that the randomness in the gating network is beneficial.

G.4 SENSITIVITY ANALYSIS OF PARAMETER p

How is p determined in Equation (10)? Since the egographs of individual nodes vary in size, we calculate p proportionally as p = rp% |V(vi)| , where rp% represents the ratio of selected columns in the ego-graph of vi. In our experiments, we set rp% = 50% for simplicity.

What is its impact on performance? We conduct additional tests on Graph SAGE + OGBN-ARXIV using different values of rp%, as shown in Table 16. It can be observed that while excessively small values of p negatively impact the performance of Mo G, the model remains largely insensitive to variations in p within a broader range.

Published as a conference paper at ICLR 2025

Table 15: Ablation study on the noise control of the router network Ψ. ϵ N(0, I) corresponds to the original setting in our paper, ϵ = 0 corresponds to completely remove the noise modeling, and ϵ = 0.2 corresponds to fixing the noise coefficient.

Sparsity Train Acc Valid Acc Test Acc k

10% 77.20 72.68 71.93 3 30% 76.03 71.90 70.53 3 50% 72.45 69.54 69.06 3

10% 76.87 72.05 71.27 3 30% 75.99 71.15 70.14 3 50% 72.09 68.34 67.05 3

10% 76.98 72.22 71.75 3 30% 75.98 71.48 70.27 3 50% 73.15 69.84 68.45 3

Table 16: The sensitivity analysis of rp% on Graph SAGE+OGBN-ARXIV, with s = 50.

rp% 0.1 0.3 0.5 0.7 s = 30 69.92 70.03 70.53 70.41 s = 50 68.27 69.12 69.06 69.19

G.5 EXPERIMENTS INCORPORATING GLOBAL INFORMATION

While Mo G prunes edges based on the node s local context, the edge importance evaluation function in Equation (9), Cm(eij) = FFN (xi, xj, cm(eij)), can incorporate multi-hop or even global information. Specifically, we explored two approaches and conducted supplementary experiments accordingly:

1. Expanding xv to x(K) v with aggregated K-hop features: To integrate multi-hop information, we expanded xv into

x(K) v = CONCAT(h(1) v , h(2) v , . . . , h(K) v ),

where h(k) v = 1 |Nk(v)| P

u Nk(v) xu represents the K-hop features for node v, with Nk(v) denoting the K-hop neighbors of v.

2. Incorporating cm(eij) with prior global edge significance: For the cm(eij) function in Equation (9), we can consider global edge significance as prior guidance. This involved computing edge importance metrics such as Page Rank, Betweenness Centrality, or Eigenvector Centrality across the entire graph and passing them to each ego graph to improve edge evaluation.

Table 17: Performance of Mo G-Khop with multi-hop features and Mo G-m incorporating global edge significance as prior guidance on Graph SAGE+OGBN-ARXIV (node classification, 50% sparsity) and Deeper GCN+OGBG-PPA (graph classification, 50% sparsity).

Method Graph SAGE+OGBN-ARXIV Deeper GCN+OGBG-PPA Mo G-1hop 69.06 75.23 Mo G-2hop 69.54 75.79 Mo G-3hop 69.27 76.38 Mo G-Page Rank 69.03 75.70 Mo G-Betweenness 69.14 75.68 Mo G-Eigenvector 68.74 75.14

The results in Table 17 showcase that the performance gain from integrating global information varies across different tasks: in tasks such as graph classification, which rely more heavily on global information, Mo G-3hop results in a notable 1.15% accuracy improvement compared to Mo G-1hop.

Published as a conference paper at ICLR 2025

However, in node classification tasks, the gains from both hop expansion and global priors are relatively limited.

G.6 ADDITIONAL EXPERIMENT WITH LINK PREDICTION

Table 18: Additional experiments on Deeper GCN+OGBL-COLLAB. The reported metrics represent the average of five runs. (Metric = Hits@50, Baseline = 53.53%)

Sparsity % 10 30 50

Random 52.92 47.38 44.62 Local Degree 53.57 51.80 49.92 UGS 53.65 52.25 49.67 Ada GLT 53.82 53.61 52.69 Mo G 53.80 53.77 53.28

Table 19: Additional Experiments on GIN+PUBMED. The reported metrics represent the average of five runs. (Metric = ROC-AUC, Baseline = 0.895)

Sparsity % 10 30 50

Random 0.889 0.850 0.817 Local Degree 0.905 0.875 0.846 UGS 0.902 0.862 0.839 Ada GLT 0.898 0.884 0.851 Mo G 0.910 0.893 0.862

We follow the experimental setting in Chen et al. (2021) to test link prediction on PUBMED and OGBL-COLLAB From Tables 18 and 19, we can further verify that Mo G is also effective in the link prediction task, demonstrating strong generalization across diverse datasets and backbones.

H BROADER IMPACT AND DISCUSSION

H.1 BROADER IMPACT

Mo G, as a novel concept in graph sparsification, holds vast potential for general application. It allows for the sparsification of each node based on its specific circumstances, making it well-suited to meet the demands of complex real-world scenarios such as financial fraud detection and online recommender systems, which require customized approaches. More importantly, Mo G provides a selectable pool for future sparsification, enabling various pruning algorithms to collaborate and enhance the representational capabilities of graphs.

H.2 KEY CONTRIBUTIONS OF MOG

Briefly put, our contributions can be summarized as:

Node-Granular Customization: We propose a new paradigm of graph sparsification by introducing, for the first time, a method that customizes both sparsity levels and criteria for individual node modeling based on their local context. Mo E for Graph Sparsification: We design an innovative and highly pluggable graph sparsifier, dubbed Mixture of Graphs (Mo G), which pioneers the application of Mixture-of Experts (Mo E) in graph sparsification, supported by a robust theoretical foundation rooted in the Grassmann manifold. Empirical Evidence: Our extensive experiments on seven datasets and six backbones demonstrate that Mo G is (1) a superior graph sparsifier, maintaining GNN performance losslessly at 8.67% - 50.85% sparsity levels; (2) a computational accelerator, achieving a tangible 1.47 - 2.62 inference speedup; (3) a performance booster, which boosts ROC-AUC by 1.81% on OGBG-MOLHIV, 1.02% on OGBN-PROTEINS.

Published as a conference paper at ICLR 2025

H.3 REAL-WORLD USE CASES

Sp MM frequently represents the primary computational bottleneck in GNNs, accounting for 50% 70% of the total computational load (Liu et al., 2023b), thereby severely limiting their scalability on large graphs. There are some real use cases:

Online Fraud Detection: Financial transaction graphs are often massive, incurring significant Sp MM costs during inference. The high computational cost of Sp MM hinders the rapid detection and response required for financial fraud prevention, compromising user security.

Recommender Systems: Large-scale user-item interaction graphs in recommender systems face substantial Sp MM costs, restricting GNN deployment. Mo G enhances the feasibility and responsiveness of GNNs in recommender systems.

Network Architecture Search (NAS): NAS involves optimizing parameters for each architecture. The large parameter and gradient storage requirements of GNNs intensify memory demands. Moreover, NAS evaluates numerous candidate architectures, each requiring multiple training and validation cycles, amplifying Sp MM s computational cost.

Federated Graph Learning: The large-scale graph structures in Federated Graph Learning entail significant local Sp MM computations and parameter transmission. Mo G mitigates this by extracting sparse structures from local GNNs, reducing the parameter load sent to the central server.

We believe Mo G s potential to alleviate Sp MM bottlenecks could unlock broader applications of GNNs across these domains.

I POST-SPARSIFIED EGO-GRAPHS ASSEMBLING

we provide a detailed explanation of how post-sparsified ego-graphs are assembled as follows:

b G { sign Top K( b A, |E| s%) , X}, s% = 1

i=1 s(i)%, b A =

i=1 f b A(i) ,

where f : R|N (v)| |N (v)| R|V| |V| denotes the mapping of edges from the ego-graph to the global graph. After generating the post-sparsified ego-graphs, we compute the global sparsity s% by averaging the sparsity levels of each ego-graph and applying a function f that maps each ego-graph into the global graph. Afterward, we sum the weights of unpruned edges across all ego-graphs to form the weighted adjacency matrix b A. Finally, we prune the global graph to achieve the target sparsity s%, yielding the final sparsified graph.