# characterizing_the_influence_of_graph_elements__5027b8fa.pdf

Published as a conference paper at ICLR 2023

CHARACTERIZING THE INFLUENCE OF GRAPH ELEMENTS

Zizhang Chen, Peizhao Li, Hongfu Liu, Pengyu Hong Brandeis University {zizhang2,peizhaoli,hongfuliu,hongpeng}@brandeis.edu

Influence function, a method from robust statistics, measures the changes of model parameters or some functions about model parameters concerning the removal or modification of training instances. It is an efficient and useful post-hoc method for studying the interpretability of machine learning models without the need for expensive model re-training. Recently, graph convolution networks (GCNs), which operate on graph data, have attracted a great deal of attention. However, there is no preceding research on the influence functions of GCNs to shed light on the effects of removing training nodes/edges from an input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is challenging to derive influence functions for GCNs. To fill this gap, we started with the simple graph convolution (SGC) model that operates on an attributed graph and formulated an influence function to approximate the changes of model parameters when a node or an edge is removed from an attributed graph. Moreover, we theoretically analyzed the error bound of the estimated influence of removing an edge. We experimentally validated the accuracy and effectiveness of our influence estimation function. In addition, we showed that the influence function of a SGC model could be used to estimate the impact of removing training nodes/edges on the test performance of the SGC without re-training the model. Finally, we demonstrated how to use influence functions to guide the adversarial attacks on GCNs effectively.

1 INTRODUCTION

Graph data is pervasive in real-world applications, such as, online recommendations (Shalaby et al., 2017; Huang et al., 2021; Li et al., 2021), drug discovery (Takigawa & Mamitsuka, 2013; Li et al., 2017), and knowledge management (Rizun, 2019; Wang et al., 2018), to name a few. The growing need to analyze huge amounts of graph data has inspired work that combines Graph Neural Networks with deep learning (Gori et al., 2005; Scarselli et al., 2005; Li et al., 2016; Hamilton et al., 2017; Xu et al., 2019b; Jiang et al., 2019). Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017; Zhang & Chen, 2018; Fan et al., 2019), the most cited GNN architecture, adopts convolution and message-passing mechanisms.

To better understand GCNs from a data-centric perspective, we consider the following question:

Without model retraining, how can we estimate the changes of parameters in GCNs when the graph used for learning is perturbed by edgeor node-removals?

This question proposes to estimate counterfactual effects on the parameters of a well-trained model when there is a manipulation in the basic elements in a graph, where the ground truth of such an effect should be obtained from model retraining. With a computational tool as the answer, we can efficiently manipulate edges or nodes in a graph to control the change of model parameters of trained GCNs. The solution would provide further extensions like graph data rectification, improving model generalization, and graph data poison attacks through a pure data modeling way. Yet, current methods for training GCNs offer limited interpretability of the interactions between the training graph and the GCN model. More specifically, we fall short of understanding the influence of the input graph elements on both the changes in model parameters and the generalizability of a trained model (Ying et al., 2019; Huang et al., 2022; Yuan et al., 2021; Xu et al., 2019a; Zheng et al., 2021).

Published as a conference paper at ICLR 2023

In the regime of robust statistics, an analyzing tool called influence functions (Hampel, 1974; Koh & Liang, 2017) is proposed to study the counterfactual effect between training data and model performance. For independent and identically distributed (i.i.d.) data, influence functions offer an approximate estimation of the model s change when there is an infinitesimal perturbation added to the training distribution, e.g., a reweighing on some training instances. However, unlike i.i.d. data, manipulation on a graph would incur a knock-on effect through GCNs. For example, an edge removal will break down all message passing that is supposed to pass through this edge and consequentially change node representations and affect the final model optimization. Therefore, introducing influence functions to graph data and GCNs is non-trivial work and requires extra considerations.

In this work, we aim to derive influence functions for GCNs. As the first attempt in this direction, we focused on Simple Graph Convolution (Wu et al., 2019). Our contributions are three-fold:

We derived influence functions for Simple Graph Convolution. Based on influence functions, we developed computational approaches to estimate the changes in model parameters caused by two basic perturbations: edge removal and node removal. We derived the theoretical error bounds to characterize the gap between the estimated changes and the actual changes in model parameters in terms of both edge and node removal. We show that our influence analysis on the graph can be utilized to (1) rectify the training graph to improve model testing performance, and (2) guide adversarial attacks to SGC or conduct grey-box attacks on GCNs via a surrogate SGC.

Code is publicly available at https://github.com/Cyrus9721/Characterizing_ graph_influence.

2 PRELIMINARIES

In the following sections, we use a lowercase x for a scalar or an entity, an uppercase X for a constant or a set, a bolder lowercase x for a vector, and a bolder uppercase X for a matrix.

Influence Functions Influence functions (Hampel, 1974) estimate the change in model parameters when the empirical weight distribution of i.i.d. training samples is perturbed infinitesimally. Such estimations are computationally efficient compared to learn-one-out retraining iterating every training sample. For N training instances x and label y, consider empirical risk minimization (ERM) ˆθ = arg minθ Θ 1 N P

x,y ℓ(x, y) + λ

2 θ 2 2 for some loss function ℓ( , ) through a parameterized model θ and with a regularization term. When down weighing a training sample (xi, yi) by an infinitely small fraction ϵ, the substitutional ERM can be expressed as ˆθ(xi; ϵ) = arg minθ Θ 1 N P

x,y ℓ(x, y) ϵℓ(xi, yi) + λ

2 θ 2 2. Influence functions estimate the actual change I (xi; ϵ) = ˆθ(xi; ϵ) ˆθ for a strictly convex and twice differentiable ℓ( , ):

I(xi; ϵ) = lim ϵ 0 ˆθ(xi; ϵ) ˆθ = H 1 ˆθ ˆθℓ(xi, yi), (1)

where Hˆθ := 1 N PN i=1 2 ˆθℓ(xi, yi) + λI is the Hessian matrix with regularization at parameter ˆθ. For some differentiable model evaluation function f : Θ R like calculating total model loss over a test set, the change from down weighing ϵ (xi, yi) to the evaluative results can be approximated by ˆθf(ˆθ)H 1 ˆθ ˆθℓ(xi, yi). When N the size of the training data is large, by setting ϵ = 1

approximate the change of ˆθ incurred by removing an entire training sample I(xi; 1

N ) = I( xi) via linear extrapolations 1 N 0. Obviously, in terms of the estimated influence I, removing a training sample has the opposite value of adding the same training sample I( xi) = I(+xi). In our work, we shall assume an additivity of influence functions in computations when several samples are removed, e.g., when removing two samples: I( xi, xj) = I( xi) + I( xj).

Though efficient, as a drawback, influence functions on non-convex models suffer from estimation errors due to the variant local minima and usually a computational approximation to H 1 ˆθ for a noninvertible Hessian matrix. To introduce influence functions from i.i.d. data to graphs and precisely characterize the influence of graph elements to model parameters changes, we consider a convex model called Simple Graph Convolution from the GCNs family.

Published as a conference paper at ICLR 2023

Simple Graph Convolution By removing non-linear activations between layers from typical Graph Convolutional Networks, Simple Graph Convolution (SGC) (Wu et al., 2019) formulates a linear simplification of GCNs with competitive performance on various tasks (He et al., 2020; Rakhimberdina & Murata, 2019). Let G = (V, E) denote an undirected attributed graph, where V = {v} contains vertices with corresponding feature X R|V | D with D the feature dimension, and E = {eij}1 i<j |V | is the set of edges. Let Γv denote the set of neighborhood nodes around v, and dv the node degrees of v. We use A denote the adjacency matrix where Aij = Aji = 1 if eij E, and 0 elsewhere. D = diag(dv) denotes the degree matrix. When the context is clear, we simplify the notation Γvi Γi, and the same manner for other symbols. For multi-layer GNNs, let z(k) v denote the hidden representation of node v in the k-th layer, and with z(0) v = xv the initial node features. Simple Graph Convolution processes node representations as: z(k) v = W(k) P

u Γv {v} d 1/2 u d 1/2 v z(k 1) u + b(k), where W(k) and b(k) are trainable parameters in k-th layer. In transductive node classification, let Vtrain V denote the set of N training nodes associated with labels y. ERM of SGC in this task is ˆθ = arg minθ Θ 1 N P

v Vtrain ℓ(z(k) v , yv) + λ

2 θ 2 2. Due to the linearity of SGC, parameters W(k) and b(k) in each layer can be unified, and predictions after k layers can be simplified as y = arg max( D 1

2 )k XW + b with A = A + I and D the degree matrix of A. Therefore, for node representations Z(k) = ( D 1

2 )k X with y and cross-entropy loss, ℓ( , ) is convex. The parameters θ in ℓconsist of matrix W RD |Class| and vector b R|Class| with |Class| the number of class, and can be solved via logistic regression.

Additional Notations In what follows, we shall build our influence analysis upon SGC. For notational simplification, we omit (k) in Z(k) and use Z to denote the last-layer node representations from SGC. We use I ( eij) = ˆθ( eij) ˆθ to denote the actual model parameters change where ˆθ( eij) is obtained through ERM when eij is removed from E. Likewise, I ( vi) denotes the change from vi s removal from graph G. I( eij) and I( vi) are the corresponding estimated influence for I ( eij) and I ( vi) based on influence functions, respectively.

3 MODELING THE INFLUENCE OF ELEMENTS IN GRAPHS

We mainly consider the use of influence functions of two fundamental operations over an attributed graph: removing an edge (in Section 3.1) and removing a complete node (in Section 3.2).

3.1 INFLUENCE OF EDGE REMOVAL

With message passing through edges in graph convolution, removing an edge will incur representational changes in Z. When eij is removed, the changes come from two aspects: (1) The message passing for node features via the removed edge will be blocked, and all the representations of k-hop neighboring nodes of the removed edge will be affected. (2) Due to the normalization operation over A, the degree of all adjacent edges ejk, k Γi and eik, k Γj will be changed, and these edges will have a larger value in D 1

2 . We have the following expression to describe the representational changes ( eij) of node representations Z in SGC incurred by removing eij.

( eij) = [( D( eij) 1

2 A( eij) D( eij) 1

2 )k]X. (2) A( eij) is the modified adjacency matrix with A( eij)ij/ji = 0 and A( eij) = A elsewhere. A( eij) = A( eij) + I and D( eij) the degree matrix of A( eij). By having ( eij), we can access the change in every node. Let δk( eij) denotes the k-th row in ( eij). δk = 0 implies no change in k-th node from removing eij, and δk = 0 indicates a change in zk.

We proceed to use influence functions to characterize the counterfactual effect of removing eij. Our high-level idea is, from an influence functions perspective, representational changes in nodes z z + δ is equivalent to removing training instances with feature z, and adding new training instances with feature z + δ and with the same labels. The problem thus turns back to an instance reweighing problem developed by influence functions. In this case, we have the lemma below to prove the influence functions linearity.

Lemma 3.1. Consider empirical risk minimization ˆθ = arg minθ Θ P

i ℓ(xi, yi) and ˆθ(xj xj + δ) = arg minθ Θ P

i =j ℓ(xi, yi) + ℓ(xj + δ, yj) with some twice-differentiable and strictly convex

Published as a conference paper at ICLR 2023

ℓ, let I (xj xj + δ) = ˆθ(xj xj + δ) ˆθ, the estimated influence satisfies linearity:

I(xj xj + δ) = I( xj) + I(+(xj + δ)). (3)

By having Lemma 3.1, we are ready to derive a proposition from characterizing edge removal. Proposition 3.2. Let δk( eij) denote the k-th row of ( eij). The influence of removing an edge eij E from graph G can be estimated by:

I( eij) = I(z z + δ( eij)) = X

k I(+(zk + δk( eij))) + I( zk)

vk Vtrain ( ˆθℓ(zk + δk( eij), yk) ˆθℓ(zk, yk)). (4)

Proof. The second equality comes from Lemma 3.1, and the third equality comes from Equation (1). Realize that removing two representations I( zi, zj) = I( zi) + I( zj) completing the proof.

Proposition 3.2 offers an approach to calculate the estimated influence of removing eij. In practice, having the inverse hessian matrix, a removal only requires users to compute the updated gradients ˆθℓ(zk + δk( eij), yk) and its original gradients for all affected nodes in (k+1)-hop neighbors.

3.2 INFLUENCE OF NODE REMOVAL

We address the case of node removal. The impact from removing a node vi from graph G to parameters change are two-folds: (1) The loss term ℓ(xi, yi) will no longer involved in ERM if vi Vtrain. (2) All edges link to this node {eij}, j Γi will be removed either. The first aspect can be deemed as a regular training instance removal similar to an i.i.d. case, and the second aspect be can an incremental extension from edge removal in Proposition 3.2.

The representational changes from removing node vi can be expressed as:

( vi) = [( D( vi) 1

2 A( vi) D( vi) 1

2 )k]X, (5)

with A( vi)jk/kj = Ajk/kj, j, k : j = i k / Γi, and A( vi) = 0 elsewhere. Similarly, A( vi) = A( vi) + I and D( vi) is the corresponding degree matrix of A( vi). Having ( vi), Lemma 3.1 and Proposition 3.2, we state the estimated influence of removing vi. Proposition 3.3. Let δj( vi) denote the j-th row of ( vi). The influence of removing node vi from graph G can be estimated by:

I( vi) = I( zi) + I(z z + δ( vi)) = I( zi) + X

j I(+(zj + δj( vi))) + I( zj)

= 1vi Vtrain H 1 ˆθ ˆθℓ(zi, yi) H 1 ˆθ X

vj Vtrain ( ˆθℓ(zj + δj( vi), yj) ˆθℓ(zj, yj)),

(6) where 1 is an indicator function. Proof. Combining Lemma 3.1 and Equation (1) completes the proof.

4 THEORETICAL ERROR BOUNDS

In the above section, we show how to estimate the changes of model parameters due to edge removal: ˆθ ˆθ( eij) and node removals: ˆθ ˆθ( vi). In this section, we study the error between the estimated influence given by influence functions I and the actual influence I obtained by model retraining. We give upper error bounds on edge removal I ( eij) I( eij) 2 (see Theorem 4.1) and node removal I ( vi) I( vi) 2 (see Corollary A.1).

In what follows, we shall assume the second derivative of ℓ( , ) is Lipschitz continuous at θ with constant C based on the convergence theory of Newton s method. To simplify the notations, we use z i = zi +δi to denote the new representation of vi obtained after removing an edge or a node, where δi is the row vector of ( eij) or ( vi) depending on the context.

Published as a conference paper at ICLR 2023

Theorem 4.1. Let σmin 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices 2 ˆθℓ(zi, yi), vi Vtrain of the original model ˆθ. Let σ min 0 denote the smallest eigenvalue of all

eigenvalues of Hessian matrices 2 ˆθ( eij)ℓ(zi, yi), vi Vtrain of the retrained model ˆθ( eij) with

eij removed from graph G. Use L denote the set {v : z = z} containing affected nodes from the edge removal, and Err( eij) = I ( eij) I( eij) 2. Recall λ is the ℓ2 regularization strength, we have an upper bound on the estimated error of model parameters change:

Err( eij) N 3C (Nλ + (N |L|)σmin + σ min|L|)3 X

vl L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl)) 2 2

+ N Nλ + (N |L|)σmin + min(σmin, σ min)|L| X

vl L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl)) 2. (7)

Proof sketch. We use the one-step Newton approximation (Pregibon, 1981) as an intermediate step to derive the bound. The first term is the difference between the actual change I ( eij) and its Newton approximation, and the second term is the difference between the Newton approximation and the estimated influence I( eij). Combining these two parts result the bound.

Remark 4.2. We have the following main observations from Theorem 4.1. (1) The estimation error of influence function is controlled by the ℓ2 regularization strength within a factor of O(1/λ). A stronger regularization will likely produce a better approximation. (2) The error is controlled by the inherent property of a model. A smoother model in terms of its hessian matrix will help lower the upper bound. (3) The upper bound is controlled by the norm of the changed gradient from z z . Intuitively, if removing eij incurs smaller changes in node representations, the approximation of the actual influence would be more accurate. Also, a smaller Err( vi) is expected if the model is less prone to changes in training samples. (4) There are no significant correlations between the bound and the number of training nodes N. As a special case, if σmin = σ min = 0, the bound is irrelevant to N. We attach empirical verification for our bound in Appendix D.

Similar to Theorem 4.1, we have Corollary A.1 to derive an upper bound on I ( vi) I( vi) 2 for removing a node vi from graph presented in Appendix A.

5 EXPERIMENTS

We conducted three major experiments: (1) Validate the estimation accuracy of our influence functions on graph in Section 5.2; (2) Utilize the estimated edge influence to carry out adversarial attacks and graph rectification for increasing model performance in Section 5.3; and (3) Utilize the estimated node influence to carry out adversarial attacks on GCN (Kipf & Welling, 2017) in Section 5.4.

We choose six real-world graph datasets:Cora, Pub Med, Cite Seer (Sen et al., 2008), Wi Ki CS (Mernyei & Cangea, 2020), Amazon Computers, and Amazon Photos (Shchur et al., 2018) in our experiments. Statistics of these datasets are outlined in Appendix B Table 4. For the Cora, Pub Med, and Cite Seer datasets, we used their public train/val/test splits. For the Wiki-CS datasets, we took a random single train/val/test split provided by Mernyei & Cangea (2020). For the Amazon datasets, we randomly selected 20 nodes from each class for training, 30 nodes from each class for validation and used the rest nodes in the test set. All the experiments are conducted under the transductive node classification settings. We only use the last three datasets for influence validation.

5.2 VALIDATING INFLUENCE FUNCTIONS ON GRAPHS

Validating Estimated Influence We compared the estimated influence of removing a node/edge with its corresponding ground truth effect. The actual influence is obtained by re-training the model after removing a node/edge and calculating the change in the total cross-entropy loss. We also validated the estimated influence of removing node embeddings, for example, removing ℓ(zi, yi) of node vi from the ERM objective while keeping the embeddings of other nodes intact. Figure 2

Published as a conference paper at ICLR 2023

Figure 1: The Cora experiment the estimated influences of individual training nodes/edges on the validation loss. The largest connected component of the Cora dataset is visualized here. Left: The dataset. The node size indicates if a node is in the training subset (large) or not (small). Middle: Influence of the training edges. Each edge is colored accordingly to its estimated influence value (blue - negative influence, removing it is expected to decrease the loss on the validation set; red positive influence, removing it is expected to increase the loss on the validation set; and grey little influence. The deeper color indicates higher influence.). Right: Influence of the training nodes. The same color scheme in the middle plot is used here.

shows that the estimated influence correlates highly with the actual influence (Spearman correlation coefficients range from 0.847 to 0.981). More results are included in Figure 4 in the appendix.

Visualization Figure 1 visualizes the estimated influence of edge and node removals on the validation loss for the Cora dataset. This visualization hints at opportunities for improving the test performance of a model or attacking a model by removing nodes/edges with noticeable influences (see experiments in Sections 5.3 and 5.4).

5.3 APPLICATIONS OF THE ESTIMATED EDGE INFLUENCE

The estimated influence of edge removals on the validation set can be utilized to improve the test performance of SGC or carry out adversarial attacks on SGC/GCN.

Table 1: Our performance via eliminating edges with negative influence values.

Methods Cora Pubmed Citeseer

GCN 81.4 0.4 79.0 0.4 70.1 0.5 GAT 83.3 0.7 78.5 0.3 72.6 0.6 FGCN 79.8 0.3 77.4 0.3 68.8 0.6 GIN 77.6 1.1 77.0 1.2 66.1 0.9 DGI 82.5 0.7 78.4 0.7 71.6 0.7 SGC 81.0 0.0 78.9 0.0 71.9 0.1 Ours 81.8 0.0 79.7 0.0 73.7 0.0

Graph Rectification via Edge Removals We begin by investigating the impact of edges with negative influences. Based on our influence analysis, removing negative influence edges from the original will decrease validation loss. Thus the classification accuracy on the test set is expected to increase correspondingly. We sort the edges by their estimated influences in descending order, then cumulatively remove edges starting from the one with the lowest negative influence. We train the SGC model, fine-tune it on the public split validation set and select the number of negative influence edges to be removed by validation accuracy. For a fair comparison, we fix the test set remaining unchanged regarding the removal of the edges. The results are derived based on Figure 8 and displayed in Table 1, where we also report the performance of several classical and state-of-the-art GNN models on the original whole set as references, including GCN (Kipf & Welling, 2017), GAT (Veliˇckovi c et al., 2018), FGCN (Chen et al., 2018), GIN (Xu et al., 2019b), DGI (Velickovic et al., 2019) with a nonlinear activation function and SGC (Wu et al., 2019).

We demonstrate that our proposed method can marginally improve the accuracy of SGC from the data perspective and without any change to the original model structure of SGC, which validates

Published as a conference paper at ICLR 2023

Figure 2: Estimated influence vs. actual influence. Three datasets are used in this illustration Cora (left column), Pubmed (middle column) and Citeseer (right column). In all plots, the horizontal axes indicate the actual influence on the validation set, the vertical axes indicate the predicted influence, and ρ indicates Spearman s correlation coefficient between our predictions and the actual influences. Top row: Influence of node embedding removal. Each point represents a training node embedding Middle row: Influence of edge removals. Each point corresponds to a removed edge. Bottom row: Influence of node removal. Each point represents a removed training node.

Table 2: Grey-box attacks to GCN via edge removals. A lower performance indicates a more successful attack. The best attacks are in bold font. The number following the dataset name is the preattack performance. - denotes an out-of-memory issue encountered on GPU with 24GB VRAM.

Dataset Cora - 81.10% Citeseer - 70.07% Pubmed - 79.80%

Elimination Rate 1% 3% 5% 1% 3% 5% 1% 3% 5%

DICE 79.9% 80.1% 80.0% 71.1% 70.3% 69.8% 79.4% 79.7% 79.1% Graph Poison 80.0% 80.1% 79.6% 70.2% 70.1% 70.0% 79.4% 79.7% 79.1% Meta Attack 79.6% 77.1% 73.3% 70.4% 69.3% 65.4% - - - Ours 77.3% 74.2% 72.8% 69.3% 67.4% 64.7% 69.3% 65.2% 64.1%

the impacts of edges with negative influences. In addition, the performance of the SGC model with eliminating the negative influence edges can outperform other GNN-based methods in most cases.

Attacking SGC via Edge Removals We investigated how to use edge removals to deteriorate SGC performance. Based on the influence analysis, removing an edge with a positive estimated influence can increase the model loss and decrease the model performance on the validation set. Thus, our attack is carried out in the following way. We first calculated the estimated influence of all edges and cumulatively removed edges with the highest positive influence one at a time. Every time we remove a new edge, we retrain the model to obtain the current model performance. We remove 100 edges in total for each experiment.

Published as a conference paper at ICLR 2023

Figure 3: Study of edges with positive influence on both validation (blue) and test (red) set. Columns correspond to Cora, Pubmed and Citeseer datasets. Top: scale of values of the edges with negative influence. Bottom: accuracy drop by cumulatively removing edges with positive influence.

Table 3: Performance of node removing attack. Lower performance means better attacks. The number after the dataset name means the performance of GCN model without an attack. Victim model s test accuracy averaged over 25 runs on the citation network.

Dataset Cora - 81.10% Citeseer - 70.07% Pubmed - 79.80%

Removing Rate 5% 10% 15% 5% 10% 15% 5% 10% 15%

Random 80.4% 80.3% 80.2% 70.6% 69.0% 69.2% 78.9% 79.6% 77.3% Degree 80.3% 78.7% 79.0% 69.4% 68.3% 68.4% 79.1% 79.6% 77.4% Ours 74.7% 59.8% 57.9% 69.5% 65.5% 56.1% 79.0% 77.2% 75.2%

We present our results in Figure 3. Apparently, in general, the accuracy of SGC on node classification drops significantly. We notice the influence of edges is approximately power-law distributed, where only a small proportion of edges has a relatively significant influence. The performance worsens with increasingly cumulative edge removals on both validation and test sets. The empirical results verify our expectations of edges with a positive estimated influence.

Attacking GCN via Surrogate SGC We further explored the impact of removing positive influences edges under adversarial grey-box attack settings. Here, we followed Z ugner & G unnemann (2019) to interpret SGC as a surrogate model for attacking the GCN (Kipf & Welling, 2017) as a victim model, where the assumption lays under that the increase of loss on SGC can implicitly drop the performance of GCN. We eliminated positive influence edges at different rates 1%, 3%, 5% among all edges. The drop in accuracy was compared against DICE (Z ugner et al., 2018), Graph Poison (Bojchevski & G unnemann, 2019), Meta Attack (Z ugner et al., 2018). For a fair comparison, we restrict the compared method can only perturb graph structures via edge removals.

Our results are presented in Table 2. Our attack strategy achieves the best performance in all the scenarios of edge eliminations, especially on Pubmed with 1% edge elimination rate. Our attack model outperforms others by over 10% in accuracy drop. Since we directly estimate the impact of edges on the model parameter change, our attack strategy is more effective in seeking the most vulnerable edges to the victim model. These indicate that our proposed influence on edges can guide the construction of grey-box adversarial attacks on graph structures.

5.4 INFLUENCE OF NODE REMOVAL

Attacking GCN via Node Removals In this section, we study the impact of training nodes with a positive influence on transductive node classification tasks. Again, we assume that eliminating the positive influence nodes derived from SGC may implicitly harm the GCN model. We sort the nodes

Published as a conference paper at ICLR 2023

by our estimated influence in descending order and cumulatively remove the nodes from the training set. We built two baseline methods, Random and Degree, to compare the accuracy drop in different node removal ratios: 5%, 10%, 15%. For the Random baseline, we randomly remove the nodes from the training sets. For Degree baseline, we remove nodes by their degree in descending order.

According to Table Table 3, the model performance on GCN drops by a large margin in all three citation network datasets as the selected positive influence node is removed, especially on the Cora dataset. The model outperforms the baseline over 20% on 15% removing ratio. These results indicate that our estimation of node influence can be used to guide the adversarial attack on GCN in the settings of node removal.

6 RELATED WORKS

Influence Functions Recently, more efforts have been dedicated to investigating influence functions (Koh et al., 2019; Giordano et al., 2019; Ting & Brochu, 2018) in various applications, such as,computer vision (Koh & Liang, 2017), natural language processing (Han et al., 2020), tabular data (Wang et al., 2020b), causal inference (Alaa & Van Der Schaar, 2019), data poisoning attack (Fang et al., 2020; Wang et al., 2020a), and algorithmic fairness (Li & Liu, 2022). In this work, we propose a major extension of influence functions to graph-structured data and systemically study how we can estimate the influence of nodes and edges in terms of different editing operations on graphs. We believe our work complements the big picture of influence functions in machine learning applications.

Understanding Graph Data Besides influence functions, there are many other approaches to exploring the underlying patterns in graph data and its elements. Explanation models for graphs (Ying et al., 2019; Huang et al., 2022; Yuan et al., 2021; Bajaj et al., 2021; Abrate & Bonchi, 2021) provide an accessible relationship between the model s predictions and corresponding elements in graphs or subgraphs. They show how the graph s local structure or node features impact the decisions from GNNs. As a major difference, these approaches tackle model inference with fixed parameters, while we focus on a counterfactual effect and investigate the contributions from the presence of nodes and edges in training data to decisions of GNN models in the inference stage.

Adversarial Attacks on Graph The adversarial attack on an attributed graph is usually conducted by adding perturbations on the graphic structure or node features (Z ugner & G unnemann, 2019; Zheng et al., 2021). In addition, Zhang et al. (2020) introduces an adversarial attack setting by flipping a small fraction of node labels in the training set that causes a significant drop in model performance. A majority of the attacker models (Z ugner et al., 2018; Xu et al., 2019a) on graph structure are constructed based on the gradient information on both edges and node features and achieved costly but effective attacking results. These attacker models rely mainly on greedy-based methods to find the graph structure s optimal perturbations. We only focus on the perturbations resulting from eliminating edges and directly estimate the change of loss in response to the removal effect guided by the proposed influence-based approach.

7 CONCLUSIONS

We have developed a novel influence analysis to understand the effects of graph elements on the parameter changes of GCNs without needing to retrain the GCNs. We chose Simple Graph Convolution due to its convexity and its competitive performance to non-linear GNNs on a variety of tasks. Our influence functions can be used to approximate the changes in model parameters caused by edge or node removals from an attributed graph. Moreover, we provided theoretical bounds on the estimation error of the edge and node influence on model parameters. We experimentally validated the accuracy and effectiveness of our influence functions by comparing its estimation with the actual influence obtained by model retraining. We showed in our experiments that our influence functions could be used to reliably identify edge and node with negative and positive influences on model performance. Finally, we demonstrated that our influence function could be applied to graph rectification and model attacks.

Published as a conference paper at ICLR 2023

ACKNOWLEDGEMENT

We would like to thank the three anonymous reviewers for their constructive questions and invaluable suggestions. This work is partially supported by NSF DMR 1933525 and NSF OAC 1920147. Any opinions or conclusions in this paper are those of the authors and do not reflect the views of the funding agencies.

Carlo Abrate and Francesco Bonchi. Counterfactual graphs for explainable classification of brain networks. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2021.

Ahmed Alaa and Mihaela Van Der Schaar. Validating causal inference models via influence functions. In International Conference on Machine Learning, 2019.

Mohit Bajaj, Lingyang Chu, Zi Yu Xue, Jian Pei, Lanjun Wang, Peter Cho-Ho Lam, and Yong Zhang. Robust counterfactual explanations on graph neural networks. In Advances in Neural Information Processing Systems, 2021.

Aleksandar Bojchevski and Stephan G unnemann. Adversarial attacks on node embeddings via graph poisoning. In International Conference on Machine Learning, 2019.

Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. International Conference on Learning Representations, 2018.

Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In The World Wide Web Conference, 2019.

Minghong Fang, Neil Zhenqiang Gong, and Jia Liu. Influence function based data poisoning attacks to top-n recommender systems. In Proceedings of The Web Conference, 2020.

Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Broderick. A swiss army infinitesimal jackknife. In International Conference on Artificial Intelligence and Statistics, 2019.

M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Proceedings of the IEEE International Joint Conference on Neural Networks, 2005.

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 2017.

Frank R Hampel. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 1974.

Xiaochuang Han, Byron C Wallace, and Yulia Tsvetkov. Explaining black box predictions and unveiling data artifacts through influence functions. Annual Meeting of the Association for Computational Linguistics, 2020.

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020.

Chao Huang, Jiahui Chen, Lianghao Xia, Yong Xu, Peng Dai, Yanqing Chen, Liefeng Bo, Jiashu Zhao, and Jimmy Xiangji Huang. Graph-enhanced multi-task learning of multi-level transition dynamics for session-based recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.

Published as a conference paper at ICLR 2023

Qiang Huang, Makoto Yamada, Yuan Tian, Dinesh Singh, and Yi Chang. Graphlime: Local interpretable model explanations for graph neural networks. IEEE Transactions on Knowledge and Data Engineering, 2022.

Xiaolong Jiang, Peizhao Li, Yanjing Li, and Xiantong Zhen. Graph neural based end-to-end data association framework for online multiple-object tracking. ar Xiv preprint ar Xiv:1907.05315, 2019.

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations, 2017.

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Internatioboonal Conference on Machine Learning, 2017.

Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. On the accuracy of influence functions for measuring group effects. In Advances in Neural Information Processing Systems, 2019.

Junying Li, Deng Cai, and Xiaofei He. Learning graph-level representation for drug discovery. ar Xiv preprint ar Xiv:1709.03741, 2017.

Peizhao Li and Hongfu Liu. Achieving fairness at no utility cost via data reweighing with influence. In International Conference on Machine Learning, 2022.

Peizhao Li, Yifei Wang, Han Zhao, Pengyu Hong, and Hongfu Liu. On dyadic fairness: Exploring and mitigating bias in graph connections. In International Conference on Learning Representations, 2021.

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. International Conference on Learning Representations, 2016.

P eter Mernyei and C at alina Cangea. Wiki-cs: A wikipedia-based benchmark for graph neural networks. ar Xiv preprint ar Xiv:2007.02901, 2020.

Daryl Pregibon. Logistic regression diagnostics. The annals of statistics, 9(4):705 724, 1981.

Zarina Rakhimberdina and Tsuyoshi Murata. Linear graph convolutional model for diagnosing brain disorders. In International Conference on Complex Networks and Their Applications, 2019.

Mariia Rizun. Knowledge graph application in education: a literature review. Acta Universitatis Lodziensis. Folia Oeconomica, 2019.

F. Scarselli, S. L. Yong, M. Gori, M. Hagenbuchner, A. C. Tsoi, and M. Maggini. Graph neural networks for ranking web pages. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005.

Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 2008.

Walid Shalaby, Bahaa Eddin Al Aila, Mohammed Korayem, Layla Pournajaf, Khalifeh Al Jadda, Shannon Quinn, and Wlodek Zadrozny. Help me find a job: A graph-based approach for job recommendation at scale. In 2017 IEEE international conference on big data (big data), pp. 1544 1553. IEEE, 2017.

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan G unnemann. Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop, Neur IPS 2018, 2018.

Ichigaku Takigawa and Hiroshi Mamitsuka. Graph mining: procedure, application to drug discovery and recent advances. Drug Discovery Today, 2013.

Daniel Ting and Eric Brochu. Optimal subsampling with influence functions. In Advances in Neural Information Processing Systems, 2018.

Published as a conference paper at ICLR 2023

Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018.

Petar Velickovic, William Fedus, William L Hamilton, Pietro Li o, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. International Conference on Learning Representations, 2019.

Binghui Wang, Tianxiang Zhou, Minhua Lin, Pan Zhou, Ang Li, Meng Pang, Cai Fu, Hai Li, and Yiran Chen. Evasion attacks to graph neural networks via influence function. ar Xiv preprint ar Xiv:2009.00203, 2020a.

Ruijie Wang, Yuchen Yan, Jialu Wang, Yuting Jia, Ye Zhang, Weinan Zhang, and Xinbing Wang. Acekg: A large-scale knowledge graph for academic data mining. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018.

Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, and Shao-Lun Huang. Less is better: Unweighted data subsampling via influence function. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020b.

Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, 2019.

Kaidi Xu, Hongge Chen, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Mingyi Hong, and Xue Lin. Topology attack and defense for graph neural networks: An optimization perspective. International Joint Conferences on Artificial Intelligence Organization, 2019a.

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? International Conference on Learning Representations, 2019b.

Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. In Advances in Neural Information Processing Systems, 2019.

Hao Yuan, Haiyang Yu, Jie Wang, Kang Li, and Shuiwang Ji. On explainability of graph neural networks via subgraph explorations. In International Conference on Machine Learning, 2021.

Mengmei Zhang, Linmei Hu, Chuan Shi, and Xiao Wang. Adversarial label-flipping attack and defense for graph neural networks. In IEEE International Conference on Data Mining, 2020.

Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, 2018.

Qinkai Zheng, Xu Zou, Yuxiao Dong, Yukuo Cen, Da Yin, Jiarong Xu, Yang Yang, and Jie Tang. Graph robustness benchmark: Benchmarking the adversarial robustness of graph machine learning. Neural Information Processing Systems Datasets and Benchmarks Track, 2021.

Daniel Z ugner, Amir Akbarnejad, and Stephan G unnemann. Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.

Daniel Z ugner and Stephan G unnemann. Adversarial attacks on graph neural networks via meta learning. In International Conference on Learning Representations, 2019.

Published as a conference paper at ICLR 2023

Lemma 3.1. Consider empirical risk minimization ˆθ = arg minθ Θ P

i ℓ(xi, yi) and ˆθ(xj xj + δ) = arg minθ Θ P

i =j ℓ(xi, yi) + ℓ(xj + δ, yj) with some twice-differentiable and strictly convex ℓ, let I (xj xj + δ) = ˆθ(xj xj + δ) ˆθ, the estimated influence satisfies linearity:

I(xj xj + δ) = I( xj) + I(+(xj + δ)). (3)

Proof. Notice the actual model parameters in response of the perturbations can be denoted as:

ˆθ(xj xj + δ) def = arg min θ Θ

k=1 ℓ(xk, yk) 1

N ℓ(xj, yj) + 1

N ℓ(xj + δ, yj)

In this case, the actual change in model parameters in response of the perturbations can be represented as: I(xj xj + δ)=ˆθ(xj xj + δ) ˆθ. For estimating θ, we start by considering the parameter change from up weighting infinite small ε on {x l} and down weight infinite small ε on {xl} where l L. By definition, the model parameter in response of perturbation ˆθε can be represented as:

ˆθε def = arg min θ Θ

k=1 ℓ(xk, yk) εℓ(xj, yj) + εℓ(xj + δ, yj) (8)

The change of model parameter due to the modification on group of data s weight on loss be:

θε = ˆθε ˆθ (9)

Since ˆθε minimize the changed loss function under perturbation, take the derivative:

k=1 ˆθεℓ(xk, yk) ε ˆθεℓ(xj, yj)

+ ε ˆθεℓ(xj + δ, yj)

Apply the first order Taylor expansion of ˆθε on ˆθ on the right side of the equation, we have:

k=1 θℓ(xk, yk) + ε θℓ(xj + δ, yj) ε θℓ(xj, yj)

k=1 2 θℓ(xk, yk) + ε 2 θℓ(xj + δ, yj) ε 2 θℓ(xj, yj)

θε + o( θ2 ε)

Since ˆθ minimize the loss function without perturbation, 1 N PN k=1 ˆθεℓ(xk, yk)=0. Dropping o( θ2 ε) term, We have:

k=1 2 θℓ(xk, yk) + ε 2 θℓ(xj + δ, yj) ε 2 θℓ(xj, yj)

[ε θℓ(xj + δ, yj) ε θℓ(xj, yj)]

Take the derivative of θε over ε, by dropping O(ε) terms we have:

k=1 2 ˆθℓ(xk, yk) 1 ˆθℓ(xj + δ, yj) ˆθℓ(xj, yj)

l L ( ℓ(xj + δ, yl) ℓ(xj, yj)) (13)

Published as a conference paper at ICLR 2023

For sufficient large N, by setting ε to 1 N , the changed we can approximate the actual change in model parameters using: I(xj xj + δ)=ˆθ(xj xj + δ) ˆθ ˆθε ˆθ. Plugging in to Eq. (13) we finish the proof:

I(xj xj + δ) H 1 ˆθ ( ℓ(xj + δ, yl) ℓ(xj, yj))

= H 1 ˆθ ℓ(xj + δ, yl) + H 1 ˆθ ℓ(xj, yl)

= I(+(xj + δ)) + I( xj).

Proposition 3.3. Let δj( vi) denote the j-th row of ( vi). The influence of removing node vi from graph G can be estimated by:

I( vi) = I( zi) + I(z z + δ( vi)) = I( zi) + X

j I(+(zj + δj( vi))) + I( zj)

= 1vi Vtrain H 1 ˆθ ˆθℓ(zi, yi) H 1 ˆθ X

vj Vtrain ( ˆθℓ(zj + δj( vi), yj) ˆθℓ(zj, yj)),

(6) where 1 is an indicator function.

Proof. Similarly to the edge removal, we first calculate the node representation change incurred from the removal of the node vi of a 2-layer SGC as follow:

( vi) = h (D 1

2 vi A vi D 1

2 vi)2 (D 1

2 )2i X. (15)

The above change will affect a set of nodes, including the node vi itself and the 2-hop neighbors of the node vi connected neighbors. A set of nodes S={s|s Ni j Ni Nj} capture the changed node embeddings in the training set, i.e., δs = 0, where vi = {δi}N i=1 in Eq. (15). The model parameter change of the removal of the node vi can be characterized by removing the representation of the node vi if the node vi is a training sample, and the node representation change from the set S. Thus, we have

I( vi) = 1vi Vtrain I (zi, yi) + X

s {S\vi} (I (z s, ys) I (zs, ys))

= 1vi Vtrain H 1 ˆθ LCE (zi, yi) H 1 ˆθ X

s S\vi ( LCE (z s, ys) LCE (zs, ys)). (16)

We finish the proof.

Theorem 4.1. Let σmin 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices 2 ˆθℓ(zi, yi), vi Vtrain of the original model ˆθ. Let σ min 0 denote the smallest eigenvalue of all

eigenvalues of Hessian matrices 2 ˆθ( eij)ℓ(zi, yi), vi Vtrain of the retrained model ˆθ( eij) with

eij removed from graph G. Use L denote the set {v : z = z} containing affected nodes from the edge removal, and Err( eij) = I ( eij) I( eij) 2. Recall λ is the ℓ2 regularization strength, we have an upper bound on the estimated error of model parameters change:

Err( eij) N 3C (Nλ + (N |L|)σmin + σ min|L|)3 X

vl L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl)) 2 2

+ N Nλ + (N |L|)σmin + min(σmin, σ min)|L| X

vl L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl)) 2. (7)

Proof. In this proof, we utilize one-step Newton approximation as an intermediary to estimate the error bound of the change in model parameters, i.e.,

Err( eij) = I ( eij) INt( eij) + INt( eij) I( eij) , (17)

Published as a conference paper at ICLR 2023

where I ( eij)= ˆθε=ˆθε ˆθ, INt( eij) is the one-step Newton approximation with the model parameter ˆθNt=ˆθ + ˆθNt. According to Boyd et al. (2004) (Section 9.5.1), ˆθNt can be calculated as follows:

ˆθNt = Hˆθ + λI 1 1

i=1 ˆθℓ(zi, yi) + X

vl L ˆθℓ(z l, yl)

vl L ˆθℓ(zl, yl) + λ ˆθ 2). (18)

In the following, we will calculate the bound of I ( eij) INt( eij) and INt( eij) I( eij) as two separate steps and combine them together. Here we define the before and after objective functions with the removal of edge eij as follows:

i=1 ℓ(zi, yi) + λ

i=1 ℓ(zi, yi) + X

vl L ℓ(z l, yl) X

vl L ℓ(zl, yl) + λ

Step I: Bound of I ( eij) INt( eij).

Due to that SGC model is convex on θ, we take the second derivative of La(θ) and have

i=1 2L (zi, yi) + X

vl L 2L (z l, yl) X

vl L 2L (zl, yl)

To simplify the above equation, we define σ min and σ max are the smallest and largest eignenvalues of 2ℓ(z l, yl) and σmin and σmax are the smallest and largest eignenvalues of 2L (zl, yl). Then we have

I λ + (N |L|) σmin + |L| σ min N

Therefore, the SGC loss function corresponds to the removal of edge is strictly convex with the parameter λ + (N |L|) σmin+|L| σ min N . By this convexity property and the implications of strong

convexity (Boyd et al., 2004) (Section 9.1.2), we can bound I ( eij) INt( eij) with the first derivative of SGC loss function as follows:

I ( eij) INt( eij)

= ˆθε ˆθNt 2 = ( ˆθε + ˆθ) ( ˆθNt + ˆθ) 2 = ˆθε ˆθNt 2

2N Nλ + (N |L|)σmin + |L|σ min www 1

i=1 ˆθNtℓ(zi, yi) + X

vl L ˆθNtℓ(z l, yl)

vl L ˆθNtℓ(zl, yl) + λ ˆθNt 2) www 2.

If we take a close look at the second term in the above equation, we notice it is equal the first derivative of La(θ), i.e.,

θℓa(ˆθNt) = 1

k=1 ˆθNtℓ(zk, yk) + X

vl L ˆθNtℓ(z l, yl) X

vl L ˆθNtℓ(zl, yl) + λ ˆθNt 2). (23)

Therefore, we focus on bounding θLa(ˆθNt) 2 in the following.

θLa(ˆθNt) 2 = θLa(ˆθ + ˆθNt) 2

= θLa(ˆθ + ˆθNt) θLa(ˆθ) + θLa(ˆθ) 2

= θLa(ˆθ + ˆθNt) θLa(ˆθ) 2 θLa(ˆθ) ˆθNt 2

Published as a conference paper at ICLR 2023

The above last equation holds due to the definition of ˆθNt in Eq. (18).

For any continuous function f and any inputs a and b, there exists f(a + b) f(a) bf (a)= R 1 0 b (f (a + bt) f (a))dt. Based on that, we can rewrite θLa(ˆθNt) 2 as follows:

θLa(ˆθNt) 2 = θLa(ˆθ + ˆθNt) θLa(ˆθ) 2 θLa(ˆθ) ˆθNt 2

0 ˆθNt( 2 θLa(ˆθ + ˆθNt t) 2 θLa(ˆθ))dt www 2. (25)

We assume the loss function ℓon is twice differentiable and the second derivative of the loss function is Lipschitz continuous at θ, with parameter C. Here C is controlled by the third derivative (Curvature) of the loss function ℓ. Thus, we have

2 θℓ(θ1) 2 θℓ(θ2) 2 C θ1 θ2 2. (26)

Then we take Eq. (26) into Eq. (25) and have

delt θLa(ˆθNt) 2

0 tdt 2 = NC

2 ˆθNt 2 2 = NC

2 2 θLa(ˆθ) 1 θLa(ˆθ) 2 2

(Nλ + (N |L|) σmin + |L| σ min)2 X

l L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl)) 2 2.

The above last inequation holds according to the bound of 2 θLa(ˆθ) 1 and Eq. (19).

Combining Eq. (22), (23) and (27), we finish the bound of I ( eij) INt( eij) as follows:

I ( ei,j) INt( ei,j) 2

N 3C (Nλ + (N |L|)σmin + σ min|L|)3 X

vl L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl)) 2 2. (28)

We finish Step I.

Step II: Bound of INt( eij) I( eij).

By the definition of INt( eij) and I( eij)), we have:

INt( eij) I( eij)

k=1 2 ˆθℓ(zk, yk) + X

vl L 2 ˆθℓ(z l, yl) X

vl L 2 ˆθℓ(zl, yl)

k=1 2 ˆθℓ(zk, yk)

vl L ( ˆθℓ(z l, yl) ℓ(zl, yl))

For simplification, we use matrix A, B and C for the following substitutions:

k=1 2 ˆθℓ(zk, yk) X

vl L 2 ˆθℓ(zl, yl)

vl L 2 ˆθℓ(z l, yl) , and C = 1

vl L 2 ˆθℓ(zl, yl) , (30)

where A, B and C are positive definite matrix and have the following properties:

λ + (N |L|)σmax

N A λ + (N |L|)σmin

|L|σ max N B |L|σ min N , and |L|σmax

N C |L|σmin

Published as a conference paper at ICLR 2023

Therefore, we have

INt( eij) I( eij) = ((A + B) 1 (A + C) 1)

vl L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl))

where (A + B) 1 (A + C) 1 N Nλ+(N |L|)σmin+|L|min(σ min,σmin)I.

The l2 norm of the error between our predicted influence and Newton approximation can be bounded as follows: INt( eij) I( eij) 2

N Nλ + (N |L|)σmin + min(σ min, σmin)|L| X

vl L ( ℓˆθ (z l, yl) ℓˆθ (zl, yl)) 2. (33)

We finish Step II.

Combining the conclusion in Step I and II in Eq. (28) and (33), we have the error between the actual influence and our predicted influence as: Err( eij)

I ( eij) INt( eij) 2 + INt( eij) I( eij) 2

= N 3C (Nλ + (N |L|)σmin + |L|σ min)3 X

vl L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl)) 2 2

+ N Nλ + (N |L|)σmin + min(σ min, σmin) X

vl L ( ˆθℓ(z l, yl) ˆθℓ(zl, yl)) 2.

We finish the whole proof.

Corollary A.1. Let σmin 0 denote the smallest eigenvalue of all eigenvalues of Hessian matrices 2 ˆθℓ(zi, yi), vi Vtrain of the original model ˆθ. Let σ min 0 denote the smallest eigenvalue of all

eigenvalues of Hessian matrices 2 ˆθ( vi)ℓ(zi, yi), vi Vtrain of the retrained model ˆθ( vi) with vi removed from graph G. Use S denote the set {v : z = z} containing affected nodes from the node removal, and Err( vi) = I ( vi) I( vi) 2. We have the following upper bound on the estimated error of model parameters change:

Err( vi) N 3m2C ((N 1)λ + (N |S|)σmin + σ min|S|)3 + (N 1)m Nλ + (N |S|)σmin + min(σmin, σ min)|S|

+ N 3C (Nλ + (N 1)σmin)3 ℓ(z i, yi) 2 2 + N Nλ + Nσmin ℓ(z i, yi) 2

(35) where m = P

vs S[ ˆθℓ(z s, ys) ˆθℓ(zs, ys)] ˆθℓ(zi, yi) 2.

Proof. We provide a simple proof for the error bound of removing a complete nodes. Notice that this error can be decomposed into two parts, 1, the error or removing a single node embedding zi and 2, the error of adding z s and removing zs, where s S. where we have

s S Err(zs z s) + Err( zi)

Notice that Eq. Theorem 4.1 proofs the error bound of Err( vi), in the proving process we decompose the problem in to deriving the error bound by adding z l and removing zl where l L, where L is the set of changed node embedding caused by removing a edge from the graph. Following the same proving setting of Eq. Theorem 4.1, Again, notice that S is the set of changed node embedding caused by removing a node from the graph. We simply substitute L by S, we have the error bounds for P

s S Err(zs z s). X

s S Err(zs z s) N 3m2C Nλ + (N |S|)σmin + σ min|S|)3

+ (N 1)m Nλ + (N |S|)σmin + min(σmin, σ min)|S| ,

Published as a conference paper at ICLR 2023

Where m = P

vs S[ ˆθℓ(z s, ys) ˆθℓ(zs, ys)] 2. For Err( zi), it can be derived following the same proving process as Eq. Theorem 4.1, but we only remove one data points. In this case we have:

Err( zi) N 3C (Nλ + (N 1)σmin)3 ℓ(z i, yi) 2 2 + N Nλ + Nσmin ℓ(z i, yi) 2.

Combining the two error bounds we have:

Err( vi) N 3m2C ((N 1)λ + (N |S|)σmin + σ min|S|)3 + (N 1)m Nλ + (N |S|)σmin + min(σmin, σ min)|S|

+ N 3C (Nλ + (N 1)σmin)3 ℓ(z i, yi) 2 2 + N Nλ + Nσmin ℓ(z i, yi) 2

B DATASET STATISTICS

We present the data statistic on our experiments below. We choose only small and medium-sized data. Because, each time we validate the influence of the elements in a graph, we need to retrain the model.

Table 4: Dataset Statistics

Dataset # Node # Edge # Class # Feature # Train/Val/Test

Cora 2,708 5,429 7 1,433 140 / 500 / 1,000 Citeseer 3,327 4,732 6 3,703 120 / 500 / 1,000 Pubmed 19,717 44,338 3 500 60 / 500 / 1,000 Wiki CS 11,701 216,123 10 300 250/ 1769 /5847 Amazon Computer 13,752 245,861 10 767 200 / 300 / Rest Amazon Photo 7,650 119,081 8 745 160 / 240 / Rest

C VALIDATING INFLUENCE OF ELEMENTS: EXTRA DATASETS

For the Wiki-CS dataset, we randomly select one of the train/val/test split as described in Mernyei & Cangea (2020) to explore the effect of training nodes/edges influence. For the Amazon Computers and Amazon Photo dataset, we follow the implementation of Shchur et al. (2018). To set random splits, On each dataset, we use 20 C nodes as training set, 30 C nodes as validating set and the rest nodes as testing set, where C is the number of classes.. Because for validating every edge s influence, we need to retrain the model and compare the change on loss, the computation cost is exceptionally high. We randomly choose 10000 edges of each datasets and validate their influence. We observe that even for medium-size datasets, our estimated influence is of high correlation to the actual influence.

D EMPIRICAL VERIFICATION OF THEOREM 4.1

As the value of l2 regularization term decreases, the accuracy of our estimation of the influence of edges drops, and the Spearman correlation coefficient decrease correspondingly. This trend is consistent with the interpretations of the error bound on Theorem 4.1 that the estimation error of an influence function is inversely related with the l2 regularization term. We also notice that the edges that connects high-degree nodes have overall less influence. Their estimation points lies relatively close to the y=x line and thus could have relative small estimation error.

Published as a conference paper at ICLR 2023

Figure 4: Estimated influence vs. actual influence on medium-sized graphs. Three datasets are used in this illustration Wiki-CS (left column), Amazon Computers (middle column) and Amazon Photo (right column). In all plots, the horizontal axes indicate the actual influence on the test set, the vertical axes indicate the predicted influence, and ρ indicates Spearman s correlation coefficient between our predictions and the actual influences. Top row: Influence of node embeddings. Middle row: Influence of edge removals. Each point corresponds a removed training edge. Bottom row: Influence of node removal. Each point represents a removed training node.

Figure 5: Spearman correlation on Citeseer dataset with different l2 regularization term on validating influence of edges. The orange points denote the summations of the degrees of the two nodes that an edge connects is high. The blue points denote the edges, which are the summations of the degrees of the two nodes connecting the edge.

Published as a conference paper at ICLR 2023

E GROUP EFFECT OF REMOVING MULTIPLE EDGES

We study the group effect of influence estimation on removing multiple edges. On dataset Cora, we randomly sample k edges from the attributed graph, where k s values were chosen increasingly as 2, 10, 50, 100, 200, 350. Every time, we remove k edges simultaneously and validate their estimated influence. We observe: though with high correlation, our influence plots tend to move downward as more edges are removed at the same time. In this case, our method tends to be less accurate and underestimates the influence of a simultaneously removed group of edges.

Figure 6: Estimating group influence on Cora. The horizontal axes indicate the actual influence on the validation set, and the vertical axes indicate the predicted influence. On each set, we randomly sample k edges (k=2, 10, 50, 100, 200, 350) from the graph and repeat this process 5000 times. Each time, we remove k edges simultaneously and validate our influence estimation.

F VALIDATING INFLUENCE FOR ARTIFICIALLY ADDED EDGES

In this section, we validate our influence estimation for artificially added edges on dataset Cora, Pubmed, and Citeseer. On each dataset, we randomly select 10000 unconnected node pairs, add an artificial edge between them and validate its influence estimation. Figure 7 shows that the estimated influence correlates highly with the actual influence. This demonstrates that our proposed method can successfully evaluate the influence of artificially added edges.

Figure 7: Estimated influence vs. actual influence on artificially added edges. Three datasets are used in this illustration Cora, Pubmed, and Citeseer. Due to the high time complexity of evaluating the influence on every pair of nodes, we randomly sample 10000 node pairs and add artificial edge.

Published as a conference paper at ICLR 2023

G STUDY OF EDGES WITH NEGATIVE INFLUENCE

Here we demonstrate the performance via cumulatively removing edges with negative influence in Figure 8. The detailed implementation has been discussed in Section 5.3. Due to that the inaccurate influence estimation with more edges removed, we consider a maximum of 50 edges to be removed for each dataset. We observe an overall increase in model performance as we cumulatively remove edges predicted as a negative influence. This again demonstrates the usefulness of our influence estimation on edges.

Figure 8: Study of edges with negative influence, each column corresponds to Cora, Pubmed, and Citeseer dataset. Top: the scale of edges with negative influence. Bottom: accuracy by cumulatively removing edges with negative influence. Blue and red lines present the accuracy changes of validation and test in response to negative influence edge removal, respectively.

H EXTEND INFLUENCE METHOD TO OTHER GNN MODELS

Theoretically, our current pipeline can be extended to other nonlinear GNNs under some violation of assumption. (1) According to Propositions Proposition 3.2 and Proposition 3.3, we require the existence of the inverse of the Hessian matrix, which is based on the assumption that the loss function on model parameters is strictly convex. Under the context of some GNN models with nonlinear activation functions, we can use the pseudo-inverse of the hessian matrix instead. (2) For non-convex loss functions of most GNN, our proposed error bound in Theorem Theorem 4.1 does not hold unless a large regularization term is applied to make the hessian matrix positive definite. From the implementation purpose, (1) From the implementation perspective, the non-linear models usually have more parameters than the linear ones, which require more space to store the Hessian matrix. Accordingly, the calculation of the inverse of the Hessian matrix might be out of memory. It needs to reformulate the gradient calculation and apply optimization methods like Conjugate gradient for approximation. (2) Our current pipeline is constructed based on mathematical, hands-on derived gradients adopted from Koh et al. (2019). Existing packages like Py Torch use automatic differentiation to get the gradients on model parameters. It could be inaccurate for second-order gradients calculation. Extending the current pipeline to other GNNs may require extensive first and second-order gradient formulations. We will explore more GNN influence in the future.

I RUNNING TIME COMPARISON

We present the running time comparison between calculating the edge influence via the influencebased method and retrieving the actual edge influence via retraining. We conduct our experiment on dataset Cora, Pubmed, and Citeseer. We demonstrate our method is 15-25 faster than the retrained method. Notably, for tasks like improving model performance or carrying out adversarial attacks via

Published as a conference paper at ICLR 2023

edge removal, it could save a considerable amount of time in finding the edge to be removed with the lowest/largest influence.

Table 5: Running time comparisons for edge removal by second. Self-loop edges are not recorded.

Dataset Infl. (single edge) Infl. (all edges) Retrain (single edge) Retrain (all edges)

Cora 0.0049 0.0006 24.86 0.0683 0.0216 370.80 Pumbed 0.0008 0.0001 34.58 0.0203 0.0044 899.62 Citeseer 0.0097 0.0008 45.90 0.1578 0.0404 746.47