# temporal_generalization_estimation_in_evolving_graphs__0b7f0991.pdf Published as a conference paper at ICLR 2024 TEMPORAL GENERALIZATION ESTIMATION IN EVOLVING GRAPHS Bin Lu1, Tingyan Ma1, Xiaoying Gan1, Xinbing Wang1 Yunqiang Zhu2, Chenghu Zhou2, Shiyu Liang3 1Department of Electronic Engineering, Shanghai Jiao Tong University 2Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences 3John Hopcroft Center for Computer Science, Shanghai Jiao Tong University {robinlu1209, xiaokeaiyan, ganxiaoying, xwang8}@sjtu.edu.cn {zhuyq, zhouch}@igsnrr.ac.cn, lsy18602808513@sjtu.edu.cn Graph Neural Networks (GNNs) are widely deployed in vast fields, but they often struggle to maintain accurate representations as graphs evolve. We theoretically establish a lower bound, proving that under mild conditions, representation distortion inevitably occurs over time. To estimate the temporal distortion without human annotation after deployment, one naive approach is to pre-train a recurrent model (e.g., RNN) before deployment and use this model afterwards, but the estimation is far from satisfactory. In this paper, we analyze the representation distortion from an information theory perspective, and attribute it primarily to inaccurate feature extraction during evolution. Consequently, we introduce SMART, a straightforward and effective baseline enhanced by an adaptive feature extractor through self-supervised graph reconstruction. In synthetic random graphs, we further refine the former lower bound to show the inevitable distortion over time and empirically observe that SMART achieves good estimation performance. Moreover, we observe that SMART consistently shows outstanding generalization estimation on four real-world evolving graphs. The ablation studies underscore the necessity of graph reconstruction. For example, on OGB-ar Xiv dataset, the estimation metric MAPE deteriorates from 2.19% to 8.00% without reconstruction. 1 INTRODUCTION The rapid rising of Graph Neural Networks (GNN) leads to widely deployment in various applications, e.g. social network, smart cities, drug discovery (Yang & Han, 2023; Lu et al., 2022; Xu et al., 2023). However, recent studies have uncovered a notable challenge: as the distribution of the graph shifts continuously after deployment, GNNs may suffer from the representation distortion over time, which further leads to continuing performance degradation (Liang et al., 2018; Wu et al., 2022; Lu et al., 2023), as shown in Figure 1. This distribution shift may come from the continuous addition of nodes and edges, changes in network structure or the introduction of new features. This issue becomes particularly salient in applications where the graph evolves rapidly over time. 1985 1990 1995 2000 2005 2010 2015 0.35 Accuracy #Node Figure 1: GNN performance continues to decline with the rapid growth over 30 years. Consequently, a practical and urgent need is to monitor the representation distortion of GNN. An obvious method is to regularly label and test online data. However, constant human annotation is difficult to withstand the rapidly ever-growing evolution of graph after deployment. Therefore, how to proactively estimate the temporal generalization performance without annotation after deployment is a challenging problem. To solve this problem, a naive way is to collect the generalization changes through partially-observed labels before deployment, and train a recurrent neural network (RNN) (Mikolov et al., 2010) in a supervised manner. However, existing stud- Correspondence to: Shiyu Liang (lsy18602808513@sjtu.edu.cn) Published as a conference paper at ICLR 2024 ies (Li et al., 2018; Pareja et al., 2020) have shown that RNN itself have insufficient representation power, thereby usually concatenating a well-designed feature extractor. Unluckily, the representation distortion of static feature extractor still suffers during evolution. Other methods consider measuring the distribution distance between training and testing data through average confidence (Guillory et al., 2021), representation distance Deng & Zheng (2021) and conformal test martingales (Vovk et al., 2021). However, these methods do not directly estimate generalization performance. Meanwhile, graph evolution can be very fast and the evolving graphs are usually very different from their initial states. Hence, to deal with this problem, we propose SMART (Selfsupervised te Mpor Al gene Ralization es Timation). Since it is hard to gather label information in a rapid graph evolution after deployment, our SMART resorts to an adaptive feature extractor through self-supervised graph feature and structure reconstruction, eliminating the information gap due to the evolving distribution drift. We summarize the main contributions as follows: We theoretically prove the representation distortion is unavoidable during evolution. We consider a single-layer GCN with Leaky Re LU activation function and establish a lower bound of distortion, which indicates it is strictly increasing under mild conditions. (see Section 2.3) We propose a straightforward and effective baseline SMART for this underexplored problem, which is enhanced with self-supervised graph reconstruction to minimize the information loss in estimating the generalization performance in evolving graphs. (see Section 3) Under a synthetic Barab asi Albert random graph setting, we refine the lower bound to derive a closed-form generalization error of GNN on whole graph after deployment, and our SMART achieves a satisfied prediction performance. (see Section 4) We conduct extensive experiments on four real-world evolving graphs and our SMART consistently shows outstanding performance on different backbones and time spans. The ablation studies indicate the augmented graph reconstruction is of vital importance, e.g., our method reduces the MAPE from 8.00% to 2.19% on OGB-ar Xiv citation graphs. (see Section 5) 2 PRELIMINARY 2.1 NOTATIONS Graph. Let X Rn d denote the feature matrix where each row xi Rd of the matrix X denotes the feature vector of the node i in the graph. Let Y Rn denote the label matrix where each element yi R of the matrix Y denotes the label of the node i. Let A {0, 1}n n denote the adjacency matrix where the element aij = 1 denotes the existence of an edge between node i and j, while aij = 0 denotes the absence of an edge. Let A = A + I denote the self-looped adjacency matrix, while the diagonal matrix D = I + Diag(d1, ..., dn) denotes the self-looped degree matrix, where di = P j =i Aij denotes the degree of node i. Let L = D 1 A denote a standard normalized self-looped adjacency matrix (Perozzi et al., 2014). Graph Evolution Process. We use Gt = (At, Xt, Yt) to denote the graph at time t. We consider an ever-growing evolution process G = (Gt : t N), where N denotes set of all natural numbers. We use nt to denote the number of nodes in the graph Gt. We use the conditional probability P (Gt|G0, ..., Gt 1) to denote the underlying transition probability of the graph state Gt at time t conditioned on the history state G0, ..., Gt 1. 2.2 PROBLEM FORMULATION As shown in Figure 2, before the deployment at time tdeploy, we are given with a pre-trained graph neural network G that outputs a label for each node in the graph, based on the graph adjacency matrix Ak {0, 1}nk nk and feature matrix Xk Rnk d at each time k. Additionally, we are given an observation set D = {(Ak, Xk, Hk Yk)}t k=0, consisting of (1) a series of fully observed graph adjacency matrices A0, , At and node feature matrices X0, , Xt; (2) a series of partially observed label vectors H0Y0, , Ht Yt, where each observation matrix Hk {0, 1}nk nk is diagonal. Specifically, the diagonal element on the i-th row in matrix Hk is non-zero if and only if the label of the node i is observed. Published as a conference paper at ICLR 2024 Before Deployment After Deployment 𝒕𝐝𝐞𝐩𝐥𝐨𝐲 Label (partial) Label Figure 2: Illustration for generalization estimation. At each time τ after deployment at time tdeploy, we aim to predict the expected temporal generalization performance of the model G on the state Gτ = (Aτ, Xτ, Yτ), i.e., predicting the generalization error E [ℓ(G(Aτ, Xτ), Yτ)|Aτ, Xτ], given a full observation on the adjacency matrix Aτ and feature matrix Xτ. Obviously, the problem is not hard if we have a sufficient amount of labeled samples drawn from the distribution P (Yτ|G0, ..., Gτ 1, Aτ, Xτ). However, it is costly to obtain the human annotation on the constant portion of the whole graph after deployment, especially many real-world graphs grow exponentially fast. Therefore, the problem arises if we try to design a post-deployment testing performance predictor M without further human annotations after deployment. More precisely, we try to ensure that the accumulated performance prediction error is small within a period of length T tdeploy after deployment time tdeploy, where the accumulated performance prediction error is defined as E(M; G) EA,X,Y τ=tdeploy+1 (M(Aτ, Xτ) ℓ(G(Aτ, Xτ), Yτ))2 where the random matrix sequence A = (Atdeploy+1, ..., AT ), sequence X = (Xtdeploy+1, ..., XT ) and sequence Y = (Ytdeploy+1, ..., YT ) contain all adjacency matrices, all feature matrices and all label vectors from time tdeploy + 1 to T separately. 2.3 UNAVOIDABLE REPRESENTATION DISTORTION AS GRAPH EVOLVES One may ask, then, what impact the pre-trained graph neural network may have on the node representation over time? In this subsection, we show that as the graph evolves, the distortion of representation is unavoidable, which might further lead to potential performance damage. 0 3 6 9 12 Timestep Prediction Loss Figure 3: Test loss changes over time of 30 pre-trained GNN checkpoints on OGB-ar Xiv dataset. First, as shown in Figure 3, we plot the performance variation on the OGB-ar Xiv dataset (Hu et al., 2020a) with 30 different GNN checkpoints. We observe that GNN prediction performance continues to deteriorate over time. Additionally, we prove that the representation distortion strictly increasing with probability one. Before presenting the results, we first present the following several assumptions. Assumption 1 (Graph Evolution Process). The initial graph G0 = (A0, X0, Y0) has n nodes. (1) We assume that the feature matrix X0 is drawn from a continuous probability distribution supported on Rn d. (2) At each time t in the process, a new node indexed by n + t appears in the graph. We assume that this node connects with each node in the existing graph with a positive probability and that edges in the graph do not vanish in the process. (3) We assume that the feature vector xn+t has a zero mean conditioned on all previous graph states, i.e., E[xn+t|G0, ..., Gt 1] = 0d for all t 1. Remark 1. Assumption 1 states the following: (1) For any given subspace in the continuous probability distribution, the probability that X0 appears is zero. (2) The ever-growing graph assumption is common in both random graph analysis, like the Barab asi-Albert graph (Albert & Barab asi, 2002), and real-world scenarios. For example, in a citation network, a paper may have a citation relationships with other papers in the network, and this relationships will not disappear over time. Similarly, the purchasing relationships between users and products in e-commerce trading networks are the same. (3) The zero-mean assumption always holds, as it is a convention in deep learning to normalize the features. In this subsection, we consider the pre-trained graph model as a single-layer GCN with Leaky Re LU activation function parameterized as θ. The optimal parameters for the model are θ . However, due Published as a conference paper at ICLR 2024 to the stochastic gradient descent in optimization and quantization of models, we always obtain a sub-optimal parameter drawn from a uniform distribution U(θ , ξ). We define the expected distortion of the model output on node i at time t as the expected difference between the model output at time t and time zero on the node i, i.e., ℓt(i) = Eθ,Gt h |ft(i; θ) f0(i; θ)|2i . Theorem 1. If θ is the vectorization of the parameter set {(aj, Wj, bj)}N j=1 and its i-th coordinate θi is drawn from the uniform distribution U(θ i , ξ) centering at the i-th coordinate of the vector θ i , the expected deviation ℓτ(i) of the perturbed GCN model at the time τ 0 on the node i {1, ..., n} is lower bounded by ℓτ(i) ϕτ(i) Nβ2ξ4 1 dτ(i) 1 d0(i) where the set N0(i) denotes the neighborhood set of the node i at time 0, β is is the slope ratio for negative values in Leaky Re LU. Additionally, ϕτ(i) is strictly increasing with respect to τ 1. Remark 2. Proofs of Theorem 1 can be found in Appendix B. This theorem shows that for any node i, the expected distortion of the model output is strictly increasing over time, especially when the width N of the model is quite large in the current era of large models. Note that the continuous probability distribution and the ever-growing assumption in Assumption 1 are necessary. Otherwise, P k N0(i) xk 2 2 might be zero for some choice of X0, which would lead to the lower bound being zero for all τ and further indicate that the lower bound is not strictly increasing. 3 METHODOLOGY 3.1 A NAIVE WAY OF ESTIMATING GENERALIZATION LOSS Before deployment, we are given an observation set D = {(Ak, Xk, Hk Yk)}tdeploy k=0 . Now, we want to construct a temporal generalization loss estimator that takes the adjacency matrices and feature matrices as its input and predicts the difference between the outputs of the graph neural network G and the observed true labels at time k, i.e., ℓk = ℓ(Hk G(Ak, Xk), Hk Yk). In order to capture the temporal variation, we adopt a recurrent neural network-based model M( ; θRNN) to estimate generalization loss in the future. The RNN model sequentially takes the output of the GNN G(Ak, Xk) as its input and outputs the estimation ˆℓk at time k. To enhance the representation power of RNN, we usually add a non-linear feature extractor φ to capture the principal features in the inputs and the RNN model becomes ˆℓk hk = Mℓ(φ G(Ak, Xk), hk 1) Mh (φ G(Ak, Xk), hk 1) where hk denotes the hidden state in the RNN, transferring the historical information. During training, we are trying to find the optimal model M , φ such that the following empirical risk minimization problem is solved, (M , φ ) = arg min M,φ k=0 Mℓ(φ G(Ak, Xk), hk 1) ℓ(Hk G(Ak, Xk), Hk Yk) 2. After deployment time tdeploy, a naive and straightforward way of using the generalization loss estimator is directly applying the model on the graph sequence (Gτ : τ > tdeploy). This results in a population loss prediction error Eτ at time τ given by Eτ(M , φ ) = E M ℓ(φ G(Ak, Xk), hk 1) ℓ(G(Ak, Xk), Yk) 2. Before deployment, however, we are not able to get sufficient training frames. This indicates φ may only have good feature extraction performance on graphs similar to the first several graphs. After deployment, the graphs undergo significant changes, and thereby have unavoidable representation distortion, which makes the generalization estimator perform worse and worse. Published as a conference paper at ICLR 2024 3.2 INFORMATION LOSS BY REPRESENTATION DISTORTIONS To further investigate this problem, let us consider the information loss within the input graph series and the output prediction after the GNN is deployed, Information Loss I({(Aτ, Xτ)}k τ=tdeploy+1, D; ℓk) I(ˆℓk; ℓk), where I(U; V ) is the mutual information of two random variables U and V . The learning process is equivalent to minimizing the above information loss. Furthermore, it can be divided into two parts: Information Loss = I({φ G(Aτ, Xτ)}k τ=tdeploy+1, D; ℓk) I(ˆℓk; ℓk) | {z } ①Information Loss Induced by RNN + I({G(Aτ, Xτ)}k τ=tdeploy+1, D; ℓk) I({φ G(Aτ, Xτ)}k τ=tdeploy+1, D; ℓk). | {z } ②Information Loss Induced by Representation Distortion Information Loss Induced by RNN. The first part is induced by the insufficient representation power of the RNN. Specifically, ˆℓk is a function of the current state and the hidden state vector hk 1. Here, there exists information loss if hk 1 cannot fully capture all the historical information. However, this is usually inevitable due to the insufficient representation power of the RNN, limited RNN hidden dimension, and inadequate training data. Information Loss Induced by Reprentation Distortion. The second part indicates the information loss by the representation distortion of φ. Especially as the graph continues to evolve over time, the information loss correspondingly increases due to the distribution shift. According to the data-processing inequality (Beaudry & Renner, 2012) in information theory, post-processing cannot increase information. Therefore, the second part of Equation 2 holds for any time τ, I({G(Aτ, Xτ)}k τ=tdeploy+1, D; ℓk) I({φ G(Aτ, Xτ)}k τ=tdeploy+1, D; ℓk) 0. The equality holds if and only if φ G(Aτ, Xτ) is a one-to-one mapping with G(Aτ, Xτ). To minimize the information loss and achieve equality, we have to choose a bijective mapping φ, which further indicates minϕ P τ E G(Aτ, Xτ) ϕ φ G(Aτ, Xτ) 2 2 = 0. Therefore, to minimize the information loss, it is equivalent to solve the graph representation reconstruction problem. In the next subsection, we propose a reconstruction algorithm to update feature extractor and adapt to the significant graph distribution shift after deployment. 3.3 SMART: SELF-SUPERVISED TEMPORAL GENERALIZATION ESTIMATION In this section, we present our method, SMART, for generalization estimation in evolving graph as shown in Figure 4. After deployment, since there are no human-annotated label, we design the augmented graph reconstruction to obtain self-supervised signals to finetune the adaptive feature extractor φ at post-deployment time, thereby reducing the information loss during the dynamic graph evolution. Compared with directly graph reconstruction, our method generates more views through data augmentation, which can learn more principal and robust representations. Post-deployment Graph Reconstruction. Given the pre-trained graph model G and evolving graph (Ak, Xk) at time k, we first obtain the feature embedding matrix Ok = G(Ak, Xk). In order to capture the evolution property of graphs, we define two reconstruction loss on augmented feature graph, which is denoted as (T (Ak), Ok), where T ( ) is the structure transformation function, i.e. randomly add or drop edges (You et al., 2020; Rong et al., 2020; Liu et al., 2023). Structure Reconstruction Loss Ls. First of all, we define the structure reconstruction loss Ls. Given the augmented feature graph (T (Ak), Ok), we perform reconstruction on adjacency matrix Ak. Specifically, it computes the reconstructed adjacency matrix ˆAk by ˆAk = σ(Fk F T k ), Fk = φ(T (Ak), Ok), where Fk RN B, σ is the sigmoid activation function. Here we utilize one-layer graph attention network (GAT) as model φ (Velickovic et al., 2018), and it is optimized by binary cross entropy loss between Ak and ˆAk as a link prediction task, i.e. Ls = LBCE(Ak, ˆAk). Published as a conference paper at ICLR 2024 GNN Feature Extractor 𝜑! Reconstruction Reconstruction SMART: Self-supervised Te Mpor Al gene Ralization es Timation Figure 4: Overview of our proposed SMART for generalization estimation in evolving graph. Feature Reconstruction Loss Lf. Moreover, we perform the node-level feature reconstruction on the corrupted adjacency matrix T (Ak). We utilize a single-layer decoder a to obtain the reconstructed feature ˆOk by ˆOk = Fka T , Fk = φ(T (Ak), Ok), where a RB B, and it is optimized by mean squared error loss Lf = Ok ˆOk 2. To sum up, the reconstruction loss Lg is the composition of structure reconstruction loss Ls and feature reconstruction loss Lf as Lg(φ) = λLs + (1 λ)Lf, (3) where λ is the proportional weight ratio to balance two loss functions. Pre-deployment Graph Reconstruction. We note here that, before deployment, we combine the same graph reconstruction with the supervised learning to improve the performance of feature extraction. Algorithm 1 in Appendix A outlines the pre-deployment warm-up training and postdeployment finetuning process of SMART in detail. 4 A CLOSER LOOK AT BARAB ASI ALBERT RANDOM GRAPH To theoretically verify the effectiveness of SMART, we first consider the synthetic random graphs, i.e., Barab asi Albert (BA) graph model, G = (Gt : t N) (Barab asi, 2009). Please refer to Appendix E for the details. Assumption 2 (Preferential Attachment Evolving Graphs). Here we consider a node regression task. The initial graph G0 has N0 nodes. (1) We assume the node feature matrix Xk is a Gaussian random variable with N(0, IB), where IB RB. (2) The node label of each node i is generated by yi = dα i Xim, α 0, which is satisfied like node degree, closeness centrality coefficients, etc. (3) A single-layer GCN f(G) = LXW is given as the pre-trained graph model G. Theorem 2. If at each time-slot t, the Barab asi Albert random graph is grown by attaching one new node with m edges that are preferentially attached to existing nodes with high degree. To quantify the performance of GNN, the mean graph-level generalization relative error under the stationary degree distribution Q is determined by EG = 2m t N0 + t (C2EQd 2α 4 2CEQd α 4 + EQd 3), where d is the degree of nodes. C = (EQdα 1)/(βEQd 1) and di is the degree of node vi. Remark 3. Proofs of Theorem 2 can be found in the Appendix C. This theorem shows that when the node scale N0 of initial graph G0 is quite large, the graph-level error loss is approximately linearly related to time t and continues to deteriorate. To verify the above propositions, we generate a Barab asi Albert (BA) scale-free model with the following setting: the initial scale of the graph is N0 = 1000, and the evolution period is 180 timesteps. At each timestep, one vertex is added with m = 5 edges. The label of each node is the closeness centrality coefficients. The historical training time period is only 9. As we derived in Theorem 2, the actual graph-level generalization error approximates a linear growth pattern. Therefore, we consider to compare SMART with linear regression model to estimate the generalization performance. Does Linear Regression Model Work? We conduct experiments with 10 random seeds and present the empirical results in Figure 5a. (1) Generalization error exhibits linear growth, consistent with our Published as a conference paper at ICLR 2024 Table 1: Performance comparison on different Barab asi Albert graph setting. We use Mean Absolute Percentage Error (MAPE) Standard Error to evaluate the estimation on different scenarios. Barab asi Albert (N0 = 1000) Dual Barab asi Albert (N0 = 1000, m1 = 1) m = 2 m = 5 m = 10 m2 = 2 m2 = 5 m2 = 10 Linear 79.2431 74.1083 82.1677 61.8048 67.6442 38.4884 SMART 7.1817 1.2350 4.2602 0.5316 9.1173 0.1331 7.9038 1.8008 3.8288 0.1706 1.9947 0.1682 0 30 60 90 120 150 180 Timestep Pre-trained GNN Performance Full Observation Partial Observation Linear Regression Smart (Ours) (a) Performance comparison with Linear Regression and SMART. 0 30 60 90 120 150 180 Timestep Pre-trained GNN Performance Full Observation Smart (Ours) w/o Reconstruction (b) Ablation study of removing graph reconstruction in SMART. 0.02 0.00 0.02 Reduction of Reconstruction Loss Δ g Prediction Improvement (%) Finetuning Epoch (c) Prediction improvement during the graph reconstruction. Figure 5: Experimental results of SMART and its variation on BA random graph. theorem (the blue solid line). (2) Our SMART method performs significantly well in a long testing period, with a mean prediction percentage error of 4.2602% and collapses into a roughly linear model. (3) However, the linear regression model, based on the first 9-step partial observations (the green solid line), exhibits extremely poor performance. Due to limited human annotation, partial observations cannot accurately represent the performance degradation of the entire graph. We also adjust parameters in the BA graph model and introduced dual BA graph model (Moshiri, 2018) for further experiments (see Table 1). Our proposed SMART model effectively captures the evolutionary characteristics across various settings and significantly outperforms the linear regression model. Effectiveness of Graph Reconstruction. To further validate the effectiveness of graph reconstruction in SMART, we conduct following two experiments. (1) As shown in Figure 5b, we remove the graph reconstruction module and repeat the experiment with 10 random seeds. Due to the temporal distribution shift caused by the graph evolution, the generalization estimation shows significant deviations and instability. (2) We track the intermediate results during post-deployment fine-tuning, i.e. the reduction of reconstruction loss and prediction improvements. As depicted in Figure 5c, in the early stage of reconstruction (scatter points in light color), the prediction performance optimization is fluctuating. As the optimization continues (scatter points in dark color), the prediction performance is effectively boosted and concentrated in the upper-right corner, with an average performance improvement of 10%. 5 EXPERIMENTS ON REAL-WORLD EVOLVING GRAPHS 5.1 EXPERIMENT SETTINGS Datasets. We use two citation datasets, a co-authorship network dataset and a series of social network datasets for evaluation. The statstics and more details are provided in Appendix F. OGBar Xiv (Hu et al., 2020a): A citation network between ar Xiv papers of 40 subject areas, and we divide the temporal evolution into 14 timesteps. DBLP (Galke et al., 2021): A citation network focused on computer science. We divide the temporal evolution into 17 timesteps, and the task is to predict 6 venues. Pharmabio (Galke et al., 2021): A co-authorship graph, which records paper that share common authors. We divide the graph into 30 timesteps, and predict 7 different journal categories. Facebook 100 (Lim et al., 2021): A social network from Facebook of 5 different university: Penn, Amherst, Reed, Johns Hopkins and Cornell. The task is to predict the reported gender. Evaluation Metrics. To evaluate the performance, we adopt following metrics. (1) Mean Absolute Percentage Error (MAPE) quantifies the average relative difference between the predicted Published as a conference paper at ICLR 2024 generalization loss and the actual observed generalization loss, which is calculated as MAPE = 1 T tdeploy PT τ=tdeploy+1 ˆℓτ ℓτ 100%. (2) Standard Error (SE) measures the variability of sample mean and estimates how much the sample mean is likely to deviate from the true population mean, which is calculated as SE = σ/ n, where σ is the standard deviation, n is the number of samples. In addition, we conduct the performance comparison on Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), please refer to Appendix G.1. Experiment Configurations. We adopt the vanilla graph convolution network as the pre-trained graph model G, which is trained on the initial timestep of the graph with 10% labeling. During training, we only select the next 3 historical timesteps, where we randomly label 10% of the newlyarrived nodes. The remaining timesteps are reserved for testing, where we have no observations of the label. We run SMART and baselines 20 times with different random seeds. 5.2 EXPERIMENTAL RESULTS Comparison with Baselines. In Table 2, we report the performance comparison with two existing baselines, linear regression and Do C (Guillory et al., 2021). Moreover, we compare SMART with a supervised baseline, which requires new node labels and retrain the model on the new data after deployment. We observed that SMART consistently outperforms the other two self-supervised methods (linear regression and Do C) on different evaluation metrics, demonstrating the superior temporal generalization estimation of our methods. On OGB-ar Xiv dataset, our SMART achieves comparable performance with Supervised. Table 2: Performance comparison on three academic network datasets and four GNN backbones. We use MAPE Standard Error to evaluate the estimation on different scenarios. The complete experimental results can be found in Appendix G.3. Dataset OGB-ar Xiv ( ) DBLP ( ) Pharmabio ( ) Backbone Linear Do C Supervised SMART Linear SMART Linear SMART GCN 10.5224 9.5277 1.4857 2.1354 0.4501 2.1897 0.2211 16.4991 3.4992 0.1502 32.3653 1.3405 0.2674 GAT 12.3652 12.2138 1.222 1.9027 1.1513 3.1481 0.4079 17.6388 6.6459 1.3401 29.0404 1.2197 0.2241 Graph Sage 19.5480 20.5891 0.4553 1.6918 0.4556 5.2733 2.2635 23.7363 9.9651 1.4699 31.7033 3.1448 0.6875 Trans Conv 14.9552 10.0999 0.1972 2.0473 0.5588 3.5275 1.1462 18.2188 6.4212 1.9358 31.8249 2.7357 1.1357 Estimation on Different Test Time Period. In Table 3, we demonstrate the performance of SMART over time during the evolution of graphs in five social network datasets from Facebook 100. As the evolving pattern gradually deviates from the pre-trained model on the initial graph, generalization estimation becomes more challenging. Consequently, the error in linear estimation increases. However, our SMART method maintains overall stable prediction performance. Table 3: Performance comparison on five social network datasets in Facebook 100. We divide the test time Ttest into 3 periods and evaluate the estimation performance separately. Facebook 100 [0, Ttest/3] (Ttest/3, 2Ttest/3] (2Ttest/3, Ttest] Linear SMART Linear SMART Linear SMART Penn 1.9428 0.0193 0.0041 2.0432 0.6127 0.0307 2.7219 2.2745 0.0553 Amherst 31.1095 1.4489 0.2450 49.2363 2.8280 0.9527 73.5709 4.3320 1.8799 Reed 55.6071 0.0453 0.0020 65.7536 0.0987 0.0078 73.6452 0.0318 0.0085 Johns Hopkins 8.1043 0.5893 0.0491 10.2035 0.8607 0.1661 11.5206 0.9061 0.2795 Cornell 4.5655 0.4663 0.0275 8.6622 1.0467 0.0817 12.3263 1.7311 0.1175 5.3 ABLATION STUDY To verify the effectiveness of different modules in SMART, we conducted ablation studies on four datasets with the following variants: (M1) w/o augmented graph reconstruction: Removing graph reconstruction, using RNN only for estimation. (M2) w/o RNN: Replacing RNN with an MLP for processing current step input. (M3) w/o structure or feature reconstruction: Removing either structure or feature reconstruction, using the complementary method in post-deployment fine-tuning. Published as a conference paper at ICLR 2024 M1 M2 M3a M3b Ours 0 M1 M2 M3a M3b Ours 0 M1 M2 M3a M3b Ours 0 3 Pharmabio M1 M2 M3a M3b Ours 0 M3a : w/o structure reconstruction M3b : w/o feature reconstruction Smart (Ours) M1 : w/o augmented graph reconstruction M2 : w/o RNN Figure 6: Ablation study on four representative evolving datasets. 0.1 0.3 0.5 0.7 0.9 Loss ratio λ (a) OGB-ar Xiv 0.1 0.3 0.5 0.7 0.9 Loss ratio λ 0.1 0.3 0.5 0.7 0.9 Loss ratio λ (c) Pharma Bio 0.1 0.3 0.5 0.7 0.9 Loss ratio λ (d) Cornell 4 8 16 32 64 128 RNN Dimension (e) OGB-ar Xiv 4 8 16 32 64 128 RNN Dimension 4 8 16 32 64 128 RNN Dimension (g) Pharma Bio 4 8 16 32 64 128 RNN Dimension (h) Cornell Figure 7: Hyperparamter study on proportional weight ratio λ and RNN input dimension. As seen in Figure 6, we make the following observations: (1) Comparing (M1) with our method, augmented graph reconstruction significantly impacts accurate generalization estimation, particularly evident in the OGB-ar Xiv and Pharmabio datasets, where the (M1) variant exhibits a substantial performance gap. (2) In the case of (M2), generalization estimation based on historical multi-step information improves prediction accuracy and stability. For instance, in the Cornell dataset, predictions using single-step information result in a larger standard error. (3) As shown by (M3a) and (M3b), removing either of the reconstruction losses leads to a performance decrease in SMART. Since evolving graphs display temporal drift in both structure and features, both graph augmented reconstruction losses are essential for mitigating information loss over time. 5.4 HYPERPARAMETER SENSITIVITY Proportional Ratio of Two Reconstruction Loss. We evaluate performance using different weight ratios λ {0, 1, 0.3, 0.5, 0.7, 0.9}, as shown in Figure 7(a)-(d). Our method is generally insensitive to the choice of λ, with λ = 0.5 being a balanced option in most cases. However, larger λ values can yield better results in specific datasets, such as DBLP and Pharma Bio, especially when the node features are simple, like one-hot encoding or TF-IDF representations. Feature Dimension of RNN Input. We compared RNN feature dimensions ranging from {4, 8, 16, 32, 64, 128}, as shown in Figure 7(e)-(h). Performance remains stable across four datasets when the feature dimension is set between 4 and 64. However, a significant performance drop occurs on the Cornell dataset when the dimension is set to 128. Setting the RNN feature dimension too high is discouraged for two reasons: (1) As shown in Equation 2, RNN input represents compressed node information over time. To enhance historical information density and effectiveness, the input dimension should be reduced, facilitated by the reconstruction loss. (2) Given limited observation samples during training, reducing the RNN input dimension helps alleviate training pressure. 6 CONCLUSIONS In this paper, we investigate a practical but underexplored problem of temporal generalization estimation in evolving graph. To this end, we theoretically show that the representation distortion is unavoidable and further propose a straightforward and effective baseline SMART. Both synthetic and real-world experiments demonstrate the effectiveness of our methods. Future work involves exploring our methods in more complicated scenarios such as evolving graph with changing label, heterogeneous graphs and spatio-temporal graphs. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENT This work was partially supported by National Key R&D Program of China (No.2022YFB3904204), NSF China (No. 62306179, 62272301, 42050105, 62020106005, 62061146002, 61960206002), and the Deep-time Digital Earth (DDE) Science Program. Shiyu Liang is also supported by National Natural Science Fund for Excellent Young Scientists Fund Program (Overseas) Optimizing and Analyzing Deep Residual Networks . R eka Albert and Albert-L aszl o Barab asi. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002. Albert-L aszl o Barab asi. Scale-free networks: a decade and beyond. Science, 325(5939):412 413, 2009. Normand J. Beaudry and Renato Renner. An intuitive proof of the data processing inequality. 12 (5 6):432 441, may 2012. ISSN 1533-7146. Weijian Deng and Liang Zheng. Are labels always necessary for classifier accuracy evaluation? In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 15069 15078. Computer Vision Foundation / IEEE, 2021. Weijian Deng, Stephen Gould, and Liang Zheng. What does rotation prediction tell us about classifier accuracy under varying testing environments? In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 2579 2589. PMLR, 2021. Weijian Deng, Stephen Gould, and Liang Zheng. On the strong correlation between model invariance and generalization. In Neur IPS, 2022. Lukas Galke, Benedikt Franke, Tobias Zielke, and Ansgar Scherp. Lifelong learning of graph neural networks for open-world node classification. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2021. Alex Graves and J urgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602 610, 2005. Devin Guillory, Vaishaal Shankar, Sayna Ebrahimi, Trevor Darrell, and Ludwig Schmidt. Predicting with confidence on unseen distributions. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 1114 1124. IEEE, 2021. William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 1024 1034, 2017. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020a. Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id= HJl WWJSFDH. Neha Hulkund, Nicol o Fusi, Jennifer Wortman Vaughan, and David Alvarez-Melis. Interpretable distribution shift detection using optimal transport. Ar Xiv, abs/2208.02896, 2022. Published as a conference paper at ICLR 2024 Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156 165, 2017. Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser-Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 20887 20902, 2021. Yixin Liu, Ming Jin, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and Philip S. Yu. Graph selfsupervised learning: A survey. IEEE Trans. Knowl. Data Eng., 35(6):5879 5900, 2023. Yunkai Lou, Chaokun Wang, Tiankai Gu, Hao Feng, Jun Chen, and Jeffrey Xu Yu. Time-topology analysis on temporal graphs. VLDB J., 32(4):815 843, 2023. Bin Lu, Xiaoying Gan, Weinan Zhang, Huaxiu Yao, Luoyi Fu, and Xinbing Wang. Spatio-temporal graph few-shot learning with cross-city knowledge transfer. In KDD 22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pp. 1162 1172. ACM, 2022. Bin Lu, Xiaoying Gan, Ze Zhao, Shiyu Liang, Luoyi Fu, Xinbing Wang, and Chenghu Zhou. Graph out-of-distribution generalization with controllable data augmentation. Co RR, abs/2308.08344, 2023. Yuanfu Lu, Xiao Wang, Chuan Shi, Philip S. Yu, and Yanfang Ye. Temporal network embedding with microand macro-dynamics. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, pp. 469 478. ACM, 2019. Tom as Mikolov, Martin Karafi at, Luk as Burget, Jan Cernock y, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp. 1045 1048. ISCA, 2010. Niema Moshiri. The dual-barab asi-albert model. ar Xiv preprint ar Xiv:1810.10538, 2018. P al Andr as Papp, Karolis Martinkus, Lukas Faber, and Roger Wattenhofer. Dropgnn: Random dropouts increase the expressiveness of graph neural networks. In Advances in Neural Information Processing Systems, Neur IPS 2021, December 6-14, 2021, virtual, pp. 21997 22009, 2021. Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B. Schardl, and Charles E. Leiserson. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 5363 5370. AAAI Press, 2020. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 14, New York, NY, USA - August 24 - 27, 2014, pp. 701 710. ACM, 2014. Published as a conference paper at ICLR 2024 Stephan Rabanser, Stephan G unnemann, and Zachary C. Lipton. Failing loudly: An empirical study of methods for detecting dataset shift. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 1394 1406, 2019. Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020. Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In WSDM 20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, pp. 519 527. ACM, 2020. Min Shi, Yu Huang, Xingquan Zhu, Yufei Tang, Yuan Zhuang, and Jianxun Liu. GAEN: graph attention evolving networks. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pp. 1541 1547. ijcai.org, 2021a. Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjing Wang, and Yu Sun. Masked label prediction: Unified message passing model for semi-supervised classification. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pp. 1548 1554. ijcai.org, 2021b. Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. Dyrep: Learning representations over dynamic graphs. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. Vladimir Vovk, Ivan Petej, Ilia Nouretdinov, Ernst Ahlberg, Lars Carlsson, and Alex Gammerman. Retrain or not retrain: conformal test martingales for change-point detection. In Conformal and Probabilistic Prediction and Applications, 8-10 September 2021, Virtual Event, volume 152 of Proceedings of Machine Learning Research, pp. 191 210. PMLR, 2021. Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. Inductive representation learning in temporal networks via causal anonymous walks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021. Zhihao Wen and Yuan Fang. Trend: Temporal event and node dynamics for graph representation learning. In Proceedings of the ACM Web Conference 2022, pp. 1159 1169, 2022. Yingxin Wu, Xiang Wang, An Zhang, Xiangnan He, and Tat-Seng Chua. Discovering invariant rationales for graph neural networks. In International Conference on Learning Representations, 2022. Minkai Xu, Alexander S. Powers, Ron O. Dror, Stefano Ermon, and Jure Leskovec. Geometric latent diffusion models for 3d molecule generation. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 38592 38610. PMLR, 2023. Carl Yang and Jiawei Han. Revisiting citation prediction with cluster-aware text-enhanced heterogeneous graph neural networks. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pp. 682 695. IEEE, 2023. Jiaxuan You, Tianyu Du, and Jure Leskovec. ROLAND: graph learning framework for dynamic graphs. In KDD 22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pp. 2358 2366. ACM, 2022. Published as a conference paper at ICLR 2024 Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. Le-kui Zhou, Yang Yang, Xiang Ren, Fei Wu, and Yueting Zhuang. Dynamic network embedding by modeling triadic closure process. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 571 578. AAAI Press, 2018. Dingyuan Zhu, Peng Cui, Ziwei Zhang, Jian Pei, and Wenwu Zhu. High-order proximity preserved embedding for dynamic networks. IEEE Trans. Knowl. Data Eng., 30(11):2134 2144, 2018. A ALGORITHM We illustrate the details of our proposed SMART in Algorithm 1. The learning process of SMART is divided into two stage: pre-deployment warmup training and post-deployment finetuning. Before the deployment, we conduct both the supervised learning on the generalization loss prediction with the partially-observed labels and self-supervised learning on the evolving graph structure. After the deployment, since we no longer have the label information over time, we design two augmented graph reconstruction task as a self-supervised manner to actively finetuning the feature extractor φ. Algorithm 1 SMART: Self-supervised Temporal Generalization Estimation Input: Pre-trained graph model G, observation data D, evolving graph (Ak, Xk) at time k. Output: Update generalization performance predictor M and φ with paramter θ = θM, θφ . 1: while not converged or maximum epochs not reached do Pre-deployment Warmup Training 2: Compute loss L(θ) = Ptdeploy τ=0 Mℓ(φ G(Aτ, Xτ), hτ 1) ℓ(HτG(Aτ, Xτ), HτYτ) 2; 3: Compute graph self-supervised Lg via Equation 3; 4: Update θ θ + α θ(L + Lg); 5: end while 6: for k = tdeploy + 1, , T do Post-deployment Finetuning 7: Get the newly-arrival graph (Ak, Xk); 8: while not converged or maximum epochs not reached do 9: Compute graph self-supervised Lg via Equation 3; 10: Update θφ k θφ k + β θφ k Lg; 11: end while 12: end for B PROOF FOR THEOREM 1 Proof. Let fτ(i; θ) denote the output of the GCN on the node i at time τ 0. Therefore, we have k Nτ (i) x k Wj + bj Thus, the expected loss of the parameter θ on the node i at time τ is ℓτ(i) = E h (fτ(i; θ) f0(i; θ))2i k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj Published as a conference paper at ICLR 2024 Furthermore, recall that each parameter aj U(a j, ξ) and each element Wj,k in the weight vector Wj also satisfies Wj,k U(W j , ξ). Therefore, the differences aj a j and Wj,k W j,k are all i.i.d. random variables drawn from distribution U(0, ξ). Therefore, we have k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj j=1 (aj a j) k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj j=1 (aj a j) k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj The third equality holds by the fact that the differences (aj a j) s are all i.i.d. random variables drawn from the uniform distribution U(0, ξ). Therefore, we have j=1 (aj a j) k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj Furthermore, since the differences (aj a j) are i.i.d. random variable drawn from the distribution U(0, ξ), we must further have j=1 (aj a j) k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj j=1 E[(aj a j)2|Aτ, Xτ] k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj Published as a conference paper at ICLR 2024 Furthermore, the leaky Re LU satisfies that |σ(u) σ(v)| β|u v|. The above inequality further implies k Nτ (i) x k Wj + bj k N0(i) x k Wj + bj k Nτ (i) x k Wj + bj 1 d0(i) k N0(i) x k Wj bj k Nτ (i)\N0(i) x k Wj + 1 dτ(i) 1 d0(i) k N0(i) x k Wj k Nτ (i)\N0(i) x k Wj 1 dτ(i) 1 d0(i) k N0(i) x k Wj 1 dτ(i) 1 d0(i) k N0(i) x k Wj Therefore, we have 1 dτ(i) 1 d0(i) k N0(i) x k Wj 1 dτ(i) 1 d0(i) k N0(i) x k (Wj W j + W j ) 1 dτ(i) 1 d0(i) k N0(i) x k (Wj W j ) 1 dτ(i) 1 d0(i) k N0(i) x k W j where the last equality comes from the fact that random vectors (Wj W j ) s are an i.i.d. random variables drawn from the uniform distribution U(0, ξ) and are also independent of the graph evolution process. Therefore, we have ℓτ(i) Nβ2ξ2 1 dτ(i) 1 d0(i) k N0(i) x k (Wj W j ) 1 dτ(i) 1 d0(i) where the last equality comes from the fact that each element in the random vector (Wj W j ) is i.i.d. random variable drawn from the uniform distribution U(0, ξ). Since the initial feature matrix X0 = (x1, ..., xn) are drawn from a continuous distribution supported on Rd, we must have with Published as a conference paper at ICLR 2024 probability one, Furthermore, we have " 1 dτ(i) 1 d0(i) 1 d0(i) EGτ To prove ϕτ(i) is strictly increasing, it suffices to prove that EGτ is decreasing with respect to τ. Since 1 dτ(i) > r dτ 1(i) < 1 where the last inequality comes from the fact that dτ 1(i) < 1 C PROOF FOR THEOREM 2 Proof. Assuming a regression task with a single-layer GCN, we compute mean squared error between prediction and ground truth as the learning objective as follows min W EX LXW Y 2 2 = min W EX ((LXW)T LXW 2(LXW)T Y + Y 2 2) (5) = min W EX (W T XT LT LXW 2W T XT LT Y + Y 2 2). (6) Since D 2 is a diagonal matrix, (LT L)ij = (AT D 2A)ij = a T i D 2aj, where ai is the i-th row in matrix A. Each node feature is independently Gaussian distributed. When i = j, EX [XT LT LX]ij = 0 (7) When i = j, EX [XT LT LX]ii = EX (x T i LT Lxi) = EX ( m=1 xij(LT L)jmxmi) (8) Similarly, when and only when j = m, EX [XT LT LX]ii = 0 Published as a conference paper at ICLR 2024 EX [XT LT LX]ii = j=1 EX (x2 ij)(LT L)jj = j=1 (LT L)jj I (9) where Pn j=1(LT L)jj = Pn j=1((D 1A)T D 1A)jj = Pn j=1(AT D 2A)jj = Pn j=1 1 dj β Consequently, the learning objective is equal to min W EX LXW Y 2 2 = min W β W T W 2W T EX (XT LT Y ) + Y 2 (10) Meanwhile, the optimal parameter of GCN equals to W = 1 β EX (XT LT Y ). β EX [XT LT Y ]i (11) m=1 (XT LT )im Ym (12) m=1 (LX)midα m Xmk (13) s=1 Lms Xsidα m Xmk (14) m=1 Lmm Xmidα m Xmk (15) Therefore, only if i = k, W k = 1 β PN m=1 Lmmdα m = 1 β PN m=1 dα 1 m C, and otherwise W i = 0. For any given node vi, we have εi = E (LXW )i Yi 2 2 E Yi 2 2 = E(LXW )2 i 2(LXW )i Yi + Yi 2 E Yi 2 2 (16) To be specific, j=1 (LX)ij W j = (LX)ik W k = C s=1 Lis Xsk (17) (LXW )i Yi = C s=1 Lis Xskdα i Xik = Cdα i Lii = Cdα 1 i (18) E(LXW )2 i = E[C s=1 Lis Xsk]2 = E[C2 N X s=1 (Lis Xsk)2] = C2 N X s=1 L2 is = C2 1 E Yi 2 = Ed2α i X2 ik = d2α i (20) Therefore, plugging to Eq. 16 ε(d) = C2d 1 2Cdα 1 + d2α = C2d 2α 1 2Cd α 1 + 1 (22) We denote the error of node i as ϵi. Therefore, the overall error of graph G under the stationary degree distribution Q is calculated as EG = EQε(d). Here we assume the inherent graph model Published as a conference paper at ICLR 2024 follows the Barab asi Albert model (Albert & Barab asi, 2002), where the probability of node degree equals to P(d) = 2m2 t N0 + t 1 N0 is the initial scale of graphs, m is the newly-arrival number of nodes of each time t. Consequently, the error of graph G can be further deduced as: 2m2 t N0 + t 1 d3 (C2 d 2α 1 2C d α 1 + 1) (24) = 2m2 t N0 + t(C2EQd 2α 4 2CEQd 2α 4 + EQd 3) (25) D RELATED WORK D.1 DISTRIBUTION SHIFT ESTIMATION Deep learning models are sensitive to the data distribution (Deng et al., 2022). Especially when these models are deployed online, how to accurately estimate generalization error and make decisions in advance is a crucial issue. Deng & Zheng (2021); Deng et al. (2021) estimate the recogniion performance by learning an accuracy regression model with image distribution statistics. Rabanser et al. (2019) conduct an empirical study of dataset shift by two-sample tests on pre-trained classifier. Hulkund et al. (2022) use optimal transport to identify distribution shifts in classification tasks. Vovk et al. (2021) propose conformal test martingales to detect the change-point of retraining the model. However, all these methods are designed for images, while evolving graph shows significant different characteristics due to its ever-growing topology. To our best of knowledge, we are the first to investigate the temporal generalization estimation in graphs. D.2 EVOLVING GRAPH REPRESENTATION Real world graphs show dynamic properties of both feature and topology continuously evolving over time. Existing research can be categorized into two types: (1) Temporal Graph Snapshots. Intuitively, the evolving graph can be splited into a graph time series, thereby taking a recurrent model for temporal encoding (Shi et al., 2021a; Sankar et al., 2020). DHPE (Zhu et al., 2018) gets a simple dynamic model base on SVD, which only preserves the first-order proximity between nodes. Dynamic Triad (Zhou et al., 2018) models the triadic closure process to capture dynamics and learns node embeddings at each time step. Evolve GCN (Pareja et al., 2020) proposes to use RNN to evolve the parameter of GCN , while ROLAND (You et al., 2022) recurrently updates hierarchical node embeddings. (2) Temporal Graph Events. Some other studies convert the graph representation by grouping the node and edges with time steps in a given period (Wang et al., 2021; Trivedi et al., 2019). TREND (Wen & Fang, 2022) borrows the idea of Hawkes temporal point process in time series analysis, capturing both individual and collective temporal characteristics of dynamic link formation. M2DNE (Lu et al., 2019) proposes a novel temporal network embedding with microand macro-dynamics, where a temporal attention point process is designed to capture structural and temporal properties at a fine-grained level. Lou et al. (2023) propose to evaluate the cohesiveness of temporal graphs in both topology and time dimensions. E EXPERIMENT DETAILS OF BARAB ASI-ALBERT RANDOM GRAPH Barab asi-Albert (BA) random graph (Barab asi, 2009; Albert & Barab asi, 2002), also known as the preferential attachment model, is a type of random graph used to model complex networks, particularly those that exhibit scale-free properties. BA graphs are often used to model various real-world networks, such as the World Wide Web, social networks, citation networks, etc. BA random graph has two important concepts: growth and preferential attachment. Growth means that the number of nodes in the network increases over time. Preferential attachment means that the more connected a node is, the more likely it is to receive new links. Start with an initial graph G0 Published as a conference paper at ICLR 2024 BA (m = 10) Figure 8: Degree Distribution of BA graphs that consists of a number of nodes and edges. At each time step τ, a new node is introduced into the graph, and it forms m edges to existing nodes in the graph. The probability that an existing node receives an edge from the new node is proportional to the degree (number of edges) in the current graph Gτ 1. This means that nodes with higher degrees are more likely to receive new edges, exhibiting preferential attachment. Dual Barab asi-Albert (Dual BA) random graph (Moshiri, 2018) is an extension of BA model that better capture properties of real networks of social contact. Dual BA model is parameterized by N0 inital number of nodes, 1 m1, m2, N0 and probability 0 p 1. For each new vertex, with probability p, m1 new edges are added, and with probability 1 p, m2 new edges are added. The new edges are added in the same manner as in the BA model. In our experiment, we adopt above two different BA model as test datasets. We use barabasi albert graph1 and dual barabasi albert graph2 implementation provided by Network X. The detailed experiment parameter settings (Table 4, Table 5) and the degree distribution (Figure 8, Figure 9) are shown as follows. Table 4: Parameter setting of BA graph in synthetic experiments. Barab asi-Albert model 1000 2 1000 5 1000 10 Table 5: Parameter setting of Dual BA graph in synthetic experiments Dual Barab asi-Albert model n m1 m2 1000 1 2 1000 1 5 1000 1 10 Here, we present the complete experiment results on 4 evaluation metrics (MAPE, RMSE, MAE and Standard Error) in Table 6. Since the Do C method only applies to classification problems using the average confidence, we thus do not compare it with our SMART on the synthetic setting. F REAL-WORLD DATASETS We use two citation datasets, a co-authorship network dataset and a series of social network datasets to evaluate our model s performance. We utilize inductive learning, wherein nodes and edges that emerge during testing remain unobserved during the training phase. The detailed statistics are shown in Table 7. 1Please refer to the implementation in https://networkx.org/documentation/stable/ reference/generated/networkx.generators.random_graphs.barabasi_albert_ graph.html 2Please refer to the implementation in https://networkx.org/documentation/stable/ reference/generated/networkx.generators.random_graphs.dual_barabasi_ albert_graph.html Published as a conference paper at ICLR 2024 DBA (m2 = 2) DBA (m2 = 5) DBA (m2 = 10) Figure 9: Degree Distribution of DBA graphs Table 6: Performance comparison on different Barab asi Albert graph setting. Barab asi-Albert (BA) Random Graph Linear SMART MAPE RMSE MAE MAPE RMSE MAE BA (N0 = 1000) m = 2 79.2431 0.0792 0.0705 7.1817 1.2350 0.0057 0.0008 0.0042 0.0006 m = 5 74.1083 0.0407 0.0398 4.2602 0.5316 0.0039 0.0004 0.0035 0.0004 m = 10 82.1677 0.1045 0.0925 9.1173 0.1331 0.0077 0.0010 0.0071 0.0009 Dual BA (N0 = 1000, m1=1) m2 = 2 61.8048 0.0676 0.0615 7.9038 1.8008 0.0088 0.0022 0.0069 0.0017 m2 = 5 67.6442 0.0827 0.0771 3.8288 0.1706 0.0049 0.0013 0.0040 0.0010 m2 = 10 38.4884 0.0298 0.0297 1.9947 0.1682 0.0026 0.0002 0.0023 0.0003 OGB-ar Xiv (Hu et al., 2020a): The OGB-ar Xiv dataset is a citation network where each node represents an ar Xiv paper, and each edge signifies the citation relationship between these papers. Within this dataset, we conduct node classification tasks, encompassing a total of 40 distinct subject areas. Our experiment spans the years from 2007 to 2020. In its initial state in 2007, OGB-ar Xiv comprises 4,980 nodes and 6,103 edges. As the graph evolves over time, the citation network boasts 169,343 nodes and 1,166,243 edges. We commence by pretraining graph neural networks on the graph in 2007. Subsequently, we employ data from the years 2008 to 2010 to train our generalization estimation model SMART. Following this training, we predict the generalization performance of the pretrained graph neural network on graphs spanning the years 2011 to 2020. DBLP (Galke et al., 2021): DBLP is also a citation network and this dataset use the conferences and journals as classes. In our experiment, DBLP starts from 1999 to 2015 with 6 classes. Throughout the evolution of DBLP, the number of nodes increase from 6,968 to 45,407, while the number of edges grow from 25,748 to 267,227. We pretrain the graph neural network on the graph in 1999 and train our model on the next three years. We employ the graph spanning from 2004 to 2015 to assess the performance of our model. Pharmabio (Galke et al., 2021): Pharmabio is a co-authorship graph dataset, and each node represents a paper with normalized TF-IDF representations of the publication title as its feature. If two papers share common authors, an edge is established between the corresponding nodes. We conduct node classification tasks on this dataset, comprising a total of seven classes, with each class representing a journal category. The range of Pharmabio is 1985 to 2016. The pretrained graph neural network is based on the graph of the year 1985 with 620 nodes and 57,559 edges. Then we train our estimation model by using graph data from 1986 to 1988. We evaluate our model on consecutive 26 years starting form 1989. At the last year 2016, the graph has evolved to 2,820 nodes with 3,005,421 edges. Facebook 100 (Lim et al., 2021): Facebook 100 is a social network which models the friendship of users within five university. We perform binary node classification on this dataset, with the classes representing the gender of the users. Among these datasets, Amherst, Reed and Johns Hopkins are of smaller scale, while Penn and Cornell are larger in size. We sequentially evaluate Published as a conference paper at ICLR 2024 Table 7: Overview of real-world evolving graph datasets Datasets OGB-ar Xiv DBLP Pharmabio Facebook 100 Penn Amherst Reed Johns Hopkins Cornell #Node (Start) 4980 6968 620 5385 107 85 368 1360 #Node (End) 169343 45407 57559 38815 2032 865 4762 16822 #Edge (Start) 6103 25748 2820 47042 192 188 1782 8148 #Edge (End) 1166243 267227 3005421 2498498 157466 31896 338256 1370660 Time Span 14 17 30 18 11 13 14 14 #Class 40 6 7 2 2 2 2 2 2010 2015 2020 Year Homogeneity Ratio (%) 2000 2005 2010 2015 Year Homogeneity Ratio (%) 1990 2000 2010 Year Homogeneity Ratio (%) 1995 2000 2005 2010 Year Homogeneity Ratio (%) 2000 2005 Year Homogeneity Ratio (%) 2000 2005 2010 Year Homogeneity Ratio (%) 2000 2005 2010 Year Homogeneity Ratio (%) Johns Hopkins 2000 2005 Year Homogeneity Ratio (%) Figure 10: Homophily ratio analysis on different datasets. our model s adaptability to datasets of different scales. All these datasets end in the year of 2010 with the number of nodes varying from 865 to 38,815 and edges from 31,896 to 2,498,498. In addition, research on the homogeneity and heterogeneity of graphs has received widespread attention in recent years. In the context of graph data, homogeneity and heterogeneity refer to whether the nodes and edges in the graph are of the same type (homogeneous) or different types (heterogeneous). For example, in a social network, a homogeneous graph might represent individuals (nodes) and friendships (edges), where all nodes and edges are of the same type. On the other hand, a heterogeneous graph could represent individuals, events, and organizations as different types of nodes, with edges representing relationships like attended, organized, or works for. In our work, we do not limit or pre-define the homogeneity and heterogeneity in the data. Moreover, we calculate the homophily ratio of nodes on different experiment datasets, as shown in Figure 10. According to the distribution of homogeneity, there is a significant difference in the proportion of homogeneity among different datasets, and varies greatly over time. G ADDITIONAL EXPERIMENTAL RESULTS OF SMART In this section, we provide a comprehensive illustration of the experiment details for temporal generalization estimation in evolving graphs, including the experiment settings, implementation details and completed results. G.1 EXPERIMENT SETTINGS Baselines. We compare the performance of SMART with following three baselines. Linear: Linear regression is a straightforward statistical learning methods for modeling the relationship between a scalar and variables. We observe that the performance degradation of Barab asi-Albert random graph shows approximate linear process. Therefore, to estimate the Published as a conference paper at ICLR 2024 GNN performance degradation on real-world datasets, we adopt linear regression as a baseline to predict the performance changes. Do C: Differences of confidences (Do C) approach (Guillory et al., 2021) proposes to estimate the performance variation on a distribution that slightly differs from training data with average confidence. Do C yields effective estimation of image classification over a variety of shifts and model architecture, outperforming common distributional distances such as Frechet distance or Maximum Mean Discrepancy. Base Distribution Target Distribution Featurizer Regressor loss Similarity Featurizer loss Figure 11: System framework of Do C for temporal generalization estimation in graphs. (Modification of Figure 2 in paper (Guillory et al., 2021)) Supervised: Moreover, we design a comparison method for temporal generalization estimation with supervised signals. Although we are unable to access the labeled data over the time during post-deployment, in our experiment we can acquire new node labels and retrain the model on the new data, which is approximately a upper performance limit of this problem. GNN Architectures. To fully validate and compare the effectiveness of different methods, we select four different graph neural network architectures as follows. GCN (Kipf & Welling, 2017): GCN is one of the most classic graph neural network architecture, which operates localized first-order approximation of spectral graph convolutions. GAT (Velickovic et al., 2018) : Graph Attention Network (GAT) leverages attention mechanism to selectively focus on different neighboring nodes of the graph when aggregating information. Graph Sage (Hamilton et al., 2017): Graph Sage is a graph neural network architecture designed for scalable and efficient learning on large graphs. It addresses the challenge by sampling and aggregating information from a node s local neighborhood. Transformer Conv (Shi et al., 2021b): In light of the superior performance of Transformer in NLP, Transformer Conv adopt transformer architecture into graph learning with taking into account the edge features. The Multi-head attention matrix replaces the original normalized adjacency matrix as transition matrix for message passing. Evaluation Metrics. To evaluate the performance, we adopt following four metrics to measure the effectiveness of temporal generalization estimation over time. Mean Absolute Percentage Error (MAPE) quantifies the average relative difference between the predicted generalization loss and the actual observed generalization loss, which is calculated as MAPE = 1 T tdeploy τ=tdeploy+1 Root Mean Squared Error (RMSE) is calculated by taking the square root of the average of the squared differences between predicted and actual generalization loss, which is calculated as v u u t 1 T tdeploy τ=tdeploy+1 (ˆℓτ ℓτ)2. Published as a conference paper at ICLR 2024 Table 8: Hyperparameter setting in our experiments Datasets OGB-ar Xiv DBLP Pharmabio Facebook 100 Penn Amherst Reed Johns Hopkins Cornell GNN Layer 3 3 2 2 2 1 2 2 GNN dimension 256 256 256 256 256 32 256 256 RNN Layer 1 1 1 1 1 1 1 1 RNN dimension 64 8 64 64 64 32 64 8 loss lambda 0.5 0.9 0.7 0.3 0.9 0.7 0.1 0.5 learning rate 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 Mean Absolute Error (MAE) measures the average absolute difference between the predicted generalization loss and the actual generalization loss, which is calculated as MAE = 1 T tdeploy τ=tdeploy+1 Standard Error (SE) measures the variability of sample mean and estimates how much the sample mean is likely to deviate from the true population mean, which is calculated as SE = σ/ n, where σ is the standard deviation, n is the number of samples. G.2 IMPLEMENTATION DETAILS Data Labeling. To simulate real-world human annotation scenarios, we randomly labeled 10% of the samples during the training of the pre-trained graph neural network model. Prior to deployment, at each time step, we additionally labeled 10% of the newly appearing nodes. After deployment, no additional labeling information was available for newly added nodes. For consistency, we use only the first three frames to obtain few labels for all real-world datasets, which is a relatively small sample size. Further enhancing the labeled data can yield additional improvements in temporal generalization estimation. Hyperparameter Setting. We use Adam optimizer for all the experiments, and the learning rate for all datasets are uniformly set to be 1e-3. In all experiments, the pre-trained graph neural networks are equipped with batch normalization and residual connections, with a dropout rate set to 0.1. Meanwhile, We employed the Re LU activation function. We set hyperparameter for each datasets and specify the details in Table 8. Hardware. All the evaluated models are implemented on a server with two CPUs (Intel Xeon Platinum 8336C 2) and four GPUs (NVIDIA Ge Force RTX 4090 8). G.3 EXPERIMENTAL RESULTS Comparison with different baselines. We conduct performance comparison of SMART and other three baselines on three academic network datasets. As shown in Table 9, we observe a strikingly prediction improvement. Our proposed SMART consistently outperforms linear regression and Do C on different evaluation metrics, which demonstrates the superior temporal generalization estimation of our methods. For example, on Pharmabio datasets, due to its long-term temporal evolution, the vanilla linear regression and Do C shows inferior prediction due to the severe GNN representation distortion. Our SMART shows advanced performance due to our self-supervised parameter update over the time. In addition, when comparing with supervised method, due to its continuous acquisition of annotation information and retraining during the testing phase, it is approximately the upper limit of the estimation performance. On OGB-ar Xiv dataset, our SMART achieves close performance with Supervised, indicating our strong ability to cope with evolving distribution shift. However, on Pharmabio dataset, accurately estimating its generalization performance remains a challenge due to its long-term evolution (30 timesteps). Thereby, temporal generalization estimation is an important and challenging issue, and still deserves further extensive research. Published as a conference paper at ICLR 2024 Table 9: Performance comparison of SMART and baselines on three academic network datasets. MAPE, RMSE, MAE and Standard Error are utilized to evaluate the estimation performance on different datasets. The smaller the value, the better the performance. Dataset Metric Linear Do C SMART (Ours) Supervised MAPE 10.5224 9.5277 1.4857 2.1897 0.2211 2.1354 0.4501 RMSE 0.4764 0.3689 0.0400 0.1129 0.0157 0.0768 0.0155 MAE 0.4014 0.3839 0.0404 0.0383 0.0083 0.0218 0.0199 MAPE 16.4991 4.3910 0.1325 3.4992 0.1502 2.5359 0.4282 RMSE 0.5531 0.1334 0.0058 0.1165 0.0444 0.0914 0.0175 MAE 0.431 0.1162 0.0310 0.0978 0.0344 0.0852 0.0038 MAPE 32.3653 8.1753 0.0745 1.3405 0.2674 0.4827 0.0798 RMSE 0.7152 0.1521 0.0014 0.0338 0.0136 0.0101 0.0015 MAE 0.6025 0.1521 0.0013 0.0282 0.0120 0.0088 0.0026 Table 10: Performance comparison of different GNN architectures, including GCN, GAT, Graph SAGE and Transformer Conv. We use MAPE Standard Error to evaluate the estimation on different scenarios. Dataset Method GNN architecture GCN GAT Graph SAGE Transformer Conv OGB-ar Xiv Linear 10.5224 12.3652 19.5480 14.9552 Do C 13.5277 1.4857 12.2138 1.222 20.5891 0.4553 10.0999 0.1972 SMART 2.1897 0.2211 3.1481 0.4079 5.2733 2.2635 3.5275 1.1462 DBLP Linear 16.4991 17.6388 23.7363 18.2188 Do C 4.3910 0.1325 13.8735 4.1744 11.9003 1.8249 9.0127 2.6619 SMART 3.4992 0.1502 6.6459 1.3401 9.9651 1.4699 6.4212 1.9358 Pharmabio Linear 32.3653 29.0404 31.7033 31.8249 Do C 8.1753 0.0745 7.4942 0.0702 6.6376 0.0194 5.3498 0.2636 SMART 1.3405 0.2674 1.2197 0.2241 3.1448 0.6875 2.7357 1.1357 Comparison with different GNN architectures. To evaluate the applicability of proposed algorithm, we conduct temporal generalization estimation on different GNN architectures. As shown in Table 10, first of all, our SMART shows consistent best performance among different architectures. Besides, for different graph neural network architectures, they tend to capture the structure and feature information from different aspects. Due to the simple model structure of GCN, our SMART shows advanced prediction performance among other architectures, which is also consistent with the theory of classical statistical machine learning. Comparison with different time series model. To capture the temporal drift of GNN generalization variation, we propose an RNN-based method (i.e. LSTM). Moreover, we additionally replace the LSTM with other two time series model Bi-LSTM and TCN as follows: LSTM: Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem in traditional RNNs. The ability of LSTMs to selectively remember or forget information makes them well-suited for tasks involving long-term dependencies, and become a standard architecture in the field of deep learning for sequential data. Bi-LSTM Graves & Schmidhuber (2005): Bidirectional Long Short-Term Memory (Bi-LSTM) is an extension of the traditional LSTM architecture that enhances its ability to capture information from both past and future context. TCN (Lea et al., 2017): Temporal Convolutional Network (TCN) use 1D convolutional layers to capture dependencies across different time steps in a sequence. Unlike traditional CNNs, TCNs Published as a conference paper at ICLR 2024 0 5 10 15 MAPE Smart Baseline 0 5 10 15 MAPE Smart Baseline 0 2 4 6 8 MAPE Smart Baseline Figure 12: Comparison with different time series model, i.e., LSTM, Bi-LSTM and TCN. use dilated convolutions to extend the receptive field without significantly increasing the number of parameters. Different time series methods are worth trying out, since we hope to capture the temporal changes based on a small number of training snapshots. In our experiment, as depicted in Figure 12, our SMART achieves the optimal performance using three different time series models on both OGBar Xiv and Pharmabio datasets. On DBLP dataset, SMART achieve the best performance using LSTM, while SMART equipped with Bi-LSTM and TCN show close performance with Do C. In order to maintain consistency in the experiment, we adopted the same hyperparameter settings. From the perspective of parameter size, the parameter quantity of Bi-LSTM is nearly twice that of LSTM, while the parameter quantity of TCN is very small. It can be seen that as the parameter size increases, the prediction error generally decreases first and then increases. Therefore, it is very important to choose appropriate models and parameter settings based on the complexity of the data. Comparison with different graph data augmentation methods. Self-supervised learning is a popular and effective learning strategy to enrich the supervised signals. Recent success of self-supervised learning on image datasets heavily rely on various data augmentation methods. Due to the irregular structure of graphs, existing graph augmentation methods can be categorized into topological-based augmentation and attributive-based augmentation. In our work, we adopt three familiar and effective graph data augmentation methods to generate different views for self-supervised graph reconstruction, shown as follows: Drop Edge (Rong et al., 2020): Randomly delete some edges with a certain probability, thereby changing the graph structure as the input of GNN. Drop Node (Papp et al., 2021): Instead of randomly delete some edges, Drop Node randomly delete some nodes with a certain probability. Feature Mask (Hu et al., 2020b): Randomly mask features of a portion of nodes with a certain probability, thereby changing the node feature of graphs. Figure ?? shows the experimental results, and our SMART shows the best performance on three datasets. Due to our consideration of both structure prediction and feature reconstruction, the overall impact of different data augmentation methods on performance is relatively stable. Ablation study and Hyperparameter Study. In the main text, due to space constraints, we have chosen to present representative experimental results from some selected datasets. In order to provide a more comprehensive demonstration of the effectiveness of SMART, we have included complete experimental results for eight datasets, including ablation experiments and hyperparameter experiments, in the appendix. The correspondence between figures and experiment settings is as follows: Figure 13: Ablation Study of SMART on all datasets. Figure 14: Hyperparameter Study on proportional weight ratio λ of SMART on all datasets. Figure 15: Hyperparameter Study on RNN dimension of SMART on all datasets. Published as a conference paper at ICLR 2024 M1 M2 M3a M3b Ours 0.0 M1 M2 M3a M3b Ours 0 M1 M2 M3a M3b Ours 0 M1 M2 M3a M3b Ours 0.00 M1 M2 M3a M3b Ours 0 M1 M2 M3a M3b Ours 0.075 M1 M2 M3a M3b Ours 0.0 Johns Hopkins M1 M2 M3a M3b Ours 0 M3a : w/o structure reconstruction M3b : w/o feature reconstruction Smart (Ours) M1 : w/o augmented graph reconstruction M2 : w/o RNN Figure 13: Ablation Study of SMART on all datasets. 0.1 0.3 0.5 0.7 0.9 Loss ratio λ 0.1 0.3 0.5 0.7 0.9 Loss ratio λ 0.1 0.3 0.5 0.7 0.9 Loss ratio λ 0.1 0.3 0.5 0.7 0.9 Loss ratio λ 0.1 0.3 0.5 0.7 0.9 Loss ratio λ 0.1 0.3 0.5 0.7 0.9 Loss ratio λ 0.1 0.3 0.5 0.7 0.9 Loss ratio λ Johns Hopkins 0.1 0.3 0.5 0.7 0.9 Loss ratio λ Figure 14: Hyperparameter Study on proportional weight ratio λ of SMART on all datasets. 4 8 16 32 64 128 RNN dimension 4 8 16 32 64 128 RNN dimension 4 8 16 32 64 128 RNN dimension 4 8 16 32 64 128 RNN dimension 4 8 16 32 64 128 RNN dimension 4 8 16 32 64 128 RNN dimension 4 8 16 32 64 128 RNN dimension Johns Hopkins 4 8 16 32 64 128 RNN dimension Figure 15: Hyperparameter Study on RNN dimension of SMART on all datasets.