# sail_selfaugmented_graph_contrastive_learning__7584f60a.pdf SAIL: Self-Augmented Graph Contrastive Learning Lu Yu2,1, Shichao Pei1, Lizhong Ding5, Jun Zhou2, Longfei Li2, Chuxu Zhang4 , Xiangliang Zhang3,1* 1 King Abdullah University of Science and Technology, Saudi Arabia 2 Ant Group, Hangzhou, China 3 University of Notre Dame, USA 4 Brandeis University, USA 5 Inception Institute of Artificial Intelligence, UAE bruceyu.yl@alibaba-inc.com, shichao,pei@kaust.edu.sa, lizhong.ding@inceptioniai.org, {jun.zhoujun,longyao.llf}@antgroup.com,chuxuzhang@brandeis.edu,xzhang33@nd.edu This paper studies learning node representations with graph neural networks (GNNs) for unsupervised scenario. Specifically, we derive a theoretical analysis and provide an empirical demonstration about the non-steady performance of GNNs over different graph datasets, when the supervision signals are not appropriately defined. The performance of GNNs depends on both the node feature smoothness and the locality of graph structure. To smooth the discrepancy of node proximity measured by graph topology and node feature, we proposed SAIL - a novel Self-Augmented graph contrastive Learning framework, with two complementary self-distilling regularization modules, i.e., intraand inter-graph knowledge distillation. We demonstrate the competitive performance of SAIL on a variety of graph applications. Even with a single GNN layer, SAIL has consistently competitive or even better performance on various benchmark datasets, comparing with state-of-the-art baselines. Introduction Graph neural networks (GNNs) have been a leading effective framework of learning graph representations. The key of GNNs roots at the repeated aggregation over local neighbors to obtain smoothing node representations by filtering out noise existing in the raw node features. With the enormous architectures proposed (Kipf and Welling 2017; Hamilton, Ying, and Leskovec 2017; Veliˇckovi c et al. 2018), learning GNN models to maintain local smoothness usually depends on supervised signals (e.g., node labels or graph labels). However, labeled information is not always available in many scenarios. Along with the raising attention on selfsupervised learning (Veliˇckovi c et al. 2019; Hassani and Khasahmadi 2020), pre-training GNNs without labels has become an alternative way to learn GNN models. There are a group of unsupervised node representation learning models in the spirit of self-supervised learning. As one of the most representatives, predicting contextual neighbors (e.g., Deep Walk (Perozzi, Al-Rfou, and Skiena 2014) or node2vec (Grover and Leskovec 2016)) enforces the locally connected nodes to have similar representations. Self- *Corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. supervising signals of this method are designed to extract local structure dependency but discard the contribution of node feature smoothness which has been utilized to improve expressive capability of GNNs (Kipf and Welling 2017; Wu et al. 2019). Another line pays attention to maximizing the mutual information (MI) criterion to make agreement on multi-view graph representations (Hassani and Khasahmadi 2020; You et al. 2020a), in which each view of augmented graph is generated by operations on nodes, attributes, etc. However, most of them aim at making agreement on the graph-level representations (You et al. 2020a; Sun et al. 2019), which might be not suitable for node-level tasks. Instead of creating multiple views through graph augmentation (Hassani and Khasahmadi 2020) methods, there are recent works building upon self-augmented views created by the intermedian hidden layers of GNN. As a pioneering work, deep graph infomax (DGI) (Veliˇckovi c et al. 2019) proposes to maximize MI between the summarized graph and node embeddings. However, the summarized graph embedding contains the global context that might not be shared by all nodes. Inspired by DGI, graphical mutual information (GMI) (Peng et al. 2020) turns to maximize the edge MI between the created views of two adjacent nodes. As GMI focuses on the edge MI maximization task, it lacks a bird s eye on the learned node representations. The learned GNN might bias towards performing well on edge prediction task, but downgrades on the other tasks like node clustering or classification. Recently some works (Mandal et al. 2021) attempt to bring the idea of meta-learning to train GNNs with meta knowledge which can help to avoid the bias caused by single pretext task. However, the meta-GNN might contain knowledge that cause task discrepancy issue (Tian et al. 2020; Wang et al. 2020). With the knowledge of the previous work, we just wonder can we advance the expressivity of GNNs with the knowledge extracted by themselves in an unsupervised way? In order to answer this question, we theoretically dissect the graph convolution operations (shown in Theorem 1), and find that the smoothness of node representations generated by GNNs is dominated by smoothness of neighborhood embeddings from previous layers and the structural similarity. It suggests that improving the graph representations of shallow layer can indirectly help get better node embed- The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) dings of deep layer or the final layer. Based on this observation, we propose SAIL, a Self-Augmented graph contrastive Learning framework, in which we mainly use two different views (i.e., non-linear mappings of input node feature and the final layer of GNN) of transformed node representations. More specifically, we propose to iteratively use the smoothed node representations from the output layer of GNNs to improve the node representations of shallow layer or the input layer (e.g., non-linear mapping of input node features). The most recent work (Chen et al. 2020b) also shares a similar idea for supervised task, while the different thing is that it forces the knowledge flow from lowlevel to high-level neural representations. Besides attempting to make an agreement on the selected views, we introduced a self-distilling module to raise consistency regularization over node representations from both local and global perspectives. The design of self-distilling module is inspired by a recent work (Wang and Isola 2020) on the importance of alignment and uniformity for a successful contrastive learning method. With a given distribution of positive pair, the alignment calculates the expected similarity of connected nodes (i.e. locally closeness), and the uniformity measures how well the encoded node embeddings are globally distributed. The intra-distilling module aims at forcing the learnt representations and node features to have consistent uniformity distribution. Inspired by another piece of work (Ishida et al. 2020) indicating out the alleviating the learning bias by injecting noise into objective, we design an inter-distilling framework to align the node representations from a copied teacher model to noisy student model. Through multiple runs of the inter-distilling module, we implicitly mimic the deep smoothing operation with a shallow GNN (e.g., only a single GNN layer), while avoiding noisy information from high-order neighbors to cause the known oversmoothing issue (Chen et al. 2020a; Li, Han, and Wu 2018) since shallow GNN only depends on the local neighbors. The proposed SAIL can learn shallow but powerful GNN. Even with a single GNN layer, it has consistently competitive or even better performance on various benchmark datasets, comparing to state-of-the-art baselines. We summarize the contributions of this work as follow: - We present SAIL, to the best of our knowledge, the first generic self-supervised framework designed for advancing the expressivity of GNNs through distilling knowledge of self-augmented views but not depending on any external teacher model. - We introduce a universal self-distilling module for unsupervised learning graph neural networks. The presented self-distilling method can bring several advantages including but not limited to: 1) distilling knowledge from a self-created teacher model following the graph topology; 2) we can mimic a deep smoothing operation with a shallow GNN by iteratively distilling knowledge from teacher models to guide a noisy student model with future knowledge. - We demonstrate the effectiveness of proposed method with thorough experiments on multiple benchmark datasets and various tasks, yielding consistent improve- ment comparing with state-of-the-art baselines. Related Work Graph Neural Networks. In recent years, we have witnessed a fast progress of graph neural network in both methodology study and its applications. As one of the pioneering research, spectral graph convolution methods (Defferrard, Bresson, and Vandergheynst 2016) generalized the convolution operation to non-Euclidean graph data. Kipf et al. (Kipf and Welling 2017) reduced the computation complexity to 1-order Chebyshev approximation with an affined assumption. NT et al. (NT and Maehara 2019) justified that classical graph convolution network (GCN) and its variants are just low-pass filter. At the same time, both of studies (Li et al. 2019; Klicpera, Weißenberger, and G unnemann 2019) proposed to replace standard GCN layer with a normalized high-order low-pass filters (e.g. personalized Page Rank, heats kernel). This conclusion can help to answer why simplified GCN proposed by Wu et al. (Wu et al. 2019) has competitive performance with complicated multi-layer GNNs. Besides GCNs, many novel GNNs have been proposed, such as multi-head attention models (Veliˇckovi c et al. 2018; Ma et al. 2019), recurrent graph neural network (Liu et al. 2019), Rev GNN (Li et al. 2021), heterogeneous graph neural network (Zhang et al. 2019). Self-supervised Learning for GNNs. In addition to the line of work following Deepwalk (Perozzi, Al-Rfou, and Skiena 2014) for constructing self-supervising signals, mutual information (MI) maximization (Veliˇckovi c et al. 2019) over the input and output representations shows up as an alternative solution. In analogy to discriminating that output image representation is generated from the input image patch or noisy image, Veliˇckovi c et al. (Veliˇckovi c et al. 2019) propose deep graph infomax (DGI) criterion to maximize the mutual information between a high-level global graph summary vector and a local patch representation. Inspired by DGI, more and more recent methods like Graph Clustering (Bo et al. 2020), Info Graph (Sun et al. 2019), Graph CL (You et al. 2020a), GCC (Qiu et al. 2020) and pretraining graph neural networks (Hu et al. 2019) are designed for learning graph representations. Most of self-supervised learning methods can be distinguished by the way to do data augmentation and the predefined pretext tasks (Xie et al. 2021; Sun, Lin, and Zhu 2020; You et al. 2020b; Zhao et al. 2021; Xu et al. 2021). For example, graph contrastive coding (GCC) borrows the idea from the momentum contrastive learning (He et al. 2020), and aims at learning transferrable graph neural networks taking the node structural feature as the input. Both of Graph CL (You et al. 2020a) and Info Graph (Sun et al. 2019) are designed for learning agreement of graph-level representations of augmented graph patches. Methodology Let G = {V, E, X} denote an attribute graph where V is the node set {vi V}, E is the edge set, and X RN F is the node feature matrix where each row xi stands for the node feature of vi. We use A represents the node relation matrix where aij = 1 if there existing a link between node Figure 1: Overall architecture of the proposed self-supervised GNN with intraand inter-distilling modules. The intra-distilling module aims at forcing the learnt representations and node features to have consistent uniformity distribution. The interdistilling module consists of three operations including: 1) creating a teacher model by copying the target model (i.e., θt θs), 2) fading the target model to a student model by injecting noise into the model parameters (l(θs, ϵ) = wθs + (1 w)ϵ), 3) supervising the faded target model (i.e., student model) with future knowledge (i.e., Ht). vi and vj, i.e., eij E, otherwise aij = 0. We define the degree matrix D = diag(d1, d2, , d N) where each element equals to the row-sum of adjacency matrix di = P j aij. Our goal is to learn a graph neural encoder Φ(X, A|θ) = H, where H {h1, h2, , h N} is the representation learned for nodes in V. Deep graph encoder Φ usually has oversmoothing problem. In this work, we instantiate a GNN with single layer to validate the effectiveness of proposed method to learn qualified node representations with shallow neighbors. But the analysis results in the following section can be easily extended to deeper GNNs. The graph neural encoder (Li et al. 2019; Wu et al. 2019) used in this paper is defined as: A = D 1 2 (A + I)D 1 Φ(X, A) = σ( A 2XW) (1) where W RF F is learnable parameter and σ( ) denotes the activation function. The vector hi RF actually summarizes a subgraph containing the second-order neighbors centered around node vi. We refer to H and e X = XW as the self-agumented node representations after transformed raw node featatures X. e X denotes the low-level node feature, which might contain lots of noisy information. Definition 1 (Second-order Graph Regularization). The objective of second-order graph regularization is to minimize the following equation X eij E sij||hi hj||2 2 (2) where sij is the second-order similarity which can be de- fined as cosine similarity sij = c N (i) N (j) αic αjc ||αi ||2||αj ||2 , and hi denotes the node representation. Theorem 1. Suppose that a GNN aggregates node representations as hl i = σ(P j N (i) vi αijhl 1 j ), where αij stands for the element of a normalized relation matrix. If the firstorder gradient of the selected activation function σ(x) satisfies |σ (x)| 1, then the graph neural operator approximately equals to a second-order proximity graph regularization over the node representations. Proof. Here we mainly focus on analyzing GNNs whose aggregation operator mainly roots on weighted sum over the neighbors, i.e. hl = σ(P j N (i) vi αijhl 1 j ). Typical examples include but not limited to GCN (Kipf and Welling 2017) where αij can be the element of normalized adjacent ma- trix A = D 1 2 (A + I)D 1 2 or A 2. The node representation hl i can be divided into three parts: the node representations αiihl 1 i , the sum of common neighbor representations Si = P c N (i) N (j) αichl 1 c , the sum of non-common neighbor representations Di = P q N (i) N (i) N (j) αiqhl 1 q . Let y = σ(x), and suppose that the selected activation function holds |σ (x)| 1. We can have (y1 y2)2 (x1 x2)2 = |y1 y2|2 |x1 x2|2 1. Let s reformulate the definition of hl as hl = σ(ˆhl) and ˆhl = P j N (i) vi αijhl 1 j . Then we can have ||hl i hl j||2 ||ˆhl i ˆhl j||2. The distance between the representations hl i and hl j satisfies: ||hl i hl j||2 ||ˆhl i ˆhl j||2 = ||(αiihl 1 i αjjhl 1 j ) + (Si Sj) + (Di Dj)||2 ||(αiihl 1 i αjjhl 1 j )||2 + ||(Si Sj)||2 + ||(Di Dj)||2 ||(αiihl 1 i αjjhl 1 j )||2 | {z } local feature smoothness + ||Di||2 + ||Dj||2 | {z } non common neighbor c N (i) N (j) (αic αjc)hl 1 c ||2 | {z } structure proximity (3) From Equation 3, we can see that the upper bound of similarity of a pair of nodes is mainly influenced by local feature smoothness and structure proximity. According to the proof shown above, if a pair of node (vi, vj) has smoothed local features and similar structure proximity with many common similar neighbors (i.e. αic αjc), the obtained node representation of a GNN will also enforce their node representations to be similar. Learning from Self-augmented View From the conclusion given in Theorem 1, we can see that the quality of each GNN layer has close relation to previous layer. As the initial layer, the quality of input layer feature X will propagate from the bottom to the top layer of a given GNN model. As a graph neural layer can work as a low-pass filter (NT and Maehara 2019), its output H are actually smoothed node representations after filtering out the noisy information existing in the low-level features. Usually single GNN layer might not perfectly get a clean node representations. By stacking multiple layers, a deep GNN model can repeatedly improved representations from previous layer. However, deep GNN models tend to oversmooth the node representations with unlimited neighborhood mixing. In this work, we attempt to improve the GNN model with shallow neighborhood by shaping the low-level node features with relatively smoothed node representations. To overcome the above-discussed challenges, we propose to transfer the learnt knowledge in the last GNN layer H to shape e X in both local and global view. Concretely, instead of constructing contrastive learning loss over the node representations h at the same GNN layer, we turn to maximize the neighborhood predicting probability between a node representation h and its input node features ex in its neighbors. Formally, for a sample set {vi, vj, vk} where eij E but eik / E, the loss ℓi jk is defined on the pairwise comparison of (hi, exj) and (hi, exk)). Therefore, our self-supervised learning for GNN has the loss function defined below, eik / E ℓ(ψ(hi, exj), ψ(hi, exk))+λR(G), (4) where ℓ( ) can be an arbitrary contrastive loss function, ψ is a scoring function, and R is the regularization function with weight λ for implementing graph structural constraints (to be introduced in next section). There are lots of candidates for contrastive loss ℓ(). In this work we use logistic pairwise loss ln σ(ψ(hi, exj) ψ(hi, exk)), wehre σ(x) = 1 1+exp( x). Self-distilling Graph Knowledge Regularization The objective function defined in Eq. (4) models the interactions between output node representations h and input node features ex, which can be regarded as an intra-model knowledge distillation process from smoothed node embeddings to denoise low-level node features. However, the raised contrastive samples over edge connectivity might fail to represent a whole picture of node representation distribution, and cause a bias to learning node representations favoring to predict edges. We present a self-distilling method shown in Figure 1 consists of intraand inter-distilling modules. Intra-distilling module: To supplement the loss defined on the individual pairwise samples, we introduce a regularization term to ensure the distribution consistency on the relations between the learned node representations H and the node features e X over a set of randomly sampled nodes. Let LS = {LS1, LS2, , LSN} denote the randomly sampled pseudo relation graph, where LSi V and |LSi| = d is the number of sampled pseudo local neighbors for center node Algorithm 1: SAIL 1 Input: graph G = {V, E, X}, hyperparameters= {α, λ} 2 Output: learned GNN Φ 3 initialize Φ0 by optimizing Eq. (16) without Rcross 4 for m 1 to n do 5 if m%τ == 0 then 7 θs wθs + (1 w)ϵ; 8 Optimize Lssl(Φt, Φs, X, A); 9 return Φs; vi. The estimated proximity for each node in i-th local structure LSi is computed by St ij = exp(ψ(hi, exj)) P vj LSi exp(ψ(hi, exj)) Ss ij = exp(ψ(exi, exj)) P vj LSi exp(ψ(exi, exj)) where St ij and Ss ij denote the similarity estimated from different node representations between node vi and vj. The St ij will act as the teacher signal to guide the node features e X = {ex1, ex2, , ex N} to agree on the relation distribution over a random sampled graph. For the node vi, the relation distribution similarity can be measured as Si = Cross Entropy(St [i, ], Ss [i, ]). Then we can compute the relation similarity distribution over all the nodes as i=1 Si, (6) where Rintra acts as a regularization term generated from the intra-model knowledge, and to push the learned node representations H and node features e X being consistent at a subgraph-level. Inter-distilling Module: The second regularization is to introduce the inter-distilling module for addressing the over-smoothing issue. The inter-distilling module can guide the target GNN model by transferring the learned selfsupervised knowledge. Through multiple implementations of the inter-distilling module, we implicitly mimic the deep smoothing operation with a shallow GNN (e.g. a single GNN layer), while avoiding to bring noisy information from highorder neighbors. The overall inter-distilling framework is shown in Figure 1. We create a teacher model Φt by copying the target GNN model, then inject noise into the target model that will degrade into a student model Φs after a fix number of iterations. Working with a self-created teacher and student model {Φt, Φs} with the same architectures shown in Eq. (1), student model Φs(X, A) = {Hs, e Xs} distills knowledge from the teacher model Φt. Since no label is available, we propose to implement representation distillation (Tian, Krishnan, and Isola 2020) with the constraint of graph structure. The knowledge distillation module consists of two parts, defined as Rinter = KD(Ht, e Xs|G) + KD(Ht, Hs|G) (7) where Ht is the node representations from teacher model Φt, and e Xs = XW. To define module KD( ), we should meet several requirements: 1) this function should be easy to compute and friendly to back-propagation strategy; and 2) it should stick to the graph structure constraint. We resort to the conditional random field (CRF) (Lafferty, Mc Callum, and Pereira 2001) to capture the pairwise relationship between different nodes. For a general knowledge distillation module KD(Y, Z|G), the dependency of Z on Y can be given following the CRF model: P(Z|Y ) = 1 C(Y )exp( E(Z|Y )), (8) where C( ) is the normalization factor and E( ) stands for the energy function, defined as follows: E(Zi|Yi) = ψu(Zi, Yi) + ψp(Zi, Zj, Yi, Yj) = (1 α)||Zi Yi||2 2 + α X j N (i) βij||Zi Zj||2 2, (9) where ψu and ψp are the unary and pairwise energy function, respectively. The parameter α [0, 1] is to control the importance of two energy functions. When Z is the node feature of student model and Y is the node representation from teacher model Φm 1, the energy function defined in Equation (9) enforces the node vi representation from student model to be close to that in the teacher model and its neighbor nodes. After obtaining the energy function, we can resolve the CRF objective with the mean-field approximation method by employing a simple distribution Q(Z) to approximate the distribution P(Z|Y ). Specifically distribution Q(Z) can be initialized as the product marginal distributions as Q(Z) = ΠN i=1Qi(Zi). Through minimizing the KL divergence between these two distributions as follows: arg min KL(Q(Z)||P(Z|Y )). (10) Then we can get the optimal Q i (Zi) as follows: ln Q i (Zi) = Ej =i[ln P(Zj|Yj)] + const. (11) According to Equation (8) and (9), we can get Q i (Zi) exp((1 α)||Zi Yi||2 2+α X j N (i) βij||Zi Zj||2 2), (12) which shows that Q i (Zi) is a Gaussian function. By computing its expectation, we have the optimal solution for Zi as follows: Z i = (1 α)Yi + α P j N (i) βij Zj (1 α) + α P j N (i) βij (13) Then we can get the cross-model knowledge distillation rule by enforcing the node representations from student model to have minimized metric distance to Z i . After replacing the random variables Yi as the node representation ht i of teacher 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Locality Feature Smoothness Cora Citeseer Computers Photo Figure 2: Locality and node feature smoothness in the six graphs used in experimental evaluation. The larger feature smoothness value is, the more smoothing feature we have. model Φt, then we can get the final distillation regularization as follows: KD(Ht, e Xs|G) = ||exs i (1 α)ht i + α P j N (i) βijexs j (1 α) + α P j N (i) βij ||2 2 KD(Ht, Hs|G) = ||hs i (1 α)ht i + α P j N (i) βijhs j (1 α) + α P j N (i) βij ||2 2 (15) where exs i denotes the feature of node vi from the student model Φs, and hs i denotes the output node representation for node vi. In terms of βij in Eq. (13), we have many choices such as attentive weight, or mean pooling etc. In this work, we simply initialize it with mean-pooling operation over the node representations. The overall self-supervised learning objective in this work can be extended as follows by taking the two regularization terms: eik / E ℓ(ψ(hs i, exs j), ψ(hs i, exs k)) + λ(Rintra + Rinter). (16) where the initial teacher model Φt can be initialized through optimizing the proposed self-supervised learning without the cross-model distillation regularization. Experimental Evaluation We compare the proposed method with various state-ofthe-art methods on six datasets including three citation networks (Cora, Citeseer, Pubmed) (Kipf and Welling 2017), product co-purchase networks (Computers, Photo), and one co-author network subjected to a subfield Computer Science (CS). The product and co-author graphs are benchmark datasets from pytorch geometric (Fey and Lenssen 2019). Overall Performance Node Classification. The GNN methods compared here include convolution or attentive neural networks. Table 1 summarizes the overall performance. The performance of simple GCN learned by the proposed method SAIL consistently outperforms the other baselines learned by supervised Input Methods Cora Citeseer Pub Med Computers Photo CS Cheb Net (Defferrard, Bresson, and Vandergheynst 2016) 81.2 69.8 74.4 70.5 0.5 76.9 0.3 92.3 0.1 MLP (Kipf and Welling 2017) 55.1 46.5 71.4 55.3 0.2 71.1 0.1 85.5 0.1 GCN (Kipf and Welling 2017) 81.5 70.3 79.0 76.11 0.1 89.0 0.3 92.4 0.1 SGC (Wu et al. 2019) 81.0 0.0 71.9 0.1 78.9 0.0 55.7 0.3 69.7 0.5 92.3 0.1 GAT (Veliˇckovi c et al. 2018) 83.0 0.7 72.5 0.7 79.0 0.3 71.4 0.6 89.4 0.5 92.2 0.2 Disen GCN (Ma et al. 2019) 83.7* 73.4* 80.5 52.7 0.3 87.7 0.6 93.3 0.1* GMNN (Qu, Bengio, and Tang 2019) 83.7* 73.1 81.8* 74.1 0.4 89.5 0.6 92.7 0.2 Graph SAGE (Hamilton, Ying, and Leskovec 2017) 77.2 0.3 67.8 0.3 77.5 0.5 78.9 0.4 91.3 0.5 93.2 0.3 CAN (Meng et al. 2019) 75.2 0.5 64.5 0.2 64.8 0.3 78.2 0.5 88.3 0.6 91.2 0.2 DGI (Veliˇckovi c et al. 2019) 82.3 0.6 71.8 0.7 76.8 0.6 68.2 0.6 78.2 0.3 92.4 0.3 GMI (Peng et al. 2020) 82.8 0.7 73.0 0.3 80.1 0.2 62.7 0.5 86.2 0.3 92.3 0.1 MVGRL (Hassani and Khasahmadi 2020) 83.5 0.4 72.1 0.6 79.8 0.7 88.4 0.3* 92.8 0.2 93.1 0.3 SAIL 84.6 0.3 74.2 0.4 83.8 0.1 89.4 0.1 92.5 0.1* 93.3 0.05 Table 1: Accuracy (with standard deviation) of node classification (in %). The best results are highlighted in bold. Results without standard deviation are copied from original works. Methods Cora Citeseer Pub Med Computers Photo CS CAN 51.7 35.4 26.7 42.6 53.4* 71.3 GMI 55.9 34.7 23.3 34.5 47.2 73.9 DGI 50.5 40.1* 28.8 36.8 42.1 74.7* MVGRL 56.1* 37.6 34.7 46.2* 12.15 66.5 SAIL 58.1 44.6 33.3* 49.1 66.5 76.4 Table 2: Node clustering performance measured by normalized mutual information (NMI) in %. Methods Cora Citeseer Pub Med Computers Photo CS GMNN 87.5 86.9 88.8 82.1 86.7 91.7 GAT 90.5 89.1 80.5 84.5 88.4 92.2 GCN 82.6 83.2 88.5 82.1 86.7 89.7 Disen GCN 93.3 92.0 91.1 78.9 77.6 94.5 DGI 69.2 69.0 85.2 75.1 74.2 79.7 MVGRL 89.5 94.4 96.1* 74.6 73.1 83.1 CAN 94.8 94.8 91.9 94.9* 95.0 97.1* GMI 95.1* 96.0* 96.0 85.5 91.9 95.5 SAIL 97.3 98.4 98.5 94.9 94.6* 97.4 Table 3: AUC (in %) of link prediction. and unsupervised objectives. It s noted that Disen GCN iteratively applies attentive routing operation to dynamically reshape node relationships in each layer. By default, it has 5-layers and iteratively applies 6 times, which typically is a deep GNN. From the empirical results, we can see that SAIL can empower the single-layer GNN through iterative inter-model knowledge distillation. Node Clustering. In node clustering task, we aim at evaluating the quality of the node representations learned by unsupervised methods. The performance is validated by measuring the normalized mutual information (NMI). The node representations are learned with the same experimental setting for the node classification task. From the results shown in Table 2, we can see that the proposed method are superior to the baselines in most of cases. Link Prediction. In addition, we attempt answer the question about whether the learned node representations can keep the node proximity. The performance of each method is measured with AUC value of link prediction. All of the methods in this experiment have the same model configuration as the node classification task. From the results shown in Table 3, we can see that SAIL still outperforms most of the baselines learned with supervised and unsupervised objectives. From the results in Table 1, we see that classification of nodes in graph CS is indeed an easy task, and most of the GNNs models have similar good performance. However, for link prediction shown in Table 3, unsupervised models (CAN and our model) learned better h than those with supervision information, obviously because the supervision information is for node classification, not for link prediction. Exploring Locality and Feature Smoothness Based on our theoretical understanding in Theorem 1, the representation smoothness is mainly influenced by the local structural proximity and feature closeness. Follow the ideas of recent works (Hou et al. 2020; Chen et al. 2020a), we conduct empirical study on the node representation smoothness before and after being encoded by the GNNs. Before encoding. With a given graph and node features, we calculate the inverse of average pairwise distance among the raw feature (Hou et al. 2020), and clustering coefficient (Watts and Strogatz 1998) to study the feature smoothness and structural locality, respectively. Combining with the node classification results, we empirically found that most of the neural encoders (e.g., GNNs, even simple MLPs) performed well on node classification in graphs like CS, which has a strong locality and large feature smoothness shown in Figure 2. Interestingly, for graphs with a strong locality but a low node feature smoothness (e.g., Computers , Photo ), unsupervised methods can leverage the graph structure to achieve better performance than supervised methods. After encoding. Chen et a. (Chen et al. 2020a) propose to use mean average distance (MAD) between the target node and its neighbors for measuring the smoothness of node representations, and the MAD gap between the neighbor and remote nodes to measure the oversmoothness. Let s denote MADgap = MADrmt MADnei, where MADnei represents the MAD between the target node and its neighbors, and MADrmt is defined to measure the MAD of remote Input Methods Cora Citeseer Pub Med Computers Photo CS Avg GMNN 77.2 68.8 79.8 70.8 87.5 91.6 4.1% GAT 77.7 65.6 76.7 69.3 89.4 90.4 4.0% GCN 76.0 67.2 77.7 67.5 88.7 89.9 4.5% Disen GCN 77.6 68.2 78.3 37.5 48.8 92.4 15.1% CAN 73.2 64.0 63.5 77.5 88.1 91.1 1.1% DGI 72.3 70.1 71.5 67.6 77.4 91.7 4.0% GMI 77.4 68.3 76.9 54.9 82.5 89.2 6.1% MVGRL 69.5 62.5 76.5 87.2 92.5 91.8 6.2% SAIL 81.0 71.3 81.2 88.5 92.4 92.7 2.1% Table 4: Accuracy (in %) of node classification after randomly 20% neighbors removal. The last column shows the average classification accuracy downgrades comparing with the results in Table 1. Data Metrics GCN GAT Disen GCN GMNN CAN DGI GMI MVGRL SAIL MADnei 0.075 0.029 0.215 0.088 0.059 0.312 0.069 0.240 0.013 MADgap 0.308 0.083 0.471 0.390 0.900 0.557 0.966 0.661 0.322 MADratio 4.13 2.86 2.19 4.43 15.2 1.77 13.83 2.75 25.3 MADnei 0.049 0.014 0.194 0.059 0.057 0.289 0.122 0.174 0.007 MADgap 0.308 0.083 0.471 0.390 0.921 0.497 0.879 0.491 0.429 MADratio 4.13 2.86 2.19 4.43 15.1 1.72 6.2 2.82 61.2 MADnei 0.043 0.024 0.054 0.068 0.037 0.112 0.086 0.180 0.024 MADgap 0.155 0.083 0.224 0.438 0.757 0.145 0.833 0.541 0.294 MADratio 6.02 6.43 4.17 6.39 20.6 1.29 8.63 3.00 12.1 Table 5: Empirical analysis to shown the quality of learned node representations measured by mean average distance (MAD). nodes. If we get a small MAD but relatively large MAD gap, then we can say that the learnt node representations are not over-smoothed. We define a variant metric MADratio = MADgap MADnei to measure the information-to-noise ratio brought by the relative changes of MAD of remote nodes over neighbors. We use the node representations with the same setting as node classification task. The results shown in Table 5 demonstrate that the proposed method can achieve the best smoothing performance (i.e. smallest MADnei). In terms of over-smoothing issue, SAIL has the relative large MADgap comparing with the scale of the achieved MAD of each method, which can be measured by MADratio. The empirical results reflect that the proposed method can help to tell the difference from local neighbors and remote nodes. Robustness Against Incomplete Graphs. With the same experimental configuration as link prediction task, we also validate the performance of learned node embeddings from incomplete graph for node classification task. According to the results in Table 4, SAIL still outperforms baseline methods in most cases, demonstrating the robustness of SAIL performance. It has a larger average downgrade in terms of node classification accuracy than CAN, but still has better classification accuracy. Ablation Study We conduct node classification experiments to validate the contribution of each component of the proposed SAIL, where EMI denotes the edge MI loss li jk, Intra stands for intra-distill module Rintra and Inter represents th Rinter in Eq. 16. From the results shown in Figure 3, we can see that both intraand inter-distilling module can jointly improve the qualification of learnt node representations. Combining with the edge MI maximization task, we can see a significant improvement on the node classification accuracy. Cora Citeseer Pubmed 50 Node Classification EMI EMI+Intra EMI+Inter EMI+Intra+Inter Figure 3: Ablation study of the distillation component influence in node classification accuracy (in %). Conclusions In this work, we propose a self-supervised learning method (SAIL) regularized by graph structure to learn unsupervised node representations for various downstream tasks. We conduct thorough experiments on node classification, node clustering, and link prediction tasks to evaluate the learned node representations. Experimental results demonstrate that SAIL helps to learn competitive shallow GNN which outperforms the state-of-the-art GNNs learned with supervised or unsupervised objectives. This initial study might shed light upon a promising way to implement self-distillation for graph neural networks. In the future, we plan to study how to improve the robustness of the proposed method against adversarial attacks and learn transferable graph neural network for downstream tasks like few-shot classification. References Bo, D.; Wang, X.; Shi, C.; Zhu, M.; Lu, E.; and Cui, P. 2020. Structural deep clustering network. In The Web Conf, 1400 1410. Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; and Sun, X. 2020a. Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View. In AAAI. Chen, Y.; Bian, Y.; Xiao, X.; Rong, Y.; Xu, T.; and Huang, J. 2020b. On Self-Distilling Graph Neural Network. In IJCAI. Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Neur IPS, 3844 3852. Fey, M.; and Lenssen, J. E. 2019. Fast graph representation learning with Py Torch Geometric. ar Xiv preprint ar Xiv:1903.02428. Grover, A.; and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In KDD, 855 864. Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In Neur IPS, 1024 1034. Hassani, K.; and Khasahmadi, A. H. 2020. Contrastive Multi-View Representation Learning on Graphs. In ICML. He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR, 9729 9738. Hou, Y.; Zhang, J.; Cheng, J.; Ma, K.; Ma, R. T. B.; Chen, H.; and Yang, M.-C. 2020. Measuring and Improving the Use of Graph Information in Graph Neural Networks. In ICLR. Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; and Leskovec, J. 2019. Strategies for Pre-training Graph Neural Networks. In ICLR. Ishida, T.; Yamane, I.; Sakai, T.; Niu, G.; and Sugiyama, M. 2020. Do we need zero training loss after achieving zero training error? In ICML. Kipf, T. N.; and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. In ICLR. Klicpera, J.; Weißenberger, S.; and G unnemann, S. 2019. Diffusion Improves Graph Learning. In Neur IPS, 13333 13345. Lafferty, J.; Mc Callum, A.; and Pereira, F. C. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML. Li, G.; M uller, M.; Ghanem, B.; and Koltun, V. 2021. Training Graph Neural Networks with 1000 Layers. In ICML. Li, Q.; Han, Z.; and Wu, X.-M. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI. Li, Q.; Wu, X.-M.; Liu, H.; Zhang, X.; and Guan, Z. 2019. Label efficient semi-supervised learning via graph filtering. In CVPR, 9582 9591. Liu, Z.; Chen, C.; Li, L.; Zhou, J.; Li, X.; Song, L.; and Qi, Y. 2019. Geniepath: Graph neural networks with adaptive receptive paths. In AAAI, volume 33, 4424 4431. Ma, J.; Cui, P.; Kuang, K.; Wang, X.; and Zhu, W. 2019. Disentangled graph convolutional networks. In ICML, 4212 4221. Mandal, D.; Medya, S.; Uzzi, B.; and Aggarwal, C. 2021. Meta-Learning with Graph Neural Networks: Methods and Applications. ar Xiv preprint ar Xiv:2103.00137. Meng, Z.; Liang, S.; Bao, H.; and Zhang, X. 2019. Coembedding attributed networks. In WSDM, 393 401. NT, H.; and Maehara, T. 2019. Revisiting Graph Neural Networks: All We Have is Low-Pass Filters. ar Xiv preprint ar Xiv:1905.09550. Peng, Z.; Huang, W.; Luo, M.; Zheng, Q.; Rong, Y.; Xu, T.; and Huang, J. 2020. Graph Representation Learning via Graphical Mutual Information Maximization. In WWW, 259 270. Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In KDD, 701 710. Qiu, J.; Chen, Q.; Dong, Y.; Zhang, J.; Yang, H.; Ding, M.; Wang, K.; and Tang, J. 2020. Gcc: Graph contrastive coding for graph neural network pre-training. In KDD, 1150 1160. Qu, M.; Bengio, Y.; and Tang, J. 2019. GMNN: Graph Markov Neural Networks. In ICML, 5241 5250. Sun, F.-Y.; Hoffman, J.; Verma, V.; and Tang, J. 2019. Info Graph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In ICLR. Sun, K.; Lin, Z.; and Zhu, Z. 2020. Multi-stage selfsupervised learning for graph convolutional networks on graphs with few labeled nodes. In AAAI, 5892 5899. Tian, P.; Qi, L.; Dong, S.; Shi, Y.; and Gao, Y. 2020. Consistent Meta Reg: Alleviating Intra-task Discrepancy for Better Meta-knowledge. In IJCAI. Tian, Y.; Krishnan, D.; and Isola, P. 2020. Contrastive representation distillation. In ICLR. Veliˇckovi c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph attention networks. In ICLR. Veliˇckovi c, P.; Fedus, W.; Hamilton, W. L.; Li o, P.; Bengio, Y.; and Hjelm, R. D. 2019. Deep graph infomax. In ICLR. Wang, C.; Qiu, M.; Huang, J.; and He, X. 2020. Meta Fine Tuning Neural Language Models for Multi-Domain Text Mining. In EMNLP, 3094 3104. Wang, T.; and Isola, P. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 9929 9939. Watts, D. J.; and Strogatz, S. H. 1998. Collective dynamics of small-world networks. nature, 393(6684): 440 442. Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; and Weinberger, K. 2019. Simplifying Graph Convolutional Networks. In ICML, 6861 6871. Xie, Y.; Xu, Z.; Zhang, J.; Wang, Z.; and Ji, S. 2021. Selfsupervised learning of graph neural networks: A unified review. ar Xiv preprint ar Xiv:2102.10757. Xu, D.; Cheng, W.; Luo, D.; Chen, H.; and Zhang, X. 2021. Info GCL: Information-Aware Graph Contrastive Learning. In Neur IPS. You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; and Shen, Y. 2020a. Graph contrastive learning with augmentations. In Neur IPS. You, Y.; Chen, T.; Wang, Z.; and Shen, Y. 2020b. When does self-supervision help graph convolutional networks? In ICML, 10871 10880. Zhang, C.; Song, D.; Huang, C.; Swami, A.; and Chawla, N. V. 2019. Heterogeneous graph neural network. In KDD, 793 803. Zhao, J.; Wen, Q.; Sun, S.; Ye, Y.; and Zhang, C. 2021. Multi-view Self-supervised Heterogeneous Graph Embedding. In ECML/PKDD, 319 334.