# learning_to_pretrain_graph_neural_networks__db2e5097.pdf

Learning to Pre-train Graph Neural Networks

Yuanfu Lu1, 2 , Xunqiang Jiang1, Yuan Fang3, Chuan Shi1, 4

1Beijing University of Posts and Telecommunications 2We Chat Search Application Department, Tencent Inc. China 3Singapore Management University 4Peng Cheng Laboratory, Shenzhen, China luyuanfu@bupt.edu.com, skd621@bupt.edu.cn, yfang@smu.edu.sg, shichuan@bupt.edu.cn

Graph neural networks (GNNs) have become the de facto standard for representation learning on graphs, which derive effective node representations by recursively aggregating information from graph neighborhoods. While GNNs can be trained from scratch, pre-training GNNs to learn transferable knowledge for downstream tasks has recently been demonstrated to improve the state of the art. However, conventional GNN pre-training methods follow a two-step paradigm: 1) pre-training on abundant unlabeled data and 2) ﬁne-tuning on downstream labeled data, between which there exists a signiﬁcant gap due to the divergence of optimization objectives in the two steps. In this paper, we conduct an analysis to show the divergence between pre-training and ﬁne-tuning, and to alleviate such divergence, we propose L2P-GNN, a self-supervised pre-training strategy for GNNs. The key insight is that L2PGNN attempts to learn how to ﬁne-tune during the pre-training process in the form of transferable prior knowledge. To encode both local and global information into the prior, L2P-GNN is further designed with a dual adaptation mechanism at both node and graph levels. Finally, we conduct a systematic empirical study on the pre-training of various GNN models, using both a public collection of protein graphs and a new compilation of bibliographic graphs for pre-training. Experimental results show that L2P-GNN is capable of learning effective and transferable prior knowledge that yields powerful representations for downstream tasks. (Code and datasets are available at https://github.com/rootlu/L2P-GNN.)

Introduction Graph neural networks (GNNs) have emerged as the state of the art for representation learning on graphs, due to their ability to recursively aggregate information from neighborhoods on the graph, naturally capturing both graph structures as well as node or edge features (Zhang, Cui, and Zhu 2020; Wu et al. 2020; Dwivedi et al. 2020). Various GNN architectures with different aggregation schemes have been proposed (Kipf and Welling 2017; Hamilton, Ying, and Leskovec 2017; Velickovic et al. 2018; Ying et al. 2018b; Hasanzadeh et al. 2019; Qu, Bengio, and Tang 2019; Pei et al. 2020; Munkhdalai and Yu 2017). Empirically, these GNNs have achieved impressive

Part of the work was done while a visiting research student at Singapore Management University. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

performance in many tasks, such as node and graph classiﬁcation (Kipf and Welling 2017; Hamilton, Ying, and Leskovec 2017), recommendation systems (Fan et al. 2019; Ying et al. 2018a) and graph generation (Li et al. 2018; You et al. 2018). However, training GNNs usually requires abundant labeled data, which are often limited and expensive to obtain. Inspired by pre-trained language models (Devlin et al. 2019; Mikolov et al. 2013) and image encoders (Girshick et al. 2014; Donahue et al. 2014; He et al. 2019), recent advances in pre-training GNNs have provided insights into reducing the labeling burden and making use of abundant unlabeled data. The primary goal of pre-training GNNs (Navarin, Tran, and Sperduti 2018; Hu et al. 2019, 2020) is to learn transferable prior knowledge from mostly unlabeled data, which can be generalized to downstream tasks with a quick ﬁne-tuning step. Essentially, those methods mainly follow a two-step paradigm: (1) pre-training a GNN model on a large collection of unlabeled graph data, which derives generic transferable knowledge encoding intrinsic graph properties; (2) ﬁne-tuning the pre-trained GNN model on task-speciﬁc graph data, so as to adapt the generic knowledge to downstream tasks. However, here we argue that there exists a gap between pre-training and ﬁne-tuning due to the divergence of the optimization objectives in the two steps. In particular, the pre-training step optimizes the GNN to ﬁnd an optimal point over the pre-training graph data, whereas the ﬁne-tuning step aims to optimize the performance on downstream tasks. In other words, the pre-training process completely disregards the need to quickly adapt to downstream tasks with a few ﬁne-tuning updates, leaving a gap between the two steps. It is inevitable that such divergence will signiﬁcantly hurt the generalization ability of the pre-trained GNN models. Challenges and Present Work. In this work, we propose to alleviate the divergence between pre-training and ﬁne-tuning. However, alleviating this divergence is non-trivial, presenting us with two key challenges. (1) How to narrow the gap caused by different optimization objectives? Existing pre-training strategies for GNNs fall into a two-step paradigm, and the optimization gap between the two steps signiﬁcantly limits the ability of pre-trained GNNs to generalize to new downstream tasks. Hence, it is vital to re-examine the objective of the pre-training step to better match that of the ﬁne-tuning step. (2) How to simultaneously preserve nodeand graph-level information with completely unlabeled graph data? Existing

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

methods either only take into account the node-level pretraining (Navarin, Tran, and Sperduti 2018; Hu et al. 2019), or still require supervised information for graph-level pretraining (Hu et al. 2020). While at the node level, predicting links between node pairs is naturally self-supervised, graphlevel self-supervision has been seldom explored. Thus, it is crucial to devise a self-supervised strategy to pre-train graph-level representations. To tackle the challenges, we propose L2P-GNN, a pretraining strategy for GNNs that learns to pre-train (L2P) at both node and graph levels in a fully self-supervised manner. More speciﬁcally, for the ﬁrst challenge, L2P-GNN mimics the ﬁne-tuning step within the pre-training step, and thus learns how to ﬁne-tune during the pre-training process itself. As a result, we learn a prior that possesses the ability of quickly adapting to new downstream tasks with only a few ﬁne-tuning updates. The proposed learning to pre-train can be deemed a form of meta-learning (Finn, Abbeel, and Levine 2017), also known as learning to learn. For the second challenge, we propose a self-supervised strategy with a dual adaptation mechanism, which is equipped with both nodeand graph-level adaptations. On one hand, the node-level adaptation takes the connectivity of node pairs as self-supervised information, so as to learn a transferable prior to encode local graph properties. On the other hand, the graph-level adaptation is designed for preserving the global information in the graph, in which a sub-structure should be close to the whole graph in the representation space. To summarize, this work makes the following major contributions. This is the ﬁrst attempt to explore learning to pre-train GNNs, which alleviates the divergence between pretraining and ﬁne-tuning objectives, and sheds a new perspective for pre-training GNNs. We propose a completely self-supervised GNN pretraining strategy for both nodeand graph-level representations. We build a new large-scale bibliographic graph data for pre-training GNNs, and conduct extensive empirical studies on two datasets in different domains. Experimental results demonstrate that our approach consistently and signiﬁcantly outperforms the state of the art.

Related Work GNNs have received signiﬁcant attention due to the prevalence of graph-structured data (Bronstein et al. 2017). Originally proposed (Marco, Gabriele, and Franco 2005; Scarselli et al. 2008) as a framework of utilizing neural networks to learn node representations on graphs, this concept is extended to convolution neural networks using spectral methods (Defferrard, Bresson, and Vandergheynst 2016; Bruna et al. 2014; Levie et al. 2019; Xu et al. 2019a) and message passing architectures to aggregate neighbors features (Kipf and Welling 2017; Niepert, Ahmed, and Kutzkov 2016; Hamilton, Ying, and Leskovec 2017; Velickovic et al. 2018; Abu-El-Haija et al. 2019). For a more comprehensive understanding of GNNs, we refer readers to the literature (Wu et al. 2020; Battaglia et al. 2018; Zhang, Cui, and Zhu 2020; Zhou et al. 2018). To

enable more effective learning on graphs, researchers have explored how to pre-train GNNs for node-level representations on unlabeled graph data. Navarin et al. (Navarin, Tran, and Sperduti 2018) utilize the graph kernel for pre-training, while another work (Hu et al. 2019) pre-trains graph encoders with three unsupervised tasks to capture different aspects of a graph. More recently, Hu et al. (Hu et al. 2020) propose different strategies to pre-train graph neural networks at both node and graph levels, although labeled data are required at the graph level. On another line, meta-learning intends to learn a form of general knowledge across similar learning tasks, so that the learned knowledge can be quickly adapted to new tasks (Vilalta and Drissi 2002; Vanschoren 2018; Peng 2020). Among previous works on meta-learning, metric-based methods (Sung et al. 2018; Snell, Swersky, and Zemel 2017) learn a metric or distance function over tasks, while model-based methods (Santoro et al. 2016; Munkhdalai and Yu 2017) aim to design an architecture or training process for rapid generalization across tasks. Finally, some optimization-based methods directly adjust the optimization algorithm to enable quick adaptation with just a few examples (Finn, Abbeel, and Levine 2017; Yao et al. 2019; Lee et al. 2019; Lu, Fang, and Shi 2020).

Learning to Pre-train: Motivation and Overview

Our key insight is the observation that there exists a divergence between pre-training and ﬁne-tuning. In this section, we conduct an analysis to demonstrate this divergence, and further motivate a paradigm shift to learning to pre-train GNNs.

Preliminaries

GNNs. Let G = (V, E, X, Z) denote a graph with nodes V and edges E, where X R|V| dv and Z R|E| de are node and edge features, respectively. A GNN involves two key computations for each node v at every layer. (1) AGGREGATE operation: aggregating messages from v s neighbors Nv. (2) UPDATE operation: updating v s representation from its representation in the previous layer and the aggregated messages. Formally, the l-th layer representation of node v is given by

hl v =Ψ(ψ; A, X, Z)l (1)

=UPDATE(hl 1 v ,

AGGREGATE({(hl 1 v , hl 1 u , zuv) : u Nv})),

where zuv is the feature vector of edge (u, v), and h0 v = xv X is the input layer of a GNN. A denotes the adjacency matrix or some normalized variant, and Nv denotes the neighborhood of node v whose deﬁnition depends on a particular GNN variant. We abstract the composition of the two operations as one parameterized function Ψ( ) with parameters ψ. To address graph-level tasks such as graph classiﬁcation, node representations need to be further aggregated into a

graph-level representation. The READOUT operation can usually be performed at the ﬁnal layer as follows:

h G = Ω(ω; Hl) = READOUT({hl v|v V}), (2)

where h G is the representation of the whole graph G, and Hl = [hl v] is the node representation matrix. READOUT is typically implemented as a simple pooling operation like sum, max or mean-pooling (Atwood and Towsley 2016; Duvenaud et al. 2015) or more complex approaches (Bruna et al. 2014; Ying et al. 2018b). We abstract READOUT as a parameterized function Ω( ) with parameters ω. Conventional GNN Pre-training. The goal of pre-training GNNs is to learn a generic initialization for model parameters using readily available graph structures (Hu et al. 2020, 2019). Conventional pre-training strategies largely follow a twostep paradigm. (1) Pre-training a GNN model fθ(A, X, Z) on a large graph-structured dataset (e.g., multiple small graphs or a large-scale graph). The learned parameter θ0 is expected to capture task-agnostic transferable information. (2) Fine-tuning the pre-trained GNN on downstream tasks. With multiple (say, n) gradient descent steps over the training data of the downstream task, the model aims to obtain the optimal parameters θn on the downstream task. Note that, for node-level tasks, the GNN model is fθ = Ψ(ψ; A, X, Z), i.e., θ = ψ; for graph-level tasks, the GNN model is fθ = Ω(ω; Ψ(ψ; A, X, Z)), i.e., θ = {ψ, ω}. Let Dpre denote the pre-training graph data, and Lpre be the loss function for pre-training. That is, the objective of pre-training is to optimize the following:

θ0 = arg minθ Lpre(fθ; Dpre). (3)

On the other hand, the ﬁne-tuning process aims to maximize the performance on the testing graph data Dte of the downstream task, after ﬁne-tuning over the training graph data Dtr of the task. The so-called ﬁne-tuning initializes the model from the pre-trained parameters θ0, and updates the GNN model fθ with multiple gradient descent steps over (usually batched) Dtr. Taking one step as an example, we have

θ1 = θ0 η θ0Lfine(fθ0; Dtr), (4)

where Lfine is the loss function of ﬁne-tuning and η is the learning rate.

Learning to Pre-train GNNs In the conventional two-step paradigm, the pre-training step is decoupled from the ﬁne-tuning step. In particular, θ0 is pre-trained without accommodating any form of adaptation that are potentially useful for future ﬁne-tuning on downstream tasks. The apparent divergence between the two steps would result in suboptimal pre-training. To narrow the gap between pre-training and ﬁne-tuning, it is important to learn how to pre-train such that the pre-trained model becomes more amenable to adaptations on future downstream tasks. To this end, we propose to structure the pre-training stage to simulate the ﬁne-tuning process on downstream tasks, so as to directly optimize the pre-trained model s quick adaptability to downstream tasks.

Speciﬁcally, to pre-train a GNN model over a graph G Dpre, we sample some sub-structures from G, denoted Dtr TG, as the training data of a simulated downstream task TG; similarly, we mimic the evaluation on testing sub-structures Dte TG that are also sampled from G. Training and testing data are simulated here since the actual downstream task is unknown during pre-training. This setup is reasonable as our goal is learning how to pre-train a GNN model with the ability of adapting to new tasks quickly, rather than directly learning the actual downstream task. Formally, our pre-training aims to learn a GNN model fθ, such that after ﬁne-tuning it on the simulated task training data Dtr TG, the loss on the simulated testing data Dte TG is minimized. That is,

θ0 = arg minθ P G Dpre Lpre(fθ α θLpre(fθ;Dtr TG ); Dte TG), (5) where θ α θLpre(fθ; Dtr TG) is the ﬁne-tuned parameters on Dtr TG (still part of the pre-training data), in a similar manner as the ﬁne-tuning step on the downstream task in Eq. (4). Moreover, α represents the learning rate of the ﬁne-tuning on Dtr TG, which can be ﬁxed as a hyper-parameter. Thus, the output of our pre-training, θ0, is not intended to directly optimize the training or testing data of any particular task. Instead, θ0 is optimal in the sense that it allows for quick adaptation to new tasks in general. Note that here we only show one gradient update, and yet employing multiple updates is a straightforward extension. Connection to other works. Interestingly, our proposed strategy of learning to pre-train GNNs subsumes the conventional GNN pre-training as a special case. In particular, if we set α = 0, i.e., there is no ﬁne-tuning on Dtr TG, our strategy becomes equivalent to conventional pre-training approaches. Furthermore, our strategy is a form of meta-learning, in particular, model agnostic meta-learning (MAML) (Finn, Abbeel, and Levine 2017). Meta-learning aims to learn prior knowledge from a set of training tasks that can be transferred to testing tasks. Speciﬁcally, MAML learns a prior that can be quickly adapted to new tasks by one or a few gradient updates, so that the prior, after being adapted to the so-called support set of each task, can achieve optimal performance on the so-called query set of the task. In our case, the output of our pre-training θ0 is the prior knowledge that can quickly adapt to new downstream tasks, while Dtr TG and Dte TG correspond to the support and query sets in MAML, respectively.

Proposed Method In the following, we introduce our approach L2P-GNN. We ﬁrst present a self-supervised base GNN model for learning graph structures in the MAML setting, followed by our dual nodeand graph-level adaptations designed to simulate ﬁnetuning during the pre-training process.

Self-supervised Base Model At the core of L2P-GNN is the notion of learning to pretrain a GNN to bridge the gap between the pre-training and ﬁne-tuning processes. Speciﬁcally, our approach can be formulated as a form of MAML. To this end, we deﬁne a task as

Parent Task

Child Task T c G

Support Set

Node-level loss on support / query set

Node-level Aggregation

Graph-level loss on

support / query set

Graph-level Pooling

Adaptation on support set

Optimization on query set

Graph-level adaptation Node-level adaptation

Backpropagation on query set

0 = { 0, !0}

(a) An Example of Graph (c) Dual Adaptation in Self-supervised Base Model

Support Set

G = {V, E, X, Z}

( ; A, X, Z) (!; H)

(b) Task Construction

TG = {T 1 G , , T k G }

Figure 1: Illustration of L2P-GNN. (a/b) Task construction for a graph, where the graph G is associated with a parent task TG consisting of k child tasks {T 1 G , , T k G }. (c) Dual nodeand graph-level adaptations on the support set, and the optimization of transferable prior θ on the query set.

capturing structures and attributes in a graph, from both local and global perspectives. The meta-learned prior can then be adapted to a new task or graph. Task Construction. Consider a set of graphs as the pretraining data, Dpre = {G1, G2, , GN}. A task TG = (SG, QG) involves a graph G, consisting of a support set SG and a query set QG. We learn the prior such that, after updating by gradient descent w.r.t. the loss on the support set, it optimizes the performance on the query set, which simulates the training and testing in the ﬁne-tuning step. As illustrated in Figs. 1(a) and (b), to promote both global and local perspectives of a graph, its corresponding task TG is designed to contain k child tasks, i.e., TG = (T 1 G , T 2 G , , T k G ). Each child task T c G attempts to capture a local aspect of G, which is deﬁned as

T c G = Sc G = {(u, v) p E}, Qc G = {(p, q) p E} (6)

s.t. Sc G Qc G = ,

where the support Sc G and query Qc G contain edges randomly sampled from the edge distribution p E of the graph, and they are mutually exclusive. In essence, child tasks incorporate the local connectivity between node pairs in a graph, and they fuse into a parent task TG = (SG, QG) to facilitate a graph-level view, where SG = (S1 G, S2 G, , Sk G) and QG = (Q1 G, Q2 G, , Qk G), Base GNN Model. Given the parent and child tasks, we design a self-supervised base GNN model with node-level aggregation and graph-level pooling to learn node and graph representations, respectively. The key idea is to utilize the intrinsic structures of label-free graph data as self-supervision, at both node and graph levels. Speciﬁcally, the base model fθ involves node-level aggregation Ψ(ψ; A, X, Z) that aggregates node information (e.g., its local structures and attributes) to generate node representations, and graph-level pooling Ω(ω; H) that further generates a graph-level representation given the node representation matrix H.

In node-level aggregation, the node embeddings are aggregated from their neighborhoods, as deﬁned in Eq. (1). That is, for each node v G, we obtain its representation hv after l iteration of Ψ( ). Subsequently, given a support edge (u, v) in a child task T c G , we optimize the self-supervised objective of predicting the link between u and v (Tang et al. 2015; Hamilton, Ying, and Leskovec 2017), as follows.

Lnode(ψ; Sc G) = X

(u,v) Sc G (7)

ln(σ(h u hv)) ln(σ( h u hv )),

where v is a negative node sample that is not linked with node u in the graph, σ is the sigmoid function, and ψ denotes the learnable parameters of Ψ( ). The node-level aggregation encourages linked nodes in the support set of the child tasks to have similar representations. In graph-level pooling, the graph representation h G is gathered from node representations with the pooling function deﬁned in Eq. (2). As the child tasks capture various local sub-structures of the graph, we also perform pooling on the support nodes of each child task T c G to generate the pooled representation h Sc G = Ω(ω; {hu| u, v : (u, v) Sc G}). Given a parent task TG = (SG, QG) which is a fusion of all child tasks, we deﬁne the following self-supervised graphlevel objective:

Lgraph(ω; SG) =

log(σ(h Sc Gh G)) log(σ( h Sc Gh G )),

where ω denotes the learnable parameters of Ω( ), and h G denotes the shifted graph representation by randomly shifting some dimensions of h G (Velickovic et al. 2019), serving as the negative sample.

Altogether, to capture both nodeand graph-level information, we minimize the following loss for a graph G:

LTG(θ; SG) = Lgraph(ω; SG) + 1

c=1 Lnode(ψ; Sc G), (9)

where θ = {ψ, ω} is the learnable parameters of our selfsupervised base GNN model.

Dual Adaptation As motivated, to bridge the gap between the pre-training and ﬁne-tuning processes, it is crucial to optimize the model s ability of quickly adapting to new tasks during pre-training itself. To this end, we propose learning to pre-train the base GNN model: we aim to learn transferable prior knowledge (i.e., θ = {ψ, ω}), to provide an adaptable initialization that can be quickly ﬁne-tuned for new downstream tasks with new graph data. In particular, the learned initialization should not only encode and adapt to the local connectivity between node pairs, but also become capable of generalizing to different sub-structures of the graphs. Correspondingly, we devise the dual nodeand graph-level adaptations, as illustrated in Fig. 1(c). Node-level Adaptation. To simulate the procedure of ﬁnetuning on training data, we calculate the loss on the support set Sc G in each child task T c G as shown in Eq. (7). Then, we adapt the node-level aggregation prior ψ w.r.t. the support loss with one or a few gradient descent step, to obtain the adapted prior ψ for the child tasks. For instance, when using one gradient update with a node-level learning rate α, we have

ψ = ψ α Pk c=1 Lnode(ψ; Sc G) ψ . (10)

Graph-level Adaptation. To encode how to pool node information for representing a graph, we adapt the graphlevel pooling prior ω to a parent task TG with one (or a few) gradient descent step. Given β as the graph-level learning rate, the adapted pooling prior is given by

ω = ω β Lgraph(ω; SG)

Optimization of Transferable Prior. With the nodeand graph-level adaptations, we have adapted the prior θ to θ = {ψ , ω } that is speciﬁc to the task TG. To mimic the testing process with the ﬁne-tuned model, the base GNN model is trained by optimizing the performance of the adapted parameters θ on the query set QG over all training tasks or graphs in Dpre. That is, the transferable prior θ = {ψ, ω} will be optimized through the backpropgation of the query loss given by X

G Dpre LTG(θ ; QG). (12)

In other words, we can update the prior θ as follows.

θ θ γ P G Dpre LTG(θ ; QG)

where γ is the learning rate of the prior. The detailed training procedure is provided in Appendix .

Dataset Biology Pre DBLP

#subgraphs 394,925 1,054,309 #labels 40 6 #subgraphs for pre-training 306,925 794,862 #subgraphs for ﬁne-tuning 88,000 299,447

Table 1: Statistics of the two datasets.

Discussion Here we give an analysis of the proposed L2P-GNN with respect to model generality and efﬁciency. Firstly, the proposed L2P-GNN is generalized and can be easily applied to different graph neural networks. Sec. demonstrates the divergence between pre-training and ﬁnetuning, which is widely known in the literature (Lv et al. 2020; Gururangan et al. 2020), whether it is on the graph data, or in natural language processing or computer vision. L2P-GNN directly optimizes the pre-trained model s quick adaptability to downstream tasks by simulating the ﬁne-tuning process on downstream tasks, making it free from the architectures of graph neural networks. Secondly, our L2P-GNN is efﬁcient and can be parallelized for large-scale datasets. In L2P-GNN, for task construction, the time complexity is linear w.r.t. the number of edges as each task contains edges sampled from the graph. For dual adaptation, the time complexity depends on the architecture of the GNN, which is at most k (i.e., number of child tasks) times the complexity of the corresponding GNN. As the number of child task k is usually small, the complexity of L2P-GNN is as efﬁcient as other pre-training strategies for GNNs. Besides, with on-the-ﬂy transformation of data (e.g., task construction), there is almost no memory overhead for our L2P-GNN. Detailed pseudocode of the algorithm is in supplemental material, Appendix .

Experiments In this section, we present a new graph dataset for pretraining, and compare the performance of our approach and various state-of-the-art pre-training baselines. Lastly, we conduct thorough model analysis to support the motivation and design of our pre-training strategy.

Experimental Settings

Datasets. We conduct experiments on data from two domains: biological function prediction in biology (Hu et al. 2020) and research ﬁeld prediction in bibliography. The biology graphs come from a public repository1, covering 394,925 protein subgraphs (Marinka et al. 2019). We further present a new collection of bibliographic graphs called Pre DBLP, purposely compiled for pre-training GNNs based on DBLP2, which contains 1,054,309 paper subgraphs in 31 ﬁelds (e.g., artiﬁcial intelligence, data mining). Each subgraph is centered at a paper and contains the associated information of

1http://snap.stanford.edu/gnn-pretrain 2https://dblp.uni-trier.de

Model Biology Pre DBLP GCN Graph SAGE GAT GIN GCN Graph SAGE GAT GIN

No pre-train 63.22 1.06 65.72 1.23 68.21 1.26 64.82 1.21 62.18 0.43 61.03 0.65 59.63 2.32 69.01 0.23

Edge Pred 64.72 1.06 67.39 1.54 67.37 1.31 65.93 1.65 65.44 0.42 63.60 0.21 55.56 1.67 69.43 0.07 DGI 64.33 1.14 66.69 0.88 68.37 0.54 65.16 1.24 65.57 0.36 63.34 0.73 61.30 2.17 69.34 0.09 Context Pred 64.56 1.36 66.31 0.94 66.89 1.98 65.99 1.22 66.11 0.16 62.55 0.11 58.44 1.18 69.37 0.21 Attr Masking 64.35 1.23 64.32 0.78 67.72 1.16 65.72 1.31 65.49 0.52 62.35 0.58 53.34 4.77 68.61 0.16

L2P-GNN 66.48 1.59 69.89 1.63 69.15 1.86 70.13 0.95 66.58 0.28 65.84 0.37 62.24 1.89 70.79 0.17 (Improv.) (5.16%) (6.35%) (1.38%) (8.19%) (7.08%) (7.88%) (4.38 %) (2.58%)

Table 2: Experimental results (mean std in percent) of different pre-training strategies w.r.t. various GNN architectures. The improvements are relative to the respective GNN without pre-training.

the paper. The new bibliographic dataset is publicly released, while more detailed descriptions on the construction or processing of the datasets are included in Appendix .

For biology data, as in (Hu et al. 2020), we use 306,925 unlabeled protein ego-networks for pre-training. In ﬁne-tuning, we predict 40 ﬁne-grained biological functions with 88,000 labeled subgraphs that correspond to 40 binary classiﬁcation tasks. We split the downstream data with species split (Hu et al. 2020), and evaluate the test performance with average ROC-AUC (Bradley 1997) across the 40 tasks. For Pre DBLP, we utilize 794,862 subgraphs to pre-train a GNN model. In ﬁne-tuning, we predict the research ﬁeld with 299,447 labeled subgraphs from 6 different categories. We randomly split the downstream data and evaluate test performance with micro-averaged F1 score. For both domains, we split downstream data with 8:1:1 ratio for train/validation/test sets. All downstream experiments are repeated with 10 random seeds, and we report the mean with standard deviation. The detailed statistics of two datasets are summarized in Table 1.

Baselines. To contextualize the empirical results of L2PGNN on the pre-training benchmarks, we compare against four self-supervised or unsupervised baselines: (1) the original Edge Prediction (denoted by Edge Pred) (Hamilton, Ying, and Leskovec 2017) to predict the connectivity of node pairs; (2) Deep Graph Infomax (denoted by DGI) (Velickovic et al. 2019) to maximize local mutual information across the graph s patch representations; (3) Context Prediction strategy (denoted by Context Pred) (Hu et al. 2020) to explore graph structures and (4) Attribute Masking strategy (denoted by Attr Masking) (Hu et al. 2020) to learn the regularities of the node and edge attributes distributed over graphs. Further details are provided in Appendix .

GNN Architectures and Parameter Settings. All pretraining baselines and our L2P-GNN can be implemented for different GNN architectures. We experiment with four popular GNN architectures, namely, GCN (Kipf and Welling 2017), Graph SAGE (Hamilton, Ying, and Leskovec 2017), GAT (Velickovic et al. 2018) and GIN (Xu et al. 2019b). Implementation details are presented in Appendix . We tune hyper-parameters w.r.t. the model performance on validation sets. The hyper-parameter settings and experimental environment are discussed in Appendix .

Performance Comparison

Table 2 compares the performance of L2P-GNN and stateof-the-art pre-training baselines, w.r.t. four different GNN architectures. We make the following observations. (1) Overall, the proposed L2P-GNN consistently yields the best performance among all methods across architectures. Compared to the best baseline on each architecture, L2P-GNN achieves up to 6.27% and 3.52% improvements on the two datasets, respectively. We believe that such signiﬁcant improvements can be attributed to the simulation of ﬁne-tuning during the pre-training process, thereby narrowing the gap between pre-training and ﬁne-tuning objectives. (2) Furthermore, pretraining GNNs with abundant unlabeled data is clearly helpful to downstream tasks, as our L2P-GNN brings up to 8.19% and 7.88% gains relative to non-pretrained models on the two datasets, respectively. (3) We also notice that some baselines give surprisingly limited performance gain and yield negative transfer (Rosenstein et al. 2005) on the downstream task (i.e., Edge Pred and Attr Masking strategies w.r.t. the GAT model). The reason might be that these strategies learn information irrelevant to the downstream tasks, which harms the generalization of the pre-trained GNNs. This ﬁnding conﬁrms previous observations (Hu et al. 2020; Rosenstein et al. 2005) that negative transfer results in limitations on the applicability and reliability of pre-trained models.

Model Analysis

Next, we investigate the underlying mechanism of L2P-GNN: the capability to narrow the gap between pre-training and ﬁnetuning by learning to pre-train GNNs, the impact of the node and graph-level adaptations on L2P-GNN s performance and a parameter analysis. Since similar trends are observed for different GNN architectures, here we only report the results w.r.t. the GIN model. Comparative Analysis. We attempt to validate whether L2P-GNN narrows the gap between pre-training and ﬁnetuning by learning to pre-train GNNs. Towards this end, we conduct a comparative analysis of the pre-trained GNN model before and after ﬁne-tuning on downstream tasks (named Model-P and Model-F), and consider three perspectives for comparison: Centered Kernel Alignment (CKA) similarity (Kornblith et al. 2019) between the parameters of Model-P

Evaluation change

Delta Loss Delta Micro-F1

CKA Similarity

Model change

Layer1 Layer2 Layer3 Layer4 Layer5

CKA Similarity

Model change

Layer1 Layer2 Layer3 Layer4 Layer5

L2P-GNN Context Pred Attr Masking DGI Edge Pred

Evaluation change

Delta Loss Delta RUC-AUC

L2P-GNN Context Pred Masking DGI Edge Pred

(a) Biology dataset (b) Pre DBLP dataset

Figure 2: CKA similarity of GIN layers and changes of loss and performance on two datasets.

Biology Pre DBLP

L2P-GNN-Node L2P-GNN-Graph L2P-GNN

(a) Ablation study.

Biology t 0 1 2 3 3

t 0 1 2 3 3

(b) Nodeand graph-level adaptation steps (s, t).

10 50 100 300 500

Biology Pre DBLP

(c) Dimension analysis.

Figure 3: Model analysis w.r.t. the GIN model.

and Model-F, changes in training loss (delta loss) and testing performance on downstream tasks (delta RUC-AUC or Micro-F1). As presented in Fig. 2, we observe that the CKA similarities of our L2P-GNN parameters before and after ﬁne-tuning are generally smaller than those of the baselines, indicating that L2P-GNN undergoes larger changes so as to become more adapted to downstream tasks. Besides, the smaller changes in training loss show that L2P-GNN can easily achieve the optimal point of the new tasks by rapid adaptations. This further implies that the objectives of our pre-training and downstream tasks are more aligned, resulting in a quick adaptation in the right optimization direction for downstream tasks and a much more signiﬁcant improvement in testing performance. Thus, L2P-GNN indeed narrows the gap by learning how to make adaptations during the pretraining process. Ablation Study. As the nodeand graph-level adaptations play pivotal roles in L2P-GNN, we compare two ablated variants, namely L2P-GNN-Node (with only node-level adaptation) and L2P-GNN-Graph (with only graph-level adaptation). As reported in Fig. 3(a), L2P-GNN is superior to both variants on the two datasets. The results demonstrate that both the local node-level structures and global graph-level information are useful and it is beneﬁcial to model them jointly. Parameter Analysis. Lastly, we investigate the effect of the number of nodeand graph-level adaptation steps (s, t), as well as the dimension of node representations. We plot the performance of L2P-GNN under combinations of 0 s 3 and 0 t 3 in Fig. 3(b). We ﬁnd that L2P-GNN is robust to different values of s and t, except when one or both of

them are zero (i.e., no adaptation at all). In particular, L2PGNN can adapt quickly with only one gradient update in both adaptions (i.e., s = t = 1). Finally, we summarize the impact of the dimension in Fig. 3(c). We observe that L2P-GNN achieves the optimal performance when the dimension is 300 and is generally stable around the optimal setting, indicating that L2P-GNN is robust w.r.t. the representation dimension.

In this paper, we introduce L2P-GNN, a self-supervised pretraining strategy for GNNs. We ﬁnd that with conventional pre-training strategies, there exists a divergence between the pre-training and ﬁne-tuning objectives, resulting in suboptimal pre-trained GNN models. To narrow the gap by learning how to pre-train GNNs, L2P-GNN structures the pre-training step to simulate the ﬁne-tuning process on downstream tasks, so as to directly optimize the pre-trained model s quick adaptability to downstream tasks. At both node and graph levels, L2P-GNN is equipped with dual adaptations to utilize the intrinsic structures of label-free graph data as self-supervision to learn local and global representations simultaneously. Extensive experiments demonstrate that L2P-GNN signiﬁcantly outperforms the state of the art and effectively narrows the gap between pre-training and ﬁne-tuning.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (No. U20B2045, 61772082, 61702296, 62002029), and the Tencent We Chat Rhino-Bird Focused Research Program.

References Abu-El-Haija, S.; Perozzi, B.; Kapoor, A.; Alipourfard, N.; Lerman, K.; Harutyunyan, H.; Steeg, G. V.; and Galstyan, A. 2019. Mix Hop: Higher-Order Graph Convolutional Architectures via Sparsiﬁed Neighborhood Mixing. In Proceedings of ICML, 21 29. Atwood, J.; and Towsley, D. 2016. Diffusion-Convolutional Neural Networks. In Proceedings of Neur IPS, 1993 2001. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V. F.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; Gülçehre, Ç.; Song, H. F.; Ballard, A. J.; Gilmer, J.; Dahl, G. E.; Vaswani, A.; Allen, K. R.; Nash, C.; Langston, V.; Dyer, C.; Heess, N.; Wierstra, D.; Kohli, P.; Botvinick, M.; Vinyals, O.; Li, Y.; and Pascanu, R. 2018. Relational inductive biases, deep learning, and graph networks. Co RR abs/1806.01261. Bradley, A. P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7): 1145 1159. Bronstein, M. M.; Bruna, J.; Le Cun, Y.; Szlam, A.; and Vandergheynst, P. 2017. Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 34(4): 18 42. Bruna, J.; Zaremba, W.; Szlam, A.; and Le Cun, Y. 2014. Spectral Networks and Locally Connected Networks on Graphs. In Proceedings of ICLR. Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Proceedings of Neur IPS, 3837 3845. Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, 4171 4186. Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; and Darrell, T. 2014. De CAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proceedings ICML, 647 655. Duvenaud, D.; Maclaurin, D.; Aguilera-Iparraguirre, J.; Gómez-Bombarelli, R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, R. P. 2015. Convolutional Networks on Graphs for Learning Molecular Fingerprints. In Proceedings of Neur IPS, 2224 2232. Dwivedi, V. P.; Joshi, C. K.; Laurent, T.; Bengio, Y.; and Bresson, X. 2020. Benchmarking Graph Neural Networks. ar Xiv preprint ar Xiv:2003.00982 . Fan, W.; Ma, Y.; Li, Q.; He, Y.; Zhao, Y. E.; Tang, J.; and Yin, D. 2019. Graph Neural Networks for Social Recommendation. In Proceedings of WWW, 417 426. Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of ICML, 1126 1135. Girshick, R. B.; Donahue, J.; Darrell, T.; and Malik, J. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of CVPR, 580 587.

Gururangan, S.; Marasovic, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; and Smith, N. A. 2020. Don t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., Proceedings of ACL, 8342 8360.

Hamilton, W. L.; Ying, Z.; and Leskovec, J. 2017. Inductive Representation Learning on Large Graphs. In Proceedings of Neur IPS, 1025 1035.

Hasanzadeh, A.; Hajiramezanali, E.; Narayanan, K. R.; Dufﬁeld, N.; Zhou, M.; and Qian, X. 2019. Semi-Implicit Graph Variational Auto-Encoders. In Proceedings of Neur IPS, 10711 10722.

He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. B. 2019. Momentum Contrast for Unsupervised Visual Representation Learning. Co RR abs/1911.05722.

Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V. S.; and Leskovec, J. 2020. Strategies for Pre-training Graph Neural Networks. In Proceedings of ICLR.

Hu, Z.; Fan, C.; Chen, T.; Chang, K.; and Sun, Y. 2019. Pre-Training Graph Neural Networks for Generic Structural Feature Extraction. Co RR abs/1905.13728.

Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classiﬁcation with Graph Convolutional Networks. In Proceedings of ICLR.

Kornblith, S.; Norouzi, M.; Lee, H.; and Hinton, G. E. 2019. Similarity of Neural Network Representations Revisited. In Proceedings of ICML, 3519 3529.

Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019. Meta-learning with differentiable convex optimization. In Proceedings of CVPR, 10657 10665.

Levie, R.; Monti, F.; Bresson, X.; and Bronstein, M. M. 2019. Cayley Nets: Graph Convolutional Neural Networks With Complex Rational Spectral Filters. IEEE Trans. Signal Process. 67(1): 97 109.

Li, Y.; Vinyals, O.; Dyer, C.; Pascanu, R.; and Battaglia, P. W. 2018. Learning Deep Generative Models of Graphs. Co RR abs/1803.03324.

Lu, Y.; Fang, Y.; and Shi, C. 2020. Meta-learning on Heterogeneous Information Networks for Cold-start Recommendation. In Proceedings of KDD, 1563 1573.

Lv, S.; Wang, Y.; Guo, D.; Tang, D.; Duan, N.; Zhu, F.; Gong, M.; Shou, L.; Ma, R.; Jiang, D.; et al. 2020. Pre-training Text Representations as Meta Learning. ar Xiv preprint ar Xiv:2004.05568 .

Marco, G.; Gabriele, M.; and Franco, S. 2005. A new model for learning in graph domains. In Proceedings of IJCNN, volume 2, 729 734.

Marinka; Zitnik; Rok; Sosiˇc; Marcus; W; Feldman; Jure; and Leskovec. 2019. Evolution of resilience in protein interactomes across the tree of life. Proceedings of the National Academy of Sciences of the United States of America .

Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and

phrases and their compositionality. In Proceedings of Neur IPS, 3111 3119.

Munkhdalai, T.; and Yu, H. 2017. Meta networks. In Proceedings of ICML, 2554 2563.

Navarin, N.; Tran, D. V.; and Sperduti, A. 2018. Pre-training Graph Neural Networks with Kernels. Co RR abs/1811.06930.

Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning Convolutional Neural Networks for Graphs. In Balcan, M.; and Weinberger, K. Q., eds., Proceedings of ICML, 2014 2023.

Pei, H.; Wei, B.; Chang, K. C.; Lei, Y.; and Yang, B. 2020. Geom-GCN: Geometric Graph Convolutional Networks. In Proceedings of ICLR.

Peng, H. 2020. A Comprehensive Overview and Survey of Recent Advances in Meta-Learning. Co RR abs/2004.11149.

Qu, M.; Bengio, Y.; and Tang, J. 2019. GMNN: Graph Markov Neural Networks. In Proceedings of ICML, 5241 5250.

Rosenstein, M. T.; Marx, Z.; Kaelbling, L. P.; and Dietterich, T. G. 2005. To transfer or not to transfer. In Proceedings of Neur IPS, 1 -4.

Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In Proceedings of ICML, 1842 1850.

Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20(1): 61 80.

Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical networks for few-shot learning. In Proceedings of Neur IPS, 4077 4087.

Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; and Hospedales, T. M. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of CVPR, 1199 1208.

Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. Line: Large-scale information network embedding. In Proceedings of WWW, 1067 1077.

Vanschoren, J. 2018. Meta-learning: A survey. ar Xiv preprint ar Xiv:1810.03548 .

Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2018. Graph Attention Networks. In Proceedings of ICLR.

Velickovic, P.; Fedus, W.; Hamilton, W. L.; Liò, P.; Bengio, Y.; and Hjelm, R. D. 2019. Deep Graph Infomax. In Proceedings of ICLR.

Vilalta, R.; and Drissi, Y. 2002. A perspective view and survey of meta-learning. Artiﬁcial intelligence review 18(2): 77 95.

Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; and Philip, S. Y. 2020. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems .

Xu, B.; Shen, H.; Cao, Q.; Qiu, Y.; and Cheng, X. 2019a. Graph Wavelet Neural Network. In Proceedings of ICLR. Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2019b. How Powerful are Graph Neural Networks? In Proceedings of ICLR. Yao, H.; Wei, Y.; Huang, J.; and Li, Z. 2019. Hierarchically Structured Meta-learning. In Proceedings of ICML, 7045 7054. Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W. L.; and Leskovec, J. 2018a. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of SIGKDD, 974 983. Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W. L.; and Leskovec, J. 2018b. Hierarchical Graph Representation Learning with Differentiable Pooling. In Proceedings of Neur IPS, 4805 4815. You, J.; Ying, R.; Ren, X.; Hamilton, W. L.; and Leskovec, J. 2018. Graph RNN: A Deep Generative Model for Graphs. Co RR abs/1802.08773. Zhang, Z.; Cui, P.; and Zhu, W. 2020. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering . Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.; Liu, Z.; and Sun, M. 2018. Graph Neural Networks: A Review of Methods and Applications. Co RR abs/1812.08434.