# graph_meta_learning_via_local_subgraphs__f873a024.pdf

Graph Meta Learning via Local Subgraphs

Kexin Huang Harvard University kexinhuang@hsph.harvard.edu

Marinka Zitnik Harvard University marinka@hms.harvard.edu

Prevailing methods for graphs require abundant label and edge information for learning. When data for a new task are scarce, meta learning can learn from prior experiences and form much-needed inductive biases for fast adaption to new tasks. Here, we introduce G-META, a novel meta-learning algorithm for graphs. G-META uses local subgraphs to transfer subgraph-speciﬁc information and learn transferable knowledge faster via meta gradients. G-META learns how to quickly adapt to a new task using only a handful of nodes or edges in the new task and does so by learning from data points in other graphs or related, albeit disjoint, label sets. G-META is theoretically justiﬁed as we show that the evidence for a prediction can be found in the local subgraph surrounding the target node or edge. Experiments on seven datasets and nine baseline methods show that G-META outperforms existing methods by up to 16.3%. Unlike previous methods, G-META successfully learns in challenging, few-shot learning settings that require generalization to completely new graphs and never-before-seen labels. Finally, G-META scales to large graphs, which we demonstrate on a new Tree-of-Life dataset comprising 1,840 graphs, a two-orders of magnitude increase in the number of graphs used in prior work.

1 Introduction

Graph Neural Networks (GNNs) have achieved remarkable results in domains such as recommender systems [56], molecular biology [65, 19], and knowledge graphs [49, 18]. Performance is typically evaluated after extensive training on datasets where majority of labels are available [52, 54]. In contrast, many problems require rapid learning from only a few labeled nodes or edges in the graph. Such ﬂexible adaptation, known as meta learning, has been extensively studied for images and language, e.g., [39, 51, 27]. However, meta learning on graphs has received considerably less research attention and has remained a problem beyond the reach of prevailing GNN models.

Meta learning on graphs generally refers to a scenario in which a model learns at two levels. In the ﬁrst level, rapid learning occurs within a task. For example, when a GNN learns to classify nodes in a particular graph accurately. In the second level, this learning is guided by knowledge accumulated gradually across tasks to capture how the task structure changes across target domains [35, 4, 34]. A powerful GNN trained to meta-learn can quickly learn never-before-seen labels and relations using only a handful of labeled data points. As an example, a key problem in biology is to translate insights from non-human organisms (such as yeast, zebraﬁsh, and mouse) to humans [66]. How to train a GNN to effectively meta-learn on a large number of incomplete and scarcely labeled protein-protein interaction (PPI) graphs from various organisms, transfer the accrued knowledge to humans, and use it to predict the roles of protein nodes in the human PPI graph? While this is a hard task, it is only an instance of a particular graph meta-learning problem (Figure 1C). In this problem, the GNN needs to learn on a large number of graphs, each scarcely labeled with a unique label set. Then, it needs to quickly adapt to a never-before-seen graph (e.g., a PPI graph from a new organism) and never-before-seen labels (e.g., newly discovered roles of proteins). Current methods are specialized techniques speciﬁcally designed for a particular problem and a particular task [63, 3, 24, 7, 53].

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Single graph & disjoint labels

Meta-Training

Meta-Testing

Meta Learner

Multiple graphs & shared labels

Meta-Training

Meta-Testing

Meta Learner

Multiple graphs & disjoint labels

Meta-Training

Meta-Testing

Meta Learner

Label set : Label set : Label set : Label set : Label set : Y Y Y Y Y

Figure 1: Graph meta-learning problems. A. Meta-learner classiﬁes unseen label set by observing other label sets in the same graph. B. Meta-learner learns unseen graph by learning from other graphs with the same label set. C. Meta-learner classiﬁes unseen label set by learning from other label sets across multiple graphs. Unlike existing methods, G-META solves all three problems and also works for link prediction (see Section 6.1).

While these methods provide a promising approach to meta learning in GNNs, their speciﬁc strategy does not scale well nor extend to other problems (Figure 1).

Present work. We introduce G-META,1 an approach for meta learning on graphs (Figure 1). The core principle of G-META is to represent every node with a local subgraph and use subgraphs to train GNNs to meta-learn. Our theoretical analysis (Section 4) suggests that the evidence for a prediction can be found in the subgraph surrounding the target node or edge when using GNNs. In contrast to G-META, earlier techniques are trained to meta-learn on entire graphs. As we show theoretically and empirically, such methods are unlikely to succeed in few-shot learning settings when the labels are scarce and scattered around multiple graphs. Furthermore, previous methods capture the overall graph structure but at the loss of ﬁner local structures. Besides, G-META s construction of local subgraphs gives local structural representations that enable direct structural similarity comparison using GNNs based on its connection to Weisfeiler-Lehman test [54, 60]. Further, structural similarity enables G-META to form the much-needed inductive bias via a metric-learning algorithm [37]. Moreover, local subgraphs also allow for effective feature propagation and label smoothing within a GNN.

(1) G-META is a general approach for a variety of meta learning problems on graph-structured data. While previous methods [63, 3] apply only to one graph meta-learning problem (Figure 1), G-META works for all of them (Appendix B). (2) G-META yields accurate predictors. We demonstrate G-

META s performance on seven datasets and compare it against nine baselines. G-META considerably outperforms baselines by up to 16.3%. (3) G-META is scalable. By operating on subgraphs, G-META needs to examine only small graph neighborhoods. We show how G-META scales to large graphs by applying it to our new Tree-of-Life dataset comprising 1,840 graphs, a two-orders of magnitude increase in the number of graphs used in prior work.

2 Related Work

(1) Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning Few-shot meta learning. Few-shot meta learning learns from prior experiences and transfers them to a new task using only a few labeled examples [46, 41]. Meta learning methods generally fall into three categories, model-based [14, 50, 34], metric-based [37, 38, 47], and optimizationbased [32, 29, 12] techniques. (2) Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs Meta learning for graphs. Recent studies incorporate graphstructured data into meta learning. Liu et al. [24] construct image category graphs to improve few-shot learning. Zhou et al. [62] use graph meta learning for fast network alignment. Meta R [7] and GMatching [53] use metric methods to generalize over new relations from a handful of associative relations in a knowledge graph. While these methods learn from a single graph, G-META can handle settings with many graphs and disjoint label sets. Further, Chauhan et al. [5] use a super-class prototypical network for graph classiﬁcation. In contrast, G-META focuses on node classiﬁcation and link prediction. In settings with single graphs and disjoint labels (Figure 1), Meta-GNN [63] uses gradient-based meta-learning for node classiﬁcation. On a related note, GFL [55] focuses on settings with multiple graphs and shared labels across the graphs. Similarly, Meta-Graph [3] uses graph signature functions for few-shot link prediction across multiple graphs. In contrast, our G-META can be used for many more problems (Figure 1). Further, Meta-GNN operates a GNN on

1Code and datasets are available at https://github.com/mims-harvard/G-Meta.

an entire graph, whereas G-META extracts relevant local subgraphs ﬁrst and then trains a GNN on each subgraph individually. In Meta-GNN, a task is deﬁned as a batch of node embeddings, whereas in G-META, a task is given by a batch of subgraphs. This difference allows for rapid adaptation to new tasks, improves scalability, and applies broadly to meta learning problems. In summary, G-META works for all problems in Figure 1 and applies to node classiﬁcation and link prediction. (3) Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs Subgraphs and GNNs. The ability to model subgraph structure is vital for numerous graph tasks, e.g., [45, 9, 58, 40]. For example, Patchy-San [30] uses local receptive ﬁelds to extract useful features from graphs. Ego-CNN uses ego graphs to ﬁnd critical graph structures [43]. SEAL [61] develops theory showing that enclosing subgraphs capture graph heuristics, which we extend to GNNs here. Cluster-GCN [8] and Graph SAINT [59] use subgraphs to improve GNN scalability. G-META is the ﬁrst approach to use subgraphs for meta learning.

3 Background and Problem Formulation

Let G = {G1, . . . , GN} denote N graphs. For each graph G = (V, E, X), V is a set of nodes, E is a set of edges, and X = {x1, . . . , xn} is a set of attribute vectors, where xu 2 Rd is a d-dimensional feature vector for node u 2 V. We denote Y = {Y1, . . . , YM} as a set of M distinct labels. We use Y to denote a set of labels selected from Y. G-META s core principle is to represent nodes with local subgraphs and then use subgraphs to transfer knowledge across tasks, graphs, and sets of labels. We use S to denote local subgraphs for nodes, S = {S1, . . . , Sn}. Node classiﬁcation aims to specify a GNN f : S 7! {1, . . . , |Y|} that can accurately map node u s local subgraph Su to labels in Y given only a handful of labeled nodes.

Background on graph neural networks. GNNs learn compact representations (embeddings) that capture network structure and node features. A GNN generates outputs through a series of propagation layers [13], where propagation at layer l consists of the following three steps: (1) Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing Neural message passing. GNN computes a message m(l)

uv = MSG(h(l 1)

v ) for every linked nodes u, v based on their embeddings from the previous layer h(l 1)

u and h(l 1)

v . (2) Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation Neighborhood aggregation. The messages between node u and its neighbors Nu are aggregated as ˆm(l)

u = AGG(m(l)

uv|v 2 Nu). (3) Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update. Finally, GNN uses a non-linear function to update node embeddings as h(l)

u = UPD( ˆm(l)

u ) using the aggregated message and the embedding from the previous layer.

Background on meta learning. In meta learning, we have a meta-set D that consists of Dtrain, Dval, Dtest. This meta-set consists of many tasks. Each task Ti 2 D can be divided into T support

i and T query

i . T support

i has Ksupport labeled data points in each label for learning and T query

i has Kquery data points in each label for evaluation. The size of the label set for the meta-set |Y| is N. It is also called N-ways Ksupport-shots learning problem. During meta-training, for Ti from Dtrain , the model ﬁrst learns from T support

i and then evaluates on T query

i to see how well the model performs on that task. The goal of Model-Agnostic Meta-Learning (MAML) [12] is to obtain a parameter initialization that can adapt to unseen tasks quickly, such as Dtest, using gradients information learnt during meta-training. Hyperparameters are tuned via Dval.

3.1 G-META: Problem Formulation

G-META is designed for three fundamentally different meta-learning problems (Figure 1). Shared labels refers to the situation where every task shares the same label set Y. Disjoint labels refers to the situation where label sets for tasks i and j are disjoint, i.e., Yi \ Yj = ;. Each data point in a task Ti is a local subgraph Su, along with its associated label Yu. G-META aims to adapt to a new task T p(T ) for which only a handful examples are available after observing related tasks Ti p(T ).

Graph meta-learning problem 1: Single Graph and Disjoint Labels. We have graph G and a distribution of label sets p(Y|G). The goal is to adapt to an unseen label set Y p(Y|G) by learning from tasks with other label sets Yi p(Y|G), where Yi \ Y = ; for every label set Yi. Graph meta-learning problem 2: Multiple Graphs and Shared Labels. We have a distribution of graphs p(G) and one label set Y. The goal is to learn from graph Gj p(G) and quickly adapt to an unseen graph G p(G), where Gj and G are disjoint. All tasks share the same labels.

Graph meta-learning problem 3: Multiple Graphs and Disjoint Labels. We have a distribution of label sets p(Y|G) conditioned on multiple graphs G. Each task has its own label set Yi but the same label set can appear in multiple graphs. The goal is to adapt to an unseen label set Y p(Y|G) by learning from a disjoint label set Yi p(Y|G), where Yi \ Y = ;.

4 Local Subgraphs and Theoretical Motivation for G-META

We start by describing how to construct local subgraphs in G-META. We then provide a theoretical justiﬁcation showing that local subgraphs preserve useful information from the entire graph. We then argue how subgraphs enable G-META to capture sufﬁcient information on the graph structure, node features, and labels, and use that information for graph meta-learning.

For node u, a local subgraph is deﬁned as a subgraph Su = (Vu, Eu, Xu) induced from a set of nodes {v|d(u, v) h}, where d(u, v) is the shortest path distance between node u and v, and h deﬁnes the neighborhood size. In a meta-task, subgraphs are sampled from graphs or label sets, depending on the graph meta-learning problems deﬁned in Section 3. We then use GNNs to encode the local subgraphs. However, one straightforward question raised is if this subgraph loses information by excluding nodes outside of it. Here, we show in theory that applying GNN on the local subgraph preserve useful information compared to using GNN on the entire graph.

Preliminaries and deﬁnitions. Next, we use GCN [22] as an exemplar GNN to understand how nodes inﬂuence each other during neural message passing. The assumptions are based on [48] and are detailed in Appendix C. We need the following deﬁnitions. (1) Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Node inﬂuence Iu,v of v on u in the ﬁnal GNN output is: Iu,v = k@x(1)

v k, where the norm is any subordinate norm and the Jacobian measures how a change in v translates to a change in u [48]. (2) Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence Graph inﬂuence IG on u is: IG(u) = k[Iu,v1, . . . , Iu,vn]k1, where [Iu,v1, . . . , Iu,vn] is a vector representing the inﬂuence of other nodes on u. (3) Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Graph inﬂuence loss Rh is deﬁned as: Rh(u) = IG(u) ISu(u), where IG(u) is the inﬂuence of entire graph G, and ISu(u) is the inﬂuence of local subgraph Su. Next, we show how inﬂuence spreads between nodes depending on how far the nodes are from each other in a graph. Theorem 1 (Decaying Property of Node Inﬂuence). Let t be a path between node u and node v and let Dt

GM be a geometric mean of node degrees occurring on path t. Let Dt

GM = mint{Dt

GM} and h = d(u, v). Consider the node inﬂuence Iu,v from v to u. Then, Iu,v C/(Dt

The proof is deferred to Appendix C. Theorem 1 states that the inﬂuence of a node v on node u decays exponentially as their distance h increases. A side theoretical ﬁnding is that the inﬂuence is signiﬁcantly decided by the accumulated node degrees of the paths between two nodes. In other words, if the paths are straight lines of nodes (low accumulated node degrees), then the node inﬂuence is high. Otherwise, if the paths consist of numerous connections to other nodes (high accumulated node degrees), the node inﬂuence is minimum. Intuitively, high accumulated node degrees can bring complicated messages along the paths, which dampen each individual node inﬂuence whereas low degrees paths can pass the message directly to the target node. Since real-world graphs are usually complicated graphs with relatively high node degrees, the node inﬂuence will be considerably low. We then show that for a node u, operating GNN on a local subgraph does not lose useful information than operating GNN on the entire graph. Theorem 2 (Local Subgraph Preservation Property). Let Su be a local subgraph for node u with neighborhood size h. Let node v be deﬁned as: v = argmaxw({Iu,w|w 2 V \ Vu}). Let t be a path between u and v and let D t

GM be a geometric mean of node degrees occurring on path t. Let D

t GM = min t{D t

GM}. The following holds: Rh(u) C/(D

The proof is deferred to Appendix D. Theorem 2 states that the graph inﬂuence loss is bounded by an exponentially decaying term as h increases. In other words, the local subgraph formulation is an h-th order approximation of applying GNN for the entire graph.

Local subgraphs enable few-shot meta-learning. Equipped with the theoretical justiﬁcation, we now describe how the local subgraph allows G-META to do few-shot meta-learning. The construction of local subgraphs captures the following. (1) Structures Structures Structures Structures Structures Structures Structures Structures Structures Structures Structures Structures Structures Structures Structures Structures Structures. Graph structures present an alternative source of strong signal for prediction [9, 45], especially when node labels are scarce. The GNN representations cannot fully capture large graphs structure because they are too complicated [3, 54]. However, they can learn representations that capture the structure of small graphs, such as our

Outer Loop Update ...

Meta-Training Subgraphs

Local Subgraph Extraction

Meta-Testing Subgraphs

GNN Support Prototypes

Ti Support Subgraphs

Query Subgraphs

Inner Loop Update

GNN Support Prototypes

Support Subgraphs

Query Subgraphs

Inner Loop Update

Figure 2: (1) We ﬁrst construct a batch of m meta-training tasks and extract local subgraphs on the ﬂy for nodes in the meta-tasks. For each task Ti, (2) subgraphs from the support set are mini-batched and are fed into a GNN parameterized by . (3) The support set embeddings using the centroid nodes are generated, and (4) the prototypes are computed from the support centroid embeddings. Then, (5) the support set loss Lsupport is computed, and (6) back-propagates to update the GNN parameter. (7) T query

i subgraphs then feed into the updated GNN to (8) generate query centroid embeddings. (9) Using the support prototypes and the query embeddings, the query loss Li

query for task Ti is computed. Steps (2-9) are repeated for update steps. The same process repeats for the other m sampled tasks, starting from the same GNN f . (10) The last update step s query loss from all the tasks are summed up and used to update . Then, another batch of tasks are sampled, and step (1-10) are repeated. Then, for meta-testing tasks, steps (1-9) are repeated with the GNN using the meta-learned parameter , which enables generalization over unseen tasks. See Algorithm 1 (Appendix E).

local subgraphs, as is evidenced by the connection to the Weisfeiler-Lehman test [54, 60]. Hence, subgraphs enable G-META to capture structrual node information. (2) Features Features Features Features Features Features Features Features Features Features Features Features Features Features Features Features Features. Local subgraphs preserve useful information, as indicated by the theorems above. (3) Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels. When only a handful of nodes are labeled, it is challenging to efﬁciently propagate the labels through the entire graph [64, 23]. Metric-learning methods [37] learn a task-speciﬁc metric to classify query set data using the closest point from the support set. It has been proved as an effective inductive bias [37, 42]. Equipped with subgraph representations that capture both structure and feature information, G-META uses metric-learning by comparing the query subgraph embedding to the support subgraph embedding. As such, it circumvents the problem of having too little label information for effective propagation.

5 G-META: Meta Learning via Local Subgraphs

G-META (Figure 2) is an approach for meta-learning on graphs. Building on theoretical motivation from Section 4, G-META ﬁrst constructs local subgraphs. It then uses a GNN encoder to generate embeddings for subgraphs. Finally, it uses prototypical loss for inductive bias and MAML for knowledge transfer across graphs and labels. The overview is in Algorithm 1 (Appendix E).

Neural encoding of subgraphs. In each meta-task, we ﬁrst construct a subgraph Su for each node u. While we use h-hops neighbors to construct subgraphs, other subgraph extraction algorithms, e.g., [6, 11] can be considered. We then feed each subgraph Su into a h-layer GNN to obtain an embedding for every node in the subgraph. Here, h is set to the size of subgraph neighborhood. The

centroid node u s embedding is used to represent the subgraph hu = Centroid(GNN(Su)). Note that a centroid node embedding is a particular instantiation of our framework. One can consider alternative subgraph representations, such as subgraph neural networks [1, 57] or readout functions speciﬁed over nodes in a subgraph [54, 5]. Notably, local subgraphs in our study are different from computation graphs [58]. Local subgraphs are not used for neural message passing; instead we use them for meta learning. Further, our framework does not constrain subgraphs to h-hop neighborhoods or subgraph neural encodings to centroid embeddings.

Prototypical loss. After we obtain subgraph representations, we leverage inductive bias between representations and labels to circumvent the issue of limited label information in few-shot settings. For each label k, we take the mean over support set subgraph embeddings to obtain a prototype ck as: ck = 1/Nk

yj=k hj. The prototype ck serves as a landmark for label k. Then, for each local subgraph Su that exists in either support or query set, a class distribution vector p is calculated via the Euclidean distance between support prototypes for each class and centroid embeddings as: pk = (exp ( khu ckk))/(P

). Finally, we use class distribution vectors from local subgraphs to optimize a cross-entropy loss as follows: L(p, y) = P

j yj log pj, where y indicates a true label s one-hot encoding.

Optimization-based meta-learning. To transfer the structural knowledge across graphs and labels, we use MAML, an optimization-based meta-learning approach. We break the node s dependency on a graph by framing nodes into independent local subgraph classiﬁcation tasks. It allows direct adaptation to MAML since individual subgraphs can be considered as an individual image in the classic few-shot meta-learning setups. More speciﬁcally, we ﬁrst sample a batch of tasks, where each task consists of a set of subgraphs. During meta-training inner loop, we perform the regular stochastic gradient descent on the support loss for each task Ti: j = j 1 r Lsupport. The updated parameter is then evaluated using the query set, and the query loss for task i is recorded as Li

query. The above steps are repeated for update steps. Then, the Li

query from the last update step is summed up across the batch of tasks, and then we perform a meta-update step: = βr P

query. Next, new tasks batch are sampled, and the same iteration is applied on the meta-updated . During meta-testing, the same procedure above is applied using the ﬁnal meta-updated parameter . is learned from knowledge across meta-training tasks and is the optimal parameter to adapt to unseen tasks quickly.

Attractive properties of G-META. (1) Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: Scalability: G-META operates on a mini-batch of local subgraphs where the subgraph size and the batch size (few-shots) are both small. This allows fast computation and low memory requirement because G-META s aggregation ﬁeld is smaller than that of previous methods that operate on entire graphs [3, 63]. We further increase scalability by sub-sampling subgraphs with more than 1,000 nodes. Also note that graph few-shot learning does not evaluate all the samples in the graph since few-shot learning means majority of labels do not exist. Thus, each mini-batch consists only of a few samples with labels (e.g., 9 in 3-way 3-shot learning), which are quick operations. (2) Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Inductive learning: Since G-META operates on different subgraphs in each GNN encoding, it forces inductiveness over unseen subgraphs. This inductiveness is crucial for few-shot learning, where the trained model needs to adapt to unseen nodes. Inductive learning also allows for knowledge transfer from meta-training subgraphs to metatesting subgraphs. (3) Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: Over-smoothing regularization: One limitation of GNN is that connected nodes become increasingly similar after multiple iterations of propagation on the same graph. In contrast, each iteration for G-META consists of a batch of different subgraphs with various structures, sizes and nodes, where each subgraph is fed into the GNN individually. This prevents GNN from over-smoothing on the structure of a single graph. (4) Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: Few-shot learning: G-META needs only a tiny number of labeled nodes for successful learning, as demonstrated in experiments. This property is in contrast with prevailing GNNs, which require a large fraction of labeled nodes to propagate neural messages in the graph successfully. (5) Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: Broad applicability: G-META applies to many graph meta-learning problems (Figure 1) whereas previous methods apply to at most one [3, 63]. Unlike earlier methods, G-META works for node classiﬁcation and few-shot link prediction (i.e., via local subgraphs for a pair of nodes).

6 Experiments

Synthetic datasets. We have two synthetic datasets whose labels depend on nodes structural roles [16], which we use to conﬁrm that G-META captures local graph structure (Table 1). (1) Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle:

Table 1: Dataset statistics. Fold-PPI and Tree-of-Life are new datasets introduced in this study. Dataset Task # Graphs # Nodes # Edges # Features # Labels

Synthetic Cycle Node 10 11,476 19,687 N/A 17 Synthetic BA Node 10 2,000 7,647 N/A 10 ogbn-arxiv Node 1 169,343 1,166,243 128 40 Tissue-PPI Node 24 51,194 1,350,412 50 10 First MM-DB Link 41 56,468 126,024 5 2

Fold-PPI Node 144 274,606 3,666,563 512 29 Tree-of-Life Link 1,840 1,450,633 8,762,166 N/A 2

we use a cycle basis graph and attach a distribution of shapes: House, Star, Diamond, Fan [9]. The label of each node is the structural role deﬁne by the shape. We also add random edges as noise. In the multiple graphs problem, each graph is with varying distribution of number of shapes. (2) BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA: to model local structural information under a more realistic homophily graph, we construct a Barabási-Albert (BA) graph and then plant different shapes to the graph. Then, we compute the Graphlet Distribution Vector [31] for each node, which characterizes the local graph structures and then we apply spectral clustering on this vector to generate the labels. For multiple graphs problem, a varying distribution of numbers of shapes are used to plant each BA graph. Details are in Appendix F.

Real-world datasets and new meta-learning datasets. We use three real world datasets for node classiﬁcation and two for link prediction to evaluate G-META (Table 1). (1) ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv ogbn-arxiv is a CS citation network, where features are titles, and labels are the subject areas [17]. (2) Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI Tissue-PPI consists of 24 protein-protein interaction networks from different tissues, where features are gene signatures and labels are gene ontology functions [67, 15]. (3) Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI Fold-PPI is a novel dataset, which we constructed for the multiple graph and disjoint label problem. It has 144 tissue PPI networks [67], and the labels are protein structures deﬁned in SCOP database [2]. The features are conjoint triad protein descriptor [36]. We screen fold groups that have more than 9 unique proteins across the networks. It results in 29 unique labels. Like many real-world graphs, in Fold-PPI, the majority of the nodes do not have associated labels. This also shows the importance of graph few-shot learning. (4) First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB First MM-DB [28] is the standard 3D point cloud data for link prediction across graphs, which consists of 41 graphs. (5) Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life Tree-of-Life is a new dataset that we constructed based on 1,840 protein interaction networks, each originating from a different species [66]. Since node features are not provided, we use node degrees instead. Details are in Appendix F.

Experimental setup. We follow the standard episode training for few-shot learning [42]. We refer the readers to Section 3.1 for task setups given various graph meta learning problems. Note that in order to simulate the real-world few-shot graph learning settings, we do not use the full set of labels provided in the dataset. Instead, we select a partition of it by constructing a ﬁxed number of meta-tasks. Here, we describe the various parameters used in this study. (1) Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. Node classiﬁcation. For disjoint label setups, we sample 5 labels for meta-testing, 5 for meta-validation, and use the rest for meta-training. In each task, 2-ways 1-shot learning with 5 gradient update steps in meta-training and 10 gradient update steps in meta-testing is used for synthetic datasets. 3-ways 3-shots learning with 10 gradient update steps in meta-training and 20 gradient update steps in meta-testing are used for real-world datasets disjoint label settings. For multiple graph shared labels setups, 10% (10%) of all graphs are held out for testing (validation). The remaining graphs are used for training. For fold-PPI, we use the average of ten 2-way protein function tasks. (2) Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. Link prediction. 10% of graphs are held out for testing and another 10% for validation. For each graph, the support set consists of 30% edges and the query set 70%. Negative edges are sampled randomly to match the same number of positive edges, and we follow the experiment setting from [61] to construct the training graph. We use 16-shots for each task, i.e., using only 32 node pairs to predict links for an unseen graph. 10 gradient update steps in meta-training and 20 gradient update steps in meta-testing are used. Each experiment is repeated ﬁve times to calculate the standard deviation of the results. Hyperparameters selection and a recommended set of them are in Appendix G.

Nine baseline methods. Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph Meta-Graph [3] injects graph signature in VGAE [21] to do few-shot multigraph link prediction. Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN Meta-GNN [63] applies MAML [12] to Simple Graph Convolution (SGC) [52]. Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) Few-shot Graph Isomorphism Network (FS-GIN) [54] applies GIN on the entire graph and retrieve the few-shot nodes to propagate loss and enable learning. Similarly, Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) Few-shot SGC (FS-SGC) [52] switches GIN to SGC for GNN encoder. Note that the previous four baselines only work in a

Table 2: Graph meta-learning performance on synthetic datasets. Reported is multi-class classiﬁcation accuracy (ﬁve-fold average) for 1-shot node classiﬁcation in various graph meta-learning problems. N/A means the method does not apply for the problem. Disjoint label problems use a 2-way setup. In the shared label problem, the cycle graph has 17 labels and the BA graph has 10 labels. The results use 5 gradient update steps in meta-training and 10 gradient update steps in meta-testing. Full table with standard deviations is in Appendix K.

Graph Meta Single graph Multiple graphs Multiple graphs Learning Problem Disjoint labels Shared labels Disjoint labels

Prediction Task Node Node Node

Base graph Cycle BA Cycle BA Cycle BA

G-META (Ours) 0.872 0.867 0.542 0.734 0.767 0.867 Meta-Graph N/A N/A N/A N/A N/A N/A Meta-GNN 0.720 0.694 N/A N/A N/A N/A FS-GIN 0.684 0.749 N/A N/A N/A N/A FS-SGC 0.574 0.715 N/A N/A N/A N/A

KNN 0.918 0.804 0.343 0.710 0.753 0.769 No-Finetune 0.509 0.567 0.059 0.265 0.592 0.577 Finetune 0.679 0.671 0.385 0.517 0.599 0.629

Proto Net 0.821 0.858 0.282 0.657 0.749 0.866 MAML 0.842 0.848 0.511 0.726 0.653 0.844

Synthetic BA

Synthetic Cycle

Random Edges

Planted Edges

Figure 3: Synthetic Cycle and Barabási-Albert datasets.

Table 3: Graph meta-learning performance on real-world datasets. Reported is multi-class classiﬁcation accuracy (ﬁve-fold average) and standard deviation. N/A means the method does not work in the graph metalearning problem. Note that Proto Net and MAML can be considered as ablation studies of G-META. Further details on performance are in Appendix I.

Graph Meta Single graph Multiple graphs Multiple graphs Multiple graphs Multiple graphs Learning Problem Disjoint labels Shared labels Disjoint labels Shared labels Shared labels Prediction Task Node Node Node Link Link Dataset ogbn-arxiv Tissue-PPI Fold-PPI First MM-DB Tree-of-Life G-META (Ours) 0.451 0.032 0.768 0.029 0.561 0.059 0.784 0.028 0.722 0.032 Meta-Graph N/A N/A N/A 0.719 0.020 0.705 0.004 Meta-GNN 0.273 0.122 N/A N/A N/A N/A FS-GIN 0.336 0.042 N/A N/A N/A N/A FS-SGC 0.347 0.005 N/A N/A N/A N/A KNN 0.392 0.015 0.619 0.025 0.433 0.034 0.603 0.072 0.649 0.012 No-Finetune 0.364 0.014 0.516 0.006 0.376 0.017 0.509 0.006 0.505 0.001 Finetune 0.359 0.010 0.521 0.013 0.370 0.022 0.511 0.007 0.504 0.003 Proto Net 0.372 0.017 0.546 0.025 0.382 0.031 0.779 0.020 0.697 0.010 MAML 0.389 0.021 0.745 0.051 0.482 0.062 0.758 0.025 0.719 0.012

few graph meta-learning problems. We also test on different meta-learning models, using the top performing ones in [42]. We operate on subgraph level for them since it allows comparison in all problems. No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune No-Finetune performs training on the support set and use the trained model to classify each query example, using only meta-testing set. KNN KNN KNN KNN KNN KNN KNN KNN KNN KNN KNN KNN KNN KNN KNN KNN KNN [10, 42] ﬁrst trains a GNN using all data in the meta-training set and it is used as an embedding function. Then, it uses the label of the voted K-closest example in the support set for each query example. Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune Finetune [42] uses the embedding function generated from meta-training set and the models are then ﬁnetuned on the meta-testing set. Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net Proto Net [37] applies prototypical learning on each subgraph embeddings, following the standard few-shot learning setups. MAML MAML MAML MAML MAML MAML MAML MAML MAML MAML MAML MAML MAML MAML MAML MAML MAML [12] switches Proto Net to MAML as the meta-learner. Note that the baselines Proto Net and MAML can be considered as an ablation of G-META removing MAML and Prototypical loss respectively. All experiments use the same number of query points per label. We use multi-class accuracy metric.

6.1 Results

Overview of results. Synthetic dataset results are reported in Table 2 and real-world datasets in Table 3. We see that G-META can consistently achieve the best accuracy in almost all tested graph

meta-learning problems and on both node classiﬁcation and link prediction tasks. Notably, G-META achieves a 15.1% relative increase in the single graph disjoint setting, a 16.3% relative increase in the multiple graph disjoint label setting, over the best performing baseline. This performance boost suggests that G-META can learn across graphs and disjoint labels. Besides, this increase also shows G-META obtains good predictive performance, using only a few gradient update steps given a few examples on the target tasks.

G-META captures local structural information. In the synthetic datasets, we see that Meta-GNN, FS-GIN, and FS-SGC, which base on the entire graph, are inferior to subgraph-based methods, such as G-META. This ﬁnding demonstrates that subgraph embedding can capture the local structural roles, whereas using the entire graph cannot. It further supports our theoretical motivation. In the single graph disjoint label setting, KNN achieves the best result. This result suggests that the subgraph representation learned from the trained embedding function is sufﬁcient to capture the structural roles, further conﬁrming the usefulness of local subgraph construction.

G-META is highly predictive, general graph meta-learning method. Across meta-learning models, we observe G-META is consistently better than others such as MAML and Proto Net. MAML and Proto Net have volatile results across different problems and tasks, whereas G-META is stable. This result conﬁrms our analysis of using the prototypical loss to leverage the label inductive bias and MAML to transfer knowledge across graphs and labels. By comparing G-META to two relevant existing works Meta-GNN and Meta-Graph, we see G-META can work across different problems. In contrast, the previous two methods are restricted by the meta graph learning problems. In real-world datasets, we observe No-Finetune is better than Finetune in many problems (ogbn-arxiv & Fold-PPI). This observation shows that the meta-training datasets bias the meta-testing result, suggesting the importance of meta-learning algorithm to achieve positive transfer in meta graph learning.

Local subgraphs are vital for successful graph few-shot learning. We observe that Meta-GNN, FS-GIN, and FS-SGC, which operate on the entire graph, perform poorly. In contrast, G-META can learn transferable signal, even when label sets are disjoint. Since the problem requires learning using only 3-shots in a large graph of around 160 thousands nodes and 1 million edges (ogbn-arxiv), we posit that operating GNN on the entire graph would not propagate useful information. Hence, the performance increase suggests that local subgraph construction is useful, especially in large graphs. Finally, in the Tree-of-Life link prediction problem, we studied the largest-ever graph meta-learning dataset, spanning 1,840 PPI graphs. This experiment also supports the scalability of G-META, which is enabled by the local subgraphs.

Ablation and parameter studies. We conduct ablations on two important components, optimizationbased meta-learning and prototypical loss. Note that baseline Proto Net and MAML are the ablations of G-META. We ﬁnd that both components are important to improve predictive performance. To study parameters, we ﬁrst vary the size of subgraph size h. We ﬁnd that h = 2 has a better performance across datasets than h = 1. When h = 3, in some dataset, it introduces noise and decreases the performance while in other one, it can improve performance since it includes more information. We suggest to use h = 2 since it has the most stable performance. We then vary the number of shots k, and observes a linear trend between k and the predictive performance. The detailed quantitative results on the parameter studies are provided in the Appendix J.

7 Conclusion

We introduce G-META, a scalable and inductive meta learning method for graphs. Unlike earlier methods, G-META excels at difﬁcult few-shot learning tasks and can tackle a variety of meta learning problems. The core principle in G-META is to use local subgraphs to identify and transfer useful information across tasks. The local subgraph approach is fundamentally different from prior studies, which use entire graphs and capture global graph structure at the loss of ﬁner topological shapes. Local subgraphs are theoretically justiﬁed and stem from our observation that evidence for a prediction can be found in the local subgraph surrounding the target entity. G-META outperforms nine baselines across seven datasets, including a new and original dataset of 1,840 graphs.

Broader Impact

Graphs represent an incredibly powerful data representation that has proved useful for numerous domains and application areas. At the same time, meta learning is a key area of machine learning research that has already demonstrated great potential for problems, such as few-shot image recognition and neural architecture search.

G-META advances ML research. Our present work is at an exciting yet underexplored intersection of graph ML and meta learning, advancing the state-of-the-art and enabling practical applications of meta learning for large graph datasets. State-of-the-art GNN methods do not work when there are only a handful of labels available. Our work ﬁlls in this gap by providing a novel approach to leverage related information such as related graphs or other labels sets to aid the graph few-shot learning. Numerous real-world graphs are only associated with scarce labels of nodes or have a large percentage of missing links. Low-resource constraint (e.g., the sparsity of graphs/scarcity of labels) is a fundamental challenge in real-world graph applications. The goal of graph meta-learning is to solve key graph ML tasks, such as node classiﬁcation and link prediction, under these constraints.

We envision numerous impactful applications of our work, as already demonstrated in the paper. We provide a brief overview of key applications for graph meta-learning, and G-META in particular.

G-META can expedite scientiﬁc discovery. Science is ﬁlled with complex systems that are modeled by graphs. The labels are usually obtained through resource-intensive experiments in laboratories. Many experiments, such as those to measure protein-protein interactions (studied in the paper), cannot yet be automated. Because of that, labels (i.e., a variety of molecular properties that need to be discovered for downstream problems like drug discovery [44]) are expensive to obtain and thus are very scarce. G-META can accelerate scientiﬁc discovery by learning from related sources and quickly adapting to a new, never-before-seen task of interest given only a handful of examples. For example, G-META can better annotate rare disease pathways by transferring knowledge from the mouse PPI network to the human PPI network [33]. Further, PPI networks of many species are highly incomplete, even for humans, less than 30% of all pairwise combinations of proteins have been tested for interaction so far [25]. Graph meta-learning can help label sparse networks by learning from a handful of other networks that are well-annotated.

G-META can bring economic values. While graphs have helped businesses make better products (e.g., [56, 26]), many graph-structured data resources remain under-utilized because of the sparsity and scarcity of labels. G-META can help tackle this problem. For example, it can improve recommendation for a new set of products that only have a few user behavior records by transferring knowledge from previous product sets in the user products recommendation networks, or help a business expand to new locations by learning from the interconnected data about the current location.

G-META can improve equality. G-META can be used to facilitate the development of world regions by quickly learning from the developed regions. While this has previously been successfully demonstrated on satellite image data [20], we see many new opportunities for graph data, such as identifying what transportation lines (links) should be prioritized to add in a rural region by learning from infrastructure networks of developed regions that were similarly rural before.

Finally, we brieﬂy comment on potential risks of having a powerful graph meta-learning predictor. (1) Negative transfer: Meta-learning models are tricky to train and may vary across datasets and domains. It can lead to negative transfer. (2) Misuse of the methodology: Meta-learning is not universal. It works in speciﬁc settings. For example, G-META is applied to few-shot settings. When labels are abundant, it might not yield considerable beneﬁts. (3) Adversarial attacks: There are no studies on how adversarial attacks would affect meta-learning algorithms and graph algorithms in particular. We posit that in the few-shot setting, it may be particularly easy to attack the graph since each labeled example is vital for prediction.

Acknowledgments and Disclosure of Funding

This work is supported, in part, by NSF grant nos. IIS-2030459 and IIS-2033384, and by the Harvard Data Science Initiative. The content is solely the responsibility of the authors. We thank Chengtai Cao, Avishek Joey Bose, Xiang Zhang for helpful discussions.

[1] Emily Alsentzer, Samuel G Finlayson, Michelle M Li, and Marinka Zitnik. Subgraph neural

networks. Neur IPS, 2020. [2] Antonina Andreeva, Eugene Kulesha, Julian Gough, and Alexey G Murzin. The scop database

in 2020: expanded classiﬁcation of representative family and superfamily domains of known protein structures. Nucleic Acids Research, 48(D1):D376 D382, 2020. [3] Avishek Joey Bose, Ankit Jain, Piero Molino, and William L Hamilton. Meta-Graph: few shot

link prediction via meta learning. ar Xiv:1912.09867, 2019. [4] Rich Caruana. Multitask learning. Machine Learning, 28(1):41 75, 1997. [5] Jatin Chauhan, Deepak Nathani, and Manohar Kaul. Few-shot learning on graphs via super-

classes based on graph spectral measures. In ICLR, 2020. [6] Jie Chen and Yousef Saad. Dense subgraph extraction with application to community detection.

IEEE Transactions on knowledge and data engineering, 24(7):1216 1230, 2010. [7] Mingyang Chen, Wen Zhang, Wei Zhang, Qiang Chen, and Huajun Chen. Meta relational

learning for few-shot link prediction in knowledge graphs. In EMNLP-IJCNLP, Hong Kong, China, November 2019. [8] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn:

An efﬁcient algorithm for training deep and large graph convolutional networks. In KDD, pages 257 266, 2019. [9] Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. Learning structural node

embeddings via diffusion wavelets. In KDD, pages 1320 1329, 2018. [10] Sahibsingh A Dudani. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on

Systems, Man, and Cybernetics, 0(4):325 327, 1976. [11] Karoline Faust, Pierre Dupont, Jérôme Callut, and Jacques Van Helden. Pathway discovery in

metabolic networks by subgraph extraction. Bioinformatics, 26(9):1211 1218, 2010. [12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-

tion of deep networks. In ICML, pages 1126 1135, 2017. [13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural

message passing for quantum chemistry. In ICML, pages 1263 1272. JMLR. org, 2017. [14] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. ar Xiv:1410.5401, 2014. [15] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large

graphs. In Neur IPS, pages 1024 1034, 2017. [16] Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Leman

Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. Rolx: structural role extraction & mining in large graphs. In KDD, pages 1231 1239, 2012. [17] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele

Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv:2005.00687, 2020. [18] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In

WWW, pages 2704 2710, 2020. [19] Kexin Huang, Cao Xiao, Lucas Glass, Marinka Zitnik, and Jimeng Sun. Skipgnn: Predicting

molecular interactions with skip-graph networks. Scientiﬁc Reports, 10, 2020. [20] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon.

Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790 794, 2016. [21] Thomas N Kipf and Max Welling. Variational graph auto-encoders. Bayesian Deep Learning

Workshop at NIPS, 2016. [22] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional

networks. In ICLR, 2017. [23] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks

for semi-supervised learning. In AAAI, 2018.

[24] LU LIU, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Learning to propagate

for graph meta-learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Neur IPS, pages 1039 1050, 2019.

[25] Katja Luck, Dae-Kyum Kim, Luke Lambourne, Kerstin Spirohn, Bridget E Begg, Wenting Bian,

Ruth Brignall, Tiziana Cafarelli, Francisco J Campos-Laborie, Benoit Charloteaux, et al. A reference map of the human binary protein interactome. Nature, 580(7803):402 408, 2020.

[26] Julian Mc Auley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based

recommendations on styles and substitutes. In SIGIR, pages 43 52, 2015.

[27] Bryan Mc Cann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural

language decathlon: Multitask learning as question answering. ar Xiv:1806.08730, 2018.

[28] Marion Neumann, Plinio Moreno, Laura Antanas, Roman Garnett, and Kristian Kersting. Graph

kernels for object category prediction in task-dependent robot grasping. In Online Proceedings of the Eleventh Workshop on Mining and Learning with Graphs, pages 0 6, 2013.

[29] Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms.

ar Xiv:1803.02999, 2018.

[30] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural

networks for graphs. In ICML, page 2014 2023, 2016.

[31] Nataša Pržulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):e177 e183, 2007.

[32] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR,

[33] Thomas Rolland, Murat Ta san, Benoit Charloteaux, Samuel J Pevzner, Quan Zhong, Nidhi

Sahni, Song Yi, Irma Lemmens, Celia Fontanillo, Roberto Mosca, et al. A proteome-scale map of the human interactome network. Cell, 159(5):1212 1226, 2014.

[34] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.

Meta-learning with memory-augmented neural networks. In ICML, pages 1842 1850, 2016.

[35] Jürgen Schmidhuber, Jieyu Zhao, and Marco Wiering. Shifting inductive bias with success-

story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28(1):105 130, 1997.

[36] Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen, Yixue Li, and

Hualiang Jiang. Predicting protein protein interactions based only on sequences information. PNAS, 104(11):4337 4341, 2007.

[37] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In

Neur IPS, pages 4077 4087, 2017.

[38] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales. Learning to compare:

Relation network for few-shot learning. In CVPR, pages 1199 1208, 2018.

[39] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.

Learning to compare: Relation network for few-shot learning. In CVPR, 2018.

[40] Komal K Teru, Etienne Denis, and William L Hamilton. Inductive relation prediction by

subgraph reasoning. ICML, 2020.

[41] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.

[42] Eleni Triantaﬁllou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross

Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. ICLR, 2020.

[43] Ruo-Chun Tzeng and Shan-Hung Wu. Distributed, egocentric representations of graphs for

detecting critical structures. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, ICML, pages 6354 6362, 2019.

[44] Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George

Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development. Nature Reviews Drug Discovery, 18(6):463 477, 2019.

[45] Petar Veliˇckovi c, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon

Hjelm. Deep graph infomax. In ICLR, 2019. [46] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artiﬁcial

Intelligence Review, 18(2):77 95, 2002. [47] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra.

Matching networks for one shot learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Neur IPS, pages 3630 3638, 2016. [48] Hongwei Wang and Jure Leskovec. Unifying graph convolutional neural networks and label

propagation. ar Xiv:2002.06755, 2020. [49] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Het-

erogeneous graph attention network. In The World Wide Web Conference, pages 2022 2032, 2019. [50] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. ar Xiv:1410.3916, 2014. [51] Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi.

Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In CVPR, pages 6750 6759, 2019. [52] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger.

Simplifying graph convolutional networks. In ICML, pages 6861 6871, 2019. [53] Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. One-shot

relational learning for knowledge graphs. In EMNLP, Brussels, Belgium, 2018. [54] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural

networks? In ICLR, 2019. [55] Huaxiu Yao, Chuxu Zhang, Ying Wei, Meng Jiang, Suhang Wang, Junzhou Huang, Nitesh V

Chawla, and Zhenhui Li. Graph few-shot learning via knowledge transfer. AAAI, 2020. [56] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure

Leskovec. Graph convolutional neural networks for web-scale recommender systems. In KDD, 2018. [57] Rex Ying, Zhaoyu Lou, Jiaxuan You, Chengtao Wen, Arquimedes Canedo, and Jure Leskovec.

Neural subgraph matching. ar Xiv:2007.03092, 2020. [58] Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer:

Generating explanations for graph neural networks. In Neur IPS, pages 9240 9251, 2019. [59] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna.

Graphsaint: Graph sampling based inductive learning method. In ICLR, 2020. [60] Muhan Zhang and Yixin Chen. Weisfeiler-lehman neural machine for link prediction. In KDD,

pages 575 583, 2017. [61] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Neur IPS,

pages 5165 5175, 2018. [62] Fan Zhou, Chengtai Cao, Goce Trajcevski, Kunpeng Zhang, Ting Zhong, and Ji Geng. Fast

network alignment via graph meta-learning. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages 686 695. IEEE, 2020. [63] Fan Zhou, Chengtai Cao, Kunpeng Zhang, Goce Trajcevski, Ting Zhong, and Ji Geng. Meta-

GNN: On few-shot node classiﬁcation in graph meta-learning. In CIKM, pages 2357 2360, 2019. [64] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian

ﬁelds and harmonic functions. In ICML, pages 912 919, 2003. [65] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with

graph convolutional networks. Bioinformatics, 34(13):i457 i466, 2018. [66] Marinka Zitnik, Marcus W Feldman, Jure Leskovec, et al. Evolution of resilience in protein

interactomes across the tree of life. PNAS, 116(10):4426 4433, 2019. [67] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue

networks. Bioinformatics, 33(14):i190 i198, 2017.