# fewshot_relational_reasoning_via_connection_subgraph_pretraining__8e02921c.pdf Few-shot Relational Reasoning via Connection Subgraph Pretraining Qian Huang Stanford University qhwang@cs.stanford.edu Hongyu Ren Stanford University hyren@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford.edu Few-shot knowledge graph (KG) completion task aims to perform inductive reasoning over the KG: given only a few support triplets of a new relation ./ (e.g., (chop, ./, kitchen), (read, ./, library)), the goal is to predict the query triplets of the same unseen relation ./, e.g., (sleep, ./, ?). Current approaches cast the problem in a meta-learning framework, where the model needs to be first jointly trained over many training few-shot tasks, each being defined by its own relation, so that learning/prediction on the target few-shot task can be effective. However, in real-world KGs, curating many training tasks is a challenging ad hoc process. Here we propose Connection Subgraph Reasoner (CSR), which can make predictions for the target few-shot task directly without the need for pre-training on the human curated set of training tasks. The key to CSR is that we explicitly model a shared connection subgraph between support and query triplets, as inspired by the principle of eliminative induction. To adapt to specific KG, we design a corresponding self-supervised pretraining scheme with the objective of reconstructing automatically sampled connection subgraphs. Our pretrained model can then be directly applied to target few-shot tasks on without the need for training few-shot tasks. Extensive experiments on real KGs, including NELL, FB15K-237, and Concept Net, demonstrate the effectiveness of our framework: we show that even a learning-free implementation of CSR can already perform competitively to existing methods on target few-shot tasks; with pretraining, CSR can achieve significant gains of up to 52% on the more challenging inductive few-shot tasks where the entities are also unseen during (pre)training. 1 Introduction Knowledge Graphs (KGs) are structured representations of human knowledge, where each edge represents a fact in the triplet form of (head entity, relation, tail entity) [10, 13, 17, 20]. Since KGs are typically highly incomplete yet widely used in downstream applications, predicting missing edges, i.e., KG completion, is one of the most important machine learning tasks over these large heterogeneous data structures. Deep learning based methods have achieved great success on this task [19, 21, 26], but the more challenging few-shot setting [23] is much less explored: Given a background knowledge graph, an unseen relation, and a few support edges in triplet form, the task is to predict whether this unseen relation exists between a query entity and candidate answers based on the background knowledge graph. Such a setting captures the most difficult and important case during KG completion: predict rare relations (i.e. appearing only a few times in the existing KG) and incorporating new relations into the KG efficiently. It also tests the inductive reasoning skill of the model on deriving new knowledge data-efficiently, which is critical for AI in general. indicates equal contribution. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Hypothesis Proposal Score = cosine similarity (B) Background KG Support set Sr' Query set Qr' (A) Few-shot Task (C) Connection Subgraph Reasoner Evidence Proposal Contextualize Shared hypotheses Figure 1: The few-shot KG completion problem includes (A) few-shot task that aims to learn a new relation (purple) and (B) background knowledge graph. Our CSR framework (C) first contexualizes all triplets in the background KG, then finds the shared hypothesis in the form of a connection subgraph using the Hypothesis Proposal module, and finally tests whether there is an evidence close enough to the hypothesis using Evidence Proposal module. In general all edges shown have different relation types, but here we only highlight ones in the connection subgraph with colors. Existing approaches to few-shot KG completion [3, 18, 23, 24] typically adopt the meta-learning framework [8], where the model is trained over a meta-training set consisting of many few-shot tasks created from different relations in the background knowledge graph. GMatching [23], FSRL [24] and Att-LMetric [18] are metric-based meta-learning methods that try to learn a good metric where positive query pairs are closer to representation of edges in the support set than the negative ones. Meta R [3] is an optimization-based meta-learning method that use a meta-learner to improve the optimization of the task learner, such that the task learner can quickly learn with only few examples. However, creating the meta-training set for some unknown few-shot tasks in test time is a very difficult ad hoc process in practice. On existing benchmarks [23], the training few-shot tasks and the target few-shot tasks are both randomly sampled from relations with least occurrences in the full knowledge graph, meaning the training and target relations are from the same distribution. But in reality, one has no information about the target few-shot tasks and the meta-training needs to be manually constructed out of the background knowledge graph. This is challenging since background knowledge graph often has a limited number of tasks due to the limited number of relations; creating too many meta-training tasks out of the background KG may remove a large number of edges from the KG, making it sparse and hard to learn over. Moreover, with a small meta-training set, the target few-shot tasks are very likely out of the curated meta-training set distribution, since the novel relation could be more complicated than known ones and the entities involved the target few-shot tasks can also be unseen. This then makes meta-learning based method suffer negative transfer due to distribution shift. Thus, having a method that can perform well on any novel few-shot tasks without relying on specifically designed meta-training set is crucial for real-world applications. Here we propose a novel modeling framework Connection Subgraph Reasoner (CSR) that can make prediction on the target few-shot task directly without the need for meta-learning and creation of a curated set of training few-shot tasks. Our insight is that a triplet of the unseen relation of interest can be inferred through the existence of a hypothesis in the form of a connection subgraph, i.e. a subgraph in KG that connect the two entities of the triplet. Intuitively, the connection subgraph represents the logical pattern that implies the existence of the triplet. For the (chop, ./, kitchen) example, such a connection subgraph that implies ./ is a two hop path in KG: {(chop, can be done with, knife), (knife, is located at, kitchen)}. This insight allows us to cast the few-shot link prediction as an inductive reasoning problem. Following the eliminative induction method of inductive reasoning [9], our framework first recovers this hypothesis from the support triplets by finding the connection subgraph approximately shared among the support triplets, then tests whether this hypothesis is also a connection subgraph between the query entity and a candidate answer. We show the full pipeline along with an example of connection subgraph in Figure 1. To better adapt to specific KG, we design a novel encoder-decoder architecture based on graph neural networks (GNN) to implement the two stages and a corresponding self-supervised pretraining scheme to reconstruct diverse connection subgraphs. We demonstrate that a training-free implementation of CSR via edge mask optimization can already discover the connection subgraph and reach link prediction performance competitive to many metalearning methods over real-world knowledge graphs. With pretraining and optionally meta-learning over background KG uniformly, our method achieves high performance on both transductive and inductive few-shot test tasks that involve long tails relations, which are out of distribution to the training tasks; while existing methods using meta-learning suffers from distribution shift and cannot handle inductive tasks. Over real KGs including NELL [13], FB15K-237 [20], and Concept Net [17], our method consistently exceeds or matches state-of-the-art methods in meta-training tasks free setting, and far exceeds the best existing methods by up to 52 % in the the more challenging inductive few-shot tasks where entities in the target few-shot tasks are also unseen. The implementation of CSR can be found in https://github.com/snap-stanford/csr. 2 Related Work 2.1 Few-shot Relational Learning via Meta-Learning Meta-learning is a paradigm of learning across a set of meta-training tasks and then adapting to a new task during meta-testing [8]. To the best of our knowledge, all existing methods on few-shot KG completion follow the meta-learning paradigm to address the data scarcity in the target few-shot task [3, 18, 23, 24]. Therefore, these methods require the access to a meta-training set that contains many few-shot KG completion tasks for training. On the two existing benchmarks NELL-One and Wiki-One[23] the meta-training set is constructed by sampling from long tail relations, in the same way as the target few-shot tasks are constructed. However, such a meta-training set is not given in real world application and needs to be manually constructed out of the background knowledge graph G to mimic the actual few-shot task during test time. This curation is inherently challenging because that the background knowledge graph has a limited number of relations/tasks in G , the distribution of the novel relations of interest is unknown, and the entites in the target few-shot tasks can be unseen in the background KG. In this paper, we develop a more general pretraining procedure to remove the dependency on manually created training tasks. 2.2 Few-shot Learning via Pretraining It has been shown in natural language processing [2, 16] and computer vision [4, 5, 7] domains that large-scale self-supervised pretraining can significantly improve task-agnostic few-shot learning ability. One of the most successful pretraining objectives is predicting the next token or image patch given ones seen before it. However, how to design such powerful pretraining objectives for few-shot relational learning is still under-explored. In this work, we design a well motivated self-supervised pretraining objective, i.e. recovering diverse connection subgraphs that correspond to different inductive hypothesis. We show that such a pretraining scheme can significantly improve few-shot relational learning tasks on knowledge graphs. 3 Few-shot KG Completion Few-shot KG completion is defined as follows [3, 23]: Denote the background KG that represents the known knowledge as G = (E, R, T ), where E and R represents the set of entities and relations. T = {(h, r, t)|h, t 2 E, r 2 R} represents the facts as triplets. Given a new relation r0 62 R and a support set Sr0 = {(hk, r0, tk)|hk 2 E}K k=1, we want to make predictions over a query set Qr0 = {(hj, r0, ?)|hj 2 E}J j=1. This prediction on (hj, r0, ?) is typically converted to scoring triplet (hj, r0, e) for all candidate entities e then ranking the scores. So we will proceed to consider Qr0 as directly containing full triplets (hj, r0, e) to score. We call this a K-shot KG completion task, typically the number of support is a small number (K 5). Note that existing works generally assume the entities in the few-shot tasks (support + query set) belong to the background KG. However, in real world cases, the goal of few-shot KG completion is to simulate learning of novel relations that may involve new entities not exist yet on the KG. Thus, in this paper we also consider a more challenging inductive setting where entities in the few-shot tasks do not belong to the entity set E, but new triplets about these unseen entities can be added at test time. 4 Connection Subgraph Reasoner In this section, we first discuss our main motivation from the inductive reasoning perspective and present the general framework based on it. Then we introduce both learning-free and learning-based implementations of this framework. 4.1 Inductive Reasoning Inductive Reasoning refers to the reasoning process of synthesizing a general principle from past observations, and then using this general principle to make predictions about future events [9]. Few-shot link prediction task can be seen as an inductive reasoning task with background knowledge. The key motivation of our work is eliminative induction, one of the principled methods used to reach inductive conclusions. Specifically, we consider the scientific hypothesis method: eliminating hypotheses inconsistent with observations. In the context of few-shot link prediction task, we explicitly try to find a hypothesis consistent with all examples in the support set, then test whether the the query is consistent with this hypothesis. To illustrate a simple case of this, we use the ./ example where the support triplets are (chop, ./, kitchen), (read, ./, library), and query triplet is (sleep, ./, ?). From a background KG (e.g.Concept Net), we can know a lot of knowledge in forms of triplets about these entities, such as (kitchen, is part of, a house) and (read, is done by, human) etc. We essentially want to find an induction hypothesis that explains how chop is related to kitchen in the same way that read is related to library. In other words, we want to find the shared connection pattern over the background KG that connects both two pairs of entities. In this case, we can observe that there is a simple shared 2 hop connection path that connects both pairs: {(chop, can be done with, knife), (knife, is located at, kitchen)} (1) {(read, can be done with, book), (book, is located at, library)} (2) The abstracted inductive hypothesis consistent with both examples in the support set is then 9Z, (hc, can be done with, Z) (Z, is located at, tc) =) (hc, ./, tc). (3) This hypothesis can then be used to deduce that (sleep, ./, bedroom) has a high score, since we know {(sleep, can be done with, bed), (bed, is located at, bedroom)} from the background KG. More generally, the shared connection pattern can be graph structured instead of a two-hop path, which then form a connection subgraph between the two end entities instead of a connection path (Figure 1). Here we define the connection subgraph: Let G0 = (E0, R0, T 0) be any subgraph of the background KG G, (i.e., E0 E, R0 R and T 0 T ) that satisfies the following requirement for a given pair of nodes (hc, tc) on the KG. (1) hc 2 E0 and tc 2 E0; (2) there is no disconnected component. We define the connection subgraph GC of (hc, tc) to be any such G0 where we further ignore the node identity. The key insight is that we should only consider the relation structure patterns and abstract away the node identity in order to construct a hypothesis. Then a hypothesis like Eq. 3 can be represented as a connection subgraph GC by interpreting each clause as an edge. And such a hypothesis is consistent with a support/query triplet (h, r, t) if GC is a connection subgraph of h, t. In terms of the ./ example, the triplet (sleep, ./, bedroom) is consistent with the hypothesis Eq. 3 because the connection subgraph form of the hypothesis (hc, can be done with, Z) (Z, is located at, tc) is a connection subgraph between sleep and bedroom, with hc, Z, tc corresponding to sleep, bed and bedroom respectively. Given a pair of node, we further call the KG subgraph with node identity an evidence that witnesses why the hypothesis is consistent with a connection subgraph of h, t;. Note the key difference between hypothesis and evidence is that, different pair of nodes may share the same hypothesis but each may have different evidence to support the consistency with such a hypothesis. As an example, for both (sleep, ./, bedroom) and (chop, ./, kitchen), the hypothesis is the same (hc, can be done with, Z) (Z, is located at, tc). Yet the evidences are respectively {(sleep, can be done with, bed), (bed, is located at, bedroom)} and {(chop, can be done with, knife), (knife, is located at, kitchen)} Note that this hypothesis formulation can be seen as a generalization of the typical path structured logic rules considered by the multi-hop reasoning [11, 12, 22] and rule induction [6, 15] literature. However, even with our graph structured hypothesis so far, such a strict way of testing the consistency with a hypothesis by checking of whether GC is a connection subgraph of h, t exactly would be time-consuming and brittle due to the fuzziness of target relations and the KG incompleteness. Below we develop our framework that relaxes the hypothesis representation and consistency testing criteria. 4.2 General Framework Based on the above motivation, we design the following general framework Connection Subgraph Reasoner (CSR) that includes 2 main modules: hypothesis proposal module Mp and evidence proposal module Me. CSR then has the following 3 components: (1) Triplet Contextualization. We first contextualize each triplet (h, r0, t) in support set Sr0 and query set Qr0 by retrieving its contextualized graph G(h, t) G such that it contains h, t and captures most of the immediately relevant information about the pair. There are several prior works on retrieving smaller subgraph around a triplet in KG for link prediction [19, 25]. We choose to use the enclosing subgraph proposed by Grail [19], which is the subgraph induced by all nodes that are in the k hops neigborhood of both h, t. We generally use k = 1, 2 depends on the density of KG. We also supplement with random sampling of the neighbors of h, t in case the enclosing subgraph itself is too small. We call the contextualized graphs of support triplets support graphs, and the contextualized graphs of query triplets query graphs. (2) Hypothesis Proposal. After contextualization, we would like to find the hypothesis consistent with all support graphs. We use a hypothesis proposal module Mp to generate a hypothesis from each support graph such that they are most similar to each other. Each hypothesis can be represented as a soft edge mask m : [0, 1]E over edges in the corresponding support graph. To aggregates these hypotheses to produce the embedding of one final hypothesis b, we take the mean of their GNN embedding produced by a graph encoder Menc. i=1 = Mp({G(hi, ti)|(hi, r0, ti) 2 Sr0}) Menc(G(hi, ti), mi) Mp should compare all support graphs and output masks that represent the largest common connection subgraph, i.e. s.t. 8i, j 2 1...K, s(Menc(G(hi, ti), mi), Menc(G(hj, tj), mj)) > 1 (5) (3) Hypothesis Testing. Finally, we want to test whether each query graph is consistent with the proposed hypothesis. We uses evidence proposal module Me to take in b and a query graph to output the closest evidence to the hypothesis represented by b. The score of the query is then the cosine similarity between b and the embedding of the evidence. mq = Me(b, G(hq, tq)) score = s(b, Menc(G(hq, tq), mq)) (6) Me is intended to output mq such that s(Menc(G(hq, tq), mq), b) > 1 (7) Overall, we model hypothesis softly as the aggregation of approximately shared hypothesis connection subgraphs between support graphs as found by Mp, then test its probabilistic consistency with query graph by finding how close an evidence for this hypothesis can be found in the query graph by Me. For the Menc, we use the alternative message passing in Path Con [21] so that that it does not use entity embedding as input can measures graph isomorphism : v = 1 1 + P (v = t) (9) e = σ(([sv, su, se]) W i + bi), u, v 2 N(e) (10) Menc(G, m) = max_pool(a L where L is number of layers and s1 e is random or pretrained relation embedding. Note that we concatenate the head and tail representation in the final graph representation so that the hypothesis and evidence subgraph are compared with heads and tails matched already. For Mp and Me, we introduce two specific implementations that satisfies the requirements Eq. 5 and 7 in the following sections. 4.3 CSR-OPT: Learning-free Implementation We implement Mp and Me as a learning free optimization processes, where the masks are optimized toward Eq. 5 and 7 directly. For Mp, we formulate it as a constraint optimization problem: Mp({G(hi, ti)|(hi, r0, ti) 2 Sr0}) = arg max i=1) λ H(mq) (12) s(Menc(G(hi, ti), mi), Menc(G(hj, tj), mj)) > 1 0, connectivity(mi) > 1 0 Here we add a constraint on the connectivity that measures whether the nodes in the subgraph represented by mi can be reachable from head or tail within 2 hops in the subgraph. Let A be a soft adjacency matrix of the edge mask mi 2 [0, 1]|E|. We compute whether two nodes i, j can reach each other within 2 hops as Ri,j = min((I + A + A2)[i, j], 1), then connectivity(mi) = 1 |E| mie min(Rn,hi + Rn,ti + Rn0,hi + Rn0,hi, 1) (13) We also add the entropy regularization terms H( ) to force the mask to represent a valid subgraph. Similarly, for Me: T(b, G(hq, tq)) = arg max s(b, Menc(G(hq, tq), mq)) λ H(mq) (14) We optimize each objective independently using gradient descent, but do not train them together end to end. The Menc remains random initialized, but we found that this random GNN is already powerful enough to distinguish non-isomorphic graphs as needed. Although this implementation is straight forward and learning free, it is slow and difficult to incorporate additional training. Nevertheless, it serves as a good way for directly verifying our hypothesis on the framework design. 4.4 CSR-GNN: Learning-based Implementation and Pretraining Scheme We also design a fully GNN-based encoder-decoder approach to implement Mp and Me. We use two models as building blocks: encoder f ENC(G1, m1) : G R|E| ! Rd that encodes graph G1 weighted by m1 to an embedding b, and decoder f DEC(G2, b) : G Rd ! R|E| that decodes out the weight m2 such that graph G2 weighted by m2 corresponds to the input embedding b. Intuitively, f DEC(G2, f ENC(G1, m1))) compares G1, G2 and finds the subgraph of G2 that is closest to the subgraph of G1 induced by m1. We then implement Mp as an iterative process of comparing between all pairs of support graphs using this encoder-decoder, as shown in Algorithm 1. We start with a full edge mask mi = 1 for each graph Gi. During each iteration, each graph Gj obtains edge mask mjk as a result of comparing against Gk weighted by mk. Gj then takes the shared parts between mjk by taking an element wise minimum to obtain mj. Algorithm 1 Hypothesis Proposal Module Mp of CSR-GNN Require: n support graphs G1, . . . , Gn 1: Initialize the masks of all support graphs to be all ones: mi = 1, 8i 2 1, . . . , n 2: for iter 1, . . . do 3: for j 1, . . . , n do 4: for k 1, . . . , n do 5: mjk = f DEC(Gj, f ENC(Gk, mk)) 6: end for 7: mj = mink mjk 8: end for 9: end for 10: 11: return [m1, . . . , mn] Algorithm 2 CSR-GNN Full Architecture Require: n support graphs G1, . . . , Gn, query graph Gq 1: m1, . . . , mn = Mp(G1, . . . , Gn) 2: Obtain subgraph embedding using the encoder: gi = f ENC(Gi, mi) 3: Average subgraph embedding of the supporting graphs b = 1 i gi 4: Decode masks using g from the query graph Gq: mq = f DEC(Gq, b) 5: Obtain subgraph embedding for the query graph: gq = f ENC(Gq, mq) 6: 7: return cosine_similarity(gq, b) We use f DEC( ) directly as Me. Combining these two steps, the full architecture is shown in 2, where the encoder f ENC( ) is shared with Menc. We use the same Path Con atchitecture for f DEC( ) as for Menc, except that we concatenate b to the input edge embeddings before the first layer. Encoder-Decoder Pretraining To train this encoder-decoder architecture, we simply use reconstruction: given a graph G and a randomly sampled mask m 2 R|E|, we should obtain back the same mask m after applying both encoder and decoder (Figure 2) Lossrecon = CE(m, f DEC(G, f ENC(G, m))) (15) To jointly train together with Menc, we also add contrastive loss g = Menc(G, m) (16) gpos = Menc(G, f DEC(G, f ENC(G, m))) (17) gneg = Menc(G0, f DEC(G0, f ENC(G, m))) (18) Losscontrast = max(s(gpos, g) s(gneg, g) + γ, 0) (19) During pretraining, we sample graph G and G0 by sampling random triplets per relation and contextualize them. G, G0 should correspond to different relations. Then we sample m that represents mutiple random paths connecting to either head or tail in G. 5 Experiments We evaluate our method CSR-OPT and CSR-GNN on few-shot KG completion tasks over three real world KGs NELL, FB15K-237 and Concept Net. We take NELL directly from NELL-One [23] but Sample Edge Masks Reconstruct Edge Masks Optimize Reconstruction Loss Figure 2: Pretraining of Connection Subgraph Reconstruction. Table 1: Statistics of the benchmark datasets. BG refers to the background KG available during training time. NELL FB15K-237 Concept Net #rels #entities #edges #tasks #rels #entities #edges #tasks #rels #entities #edges #tasks Trans-BG 291 68544 181109 11 200 14543 268039 30 14 790703 2541996 2 Ind-BG 291 44005 82318 - 200 11290 112477 - 14 619163 1191782 - Ind-Test 291 24539 98791 11 200 3253 155562 10 14 171540 1350214 2 add different settings; we select the fewest appearing relations as target few-shot tasks in FB15K-237 and Concept Net following [12, 23]. We summarize the statistics of all three datasets in Table 1. We also created one synthetic dataset for explanatory purpose. Across all four KGs, (1) we first evaluate our method against the previous methods on real KGs in terms of few-shot learning performance without the pre-designed training few-shot tasks set, on both transductive and inductive settings; (2) we perform extensive ablation studies on each component of our method and show that every component are indispensable in our framework; (3) we show that with a manually curated meta-train set, meta-learning based method suffers from negative transfer when facing a different target test set; (4) on the synthetic dataset, we further show that our method is able to recover complex hypothesis and evidence connection subgraphs during the prediction. We consider state-of-the-art few-shot KG completion baselines Meta R and FSRL in terms of the standard ranking metrics MRR and Hits@h, where we sample 50 negative tail candidates for each query triplet and rank them together with the positive tail entity. Hits@h measures the percentage of times that the positive tail is ranked higher than h among the negative tail candidates. We use h = 1, 5, 10. For simplicity, we only consider the number of few shot example K = 3, even though all methods here can generalize to arbitrary K. See appendix A for more details on full experiment setups. Since the methods have low variance in general, bellow we omit the standard deviations in the table and provide them instead in appendix A. 5.1 Few-shot Learning without Curated Training Tasks We first evaluate our method CSR-OPT and CSR-GNN along with baselines on the three real KGs without curated training tasks. This means that on NELL we do not use the meta-train split originally provided in NELL-One. To adapt Meta R and FSRL to this setting, we use pretrained entity and relation embeddings as in the original papers, but meta-train them on randomly sampled tasks from the background KG instead. Since our method is designed to not include entity embeddings, we add it in by concatenating the head and tail entity embedding to the representation produced by Menc in CSR-GNN. Similarly, we also perform end-to-end finetuning on the same set of randomly sampled tasks in addition to our pretraining objectives. See ablations for these modifications in Section 5.2. 5.1.1 Transductive Setting As shown in Table 2, we demonstrate that our learning-free method CSR-OPT can already give competitive performance without any training over the real dataset. With CSR-GNN, we can pretrain over these real KG to achieve higher performance exceeding/competitive to meta learning results: CSR-GNN gives 17.8% improvement of MRR over the second best method on NELL, 5% improvement on Concept Net and comes close second on FB15k-237. On FB15K-237, the graph is much denser than NELL and Concept Net so that the pretrained entity embeddings could already capture most relational structures when predicting the query triples. However, this is only limited to the transductive setting, where all entities are seen during pretraining. 5.1.2 Inductive Setting We evaluate the same set of methods in inductive version of these three datasets, where all entities involved in the testing few-shot task and their one hop neigbors are unseen in the background knowledge graph. In this setting, we do not use entity embedding for our methods, and pick the best performance for baselines between using and not using entity embedding. As shown in table 5, our methods only drops slightly in performance than in the transductive setting comparing to baselines, resulting in significantly larger gains of up to 52% in this more challenging setting. This is because our architecture is designed entirely based on topological rule and does not rely on entity embeddings, while the performance of baselines in the transductive setting rely heavily on entity embedding. To Table 2: Performance comparison on transductive few-shot tasks without curated training tasks MRR Hits@1 Hits@5 Hits@10 Meta R 0.471 0.322 0.647 0.763 FSRL 0.490 0.327 0.695 0.853 CSR-OPT 0.463 0.321 0.629 0.760 CSR-GNN 0.577 0.442 0.746 0.858 Meta R 0.805 0.740 0.881 0.937 FSRL 0.684 0.573 0.817 0.912 CSR-OPT 0.619 0.512 0.747 0.824 CSR-GNN 0.781 0.718 0.851 0.907 Concept Net Meta R 0.318 0.226 0.390 0.496 FSRL 0.577 0.469 0.695 0.753 CSR-OPT 0.559 0.450 0.692 0.736 CSR-GNN 0.606 0.496 0.735 0.777 Table 3: Inference time on NELL (inductive setting). NELL (inductive) Meta R FSRL CSR-GNN Inference Time (s) 5.52 14.24 17.50 MRR 0.355 0.180 0.511 Table 4: Inference time on NELL (transductive setting). NELL (transductive) Meta R FSRL CSR-GNN Inference Time (s) 5.50 14.49 18.33 MRR 0.471 0.490 0.577 further demonstrate this, we show in Appendix B that our methods achieve similar large gain of up to 52% under transductive setting when all entity embedding are randomized during testing, equivalent to an extreme inductive setting. Table 5: Performance comparison on inductive few-shot tasks without curated training tasks MRR Hits@1 Hits@5 Hits@10 Meta R 0.353 0.191 0.517 0.820 FSRL 0.180 0.090 0.242 0.360 CSR-OPT 0.425 0.303 0.534 0.657 CSR-GNN 0.511 0.348 0.725 0.837 Meta R 0.315 0.143 0.506 0.896 FSRL 0.453 0.299 0.571 0.922 CSR-OPT 0.554 0.429 0.727 0.844 CSR-GNN 0.624 0.479 0.833 0.894 Concept Net Meta R 0.154 0.041 0.260 0.452 FSRL 0.402 0.233 0.603 0.740 CSR-OPT 0.547 0.425 0.726 0.740 CSR-GNN 0.611 0.496 0.729 0.786 5.1.3 Inference Time We have also measured the inference time of our method and baselines on NELL. As shown in Table 3 and 4, we find that our method has comparable inference runtime compared with state-of-the-art baselines FSRL but it s slower than Meta R. The reason is that for each query triplet, our model needs to use the evidence proposal module to decode the edge masks and obtain the embeddings for the connection subgraphs. Compared with Meta R, which directly uses shallow KG embeddings to score each query triplet in Trans E style, CSR achieves much better empirical performance and also allows inductive few-shot KG link prediction. Table 6: Ablative results on NELL. Entity Embedding Finetuning Hypothesis Proposal Evidence Proposal MRR " " " " 0.577 7 " " " 0.511 7 7 " " 0.466 7 7 7 " 0.441 7 7 " 7 0.436 7 7 7 7 0.420 5.2 Ablation Study Here we conduct ablation study on each component of our proposed model CSR-GNN on the NELL dataset. Specifically we consider the four components of our models. It includes (1) whether we use the pretrained KG entity embeddings or not; (2) whether we perform additional finetuning of our model; (3) whether we perform the hypothesis proposal as shown in Algorithm 1, an alternative is to directly assign full one masks for m1, . . . , mn without detecting the common subgraph; (4) whether we perform the evidence proposal for the query triplets, if not, we will propose a full one mask for the query graph. As shown in Table 6, each component in our model is indispensable. Due to our model architecture design, without entity embeddings, MRR decreases from 0.577 to 0.511, however it s still much higher than state-of-the-art baselines with pretrained entity embeddings (0.471 for Meta R). Next we show that the two proposed stages Hypothesis Proposal and Evidence Proposal play a key role. Without either of them, it means that our model can no longer accurately perform eliminative induction over the support or the query triples, which may significantly deteriorates the performance. 5.3 Robustness to Distribution Shift To demonstrate the distribution shift problem when constructing meta-training set, we construct a new set of test few-shot learning tasks and compare the meta-learning based methods performance on this new test set against performance on the original test set when using the same meta-training set. Specifically, on the existing NELL dataset, the meta-training and test tasks/relations are sampled from the same long-tail distribution. Here we sample the new set of test few-shot learning tasks by randomly sampling the 10 test relations from the whole background KG, so that there exists a gap between the training tasks and the test tasks on the new set. Comparing ours with the state-of-the-art FSRL, we achieve comparable MRR on both test the original and our new challenging tasks (0.590 vs 0.540) while FSRL suffers greatly from the distribution gap (0.578 vs 0.459). This demonstrates the robustness of our method in handling different distribution of test tasks. 5.4 Synthetic Dataset For explanatory purpose, we construct synthetic datasets that strictly follow our assumptions so that we have the ground truth of the hypothesis connection subgraphs. In each task, all support graphs contain a shared connection graph, and the query is only True if the query graph also contains the same connection graph. We show that CSR-OPT can recover hypothesis and evidence connection subgraphs with high IOU of 0.843 and 0.992 when the hypothesis is in the form of a 4 clique. With direct supervision of ground truth hypothesis during training, CSR-GNN can also achieve high IOU of 0.809 and 0.981. See full details in Appendix C. 6 Conclusion In this paper we proposed CSR, a general framework for few-shot relational reasoning over knowledge graphs using self-supervised pretraining. Based on eliminative induction, we model hypothesis as the shared connection subgraph between the support triplets and predict the query triplets by checking evidence of the connection subgraph in the query graph. Our method achieves state-of-the-art performance across multiple datasets on few-shot link prediction without curated training tasks. Acknowledgments and Disclosure of Funding We thank Yuhuai Wu for discussions and for providing feedback on our manuscript. We also gratefully acknowledge the support of DARPA under Nos. HR00112190039 (TAMI), N660011924033 (MCS); ARO under Nos. W911NF-16-1-0342 (MURI), W911NF-16-1-0171 (DURIP); NSF under Nos. OAC-1835598 (CINES), OAC-1934578 (HDR), CCF-1918940 (Expeditions), NIH under No. 3U54HG010426-04S1 (Hu BMAP), Stanford Data Science Initiative, Wu Tsai Neurosciences Institute, Amazon, Docomo, GSK, Hitachi, Intel, JPMorgan Chase, Juniper Networks, KDDI, NEC, and Toshiba. Hongyu Ren is supported by Masason Foundation Ph D fellowship, Apple Ph D fellowship and Baidu Ph D scholarship. Qian Huang is supported by Open Philanthropy AI fellowship. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding entities. [1] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation hyperparam- eter optimization framework. In KDD, 2019. [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. J. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. Mc Candlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. Ar Xiv, abs/2005.14165, 2020. [3] M. Chen, W. Zhang, W. Zhang, Q. Chen, and H. Chen. Meta relational learning for few-shot link prediction in knowledge graphs. In EMNLP, 2019. [4] A. Chowdhury, M. Jiang, and C. Jermaine. Few-shot image classification: Just use a library of pre-trained feature extractors and a simple classifier. In ICCV, 2021. [5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. Ar Xiv, 2021. [6] L. Galárraga, C. Teflioudi, K. Hose, and F. M. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, 2013. [7] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord. Boosting few-shot visual learning with self-supervision. In ICCV, 2019. [8] T. M. Hospedales, A. Antoniou, P. Micaelli, and A. J. Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 2021. [9] D. Hunter. No wilderness of single instances: inductive inference in law. Journal of Legal Education, 1998. [10] S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. S. Yu. A survey on knowledge graphs: Repre- sentation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 2022. [11] X. V. Lin, R. Socher, and C. Xiong. Multi-hop knowledge graph reasoning with reward shaping. In EMNLP, 2018. [12] X. Lv, Y. Gu, X. Han, L. Hou, J. Li, and Z. Liu. Adapting meta knowledge graph information for multi-hop reasoning over few-shot relations. In EMNLP-IJCNLP, 2019. [13] T. M. Mitchell, W. W. Cohen, E. R. Hruschka, P. P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. A. Platanios, A. Ritter, M. Samadi, B. Settles, R. C. Wang, D. Wijaya, A. K. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. Never-ending learning. Communications of the ACM, 2015. [14] J. C. Platt and A. H. Barr. Constrained differential optimization. In Neur IPS, 1987. [15] M. Qu, J. Chen, L.-P. Xhonneux, Y. Bengio, and J. Tang. {RNNL}ogic: Learning logic rules for reasoning on knowledge graphs. In ICLR, 2021. [16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019. [17] R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 2017. [18] J. Sun, Y. Zhou, and C. Zong. One-shot relation learning for knowledge graphs via neighbor- hood aggregation and paths encoding. Transactions on Asian and Low-Resource Language Information Processing, 2022. [19] K. K. Teru, E. Denis, and W. L. Hamilton. Inductive relation prediction by subgraph reasoning. In ICML, 2020. [20] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon. Representing text for joint embedding of text and knowledge bases. In EMNLP, 2015. [21] H. Wang, H. Ren, and J. Leskovec. Relational message passing for knowledge graph completion. In KDD, 2021. [22] W. Xiong, T.-L.-G. Hoang, and W. Y. Wang. Deeppath: A reinforcement learning method for knowledge graph reasoning. In EMNLP, 2017. [23] W. Xiong, M. Yu, S. Chang, X. Guo, and W. Y. Wang. One-shot relational learning for knowledge graphs. In EMNLP, 2018. [24] C. Zhang, H. Yao, C. Huang, M. Jiang, Z. J. Li, and N. Chawla. Few-shot knowledge graph completion. In AAAI, 2020. [25] M. Zhang and Y. Chen. Link prediction based on graph neural networks. In Neur IPS, 2018. [26] Z. Zhu, Z. Zhang, L.-P. Xhonneux, and J. Tang. Neural bellman-ford networks: A general graph neural network framework for link prediction. In Neur IPS, 2021. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We described limitation on dense KG such as FB15K-237 briefly in the experiment section. See Appendix for more discussion (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Appendix. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] in supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] See Appendix. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]