# revisiting_link_prediction_a_data_perspective__3b1fda8f.pdf

Published as a conference paper at ICLR 2024

REVISITING LINK PREDICTION: A DATA PERSPECTIVE

Haitao Mao1 , Juanhui Li1, Harry Shomer1, Bingheng Li2, Wenqi Fan3, Yao Ma4, Tong Zhao5, Neil Shah5 and Jiliang Tang1

1Michigan State University 2 The Chinese University of Hong Kong, Shenzhen 3 Hong Kong Polytechnic University 4 Rensselaer Polytechnic Institute 5Snap Inc. {haitaoma,lijuanh1,shomerha,tangjili}@msu.edu, libingheng@cuhk.edu.cn wenqi.fan@polyu.edu.hk, may13@rpi.edu, {tzhao, nshah}@snap.com

Link prediction, a fundamental task on graphs, has proven indispensable in various applications, e.g., friend recommendation, protein analysis, and drug interaction prediction. However, since datasets span a multitude of domains, they could have distinct underlying mechanisms of link formation. Evidence in existing literature underscores the absence of a universally best algorithm suitable for all datasets. In this paper, we endeavor to explore principles of link prediction across diverse datasets from a data-centric perspective. We recognize three fundamental factors critical to link prediction: local structural proximity, global structural proximity, and feature proximity. We then unearth relationships among those factors where (i) global structural proximity only shows effectiveness when local structural proximity is deficient. (ii) The incompatibility can be found between feature and structural proximity. Such incompatibility leads to GNNs for Link Prediction (GNN4LP) consistently underperforming on edges where the feature proximity factor dominates. Inspired by these new insights from a data perspective, we offer practical instruction for GNN4LP model design and guidelines for selecting appropriate benchmark datasets for more comprehensive evaluations.

1 INTRODUCTION

Graphs are essential data structures that use links to describe relationships between objects. Link prediction, which aims to find missing links within a graph, is a fundamental task in the graph domain. Link prediction methods aim to estimate proximity between node pairs, often under the assumption that similar nodes are inclined to establish connections. Originally, heuristic methods (Zhou et al., 2009; Katz, 1953) were proposed to predict link existence by employing handcrafted proximity features to extract important data factors, e.g., local structural proximity and feature proximity. For example, Common Neighbors(CN) algorithm (Zhou et al., 2009) assumes that node pairs with more overlapping between one-hop neighbor nodes are more likely to be connected. To mitigate the necessity for handcrafted features, Deep Neural Networks are utilized to automatically extract high-quality proximity features. In particular, Graph Neural Networks (GNNs) (Kipf & Welling, 2017; 2016; Hamilton et al., 2017) become increasingly popular owing to their excellence in modeling graph data. Nonetheless, vanilla GNNs fall short in capturing pairwise structural information (Zhang et al., 2021; Liang et al., 2022), e.g., neighborhood-overlapping features, achieving modest performance in link prediction. To address these shortcomings, Graph Neural Networks for Link Prediction(GNN4LP)(Zhang & Chen, 2018; Wang et al., 2022; Chamberlain et al., 2023) are proposed to incorporate different inductive biases revolving on pairwise structural information.

New designs on GNN4LP models strive to improve vanilla GNN to capture diverse pairwise data patterns, e.g., local structural patterns (Yun et al., 2021; Wang et al., 2023), the number of paths (Zhu et al., 2021b), and structural position (Zhang & Chen, 2018). These models have found wide applicability across a myriad of real-world graph problems from multiple domains, e.g., paper recommendation, drug interaction prediction, and protein analysis (Kovács et al., 2019; Hu et al., 2020). A recent benchmark (Li et al., 2023) evaluates the performance of GNN4LP models on datasets from diverse domains, and finds performance disparity as there is no universally best-performing

Work was partially done while the author was a research assistant at The Hong Kong Polytechnic University.

Published as a conference paper at ICLR 2024

GNN4LP model, observing that even vanilla GCN can achieve best performance on certain datasets. (Abu Oda et al., 2020; Chakrabarti, 2022) reveal similar phenomena across heuristic algorithms. We conjecture the main reasons for such phenomena are that (i) From a model perspective, different models often have preferred data patterns due to their distinct capabilities and inductive biases. (ii) From a data perspective, graphs from different domains could originate from distinct underlying mechanisms of link formation. Figure 1 illustrates this disparity in the number of CNs on multiple benchmark datasets1. Notably, edges in the OGBL-PPA and OGBL-DDI datasets tend to have many CNs. Considering both model and data perspectives, performance disparity becomes evident where certain models perform well when their preferred data patterns align with particular data mechanisms on particular datasets, but others do not. This suggests that both model and data perspectives are significant to the success of link prediction. While mainstream research focuses on designing better models (Zhang & Chen, 2018; Zhang et al., 2021), we opt to investigate a data-centric perspective on the development of link prediction. Such a perspective can provide essential guidance on model design and benchmark dataset selection for comprehensive evaluation.

[0-1) [1-3) [3-10) [10-25)[25-inf)

Pubmed ogbl-collab

ogbl-ppa ogbl-ddi

Figure 1: Distribution disparity of Common Neighbors across datasets.

To analyze link prediction from a data-centric perspective, we must first understand the underlying data factors across different datasets. To achieve these goals, our study proceeds as follows: (i) Drawing inspiration from well-established literature (Huang et al., 2015; Mc Pherson et al., 2001) in network analysis, we pinpoint three key data factors for link prediction: local structural proximity, global structural proximity, and feature proximity. Comprehensive empirical analyses confirm the importance of these three factors. (ii) In line with empirical analysis, we present a latent space model for link prediction, providing theoretical guarantee on the effectiveness of the empirically identified data factors. (iii) We conduct an in-depth analysis of relationships among data factors on the latent space model. Our analysis reveals the presence of incompatibility between feature proximity and local structural proximity. This suggests that the occurrence of both high feature similarity and high local structural similarity within a single edge rarely happens. Such incompatibility sheds light on an overlooked vulnerability in GNN4LP models: they typically fall short in predicting links that primarily arise from feature proximity. (iv) Building upon the systematic understandings, we provide guidance for model design and benchmark dataset selection, with opportunities for link prediction.

2 RELATED WORK

Link prediction aims to complete missing links in a graph, with applications ranging from knowledge graph completion (Nickel et al., 2015) to e-commerce recommendations (Huang et al., 2005). While heuristic algorithms were once predominant, Graph Neural Networks for Link Prediction (GNN4LP) with deep learning techniques have shown superior performance in recent years.

Heuristic algorithms, grounded in the principle that similar nodes are more likely to connect, encompass local structural heuristics like Common Neighbor and Adamic Adar (Adamic & Adar, 2003), global structural heuristics like Katz and Sim Rank (Jeh & Widom, 2002; Katz, 1953), and feature proximity heuristics(Nickel et al., 2014; Zhao et al., 2017) integrating additional node features.

GNN4LP is built on basic GNNs (Kipf & Welling, 2016; 2017) which learn single node structural representation by aggregating neighborhood and transforming features recursively, equipped with pairwise decoders. GNN4LP models augment vanilla GNNs by incorporating more complicated pairwise structural information inspired by heuristic methods. For instance, NCNC (Wang et al., 2023) and NBFNet (Zhu et al., 2021b) generalize CN and Katz heuristics with neural functions to incorporate those pairwise information, thereby achieving efficiency and promising performance. A more detailed discussion on heuristics, GNNs, and principles in network analysis is in Appendix A.

1More evidence on other data properties can be found in Appendix E.

Published as a conference paper at ICLR 2024

3 MAIN ANALYSIS

In this section, we conduct analyses to uncover the key data factors for link prediction and the underlying relationships among those data factors. Since underlying data factors contributing to link formation are difficult to directly examine from datasets, we employ heuristic algorithms as a lens to reflect their relevance. Heuristic algorithms calculate similarity scores derived from different data factors to examine the probability of whether two nodes should be connected. They are wellsuited for this analysis as they are simple and interpretable, rooted in principles from network analysis (Murase et al., 2019; Khanam et al., 2020). Leveraging proper-selected heuristic algorithms and well-established literature in network analysis, we endeavor to elucidate the underlying data factors for link prediction.

Organization. Revolving on the data perspective for link prediction, the following subsections are organized as follows. Section 3.1 focuses on identifying and empirically validating the key data factors for link prediction using corresponding heuristics. In line with the empirical significance of those factors, Section 3.2 introduces a theoretical model for link prediction, associating data factors with node distances within a latent space. Links are more likely to be established between nodes with a small latent distance. Section 3.3 unveils the relationship among data factors building upon the theoretical model. We then clearly identify an incompatibility between local structural proximity and feature proximity factors. Specifically, incompatibility indicates it is unlikely that the occurrence of both large feature proximity and large local structural proximity within a single edge. Section 3.4 highlights an overlooked limitation of GNN4LP models stemming from this incompatibility.

Preliminaries & Experimental Setup. G = (V, E) is an undirected graph where V and E are the set of N nodes and |E| edges, respectively. Nodes can be associated with features X Rn d, where d is the feature dimension. We conduct analysis on CORA, CITESEER, PUBMED, OGBL-COLLAB, OGBL-PPA, and OGBL-DDI datasets (Hu et al., 2020; Mc Callum et al., 2000) with the same model setting as recent benchmark (Li et al., 2023). Experimental and dataset details are in Appendix K and J, respectively.

3.1 UNDERLYING DATA FACTORS ON LINK PREDICTION

Motivated by well-established understandings in network analysis (Daud et al., 2020; Wang & Le, 2020; Kumar et al., 2020) and heuristic designs (Adamic & Adar, 2003; Katz, 1953), we conjecture that there are three key data factors for link prediction.

(1)Local structural proximity(LSP)(Newman, 2001) corresponds to the similarity of immediate neighborhoods between two nodes. The rationale behind LSP is rooted in the principle of triadic closure (Huang et al., 2015), which posits that two nodes with more common neighbors have a higher probability of being connected. Heuristic algorithms derived from the LSP perspective include CN, RA, and AA (Adamic & Adar, 2003), which quantify overlap between neighborhood node sets. We mainly focus on common neighbors (CN) in the following discussion. The CN score for nodes i and j is calculated as |Γ(i) Γ(j)|, where Γ( ) denotes the neighborhood set. More analysis on other related heuristics revolving around LSP, e.g., RA, AA, can be found in Appendix C.5.

(2)Global structural proximity(GSP)(Katz, 1953; Jeh & Widom, 2002) goes beyond immediate neighborhoods between two nodes by considering their global connectivity. The rationale behind GSP is that two nodes with more paths between them have a higher probability of being connected. Heuristic algorithms derived from GSP include Sim Rank, Katz, and PPR (Brin & Page, 2012), to extract the ensemble of paths information. We particularly focus on the Katz heuristic in the following discussion. The Katz score for nodes i and j is calculated as P l=1 λl|paths l (i, j)|, where λ < 1 is a damping factor, indicating the importance of the higher-order information. |paths l (i, j)| counts the number of length-l paths between i and j.

(3)Feature proximity(FP)(Murase et al., 2019) corresponds to the feature similarity between nodes. The rationale behind FP is the principle of feature homophily (Khanam et al., 2020; Evtushenko & Kleinberg, 2021), which posits two nodes with more similar individual characteristics have a higher probability of being connected. There are many heuristic algorithms (Tang et al., 2013; Zhao et al., 2017) derived from the FP perspective. Nonetheless, most of them combine FP in addition to the above structural proximity, leading to difficulties for analyzing FP solely. Hence, we derive a simple heuristic called feature homophily (FH) focusing on only feature proximity solely for ease of

Published as a conference paper at ICLR 2024

0.65 0.0015 0.0021

0.0039 0.8 0.002

0.0023 0.00088 0.63

0.85 0.00035 0.00013

0.00035 0.64 0.00017

0.0002 0.0001 0.7

(b) OGBL-COLLAB

0.67 4.1e-05

4.2e-05 0.52

(c) OGBL-DDI

0.54 3.3e-07

2.5e-07 0.65

(d) OGBL-PPA

Figure 3: Overlapping ratio between top-ranked edges on different heuristic algorithms. Diagonals are the comparison between two heuristics within the same factor, while others compare heuristics from different factors. FP is ignored on OGBL-DDI and OGBL-PPA due to no or weak feature quality. MRR is selected as the metric. More results on hit@10 metric can be found in Appendix D.

analysis. The FH score between nodes i and j is calculated as dis(xi, xj), where xi corresponds to the node feature, and dis( ) is a distance function. We particularly focus on FH with cosine distance function in the following discussion. Notably, details on all the heuristics mentioned above can be found in Appendix A and B. To understand the importance of those data factors, we aim to answer the following questions: (i) Does each data factor indeed play a key role in link prediction? (ii) Does each factor provide unique information instead of overlapping information?

Figure 2: Performance of heuristics corresponding to different factors.

We first concentrate on examining the significance of each aforementioned factor for link prediction, based on well-established principles from network analysis. We exhibit the performance of heuristic algorithms in Figure 2. We make the following observations: (i) For datasets from the academic domain, CORA,

CITESEER, PUBMED, and OGBL-COLLAB, we find that heuristics for different factors can achieve satisfying performance. The Katz corresponding to the GSP factor consistently outperforms other heuristics. Explanations of the phenomenon are further presented in the following Section 3.3. (ii) For OGBL-DDI and

OGBL-PPA datasets, CN heuristic corresponding to the LSP factor consistently performs best while FH performs poorly. We conjecture that this is due to low feature quality. For instance, node features for OGBL-PPA are a one-hot vector corresponding to different categories. (iii) No single heuristic algorithm consistently outperforms across all datasets, indicating data disparity; detailed discussions are in Section 4.2. The effectiveness of heuristics is very data-specific which further highlights the importance of investigating link prediction from a data perspective.

We further investigate the relationship between heuristics from the same and different data factors. Details on the heuristic selection are in Appendix B. This includes whether heuristics from the same data factor provide similar information for link prediction and whether those from different data factors could offer unique perspectives. To this end, we examine disparities in predictions among different heuristics. Similar predictions imply that they provide similar information, while divergent predictions indicate that each factor could provide unique information. Predictions for node pairs can be arranged in descending order according to the predicted likelihood of them being connected. We primarily focus on top-ranked node pairs since they are likely to be predicted as links. Thus, they can largely determine the efficacy of the corresponding algorithm. If two algorithms produce similar predictions, high-likelihood edges should have a high overlap. Else, their overlap should be low.

Experimental results are shown in Figure 3. Due to the low feature quality, we exclude OGBL-DDI and

OGBL-PPA datasets as we conduct analyses on all three factors. It focuses on the overlapping ratio between the top ranking (25%) node pairs of two different heuristics either from the same data factor or different data factors. We make the following observations: (i) Comparing two heuristics from the same factor, i.e., the diagonal cells, we observe that high-likelihood edges for one heuristic are top-ranked in the other. This indicates heuristics from the same data factor capture similar information. (ii) Comparing two heuristics derived from different factors, We can observe that the overlapping of top-ranked edges is much lower, especially when comparing GSP and FP, as well as LSP and FP.

Published as a conference paper at ICLR 2024

Though GSP and LSP factors have a relatively high overlapping in top-ranked edges, the overlapping is still much smaller than that for heuristics from the same factor. These observations suggest that (i) selecting one representative heuristic for one data factor could be sufficient as heuristics from the same factors share similar predictions, and (ii) different factors are unique as there is little overlap in predictions. More analyses are in Appendix E and F.

3.2 THEORETICAL MODEL FOR LINK PREDICTION BASED ON UNDERLYING DATA FACTORS

In this subsection, we rigorously propose a theoretical model for link prediction based on the important data factors empirically analyzed above. We first introduce a latent space model and then theoretically demonstrate that the model reflects the effectiveness of LSP, GSP, and FP factors. All proofs can be found in Appendix C given space constraints.

The latent space model (Hoff et al., 2002) has been widely utilized in many domains, e.g., sociology and statistics. It is typically utilized to describe proximity in the latent space, where nodes with a close in the latent space are likely to share particular characteristics. In our paper, we propose a latent space model for link prediction, describing a graph with N nodes incorporating both feature and structure proximity, where each node is associated with a location in a D-dimensional latent space. Intuitions on modeling feature and structure perspectives are shown as follows. (i) Structural perspective is of primarily significance in link prediction. In line with this, the latent space model connects the link prediction problem with the latent node pairwise distance d, where d is strongly correlated with structural proximity. A small dij indicates two nodes i and j sharing similar structural characteristics, with a high probability of being connected. (ii) Feature perspective provides complementary information, additionally considering two nodes with high feature proximity but located distantly in the latent space should also be potentially connected. In line with this, we introduce the feature proximity parameter βij in the latent space. A larger βij indicates more likely for node i and j to be connected. Considering feature and structural perspectives together, we develop an undirected graph model inspired by (Sarkar et al., 2011). Detailed formulation is as follows:

P(i j|dij) =

1 1 + eα(dij max{ri,rj}) (1 βij) dij max{ri, rj}

βij dij > max{ri, rj} (1)

where P(i j|dij) depicts the probability of forming an undirected link between i and j (i j), predicated on both the features and structure. The latent distance dij indicates the structural likelihood of link formation between i and j. The feature proximity parameter βij [0, 1] additionally introduces the influence from the feature perspective. Moreover, the model has two parameters α and r. α > 0 controls the sharpness of the function. To ease the analysis, we set α = + . Discussions on when α = + are in Appendix C.6. ri is a connecting threshold parameter corresponding to node i. With α = + , 1 1+eα(dij max{ri,rj }) = 0 if dij > max{ri, rj}, otherwise it equals to 1. Therefore, a large ri indicates node i is more likely to form edges, leading to a potentially larger degree. Nodes in the graph can be associated with different r values, allowing us to model graphs with various degree distributions. Such flexibility enables our theoretical model to be applicable to more real-world graphs. We identify how the model can reveal different important data factors in link prediction. Therefore, we (i) derive heuristic scores revolving around each factor in the latent space and (ii) provide a theoretical foundation suggesting that each score can offer a suitable bound for the probability of link formation. Theoretical results underscore the effectiveness of each factor.

Effectiveness of Local Structural Proximity (LSP). We first derive the common neighbor (CN) score on the latent space model. Notably, since we focus on the local structural proximity, the effect of the features is ignored. We therefore set the FP parameter βij = 0, for ease of analysis. Considering two nodes i and j, a common neighbor node k can be described as a node connected to both nodes i and j. In the latent space, it should satisfy both dik < max {ri, rk} and dkj < max {rk, rj}, which lies in the intersection between two balls, V (max {ri, rk}) and V (max {rk, rj}). Notably, V (r) = V (1)r D is the volume of a radius r, where V (1) is the volume of a unit radius hypersphere. Therefore, the expected number of common neighbor nodes is proportional to the volume of the intersection between two balls. Detailed calculations are in Appendix C.1. With the volume in the latent space, we then derive how CN provides a meaningful bound on the structural distance dij.

Published as a conference paper at ICLR 2024

Proposition 1 (latent space distance bound with CNs). For any δ > 0, with probability at least

1 2δ, we have dij 2

rmax ij ηij/N ϵ

V (1) 2/D , where ηij is the number of common neighbors

between nodes i and j, rmax ij = max{ri, rj}, and V (1) is the volume of a unit radius hypersphere in D dimensional space. ϵ is a term independent of ηij. It vanishes as the number of nodes N grows.

Proposition 1 indicates that a large number of common neighbors ηij results in a smaller latent distance dij, leading to a high probability for an edge connection. We then extend the above analysis on local structure to global structure with more complicated structural patterns.

Effectiveness of Global Structural Proximity (GSP). We first derive the number of paths between node i and j on the latent space. Notably, most heuristics on the GSP factor can be viewed as a weighted number of paths. The key idea is to view each common neighbor node as a path with a length ℓ= 2, serving as the basic element for paths with a length ℓ> 2. We denote that the nodes i, j are linked through path of length ℓ, i.e., i = k0 k1 . . . kℓ 1 kℓ= j. As we assume each node is only associated with its neighborhood, the probability that the path P(k0 k1 . . . kℓ 1 kℓ) exists can be easily bounded by a decomposition of P(k0 k1 k2) P(k1 k2 k3) P(kℓ 2 kℓ 1 kℓ) = Qℓ 1 l=1 P(kℓ 1, kℓ, kℓ+1). Notably, each element is the common neighbor probability discussed in Proposition 1, equivalent to the path with ℓ= 2. We then calculate the volume of the number of paths and derive how it bound the latent distance dij. Proposition 2 (latent space distance bound with the number of paths). For any δ > 0, with probability

at least 1 2δ, we have dij PM 2 n=0 rn + 2

rmax M ηℓ(i,j) b(N,δ)

c(N,δ,ℓ) 2 D(ℓ 1) , where ηℓ(i, j) is the

number of paths of length ℓbetween i and j in D dimensional Euclidean space. M {1, , ℓ 1} is the set of intermediate nodes.

Proposition 2 indicates that a large number of paths ηℓ(i, j) results in a smaller latent distance dij, leading to a high probability for an edge connection. It demonstrates the effectiveness of GSP factor.

Effectiveness of Feature Proximity (FP). We next focus on the role of FP parameter βij. In particular, we extend Proposition 1 which ignored the FP with βij = 0, to βij = [0, 1]. This allows distant nodes in the latent space to be connected with each other if they share similar features. Specifically, two nodes i, j with latent distance dij > rmax{ri, rj} are connected to each other with a probability βij. Instead of defining that there is no edge connected with p = 0 when dij > max{ri, rj}, nodes are instead connected with a probability of p = βij. This provides a connection probability for node pairs with high FP. Revolving on the additional consideration of FP, we show the proposition as follows: Proposition 3 (latent space distance bound with feature proximity). For any δ > 0, with probability

at least 1 2δ, we have dij 2

rmax ij βij(1 A(ri,rj,dij))+A(ri,rj,dij)

V (1) 2/D , where βij measures

feature proximity between i and j, rmax ij = max{ri, rj} and V (1) is the volume of a unit radius hypersphere in D dimensional Euclidian space. A(ri, rj, dij) is the volume of intersection of two balls of V (ri) and V (rj) in latent space, corresponding to the expectation of common neighbors.

We can observe that when A (ri, rj, dij) is fixed, a larger βij leads to a tighter bound with close distance in the latent space. Proposition 3 indicates that a high FP results in a small latent distance dij, leading to a high probability for an edge connection. Notably, the conclusion could easily extend two Proposition 2 on global structural proximity with details in Appendix C.4. The above theoretical results indicate the significance of the three data factors.

3.3 INTRINSIC RELATIONSHIP AMONG UNDERLYING DATA FACTORS

In this subsection, we conduct a rigorous analysis elucidating the intrinsic relationship among different factors, upon the theoretical model. Our analyses are two-fold: (i) the relationship between structural factors, i.e., LSP and GSP; and (ii) the relationship between factors focusing on feature and structure, i.e., FP and LSP, FP and GSP. Proof details are in Appendix C.

The relationship between local and global structural proximity. To consider both local and global structural factors, we treat the CN algorithm as the number of paths ηℓ(i, j) with length ℓ= 2. Therefore, analysis between local and global structural factors can be regarded as the influence of

Published as a conference paper at ICLR 2024

η(i, j) on different lengths ℓ. The key for the proof is to identify the effect of ℓby bounding other terms related with ℓin Proposition 2, i.e., ηℓ(i, j) and c(N, δ, ℓ). We also ignore the feature effect to ease the structural analysis.

Lemma 1 (latent space distance bound with local and global structural proximity). For any δ > 0,

with probability at least 1 2δ, we have dij PM 2 n=0 rn + 2

2 1 2 D(ℓ 1) ,

where PM 2 n=0 rn, rmax M serve as independent variables that do not change with ℓ.

Given the same number of paths ηℓwith different lengths ℓ, a small ℓprovides a much tighter bound with close distance in the latent space. The bound becomes exponentially loose with the increase of

ℓas the hop ℓin q

2 1 2 D(ℓ 1) acts as an exponential coefficient. This indicates that (i)

When both LSP and GSP are sufficient, LSP can provide a tighter bound, indicating a more important role. (ii) When LSP is deficient, e.g., the graph is sparse with not many common neighborhoods, GSP can be more significant. The theoretical understanding can also align with our empirical observations in Section 3.1. Figure 2 illustrates that (i) heuristics derived from GSP perform better on sparse graphs with deficient common neighbors shown in Figure 1. (ii) The heuristics derived from LSP perform better on the dense graph, i.e., OGBL-DDI and OGBL-PPA with more common neighbors.

feat struct 0.1

Difference in hit

model - SAGE

feat struct

BUDDY Neo GNN NCNC

Figure 4: Performance comparison between GNN4LP models and SAGE on the OGBLCOLLAB dataset. Bars represent the performance gap on node pairs dominated by feature and structural proximity, respectively. Figures correspond to compare FP with GSP and LSP, respectively

The relationship between structural and feature proximity. Our analysis then focuses on the interplay between feature and structural proximity. The key for the proof is to recognize how feature proximity could affect the number of common neighbors derived from the LSP factor.

Lemma 2 (Incompatibility between LSP and FP factors). For any δ > 0, with probability at least 1 2δ, we have ηij = c 1 βij +N(1+ϵ), where ηij and βij are the number of common neighbor nodes and feature proximity between nodes i and j. c < 0 is an independent variable that does not change with βij and ηij. ηij is negatively correlated with βij.

Lemma 2 demonstrates that node pairs with a large number of common neighbors ηij tend to have low feature proximity βij and vice versa. Such findings underscore the incompatibility between LSP and feature proximity, where it is unlikely that both large LSP and FP co-exist in a single node pair. It challenges the conventional wisdom, which posits that LSP tends to connect people, reinforcing existing FP, e.g., connecting people with similar characteristics. However, our findings suggest that LSP could offset the feature proximity. One intuitive explanation of such phenomenon from social network literature (Abebe et al., 2022) is that, in contexts with FP, similar individuals tend to connect. Thus, if nodes with common neighbors (mutual friends) do not have a link connected, their features may be quite different. The new edge forms between those nodes with high LSP actually connect individuals with low FP. A similar relationship is also established between GSP and FP with proof in Appendix C.4.

3.4 AN OVERLOOKED VULNERABILITY IN GNN4LP MODELS INSPIRED FROM DATA FACTORS

In this subsection, we delve into how the incompatibility between structural proximity and feature proximity affects the effectiveness of GNN4LP models. These models are inherently designed to learn pairwise structural representation, encompassing both feature and structural proximity. Despite their strong capability, the incompatibility between structural and feature factors leads to potentially conflicting training signals. For example, while structural proximity patterns may imply a likely link between two nodes, feature proximity patterns might suggest the opposite. Therefore, it seems challenging for a single model to benefit both node pairs with feature proximity factor and those with the structural ones. Despite most research primarily emphasizing the capability of GNN4LP models on structural proximity, the influence of incompatibility remains under-explored.

Published as a conference paper at ICLR 2024

GNN Readout

(a) Original coupling SEAL model architecture

GNN1 Readout1

GNN2 Readout2

(b) Proposed decoupled SEAL model architecture

Figure 5: The original SEAL and the proposed decoupled SEAL architectures. Xfeat and Xdrnl are the original node feature and the structural embedding via Double-Radius Node Labeling.

To validate our statement, we conduct experiments to compare the performance of vanilla GNNs, e.g., SAGE and GCN, with the advanced GNN4LP models including Buddy, Neo GNN, and NCNC. The fundamental difference between GNN4LP models and vanilla GNNs is that vanilla GNNs only learn single-node structural representation with limitations in capturing pairwise structural factors while GNN4LP models go beyond. Such comparison sheds light on examining how the key capacity of GNN4LP, i.e., capturing pairwise structural factor, behaves along the incompatibility. Comparisons are conducted on node pairs dominated by different factors, represented as node pairs Es \ Ef and Ef \ Es with only structural proximity and only feature proximity accurately predicted, respectively. Es and Ef denote node pairs accurately predicted with structural proximity and feature proximity, respectively. Experimental results are presented in Figure 4, where the x-axis indicates node pairs dominated by different underlying factors. The y-axis indicates the performance differences between GNN4LP models and vanilla Graph SAGE. More results on GCN can be found in Appendix E. A notable trend is that GNN4LP models generally outperform vanilla GNNs on edges governed by LSP and GSP while falling short in those on feature proximity. This underlines the potential vulnerability of GNN4LP models, especially when addressing edges primarily influenced by feature proximity. This underlines the overlooked vulnerability of GNN4LP models on node pairs dominated by the FP factor due to the incompatibility between feature and structural proximity.

4 GUIDANCE FOR PRACTITIONERS ON LINK PREDICTION

In this section, we provide guidance for the new model design and how to select benchmark datasets for comprehensive evaluation, based on the above understandings from a data perspective.

4.1 GUIDANCE FOR THE MODEL DESIGN

Cora Citeseer Pubmed collab 40

origin decouple

(a) Performance on original and decoupled SEAL.

feat struct 0.1

Difference in hit

model - SAGE

feat struct

collab origin decouple

(b) Comparison between SEAL models and SAGE.

Figure 6: Effectiveness of proposed decoupled SEAL comp.

In Section 3, we highlight the incompatibility between structural and feature proximity factors in influencing GNN4LP models. When both structural and feature factors come into play simultaneously, there is a potential for them to provide conflicting supervision to the model. Such understanding suggests that the model design should learn the feature proximity factors and pairwise structural ones independently before integrating their outputs, in order to mitigate such incompatibility. In particular, we apply such a strategy to the SEAL (Zhang & Chen, 2018), a representative GNN4LP model. Different from the vanilla GNNs only utilizing original node features Xfeat as feature input, it additionally employs local structural features Xdrnl by double-radius node labeling (DRNL) based on their structural roles. Xfeat and Xdrnl are concatenated and then forwarded to one single GNN, as depicted in Figure 5(a). Therefore, the GNN must wrestle with the incompatibility between FP and structural factors. Guided by the above understanding, we propose the decoupled SEAL, which separates the original node features Xfeat and local structural features Xdrnl into different GNNs. Each dedicated GNN could learn either feature patterns or pairwise structural patterns separately. The decoupled model architecture is depicted in Figure 5(b). Experimental results comparing the original SEAL and our proposed decoupled SEAL are illustrated in Figure 6(a). Notably, our decoupled

Published as a conference paper at ICLR 2024

SEAL consistently outperforms, with gains reaching up to 1.03% on the large OGBL-COLLAB dataset. Furthermore, Figure 6(b) shows comparisons with Graph SAGE, following the same setting with Figure 4. The decoupled SEAL demonstrates a reduced performance drop on node pairs dominated by the FP factor with a larger gain on those by structural factors. Code is available at here.

Table 1: The Hit@10 performance on the newly selected datasets.

CN Katz FH MLP SAGE BUDDY

POWER 12.88 29.85 NA 5.03 0.88 6.99 1.16 19.88 1.37

PHOTO 18.34 7.07 13.78 12.37 4.13 18.61 5.97 18.09 2.52

4.2 GUIDANCE FOR BENCHMARK DATASET SELECTION

With the recognized data factors and their relationships, we enumerate all potential combinations among different data factors, illuminating the complete dataset landscape. It allows us to categorize prevalent datasets and pinpoint missing scenarios not covered by those datasets. Consequently, we introduce new datasets addressing those identified gaps and offer guidance for practitioners on more comprehensive benchmark dataset selection. In particular, we recognize datasets into four categories considering two main aspects: (i) From the feature perspective, we verify whether FP dominates, indicated with decent performance on FH. (ii) From the structural perspective, we verify whether GSP dominates, indicated by whether a GSP heuristic can provide additional improvement over LSP (if not, then LSP dominates). Section 3.3 demonstrates that such scenario happens when LSP is inadequate. Therefore, there are four categories including category 1: both LSP and FP factors dominate. Category 2: Only LSP factor dominates. Category 3: both GSP and FP factors dominate. Category 4: Only GSP factor dominates. Evidence in Figure 2 helps to categorize existing benchmarking datasets. The prevalent datasets like CORA, CITESEER , and PUBMED are in category 3 with both GSP and FP factors dominating, while datasets like OGBL-DDI and OGBL-PPA primarily are in category 2, focusing on the LSP factor. We can then clearly identify that two significant dataset categories, 1 and 4, are not covered on existing datasets.

To broaden for a more comprehensive evaluation beyond existing benchmark datasets, we introduce more datasets to cover these categories . This includes the unfeatured POWER dataset in category 4 and the PHOTO dataset in category 1. The categorizations of these datasets are confirmed through experimental results illustrated in Table 1. We observe: (i) For the POWER dataset with only GSP matters, the Katz significantly outperforms other algorithms, even the GNN4LP model, BUDDY. (ii) Deep models do not show superior performance on both datasets, indicating that success focusing on existing datasets cannot extend to the new ones, suggesting potential room for improvement. We can then provide the following guidance for benchmarking dataset selection for practitioners: (i) selecting algorithms that perform best on the datasets belonging to the same category as the proposed one. (ii) selecting datasets from their own domain rather than datasets from other domains. To help with that, we collect most of the existing datasets for link prediction covering most domains including biology, transportation, web, academia, and social science, assisting in a more comprehensive evaluation aligning with real-world scenarios. Details on all datasets are in Appendix D and the repository.

5 CONCLUSION

In this work, we explore link prediction from a data perspective, elucidating three pivotal factors: LSP, GSP, and FP. Theoretical analyses uncover the underlying incompatibility. Inspired by incompatibility, our paper shows a positive broader impact as we identify the overlooked biased prediction in GNN4LP models and show the potential solution to address this issue. Our understanding provides guidance for the new model design and how to select benchmark datasets for comprehensive evaluation. Such understanding also gains insights for future direction including (1) adding a more careful discussion on the above fairness issue and (2) designing specific GNN4LP models for datasets in different dataset categories mentioned in Sec 4.2. Nonetheless, our paper shows minor limitations as we make the assumption that the feature proximity is an additional noise parameter rather than adaptively combining that information in the same subspace in theoretical analysis. A more comprehensive discussion on Limitation, broader impact, and future works are in Appendix G,H, and I.

Published as a conference paper at ICLR 2024

6 ACKNOWLEDGEMENT

This research is supported by the National Science Foundation (NSF) under grant numbers CNS 2246050, IIS1845081, IIS2212032, IIS2212144, IIS-2406648, IIS-2406647, IOS2107215, DUE 2234015, DRL 2025244 and IOS2035472, the Army Research Office (ARO) under grant number W911NF-21-1-0198, the Home Depot, Cisco Systems Inc, Amazon Faculty Award, Johnson&Johnson, JP Morgan Faculty Award and SNAP.

Rediet Abebe, Nicole Immorlica, Jon Kleinberg, Brendan Lucier, and Ali Shirali. On the effect of triadic closure on network segregation. In Proceedings of the 23rd ACM Conference on Economics and Computation, pp. 249 284, 2022.

Ghadeer Abu Oda, Gianmarco De Francisci Morales, and Ashraf Aboulnaga. Link prediction via higher-order motif features. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16 20, 2019, Proceedings, Part I, pp. 412 429. Springer, 2020.

Robert Ackland et al. Mapping the us political blogosphere: Are conservative bloggers more prominent? In Blog Talk Downunder 2005 Conference, Sydney. Blog Talk Downunder 2005 Conference, Sydney, 2005.

Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks, 25(3):211 230, 2003.

Aili Asikainen, Gerardo Iñiguez, Javier Ureña-Carrión, Kimmo Kaski, and Mikko Kivelä. Cumulative effects of triadic closure and homophily in social networks. Science Advances, 6(19):eaax7310, 2020.

Vladimir Batagelj and Andrej Mrvar, 2006. http://vlado.fmf.uni-lj.si/pub/ networks/data/.

Sergey Brin and Lawrence Page. Reprint of: The anatomy of a large-scale hypertextual web search engine. Computer networks, 56(18):3825 3833, 2012.

Deepayan Chakrabarti. Avoiding biases due to similarity assumptions in node embeddings. In

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 56 65, 2022.

Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire. Graph neural networks for link prediction with subgraph sketching. In ICLR, 2023.

Sergio Currarini, Matthew O Jackson, and Paolo Pin. An economic model of friendship: Homophily, minorities, and segregation. Econometrica, 77(4):1003 1045, 2009.

Nur Nasuha Daud, Siti Hafizah Ab Hamid, Muntadher Saadoon, Firdaus Sahran, and Nor Badrul Anuar. Applications of link prediction in social networks: A review. Journal of Network and Computer Applications, 166:102716, 2020.

Yuxiao Dong, Reid A Johnson, Jian Xu, and Nitesh V Chawla. Structural diversity and homophily: A study across more than one hundred big networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 807 816, 2017.

Anna Evtushenko and Jon Kleinberg. The paradox of second-order homophily in networks. Scientific

Reports, 11(1):13360, 2021.

Hao-Ming Fu, Patrick Poirson, Kwot Sin Lee, and Chen Wang. Revisiting neighborhood-based link prediction for collaborative filtering. In Companion Proceedings of the Web Conference 2022, pp. 1009 1018, 2022.

Published as a conference paper at ICLR 2024

Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings

of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855 864, 2016.

Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Zadeh. Wtf: The who to follow service at twitter. In Proceedings of the 22nd international conference on World Wide Web, pp. 505 514, 2013.

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.

Advances in neural information processing systems, 30, 2017.

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 639 648, 2020.

Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social network analysis. Journal of the american Statistical association, 97(460):1090 1098, 2002.

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020.

Hong Huang, Jie Tang, Lu Liu, Jar Der Luo, and Xiaoming Fu. Triadic closure pattern analysis and prediction in social networks. IEEE Transactions on Knowledge and Data Engineering, 27(12): 3374 3389, 2015.

Hong Huang, Yuxiao Dong, Jie Tang, Hongxia Yang, Nitesh V Chawla, and Xiaoming Fu. Will triadic closure strengthen ties in social networks? ACM Transactions on Knowledge Discovery from Data (TKDD), 12(3):1 25, 2018.

Zan Huang, Xin Li, and Hsinchun Chen. Link prediction approach to collaborative filtering. In

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, pp. 141 142, 2005.

Paul Jaccard. Distribution comparée de la flore alpine dans quelques régions des alpes occidentales et orientales. Bulletin de la Murithienne, (31):81 92, 1902.

Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceedings

of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 538 543, 2002.

Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39 43, 1953.

Kazi Zainab Khanam, Gautam Srivastava, and Vijay Mago. The homophily principle in social network analysis. ar Xiv preprint ar Xiv:2008.10383, 2020.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980, 2014.

Thomas N Kipf and Max Welling. Variational graph auto-encoders. NIPS Workshop on Bayesian

Deep Learning, 2016.

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.

Lecheng Kong, Yixin Chen, and Muhan Zhang. Geodesic graph neural network for efficient graph representation learning. Advances in Neural Information Processing Systems, 35:5896 5909, 2022.

István A Kovács, Katja Luck, Kerstin Spirohn, Yang Wang, Carl Pollis, Sadie Schlabach, Wenting Bian, Dae-Kyum Kim, Nishka Kishore, Tong Hao, et al. Network-based prediction of protein interactions. Nature communications, 10(1):1240, 2019.

Published as a conference paper at ICLR 2024

Ajay Kumar, Shashank Sheshar Singh, Kuldeep Singh, and Bhaskar Biswas. Link prediction techniques, applications, and performance: A survey. Physica A: Statistical Mechanics and its Applications, 553:124289, 2020.

Jure Leskovec and Julian Mcauley. Learning to discover social circles in ego networks. Advances in

neural information processing systems, 25, 2012.

Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 177 187, 2005.

Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 1(1):2 es, 2007.

Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29 123, 2009.

Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Signed networks in social media. In

Proceedings of the SIGCHI conference on human factors in computing systems, pp. 1361 1370, 2010.

Juanhui Li, Harry Shomer, Haitao Mao, Shenglai Zeng, Yao Ma, Neil Shah, Jiliang Tang, and Dawei Yin. Evaluating graph neural networks for link prediction: Current pitfalls and new benchmarking.

ar Xiv preprint ar Xiv:2306.10453, 2023.

Shuming Liang, Yu Ding, Zhidong Li, Bin Liang, Yang Wang, Fang Chen, et al. Can gnns learn heuristic information for link prediction? 2022.

Haoyu Liu, Ningyi Liao, and Siqiang Luo. Simga: A simple and effective heterophilous graph neural network with efficient global aggregation. ar Xiv preprint ar Xiv:2305.09958, 2023.

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009.

Julian Mc Auley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 43 52, 2015.

Andrew Kachites Mc Callum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3:127 163, 2000.

Colin Mc Diarmid et al. On the method of bounded differences. Surveys in combinatorics, 141(1): 148 188, 1989.

Miller Mc Pherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415 444, 2001.

Yohsuke Murase, Hang Hyun Jo, János Török, János Kertész, Kimmo Kaski, et al. Structural transition in social networks. 2019.

Mark EJ Newman. Clustering and preferential attachment in growing networks. Physical review E, 64(2):025102, 2001.

Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices.

Physical review E, 74(3):036104, 2006.

Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing the rank in relational factorization models by including observable patterns. Advances in Neural Information Processing Systems, 27, 2014.

Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11 33, 2015.

Published as a conference paper at ICLR 2024

Tore Opsahl. Triadic closure in two-mode networks: Redefining the global and local clustering coefficients. Social networks, 35(2):159 167, 2013.

Rose Oughtred, Chris Stark, Bobby-Joe Breitkreutz, Jennifer Rust, Lorrie Boucher, Christie Chang, Nadine Kolas, Lara O Donnell, Genie Leung, Rochelle Mc Adam, et al. The biogrid interaction database: 2019 update. Nucleic acids research, 47(D1):D529 D541, 2019.

Tolutola Oyetunde, Muhan Zhang, Yixin Chen, Yinjie Tang, and Cynthia Lo. Boostgapfill: improving the fidelity of metabolic network reconstructions through integrated constraint and pattern-based methods. Bioinformatics, 33(4):608 611, 2017.

Matthew Richardson, Rakesh Agrawal, and Pedro Domingos. Trust management for the semantic web. In International semantic Web conference, pp. 351 368. Springer, 2003.

Ryan A Rossi, Anup Rao, Sungchul Kim, Eunyee Koh, and Nesreen Ahmed. From closing triangles to closing higher-order motifs. In Companion Proceedings of the Web Conference 2020, pp. 42 43, 2020.

Purnamrita Sarkar, Deepayan Chakrabarti, and Andrew W Moore. Theoretical justification of popular link prediction heuristics. In IJCAI proceedings-international joint conference on artificial intelligence, volume 22, pp. 2722, 2011.

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. ar Xiv preprint ar Xiv:1811.05868, 2018.

Jiliang Tang, Huiji Gao, Xia Hu, and Huan Liu. Exploiting homophily effect for trust prediction. In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 53 62, 2013.

Gerg o Tóth, Johannes Wachs, Riccardo Di Clemente, Ákos Jakobi, Bence Ságvári, János Kertész, and Balázs Lengyel. Inequality is rising where social network segregation interacts with urban topology. Nature communications, 12(1):1143, 2021.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. stat, 1050:20, 2017.

Christian Von Mering, Roland Krause, Berend Snel, Michael Cornell, Stephen G Oliver, Stanley Fields, and Peer Bork. Comparative assessment of large-scale data sets of protein protein interactions. Nature, 417(6887):399 403, 2002.

Haorui Wang, Haoteng Yin, Muhan Zhang, and Pan Li. Equivariant and stable positional encoding for more powerful graph neural networks. ar Xiv preprint ar Xiv:2203.00199, 2022.

Hui Wang and Zichun Le. Seven-layer model in complex networks link prediction: A survey. Sensors, 20(22):6560, 2020.

Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval, pp. 165 174, 2019a.

Xiyuan Wang, Haotong Yang, and Muhan Zhang. Neural common neighbor with completion for link prediction. ar Xiv preprint ar Xiv:2302.00890, 2023.

Yue Wang, Lei Chen, Yulin Che, and Qiong Luo. Accelerating pairwise simrank estimation over static and dynamic graphs. The VLDB Journal, 28:99 122, 2019b.

Duncan J Watts and Steven H Strogatz. Collective dynamics of small-world networks. nature, 393 (6684):440 442, 1998.

Wei Wu, Bin Li, Chuan Luo, and Wolfgang Nejdl. Hashing-accelerated graph neural networks for link prediction. In Proceedings of the Web Conference 2021, pp. 2910 2920, 2021.

Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. Network representation learning with rich text information. In IJCAI, volume 2015, pp. 2111 2117, 2015.

Published as a conference paper at ICLR 2024

Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, pp. 1 8, 2012.

Haoteng Yin, Muhan Zhang, Yanbang Wang, Jianguo Wang, and Pan Li. Algorithm and system co-design for efficient subgraph-based graph representation learning. ar Xiv preprint ar Xiv:2202.13538, 2022.

Haoteng Yin, Muhan Zhang, Jianguo Wang, and Pan Li. Surel+: Moving from walks to sets for scalable subgraph-based graph representation learning. ar Xiv preprint ar Xiv:2303.03379, 2023.

Seongjun Yun, Seoyoon Kim, Junhyun Lee, Jaewoo Kang, and Hyunwoo J Kim. Neo-gnns: Neighborhood overlap-aware graph neural networks for link prediction. Advances in Neural Information Processing Systems, 34:13683 13694, 2021.

Akbar Zaheer and Giuseppe Soda. Network evolution: The origins of structural holes. Administrative

Science Quarterly, 54(1):1 31, 2009.

Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graphsaint: Graph sampling based inductive learning method. ar Xiv preprint ar Xiv:1907.04931, 2019.

Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural

information processing systems, 31, 2018.

Muhan Zhang, Zhicheng Cui, Shali Jiang, and Yixin Chen. Beyond link prediction: Predicting hyperlinks in adjacency space. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems, 34:9061 9073, 2021.

He Zhao, Lan Du, and Wray Buntine. Leveraging node attributes for incomplete relational data. In

International conference on machine learning, pp. 4072 4081. PMLR, 2017.

Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. Predicting missing links via local information. The

European Physical Journal B, 71:623 630, 2009.

Yangze Zhou, Gitta Kutyniok, and Bruno Ribeiro. Ood link prediction generalization capabilities of message-passing gnns in larger test graphs. Advances in Neural Information Processing Systems, 35:20257 20272, 2022.

Fanwei Zhu, Yuan Fang, Kai Zhang, Kevin Chang, Hongtai Cao, Zhen Jiang, and Minghui Wu. Unified and incremental simrank: Index-free approximation with scheduled principle. IEEE Transactions on Knowledge and Data Engineering, 2021a.

Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. Neural bellman-ford networks: A general graph neural network framework for link prediction. Advances in Neural Information Processing Systems, 34:29476 29490, 2021b.

Zhaocheng Zhu, Xinyu Yuan, Louis-Pascal Xhonneux, Ming Zhang, Maxime Gazeau, and Jian Tang. Learning to efficiently propagate for reasoning on knowledge graphs. ar Xiv preprint ar Xiv:2206.04798, 2022.

Published as a conference paper at ICLR 2024

Table of Contents

A Related work 16

B Details on heuristic algorithms 17

C Theoretical analysis on the graph latent space model 18 C.1 Extended Latent space model for link prediction . . . . . . . . . . . . . . . . . 18 C.2 The effectiveness of local structural proximity (LSP) . . . . . . . . . . . . . . . 19 C.3 The effectiveness of global structural proximity (GSP) . . . . . . . . . . . . . . 20 C.4 The effectiveness of feature proximity (FP) & relationship with feature proximity 23 C.5 Theoretical analysis for other local structral heuristics . . . . . . . . . . . . . . 24 C.6 Non-deterministic case for latent space model . . . . . . . . . . . . . . . . . . 25

D Descriptions on more datasets 27

E Additional results in main analysis 28

F Additional analysis among data factors 29

G Limitation 29

H Broader Impact 31

I Future Direction 33

J Datasets details 34

K Experimental Settings & model performance 35 K.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 K.2 Model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Published as a conference paper at ICLR 2024

A RELATED WORK

Link prediction is a fundamental task in graph analysis. It seeks to determine the probability of a connection between two nodes in a graph, leveraging both the existing graph topology and node features. This task plays a critical role in a variety of applications, such as improving knowledge graph completion (Nickel et al., 2015), enhancing metabolic networks (Oyetunde et al., 2017), suggesting potential friends in social media platforms (Adamic & Adar, 2003; Gupta et al., 2013), predicting protein-protein interactions (Kovács et al., 2019), and refining e-commerce recommendations (Huang et al., 2005). Heuristic algorithms initially dominated, serving as the primary methodology, due to their simplicity, interpretability, and scalability. Then, GNNs are introduced which leverages the power of deep learning. It allows for automatic feature extraction with superior performance.

Heuristic algorithms are based on the hypothesis that nodes with higher similarity have an increased probability of connection. The proximity score between two nodes represents their potential connection probability. This score is tailored to the characteristics of specific node pairs, viewed from various perspectives. Generally, there are three types of heuristic approaches to measure the proximity and the formal definition of some representative ones are listed in Table 2. (1) Local structural heuristics. This category includes four prevalent algorithms: Common Neighbor (CN) (Newman, 2001), Jaccard (Jaccard, 1902), Adamic Adar (AA) (Adamic & Adar, 2003) and Resource Allocation (RA) (Zhou et al., 2009). These methods are fundamentally based on the assumption that nodes sharing more common neighbors are more likely to connect. AA and RA are weighted variants of common Neighbors. Typically, they incorporate node degree information, emphasizing that nodes with a lower degree have a higher influence. (2) Global structural proximity heuristics. This category includes algorihms Katz (Katz, 1953), Sim Rank (Jeh & Widom, 2002), and Personalized Page Rank (PPR) (Brin & Page, 2012) in the graph. They consider the global structural patterns with the number of paths between two nodes, where more paths indicate high similarity and are more likely to connect. Specifically, Katz counts the number of all paths between two nodes, emphasizing shorter paths by penalizing longer ones with a factor. Sim Rank assumes that two nodes are similar if they are linked to similar nodes, and PPR produces a ranking personalized to a particular node based on the random walk. (3) Feature proximity heuristics is available when node attributes are available, incorporating the side information about individual nodes. (Nickel et al., 2014; Zhao et al., 2017) combines graph structure with latent features and explicit features for better performance.

Graph Neural Networks Graph Neural Networks (GNNs) have emerged as a powerful technique in Deep Learning, specifically designed for graph-structured data. They address the limitations of traditional neural networks in dealing with irregular data structures. GNNs learn node representations by aggregating neighborhood and transforming features recursively.

Graph Neural Networks (Kipf & Welling, 2017; 2016; Hamilton et al., 2017; Velickovic et al., 2017) aim to learn high-quality features from data instead of predefined heuristics developed with domain knowledge. Despite satisfying results in many tasks via utilizing structural information, (Zhang & Chen, 2018; Zhang et al., 2021; Liang et al., 2022) find that vanilla GNNs are still sub-optimal suffering from disability on capturing important pairwise patterns for Link Prediction, e.g., Common Neighborhood. Graph Neural Networks for Link Prediction (GNN4LP) (Zhang & Chen, 2018; Zhang et al., 2021; Yun et al., 2021; Wang et al., 2022; 2023; Chamberlain et al., 2023; Zhu et al., 2021b) are then proposed which incorporates different inductive bias to capturing more pairwise information. SEAL (Zhang & Chen, 2018), Neo-GNN (Yun et al., 2021) and NCNC (Wang et al., 2023) involve more neighbor-overlapping knowledge. BUDDY (Chamberlain et al., 2023), and NBFNet (Zhu et al., 2021b) exploit the higher order structural information, e.g., number of paths between nodes. Nonetheless, advanced GNN4LP models with better effectiveness usually have more complicated model designs, leading to the efficiency issue. (Yin et al., 2022; 2023; Wu et al., 2021; Zhu et al., 2022; Kong et al., 2022) are then proposed to improve efficiency via approximating methods, e.g., hashing algorithm, A* algorithm, and random walk approximations.

The node embedding models (Yang et al., 2015; Grover & Leskovec, 2016) are graph learning algorithms used to transform the nodes of a graph into low-dimensional vectors, capturing the network topology and the node feature properties. This approach learns representations through data-driven algorithms, enabling effective capturing of complex patterns within the graph structure and node features for link prediction. For instance, Node2Vec (Grover & Leskovec, 2016) captures

Published as a conference paper at ICLR 2024

both local and global structural proximity while TADW (Grover & Leskovec, 2016) [3] captures both local structural proximity and feature proximity.

Notably, our paper provides a comprehensive analysis with the latent space model to understand the important data factors for the link prediction task and how different link prediction algorithms work. The analysis can be used to understand all the above algorithms including graph embedding algorithms, GNNs, and heuritics in a holistic manner.

Principles in Link Prediction & Network Analysis Over the years, various theories and expertise knowledge have emerged to understand and predict links in networks. We typically review the foundational theories shaping the landscape of link prediction and network analysis. Moreover, we provide a detailed comparison with particular works.

Homophily (Khanam et al., 2020) is a long-standing principle in social network analysis, stemming from the shared beliefs and thoughts of individuals. It suggests that people with aligned perspectives are likely to connect with one another, despite potential differences in their social position. (Murase et al., 2019; Evtushenko & Kleinberg, 2021; Khanam et al., 2020; Currarini et al., 2009; Mc Pherson et al., 2001) provide further study illustrating how homophily induced different phenomenons in social networks.

Tradic closure (Huang et al., 2015) is another fundamental concept in social network analysis, stemming from that friends of friends become friends themselves. It suggests that two individuals, who have a mutual friend, become connected themselves. (Dong et al., 2017; Rossi et al., 2020; Huang et al., 2018; Opsahl, 2013) provide further study illustrating how homophily induced different phenomenons in social networks. (Sarkar et al., 2011) theoretically proves the effectiveness of heuristics algorithms. The key differences between our work and (Sarkar et al., 2011) are as follow: 1. (Sarkar et al., 2011) assumes nodes are with the same degree in most cases, while our model can adapt to graphs with all kinds of degree distribution. 2. (Sarkar et al., 2011) only focuses on the structural perspective while our model considers both effects from feature and structural 3. (Sarkar et al., 2011) focuses on the deterministic model while the non-deterministic one is largely ignored.

More recently, (Murase et al., 2019; Asikainen et al., 2020; Abebe et al., 2022) focus on investigating the interplay between the role of the tradic closure and homophily. (Asikainen et al., 2020) first demonstrate that triadic closure intensifies the impacts of homophily. Nonetheless, (Abebe et al., 2022) points out that (Asikainen et al., 2020) builds on the existence of sufficient tradic closure rather than considering the tradic closure and homophily simultaneously. With a modest modification, (Abebe et al., 2022) finds that tradic closure can introduce individuals to those unlike themselves, consequently reducing segregation. The key differences between our work and (Abebe et al., 2022) are as follows: (1) (Abebe et al., 2022) typically focuses on alleviating the effects of segregation (Tóth et al., 2021) via the tradic closure rather than the link prediction task in our work. (2) (Abebe et al., 2022) focuses on the typical graph model in social science, while our work aims to understand link prediction in various domains.

Despite those principles focusing on network analysis, more theories and principles are proposed revolving around deep learning models, especially for GNN4LP models. (Zhang & Chen, 2018) proposes the γ-decaying heuristic theory indicating that subgraph GNN can capture sufficient structural information for link prediction. (Zhang et al., 2021) proposes the labeling trick theory indicating how to identify the most structural node representation. (Zhou et al., 2022) revolves around the size stability of inductive OOD link prediction problem from a causal perspective.

B DETAILS ON HEURISTIC ALGORITHMS

We then explain more implementation details on those heuristics as follows:

Katz. λ < 1 is a damping factor, indicating the importance of the higher-order information. We set λ = 0.1 in our implementation. |paths l (i, j)| counts the number of length-l paths between i and j. We compute the Katz index (Katz, 1953) by approximating the closed-form solution S = (I λA) 1 I with λA + λ2A2 + + λn An, where S is the score matrix. We utilize λ = 0.05 and n = 3 in our implementation.

PPR. [πi]j is the stationary distribution probability of j under the random walk from i with restart, see more details in Brin & Page (2012). We adapt 1e-3 as the stop criterion.

Published as a conference paper at ICLR 2024

Table 2: Popular heuristics for link prediction, where Γ(i) denotes the neighbor set of vertex i.

Name Formula Factor

common neighbors (CN) |Γ(i) Γ(j)| LSP

Adamic-Adar (AA) P

k Γ(i) Γ(j) 1 log |Γ(k)| LSP

resource allocation (RA) P

k Γ(j) Γ(i) 1 |Γ(k)| LSP

Katz P l=1 λl|paths l (i, j)| GSP

Personal Page Rank (PPR) [πi]j + [πj]i GSP

P a Γ(i) P b Γ(j)score(i,j)

|Γ(i)| |Γ(j)| GSP

Feature Homophily (FH) dis(xi, xj) FP

Sim Rank. We utilize the efficient framework (Zhu et al., 2021a) with the Local Push Sim Rank algorithm (Wang et al., 2019b). Code is available at https://github.com/UISim2020/UISim2020.

FH. It estimates the node feature similarity between a pair of nodes. Different feature distance metrics can be utilized for estimation. Typically, we include two feature distance metrics: cosine distance and Euclidean distance.

Double-Radius Node Labeling. It aims to extract the position information based on the source and target nodes i and j. It can be calculated as: fl(i) = 1 + min(dx, dy) + (d/2)[(d/2) + (d%2) 1] where dx = d(i, x), dy = d(y, i). d(i, j) is the node with a distance i to the source node and a distance j to the target node.

Heuristic selection . In section 3, we majorly focus on one default heuristic for each factor, which are CN for LSP, Katz for GSP, and FH for FP. Moreover, for experiments in Section 3.1, we make a comparison between two heuristics from the same factors. The details for the second heuristic are RA for LSP, PPR for GSP. For FP factor, the default one is with cosine distance while the second one is with Euclidean distance.

C THEORETICAL ANALYSIS ON THE GRAPH LATENT SPACE MODEL

In Section 3.2, we introduce the latent space model and theoretically verify the effectiveness of heuristic algorithms derived from three factors: local structural proximity (LSP), global structural proximity (GSP), and feature proximity (FP). In this section, we provide all the proof details in Section 3.2 and 3.3 on the signifiance of the data factors and their underlying relationship. In particular, the following proof is inspired by the latent model proposed in (Sarkar et al., 2011). The main differences between our model and (Sarkar et al., 2011) are two-fold with better alignment to the real-world scenario. (1) Our model can easily extend to graphs under arbitrary degree distributions in (Sarkar et al., 2011) rather than only considering regular graphs with the same node degree. (2) Our model adds node features into consideration rather than only considering unfeatured graphs in (Sarkar et al., 2011). Moreover, we provide a deep analysis of the interplay between feature and structure proximity.

C.1 EXTENDED LATENT SPACE MODEL FOR LINK PREDICTION

To ease the analysis, we utilize the latent space model as the data assumption showing as follows: Definition 1 (Latent space model for link prediction). The generated nodes are uniformly distributed in a D dimensional Euclidean space. Each node possesses a subordinate radius r and corresponding volume V (r). For any nodes i, j, The probability of link formation P(i j) is determined by the radius (ri, rj) and distance dij of the nodes at its ends.

1) V (r) = V (1)r D, where V (r) is the volume of a radius r and V (1) is the volume of a unit radius hypersphere.

2)Deg(i) = NV (ri), where Deg(i) is degree of node i and N is the total number of nodes.

Published as a conference paper at ICLR 2024

3) P(i j|dij) = 1/(1+eα(dij max{ri,rj})), where P(i j|dij) depicts the probability of forming a unidirectional link between i and j (i j), α > 0 controls the sharpness of the function and r determines the threshold.

In order to normalize the probabilities, we assume that all points lie inside a unit volume hypersphere in D dimensions. The maximum r MAX satisfies V (r MAX) = V (1)r D MAX = 1, i.e., r MAX = ( 1 V (1))1/D. For any node i,j in graph, we define the volume of intersection of two balls of V (ri) and V (rj) as A(ri, rj, dij), which can be bounded using hypersphere as:

ri + rj dij

D A (ri, ri, dij)

rmax ij dij

where rmax ij = max{ri, rj} and Eq.(2) above bridges A(ri, rj, dij) and dij through the volume of hypersphere.

The latent space model above is well-fitted for link prediction because the distance dij in the model portrays the probability of forming a link between node i and j, and for a given node, the most likely node it would connect to is the non-neighbor at the smallest distance. In addition, Deg(i) = NV (ri) describes the sum of the first-order link neighbors of the node. Our latent space model can be applied to the dataset where nodes are distributed independently in some latent metric space; given the positions links are independent of each other.

C.2 THE EFFECTIVENESS OF LOCAL STRUCTURAL PROXIMITY (LSP)

Proposition 1 (latent space distance bound with CNs). For any δ > 0, with probability at least

1 2δ, we have dij 2

rmax ij ηij/N ϵ

V (1) 2/D , where ηij is the number of common neighbors

between nodes i and j, rmax ij = max{ri, rj}, and V (1) is the volume of a unit radius hypersphere in D dimensional space. ϵ is a term independent of ηij. It vanishes as the number of nodes N grows.

Note N(i) as the set of neighbors of node i and let Yk be a random variable depending on the position of point k, which is 1 if k N(i) N(j), and 0 otherwise. Therefore, we have

E [Yk | dij] = P (i k j | dij) = Z

dik,djk P (i k | dik) P (j k | djk) P (dik, djk | dij) d (dij)

= A(ri, rj, dij) (3) For the sake of brevity, we denote E [Yk | dij] as E [Yk]. It can be easy observed that PN k=0 Yk = ηij, i.e., the common neighbour number (CN) of node i and j, denoted as ηij here. Through empirical Bernstein bounds (Maurer & Pontil, 2009), we have

k Yk/N E [Yk]

2 var N(Y ) log 2/δ

N + 7 log 2/δ

where var N(Y ) = ηij(1 ηij/N)

N 1 is the sample variance of Y . Setting ϵ = q

2 var N(Y ) log 2/δ

N ϵ A (ri, rj, dij) ηij

N + ϵ i 1 2δ (5)

Combining Eq. (2) and equation above, we can get the bounds of dij as:

ri + rj dij

D V (1) ηij

rmax ij dij

Published as a conference paper at ICLR 2024

ri + rj 2 ηij/N + ϵ

rmax ij ηij/N ϵ

The R.H.S of the above equation, i.e., the upper bound of dij decreases as ηij increases. In other words, the greater CN number ηij of two nodes i,j will cause the upper bound on the distance dij between them to decrease more, thus indicating the effectiveness of local structural proximity.

Notably, the above theoretical analysis can be easily extended to many other local heuristics, e.g., RA, AA (Zhou et al., 2009), which can be viewed as a weighted version for common neighbors. The key to extending proof to those heuristics is to change the definition of Yk mainly distinguished from CN by their weight design, can also be proven the existence of similar conclusions. Details show in section C.5

C.3 THE EFFECTIVENESS OF GLOBAL STRUCTURAL PROXIMITY (GSP)

Proposition 2 (latent space distance bound with the number of paths). For any δ > 0, with probability

at least 1 2δ, we have dij PM 2 n=0 rn + 2

rmax M ηℓ(i,j) b(N,δ)

c(N,δ,ℓ) 2 D(ℓ 1) , where ηℓ(i, j) is the

number of paths of length ℓbetween i and j in D dimensional Euclidean space. M {1, , ℓ 1} is the set of intermediate nodes.

In this session, we analyze how dij can be upper-bounded through the number of simple ℓ-hop paths. Definition 1. (simple path and set) Given node i, j in graph G(V, E), define a simple path of length ℓbetween i, j as path(i, k1, k2, , kℓ 2, j) such that i k1 k2 . . . kℓ 2 j and Sℓ(i, j) as the set of all possible path(i, k1, k2, , kℓ 2, j), {k1, k2, , kl 2} V

Let Y (i, k1, k2, , kℓ 2, j) be a random variable which is 1 if (i, k1, k2, , kℓ 2, j) Sℓ(i, j) and 0 else. We denote ηℓ(i, j) as the number of paths of length ℓbetween i and j.

ηℓ(i, j) = X

k1,...kℓ 2 S(ℓ 2) Y (i, k1, . . . , kℓ 2, j | dij) (8)

Our goal is to infer bounds on dij given the observed number of ηℓ(i, j). We first bound the maximum degree as follows.

Lemma 1. < N 1 + p

2 ln δ/N with probability at least 1 δ.

Proof: The degree Deg(k) of any node k is a binomial random variable with expectation E[Deg(k)] = NV (rk), where V (rk) is the volume of a hypersphere of radius rk. Thus, us-

ing the Chernoff bound, Deg(k) < NV (rk) 1 + q

2 ln δ NV (rk) holds with probability at least 1 δ. Applying the union bound on all nodes yields the desired proposition, i.e., < NV (r MAX) 1 + q

2 ln δ NV (r MAX) = N 1 + p

Next, we use to bound the maximum possible value of ηℓ(i, j). Lemma 2. For any graph with maximum degree , we have: ηℓ(i, j) ℓ 1.

Proof: This can be proved using a simple inductive argument. If the graph is represented by adjacency matrix M, then the number of length ℓpaths between i and j is given by M ℓ(i, j). Trivially M 2 ij can be at most . This happens when both i and j have degree , and their neighbors form a perfect matching. Assuming this is true for all m < ℓ, we have: M ℓ(i, j) = P

p M(i, p)M ℓ 1(p, j) ℓ 2 P

p M(i, p) ℓ 1

Lemma 3. For ℓ< , ηℓ(i, j | X1, . . . , Xp, . . . , XN) ηℓ i, j | X1, . . . , Xp, . . . XN (ℓ

Proof: The largest change in ηℓ( ) occurs when node p was originally unconnected to any other node, and is moved to a position where it can maximally add to the number of ℓ-hop paths between

Published as a conference paper at ICLR 2024

i and j (or vice versa). Consider all paths where p is m hops from i (and hence ℓ m hops from j . From Lemma 2, the number of such paths can be at most m 1 ℓ m 1 = ℓ 2. Since m {1, . . . , ℓ 1} , the maximum change is (ℓ 1) ℓ 2.

Define Pℓ(i, j) as the probability of observing an ℓ-hop path between points i and j. Next, we compute the expected number of ℓ-hop paths.

Theorem 1. E [ηℓ(i, j)] ℓ 1 Qℓ 1 p=1 A r, p r, (dij (ℓ p 1)r)+ , where x+ is defined

as max(x, 0).

Proof: Consider an ℓ-hop path between i, j, for clarity of notation, let us denote the distances di,k1, dk1,k2 etc. by a1, a2, up to aℓ 1 and radius ri, rk1, , rj by r0, r1, , rℓ 1. We also denote the distances djk1, djk2, etc. by d1, d2, , dℓ 1. Note r j = max(rj 1, rj), j {1, 2, , ℓ 1} From the triangle inequality, dℓ 2 aℓ 1 + aℓ rℓ 1 + rℓ, and by induction, dk Pℓ m=k+1 rm.

Similarly, d1 (dij a1)+ (dij ri)+ , and by induction, dk dij Pk 1 n=0 rn

Pℓ(i, j) = P (i k1 . . . kℓ 1 j | dij)

= P a1 r 1 . . . aℓ r ℓ 1 | dij

d1,...,dℓ 2 P a1 r 1, . . . , aℓ 1 r ℓ 1, d1, . . . , dℓ 2 | dij

= Z rℓ 1+rℓ

dℓ 2=(dij Pℓ 3 n=0 rn)+ . . . Z Pℓ m=2 rm

d1=(dij r0)+ P (a1 r 1, d1 | dij) . . . P aℓ 1 r ℓ 1, aℓ r ℓ| dℓ 2

m=2 rm, dij

m=3 rm, (dij r0)+

r ℓ 1, rℓ, (dij

(9) Since there can be at most ℓ 1 possible paths (from Lemma 2), the theorem statement follows.

Theorem 2. ηℓ(i, j)

" Qℓ 1 p=1 A r p, Pℓ m=p+1 rm, dij Pp 2 n=0 rn

N 1+ 2 ln δ

2N ln δ ℓ 1 with probability at least (1 2δ).

Proof: From Mc Diarmid s inequality (Mc Diarmid et al., 1989), we have:

ηℓ(i, j) E [ηℓ(i, j)] + (ℓ 1) ℓ 2 r

E [ηℓ(i, j)] + ℓ 1 r

2N ln δ ℓ 1

The inequality above can be rewritten as:

ηℓ(i, j) c(N, δ, ℓ)

+ b(N, δ) (14)

Published as a conference paper at ICLR 2024

where c(N, δ, ℓ) and b(N, δ) contain coefficients that do not vary with dij. Note rmax p = max{r p, Pℓ m=p+1 rm}, and dij can be upper-bounded through the number of simple ℓ-hop paths ηℓ(i, j) with Eq. (2).

ηℓ(i, j) c(N, δ, ℓ)

dij Pp 2 n=0 rn

= c(N, δ, ℓ)

dij Pp 2 n=0 rn 2

dij PM 2 n=0 rn 2

+ b(N, δ) M {1, , ℓ 1}

dij PM 2 n=0 rn 2

rmax M ηℓ(i, j) b(N, δ)

2 D(ℓ 1) (15)

Since b is ignorable, we can observe the upper bound decreases as ηℓ(i, j) increases which implies the effectiveness of GSP as well.

The relationship between global structural proximity (GSP) and local structural proximity (LSP) :

The detailed proof of the relationship between global and local structural proximity is shown as follows. The key is to view the CN algorithm as the count on the number of paths ηℓ(i, j) with length ℓ= 2.

Lemma 4 (latent space distance bound with local and global structural proximity). For any δ > 0,

with probability at least 1 2δ, we have dij PM 2 n=0 rn + 2

2 1 2 D(ℓ 1) ,

where PM 2 n=0 rn, rmax M serve as independent variables that do not change with ℓ.

Proof: Replace c and b in Eq. (15), we can obeserve that

v u u u trmax M

2 ηℓ(i, j) N +

2N ln δ ℓ 1

! 2 D(ℓ 1) (16)

v u u trmax M

ℓ 1! 2 D(ℓ 1) (17)

v u u u u trmax M

v u u trmax M

! 2 D(ℓ 1) (19)

Published as a conference paper at ICLR 2024

Since ℓin q

2 1 2 D(ℓ 1) acts as an exponential coefficient, the upper bound of dij grows

at a surprisingly fast rate as ℓincreases (i.e. the order of the structure s proximity goes higher), which makes the effectiveness of structure proximity much weaker.

C.4 THE EFFECTIVENESS OF FEATURE PROXIMITY (FP) & RELATIONSHIP WITH FEATURE PROXIMITY

Proposition 3 (latent space distance bound with feature proximity). For any δ > 0, with probability

at least 1 2δ, we have dij 2

rmax ij βij(1 A(ri,rj,dij))+A(ri,rj,dij)

V (1) 2/D , where βij measures

feature proximity between i and j, rmax ij = max{ri, rj} and V (1) is the volume of a unit radius hypersphere in D dimensional Euclidean space. A(ri, rj, dij) is the volume of intersection of two balls of V (ri) and V (rj) in latent space, corresponding to the expectation of common neighbors.

For graphs with node features, feature proximity (FP) is also often used as a similarity measure. In this section, we briefly analyze the effect of FP on dij. Since FP is not directly related to the spatial concepts of the individuals in the model (e.g., radius r, distance d), we can think of it as a noise parameter that affects the probability of a connecting edge of two nodes. According to the original deterministic model, i j iff dij < max{ri, rj}, it relies almost exclusively on structural proximity. The introduction of the FP can serve as a noise parameter βij to extend the deterministic model further into a non-deterministic model, i.e. i j with probability βij (0, 1) (if dij > max{ri, rj}), or with probability 1 βij (otherwise). Now, the probability of having a common neighbor between node i, j will be Aβij(ri, rj, dij) = βij + (1 βij)A (ri, rj, dij). Substituting Aβij(ri, rj, dij) for A(ri, rj, dij) in Eq. (2), we have:

ri + rj dij

D Aβ (ri, rj, dij)

rmax ij dij

The upper bound of dij is

rmax ij βij(1 A (ri, rj, dij)) + A (ri, rj, dij)

An increase in FP value βij can reduce the upper bound of dij, which reveals the effectiveness of FP. Since A (ri, rj, dij) [0, 1) and β (0, 1).

The relationship between local structural proximity (LSP) and feature proximity (FP):

Lemma 5 (Incompatibility between LSP and FP factors). For any δ > 0, with probability at least 1 2δ, we have ηij = c 1 βij + N(1 + ϵ), where ηij and βij are the number of common neighbor nodes and feature proximity between nodes i and j. c < 0 is an independent variable that does not change with βij and ηij. ηij is negatively correlated with βij.

Proof : Combining Eq. (5) and Eq. (21), we have for any δ > 0, with probability at least 1 2δ,

dij Udij = 2

rmax ij βij + (1 βij)( ηij

where ϵ = q

2 var N(Y ) log 2/δ

N + 7 log 2/δ

3(N 1) , var N(Y ) = ηij(1 ηij/N)

N 1 is the sample variance of Y , and ηij represents the common neighbour number between node i and j. Eq. (22) can be rewritten as:

ηij = c (Udij, rmax ij , N) 1 βij + N(1 + ϵ) (23)

Published as a conference paper at ICLR 2024

c (Udij, rmax ij , N) = NV (1)(rmax ij (Udij

2 )2)D/2 N (24)

NV (1)(rmax ij (dij

2 )2)D/2 N (25)

< NV (1) 1 V (1) N = 0 (26)

Since β (0, 1), we can learn from Eq. (23) that numerically, ηij decreases as βij increases, and vice-versa. It indicates that when the ground truth dij is determined, we can learn from Eq. (23) that there is a conflict between LSP ηij and FP βij: these two cannot be increased at the same time.

The relationship between global structural proximity (GSH) and feature proximity (FP)

After the introduction of feature proximity into the model, we can rewrite Eq. (14) as:

ηℓ(i, j) c(N, δ, ℓ)

= c(N, δ, ℓ)

βij + (1 βij)A

βij + (1 βij)

dij Pp 2 n=0 rn 2

βij + (1 βij)

dij PM 2 n=0 rn 2

+ b(N, δ) M {1, , ℓ 1}

v u u u u trmax M

ηℓ(i,j) b(N,δ)

c(N,δ,ℓ) 1/(ℓ 1) 1

ηℓ(i,j) b(N,δ)

c(N,δ,ℓ) 1/(ℓ 1) 1

rmax M Udij PM 2 n=0 rn 2

Since ηℓ(i,j) b(N,δ)

c(N,δ,ℓ) > 1, We can learn from Eq. (23) that numerically, βℓ(i, j) decreases as ηij increases, and vice-versa. It indicates that there is a conflict between GP ηij and FP βij: these two cannot be increased at the same time.

C.5 THEORETICAL ANALYSIS FOR OTHER LOCAL STRUCTRAL HEURISTICS

Except for common neighbor number(CN), there are still many local heuristics metrics, e.g. RA, and AA. In this session, we will analyze the relationship between dij and them , which gives a broader account of the effectiveness of local structural proximity (LSP). We define RA and AA between node i and j as ηRA ij = P k Γ(x) Γ(y) 1 Deg(k) and ηAA ij = P k Γ(x) Γ(y) 1 log(Deg(k). Note min(Degxy) = mink Deg(k), k Γ(x) Γ(y) and max(Degxy) = maxk Deg(k), k Γ(x) Γ(y).

Published as a conference paper at ICLR 2024

For RA, let Yk be a random variable depending on the position of point k, which is Yk = 1 Deg(k) if k N(i) N(j), and 0 otherwise. Therefore we have:

E [Yk | dij] = Z

1 Deg(k) P (i k j | dij)

1 Deg(k) P (i k | dik) P (j k | djk) P (dik, djk | dij) d (dij)

1 min(Degxy)A(ri, rj, dij)

For the sake of brevity, we denote E [Yk | dij] as E [Yk]. Similarly, we can obtain:

1 max(Degxy)A(ri, rj, dij) E [Yk] 1 min(Degxy)A(ri, rj, dij) (31)

It can be easy observed that PN k=0 Yk = ηRA ij , i.e., the RA number of node i and j, denoted as ηRA ij here. Through empirical Bernstein bounds (Maurer & Pontil, 2009), we have:

k Yk/N E [Yk]

2 var N(Y ) log 2/δ

N + 7 log 2/δ

where var N(Y ) = ηRA ij (1 ηRA ij /N) N 1 is the sample variance of Y . Setting ϵ = q

2 var N(Y ) log 2/δ

" ηRA ij N ϵ E[Yk] ηRA ij N + ϵ

Combining Eq. (2) and equation above, we can get the bounds of dij as:

V (1) max(Degxy)

ri + rj dij

D ηRA ij N + ϵ

V (1) min(Degxy)

rmax ij dij

ηRA ij /N + ϵ V (1)/ max(Degxy)

v u u trmax ij

ηRA ij /N ϵ V (1)/ min(Degxy)

In fact, we can make a more accurate approximation in Eq. (30) if the degree distribution of the graph is known in advance. Similarly, we can provide a proof for AA, with the only difference that Yk in AA becomes Yk = 1 log(Deg(k)) if k N(i) N(j), and 0 otherwise.

C.6 NON-DETERMINISTIC CASE FOR LATENT SPACE MODEL

In this section, we extend our previous analysis based on the infinite values of α (α + ) in Eq. (1) with finite α. It could help to generalize our conclusion to more diverse graphs with different underlying mechanisms.

The core idea underlying almost all of our previous results has been the computation of the probability of two nodes i and j having a common neighbor. For the deterministic case, this is simply the area of intersection of two hyperspheres, A(ri, rj, dij). However, in the non-deterministic case, there is no such intuitive equivalence. Therefore, our primary goal is to find the representation in the latent space model associated with common neighbor ηij. The analysis is similar with section C.3 Set β = 0 and P(i j|dij) = 1/ 1 + eα(dij rmax ij ) depicts the probability of forming an undirected link between i and j (i j), influenced by both feature and structure. From the triangle inequality, we have dik < dij + djk, and dik > max{dij djk, djk dij}.

Published as a conference paper at ICLR 2024

Note N(i) as the set of neighbors of node i and let Yk be a random variable depending on the position of point k, which is 1 if k N(i) N(j), and 0 otherwise. Therefore, we have:

E [Yk | dij] = P (i k j | dij)

= Z P (i k | dik) P (j k | djk) P (dik, djk | dij) d (dij)

max{dij djk,djk dij}

1 1 + eα(dik rmax ik ) 1

1 + eα(djk rmax jk ) d (dik) d (djk)

1 1 + eα(dik rmax ik ) 1

1 + eα(djk rmax jk ) d (dik) d (djk) +

1 1 + eα(dik rmax ik ) 1

1 + eα(djk rmax jk ) d (dik) d (djk)

1 1 + eα(dik rmax ik ) 1

1 + eα(djk rmax jk ) d (dik) d (djk) +

1 1 + eα(dik rmax ik ) 1

1 + eα(djk rmax jk ) d (dik) d (djk)

1 1 + eα(dik rmax ik ) d (dik) + Z

1 1 + eα(dik rmax ik ) d (dik)

1 + eα(djk rmax jk ) d (djk)

c1(rmax ik , α) + Z

1 1 + eα(dik rmax ik ) d (dik)

c1(rmax jk , α) (Notation for brevity)

= c1(rmax ik , α) + log 1 + eα rmax ik

c1(rmax jk , α)

= e1(rmax ik , rmax jk , α) e2(rmax ik , rmax jk , α) (1 + 2dij) (Notation for brevity) (36) where e1(rmax ik , rmax jk , α), e2(rmax ik , rmax jk , α) > 0 are independent items which do not vary with dij.

For the sake of brevity, we denote E [Yk | dij] as E [Yk]. It can be easy observed that PN k=0 Yk = ηij, i.e., the common neighbour number (CN) of node i and j, denoted as ηij here. Through empirical Bernstein bounds (Maurer & Pontil, 2009), we have:

k Yk/N E [Yk]

2 var N(Y ) log 2/δ

N + 7 log 2/δ

where var N(Y ) = ηij(1 ηij/N)

N 1 is the sample variance of Y . Setting ϵ = q

2 var N(Y ) log 2/δ

N ϵ E [Yk] ηij

N + ϵ i 1 2δ (38)

Combining Eq. (36) and Eq. (38), we can further obtain that for any δ > 0, with probability at least 1 2δ, ηij

N ϵ E [Yk] e1(rmax ik , rmax jk , α) e2(rmax ik , rmax jk , α) (1 + 2dij) (39)

dij ηij/N ϵ 2e1(rmax ik , rmax jk , α) + 1

2 1 e2(rmax ik , rmax jk , α) (40)

From the above analysis, we can obtain two conclusions in the non-deterministic model (finite α) : 1) An analogous proposition 1, which centers on the fact that the upper bound of dij decreases as ηij increases. 2) We have succeeded in finding an upper bound on E [Yk] (with dij) in the latent space model, which allows us to similarly obtain all the conclusion of the deterministic model, i.e., the theoretical analyses derived from our model are not constrained by α.

Published as a conference paper at ICLR 2024

D DESCRIPTIONS ON MORE DATASETS

To ensure the diversity and completeness of our analysis, we collect datasets from many different domains, e.g., biology (Watts & Strogatz, 1998; Zhang et al., 2018; Von Mering et al., 2002; Oughtred et al., 2019), transport (Batagelj & Mrvar, 2006; Watts & Strogatz, 1998), web (Ackland et al., 2005; Leskovec et al., 2009; Yang & Leskovec, 2012), academic (Newman, 2006; Leskovec et al., 2007; 2005; Yang & Leskovec, 2012) and social networks (Leskovec & Mcauley, 2012; Leskovec et al., 2010; Richardson et al., 2003). With the domain diversity, the graphs in our datasets also cover a large range of types and sizes. Their types include the weighted and the unweighted, the directed and the undirected. The number of nodes in graphs varies from 102 to 106, while the number of edges varies from 103 to 107. Below is a detailed description of the datasets used:

Selected datasets We first present the selected datasets with ignored factors in the existing benchmarking datasets as mentioned in Section 4.2. The statistics of selected domain datasets are listed in table 3. Power (Watts & Strogatz, 1998) is an electrical grid of western US. Each node represents a station, and each edge represents a power line between two connected stations. Amazon-photo (Shchur et al., 2018) is a portion of the Amazon co-purchase graph (Mc Auley et al., 2015) in which nodes symbolize products. The presence of an edge between two nodes suggests that these products are often purchased together. The features of each node are derived from product reviews using a bag-of-words encoding. Reddit (Zeng et al., 2019) is an undirected graph from the online discussion forum. The nodes represent a post, while the edges indicate posts commented by the same user. The node features the bag-of-word representation of the forum. Flicker (Zeng et al., 2019) is an undirected graph originating from NUS-wide. The nodes represent one image uploaded to Flickr, while the edges indicate two images share some common properties (e.g., same geographic location, same gallery, comments by the same user, etc.). The node features the 500-dimensional bag-of-word representation of the images provided by NUS-wide.

Table 3: Selected Datasets Statistics.

Power Reddit Amazon Photo Flicker

#Nodes 4,941 232,965 7,650 334,863 #Edges 6,594 114,615,892 238,162 899,756 #Feature NA 602 745 500 Mean Degree 2.67 98.38 6.21 5.69 Split Ratio 80/10/10 80/10/10 80/10/10 80/10/10 Domains Transport Social Web Social

Table 4: Web Domain Datasets Statistics.

PB Email-Enron Amazon Photo Amazon Google

#Nodes 1,222 36,692 7,650 334,863 875,713 #Edges 16,714 183,831 238,162 899,756 5,105,039 Mean Degree 27.36 10.02 6.21 5.69 116.49 Split Ratio 80/10/10 80/10/10 80/10/10 80/10/10 80/10/10 Domains Web Web Web Web Web

Biology Domain C.ele (Watts & Strogatz, 1998) is an undirected and unweighted neural network of C. elegans. The nodes represent neurons and each edge represents there exists a pathway between two connected neurons. E.coli (Zhang et al., 2018) is an undirected, unweighted metabolic network in E. coli. The nodes represent participating metabolites, and the edges indicate that two connected metabolites can interact. Yeast (Von Mering et al., 2002) is an undirected and unweighted proteinprotein interaction network in yeast. The nodes represent different kinds of proteins and the edges indicate two proteins can interact. The statistics of biology domain datasets are listed in table 5.

Published as a conference paper at ICLR 2024

Transport Domain USAir (Batagelj & Mrvar, 2006) is a weighted, directed network of US Airlines, in which nodes represents airports while weighted edges indicate airline frequency between two airports. Power (Watts & Strogatz, 1998) is an electrical grid of western US. Each node represents a station, and each edge represents a power line between two connected stations. The statistics of transport domain datasets are listed in table 6.

Web Domain PB (Ackland et al., 2005) is a directed network of US political blogs, where the nodes represent blogs and the directed edges represent links from one blog to another. Email Enron (Leskovec et al., 2009) is a directed, unweighted graph covering around half a million emails. The nodes represent email addresses, and the edges indicate that an address sends an email to another. Amazon (Yang & Leskovec, 2012) is an undirected and unweighted graph constructed by Amazon shopping data. The nodes represent products while the edges indicate two connected nodes are used to be bought together. Google (Leskovec et al., 2009) is a directed, unweighted graph of website hyperlinks. The nodes represent websites, while the edges indicate that a website has a link to another. The statistics of web domain datasets are listed in table 4.

Academic Domain NS (Newman, 2006) is an unweighted and undirected network of collaboration of researchers in network science, where nodes represent scientists and edges represent collaborating relationships. Ca-Astro Ph (Leskovec et al., 2007) is an undirected, unweighted collaboration network from the e-print ar Xiv and covers scientific collaborations between authors papers submitted to Astro Physics category. The nodes represent authors and the edges represent the co-authorship. Ca-Cond Mat (Leskovec et al., 2007) is an undirected, unweighted collaboration network from the e-print ar Xiv and covers scientific collaborations between authors papers submitted to Condense Matter category. The nodes represent authors and the edges represent the co-authorship. Cit Hep Ph (Leskovec et al., 2005) is a directed and unweighted citation graph from the e-print ar Xiv. The nodes represent papers and the edges indicate that a paper is cited by another. DBLP (Yang & Leskovec, 2012) is an undirected, unweighted collaboration network from the computer science domain. The nodes represent authors and the edges represent the co-authorship. The statistics of academic domain datasets are listed in table 7.

Social Domain Facebook (Leskovec & Mcauley, 2012) is a undirected, unweighted social network. The nodes represent users, and the edges represent a friendship relation between any two users. Vote (Leskovec et al., 2010) is a directed, unweighted graph of a wiki vote. The nodes represent participants, while the edges indicate the participants vote for one another. Epinions (Richardson et al., 2003) is a directed, unweighted who-trust-whom online social network on which users could decide whether to "trust" each other. Each node represents a user, and each edge indicates whether a user trusts another or not. Slashdot (Leskovec et al., 2010) is an directed, weighted social network. The nodes represent users, while the edges indicate whether the user thinks another is a friend or foe. The statistics of social domain datasets are listed in table 8. Reddit (Zeng et al., 2019) is an undirected graph from the online discussion forum. The nodes represent a post, while the edges indicate posts commented by the same user. The node features the bag-of-word representation of the forum. Flicker (Zeng et al., 2019) is an undirected graph originating from NUS-wide. The nodes represent one image uploaded to Flickr, while the edges indicate two images share some common properties (e.g., same geographic location, same gallery, comments by the same user, etc.). The node features the 500-dimensional bag-of-word representation of the images provided by NUS-wide. Amazon-photo (Shchur et al., 2018) is a portion of the Amazon co-purchase graph (Mc Auley et al., 2015) in which nodes symbolize products. The presence of an edge between two nodes suggests that these products are often purchased together. The features of each node are derived from product reviews using a bag-of-words encoding.

E ADDITIONAL RESULTS IN MAIN ANALYSIS

Additional analysis on whether heuristics offer unique perspectives In Section 3.1, we conduct an analysis on the complementary effect of different factors. Additional results on Cora and Cite Seer are shown in Figure 7. More results with Hit@10 metric can be found in Figures 8 and 9. Similar observations can be found.

Published as a conference paper at ICLR 2024

Table 5: Biology Domain Datasets Statistics.

C.ele E.coli Yeast

#Nodes 297 1,805 2,375 #Edges 2,148 14,660 11,693 Mean Degree 14.46 12.55 9.85 Split Ratio 80/10/10 80/10/10 80/10/10 Domains Biology Biology Biology

Table 6: Transport Domain Datasets Statistics.

USAir Power

#Nodes 332 4,941 #Edges 2,126 6,594 Mean Degree 12.82 2.67 Split Ratio 80/10/10 80/10/10 Domains Transport Transport

Additional analysis on the overlooked weakness in GNN4LP models In Section 3.4, we conduct an analysis that find the overlooked weakness in GNN4LP models. Additional comparison results with Graph SAGE are shown in Figure ??. Results of the comparison with GCN are shown in Figure 10. Similar observations can be found.

Data Analysis In section 1, we exhibit the data disparity via the distribution of CN scores on different datasets. In this section, we provide more analysis across GSP, LSP, and FP factors in Figure 11. Similar data disparities can be found. Similar observations can be found.

F ADDITIONAL ANALYSIS AMONG DATA FACTORS

In Section 3.1, we analyze the significance of link prediction and the complementary effect among them. In this section, we provide further analysis for integrating all these factors for a holistic view. We ultimately investigate whether those data factors could provide a complete view for link prediction with no neglected factors. Typically, we examine whether heuristics derived from different data factors share similar hard negative pairs. The hard negative pairs, the top-ranked negative pairs, pose the main obstacle to link prediction. The negative pair set can be denoted as Ei where i indicates the i-th heuristic algorithm. We conduct experiments to examine the proportion of hard negative edges remaining after including more data factors. The experimental results are shown in Figure 12. The y-axis is the proportion of remaining hard negative edges. It can be calculated as | i SEi|

|E0| , where S is the set of heuristic algorithms, E0 is the hard negative edges of the basis heuristic algorithm. The x-axis indicates heuristic algorithms derived from different data factors. Looking x-axis from left to right, we gradually add new heuristics into the candidate heuristic set, beginning with a single basic algorithm. We can find that when adding each heuristic with a new factor, e.g., adding CN algorithm, the ideal performance increases substantially, indicating the significance of each factor and its complementary effects. Such observations indicate a complete view of those three factors with no factor neglected.

G LIMITATION

It is important to acknowledge certain limitations in our work despite our research providing valuable insights on unveiling the underlying factors across different datasets for the link prediction. Notably, there are a series of heuristic algorithms revolving around each factor while we only focus on a few typical ones. To this end, we show experimental results in Section 3.1. It provides a more concrete discussion on the overlapping of heuristic algorithms revolving around the same factors. Nonetheless, there may still remain potential for some ignored heuristic algorithms.

Published as a conference paper at ICLR 2024

0.68 0.019 0.014

0.023 0.74 0.014

0.012 0.0059 0.57

0.73 0.0068 0.022

0.026 0.75 0.014

0.021 0.011 0.64

(b) Citeseer

Figure 7: Overlapping ratio between top ranking edges (top 25%) on different heuristic algorithms. Diagonals are the comparison between two heuristics within the same factor, while others compare between heuristics derived from different factors. Experiments are conducted on the hit@10 metric

0.68 0.44 0.052

0.49 0.74 0.29

0.029 0.31 0.57

0.73 0.32 0.016

0.35 0.75 0.19

0.027 0.23 0.64

(b) Citeseer

0.65 0.32 0.025

0.4 0.8 0.2

0.031 0.23 0.55

0.85 0.45 0.38

0.37 0.66 0.32

0.42 0.36 0.73

(d) Ogbl-collab

Figure 8: Overlapping ratio between top ranking edges (top 25%) on different heuristic algorithms utilizing Hit@10 metric. Experiments are conducted on Cora, Citeseer, Pubmed and Ogbl-collab. Diagonals are the comparison between two heuristics within the same factor, while others compare between heuristics derived from different factors.

Published as a conference paper at ICLR 2024

Table 7: Academic Domain Datasets Statistics.

NS Ca-Astro Ph Ca-Con Mat Cit-Hep Ph DBLP

#Nodes 1,589 18,772 23,133 34,546 317,080 #Edges 2,742 5,105,039 93,497 421,578 1,049,866 Mean Degree 3.45 543.90 8.08 24.41 6.62 Split Ratio 80/10/10 80/10/10 80/10/10 80/10/10 80/10/10 Domains Academic Academic Academic Academic Academic

Table 8: Social Domain Datasets Statistics.

Facebook Vote Epinions Slashdot Reddit Flickr

#Nodes 4,039 7,115 75,879 82,140 232,965 89,250 #Edges 88,234 103,689 508,837 549,202 114,615,892 925,872 Mean Degree 43.69 29.15 13.41 13.37 98.38 20.75 Split Ratio 80/10/10 80/10/10 80/10/10 80/10/10 80/10/10 80/10/10 Domains Social Social Social Social Social Social

In order to facilitate a more feasible theoretical analysis, we have made several assumptions, the most significant one is that we separately examine the effect from feature and structure. Typically, we model the feature proximity as an additional noise parameter. The correlation between feature and structure perspectives is only analyzed on the conflict on their effects on link prediction. Such an assumption is subject to a notable drawback since it ignores the correlation between features and existing structural information. It restricts the generality of our theoretical analysis. Moreover, when analyzing the effectiveness of global structural proximity with the number of paths rather than the number of random walks. Paths are deterministic and non-repetitive sequences connecting two points, while random walks are probabilistic sequences with repeat. Extension to the random walks should be the next step.

Our current findings lay a strong foundation for further exploration. It is worth mentioning that our theoretical analysis predominantly focuses on a data perspective. We only show an empirical comparison between vanilla GNN and GNN4LP models, while lacking theoretical analysis from a model perspective. The major obstacle is that the GNN4LP model design is too diverse, making it difficult to analyze comprehensively.

We hope that these efforts will contribute significantly to the Link Prediction research community. Moving forward, we aspire to broaden our findings to encompass more advanced GNN4LP algorithm design, with particular emphasis on the conflict effects on feature similarity and structural similarity. Besides, the guidance we provided on the dataset selection depends entirely on the existing but ignored datasets. We acknowledge that there may still be graphs with distinguishing properties from domains not considered in our work.

H BROADER IMPACT

Link prediction is one of the fundamental tasks with multiple applications in the graph domain. Recently, various GNN4LP models have been proposed with state-of-the-art performance. A key aspect of GNN4LP models is to capture pairwise structural information, e.g. neighborhood overlapping features. In this study, we find that, despite GNN4LP models being more expressive to capture pairwise structural patterns, they inherently reveal performance degradation on those edges without sufficient pairwise structural information. Such characteristics may lead to a biased test prediction. For example, if social media utilize GNN4LP models to recommend friends in a social network, introverts with fewer friends will receive low-quality recommendations. Such flaws lead to bad experiences with obstacles for introverts. Therefore, a vicious spiral happens leading to worse situations. Our research offers an understanding of these limitations rather than introducing a new methodology or approach. We identify the overlooked problem and show the potential solution to address this issue. Consequently, we do not foresee any negative broader impacts stemming from

Published as a conference paper at ICLR 2024

(a) Ogbl-ddi

(b) Ogbl-ppa

Figure 9: Overlapping ratio between top ranking edges (top 25%) on different heuristic algorithms utilizing hit@10 metric. Experiments are conducted on Ogbl-ppa and Ogbl-ddi. Diagonals are the comparison between two heuristics within the same factor, while others compare between heuristics derived from different factors.

feat struct

Difference in hit

model - GCN

feat struct

BUDDY Neo GNN NCNC

feat struct

Difference in hit

model - GCN

feat struct

BUDDY Neo GNN NCNC

(b) Citeseer

feat struct

Difference in hit

model - GCN

feat struct

BUDDY Neo GNN NCNC

(c) Pub Med

feat struct 0.1

Difference in hit

model - GCN

feat struct

BUDDY Neo GNN NCNC

(d) ogbl-collab

Figure 10: Performance comparison between GNN4LP models and vanilla GCN. Two bar groups represent the performance gap on node pairs dominated by feature and structural proximity, respectively. Two sub-figures correspond to compare FP with GSP and LSP, respectively. Vanilla Graph SAGE often outperforms GNN4LP models on node pairs dominated by feature proximity while GNN4LP models outperform on those dominated by structural proximity.

our findings. We expect our work to contribute significantly to the ongoing research efforts aimed at enhancing the versatility and fairness of GNN models when applied to diverse data settings.

Published as a conference paper at ICLR 2024

Cora Citeseer

Pubmed ogbl-collab

ogbl-ppa ogbl-ddi

Cora Citeseer

Pubmed ogbl-collab

Cora Citeseer

Pubmed ogbl-collab

ogbl-ppa ogbl-ddi

(c) Sim Rank

Cora Citeseer

Pubmed ogbl-collab

ogbl-ppa ogbl-ddi

Figure 11: Distribution disparity of different heuristic scores across datasets

FP GSP LSP 0.0

remaining ratio

Cora Citeseer

Pubmed collab

GSP LSP FP 0.0

remaining ratio

Cora Citeseer

Pubmed collab

LSP GSP FP 0.0

Cora Citeseer

Pubmed collab

Figure 12: The impact of various heuristic algorithms on the proportion of remaining hard negative edges. The y-axis depicts the remaining hard negative proportion, calculated as | i SEi|

|E0| , where E0 represents hard negative edges from the basic heuristic, the title for each subfigure. Moving from left to right on the x-axis, more heuristics are incrementally added, starting from a basic one.

I FUTURE DIRECTION

In section 4.1, we show how our understanding can help guide the model design and dataset selections. It is worth noting that our findings can derive new understandings and inspire us to identify new problems in link prediction with future opportunities. We further illustrate more understanding and initial thoughts on the future of link prediction as follows.

Dataset categorization for Link Prediction. In our work, we identify datasets with different data factors based on the performance of the test set, and ease of analysis. However, test edges are not available in real-world scenarios. How to identify data factors becomes an essential problem. One potential direct solution is to utilize the validation set. After identifying the correct direction, we can then select a suitable model for link prediction.

Published as a conference paper at ICLR 2024

Specific GNN4LP design for graphs in particular domain. instead of viewing the link prediction as a whole problem, we suggest decoupling this problem and designing specific models for graphs with different factors. For instance, the specific GNN4LP model should be designed for dense graphs like OGBL-DDI and OGBL-PPA.

Automatic Machine Learning on Link Prediction. With specific model designs on graphs with different categories, we can then combine all those components for automatical model architecture selection. The selected model should be able to adaptively decide what type of factors to use, thereby tailoring the pairwise encoding to each individual. It also allows non-experts to benefit from data-driven insights without expert knowledge.

Inherent connection between node classification and link prediction. Notably, homophily is a concept popularly discussed in node classification while it is largely ignored in GNN4LP models. The interplay between homophily in link prediction and node classification is a chicken-egg problem. More specifically, link prediction considers that nodes with similar features are likely to be connected. In contrast, node classification considers nodes connected are likely to share similar features. Such a relationship indicates that the principles for node classification and link prediction may benefit from each other. Recently, there has been an attempt (Liu et al., 2023) on utilizing principle in link prediction for node classification. Nonetheless, there is still a lack of exploration on the other side, utilizing the principle of node classification for link prediction.

Fairness problem in Link Prediction. In section 3.4, we find the drawback of GNN4LP models over vanilla GNNs. While the overall performance is promising, GNN4LP models may not be satisfying. However, such bias leads to can potentially affect the fairness of the models in real-world applications as discussion in Appendix H. There should be a systematical task design for the fairness problem on link prediction.

Extension to Network Evolution. Network evolution (Zaheer & Soda, 2009) is about understanding and modeling how networks change over time. This includes the formation of new links as well as the deletion of existing ones. The link prediction considers the network at a fixed point in time and doesn t account for how the network might evolve and grow temporally. Moreover, link prediction generally only considers a link addition while ignoring the link deletion. We leave extending our understanding of link prediction to network evolution as the future work. The core future steps for extension are:

link deletion: Despite considering node pairs with high proximity to be more likely to be connected, we need to extend to verify that node pairs with low proximity are more likely to be deleted.

Temporal feature: Network evolution introduces the time-series feature, considers capturing the time evolution features of link occurrences, and predicts the link occurrence probability. Temporal proximity will be an additional data factor besides the existing three data factors.

More application in recommendation. The recommendation (Fu et al., 2022; Wang et al., 2019a; He et al., 2020) can be viewed as a particular link prediction problem on the bipartite graph between users and items. Nonetheless, the graph is generally much more sparse. The mechanism underlying such a specific link prediction problem remains unknown.

J DATASETS DETAILS

Table 9: Datasets Statistics.

Cora Citeseer Pubmed ogbl-collab ogbl-ddi ogbl-ppa

#Nodes 2,708 3,327 18,717 235,868 4,267 576,289 #Edges 5,278 4,676 44,327 1,285,465 1,334,889 30,326,273 Mean Degree 3.9 2.81 4.74 10.90 625.68 105.25 Split Ratio 85/5/10 85/5/10 85/5/10 92/4/4 80/10/10 70/20/10 Domains Academic Academic Academic Academic Chemistry Biology

Published as a conference paper at ICLR 2024

The basic dataset statistics are shown in Table 9. Notably, three planetoid datasets, CORA, CITESEER, and PUBMED are smaller toy datasets while OGB datasets (Hu et al., 2020) have much more nodes and edges. We include most of the OGB datasets except for ogbl-citation2. The reason is that the standard evaluation method for ogbl-citation2 is different from other datasets (1) Most datasets adopt a shared negative sample setting where all positive samples share the same negative sample set. In contrast, each positive sample in ogbl-citation2 corresponds to an individual negative sample set. (2) The ogbl-citation2 dataset utilizes MRR as the default evaluation metric rather than Hit@K in other datasets. Moreover, we can still achieve a convincing empirical conclusion without ogbl-citation2 since there still remain four citation graphs with varied sizes.

For the data split, we adapt the fixed split with percentages 85/5/10% for planetoid datasets, which can be found at https://github.com/Juanhui28/Hea RT. For OGB datasets, we use the fixed splits provided by the OGB benchmark (Hu et al., 2020).

K EXPERIMENTAL SETTINGS & MODEL PERFORMANCE

K.1 EXPERIMENTAL SETTINGS

We provide implementation details in the following section. Code for our paper can be found in https://anonymous.4open.science/r/Link Prediction-5EC1/. Notably, we adopt all the settings and implementation from the recent Link Prediction benchmark (Li et al., 2023) for a fair comparison. Codes can be found at https://github.com/Juanhui28/Hea RT. More details are shown as follows.

Training Settings. The binary cross entropy loss and Adam optimizer (Kingma & Ba, 2014) are utilized for training. For each training positive sample, we randomly select one negative sample for training. Each model is trained with a maximum of 9999 epochs with the early stop training strategy. We set the early stop epoch to 50 and 20 for planetoid and OGB datasets, respectively.

Evaluation Settings For OGB datasets, we utilize the default evaluation metrics provided by (Hu et al., 2020), which are Hits@50, Hits@20, Hits@100 for OGBL-COLLAB, OGBL-DDI, and OGBL-PPA datasets, respectively. For planetoid datasets, we utilize Hit@10 as the evaluation metric.

Hardware Settings The experiments are performed on one Linux server (CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @2.60GHz, Operation system: Ubuntu 16.04.6 LTS). For GPU resources, eight NVIDIA Tesla V100 cards are utilized The Python libraries we use to implement our experiments are Py Torch 1.12.1 and Py G 2.1.0.post1.

Hyperparameter Settings. For deep models, the hyparameter searching range is shown in Table 10. The general hyperparameter search space is shown in Table 10. The weight decay, number of model and prediction layers, and the embedding dimension are fixed. Notably, several exceptions occur in the general search space resulting in significant performance degradations. Adjustments are made with the guidance of the optimal hyperparameters published in the respective source codes. This includes:

NCNC (Wang et al., 2023): When training on OGBL-DDI, we adhere to the suggested optimal hyperparameters used in the source code.2 Specifically, we set the number of model layers to be 1, and we don t apply the pretraining for NCNC to facilitate a fair comparison.

SEAL (Zhang & Chen, 2018): Due to the computational inefficiency of SEAL, when training on CORA, CITESEER and PUBMED, we further fix the weight decay to 0. Furthermore, we adhere to the published hyperparameters 3 and fix the number of model layers to be 3 and the embedding dimension to be 256.

BUDDY (Chamberlain et al., 2023): When training on OGBL-PPA, we incorporate the RA and normalized degree as input features while excluding the raw node features. This is based on the optimal hyperparameters published by the authors.4

2https://github.com/Graph PKU/Neural Common Neighbor/ 3https://github.com/facebookresearch/SEAL_OGB/ 4https://github.com/melifluos/subgraph-sketching/

Published as a conference paper at ICLR 2024

Table 10: Hyperparameter Search Ranges

Dataset Learning Rate Dropout Weight Decay # Model Layers # Prediction Layers Embedding Dim

Cora (0.01, 0.001) (0.1, 0.3, 0.5) (1e-4, 1e-7, 0) (1, 2, 3) (1, 2, 3) (128, 256) Citeseer (0.01, 0.001) (0.1, 0.3, 0.5) (1e-4, 1e-7, 0) (1, 2, 3) (1, 2, 3) (128, 256) Pubmed (0.01, 0.001) (0.1, 0.3, 0.5) (1e-4, 1e-7, 0) (1, 2, 3) (1, 2, 3) (128, 256) ogbl-collab (0.01, 0.001) (0, 0.3, 0.5) 0 3 3 256 ogbl-ddi (0.01, 0.001) (0, 0.3, 0.5) 0 3 3 256 ogbl-ppa (0.01, 0.001) (0, 0.3, 0.5) 0 3 3 256

K.2 MODEL PERFORMANCE

We include baseline models that are both representative and scalable to most datasets. Models like NBFNet (Zhu et al., 2021b) are not included since they meet severe out-of-memory issue on those larger OGB datasets. The detailed model performance is shown in Table 11. Since the ogbl-ddi has no node features, we mark the MLP results with a N/A". Notably, there is no consistent winning solution across different settings. All results except for Sim Rank, PPR, and FH are from Li et al. (2023).

Table 11: Model performance on Cora, Cite Seer, Pubmed, and OGB datasets.

Cora Citeseer Pubmed Ogbl-collab Ogbl-ddi Ogbl-ppa

CN 42.69 35.16 27.93 61.37 17.73 27.65 AA 42.69 35.16 27.93 64.17 18.61 32.45 RA 42.69 35.16 27.93 63.81 6.23 49.33 Katz 51.61 57.36 42.17 64.33 17.73 27.65 Sim Rank 50.28 52.52 37.03 21.75 OOM OOM PPR 32.96 31.88 32.84 54.53 7.19 4.97 FH 47.13 48.19 34.57 23.92 4.74 NA

MLP 53.59 3.57 69.74 2.19 34.01 4.94 35.81 1.08 N/A 0.45 0.04 GCN 66.11 4.03 74.15 1.70 56.06 4.83 54.96 3.18 49.90 7.23 29.57 2.90 SAGE 63.66 4.98 78.06 2.26 48.18 4.60 59.44 1.37 49.84 15.56 41.02 1.94 Neo GNN 64.10 4.31 69.25 1.90 56.25 3.42 66.13 0.61 20.95 6.03 48.45 1.01 BUDDY 59.47 5.49 80.04 2.27 46.62 4.58 64.59 0.46 29.60 4.75 47.33 1.96 NCNC 75.07 1.95 82.64 1.40 61.89 3.54 65.97 1.03 70.23 12.11 62.64 0.79