# pagerank_bandits_for_link_prediction__18357bf1.pdf

Page Rank Bandits for Link Prediction

Yikun Ban1 , Jiaru Zou1 , Zihao Li1, Yunzhe Qi1, Dongqi Fu2, Jian Kang3, Hanghang Tong1, Jingrui He1

1University of Illinois Urbana-Champaign, 2Meta AI, 3University of Rochester 1{yikunb2, jiaruz2, zihaoli5, yunzheq2, htong, jingrui}@illinois.edu 2dongqifu@meta.com, 3jian.kang@rochester.edu

Link prediction is a critical problem in graph learning with broad applications such as recommender systems and knowledge graph completion. Numerous research efforts have been directed at solving this problem, including approaches based on similarity metrics and Graph Neural Networks (GNN). However, most existing solutions are still rooted in conventional supervised learning, which makes it challenging to adapt over time to changing customer interests and to address the inherent dilemma of exploitation versus exploration in link prediction. To tackle these challenges, this paper reformulates link prediction as a sequential decision-making process, where each link prediction interaction occurs sequentially. We propose a novel fusion algorithm, PRB (Page Rank Bandits), which is the first to combine contextual bandits with Page Rank for collaborative exploitation and exploration. We also introduce a new reward formulation and provide a theoretical performance guarantee for PRB. Finally, we extensively evaluate PRB in both online and offline settings, comparing it with bandit-based and graph-based methods. The empirical success of PRB demonstrates the value of the proposed fusion approach. Our code is released at https://github.com/jiaruzouu/PRB

1 Introduction

Link prediction is an essential problem in graph machine learning, focusing on predicting whether a link will exist between two nodes. Given the ubiquitous graph data in real-world applications, link prediction has become a powerful tool in domains such as recommender systems [72] and knowledge graph completion [49, 41]. Considerable research efforts have been dedicated to solving this problem. One type of classic research approaches is heuristic-based methods, which infer the likelihood of links based on node similarity metrics [43, 46]. Graph Neural Networks (GNNs) have been widely utilized for link prediction. For example, Graph Autoencoders leverage Message Passing Neural Network (MPNN) representations to predict links [29]. Recently, MPNNs have been combined with structural features to better explore pairwise relations between target nodes [73, 70, 18, 61].

Existing supervised-learning-based methods for link prediction are designed for either the static [73, 70, 18, 61] or relatively dynamic environment [64, 55, 62, 58, 69, 19, 27, 26, 75], they (chronologically) split the dataset into training and testing sets. Due to the dynamic and evolving nature of many real-world graphs, ideal link prediction methods should adapt over time to consistently meet the contexts and goals of the serving nodes. For instance, in short-video recommender systems, both video content and user preferences change dynamically over time [28]. Another significant challenge is the dilemma of exploitation and exploration in link prediction. The learner must not only exploit past collected data to predict links with high likelihood but also explore lower-confidence target nodes to acquire new knowledge for long-term benefits. For example, in social recommendations, it

Equal contribution.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

is necessary to prioritize popular users by exploiting knowledge gained from previous interactions, while also exploring potential value from new or under-explored users to seek long-term benefits [7]. Furthermore, while existing works often analyze time and space complexity, they generally lack theoretical guarantees regarding the performance of link prediction. To address these challenges, in this paper, we make the following contributions:

Problem Formulation and Algorithm. We formulate the task of link prediction as sequential decision-making under the framework of contextual bandits, where each interaction of link prediction is regarded as one round of decision-making. We introduce a pseudo-regret metric to evaluate the performance of this decision process. More specifically, we propose a fusion algorithm named PRB (Page Rank Bandits), which combines the exploitation and exploration balance of contextual bandits with the graph structure utilization of Page Rank [59, 42]. Compared to contextual bandit approaches, PRB leverages graph connectivity for an aggregated representation. In contrast to Page Rank, it incorporates the principles of exploitation and exploration from contextual bandits to achieve a collaborative trade-off. Additionally, we extend PRB to node classification by introducing a novel transformation from node classification to link prediction, thereby broadening the applicability of PRB.

Theoretical Analysis. We introduce a new formulation of the reward function to represent the mapping from both node contexts and graph connectivity to the reward. We provide one theoretical guarantee for the link prediction performance of the proposed algorithm, demonstrating that the cumulative regret induced by PRB can grow sub-linearly with respect to the number of rounds. This regret upper bound also provides insights into the relationship between the reward and damping factor, as well as the required realization complexity of the neural function class.

Empirical Evaluation. We extensively evaluate PRB in two mainstream settings. (1) Online Link Prediction. In this setting, each link prediction is made sequentially. In each round, given a serving node, the model is required to choose one target node that has the highest likelihood of forming a link with the serving node. The model then observes feedback and performs corresponding optimizations. The goal is to minimize regret over T rounds (e.g., T = 10, 000). We compare PRB with state-ofthe-art (SOTA) bandit-based approaches (e.g., [76, 12]), which are designed for sequential decisionmaking. PRB significantly outperforms these bandit-based baselines, demonstrating the success of fusing contextual bandits with Page Rank for collaborative exploitation and exploration. (2) Offline Link Prediction. In this setting, both training and testing data are provided, following the typical supervised learning process. Although PRB is designed for online learning, it can be directly applied to offline learning on the training data. We then use the trained model to perform link prediction on the testing data, comparing it with SOTA GNNs-based methods (e.g., [18, 61]). The superior performance of PRB indicates that principled exploitation and exploration can break the performance bottleneck in link prediction. Additionally, we conduct ablation and sensitivity studies for a comprehensive evaluation of PRB.

2 Related Work

Contextual Bandits. The first line of works studies the linear reward assumption, typically calculated using ridge regression [39, 8, 1, 60, 21, 53]. Linear UCB-based bandit algorithms [1, 9, 40] and linear Thompson Sampling [4, 2] can achieve satisfactory performance and a near-optimal regret bound of O(

T). To learn general reward functions, deep neural networks have been adapted to bandits in various ways [10, 11]. [54, 47] develop L-layer DNNs to learn arm embeddings and apply Thompson Sampling on the final layer for exploration. [76] introduced the first provable neural-based contextual bandit algorithm with a UCB exploration strategy, and [74] later extended to the TS framework. [22] provides sharper regret upper bound for neural bandits with neural online regression. Their regret analysis builds on recent advances in the convergence theory of over-parameterized neural networks [24, 5] and uses the Neural Tangent Kernel [34, 6] to establish connections with linear contextual bandits [1]. [12, 13] retains the powerful representation ability of neural networks to learn the reward function while using another neural network for exploration. [52, 51] integrates exploitation-exploration neural networks into the graph neural networks for fine-grained exploration and exploration. Recently, neural bandits have been adapted to solve other learning problems, such as active learning[14, 7], meta learning[53].

Link Prediction Models. Three primary approaches have been identified for link prediction models. Node embedding methods, as described by previous work [50, 30, 57, 23, 44, 45, 25], focus on

mapping each node to an embedding vector and leveraging these embeddings to predict connections. Another approach involves link prediction heuristics, as explored by [43, 15, 3, 77], which utilize crafted structural features and network topology to estimate the likelihood of connections between nodes in a network. The third category employs GNNs for predicting link existence; notable is the Graph Autoencoder (GAE) [36], which learns low-dimensional representations of graphstructured data through an unsupervised learning process. GAE utilizes the inner product of MPNN representations of target nodes to forecast links but might not capture pairwise relations between nodes effectively. More sophisticated GNN models that combine MPNN with additional structural features, such as those by [71, 70, 18], have demonstrated superior performance by integrating both node and structural attributes. One such combined architecture is SF-then-MPNN, as adopted by [71, 78]. In this approach, the input graph is first enriched with structural features (SF) and then processed by the MPNN to enhance its expressivity. However, since structural features change with each target link, the MPNN must be re-run for each link, reducing scalability. For instance, the SEAL model [71] first enhances node features by incorporating the shortest path distances and extracting k-hop subgraphs, then applies MPNN across these subgraphs to generate more comprehensive link representations. Another combined architecture is SF-and-MPNN. Models like Neo-GNN [70] and BUDDY [18] apply MPNN to the entire graph and concatenate features such as common neighbor counts to enhance representational fidelity. In addition, [61] has developed the Neural Common Neighbor with Completion (NCNC) which utilizes the MPNN-then-SF architecture to achieve higher expressivity and address the graph incompleteness.

Recently, representation learning on temporal graphs for link prediction has also been widely studied to exploit patterns in historical sequences, particularly with GNN-based methods [58, 69, 19, 64, 62, 55]. However, these approaches are still conventional supervised-learning-based methods that chronologically split the dataset into training and testing sets. Specifically, these methods train a GNN-based model on the temporal training data and then employ the trained model to predict links in the test data. In contrast, we formulate link predictions as sequential decision-making, where each link prediction is made sequentially. Node classification[16, 67, 66] is also a prominent direction in graph learning, but it is not the main focus of this paper.

3 Problem Definition

Let G0 = (V, E0) be an undirected graph at initialization, where V is the set of n nodes, |V | = n, and E0 V V represents the set of edges. E0 can be an empty set in the cold-start setting or include some existing edges with a warm start. Each node vi V is associated with a context vector x0,i Rd . Then, we formulate link prediction as the problem of sequential decision-making under the framework of contextual bandits. Suppose the learner is required to finish a total of T link predictions. We adapt the above notation to all the evolving T graphs {Gt = (V, Et)}T 1 t=0 and let [T] = {1, . . . , T}. In a round of link prediction t [T], given Gt 1 = (V, Et 1), the learner is presented with a serving node vt V and a set of k candidate nodes Vt = {vt,1, . . . , vt,k} V , where Vt is associated with the corresponding k contexts Xt = {xt,1, . . . , xt,k} and |Vt| = k. In the scenario of social recommendation, vt can be considered as the user that the platform (learner) intends to recommend potential friends to, and the other candidate users will be represented by Vt. Vt can be set as the remaining nodes Vt = Vt/vt or formed by some pre-selection algorithm Vt Vt.

The goal of the learner is to predict which node in Vt will generate a link or edge with vt. Therefore, we can consider each node in Vt as an arm, and aim to select the arm with the maximal reward or the arm with the maximal probability of generating an edge with vt. For simplicity, we define the reward of link prediction as the binary reward. Let vt,ˆi Vt be the node selected by the learner. Then, the corresponding reward is defined as rt,ˆi = 1 if the link [vt, vt,ˆi] is really generated; otherwise, rt,ˆi = 0. After observing the reward rt,ˆi, we update Et 1 to obtain the new edge set Et, and thus new Gt.

For any node vt,i Vt, denote by DY|xt,i the conditional distribution of the random reward rt,i with respect to xt,i, where Y = {1, 0}. Then, inspired by the literature of contextual bandits, we define the following pseudo regret:

Ert,i DY|xt,i [rt,i ] Ert,ˆi DY|xt,ˆi[rt,ˆi] = P(rt,i = 1|xt,i ) P(rt,ˆi = 1|xt,ˆi)

Algorithm 1 PRB (Page Rank Bandits)

Input: f1, f2, T, G0, η1, η2 (learning rate), α (damping factor)

1: Initialize θ1 0, θ2 0 2: for t = 1, 2, . . . , T do 3: Observe serving node vt, candidate nodes Vt, contexts Xt and Graph Gt 1 4: ht = 0 5: for each vt,i Vt do 6: ht[i] = f1 xt,i; θ1 t 1 + f2 ϕ (xt,i) ; θ2 t 1

7: end for 8: Compute Pt based on Gt 1 9: Solve vt = αPtvt + (1 α) ht 10: Select ˆi = arg maxvt,i Vt vt [i] 11: Observe rt,ˆi 12: if rt,ˆi == 1 then 13: Add [vt, vt,ˆi] to Gt 1 and set as Gt 14: else 15: Gt = Gt 1 16: end if 17: θ1 t = θ1 t 1 η1 θ1 t 1L xt,ˆi, rt,ˆi; θ1 t 1

18: θ2 t = θ2 t 1 η2 θ2 t 1L ϕ(xt,ˆi), rt,ˆi f1(xt,ˆi; θ1 t 1); θ2 t 1

19: end for

where i = arg maxvt,i Vt P(rt,i = 1|xt,i), the tie is broken randomly, and ˆi is the index of selected node. RT reflects the performance difference of the learned model from the Bayes-optimal predictor. The goal of the learner is to minimize RT .

4 Proposed Algorithms

Algorithm 1 describes the proposed algorithm PRB. It integrates contextual bandits and Page Rank to combine the power of balancing exploitation and exploration with graph connectivity. The first step is to balance the exploitation and exploration in terms of the reward mapping concerning node contexts, and the second step is to propagate the exploitation and exploration score via graph connectivity.

To exploit the node contexts, we use a neural network to estimate rewards from the node contexts. Let f1( ; θ1) be a neural network to learn the mapping from the node context to the reward. Denote the initialized parameter of f1 by θ1 0. In round t, let θ1 t 1 be parameter trained on the collected data of previous t 1 rounds including all selected nodes and the received rewards. Given the serving node vt, for any candidate node vt,i Vt, f1(xt,i; θ1 t 1), i Vt is the estimated reward by greedily exploiting the observed contexts, which we refer to as exploitation . Supposeˆi is the index of selected nodes. To update θ1 t 1, we can conduct stochastic gradient descent to update θ1 based on the collected training

sample (xt,ˆi, rt,ˆi) with the squared loss function L xt,ˆi, rt,ˆi; θ1 t 1 = [f(xt,ˆi; θ1 t 1) rt,ˆi]2/2.

Denote the updated parameters by θ1 t for the next round of link prediction.

In addition to exploiting the observed contexts, we employ another neural network to estimate the potential gain of each candidate node in terms of reward for exploration. This idea is inspired by [12]. Denote the exploration network by f2( ; θ2). f2 is to learn the mapping from node contexts and the discriminative ability of f1 to the potential gain. In round t [T], given node context xt,i Vt and its estimated reward f1(xt,i; θ1 t 1), the input of f2 is the gradient of f1(xt,i; θ1 t 1) with respect to θ1 t 1, denoted by ϕ(xt,i), and f2(ϕ(xt,i); θ2 t 1) is the estimated potential gain. After the learner selects the node xt,ˆi and observes the reward rt,ˆi, the potential gain is defined as rt,ˆi f1(xt,i; θ1 t 1), which is used to train f2. Thus, after this interaction, we conduct the stochastic gradient descent to update θ2 based on the collected sample (ϕ(xt,ˆi), rt,ˆi f1(xt,i; θ1 t 1)) with the squared loss function

L ϕ(xt,ˆi), rt,ˆi f1(xt,ˆi; θ1 t 1); θ2 t 1 = [f(ϕt,i; θ2 t 1) (rt,ˆi f1(xt,i; θ1 t 1))]2/2. Denote by θ2 t

Figure 1: Transforming Node Classification to Link Prediction. Consider a binary node classification problem. In the left figure, given a graph, the learner tries to classify node 4 into one of two classes. First, we add two supernodes to the graph, each representing one of the classes. The node classification problem is then transformed into predicting links between node 4 and the two supernodes in the right figure. Suppose the learner predicts that a link will exist between node 4 and supernode 0. If node 4 belongs to Class 0, the reward is 1, and an edge is added between node 4 and supernode 0; otherwise, the reward is 0, and an edge is added between node 4 and supernode 1.

the updated parameters of f2 for the next round of link prediction. The reasons for setting ϕ(xt,i) as the input of f2 are as follows: (1) it incorporates the information of both xt,ˆi and discriminative ability of f1( ; θ1 t 1); (2) the statistical form of the confidence interval for reward estimation can be regarded as the mapping function from ϕ(xt,i) to the potential gain, and f2 is to learn the unknown mapping [12].

The previous steps demonstrate the exploitation and exploration of node contexts to facilitate decisionmaking in link prediction. Since graph connectivity is also crucial, we next introduce our method of integrating the bandit principle with Page Rank to enable collaborative exploitation and exploration. Page Rank calculates the stationary distribution of the random walker starting from some node, iteratively moving to a random neighbor with probability α (damping factor) or returning to its original position with probability 1 α. Let vt be the stationary distribution vector calculated based on the graph Gt. Then, vt satisfies:

vt = αPtvt + (1 α) ht (4.1)

where Pt En n is the transition matrix built on Gt 1 and ht is typically regarded as a position vector to mark the starting node. Pt is computed as D 1 t 1At 1, where Dt 1 Rn n is the degree matrix of Gt 1 and At 1 Rn n is the adjacency matrix of Gt 1.

Here we propose to use ht to include the starting exploitation and exploration scores of candidate nodes, defined as:

i Vt, ht[i] = f1(xt,i; θ1 t 1) + f2(xt,i; θ2 t 1), and i V/Vt, ht[i] = 0. (4.2)

Therefore, vt is the vector for the final decision-making based on collaborative exploitation and exploration. Some research efforts have been devoted to accelerating the calculation of Eq.4.1 in the evolving graph, e.g., [42], which can be integrated into PRB (Line 9 in Algorithm 1) to boost its efficiency and scalability.

PRB for Node Classification. We also extend PRB to solve the problem of node classification as illustrated in Figure 1. Consider a k-class classification problem. We add k super nodes { v1, v2, . . . , vk} to the graph, which represents k classes, respectively. Then, we transform the node classification problem into the link prediction problem, aiming to predict the link between the serving node and the k super nodes. To be specific, in round t [T], the learner is presented with the serving node vt and the k candidate (super) nodes Vt = { v1, v2, . . . , vk} associated with k corresponding contexts Xt = {xt,1, xt,2, . . . , xt,k}. Recall xt is the context associated with vt. Then, we define the contexts of super nodes as xt,1 = [x t , 0, . . . , 0] , xt,2 = [0, x t , . . . , 0] , . . . , xt,k = [0, 0, . . . , xt] , xt,i Rkd, i [k]. This context definition is adopted from neural contextual bandits [12, 76]. Then, the learner is required to select one node from Vt. Let vit be the selected node and vi t be ground-truth node (i t is the index of ground-truth class of node vt). Then, after observing the reward rt,it, one edge [vt, vit] is added to the graph Gt 1, if vt belongs to the class it, i.e., it = i t and reward rt,it = 1. Otherwise, rt,it = 0 and the edge [vt, vi t ] is added to Gt 1. Then, we can naturally apply PRB to this problem. We detail our extended algorithm for node classification in Algorithm 2.

PRB Greedy. We also introduce a greedy version of PRB which integrates Page Rank solely with contextual bandit exploitation, as outlined in Algorithm 3. We will compare each variant of algorithms in our experiment section.

5 Regret Analysis

In this section, we provide the theoretical analysis of PRB by bounding the regret defined in Eq.3.1. One important step is the definition of the reward function, as this problem is different from the standard bandit setting that focuses on the arm (node) contexts and does not take into account the graph connectivity. First, we define the following general function to represent the mapping from the node contexts to the reward. Given the serving node vt and an arm node vt,i Vt associated with the context xt,i, the reward conditioned on vt and vt,i is assumed to be governed by the function: E[ rt,i|vt, vt,i] = y (xt,i) (5.1) where y is an unknown but bounded function that can be either linear or non-linear. Next, we provide the formulation of the final reward function. In round t [T], let yt be the vector to represent the expected rewards of all candidate arms yt = [y (xt,i) : vt,i Vt]. Given the graph Gt 1, its normalized adjacency matrix Pt, and the damping factor α, inspired by Page Rank, the optimizing problem is defined as: v t = arg minv αv (I Pt)v + (1 α) v yt 2 2/2. Then, its optimal solution is v t = αPtv t + (1 α) yt. (5.2) For any candidate node vt,i Vt, we define its expected reward as Ert,i DY|xt,i[rt,i] = v t [i]. v t is a flexible reward function that reflects the mapping relation of both node contexts and graph connectivity. α is a hyper-parameter to trade-off between the leading role of graph connectivity and node contexts. When α = 0, v t turns into the reward function in contextual bandits [76, 12]; when α = 1, v t is the optimal solution solely for graph connectivity. Here, we assume α is a prior knowledge. Finally, the pseudo-regret is defined as

v t [i ] v t [ˆi] . (5.3)

where i = arg maxvt,i Vt v t [i] and ˆi is the index of the selected node. The regret analysis is associated with the Neural Tangent Kernel (NTK) matrix as follows: Definition 5.1 (NTK [34, 63]). Let N denote the normal distribution. Given all data instances {xt}T k t=1, for i, j [Tk], define

H0 i,j = Σ0 i,j = xi, xj , Al i,j = Σl i,i Σl i,j Σl j,i Σl j,j

Σl i,j = 2Ea,b N(0,Al 1 i,j )[σ(a)σ(b)],

Hl i,j = 2Hl 1 i,j Ea,b N(0,Al 1 i,j )[σ (a)σ (b)] + Σl i,j.

Then, the NTK matrix is defined as H = (HL + ΣL)/2. Assumption 5.1. There exists λ0 > 0, such that H λ0I.

The assumption 5.1 is generally made in the literature of neural bandits [76, 74, 20, 35, 12, 10, 65] to ensure the existence of a solution for NTK regression.

As the standard setting in contextual bandits, all node contexts are normalized to the unit length. Given xt,i Rd with xt,i 2 = 1, t [T], i [k], without loss of generality, we define a fully-connected network with depth L 2 and width m: f(xt,i; θ) = WLσ(WL 1σ(WL 2 . . . σ(W1xt,i))) (5.4)

where σ is the Re LU activation function, W1 Rm d, Wl Rm m, for 2 l L 1, WL R1 m, and θ = [vec(W1) , vec(W2) , . . . , vec(WL) ] Rp. Note that our analysis can also be readily generalized to other neural architectures such as CNNs and Res Net [5, 24]. We employ the following initialization [17] for θ: For l [L 1], each entry of Wl is drawn from the normal distribution N(0, 2/m); each entry of WL is drawn from the normal distribution N(0, 1/m). The network f1 and f2 follows the structure of f. Define y = [y(xt,i) : t [T], i [k]]. Finally, we provide the performance guarantee as stated in the following Theorem.

Theorem 5.1. Given the number of rounds T, for any α, δ (0, 1), suppose m eΩ(poly(T, L) k log(1/δ)), η1 = η2 = T 3 m and set rt,i = rt,i, t [T], i [k]. Then, with probability at least 1 δ over the initialization, Algorithm 1 achieves the following regret upper bound:

max( d, S2) (5.5)

where ed = log det(I+H)

log(1+T k) and S = p

Theorem 5.1 provides a regret upper bound for PRB with the complexity of e O(ed

k T) (see proofs in Appendix E). Instead, the graph-based methods (e.g., [18, 61]) lack an upper bound in terms of their performance. Theorem 5.1 provides insightful results in terms of PRB s performance. First, PRB s regret can grow sub-linearly with respective to T. Second, PRB s performance is affected by the number of nodes k. This indicates the larger the graph is, the more difficult the link prediction problem is. Third, ed and S in the regret upper bound reflect the complexity of the required neural function class to realize the underlying reward function v t , i.e., the difficulty of learning v t . ed is the effective dimension, which measures the actual underlying dimension in the RKHS space spanned by NTK. S is to provide an upper bound on the optimal parameters in the context of NTK regression. Both ed and S are two complexity terms that commonly exist in the literature of neural contextual bandits[76, 74]. In the general case when 1 > α > 0, learning v t proportionally turns into a bandit optimization problem and the upper bound provided in Theorem 5.1 matches the SOTA results in neural bandits [76, 74]. In fact, the regret upper bound is closely related to the graph structure of Gt. In the special case when α = 1, learning v t turns into a simple convex optimization problem (Eq. (4.1)) and PRB can really find the optimal solution, which leads to zero regrets. When α = 0, the problem turns into a complete bandit optimization problem with the same regret upper bound as Theorem 5.1.

6 Experiments

In this section, we begin by conducting a comprehensive evaluation of our proposed method, PRB, compared with both bandit-based and graph-based baselines across online and offline link prediction settings. Then, we analyze the computational costs associated with each experiment and present additional ablation studies related to PRB. In the implementation of PRB, we adapt the efficient Page Rank algorithm [42] to solve Eq. (4.1).

6.1 Online Link Prediction

Methods Movie Lens Amazon Fashion Facebook Gr Qc

Mean Std Mean Std Mean Std Mean Std

EE-Net 1638 15.3 1698 19.3 2274 27.1 3419 16.5 Neu Greedy 1955 17.3 1952 27.4 2601 14.2 3629 18.2 Neural UCB 1737 16.8 1913 18.6 2190 16.3 3719 16.4 Neural TS 1683 14.7 2055 21.9 2251 19.5 3814 23.3

PRB 1555 21.7 1455 18.4 1929 17.0 3236 18.5

Table 1: Cumulative regret of bandit-based methods on online link prediction.

Figure 2: Regret comparison of bandit-based methods on online link prediction datasets (average of 10 runs with standard deviation in shadow, detailed in Table 1).

In this sub-section, we evaluate PRB on the setting of online link prediction and node classification as described in Sec. 3, compared with bandit-based baselines.

Datasets and Setups. We use three categories of real-world datasets to compare PRB with banditbased baselines. The details and experiment settings are as follows.

(1) Recommendation datasets: Movielens [32] and Amazon Fashion [48] (Bipartite Graph). Given the user set U and item set I, let G0 be the graph with no edges, G0 = (V = U + I, E0 = ). In round t [T], we randomly select a user vt U, and then randomly pick 100 items (arms) from I, including vt s 10 purchased items, forming Vt. PRB runs based on Gt 1 and selects an arm (node) vt,ˆi Vt. If the selected arm vt,ˆi is the purchased item by ut, the regret is 0 (or reward is 1) and we add the edge [vt, vt,ˆi] to Gt 1, to form the new graph Gt; otherwise, the regret is 1 (or reward is 0) and Gt = Gt 1.

(2) Social network datasets: Facebook [38] and GR-QC [37]. Given the user set V , we have G0 = (V, E0 = ). In a round t [T], we randomly select a source node vt that can be thought of as the serving user. Then, we randomly choose 100 nodes, including vt s 10 connected nodes but their edges are removed, which form the arm pool Vt associated with the context set Xt. Then, PRB will select one arm vt,ˆi Vt. If vt and vt,ˆi are connected in the original graph, the regret is 0 and add the edge [vt, vt,ˆi] to Gt 1; otherwise, the regret is 1 and Gt = Gt 1.

(3) Node classification datasets: Cora, Citeseer, and Pubmed from the Planetoid citation networks [68]. Recall the problem setting described in Sec. 4. Consider a k-class node classification problem. Given a graph G(V, E0 = ), we randomly select a node vt V to predict its belonging class, in a round t [T]. Then, PRB select one super node vit. If vt belongs to class it, the regret is 0 and add [vt, vit] to Gt 1. Otherwise, the regret is 1 and Gt = Gt 1.

Baselines. For bandit-based methods, we apply Neural Greedy [12] that leverages the greedy exploration strategy on the exploitation network, Neural UCB [76] that uses the exploitation network to learn the reward function along with an UCB-based exploration strategy, Neural TS [74] that adopts the exploitation network to learn the reward function along with the Thompson Sampling exploration strategy, and EE-net [12] that utilizes the exploitation-exploration network to learn the reward function. Following [76, 12], for all methods, we train each network every 50 rounds for the first 2000 rounds and then every 100 rounds for the remaining rounds. See Appendix A.1 for additional experimental setups.

Online Link Prediction. We use Figure 2 to depict the regret trajectories over 10,000 rounds, and Table 1 to detail the cumulative regret after 10,000 rounds for all methods, where the lower is better. Based on the regret comparison, PRB consistently outperforms all other baselines across all datasets. For example, the cumulative regret at 10,000 rounds for PRB on Movie Lens is considerably lower than the best-performing baseline, EE-Net. Similarly, in the Amazon Fashion dataset, PRB achieved the lowest regret, surpassing the strongest baseline EE-Net over 14%. This trend is consistent across the Facebook and Gr Qc datasets, where PRB maintains its lead with the lowest regrets respectively. The consistency in PRB s performance across various datasets suggests the importance of utilizing the graph structure formed by previous link predictions.

Methods Cora Citeseer Pubmed

Mean Std Mean Std Mean Std

EE-Net 1990 13.8 2299 33.4 1659 11.3 Neu Greedy 2826 21.4 2543 24.6 1693 13.5 Neural UCB 2713 21.7 3101 22.0 1672 14.3 Neural TS 1998 15.6 3419 39.5 1647 11.3

PRB 1874 25.6 2168 35.7 1577 10.7

Table 2: Cumulative regret of bandit-based methods on online node classification.

Online Node classification. Figure 3 and Table 2 show the regret comparison on online node classification. PRB consistently demonstrates the lowest cumulative regret by outperforming other

Figure 3: Regret comparison of bandit-based methods on online node classification datasets (average of 10 runs with standard deviation in shadow, detailed in Table 2.

Cora Citeseer Pubmed Collab PPA DDI

HR@100 Std HR@100 Std HR@100 Std HR@50 Std HR@100 Std HR@20 Std

CN 33.92 0.46 29.79 0.90 23.13 0.15 56.44 0.00 27.65 0.00 17.73 0.00 AA 39.85 1.34 35.19 1.33 27.38 0.11 64.35 0.00 32.45 0.00 18.61 0.00 RA 41.07 0.48 33.56 0.17 27.03 0.35 64.00 0.00 49.33 0.00 27.60 0.00

GCN 66.79 1.65 67.08 2.94 53.02 1.39 44.75 1.07 18.67 1.32 37.07 5.07 SAGE 55.02 4.03 57.01 3.74 39.66 0.72 48.10 0.81 16.55 2.40 53.90 4.74

SEAL 81.71 1.30 83.89 2.15 75.54 1.32 64.74 0.43 48.80 3.16 30.56 3.86 NBFnet 71.65 2.27 74.07 1.75 58.73 1.99 OOM OOM 4.00 0.58

Neo-GNN 80.42 1.31 84.67 2.16 73.93 1.19 57.52 0.37 49.13 0.60 63.57 3.52 BUDDY 88.00 0.44 92.93 0.27 74.10 0.78 65.94 0.58 49.85 0.20 78.51 1.36

NCN 89.05 0.96 91.56 1.43 79.05 1.16 64.76 0.87 61.19 0.85 82.32 6.10 NCNC 89.65 1.36 93.47 0.95 81.29 0.95 66.61 0.71 61.42 0.73 84.11 3.67

PRB 92.33 0.57 95.13 1.28 84.54 0.86 67.29 0.31 63.47 1.75 88.31 4.36

Table 3: Results on offline link prediction benchmarks. OOM means out of GPU memory.

bandit methods at round 10,000, respectively. Overall, PRB decreases regrets by 3.0%, 1.2%, and 3.5% compared to one of the best baselines, Neural TS. This experiment demonstrates that PRB is versatile enough for applications beyond online link prediction, extending to other real-world tasks such as online node classification. This highlights PRB s advantage of fusing contextual bandits with Page Rank for collaborative exploitation and exploration.

6.2 Offline Link Prediction

In this subsection, we evaluate PRB in the setting of offline link prediction compared with graph-based baselines, where training and testing datasets are provided, following the same evaluation process of [18, 61]. Here, we train PRB on the training dataset using the same sequential optimization method Sec. 6.1. Then, we run the trained PRB on the testing dataset. Notice that PRB never sees the test data in the training process as other baselines.

Datasets. In this study, we use real-world link-prediction datasets to compare PRB with graph-based baselines. Specifically, we apply Cora, Citeseer, and Pubmed from Planetoid citation networks [68]; ogbl-collab, ogbl-ppa, and ogbl-ddi from Open Graph Benchmark [33]. (See dataset statistics in Appendix C.)

Setting: We strictly follow the experimental setup in [18] and use the Hits@k metric for evaluation. Please also refer to A.1 for additional setups.

Baselines. For graph-based methods, we choose traditional link-prediction heuristics including CN [15], RA [77], AA [3] and common GNNs including GCN [36] and SAGE [31]. Then, we employ SF-then-MPNN models, including SEAL [71] and NBFNet [78], as well as SF-and-MPNN models like Neo-GNN [70] and BUDDY[18]. Additionally, we also select the MPNN-then-SF model NCN [61] and NCNC [61]. The results of the baselines are sourced from Table 2 of [61].

Comparison with Graph-based Baselines. We present the experimental results in Table 3 for all methods. The results demonstrate that PRB consistently outperforms other baselines across all six datasets. Specifically, compared to the most recent method, NCNC, PRB achieves a minimum improvement of 0.68% on the Collab dataset, a maximum of 4.2%, and an average of 2.42% across

Movie Lens Amazon Fashion Facebook Gr Qc Cora Citeseer Pubmed

Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std

PRB 1555 21.7 1455 18.4 1929 17.0 3236 18.5 1874 25.6 2168 35.7 1577 10.7 PRB-Greedy 1892 15.1 1567 24.6 1994 23.6 3332 15.9 1932 24.1 2194 23.3 1634 12.3 PRB-(10%-G) 1521 17.6 1408 23.5 1858 15.7 3085 14.3 1804 23.5 2158 33.1 1630 11.5

Table 4: Cumulative regrets of PRB variants for online link prediction and node classification.

all datasets. Given that all baselines lack the perspective of exploration, the results demonstrate that fusing the exploitation and exploration in contextual bandits along with learning graph connectivity through Page Rank does significantly enhance accuracy for link prediction.

6.3 Ablation and Sensitivity Studies

Table 4 presents the performance of different variants of PRB, including PRB-greedy that only use the exploitation network and PRB-(10%-G) that has the warm start with addition 10% edges in G0. The results show that exploration is crucial to the final performance and the additional graph knowledge can boost the performance.

Due to the space limit, we move all other experiment sections to Appendix A, including computational cost analysis on PRB and additional ablation & sensitivity studies.

7 Conclusion

This paper introduces a fusion algorithm for link prediction, which integrates the power of contextual bandits in balancing exploitation and exploration with propagation on graph structure by Page Rank. We further provide the theoretical performance analysis for PRB, showing the regret of the proposed algorithm can grow sublinearly. We conduct extensive experiments in link prediction to evaluate PRB s effectiveness, compared with both bandit-based and graph-based baselines.

Acknowledgments and Disclosure of Funding

This work is supported by the National Science Foundation under Award No. IIS-2117902 and DARPA (HR001121C0165). The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

[2] Marc Abeille and Alessandro Lazaric. Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pages 176 184. PMLR, 2017.

[3] Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks, 25(3):211 230, 2003.

[4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127 135. PMLR, 2013.

[5] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242 252. PMLR, 2019.

[6] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141 8150, 2019.

[7] Yikun Ban, Ishika Agarwal, Ziwei Wu, Yada Zhu, Kommy Weldemariam, Hanghang Tong, and Jingrui He. Neural active learning beyond bandits. ar Xiv preprint ar Xiv:2404.12522, 2024.

[8] Yikun Ban and Jingrui He. Generic outlier detection in multi-armed bandit. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 913 923, 2020.

[9] Yikun Ban and Jingrui He. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pages 2335 2346, 2021.

[10] Yikun Ban, Jingrui He, and Curtiss B Cook. Multi-facet contextual bandits: A neural network perspective. In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 35 45, 2021.

[11] Yikun Ban, Yunzhe Qi, Tianxin Wei, Lihui Liu, and Jingrui He. Meta clustering of neural bandits. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 95 106, 2024.

[12] Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. EE-net: Exploitation-exploration neural networks in contextual bandits. In International Conference on Learning Representations, 2022.

[13] Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. Neural exploitation and exploration of contextual bandits. ar Xiv preprint ar Xiv:2305.03784, 2023.

[14] Yikun Ban, Yuheng Zhang, Hanghang Tong, Arindam Banerjee, and Jingrui He. Improved algorithms for neural active learning. Advances in Neural Information Processing Systems, 35:27497 27509, 2022.

[15] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. science, 286(5439):509 512, 1999.

[16] Smriti Bhagat, Graham Cormode, and S Muthukrishnan. Node classification in social networks. Social network data analytics, pages 115 148, 2011.

[17] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32:10836 10846, 2019.

[18] Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Yannick Hammerla, Michael M Bronstein, and Max Hansmire. Graph neural networks for link prediction with subgraph sketching. In The eleventh international conference on learning representations, 2022.

[19] Weilin Cong, Si Zhang, Jian Kang, Baichuan Yuan, Hao Wu, Xin Zhou, Hanghang Tong, and Mehrdad Mahdavi. Do we really need complicated model architectures for temporal networks? ar Xiv preprint ar Xiv:2302.11636, 2023.

[20] Zhongxiang Dai, Yao Shu, Arun Verma, Flint Xiaofeng Fan, Bryan Kian Hsiang Low, and Patrick Jaillet. Federated neural bandit. ar Xiv preprint ar Xiv:2205.14309, 2022.

[21] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback, 2008.

[22] Rohan Deb, Yikun Ban, Shiliang Zuo, Jingrui He, and Arindam Banerjee. Contextual bandits with online neural regression. ar Xiv preprint ar Xiv:2312.07145, 2023.

[23] Kaize Ding, Zhe Xu, Hanghang Tong, and Huan Liu. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explorations Newsletter, 24(2):61 77, 2022.

[24] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675 1685. PMLR, 2019.

[25] Dongqi Fu, Yikun Ban, Hanghang Tong, Ross Maciejewski, and Jingrui He. Disco: Comprehensive and explainable disinformation detection. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 4848 4852, 2022.

[26] Dongqi Fu, Liri Fang, Ross Maciejewski, Vetle I. Torvik, and Jingrui He. Meta-learned metrics over multi-evolution temporal graphs. In Aidong Zhang and Huzefa Rangwala, editors, KDD 22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pages 367 377. ACM, 2022.

[27] Dongqi Fu and Jingrui He. SDG: A simplified and dynamic graph neural network. In Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai, editors, SIGIR 21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 2273 2277. ACM, 2021.

[28] Chongming Gao, Kexin Huang, Jiawei Chen, Yuan Zhang, Biao Li, Peng Jiang, Shiqi Wang, Zhong Zhang, and Xiangnan He. Alleviating matthew effect of offline reinforcement learning in interactive recommendation. ar Xiv preprint ar Xiv:2307.04571, 2023.

[29] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263 1272. PMLR, 2017.

[30] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855 864, 2016.

[31] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.

[32] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015.

[33] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020.

[34] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018.

[35] Yiling Jia, Weitong Zhang, Dongruo Zhou, Quanquan Gu, and Hongning Wang. Learning neural contextual bandits through perturbed rewards. In International Conference on Learning Representations, 2022.

[36] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

[37] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 1(1):2 es, 2007.

[38] Jure Leskovec and Julian Mcauley. Learning to discover social circles in ego networks. Advances in neural information processing systems, 25, 2012.

[39] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010.

[40] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539 548, 2016.

[41] Zihao Li, Yuyi Ao, and Jingrui He. Sphere: Expressive and interpretable knowledge graph embedding for set retrieval. In Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang, editors, Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, pages 2629 2634. ACM, 2024.

[42] Zihao Li, Dongqi Fu, and Jingrui He. Everything evolves in personalized pagerank. In Proceedings of the ACM Web Conference 2023, pages 3342 3352, 2023.

[43] David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge management, pages 556 559, 2003.

[44] Xiao Lin, Jian Kang, Weilin Cong, and Hanghang Tong. Bemap: Balanced message passing for fair graph neural network. In Learning on Graphs Conference, pages 37 1. PMLR, 2024.

[45] Xiao Lin, Zhining Liu, Dongqi Fu, Ruizhong Qiu, and Hanghang Tong. Backtime: Backdoor attacks on multivariate time series forecasting. ar Xiv preprint ar Xiv:2410.02195, 2024.

[46] Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 390(6):1150 1170, 2011.

[47] Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. ar Xiv preprint ar Xiv:1705.07347, 2017.

[48] Jianmo Ni, Jiacheng Li, and Julian Mc Auley. Justifying recommendations using distantlylabeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 188 197, 2019.

[49] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11 33, 2015.

[50] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701 710, 2014.

[51] Yunzhe Qi, Yikun Ban, and Jingrui He. Neural bandit with arm group graph. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1379 1389, 2022.

[52] Yunzhe Qi, Yikun Ban, and Jingrui He. Graph neural bandits. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1920 1931, 2023.

[53] Yunzhe Qi, Yikun Ban, Tianxin Wei, Jiaru Zou, Huaxiu Yao, and Jingrui He. Meta-learning with neural bandit scheduler. Advances in Neural Information Processing Systems, 36, 2024.

[54] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. ar Xiv preprint ar Xiv:1802.09127, 2018.

[55] Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs. ar Xiv preprint ar Xiv:2006.10637, 2020.

[56] Yuan Sui, Jiaru Zou, Mengyu Zhou, Xinyi He, Lun Du, Shi Han, and Dongmei Zhang. Tap4llm: Table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning. ar Xiv preprint ar Xiv:2312.09039, 2023.

[57] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067 1077, 2015.

[58] Yuxing Tian, Yiyan Qi, and Fan Guo. Freedyg: Frequency enhanced continuous-time dynamic graph model for link prediction. In The Twelfth International Conference on Learning Representations, 2023.

[59] Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. In Sixth international conference on data mining (ICDM 06), pages 613 622. IEEE, 2006.

[60] Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013.

[61] Xiyuan Wang, Haotong Yang, and Muhan Zhang. Neural common neighbor with completion for link prediction. In The Twelfth International Conference on Learning Representations, 2023.

[62] Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. Inductive representation learning in temporal networks via causal anonymous walks. ar Xiv preprint ar Xiv:2101.05974, 2021.

[63] Zhilei Wang, Pranjal Awasthi, Christoph Dann, Ayush Sekhari, and Claudio Gentile. Neural active learning with performance guarantees. Advances in Neural Information Processing Systems, 34, 2021.

[64] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Inductive representation learning on temporal graphs. ar Xiv preprint ar Xiv:2002.07962, 2020.

[65] Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. ar Xiv preprint ar Xiv:2012.01780, 2020.

[66] Zhe Xu, Yuzhong Chen, Qinghai Zhou, Yuhang Wu, Menghai Pan, Hao Yang, and Hanghang Tong. Node classification beyond homophily: Towards a general solution. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2862 2873, 2023.

[67] Zhe Xu, Boxin Du, and Hanghang Tong. Graph sanitation with application to node classification. In Proceedings of the ACM Web Conference 2022, pages 1136 1147, 2022.

[68] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pages 40 48. PMLR, 2016.

[69] Le Yu, Leilei Sun, Bowen Du, and Weifeng Lv. Towards better dynamic graph learning: New architecture and unified library. Advances in Neural Information Processing Systems, 36:67686 67700, 2023.

[70] Seongjun Yun, Seoyoon Kim, Junhyun Lee, Jaewoo Kang, and Hyunwoo J Kim. Neo-gnns: Neighborhood overlap-aware graph neural networks for link prediction. Advances in Neural Information Processing Systems, 34:13683 13694, 2021.

[71] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information processing systems, 31, 2018.

[72] Muhan Zhang and Yixin Chen. Inductive matrix completion based on graph neural networks. ar Xiv preprint ar Xiv:1904.12058, 2019.

[73] Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems, 34:9061 9073, 2021.

[74] Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural thompson sampling. In International Conference on Learning Representations, 2021.

[75] Lecheng Zheng, Baoyu Jing, Zihao Li, Hanghang Tong, and Jingrui He. Heterogeneous contrastive learning for foundation models and beyond. In Ricardo Baeza-Yates and Francesco Bonchi, editors, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024, Barcelona, Spain, August 25-29, 2024, pages 6666 6676. ACM, 2024.

[76] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492 11502. PMLR, 2020.

[77] Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. Predicting missing links via local information. The European Physical Journal B, 71:623 630, 2009.

[78] Zhaocheng Zhu, Zuobai Zhang, Louis-Pascal Xhonneux, and Jian Tang. Neural bellman-ford networks: A general graph neural network framework for link prediction. Advances in Neural Information Processing Systems, 34:29476 29490, 2021.

[79] Jiaru Zou, Mengyu Zhou, Tao Li, Shi Han, and Dongmei Zhang. Promptintern: Saving inference costs by internalizing recurrent prompt during large language model fine-tuning. ar Xiv preprint ar Xiv:2407.02211, 2024.

A Additional Experiments

A.1 Experiment Setups

Online Link Prediction Setups. For all bandit-based methods including PRB, for fair comparison, the exploitation network f1 is built by a 2-layer fully connected network with 100-width. For the exploration network of EE-Net and PRB, we use a 2-layer fully connected network with 100-width as well. For Neural UCB and Neural TS, following the setting of [76, 74], we use the exploitation network f1 and conduct the grid search for the exploration parameter ν over {0.001, 0.01, 0.1, 1} and for the regularization parameter λ over {0.01, 0.1, 1}. For the neural bandits Neural UCB/TS, following their setting, as they have expensive computation costs to store and compute the whole gradient matrix, we use a diagonal matrix to make an approximation. For all grid-searched parameters, we choose the best of them for comparison and report the average results of 10 runs for all methods. For all bandit-based methods, we use SGD as the optimizer for the exploitation network f1. Additionally, for EE-Net and PRB, we use the Adam optimizer for the exploration network f2. For all neural networks, we conduct the grid search for learning rate over {0.01, 0.001, 0.0005, 0.0001}. For PRB, we strictly follow the settings in [42] to implement the Page Rank component. Specifically, we set the parameter α = 0.85 after grid search over {0.1, 0.3, 0.5, 0.85, 0.9}, and the terminated accuracy ϵ = 10 6. For each dataset, we first shuffle the data and then run each network for 10,000 rounds (t = 10, 000). We train each network every 50 rounds when t < 2000 and every 100 rounds when 2000 < t < 10, 000.

Offline Link Prediction Setups. For the graph-based methods, we strictly follow the experimental and hyperparameters settings in [61, 18] to reproduce the experimental results. Offline link prediction task requires graph links to play dual roles as both supervision labels and message passing links. For all datasets, the message-passing links at training time are equal to the supervision links, while at test and validation time, disjoint sets of links are held out for supervision that are never seen during training. All hyperparameters are tuned using Weights and Biases random search, exploring the search space of hidden dimensions from 64 to 512, dropout from 0 to 1, layers from 1 to 3, weight decay from 0 to 0.001, and learning rates from 0.0001 to 0.01. Hyperparameters yielding the highest validation accuracy are selected, and results are reported on a single-use test set. For PRB, we use setups similar to those in the online setting. We utilize the exploitation network f1 and exploration network f2 both with 500-width. We set the training epoch to 100 and evaluate the model performance on validation and test datasets. We utilize the Adam optimizer for all baseline models. For PRB implementation, We utilize the SGD optimizer for f1 and the Adam optimizer for f2.

A.2 Computational Cost Analysis

We conduct all of our experiments on an Nvidia 3060 GPU with an x64-based processor.

Time and space complexity. Let n be the number of nodes, t be the index of the current round of link prediction, k be the number of target candidate nodes, d be the number of context dimensions, and p be neural network width. For the online setting, let mt be the number of edges at round t. In the setting of online link prediction, the time complexity of PRB is O(kdp + mtk), where the first term is the cost of calculating the exploitation-exploration score for each candidate node and the second term is the cost of running Page Rank, following [42]. And, the space complexity is O(n + mt) to store node weights and edges. For the offline setting, let m be the number of edges in the testing dataset. Let F be the number target links to predict. Then, the inference time complexity of PRB for F links is O(ndp) + O(m F). The first term is the cost of calculating the exploitation-exploration score for each node. The second term is the cost of Page Rank [42]. The comparison with existing methods is listed in the following table:

In Figure 5, we analyze the running time of the internal components of PRB and PRB-Greedy (Algorithm 3). The comparison of the internal components reveals that the Random Walk phase accounts for 10% (PRB) and 6.3% (PRB-Greedy) on average of the total running time across seven datasets. Previous results also demonstrate that PRB significantly outperforms EE-Net which solely relies on the Exploitation-Exploration framework, by dedicating a small additional portion of time to the Random Walk component.

By recording the total training time of 10,000 rounds, we also compare PRB with other bandit-based baselines in Figure 4. Across all datasets, Neural TS achieves the minimum average running time at 10.9 minutes, while PRB has the maximum at 17.5 minutes. Additionally, given that the Random

Figure 4: Running time comparison of PRB and bandit-based baselines.

Figure 5: Proportion of running time for PRB-Greedy (left) and PRB (right) between exploitationexploration and random walk.

Walk component takes only a minimal portion of our algorithm s running time, the average running times are relatively close between PRB-Greedy (15.4 minutes) and Neural Greedy (14.8 minutes), and between PRB (17.5 minutes) and EE-net (16.7 minutes). The comparative analysis reveals that while PRB incurs a relatively extended running time, it remains competitive with established baselines and demonstrates a significant enhancement in performance. This observation underscores the efficacy of PRB and supports its potential utility in practical applications despite its temporal demands.

Table 5 reports the inference time (one round in seconds) of bandit-based methods on three datasets for online link prediction. Although PRB takes a slightly longer time, it remains in the same order of magnitude as the other baselines. We adopt the approximated methods from [42] for the Page Rank component to significantly reduce computation costs while ensuring good empirical performance.

Table 6 reports the inference time (one epoch of testing in seconds) of graph-based methods on three datasets for offline link prediction. PRB is faster than SEAL and shows competitive inference time as compared to other baselines.

Methods Movie Lens Gr Qc Amazon

Neural UCB 0.11 0.01 0.02 Neural Greedy 0.14 0.02 0.03 EE-Net 0.17 0.03 0.04

PRB 0.20 0.03 0.04

Table 5: Inference Time (s) of PRB and banditbased methods for online setting

Methods Cora Pubmed Collab

SEAL 6.31 22.74 68.36 Neo-GNN 0.12 0.24 9.47 BUDDY 0.27 0.33 2.75 NCNC 0.04 0.07 1.58

PRB 0.11 0.58 3.52

Table 6: Inference Time (s) of PRB and graph-based methods for offline setting

Figure 6: Regret Comparison of PRB-Greedy, PRB, and PRB-Prior (mean of 10 runs with standard deviation in shadow, detailed in Table 4 and 1).

Figure 7: Regret Comparison of PRB, EEnet, and Eve PPR (mean of 10 runs with standard deviation in shadow, detailed in Table 1).

A.3 Additional Ablation and Sensitivity Studies

PRB variants. To extensively evaluate PRB in our experiments, we provide the following variants. PRB is the direct implementation of Algorithm 1. The initial graph G0 only contains all nodes without any edges. PRB-Greedy is the greedy version of Algorithm 1 by removing the exploration network, as specified in Algorithm 3. PRB-Prior (10%-G) is Algorithm 1 with prior knowledge by revealing 10% of training edges on the initial graph. We apply PRB-Prior in our experiments to demonstrate how extra prior knowledge about the graph improves PRB s decision-making process.

Figure 6 and Table 4 highlights the regret comparison of three PRB variants: PRB, PRB-Greedy, and PRB-Prior. For both online link prediction and node classification, PRB surpasses PRB-Greedy by an average of 5.8%, highlighting the robustness of the exploration network embedded within PRB. Additionally, in online link prediction, the PRB-Prior (10%-G) variant consistently outperforms its counterparts across a majority of datasets. This is particularly evident in the Movie Lens and Amazon Fashion datasets, where it achieves notably low cumulative regrets of 1521 and 1408. Same in online node classification, PRB-Prior (10%-G) demonstrates exceptional performance on two out of three datasets, recording cumulative regrets of 1804 in Cora and 2158 in Citeseer. These results emphasize the benefits of incorporating prior knowledge within PRB to enhance predictive accuracy.

Effectiveness of Bandits and Page Rank. In Figure 7, we compare the performance of PRB with that of Eve PPR [42] and EE-Net [12], which represent methodologies based on Page Rank and contextual bandits respectively. On one hand, PRB significantly outperforms Eve PPR by integrating the exploitation and exploration strategy, which enhances Page Rank s decision-making capabilities. On the other hand, PRB surpasses EE-net by leveraging a more comprehensive understanding of the input graph s structure and connectivity through enhanced Page Rank. Overall, PRB consistently achieves lower regrets compared to both Eve PPR and EE-Net, demonstrating the effectiveness of combining the exploitation-exploration with Page Rank.

B Limitations

In this paper, we propose the PRB algorithm that integrates the exploitation-exploration of contextual bandits with Page Rank. We do not investigate other integration methods, such as combining such exploitation-exploration with other Random Walk algorithms or GNNs. We also evaluate PRB on online link prediction and node classification. Several other real-world tasks, such as Subgraph Matching and Node Clustering, remain unexplored. Our future research will extend PRB to these and additional related tasks [79, 56] to assess its broader implications.

C Graph Dataset Statistics

Cora Citeseer Pubmed Collab PPA DDI

#Nodes 2,708 3,327 18,717 235,868 576,289 4,267 #Edges 5,278 4,676 44,327 1,285,465 30,326,273 1,334,889 Splits random random random fixed fixed fixed Average Degree 3.9 2.74 4.5 5.45 52.62 312.84

Table 7: Dataset Statistics

The statistics of each dataset are shown in Table 7. Random splits use 70%,10%, and 20% edges for training, validation, and test set respectively.

D Variant Algorithms

Algorithm 2 PRB-N (Node Classification)

Input: f1, f2, T, G0, η1, η2 (learning rate)

1: Initialize θ1 0, θ2 0 2: for t = 1, 2, . . . , T do 3: Observe serving node vt, candidate nodes Vt = { v1, v2, . . . , vk}, contexts Xt and Graph Gt 1 4: for each vt,i Vt do 5: ht[i] = f1 xt,i; θ1 t 1 + f2 ϕ (xt,i) ; θ2 t 1

6: end for 7: Compute Pt based on Gt 1 8: Solve vt = αPtvt + (1 α) ht 9: Select ˆi = arg maxvt,i Vt vt [i] 10: Observe rt,ˆi 11: if rt,ˆi == 1 then 12: Add [vt, vt,ˆi] to Gt 1 and set as Gt 13: else 14: Gt = Gt 1 15: end if 16: θ1 t = θ1 t 1 η1 θ1 t 1L xt,ˆi, rt,ˆi; θ1 t 1

17: θ2 t = θ2 t 1 η2 θ2 t 1L ϕ(xt,ˆi), rt,ˆi f1(xt,ˆi; θ1 t 1); θ2 t 1

18: end for

Algorithm 3 PRB-Greedy

Input: f1, f2, T, G0, η1, η2 (learning rate)

1: Initialize θ1 0, θ2 0 2: for t = 1, 2, . . . , T do 3: Observe serving node vt, candidate nodes Vt, contexts Xt and Graph Gt 1 4: for each vt,i Vt do 5: ht[i] = f1 xt,i; θ1 t 1

6: end for 7: Compute Pt based on Gt 1 8: Solve vt = αPtvt + (1 α) ht 9: Select ˆi = arg maxvt,i Vt vt [i] 10: Observe rt,ˆi 11: if rt,ˆi == 1 then 12: Add [vt, vt,ˆi] to Gt 1 and set as Gt 13: else 14: Gt = Gt 1 15: end if 16: θ1 t = θ1 t 1 η1 θ1 t 1L xt,ˆi, rt,ˆi; θ1 t 1

17: end for

E Proof of Theorem 5.1

E.1 Preliminaries

Following neural function definitions and Lemmas of [13], given an instance x, we define the outputs of hidden layers of the neural network (Eq. (5.4)):

h0 = x, hl = σ(Wlhl 1), l [L 1].

Then, we define the binary diagonal matrix functioning as Re LU:

Dl = diag(1{(Wlhl 1)1}, . . . , 1{(Wlhl 1)m}), l [L 1].

Accordingly, the neural network (Eq. (5.4)) is represented by

f(x; θ) = WL(

l=1 Dl Wl)x, (E.1)

( [hl 1WL(QL 1 τ=l+1 DτWτ)] , l [L 1] h L 1, l = L. (E.2)

Here, given a constant R > 0, we define the following function class:

B(θ0, R) = {θ Rp : θ θ0 2 R/m1/4}. (E.3)

Let Lt represent the squared loss function in round t. We use x1, x2, . . . , x T k represent all the context vectors presented in T rounds. Then, we define the following instance-dependent complexity term:

Ψ(θ0, R) = inf θ B(θ0,R)

t=1 (f2(xt; θ) rt)2 (E.4)

Then, we have the following auxiliary lemmas. Lemma E.1. Suppose m, η1, η2 satisfy the conditions in Theorem 5.1. With probability at least 1 O(Tk L) exp( Ω(mω2/3L)) over the random initialization, for all t [T], i [k], θ satisfying θ θ0 2 ω with ω O(L 9/2[log m] 3), it holds uniformly that

(1), |f(xt,i; θ)| O(1).

(2), θf(xt,i; θ) 2 O(

(3), θLt(θt) 2 O(

Lemma E.2. Suppose m, η1, η2 satisfy the conditions in Theorem 5.1. With probability at least 1 O(Tk L) exp( Ω(mω2/3L)), for all t [T], i [k], θ, θ (or Θ, Θ ) satisfying θ θ0 2, θ θ0 2 ω with ω O(L 9/2[log m] 3), it holds uniformly that

|f(x; θ) f(x; θ ) θ f(x; θ ), θ θ | O(w1/3L2p

log m) θ θ 2.

Lemma E.3. Suppose m, η1, η2 satisfy the conditions in Theorem 5.1. With probability at least 1 O(Tk L) exp( Ω(mω2/3L)), for all t [T], i [k], θ, θ satisfying θ θ0 2, θ θ0 2 ω with ω O(L 9/2[log m] 3), it holds uniformly that

(1) |f(x; θ) f(x; θ )| O(ω

L) + O(ω4/3L2p

log m) (E.5)

Lemma E.4 (Almost Convexity). Let Lt(θ) = (f(xt; θ) rt)2/2. Suppose m, η1, η2 satisfy the conditions in Theorem 5.1. With probability at least 1 O(Tk L2) exp[ Ω(mω2/3L)] over randomness of θ1, for all t [T], and θ, θ satisfying θ θ0 2 ω and θ θ0 2 ω with ω O(L 6[log m] 3/2), it holds uniformly that

Lt(θ ) Lt(θ) + θLt(θ), θ θ ϵ.

where ϵ = O(ω4/3L3 log m))

Lemma E.5 (User Trajectory Ball). Suppose m, η1, η2 satisfy the conditions in Theorem 5.1. With probability at least 1 O(Tk L2) exp[ Ω(mω2/3L)] over randomness of θ0, for any R > 0, it holds uniformly that θt θ0 2 O(R/m1/4), t [T].

Lemma E.6 (Instance-dependent Loss Bound). Let Lt(θ) = (f(xt; θ) rt)2/2. Suppose m, η1, η2 satisfy the conditions in Theorem 5.1. With probability at least 1 O(Tk L2) exp[ Ω(mω2/3L)] over randomness of θ1, given any R > 0 it holds that

t Lt(θ ) + O(1) + TLR2

m + O(TR4/3L2 log m

m1/3 ). (E.6)

where θ = arg infθ B(θ0,R) PT t Lt(θ).

E.2 Regret analysis

Lemma E.7. Suppose m, η1, η2 satisfies the conditions in Theorem 5.1. In round t [T], let ˆi be the index selected by the algorithm. Then, For any δ (0, 1), R > 0, with probability at least 1 δ, for t [T], it holds uniformly

τ=1 E rτ,ˆi

h f1(xτ,ˆi; θ1 τ 1) + f2(ϕ(xτ,ˆi); θ2 τ 1) rτ,ˆi | Hτ 1 i

Ψ(θ0, R) + O(1)

2 log(O(1)/δ)

where Ht = {xτ,ˆi, rτ,ˆi}t τ=1 represents of historical data selected by πτ and expectation is taken over the reward.

Proof. First, according to Lemma E.5, θ2 0, . . . , θ2 T 1 all are in B(θ0, R/m1/4). Then, according to Lemma E.1, for any x Rd, x 2 = 1, it holds uniformly |f1(xt,ˆi; θ1 t ) + f2(ϕ(xt,ˆi); θ2 t ) rt,ˆi| O(1).

Then, for any τ [t], define

Vτ := E rτ,ˆi

h |f2(ϕ(xτ,ˆi); θ2 τ 1) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1))| i

|f2(ϕ(xτ,ˆi); θ2 τ 1) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1))| (E.8)

Then, we have

E[Vτ|Fτ 1] = E rτ,ˆi

h |f2(ϕ(xτ,ˆi); θ2 τ 1) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1))| i

h |f2(ϕ(xτ,ˆi); θ2 τ 1) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1)) i

where Fτ 1 denotes the σ-algebra generated by the history Hτ 1.

Therefore, the sequence {Vτ}t τ=1 is the martingale difference sequence. Applying the Hoeffding Azuma inequality, with probability at least 1 δ, we have

τ=1 E ri,ˆi [Vτ|Fτ 1]

As I1 is equal to 0, we have

τ=1 E rτ,ˆi

h f2(ϕ(xτ,ˆi); θ2 τ 1) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1)) i

f2(ϕ(xτ,ˆi); θ2 τ 1) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1))

For I3, based on Lemma E.6, for any θ satisfying θ θ2 0 2 R/m1/4, with probability at least 1 δ, we have

f2(ϕ(xτ,ˆi); θ2 τ 1) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1)) 2

f2(ϕ(xτ,ˆi); θ ) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1)) 2 + O(1)

Ψ(θ0, R) + O(1)

where (a) is based on the definition of instance-dependent complexity term. Combining the above inequalities together, with probability at least 1 δ, we have

τ=1 E rτ,ˆi

h f2(ϕ(xτ,ˆi); θ2 τ 1) (rτ,ˆi f1(xτ,ˆi; θ1 τ 1) i

Ψ(θ0, R) + O(1)

2 log(O(1)/δ)

The proof is completed.

Corollary E.1. Suppose m, η1, η2 satisfy the conditions in Theorem 5.1. For any t [T], let i be the index selected by some fixed policy and rt,i is the corresponding reward, and denote the policy by π . Let θ1, t 1, θ2, t 1 be the intermediate parameters trained by Algorithm 1 using the data select by π . Then, with probability at least (1 δ) over the random of the initialization, for any δ (0, 1), R > 0, it holds that

h f2(ϕ(xτ,i ); θ2, τ 1) rτ,i f1(xτ,i ; θ1, τ 1) | π , H τ 1 i

Ψ(θ0, R) + O(1)

2 log(O(1)/δ)

where H τ 1 = {xτ,i , rτ,i }τ 1 τ =1 represents the historical data produced by π and the expectation is taken over the reward.

Define g(xt; θ) θ = f(xt; θ) for brevity.

Lemma E.8. Suppose m satisfies the conditions in Theorem 5.1. With probability at least 1 δ over the initialization, there exists θ B(θ0, eΩ(T 3/2)), such that

t=1 E[(rt f(xt; θ ))2/2] O q

ed log(1 + Tk) 2 log δ + S + 1 2 ed log(1 + Tk).

t=1 (rt f(xt; θ ))2]

t=1 (y(xt) f(xt; θ ))2

log det(AT )

2 log δ + S + 1

t=1 g(xt; θ0) 2 A 1 T + 2TK O T 2L3 log m

ed log(1 + Tk) 2 log δ + S + 1 2 ed log(1 + Tk) + 1 + O(1),

where (a) is based on Lemma E.9 and (b) is an application of Lemma 11 in [1] and Lemma E.12, and O(1) is induced by the choice of m. By ignoring O(1), The proof is completed.

Definition E.1. Given the context vectors {xi}T i=1 and the rewards {ri}T i=1, then we define the estimation bθt via ridge regression:

i=1 g(xi; θ0)g(xi; θ0)

i=1 rig(xi; θ0)

bθt = A 1 t bt

Lemma E.9. Suppose m satisfies the conditions in Theorem 5.1. With probability at least 1 δ over the initialization, there exists θ B(θ0, eΩ(T 3/2)) for all t [T], such that

|y(xt) f(xt; θ )| O

log det(At)

2 log δ + S + 1

g(xt; θ0) A 1 t +O T 2L3 log m

Proof. Given a set of context vectors {x}T t=1 with the ground-truth function h and a fully-connected neural network f, we have

|y(xt) f(xt; θ )|

y(xt) g(xt; θ0), bθt + f(xt; θ ) g(xt; θ0), bθt

where θ is the estimation of ridge regression from Definition E.1. Then, based on the Lemma E.10, there exists θ RP such that h(xi) = g(xi, θ0), θ . Thus, we have y(xt) g(xt; θ0), bθt

= g(xi, θ0), θ D g(xi, θ0), bθt E

log det(At)

2 log δ + S

g(xt; θ0) A 1 t

where the final inequality is based on the the Theorem 2 in [1], with probability at least 1 δ, for any t [T].

Second, we need to bound f(xt; θ ) g(xt; θ0), bθt

|f(xt; θ ) g(xt; θ0), θ θ0 |

+ g(xt; θ0), θ θ0 g(xt; θ0), bθt

To bound the above inequality, we first bound

|f(xt; θ ) g(xt; θ0), θ θ0 |

= |f(xt; θ ) f(xt; θ0) g(xt; θ0), θ θ0 |

where we initialize f(xt; θ0) = 0 and the inequality is derived by Lemma E.2 with ω = O(t3/2)

m1/4 . Next, we need to bound

| g(xt; θ0), θ θ0 g(xt; θ0), bθt |

=| g(xt; θ0), (θ θ0 bθt) |

g(xt; θ0) A 1 t θ θ0 bθt At

g(xt; θ0) A 1 t At 2 θ θ0 bθt 2.

Due to the Lemma E.12 and Lemma E.11, we have

At 2 θ θ0 bθt 2 (1 + t O(L)) 1 1 + O(t L) = O(1).

Finally, putting everything together, we have

|y(xt) f(xt; θ )| γ1 g(xt; θ0)/ m A 1 t + γ2.

The proof is completed.

Definition E.2.

G(0) = [g(x1; θ0), . . . , g(x T ; θ0)] Rp T

G0 = [g(x1; θ0), . . . , g(x T k; θ0)] Rp T k

r = (r1, , r T ) RT

G(0) and r are formed by the selected contexts and observed rewards in T rounds, G0 are formed by all the presented contexts.

Inspired by Lemma B.2 in [76] , with η = m 1/4 we define the auxiliary sequence following :

θ0 = θ(0), θ(j+1) = θ(j) η h G(0) [G(0)] (θ(j) θ0) r + λ(θ(j) θ0) i

Lemma E.10. Suppose m satisfies the conditions in Theorem 5.1. With probability at least 1 δ over the initialization, for any t [T], i [k], the result uniformly holds:

hut(xt,i) = g(xt,i; θ0), θ θ0 .

Proof. Based on Lemma E.13 with proper choice of ϵ, we have

G 0 G0 H G 0 G0 H F I H λ0I/2 H/2 0.

Define h = [hu1(x1), . . . , hu T (x T k)]. Suppose the singular value decomposition of G0 is PAQ , P Rp T k, A RT k T k, Q RT k T k, then, A 0. Define θ = θ0 + PA 1Q h. Then, we have G 0 (θ θ0) = QAP PA 1Q h = h.

which leads to T X

i=1 (hut(xt,i) g(xt,i; θ0), θ θ0 ) = 0.

Therefore, the result holds:

θ θ0 2 2 = h QA 2Q h = h (G 0 G0) 1h 2h H 1h (E.15)

Lemma E.11. There exist θ B(θ0, e O(T 3/2L +

T)), such that, with probability at least 1 δ, the results hold:

(1) θ θ0 2 e O(T 3/2L +

(2) θ θ0 bθt 2 1 1 + O(TL)

Proof. The sequence of θ(j) is updates by using gradient descent on the loss function:

min θ L(θ) = 1

2 [G(0)] (θ θ(0)) r 2 2 + mλ

2 θ θ(0) 2 2.

For any j > 0, the results holds:

T max t [T ] g(xt; θ0) 2 O(

where the last inequality is held by Lemma E.1. Finally, given the j > 0,

θ(j) θ(0) 2 2

i=1 η h G(0) [G(0)] (θ(i) θ0) r + λ(θ(i) θ0) i O(j(TL p

Tλ)) m1/4 .

(E.16) For (2), by standard results of gradient descent on ridge regression, θ(j), and the optimum is θ(0) + bθt. Therefore, we have

θ(j) θ(0) bθt 2 2 [1 ηλ]j 2

L(θ(0)) L(θ(0) + bθt)

By setting λ = 1 and j = log((T + O(T 2L)) 1)/ log(1 m 1/4), we have θ(j) θ0 bθt 2 2 1 1+O(T L). Replacing k and λ in (E.16) finishes the proof.

Lemma E.12. Suppose m satisfies the conditions in Theorem 5.1. With probability at least 1 δ over the initialization, the result holds:

AT 2 1 + O(TL),

det I ed log(1 + Tk) + 1.

Proof. Based on the Lemma E.1, for any t [T], g(xt; θ0) 2 O(

L). Then, for the first item:

t=1 g(xt; θ0)g(xt; θ0) 2

t=1 g(xt; θ0)g(xt; θ0) 2

t=1 g(xt; θ0) 2 2 1 + O(TL).

Next, we have

log det(AT )

det(I) = log det(I +

t=1 g(xt; θ0)g(xt; θ0) ) = det(I + G0G 0 )

Then, we have

log det(I + G0G 0 )

= log det(I + H + (G0G 0 H))

log det(I + H) + (I + H) 1, (G0G 0 H)

log det(I + H) + (I + H) 1 F G0G 0 H F

log det(I + H) +

T G0G 0 H F log det(I + H) + 1

= ed log(1 + Tk) + 1.

The first inequality is because the concavity of log det ; The third inequality is due to (I + Hλ) 1 F I 1 F

T; The last inequality is because of the choice the m, based on Lemma E.13; The last equality is because of the Definition of ed. The proof is completed.

Lemma E.13. For any δ (0, 1), if m = Ω L6 log(T k L/δ)

(ϵ/T k)4 , then with probability at least 1 δ, the results hold:

G0G 0 H F ϵ.

Proof. This is an application of Lemma B.1 in [76] by properly setting ϵ.

Lemma E.14 (Exactness of Page Rank [42]). When Page Rank achieves the stationary distribution, vt = 1 α I αPt ht.

Finally, we provide the proof of Theorem 5.1.

v t [i ] v t [ˆi]

=v t [i ] vt[ˆi] + vt[ˆi] v t [ˆi]

(1) v t [i ] vt[i ] + vt[ˆi] v t [ˆi]

|v t [i ] vt[i ]| + |vt[ˆi] v t [ˆi]| 2 v t vt 2

(2) =2 1 α I αPt yt 1 α I αPt ht

2 1 α I αPt

where (1) is by the choice of PRB and (2) is based on the exact solution of Page Rank [42]. Let λmax be the maximal eigenvalue of Pt. Because Pt is a stochastic matrix, λmax = 1. Then, we have

I αPt (1 αλmax)I (1 α)I.

Accordingly, we have 1 α I αPt

2 yt ht 2 1 α 1 αI 2 yt ht 2 = yt ht 2.

Then, based on Corollary E.1, Lemma E.8, and Lemma E.3, with shadow parameters, we have

vt,i Vt [y(xt,i) (f2 ϕ (xt,i) ; θ2 t 1 + f1(xt,i; θ1 t 1))]2

vt,i Vt [f2 ϕ (xt,i) ; θ2 t 1 (y(xt,i) f1(xt,i; θ1 t 1))]2

max( d, S2).

To sum up, we have

t=1 (v t [i ] v t [ˆi])

t=1 yt ht 2

max( d, S2)

The proof is completed.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We have included discussion for our contributions in the Abstract and Introduction. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We have included the discussion of the limitation in the Appendix B. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: Please see theoretical analysis section and appendix for details. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Please see appendix and our submitted source code. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We include the source code along with our submission.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We have shown detailed experimental settings/details in the Experiment and Appendix section.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We have provided standard deviation as part of our experiment results.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Please see the appendix where we mention our computational infrastructure and time of execution.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We confirm this perform conform with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This paper does not perform societal impact.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper does not include any data or models that have a high risk for misuse.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make the best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We have credited correctly for the existing data and models we referred.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: No new datasets are introduced. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No human subjects are involved. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: No human subjects are involved. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.