# noniid_transfer_learning_on_graphs__9b434119.pdf Non-IID Transfer Learning on Graphs Jun Wu1, Jingrui He1, Elizabeth Ainsworth1,2 1University of Illinois Urbana-Champaign 2USDA ARS Global Change and Photosynthesis Research Unit {junwu3, jingrui, ainswort}@illinois.edu Transfer learning refers to the transfer of knowledge or information from a relevant source domain to a target domain. However, most existing transfer learning theories and algorithms focus on IID tasks, where the source/target samples are assumed to be independent and identically distributed. Very little effort is devoted to theoretically studying the knowledge transferability on non-IID tasks, e.g., cross-network mining. To bridge the gap, in this paper, we propose rigorous generalization bounds and algorithms for cross-network transfer learning from a source graph to a target graph. The crucial idea is to characterize the cross-network knowledge transferability from the perspective of the Weisfeiler-Lehman graph isomorphism test. To this end, we propose a novel Graph Subtree Discrepancy to measure the graph distribution shift between source and target graphs. Then the generalization error bounds on cross-network transfer learning, including both cross-network node classification and link prediction tasks, can be derived in terms of the source knowledge and the Graph Subtree Discrepancy across domains. This thereby motivates us to propose a generic graph adaptive network (GRADE) to minimize the distribution shift between source and target graphs for cross-network transfer learning. Experimental results verify the effectiveness and efficiency of our GRADE framework on both cross-network node classification and cross-domain recommendation tasks. Introduction Transfer learning (Pan and Yang 2009) tackles the knowledge transferability from a source domain to a relevant target domain under a distribution shift. It has been theoretically shown (Ben-David et al. 2010; Zhang et al. 2019a; Acuna et al. 2021) that the generalization performance of a learning algorithm can be improved by leveraging the knowledge from the source domain, when the source and target domains have the same labeling space (also known as domain adaption (Pan and Yang 2009; Zhao et al. 2019)). To be more specific, the expected target error could be bounded in terms of the prediction error of the source domain and the distribution discrepancy across domains. It thus motivates a line of practical approaches with domain discrepancy minimization in the latent feature space (Ganin et al. 2016; Acuna et al. 2021). However, it is noteworthy that most of the existing Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Illustration of the cross-network transfer learning (best viewed in color). Given a labeled source graph (color indicates node label) and an unlabeled target graph, crossnetwork transfer learning tackles the classification and link prediction tasks in the target graph, by leveraging the auxiliary information from the source graph. theoretical guarantees and empirical algorithms hold the IID assumption that all the source/target samples are drawn independently from an identical source/target distribution. This hinders their applications on other tasks with non-IID data, e.g., node classification (Kipf and Welling 2017; Wu and He 2019; Tang et al. 2020) and recommendation (Zhao, Li, and Fu 2019; Zhou et al. 2021a,b) across domains. In this paper, we focus on studying the problem of crossnetwork transfer learning, where the knowledge can be transferred from a source graph to a target graph1. To be specific, we consider the following cross-network mining tasks (see Figure 1). (1) Node classification (e.g., crossnetwork role identification (Zhu et al. 2021)): It aims to predict the class labels of nodes, by leveraging the knowledge from a source graph with fully labeled nodes. Following (Wu et al. 2020), we consider the unsupervised learning 1In this paper, we use graph and network interchangeably to denote the graph-structured data in every domain. The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) (a) Node attribute distribution shift (b) Graph structure shift Source Target Node attributes: Figure 2: Illustration of distribution shift on graphs. (a) Source and target graphs share the same graph structure, but they have different node attribute distributions (node attribute is assumed to be a 3-dimensional feature vector, where the black box denotes 1 and the white box denotes 0). (b) Source and target graphs share a similar node attribute distribution but different graph structures. scenarios where the target domain has no label information. (2) Link prediction (e.g., cross-domain recommendation (Li and Tuzhilin 2020; Zhao, Li, and Fu 2019)): It predicts the missing links in the incomplete target graph, by leveraging knowledge from a complete source graph. The unique challenge of cross-network transfer learning lies in the interdependence nature of nodes within the graph. As shown in Figure 2, the distribution shift between source and target graphs can be induced by node attribute and graph structure. We start by developing a novel distribution discrepancy measure named Graph Subtree Discrepancy between source and target graphs. This is motivated by the connection of existing message-passing graph neural networks (Hamilton, Ying, and Leskovec 2017; Xu et al. 2018, 2019) and Weisfeiler-Lehman graph kernel (Weisfeiler and Leman 1968; Shervashidze et al. 2011). On one hand, the Weisfeiler-Lehman graph kernel holds that the nonparametric graph similarity can be decomposed into the similarity of a sequence of subtrees rooted at every node. On the other hand, message-passing graph neural networks tend to iteratively aggregate the messages from nodes local neighborhoods in a parametric way. Then, Graph Subtree Discrepancy is designed to measure the distribution shift of graphs by estimating the similarity/difference of subtree representations learned from a message-passing graph neural network. As a result, it can inherit the benefits of high expressiveness from message-passing graph neural networks and feasible explanations from the Weisfeiler-Lehman graph kernel. Based on Graph Subtree Discrepancy, the generalization error bounds of cross-network transfer learning can be derived for cross-network mining tasks. By empirically minimizing the error upper bounds, we propose a generic graph adaptive network (GRADE) for cross-network transfer learning. The efficacy of the proposed GRADE framework is confirmed on various cross-network mining data sets. The major contributions of this paper are summarized as follows. We propose a novel Graph Subtree Discrepancy to measure the distribution shift of nodes data distribution be- tween source and target graphs. The generalization error bounds of cross-network transfer learning can then be derived based on Graph Subtree Discrepancy. We propose a generic Graph Adaptive Network (GRADE) framework for cross-network transfer learning, followed by the instantiations on cross-network node classification and cross-domain recommendation tasks. Extensive experiments demonstrate the effectiveness of our proposed GRADE framework on cross-network node classification and cross-domain recommendation tasks. Related Work Transfer Learning Transfer learning (Pan and Yang 2009) refers to the knowledge transferability from a source domain to a relevant target domain. It is theoretically guaranteed (Mansour, Mohri, and Rostamizadeh 2009; Ben-David et al. 2010; Acuna et al. 2021; Wu and He 2021, 2022a,b; Wu et al. 2022a,b) that under mild conditions, the generalization performance of a learning algorithm on the target domain can be improved by leveraging the knowledge from the source domain. One common assumption behind those theoretical guarantees is that all the source/target samples are drawn independently from an independent and identical source/target probability distribution. More recently, (Levie, Isufi, and Kutyniok 2019; Ruiz, Chamon, and Ribeiro 2020; Zhu et al. 2021) proposed to analyze the transferability of graph neural networks using graphons or ego-graphs. Nevertheless, those works explore whether graph neural networks are transferable given two graphs. In contrast, in this paper, by using a hypothesis-dependent Graph Subtree Discrepancy, we show how knowledge can be transferred across graphs. The resulting theoretical analysis provides insight into designing practical cross-network transfer learning algorithms. Cross-Network Mining Cross-network mining aims to exhibit the informative patterns from multiple relevant networks/graphs for a variety of mining tasks, e.g. cross-network node classification (Wu et al. 2020; Zhang et al. 2019b; Zhu et al. 2021), multidomain graph clustering (Ni et al. 2015), cross-domain recommendation (Zhao, Li, and Fu 2019; Li and Tuzhilin 2020), etc. There are two lines of solutions in exploring the knowledge transferability among different graphs. One is (Fang et al. 2015; Wu et al. 2020; Zhang et al. 2019b; Dai et al. 2022) that it extracts the signature subgraphs or consistent aggregation patterns from source and target graphs without a theoretical explanation. The other one is (Hu et al. 2020a,b; Qiu et al. 2020; Han et al. 2021) that it first pretrains the graph neural networks on a large source graph for encoding the general graph structures, and then fine-tunes on the target graph for extracting the task-specific information. This might lead to a sub-optimal solution for unsupervised cross-network node classification tasks where no labeled target nodes are available for fine-tuning. Preliminaries Notation Suppose that a graph is represented as G = (V, E), where V = {v1, , vn} is the set of n nodes and E V V is the edge set in the graph. In this paper, we consider the attributed graph. That is, each node is associated with a D-dimensional feature vector xv RD. In the node classification task, each node is associated with a class label yv {1, , C}, where C is the total number of classes. The graph G can also be represented by an adjacency matrix A Rn n, where Aij is the similarity between vi and vj on the graph. In the context of cross-network network mining, we denote Gs = (V s, Es, Xs) and Gt = (V t, Et, Xt) to be the source and target graphs, respectively. The associated adjacent matrices of source and target graphs are represented as As and At, respectively. Problem Setting Following (Wu et al. 2020; Zhu et al. 2021), we formally define the cross-network transfer learning problem as follows. Definition 1. (Cross-Network Transfer Learning) Given a source graph Gs and a target graph Gt, cross-network transfer learning aims to improve the prediction performance of node classification or link prediction in the target graph by using knowledge from the source graph, with the assumption that source and target graphs are related. As illustrated in Figure 2, the distribution shift across graphs can be induced by both node attribute2 and graph structure. Compared to standard transfer learning (Ben David et al. 2010), the additional distribution shift over complex graph structure leads to a much more challenging crossnetwork transfer learning problem setting. Theoretical Results In this section, we propose a novel Graph Subtree Discrepancy (GSD) to measure the distribution shift across graphs. Connection of WL Kernels and GNNs Weisfeiler-Lehman graph subtree kernel (Shervashidze et al. 2011) aims to measure the semantic similarity of a pair of input graphs. It learns a sequence of Weisfeiler-Lehman subgraphs for an input graph G: {G0, G1, , Gm, } = {(V, E, f0), (V, E, f1), , (V, E, fm), }, where G0 = G and f0(v) denotes the raw node attributes of v for any v G. The relabeling function fj (j = 1, , m, ) aims to represent the subtree structure rooted at v G into a novel representation at every iteration (see Definition 2). Then, the structural information around node v can be represented as a sequence of Weisfeiler-Lehman subtrees {f0(v), f1(v), , fm(v), } with different depths m. Definition 2. (Weisfeiler-Lehman Subtree (Shervashidze et al. 2011)) Given a graph G = (V, E) associated with initial node attributes f0(v) for v V , the Weisfeiler-Lehman 2In this paper, we consider that source and target graphs share the same node attribute space, but they might have different node attribute distributions. subtree of depth m rooted at v V can be represented as fm(v) := fm fm 1(v); u N(v)fm 1(u) where N(v) denotes the nearest neighbors of root node v. Note that in the original work (Shervashidze et al. 2011), the relabeling function of the WL subtree is simply given by the hashing table due to the discrete node attributes in the graph. Later, it is revealed (Hamilton, Ying, and Leskovec 2017; Wu, He, and Xu 2019; Geerts, Mazowiecki, and Perez 2021) that this WL subtree can actually recover the crucial message-passing modular in many popular message-passing GNNs, where the relabeling function fm( ) is instantiated by deep neural networks for learning continuous node representation. We observe that for single graph mining task, e.g., node classification and link prediction, only the message-passing philosophy of the Weisfeiler-Lehman subtree is studied to design the graph neural networks (Hamilton, Ying, and Leskovec 2017; Kipf and Welling 2017). It maps the structurally equivalent nodes within one graph into the same low-dimensional representation in a latent feature space. However, in the context of cross-network transfer learning, we highlight that the following WL subtree kernel sheds light on measuring the distribution shift of source and target graphs in the feature space learned by GNNs. Definition 3. (Weisfeiler-Lehman Subtree Kernel (Shervashidze et al. 2011)) Given any two graphs G = (V, E) with n nodes and G = (V , E ) with n nodes, the Weisfeiler-Lehman subtree kernel on two graphs G and G with M iterations is defined as: k (G, G ) = 1 nn v G δ (fm(v), fm(v )) where δ( , ) is the Dirac kernel, that is, it is 1 when its arguments are equal and 0 otherwise, and fm(v) represents the subtree pattern of depth m rooted at v. The WL subtree kernel on graphs G and G is rewritten as k (G, G ) = m=0 s ˆP (Gm) , ˆQ (G m) where ˆP (or ˆQ) is the empirical node distribution of graph G (or G ), i.e., ˆP(τ|Gm) = 1 n Pn i=1 δ(fm(vi), τ) for any subtree pattern τ. Here s( , ) is an inner product metric to measure the distribution similarity of ˆP(Gm) and ˆQ(G m), i.e., s(ˆP(Gm), ˆQ(G m)) = (ˆP(τ1|Gm), , ˆP(τk|Gm), ), (ˆP(τ1|G m), , ˆP(τk|G m), ) . Furthermore, we have the following observations. (1) This nonparametric graph kernel (Shervashidze et al. 2011) can be exploited to measure the distribution shift between source and target graphs by counting the occurrence of subtrees when the node attributes are discrete. However, it might suffer when using continuous node attribute to estimate the graph similarity (or distribution discrepancy in the context of cross-network transfer learning). (2) Previous works (Wu et al. 2020; Zhu et al. 2021; Dai et al. 2022) focus on characterizing the distribution discrepancy over only the mth subtree representation. They can thereby partially reveal the distribution discrepancy between source and target graphs in practice. Graph Subtree Discrepancy Inspired by the connection of the WL subtree kernel and message-passing GNNs, we propose a parametric Graph Subtree Discrepancy (GSD). GSD measures the distribution discrepancy of graphs in the latent feature space induced by the message-passing GNNs. Following WL subtree kernel (Shervashidze et al. 2011), we assume that given a graph G = (V, E), the WL subtrees (i.e., {fm(v)|v V }) with a fixed depth m are conditionally independent with respect to WL subgraph Gm 1 at depth m 1, i.e., fm(u) fm(v)|Gm 1. In this case, given the WL subgraph Gm 1, the subtree representations {fm(v)|v V } can thus be considered as IID samples. This tells us that the subtree (at depth m) distribution shift of source and target graphs can be measured by any existing distribution discrepancy measures, e.g, JS-divergence (Ben-David et al. 2010; Ganin et al. 2016), discrepancy distance (Mansour, Mohri, and Rostamizadeh 2009), Maximum Mean Discrepancy (Gretton et al. 2012) and f-divergence (Acuna et al. 2021). Therefore, our Graph Subtree Discrepancy can be formally defined as follows. Definition 4. (Graph Subtree Discrepancy) Given two graphs Gs = (V s, Es) and Gt = (V t, Et), the graph subtree discrepancy between them can be defined as: d GSD Gs, Gt = lim M 1 M + 1 m=0 db Gs m, Gt m (1) where db( , ) is the base domain discrepancy. For example, we can use the discrepancy distance (Mansour, Mohri, and Rostamizadeh 2009) to instantiate the base domain discrepancy db( , ), which is defined as db(Gs m, Gt m) = sup h,h H Ev V s [L (h (fm(v)), h (fm(v)))] Ev V t [L (h (fm(v)), h (fm(v)))] (2) We see that GSD recursively estimates the subtrees distribution discrepancy between source and target graphs at different depths. Here the subtree representation can be learned by existing passage-passing GNNs (Hamilton, Ying, and Leskovec 2017; Veliˇckovi c et al. 2018; Xu et al. 2019). Error Bounds Next, we derive the error bounds for cross-network transfer learning based on GSD. Let H be the hypothesis space. For any hypothesis h H, the node classification error on graph G of a message-passing GNN (with L convolutional layers) can be defined as ϵ(h f) = Ev G[L(h(f(v)), y)], where f( ) extracts the node representation and L( , ) is the loss function. Without loss of generality, we focus on the commonly used GNNs with the feature extraction f(v) = f L(v) (using only the output of the final graph convolutional layer (Kipf and Welling 2017; Hamilton, Ying, and Leskovec 2017; Veliˇckovi c et al. 2018)). The following theorem shows that in the cross-network node classification, the expected error in the target graph can be bounded in terms of the classification error in the source graph and the distribution discrepancy across graphs. Theorem 1. (Cross-Network Node Classification) Assume that there are a source graph Gs and a target graph Gt and the base domain discrepancy db( , ) of GSD is instantiated by the discrepancy distance (see Eq. (2)). Given a messagepassing GNN with the feature extractor f and the hypothesis h H, the node classification error in the target graph can be bounded as follows. ϵt(h f) ϵs(h f) + d GSD Gs, Gt + λ + R where λ = Ev V t[L(hs (f(v)), ht (f(v)))] measures the prediction difference of optimal source and target hypotheses on the target nodes, and R = Ev V s[L(y, hs (f(v)))]+ Ev V t[L(ht (f(v)), y)] is the Bayes error on the source and target graphs. y is the class label of v. In this case, hs arg minh H Ev V s[L(h(f(v)), y)] and ht arg minh H Ev V t[L(h(f(v)), y)] are the optimal source and target hypotheses, respectively. Compared to previous work (Zhu et al. 2021), our error bound of Theorem 1 is hypothesis-dependent. That is, the knowledge transferability can be enhanced, if the messagepassing GNN learns a latent feature space such that the subtree distribution shift of source and target graphs is minimized. This is in sharp contrast to previous work which focuses on evaluating the transferability of a trained GNN model. Therefore, Theorem 1 provides insights on designing practical cross-network transfer learning algorithms by minimizing the error upper bound. We have similar results for cross-network link prediction. That is, the label space of link prediction is Y = {0, 1}, where y = 1 if the link of a pair of nodes exists, y = 0 otherwise. In this case, the loss of the intra-graph link prediction is defined as ϵ(h f) = Eu,v V V [L(h([f L(u)||f L(v)]), y)], where [ || ] denote the vector concatenation. Theorem 2. (Cross-Network Link Prediction) With assumptions in Theorem 1, and let L(y, y) = |y y| and the hypothesis class H is given by the multi-layer perceptrons, if the loss of the link prediction is defined as ϵlink(h f) = Eu,v V V [L(h([f L(u)||f L(v)]), y)], then the link prediction error in the target graph can be bounded as follows. ϵlink t (h) ϵlink t (h) + d GSD Gs, Gt + λ link + R link where λ link = E(u,v) V t V t[L(hs ([f(u)||f(v)]), ht ([f(u)||f(v)]))] measures the difference of optimal source and target hypotheses on the target graph, and R link = E(u,v) V s V s[L(y, hs ([f(u)||f(v)]))] + E(u,v) V t V t[L(ht ([f(u)||f(v)]), y)] is the Bayes error. In this case, hs arg minh H E(u,v) V s V s[L(h([f(u)||f(v)]), y)], and ht arg minh H E(u,v) V t V t[L(h([f(u)||f(v)]), y)] are optimal source and target hypothesises, respectively. Proposed Framework In this section, we propose a cross-network transfer learning framework named Graph Adaptive Network (GRADE). Objective Function The objective function of a generic cross-network transfer learning framework (GRADE) is summarized as follows. min θ C(Gs; θ) + λ d GSD(Gs, Gt; θ) (3) where θ denotes all the trainable parameters. C(Gs; θ) is the task-specific loss function on the source graph, and d GSD(Gs, Gt; θ) is the discrepancy minimization between source and target graphs. λ 0 is a hyper-parameter to balance these terms. Note that C(Gs; θ) might also contain the task-specific loss function on the target graph, if label information is partially available in the target domain. Algorithms Following the framework of Eq. (3), we present the instantiated algorithms for two cross-network mining tasks, including cross-network node classification (GRADE-N) and cross-domain recommendation (GRADE-R). Cross-Network Node Classification We focus on the cross-network node classification setting from a source graph with labeled nodes to a target graph with only unlabeled nodes. The goal is to identify the class label of every node in the target domain, by leveraging the relevant knowledge from the source domain. The objective function of GRADE-N can be instantiated as follows. min θf ,θh L (h (f(Gs; θf); θh) , Y s) + λ d GSD f(Gs; θf), f(Gt; θf) (4) where f( ) is the message-passing graph neural network function parameterized by θf, and h( ) is the classifier function (MLP is adopted in the experiments) parameterized by θh. L( , ) is the cross-entropy loss function for crossnetwork node classification in the experiments (mean square error loss function can be applied for regression task). Specifically, we adopt Graph Convolutional Network (GCN) (Kipf and Welling 2017) as the base model to extract the subtree representations of a graph. Then, the subtree pattern of depth m rooted at v can be represented as follows. fm(v) = σ X u {v} N(v) bavu W mfm 1(u) (5) where b A = (bavu) Rn n (n is the number of nodes) is the re-normalization of the adjacency matrix A with added self-loops, and W m is the trainable matrix at mth layer. σ( ) is the non-linear activation function. After M iterations, we obtain the sequence of subtree representations rooted at v: f0(v), f1(v), , f M(v), where f0(v) is the raw node attributes of v. Following (Kipf and Welling 2017), the final representation f M(v) can be applied to identify the class label of node v. In addition, we consider finite iterations (e.g., M) of the message-passing graph neural network for estimating GSD. That is, d GSD f(Gs; θf), f(Gt; θf) = 1 M + 1 m=0 db Gs m, Gt m Cross-Domain Recommendation Cross-domain recommendation learns the user preference in the target domain, by leveraging the rich information from a relevant source domain. The objective function of GRADE-R can be instantiated as follows. min θf ,θh L (h (f(Gs; θf); θh )) + L h f(Gt; θf); θh + λ d GSD f(Gs; θf), f(Gt; θf) (7) where f( ) is the message-passing graph neural network function parameterized by θf, and h ( ) is the link prediction function parameterized by θh . More specifically, we adopt GCN (see Eq. (5)) as the base model f( ) to extract the subtree representations of a graph. The graph subtree discrepancy d GSD( , ) can also be given by Eq. (6) over those subtree representations. In addition, following (He et al. 2017; Zhao, Li, and Fu 2019), we adopt the multi-layer perceptron as the link prediction function h ( ) to infer whether a link of two nodes exists. That is, h ((u, v), yuv; θh ) = BCE MLPθh (f(u)||f(v)) , yuv where yuv is the link label (i.e., yuv = 1 if u and v are linked, yuv = 0 otherwise) for any u, v V s or u, v V t. Here BCE( ) denotes the binary cross-entropy, and MLPθh ( ) is a multi-layer perceptron function. Experiment Experimental Setup Data Sets For cross-network node classification, we use the following data sets: Airport networks (Ribeiro, Saverese, and Figueiredo 2017) (Brazil, USA and Europe); Citation network (Wu et al. 2020; Tang et al. 2008) (ACMv9 (A) and DBLPv8 (D)); Social network (Shen et al. 2020; Li et al. 2015) (Blog1 (B1) and Blog2 (B2)); and Agriculture data (Wang et al. 2021) (Maize (M) and Maize UNL (MU)). For cross-domain recommendation, we evaluate the models on the Amazon data set (He and Mc Auley 2016). We adopt two pairs of real-world cross-domain data sets from Amazon-5cores, including CD and Music, Book and Movie. Note that most existing cross-domain recommendation algorithms (Hu, Zhang, and Yang 2018; Zhang et al. 2020) assume that source and target domains have the same group of users. To validate the effectiveness of our proposed approach, we consider two scenarios: (1) Overlapping users: following (Hu, Zhang, and Yang 2018), source and target domains have the same group of users; (2) Disjoint users: the users of source and target domains are different. Baselines For cross-network node classification, we use the following baselines: (1) Source Only: GCN (Kipf and Welling 2017), SGC (Wu et al. 2019), GCNII (Chen et al. 2020); (2) Feature-only adaptation: DAN (Long et al. 2015), DANN (Ganin et al. 2016), MDD (Zhang et al. 2019a); (3) Cross-network adaptation: Ada GCN (Dai et al. 2022), UDAGCN (Wu et al. 2020), EGI (Zhu et al. 2021). For cross-domain recommendation, we use the following baselines: (1) Single-domain recommendation: BPR (Rendle et al. 2009), Neu MF (He et al. 2017); (2) Crossdomain recommendation: Co Net (Hu, Zhang, and Yang Methods USA Brazil USA Europe Brazil USA Brazil Europe Europe USA Europe Brazil Avg. GCN 0.366 0.011 0.371 0.004 0.491 0.011 0.452 0.012 0.439 0.001 0.298 0.022 0.403 SGC 0.527 0.022 0.430 0.009 0.432 0.005 0.479 0.000 0.447 0.002 0.481 0.011 0.466 GCNII 0.344 0.086 0.393 0.025 0.470 0.056 0.494 0.018 0.460 0.012 0.542 0.011 0.450 DAN 0.504 0.020 0.393 0.000 0.436 0.006 0.393 0.010 0.436 0.003 0.542 0.000 0.451 DANN 0.500 0.005 0.386 0.011 0.402 0.048 0.350 0.062 0.436 0.000 0.538 0.005 0.435 MDD 0.500 0.005 0.378 0.000 0.402 0.048 0.350 0.062 0.402 0.048 0.477 0.081 0.418 Ada GCN 0.466 0.065 0.434 0.004 0.501 0.003 0.486 0.021 0.456 0.034 0.561 0.081 0.484 UDA-GCN 0.607 0.059 0.388 0.007 0.497 0.005 0.510 0.019 0.434 0.042 0.477 0.024 0.486 EGI 0.523 0.013 0.451 0.011 0.417 0.021 0.454 0.046 0.452 0.029 0.588 0.011 0.481 GRADE-N 0.550 0.062 0.457 0.027 0.497 0.010 0.506 0.004 0.463 0.001 0.588 0.032 0.510 Table 1: Cross-network node classification on the Airport network Methods Citation Social A D D A B1 B2 B2 B1 GCN 0.44 0.01 0.57 0.05 0.41 0.03 0.45 0.04 SGC 0.43 0.00 0.61 0.00 0.33 0.11 0.27 0.06 GCNII 0.47 0.00 0.56 0.01 0.39 0.02 0.47 0.06 DAN 0.34 0.01 0.42 0.06 0.41 0.02 0.42 0.02 DANN 0.37 0.02 0.38 0.01 0.41 0.02 0.42 0.02 MDD 0.35 0.03 0.39 0.03 0.39 0.01 0.42 0.02 Ada GCN 0.45 0.01 0.57 0.04 0.50 0.06 0.52 0.03 UDA-GCN 0.52 0.03 0.60 0.01 0.47 0.01 0.47 0.01 EGI 0.49 0.04 0.40 0.01 0.49 0.03 0.52 0.01 GRADE-N 0.48 0.01 0.64 0.01 0.57 0.04 0.54 0.01 Table 2: Cross-network node classification on the citation and social networks Methods M MU MU M MAE R2 MAE R2 GCN 0.49 0.01 0.13 0.02 0.68 0.03 0.30 0.05 GCNII 0.47 0.02 0.19 0.06 0.69 0.04 0.25 0.02 DANN 0.49 0.00 0.10 0.01 0.72 0.03 0.19 0.07 RSD 0.52 0.01 0.02 0.06 0.73 0.01 0.17 0.05 Ada GCN 0.45 0.06 0.25 0.08 0.65 0.02 0.35 0.04 UDA-GCN 0.45 0.07 0.26 0.09 0.68 0.01 0.27 0.00 GRADE-N 0.35 0.04 0.53 0.09 0.65 0.01 0.35 0.03 Table 3: Plant phenotyping on the agriculture data set 2018), PPGN (Zhao, Li, and Fu 2019), CGN (Zhang et al. 2020), EGI (Zhu et al. 2021). Model Configuration We adopt two hidden layers in the base GCN model when implementing the GRADE3 algorithms. We set λ = 0.02 for cross-network node classification and λ = 0.1 for the cross-domain recommendation. Cross-Network Node Classification Table 1 and Table 2 provide the cross-network node classification results of GRADE-N. Here we report the classifica- 3https://github.com/jwu4sml/GRADE tion accuracy, i.e., mean and standard deviation over 5 runs. We observe that (1) the cross-network node classification algorithms outperform the vanilla graph neural networks and the feature-only adaptation methods, and (2) in most cases, the proposed GRADE-N algorithm improves the node classification performance (up to 13%) over baselines. In addition, we investigate the effectiveness of GRADEN on the regression task. In this case, we use mean square error (MSE) as the loss function of Eq. (4). Table 3 provides the results of GRADE-N on the agriculture data set. Here we use two regression evaluation metrics: mean absolute error (MAE) and R2. Since MDD (Zhang et al. 2019a) focuses only on the classification setting, we use another state-of-the-art adaptation regression baseline RSD (Chen et al. 2021) in our experiments. It is observed that our proposed GRADE-N outperforms the state-of-the-art baselines for both MAE (lower is better) and R2 (higher is better). Cross-Domain Recommendation Results on overlapping users Table 4 provides the crossdomain recommendation results on the Amazon data set. Here we use three recommendation metrics to evaluate our algorithms: hit ratio (HR@k), mean reciprocal rank (MRR@k), and normalized discounted cumulative gain (NDCG@k) where k is 10. Following (Hu, Zhang, and Yang 2018; Zhang et al. 2020), we only consider the users shared by both source and target domains. We have the following observations. (1) Single-domain recommendation methods study the user preference in the target domain from limited observed user-item interactions. They have inferior performance due to network sparsity. (2) Cross-domain recommendation methods improve the model performance by leveraging the user preference information from the source domain. (3) The proposed GRADE-R outperforms the stateof-the-art cross-domain recommendation baselines. Results on disjoint users Table 5 provides the results when the users of source and target domains are disjoint. Most existing cross-domain recommendation approaches (Hu, Zhang, and Yang 2018; Zhang et al. 2020) cannot be applied to this scenario, because they require the shared users to explore common knowledge across domains. Thus, we only consider the single-domain recommenda- Methods CD Music Music CD Book Movie HR@10 MRR@10 NDCG@10 HR@10 MRR@10 NDCG@10 HR@10 MRR@10 NDCG@10 BPRMF 0.18 0.00 0.06 0.00 0.09 0.00 0.26 0.01 0.10 0.01 0.13 0.01 0.20 0.00 0.07 0.00 0.10 0.00 Neu MF 0.27 0.01 0.10 0.01 0.15 0.01 0.33 0.01 0.11 0.02 0.16 0.01 0.29 0.02 0.10 0.01 0.15 0.01 Co Net 0.41 0.02 0.16 0.00 0.21 0.01 0.33 0.01 0.12 0.02 0.17 0.02 0.32 0.02 0.12 0.02 0.16 0.02 CGN 0.36 0.02 0.12 0.02 0.18 0.02 0.48 0.04 0.19 0.02 0.26 0.03 0.36 0.03 0.14 0.02 0.19 0.02 PPGN 0.42 0.02 0.18 0.01 0.23 0.01 0.56 0.04 0.28 0.03 0.34 0.04 0.49 0.01 0.24 0.01 0.29 0.01 EGI 0.45 0.01 0.20 0.00 0.25 0.01 0.60 0.01 0.27 0.02 0.34 0.01 0.46 0.02 0.22 0.02 0.27 0.01 GRADE-R 0.45 0.01 0.20 0.00 0.25 0.00 0.60 0.01 0.31 0.01 0.37 0.01 0.51 0.02 0.25 0.00 0.30 0.01 Table 4: Cross-domain recommendation on Amazon data set with overlapping users Methods CD Music Music CD Book Movie HR@10 MRR@10 NDCG@10 HR@10 MRR@10 NDCG@10 HR@10 MRR@10 NDCG@10 BPRMF 0.17 0.01 0.06 0.00 0.08 0.00 0.12 0.00 0.04 0.00 0.06 0.00 0.17 0.00 0.06 0.00 0.08 0.00 Neu MF 0.28 0.02 0.10 0.01 0.14 0.01 0.13 0.01 0.04 0.00 0.06 0.00 0.23 0.01 0.08 0.00 0.12 0.00 Neu MF(S+T) 0.31 0.01 0.12 0.01 0.17 0.01 0.16 0.02 0.06 0.01 0.08 0.01 0.27 0.02 0.11 0.01 0.15 0.01 EGI 0.42 0.01 0.17 0.03 0.22 0.02 0.23 0.04 0.10 0.01 0.13 0.02 0.25 0.00 0.15 0.00 0.17 0.00 GRADE-R 0.42 0.00 0.19 0.00 0.24 0.00 0.22 0.03 0.11 0.00 0.14 0.01 0.32 0.06 0.19 0.02 0.22 0.03 Table 5: Cross-domain recommendation on Amazon data set with disjoint users B1 B2 B2 B1 0.0 (a) Base discrepancy B1 B2 B2 B1 0.0 (b) Base GNN Figure 3: Performance of GRADE-N with different base discrepancies and base GNNs on social network tion baselines BPRMF (Rendle et al. 2009), Neu MF (He et al. 2017), and cross-domain recommendation baseline EGI (Zhu et al. 2021). In addition, we also extend Neu MF to the cross-domain recommendation scenarios by incorporating the recommendation loss in the source domain (denoted as Neu MF (S+T) ). It is observed that the proposed GRADE-R outperforms the baselines by a large margin (up to 18% on HR@10) when the users are disjoint. Flexibility Figure 3 shows the performance of GRADEN with different base discrepancies and base graph neural network architectures on social networks. It shows that our GRADE framework is flexible to incorporate existing domain discrepancy measures (i.e., JS-divergence (Ganin et al. 2016), CORAL (Sun and Saenko 2016), MMD (Gretton et al. 2012) and MDD (Zhang et al. 2019a)) and message-passing graph neural networks (i.e., GCN (Kipf and Welling 2017), SAGE (Hamilton, Ying, and Leskovec 2017), SGC (Wu et al. 2019) and GCNII (Chen et al. 2020)). 0.0 0.01 0.02 0.03 0.04 0.05 λ B1 B2 B2 B1 (a) Hyper-parameter sensitivity 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k Number of nodes Second / epoch (b) Efficiency analysis Figure 4: Model analysis of GRADE-N Hyper-parameter Sensitivity We investigate the impact of λ on GRADE-N. Figure 4(a) shows the results on the social networks. It shows that the proposed GRADE-N can achieve much better performance for λ (0.01, 0.03). Thus, we use λ = 0.02 in the experiments. Computational Efficiency We investigate the computational efficiency of GRADE framework. Following (Kipf and Welling 2017), we report the running time (measured in seconds wall-clock time) per epoch on the synthetic source and target graphs. Both graphs have n (i.e., ns = nt = n) nodes and 2n edges. As shown in Figure 4(b), we observe that the wall-clock time of GRADE-N is linear with respect to the number of nodes within the source and target graphs. Conclusion In this paper, we study the problem of cross-network transfer learning. We start by providing the theoretical analysis of cross-network transfer learning based on Graph Subtree Discrepancy. Then we propose a novel GRADE framework for cross-network transfer learning. Experiments demonstrate the effectiveness and efficiency of our GRADE framework. Acknowledgements This work is supported by National Science Foundation under Award No. IIS-1947203, IIS-2117902, IIS-2137468, and Agriculture and Food Research Initiative (AFRI) grant no. 2020-67021-32799/project accession no.1024178 from the USDA National Institute of Food and Agriculture. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government. Acuna, D.; Zhang, G.; Law, M. T.; and Fidler, S. 2021. f Domain Adversarial Learning: Theory and Algorithms. In International Conference on Machine Learning, 66 75. Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning, 79(1): 151 175. Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; and Li, Y. 2020. Simple and deep graph convolutional networks. In International Conference on Machine Learning, 1725 1735. Chen, X.; Wang, S.; Wang, J.; and Long, M. 2021. Representation Subspace Distance for Domain Adaptation Regression. In International Conference on Machine Learning, 1749 1759. PMLR. Dai, Q.; Wu, X.-M.; Xiao, J.; Shen, X.; and Wang, D. 2022. Graph Transfer Learning via Adversarial Domain Adaptation with Graph Convolution. IEEE Transactions on Knowledge and Data Engineering. Fang, M.; Yin, J.; Zhu, X.; and Zhang, C. 2015. Tr Graph: Cross-network transfer learning via common signature subgraphs. IEEE Transactions on Knowledge and Data Engineering, 27(9): 2536 2549. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1): 2096 2030. Geerts, F.; Mazowiecki, F.; and Perez, G. 2021. Let s agree to degree: Comparing graph convolutional networks in the message-passing framework. In International Conference on Machine Learning, 3640 3649. Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Sch olkopf, B.; and Smola, A. 2012. A kernel two-sample test. The Journal of Machine Learning Research, 13(1): 723 773. Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, 30. Han, X.; Huang, Z.; An, B.; and Bai, J. 2021. Adaptive Transfer Learning on Graph Neural Networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 565 574. He, R.; and Mc Auley, J. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, 507 517. He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.-S. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, 173 182. Hu, G.; Zhang, Y.; and Yang, Q. 2018. Co Net: Collaborative cross networks for cross-domain recommendation. In Proceedings of the 27th ACM international conference on information and knowledge management, 667 676. Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; and Leskovec, J. 2020a. Strategies for Pre-training Graph Neural Networks. In International Conference on Learning Representations. Hu, Z.; Dong, Y.; Wang, K.; Chang, K.-W.; and Sun, Y. 2020b. GPT-GNN: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1857 1867. Kipf, T. N.; and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations. Levie, R.; Isufi, E.; and Kutyniok, G. 2019. On the transferability of spectral graph filters. In 2019 13th International conference on Sampling Theory and Applications (Samp TA), 1 5. IEEE. Li, J.; Hu, X.; Tang, J.; and Liu, H. 2015. Unsupervised streaming feature selection in social media. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 1041 1050. Li, P.; and Tuzhilin, A. 2020. DDTCDR: Deep dual transfer cross domain recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining, 331 339. Long, M.; Cao, Y.; Wang, J.; and Jordan, M. 2015. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, 97 105. Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009. Domain adaptation: Learning bounds and algorithms. In 22nd Conference on Learning Theory. Ni, J.; Tong, H.; Fan, W.; and Zhang, X. 2015. Flexible and robust multi-network clustering. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 835 844. Pan, S. J.; and Yang, Q. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345 1359. Qiu, J.; Chen, Q.; Dong, Y.; Zhang, J.; Yang, H.; Ding, M.; Wang, K.; and Tang, J. 2020. GCC: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1150 1160. Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt Thieme, L. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 452 461. Ribeiro, L. F.; Saverese, P. H.; and Figueiredo, D. R. 2017. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. Ruiz, L.; Chamon, L.; and Ribeiro, A. 2020. Graphon neural networks and the transferability of graph neural networks. Advances in Neural Information Processing Systems, 33: 1702 1712. Shen, X.; Dai, Q.; Chung, F.-l.; Lu, W.; and Choi, K.- S. 2020. Adversarial deep network embedding for crossnetwork node classification. In Proceedings of the AAAI Conference on Artificial Intelligence, 2991 2999. Shervashidze, N.; Schweitzer, P.; Van Leeuwen, E. J.; Mehlhorn, K.; and Borgwardt, K. M. 2011. Weisfeiler Lehman graph kernels. Journal of Machine Learning Research, 12(9). Sun, B.; and Saenko, K. 2016. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, 443 450. Springer. Tang, J.; Zhang, J.; Yao, L.; Li, J.; Zhang, L.; and Su, Z. 2008. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 990 998. Tang, X.; Yao, H.; Sun, Y.; Wang, Y.; Tang, J.; Aggarwal, C.; Mitra, P.; and Wang, S. 2020. Investigating and mitigating degree-related biases in graph convoltuional networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 1435 1444. Veliˇckovi c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li o, P.; and Bengio, Y. 2018. Graph Attention Networks. In International Conference on Learning Representations. Wang, S.; Guan, K.; Wang, Z.; Ainsworth, E. A.; Zheng, T.; Townsend, P. A.; Li, K.; Moller, C.; Wu, G.; and Jiang, C. 2021. Unique contributions of chlorophyll and nitrogen to predict crop photosynthetic capacity from leaf spectroscopy. Journal of experimental botany, 72(2): 341 354. Weisfeiler, B.; and Leman, A. 1968. The reduction of a graph to canonical form and the algebra which appears therein. NTI, Series, 2(9): 12 16. Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; and Weinberger, K. 2019. Simplifying graph convolutional networks. In International Conference on Machine Learning, 6861 6871. Wu, J.; and He, J. 2019. Scalable manifold-regularized attributed network embedding via maximum mean discrepancy. In Proceedings of the 28th ACM international conference on information and knowledge management, 2101 2104. Wu, J.; and He, J. 2021. Indirect Invisible Poisoning Attacks on Domain Adaptation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 1852 1862. Wu, J.; and He, J. 2022a. Domain Adaptation with Dynamic Open-Set Targets. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2039 2049. Wu, J.; and He, J. 2022b. A Unified Meta-Learning Framework for Dynamic Transfer Learning. In The Thirty-First International Joint Conference on Artificial Intelligence. Wu, J.; He, J.; Wang, S.; Guan, K.; and Ainsworth, E. 2022a. Distribution-Informed Neural Networks for Domain Adaptation Regression. In Advances in Neural Information Processing Systems. Wu, J.; He, J.; and Xu, J. 2019. DEMO-Net: Degree-specific Graph Neural Networks for Node and Graph Classification. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 406 415. Wu, J.; Tong, H.; Ainsworth, E.; and He, J. 2022b. Adaptive Knowledge Transfer on Evolving Domains. In 2022 IEEE International Conference on Big Data (Big Data), 1389 1394. IEEE. Wu, M.; Pan, S.; Zhou, C.; Chang, X.; and Zhu, X. 2020. Unsupervised Domain Adaptive Graph Convolutional Networks. In Proceedings of The Web Conference 2020, 1457 1467. Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2019. How Powerful are Graph Neural Networks? In International Conference on Learning Representations. Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, 5453 5462. Zhang, Y.; Liu, T.; Long, M.; and Jordan, M. 2019a. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, 7404 7413. Zhang, Y.; Liu, Y.; Han, P.; Miao, C.; Cui, L.; Li, B.; and Tang, H. 2020. Learning Personalized Itemset Mapping for Cross-Domain Recommendation. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, 2561 2567. Zhang, Y.; Song, G.; Du, L.; Yang, S.; and Jin, Y. 2019b. DANE: Domain Adaptive Network Embedding. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 4362 4368. Zhao, C.; Li, C.; and Fu, C. 2019. Cross-domain recommendation via preference propagation graphnet. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2165 2168. Zhao, H.; Des Combes, R. T.; Zhang, K.; and Gordon, G. 2019. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, 7523 7532. Zhou, Y.; Wang, H.; He, J.; and Wang, H. 2021a. From Intrinsic to Counterfactual: On the Explainability of Contextualized Recommender Systems. ar Xiv preprint ar Xiv:2110.14844. Zhou, Y.; Xu, J.; Wu, J.; Taghavi, Z.; Korpeoglu, E.; Achan, K.; and He, J. 2021b. PURE: Positive-unlabeled recommendation with generative adversarial network. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2409 2419. Zhu, Q.; Yang, C.; Xu, Y.; Wang, H.; Zhang, C.; and Han, J. 2021. Transfer learning of graph neural networks with ego-graph information maximization. Advances in Neural Information Processing Systems, 34.