# graph_inference_learning_for_semisupervised_classification__113b8695.pdf Published as a conference paper at ICLR 2020 GRAPH INFERENCE LEARNING FOR SEMI-SUPERVISED CLASSIFICATION Chunyan Xu, Zhen Cui , Xiaobin Hong, Tong Zhang, and Jian Yang School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China {cyx,zhen.cui,xbhong,tong.zhang,csjyang}@njust.edu.cn Wei Liu Tencent AI Lab, China wl2223@columbia.edu In this work, we address semi-supervised classification of graph data, where the categories of those unlabeled nodes are inferred from labeled nodes as well as graph structures. Recent works often solve this problem via advanced graph convolution in a conventionally supervised manner, but the performance could degrade significantly when labeled data is scarce. To this end, we propose a Graph Inference Learning (GIL) framework to boost the performance of semisupervised node classification by learning the inference of node labels on graph topology. To bridge the connection between two nodes, we formally define a structure relation by encapsulating node attributes, between-node paths, and local topological structures together, which can make the inference conveniently deduced from one node to another node. For learning the inference process, we further introduce meta-optimization on structure relations from training nodes to validation nodes, such that the learnt graph inference capability can be better self-adapted to testing nodes. Comprehensive evaluations on four benchmark datasets (including Cora, Citeseer, Pubmed, and NELL) demonstrate the superiority of our proposed GIL when compared against state-of-the-art methods on the semi-supervised node classification task. 1 INTRODUCTION Graph, which comprises a set of vertices/nodes together with connected edges, is a formal structural representation of non-regular data. Due to the strong representation ability, it accommodates many potential applications, e.g., social network (Orsini et al., 2017), world wide data (Page et al., 1999), knowledge graph (Xu et al., 2017), and protein-interaction network (Borgwardt et al., 2007). Among these, semi-supervised node classification on graphs is one of the most interesting also popular topics. Given a graph in which some nodes are labeled, the aim of semi-supervised classification is to infer the categories of those remaining unlabeled nodes by using various priors of the graph. While there have been numerous previous works (Brandes et al., 2008; Zhou et al., 2004; Zhu et al., 2003; Yang et al., 2016; Zhao et al., 2019) devoted to semi-supervised node classification based on explicit graph Laplacian regularizations, it is hard to efficiently boost the performance of label prediction due to the strict assumption that connected nodes are likely to share the same label information. With the progress of deep learning on grid-shaped images/videos (He et al., 2016), a few of graph convolutional neural networks (CNN) based methods, including spectral (Kipf & Welling, 2017) and spatial methods (Niepert et al., 2016; Pan et al., 2018; Yu et al., 2018), have been proposed to learn local convolution filters on graphs in order to extract more discriminative node representations. Although graph CNN based methods have achieved considerable capabilities of graph embedding by optimizing filters, they are limited into a conventionally semi-supervised framework and lack of an efficient inference mechanism on graphs. Especially, in the case of few-shot learning, where a small number of training nodes are labeled, this kind of methods would drastically compromise the performance. For example, the Pubmed graph dataset (Sen et al., 2008) consists Corresponding author: Zhen Cui. Published as a conference paper at ICLR 2020 Figure 1: The illustration of our proposed GIL framework. For the problem of graph node labeling, the category information of these unlabeled nodes depends on the similarity computation between a query node (e.g., vj) and these labeled reference nodes (e.g., vi). We consider the similarity from three points: node attributes, the consistency of local topological structures (i.e., the circle with dashed line), and the between-node path reachability (i.e., the red wave line from vi to vj). Specifically, the local structures as well as node attributes are encoded as high-level features with graph convolution, while the between-node path reachability is abstracted as reachable probabilities of random walks. To better make the inference generalize to test nodes, we introduce a meta-learning strategy to optimize the structure relations learning from training nodes to validation nodes. of 19,717 nodes and 44,338 edges, but only 0.3% nodes are labeled for the semi-supervised node classification task. These aforementioned works usually boil down to a general classification task, where the model is learnt on a training set and selected by checking a validation set. However, they do not put great efforts on how to learn to infer from one node to another node on a topological graph, especially in the few-shot regime. In this paper, we propose a graph inference learning (GIL) framework to teach the model itself to adaptively infer from reference labeled nodes to those query unlabeled nodes, and finally boost the performance of semi-supervised node classification in the case of a few number of labeled samples. Given an input graph, GIL attempts to infer the unlabeled nodes from those observed nodes by building between-node relations. The between-node relations are structured as the integration of node attributes, connection paths, and graph topological structures. It means that the similarity between two nodes is decided from three aspects: the consistency of node attributes, the consistency of local topological structures, and the between-node path reachability, as shown in Fig. 1. The local structures anchored around each node as well as the attributes of nodes therein are jointly encoded with graph convolution (Defferrard et al., 2016) for the sake of high-level feature extraction. For the between-node path reachability, we adopt the random walk algorithm to obtain the characteristics from a labeled reference node vi to a query unlabeled node vj in a given graph. Based on the computed node representation and between-node reachability, the structure relations can be obtained by computing the similar scores/relationships from reference nodes to unlabeled nodes in a graph. Inspired by the recent meta-learning strategy (Finn et al., 2017), we learn to infer the structure relations from a training set to a validation set, which can benefit the generalization capability of the learned model. In other words, our proposed GIL attempts to learn some transferable knowledge underlying in the structure relations from training samples to validation samples, such that the learned structure relations can be better self-adapted to the new testing stage. We summarize the main contributions of this work as three folds: We propose a novel graph inference learning framework by building structure relations to infer unknown node labels from those labeled nodes in an end-to-end way. The structure relations are well defined by jointly considering node attributes, between-node paths, and graph topological structures. To make the inference model better generalize to test nodes, we introduce a meta-learning procedure to optimize structure relations, which could be the first time for graph node classification to the best of our knowledge. Comprehensive evaluations on three citation network datasets (including Cora, Citeseer, and Pubmed) and one knowledge graph data (i.e., NELL) demonstrate the superiority of our proposed GIL in contrast with other state-of-the-art methods on the semi-supervised classification task. Published as a conference paper at ICLR 2020 2 RELATED WORK Graph CNNs: With the rapid development of deep learning methods, various graph convolution neural networks (Kashima et al., 2003; Morris et al., 2017; Shervashidze et al., 2009; Yanardag & Vishwanathan, 2015; Jiang et al., 2019; Zhang et al., 2020) have been exploited to analyze the irregular graph-structured data. For better extending general convolutional neural networks to graph domains, two broad strategies have been proposed, including spectral and spatial convolution methods. Specifically, spectral filtering methods (Henaff et al., 2015; Kipf & Welling, 2017) develop convolution-like operators in the spectral domain, and then perform a series of spectral filters by decomposing the graph Laplacian. Unfortunately, the spectral-based approaches often lead to a high computational complex due to the operation of eigenvalue decomposition, especially for a large number of graph nodes. To alleviate this computation burden, local spectral filtering methods (Defferrard et al., 2016) are then proposed by parameterizing the frequency responses as a Chebyshev polynomial approximation. Another type of graph CNNs, namely spatial methods (Li et al., 2016; Niepert et al., 2016), can perform the filtering operation by defining the spatial structures of adjacent vertices. Various approaches can be employed to aggregate or sort neighboring vertices, such as diffusion CNNs (Atwood & Towsley, 2016), Graph SAGE (Hamilton et al., 2017), PSCN (Niepert et al., 2016), and Ngram CNN (Luo et al., 2017). From the perspective of data distribution, recently, the Gaussian induced convolution model (Jiang et al., 2019) is proposed to disentangle the aggregation process through encoding adjacent regions with Gaussian mixture model. Semi-supervised node classification: Among various graph-related applications, semi-supervised node classification has gained increasing attention recently, and various approaches have been proposed to deal with this problem, including explicit graph Laplacian regularization and graphembedding approaches. Several classic algorithms with graph Laplacian regularization contain the label propagation method using Gaussian random fields (Zhu et al., 2003), the regularization framework by relying on the local/global consistency (Zhou et al., 2004), and the random walkbased sampling algorithm for acquiring the context information (Yang et al., 2016). To further address scalable semi-supervised learning issues (Liu et al., 2012), the Anchor Graph regularization approach (Liu et al., 2010) is proposed to scale linearly with the number of graph nodes and then applied to massive-scale graph datasets. Several graph convolution network methods (Abu-El-Haija et al., 2018; Du et al., 2017; Thekumparampil et al., 2018; Velickovic et al., 2018; Zhuang & Ma, 2018) are then developed to obtain discriminative representations of input graphs. For example, Kipf et al. (Kipf & Welling, 2017) proposed a scalable graph CNN model, which can scale linearly in the number of graph edges and learn graph representations by encoding both local graph structures and node attributes. Graph attention networks (GAT) (Velickovic et al., 2018) are proposed to compute hidden representations of each node for attending to its neighbors with a self-attention strategy. By jointly considering the localand global-consistency information, dual graph convolutional networks (Zhuang & Ma, 2018) are presented to deal with semi-supervised node classification. The critical difference between our proposed GIL and those previous semi-supervised node classification methods is to adopt a graph inference strategy by defining structure relations on graphs and then leverage a meta optimization mechanism to learn an inference model, which could be the first time to the best of our knowledge, while the existing graph CNNs take semi-supervised node classification as a general classification task. 3 THE PROPOSED MODEL 3.1 PROBLEM DEFINITION Formally, we denote an undirected/directed graph as G = {V, E, X, Y}, where V = {vi}n i=1 is the finite set of n (or |V|) vertices, E Rn n defines the adjacency relationships (i.e., edges) between vertices representing the topology of G, X Rn d records the explicit/implicit attributes/signals of vertices, and Y Rn is the vertex labels of C classes. The edge Eij = E(vi, vj) = 0 if and only if vertices vi, vj are not connected, otherwise Eij = 0. The attribute matrix X is attached to the vertex set V, whose i-th row Xvi (or Xi ) represents the attribute of the i-th vertex vi. It means that vi V carries a vector of d-dimensional signals. Associated with each node vi V, there is a discrete label yi {1, 2, , C}. We consider the task of semi-supervised node classification over graph data, where only a small number of vertices are labeled for the model learning, i.e., |VLabel| |V|. Generally, we have three node sets: a training set Vtr, a validation set Vval, and a testing set Vte. In the standard protocol Published as a conference paper at ICLR 2020 of prior literatures (Yang et al., 2016), the three node sets share the same label space. We follow but do not restrict this protocol for our proposed method. Given the training and validation node sets, the aim is to predict the node labels of testing nodes by using node attributes as well as edge connections. A sophisticated machine learning technique used in most existing methods (Kipf & Welling, 2017; Zhou et al., 2004) is to choose the optimal classifier (trained on a training set) after checking the performance on the validation set. However, these methods essentially ignore how to extract transferable knowledge from these known labeled nodes to unlabeled nodes, as the graph structure itself implies node connectivity/reachability. Moreover, due to the scarcity of labeled samples, the performance of such a classifier is usually not satisfying. To address these issues, we introduce a meta-learning mechanism (Finn et al., 2017; Ravi & Larochelle, 2017; Sung et al., 2017) to learn to infer node labels on graphs. Specifically, the graph structure, between-node path reachability, and node attributes are jointly modeled into the learning process. Our aim is to learn to infer from labeled nodes to unlabeled nodes, so that the learner can perform better on a validation set and thus classify a testing set more accurately. 3.2 STRUCTURE RELATION For convenient inference, we specifically build a structure relation between two nodes on the topology graph. The labeled vertices (in a training set) are viewed as the reference nodes, and their information can be propagated into those unlabeled vertices for improving the label prediction accuracy. Formally, given a reference node vi VLabel, we define the score of a query node vj similar to vi as si j = fr(fe(Gvi), fe(Gvj), f P(vi, vj, E)), (1) where Gvi and Gvj may be understood as the centralized subgraphs around vi and vj, respectively. fe, fr, f P are three abstract functions that we explain as follows: Node representation fe(Gvi) Rdv, encodes the local representation of the centralized subgraph Gvi around node vi, and may thus be understood as a local filter function on graphs. This function should not only take the signals of nodes therein as input, but also consider the local topological structure of the subgraph for more accurate similarity computation. To this end, we perform the spectral graph convolution on subgraphs to learn discriminative node features, analogous to the pixel-level feature extraction from convolution maps of gridded images. The details of feature extraction fe are described in Section 4. Path reachability f P(vi, vj, E) Rdp, represents the characteristics of path reachability from vi to vj. As there usually exist multiple traversal paths between two nodes, we choose the function as reachable probabilities of different lengths of walks from vi to vj. More details will be introduced in Section 4. Structure relation fr(Rdv, Rdv, Rdp) R, is a relational function computing the score of vj similar to vi. This function is not exchangeable for different orders of two nodes, due to the asymmetric reachable relationship f P. If necessary, we may easily revise it as a symmetry function, e.g., summarizing two traversal directions. The score function depends on triple inputs: the local representations extracted from the subgraphs w.r.t. fe(Gvi) and fe(Gvj), respectively, and the path reachability from vi to vj. In semi-supervised node classification, we take the training node set Vtr as the reference samples, and the validation set Vval as the query samples during the training stage. Given a query node vj Vval, we can derive the class similarity score of vj w.r.t. the c-th (c = 1, , C) category by weighting the reference samples Cc = {vk|yvk = c}. Formally, we can further revise Eqn. (1) and define the class-to-node relationship function as s Cc j = φr(FCc vj X vi Cc wi j fe(Gvi), fe(Gvj)), (2) s.t. wi j = φw(f P(vi, vj, E)), (3) where the function φw maps a reachable vector f P(vi, vj, E) into a weight value, and the function φr computes the similar score between vj and the c-th class nodes. The normalization factor FCc vj of the c-th category w.r.t. vj is defined as FCc vj = 1 P vi Cc wi j . (4) For the relation function φr and the weight function φw, we may choose some subnetworks to instantiate them in practice. The detailed implementation of our model can be found in Section 4. Published as a conference paper at ICLR 2020 3.3 INFERENCE LEARNING According to the class-to-node relationship function in Eqn. (2), given a query node vj, we can obtain a score vector s C j = [s C1 j, , s CC j] RC after computing the relations to all classes . The indexed category with the maximum score is assumed to be the estimated label. Thus, we can define the loss function based on cross entropy as follows: c=1 yj,c log ˆy Cc j, (5) where yj,c is a binary indicator (i.e., 0 or 1) of class label c for node vj, and the softmax operation is imposed on s Cc j, i.e., ˆy Cc j = exp(s Cc j)/ PC k=1 exp(s Ck j). Other error functions may be chosen as the loss function, e.g., mean square error. In the regime of general classification, the cross entropy loss is a standard one that performs well. Given a training set Vtr, we expect that the best performance can be obtained on the validation set Vval after optimizing the model on Vtr. Given a trained/pretrained model Θ = {fe, φw, φr}, we perform iteratively gradient updates on the training set Vtr to obtain the new model, formally, Θ = Θ α ΘLtr(Θ), (6) where α is the updating rate. Note that, in the computation of class scores, since the reference node and query node can be both from the training set Vtr, we set the computation weight wi j = 0 if i = j in Eqn. (3). After several iterates of gradient descent on Vtr, we expect a better performance on the validation set Vval, i.e., min Θ Lval(Θ ). Thus, we can perform the gradient update as follows Θ = Θ β ΘLval(Θ ), (7) where β is the learning rate of meta optimization (Finn et al., 2017). During the training process, we may perform batch sampling from training nodes and validation nodes, instead of taking all one time. In the testing stage, we may take all training nodes and perform the model update according to Eqn. (6) like the training process. The updated model is used as the final model and is then fed into Eqn. (2) to infer the class labels for those query nodes. In this section, we instantiate all modules (i.e., functions) of the aforementioned structure relation. The implementation details can be found in the following. Node Representation fe(Gvi): The local representation at vertex vi can be extracted by performing the graph convolution operation on subgraph Gvi. Similar to gridded images/videos, on which local convolution kernels are defined as multiple lattices with various receptive fields, the spectral graph convolution is used to encode the local representations of an input graph in our work. Given a graph sample G = {V, E, X}, the normalized graph Laplacian matrix is L = In D 1/2ED 1/2 = UΛUT , with a diagonal matrix of its eigenvalues Λ. The spectral graph convolution can be defined as the multiplication of signal X with a filter gθ(Λ) = diag(θ) parameterized by θ in the Fourier domain: conv(X) = gθ(L) X = Ugθ(Λ)UT X, where parameter θ Rn is a vector of Fourier coefficients. To reduce the computational complexity and obtain the local information, we use an approximate local filter of the Chebyshev polynomial (Defferrard et al., 2016), gθ(Λ) = PK 1 k=0 θk Tk(ˆΛ), where parameter θ RK is a vector of Chebyshev coefficients and Tk(ˆΛ) Rn n is the Chebyshev polynomial of order k evaluated at ˆΛ = 2Λ/λmax In, a diagonal matrix of scaled eigenvalues. The graph filtering operation can then be expressed as gθ(Λ) X = PK 1 k=0 θk Tk(ˆL)X, where Tk(ˆL) Rn n is the Chebyshev polynomial of order k evaluated at the scaled Laplacian ˆL = 2L/λmax In. Further, we can construct multi-scale receptive fields for each vertex based on the Laplacian matrix L, where each receptive field records hopping neighborhood relationships around the reference vertex vi, and forms a local centralized subgraph. Path Reachability f P(vi, vj, E): Here we compute the probabilities of paths from vertex i to vertex j by employing random walks on graphs, which refers to traversing the graph from vi to vj according to the probability matrix P. For the input graph G with n vertices, the random-walk transition matrix Published as a conference paper at ICLR 2020 Datasets Nodes Edges Classes Features Label Rates Cora 2,708 5,429 7 1,433 0.052 Citeseer 3,327 4,732 6 3,703 0.036 Pubmed 19,717 44,338 3 500 0.003 NELL 65,755 266,144 210 5,414 0.001 Table 1: The properties (especially for label rate) of various graph datasets used for the semi-supervised classification task. can be defined as P = D 1E, where D Rn n is the diagonal degree matrix with Dii = P i Eij. That is to say, each element Pij is the probability of going from vertex i to vertex j in one step. The sequence of nodes from vertex i to vertex j is a random walk on the graph, which can be modeled as a classical Markov chain by considering the set of graph vertices. To represent this formulation, we show that P t ij is the probability of getting from vertex vi to vertex vj in t steps. This fact is easily exhibited by considering a t-step path from vertex vi to vertex vj as first taking a single step to some vertex h, and then taking t 1 steps to vj. The transition probability P t in t steps can be formulated as Pij if t = 1 X h Pih P t 1 h,j if t > 1 , (8) where each matrix entry P t ij denotes the probability of starting at vertex i and ending at vertex j in t steps. Finally, the node reachability from vi to vj can be written as a dp-dimensional vector: f P(vi, vj, E) = [Pij, P 2 ij, . . . , P dp ij ], (9) where dp refers to the step length of the longest path from vi to vj. Class-to-Node Relationship s Cc j: To define the node relationship si j from vi to vj, we simultaneously consider the property of path reachability f P(vi, vj, E), local representations fe(Gvi), and fe(Gvj) of nodes vi, vj. The function φw(f P(vi, vj, E)) in Eqn. (3), which is to map the reachable vector f P(vi, vj, E) Rdp into a weight value, can be implemented with two 16-dimensional fully connected layers in our experiments. The computed value wi j can be further used to weight the local features at node vi, fe(Gvi) Rdv. For obtaining the similar score between vj and the c-th class nodes Cc in Eqn. (2), we perform a concatenation of two input features, where one refers to the weighted features of vertex vi, and another is the local features of vertex vj. One fully connected layer (w.r.t. φr) with C-dimensions is finally adopted to obtain the relation regression score. 5 EXPERIMENTS 5.1 EXPERIMENTAL SETTINGS We evaluate our proposed GIL method on three citation network datasets: Cora, Citeseer, Pubmed (Sen et al., 2008), and one knowledge graph NELL dataset (Carlson et al., 2010). The statistical properties of graph data are summarized in Table 1. Following the previous protocol in (Kipf & Welling, 2017; Zhuang & Ma, 2018), we split the graph data into a training set, a validation set, and a testing set. Taking into account the graph convolution and pooling modules, we may alternately stack them into a multi-layer Graph convolutional network. The GIL model consists of two graph convolution layers, each of which is followed by a mean-pooling layer, a class-to-node relationship regression module, and a final softmax layer. We have given the detailed configuration of the relationship regression module in the class-to-node relationship of Section 4. The parameter dp in Eqn. (9) is set to the mean length of between-node reachability paths in the input graph. The channels of the 1-st and 2-nd convolutional layers are set to 128 and 256, respectively. The scale of the respective filed is 2 in both convolutional layers. The dropout rate is set to 0.5 in the convolution and fully connected layers to avoid over-fitting, and the Re LU unit is leveraged as a nonlinear activation function. We pre-train our proposed GIL model for 200 iterations with the training set, where its initial learning rate, decay factor, and momentum are set to 0.05, 0.95, and 0.9, respectively. Here we train the GIL model using the stochastic gradient descent method with the batch size of 100. We further improve the inference learning capability of the GIL model for 1200 iterations with the validation set, where the meta-learning rates α and β are both set to 0.001. Published as a conference paper at ICLR 2020 5.2 COMPARISON WITH STATE-OF-THE-ARTS We compare the GIL approach with several state-of-the-art methods (Monti et al., 2017; Kipf & Welling, 2017; Zhou et al., 2004; Zhuang & Ma, 2018) over four graph datasets, including Cora, Citeseer, Pubmed, and NELL. The classification accuracies for all methods are reported in Table 2. Our proposed GIL can significantly outperform these graph Laplacian regularized methods on four graph datasets, including Deep walk (Zhou et al., 2004), modularity clustering (Brandes et al., 2008), Gaussian fields (Zhu et al., 2003), and graph embedding (Yang et al., 2016) methods. For example, we can achieve much higher performance than the deepwalk method (Zhou et al., 2004), e.g., 43.2% vs 74.1% on the Citeseer dataset, 65.3% vs 83.1% on the Pubmed dataset, and 58.1% vs 78.9% on the NELL dataset. We find that the graph embedding method (Yang et al., 2016), which has considered both label information and graph structure during sampling, can obtain lower accuracies than our proposed GIL by 9.4% on the Citeseer dataset and 10.5% on the Cora dataset, respectively. This indicates that our proposed GIL can better optimize structure relations and thus improve the network generalization. We further compare our proposed GIL with several existing deep graph embedding methods, including graph attention network (Velickovic et al., 2018), dual graph convolutional networks (Zhuang & Ma, 2018), topology adaptive graph convolutional networks (Du et al., 2017), Multi-scale graph convolution (Abu-El-Haija et al., 2018), etc. For example, our proposed GIL achieves a very large gain, e.g., 86.2% vs 83.3% (Du et al., 2017) on the Cora dataset, and 78.9% vs 66.0% (Kipf & Welling, 2017) on the NELL dataset. We evaluate our proposed GIL method on a large graph dataset with a lower label rate, which can significantly outperform existing baselines on the Pubmed dataset: 3.1% over DGCN (Zhuang & Ma, 2018), 4.1% over classic GCN (Kipf & Welling, 2017) and TAGCN (Du et al., 2017), 3.2% over AGNN (Thekumparampil et al., 2018), and 3.6% over N-GCN (Abu-El-Haija et al., 2018). It demonstrates that our proposed GIL performs very well on various graph datasets by building the graph inference learning process, where the limited label information and graph structures can be well employed in the predicted framework. Table 2: Performance comparisons of semi-supervised classification methods. Methods Cora Citeseer Pubmed NELL Clustering (Brandes et al., 2008) 59.5 60.1 70.7 21.8 Deep Walk (Zhou et al., 2004) 67.2 43.2 65.3 58.1 Gaussian (Zhu et al., 2003) 68.0 45.3 63.0 26.5 G-embedding (Yang et al., 2016) 75.7 64.7 77.2 61.9 DCNN (Atwood & Towsley, 2016) 76.8 - 73.0 - GCN (Kipf & Welling, 2017) 81.5 70.3 79.0 66.0 Mo Net (Monti et al., 2017) 81.7 - 78.8 - N-GCN (Abu-El-Haija et al., 2018) 83.0 72.2 79.5 - GAT (Velickovic et al., 2018) 83.0 72.5 79.0 - AGNN (Thekumparampil et al., 2018) 83.1 71.7 79.9 - TAGCN (Du et al., 2017) 83.3 72.5 79.0 - DGCN (Zhuang & Ma, 2018) 83.5 72.6 80.0 74.2 Our GIL 86.2 74.1 83.1 78.9 5.3 ANALYSIS Meta-optimization: As can be seen in Table 3, we report the classification accuracies of semi-supervised classification with several variants of our proposed GIL and the classical GCN method (Kipf & Welling, 2017) when evaluating them on the Cora dataset. For analyzing the performance improvement of our proposed GIL with the graph inference learning process, we report the classification accuracies of GCN (Kipf & Welling, 2017) and our proposed GIL on the Cora dataset under two different situations, including only learning with the training set Vtr" and with jointly learning on a training set Vtr and a validation set Vval". GCN /w jointly learning on Vtr & Vval" achieves a better result than GCN /w learning on Vtr" by 3.6%, which demonstrates that the network performance can be improved by employing validation samples. When using structure relations, GIL /w learning on Vtr" obtains an improvement of 1.9% (over GCN /w learning on Vtr ), which can be attributed to the building connection between nodes. The meta-optimization strategy ( GIL /w meta-training from Vtr Vval" vs GIL /w learning on Vtr ) has a gain of 2.9%, which indicates that a good inference capability can be learnt through meta-optimization. It is worth noting that, GIL adopts a meta-optimization strategy to learn the inference model, which is a process of migrating Published as a conference paper at ICLR 2020 from a training set to a validation set. In other words, the validation set is only used to teach the model itself how to transfer to unseen data. In contrast, the conventional methods often employ a validation set to tune parameters of a certain model of interest. Table 3: Performance comparisons with several GIL variants and the classical GCN method on the Cora dataset. Methods Acc. (%) GCN (Kipf & Welling, 2017) /w learning on Vtr 81.4 /w jointly learning on Vtr & Vval 84.0 GIL /w learning on Vtr 83.3 /w meta-train Vtr Vval 86.2 GIL+mean pooling /w 1 conv. layer 84.5 /w 2 conv. layers 86.2 /w 3 conv. layers 85.4 GIL+2 conv. layers /w max-pooling 85.2 /w mean pooling 86.2 Network settings: We explore the effectiveness of our proposed GIL with the same mean pooling mechanism, but with different numbers of convolutional layers, i.e., GIL + mean pooling" with one, two, and three convolutional layers, respectively. As can be seen in Table 3, the proposed GIL with two convolutional layers can obtain a better performance on the Cora data than the other two network settings (i.e., GIL with one or three convolutional layers). For example, the performance of GIL /w 1 conv. layer + mean pooling" is slightly decreased by 1.7% over GIL /w 2 conv. layers + mean pooling" on the Cora dataset. Furthermore, we report the classification results of our proposed GIL by using mean and max-pooling mechanisms, respectively. GIL with mean pooling (i.e., GIL /w 2 conv layers + mean pooling") can get a better result than the GIL method with max-pooling (i.e., GIL /w 2 conv layers + max-pooling"), e.g., 86.2% vs 85.2% on the Cora graph dataset. The reason may be that the graph network with two convolutional layers and the mean pooling mechanism can obtain the optimal graph embeddings, but when increasing the network layers, more parameters of a certain graph model need to be optimized, which may lead to the over-fitting issue. Influence of different between-node steps: We compare the classification performance within different between-node steps for our proposed GIL and GCN (Kipf & Welling, 2017), as illustrated in Fig. 2(a). The length of between-node steps can be computed with the shortest path between reference nodes and query nodes. When the step between nodes is smaller, both GIL and GCN methods can predict the category information for a small part of unlabeled nodes in the testing set. The reason may be that the node category information may be disturbed by its nearest neighboring nodes with different labels and fewer nodes are within 1 or 2 steps in the testing set. The GIL and GCN methods can infer the category information for a part of unlabeled nodes by adopting node attributes, when two nodes are not connected in the graph (i.e., step= ). By increasing the length of reachability path, the inference process of the GIL method would become difficult and more graph structure information may be employed in the predicted process. GIL can outperform the classic GCN by analyzing the accuracies within different between-node steps, which indicates that our proposed GIL has a better reference capability than GCN by using the meta-optimization mechanism from training nodes to validation nodes. 1 3 5 7 9 11 step our GIL GCN 1x 2x 3x 4x 5x 6x Label rates Figure 2: (a) Performance comparisons within different between-node steps on the Cora dataset. The accuracy equals to the number of correctly classified nodes divided by all testing samples, and is accumulated from step 1 to step k. (b) Performance comparisons with different label rates on the Pubmed dataset. Published as a conference paper at ICLR 2020 Influence of different label rates: We also explore the performance comparisons of the GIL method with different label rates, and the detailed results on the Pubmed dataset can be shown in Fig. 2(b). When label rates increase by multiplication, the performances of GIL and GCN are improved, but the relative gain becomes narrow. The reason is that, the reachable path lengths between unlabeled nodes and labeled nodes will be reduced with the increase of labeled nodes, which will weaken the effect of inference learning. In the extreme case, labels of unlabeled nodes could be determined by those neighbors with the 1 2 step reachability. In summary, our proposed GIL method prefers small ratio labeled nodes on the semi-supervised node classification task. Inference learning process: Classification errors of different epochs on the validation set of the Cora dataset can be illustrated in Fig. 3. Classification errors are rapidly decreasing as the number of iterations increases from the beginning to 400 iterations, while they are with a slow descent from 400 iterations to 1200 iterations. It demonstrates that the learned knowledge from the training samples can be transferred for inferring node category information from these reference labeled nodes. The performance of semi-supervised classification can be further increased by improving the generalized capability of the Graph CNN model. the number of iterations Figure 3: Classification errors of different iterations on the validation set of the Cora dataset. Table 4: Performance comparisons with different modules on the Cora dataset, where fe, f P, and fr denote node representation, path reachability, and structure relation, respectively. fe fr f P Acc.(%) - - - 56.0 - - 81.5 - 85.0 86.2 Module analysis: We evaluate the effectiveness of different modules within our proposed GIL framework, including node representation fe, path reachability f P, and structure relation fr. Note that the last one fr defines on the former two ones, so we consider the cases in Table 4 by adding modules. When not using all modules, only original attributes of nodes are used to predict labels. The case of only using fe belongs to the GCN method, which can achieve 81.5% on the Cora dataset. The large gain of using the relation module fr (i.e., from 81.5% to 85.0%) may be contributed to the ability of inference learning on attributes as well as local topology structures which are implicitly encoded in fe. The path information f P can further boost the performance by 1.2%, e.g., 86.2% vs 85.0%. It demonstrates that three different modules of our method can improve the graph inference learning capability. Computational complexity: For the computational complexity of our proposed GIL, the cost is mainly spent on the computations of node representation, between-node reachability, and class-tonode relationship, which are about O((ntr + nte) e din dout), O((ntr + nte) e P), and O(ntr nted2 out), respectively. ntr and nte refer to the numbers of training and testing nodes, din and dout denote the input and output dimensions of node representation, e is about the average degree of graph node, and P is the step length of node reachability. Compared with those classic Graph CNNs (Kipf & Welling, 2017), our proposed GIL has a slightly higher cost due to an extra inference learning process, but can complete the testing stage with several seconds on these benchmark datasets. 6 CONCLUSION In this work, we tackled the semi-supervised node classification task with a graph inference learning method, which can better predict the categories of these unlabeled nodes in an end-to-end framework. We can build a structure relation for obtaining the connection between any two graph nodes, where node attributes, between-node paths, and graph structure information can be encapsulated together. For better capturing the transferable knowledge, our method further learns to transfer the mined knowledge from the training samples to the validation set, finally boosting the prediction accuracy of the labels of unlabeled nodes in the testing set. The extensive experimental results demonstrate the effectiveness of our proposed GIL for solving the semi-supervised learning problem, even in the few-shot paradigm. In the future, we would extend the graph inference method to handle more graph-related tasks, such as graph generation and social network analysis. Published as a conference paper at ICLR 2020 ACKNOWLEDGMENT This work was supported by the National Natural Science Foundation of China (Nos. 61972204, 61906094, U1713208), the Natural Science Foundation of Jiangsu Province (Grant Nos. BK20191283 and BK20190019), and Tencent AI Lab Rhino-Bird Focused Research Program (No. JR201922). Sami Abu-El-Haija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. N-gcn: Multi-scale graph convolution for semi-supervised node classification. ar Xiv preprint ar Xiv:1802.08888, 2018. James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Neur IPS, pp. 1993 2001, 2016. Karsten M Borgwardt, Hans-Peter Kriegel, SVN Vishwanathan, and Nicol N Schraudolph. Graph kernels for disease outcome prediction from protein-protein interaction networks. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing, pp. 4 15, 2007. Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Gorke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. On modularity clustering. IEEE transactions on knowledge and data engineering, 20(2):172 188, 2008. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Neur IPS, pp. 3844 3852, 2016. Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. Topology adaptive graph convolutional networks. ar Xiv preprint ar Xiv:1710.10370, 2017. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126 1135, 2017. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Neur IPS, pp. 1025 1035, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016. Mikael Henaff, Joan Bruna, and Yann Le Cun. Deep convolutional networks on graph-structured data. ar Xiv preprint ar Xiv:1506.05163, 2015. Jiatao Jiang, Zhen Cui, Chunyan Xu, and Jian Yang. Gaussian-induced convolution for graphs. In AAAI, volume 33, pp. 4007 4014, 2019. Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. Marginalized kernels between labeled graphs. In ICML, pp. 321 328, 2003. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. ICLR, 2016. Wei Liu, Junfeng He, and Shih-Fu Chang. Large graph construction for scalable semi-supervised learning. In ICML, 2010. Wei Liu, Jun Wang, and Shih-Fu Chang. Robust and scalable graph-based semisupervised learning. Proceedings of the IEEE, 100(9):2624 2638, 2012. Zhiling Luo, Ling Liu, Jianwei Yin, Ying Li, and Zhaohui Wu. Deep learning of graphs with ngram convolutional neural networks. IEEE Transactions on Knowledge and Data Engineering, 29(10): 2125 2139, 2017. Published as a conference paper at ICLR 2020 Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, pp. 5115 5124, 2017. Christopher Morris, Kristian Kersting, and Petra Mutzel. Glocalized weisfeiler-lehman graph kernels: Global-local feature maps of graphs. In ICDM, pp. 327 336. IEEE, 2017. Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML, pp. 2014 2023, 2016. Francesco Orsini, Daniele Baracchi, and Paolo Frasconi. Shift aggregate extract networks. ar Xiv preprint ar Xiv:1703.05537, 2017. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, 1999. Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang. Adversarially regularized graph autoencoder for graph embedding. In IJCAI, pp. 2609 2615, 2018. Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488 495, 2009. Flood Sung, Li Zhang, Tao Xiang, Timothy Hospedales, and Yongxin Yang. Learning to learn: Meta-critic networks for sample efficient learning. ar Xiv preprint ar Xiv:1706.09529, 2017. Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. Attention-based graph neural network for semi-supervised learning. ar Xiv preprint ar Xiv:1803.03735, 2018. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. ICLR, 2018. Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In CVPR, pp. 5410 5419, 2017. Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In SIGKDD, pp. 1365 1374, 2015. Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. ICML, 2016. Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In IJCAI, pp. 3634 3640, 2018. Tong Zhang, Zhen Cui, Chunyan Xu, Wenming Zheng, and Jian Yang. Variational pathway reasoning for eeg emotion recognition. In AAAI, 2020. Wenting Zhao, Zhen Cui, Chunyan Xu, Chengzheng Li, Tong Zhang, and Jian Yang. Hashing graph convolution for node classification. In CIKM, 2019. Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Neur IPS, pp. 321 328, 2004. Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, pp. 912 919, 2003. Chenyi Zhuang and Qiang Ma. Dual graph convolutional networks for graph-based semi-supervised classification. In WWW, pp. 499 508, 2018.