# deep_collective_inference__8dd71850.pdf Deep Collective Inference John Moore, Jennifer Neville Departments of Computer Science and Statistics Purdue University, West Lafayette, IN Email: moore269, neville@purdue.edu Collective inference is widely used to improve classification in network datasets. However, despite recent advances in deep learning and the successes of recurrent neural networks (RNNs), researchers have only just recently begun to study how to apply RNNs to heterogeneous graph and network datasets. There has been recent work on using RNNs for unsupervised learning in networks (e.g., graph clustering, node embedding) and for prediction (e.g., link prediction, graph classification), but there has been little work on using RNNs for node-based relational classification tasks. In this paper, we provide an end-to-end learning framework using RNNs for collective inference. Our main insight is to transform a node and its set of neighbors into an unordered sequence (of varying length) and use an LSTM-based RNN to predict the class label as the output of that sequence. We develop a collective inference method, which we refer to as Deep Collective Inference (DCI), that uses semi-supervised learning in partially-labeled networks and two label distribution correction mechanisms for imbalanced classes. We compare to several alternative methods on seven network datasets. DCI achieves up to a 12% reduction in error compared to the best alternative and a 25% reduction in error on average over all methods, for all label proportions. Introduction Collective inference is widely used to improve classification in network datasets (see e.g., Macskassy and Provost 2007). This is because many network datasets have nodes with attributes values that are correlated across the links. For example, in protein-protein interaction networks, the functions of interacting proteins in the cell are typically correlated. Similarly in social networks, friends tends to share similar interests and preferences. Thus in a partially labeled network where the attribute values of some nodes are observed, but others are unobserved, it is often helpful to learn a statistical relational model (see e.g., Getoor and Taskar 2007) and apply the model using collective classification (see e.g., Sen et al. 2008) to jointly make predictions about the set of unlabeled nodes. Recently, the use of recurrent neural networks (RNNs) and deep learning for both supervised and unsupervised Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. tasks have produced significant performance gains across domains such as speech translation, image processing, and natural language processing (Lipton, Berkowitz, and Elkan 2015; Chung et al. 2014; Graves and Jaitly 2014; Vinyals et al. 2014; Ren, Kiros, and Zemel 2015). While research on neural network models has been active for decades, recent achievements with the models are due to the availability of larger datasets, combined with several insights on how to structure the form of the model, initialize the weights, guide the optimization process to avoid overfitting, and fix problems such as vanishing gradients. However, in the majority of domains where RNNs have been applied successfully, the examples are structured either as vectors, sequences, or matrices. Due to heterogeneous structure of graph data, it is still a relatively open question as to how to best design and exploit RNNs for learning in graphs with heterogeneous structure. There has been work on using neural networks for unsupervised learning in networks (e.g., graph clustering (Tian et al. 2014), node embeddings (Perozzi, Al-Rfou, and Skiena 2014; Tang et al. 2015)). However, when these methods have been applied for classification (e.g., link prediction (Li et al. 2014), graph classification (Yanardag and Vishwanathan 2015), node classification (Grover and Leskovec 2016)), it is typically based on using the output of unsupervised learning as features in a basic predictive model (e.g., logistic regression). The majority of previous methods for collective classification have been based on graphical models (e.g., Taskar, Abbeel, and Koller 2002; Richardson and Domingos 2006; Neville and Jensen 2007). There has been relatively little work on neural network models for supervised, end-to-end classification in partially-labeled relational graphs. One exception is the work of Monner and Reggia (2013) on Recurrent Neural Collective Classification (RNCC). In this work, we develop an RNN for semi-supervised collective inference in attributed networks. Specifically, we consider the node classification problem, where given a single partially-labeled attributed network, the goal is to learn a model to jointly predict the remaining unlabeled nodes in the network. We propose an RNN approach that uses semisupervised learning to jointly model relational structure and attributes. Our main insights are: (1) to transform a node and its set of neighbors into a random order (i.e., sequence of varying length) and use an LSTM-based RNN (Hochreiter Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) and Schmidhuber 1997) to predict the class label as the output of that sequence, (2) to use a data augmentation or objective function balancing to adjust for skewed class label distributions, and (3) to initialize predictions of unlabeled nodes with a basic relational RNN, instead of using only the known class label values for the first round of learning. We compare our proposed method to several baselines and alternatives, including state-of-the-art approaches to semi-supervised relational learning (Pfeiffer III, Neville, and Bennett 2015), network embedding (Grover and Leskovec 2016), and RNCC (Monner and Reggia 2013). We show that our approach achieves a significant reduction in classification error. We consider seven network datasets and observed up to a 12% reduction in error compared to the best alternative and a 25% reduction in error on average over all competing methods, for all label proportions. Our main contributions are the following: We develop an LSTM-based RNN for node-based relational learning and collective inference, which we refer to this method as Deep Collective Inference (DCI). We show DCI outperforms state-of-the-art methods for semi-supervised collective classification using: graphical models, node embeddings, and recurrent neural networks. We evaluate the efficacy of our modeling choices and show that: (1) sequences of neighbors based on random orderings work better than ordering by connectivity, (2) a combination of data augmentation and cross entropy balancing helps to offset class label skew significantly during learning, (3) initializing class label predictions for unknown labels using a basic relational model improves performance compared to stacking (Kou and Cohen 2007) where class probabilities are used as features instead. Background We define a graph G = V, E where vi V with i [1, n] is a node and E V V is the edge set. If eij E, there is an edge between vi and vj, otherwise there is not. Let Ni correspond to the set of neighbors of vi, that is Ni = {vj | eij E}. Let F, Y be the feature and label set over the nodes, respectively. Each vi V has a corresponding feature vector fi F. For relational classification, the input network is partially labeled and thus only some of the nodes have an associated class label (i.e., if yj Y then vj is labeled). The goal of relational classification is to learn a model from the partially labeled network and use the model to make predictions ˆy for the unlabeled nodes {vk} s.t. yk / Y. In this work we assume that Y is binary and can only take values {0, 1}. Moreover each prediction ˆyi [0, 1] represents the probability of that the class label value for vi is 1. Let VU, VL refer to nodes that are unlabeled and labeled, respectively. Recurrent Neural Networks RNNs have been used extensively in sequence prediction problems. The vanilla RNN can be described by the following equations: h(t) = σ(W hxx(t) + W hhh(t 1) + bh) ˆy(t) = softmax(W yhh(t) + by) softmax(z)j = ezj ezi for the jth entry in z where t is the sequence length, x(t) is the current element in the sequence, and h(t 1) is the network s previous state. W hx, W hh, and W yh are the matrices of weights between input and hidden layer, hidden to hidden layers, and hidden to output layers. bh and by are bias parameters. σ is the continuous, differentiable activation function. For prediction, the output ˆy(t) at each timestep t is calculated given current hidden state h(t). RNNs are known to have problems with exploding or vanishing gradients. However, the LSTM architecture overcomes the exploding/vanishing gradient problem by carefully designing the hidden units such that gradients behave well (Hochreiter and Schmidhuber 1997). Thus, in this work we use the LSTM architecture. Deep Collective Inference In this paper, we develop a deep collective inference (DCI) method, which uses an RNN for collective classification in relational network data. We refer to weights in any RNN as both the weights and bias terms. Problem Definition and Input Specification We assume as input a partially labeled graph < G, F, Y >. The goal is to learn a predictive model from the labeled nodes VL and use the model to make predictions for the unlabeled nodes VU = V VL. We first specify a noncollective version of our method to generate seed predictions known as Deep Relational Inference (DRI). Examples are constructed as follows: for a node vi, the target output is the class label yi and the input is the node s features fi and the features of its neighbors {fj | vj Ni}. We form a model to learn a mapping of the inputs [fi, {fj}vj Ni] to the output yi. Since each node has a different number of neighbors, we will transform the input into an unordered sequence (of varying length). First, we randomly order the list of neighbors (i.e., [vj1, vj2, ..., vj|Ni|]) and then we use the associated features as a sequential input to the model: xi = [fj1, fj2, ..., fj|Ni|, fi] = [x(0) i , x(1) i , ..., x(|Ni|) i ] (1) Note the last feature vector in the sequence fi corresponds to the features of the target node. Here x(t) refers to the tth neighbor (i.e., element) in the sequence and x(|Ni|) i refers to the target node. Thus the resulting inputs to the RNN will be a set of feature vector sequences: XL = {xi | vi VL}. We train DRI using a canonical RNN learning algorithm outlined in Algorithm 1. Then we perform inference to generate seed predictions on the test set, ˆyi 0 vi VU. Predictions are obtained from the class with highest probability for vi. RNN for Collective Inference To extend DRI to the collective inference setting we augment each feature vector with a class label value, i.e., xi = [fj1, yj1], [fj2, yj2], ..., [fi, ˆyi] xd = [< fb,yb >, < ff,yf >, < fi,yi >, < fe,ye >, < fa,ya >, < fd,yd > ] = [xd , xd , xd , xd , xd , xd ] (0) (1) (2) (3) (4) (5) (a) Example network and unordered sequence for node vd. x(0) x(1) x(|Ɲ|-1) x(|Ɲ|) (b) Sequential structure of RNN. f0 f1 ... fp y h1 ... hw r1 (c) Internal details of RNN structure at end of sequence. Figure 1: Illustration of model structure. = [x(0) i , x(1) i , ..., x(|Ni|) i ] (2) In this case, if vj VL then we append the true class label yj. Otherwise, we can append ˆyj. For the target instance vi, we use ˆyi, since using the true class would lead to obvious overfitting. This is illustrated for a small network in Figure 1a, where the sequence for node vd is specified based on a random ordering of its neighbors. The structure of the RNN is illustrated in Figure 1b and 1c. r is the previous recurrent output and h refers to the hidden nodes. Any generic RNN architecture can be used; however, we specify an LSTM so that gradients behave well with hidden node size, w = 10 used in evaluations. Similar to other templated models, the length of the model is determined by the length of each sequence. Each square in Figure 1b represents a local RNN, with parameters W that are shared across each the replication in the sequence. Figure 1c shows the details of the local RNNs at the end of the sequence. The input consists of the feature vector for the example at that point in the sequence the length depends on the number of node features in G. On the final hidden output of the RNN, the w outputs are aggregated using softmax to produce a predicted class label ˆy. Label Skew Corrections Skewed class label distributions motivated us to explore two different ways of correcting for skewed labels. The first involves generating more data from the rare class via data augmentation, while the second involves changing the objective function to balance between the classes. Our algorithm learns which method to use by evaluating their performance on a held out dataset where labels are known. We call this held out set VLV2 in Algorithm 3. It is constructed from a portion of the original validation set. Data Augmentation In image classification, researchers perform data augmentation by adding noise in some way to an image and using the noisy image as another training example (Krizhevsky, Sutskever, and Hinton 2012). We follow this approach and generate additional training examples by replicating examples (nodes) and then add noise by swapping some attribute values (neighbors) across examples from the same class. Algorithm 2 details our approach to data augmentation. It randomly selects two examples from the minority class, duplicates them, and then swaps at most 50% of the x attribute Algorithm 1 RNNTrain(Xt, Xv, Y, max Itr, perform Swap) 1: if perform Swap then 2: Specify Canonical Cross Entropy as objective 3: else 4: Specify Balanced Cross Entropy as objective 5: end if 6: repeat 7: Initialize the structure of the RNN with random gaussian weights W. 8: Intialize Scores = int array of length max Itr ; Initialize te = 0 ; 9: Train RNN with 1 epoch over Xt to optimize over labels using specified objective function 10: Scores[te] = Calculate loss on Xv ; te++ ; 11: until [early Stop Met(Scores, tol)] OR [te max Itr] 12: return RNN with learned weights W values across the vectors. This corresponds to randomly swapping the neighbors of the two duplicated nodes. In the line 23 of Alg. 2, the augmented dataset Xret is formed by downsampling to get to the original size of XL. We use downsampling to ensure a fair comparison (i.e., equal training sizes) to other methods that do not use data augmentation. The algorithm returns the set Xret, which has a balanced class distribution. We can then simply train on Xret whenever an RNN is trained in Algorithm 3. Balanced Cross Entropy We use balancing to adapt the objective function to imbalanced classes and therefore gradient updates when performing backpropgation. Here, we specifically modify the Cross Entropy objective function, but the adaptation is general and can be applied to any objective function, which has separate terms for each class. Recall that the cross entropy objective is: i=1 yilog( ˆyi) + (1 yi)(log(1 ˆyi)) where yi is the true label of node vi and ˆyi is the corresponding prediction. To balance the gradient updates, we can simply take the frequency of positive labels (C+) and negative labels (C ) in the training and validation sets and use these to scale the terms in the objective function. The new balanced objective function is then: i=1 C yilog( ˆyi) + C+(1 yi)(log(1 ˆyi)) Algorithm 2 Swap Aug Method(XL, YL) 1: counts0 = count Classes(YL, 0) 2: counts1 = count Classes(YL, 1) 3: small Label = 0 4: if counts0>counts1 then 5: small Label = 1 6: end if 7: class Diff = | counts0 counts1 | 8: Let XS XL be the set of sequences that have small Label 9: initialize new X to list of size=class Diff+1 10: j = 0 11: while j < length(new X) do 12: j+= 2 13: n1, n2 = Select two integers at random [0, length(XS) 1] 14: xt1 = copy(XS[n1]), xt2 = copy(XS[n2]) 15: num Swaps = min(length(xt1)/2, length(xt2)/2) 16: t1Idxs = sample num Swap integers [0, length(xt1) 1] 17: t2Idxs = sample num Swap integers [0, length(xt2) 1] 18: for each (t1idx, t2idx) in (t1Idxs, t2Idxs) do 19: swap x(t1idx) t1 , x(t2idx) t2 20: end for 21: Append xt1, xt2 to new X 22: end while 23: Xret = Sample (counts0+counts1) sequences randomly from new X + XL 24: return Xret In our experiments C+ C , but the adapted objective is general and isn t specific to a given class. Intuitively in our case, since positive labels occur less frequently, we want to upweight their effect by using C , and since negative labels occur more frequently, we want to downweight their effect by C+. The net effect is that both sets of positive and negative examples have equivalent impact on the gradient updates. The modified objective is used in Algorithm 1 if specified. DCI Semi-supervised Learning We now outline our algorithm to learn a deep collective inference (DCI) model. Algorithm 4 is a wrapper method that chooses the mechanism DCI should use to adjust for imbalanced classes. Nodes with yi YL are split into training VLT, validation 1 VLV1 , and validation 2 VLV1 in line 1. In lines 2-3, we learn a model using either swapping or the balanced objective by calling Algorithm 3. In order to select the Swapping or Balanced Cross Entropy approach, we simply return the model that performs best on the held out validation set VLV2 (lines 4-8). Algorithm 3 describes how to learn a DCI model from a partially-labeled network with a specified approach to adjusting for imbalanced class distributions. First, a DRI model is learned and applied to initialize the unlabeled prediction set Y0 U (lines 1-8). Then, the algorithm starts the semi-supervised collective inference process (lines 12-26). One iteration of collective inference consists of the following steps. New attribute input examples, X and Xval, are formed by concatenating the current predictions Ytc U and the labels YL to F (lines 16-17). Note that neighbor orders are random when constructing X on each collective iteration. If Algorithm 3 DCI Apply(G, F, Y, VU, VLT, VLV1 , VLV2 , max Itr, tol, perform Swap) 1: //Train a DRI RNN to initialize seed predictions 2: Form X by constructing an unordered input sequence xi from train Nodes, F, Y. 3: Form Xval similarly except with VLV1 4: if perform Swap then 5: X = Swap Aug(X, YL) 6: end if 7: M = RNNTrain(X, Xval, YL, max Itr, perform Swap) 8: Use M to predict ˆy0 i for each vi VU 9: Intialize both Scores V1, Scores V2 = int arrays of length max Itr; tc = 0 ; 10: Initialize the structure of the RNN with random weights Wtc. 11: //Start collective learning 12: repeat 13: if tc! = 0 then 14: Set Wtc = Wtrained tc 1 15: end if 16: Form X by constructing xi from VLT, F, Y, ˆYtc 17: Form Xval similarly except with VLV1 18: if perform Swap then 19: X = Swap Aug( X, YL) 20: end if 21: Let RNN, Wtrained tc = RNNTrain( X, Xval, YL, max Itr, perform Swap) 22: Use the RNN to predict ˆytc+1 i for each vi VU to form ˆYtc+1 23: Scores V1[tc] = Calculate loss on VLV1 24: Scores V2[tc] = Calculate loss on VLV2 25: tc++ 26: until [early Stop Met(Scores V1, tol)] OR [tc max Itr] 27: Let tbest be best collective model on VLV1 based on BAE 28: return RNN with learned weights Wtbest, predictions ˆYtbest, Scores V2[tbest] Swap Aug is used in line 19, then X is transformed to reflect swapping. Otherwise, it is not transformed and CEB is instead used as the objective function. Next parameters are re-estimated (line 21) by training the RNN. Let the parameters of the RNN at iteration tc be Wtc. We perform backpropagation and utilize early stopping methods based on the validation set. Any type of early stopping method (denoted early Stop Met) on the validation set can be used. For each iteration, tc, we obtain ˆytc i for each vi VU from the collective RNN by performing inference on the unlabeled set (line 22). The predictions on the unlabeled set are used in the next collective iteration. Except for the first iteration, note that we initialize weights according to the previous iteration s trained weights, i.e., let Wtc+1 = W trained tc instead of initializing randomly in lines 13-15. We repeat collective inference until the early stopping criteria is met on the validation set, VLV1 , or until max Itr is reached. We return the RNN with predictions that performed best in terms of BAE on the validation set VLV1 (line 28). The DCI algorithm is simultaneously learning the parameters of the RNN and making predictions for the unlabeled nodes thus it is a semi-supervised approach to learning based on the full, albeit partially-labeled, graph. In addition, DCI uses an initialization approach where the predictions are initialized with the DRI model, which DCI uses on its first iteration of collective inference. Algorithm 4 DCI(G, F, Y, VU, val1%, val2%, max Itr, tol) 1: Form VLT, VLV1 , and VLV2 from YL based on val1% and val2% 2: DCI S RNN, ˆ Y tbest S , score S = DCI Apply(G, F, Y, VU, VLT, VLV1 , VLV2 , max Itr, tol, True) 3: DCI B RNN, ˆ Y tbest B , score B = DCI Apply(G, F, Y, VU, VLT, VLV1 , VLV2 , max Itr, tol, False) 4: if score S < score B then 5: return DCI S RNN, ˆ Y tbest S 6: else 7: return DCI B RNN, ˆ Y tbest B 8: end if There are a number of DCI variations that are possible depending on different algorithmic decisions. We evaluate the algorithmic choices and found that they each improve performance significantly. See Section Empirical Evaluation for more details. Time Complexity DCI is scalable especially if run on a GPU. DCI takes O(|E| + |V|) time since it s bottleneck is performing |E| + |V| backpropagations. This could potentially be reduced to b |V | if performing truncated backprop to only b steps, which essentially corresponds to throwing away sequence input for large degree nodes. Related Work Relational Machine Learning Relational machine learning (RML) methods seek to jointly model user labels given their attributes and relational structure (Getoor and Taskar 2007). In particular, semisupervised RML approaches have been developed for partially-labeled networks (Xiang and Neville 2008; Mc Dowell and Aha 2012; Lin and Cohen 2010). Many semisupervised approaches to RML perform Expectation Maximization (EM), which can be divided into two basic steps: an E-Step that uses collective classification and an M-Step that optimizes the parameters given the current predicted labels. Current state-of-the-art methods use pseudolikelihood EM with a maximum entropy constraint in the inference step to produce better calibrated probability estimates (Pfeiffer III, Neville, and Bennett 2015). This method, which we refer to as PLEM, is the current best performing RML method for large-scale partially-labeled networks. Neural Network Models for Graphs Some recent work has used neural network models for graph clustering (Tian et al. 2014), labeling graphs (Yanardag and Vishwanathan 2015), and link prediction (Li et al. 2014). However, this work typically focuses on learning models of the graph structure alone and has not considered the development of node-based predictive models. There has been some work on unsupervised learning of node embeddings, which are then used afterwards for learning predictive models for nodes. Line (Tang et al. 2015) uses a two-stage approach to learning an embedding such that nodes that are either well-connected or share many neighbors are close. Structural Deep Network Embedding (SDNE) (Wang, Cui, and Zhu 2016) generates a network embedding via an auto-encoder architecture where 1st and 2nd order neighbors are used in their objective function. Deep Walk (Perozzi, Al-Rfou, and Skiena 2014) uses random walks and a skip-gram based approach. Node2Vec (Grover and Leskovec 2016) extends the skip-gram architecture from Deep Walk and performs various sampling strategies to sample neighborhoods differently. Currently, SDNE and Node2Vec are state of the art node embedding methods. While these methods have been shown to produce an embedding that is useful for subsequent classification, they do not directly learn to optimize class label predictions (i.e., the embedding is unsupervised) and moreover, they do not consider node attributes in their models. The Graph Neural Network (GNN) (Scarselli et al. 2009) is a specially designed recurrent neural network for graph data, which assumes node attributes and optionally specifies attributes for edges. The model takes a whole graph as input to propagate information about each node. Unfortunately, GNNs cannot be directly applied to the node classification problem since GNN training examples consist of entire labeled graphs. The work most closely related to ours is the Recurrent Neural Collective Classification (RNCC) method (Monner and Reggia 2013). During our initial work on DCI, we were unaware of the RNCC method, but the two methods are similar in that they both use neighbor information in LSTMs for collective classification. However, there are a number of key differences between the two methods: Data: RNCC does not adjust for imbalanced class label distributions, while DCI does. Architecture: RNCC s LSTM architecture models a node vi s features separately from its neighbors features (i.e., with separate weights). DCI s architecture models them jointly with shared weights. In addition, RNCC uses the hidden representation of neighbors as input features, while DCI does not. Learning: RNCC performs collective inference on every epoch of learning (i.e., epoch=collective iteration). In contrast, DCI waits until parameter estimation has locally converged in terms of epochs in order to perform the next round of collective inference. Also, RNCC uses a generalized LSTM learning algorithm (Monner and Reggia 2012), while DCI uses the standard LSTM backpropgation through time algorithm (Lipton, Berkowitz, and Elkan 2015). In our experimental evaluation next, we compare to RNCC and also investigate the impact of a number of these algorithmic decisions in ablation studies. Empirical Evaluation Data We employ seven datasets in our evaluations. Table 1 reports the number nodes, edges, density, and positive label proportion of each dataset. Dataset |V | |E| density P(+) Facebook 5906 73,374 4.2e-3 0.32 IMDB 7934 122,230 3.9e-3 0.164 Amz DVD 20000 16,118 75,596 5.8e-3 0.5 Amz DVD 7500 16,118 75,596 5.8e-3 0.21 Amz Music 64500 56,891 272,544 1.7e-4 0.5 Amz Music 7500 56,891 272,544 1.7e-4 0.08 Patents 881,187 5,302,712 1.4e-5 0.169 Table 1: Datasets The Facebook dataset is a snapshot of the Purdue University Facebook network (Pfeiffer III, Neville, and Bennett 2015). Users have two attributes: religious views, and gender, and a class label: political views. The Internet Movie Database (IMDB) is a movie dataset where we predict if a movie will have a gross revenue $50 million (Pfeiffer III, Neville, and Bennett 2015). Each movie (node) has 29 genre attributes and 9 boolean variables that record whether the average rating is greater than a particular value. Edges are created between movies that share two or more producers. Amazon DVD 20000 is a subset of the Amazon copurchase data gathered by (Leskovec, Adamic, and Huberman 2007). Nodes correspond to DVD items and edges are created via DVD copurchases. Each node has 24 attributes describing the movie s genres. The prediction task is to determine whether the item has an Amazon salesrank < 20000. Amazon DVD 7500 is the same as Amazon DVD 20000 except that the threshold for class labels is salesrank < 7500. This changes the class label distribution. Amazon Music 64500 dataset is another subset of the Amazon data where nodes are song items, and each node has 22 attributes describing its styles. The class label threshold is salesrank < 64500. Amazon Music 20000 uses the threshold for class labels is salesrank < 7500 The Patents citation network (Pfeiffer III, Neville, and Bennett 2015) consists of patent nodes and citations among them. Each patent has an attribute vector which records the TF/IDF values of it s top 50 words. We consider the Computers classification task where patents are predicted to be filed in Primary Category 2 (computer related) or not. Methodology Our metric for algorithmic comparison is the Balanced Absolute Error (BAE), which normalizes error across classes. See Pfeiffer III, Neville, and Bennett (2015) for more detail. We run Algorithm 3 and plot performance for DRI and DCI using LSTMs. We use 10 trials for all datasets. For each trial, nodes are randomly assigned to the labeled set, VL, Then the unlabeled set is formed from the remainder: VU = V VL. The following plots show performance as the proportion of the data used in the training set size is varied (i.e., |VL| V ). For example, 0.2 corresponds to 20% labeled nodes and 80% unlabeled in the network. For DCI, 12% and 3% of nodes of the whole dataset is used for VLV1 and VLV1 , respectively, which accounts for a total of 15% for validation. Thus when the training proportion is 0.2, DCI uses 5% for training, 15% for validation, and 80% for testing. Degree is concatenated to each fi F to compare to previous work (Pfeiffer III, Neville, and Bennett 2015). For our implementation, we use Theano under the library known as Blocks (van Merri enboer et al. 2015). All evaluations are performed using three computer clusters with 20 Xeon cores each and memory ranging from 64gb-256gb ram. We compare our proposed method DCI to baselines and state-of-the-art alternative methods. Unless otherwise specified, each method uses all available labeled nodes, VL, for training. LP: A label propagation baseline that uses a weighted vote of the predicted/true labels of neighbors to make predictions (Macskassy and Provost 2007). LR: An independent logistic regression baseline that uses only node features (no relational data) to predict the label optimized using L2 regularization. LR+N2V: LR but utilizing new attributes by concatenating node2vec (Grover and Leskovec 2016) features of size 128 and the original attribute vectors of each node. PLEM: The Max Ent Inf adjusted semi-supervised relational learning method from (Pfeiffer III, Neville, and Bennett 2015), specially designed for data with imbalanced class labels which is currently state-of-the art. PLEM+N2V: PLEM with added node2vec features. Note that this is not an existing method, we add node embedding features to PLEM to ascertain whether they improve relational learning methods. DCI: Weights of LSTMs are initialized (except after first iteration of collective inference) with sampled values from the Gaussian distribution. We apply Batched Gradient Descent where batch size = 100. The maximum number of epochs for any network is 200. Early Stopping is used to check if performance on the validation set does not improve in the last 10 epochs. If no improvement, training stops early, and the model performing best on the validation set is chosen. DCI was run for 100 collective iterations with the same early stopping criterion. We used w = 10 (number of hidden nodes). We did not perform hyperparameter optimization, though this could further improve results. DCI further splits VL into training VLT, validation 1 VLV1 , and validation 2 VLV2 sets. RNCC: RNCC is a similar model to DCI (Monner and Reggia 2013). We implement a version as close to it as possible. We use their RNN architecture where a given node vi s attributes ai are modeled via separate weights than neighborhood node s attributes. We utilize hidden states learned from a non-collective version of RNCC as the first hidden states to be used in collective inference. However, we use a canonical LSTM and standard backpropagation for learning, rather than the specific learning rules from (Monner and Reggia 2013). RNCC is learned with exact same hyperparameters and learning settings as (a) Facebook (b) Amazon DVD 20000 (c) Amazon DVD 7500 (d) Amazon Music 64500 (e) Amazon Music 7500 (f) IMDB Figure 2: DCI compared to alternatives LP, LR, RNCC, LR+N2V, PLEM, PLEM+N2V. described for DCI, with the exception of collective iterations. Since epoch=collective iteration for RNCC, we utilize Max Collective iteration = 200. Lastly, training and validation sets are exactly similar to DCI s, except that VLV1 and VLV2 are merged and used as one complete validation set. Comparison to alternative methods We compare DCI to the baselines LP, LR, LR+N2V, RNCC, PLEM, and PLEM+N2V. Figure 2 reports the results. The mean and standard errors of the BAE measure are plotted on each dataset. We use the same train/test splits in each model. It s important to note that amazon DVD 20000 and amazon Music 64500 both have about 50/50 class distributions, while the rest of the datasets are imbalanced. In these cases, the performance gap between DCI and competing methods is significant and clear. Amazon DVD 7500 is the only dataset where DCI does not outperform PLEM+N2V on the majority of label proportions. However, DCI does outperform it when the training size is high enough. This is possibly due to small data, large class imbalance, and/or low network density. These may not be problems for PLEM+N2V since node2vec explores more of the network besides just neighbors. However, on all other datasets DCI outperforms other state-of-the art methods for most if not all training/test splits, especially when it is not sparsely labeled. It s interesting to see that the RNCC implementation does not outperform PLEM for most datasets. This indicates that the algorithmic decisions for DCI allow DCI to significantly outperform RNCC. Because of the size of the Patents data, for efficiency we perform DCI-10 and RNCC-10, which only uses a subset of at most 10 neighbors (randomly selected). For DCI-10, we Figure 3: DCI-10 compared to alternatives RNCC, PLEM, PLEM+N2V on Patents dataset. also only run for 10 collective iterations, while keeping the rest of RNCC s learning parameters the same. We compare DCI-10 to PLEM and PLEM+N2V. See Figure 3 for the results. Note that we did not include the more naive methods in the plot, since their BAE values were too high and masked the performance of the other methods. LR has 0.4 BAE and LR+N2V has 0.19 BAE for all proportions. For this network, it is interesting that with only 10 hidden units and truncated backpropagation, DCI-10 eventually outperforms PLEM and PLEM+N2V. We expect further improvement if all neighbors are used. In summary, our evaluation show that, across seven network datasets, DCI resulted in an up to a 12% reduction in error compared to the best alternative (PLEM+N2V). Moreover, DCI achieved an average of 25% reduction in error over all methods, all datasets, and all label proportions. These results demonstrate the impact of our proposed method for improving collective inference in large-scale networks. Comparison to DCI variants We evaluate several variants of our proposed DCI method on smaller datasets to assess initial algorithmic choices. All other variants of DCI do not perform any label distribution corrections and use the whole validation set for early stopping instead of separating out a portion for VLV2 . We test on all but the largest dataset in order to compute personalized page rank vectors for each node, which is computationally intensive on large datasets. DCI-WSB: Learned without data augmentations or cross entropy balancing (WSB) DCI-ST: WSB version learned with stacking predictions instead of using predicted class labels DCI-A: Instead of random ordering, WSB version learned by ordering the top 10 neighbors in ascending order with respect to personalized page rank for each given node where the restart vector always restarts back to the given node (Jeh and Widom 2002) DCI-D: Instead of random ordering, WSB version learned by ordering the top 10 neighbors in descending order with respect to personalized page rank DCI-10-WSB: Uses only a subset of at most 10 neighbors (randomly selected) rather than the full set. DRI-WSB: Learned without label distribution corrections. Figure 4 shows a comparison between DCI, DCI-WSB, DCI-ST, DCI-A, DCI-D, DCI-10-WSB, and DRI-WSB. DRI-WSB performs worst in all evaluations compared to all collective methods, which indicates that collective inference is improving performance. In all datasets, specifying an order and running DCI-A or DCI-D result in about the same or worse performance than all other collective methods, which suggests that a page rank does not improve and sometimes degrades performance. It is not so clear to determine which is best among DCI-ST and DCI-WSB. Therefore, we summarize the gain by calculating reduction in error over all training set proportions. Overall there is small gain of 1.6% comparing DCI-WSB to DCI-ST. This shows a small improvement when we use class label predictions compared to stacking. We also perform DCI-WSB with randomly initializing predictions for the first iteration of collective inference (DCI-WSB-R) and notice these results do not show a clear winner in the plots. Therefore, we perform DCI with swapping (DCI-S), DCI with balancing (DCI-B), and their random initialization variants (DCI-S-R, DCI-B-R). Overall there is a small 0.82% gain when we initialize with predictions from DRI (DCI, DCI-S, and DCI-B) compared to using random initial predictions from the prior (DCI-R, DCIS-R, and DCI-B-R). DCI-WSB has a 0.06% gain over DCIWSB-R. DCI-S has a 1.78% gain over DCI-S-R. DCI-B has a 1.04% gain over DCI-B-R. Discussion and Conclusion Recurrent neural networks have recently produced impressive performance gains but have also been traditionally used for sequential problems where order is generally important and well-suited for structured inputs such as vectors or matrices. In this work, we provided an end-to-end learning framework by using RNNs for collective classification as opposed to a two-stop process of finding a node embedding, then using this representation in another model. Deep Collective Inference (DCI) is developed for semi-supervised learning in partially labeled networks. We proposed a data augmentation scheme and a Balanced Cross Entropy objective to balance the classes, which improves performance. We conducted experiments across seven network datasets with varying levels of label availability and class proportions. We compare to other state-of-the-art methods in relational learning, node embeddings, and RNNs. Our results are clearly superior to other methods except in sparsely labeled networks. DCI provides up to a 12% reduction in error compared to the best state-of-the-art alternative (PLEM+N2V) and a 25% reduction in error on average over the six competing methods, across all label proportions. Acknowledgements This research is supported by NSF under contract numbers IIS-1149789, IIS-1302172, CCF-0939370. We thank Joseph J. Pfeiffer III for helpful suggestions that motivated this work. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation hereon. References Chung, J.; G ulc ehre, C .; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. Co RR abs/1412.3555. Getoor, L., and Taskar, B., eds. 2007. Introduction to Statistical Relational Learning. MIT Press. Graves, A., and Jaitly, N. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Jebara, T., and Xing, E. P., eds., ICML 14, 1764 1772. JMLR Workshop and Conference Proceedings. Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In KDD 16. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Comput. 9(8):1735 1780. Jeh, G., and Widom, J. 2002. Scaling personalized web search. In WWW 12, 271 279. ACM Press. Kou, Z., and Cohen, W. W. 2007. Stacked graphical models for efficient inference in markov random fields. In SDM, 533 538. SIAM. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS 12. Leskovec, J.; Adamic, L. A.; and Huberman, B. A. 2007. The dynamics of viral marketing. ACM Transactions on the Web (TWEB) 1(1):5. Li, X.; Du, N.; Li, H.; Li, K.; Gao, J.; and Zhang, A. 2014. A deep learning approach to link prediction in dynamic networks. In SDM, volume 14, 289 297. SIAM. (a) Facebook (b) Amazon DVD 7500 (c) IMDB (d) Amazon DVD 20000 (e) Amazon Music 7500 (f) Amazon Music 64500 Figure 4: DCI compared to its variants on Facebook, IMDB, and Amazon datasets Lin, F., and Cohen, W. W. 2010. Semi-supervised classification of network data using very few labels. In ASONAM, 192 199. Lipton, Z. C.; Berkowitz, J.; and Elkan, C. 2015. A Critical Review of Recurrent Neural Networks for Sequence Learning. Ar Xiv e-prints. Macskassy, S. A., and Provost, F. 2007. Classification in networked data: A toolkit and a univariate case study. The Journal of Machine Learning Research 8:935 983. Mc Dowell, L., and Aha, D. 2012. Semi-supervised collective classification via hybrid label regularization. ar Xiv preprint ar Xiv:1206.6467. Monner, D., and Reggia, J. A. 2012. A generalized lstmlike training algorithm for second-order recurrent neural networks. Neural Netw. 25:70 83. Monner, D. D., and Reggia, J. A. 2013. Recurrent neural collective classification. IEEE Transactions on Neural Networks and Learning Systems 24(12):1932 1943. Neville, J., and Jensen, D. 2007. Relational Dependency Networks. JMLR 8:653 692. Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In KDD 14, 701 710. ACM. Pfeiffer III, J. J.; Neville, J.; and Bennett, P. N. 2015. Overcoming relational learning biases to accurately predict preferences in large scale networks. WWW 15. Ren, M.; Kiros, R.; and Zemel, R. S. 2015. Image question answering: A visual semantic embedding model and a new dataset. Co RR abs/1505.02074. Richardson, M., and Domingos, P. 2006. Markov Logic Networks. Mach. Learn. 62(1-2):107 136. Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The graph neural network model. Trans. Neur. Netw. 20(1):61 80. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective classification in network data. AI magazine 29(3):93. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. Line: Large-scale information network embedding. In WWW 15, 1067 1077. Taskar, B.; Abbeel, P.; and Koller, D. 2002. Discriminative probabilistic models for relational data. In UAI, 485 492. Tian, F.; Gao, B.; Cui, Q.; Chen, E.; and Liu, T.-Y. 2014. Learning deep representations for graph clustering. In AAAI, 1293 1299. van Merri enboer, B.; Bahdanau, D.; Dumoulin, V.; Serdyuk, D.; Warde-Farley, D.; Chorowski, J.; and Bengio, Y. 2015. Blocks and fuel: Frameworks for deep learning. Co RR abs/1506.00619. Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2014. Show and tell: A neural image caption generator. Co RR abs/1411.4555. Wang, D.; Cui, P.; and Zhu, W. 2016. Structural deep network embedding. KDD 16, 1225 1234. Xiang, R., and Neville, J. 2008. Pseudolikelihood em for within-network relational learning. In ICDM, 1103 1108. Yanardag, P., and Vishwanathan, S. 2015. Deep graph kernels. In KDD 15, 1365 1374. ACM.