# topology_optimization_based_graph_convolutional_network__182986bd.pdf

Topology Optimization based Graph Convolutional Network

Liang Yang1,2,3 , Zesheng Kang1 , Xiaochun Cao2 , Di Jin4 , Bo Yang5,3 and Yuanfang Guo6,

1School of Artiﬁcial Intelligence, Hebei University of Technology, China 2State Key Laboratory of Information Security, Institute of Information Engineering, CAS, China 3Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China 4College of Intelligence and Computing, Tianjin University, China 5College of Computer Science and Technology, Jilin University, China 6School of Computer Science and Engineering, Beihang University, China yangliang@vip.qq.com, andyguo@buaa.edu.cn

In the past few years, semi-supervised node classiﬁcation in attributed network has been developed rapidly. Inspired by the success of deep learning, researchers adopt the convolutional neural network to develop the Graph Convolutional Networks (GCN), and they have achieved surprising classiﬁcation accuracy by considering the topological information and employing the fully connected network (FCN). However, the given network topology may also induce a performance degradation if it is directly employed in classiﬁcation, because it may possess high sparsity and certain noises. Besides, the lack of learnable ﬁlters in GCN also limits the performance. In this paper, we propose a novel Topology Optimization based Graph Convolutional Networks (TO-GCN) to fully utilize the potential information by jointly reﬁning the network topology and learning the parameters of the FCN. According to our derivations, TO-GCN is more ﬂexible than GCN, in which the ﬁlters are ﬁxed and only the classiﬁer can be updated during the learning process. Extensive experiments on real attributed networks demonstrate the superiority of the proposed TO-GCN against the state-of-the-art approaches.

1 Introduction Objects in real world, which is usually modeled by nodes in typical graphs/networks, are likely to be connected in various approaches. For example, people are connected through social network and online social networks. Computers are connected via intranets and Internet. Proteins are connected by the existence of electrostatic forces. To explore the characteristics and mechanisms of these networks, network analysis is introduced [Newman, 2003]. In network analysis, node classiﬁcation, which is also named as network partition, community detection and graph clustering, is one of the most widely studied topic [Fortunato, 2010; Fortunato and Hric, 2016; Tao et al., 2017].

Corresponding author.

D-½AD-½X W Loss

(A) Graph Convolutional Networks

(B) Topology Optimization based Graph Convolutional Networks

Figure 1: The difference between Graph Convolutional Networks (GCN) and our proposed TO-GCN. Learnable variables are shown with a background color. GCN takes the given network topology A and node feature X as input and only learns the parameters of FCN W with label Z. TO-GCN jointly learns the network topology O (red line and disk) and the parameters of the FCN W to fully explore the labels information Z.

Classical node classiﬁcation is developed based on the utilization of the network topology. Although they contribute signiﬁcantly in understanding the characteristics and mechanisms of the networks, their performances are typically unsatisfactory due to the sparsity properties of the networks and the noises in the networks. To improve the performances, numerous side information is adopted. Among them, label and content information are the most commonly utilized [Tu et al., 2016; Yang et al., 2013]. By effectively exploiting the labels and network topology, label propagation [Zhu et al., 2003] achieves a great performance boost. Many algorithms are later proposed to utilize the pairwise constraint information, which is another kind of supervised information in the form of must-link and cannot-link [Lu and Peng, 2013]. By exploring the correlations between node feature and network topology structure, node classiﬁcation task in attributed networks has achieved great success and performance improve-

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

ment [Wang et al., 2016]. Recently, many efforts have been made to resolve the task of semi-supervised node classiﬁcation in attributed networks, which predicts the labels by exploiting the network topology, node features and labels. Motivated by the success of deep learning [Goodfellow et al., 2016], especially the convolutional neural networks (CNNs) [Krizhevsky et al., 2012], researchers extend CNNs to process the irregular graph data, e.g., graphs and manifolds. Among these extensions, Graph Convolutional Networks (GCN), which simpliﬁes Cheb Net [Defferrard et al., 2016], has attracted a large amount of attentions due to its simplicity and high performance [Kipf and Welling, 2017]. As shown in Figure 1(A), GCN is equivalent to smoothing the node features in the neighbourhoods and process them with a fully connected network (FCN). The success of GCN is yielded by the network topology, which is shown in [Li et al., 2018], and the label information, which is only employed to train the parameters in FCN. Unfortunately, GCN has not fully exploited the potential of the network topology and the ﬂexibility of the FCN is also limited. Speciﬁcally, the given network topology is not optimal due to certain sparsity and noises. Some nodes in one class may not be close, while other nodes belonging to different classes may be directly connected. These phenomenons have not been taken into considerations by GCN. On the other hand, the ﬂexibility of GCN is limited compared to CNN. Typically, CNN consists of two learnable components, convolutional layers (including pooling layers) and fully connected layers. The former one can be regarded as the feature extractor with learnable ﬁlters, while the latter one is equivalent to a learnable classiﬁer. On the contrary, the ﬁlters A in GCN are ﬁxed and only the classiﬁer is learnable (Figure 1(A)). To better utilize the network topology via reﬁnement and improve the ﬂexibility of the network, we propose a novel Topology Optimization based Graph Convolutional Networks (TO-GCN). As shown in Figure 1(B), the given labels are utilized to simultaneously and jointly learn the network topology and the parameters of the FCN, which provides more ﬂexibility compared to GCN. Speciﬁcally, with the given labels being the pairwise constraints (must-link and cannot-link), the network topology is reﬁned to satisfy the pairwise constraints and predict the unknown labels according to the given labels. To measure the consistencies between the learned network topology and pairwise constraints, constraint propagation is formulated as a minimization of an objective function with the reﬁned topology being the parameters. This model is optimized with respect to the reﬁned topology and the parameters of FCN. Analysis of the joint optimization algorithm indicates that the learned topology is affected by both the given constraints and classiﬁcation results. Our main contributions are summarized as follows:

We analyze the impact of network topology to the performance of semi-supervised node classiﬁcation in attributed networks and demonstrate that the label information has not been fully explored in most of the existing methods. We propose a novel Topology Optimization based Graph

Convolutional Networks (TO-GCN), which jointly learns the network topology and the parameters of the FCN with respect to the given labels. We provide an analysis of the joint optimization algorithm to demonstrate its superiority compared to the straightforward separate optimization approach.

2 Backgrounds and Motivations 2.1 Notations and Deﬁnitions Deﬁne an attributed network as G = (V, E, X) with nodes/vertices V = {v1, v2, ..., v N} and edges E = {e1, e2, ..., e M}. The attributes of all the vertices are represented as an attribute matrix X RN F , where the nth row of which, xn R1 F , corresponds to the attributes of vertex vn in the form of a F-dimensional vector. The network topology is represented by an adjacency matrix A = [aij] RN N. The elements in the adjacency matrix possess binary value, i.e., aij = 1, if an edge exists between the vertices vi and vj, and vice versa. D = diag(d1, d2, ..., d N), where dn = P j anj is the degree of vertex vn, is the degree matrix. In general, semi-supervised node classiﬁcation predicts the labels Y {0, 1}N C of the vertices in V , given the network G = (V, E, X) and some labelled nodes in set Vl in the form of Z = [Znc] {0, 1}N C, where C is the number of classes and Znc = 1 if and only if the vertex vn belongs to cth class.

2.2 Label Propagation The semi-supervised node classiﬁcation originates from graph-based semi-supervised learning, which makes full use of the limited number of labels by exploring the graph structure. The philosophy behind them is the assumption that nearby vertices on a graph tend to share the same label. Thus, they can be formulated as minimizing the objective function

i,j wij(yi yj)2, (1)

where wij is the similarity between the vertices vi and vj. Unfortunately, most of the graph-based semi-supervised learning algorithms ignore either network topology or node feature. When the node features are neglected, some methods like LPA [Raghavan et al., 2007] only exploit the label propagation on the network (topology information), i.e., wij = aij, which is an element in the adjacency matrix. Other methods propagate the labels in similarity graph, which is constructed from the node features without taking the topological information into consideration [Zhu et al., 2003]. In the similarity graph, wij = exp ||xi xj||2 2/σ2 is the similarity calculated between the vertices vi and vj with reference to their features. The process of the label propagation in graph can be seen as reﬁning the graph structure with the given labels.

2.3 Graph Convolutional Neural Networks Recently, semi-supervised node classiﬁcation algorithms tend to improve the performance by jointly considering the network topology and node attributes (graph data) due to their necessity. Inspired by the successful applications of deep learning to the regular grid data (e.g. images and videos), researchers consider to adopt the deep learning techniques

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

to process the irregular graph data (e.g. graphs or manifolds) [Defferrard et al., 2016; Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017; Scarselli et al., 2009; Xu et al., 2019]. To alleviate this difﬁculty, spectral approaches apply the convolution operation directly to the spectrum of the graph (i.e., the singular values of graph Laplacian) by treating the node attributes as signals in graph according to the spectral graph theory.

gθ x = Ugθ(Λ)U T x, (2)

where U and Λ represent the singular vectors and singular values of the graph Laplacian L = D 1/2(D A)D 1/2, i.e., L = UΛU T , respectively. Unfortunately, the high computational complexity of singular value decomposition (SVD) prevents the spectral approaches from being applied to large graphs. To overcome this deﬁcit, may simpliﬁcations have been proposed. Cheb Net [Defferrard et al., 2016] approximates the spectral ﬁlter with P th order Chebyshev polynomials as gθ x = PP p=0 θ p Tp(L)x, where Tp and θp are the Chebyshev polynomials and coefﬁcients, respectively. GCN (Kipf and Welling 2017) further simpliﬁes Cheb Net as

gθ x = θ(I + D 1/2AD 1/2)x, (3)

by constraining the Chebyshev polynomials with 1st order and the largest singular values with 2, where I denotes the identity matrix. By denoting A = A+I and Dnn = P j Anj and generalizing one input channel x and one spectral ﬁlters θ to F input channels X and C spectral ﬁlters Θ RF C, GCN becomes

where σ(.) is the nonlinear activation function, such as softmax or Re LU, as shown in Figure 1(A). According to the experiments, stacking two GCNs, which is shown in Eq. (5), will provide the best performance.

Z = f(X, A) = softmax ˆA Re LU( ˆAXΘ(0))Θ(1) , (5)

where ˆA = D 1

2 . The parameters Θ(0) and Θ(1) can be calculated by minimizing the cross-entropy errors of the labeled nodes

c=1 Ync log(Znc), (6)

where Vl denotes the set of labelled nodes. Recent literature [Li et al., 2018] indicates that the graph convolution operation D 1

2 X in GCN is equivalent to a Laplacian smoothing operation applied to the local neighbourhood, i.e., the graph convolution operation is essentially averaging the node attributes in a local neighbourhood with weight 1

where di is the degree of node vi. Graph attention network (GAT) [Veliˇckovi c et al., 2018] extends GCN by imposing the attention mechanism [Bahdanau et al., 2015] to the neighbouring weight assignment and formulates the weight between vertices vn and vk as

Oij = exp c(x T i Θ, x T j Θ)

P k N(j) exp c(x T k Θ, x T j Θ) , (7)

Dataset FCN GCN GCN-GT

Citeseer 57.1% 72.0% 100% Cora 56.2% 81.3% 100% Pub Med 70.7% 79.2% 100%

Table 1: Classiﬁcation results with different topologies.

where c(x, y) is the self-attention with a shared attentional mechanism c : RC RC R. As can be observed, GAT is equivalent to estimating the edge weights according to the attribute similarities between the connected nodes. Mixture model networks (Mo Net) [Monti et al., 2017] uniﬁes some convolutional operations on non-Euclidean structured data (graph or manifold) as mixture CNNs.

2.4 Motivation After the review of the existing schemes, the main drawbacks of these methods, label propagation (LP) and graph convolutional networks (GCN), are summarized that three kinds of information are not fully utilized by the existing approaches. LP predicts the labels by combining either the network topology or node features with the given labelled nodes. Obviously, LP approaches have not fully explored all the available information, including the network topology, node features and given labels. GCN in Eq. (4), which is represented in Figure 1(A), can be rewritten as

H = σ(( D 1

2 ) | {z } First Term

(XΘ) | {z } Second Term

where the ﬁrst term is the given graph Laplacian without the parameters, and the second term is the fully connected network (FCN) which directly employs the node features as inputs. [Li et al., 2018] claims that the classiﬁcation performance is signiﬁcantly improved compared to GCN without the Laplacian smoothing term (ﬁrst term). Therefore, the given network topology (for Laplacian smoothing) plays a very important roles in the classiﬁcation task. Based on the above observation in [Li et al., 2018], we further obtain that A clearer network (community) structure will improve the performance of semi-supervised node classiﬁcation . The presence of community (modular/cluster) structures, in which the nodes are densely connected within the communities and seldomly connected across the communities, is a common property of networks. For better illustration, we conduct experiments on three attributed networks, Cora, Citeseer and Pub Med, with the same attribute matrix and labelled nodes yet three different network topologies.

FCN only utilize the node features without the graph convolution operation; GCN takes the given network topology A as input; GCN-GT employs the ground truth membership matrix G = [gij] {0, 1}N N, where gij = 1 if and only if the vertices vi and vj are in the same class.

It is obvious that the community structure of GCN-GT is clearer than that of GCN, and the community structure of GCN is clearer than that of FCN.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

As can be observed in Table 1, the performance is improved when the network structure becomes clear. This certainly reveals that a reﬁned network topology will beneﬁt the classiﬁcation performance. However, the network topology is ﬁxed in GCN, which limits the ﬂexibility of the network. Some previous work have shown that the network structure can be improved with the labels [Zhu et al., 2003; Lu and Peng, 2013]. However, the given label information has not been properly integrated with the network topology in GCN. Therefore, GCN has not fully utilized the given label information.

3 Topology Optimization based Graph Convolutional Networks The existing work reviewed in Section 2 motivates us to reﬁne the network topology. Although there exists previous methods, which reﬁne the network topology with the given labels, these label information has not been fully exploited. Therefore, in this paper, we consider to utilize the given labels to simultaneously and jointly reﬁne the network topology and learn the parameters of the FCN. The ﬂow chart of our proposed method is shown in Figure 1(B). In this section, our network topology reﬁnement method is ﬁrstly introduced. Then, the proposed Topology Optimization based Graph Convolutional Networks (TO-GCN) is presented, followed by the optimization algorithm. Finally, the analysis of TO-GCN is given from the model and optimization perspectives.

3.1 Network Topology Reﬁnement To reﬁne the network topology with the given labels and maintain the topology to be non-negative, the reﬁnement is modeled as a label propagation process. With the assumption that the nearby vertices in a graph tend to share the same label, Eq. (1) can be reformed to

i,j aij||yi yj||2 2 = min Y 1 2Tr(Y T LY )

s.t. yn = zn vn Vl, (9)

where Vl stands for the set of labelled vertices, zn {0, 1}1 K and yn {0, 1}1 K are the ground-truth and predicted labels of the vertex vn. The nth rows of Y and Z are y T n and z T n , respectively. L = D A is the graph Laplacian of network with the adjacency matrix A. By relaxing the elements in Y to be in RN K, the constrained integer optimization in Eq. (9) can be relaxed to

min Y 1 2Tr(Y T LY ) + λ

2 ||Y Z||2 F . (10)

Its optimal Y is also the optimal solution to the following problem according to [Wang et al., 2012]

min Y 1 2Tr(LY Y T ) + λ

2 ||Y Y T ZZT ||2 F . (11)

By denoting Q = [Qij] = ZZT {0, 1}N N, Qij equals to 1 (must-link constraint) if and only if the vertices vi and vj belong to the same class based on the given labels Z. Since

the reﬁnement is achieved with the given label information instead of directly obtaining the node label Y , we let O = [Oij] = Y Y T RN N as the reﬁned network topology and rewritten Eq. (11) as

min O 1 2Tr(LO) + λ

2 ||O Q||2 F . (12)

The must-link constraints (the pairs of nodes belong to the same class) are preserved during the transformation from Eq. (10) to Eq. (11). However, the cannot-link constraints (the pairs of nodes belong to different classes), which reduce the similarity (edge weight) between the nodes belonging to different classes, are ignored during the transformation. Therefore, Eq. (12) is revised to penalize the high similarities between the nodes from different classes as

min O 1 2Tr(LO) + λ

2 ||O Q||2 F + α

i,j Oij||zi zj||2 2.

Let G = [Gij] {0, 1}N N, where Gij = 1 if and only if the vertices vi and vj belong to different classes. Then, the objective function of our network topology reﬁnement method becomes

min O 1 2Tr((L + αG)O) + λ||O Q||2 F , (13)

where Q and G are the pairwise constraint matrices and can be obtained from the given label matrix Z. By minimizing Eq. (13), the given topology A is reﬁned to O with the help of the given labels.

3.2 Topology Optimization based Graph Convolutional Networks To exploit the network topology reﬁnement in GCN, a straightforward approach, which is named P-GCN, is to simply perform the reﬁnement in Eq. (13) at ﬁrst, and then perform GCN on the reﬁned network topology. However, this approach is sub-optimal, because the pre-processed reﬁned topology only facilitates the semisupervised node classiﬁcation by providing a high modular network structure while ignoring its direct impact to the classiﬁcation results. Detailed analysis is provided at the end of this section. Therefore, Topology Optimization based Graph Convolutional Networks (TO-GCN) is proposed to jointly learn the network topology and the parameters of a fully connected network (FCN) as shown in Figure 1(B). Similar to GCN, the cross-entropy loss in Eq. (6) is adopted to measure the classiﬁcation errors, and the ﬁnal objective function is

min W,O Lclassify + Lrefine

= min W,O X

c=1 Znc log(Hnc) + Tr ((L + αG)O) +

λ 2 ||O Q||2 F , (14)

where the ﬁrst term is the classiﬁcation loss and the second is the topology reﬁnement loss. H = [Hnc] RN C is the

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

prediction matrix and

H = f(X, O) = softmax O Re LU OXW (0) W (1) ,

where W = {W (0), W (1)} is the parameters of FCN.

3.3 Optimization To jointly learn the reﬁned network topology O and FCN parameters W, we alternatively update one at a time while ﬁx the other. When the reﬁned network topology O is ﬁxed, the updating of the FCN parameters W is identical to the updating procedures in GCN, and we thus omit the details of it. For convenience, the process of updating O with ﬁxed W is described in details. Although the second and third terms in Eq. (14) both possess analytical solutions, the gradient decent approach is still adopted to optimize Eq. (14), because the ﬁrst term does not have a close-form solution. Since the Laplacian matrix L and the constraints matrices G and Q are symmetric, the gradient of Lrefine with respect to O is

O = (L + αG λQ) + λO, (15)

where L + αG λQ can be computed in advance, since it does not vary in each iteration. Then, we compute the gradient of Lclassify with respect to O. For simplicity, we consider the one-layer convolution operation i.e., H = softmax (OXW). Note that the prediction of the node vn, Hn = softmax (On B), will only be affected by On, the nth row of the topology matrix, where B = XW is ﬁxed when W is ﬁxed. Thus, the gradient of Lclassify with respect to Onk is equivalent to that

of Lclassify(n) = PC c=1 Znc log(Hnc) with respect to Onk, as shown in Eq. (14). According to the Chain Rule,

c=1 Znc log(Hnc)

where Hnc = softmax(Tnc) and Tnc = P k Onk Bkc. Since

c=1 Znc = 1 and log(Hnc)

Tnp = δcp Hnp,

where δcp is the Kronecker delta and δcp = 1 if and only if c = p, then

c=1 Znc (δcp Hnp) = Hnp Znp.

Besides, since Tnp

Onk = Bkp, then,

p (Hnp Znp)Bkp,

and it can be transformed to the matrix form

O = (H Z)BT = (H Z)W T XT , (16)

since B = XW. By combining Eqs. (15) and (16), the total gradient in the matrix form is

L O = (H Z)W T XT + (L + αG λQ) + λO. (17)

Tips for Implementation: To speedup the optimization, the reﬁned network topology O is ﬁrstly initialized by minimizing Eq. (13) without considering the backward-propagation of the classiﬁcation errors. Besides, the updating rule in Eq. (17) is only applied to update the weights of the existing edges and the must-link constraints, i.e., A + Q. Thus, the complexity of TO-GCN is O(M) instead of O(N 2), where N and M are the numbers of nodes and edges, respectively. Remark: The extra cost of the learnable graph is the topology updating in Eq. (17). For efﬁciency, the topology is initialized via constrained propagation, as shown in Eq. (13), and the updating in Eq. (17) is only applied to update the weights of the existing edges and the must-link constraints. Therefore, the increased computations are linearly proportional to the number of edges. and the proposed TO-GCN can be applied to large-scale networks.

3.4 Insights Here, we introduce some insights of the proposed method to further illustrate the superiority of TO-GCN and the advantage of jointly learning the reﬁned network topology and parameters of the FCN.

From the Model Perspective To elaborate the advantage of our proposed TO-GCN against GCN, they are both compared to the Convolution Neural Network (CNN). CNN consists of two learnable components: convolutional layers (including pooling layers) and fully connected layers. The former layers can be regarded as the feature extractor with learnable ﬁlters, while the latter layers are equivalent to a learnable classiﬁer. In our proposed TO-GCN (Figure 1(B)), the learnable network topology O can be regarded as the learnable ﬁlters while FCN with parameters W is equivalent to the learnable classiﬁer. Therefore, TOGCN and CNN are similar. On the contrary, the ﬁlters A is ﬁxed and only the classiﬁer is learnable in GCN (Figure 1(A)). Therefore, TO-GCN is more ﬂexible than GCN.

From the Optimization Perspective The updating rule of O in Eq. (17) consists of three terms whose roles are elaborated in details as follows. The ﬁrst term is the main difference between the straightforward approach and the joint learning mechanism. S = [Snk] = (H Z)W T XT can be regarded as the correlation between the prediction error Z H and the prediction from node attributes XW. Snk can then be written as

Snk = (hn zn)(xk W)T , (18)

where hn and zn represent the predicted and ground-truth label vectors of the vertex vn, respectively.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Figure 2: Visualizations of the learning process on the Cornell network. Three subﬁgures are the adjacency matrix, constraint matrix constructed from 10% labelled data, and learned adjacency matrix, respectively. The cyan boxes contain the intra-class edges while the orange boxes contain the inter-class edges. The color brightness indicate the weights of edges.

Then, we analyze the role of Snk. If the ground truth and the predicted label of node vn are identical, then Snk = 0. Otherwise, if the ground truth and the predicted label of node vn are i and j, respectively, then

hn zn = (0, ..., 0, 1 |{z} j , 0, ..., 0, 1 |{z} i , 0, ..., 0). (19)

Snk can be rewritten as Snk = xk(w.i w.j), where w.j is the jth column of W. Since xkw.j is the conﬁdence of node vk being classiﬁed into class j, xk(w.j w.i) is the conﬁdence difference of node vk being classiﬁed into classes i and j based on the node attributes xk. Therefore, updating O with S = (H Z)W T XT reduces the weights Onk if the classiﬁcation result of vk makes vn being misclassiﬁed. The second term is the combination of the graph Laplacian and constraint matrix, which reveals the impact of the constraints to the reﬁned network. The third term is to guarantee the updating process to be performed regularly. Overall, by jointly learning the reﬁned network topology and the parameters of FCN, the network topology is reﬁned by both the given constraints and classiﬁcation results. Therefore, TO-GCN is better than the straightforward P-GCN.

4 Evaluations In this section, we evaluate the proposed TO-GCN with extensive experiments on real attributed networks and provide some visualizations for better illustrations.

4.1 Datasets and Setup As in previous work, three citation networks, in which the vertices and edges are documents and undirected citations respectively, are utilized for performance evaluation. Node attributes are the bag-of-words representations of documents. Each node is assigned a label based on its discipline. The statistics of the networks, Cora, Cite Seer and Pub Med, are summarized in Table 2. Following the setup in [Yang et al., 2016] and [Kipf and Welling, 2017], the network topology, the features of all the nodes and the labels of 20 nodes per class are available for training. The performance of all the baseline methods are evaluated on 1,000 test nodes with 500 additional nodes for validation. Note that the balancing parameter between Lclassify and Lrefine are not employed, which is equivalent to set the balancing parameter to be 1. α and λ, which are the hyperparameters for cannot-link and

Dataset Nodes Edges Classes Features

Citeseer 3,327 4,732 6 3,703 Cora 2,708 5,429 7 1,433 Pub Med 19,717 44,338 3 500

Table 2: Datasets.

must-link constraints, are set according to the results of the validation set.

4.2 Baselines Our proposed model are compared to 10 state-of-the-art approaches, including multilayer perceptron (MLP), label propagation (LP) [Zhu et al., 2003], semi-supervised embedding (Semi Emb) [Weston et al., 2012], manifold regularization (Mani Reg) [Belkin et al., 2006], graph embedding (Deep Walk) [Perozzi et al., 2014], iterative classiﬁcation algorithm (ICA) [Lu and Getoor, 2003], graph-based semi-supervised learning framework (Planetoid) [Yang et al., 2016], graph convolution with Chebyshev ﬁlters [Defferrard et al., 2016], graph convolutional networks (GCN) [Kipf and Welling, 2017], and mixture model networks (Mo Net) [Monti et al., 2017]. All the results of the baseline approaches are either from their original papers or obtained with their original codes and default settings. To show the superiority of the joint learning strategy, TO-GCN is also compared to the straightforward version, P-GCN, where the reﬁned topology matrix is pre-computed and fed into GCN.

4.3 Results Analysis As shown in Table 3, the results of semi-supervised node classiﬁcation indicate that the proposed TO-GCN consistently outperforms other state-of-the-art methods and the straightforward P-GCN. The superior performance of TOGCN demonstrates that both the proposed topology reﬁnement and the jointly learning mechanism are vital to the performance improvements. As can be observed, the performance improvement for the Pubmed network is limited, because the network topology has not been appropriately reﬁned due to the messy structure of Pubmed and the small dimensions of the features. The former aspect causes the constraint propagation to be inefﬁcient, while the latter one affects the learnable weights obtained from the backward propagated classiﬁcation error. To demonstrate the robustness of

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Methods Cora Citeseer Pubmed

MLP 55.1% 46.5% 71.4% Mani Reg 59.5% 60.1% 70.7% Semi Emb 59.0% 59.6% 71.7% LP 68.0% 45.3% 63.0% Deep Walk 67.2% 43.2% 65.3% ICA 75.1% 69.1% 73.9% Planetoid 75.7% 64.7% 77.2% Chebyshev 81.2% 69.8% 74.4% GCN 81.5% 70.3% 79.0% Mo Net 81.7% 69.9% 78.8%

P-GCN 82.4% 70.9% 79.3% TO-GCN 83.1% 72.7% 79.5%

Table 3: Node classiﬁcation results.

the proposed TO-GCN compared to GCN, an experiment is conducted on Cora by adding 3% noisy edges are conducted. The accuracy degradations of GCN and TO-GCN are 11.2% and 8.7%, respectively, which also veriﬁes the superiority of TO-GCN.

4.4 Visualizations To elaborate the results with more intuitions, we intend to present more insights of the learned adjacency matrix. Since the network is sparse, it is difﬁcult to effectively visualize the learned adjacency matrices of the above three networks. Therefore, we conduct these visualizations on medium-sized Cornell network from the Web KB dataset. It consists of 195 connected webpages, which are classiﬁed into 5 classes. Each webpage is annotated by 1703-dimensional binary-valued word attributes. Since the dimensions of node content are much larger than the number of nodes, the dimensions of node content are reduced to 50 via PCA to prevent the overﬁtting problem. 10% of the nodes are randomly selected for training. The results are introduced in Figure 2. The nodes are re-ordered according to the ground-truth labels, and the color brightness indicates the weights of the edges. The structure of the learned graph is the combination of network topology and the constraints constructed form the labels. During the training process, the weights of the edges are reﬁned by the pairwise constraints and the back-propagation of the classiﬁer. As can be observed, the weights of inter-class edges (orange boxes) are signiﬁcantly reduced while those of intraclass edges (cyan boxes) remain at a high level (Figure 2(D)).

5 Conclusions In this paper, we observe that the network topology (both the structure and weights) contributes signiﬁcantly in semisupervised node classiﬁcation. A clear community structure (modularity) of network usually induces a high classiﬁcation accuracy. Besides of this observation, existing approaches have not fully utilized the potentials of three types of information, including network topology, node features and given labels. Therefore, we propose a novel Topology Optimization based Graph Convolutional Networks (TO-GCN) to fully utilize these three types of information by jointly reﬁning the

network topology and learning the parameters of FCN. Extensive experiments on real attributed networks demonstrate the superior performance of the proposed TO-GCN against the state-of-the-arts.

Acknowledgments

This work was supported in part by the National Key R&D Program of China (No.2017YFC0820106), in part by the National Natural Science Foundation of China under Grant 61503281, Grant 61802391, Grant 61772361, Grant U1636214, Grant U1605252, in part by the Foundation for Innovative Research Groups through the National Natural Science Foundation of China under Grant 61421003, in part by the Key Program of the Chinese Academy of Sciences (No. QYZDB-SSW-JSC003), and in part by the Fundamental Research Funds for the Central Universities.

References [Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015. [Belkin et al., 2006] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7:2399 2434, 2006. [Defferrard et al., 2016] Micha el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In NIPS, pages 3837 3845, 2016. [Duvenaud et al., 2015] David K. Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael G omez Bombarelli, Timothy Hirzel, Al an Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular ﬁngerprints. In NIPS, pages 2224 2232, 2015. [Fortunato and Hric, 2016] Santo Fortunato and Darko Hric. Community detection in networks: A user guide. Physics Reports, 659:1 44, 2016. [Fortunato, 2010] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3):75 174, 2010. [Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. [Hamilton et al., 2017] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1025 1035, 2017. [Kipf and Welling, 2017] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. In ICLR, 2017. [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, pages 1106 1114, 2012. [Li et al., 2018] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

semi-supervised learning. In AAAI, pages 3538 3545, 2018. [Lu and Getoor, 2003] Qing Lu and Lise Getoor. Link-based classiﬁcation. In ICML, pages 496 503, 2003. [Lu and Peng, 2013] Zhiwu Lu and Yuxin Peng. Exhaustive and efﬁcient constraint propagation: A graph-based learning approach and its applications. IJCV, 103(3):306 325, 2013. [Monti et al., 2017] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol a, Jan Svoboda, and Michael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In IEEE CVPR, pages 5425 5434, 2017. [Newman, 2003] Mark E. J. Newman. The structure and function of complex networks. SIAM Review, 45(2):167 256, 2003. [Niepert et al., 2016] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML, pages 2014 2023, 2016. [Perozzi et al., 2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In ACM SIGKDD, pages 701 710, 2014. [Raghavan et al., 2007] Usha Nandini Raghavan, R eka Albert, and Soundar Kumara. Near linear time algorithm to detect community structures in large-scale networks. PRE, 76 3 Pt 2:036106, 2007. [Scarselli et al., 2009] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE TNN, 20(1):61 80, 2009. [Tao et al., 2017] Zhiqiang Tao, Hongfu Liu, Sheng Li, Zhengming Ding, and Yun Fu. From ensemble clustering to multi-view clustering. In IJCAI, pages 2843 2849, 2017. [Tu et al., 2016] Cunchao Tu, Weicheng Zhang, Zhiyuan Liu, and Maosong Sun. Max-margin deepwalk: Discriminative learning of network representation. In IJCAI, pages 3889 3895, 2016. [Veliˇckovi c et al., 2018] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph attention networks. In ICLR, 2018. [Wang et al., 2012] Xiang Wang, Buyue Qian, and Ian Davidson. Labels vs. pairwise constraints: A uniﬁed view of label propagation and constrained spectral clustering. In IEEE ICDM, pages 1146 1151, 2012. [Wang et al., 2016] Xiao Wang, Di Jin, Xiaochun Cao, Liang Yang, and Weixiong Zhang. Semantic community identiﬁcation in large attribute networks. In AAAI, pages 265 271, 2016. [Weston et al., 2012] Jason Weston, Fr ed eric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade - Second Edition, pages 639 655. 2012.

[Xu et al., 2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019. [Yang et al., 2013] Jaewon Yang, Julian J. Mc Auley, and Jure Leskovec. Community detection in networks with node attributes. In IEEE ICDM, pages 1151 1156, 2013. [Yang et al., 2016] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In ICML, pages 40 48, 2016. [Zhu et al., 2003] Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learning using gaussian ﬁelds and harmonic functions. In ICML, pages 912 919, 2003.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)