# gaussianinduced_convolution_for_graphs__aca21d2d.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Gaussian-Induced Convolution for Graphs

Jiatao Jiang, Zhen Cui, Chunyan Xu, Jian Yang PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China {jiatao, zhen.cui, cyx, csjyang}@njust.edu.cn

Learning representation on graph plays a crucial role in numerous tasks of pattern recognition. Different from gridshaped images/videos, on which local convolution kernels can be lattices, however, graphs are fully coordinate-free on vertices and edges. In this work, we propose a Gaussianinduced convolution (GIC) framework to conduct local convolution ﬁltering on irregular graphs. Speciﬁcally, an edgeinduced Gaussian mixture model is designed to encode variations of subgraph region by integrating edge information into weighted Gaussian models, each of which implicitly characterizes one component of subgraph variations. In order to coarsen a graph, we derive a vertex-induced Gaussian mixture model to cluster vertices dynamically according to the connection of edges, which is approximately equivalent to the weighted graph cut. We conduct our multi-layer graph convolution network on several public datasets of graph classiﬁcation. The extensive experiments demonstrate that our GIC is effective and can achieve the state-of-the-art results.

Introduction As witnessed by the widespread applications, graph is one of the most successful models to conduct structured and semistructured data, ranging from text (Defferrard, Bresson, and Vandergheynst 2016), bioinformatics (Yanardag and Vishwanathan 2015; Niepert, Ahmed, and Kutzkov 2016; Song et al. 2018) and social network (Gomez, Chiem, and Delvenne 2017; Orsini, Baracchi, and Frasconi 2017) to images/videos (Marino, Salakhutdinov, and Gupta 2016; Cui et al. 2018; Cui, Yang, and others 2017). Among these applications, learning robust representations from structured graphs becomes the main topic. To this end, various methods have come forth in recent years. Graph kernels (Yanardag and Vishwanathan 2015) and recurrent neural networks (RNNs) (Scarselli et al. 2009) are the most representative ones. Graph kernels usually take the classic Rconvolution strategy (Haussler 1999) to recursively decompose graphs into atomic sub-structures and then deﬁne local similarities between them. RNNs based methods sequentially traverse neighbors with tied parameters in depth. With the increase of graph size, graph kernels would suffer diagonal dominance of kernels (Sch olkopf et al. 2002) while

Corresponding author Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

RNNs would have the explosive number of combinatorial paths in the recursive stage.

Recently convolutional neural networks (CNNs) (Le Cun, Bengio, and Hinton 2015) have achieved breakthrough progresses on representing grid-shaped image/video data. In contrast, graphs are with irregular structures and fully coordinate-free on vertices and edges. The vertices/edges are not strictly ordered, and can not be explicitly matched between two graphs. To generalize the idea of CNNs onto graphs, we need to solve this problem therein that the same responses should be produced for those homomorphic graphs/subgraphs when performing convolutional ﬁltering. To this end, recent graph convolution methods (Defferrard, Bresson, and Vandergheynst 2016; Atwood and Towsley 2016; Hamilton, Ying, and Leskovec 2017) attempted to aggregate neighbor vertices as shown in Fig. 1e. This kind of methods actually employ a fuzzy ﬁltering (i.e., a tied/shared ﬁlter) on neighbor vertices because only ﬁrstorder statistics (mean) is used. Two examples are shown in Fig. 1a and Fig. 1b. Although they have different structures, the responses on them are fully equal. Oppositely, Niepert et.al (Niepert, Ahmed, and Kutzkov 2016) ranked neighbor vertices according to weights of edges, and then used different ﬁlters on these sorted vertices, as shown in Fig. 1f. However, this rigid ranking method will suffer some limitations: i) probably consistent responses to different structures (e.g., Fig. 1b and Fig. 1c) because weights of edges are out of consideration after ranking; ii) information loss of node pruning for a ﬁxed-size receptive ﬁeld as shown in Fig. 1b; and iii) ranking ambiguity for equal connections as shown in Fig. 1d; and iv) ranking sensitivity to (slightly) changes of edge weights/connections.

In this paper we propose a Gaussian-induced graph convolution framework to learn graph representation. For a coordinate-free subgraph region, we design an edge-induced Gaussian mixture model (EI-GMM) to implicitly coordinate the vertices therein. Speciﬁcally, the edges are used to regularize Gaussian models such that variations of subgraph can be well-encoded. In analogy to the standard convolutional kernel as shown in Fig. 1h, EI-GMM can be viewed as a coordinate normalization by projecting variations of subgraph into several Gaussian components. For example, the four subgraphs w.r.t. Fig. 1a 1d will have different repre-

𝑓𝑓Σ𝑖𝑖=1 5 ෦ 𝑤𝑤0𝑖𝑖𝑥𝑥𝑖𝑖= Σ𝑖𝑖=1 5 ෦ 𝑤𝑤0𝑖𝑖𝑓𝑓(𝑥𝑥𝑖𝑖)

weighted summation

𝑣𝑣2 𝑣𝑣1 𝑣𝑣3 𝑣𝑣4 𝑣𝑣5

(e) Tied ﬁltering

𝑓𝑓1 𝑥𝑥2 + 𝑓𝑓2(𝑥𝑥1) + 𝑓𝑓3(𝑥𝑥4) + 𝑓𝑓4(𝑥𝑥5)

rank and select vertices

𝑣𝑣1 𝑣𝑣2 𝑣𝑣4 𝑣𝑣5 𝑣𝑣3

(f) Ranking ﬁltering

𝑓𝑓1 ℊ1 𝑥𝑥+ 𝑓𝑓2(ℊ2 𝑥𝑥) + 𝑓𝑓3(ℊ3 𝑥𝑥) encode Gaussian vectors

𝒩𝒩(𝜇𝜇𝑐𝑐, 1 𝑤𝑤0𝑖𝑖Σ𝑐𝑐) edge-induced

𝑣𝑣5 𝑣𝑣4 𝑣𝑣3 𝑣𝑣2 𝑣𝑣1

(g) Gaussian-induced ﬁltering

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3 𝑥𝑥8 𝑥𝑥0 𝑥𝑥4 𝑥𝑥7 𝑥𝑥6 𝑥𝑥5

𝑓𝑓0 𝑥𝑥0 + + 𝑓𝑓8(𝑥𝑥8)

(h) 3 3 regular ﬁltering

Figure 1: Different ﬁlters operations on graph vertices. Examples of one-hop subgraphs are given in (a)-(d), where v0 is the reference vertex and each vertex is assigned to a signal. The tied ﬁltering (e) summarizes all neighbor vertices, and generates the same responses to (a) and (b) under the ﬁlter f, i.e., f(P g w0ixi) = f(1.9), although the two graphs are completely different in structures. The ranking ﬁltering (f) sorts/prunes neighbor vertices and then performs different ﬁltering on them. It might result into the same responses f1(1) + f2(3) + f3(1) + f4(4) to different graphs such as (b) and (c), where the digits in red boxes denote the ranked indices and the vertex of dashed box in (b) is pruned. Moreover, the vertex ranking is uncertain/non-unique for equal connections in (d).To address these problems, we derive edge-induced GMM to coordinate subgraphs as shown in (g). Each of Gaussian model can be viewed as one variation component (or direction) of subgraph. Like the standard convolution (h), the Gaussian encoding is sensitive to different subgraphs, e.g., (a)-(d) will have different responses. Note f, fi are linear ﬁlters, and the non-linear activation functions are put on their responses.

sentations1 through our Gaussian encoding in Fig. 1g. To make the network inference forward, we transform Gaussian components of each subgraph into the gradient space of multivariate Gaussian parameters, instead of employing the sophisticated EM algorithm. Then the ﬁlters (or transform functions) are performed on different Gaussian components like latticed kernels on different directions in Fig. 1h. Further, we derive a vertex-induced Gaussian mixture model (VI-GMM) to favor dynamic coarsening of graph. We also theoretically analyze the approximate equivalency of VIGMM to weighted graph cut (Dhillon, Guan, and Kulis 2007). Finally, EI-GMM and VI-GMM can be alternately stacked into an end-to-end optimization network. In summary, our main contributions are four folds: i) propose an end-to-end Gaussian-induced convolutional neural network for graph representation; ii) propose edge-induced GMM to encode variations of different subgraphs; iii) derive vertex-induced GMM to perform dynamic coarsening of graphs, which is an approximation to the weighted graph cut; iv) verify the effectiveness of our method and report state-of-the-art results on several graph datasets.

1Suppose three Gaussian models are N(0, 1), N(0, 2) and N(0, 3), then we can compute the responses on (a)-(d) respectively as f1([0.49, 0.93]) + f2([0.17, 0.65]) + f3([0.07, 0.44]), f1([0.35, 0.73]) + f2([0.15, 0.58]) + f3([0.10, 0.64], f1([0.35, 0.71]) + f2([0.15, 0.39]) + f3([0.10, 0.43]), f1([0.46, 0.99]) + f2([0.18, 0.62]) + f3([0.08, 0.42]). Please refer to incoming section.

Related Work

Graph CNNs mainly fall into two categories: spectral and spatial methods. Spectral methods (Bruna et al. 2014; Scarselli et al. 2009; Henaff, Bruna, and Le Cun 2015; Such et al. 2017; Li et al. 2018a; 2018b) construct a series of spectral ﬁlters by decomposing graph Laplacian, which often suffers high-computational burden. To address this problem, the fast local spectral ﬁltering method (Defferrard, Bresson, and Vandergheynst 2016) parameterizes the frequency responses as a Chebyshev polynomial approximation. However, as shown in Fig. 1b, after summarizing all nodes, this method will discard topology structures of a local receptive ﬁeld. This kind of methods usually require equal sizes of graphs like the same sizes of images for CNNs (Kipf and Welling 2017). Spatial methods attempt to deﬁne spatial structures of adjacent vertices and then perform ﬁltering on structured graphs. Diffusion CNNs (Atwood and Towsley 2016) scans a diffusion process across each node. PATCHYSAN (Niepert, Ahmed, and Kutzkov 2016) linearizes neighbors by sorting weights of edges and deriving convolutional ﬁltering on graphs, as shown in Fig. 1c. As an alternative, random walks based approach is also used to deﬁne the neighborhoods (Perozzi, Al-Rfou, and Skiena 2014). For the linearized neighbors, RNNs (Li et al. 2016) could be used to model the structured sequences. Similarly, Ngram CNN (Luo et al. 2017) serializes each graph by introducing the concept of n-gram block. GAT (Velickovic et al. 2018) attempts to weight edges through the attention mechanism. WSC (Jiang

edge-induced GMM

vertex-induced GMM

FC + Softmax edge-induced GMM

vertex-induced GMM

convolution convolution coarsening coarsening

Figure 2: The GIC network architecture. The GIC main contains two module: convolution layer (EI-GMM) and coarsening layer (VI-GMM). The GIC stacks several convolution and coarsening layers alternatively and iteratively. More details can be found in incoming section.

et al. 2018) attempts to aggregate walk ﬁelds deﬁned by random walks into Gaussian mixture models. Zhao (Zhao et al. 2018) attempts to deﬁne a standard network with different graph convolutions. Besides, some variants (Hamilton, Ying, and Leskovec 2017; Duran and Niepert 2017; Zhang et al. 2018) employ the aggregation or propagation of local neighbor nodes. Different from these tied ﬁltering or ranking ﬁltering methods, we use Gaussian models to encode local variations of graph. Also different from the recent mixture models (Monti et al. 2017), which uses GMM to only learn the importance of adjacent nodes, our method uses weighted GMM to encode the distributions of local graph structures.

The GIC Network Attribute Graph Here we consider an undirected attribute graph G = (V, A, X) of m vertices (or nodes), where V = {vi}m i=1 is the set of vertices, A is a (weighted) adjacency matrix, and X is a matrix of graph attributes (or signals). The adjacency matrix A Rm m records the connections between vertices. If vi, vj are not connected, then A(vi, vj) = 0, otherwise A(vi, vj) = 0. We sometimes abbreviate A(vi, vj) as Aij. The attribute matrix X Rm d is associated with the vertex set V, whose i-th row Xi (or Xvi) denotes a ddimension attribute of the i-th node (i.e., vi). The graph Laplacian matrix L is deﬁned as L = D A, where D Rm m is the diagonal degree matrix with Dii = P

j Aij. The normalized version is written as Lnorm = D 1/2LD 1/2 = I D 1/2AD 1/2. where I is the identity matrix. Unless otherwise speciﬁed, we use the latter. We give the deﬁnition of subgraph used in the following.

Deﬁnition 1. Given an attribute graph G = (V, A, X), the attribute graph G = (V , A , X ) is a subgraph of G, denoted G G, if (i) V V, (ii) A is the submatrix of A w.r.t. the subset V , and (iii) X = XV .

Overview The GIC network architecture is shown in Fig. 2. Given an attribute graph G(0) = (V(0), A(0), X(0)), where the superscript denotes the layer number, we construct multi-scale re-

ceptive ﬁelds for each vertex based on the adjacency matrix A(0). Each receptive ﬁeld records k-hop neighborhood relationships around the reference vertex, and forms a local centralized subgraph. To encode the centralized subgraph, we project it into edge-induced Gaussian models, each of which deﬁnes one variation direction of the subgraph. We perform different ﬁltering operations on different Gaussian components and aggregate all responses as the convolutional output. After the convolutional ﬁltering, the input graph G(0)

is transformed into a new graph G(1) = (V(1), A(1), X(1)), where V(1) = V(0) and A(1) = A(0). To further abstract graphs, we next stack a coarsening layer on the graph G(1). The proposed vertex-induced GMM is used to downsample the graph G(1) into the low-resolution graph G(2) = (V(2), A(2), X(2)). Taking the convolution and coarsening modules, we may alternately stack them into a multi-layer GIC network, With the increase of layers, the receptive ﬁeld size of ﬁlters will become larger, so the higher layer can extract more global graph information. In the supervised case of graph classiﬁcation, we ﬁnally concatenate with a fully connected layer followed by a softmax loss layer.

Multi-Scale Receptive Fields In the standard CNN, receptive ﬁelds may be conveniently deﬁned as latticed spatial regions. Thus convolution kernels on grid-shaped structures are accessible. However, the construction of convolutional kernels on graphs are intractable due to coordinate-free graphs, e.g., unordered vertices, unﬁxed number of adjacent edges/vertices. To address this problem, we resort to the adjacent matrix A, which expresses connections between vertices. Since Ak exactly records the k-step reachable vertices, we may construct a k-neighbor receptive ﬁeld by using the k-order polynomial of A, denoted as ψk(A). Taking the simplest case, ψk(A) = Ak reﬂects the k-hop neighborhood relationships. In order to remove the scale effect, we may normalize ψk(A) as ψk(A)diag(ψk(A)1) 1, which describes the reachable possibility in a k-hop walking. Formally, we deﬁne the k-th scale receptive ﬁeld as a subgraph. Deﬁnition 2. The k-th scale receptive ﬁeld around a reference vertex vi is a subgraph Gk vi = (V , A , X ) of the k-order graph (V, e A = ψk(A), X), where V = {vj| e Aij =

0} {vi}, A is the submatrix of e A w.r.t. V , and X = XV .

Convolution: Edge-Induced GMM

Given a reference vertex vi, we can construct the centralized subgraph Gk vi of the k-th scale. To coordinate the subgraph, we introduce Gaussian mixture models (GMMs), each of which may be understood as one principal direction of its variations. To encode the variations accurately, we jointly formulate attributes of vertices and connections of edges into Gaussian models. The edge weight A (vi, vj) indicates the relevance of vj to the central vertex vi. The higher weight is, the stronger impact on vi is. So the weights can be incorporated into a Gaussian model by observing A (vi, vj) times. As the likelihood function, it is equivalent to raise the power A (vi, vj) on Gaussian function, which is proportional to N(X vj, µ, 1 A (vi,vj)Σ). Formally, we estimate the probability density of the subgraph Gk vi from the C1component GMM,

pvi(X vj; Θ1, A ij) =

c=1 πc N(X vj; µc, 1

s.t. πc > 0,

c=1 πc = 1, (1)

where Θ1 = {π1, , πC1, µ1, , µC1, Σ1, , ΣC1} are the mixture parameters, {πc} are the mixture coefﬁcients, {µc, Σc} are the parameters of the k-th component, and A ij > 0 2. Intuitively, edge weight A ij is, the stronger impact of the node vj w.r.t. the reference vertex vi is. We will refer to the model in Eqn. (1) as the edge-induced Gaussian mixture model (EI-GMM). In what follows, we assume all attributes of nodes are independent on each other, which is often used in signal processing. That means, the covariance matrix Σc is diagonal, so we denote it as diag(σ2 c). To avoid the explicit constraints for πc in Eqn. (1), we adopt the soft-max normalization with the re-parameterization variable αc, i.e., πc = exp(αc)/PC1 k=1 exp(αk). Thus, the entire subgraph log-likelihood can be written as

j=1 ln pvi(X vj; Θ1, A )

c=1 πc N(X vj; µc, 1

A ij Σc), (2)

To infer forward, instead of the expectation-maximization (EM) algorithm, we use the gradients of the subgraph with regard to the parameters of the EI-GMM model Θ1, motivated by the recent Fisher vector work (Sanchez et al. 2013), which has been proven to be effective in representation. For a convenient calculation, we simplify the notations, Njc = N(X vj, µc, 1 A ij σ2 c) and Qjc = πc Njc PC1 k=1 πk Njk , then we

2In practice, we normalize A into a non-negative matrix.

can derive the gradients of model parameters from Eqn. (2) as follows

ζ(Gk vi) µc =

A ij Qjc(X vj µc)

ζ(Gk vi) σc =

Qjc(A ij(X vj µc)2 σ2 c)

where the division of vectors means a term-by-term operation. Note we do not use ζ(Gk vi)/ αc due to no improvement in our experience. The gradients describe the contribution of the corresponding parameters to the generative process. The subgraph variations are adaptively allocated to C1 Gaussian models. Finally, we ensemble all gradients w.r.t. Gaussian model (i.e., directions of graph) to analogize the collection of local square receptive ﬁeld on image. Formally, for the k-scale receptive ﬁeld Gk vi around the vertex vi, the attributes produced from Gaussian models are ﬁltered respectively and then concatenated,

F(Gk vi, Θ1, f) = Re LU(

c=1 fi(Cat[ ζ(Gk vi) µc , ζ(Gk vi) σc ]),

where Cat[ , ] is a concatenation operator, fi is a linear ﬁltering function (i.e., a convolution function) and Re LU is the rectiﬁed linear unit. Therefore we can produce the feature vectors that have same dimensionality depending on the number of Gaussian models for different subgraphs. If the soft assignment distribution Qjc is sharply peaked on a single value of one certain Gaussian for the vertex vj, the vertex will be only projected onto one Gaussian direction.

Coarsening: Vertex-Induced GMM Like the standard pooling in CNNs, we need to downsample graphs so as to abstract them as well as reduce the computational cost. However, the pooling on images are tailored for latticed structures, and cannot be used for irregular graphs. One solution is to use some clustering algorithms to partition vertices to several clusters, and then produce a new vertex from each cluster. However, we expect that two vertices should not fall into the same cluster with a larger possibility if there is a high transfer difﬁculty between them. To this end, we derive vertex-induced Gaussian mixture models (VI-GMM) to weight each vertex. To utilize the edge information, we construct a latent observation φ(vi) w.r.t. each vertex vi from the graph Laplacian (or adjacent matrix if semi-positive deﬁnite), i.e., the kernel calculation φ(vi), φ(vj) = Lij. Moreover, for each vertex vi, we deﬁne an inﬂuence factor wi for Gaussian models. Formally, given C2 Gaussian models, VI-GMM is written as

p(φ(vi); Θ2, wi) =

c=1 πc N(φ(vi); µc, 1

s.t. wi = h(Xvi) > 0, (5)

where h is a mapping function to be learnt. To reduce the computation cost of matrix inverse on Σ, we specify it as an

identity matrix. Then we have

p(φ(vi); Θ2, wi) =

wi )d/2 exp wi

2 φ(vi) µc 2, (6)

Given a graph with m vertices, the objective is to maximize the following log-likelihood:

arg max Θ2 ζ(Θ2) =

c=1 πc N(φ(vi); µc, 1

wi I)). (7)

To solve above model in Eqn. (7), we use the iterative expectation maximization algorithm, which has closed-form solution at each step. Meanwhile, the algorithm may automatically conduct the required constraints. The graphical clustering process is summarized as follows: (1) E-Step: the posteriors, i.e., the i-th vertex for the cth cluster, are updated with pic = πcp(φ(vi);θc,wi) PC k=1 πkp(φ(vi);θk,wi), where θc is the c-th Gaussian parameters, and Θ2 = {θ1, , θC2}. (2) M-Step: we optimize Gaussian parameters π, µ. The parameter estimatation is given by πc = 1 m Pm i=1 ric, µc = P

vi Gc wiφ(vi) P

vi Gc wi . πc indicates the energy summation of all

vertices assigned to the cluster c, and µc may be understood as a doubly weighted (wi, ric) average on the cluster c. After several iterations of the two steps, we perform hard quantiﬁcation. The i-th vertex is assigned as the class with the maximum possibility, formally, ric = 1 if c = arg maxk pik, otherwise 0. Thus we can obtain the cluster matrix P {0, 1}m C2, where Pic = 1 if the i-th vertex falls into the cluster c. During coarsening, we take maximal responses of each cluster as the attributes of new vertex, and derive a new adjacency matrix by using P AP. It is worth noting that we need not compute the concrete φ during the clustering process. The main calculation φ(vi) µc 2 in EM can be reduced to the kernel version:

vj Gc wj Kij P

vj ,vk Gc wjwk Kjk

vj Gc wj)2 , where Kij =

φ(vi), φ(vj) . In practice, we can use the graph Laplacian L as the kernel. In this case, we can easily reach the following proposition, which is relevant to graph cut (Dhillon, Guan, and Kulis 2007).

Proposition 1. In EM, if the kernel matrix takes the weightregularized graph Laplacian, i.e., K = diag(w)Ldiag(w), then VI-GMM is equal to an approximate optimization of graph cut, i.e., min PC c=1 links(Vc,V\Vc)

w(Vc) , where links(A, B) = P

vi A,vj B Aij, and w(Vc) = P

Experiments Graph Classiﬁcation For graph classiﬁcation, each graph is annotated with one label. We use two types of datasets: Bioinformatics and Network datasets. The former contains MUTAG (Debnath et al. 1991), PTC (Toivonen et al. 2003), NCI1 and NCI109 (Wale, Watson, and Karypis 2008), ENZYMES (Borgwardt et al. 2005)

and PROTEINS (Borgwardt et al. 2005). The latter has COLLAB (Leskovec, Kleinberg, and Faloutsos 2005), REDDIT-BINARY, REDDIT-MULTI-5K, REDDITMULTI-12K, IMDB-BINARY and IMDB-MULTI.

Experiment Settings We verify our GIC on the above bioinformatics and social network datasets. In default, GIC mainly consists of three graph convolution layers, each of which is followed by a graph coarsening layer, and one fully connected layer with a ﬁnal softmax layer as shown in Fig 2. Its conﬁguration can simply be set as C(64)-P(0.25)-C(128)- P(0.25)-C(256)-P-FC(256), where C, P and FC denote the convolution, coarsening and fully connected layers respectively. The choices of hyperparameters are mainly inspired from the classic VGG net. For example, the coarsening factor is 0.25 (w.r.t. 0.5 0.5 in VGG), the attribute dimensions at three conv. layers are 64-128-256 (w.r.t. the channel numbers of conv1-3 in VGG). The scale of respective ﬁeld and the number of Gaussian components are both set to 7. We train GIC network with stochastic gradient descent for roughly 300 epochs with a batch size of 100, where the learning rate is 0.1 and the momentum is 0.95. In the bioinformatics datasets, we exploit labels and degrees of the vertices to generate initial attributes of each vertex. In the social network datasets, we use degrees of vertices. We closely follow the experimental setup in PSCN (Niepert, Ahmed, and Kutzkov 2016). We perform 10-fold cross-validation, 9-fold for training and 1-fold for testing. The experiments are repeated 10 times and the average accuracies are reported.

Comparisons with the State-of-the-arts We compare our GIC with several state-of-the-arts, which contain graph convolution networks (PSCN (Niepert, Ahmed, and Kutzkov 2016), DCNN (Atwood and Towsley 2016), Ngram CNN (Luo et al. 2017)), neural networks (SAEN (Orsini, Baracchi, and Frasconi 2017)), feature based algorithms (Dy F (Gomez, Chiem, and Delvenne 2017), FB (Bruna et al. 2014)), random walks based methods (RW (G artner, Flach, and Wrobel 2003)), graph kernel approaches (GK (Shervashidze et al. 2009), DGK (Yanardag and Vishwanathan 2015), WL (Morris, Kersting, and Mutzel 2017)). We present the comparisons with the state-of-the-arts, as shown in Table 1. All results come from the related literatures. We have the following observations. Deep learning based methods on graphs (including DCNN, PSCN, Ngram CNN, SAEN and ours) are superior to those conventional methods in most cases. The conventional kernel methods usually require the calculation on graph kernels with high-computational complexity. In contrast, these graph neural networks attempt to learn more abstract highlevel features by performing inference-forward, which need relatively low computation cost. Compared with recent graph convolution methods, ours can achieve better performance on most datasets, such as PTC, NCI1, NCI109, ENZYMES and PROTEINS. The main reason should be that local variations of subgraphs are accurately described with Gaussian component analysis. The proposed GIC achieves state-of-the-art results on most datasets. The best performance is gained in some bioin-

Table 1: Comparisons with state-of-the-art methods.

DATASET PSCN DCNN NGRAMCNN FB DYF WL GK DGK RW SAEN GIC

MUTAG 92.63 66.98 94.99 84.66 88.00 78.3 81.66 82.66 83.72 84.99 94.44 4.21 5.63 2.01 2.37 1.9 2.11 1.45 1.50 1.82 4.30

PTC 60.00 56.60 68.57 55.58 57.15 57.26 57.32 57.85 57.04 77.64 4.82 1.72 2.30 1.47 1.41 1.13 1.30 1.30 6.98

NCI1 78.59 62.61 62.90 68.27 83.1 62.28 62.48 48.15 77.80 84.08 1.89 0.96 0.34 0.2 0.29 0.25 0.50 0.42 1.77

NCI109 62.86 62.43 66.72 85.2 62.60 62.69 49.75 82.86 1.13 0.20 0.2 0.19 0.23 0.60 2.37

ENZYMES 18.10 29.00 33.21 53.4 26.61 27.08 24.16 62.50 1.16 1.20 1.4 0.99 0.79 1.64 5.12

PROTEINS 75.89 75.96 69.97 75.04 73.7 71.67 71.68 74.22 75.31 77.65 2.76 2.98 1.34 0.65 0.5 0.55 0.50 0.42 0.70 3.21

COLLAB 72.60 76.35 80.61 72.84 73.09 69.01 75.63 81.24 2.15 1.64 1.60 0.28 0.25 0.09 0.31 1.44

REDDIT-B 86.30 88.98 89.51 75.3 77.34 78.04 67.63 86.08 88.45 1.58 2.26 1.96 0.3 0.18 0.39 1.01 0.53 1.60

REDDIT-5K 49.10 50.83 50.31 41.01 41.27 52.24 51.58 0.70 1.83 1.92 0.17 0.18 0.38 1.68

REDDIT-12K 41.32 42.37 40.30 31.82 32.22 46.72 42.98 0.42 1.27 1.41 0.08 0.10 0.23 0.87

IMDB-B 71.00 71.66 72.02 72.87 72.4 65.87 66.96 64.54 71.26 76.70 2.29 2.71 4.71 4.05 0.5 0.98 0.56 1.22 0.74 3.25

IMDB-M 45.23 50.66 47.34 48.12 43.89 44.55 34.54 49.11 51.66 2.84 4.10 3.56 3.56 0.38 0.52 0.76 0.64 3.40

Table 2: Node label prediction on Reddit and PPI data (micro-averaged F1 score).

DATASET REDDIT PPI

RANDOM 0.042 0.396 RAW FEATURES 0.585 0.422 DEEP WALK 0.324 DEEP WALK + FEATURES 0.691 NODE2VEC + REGRESSION 0.934 GRAPHSAGE-GCN 0.930 0.500 GRAPHSAGE-MEAN 0.950 0.598 GRAPHSAGE-LSTM 0.954 0.612 GIC 0.952 0.661

formatics datasets and some social network datasets including PTC, NCI1, ENZYMES, PROTEINS, COLLAB, IMDB-BINARY and IMDB-MULTI. Although Ngram CNN, Dy F, WL and SEAN approaches have obtained the best performance on MUTAG, REDDIT-BINARY, NCI109, REDDIT-MULTI-5K and REDDIT-MULTI-12K respectively, our method is fully comparable to them.

Node Classiﬁcation

For node classiﬁcation, one node is assigned one/multiple labels. It is challenging if the label set is large. During training, we only use a fraction of nodes and their labels. The task is to predict the labels for the remaining nodes. Following the setting in (Hamilton, Ying, and Leskovec 2017),

we conduct the experiments on Reddit data and PPI data. For a fair comparison to graph SAGE (Hamilton, Ying, and Leskovec 2017), we use the same initial graph data, minibatch iterators, supervised loss function and neighborhood sample. The other network parameters are similar to graph classiﬁcation except removing the coarsening layer. Tabel 2 summarizes the comparison results. Our GIC can obtain the best performance 0.661 on PPI data and a comparable result 0.952 on Reddit data. The raw features provide an important initial information for node multi-label classiﬁcation. Based on the raw features, deep walk (Perozzi, Al-Rfou, and Skiena 2014) improves about 0.36 (micro-F1 scores) on Reddit data. Meanwhile, we conduct an experiment of node2vec and use regression model to classiﬁcation. Our method gains better performance than node2vec (Grover and Leskovec 2016). Comparing different aggregation methods like GCN (Kipf and Welling 2017), mean and LSTM, our GIC has a signiﬁcant improvement about 0.16 on PPI data and gains a competitive performance on Reddit data. The results demonstrate our approach is robust to infer unknown labels of partial graphs.

Model Analysis EI-GMM and VI-GMM: To directly analyze convolution ﬁltering with EI-GMM, we compare our method with Cheb Net (Defferrard, Bresson, and Vandergheynst 2016) and GCN (Kipf and Welling 2017) approaches by using the same coarsening mechanism VI-GMM. As shown in Table 3, under the same coarsening operation, our GIC is superior to

Table 3: The veriﬁcation of our convolution and coarsening.

DATASET CHEBNET GCN GIC GIC W/ VI-GMM W/ VI-GMM W/O VI-GMM

MUTAG 89.44 6.30 92.22 5.66 93.33 4.84 94.44 4.30 PTC 68.23 6.28 71.47 4.75 68.23 4.11 77.64 6.98 NCI1 73.96 1.87 76.39 1.08 79.17 1.63 84.08 1.77 NCI109 72.88 1.85 74.92 1.70 77.81 1.88 82.86 2.37 ENZYMES 52.83 7.34 51.50 5.50 52.00 4.76 62.50 5.12 PROTEINS 78.10 3.37 80.09 3.20 78.19 2.04 77.65 3.21

Table 4: Comparisons on K and C1.

DATASET K, C1 = 1 K, C1 = 3 K, C1 = 5 K, C1 = 7

MUTAG 67.77 11.05 83.88 5.80 90.55 6.11 94.44 4.30 PTC 72.05 8.02 77.05 4.11 76.47 5.58 77.64 6.98 NCI1 71.21 1.94 83.26 1.17 84.47 1.64 84.08 1.77 NCI109 70.02 1.57 81.74 1.56 83.39 1.65 82.86 2.37 ENZYMES 33.83 4.21 64.00 4.42 63.66 3.85 62.50 5.12 PROTEINS 75.49 4.00 77.47 3.37 78.10 2.96 77.65 3.21

Table 5: Comparisons on the layer number.

DATASET N = 2 N = 4 N = 6 N = 8

MUTAG 86.66 8.31 91.11 5.09 93.88 5.80 94.44 4.30 PTC 64.11 6.55 74.41 6.45 75.29 6.05 77.64 6.98 NCI1 71.82 1.85 81.36 1.07 83.01 1.54 84.08 1.77 NCI109 71.09 2.41 80.02 1.67 81.60 1.83 82.86 2.37 ENZYMES 42.33 4.22 61.83 5.55 64.83 6.43 62.50 5.12 PROTEINS 77.38 2.97 79.81 3.84 78.37 4.00 77.65 3.21

Cheb Net+VI-GMM and GCN+VI-GMM. It indicates EIGMM can indeed encode the variations of subgraphs more effectively. On the other hand, we remove the coarsening layer from our GIC. For different size graphs, we pad new zero vertices into a ﬁxed size and then concatenate attributes of all vertices for classiﬁcation. As shown in this table, the performance of GIC still outperforms GIC without VIGMM coarsening, which veriﬁes the effectiveness of the coarsening layer VI-GMM. K and C1: The kernel size K and the number of Gaussian components C1 are the most crucial parameters. Generally, the C1 is proportional to the K. The reason is that the larger receptive ﬁeld usually contains more vertices (i.e., a relative large subgraph). Thus we simply take the equal values for them, K = C1 = {1, 3, 5, 7}. The experimental results are shown in Table 4. With the increase of K, C1, the performance improves at most cases. The reasons are two folds: i) with increasing receptive ﬁeld size, the convolution will cover the farther hopping neighbors; ii) with the increase of C1, the variations of subgraphs are encoded more accurately. But for the larger values of K and C1 will increase the computational burden. Moreover, the overﬁtting phenomenon might occur with the increase of model complexity. Take the example of NCI109, in the ﬁrst convolution layer, the encoded attributes (in Eqn. (4)) will be 2 39 7 = 546 for each scale of receptive ﬁeld, where 39 is the dimension of attributes (w.r.t the number of node labels) and 7 is the number of Gaussian components. Thus, for 7 scales of receptive ﬁeld, the ﬁnal encoded attributes will be 546 7 = 3822 dimensions, which will be mapping to 64 dimensions by the function f = [f1, , f C1] in Eqn. (4). Thus the model parameter is 3822 64 = 244608 in the ﬁrst

layer. Similarly, if the number of node label is 2, the model parameter will sharply decrease into 18816. Besides, the parameter complexity is related to the number of classes and nodes. The comparison results in Table 4 demonstrate the trend of the parameters K and C1 in our GIC framework. Number of stacked layers: Here we test on the number of stacked network layers with N = 2, 4, 6, 8. When N = 2, only one fully connected layer and one softmax layer are preserved. When N = 4, we add two layers: the convolution layer and the coarsening layer. When continuing to stack both, the depth of network will be 6 and 8. The results are shown in Table 5. Deeper networks can gain better performance in most cases, because the larger receptive ﬁeld is observed and more abstract structures will be extracted in the topper layer. Of course, there is an extra risk of overﬁtting due to the increase of model complexity. An analysis of computation complexity: In the convolution layer, the computational costs of receptive ﬁelds and Gaussian encoding are about O(Km2) and O(C1md2) respectively, where m, d are number of nodes and the feature dimensionality. Generally, K = C1 d < m. In the coarsening layer, the time complexity is about O(pm2 + md), where p is iteration number of the EM algorithm. In all, suppose the whole GIC alternatively stacks n convolution and coarsening layers, the entire time complexity is O(n(K + p)m2 + n C1md2 + nmd).

In this paper, we proposed a novel Gaussian-induced convolution network to handle with general irregular graph data. Considering the previous spectral and spatial methods do not well characterize local variations of graph, we derived edgeinduced GMM to adaptively encode subgraph structures by projecting them into several Gaussian components and then performing different ﬁltering operations on each Gaussian direction like the standard CNN ﬁlters on images. Meanwhile, we formulated graph coarsening into vertex-induced GMM to dynamically partition a graph, which was also proven to be equal to graph cut. Extensive experiments in two graphic tasks (i.e. graph and node classiﬁcation) demonstrated the effectiveness and superiority of our GIC compared with those baselines and state-of-the-art methods. In the future, we would like to extend our method into more applications to irregular data.

Acknowledgments

The authors would like to thank the Chairs and the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported by the National Science Fund of China under Grant Nos. 61602244, 61772276, U1713208 and 61472187 and Program for Changjiang Scholars.

Atwood, J., and Towsley, D. 2016. Diffusion-convolutional neural networks. In NIPS, 1993 2001.

Borgwardt, K. M.; Ong, C. S.; Sch onauer, S.; Vishwanathan, S.; Smola, A. J.; and Kriegel, H.-P. 2005. Protein function prediction via graph kernels. Bioinformatics 21(suppl 1):i47 i56. Bruna, J.; Zaremba, W.; Szlam, A.; and Le Cun, Y. 2014. Spectral networks and locally connected networks on graphs. ICLR. Cui, Z.; Xu, C.; Zheng, W.; and Yang, J. 2018. Context-dependent diffusion network for visual relationship detection. In MM, 1475 1482. ACM. Cui, Z.; Yang, J.; et al. 2017. Spectral ﬁlter tracking. ar Xiv preprint ar Xiv:1707.05553. Debnath, A. K.; Lopez de Compadre, R. L.; Debnath, G.; Shusterman, A. J.; and Hansch, C. 1991. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry 34(2):786 797. Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In NIPS, 3844 3852. Dhillon, I. S.; Guan, Y.; and Kulis, B. 2007. Weighted graph cuts without eigenvectors a multilevel approach. TPAMI 29(11). Duran, A. G., and Niepert, M. 2017. Learning graph representations with embedding propagation. In NIPS, 5119 5130. G artner, T.; Flach, P.; and Wrobel, S. 2003. On graph kernels: Hardness results and efﬁcient alternatives. Learning Theory and Kernel Machines 129 143. Gomez, L. G.; Chiem, B.; and Delvenne, J.-C. 2017. Dynamics based features for graph classiﬁcation. ar Xiv preprint ar Xiv:1705.10817. Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In SIGKDD, 855 864. ACM. Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In NIPS, 1025 1035. Haussler, D. 1999. Convolution kernels on discrete structures. Technical report, Technical report, Department of Computer Science, University of California at Santa Cruz. Henaff, M.; Bruna, J.; and Le Cun, Y. 2015. Deep convolutional networks on graph-structured data. ar Xiv preprint ar Xiv:1506.05163. Jiang, J.; Cui, Z.; Xu, C.; Li, C.; and Yang, J. 2018. Walksteered convolution for graph classiﬁcation. ar Xiv preprint ar Xiv:1804.05837. Kipf, T. N., and Welling, M. 2017. Semi-supervised classiﬁcation with graph convolutional networks. ICLR. Le Cun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436 444. Leskovec, J.; Kleinberg, J.; and Faloutsos, C. 2005. Graphs over time: densiﬁcation laws, shrinking diameters and possible explanations. In SIGKDD, 177 187. ACM. Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2016. Gated graph sequence neural networks. ICLR. Li, C.; Cui, Z.; Zheng, W.; Xu, C.; Ji, R.; and Yang, J. 2018a. Action-attending graphic neural network. TIP 27(7):3657 3670. Li, C.; Cui, Z.; Zheng, W.; Xu, C.; and Yang, J. 2018b. Spatiotemporal graph convolution for skeleton based action recognition. AAAI. Luo, Z.; Liu, L.; Yin, J.; Li, Y.; and Wu, Z. 2017. Deep learning of graphs with ngram convolutional neural networks. TKDE 29(10):2125 2139.

Marino, K.; Salakhutdinov, R.; and Gupta, A. 2016. The more you know: Using knowledge graphs for image classiﬁcation. ar Xiv preprint ar Xiv:1612.04844. Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; and Bronstein, M. M. 2017. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, volume 1, 3. Morris, C.; Kersting, K.; and Mutzel, P. 2017. Glocalized weisfeiler-lehman graph kernels: Global-local feature maps of graphs. In ICDM, 327 336. IEEE. Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning convolutional neural networks for graphs. In ICML, 2014 2023. Orsini, F.; Baracchi, D.; and Frasconi, P. 2017. Shift aggregate extract networks. ar Xiv preprint ar Xiv:1703.05537. Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In SIGKDD, 701 710. ACM. Sanchez, J.; Perronnin, F.; Mensink, T.; and Verbeek, J. 2013. Image classiﬁcation with the ﬁsher vector: Theory and practice. IJCV 105(3):222 245. Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The graph neural network model. TNN 20(1):61 80. Sch olkopf, B.; Weston, J.; Eskin, E.; Leslie, C.; and Noble, W. S. 2002. A kernel approach for learning from almost orthogonal patterns. In ECML, 511 528. Springer. Shervashidze, N.; Vishwanathan, S.; Petri, T.; Mehlhorn, K.; and Borgwardt, K. 2009. Efﬁcient graphlet kernels for large graph comparison. In Artiﬁcial Intelligence and Statistics, 488 495. Song, T.; Zheng, W.; Song, P.; and Cui, Z. 2018. Eeg emotion recognition using dynamical graph convolutional neural networks. IEEE Transactions on Affective Computing. Such, F. P.; Sah, S.; Dominguez, M. A.; Pillai, S.; Zhang, C.; Michael, A.; Cahill, N. D.; and Ptucha, R. 2017. Robust spatial ﬁltering with graph convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing 11(6):884 896. Toivonen, H.; Srinivasan, A.; King, R. D.; Kramer, S.; and Helma, C. 2003. Statistical evaluation of the predictive toxicology challenge 2000 2001. Bioinformatics 19(10):1183 1193. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph attention networks. ICLR. Wale, N.; Watson, I. A.; and Karypis, G. 2008. Comparison of descriptor spaces for chemical compound retrieval and classiﬁcation. Knowledge and Information Systems 14(3):347 375. Yanardag, P., and Vishwanathan, S. 2015. Deep graph kernels. In SIGKDD, 1365 1374. ACM. Zhang, T.; Zheng, W.; Cui, Z.; and Li, Y. 2018. Tensor graph convolutional neural network. ar Xiv preprint ar Xiv:1803.10071. Zhao, W.; Xu, C.; Cui, Z.; Zhang, T.; Jiang, J.; Zhang, Z.; and Yang, J. 2018. When work matters: Transforming classical network structures to graph cnn. ar Xiv preprint ar Xiv:1807.02653.