# stochastic_blockmodels_meet_graph_neural_networks__0b5625af.pdf

Stochastic Blockmodels meet Graph Neural Networks

Nikhil Mehta 1 Lawrence Carin 1 Piyush Rai 2

Stochastic blockmodels (SBM) and their variants, e.g., mixed-membership and overlapping stochastic blockmodels, are latent variable based generative models for graphs. They have proven to be successful for various tasks, such as discovering the community structure and link prediction on graph-structured data. Recently, graph neural networks, e.g., graph convolutional networks, have also emerged as a promising approach to learn powerful representations (embeddings) for the nodes in the graph, by exploiting graph properties such as locality and invariance. In this work, we unify these two directions by developing a sparse variational autoencoder for graphs, that retains the interpretability of SBMs, while also enjoying the excellent predictive performance of graph neural nets. Moreover, our framework is accompanied by a fast recognition model that enables fast inference of the node embeddings (which are of independent interest for inference in SBM and its variants). Although we develop this framework for a particular type of SBM, namely the overlapping stochastic blockmodel, the proposed framework can be adapted readily for other types of SBMs. Experimental results on several benchmarks demonstrate encouraging results on link prediction while learning an interpretable latent structure that can be used for community discovery.

1. Introduction

Learning the latent structure in graph-structured data (Fortunato, 2010; Goldenberg et al., 2010; Schmidt & Morup, 2013) is an important problem in a wide range of domains,

The major part of this work was done when Nikhil Mehta was at IIT Kanpur. 1Department of Electrical and Computer Engineering, Duke University 2Department of Computer Science, IIT Kanpur. Correspondence to: Nikhil Mehta <nikhilmehta.dce@gmail.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

such as social and biological network analysis and recommender systems. These latent structures help discover the underlying communities in the network, as well as in predicting potential links between nodes. Latent space models (Hoff et al., 2002) and their structured extensions, such as the stochastic blockmodel (Nowicki & Snijders, 2001) and variants like the inﬁnite relational model (IRM) (Kemp et al., 2006), mixed-membership stochastic blockmodel (MMSB) (Airoldi et al., 2008), and the overlapping stochastic blockmodel (OSBM) (Miller et al., 2009a; Latouche et al., 2011a) accomplish this by learning low-dimensional, interpretable node embeddings deﬁned via structured latent variables. These embeddings can be used to identify the community membership(s) of each node in the graph, as well as for tasks such as link prediction.

The overlapping stochastic blockmodel (OSBM), also known as the latent feature relational model (LFRM), is a particularly appealing model for relational data (Miller et al., 2009a; Latouche et al., 2011a; Zhu et al., 2016). The OSBM/LFRM models each node in the graph as belonging to one or more communities using a binary membership vector, and deﬁnes the link probability between any pair of nodes as a bilinear function of their community membership vectors. Despite its appealing properties, the OSBM/LFRM has a number of limitations. In particular, although usually considered to be more expressive (Miller et al., 2009a) than models such as IRM and MMSB, a single layer of binary node embeddings and the bilinear model for the link generation can still limit the expressiveness of OSBM/LFRM. Moreover, it has a challenging inference procedure, which primarily relies on MCMC (Miller et al., 2009a; Latouche et al., 2011a) or mean-ﬁeld variational inference (Zhu et al., 2016). Although recent models have tried to improve the expressiveness of OSBM/LFRM, e.g., by assuming a deep hierarchy of binary-vector-based node embeddings (Hu et al., 2017), inference in such models remains intractable, requiring expensive MCMC-based inference. It is therefore desirable to have a model that retains the basic spirit to OSBM/LFRM (e.g., easy interpretability and strong link prediction performance), but with greater expressiveness, and a simpler and scalable inference procedure.

Motivated by these desiderata, we develop a deep generative framework for graph-structured data, that inherits the easy interpretability of overlapping stochastic blockmodels, but is

Stochastic Blockmodels meet Graph Neural Networks

much more expressive and enjoys a fast inference procedure. Our framework is based on a novel, sparse variant of the variational autoencoder (VAE) (Kingma & Welling, 2013), designed to model graph-structured data. Our VAE-based setup comprises a nonlinear generator/decoder for the graph and a nonlinear encoder based on the graph convolutional network (GCN) (Kipf & Welling, 2016a) (although other graph neural networks can also be used). Our framework posits each node of the graph to have an embedding in the form of a sparse latent representation (modeled by a Beta Bernoulli process (Grifﬁths & Ghahramani, 2011), which also enables learning the size of the embeddings). The generator/decoder part of the VAE models the probability of a link between two nodes via a nonlinear function (deﬁned by a deep neural network) of their associated embeddings. The encoder part of the VAE consists of a fast recognition model that is designed leveraging reparameterization method for Beta and Bernoulli distributions (Maddison et al., 2017; Nalisnick & Smyth, 2017). The recognition model, based on stochastic gradient variational Bayes (SGVB) inference, enables fast inference of the node embeddings. In contrast, the traditional stochastic blockmodels rely on iterative MCMC or variational inference procedures for inferring the node embeddings. Consequently, the SGVB inference algorithm we develop is also of independent interest, since the recognition model enables fast inference of the node embeddings in single-layer overlapping stochastic blockmodels.

2. Preliminaries

We ﬁrst introduce notation and then brieﬂy describe the overlapping stochastic blockmodel (OSBM) (Latouche et al., 2011a; Miller et al., 2009a; Zhu et al., 2016). As described in the next section, our deep generative model enriches OSBM by endowing it with a deep architecture based on a sparse variational autoencoder, and a fast inference algorithm based on a recognition model. We assume that the graph is given as an adjacency matrix A {0, 1}N N, where N denotes the number of nodes. We assume Anm = 1 if there exist a link between node n and node m, and otherwise Anm = 0. In addition to A, for each node we may also be provided node features. These are given in the form of an N D matrix X, with xn RD being the node features for node n, and D being the number of observed features.

The OSBM (Latouche et al., 2011a; Miller et al., 2009a; Zhu et al., 2016) is a stochastic blockmodel for networks; it assumes each node n has an associated binary vector (node embedding), also termed a latent feature vector zn {0, 1}K. Within the node embedding, znk = 1 denotes that node n belongs to cluster/community k, and znk = 0 otherwise. The OSBM allows each node to simultaneously belong to multiple communities, and deﬁnes the link probability be-

tween two nodes via a bilinear function of their latent feature vectors

p(Anm = 1|zn, zm, W) = σ(z n Wzm) (1)

where entry wkℓin W RK K affects the probability of a link between node n and node m belonging to cluster k and cluster ℓ, respectively.

The nonparametric latent feature relational model (LFRM) is a speciﬁc type of OSBM, that leverages the Indian Buffet Process (IBP) prior (Miller et al., 2009a) on the N K binary matrix Z = [z1, . . . , z N] of the node-community membership vectors. Use of the IBP enables learning the number of communities. Inference in LFRM/OSBM is typically performed via MCMC or variational inference (Miller et al., 2009a; Latouche et al., 2011a; Zhu et al., 2016), which tends to be slow and often cannot scale easily to more than a few hundred nodes.

3. Deep Generative OSBM

We now present our sparse VAE based deep generative framework for overlapping stochastic blockmodel. The proposed architecture, depicted in Fig. 1 (left), associates each link Anm {0, 1} with two latent embeddings zn and zm (for the nodes associated with this link). Each link probability is modeled as a nonlinear function of the embeddings of its associated nodes. Unlike the standard VAE that assumes dense, Gaussian-distributed embeddings, since we wish to use the node embeddings to also infer the community membership(s) of each node (as it is one of the goals of stochastic blockmodels), we impose sparsity on the node embeddings. This is done by modeling them as a sparse vector of the form zn = bn rn, where bn {0, 1}K is a binary vector modeled using a stick-breaking process prior (Teh et al., 2007) and rn RK is a real-valued vector with a Gaussian prior. Modeling bn using the stick-breaking prior enables learning the node embedding size K from data. Note that, unlike the OSBM/LFRM, which assumes the node embedding zn to be a strictly binary vector, our framework models it as a sparse real-valued vector, providing a more ﬂexible and informative representation for the nodes. In particular, this enables inference of not just the node s membership into communities, but also the strength of the membership in each of the communities the node belongs to. Speciﬁcally, bnk {0, 1} denotes whether node n belongs to cluster k or not, and rnk R denotes the strength.

3.1. VAE Generator/Decoder

Given the node embeddings zn = bn rn, the VAE decoder generates each link in the graph as Anm pθ(zn, zm), where probability distribution pθ deﬁnes a decoder or generator model for the graph. This decoder can consist of one or more layers of deterministic nonlinear transformation of the

Stochastic Blockmodels meet Graph Neural Networks

Figure 1. (Left) The generator/decoder model in plate notation. Note that the mapping from zn to f n is a deterministic nonlinear transformation, modeled by a deep neural network. (Right) The encoder model, deﬁned by a graph convolutional network (Kipf & Welling, 2016a) that takes as input the network A and node features X (if available) and outputs the parameters of the variational distributions of the model parameters.

node embeddings zn. Denoting the overall transformation for a node embedding zn as f(zn) = f n, we model the probability of a link as p(Anm = 1|f n, f m) = σ(f n f m), where the nonlinear function f can be modeled by a deep neural network (in our experiments, we use a deep neural net with each hidden layer having leaky Re LU nonlinearity). Figure 1 (left) depicts the generator.

We model the binary vector bn {0, 1}K, denoting nodecommunity memberships, using the stick-breaking construction of the IBP (Teh et al., 2007), which enables learning of the effective K by specifying a sufﬁciently large truncation level K. The stick-breaking construction is given as follows

vk Beta(α, 1), k = 1, . . . , K (2)

j=1 vj, bnk Bernoulli(πk) (3)

We further assume a Gaussian prior on membership strengths rn RK, i.e., p(rn) = N(0, σ2I).

3.2. VAE Encoder

We employ a nonlinear encoder to infer the node embedding zn for each node, using a fast non-iterative recognition model (Kingma & Welling, 2013). Denoting the parameters of the variational posterior for the embeddings of all the nodes collectively as {v, b, r}, we consider an approximation to the model s true posterior p(v, b, r|A, X) with a variational posterior of the form qφ(v, b, r). For simplicity, we consider a mean-ﬁeld ap-

proximation, which allows us to factorize the posterior as qφ(v, b, r) = QK k=1 QN n=1 qφ(vnk)qφ(bn,k)qφ(rn,k). Our nonlinear encoder, as shown in Fig. 1 (Right), assumes variational distributions on the local variables of each node, i.e., vn, bn and rn, and deﬁnes the variational parameters of these distributions as the outputs of a graph convolutional network (GCN) (Kipf & Welling, 2016a), which takes as input the network A and the node feature matrix X. GCN has recently emerged as a ﬂexible encoder of graph-structured data (similar in spirit to convolutional neural networks for images), which makes it an ideal choice of the encoder in our VAE-based generative model for graphs. The forward propagation rule for each layer l in GCN is deﬁned as Hl = g( ˆAHl 1Wl), where H0 = X (X = I when no side information is present), Wl is the weight matrix, g( ) is the non-linear activation, and ˆA is the symmetric normalization of adjacency A. Although here we have used the vanilla GCN in our architecture, more-generalized variants of GCN, such as Graph SAGE (Hamilton et al., 2017), can also be used as the encoder. The variational distributions have the following forms

qφ(vnk) = Beta(cnk, dnk) k = 1, . . . , K (4)

qφ(bnk) = Bernoulli(πnk) k = 1, . . . , K (5)

qφ(rn) = N(µn, diag(σ2 n)) (6)

where cnk, dnk, πnk, µn, and σn are outputs of a GCN, i.e., {ck, dk, πk, µk, σk}n=N n=1 = GCN(A, X). We use the stochastic gradient variational Bayes (SGVB) algorithm (Kingma & Welling, 2013) to infer the parameters of the variational distributions. Details on reparameterization and the loss formulation are provided in Section 4.

Stochastic Blockmodels meet Graph Neural Networks

3.3. Special Cases

Existing models for graph-structured data can be seen as special cases of our framework. Recall that we model the node embeddings as zn = bn rn, and our generative model is of the form Anm pθ(zn, zm). If we ignore the community strength latent variable rn, i.e., zn is deﬁned simply as zn = bn (just a binary vector) and further deﬁne pθ as a Bernoulli distribution with its probability being a bilinear function of the embeddings zn and zm, then we recover the OSBM/LFRM (Latouche et al., 2011a; Miller et al., 2009a). Note, however, that while OSBM/LFRM typically rely on MCMC or variational inference, our framework can leverage SGVB for efﬁcient inference.

Likewise, if we deﬁne zn = rn, i.e., a dense vector, and deﬁne pθ as a Bernoulli distribution with its probability being a bilinear function of the embeddings, we recover the Eigenmodel or latent-space model (LSM) (Hoff et al., 2002). Note that this model cannot infer K since the binary vector bn is not present. Finally, if pθ is a Bernoulli distribution with its probability being a nonlinear function of the embeddings, then we recover the VGAE model (Kipf & Welling, 2016b), which can also be seen as a nonlinear extension of LSM. Moreover, note that a key limitation of LSM and VGAE is that these cannot be used to infer the community structure (due to the non-sparse nature of zn) and usually can only be used for link-prediction tasks.

4. Inference

We deﬁne the factorized variational posterior qφ(v, b, r) as

qφ(vnk) = Beta(vnk|ck(A, X), dk(A, X))

qφ(bnk) = Bernoulli(bnk|πk(A, X))

qφ(rn) = N(µn(A, X), diag(σ2 n(A, X)))

where ck, dk, πk, µnand σn are a function of the GCN encoder, with inputs A and X. We deﬁne the loss function L parameterized by inference network (encoder) parameters (φ) and generator parameters (θ) by minimizing the negative of the evidence lower bound (ELBO)

KL[qφ(bn|vn) || pθ(bn|vn)]+KL[qφ(rn) || pθ(rn)]

+KL[qφ(vn) || p(vn)]

Eq[log pθ(Xn|zn)]

Eq[log pθ(Anm|zn, zm)]

where KL[q( )||p( )] is the Kullback-Leibler divergence between q( ) and p( ). Note that here we have also included the loss from the reconstruction of the side information Xn. We have considered that the side information X and

the links A are conditionally independent given the node embeddings z1:N. When there is no side information, we can ignore the Eq[log pθ(Xn|zn)] term in the loss function. For the encoder and decoder parameters we infer point estimates, while we learn the distribution over the latent variables b, v, and r.

Our variational autoencoder for link generation is trained using Stochastic Gradient Variational Bayes (SGVB) (Kingma & Welling, 2013). SGVB can be used to perform inference for a broad class of non-conjugate models and is therefore appealing to Bayesian nonparametric models, such as those based on stick-breaking priors that we use in our framework. SGVB uses differentiable Monte Carlo (MC) expectations to learn the model parameters. Speciﬁcally, it requires differentiable, non-centered parameterization (DNCP) (Kingma & Welling, 2014) to allow backpropagation. However, our model has expectations over Beta and Bernoulli distributions, neither of which permit easy reparameterization as required by SGVB. We leverage recent developments on reparameterizing these distributions (Maddison et al., 2017; Nalisnick & Smyth, 2017), which consequently leads to a simple inference procedure.

Following (Nalisnick & Smyth, 2017), we approximate the Beta posterior in (4) with the Kumaraswamy distribution, deﬁned as: Kumar(x; a, b) = abxa 1(1 xa)b 1 for x (0, 1) and a, b > 0. The closed-form inverse CDF allows easy reparameterization, and samples for vnk (with parameters cnk and dnk) can be drawn using:

u Uniform(0, 1)

vnk d= (1 u

We compute the KL divergence between the Kumaraswamy q(v) and the Beta distribution p(v) by taking a ﬁnite approximation of the inﬁnite sum as mentioned in (Nalisnick & Smyth, 2017).

For the Bernoulli random variable, we use the Binary Concrete distribution (Maddison et al., 2017; Jang et al., 2017) at the time of training, as a continuous relaxation to get the biased low-variance estimates of the gradient. The KL divergence between two Bernoulli distributions is relaxed using two Binary Concrete distributions.

We reparameterize bnk, deﬁned by a Bernoulli with probability πnk, (in (3) and (5)) with reparameterization:

L = log u 1 u

logit(πnk) + L

where σ( ) is the sigmoid function, logit( ) is the inversesigmoid function, λ is the relaxation temperature and u Uniform(0, 1).

Stochastic Blockmodels meet Graph Neural Networks

Structured Mean-Field: Since the vanilla mean-ﬁeld variational inference ignores the posterior dependence among the latent variables, we also considered Structured Stochastic Variational Inference (SSVI) (Hoffman, 2014; Hoffman et al., 2013), which allows global-local parameter dependency and improves upon the mean-ﬁeld approximation. We considered v (and its variational parameters c and d) as global parameters and impose a hierarchical structure on bn by conditioning it on v. The variational posterior of our framework using SSVI can be factorized as qφ(v, b, r) = QK k=1 qφ(vk) QN n=1 qφ(bn,k|v)qφ(rn,k) with qφ(vk) = Beta(ck, dk); qφ(bnk|v) = Bernoulli(πk); πk = QK j=1 vk, where ck, dk are parameters to be learned. In practice, we have found structured mean-ﬁeld to perform better than the mean-ﬁeld, and our model implementation uses the former.

5. Related Work

The proposed framework can be seen as bridging two lines of research on modeling graphs: (i) structured latent variable models for graphs, such as stochastic blockmodels and its variants (Kemp et al., 2006; Airoldi et al., 2008; Miller et al., 2009a; Latouche et al., 2011a); and (ii) deep learning models for graphs, such as graph convolutional networks (Kipf & Welling, 2016a). Our effort is motivated by the goal of harnessing their complementary strengths to develop a deep generative stochastic blockmodel for graphs, that also enjoys an efﬁcient inference procedure.

The most prominent methods in stochastic blockmodels include models that associate each node to a single community (Nowicki & Snijders, 2001; Kemp et al., 2006), a mixture of communities (Airoldi et al., 2008), and an overlapping set of communities (Miller et al., 2009a; Latouche et al., 2011a; Yang & Leskovec, 2012; Zhou, 2015). While stochastic blockmodels have nice interpretability, these models usually assume the links of the networks to be modeled as a simple bilinear function of the node embeddings, which may not be able to capture the nonlinear interactions between the nodes (Yan et al., 2011). An approach to model such nonlinear interactions was proposed in (Yan et al., 2011), using a matrix-variate Gaussian process. However, despite the modeling ﬂexibility, inference in this model is challenging and the model is usually infeasible to run on networks with more than 100 nodes.

There is also signiﬁcant recent interest in non-probabilistic deep learning models for graphs. Some of the prominent works in this direction include Deep Walk (Perozzi et al., 2014) and graph autoencoders (GAE) (Kipf & Welling, 2016a; Hamilton et al., 2017). Deep Walk is inspired by the idea of word embeddings. It treats each node as a document, by starting a random walk at that node and taking the nodes encountered in the path taken as the word in that document. It uses document/word embedding methods to

the learn embedding of each node. In contrast, the GAE approaches are based on the idea of graph convolutional networks (GCN) (Kipf & Welling, 2016a). This line of work nicely complements our contribution, since modules like GCN can be effectively used to design the encoder model for our deep generative framework. In particular, as noted in the model description, our encoder is essentially a GCN. We believe that such advances in graph encoding can be used as modules to design new deep generative models for relational data.

Despite the signiﬁcant success of deep generative models for images and text data, there has been relatively little work on deep generative models for relational data (You et al., 2018; Hu et al., 2017; Wang et al., 2017; Kipf & Welling, 2016b). Graph RNN (You et al., 2018) learns a single representation of an entire graph to model the joint distribution of different graphs. The focus of Graph RNN is on generating small-sized graphs, whereas we focus on link prediction and community detection for a given graph. Among other existing methods, (Hu et al., 2017) proposed an extension of the LFRM via a deep hierarchy of binary latent features for each node. However, this model relies on expensive batch MCMC inference, precluding its applicability to large-scale networks. Another deep latent variable model was proposed recently in (Wang et al., 2017). However, this model also has a difﬁcult inference procedure, requiring model-speciﬁc inference. Moreover, the node embeddings are not interpretable. Perhaps the closest in spirit to our work is the recent work on variational graph autoencoders (VGAE) (Kipf & Welling, 2016b). Graphite (Grover et al., 2018) extends the VGAE by using a multi-layer iterative decoder that alternates between message passing and graph reﬁnement. A similar decoding scheme can also be applied in our framework; however, the focus of this work is on learning sparse interpretable node embeddings. Both VGAE and Graphite are built on top of the standard VAE, and consequently do not have direct interpretability of node embeddings as desired by stochastic blockmodels. This leads to a model with different properties and a different inference procedure, compared to (Kipf & Welling, 2016b). Moreover, our VAE architecture is nonparametric in nature and can infer the node embedding size.

6. Experiments

We report experimental results on several synthetic and realworld datasets, to demonstrate the efﬁcacy of our model. Our experimental results include quantitative comparisons on the task of link prediction as well as qualitative results, such as using the embeddings to discover the underlying communities in the network data. The qualitative results are meant to demonstrate the expressiveness of the latent space that our model infers. The expressive nature of our model

Stochastic Blockmodels meet Graph Neural Networks

is the result of the sparse and interpretable embedding for each node of the graph. In particular, we show that these sparse embeddings can be interpreted as the memberships and strength of memberships of each node in one or more communities.

First we evaluate our model on link-prediction, comparing it with various baselines on several benchmark datasets on moderate (about 2000 nodes) to large-scale (about 20,000 nodes) datasets. We then analyze the latent structure zn learned by our model on a synthetic and a real-world coauthorship dataset. We compare the latent structure with the embeddings learned by the variational graph autoencoder (VGAE) (Kipf & Welling, 2016b). We also examine the community structure on the real-world co-authorship dataset, and show that the proposed framework is able to readily capture the underlying communities. We refer to our framework as DGLFRM (Deep Generative Latent Feature Relational Model), which refers to our most general model with sparse embeddings zn = bn rn with nonlinear generator and nonlinear encoder. We also consider a variant of DGLFRM with binary embeddings zn = bn, which we refer to as DGLFRM-B (the B here denotes binary ). Note that DGLFRM-B can be seen as a deep generalization of LFRM (Miller et al., 2009b)/OSBM (Latouche et al., 2011b), with another key difference from LFRM/OSBM being the fact that we use amortized inference.

6.1. Baselines

For link prediction, we compare the proposed model with four baselines, one of which is a simpliﬁed variant of DGLFRM akin to LFRM (Miller et al., 2009a), which is an overlapping stochastic blockmodel. The original LFRM, which uses MCMC-based inference, was infeasible to run on the datasets used in these experiments. On the other hand, DGLFRM with zn = bn and bilinear decoder (link generation model) is similar to LFRM, but with a much faster SGVB based inference (we will refer to this simpliﬁed variant of DGLFRM as LFRM).

Among the other three baselines, Spectral Clustering (SC) and Deep Walk (DW) (Perozzi et al., 2014) learn node embeddings, which we use to compute the link probability as σ(z n zm). The third baseline is the recently proposed variational autoencoder on graphs (VGAE) (Kipf & Welling, 2016b). Note that none of these baselines can be used for community detection, since the real-valued embeddings learned by these baselines are not interpretable (unlike our model which learns sparse embeddings, with nonzeros denoting community memberships).

6.2. Datasets

We consider ﬁve real-world datasets, with three datasets consisting of side information in the form of the node features,

and the other two datasets having only the link information. For the link-prediction experiments, all models are provided a partially-complete network (with unknown part to be predicted). The node features (when available) are provided to all the models. The description of each data set is as follows:

NIPS12: The NIPS12 coauthor network (Zhou, 2015) includes all 2037 authors in NIPS papers from volumes 1-12, with 3134 edges. This network has no side information.

Yeast: The Yeast protein interaction network (Zhou,

2015) has 2361 nodes and 6646 non-self edges. This network has no side information.

Cora: Cora network is a citation network consisting of 2708 documents. The datasets contain sparse bagof-words feature vectors of length 1433 for each document. These are used as node features. The network has total 5278 links.

Citeseer: Citeseer is a citation network consisting of 3312 scientiﬁc publications from six categories: agents, AI, databases, human computer interaction, machine learning, and information retrieval. The side information for the dataset is the category label for each paper which is converted into a one-hot representation. These one-hot vectors are used as node features. The network has a total of 4552 links.

Pubmed: A citation network consisting of 19,717 nodes. The dataset contains sparse bag-of-words feature vectors of length 500 for each document, used as node features. The network has total 44,324 links.

6.3. Link Prediction

We use Area Under the ROC Curve (AUC) and Average Precision (AP) to compare our model with the other baselines for link prediction. For all datasets, we hold out 10% and 5% of the links as our test set and validation set, respectively, and use the validation set to ﬁne-tune the hyperparameters. We take the average of AUC-ROC and AP scores by running our model on 10 random splits of each dataset, to compare with the baselines. The AUC-ROC scores of our models and the various baselines are shown in Table 1 and AP scores are shown in Table 2. As shown in the tables, our models outperforms the baselines on almost all datasets. We again highlight that unlike the baselines, such as VGAE, that cannot learn interpretable embeddings, our model also learns embeddings that can be interpreted as memberships of nodes into communities. The superior results of DGLFRM and DGLRFM-B demonstrate the beneﬁt of our deep generative models. The signiﬁcantly better results of these as compared to LFRM also show the beneﬁt of endowing

Stochastic Blockmodels meet Graph Neural Networks

Table 1. AUC ROC

Method NIPS12 Yeast Cora Citeseer Pubmed

SC 0.8792 .0003 0.7886 .0001 0.8460 .0001 0.8050 .0001 0.8420 .0002 DW 0.8058 .0000 0.6443 .0003 0.8310 .0001 0.8050 .0002 0.8440 .0000 VGAE 0.8790 .0055 0.7784 .0002 0.9260 .0001 0.9080 .0002 0.9418 .0076 LFRM 0.8489 .0001 0.7975 .0006 0.9096 .0026 0.8965 .0035 0.9152 .0041

DGLFRM-B 0.8898 .0028 0.8061 .0003 0.9281 .0024 0.9007 .0020 0.9396 .0052 DGLFRM 0.8734 .0043 0.7856 .0005 0.9343 .0023 0.9379 .0032 0.9395 .0008

Table 2. Average Precision (AP).

Method NIPS12 Yeast Cora Citeseer Pubmed

SC 0.9022 .0002 0.8440 .0001 0.8850 .0000 0.8500 .0100 0.8780 .0100 DW 0.8634 .0000 0.6699 .0002 0.8500 .0001 0.8360 .0001 0.8440 .0000 VGAE 0.9114 .0042 0.8349 .0002 0.9328 .0001 0.9200 .0002 0.9394 .0088 LFRM 0.8870 .0000 0.8268 .0005 0.9060 .0033 0.9118 .0031 0.9197 .0054

DGLFRM-B 0.9120 .0021 0.8442 .0002 0.9259 .0023 0.9153 .0031 0.9454 .0050 DGLFRM 0.9005 .0027 0.8388 .0002 0.9376 .0022 0.9438 .0073 0.9497 .0035

Table 3. Example of communities inferred by our model on the NIPS data.

Cluster Authors

Probabilistic Modeling Sejnowski T, Hinton G, Dayan P, Jordan M, Williams C Reinforcement Learning Barto A, Singh S, Sutton R, Connolly C, Precup D Robotics/Vision Shibata T, Peper F, Thrun S, Giles C, Michel A Computational Neuroscience Baldi P, Stein C, Rinott Y, Weinshall D, Druzinsky R Neural Networks Pearlmutter B Abu-Mostafa Y, Le Cun Y, Sejnowski T, Tang A

LFRM with a deep architecture, with nonlinear decoder and nonlinear encoder. The hyperparameter settings used for all experiments are included in the Supplementary Material. We also performed an experiment to investigate the model s ability to leverage node features. As expected, when using the features the model performs better compared to the case when it ignores features. This experiment is included in the supplementary section.

6.4. Qualitative Analysis on Learned Embeddings

To demonstrate the interpretable nature of the embeddings learned by our model, we generate a synthetic dataset with 100 nodes and 10 communities. The dataset is generated by ﬁxing the ground-truth communities (by creating a binary vector for each node) such that some of the nodes belong to same communities. The adjacency matrix is then generated using a simple inner product, followed by the sigmoid operation (Figure 2a). We train using 85% of the synthetic adjacency matrix for link-prediction and for visualizing the latent structure that our model learns. The latent structure obtained using DGLFRM is plotted in Figure 2(b). Figure 2(c) shows that by using only the ﬁrst two dimensions of the latent structure we can reconstruct the graph reasonably well.

This depicts an important property of using a stick-breaking IBP prior which encourages the most commonly selected communities (the columns on the left in Figure 2 (b)) to be dense, while the communities with higher indices (columns in right) to be sparse. This shows that DGLFRM can learn the effective number of communities given a graph. Finally, we can quantize the latent space into discrete intervals to extract nodes belonging to different communities. In our experiments we saw that the latent structure learned is in fact close to the ground-truth community assignments we started with. In Figure 2(d) we compare the community structure from our model with the latent structure obtained by running the VGAE. Note that the Gaussian latent structure in VGAE is dense and therefore fails to learn community memberships that are readily interpretable.

We also do a qualitative analysis on the NIPS12 dataset. Again we train DGLFRM and VGAE using 85% of the adjacency matrix. Table 3 shows ﬁve of the inferred communities by DGLFRM. The authors shown under each community are ordered by the strength of their community memberships (in decreasing order). As Table 3 shows, each of the communities represent a sub-ﬁeld, with authors working on similar topics. Moreover, note that some authors (e.g.,

Stochastic Blockmodels meet Graph Neural Networks

(a) Adjacency

(c) Generated

Figure 2. (a) The synthetic adjacency matrix (where white, black and grey denote link, no-link and hidden parts of the graph). (b) The latent structure of synthetic data learned using DGLFRM. The truncation parameter (K) of the latent structure was ﬁxed to 10. (c) The graph generated by DGLFRM using only the ﬁrst 2 dimensions of the latent structure(i.e. columns 0 and 1 in (b)). The columns 2-9 were all set to zero. The graph is represented as the probability of links; white (black) represents link (no-link) with high probability. (d) The latent structure of synthetic data learned using VGAE.

Sejnowski) are inferred as belonging to more than one community. This qualitative experiment demonstrates that our model can learn interpretable embeddings that can be used for tasks such as (overlapping) clustering. We have included a visualization of the latent structure learned on NIPS12 data in the Supplementary Material. Note that our model can infer the number of communities naturally, via the stickbreaking prior. The stick-breaking prior requires specifying a large truncation level on the number of communities. Our model effectively infers the active communities for a given truncation level. As shown in Fig. 2 (b)-(c), the posterior inference in our model is able to turn off the unnecessary columns in Z. Although we do not know the ground truth for the number of communities, the number of inferred active communities is similar to what is reported in prior work on nonparametric Bayesian overlapping stochastic blockmodels (Miller et al., 2009a). Note that VGAE embeddings require an additional step (such as K-Means clustering) to cluster nodes. Moreover, a method such as K-means cannot detect overlapping communities, and it is also sensitive to the initialization of K (estimated number of communities). For reference, we have included the clustering results on the VGAE embeddings in the Supplementary Material.

7. Conclusion and Discussion

We have presented a deep generative framework for overlapping community discovery and link prediction. This work combines the interpretability of stochastic blockmodels, such as the latent feature relational model, with the modeling power of deep generative models. Moreover, leveraging a nonparametric Bayesian prior on the node embeddings enables learning the node embedding size (i.e., the number of communities) from data. Our framework is modular and a wide variety of decoder and encoder models

can be used. In particular, it can leverage recent advances in non-probabilistic autoencoders for graphs, such as the graph convolutional network (Kipf & Welling, 2016a) or its extensions (Hamilton et al., 2017). Inference in the model is based on SGVB, that does not require conjugacy. This further widens the applicability of our framework to model different types of networks (e.g., weighted, countvalued edges, and power-law degree distribution of node degrees). We believe this combination of discrete latent variables based stochastic blockmodels and graph neural network will help leverage their respective strengths, and will fuel further research and advance the state-of-the-art in (deep) generative modeling of graph-structured data.

Although SGVB inference makes our model fairly efﬁcient, it can be scaled up further for massive networks by using mini-batch based inference (Chen et al., 2018). Another possibility to scale up the model is to replace the Bernoullilogistic likelihood model by a Bernoulli-Poisson link (Zhou, 2015), which enable scaling up the model in the number of nonzeros (i.e., number of edges) in the network. Given that our framework can work with a wide variety of decoder/generator models, such modiﬁcations can be done without much difﬁculty.

Finally, in this work we model each node as having a single binary vector, denoting its memberships in one or more communities. Another interesting extension would be to consider multiple layers of latent variables, which can model a node s membership into a hierarchy of communities (Ho et al., 2011; Blundell & Teh, 2013; Hu et al., 2017).

Acknowledgements: PR acknowledges support from Google AI/ML faculty award and DST-SERB Early Career Research Award. The Duke investigators acknowledge the support of DARPA and ONR.

Stochastic Blockmodels meet Graph Neural Networks

Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. Mixed membership stochastic blockmodels. JMLR, 2008.

Blundell, C. and Teh, Y. W. Bayesian hierarchical community discovery. In NIPS, 2013.

Chen, J., Ma, T., and Xiao, C. Fastgcn: Fast learning with graph convolutional networks via importance sampling. Co RR, abs/1801.10247, 2018. URL http://arxiv.org/abs/ 1801.10247.

Fortunato, S. Community detection in graphs. Physics reports, 486(3):75 174, 2010.

Goldenberg, A., Zheng, A. X., Fienberg, S. E., and Airoldi, E. M. A survey of statistical network models. Foundations and Trends R in Machine Learning, 2010.

Grifﬁths, T. L. and Ghahramani, Z. The indian buffet process: An introduction and review. JMLR, 2011.

Grover, A., Zweig, A., and Ermon, S. Graphite: Iterative generative modeling of graphs. ar Xiv preprint ar Xiv:1803.10459, 2018.

Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In NIPS, 2017.

Ho, Q., Parikh, A., Song, L., and Xing, E. Multiscale community blockmodel for network exploration. In AISTATS, 2011.

Hoff, P. D., Raftery, A. E., and Handcock, M. S. Latent space approaches to social network analysis. JASA, 2002.

Hoffman, M. D. Stochastic structured mean-ﬁeld variational inference. Co RR, abs/1404.4114, 2014.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013.

Hu, C., Rai, P., and Carin, L. Deep generative models for relational data with side information. In International Conference on Machine Learning, pp. 1578 1586, 2017.

Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.

Kemp, C., Tenenbaum, J. B., Grifﬁths, T. L., Yamada, T., and Ueda, N. Learning systems of concepts with an inﬁnite relational model. In Proceedings of the national conference on artiﬁcial intelligence, 2006.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kingma, D. P. and Welling, M. Efﬁcient gradient-based inference through transformations between bayes nets and neural nets. Co RR, abs/1402.0480, 2014. URL http://arxiv.org/ abs/1402.0480.

Kipf, T. N. and Welling, M. Semi-supervised classiﬁcation with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016a.

Kipf, T. N. and Welling, M. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016b.

Latouche, P., Birmel e, E., and Ambroise, C. Overlapping stochastic block models with application to the french political blogosphere. The Annals of Applied Statistics, 2011a.

Latouche, P., Birmel, E., and Ambroise, C. Overlapping stochastic block models with application to the french political blogosphere. The Annals of Applied Statistics, 2011b.

Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, 2017.

Miller, K., Grifﬁths, T., and Jordan, M. Nonparametric latent feature models for link prediction. NIPS, 2009a.

Miller, K. T., Grifﬁths, T. L., and Jordan, M. I. Nonparametric latent feature models for link prediction. In NIPS, 2009b.

Nalisnick, E. and Smyth, P. Stick-breaking variational autoencoders. In ICLR, 2017.

Nowicki, K. and Snijders, T. A. B. Estimation and prediction for stochastic blockstructures. JASA, 2001.

Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In KDD, 2014.

Schmidt, M. N. and Morup, M. Nonparametric bayesian modeling of complex networks: An introduction. Signal Processing Magazine, IEEE, 30(3), 2013.

Teh, Y. W., Grr, D., and Ghahramani, Z. Stick-breaking construction for the indian buffet process. In AISTATS, 2007.

Wang, H., Shi, X., and Yeung, D.-Y. Relational deep learning: A deep latent variable model for link prediction. In AAAI, pp. 2688 2694, 2017.

Yan, F., Xu, Z., and Qi, Y. Sparse matrix-variate gaussian process blockmodels for network modeling. In UAI, 2011.

Yang, J. and Leskovec, J. Community-afﬁliation graph model for overlapping network community detection. In ICDM, 2012.

You, J., Ying, R., Ren, X., Hamilton, W. L., and Leskovec, J. Graphrnn: A deep generative model for graphs. Co RR, abs/1802.08773, 2018. URL http://arxiv.org/abs/ 1802.08773.

Zhou, M. Inﬁnite edge partition models for overlapping community detection and link prediction. In AISTATS, 2015.

Zhu, J., Song, J., and Chen, B. Max-margin nonparametric latent feature models for link prediction. ar Xiv preprint ar Xiv:1602.07428, 2016.