# augmentable_gamma_belief_networks__3cc8582f.pdf

Journal of Machine Learning Research 17 (2016) 1-44 Submitted 12/15; Revised 7/16; Published 9/16

Augmentable Gamma Belief Networks

Mingyuan Zhou mingyuan.zhou@mccombs.utexas.edu Department of Information, Risk, and Operations Management Mc Combs School of Business The University of Texas at Austin Austin, TX 78712, USA

Yulai Cong yulai cong@163.com Bo Chen bchen@mail.xidian.edu.cn National Laboratory of Radar Signal Processing Collaborative Innovation Center of Information Sensing and Understanding Xidian University Xi an, Shaanxi 710071, China

Editor: Francois Caron

To infer multilayer deep representations of high-dimensional discrete and nonnegative real vectors, we propose an augmentable gamma belief network (GBN) that factorizes each of its hidden layers into the product of a sparse connection weight matrix and the nonnegative real hidden units of the next layer. The GBN s hidden layers are jointly trained with an upward-downward Gibbs sampler that solves each layer with the same subroutine. The gamma-negative binomial process combined with a layer-wise training strategy allows inferring the width of each layer given a ﬁxed budget on the width of the ﬁrst layer. Example results illustrate interesting relationships between the width of the ﬁrst layer and the inferred network structure, and demonstrate that the GBN can add more layers to improve its performance in both unsupervisedly extracting features and predicting heldout data. For exploratory data analysis, we extract trees and subnetworks from the learned deep network to visualize how the very speciﬁc factors discovered at the ﬁrst hidden layer and the increasingly more general factors discovered at deeper hidden layers are related to each other, and we generate synthetic data by propagating random variables through the deep network from the top hidden layer back to the bottom data layer. Keywords: Bayesian nonparametrics, deep learning, multilayer representation, Poisson factor analysis, topic modeling, unsupervised learning

1. Introduction

There has been signiﬁcant recent interest in deep learning. Despite its tremendous success in supervised learning, inferring a multilayer data representation in an unsupervised manner remains a challenging problem (Bengio and Le Cun, 2007; Ranzato et al., 2007; Bengio et al., 2015). To generate data with a deep network, it is often unclear how to set the structure of the network, including the depth (number of layers) of the network and the width (number of hidden units) of each layer. In addition, for some commonly used deep

. Correspondence should be addressed to M. Zhou or B. Chen.

c 2016 Mingyuan Zhou, Yulai Cong, and Bo Chen.

Zhou, Cong, and Chen

generative models, including the sigmoid belief network (SBN), deep belief network (DBN), and deep Boltzmann machine (DBM), the hidden units are often restricted to be binary. More speciﬁcally, the SBN, which connects the binary units of adjacent layers via the sigmoid functions, infers a deep representation of multivariate binary vectors (Neal, 1992; Saul et al., 1996); the DBN (Hinton et al., 2006) is a SBN whose top hidden layer is replaced by the restricted Boltzmann machine (RBM) (Hinton, 2002) that is undirected; and the DBM is an undirected deep network that connects the binary units of adjacent layers using the RBMs (Salakhutdinov and Hinton, 2009). All these three deep networks are designed to model binary observations, without principled ways to set the network structure. Although one may modify the bottom layer to model Gaussian and multinomial observations, the hidden units of these networks are still typically restricted to be binary (Salakhutdinov and Hinton, 2009; Larochelle and Lauly, 2012; Salakhutdinov et al., 2013). To generalize these models, one may consider the exponential family harmoniums (Welling et al., 2004; Xing et al., 2005) to construct more general networks with non-binary hidden units, but often at the expense of noticeably increased complexity in training and data ﬁtting. To model realvalued data without restricting the hidden units to be binary, one may consider the general framework of nonlinear Gaussian belief networks (Frey and Hinton, 1999) that constructs continuous hidden units by nonlinearly transforming Gaussian distributed latent variables, including as special cases both the continuous SBN of Frey (1997a,b) and the rectiﬁed Gaussian nets of Hinton and Ghahramani (1997). More recent scalable generalizations under that framework include variational auto-encoders (Kingma and Welling, 2014) and deep latent Gaussian models (Rezende et al., 2014). Moving beyond conventional deep generative models using binary or nonlinearly transformed Gaussian hidden units and setting the network structure in a heuristic manner, we construct deep networks using gamma distributed nonnegative real hidden units, and combine the gamma-negative binomial process (Zhou and Carin, 2015; Zhou et al., 2015b) with a greedy-layer wise training strategy to automatically infer the network structure. The proposed model is called the augmentable gamma belief network, referred to hereafter for brevity as the GBN, which factorizes the observed or latent count vectors under the Poisson likelihood into the product of a factor loading matrix and the gamma distributed hidden units (factor scores) of layer one; and further factorizes the shape parameters of the gamma hidden units of each layer into the product of a connection weight matrix and the gamma hidden units of the next layer. The GBN together with Poisson factor analysis can unsupervisedly infer a multilayer representation from multivariate count vectors, with a simple but powerful mechanism to capture the correlations between the visible/hidden features across all layers and handle highly overdispersed counts. With the Bernoulli-Poisson link function (Zhou, 2015), the GBN is further applied to high-dimensional sparse binary vectors by truncating latent counts, and with a Poisson randomized gamma distribution, the GBN is further applied to high-dimensional sparse nonnegative real data by randomizing the gamma shape parameters with latent counts. For tractable inference of a deep generative model, one often applies either a sampling based procedure (Neal, 1992; Frey, 1997a) or variational inference (Saul et al., 1996; Frey, 1997b; Ranganath et al., 2014b; Kingma and Welling, 2014). However, conjugate priors on the model parameters that connect adjacent layers are often unknown, making it diﬃcult to develop fully Bayesian inference that infers the posterior distributions of these parameters.

Augmentable Gamma Belief Networks

It was not until recently that a Gibbs sampling algorithm, imposing priors on the network connection weights and sampling from their conditional posteriors, was developed for the SBN by Gan et al. (2015b), using the Polya-Gamma data augmentation technique developed for logistic models (Polson et al., 2012). In this paper, we will develop data augmentation technique unique for the augmentable GBN, allowing us to develop a fully Bayesian upwarddownward Gibbs sampling algorithm to infer the posterior distributions of not only the hidden units, but also the connection weights between adjacent layers. Distinct from previous deep networks that often require tuning both the width (number of hidden units) of each layer and the network depth (number of layers), the GBN employs nonnegative real hidden units and automatically infers the widths of subsequent layers given a ﬁxed budget on the width of its ﬁrst layer. Note that the budget could be inﬁnite and hence the whole network can grow without bound as more data are being observed. Similar to other belief networks that can often be improved by adding more hidden layers (Hinton et al., 2006; Sutskever and Hinton, 2008; Bengio et al., 2015), for the proposed model, when the budget on the ﬁrst layer is ﬁnite and hence the ultimate capacity of the network could be limited, our experimental results also show that a GBN equipped with a narrower ﬁrst layer could increase its depth to match or even outperform a shallower one with a substantially wider ﬁrst layer. The gamma distribution density function has the highly desired strong non-linearity for deep learning, but the existence of neither a conjugate prior nor a closed-form maximum likelihood estimate (Choi and Wette, 1969) for its shape parameter makes a deep network with gamma hidden units appear unattractive. Despite seemingly diﬃcult, we discover that, by generalizing the data augmentation and marginalization techniques for discrete data modeled with the Poisson, gamma, and negative binomial distributions (Zhou and Carin, 2015), one may propagate latent counts one layer at a time from the bottom data layer to the top hidden layer, with which one may derive an eﬃcient upward-downward Gibbs sampler that, one layer at a time in each iteration, upward samples Dirichlet distributed connection weight vectors and then downward samples gamma distributed hidden units, with the latent parameters of each layer solved with the same subroutine. With extensive experiments in text and image analysis, we demonstrate that the deep GBN with two or more hidden layers clearly outperforms the shallow GBN with a single hidden layer in both unsupervisedly extracting latent features for classiﬁcation and predicting heldout data. Moreover, we demonstrate the excellent ability of the GBN in exploratory data analysis: by extracting trees and subnetworks from the learned deep network, we can follow the paths of each tree to visualize various aspects of the data, from very general to very speciﬁc, and understand how they are related to each other. In addition to constructing a new deep network that well ﬁts high-dimensional sparse binary, count, and nonnegative real data, developing an eﬃcient upward-downward Gibbs sampler, and applying the learned deep network for exploratory data analysis, other contributions of the paper include: 1) proposing novel link functions, 2) combining the gammanegative binomial process (Zhou and Carin, 2015; Zhou et al., 2015b) with a layer-wise training strategy to automatically infer the network structure; 3) revealing the relationship between the upper bound imposed on the width of the ﬁrst layer and the inferred widths of subsequent layers; 4) revealing the relationship between the depth of the network and the model s ability to model overdispersed counts; and 5) generating multivariate high-

Zhou, Cong, and Chen

dimensional discrete or nonnegative real vectors, whose distributions are governed by the GBN, by propagating the gamma hidden units of the top hidden layer back to the bottom data layer. We note this paper signiﬁcantly extends our recent conference publication (Zhou et al., 2015a) that proposes the Poisson GBN.

2. Augmentable Gamma Belief Networks

Denoting θ(t) j RKt + as the Kt hidden units of sample j at layer t, where R+ = {x : x 0}, the generative model of the augmentable gamma belief network (GBN) with T hidden layers, from top to bottom, is expressed as

θ(T) j Gam r, 1 c(T+1) j ,

θ(t) j Gam Φ(t+1)θ(t+1) j , 1 c(t+1) j ,

θ(1) j Gam Φ(2)θ(2) j , p(2) j 1 p(2) j , (1)

where x Gam(a, 1/c) represents a gamma distribution with mean a/c and variance a/c2. For t = 1, 2, . . . , T 1, the GBN factorizes the shape parameters of the gamma distributed hidden units θ(t) j RKt + of layer t into the product of the connection weight matrix Φ(t+1)

RKt Kt+1 + and the hidden units θ(t+1) j RKt+1 + of layer t + 1; the top layer s hidden units

θ(T) j share the same vector r = (r1, . . . , r KT ) as their gamma shape parameters; and the

p(2) j are probability parameters and {1/c(t)}3,T+1 are gamma scale parameters, with c(2) j := 1 p(2) j p(2) j . We will discuss later how to measure the connection strengths between the nodes of adjacent layers and the overall popularity of a factor at a particular hidden layer. For scale identiﬁability and ease of inference and interpretation, each column of Φ(t) RKt 1 Kt + is restricted to have a unit L1 norm and hence 0 Φ(t)(k , k) 1. To complete the hierarchical model, for t {1, . . . , T 1}, we let

φ(t) k Dir η(t), . . . , η(t) , rk Gam γ0/KT , 1/c0

where φ(t) k RKt 1 + is the kth column of Φ(t); we impose c0 Gam(e0, 1/f0) and γ0 Gam(a0, 1/b0); and for t {3, . . . , T + 1}, we let

p(2) j Beta(a0, b0), c(t) j Gam(e0, 1/f0). (3)

We expect the correlations between the Kt rows (latent features) of (θ(t) 1 , . . . , θ(t) J ) to be captured by the columns of Φ(t+1). Even if Φ(t) for t 2 are all identity matrices, indicating no correlations between the latent features to be captured, our analysis in Section 3.2 will show that a deep structure with T 2 could still beneﬁt data ﬁtting by better modeling the variability of the latent features θ(1) j . Before further examining the network structure, below we ﬁrst introduce a set of distributions that will be used to either model diﬀerent types of data or augment the model for simple inference.

Augmentable Gamma Belief Networks

2.1 Distributions for Count, Binary, and Nonnegative Real Data

Below we ﬁrst describe some useful count distributions that will be used later.

2.1.1 Useful Count Distributions and Their Relationships

Let the Chinese restaurant table (CRT) distribution l CRT(n, r) represent the random number of tables seated by n customers in a Chinese restaurant process (Blackwell and Mac Queen, 1973; Antoniak, 1974; Aldous, 1985; Pitman, 2006) with concentration parameter r. Its probability mass function (PMF) can be expressed as

P(l | n, r) = Γ(r)rl

Γ(n + r)|s(n, l)|,

where l Z, Z := {0, 1, . . . , n}, and |s(n, l)| are unsigned Stirling numbers of the ﬁrst kind. A CRT distributed sample can be generated by taking the summation of n independent Bernoulli random variables as

i=1 bi, bi Bernoulli [r/(r + i 1)] .

Let u Log(p) denote the logarithmic distribution (Fisher et al., 1943; Anscombe, 1950; Johnson et al., 1997) with PMF

P(u | p) = 1 ln(1 p) pu

where u {1, 2, . . .}, and let n NB(r, p) denote the negative binomial (NB) distribution (Greenwood and Yule, 1920; Bliss and Fisher, 1953) with PMF

P(n | r, p) = Γ(n + r)

n!Γ(r) pn(1 p)r,

where n Z. The NB distribution n NB(r, p) can be generated as a gamma mixed Poisson distribution as

n Pois(λ), λ Gam [r, p/(1 p)] ,

where p/(1 p) is the gamma scale parameter. As shown in (Zhou and Carin, 2015), the joint distribution of n and l given r and p in

l CRT(n, r), n NB(r, p),

where l {0, . . . , n} and n Z, is the same as that in

n = Pl t=1 ut, ut Log(p), l Pois[ r ln(1 p)], (4)

which is called the Poisson-logarithmic bivariate distribution, with PMF

P(n, l | r, p) = |s(n, l)|rl

n! pn(1 p)r.

We will exploit these relationships to derive eﬃcient inference for the proposed models.

Zhou, Cong, and Chen

2.1.2 Bernoulli-Poisson Link and Truncated Poisson Distribution

As in Zhou (2015), the Bernoulli-Poisson (Ber Po) link thresholds a random count at one to obtain a binary variable as

b = 1(m 1), m Pois(λ), (5)

where b = 1 if m 1 and b = 0 if m = 0. If m is marginalized out from (5), then given λ, one obtains a Bernoulli random variable as b Ber 1 e λ . The conditional posterior of the latent count m can be expressed as

(m | b, λ) b Pois+(λ),

where x Pois+(λ) follows a truncated Poisson distribution, with P(x = k) = (1 e λ) 1λke λ/k! for k {1, 2, . . .}. Thus if b = 0, then m = 0 almost surely (a.s.), and if b = 1, then m Pois+(λ), which can be simulated with a rejection sampler that has a minimal acceptance rate of 63.2% at λ = 1 (Zhou, 2015). Given the latent count m and a gamma prior on λ, one can then update λ using the gamma-Poisson conjugacy. The Ber Po link shares some similarities with the probit link that thresholds a normal random variable at zero, and the logistic link that lets b Ber[ex/(1 + ex)]. We advocate the Ber Po link as an alternative to the probit and logistic links since if b = 0, then m = 0 a.s., which could lead to signiﬁcant computational savings if the binary vectors are sparse. In addition, the conjugacy between the gamma and Poisson distributions makes it convenient to construct hierarchical Bayesian models amenable to posterior simulation.

2.1.3 Poisson Randomized Gamma and Truncated Bessel Distributions

To model nonnegative data that include both zeros and positive observations, we introduce the Poisson randomized gamma (PRG) distribution as

x PRG(λ, c),

whose distribution has a point mass at x = 0 and is continuous for x > 0. The PRG distribution is generated as a Poisson mixed gamma distribution as

x Gam(n, 1/c), n Pois(λ),

in which we deﬁne Gam(0, 1/c) = 0 a.s. and hence x = 0 if and only n = 0. Thus the PMF of x PRG(λ, c) can be expressed as

f X(x | λ, c) =

n=0 Gam(x; n, 1/c)Pois(n; λ)

= e λ 1(x=0) "

λcx #1(x>0) , (6)

n!Γ(n), α > 0

Augmentable Gamma Belief Networks

is the modiﬁed Bessel function of the ﬁrst kind Iν(α) with ν ﬁxed at 1. Using the laws of total expectation and total variance, or using the PMF directly, one may show that

E[x | λ, c] = λ/c, var[x | λ, c] = 2λ/c2.

Thus the variance-to-mean ratio of the PRG distribution is 2/c, as controlled by c. The conditional posterior of n given x, λ, and c can be expressed as

f N(n | x, λ, c) = Gam(x; n, 1/c)Pois(n; λ)

PRG(x; λ, c)

= 1(x = 0)δ0 + 1(x > 0)

λcx (λcx)n 1

2 n!Γ(n) δn

= 1(x = 0)δ0 + 1(x > 0)

n=1 Bessel 1(n; 2

cxλ)δn , (7)

where we deﬁne n Bessel 1(α) as the truncated Bessel distribution, with PMF

Bessel 1(n; α) =

I 1(α)n!Γ(n), n {1, 2, . . .}.

Thus n = 0 if and only if x = 0, and n is a positive integer drawn from a truncated Bessel distribution if x > 0. In Appendix A, we plot the probability distribution functions of the proposed PRG and truncated Bessel distributions and show how they diﬀer from the randomized gamma and Bessel distributions (Yuan and Kalbﬂeisch, 2000), respectively.

2.2 Link Functions for Three Diﬀerent Types of Observations

If the observations are multivariate count vectors x(1) j ZV , where V := K0, then we link the integer-valued visible units to the nonnegative real hidden units at layer one using Poisson factor analysis (PFA) as

x(1) j Pois Φ(1)θ(1) j . (8)

Under this construction, the correlations between the K0 rows (features) of (x(1) 1 , . . . , x(1) J ) are captured by the columns of Φ(1). Detailed descriptions on how PFA is related to a wide variety of discrete latent variable models, including nonnegative matrix factorization (Lee and Seung, 2001), latent Dirichlet allocation (Blei et al., 2003), the gamma-Poisson model (Canny, 2004), discrete Principal component analysis (Buntine and Jakulin, 2006), and the focused topic model (Williamson et al., 2010), can be found in Zhou et al. (2012) and Zhou and Carin (2015). We call PFA using the GBN in (1) as the prior on its factor scores as the Poisson gamma belief network (PGBN), as proposed in Zhou et al. (2015a). The PGBN can be naturally applied to factorize the term-document frequency count matrix of a text corpus, not only extracting semantically meaningful topics at multiple layers, but also capturing the relationships between the topics of diﬀerent layers using the deep network, as discussed below in both Sections 2.3 and 4.

Zhou, Cong, and Chen

x_1 x_2 x_3 x_6

Figure 1: An example directed network of ﬁve hidden layers, with K0 = 8 visible units, [K1, K2, K3, K4, K5] = [6, 4, 3, 3, 2], and sparse connections between the units of adjacent layers.

If the observations are high-dimensional sparse binary vectors b(1) j {0, 1}V , then we factorize them using Bernoulli-Poisson factor analysis (Ber-PFA) as

b(1) j = 1 x(1) j 0 , x(1) j Pois Φ(1)θ(1) j . (9)

We call Ber-PFA with the augmentable GBN as the prior on its factor scores θ(1) j as the Bernoulli-Poisson gamma belief network (Ber Po-GBN). If the observations are high-dimensional sparse nonnegative real-valued vectors y(1) j RV +, then we factorize them using Poisson randomized gamma (PRG) factor analysis as

y(1) j Gam(x(1) j , 1/aj), x(1) j Pois Φ(1)θ(1) j . (10)

We call PRG factor analysis with the augmentable GBN as the prior on its factor scores θ(1) j as the PRG gamma belief network (PRG-GBN). We show in Figure 1 an example directed belief network of ﬁve hidden layers, with K0 = 8 visible units, with 6, 4, 3, 3, and 2 hidden units for layers one, two, three, four, and ﬁve, respectively, and with sparse connections between the units of adjacent layers.

2.3 Exploratory Data Analysis

To interpret the network structure of the GBN, we notice that

E x(1) j θ(t) j , {Φ(ℓ), c(ℓ) j }1,t =

ℓ=1 Φ(ℓ) # θ(t) j Qt ℓ=2 c(ℓ) j , (11)

E θ(t) j {Φ(ℓ), c(ℓ) j }t+1,T , r =

ℓ=t+1 Φ(ℓ) # r QT+1 ℓ=t+1 c(ℓ) j . (12)

Augmentable Gamma Belief Networks

1_3 1_2 1_1

Figure 2: Extracted from the network shown in Figure 1, the left plot is a tree rooted at node 5 1, the middle plot is a tree rooted at node 3 3, and the right plot is a subnetwork consisting of both the tree rooted at node 3 1 and the tree rooted at node 3 3.

Thus for visualization, it is straightforward to project the Kt topics/hidden units/factor loadings/nodes of layer t {1, . . . , T} to the bottom data layer as the columns of the V Kt matrix t Y

ℓ=1 Φ(ℓ), (13)

and rank their popularities using the Kt dimensional nonnegative weight vector

ℓ=t+1 Φ(ℓ) #

To measure the connection strength between node k of layer t and node k of layer t 1, we use the value of Φ(t)(k , k),

which is also expressed as φ(t) k (k ) or φ(t) k k. Our intuition is that examining the nodes of the hidden layers, via their projections to the bottom data layer, from the top to bottom layers will gradually reveal less general and more speciﬁc aspects of the data. To verify this intuition and further understand the relationships between the general and speciﬁc aspects of the data, we consider extracting a tree for each node of layer t, where t 2, to help visualize the inferred multilayer deep structure. To be more speciﬁc, to construct a tree rooted at a node of layer t, we grow the tree downward by linking the root node (if at layer t) or each leaf node of the tree (if at a layer below layer t) to all the nodes at the layer below that are connected to the root/leaf node with non-negligible weights. Note that a tree in our deﬁnition permits a node to have more than one parent, which means that diﬀerent branches of the tree can overlap with each other. In addition, we also consider extracting subnetworks, each of which consists of multiple related trees from the full deep network. For example, shown in the left of Figure 2 is the tree extracted from the network in Figure 1 using node 5 1 as the root, shown in the middle is the tree using node 3 3 as the root, and shown in the right is a subnetwork consisting of two related trees that are rooted at nodes 3 1 and 3 3, respectively.

Zhou, Cong, and Chen

2.3.1 Visualizing Nodes of Different Layers

Before presenting the technical details, we ﬁrst provide some example results obtained with the PGBN on extracting multilayer representations from the 11,269 training documents of the 20newsgroups data set (http://qwone.com/ jason/20Newsgroups/). Given a ﬁxed budget of K1 max = 800 on the width of the ﬁrst layer, with η(t) = 0.1 for all t, a ﬁve-layer deep network inferred by the PGBN has a network structure as [K1, K2, K3, K4, K5] = [386, 63, 58, 54, 51], meaning that there are 386, 63, 58, 54, and 51 nodes at layers one to ﬁve, respectively. For visualization, we ﬁrst relabel the nodes at each layer based on their weights {r(t) k }1,Kt, calculated as in (14), with a more popular (larger weight) node assigned with a smaller label. We visualize node k of layer t by displaying its top 12 words ranked according to their probabilities in Qt 1 ℓ=1 Φ(ℓ) φ(t) k , the kth column of the projected representation calculated

as in (13). We set the font size of node k of layer t proportional to r(t) k /r(t) 1 1

10 in each subplot, and color the outside border of a text box as red, green, orange, blue, or black for a node of layer ﬁve, four, three, two, or one, respectively. For better interpretation, we also exclude from the vocabulary the top 30 words of node 1 of layer one: don just like people think know time good make way does writes edu ve want say really article use right did things point going better thing need sure used little, and the top 20 words of node 2 of layer one: edu writes article com apr cs ca just know don like think news cc david university john org wrote world. These 50 words are not in the standard list of stopwords but can be considered as stopwords speciﬁc to the 20newsgroups corpus discovered by the PGBN. For the [386, 63, 58, 54, 51] PGBN learned on the 20newsgroups corpus, we plot 54 example topics of layer one in Figure 3, the top 30 topics of layer three in Figure 4, and the top 30 topics of layer ﬁve in Figure 5. Figure 3 clearly shows that the topics of layer one, except for topics 1-3 that mainly consist of common functional words of the corpus, are all very speciﬁc. For example, topics 71 and 81 shown in the ﬁrst row are about candida yeast symptoms and sex, respectively, topics 53, 73, 83, and 84 shown in the second row are about printer, msg, police radar detector, and Canadian health care system, respectively, and topics 46 and 76 shown in third row are about ice hockey and second amendment, respectively. By contrast, the topics of layers three and ﬁve, shown in Figures 4 and 5, respectively, are much less speciﬁc and can in general be matched to one or two news groups out of the 20 news groups, including comp.{graphics, os.mswindows.misc, sys.ibm.pc.hardware, sys.mac.hardware, windows.x}, rec.{autos, motorcycles}, rec.sport.{baseball, hockey}, sci.{crypt, electronics, med, space}, misc.forsale, talk. politics.{misc, guns, mideast}, and {talk.religion.misc, alt.atheism, soc.religion.christian}.

2.3.2 Visualizing Trees Rooted at The Top-Layer Hidden Units

While it is interesting to examine the topics of diﬀerent layers to understand the general and speciﬁc aspects of the corpus used to train the PGBN, it would be more informative to further illustrate how the topics of diﬀerent layers are related to each other. Thus we consider constructing trees to visualize the PGBN. We ﬁrst pick a node as the root of a tree and grow the tree downward by drawing a line from node k at layer t, the root or a leaf node of the tree, to node k at layer t 1 for all k in the set {k : Φ(t)(k , k) > τt/Kt 1},

Augmentable Gamma Belief Networks

new problem

work probably

let said years

ll long question

game hockey

season games

nhl year league

teams players

pain treatment

medical patients medicine

help skin blood problems

bit keys number

serial encrypted

bits clipper escrow algorithm

zoo spencer

work utzoo zoology

man umd kipling

wire wiring neutral outlets circuit

cable electrical

wires outlet connected

61 callison

car ecn lights uokmax

fake bird continental

yeast noring

steve symptoms

jon vitamin

dyer body quack patients

child copy women children protection pregnancy depression abstinence

sexual education

netcom andrew

new uucp internet

mark steve opinions

new shipping

offer condition

price asking

cd interested

nasa earth orbit

gov data shuttle mission

lunar spacecraft

israeli lebanese

arab israelis civilians

killed palestinian

42 graphics

ftp pub data

ray risc contact

sgi machines

pc instruction

lost san new kaldis rutgers

york houston

ﬁ ﬁnland germany

canada players ﬁnnish german

april swedish

atlanta technology

institute internet

uucp gilkey hydra lankford

dave fame smith eddie murray winﬁeld kingman

yount steve guys

fax advance

hi info information

high signal

low voltage

chip radio battery

guns ﬁrearms

crime control weapons

self criminals

defense handgun

ﬁrearm violent

gay homosexual

sexual clayton

sex homosexuals homosexuality

male virginia

state magnus

ncr atlantaga

hp print laser

ink printers deskjet postscript

bj canon quality

isc ultb bobby mozumder

snm religious

eric mom atheist

dyer spdcc glutamate

food effects

brain studies humans

blood olney foods

83 radar detector detectors

car beam illegal antenna

radio virginia

jesus christian

bible christ christians

life faith believe

man christianity

14 law government

state court

laws states public

case civil legal federal

gas stratus compound

waco children

atf cdt believe government

armenian armenians

turks armenia

turkey genocide

soviet today russian government

ride motorcycle

shaft rider motorcycles

rec denizens

54 ted colorado

thf kimbark

teel uchicago

psu large psuvm

kent apple newton private

alink ksand activities

order royalroads

74 lib libxmu

ld symbol undeﬁned

usr imake problem

84 insurance

health private

care canada coverage canadian

hospital medical

pay public national

bit card speed

port board simms memory

dod ride bikes riding motorcycle

sun left road

bnr ama honda

israel israeli jewish

land center policy

anti research palestine

35 pitt gordon

geb skepticism

cadre chastity intellect

jxp soon surrender shameful

ux illinois urbana

cka hopkins champaign

55 uga georgia michael

ai covington mcovingt

easter programs

jayne athens amateur research

car drive manual drivers

cars fuel automatic

auto lane shift

cb pack bike

iran jury farid energy bombing

iraq military

iraqi bomb civilians

6 available

ﬁle sun program

mit information

16 clipper government

key encryption

nsa phone crypto secure

keys netcom

went didn came

told saw home

left started

says took apartment

mb bus ide controller

isa bit pc drive

pp puck goal ﬂyers shots

pts scorer second lindros

polygon algorithm

points reference

edge cartridge program

66 fred software

dseg nick level process

mksol negev bedouin

76 militia amendment

arms bear second constitution

regulated government

shall state ulowell organized

tt truetype

type atm printer characters

ps character

Figure 3: Example topics of layer one of the PGBN trained on the 20newsgroups corpus.

1 thanks mail email information

help fax software info advance looking new hi

2 mac apple thanks bit

mhz card modem ram speed port board memory

3 god jesus believe christian

bible true question life christians said faith christ

4 car cars new engine miles dealer price drive

year road buy speed

5 windows dos ﬁle ﬁles thanks problem program using

run running win ms

6 team game hockey season

games nhl year play league new players teams

7 year game team baseball runs games season players

hit win league years

8 god jesus christian bible christ believe christians church

life faith man said

9 bike dod ride bikes riding motorcycle got sun

left new said ll

10 window thanks widget application

using display available program

mail server set software

11 power thanks mail output

input high work line using low phone circuit

12 government key encryption clipper

chip nsa phone public keys secure crypto escrow

13 gun guns ﬁrearms crime

control law weapons state new self government said

14 disease doctor pain new help treatment medical years said problem thanks patients

15 card monitor video vga windows thanks drivers cards

mode mail graphics ati

16 drive scsi disk hard drives mb windows thanks

controller ﬂoppy mail ide

17 space nasa new gov

work henry long doesn

said believe let ll

18 israel jews israeli arab

jewish state arabs peace

land policy new years

19 new sale shipping offer

price mail condition drive thanks asking email interested

20 space nasa new information

gov earth data orbit research program shuttle national

21 koresh fbi batf believe gas compound children stratus

waco said atf government

22 new sale shipping thanks

mail offer price condition email interested asking best

23 new uiuc believe said

cso read let ll doesn says long post

24 objective morality moral keith

frank values livesey jon

wrong caltech sgi isn

25 tax clinton taxes government

new money ll house pay look bush congress

26 law new state government

said president mr rights states information public national

27 men cramer gay homosexual

sexual optilink clayton sex homosexuals homosexuality male state

28 space nasa new work

gov henry long doesn

ll said believe let

29 pitt gordon banks geb skepticism soon cadre intellect chastity jxp surrender shameful

30 turkish armenian armenians turks armenia turkey government genocide

soviet said today new

Figure 4: The top 30 topics of layer three of the PGBN trained on the 20newsgroups corpus.

1 windows thanks ﬁle dos window using mail program

problem help ﬁles work

2 god believe jesus true question said new christian

bible life moral objective

3 car bike new cars dod engine got thanks

price road ll miles

4 thanks mail email help information fax software info

new advance looking hi

5 card thanks new mac

apple mail drive bit work video mb problem

6 space nasa new gov

work years long earth

said orbit ll year

7 team game hockey season

games nhl year play new league players teams

8 god jesus christian bible believe christ christians life

new church said man

9 key chip government clipper

encryption keys nsa phone

public escrow secure new

10 card thanks mac windows

apple mail drive bit problem work mb new

11 car bike new cars

dod engine got road

said ll miles ride

12 thanks new mail power

email work phone high

help sale price using

13 year game team baseball

games runs season players

hit win league years

14 disease new doctor pain

help treatment said years thanks problem medical work

15 israel jews israeli armenian

turkish new state government

said armenians years law

16 gun koresh fbi believe batf said government children

guns new gas compound

17 thanks drive card mail

work mac new bit apple problem power mb

18 gun guns ﬁrearms crime

law control new state weapons government said believe

19 tax clinton government new

taxes law ll state said money believe years

20 god jesus believe new

said christian bible life

read day says doesn

21 drive disk scsi hard drives mb thanks windows

controller mail ﬂoppy ide

22 thanks mail information email

new help fax software space info available send

23 israel jews israeli arab

jewish state new arabs

peace land years true

24 pitt gordon banks geb skepticism soon cadre intellect chastity jxp surrender shameful

25 new sale shipping offer price mail thanks condition drive asking email interested

26 team game year season games hockey players play

league new win baseball

27 men cramer gay new homosexual sexual optilink said

believe state government law

28 god jesus christian bible believe christ christians life

new church said faith

29 year game team baseball

games runs season players

hit new years win

30 new mail thanks key internet information email work

number ll read believe

Figure 5: The top 30 topics of layer ﬁve of the PGBN trained on the 20newsgroups corpus.

Zhou, Cong, and Chen

1 windows thanks ﬁle dos window using mail program

problem help ﬁles work

1 windows thanks ﬁle dos window using mail program

problem help ﬁles work

1 thanks mail email information

help fax software info advance looking new hi

5 windows dos ﬁle ﬁles thanks problem program using

run running win ms

10 window thanks widget application

using display available program

mail server set software

15 card monitor video vga windows thanks drivers cards

mode mail graphics ati

1 thanks mail email information

help fax software info advance new looking send

26 image jpeg gif ﬁle ﬁles color images bit format thanks windows program

6 windows dos ﬁle problem

thanks ﬁles using program

run win running ms

9 window thanks widget application

using display available program

mail server set software

16 card monitor video vga windows drivers thanks cards

mode mail graphics ati

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

fax advance

hi info information

6 available

ﬁle sun program

mit information

7 new information

national research program

year april center washington

42 graphics

ftp pub data

ray risc contact

sgi machines

pc instruction

polygon algorithm

points reference

edge cartridge program

59 au australia

oz mark philips canberra melbourne

uwa quarterdeck

69 graphics

ch comp umich

split aspects

engin radiosity

grafsys hidden

ﬁle ﬁles problem program

bit display graphics program

ac ed picture

dcs sleeve liverpool

tony simon oxford warwick

hp print laser

ink printers deskjet postscript

bj canon quality

tt truetype

type atm printer characters

ps character

widget application

display manager

set using problem

color value

68 xterm keyboard

key echo keys

set keycode sequences

ff hostname

74 lib libxmu

ld symbol undeﬁned

usr imake problem

17 card monitor

vga drivers

ati graphics

svga driver

Figure 6: A [18, 5, 4, 1, 1] tree that includes all the lower-layer nodes (directly or indirectly) linked with non-negligible weights to the top ranked node of the top layer, taken from the full [386, 63, 58, 54, 51] network inferred by the PGBN on the 11,269 training documents of the 20newsgroups corpus, with η(t) = 0.1 for all t. A line from node k at layer t to node k

at layer t 1 indicates that Φ(t)(k , k) > 3/Kt 1, with the width of the line proportional

Φ(t)(k , k). For each node, the rank (in terms of popularity) at the corresponding layer and the top 12 words of the corresponding topic are displayed inside the text box, where the text font size monotonically decreases as the popularity of the node decreases, and the outside border of the text box is colored as red, green, orange, blue, or black if the node is at layer ﬁve, four, three, two, or one, respectively.

where we set the width of the line connecting node k of layer t to node k of layer t 1

be proportional to q

Φ(t)(k , k) and use τt to adjust the complexity of a tree. In general, increasing τt would discard more weak connections and hence make the tree simpler and easier to visualize. We set τt = 3 for all t to visualize both a ﬁve-layer tree rooted at the top ranked node of the top hidden layer, as shown in Figure 6, and a ﬁve-layer tree rooted at the second ranked node of the top hidden layer, as shown in Figure 7. For the tree in Figure 6, while it is somewhat vague to determine the actual meanings of both node 1 of layer ﬁve and node 1 of layer four based on their top words, examining the more speciﬁc topics of layers three and two within the tree clearly indicate that this tree is primarily about windows, window system, graphics, information, and software, which are relatively speciﬁc concepts that are all closely related to each other. The similarities and diﬀerences between the ﬁve nodes of layer two can be further understood by examining the nodes of layer one that are connected to them. For example, while nodes 26 and 16 of layer two share their connections to multiple nodes of layer one, node 27 of layer one on image is strongly connected to node 26 of layer two but not to node 16 of layer two, and node 17 of layer one on video is strongly connected to node 16 of layer two but not to node 26 of layer two.

Augmentable Gamma Belief Networks

2 god believe jesus true question said new christian

bible life moral objective

5 god jesus believe christian

bible true question said life christians faith religion

21 objective morality moral keith

frank values livesey jon

wrong caltech sgi isn

3 god jesus believe christian

bible true question life christians said faith christ

33 islam islamic muslims muslim

jaeger gregg au qur monash bu women rushdie

24 objective morality moral keith

frank values livesey jon

wrong caltech sgi isn

5 god jesus christian bible christ believe christians life

man faith said come

17 god science believe evidence

atheism question exist true atheists existence reason belief

34 islam islamic muslims muslim

jaeger gregg au qur monash bu rushdie women

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

4 god jesus christian

bible christ christians

life faith believe

man christianity

55 uga georgia michael

ai covington

easter programs

jayne athens amateur research

kent apple newton private

alink ksand activities

order royalroads

101 matthew

king messiah prophecies

isaiah jesus testament

disciples josephus

14 law government

state court

laws states public

civil legal federal

19 god science evidence

exist atheists existence

true atheist religion

ac ed picture

dcs sleeve liverpool

tony simon oxford warwick

isc ultb bobby mozumder

snm religious

eric mom atheist

wam mangoe wingate charley

luke revelation

kmr matthew

88 truth absolute absolutes

scripture christians

truths arrogant

true arrogance

bible believe

israel israeli jewish

land center

anti research palestine

islamic muslims

jaeger gregg

bu au rushdie women

27 objective morality moral keith

frank values livesey jon

wrong caltech sgi isn

29 objective

keith frank values livesey

jon caltech

sgi murder natural

ux illinois urbana

cka hopkins champaign

ico bobbe beauchaine

robert bronx

bob sea queens

Figure 7: Analogous plot to Figure 6 for a tree on religion, rooted at node 2 of the top-layer.

Following the branches of each tree shown in both ﬁgures, it is clear that the topics become more and more speciﬁc when moving along the tree from the top to bottom. Taking the tree on religion shown in Figure 7 for example, the root node splits into two nodes when moving from layers ﬁve to four: while the left node is still mainly about religion, the right node is on objective morality. When moving from layers four to three, node 5 of layer four splits into a node about Christian and another node about Islamic. When moving from layers three to two, node 3 of layer three splits into a node about God, Jesus, & Christian, and another node about science, atheism, & question of the existence of God. When moving from layers two to one, all four nodes of layer two split into multiple topics, and they are all strongly connected to both topics 1 and 2 of layer one, whose top words are those that appear frequently in the 20newsgroups corpus.

2.3.3 Visualizing Subnetworks Consisting of Related Trees

Examining the top-layer topics shown in Figure 5, one may ﬁnd that some of the nodes seem to be closely related to each other. For example, topics 3 and 11 share eleven words out of the top twelve ones; topics 15 and 23 both have Israel and Jews as their top two words; topics 16 and 18 are both related to gun; and topics 7, 13, and 26 all share team(s), game(s), player(s), season, and league. To understand the relationships and distinctions between these related nodes, we construct subnetworks that include the trees rooted at them, as shown in Figures 17-20 in Appendix C. It is clear from Figure 17 that the top-layer topic 3 diﬀers from topic 11 in that it is not only strongly connected to topic 2 of layer four on car & bike, but also has a non-negligible connection to topic 27 of layer four on sales. It is clear from Figure 18

Zhou, Cong, and Chen

that topic 15 diﬀers from topic 23 in that it is not only about Israel & Arabs, but also about Israel, Armenia, & Turkey. It is clear from Figure 19 in that topic 16 diﬀers from topic 18 in that it is mainly about Waco siege happened in 1993 involving David Koresh, the Federal Bureau of Investigation (FBI), and the Bureau of Alcohol, Tobacco, Firearms and Explosives (BATF). It is clear from Figure 20 that topics 7 and 13 are mainly about ice hockey and baseball, respectively, and topic 26 is a mixture of both.

2.3.4 Capturing Correlations Between Nodes

For the augmentable GBN, as in (18), given the weight vector θ(1) j , we have

E x(1) j Φ(1), θ(1) j = Φ(1)θ(1) j . (15)

A distinction between a shallow augmentable GBN with T = 1 hidden layer and a deep augmentable GBN with T 2 hidden layers is that the prior for θ(1) j changes from θ(1) j

Gam(r, 1/c(2) j ) for T = 1 to θ(1) j Gam(Φ(2)θ(2) j , 1/c(2) j ) for T 2. For the GBN with T = 1, given the shared weight vector r, we have

E x(1) j Φ(1), r = Φ(1)r/c(2) j ; (16)

for the GBN with T = 2, given the shared weight vector r, we have

E x(1) j Φ(1), Φ(2), r = Φ(1)Φ(2)r . c(2) j c(3) j ; (17)

and for the GBN with T 2, given the weight vector θ(2) j , we have

E x(1) j Φ(1), Φ(2), θ(2) j = Φ(1)Φ(2)θ(2) j /c(2) j . (18)

Thus in the prior, the co-occurrence patterns of the columns of Φ(1) are modeled by only a single vector r when T = 1, but are captured in the columns of Φ(2) when T 2. Similarly, in the prior, if T t + 1, the co-occurrence patterns of the Kt columns of the projected topics Qt ℓ=1 Φ(ℓ) will be captured in the columns of the Kt Kt+1 matrix Φ(t+1). To be more speciﬁc, we show in Figure 21 in Appendix C three example trees rooted at three diﬀerent nodes of layer three, where we lower the threshold to τt = 1 to reveal more weak links between the nodes of adjacent layers. The top subplot reveals that, in addition to strongly co-occurring with the top two topics of layer one, topic 21 of layer one on medicine tends to co-occur not only with topics 7, 21, and 26, which are all common topics that frequently appear, but also with some much less common topics that are related to very speciﬁc diseases or symptoms, such as topic 67 on msg and Chinese restaurant syndrome, topic 73 on candida yeast symptoms, and topic 180 on acidophilous and astemizole (hismanal). The middle subplot reveals that topic 31 of layer two on encryption & cryptography tends to co-occur with topic 13 of layer two on government & encryption, and it also indicates that topic 31 of layer one is more purely about encryption and more isolated from government in comparison to the other topics of layer one.

Augmentable Gamma Belief Networks

The bottom subplot reveals that in layer one, topic 14 on law & government, topic 32 on Israel & Lebanon, topic 34 on Turkey, Armenia, Soviet Union, & Russian, topic 132 on Greece, Turkey, & Cyprus, topic 98 on Bosnia, Serbs, & Muslims, topic 143 on Armenia, Azeris, Cyprus, Turkey, & Karabakh, and several other very speciﬁc topics related to Turkey and/or Armenia all tend to co-occur with each other. We note that capturing the co-occurrence patterns between the topics not only helps exploratory data analysis, but also helps extract better features for classiﬁcation in an unsupervised manner and improves prediction for held-out data, as will be demonstrated in detail in Section 4.

2.4 Related Models

The structure of the augmentable GBN resembles the sigmoid belief network and recently proposed deep exponential family model (Ranganath et al., 2014b). Such kind of gamma distribution based network and its inference procedure were vaguely hinted in Corollary 2 of Zhou and Carin (2015), and had been exploited by Acharya et al. (2015) to develop a gamma Markov chain to model the temporal evolution of the factor scores of a dynamic count matrix, but have not yet been investigated for extracting multilayer data representations. The proposed augmentable GBN may also be considered as an exponential family harmonium (Welling et al., 2004; Xing et al., 2005).

2.4.1 Sigmoid and Deep Belief Networks

Under the hierarchical model in (1), given the connection weight matrices, the joint distribution of the observed/latent counts and gamma hidden units of the GBN can be expressed, similar to those of the sigmoid and deep belief networks (Bengio et al., 2015), as

P x(1) j , {θ(t) j }t {Φ(t)}t = P x(1) j Φ(1), θ(1) j "T 1 Y

t=1 P θ(t) j Φ(t+1), θ(t+1) j #

With φv: representing the vth row Φ, for the gamma hidden units θ(t) vj we have

P θ(t) vj φ(t+1) v: , θ(t+1) j , c(t+1) j+1 =

c(t+1) j+1 φ(t+1) v: θ(t+1) j

Γ φ(t+1) v: θ(t+1) j θ(t) vj φ(t+1) v: θ(t+1) j 1 e c(t+1) j+1 θ(t) vj , (19)

which are highly nonlinear functions that are strongly desired in deep learning. By contrast, with the sigmoid function σ(x) = 1/(1 + e x) and bias terms b(t+1) v , a sigmoid/deep belief network would connect the binary hidden units θ(t) vj {0, 1} of layer t (for deep belief networks, t < T 1 ) to the product of the connection weights and binary hidden units of the next layer with

P θ(t) vj = 1 φ(t+1) v: , θ(t+1) j , b(t+1) v = σ b(t+1) v + φ(t+1) v: θ(t+1) j . (20)

Comparing (19) with (20) clearly shows the distinctions between the gamma distributed nonnegative hidden units and the sigmoid link function based binary hidden units. The

Zhou, Cong, and Chen

limitation of binary units in capturing the approximately linear data structure over small ranges is a key motivation for Frey and Hinton (1999) to investigate nonlinear Gaussian belief networks with real-valued units. As a new alternative to binary units, it would be interesting to further investigate whether the gamma distributed nonnegative real units can in theory carry richer information and model more complex nonlinearities given the same network structure. Note that the rectiﬁed linear units have emerged as powerful alternatives of sigmoid units to introduce nonlinearity (Nair and Hinton, 2010). It would be interesting to investigate whether the gamma units can be used to introduce nonlinearity into the positive region of the rectiﬁed linear units.

2.4.2 Deep Poisson Factor Analysis

With T = 1, the PGBN speciﬁed by (1)-(3) and (8) reduces to Poisson factor analysis (PFA) using the (truncated) gamma-negative binomial process (Zhou and Carin, 2015), with a truncation level of K1. As discussed in (Zhou et al., 2012; Zhou and Carin, 2015), with priors imposed on neither φ(1) k nor θ(1) j , PFA is related to nonnegative matrix factorization

(Lee and Seung, 2001), and with the Dirichlet priors imposed on both φ(1) k and θ(1) j , PFA is related to latent Dirichlet allocation (Blei et al., 2003). Related to the PGBN and the dynamic model in (Acharya et al., 2015), the deep exponential family model of Ranganath et al. (2014b) also considers a gamma chain under Poisson observations, but it is the gamma scale parameters that are chained and factorized, which allows learning the network parameters using black box variational inference (Ranganath et al., 2014a). In the proposed PGBN, we chain the gamma random variables via the gamma shape parameters. Both strategies worth through investigation. We prefer chaining the shape parameters in this paper, which leads to eﬃcient upward-downward Gibbs sampling via data augmentation and makes it clear how the latent counts are propagated across layers, as discussed in detail in the following sections. The sigmoid belief network has also been recently incorporated into PFA for deep factorization of count data (Gan et al., 2015a), however, that deep structure captures only the correlations between binary factor usage patterns but not the full connection weights. In addition, neither Ranganath et al. (2014b) nor Gan et al. (2015a) provide a principled way to learn the network structure, whereas the proposed GBN uses the gamma-negative binomial process together with a greedy layer-wise training strategy to automatically infer the widths of the hidden layers, which will be described in Section 3.3.

2.4.3 Correlated and Tree-Structured Topic Models

The PGBN with T = 2 can also be related to correlated topic models (Blei and Laﬀerty, 2006; Paisley et al., 2012; Chen et al., 2013; Ranganath and Blei, 2015; Linderman et al., 2015), which typically use the logistic normal distributions to replace the topic-proportion Dirichlet distributions used in latent Dirichlet allocation (Blei et al., 2003), capturing the co-occurrence patterns between the topics in the latent Gaussian space using a covariance matrix. By contrast, the PGBN factorizes the topic usage weights (not proportions) under the gamma likelihood, capturing the co-occurrence patterns between the topics of the ﬁrst layer (i.e., the columns of Φ(1)) in the columns of Φ(2), the latent weight matrix connecting the hidden units of layers two and one. For the PGBN, the computation does not involve

Augmentable Gamma Belief Networks

matrix inversion, which is often necessary for correlated topic models without specially structured covariance matrices, and scales linearly with the number of topics, hence it is suitable to be used to capture the correlations between hundreds of or thousands of topics. As in Figures 6, 7, and 17-21, trees and subnetworks can be extracted from the inferred deep network to visualize the data. Tree-structured topic models have also been proposed before, such as those in Blei et al. (2010), Adams et al. (2010), and Paisley et al. (2015), but they usually artiﬁcially impose the tree structures to be learned, whereas the PGBN learns a directed network, from which trees and subnetworks can be extracted for visualization, without the need to specify the number of nodes per layer, restrict the number of branches per node, and forbid a node to have multiple parents.

3. Model Properties and Inference

Inference for the GBN shown in (1) appears challenging, because not only the conjugate prior is unknown for the shape parameter of a gamma distribution, but also the gradients are diﬃcult to evaluate for the parameters of the (log) gamma probability density function, which, as in (19), includes the parameters inside the (log) gamma function. To address these challenges, we consider data augmentation (van Dyk and Meng, 2001) that introduces auxiliary variables to make it simple to compute the conditional posteriors of model parameters via the joint distribution of the auxiliary and existing random variables. We will ﬁrst show that each gamma hidden unit can be linked to a Poisson distributed latent count variable, leading to a negative binomial likelihood for the parameters of the gamma hidden unit if it is margined out from the Poisson distribution; we then introduce an auxiliary count variable, which is sampled from the CRT distribution parametrized by the negative binomial latent count and shape parameter, to make the joint likelihood of the auxiliary CRT count and latent negative binomial count given the parameters of the gamma hidden unit amenable to posterior simulation. More speciﬁcally, under the proposed augmentation scheme, the gamma shape parameters will be linked to auxiliary counts under the Poisson likelihoods, making it straightforward for posterior simulation, as described below in detail.

3.1 The Upward Propagation of Latent Counts

We break the inference of the GBN of T hidden layers into T related subproblems, each of which is solved with the same subroutine. Thus for implementation, it is straightforward for the GBN to adjust its depth T. Let us denote x(t) j ZKt 1 as the observed or latent

count vector of layer t {1, . . . , T}, and x(t) vj as its vth element, where v {1, . . . , Kt 1}.

Lemma 1 (Augment-and-Conquer The Gamma Belief Network) With p(1) j := 1 e 1 and p(t+1) j := ln(1 p(t) j ) . h c(t+1) j ln(1 p(t) j ) i (21)

for t = 1, . . . , T, one may connect the observed or latent counts x(t) j ZKt 1 to the product

Φ(t)θ(t) j at layer t under the Poisson likelihood as

x(t) j Pois h Φ(t)θ(t) j ln 1 p(t) j i . (22)

Zhou, Cong, and Chen

Proof By deﬁnition (22) is true for layer t = 1. Suppose that (22) is also true for layer t > 1, then we can augment each count x(t) vj , where v {1, . . . , Kt 1}, into the summation

of Kt latent counts, which are smaller than or equal to x(t) vj as

k=1 x(t) vjk, x(t) vjk Pois h φ(t) vkθ(t) kj ln 1 p(t) j i . (23)

Let the symbol represent summing over the corresponding index and let

m(t)(t+1) kj := x(t) jk :=

v=1 x(t) vjk

represent the number of times that factor k {1, . . . , Kt} of layer t appears in observation j

and m(t)(t+1) j := x(t) j1, . . . , x(t) j Kt

. Since PKt 1 v=1 φ(t) vk = 1, we can marginalize out Φ(t) as in (Zhou et al., 2012), leading to

m(t)(t+1) j Pois h θ(t) j ln 1 p(t) j i .

Further marginalizing out the gamma distributed θ(t) j from the Poisson likelihood leads to

m(t)(t+1) j NB Φ(t+1)θ(t+1) j , p(t+1) j . (24)

Element k of m(t)(t+1) j can be augmented under its compound Poisson representation as

m(t)(t+1) kj =

x(t+1) kj X

ℓ=1 uℓ, uℓ Log(p(t+1) j ), x(t+1) kj Pois h φ(t+1) k: θ(t+1) j ln 1 p(t+1) j i .

Thus if (22) is true for layer t, then it is also true for layer t + 1.

Corollary 2 (Propagate the latent counts upward) Using Lemma 4.1 of (Zhou et al., 2012) on (23) and Theorem 1 of (Zhou and Carin, 2015) on (24), we can propagate the latent counts x(t) vj of layer t upward to layer t + 1 as

n x(t) vj1, . . . , x(t) vj Kt

x(t) vj , φ(t) v: , θ(t) j o Mult

x(t) vj , φ(t) v1θ(t) 1j PKt k=1 φ(t) vkθ(t) kj , . . . , φ(t) v Ktθ(t) Ktj PKt k=1 φ(t) vkθ(t) kj

x(t+1) kj m(t)(t+1) kj , φ(t+1) k: , θ(t+1) j CRT m(t)(t+1) kj , φ(t+1) k: θ(t+1) j . (26)

We provide a set of graphical representations in Figure 8 to describe the GBN model and illustrate the augment-and-conquer inference scheme. We provide the upward-downward Gibbs sampler in Appendix B.

Augmentable Gamma Belief Networks

( ) 1 jx ( )1

( ) 1 jθ ( ) 2

( ) 2 jθ ( ) 3

( ) T jθ ( ) T+1 jc

( ) { } 1 vjk x ( ) 1 jp

( ) 2 jθ ( ) 3

( ) { } 1 vjk x

( ) 2 jθ ( ) 3

( )( ) { } 1 2 kj m

( ) { } 1 vjk x

( ) 2 jp ( ) 2

( ) 2 jθ ( ) 3

( )( ) { } 1 2 kj m

( ) { } 1 vjk x

( ) 2 jp ( ) 2

( ) 2 jθ ( ) 3

( )( ) { } 1 2 kj m

( ) { } 1 vjk x

( ) 2 jθ ( ) 3

( )( ) { } 1 2 kj m

( ) 2 jp ( ) 2

( ) { } t vjk x ( )t

( )( ) { } t t+1 kj m

( ) t+1 jx ( ) t+1

( ) t+2 ( ) t+2 jc

( ) { } T vjk x ( ) T

( )( ) { } T T+1 kj m

( ) T+1 jx ( ) T+1 jp

Figure 8: Graphical representations of the model and data augmentation and marginalization based inference scheme. (a) graphical representation of the GBN hierarchical model. (b) an augmented representation of Poisson factor model of layer t = 1, corresponding to (23) with t = 1. (c) an alternative representation using the relationships between the Poisson and multinomial distributions, obtained by applying Lemma 4.1 of (Zhou et al., 2012) on (23) for t = 1. (d) a negative binomial distribution based representation that marginalizes out the gamma from the Poisson distributions, corresponding to (24) for t = 1. (e) an equivalent representation that introduces CRT distributed auxiliary variables, corresponding to (26) with t = 1. (f) an equivalent representation using Theorem 1 of (Zhou and Carin, 2015) on (24) and (26) for t = 1. (g) An representation obtained by repeating the same augmentation-marginalization steps described in (b)-(f) one layer at a time from layers 1 to t. (h) An representation of the top hidden layer.

Note that x(t) j = m(t)(t+1) j , and as the number of tables occupied by the customers is in the same order as the logarithm of the customer number in a Chinese restaurant process, x(t+1) kj is in the same order as ln m(t)(t+1) kj . Thus the total count of layer t + 1 as P j x(t+1) j would often be much smaller than that of layer t as P j x(t) j (though in general not as small

as a count that is in the same order of the logarithm of P j x(t) j ), and hence one may use

the total count P j x(T) j as a simple criterion to decide whether it is necessary to add more

Zhou, Cong, and Chen

layers to the GBN. In addition, if the latent count x(t) k k := P j x(t) k jk becomes close or equal

to zero, then the posterior mean of Φ(t)(k , k) could become so small that node k of layer t 1 can be considered to be disconnected from node k of layer t.

3.2 Modeling Data Variability With Distributed Representation

In comparison to a single-layer model with T = 1, which assumes that the hidden units of layer one are independent in the prior, the multilayer model with T 2 captures the correlations between them. Note that for the extreme case that Φ(t) = IKt for t 2 are all identity matrices, which indicates that there are no correlations between the features of θ(t 1) j left to be captured, the deep structure could still provide beneﬁts as it helps

model latent counts m(1)(2) j that may be highly overdispersed. For example, let us assume Φ(t) = IK2 for all t 2, then from (1) and (24) we have

m(1)(2) kj NB(θ(2) kj , p(2) j ), . . . , θ(t) kj Gam(θ(t+1) kj , 1/c(t+1) j ), . . . , θ(T) kj Gam(rk, 1/c(T+1) j ).

Using the laws of total expectation and total variance, we have

E θ(2) kj | rk = rk QT+1 t=3 c(t) j , var θ(2) kj | rk = rk

c(ℓ) j 2 # " T+1 Y

Further applying the same laws, we have

E m(1)(2) kj | rk = rkp(2) j 1 p(2) j QT+1 t=3 c(t) j ,

var m(1)(2) kj | rk = rkp(2) j 1 p(2) j 2QT+1 t=3 c(t) j

c(ℓ) j 1 #)

Thus the variance-to-mean ratio (VMR) of the count m(1)(2) kj given rk can be expressed as

VMR m(1)(2) kj | rk = 1 1 p(2) j

c(ℓ) j 1 #)

In comparison to PFA with m(1)(2) kj NB(rk, p(2) j ) given rk, with a VMR of 1/(1 p(2) j ),

the GBN with T hidden layers, which mixes the shape of m(1)(2) kj NB(θ(2) kj , p(2) j ) with a

chain of gamma random variables, increases VMR m(1)(2) kj | rk by a factor of

which is equal to 1 + (T 1)p(2) j

if we further assume c(t) j = 1 for all t 3. Therefore, by increasing the depth of the network to distribute the variability into more layers, the multilayer structure could increase its capacity to model data variability.

Augmentable Gamma Belief Networks

3.3 Learning The Network Structure With Layer-Wise Training

As jointly training all layers together is often diﬃcult, existing deep networks are typically trained using a greedy layer-wise unsupervised training algorithm, such as the one proposed in (Hinton et al., 2006) to train the deep belief networks. The eﬀectiveness of this training strategy is further analyzed in (Bengio et al., 2007). By contrast, the augmentable GBN has a simple Gibbs sampler to jointly train all its hidden layers, as described in Appendix B, and hence does not necessarily require greedy layer-wise training, but the same as these commonly used deep learning algorithms, it still needs to specify the number of layers and the width of each layer. In this paper, we adopt the idea of layer-wise training for the GBN, not because of the lack of an eﬀective joint-training algorithm that trains all layers together in each iteration, but for the purpose of learning the width of each hidden layer in a greedy layer-wise manner, given a ﬁxed budget on the width of the ﬁrst layer. The basic idea is to ﬁrst train a GBN with a single hidden layer, i.e., T = 1, for which we know how to use the gammanegative binomial process (Zhou and Carin, 2015; Zhou et al., 2015b) to infer the posterior distribution of the number of active factors; we ﬁx the width of the ﬁrst layer K1 with the number of active factors inferred at iteration B1, prune all inactive factors of the ﬁrst layer, and continue Gibbs sampling for another C1 iterations. Now we describe the proposed recursive procedure to build a GBN with T 2 layers. With a GBN of T 1 hidden layers that has already been inferred, for which the hidden units of the top layer are distributed as θ(T 1) j Gam(r, 1/c(T) j ), where r = (r1, . . . , r KT 1) , we add another layer by letting

θ(T 1) j Gam(Φ (T )θ(T) j , 1/c(T) j ), θ(T) j Gam(r, 1/c(T+1) j ), where Φ (T ) RKT 1 KTmax + and r is redeﬁned as r = (r1, . . . , r KTmax) . The key idea is with latent counts m(T)(T+1) kj upward propagated from the bottom data layer, one may marginalize out θ(T) kj , leading to

m(T)(T+1) kj NB(rk, p(T+1) j ), rk Gam(γ0/KT max, 1/c0), and hence can again rely on the shrinkage mechanism of a truncated gamma-negative binomial process to prune inactive factors (connection weight vectors, columns of Φ(T)) of layer T, making KT , the inferred layer width for the newly added layer, smaller than KT max if KT max is set to be suﬃciently large. The newly added layer and all the layers below would be jointly trained, but with the structure below the newly added layer kept unchanged. Note that when T = 1, the GBN infers the number of active factors if K1 max is set large enough, otherwise, it still assigns the factors with diﬀerent weights rk, but may not be able to prune any of them. The details of the proposed layer-wise training strategies are summarized in Algorithm 1 for multivariate count data, and in Algorithm 2 for multivariate binary and nonnegative real data.

4. Experimental Results

In this section, we present experimental results for count, binary, and nonnegative real data.

4.1 Deep Topic Modeling

We ﬁrst analyze multivariate count data with the Poisson gamma belief network (PGBN). We apply the PGBNs for topic modeling of text corpora, each document of which is represented as a term-frequency count vector. Note that the PGBN with a single hidden layer

Zhou, Cong, and Chen

is identical to the (truncated) gamma-negative binomial process PFA of Zhou and Carin (2015), which is a nonparametric Bayesian algorithm that performs similarly to the hierarchical Dirichlet process latent Dirichlet allocation of Teh et al. (2006) for text analysis, and is considered as a strong baseline. Thus we will focus on making comparison to the PGBN with a single layer, with its layer width set to be large to approximate the performance of the gamma-negative binomial process PFA. We evaluate the PGBNs performance by examining both how well they unsupervisedly extract low-dimensional features for document classiﬁcation, and how well they predict heldout word tokens. Matlab code will be available in http://mingyuanzhou.github.io/.

We use Algorithm 1 to learn, in a layer-wise manner, from the training data the connection weight matrices Φ(1), . . . , Φ(Tmax) and the top-layer hidden units gamma shape parameters r: to add layer T to a previously trained network with T 1 layers, we use BT iterations to jointly train Φ(T) and r together with {Φ(t)}1,T 1, prune the inactive factors of layer T, and continue the joint training with another CT iterations. We set the hyper-parameters as a0 = b0 = 0.01 and e0 = f0 = 1. Given the trained network, we apply the upward-downward Gibbs sampler to collect 500 MCMC samples after 500 burnins to estimate the posterior mean of the feature usage proportion vector θ(1) j /θ(1) j at the ﬁrst hidden layer, for every document in both the training and testing sets.

4.1.1 Feature Learning for Binary Classification

We consider the 20newsgroups data set that consists of 18,774 documents from 20 different news groups, with a vocabulary of size K0 = 61,188. It is partitioned into a training set of 11,269 documents and a testing set of 7,505 ones. We ﬁrst consider two binary classiﬁcation tasks that distinguish between the comp.sys.ibm.pc.hardware and comp.sys.mac.hardware, and between the sci.electronics and sci.med news groups. For each binary classiﬁcation task, we remove a standard list of stop words and only consider the terms that appear at least ﬁve times, and report the classiﬁcation accuracies based on 12 independent random trials. With the upper bound of the ﬁrst layer s width set as K1 max {25, 50, 100, 200, 400, 600, 800}, and Bt = Ct = 1000 and η(t) = 0.01 for all t, we use Algorithm 1 to train a network with T {1, 2, . . . , 8} layers. Denote θj as the estimated K1 dimensional feature vector for document j, where K1 K1 max is the inferred number of active factors of the ﬁrst layer that is bounded by the pre-speciﬁed truncation level K1 max. We use the L2 regularized logistic regression provided by the LIBLINEAR package (Fan et al., 2008) to train a linear classiﬁer on θj in the training set and use it to classify θj in the test set, where the regularization parameter is ﬁve-folder cross-validated on the training set from (2 10, 2 9, . . . , 215).

As shown in Figure 9, modifying the PGBN from a single-layer shallow network to a multilayer deep one clearly improves the qualities of the unsupervisedly extracted feature vectors. In a random trial, with K1 max = 800, we infer a network structure of [K1, . . . , K8] = [512, 154, 75, 54, 47, 37, 34, 29] for the ﬁrst binary classiﬁcation task, and [K1, . . . , K8] = [491, 143, 74, 49, 36, 32, 28, 26] for the second one. Figures 9(c)-(d) also show that increasing the network depth in general improves the performance, but the ﬁrst-layer width clearly plays a critical role in controlling the ultimate network capacity. This insight is further illustrated below.

Augmentable Gamma Belief Networks

Number of layers T 1 2 3 4 5 6 7 8

Classification accuracy

(a) ibm.pc.hardware vs mac.hardware

Number of layers T 1 2 3 4 5 6 7 8

Classification accuracy

(b) sci.electronics vs sci.med

Number of layers T 2 4 6 8

Classification accuracy

86(c) ibm.pc.hardware vs mac.hardware

Number of layers T 2 4 6 8

Classification accuracy

95 (d) sci.electronics vs sci.med

K1max = 25 K1max = 50 K1max = 100 K1max = 200 K1max = 400 K1max = 600 K1max = 800

Figure 9: Classiﬁcation accuracy (%) as a function of the network depth T for two 20newsgroups binary classiﬁcation tasks, with η(t) = 0.01 for all layers. (a)-(b): the boxplots of the accuracies of 12 independent runs with K1 max = 800. (c)-(d): the average accuracies of these 12 runs for various K1 max and T. Note that K1 max = 800 is large enough to cover all active ﬁrst-layer topics (inferred to be around 500 for both binary classiﬁcation tasks), whereas all the ﬁrst-layer topics would be used if K1 max = 25, 50, 100, or 200.

4.1.2 Feature Learning for Multi-Class Classification

We test the PGBNs for multi-class classiﬁcation on 20newsgroups. After removing a standard list of stopwords and the terms that appear less than ﬁve times, we obtain a vocabulary with V = 33, 420. We set Ct = 500 and η(t) = 0.05 for all t; we set Bt = 1000 for all t if K1 max 400, and set B1 = 1000 and Bt = 500 for t 2 if K1 max > 400. We use all 11,269 training documents to infer a set of networks with Tmax {1, . . . , 5} and K1 max {50, 100, 200, 400, 600, 800}, and mimic the same testing procedure used for binary classiﬁcation to extract low-dimensional feature vectors, with which each testing document is classiﬁed to one of the 20 news groups using the L2 regularized logistic regression. Figure 10 shows a clear trend of improvement in classiﬁcation accuracy by increasing the network depth with a limited ﬁrst-layer width, or by increasing the upper bound of the width of the ﬁrst layer with the depth ﬁxed. For example, a single-layer PGBN with K1 max = 100 could add one or more layers to slightly outperform a single-layer PGBN with K1 max = 200, and a single-layer PGBN with K1 max = 200 could add layers to clearly outperform a single-layer PGBN with K1 max as large as 800. The proposed Gibbs sampler also exhibits several desirable computational properties. Each iteration of jointly training multiple layers usually only costs moderately more than that of training a single layer, e.g., with K1 max = 400, a training iteration on a single core of an Intel Xeon 2.7 GHz CPU takes about 5.6, 6.7, 7.1 seconds for the PGBN with 1, 3,

Zhou, Cong, and Chen

Number of layers T 1 2 3 4 5 6 7

Classification accuracy

K1max = 50 K1max = 100 K1max = 200 K1max = 400 K1max = 600 K1max = 800

100 200 300 400 500 600 700 800

Classification accuracy

T = 1 T = 2 T = 3 T = 4 T = 5

Figure 10: Classiﬁcation accuracy (%) of the PGBNs with Algorithm 1 for 20newsgroups multiclass classiﬁcation (a) as a function of the depth T with various K1 max and (b) as a function of K1 max with various depths, with η(t) = 0.05 for all layers. The widths of the hidden layers are automatically inferred. In a random trial, the inferred network widths [K1, . . . , K5] for K1 max = 50, 100, 200, 400, 600, and 800 are [50, 50, 50, 50, 50], [100, 99, 99, 94, 87], [200, 161, 130, 94, 63], [396, 109, 99, 82, 68], [528, 129, 109, 98, 91], and [608, 100, 99, 96, 89], respectively.

Number of layers T 1 2 3 4 5 6 7

Classification accuracy

K1max = 128

K1max = 256

K1max = 512

32 64 128 256 512

Classification accuracy

T=1 T=2 T=3 T=4 T=5

Figure 11: Analogous plots to Figure 10 with the vocabulary size restricted to be 2000, including the most frequent 2000 terms after removing a standard list of stopwords. The widths of the hidden layers are automatically inferred. In a random trial, the inferred network widths [K1, . . . , K5] for K1 max = 32, 64, 128, 256, and 512 are [32, 32, 32, 32, 32], [64, 64, 64, 59, 59], [128, 125, 118, 106, 87], [256, 224, 124, 83, 65], and [512, 187, 89, 78, 62], respectively.

and 5 layers, respectively. Since the per iteration cost increases approximately as a linear function of the inferred K1 and as a linear function of the size of the data set, given a ﬁxed computational budget, one may choose a moderate K1 max to allow adding a suﬃciently large number of hidden layers. In addition, the samplings of x(t) vkj, φ(t) k , and θ(t) kj in each layer can all be made embarrassingly parallel with blocked Gibbs sampling, and hence can potentially signiﬁcantly beneﬁt from implementing the algorithm using graphics processing units (GPUs) or other parallel computing architectures.

Examining the inferred network structure also reveals interesting details. For example, in a random trial with Algorithm 1, with η(t) = 0.05 for all t, the inferred network widths [K1, . . . , K5] for K1 max = 50, 100, 200, 400, 600, and 800 are [50, 50, 50, 50, 50], [100, 99, 99, 94, 87], [200, 161, 130, 94, 63], [396, 109, 99, 82, 68], [528, 129, 109, 98, 91], and [608,

Augmentable Gamma Belief Networks

100, 99, 96, 89], respectively. This indicates that for a network with an insuﬃcient budget on its ﬁrst-layer width, as the network depth increases, its inferred layer widths decay more slowly than a network with a suﬃcient or surplus budget on its ﬁrst-layer width; and a network with a surplus budget on its ﬁrst-layer width may only need relatively small widths for its higher hidden layers. In order to make comparison to related algorithms, we also consider restricting the vocabulary to the 2000 most frequent terms of the vocabulary after moving a standard list of stopwords. We repeat the same experiments with the same settings except that we set K1 max {32, 64, 128, 256, 512}, B1 = 1000, C1 = 500, and Bt = Ct = 500 for all t 2. We show the results in Figure 11. Again, we observe a clear trend of improvement by increasing the network depth with a limited ﬁrst-layer width, or by increasing the upper bound of the width of the ﬁrst layer with the depth ﬁxed. In a random trial with Algorithm 1, the inferred network widths [K1, . . . , K5] for K1 max = 32, 64, 128, 256, and 512 are [32, 32, 32, 32, 32], [64, 64, 64, 59, 59], [128, 125, 118, 106, 87], [256, 224, 124, 83, 65], and [512, 187, 89, 78, 62], respectively. For comparison, we ﬁrst consider the same L2 regularized logistic regression multiclass classiﬁer, trained either on the raw word counts or normalized term-frequencies of the 20newsgroups training documents using ﬁve-folder cross-validation. As summarized in Table 1 of Appendix C, when using the raw term-frequency word counts as covariates, the same classiﬁer achieves 69.8% (68.2%) accuracy on the 20newsgroups test documents if using the top 2000 terms that exclude (include) a standard list of stopwords, achieves 75.8% if using all the 61, 188 terms in the vocabulary, and achieves 78.0% if using the 33, 420 terms remained after removing a standard list of stopwords and the terms that appear less than ﬁve times; and when using the normalized term-frequencies as covariates, the corresponding accuracies are 70.8% (67.9%) if using the top 2000 terms excluding (including) stopwords, 77.6% with all the 61, 188 terms, and 79.4% with the 33, 420 selected terms. As summarized in Table 2 of Appendix C, for multi-class classiﬁcation on the same data set, with a vocabulary size of 2000 that consists of the 2000 most frequent terms after removing stopwords and stemming, the Doc NADE (Larochelle and Lauly, 2012) and the over-replicated softmax (Srivastava et al., 2013) provide the accuracies of 67.0% and 66.8%, respectively, for a feature dimension of K = 128, and provide the accuracies of 68.4% and 69.1%, respectively, for a feature dimension of K = 512. As shown in Figure 11 and summarized in Table 3 of Appendix C, with the same vocabulary size of 2000 (but diﬀerent terms due to diﬀerent preprocessing), the proposed PGBN provides 65.9% (67.5%) with T = 1 (T = 5) for K1 max = 128, and 65.9% (69.2%) with T = 1 (T = 5) for K1 max = 512, which may be further improved if we also consider the stemming step, as done in these two algorithms, for word preprocessing, or if we set the values of η(t) to be smaller than 0.05 to encourage a more complex network structure. We also summarize in Table 3 the classiﬁcation accuracies shown in Figure 10 for the PGBNs with V = 33, 420. Note that the accuracies in Tables 2 and 3 are provided to show that the PGBNs are in the same ballpark as both the Doc NADE (Larochelle and Lauly, 2012) and over-replicated softmax (Srivastava et al., 2013). Note these results are not intended to provide a head-to-head comparison, which is possible if the same data preprocessing and classiﬁer were used and the error bars were shown in Srivastava et al. (2013), or we could obtain the code to replicate the experiments using the same preprocessed data and classiﬁer.

Zhou, Cong, and Chen

Note that 79.4% achieved using the 33, 420 selected features is the best accuracy reported in the paper, which is unsurprising since all the unsupervisedly extracted latent feature vectors have much lower dimensions and are not optimized for classiﬁcation (Zhu et al., 2012; Zhou et al., 2015b). In comparison to using appropriately preprocessed highdimensional features, our experiments show that while text classiﬁcation performance often clearly deteriorates if one trains a multi-class classiﬁer on the lower-dimensional features extracted using shallow unsupervised latent feature models, one could obtain much improved results using appropriate deep generalizations. For further improvement, one may consider adding an extra supervised component into the model, which is shown to boost the classiﬁcation performance for latent Dirichlet allocation (Blei and Mcauliﬀe, 2008; Zhu et al., 2012), a shallow latent feature model related to the PGBN with a single hidden layer.

4.1.3 Perplexities for Heldout Words

In addition to examining the performance of the PGBN for unsupervised feature learning, we also consider a more direct approach that we randomly choose 30% of the word tokens in each document as training, and use the remaining ones to calculate per-heldout-word perplexity. We consider both all the 18,774 documents of the 20newsgroups corpus, limiting the vocabulary to the 2000 most frequent terms after removing a standard list of stopwords, and the NIPS12 (http://www.cs.nyu.edu/ roweis/data.html) corpus whose stopwords have already been removed, limiting the vocabulary to the 2000 most frequent terms. We set η(t) = 0.05 and Ct = 500 for all t, set B1 = 1000 and Bt = 500 for t 2, and consider ﬁve random trials. Among the Bt+Ct Gibbs sampling iterations used to train layer t, we collect one sample per ﬁve iterations during the last 500 iterations, for each of which we draw the topics {φ(1) k }k and topics weights θ(1) j , to compute the per-heldout-word perplexity using Equation (34) of Zhou and Carin (2015). This evaluation method is similar to those used in Newman et al. (2009), Wallach et al. (2009), and Paisley et al. (2012).

As shown in both Figures 12 and 13, we observe a clear trend of improvement by increasing both K1 max and T. We have also examined the topics and network structure learned on the NIPS12 corpus. Similar to the exploratory data analysis performed on the 20newsgroups corpus, as described in detail in Section 2.3, the inferred deep networks also allow us to extract trees and subnetworks to visualize various aspects of the NIPS12 corpus from general to speciﬁc and reveal how they are related to each other. We omit these details for brevity and instead provide a brief description: with K1 max = 200 and T = 5, the PGBN infers a network with [K1, . . . , K5] = [200, 164, 106, 60, 42] in one of the ﬁve random trials. The ranks, according to the weights r(t) k calculated in (14), and the top ﬁve words of three example topics for layer T = 5 are 6 network units input learning training, 15 data model learning set image, and 34 network learning model input neural; while these of ﬁve example topics of layer T = 1 are 19 likelihood em mixture parameters data, 37 bayesian posterior prior log evidence, 62 variables belief networks conditional inference, 126 boltzmann binary machine energy hinton, and 127 speech speaker acoustic vowel phonetic. It is clear that the topics of the bottom hidden layers are very speciﬁc whereas these of the top hidden layer are quite general.

Augmentable Gamma Belief Networks

25 100 200 400 600 800

25 100 200 400 600 800

Relative perplexity

T = 1 T = 2 T = 3 T = 4 T = 5

Figure 12: (a) per-heldout-word perplexity (the lower the better) for the NIPS12 corpus (using the 2000 most frequent terms) as a function of the upper bound of the ﬁrst layer width K1 max and network depth T, with 30% of the word tokens in each document used for training and η(t) = 0.05 for all t. (b) for visualization, each curve in (a) is reproduced by subtracting its values from the average perplexity of the single-layer network. In a random trial, the inferred network widths [K1, . . . , K5] for K1 max = 25, 50, 100, 200, 400, 600, and 800 are [25, 25, 25, 25, 25], [50, 50, 50, 49, 42], [100, 99, 93, 78, 54], [200, 164, 106, 60, 42], [400, 130, 83, 52, 39], [596, 71, 68, 58, 37], and [755, 57, 53, 46, 42], respectively.

25 100 200 400 600 800

T = 1 T = 2 T = 3 T = 4 T = 5

25 100 200 400 600 800

Relative perplexity

Figure 13: Analogous plots to Figure 12 for the 20newsgroups corpus (using the 2000 most frequent terms after removing a standard list of stopwords). In a random trial, the inferred network widths [K1, . . . , K5] for K1 max = 25, 50, 100, 200, 400, 600, and 800 are [25, 25, 25, 25, 25], [50, 50, 50, 50, 50], [100, 99, 99, 97, 97], [200, 194, 177, 152, 123], [398, 199, 140, 116, 105], [557, 156, 133, 118, 103], and [701, 119, 116, 112, 103], respectively.

4.1.4 Generating Synthetic Documents

We have also tried drawing θ(T) j Gam r, 1/c(T+1) j and downward passing it through a T-layer network trained on a text corpus to generate synthetic bag-of-words documents, which are found to be quite interpretable and reﬂect various general aspects of the corpus used to train the network. We consider the PGBN with [K1, . . . , K5] = [608, 100, 99, 96, 89], which is trained on the training set of the 20newsgroups corpus with K1 max = 800 and η(t) = 0.05 for all t. We set c(t) j as the median of the inferred {c(t) j }j of the training doc-

Zhou, Cong, and Chen

Number of layers T 1 2 3 4 5 6 7

Classification accuracy

K1max = 100

K1max = 200

K1max = 400

K1max = 600

K1max = 800

K1max 100 200 300 400 500 600 700 800

Classification accuracy

T=1 T=2 T=3 T=4 T=5

Figure 14: Analogous plots to Figure 10 for the Ber Po-GBNs on the binarized 20newsgroups term-document count matrix. The widths of the hidden layers are automatically inferred. In a random trial with Algorithm 2, the inferred network widths [K1, . . . , K5] for K1 max = 50, 100, 200, 400, 600, and 800 are [50, 50, 50, 50, 50], [100, 97, 95, 90, 82], [178, 145, 122, 97, 72], [184, 139, 119, 101, 75], [172, 165, 158, 138, 110], and [156, 151, 147, 134, 117], respectively.

uments for all t. Given {Φ(t)}1,T and r, we ﬁrst generate θ(T) j Gam r, 1 c(T+1) j and then downward pass it through the network by drawing nonnegative real random variables, one layer after another, from the gamma distributions as in (1). With the simulated θ(1) j , we calculate the Poisson rates for all the V words using Φ(1)θ(1) j and display the top 100 words ranked by their Poisson rates. As shown in the text ﬁle available at http://mingyuanzhou.github.io/Results/GBN-BOW.txt, the synthetic documents generated in this manner are all easy to interpret and reﬂect various general aspects of the 20newsgroups corpus on which the PGBN is trained.

4.2 Multilayer Representation for Binary Data

We apply the Ber Po-GBN to extract multilayer representations for high-dimensional sparse binary vectors. The Ber Po link is proposed in Zhou (2015) to construct edge partition models for network analysis, whose computation is mainly spent on pairs of linked nodes and hence is scalable to big sparse relational networks. That link function and its inference procedure have also been adopted by Hu et al. (2015) to analyze big sparse binary tensors. We consider the same problem of feature learning for multi-class classiﬁcation studied in detail in Section 4.1.2. We consider the same setting except that the original termdocument word count matrix is now binarized into a term-document indicator matrix, the (v, j) element of which is set as one if and only if nvj 1 and set as zero otherwise. We test the Ber Po-GBNs on the 20newsgroups corpus, with η(t) = 0.05 for all layers. As shown in Figure 14, given the same upper-bound on the width of the ﬁrst layer, increasing the depth of the network clearly improves the performance. Whereas given the same number of hidden layers, the performance initially improves and then ﬂuctuates as the upper-bound of the ﬁrst layer increases. Such kind of ﬂuctuations when K1 max reaches over 200 are expected, since the width of the ﬁrst layer is inferred to be less than 190 and hence the budget as small as K1 max = 200 is already large enough to cover all active factors.

Augmentable Gamma Belief Networks

Number of layers T 1 2 3 4 5 6 7

Classification accuracy

K1max = 25 K1max = 50

K1max = 100 K1max = 200

K1max = 400

25 50 100 200 400

Classification accuracy

T=1 T=2 T=3 T=4 T=5

Figure 15: Analogous plots to Figure 10 for the PRG-GBNs on the MNIST data set. In a random trial with Algorithm 2, the inferred network widths [K1, . . . , K5] for K1 max=50, 100, 200, and 400 are [50, 50, 50, 50, 50], [100, 100, 100, 100, 100], [200, 200, 200, 200, 200], and [400, 400, 399, 385, 321], respectively.

4.3 Multilayer Representation for Nonnegative Real Data

We use the PRG-GBN to unsupervisedly extract features from nonnegative real data. We consider the MNIST data set (http://yann.lecun.com/exdb/mnist/), which consists of 60000 training handwritten digits and 10000 testing ones. We divide the gray-scale pixel values of each 28 28 image by 255 and represent each image as a 784 dimensional nonnegative real vector. We set η(1) = 0.05 and use all training digits to infer the PRG-GBNs with Tmax {1, , 5} and K1max {50, 100, 200, 400}. We consider the same problem of feature extraction for multi-class classiﬁcation studied in detail in Section 4.1.2, and we follow the same experimental settings over there. As shown in Figure 15, both increasing the width of the ﬁrst layer and the depth of the network could clearly improve the performance in terms of unsupervisedly extracting features that are better suited for multi-class classiﬁcation. Note that the PRG distribution might not be the best distribution to ﬁt MNIST digits, but nevertheless, displaying the inferred features at various layers as images provides a straightforward way to visualize the latent structures inferred from the data and hence provides an excellent example to understand the properties and working mechanisms of the GBN. We display the projections to the ﬁrst layer of the factors Φ(t) at all ﬁve hidden layers as images for K1 max = 100 and K1 max = 400 in Figures 22 and 23, respectively, which clearly show that the inferred latent factors become increasingly more general as the layer increases. In both Figures 22 and 23, the latent factors inferred at the ﬁrst hidden layer represent ﬁlters that are only active at very particular regions of the images, those inferred at the second hidden layer represent larger parts of the hidden-written digits, and those inferred at the third and deeper layers resemble the whole digits. To visualize the relationships between the factors of diﬀerent layers, we show in Figure 24 in Appendix C a subset of nodes of each layer and the nodes of the layer below that are connected to them with non-negligible weights. It is interesting to note that unlike Lee et al. (2009) and many other following works that rely on the convolutional and pooling operations, which are pioneered by Le Cun et al. (1989), to extract hierarchical representation for images at diﬀerent spatial scales, we show that the proposed algorithm, while not breaking the images into spatial patches, is already able to learn the factors that are active on very speciﬁc spatial regions of the image in

Zhou, Cong, and Chen

the bottom hidden layer, and learn these increasingly more general factors covering larger spatial regions of the images as the number of layer increases. However, due to the lack of the ability to discover spatially localized features that can be shared at multiple diﬀerent spatial regions, our algorithm does not at all exploit the redundancies of the spatially localized features inside a single image and hence may require much more data to train. Therefore, it would be interesting to investigate whether one can introduce convolutional and pooling operations into the GBNs, which may substantially improve their performance on modeling natural images.

5. Conclusions

The augmentable gamma belief network (GBN) is proposed to extract a multilayer representation for high-dimensional count, binary, or nonnegative real vectors, with an eﬃcient upward-downward Gibbs sampler to jointly train all its layers and a layer-wise training strategy to automatically infer the network structure. A GBN of T layers can be broken into T subproblems that are solved by repeating the same subroutine, with the computation mainly spent on training the ﬁrst hidden layer. When used for deep topic modeling, the GBN extracts very speciﬁc topics at the ﬁrst hidden layer and increasingly more general topics at deeper hidden layers. It provides an excellent way for exploratory data analysis through the visualization of the inferred deep network, whose hidden units of adjacent layers are sparsely connected. Its good performance is further demonstrated in unsupervisedly extracting features for document classiﬁcation and predicting heldout word tokens. The extracted deep network can also be used to simulate very interpretable synthetic documents, which reﬂect various general aspects of the corpus that the network is trained on. When applied for image analysis, without using the convolutional and pooling operations, the GBN is already able to extract interpretable factors in the ﬁrst hidden layer that are active in very speciﬁc spatial regions and interpretable factors in deeper hidden layers with increasingly more general spatial patterns covering larger spatial regions. For big data problems, in practice one may rarely have a suﬃcient budget to allow the ﬁrst-layer width to grow without bound, thus it is natural to consider a deep network that can use a multilayer deep representation to better allocate its resource and increase its representation power with limited computational power. Our algorithm provides a natural solution to achieve a good compromise between the width of each layer and the depth of the network.

Acknowledgments

The authors would like to thank the editor and two anonymous referees for their insightful and constructive comments and suggestions, which have helped us improve the paper substantially. M. Zhou thanks Texas Advanced Computing Center for computational support. B. Chen thanks the support of the Thousand Young Talent Program of China, NSFC (61372132), NCET-13-0945, and NDPR-9140A07010115DZ01015.

Augmentable Gamma Belief Networks

0 5 10 15 20 0

6=1 , c=0.5 6=2 , c=0.5 6=3 , c=0.5 6=5 , c=1 6=9 , c=2 6=7.5 , c=1 6=0.5 , c=1

1 2 3 4 5 6 7 8 9 10 0

,=0.1 ,=1 ,=5 ,=10 ,=15 ,=20

Figure 16: Left: probability distribution functions for the Poisson randomized gamma (PRG) distribution x PRG(λ, c), where the sum of the probability mass at x = 0 and the area under the probability density function curve for x > 0 is equal to one; Right: probability mass functions for the truncated Bessel distribution n Bessel 1(α), where n {1, 2, . . .}.

Appendix A. Randomized Gamma and Bessel Distributions

Related to our work, Yuan and Kalbﬂeisch (2000) proposed the randomized gamma distribution to generate a random positive real number as

x | n, ν Gam(n + ν + 1, 1/c), n Pois(λ),

where ν > 1 and c > 0. As in Yuan and Kalbﬂeisch (2000), the conditional posterior of n can be expressed as (n | x, ν, α) Besselν(2

where we denote n Besselν(α) as the Bessel distribution with parameters ν > 1 and α > 0, with PMF

Besselν(n; α) =

Iν(α)n!Γ(n + ν + 1), n {0, 1, 2, . . .}.

Algorithms to draw Bessel random variables can be found in Devroye (2002). The proposed PRG is diﬀerent from the randomized gamma distribution of Yuan and Kalbﬂeisch (2000) in that it models both positive real numbers and exact zeros, and the proposed truncated Bessel distribution n Bessel 1(α) is diﬀerent from the Bessel distribution n Besselν(α), where ν > 1, in that it is deﬁned only on positive integers. For illustration, we show in Figure 16 the probability distribution functions of both the PRG and truncated Bessel distributions under a variety of parameter settings.

Appendix B. Upward-Downward Gibbs Sampling

Below we ﬁrst discuss Gibbs sampling for count data and then generalize it for both binary and nonnegative real data.

Zhou, Cong, and Chen

B.1 Inference for the PGBN

With Lemma 1 and Corollary 2 and the width of the ﬁrst layer being bounded by K1 max, we ﬁrst consider multivariate count observations and develop an upward-downward Gibbs sampler for the PGBN, each iteration of which proceeds as follows. Sample x(t) vjk. We can sample x(t) vjk for all layers using (25). But for the ﬁrst hidden

layer, we may treat each observed count x(1) vj as a sequence of word tokens at the vth term (in

a vocabulary of size V := K0) in the jth document, and assign the x(1) j words {vji}i=1,x(1) j one after another to the latent factors (topics), with both the topics Φ(1) and topic weights θ(1) j marginalized out, as

P(zji = k | ) η(1) + x(1) ji

V η(1) + x(1) ji k

jk + φ(2) k: θ(2) j , k {1, . . . , K1 max}, (28)

where zji is the topic index for vji and x(1) vjk := P i δ(vji = v, zji = k) counts the number of times that term v appears in document j; we use x ji to denote the count x calculated without considering word i in document j. The collapsed Gibbs sampling update equation shown above is related to the one developed in (Griﬃths and Steyvers, 2004) for latent Dirichlet allocation, and the one developed in (Zhou, 2014) for PFA using the beta-negative binomial process. When T = 1, we would replace the terms φ(2) k: θ(2) j with rk for PFA built on the gamma-negative binomial process (Zhou and Carin, 2015) (or with απk for hierarchical Dirichlet process latent Dirichlet allocation, see (Teh et al., 2006) and (Zhou, 2014) for details), and add an additional term to account for the possibility of creating an additional factor (Zhou, 2014). For simplicity, in this paper, we truncate the nonparametric Bayesian model with K1 max factors and let rk Gam(γ0/K1 max, 1/c0) if T = 1. Note that although we use collapsed Gibbs sampling inference in this paper, if one desires embarrassingly parallel inference and possibly lower computation, then one may consider explicitly sampling {φ(1) k }k and {θ(1) j }j and sampling x(1) vjk with (25).

Sample φ(t) k . Given these latent counts, we sample the factors/topics φ(t) k as

(φ(t) k | ) Dir η(t) 1 + x(t) 1 k, . . . , η(t) Kt 1 + x(t) Kt 1 k . (29)

Sample x(t+1) vj . We sample x(t+1) j using (26), where we replace the term φ(T+1) v: θ(T+1) j with rv.

Sample r. Both γ0 and c0 are sampled using related equations in (Zhou and Carin, 2015), omitted here for brevity. We sample r as

(rv | ) Gam γ0/KT + x(T+1) v , h c0 P j ln 1 p(T+1) j i 1 . (30)

Sample θ(t) j . Using (22) and the gamma-Poisson conjugacy, we sample θj as

Augmentable Gamma Belief Networks

(θ(T) j | ) Gam r + m(T)(T+1) j , h c(T+1) j ln 1 p(T) j i 1 ,

(θ(t) j | ) Gam Φ(t+1)θ(t+1) j + m(t)(t+1) j , h c(t+1) j ln 1 p(t) j i 1 ,

(θ(1) j | ) Gam Φ(2)θ(2) j + m(1)(2) j , h c(2) j ln 1 p(1) j i 1 , (31)

Sample c(t) j . With θ(t) j := PKt k=1 θ(t) kj for t T and θ(T+1) j := r , we sample p(2) j and

{c(t) j }t 3 as

(p(2) j | ) Beta a0+m(1)(2) j , b0+θ(2) j , (c(t) j | ) Gam e0+θ(t) j , h f0+θ(t 1) j i 1 , (32)

and calculate c(2) j and {p(t) j }t 3 with (21).

B.2 Handling Binary and Nonnegative Real Observations

For binary observations that are linked to the latent counts at layer one as b(1) vj = 1(x(1) vj 1), we ﬁrst sample the latent counts at layer one from the truncated Poisson distribution as

x(1) vj | b(1) vj Pois+

k=1 φ(1) vk θ(1) kj

and then sample x(t) vjk for all layers using (25).

For nonnegative real observations y(1) vj that are linked to the latent counts at layer one as

y(1) vj Gam(x(1) vj , 1/aj),

we let x(1) vj = 0 if y(1) vj = 0 and sample x(1) vj from the truncated Bessel distribution as

x(1) vj | Bessel 1

v u u tajy(1) vj

k=1 φ(1) vk θ(1) kj

if y(1) vj > 0. We let aj Gam(e0, 1/f0) in the prior and sample aj as

(aj | ) Gam

v x(1) vj , 1

f0 + P v y(1) vj

We then sample x(t) vjk for all layers using (25).

Appendix C. Additional Tables and Figures

Zhou, Cong, and Chen

Algorithm 1 The PGBN upward-downward Gibbs sampler that uses a layer-wise training strategy to train a set of networks, each of which adds an additional hidden layer on top of the previously inferred network, retrains all its layers jointly, and prunes inactive factors from the last layer. Inputs: observed counts {xvj}v,j, upper bound of the width of the ﬁrst layer K1 max, upper bound of the number of layers Tmax, number of iterations {BT , ST }1,Tmax, and hyper-parameters. Outputs: A total of Tmax jointly trained PGBNs with depths T = 1, T = 2, . . ., and T = Tmax.

1: for T = 1, 2, . . . , Tmax do Jointly train all the T layers of the network 2: Set KT 1, the inferred width of layer T 1, as KT max, the upper bound of layer T s width. 3: for iter = 1 : BT + CT do Upward-downward Gibbs sampling 4: Sample {zji}j,i using collapsed inference; Calculate {x(1) vjk}v,k,j; Sample {x(2) vj }v,j ; 5: for t = 2, 3, . . . , T do 6: Sample {x(t) vjk}v,j,k ; Sample {φ(t) k }k ; Sample {x(t+1) vj }v,j ; 7: end for 8: Sample p(2) j and Calculate c(2) j ; Sample {c(t) j }j,t and Calculate {p(t) j }j,t for t = 3, . . . , T +1; 9: for t = T, T 1, . . . , 2 do 10: Sample r if t = T; Sample {θ(t) j }j ; 11: end for 12: if iter = BT then 13: Prune layer T s inactive factors {φ(T ) k }k:x(T ) k =0;

14: let KT = P

k δ(x(T ) k > 0) and update r; 15: end if 16: end for 17: Output the posterior means (according to the last MCMC sample) of all remaining factors {φ(t) k }k,t as the inferred network of T layers, and {rk}KT k=1 as the gamma shape parameters of layer T s hidden units. 18: end for

Algorithm 2 The upward-downward Gibbs samplers for the Ber-GBN and PRG-GBN are constructed by using Lines 1-8 shown below to substitute Lines 4-11 of the PGBN Gibbs sampler shown in Algorithm 1.

1: Sample {x(1) vj }v,j using (33) for binary observations; Sample {x(1) vj }v,j using (34) and sample aj using (35) for nonnegative real observations; 2: for t = 1, 2, . . . , T do 3: Sample {x(t) vjk}v,j,k ; Sample {φ(t) k }k ; Sample {x(t+1) vj }v,j ; 4: end for 5: Sample p(2) j and Calculate c(2) j ; Sample {c(t) j }j,t and Calculate {p(t) j }j,t for t = 3, . . . , T + 1; 6: for t = T, T 1, . . . , 1 do 7: Sample r if t = T; Sample {θ(t) j }j ; 8: end for

Augmentable Gamma Belief Networks

V = 61, 188 V = 61, 188 V = 33, 420 V = 33, 420 with stopwords with stopwords remove stopwords remove stopwords with rare words with rare words remove rare words remove rare words raw word counts term frequencies raw word counts term frequencies

75.8% 77.6% 78.0% 79.4%

V = 2000 V = 2000 V = 2000 V = 2000 with stopwords with stopwords remove stopwords remove stopwords raw counts term frequencies raw counts term frequencies

68.2% 67.9% 69.8% 70.8%

Table 1: Multi-class classiﬁcation accuracies of L2 regularized logistic regression.

V = 2000, K = 128 V = 2000, K = 512 remove stopwords, stemming remove stopwords, stemming

Doc NADE 67.0% 68.4% Over-replicated softmax 66.8% 69.1%

Table 2: Multi-class classiﬁcation accuracies of the Doc NADE (Larochelle and Lauly, 2012) and over-replicated softmax (Srivastava et al., 2013).

V = 2000, K1 max = 128 V = 2000, K1 max = 256 V = 2000, K1 max = 512 remove stopwords remove stopwords remove stopwords

PGBN (T = 1) 65.9% 0.4% 66.3% 0.4% 65.9% 0.4% PGBN (T = 2) 67.1% 0.5% 67.9% 0.4% 68.3% 0.3% PGBN (T = 3) 67.3% 0.3% 68.6% 0.5% 69.0% 0.4% PGBN (T = 5) 67.5% 0.4% 68.8% 0.3% 69.2% 0.4%

V = 33, 420, K1 max = 200 V = 33, 420, K1 max = 400 V = 33, 420, K1 max = 800 remove stopwords remove stopwords remove stopwords remove rare words remove rare words remove rare words

PGBN (T = 1) 74.6% 0.6% 75.3% 0.6% 75.4% 0.4% PGBN (T = 2) 76.0% 0.6% 76.9% 0.5% 77.5% 0.4% PGBN (T = 3) 76.3% 0.8% 77.1% 0.6% 77.8% 0.4% PGBN (T = 5) 76.4% 0.5% 77.4% 0.6% 77.9% 0.3%

Table 3: Multi-class classiﬁcation accuracies of the PGBN trained with ηt = 0.05 for all t.

Zhou, Cong, and Chen

3 car bike new cars dod engine got thanks

price road ll miles

2 car bike new cars dod engine got road

miles ride said ll

27 new sale thanks shipping

mail offer price condition email interested asking best

4 car cars new engine miles dealer price drive

year road buy speed

9 bike dod ride bikes riding motorcycle got sun

left new said ll

22 new sale shipping thanks

mail offer price condition email interested asking best

3 car cars new engine miles dealer price drive

year road buy speed

8 bike dod ride bikes riding motorcycle got sun

left new said ll

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

fax advance

hi info information

7 new information

national research program

year april center washington

cars engine

miles dealer

new price drive road speed

61 callison

car ecn lights uokmax

fake bird continental

car drive manual

cars fuel automatic

auto lane shift

dod ride bikes riding motorcycle

bnr ama honda

went didn came

told saw home

left started

says took apartment

ride motorcycle

shaft rider motorcycles

rec denizens

cb pack bike

100 ryan cousineau

dog nmm stafford compdyn

bike winona springer vancouver

shoei mavenry

size head maven

altcit paint passenger

25 new sale shipping thanks

mail offer price condition email interested asking sell

new shipping

offer condition

price asking

cd interested

ac ed picture

dcs sleeve liverpool

tony simon oxford warwick

wire wiring neutral outlets

circuit cable electrical

wires outlet connected

112 package

hotel college

kids hirama

job kou minimum

11 car bike new cars

dod engine got road

said ll miles ride

Figure 17: Analogous plots to Figure 6 for a subnetwork on car & bike , consisting of two trees rooted at nodes 3 and 11, respectively, of layer one.

15 israel jews israeli armenian

turkish new state government

said armenians years law

12 israel jews israeli armenian

turkish government new said

state armenians years law

18 israel jews israeli arab

jewish state arabs peace

land policy new years

26 law new state government

said president mr rights states information public national

30 turkish armenian armenians turks armenia turkey government genocide

soviet said today new

18 israel jews israeli arab

jewish arabs state peace

land policy center new

28 law state government new

said president mr rights states information public national

32 turkish armenian armenians turks armenia turkey genocide government

soviet said today new

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

7 new information

national research program

year april center washington

14 law government

state court

laws states public

civil legal federal

israel israeli jewish

land center policy

anti research palestine

went didn came

told saw home

left started

says took apartment

israeli lebanese

arab israelis civilians

killed palestinian

4 god jesus christian

bible christ christians

life faith believe

man christianity

47 president

mr stephanopoulos

myers secretary

work working

child copy women children protection pregnancy depression abstinence

sexual education

iran jury farid energy bombing

iraq military

iraqi bomb civilians

105 adl bullock

san francisco

police arens information

anti league

south secret nuclear

naval rockefeller

island georgia

new ships navy

34 turkish armenian armenians

turks armenia

turkey genocide

soviet today russian government

98 bosnia arromdee

jhu serbs somalia

ken jyusenkyou

sides turkey ethnic intervention

greece turkish greeks cyprus minority

uv turks turkey cypriot ethnic cypriots

book urartu yalanci published

page pages forged

argic serdar quotes

23 israel jews israeli arab

jewish state new arabs

peace land years true

24 israel jews israeli arab

jewish state arabs peace

land new years policy

Figure 18: Analogous plot to Figure 6 for a subnetwork on Middle East, consisting of two trees rooted at nodes 15 and 23, respectively, of layer one.

Augmentable Gamma Belief Networks

16 gun koresh fbi believe batf said government children

guns new gas compound

10 gun koresh fbi believe batf said government guns children gas compound stratus

13 gun guns ﬁrearms crime

control law weapons state

new self government said

21 koresh fbi batf believe gas compound children stratus

waco said atf government

14 gun guns ﬁrearms crime

control law weapons state new self government public

22 koresh fbi batf believe gas compound children stratus

waco said atf government

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

7 new information

national research program

year april center washington

14 law government

state court

laws states public

civil legal federal

guns ﬁrearms

crime control weapons

self criminals

defense handgun

told saw home

left started

says took apartment

76 militia amendment

arms bear second constitution

regulated government

shall state ulowell organized

dan nuclear

viking sorenson destruction

weapon homosexuality

isu roehm unusual

4 god jesus christian

bible christ christians

life faith believe

man christianity

fbi batf gas stratus compound

waco children

atf cdt believe government

fbi udel handheld

jim jmd chopin clinton

scott children

idbsu affair

uiuc cso wpi stove brent electric

uxh ccwf stratus

18 gun guns ﬁrearms crime

law control new state weapons government said believe

18 gun guns ﬁrearms crime

control law weapons state new government self said

Figure 19: Analogous plot to Figure 6 for a subnetwork related to gun, consisting of two trees rooted at nodes 16 and 18, respectively, of layer one.

7 team game hockey season

games nhl year play new league players teams

6 team game hockey season

games nhl year play league new players teams

6 team game hockey season

games nhl year play league new players teams

4 team game hockey season

games nhl year play league new players teams

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

fax advance

hi info information

7 new information

national research program

year april center washington

team baseball

runs games season players

hit win league pitching

game hockey season

nhl year league

teams players

went didn came

told saw home

left started

says took apartment

pp puck goal ﬂyers shots

pts scorer second

57 roger maynard

gainey laurentian

uwo best gilmour business

ﬁ ﬁnland germany

canada players

ﬁnnish german

april swedish

78 gld columbia

dare gary bitnet keenan

je souviens

cunixc selanne

mike hrivnak

gtd pts andrew

city tulsa votes team

cmu hammerl

acsu clement ferguson rochester

valerie ubvms lemieux

det chi cal tor vs van

pit stl mon que

13 year game team baseball

games runs season players

hit win league years

7 year game team baseball runs games season players

hit win league years

7 year game team baseball runs games season players

hit win league years

7 year game team baseball runs games season players

hit win league years

lost san new kaldis rutgers

york houston

dave fame smith eddie murray winﬁeld kingman

yount steve guys

26 team game year season games hockey players play

league new win baseball

Figure 20: Analogous plot to Figure 6 for a subnetwork on ice hockey and baseball, consisting of three trees rooted at nodes 7, 13, and 26, respectively, of layer one.

Zhou, Cong, and Chen

14 disease doctor pain new help treatment medical years said problem thanks patients

12 disease doctor pain new help treatment medical years said problem patients thanks

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

fax advance

hi info information

7 new information

national research program

year april center washington

19 god science evidence

exist atheists existence

true atheist religion

pain treatment

medical patients medicine

help skin blood problems

went didn came

told saw home

left started

says took apartment

food chinese

taste reaction restaurant

mot ﬂavor dougb

yeast noring

steve symptoms

jon vitamin

dyer body quack patients

dyer spdcc glutamate

food effects

brain studies humans

blood olney foods

cooling nuclear

plants steam towers

hot air temperature

hiv medical

cancer number

april tobacco

hicnet volume newsletter

nye nyeda absurd

clinic bertrand absurdities

eau claire philosopher

gr medicine

roth rose alternative

rr uclink diseases berkeley

180 acidophilous

thrush outbreaks

taking astemizole

gain linked resumed hismanal mccurdy

sdsu ucsvax

190 zisfein

slmr headache

gb assignment

12 government key encryption clipper

chip nsa phone public keys secure crypto escrow

13 government encryption clipper key

chip nsa phone public secure crypto keys netcom

31 key chip bit keys number clipper serial encrypted

des escrow bits algorithm

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

7 new information

national research program

year april center washington

14 law government

state court

laws states public

civil legal federal

16 clipper government

key encryption

nsa phone crypto secure

keys netcom

37 encryption

law government

technology enforcement

chip clipper privacy

key administration

new protect

des rsa public

pgp security cryptography

crypt message

76 militia amendment

arms bear second constitution

regulated government

shall state ulowell organized

craig douglas

rob protect cmhnet

acns bullets

rwing clinton

wa administration

zinc inﬁnite

115 gtoal compression

speech graham

uk rate nren samples modems

6 available

ﬁle sun program

mit information

bit keys number

serial encrypted

bits clipper escrow algorithm

30 turkish armenian armenians turks armenia turkey government genocide

soviet said today new

32 turkish armenian armenians turks armenia turkey genocide government

soviet said today new

new problem

work probably

let said years

ll long question

netcom andrew

new uucp internet

mark steve opinions

7 new information

national research program

year april center washington

14 law government

state court

laws states public

civil legal federal

told saw home

left started

says took apartment

israeli lebanese

arab israelis civilians

killed palestinian

34 turkish armenian armenians

turks armenia

turkey genocide

soviet today russian government

98 bosnia arromdee

jhu serbs somalia

ken jyusenkyou

sides turkey ethnic intervention

greece turkish greeks cyprus minority

uv turks turkey cypriot ethnic cypriots

bm armenia armenians

azeris cyprus turkey karabakh

kpc turks conﬂict

onur yalcin armenians

henrik announced

oy homeland azerbadjan

book urartu yalanci published

page pages forged

argic serdar quotes

hitler nazis chancellor

dodd translation

persian problems

shirak problem persians

orbeli prince famous ancient

troops fourth

251 istanbul

ermeni ankara

turk ofﬁce hakkinda mezalimi ermeniler

259 kk translation

kubilay kultigin

suat politics strongest

carleton fatherland

Figure 21: Analogous plots to Figure 6, with τt = 1 to reveal more weak links. Top: the tree rooted at node 14 of layer three on medicine. Middle: the tree rooted at node 12 of layer three on encryption. Bottom: the tree rooted at node 30 of layer three on Turkey & Armenia.

Augmentable Gamma Belief Networks

Figure 22: Visualization of the inferred {Φ(1), , Φ(T )} on the MNIST data set using the PRGGBN with K1max = 100 and η(t) = 0.05 for all t. The latent factors of all layers are projected to the ﬁrst layer: (a) Φ(1), (b) Φ(1)Φ(2), (c) Φ(1)Φ(2)Φ(3), (d) Φ(1)Φ(2)Φ(3)Φ(4), and (e) Φ(1)Φ(2)Φ(3)Φ(4)Φ(5).

Figure 23: Visualization of the inferred {Φ(1), , Φ(T )} on the MNIST data set using the PRGGBN with K1max = 400 and η(t) = 0.05 for all t. The latent factors of all layers are projected to the ﬁrst layer: (a) Φ(1), (b) Φ(1)Φ(2), (c) Φ(1)Φ(2)Φ(3), (d) Φ(1)Φ(2)Φ(3)Φ(4), and (e) Φ(1)Φ(2)Φ(3)Φ(4)Φ(5).

Zhou, Cong, and Chen

Figure 24: Visualization of the network structures inferred by the PRG-GBN on the MNIST data set with K1max = 400. (a) Visualization of the factors (φ(5) 1 , φ(5) 11 , φ(5) 21 , . . . , φ(5) 111) of layer ﬁve and those of layer four that are strongly connected to them. (b) Visualization of the factors (φ(4) 1 , φ(4) 6 , φ(4) 11 , . . . , φ(4) 106) of layer four and those of layer three that are strongly connected to them. (c)the Visualization of the factors (φ(3) 1 , φ(3) 6 , φ(3) 11 , . . . , φ(3) 146) of layer three and those of layer two that are strongly connected to them. (d) Visualization of the factors (φ(2) 1 , φ(2) 6 , φ(2) 11 , . . . , φ(2) 146) of layer two and those of layer one that are strongly connected to them.

Augmentable Gamma Belief Networks

A. Acharya, J. Ghosh, and M. Zhou. Nonparametric Bayesian factor analysis for dynamic count matrices. In AISTATS, pages 1 9, 2015.

R. P. Adams, Z. Ghahramani, and M. I. Jordan. Tree-structured stick breaking for hierarchical data. In NIPS, 2010.

D. Aldous. Exchangeability and related topics. Ecole d ete de probabilit es de Saint-Flour XIII-1983, pages 1 198, 1985.

F. J. Anscombe. Sampling theory of the negative binomial and logarithmic series distributions. Biometrika, 37(3-4):358 382, 1950.

C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist., 2(6):1152 1174, 1974.

Y. Bengio and Y. Le Cun. Scaling learning algorithms towards AI. In L eon Bottou, Olivier Chapelle, D. De Coste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press, 2007.

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, pages 153 160, 2007.

Y. Bengio, I. J. Goodfellow, and A. Courville. Deep Learning. Book in preparation for MIT Press, 2015.

D. Blackwell and J. Mac Queen. Ferguson distributions via P olya urn schemes. Ann. Statist., 1(2):353 355, 1973.

D. Blei and J. Laﬀerty. Correlated topic models. NIPS, 2006.

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3: 993 1022, 2003.

D. M. Blei and J. D. Mcauliﬀe. Supervised topic models. In NIPS, 2008.

D. M. Blei, T. L. Griﬃths, and M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of ACM, 2010.

C. I. Bliss and R. A. Fisher. Fitting the negative binomial distribution to biological data. Biometrics, 9(2):176 200, 1953.

W. Buntine and A. Jakulin. Discrete component analysis. In Subspace, Latent Structure and Feature Selection Techniques. Springer-Verlag, 2006.

J. Canny. Gap: a factor model for discrete data. In SIGIR, 2004.

J. Chen, J. Zhu, Z. Wang, X. Zheng, and B. Zhang. Scalable inference for logistic-normal topic models. In NIPS, 2013.

Zhou, Cong, and Chen

S. C. Choi and R. Wette. Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics, pages 683 690, 1969.

L. Devroye. Simulating Bessel random variables. Statistics & probability letters, 57(3): 249 257, 2002.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classiﬁcation. J. Mach. Learn. Res., pages 1871 1874, 2008.

R. A. Fisher, A. S. Corbet, and C. B. Williams. The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12(1):42 58, 1943.

B. J. Frey. Continuous sigmoidal belief networks trained using slice sampling. pages 452 458, 1997a.

B. J. Frey. Variational inference for continuous sigmoidal Bayesian networks. In Sixth International Workshop on Artiﬁcial Intelligence and Statistics. Ft. Lauderdale FL. Citeseer, 1997b.

B. J. Frey and G. E. Hinton. Variational learning in nonlinear Gaussian belief networks. Neural Computation, 11(1):193 213, 1999.

Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin. Scalable deep Poisson factor analysis for topic modeling. In ICML, 2015a.

Z. Gan, R. Henao, D. E. Carlson, and L. Carin. Learning deep sigmoid belief networks with data augmentation. In AISTATS, 2015b.

M. Greenwood and G. U. Yule. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J. R. Stat. Soc., 1920.

T. L. Griﬃths and M. Steyvers. Finding scientiﬁc topics. PNAS, 2004.

G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002.

G. E. Hinton and Z. Ghahramani. Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352(1358):1177 1190, 1997.

G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527 1554, 2006.

C. Hu, P. Rai, and L. Carin. Zero-truncated poisson tensor factorization for massive binary tensors. In UAI, 2015.

N. L. Johnson, S. Kotz, and N. Balakrishnan. Discrete multivariate distributions, volume 165. Wiley New York, 1997.

Augmentable Gamma Belief Networks

D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.

H. Larochelle and S. Lauly. A neural autoregressive topic model. In NIPS, 2012.

Y. Le Cun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541 551, 1989.

D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.

H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

S. Linderman, M. Johnson, and R. P. Adams. Dependent multinomial models made easy: Stick-breaking with the P olya-Gamma augmentation. In NIPS, pages 3438 3446, 2015.

V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted Boltzmann machines. In ICML, pages 807 814, 2010.

R. M. Neal. Connectionist learning of belief networks. Artiﬁcial intelligence, 56(1):71 113, 1992.

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. J. Mach. Learn. Res., 2009.

J. Paisley, C. Wang, and D. M. Blei. The discrete inﬁnite logistic normal distribution. Bayesian Analysis, 2012.

J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan. Nested hierarchical dirichlet processes. IEEE Trans. Pattern Anal. Mach. Intell., 2015.

J. Pitman. Combinatorial stochastic processes. Lecture Notes in Mathematics. Springer Verlag, 2006.

N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using polya-gamma latent variables. ar Xiv preprint ar Xiv:1205.0310, 2012.

R. Ranganath and D. Blei. Correlated random measures. ar Xiv:1507.00720v1, 2015.

R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, 2014a.

R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep exponential families. ar Xiv, 2014b.

M Ranzato, F. J. Huang, Y.-L. Boureau, and Y. Le Cun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, pages 1278 1286, 2014.

Zhou, Cong, and Chen

R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009.

R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. Learning with hierarchical-deep models. IEEE Trans. Pattern Anal. Mach. Intell., pages 1958 1971, 2013.

L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean ﬁeld theory for sigmoid belief networks. Journal of artiﬁcial intelligence research, 4(1):61 76, 1996.

N. Srivastava, R. Salakhutdinov, and G. E. Hinton. Modeling documents with a deep Boltzmann machine. In UAI, 2013.

I. Sutskever and G. E. Hinton. Deep, narrow sigmoid belief networks are universal approximators. Neural Computation, 20(11):2629 2636, 2008.

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. J. Amer. Statist. Assoc., 2006.

D.A. van Dyk and X. Meng. The art of data augmentation. J. Comp. Graph. Stat., 2001.

H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In ICML, 2009.

M. Welling, M. Rosen-Zvi, and G. E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, pages 1481 1488, 2004.

S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. The IBP compound Dirichlet process and its application to focused topic modeling. In ICML, 2010.

E. P. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dualwing harmoniums. In UAI, 2005.

L Yuan and J. D. Kalbﬂeisch. On the Bessel distribution and related problems. Annals of the Institute of Statistical Mathematics, 52(3):438 447, 2000.

M. Zhou. Beta-negative binomial process and exchangeable random partitions for mixedmembership modeling. In NIPS, pages 3455 3463, 2014.

M. Zhou. Inﬁnite edge partition models for overlapping community detection and link prediction. In AISTATS, pages 1135 1143, 2015.

M. Zhou and L. Carin. Negative binomial process count and mixture modeling. IEEE Trans. Pattern Anal. Mach. Intell., 37(2):307 320, 2015.

M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson factor analysis. In AISTATS, pages 1462 1471, 2012.

M. Zhou, Y. Cong, and B. Chen. The Poisson gamma belief network. In NIPS, 2015a.

M. Zhou, O. H. M. Padilla, and J. G. Scott. Priors for random count matrices derived from a family of negative binomial processes. To appear in J. Amer. Statist. Assoc., 2015b.

J. Zhu, A. Ahmed, and E. P. Xing. Med LDA: maximum margin supervised topic models. J. Mach. Learn. Res., 13(1):2237 2278, 2012.