# convolutional_poisson_gamma_belief_network__7e1cbdc4.pdf Convolutional Poisson Gamma Belief Network Chaojie Wang 1 Bo Chen 1 Sucheng Xiao 1 Mingyuan Zhou 2 For text analysis, one often resorts to a lossy representation that either completely ignores word order or embeds each word as a low-dimensional dense feature vector. In this paper, we propose convolutional Poisson factor analysis (CPFA) that directly operates on a lossless representation that processes the words in each document as a sequence of high-dimensional one-hot vectors. To boost its performance, we further propose the convolutional Poisson gamma belief network (CPGBN) that couples CPFA with the gamma belief network via a novel probabilistic pooling layer. CPFA forms words into phrases and captures very specific phrase-level topics, and CPGBN further builds a hierarchy of increasingly more general phrase-level topics. For efficient inference, we develop both a Gibbs sampler and a Weibull distribution based convolutional variational auto-encoder. Experimental results demonstrate that CPGBN can extract high-quality text latent representations that capture the word order information, and hence can be leveraged as a building block to enrich a wide variety of existing latent variable models that ignore word order. 1. Introduction A central task in text analysis and language modeling is to effectively represent the documents to capture their underlying semantic structures. A basic idea is to represent the words appearing in a document with a sequence of one-hot vectors, where the vector dimension is the size of the vocabulary. This preserves all textual information but results in a collection of extremely large and sparse matrices for a text corpus. Given the memory and computation constraints, 1National Laboratory of Radar Signal Processing, Collaborative Innovation Center of Information Sensing and Understanding, Xidian University, Xi an, Shaanxi, China. 2Mc Combs School of Business, The University of Texas at Austin, Austin, Texas 78712, USA. Correspondence to: Bo Chen . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). it is very challenging to directly model this lossless representation. Thus existing methods often resort to simplified lossy representations that either completely ignore word order (Blei et al., 2003), or embed the words into a lower dimensional feature space (Mikolov et al., 2013). Ignoring word order, each document is simplified as a bagof-words count vector, the vth element of which represents how many times the vth vocabulary term appears in that document. With a text corpus simplified as a term-document frequency count matrix, a wide array of latent variable models (LVMs) have been proposed for text analysis (Deerwester et al., 1990; Papadimitriou et al., 2000; Lee & Seung, 2001; Blei et al., 2003; Hinton & Salakhutdinov, 2009; Zhou et al., 2012). Extending shallow probabilistic topic models such as latent Dirichlet allocation (LDA) (Blei et al., 2003) and Poisson factor analysis (PFA) (Zhou et al., 2012), steady progress has been made in inferring multi-stochasticlayer deep latent representations for text analysis (Gan et al., 2015; Zhou et al., 2016; Ranganath et al., 2015; Zhang et al., 2018). Despite the progress, completely ignoring word order could still be particularly problematic on some common text-analysis tasks, such as spam detection and sentiment analysis (Pang et al., 2002; Tang et al., 2014). To preserve word order, a common practice is to first convert each word in the vocabulary from a high-dimensional sparse one-hot vector into a low-dimensional dense wordembedding vector. The word-embedding vectors can be either trained as part of the learning (Kim, 2014; Kalchbrenner et al., 2014), or pre-trained by some other methods on an additional large corpus (Mikolov et al., 2013). Sequentially ordered word embedding vectors have been successfully combined with deep neural networks to address various problems in text analysis and language modeling. A typical combination method is to use the word-embedding layer as part of a recurrent neural network (RNN), especially long short-term memory (LSTM) and its variants (Hochreiter & Schmidhuber, 1997; Chung et al., 2014), achieving great success in numerous tasks that heavily rely on having high-quality sentence representation. Another popular combination method is to apply a convolutional neural network (CNN) (Lecun et al., 1998) directly to the embedding representation, treating the word embedding layer as an image input; it has been widely used in systems for entity search, sentence modeling, product feature mining, and so on (Xu Convolutional Poisson Gamma Belief Network & Sarikaya, 2013; Weston et al., 2014). In this paper, we first propose convolutional PFA (CPFA) that directly models the documents, each of which is represented without information loss as a sequence of one-hot vectors. We then boot its performance by coupling it with the gamma belief network (GBN) of Zhou et al. (2016), a multi-stochastic-hidden layer deep generative model, via a novel probabilistic document-level pooling layer. We refer to the CPFA and GBN coupled model as convolutional Poisson GBN (CPGBN). To the best of our knowledge, CPGBN is the first unsupervised probabilistic convolutional model that infers multi-stochastic-layer latent variables for documents represented without information loss. Its hidden layers can be jointly trained with an upward-downward Gibbs sampler; this makes its inference different from greedy layer-wise training (Lee et al., 2009; Chen et al., 2013). In each Gibbs sampling iteration, the main computation is embarrassingly parallel and hence will be accelerated with Graphical Process Units (GPUs). We also develop a Weibull distribution based convolutional variational auto-encoder to provide amortized variational inference, which further accelerates both training and testing for large corpora. Exploiting the multi-layer structure of CPGBN, we further propose a supervised CPGBN (s CPGBN), which combines the representation power of CPGBN for topic modeling and the discriminative power of deep neural networks (NNs) under a principled probabilistic framework. We show that the proposed models achieve state-of-art results in a variety of text-analysis tasks. 2. Convolutional Models for Text Analysis Below we introduce CPFA and then develop a probabilistic document-level pooling method to couple CPFA with GBN, which further serves as the decoder of a Weibull distribution based convolutional variational auto-encoder (VAE). 2.1. Convolutional Poisson Factor Analysis Denote V as the vocabulary and let Dj = (xj1, ..., xj Lj) represent the Lj sequentially ordered words of the jth document, which can be represented as a sequence of one-hot vectors. For example, with vocabulary V = { don t , hate , I , it , like }, document Dj = ( I , like , it ) can be represented as Xj = [xj1, xj2, xj3] {0, 1}|V | Lj, where xj1 = (0, 0, 1, 0, 0) , xj2 = (0, 0, 0, 0, 1) , and xj3 = (0, 0, 0, 1, 0) are one-hot column vectors. Let us denote xjvl = Xj(v, l), which is one if and only if word l of document j matches term v of the vocabulary. To exploit a rich set of tools developed for count data analysis (Zhou et al., 2012; 2016), we first link these sequential binary vectors to sequential count vectors via the Bernoulli-Poisson link (Zhou, 2015). More specifically, we link each xjvl to a latent count as xjvl = 1(mjvk > 0), where mjvl Z := {0, 1, . . .}, and factorize the matrix Mj = {mjvl}v,l Z|V | Lj under the Poisson likelihood. Distinct from vanilla PFA (Zhou et al., 2012) where the columns of the matrix are treated as conditionally independent, here we introduce convolution into the hierarchical model to capture the sequential dependence between the columns. We construct the hierarchical model of CPFA as Xj = 1(Mj > 0), Mj Pois(PK k=1 Dk wjk), wjk Gam(rk, 1/cj), Dk(:) Dir(η1|V |F ), (1) where denotes a convolution operator, R+ := {x : x 0}, Dk = (dk1, . . . , dk F ) R|V | F + is the kth convolutional filter/factor/topic whose filter width is F, dkf = (dk1f, . . . , dk|V |f) , and Dk(:) = (d k1, . . . , d k F ) R|V |F + ; the latent count matrix Mj is factorized into the summation of K equal-sized latent count matrices, the Poisson rates of the kth of which are obtained by convolving Dk with its corresponding gamma distributed feature representation wjk RSj + , where Sj := Lj F + 1. To complete the hierarchical model, we let rk Gamma(1/K, 1/c0) and cj Gamma(e0, 1/f0). Note as in Zhou et al. (2016), we may consider K as the truncation level of a gamma process, which allows the number of needed factors to be inferred from the data as long as K is set sufficiently large. We can interpret dkvf := Dk(v, f) as the probability that the vth term in the vocabulary appears at the fth temporal location for the kth latent topic, and expect each Dk to extract both global cooccurrence patterns, such as common topics, and local temporal structures, such as common ngram phrases, where n F, from the text corpus. Note the convolution layers of CPFA convert text regions of size F (e.g., am so happy with F = 3) to feature vectors, directly learning the embedding of text regions without going through a separate learning for word embedding. Thus CPFA provides a potential solution for distinguishing polysemous words according to their neighboring words. The length of the representation weight vector wjk in our model is Sj = Lj F + 1, which varies with the document length Lj. This differs CPFA from traditional convolutional models with a fixed feature map size (Zhang et al., 2017; Miao et al., 2018; Min et al., 2019), which requires either heuristic cropping or zero-padding. 2.2. Convolutional Poisson Gamma Belief Network There has been significant recent interest in inferring multistochastic-layer deep latent representations for text analysis in an unsupervised manner (Gan et al., 2015; Zhou et al., 2016; Ranganath et al., 2015; Wang et al., 2018; Zhang et al., 2018), where word order is ignored. The key intuition behind these models, such as GBN (Zhou et al., 2016), is Convolutional Poisson Gamma Belief Network Max-over-time Up-Sampling Figure 1. The proposed CPGBN (upper part) and its corresponding convolutional variational inference network (lower part). that words frequently co-occurred in the same document can form specific word-level topics in shallow layers; as the depth of the network increases, frequently co-occurred topics can form more general ones. Here, we propose a model to preserve word order, without losing the nice hierarchical topical interpretation provided by a deep topic model. The intuition is that by preserving word order, words can first form short phrases; frequently co-occurred short phrases can then be combined to form specific phrase-level topics; and these specific phrase-level topics can form increasingly more general phrase-level topics when moving towards deeper layers. As in Fig. 1, we couple CPFA in (1) with GBN to construct CPGBN, whose generative model with T hidden layers, from top to bottom, is expressed as θ(T ) j Gam(r, 1/c(T +1) j ), ..., θ(t) j Gam(Φ(t+1)θ(t+1) j , 1/c(t+1) j ), ..., θ(1) j Gam(Φ(2)θ(2) j , 1/c(2) j ), wjk = πjkθ(1) jk , πjk Dir Φ(2) k: θ(2) j /Sj1Sj , Mj Pois PK(1) k=1 Dk wjk , where Φk: is the kth row of Φ and superscripts indicate layers. Note CPGBN first factorizes the latent count matrix Mj Z|V | Lj under the Poisson likelihood into the summation of K(1) convolutions, the kth of which is between Dk R|V | F + and weight vector wjk RSj + . Using the relationship between the gamma and Dirichlet distributions (e.g., Lemma IV.3 of Zhou & Carin (2012)), wjk = (wjk1, . . . , wjk Sj) = (θ(1) jk πjk1, . . . , θ(1) jk πjk Sj) RSj in (2) can be equivalently generated as wjks Gam Φ(2) k: θ(2) j /Sj, 1/c(2) j , s = 1, . . . , Sj, (3) which could be seen as a specific probabilistic documentlevel pooling algorithm on the gamma shape parameter. For t {1, ..., T 1}, the shape parameters of the gamma distributed hidden units θ(t) j RK(t) + are factorized into the product of the connection weight matrix Φ(t+1) RK(t) K(t+1) + and hidden units θ(t+1) j RK(t+1) + of layer t + 1; the top layer s hidden units θ(T ) j share the same r RK(T ) + as their gamma shape parameters; and c(t+1) j are gamma scale parameters. For scale identifiability and ease of inference, the columns of Dk and Φ(t+1) RK(t) K(t+1) + are restricted to have unit L1 norm. To complete the hierarchical model, we let Dk(:) Dir(η(1)1|V |F ), φ(t) k Dir(η(t)1K(t)), rk Gam(1/K(T ), 1), and c(t+1) j Gam(e0, 1/f0). Examining (3) shows CPGBN provides a probabilistic document-level pooling layer, which summarizes the content coefficients wjk across all word positions into θ(1) jk = PSj s=1 wjks; the hierarchical structure after θ(1) jk can be flexibly modified according to the deep models (not restricted to GBN) to be combined with. The proposed pooling layer can be trained jointly with all the other layers, making it distinct from a usual one that often cuts off the message passing from deeper layers (Lee et al., 2009; Chen et al., 2013). We note using pooling on the first hidden layer is related to shallow text CNNs that use document-level pooling directly after a single convolutional layer (Kim, 2014; Johnson & Zhang, 2015a), which often contributes to improved efficiency (Boureau et al., 2010; Wang et al., 2010). 2.3. Convolutional Inference Network for CPGBN To make our model both scalable to big corpora in training and fast in out-of-sample prediction, below we introduce a convolutional inference network, which will be used in hybrid MCMC/variational inference described in Section 3.2. Note the usual strategy of autoencoding variational inference is to construct an inference network to map the observations directly to their latent representations, and optimize the encoder and decoder by minimizing the negative evidence lower bound (ELBO) as Lg = PJ j=1 Lg(Xj), where Lg(Xj) = PT t=2 EQ ln q(θ(t) j | ) p(θ(t) j | Φ(t+1),θ(t+1) j ) k=1 PSj s=1 EQ ln q(wjks | ) p(wjks | Φ(2),θ(2) j ) EQ[ln p(Xj | {Dk, wjk}1,K(1))]; following Zhang et al. (2018), we use the Weibull distribution to approximate the gamma distributed conditional posterior of θ(t) j , as it is reparameterizable, resembles the gamma distribution, and the Kullback Leibler (KL) divergence from the gamma to Weibull distributions is analytic; as in Fig. 1, we construct the autoencoding variational dis- Convolutional Poisson Gamma Belief Network tribution as Q = q(wjk | ) QT t=2 q(θ(t) j | ), where q(wjk | ) = Weibull(Σ(1) jk + Φ(2) k: θ(2) j , Λ(1) jk ), q(θ(t) j | ) = Weibull(σ(t) j + Φ(t+1)θ(t+1) j , λ(t) j ). (5) The parameters Σ(1) j , Λ(1) j RK(1) Sj of wj = (wj1, . . . , wj K(1)) RK(1) Sj are deterministically transformed from the observation Xj using CNNs specified as H(1) j = relu(C(1) 1 Xj + b(1) 1 ), Σ(1) j = exp(C(1) 2 pad(H(1) j ) + b(1) 2 ), Λ(1) j = exp(C(1) 3 pad(H(1) j ) + b(1) 3 ), where b(1) 1 , b(1) 2 , b(1) 3 RK(1), C(1) 1 RK(1) |V | F , C(1) 2 , C(1) 3 RK(1) K(1) F , H(1) j RK(1) Sj, and pad(H(1) j ) RK(1) Lj is obtained with zero-padding; the parameters σ(t) j and λ(t) j are transformed from h(1) j = pool(H(1) j ) specified as h(t) j = relu(U (t) 1 h(t 1) j + b(t) 1 ), σ(t) j = exp(U (t) 2 h(t) j + b(t) 2 ), λ(t) j = exp(U (t) 3 h(t) j + b(t) 3 ), where b(t) 1 , b(t) 2 , b(t) 3 , h(t) j RK(t), U (t) 1 RK(t) K(t 1), and U (t) 2 , U (t) 3 RK(t) K(t) for t {2, ..., T}. Further we develop s CPGBN, a supervised generalization of CPGBN, for text categorization tasks: by adding a softmax classifier on the concatenation of {θ(t) j }1,T , the loss function of the entire framework is modified as L = Lg + ξLc, where Lc denotes the cross-entropy loss and ξ is used to balance generation and discrimination (Higgins et al., 2017). 3. Inference Below we describe the key inference equations for CPFA shown in (1), a single hidden-layer version of CPGBN shown in (2), and provide more details in the Appendix. How the inference of CPFA, including Gibbs sampling and hybrid MCMC/autoencoding variational inference, is generalized to that of CPGBN is similar to how the inference of PFA is generalized to that of PGBN, as described in detail in Zhou et al. (2016) and Zhang et al. (2018) and omitted here for brevity. 3.1. Gibbs Sampling Directly dealing with the whole matrix by expanding the convolution operation with Toeplitz conversion (Bojanczyk et al., 1995) provides a straightforward solution for the inference of convolutional models, which transforms each observation matrix Mj into a vector, on which the inference methods for sparse factor analysis (Carvalho et al., 2008; James et al., 2010) could then be applied. However, considering the sparsity of the document matrix consisting of one-hot vectors, directly processing these matrices without considering sparsity will bring unnecessary burden in computation and storage. Instead, we apply data augmentation under the Poisson likelihood (Zhou et al., 2012; 2016) to upward propagate latent count matrices Mj as (Mj1, ..., Mj K | ) Multi(Mj; ζj1, ..., ζj K), where ζjk = (Dk wjk)/(PK k=1 Dk wjk). Note we only need to focus on nonzero elements of Mjk Z|V | Lj. We rewrite the likelihood function by expanding the convolution operation along the dimension of wjk as mjkvl Pois(PSj s=1 wjksdkv(l s+1)), where dkv(l s+1) := 0 if l s + 1 / {1, 2, . . . , F}. Thus each nonzero element mjkvl could be augmented as (mjkvl | mjkvl) Multi(mjkvl; δjkvl1, ..., δjkvl Sj), (6) where δjkvls = wjksdkv(l s+1)/ PSj s=1 wjksdkv(l s+1) and mjkvl ZSj. We can now decouple Dk wjk in (1) by marginalizing out Dk, leading to mjk Pois(wjk). where the symbol denotes summing over the corresponding index and hence mjk = P|V | v=1 PLj l=1 mjkvl. Using the gamma-Poisson conjugacy, we have (wjk | ) Gam(mjk + rk, 1/(1 + c(2) j )). Similarly, we can expand the convolution along the other direction as mjkvl Pois(PF f=1 dkvfwjk(l f+1)), where wjk(l f+1) := 0 if l f + 1 / {1, 2, . . . , Sj}, and obtain (djkvl | mjkvl) Multi(mjkvl; ξjkvl1, . . . , ξjkvl F ), where ξjkvlf = dkvfwjk(l f+1)/ PF f=1 dkvfwjk(l f+1) and djkvl ZF . Further applying the relationship between the Poisson and multinomial distributions, we have ((d jk1 , . . . , d jk V ) | mjk ) Multi(mjk ; Dk(:)). With the Dirichlet-multinomial conjugacy, we have (Dk(:) | ) Dir((d k1 , . . . , d k V ) + η1|V |F ). Exploiting the properties of the Poisson and multinomial distributions helps CPFA fully take advantages of the sparsity of the one-hot vectors, making its complexity comparable to a regular bag-of-words topic model that uses Gibbs sampling for inference. Note as the multinomial related samplings inside each iteration are embarrassingly parallel, they are accelerated with GPUs in our experiments. Convolutional Poisson Gamma Belief Network Algorithm 1 Hybrid stochastic-gradient MCMC and autoencoding variational inference for CPGBN Set mini-batch size m and number of dictionaries K; Initialize encoder parameter Ωand model parameter {Dk}1,K; for iter = 1, 2, do Randomly select a mini-batch of m documents to form a subset X = {Xj}1,m; for j = 1, 2, do Draw random noise ϵj from uniform distribution; Calculate ΩL(Ω, D; xj, ϵj) according to (8) and update Ω; end for Sample {wj}1,m from (5) given Ω; Parallely process each positive point in X to obtain (d k1 , . . . , d k|V | ) according to (6); Update {Dk}1,K according to (7) end for 3.2. Hybrid MCMC/Variational Inference While having closed-form update equations, the Gibbs sampler requires processing all documents in each iteration and hence has limited scalability. Fortunately, there have been several related research on scalable inference for discrete LVMs (Ma et al., 2015; Patterson & Teh, 2013). Specifically, TLASGR-MCMC of Cong et al., 2017, which uses an elegant simplex constraint and increases the sampling efficiency via the use of the Fisher information matrix (FIM), with adaptive step-sizes for the topics of different layers, can be naturally extended to our model. The efficient TLASGRMCMC update of Dk in CPFA can be described as D(new) k (:) = n Dk(:) + εi Mk [(ρ(d k1 , ..., d k|V | ) + η) (ρd k + η |V | F)Dk(:)] + N 0, 2εi Mk diag(Dk(:)) o where i denotes the number of mini-batches processed so far; the symbol in the subscript denotes summing over the data in a mini-batch; and the definitions of ρ, εi, { } , and Mk are analogous to these in Cong et al. (2017) and omitted here for brevity. Similar to Zhang et al. (2018), combining TLASGR-MCMC and the convolutional inference network described in Section 2.3, we can construct a hybrid stochastic-gradient MCMC/autoencoding variational inference for CPFA. More specifically, in mini-batch based each iteration, we draw a random sample of the CPFA global parameters D = {Dk}1,K via TLASGR-MCMC; given the sampled global parameters, we optimize the parameters of the convolutional inference network, denoted as Ω, using the ELBO in (4), which for CPFA is simplified as Lg = PJ j=1 Eq(wj | Xj)[ln p(Xj | {Dk, wjk}1,K)] + PJ j=1 Eq(wj | xj)[ln q(wj | xj) p(wj) ]. (8) We describe the proposed hybrid stochastic-gradient MCMC/autoencoding variational inference algorithm in Al- gorithm 1, which is implemented in Tensor Flow (Abadi et al., 2016), combined with py CUDA (Klockner et al., 2012) for more efficient computation. 4. Related Work With the bag-of-words representation that ignores the word order information, a diverse set of deep topic models have been proposed to infer a multilayer data representation in an unsupervised manner. A main mechanism of them is to connect adjacent layers by specific factorization, which usually boosts the performance (Gan et al., 2015; Zhou et al., 2016; Zhang et al., 2018). However, limited by the bag-of-words representation, they usually perform poorly on sentiment analysis tasks, which heavily rely on the word order information (Xu & Sarikaya, 2013; Weston et al., 2014). In this paper, the proposed CPGBN could be seen as a novel convolutional extension, which not only clearly remedies the loss of word order, but also inherits various virtues of deep topic models. Benefiting from the advance of word-embedding methods, CNN-based architectures have been leveraged as encoders for various natural language processing tasks (Kim, 2014; Kalchbrenner et al., 2014). They in general directly apply to the word embedding layer a single convolution layer, which, given a convolution filter window of size n, essentially acts as a detector of typical n-grams. More complex deep neural networks taking CNNs as the their encoder and RNNs as decoder have also been studied for text generation (Zhang et al., 2016; Semeniuta et al., 2017). However, for unsupervised sentence modeling, language decoders other than RNNs are less well studied; it was not until recently that Zhang et al. (2017) have proposed a simple yet powerful, purely convolutional framework for unsupervisedly learning sentence representations, which is the first to force the encoded latent representation to capture the information from the entire sentence via a multi-layer CNN specification. But there still exists a limitation in requiring an additional large corpus for training word embeddings, and it is also difficult to visualize and explain the semantic meanings learned by black-box deep networks. For text categorization, the bi-grams (or a combination of bi-grams and unigrams) are confirmed to provide more discriminative power than unigrams (Tan et al., 2002; Glorot et al., 2011). Motivated by this observation, Johnson & Zhang (2015a) tackle document categorization tasks by directly applying shallow CNNs, with filter width three, on one-hot encoding document matrices, outperforming both traditional n-grams and word-embedding based methods without the aid of additional training data. In addition, the shallow CNN serves as an important building block in many other supervised applications to help achieve sate-of-art results (Johnson & Zhang, 2015b; 2017). Convolutional Poisson Gamma Belief Network CPFA CPGBN-2 CPGBN-3 Figure 2. Point likelihood of CPGBNs on TREC as a function of time with various structural settings. 1 1.5 2 2.5 3 Layer CPGBN-32 CPGBN-64 CPGBN-128 Figure 3. Classification accuracy (%) of the CPGBNs on TREC as a function of the depth with various structural settings. 5. Experimental Results 5.1. Datasets and Preprocessing We test the proposed CPGBN and its supervised extension (s CPGBN) on various benchmarks, including: MR: Movie reviews with one sentence per review, where the task is to classify a review as being positive or negative (Pang & Lee, 2005). TREC: TREC question dataset, where the task is to classify a question into one of six question types (whether the question is about abbreviation, entity, description, human, location, or numeric) (Li & Roth, 2002). SUBJ: Subjectivity dataset, where the task is to classify a sentence as being subjective or objective (Pang & Lee, 2004). ELEC: ELEC dataset (Mcauley & Leskovec, 2013) consists of electronic product reviews, which is part of a large Amazon review dataset. IMDB: IMDB dataset (Maas et al., 2011) is a benchmark Table 1. Summary statistics for the datasets after tokenization (C: Number of target classes. L: Average sentence length. N: Dataset size. V : Vocabulary size. Vpre: Number of words present in the set of pre-trained word vectors. Test: Test set size, where CV means 10-fold cross validation). Data MR TREC SUBJ ELEC IMDB C 2 6 2 2 2 L 20 10 23 123 266 N 10662 5952 10000 50000 50000 V 20277 8678 22636 52248 95212 Vpre 20000 8000 20000 30000 30000 Test CV 500 CV 25000 25000 dataset for sentiment analysis, where the task is to determine whether a movie review is positive or negative. We follow the steps listed in Johnson & Zhang (2015a) to tokenize the text, where emojis such as :-) are treated as tokens and all the characters are converted to lower case. We then select the top Vpre most frequent words to construct the vocabulary, without dropping stopwords; we map the words not included in the vocabulary to a same special token to keep all sentences structurally intact. The summary statistics of all benchmark datasets are listed in Table 1. 5.2. Inference Efficiency In this section we show the results of the proposed CPGBN on TREC. First, to demonstrate the advantages of increasing the depth of the network, we construct three networks of different depths: with (K = 32) for T = 1, (K1 = 32, K2 = 16) for T = 2, and (K1 = 32, K2 = 16, K3 = 8) for T = 3. Under the same configuration of filter width F = 3 and the same hyperparameter setting, where e0 = f0 = 0.1 and η(t) = 0.05, the networks are trained with the proposed Gibbs sampler. The trace plots of model likelihoods are shown in Fig. 2. It is worth noting that increasing the network depth in general improves the quality of data fitting, but as the complexity of the model increases, the model tends to converge more slowly in time. Considering that the data fitting and generation ability is not necessarily strongly correlated the performance on specific tasks, we evaluate the proposed models on document classification. Using the same experimental settings as mentioned above, we investigate how the classification accuracy is impacted by the network structure. On each network, we apply the Gibbs sampler to collect 200 MCMC samples after 500 burn-ins to estimate the posterior mean of the feature usage weight vector wj, for every document in both the training and testing sets. A linear support vector machine (SVM) (Cortes & Vapnik, 1995) is taken as the classifier on the first hidden layer, denoted as θ(1) j in (2), to make a fair comparison, where each result listed in Table 2 is the Convolutional Poisson Gamma Belief Network Table 2. Comparison of classification accuracy on unsupervisedly extracted feature vectors and average training time (seconds per Gibbs sampling iteration across all documents) on three different datasets. Model Size Accuracy Time MR TREC SUBJ MR TREC SUBJ LDA 200 54.4 0.8 45.5 1.9 68.2 1.3 3.93 0.92 3.81 Doc NADE 200 54.2 0.8 62.0 0.6 72.9 1.2 - - - DPFA 200 55.2 1.2 51.4 0.9 74.5 1.9 6.61 1.88 6.53 DPFA 200-100 55.4 0.9 52.0 0.6 74.4 1.5 6.74 1.92 6.62 DPFA 200-100-50 56.1 0.9 62.0 0.6 78.5 1.4 6.92 1.95 6.80 PGBN 200 56.3 0.6 66.7 1.8 76.2 0.9 3.97 1.01 3.56 PGBN 200-100 56.7 0.8 67.3 1.7 77.3 1.3 5.09 1.72 4.39 PGBN 200-100-50 57.0 0.5 67.9 1.5 78.3 1.2 5.67 1.87 4.91 WHAI 200 55.6 0.8 60.4 1.9 75.4 1.5 - - - WHAI 200-100 56.2 1.0 63.5 1.8 76.0 1.4 - - - WHAI 200-100-50 56.4 0.6 65.6 1.7 76.5 1.1 - - - CPGBN 200 61.5 0.8 68.4 0.8 77.4 0.8 3.58 0.98 3.53 CPGBN 200-100 62.4 0.7 73.4 0.8 81.2 0.8 8.19 1.99 6.56 CPGBN 200-100-50 63.6 0.8 74.4 0.6 81.5 0.6 10.44 2.59 7.87 Table 3. Example phrases learned from TREC by CPGBN. Kernel Index Visualized Topic Visualized Phrase 1st Column 2nd Column 3rd Column 192th Kernel how do you how do you, how many years, how much degrees cocktail many years stadium much miles run long degrees 80th Kernel microsoft e-mail address microsoft e-mail address, microsoft email address, virtual ip address virtual email addresses answers.com ip floods softball brothers score 177th Kernel who created maria who created snoopy, who fired caesar, who wrote angela willy wrote angela bar fired snoopy hydrogen are caesar 47th Kernel dist how far dist how far, dist how high , dist how tall all-time stock high wheel 1976 tall saltpepter westview exchange average accuracy of five independent runs. Fig. 3 shows a clear trend of improvement in classification accuracy, by increasing the network depth given a limited first-layer width, or by increasing the hidden-layer width given a fixed depth. 5.3. Unsupervised Models In our second set of experiments, we evaluate the performance of different unsupervised algorithms on MR, TREC, and SUBJ datasets by comparing the discriminative ability of their unsupervisely extracted latent features. We consider LDA (Blei et al., 2003) and its deep extensions, including DPFA (Gan et al., 2015) and PGBN (Zhou et al., 2016), which are trained with batch Gibbs sampling. We also consider WHAI (Zhang et al., 2018) and Doc NADE (Lauly et al., 2017) that are trained with stochastic gradient descent. To make a fair comparison, we let CPGBNs to have the same hidden layer widths as the other methods, and set the filter width as 3 for the convolutional layer. Listed in Table 2 are the results of various algorithms, where the means and error bars are obtained from five independent runs, using the code provided by the original authors. For all batch learning algorithms, we also report in Table 2 their average run time for an epoch (i.e., processing all training documents once). Clearly, given the same generative network structure, CPGBN performs the best in terms of classification accuracy, which can be attributed to its ability to utilize the word order information. The performance of CPGBN has a clear trend of improvement as the generative network becomes deeper, which is also observed on other deep generative models including DPFA, PGBN, and WHAI. In terms of running time, the shallow LDA could be the most efficient model compared to these more sophisticated ones, while CPGBN of a single layer achieves a comparable effectiveness thanks to its efficient use of GPU for parallelizing its computation inside each iteration. Note all running times are reported based on a Nvidia GTX 1080Ti GPU. In addition to quantitative evaluations, we have also visu- Convolutional Poisson Gamma Belief Network ally inspected the inferred convolutional kernels of CPGBN, which is distinct from many existing convolutional models that build nonlinearity via black-box neural networks. As shown in Table 3, we list several convolutional kernel elements of filter width 3 learned from TREC, using a singlehidden-layer CPGBN of size 200. We exhibit the top 4 most probable words in each column of the corresponding kernel element. It s particularly interesting to note that the words in different columns can be combined into a variety of interpretable phrases with similar semantics. CPGBN explicitly take the word order information into consideration to extract phrases, which are then combined into a hierarchy of phrase-level topics, helping clearly improve the quality of unsupervisedly extracted features. Take the 177th convolutional kernel for example, the top word of its 1st topic is who, its 2nd topic is a verb topic: created, wrote, fired, are, while its 3rd topic is a noun topic: maria/angela/snoopy/caesar. These word-level topics can be combined to construct phrases such as who, created/ wrote/ fired/are, maria/angela/snoopy/caesar, resulting in a phrase-level topic about human, one of the six types of questions in TREC. Note these shallow phrase-level topics will become more general in a deeper layer of CPGBN. We provide two example phrase-level topic hierarchies in the Appendix to enhance interpretability. 5.4. Supervised Models Table 4 lists the comparison of various supervised algorithms on three common benchmarks, including SUBJ, ELEC, and IMDB. The results listed there are either quoted from published papers, or reproduced with the code provided by the original authors. We consider bag-of-words representation based supervised topic models, including s AVITM (Srivastava & Sutton, 2017), Med LDA (Zhu et al., 2014), and s WHAI (Zhang et al., 2018). We also consider three types of bag-of-n-gram models (Johnson & Zhang, 2015a), where n {1, 2, 3}, and word embedding based methods, indicated with suffix -wv, including SVM-wv (Zhang & Wallace, 2017) and RNN-wv and LSTM-wv (Johnson & Zhang, 2016). In addition, we consider several related CNN based methods, including three different variants of Text CNN (Kim, 2014) CNN-rand, CNN-static, and CNN-non-static and CNN-one-hot (Johnson & Zhang, 2015a) that is based on one-hot encoding. We construct three different s CPGBNs with T {1, 2, 3}, as described in Section 2.3. As shown in Table 4, the word embedding based methods generally outperform the methods based on bag-of-words, which is not surprising as the latter completely ignore word order. Among all bag-ofwords representation based methods, s WHAI performs the best and even achieves comparable performance to some word-embedding based methods, which illustrates the benefits of having multi-stochastic-layer latent representations. Table 4. Comparison of classification accuracy on supervised feature extraction tasks on three different datasets. Model SUBJ ELEC IMDB s AVITM (Srivastava & Sutton, 2017) 85.7 83.7 84.9 Med LDA (Zhu et al., 2014) 86.5 84.6 85.7 s WHAI-layer1 (Zhang et al., 2018) 90.6 86.8 87.2 s WHAI-layer2 (Zhang et al., 2018) 91.7 87.5 88.0 s WHAI-layer3 (Zhang et al., 2018) 92.0 87.8 88.2 SVM-unigrams (Tan et al., 2002) 88.5 86.3 87.7 SVM-bigrams (Tan et al., 2002) 89.4 87.2 88.2 SVM-trigrams (Tan et al., 2002) 89.7 87.4 88.5 SVM-wv (Zhang & Wallace, 2017) 90.1 85.9 86.5 RNN-wv (Johnson & Zhang, 2016) 88.9 87.5 88.3 LSTM-wv (Johnson & Zhang, 2016) 89.8 88.3 89.0 CNN-rand (Kim, 2014) 89.6 86.8 86.3 CNN-static (Kim, 2014) 93.0 87.8 88.9 CNN-non-static (Kim, 2014) 93.4 88.6 89.5 CNN-one-hot (Johnson & Zhang, 2015a) 91.1 91.3 91.6 s CPGBN-layer1 93.4 0.1 91.6 0.3 91.8 0.3 s CPGBN-layer2 93.7 0.1 92.0 0.2 92.4 0.2 s CPGBN-layer3 93.8 0.1 92.2 0.2 92.6 0.2 As for n-grams based models, although they achieve comparable performance to word-embedding based methods, we find via experiments that both their performance and computation are sensitive to the vocabulary size. Among the CNN related algorithms, CNN-one-hot tends to have a better performance on classifying longer texts than Text CNN does, which agrees with the observations of Zhang & Wallace (2017); possible explanation for this phenomenon is that CNN-one-hot is prone to overfitting on short documents. Moving beyond CNN-one-hot, s CPGBN could help capture the underlying high-order statistics to alleviate overfitting, as commonly observed in deep generative models (DGMs) (Li et al., 2015), and improves its performance by increasing its number of stochastic hidden layers. 6. Conclusion We propose convolutional Poisson factor analysis (CPFA), a hierarchical Bayesian model that represents each word in a document as a one-hot vector, and captures the word order information by performing convolution on sequentially ordered one-hot word vectors. By developing a principled document-level stochastic pooling layer, we further couple CPFA with a multi-stochastic-layer deep topic model to construct convolutional Poisson gamma belief network (CPGBN). We develop a Gibbs sampler to jointly train all the layers of CPGBN. For more scalable training and fast testing, we further introduce a mini-batch based stochastic inference algorithm that combines both stochastic-gradient MCMC and a Weibull distribution based convolutional variational auto-encoder. In addition, we provide a supervised extension of CPGBN. Example results on both unsupervised and supervised feature extraction tasks show CPGBN combines the virtues of both convolutional operations and deep topic models, providing not only state-of-the-art classification performance, but also highly interpretable phrase-level deep latent representations. Convolutional Poisson Gamma Belief Network Acknowledgements B. Chen acknowledges the support of the Program for Young Thousand Talent by Chinese Central Government, the 111 Project (No. B18039), NSFC (61771361), NSFC for Distinguished Young Scholars (61525105), and the Innovation Fund of Xidian University. M. Zhou acknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation and the Mc Combs Research Excellence Grant. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. Tensor Flow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467, 2016. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. JMLR, 3:993 1022, 2003. Bojanczyk, A. W., Brent, R. P., De Hoog, F. R., and Sweet, D. R. On the stability of the bareiss and related toeplitz factorization algorithms. SIAM Journal on Matrix Analysis and Applications, 16(1):40 57, 1995. Boureau, Y., Ponce, J., and Lecun, Y. A theoretical analysis of feature pooling in visual recognition. In ICML, pp. 111 118, 2010. Carvalho, C. M., Chang, J. T., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. High-dimensional sparse factor modeling: Applications in gene expression genomics. J. Amer. Statist. Assoc., 103(484):1438 1456, 2008. Chen, B., Polatkan, G., Sapiro, G., Blei, D. M., Dunson, D. B., and Carin, L. Deep learning with hierarchical convolutional factor analysis. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1887 1901, 2013. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555, 2014. Cong, Y., Chen, B., Liu, H., and Zhou, M. Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In ICML, pp. 864 873, 2017. Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20(3):273 297, 1995. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci., 1990. Gan, Z., Chen, C., Henao, R., Carlson, D. E., and Carin, L. Scalable deep Poisson factor analysis for topic modeling. In ICML, pp. 1823 1832, 2015. Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, pp. 513 520, 2011. Higgins, I., Matthey, L., Pal, A., Burgess, C. P., Glorot, X., Botvinick, M. M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, volume 3, 2017. Hinton, G. E. and Salakhutdinov, R. Replicated softmax: An undirected topic model. In NIPS, pp. 1607 1614, 2009. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. James, G. M., Sabatti, C., Zhou, N., and Zhu, J. Sparse regulatory networks. AOAS, 4(2):663 686, 2010. Johnson, R. and Zhang, T. Effective use of word order for text categorization with convolutional neural networks. NAACL, pp. 103 112, 2015a. Johnson, R. and Zhang, T. Semi-supervised convolutional neural networks for text categorization via region embedding. In NIPS, pp. 919 927, 2015b. Johnson, R. and Zhang, T. Supervised and semi-supervised text categorization using LSTM for region embeddings. ICML, pp. 526 534, 2016. Johnson, R. and Zhang, T. Deep pyramid convolutional neural networks for text categorization. In ACL, pp. 562 570, 2017. Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. In ACL, pp. 655 665, 2014. Kim, Y. Convolutional neural networks for sentence classification. In EMNLP, pp. 1746 1751, 2014. Klockner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., and Fasih, A. R. Pycuda and pyopencl: A scriptingbased approach to gpu run-time code generation. Parallel computing, 38(3):157 174, 2012. Lauly, S., Zheng, Y., Allauzen, A., and Larochelle, H. Document neural autoregressive distribution estimation. JMLR, 18(113):1 24, 2017. Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. In NIPS, 2001. Convolutional Poisson Gamma Belief Network Lee, H., Pham, P. T., Largman, Y., and Ng, A. Y. Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS, pp. 1096 1104, 2009. Li, C., Zhu, J., Shi, T., and Zhang, B. Max-margin deep generative models. In NIPS, pp. 1837 1845, 2015. Li, X. and Roth, D. Learning question classifiers. In International Conference on Computational Linguistics, pp. 1 7, 2002. Ma, Y. A., Chen, T., and Fox, E. B. A complete recipe for stochastic gradient mcmc. In NIPS, pp. 2917 2925, 2015. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In ACL, pp. 142 150, 2011. Mcauley, J. and Leskovec, J. Hidden factors and hidden topics: understanding rating dimensions with review text. In ACM Rec Sys, pp. 165 172, 2013. Miao, X., Zhen, X., Liu, X., Deng, C., Athitsos, V., and Huang, H. Direct shape regression networks for end-toend face alignment. In CVPR, pp. 5040 5049, 2018. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111 3119, 2013. Min, S., Chen, X., Zha, Z., Wu, F., and Zhang, Y. A two-stream mutual attention network for semi-supervised biomedical segmentation with noisy labels. AAAI, 2019. Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, pp. 271 278, 2004. Pang, B. and Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pp. 115 124, 2005. Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up? sentiment classification using machine learning techniques. In ACL, pp. 79 86, 2002. Papadimitriou, C., Raghavan, P., Tamaki, H., and Vempala, S. Latent semantic indexing: A probabilistic analysis. J. Computer and System Sci., 2000. Patterson, S. and Teh, Y. W. Stochastic gradient riemannian langevin dynamics on the probability simplex. In NIPS, pp. 3102 3110, 2013. Ranganath, R., Tang, L., Charlin, L., and Blei, D. Deep exponential families. In AISTATS, pp. 762 771, 2015. Semeniuta, S., Severyn, A., and Barth, E. A hybrid convolutional variational autoencoder for text generation. EMNLP, pp. 627 637, 2017. Srivastava, A. and Sutton, C. A. Autoencoding variational inference for topic models. ICLR, 2017. Tan, C., Wang, Y., and Lee, C. The use of bigrams to enhance text categorization. Information Processing and Management, 38(4):529 546, 2002. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., and Qin, B. Learning sentiment-specific word embedding for twitter sentiment classification. In ACL, pp. 1555 1565, 2014. Wang, C., Chen, B., and Zhou, M. Multimodal Poisson gamma belief network. In AAAI, 2018. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. S., and Gong, Y. Locality-constrained linear coding for image classification. In CVPR, pp. 3360 3367, 2010. Weston, J., Chopra, S., and Adams, K. Tagspace: Semantic embeddings from hashtags. In EMNLP, pp. 1822 1827, 2014. Xu, P. and Sarikaya, R. Convolutional neural network based triangular crf for joint intent detection and slot filling. In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 78 83, 2013. Zhang, H., Chen, B., Guo, D., and Zhou, M. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. ICLR, 2018. Zhang, Y. and Wallace, B. A sensitivity analysis of (and practitioners guide to) convolutional neural networks for sentence classification. In IJCNLP, pp. 253 263, 2017. Zhang, Y., Gan, Z., and Carin, L. Generating text via adversarial training. In NIPS workshop on Adversarial Training, volume 21, 2016. Zhang, Y., Shen, D., Wang, G., Gan, Z., Henao, R., and Carin, L. Deconvolutional paragraph representation learning. In NIPS, pp. 4169 4179, 2017. Zhou, M. Infinite edge partition models for overlapping community detection and link prediction. In AISTATS, pp. 1135 1143, 2015. Zhou, M. and Carin, L. Negative binomial process count and mixture modeling. ar Xiv preprint ar Xiv:1209.3442v1, 2012. Zhou, M., Hannah, L., Dunson, D., and Carin, L. Betanegative binomial process and Poisson factor analysis. In AISTATS, pp. 1462 1471, 2012. Convolutional Poisson Gamma Belief Network Zhou, M., Cong, Y., and Chen, B. Augmentable gamma belief networks. JMLR, 17(1):5656 5699, 2016. Zhu, J., Chen, N., Perkins, H., and Zhang, B. Gibbs maxmargin topic models with data augmentation. JMLR, 15 (1):1073 1110, 2014.