# deep_poisson_gamma_dynamical_systems__50a1e6fb.pdf

Deep Poisson gamma dynamical systems

Dandan Guo, Bo Chen , Hao Zhang National Laboratory of Radar Signal Processing Collaborative Innovation Center of Information Sensing and Understanding Xidian University, Xi an, China gdd_xidian@126.com, bchen@mail.xidian.edu.cn, zhanghao_xidian@163.com

Mingyuan Zhou Mc Combs School of Business The University of Texas at Austin Austin, TX 78712, USA mingyuan.zhou@mccombs.utexas.edu

We develop deep Poisson-gamma dynamical systems (DPGDS) to model sequentially observed multivariate count data, improving previously proposed models by not only mining deep hierarchical latent structure from the data, but also capturing both ﬁrst-order and long-range temporal dependencies. Using sophisticated but simple-to-implement data augmentation techniques, we derived closed-form Gibbs sampling update equations by ﬁrst backward and upward propagating auxiliary latent counts, and then forward and downward sampling latent variables. Moreover, we develop stochastic gradient MCMC inference that is scalable to very long multivariate count time series. Experiments on both synthetic and a variety of real-world data demonstrate that the proposed model not only has excellent predictive performance, but also provides highly interpretable multilayer latent structure to represent hierarchical and temporal information propagation.

1 Introduction

The need to model time-varying count vectors x1, ..., x T appears in a wide variety of settings, such as text analysis, international relation study, social interaction understanding, and natural language processing [1 9]. To model these count data, it is important to not only consider the sparsity of high-dimensional data and robustness to over-dispersed temporal patterns, but also capture complex dependencies both within and across time steps. In order to move beyond linear dynamical systems (LDS) [10] and its nonlinear generalization [11] that often make the Gaussian assumption [12], the gamma process dynamic Poisson factor analysis (GP-DPFA) [5] factorizes the observed time-varying count vectors under the Poisson likelihood as xt Poisson(Φθt), and transmit temporal information smoothly by evolving the factor scores with a gamma Markov chain as θt Gamma(θt 1, β), which has highly desired strong non-linearity. To further capture crossfactor temporal dependence, a transition matrix Π is further used in Poisson gamma dynamical system (PGDS) [7] as θt Gamma(Πθt 1, β). However, these shallow models may still have shortcomings in capturing long-range temporal dependencies [8]. For example, if given θt, then θt+1 no longer depends on θt k for all k 1.

Deep probabilistic models are widely used to capture the relationships between latent variables across multiple stochastic layers [4,8,13 16]. For example, deep dynamic Poisson factor analysis (DDPFA)

Corresponding author

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

[8] utilizes recurrent neural networks (RNN) [3] to capture long-range temporal dependencies of the factor scores. The latent variables and RNN parameters, however, are separately inferred. Deep temporal sigmoid belief network (DTSBN) [4] is a deep dynamic generative model deﬁned as a sequential stack of sigmoid belief networks (SBNs), whose hidden units are typically restricted to be binary. Although a deep structure is designed to describe complex long-range temporal dependencies, how the layers in DTSBN are related to each other lacks an intuitive interpretation, which is of paramount interest for a multilayer probabilistic model [15].

In this paper, we present deep Poisson gamma dynamical systems (DPGDS), a deep probabilistic dynamical model that takes the advantage of the hierarchical structure to efﬁciently incorporate both between-layer and temporal dependencies, while providing rich interpretation. Moving beyond DTSBN using binary hidden units, we build a deep dynamic directed network with gamma distributed nonnegative real hidden units, inferring a multilayer contextual representation of multivariate timevarying count vectors. Consequently, DPGDS can handle highly overdispersed counts, capturing the correlations between the visible/hidden features across layers and over times using the gamma belief network [15]. Combing the deep and temporal structures shown in Fig. 1(a), DPGDS breaks the assumption that given θt, θt+1 no longer depends on θt k for k 1, suggesting that it may better capture long-range temporal dependencies. As a result, the model can allow more speciﬁc information, which are also more likely to exhibit fast temporal changing, to transmit through lower layers, while allowing more general information, which are also more likely to slowly evolve over time, to transmit through higher layers. For example, as shown in Fig. 1(b) that is learned from GDELT2003 with DPGDS, when analyzing these international events, the factors at lower layers are more speciﬁc to discover the relationships between the different countries, whereas those at higher layers are more general to reﬂect the conﬂicts between the different areas consisting of several related countries, or the ones occurring simultaneously, and the latent representation θt at a lower layer varies more intensely than that at a higher layer.

Distinct from DDPFA [8] that adopts a two-stage inference, the latent variables of DPGDS can be jointly trained with both a Backward-Upward Forward-Downward (BUFD) Gibbs sampler and a sophisticated stochastic gradient MCMC (SGMCMC) algorithm that is scalable to very long multivariate time series [17 21]. Furthermore, the factors learned at each layer can reﬁne the understanding and analysis of sequentially observed multivariate count data, which, to the best of our knowledge, may be very challenging for existing methods. Finally, based on a diverse range of real-world data sets, we show that DPGDS exhibits excellent predictive performance, inferring interpretable latent structure with well captured long-range temporal dependencies.

2 Deep Poisson gamma dynamic systems

Shown in Fig. 1(a) is the graphical representation of a three-hidden-layer DPGDS. Let us denote θ Gam(a, c) as a gamma random variable with mean a/c and variance a/c2. Given a set of V -dimensional sequentially observed multivariate count vectors x1, ..., x T , represented as a V T matrix X, the generative process of a L-hidden-layer DPGDS, from top to bottom, is expressed as

θ(L) t Gam τ0Π(L)θ(L) t 1, τ0 , , θ(l) t Gam τ0(Φ(l+1)θ(l+1) t + Π(l)θ(l) t 1), τ0 , ,

θ(1) t Gam τ0(Φ(2)θ(2) t + Π(1)θ(1) t 1), τ0 , x(1) t Pois δ(1) t Φ(1)θ(1) t , (1)

where Φ(l) RKl 1 Kl + is the factor loading matrix at layer l, θ(l) t RKl + the hidden units of layer l at time t, and Π(l) RKl Kl + a transition matrix of layer l that captures cross-factor temporal dependencies. We denote δ(1) t R+ as a scaling factor, reﬂecting the scale of the counts at time t; one may also set δ(1) t = δ(1) for t = 1, ..., T. We denote τ0 R+ as a scaling hyperparameter that controls the temporal variation of the hidden units. The multilayer time-varying hidden units θ(l) t are well suited for downstream analysis, as will be shown below.

DPGDS factorizes the count observation x(1) t into the product of δ(1) t , Φ(1), and θ(1) t under the Poisson likelihood. It further factorizes the shape parameters of the gamma distributed θ(l) t of layer l at time t into the sum of Φ(l+1)θ(l+1) t , capturing the dependence between different layers, and Π(l)θ(l) t 1, capturing the temporal dependence at the same layer. At the top layer, θ(L) t is only

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

USA->ISR ISR->USA ISR->PSE PSE->ISR USA->AFG AFG->USA

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

1 ISR->PSE PSE->ISR EGY->ISR IND->PAK

Jan/2003 Apr/2003 Jul/2003 Oct/2003 Dec/2003 0

1 USA->AFG AFG->USA

AFG->IRQ IRQ->AFG

ISR->PSE PSE->ISR USA->ISR USA->AFG

USA->PSE AFG->USA

(b) Figure 1: Graphical model and illustration for a three-hidden-layer deep Poisson Gamma Dynamical System (DPGDS). (a) The generative model; (b) Visualization of data and latent factors learned from GDELT2003, with the black, red, blue and green lines denoting the observed data, temporal trajectories of example latent factors at layer 1, 2, 3, respectively.

dependent on Π(L)θ(L) t 1, and at t = 1, θ(l) 1 Gam τ0Φ(l+1)θ(l+1) 1 , τ0 for l = 1, . . . , L 1 and

θ(L) 1 Gam τ0ν(L) k , τ0 . To complete the hierarchical model, we introduce Kl factor weights

ν (l) = (ν(l) 1 , ..., ν(l) Kl) in layer l to model the strength of each factor, and for l = 1, ..., L, we let

π(l) k Dir(ν(l) 1 ν(l) k , ..., ν(l) k 1ν(l) k , ξ(l)ν(l) k , ν(l) k+1ν(l) k ..., ν(l) Klν(l) k ), ν(l) k Gam( γ0

Kl , β(l)). (2)

Note that π(l) k is the kth column of Π(l) and π(l) k1k2 can be interpreted as the probability of transiting from topic k2 of the previous time to topic k1 of the current time at layer l.

Finally, we place Dirichlet priors on the factor loadings and draw other parameters from a noninformative gamma prior: φ(l) k = (φ(l) 1k, ..., φ(l) Kl 1k) Dir(η(l), ..., η(l)), and δ(1) t , ξ(l), β(l) Gam(ϵ0, ϵ0).

Note that imposing Dirichlet distributions on the columns of Π(l) and Φ(l) not only makes the latent representation more identiﬁable and interpretable, but also facilitates inference, as will be shown in the next section. Clearly when L = 1, DPGDS reduces to PGDS [7]. In real-world applications, a binary observation can be linked to a latent count using the Bernoulli-Poisson link as b = 1(n 1), n Pois(λ) [22]. Nonnegative-real-valued matrix can also be linked to a latent count matrix via a Poisson randomized gamma distribution as x Gam(n, c), n Pois(λ) [23].

Hierarchical structure: To interpret the hierarchical structure of (1), we notice that E h x(1) t | θ(l) t , {Φ(p)}l p=1 i = h Ql p=1 Φ(p)i θ(l) t if the temporal structure is ignored. Thus it is

straightforward to interpret φ(l) k by projecting them to the bottom data layer as h Ql 1 t=1 Φ(t)i φ(l) k , which are often quite speciﬁc at the bottom layer and become increasingly more general when moving upwards, as will be shown below in Fig. 5(a).

Long-range temporal dependencies: Using the law of total expectations on (1), for a three-hiddenlayer DPGDS shown in Fig. 1(a), we have

E[x(1) t | θ(1) t 1, θ(2) t 2, θ(3) t 3]/δ(1) t = Φ(1)Π(1)θ(1) t 1 + Φ(1)Φ(2)[Π(2)]2θ(2) t 2

+ Φ(1)Φ(2)(Π(2)Φ(3) + Φ(3)Π(3))[Π(3)]2θ(3) t 3, (3)

which suggests that {Π(l)}L l=1 play the role of transiting the latent representation across time and, different from most existing dynamic models, DPGDS can capture and transmit long-range temporal information (often general and change slowly over time) through its higher hidden layers.

3 Scalable MCMC inference

In this paper, in each iteration, across layers and times, we ﬁrst exploit a variety of data augmentation techniques for count data to backward and upward propagate auxiliary latent counts, with which

T Gamma Gamma

T Gamma Gamma

T Gamma Gamma

T Gamma Gamma

T Gamma Gamma

T Gamma Gamma

Figure 2: Graphical representation of the model and data augmentation and marginalization based inference scheme. (a) An alternative representation of layer l = 1 using the relationships between the Poisson and multinomial distributions; (b) A negative binomial distribution based representation that marginalizes out the gamma from the Poisson distributions, corresponding to (4) for t = T; (c) An equivalent representation that introduces CRT distributed auxiliary variables, corresponding to (5); (d) An equivalent representation using P3, corresponding to (6); (e) An equivalent representation obtained by using P1, corresponding to (7); (f) A representation obtained by repeating the same augmentation-marginalization steps described in (a).

we then downward and forward sample latent variables, leading to a Backward-Upward Forward Downward Gibbs sampling (BUFD) Gibbs sampling algorithm.

3.1 Backward and upward propagation of latent counts

Different from PGDS that has only backward propagation for latent counts, DPGDS have both backward and upward ones due to its deep hierarchical structure. To derive closed-form Gibbs sampling update equations, we exploit three useful properties for count data, denoted as P1, P2, and P3 [7, 24], respectively, as presented in the Appendix. Let us denote x NB(r, p) as the negative binomial distribution with probability mass function P(x = k) = Γ(k+r)

k!Γ(r) pk(1 p)r,

where k {0, 1, . . .}. First, we can augment each count x(1) vt in (1) into the summation of K1 latent counts that are smaller or equal as x(1) vt = PK1 k=1 A(1) vkt, A(1) vkt Pois(δ(1) t φ(1) vk θ(1) kt ), with A(1) kt = PV v=1 A(1) vkt . Since PV v=1 φ(1) vk = 1 by construction, we also have A(1) kt Pois(δ(1) t θ(1) kt ), as shown in Fig. 2(a). We start with θ(1) T at the last time point T, as none of the other time-step factors depend on it in their priors. Via P2, as shown in Fig. 2(b), we can marginalize out θ(1) k T to obtain

A(1) k T NB τ0

k2=1 φ(2) kk2θ(2) k2T + XK1

k1=1 π(1) kk1θ(1) k1,T 1

, g(ζ(1) T ) , (4)

where ζ(1) T = ln(1 + δ(1) T τ0 ) and g (ζ) = 1 exp ( ζ).

In order to marginalize out θ(1) T 1, as shown in Fig. 2(c), we introduce an auxiliary variable following the Chines restaurant table (CRT) distribution [24] as

x(2) k T CRT A(1) k T , τ0

k2=1 φ(2) kk2θ(2) k2T + XK1

k1=1 π(1) kk1θ(1) k1,T 1

As shown in Fig. 2(d), we re-express the joint distribution over A(1) k T and x(2) k T according to P3 as

A(1) k T Sum Log(x(2) k T , g(ζ(1) T )), x(2) k T Pois h ζ(1) T τ0 XK2

k2=1 φ(2) kk2θ(2) k2T + XK1

k1=1 π(1) kk1θ(1) k1,T 1 i , (6)

where the sum-logarithmic (Sum Log) distribution is deﬁned as in Zhou and Carin [24]. Via P1, as in Fig. 2(e), the Poisson random variable x(2) k T in (6) can be augmented as x(2) k T = x(2,1) k T + x(2,2) k T , where

x(2,1) k T Pois(ζ(1) T τ0 XK1

k1=1 π(1) kk1θ(1) k1,T 1), x(2,2) k T Pois(ζ(1) T τ0 XK2

k2=1 φ(2) kk2θ(2) k2T ). (7)

It is obvious that due to the deep dynamic structure, the count at layer two x(2) k T is divided into two parts: one from time T 1 at layer one, while the other from time T at layer two. Furthermore, ζ(1) T is the scaling factor at layer two, which is propagated from the one at layer one δ(1) T . Repeating the process all the way back to t = 1, and from l = 1 up to l = L, we are able to marginalize out all gamma latent variables {Θ}T,L t=1,l=1 and provide closed-form conditional posteriors for all of them.

3.2 Backward-upward forward-downward Gibbs sampling

Sampling auxiliary counts: This step is about the backward and upward pass. Let us denote Z(l) kt = PKl kl=1 Z(l) klkt, Z(l) k,T +1 = 0, and x(1,1) kt = x(1) vt . Working backward for t = T, ..., 2 and upward for l = 1, ..., L, we draw

(A(l) k1t, ..., A(l) k Klt) Multi

x(l,l) kt ; φ(l) k1θ(l) 1t PKl kl=1 φ(l) kklθ(l) klt , ..., φ(l) k Klθ(l) Klt PKl kl=1 φ(l) kklθ(l) klt

x(l+1) kt CRT A(l) kt + Z(l) k,t+1, τ0

kl+1=1 φ(l+1) kkl+1θ(l+1) kl+1t + XKl

kl=1 π(l) kk1θ(l) k1,t 1

Note that via the deep structure, the latent counts x(l+1) kt will be inﬂuenced by the effects from both of time t 1 at layer l and time t at layer l + 1. With p1 := PKl kl=1 π(l) kklθ(l) kl,t 1 and p2 := PKl+1 kl+1=1 φ(l+1) kkl+1θ(l+1) kl+1t, we can sample the latent counts at layer l and l + 1 by

(x(l+1,l) kt , x(l+1,l+1) kt ) Multi x(l+1) kt , p1/(p1 + p2), p2/(p1 + p2) , (10)

and then draw

(Z(l) k1t, ..., Z(l) k Klt) Multi

x(l+1,l) kt ; π(l) k1θ(l) 1,t 1 PKl kl=1 π(l) kklθ(l) kl,t 1 , ..., π(l) k Klθ(l) Kl,t 1 PKl kl=1 π(l) kklθ(l) kl,t 1

Sampling hidden units θ(l) t and calculating ζ(l) t : Given the augmented latent count variables, working forward for t = 1, ..., T and downward for l = L, ..., 1, we can sample

θ(l) kt Gamma h A(l) kt + Z(l) k(t+1) + τ0 XKl+1

kl+1=1 φ(l+1) kkl+1θ(l+1) kl+1t + XKl

kl=1 π(l) kklθ(l) k2,t 1 ,

τ0 1 + ζ(l 1) t + ζ(l) t+1 i , (12)

where ζ(0) t = δ(1) t τ0 and ζ(l) t = ln 1 + ζ(l 1) t + ζ(l) t+1 . Note if δ(1) t = δ(1) for t = 1, ..., T, then we

may let ζ(l) = W 1( exp( 1 ζ(l 1))) 1 ζ(l 1), where the function W 1 is the lower real part of the Lambert W function [7,25]. From (12), we can ﬁnd that the conditional posterior of θ(l) t is parameterized by not only both Φ(l+1)θ(l+1) t and Π(l)θ(l) t 1, which represent the information from

layer l + 1 (downward) and time t 1 (forward), respectively, but also both A(l) ,:,t and Z(l) ,:,t+1, which record the message from layer l 1 (upward) in (8) and time t + 1 (backward) in (11), respectively. We describe the BUFD Gibbs sampling algorithm for DPGDS in Algorithm 1 and provide more details in the Appendix.

3.3 Stochastic gradient MCMC inference

Although the proposed BUFD Gibbs sampling algorithm for DPGDS has closed-form update equations, it requires processing all time-varying vectors at each iteration and hence has limited scalability [26]. To allow for scalable inference, we apply the topic-layer-adaptive stochastic gradient

0 100 200 300 400 500 600 Length of sequence (T)

Averge prediction error (APE)

TSBN-1 PGDS TSBN-4 DTSBN DPGDS

Figure 3: Results on the bouncing ball data set. (a) Shown in the ﬁrst to third columns are the top ﬁfteen latent factors learned by a three-hidden-layer DPGDS at layers 1, 2, and 3, respectively; (b) The average prediction errors as a function of the sequence length for various algorithms.

Riemannian (TLASGR) MCMC algorithm described in Cong et al. [27] and Zhang et al. [26], which can be used to sample simplex-constrained global parameters [28] in a mini-batch based manner. It improves its sampling efﬁciency via the use of the Fisher information matrix (FIM) [29], with adaptive step-sizes for the latent factors and transition matrices of different layers. More speciﬁcally, for π(l) k , column k of the transition matrix Π(l) of layer l, its sampling can be efﬁciently realized as π(l) k

n+1 = π(l) k

h ρ z(l) :k +η(l) :k ρ z(l) k +η(l) k π(l) k

h diag(π(l) k )n (π(l) k )n(π(l) k )T n i!

where M (l) k is calculated using the estimated FIM, both z(l) :k and z(l) k come from the augmented latent counts Z(l), [.] denotes a simplex constraint, and η(l) :k denotes the prior of π(l) k . The update of Φ(l)

is the same with Cong et al. [27], and all the other global parameters are sampled using SGNHT [20]. We provide the details of the SGMCMC for DPGDS in Algorithm 2 in the Appendix.

4 Experiments

In this section, we present experimental results on a synthetic dataset and ﬁve real-world datasets. For a fair comparison, we consider PGDS [7], GP-DPFA [5], DTSBN [4], and GPDM [11] that can be considered as a dynamic generalization of the Gaussian process latent variable model of Lawrence [30], using the code provided by the authors. Note that as shown Schein et al. [7] and Gan et al. [4], PGDS and DTSBN are state-of-the-art count time series modeling algorithms that outperform a wide variety of previously proposed ones, such as LDS [12] and DRFM [31]. The hyperparameter settings of PGDS, GP-DPFA, GPDM, TSBN, and DTSBN are the same as their original settings [4, 5, 7, 11]. For DPGDS, we set τ0 = 1, γ0 = 100, η0 = 0.1 and ϵ0 = 0.1. We use [K(1), K(2), K(3)] = [200, 100, 50] for both DPGDS and DTSBN and K = 200 for PGDS, GP-DPFA, GPDM, and TSBN. For PGDS, GP-DPFA, GPDM, and DPGDS, we run 2000 Gibbs sampling as burn-in and collect 3000 samples for evaluation. We also use SGMCMC to infer DPGDS, with 5000 collection samples after 5000 burn-in steps, and use 10000 SGMCMC iterations for both TSBN and DTSBN to evaluate their performance.

4.1 Synthetic dataset

Following the literature [1, 4], we consider sequences of different lengths, including T = 10, 50, 100, 200, 300, 400, 500 and 600, and generate 50 synthetic bouncing ball videos for training, and 30 ones for testing. Each video frame is a binary-valued image with size 30 30, describing the location of three balls within the image. Both TSBN and DTSBN model it with the Bernoulli likelihood, while both PGDS and DPGDS use the Bernoulli-Poisson link [22].

As shown in Fig. 3(b), the average prediction errors of all algorithms decrease as the training sequence length increases. A higher-order TSBN, TSBN-4, performs much better than the ﬁrst-order TSBN

Table 1: Top-M results on real-world text data

Model Top-M GDELT (T = 365) ICEWS (T = 365) SOTU (T = 225) DBLP (T = 14) NIPS (T = 17)

GPDPFA MP 0.611 0.001 0.607 0.002 0.379 0.002 0.435 0.009 0.843 0.005 MR 0.145 0.002 0.235 0.005 0.369 0.002 0.254 0.005 0.050 0.001 PP 0.447 0.014 0.465 0.008 0.617 0.013 0.581 0.011 0.807 0.006

PGDS MP 0.679 0.001 0.658 0.001 0.375 0.002 0.419 0.004 0.864 0.004 MR 0.150 0.001 0.245 0.005 0.373 0.002 0.252 0.004 0.050 0.001 PP 0.420 0.017 0.455 0.008 0.612 0.018 0.566 0.008 0.802 0.020

GPDM MP 0.520 0.001 0.530 0.002 0.274 0.001 0.388 0.004 0.355 0.008 MR 0.141 0.001 0.234 0.001 0.261 0.002 0.146 0.005 0.050 0.001 PP 0.362 0.021 0.185 0.017 0.587 0.016 0.509 0.008 0.384 0.028

TSBN MP 0.594 0.007 0.471 0.001 0.360 0.001 0.403 0.012 0.788 0.005 MR 0.124 0.001 0.158 0.001 0.275 0.001 0.194 0.001 0.050 0.001 PP 0.418 0.019 0.445 0.031 0.611 0.001 0.527 0.003 0.692 0.017

DTSBN-2 MP 0.439 0.001 0.475 0.002 0.370 0.004 0.407 0.003 0.756 0.001 MR 0.134 0.001 0.208 0.001 0.361 0.001 0.248 0.007 0.050 0.001 PP 0.391 0.001 0.446 0.001 0.587 0.027 0.522 0.005 0.737 0.004

DTSBN-3 MP 0.411 0.001 0.431 0.001 0.450 0.008 0.390 0.002 0.774 0.002 MR 0.141 0.001 0.189 0.001 0.274 0.001 0.252 0.004 0.050 0.001 PP 0.367 0.011 0.451 0.026 0.548 0.013 0.510 0.006 0.715 0.009

DPGDS-2 MP 0.688 0.002 0.659 0.001 0.379 0.002 0.430 0.009 0.867 0.008 MR 0.149 0.001 0.242 0.007 0.373 0.001 0.254 0.005 0.050 0.001 PP 0.443 0.025 0.473 0.012 0.622 0.014 0.582 0.007 0.814 0.035

DPGDS-3 MP 0.689 0.002 0.660 0.001 0.380 0.001 0.431 0.012 0.887 0.002 MR 0.150 0.001 0.244 0.003 0.374 0.002 0.255 0.004 0.050 0.001 PP 0.456 0.015 0.478 0.024 0.628 0.021 0.600 0.001 0.839 0.007

does, suggesting that using high-order messages can help TSBN better pass useful information. As discussed above, since a deep structure provides a natural way to propagate high-order information for prediction, it is not surprising to ﬁnd that both DTSBN and DPGDS, which are both multilayer models, have exhibited superior performance. Moreover, it is clear that the proposed DPGDS consistently outperforms DTSBN under all settings.

Another advantage of DPGDS is that its inferred deep latent structure often has meaningful interpretation. As shown in Fig. 3(a), for the bouncing ball data, the inferred factors at layer one represent points or pixels, those at layer two cover larger spatial contiguous regions, some of which exhibit the shape of a single bouncing ball, and those at layer three are able to capture multiple bouncing balls. In addition, we show in Appendix B the one-step prediction frames of different models.

4.2 Real-world datasets

Besides the binary-valued synthetic bouncing ball dataset, we quantitatively and qualitatively evaluate all algorithms on the following real-world datasets used in Schein et al. [7]. The State-of-the-Union (SOTU) dataset consists of the text of the annual SOTU speech transcripts from 1790 to 2014. The Global Database of Events, Language, and Tone (GDELT) and Integrated Crisis Early Warning System (ICEWS) are both datasets for international relations extracted from news corpora. Note that ICEWS consists of undirected pairs, while GDELT consists of directed pairs of countries. The NIPS corpus contains the text of every NIPS conference paper from 1987 to 2003. The DBLP corpus is a database of computer science research papers. Each of these datasets is summarized as a V T count matrix, as shown in Tab. 1. Unless speciﬁed otherwise, we choose the top 1000 most frequently used terms to form the vocabulary, which means we set V = 1000 for all real-data experiments.

4.2.1 Quantitative comparison

For a fair and comprehensive comparison, we calculate the precision and recall at top-M [4,5,31,32], which are calculated by the fraction of the top-M words that match the true ranking of the words and appear in the top-M ranking, respectively, with M = 50. We also use the Mean Precision (MP) and Mean Recall (MR) over all the years appearing in the training set to evaluate different models. As another criterion, the Predictive Precision (PP) shows the predictive precision for the ﬁnal year, for which all the observations are held out. Similar as previous methods [4,5], for each corpus, the entire data of the last year is held out, and for the documents in the previous years we randomly partition the words of each document into 80% / 20% in each trial, and we conduct ﬁve random trials to report the sample mean and standard deviation. Note that to apply GPDM, we have used Anscombe transform

[33] to preprocess the count data to mitigate the mismatch between the data and model assumption. The results on all ﬁve datasets are summarized in Tab. 1, which clearly show that the proposed DPGDS has achieved the best performance on most of the evaluation criteria, and again a deep model often improves its performance by increasing its number of layers. To add more empirical study

250 500 750 1000 0.4

Time (Seconds)

DPGDS-Gibbs DPGDS-SGMCMC DTSBN GPDM

Figure 4: MP as a function of time for GDELT.

on scalability, we have also tested the efﬁciency of our model on a GDELT data (from 2001 to 2005, temporal granularity of 24 hrs, with a total of 1825 time points), which is not too large so that we can still run DPGDS-Gibbs and GPDM. As shown in Fig. 4, we present how various algorithms progress over time, evaluated with MP. It takes about 1000s for DTSBN and DPGDS-SGMCMC to converge, 3.5 hrs for DPGDS-Gibbs, 5 hrs for GPDM. Clearly, our DPGDS-SGMCMC is scalable and clearly outperforms both DTSBN and GPDM. We also present in Appendix C the results of DPGDS-SGMCMC on a very long time series, on which it becomes too expensive to run a batch learning algorithm.

4.2.2 Exploratory data analysis

Compared to previously proposed dynamic systems, the proposed DPGDS, whose inferred latent structure is simple to visualize, provides much richer interpretation. More speciﬁcally, we may not only exhibit the content of each factor (topic), but also explore both the hierarchical relationships between them at different layers, and the temporal relationships between them at the same layer. Based on the results inferred on ICEWS 2001-2003 via a three hidden layer DPGDS, with the size of 200-100-50, we show in Fig. 5 how some example topics are hierarchically and temporally related to each other, and how their corresponding latent representations evolve over time.

In Fig. 5(a), we select two large-weighted topics at the top hidden layer and move down the network to include any lower-layer topics that are connected to them with sufﬁciently large weights. For each topic, we list all its terms whose values are larger than 1% of the largest element of the topic. It is interesting to note that topic 2 at layer three is connected to three topics at layer two, which are characterized mainly by the interactions of Israel (ISR)-Palestinian Territory (PSE), Iraq (IRQ)- USA-Iran (IRN), and North Korea (PRK)-South Korea (KOR)-USA-China (CHN)-Japan (JPN), respectively. The activation strength of one of these three interactions, known to be dominant in general during 2001-2003, can be contributed not only by a large activation of topic 2 at layer three, but also by a large activation of some other topic of the same layer (layer two) at the previous time. For example, topic 41 of layer two on ISR-PSE, IND-PAK, RUS-UKR, GEO-RUS, AFG-PAK, SYR-USA, MNE-SRB could be associated with the activation of topic 46 of layer two on IND-PAK, RUS-TUR, ISR-PSE, BLR-RUS at the previous time; and topic 99 of layer two on PRK-KOR, JPN-USA, CHN-USA, CHN-KOR, CHN-JPN, USA-RUS could be associated with the activation of topic 63 of layer two on IRN-USA, CHN-USA, AUS-CHN, CHN-KOR at the previous time.

Another instructive observation is that topic 140 of layer one on IRQ-USA, IRQ-GBR, IRN-IRQ, IRQ-KWT, AUS-IRQ is related not only in hierarchy to topic 34 of the higher layer on IRQ-USA, IRQ-GBR, GBR-USA, IRQ-KWT, IRN-IRQ, SYR-USA, but also in time to topic 166 of the same layer on ESP-USA, ESP-GBR, FRA-GBR, POR-USA, which are interactions between the member states of the North Atlantic Treaty Organization (NATO). Based on the transitions from topic 13 on PRK-KOR to both topic 140 on IRQ-USA and 77 on ISR-PSE, we can ﬁnd that the ongoing Iraq war and Israeli Palestinian relations regain attention after the six-party talks [7].

To get an insight of the beneﬁts attributed to the deep structure, how the latent representations of several representative topics evolve over days are shown in Fig. 5(b). It is clear that relative to these temporal factor trajectories at the bottom layer, which are speciﬁc for the bilateral interactions between two countries, these from higher layers vary more smoothly, whose corresponding highlayer topics capture the multilateral interactions between multiple closely related countries. Similar phenomena have also been demonstrated in Fig. 1(b) on GDELT2003. Moreover, we ﬁnd that a spike of the temporal trajectory of topic 166 (NATO) appears right before a one of topic 140 (Iraq war), matching the above description in Fig. 5(a). Also, topic 14 of layer three and its descendants, including topic 23 of layer two and topic 48 at layer one are mainly about a breakthrough between RUS and Azerbaijan (AZE), coinciding with Putin s visit in January 2001. Additional example results for the topics and their hierarchical and temporal relationships, inferred by DPGDS on different datasets, are provided in the Appendix.

IND-PAK RUS-UKR

23 AZE-RUS IND-VNM CHN-IRN

IRQ-USA ISR-PSE PRK-KOR CHN-KOR

46 IND-PAK RUS-TUR ISR-PSE BLR-RUS

IRQ-USA IRQ-GBR GBR-USA IRQ-KWT IRN-IRQ SYR-USA

IND-PAK BIH-SRB BAN-IND AFG-RUS MNE-SRB

SYR-USA LBN-USA CUB-USA ISR-SYR IRN-IRQ

GBR-USA BRA-USA ZAF-USA CZE-USA PSE-USA

PRK-KOR JPN-USA CHN-USA CHN-KOR CHN-JPN USA-RUS

IRQ-USA IRQ-GBR IRN-IRQ AUS-IRQ IRQ-KWT

PRK-KOR CHN-KOR PHL-USA KOR-VNM CHN-JPN

41 ISR-PSE IND-PAK RUS-UKR GEO-RUS AFG-PAK SYR-USA MNE-SRB

90 77 138 140 13

ISR-PSE MNE-SRB IRN-ISR BRA-USA ISR-JOR

CHN-USA JPN-USA CHN-JPN RUS-USA

IRN-USA CHN-USA AUS-CHN CHN-KOR

AZE-RUS IND-VNM IND-IDN

48 CHN-IRN MYS-MMR BAN-IDN JPN-ZAF CHN-THA

AFG-TUK POL-RUS ARM-RUS AFG-RUS

146 ESP-USA ESP-GBR FRA-GBR POR-USA

CHN-USA AUS-USA INA-USA

IND-PAK RUS-TRK RUS-USA ISR-PSE JPN-FRA

Jan/2001 Jul/2001 Jan/2002 Jul/2002 Jan/2003 Jul/2003 0

Factor 2 Factor 14

Jan/2001 Jul/2001 Jan/2002 Jul/2002 Jan/2003 Jul/2003 0

Factor 77 Factor 140 Factor 13 Factor 48 Factor 166

Jan/2001 Jul/2001 Jan/2002 Jul/2002 Jan/2003 Jul/2003 0

Factor 41 Factor 34 Factor 99 Factor 23

Jan/2001 Jul/2001 Jan/2002 Jul/2002 Jan/2003 Jul/2003 0

ISR-PSE IRQ-USA PRK-KOR AZE-RUS

Temporal factor trajectories at layer 1

Temporal factor trajectories at layer 3 Temporal factor trajectories at layer 2

Figure 5: Topics and their temporal trajectories inferred by a three-hidden-layer DPGDS from the ICEWS 2001-2003 dataset (best viewed in color). (a) Some example topics that are hierarchically or temporally related; (b) The temporal trajectories of some inferred latent topics.

Factor Index

Factor Index

Factor Index

Factor Index

Factor Index

Factor Index

(c) Figure 6: Learned transition structure on ICEWS 2001-2003 from the same DPGDS depicted in Fig. 5. Shown in (a)-(c) are transition matrices for layers 1, 2 and 3, respectively, with a darker color indicating a larger transition weight (between 0 and 1).

In Fig. 6, we also present a subset of the transition matrix Π(l) in each layer, corresponding to the top ten topics, some of which have been displayed in Fig. 5(b). The transition matrix Π(l) captures the cross-topic temporal dependence at layer l. From Fig. 6, besides the temporal transitions between the topics at the same layer, we can also see that with the increase of the layer index l, the transition matrix Π(l) more closely approaches a diagonal matrix, meaning that the feature factors become more likely to transit to themselves, which matches the characteristic of DPGDS that the topics in higher layers have the ability to cover longer-range temporal dependencies and contain more general information, as shown in Fig. 5(a). With both the hierarchical connections between layers and dynamic transitions at the same layer, distinct from the shallow PGDS, DPGDS is equipped with a larger capacity to model diverse temporal patterns with the help of its deep structure.

5 Conclusions

We propose deep Poisson gamma dynamical systems (DPGDS) that take the advantage of a probabilistic deep hierarchical structure to efﬁciently capture both across-layer and temporal dependencies. The inferred latent structure provides rich interpretation for both hierarchical and temporal information propagation. For Bayesian inference, we develop both a Backward-Upward Forward-Downward Gibbs sampler and a stochastic gradient MCMC (SGMCMC) that is scalable to long multivariate count/binary time series. Experimental results on a variety of datasets show that DPGDS not only exhibits excellent predictive performance, but also provides highly interpretable latent structure.

Acknowledgements

D. Guo, B. Chen, and H. Zhang acknowledge the support of the Program for Young Thousand Talent by Chinese Central Government, the 111 Project (No. B18039), NSFC (61771361), NSFC for Distinguished Young Scholars (61525105) and the Innovation Fund of Xidian University. M. Zhou acknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation.

[1] I. Sutskever and G. E. Hinton, Learning multilevel distributed representations for highdimensional sequences, in AISTATS, 2007. [2] C. Wang, D. Blei, and D. Heckerman, Continuous time dynamic topic models, in UAI, 2008, pp. 579 586. [3] M. Hermans and B. Schrauwen, Training and analysing deep recurrent neural networks, in NIPS, 2013, pp. 190 198. [4] Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin, Deep temporal sigmoid belief networks for sequence modeling, in NIPS, 2015, pp. 2467 2475. [5] A. Acharya, J. Ghosh, and M. Zhou, Nonparametric Bayesian factor analysis for dynamic count matrices, in AISTATS, 2015. [6] L. Charlin, R. Ranganath, J. Mcinerney, and D. M. Blei, Dynamic Poisson factorization, in ACM, 2015, pp. 155 162. [7] A. Schein, M. Zhou, and H. Wallach, Poisson gamma dynamical systems, in NIPS, 2016. [8] C. Y. Gong and W. Huang, Deep dynamic Poisson factorization model, in NIPS, 2017. [9] H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, H. R. Rabiee, and H. R. Rabiee, Recurrent Poisson factorization for temporal recommendation, in KDD, 2017, pp. 847 855. [10] Z. Ghahramani and S. T. Roweis, Learning nonlinear dynamical systems using an EM algorithm, in NIPS, 1999, pp. 431 437. [11] J. M. Wang, A. Hertzmann, and D. M. Blei, Gaussian process dynamical models, in NIPS, 2006. [12] R. E. Kalman, Mathematical description of linear dynamical systems, Journal of The Society for Industrial and Applied Mathematics, Series A: Control, vol. 1, no. 2, pp. 152 192, 1963. [13] R. M. Neal, Connectionist learning of belief networks, Artiﬁcial Intelligence, vol. 56, no. 1, pp. 71 113, 1992. [14] R. Ranganath, L. Tang, L. Charlin, and D. M. Blei, Deep exponential families, in AISTATS, 2014, pp. 762 771. [15] M. Zhou, Y. Cong, and B. Chen, The Poisson gamma belief network, in NIPS, 2015, pp. 3043 3051. [16] R. Henao, Z. Gan, J. T. Lu, and L. Carin, Deep Poisson factor modeling, in NIPS, 2015, pp. 2800 2808. [17] Y. A. Ma, T. Chen, and E. B. Fox, A complete recipe for stochastic gradient MCMC, in NIPS, 2015, pp. 2917 2925. [18] M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, in ICML, 2011, pp. 681 688. [19] S. Patterson and Y. W. Teh, Stochastic gradient Riemannian Langevin dynamics on the probability simplex, in NIPS, 2013, pp. 3102 3110. [20] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven, Bayesian sampling using stochastic gradient thermostats, in NIPS, 2014, pp. 3203 3211. [21] C. Li, C. Chen, D. Carlson, and L. Carin, Preconditioned stochastic gradient Langevin dynamics for deep neural networks, in AAAI, 2016, pp. 1788 1794. [22] M. Zhou, Inﬁnite edge partition models for overlapping community detection and link prediction, in AISTATS, 2015, pp. 1135 1143.

[23] M. Zhou, Y. Cong, and B. Chen, Augmentable gamma belief networks, Journal of Machine Learning Research, vol. 17, no. 163, pp. 1 44, 2016. [24] M. Zhou and L. Carin, Negative binomial process count and mixture modeling, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 2, pp. 307 320, 2015. [25] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth, On the Lambert W function, Advances in Computational Mathematics, vol. 5, no. 1, pp. 329 359, 1996. [26] H. Zhang, B. Chen, D. Guo, and M. Zhou, WHAI: Weibull hybrid autoencoding inference for deep topic modeling, in ICLR, 2018. [27] Y. Cong, B. Chen, H. Liu, and M. Zhou, Deep latent Dirichlet allocation with topic-layeradaptive stochastic gradient Riemannian MCMC, in ICML, 2017. [28] Y. Cong, B. Chen, and M. Zhou, Fast simulation of hyperplane-truncated multivariate normal distributions, Bayesian Anal., vol. 12, no. 4, pp. 1017 1037, 2017. [29] M. A. Girolami and B. Calderhead, Riemann manifold Langevin and Hamiltonian Monte Carlo methods, Journal of The Royal Statistical Society Series B-statistical Methodology, vol. 73, no. 2, pp. 123 214, 2011. [30] N. D. Lawrence, Probabilistic non-linear principal component analysis with gaussian process latent variable models, Journal of Machine Learning Research, vol. 6, pp. 1783 1816, 2005. [31] S. Han, L. Du, E. Salazar, and L. Carin, Dynamic rank factor model for text streams, in NIPS, 2014, pp. 2663 2671. [32] P. Gopalan, F. J. R. Ruiz, R. Ranganath, and D. M. Blei, Bayesian nonparametric Poisson factorization for recommendation systems, in AISTATS, 2014. [33] F. J. Anscombe, The transformation of Poisson, binomial and negative-binomial data, Biometrika, vol. 35, no. 3/4, pp. 246 254, 1948. [34] D. B. Dunson and A. H. Herring, Bayesian latent variable models for mixed discrete outcomes, Biostatistics, vol. 6, no. 1, pp. 11 25, 2005. [35] M. Zhou, L. Hannah, D. B. Dunson, and L. Carin, Beta-negative binomial process and Poisson factor analysis, in AISTATS, 2012, pp. 1462 1471. [36] M. Zhou, Nonparametric Bayesian negative binomial factor analysis, Bayesian Anal., vol. 13, no. 4, pp. 1061 1089, 2018.