# stochastic_nonparametric_eventtensor_decomposition__7cfc611c.pdf

Stochastic Nonparametric Event-Tensor Decomposition

Shandian Zhe, Yishuai Du School of Computing, University of Utah zhe@cs.utah.edu, yishuai.du@utah.edu

Tensor decompositions are fundamental tools for multiway data analysis. Existing approaches, however, ignore the valuable temporal information along with data, or simply discretize them into time steps so that important temporal patterns are easily missed. Moreover, most methods are limited to multilinear decomposition forms, and hence are unable to capture intricate, nonlinear relationships in data. To address these issues, we formulate event-tensors, to preserve the complete temporal information for multiway data, and propose a novel Bayesian nonparametric decomposition model. Our model can (1) fully exploit the time stamps to capture the critical, causal/triggering effects between the interaction events, (2) ﬂexibly estimate the complex relationships between the entities in tensor modes, and (3) uncover hidden structures from their temporal interactions. For scalable inference, we develop a doubly stochastic variational Expectation-Maximization algorithm to conduct an online decomposition. Evaluations on both synthetic and real-world datasets show that our model not only improves upon the predictive performance of existing methods, but also discovers interesting clusters underlying the data.

1 Introduction

Tensors represent the high-order interactions between entities in multiway data. Such interactions are ubiquitous in real-world applications. For instance, in online shopping, users purchase commodities under different web contexts these interactions can be represented by a three mode tensor (user, commodity, web context). To analyze tensor data, we use decomposition approaches where we jointly estimate a set of latent factors for each entity, and the mapping between the latent factors and tensor entry values. The latent factors can reveal hidden structures of the entities, such as clusters/communities; the mapping characterizes the entities relationships (in terms of their factor representations), and can be used to predict missing entry values.

Despite the wide success of existing tensor decomposition algorithms (Tucker, 1966; Harshman, 1970; Kang et al., 2012; Choi and Vishwanathan, 2014), most methods assume a simple multilinear decomposition form, which might be insufﬁcient to estimate intricate, nonlinear relationships in data. More important, most methods ignore the valuable temporal information along with data or exploit them in a relatively coarse way. For instance, the time stamp of each interaction is usually abandoned and only their counts are used for count tensor decomposition (Chi and Kolda, 2012; Hansen et al., 2015; Hu et al., 2015b). More elegant approaches (Xiong et al., 2010; Schein et al., 2015, 2016) discretize the time stamps into steps, e.g., weeks/months, and use a set of time factors to represent each step. The tensor is hence augmented with a time mode. The decomposition may further use Markov assumptions to encourage smooth transitions between the time factors (Xiong et al., 2010). However, in each time step, the occurrences of the interactions are treated independently. Hence, important temporal patterns, such as causal/triggering effects in adjacent interactions, cannot be well modeled or captured.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

To address these issues, we ﬁrst formulate a new data abstraction, event-tensor, to preserve all the time stamps for multiway data. In an event-tensor, each entry comprises a sequence of interaction events rather than a numerical value. Second, we propose a powerful Bayesian nonparametric model to decompose event-tensors (Section 3). We hybridize latent Gaussian processes and Hawkes processes to capture various excitation effects among the observed interaction events, and the underlying complex relationships between the entities that participated in the events. Furthermore, we design a novel triggering function that enables discovering clusters of entities (or latent factors) in terms of excitation strengths. Besides, the triggering function allows us to ﬂexibly specify the triggering range (say, via domain knowledge) to better capture local excitations and to control the trade-off to the computational cost. Finally, to handle data where both the tensor entries and interaction events are many, we derive a fully decomposed variational model evidence lower bound by using Poisson process superposition theorem and the variational sparse Gaussian process framework (Titsias, 2009). Based on the bound, we develop a doubly stochastic variational Expectation-Maximization algorithm to fulﬁll a scalable, online decomposition (Section 4).

For evaluation, we examined our model in both predictive performance and structure discovery. On three real-world datasets, our model often largely improves upon the prediction accuracy of the existing methods that use Poisson processes and/or time factors to incorporate temporal information. Simulation shows the latent factors estimated by our model clearly reﬂect the ground-truth clusters while by the competing methods do not. We further examined the structures discovered by our model on the real-world datasets and found many interesting patterns, such as groups of 911 accidents with strong associations, locations of townships that are apt to have consecutive accidents, and UFO shapes that are more likely to be sighted together (Section 6).

2 Background

Tensor Decomposition. We denote a K-mode tensor by M Rd1 ... d K, where dk is the dimension of k-th mode, corresponding to dk entities (e.g., users or items). The entry value at location i = (i1, . . . , i K) is denoted by mi. Given a tensor W Rr1 ... r K, and a matrix U Rs t, we can multiply W by U at mode k when rk = t. The result is a new tensor of size r1 . . . rk 1 s rk+1 . . . r K. Each entry is computed by (W k U)i1...ik 1jik+1...i K = Prk ik=1 wi1...i Kujik. For decomposition, we introduce K latent factor matrices, U = {U(1), . . . , U(K)}, to represent the entities in each tensor mode each U(k)(j, :) are the latent factors of the j-th entity in mode k.

The classical Tucker decomposition (Tucker, 1966) incorporates a small core tensor W Rr1 ... r K, and assumes M = W 1 U(1) 2 . . . K U(K). We can simplify Tucker decomposition, by restricting r1 = . . . = r K and W to be diagonal. Then we reduce to CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970). While many other decomposition methods have been proposed e.g., (Chu and Ghahramani, 2009; Kang et al., 2012; Choi and Vishwanathan, 2014), most of them are still based on the Tucker/CP forms. However, the multilinear assumptions might be insufﬁcient to capture intricate, highly nonlinear relationships in data.

Recently, several Bayesian nonparametric tensor decomposition models (Xu et al., 2012; Zhe et al., 2016b) are proposed, which are ﬂexible to capture various nonlinear relationships in data. For example, Zhe et al. (2016b) considered each entry value mi as a function of the corresponding latent factors, i.e., mi = f([U(1)(i1, :), . . . , U(K)(i K, :)]), and placed a Gaussian process (GP) (Rasmussen and Williams, 2006) prior over f( ), to automatically infer the (possible) nonlinearity of f( ). These methods often improve the CP/Tucker decompositions by a large margin in missing value prediction.

Decomposition with Temporal Information. Practical tensors often come with temporal information, namely the time stamps of those interactions. For example, from a ﬁle access log, we can extract not only a three-mode (user, action, ﬁle) tensor, but also the time stamps for each user taking the action to access a ﬁle. To use the temporal information in the decomposition, many methods discard the time stamps, use a Poisson (process) likelihood to model the interaction frequency mi in each entry i, p(mi) e λi T λmi i (Chi and Kolda, 2012; Hu et al., 2015b), and perform the Tucker/CP decomposition over {λi} or {log(λi)}. More reﬁned approaches (Xiong et al., 2010; Schein et al., 2015, 2016) ﬁrst discretize the time stamps into several steps, such as months/weeks, and augment the original tensor with a time mode. Then a time factor matrix T are estimated in the decomposition. While the interactions from different time steps are modeled with distinct time factors, the ones in the same interval are considered independently (given the latent factors), say, being modeled by Poisson likelihoods (Schein et al., 2015, 2016). A Markov assumption might be used to encourage

the smoothness between the time factors. For example, Xiong et al. (2010) assigned a conditional Gaussian prior over each T(k, :), p T(k, :)|T(k 1, :) = N T(k, :)|T(k 1, :), σ2I .

3 Model Despite the success of existing approaches in exploiting temporal information, they entirely drop the time stamps and hence are unable to capture the important, triggering or causal effects between the interactions. The triggering effects are common in real-world applications. For example, the event that user A purchased commodity B may excite A s friend C to purchase B as well. The triggering effects are usually local and decay fast with time; dropping the time stamps and considering the event occurrences independently make us unable to model/capture these effects.

To address these issues, and hence to further capture the complex relationships and important structures underlying the interaction events, we formulate a new data abstraction, event-tensor, to preserve all the time stamps. We then propose a powerful Bayesian nonparametric model to decompose the event-tensors, discussed as follows.

3.1 Event-Tensor Formulation

First, let us look at the deﬁnition of event-tensors. To preserve the complete temporal information in decomposition, we relax the deﬁnition that tensors must be multidimensional arrays of numerical values. Instead, we deﬁne that each entry is a sequence of events, i.e., mi = {s1 i , . . . , sni i } where each sk i (1 k ni) is a time stamp when the interaction happened, and ni the count of the events. Note that, different entries correspond to distinct types of interaction events, since the involved entities (or latent factors) are different. We name this tensor as an event-tensor. Given the observed entries {mi}, we can ﬂatten their event sequences to obtain a single sequence S = [(s1, i1), . . . (s N, i N)] where s1 . . . s N are all the time stamps, and each ik is the entry index for the event sk(1 k N) .

3.2 Nonparametric Event-Tensor Decomposition

Now, we consider a probabilistic model for event-tensor decomposition. While Poisson processes (PPs) have many nice properties and are often good choices of modeling events (Schein et al., 2015), they assume event occurences are independent (i.e., independent increments), and hence are unable to capture the inﬂuences of the events on each other. To overcome this limit, we use a much more expressive point process, Hawkes process (Hawkes, 1971), for events modeling in tensor entries. Given an event sequence {t1, . . . tn}, the Hawkes process deﬁnes the event rate λ as a function of time t, λ(t) = λ0 + P

ti<t h(t ti), where λ0 is the base rate (or background rate), and h( t) is the triggering function, which describes the strength of a preceding event triggering a new event at time t. Note that the strength usually decays with time. For example, a commonly used triggering function is h( t) = β exp( t

τ ), which expresses an exponential decay over time. The joint probability of the sequence {t1, . . . tn} is p({t1, . . . tn}) = e R T 0 λ(t) Qn j=1 λ(tj), where T is the total time span.

In our model, for each observed entry i we use a Hawkes process to sample the interaction sequence mi. As in section 3.1, we denote the ﬂattened single event sequence over all the observed entries by S = [(s1, i1), . . . , (s N, i N)]. For the process in entry i, we deﬁne the rate function as

λi(t) = λ0 i + X

sn<t hin i(t sn) (1)

where λ0 i is the base rate and hin i( t) is the triggering function.

Now, let us present the detailed design for the base rate and triggering function. First, to capture the (complex) relationships between the entities underlying the events in entry i, we assume the background rate λ0 i , is a (possible) nonlinear function of the corresponding latent factors, xi = [U(1)(i1, :), . . . , U(K)(i K, :)]. To ensure the positiveness of λ0 i , we sample a latent function f(xi) and take λ0 i = ef(xi). We place a Gaussian process (GP) prior over f( ). Hence, the latent function values f for all the observed entries follow a multivariate Gaussian distribution,

p(f|U) = N f|0, c(X, X) (2)

where each row of the input matrix X corresponds to one entry, and are the concatenation of the corresponding latent factors; c( , ) is the covariance function, and can be some nonlinear or/and periodical kernels.

Second, to capture various excitation effects and the underlying structures of the entities, we design the triggering function as the following form:

hin i(t sn) = k(xin, xi)h0(t sn) (3)

where k( , ) is a kernel function, xin and xi are the concatenated latent factors for entries in and i, respectively; h0( ) is the base triggering function which we will explain later. In our design, the excitation strength between the two types of interactions, in and i, is determined by the closeness/similarity between the associated entities. The closeness is measured by the kernel function of their latent factors. Such design enables our model to discover the grouping structures hidden in the triggering effects entities in the same group/community more strongly excite each other to interact with other modes entities from the same group, e.g., purchasing the same brand of products" and watching the same types of movies".

Next, we design a local base triggering function, to better capture the locality of the triggering effects,

h0(t sn) = 1(sn At)βe 1

τ (t sn) (4)

where At is the set of possible triggering events to time t. By setting At, we can specify the appropriate range of triggering effects through domain knowledge, or the best trade-off to the computational efﬁciency. In our model, we deﬁne At as the collection of preceding events nearest to time t in the time window max, At = {sj|sj Pt(Cmax), t max sj t} where Pt(Cmax) are Cmax preceding events nearest to t.

Finally, given the observed entries {mi}, based on (1) and (2), the joint probability of our model is

p({mi, fi}|U) = N f|0, c(X, X) Y

i e R T 0 λi(t)dt Yni

j=1 λi(sn i ) (5)

where mi = {s1 i , . . . , sni i } and fi is the latent function value for entry i, used in our deﬁnition of the base rate, λ0 i = efi. 4 Algorithm

4.1 Decomposable Variational Lower Bound Exact inference of our model is computationally infeasible for large data, because the GP term in (5) is required to compute the covariance matrix c(X, X) and its inverse, which intertwine all the latent factors when the number of observed entries is large, the computation is infeasible. Furthermore, the log joint probability of our model involves many log-summation terms, {log(λi(sn i ))} these terms further couples the latent factors (see (3)) and the base triggering function parameters, β and τ (see (4)), making the computation even less efﬁcient.

To tackle these problems, we ﬁrst consider the standard variational sparse GP framework (Titsias, 2009). We introduce Q pseudo inputs B and targets g. Note that Q is much smaller than the number of tensor entries. We assume g and f are sampled from the same Gaussian process, and hence they jointly follow a multivariate Gaussian distribution, p(f, g) = N([g; f]|0, C) where C = [c(B, B), c(B, X); c(X, B), c(X, X)] and c(X, B) is the cross-covariance between X and B. We then augment our model with the pseudo target g, p(f, g, {mi}|U) = p(g)p(f|g)p({mi}|f, U). Following (Titsias, 2009), we introduce a variational distribution q(g) and apply Jensen s inequity to obtain log p({mi}|U) Eq(g) log( p(g)

q(g)) + Eq(g) log Ep(f|g)p({mi}|f, U) . Next, we apply Jensen s inequality again for the second term to switch the order of logarithm and the expectation, so as to obtain a lower bound decomposed over tensor entries, log Ep(f|g)p({mi}|f, U) Ep(f|g) log p({mi}|f, U) = P

i Ep(fi|g) log p(mi|fi, U) . Note that p(fi|g) is scalar conditional Gaussian distribution. However, this step is infeasible, because the expectations are not analytical each base rate efi is trapped in a set of the log-summation terms, i.e., log p(mi|fi, U) = efi T + Pni j=1 log(efi + aj) + a0 where a0, {aj} are the terms irrelevant to fi. This stems from the additive form of the Hawkes process rate function (see (1)). The expectation w.r.t a Gaussian distribution is not analytical.

To solve this problem, we exploit Poisson process super-position theorem (Cinlar and Agnew, 1968) to further augment our model with event cause variables. Thereby the base rate can be decoupled from the log-summation terms, and we can derive a tractable and decomposable bound.

Speciﬁcally, by the super-position theorem, each additive component in the rate function (1) can be considered as an independent Poisson process. The Hawkes process is equivalent to the union of

these Poisson processes each event is sampled from one of these processes. Therefore, it is natural to introduce a latent cause variable z for each event: z = 0 if the event is caused by the base rate; z = n if the event is caused by a preceding event sn.

For clearer description, let us consider the ﬂattened single event sequence S = [(s1, i1), . . . , (s N, i N)] over all the observed tensor entries. Note that this is an equivalent representation of our observed data {mi}. For each event sj(1 j N), we introduce a latent cause variable zj. Thanks to our local triggering function (4), we can use domain knowledge to determine an appropriate range of the cause, i.e., zj {0} Asj where Asj are the indices of the events in Asj. Note that short ranges are helpful to capture local excitations and to reduce the computation cost in model estimation. Given the latent cause variables, {zj}N j=1, we can partition the whole sequence S into multiple Poisson process sequences. The probability of our model augmented with the latent cause variables is then derived by

p({sj, ij, zj}N j=1|f) = Y

i O p({sn|in = i, zn = 0}|λ0 i )

i O p {sn|in = i, zn = j}|hij i(t sj)

j=1 e R T sj hij i(t sj)dt N Y

hin ij(sj sn)1(zj=n) Y

i O e T efi+fi PN j=1 1(ij=i,zj=0)

where λ0 i = efi, O are the indices for all the observed entries, and 1( ) is the indicator function. As we can see, the latent function values f are now decoupled into individual exponential terms. Although e T efi looks nontrivial, we are still able to follow (Titsias, 2009) to derive an analytical variational bound (given in the below).

To infer the posterior distributions of the latent causes z = [z1, . . . z N], we further introduce a variational posterior q(z) with the mean-ﬁeld form, q(z) = QN j=1 q(zj). Using the standard frameworks for variational sparse GP and mean-ﬁeld approximations, we ﬁnally derive a tractable variational model evidence lower bound,

sj hij i(t sj)dt +

j=1 Eq(g)Ep(fij |g) Eq(zj) 1(zj = 0) fij T

Eq(zj) 1(zj = n)) log hij i(sj sn) + Eq(g) log p(g)

where q(g) is the variational posterior of the pseudo targets in sparse GP. We assume q(g) = N(g|µ, Σ), and each q(zj) is a multinomial distribution. As we can see, the variational lower bound is decomposed over each event this additive structure enables us to develop a scalable stochastic inference algorithm, presented as follows.

4.2 Doubly Stochastic Variational Expectation-Maximization Inference

Given the variational lower bound (6), our model inference amounts to maximizing this bound. While the standard variational Expectation-Maximization (EM) algorithm is available, this batch inference paradigm can be very inefﬁcient when the observed events are many, because each E-M iteration requires to pass all the events. Moreover, the batch inference is not suitable for dynamic event-tensors, where the events are collected in real-time. It is therefore natural to design a stochastic inference algorithm, where we sample a mini-batch of events at a time and perform a local variational EM update: in the E step, we update the variational posteriors for the latent causes variables, {q(zj)}. In the M step, we update the latent factors and all the other parameters with stochastic gradient accent.

However, this stochastic inference following the standard recipe may still be inefﬁcient when the tensor entries are many as well. The reason is that our bound has a double summation term, PN j=1 P

i O R T sj hij i(t sj)dt. While each time we only need to handle a mini-batch of events, for each event we have to process all the tensor entries in the inner summation. Consider that a small tensor of size 100 100 100 can have up to 1 million observed entries. The computation can be extremely expensive.

To deal with both large numbers of events and tensor entries, we further develop a doubly stochastic variational EM algorithm. Speciﬁcally, we randomly partition both the events and the tensor entries

into mini-batches, {Nk} and {Ml}, according to which we arrange our variational bound as

L = Eq(g) log p(g)

j Nk φsj, Asj N |Nk| + X

i Ml ψsj,i,ij N |Nk| M |Ml|

where | | is the size of the mini-batch, M is the number of observed entries, ψsj,i,ij = R T sj hij i(t

sj)dt, and φsj, Asj = Eq(g)Ep(fij |g) Eq(zj) 1(zj = 0) fij T nij efij + P

n Asj Eq(zj) 1(zj =

n)) log hij i(sj sn) . Then, the bound can be considered as an expectation of a stochastic objective, L = Ep(k),p(l)( Lk,l), where p(k) = |Nk|

N , p(l) = |Ml|

Lk,l = Eq(g) log p(g)

j Nk φsj, Asj N |Nk| + X

i Ml ψsj,i,ij N |Nk| M |Ml|.

We can therefore develop a doubly-stochastic EM algorithm to maximize L. Each time, we sample two mini-batches, Nk and Ml, one for the events and the other for the tensor entries. We then optimize the stochastic objective, Lk,l, with one E-M iteration. In the E step, we optimize the variational posteriors of the latent causes {q(zj)} associated with the events in Nk; in the M step, we update

all the other parameters θ with stochastic gradient accent, θ θ + η Lk,l θ , where η is the step size. Here θ include the latent factors U, the base triggering function parameters β and τ, the pseudo inputs B, the kernel parameters, and the mean and covariance of q(g). The detailed updating equations are listed in the supplementary material. Note that we cannot update q(g) in the E-step because we do not have an analytical updating formula. We repeat this process until convergence or the maximum number of batches have been processed.

4.3 Algorithm Complexity

The time complexity of our algorithm is O(Q3Eb + Eb Vb) where Eb and Vb are mini-batch sizes for events and tensor entries, respectively. Since Q N, M is constant, the time complexity is proportional to the sizes of the mini-batches. The space complexity is O(PK k=1 dkrk + Q2), which is to store all the latent factors, and covariance of q(g) and all the other parameters.

5 Related Works Many excellent works have been proposed for tensor decomposition (Shashua and Hazan, 2005; Chu and Ghahramani, 2009; Sutskever et al., 2009; Acar et al., 2011; Hoff, 2011; Kang et al., 2012; Yang and Dunson, 2013; Rai et al., 2014; Choi and Vishwanathan, 2014; Hu et al., 2015a; Rai et al., 2015). Most of them are based on the classical, multilinear Tucker (Tucker, 1966) or CP (Harshman, 1970) decompositions. Recently, several nonparametric decomposition methods (Xu et al., 2012; Zhe et al., 2015, 2016a,b) were developed to capture nonlinear relationships in data, and have shown excellent predictive performance. However, most methods ignore the temporal information, or simply integrate them into count tensors (Chi and Kolda, 2012; Hansen et al., 2015; Hu et al., 2015b). The latter approaches usually use Poisson processes to model events, and ignore the temporal inﬂuences among those events. More elegant, temporal decomposition approaches (Xiong et al., 2010; Schein et al., 2015, 2016) introduce extra time factors to capture reﬁned temporal patterns. However, since they discretize the time stamps into steps, they still lose information and are unable to capture ﬁnegrained, triggering effects within the events. To address these problems, we formulated event-tensors to keep the complete temporal information, and proposed a powerful nonparametric event-tensor decomposition model by hybridizing latent GPs and Hawkes processes. Our model can be further extended for more general, temporal high-order relation data analysis (Du Bois and Smyth, 2010; Du Bois et al., 2013).

Due to the great ﬂexibility, Hawkes processes (HPs) have been an important tool for discovering latent structures/relationships within general types of events, including reciprocal relationship on graphs (Blundell et al., 2012), latent network structures (Linderman and Adams, 2014), temporal clustering of documents (Du et al., 2015), network structures and topics in text-based cascades (He et al., 2015), user activity levels (Wang et al., 2017), etc. Moreover, many works have been developed for general HP modeling and inference (Zhou et al., 2013; Xu et al., 2016, 2017). Different from these methods, our doubly stochastic variational EM inference is designed for a hybrid of latent GP and HP model (on event-tensors).

6 Experiment

6.1 Predictive Performance Datasets. To examine the predictive performance, we used three real-world datasets, Article(www. kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop/ data), UFO(www.kaggle.com/NUFORC/ufo-sightings/data) and 911(www.kaggle. com/mchirico/montcoalert/data). The Article data are 12 month logs (03/2016 - 02/2017) of CI&T s Internal Communication platform (Desk Drop), which record users operations on the shared articles, such as LIKE, FOLLOW and BOOKMARK. We extracted a three mode event-tensor (user, operation, session id), of size 1895 5 2987. There are 50, 938 entries observed to have events. The length of the longest event sequence in all the entries is 76. The total number of events is 72, 312. The UFO data consist of reported UFO sightings over the last century in the world, from which we extracted a two mode event-tensor (UFO shape, city), of size 28 19, 408, with 45, 045 entries observed to have sighting events. The longest event sequence length is 113. There are in total 77,747 events. The 911 data record the emergency (911) calls from 2015-12-10 to 2017-04-10 in Montgomery County, PA. We focused on the Emergence Medical Service (EMS) calls and extracted a two mode event-tensor (EMS title, township), which is 72 69. There are 2, 494 entries observed to have events. The length of the longest event sequence is 545. The total number of events are 59,270.

Competing methods. We compared our approach with the following typical methods to incorporate the temporal information into tensor decomposition. (1) CP-PTF the Poisson process (PP) tensor factorization model using CP to decompose event rates. Similar to our approach, the CP form is applied in the log-domain to ensure the positiveness of the rates. We have investigated alternative methods (Chi and Kolda, 2012) where the latent factors are constrained to be nonnegative and so CP is directly applied over the rates. There was tiny difference in predictive performance. (2) CPT-PTF where, similar to (Schein et al., 2015), we discretized the time stamps into multiple steps, augmented the tensor with a time mode to represent the time steps, and used PPs to model the event rate in the each step and CP to decompose the rates. As in (Xiong et al., 2010), we assigned conditional priors over the time factors to encourage their smooth transitions. In addition, we implemented (3) GP-PTF, the PP tensor factorization using GPs to model the event rates as a (nonlinear) function of the latent factors. This is the same strategy as we used to model the base rates of the HPs in our approach. For a fair comparison, we ran all the competing methods with standard stochastic inference, where each time, a mini-batch of tensor entries are sampled and the latent factors are updated with the stochastic gradient ascent. For GP-PTF, we used the same variational sparse GP framework as in our approach.

Parameter settings. We varied the number of latent factors from {1, 2, 5, 8}. For both GP-PTF and our method, we used the ARD kernel and set the number of pseudo inputs to 100. For a fair comparison, all the methods were initialized with the same latent factors, which were drawn elementwisely from the uniform distribution in [0, 1]. For training, we used the ﬁrst 50K, 40K and 40K events in Article, UFO and 911 respectively, and the remaining 22.3K, 19.3K and 30.4K events for testing. For CPT-PTF, we varied the number of time steps from {5, 10, 20, 30}. For our approach, to examine different settings of the triggering range, we ﬁxed the maximum number of triggering events Cmax to 300 and varied the maximum triggering time window max from {1, 2, 3} hours for Article and 911, and {1, 3, 5} days for UFO. The mini-batch sizes of tensor entries (for all the methods), and events (for our method only) are both set to 100. We used Ada Delta (Zeiler, 2012) to adjust the step-size for the stochastic gradient ascent, and ran 100 epochs for each method. To remove the vibration of the prediction accuracy (due to the stochastic updates) from evaluation, we computed the test log likelihood after each epoch, and then reported the largest one as the prediction result.

Results. As shown in Fig. 1a-c, our approach outperforms all the competing methods, and in many cases improves them by a large margin. Note that the second best approach is always GP-PTF, implying complex, nonlinear relationships within the events. Furthermore, our improvement over GP-PTF demonstrates the advantage of using HPs to capture the (local) triggering effects between the events. To examine the dynamic behaviors of our doubly stochastic algorithm, we reported the test log likelihoods after each epoch in Article and 911 when the factor number was set to 8. As shown in Fig. 1d-e, the predictive performance of our algorithm kept improving and tended to converge at last. The running time is provided in the supplementary material.

6.2 Latent Structure Discovery Next, we examined the capability of our model in discovering latent structures. We ﬁrst simulated a small 10 10 10 event-tensor with highly nonlinear hidden relationships between the latent factors, and the factors in each mode form 2 clusters (see the details in the supplementary

1 2 5 8 -4.2 -3.8 -3.4

Test Log Likelihood

Ours-Win-1 Ours-Win-2 Ours-Win-3 GP-PTF

1 2 5 8 Number of Factors

-1 107 CPT-PTF-5 CPT-PTF-10 CPT-PTF-20 CPT-PTF-30 CP-PTF

(a) Article

1 2 5 8 Number of Factors

Test Log Likelihood

1 2 5 8 Number of Factors

Test Log Likelihood

10 20 30 40 50 60 70 80 90 100

Number of Epochs

Test Log Likelihood

Ours-Win-1 Ours-Win-2 Ours-Win-3 GP-PTF

(d) Article-Running

10 20 30 40 50 60 70 80 90 100

Number of Epochs

Test Log Likelihood

(e) 911-Running Figure 1: The prediction performance on the three real-world datasets (a-c) and along with running time (d-e). CPT-PTF-{5, 10, 20, 30} correspond to CPT-PTF using {5, 10, 20, 30} time steps. Ours-Win-{1, 2, 3} are our methods using three triggering time windows.

(a) CP-PTF (b) GP-PTF (c) Ours Figure 2: The estimated latent factors in synthetic data.

material). We ran CP-PTF, GP-PTF and our approach for 50 epochs to estimate the latent factors. The results of the second mode are reported in Figure 2. The markers (and the colors) of the points (i.e., latent factors) exhibit their ground-truth classes. As we can see, CP-PTF obtained factors with mixed classes and unclear structures (Fig. 2a), GP-PTF with clearer cluster structures but mistaken groups (Fig. 2b). Our approach recovered both the clear cluster structures and correct factor groups (Fig. 2c).

(a) EMS titles (b) UFO shapes

(c) Townships Figure 3: Structures reﬂected from the latent factors learned by our model on 911 on UFO. In (c), the clusters of townships are shown in the actual map.

In addition, we examined the structures discovered by our model from real-world applications. To this end, we used k-means plus BIC to cluster our estimated latent factors for 911 and UFO datasets (see Sec. 2.3 of the supplementary material for more details). We obtained 6 groups of EMS titles and 10 groups of townships for 911, as shown in Figure 3a and c. We obtained 4 groups of UFO shapes, as shown in Figure 3b. Due to the space limit, we do not report the clusters of UFO sighting cities (19K cities).

As we can see, the estimated latent factors for both 911 and UFO datasets reﬂect clear cluster structures, which may imply interesting patterns. First, we found that the clusters of EMS titles often contain accidents/events with strong associations. For example, Cluster 1 in Fig. 3a consist of {SHOOTING, AMPUTATION and S/B AT HELICOPTER LANDING} after SHOOTING or accidental AMPUTATION, the urgent rescue may require HELICOPTER supports. For another example, Cluster 2 are about disease symptoms, and include SEIZURES, CVA/STROKE, OVERDOSE, ABDOMINAL PAINS, etc. It is known that STOKE is a common cause of SEIZURE (De Reuck, 2009) in the aftermath of a stroke, the seizure is often experienced. Likewise, it is common that after OVERDOSES, people may feel ABDOMINAL PAINS. The detailed EMS titles in each cluster are listed in Table 1 of the supplementary material. Furthermore, from Figure 3c, we can see the cluster of townships tend to neighbor each other. This is reasonable, since one accident is more likely to cause subsequent accidents in adjacent geolocations. For example, a severe road accident may cause a trafﬁc jam in a nearby town.

Second, we investigated the clusters of UFO shapes on UFO data (Fig. 3b). We found these clusters correspond to different appearance patterns. For example, Cluster 1 contain more three-dimensional looks, including cone, cylinder, egg, pyramid, etc, while Cluster 2 comprise thinner/ﬂatter shapes, such as disk and cigar. Cluster 3 are {ﬁreball, ﬂash} and Cluster 4 are more about formation ﬂying, such as cross, delta and round. The details are in Table 2 of the supplementary material. Generally, it reﬂects UFOs with similar looks are more likely to be sighted together/successively in a short time. 7 Conclusion We proposed a nonparametric event-tensor decomposition model to capture the complex relationships and temporal dependencies in tensor data with time stamps. We developed a doubly stochastic variational EM algorithm for scalable inference. Our model has shown effectiveness on several real-world datasets. In the future, we will investigate more and larger scale applications.

Acar, E., Dunlavy, D. M., Kolda, T. G., and Morup, M. (2011). Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41 56.

Blundell, C., Beck, J., and Heller, K. A. (2012). Modelling reciprocating relationships with hawkes processes. In Advances in Neural Information Processing Systems, pages 2600 2608.

Chi, E. C. and Kolda, T. G. (2012). On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272 1299.

Choi, J. H. and Vishwanathan, S. (2014). Dfacto: Distributed factorization of tensors. In Advances in Neural Information Processing Systems, pages 1296 1304.

Chu, W. and Ghahramani, Z. (2009). Probabilistic models for incomplete multi-dimensional arrays. AISTATS.

Cinlar, E. and Agnew, R. (1968). On the superposition of point processes. Journal of the Royal Statistical Society. Series B (Methodological), pages 576 581.

De Reuck, J. (2009). Management of stroke-related seizures. Acta Neurol Belg, 109(4):271 6.

Du, N., Farajtabar, M., Ahmed, A., Smola, A. J., and Song, L. (2015). Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 219 228. ACM.

Du Bois, C., Butts, C. T., Mc Farland, D., and Smyth, P. (2013). Hierarchical models for relational event sequences. Journal of Mathematical Psychology, 57(6):297 309.

Du Bois, C. and Smyth, P. (2010). Modeling relational events via latent classes. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 803 812. ACM.

Hansen, S., Plantenga, T., and Kolda, T. G. (2015). Newton-based optimization for Kullback-Leibler nonnegative tensor factorizations. Optimization Methods and Software, 30(5):1002 1029.

Harshman, R. A. (1970). Foundations of the PARAFAC procedure: Model and conditions for an explanatory multi-mode factor analysis. UCLA Working Papers in Phonetics, 16:1 84.

Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90.

He, X., Rekatsinas, T., Foulds, J., Getoor, L., and Liu, Y. (2015). Hawkestopic: A joint model for network inference and topic modeling from text-based cascades. In International conference on machine learning, pages 871 880.

Hoff, P. (2011). Hierarchical multilinear models for multiway data. Computational Statistics & Data Analysis, 55:530 543.

Hu, C., Rai, P., and Carin, L. (2015a). Zero-truncated poisson tensor factorization for massive binary tensors. In UAI.

Hu, C., Rai, P., Chen, C., Harding, M., and Carin, L. (2015b). Scalable bayesian non-negative tensor factorization for massive count data. In Proceedings, Part II, of the European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 9285, ECML PKDD 2015, pages 53 70, New York, NY, USA. Springer-Verlag New York, Inc.

Kang, U., Papalexakis, E., Harpale, A., and Faloutsos, C. (2012). Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316 324. ACM.

Linderman, S. and Adams, R. (2014). Discovering latent network structure in point process data. In International Conference on Machine Learning, pages 1413 1421.

Rai, P., Hu, C., Harding, M., and Carin, L. (2015). Scalable probabilistic tensor factorization for binary and count data. In IJCAI.

Rai, P., Wang, Y., Guo, S., Chen, G., Dunson, D., and Carin, L. (2014). Scalable Bayesian lowrank decomposition of incomplete multiway tensors. In Proceedings of the 31th International Conference on Machine Learning (ICML).

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.

Schein, A., Paisley, J., Blei, D. M., and Wallach, H. (2015). Bayesian poisson tensor factorization for inferring multilateral relations from sparse dyadic event counts. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1045 1054. ACM.

Schein, A., Zhou, M., Blei, D. M., and Wallach, H. (2016). Bayesian poisson tucker decomposition for learning the structure of international relations. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pages 2810 2819. JMLR.org.

Shashua, A. and Hazan, T. (2005). Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22th International Conference on Machine Learning (ICML), pages 792 799.

Sutskever, I., Tenenbaum, J. B., and Salakhutdinov, R. R. (2009). Modelling relational data using bayesian clustered tensor factorization. In Advances in neural information processing systems, pages 1821 1828.

Titsias, M. K. (2009). Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artiﬁcial Intelligence and Statistics, pages 567 574.

Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31:279 311.

Wang, Y., Ye, X., Zha, H., and Song, L. (2017). Predicting user activity level in point processes with mass transport equation. In Advances in Neural Information Processing Systems, pages 1644 1654.

Xiong, L., Chen, X., Huang, T.-K., Schneider, J., and Carbonell, J. G. (2010). Temporal collaborative ﬁltering with bayesian probabilistic tensor factorization. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 211 222. SIAM.

Xu, H., Farajtabar, M., and Zha, H. (2016). Learning granger causality for hawkes processes. In International Conference on Machine Learning, pages 1717 1726.

Xu, H., Luo, D., and Zha, H. (2017). Learning hawkes processes from short doubly-censored event sequences. In nternational Conference on Machine Learning.

Xu, Z., Yan, F., and Qi, Y. (2012). Inﬁnite Tucker decomposition: Nonparametric Bayesian models for multiway data analysis. In Proceedings of the 29th International Conference on Machine Learning (ICML).

Yang, Y. and Dunson, D. (2013). Bayesian conditional tensor factorizations for high-dimensional classiﬁcation. Journal of the Royal Statistical Society B, revision submitted.

Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method.

Zhe, S., Qi, Y., Park, Y., Xu, Z., Molloy, I., and Chari, S. (2016a). Dintucker: Scaling up gaussian process models on large multidimensional arrays. In Thirtieth AAAI Conference on Artiﬁcial Intelligence.

Zhe, S., Xu, Z., Chu, X., Qi, Y., and Park, Y. (2015). Scalable nonparametric multiway data analysis. In Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 1125 1134.

Zhe, S., Zhang, K., Wang, P., Lee, K.-c., Xu, Z., Qi, Y., and Ghahramani, Z. (2016b). Distributed ﬂexible nonlinear tensor factorization. In Advances in Neural Information Processing Systems, pages 928 936.

Zhou, K., Zha, H., and Song, L. (2013). Learning triggering kernels for multi-dimensional hawkes processes. In International Conference on Machine Learning, pages 1301 1309.