# selfmodulating_nonparametric_eventtensor_factorization__dbe84a21.pdf Self-Modulating Nonparametric Event-Tensor Factorization Zheng Wang 1 Xinqi Chu 2 Shandian Zhe 1 Tensor factorization is a fundamental framework to analyze high-order interactions in data. Despite the success of the existing methods, the valuable temporal information are severely underused. The timestamps of the interactions are either ignored or discretized into crude steps. The recent work although formulates event-tensors to keep the timestamps in factorization and can capture mutual excitation effects among the interaction events, it overlooks another important type of temporal influence, inhibition. In addition, it uses a local window to exclude all the long-term dependencies. To overcome these limitations, we propose a self-modulating nonparametric Bayesian factorization model. We use the latent factors to construct mutually-governed, general random point processes, which can capture various shortterm/long-term, excitation/inhibition effects, so as to encode the complex temporal dependencies into factor representations. In addition, our model couples with a latent Gaussian process to estimate and fuse nonlinear yet static relationships between the entities. For efficient inference, we derive a fully decomposed model evidence lower bound to dispense with the huge kernel matrix and costly summations inside the rate and log rate functions. We then develop an efficient stochastic optimization algorithm. We show the advantage of our method in four real-world applications. 1. Introduction Interactions between multiple entities (or nodes) are common in real-world applications. For example, consumers activities can be viewed as interactions between customers, service items and providers. These (high-order) interactions are naturally represented by tensors and analyzed by tensor 1School of Computing, University of Utah 2Xjera Labs, Pte. Ltd. Correspondence to: Shandian Zhe . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). factorization, which learns a set of latent factors to represent each participant. With the factor representations, we can discover the hidden structures within the entities, e.g., clusters and outliers, and extract useful features to make predictions (in downstream tasks). While many excellent methods for tensor factorization have been proposed (Tucker, 1966; Harshman, 1970; Chu and Ghahramani, 2009; Kang et al., 2012; Choi and Vishwanathan, 2014), they mainly conduct a multilinear decomposition, and might be inadequate to capture more complex, nonlinear relationships. More importantly, they severely underuse the valuable temporal information along with the data. Most methods either drop the time stamps of the interactions, summarizing the events as a count tensor (Chi and Kolda, 2012; Hansen et al., 2015; Hu et al., 2015b), or discretize the time stamps into crude steps (e.g., weeks or months), ignoring the temporal dependencies within the same step (Xiong et al., 2010; Schein et al., 2015; 2016). Recently, Zhe and Du (2018) formulated event-tensors (where the tensor entries are sequences of interaction events) to preserve the accurate timestamps. They used Hawkes processes to estimate the fine-grained, triggering effects between the interactions. However, their approach overlooks inhibition, another ubiquitous effect between the events, and can still miss important temporal patterns. In addition, to compromise on the computational cost, they use a small time window to restrict the range of dependent events, and cannot capture long-term temporal influences of the interactions. To overcome these limitations, we propose a selfmodulating nonparametric Bayesian factorization model for event tensors, which not only can capture static, nonlinear relationships of the entities, but also is flexible enough to capture a variety of short-term and long-term, excitation and inhibition effects among the interaction events, encoding these complex temporal effects into the factor representations. Specifically, we use the latent factors to construct a set of mutually-governed, general random point processes to sample the observed interaction events. We first use a latent Gaussian process (GP) to sample a function of the factor representations to determine the type of temporal effect between each pair of interactions. The strength of the effect is further modelled as a kernel (similarity) function of their factors. In this way, both the type and strength of the effect are absorbed into the factors, from which we can discover Self-Modulating Nonparametric Event-Tensor Factorization the underlying temporal structures. We then couple with another latent GP to sample the base rate as a (nonlinear) function of the factors, in order to estimate and fuse complex yet static relationships between the entities. We use a scaled softplus function to additively integrate all the positive and negative influences from previous interactions to construct the rate function. For efficient inference, we take advantage of the convexity and log concavity of the rate function, and use the sparse variational GP framework (Hensman et al., 2013) and Jensen s inequality to derive a fully decomposed model evidence lower bound (ELBO). Based on the ELBO, we develop an efficient stochastic optimization algorithm. The complexity of our algorithm is only proportional to the size of the mini-batches, while it captures all the long-term dependencies among the interactions. For evaluation, we examined our method on three real-world datasets. Our model nearly always achieves better predictive performance than the existing methods using Poisson processes, time factors, and Hawkes processes to incorporate temporal information. The training curve shows that our inference algorithm converges reasonably fast and is quite stable. Finally, by looking into the latent factors estimated by our approach, we found interesting and meaningful structures both within the entities and within the events. We also found interesting temporal influence patterns. 2. Background Tensor Factorization. We denote a K-mode tensor by Y Rd1 ... d K. The k-th mode includes dk entities or nodes (e.g., customers). Each entry is indexed by a tuple i = (i1, . . . , i K) and stands for the interaction of the corresponding K nodes. The entry value is denoted by yi. To decompose Y, we introduce K latent factor matrices U = {U1, . . . , UK} to represent all the tensor nodes. Each Uk = [uk 1; . . . ; uk dk] , which is dk rk, and each uk t are the rk latent factors of node t in mode k. We aim to use U to recover the observed entries in Y. A classical approach is Tucker decomposition (Tucker, 1966), which assumes Y = W 1 U1 2 . . . K UK, where W Rr1 ... r K is a parametric tenor, and k is the mode-k tensor matrix product (Kolda, 2006), which resembles the ordinary matrixmatrix product. If we set all rk = r and W to be diagonal, Tucker decomposition becomes CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970). While numerous tensor factorization methods have been proposed, e.g., (Chu and Ghahramani, 2009; Kang et al., 2012; Choi and Vishwanathan, 2014), most of them are inherently based on the CP or Tucker form. However, since both forms are mutilinear to the latent factors, they are incapable of capturing more complicated, nonlinear relationships in data. Factorization with Temporal Information. Real-world tensors are often supplemented with detailed temporal infor- mation, namely, the timestamps of the observed interactions. To incorporate these information, traditional methods either drop the timestamps to perform count tensor decomposition (Chi and Kolda, 2012; Hu et al., 2015b), or discretize the timestamps into time steps, e.g., weeks or months, augment the tensor with a time mode (Xiong et al., 2010; Schein et al., 2015; 2016), and jointly estimate the time factors. Both approaches can be viewed as using Poisson processes to model the interaction events, p(yi) e λi T λyi i , where yi is the interaction count in entry i (with/without a time step), and λi is the event rate. The factorization is performed on {λi} or {log(λi)}, typically with Tucker/CP forms. Despite their simplicity and convenience, these methods disregard the rich and vital temporal dependencies between the interactions, due to the independent increment assumption in Poisson processes. To mitigate this issue, Zhe and Du (2018) formulated event-tensor to maintain all the accurate timestamps. In an event-tensor, each entry is an event sequence of a particular interaction, rather than a numerical value. Zhe and Du (2018) modelled the observed entries as a set of mutually excited Hawkes processes (Hawkes, 1971). The rate of the events in each entry i is λi(t) = λ0 i + X sn A(t) k(xin, xi)h0(t sn) (1) where λ0 i is the background rate, A(t) is a local time window that specifies the range of dependent events happened before t (e.g., 50 past events nearest to t), in is the entry to which the interaction at time sn belongs, xin and xi are the latent factors associated with entry in and i respectively, k( , ) is a kernel function that measures their similarity, and h0( ) is a base triggering kernel that measures how the triggering effect decays along with time. From the rate function (1), we can see that the model can capture the (local) excitation effects of the previously happened interactions on the current one, and the triggering strength are (partly) encoded into the latent factors the closer the corresponding factors of the two entries, the stronger the strength. Although the model in (Zhe and Du, 2018) can capture finegrained, mutual triggering effects between the interactions, it ignores another type of important and ubiquitous effect inhibition. For example, a customer who has recently purchased a Surface laptop is unlikely to buy a Mac Book; people who voted for one president candidate are unlikely to support another one in a short term. In practice, among the interaction events can be mixed excitation and inhibition effects, resulting in complex temporal dependencies. In addition, the model uses a local time window A(t) to specify a small range of the dependent events for each interaction (see (1)). Although this can save much computational cost for the rate function and its logarithm in model estimation Self-Modulating Nonparametric Event-Tensor Factorization (especially for a large number of events), it excludes all the long-term influences of the interaction events on each other, and hence can miss many interesting and valuable temporal patterns. To overcome these problems, we propose a self-modulating nonparametric event-tensor factorization model, presented as follows. 3.1. Notations for Event-Tensor First, let us supplement a few notations. In the event-tensor, for each observed entry i, we denote its event sequence by yi = [s1 i , . . . , sni i ], i.e., the time stamps when the interaction i occurred, and ni is the number of occurrences. Note that each entry represents a particular type of interaction. We can merge the event sequences of all the observed entries into a single sequence, S = [(s1, i1), . . . , (s N, i N)], where s1 . . . s N are all the time stamps, each in indexes the entry that event sn belongs to, i.e., the particular interaction occurred at sn. 3.2. Self-Modulating Nonparametric Factorization We now consider using the latent factors U to construct a general random point process to accommodate both the triggering and inhibition effects among the interaction events. One basic assumption in (Zhe and Du, 2018) (see (1)) is that the closer (or more similar) the factor representations of two interactions, the stronger their mutual excitation effects. This is true in many applications, for example, the event that user A purchased commodity B may excite A s friend C to purchase B as well . Obviously, the factors of A and C are expected to be close because they are in the same community (i.e., friends) and so are the factor representations for the interactions (A, B) and (C, B). However, in many other cases, closer factor representations may on the contrary lead to stronger inhibition effects. For example, the event that user A has purchased Surface laptop B can strongly suppress A to buy Mac Book C (aforementioned); the event that athlete A has won the champion of Game B deprives of the possibility that his competitor C wins B. Therefore, to model the strength of the temporal influence of a previously occurred interaction j on the current one i, we still use a kernel function of their factor representations, k(xj, xi), where xi = [u1 i1; . . . ; u K i K] and xj = [u1 j1; . . . ; u K j K]. However, to detect the type of the influence, we consider learning a discriminating function of the factor representations, g(xj, xi), where g(xj, xi) > 0 indicates that j will trigger the occurrence of i and otherwise inhibit. To flexibly estimate g( ), we place a Gaussian process (GP) prior (Rasmussen and Williams, 2006) a nonparametric function prior that accommodates various complex functions. Hence, the latent function values g for every pair of observed entries will follow a multivariate Gaussian distribution, p(g|U) = N g|0, κg(Xg, Xg) , (2) where each row of the input matrix Xg corresponds to a pair of entries, and are the concatenation of the associated factors, κg( , ) is the covariance (kennel) function. Now, we define a raw rate function for each entry i that integrates both the triggering and suppressing effects from the previous interactions, λi(t) = λ0 i sn 0. It is trivial to show that when s , λi(t) max λi(t), 0 . Therefore, the scaled soft-plus can considerably maintain the additive structure in our raw rate definition in (3). While other transformation operators are also possible, e.g., exp( ), we found empirically that the scaled softplus exhibits superior and excellent performance. Finally, to estimate the complex yet static relationships between the entities and fuse the relationships into the latent factors, we model the background rate λ0 i in each entry i as a nonlinear function of the associated factors, f(xi). To this end, we place another GP prior over f( ). Then the background rate values f for all the observed entries are sampled from a multivariate Gaussian distribution, p(f|U) = N f|0, κf(Xf, Xf) , (5) where each row of Xf are the concatenated factors associated with one entry, and κf( , ) is the covariance (or kernel) function. Note that we do not need to constrain f( ) > 0, because via the soft-plus transformation (4), we will always obtain a non-negative event rate. We place a standard Gaussian prior over all the latent factors U. Given the observed interaction events S (from all the Self-Modulating Nonparametric Event-Tensor Factorization entries), the joint probability of our model is given by p(S, g, f, U) = Y ik N(uk ik|0, I) N g|0, κg(Xg, Xg) N f|0, κf(Xf, Xf) 0 λi(t)dt N Y n=1 λin(sn), (6) where T is the total time span across all the events. Note that the last row is the likelihood of our mutually governed random point processes on the observed entries (Daley and Vere-Jones, 2007). 4. Algorithm The estimation of our model is challenging. First, the exact inference of our model is infeasible for large data because the GP likelihoods (2) and (5) require us to compute M 2 M 2 and M M covariance (kernel) matrices respectively and their inverse, where M is the number of observed entries (i.e., distinct interactions). When M is large, the computation is prohibitively costly. Second, the calculation of each rate λin(sn) in the joint probability (6) needs to go through all the previously happened interactions {s1, . . . , sn 1} (see (3)), and therefore is expensive for a large number of events N. Third, due to the softplus transformation in (4), the integral over each rate function in (6) does not have a closed form and is intractable to compute. To address these challenges, we take advantage of the variational sparse GP framework (Hensman et al., 2013) and the properties of our rate function to derive a fully decomposed model evidence lower bound (ELBO). Based on the ELBO, we develop a stochastic mini-batch optimization algorithm that is efficient to both large M and N. Our algorithm is presented as follows. 4.1. Fully Decomposed Model Evidence Lower Bound 4.1.1. REMOVING HUGE COVARIANCE MATRICES First, we use the variational sparse GP to dispense with huge covariance matrices. We introduce pseudo inputs Zg = [zg 1, . . . , zg mg] and Zf = [zf 1, . . . , zf mf ] for the two latent functions g( ) and f( ), respectively, where mg M 2 and mf M. We denote the function values at these pseudo inputs by bg = [g(zg 1), . . . , g(zg mg)] and bf = [f(zf 1), . . . , f(zf mf )], which we refer to as the pseudo outputs. Then we can augment our model by jointly sampling {f, bf} and {g, bg}. Due to the GP priors of g( ) and f( ), both {g, bg} and {f, bf} follow a multivariate Gaussian distribution, and the covariance (kernel) matrices are computed on {Xf, Zf} and {Xg, Zg}, respectively. We can further decompose the joint prior by p(g, bg) = p(bg)p(g|bg) (7) where p(bg) = N bg|0, κg(Zg, Zg) , p(g|bg) = N(g|mg|b, Σg|b) is a conditional Gaussian distribution, mg|b = κg(Xg, Zg)κg(Zg, Zg) 1bg and Σg|b = κg(Xg, Xg) κg(Xg, Zg)κg(Zg, Zg) 1κg(Zg, Xg). Similarly, we can decompose p(f, bf) = p(bf)p(f|bf) (8) = N bf|0, κf(Zf, Zf) N(f|mf|b, Σf|b) where mf|b and Σf|b are the conditional mean and covariance matrix given bf respectively, similar to mg|b and Σg|b. The joint probability of the augmented model is then p(S, g, bg, f, bf, U) = p(bg)p(g|bg)p(bf)p(f|bf)p(U, S|f, g) (9) where p(U, S|f, g) = Q ik N(uk ik|0, I) Q i exp R T 0 λi(t)dt QN n=1 λin(sn). Note that if we marginalize out the pseudo outputs bg and bf, we will recover the original model (6). Based on (9), we now construct a variational model evidence lower bound (ELBO) to avoid calculating the full covariance matrices κg(Xg, Xg) and κf(Xf, Xf), which is infeasible for large M. To do so, we introduce a variational posterior for {g, bg, f, bf}, q(g, bg, f, bf) = q(bg)p(g|bg)q(bf)p(f|bf), (10) where q(bg) = N(bg|µg, Sg) and q(bf) = N(bf|µf, Sf). We further parameterize Sg and Sf by their Cholesky decompositions, Sg = Lg L g and Sf = Lf L f , to ensure their positive definiteness. We then derive the EBLO from L = Eq(g,bg,f,bf ) log p(S, g, bg, f, bf, U) q(g, bg, f, bf) = Eq log p(bg) p(g|bg)p(bf) p(f|bf)p(U, S|f, g) q(bg) p(g|bg)q(bf) p(f|bf) . Now we can see that the full conditional Gaussian distributions p(g|bg) and p(f|bf) are both cancelled. We only need to compute the covariance matrices for p(bg) and p(bf), which are mg mg and mf mf, respectively. Hence, the computational cost is greatly reduced. Rearranging the terms, we have L = log(p(U)) KL q(bg) p(bg) KL q(bf) p(bf) 0 λi(t)dt + n=1 Eq log λin(sn) , (11) where p(U) = Q ik N(uk ik|0, I) and KL( ) is the Kullback-Leibler divergence. To handle the intractable integral in (11), we rewrite it as an expectation, R T 0 λi(t)dt = Ep(t)[Tλi(t)] where p(t) = Uniform(0, T). Then we can sample t to obtain an unbiased estimate of the integral Self-Modulating Nonparametric Event-Tensor Factorization and conduct stochastic optimization (which we will discuss later). The ELBO now is L = log(p(U)) KL q(bg) p(bg) KL q(bf) p(bf) i Eq Ep(t)[Tλi(t)] + n=1 Eq log λin(sn) . (12) 4.1.2. DECOMPOSING LONG SUMMATIONS However, the computation of each rate λi(t) and log rate log λin(sn) is still quite expensive. According to (3) and (4), they couple a summation of the temporal influences (excitation or inhibition) from all the previously happened events in the (scaled) softplus and log-softplus function, and the time complexity is (on average) O(N). Since we need to compute N log rates, the total complexity is O(N 2). Therefore, it will be very costly for large N. To address this issue, we observe the following fact. Lemma 4.1. The scaled soft-plus function γs(x) = s log 1 + exp(x/s) (s > 0) is convex and log γs(x) is concave. The proof is given in the supplementary material. Based on this property, we can use Jensen s inequality to further derive an ELBO that fully decomposes these expensive summations. Specifically, we first rewrite the raw rate function (3) as λi(t) = λ0 i + n=1 δ(sn < t)hin i(xin, xi, t sn) where δ( ) is the indicator function and hin i(xin, xi, t sn) = tanh g(xin, xi) k(xin, xi)h0(t sn). We then partition the observed events into mini-batches of size Q: {B1, . . . , BN/Q}, and rearrange the summation as λi(t) = N PN/Q k=1 N Q P n Bk δ(sn < t)hin i(xin, xi, t sn). Thereby, we can view the raw rate as an expectation, λi(t) = Ep(k)[Xi k] where p(k) = Q/N, k {1, . . . , N/Q}, and Xi k = λ0 i + N n Bk δ(sn < t)hin i(xin, xi, t sn). Since the rate λi(t) = γs λi(t) and γs( ) is convex, we can apply Jensen s inequality to obtain λi(t) = γs(Ep(k)[Xi k]) Ep(k)[γs(Xi k)] and so Eq Ep(t)[λi(t)] Eq Ep(t)Ep(k)[γs(Xi k)]. (13) Similarly, the raw rate inside each log rate λin(sn) can also be viewed as an expectation, λi(sn) = Ep(k)[Y n k ], where Y n k = λ0 in + N j Bk δ(sj < sn)hij in(xij, xin, sn sj). Since log(γs( )) is concave, we can apply Jensen s inequality to obtain log λin(sn) = log(γs(Ep(k)[Y n k ])) Ep(k)[log(γs(Y n k ))]. (14) Finally, we substitute the lower bounds in (13) and (14) for each expected rate and log rate in (12), respectively. We then obtain a fully decomposed ELBO, L+ = log(p(U)) KL q(bg) p(bg) KL q(bf) p(bf) i Eq Ep(t)Ep(k)[Tγs(Xi k)] n=1 Eq Ep(k)[log(γs(Y n k ))]. (15) In this way, we move out most of the summation in each softplus and log softplus function, leaving only a very light summation across the mini-batch, i.e., Xi k and Y n k . The ELBO is additive on the observed entries, events and minibatch set (via p(k)). Thereby, we can develop a stochastic optimization algorithm for efficient model estimation. 4.2. Stochastic Optimization We now maximize the ELBO L+ in (15) to estimate the variational posterior q, the latent factors U and the other parameters. This ELBO is not analytical because the expectation terms are for the softplus or log softplus functions and do not have closed forms. Hence, we resort to stochastic optimization. In order to develop an efficient algorithm, we further partition all the observed entries (i.e., distinct interactions) into mini-batches of size C: {C1, . . . , CM/C}, and all the observed events into mini-batches of D: {D1, . . . , DN/D}. Note that we can also reuse the previous partition of the events, {B1, . . . , BN/Q}. Next, we rearrange L+ = log(p(U)) KL q(bg) p(bg) KL q(bf) p(bf) C Eq Ep(t)Ep(k)[Tγs(Xi k)] N D Eq Ep(k)[log(γs(Y n k ))]. (16) Then we can view the ELBO as an expectation of a stochastic ELBO, L+ = Ep(k),p(j),p(l)[ L+ k,j,l] (17) Self-Modulating Nonparametric Event-Tensor Factorization where p(k) = Q N , k {1, . . . , N Q }, p(j) = C M , j {1, . . . , M C }, p(l) = D N , l {1, . . . , N L+ k,j,l = log(p(U)) KL q(bg) p(bg) KL q(bf) p(bf) C Eq Ep(t)[Tγs(Xi k)] + X N D Eq[log(γs(Y n k ))]. Now with (17), we can develop an efficient stochastic optimization algorithm. Each time, we first draw a minibatch Bk, Cj and Dl from p(k), p(j) and p(l) respectively, and then seek to compute L+ k,j,l as an unbiased estimate of L+. However, the expectation term in L+ k,j,l is still intractable. To address this issue, we use the reparameterization trick (Kingma and Welling, 2013) to generate a parameterized sample for bf and each background rate λ0 i = f(xi) in Xi k and Y n k . Since q(bf) is Gaussian and p(f(xi)|bf) is conditional Gaussian, it is straightforward to obtain the sample bf = µf + Lfη and λ0 i = µ0 i + σ0 i ϵ, where η N(0, I), ϵ N(0, 1), µ0 i = κf(xi, Zf)κf(Zf, Zf) 1 bf and (σ0 i )2 = κf(xi, xi) κf(xi, Zf)κf(Zf, Zf) 1κf(Zf, xi). Similarly, we can generate parameterized samples for bg and each g(xin, xi) in Xi k and Y n k . We then generate a sample for t from p(t). Now, we substitute all the parameterized samples for the corresponding random variables in Xi k and Y n k , and obtain an unbiased stochastic estimate of L+ k,j,l. We then compute its gradient to obtain the an unbiased estimate of L+ k,j,l, which in turn is an unbiased estimate of L+. We now can apply any stochastic optimization algorithm to maximize L+ with the stochastic gradient. Note that the computation of the stochastic gradient is only inside the sampled mini-batches. Therefore, the cost is greatly reduced. 4.3. Algorithm Complexity The time complexity of our inference is O Q(C + D)m3 g + (C + D)m3 f where Q, C and D are the mini-batch sizes of the three partitions. Therefore, the computational cost is proportional to the mini-batch sizes, rather than determined by the total number of entries M and events N. The space complexity is O(m2 g +m2 f +PK k=1 dkrk), which is to store the prior and posterior matrices for the pseudo outputs bg and bf and the latent factors U. 5. Related Work Classical tensor factorization methods include CP (Harshman, 1970) and Tucker (Tucker, 1966) decompositions, which are multilinear. Many other approaches have been proposed based on them (Shashua and Hazan, 2005; Chu and Ghahramani, 2009; Sutskever et al., 2009; Acar et al., 2011; Hoff, 2011; Kang et al., 2012; Yang and Dunson, 2013; Rai et al., 2014; Choi and Vishwanathan, 2014; Hu et al., 2015a; Rai et al., 2015), to name a few. Recently, several Bayesian nonparametric factorization models (Xu et al., 2012; Zhe et al., 2015; 2016b;a) were proposed to estimate the nonlinear relationships in data. To handle temporal information, current methods either factorize the counts instead (Chi and Kolda, 2012; Hansen et al., 2015; Hu et al., 2015b), or discretize the time stamps into crude steps and then factorize the counts in each step (Xiong et al., 2010; Schein et al., 2015; 2016; 2019). Despite their success, these methods ignore the rich temporal dependencies between the interactions and may miss important temporal patterns. Recently, Zhe and Du (2018) used the Hawkes processes to capture fine-grained triggering effects among the interactions. The method outperforms the previous approaches in terms of prediction, and discovers interesting clusters with temporal meanings. However, this method ignores another important temporal effect, inhibition, hence can still fail to capture complex, mixed temporal dependency patterns. In addition, the method sets a local time window to enable efficient computation, but meanwhile misses the valuable, long-range influences of the events on each other. To overcome these problems, we use latent factors to construct more flexible random point processes to detect and disentangle a variety of mixed excitation and inhibition effects, encoding them into the latent factors. In addition, we develop an efficient inference algorithm that is able to capture all kinds of short-term and long-term dependencies. Hawkes process (HP) is a popular class of random point processes to study mutual excitations within various events. Many models use HPs to discover temporal relationships, e.g., (Blundell et al., 2012; Tan et al., 2016; Linderman and Adams, 2014; Du et al., 2015; He et al., 2015; Wang et al., 2017). Several pieces of work were also proposed to improve the learning of HPs, such as nonparametric triggering kernel estimation (Zhou et al., 2013), Granger causality (Xu et al., 2016), short doubly-censored event sequences (Xu et al., 2017) and online estimation (Yang et al., 2017). Recently, Mei and Eisner (2017) proposed neural Hawkes processes, which uses a continuous LSTM to model complex temporal dependencies among the events (e.g., triggering, suppression and their nonlinear interactions). They also tested the usage of softplus function to integrate different temporal effects. Distinct from their work, our method (1) uses the latent factors (rather than free parameters) to construct the point process to encode the temporal effects into the factor representations, and (2) investigates the properties of the rate function to fulfill efficient inference, especially for large numbers of events and event types. 6. Experiment 6.1. Predictive Performance Datasets. We first examined the predictive performance on the following three real-world datasets. (1) Taobao (https://tianchi.aliyun.com/dataset/ data Detail?data Id=53), the online shopping behaviours between 07/01/2015 and 11/30/2015 in the Self-Modulating Nonparametric Event-Tensor Factorization Ours HP-Local-50 HP-Local-100 HP-Local-150 GP-PTF CP-PTF GP-NPTF CP-NPTF CPT-PTF-T5 CPT-PTF-T10 CPT-PTF-T20 2 5 8 10 Number of Factors 2 5 8 10 Number of Factors 2 5 8 10 Number of Factors 0 100 200 300 400 Number of Epoch Figure 1. Test log-likelihood (LL) on real-world datasets. HP-Local-{50, 100, 150} means running HP-Local with window size 50, 100 and 150. CPT-PTF-{5,10,20} are CPT-PTF with 5, 10 and 20 time steps. largest retail platforms of China. We extracted a fivemode tensor (user, seller, item, category, action), of size 980 274 631 58 2, including 16, 609 entries and 69, 833 events. Note that the action can be either buy or click . (2) UFO (https://www.kaggle.com/ NUFORC/ufo-sightings/data), the UFO sighting reports in 20th century. The dataset is about a two-mode event-tensor (UFO shape, city), of size 28 13, 713, including 45, 045 observed entries and 70, 418 sighting events. (3) Crash (https://www.kaggle.com/ usdot/nhtsa-traffic-fatalities), the report of fatal traffic crashes in US 2015. We extracted a four-mode tensor (state, county, city, landuse-type), of size 51 288 2, 098 5. Each entry consists of a sequence of crash events. There are 8, 691 entries and 32, 052 events in total. Note that while the modes are conceptually nested, the data are well organized as an event-tensor. Competing Methods. We compared with the following popular and/or state-of-the-art tensor factorization methods incorporating temporal information. (1) CP-PTF, the homogeneous Poisson process (PP) tensor factorization, which uses CP to factorize the event rate of each entry in the log domain (to ensure positiveness). (2) CPT-PTF, similar to (Schein et al., 2015), which discretizes the timestamps into steps and adds a time mode. Time factors are introduced and jointly estimated with the other factors in the CP framework over the log event rates. We assigned a condition Gaussian prior (Xiong et al., 2010) to model the dynamics of the time factors. (3) GP-PTF, which uses GPs to estimate the log rate of each entry as a nonlinear function of the associated latent factors. This is the same as our approach in modeling the base rate. (4) CP-NPTF, non-homogeneous Poisson process tensor factorization where the event rate is modelled as λi(t) = t exp CP(i) . Here CP(i) is CP factorization of entry i. (5) GP-NPTF, identical to CP-NPTF except we replace CP with GP in the rate modeling. (6) HP-Local (Zhe and Du, 2018), Hawkes process event-tensor decomposition using a local time window to model the rate and to estimate the local triggering effects among the neighbouring events. Experimental settings. For all the competing methods that use GPs, we applied the same sparse GP framework as in our method for scalable inference: we set the number of pseudo inputs to 100 and used SE-ARD kernel for which we initialized all the parameters to 1. For a fair comparison, all the methods were initialized with the same set of factors independently sampled from Uniform(0, 1). We implemented our approach and HP-Local with Py Torch and used ADAM (Kingma and Ba, 2014) for stochastic optimization. For both methods, the mini-batch size was set to 100. The learning rate was chosen from {5 10 4, 10 3, 3 10 3, 5 10 3, 10 2}. We ran each method for 400 epochs, which are sufficient for convergence. All the other methods were implemented with MATLAB. For training, we used the first 40K, 40K, 20K events from Taobao, UFO and Crash, respectively. The remaining 29.8K, 30.4K, 12K events were used for test. For CPT-PTF, the number of time steps was varied from {5, 10, 20}. For HP-Local, we varied the window size from {50, 100, 150}. Note that the event sequences for test are very long, through which we can examine the performance of our method in capturing longrange temporal effects. We varied the number of the factors from {2, 5, 8, 10}. We calculated the test log-likelihood of each method and report the results in Figure 1. Results. As shown in Fig. 1a-c, our method consistently outperforms all the competing approaches, often by a large margin. Hence, it demonstrates the advantage of our method in predictive performance. Note that the results of some methods are not shown because they are much worse than all the other methods. The full results are given in the supplementary material. In general, when we increased the window size, HP-Local obtained better or similar prediction accuracy (e.g., Fig. 1a), showing that estimating long-range temporal dependencies helps improve the prediction. In most cases, the other competing methods, e.g., CP/GP-PTF, CP/GP-NPTF, are far worse than our model and HP-Local. Since these methods are based on homogeneous or nonhomogeneous Poisson processes and disregard the temporal dependencies among the interaction events (except that CPT- Self-Modulating Nonparametric Event-Tensor Factorization 0.5 0.0 0.5 Cluster 1 Cluster 2 Cluster 3 (a) Sellers 0.5 0.0 0.5 Cluster 1 Cluster 2 Cluster 3 (c) Geolocations of the state clusters 0.5 0.0 0.5 Cluster 1 Cluster 2 Cluster 3 0.5 0.0 0.5 0.4 Cluster 1 Cluster 2 Cluster 3 (e) UFO shapes Figure 2. Structures of the learned latent factors on Taobao (a, b), Crash (c, d), UFO (e). Colors indicate the cluster memberships. 0.4 0.0 0.4 Excitation Inhibition Cluster 1 Cluster 2 Cluster 3 (a) Within clusters 0.4 0.0 0.4 Excitation Inhibition Cluster 1 Cluster 2 Cluster 3 (b) Across clusters Figure 3. Structures of the events and their temporal influences in Taobao. the red/black lines show the excitation/inhibition effects within and across the clusters. PTF learns dynamics between the time factors), their results confirm the benefit of our model in capturing a variety of complex temporal dependencies among the events. To examine the learning behaviour of our model, we reported the test log-likelihood after each epoch on Crash when the factor number was set to 8. As shown in Fig. 1d, our learning algorithm converges reasonably fast (around 100 epochs) and then remains quite stable after the convergence. 6.2. Structure Discovery Next, we examined if our method can discover hidden structures in the data. To this end, we set the number of latent factors to 10 and ran our algorithm on Taobao, UFO and Crash. Then we applied kernel PCA (Sch olkopf et al., 1998) (with RBF kernel) to project the latent factors onto the x-y plane. We then ran k-means to find the potential clusters in the (projected) factor representations of the entities (nodes). We used the elbow method (Ketchen and Shook, 1996) to select the number of clusters. As we can see from Fig. 2 a-e, these factors reflect clear grouping structures. Note that Taobao data were anonymized so we cannot investigate the meaning of those clusters. But they can be potentially useful for tasks like recommendation (Tran et al., 2018) and click- through-rate prediction (Pan et al., 2019). Then for Crash data, we show the actual geolocations of the clustered states in Figure 2c. We can see that states grouped together are often neighbouring each other. The clusters can be roughly seen as the middle (red), the south (blue) and the east/west coast (green). This is reasonable the states close by may have similar patterns of the traffic crash events and their chain effects (inhibition/excitation) due to similar driving customs, roads layout and weather patterns. We also looked into the triggering and inhibition effects among the interaction evens. To this end, we represented each online shopping event on Taobao by concatenating the latent factors of the participants, i.e., user, seller, item, category and action. We then used kernel PCA to project the event representations onto the x-y plane, and ran k-means to obtain the clusters. We then randomly sampled pairs of events within and across the clusters, and show their triggering/inhibition relationships with red/black colors. As we can see from Fig. 3a and b, the factor representations of the events present a cluster structure. Interestingly, we can see the effects between the events within the same cluster are uniform, either excitation or inhibition (see Fig. 3a), and across different clusters are mixed (see Fig. 3b). This can be reasonable in that those event clusters actually reflect how one type of events influence on each other, which is uniform; but between different types (clusters) of events can be mixed effects. For example, buying an IPhone will inhibit one to buy another cellphone (identical event type), but may stimulate them to buy a case (another event type). 7. Conclusion We have presented a self-modulating nonparametric eventtensor factorization model. Our model can capture various triggering and inhibition effects among the interaction Self-Modulating Nonparametric Event-Tensor Factorization events, in both short and long ranges. Our inference is efficient for large numbers of observed entries and events. Acknowledgement This work was supported by the NSF IIS-1910983. Acar, E., Dunlavy, D. M., Kolda, T. G., and Morup, M. (2011). Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41 56. Blundell, C., Beck, J., and Heller, K. A. (2012). Modelling reciprocating relationships with hawkes processes. In Advances in Neural Information Processing Systems, pages 2600 2608. Chi, E. C. and Kolda, T. G. (2012). On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272 1299. Choi, J. H. and Vishwanathan, S. (2014). Dfacto: Distributed factorization of tensors. In Advances in Neural Information Processing Systems, pages 1296 1304. Chu, W. and Ghahramani, Z. (2009). Probabilistic models for incomplete multi-dimensional arrays. AISTATS. Daley, D. J. and Vere-Jones, D. (2007). An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media. Du, N., Farajtabar, M., Ahmed, A., Smola, A. J., and Song, L. (2015). Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 219 228. ACM. Hansen, S., Plantenga, T., and Kolda, T. G. (2015). Newtonbased optimization for Kullback-Leibler nonnegative tensor factorizations. Optimization Methods and Software, 30(5):1002 1029. Harshman, R. A. (1970). Foundations of the PARAFAC procedure: Model and conditions for an explanatory multimode factor analysis. UCLA Working Papers in Phonetics, 16:1 84. Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90. He, X., Rekatsinas, T., Foulds, J., Getoor, L., and Liu, Y. (2015). Hawkestopic: A joint model for network inference and topic modeling from text-based cascades. In International conference on machine learning, pages 871 880. Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 282 290. AUAI Press. Hoff, P. (2011). Hierarchical multilinear models for multiway data. Computational Statistics & Data Analysis, 55:530 543. Hu, C., Rai, P., and Carin, L. (2015a). Zero-truncated poisson tensor factorization for massive binary tensors. In UAI. Hu, C., Rai, P., Chen, C., Harding, M., and Carin, L. (2015b). Scalable bayesian non-negative tensor factorization for massive count data. In Proceedings, Part II, of the European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 9285, ECML PKDD 2015, pages 53 70, New York, NY, USA. Springer-Verlag New York, Inc. Kang, U., Papalexakis, E., Harpale, A., and Faloutsos, C. (2012). Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316 324. ACM. Ketchen, D. J. and Shook, C. L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, 17(6):441 458. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. Kolda, T. G. (2006). Multilinear operators for higher-order decompositions, volume 2. United States. Department of Energy. Linderman, S. and Adams, R. (2014). Discovering latent network structure in point process data. In International Conference on Machine Learning, pages 1413 1421. Mei, H. and Eisner, J. M. (2017). The neural hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, pages 6754 6764. Pan, F., Li, S., Ao, X., Tang, P., and He, Q. (2019). Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings. In Proceedings of the Self-Modulating Nonparametric Event-Tensor Factorization 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 695 704. Rai, P., Hu, C., Harding, M., and Carin, L. (2015). Scalable probabilistic tensor factorization for binary and count data. In IJCAI. Rai, P., Wang, Y., Guo, S., Chen, G., Dunson, D., and Carin, L. (2014). Scalable Bayesian low-rank decomposition of incomplete multiway tensors. In Proceedings of the 31th International Conference on Machine Learning (ICML). Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. Schein, A., Linderman, S., Zhou, M., Blei, D., and Wallach, H. (2019). Poisson-randomized gamma dynamical systems. In Advances in Neural Information Processing Systems, pages 781 792. Schein, A., Paisley, J., Blei, D. M., and Wallach, H. (2015). Bayesian poisson tensor factorization for inferring multilateral relations from sparse dyadic event counts. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1045 1054. ACM. Schein, A., Zhou, M., Blei, D. M., and Wallach, H. (2016). Bayesian poisson tucker decomposition for learning the structure of international relations. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pages 2810 2819. JMLR.org. Sch olkopf, B., Smola, A., and M uller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299 1319. Shashua, A. and Hazan, T. (2005). Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22th International Conference on Machine Learning (ICML), pages 792 799. Sutskever, I., Tenenbaum, J. B., and Salakhutdinov, R. R. (2009). Modelling relational data using bayesian clustered tensor factorization. In Advances in neural information processing systems, pages 1821 1828. Tan, X., Naqvi, S. A., Qi, A. Y., Heller, K. A., and Rao, V. (2016). Content-based modeling of reciprocal relationships using hawkes and gaussian processes. In UAI. Tran, T., Lee, K., Liao, Y., and Lee, D. (2018). Regularizing matrix factorization with user and item embeddings for recommendation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 687 696. Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31:279 311. Wang, Y., Ye, X., Zha, H., and Song, L. (2017). Predicting user activity level in point processes with mass transport equation. In Advances in Neural Information Processing Systems, pages 1644 1654. Xiong, L., Chen, X., Huang, T.-K., Schneider, J., and Carbonell, J. G. (2010). Temporal collaborative filtering with bayesian probabilistic tensor factorization. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 211 222. SIAM. Xu, H., Farajtabar, M., and Zha, H. (2016). Learning granger causality for hawkes processes. In International Conference on Machine Learning, pages 1717 1726. Xu, H., Luo, D., and Zha, H. (2017). Learning hawkes processes from short doubly-censored event sequences. In nternational Conference on Machine Learning. Xu, Z., Yan, F., and Qi, Y. (2012). Infinite Tucker decomposition: Nonparametric Bayesian models for multiway data analysis. In Proceedings of the 29th International Conference on Machine Learning (ICML). Yang, Y. and Dunson, D. (2013). Bayesian conditional tensor factorizations for high-dimensional classification. Journal of the Royal Statistical Society B, revision submitted. Yang, Y., Etesami, J., He, N., and Kiyavash, N. (2017). Online learning for multivariate hawkes processes. In Advances in Neural Information Processing Systems, pages 4937 4946. Zhe, S. and Du, Y. (2018). Stochastic nonparametric event-tensor decomposition. In Advances in Neural Information Processing Systems, pages 6856 6866. Zhe, S., Qi, Y., Park, Y., Xu, Z., Molloy, I., and Chari, S. (2016a). Dintucker: Scaling up gaussian process models on large multidimensional arrays. In Thirtieth AAAI conference on artificial intelligence. Zhe, S., Xu, Z., Chu, X., Qi, Y., and Park, Y. (2015). Scalable nonparametric multiway data analysis. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 1125 1134. Zhe, S., Zhang, K., Wang, P., Lee, K.-c., Xu, Z., Qi, Y., and Ghahramani, Z. (2016b). Distributed flexible nonlinear tensor factorization. In Advances in Neural Information Processing Systems, pages 928 936. Self-Modulating Nonparametric Event-Tensor Factorization Zhou, K., Zha, H., and Song, L. (2013). Learning triggering kernels for multi-dimensional hawkes processes. In International Conference on Machine Learning, pages 1301 1309.