# stochastic_nonparametric_eventtensor_decomposition__7cfc611c.pdf Stochastic Nonparametric Event-Tensor Decomposition Shandian Zhe, Yishuai Du School of Computing, University of Utah zhe@cs.utah.edu, yishuai.du@utah.edu Tensor decompositions are fundamental tools for multiway data analysis. Existing approaches, however, ignore the valuable temporal information along with data, or simply discretize them into time steps so that important temporal patterns are easily missed. Moreover, most methods are limited to multilinear decomposition forms, and hence are unable to capture intricate, nonlinear relationships in data. To address these issues, we formulate event-tensors, to preserve the complete temporal information for multiway data, and propose a novel Bayesian nonparametric decomposition model. Our model can (1) fully exploit the time stamps to capture the critical, causal/triggering effects between the interaction events, (2) flexibly estimate the complex relationships between the entities in tensor modes, and (3) uncover hidden structures from their temporal interactions. For scalable inference, we develop a doubly stochastic variational Expectation-Maximization algorithm to conduct an online decomposition. Evaluations on both synthetic and real-world datasets show that our model not only improves upon the predictive performance of existing methods, but also discovers interesting clusters underlying the data. 1 Introduction Tensors represent the high-order interactions between entities in multiway data. Such interactions are ubiquitous in real-world applications. For instance, in online shopping, users purchase commodities under different web contexts these interactions can be represented by a three mode tensor (user, commodity, web context). To analyze tensor data, we use decomposition approaches where we jointly estimate a set of latent factors for each entity, and the mapping between the latent factors and tensor entry values. The latent factors can reveal hidden structures of the entities, such as clusters/communities; the mapping characterizes the entities relationships (in terms of their factor representations), and can be used to predict missing entry values. Despite the wide success of existing tensor decomposition algorithms (Tucker, 1966; Harshman, 1970; Kang et al., 2012; Choi and Vishwanathan, 2014), most methods assume a simple multilinear decomposition form, which might be insufficient to estimate intricate, nonlinear relationships in data. More important, most methods ignore the valuable temporal information along with data or exploit them in a relatively coarse way. For instance, the time stamp of each interaction is usually abandoned and only their counts are used for count tensor decomposition (Chi and Kolda, 2012; Hansen et al., 2015; Hu et al., 2015b). More elegant approaches (Xiong et al., 2010; Schein et al., 2015, 2016) discretize the time stamps into steps, e.g., weeks/months, and use a set of time factors to represent each step. The tensor is hence augmented with a time mode. The decomposition may further use Markov assumptions to encourage smooth transitions between the time factors (Xiong et al., 2010). However, in each time step, the occurrences of the interactions are treated independently. Hence, important temporal patterns, such as causal/triggering effects in adjacent interactions, cannot be well modeled or captured. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. To address these issues, we first formulate a new data abstraction, event-tensor, to preserve all the time stamps for multiway data. In an event-tensor, each entry comprises a sequence of interaction events rather than a numerical value. Second, we propose a powerful Bayesian nonparametric model to decompose event-tensors (Section 3). We hybridize latent Gaussian processes and Hawkes processes to capture various excitation effects among the observed interaction events, and the underlying complex relationships between the entities that participated in the events. Furthermore, we design a novel triggering function that enables discovering clusters of entities (or latent factors) in terms of excitation strengths. Besides, the triggering function allows us to flexibly specify the triggering range (say, via domain knowledge) to better capture local excitations and to control the trade-off to the computational cost. Finally, to handle data where both the tensor entries and interaction events are many, we derive a fully decomposed variational model evidence lower bound by using Poisson process superposition theorem and the variational sparse Gaussian process framework (Titsias, 2009). Based on the bound, we develop a doubly stochastic variational Expectation-Maximization algorithm to fulfill a scalable, online decomposition (Section 4). For evaluation, we examined our model in both predictive performance and structure discovery. On three real-world datasets, our model often largely improves upon the prediction accuracy of the existing methods that use Poisson processes and/or time factors to incorporate temporal information. Simulation shows the latent factors estimated by our model clearly reflect the ground-truth clusters while by the competing methods do not. We further examined the structures discovered by our model on the real-world datasets and found many interesting patterns, such as groups of 911 accidents with strong associations, locations of townships that are apt to have consecutive accidents, and UFO shapes that are more likely to be sighted together (Section 6). 2 Background Tensor Decomposition. We denote a K-mode tensor by M Rd1 ... d K, where dk is the dimension of k-th mode, corresponding to dk entities (e.g., users or items). The entry value at location i = (i1, . . . , i K) is denoted by mi. Given a tensor W Rr1 ... r K, and a matrix U Rs t, we can multiply W by U at mode k when rk = t. The result is a new tensor of size r1 . . . rk 1 s rk+1 . . . r K. Each entry is computed by (W k U)i1...ik 1jik+1...i K = Prk ik=1 wi1...i Kujik. For decomposition, we introduce K latent factor matrices, U = {U(1), . . . , U(K)}, to represent the entities in each tensor mode each U(k)(j, :) are the latent factors of the j-th entity in mode k. The classical Tucker decomposition (Tucker, 1966) incorporates a small core tensor W Rr1 ... r K, and assumes M = W 1 U(1) 2 . . . K U(K). We can simplify Tucker decomposition, by restricting r1 = . . . = r K and W to be diagonal. Then we reduce to CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970). While many other decomposition methods have been proposed e.g., (Chu and Ghahramani, 2009; Kang et al., 2012; Choi and Vishwanathan, 2014), most of them are still based on the Tucker/CP forms. However, the multilinear assumptions might be insufficient to capture intricate, highly nonlinear relationships in data. Recently, several Bayesian nonparametric tensor decomposition models (Xu et al., 2012; Zhe et al., 2016b) are proposed, which are flexible to capture various nonlinear relationships in data. For example, Zhe et al. (2016b) considered each entry value mi as a function of the corresponding latent factors, i.e., mi = f([U(1)(i1, :), . . . , U(K)(i K, :)]), and placed a Gaussian process (GP) (Rasmussen and Williams, 2006) prior over f( ), to automatically infer the (possible) nonlinearity of f( ). These methods often improve the CP/Tucker decompositions by a large margin in missing value prediction. Decomposition with Temporal Information. Practical tensors often come with temporal information, namely the time stamps of those interactions. For example, from a file access log, we can extract not only a three-mode (user, action, file) tensor, but also the time stamps for each user taking the action to access a file. To use the temporal information in the decomposition, many methods discard the time stamps, use a Poisson (process) likelihood to model the interaction frequency mi in each entry i, p(mi) e λi T λmi i (Chi and Kolda, 2012; Hu et al., 2015b), and perform the Tucker/CP decomposition over {λi} or {log(λi)}. More refined approaches (Xiong et al., 2010; Schein et al., 2015, 2016) first discretize the time stamps into several steps, such as months/weeks, and augment the original tensor with a time mode. Then a time factor matrix T are estimated in the decomposition. While the interactions from different time steps are modeled with distinct time factors, the ones in the same interval are considered independently (given the latent factors), say, being modeled by Poisson likelihoods (Schein et al., 2015, 2016). A Markov assumption might be used to encourage the smoothness between the time factors. For example, Xiong et al. (2010) assigned a conditional Gaussian prior over each T(k, :), p T(k, :)|T(k 1, :) = N T(k, :)|T(k 1, :), σ2I . 3 Model Despite the success of existing approaches in exploiting temporal information, they entirely drop the time stamps and hence are unable to capture the important, triggering or causal effects between the interactions. The triggering effects are common in real-world applications. For example, the event that user A purchased commodity B may excite A s friend C to purchase B as well. The triggering effects are usually local and decay fast with time; dropping the time stamps and considering the event occurrences independently make us unable to model/capture these effects. To address these issues, and hence to further capture the complex relationships and important structures underlying the interaction events, we formulate a new data abstraction, event-tensor, to preserve all the time stamps. We then propose a powerful Bayesian nonparametric model to decompose the event-tensors, discussed as follows. 3.1 Event-Tensor Formulation First, let us look at the definition of event-tensors. To preserve the complete temporal information in decomposition, we relax the definition that tensors must be multidimensional arrays of numerical values. Instead, we define that each entry is a sequence of events, i.e., mi = {s1 i , . . . , sni i } where each sk i (1 k ni) is a time stamp when the interaction happened, and ni the count of the events. Note that, different entries correspond to distinct types of interaction events, since the involved entities (or latent factors) are different. We name this tensor as an event-tensor. Given the observed entries {mi}, we can flatten their event sequences to obtain a single sequence S = [(s1, i1), . . . (s N, i N)] where s1 . . . s N are all the time stamps, and each ik is the entry index for the event sk(1 k N) . 3.2 Nonparametric Event-Tensor Decomposition Now, we consider a probabilistic model for event-tensor decomposition. While Poisson processes (PPs) have many nice properties and are often good choices of modeling events (Schein et al., 2015), they assume event occurences are independent (i.e., independent increments), and hence are unable to capture the influences of the events on each other. To overcome this limit, we use a much more expressive point process, Hawkes process (Hawkes, 1971), for events modeling in tensor entries. Given an event sequence {t1, . . . tn}, the Hawkes process defines the event rate λ as a function of time t, λ(t) = λ0 + P ti