# multivariate_hawkes_processes_for_largescale_inference__50a2f396.pdf Multivariate Hawkes Processes for Large-Scale Inference R emi Lemonnier,1,2 Kevin Scaman,1,3 Argyris Kalogeratos1 1 CMLA ENS Paris-Saclay, CNRS, Universit e Paris-Saclay, France 2 Numberly, 1000Mercis group, Paris, France 3 Microsoft Research Inria Joint Center, Palaiseau, France {lemonnier,scaman,kalogeratos}@cmla.ens-cachan.fr In this paper, we present a framework for fitting multivariate Hawkes processes for large-scale problems, both in the number of events in the observed history n and the number of event types d (i.e. dimensions). The proposed Scalable Low Rank Hawkes Process (SLRHP) framework introduces a lowrank approximation of the kernel matrix that allows to perform the nonparametric learning of the d2 triggering kernels in at most O(ndr2) operations, where r is the rank of the approximation (r d, n). This comes as a major improvement to the existing state-of-the-art inference algorithms that require O(nd2) operations. Furthermore, the low-rank approximation allows SLRHP to learn representative patterns of interaction between event types, which is usually valuable for the analysis of complex processes in real-world networks. Introduction In many real-world phenomena, such as product adoption or information sharing, events exhibit a mutually-exciting behavior, in the sense that the occurrence of one event can increase the occurrence rate of other events. In the field of internet marketing, a client s purchasing behavior on one online shopping website can be, to a large extent, predicted by his past navigation history on other websites. In finance, arrivals of buying and selling orders for different stocks convey information about macroscopic market tendencies. In the study of information propagation, users of a social network share information, which leads to information cascades spreading throughout the social graph. Over the past few years, the study of point processes gained attention as the acquisition of such datasets by companies and research laboratories became simpler. However, the traditional models for time series analysis, such as discrete-time autoregressive models, do not apply in this context due to the fact that events happen in a continuous way. Multivariate Hawkes processes (MHP) (Oakes 1975; Liniger 2009) have emerged in several fields as the gold standard to deal with such data, e.g. earthquake prediction (Vere Jones 1978), biology (Reynaud-Bouret et al. 2014), financial (Bauwens and Hautsch 2009; Alfonsi and Blanc 2015), and social interactions studies (Crane and Sornette 2008). For Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. MHP, an event of type u (e.g. a visit to a product s website) occurring at time t, will increase the conditional occurrence rate of events of type v at time s t (e.g. purchases of that product in the future) by a rate guv(s t). Despite these processes have been extensively studied from the probabilistic point of view (stability (Br emaud and Massouli e 1996), cluster representation (Bordenave and Torrisi 2007)), their application to real-scale datasets remains quite challenging. For instance, social interactions data is at the same time big (large number of posts), high-dimensional (large number of users), and structured (social network). Several nonparametric estimation procedures have been proposed for MHP. relying on approaches such as moment matching (Bacry and Muzy 2014; 2016), least-squares error minimization (Hansen, Reynaud-Bouret, and Rivoirard 2015), or log-likelihood maximization (Zhou, Zha, and Song 2013b). Regardless to the approach each of those works adopts, the dependence of the stochastic occurrence rate at a given time on all past occurrences, and the fact that all d2 triggering kernels guv need to be estimated, do imply that all these methods are quadratic in the number of events n as well as the number of dimensions d. Therefore, they quite impractical for large datasets. Subsequent works aimed naturally at increasing scalability. In order to reduce complexity towards O(n), a nonparametric estimation procedure linear in the number of events was proposed in (Lemonnier and Vayatis 2014), relying on the memoryless property of Hawkes processes with exponential triggering kernels, thus achieving an overall complexity of O(nd2). Moreover, several other works (Tran et al. 2015; Du et al. 2015; Xu, Farajtabar, and Zha 2016) managed a complexity in O(n2d) by imposing a low-rank structure to the amplitude of the mutual excitation, while keeping a fixed temporal excitation pattern. Although these methods may exhibit a linear complexity in d, they only impose a community structure to the network via a low-rank assumption on the adjacency matrix, instead of learning an excitation function for each group independently. Note that this difference is significant since learning independent excitation functions allows to uncover groups of users sharing a similar role (e.g. influencer vs influencee), instead of mere clusters of densely connected nodes in the network. In this paper we introduce the Scalable Low-Rank Hawkes Processes (SLRHP) model for structured point processes, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) relying on a low-rank decomposition of the triggering kernel, that aims to learn representative patterns of interaction between event types. We also provide the first inference algorithm for SLRHP that has linear complexity in the total number of events and number of event types (i.e. dimensions), that is O(nd). The inference is performed by combining minorize-maximization and self-concordant optimization techniques. In addition, if the underlying network of interactions is provided, then SLRHP is capable of fully exploiting the network sparsity, which renders it practical for large structured datasets. The major advantage of the proposed SLRHP algorithm is the ability to scale-up to datasets much larger than existing state-of-the-art methods, while achieving performance very close to state-of-the-art competitors (in terms of prediction and inference accuracy) on synthetic as well as real datasets. Setup and Notations A multivariate Hawkes process (MHP) N(t) = {Nu(t) : u = 1, ..., d, t 0} is a d-dimensional counting process, where Nu(t) is the number of events along dimension u which occurred during time [0, t]. We call as event of type u an event that occurs along dimension u. Each onedimensional counting process Nu(t) can be influenced by the occurrence of events of other types. Without loss of generality, we consider that these mutual excitations take place along the edges of an unweighted directed graph G = (V, E) of d nodes and adjacency matrix A {0, 1}d d. Note that this setting includes in particular the standard definition of multivariate Hawkes processes. Finally, let H : (um, tm)n m=1 be the event history of the process indicating for each event m its type um and occurrence time tm. The non-negative stochastic occurrence rate of each Nu(t) is then defined by: λu(t) = μu(t) + m:tm 0 fixed hyperparameter values: μK i (t) = K k=0 βi,k e kγt; g K ji (t) = K k=1 αji,k e kδt, (5) Due to the memoryless property of exponential functions, this approximation allows for log-likelihood computations with complexity linear in the number of events, i.e. O(n = H h=1 nh). Results of polynomial approximation theory also ensures fast convergence, with respect to K, of the optimal μK i and g K ji towards the true μi and gji. For instance, if gji is analytic, then we have for the approximation error: supt [0,T ] | g K ji (t) gji(t)| = O(e K) which means that, for smooth enough functions, setting K = 10 already provides a good approximation. We therefore search the values of parameters α, β that maximize the approximated log-likelihood as well as the most probable projection matrix P, conditionally to the realizations of the process, and under the constraint that the approximated natural rates and triggering kernels remain nonnegative. At high-level, this is formally expressed as: arg max (P,α,β) L(P, H; α, β) s.t. i, j, t : μK i (t) 0 and g K ji (t) 0. (6) Above, for clarity of notation, we actually reformulate the log-likelihood by introducing L that makes implicit the dependency of L in the fixed hyperparameters K, δ, and γ of Eq. 5. Note also that limiting K and r to small values can be seen as a form of regularization, although more refined approaches could be considered in case of training with datasets of very limited size. Simplification with tensor notation. In order to perform inference efficiently, we now reformulate the log-likelihood using very large and sparse tensors. We also introduce the artificial (r + 1)-th dimension to the embedding space in order to remove linear terms of the equation and store the β parameters as additional dimensions of α. In detail, let α(r+1)i,k = βi,k, αj(r+1),k = 0, and P(d+1)i = 1{i=r+1} (let 1{ } denote the indicator function), also, u {1, ..., d}, Pu(r+1) = 0. The log-likelihood of the model can then be rewritten as follows: L(P,H; α) = u,v,i,j,k Pui Pvj αji,k Dh,m,u,v,k h,u,v,i,j,k Pui Pvj αji,k Bh,u,v,k, nh m=1 Jv,u,mfkδ(T h + th m) if v d; fkγ(T h + T h ) if v = d + 1; 0 otherwise, (8) Dh,m,u,v,k = nh l=1 Ih,m,l,u,v e kδ(th m th l ) if v d; 1{uh m=u}e kγ(th m T h ) if v = d + 1; 0 otherwise, (9) with fkx(t) = 1 e kx T kx , for x in {γ, δ}; Jv,u,m = 1{v=uh m}Avu; Ih,m,l,u,v = 1{u=uh m v=uh l th l 0 and solve: α = arg max α ln chm α + ϵb(α) b α, (12) where b(α)hm = k=1 αji,k Dh,m,u,v,k A feature of the optimization problem in Eq. 12 is that it verifies the self-concordance property. Self-concordant functions have the advantage of behaving nicely with barrier optimization methods and are among the rare classes of functions for which explicit convergence rates of Newton methods are known (Boyd and Vandenberghe 2004). This is the reason why we chose to perform the unconstrained optimization using Newton s method, which requires O(n Kr2 + K3r6) operations. Note that, since we have n events and aim to learn K Hawkes parameters per pair of groups, we have necessarily Kr2 n. For the second factor of the expression, if we do not have K2r4 n, we can then reduce the complexity by using quasi-Newton methods that necessitates only O(n Kr2 + K2r4) = O(n Kr2) operations. The computation of c, b and b(α) requires multiplying sparse matrices of O(n KΔ) non-zero elements with a full matrix of r columns, which yields a O(n KΔr) complexity. Overall, the complexity of the Hawkes parameters optimization is of the order O(n Kr(Δ + r)). Projection matrix optimization. Let p be a reshaping of the projection matrix P to a vector (linearized). Then, p is updated by solving the following maximization procedure: p = arg max p h,m ln p Ξhmp p Ψp, (14) where 2 Ξhm ui,vj = k(αji,k Dh,m,u,v,k + αij,k Dh,m,v,u,k); 2 Ψui,vj = h,k(αji,k Bh,u,v,k + αij,k Bh,v,u,k). The maximization task is performed by a novel minorizemaximization procedure which is summarized by the following proposition. Proposition 1. The log-likelihood is non-decreasing under the update: pt+1 ui =pt ui (Ξhmpt)ui pt Ξhmpt(Ψpt)ui Furthermore, if pui is a stable fixed point of Eq. 15, then pui is a local maximum of the log-likelihood. As previously, computing Ξ, Ψ, and all the matrix-vector products, requires O(n KΔr2) operations, and each update necessitates O(nd) operations. As we consider scenarios where there are at least a few events per dimension, the total complexity of the group affinities optimization is O(n KΔr2). In total, the complexity of our optimization procedure is of the order O(n Kσ + n KΔr2) and its behavior is linear w.r.t. the number of events and the number of dimensions. If we consider K and r as tunable hyperparameters, the complexity becomes O(n maxσ,Δ) O(nd). This complexity seems close to the best achievable without making any further hypothesis. While O(nd) can still be high when studying viral processes occurring over social networks of billions of nodes, the exact expression O(n Kσ + n KΔr2) can be significantly lower in the case we observe subcritical cascades (σ d) occurring over a sparse network (Δ d). Figure 1: (a) True and inferred triggering kernels gij and natural occurrence rates μi, for the synthetic dataset. (b) Low-dimensional embedding of the event types learned by SLRHP in the synthetic dataset. The two groups (blue and green) of event types are successfully identified. Experiments Synthetic data In this section we illustrate the validity and precision of our method in learning the diffusion parameters of simulated Hawkes processes. More specifically, we simulate MHPs such that event types are separated into two groups of similar activation pattern. In the context of social networks, these groups may encode influencer-influencee types of relations. We show that our inference algorithm can recover the groups and the corresponding triggering kernels consistently and with high accuracy. Note that SLRHP is more generic than this specific setting we deploy, and this comes from the fact that there are many notions of data structure that can be captured by the low-rank approximation. However, we believe that the reported scenario is simple and intuitive, and may therefore provide a clear overview of the capabilities of our approach. Data generation procedure. The employed procedure for generation of synthetic datasets follows. We assume that the MHPs take place on a random Erd os-R enyi (Erd os and R enyi 1976) network of d = 100 event types whose adja- cency matrix A is generated with parameter p = 0.1 (i.e. 10 neighbors in average). Then, we consider two distinct groups of event types, and assign each event type to one of the groups at random. The natural occurrence rate μi of each group is fixed to a constant value chosen uniformly over [0, 0.01]. The triggering kernels between two groups, i and j, are generated as: gij(t) = νij sin 2πt ωij + π 2 ((i+j) mod 2) +2 3(t+1)2 , (16) where ωij and νij are sampled uniformly over [1, 10] and [0, 1/50], respectively. These parameter intervals are chosen so that the behavior of the generated process is nonexplosive (Daley and Vere-Jones 2007). The rationale behind the kernels in Eq. 16 is that they present a power-law decreasing intensity that allows long term influence with a periodic behavior. This kind of dynamics could, for instance, represent the daytime cycles of internet users. Results. Following the above procedure we generate 8 datasets by sampling 8 different sets of parameters {(ωij, νij)i r,j r, ( μi)i r}. Finally, we simulate 105 i.i.d. realizations of the resulting Hawkes process, that we use as training set. The ability of SLRHP to recover the true group triggering kernels gij is shown in Fig. 1(a) and evaluated by means of the normalized L2 error: 1 r2 i,j || gij gij||2 || gij||2+|| gij||2 . On average, this is only 12.9%, with minimum 9.2% and maximum 18.9% amongst the 8 sample datasets. Moreover, the figure compares visually the fitness of the inferred to the true natural occurrence rates and triggering kernel functions. In order to find the group assignments, we infer the parameters of an SLRHP of rank r = 2, and recover the group structure by a clustering algorithm on the projected event types. Then, choosing as basis of the two-dimensional space the centers of the two clusters enables the recovery of the group triggering kernels. Fig. 1(b) shows the 2D embedding learned by our inference algorithm for one of the 8 sample datasets. Two particularly separate clusters appear, which indicates that the group assignments were perfectly recovered. The other 7 datasets gave similar results. These results provide strong indication regarding the validity of our algorithm for inferring the underlying dynamics of MHPs in situations where there is structure in the interactions between the dimensions. Results on the Meme Tracker dataset Our final set of experiments are conducted on the Meme Tracker (Leskovec and Krevl 2014) dataset. Meme Tracker is a benchmark corpus of 9.6 106 blog posts published between August 2008 and April 2009. We use posts from the period August 2008 to December 2008 as training set, and evaluate our models on the four remaining months. An event for website u is defined as the creation of a post on website u containing a hyperlink towards any other website. We also consider that an edge exists between two websites if at least one hyperlink exists between them in the training set. In order to compare the inference algorithms on datasets of different size, prediction was performed on four subsets of the Table 1: Experiments on Meme Tracker subsets. AUC (%) and Accuracy (%) are reported for predicting the next event to happen, using SLRHP, MEMIP, and NAIVE approach. The experiments denoted with did not finish in reasonable time. Dataset Training Time (secs) AUC Accuracy Name thd n d SLRHP MEMIP SLRHP MEMIP NAIVE SLRHP MEMIP NAIVE MT1 50000 7311 13 8.34 3.16 86.3 86.4 86.1 99.2 99.2 93.1 MT2 10000 74474 80 281 7.14 103 90.8 92.6 84.4 89.8 92.7 70.6 MT3 5000 277914 172 1.95 103 1.74 105 86.2 91.9 81.2 87.0 91.6 67.7 MT4 1000 875402 1075 3.77 105 * 87.0 * 85.2 84.7 * 81.3 !"# $ %&' ('(%# 0 1 2 3 4 5 0 first dimension second dimension Figure 2: Experimental results on real data: (a) A plot in log-log scale with the training time (secs) for SLRHP and MEMIP algorithm against the quantity nd. The linear behavior for SLRHP and super-linear for MEMIP are clearly visible. (b) Sensitivity analysis of SLRHP accuracy w.r.t. the rank r of the approximation used for inference, and a comparison to the best scores for MMEL and NAIVE baselines on the MT3 dataset. (c) Low-dimensional embedding of the event types learned by SLRHP for the MT3 dataset with r = 2. Meme Tracker dataset (smaller to larger): MT1, MT2, MT3, and MT4. These subsets are created by removing the events taking place on websites that appear less than a fixed number of times in the training set. This threshold value (thd in Tab. 1) is, respectively, 50000, 10000, 5000, and 1000. Prediction task. The task consists in predicting the next website to create a post. More specifically, for each event of the test dataset, we are interested in predicting the website on which it will take place knowing its time of occurrence. For MEMIP and SLRHP, the prediction is achieved by scoring the websites according to λu(tm), since this value is proportional to the theoretical conditional probability for event m to be of type u. We evaluate the prediction with two metrics: the area under the ROC curve (AUC) and a classification accuracy with a fixed number of candidate types. Due to the high bias towards major news websites (e.g. CNN), the number of candidate types has to be relatively large to see the difference in the performance of algorithms, and we set this value to 30% of the total number of event types d in our experiments. This means that we consider a successful prediction if the website that eventually fires was ranked by an algorithm in the top 30% candidates. Baselines. In the following experiments, we use as main competitor the state-of-the-art MEMIP algorithm (Lemonnier and Vayatis 2014), which is, to the best of our knowledge, the only inference algorithm with linear complexity in the number of events n in the training history. Also, pre- vious work (Lemonnier and Vayatis 2014) shows that this algorithm outperforms the more standard inference algorithm MMEL (Zhou, Zha, and Song 2013b) on the Meme Tracker dataset. In addition, we also use the NAIVE baseline which ranks the nodes according to their frequency of appearance in the training set. Note that this is equivalent to fitting a Poisson process and, hence, does not consider mutual-excitation. Results. Tab. 1 summarizes the experimental results comparing the proposed SLRHP against MEMIP and NAIVE algorithms on four subsets of the Meme Tracker dataset. In each row, the table describes the dataset characteristics, and for each method it provides the training time, AUC, and accuracy with the best parameter settings (for SLRHP, K = 6 and r = 2, except for MT3 for which r = 3). On small to medium-sized datasets (MT1, MT2, MT3), SLRHP is as efficient as its main competitor MEMIP, while orders of magnitude faster. On the larger MT4 dataset, SLRHP still runs in reasonable time while outperforming the NAIVE baseline. Note that MEMIP could not be computed in reasonable time for this dataset (less than a few days). Fig. 2(a) shows in log-log scale the computational time needed for the inference algorithm on all the Meme Tracker datasets, with respect to nd. This time is indeed linear in nd for SLRHP, while super-linear for the state-of-the-art competitor of the related literature. In Fig. 2(b) it is indicated that the accuracy measurements are relatively stable with re- gards to the rank of the approximation r, with a maximum for r = 3. Finally, Fig. 2(c) shows the 2D embedding learned by SLRHP for the MT3 dataset with r = 2. In the embedding space, the websites seem to align along the axes of the embedding space, with varying amplitudes. This may indicate that the algorithm recovered two different groups, each one representing similar activities, although with a large variability in the activity of the websites inside each group. Conclusion This work focused on modeling multivariate time series where a very large number of event types can occur, and a very large number of historical observations are available for training. The introduced framework is called Scalable Low Rank Hawkes Processes (SLRHP), for which we developed a novel inference algorithm for parameter estimation. Theoretical complexity analysis as well as experimental results show that our approach is highly scalable, while also still competitive with regards to predictive performance as compared to state-of-the-art inference algorithms. Acknowledgments This research is part of the SODATECH project funded by the French Government within the program of Investments for the Future Big Data . References Alfonsi, A., and Blanc, P. 2015. Dynamic optimal execution in a mixed-market-impact Hawkes price model. Finance and Stochastics 20(1):183 218. Bacry, E., and Muzy, J.-F. 2014. Hawkes model for price and trades high-frequency dynamics. Quantitative Finance 14(7):1147 1166. Bacry, E., and Muzy, J.-F. 2016. Firstand second-order statistics characterization of Hawkes processes and nonparametric estimation. IEEE Transactions on Information Theory 62(4):2184 2202. Bauwens, L., and Hautsch, N. 2009. Modelling financial high frequency data using point processes. Springer. Bordenave, C., and Torrisi, G. L. 2007. Large deviations of Poisson cluster processes. Stochastic Models 23(4):593 625. Boyd, S. P., and Vandenberghe, L. 2004. Convex optimization. Cambridge university press. Br emaud, P., and Massouli e, L. 1996. Stability of nonlinear Hawkes processes. The Annals of Probability 1563 1588. Crane, R., and Sornette, D. 2008. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences 105(41):15649 15653. Daley, D. J., and Vere-Jones, D. 2007. An introduction to the theory of point processes. Springer. Du, N.; Wang, Y.; He, N.; Sun, J.; and Song, L. 2015. Time-sensitive recommendation from recurrent user activities. In Advances in Neural Information Processing Systems 28, NIPS, 3492 3500. Erd os, P., and R enyi, A. 1976. On the evolution of random graphs. Selected Papers of Alfr ed R enyi 2:482 525. Hansen, N. R.; Reynaud-Bouret, P.; and Rivoirard, V. 2015. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli 21(1):83 143. Lemonnier, R., and Vayatis, N. 2014. Nonparametric markovian learning of triggering kernels for mutually exciting and mutually inhibiting multivariate Hawkes processes. In Machine Learning and Knowledge Discovery in Databases. Springer. 161 176. Lemonnier, R.; Scaman, K.; and Kalogeratos, A. 2017. Multivariate Hawkes processes for large-scale inference Supplementary material. Available at: http://kalogeratos.com/psite/files/My Papers/SLRHPappendix.pdf. Leskovec, J., and Krevl, A. 2014. SNAP Datasets: Stanford large network dataset collection. Available at: http://snap.stanford.edu/data. Liniger, T. J. 2009. Multivariate Hawkes processes. Ph.D. Dissertation, Diss., Eidgen ossische Technische Hochschule ETH Z urich, Nr. 18403. Nesterov, Y.; Nemirovskii, A. S.; and Ye, Y. 1994. Interiorpoint polynomial algorithms in convex programming, volume 13. SIAM. Oakes, D. 1975. The Markovian self-exciting process. Journal of Applied Probability 69 77. Reynaud-Bouret, P.; Rivoirard, V.; Grammont, F.; and Tuleau-Malot, C. 2014. Goodness-of-fit tests and nonparametric adaptive estimation for spike train analysis. Journal of Mathematical Neurosciences 4:3. Tran, L.; Farajtabar, M.; Song, L.; and Zha, H. 2015. Netcodec: Community detection from individual activities. In Proceedings of the SIAM International Conference on Data Mining, SIAM ICDM, 91 99. Vere-Jones, D. 1978. Earthquake prediction statistician s view. Journal of Physics of the Earth 26(2):129 146. Xu, H.; Farajtabar, M.; and Zha, H. 2016. Learning granger causality for Hawkes processes. In Proceedings of the International Conference on Machine Learning, ICML, 1717 1726. Zhou, K.; Zha, H.; and Song, L. 2013a. Learning social infectivity in sparse low-rank networks using multidimensional Hawkes processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS, 641 649. Zhou, K.; Zha, H.; and Song, L. 2013b. Learning triggering kernels for multi-dimensional Hawkes processes. In Proceedings of International Conference on Machine Learning, ICML, 1301 1309.