# multivariate_hawkes_processes_for_largescale_inference__50a2f396.pdf

Multivariate Hawkes Processes for Large-Scale Inference

R emi Lemonnier,1,2 Kevin Scaman,1,3 Argyris Kalogeratos1

1 CMLA ENS Paris-Saclay, CNRS, Universit e Paris-Saclay, France 2 Numberly, 1000Mercis group, Paris, France 3 Microsoft Research Inria Joint Center, Palaiseau, France {lemonnier,scaman,kalogeratos}@cmla.ens-cachan.fr

In this paper, we present a framework for ﬁtting multivariate Hawkes processes for large-scale problems, both in the number of events in the observed history n and the number of event types d (i.e. dimensions). The proposed Scalable Low Rank Hawkes Process (SLRHP) framework introduces a lowrank approximation of the kernel matrix that allows to perform the nonparametric learning of the d2 triggering kernels in at most O(ndr2) operations, where r is the rank of the approximation (r d, n). This comes as a major improvement to the existing state-of-the-art inference algorithms that require O(nd2) operations. Furthermore, the low-rank approximation allows SLRHP to learn representative patterns of interaction between event types, which is usually valuable for the analysis of complex processes in real-world networks.

Introduction

In many real-world phenomena, such as product adoption or information sharing, events exhibit a mutually-exciting behavior, in the sense that the occurrence of one event can increase the occurrence rate of other events. In the ﬁeld of internet marketing, a client s purchasing behavior on one online shopping website can be, to a large extent, predicted by his past navigation history on other websites. In ﬁnance, arrivals of buying and selling orders for different stocks convey information about macroscopic market tendencies. In the study of information propagation, users of a social network share information, which leads to information cascades spreading throughout the social graph. Over the past few years, the study of point processes gained attention as the acquisition of such datasets by companies and research laboratories became simpler. However, the traditional models for time series analysis, such as discrete-time autoregressive models, do not apply in this context due to the fact that events happen in a continuous way. Multivariate Hawkes processes (MHP) (Oakes 1975; Liniger 2009) have emerged in several ﬁelds as the gold standard to deal with such data, e.g. earthquake prediction (Vere Jones 1978), biology (Reynaud-Bouret et al. 2014), ﬁnancial (Bauwens and Hautsch 2009; Alfonsi and Blanc 2015), and social interactions studies (Crane and Sornette 2008). For

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

MHP, an event of type u (e.g. a visit to a product s website) occurring at time t, will increase the conditional occurrence rate of events of type v at time s t (e.g. purchases of that product in the future) by a rate guv(s t). Despite these processes have been extensively studied from the probabilistic point of view (stability (Br emaud and Massouli e 1996), cluster representation (Bordenave and Torrisi 2007)), their application to real-scale datasets remains quite challenging. For instance, social interactions data is at the same time big (large number of posts), high-dimensional (large number of users), and structured (social network). Several nonparametric estimation procedures have been proposed for MHP. relying on approaches such as moment matching (Bacry and Muzy 2014; 2016), least-squares error minimization (Hansen, Reynaud-Bouret, and Rivoirard 2015), or log-likelihood maximization (Zhou, Zha, and Song 2013b). Regardless to the approach each of those works adopts, the dependence of the stochastic occurrence rate at a given time on all past occurrences, and the fact that all d2 triggering kernels guv need to be estimated, do imply that all these methods are quadratic in the number of events n as well as the number of dimensions d. Therefore, they quite impractical for large datasets. Subsequent works aimed naturally at increasing scalability. In order to reduce complexity towards O(n), a nonparametric estimation procedure linear in the number of events was proposed in (Lemonnier and Vayatis 2014), relying on the memoryless property of Hawkes processes with exponential triggering kernels, thus achieving an overall complexity of O(nd2). Moreover, several other works (Tran et al. 2015; Du et al. 2015; Xu, Farajtabar, and Zha 2016) managed a complexity in O(n2d) by imposing a low-rank structure to the amplitude of the mutual excitation, while keeping a ﬁxed temporal excitation pattern. Although these methods may exhibit a linear complexity in d, they only impose a community structure to the network via a low-rank assumption on the adjacency matrix, instead of learning an excitation function for each group independently. Note that this difference is signiﬁcant since learning independent excitation functions allows to uncover groups of users sharing a similar role (e.g. inﬂuencer vs inﬂuencee), instead of mere clusters of densely connected nodes in the network. In this paper we introduce the Scalable Low-Rank Hawkes Processes (SLRHP) model for structured point processes,

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

relying on a low-rank decomposition of the triggering kernel, that aims to learn representative patterns of interaction between event types. We also provide the ﬁrst inference algorithm for SLRHP that has linear complexity in the total number of events and number of event types (i.e. dimensions), that is O(nd). The inference is performed by combining minorize-maximization and self-concordant optimization techniques. In addition, if the underlying network of interactions is provided, then SLRHP is capable of fully exploiting the network sparsity, which renders it practical for large structured datasets. The major advantage of the proposed SLRHP algorithm is the ability to scale-up to datasets much larger than existing state-of-the-art methods, while achieving performance very close to state-of-the-art competitors (in terms of prediction and inference accuracy) on synthetic as well as real datasets.

Setup and Notations A multivariate Hawkes process (MHP) N(t) = {Nu(t) : u = 1, ..., d, t 0} is a d-dimensional counting process, where Nu(t) is the number of events along dimension u which occurred during time [0, t]. We call as event of type u an event that occurs along dimension u. Each onedimensional counting process Nu(t) can be inﬂuenced by the occurrence of events of other types. Without loss of generality, we consider that these mutual excitations take place along the edges of an unweighted directed graph G = (V, E) of d nodes and adjacency matrix A {0, 1}d d. Note that this setting includes in particular the standard definition of multivariate Hawkes processes. Finally, let H : (um, tm)n m=1 be the event history of the process indicating for each event m its type um and occurrence time tm. The non-negative stochastic occurrence rate of each Nu(t) is then deﬁned by:

λu(t) = μu(t) +

m:tm<t Aumu gumu(t tm). (1)

In the above, μu(t) 0 is the natural occurrence rate of events of type u (i.e. along that dimension) at time t, and the triggering kernel function evaluation gvu(s t) 0 determines the increase in the occurrence rate of events of type u at time s, caused by an event of type v at a past time t s. The natural occurrence rates μu and triggering kernels gvu are usually inferred by means of log-likelihood maximization. The main practical issue for inferring the parameters of the model in Eq. 1 is that it requires a particularly large dataset of observations, as standard inference algorithms require at least one observation per pair of event types (i.e. d2 observations). In many practical situations, the underlying network of interactions is unknown. In this case, the model to use will be a standard multivariate Hawkes process, which corresponds to taking Auv = 1 for every pair of event types (u, v), and the inference procedure proposed in this paper will discover by itself which interactions are non-negligible. However, the reason for which we use an adjacency matrix in our deﬁnition is that we show in the following that our model can take advantage of additional information on the support of

interactions if provided. In any case, we point out that the applicability of our method is not conditioned on previous knowledge of the support of interactions and that we do not aim to perform network inference, as in (Zhou, Zha, and Song 2013a) whose goal is to learn a sparse and low-rank support of interactions given a parametric form of the triggering kernels.

Scalable Low-Rank Hawkes Processes The proposed model

Model considerations. Standard MHP inference requires the learning of d2 triggering kernels that encode the crossand self-excitement of the event types. Apparently, it becomes prohibitive to satisfy this requirement as d gets larger (e.g. when the dimensions represent the users of a social network, or websites on the Internet). However, in a number of practical situations, the d2 complex interactions between event types can be summarized by considering that there is a small number of r event groups and each event type is related to each of those groups to a certain extent. Therefore, one needs to simultaneously learn a d r event typeto-group(s) mapping (we speciﬁcally use soft assignments) as well as the r2 interactions between pairs of event groups. Model formulation. Scalable Low-Rank Hawkes Processes (SLRHP) simplify the standard inference process by projecting the original d event types (i.e. dimensions) of a multivariate Hawkes process into a smaller and more compact r-dimensional space. The natural occurrence rates μu and triggering kernels gvu of Eq. 1 are then deﬁned via the lowrank approximation:

μu(t) = r i=1 Pui μi(t);

gvu(t) = r i,j=1 Pui Pvj gji(t), (2)

where u, v are event types, P Rd r + is the projection matrix from the original d-dimensional space to the lowdimensional space, and i, j are its component directions. Besides, this projection can be seen as a low-rank approximation of the kernel matrix g since, in matrix notations, g = P g P and g Rr r + is a matrix of size r d. Then, the SLRHP occurrence rates are formulated as an extension of Eq. 1 that uses an embedding of event types in a low-dimensional space:

i=1 Pui μi(t)

i,j=1 Pui Pumj Aumu gji(t tm).

Speciﬁcally, if the projection of event type u along the dimension i is given by Pui, then essentially the event type u inherits the natural occurrence rate of events of the component μi with multiplicative weight Pui, that is r i=1 Pui μi. In addition, if the projection of event type v along each dimension j is given by Pvj, then v s effect on event type u can be evaluated by r i,j=1 Pui Pvj gji.

Provided that r d, the proposed SLRHP is a simple and straightforward way to: i) impose regularity to the inferred occurrence rates using constraints to the parameters, and ii) reduce the number of parameters. Speciﬁcally, the d natural rates and d2 triggering kernels are reduced to r and r2, respectively, with the only additional need of inferring the d r elements of the matrix P. Remark on the generality and uniqueness of the projection. Although a projection of the form g = P g Q, with P = Q two matrices, is more general than g = P g P , it turns out that the latter class is not much smaller than the former. Indeed, any given decomposition P g Q of rank r can be written as a decomposition of rank 2r in the form P g P , where P = (P Q) and g = 0 g 0 0 are respectively of size d 2r and 2r 2r. Thus, using one matrix P improves the readability and interpretation of the projection, while not leading to substantial differences in the performance of the algorithm. We also remark that, unless any further assumption is made on the projection matrix P or the low-dimensional kernel g, the low-rank decomposition of the triggering kernel g = P g P is not unique. More speciﬁcally, any change of basis in the r-dimensional space will not alter the decomposition. Notwithstanding, uniqueness is not required in order to perform the prediction task, and therefore we do not address this issue in the present paper.

Log-likelihood General formulation. For h = 1, ..., H, let Hh = (th m, uh m)m nh be the observed i.i.d. realizations sampled from the Hawkes process, and H = (Hh)h H the recorded history of events of all realizations. For each realization h, we denote as [T h , T h +] the observation period, and uh m and th m are respectively the event type and time of occurrence of the m-th event. The log-likelihood of the observations is:

m=1 ln λuh m(th m) +

T h λu(s)ds

Our objective is to infer the natural rates μi and triggering kernels gji of Eq. 3 by means of log-likelihood maximization. From Eq. 3 and Eq. 4, we see that, for arbitrary gji, a single log-likelihood computation already necessitates O( H h=1 n2 h) triggering kernel evaluations. This is intractable when individual realizations can have a number of events of the order 107 or 108 (e.g. a viral video when modeling information cascades). This issue can be tackled by relying on a convenient K-approximation introduced in (Lemonnier and Vayatis 2014). Each natural occurrence rate and kernel function are approximated by a sum of K exponential triggering functions with γ, δ > 0 ﬁxed hyperparameter values:

μK i (t) = K k=0 βi,k e kγt;

g K ji (t) = K k=1 αji,k e kδt, (5)

Due to the memoryless property of exponential functions, this approximation allows for log-likelihood computations with complexity linear in the number of events,

i.e. O(n = H h=1 nh). Results of polynomial approximation theory also ensures fast convergence, with respect to K, of the optimal μK i and g K ji towards the true μi and gji. For instance, if gji is analytic, then we have for the approximation error: supt [0,T ] | g K ji (t) gji(t)| = O(e K) which means that, for smooth enough functions, setting K = 10 already provides a good approximation. We therefore search the values of parameters α, β that maximize the approximated log-likelihood as well as the most probable projection matrix P, conditionally to the realizations of the process, and under the constraint that the approximated natural rates and triggering kernels remain nonnegative. At high-level, this is formally expressed as:

arg max (P,α,β) L(P, H; α, β)

s.t. i, j, t : μK i (t) 0 and g K ji (t) 0. (6)

Above, for clarity of notation, we actually reformulate the log-likelihood by introducing L that makes implicit the dependency of L in the ﬁxed hyperparameters K, δ, and γ of Eq. 5. Note also that limiting K and r to small values can be seen as a form of regularization, although more reﬁned approaches could be considered in case of training with datasets of very limited size. Simpliﬁcation with tensor notation. In order to perform inference efﬁciently, we now reformulate the log-likelihood using very large and sparse tensors. We also introduce the artiﬁcial (r + 1)-th dimension to the embedding space in order to remove linear terms of the equation and store the β parameters as additional dimensions of α. In detail, let α(r+1)i,k = βi,k, αj(r+1),k = 0, and P(d+1)i = 1{i=r+1} (let 1{ } denote the indicator function), also, u {1, ..., d}, Pu(r+1) = 0. The log-likelihood of the model can then be rewritten as follows:

L(P,H; α) =

u,v,i,j,k Pui Pvj αji,k Dh,m,u,v,k

h,u,v,i,j,k Pui Pvj αji,k Bh,u,v,k,

nh m=1 Jv,u,mfkδ(T h + th m) if v d; fkγ(T h + T h ) if v = d + 1; 0 otherwise, (8)

Dh,m,u,v,k =

nh l=1 Ih,m,l,u,v e kδ(th m th l ) if v d; 1{uh m=u}e kγ(th m T h ) if v = d + 1; 0 otherwise, (9) with fkx(t) = 1 e kx T

kx , for x in {γ, δ}; Jv,u,m = 1{v=uh m}Avu; Ih,m,l,u,v = 1{u=uh m v=uh l th l <th m}Avu.

What the expression suggests is the possibility to optimize the approximated log-likelihood, according to the different parameters and projection matrices, by ﬁrst creating two large and sparse tensors B and D with four and ﬁve dimensions, respectively.

Algorithm 1 SLRHP Inference: high-level description Input: H, K, γ, δ, P, α Output: P, α 1: Compute D and B 2: for i = 1 to num iters do 3: α = arg maxα L(P, H; α) 4: s.t. μK i 0 and g K ji 0, i, j = 1, ..., r 5: P = arg max P L(P, H; α) 6: end for 7: return P, α

The inference algorithm The inference is performed by alternating optimization between the projection matrix P and Hawkes parameters α. When all other parameters are ﬁxed, the optimization w.r.t. α is performed using self-concordant function optimization with self-concordant barriers. The technical difﬁculty of this part is due to the need to ensure that non-negativity constraints are respected. For the optimization w.r.t. P, we introduce novel optimization techniques based on a minorizemaximization algorithm. Alg. 1 outlines the general scheme of our algorithm and details are provided in the supplementary material (Lemonnier, Scaman, and Kalogeratos 2017). Computing B and D tensors. For the inference algorithm to be tractable, special attention has to be paid to the computation of the sparse tensors B = (Bh,u,v,k) and D = (Dh,m,u,v,k). Our algorithm has complexity O(n KΔ), where Δ the maximum node degree in G. If G is sparse, as it is usual for social networks for instance, then Δ d and hence O(n KΔ) O(n Kd). Thus, storing and computing B and D is tractable for large dense graphs and for particularly large sparse graphs. Note that, since computing the log-likelihood requires the update of occurrence rates at each event time, which in turn depends on the occurrences of all preceding events, the linear complexity in the number of events is only possible due to the memoryless property of the decomposition over a basis of exponential functions. Hawkes parameters optimization. Updating the Hawkes parameters α requires solving the problem:

α = arg max α

h,m ln chm α b α

s.t μK i 0 and g K ji 0, i, j = 1, ..., r (10)

where chm ijk =

u,v Pui Pvj Dh,m,u,v,k; bijk =

u,v,h Pui Pvj Bh,u,v,k. For the sake of inference tractability we relax the nonnegativity constraint and only impose it for the observed time differences:

k=1 αji,k Dh,m,u,v,k 0. (11)

Then, we approximate the constrained maximization problem by an unconstrained one, using the concept of selfconcordant barriers (Nesterov, Nemirovskii, and Ye 1994): i.e. we choose ϵ > 0 and solve:

α = arg max α

ln chm α + ϵb(α) b α, (12)

where b(α)hm =

k=1 αji,k Dh,m,u,v,k

A feature of the optimization problem in Eq. 12 is that it veriﬁes the self-concordance property. Self-concordant functions have the advantage of behaving nicely with barrier optimization methods and are among the rare classes of functions for which explicit convergence rates of Newton methods are known (Boyd and Vandenberghe 2004). This is the reason why we chose to perform the unconstrained optimization using Newton s method, which requires O(n Kr2 + K3r6) operations. Note that, since we have n events and aim to learn K Hawkes parameters per pair of groups, we have necessarily Kr2 n. For the second factor of the expression, if we do not have K2r4 n, we can then reduce the complexity by using quasi-Newton methods that necessitates only O(n Kr2 + K2r4) = O(n Kr2) operations. The computation of c, b and b(α) requires multiplying sparse matrices of O(n KΔ) non-zero elements with a full matrix of r columns, which yields a O(n KΔr) complexity. Overall, the complexity of the Hawkes parameters optimization is of the order O(n Kr(Δ + r)). Projection matrix optimization. Let p be a reshaping of the projection matrix P to a vector (linearized). Then, p is updated by solving the following maximization procedure:

p = arg max p

h,m ln p Ξhmp p Ψp, (14)

where 2 Ξhm ui,vj =

k(αji,k Dh,m,u,v,k + αij,k Dh,m,v,u,k); 2 Ψui,vj = h,k(αji,k Bh,u,v,k + αij,k Bh,v,u,k). The maximization task is performed by a novel minorizemaximization procedure which is summarized by the following proposition. Proposition 1. The log-likelihood is non-decreasing under the update:

pt+1 ui =pt ui

(Ξhmpt)ui pt Ξhmpt(Ψpt)ui

Furthermore, if pui is a stable ﬁxed point of Eq. 15, then pui is a local maximum of the log-likelihood. As previously, computing Ξ, Ψ, and all the matrix-vector products, requires O(n KΔr2) operations, and each update necessitates O(nd) operations. As we consider scenarios where there are at least a few events per dimension, the total complexity of the group afﬁnities optimization is O(n KΔr2). In total, the complexity of our optimization procedure is of the order O(n Kσ + n KΔr2) and its behavior is linear w.r.t. the number of events and the number of dimensions. If we consider K and r as tunable hyperparameters, the complexity becomes O(n maxσ,Δ) O(nd). This complexity seems close to the best achievable without making any further hypothesis. While O(nd) can still be high when studying viral processes occurring over social networks of billions of nodes, the exact expression O(n Kσ + n KΔr2) can be signiﬁcantly lower in the case we observe subcritical cascades (σ d) occurring over a sparse network (Δ d).

Figure 1: (a) True and inferred triggering kernels gij and natural occurrence rates μi, for the synthetic dataset. (b) Low-dimensional embedding of the event types learned by SLRHP in the synthetic dataset. The two groups (blue and green) of event types are successfully identiﬁed.

Experiments

Synthetic data

In this section we illustrate the validity and precision of our method in learning the diffusion parameters of simulated Hawkes processes. More speciﬁcally, we simulate MHPs such that event types are separated into two groups of similar activation pattern. In the context of social networks, these groups may encode inﬂuencer-inﬂuencee types of relations. We show that our inference algorithm can recover the groups and the corresponding triggering kernels consistently and with high accuracy. Note that SLRHP is more generic than this speciﬁc setting we deploy, and this comes from the fact that there are many notions of data structure that can be captured by the low-rank approximation. However, we believe that the reported scenario is simple and intuitive, and may therefore provide a clear overview of the capabilities of our approach. Data generation procedure. The employed procedure for generation of synthetic datasets follows. We assume that the MHPs take place on a random Erd os-R enyi (Erd os and R enyi 1976) network of d = 100 event types whose adja-

cency matrix A is generated with parameter p = 0.1 (i.e. 10 neighbors in average). Then, we consider two distinct groups of event types, and assign each event type to one of the groups at random. The natural occurrence rate μi of each group is ﬁxed to a constant value chosen uniformly over [0, 0.01]. The triggering kernels between two groups, i and j, are generated as:

gij(t) = νij sin 2πt ωij + π

2 ((i+j) mod 2) +2

3(t+1)2 , (16)

where ωij and νij are sampled uniformly over [1, 10] and [0, 1/50], respectively. These parameter intervals are chosen so that the behavior of the generated process is nonexplosive (Daley and Vere-Jones 2007). The rationale behind the kernels in Eq. 16 is that they present a power-law decreasing intensity that allows long term inﬂuence with a periodic behavior. This kind of dynamics could, for instance, represent the daytime cycles of internet users. Results. Following the above procedure we generate 8 datasets by sampling 8 different sets of parameters {(ωij, νij)i r,j r, ( μi)i r}. Finally, we simulate 105 i.i.d. realizations of the resulting Hawkes process, that we use as training set. The ability of SLRHP to recover the true group triggering kernels gij is shown in Fig. 1(a) and evaluated by means of the normalized L2 error: 1 r2

i,j || gij gij||2 || gij||2+|| gij||2 . On average, this is only 12.9%, with minimum 9.2% and maximum 18.9% amongst the 8 sample datasets. Moreover, the ﬁgure compares visually the ﬁtness of the inferred to the true natural occurrence rates and triggering kernel functions. In order to ﬁnd the group assignments, we infer the parameters of an SLRHP of rank r = 2, and recover the group structure by a clustering algorithm on the projected event types. Then, choosing as basis of the two-dimensional space the centers of the two clusters enables the recovery of the group triggering kernels. Fig. 1(b) shows the 2D embedding learned by our inference algorithm for one of the 8 sample datasets. Two particularly separate clusters appear, which indicates that the group assignments were perfectly recovered. The other 7 datasets gave similar results. These results provide strong indication regarding the validity of our algorithm for inferring the underlying dynamics of MHPs in situations where there is structure in the interactions between the dimensions.

Results on the Meme Tracker dataset Our ﬁnal set of experiments are conducted on the Meme Tracker (Leskovec and Krevl 2014) dataset. Meme Tracker is a benchmark corpus of 9.6 106 blog posts published between August 2008 and April 2009. We use posts from the period August 2008 to December 2008 as training set, and evaluate our models on the four remaining months. An event for website u is deﬁned as the creation of a post on website u containing a hyperlink towards any other website. We also consider that an edge exists between two websites if at least one hyperlink exists between them in the training set. In order to compare the inference algorithms on datasets of different size, prediction was performed on four subsets of the

Table 1: Experiments on Meme Tracker subsets. AUC (%) and Accuracy (%) are reported for predicting the next event to happen, using SLRHP, MEMIP, and NAIVE approach. The experiments denoted with did not ﬁnish in reasonable time.

Dataset Training Time (secs) AUC Accuracy Name thd n d SLRHP MEMIP SLRHP MEMIP NAIVE SLRHP MEMIP NAIVE

MT1 50000 7311 13 8.34 3.16 86.3 86.4 86.1 99.2 99.2 93.1 MT2 10000 74474 80 281 7.14 103 90.8 92.6 84.4 89.8 92.7 70.6 MT3 5000 277914 172 1.95 103 1.74 105 86.2 91.9 81.2 87.0 91.6 67.7 MT4 1000 875402 1075 3.77 105 * 87.0 * 85.2 84.7 * 81.3

!"# $ %&' ('(%#

0 1 2 3 4 5 0

first dimension

second dimension

Figure 2: Experimental results on real data: (a) A plot in log-log scale with the training time (secs) for SLRHP and MEMIP algorithm against the quantity nd. The linear behavior for SLRHP and super-linear for MEMIP are clearly visible. (b) Sensitivity analysis of SLRHP accuracy w.r.t. the rank r of the approximation used for inference, and a comparison to the best scores for MMEL and NAIVE baselines on the MT3 dataset. (c) Low-dimensional embedding of the event types learned by SLRHP for the MT3 dataset with r = 2.

Meme Tracker dataset (smaller to larger): MT1, MT2, MT3, and MT4. These subsets are created by removing the events taking place on websites that appear less than a ﬁxed number of times in the training set. This threshold value (thd in Tab. 1) is, respectively, 50000, 10000, 5000, and 1000. Prediction task. The task consists in predicting the next website to create a post. More speciﬁcally, for each event of the test dataset, we are interested in predicting the website on which it will take place knowing its time of occurrence. For MEMIP and SLRHP, the prediction is achieved by scoring the websites according to λu(tm), since this value is proportional to the theoretical conditional probability for event m to be of type u. We evaluate the prediction with two metrics: the area under the ROC curve (AUC) and a classiﬁcation accuracy with a ﬁxed number of candidate types. Due to the high bias towards major news websites (e.g. CNN), the number of candidate types has to be relatively large to see the difference in the performance of algorithms, and we set this value to 30% of the total number of event types d in our experiments. This means that we consider a successful prediction if the website that eventually ﬁres was ranked by an algorithm in the top 30% candidates. Baselines. In the following experiments, we use as main competitor the state-of-the-art MEMIP algorithm (Lemonnier and Vayatis 2014), which is, to the best of our knowledge, the only inference algorithm with linear complexity in the number of events n in the training history. Also, pre-

vious work (Lemonnier and Vayatis 2014) shows that this algorithm outperforms the more standard inference algorithm MMEL (Zhou, Zha, and Song 2013b) on the Meme Tracker dataset. In addition, we also use the NAIVE baseline which ranks the nodes according to their frequency of appearance in the training set. Note that this is equivalent to ﬁtting a Poisson process and, hence, does not consider mutual-excitation. Results. Tab. 1 summarizes the experimental results comparing the proposed SLRHP against MEMIP and NAIVE algorithms on four subsets of the Meme Tracker dataset. In each row, the table describes the dataset characteristics, and for each method it provides the training time, AUC, and accuracy with the best parameter settings (for SLRHP, K = 6 and r = 2, except for MT3 for which r = 3). On small to medium-sized datasets (MT1, MT2, MT3), SLRHP is as efﬁcient as its main competitor MEMIP, while orders of magnitude faster. On the larger MT4 dataset, SLRHP still runs in reasonable time while outperforming the NAIVE baseline. Note that MEMIP could not be computed in reasonable time for this dataset (less than a few days). Fig. 2(a) shows in log-log scale the computational time needed for the inference algorithm on all the Meme Tracker datasets, with respect to nd. This time is indeed linear in nd for SLRHP, while super-linear for the state-of-the-art competitor of the related literature. In Fig. 2(b) it is indicated that the accuracy measurements are relatively stable with re-

gards to the rank of the approximation r, with a maximum for r = 3. Finally, Fig. 2(c) shows the 2D embedding learned by SLRHP for the MT3 dataset with r = 2. In the embedding space, the websites seem to align along the axes of the embedding space, with varying amplitudes. This may indicate that the algorithm recovered two different groups, each one representing similar activities, although with a large variability in the activity of the websites inside each group.

Conclusion This work focused on modeling multivariate time series where a very large number of event types can occur, and a very large number of historical observations are available for training. The introduced framework is called Scalable Low Rank Hawkes Processes (SLRHP), for which we developed a novel inference algorithm for parameter estimation. Theoretical complexity analysis as well as experimental results show that our approach is highly scalable, while also still competitive with regards to predictive performance as compared to state-of-the-art inference algorithms.

Acknowledgments This research is part of the SODATECH project funded by the French Government within the program of Investments for the Future Big Data .

References Alfonsi, A., and Blanc, P. 2015. Dynamic optimal execution in a mixed-market-impact Hawkes price model. Finance and Stochastics 20(1):183 218. Bacry, E., and Muzy, J.-F. 2014. Hawkes model for price and trades high-frequency dynamics. Quantitative Finance 14(7):1147 1166. Bacry, E., and Muzy, J.-F. 2016. Firstand second-order statistics characterization of Hawkes processes and nonparametric estimation. IEEE Transactions on Information Theory 62(4):2184 2202. Bauwens, L., and Hautsch, N. 2009. Modelling ﬁnancial high frequency data using point processes. Springer. Bordenave, C., and Torrisi, G. L. 2007. Large deviations of Poisson cluster processes. Stochastic Models 23(4):593 625. Boyd, S. P., and Vandenberghe, L. 2004. Convex optimization. Cambridge university press. Br emaud, P., and Massouli e, L. 1996. Stability of nonlinear Hawkes processes. The Annals of Probability 1563 1588. Crane, R., and Sornette, D. 2008. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences 105(41):15649 15653. Daley, D. J., and Vere-Jones, D. 2007. An introduction to the theory of point processes. Springer. Du, N.; Wang, Y.; He, N.; Sun, J.; and Song, L. 2015. Time-sensitive recommendation from recurrent user activities. In Advances in Neural Information Processing Systems 28, NIPS, 3492 3500.

Erd os, P., and R enyi, A. 1976. On the evolution of random graphs. Selected Papers of Alfr ed R enyi 2:482 525. Hansen, N. R.; Reynaud-Bouret, P.; and Rivoirard, V. 2015. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli 21(1):83 143. Lemonnier, R., and Vayatis, N. 2014. Nonparametric markovian learning of triggering kernels for mutually exciting and mutually inhibiting multivariate Hawkes processes. In Machine Learning and Knowledge Discovery in Databases. Springer. 161 176. Lemonnier, R.; Scaman, K.; and Kalogeratos, A. 2017. Multivariate Hawkes processes for large-scale inference Supplementary material. Available at: http://kalogeratos.com/psite/ﬁles/My Papers/SLRHPappendix.pdf. Leskovec, J., and Krevl, A. 2014. SNAP Datasets: Stanford large network dataset collection. Available at: http://snap.stanford.edu/data. Liniger, T. J. 2009. Multivariate Hawkes processes. Ph.D. Dissertation, Diss., Eidgen ossische Technische Hochschule ETH Z urich, Nr. 18403. Nesterov, Y.; Nemirovskii, A. S.; and Ye, Y. 1994. Interiorpoint polynomial algorithms in convex programming, volume 13. SIAM. Oakes, D. 1975. The Markovian self-exciting process. Journal of Applied Probability 69 77. Reynaud-Bouret, P.; Rivoirard, V.; Grammont, F.; and Tuleau-Malot, C. 2014. Goodness-of-ﬁt tests and nonparametric adaptive estimation for spike train analysis. Journal of Mathematical Neurosciences 4:3. Tran, L.; Farajtabar, M.; Song, L.; and Zha, H. 2015. Netcodec: Community detection from individual activities. In Proceedings of the SIAM International Conference on Data Mining, SIAM ICDM, 91 99. Vere-Jones, D. 1978. Earthquake prediction statistician s view. Journal of Physics of the Earth 26(2):129 146. Xu, H.; Farajtabar, M.; and Zha, H. 2016. Learning granger causality for Hawkes processes. In Proceedings of the International Conference on Machine Learning, ICML, 1717 1726. Zhou, K.; Zha, H.; and Song, L. 2013a. Learning social infectivity in sparse low-rank networks using multidimensional Hawkes processes. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, AISTATS, 641 649. Zhou, K.; Zha, H.; and Song, L. 2013b. Learning triggering kernels for multi-dimensional Hawkes processes. In Proceedings of International Conference on Machine Learning, ICML, 1301 1309.