# uncovering_causality_from_multivariate_hawkes_integrated_cumulants__c356b3a0.pdf Uncovering Causality from Multivariate Hawkes Integrated Cumulants Massil Achab 1 Emmanuel Bacry 1 Stéphane Gaïffas 1 Iacopo Mastromatteo 2 Jean-François Muzy 1 3 Abstract We design a new nonparametric method that allows one to estimate the matrix of integrated kernels of a multivariate Hawkes process. This matrix not only encodes the mutual influences of each node of the process, but also disentangles the causality relationships between them. Our approach is the first that leads to an estimation of this matrix without any parametric modeling and estimation of the kernels themselves. As a consequence, it can give an estimation of causality relationships between nodes (or users), based on their activity timestamps (on a social network for instance), without knowing or estimating the shape of the activities lifetime. For that purpose, we introduce a moment matching method that fits the second-order and the third-order integrated cumulants of the process. A theoretical analysis allows us to prove that this new estimation technique is consistent. Moreover, we show on numerical experiments that our approach is indeed very robust to the shape of the kernels, and gives appealing results on the Meme Tracker database and on financial order book data. 1. Introduction In many applications, one needs to deal with data containing a very large number of irregular timestamped events that are recorded in continuous time. These events can reflect, for instance, the activity of users on a social network (Subrahmanian et al., 2016), high-frequency variations of signals in finance (Bacry & Muzy, 2014), earthquakes and aftershocks in geophysics (Ogata, 1998), crime activity (Mohler et al., 2011) or position of genes in genomics (Reynaud-Bouret & Schbath, 2010). In this context, multidimensional counting processes based models play a paramount role. Within this framework, an important task is to recover the mutual 1Ecole Polytechnique, Palaiseau, France 2Capital Fund Management, Paris, France 3Université de Corse, Corte, France. Correspondence to: Massil Achab . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). influence of the nodes, by leveraging on their timestamp patterns (Gomez-Rodriguez et al., 2013; Farajtabar et al., 2015; Xu et al., 2016). Consider a set of nodes I = {1, . . . , d}. For each i I, we observe a set Zi of events, where any τ Zi labels the occurrence time of an event related to the activity of i. The events of all nodes can be represented as a vector of counting processes N t = [N 1 t N d t ] , where N i t counts the number of events of node i until time t R+, namely N i t = P τ Zi 1{τ t}. The vector of stochastic intensities λt = [λ1 t λd t ] associated with the multivariate counting process N t is defined as λi t = lim dt 0 P(N i t+dt N i t = 1|Ft) dt for i I, where the filtration Ft encodes the information available up to time t. The coordinate λi t gives the expected instantaneous rate of event occurrence at time t for node i. The vector λt characterizes the distribution of N t, see (Daley & Vere-Jones, 2003), and patterns in the events time-series can be captured by structuring these intensities. 1.1. Hawkes processes The Hawkes process framework (Hawkes, 1971) corresponds to an autoregressive structure of the intensities in order to capture self-excitation and cross-excitation of nodes, which is a phenomenon typically observed in social networks (Crane & Sornette, 2008). Namely, N t is called a Hawkes point process if the stochastic intensities can be written as λi t = µi + 0 φij(t t )d N j t , where µi R+ is an exogenous intensity and φij are positive, integrable and causal (with support in R+) functions called kernels encoding the impact of an action by node j on the activity of node i. Note that when all kernels are zero, the process is a simple homogeneous multivariate Poisson process. 1.2. Related works Most papers use very simple parameterizations of the kernels (Yang & Zha, 2013; Zhou et al., 2013b; Farajtabar Uncovering Causality from Multivariate Hawkes Integrated Cumulants et al., 2015), they are of the form φij(t) = αijh(t) with αij R+ quantifying the intensity of the influence of j on i and h(t) a (normalized) function that characterizes the timeprofile of this influence and that is shared by all couples of nodes (i, j) (most often, it is chosen to be either exponential h(t) = βe βt or power-law h(t) = βt (β+1)). This is highly non-realistic: there is a priori no reason for assuming that the time-profile of the influence of a node j on a node i does not depend on the pair (i, j). Moreover, assuming an exponential shape or a power-law shape for h(t) arbitrarily imposes an event impact that is always instantly maximal, and that can only decrease with time, while in practice, there may exist a latency between an event and its impact. In order to improve this and have more flexibility on the shape of the kernels, nonparametric estimation is considered in (Lewis & Mohler, 2011) and extended to the multidimensional case in (Zhou et al., 2013a). An alternative method is proposed in (Bacry & Muzy, 2016) where nonparametric estimation is formulated as a Wiener-Hopf equation. Another nonparametric strategy considers a decomposition of kernels on a dictionary of function h1, . . . , h K, namely φij(t) = PK k=1 aij k hk(t), where the coefficients aij k are estimated, see (Hansen et al., 2015; Lemonnier & Vayatis, 2014) and (Xu et al., 2016), where group-lasso is used to induce a sparsity pattern on the coefficients aij k that is shared across k = 1, . . . , K. Such methods are heavy when d is large, since they rely on likelihood maximization or least squares minimization within an over-parametrized space in order to gain flexibility on the shape of the kernels. This is problematic, since the original motivation for the use of Hawkes processes is to estimate the influence and causality of nodes, the knowledge of the full parametrization of the model being of little interest by itself. 1.3. Granger Causality Since the question of a real causality is too complex in general, most econometricians agreed on the simpler definition of Granger causality (Granger, 1969). Its mathematical formulation is a statistical hypothesis test: X causes Y in the sense of Granger causality if forecasting future values of Y is more successful while taking X past values into account. In (Eichler et al., 2016), it is shown that for N t a multivariate Hawkes process, N j t does not Granger-cause N i t w.r.t N t if and only if φij(u) = 0 for u R+. Since the kernels take positive values, the latter condition is equivalent to R 0 φij(u)du = 0. In the following, we ll refer to learning the kernels integrals as uncovering causality since each integral encodes the notion of Granger causality, and is also linked to the number of events directly caused from a node to another node, as described below at Eq. (2). 1.4. Our contribution: cumulants matching Our paper solves this problem with a different and more direct approach. Instead of trying to estimate the kernels φij, we focus on the direct estimation of their integrals. Namely, we want to estimate the matrix G = [gij] where 0 φij(u) du 0 for 1 i, j d. (1) From the definition of Hawkes process as a Poisson cluster process (Jovanovi c et al., 2015), gij can be simply interpreted as the average total number of events of node i whose direct ancestor is a given event of node j (by direct we mean that interactions mediated by any other intermediate event are not counted). In that respect, G not only describes the mutual influences between nodes, but it also quantifies their direct causal relationships. Namely, introducing the counting function N i j t that counts the number of events of i whose direct ancestor is an event of j, we know from (Bacry et al., 2015) that E[d N i j t ] = gij E[d N j t ] = gijΛjdt, (2) where we introduced Λi as the intensity expectation, namely satisfying E[d N i t] = Λidt. Note that Λi does not depend on time by stationarity of N t, which is known to hold under the stability condition G < 1, where G stands for the spectral norm of G. In particular, this condition implies the non-singularity of Id G. The main idea is to estimate the matrix G directly using a matching cumulants (or moments) method. Indeed the cumulants write as centered moments, up to the third order. For higher order, they are computable using the cumulant generating function. First, we compute an estimation c M of some moments M(G), that are uniquely defined by G. Second, we look for a matrix b G that minimizes the L2 error M( b G) c M 2. This approach turns out to be particularly robust to the kernel shapes, which is not the case of all previous approaches for causality recovery with the Hawkes model. We call this method NPHC (Non Parametric Hawkes Cumulant), since our approach is of nonparametric nature. This new approach is confirmed by a theoretical analysis allowing to prove the consistency of the NPHC estimator, by using tools from Generalized Method of Moments, see (Hall, 2005), and a technical original proof that is detailed in the supplementary material. Note that moment and cumulant matching techniques proved particularly powerful for latent topic models, in particular Latent Dirichlet Allocation, see (Podosinnikova et al., 2015). Previous works (Da Fonseca & Zaatour, 2014; Aït-Sahalia et al., 2010) already used method of moments with Hawkes processes, but only in a parametric setting. Our work is the first to consider such an approach for a nonparametric counting processes framework. Uncovering Causality from Multivariate Hawkes Integrated Cumulants 2. NPHC: The Non Parametric Hawkes Cumulant method The simplest moment-based quantities M that can be explicitly written as a function of G are the integrated cumulants of the Hawkes process. 2.1. Integrated cumulants of the Hawkes process A general formula for these cumulants is provided in (Jovanovi c et al., 2015) but, as explained below, for the purpose of our method, we only need to consider cumulants up to the third order. Given 1 i, j, k d, the first three integrated cumulants of the Hawkes process can be defined as follows thanks to stationarity: Λidt = E(d N i t) (3) E(d N i td N j t+τ) E(d N i t)E(d N j t+τ) Kijkdt = Z Z E(d N i td N j t+τd N k t+τ ) + 2E(d N i t)E(d N j t+τ)E(d N k t+τ ) E(d N i td N j t+τ)E(d N k t+τ ) E(d N i td N k t+τ )E(d N j t+τ) E(d N j t+τd N k t+τ )E(d N i t) , where Eq. (3) is the mean intensity of the Hawkes process, the second-order cumulant (4) refers to the integrated covariance density matrix and the third-order cumulant (5) measures the skewness of N t. Using the Laplace transform (Bacry & Muzy, 2016) or the Poisson cluster process representation (Jovanovi c et al., 2015), one can obtain an explicit relationship between these integrated cumulants and the matrix G. If one sets R = (Id G) 1, (6) straightforward computations (see Section 5) lead to the following identities: m=1 Rimµm (7) m=1 Λm Rim Rjm (8) m=1 (Rim Rjm Ckm + Rim Cjm Rkm (9) + Cim Rjm Rkm 2Λm Rim Rjm Rkm). Our strategy is to use a convenient subset of Eqs. (3), (4) and (5) to define M, while we use Eqs. (7), (8) and (9) in order to construct the operator that maps a candidate matrix R to the corresponding cumulants M(R). By looking for b R that minimizes R 7 M(R) c M 2, we obtain, as illustrated below, good recovery of the ground truth matrix G using Equation (6). The simplest case d = 1 has been considered in (Hardiman & Bouchaud, 2014), where it is shown that one can choose M = {C11} in order to compute the kernel integral. Eq. (8) then reduces to a simple second-order equation that has a unique solution in R (and consequently a unique G) that accounts for the stability condition ( G < 1). Unfortunately, for d > 1, the choice M = {Cij}1 i j d is not sufficient to uniquely determine the kernels integrals. In fact, the integrated covariance matrix provides d(d + 1)/2 independent coefficients, while d2 parameters are needed. It is straightforward to show that the remaining d(d 1)/2 conditions can be encoded in an orthogonal matrix O, reflecting the fact that Eq. (8) is invariant under the change R OR, so that the system is under-determined. Our approach relies on using the third order cumulant tensor K = [Kijk] which contains (d3 + 3d2 + 2d)/6 > d2 independent coefficients that are sufficient to uniquely fix the matrix G. This can be justified intuitively as follows: while the integrated covariance only contains symmetric information, and is thus unable to provide causal information, the skewness given by the third order cumulant in the estimation procedure can break the symmetry between past and future so as to uniquely fix G. Thus, our algorithm consists of selecting d2 third-order cumulant components, namely M = {Kiij}1 i,j d. In particular, we define the estimator of R as b R argmin RL(R), where L(R) = (1 κ) Kc(R) d Kc 2 2 + κ C(R) b C 2 2, (10) where 2 stands for the Frobenius norm, Kc = {Kiij}1 i,j d is the matrix obtained by the contraction of the tensor K to d2 indices, C is the covariance matrix, while d Kc and b C are their respective estimators, see Equations (12), (13) below. It is noteworthy that the above mean square error approach can be seen as a peculiar Generalized Method of Moments (GMM), see (Hall, 2005). This framework allows to determine the optimal weighting matrix involved in the loss function, which is a question to be addressed in an extended version of the present work. In this work, we use the coefficient κ to scale the two terms, by setting κ = d Kc 2 2/( d Kc 2 2 + b C 2 2). Finally the estimator of G is straightforwardly obtained as b G = Id b R 1, from the inversion of Eq. (6). Let us mention an important point: the matrix inversion in the previous formula is not Uncovering Causality from Multivariate Hawkes Integrated Cumulants the bottleneck of the algorithm. Indeed, its has a complexity O(d3) which is cheap compared to the computation of the cumulants when n = maxi |Zi| d, which is typically satisfied (see next subsection). Solving the considered problem on a larger scale, say d 103, is an open question, even with state-of-the-art parametric and nonparametric approaches (Zhou et al., 2013a; Xu et al., 2016; Zhou et al., 2013b; Bacry & Muzy, 2016), where the number of components d in experiments is always around 100 or smaller. Actually, our approach leads to a much faster algorithm than the considered state-of-the-art baselines (see Tables 1 4 below). 2.2. Estimation of the integrated cumulants In this section we present explicit formulas to estimate the three moment-based quantities listed in the previous section, namely, Λ, C and K. We first assume there exists H > 0 such that the truncation from ( , + ) to [ H, H] of the domain of integration of the quantities appearing in Eqs. (4) and (5), introduces only a small error. In practice, this amounts to neglecting border effects in the covariance density and in the skewness density, and it is a good approximation if i) the support of the kernel φij(t) is smaller than H and ii) the spectral norm G is sufficiently distant from the critical point G = 1. In this case, given a realization of a stationary Hawkes process {N t : t [0, T]}, as shown in Section 5, we can write the estimators of the first three cumulants (3), (4) and (5) as τ Zi 1 = N i T T (11) N j τ+H N j τ H 2H bΛj (12) N j τ+H N j τ H 2H bΛj N k τ+H N k τ H 2H bΛk τ Zk (2H |τ τ|)+ + 4H2bΛibΛj bΛk. Let us mention the following facts. Bias. While the first cumulant ˆΛi is an unbiased estimator of Λi, the other estimators b Cij and b Kijk introduce a bias. However, as we will show, in practice this bias is small and hardly affects numerical estimations (see Section 3). This is confirmed by our theoretical analysis, which proves that if H does not grow to fast compared to T, then these estimated cumulants are consistent estimators of the theoretical cumulants (see Subsection 2.5). Complexity. The computations of all the estimators of the first, second and third-order cumulants have complexity respectively O(nd), O(nd2) and O(nd3), where n = maxi |Zi|. However, our algorithm requires a lot less than that: it computes only d2 third-order terms, of the form b Kiij, leaving us with only O(nd2) operations to perform. Symmetry. While the values of Λi, Cij and Kijk are symmetric under permutation of the indices, their estimators are generally not symmetric. We have thus chosen to symmetrize the estimators by averaging their values over permutations of the indices. Worst case is for the estimator of Kc, which involves only an extra factor of 2 in the complexity. 2.3. The NPHC algorithm The objective to minimize in Equation (10) is non-convex. More precisely, the loss function is a polynomial of R of degree 10. However, by replacing Λ and C appearing in Eq. (4) and (5) with bΛ and b C helps us to decrease the degree from 10 to 6, which makes the optimization problem less difficult. We denote e L(R) this simpler objective function, where the expectations of cumulants Λi and Cij have been replaced with their estimators in the right-hand side of Eqs. (8) and (9). Thanks to (Choromanska et al., 2015), we know that the loss function of a typical multilayer neural network with simple nonlinearities can be expressed as a polynomial function of the weights in the network, whose degree is the number of layers. Since the loss function of NPHC writes as a polynomial of degree 6, we expect good results using optimization methods designed to train deep multilayer neural networks. We used Ada Grad (Duchi et al., 2011), a variant of the Stochastic Gradient Descent algorithm. It scales the learning rate coordinate-wise using the online variance of the previous gradients, in order to captures second-order information. Our problem being non-convex, the choice of the starting point has a major effect on the convergence. Here, the key is to notice that the matrices R that match relation Eq. (8) writes C1/2OL 1/2, with L = diag(Λ) and O an orthogonal matrix. In our setting, this algorithm gave nice convergence results for O = Id. The NPHC method is described schematically in Algorithm 1. Even though our main concern is to retrieve the matrix G, let us notice we can also obtain an estimation of the baseline intensities from Eq. (3): bµ = b R 1 bΛ. An efficient implementation of this algorithm with Tensor Flow, see (Abadi et al., 2016), is available on Git Hub: https://github.com/achab/nphc. Uncovering Causality from Multivariate Hawkes Integrated Cumulants Algorithm 1 Non Parametric Hawkes Cumulant method Input: N t Output: b G 1: Estimate bΛi, b Cij, b Kiij from Eqs. (11, 12, 13) 2: Design e L(R) using the computed estimators. 3: Minimize numerically e L(R) so as to obtain b R 4: Return b G = Id b R 1. 2.4. Complexity of the algorithm Compared with existing state-of-the-art methods to estimate the kernel functions, e.g., the ordinary differential equations-based (ODE) algorithm in (Zhou et al., 2013a), the Granger Causality-based algorithm in (Xu et al., 2016), the ADM4 algorithm in (Zhou et al., 2013b), and the Wiener-Hopf-based algorithm in (Bacry & Muzy, 2016), our method has a very competitive complexity. This can be understood by the fact that those methods estimate the kernel functions, while in NPHC we only estimate their integrals. Let us recall that d is the number of components and n = maxi |Zi| d the maximum number of events on a single component. Let us recall complexities given in (Xu et al., 2016) together with other ones. The ODE-based algorithm is an EM algorithm that parametrizes the kernel function with M basis functions, each being discretized to L points. The basis functions are updated after solving M Euler-Lagrange equations. The complexity of one iteration of the algorithm is then O(Mn3d2 + ML(nd + n2)), with n the maximum number of events and d the dimension. The Granger Causality-based algorithm is similar to the previous one, without the update of the basis functions, that are Gaussian kernels. The complexity per iteration is O(Mn3d2). The algorithm ADM4 is similar to the two algorithms above, as EM algorithm as well, with only one exponential kernel as basis function. The complexity per iteration is then O(n3d2). The Wiener-Hopf-based algorithm is not iterative, on the contrary to the previous ones. It first computes the empirical conditional laws on many points, and then invert the Wiener-Hopf system, leading to a O(nd2L + d4L3) computation. Similarly, our method first computes the integrated cumulants, then minimize the objective function with Niter iterations, and invert the resulting matrix b R to obtain b G. At the end, the complexity of the NPHC method is O(nd2 + Niterd3). This is summarized in Table 1 2.5. Theoretical guarantee: consistency The NPHC method can be phrased using the framework of the Generalized Method of Moments (GMM). GMM is a generic method for estimating parameters in statistical models. In order to apply GMM, we have to find a vector-valued Table 1. Complexity of state-of-the-art methods. NPHC s complexity is very low since, especially in the regime n d. Method Total complexity ODE (Zhou et al., 2013a) O(Niter M(n3d2 + L(nd + n2))) GC (Xu et al., 2016) O(Niter Mn3d2) ADM4 (Zhou et al., 2013b) O(Nitern3d2) WH (Bacry & Muzy, 2016) O(nd2L + d4L3) NPHC O(nd2 + Niterd3) function g(X, θ) of the data, where X is distributed with respect to a distribution Pθ0, which satisfies the moment conditions: E[g(X, θ)] = 0 if and only if θ = θ0, where θ0 is the ground truth value of the parameter. Based on i.i.d. observed copies x1, . . . , xn of X, the GMM method minimizes the norm of the empirical mean over n samples, 1 n Pn i=1 g(xi, θ) , as a function of θ, to obtain an estimate of θ0. In the theoretical analysis of NPHC, we use ideas from the consistency proof of the GMM, but the proof actually relies on very different arguments. Indeed, the integrated cumulants estimators used in NPHC are not unbiased, as the theory of GMM requires, but asymptotically unbiased. Moreover, the setting considered here, where data consists of a single realization {N t} of a Hawkes process strongly departs from the standard i.i.d setting. Our approach is therefore based on the GMM idea but the proof is actually not using the theory of GMM. Now, we use the subscript T to refer quantities used or computed when we observe the process on (Nt) on [0, T], like the truncation term HT , the estimated integrated covariance b CT , or the estimated kernel norm matrix b GT . In the next equation, stands for the Hadamard product and 2 stands for the entrywise square of a matrix. We denote G0 = Id R 1 0 the true value of G, and the R2d d valued vector functions g0(R) = C RLR Kc R 2C 2[R (C RL)]R " b CT Rb LT R d Kc T R 2( b CT ) 2[R ( b CT Rb LT )]R so that e LT (R) is a weighted squared Frobenius norm of bg T (R), and bg T (R) P g0(R) under the conditions of the following theorem, where P stands for convergence in probability. Theorem 2.1 (Consistency of NPHC). Suppose that (Nt) is observed on R+ and assume that 1. g0(R) = 0 if and only if R = R0; 2. R Θ, which is a compact set; 3. the spectral radius of the kernel norm matrix satisfies G0 < 1; Uncovering Causality from Multivariate Hawkes Integrated Cumulants 4. HT and H2 T /T 0. b GT = Id arg min R Θ e LT (R) 1 P G0. Remark 1. Assumption 3 is mandatory for stability of the Hawkes process, and Assumptions 3 and 4 are sufficient to prove that the estimators of the integrated cumulants defined in Equations 11, 12 and 13 are asymptotically consistent. Assumption 2 is a very mild standard technical assumption, note Θ is compact so that the minima of the considered functionals of R are reached within Θ. Assumption 1 is a standard asymptotic moment condition, that allows to identity parameters from the integrated cumulants. The proof of the Theorem is given in the supplementary material. 3. Numerical Experiments Simulated datasets. We simulated several datasets with Ogata s Thinning algorithm (Ogata, 1981) using the opensource library tick1, each corresponding to a shape of kernel. exponential kernel: φ(t) = αβ exp( βt) (14) power law kernel: φ(t) = αβγ(1 + βt) (1+γ) (15) rectangular kernel: φ(t) = αβ1[0,1/β](t γ) (16) The integral of each kernel on its support equals α, 1/β can be regarded as a characteristic time-scale and γ is the scaling exponent for the power law distribution and a delay parameter for the rectangular one. We consider a non-symmetric block-matrix G to show that our method can effectively uncover causality between the nodes, see Figure 1. The parameter α take the same constant value on the three blocks, but we set three very different β0, β1 and β2 from one block to the other, with ratio βi+1/βi = 10 and β0 = 0.1. The matrix G has constant entries on the blocks - gij = 1/6 for dimension 10 and gij = 1/10 for dimension 100 -, and zero outside, and the number of events is roughly equal to 105 on average over the nodes. We ran the algorithm on three simulated datasets: a 10-dimensional process with rectangular kernels named Rect10, a 10-dimensional process with power law kernels named PLaw10 and a 100-dimensional process with exponential kernels named Exp100. Meme Tracker dataset. We use events of the most active sites from the Meme Tracker dataset2. This dataset contains the publication times of articles in many websites/blogs 1https://github.com/X-Data Initiative/tick 2https://www.memetracker.org/data.html from August 2008 to April 2009, and hyperlinks between posts. We extract the top 100 media sites with the largest number of documents, with about 7 million of events. We use the links to trace the flow of information and establish an estimated ground truth for the matrix G. Indeed, when an hyperlink j appears in a post in website i, the link j can be regarded as a direct ancestor of the event. Then, Eq. (2) shows gij can be estimated by N i j T /N j T = #{links j i}/N j T . Order book dynamics. We apply our method to financial data, in order to understand the self and cross-influencing dynamics of all event types in an order book. An order book is a list of buy and sell orders for a specific financial instrument, the list being updated in real-time throughout the day. This model has first been introduced in (Bacry et al., 2016), and models the order book via the following 8-dimensional point process: Nt = (P (a) t , P (b) t , T (a) t , T (b) t , L(a) t , L(b) t , C(a) t , C(b) t ), where P (a) (resp. P (b)) counts the number of upward (resp. downward) price moves, T (a) (resp. T (b)) counts the number of market orders at the ask3 (resp. at the bid) that do not move the price, L(a) (resp. L(b)) counts the number of limit orders at the ask4 (resp. at the bid) that do not move the price, and C(a) (resp. C(b)) counts the number of cancel orders at the ask5 (resp. at the bid) that do not move the price. The financial data has been provided by Quant House EUROPE/ASIA, and consists of DAX future contracts between 01/01/2014 and 03/01/2014. Baselines. We compare NPHC to state-of-the art baselines: the ODE-based algorithm (ODE) by (Zhou et al., 2013a), the Granger Causality-based algorithm (GC) by (Xu et al., 2016), the ADM4 algorithm (ADM4) by (Zhou et al., 2013b), and the Wiener-Hopf-based algorithm (WH) by (Bacry & Muzy, 2016). Metrics. We evaluate the performance of the proposed methods using the computing time, the Relative Error Rel Err(A, B) = 1 |aij| 1{aij =0}+|bij|1{aij=0} and the Mean Kendall Rank Correlation MRank Corr(A, B) = 1 i=1 Rank Corr([ai ], [bi ]), where Rank Corr(x, y) = 2 d(d 1)(Nconcordant(x, y) Ndiscordant(x, y)) with Nconcordant(x, y) the number of pairs 3i.e. buy orders that are executed and removed from the list 4i.e. buy orders added to the list 5i.e. the number of times a limit order at the ask is cancelled: in our dataset, almost 95% of limit orders are cancelled before execution. Uncovering Causality from Multivariate Hawkes Integrated Cumulants (i, j) satisfying xi > xj and yi > yj or xi < xj and yi < yj and Ndiscordant(x, y) the number of pairs (i, j) for which the same condition is not satisfied. Note that Rank Corr score is a value between 1 and 1, representing rank matching, but can take smaller values (in absolute value) if the entries of the vectors are not distinct. Figure 1. Estimated b G via NPHC on DAX order book data. Table 2. Metrics on Rect10: comparable rank correlation, strong improvement for relative error and computing time. Method ODE GC ADM4 WH NPHC Rel Err 0.007 0.15 0.10 0.005 0.001 MRank Corr 0.33 0.02 0.21 0.34 0.34 Time (s) 846 768 709 933 20 Table 3. Metrics on PLaw10: comparable rank correlation, strong improvement for relative error and computing time. Method ODE GC ADM4 WH NPHC Rel Err 0.011 0.09 0.053 0.009 0.0048 MRank Corr 0.31 0.26 0.24 0.34 0.33 Time (s) 870 781 717 946 18 Discussion. We perform the ADM4 estimation, with exponential kernel, by giving the exact value β = β0 of one block. Let us stress that this helps a lot this baseline, in comparison to NPHC where nothing is specified on the shape of the kernel functions. We used M = 10 basis functions for both ODE and GC algorithms, and L = 50 quadrature points for WH. We did not run WH on the 100-dimensional datasets, for computing time reasons, because its complexity scales with d4. We ran multiprocessed versions of the Table 4. Metrics on Exp100: comparable rank correlation, strong improvement for relative error and computing time. Method ODE GC ADM4 NPHC Rel Err 0.092 0.112 0.079 0.008 MRank Corr 0.032 0.009 0.049 0.041 Time (s) 3215 2950 2411 47 Table 5. Metrics on Meme Tracker: strong improvement in relative error, rank correlation and computing time. Method ODE GC ADM4 NPHC Rel Err 0.162 0.19 0.092 0.071 MRank Corr 0.07 0.053 0.081 0.095 Time (s) 2944 2780 2217 38 baseline methods on 56 cores, to decrease the computing time. Our method consistently performs better than all baselines, on the three synthetic datasets, on Meme Tracker and on the financial dataset, both in terms of Kendall rank correlation and estimation error. Moreover, we observe that our algorithm is roughly 50 times faster than all the considered baselines. On Rect10, PLaw10 and Exp100 our method gives very impressive results, despite the fact that it does not uses any prior shape on the kernel functions, while for instance the ADM4 baseline do. On these simulated datasets, NPHC obtains a comparable or slightly better Kendall rank correlation, but improves a lot the relative error. On Meme Tracker, the baseline methods obtain a high relative error of from 9% to 19% while our method achieves a relative error of 7% which is a strong improvement. Moreover, NPHC reaches a much better Kendall rank correlation, which proves that it leads to a much better recovery of the relative order of estimated influences than all the baselines. Indeed, it has been shown in (Zhou et al., 2013a) that kernels of Meme Tracker data are not exponential, nor power law. This partly explains why our approach behaves better. On the financial data, the estimated kernel norm matrix obtained via NPHC, see Figure 3, gave some interpretable results (see also (Bacry et al., 2016)): 1. Any 2 2 sub-matrix with same kind of inputs (i.e. Prices changes, Trades, Limits or Cancels) is symmetric. This shows empirically that ask and bid have symmetric roles. 2. The prices are mostly cross-excited, which means that a price increase is very likely to be followed by a price decrease, and conversely. This is consistent with the wavy prices we observe on financial markets. Uncovering Causality from Multivariate Hawkes Integrated Cumulants 3. The market, limit and cancel orders are strongly selfexcited. This can be explained by the persistence of order flows, and by the splitting of meta-orders into sequences of smaller orders. Moreover, we observe that orders impact the price without changing it. For example, the increase of cancel orders at the bid causes downward price moves. 4. Conclusion In this paper, we introduce a simple nonparametric method (the NPHC algorithm) that leads to a fast and robust estimation of the matrix G of the kernel integrals of a Multivariate Hawkes process that encodes Granger causality between nodes. This method relies on the matching of the integrated order 2 and order 3 empirical cumulants, which represent the simplest set of global observables containing sufficient information to recover the matrix G. Since this matrix fully accounts for the selfand crossinfluences of the process nodes (that can represent agents or users in applications), our approach can naturally be used to quantify the degree of endogeneity of a system and to uncover the causality structure of a network. By performing numerical experiments involving very different kernel shapes, we show that the baselines, involving either parametric or non-parametric approaches are very sensible to model misspecification, do not lead to accurate estimation, and are numerically expensive, while NPHC provides fast, robust and reliable results. This is confirmed on the Meme Tracker database, where we show that NPHC outperforms classical approaches based on EM algorithms or the Wiener-Hopf equations. Finally, the NPHC algorithm provided very satisfying results on financial data, that are consistent with well-known stylized facts in finance. 5. Technical details 5.1. Proof of Equation (8) We denote ν(z) the matrix νij(z) = Lz t E(d N i ud N j u+t) dudt ΛiΛj , where Lz(f) is the Laplace transform of f, and ψt = P n 1 φ( n) t , where φ( n) t refers to the nth auto-convolution of φt. Then we use the characterization of second-order statistics, first formulated in (Hawkes, 1971) and fully generalized in (Bacry & Muzy, 2016), ν(z) = (Id + L z(Ψ))L(Id + Lz(Ψ)) , where Lij = Λiδij with δij the Kronecker symbol. Since Id + Lz(Ψ) = (Id Lz(Φ)) 1, taking z = 0 in the previous equation gives ν(0) = (Id G) 1L(Id G ) 1, which gives us the result since the entry (i, j) of the last equation gives Cij = P m Λm Rim Rjm. 5.2. Proof of Equation (9) We start from (Jovanovi c et al., 2015), cf. Eqs. (48) to (51), and group some terms: m Λm Rim Rjm Rkm m Rim Rjm X n Λn Rkn L0(ψmn) m Rim Rkm X n Λn Rjn L0(ψmn) m Rjm Rkm X n Λn Rin L0(ψmn). Using the relations L0(ψmn) = Rmn δmn and Cij = P m Λm Rim Rjm, proves Equation (9). 5.3. Integrated cumulants estimators For H > 0 let us denote HN i t = N i t+H N i t H. Let us first remark that, if one restricts the integration domain to ( H, H) in Eqs. (4) and (5), one gets by permuting integrals and expectations: Λidt = E(d N i t) Cijdt = E d N i t( HN j t 2HΛj) Kijkdt = E d N i t( HN j t 2HΛj)( HN k t 2HΛk) dtΛi E ( HN j t 2HΛj)( HN k t 2HΛk) . The estimators (11) and (12) are then naturally obtained by replacing the expectations by their empirical counterparts, notably E(d N i tf(t)) dt 1 For the estimator (13), we shall also notice that E(( HN j t 2HΛj)( HN k t 2HΛk)) = Z Z 1[ H,H](t)1[ H,H](t )Cjk t t dtdt = Z (2H |t|)+Cjk t dt. We estimate the last integral with the remark above. Uncovering Causality from Multivariate Hawkes Integrated Cumulants Acknowledgements This work benefited from the support of the École Polytechnique fund raising - Data Science Initiative. The authors want to thank Marcello Rambaldi for fruitful discussions on order book data s experiments. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467, 2016. Aït-Sahalia, Y., Cacho-Diaz, J., and Laeven, R. JA. Modeling financial contagion using mutually exciting jump processes. Technical report, National Bureau of Economic Research, 2010. Bacry, E. and Muzy, J. F. Hawkes model for price and trades high-frequency dynamics. Quantitative Finance, 14(7):1147 1166, 2014. Bacry, E. and Muzy, J.-F. Firstand second-order statistics characterization of hawkes processes and non-parametric estimation. IEEE Transactions on Information Theory, 62(4):2184 2202, 2016. Bacry, E., Mastromatteo, I., and Muzy, J.-F. Hawkes processes in finance. Market Microstructure and Liquidity, 1 (01):1550005, 2015. Bacry, E., Jaisson, T., and Muzy, J.-F. Estimation of slowly decreasing hawkes kernels: application to high-frequency order book dynamics. Quantitative Finance, pp. 1 23, 2016. Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., and Le Cun, Y. The loss surfaces of multilayer networks. In AISTATS, 2015. Crane, R. and Sornette, D. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences, 105 (41), 2008. Da Fonseca, J. and Zaatour, R. Hawkes process: Fast calibration, application to trade clustering, and diffusive limit. Journal of Futures Markets, 34(6):548 579, 2014. Daley, D. J. and Vere-Jones, D. An Introduction to the Theory of Point Processes Volume I: Elementary Theory and Methods. Springer Science & Business Media, 2003. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159, 2011. Eichler, M., Dahlhaus, R., and Dueck, J. Graphical modeling for multivariate hawkes processes with nonparametric link functions. Journal of Time Series Analysis, pp. n/a n/a, 2016. ISSN 1467-9892. Farajtabar, M., Wang, Y., Rodriguez, M., Li, S., Zha, H., and Song, L. Coevolve: A joint point process model for information diffusion and network co-evolution. In Advances in Neural Information Processing Systems, pp. 1945 1953, 2015. Gomez-Rodriguez, M., Leskovec, J., and Schölkopf, B. Modeling information propagation with survival theory. Proceedings of the International Conference on Machine Learning, 2013. Granger, C. W. J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37(3):424 438, 1969. ISSN 00129682, 14680262. Hall, A. R. Generalized Method of Moments. Oxford university press, 2005. Hansen, N. R., Reynaud-Bouret, P., and Rivoirard, V. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21(1):83 143, 2015. Hardiman, S. J. and Bouchaud, J.-P. Branching-ratio approximation for the self-exciting Hawkes process. Phys. Rev. E, 90(6):062807, December 2014. doi: 10.1103/ Phys Rev E.90.062807. Hawkes, A. G. Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society. Series B (Methodological), 33(3):438 443, 1971. ISSN 00359246. Jovanovi c, S., Hertz, J., and Rotter, S. Cumulants of Hawkes point processes. Phys. Rev. E, 91(4):042802, April 2015. doi: 10.1103/Phys Rev E.91.042802. Lemonnier, R. and Vayatis, N. Nonparametric markovian learning of triggering kernels for mutually exciting and mutually inhibiting multivariate hawkes processes. In Machine Learning and Knowledge Discovery in Databases, pp. 161 176. Springer, 2014. Lewis, E. and Mohler, G. A nonparametric em algorithm for multiscale hawkes processes. Journal of Nonparametric Statistics, 2011. Mohler, G. O., Short, M. B., Brantingham, P. J., Schoenberg, F. P., and Tita, G. E. Self-exciting point process modeling of crime. Journal of the American Statistical Association, 2011. Ogata, Y. On lewis simulation method for point processes. Information Theory, IEEE Transactions on, 27(1):23 31, 1981. Uncovering Causality from Multivariate Hawkes Integrated Cumulants Ogata, Y. Space-time point-process models for earthquake occurrences. Annals of the Institute of Statistical Mathematics, 50(2):379 402, 1998. Podosinnikova, A., Bach, F., and Lacoste-Julien, S. Rethinking lda: moment matching for discrete ica. In Advances in Neural Information Processing Systems, pp. 514 522, 2015. Reynaud-Bouret, P. and Schbath, S. Adaptive estimation for hawkes processes; application to genome analysis. The Annals of Statistics, 38(5):2781 2822, 2010. Subrahmanian, V.S., Azaria, A., Durst, S., Kagan, V., Galstyan, A., Lerman, K., Zhu, L., Ferrara, E., Flammini, A., and Menczer, F. The darpa twitter bot challenge. Computer, 49(6):38 46, 2016. Xu, H., Farajtabar, M., and Zha, H. Learning granger causality for hawkes processes. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1717 1726, 2016. Yang, S.-H. and Zha, H. Mixture of mutually exciting processes for viral diffusion. In Proceedings of the International Conference on Machine Learning, 2013. Zhou, K., Zha, H., and Song, L. Learning triggering kernels for multi-dimensional hawkes processes. In Proceedings of the International Conference on Machine Learning, pp. 1301 1309, 2013a. Zhou, K., Zha, H., and Song, L. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. AISTATS, 2013b.