# sparse_and_lowrank_multivariate_hawkes_processes__e23bbd43.pdf Journal of Machine Learning Research 21 (2020) 1-32 Submitted 3/15; Revised 2/20; Published 4/20 Sparse and low-rank multivariate Hawkes processes Emmanuel Bacry EMMANUEL.BACRY@POLYTECHNIQUE.EDU CEREMADE, CNRS UMR 7534, Universit e Paris-Dauphine, Paris, France Martin Bompaire M.BOMPAIRE@CRITEO.COM Criteo, Paris, France St ephane Ga ıffas STEPHANE.GAIFFAS@LPSM.PARIS LPSM, CNRS UMR 8001, Universit e Paris-Diderot, Paris, France DMA, CNRS UMR 8553, Ecole Normale Sup erieure, Paris, France Jean-Francois Muzy MUZY@UNIV-CORSE.FR Laboratoire Sciences Pour l Environnement, CNRS UMR 6134, Universit e de Corse, Cort e, France Editor: Nicolas Vayatis We consider the problem of unveiling the implicit network structure of node interactions (such as user interactions in a social network), based only on high-frequency timestamps. Our inference is based on the minimization of the least-squares loss associated with a multivariate Hawkes model, penalized by ℓ1 and trace norm of the interaction tensor. We provide a first theoretical analysis for this problem, that includes sparsity and low-rank inducing penalizations. This result involves a new data-driven concentration inequality for matrix martingales in continuous time with observable variance, which is a result of independent interest and a broad range of possible applications since it extends to matrix martingales former results restricted to the scalar case. A consequence of our analysis is the construction of sharply tuned ℓ1 and trace-norm penalizations, that leads to a data-driven scaling of the variability of information available for each users. Numerical experiments illustrate the significant improvements achieved by the use of such data-driven penalizations. Keywords. Hawkes processes; Sparsity; Low-Rank; Random matrices; Data-driven concentration 1. Introduction Understanding the dynamics of social interactions is a challenging problem of rapidly growing interest (de Menezes and Barab asi, 2004; Leskovec, 2008; Crane and Sornette, 2008; Leskovec et al., 2009) because of the large number of applications in web-advertisement and e-commerce, where large-scale logs of event history are available. A common supervised approach consists in the prediction of labels based on declared interactions (friendship, like, follower, etc.). However such supervision is not always available, and it does not always describe accurately the level of interactions between users. Labels are often only binary while a quantification of the interaction is more interesting, declared interactions are often deprecated, and more generally a supervised approach is not enough to infer the latent communities of users, as temporal patterns of actions of users are much more informative. For latent social groups recovering, several recent papers (Rodriguez et al., 2011; Gomez-Rodriguez et al., 2013; Daneshmand et al., 2014) consider an approach directly based on the real actions or events of users (referred to as nodes in the following) that are fully identified through their corresponding user id and timestamp. These models assume a structure of data consisting in a sequence of independent cascades, containing the timestamp of each node. In these works, techniques coming from survival analysis are used to derive a tractable convex likelihood, that allows one to infer the latent community structure. However, they require that data are already segmented into sets of independent cascades, which is often unrealistic. Moreover, it does not allow for recurrent events, namely a node can be infected only once, and it cannot incorporate exogenous factors, i.e., influence from the world outside the network. c 2020 Emmanuel Bacry, Martin Bompaire, St ephane Ga ıffas and Jean-Francois Muzy. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/15-114.html. BACRY, BOMPAIRE, GA IFFAS AND MUZY Another approach is based on self-exciting point processes, such as the Hawkes process (Hawkes, 1971). Previously used for geophysics (Ogata, 1998), high-frequency finance (Bacry et al., 2013, 2015), crime activity (Mohler et al., 2011), these processes have been recently used for the modelization of users activity in social networks, see for instance Crane and Sornette (2008); Blundell et al. (2012); Zhou et al. (2013); Yang and Zha (2013). The structure of the Hawkes model allows us to capture the direct influence of a specific user s action on all the future actions of all the users (including himself). It encompasses in a single likelihood the decay of the influence over time, the levels of interaction between nodes, which can be seen as a weighted asymmetrical adjacency matrix, and a baseline intensity, that measures the level of exogeneity of a user, namely the spontaneous apparition of an action, with no influence from other nodes of the network. In this paper, we consider such a multivariate Hawkes process (MHP), and we combine convex proxies for sparsity and low-rank of the adjacency tensor and the baseline intensities, that are now of common use in low-rank modeling in collaborative filtering problems (Cand es and Tao, 2004, 2009). Note that this approach is also considered in (Zhou et al., 2013). We provide a first theoretical analysis of the generalization error for this problem, see Hansen et al. (2012) for an analysis including only entrywise ℓ1 penalization. Namely, we prove a sharp oracle inequality for our procedure, that includes sparsity and low-rank inducing priors, see Theorem 6 in Section 5. This result involves a new data-driven concentration inequality for matrix martingales in continuous time, see Theorems 3 and 4 in Section 3.3, that are results of independent interest, that extends previous non-commutative versions of concentration inequalities for martingales in discrete time, see Tropp (2012). A consequence of our analysis is the construction of sharply tuned ℓ1 and trace-norm penalizations, that leads to a data-driven scaling of the variability of information available for each node. We give empirical evidence of the improvements of our data-driven penalizations, by conducting in Section 6 numerical experiments on simulated data. Since the objectives involved are convex with a smooth component, our algorithms build upon standard batch proximal gradient descent algorithms. 2. The multivariate Hawkes model and the least-squares functional Consider a finite network with d nodes (each node corresponding to a user in a social network for instance). For each node j {1, . . . , d}, we observe the timestamps {tj,1, tj,2, . . .} of actions of node j on the network (a message, a click, etc.). With each node j is associated a counting process Nj(t) = P i 1 1tj,i t and we consider the d-dimensional counting process Nt = [N1(t) Nd(t)] , for t 0. We observe this process for t [0, T]. Each Nj has an intensity λj, meaning that P Nj has a jump in [t, t + dt] | Ft = λj(t)dt, j = 1, . . . , d, where Ft is the σ-field generated by N up to time t. The multivariate Hawkes model assumes that each Nj has an intensity λj,θ given by λj,θ(t) = µj + (0,t) ϕj,j (t s)d Nj (s), (1) where µj 0 is the baseline intensity of j (i.e., the intensity of exogenous events of node j) and where the functions ϕj,j : R+ R for j = 1, . . . , d, called kernels, allow to quantify the impact of node j on node j. Note that the integral used in Equation (1) is a Stieljes integral, namely it simply stands for Z (0,t) ϕ(t s)d Nj (s) = X i : tj ,i [0,t) ϕ(t tj ,i). In the paper, we consider general kernel functions ϕj,j (t) that can be written as: k=1 aj,j ,khj,j ,k(t). (2) SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES where the coefficients aj,j ,k are the entries of a d d K tensor A (i.e., (A)j,j ,k = aj,j ,k) and the kernels hj,j ,k(t) are elements of a fixed dictionnary of non negative and causal functions (hj,j ,k : R+ R+) such that hj,j ,k 1 = 1. In that respect, the weights aj,j ,1, . . . , aj,j ,K all quantify the influence of j on j, but the particular weight aj,j ,k quantifies it for the k-th decay function hj,j ,k. A standard choice is a dictionnary of exponential kernels, hj,j ,k(t) = αke αkt with varying memory parameters α1, . . . , αK. This leads to the following standard parametrization of the kernel functions, called exponential kernels: k=1 aj,j ,kαk exp( αkt). (3) The main advantage of exponential kernels with fixed memory parameters α1, . . . , αK, is that it allows one to handle a convex problem. In the general case or when the memory parameters are unknown, the problem becomes non-convex, more challenging and is beyond the scope of the paper. The parameter of interest is the self-excitement tensor A, which can be viewed as a cross-scale (for k = 1, . . . , K) weighted adjacency matrix of connectivity between nodes, as illustrated in Figure 1 below. 0 25 50 75 100 125 150 175 200 (aj, j , 0)0 j, j d (aj, j , 1)0 j, j d (aj, j , 2)0 j, j d Figure 1: Toy example with d = 10 nodes. Based on actions timestamps of the nodes, represented by vertical bars (top), we aim at recovering the vector µ0 and the tensor A of implicit influence between nodes (bottom). The Hawkes model is particularly relevant for the modelization of the microscopic activity of social networks and has attracted a lot of interest in the recent literature (see Crane and Sornette (2008); Blundell et al. (2012); Zhou et al. (2013); Yang and Zha (2013); Linderman and Adams (2014); Du Bois et al. (2013); Blundell et al. (2012); Iwata et al. (2013), among others) for this kind of application, with a particular emphasis on Hansen et al. (2012) that gives first theoretical results for the Lasso used with Hawkes processes with an application to neurobiology. The main point is that this simple autoregressive structure of the intensity BACRY, BOMPAIRE, GA IFFAS AND MUZY allows us to capture the direct influence of a user, based on the recurrence and the patterns of his actions, by separating the intensity into a baseline and a self-exciting component, hence allowing to filter exogeneity in the estimation of users influences on each others. We introduce in this paper an estimation procedure of θ = (µ, A) based on data {Nt : t [0, T]}. The hidden structure underlying the observed actions of nodes is contained in A. Our strategy is based on the least-squares functional given by RT (θ) = λθ 2 T 2 [0,T ] λj,θ(t)d Nj(t), (4) with respect to θ, where λθ 2 T = 1 [0,T ] λj,θ(t)2dt is the norm associated with the inner product λθ, λθ T = 1 [0,T ] λj,θ(t)λj,θ (t)dt. (5) This least-squares function is very natural, and comes from the empirical risk minimization principle (Van De Geer, 2000; Massart, 2007; Koltchinskii, 2011; Bartlett and Mendelson, 2006): assuming that Nj has an unknown ground truth intensity λj (not necessarily following the Hawkes model), the Doob-Meyer s decomposition gives Z [0,T ] λj,θ(t)d Nj(t) = Z [0,T ] λj,θ(t)λj(t)dt + Z [0,T ] λj,θ(t)d Mj(t), where Mj(t) = Nj(t) R t 0 λj(s)ds is a continuous-time martingale with upwards jumps of +1. Since the noise term R [0,T ] λj,θ(t)d Mj(t) is centered, we obtain E[RT (θ)] = E λθ 2 T 2E λθ, λ T = E λθ λ 2 T λ 2 T , so that we expect a minimum ˆθ of RT (θ) to lead to a good estimation λˆθ of λ, following the empirical risk minimization principle. As explained in Section 8 below, the noise terms can be written as Z t 0 Ts d M s, for a specific tensor Tt and matrix martingale M t, where Ts M s stands for a tensor-matrix product defined in Section 3.1 below. The next Section introduces new results, of independent interest, providing datadriven deviation inequalities for the operator norm of a matrix martingale defined as the stochastic integral R t 0 Ts d M s. These results allow us, as a by-product, to control the noise terms arising in the application considered in this paper, and lead to a sharp data-driven tuning of the penalizations used on A, as explained in Section 4 below. 3. A new data-driven matrix martingale Bernstein s inequality An important ingredient for the theoretical results proposed in this paper is an observable deviation inequality for continuous time matrix martingales. We first recall previous results obtained in Bacry et al. (2016b) about non-observable deviation inequalities for such objects. 3.1. Notations Let T be a tensor of shape m n p q. It can be considered as a linear mapping from Rp q to Rm n according to the following tensor-matrix product: l=1 Ti,j;k,l Ak,l. SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES We will denote by T the tensor such that T A = (T A) (i.e., T i,j;k,l = Tj,i;k,l) and by T , ;k,l and Ti,j; , the matrices obtained when fixing the indices k, l and i, j respectively. Note that (T A)i,j = tr(Ti,j; , A ). If T and T are two tensors of dimensions m n p q and n r p q respectively, TT stands for the m r p q tensor defined as (TT )i,j;k,l = (T , ;k,l T , ;k,l)i,j. Accordingly, for an integer r 1, if T , ;a,b are square matrices, we will denote by Tr the tensor such that (Tr)i,j;k,l = (Tr , ;k,l)i,j. We also introduce T op; = maxk,l T , ;k,l op, the maximum operator norm of all matrices formed by the first two dimensions of tensor T. In this paper we shall consider the class of m n matrix martingales that can be written as ZT(t) = Z t 0 Ts d M s, (6) where Ts is a tensor with dimensions m n p q, whose components are assumed to be locally bounded predictable random functions. The process M t is a p q is matrix with entries that are square integrable martingales with a diagonal quadratic covariation matrix. More explicitly, the entries of ZT(t) are given by (ZT(t))i,j = 0 (Ts)i,j;k,l(d M s)k,l, where the martingale M t is a matrix of compensated counting processes M t = N t λt where N t is a p q matrix counting process (i.e., each component is a counting process) with an intensity process λt which is predictable, continuous and with finite variations (FV). 3.2. A non-observable matrix martingale Bernstein s inequality The next Theorem (which is a small variation of Theorem 2 in Bacry et al. (2016b)) provides a concentration inequality for ZT(t) op, the operator norm of ZT(t). Before stating the Theorem, let us introduce some more notations. We define b T(t) = sup 0 s t max Ts op; , T s op; , (7) and depending on whether the tensor Ts is symmetric (i.e., T s = Ts and m = n) or not, we define the following. If Ts is symmetric, we define W T(s) = T2 s λs (8) and Km,n = m If Ts is not symmetric, we define W T(s) = Ts T s λs 0 0 T s Ts λs and Km,n = m + n. In both cases, we define V T(t) = Z t 0 W T(s) ds. (10) Finally, all along the paper we denote φ(x) = ex 1 x for x R. The following concentration inequality is an easy consequence of Theorem 1 from Bacry et al. (2016b). BACRY, BOMPAIRE, GA IFFAS AND MUZY Theorem 1 Let ZT(t) be the m n matrix martingale given by Equation (6). Moreover, assume that φ 3 max( Ts op; , T s op; ) max( Ts 2op; , T s 2op; ) (W T(s))i,jds < + , (11) for any 1 i, j m + n. Then for any ξ (0, 3), t, b, x > 0, the following holds: P ZT(t) op φ(ξ) ξb λmax V T(t)) + xb ξ , b T(t) b Km,ne x. (12) Optimizing this last inequality on ξ gives 3 , λmax(V T(t)) v, b T(t) b Km,ne x. (13) The proof of Theorem 1 is given in Section 8.1 below. This result is a Freedman (or Bernstein) inequality for the operator norm of ZT(t), that provides a deviation based on a variance term V T(t) and a L term b T(t). It is a strong generalization of the scalar Freedman inequality for continuous time martingales, and this result match exactly the scalar case whenever ZT(t) is scalar. A more thorough discussion about the consequences of this result is provided in Bacry et al. (2016b). 3.3. Data-driven matrix martingale Bernstein s inequalities Inequality (13) is of poor practical interest in situations where one observes only the jumping times of the Zt components (namely N t) and not the stochastic intensity λt. In that respect, one needs a data driven inequality where V T(t) is replaced by its empirical version b V T(t). If Ts is symmetric, we define b V T(t) = Z t 0 T2 s d N s, while if Ts is not symmetric, we define "R t 0 Ts T s d N s 0 0 R t 0 T s Ts d N s The next Proposition allows us to control λmax(V T(t)) using its observable counterpart λmax( b V T(t)) with a large probability. This result is a generalization to arbitrary matrices of dimensions m n of an analog inequality originally proven by Hansen et al. (2012) for scalar martingales. Proposition 2 For any x, b > 0 and ξ (0, 3) such that ξ > φ(ξ), we have P λmax(V T(t)) ξ ξ φ(ξ)λmax( b V T(t)) + xb2 ξ φ(ξ), b T(t) b Km,ne x, where Km,n is defined as in Theorem 1. Moreover, choosing ξ = W 1( 2 3e 2/3) 2/3 (note that ξ 0.762), where W 1 is the second branch of the Lambert W function, leads to P h λmax(V T(t)) 2λmax( b V T(t)) + cb2x, b T(t) b i Km,ne x for any x, b > 0, with c = 2.62. Thanks to Proposition 2, we can establish an analog of Theorem 1 where λmax(V T(t)) is replaced by its data-driven version λmax( b V T(t)), up to a slight loss in values of the numerical constants. SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES Theorem 3 With the same notations and assumptions as in Theorem 1 one has P ZT(t) op 2 vx + cbx, λmax( b V T(t)) v, b T(t) b 2Km,ne x (14) for any x, b > 0 with c = 14.39. The proof of Theorem 3 is given in Section 8.3 below. It follows simple arguments that combine Theorem 1 and Proposition 2. However, this inequality is stated on the events {λmax( b V T(t)) v} and {b T(t) b}, while an unconditional deviation inequality is more practical. Such a result, which involves some extra technicalities, is stated in the next Theorem. Theorem 4 With the same conditions and notations as in Theorem 3, one has P ZT(t) op 2 q λmax( b V T(t))(x + ℓx(t)) + c(x + ℓx(t))(1 + b T(t)) Cm,ne x (15) where Cm,n = π4 18 log(2)4 Km,n 23.45Km,n, where c = 14.39 and ℓx(t) = 2 log log 4λmax( b V T(t)) x 2 + 2 log log(4b T(t) 2). The proof of this Theorem is given in Section 8.4. It is a result of independent interest, that gives a control on the operator norm of a matrix martingale in continuous time (with jumps at most 1), using only observable quantities. Along with Bacry et al. (2016b), it provides a first deviation inequality for such objects, and it can be understood as a data-driven version of the results given in Bacry et al. (2016b). 4. The procedure We want to produce an estimation procedure of θ = (µ, A) based on data from {Nt : t [0, T]}. Following the empirical risk minimization principle, the estimation procedure uses the least-squares functional (4) as a goodness-of-fit. In addition to this goodness-of-fit criterion, we need to use a penalization that allows us to reduce the dimensionality of the model, namely we consider ˆθ argmin θ=(µ,A) Rd + Rd d K + RT (θ) + pen(θ) , (16) for a specific penalization function pen(θ) described below. In particular, we want to reduce the dimensionality of A, based on the prior assumption that latent factors explain the connectivity of users in the network. This leads to a low-rank assumption on A, which is commonly used in collaborative filtering and matrix completion techniques (Ricci et al., 2011). Our prior assumptions on µ and A are the following. Sparsity of µ. Some nodes are basically inactive and react only if stimulated. Hence, we assume that the baseline intensity vector µ is sparse. Sparsity of A. A node interacts only with a fraction of other nodes, meaning that for a fixed node j, only a few aj,j ,k are non-zero. Moreover, a node might react at specific time scales only, namely aj,j ,k is non-zero for some k only for fixed j, j . Hence, we assume that A is an entrywise sparse tensor. Low-rank of A. Using together Equations (1) and (2), one can write λj,θ(t) = µj + k=1 aj,j ,k (0,t) hj,j ,k(t s)d Nj (s) = µj + hstack(A)j, ) hstack(H(t))j, , BACRY, BOMPAIRE, GA IFFAS AND MUZY where H(t) is the d d K tensor with entries Hj,j ,k(t) = Z (0,t) hj,j ,k(t s)d Nj (s), (18) where (X)j, stands for the j-th row of a matrix X and where hstack stands for the horizontally stacking operator defined by hstack : Rd d K Rd Kd such that hstack(A) = A , ,1 A , ,K , (19) where A , ,k stands for the d d matrix with entries (A , ,k)j,j = Aj,j ,k. In view of Equation (17), all the impacts of nodes j at time scale k on node j is encoded in the j-th row of the d Kd matrix hstack(A). Therefore, a natural assumption is that the matrix hstack(A) has a low-rank: we assume that there exist latent factors that explain the way nodes impact other nodes through the different scales k = 1, . . . , K. To induce these prior assumptions on the parameters, we use a penalization based on a mixture of the ℓ1 and trace-norms. These norms are respectively the tightest convex relaxations for sparsity and low-rank, see for instance Cand es and Tao (2004, 2009). They provide state-of-the art results in compressed sensing and collaborative filtering problems, among many other problems. These two norms have been previously combined for the estimation of sparse and low-rank matrices, see for instance Richard et al. (2014) and Zhou et al. (2013) in the context of MHP. Therefore, we consider the following penalization on the parameter θ = (µ, A): pen(θ) = µ 1, ˆ w + A 1, ˆW + ˆτ hstack(A) , (20) where each terms are entry-wise weighted ℓ1 and trace-norm penalizations given by j=1 ˆwj|µj|, A 1, ˆW = X 1 j,j d,1 k K ˆWj,j ,k|Aj,j ,k|, A = where the σ1(A) σd(A) are the singular values of a matrix A (we take A = hstack(A) in the penalization). The weights ˆw, ˆW, and coefficients ˆτ are data-driven tuning parameters described below. The choice of these weights comes from a sharp analysis of the noise terms and lead to a data-driven scaling of the variability of information available for each nodes. From now on, we fix some confidence level x > 0, which corresponds to the probability that the oracle inequality from Theorem 6 holds (see Section 5 below). This can be safely chosen as x = log T for instance, as described in our numerical experiments (see Section 6 below). Weight ˆτ for the trace-norm penalization of hstack(A). This weight comes from Corollary 7 (see Section 8.5). Let us introduce the d Kd matrix H(t) = hstack(H(t)) where H(t) is the d d K tensor defined by (18) and hstack is the horizontally stacking operator defined by (19). Let us also recall that 2 is the ℓ2-norm, and define H ,2 = max1 j d Hj, 2 where Hj, stands for the j-th row of H. We define λmax( b V (T)/T)(x + log(2d) + ℓτ(T)) + 28.78x + log(2d) + ℓτ(T))(1 + sup0 t T H(t) ,2) λmax( b V (T)) = λmax Z T 0 H (s)H(s) diag(d N(s)) _ max j=1,...,d 0 Hj, (t) 2 2d Nj(s), ℓτ(T) = 2 log log 4λmax( b V (T)) x 2 + 2 log log 4 sup 0 t T H(t) ,2 2 , where we used the notation a b = max(a, b) for a, b R. SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES Weights ˆwj for ℓ1-penalization of µ. These weights are given by (Nj(T)/T)(x + log d + ℓj(T)) T + 86.34x + log d + ℓj(T) with ℓj(T) = 2 log log( 4Nj(T ) x 2)) + 2 log log 4. The weighting of each coordinate j in the penalization of µ is natural: it is roughly proportional to the square-root of Nj(T)/T, which is the average intensity of events on coordinate j. The term ℓj(T) is a technical term, that can be neglected in practice, see Section 6. Weights ˆWj,j k for ℓ1-penalization of A. Recall that the tensor H is given by (18). The weights ˆWj,j k are given by ˆWj,j ,k = 4 1 T R T 0 Hj,j ,k(t)2d Nj(t)(x + log(Kd2) + Lj,j ,k(T)) + 28.78(x + log(Kd2) + Lj,j ,k(T))(1 + sup0 t T |Hj,j ,k(t)|) where Lj,j ,k(T) = 2 log log 4 R T 0 Hj,j ,k(t)2d Nj(t) x 2 +2 log log(4 sup0 t T |Hj,j ,k(t)| 2). Once again, this is natural: the variance term R T 0 Hj,j ,k(t)2d Nj(t) is, roughly, an estimation of the variance of the selfexcitements between coordinates j and j at time scale k. The term Lj,j ,k(T) is a technical term that can be neglected in practice. These weights are actually quite natural: the terms λmax( b V (T)) and R T 0 Hj,j ,k(t)2d Nj(t) correspond to estimations of the noise variance, that are the L2 terms appearing in the empirical Bernstein s inequalities given in Section 3.3. The terms sup0 t T H(t) ,2 and sup0 t T |Hj,j ,k(t)| correspond to the L terms from these Bernstein s inequalities. Once again, these data-driven weights lead to a sharp tuning of the penalizations, as illustrated numerically in Section 6 below. 5. A sharp oracle inequality Recall that the inner product λ1, λ2 T is given by (5) and recall that T stands for the corresponding norm. Theorem 6 below is a sharp oracle inequality on the prediction error measured by λˆθ λ 2 T . For the proof of oracle inequalities with a fast rate, one needs a restricted eigenvalue condition on the Gram matrix of the problem (Bickel et al., 2009; Koltchinskii, 2011). One of the weakest assumptions considered in literature is the Restricted Eigenvalue (RE) condition. In our setting, a natural RE assumption is given in Definition 5 below. First, we need to introduce some simple notations and definitions. Some notations and definitions. If a, b (resp. A, B and A, B) are vectors (resp. matrices and tensors) of the same size, we always denote by a, b (resp. A, B and A, B ) their inner products. For matrices this can be written as A, B = P i,j Ai,j Bi,j = tr(A B), where tr stands for the trace, while for (say, three dimensional) tensors we write similarly A, B = P i,j,k Ai,j,k Bi,j,k. We define the Euclidean norm (Frobenius) for tensors and matrices simply as A F = p A, A and A F = p A, A . If W (resp. W) is a matrix (resp. tensor) with positive entries, we introduce the weighted entrywise ℓ1-norm given by A 1,W = W , |A| , (resp. A 1,W = W, |A| ) where |A| (resp. |A|) contains the absolute values of the entries of A (resp. A). If A is a vector, matrix or tensor then A 0 is the number of non-zero entries of A, while supp(A) stands for the support of A (indices of non-zero entries) For another vector, matrix or tensor A with the same shape, the notation [A ]supp(A) stands for the vector, matrix or tensor with the same coordinates as A where we put 0 at indices outside of supp(A). We also use the notation u v = max(u, v) for a, b R. If A = UΣV is the SVD of a m n matrix A, with the columns uj of U and vk of V being, respectively, the orthonormal left and right singular vectors of A, the projection matrix onto the space spanned by the columns (resp. rows) of A is given by P U = UU (resp. P V = V V ). The operator PA : BACRY, BOMPAIRE, GA IFFAS AND MUZY Rm n Rm n given by PA(B) = P UB + BP V P UBP V is the projector onto the linear space spanned by the matrices ujx and yv k for all 1 j, k rank(A) and x Rn, y Rm. The projector onto the orthogonal space is given by P A(B) = (I P U)B(I P V ). Definition 5 Fix θ = (µ, A) where µ Rd and A Rd d K + and define A = hstack(A). We define the constant κ(θ) (0, + ] such that, for any θ = (µ , A ) and A = hstack(A ) satisfying 1 3 (µ )supp(µ) 1, ˆ w + 1 2 (A )supp(A) 1, ˆW + 1 2 ˆτ P A(A ) 3 (µ )supp(µ) 1, ˆ w + 3 2 (A )supp(A) 1, ˆW + 3 2 ˆτ PA(A ) , we have (µ )supp(µ) 2 (A )supp(A) F PA(A ) F κ(θ) λθ T . The constant 1/κ(θ) is a restricted eigenvalue depending on the support of θ, which is naturally associated with the problem considered here. Roughly, it requires that for any parameter θ that has a support close to the one of θ (measured by domination of the ℓ1 norms outside the support of θ by the ℓ1 norm inside it), we have that the L2 norm of the intensity given by λθ T can be compared with the L2 norm of θ in the support of θ. Note that for a given θ, we simply allow κ(θ) = + , so the restricted eigenvalue is zero, whenever the inequality is not met (which makes in such as case the statement of Theorem 6 trivial). Theorem 6 Fix x > 0, and let ˆθ be given by (16) and (20) with tuning parameters given by (21), (22) and (23). Then, the inequality λˆθ λ 2 T inf θ=(µ,A) λθ λ 2 T + 1.25κ(θ)2 ( ˆw)supp(µ) 2 2 + ( ˆW)supp(A) 2 F + ˆτ 2 rank(hstack(A)) (24) holds with a probability larger than 1 70.35e x. The proof of Theorem 6 is given in Section 8.5 below. Note that no assumption is required on the ground truth intensity λ of the multivariate counting process N in Theorem 6. Moreover, if one forgets in Section 4 about the negligible terms ℓτ(T), ℓj(T) and Lj,j ,k(T) and if one keeps only the dominating L2 terms in O(1/T) (while L terms are O(1/T 2) in the large T regime), we obtain upper bounds, up to numerical constants (denoted ), for the terms involved in Theorem 5: ( ˆw)supp(µ) 2 2 µ 0 max j supp(µ) 1 T Nj(T)(x + log d) where µ 0 stands for the sparsity of µ, ( ˆW)supp(A) 2 F A 0 max (j,j ,k) supp(A) 1 T R T 0 Hj,j ,k(t)2d Nj(t)(x + log(Kd2)) where A 0 stands for the sparsity of A, and finally ˆτ 2 rank(hstack(A)) 1 T λmax( b V (T))(x + log(2d)) Hence, Theorem 6 proves that ˆθ achieves an optimal trade-off between approximation and complexity, where the complexity is, roughly, measured by µ 0(x + log d) T max j Nj(T) T + A 0(x + log(Kd2)) T max j,j ,k 1 T 0 Hj,j ,k(t)2d Nj(t) + rank(hstack(A))(x + log(2d)) T 1 T λmax( b V (T)). SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES Note that typically K d so that log(Kd2) 3 log d, which means that log(Kd2) scales as log d. The complexity term depends on both the sparsity of A and the rank of hstack(A). The rate of convergence has the expected shape (log d)/T, recalling that T is the length of the observation interval of the process, and these terms are balanced by the empirical variance terms coming out of the new concentration results given in Section 3.3 above. 6. Numerical experiments In this Section we conduct experiments on synthetic datasets to evaluate the performance of our method, based on the proposed data-driven weighting of the penalizations, compared to unweighted penalizations (Zhou et al., 2013). Throughout this Section, we consider the most widely used sum of exponentials kernel, defined in Equation (3). 6.1. Simulation setting We generate Hawkes processes using Ogata s thinning algorithm (Ogata, 1981) with d = 30 nodes. Baseline intensities µj are constant on blocks, we use K = 3 basis kernels hj,j ,k(t) = αke αkt with α1 = 0.5, α1 = 2 and α3 = 5. We consider three examples for the slices A , ,1, A , ,2 and A , ,3 of the adjacency tensor A, including settings with overlapping boxes, and noisy entries over the block structure, as illustrated in Figure 2. These blocks correspond to the overlapping communities reacting at different time scales. The tensor A is rescaled so that the operator norm of the matrix P3 k=1 A , ,k is equal to 0.8, guaranteeing to obtain a stationary process. For each simulated data, we increase the length of the time interval T = 5000, 7000, 10000, 15000, 20000, and fit each time the procedures. An overall averaging of the results is computed on 100 separate simulations. 6.2. Procedures and metrics We consider a procedure based on the minimization of the least-squares functional (4). This objective is convex, with a goodness-of-fit term which is gradient-Lipschitz: we use first-order optimization algorithms, based on proximal gradient descent. Namely, we use Fista (Beck and Teboulle, 2009) for problems with a single penalization on A (ℓ1-norm or trace norm penalization of hstack(A)) and GFB (generalized forward backward, see Pino et al. (1999)) for mixed ℓ1 penalization of A and trace-norm penalization of hstack(A). For both procedures we choose a fixed gradient step equal to 1/L where L is the Lipschitz constant of the loss, namely the largest singular value of the Hessian (which is constant for this least-squares functional). We limit our algorithms to 25, 000 iterations and stop when the objective relative decrease is less than 10 10 for Fista and 10 7 for GFB. We only penalize A and consider the following procedures: L1: non-weighted L1 penalization; w L1: weighted L1 penalization; Nuclear: non-weighted trace-norm penalization; L1Nuclear: non-weighted L1 penalization and trace-norm penalization; w L1Nuclear: weighted L1 penalization and trace-norm penalization. Note that L1Nuclear is the same as the procedure considered in Zhou et al. (2013), however, we use a different optimization algorithm, based on an proximal gradient descent (a first-order method, which is typically faster than an algorithm based on ADMM, as proposed in Zhou et al. (2013)). The data-driven weights used in our procedures are the ones derived from our analysis, see (21) and (23), where we simply put x = log T. For each metric, we tune the constant in front the ℓ1 penalization, and the constant in front of the trace-norm penalization in order to obtain the best possible metrics for each procedure, on average over all separate simulations. Namely, there is no test set, we simply display the best metrics obtained by each procedure for a BACRY, BOMPAIRE, GA IFFAS AND MUZY 0 5 10 15 20 25 (ai, j, 0)0 i, j d 0 5 10 15 20 25 (ai, j, 1)0 i, j d 0 5 10 15 20 25 (ai, j, 2)0 i, j d 0 5 10 15 20 25 (ai, j, 0)0 i, j d 0 5 10 15 20 25 (ai, j, 1)0 i, j d 0 5 10 15 20 25 (ai, j, 2)0 i, j d 0 5 10 15 20 25 (ai, j, 0)0 i, j d 0 5 10 15 20 25 (ai, j, 1)0 i, j d 0 5 10 15 20 25 (ai, j, 2)0 i, j d Figure 2: Ground truth vector µ and tensor A in dimension 30. Each row corresponds to a different example used in our experiments. SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES fair comparison. All experiments are done using our tick library for Python3, see Bacry et al. (2018), its Git Hub page is https://github.com/X-Data Initiative/tick and documentation is available here https://x-datainitiative.github.io/tick/. The following metrics are considered in order to assess the procedures. Estimation error: the relative ℓ2 estimation error of A, given by ˆA A 2 2/ A 2 2 AUC: we compute the AUC (area under the ROC curve) between the binarized ground truth matrix A and the solution ˆA with entries scaled in [0, 1]. This allows us to quantify the ability of the procedure to detect the support of the connectivity structure between nodes. Kendall: we compute Kendall s tau-b between all entries of the ground truth matrix A and the solution ˆA. This correlation coefficient takes value between 1 and 1 and compare the number of concordant and discordant pairs. This allows us to quantify the ability of the procedure to rank correctly the intensity of the connectivity between nodes. 6.3. Results In Figure 3 we observe, on an instance of the problem, the strong improvements of w L1 and w L1Nuclear over L1, Nuclear and L1Nuclear respectively. We observe in particular that a sharp tuning of the penalizations, using data-driven weights, leads to a much smaller number of false positives outside the node communities (better viewed on a computer). In Figure 4, we compare all the procedures in terms of estimation error, AUC and Kendall coefficient and confirm the fact that weighted penalizations systematically lead to an improvement, both over unweighted L1, Nuclear and L1Nuclear. 6.4. A comparison of the least-squares and likelihood functionals This paper considers, mostly for theoretical reasons, least-squares as a goodness-of-fit for the Hawkes process. However, estimation in this model is usually achieved by minimizing the goodness-of-fit given by the negative log-likelihood. In what follows, we provide some numerical insights in order to compare objectively both approaches. First, one can precompute for both functionals some weights in order to accelerate future gradient and value computations. In both cases, the precomputations have similar complexities, unless the number of kernels K is large (see Table 1 below). However, given such precomputations, a remarkable property of the least-squares versus the log likelihood is that value and gradient computation is independent of the total number of observed events (denoted n): complexity is O(K2d3) for least-squares, while it is O(n Kd) for log likelihood, which means that such computations for least-squares can be orders of magnitude faster whenever n Kd2, which is the case in the setting considered in our experiments. For instance, experiments used to produce Figures 3 and 4 for T = 20, 000 use about n 500, 000 events, and d = 30, K = 3. Note that, however, the least-squares approach considered here does not scale with respect to d because of its O(d3) complexity, we recommend to use instead the negative log-likelihood whenever d is large (larger than 1000, say). The complexity of each operation is described in Table 1 below and a numerical illustration of this complexity is displayed in Figure 5, which confirms that computations with least-squares are orders of magnitude faster than with log-likelihood in the considered setting. We don t provide proofs for these complexities, since it follows straightforward arguments, however details about this can be found in Chapter 2 of Bompaire (2018). Another important point is related to smoothness properties: the negative log-likelihood does not satisfy the gradient-Lipschitz assumption, while this property is required by most first order optimization algorithms to obtain convergence guarantees and an easy tuning of the step-size used in gradient descent. Therefore, for the negative log-likelihood, convergence can be very unstable, while on the contrary, least-squares is gradient Lipschitz and is easy to optimize since it is a quadratic function. Note that in Bompaire et al. (2018) is proposed an alternative approach based on duality, in particular for the negative log-likelihood of the Hawkes BACRY, BOMPAIRE, GA IFFAS AND MUZY (ai, j, 0)0 i, j d (ai, j, 1)0 i, j d (ai, j, 2)0 i, j d 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 w L1Nuclear Figure 3: Ground truth tensor A and recovered tensors using all procedures. We observe that w L1 and w L1Nuclear leads to a much better support recovery, since we observe less false positives outside of the node communities. SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES Estimation error 10000 20000 T 10000 20000 T 10000 20000 T Nuclear L1Nuclear w L1Nuclear Estimation error 10000 20000 T 10000 20000 T 10000 20000 T 0.54 Nuclear L1Nuclear w L1Nuclear 10 Estimation error 10000 20000 T 10000 20000 T 10000 20000 T Nuclear L1Nuclear w L1Nuclear Figure 4: Average metrics achieved by all procedures on the three considered examples of A (in the same order as the display from Figure 2), and 95% confidence bands, with increasing observation length T over repeated simulations. Weighted penalizations systematically lead to improvements over L1, Nuclear and L1 + Nuclear penalization. BACRY, BOMPAIRE, GA IFFAS AND MUZY pre-computation memory value gradient Least squares O(n K2d) O(K2d3) O(K2d3) O(K2d3) Likelihood O(n Kd) O(n Kd) O(n Kd) O(n Kd) Table 1: From left to right: Weights precomputation complexity, memory storage, value and gradient complexity for both functionals. Note that for least-squares, the complexity of the value and the gradient with precomputed weights is independent on the number of events n. 5000 10000 15000 20000 25000 30000 T Weights computation 5000 10000 15000 20000 25000 30000 T time in log scale (s) 100 loss computations log-likelihood least squares Figure 5: Average time needed for weights (left) and value computation (right) (and 95% confidence bands) for least squares and log-likelihood with precomputations, over repeated simulations. We observe that value computations are order of magnitude faster for least-squares (y-scale is logarithmic on the right hand side) and constant with an increasing observation length, while it is strongly increasing for the log-likelihood. process. Herein one can observe the strong instability of standard first order algorithms (such as the one considered here) for the negative log-likelihood. In Figure 6 below, we compare the performances of ISTA and FISTA with linesearch for automatic step-size tuning, both for least-squares and negative log-likelihood. This figure confirms that the number of iterations required for least-squares is much smaller than for the negative log-likelihood. This gap is even stronger if we look at the computation times, since each iteration is computationally faster with least squares, and even more so when the observation length increases. In this Section, we compared least-squares and log-likelihood for the Hawkes process through a computational perspective only, and concluded that least-squares is typically order of magnitude faster. Now, let us compare the statistical performances of both approaches on the same simulation setting as before, with T = 20, 000, using the metrics defined above, namely Estimation Error, AUC and Kendall. We simply use for this L1 penalization on A, with a strength parameter tuned for each metric and for each goodness-of-fit. In Figure 7, we observe that both functionals roughly achieve the same performance measured by the Kendall coefficient, but that the negative log-likelihood achieves a slightly better AUC and estimation error than least-squares, at a stronger computational cost. The slightly better statistical performance of maximum likelihood is not surprising, since vanilla maximum likelihood is known to be statistically efficient asymptotically for Hawkes processes, see Ogata (1978), while up to our knowledge, vanilla least-squares estimator is not. This leads to the conclusion that least squares are a very good alternative to maximum likelihood when dealing with a large number of events: statistical accuracy is only slightly deteriorated, but the computational cost is order of magnitudes smaller, and convergence is much more stable. In Figure 8, we observe the performances achieved by ℓ1 versus weighted-ℓ1 for the estimators based on the log-likelihood functional. The point here is that we use the weights ˆW from Equation (23) that are derived for the least-squares functional. We observe that, however, these data-driven weights allow to strongly improve over the vanilla ℓ1-penalization for the negative log-likelihood estimator as well. This behavior is SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES 0 200 400 600 800 1000 iterations Distance to optimum 0 1 2 3 4 5 6 7 time (s) ISTA log likelihood FISTA log likelihood ISTA least-squares FISTA least-squares T = 1000 (22616 events) 0 200 400 600 800 1000 iterations Distance to optimum 0 5 10 15 20 25 30 35 time (s) ISTA log likelihood FISTA log likelihood ISTA least-squares FISTA least-squares T = 5000 (117058 events) Figure 6: Convergence speed of least squares and likelihood losses with ISTA and FISTA optimization algorithms on two simulations of a Hawkes process with parameters from Figure 2 with observation length T = 1000 (top) and T = 5000 (bottom). Once again, we observe that the computations are much faster with least-squares, in particular with a large observation length. 10 1 101 time (s) Estimation error 10 1 101 103 10 1 101 time (s) ISTA least-squares FISTA least-squares ISTA log likelihood FISTA log likelihood Figure 7: Metrics achieved by least squares and log-likelihood estimators after precomputations. We observe that log-likelihood achieves a slightly better AUC and Estimation Error, but at a stronger computational cost (x-axis are on a logarithmic scale). BACRY, BOMPAIRE, GA IFFAS AND MUZY actually expected, since both functionals are actually close to each other, and the least-squares functional can even be understood as an approximation of the negative log-likelihood one, see Bacry et al. (2016a). 1000 2000 3000 T 1000 2000 3000 T Estimation error 1000 2000 3000 T Figure 8: Performances of ℓ1 versus weighted-ℓ1 for estimators based on the negative log-likelihood functional, where the data-driven weights used in the ℓ1 penalization are the ones derived for the leastsquares functional. We observe that these weights allow to improve significantly the performances of ℓ1-penalized estimators based on the log-likelihood functional, for all the considered metrics. This is expected, since both functionals are actually close to each other. 6.5. Sensitivity to the penalization level and weights In Figure 9, we display the values of the metrics as a function of the penalization level used, both for unweighted and weighted ℓ1 penalization. We observe that the weighted ℓ1-penalization is more sensitive to its unweighted counterpart, but leads anyway to much better performances even if the penalization level is not perfectly tuned. In Figure 10 we display the weights ˆW from Equation (23) used in the weighted-ℓ1 penalization for a single simulation from the first setting (corresponding to tensor A displayed in the first row of Figure 2). We observe that these weights are far from being uniform, and effectively induce a strongly varying scaling across kernels k = 1, 2, 3 and between nodes. Although this display is hard to interpret, it can be better understood when looked together with the first row of Figure 2: we observe a similarly looking block structure, which means that these weights scale the penalization level roughly following the block structure of the adjacency matrix A and the intensity of the baseline vector µ. 7. Conclusion In this paper we proposed a careful analysis of the generalization error of the multivariate Hawkes process. Our theoretical analysis required a new concentration inequality for matrix-martingales in continuous time, with an observable variance term, which is a result of independent interest. This analysis led to a new datadriven tuning of sparsity-inducing penalizations, that we assessed on a numerical example. Future works will focus on other theoretical results for non-convex matrix factorization techniques applied to this problem. SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES L1 - Estimation error L1 - Kendall 10 7 10 6 10 5 10 4 10 3 10 2 10 7 10 6 10 5 10 4 10 3 10 2 0 w L1 - Estimation error 10 7 10 6 10 5 10 4 10 3 10 2 w L1 - Kendall Figure 9: Sensitivity of the metrics (top: AUC, middle: Estimation error, bottom: Kendall) with respect to the penalization level both for unweighted (left-hand side) and weighted (right-hand side) ℓ1 penalizations. Weighted ℓ1-penalization is more sensitive to its unweighted counterpart, but leads to much better performances even if not perfectly tuned. 0 5 10 15 20 25 (ai, j, 0)0 i, j d 0 5 10 15 20 25 (ai, j, 1)0 i, j d 0 5 10 15 20 25 (ai, j, 2)0 i, j d Figure 10: Visualization of the weights used in the weighted-ℓ1 penalization for a single simulation from the first setting (corresponding to tensor A displayed in the first row of Figure 2). This corresponds to the weights from Equation (23), namely ˆW , ,1 (left), ˆW , ,2 (middle) and ˆW , ,3 (right). BACRY, BOMPAIRE, GA IFFAS AND MUZY This Section contains the proofs of all the results given in the paper. First, we prove the statements concerned with deviation inequalities, namely Theorems 1, 3, Proposition 2 and Theorem 4. Then, we give the proof of Theorem 6, concerning the oracle inequality for the procedure. 8.1. Proof of Theorem 1 In Bacry et al. (2016b), a deviation inequality is proven in a slightly more general setting than the one considered in this paper. There are mainly two differences. This paper considers only counting processes with uniform jumps of size 1 whereas in Bacry et al. (2016b), jump sizes are controlled by a predictable process J. Therefore, it suffices to set J = 1 and Cs = 1 in Equations (2) and (3) of Bacry et al. (2016b), where 1 stands for the all-ones matrices with relevant shapes. In Bacry et al. (2016b), the deviation inequality is proved in a general context where no symmetry is assumed on Ts. It forces to consider a symmetric version of W T(s) as in Eq. (9) increasing the dimension of the working space by a factor of 2, which leads to less precise deviation inequality. In this paper we consider both cases, symmetric and non symmetric, in order to obtain slightly better constants (see the definition of Km,n). With those two differences in mind, following carefully the proof of the concentration inequality in Bacry et al. (2016b) (see the beginning of Appendix B.1 herein) one gets P λmax(S (Zt)) φ ξJmax Cs max( Ts op; , T s op; )b 1 J2max Cs 2 max( Ts 2op; , T s 2op; ) W sds + x b T(t) b (m + n)e x, where ξ (0, 3) and λmax(S (Zt)) = Z op (see the beginning of Appendix B.1 in Bacry et al. (2016b)). Setting J = 1, C = 1 and taking care of the symmetric case at the same time as the non symmetric one, one gets: φ ξ max( Ts op; , T s op; )b 1 max( Ts 2op; , T s 2op; ) W sds + x b T(t) b Km,ne x, using the definitions Km,n and W s introduced previously (depending on the symmetric properties of the tensor Ts). Let us note that on {b T(t) b} one has max( Ts op; , T s op; )b 1 1 for any s [0, t]. Thus, since φ(xh) h2φ(x) for any h [0, 1] and x > 0, one gets ξb2 λmax Z t 0 W sds + x ξ , b T(t) b Km,ne x P Zt op φ(ξ) ξb λmax(V t) + xb ξ , b T(t) b Km,ne x which proves the first part of the Theorem. The second part (i.e., Inequality (13)) can be obtained following some standard tricks (see e.g. Massart (2007)): (i) on (0, 3), φ(ξ) ξ2 2(1 ξ/3) and SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES (ii) minξ (0,1/c) aξ 1 cξ + x ξ = 2 ax + cx for any a, c, x > 0. Thus applying (i) leads to P Zt op ξ 2b(1 ξ/3)λmax(V t) + xb ξ , b T(t) b Km,ne x or equivalently P Zt op ξ 2b(1 ξ/3)v + xb ξ , λmax(V t) v, b T(t) b Km,ne x. Then optimizing on ξ using (ii) with c = 1/3 and a = v/2b2, one gets 3 , λmax(V t) v, b T(t) b Km,ne x which concludes the proof of Theorem 1. 8.2. Proof of Proposition 2 This Proposition provides a deviation between λmax(V (t)) and λmax( b V (t)). Let us notice that it is a generalization to arbitrary matrices of dimensions m n of an analog inequality originally proven by Hansen et al. (2012) for scalar martingales (i.e., in dimension 1). The proof below follows the same lines as these authors. The proof is based on the observation that the difference V T(t) b V T(t) can be written as a martingale ZH(t) V T(t) b V T(t) = ZH(t) = Z t 0 Hs d M s, where Hs = T2 s (25) when Ts is symmetric, while Hs = Ts T s 0 0 T s Ts if Ts is not symmetric. Then applying Eq. (12) of Theorem 1 to the martingale ZH(t) (we are in the symmetric case of the Theorem since H s = Hs), one gets P ZH(t) op φ(ξ) ξb λmax V H(t)) + xb ξ , b H(t) b Km,ne x, (27) V H(t) = Z t 0 H2 s λsds . (28) Since ZH(t) op λmax(V T(t)) λmax( b V T(t)), P λmax(V T(t)) λmax( b V T(t)) + φ(ξ) ξb λmax V H(t)) + xb ξ , b H(t) b Km,ne x, (29) One can first notice that, from the definitions of H and b T(t), one has b H(t) b2 T(t). Moreover, since Ts T s b2 T(s)Im and T s Ts b2 T(s)In BACRY, BOMPAIRE, GA IFFAS AND MUZY for all s, we have from Eq. (28), V H(t) b2 T(t)V T(t) (30) and therefore λmax(V H(t)) b2 T(t)λmax(V T(t)). Inequality (29) then gives: P λmax(V T(t)) λmax( b V T(t)) + φ(ξ) ξ λmax(V T(t)) + xb2 ξ , b T(t) b Km,ne x, (31) P λmax(V T(t)) ξλmax( b V T(t)) ξ φ(ξ) + xb2 ξ φ(ξ), b T(t) b Km,ne x, (32) which proves the first inequality stated in Proposition 2. Now, an easy computation proves that the choice ξ = W 1( 2 3e 2/3) 2/3 0.762 provides the second desired inequality. 8.3. Proof of Theorem 3 Introduce the set Et = {λmax(V T(t)) 2λmax( b V T(t)) + 2.62b2x}. We know from Proposition 2 that P[E t , b T(t) b] Km,ne x. Now, on the set Et {λmax( b V T(t)) v} {b T(t) b} ξb λmax(V (t)) + xb ξ + 2.62φ(3) for any ξ (0, 3), since ξ 7 φ(ξ)/ξ is increasing. Using again points (i) and (ii) from Section 8.1 proves that the minimum for ξ (0, 3) of the right hand size of this last inequality is equal to 2 vx + 2.62φ(3) + 1 3 xb 2 vx + cxb with c = 14.39. Now, the conclusion easily follows from the following decomposition: P ZT(t) op 2 vx + cbx, λmax( b V T(t)) v, b T(t) b P[E t , b T(t) b] + P ZT(t) op 2 vx + cbx, Et, λmax( b V T(t)) v, b T(t) b Km,ne x + P Zt op ξ 2b(1 ξ/3)λmax(V t) + xb ξ , b T(t) b where we used Equation (12) from Theorem 1 in the last inequality. 8.4. Proof of Theorem 4 In order to prove this theorem, we are going to use peeling arguments. For any ϵ > 0 and z > 0 we define the interval Iz,ε = [z, z(1 + ε)]. SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES Let, v0, b0, ϵ > 0 and let us define vj = v0(1 + ε)j, bj = b0(1 + ε)j. Let us define also the events V 1 = {λmax( b V T(t)) v0}, B 1 = {b T(t) b0}, and Vj = {λmax( b V T(t)) Ivj,ε}, Bj = {b T(t) Ibj,ε} for any j N. We set v0 = w0x, then, from Equation (14), one gets successively P ZT(t) op x 2 w0 + cb0 , V 1 B 1 P ZT(t) op x 2 w0 + c(1 + ε)b T(t) , V 1 Bj P ZT(t) op 2 q λmax( b V T(t))(1 + ε)x + cxb0, Vi B 1 P ZT(t) op 2 q λmax( b V T(t))(1 + ε)x + c(1 + ε)xb T(t), Vi Bj for all i, j 0. If one denotes A = 2 w0/c + b0, previous inequalities entail, for any i, j 1: P ZT(t) op 2 q λmax( b V T(t))(1 + ε)x + c(1 + ε)(A + b T(t))x, Vi Bj 2Km,ne x. (33) Let α > 0 and define ℓx(t) = α log log λmax( b V T(t)) w0x (1 + ϵ)2 (1 + ϵ) + α log log b T(t) b0 (1 + ϵ)2 (1 + ϵ) . (34) Since, i, j 1, λmax( b V T(t)) xw0(1 + ε)i(1 δ 1,i) and b T(t) b0(1 + ε)j(1 δ 1,j) on Vi Bj, then one has ℓx(t) ℓi,j = log (i + 2)α(j + 2)α(log(1 + ϵ))2α on Vi Bj for any i, j 1. Then making the change of variable x x + ℓi,j in (33) gives P ZT(t) op 2 q λmax( b V T(t))(1 + ε)(x + ℓi,j) + c(1 + ε)(A + b T(t))(x + ℓi,j), Vi Bj 2Km,ne xe ℓi,j P ZT(t) op 2 q λmax( b V T(t))(1 + ε)(x + ℓx(t)) + c(1 + ε)(x + ℓx(t))(A + b T(t)), Vi Bj 2Km,n log(1 + ε) 2αe x (i + 2)(j + 2) α for any i, j 1. Since the whole probability space can be partitioned as S i,j 1 Vi Bj, one has finally P ZT(t) op 2 q λmax( b V T(t))(1 + ε)(x + ℓx(t)) + c(1 + ε)(x + ℓx(t))(A + b T(t)) i,j= 1 P ZT(t) op 2 q λmax( b V T(t))(1 + ε)(x + ℓx(t)) + c(1 + ε)(x + ℓx(t))(A + b T(t)), Vi Bj 2Km,n log(1 + ε) 2α X i=1 i α 2e x. BACRY, BOMPAIRE, GA IFFAS AND MUZY Finally, choosing ϵ = b0 = w0 = 1 and α = 2 leads to Equation (15) and concludes the proof of the Theorem. 8.5. Proof of Theorem 6 If A, B are vectors, matrices or tensors of matching dimensions, we denote by A B their entrywise product (Hadamard product). We recall also that Aj, the j-th row of a matrix A and recall that A ,2 = maxj Aj, 2. The proof is based on the proof of a sharp oracle inequality for trace norm penalization, see Koltchinskii et al. (2011) and Koltchinskii (2011). We endow the space Rd Rd d K with the inner product θ, θ = µ, µ + A, A , where θ = (µ, A) and θ = (µ , A ) with µ, µ = µ µ and 1 j,j d 1 k K Aj,j ,k A j,j ,k. We denote for short aj,j ,k = Aj,j ,k. For any θ, one has RT (ˆθ), ˆθ θ = 2 X 1 j d (ˆµj µj) RT (ˆθ) 1 j,j d 1 k K (ˆaj,j ,k aj,j ,k) RT (ˆθ) Let us recall that Hj,j ,k(t) = R (0,t) hj,j ,k(t s)d Nj (s). Since µj = 1 and λj,θ(t) aj,j ,k = Hj,j ,k(t), we have that the derivatives of the empirical risk are given by 0 λj,θ(t)dt Z T aj,j ,k = 2 0 Hj,j ,k(t)λj,θ(t)dt Z T 0 Hj,j ,k(t)d Nj(t) . It leads to RT (ˆθ), ˆθ θ = 2 0 (λj,ˆθ(t) d Nj(t))(ˆµj µj) 1 j,j d 1 k K 0 Hj,j ,k(t)(λj,ˆθ(t) d Nj(t))(ˆaj,j ,k aj,j ,k) 0 (λj,ˆθ(t) λj,θ(t))(λj,ˆθ(t)dt d Nj(t)). Let us remind that Mj(t) = Nj(t) R t 0 λj(s)ds are martingales coming from the Doob-Meyer decomposition, so that d Mj(t) = d Nj(t) λj(t)dt. So, recalling that [0,T ] fj(t)gj(t)dt, SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES we obtain the decomposition RT (ˆθ), ˆθ θ = 2 λˆθ λθ, λˆθ λ T 2 0 (λj,ˆθ(t) λj,θ(t))d Mj(t). Namely, we end up with 2 λˆθ λθ, λˆθ λ T = RT (ˆθ), ˆθ θ + 2 0 (λj,ˆθ(t) λj,θ(t))d Mj(t). (35) The parallelogram identity gives 2 λˆθ λθ, λˆθ λ T = λˆθ λ 2 T + λˆθ λθ 2 T λθ λ 2 T , where we put f 2 T = f, f T . Let us point out that, in the case λˆθ λθ, λˆθ λ T < 0, one obtains λˆθ λ 2 T λθ λ 2 T , which directly implies the inequality of the Theorem. Thus, from now on, let us assume that λˆθ λθ, λˆθ λ T 0. (36) The first order condition for ˆθ argminθ{RT (θ) + pen(θ)} gives RT (ˆθ) pen(ˆθ). Let ˆθ = RT (ˆθ). Since the subdifferential is a monotone mapping, we have ˆθ θ, ˆθ θ 0 for any θ pen(θ). Thus from (35), one gets θ pen(θ), 2 λˆθ λθ, λˆθ λ T θ , ˆθ θ + 2 0 (λj,ˆθ(t) λj,θ(t))d Mj(t). (37) We need now to characterize the structure of the subdifferentials involved in pen(θ), to describe θ . If g1(µ) = Pd j=1 ˆwj|µj|, for ˆwj 0, we have g1(µ) = n ˆw sign(µ) + ˆw f : f 1, µ f = 0 o . (38) If g2(A) = P 1 j,j d,1 k K ˆWj,j ,k|Aj,j ,k|, for ˆWj,j ,k 0, we have g2(A) = n ˆW sign(A) + ˆW F : F 1, A F = 0 o . (39) Now let A = hstack(A) and ˆA = hstack(ˆA). Let us recall that if A = UΣV is the SVD of A, we have PA(B) = P UB + BP V P UBP V and P A(B) = (I P U)B(I P V ) (projection onto the column and row space of A and projection onto its orthogonal space). Now, for g3(A) = ˆτ A , we have g3(A) = n ˆτUV + ˆτP A(F ) : F op 1 o , (40) see for instance (Lewis, 1995). Now, write θ , ˆθ θ = µ , ˆµ µ A ,1, ˆA A A , , ˆA A BACRY, BOMPAIRE, GA IFFAS AND MUZY with µ g1(µ), A ,1 g2(A) and A , g3(A). Using Equation (38), (39) and (40), we can write θ , ˆθ θ = ˆw sign(µ), ˆµ µ ˆw f, ˆµ µ ˆW sign(A), ˆA A ˆW F1, ˆA A ˆτ UV , ˆA A ˆτ F , P A( ˆA A) , where by duality between the norms 1 and , and between and op, we can choose f, F1 and F such that ˆw f, ˆµ µ = (ˆµ µ)supp(µ) 1, ˆ w, ˆW F1, ˆA A = (ˆA A)supp(A) 1, ˆW and F , P A( ˆA A) = P A( ˆA A) , which leads to θ , ˆθ θ (ˆµ µ)supp(µ) 1, ˆ w (ˆµ µ)supp(µ) 1, ˆ w + (ˆA A)supp(A) 1, ˆW (ˆA A)supp(A) 1, ˆW + ˆτ PA( ˆA A) ˆτ P A( ˆA A) . Now, we decompose the noise term of (37): 0 (λj,ˆθ(t) λj,θ(t))d Mj(t) j=1 (ˆµj µj) Z T 0 d Mj(t) + 2 1 j,j d 1 k K (ˆaj,j ,k aj,j ,k) Z T 0 Hj,j ,k(t)d Mj(t) T ˆµ µ, M(T) + 2 T ˆA A, Z(T) , where M(T) = [M1(T) Md(T)] and where Z(T) is the d d K tensor with entries Zj,j ,k(T) = Z T 0 Hj,j ,k(t)d Mj(t). Recall that hstack is the horizontally stacking operator defined by (19). The following upper bounds | ˆµ µ, M(T) | j=1 |ˆµj µj||Mj(T)| | ˆA A, Z(T) | X 1 j,j d 1 k K |ˆAj,j ,k Aj,j ,k||Zj,j ,k(T)| | ˆA A, Z(T) | = hstack(ˆA A), hstack(Z(T)) hstack(Z(T)) op hstack(ˆA A) , entail that we need to upper bound the three terms |Mj(T)|, |Zj,j ,k(T)| and hstack(Z(T)) op by data-driven quantities. Let us start with hstack(Z(T)) op. Denote for short Z(t) = hstack(Z(t)) and H(t) = hstack(H(t)) where H(t) is defined by (18). We note that 0 diag(d M(s))H(s), SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES (Z(t))j,j +(k 1)d = Z t 0 (H(t s))j,j ,kd Mj(s) for any 1 j, j d and 1 k K. We need the following corollary. Corollary 7 The following deviation inequality holds P Z(t) op 2 q λmax( b V (t))(x + log(2d) + ℓ(t)) + 14.39(x + log(2d) + ℓ(t))(1 + sup 0 s t H(s) ,2) 23.45e x, (41) λmax( b V (t)) = λmax Z t 0 H (s)H(s) diag(d N(s)) _ max j=1,...,d 0 Hj, (s) 2 2d Nj(s), ℓ(t) = 2 log log 4λmax( b V (t)) x 2 + 2 log log 4 sup 0 s t H(s) ,2 2 . The proof of Corollary 7 is given in Section 8.6 below. Corollary 7 proves that 1 T Z(t) op ˆτ 2 holds with probability 1 23.45e x, with λmax( b V (T)/T)(x + log(2d) + ℓ(T)) + 28.78x + log(2d) + ℓ(T))(1 + sup0 t T H(t) ,2) which leads to the choice of ˆτ given in Section 4. This entails that, on an event of probability larger than 1 23.45e x, we have 1 T | ˆA A, Z(T) | ˆτ 2 hstack(ˆA A) . Using again Corollary 7 with H(t) 1 (constant number equal to 1) and M = Mj gives that 1 T |Mj(T)| ˆ wj 3 for all j = 1, . . . , d with probability 1 23.45e x with (Nj(T)/T)(x + log d + ℓj(T)) T + 86.34x + log d + ℓj(T) with ℓj(T) = 2 log log( 4Nj(T ) x 2) + 2 log log 4. This entails that, on an event of probability larger than 1 23.45e x, we have 2 T | ˆµ µ, M(T) | 2 3 ˆµ µ 1, ˆ w. Using a last time Corollary 7 with H(t) = Hj,j ,k(t) and M = Mj gives 1 T |Zj,j ,k(T)| ˆWj,j ,k 2 uniformly for j, j , k for ˆWj,j ,k = 4 1 T R T 0 Hj,j ,k(t)2d Nj(t)(x + log(Kd2) + Lj,j ,k(T)) + 28.78(x + log(Kd2) + Lj,j ,k(T))(1 + sup0 t T |Hj,j ,k(t)|) BACRY, BOMPAIRE, GA IFFAS AND MUZY Lj,j ,k(T) = 2 log log 4 R T 0 Hj,j ,k(t)2d Nj(t) x 2 + 2 log log 4 sup 0 t T |Hj,j ,k(t)| 2 , which entails that on an event of probability larger than 1 23.45e x, we have 1 T | ˆA A, Z(T) | 1 2 ˆA A 1, ˆW. This entails that, with a probability larger than 1 3 23.45e x, one has 0 θ , ˆθ θ + 2 0 (λj,ˆθ(t) λj,θ(t))d Mj(t) 3 (ˆµ µ)supp(µ) 1, ˆ w 1 3 (ˆµ µ)supp(µ) 1, ˆ w 2 (ˆA A)supp(A) 1, ˆW 1 2 (ˆA A)supp(A) 1, ˆW 2 ˆτ PA( ˆA A) 1 2 ˆτ P A( ˆA A) , where we recall once again that A = hstack(A) and ˆA = hstack(ˆA). This matches the constraint of Definition 5 with µ = ˆµ µ and A = ˆA A, so that it entails (ˆµ µ)supp(µ) 2 (ˆA A)supp(A) F PA( ˆA A) F κ(θ) λˆθ λθ T . (42) Putting all this together gives θ ,ˆθ θ + 2 T ˆµ µ, M(T) + 2 T ˆA A, Z(T) 3 (ˆµ µ)supp(µ) 1, ˆ w 1 3 (ˆµ µ)supp(µ) 1, ˆ w 2 (ˆA A)supp(A) 1, ˆW 1 2 (ˆA A)supp(A) 1, ˆW 2 ˆτ PA( ˆA A) 1 2 ˆτ P A( ˆA A) 3 ( ˆw)supp(µ) 2 (ˆµ µ)supp(µ) 2 + 3 2 ( ˆW)supp(A) F (ˆA A)supp(A) F rank(A) PA( ˆA A) F , where we used Cauchy-Schwarz s inequality. This finally gives λˆθ λ 2 T λθ λ 2 T λˆθ λθ 2 T 3 ( ˆw)supp(µ) 2 + 3 2 ( ˆW)supp(A) F + 3 rank(A) λˆθ λθ T where we used (42). The conclusion of the proof of Theorem 6 follows from the fact that ax x2 a2/4 for any a, x > 0. 8.6. Proof of Corollary 7 We simply use Theorem 4. First, we remark that Z(t) = R t 0 T(s) diag(d M(s)) for the tensor T(t) of size d Kd d d given by (T(t))i,j;k,l = (I)i,k(H(t))l,j (43) SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES for 1 i, k, l d and 1 j Kd. Note that we have T , ;k,l(t) = ek Hl, (t) and T , ;k,l(t) = Hl, (t)e k (44) where ek Rd stands for the k-th element of the canonical basis of Rd and where Hl, (t) RKd stands for the vector corresponding to the l-th row of the matrix H(t). Therefore, we have T , ;k,l(t)T , ;k,l(t) = Hl, (t) 2 2eke k and T , ;k,l(t)T , ;k,l(t) = Hl, (t)Hl, (t) and therefore T , ;k,l(t) op = q λmax(T , ;k,l(t)T , ;k,l(t)) = Hl, (t) 2 and T(t) op; = max 1 l d Hl, (t) 2 = H(t) ,2. One can prove in the same way that T (t) op; = H(t) ,2, so that for this choice of tensor T(t), we have b T(t) = H(t) ,2. Now, let us explicit what b V T(t) is for the tensor (43). First, let us remind that "R t 0 T(s)T (s) diag(d N(s)) 0 0 R t 0 T (s)T(s) diag(d N(s)) Using (44) we get (T(t)T(t) ) , ;,k,l = ek Hl, (t) Hl, (t)e k = Hl, (t) 2 2eke k so that R t 0(T(s)T (s)) diag(d N(s)) is the diagonal matrix with entries Z t 0 (T(s)T (s)) diag(d N(s)) 0 Hj, (s) 2 2d Nj(s), or equivalently Z t 0 (T(s)T (s)) diag(d N(s)) = Z t 0 diag(H (s)H(s)) diag(d N(s)). Using again (44) we get (T (t)T(t)) , ;,k,l = Hl, (t)e k ek Hl, (t) = Hl, (t)Hl, (t) so that R t 0(T (s)T(s)) diag(d N(s)) is the matrix with entries 0 (T (s)T(s)) diag(d N(s)) 0 Hl,i(s)Hl,j(s)d Nl(s) or equivalently Z t 0 (T (s)T(s)) diag(d N(s)) = Z t 0 H (s)H(s) diag(d N(s)). Finally, we obtain that λmax( b V t) = λmax Z t 0 H (s)H(s) diag(d N(s)) _ max j=1,...,d 0 Hj, (t) 2 2d Nj(s). This concludes the proof of the corollary. Acknowledgments This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). BACRY, BOMPAIRE, GA IFFAS AND MUZY E. Bacry, S. Delattre, M. Hoffmann, and J.-F. Muzy. Modelling microstructure noise with mutually exciting point processes. Quantitative Finance, 13(1):65 77, 2013. E. Bacry, I. Mastromatteo, and J.-F. Muzy. Hawkes processes in finance. Market Microstructure and Liquidity, 01(01):1550005, 2015. E. Bacry, S. Ga ıffas, I. Mastromatteo, and J.-F. Muzy. Mean-field inference of hawkes point processes. Journal of Physics A: Mathematical and Theoretical, 49(17):174006, 2016a. E. Bacry, S. Ga ıffas, and J.-F. Muzy. Concentration inequalities for matrix martingales in continuous time. Probability Theory and Related Fields, 170:525 553, 2016b. E. Bacry, M. Bompaire, P. Deegan, S. Ga ıffas, and S. V. Poulsen. tick: a python library for statistical learning, with an emphasis on hawkes processes and time-dependent models. Journal of Machine Learning Research, 18(214):1 5, 2018. P. L. Bartlett and S. Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3): 311 334, 2006. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, 2(1):183 202, 2009. P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. Ann. Statist., 37(4):1705 1732, 2009. C. Blundell, K. A Heller, and J. M. Beck. Modelling reciprocating relationships with hawkes processes. In NIPS, pages 2609 2617, 2012. M. Bompaire. Machine Learning based on Hawkes processes and Stochastic Optimization. Ph D thesis, CMAP, Ecole polytechique, EDMH, 2018. M. Bompaire, E. Bacry, and S. Ga ıffas. Dual optimization for convex constrained objectives without the gradient-lipschitz assumption. ar Xiv preprint ar Xiv:1807.03545, 2018. E. J. Cand es and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 12 (51):4203 4215, 2004. E. J. Cand es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5), 2009. R. Crane and D. Sornette. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences, 105(41), 2008. N. Daneshmand, M. Rodriguez, L. Song, and B. Sch olkpof. Estimating diffusion network structure: Recovery conditions, sample complexity, and a soft-thresholding algorithm. ICML, 2014. M. Argollo de Menezes and A.-L. Barab asi. Fluctuations in network dynamics. Phys. Rev. Lett., 92:028701, Jan 2004. doi: 10.1103/Phys Rev Lett.92.028701. URL http://link.aps.org/doi/10.1103/ Phys Rev Lett.92.028701. C. Du Bois, C. Butts, and P. Smyth. Stochastic blockmodeling of relational event dynamics. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, pages 238 246, 2013. M. Gomez-Rodriguez, J. Leskovec, and B. Sch olkopf. Modeling information propagation with survival theory. ICML, 2013. SPARSE AND LOW-RANK MULTIVARIATE HAWKES PROCESSES N. R. Hansen, P. Reynaud-Bouret, and V. Rivoirard. Lasso and probabilistic inequalities for multivariate point processes. Technical report, Arvix preprint, 2012. A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90, 1971. T. Iwata, A. Shah, and Z. Ghahramani. Discovering latent influence in online social activities via shared cascade poisson processes. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 266 274. ACM, 2013. V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Saint Flour XXXVIII-2008, volume 2033. Springer, 2011. V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329, 2011. J. Leskovec. Dynamics of large networks. Ph D thesis, Machine Learning Department, Carnegie Mellon University, 2008. J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD. ACM, 2009. A. S. Lewis. The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2(1): 173 183, 1995. S. W. Linderman and R. P. Adams. Discovering latent network structure in point process data. ar Xiv preprint ar Xiv:1402.0914, 2014. P. Massart. Concentration inequalities and model selection, volume 1896. Springer, 2007. G. O. Mohler, M. B. Short, P. J. Brantingham, F. P. Schoenberg, and G. E. Tita. Self-exciting point process modeling of crime. Journal of the American Statistical Association, 2011. Y. Ogata. The asymptotic behaviour of maximum likelihood estimators for stationary point processes. Annals of the Institute of Statistical Mathematics, 30(1):243 261, 1978. Y. Ogata. On lewis simulation method for point processes. Information Theory, IEEE Transactions on, 27 (1):23 31, 1981. Y. Ogata. Space-time point-process models for earthquake occurrences. Annals of the Institute of Statistical Mathematics, 50(2):379 402, 1998. M. R. Pino, L. Landesa, J. L. Rodriguez, F. Obelleiro, and R. J. Burkholder. The generalized forwardbackward method for analyzing the scattering from targets on ocean-like rough surfaces. IEEE Transactions on Antennas and Propagation, 47(6):961 969, 1999. F. Ricci, L. Rokach, and B. Shapira. Introduction to recommender systems handbook. Springer, 2011. E. Richard, S. Ga ıffas, and N. Vayatis. Link prediction in graphs with autoregressive features. Journal of Machine Learning Research, 2014. M. Rodriguez, D. Balduzzi, and B. Sch olkopf. Uncovering the temporal dynamics of diffusion networks. ICML, 2011. J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012. BACRY, BOMPAIRE, GA IFFAS AND MUZY S. Van De Geer. Empirical Processes in M-estimation, volume 105. Cambridge university press Cambridge, 2000. S.-H. Yang and H. Zha. Mixture of mutually exciting processes for viral diffusion. In ICML, 2013. K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using multidimensional hawkes processes. In AISTATS, volume 31, pages 641 649, 2013.