# online_learning_for_multivariate_hawkes_processes__02139f09.pdf Online Learning for Multivariate Hawkes Processes Yingxiang Yang Jalal Etesami Niao He Negar Kiyavash University of Illinois at Urbana-Champaign Urbana, IL 61801 {yyang172,etesami2,niaohe,kiyavash} @illinois.edu We develop a nonparametric and online learning algorithm that estimates the triggering functions of a multivariate Hawkes process (MHP). The approach we take approximates the triggering function fi,j(t) by functions in a reproducing kernel Hilbert space (RKHS), and maximizes a time-discretized version of the log-likelihood, with Tikhonov regularization. Theoretically, our algorithm achieves an O(log T) regret bound. Numerical results show that our algorithm offers a competing performance to that of the nonparametric batch learning algorithm, with a run time comparable to parametric online learning algorithms. 1 Introduction Multivariate Hawkes processes (MHPs) are counting processes where an arrival in one dimension can affect the arrival rates of other dimensions. They were originally proposed to statistically model the arrival patterns of earthquakes [16]. However, MHP s ability to capture mutual excitation between dimensions of a process also makes it a popular model in many other areas, including high frequency trading [3], modeling neural spike trains [24], and modeling diffusion in social networks [28], and capturing causality [12, 18]. For a p-dimensional MHP, the intensity function of the i-th dimension takes the following form: λi(t) = µi + 0 fi,j(t τ)d Nj(τ), (1) where the constant µi is the base intensity of the i-th dimension, Nj(t) counts the number of arrivals in the j-th dimension within [0, t], and fi,j(t) is the triggering function that embeds the underlying causal structure of the model. In particular, one arrival in the j-th dimension at time τ will affect the intensity of the arrivals in the i-th dimension at time t by the amount fi,j(t τ) for t > τ. Therefore, learning the triggering function is the key to learning an MHP model. In this work, we consider the problem of estimating the fi,j(t)s using nonparametric online learning techniques. 1.1 Motivations Why nonparametric? Most of existing works consider exponential triggering functions: fi,j(t) = αi,j exp{ βi,jt}1{t > 0}, (2) where αi,j is unknown while βi,j is given a priori. Under this assumption, learning fi,j(t) is equivalent to learning a real number, αi,j. However, there are many scenarios where (2) fails to Department of Electrical and Computer Engineering. Department of Industrial and Enterprise Systems Engineering. This work was supported in part by MURI grant ARMY W911NF-15-1-0479 and ONR grant W911NF-15-1-0479. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. describe the correct mutual influence pattern between dimensions. For example, [20] and [11] have reported delayed and bell-shaped triggering functions when applying the MHP model to neural spike train datasets. Moreover, when fi,j(t)s are not exponential, or when βi,js are inaccurate, formulation in (2) is prone to model mismatch [15]. Why online learning? There are many reasons to consider an online framework. (i) Batch learning algorithms do not scale well due to high computational complexity [15]. (ii) The data can be costly to observe, and can be streaming in nature, for example, in criminology. The above concerns motivate us to design an online learning algorithm in the nonparametric regime. 1.2 Related Works Earlier works on learning the triggering functions can be largely categorized into three classes. Batch and parametric. The simplest way to learn the triggering functions is to assume that they possess a parametric form, e.g. (2), and learn the coefficients. The most widely used estimators include the maximum likelihood estimator [23], and the minimum mean-square error estimator [2]. These estimators can also be generalized to the high dimensional case when the coefficient matrix is sparse and low-rank [2]. More generally, one can assume that fi,j(t)s lie within the span of a given set of basis functions S = {e1(t), . . . , e|S|(t)}: fi,j(t) = P|S| i=1 ciei(t), where ei(t)s have a given parametric form [13, 27]. The state-of-the-art of such algorithms is [27], where |S| is adaptively chosen, which sometimes requires a significant portion of the data to determine the optimal S. Batch and nonparametric. A more sophisticated approach towards finding the set S is explored in [29], where the coefficients and the basis functions are iteratively updated and refined. Unlike [27], where the basis functions take a predetermined form, [29] updates the basis functions by solving a set of Euler-Lagrange equations in the nonparametric regime. However, the formulation of [29] is nonconvex, and therefore the optimality is not guaranteed. The method also requires more than 105 arrivals for each dimension in order to obtain good results, on networks of less than 5 dimensions. Another way to estimate fi,j(t)s nonparametrically is proposed in [4], which solves a set of p Wiener Hopf systems, each of dimension at least p2. The algorithm works well on small dimensions; however, it requires inverting a p2 p2 matrix, which is costly, if not all together infeasible, when p is large. Online and parametric. To the best of our knowledge, learning the triggering functions in an online setting seems largely unexplored. Under the assumption that fi,j(t)s are exponential, [15] proposes an online algorithm using gradient descent, exploiting the evolutionary dynamics of the intensity function. The time axis is discretized into small intervals, and the updates are performed at the end of each interval. While the authors provide the online solution to the parametric case, their work cannot readily extend to the nonparametric setting where the triggering functions are not exponential, mainly because the evolutionary dynamics of the intensity functions no longer hold. Learning triggering functions nonparametrically remains an open problem. 1.3 Challenges and Our Contributions Designing an online algorithm in the nonparametric regime is not without its challenges: (i) It is not clear how to represent fi,j(t)s. In this work, we relate fi,j(t) to an RKHS. (ii) Although online learning with kernels is a well studied subject in other scenarios [19], a typical choice of loss function for learning an MHP usually involves the integral of fi,j(t)s, which prevents the direct application of the representer theorem. (iii) The outputs of the algorithm at each step require a projection step to ensure positivity of the intensity function. This requires solving a quadratic programming problem, which can be computationally expensive. How to circumvent this computational complexity issue is another challenge of this work. In this paper, we design, to the best of our knowledge, the first online learning algorithm for the triggering functions in the nonparametric regime. In particular, we tackle the challenges mentioned above, and the only assumption we make is that the triggering functions fi,j(t)s are positive, have a decreasing tail, and that they belong to an RKHS. Theoretically, our algorithm achieves a regret bound of O(log T), and numerical experiments show that our approach outperforms the previous approaches despite the fact that they are tailored to a less general setting. In particular, our algorithm attains a similar performance to the nonparametric batch learning maximum likelihood estimators while reducing the run time extensively. 1.4 Notations Prior to discussing our results, we introduce the basic notations used in the paper. Detailed notations will be introduced along the way. For a p-dimensional MHP, we denote the intensity function of the i-th dimension by λi(t). We use λ(t) to denote the vector of intensity functions, and we use F = [fi,j(t)] to denote the matrix of triggering functions. The i-th row of F is denoted by fi. The number of arrivals in the i-th dimension up to t is denoted by the counting process Ni(t). We set N(t) = Pp i=1 Ni(t). The estimates of these quantities are denoted by their hatted versions. The arrival time of the n-th event in the j-th dimension is denoted by τj,n. Lastly, define x y = y x/y . 2 Problem Formulation In this section, we introduce our assumptions and definitions followed by the formulation of the loss function. We omit the basics on MHPs and instead refer the readers to [22] for details. Assumption 2.1. We assume that the constant base intensity µi is lower bounded by a given threshold µmin > 0. We also assume bounded and stationary increments for the MHP [16, 9]: for any t, z > 0, Ni(t) Ni(t z) κz = O(z). See Appendix A for more details. Definition 2.1. Suppose that {tk} k=0 is an arbitrary time sequence with t0 = 0, and supk 1(tk tk 1) δ 1. Let εf : [0, ) [0, ) be a continuous and bounded function such that limt εf(t) = 0. Then, f(x) satisfies the decreasing tail property with tail function εf(t) if k=m (tk tk 1) sup x (tk 1,tk] |f(x)| εf(tm 1), m > 0. Assumption 2.2. Let H be an RKHS associated with a kernel K( , ) that satisfies K(x, x) 1. Let L1[0, ) be the space of functions for which the absolute value is Lebesgue integrable. For any i, j {1, . . . , p}, we assume that fi,j(t) H and fi,j(t) L1[0, ), with both fi,j(t) and dfi,j(t)/dt satisfying the decreasing tail property of Definition 2.1. Assumption 2.1 is common and has been adopted in existing literature [22]. It ensures that the MHP is not explosive by assuming that N(t)/t is bounded. Assumption 2.2 restricts the tail behaviors of both fi,j(t) and dfi,j(t)/dt. Complicated as it may seem, functions with exponentially decaying tails satisfy this assumption, as is illustrated by the following example (See Appendix B for proof): Example 1. Functions f1(t) = exp{ βt}1{t > 0} and f2(t) = exp{ (t γ)2}1{t > 0} satisfy Assumption 2.2 with tail functions β 1 exp{ β(t δ)} and 2 γ) exp{δ2/2}. 2.1 A Discretized Loss Function for Online Learning A common approach for learning the parameters of an MHP is to perform regularized maximum likelihood estimation. As such, we introduce a loss function comprised of the negative of the loglikelihood function and a penalty term to enforce desired structural properties, e.g. sparsity of the triggering matrix F or smoothness of the triggering functions (see, e.g., [2, 29, 27]). The negative of the log-likelihood function of an MHP over a time interval of [0, t] is given by 0 log λi(τ)d Ni(τ) Z t 0 λi(τ)dτ . (3) Let {τ1, ..., τN(t)} denote the arrival times of all the events within [0, t] and let {t0, . . . , t M(t)} be a finite partition of the time interval [0, t] such that t0 = 0 and tk+1 := minτi tk{ tk δ + δ, τi}. Using this partitioning, it is straightforward to see that the function in (3) can be written as tk 1 λi(τ)dτ xi,k log λi(tk) i=1 Li,t(λi), (4) where xi,k := Ni(tk) Ni(tk 1). By the definition of tk, we know that xi,k {0, 1}. In order to learn fi,j(t)s using an online kernel method, we require a similar result as the representer theorem in [25] that specifies the form of the optimizer. This theorem requires that the regularized version of the loss in (4) to be a function of only fi,j(t)s. However, due to the integral part, Lt(λ) is a function of both fi,j(t)s and their integrals, which prevents us from applying the representer theorem directly. To resolve this issue, several approaches can be applied such as adjusting the Hilbert space as proposed in [14] in context of Poisson processes, or approximating the log-likelihood function as in [15]. Here, we adopt a method similar to [15] and approximate (4) by discretizing the integral: L(δ) t (λ) := k=1 ((tk tk 1)λi(tk) xi,k log λi(tk)):= i=1 L(δ) i,t (λi). (5) Intuitively, if δ is small enough and the triggering functions are bounded, it is reasonable to expect that Li,t(λ) is close to L(δ) i,t (λ). Below, we characterize the accuracy of the above discretization and also truncation of the intensity function. First, we require the following definition. Definition 2.2. We define the truncated intensity function as follows λ(z) i (t) := µi + 0 1{t τ < z}fi,j(t τ)d Nj(τ). (6) Proposition 1. Under Assumptions 2.1 and 2.2, for any i {1, . . . , p}, we have L(δ) i,t (λ(z) i ) Li,t(λi) (1 + κ1µ 1 min)N(t z)ε(z) + δN(t)ε (0), where µmin is the lower bound for µi, κ1 is the upper bound for Ni(t) Ni(t 1) from Definition 2.1, while ε and ε are two tail functions that uniformly capture the decreasing tail property of all fi,j(t)s and all dfi,j(t)/dts, respectively. The first term in the bound characterizes the approximation error when one truncates λi(t) with λ(z) i (t). The second term describes the approximation error caused by the discretization. When z = , λi(t) = λ(z) i (t), and the approximation error is contributed solely by discretization. Note that, in many cases, a small enough truncation error can be obtained by setting a relatively small z. For example, for fi,j(t) = exp{ 3t}1{t > 0}, setting z = 10 would result in a truncation error less than 10 13. Meanwhile, truncating λi(t) greatly simplifies the procedure of computing its value. Hence, in our algorithm, we focus on λ(z) i instead of λi. In the following, we consider the regularized instantaneous loss function with the Tikhonov regularization for fi,j(t)s and µi: li,k(λi) := (tk tk 1)λi(tk) xi,k log λi(tk) + 1 2 fi,j 2 H, (7) and aim at producing a sequence of estimates {bλi(tk)}M(t) k=1 of λi(t) with minimal regret: k=1 li,k(bλi(tk)) min µi µmin,fi,j(t) 0 k=1 li,k(λi(tk)). (8) Each regularized instantaneous loss function in (7) is jointly strongly convex with respect to fi,js and µi. Combining with the representer theorem in [25], the minimizer to (8) is a linear combination of a finite set of kernels. In addition, by setting ζi,j = O(1), our algorithm achieves β-stability with β = O((ζi,jt) 1), which is typical for a learning algorithm in RKHS (Theorem 22 of [8]). 3 Online Learning for MHPs We introduce our Non Parametric On Line Estimation for MHP (NPOLE-MHP) in Algorithm 1. The most important components of the algorithm are (i) the computation of the gradients and (ii) the Algorithm 1 Non Parametric On Line Estimation for MHP (NPOLE-MHP) 1: input: a sequence of step sizes {ηk} k=1 and a set of regularization coefficients ζi,js, along with positive values of µmin, z and σ. output: bµ(M(t)) and b F (M(t)). 2: Initialize bf (0) i,j and bµ(0) i for all i, j. 3: for k = 0, ..., M(t) 1 do 4: Observe the interval [tk, tk+1), and compute xi,k for i {1, . . . , p}. 5: for i = 1, . . . , p do 6: Set bµ(k+1) i max n bµ(k) i ηk+1 µili,k λ(z) i (bµ(k) i , b f (k) i ) , µmin o . 7: for j = 1, . . . , p do 8: Set bf (k+ 1 2 ) i,j h bf (k) i,j ηk+1 fi,jli,k λ(z) i (bµ(k) i , b f (k) i ) i , and bf (k+1) i,j Π h bf (k+ 1 2 ) i,j i . 9: end for 10: end for 11: end for projections in lines 6 and 8. For the partial derivative with respect to µi, recall the definition of li,k in (7) and λ(z) i in (6). Since λ(z) i is a linear function of µi, we have µili,k λ(z) i (bµ(k) i , b f (k) i ) = (tk tk 1) xi,k h λ(z) i bµ(k) i , b f (k) i i 1 + ωibµ(k) i ρk + ωibµ(k) i , where ρk is the simplified notation for the first two terms. Upon performing gradient descent, the algorithm makes sure that bµ(k+1) i µmin, which further ensures that bλ(z) i (bµ(k+1) i , b f (k+1) i ) λmin. For the update step of bf (k) i,j (t), notice that the li,k is also a linear function with respect to fi,j. Since fi,jfi,j(x) = K(x, ), which holds true due to the reproducing property of the kernel, we thus have fi,jli,k λ(z) i (bµ(k) i , b f (k) i ) = ρk X τj,n [tk z,tk) K(tk τj,n, ) + ζi,j bf (k) i,j ( ). (9) Once again, a projection Π[ ] is necessary to ensure that the estimated triggering functions are positive. 3.1 Projection of the Triggering Functions For any kernel, the projection step for a triggering function can be executed by solving a quadratic programming problem: min f bf (k+ 1 2 ) i,j 2 H subject to f H and f(t) 0. Ideally, the positivity constraint has to hold for every t > 0, but in order to simplify computation, one can approximate the solution by relaxing the constraint such that f(t) 0 holds for only a finite set of ts within [0, z]. Semi-Definite Programming (SDP). When the reproducing kernel is polynomial, the problem is much simpler. The projection step can be formulated as an SDP problem [26] as follows: Proposition 2. Let S = r k{tr τj,n : tr z τj,n < tr} be the set of tr τj,ns. Let K(x, y) = (1+xy)2d and K (x, y) = (1+xy)d be two polynomial kernels with d 1. Furthermore, let K and G denote the Gramian matrices where the i, j-th element correspond to K(s, s ), with s and s being the i-th and j-th element in S. Suppose that a R|S| is the coefficient vector such that bf (k+ 1 2 ) i,j ( ) = P s S as K(s, ), and that the projection step returns bf (k+1) i,j ( ) = P s S b s K(s, ). Then the coefficient vector b can be obtained by b = argmin b R|S| 2a Kb + b Kb, s.t. G diag(b) + diag(b) G 0. (10) Non-convex approach. Alternatively, we can assume that fi,j(t) = g2 i,j(t) where gi,j(t) H. By minimizing the loss with respect to gi,j(t), one can naturally guarantee that fi,j(t) 0. This method was adopted in [14] for estimating the intensity function of non-homogeneous Poisson processes. While this approach breaks the convexity of the loss function, it works relatively well when the initialization is close to the global minimum. It is also interestingly related to a line of recent works in non-convex SDP [6], as well as phase retrieval with Wirtinger flow [10]. Deriving guarantees on regret bound and convergence performances is a future direction implied by the result of this work. 4 Theoretical Properties We now discuss the theoretical properties of NPOLE-MHP. We start with defining the regret. Definition 4.1. The regret of Algorithm 1 at time t is given by R(δ) t (λ(z) i (µi, fi)) := li,k(λ(z) i (bµ(k) i , b f (k) i )) li,k(λ(z) i (µi, fi)) , where bµ(k) i and b f (k) i denote the estimated base intensity and the triggering functions, respectively. Theorem 1. Suppose that the observations are generated from a p-dimensional MHP that satisfies Assumptions 2.1 and 2.2. Let ζ = mini,j{ζi,j, ωi}, and ηk = 1/(ζk + b) for some positive constants b. Then R(δ) t (λ(z) i (µi, fi)) C1(1 + log M(t)), where C1 = 2(1 + pκ2 z)ζ 1|δ µ 1 min|2. The regret bound of Theorem 1 resembles the regret bound for a typical online learning algorithm with strongly convex loss function (see for example, Theorem 3.3 of [17]). When δ, ζ and µ 1 min are fixed, C1 = O(p), which is intuitive as one needs to update p functions at a time. Note that the regret in Definition 4.1, encodes the performance of Algorithm 1 by comparing its loss with the approximated loss. Below, we compare the loss of Algorithm 1 with the original loss in (4). Corollary 1. Under the same assumptions as Theorem 1, we have li,k(λ(z) i (bµi, b f (k) i )) li,k(λi(µi, fi)) C1[1 + log M(t)] + C2N(t), (11) where C1 is defined in Theorem 1 and C2 = (1 + κ1µ 1 min)ε(z) + δε (0). Note that C3N(t) is due to discretization and truncation steps and it can be made arbitrary small for given t and setting small δ and large enough z. Computational Complexity. Since b fis can be estimated in parallel, we restrict our analysis to the case of a fixed i {1, . . . , p} in a single iteration. For each iteration, the computational complexity comes from evaluating the intensity function and projection. Since the number of arrivals within the interval [tk z, tk) is bounded by pκz and κz = O(1), evaluating the intensity costs O(p2) operations. For the projection in each step, one can truncate the number of kernels used to represent fi,j(t) to be O(1) with controllable error (Proposition 1 of [19]), and therefore the computation cost is O(1). Hence, the per iteration computation cost of NPOLE-MHP is O(p2). By comparison, parametric online algorithms (DMD, OGD of [15]) also require O(p2) operations for each iteration, while the batch learning algorithms (MLE-SGLP, MLE of [27]) require O(p2t3) operations. 5 Numerical Experiments We evaluate the performance of NPOLE-MHP on both synthetic and real data, from multiple aspects: (i) visual assessment of the goodness-of-fit comparing to the ground truth; (ii) the average L1 error defined as the average of Pp i=1 Pp j=1 fi,j bfi,j L1[0,z] over multiple trials; (iii) scalability over both dimension p and time horizon T. For benchmarks, we compare NPOLE-MHP s performance to that of online parametric algorithms (DMD, OGD of [15]) and nonparametric batch learning algorithms (MLE-SGLP, MLE of [27]). 5.1 Synthetic Data Consider a 5-dimensional MHP with µi = 0.05 for all dimensions. We set the triggering functions as e 2.5t 0 0 e 10(t 1)2 0 2 5t (1 + cos(πt))e t/2 e 5t 0 0 0 2e 3t 0 0 0 0 0 0 0.6e 3t2 + 0.4e 3(t 1)2 e 4t 0 0 te 5(t 1)2 0 e 3t 0 0.5 1 1.5 2 2.5 3 0 0 0.5 1 1.5 2 2.5 3 0 0 0.5 1 1.5 2 2.5 3 0 True fi,j(t) NPOLE-MHP DMD OGD MLE-SGLP MLE Figure 1: Performances of different algorithms for estimating F . Complete set of result can be found in Appendix F. For each subplot, the horizontal axis covers [0, z] and the vertical axis covers [0, 1]. The performances are similar between DMD and OGD, and between MLE and MLE-SGLP. The design of F allows us to test NPOLE-MHP s ability of detecting (i) exponential triggering functions with various decaying rate; (ii) zero functions; (iii) functions with delayed peaks and tail behaviors different from an exponential function. Goodness-of-fit. We run NPOLE-MHP over a set of data with T = 105 and around 4 104 events for each dimension. The parameters are chosen by grid search over a small portion of data, and the parameters of the benchmark algorithms are fine-tuned (see Appendix F for details). In particular, we set the discretization level δ = 0.05, the window size z = 3, the step size ηk = (kδ/20+100) 1, and the regularization coefficient ζi,j ζ = 10 8. The performances of NPOLE-MHP and benchmarks are shown in Figure 1. We see that NPOLE-MHP captures the shape of the function much better than the DMD and OGD algorithms with mismatched forms of the triggering functions. It is especially visible for f1,4(t) and f2,2(t). In fact, our algorithm scores a similar performance to the batch learning MLE estimator, which is optimal for any given set of data. We next plot the average loss per iteration for this dataset in Figure 2. In the left-hand side, the loss is high due to initialization. However, the effect of initialization quickly diminishes as the number of events increases. Run time comparison. The simulation of the DMD and OGD algorithms took 2 minutes combined on a Macintosh with two 6-core Intel Xeon processor at 2.4 GHz, while NPOLE-MHP took 3 minutes. The batch learning algorithms MLE-SGLP and MLE in [27] each took about 1.5 hours. Therefore, our algorithm achieves the performance similar to batch learning algorithms with a run time close to that of parametric online learning algorithms. Effects of the hyperparameters: δ, ζi,j, and ηk. We investigate the sensitivity of NPOLE-MHP with respect to the hyperparameters, measuring the averaged L1 error defined at the beginning of this section. We independently generate 100 sets of data with the same parameters, and a smaller T = 104 for faster data generation. The result is shown in Table 1. For NPOLE-MHP, we fix ηk = 1/(k/2000 + 10). MLE and MLE-SGLP score around 1.949 with 5/5 inner/outer rounds of iterations. NPOLE-MHP s performance is robust when the regularization coefficient and discretization level are sufficiently small. It surpasses MLE and MLE-SGLP on large datasets, in which case the iterations of MLE and MLE-SGLP are limited due to computational considerations. As ζ increases, the error decreases first before rising drastically, a phenomenon caused by the mismatch between the loss functions. For the step size, the error varies under different choice of ηk, which can be selected via grid-search on a small portion of the data like most other online algorithms. 5.2 Real Data: Inferencing Impact Between News Agencies with Memetracker Data We test the performance of NPOLE-MHP on the memetracker data [21], which collects from the internet a set of popular phrases, including their content, the time they were posted, and the url address of the articles that included them. We study the relationship between different news agencies, modeling the data with a p-dimensional MHP where each dimension corresponds to a news website. Unlike [15], which conducted a similar experiment where all the data was used, we focus on only 20 Regularization log10 ζ 8 6 4 2 0 0.01 1.83 1.83 1.84 4.15 4.64 0.05 1.86 1.86 1.86 3.10 4.64 0.1 1.92 1.92 1.88 2.73 4.64 0.5 4.80 4.80 4.64 2.19 4.62 1 5.73 5.73 5.58 2.38 4.59 Table 1: Effect of hyperparameters ζ and δ, measured by the average L1 error . Horizon T (days) 1.8 3.6 5.4 Dimension p 20 3.9 9.1 15.3 40 4.6 10.4 17.0 60 4.6 10.2 16.7 80 4.5 10.0 16.4 100 4.5 9.7 15.9 Table 2: Average CPU-time for estimating one triggering function (seconds). 0 2 4 6 8 10 Time Axis t 104 Average Loss per Iteration δ = 0.05, with true fi,j(t)s δ = 0.05, NPOLE-MHP δ = 0.10, NPOLE-MHP δ = 0.50, NPOLE-MHP δ = 0.05, DMD Figure 2: Effect of discretization in NPOLEMHP. 0 5 10 15 Time Axis t 105 Cumulative Loss NPOLE-MHP DMD OGD Figure 3: Cumulative loss on memetracker data of 20 dimensions. websites that are most active, using 18 days of data. We plot the cumulative losses in Figure 3, using a window size of 3 hours, an update interval δ = 0.2 seconds, and a step size ηk = 1/(kζ + 800) with ζ = 10 10 for NPOLE-MHP. For DMD and OGD, we set ηk = 5/ p T/δ. The result shows that NPOLE-MHP accumulates a smaller loss per step compared to OGD and DMD. Scalability and generalization error. Finally, we evaluate the scalability of NPOLE-MHP using the average CPU-time for estimating one triggering function. The result in Table 2 shows that the computation cost of NPOLE-MHP scales almost linearly with the dimension and data size. When scaling the data to 100 dimensions and 2 105 events, NPOLE-MHP scores an average 0.01 loss per iteration on both training and test data, while OGD and DMD scored 0.005 on training data and 0.14 on test data. This shows a much better generalization performance of NPOLE-MHP. 6 Conclusion We developed a nonparametric method for learning the triggering functions of a multivariate Hawkes process (MHP) given time series observations. To formulate the instantaneous loss function, we adopted the method of discretizing the time axis into small intervals of lengths at most δ, and we derived the corresponding upper bound for approximation error. From this point, we proposed an online learning algorithm, NPOLE-MHP, based on the framework of online kernel learning and exploits the interarrival time statistics under the MHP setup. Theoretically, we derived the regret bound for NPOLE-MHP, which is O(log T) when the time horizon T is known a priori, and we showed that the per iteration cost of NPOLE-MHP is O(p2). Numerically, we compared NPOLEMHP s performance with parametric online learning algorithms and nonparametric batch learning algorithms. Results on both synthetic and real data showed that we are able to achieve similar performance to that of the nonparametric batch learning algorithms with a run time comparable to the parametric online learning algorithms. [1] Emmanuel Bacry, Khalil Dayri, and Jean-Franc ois Muzy. Non-parametric kernel estimation for symmetric Hawkes processes. application to high frequency financial data. The European Physical Journal B-Condensed Matter and Complex Systems, 85(5):1 12, 2012. [2] Emmanuel Bacry, St ephane Ga ıffas, and Jean-Franc ois Muzy. A generalization error bound for sparse and low-rank multivariate Hawkes processes, 2015. [3] Emmanuel Bacry, Iacopo Mastromatteo, and Jean-Franc ois Muzy. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01):1550005, 2015. [4] Emmanuel Bacry and Jean-Franc ois Muzy. Firstand second-order statistics characterization of Hawkes processes and non-parametric estimation. IEEE Transactions on Information Theory, 62(4):2184 2202, 2016. [5] J Andrew Bagnell and Amir-massoud Farahmand. Learning positive functions in a Hilbert space, 2015. [6] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-definite optimization. Conference on Learning Theory, pages 530 582, 2016. [7] Jacek Bochnak, Michel Coste, and Marie-Franc oise Roy. Real algebraic geometry, volume 36. Springer Science & Business Media, 2013. [8] Olivier Bousquet and Andr e Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499 526, 2002. [9] Pierre Br emaud and Laurent Massouli e. Stability of nonlinear Hawkes processes. The Annals of Probability, pages 1563 1588, 1996. [10] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985 2007, 2015. [11] Michael Eichler, Rainer Dahlhaus, and Johannes Dueck. Graphical modeling for multivariate Hawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38(2):225 242, 2017. [12] Jalal Etesami and Negar Kiyavash. Directed information graphs: A generalization of linear dynamical graphs. In American Control Conference (ACC), 2014, pages 2563 2568. IEEE, 2014. [13] Jalal Etesami, Negar Kiyavash, Kun Zhang, and Kushagra Singhal. Learning network of multivariate Hawkes processes: A time series approach. Conference on Uncertainty in Artificial Intelligence, 2016. [14] Seth Flaxman, Yee Whye Teh, and Dino Sejdinovic. Poisson intensity estimation with reproducing kernels. International Conference on Artificial Intelligence and Statistics, 2017. [15] Eric C Hall and Rebecca M Willett. Tracking dynamic point processes on networks. IEEE Transactions on Information Theory, 62(7):4327 4346, 2016. [16] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90, 1971. [17] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. [18] Sanggyun Kim, Christopher J Quinn, Negar Kiyavash, and Todd P Coleman. Dynamic and succinct statistical analysis of neuroscience data. Proceedings of the IEEE, 102(5):683 698, 2014. [19] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165 2176, 2004. [20] Michael Krumin, Inna Reutsky, and Shy Shoham. Correlation-based analysis and generation of multiple spike trains using Hawkes models with an exogenous input. Frontiers in Computational Neuroscience, 4, 2010. [21] Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. International Conference on Knowledge Discovery and Data Mining, pages 497 506, 2009. [22] Thomas Josef Liniger. Multivariate Hawkes processes. Ph D thesis, Eidgen ossische Technische Hochschule ETH Z urich, 2009. [23] Tohru Ozaki. Maximum likelihood estimation of Hawkes self-exciting point processes. Annals of the Institute of Statistical Mathematics, 31(1):145 155, 1979. [24] Patricia Reynaud-Bouret, Sophie Schbath, et al. Adaptive estimation for Hawkes processes; application to genome analysis. The Annals of Statistics, 38(5):2781 2822, 2010. [25] Bernhard Sch olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. International Conference on Computational Learning Theory, pages 416 426, 2001. [26] Lieven Vandenberghe and Stephen Boyd. Semidefinite programming. SIAM review, 38(1):49 95, 1996. [27] Hongteng Xu, Mehrdad Farajtabar, and Hongyuan Zha. Learning Granger causality for Hawkes processes. International Conference on Machine Learning, 48:1717 1726, 2016. [28] Shuang-Hong Yang and Hongyuan Zha. Mixture of mutually exciting processes for viral diffusion. International Conference on Machine Learning, 28:1 9, 2013. [29] Ke Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional Hawkes processes. International Conference on Machine Learning, 28:1301 1309, 2013.