# isotonic_hawkes_processes__7c7c539b.pdf Isotonic Hawkes Processes Yichen Wang, Bo Xie, Nan Du {YICHEN.WANG, BO.XIE, DUNAN}@GATECH.EDU Le Song LSONG@CC.GATECH.EDU College of Computing, Georgia Institute of Technology, 266 Ferst Drive, Atlanta, GA 30332 USA Hawkes processes are powerful tools for modeling the mutual-excitation phenomena commonly observed in event data from a variety of domains, such as social networks, quantitative finance and healthcare records. The intensity function of a Hawkes process is typically assumed to be linear in the sum of triggering kernels, rendering it inadequate to capture nonlinear effects present in real-world data. To address this shortcoming, we propose an Isotonic-Hawkes process whose intensity function is modulated by an additional nonlinear link function. We also developed a novel iterative algorithm which learns both the nonlinear link function and other parameters provably. We showed that Isotonic-Hawkes processes can fit a variety of nonlinear patterns which cannot be captured by conventional Hawkes processes, and achieve superior empirical performance in real world applications. 1. Introduction Temporal point processes are powerful tools for modeling the complex dynamics of events occurrences. In particular, Hawkes processes (Hawkes, 1971) are well-suited to capture the phenomenon of mutual excitation between the occurrence of events, and have been applied successfully in modeling criminal retaliations (Mohler et al., 2011), online users behaviors (Farajtabar et al., 2014; 2015; Du et al., 2015), and opinion dynamics (Wang et al., 2016). A Hawkes process is characterized by a linear intensity function, i.e., λ(t) = w xt where xt denotes time-dependent features and w is the weight. The intensity function parametrizes the likelihood of observing an event in the time window [t, t + dt) given that it has not occurred before t. Such linearity may be insufficient to model many Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). real world scenarios. For example, after purchasing a new album, users may be initially highly engaged, and play the album over and over again. However, the engagement will saturate at some point as they become bored of the same album. Such plateau pattern may not be captured by a simple linear relation. In another scenario, a recent hospital visit may trigger more future visits due to the progression of a disease into a more severe stage. Such cumulative influence from recent events may grow faster than a linear model can explain. Nonlinear Hawkes process (Br emaud & Massouli e, 1996) has been introduced to provide more flexibility in explaining the real-world phenomena. It applies a fixed nonlinear link function g to the linear combination, i.e., λ(t) = g(w xt). For computational considerations, g( ) is often assumed to be in some simple parametric forms, such as exp(u) and max {0, u} (Carstensen et al., 2010; Hansen et al., 2015). Although these models are more flexible, they are still restricted to a few nonlinear patterns with a fixed parametrization, which may not be correct for real world data. Ideally, both g( ) and w should be learned from data. Unfortunately, such desideratum leads to a non-convex optimization problem, where efficient algorithms with provable guarantees do not exist. To address these challenges, we propose a novel model, referred to as the Isotonic-Hawkes process, where both g( ) and w can be directly learned from data. Rather than committing to a fixed parametric form, we instead use a non-parametric, monotonic nonlinear link function. Therefore, it is extremely flexible to capture different temporal dynamics without the need to select a fixed form in advance. To solve the non-convex learning problem with guarantees, we propose a different loss function than the typical log-likelihood for point processes. Moreover, by exploiting the problem structure, we are still able to provide theoretical guarantees on the computational and statistical performance. Our work makes the following contributions: We propose a novel method for nonlinear Hawkes process that can learn both the link function and other Isotonic Hawkes Processes parameters directly from data. Although the learning involves a non-convex problem, our algorithm can provably recover the true link function and the model parameters. This also requires a novel analysis for non i.i.d. observations. Our method achieves superior empirical performance, significantly outperforming alternatives on both synthetic and real-world datasets. Related work. Prior work on nonlinear Hawkes process focuses on theoretical properties (Br emaud & Massouli e, 1996; Zhu, 2015; Hansen et al., 2015). The link function is usually given, and the discretization of time is needed in order to evaluate the integral of the intensity function. Hence, efficient algorithms are available only for specific link functions (Paninski, 2004; Truccolo et al., 2005; Carstensen et al., 2010). In contrast, our method is the first algorithm that can learn both the link function and the model parameters non-parametrically. Our work is also closely related to Isotonic regression and Single Index Model (SIM). The Isotonic regression (Barlow et al., 1972; Robertson et al., 1988; Mair et al., 2009) is a well studied technique to fit an arbitrary monotonic 1-D function. SIM generalizes the linear regression and estimates both the nonlinear link function and the feature weights. However, earlier work are usually heuristics, which are not guaranteed to converge to a global optimum (Horowitz & H ardle, 1996; Hristache et al., 2001; Naik & Tsai, 2004; Delecroix et al., 2006). Only recently algorithms have been proposed with global convergence guarantees (Kalai & Sastry, 2009; Kakade et al., 2011; Acharyya & Ghosh, 2015) . Unlike SIM, which only focuses on regression, our work is concerned with learning a temporal point process where the response variable is not directly observed. At the same time, the observations are non i.i.d. , a setting significantly different from previous works. The added complexity of temporal point processes requires us to develop a new efficient algorithm and its analysis. 2. Preliminaries In this section, we will first provide some background on isotonic regression and point processes. Isotonic Regression and Single Index Model. Given 1-D data points {(zi, yi)}n i=1, Isotonic regression solves a least-square problem with monotonicity constraints: (yi ˆyi)2, s.t. ˆyi ˆyj if zi zj. (1) The problem can be solved efficiently by a Pool Adjacent Violators (PAV) algorithm (Mair et al., 2009) in O(n log n), and the input and the solution {(zi, ˆyi)}n i=1 implicitly define a monotonic function with g(zi) = ˆyi. The Single Index Model is a generalized linear model with the following assumption E[y|x] = g(w x), where g is the link function. The Isotron algorithm can provably recover w and g (Kalai & Sastry, 2009; Kakade et al., 2011) under the mild assumption that g is monotonic and Lipschitz continuous. Temporal Point Processes A temporal point process is a random process of which the realization consists of a list of discrete temporal events {ti}n i=1. It is equivalent to a counting process, {N(t), t 0}, which records the cumulative number of events happening right before time t, and satisfies N(t0) N(t) for t0 t and N(0) = 0. A counting process is also a submartingale, i.e., E[N(t)|Ht0] N(t0) for all t > t0, where Ht0 = {ti|ti < t0} denotes the history up to but not including time t0. A useful characterization of temporal point processes is the intensity function. Specifically, according to Doob-Meyer theorem (Aalen et al., 2008), N(t) has the unique decomposition: N(t) = (t) + M(t), where (t) is an increasing predictable process called the compensator (or cumulative intensity) and M(t) is a zero-mean martingale. Alternatively, we have E[d M(t)|Ht] = 0, and E[d N(t)|Ht] = d (t) := λ(t)dt (2) where λ(t) is the intensity function. Intuitively, the larger λ(t), the greater the chance an event happens in time interval [t, t + dt). The functional form of the intensity function characterizes the temporal point process. A particular useful form is the intensity of a Hawkes process, which captures the mutual excitation phenomena between events: λ(t) = λ0 + ti2Ht (t ti) (3) where λ0 captures the long-term incentive to generate events. (t) > 0 models temporal dependencies, and > 0 quantifies how the influence from each past event evolves over time, making the intensity function depend on the history Ht. In the Hawkes process, past events affect the occurrence of future events, which makes it particularly useful for modeling clustered event patterns. However, the linear link function of the intensity function may be insufficient to model many real world scenarios. In the next section, we propose Isotonic-Hawkes processes with a flexible link function and a provable learning algorithm. 3. Isotonic Hawkes Processes We propose a new family of nonlinear Hawkes processes: Isotonic-Hawkes processes. We present the moment matching learning framework for the non-convex problem. To facilitate learning, we optimize the representation of the objective function by showing that the intensity integral in the objective function can be exactly computed. Then we present the overall algorithm, which applies an alternating Isotonic Hawkes Processes minimization scheme to update the link function g and weights parameters w. 3.1. Model Formulation In Isotonic-Hawkes processes, we model its intensity as the composition of a monotonic, non-parametric link function and a linear Hawkes intensity. Definition 1. A Isotonic-Hawkes process is a counting process N(t), with associated history Ht = {ti|ti < t}, such that the intensity function λ(t) can be written as: = g(w xt) (4) where (t) : R+ ! R+ is a continuous monotonic decreasing triggering kernel capturing the influence of the history Ht, λ0 > 0 is a baseline intensity independent of the history, and g 2 G : R ! R+, is a monotonic increasing and G-Lipschitz link function, i.e., 0 6 g(b) g(a) 6 G(b a) for all 0 6 a 6 b (5) We set w = (λ0, )>, and xt = ti2Ht (t ti) We require (t) to be monotonically decreasing, such as the exponential kernel exp( t)I[t > 0], Gaussian kernel and heavy tailed log-logistic kernel. This property is useful for computing the integral of the intensity discussed later. The linear term in Hawkes process alone is not sufficient to capture the general trend in real-world applications. For instance, linearity leads to unbounded intensity, which is at odds with the saturation phenomenon. The nonlinear link function g enables the model to adapt to such nonlinearities in the data, hence achieving better performance. We assume g is nonparametric and monotonic increasing, which covers a wide range of functions, and also maintains the properties of the composed intensity function. 3.2. Moment Matching Objective Maximum Likelihood Estimation (MLE) is often used to learn Hawkes processes, yielding a convex problem w.r.t. w. The estimator has good statistical rates and is consistent and asymptotically efficient (Ozaki, 1979). However, if we want to learn g( ) and w jointly, MLE becomes non-convex, and we no longer have statistical guarantees. To solve this problem, we use a different learning criteria based on the moment matching idea (Aalen et al., 2008). We can establish global convergence guarantees despite the non-convexity by establishing the connections to Isotonic regression and SIM. Let Ni = N(ti). Since N(t) is a counting process, the count increases by one at each time ti. Hence for a list of events {ti}n i=1, we have Ni = i for i 2 [n]. We have E[Ni|Hti] = g(w xt)dt. (6) Therefore, we can estimate the parameters g and w by matching the integral 0 λ(t)dt with observations Ni, Figure 1. Illustration of integral computation. (A) the function g has 3 pieces and is constant on intervals I1, I2 and I3. (B) The function z(t) = w xt is restricted on the interval [t1, t2]. It is continuous and monotonic decreasing due to the property of triggering kernel . The pre-image of I2 is shown as the light yellow area on the t axis, and b22 is the intersection of [t1, t2] and z 1(I2). It is found by locating the pre-image of the endpoints, t0 and another point outside the interval [t1, t2] (not shown here). which leads to the following objective function: Note that we need to optimize an integral w.r.t. a function g, which is challenging in representation and computation. Instead of optimizing over G, we replace it with the family of piecewise constant non-decreasing functions, F, and the jumps of g is defined only at the intensity of each observed event. As shown in Theorem 2, the integral of such functions can be computed exactly as weighted combinations of g(w xti) defined on the observed time points ti. For notation simplicity, we set xi = xti. The piecewise-constant function will provide a good approximation to the original function as we increase the number of training samples. 3.3. Integral Computation We assume g is a piecewise constant function defined on each time ti. Then we have the following result: Theorem 2. Assume g is piecewise constant, then the integral on [0, ti] is a weighted sum of yj = g(w xj) with weights aij. That is 0 g(w xt)dt = P j2Si aijyj. To efficiently compute the aij s, we can first compute the integral on intervals [ti 1, ti], then use cumulative sum to arrive at the final results. Set z(t) = w xt, since g( ) is a piecewise constant function, we have: yj I [z(t) 2 Ij] where I [ ] is the indicator function, and Ij denotes the j-th interval where g( ) is a constant. Therefore, we can write the integral on [ti 1, ti] as: t 2 z 1(Ij) where z 1(Ij) denotes the pre-image of the interval Ij. Next, we need to compute bij := t 2 z 1(Ij) dt. Since it is the length of the intersection of two intervals, Isotonic Hawkes Processes Algorithm 1 COMPUTE-COEFFICIENT 1: Input: ti, for i = 1, , n 2: Output: aij 3: for j = 1, , n do 4: Compute t0 j that satisfies xt0 j = xtj. 5: end for 6: for j = 1, , n do 7: Set a0,j = 0 8: for i = 1, , n do 9: Compute bij = min(t0 j 1, ti) max(t0 j, ti 1) 10: Compute aij = a(i 1),j + bij 11: end for 12: end for we can compute bij by finding all the endpoints of the pre-images z 1(Ij). To do this, we first state a property of z 1(Ij). Restricted on [ti 1, ti], z(t) is a continuous and monotonic decreasing function due to the monotonic decreasing triggering kernel (Figure 1(B)). Combined with the fact that Ij are disjoint and share endpoints (Figure 1(A)), the pre-images z 1(Ij) are also disjoint and share endpoints. With this property, we can compute bij easily. According to the definition of Ij, one endpoint of z 1(Ij) is w xtj, so we just need to find another endpoint as t0 j = z 1(w xtj), which is equivalent to solving the equation w xt0 Note xt only has two dimensions, and the first dimension is a constant. Hence, the above equation does not depend on w, and it suffices to solve xt0 j = xtj, where the left-hand side is a function of the unknown t0 j, and the right-hand side is a function of the observed data. It can be easily solved by root finding algorithms. We can then compute bij as: bij = min(t0 j 1, ti) max(t0 j, ti 1) The min and max operator implement the interval intersection. Since z(t) is monotonic decreasing, we have t0 j. Figure 1 illustrates the algorithm. After we have computed bij, aij is readily available by aij = P i0<=i bi0j. The corresponding index sets Si contain nonzero aij s. The detailed procedures are presented in Algorithm 1. 3.4. Overall Algorithm With Theorem 2, we can replace the integral of an unknown function by the weighted summation of its values defined at the intensity of each observed event. Hence we can represent the g 2 F non-parametrically, and reformulate the objective function as: We optimize g and w alternatively until convergence. The update rules for w and g are presented as follows. Update ˆw. Given ˆgk, the update rule for ˆwk+1 is: ˆwk+1 = ˆwk+ 1 aijˆgk( ˆwk xj) (9) Similar to the Isotron algorithm (Kalai & Sastry, 2009), this update rule is parameter free and Perceptron-like. Update ˆg. Note that ˆg is a non-parametric function which is represented by its values ˆyk i at ˆwk xi. Therefore, we only need to determine its values on existing data points. Given ˆwk, we first sort i=1 such that it is an increasing sequence. That is, we re-label the data points according to the sorted order. Then we solve the following Quadratic Programming problem for i+1 , 1 i n 1 (11) For simplicity we re-write the problem in matrix notations. Denote N = (N1, , Nn)>, ˆy = (ˆy1, , ˆyn)>, Ai,j = aij if j 2 Si and Ai,j = 0 otherwise. The monotonic constraint in (11) can be written as By 0 where B is the adjacent difference operator: Bi,i = 1, Bi,i+1 = 1 and other entries are zero. Then we arrive at the following formulation: ˆy k N Aˆyk2, s.t. B ˆy 0 This is a convex problem and can be computed efficiently using projected gradient descent: where [u] is an operator that projects u into the feasible set: [u] = argmin x kx uk2 , s.t. Bx 0 The projection is exactly the Isotonic regression problem and can be solved by PAV (Mair et al., 2009) in O(n log n). In addition, the computation of the gradient is also efficient since A is a sparse matrix and it takes time O(n + nnz(A)), where nnz(A) is the number of nonzero elements. The number of iterations required to reach accuracy is O(1/ ), hence the overall complexity is O((n log n + nnz(A)) / ). This can also be accelerated to O((n log n + nnz(A)) /p ) using Nesterov s acceleration scheme (Nesterov, 1983). The algorithm is illustrated in Algorithm 2 and the whole alternating minimization procedure is summarized in Algorithm 3. Such procedure will efficiently find the near-optimal ˆg and ˆw. 4. Theoretical Guarantees We now provide the theoretical analysis of convergence property. First we define the error as: "(ˆgk, ˆwk) = 1 ˆgk( ˆwkxi) g (w xi) Isotonic Hawkes Processes Algorithm 2 LEARN-ISOTONIC-FUNC 1: Input: {Ni}, {aij}, 2: Output: ˆy 3: Initialize ˆy0 randomly 4: Construct matrices N, A from input 5: t = 0 6: repeat 7: t = t + 1 8: ˆyt+1 = 9: until convergence Algorithm 3 ISOTONIC-HAWKES 1: Input: Sequences of events {ti}n i=1 2: Output: ˆg, ˆw 3: Compute xi = (1, P tj2Hti (ti tj)> for i 2 [n] 4: {aij} = COMPUTE-COEFFICIENT({ti}) 5: Compute Ni = i for i 2 [n] 6: Initialize w0, g0 randomly 7: k = 0 8: repeat 9: k = k + 1 10: Sort the data according to ˆwk xi 11: Update ˆgk = LEARN-ISOTONIC-FUNC({Ni}, {aij}) 12: Update ˆwk+1 using (9) 13: until loss(ˆg, ˆw) 6 where g ( ) and w are the unknown true link function and model parameters, respectively. The goal is to analyze how quickly "(ˆgk, ˆwk) decreases with k. Notations. Set y i = g (w xi) to be the expected value of each yi. Let Ni be the expected value of Ni. Then we have Ni = P j . Clearly we do not have access to Ni. However, consider a hypothetical call to the algorithm with input {(xi, Ni)}n i=1 and suppose it returns gk. In this case, we define yk i = gk( wk xi). We first bound the error using the squared distance k ˆwk w k2 between two consecutive iterations: Lemma 3. Suppose that k ˆwk w k W, kxik 1, pc P C, yj M, and then we have: k ˆwk w k2 k ˆwk+1 w k2 C2"(ˆgk, ˆwk) C1( 1+ 2), where C1 = max{5CW, 4Mpc + 2CW}, C2 = 2c C. The squared distance decreases at a rate depending on "(ˆgk, ˆwk) and the upper bounds 1 and 2. The following two lemmas provide the concrete upper bounds. Lemma 4 (Martingale Concentration Inequality). Suppose d M(t) K, V (t) k for all t > 0 and some K, k 0. With probability at least 1 δ, it holds that Note Ni Ni = Mi, which is the martingale at time ti. A continuous martingale is a stochastic process such that E[Mt|{M , s}] = Ms. It means the conditional expectation of an observation at time t is equal to the observation at time s, given all the observations up to time s t. V (t) is the variation process. The martingale serves as the noise term in point processes (similar to Gaussian noise in regression) and can be bounded using the Bernstein-type concentration inequality. Lemma 5. (Kakade et al., 2011) With probability at least 1 δ, it holds for any k that W 2 log(Wn/δ) Lemma 5 relates ˆyk j (the value we can actually compute) to yk j (the value we could compute if we had the conditional means of Nj). The proof of this lemma uses the covering number technique in (Kakade et al., 2011). We now state the main theorem: Theorem 6. Suppose E[Ni|Hti] = 0 g (w xt)dt, where g is monotonic increasing, 1-Lipschitz and kw k W. Then with probability at least 1 δ, there exist some iteration k < O W n log(W n/δ) "(ˆgk, ˆwk) O W 2 log(Wn/δ) Theorem 6 implies that some iteration has "(ˆgk, ˆwk) = O(1/ 3pn). It is plausible the rate is sharp based on the information-theoretic lower bounds in (Zhang, 2002). Proof sketch. We conduct a telescoping sum of Lemma 3 and show that there are at most O iterations before the error "(ˆgk, ˆwk) is less than O( 1+ 2). Set 1, 2 to be the right-hand sides in Lemma 4 and 5. Since 2 is the dominant term compared with 1, we replace 1 by 2 in the final results. This completes the proof. 5. Extensions We provide several extensions of the Isotonic-Hawkes processes to more general cases. General point processes. The idea and algorithm of Isotonic-Hawkes can be easily extended to other point processes. The time-dependent feature xt in Isotonic-Hawkes is designed to capture the influence of history. However, one can also incorporate and extend other features in prior work (Perry & Wolfe, 2013; Li & Zha, 2014) or design it from users experiences and the application domain. Learning monotonic decreasing functions. Our model can be easily extended to learn a monotonically decreasing function. We just need to change the sign of the inequality in (11). Note that Theorem 6 still holds in this case. Isotonic Hawkes Processes Figure 2. The sequence of events for each pair is modeled as an Isotonic-Hawkes process. Low-rank Isotonic-Hawkes processes. We can also use our model to learn low rank parameters. For example, in the time-sensitive recommendations for online services (Du et al., 2015), we model user u s past consumption events on item i as an Isotonic-Hawkes process (Figure 2) and need to learn the parameters 0 , ui, gui for each user-item pair (u, i). That is, we Since group structure often exists within users preferences and items attributes, we assume that both matrices λ0 = (λui 0 ) and = ( ui) have low-rank structures. We can then factorize them as product of two rank r matrices: λ0 = X0Y0 and 0 = XY . Then we formulate the learning problem by applying our objective function in (7) for each observed pair (u, i): min X0,Y0,X,Y ,g gui(wui xui where wui = (λui, ui). nui is total number of events and Hui is the set of history events for user-item pair (u, i). O = {Hui} is the collection of all observed sequences. We use the alternating minimization technique to update X0, Y0, X, Y and g. First keep gk fixed and update the parameters to Xk+1 0 , Xk+1, Y k+1, then keep them fixed and update gk+1. For the unobserved user-item pairs, after the algorithm stops, we obtain gui by taking average of the user s link functions learned from data. Multi-dimensional Isotonic-Hawkes processes. We extend the Isotonic-Hawkes process to multi-dimension, which is particular useful to model information diffusion in social networks. It is defined by a U-dimensional point process N(t), with intensity for the u-th dimension as: where uu0 captures the degree of influence in the u0-th dimension to the u-th dimension. As for learning, the input data is a sequence of events observed in the form of {(ti, ui)} and each pair represents an event occurring at the ui-th dimension at time ti. Hence, for each dimension u, set N u i = N u(ti), and we solve the problem: Number of Samples (log scale) 3 4 5 6 Linear Exp Sigmoid Decrease Number of Samples (log scale) 3 4 5 6 Linear Exp Sigmoid Decrease Exp-likelihood (a) RMSE for function g (b) RMSE for parameter w Figure 3. Convergence by number of samples. where the i-th entry of wu and xu t is wu(i) = (λu 0, uui) and xu t (i) = (1, P ti2Ht (t, ti)) respectively. Our goal is to learn w = (wu) and g = (gu). From (14) we can see that learning in each dimension u is independent of others. Hence under this framework, wu and gu can be learned using Algorithm 3 in parallel efficiently. 6. Experiments We evaluate the performance of Isotonic-Hawkes on both synthetic and real-world datasets with respect to the following tasks : Convergence: investigate how well Isotonic-Hawkes can learn the true parameters as the number of training samples increases. Fitting capability: study how well Isotonic-Hawkes can explain real-world data by comparing it with the classic Hawkes process. Time-sensitive recommendation: demonstrate that Isotonic Hawkes can improve the predictive performance in item recommendation and time prediction. Diffusion network modeling: evaluate how well Isotonic-Hawkes can model the information diffusion from cascades of temporal events. 6.1. Experiments on Synthetic Data Experimental setup. Table 1 lists the ground-truth setting with four typical link functions g( ) and the respective model parameters w. The first three link functions (Linear, Exp, Sigmoid) are monotonically increasing, while the last one is strictly decreasing. For the Exp link function, we explore the performance of learning self-inhibition by setting to be negative. Without loss of generality, we use the unit-rate exponential decaying function as the triggering kernel. Then, based on the configuration of each row in Table 1, we simulate one million events using Ogata s Thinning algorithm (Ogata, 1981). Table 1. Model configurations. Name link function g Weights w Linear g(x) = x w = (1.2, 0.6) Exp g(x) = ex w = (0.5, 0.1) Sigmoid g(x) = 1/(1 + e 4(x 2)) w = (0.5, 1.2) Decrease g(x) = 1 1/(1 + e 4(x 3)) w = (0.5, 1.2) Convergence analysis. We first evaluate the convergence property of our learning algorithm by increasing the number of samples from 1,000 to 1,000,000. For each Isotonic Hawkes Processes Intensities λ(t) 0 1 2 Learned Function Ground Truth Intensities λ(t) 0 1 2 Learned Function Ground Truth (a) Linear (b) Exp Intensities λ(t) 0 3 6 Learned Function Ground Truth Intensities λ(t) 0 3 6 Learned Function Ground Truth (c) Sigmoid (d) Decrease Figure 4. Comparison between learned link function and the ground truth on four synthetic datasets. dataset, we repeat the simulations ten times and report the averaged results. Figure 3 (a) shows the Root Mean Squared Error (RMSE) between the values of the learned function and those given by the ground-truth link function as a function of training data size. Figure 3 (b) shows the RMSE of learning the model parameters. The x-axis is in log scale. Since in all cases, the RMSE decreases in a consistent way, it demonstrates that Isotonic-Hawkes is very robust regardless of the latent ground-truth link functions. Furthermore, for the Exp link function, we compare the RMSE between our method and the likelihood based approach, Exp-likelihood (Truccolo et al., 2005), which has access to the link function and discretizes the time interval to compute the integral in the likelihood. Our method works better at estimating w. Finally, the ability to recover the linear link function verifies that Isotonic-Hawkes naturally includes the classic Hawkes process as a special case and is much more expressive to explain the data. Visualization of recovered link functions. We also plot each learned link function against the respective ground-truth in Figure 4 trained with 1,000,000 events. In all the cases, the algorithm can achieve the global optimal to precisely recover the true functions. 6.2. Experiments on Time-sensitive Recommendation Experimental setup. For the task of time-sensitive recommendation, we fit a low-rank Isotonic-Hawkes process with the alternating minimization technique from (13) to solve the following two related problems proposed from (Du et al., 2015) : (1) how to recommend the most relevant item at the right moment; (2) how to accurately predict the next returning-time of users to existing services. We evaluate the predictive performance of our model on Sample Quantiles 0 2.5 5 Theoretical Quantiles 5 Isotonic-Hawkes Hawkes Intensities λ(t) 0 2 4 (a) QQ-plot (b) link function g Sample Quantiles 0 2.5 5 Theoretical Quantiles 5 Isotonic-Hawkes Hawkes Intensities λ(t) 0 1 2 (c) QQ-plot (d) link function g Figure 5. Experiment results on two randomly picked sequences from last.fm data. (a-b) and (c-d) correspond to two sequences. two real datasets. last.fm1 consists of the music listening histories of around 1,000 users over 3,000 different albums. We use the events of the first three months for training and those of the next month for testing. There are around 20,000 user-album pairs with more than one million events. tmall.com2 contains online shopping records. There are around 100K events between 26, 376 users and 2,563 stores. We use the events of the first four months for training and those of the last month for testing. The unit time is an hour. Better data fitting capability. Since the true temporal dynamics governing the temporal point patterns are unknown, we first investigate whether our new model can better explain the data compared with the classic Hawkes process. According to the Time Changing Theorem (Daley & Vere-Jones, 2007), given a sequence T = {ti}n i=1 and a point process with intensity λ(t), the set of samples { ti 1 λ(t)dt}n i=1 should conform to a unit-rate exponential distribution if T is truly sampled from the process. As a consequence, we compare the theoretical quantiles from the unit-rate exponential distribution with the empirical quantiles of different models. The closer the slope of QQ-plot goes to one, the better a model matches the point patterns. (Du et al., 2015) has shown that Hawkes process fits the data better compared to other simple processes. In Figure 5 (a) and (c), we show that Isotonic-Hawkes achieves much better fitting capability. Furthermore, (b) and (d) visualize the learned link functions. In Figure 5(b), the function captures the phenomenon that the user s 1http://www.dtic.upf.edu/ ocelma/ Music Recommendation Dataset/lastfm-1K.html 2http://ijcai-15.org/index.php/ repeat-buyers-prediction-competition Isotonic Hawkes Processes interests tend to saturate in the long-run despite that he may be excited about the item initially. Intuitively, we can also see this from (a), where Hawkes process has larger sample quantiles than the theoretical one, which means ti 1 λ(t)dt is larger than the value it should be. Hence using a saturating function in (b) helps adjusting the Hawkes intensity λ(t) and make it smaller. In contrast, (d) presents the opposite trend where the user was not quite interested in the given item initially, but later became addicted to it. Since the Hawkes sample quantile is smaller than the theoretical one in (c), link function helps changing λ(t) to be larger. Hence learning the link function is important. Recommendation improvements. We evaluate the predictive performance on the two tasks following (Du et al., 2015) : (1) Rank prediction. At each testing moment, we record the predicted rank of the target item based on the respective intensity function. We report the average rank over all test events. Smaller value indicates better performance. (2) Arrival-time prediction. We predict the arrival time of the next testing event and report the mean absolute error (MAE) between the predicted time and the true value. In addition, besides Hawkes process, we also compare with the commonly used Poisson process, which is a relaxation of the Hawkes model by assuming that each user-item pair has constant base intensity independent of the history, as well as the state-of-the-art Tensor factorization method (Chi & Kolda, 2012) which applies Poisson factorization to fit the number of events in each discretized time slot and has better performance than methods based squared loss (Wang et al., 2015). We use the parameters averaged over all time intervals to make predictions. The latent rank of the low-rank Isotonic-Hawkes process and the tensor method are tuned to give the best performance. We summarize the results in Figure 6. First, Hawkes outperforms the Poisson process, which means that considering the effects of history is helpful. Second, Isotonic-Hawkes outperforms Hawkes process for a significant margin thanks to the better data fitting capability shown in Figure 5. For time prediction, since the MAE s unit is hour, we can see that the error difference between Isotonic-Hawkes and Hawkes is about three days. The online shopping services can benefit a lot from this improvement and make better demand predictions. 6.3. Experiments on Modeling Diffusion Networks Finally, we apply the multi-dimension Isotonic-Hawkes process with the model estimation procedure in (14) to recover the latent information diffusion network reflected by the nonzero patterns of the mutual excitation matrix over the real Network dataset from (Farajtabar et al., 2014). This dataset comprises of all tweets posted by 2,241 users in Predict arrival time Iso Haw Hawke Poisson Tensor Iso Haw Hawke Poisson Tensor Predict rank Iso Haw Hawke Poisson Tensor Iso Haw Hawke Poisson Tensor last.fm tmall.com Figure 6. Time-sensitive recommendation results. # events 105 1 3.5 6 100 Isotonic-Hawkes Hawkes # events 105 1 3.5 6 200 Isotonic-Hawkes Hawkes (a) Avg Rank (b) Arrival-time Figure 7. Prediction results on the Network dataset. eight month. The network has 4,901 edges. We split data into a training set (covering 85% of the total events) and a test set (covering the remaining 15%) according to time. Being similar to the time-sensitive recommendation task, we report the average rank of all testing events and MAE for the arrival time prediction with an increasing proportion of training events. Figure 7 verifies that Isotonic-Hawkes outperforms Hawkes process consistently. 7. Conclusion We have proposed a novel nonlinear Hawkes process, the Isotonic-Hawkes process, with a flexible nonlinear link function. Along with the model, we have developed a computationally and statistically efficient algorithm to learn the link function and model parameters jointly, and rigorously show that under mild assumptions of the monotonicity, our algorithm is guaranteed to converge to the global optimal solution. Furthermore, our model is very general and can be extended to many different forms, including monotonically decreasing link functions, low-rank Isotonic-Hawkes processes model and multi-dimensional Isotonic-Hawkes processes. Experiments on both synthetic and real world datasets empirically verify the theoretical guarantees and demonstrate the superior predictive performance compared to other baselines. Acknowledgements. This project was supported in part by NSF/NIH BIGDATA 1R01GM108341, ONR N00014-15-1-2340, NSF IIS-1218749, and NSF CAREER IIS-1350983. Isotonic Hawkes Processes Aalen, Odd, Borgan, Ornulf, and Gjessing, Hakon. Survival and event history analysis: a process point of view. Springer, 2008. Acharyya, Sreangsu and Ghosh, Joydeep. Parameter estimation of generalized linear models without assuming their link function. In AISTAT, 2015. Barlow, R.E., Bartholomew, D., Bremner, J. M., and Brunk, H. D. Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. J. Wiley, 1972. Br emaud, Pierre and Massouli e, Laurent. Stability of nonlinear hawkes processes. The Annals of Probability, pp. 1563 1588, 1996. Carstensen, Lisbeth, Sandelin, Albin, Winther, Ole, and Hansen, Niels R. Multivariate hawkes process models of the occurrence of regulatory elements. BMC bioinformatics, 11(1):456, 2010. Chi, Eric C and Kolda, Tamara G. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272 1299, 2012. Daley, D.J. and Vere-Jones, D. An introduction to the theory of point processes: volume II: general theory and structure, volume 2. Springer, 2007. Delecroix, Michel, Hristache, Marian, and Patilea, Valentin. On semiparametric m-estimation in single-index regression. Journal of Statistical Planning and Inference, 2006. Du, Nan, Wang, Yichen, He, Niao, and Song, Le. Time sensitive recommendation from recurrent user activities. In NIPS, 2015. Farajtabar, Mehrdad, Du, Nan, Gomez-Rodriguez, Manuel, Valera, Isabel, Zha, Hongyuan, and Song, Le. Shaping social activity by incentivizing users. In NIPS, 2014. Farajtabar, Mehrdad, Wang, Yichen, Gomez-Rodriguez, Manuel, Li, Shuang, Zha, Hongyuan, and Song, Le. Coevolve: A joint point process model for information diffusion and network co-evolution. In NIPS, 2015. Hansen, Niels Richard, Reynaud-Bouret, Patricia, Rivoirard, Vincent, et al. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21(1):83 143, 2015. Hawkes, Alan G. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90, 1971. Horowitz, Joel L and H ardle, Wolfgang. Direct semiparametric estimation of single-index models with discrete covariates. Journal of the American Statistical Association, 91(436): 1632 1640, 1996. Hristache, Marian, Juditsky, Anatoli, and Spokoiny, Vladimir. Direct estimation of the index coefficient in a single-index model. Annals of Statistics, 2001. Kakade, Sham M, Kanade, Varun, Shamir, Ohad, and Kalai, Adam. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS, 2011. Kalai, Adam Tauman and Sastry, Ravi. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009. Li, Liangda and Zha, Hongyuan. Learning parametric models for social infectivity in multi-dimensional hawkes processes. In AAAI, 2014. Liptser, Robert and Shiryayev, Albert Nikolaevich. Theory of martingales, volume 49. Springer Science & Business Media, 2012. Mair, Patrick, Hornik, Kurt, and de Leeuw, Jan. Isotone optimization in r: pool-adjacent-violators algorithm (pava) and active set methods. Journal of statistical software, 32(5):1 24, 2009. Mohler, George O, Short, Martin B, Brantingham, P Jeffrey, Schoenberg, Frederic Paik, and Tita, George E. Self-exciting point process modeling of crime. Journal of the American Statistical Association, 106(493), 2011. Naik, Prasad A and Tsai, Chih-Ling. Isotonic single-index model for high-dimensional database marketing. Computational statistics & data analysis, 2004. Nesterov, Yurii. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Soviet Math. Docl., 269:543 547, 1983. Ogata, Yosihiko. On lewis simulation method for point processes. IEEE Transactions on Information Theory, 27(1): 23 31, 1981. Ozaki, Tohru. Maximum likelihood estimation of hawkes self-exciting point processes. Annals of the Institute of Statistical Mathematics, 31(1):145 155, 1979. Paninski, L. Estimating entropy on m bins given fewer than m samples. IEEE Transactions on Information Theory, 50: 2200 2203, 2004. Perry, Patrick O and Wolfe, Patrick J. Point process modelling for directed interaction networks. Journal of the Royal Statistical Society, 75(5):821 849, 2013. Robertson, Tim, Wright, FT, Dykstra, Richard L, and Robertson, T. Order restricted statistical inference, volume 229. Wiley New York, 1988. Truccolo, Wilson, Eden, Uri T, Fellows, Matthew R, Donoghue, John P, and Brown, Emery N. A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. Journal of neurophysiology, 93(2):1074 1089, 2005. Wang, Yichen, Chen, Robert, Ghosh, Joydeep, Denny, Joshua C, Kho, Abel, Chen, You, Malin, Bradley A, and Sun, Jimeng. Rubik: Knowledge guided tensor factorization and completion for health data analytics. In KDD, 2015. Wang, Yichen, Theodorou, Evangelos, Verma, Apurv, and Song, Le. A stochastic differential equation framework for guiding information diffusion. ar Xiv preprint ar Xiv:1603.09021, 2016. Zhang, Cun-Hui. Risk bounds in isotonic regression. The Annals of Statistics, 30(2):528 555, 2002. Zhu, Lingjiong. Large deviations for markovian nonlinear hawkes processes. The Annals of Applied Probability, 2015.