# neural_jumpdiffusion_temporal_point_processes__20c16154.pdf

Neural Jump-Diffusion Temporal Point Processes

Shuai Zhang 1 Chuan Zhou 1 2 Yang Liu 1 Peng Zhang 3 Xixun Lin 4 Zhi-Ming Ma 1

We present a novel perspective on temporal point processes (TPPs) by reformulating their intensity processes as solutions to stochastic differential equations (SDEs). In particular, we first prove the equivalent SDE formulations of several classical TPPs, including Poisson processes, Hawkes processes, and self-correcting processes. Based on these proofs, we introduce a unified TPP framework called Neural Jump-Diffusion Temporal Point Process (NJDTPP), whose intensity process is governed by a neural jump-diffusion SDE (NJDSDE) where the drift, diffusion, and jump coefficient functions are parameterized by neural networks. Compared to previous works, NJDTPP exhibits model flexibility in capturing intensity dynamics without relying on any specific functional form, and provides theoretical guarantees regarding the existence and uniqueness of the solution to the proposed NJDSDE. Experiments on both synthetic and real-world datasets demonstrate that NJDTPP is capable of capturing the dynamics of intensity processes in different scenarios and significantly outperforms the state-of-the-art TPP models in prediction tasks.

1. Introduction

Many real-world scenarios often generate a large amount of asynchronous event sequences. Each event consists of a timestamp and a type mark, indicating when and what the event occurred. Examples include user activities on social media platforms (Farajtabar et al., 2017), electronic health records in healthcare (Liu & Hauskrecht, 2021), and transaction behaviors in e-commerce systems (Xue et al., 2022).

1Academy of Mathematics and Systems Science, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Cyberspace Institute of Advanced Technology, Guangzhou University 4Institute of Information Engineering, Chinese Academy of Sciences. Correspondence to: Chuan Zhou <zhouchuan@amss.ac.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Modeling such data has become increasingly important for tasks such as predicting the occurrence of future events (Du et al., 2016; Mei & Eisner, 2017; Zhang et al., 2020a; Zuo et al., 2020), detecting anomalies in event sequences (Liu & Hauskrecht, 2021; Shchur et al., 2021; Zhang et al., 2023), and performing causal inference on events (Xu et al., 2016; Zhang et al., 2020b; Gao et al., 2021).

Temporal point processes (TPPs) (Daley et al., 2003) serve as a useful mathematical tool for modeling sequences of discrete events in continuous time. Classical examples of TPPs include Poisson processes (Kingman, 1992), Hawkes processes (Hawkes, 1971), and self-correcting processes (Isham & Westcott, 1979). A central concept in TPPs is the intensity process1 (Oakes, 1975), also known as the intensity function (Zhang et al., 2020b), which measures the expected rate of events occurrence given historical events. While these classical models exhibit favorable statistical properties, the fixed parametric form of their intensity functions prevents them from capturing complicated dynamics.

To enhance the capability of TPP models, there has been a surge in modeling the intensity function as a transformation of the hidden state of neural networks. Depending on the neural network structures, these TPP models can be divided into two categories, i.e., those based on either RNNs or Transformers (Du et al., 2016; Zuo et al., 2020; Yang et al., 2022), and those based on continuous-depth neural networks (Jia & Benson, 2019; Chen et al., 2020). While being more expressive than classical TPPs, the former models usually assume a specific functional form for the intensity function. For example, RMTPP (Du et al., 2016) assumes that the intensity exponentially decreases or increases between events. However, relying on such an assumption would limit model expressiveness when the employed assumption deviates from reality (Omi et al., 2019). The latter models represent the hidden state as the solution to a neural jump stochastic differential equation (Jia & Benson, 2019). These models, however, provide no theoretical guarantee for the global existence and uniqueness of the solution.

In this paper, we provide a new view for TPPs by reformulating the intensity process as the solution to a stochastic differential equation (SDE) (Ikeda & Watanabe, 2014). Specifi-

1In this paper, we use intensity process and intensity function interchangeably.

Neural Jump-Diffusion Temporal Point Processes

cally, we first derive equivalent SDE formulations of several classical TPPs mentioned above. From these SDE formulations, we observe that the coefficient functions in SDE play a key role in shaping the evolution of intensity process over time and revealing the influences between events. Based on these observations, we introduce the Neural Jump-Diffusion Temporal Point Process (NJDTPP), whose intensity process is governed by a neural jump-diffusion SDE (NJDSDE). The drift, diffusion, and jump coefficient functions in NJDSDE are parameterized by three neural networks, i.e., the drift net, diffusion net, and jump net. Concretely, the drift net captures the intrinsic evolution of the intensity process, the diffusion net models the Gaussian noise with the Brownian motion (Wang et al., 2017; 2018), and the jump net captures the influences between events, such as the excitatory and inhibitory influences. Remarkably, our NJDTPP model does not require a specific functional form for the intensity function. Instead, by using the drift, diffusion, and jump nets, the solution to NJDSDE can implicitly determine a free-form intensity process consistent with the observed event data. We summarize our contributions as follows:

Theoretical Analysis. We prove the equivalent SDE formulations of several classical TPPs. For the SDE formulation, we provide a sufficient condition for the existence of a unique positive solution. Moreover, we theoretically analyze the existence and uniqueness of the solution to the proposed neural jump-diffusion SDE.

Unified Framework. By viewing the intensity process as the solution to an SDE, we propose a unified TPP framework NJDTPP which can learn a free-form intensity process consistent with the observed data. A number of classical TPPs can be interpreted as special cases of our framework with simple coefficient functions.

Extensive Experiments. We conduct experiments on three synthetic and six real-world datasets to evaluate the performance of NJDTPP. Experimental results show that NJDTPP successfully captures the dynamics of intensity processes and achieves state-of-the-art results in the tasks of likelihood evaluation and event prediction.

2. Related Work

Neural Temporal Point Processes. Neural TPPs that combine TPPs with neural networks have received considerable attention (Du et al., 2016; Mei & Eisner, 2017; Zhang et al., 2020a; Zuo et al., 2020; Lin et al., 2021; Yang et al., 2022). While being more expressive than classical parametric ones, neural TPPs usually assume a specific functional form for the intensity function. For example, RMTPP (Du et al., 2016) assumes that the intensity exponentially decreases or increases between events; THP (Zuo et al., 2020) utilizes the softplus function so that the intensity

between events is approximately linearly interpolated. However, relying on such an assumption can undermine model effectiveness if the employed assumption deviates from reality. In addition to the dominant paradigm of parameterizing intensity functions, alternative methods involve modeling cumulative intensity functions (Omi et al., 2019) and conditional density functions (Shchur et al., 2019). However, these methods may not fully capture the dynamics of the intensity process. In contrast to existing studies, our model formulates the intensity process as the solution to an SDE without relying on any specific functional form.

Neural Differential Equations. Neural differential equations (NDEs) (Kidger et al., 2021a) are defined as differential equations in which coefficient functions are parameterized by neural networks. Many NDEs, including neural ODE and its variants (Chen et al., 2018; Rubanova et al., 2019; Kidger et al., 2020; Herrera et al., 2020), as well as neural SDEs (Li et al., 2020; Kong et al., 2020; Kidger et al., 2021a;b), have been proposed for modeling time series. However, there is a distinction between time series and event sequences (Xiao et al., 2017). In time series, time serves only as the index to order the sequence of values for the target variable. In event sequences, time is regarded as a random variable representing the timestamp of asynchronous events, with time itself being the subject of research. Therefore, many existing NDE-based models are not directly suitable for modeling event sequences. While Jia & Benson (2019) and Chen et al. (2020) utilize NDEs to model event sequences, they actually capture the dynamics of the hidden state of neural networks. Besides, they solely focus on the jump term, neglecting the diffusion term associated with randomness driven by Brownian motion. In contrast, we incorporate Brownian motion to model Gaussian noise, and more importantly our proposed NJDSDE models the dynamics of the intensity process.

Equivalent SDE Formulations for TPPs. Wang et al. (2018) provided a jump-diffusion SDE framework for modeling user activities. They introduced the diffusion term to model the Gaussian noise, such as fluctuations in the dynamics caused by unobserved factors. However, their utilization of fixed linear coefficient functions in the SDE might not fully capture the actual intensity. On the contrary, we employ neural networks to parameterize coefficient functions, allowing for a more flexible modeling of the intensity that better aligns with the observed data. While De et al. (2016); Zarezade et al. (2017); Wang et al. (2018) established the equivalent SDE formulation for Hawkes processes, we provide a distinct proof method. Besides, we derive equivalent SDE formulations for several other classical TPPs, such as Poisson processes and self-correcting processes. Moreover, for the SDE formulation, we provide a sufficient condition for the existence of a unique positive solution.

Neural Jump-Diffusion Temporal Point Processes

3. Background

In this section, we provide a brief overview of temporal point processes and jump-diffusion stochastic differential equations.

3.1. Temporal Point Processes

A temporal point process (TPP) (Daley et al., 2003) is a stochastic process {ti} i=1, in which the non-negative random variable ti represents the occurrence time of the i-th event and ti < ti+1. Such a process can be equivalently represented as a counting process {Nt}t 0, where Nt represents the number of events up to time t.

The most common way to characterize a TPP is via its intensity process (Oakes, 1975), also known as the intensity function. Specifically, the intensity process of {Nt}t 0 is a left-continuous with right-limits process {λ(t | Ft )}t 0, denoted for simplicity as {λt}t 0, where λt measures the expected rate of events occurring in an infinitesimal window (t, t + dt] given the historical events up to time t. Formally,

λt dt = P (d Nt = 1 | Ft ) = E [d Nt | Ft ] , (1)

where Ft = σ(Ns : 0 s < t) and the jump size d Nt = Nt+dt Nt {0, 1}.

In the following, we review several classical TPPs, where the intensity function has a fixed parametric form.

Poisson processes (Kingman, 1992). The intensity function of the Poisson process {Nt}t 0 is independent of event history. The simplest case is a homogeneous Poisson process where the intensity is a positive constant:

λt = λ > 0. (2)

For a more general inhomogeneous poisson process, the intensity is a function varying over time:

λt = g(t) > 0. (3)

Hawkes processes (Hawkes, 1971). The Hawkes process {Nt}t 0 with the widely used exponential kernel assumes that events are self-exciting. The arrival of a new event results in a sudden increase in intensity, and this influence decays exponentially:

λt = µ + α X

i: ti<t exp ( β(t ti)), (4)

where µ > 0, α > 0 and β > 0.

Self-correcting processes (Isham & Westcott, 1979). In contrast to the Hawkes process, the self-correcting process {Nt}t 0 assumes that a new event inhibits future events and the intensity grows exponentially over time:

λt = exp µt X

i: ti<t α , (5)

where µ > 0 and α > 0.

3.2. Jump-Diffusion Stochastic Differential Equations

One-dimensional autonomous jump-diffusion stochastic differential equations (JDSDE) (Hanson, 2007) with initial conditions are of the form ( d Xt = f(Xt) dt + g(Xt) d Wt + h(Xt) d Nt,

X0 = x0, (6)

where x0 R is the initial value, f : R R is the drift coefficient function, g : R R is the diffusion coefficient function, h: R R is the jump coefficient function, {Wt}t 0 is a standard Brownian motion, and {Nt}t 0 is a counting process that jumps at times {ti} i=1. Suppose that {Wt}t 0 and {Nt}t 0 are independent. In this paper, it is essential to highlight that the process {Nt}t 0 in Eq.(6) is a general counting process introduced in Section 3.1, distinct from many previous works (Cyganowski et al., 2002; Hanson, 2007; Lamberton & Lapeyre, 2011) that focus on a Poisson process.

The JDSDE Eq.(6) is interpreted as a stochastic integral equation (Cyganowski et al., 2002):

Xt = x0+ Z t

0 f(Xs) ds+ Z t

0 g(Xs) d Ws+ Z t

0 h(Xs) d Ns,

where the first integral is a Riemann integral, the second is an Itˆo integral and the third is a Riemann Stieltjes integral. In fact, Eq.(6) behaves as a normal Itˆo SDE (Cyganowski et al., 2002; Hanson, 2007) between jumps of {Nt}t 0. This can be expressed as:

d Xt = f(Xt) dt + g(Xt) d Wt, t (ti 1, ti] .

On the other hand, at a jump time ti, {Nt}t 0 has a jump size of Nti = 1, which implies that the process {Xt}t 0 will have a jump of size

Xti = Xti+ Xti = h(Xti) Nti = h(Xti),

where Xti+ = lim s ti Xs. Then Xti+ = Xti + h(Xti).

4. Equivalent SDE Formulations for TPPs

In this section, we derive equivalent SDE formulations of several classical TPPs, which involves expressing their respective intensity process as a solution to the corresponding SDE. Then for the SDE formulation, we provide a sufficient condition for the existence of a unique positive solution. Theorem 1. The intensity processes of homogeneous and inhomogeneous Poisson processes can be equivalently expressed as solutions to the following ODEs, respectively. These ODEs can be viewed as degenerate forms of SDEs.

dλt = 0, λ0 = λ, (7)

dλt = g (t) dt, λ0 = g(0), (8)

where λ > 0 and g(t) > 0 is assumed to be differentiable.

Neural Jump-Diffusion Temporal Point Processes

According to Eq.(2) and Eq.(3), Theorem 1 is evident. Subsequently, we establish equivalent SDE formulations for Hawkes processes and self-correcting processes. Theorem 2. The intensity process {λt}t 0 of the Hawkes process {Nt}t 0 can be equivalently expressed as the solution to the jump SDE

dλt = β(µ λt) dt + α d Nt, λ0 = µ. (9)

Proof. See Appendix A.1. The proof sketch is as follows: Taking inspiration from (Bj ork, 2021), we now solve the above SDE. Let the jump times of {Nt}t 0 be {ti} i=1, then Eq.(9) behaves as an ODE dλt = β(µ λt) dt between these jump points. And at a jump time ti, the jump size is α, leading to λti+ = λti + α. Iteratively solving this ODE between jumps with the initial value λti 1+, we establish that the intensity process Eq.(4) satisfies Eq.(9).

Theorem 3. The intensity process {λt}t 0 of the selfcorrecting process {Nt}t 0 can be equivalently expressed as the solution to the jump SDE

dλt = µλt dt + e α 1 λt d Nt, λ0 = 1. (10)

The proof of this theorem is similar to the previous one and can be found in Appendix A.2. The following result shows that under certain conditions, there exits a unique positive solution to an SDE, which means that an SDE can determine an intensity process of a TPP. Theorem 4. Assume that the ODE

dyt = f (eyt)

eyt dt, t 0, y0 = y,

has a unique global solution for every y R and let h(x) : R R be a chosen function such that h(x) + x > 0 for x > 0. Then the jump SDE

dλt = f (λt) dt + h (λt) d Nt, λ0 = λ, (11)

has a unique global positive solution for every λ > 0.

Appendix A.3 includes the detailed proof. Specially, according to this theorem, by setting f(x) = µx and h(x) = (e α 1)x, it follows that Eq.(10) has a unique global positive solution.

From the above equivalent SDE formulations of several classical TPPs, we can clearly see that the coefficient functions within the SDE play a key role in shaping the evolution of intensity processes over time and revealing the influences between events. For example, in Hawkes processes (Eq.(9)), there exists excitatory influences between events, where each occurrence of an event leads to an instantaneous increase in intensity by α. This inspires us that by defining appropriate coefficient functions, it becomes feasible to construct an intensity process consistent with the observed data. These observations motivate us to propose our model Neural Jump-Diffusion Temporal Point Processes.

5. Neural Jump-Diffusion TPPs

In this section, for symbol simplicity and reader comprehension, we first model the intensity process of the univariate TPPs. Subsequently, we extend our method to address the multivariate TPPs, proposing a more comprehensive model.

5.1. Neural Jump-Diffusion Univariate Point Process

Unlike classical TPPs with known linear drift and jump coefficient functions, we consider a general problem where the dynamics of the intensity process are completely unknown. Specifically, we assume access to a large set of event sequences, each denoted as S = {ti}n i=1, representing independent realizations of a counting process {Nt}t 0. The objective is to identify the unknown dynamics governing the intensity process {λt}t 0 of {Nt}t 0.

To this end, we propose the Neural Jump-Diffusion Univariate Point Process whose intensity process is governed by a neural jump-diffusion SDE (NJDSDE). The drift, diffusion, and jump coefficient functions in the NJDSDE are parameterized by three neural networks which are called the drift net, diffusion net, and jump net, respectively. To ensure that the intensity {λt}t 0 remains positive, we introduce the log-intensity process ηt := log λt. Then we formally present the NJDSDE for {ηt}t 0 as follows:

dηt = fθf (ηt) | {z } drift net

dt + gθg(ηt) | {z } diffusion net

d Wt + hθh(ηt) | {z } jump net

η0 = log λ0, (12) where η0 R is the initial value, fθf : R R, gθg : R R, hθh : R R, {Wt}t 0 is a standard Brownian motion (Le Gall, 2016), and {Nt}t 0 is the counting process mentioned above, which records the occurrence of events. Suppose that {Wt}t 0 and {Nt}t 0 are independent. We explain each term in Eq.(12) in detail:

The drift term fθf (ηt) dt captures the intrinsic evolution of {ηt}t 0.

The diffusion term gθg(ηt) d Wt models the Gaussian noise with the Brownian motion. Inspired by (Wang et al., 2018), we add the diffusion term to model the impact of noise on the intensity process.

The jump term hθh(ηt) d Nt represents the magnitude of the jump, capturing the influence of historical events up to time t. Its sign indicates whether the influence is excitatory or inhibitory.

The proposed NJDSDE Eq.(12) is a general framework. When the function fθf is set to β(µe ηt 1), gθg is set to 0, and hθh is set to log(1 + αe ηt) in Eq.(12), the NJDSDE characterizes Hawkes processes. Similarly, when fθf is set

Neural Jump-Diffusion Temporal Point Processes

to µ, gθg is set to 0, and hθh is set to α, Eq.(12) characterizes self-correcting processes. The similar results for Poisson processes are trivial. In other words, the proposed NJDSDE encompasses the classical TPPs mentioned above. In addition, a specific class (but not all) of log-Gaussian Cox processes (Møller et al., 1998) can also be incorporated into our modeling framework Eq.(12). The proofs for these conclusions are detailed in Appendix A.4.

We proceed to investigate the existence and uniqueness of the solution {ηt}t 0 to the proposed NJDSDE. The theoretical analysis in the following theorem provides insights into designing an effective network architecture for the drift net fθf , diffusion net gθg, and jump net hθh.

Theorem 5. Assuming that fθf (x), gθg(x), hθh(x) are measurable functions R R, hθh(x) is continuous, and there exists a positive constant C such that for all x, y R,

|fθf (x) fθf (y)| + |gθg(x) gθg(y)| C|x y|,

then for every λ0 > 0, there exists a unique adapted leftcontinuous process {ηt}t 0 with right-limits that satisfies Eq.(12).

The proof is available in Appendix A.5. According to Theorem 5, if fθf (x), gθg(x) and hθh(x) are uniformly Lipschitz continuous, then Eq.(12) has a unique strong solution. Thus, we utilize Lipschitz nonlinear activations, such as Re LU, sigmoid, and Tanh, within the network architectures, as highlighted in previous works (Anil et al., 2019; Kong et al., 2020; Oh et al., 2024; Lin et al., 2024). Moreover, in this paper, the drift net, diffusion net, and jump net are implemented as three multi-layer perceptrons (MLPs).

Remarks. We summarize the differences of our model compared to existing TPP models:

Different from the SDE formulation of classical TPPs (e.g., Eq.(9)), the coefficient functions in our model are parameterized by neural networks rather than relying on fixed functions. This enables a more flexible modeling of the complex dynamics of the intensity process.

Compared to neural TPPs (Du et al., 2016; Zuo et al.,

2020), our model eliminates the need to assume a specific functional form for the intensity function. Instead, based on the NJDSDE, our model formulates the time evolution of the intensity process in a general manner.

Furthermore, our model differs from previous TPP models based on neural differential equations (Jia & Benson, 2019; Chen et al., 2020; Song et al., 2024). In addition to incorporating the Brownian motion to model the Gaussian noise, a key distinction lies in our proposed NJDSDE, which models the dynamics of the intensity process rather than the hidden state.

5.2. Model Training

To learn model parameters in fθf , gθg, hθh, and the initial value η0, we perform the Maximum Likelihood Estimation (MLE). For an event sequence S = {ti}n i=1 over the time interval [0, T], given its intensity λt, the log-likelihood function (Rasmussen, 2018) is

i=1 log λti Z T

i=1 ηti Z T

In general, the integral term does not have a closed-form computational method. Therefore, we apply numerical integration methods for approximate calculations, such as the trapezoidal rule (Zuo et al., 2020). This requires determining the value of ηt at the divided time points. Noting that this process {ηt}t 0 is governed by our proposed NJDSDE Eq.(12). That is, on the time interval (ti 1, ti], ηt is governed by the neural SDE

dηt = fθf (ηt) dt + gθg(ηt) d Wt. (13)

And at a jump time ti, the jump size of ηt is given by

ηti = ηti+ ηti = hθh(ηti) Nti = hθh(ηti). (14)

Then the right-limit of ηt at ti is

ηti+ = ηti + hθh(ηti). (15)

Since the solution of neural SDEs (e.g., Eq.(13)) is generally analytically intractable, numerical approximation methods are often required (Kong et al., 2020; Kidger et al., 2021b). We adopt the Euler-Maruyama scheme (Kloeden & Platen, 1992) with fixed step size due to its computational efficiency. Under such a scheme, the time interval (ti 1, ti] is divided into N subintervals ti 1 = τ i 0 < < τ i k < < τ i N = ti with stepsize i k = τ i k+1 τ i k = (ti ti 1)/N. Then we discretize Eq.(13) on (ti 1, ti] by the recursive equation

ητ i k+1 = ητ i k + fθf (ητ i k) i k + gθg(ητ i k) W i k, (16)

for k = 0, 1, . . . , N 1 with ητ i 0 = ηti 1+. Here, W i k = Wτ i k+1 Wτ i k is sampled from N(0, i k) for numerical computation. The advantage of introducing the log-intensity ηt = log λt is evident in obtaining a numerical solution of Eq.(12) over the entire real number space, rather than being restricted to the domain of positive real numbers.

Iteratively using Eq.(15) and Eq.(16), we can calculate the log-likelihood function as follows:

τ i k τ i k 1 2 e ητi k 1 +e ητi k , (17)

where τ 1 0 = 0, τ n+1 N = T, ητ i 0 = ηti 1+ and ητ i N = ηti. The complete algorithm of model training is described in Algorithm 1 in Appendix B.

Neural Jump-Diffusion Temporal Point Processes

5.3. Neural Jump-Diffusion Multivariate Point Process

An important example of multivariate TPPs is the multivariate Hawkes process N t = (N 1 t , . . . , N M t )T, whose intensity process λt = (λ1 t, . . . , λM t )T characterizes the past event influences on future ones in an excitatory manner (Hawkes, 1971):

λm t = µm 0 +

ti<t,mi=l αml exp ( β (t ti)), m [M],

where [M] := {1, 2, . . . , M}. For this intensity process, we derive the following equivalent SDE formulation. Theorem 6. The intensity process {λt}t 0 of the multivariate Hawkes process {N t}t 0 can be equivalently expressed as the solution to the jump SDEs

dλm t = β (µm 0 λm t ) dt +

l=1 αml d N l t,

λm 0 = µm 0 , m [M].

The proof follows a similar approach to that given earlier for Theorem 2, and is omitted here.

Subsequently, we extend our approach to identify the unknown dynamics of general multivariate point processes. Let S = {(ti, mi)}n i=1 be a multi-type event sequence, where each event (ti, mi) indicates that the i-th event occurs at time ti R+ and is of type mi [M]. We still denote N t = (N 1 t , . . . , N M t )T the associated multivariate counting process. Similar to Section 5.1, we introduce the M-dimensional log-intensity process ηt := log λt = (log λ1 t, . . . , log λM t )T. For each m [M], we propose the NJDSDE for {ηm t }t 0 as follows:

dηm t = f m(ηt)dt+

k=1 gmk(ηt)d W k t +

l=1 hml(ηt)d N l t,

where the superscripts are used for the indices of vectors and matrices, such as the function gmk and hml are the (m, k)- th component of the M K-matrix gθg and the (m, l)-th component of the M M-matrix hθh = h1 θh| |h M θh

with hl θh as its l-th column vector, respectively. Moreover, the components W k t of W t = (W 1 t , . . . , W K t )T are standard Brownian motions which are pairwise independent. Suppose that W t and N t are independent.

We rewrite the above componentwise expression into vector form, leading to the formulation of the NJDSDE for {ηt}t 0 as follows:

dηt = f θf (ηt) | {z } drift net

dt + gθg(ηt) | {z } diffusion net

d W t + hθh(ηt) | {z } jump net

η0 = log λ0, (18)

Intensity Process

Type 1 Type 2 Type 3

Figure 1. Illustration of the NJDSDE-governed intensity process with three event types. This example shows that the intensity process behaves as a diffusion process between event points, with a jump occurring at these specific points.

where η0 RM is the initial value, the coefficient functions f θf : RM RM, gθg : RM RM K, and hθh : RM RM M are the drift net, diffusion net, and jump net, respectively. In particular, when M = K = 1, this NJDSDE degenerates into Eq.(12). Essentially, Eq.(18) behaves as a neural SDE between event points, with a jump occurring at these points. This can be interpreted as: ( dηt = f θf (ηt) dt + gθg(ηt) d W t, t (ti 1, ti],

ηti+ = ηti + hmi θh (ηti), t = ti, m = mi. (19) It is evident from this expression that the numerical value of l-th element in vector hmi θh characterizes the influence magnitude of event type mi on event type l, with the sign reflecting the excitatory or inhibitory influence.

As discussed in Section 5.1, we employ Lipschitz continuous activations in the neural networks f θf , gθg, and hθh. This ensures the existence of a unique strong solution to Eq.(18). We also apply MLE to learn model parameters and provide the complete training algorithm in Appendix B. Figure 1 presents a concrete example of the intensity process governed by the NJDSDE Eq.(18).

5.4. Event Prediction

With the proposed NJDSDE-governed intensity process, we aim to predict the next event time and next event type given the historical events Htn+1 = {(ti, mi)}n i=1.

The conditional density function of the next event time (Rasmussen, 2018) is that for t tn,

ftn+1(t) = λg t exp( Z t

tn λg s ds), (20)

where λg t := PM m=1 λm t = PM m=1 exp (ηm t ). Then for the next event time prediction, we employ the formula

tn t ftn+1(t) dt. (21)

Note that after the jump time tn, the process ηt governed by Eq.(18) will have the dynamics ( dηt = f θf (ηt) dt + gθg(ηt) d W t, t > tn,

ηtn+ = ηtn + hmn θh (ηtn). (22)

Neural Jump-Diffusion Temporal Point Processes

Therefore, similar to the method discussed in Section 5.2 for computing the log-likelihood function, we utilize the Euler-Maruyama scheme to discretize Eq.(22), followed by numerical integration techniques to compute the integrals mentioned above. Here, the value of ηtn required for discretizing Eq.(22) can be obtained by discretizing Eq.(18) over the interval [0, tn] using the Euler-Maruyama scheme and the historical events Htn+1.

Following previous works (Zuo et al., 2020; Shi et al., 2023; Xue et al., 2024), the next event type prediction is given by

bmn+1 = argmaxm λm tn+1/λg tn+1. (23)

6. Experiments

We first test the flexibility of our NJDTPP model by recovering the ground truth dynamics of the intensity process of classical TPPs. Then, we evaluate the modeling capability for event sequences and the prediction performance of NJDTPP on six real-world datasets. Our code is available at https://github.com/Zh-Shuai/NJDTPP.

6.1. Intensity Process Recovery for Classical TPPs

Synthetic Datasets. We consider the following classical TPPs: (i) Poisson Process: the intensity is given by λt = λ0, where λ0 = 1.0; (ii) Hawkes Process: the intensity is given by λt = µ + α P

ti<t exp ( β(t ti)), where µ = 0.2, α = 0.8, β = 1.0; and (iii) Self-Correcting Process: the intensity is given by λt = exp µt P ti<t α , where µ = 0.5, α = 0.2. For each TPP, we simulate a dataset using the Ogata s thinning algorithm (Ogata, 1981). Each dataset contains 500 event sequences within the time interval [0, 100]. The train-validation-test data split is 3 : 1 : 1.

Experimental Setup. We fit our NJDTPP model to each dataset using the training procedure described in Section 5.2. In this experiment, the drift, diffusion, and jump nets are implemented as three MLPs, each with 2 hidden layers of 32 units. The activation function chosen for these networks is Tanh. More training details are reported in Appendix C.5. For evaluation, we visually demonstrate the similarity between the estimated intensity from the learned NJDTPP and the ground truth intensity.

In addition, we compare the mean absolute percentage error (MAPE) of the estimated intensity of our model with the Poisson process (PP) model, the Hawkes process (HP) model, the self-correcting process (SC) model, an RNNbased model (Jia & Benson, 2019), and the Neural Jump SDE (NJSDE, Jia & Benson (2019)) model. The baseline results in Table 1 are extracted from (Jia & Benson, 2019). Similar to the calculation of the log-likelihood function, we first numerically solve Eq.(12) using the Euler-Maruyama scheme, and then compute MAPE through numerical integration. See Appendix C.4 for the definition of MAPE.

0 10 20 30 40 50 60 70 80 90 100 Time

Intensity Process

Poisson Process NJDTPP Ground Truth Events

0 10 20 30 40 50 60 70 80 90 100 Time

Intensity Process

Hawkes Process NJDTPP Ground Truth Events

0 10 20 30 40 50 60 70 80 90 100 Time

Intensity Process

Self-Correcting Process NJDTPP Ground Truth Events

Figure 2. The estimated and ground truth intensity process of the Poisson Process, Hawkes Process, and Self-Correcting Process. Each blue cross represents an event at the corresponding time.

Table 1. The MAPE comparison of the estimated intensity. Each row represents a synthetic dataset.Each column represents a model.

PP HP SC RNN NJSDE NJDTPP

Poisson 0.1 0.3 98.7 3.2 1.3 0.1 Hawkes 188.2 3.5 101.0 22.0 5.9 0.2 Self-Correcting 29.1 29.1 1.6 24.3 9.3 0.1

Results. Figure 2 compares the estimated intensity process (red curve) of our NJDTPP model with the ground truth (grayish purple curve). It clearly shows that NJDTPP effectively recovers the dynamics of ground truth intensities. This also indicates that our model can capture the excitatory and inhibitory influences between events. Table 1 reports the estimated MAPE for NJDTPP and the baseline models. These values are the average results for all 100 event sequences in the test set. As shown, our model fits the data well and shows a substantial performance improvement compared to baselines.

Having observed that NJDTPP achieves superior experimental results, we now turn to analyze the two main factors that contributed to this success. Firstly, as discussed in Section 5.1, these classical TPPs are special cases of our modeling framework. Specifically, when the drift, diffusion, and jump functions take certain forms, our proposed NJDSDE Eq.(12) characterizes these classical TPPs. Secondly, we employ neural networks to parameterize the drift, diffusion, and jump functions in Eq.(12). Due to the powerful capability of neural networks, our model can fit the data well and effectively recover the ground truth intensity.

Neural Jump-Diffusion Temporal Point Processes

21.95722.13221.988

1.998 2.002 1.993 1.971 1.988 1.989

0.373 0.372 0.374 0.373

0.369 0.370

0.532 0.531 0.533 0.531 0.530 0.533

Stack Overflow

1.376 1.372 1.375 1.374 1.372 1.374

1.4 MIMIC-II

0.907 0.907 0.908 0.907

Figure 3. Event time prediction RMSE comparison.

Table 2. Event type prediction accuracy and F1 comparison.

Model Retweet Earthquake Taxi Taobao Stack Overflow MIMIC-II

Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1

RMTPP 0.503 0.415 0.214 0.082 0.836 0.815 0.431 0.429 0.437 0.265 0.812 0.600 NHP 0.601 0.573 0.451 0.283 0.890 0.886 0.463 0.447 0.467 0.315 0.832 0.680 SAHP 0.522 0.497 0.417 0.246 0.847 0.802 0.442 0.320 0.452 0.301 0.827 0.639 THP 0.536 0.375 0.451 0.281 0.865 0.828 0.467 0.406 0.463 0.309 0.853 0.596 Att NHP 0.592 0.575 0.452 0.283 0.761 0.724 0.458 0.386 0.465 0.310 0.856 0.817 NSTPP 0.513 0.482 0.354 0.165 0.884 0.837 0.437 0.289 0.449 0.293 0.840 0.799 NJDTPP 0.608 0.584 0.472 0.313 0.908 0.891 0.486 0.452 0.469 0.327 0.861 0.823

Table 3. Negative log-likelihood comparison.

Model Retweet Earthquake Taxi Taobao Stack Overflow MIMIC-II

RMTPP 4.241 3.653 0.227 1.659 2.891 2.333 NHP 4.137 2.189 0.208 0.986 2.496 2.205 SAHP 5.009 3.941 0.478 1.640 2.952 3.394 THP 4.560 3.387 0.442 1.191 2.630 1.515 Att NHP 4.756 2.376 0.491 1.206 2.586 1.697 NSTPP 4.527 2.203 0.217 1.415 2.541 2.421 NJDTPP 4.092 1.305 -0.293 -0.440 2.347 1.398

6.2. Likelihood Evaluation and Event Prediction

We use the negative log-likelihood (NLL) as a metric to evaluate the ability of NJDTPP in modeling event sequences on real-world datasets. Moreover, we evaluate the performance of NJDTPP in the standard next-event prediction task in TPPs, predicting every next event (ti, mi) given the history Hti. We use Eq.(21) and Eq.(23) to predict the next event time and type, respectively. We evaluate event time prediction by Root Mean Square Error (RMSE) and event type prediction by accuracy and the weighted F1 score. The training details and the configuration of our NJDTPP model are provided in Appendix C.5 and Appendix C.6.

Datasets. We evaluate our model on six real-world benchmark datasets: Retweet (Zhou et al., 2013), Earthquake (Xue et al., 2024), Taxi (Whong, 2014), Taobao (Xue et al., 2022), Stack Overflow (Leskovec & Krevl, 2014), and MIMIC-II (Johnson et al., 2016). The MIMIC-II dataset is available at the public Github repository2, and all other datasets are available at the public Easy TPP3 library (Xue et al., 2024), an open benchmark for evaluating TPPs. See Appendix C.2 for dataset details.

2https://github.com/hongyuanmei/neurawkes 3https://github.com/ant-research/ Easy Temporal Point Process

Table 4. Performance of the NJDTPP variant on Earthquake and Taxi. NJDTPP-BM refers to the variant without Brownian motion.

Model Earthquake Taxi

NLL RMSE Acc F1 NLL RMSE Acc F1

NJDTPP-BM 1.594 1.507 0.470 0.306 -0.290 0.302 0.905 0.887 NJDTPP 1.305 1.420 0.472 0.313 -0.293 0.296 0.908 0.891

Baselines. We compare NJDTPP with the following models. Two RNN-based models: Recurrent Marked Temporal Point Process (RMTPP, Du et al. (2016)) and Neural Hawkes Process (NHP, Mei & Eisner (2017)). Three attention-based models: Self-Attentive Hawkes Process (SAHP, Zhang et al. (2020a)), Transformer Hawkes Process (THP, Zuo et al. (2020)), and Attentive Neural Hawkes Process (Att NHP, Yang et al. (2022)). One TPP with the hidden state governed by a neural jump SDE: Neural Spatio Temporal Point Process (NSTPP, Chen et al. (2020)). More details in Appendix C.3. For the implementation of baselines, we use the code from Easy TPP3 (Xue et al., 2024), and extract partial baseline results from the same paper.

Results. Table 3 summarizes the per-event NLL of these models on each test set. From this table, we can observe that NJDTPP fits the data well and significantly outperforms baselines across all experiments. This demonstrates our model s capability to learn complex real-world intensity dynamics. The results for next event time and event type prediction are presented in Figure 3 and Table 2 respectively. It is evident that, in each dataset, NJDTPP outperforms all competing models, often by a substantial margin. This shows the superior performance of our model in prediction tasks. The success of NJDTPP can be attributed to its flexibility in capturing complex intensity dynamics and the influences between events.

Neural Jump-Diffusion Temporal Point Processes

6.3. Ablation Study

We conduct an ablation study on the Earthquake and Taxi datasets, investigating the variant of NJDTPP by removing the diffusion term. We evaluate models based on NLL and prediction performance. Table 4 reports the experimental results. As shown, the diffusion term contributes to model performance due to the fact that it models the Gaussian noise with the Brownian motion.

7. Conclusion

We have presented Neural Jump-Diffusion Temporal Point Processes, a unified TPP framework that can learn a freeform intensity process consistent with the observed event data. By modeling the intensity process as the solution to an SDE, our approach eliminates the need to pre-specify the functional form of the intensity function, thereby significantly enhancing the flexibility and capability of TPP models. Experimental results show that our model effectively captures intensity dynamics and the influences between events, as well as achieves state-of-the-art results on benchmark datasets in likelihood evaluation and event prediction tasks.

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments and suggestions. This work was partially supported by the National Key Research and Development Program of China (2021YFB3100600), the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB0680101), the NSFC (62376064, U2336202), and the CAS Project for Young Scientists in Basic Research (YSBR-008).

Impact Statement

We develop a novel framework for modeling the intensity process of temporal point processes using neural jumpdiffusion stochastic differential equations. We hope that our study will inspire new developments in the field of temporal point processes. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Anil, C., Lucas, J., and Grosse, R. Sorting out lipschitz function approximation. In International Conference on Machine Learning, pp. 291 301. PMLR, 2019.

Bertoin, J. L evy processes. Cambridge University Press, 1996.

Bj ork, T. Point Processes and Jump Diffusions: An introduction with finance applications. Cambridge University Press, 2021.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 31, 2018.

Chen, R. T., Amos, B., and Nickel, M. Neural spatiotemporal point processes. International Conference on Learning Representations, 2020.

Cyganowski, S., Gr une, L., and Kloeden, P. E. MAPLE for jump diffusion stochastic differential equations in finance. Springer, 2002.

Daley, D. J., Vere-Jones, D., et al. An introduction to the theory of point processes: volume I: elementary theory and methods. Springer, 2003.

De, A., Valera, I., Ganguly, N., Bhattacharya, S., and Gomez Rodriguez, M. Learning and forecasting opinion dynamics in social networks. Advances in Neural Information Processing Systems, 29, 2016.

Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez Rodriguez, M., and Song, L. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555 1564, 2016.

Farajtabar, M., Wang, Y., Gomez-Rodriguez, M., Li, S., Zha, H., and Song, L. Coevolve: A joint point process model for information diffusion and network evolution. Journal of Machine Learning Research, 18(41):1 49, 2017.

Gao, T., Subramanian, D., Bhattacharjya, D., Shou, X., Mattei, N., and Bennett, K. P. Causal inference for event pairs in multivariate point processes. Advances in Neural Information Processing Systems, 34:17311 17324, 2021.

Hanson, F. B. Applied stochastic processes and control for jump-diffusions: modeling, analysis and computation. SIAM, 2007.

Hawkes, A. G. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90, 1971.

Herrera, C., Krach, F., and Teichmann, J. Neural jump ordinary differential equations: Consistent continuoustime prediction and filtering. International Conference on Learning Representations, 2020.

Ikeda, N. and Watanabe, S. Stochastic differential equations and diffusion processes. Elsevier, 2014.

Neural Jump-Diffusion Temporal Point Processes

Isham, V. and Westcott, M. A self-correcting point process. Stochastic processes and their applications, 8(3):335 347, 1979.

Jia, J. and Benson, A. R. Neural jump stochastic differential equations. Advances in Neural Information Processing Systems, 32, 2019.

Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1 9, 2016.

Kidger, P., Morrill, J., Foster, J., and Lyons, T. Neural controlled differential equations for irregular time series. Advances in Neural Information Processing Systems, 33: 6696 6707, 2020.

Kidger, P., Foster, J., Li, X., and Lyons, T. J. Neural sdes as infinite-dimensional gans. In International Conference on Machine Learning, pp. 5453 5463. PMLR, 2021a.

Kidger, P., Foster, J., Li, X. C., and Lyons, T. Efficient and accurate gradients for neural sdes. Advances in Neural Information Processing Systems, 34:18747 18761, 2021b.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.

Kingman, J. F. C. Poisson processes, volume 3. Clarendon Press, 1992.

Kloeden, P. E. and Platen, E. Numerical solution of stochastic differential equations. Springer-Verlag Berlin, 1992.

Kong, L., Sun, J., and Zhang, C. Sde-net: Equipping deep neural networks with uncertainty estimates. In International Conference on Machine Learning, pp. 5405 5415. PMLR, 2020.

Kuo, H.-H. Stochastic differential equations. Springer, 2006.

Lamberton, D. and Lapeyre, B. Introduction to stochastic calculus applied to finance. CRC press, 2011.

Le Gall, J.-F. Brownian motion, martingales, and stochastic calculus. Springer, 2016.

Leskovec, J. and Krevl, A. Snap datasets: Stanford large network dataset collection. 2014.

Li, X., Wong, T.-K. L., Chen, R. T., and Duvenaud, D. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pp. 3870 3882. PMLR, 2020.

Lin, X., Cao, J., Zhang, P., Zhou, C., Li, Z., Wu, J., and Wang, B. Disentangled deep multivariate hawkes process for learning event sequences. In International Conference on Data Mining, pp. 360 369. IEEE, 2021.

Lin, X., Zhang, W., Shi, F., Zhou, C., Zou, L., Zhao, X., Yin, D., Pan, S., and Cao, Y. Graph neural stochastic diffusion for estimating uncertainty in node classification. In International Conference on Machine Learning, 2024.

Liu, S. and Hauskrecht, M. Event outlier detection in continuous time. In International Conference on Machine Learning, pp. 6793 6803. PMLR, 2021.

Mei, H. and Eisner, J. M. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in Neural Information Processing Systems, 30, 2017.

Møller, J., Syversveen, A. R., and Waagepetersen, R. P. Log gaussian cox processes. Scandinavian journal of statistics, 25(3):451 482, 1998.

Oakes, D. The markovian self-exciting process. Journal of Applied Probability, 12(1):69 77, 1975.

Ogata, Y. On lewis simulation method for point processes. IEEE Transactions on Information Theory, 27(1):23 31, 1981.

Oh, Y., Lim, D., and Kim, S. Stable neural stochastic differential equations in analyzing irregular time series data. In International Conference on Learning Representations, 2024.

Oksendal, B. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.

Omi, T., Aihara, K., et al. Fully neural network based model for general temporal point processes. Advances in Neural Information Processing Systems, 32, 2019.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.

Rasmussen, J. G. Lecture notes: Temporal point processes and the conditional intensity function. ar Xiv preprint ar Xiv:1806.00221, 2018.

Rubanova, Y., Chen, R. T., and Duvenaud, D. K. Latent ordinary differential equations for irregularly-sampled time series. Advances in Neural Information Processing Systems, 32, 2019.

Shchur, O., Biloˇs, M., and G unnemann, S. Intensity-free learning of temporal point processes. International Conference on Learning Representations, 2019.

Neural Jump-Diffusion Temporal Point Processes

Shchur, O., Turkmen, A. C., Januschowski, T., Gasthaus, J., and G unnemann, S. Detecting anomalous event sequences with temporal point processes. Advances in Neural Information Processing Systems, 34:13419 13431, 2021.

Shi, X., Xue, S., Wang, K., Zhou, F., Zhang, J. Y., Zhou, J., Tan, C., and Mei, H. Language models can improve event prediction by few-shot abductive reasoning. Advances in Neural Information Processing Systems, 2023.

Song, Y., Donghyun, L., Meng, R., and Kim, W. H. Decoupled marked temporal point process using neural ordinary differential equations. International Conference on Learning Representations, 2024.

Wang, Y., Williams, G., Theodorou, E., and Song, L. Variational policy for guiding point processes. In International Conference on Machine Learning, pp. 3684 3693. PMLR, 2017.

Wang, Y., Theodorou, E., Verma, A., and Song, L. A stochastic differential equation framework for guiding online user activities in closed loop. In International Conference on Artificial Intelligence and Statistics, pp. 1077 1086. PMLR, 2018.

Whong, C. FOILing NYC s taxi trip data. 2014.

Xiao, S., Yan, J., Yang, X., Zha, H., and Chu, S. Modeling the intensity function of point process via recurrent neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.

Xu, H., Farajtabar, M., and Zha, H. Learning granger causality for hawkes processes. In International Conference on Machine Learning, pp. 1717 1726. PMLR, 2016.

Xue, S., Shi, X., Zhang, J., and Mei, H. Hypro: A hybridly normalized probabilistic model for long-horizon prediction of event sequences. Advances in Neural Information Processing Systems, 35:34641 34650, 2022.

Xue, S., Shi, X., Chu, Z., Wang, Y., Zhou, F., Hao, H., Jiang, C., Pan, C., Xu, Y., Zhang, J. Y., et al. Easytpp: Towards open benchmarking the temporal point processes. International Conference on Learning Representations, 2024.

Yang, C., Mei, H., and Eisner, J. Transformer embeddings of irregularly spaced events and their participants. In International Conference on Learning Representations, 2022.

Zarezade, A., De, A., Upadhyay, U., Rabiee, H. R., and Gomez-Rodriguez, M. Steering social activity: A stochastic optimal control point of view. J. Mach. Learn. Res., 18:205 1, 2017.

Zhang, Q., Lipani, A., Kirnap, O., and Yilmaz, E. Selfattentive hawkes process. In International Conference on Machine Learning, pp. 11183 11193. PMLR, 2020a.

Zhang, S., Zhou, C., Zhang, P., Liu, Y., Li, Z., and Chen, H. Multiple hypothesis testing for anomaly detection in multi-type event sequences. In International Conference on Data Mining, pp. 808 817. IEEE, 2023.

Zhang, W., Panum, T., Jha, S., Chalasani, P., and Page, D. Cause: Learning granger causality from event sequences using attribution methods. In International Conference on Machine Learning, pp. 11235 11245. PMLR, 2020b.

Zhou, K., Zha, H., and Song, L. Learning triggering kernels for multi-dimensional hawkes processes. In International Conference on Machine Learning, pp. 1301 1309. PMLR, 2013.

Zuo, S., Jiang, H., Li, Z., Zhao, T., and Zha, H. Transformer hawkes process. In International Conference on Machine Learning, pp. 11692 11702. PMLR, 2020.

Neural Jump-Diffusion Temporal Point Processes

A.1. Proof of Theorem 2

Proof. To solve the SDE Eq.(9), we first denote the jump times of the Hawkes process {Nt}t 0 as {ti} i=1. On the time interval (ti 1, ti], Eq.(9) behaves as an ODE dλt = β(µ λt) dt. On the other hand, at a jump time ti, the jump size of {λt}t 0 is given by λti = λti+ λti = α Nti = α. The right-limit at ti is then λti+ = λti + α.

Up to the first jump time t1, the solution follows the ODE dλt = β(µ λt) dt on [0, t1] with the initial value λ0 = µ. Solving this ODE yields λt = µ C1e βt, where C1 is a constant. Using the initial condition λ0 = µ, we find that C1 = 0. Thus, we have λt = µ for all t [0, t1]. In particular, λt1 = µ and λt1+ = λt1 + α = µ + α.

Subsequently, we solve the ODE dλt = β(µ λt) dt on (t1, t2] with λt1+ = µ+α to obtain λt = µ C2e βt, where C2 is a constant. Using λt1+ = µ+α and taking the right-limit at t1 in the above equation, we obtain λt1+ = µ+α = µ C2e βt1. Then C2 = αeβt1. Thus, we get λt = µ + αe β(t t1) for t (t1, t2]. This allows us to determine λt2 = µ + αe β(t2 t1)

and λt2+ = λt2 + α = µ + αe β(t2 t1) + α.

Iterating this procedure, the solution is given by λt = µ + α P

i: ti<t exp ( β (t ti)), which is the intensity process of the Hawkes process.

A.2. Proof of Theorem 3

Proof. To solve the SDE Eq.(10), we first denote the jump times of the self-correcting process {Nt}t 0 as {ti} i=1. On the time interval (ti 1, ti], Eq.(10) behaves as an ODE dλt = µλt dt. On the other hand, at a jump time ti, the jump size of {λt}t 0 is given by λti = λti+ λti = (e α 1) λti Nti = (e α 1) λti. The right-limit at ti is then λti+ = e αλti.

Up to the first jump time t1, the solution follows the ODE dλt = µλt dt on [0, t1] with the initial value λ0 = 1. Solving this ODE yields λt = C1eµt, where C1 is a constant. Using the initial condition λ0 = 1, we can find that C1 = 1. Thus, we have λt = eµt for t [0, t1]. In particular, λt1 = eµt1 and λt1+ = e αλt1 = eµt1 α.

Subsequently, we solve the ODE dλt = µλt dt on (t1, t2] with λt1+ = eµt1 α to obtain λt = C2eµt, where C2 is a constant. Using λt1+ = eµt1 α and taking the right-limit at t1 in the above equation, we obtain λt1+ = eµt1 α = C2eµt1. Then C2 = e α. Thus, we get λt = eµt α for t (t1, t2]. This allows us to determine λt2 = eµt2 α and λt2+ = e αλt2 = eµt2 2α.

Iterating this procedure, the solution is given by λt = exp(µt P

i: ti<t α), which is the intensity process of the selfcorrecting process.

A.3. Proof of Theorem 4

Proof. Step 1: For every λ > 0, we shall prove the existence and uniqueness of the solution to the SDE given by

dηt = f (eηt)

eηt dt + log eηt + h (eηt)

η0 = log λ, (24)

where the logarithmic function is well-defined due to the condition h(x) + x > 0 for all x > 0.

To initiate the proof, we denote the jump times of {Nt}t 0 as {ti} i=1. Since the ODE

dyt = f (eyt)

eyt dt, t 0,

y0 = y, (25)

has a unique global solution for every y R, we can solve the ODE dηt = f(eηt)

eηt dt on the interval [0, t1] with the initial

value η0 = log λ for every λ > 0. In particular, we can obtain the value of ηt1 and then ηt1+ = ηt1 + log eηt1 +h(eηt1)

eηt1 = log(eηt1 + h(eηt1)).

Neural Jump-Diffusion Temporal Point Processes

Since the ODE Eq.(25) is autonomous, we can similarly solve the ODE dηt = f(eηt)

eηt dt on (t1, t2] with the initial value

ηt1+. This yields ηt2 and then ηt2+ = ηt2 + log eηt2 +h(eηt2)

eηt2 = log(eηt2 + h(eηt2)).

Iterating this procedure, we have proved that Eq.(24) has a unique global solution for every λ > 0.

Step 2: We now aim to prove that λt = eηt is the unique global positive solution of the SDE Eq.(11), where {ηt}t 0 is the solution of Eq.(24).

Note that λt = eηt ensures that λt is positive and λ0 = eη0 = λ. According to Eq.(24), between the jumps of {Nt}t 0, the process {ηt}t 0 follows the dynamics dηt = f(eηt)

eηt dt. Therefore, the dynamics of λt can be expressed as follows:

dλt = deηt = eηt dηt = f(eηt) dt = f(λt) dt. (26)

On the other hand, at a jump time t, the process {Nt}t 0 has a jump size of Nt = Nt+ Nt = 1, implying that the process {ηt}t 0 will have a jump of size ηt = log eηt+h(eηt)

eηt Nt = log eηt+h(eηt)

eηt . Then ηt+ = ηt + ηt = ηt + log eηt+h(eηt)

eηt = log(eηt + h(eηt)). Since λt = eηt, the induced jump size of λt is given by

λt = eηt+ eηt = elog(eηt+h(eηt)) eηt = h(eηt) = h(λt). (27)

Combining Eq.(26) and Eq.(27), the equation holds: dλt = f(λt) dt + h(λt) d Nt. Therefore, we establish that the SDE Eq.(11) given by ( dλt = f (λt) dt + h (λt) d Nt,

has a unique global positive solution for every λ > 0.

A.4. Proofs of Special Cases of Our Proposed NJDSDE

Proof. We first provide the proof for the case of Hawkes processes, and the proofs for self-correcting processes and Poisson processes are similar.

According to Theorem 2, between the jumps of the Hawkes process {Nt}t 0, the intensity process {λt}t 0 follows the dynamics dλt = β(µ λt) dt. Therefore, the dynamics of ηt := log(λt) can be expressed as follows:

dηt = d log(λt) = 1

λt dλt = β(µ λt)

λt dt = β(µ eηt)

eηt dt = β(µe ηt 1) dt. (28)

On the other hand, at a jump time t, the intensity process {λt}t 0 has a jump of size λt = α. Then λt+ = λt + λt = λt + α. Since ηt = log(λt), the induced jump size of ηt is given by

ηt = log(λt+) log(λt) = log(λt + α

λt ) = log(eηt + α

eηt ) = log(1 + αe ηt). (29)

Combining Eq.(28) and Eq.(29), the equation holds: dηt = β(µe ηt 1) dt + log(1 + αe ηt) d Nt.

Therefore, when the function fθf (ηt) is set to β(µe ηt 1), gθg(ηt) is set to 0, and hθh(ηt) is set to log(1 + αe ηt) in Eq.(12), the proposed NJDSDE characterizes Hawkes processes.

Now, we prove that a specific class (but not all) of log-Gaussian Cox processes (LGCPs) can be incorporated into our modeling framework Eq.(12).

Specifically, when fθf takes 0, gθg takes 1, and hθh takes 0 in Eq.(12), the NJDSDE reduces to dηt = d Wt. Given the initial value η0 = log λ0 and W0 = 0, we have ηt = Wt + log λ0. Since λt = exp(ηt), it follows that λt = λ0 exp(Wt). Taking λ0 as 1, we obtain λt = exp(Wt), which indeed represents a specific class of LGCPs, since the Brownian motion Wt is a Gaussian process with independent stationary increments (Bertoin, 1996). In turn, since a Gaussian process is not necessarily a Brownian motion, for example when the increments of a Gaussian process do not satisfy independence, our framework Eq.(12) cannot incorporate all LGCPs.

Neural Jump-Diffusion Temporal Point Processes

A.5. Proof of Theorem 5

Proof. The following proof is adapted from the Theorem 9.1 in (Ikeda & Watanabe, 2014).

Consider a probability space (Ω, F, P) on which we define a Brownian motion W = {Wt}t 0 and a counting process N = {Nt}t 0 that jumps at the times {ti} i=1. Suppose that W and N are independent. We define the filtration {Ft}t 0 as the augmented natural filtration of W and N, i.e., for all t 0 we set Ft = σ ({(Ws, Ns) : s t} N), where N = {A F : P(A) = 0}. With this, (Ω, F, {Ft}t 0, P) is a filtered probability space that satisfies the usual conditions.

It is easy to see that the jump times {ti} i=1 are stopping times of {Ft}t 0 since {ti t} = {Nt i} Ft, and lim i ti = a.s. First we shall show the existence and uniqueness of solutions in the time interval [0, t1]. For this, consider

the following equation

d X(t) = fθf (X(s)) ds + gθg(X(s)) d Ws, X(0) = η0. (30)

The functions fθf and gθg in the above equation depend only on the variable x and are independent of the time variable t. In this case, note that the Lipschitz condition |fθf (x) fθf (y)| + |gθg(x) gθg(y)| C|x y| implies |fθf (x) fθf (y)| C|x y|. Then we derive

|fθf (x)| |fθf (x) fθf (0)| + |fθf (0)| C|x| + |fθf (0)| D(1 + |x|), (31)

where D = max{C, |fθf (0)|}. Therefore, the linear growth condition automatically follows from the Lipschitz condition (Kuo, 2006). According to the Theorem 5.2.1 in (Oksendal, 2013), we know that the solution X(t) of Eq.(30) exists uniquely. This solution is a measurable function of X(0), W and N in the obvious sense. Using the continuity of hθh(x), we can set

η1(t) = X(t), t [0, t1], (32)

η1(t1+) = X(t1) + hθh(X(t1)). (33)

The process {η1(t)}t [0,t1] is clearly the unique solution of Eq.(12) in the time interval [0, t1]. Next, set e X(0) = η1(t1+), f W = {f Wt}t 0 where f Wt = Wt+t1 Wt1, and e N = { e Nt}t 0 where e Nt = Nt+t1 Nt1. Since the SDE Eq.(30) is autonomous, we can determine the process η2(t) on 0, t1 with respect to e X(0), f W and e N in the same way as η1(t). Clearly t1, defined with respect to e N, coincides with t2 t1. Define {η(t)}t [0,t2] by

( η1(t), t [0, t1] , η2 (t t1) , t (t1, t2] . (34)

It is easy to see that {η(t)}t [0,t2] is the unique solution of Eq.(12) in the time interval [0, t2]. Continuing this process successively, η(t) is determined uniquely in the time interval [0, ti] for every i and hence η(t) is determined globally. This completes the proof.

B. Training Algorithm

The pseudo-codes of training algorithm of the Neural Jump-Diffusion Univariate Point Process (NJDUPP) and the Neural Jump-Diffusion Multivariate Point Process (NJDMPP) are presented in Algorithm 1 and Algorithm 2, respectively.

Neural Jump-Diffusion Temporal Point Processes

Algorithm 1 Training of the NJDUPP.

Input: Model parameter θ = {θf, θg, θh, η0}, start time t0 = 0, event sequences {Sl}L l=1, where Sl = tl i nl i=1 Initialize fθf , gθg, hθh, ηt0+ = η0, L = 0 while Not Converge do

for each sequence Sl in batch do

for i = 1, . . . , nl + 1 do

{ητ i k}N k=1 = SDESolve fθf , gθg, ηtl i 1+, (τ i 0 = tl i 1, . . . , τ i k, . . . , τ i N = tl i) using Eq.(16)

ηtl i+ = ηtl i + hθh(ηtl i) right-limit at tl i = τ i N end Ll = ℓ {ητ i k}N,nl+1 k=0 i=1 ; θ compute the NLL Eq.(17)

end back-propagate with gradient θL update model parameters by Adam optimizer end

Algorithm 2 Training of the NJDMPP.

Input: Model parameter θ = {θf, θg, θh, η0}, t0 = 0, multi-type event sequences {Sl}L l=1, where Sl = (tl i, ml i) nl i=1 Initialize f θf , gθg, hθh, ηt0+ = η0, L = 0

while Not Converge do

for each sequence Sl in batch do

for i = 1, . . . , nl + 1 do

{ητ i k}N k=1 = SDESolve f θf , gθg, ηtl i 1+, (τ i 0 = tl i 1, . . . , τ i k, . . . , τ i N = tl i) using Euler-Maruyama scheme

ηtl i+ = ηtl i + h ml i θh (ηtl i)

end Ll = ℓ {ητ i k}N,nl+1 k=0 i=1 ; θ compute the NLL Eq.(36)

end back-propagate with gradient θL update model parameters by Adam optimizer end

C. Experimental Details

C.1. Experimental Environment

The experiments are conducted on a Linux server with eight GPUs (NVIDIA RTX 2080 Ti * 8). We implement our model and all baselines with the deep learning library Py Torch (Paszke et al., 2017).

C.2. Dataset Descriptions

Retweet (Zhou et al., 2013). The dataset consists of sequences of time-stamped user retweet events, categorized into three types based on the users following sizes: small , medium , and large .

Earthquake (Xue et al., 2024). This dataset contains timestamped earthquake events over the Conterminous U.S from 1996 to 2023. The seven event types are defined based on the magnitude of earthquakes.

Neural Jump-Diffusion Temporal Point Processes

Table 5. Statistics of the used datasets.

Dataset # Types # Sequences Sequence Length # Events

Train Dev Test Min Mean Max Train Dev Test

Retweet 3 9, 000 1, 535 1, 520 10 40 97 369, 731 62, 823 61, 154 Earthquake 7 3000 400 896 11 16 18 49, 363 6, 612 14, 748 Taxi 10 1, 400 200 400 36 37 38 51, 854 7, 404 14, 820 Taobao 17 1, 300 200 500 32 57 64 75, 205 11, 737 28, 455 Stack Overflow 22 1, 401 401 401 41 65 101 90, 497 25, 762 26, 518 MIMIC-II 75 527 58 65 2 4 33 1, 930 252 237

Taxi (Whong, 2014). This dataset contains time-stamped taxi pick-up and drop-off events throughout the five boroughs of New York city. Each combination of borough, whether it s a pick-up or drop-off, defines an event type, resulting in a total of 10 event types.

Taobao (Xue et al., 2022). This dataset includes the time-stamped click behavior of users in Taobao platform from November 25 to December 3, 2017. Each user has a sequence of product click events, where each event contains a timestamp and the product category.

Stack Overflow (Leskovec & Krevl, 2014). This dataset contains two years of user-awarded collections from the question-answering website. Each user is awarded a sequence of badges, with a total of 22 different badge types.

MIMIC-II (Johnson et al., 2016). This dataset includes timestamped de-identified clinical visit events of Intensive Care Unit patients for seven years. Each patient has a sequence of hospital visit events, and each event records its timestamp and disease diagnosis.

Table 5 shows statistics about each dataset mentioned above.

C.3. Baseline Descriptions

We provide detailed descriptions of the used baselines as follows:

RMTPP (Du et al., 2016). RMTPP leverages RNNs to learn a hidden representation of event history, and then applies an exponential transformation on this representation for defining the intensity function.

NHP (Mei & Eisner, 2017). NHP proposes a continuous-time LSTM to encode event sequences. The intensity function of NHP can decay over time and does not need to encode inter-event times as numerical inputs to the LSTM.

SAHP (Zhang et al., 2020a). SAHP uses a self-attention mechanism to aggregate historical events, which enhances the expression ability of the intensity function.

THP (Zuo et al., 2020). To capture the long-term dependence of events, THP proposes to model the intensity function using a Transformer architecture.

Att NHP (Yang et al., 2022). Att NHP generalizes the Transformer architecture for modeling event sequences. Its architecture builds rich embeddings of actual and possible events at any given time, based on lower-level representations of these events and their context.

NSTPP (Chen et al., 2020). With the goal of modeling high-fidelity distributions in continuous time and space, NSTPP uses the Neural ODE framework to parameterize the spatio-temporal TPP by combining ideas from Neural Jump SDEs (Jia & Benson, 2019) and continuous-time normalizing flows (Chen et al., 2018).

Neural Jump-Diffusion Temporal Point Processes

C.4. Evaluation Metrics

We formulate the metrics used in this paper as follows:

The mean absolute percentage error (MAPE) of the estimated intensity is given by

λmodel t λGT t λGT t

dt 100%, (35)

where T is the observation length, λmodel t is the trained model intensity, and λGT t is the ground truth intensity.

The negative log-likelihood (NLL) of a multivariate point process over a time interval [0, T] is

i=1 log λmi ti +

0 λm s ds) =

i=1 ηmi ti +

0 exp (ηm s ) ds). (36)

The root mean square error (RMSE) is

RMSE(t,ˆt) =

ti ˆti 2 . (37)

The accuracy of multiclass classification is the fraction of correct classifications, that is

Accuracy = # correct classifications

# all classifications . (38)

The formula for the weighted F1 score, accounting for class imbalance, is expressed as:

F1weighted =

i=1 wi F1i, (39)

where C is the number of classes, wi represents the sample weight for class i, and F1i is the F1 score for class i.

C.5. Training Details

We train our NJDTPP model for all experiments by minimizing the negative log-likelihood of training sequences, as described in Appendix B. The drift net, diffusion net, and jump net of NJDTPP are implemented as three multi-layer perceptrons (MLPs) with the same network structure. The activation function chosen for these networks is Tanh. Optimizer is Adam (Kingma & Ba, 2015) with a weight decay of 10 5. The MLP parameters and the initial value η0 are initialized by the Gaussian distribution.

C.6. Hyper-parameter Setting

We employ the diagonal noise in the diffusion term of Eq.(18), i.e., gθg is a diagonal matrix. In this case, the dimension of the Brownian motion W t is equal to the total number of event types, i.e., K = M. Grid search is used to determine other hyper-parameters: the learning rate is selected from {0.001, 0.01, 0.1}, the hidden layer number is selected from {1, 2, 3}, and the hidden layer size is selected from {16, 32, 64}.