# hawkes_process_based_on_controlled_differential_equations__769d7ff7.pdf

Hawkes Process Based on Controlled Differential Equations

Minju Jo , Seungji Kook and Noseong Park Yonsei University, Seoul, South Korea {alflsowl12,202132139,noseong}@yonsei.ac.kr

Hawkes processes are a popular framework to model the occurrence of sequential events, i.e., occurrence dynamics, in several fields such as social diffusion. In real-world scenarios, the inter-arrival time among events is irregular. However, existing neural network-based Hawkes process models not only i) fail to capture such complicated irregular dynamics but also ii) resort to heuristics to calculate the log-likelihood of events since they are mostly based on neural networks designed for regular discrete inputs. To this end, we present the concept of Hawkes process based on controlled differential equations (HP-CDE), by adopting the neural controlled differential equation (neural CDE) technology which is an analogue to continuous RNNs. Since HP-CDE continuously reads data, i) irregular time-series datasets can be properly treated preserving their uneven temporal spaces, and ii) the log-likelihood can be exactly computed. Moreover, as both Hawkes processes and neural CDEs are first developed to model complicated human behavioral dynamics, neural CDE-based Hawkes processes are successful in modeling such occurrence dynamics. In our experiments with 4 real-world datasets, our method outperforms existing methods by non-trivial margins.

1 Introduction Real-world phenomena typically correspond to the occurrence of sequential events with irregular time intervals and numerous event types, ranging from online social network activities to personalized healthcare and so on [Zhao et al., 2015; Enguehard et al., 2020; Stoyan and Penttinen, 2000; Mohler et al., 2011; Ogata, 1999]. Hawkes processes and Poisson point process are typically used to model those sequential events [Hawkes, 1971; Miles, 1970; Streit, 2010]. However, their basic assumptions are too stringent to model such complicated dynamics, e.g., all past events should influence the occurrence of the current event. To this end, many advanced techniques have been proposed for the past several years, ranging from classical recurrent neural network (RNN) based models such as RMTPP [Du et al., 2016] and

Model Exact log-likelihood How to model dynamics NHP, SAHP, X Discrete THP

HP-CDE O (λ is continuous.)

Continuous & robust to irregular dynamics

Table 1: Comparison of neural network-based Hawkes process models. λ denotes the conditional intensity function (cf. Eqs. (4), (6), and (7)).

NHP [Mei and Eisner, 2017] to recent transformer models like SAHP [Zhang et al., 2020] and THP [Zuo et al., 2020]. Even so, they still do not treat data in a fully continuous way but resort to heuristics, which is sub-optimal in processing irregular events [Chen et al., 2018; Choi et al., 2021; Yildiz et al., 2019]. Likewise, their heuristic approaches to model the continuous time domain impede solving the multivariate integral of the log-likelihood calculation in Eq. (4), leading to approximation methods such as the Monte Carlo sampling (cf. Table 1). As a consequence, the strict constraint and/or the inexact calculation of the log-likelihood may induce inaccurate predictions.

In this work, therefore, we model the occurrence dynamics based on differential equations, not only directly handling the sequential events in a continuous time domain but also exactly solving the integral of the log-likelihood. One more inspiration of using differential equations is that they have shown several non-trivial successes in modeling human behavioral dynamics [Poli et al., 2019; Rubanova et al., 2019; Jeon et al., 2021] in particular, we are interested in controlled differential equations. To our knowledge, therefore, we first answer the question of whether occurrence dynamics can be modeled as controlled differential equations.

Controlled differential equations (CDEs [Lyons et al., 2004]) are one of the most suitable ones for building human behavioral models. CDEs were first developed by a financial mathematician to model complicated dynamics in financial markets which is a typical application domain of Hawkes processes since financial transactions are temporal point processes. In particular, neural controlled differential equations (neural CDEs [Kidger et al., 2020]), whose initial value problem (IVP) is written as below, are a set of techniques to learn

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

CDEs from data with neural networks:

h(tb) = h(ta) + Z tb

ta f(h(t); θf)d Z(t)

= h(ta) + Z tb

ta f(h(t); θf)d Z(t)

where f is a CDE function, and h(t) is a hidden vector at time t. Z(t) is a continuous path created from discrete sequential observations (or events) {(zj, tj)}b j=a by an appropriate algorithm1, where in our case, zj is a vector containing the information of j-th occurrence, and tj [ta, tb] contains the time-point of the occurrence, i.e., tj < tj+1. Note that neural CDEs keep reading the time-derivative of Z(t) over time, denoted Z(t) := d Z(t)

dt , and for this reason, neural CDEs are in general, considered as continuous RNNs. In addition, NCDEs are known to be superior in processing irregular time series [Lyons et al., 2004]. Given the neural CDE framework, we propose Hawkes Process based on Controlled Differential Equations (HPCDE). We let zj be the sum of the event embedding and the positional embedding and create a path Z(t) with the linear interpolation method which is a widely used interpolation algorithm for neural CDEs (cf. Figure 2). To get the exact loglikelihood, we use an ODE solver to calculate the non-event log-likelihood. Calculating the non-event log-likelihood involves the integral problem in Eq. (4), and our method can solve it exactly since conditional intensity function λ , which indicates an instantaneous probability of an event, is defined in a continuous manner over time by the neural CDE technology. In addition, we have three prediction layers to predict the event log-likelihood, the event type, and the event occurrence time (cf. Eqs. (8), (12), (13) and Figure 3). We conduct event prediction experiments with 4 datasets and 4 baselines. Our method shows outstanding performance in all three aspects: i) event type prediction, ii) event time prediction, and iii) log-likelihood. Our contributions are as follows:

1. We model the continuous occurrence dynamics under the framework of neural CDE whose original theory was developed for describing irregular non-linear dynamics. Many real-world Hawkes process datasets have irregular inter-arrival times of events.

2. We then exactly solve the integral problem in Eq. (4) to calculate the non-event log-likelihood, which had been done typically through heuristic methods before our work.

2 Preliminaries

2.1 Multivariate Point Processes Multivariate point processes are a generative model of an event sequence X = {(kj, tj)}N j=1 and xj = (kj, tj) indicates j-th event in the sequence. This event sequence is a subset of an event stream under a continuous time interval

1One can use interpolation algorithms or neural networks for creating Z(t) from {(zj, tj)}b j=a [Kidger et al., 2020].

[t1, t N], and an observation xj at time tj has an event type kj {1, , K}, where K is total number of event types. The arrival time of events is defined as t1 < t2 < < t N. The point process model learns a probability for every (k, t) pair, where k {1, , K}, t [t1, t N]. The key feature of multivariate point processes is the intensity function λk(t), i.e., the probability that a type-k event occurs at the infinitesimal time interval [t, t + dt). The Hawkes process, one popular point process model, assumes that the intensity λk(t) of type k can be calculated by past events before t, so-called history Ht, and its form is as follows:

λ k(t) := λk(t|Ht) = µk + X

j:tj<t ψk(t tj), (2)

where λ (t) = PK k=1 λ k(t), µk is the base intensity, and ψk( ) is a pre-determined decaying function for type k. We use to represent conditioning on the history Ht. According to the formula, all the past events affect the probability of new event occurrence with different influences. However, the intensity converges to the base intensity if the decaying function becomes close to zero. Currently, a deep learning mechanism is applied to Hawkes processes by parameterizing the intensity function. For instance, RNNs are used in the neural Hawkes process (NHP) [Mei and Eisner, 2017], and its intensity function is defined as follows:

k=1 ϕk(w k h(t)), t [t1, t N], (3)

where ϕk( ) is the softplus function, h(t) is a hidden state from RNNs, and wk is a weight for each event type. The softplus function keeps intensity values positive. However, one downside of NHP is that RNN-based models assume that events have regular intervals. Thus, one of the main issues in NHP is how to fit a model to a continuous irregular time domain.

2.2 Neural Network-based Hawkes Processes Hawkes processes are a popular temporal predicting framework in various fields since it predicts both when, which type of events would happen with mathematical approaches. It is especially widely used in sociology fields to capture the diffusion of information [Hardiman et al., 2013; Kobayashi and Lambiotte, 2016; Da Fonseca and Zaatour, 2014], seismology fields to model when earthquakes and aftershocks occur, medical fields to track the status of patients [Choi et al., 2015; Garetto et al., 2021], and so on. For enhancing the performance of Hawkes processes, a lot of deep learning approaches have been applied. The two basic approaches are the recurrent marked temporal point process (RMTPP [Du et al., 2016]) and the neural Hawkes process (NHP [Mei and Eisner, 2017]). RMTPP is the first model that combines RNNs into point processes, and NHP is a Hawkes process model with an RNN-parameterized intensity function. Based on NHP, the self-attentive Hawkes process (SAHP [Zhang et al., 2020]) attaches self-attention modules to reflect the relationships between events. Additionally, the

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

Discrete event sequence

Continuous hidden representation

Continuous path (by an interpolation algotithm)

Figure 1: Visualization of the continuous hidden state of the neural CDE model

transformer Hawkes process (THP [Zuo et al., 2020]) uses the transformer technology [Vaswani et al., 2017], one of the most popular structures in natural language processing, to capture both short-term and long-term temporal dependencies of event sequences. One important issue of neural network-based Hawkes process is how to handle irregular time-series datasets. To deal with this issue, NHP uses continuous-time LSTMs, whose memory cell exponentially decays. SAHP and THP both employ modified positional encoding schemes to represent irregular time intervals since the conventional encoding assumes regular spaces between events. However, all mentioned approaches still do not explicitly process irregular time-series. In contrast to them, our HP-CDE is robust to irregular timeseries since the original motivation of neural CDEs is better processing irregular time-series by constructing continuous RNNs.

2.3 Neural Controlled Differential Equations as continuous RNNs

Neural controlled differential equations (neural CDEs) are normally regarded as a continuous analogue to RNNs since they process the time-derivative of the continuous path Z(t). Especially, neural CDEs retain their continuous properties by using the interpolated path Z made of discrete data {(zj, tj)}b j=a and solving the Riemann-Stieltjes integral to get h(tb) from h(ta) as shown in Eq. (1) in particular, this problem to derive h(tb) from the initial condition h(ta) is known as initial value problem (IVP) (cf. Figure 1). At first, to make the interpolated continuous path Z, linear interpolation or natural cubic spline interpolation is generally used among several interpolation methods. Then, we use existing ODE solvers to solve the Riemann-Stieltjes integral problem with h(t) := dh(t)

dt = f(h(t); θf) d Z(t)

2.4 Maximum Likelihood Estimation in Temporal Point Process

Most of the neural temporal point process frameworks choose the maximum likelihood estimation (MLE) [Aitchison and Silvey, 1958] as one of the main training objectives. In order to enable the MLE training, getting the log-probability of every sequence X is required, which consists of formulas using

Non-event log-likelihood

Type prediction

Time prediction Continuous hidden representation

Event log-likelihood

Event sequence

Interpolation

Figure 2: Our proposed HP-CDE architecture

intensity functions conditioned on the history Ht={(kj, tj) : tj < t}. Thus, log-probability for any event sequence X whose events are observed in an interval [t1, t N] is as follows:

j=1 log λ (tj) Z t N

t1 λ (t)dt, (4)

where PN j=1 log λ (tj) denotes the event log-likelihood and R t N t1 λ (t)dt means the non-event log-likelihood. Non-event log-likelihood represents sum of the infinite number of nonevents log-probabilities in [t1, t N], except the infinitesimal times when the event occurs. In the case of the event loglikelihood, it is comparably easy to compute as the formula is simply a sum of the intensity functions. However, it is challenging to compute the non-event log-likelihood, due to its integral computation. Due to the difficulty, NHP, SAHP, THP and many other models use approximation methods, such as Monte Carlo integration [Robert and Casella, 2005] and numerical integration methods [Stoer and Bulirsch, 2013], to get the value. However, since those methods do not exactly solve the integral problem, numerical errors are inevitable.

3 Proposed Method In this section, we describe our explicitly continuous Hawkes process model, called HP-CDE, based on the neural CDE framework which is considered as continuous RNNs. Owing to the continuous property of the proposed model, the exact log-likelihood, especially for the non-event log-likelihood part with its challenging integral calculation, can also be computed through ODE solvers. That is, our proposed model reads event sequences with irregular inter-arrival times in a continuous manner, and exactly computes the log-likelihood.

3.1 Overall Workflow Figure 2 shows comprehensive designs of our proposed model, HP-CDE. The overall workflow is as follows: 1. Given the event sequence X = {(kj, tj)}N j=1, i.e., event type kj at time tj, the embeddings {Ee(kj), Ep(tj)}N j=1

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

are made through the encoding processes, where Ee(kj) is an embedding of kj and Ep(tj) is a positional embedding of tj.

2. Then we use {Ee(kj) Ep(tj)}N j=1 as the discrete hidden representations {zj}N j=1. In other words, zj = Ee(kj) Ep(tj), i.e., the element-wise summation of the two embeddings.

3. An interpolation algorithm is used to create the continuous path Z(t) from {(zj, tj)}N j=1 we augment the time information tj to each zj.

4. Using the continuous path Z(t), a neural CDE layer calculates the final continuous hidden representation h(t) for all t. At the same time, an ODE solver integrates the continuous intensity function λ (t) which is calculated from h(t) (cf. Eq. (7)) to calculate the non-event loglikelihood. In addition, there are three prediction layers to predict the event type, time, and log-likelihood (cf. Figure 3).

We provide more detailed descriptions for each step in the following subsections with the well-posedness of our model.

3.2 Embedding

We embed both the type and time of each event into separate vectors and then add them. To be more specific, we map each event type to an embedding vector Ee(k), which is trainable. With trigonometric functions, we embed the time information to a vector Ep(t), which is called positional encoding in transformer language models (cf. Appendix A). We use the sum of the two embeddings, {Ee(kj) Ep(tj)}N j=1 as the discrete hidden representations {zj}N j=1, i.e., zj = Ee(kj) Ep(tj).

3.3 Occurrence Dynamics and Continuous Intensity Function

With {zj}N j=1, we calculate the continuous hidden representation h(tj) for any arbitrary j, where t1 tj, based on the neural CDE framework as follows:

h(tj) = h(t1) + Z tj

t1 f(h(t); θf)d Z(t)

where Z(t) is a continuous path created by an interpolation algorithm from {(zj, tj)}N j=1. The well-posedness2 of neural CDEs is proved in [Lyons et al., 2004, Theorem 1.3] under the Lipschitz continuity requirement (cf. Appendix B). Neural CDE layer is able to generate the continuous hidden representation h(tj), where t1 tj, even when the sequence {(zj, tj)}N j=1 is an irregular time-series, i.e., the inter-arrival time varies from one case to another. This continuous property enables our model to exactly solve the integral problem of the non-event log-likelihood.

2The well-posedness of an initial value problem means that i) its unique solution, given an initial value, exists, and ii) its solutions continuously change as initial values change.

That is, the non-event log-likelihood can be re-written as the following ODE form:

a(t N) = Z t N

t1 λ (t)dt, (6)

where the conditional intensity function of Eqs. (2) and (3) is, in our case, the sum of the conditional intensity functions of all event types as follows:

k=1 λ k(t), λ k(t) = ϕk(Wintst k h(tj)), (7)

where Wintst k is a weight matrix of intensity about type k, and therefore, Wintst k h(tj) is a linear projected representation which has the history of events before time tj. ϕk(x) := βk log(1 + exp(x/βk)) is the softplus function with a parameter βk to be learned. The softplus function is used to restrict the intensity function to have only positive values. Therefore, the log-probability of HP-CDE for any event sequence X is redefined from Eq. (4) as:

j=1 log λ (tj) a(t N). (8)

As a result, we can naturally define the following augmented ODE, where h(t) and a(t) are combined:

= f(h(t); θf) d Z(t)

and h(t1) a(t1)

= π(z(t1); θπ) 0

where π is a fully connected layer. The neural network f is defined as follows:

f(h(t)) = Tanh(πM(ELU( (ELU(π1(h(t))))))), (11)

which consists of fully connected layers with the ELU or the hyperbolic tangent activation. The number of layers M is a hyperparameter. In Zuo et al. [Zuo et al., 2020], the generated hidden representations from the self-attention module of their transformer have discrete time stamps, and therefore, its associated intensity function definition is inevitably discrete. For that reason, they rely on a heuristic method, e.g., Monte Carlo method, to calculate the non-event log-likelihood. In our case, however, the physical time is modeled in a continuous manner and therefore, the exact non-event log-likelihood can be calculated as in Eq. (6).

3.4 Prediction Layer

Our model has three prediction layers as in other Hawkes process models: i) next event type, ii) next event time, and iii) the event log-likelihood (cf. Figure 3). We use Eq. (7) to calculate the event log-likelihood.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

Type prediction

Event log-likelihood

Time prediction

Eq.(13) Eq.(8)

Continuous hidden representation

Figure 3: Prediction layer of HP-CDE

For the event type and time predictions, we predict {ˆtj}N+1 j=2 and {ˆkj}N+1 j=2 after reading X = {(kj, tj)}N j=1. For the event type prediction layer, we use the following method:

ˆpj+1 = Softmax(Wtypeh(tj)), ˆkj+1 = arg max k ˆpj+1(k), (12)

where Wtype is a trainable parameter and ˆpj+1(k) is the probability of type k at time tj+1. For the event time prediction layer, we use the following definition:

ˆtj+1 = Wtimeh(tj), (13)

where Wtime is a trainable parameter.

3.5 Training Algorithm Our loss definition consists of three parts. The first part is the following MLE loss, i.e. maximizing the log-likelihood (cf. Eq. (8)):

i=1 log p(Xi), (14)

where S is the number of training samples. While training, the log-intensity of each observed event increases and the non-event log-likelihood decreases in the whole interval [t1, t N]. The second loss is the event type loss function which is basically a cross-entropy term as follows:

j=2 k j log(ˆpj), (15)

where kj is a one-hot vector for the event type kj. In the case of the event time loss, we use the inter-arrival time τi = ti ti 1 to compute the loss as follows:

j=2 (τj ˆτj)2. (16)

Therefore, the overall objective function of HP-CDE can be written as follows:

i=1 α1 log p(Xi) + Ltype(Xi) + α2Ltime(Xi), (17)

where α1 and α2 are hyperparameters.

Algorithm 1 How to train HP-CDE Input: Training data Dtrain, Iteration numbers max iter

1: Initialize all the parameters of the embedding and the neural CDE layer 2: iter 0 3: while iter < max iter do 4: Sample a mini-batch {Xi}S i=1 Dtrain 5: Calculate the embedding vectors, i.e, Ee(kj), and Ep(tj) 6: Calculate the discrete hidden representation zj, j 7: Calculate the continuous hidden representation h(t) using neural CDE and compute the non-event loglikelihood using ODE solver with Eq. (6) over time 8: Update the parameters with Eq. (17) 9: if the loss does not decrease for δ iterations then 10: exit 11: end if 12: end while 13: return the trained parameters

In Alg. (1), we show the training algorithm. We first initialize all the parameters. From our training data, we randomly build a mini-batch {Xi}S i=1 in Line 4 the optimal mini-batch size varies from one dataset to another. After feeding the constructed mini-batch into our model, we calculate the discrete and continuous hidden representations in Lines 6 and 7. With the loss in Eq. (17), we train our model. We repeat the steps max iter times.

4 Experiments

4.1 Experimental Environments

Experimental Settings In this section, we compare the model performance of HPCDE with 4 state-of-the-art baselines on 4 datasets. Each dataset is split into the training set and the testing set. The training set is used to tune the hyperparameters and the testing set is used to measure the model performance. We evaluate the models with three metrics: i) log-likelihood (LL) of X = {(kj, tj)}N j=1, ii) accuracy (ACC) on the event type prediction, and iii) root mean square error (RMSE) on the event time prediction. We train each model 100 epochs and report the mean and standard deviation of the evaluation metrics of five trials with different random seeds. We compare our model with various baselines (cf. Section 2.2): Recurrent Marked Temporal Point Process (RMTPP)3, Neural Hawkes Process (NHP)4, Self-Attentive Hawkes Process (SAHP)5, and Transformer Hawkes Process (THP)6. More details including hyperparameter configurations are in Appendix C.

3https://github.com/dunan/Neural Point Process 4https://github.com/hongyuanmei/neural-hawkes-particlesmoothing 5https://github.com/Qiang AIResearcher/sahp repo 6https://github.com/Simiao Zuo/Transformer-Hawkes-Process

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

Dataset Model LL ACC RMSE Memory Training usage(MB) time(m)

RMTPP -1.222 0.080 0.823 0.014 1.035 0.023 3 0.004 NHP -0.647 0.051 0.534 0.015 0.976 0.020 13 0.045 SAHP -0.859 0.328 0.555 0.171 1.138 0.059 34 0.037 THP -0.233 0.012 0.741 0.021 0.856 0.040 9 0.012 HP-CDE 2.573 0.201 0.847 0.007 0.726 0.042 58 0.058

Meme Tracker

RMTPP Na N 0.006 0.000 Na N 1,708 0.425 NHP -9.395 2.814 0.044 0.003 441.293 0.233 5,096 12.263 SAHP 2.160 0.324 0.009 0.000 521.672 4.071 32,894 6.642 THP -5.717 0.649 0.015 0.000 446.477 2.665 891 2.610 HP-CDE 3.846 0.626 0.151 0.005 441.223 3.480 3,669 3.817

RMTPP Na N 0.490 0.000 Na N 210 0.044 NHP -9.082 0.125 0.547 0.010 16,630.956 0.217 750 17.820 SAHP 1.904 0.566 0.505 0.067 16,648.339 1.436 13,276 0.197 THP -7.347 0.268 0.499 0.013 15,050.470 26.712 1,582 0.142 HP-CDE 6.844 0.539 0.552 0.009 15,849.218 269.068 197 6.236

Stack Over Flow

RMTPP -1.894 0.002 0.429 0.000 1.321 0.002 27 0.040 NHP -7.726 0.581 0.434 0.015 1.027 0.027 449 3.556 SAHP -0.431 0.225 0.244 0.002 4.525 1.098 11,080 0.147 THP -0.554 0.001 0.449 0.001 0.973 0.001 4,585 0.169 HP-CDE 7.348 0.466 0.452 0.001 0.996 0.017 44 6.878

Table 2: Experimental results. (resp. ) denotes that the higher (resp. lower) the better, and we use boldface to denote the best score.

Dataset K Sequence length # Events Min Average Max MIMIC 75 2 4 26 1,930 Meme Tracker 5000 1 3 31 123,639 Retweet 3 50 109 264 2,173,533 Stack Over Flow 22 41 72 720 345,116

Table 3: Characteristics of datasets used in experiments

Datasets To show the efficacy and applicability of our model, we evaluate using various real-world data. Meme Tracker [Leskovec and Krevl, June 2014], Retweet [Zhao et al., 2015], and Stack Over Flow [Leskovec and Krevl, June 2014], are collected from Stackoverflow, web articles, and Twitter, respectively. We also use a medical dataset, called MIMIC [Johnson et al., 2016]. We deliberately choose the datasets with various average sequence lengths and event type numbers K to show the general efficacy of our model. The average sequence length ranges from 3 to 109, and the number of event types K ranges from 3 to 5000 (cf. Table 3). That is, we cover not only from simple to complicated ones, but also from short-term to long-term sequences. Details of datasets are in Appendix C.3.

4.2 Experimental Results We show the experimental results of each model on MIMIC, Meme Tracker, Retweet, and Stack Over Flow in Table 2. We analyze the results in three aspects: i) the event prediction, ii) the log-likelihood, and iii) the model complexity. Ablation and sensitivity analyses are in Appendix D and E.

Event Prediction HP-CDE outperforms other baselines with regards to both the event type and the event time prediction in most cases as re-

Model Dataset MIMIC Meme Tracker RMTPP 0.385 0.037 0.000 0.000 NHP 0.126 0.018 0.011 0.002 SAHP 0.108 0.112 0.000 0.000 THP 0.162 0.016 0.000 0.000 HP-CDE 0.452 0.035 0.069 0.004

Table 4: F1 score ( ) for imbalanced datasets

ported in Table 2. To be specific, in terms of accuracy, HPCDE shows the best performance in every dataset. These results imply that processing data in a continuous manner is important when it is in a continuous time domain. Even though HP-CDE only shows the lowest RMSE on datasets with short sequnce length, MIMIC and Meme Tracker, we provide the solution to lower RMSE of HP-CDE when using datasets with long sequence length in Section 4.3. For the imbalanced datasets of MIMIC and Meme Tracker, where only 20% of types occupy 90% and 70% of events each, we do the following additional analyses. Notably, HPCDE attains an accuracy of 0.151 in Meme Tracker, which is up to 243% higher than those of baselines, and an RMSE of 0.726 in MIMIC, about 15% lower. Furthermore, we use the macro F1 score to measure the quality of type predictions. As shown in Table 4, our model shows the best F1 score in both of the imbalanced datasets. Especially for Meme Tracker, models with attention modules have relatively low F1 scores, indicating that when there exist too many classes and if they are imbalanced, attentions are overfitted to several frequently occurring classes. This phenomenon is also observed in Figure 4. In Figure 4, HP-CDE shows the most diverse predictions in terms of the number of predicted classes.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

20.0 # classes in test data # classes in hits by model

2500 # classes in test data # classes in hits by model

(b) Meme Tracker

Figure 4: The number of classes in test data vs. the number of classes in correct event type predictions, i.e., hits. HP-CDE provides not only accurate but also diverse predictions.

0 20 40 60 80 100 Epoch

Event Log-likelihood

(a) Retweet

0 20 40 60 80 100 Epoch

106 105 104 103 102 101

Event Log-likelihood

(b) Meme Tracker

Figure 5: Training curves on Retweet and Meme Tracker. HP-CDE shows the highest log-likelihood with the fastest convergence speed.

Particularly, in Figure 4 (b), HP-CDE successfully predicts for 1,164 classes among 2,604 classes, which is almost 50% of the classes in test data, whereas NHP, SAHP, and THP predict only for 217, 4, and 7 classes, respectively. Regardless of the characteristics of datasets, e.g., the number of types, the degree of imbalance, and so on, our model shows outstanding prediction results, which prove the importance of continuous processing and computing the exact loglikelihood leading to more accurate learning of dynamics.

Log-likelihood Calculation As shown in Table 2, our models always show the best loglikelihood, outperforming others by large margins, on every dataset. One remarkable point is that our log-likelihood is always positive, while baselines show negative values in many cases. That is, in HP-CDE, the event log-likelihood exceeds the non-event log-likelihood at all times. Figure 5 shows the training curves of models fitted on Retweet and Meme Tracker in a log-scale. First of all, HPCDE show the best log-likelihood at every training epoch. Overall, except THP, the log-likelihood of Meme Tracker tends to be more unstable than that of Retweet, since Meme Tracker has about 1,700 times more event types than Retweet.

Memory Usage Table 2 also recaps the model complexity. Exactly calculating the non-event log-likelihood using ODE solvers incurs additional memory usage, so that the model uses bigger memory than those of other sampling methods such as Monte Carlo sampling. Especially when the number of event types K is large, i.e., MIMIC and Meme Tracker, the complexity of HPCDEs increases as we exactly compute the non-event log-

THP HP-CDE HP-CDE-AT 0.3

(a) Retweet

THP HP-CDE HP-CDE-AT 0.4

(b) Stack Over Flow

Figure 6: Additional study on long-sequence datasets, comparing accuracy and RMSE of HP-CDE-AT to HP-CDE and THP.

likelihood for every event type. However, when K is relatively small, owing to the adjoint sensitivity method [Chen et al., 2018; Kidger et al., 2020], HP-CDE s memory footprint notably decreases. For example, when using Retweet with K = 3, the space complexity of HP-CDE is almost 1% of that of THP.

4.3 Additional Study on the Long Sequence Length

While HP-CDE shows a good performance on the datasets with relatively short sequence lengths, i.e., MIMIC and Meme Tracker, its RMSE results on others with longer sequence lengths, i.e., Retweet and Stack Over Flow, are slightly larger than those of THP s. Therefore, to effectively deal with long sequence datasets, we put the self-attention part of transformer [Vaswani et al., 2017] right before the neural CDE layer and name the model HP-CDE-AT. Experimental results of HP-CDE-AT in comparison with HP-CDE and THP, which shows the highest score among baselines, are summarized in Figure 6. According to Figure 6 (a), HP-CDE-AT achieves the smallest RMSE, improving the performance of the origianl HP-CDE model. Remarkably, in Figure 6 (b), HP-CDEAT even shows the best performance on Stack Over Flow in both metrics, accuracy and RMSE. In conclusion, since HPCDE-AT attains overall best results on longer datasets, HPCDE-AT is one good option for long sequence datasets (cf. Appendix F).

5 Conclusions

Temporal point processes are frequently used in real-world applications to model occurrence dynamics in various fields. In particular, deep learning-based Hawkes process models have been extensively studied. However, we identified the two possible enhancements from the literature and presented HP-CDE to overcome the limitations. First, we use neural CDEs to model occurrence dynamics since one of their main application areas is to model uncertainties in human behaviors. Second, we exactly calculate the non-event loglikelihood which is one important part of the training objective. Existing work uses heuristic methods for it, which makes the training process unstable sometimes. In our experiments, consequently, our presented method significantly outperforms them and shows the most diverse predictions, i.e., the least overfitting.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

Acknowledgements

Noseong Park is the corresponding author. This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-01361, Artificial Intelligence Graduate School Program at Yonsei University, 10%), and (2022-0-01032, Development of Collective Collaboration Intelligence Framework for Internet of Autonomous Things, 45%) and (No.2022-0-00113, Developing a Sustainable Collaborative Multi-modal Lifelong Learning Framework, 45%).

Ethical Statement

MIMIC contains much personal health information. However, it was released after removing observations, such as diagnostic reports and physician notes, using a rigorously evaluated deidentification system to protect the privacy of the patients who have contributed their information. Therefore, our work does not have any related ethical concerns.

Contribution Statement

Minju Jo and Seungji Kook contributed equally.

[Aitchison and Silvey, 1958] J. Aitchison and S. D. Silvey. Maximum-Likelihood Estimation of Parameters Subject to Restraints. The Annals of Mathematical Statistics, 29(3):813 828, 1958.

[Chen et al., 2018] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Neur IPS, 2018.

[Choi et al., 2015] Edward Choi, Nan Du, Robert Chen, Le Song, and Jimeng Sun. Constructing disease network and temporal progression model via context-sensitive hawkes process. In 2015 IEEE International Conference on Data Mining, pages 721 726. IEEE, 2015.

[Choi et al., 2021] Jeongwhan Choi, Jinsung Jeon, and Noseong Park. Lt-ocf: Learnable-time ode-based collaborative filtering. In CIKM, 2021.

[Da Fonseca and Zaatour, 2014] Jos e Da Fonseca and Riadh Zaatour. Hawkes process: Fast calibration, application to trade clustering, and diffusive limit. Journal of Futures Markets, 34(6):548 579, 2014.

[Du et al., 2016] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. KDD 16, page 1555 1564, New York, NY, USA, 2016. Association for Computing Machinery.

[Enguehard et al., 2020] Joseph Enguehard, Dan Busbridge, Adam Bozson, Claire Woodcock, and Nils Y. Hammerla. Neural temporal point processes for modelling electronic health records, 2020.

[Garetto et al., 2021] Michele Garetto, Emilio Leonardi, and Giovanni Luca Torrisi. A time-modulated hawkes process to model the spread of covid-19 and the impact of countermeasures. Annual reviews in control, 51:551 563, 2021. [Hardiman et al., 2013] Stephen J Hardiman, Nicolas Bercot, and Jean-Philippe Bouchaud. Critical reflexivity in financial markets: a hawkes process analysis. The European Physical Journal B, 86(10):1 9, 2013. [Hawkes, 1971] Alan G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90, 1971. [Jeon et al., 2021] Jinsung Jeon, Soyoung Kang, Minju Jo, Seunghyeon Cho, Noseong Park, Seonghoon Kim, and Chiyoung Song. Lightmove: A lightweight next-poi recommendation for taxicab rooftop advertising. In CIKM, 2021. [Johnson et al., 2016] Alistair Johnson, Tom Pollard, Lu Shen, Li-wei Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Celi, and Roger Mark. Mimic-iii, a freely accessible critical care database, 2016. [Kidger et al., 2020] Patrick Kidger, James Morrill, James Foster, and Terry Lyons. Neural controlled differential equations for irregular time series. Advances in Neural Information Processing Systems, 33:6696 6707, 2020. [Kobayashi and Lambiotte, 2016] Ryota Kobayashi and Renaud Lambiotte. Tideh: Time-dependent hawkes process for predicting retweet dynamics. In Tenth International AAAI Conference on Web and Social Media, 2016. [Leskovec and Krevl, June 2014] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014. [Lyons et al., 2004] Terry Lyons, M. Caruana, and T. L evy. Differential Equations Driven by Rough Paths. Springer, 2004. Ecole D Et e de Probabilit es de Saint-Flour XXXIV - 2004. [Mei and Eisner, 2017] Hongyuan Mei and Jason Eisner. The neural Hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, Long Beach, December 2017. [Miles, 1970] Roger E Miles. On the homogeneous planar poisson point process. Mathematical Biosciences, 6:85 127, 1970. [Mohler et al., 2011] George O Mohler, Martin B Short, P Jeffrey Brantingham, Frederic Paik Schoenberg, and George E Tita. Self-exciting point process modeling of crime. Journal of the American Statistical Association, 106(493):100 108, 2011. [Ogata, 1999] Yosihiko Ogata. Seismicity analysis through point-process modeling: A review. Seismicity patterns, their statistical significance and physical meaning, pages 471 507, 1999. [Poli et al., 2019] Michael Poli, Stefano Massaroli, Junyoung Park, Atsushi Yamashita, Hajime Asama, and

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

Jinkyoo Park. Graph neural ordinary differential equations. ar Xiv preprint ar Xiv:1911.07532, 2019. [Robert and Casella, 2005] Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2005. [Rubanova et al., 2019] Yulia Rubanova, Ricky T. Q. Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In Neur IPS. 2019. [Stoer and Bulirsch, 2013] Josef Stoer and Roland Bulirsch. Introduction to numerical analysis, volume 12. Springer Science & Business Media, 2013. [Stoyan and Penttinen, 2000] Dietrich Stoyan and Antti Penttinen. Recent applications of point process methods in forestry statistics. Statistical science, pages 61 78, 2000. [Streit, 2010] Roy L Streit. The poisson point process. In Poisson Point Processes, pages 11 55. Springer, 2010. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, volume 30. Curran Associates, Inc., 2017. [Yildiz et al., 2019] Cagatay Yildiz, Markus Heinonen, and Harri Lahdesmaki. Ode2vae: Deep generative second order odes with bayesian neural networks. In Neur IPS. 2019. [Zhang et al., 2020] Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz. Self-attentive Hawkes process. In Hal Daum e III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 11183 11193. PMLR, 13 18 Jul 2020. [Zhao et al., 2015] Qingyuan Zhao, Murat A. Erdogdu, Hera Y. He, Anand Rajaraman, and Jure Leskovec. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 1513 1522, 2015. [Zuo et al., 2020] Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer hawkes process. ICML 20. JMLR.org, 2020.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)