# intervalcensored_hawkes_processes__c94d808b.pdf

Journal of Machine Learning Research 23 (2022) 1-84 Submitted 8/21; Revised 6/22; Published 11/22

Interval-censored Hawkes processes

Marian-Andrei Rizoiu Marian-Andrei.Rizoiu@uts.edu.au University of Technology Sydney Ultimo NSW 2007, Australia Alexander Soen alexander.soen@anu.edu.au The Australian National University Canberra ACT 2601, Australia Shidi Li shidi.li@anu.edu.au The Australian National University Canberra ACT 2601, Australia Pio Calderon piogabrielle.b.calderon@student.uts.edu.au University of Technology Sydney Ultimo NSW 2007, Australia Leanne J. Dong leanne.dong@concordia.ca Concordia University Montr eal, Canada Aditya Krishna Menon adityakmenon@google.com Google Research Lexing Xie lexing.xie@anu.edu.au The Australian National University Canberra ACT 2601, Australia

Editor: Mohammad Emtiyaz Khan

Interval-censored data solely records the aggregated counts of events during speciﬁc time intervals such as the number of patients admitted to the hospital or the volume of vehicles passing traﬃc loop detectors and not the exact occurrence time of the events. It is currently not understood how to ﬁt the Hawkes point processes to this kind of data. Its typical loss function (the point process log-likelihood) cannot be computed without exact event times. Furthermore, it does not have the independent increments property to use the Poisson likelihood. This work builds a novel point process, a set of tools, and approximations for ﬁtting Hawkes processes within interval-censored data scenarios. First, we deﬁne the Mean Behavior Poisson process (MBPP), a novel Poisson process with a direct parameter correspondence to the popular self-exciting Hawkes process. We ﬁt MBPP in the interval-censored setting using an interval-censored Poisson log-likelihood (IC-LL). We use the parameter equivalence to uncover the parameters of the associated Hawkes process. Second, we introduce two novel exogenous functions to distinguish the exogenous from the endogenous events. We propose the multi-impulse exogenous function for when the exogenous events are observed as event time and the latent homogeneous Poisson process exogenous function for when the exogenous events are presented as interval-censored volumes. Third, we provide several approximation methods to estimate the intensity and compensator function of MBPP when no analytical solution exists. Fourth and ﬁnally, we

c 2022 Marian-Andrei Rizoiu, Alexander Soen, Shidi Li, Pio Calderon, Leanne Dong, Aditya Krishna Menon and Lexing Xie.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0917.html.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

connect the interval-censored loss of MBPP to a broader class of Bregman divergence-based functions. Using the connection, we show that the popularity estimation algorithm Hawkes Intensity Process (HIP) (Rizoiu et al., 2017b) is a particular case of the MBPP. We verify our models through empirical testing on synthetic data and real-world data. We ﬁnd that our MBPP outperforms HIP on real-world datasets for the task of popularity prediction. This work makes it possible to eﬃciently ﬁt the Hawkes process to interval-censored data.

Keywords: Hawkes process, Interval-censored, Mean Behavior Poisson process, Bregman divergence, popularity prediction, multi-impulse exogenous function, latent homogeneous Poisson process exogenous function

1. Introduction

Point processes are a class of well-understood mathematical instruments to model the occurrence of events in time (and optionally space). The Hawkes process (Hawkes, 1971) is a particular type of point process that can model self-excitation i.e., the occurrence of one event increases the likelihood of future events. The Hawkes process was successfully deployed for applications where the events are individually observable such as earthquakes (Ogata, 1988; Cox and Isham, 1980; Daley and Vere-Jones, 2003), ﬁnancial transactions (Bowsher, 2007; Hardiman et al., 2013), or social media postings (Zhao et al., 2015; Kobayashi and Lambiotte, 2016; Mishra et al., 2016). However, in many scenarios, one does not know the exact historical event times, but rather only the counts of events in speciﬁc time intervals. Such interval-censored (Bernoulli, 1760) scenarios could be due to data availability say, for epidemics when the exact infection time for each individual is unknown and the number of hospital admissions per day is known or due to data privacy e.g., e-commerce websites do not expose who bought a particular item, but they will report rankings of most bought items. Due to its lack of independent increments property (see Section 2.2), the Hawkes process cannot be directly ﬁtted in the interval-censored scenario. This work circumvents this shortcoming by building a novel point process, a set of tools, and approximations for ﬁtting Hawkes processes within interval-censored data scenario.

There is prior work that deals with interval-censored data and Hawkes processes. Here we highlight three families of approaches, and we present an in-depth review in Section 9.1. The ﬁrst family is the integer-valued auto-regressive (INAR) models (Kirchner, 2016, 2017; Manolakis et al., 2019). These have been shown to yield approximations of the Hawkes model parameters after appropriate scaling. However, these works do not expose the connection between the Hawkes process s event-time and interval-censored formulations. The second family of approaches refers to interval-censored data as panel count data (Sun and Zhao, 2013; Zhu et al., 2014; Ding et al., 2018; Moreno et al., 2020). These works typically concentrate on the non-parametric estimation of the intensity of Poisson processes from interval-censored data. As far as we are aware, they do not speciﬁcally consider the Hawkes process and are not directly link to the current work apart from the presentation of the data. The third family is the Hawkes Intensity Processes (HIP) (Rizoiu et al., 2017b). The HIP parameters are ﬁtted by minimizing the squared error of the observed counts to the expected intensity of a Hawkes process. However, HIP has several shortcomings (detailed in Section 2.4), e.g., its ﬁtting objective does not relate to the likelihood of a particular process, and it does not establish a link to the parameters of the underlying Hawkes process.

Interval-censored Hawkes processes

Data presentation:

Loss function:

Hawkes process Mean Behavior Poisson process (MBPP)

Point Process log-likelihood (PP-LL) Interval-censored log-likelihood (IC-LL)

Process intensity:

Figure 1: Illustration of the applicability of the Hawkes process and MBPP in event time (left gray bubble) and interval censored settings (right gray bubble). The Hawkes process is usable only when the data is in event time format, which can then be ﬁtted using the standard point process log-likelihood (PP-LL). On the other hand, the MBPP can be used with both event time and interval-censored data by switching between PP-LL and the interval-censored log-likelihood (IC-LL).

To our knowledge, no prior work has proposed a generalized treatment of Hawkes processes across event-time and interval-censored data.

In this paper, we build a new point process and a set of tools to operate with Hawkes processes on both event time and interval-censored data. First, we introduce the Mean Behavior Poisson (MBPP) a non-homogeneous Poisson process, with a deterministic intensity function equal to the expectation of a Hawkes process intensity, taken over all possible realizations. There is a one-to-one parameter equivalence between MBPP and the Hawkes process. Next, we propose two approximations for the event intensity and compensator (the integral of the intensity) of MBPP, and we use them in an interval-censored log-likelihood (IC-LL) which allows ﬁtting the process on interval-censored data. We show that IC-LL can be interpreted as a Bregman divergence, and we use it to draw a connection between the MBPP and the HIP (Rizoiu et al., 2017b) models. Last, we empirically show in Section 7 that MBPP allows one to retrieve the Hawkes model parameters for both interval-censored and event time data. Fig. 1 illustrates the relation between Hawkes and MBPP. When a realization is observed as event times, both Hawkes and MBPP can successfully ﬁt the model using a point process likelihood (PP-LL). When the same realization is observed interval-censored, only MBPP can be used to retrieve the original parameters using the interval-censored likelihood (IC-LL) (see Section 4).

We also study scenarios in which the exogenous events in the Hawkes process are separable from the endogenously generated events. This is often the case in real-world applications such as our popularity prediction exercise in Section 8 where the shares and tweets driving the views of You Tube videos are observed separately from the views process. In this setup, the exogenous events enter the process according to an unknown intensity function, and they can be observed either as event times or interval-censored. We construct two novel exogenous functions the multi-impulse function for when the exogenous events are event times and the latent homogeneous Poisson process function for when the exogenous

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

events are interval-censored and a new ﬁtting procedure that only accounts for endogenous events. Finally, we show that on real-world data our approaches outperform HIP (Rizoiu et al., 2017b), the current state of the art in popularity prediction. The main contributions of this work are as follows: C1: we show how to construct a non-homogeneous Mean Behavior Poisson process (MBPP) that approximates the mean behavior of the underlying Hawkes process (Section 3). The likelihood of this process allows the use of a broader class of Bregman divergence based loss functions (Section 5).

C2: we prove that the MBPP deﬁnes a causal, linear, continuous-time, time-invariant system with the impulse response computed as an inﬁnite sum of convolutions (Theorem 2). This allows ﬁnding both a closed-form solution for the MBPP intensity (where it exists) and a numerical approximation (where it does not exist).

C3: we propose a second approximation for the MBPP intensity (Section 4.3), which is numerically more eﬃcient, can exploit information about interval lengths and be used for predicting future event volumes (i.e., for forecasting).

C4: we introduce a set of tools (exogenous functions, loss functions) that allow ﬁtting the parameters of a Hawkes process using MBPP, when either the exogenous or the endogenous events (or both) are interval-censored (Section 6).

C5: on a real-world dataset containing information about tweets and You Tube views associated with You Tube videos, we show our approach outperforms HIP (Rizoiu et al., 2017b), the current state-of-the art in popularity prediction (Section 8).

2. Background and notations

In this section, we provide an overview of the diﬀerent mathematical objects used in this paper. First, in Section 2.1, we introduce point processes and Poisson processes using random measures, and we discuss processes with independent increments. Second, in Section 2.2, we introduce the Hawkes self-exciting point process, and we show that it does not have the independent increments property. Third, in Section 2.3, we discuss the parameter estimation of point processes. Last, in Section 2.4, we present the Hawkes intensity process a Hawkes-based interval-censored process and the previous work closest to the present paper. In this paper, we use the following elementary notations for more specialized notations concerning point processes, please refer to Table 1. We denote Jp K = 1 if p is true and Jp K = 0 if p is false, following notation in Knuth (1992). The Laplace transform and inverse Laplace transform are deﬁned as L{ } and L 1{ } respectively. The Dirac delta function δ(x) is deﬁned by R b a δ(x)f(x) dx = f(0) if f is continuous and a < 0 < b. The convolution of functions f and g is deﬁned as (f g)(t) = R f(τ)g(t τ) dτ.

2.1 Random measures, Point processes and Poisson processes

In this section, we brieﬂy introduce point processes and Poisson processes using the notion of random measures. For a more complete (yet accessible) introduction of point processes using random measures, we refer to (Grandell, 1977). We opted for this rigorous deﬁnition of point processes for completeness reasons. For a more intuitive and streamlined introduction

Interval-censored Hawkes processes

Table 1: Notation used in this paper. Throughout this paper, we denote observed quantities using sans-serif font face.

Symbol Meaning Deﬁned/Introduced

λ(t) Hawkes Process Intensity Eq. (3) ξ(t) Mean Behavior Poisson Intensity Eqs. (6) and (9) Λ(t) Hawkes Process Compensator Eq. (2) Ξ(t) Mean Behavior Poisson Compensator Eq. (10) N(t) Hawkes Counting Process Eq. (1) M(t) Mean Behavior Poisson Counting Process Eq. (18) C(x, y] Observed counts in interval (x, y] Eqs. (8) and (17) S(x, y] Observed immigrant counts in interval (x, y] Eq. (41) F(x, y] Observed oﬀspring counts in interval (x, y] Eq. (51) Bϕ(x) Bregman Divergence Eq. (28) smδ(t) Multi-impulse exogenous function Eq. (39) s (t) Latent Homogeneous Poisson process exog. func. Eq. (43) ΨT Set of immigrant events up to time T Eq. (50) ΥT Set of oﬀspring events up to time T Eq. (50)

using counting processes and intensity functions, we refer the reader to (Rizoiu et al., 2017a; Laub et al., 2015).

Point processes as random measures. Let (X, B) be a measurable state space and let (Ω, F, P) be a probability space. Let M = M(X) be the space of all σ-ﬁnite measures on the sigma algebra B. Let M denote the smallest σ-algebra on M with respect to which the function µ 7 µ(B) is measurable for all B B. A random measure (Benes and Rataj, 2007) on X is a measurable map η : (Ω, F, P) (M, M). Let N M be the space of all σ-ﬁnite counting measures on B. Furthermore, let N denote the restriction of M when only considering σ-ﬁnite counting measures on B, i.e., N = {S N | S M}. A point process (Benes and Rataj, 2007) on X is a measurable map ˆη : (Ω, F, P) (N, N). A temporal point process is a random measure on the real line taking values in the nonnegative integers or inﬁnity (Brillinger et al., 2002).

More intuitively, temporal point processes are a family of stochastic processes (tn)n N+, in which each random variable tn represents the occurrence time for an event (Klebaner, 2012). Such processes can also be speciﬁed as a counting process (N(t))t 0, where the random variable N(t) represents the number of events that have occurred up to and including time t (Ross, 1995, pg. 59).

The Poisson process. Let Λ be a positive measure such that Λ(A) < for every bounded set A. In the literature of martingale treatments of point processes (Jacobsen, 2006), Λ is often denoted as the compensator we also use this term throughout this paper. The Poisson process is a point process which is solely deﬁned by the compensator. In particular, the Poisson process is often characterised by (and can be deﬁned by) the evaluation of the probability on disjoint sets. That is for every collection of disjoint, bounded Borel measurable sets A1, , Ak, for a Poisson process, the N(Ai) are independent Poisson

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

random variables with rate Λ(Ai):

P(N(Ai) = ni, i = 1, , k) =

ni! e Λ(Ai). (1)

Next, we use the notion of counting processes to deﬁne the Poisson process, i.e., specifying the compensator Λ. Suppose that Λ(A) = λl(A), where l(A) is the Lebesgue measure of A and λ > 0 a positive constant. We deﬁne the homogeneous Poisson process (N(t))t 0 as

1. N(0) = 0 almost surely

2. for any s < t, N(t) N(s) Poisson(λ(t s))

3. for any s < t s < t , N(t) N(s) N(t ) N(s ),

where X Y denotes that X and Y are independent. We say a Poisson process is non-homogeneous when λ is no longer a constant, but a non-negative function. Here, the compensator is redeﬁned as Λ(A) = R

A λ(x) dx, and the non-homogeneous Poisson process (NHPP) is

(a) N(0) = 0 almost surely

(b) for any s < t, N(t) N(s) Poisson(Λ(s, t))

(c) for any s < t s < t , N(t) N(s) N(t ) N(s ).

Associated with the counting process are two additional sequences of interest. The ﬁrst is the sequence of event histories (H(t))t 0, where each random variable H(t) records the times of all events that have occurred up to and including time t. Strictly speaking H(t) is a ﬁltration, that is, an increasing sequence of σ-algebras (Laub et al., 2015). The second is the sequence of conditional intensities (λ(t | H(t )))t 0 denoted for simplicity as λ(t) , where each random variable λ(t) measures the expected rate of events occurring in an inﬁnitesimal window around the time t. Formally, (Cox and Lewis, 1972, Equation 3.1),

λ(t) .= lim h 0+ 1 h E N(t + h) N(t ) | H(t )

= lim h 0 1 h P N(t + h) N(t ) = 1 | H(t ) ,

where the second equality holds for simple processes wherein multiple events cannot occur at the same time, that is the jump size N(t) N(t ) at each time t can be either 0 or 1. Finally, we can generalize the deﬁnition of the compensator Λ(s, t) as the integral of the intensity between two moments of time:

Λ(s, t) .= Z t

s λ(τ)dτ, (2)

where 0 s < t. We further shorthand Λ(t) .= Λ(0, t).

Interval-censored Hawkes processes

Processes with independent increments and Markov processes. The process (Nt)t 0 is said to have the independent increments property when, for any u s t, the random variables Nt Ns and Nu are independent. The process N = (Nt)t 0 is said to be a Markov process with respect to H(t) if E[Xs | H(t)] = E[Xs | Xt] whenever s t. In simple terms, this means that the future of a Markov process is independent of its past. Moreover, if N is a Markov process and conditional stationary, i.e., the conditional distribution is invariant to time translation, it results that it is a homogeneous Markov process. That is, the transition probability function P(N(t + s) y | N(t) = x) is independent to t. Processes with independent increments are Markov processes. However, the converse is not true not every Markov process obeys the property of independent increments. One such example is the Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930), a Markov process with the return to mean property intuitively, the process is dragged towards zero when it is far away from zero. As a result, the increments are not independent and even negatively correlated. Given the deﬁnition in Eq. (1), the Poisson process has the independent increments property and, therefore, is a Markov process.

2.2 Hawkes process

The Hawkes process is a temporal point process with the property of self-excitation. Intuitively, self-excitation means that the occurrence of one event increases the likelihood of further events in the near future. The process was widely applied in analyzing social media (Kobayashi and Lambiotte, 2016; Cao et al., 2017; Zhang et al., 2019), earthquake aftershocks (Ogata, 1988; Helmstetter and Sornette, 2002a), nuclear physics (Snyder and Miller, 1991), neuronal activity (Apostolopoulou et al., 2019), online advertising (Parmar et al., 2017) and ﬁnance (Bacry et al., 2015). The Hawkes process is a doubly stochastic point process i.e., the event intensity function is also stochastic and depends upon the previous values of the process itself (Hawkes, 1971):

λ(t) = s(t) + Z t

0 φ(t s) d N(s)

= s(t) |{z} exogenous

ti<t φ(t ti)

| {z } endogenous

where s( ) is the (deterministic) background intensity function which controls the arrival of external events into the system dubbed as immigrants; s( ) is also known as the exogenous intensity. φ( ) is the kernel function which controls the arrival of oﬀspring events spawned through self-excitement by any one given event. The arrival of oﬀspring generated by any previous event is controlled by the second term of Eq. (3), also known as the endogenous intensity. {ti} denotes a realisation of the Hawkes process, with the immigrants and the oﬀspring are collectively denoted as events. In the rest of this paper, we use ti interchangeably to denote both events and event times. The Hawkes process is said to have an exponential kernel when φ(τ) = κθe θτ, with κ the excitation level and θ > 0 the decay rate; similarly, the Hawkes process has a power-law kernel when φ(τ) = κ(τ + c) (1+θ), with c a time-warping parameter to keep the kernel

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

bounded when τ approaches zero, and κ and θ have similar meanings as for the exponential kernel. Note that by the endogenous summation in Eq. (3), λ depends on the history of events that have occurred up to time t. Thus given a ﬁxed history H(t ), the intensity before time t is deterministically calculated by Eq. (3). For times greater of equal to t, the intensity is a random variable. Relation to the Poisson process and processes with independent increments. The Hawkes process is not a Poisson process but rather a non-Markovian extension of the Poisson process. A Hawkes process can also be viewed as a cluster of Poisson processes (Hawkes and Oakes, 1974), in which each event spawns oﬀspring following a nonhomogeneous Poisson process of intensity φ(t). The link between events and their oﬀspring induces the branching structure (see (Rizoiu et al., 2017a) for more details). Because the Hawkes process s intensity function depends on the previous events, it follows that the Hawkes process is not a Markov process and, as a result, does not have the independent increments property (unlike the Poisson point process). In particular, the number of jumps in a given interval is thus a function of the number of jumps that happened earlier. We will leverage this result in Section 4 for ﬁtting Hawkes processes in the interval-censored setting.

2.3 Parameter estimation

We are given a sequence H(T) .= {tn | tn T} of observed events up to an arbitrary maximum time T > 0, usually T .= maxn tn. We try to ﬁnd the set of parameters θ of a given point process which represent best the sequence. As shown by Daley and Vere-Jones (2003), this amounts to minimizing the negative log-likelihood with respect to θ, which in turn maximizes the probability of events occurrence at the observed times ΩT :

L (θ; T) .= X

tn ΩT log λ(tn; θ) + Z T

0 λ(u; θ) du. (4)

As we can see, the log-likelihood requires observing the timing of events, which may not be the case in real application scenarios in which only volumes of events are observed (see Section 4). We denote the standard negative log-likelihood loss in Eq. (4) as the point process log-likelihood (PP-LL) loss function for the remainder of the paper.

2.4 Hawkes intensity process

Rizoiu et al. (2017b) proposed the Hawkes intensity process (HIP), a Hawkes-derived process that operates in the interval-censored setting i.e., the set of observation times O = {oi | i = 0, . . . , m, oi T} partitions the time line into non-overlapping segments and only the number of events in each segment is observed (see Section 4.1 for more details). The HIP ﬁtting objective is

i=1 (C(oi 1, oi] ξ(oi; θ))2, (5)

where C(oi 1, oi] are the observed number of events having occurred in the ith interval (oi 1, oi] and ξ(t) .= E [λ(t)] is the expected intensity of the Hawkes process.

Interval-censored Hawkes processes

The key ingredient of HIP is being able to eﬃciently compute ξ: Theorem 2.1 of (Rizoiu et al., 2017b) provides a integral equation for ξ(t), namely

ξ(t) = s(t) + Z t

0 φ(s) ξ(t s) ds

= s(t) + Z t

0 φ(t s) ξ(s) ds. (6)

For completeness, we include a derivation of this result in Appendix A. To optimize Eq. (5), one must (approximately) solve this integral equation to compute each ξ(t; θ). When s(t) = µ0 + (µ1 µ0) e γ t, and φ corresponds to the exponential kernel, Dassios and Zhao (2013) showed that ξ(t) = µ1 e (γ α) t + µ0 γ

γ α (1 e (γ α) t).

For more general kernels, such as the power-law kernel, the closed-form solution for ξ often does not exist. Rizoiu et al. (2017b) proposed to approximately solve Eq. (6) by discretising time using the pre-deﬁned observation endpoints {oi}m i=0, yielding

ˆξ[oi; θ] = µ[oi] +

s=0 φ(oi os) ˆξ[o; θ], (7)

where the bracket notation ˆξ[ ] is used to explicate the discretization. One can now use Eq. (7) to compute each ˆξ[ ; θ], beginning with the base case ˆξ[0; θ] = a. Equipped with this, one solves the approximation to the HIP objective:

i=1 (C(oi 1, oi] ˆξ[oi; θ])2. (8)

Rizoiu et al. (2017b) showed that the solution to Eq. (8) has good empirical performance. Conceptually, however, the approach has a few limitations:

L1: There is a mismatch between the objective and the observed data; the observations are of event counts, while the objective measures closeness to the event rates. Intuitively, one should work with E [N(t)] rather than E [λ(t)].

L2: The HIP objective is not derived as the likelihood of a particular process, raising uncertainty as to precisely what sense it approximates the Hawkes likelihood. In the limiting case where each observation interval is made inﬁnitesimally small, for example, it is unclear whether the solution to Eq. (5) approaches that of the standard Hawkes likelihood.

L3: It relies on ˆξ, which is derived from an approximation to the underlying integral equation. This approximation is of unclear quality, and cannot handle intervals of unequal length.

3. The Mean Behavior Poisson process (MBPP)

In this section, we ﬁrst introduce the Mean Behavior Poisson process (MBPP) a Poisson process whose intensity function is the expected event intensity of a Hawkes process of

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

0 5 10 15 20 25 30

0.0 0.1 0.2 0.3 0.4

Impulse response: Hawkes vs MBPP

Event intensity

Hawkes (mean, 1000 realizations) Hawkes (95% conﬁdence interval) Mean behavior Poisson process

Exogenous impulse

Figure 2: Event intensity in Hawkes and MBPP as a result of a single exogenous event (shown in red). We generate 1000 Hawkes realizations using an exponential kernel (Eq. (13)) and the parameters κ = 0.9, θ = 1.15. We show the mean and the 95% conﬁdence interval of the Hawkes intensity λ(t) (in black), and the (deterministic) MBPP intensity ξ(t) (in blue). We observe that the mean Hawkes intensity over the 1000 realizations follows closely the MBPP intensity, empirically showing the relation between the two processes.

the same parameters (Section 3.1). Next, we show that its compensator Ξ(t) follows the same self-consistent equation as its intensity ξ(t) (Section 3.2). Finally, we develop two methods to solve the self-consistent equation of MBPP for arbitrary exogenous functions (Section 3.3).

3.1 Mean Behavior Poisson: model deﬁnition

The intensity of a Hawkes process (deﬁned in Eq. (3)) is a stochastic function i.e., at the same future time t the conditional intensity of two Hawkes realizations with the same parameters can take very diﬀerent values depending on the history of each realization. Here we introduce a new process dubbed the Mean Behavior Poisson process (MBPP) which captures the expected behavior of all the Hawkes realizations deﬁned by the same set of parameters. Formally, given a Hawkes process we deﬁne MBPP as the non-homogeneous Poisson process with a deterministic intensity ξ(t) the expected Hawkes intensity over all possible realizations (Eq. (6)):

ξ(t) .= EHt [λ(t)] = s(t) + Z t

0 ξ(τ)φ(t τ) dτ

= s(t) + (ξ φ)(t) (9)

where λ(t) denotes the event intensity in the Hawkes process (Eq. (3)); s(t) denotes exogenous intensity function; φ(t) denotes the kernel function of the Hawkes process and is the convolution operator. For the remainder of the paper, when convenient, we suppress the functional dependence on parameters θ for intensity functions (λ(t), ξ(t)) and compensators (Λ(t), Ξ(t)).

Interval-censored Hawkes processes

We observe that the same objects fully deﬁne the MBPP intensity ξ(t) as well as the Hawkes intensity λ(t); that is, the exogenous intensity s(t) and the kernel function φ(t). We, therefore, say that Hawkes processes have a one-to-one correspondence to MBPP. In Fig. 2 we visually denote the relation between the two processes by plotting the event intensity in Hawkes and in MBPP. We sample 10,000 realizations of Hawkes, and we plot its mean event intensity and its 95% conﬁdence interval. We also plot the intensity of the mean behavior Poisson, which visually overlaps closely with the mean Hawkes intensity. MBPP and HIP. Visibly, MBPP shares the same equation with HIP (Rizoiu et al., 2017b) (see Eq. (6)). However, their usages are diﬀerent: HIP discretizes Eq. (6) and aims to ﬁt volumes of events, while Eq. (9) deﬁnes the intensity of a novel point process that, as far as we are aware, has not been studied in the literature. In the next section, we study an important quantity of MBPP its compensator and we show that it satisﬁes a nearly identical integral equation to the intensity equation.

3.2 The compensator of MBPP

We deﬁne Ξ(t) as the expectation of the Hawkes compensator (see Eq. (2)) over all possible realizations: Ξ(t) = EHt [Λ(t)]

We show that the Ξ(t) is the compensator of the MBPP using the Fubini theorem and passing the expectation inside the integral since λ(t) is bounded (see Klebaner (2012, Theorem 8.4)):

0 EHt [λ(z)] dz cf.(9) = Z t

0 ξ(z y)φ(y) dy dz

0 s(z)dz + Z t

y=0 ξ(z y)φ(y) dy dz

0 s(z)dz + Z t

z=y ξ(z y)φ(y) dz dy

0 s(z)dz + Z t

y ξ(z y) dz dy

0 s(z)dz + Z t

0 φ(y) Z t y

0 ξ(x) dx dy

(c) = S(t) + Z t

0 Ξ(t y)φ(y) dy

= S(t) + (Ξ φ)(t) (10)

where (a) is obtained by reversing the order of the integrals and computing the new integration boundaries; (b) is obtained by performing the change of variable x = z y; in (c) we apply Ξ(t y) .= R t y 0 ξ(x) dx; and S(t) .= R t 0 s(z) dz is the cumulative immigrant intensity up to time t.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

A cursory observation of Eq. (10) shows a distinct resemblance with Eq. (9): MBPP s compensator Ξ(t) follows a very similar self-consistent equation as its intensity ξ(t), with the compensator s value at time t being a function of its values at every previous time τ < t, decayed by the corresponding kernel function value. The similarity in the functional form of the two equations implies that similar methods can be applied to deduce their analytical solutions, detailed in the next section. Computational requirements. Both MBPP s compensator and Hawkes process intensity functions are quadratic to compute. The Hawkes process intensity function (unraveling the summation in Eq. (3)) is quadratic in the number of events in the history Ht. The MBPP s compensator function is quadratic in the number of interval-censored intervals (see Section 4). However, calculating MBPP s compensator function is more computationally eﬀective than computing the intensity function of a Hawkes process. Typically, the number of observational intervals is signiﬁcantly smaller than the number of events.

3.3 Solving the Mean Behavior Poisson equation

To ﬁt the MBPP (for both the event time and interval-censored setups), we need to have its intensity available in a tractable form. In this section, we propose two methods to solve Eq. (9), for arbitrary s(t) and φ(t) functions. This type of equation is known as a Volterra integral equation of the second kind (Arfken, 1985), which admits a solution only in particular cases. The ﬁrst method is based on the Laplace transform and has the downside of applying only to functions for which the transform and its inverse exist. The second method is a novel solution to the Volterra equation of the second kind, which employs notions of distribution theory ﬁrst to compute the system s impulse response and afterward the solution. Laplace transform. One natural approach to solving the MBPP equation is using the Laplace transform. Let L{ } and L 1{ } denote the Laplace and the inverse of the Laplace transform respectively.

Theorem 1 Given the functions s(t) and φ(t), then

ξ(t) = L 1 L {s(t)} (ω) 1 L {φ(t)} (ω)

is a closed-form solution for Eq. (9) if the Laplace and inverse Laplace transform exist.

The derivation of Eq. (11) is shown in Appendix A. This Laplace transform-based method has several drawbacks. First, it requires that the Laplace transform exists for both s(t) and φ(t). This requirement is not always fulﬁlled by the exogenous intensity functions s(t) that we construct in Deﬁnition 12. These involve Dirac delta functions, for which the Laplace transform does not exist in the classical sense. Note that, while there are extensions for the Laplace transform to generalized functions (we refer the curious reader to van Dijk (2013, Ch. 8)), their direct application to solving Eq. (9) is not obvious, and they are outside the scope of this work. Second, the method requires that the inverse Laplace function exists, which does not hold when φ involves a power-law or a Rayleigh time decay function. This is particularly limiting since the power-law kernel (φ(t) = κ(t + c) (1+θ)) has been shown to be the best

Interval-censored Hawkes processes

performing in applications involving social media data (Rizoiu et al., 2017b; Mishra et al.,

2016); the Rayleigh function (φ(t) = κ t

2θ2 ) is widely used in epidemics modeling as it allows for a period of increasing intensity corresponding to disease incubation (Unwin et al., 2021). Lastly, this method involves re-computing Eq. (11) every time that s(t) changes with all the diﬃculties stated above occurring at each recomputation rendering the method diﬃcult to apply when the s(t) changes often, or when it is not known before-hand. The impulse response solution. We propose a novel method to solve the MBPP equation in Eq. (9), which deals with the shortcomings of the Laplace transform-based method. We ﬁrst show that the Eq. (9) deﬁnes a linear, continuous-time, time-invariant system (Phillips et al., 2003), and we discuss a method to derive its impulse response solution (Theorem 2). Next, we show how to use the impulse response to easily compute the solution for any arbitrary s(t) (Corollary 3).

Theorem 2 Let H be a Hawkes process whose behavior is deﬁned by the kernel function φ(t) satisfying R 0 φ(t)dt < 1, and let M be the mean behavior Poisson process (MBPP) associated (Eq. (9)) with H, with the event intensity denoted by ξ(t), and deﬁned in Eq. (9). Then the intensity of M deﬁnes a causal, linear, continuous-time, time-invariant system of input s(t) and output ξ(t). The system is completely characterized by its impulse response deﬁned by

E(t) = δ(t) + h(t), where h(t) =

where n signiﬁes n times convolution.

Proof We structure the proof into two parts. In the ﬁrst part, we show that MBPP is an LTI system, and in the second part, we derive its impulse response. For the ﬁrst part of the proof, Rizoiu et al. (2017b, Corollary 2.2) have shown that Eq. (9) deﬁnes a causal, linear, continuous-time, time-invariant system of input s(t) and output ξ(t). For completeness reasons, we outline this proof here. It is easy to see that linearity holds by multiplying both sides of Eq. (9) by the same constant. The time-invariance property states that the response to a time-delayed input is identical and similarly time-delayed: if s(t) ξ(t) then s(t t0) ξ(t t0). Using Eq. (9) we compute the response for s(t t0) as follows:

ξ(t t0) = s(t t0) + Z t

0 ξ(τ)φ(t t0 τ) dτ

(a) = s(t ) + Z t +t0

0 ξ(τ)φ(t τ) dτ

(b) = s(t ) + Z t

0 ξ(τ)φ(t τ) dτ + Z t +t0

t ξ(τ)φ(t τ) dτ | {z } =0

(c) = s(t ) + Z t

0 ξ(τ)φ(t τ) dτ

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

where at (a) we make a change of variable t = t t0; at line (b) we write the integral into two parts, i.e., (0, t) and (t , t + t0); at line (c) we observe that φ(t) is a causal function, i.e., φ(t) = 0 for t < 0, or φ(t τ) = 0 for τ > t , and the integral term from t to t +t0 disappears. ξ(t) is also a causal function, which together with the linearity and timeinvariance properties renders the MBPP intensity system a causal, linear time-invariant (LTI) system. In the second part of this proof, we concentrate on the impulse response function corresponding to the MBPP system. The impulse response of a dynamic system is its output when presented with a unit impulse input signal. The impulse response completely characterizes the system, as the response to a linear combination of time-delayed impulses is a linear combination of time-delayed impulse responses. For continuous time dynamic systems, the impulse is denoted by the Dirac delta function δ(t). We denote as E(t) the impulse response of MBPP, with E(t) = δ(t) + h(t). From Eq. (9), it follows that

E(t) = δ(t) + Z t

0 E(τ)φ(t τ) dτ

= δ(t) + (E φ)(t)

(a) = δ(t) + ((δ + h) φ)(t)

= δ(t) + (δ φ)(t) + (h φ)(t)

(b) = δ(t) + φ(t) + (h φ)(t) | {z } h(t)

where (a) is obtained by substituting E(t) with δ(t) + h(t); (b) is obtained by noticing that the Dirac δ function is the neutral element for the convolution operation. Consequently and by dropping the time notation, we obtain

h = φ + h φ

= φ + (φ + h φ) φ

= φ + φ 2 + h φ 2

= φ + φ 2 + φ 3 + h φ 3

For the above equation to hold, the inﬁnite sum must converge. That is, we require that lim n φ n = 0. Observe that

0 φ n(t)dt (a) lim n

where for (a) we use Young s convolution inequality on φ(t) under the L1 norm R 0 ( )dt, and for (b) we use R 0 φ(t)dt < 1. Since lim n R 0 φ n(t)dt 0, we have lim n R 0 φ n(t)dt = 0.

Interval-censored Hawkes processes

By L1 convergence, we then have lim n φ n = 0.

Corollary 3 Let M be a mean behavior Poisson process associated with a Hawkes process of kernel φ(t), and let E(t) = δ(t) + h(t) be its impulse response function. Let s(t) be an arbitrary generalized function denoting an external stimuli. Then ξ(t) = (E s)(t) is a solution to the MBPP equation (Eq. (9)).

ξ = E s = (δ + h) s = δ s + h s

(a) = s + E φ s = s + (E s) φ

where (a) is obtained by replacing h = E φ, knowing that E = δ + h and that E is the impulse response, i.e. a solution to E = δ + E φ.

For example. Suppose s(t) = κθ + (u0 κθ)e θt (12)

and φ(t) is exponentially time-decaying. These particular forms for the s(t) and φ(t) were proposed by Dassios and Zhao (2013), for which they also ﬁnd a closed-form solution. We deﬁne the exponential kernel as

φ(t) = κθe θt Jt > 0K. (13)

Following Theorem 2 we ﬁrst compute the impulse response as E(t) = δ(t) + h(t), with h L1(R) given by h = P n=1 φ n. For t 0 and φ deﬁned in Eq. (13) we have

φ 2 = (φ φ)(t)

0 κθe θτκθe θ(t τ) dτ

= κ2θ2e θtt

and one can easily show that

φ n(t) = κnθne θt tn 1

(n 1)! (14)

By summing up, we compute h(t) as

n=1 φ n(t) = e θtκθ

= e θtκθekθt

= κθe(κ 1)θt, for t 0

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Since φ(t < 0) = 0, we have h(t < 0) = 0. It follows that the impulse response for MBPP with the exponential kernel deﬁned in Eq. (13) is

E(t) = δ(t) + κθe(κ 1)θt Jt > 0K (15)

From Corollary 3 it immediately follows that the solution for Eq. (9) with s(t) = κθ +(u0 κθ)e θt is

ξ(t) = (E s)(t) = s(t) + (h s)(t)

1 e (1 κ)θt + u0e (1 κ)θt (16)

The full details of the derivation (and derivations of other exogenous functions) are shown in Appendix B.1. Note that the inﬁnite convolutions sum has a closed-form solution only for particular kernel functions (such as in Eq. (14)). For all the other kernels, one can apply an approximation based on ﬁnite sums. For example, knowing that lim n φ n = 0 when R 0 φ(t)dt < 1

(see above), we can use the approximation h PT n=1 φ n, where T controls the approximation error and the computation speed. Note also that an analytical solution for the convolution h h does not exist for most kernels, requiring numerical solutions.

0 5 10 15 20 25 30

0 2 4 6 8 10

Rectangle Exogenous Function

Hawkes process mean behavior Poisson exogenous function

0 5 10 15 20 25 30

0 1 2 3 4 5

Dassios Exogenous Function

0 5 10 15 20 25 30

0 5 10 15 20 25 30 35

Sine Exogenous Function

Figure 3: The intensity of 10,000 simulated realizations of a Hawkes process with exogenous function as a rectangle response, Dassios response (Dassios and Zhao, 2013), and sine response. Here the red part is the exogenous function both in the Hawkes process and MBPP. The blue line is the MBPP intensity, the solid black line is the mean intensity of the corresponding Hawkes process, and the black dashed lines indicate the 95% conﬁdence interval.

4. Interval-censored point processes

In this section, we discuss ﬁtting MBPP in the interval-censored setting (detailed in Section 4.1), also known in some areas of the literature as panel count data (Sun and Zhao,

Interval-censored Hawkes processes

2013; Ding et al., 2018; Moreno et al., 2020). The Hawkes process generally cannot be ﬁt in the interval-censored setting as it does not have the independent increment property. Given the correspondence in parameters between MBPP and the Hawkes process, we can instead ﬁt an MBPP using the interval-censored log-likelihood (IC-LL) as the MBPP does have the independent increment property. Finally, we present a method for numerically calculating the MBPP compensator function, which is needed to calculate the IC-LL loss function.

4.1 Interval-censored Hawkes processes

Suppose that we have the set of observation times O = {oi | i = 0, . . . , m, oi T} which partitions into non-overlapping segments the time extent that the point process is observed t (0, T]. Without loss of generality, we assume that the time points in O are ordered by index, o0 = 0, and om = T. Further, for each segment deﬁned by the half-open interval of consecutive points (oi 1, oi] we observe the number of events which occurs in the interval, denoted by C(oi 1, oi]. We deﬁne this setting as the interval-censored setting. Fig. 4 illustrates the interval-censored setting for a simulated Hawkes process realization. Panel A shows the realization as event times; panel D shows the same Hawkes process realization as interval-censored. If we want to estimate the parameters of a Hawkes process in this setting, we cannot use the standard point process log-likelihood in Eq. (4) as it is deﬁned for event times. Instead, we aim to maximize the joint probability of the observed event volumes in each interval:

max θ P(N(oi 1, oi] = C(oi 1, oi], i {1, . . . , m}), (17)

where N(oi 1, oi] = N(oi) N(oi 1) is deﬁned as the number of events in a realization on the half-open interval and θ are the Hawkes process s parameters. However, there are several complications for computing Eq. (17) for Hawkes processes. First, unlike the homogeneous and homogeneous Poisson processes, the Hawkes process does not have the independent increments property. As a result, we cannot evaluate Eq. (17) as we did in Eq. (1), i.e., the product of disjoint Poisson distributions with the expected number of events set to the point process compensator. An alternative method for breaking down the Hawkes process is to decompose the process into the event count of each generation of oﬀspring, formally known as a Galton-Watson branching process (Hawkes and Oakes, 1974). This provides a decomposition of Borel distributions (Borel, 1942). Kong et al. (2020b) derive the closed-form solution for the case where the oﬀspring distribution is evaluated at t . However, there is no known analytical solution for computing the Borel distributions at ﬁnite times. A Monte-Carlo solution is available, but the computation cost does not scale well (O Brien et al., 2020). Further, there is no known analytic or numeric solution for computing the conditional factorization of the joint probability of all oﬀspring generation Borel distributions. Thus, there is no closed-form solution for a Hawkes process likelihood function in the interval-censored setting in the general case. Using the MBPP equivalence. Given the above diﬃculties, we propose in this section a solution based on the MBPP. In a nutshell, we replace the Hawkes process with an MBPP of equivalent parameters and ﬁt the MBPP parameters in the interval-censored setting. Finally, we use the obtained MBPP parameters to approximate the Hawkes parameters. Our approach has two sources of information loss (that we study on synthetic

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

3 3 2 4 F 3 2

Figure 4: The same point process realization is shown in six scenarios of data presentation from event times, to separable event times and interval censored data. Each row corresponds to a setting in Table 2. Red components correspond to exogenous events; and blue components correspond to endogenous events. Black color indicates that we cannot distinguish between exogenous and endogenous. oi denote the observation times based on which the data is interval-censored. The lollipops markers represent event times, while the shaded areas represent event counts within intervals.

data in Section 7), which prevent obtaining the exact original Hawke parameters. First, the information contained in the interval-censored version of the data when the observation periods are not inﬁnitesimal in size will necessarily have less information than the original event time data. Second, while having parameter equivalence, MBPP and Hawkes are diﬀerent models for example, MBPP is not a self-exciting model. As a result, the ﬁtted MBPP parameters can only be approximations of the true Hawkes parameters.

Interval-censored Hawkes processes

4.2 Negative Poisson log-likelihood: Interval-censored loss

Unlike the Hawkes process, the MBPP introduced in Section 3 is an non-homogeneous Poisson process. Thus the MBPP has the independent increment property, allowing Eq. (17) to be evaluated as the likelihood of disjoint Poisson distributions (Daley and Vere-Jones, 2003, Chapter 2.4). The corresponding negative log-likelihood function of the MBPP is dubbed as the interval-censored log-likelihood (IC-LL). The IC-LL was used with non-homogeneous Poisson processes by prior literature (Ding et al., 2018). In this work, we use IC-LL as a building block to estimate the parameters of Hawkes models (via MBPP) and to build new loss functions via the Bregman divergence interpretation (see Section 5.1). The Poisson distributions which compose the negative Poisson log-likelihood are deﬁned with respect to the MBPP compensator function M(t), as per Eq. (1). Similar to the counts N(t), deﬁne M(oi 1, oi] = M(oi) M(oi 1). Note that M(t) is stochastic, as it is the counting process of the MBPP. Using this joint distribution, we can calculate the negative log-likelihood for an MBPP by decomposing over each observation interval.

L (θ) = log P(M(oi 1, oi] = C(oi 1, oi], i = 1, . . . , m)

i=1 P(M(oi 1, oi] = C(oi 1, oi])

Ξ(oi 1, oi]C(oi 1,oi]

C(oi 1, oi]! exp( Ξ(oi 1, oi])

i=1 (C(oi 1, oi] log Ξ(oi 1, oi] log(C(oi 1, oi]!) Ξ(oi 1, oi])

i=1 Ξ(oi 1, oi]

i=1 C(oi 1, oi] log Ξ(oi 1, oi] +

i=1 log(C(oi 1, oi]!).

Notably, as we are aiming to optimise a parameter set θ of MBPP, the constant which only depends on the event counts Pm i=1 log(C(oi 1, oi]!) can be ignored. As such, we further rewrite the above loss function to the following equivalent loss function when minimised.

Proposition 4 The interval-censored log-likelihood (IC-LL) for an MBPP with corresponding compensator Ξ under interval-censored data is

LIC LL(θ) =

i=1 Ξ(oi 1, oi; θ]

i=1 C(oi 1, oi] log Ξ(oi 1, oi; θ]. (19)

Note that partial similar results exist in literature. For example, (Ding et al., 2018, Sec. 2.1) derive a similar log-likelihood of a multivariate non-homogeneous point process. In this work, we further link IC-LL with the Bregman divergence (see Section 5.2), and we show how other loss functions can be obtained by varying the generator function.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

4.3 Calculating the negative Poisson log-likelihood

The loss functions proposed above both require the computation of the compensator Ξ(oi 1, oi], as deﬁned in Eq. (10). From the deﬁnition of the compensator over an interval, we have

Ξ(oi 1, oi] .= Z oi

oi 1 ξ(s) ds

0 ξ(s) ds Z oi 1

= Ξ(oi) Ξ(oi 1).

We explore the two cases where we either have an analytical solution for the MBPP compensator or when we have to use numeric approximation to evaluate it. Analytical solution. As we have foreshadowed in Section 3.2, the compensator of MBPP follows the same functional form as the MBPP intensity function ξ. Thus, we can use the same analytical methods for solving the compensator evaluated at a point. Speciﬁcally, we can apply Corollary 3 to derive an analytical solution:

Corollary 5 (to Theorem 2) Let φ(t) be the kernel of a Hawkes process and E(t) = δ(t) + h(t) be its impulse response function, then for S(t) a closed-form solution for the MBPP compensator over a half-open interval (x, y] is

Ξ(x, y] = (E S)(y) (E S)(x). (21)

Proof Follows immediately from application of Eq. (20) and Corollary 3.

Numerical approximation. The cases in which the compensator Ξ(oi 1, oi] cannot be solved analytically are generally due to the lack of a closed-form solution for the h term in Theorem 2. While we can approximate h using a ﬁnite sum of numerical convolutions (see discussion in Section 3.3), here we consider a numeric lower bound approximation of the MBPP compensator see discussion at the end of this section for additional advantages. Speciﬁcally, we approximate the MBPP counting process with a set of approximation points {dj}D j=0, where d0 = 0, dj 1 < dj, and 0 < dj < T for j {1, . . . , D}. Consider the following lower-bound on the MBPP counting process using D approximation points:

M(t) M D(t) .=

j=1 M(dj 1, dj] Jdj < t d DK. (22)

Considering that the event count M(t) is a monotonically increasing function, the lower bound M D(t) is intuitively a piecewise constant function, which gets updated at each dj, and is constant in between approximation points. For equidistant approximation points

it can be shown that we can become arbitrarily accurate to M(t) as the number of approximation points increases.

Interval-censored Hawkes processes

Proposition 6 As D , then M D(t) M(t).

Using this lower bound counting process, we can approximate the compensator of MBPP, which follows a self-consistent equation (as per Eq. (10)). We consider the following formulation of the compensator which uses the equivalence between compensator and expected counts Ξ(t) = E[M(t)] for non-homogeneous Poisson processes:

Ξ(t) = S(t) + Z t

0 Ξ(t y)φ(y)dy

= S(t) + Z t

0 Ξ(y)φ(t y)dy

= S(t) + Z t

0 E[M(y)]φ(t y)dy.

We now deﬁne the following lower-bound compensator using the counting process approximation M D(t):

Ξ (t) .= S(t) + Z t

0 E[M D(y)]φ(t y)dy

= S(t) + Z t

j=1 M(dj 1, dj] Jdj < y d DK

j=1 E [M(dj 1, dj]] Z t

0 Jdj < y d DKφ(t y)dy

(a) = S(t) +

j=1 E [M(dj 1, dj]] Z min(t,d D)

dj φ(t y)dy,

where D(t) = max{i {1, . . . , D} : di < t}. Step (a) follows from simultaneously changing the integral upper-bound and the summation bound with respect to the Iverson brackets. An upper-bound approximation equivalent is given by

Ξ+(t) = S(t) +

j=1 E [M(dj 1, dj]] Z min(t,d D)

dj 1 φ(t y)dy. (25)

The negative Poisson log-likelihood loss function requires to compute the compensator over intervals Ξ(oi 1, oi], which we approximate using the compensator lower-bound Ξ D:

Ξ(oi 1, oi] Ξ= D(oi 1, oi] = Ξ D(oi) Ξ D(oi 1). (26)

By using Eq. (24), we can compute this approximation as follows,

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Proposition 7 The numeric approximation of the compensator over an interval Ξ(oi 1, oi] Ξ= D(oi 1, oi] can be computed as

Ξ= D(oi 1, oi] = S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]] Z oi dj

oi 1 dj φ(y)dy

j= D(oi 1)+1 E [M(dj 1, dj]] Z oi 1 dj

The proof of Proposition 7 can be found in Appendix A. Leveraging Proposition 7 has several advantages. First, it allows approximating the compensator s value for functional forms that do not allow for closed-form solutions of the inﬁnite convolution proposed in Section 3.3. Second, it allows leveraging observed history (the number of observed events in past time intervals) to compute future compensator values. We use the latter property in our real-world experiments in Section 8.

5. Bregman generalization of IC-LL

Previously in Section 4, we introduced the NPLL loss function for ﬁtting in interval-censored data settings. This section shows that the IC-LL loss function can be interpreted as a Bregman divergence a generalized notion of distances. We consider alternative loss functions that arise when changing the convex generator function of the Bregman divergence. This leads to a generalization and correction of the HIP loss function (Rizoiu et al., 2017b). In particular, we present speciﬁc conditions in which the generalization simpliﬁes to the HIP loss function.

5.1 Bregman divergences

We ﬁrst begin with a brief introduction of Bregman divergences. Bregman divergences are a widespread tool for designing and analyzing algorithms within machine learning. Bregman divergences have been used to construct generalizations of algorithms to use diﬀerent deﬁnitions of distances, i.e., clustering (Banerjee et al., 2005), boosting (Collins et al., 2002), and computational geometry (Nielsen et al., 2007). Similarly, we will use Bregman divergences to generalize IC-LL by interpreting the loss function as the generalized Kullback-Leibler divergence (KL-divergence) between compensator values and interval-censored event counts. A Bregman divergence is a generalized notion of distances between points. Suppose we are given a strictly convex generator ϕ : S R, where dom(ϕ) = S is a convex set S Rd and ϕ( ) is diﬀerentiable on the relative interior ri(S). Then the Bregman divergence Bϕ : S ri(S) [0, ) is deﬁned as (Banerjee et al., 2005):

Bϕ(x, y) .= ϕ(x) ϕ(y) (x y) ϕ(y). (28)

A wide selection of popular divergences can be obtained by picking the appropriate generator function. In particular, we consider two speciﬁc convex generators for Bregman divergences.

Interval-censored Hawkes processes

KL-divergence. The KL-divergence can be obtained by considering the generator function

i=1 xi log xi, (29)

yielding, for x, y R+,

KL(x, y) .= Bϕ(x, y) =

i=1 xi log xi

yi 1 (x y). (30)

Squared loss. The squared loss function is obtained if we consider a squared Euclidean norm generator function ˆϕ(x) .= x 2, (31)

yielding SSE(x, y) .= B ˆϕ(x, y) = x y 2. (32)

5.2 Connections to IC-LL

Given a single observation sequence, the IC-LL loss function is equivalent to the generalized KL-divergence with an additional constant factor. As a result, IC-LL is equivalent to the KL-divergence under minimization. First, let us deﬁne C .= (C(o0, o1], . . . , C(om 1, om]) and Ξ(θ) .= (Ξ(o0, o1], . . . , Ξ(om 1, om]) as an observed volume of interval-censored events and corresponding interval compensator values using parameters θ, respectively. Starting from a simple observation, we express the IC-LL using the KL-divergence.

Proposition 8 The IC-LL loss function is the given by

LIC LL(θ) = KL(C, Ξ(θ)) + Γ(C), (33)

where Γ(C) is a constant only dependent on the observed events. Furthermore,

arg min θ LIC LL(θ) = arg min θ KL(C, Ξ(θ)). (34)

The KL-divergence term in Eq. (33) can be rewritten as a Bregman divergence as follows:

Bϕ(C, Ξ(θ)) + Γ(C), (35)

where ϕ is given by Eq. (29). Thus the natural question arises of what happens when we switch the choice of Bregman divergence. We now consider switching the KL-divergence generator function to that of the squared loss function given by Eq. (31). This gives a sum of the squared error loss function:

LSSE = SSE(c, Ξ) + Γ(C)

i=1 (C(oi 1, oi] Ξ(oi 1, oi; θ])2 + Γ(C). (36)

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

This SSE loss function is actually a generalized version of the HIP loss function (Rizoiu et al., 2017b); diﬀering in only that the intensity function ξ is used instead of Ξ and the additional constant Γ(C). A notable connection is the bijection between (regular) Bregman divergences and (regular) exponential families (Banerjee et al., 2005, Theorem 6). In particular, the choice of Bregman divergence dictates what distribution the data is assumed to be sampled from for each observational interval. For example, by choosing the KL-divergence Eq. (33), each observational interval is assumed to be Poisson distributed, giving the IC-LL loss function. By considering SSE divergence Eq. (36), we are assuming that the event volume is Gaussian distributed; with constant standard deviation. One way of establishing a connection between the two loss functions is by considering the SSE loss function as an approximation of the KL-divergence, where each interval is Poisson distributed. Concretely, let us deﬁne a Poisson distribution with true rate Ξ(oi 1, oi] for an arbitrary event interval. Then suppose we have N many i.i.d. volume observations over this interval C1(oi 1, oi], . . . , CN(oi 1, oi]. It follows that given the empirical mean of these observations ˆC(oi 1, oi], by the central limit theorem, in the limit we have that

ˆC(oi 1, oi] N(Ξ(oi 1, oi], Ξ(oi 1, oi]/N)

as a Poisson distribution has variance equal to its rate. Thus, the averaging of event sequence provides an SSE loss function by the bijection between Bregman divergences and exponential families (Banerjee et al., 2005, Theorem 6). Consequently, we expect the KL loss and SSE loss functions to perform similarly when the number of samples is large. Notably, optimal compensator parameters for our Bregman loss functions are sensitive to the generator function chosen. This contrasts the unconstrained minimization of the expectation of Bregman divergence, which was shown to have a unique minimum invariant of the generator function (Banerjee et al., 2005, Proposition 1). Thus the functional form of the compensator (the chosen triggering kernel) can be considered an additional constraint to the Bregman divergence minimization problem.

5.3 HIP as an MBPP approximation

We note that the SSE loss function given by Eq. (36) is similar to the HIP loss function Rizoiu et al. (2017b). This section shows that the SSE loss function is a generalization of the HIP loss function, which is not theoretically justiﬁed in the general case. In particular, it incorrectly attempts to match the point process intensity with interval-censored volumes. However, by using the framework of Bregman divergence losses presented in Section 5.1, we prove that the HIP loss function is equivalent to an SSE loss function with three additional assumptions: (1) that observation intervals are unit sized; (2) the MBPP intensity function is constant over the interval; and (3) a discrete convolution approximation is used on the intensity function. One simple, practical method of using Hawkes processes with interval-censored data is to convert intervals into event times through uniformly sampling within each interval. The average intensity (over multiple samples of intervals) would amount to the MBPP intensity function (by Eq. (9)). In essence, this simpliﬁcation assumes that events are equally likely in the intervals; thus, the simpliﬁcation is equivalent to assuming that the underlying MBPP intensity function is constant over each interval.

Interval-censored Hawkes processes

In addition, by assuming that we have unit time observation intervals, we can show that the compensator function over that interval is equivalent to the intensity function of the MBPP process.

Lemma 9 Suppose that the MBPP intensity function is constant over unit interval (i 1, i], where i Z. Then Ξ(i 1, i] = ξ(i).

Proof As the MBPP intensity function is constant over the interval, let c the value it obtains. Thus,

Ξ(i 1, i] = Z i

i 1 ξ(s) ds = Z i

i 1 c ds = c = ξ(i).

Lemma 9 shows that under the speciﬁc conditions we have speciﬁed, the MBPP intensity function can be considered instead of the MBPP compensator which can be cumbersome to calculate/approximate for certain kernel functions chosen (as shown in Section 4.3). If, in addition, we use the SSE loss function, we recover the exact HIP loss function.

Theorem 10 Suppose that observation intervals are unit length with constant MBPP intensity. Then the SSE loss function in Eq. (36) with the MBPP intensity approximated with discrete convolution recovers the HIP loss function (Rizoiu et al., 2017b, Eq. (6)), i.e.,

i=1 (ξ(i) C(i 1, i])2. (37)

Proof Given the observation intervals are unit length and Lemma 9, the SSE loss function in Eq. (36) is given by

SSE(c, Ξ) =

i=1 (C(i 1, i] Ξ(i 1, i])2 =

i=1 (C(i 1, i] ξ(i))2,

which when minimised, is equivalent to Eq. (37).

With the HIP loss function recovered, to recover the HIP model we consider an MBPP with power-law kernel function. Additionally, we also approximate the intensity function by discrete convolution. This approximation step changes the integral in Eq. (9) to a summation

0 ξ(i)φ(t τ) dτ s[i] +

τ=1 ξ[i τ]φ[τ], (38)

where the square brackets indicate that the function is constant over the interval (i 1, i]. Leaving the choice of kernel aside, the approximation via discrete convolution provides an MBPP intensity function which adheres to the assumption in Theorem 10. Thus the entire HIP model can be realised using the loss function framework proposed in this section, generalising Eq. (7).

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Theorem 11 Suppose that we have unit length interval-censored event volumes. Suppose that we have an MBPP with power-law kernel and that the MBPP intensity is approximated with a discrete convolution, Eq. (38). Then by using the SSE loss function, Eq. (36), we have the HIP model from Rizoiu et al. (2017b).

Proof An MBPP with power-law kernel and discrete convolution approximation is equivalent to the MBPP introduced in (Rizoiu et al., 2017b, Eq. (4)). By deﬁnition of the discrete convolution in Eq. (38), the intensity function is constant over each interval (i 1, i]. Thus by Theorem 10, the SSE loss function (which uses the MBPP compensator) is equivalent to the HIP loss function (Rizoiu et al., 2017b, Eq. (6)).

Theorem 11 shows that our framework can generalise the method proposed by Rizoiu et al. (2017b). In particular, we can consider diﬀerent kernel functions or use the compensator approximation method in Section 4.3 when observation intervals are not uniformly unit length. Importantly, using this framework, we can provide a closed-form solution of HIP when the power-law kernel is replaced with the exponential kernel as the discrete convolution and uniform unit length observations are no longer required to approximate the MBPP compensator.

6. Processes with observed exogenous stimuli

In Sections 3 to 5, we used the underlying assumption that the exogenous events of the point process i.e., the events not generated through self-excitation are solely determined by a corresponding exogenous intensity function s(t). Furthermore, we did not diﬀerentiate between exogenous and endogenous events. However, in many real-world scenarios (including our real-world experiments in Section 8), the exogenous and endogenous are both separable and directly observed; while the underlying functional form of s(t) is unknown. We denote the setting in which the exogenous events can be diﬀerentiated as a separable scenario. Furthermore, in the separable setting, the exogenous and endogenous events can be observed at diﬀerent granularity levels (event time vs. interval-censored). For example, consider the problem of predicting ﬂu trends using Twitter data (Achrekar et al., 2011). The tweets are observed as event times, while ﬂu cases are recorded as aggregated volumes (for patient privacy reasons (Rizoiu et al., 2016)). Table 2 lists the six possible scenarios in which a point process can be observed. Fig. 4 exempliﬁes how the same realization is presented in each of these scenarios. Scenarios A and D are not separable, and events are observed as event times (ET) or interval-censored (IC), respectively (shown in black color). Scenarios B, C, E, and F are separable and deﬁne all the possible combinations of ET and IC for exogenous and endogenous events. For these separable scenarios, Fig. 4 exempliﬁes how the exogenous events (upper red events) and endogenous events (lower blue events) can be observed with either one or both types of events as interval-censored. This section introduces two exogenous functions to approximate s(t), which allow us to characterize the exogenous contribution of a Hawkes process, or MBPP, in the separable scenarios. We introduce the multi-impulse exogenous function and the latent homogeneous Poisson process exogenous function which can be used when the exogenous data is observed

Interval-censored Hawkes processes

Table 2: Tabulation of possible data scenarios. ET = event times and IC = intervalcensored . The last column speciﬁes if an endogenous loss function can be used.

Scenario Setting Point Process

Immigrant Oﬀspring Exog. Func. HP MBPP Loss Func. Endo.?

A - ET Any PP-LL B ET ET Multi-Impulse PP-LL C IC ET LHPP PP-LL D - IC Any IC-LL/SSE E ET IC Multi-Impulse IC-LL/SSE F IC IC LHPP IC-LL/SSE

as event times or interval-censored, respectively. We also build endogenous loss functions that account for only endogenous events, allowing us to ﬁt model parameters in all possible regimes of observed data. Table 2 summarises, for each scenario, the corresponding exogenous function, loss function, whether it is applicable for Hawkes, MBPP, or both, and whether an endogenous loss function should be used.

6.1 Exogenous Event Times Multi-Impulse

Consider a separable scenario in which we observe exogenous events as event times (scenarios B and E in Table 2). That is, we have a set of exogenous event times sz ΨT . We deﬁne the corresponding exogenous intensity as the sum of Dirac delta functions, one for each observed exogenous event, i.e., smδ(t) the multi-impulse function.

Deﬁnition 12 The multi-impulse exogenous function for a set of immigrant event times sz ΨT is deﬁned as smδ(t) = X

sz ΨT δ(t sz). (39)

Notably, the multi-impulse exogenous function is entirely deterministic for a set of immigrant event times and does not have any parameters to ﬁt. Furthermore, smδ(t) is not a proper function, as the Dirac delta function is a generalized function not directly computable, i.e., it typically is computed under an integral. Thus, loss functions which require the multi-impulse exogenous to be directly evaluated are incomputable. For these cases, we introduce in Section 6.3 the endogenous loss functions. We use the multi-impulse exogenous function deﬁned in Deﬁnition 12 in Section 7 for the scenarios where the immigrants are directly observed, and the intensity function generating them is unknown.

6.2 Exogenous Interval-Censored Latent Homogeneous Poisson Process

Here we consider the separable scenarios in which the immigrant events are observed as interval-censored data, with no assumption on the endogenous response of the process (scenarios C and F in Table 2). Let Q = {q0, . . . , qm} be a pre-speciﬁed set of observations times for exogenous events. For each interval (qi 1, qi] we only observe the total immigrant counts S(qi 1, qi], similarly to Section 4. Note that for scenario F (with interval-censored

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

endogenous response), the immigrant observation times qi Q can be distinct from the endogenous observation time oj O, although they are often identical in practice and in our experiments in Sections 7 and 8. The rest of this section builds on the more general case where we do not assume that the immigrant observation times Q are the same as the endogenous events O. To account for this scenario in which the immigrants are observed as interval-censored data, we assume that the volume of events for each observation interval (qi 1, qi] is determined by a Poisson process. In particular, we deﬁne a latent homogenous Poisson process (LHPP) with constant intensity s(t) = λi for each interval; which subsequently deﬁnes a piece-wise constant non-homogenous Poisson process as an exogenous function

i=1 λi Jqi 1 < t qi K. (40)

Notably, the rates λi are free parameters to be chosen. We consider the parameters of maximum likelihood.

Proposition 13 Given a latent homogeneous Poisson process deﬁned over observation intervals (qi 1, qi] and corresponding immigrant event counts S(qi 1, qi], the most likely intensity values are

λi = S(qi 1, qi]

qi qi 1 . (41)

Proof Suppose we have a homogeneous Poisson point process N(t) deﬁned over arbitrary half-open interval (x, y], x y with a realization of C many events occurring on that interval. The number of events over a given interval in a homogeneous Poisson process follows a Poisson distribution of parameter λ(y x). Thus, the most likely intensity value is given by

λ = argmax λ P(N(x, y] = C]; λ)

C! exp( λ(y x))

= argmax λ C log(λ(y x)) log(C!) λ(y x)

= argmax λ C log(λ(y x)) λ(y x).

The function we are maximizing is concave, thus by setting the derivative to zero we have,

λ = C y x. (42)

Thus for each of the intervals (qi 1, qi] and corresponding immigrant event counts S(qi 1, qi] we have Eq. (41).

Thus using the optimal parameters for the latent homogeneous Poisson process, we have the LHPP exogenous function.

Interval-censored Hawkes processes

Deﬁnition 14 Given the interval-censored exogenous volumes S(qi 1, qi] over observation times {q0, . . . , qm} Q, then the latent homogeneous Poisson process (LHPP) exogenous function is

S(qi 1, qi]

qi qi 1 Jqi 1 < t qi K. (43)

If more information is available about the process in which immigrant events are generated, an inductive bias can be added into the point process by changing the distribution of the exogenous function (for example, by considering a non-homogenous Poisson process in each interval). Furthermore, the LHPP exogenous function given in Eq. (43) can also be derived by ﬁrst considering the volume distribution of exogenous events for each interval (qj, qj+1] (see Appendix C). In particular, considering a uniform distribution in Appendix C results in the LHPP exogenous function.

6.3 Endogenous Loss Functions

When the exogenous and endogenous events are separable, both the intensity function and observed data can be separated in to endogenous and exogenous categories. As such, we propose the endogenous loss function, which is deﬁned as only the endogenous contribution of the point process to any loss function. The motivation to do so is two fold. Firstly conceptually, as the exogenous events are already observed we only need to account for the endogenous response of the point process. Secondly computationally, the endogenous loss function does not require us to compute the exogenous function when evaluating the loss function. As a side eﬀect, this allows us to circumvent the incomputability problems of the multi-impulse exogenous function, as per Section 6.1. Suppose that we have a valid point process with corresponding intensity function λ(t) .= λ(t; θ), where θ denotes the parameterisation of the intensity function. Then for a loss function L (θ; λ, Ψ Υ), the intensity function λ( ; θ) is compared and evaluated against both the immigrant events Ψ and the oﬀspring events Υ, whether event times or intervalcensored. Suppose that the data is separable. Then similar to the separability of the data, we can decompose the intensity function into a sum of exogenous and endogenous function components: λ(t; θ) = λexo(t; θexo) + λendo(t; θendo), (44)

where λexo(t) .= λexo(t; θexo) solely characterizes the immigrant events and λendo(t) .= λendo(t; θendo) solely characterizes the oﬀspring events. The exogenous parameters θexo

and endogenous parameters θendo corresponds to decomposition of the parameter θ as a result of Eq. (48).

Hawkes process. An example of a valid decomposition can be found in the deﬁnition of the standard Hawkes process intensity function, Eq. (3). Here, the Hawkes process intensity function is explicitly deﬁned as the addition of an exogenous function s(t) and endogenous intensity contribution P ti<t φ(t ti) determined by the past events and kernel function φ( ).

MBPP. For MBPP, we can deﬁne the exogenous-endogenous function decomposition as

ξ(t) = ξexo(t) + ξendo(t) (45)

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

ξexo(t) = s(t) (46)

ξendo(t) = Z t

0 ξ(y)φ(t y)dy. (47)

Additionally, if a closed-formed solution of the intensity function ξ(t) (as per Section 3.3) and exogenous function ξexo(t) exists, we are able to ﬁnd a closed-formed solution of the endogenous function ξendo(t) = ξ(t) ξexo(t). (48)

Notably, the endogenous function λendo(t) corresponds to the intensity function of a well deﬁned point process. As such, instead of evaluating a loss function on the entire data Ψ Υ and the intensity function λ(t), we can instead evaluate the endogenous intensity function λendo(t) and its corresponding events; which is exactly the endogenous events Υ we can identify from the separability of the data. Thus we deﬁne the endogenous loss function from the original loss function and point process intensity function:

L endo(θendo) .= L (θendo; λendo, Υ). (49)

In particular, for the rest of this paper, we will assume that θexo = , as per the exogenous function we have introduced in this section, i.e., the multi-impulse (Eq. (39)) and LHPP (Eq. (43)) exogenous functions. Thus, we have that the endogenous parameters account for all model parameters of the original point process, θendo = θ.

MBPP + PP-LL. For example, the endogenous loss function of MBPP using the PP-LL loss function, as per Eq. (4), is deﬁned as

L endo(θ) = X

fi ΥT log ξendo(fi) + Z T

0 ξendo(s) ds, (50)

where ΥT is the set of oﬀspring events. Notably, this endogenous loss function is appropriate when the oﬀspring data is in point format (i.e., scenarios B and C).

MBPP + IC-LL. In the scenarios when the oﬀspring data is in an interval-censored format (i.e., scenarios E and F), we can no longer use Eq. (50). Instead, we can consider the endogenous version of an IC-LL loss function introduced in Sections 4 and 5. Consider the Bregman divergence interpretation of a loss function, with convex generator ϕ(z). The endogenous version of the loss function is deﬁned as,

L endo(θ) =

i=1 Bϕ( ˆfi, Ξendo(oi 1, oi]), (51)

where Ξendo(t) = Z t

0 ξendo(s)ds

and ˆfi = F(oi 1, oi].

Here F(oi 1, oi] denotes oﬀspring volumes on the interval (oi 1, oi].

Interval-censored Hawkes processes

Table 2 outlines when using an endogenous loss function is suitable. In addition, experiments in Section 7 show that even across the diﬀerent settings where the endogenous loss function is not required, the performances to the non-endogenous standard loss functions are similar.

7. Synthetic Experiments

In this section, we empirically validate our proposals by reﬁtting a synthetically generated dataset with each of the six scenarios described in Table 2. In Section 7.1, we generate datasets of separable Hawkes realizations. We study in detail two parameter combinations one clearly subcritical (branching factor n 1) and another approaching criticality (n 1) and two exogenous functions. We also perform a broad parameter sweep and ﬁt over the parameter grid to analyze ﬁtting stability. Next, in Section 7.2, we selectively interval-censor types of events in the datasets to render them compatible with the diﬀerent scenarios in Table 2. Last, we present our results in Section 7.3.

7.1 Synthetic Datasets

We generate synthetic datasets usable in all scenarios (A-F in Table 2). We start from two exogenous functions and two sets of Hawkes parameters. We generate separable Hawkes realizations for each of the four combinations i.e., immigrants are distinguishable from the oﬀspring. This allows testing all deﬁned scenarios (for example, to ﬁt in scenario C we hide the exogenous generating function, and we interval-censor the immigrants, more details in Section 7.2). Exogenous functions. We select two functions for s(t) the exogenous events intensity. The ﬁrst function is a piece-wise constant function, inspired from the latent homogeneous Poisson process deﬁned in Section 6.2. We select this function to allow control over the number of exogenous events in each observation interval. We denote the synthetic datasets generated with this exogenous function as HP-pc, and the resulted sampling event intensity is deﬁned as:

S(qj 1, qj]

qj qj 1 Jqj 1 < t qj K + X

ti<t κθe θ(t ti), (52)

where qj QT = {0, 5, 10, 15}, n = 2; and S(q0, q1] = 7, S(q1, q2] = 6, and S(q2, q3] = 8. The second immigrant function is a sinusoidal function, which emulates the seasonality often observed in real-world processes. We denote this dataset as HP-sin, and realizations are sampled from the event intensity:

λ(t) = sin (t) + 2 + X

ti<t κθe θ(t ti) (53)

Hawkes kernel parametrization. For each of the above-deﬁned exogenous function, we consider a Hawkes process with exponential memory kernel φ(t) = κθe θt. We consider two parameter combinations. The ﬁrst combination deﬁnes a Hawkes process in a clearly subcritical regime (i.e. each event spawn signiﬁcantly less than one oﬀspring): κ = 0.6,

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

θ = 0.8, resulting in a branching factor n = 0.6. The second combination is in a nearlycritical regime, where each event spawns, on average, close to one oﬀspring. This results in realizations with more signiﬁcant numbers of events and pronounced clustering eﬀects. We use the parameter set: κ = 0.95, θ = 1.15, resulting in a branching factor n = 0.95. Simulation algorithm. To be usable with all scenarios in Table 2, we generate sequences in which the exogenous and endogenous event times are distinguishable. We use a simulation algorithm that leverages the cluster structure of Hawkes process (Hawkes and Oakes, 1974; Reinhart, 2018), by ﬁrst sampling the exogenous events and then all their direct and indirect oﬀspring. The steps for generating the synthetic event sequences are:

1. First, we sample immigrants from the ﬁxed exogenous function.

2. Second, we obtain oﬀspring by sampling a Hawkes process with no background intensity and with the immigrant as the ﬁrst event. We use rejection sampling (Dassios and Zhao, 2013). We denote the collection of an immigrant together with its oﬀspring as a cascade (Rizoiu et al., 2017b).

3. Finally, we combine all cascades to obtain a sequence, which is a realization of a Hawkes process with the benchmark parameters.

We combine the exogenous and endogenous time points for scenarios where we do not require the separation between immigrants and oﬀspring (i.e., A and D). For scenarios that require one or both dimensions interval-censored, we compute the corresponding volumes by counting the number of events over pre-speciﬁed observation intervals.

7.2 Experimental Setup

Pre-process synthetic realizations. In our experiments in Section 7.3, the aim is to recover the parameters κ and θ of the true model used to generate the synthetic datasets. We test over all scenarios in Table 2, and for each scenario, we use the exact same set of realizations constructed in Section 7.1. Prior to ﬁtting, for each scenario, we perform the following pre-processing on the realizations:

A. We remove the distinction between immigrants and oﬀspring, and we provide the immigrant intensity s(t);

B. We hide the immigrant intensity s(t), and we observe immigrants and oﬀspring separately;

C. We interval-censor the immigrants, and we observe oﬀspring as event times;

D. We remove the distinction between immigrants and oﬀspring, we interval-censor the resulting event stream, and we provide the immigrant intensity s(t);

E. We hide the immigrant intensity s(t), we interval-censor the oﬀsprings, and we observe immigrants (event times) and oﬀspring (interval-censored) separately;

F. We hide the immigrant intensity s(t) and we individually interval-censor the immigrant stream and the oﬀspring stream

Interval-censored Hawkes processes

Models and loss functions. For the scenarios where the oﬀspring events are observed as event times (scenarios A-C), we ﬁt both Hawkes and MBPP using the PP-LL loss (Eq. (4)). This allows estimating the performances of MBPP to estimate Hawkes parameters. For scenarios in which the oﬀspring are interval-censored (D-F), we only ﬁt MBPP using both IC-LL (Eq. (19)) and SSE (Eq. (36)) loss functions. This allows comparing the performances of the two loss functions. We prefer ﬁtting using the endogenous version of the loss function for the separable settings (B, C, E, F). However, we compare the standard point process and the endogenous version of the loss function for scenario C. Finally, for the non-separable scenarios (A and D), we use the non-endogenous loss functions as there is no distinction between immigrants and oﬀspring.

Closed-form vs. approximate loss functions. We test the ﬁtting performance gap between the closed-form and approximate versions of our models, using the approximation method speciﬁed in Section 4.3. Scenario E is used to test our numerically approximated models when a high number of discretization points are given, additionally presenting both endogenous and non-endogenous versions of the IC-LL and SSE loss function. Additionally, discretization tests which do not involve parameter estimation can be found in Appendix E.1 and further non-endogenous loss functions experiments can be found in Appendix E.2.

Additional setup. For Scenarios A-C, we consider 1, 000 event sequences, which are further split into 50 groups to estimate parameters. For Scenarios D-F, we consider 10, 000 sequences, also split into 50 groups to be jointly ﬁt. The additional volume of sequences is to account for the loss of information from interval censoring. For the scenarios where endogenous events are observed as event time (scenarios A-C), we consider both HP-pc and HP-sin datasets. For interval-censored endogenous events (scenarios D-F), only the more challenging HP-sin datasets are considered to limit computation time. In both cases, the sequences we generate are sampled until time-step T = 30. When we are testing intervalcensored settings, we use uniform observation intervals with 5, 10, 15, 30, 60, and 100 observation periods for example, with 10 observation intervals we calculate volumes over intervals (0, 3], (3, 6], . . . , (27, 30].

In each scenario and model tested we ﬁt 50 of that model corresponding to the 50 groups of synthetic sequences. We then calculate the average of the estimated parameters and compare the aggregated statistic against the true parameters.

7.3 Results

Our investigations in this section have ﬁve goals. First, we test the performance of MBPP for retrieving the Hawkes parameters used to generate the datasets for scenarios A-C (the Hawkes process cannot be used for scenarios D-F). Second, we seek to quantify the performance loss due to increasingly larger data granularity when moving from scenario A to F. Third, we investigate the eﬀects of using the endogenous loss introduced in Section 6.3 for the separable scenarios B, C, E, and F. Fourth, we study the eﬀects of using the lower bound approximation in Section 4.3 for scenarios D, E, F, and the impact of the number of discretization points. Fifth and last, we study the MBPP ﬁtting stability (the accuracy of retrieving generating parameters) over a wide parameter grid.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

7.3.1 Fitting for scenarios A to F

Scenario A. Table 3 shows the results of ﬁtting the Hawkes realizations with both the Hawkes process and MBPP for non-separable data and with knowledge of the parametrization of the exogenous function. Visibly, the parameters ﬁtted with Hawkes are closest to the true parameters, which is expected as the data is generated from a Hawkes model i.e., this is the upper bound of how well parameters can be retrieved on each dataset. Perhaps surprisingly, MBPP has very similar ﬁtting performances, and we only observe a degradation on the HP-sin dataset. This suggests that MBPP can be used as a drop-in alternative for ﬁtting the parameters of Hawkes processes.

Table 3: Synthetic experiments for Scenario A: non-separable data, event times.

True model True Params. Model Fit

κ = 0.6 θ = 0.8

HP κ 0.58 0.043 θ 0.80 0.087

MBPP κ 0.58 0.043 θ 0.78 0.145

κ = 0.95 θ = 1.15

HP κ 0.95 0.015 θ 1.17 0.073

MBPP κ 0.95 0.016 θ 1.21 0.246

κ = 0.6 θ = 0.8

HP κ 0.60 0.019 θ 0.80 0.098

MBPP κ 0.60 0.018 θ 0.75 0.192

κ = 0.95 θ = 1.15

HP κ 0.95 0.008 θ 1.14 0.062

MBPP κ 0.95 0.010 θ 1.22 0.227

Table 4: Synthetic experiments for Scenario B: separable data, exogenous and endogenous event times.

True model True Params. Model

κ = 0.6 θ = 0.8

HP-Endo κ 0.59 0.056 θ 0.81 0.135

MBPP-Endo κ 0.59 0.056 θ 0.8 0.093

κ = 0.95 θ = 1.15

HP-Endo κ 0.94 0.025 θ 1.17 0.067

MBPP-Endo κ 0.94 0.025 θ 1.37 0.228

κ = 0.6 θ = 0.8

HP-Endo κ 0.59 0.024 θ 0.74 0.11

MBPP-Endo κ 0.59 0.024 θ 0.75 0.098

κ = 0.95 θ = 1.15

HP-Endo κ 0.95 0.021 θ 1.07 0.28

MBPP-Endo κ 0.95 0.027 θ 1.1 0.124

Scenario B. We distinguish between immigrants and oﬀspring in this scenario, but we do not know the immigrant intensity. We ﬁt the two synthetic datasets using the endogenous versions of MBPP and Hawkes process with the multi-impulse exogenous function. Table 4 shows the obtained ﬁtted parameters indicating very similar performances for both the Hawkes and MBPP. The MBPP even shows a slight advantage for all combinations of parameter and data sets, except for HP-pc dataset and parameters κ = 0.95 and θ = 1.15. Furthermore, MBPP even shows a lower standard deviation indicating that it is more stable when predicting parameters. These results also show the eﬀectiveness of using the multiimpulse exogenous function in the presence of separable data, together with the endogenous loss function. Scenario C. Table 5 shows the results of ﬁtting in Scenario C, in which immigrants and oﬀspring are separable, immigrants are interval-censored, and oﬀspring are event times. Again, we can conclude that the MBPP and the Hawkes process have very similar performances Similar to Scenario B, the only degradation in approximation quality is for HP-pc with the parameters κ = 0.95 and θ = 1.15. Note that Table 5 also presents results for the non-endogenous version of the loss function, which we discuss in Section 7.3.2.

Interval-censored Hawkes processes

Table 5: Synthetic experiments for Scenario C: separable data, interval-censored exogenous events and endogenous event times.

True model True Params. Model

κ = 0.6 θ = 0.8

HP κ 0.57 0.064 θ 0.74 0.097

HP-Endo κ 0.59 0.056 θ 0.8 0.093

MBPP κ 0.59 0.059 θ 0.83 0.138

MBPP-Endo κ 0.59 0.056 θ 0.81 0.139

κ = 0.95 θ = 1.15

HP κ 0.94 0.025 θ 1.12 0.065

HP-Endo κ 0.94 0.025 θ 1.17 0.067

MBPP κ 0.94 0.025 θ 1.37 0.237

MBPP-Endo κ 0.94 0.025 θ 1.37 0.229

True model True Params. Model

κ = 0.6 θ = 0.8

HP κ 0.59 0.025 θ 0.52 0.08

HP-Endo κ 0.59 0.024 θ 0.75 0.098

MBPP κ 0.59 0.024 θ 0.72 0.138

MBPP-Endo κ 0.59 0.023 θ 0.73 0.107

κ = 0.95 θ = 1.15

HP κ 0.95 0.021 θ 0.99 0.117

HP-Endo κ 0.95 0.021 θ 1.1 0.124

MBPP κ 0.95 0.03 θ 1.07 0.296

MBPP-Endo κ 0.95 0.028 θ 1.07 0.282

Scenario D. In this scenario, we do not distinguish between immigrants and oﬀspring, and we observe the process as interval-censored. We use for the ﬁrst time the novel loss functions introduced in Section 4. I.e., we ﬁt the synthetic data using the IC-LL and SSE loss functions together with the MBPP. Note that we cannot use the standard Hawkes process when the events are observed as interval-censored (see Section 4). Table 6 shows the obtained ﬁtting results for an increasing number of observation intervals. We ﬁt using both the closed-form solution and the approximation introduced in Section 4.3. For the approximation, we use as many approximation points as the number of intervals.

We observe that the closed-form MBPP performs exceptionally well, with no noticeable diﬀerences between IC-LL and SSE loss. In particular, we obtain highly accurate parameter ﬁttings for closed-form models even for small numbers of observations. However, when we use the numeric approximation of compensators, we observe a decrease in the accuracy of the recovered parameters. The degradation is prevalent only for θ, which is consistently overestimated. Interestingly, an increase in observation intervals does not yield sizable improvements for the approximate models. Thus, for this scenario, knowing the functional form of the underlying interval-censored Hawkes process (the exogenous function and kernel function) appears more important than having ﬁner-grained observation periods.

Scenario E. Here, we separately observe immigrants (as event times) and oﬀspring (intervalcensored). We ﬁt using the multi-impulse exogenous function and the SSE and IC-LL loss functions, with both the closed-form solution and the numerical approximation. Table 7 shows that both loss functions achieve highly accurate parameter ﬁttings for the closed-form models, even for low numbers of observation intervals. However, the approximate models require higher numbers of observation intervals to recover the parameters. For κ = 0.6 and θ = 0.8, we require at least 30 observation intervals to converge close to the correct θ value. We need at least 60 intervals for the other parameter setting. Notably, the κ parameter is easier to ﬁt and can be approximated using fewer intervals. We further explore the dependence of parameter estimation for the numerically approximated models in Section 7.3.2.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Table 6: Synthetic experiments for Scenario D: non-separable data, interval-censored events. Note that only MBPP can ﬁt this kind of data.

True Params. Loss Func. Model Number of Observation Intervals

5 10 15 30 60 100

κ = 0.6 θ = 0.8

SSE Closed κ 0.6 0.007 0.6 0.007 0.6 0.007 0.6 0.007 0.6 0.006 0.6 0.006 θ 0.81 0.092 0.81 0.087 0.8 0.07 0.8 0.076 0.8 0.072 0.8 0.075

Approx κ 0.6 0.007 0.6 0.007 0.6 0.007 0.6 0.007 0.6 0.006 0.6 0.006 θ 0.85 0.099 0.85 0.096 0.84 0.077 0.84 0.082 0.84 0.079 0.84 0.081

IC-LL Closed κ 0.6 0.007 0.6 0.007 0.6 0.007 0.6 0.007 0.6 0.006 0.6 0.006 θ 0.82 0.091 0.81 0.086 0.81 0.072 0.8 0.076 0.8 0.073 0.8 0.076

Approx κ 0.6 0.007 0.6 0.007 0.6 0.007 0.6 0.007 0.6 0.006 0.6 0.006 θ 0.85 0.098 0.85 0.095 0.84 0.08 0.84 0.083 0.84 0.08 0.84 0.083

κ = 0.95 θ = 1.15

SSE Closed κ 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.005 θ 1.16 0.082 1.16 0.091 1.16 0.065 1.16 0.1 1.16 0.079 1.16 0.088

Approx κ 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.005 θ 1.24 0.093 1.24 0.103 1.24 0.074 1.24 0.114 1.24 0.09 1.24 0.1

IC-LL Closed κ 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.005 θ 1.16 0.074 1.16 0.078 1.16 0.06 1.16 0.09 1.16 0.07 1.16 0.082

Approx κ 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.005 θ 1.24 0.083 1.24 0.089 1.24 0.068 1.24 0.103 1.24 0.079 1.24 0.093

Table 7: Synthetic experiments for Scenario E: separable data, event time exogenous events and interval-censored endogenous events.

True Params. Loss Func. Model Number of Observation Intervals

5 10 15 30 60 100

κ = 0.6 θ = 0.8

SSE Closed κ 0.60 0.006 0.60 0.006 0.60 0.005 0.60 0.006 0.60 0.005 0.60 0.005 θ 0.79 0.055 0.79 0.047 0.79 0.042 0.79 0.039 0.79 0.036 0.79 0.036

Approx κ 10 0 0.62 0.006 0.61 0.006 0.60 0.006 0.60 0.006 0.60 0.006 θ 0.01 0 8.82 0.866 9.55 1.589 1.09 0.091 0.91 0.052 0.86 0.043

IC-LL Closed κ 0.60 0.006 0.60 0.005 0.60 0.005 0.60 0.005 0.60 0.005 0.60 0.005 θ 0.80 0.053 0.80 0.042 0.79 0.037 0.79 0.035 0.79 0.032 0.79 0.032

Approx κ 10 0 0.63 0.006 0.61 0.006 0.60 0.005 0.60 0.005 0.60 0.005 θ 0.01 0 8.22 1.192 9.98 0.103 1.17 0.089 0.94 0.045 0.87 0.038

κ = 0.95 θ = 1.15

SSE Closed κ 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 θ 1.15 0.076 1.15 0.075 1.15 0.073 1.15 0.073 1.15 0.072 1.15 0.072

Approx κ 10 0 1.14 0.007 1.03 0.005 0.96 0.003 0.95 0.004 0.95 0.004 θ 0.01 0 9.41 0.764 10 0 10 0 1.77 0.177 1.43 0.111

IC-LL Closed κ 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 0.95 0.004 θ 1.16 0.068 1.16 0.067 1.16 0.065 1.16 0.065 1.16 0.065 1.16 0.065

Approx κ 10 0 1.17 0.007 1.04 0.005 0.96 0.003 0.95 0.004 0.95 0.004 θ 0.02 0 9.95 0.032 10 0 10 0 1.82 0.164 1.46 0.101

Interval-censored Hawkes processes

Table 8: Synthetic experiments for Scenario F: separable data, interval-censored exogenous and endogenous events.

True Params. Loss Func. Model Number of Observation Intervals

5 10 15 30 60 100

κ = 0.6 θ = 0.8

SSE Closed κ 0.726 0.004 0.682 0.005 0.668 0.005 0.642 0.005 0.62 0.005 0.612 0.005 θ 0.647 0.021 0.641 0.024 0.801 0.031 0.995 0.052 0.915 0.052 0.865 0.045

Approx κ 10 0 0.621 0.006 0.605 0.006 0.601 0.006 0.601 0.006 0.601 0.006 θ 0.005 0 8.823 0.866 9.546 1.589 1.09 0.091 0.909 0.052 0.856 0.043

IC-LL Closed κ 0.729 0.004 0.69 0.005 0.672 0.005 0.642 0.005 0.621 0.005 0.613 0.005 θ 0.642 0.019 0.675 0.023 0.828 0.029 0.997 0.045 0.929 0.044 0.878 0.039

Approx κ 10 0 0.628 0.006 0.611 0.006 0.604 0.005 0.603 0.005 0.602 0.005 θ 0.006 0 8.223 1.192 9.982 0.103 1.174 0.089 0.938 0.045 0.872 0.038

κ = 0.95 θ = 1.15

SSE Closed κ 0.958 0.002 1.496 2.17 1.494 2.171 0.95 0.004 0.95 0.004 0.95 0.004 θ 1.765 0.098 1.362 0.358 1.301 0.344 1.283 0.089 1.216 0.08 1.19 0.076

Approx κ 10 0 1.139 0.007 1.033 0.005 0.958 0.003 0.95 0.004 0.95 0.004 θ 0.014 0 9.409 0.764 10 0 10 0 1.765 0.177 1.433 0.111

IC-LL Closed κ 0.959 0.002 0.953 0.003 0.95 0.003 0.949 0.003 0.95 0.004 0.95 0.004 θ 1.687 0.085 1.501 0.087 1.448 0.092 1.323 0.081 1.239 0.073 1.205 0.069

Approx κ 10 0 1.165 0.007 1.043 0.005 0.959 0.003 0.949 0.004 0.95 0.004 θ 0.019 0 9.953 0.032 10 0 10 0 1.823 0.164 1.456 0.101

Scenario F. This is a separable scenario in which both immigrants and oﬀspring are interval-censored. Table 8 shows the ﬁtting results, and we observe that the quality of parameter estimation improves as we increase the number of observation intervals. Interestingly, the performance improvement is more constant for the IC-LL than SSE, as the number of intervals grows. When using the SSE, the parameters are poorly estimated for 10 and 15 intervals. We also note that the numerical approximation requires between 60 and 100 intervals to approach the true parameters for this scenario. Knowing that these results are directly comparable with those for Scenario E (Table 7), we conclude that ﬁtting in Scenario F is the most challenging as most information gets lost when interval-censoring.

7.3.2 Endogenous loss and number of approximation points

Endogenous vs. non-endogenous loss. Table 5 includes results for both the endogenous and non-endogenous loss function for Scenario C. We see that the endogenous versions consistently outperform the non-endogenous version. This is most prominent in the HP-sin dataset with parameters κ = 0.6 and θ = 0.8, where the two Hawkes process models have drastic diﬀerences in the ﬁtted θ parameter. Table 9 further shows comparisons between endogenous and non-endogenous numeric approximation models for Scenario E, with many approximation points. Here we see little or no improvement between endogenous and nonendogenous when using the SSE loss function. For the IC-LL loss function, we see that the endogenous version is more stable. Together with Scenario F s conclusions (that IC-LL is more stable than SSE), we conjecture that using the endogenous IC-LL is preferable. Number of approximation points. All results previously reported in this section use the same number of approximation points for the numerical approximation as the number of observation intervals. In Table 9, we report on the numeric approximation for Scenario E when 300 approximation points are used irrespective of the number of observation intervals. Intuitively, a larger number of approximation points should yield better estimates of the compensator and better parameter ﬁts, with the cost of higher computation time. We see only a slight improvement for the normal IC-LL loss function as the number of

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Table 9: Large number of discretization points experiment. Scenario E: separable data, interval-censored exogenous and endogenous events.

Loss Func. Model Number of Observation Intervals

5 10 15 30 60 100

SSE Normal κ 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 θ 1.23 0.072 1.23 0.071 1.23 0.071 1.23 0.07 1.23 0.069 1.23 0.069

Endogenous κ 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 θ 1.23 0.072 1.23 0.071 1.23 0.071 1.23 0.07 1.23 0.069 1.23 0.069

IC-LL Normal κ 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 θ 0.95 0.054 0.96 0.071 0.97 0.071 0.98 0.06 1 0.055 1.03 0.055

Endogenous κ 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 0.95 0.003 θ 1.24 0.065 1.24 0.064 1.24 0.064 1.24 0.063 1.24 0.063 1.24 0.063

observation intervals increases. For all other settings, endogenous or not, the estimated parameters are surprisingly stable. This suggests that the performances of the approximation method are dependent on the number of approximation points rather than intervals as the number of approximation points increases, the error quickly decreases and converges (see Appendix E.1).

= Fitted ~ True # Obs. = 10 θ θ

# Obs. = 30

# Obs. = 100

1 0.6 0.2 0.2 0.61

Figure 5: Mean parameter θ ﬁtting error (i.e., ﬁtted θ true θ) over a parameter grid κ [0.1, 0.95] and θ [0.55, 1.5]. Realizations are sampled using the Hawkes process with the sinusoidal exogenous function and ﬁtted using MBPP in Scenario F with 10 (a), 30 (b) and 100 (c) observational intervals. The green shapes highlight the speciﬁc parameters described in Section 7.1: {κ = 0.6, θ = 0.8} (green triangle) and {κ = 0.95, θ = 1.15} (green diamond). For graphics for parameter κ and more observational invervals, see Appendix E.3.

7.3.3 Fitting MBPP with a parameter grid

The synthetic experiments here above only use two parameter combinations deﬁned in Section 7.1. Here, we verify the stability of the MBPP ﬁtting over a wide parameter grid κ [0.1, 0.95] (step 0.05) and θ [0.5, 1.55] (step 0.05). We sample realizations from a Hawkes process with a sinusoidal exogenous function (see Eq. (53)), and we ﬁt MBPP in Scenario F (fully interval-censored).

Interval-censored Hawkes processes

Fig. 5 shows the color map of the ﬁtting error of pameter θ over the grid. In every panel, we highlight the two speciﬁc parameter combinations used in the synthetic experiments sections above: {κ = 0.6, θ = 0.8} (green triangle) and {κ = 0.95, θ = 1.15} (green diamond). Parameter κ is consistently retrieved with only minor errors (see Appendix E.3). We notice that, for high values of θ the ﬁtting under-estimates the true θ for low values of κ (bottom-right corner) and over-estimates it for high values of κ (top-right corner). Note that high θ corresponds to a faster decay rate for the Hawkes kernel φ(t), which typically translates to oﬀspring events being tightly clustered close to their parent event. As θ decreases (slower kernel decay), the parameter is increasingly well recovered. As expected, the ﬁtting error decreases as the data granularity (i.e., the number of observation intervals) is increased from 10 (Fig. 5a), to 30 (Fig. 5b) and 100 (Fig. 5c). For more data granularities, see Appendix E.3. Finally, for high data granularity (100 intervals in Fig. 5c), we observe a slight over-estimation for θ due to the model mismatch between the generating model (Hawkes process) and the ﬁtting model (MBPP).

8. Real World Experiments

We evaluate our proposed MBPP and its variants on ACTIVE (Rizoiu et al., 2017b), a realworld social media dataset containing views to You Tube videos and the tweets that mention them. The task is to predict the future popularity of You Tube videos (their future views) by accounting for the exogenous inﬂuence of tweets and the endogenous ampliﬁcation within You Tube (for example, due to word-of-mouth processes). Both the exogenous (immigrants) and the endogenous (oﬀspring) are in interval-censored format counts per day. Therefore, we ﬁt using Scenario F (also denoted here below as IC-IC) on a subset of days, and we forecast the future. We compare our MBPP to the HIP process presented in Rizoiu et al. (2017b), the current state of the art in predicting popularity under exogenous inﬂuence. The predicted number of views produced by each model is evaluated using average percentile error (APE) and symmetric mean absolute percentage error (s MAPE).

8.1 Dataset and prediction setting

Dataset. We use the ACTIVE dataset introduced by Rizoiu et al. (2017b) and further studied in (Wu et al., 2019, 2020). The dataset contains information about more than 14,000 You Tube videos published between 2014-05-29 and 2014-12-26, and the corresponding tweets that share the video in the same time range. For each video, we consider the ﬁrst 120 days of views and tweets. We ﬁt model parameters on the ﬁrst 90 days and we use the later 30 days for evaluation. We denote the number of tweets at day i for a video as [#Tweets on day i] and the number of You Tube views for a video at day i as [#Views on day i]. Prediction setting. We replicate the evaluation setup introduced in (Rizoiu et al., 2017b). To evaluate our models, we predict the number of times a You Tube video is viewed from day 91 to 120. We consider that the number of tweets (the exogenous inﬂuence) is known in the prediction period. Compared approaches. We evaluate the approximate compensator MBPP, closed-form MBPP, and HIP process, which can be considered an approximation of the MBPP, as per Section 5.3. We also ﬁt and evaluate these models for both the IC-LL and SSE loss func-

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

tions; and the exponential and power-law trigger kernel when applicable (the closed-form MBPP only exists for the exponential kernel). Furthermore, we use the LHPP exogenous function as the ACTIVE dataset is in the IC-IC setting. It follows that the LHPP exogenous function simpliﬁes to the number of tweets per day as the ACTIVE dataset presents unit-sized observation intervals (in days). Notably, in (Rizoiu et al., 2017b), the HIP process uses only the power-law triggering kernel with the SSE loss function. Thus, the HIP process with exponential kernel and IC-LL loss function combinations are new to our MBPP framework. Three ﬁtting procedure augmentations. Following Rizoiu et al. (2017b), we introduce three augmentations to improve ﬁtting and forecasting performances. The ﬁrst two augmentations enhance the exogenous intensity function, while the third is a workaround for the local optima in the loss function. The ﬁrst two augmentations are required because the models we discuss in this work make the implicit assumption that the exogenous intensity s(t) completely describes all the immigrants that enter the process. For scenarios B, C, E, F, this implies that the observed exogenous events are the only possible parents for the oﬀspring. However, this assumption is unrealistic for real data. For example, the views of You Tube videos can also be driven by other sources of inﬂuences outside tweets, such as Facebook or the radio. We account for unobserved exogenous inﬂuence by augmenting the exogenous intensity s(t). First, we scale s(t) with a ﬁtted constant µ. Second, we add two learnable parameters (γ and ν) to account for unobserved external intensity. The augmented exogenous intensity for each day i is: ˆs[i] = γ Jt = 0K + ν Jt > 0K + µ [#Tweets on day i], (54)

where µ corresponds to the scaling constant of the observed exogenous inﬂuence; and γ and ν correspond to parameters accounting for unobserved external inﬂuence. The third augmentation deals with the non-convexity of the loss surface (see Appendix D for more details), which can cause the gradient descent minimizer to get stuck into local optima. We address this by reﬁtting each video 10 times, starting with randomly selected initial parameters. We chose as the ﬁnal parameters the combination with the lowest loss. Forecasting future views using MBPP. We forecast the expected views in the days 91 to 120 using the numeric approximation scheme for the compensators, found in Proposition 7, and leveraging the observed views in days 1 to 90. Speciﬁcally, the forecasted number of views on day i is:

Predicted[i] = ˆs[90 + i] +

j=1 [#Views on day j] Z i j

i j 1 φ(y + 90)dy

k=91 Ξ[k 1, k] Z i k

i k 1 φ(y + 90)dy, for i 91 (55)

where ˆs[j] is deﬁned as per Eq. (54). We obtain this prediction scheme by replacing the compensator values evaluated between days 1 and 90 with the observed views as the expectation of the compensator is equivalent to the expected number of events for a non-homogeneous point process. Thus, the ﬁtted compensator only needs to be used when extrapolating past days 90.

Interval-censored Hawkes processes

A-MBPP IC-LL

Model Setting

C-MBPP IC-LL

A-MBPP IC-LL

Model Setting

C-MBPP IC-LL

s MAPE Score

Figure 6: Evaluation of future popularity prediction using APE (a) and s MAPE (b). For each model, the boxplots aggregate the APE for all videos in the dataset. A-MBPP denotes the approximate compensator MBPP variant, and C-MBPP denotes the closed-form compensator MBPP variant. For each boxplot, the solid line denotes the median, and the dashed line denotes the mean. The combination HIP SSE PL is the baseline reported in (Rizoiu et al., 2017b)

Evaluation measures. We evaluate the prediction for days 91 to 120 using two evaluation metrics: (1) average percentile error (APE) and (2) symmetric mean absolute percentage error (s MAPE). APE was introduced in Rizoiu et al. (2017b) and is deﬁned as the absolute diﬀerence between the recorded You Tube views percentiles and the predicted You Tube views percentile. The measure accounts for the long-tail distribution of online popularity. Intuitively, the impact of errors of the same size should be more signiﬁcant for unpopular videos than for popular videos. s MAPE is deﬁned as a normalized error of the true You Tube views and the predicted You Tube views over the entire sequence. Noteworthy, s MAPE is a local error measure (only depends on one video). In contrast, APE is a global measure that depends on both the error on the current video and the errors made on all other videos.

8.2 Results

We measure the APE (Fig. 6a) and s MAPE (Fig. 6b) for all HIP and MBPP conﬁgurations, summarized as boxplots. We measure the impact of the choices of three components: the kernel (power-law vs. exponential), the loss functions (IC-LL vs. SSE), and the model (closed-form MBPP, approximate MBPP, and HIP). Note that the combination HIP, SSE, and PL kernel is the baseline reported in (Rizoiu et al., 2017b), for which we match the performances claimed by its authors.

We can see that the choice of the kernel does not change the APE value by much, with power-law slightly over-performing exponential (as previously reported in literature (Mishra et al., 2016)). Visibly, the choice of loss function has a more signiﬁcant impact on performances, with IC-LL constantly outperforming SSE. This is particularly visible for the approximate MBPP (A-MBPP) and also present for both HIP and the closed-form MBPP to a lesser extent. The result is not surprising as the IC-LL loss function comes from the minimization of the corresponding Poisson distribution deﬁned by the MBPP compensator,

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

whereas the SSE loss function comes from changing the convex generator of a Bregman divergence to obtain an alternative loss function. The choice of model is also signiﬁcant, with the closed-form MBPP performing best for each of the loss function choices. This was expected, as HIP and the approximate MBPP are both approximations to the closed-form MBPP (for power-law, MBPP not having a closed-form solution). Perhaps surprisingly, the HIP has lower APE values than approximate MBPP with the same loss function and triggering kernel. This is because the HIP approximation is designed explicitly for unit time intervals as presented in Section 5.3 and would perform better than the more general approximate MBPP. We draw very similar conclusions from studying s MAPE performance measure (Fig. 6b). The only main diﬀerence is that the choice of kernel engenders a larger performance difference; for example, the 75% quantile for HIP SSE is much lower for power-law than exponential; similarly, the approximate MBPP (A-MBPP) SSE with power-law has a lower 25% quantile than the exponential kernel. A-MBPP IC-LL performs slightly worse with a power-law triggering kernel than the exponential triggering kernel, similar to the APE boxplot. Summary. Overall, we summarize the three main ﬁndings shown by both the local and global error metrics. First, the IC-LL loss function outperforms SSE. Second, the closed-form MBPP outperforms numerical approximations (A-MBPP and HIP). This follows naturally, given both of those processes are approximations of the closed-form MBPP. Third, the HIP approximation (Rizoiu et al., 2017b) which is specialized for unit time intervals outperforms the general approximation presented in the compensator approximation MBPP.

9. Related Work

We structure the related work into three main themes. First, we focus on prior work dedicated to estimating the parameters of Hawkes processes in both event-time and intervalcensored settings. Second, we consider alternative theoretic modeling paradigms and augmented Hawkes processes, and we investigate their link to the Hawkes processes. Finally, we consider the applications of Hawkes processes and other models in various domains.

9.1 Hawkes model parameter estimation

Several techniques have been proposed for estimating the parameters of a Hawkes model from data. Here, we concentrate on techniques using event-time data (other than the typical maximum likelihood estimates detailed in Section 2.3) and interval-censored data, both parametric and non-parametric. Event-time models. We start with works that estimate Hawkes model parameters starting from event-time data in several settings. For examples, Piggott and Solo (2018) propose an online estimator for the parameters of a Hawkes process by using a Laguerre basis expansion; their work targets an online learning setting. The Red Queen algorithm (Zarezade et al., 2017) provides a framework for expressing a Hawkes process s counting process and intensity function as a jump stochastic diﬀerential equation, providing an online algorithm with provable guarantees (Zarezade et al., 2018). Several studies have linked the Hawkes process to the inﬁnite-server queues when the Hawkes process characterizes the arrivals into

Interval-censored Hawkes processes

the queue. Koops et al. (2018) obtain a system of diﬀerential equations that characterize the joint distribution of arrival intensity and the number of customers. Similarly, Daw and Pender (2018b) ﬁnd diﬀerential equations to characterize the moments in several diﬀerent settings. Finally, Gao and Zhu (2018) show that the queue length process is limited to a non-Markovian Gaussian process in the asymptotic regime of a stationary Hawkes process. More related to the current work, Xu et al. (2017) address the problem of ﬁtting a Hawkes process observed during a ﬁnite interval of time denoted as a double-censored event sequence. Note that their deﬁnition of censored is diﬀerent than our interval-censoring in Section 4.1, i.e., they observe all event times during the given interval and nothing before and after. To avoid edge eﬀects when the history is missing, Xu et al. (2017) simulate many possible preﬁxes for the given observed window. Most of the approaches mentioned above deﬁne the Hawkes process using a parametric family of kernels and estimate their parameters. In contrast, other works estimate the kernel in a non-parametric way. For example, Dion and Lemler (2020) use trigonometric basis vectors to provide a non-parametric estimator of one-dimensional diﬀusion processes characterized by Hawkes processes. Similarly, Li et al. (2017) use a Fourier transform to provide a non-parametric estimator for the multidimensional Hawkes process. Zhang et al. (2019, 2020a,b) propose Bayesian estimation procedures for the parameters of the Hawkes process kernel, which leverage the Hawkes process s cluster structure. The above approaches are typically limited to event time data and cannot estimate model parameters in the presence of interval-censored or mixed data. Our approach can estimate parameters seamlessly from both event time and interval-censored data and operate on mixtures of the two. Interval-censored models. A series of studies show that the Hawkes process can be estimated in the interval-censored setup using an integer-valued auto-regressive (INAR) model. Kirchner (2016), and in their follow-up work in (Kirchner, 2017), show that the distribution of the resulting bin-count sequences (i.e., number of events in each interval) can be approximated by the (multivariate) INAR(p) model. They establish a least-squares estimation of the INAR(p) model, which yields estimates for the underlying multivariate Hawkes process after appropriate scaling. The quality of the estimates is compared against maximum likelihood estimates (on event-times) in (Kirchner and Bercher, 2018). At ﬁrst glance, these studies are similar to our work, as the link to the INAR model provides a discrete-time estimator of the Hawkes process. However, they do not expose the connection between the Hawkes process s event-time and interval-censored formulations as shown in this paper in Sections 4.1 and 4.2. Therefore, they do not establish a correspondence between the discrete and continuous regimes. Additionally, the INAR model and its estimation have been developed under the assumption of a constant exogenous intensity s(t) = s. This is incompatible with the synthetic and real-world experiments in Section 7 and Section 8, respectively, since these experiments involve a non-constant s(t) driving the system. For other interval-censored settings, Lu (2019) presents several Taylor s expansion approximation algorithms for several thinning-based count processes, including the previously mentioned INAR model. Similarly, Manolakis et al. (2019) provide a survey of signal processing approaches for interval-censored processes, detailing many diﬀerent auto-regressive models. Like the INAR model, there is no theoretical correspondence between the various

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

discrete models and the underlying true continuous-time process. In the context of survival analysis, Hudgens et al. (2014) propose a scheme for estimating the parameters of the cumulative incidence function in interval-censored and partly interval-censored settings.

Closest to our work, the Hawkes intensity process (HIP) (Rizoiu et al., 2017b) deﬁnes an estimator as the Hawkes process intensity function s expectation. It ﬁts the new process in an interval-censored setting by using a sum of squared errors loss function. This study has a series of shortcomings already outlined in Section 2.4. In comparison, this work introduces a new point process (the non-homogeneous Mean Behavior Poisson process (MBPP)) whose intensity is the expected intensity of a Hawkes process of equivalent parameters. The MBPP allows ﬁtting both event-time and interval-censored data. We further show the speciﬁc assumption needed to recover the HIP process (see Section 5.3).

Among the non-parametric approaches, Shlomovich et al. (2020) propose the estimation of aggregated Hawkes process using a Monte Carlo Expectation-Maximization algorithm. In aggregated data, event times are rounded to a given precision for data capturing or storage reasons (e.g., only the hour and minute of arrival is recorded, the seconde and below are omitted). The aggregated process at each time point is the number of events with that rounded time-stamp in eﬀect a version of the interval-censored setting deﬁned in Section 4 and Table 2. A spectral method of approximating stationary Hawkes process, based on Whittle s approach, is established by considering a mixing condition with polynomial decay rates from their Poisson clustering structure (Cheysson and Lang, 2020). There are prior works that propose solutions using panel count data analysis an inherently nonparametric approach. Ding et al. (2018) propose a Gaussian process-modulated Poisson process that permits panel data observations (i.e., interval-censored). Later, Moreno et al. (2020) wrap several popular panel count inference methods to deal with incomplete data and the Poisson process assumption s mis-speciﬁcation. Note that these approaches estimate Poisson processes, not Hawkes processes, and are not directly linked to the current work apart from the presentation of the data. Because of the non-parametric estimation of the Poisson intensity, this type of approaches cannot be applied to MBPP either out of the box, as MBPP requires a parametric form to connect back to the Hawkes process as shown in Section 3.1.

Our study diﬀers from the above-mentioned by connecting the Hawkes process with a non-homogeneous Poisson process and providing a generalized treatment of the HIP models. Notably, our proposed mean behavior Poisson process relates to the Hawkes process by a direct parameter equivalence, which provides a bijection between the models which the previous literature did not have. Furthermore, we diﬀerentiate from prior studies by providing a singular model to ﬁt both interval-censored and event-time data; only the log-likelihood function changes to accommodate the type of data. Finally, our proposed exogenous functions allow a mixed dataset of interval-censored and event-time data.

9.2 Other modeling paradigms and their links to the Hawkes model

The Hawkes processes are not the only option for modeling events in continuous time and various other models have also been used. Here we present some of these alternatives and how they related to Hawkes.

Interval-censored Hawkes processes

Linking Hawkes to other models. In their earlier work, Lewis and Shedler (1979) propose a foundational connection between homogeneous and non-homogeneous Poisson processes. They designed a thinning procedure a simple and eﬃcient algorithm for simulating non-homogeneous Poisson processes from homogeneous Poisson processes through thinning. This method was also adapted for Hawkes processes and is the de-facto method of simulating Hawkes process event sequences (Ogata, 1981). More recently, the Hawkes process has been linked to epidemic models (Rizoiu et al., 2018b). More speciﬁcally, the expectation of the intensity of the new infection in a stochastic Susceptible-Infectious-Recovery (SIR) epidemic model over all possible recovery processes is equal to the event intensity in a ﬁnite population Hawkes process (Hawkes N). Hawkes N is a Hawkes process in which the total number of events is limited and for which the remaining number of events modulates the intensity. Kong et al. (2020a) presented a generalized connection between Hawkes processes and epidemic models for arbitrary recovery time distributions and established relationships between the (latent and arbitrary) recovery time distribution, recovery hazard function, and the infection kernel of self-exciting processes. Hawkes properties and augmented models. Other studies focus on analyzing other aspects of the Hawkes process that are not directly used to connect diﬀerent models and estimation procedures. Although these studies are only tangentially related to our work, we include some of these theoretical studies for completeness. Cordi et al. (2018) show that univariate symmetric Hawkes processes are only weakly causal. They ﬁnd that reversed event sequences have similar point process log-likelihood scores and sometimes even yield better ﬁts. Zino et al. (2018) investigate the epidemic threshold of Susceptible-Infected-Susceptible models in the presence and absence of self-excitation in time-varying networks. The paper additionally proposes a model where self-excitement govern the temporal evolution of an activity-driven network. Embrechts and Kirchner (2018) propose several graph-theoretic objects which summarize the branching structure of a multivariate Hawkes process. Daw and Pender (2018a) propose an ephemerally self-excited process which adds an additional stochastic component to the time in which excitement aﬀects the process. Intuitively, this is motivated by how past events in epidemics only inﬂuence the future as long as the event stays active (say, as long as a person is infectious). Finally, Dion et al. (2019) introduce a new class of diﬀusion processes where jumps are driven by multivariate nonlinear Hawkes processes. Other point process modeling paradigms. Other models have been used in event time modeling as an alternative to Hawkes processes. Before being used to estimate Hawkes processes (as described above), the integer auto-regressive (INAR) model was proposed as a discrete alternative to the continuous-valued auto-regressive process (Al-Osh and Alzaid, 1987). Renewal processes are also a common alternative to the Hawkes process. For example, Mishra et al. (2020) propose a generalized renewal process that arises from the expectation of an age-dependent branching process. The renewal process generalization accounts for exogenous events and the time varying-reproduction number for modeling epidemics; they tested their approach on COVID-19 case data in South Korea. Other point process modeling paradigms make use of Markov processes. For example, Soumar e and Tafolong (2017) propose a two-state Markov switching model to model risk-based pricing models. Stochastic diﬀerential equations are another method of characterizing the dynamics of event time processes, with links to the Hawkes process as per the prior mentioned

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Red Queen algorithm Zarezade et al. (2018). A stochastic diﬀerential equation framework has been proposed for modeling information diﬀusion over networks (Wang et al., 2016). Similarly, Zarezade et al. (2017) uses an optimal control problem for jump stochastic differential equations to predict when to post in a social network for visibility. Our study presents a novel variant of the non-homogeneous Poisson process, the mean behavior Poisson. Although we mainly use MBPP to approximate the Hawkes process in interval-censored settings, it can be used on its own as an alternative to the Hawkes process with nicer properties (see the independent increments discussion in Section 2). Furthermore, the exogenous functions that we introduce can be used for other point process intensity functions.

9.3 Applications of Hawkes processes

Earthquakes, ﬁnance, and epidemics. Hawkes processes have long been used for modeling events in various domains, from modeling the aftershocks of earthquakes to predicting social network interactions. One of the ﬁrst domains to adapt Hawkes processes was earthquake modeling. The epidemic-type aftershock (ETAS) (Ogata, 1988) model assumes that each earthquake generates aftershocks at a rate governed by a Hawkes model with a powerlaw decay rate. Later works show the connection to the Richter intensity scale and provide an analytical solution for Hawkes process variants (Ogata, 1999; Helmstetter and Sornette, 2002a,b). The Hawkes processes have also been used in ﬁnancial markets. One such work uses a multivariate Hawkes process to account for dynamics of market prices by tailoring four Hawkes kernels for their domain (Bacry and Muzy, 2014). Similarly, Hawkes processes have also been used to study epidemic spread. Zino et al. (2020) use Hawkes processes to study the bursty activity patterns in social, temporal networks to help determine where targeted behavioral changes could accelerate the erratic of epidemics. Information spread. With the rise of social media platforms, many studies use Hawkes processes to characterize how information develops and moves around social networks. These studies concentrate on either the users of the network or the individual pieces of content (like a post). Similarly, Kobayashi and Lambiotte (2016) predict the time evolution of retweet activity by taking into account the circadian nature of users and the aging of information. Hawkes has been eﬀectively applied to social media platforms other than Twitter. The Hawkes process was found to outperform reinforced Poisson processes for individual micro-blogs on Sina Weibo, a Chinese micro-blogging network. Zarezade et al. (2016) propose a spatial-temporal Hawkes process with a periodic decaying kernel, which considers location-based data of users checking in to leverage the observation that users are inﬂuenced by locations in which their close friends have recently visited. Wang and Resnick (2019) incorporate the periodic nature of users on social media by switching from a Hawkes process in the day into a non-homogeneous Poisson process at night. Cascade size. Cha et al. (2009) analyze the Flickr social network, ﬁnding that popular photos do not necessarily spread widely and popular photos spread slowly. The study additionally found that information exchanged between friends likely accounted for over 50% of content sharing, with signiﬁcant delays between any additional hops. Similarly, Goel et al. (2012) ﬁnd that in a variety of social network domains, most of the cascade sizes in network diﬀusions are small, terminating within one degree of an initial adapting seed .

Interval-censored Hawkes processes

For large cascades, a study found that in addition to temporal and structural features being key predictors, breadth rather than depth is a better indicator of a large cascade. Yu et al. (2017) describe a stochastic model based on survival analysis to unify the task of predicting cascade sizes at a given time and the time when a cascade reaches a speciﬁc size. The authors propose an estimation of the cascading process, and the behavioral dynamics are aggregated. Zhao et al. (2015) use a doubly stochastic extension of the Hawkes process to examine the Twitter social network and make predictions on a tweet s ﬁnal number of reshares by looking at its current reshare history. Mishra et al. (2016) predict the ﬁnal popularity of Twitter cascades by combining a marked Hawkes process with a power-law kernel and feature-driven methods. Finally, Mishra et al. (2018) emulate a Hawkes process unfolding using Recurrent Neural Networks that leverage the Hawkes process intensity function. User inﬂuence. In a Flickr dataset, Goyal et al. (2010) investigate the inﬂuence probabilities between users in a structured social network graph. The study presents three classes of models for the inﬂuence probabilities: the ﬁrst which assumes that the inﬂuence does not change with time; the second assumes inﬂuence is a function of continuous-time; the third proposes a discrete approximation of the continuous-time models. Matsubara et al. (2012) provide an economical alternative to the Hawkes process to model the rise and fall pattern of inﬂuence propagation used to investigate the fast propagation of news rumors in social networks. Kempe et al. (2003) propose a greedy strategy to ﬁnd the most inﬂuential node in a network to determine which nodes can be targeted to trigger a large cascade of idea adaptation. Rizoiu et al. (2018a) propose a method implemented as the software package birdspotter (Ram et al., 2021) to estimate user inﬂuence in retweet data sets when the direct retweeting relation (who retweets whom directly) is latent. They use a Hawkes-inspired retweeting probability and compute user inﬂuence as the expected number of retweets over all possible unfolding of retweet cascades. Ram and Rizoiu (2021) further extend this Hawkes process-based method to include social science grounded phenomena, such as network conductance and inﬂuence-capital distribution. Content popularity. Making predictions about the popularity of online content has been another focus of study in social networks. Szabo and Huberman (2010) present a method for accurately predicting the long-time popularity from early measurements of user access, with empirical results on You Tube and Digg datasets. Another study found that group-level popularity is a necessary and more practical measurement to predict the popularity of online content (Hoang et al., 2017). To predict online video popularity, Ding et al. (2015) propose a dual sentimental Hawkes process that incorporates sentimental features in addition to event counts. Rizoiu et al. (2017b) propose the HIP model already discussed in Section 2.4 and apply it to characterize and predict the popularity of online You Tube videos. They also use the same model to build optimal promotion schedules to boost online popularity (Rizoiu and Xie, 2017). Kong et al. (2018) build a dashboard web application that leverages HIP to analyze online videos. Fake users and news. There have also been studies on detecting diﬀerences in online content, such as diﬀerentiating between fake news and regular users. (Tschiatschek et al., 2018) is one example of such a study, where a Bayesian inference algorithm is used for detecting fake news and jointly learning about the accuracy of users who ﬂag such content over time. Kong et al. (2020b) propose a dual mixture model of Hawkes processes to jointly ﬁt a group of cascades (large and small) relating to content from the same source. They

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

use the resulted parameters to build diﬀusion embeddings for news sources i.e., a vector that describes a news source based on how the retweet cascades concerning produced articles propagate through Twitter. The authors show that the diﬀusion embeddings induce a space in which controversial and reputable news sources are separable. Software tools that implement Hawkes processes. There exist several tools that implement the Hawkes process. These are usually programming language-speciﬁc (see THAP (Xu and Zha, 2017) in Matlab, Po PPy (Xu, 2018) in Py Torch) and some allow extensions to the Hawkes process (pyhawkes (Linderman and Adams, 2014) implements network Hawkes models). The most popular package, tick (Bacry et al., 2017), has the most active community and supplies a comprehensive set of models and helper functions for general time-dependent modeling, including Hawkes processes. Recently, evently (Kong et al., 2021) was developed as a Hawkes process toolbox and designed with an emphasis on online information diﬀusion modeling. In this work, we address the application of popularity prediction therefore, we classify the present work under the heading content popularity here above. In our experiments in Section 8.2, we show that we outperform HIP (the current state of the art in content popularity prediction).

10. Conclusion and future work

This work studies the self-exciting Hawkes point process in an interval-censored setting i.e. when we do not observe individual event times but instead counts of events during predeﬁned observation intervals. As Hawkes processes lack the independent increment property, computing the volumes of events is intractable as it depends on all the past observed volumes. This research overcomes the above diﬃculties by making several contributions. First, we establish the Mean Behavior Poisson process (MBPP), a novel Poisson process with a direct parameter correspondence to the popular self-exciting Hawkes process. The event intensity function of the MBPP is the expected intensity over all possible Hawkes realizations with the same parameter set. We ﬁt MBPP in the interval-censored setting using an interval-censored Poisson log-likelihood (IC-LL). Second, we introduce two novel exogenous functions to distinguish the exogenous from the endogenous events. We propose the multi-impulse exogenous function when the exogenous events are observed as event time and the latent homogeneous Poisson process exogenous function when the exogenous events are presented as interval-censored volumes. Third, we provide several approximation methods to estimate the intensity and compensator function of MBPP when no analytical solution exists. Fourth and ﬁnally, we connect the interval-censored loss of MBPP to a broader class of Bregman divergence-based functions. Using the connection, we show that the current state of the art in popularity estimation (Hawkes Intensity Process (HIP) (Rizoiu et al., 2017b)) is a particular case of the MBPP. We verify our models through empirical testing on synthetic data and real-world data. We ﬁnd that on real-world datasets that our MBPP outperforms HIP for the task of popularity prediction. Future work. This work only addresses the uni-variate Hawkes process. However, drawing from the experiments in Section 8, one could argue that tweets are not only exogenous sources, and they are inﬂuenced by the views also. In other words, tweets and views should be jointly modeled using a bi-variate Hawkes, where one dimension is interval-

Interval-censored Hawkes processes

censored. Future work will address this gap by investigating partially censored multivariate MBPP and marked variants of the Hawkes process. A second direction for future work is ﬁnding theoretical bounds for our approximation method for the MBPP compensator (Eq. (26)).

Acknowledgments

This research was partially funded by the Commonwealth of Australia (represented by the Defence Science and Technology Group) through a Defence Science Partnerships Agreement, and Australian Research Council Project DP180101985. We thanks the Ne CTAR Research Cloud for providing computational resources, an Australian research platform supported by the National Collaborative Research Infrastructure Strategy. Furthermore, this research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government.

Harshavardhan Achrekar, Avinash Gandhe, Ross Lazarus, Ssu-Hsin Yu, and Benyuan Liu. Predicting ﬂu trends using twitter data. In 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 702 707, 2011. doi: 10.1109/INFCOMW.2011.5928903.

M. A. Al-Osh and A. A. Alzaid. First-order integer-valued autoregressive (INAR(1)) process. Journal of Time Series Analysis, 8(3):261 275, 1987. ISSN 14679892. doi: 10.1111/j. 1467-9892.1987.tb00438.x.

Iﬁgeneia Apostolopoulou, Scott Linderman, Kyle Miller, and Artur Dubrawski. Mutually regressive point processes. In Neur IPS, 2019.

G Arfken. Mathematical Methods for Physicists 3rd edn (San Diego, CA: Academic). 1985.

Emmanuel Bacry and Jean Fran cois Muzy. Hawkes model for price and trades highfrequency dynamics. Quantitative Finance, 14(7):1147 1166, 2014. ISSN 14697696. doi: 10.1080/14697688.2014.897000. URL https://www.tandfonline.com/doi/abs/ 10.1080/14697688.2014.897000.

Emmanuel Bacry, Iacopo Mastromatteo, and Jean-Fran cois Muzy. Hawkes processes in ﬁnance. Market Microstructure and Liquidity, 01(01), 2015.

Emmanuel Bacry, Martin Bompaire, Philip Deegan, St ephane Ga ıﬀas, and Søren V Poulsen. Tick: a python library for statistical learning, with an emphasis on hawkes processes and time-dependent models. JMLR, 2017.

Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705 1749, 2005.

Viktor Benes and Jan Rataj. Stochastic geometry: selected topics. Springer Science & Business Media, 2007.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Daniel Bernoulli. Essai d une nouvelle analyse de la mortalit e caus ee par la petite v erole, et des avantages de l inoculation pour la pr evenir. Histoire de l Acad., Roy. Sci.(Paris) avec Mem, pages 1 45, 1760.

E Borel. Sur l emploi du th eoreme de bernoulli pour faciliter le calcul d une inﬁnit e de coeﬃcients. application au probleme de l attentea un guichet. CR Acad. Sci. Paris, 214: 452 456, 1942.

Clive Bowsher. Modelling security market events in continuous time: Intensity based, multivariate point process models. Journal of Econometrics, 141(2):876 912, 2007.

D.R. Brillinger, P.M. Guttorp, and F.P. Schoenberg. Encyclopedia of environmetrics, volume 3. Wiley, 2002.

Qi Cao, Huawei Shen, Keting Cen, Wentao Ouyang, and Xueqi Cheng. Deep Hawkes: Bridging the gap between prediction and understanding of information cascades. In International Conference on Information and Knowledge Management, Proceedings, volume Part F131841, pages 1149 1158, New York, NY, USA, nov 2017. Association for Computing Machinery. ISBN 9781450349185. doi: 10.1145/3132847.3132973. URL https://dl.acm.org/doi/10.1145/3132847.3132973.

Meeyoung Cha, Alan Mislove, and Krishna P. Gummadi. A measurement-driven analysis of information propagation in the Flickr social network. In WWW 09 - Proceedings of the 18th International World Wide Web Conference, pages 721 730, 2009. ISBN 9781605584874. doi: 10.1145/1526709.1526806.

Felix Cheysson and Gabriel Lang. Strong mixing condition for Hawkes processes and application to Whittle estimation from count data. mar 2020. URL http://arxiv.org/ abs/2003.04314.

Michael Collins, Robert E Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1-3):253 285, 2002.

Marcus Cordi, Damien Challet, and Ioane Muni Toke. Testing the causality of Hawkes processes with time reversal. Journal of Statistical Mechanics: Theory and Experiment, 2018(3):033408, mar 2018. ISSN 17425468. doi: 10.1088/1742-5468/aaac3f.

D. R. Cox and V. Isham. Point Processes. Chapman & Hall, 1980.

D. R. Cox and P. A. W. Lewis. Multivariate point processes. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 3: Probability Theory, pages 401 448, Berkeley, Calif., 1972. University of California Press.

D.J. Daley and D. Vere-Jones. An introduction to the theory of point processes. Vol. I. Probability and its Applications (New York). Springer-Verlag, New York, second edition, 2003.

Angelos Dassios and Hongbiao Zhao. Exact simulation of hawkes process with exponentially decaying intensity. Electronic Communications in Probability, 18:13 pp., 2013.

Interval-censored Hawkes processes

Andrew Daw and Jamol Pender. An Ephemerally Self-Exciting Point Process. nov 2018a. URL http://arxiv.org/abs/1811.04282.

Andrew Daw and Jamol Pender. Queues Driven by Hawkes Processes. Stochastic Systems, 8(3):192 229, sep 2018b. ISSN 1946-5238. doi: 10.1287/stsy.2018.0014.

Hongyi Ding, Young Lee, Issei Sato, and Masashi Sugiyama. Variational inference for Gaussian processes with panel count data. In 34th Conference on Uncertainty in Artiﬁcial Intelligence 2018, UAI 2018, volume 1, pages 290 299. Association For Uncertainty in Artiﬁcial Intelligence (AUAI), mar 2018. ISBN 9781510871601. URL http://arxiv. org/abs/1803.04232.

Wanying Ding, Yue Shang, Lifan Guo, Xiaohua Hu, Rui Yan, and Tingting He. Video popularity prediction by sentiment propagation via implicit network. In International Conference on Information and Knowledge Management, Proceedings, volume 19-23-Oct-2015, pages 1621 1630. Association for Computing Machinery, oct 2015. ISBN 9781450337946. doi: 10.1145/2806416.2806505.

Charlotte Dion and Sarah Lemler. Nonparametric drift estimation for diﬀusions with jumps driven by a Hawkes process. Statistical Inference for Stochastic Processes, 23(3):489 515, oct 2020. ISSN 15729311. doi: 10.1007/s11203-020-09213-5. URL https://link. springer.com/article/10.1007/s11203-020-09213-5.

Charlotte Dion, Sarah Lemler, and Eva L ocherbach. Exponential ergodicity for diﬀusions with jumps driven by a Hawkes process. apr 2019. URL http://arxiv.org/abs/1904. 06051.

P. Embrechts and M. Kirchner. Hawkes graphs. Theory of Probability and its Applications, 62(1):132 155, may 2018. ISSN 10957219. doi: 10.1137/S0040585X97T988538.

Xuefeng Gao and Lingjiong Zhu. Functional central limit theorems for stationary Hawkes processes and application to inﬁnite-server queues. Queueing Systems, 90(1-2):161 206, oct 2018. ISSN 15729443. doi: 10.1007/s11134-018-9570-5.

Sharad Goel, Duncan J. Watts, and Daniel G. Goldstein. The structure of online diﬀusion networks. In Proceedings of the ACM Conference on Electronic Commerce, pages 623 638, 2012. ISBN 9781450314152. doi: 10.1145/2229012.2229058.

Amit Goyal, Francesco Bonchi, and Laks V.S. Lakshmanan. Learning inﬂuence probabilities in social networks. In WSDM 2010 - Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pages 241 250, 2010. ISBN 9781605588896. doi: 10.1145/1718487.1718518.

Jan Grandell. Point processes and random measures. Advances in Applied Probability, pages 502 526, 1977.

Stephen J. Hardiman, Nicolas Bercot, and Jean-Philippe Bouchaud. Critical reﬂexivity in ﬁnancial markets: a Hawkes process analysis. European Physics Journal B, 86(10):442, 2013.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Alan G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83, 1971.

Alan G Hawkes and David Oakes. A cluster process representation of a self-exciting process. Journal of Applied Probability, 11(3):493 503, 1974.

Agn es Helmstetter and Didier Sornette. Subcritical and supercritical regimes in epidemic models of earthquake aftershocks. Journal of Geophysical Research: Solid Earth, 107 (B10):ESE 10 1 ESE 10 21, oct 2002a. ISSN 2169-9356. doi: 10.1029/2001jb001580.

Agn es Helmstetter and Didier Sornette. Diﬀusion of epicenters of earthquake aftershocks, Omori s law, and generalized continuous-time random walk models. Phys. Rev. E, 66(6): 061104.1 061104.24, 2002b. URL http://link.aps.org/doi/10.1103/Phys Rev E.66. 061104.

Agnes Helmstetter and Didier Sornette. Subcritical and supercritical regimes in epidemic models of earthquake aftershocks. Journal of Geophysical Research: Solid Earth, 107 (B10), 2002c.

Minh X. Hoang, Xuan Hong Dang, Xiang Wu, Zhenyu Yan, and Ambuj K. Singh. GPOP: Scalable group-level popularity prediction for online content in social networks. In 26th International World Wide Web Conference, WWW 2017, pages 725 733. International World Wide Web Conferences Steering Committee, 2017. ISBN 9781450349130. doi: 10.1145/3038912.3052626.

Michael G Hudgens, Chenxi Li, and Jason P Fine. Parametric likelihood inference for interval censored competing risks data. Biometrics, 70(1):1 9, 2014.

Martin Jacobsen. Compensators and Martingales. In Point Process Theory and Applications, pages 33 102. Birkh auser-Verlag, Boston, 2006. doi: 10.1007/0-8176-4463-6 4. URL http://link.springer.com/10.1007/0-8176-4463-6_4.

David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of inﬂuence through a social network. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 137 146, 2003. doi: 10.1145/956750.956769.

M. Kirchner and A. Bercher. A nonparametric estimation procedure for the Hawkes process: comparison with maximum likelihood estimation. Journal of Statistical Computation and Simulation, 88(6):1106 1116, apr 2018. ISSN 15635163. doi: 10. 1080/00949655.2017.1422126. URL https://www.tandfonline.com/doi/abs/10.1080/ 00949655.2017.1422126.

Matthias Kirchner. Hawkes and INAR( ) processes. Stochastic Processes and their Applications, 126(8):2494 2525, aug 2016. ISSN 03044149. doi: 10.1016/j.spa.2016.02.008. URL https://linkinghub.elsevier.com/retrieve/pii/S0304414916000399.

Matthias Kirchner. An estimation procedure for the Hawkes process. Quantitative Finance, 17(4):571 595, apr 2017. ISSN 1469-7688. doi: 10.1080/14697688.2016.1211312. URL https://www.tandfonline.com/doi/full/10.1080/14697688.2016.1211312.

Interval-censored Hawkes processes

Fima C. Klebaner. Introduction to Stochastic Calculus with Applications. Imperial College Press, 3 edition, 2012.

Donald E. Knuth. Two notes on notation. The American Mathematical Monthly, 99(5): 403 422, 1992. doi: 10.1080/00029890.1992.11995869. URL https://doi.org/10.1080/ 00029890.1992.11995869.

Ryota Kobayashi and Renaud Lambiotte. Ti De H: Time-dependent Hawkes process for predicting retweet dynamics. In Proceedings of the 10th International Conference on Web and Social Media, ICWSM 2016, pages 191 200. AAAI Press, 2016. ISBN 9781577357582.

Quyu Kong, Marian Andrei Rizoiu, Siqi Wu, and Lexing Xie. Will This Video Go Viral: Explaining and Predicting the Popularity of Youtube Videos. In The Web Conference 2018 - Companion of the World Wide Web Conference, WWW 2018, pages 175 178, Lyon, France, 2018. ACM Press. ISBN 9781450356404. doi: 10.1145/3184558. 3186972. URL https://arxiv.org/abs/1801.04117http://dl.acm.org/citation. cfm?doid=3184558.3186972.

Quyu Kong, Marian-Andrei Rizoiu, and Lexing Xie. Modeling Information Cascades with Self-exciting Processes via Generalized Epidemic Models. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 286 294, New York, NY, USA, jan 2020a. ACM. ISBN 9781450368223. doi: 10.1145/3336191.3371821.

Quyu Kong, Marian-Andrei Rizoiu, and Lexing Xie. Describing and Predicting Online Items with Reshare Cascades via Dual Mixture Self-exciting Processes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 645 654, New York, NY, USA, oct 2020b. ACM. ISBN 9781450368599. doi: 10.1145/3340531.3411861.

Quyu Kong, Rohit Ram, and Marian-Andrei Rizoiu. Evently. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 1097 1100, New York, NY, USA, mar 2021. ACM. ISBN 9781450382977. doi: 10.1145/ 3437963.3441708. URL http://arxiv.org/abs/2006.06167https://dl.acm.org/doi/ 10.1145/3437963.3441708.

D. T. Koops, M. Saxena, O. J. Boxma, and M. Mandjes. Inﬁniteserver queues with Hawkes input. Journal of Applied Probability, 55(3):920 943, sep 2018. ISSN 00219002. doi: 10.1017/jpr.2018. 58. URL /core/journals/journal-of-applied-probability/article/ infiniteserver-queues-with-hawkes-input/29A985198A710AB1F7636DCEAEA9C0E6.

P. J. Laub, T. Taimre, and P. K. Pollett. Hawkes Processes. Ar Xiv e-prints, July 2015.

P. A.W. Lewis and G. S. Shedler. SIMULATION OF NONHOMOGENEOUS POISSON PROCESSES BY THINNING. Naval research logistics quarterly, 26(3):403 413, 1979. ISSN 00281441. doi: 10.1002/nav.3800260304.

Sha Li, Xiaofeng Gao, Weiming Bao, and Guihai Chen. FM-Hawkes: A Hawkes process based approach for modeling online activity correlations. In International Conference

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

on Information and Knowledge Management, Proceedings, volume Part F131841, pages 1119 1128, New York, NY, USA, nov 2017. Association for Computing Machinery. ISBN 9781450349185. doi: 10.1145/3132847.3132883. URL https://dl.acm.org/doi/10. 1145/3132847.3132883.

Scott Linderman and Ryan Adams. Discovering latent network structure in point process data. In ICML, 2014.

Yang Lu. The predictive distributions of thinning-based count processes. Scandinavian Journal of Statistics, page sjos.12438, dec 2019. ISSN 0303-6898. doi: 10.1111/sjos.12438. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/sjos.12438.

Dimitris Manolakis, Nicholas Bosowski, and Vinay K. Ingle. Count Time-Series Analysis: A Signal Processing Perspective. IEEE Signal Processing Magazine, 36(3):64 81, may 2019. ISSN 15580792. doi: 10.1109/MSP.2018.2885853.

Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash, Lei Li, and Christos Faloutsos. Rise and fall patterns of information diﬀusion: Model and implications. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 6 14, 2012. ISBN 9781450314626. doi: 10.1145/2339530.2339537.

Swapnil Mishra, Marian-Andrei Rizoiu, and Lexing Xie. Feature Driven and Point Process Approaches for Popularity Prediction. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM 16, pages 1069 1078, Indianapolis, IN, USA, 2016. ACM Press. ISBN 9781450340731. doi: 10.1145/2983323.2983812. URL http://dl.acm.org/citation.cfm?doid=2983323. 2983812http://arxiv.org/pdf/1608.04862.pdf.

Swapnil Mishra, Marian-Andrei Rizoiu, and Lexing Xie. Modeling Popularity in Asynchronous Social Media Streams with Recurrent Neural Networks. In International AAAI Conference on Web and Social Media (ICWSM 18), pages 1 10, Stanford, CA, USA, 2018. URL https://arxiv.org/pdf/1804.02101.pdf.

Swapnil Mishra, Tresnia Berah, Thomas A. Mellan, H. Juliette T. Unwin, Michaela A Vollmer, Kris V Parag, Axel Gandy, Seth Flaxman, and Samir Bhatt. On the derivation of the renewal equation from an age-dependent branching process: an epidemic modelling perspective. jun 2020. URL http://arxiv.org/abs/2006.16487.

Alexander Moreno, Zhenke Wu, Jamie Yap, David Wetter, Cho Lam, Inbal Nahum-Shani, Walter Dempsey, and James M Rehg. A Robust Functional EM Algorithm for Incomplete Panel Count Data. mar 2020.

Frank Nielsen, Jean-Daniel Boissonnat, and Richard Nock. Bregman voronoi diagrams: Properties, algorithms and applications. ar Xiv preprint ar Xiv:0709.2196, 2007.

Joseph D. O Brien, Alberto Aleta, Yamir Moreno, and James P. Gleeson. Quantifying Uncertainty in a Predictive Model for Popularity Dynamics. jan 2020. URL https: //arxiv.org/abs/2001.09490.

Interval-censored Hawkes processes

Yosihiko Ogata. On Lewis simulation method for point processes. IEEE Transactions on Information Theory, 27(1):23 31, 1981.

Yosihiko Ogata. Statistical Models for Earthquake Occurrences and Residual Analysis for Point Processes. Journal of the American Statistical Association, 83(401):9 27, mar 1988. ISSN 0162-1459. doi: 10.1080/01621459.1988.10478560.

Yosihiko Ogata. Seismicity Analysis through Point-process Modeling: A Review. In Seismicity Patterns, their Statistical Signiﬁcance and Physical Meaning, volume 155, pages 471 507. Birkh auser Basel, Basel, aug 1999. doi: 10.1007/978-3-0348-8677-2 14.

Krunal Parmar, Samuel Bushi, Sourangshu Bhattacharya, and Surender Kumar. Forecasting ad-impressions on online retail websites using non-homogeneous hawkes processes. In CIKM, 2017.

Charles L Phillips, John M Parr, Eve Ann Riskin, and Telagarapu Prabhakar. Signals, systems, and transforms. Prentice Hall Upper Saddle River, 2003.

Marc Piggott and Victor Solo. Non-Negative Online Estimation for Hawkes Process Networks. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, volume 2018-April, pages 6333 6337. Institute of Electrical and Electronics Engineers Inc., sep 2018. ISBN 9781538646588. doi: 10.1109/ICASSP.2018. 8462138.

Rohit Ram and Marian-Andrei Rizoiu. Conductance and Social Capital: Modeling and Empirically Measuring Online Social Inﬂuence. oct 2021. URL http://arxiv.org/abs/ 2110.12569.

Rohit Ram, Quyu Kong, and Marian-Andrei Rizoiu. Birdspotter: A Tool for Analyzing and Labeling Twitter Users. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 918 921, New York, NY, USA, mar 2021. ACM. ISBN 9781450382977. doi: 10.1145/3437963.3441695. URL https://dl.acm.org/doi/ 10.1145/3437963.3441695.

Alex Reinhart. A Review of Self-Exciting Spatio-Temporal Point Processes and Their Applications. Statistical Science, 33(3):299 318, aug 2018. ISSN 0883-4237. doi: 10.1214/17-STS629. URL https://projecteuclid.org/euclid.ss/1534147221.

Marian-Andrei Rizoiu and Lexing Xie. Online Popularity under Promotion: Viral Potential, Forecasting, and the Economics of Time. In International AAAI Conference on Web and Social Media (ICWSM 17), pages 182 191, Montr eal, Qu ebec, Canada, 2017. URL https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/view/15553https:// arxiv.org/pdf/1703.01012.pdf.

Marian Andrei Rizoiu, Lexing Xie, Tiberio Caetano, and Manuel Cebrian. Evolution of Privacy Loss in Wikipedia. In WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining, pages 215 224, New York, New York, USA, dec 2016. ACM, ACM Press. ISBN 9781450337168. doi: 10.1145/2835776.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

2835798. URL http://dl.acm.org/citation.cfm?doid=2835776.2835798http:// arxiv.org/abs/1512.03523http://dx.doi.org/10.1145/2835776.2835798.

Marian-Andrei Rizoiu, Young Lee, and Swapnil Mishra. Hawkes processes for events in social media. Frontiers of Multimedia Research, pages 191 218, aug 2017a. doi: 10.1145/ 3122865.3122874. URL http://arxiv.org/abs/1708.06401.

Marian-Andrei Rizoiu, Lexing Xie, Scott Sanner, Manuel Cebrian, Honglin Yu, and Pascal Van Hentenryck. Expecting to be hip: Hawkes intensity processes for social media popularity. In Proceedings of the 26th International Conference on World Wide Web, WWW 17, pages 735 744, 2017b.

Marian-Andrei Rizoiu, Timothy Graham, Rui Zhang, Yifei Zhang, Robert Ackland, and Lexing Xie. #Debate Night: The Role and Inﬂuence of Socialbots on Twitter During the 1st 2016 U.S. Presidential Debate. In International AAAI Conference on Web and Social Media (ICWSM 18), pages 1 10, Stanford, CA, USA, 2018a.

Marian Andrei Rizoiu, Swapnil Mishra, Quyu Kong, Mark Carman, and Lexing Xie. SIRhawkes: Linking epidemic models and hawkes processes to model diﬀusions in ﬁnite populations. In The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018, pages 419 428, New York, New York, USA, apr 2018b. Association for Computing Machinery, Inc. ISBN 9781450356398. doi: 10.1145/3178876.3186108. URL http://dl.acm.org/citation.cfm?doid=3178876.3186108.

Sheldon M. Ross. Stochastic Processes. Wiley, 2nd edition, 1995. ISBN 0471120626.

Leigh Shlomovich, Edward Cohen, Niall Adams, and Lekha Patel. A Monte Carlo EM Algorithm for the Parameter Estimation of Aggregated Hawkes Processes. jan 2020. URL http://arxiv.org/abs/2001.07160.

Donald L. Snyder and Michael I. Miller. Random point processes in time and space. Springer Verlag, 2nd edition, 1991.

Issouf Soumar e and Ernest Tafolong. Risk-based capital for credit insurers with business cycles and dynamic leverage. Quantitative Finance, 17(4):597 612, apr 2017. ISSN 14697696. doi: 10.1080/14697688.2016.1206960.

Jianguo Sun and Xingqiu Zhao. Statistical Analysis of Panel Count Data, volume 80 of Statistics for Biology and Health. Springer New York, New York, NY, 2013. ISBN 9781-4614-8714-2. doi: 10.1007/978-1-4614-8715-9.

Gabor Szabo and Bernardo A. Huberman. Predicting the popularity of online content. Communications of the ACM, 53(8):80 88, aug 2010. ISSN 0001-0782. doi: 10.1145/ 1787234.1787254. URL https://dl.acm.org/doi/10.1145/1787234.1787254.

Sebastian Tschiatschek, Adish Singla, Manuel Gomez Rodriguez, Arpit Merchant, and Andreas Krause. Fake News Detection in Social Networks via Crowd Signals. In The Web Conference 2018 - Companion of the World Wide Web Conference, WWW

Interval-censored Hawkes processes

2018, pages 517 524, New York, New York, USA, apr 2018. Association for Computing Machinery, Inc. ISBN 9781450356404. doi: 10.1145/3184558.3188722. URL http://dl.acm.org/citation.cfm?doid=3184558.3188722.

George E Uhlenbeck and Leonard S Ornstein. On the theory of the brownian motion. Physical review, 36(5):823, 1930.

H Juliette T Unwin, Isobel Routledge, Seth Flaxman, Marian-Andrei Rizoiu, Shengjie Lai, Justin Cohen, Daniel J Weiss, Swapnil Mishra, and Samir Bhatt. Using Hawkes Processes to model imported and local malaria cases in near-elimination settings. PLOS Computational Biology, 17(4):e1008830, apr 2021. ISSN 1553-7358. doi: 10.1371/ journal.pcbi.1008830. URL http://medrxiv.org/content/early/2020/07/17/2020. 07.17.20156174.abstracthttps://dx.plos.org/10.1371/journal.pcbi.1008830.

Gerrit van Dijk. Distribution Theory Convolution, Fourier Transform, and Laplace Transform. De Gruyter Graduate Lectures, De Gruyter,, Berlin, 2013.

Tiandong Wang and Sidney I. Resnick. Common Growth Patterns for Regional Social Networks: a Point Process Approach. nov 2019. URL http://arxiv.org/abs/1911. 07902.

Yichen Wang, Evangelos Theodorou, Apurv Verma, and Le Song. A Stochastic Diﬀerential Equation Framework for Guiding Information Diﬀusion. ar Xiv preprint, pages 1 22, 2016. URL http://arxiv.org/abs/1603.09021.

Siqi Wu, Marian-Andrei Rizoiu, and Lexing Xie. Estimating Attention Flow in Online Video Networks. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW): 1 25, nov 2019. ISSN 25730142. doi: 10.1145/3359285. URL http://dl.acm.org/ citation.cfm?doid=3371885.3359285.

Siqi Wu, Marian Andrei Rizoiu, and Lexing Xie. Variation across scales: Measurement ﬁdelity under Twitter data sampling. In Proceedings of the 14th International AAAI Conference on Web and Social Media, ICWSM 2020, pages 715 725, mar 2020. ISBN 9781577357889. URL https://arxiv.org/abs/2003.09557.

Hongteng Xu. Poppy: A point process toolbox based on pytorch. ar Xiv, 2018.

Hongteng Xu and Hongyuan Zha. Thap: A matlab toolkit for learning with hawkes processes. ar Xiv, 2017.

Hongteng Xu, Dixin Luo, and Hongyuan Zha. Learning Hawkes Processes from Short Doubly-Censored Event Sequences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 17, pages 3831 3840. JMLR.org, 2017.

Linyun Yu, Peng Cui, Fei Wang, Chaoming Song, and Shiqiang Yang. Uncovering and predicting the dynamic process of information cascades with survival model. Knowledge and Information Systems, 50(2):633 659, feb 2017. ISSN 02193116. doi: 10.1007/ s10115-016-0955-7.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Ali Zarezade, Sina Jafarzadeh, and Hamid R. Rabiee. Spatio-Temporal Modeling of Users Check-ins in Location-Based Social Networks. nov 2016. URL http://arxiv.org/abs/ 1611.07710.

Ali Zarezade, Utkarsh Upadhyay, Hamid R. Rabiee, and Manuel Gomez-Rodriguez. Red Queen: An online algorithm for smart broadcasting in social networks. In WSDM 2017 - Proceedings of the 10th ACM International Conference on Web Search and Data Mining, pages 51 60. Association for Computing Machinery, Inc, feb 2017. ISBN 9781450346757. doi: 10.1145/3018661.3018684.

Ali Zarezade, Abir De, Utkarsh Upadhyay, Hamid R. Rabiee, and Manuel Gomez-Rodriguez. Steering Social Activity: A Stochastic Optimal Control Point of View. Journal of Machine Learning Research, 18(205):1 35, 2018. URL http://jmlr.org/papers/v18/17-416. html.

Rui Zhang, Christian Walder, Marian-Andrei Rizoiu, and Lexing Xie. Eﬃcient Nonparametric Bayesian Hawkes Processes. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, pages 4299 4305, California, aug 2019. International Joint Conferences on Artiﬁcial Intelligence Organization. ISBN 978-0-99924114-1. doi: 10.24963/ijcai.2019/597.

Rui Zhang, Christian Walder, Edwin V. Bonilla, Marian-Andrei Rizoiu, and Lexing Xie. Quantile Propagation for Wasserstein-Approximate Gaussian Processes. In H. Larochelle Lin, M. Ranzato, R. Hadsell, M. F. Balcan, and H., editors, Conference on Neural Information Processing Systems (Neur IPS 20), pages 21566 -21578. Curran Associates, Inc., 2020a. URL https://arxiv.org/abs/1912.10200.

Rui Zhang, Christian Walder, and Marian-Andrei Rizoiu. Variational Inference for Sparse Gaussian Process Modulated Hawkes Process. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 34(04):6803 6810, apr 2020b. ISSN 2374-3468. doi: 10.1609/aaai. v34i04.6160.

Qingyuan Zhao, Murat A. Erdogdu, Hera Y. He, Anand Rajaraman, and Jure Leskovec. SEISMIC: A self-exciting point process model for predicting tweet popularity. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, volume 2015-August, pages 1513 1522. Association for Computing Machinery, aug 2015. ISBN 9781450336642. doi: 10.1145/2783258.2783401.

Liang Zhu, Xinwei Tong, Jianguo Sun, Manhua Chen, Deo Kumar Srivastava, Wendy Leisenring, and Leslie L Robison. Regression analysis of mixed recurrent-event and panelcount data. Biostatistics, 15(3):555 568, 2014.

Lorenzo Zino, Alessandro Rizzo, and Maurizio Porﬁri. Modeling memory eﬀects in activitydriven networks. SIAM Journal on Applied Dynamical Systems, 17(4):2830 2854, 2018. ISSN 15360040. doi: 10.1137/18M1171485.

Lorenzo Zino, Alessandro Rizzo, and Maurizio Porﬁri. Analysis and control of epidemics in temporal networks with self-excitement and behavioral changes. European Journal of Control, 54:1 11, jul 2020. ISSN 09473580. doi: 10.1016/j.ejcon.2019.12.007.

Interval-censored Hawkes processes

Contents (Appendix)

A Proofs and Derivations 59

B MBPP Calculations 67 B.1 MBPP Intensity Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B.2 MBPP Compensator Functions . . . . . . . . . . . . . . . . . . . . . . . . . 72

C Latent Homogeneous Poisson Process Exogenous Function - Probabilistic Approach 76

D Convexity analysis 77 D.1 Hawkes process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 D.2 Mean Behaviour Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

E Experiments 80 E.1 Number of Approximation Points . . . . . . . . . . . . . . . . . . . . . . . . 80 E.2 Other Non-endogenous Experiments . . . . . . . . . . . . . . . . . . . . . . 80 E.3 Examine the MBPP ﬁtting bias . . . . . . . . . . . . . . . . . . . . . . . . . 80

Appendix A. Proofs and Derivations

Here, we present the proofs and derivations that we omitted in the main text for readability reasons. Proof [of Eq. (6) which derives ξ(t), the expected intensity of the Hawkes process and the event intensity of MBPP.] We shall show that for a Hawkes process,

ξ(t) = µ(t) + Z t

0 φ(s) ξ(t s) ds.

Similar identities also exist (Laub et al., 2015, pg. 8), (Helmstetter and Sornette, 2002c, Equation 9). Invoking an idea from stochastic calculus (see for instance, Lipster and Shiryaev or Klebaner), let us deﬁne the martingale

M( ) .= N( ) Λ( ),

where Λ( ) = R t 0 λ(s) ds is a compensator. We have:

ξ(t) = E [λ(t)]

0 φ(t s) d N(s)

0 φ(t s) [d N(s) λ(s)ds] + Z t

0 φ(t s)λ(s) ds

0 φ(t s) (d N(s) λ(s)ds) + Z t

0 φ(t s)λ(s) ds

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

0 φ(t s) (d N(s) λ(s)ds) + Z t

0 φ(t s)λ(s) ds

= µ(t) + E Z t

0 φ(t s)λ(s) ds .

Since the term is R t s φ(t s)(d N(s) λ(s))ds is a martingale and has expectation 0,

= µ(t) + Z t

0 φ(t s)E [λ(s)] ds

= µ(t) + Z t

0 φ(t s)ξ(s) ds

= µ(t) + Z t

0 φ(s)ξ(t s) ds.

Proof [of Theorem 1 which derives the solution for ξ(t) using the Laplace transform.] The closed form MBPP equation is given by

ξ(t) = s(t) + Z t

0 φ(t τ) ξ(τ) dτ = s(t) + (φ ξ)(t).

Applying the Laplace transform to both sides, we have,

L{ξ(t)}(ω) = L{s(t) + (φ ξ)(t)}(ω)

= L{s(t)}(ω) + L{(φ ξ)(t)}(ω)

= L{s(t)}(ω) + L{φ(t)}(ω) L{ξ(t)}(ω).

Solving for L{ξ(t)}(ω), we have

L{ξ(t)}(ω) = L{s(t)}(ω) 1 L{φ(t)}(ω).

ξ(t) = L 1 L{s(t)}(ω) 1 L{φ(t)}(ω)

Proof [of Proposition 8 which links the IC-LL loss function and the KL divergence.] Let ϕ(z) .= z log z deﬁne a convex function. For clarity, further deﬁne Ci .= C(oi 1, oi] and Ξi .= Ξ(oi 1, oi]. We may compute the loss composed of the sum of Bregman divergences with convex generator ϕ:

i=1 Bϕ(Ci, Ξi)

Interval-censored Hawkes processes

i=1 [ϕ(Ci) ϕ(Ξi) ϕ (Ξi) (Ci Ξi)]

i=1 [Ci log Ci Ξi log Ξi (log Ξi + 1) (Ci Ξi)]

i=1 [Ci log Ci Ci + Ξi Ci log Ξi]

i=1 Ci log Ξi +

i=1 Ci(log Ci 1).

As Pm i=1 Ci(log Ci 1) is just a constant with respect to θ, the minimisation of L (θ) with respect to θ, is equivalent to

LIC LL(θ) =

i=1 Ci log Ξi.

Proof [of Eq. (36) which shows the equivalence between IC-LL and SSE.] Let ˆϕ(z) .= z2 deﬁne a convex function. For clarity, further deﬁne Ci .= C(oi 1, oi] and Ξi .= Ξ(oi 1, oi].

i=1 B ˆϕ(Ci, Ξi) =

C2 i Ξ2 i 2Ξi(Ci Ξi) =

i=1 [Ci Ξi]2

which is exactly the SSE between Ci and Ξi.

Proof [of Proposition 6 both the lower-bound counting process and upper-bound counting process converge to the original counting process as the discretization points increase.] To consider the convergence of our counting process approximation, we further assume that the approximation points are equidistant apart

We explicitly add the superscript D is to denote the number of approximation points in the

discretization sequence. Further we let DD = n d D j o D

j=0 as the set of approximation points.

We consider the convergence of the counting functions in L1-space, as counting processes are simple functions by deﬁnition. In particular, for the lower-bound counting process we aim to show that

0 |M(t) M D(t)| dt = 0.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

However, as M D(t) is a lower-bound to M(t) and by linearity of integration we can instead consider Z T

0 |M(t) M D(t)|dt = Z T

0 M(t)dt Z T

0 M D(t)dt.

For the original counting process M(t) we assume that we have an arbitrary realisation T = {ti}i=1 over a time period ti (0, T]. Thus we consider the integral of the counting process M(t) as the following,

0 M(t)dt = Z T

s [0,T] M({s}) Js t K

s [0,T] M({s}) Z T

s [0,T] M({s}) (T s)

ti T δ(ti) (T ti).

For the lower-bound counting process, we ﬁrst introduce indexing sets ID j = {i : ti (d D j 1, d D j ]} which are the indices of the event times which occur in the half-open interval (d D j 1, d D j ]. Thus we have,

0 M D(t)dt = Z T

j=1 M(d D j 1, d D j ] Jd D j < t d DK

j=1 M(d D j 1, d D j ] Jd D j < t TK

j=1 M(d D j 1, d D j ] Z T

j=1 M(d D j 1, d D j ] (T d D j )

(T d D j ).

However, we know that (d D j 1, d D j ] will become a singleton set as D from the unifor-

mity of discretization points. Further we know that for any D, the union S2D j=1(d D j 1, d D j ] = (0, T]. Thus in the limit of D , we have {(d D j 1, d D j ] : i {1, 2, . . . , 2D}} {{t} : t

Interval-censored Hawkes processes

(0, T]}. Thus, we compute the limit as the following,

0 M D(t)dt = lim D

δ(ti) T d D j

t (0,T] δ(t) Jt T K (T t)

ti T δ(ti) (T ti)

0 |M(t) M D(t)|dt = lim D

0 M(t)dt Z T

0 M(t)dt lim D

0 M(t)dt lim D

0 M(t)dt Z T

The proof is identical when the upper-bound approximation M+ D(t) is chosen instead.

Proof [of Eq. (25) derivation of the upper bound approximation of the MBPP compensator Ξ(t).] We consider the following compensator generated by the upper-bound counting process M+ D(t):

Ξ+(t) .= S(t) + Z t

0 E[M+ D(y)]φ(t y)dy

= S(t) + Z t

j=1 M(dj 1, dj] Jdj 1 < y d DK

j=1 E [M(dj 1, dj]] Z t

0 Jdj 1 < y d DKφ(t y)dy

j=1 E [M(dj 1, dj]] Z min(t,d D)

dj 1 φ(t y)dy,

where D(t) = max{i {1, . . . , D} : di < t}. This is a similar equation to Eq. (24).

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Proof [of Proposition 7 derivation of Ξ= D(oi 1, oi] the approximation of the MBPP compensator over an interval] We assume that the anti-derivative of the kernel function φ = Φ exists. Thus we have

Ξ= D(oi 1, oi] = Ξ D(oi) Ξ D(oi 1)

= S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]] Z min(oi,d D)

dj φ(oi y)dy

j=1 E [M(dj 1, dj]] Z min(oi 1,d D)

dj φ(oi 1 y)dy

= S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]] Z oi

dj φ(oi y)dy

j=1 E [M(dj 1, dj]] Z oi 1

dj φ(oi 1 y)dy

= S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]]

dj φ(oi y)dy Z oi 1

dj φ(oi 1 y)dy

j= D(oi 1)+1 E [M(dj 1, dj]] Z oi

dj φ(oi y)dy

= S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]] [ (Φ(0) Φ(oi dj)) + (Φ(0) + Φ(oi 1 dj))]

j= D(oi 1)+1 E [M(dj 1, dj]] [ Φ(0) + Φ(oi dj)]

= S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]] [Φ(oi dj) Φ(oi 1 dj)]

j= D(oi 1)+1 E [M(dj 1, dj]] [Φ(oi dj) Φ(0)]

= S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]] Z oi dj

oi 1 dj φ(y)dy

j= D(oi 1)+1 E [M(dj 1, dj]] Z oi dj

Interval-censored Hawkes processes

= S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]] Z oi dj

oi 1 dj φ(y)dy

j= D(oi 1)+1 E [M(dj 1, dj]] Z oi dj

oi 1 dj φ(y)dy

j= D(oi 1)+1 E [M(dj 1, dj]] Z oi 1 dj

= S(oi) S(oi 1) +

j=1 E [M(dj 1, dj]] Z oi dj

oi 1 dj φ(y)dy

j= D(oi 1)+1 E [M(dj 1, dj]] Z oi 1 dj

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Table 10: Solutions for diﬀerent immigrant intensities s(t), and the exponentially time-decaying kernel in Eq. (13) of the main text. ΨT acts as a set of immigrant event times.

s(t) MBPP Intensity Function ξ(t)

I δ(t a) δ(t a) + κθ exp((k 1)θ(t a)) Jt > a K

II P sz ΨT δ(t sz) P sz ΨT

δ(t sz) + κθ exp((k 1)θ(t sz)) Jt > sz K

III Ja t < b K κ

k exp((k 1)θ(t a)) Ja < t b K + κ

1 κ exp((k 1)θ(t b)) exp((k 1)θ(t a)) Jt > b K

IV Pm i=1 S(qi 1,qi]

qi qi 1 Jqi 1 < t qi K Pm i=1 S(qi 1,qi]

k exp((k 1)θ(t qi 1)) Jqi 1 < t qi K + κ

1)θ(t qi)) exp((k 1)θ(t qi 1)) Jt > qi K i

V κθ + (u0 κθ) exp( θt) κθ

1 κ (1 exp( (1 κ)θt)) + u0 exp( (1 κ)θt)

VI sin(t) + α α

κ 1 α+αθ2 2ακθ2+ακ2θ2+θκ θ

1+θ2 2κθ2+κ2θ2 exp((k 1)θt) + sin(t)+θ2 sin(t) κθ2 sin(t) κθ cos(t)

1+θ2 2κθ2+κ2θ2

Interval-censored Hawkes processes

Appendix B. MBPP Calculations

B.1 MBPP Intensity Functions

We consider the closed form solutions of an MBPP intensity function ξ(t) with exponential kernel for various immigrant intensity functions. We use Corollary 3 where we have

h(t) = kθ exp((k 1)θt) Jt > 0K.

E(t) = δ(t) + h(t)

ξ(t) = (E s)(t) = s(t) + (h s)(t).

Speciﬁcally, we calculate each row of Table 10.

Row I. Consider s(t) = δ(t a) for a 0. Thus,

(h s)(t) = (h(t) δ(t a))

= h(t a) Jt > a K.

Together we have,

ξ(t) = δ(t a) + κθ exp((k 1)θ(t a)) Jt > a K.

Row II. Consider s(t) = P sz ΨT δ(t sz) for some set of events ΨT . As the MBPP intensity is a LTI (Rizoiu et al. (2017b)[Corollary 2.2]), we can simply express the MBPP intensity as the sum of Row I. Together we have,

δ(t sz) + κθ exp((k 1)θ(t sz)) Jt > sz K .

Row III. Consider s(t) = Ja < t b K an interval. Then,

(h s)(t) = (h(t) Ja < t b K).

We break the calculation into three cases: (1) t a; (2) a < t b; and (3) t > b. Case (1): Suppose that t a, then simply (h(t) Ja < t b K) = 0. Case (2): Suppose that a < t b. Then we have,

(h(t) Ja < t b K) = Z

0 Ja < τ b K h(t τ) dτ

a h(t τ) dτ

= κθ (1 κ)θ exp((k 1)θ(t τ)) τ=t

τ=a = κ 1 κ [1 exp((k 1)θ(t a))] .

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Case (3): Suppose that t > b. Then we have,

(h(t) Ja < t b K) = Z

0 Ja < τ b K h(t τ) dτ

a h(t τ) dτ

= κθ (1 κ)θ exp((k 1)θ(t τ)) τ=b

τ=a = κ 1 κ [exp((k 1)θ(t b)) exp((k 1)θ(t a))] .

Thus together we have that

(h(t) Ja < t b K) =

κ 1 κ [1 exp((k 1)θ(t a))] a < t b

κ 1 κ [exp((k 1)θ(t b)) exp((k 1)θ(t a))] t > b

ξ(t) = Ja < t b K + κ 1 κ [1 exp((k 1)θ(t a))] Ja < t b K

+ κ 1 κ [exp((k 1)θ(t b)) exp((k 1)θ(t a))] Jt > b K

k exp((k 1)θ(t a)) Ja < t b K

+ κ 1 κ [exp((k 1)θ(t b)) exp((k 1)θ(t a))] Jt > b K.

Row IV. Consider s(t) = Pm i=1 S(qi 1,qi]

qi qi 1 Jqi 1 < t qi K for a set of intervals with nonnegative immigrant volumes S(qi 1, qi]. As the MBPP intensity is an LTI (Rizoiu et al. (2017b)[Corollary 2.2]), we can simply express the MBPP intensity as the weighted sum of Row III.

S(qi 1, qi]

k exp((k 1)θ(t qi 1)) Jqi 1 < t qi K

+ κ 1 κ [exp((k 1)θ(t qi)) exp((k 1)θ(t qi 1))] Jt > qi K .

Row V. Proof shown in the main text.

Row VI. Consider s(t) = sin(t) + α. Then

(h s)(t) = (h α)(t) + (h sin)(t).

For (h α)(t) we have,

(h α)(t) = α Z t

Interval-censored Hawkes processes

= ακ κ 1 (exp((k 1)θt) 1)

= ακ κ 1 exp((k 1)θt) ακ κ 1.

For (h sin)(t) ﬁrst consider the indeﬁnite integral

= Z sin(τ) exp(a(t τ))dτ.

By integration by parts,

a sin(τ) exp(a(t τ)) + 1

Z cos(τ) exp(a(t τ))dτ.

Then by integration by parts again,

a sin(τ) exp(a(t τ)) + 1

a cos(τ) exp(a(t τ)) 1

Z sin(τ) exp(a(t τ))dτ

a sin(τ) exp(a(t τ)) 1

a2 cos(τ) exp(a(t τ)) 1

Z sin(τ) exp(a(t τ))dτ

= exp(a(t τ)) 1

a sin(τ) + 1

a2 cos(τ) 1

Z sin(τ) exp(a(t τ))dτ

= exp(a(t τ)) 1

a sin(τ) + 1

a2 cos(τ) 1

= exp(a(t τ)) 1

a sin(τ) + 1

a2 cos(τ) 1

( ) = exp(a(t τ)) [a sin(τ) + cos(τ)]

Therefore for (h sin)(t) we have

(h sin)(t) = κθ exp((k 1)θ(t τ)) [(k 1)θ sin(τ) + cos(τ)]

(k 1)2θ2 + 1

= κθ exp((k 1)θ(t τ)) [(k 1)θ sin(τ) + cos(τ)]

1 + θ2 2κθ2 + κ2θ2

= κθ[(k 1)θ sin(t) + cos(t)]

1 + θ2 2κθ2 + κ2θ2 + κθ exp((k 1)θt) 1 + θ2 2κθ2 + κ2θ2 .

Thus for the MBPP intensity,

ξ(t) = sin(t) + α + (h sin)(t)

= sin(t) + α + ακ κ 1 exp((k 1)θt) ακ κ 1

κθ[(k 1)θ sin(t) + cos(t)]

1 + θ2 2κθ2 + κ2θ2 + κθ exp((k 1)θt) 1 + θ2 2κθ2 + κ2θ2 .

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Now we collect the trigonometric terms, the exponential terms, and the other constants. The trigonometric terms:

sin(t) κθ[(k 1)θ sin(t) + cos(t)]

1 + θ2 2κθ2 + κ2θ2

= sin(t) κ2θ2 sin(t) κθ2 sin(t) + κθ cos(t)

1 + θ2 2κθ2 + κ2θ2

= sin(t) + θ2 sin(t) 2κθ2 sin(t) + κ2θ2 sin(t) κ2θ2 sin(t) + κθ2 sin(t) κθ cos(t)

1 + θ2 2κθ2 + κ2θ2

= sin(t) + θ2 sin(t) κθ2 sin(t) κθ cos(t)

1 + θ2 2κθ2 + κ2θ2 .

The exponential terms:

ακ κ 1 exp((k 1)θt) + κθ exp((k 1)θt) 1 + θ2 2κθ2 + κ2θ2

= κ κ 1 α + αθ2 2ακθ2 + ακ2θ2 + θκ θ

1 + θ2 2κθ2 + κ2θ2 exp((k 1)θt).

The constant terms:

α ακ κ 1 = α κ 1.

Thus together,

ξ(t) = α κ 1 + κ κ 1 α + αθ2 2ακθ2 + ακ2θ2 + θκ θ

1 + θ2 2κθ2 + κ2θ2 exp((k 1)θt)

+ sin(t) + θ2 sin(t) κθ2 sin(t) κθ cos(t)

1 + θ2 2κθ2 + κ2θ2 .

Interval-censored Hawkes processes

Table 11: Solutions for diﬀerent immigrant intensities s(t), and the exponentially time-decaying kernel in Eq. (13) of the main text. ΨT acts as a set of immigrant event times.

s(t) MBPP Compensator Function Ξ(t)

I δ(t a) Jt > a K h 1 + κ

κ 1 exp((κ 1)θ(t a)) 1 i

II P sz ΨT δ(t sz) P sz ΨT Jt > sz K h 1 + κ

κ 1 exp(κ 1)θ(t sz)) 1 i

III Ja t < b K (t a) κ

k exp((k 1)θ(t a)) i Ja < t b K + κt

1 κ h exp((k 1)θ(t b))

exp((k 1)θ(t a)) i Jt > b K+ κb

k exp((k 1)θ(t b)) i Jt > b K+ κa

1 κ h exp((k

1)θ(t a)) 1

k i Jt > b K

IV Pm i=1 S(qi 1,qi]

qi qi 1 Jqi 1 < t qi K (t qm (t))S(qm (t)+1 qm (t))

qm (t)+1 qm (t) + Pm (t) i=1 S(qi 1, qi] + Pm (t) i=1 S(qi 1,qi]

1 κ h exp((κ 1)θ(t

qi)) 1 exp((κ 1)θ(t qi 1))+exp((κ 1)θ(qi qi 1)) i + κ

1 κ h t qm (t) + 1

θ(κ 1) exp((k 1)θ(t qm (t))) dτ i , where m (t) = 1 + max{i | qi > t}

V κθ + (u0 κθ) exp( θt) κθt

(1 κ) + u0 1

(1 κ)θ(1 exp( (1 κ)θt))

VI sin(t) + α αt

(κ 1)2θ α+αθ2 2ακθ2+ακ2θ2+θκ θ

1+θ2 2κθ2+κ2θ2 (exp((k 1)θτ) 1) +

cos(t) θ2 cos(t)+κθ2 cos(t) κθ sin(t)+1+θ2 κθ2

1+θ2 2κθ2+κ2θ2

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

B.2 MBPP Compensator Functions

We similarly calculate the corresponding MBPP compensator function of the MBPP intensity function we looked at above. In particular we similarly calculate each row of Table 11.

Row I. Consider s(t) = δ(t a) for a 0. We simply integrate the result of Row I in Table 10,

0 (δ(τ a) + κθ exp((κ 1)θ(τ a)) Jτ > a K) dτ

0 δ(τ a)dτ + Z t

0 κθ exp((κ 1)θ(τ a)) Jτ > a Kdτ

= Jt > a K + Jt > a K Z t

a κθ exp((κ 1)θ(τ a))dτ

= Jt > a K + Jt > a K κ κ 1 exp((κ 1)θ(τ a))

τ=a = Jt > a K + Jt > a K κ κ 1 exp((κ 1)θ(t a)) 1

= Jt > a K 1 + κ κ 1 exp((κ 1)θ(t a)) 1 .

Row II. Consider s(t) = P sz ΨT δ(t sz) for some set of events ΨT . As MBPP compensator is LTI, we compute the weighted sum of Row II.

sz ΨT Jt > sz K 1 + κ κ 1 exp((κ 1)θ(t sz)) 1 .

Row III. Consider s(t) = Ja < t b K an interval. First we consider the integral of s(t).

0 s(τ)dτ = (t a) Ja < t b K + (b a) Jt > b K,

which can be simply computed by considering diﬀerent values of t. As the MBPP compensator is a LTI, the result can be calculated as the weighted sum of of Row III in Table 10. The ﬁrst term,

(t a) κ 1 κ

k exp((k 1)θ(t a)) i Ja < t b K

+ (t a) κ 1 κ

h exp((k 1)θ(t b)) exp((k 1)θ(t a)) i Jt > b K.

The second term,

(b a) κ 1 κ

k exp((k 1)θ(t b)) i Jt > b K.

Interval-censored Hawkes processes

Together we have

Ξ(t) = (t a) κ 1 κ

k exp((k 1)θ(t a)) i Ja < t b K

+ (t a) κ 1 κ

h exp((k 1)θ(t b)) exp((k 1)θ(t a)) i Jt > b K

+ (b a) κ 1 κ

k exp((k 1)θ(t b)) i Jt > b K

= (t a) κ 1 κ

k exp((k 1)θ(t a)) i Ja < t b K

h exp((k 1)θ(t b)) exp((k 1)θ(t a)) i Jt > b K

k exp((k 1)θ(t b)) i Jt > b K

h exp((k 1)θ(t a)) 1

Row IV. To calculate this row, one could utilise the LTI with Row III. However, we derive this diﬀerently, which provides an equation which is easier to implement. First, let us calculate the exogenous component.

S(qi 1, qi]

qi qi 1 Jqi 1 < τ qi K

S(qi 1, qi]

qi qi 1 Z t

0 Jqi 1 < τ qi K dτ

The integral within the summation is what we need to calculate now R t 0Jqi 1 < τ qi K dτ. We split up the calculation into three cases: (1) t qi 1; (2) qi 1 < t qi; and (3) t > qi. Case (1): Suppose that t qi 1, then simply R t 0Jqi 1 < τ qi K dτ = 0. Case (2): Suppose that qi 1 < t qi, Z t

0 Jqi 1 < τ qi K dτ = Z t

qi 1 1 dτ = t qi 1.

Case (3): Suppose that t > qi, Z t

0 Jqi 1 < τ qi K dτ = Z qi

qi 1 1 dτ = qi qi 1.

For arbitrary ﬁxed t, let m (t) = 1 + max{i | qi > t}. Then, by combining the three cases we have that,

S(t) = (t qm (t))S(qm (t)+1 qm (t))

qm (t)+1 qm (t) +

i=1 S(qi 1, qi].

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Similarly to the exogenous function, we consider the endogenous response in the same three cases. We directly use the piecewise derivations found in Row III of Table 10. Thus, we only need to compute R t 0(h s) dτ. First we consider R t 0(h(τ) Jqi 1 < τ qi K) dτ Case (1): Suppose that t qi 1, then simply R t 0(h(τ) Jqi 1 < τ qi K)Jt qi 1K dτ = 0. Case (2): Suppose that qi 1 < t qi,

0 (h(τ) Jqi 1 < τ qi K)Jqi 1 < τ qi K dτ

qi 1 (h s)(τ) dτ

κ 1 κ [1 exp((k 1)θ(τ qi 1))] dτ

qi 1 exp((k 1)θ(τ qi 1)) dτ

t qi 1 + 1 θ(κ 1) 1 θ(κ 1) exp((k 1)θ(t qi 1)) dτ .

Case (3): Suppose that t > qi,

0 (h(τ) Jqi 1 < τ qi K)Jτ > qi K dτ

qi (h s)(τ) dτ

κ 1 κ [exp((κ 1)θ(τ qi)) exp((κ 1)θ(τ qi 1))] dτ

= κ 1 κ [exp((κ 1)θ(t qi)) 1 exp((κ 1)θ(t qi 1))

+ exp((κ 1)θ(qi qi 1))] .

Thus, combining the three cases, we have that,

0 (h s)(τ) dτ

S(qi 1, qi]

0 (h(t) Jqi 1 < t qi K) dτ

S(qi 1, qi]

κ 1 κ [exp((κ 1)θ(t qi)) 1 exp((κ 1)θ(t qi 1))

+ exp((κ 1)θ(qi qi 1))] +

t qm (t) + 1 θ(κ 1) 1 θ(κ 1) exp((k 1)θ(t qm (t))) dτ .

Interval-censored Hawkes processes

Together, we can now compute the compensator function.

Ξ(t) = (t qm (t))S(qm (t)+1 qm (t))

qm (t)+1 qm (t) +

i=1 S(qi 1, qi]

S(qi 1, qi]

κ 1 κ [exp((κ 1)θ(t qi)) 1 exp((κ 1)θ(t qi 1))

+ exp((κ 1)θ(qi qi 1))]

t qm (t) + 1 θ(κ 1) 1 θ(κ 1) exp((k 1)θ(t qm (t))) dτ .

Row V. To calculate the MBPP compensator, we integrate Row V of Table 10.

κθ 1 κ (1 exp( (1 κ)θτ)) + u0 exp( (1 κ)θτ)dτ

τ + 1 (1 κ)θ exp( (1 κ)θτ) u0 (1 κ)θ exp( (1 κ)θτ)

= κθt (1 κ) + u0 1

(1 κ)θ(1 exp( (1 κ)θt)).

Row VI. To calculate the MBPP compensator, we integrate Row VI of Table 10.

0 ξ(τ)dτ = Z t

h α κ 1 + κ κ 1 α + αθ2 2ακθ2 + ακ2θ2 + θκ θ

1 + θ2 2κθ2 + κ2θ2 exp((k 1)θτ)

+ sin(τ) + θ2 sin(τ) κθ2 sin(τ) κθ cos(τ)

1 + θ2 2κθ2 + κ2θ2

We calculate the integral term by term. For the ﬁrst term, Z t

0 α κ 1dτ = αt

For the second term, Z t

κ κ 1 α + αθ2 2ακθ2 + ακ2θ2 + θκ θ

1 + θ2 2κθ2 + κ2θ2 exp((k 1)θτ)dτ

= κ κ 1 α + αθ2 2ακθ2 + ακ2θ2 + θκ θ

1 + θ2 2κθ2 + κ2θ2

0 exp((k 1)θτ)dτ

= κ (κ 1)2θ α + αθ2 2ακθ2 + ακ2θ2 + θκ θ

1 + θ2 2κθ2 + κ2θ2 (exp((k 1)θτ) 1).

For the third term, Z t

sin(τ) + θ2 sin(τ) κθ2 sin(τ) κθ cos(τ)

1 + θ2 2κθ2 + κ2θ2 dτ

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

= cos(τ) θ2 cos(τ) + κθ2 cos(τ) κθ sin(τ)

1 + θ2 2κθ2 + κ2θ2

= cos(t) θ2 cos(t) + κθ2 cos(t) κθ sin(t) + 1 + θ2 κθ2

1 + θ2 2κθ2 + κ2θ2 .

Thus together,

κ 1 + κ (κ 1)2θ α + αθ2 2ακθ2 + ακ2θ2 + θκ θ

1 + θ2 2κθ2 + κ2θ2 (exp((k 1)θτ) 1)

+ cos(t) θ2 cos(t) + κθ2 cos(t) κθ sin(t) + 1 + θ2 κθ2

1 + θ2 2κθ2 + κ2θ2 .

Appendix C. Latent Homogeneous Poisson Process Exogenous Function - Probabilistic Approach

The Latent Homogenous Poisson Process (LHPP) exogenous function presented in Section 6.2 can be considered in a probabilistic approach. Instead of assigning a point process to dictate the interval-censored exogenous events, we consider a probability distribtion p(s) corresponding to the probability an exogenous event occurs from (0, T]. Furthermore, we have the corresponding immigrant event times which are sampled from the probability distributions s ΨT . We ﬁrst consider the impulse response MBPP with the position of the impulse as a parameter,

ξδ(t; s) .= δ(t s) + Z t

0 φ(s) ξδ(t s; s)ds. (56)

Then, we can consider the MBPP corresponding to the expectation of the intensity function with respect to the position of the impulse:

Es p(s) [ξδ(t; s)] = Es p(s)

δ(t s) + Z t

0 φ(s) ξδ(t s; s)ds

= Es p(s) [δ(t s)] + Es p(s)

0 φ(s) ξδ(t s; s)ds

= Es p(s) [δ(t s)] + Z t

0 φ(s) Es p(s) [ξδ(t s; s)] ds

0 δ(t s) p(s)ds + Z t

0 φ(s) Es p(s) [ξδ(t s; s)] ds

= p(t) + Z t

0 φ(s) Es p(s) [ξδ(t s; s)] ds.

We now consider probability distributions pi(s) deﬁned as a uniform distribution over the exogenous event observation period (qi 1, qi], for i {q1, . . . , qm}:

( 1 qi qi 1 s (qi 1, qi]

0 otherwise

Interval-censored Hawkes processes

= 1 qi qi 1 Jqi 1 < s qi K.

The probabilistic interpretation of the LHPP exogenous function can now be expressed as the weighted sum of MBPP impulse expectations, where the weights are given by exogenous event volumes S(qi 1, qi] and the summation is over the probability distribution pi(s). That is, the intensity function is given by:

i=1 S(qi 1, qi] Es pi(s) [ξδ(t; s)]

i=1 S(qi 1, qi] pi(t) + Z t

0 φ(s) Es pi(s) [ξδ(t s; s)] ds

i=1 S(qi 1, qi] pi(t) +

i=1 S(qi 1, qi] Z t

0 φ(s) Es pi(s) [ξδ(t s; s)] ds

i=1 S(qi 1, qi] pi(t) + Z t

i=1 S(qi 1, qi] Es pi(s) [ξδ(t s; s)]

i=1 S(qi 1, qi] pi(t) + Z t

0 φ(s) ξ (t s)ds

S(qi 1, qi]

qi qi 1 Jqi 1 < t qi K + Z t

0 φ(s) ξ (t s)ds,

where the exogenous term of the intensity function is the LHPP exogenous function s (t), Eq. (43).

Appendix D. Convexity analysis

This section analyzes the loss functions of the Hawkes process and MBPP using the standard point process log-likelihood. We ﬁnd that with the exponential triggering kernel and multiimpulse response in Equation (39), both the Hawkes process and MBPP are convex in κ for ﬁxed θ. To account for exogenous functions, the exogenous events ΨT and the endogenous events ΥT are considered to be separable.

D.1 Hawkes process

It is trivial to verify that the negative log-likelihood functions corresponding to the exogenous functions below are convex in κ, as the terms consist of a negative-log function and linear functions which are convex functions of κ, but not θ.

Multiple impulse response: Equation (39)

L(θ; T) .= X

sj ΨT δ(tn sj) + X

ti<tn κθe θ(tn ti)

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

+ |ΨT | + κ X

ti ΩT (1 e θ(T ti)). (57)

Step response: s(t) = Jt 0K

L(θ; T) .= X

ti<tn κθe θ(tn ti) !

ti ΩT (1 e θ(T ti)). (58)

Dassios response: Equation (12)

L(θ; T) .= X

κθ + (u0 κθ)e θtn + X

ti<tn κθe θ(tn ti) !

+ κθT + u0 κθ

θ (1 e θT ) + κ X

ti ΩT (1 e θ(T ti)) (59)

D.2 Mean Behaviour Poisson

The negative log likelihood for the multiple impulse response is given by

L(θ; T) .= X

sj ΨT δ(tn sj) + κθe(κ 1)θ(tn sj) Jtn > sj K

+ |ΨT | + X

e(κ 1)θ(T s) 1 (60)

We will prove that for ﬁxed θ, L(θ, T) is convex with respect to κ over κ [0, 1). Our ﬁrst step is to prove convexity in κ of the NLL for the single impulse response. Suppose that s(t) = δ(t s) for s 0. The corresponding NLL is given by

l(ξ( ;κ)) z }| { X

tn ΩT log δ(tn s) + κθe(κ 1)θ(tn s) Jtn > s K

+ 1 + κ κ 1

e(κ 1)θ(T s) 1

| {z } Ξ(T;κ)

As we can see, the single impulse NLL can be written as a sum of two terms, a function of the conditional intensity l(ξ( ; κ)), where l : x 7 P log(x), and the compensator at the ﬁnal time Ξ(T; κ). To prove convexity of L(θ, T), we show that both components are convex in κ.

Interval-censored Hawkes processes

Given that the function l is non-decreasing and convex (since a sum of convex functions is still convex), it suﬃces to show that

ξ(tn; κ) = δ(tn s) + κθe(κ 1)θ(tn s) Jtn > s K

is convex in κ to prove that l(ξ( ; κ)) is convex in κ. This follows from the property that if f and g are convex and f is non-decreasing, then f g is convex. It can be shown that

κ2 = θ2(tn s)eθ(tn s)(κ 1)(κθ(tn s) + 2) Jtn > s K 0,

which proves convexity of ξ(tn; κ) for κ. Thus, l(ξ( ; κ)) is a convex function. We now show convexity of Ξ(T; κ) in κ. It can be shown that

κ2 = 2 e θ(T s)(1 κ)(2 + 2θ(T s)(1 κ) + (θ(T s))2κ(1 κ)2)

So that 2Ξ(T;κ)

κ2 0, we must show that

p(k) = e θ(T s)(1 κ)(2 + 2θ(T s)(1 κ) + (θ(T s))2κ(1 κ)2)(1 κ)3 2

whenever 0 κ < 1. Since

dp dk = θ2(T s)2(1 κ)2e θ(T s)(1 (θ(T s)))(θ(T s)κ + 3) > 0,

for 0 κ < 1, which means that p(k) is non-decreasing over this interval, and p(1) = 2, we have 2Ξ(T; κ)

for 0 κ < 1. Thus, Ξ(T; κ) is convex in κ over this range. Since l(ξ( ; κ)) and Ξ(T; κ) are both convex, their sum L(θ, T), the NLL for the single impulse response, is convex. Convexity of the NLL for the multi impulse response follows from the following observation: Observe that the conditional intensity ξ and the compensator Ξ for the multi impulse response can be expressed, respectively, as sums of ξ and Ξ of the single impulse response. Given that a nonnegative-weighted sum of convex functions is still convex, it follows that the NLL for the multi impulse response is convex as well. For θ, observe that

θ log δ(tn s) + κθe(κ 1)θ(tn s) Jtn > s K

is convex in θ but

κ κ 1e(κ 1)θ(T s) = κ(κ 1)(T s)2e(κ 1)θ(T s) < 0

for 0 < κ < 1. Thus 2L

θ2 0 does not hold, implying that the NLL is not convex in θ for ﬁxed κ.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

Appendix E. Experiments

E.1 Number of Approximation Points

Fig. 7 shows the RSS of the compensator values of the discrete compensator function versus the closed-form compensator function.

100 200 300 400 500

2000 4000 6000 8000 10000 12000

RSS of Compensator Discretization Obs: 15, Params: (K=0.95, Theta=1.15)

Number of Discretization Points

RSS of Compensator Value

Figure 7: Compensator approximation error with respect to the number of discretization points.

E.2 Other Non-endogenous Experiments

Fig. 8 shows the parameter ﬁttings resulting from a diﬀerent number of discretization points.

Table 12: Non-endogenous synthetic experiments for Scenario F.

trule True Params. Loss Func. Model Number of Observation Intervals

5 10 15 30 60 100

κ = 0.6 θ = 0.8

SSE Closed κ 0.726 0.004 0.682 0.005 0.668 0.005 0.642 0.005 0.62 0.005 0.612 0.005 θ 0.647 0.021 0.641 0.024 0.801 0.031 0.995 0.052 0.915 0.052 0.865 0.045

Approx κ 10 0 0.621 0.006 0.605 0.006 0.601 0.006 0.601 0.006 0.601 0.006 θ 0.005 0 8.823 0.866 9.546 1.589 1.09 0.091 0.909 0.052 0.856 0.043

IC-LL Closed κ 0.728 0.004 0.69 0.005 0.673 0.005 0.644 0.005 0.622 0.005 0.613 0.005 θ 0.644 0.019 0.67 0.023 0.828 0.031 1.014 0.054 0.931 0.05 0.876 0.045

Approx κ 10 0 0.618 0.007 0.599 0.006 0.594 0.006 0.595 0.006 0.596 0.006 θ 0.005 0 8.355 1.02 9.607 1.603 1.091 0.088 0.905 0.051 0.85 0.043

κ = 0.95 θ = 1.15

SSE Closed κ 0.958 0.002 1.496 2.17 1.494 2.171 0.95 0.004 0.95 0.004 0.95 0.004 θ 1.765 0.098 1.362 0.358 1.301 0.344 1.283 0.089 1.216 0.08 1.19 0.076

Approx κ 10 0 1.139 0.007 1.033 0.005 0.958 0.003 0.95 0.004 0.95 0.004 θ 0.014 0 9.409 0.764 10 0 10 0 1.765 0.177 1.433 0.111

IC-LL Closed κ 0.959 0.002 0.953 0.003 0.95 0.003 0.949 0.003 0.95 0.004 0.95 0.004 θ 1.703 0.086 1.52 0.091 1.451 0.096 1.314 0.083 1.233 0.075 1.202 0.071

Approx κ 10 0 1.161 0.007 1.041 0.005 0.958 0.003 0.95 0.004 0.95 0.004 θ 0.017 0 9.767 0.403 10 0 10 0 1.776 0.163 1.433 0.102

E.3 Examine the MBPP ﬁtting bias

Setup and protocol. In this section, we examine the parameter ﬁtting bias for the MBPP process (shown in Figs. 9 and 10). We achieve this by sampling the Hawkes process

Interval-censored Hawkes processes

100 200 300 400

0.948 0.950 0.952 0.954 0.956 0.958

K Discretisation Test IC LL, Obs: 15 , Params: (K=0.95, Theta=1.15)

Number of Discretisation Points

100 200 300 400

0.948 0.950 0.952 0.954 0.956 0.958

K Discretisation Test SSE, Obs: 15 , Params: (K=0.95, Theta=1.15)

Number of Discretisation Points

100 200 300 400

Theta Discretisation Test IC LL, Obs: 15 , Params: (K=0.95, Theta=1.15)

Number of Discretisation Points

Theta Param

100 200 300 400

Theta Discretisation Test SSE, Obs: 15 , Params: (K=0.95, Theta=1.15)

Number of Discretisation Points

Theta Param

Figure 8: Plots detailing the parameter ﬁts of synthetic data in setting D with varying numbers of discretization points. The plots are taken over 15 observation intervals, using κ = 0.95 and θ = 1.15 parameters. The red line indicates these true parameters of the synthetic data.

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

over a wide range of parameters, ﬁtting the MBPP process for each parameter combination in Scenario F as per Table 2 , and examining the diﬀerence between the true parameter (used for sampling) and the ﬁtted parameter. We sample synthetic Hawkes realizations using HP-sin (see Eq. (53)) with a parameter grid. We vary κ between 0.1 and 0.95 (step 0.05) and θ between 0.5 to 1.55 (step 0.05). For each parameter combination, we jointly ﬁt 50 groups of realizations, and we obtain 50 MBPP parameter estimates (see the main-text experiments section). We use six values for the number of observation intervals, controlling the width of the observation interval: 5 intervals (largest width, Fig. 9a), 10 intervals (Fig. 9b), 15 intervals (Fig. 9c), 30 intervals (Fig. 10a), 60 intervals (Fig. 10b), and 100 intervals (smallest width, Fig. 10c). We report the mean diﬀerence between ﬁtted and true values for κ (left panel of each ﬁgure) and θ (right panel). Furthermore we highlight the speciﬁc parameters used in the main-text by the green shapes, with the triangle signifying κ = 0.6 and θ = 0.8; and diamond signifying κ = 0.95 and θ = 1.15.

Results and interpretation There are several conclusions that emerge from Figs. 9 and 10. First, we observe that we retrieve parameter κ with only minor errors over every parameter conﬁguration. The occasional more consistent over-estimation is likely due to convergence failures of the optimization algorithm. Second, we observe that ﬁtting the θ parameter is signiﬁcantly more challenging. For high values of θ the ﬁtting under-estimates the true θ for low values of κ (bottom-right corner) and over-estimates it for high values of κ (top-right corner). Note that high θ corresponds to a faster decay rate for the Hawkes kernel φ(t), which typically translates to oﬀspring events being tightly clustered close to their parent event. As θ decreases (slower kernel decay), the parameter is increasingly well recovered. Third observation, there seems to be a breakpoint between regions where parameters are underestimated (blue) and overestimated (red) for the θ plots. Forth and as expected, for higher data granularity i.e., the higher the number of intervals the θ are recovered closer to their true values. Fifth and ﬁnally, for high data granularity (100 intervals in Fig. 10c) we observe a slight over-estimation for both κ and θ. We posit that this consistent over-estimation is due to the model mismatch between the generating model (Hawkes process) and the ﬁtting model (MBPP).

Interval-censored Hawkes processes

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.4 0.2 0 0.2 0.4

= Fitted κ~ True κ (#Obs = 5)

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

1 0.6 0.2 0.2 0.6 1

= Fitted θ~ True θ (#Obs = 5)

(a) 5 observational intervals

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.4 0.2 0 0.2 0.4

= Fitted κ~ True κ (#Obs = 10)

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

1 0.6 0.2 0.2 0.6 1

= Fitted θ~ True θ (#Obs = 10)

(b) 10 observational intervals

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.4 0.2 0 0.2 0.4

= Fitted κ~ True κ (#Obs = 15)

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

1 0.6 0.2 0.2 0.6 1

= Fitted θ~ True θ (#Obs = 15)

(c) 15 observational intervals

Figure 9: MBPP parameter error heatmap for κ (left) and θ (right).

Rizoiu, Soen, Li, Calderon, Dong, Menon, and Xie

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.4 0.2 0 0.2 0.4

= Fitted κ~ True κ (#Obs = 30)

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

1 0.6 0.2 0.2 0.6 1

= Fitted θ~ True θ (#Obs = 30)

(a) 30 observational intervals

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.4 0.2 0 0.2 0.4

= Fitted κ~ True κ (#Obs = 60)

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

1 0.6 0.2 0.2 0.6 1

= Fitted θ~ True θ (#Obs = 60)

(b) 60 observational intervals

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

0.4 0.2 0 0.2 0.4

= Fitted κ~ True κ (#Obs = 100)

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

1 0.6 0.2 0.2 0.6 1

= Fitted θ~ True θ (#Obs = 100)

(c) 100 observational intervals

Figure 10: MBPP parameter error heatmap for κ (left) and θ (right).