# selfadaptable_point_processes_with_nonparametric_time_decays__19f2616a.pdf

Self-Adaptable Point Processes with Nonparametric Time Decays

Zhimeng Pan, Zheng Wang, Jeff M. Phillips, and Shandian Zhe School of Computing, University of Utah Salt Lake City, UT 84112 {z.pan,wzhut}@utah.edu, jeffp@cs.utah.edu, zhe@cs.utah.edu

Many applications involve multi-type event data. Understanding the complex inﬂuences of the events on each other is critical to discover useful knowledge and to predict future events and their types. Existing methods either ignore or partially account for these inﬂuences. Recent works use recurrent neural networks to model the event rate. While being highly expressive, they couple all the temporal dependencies in a black-box and can hardly extract meaningful knowledge. More important, most methods assume an exponential time decay of the inﬂuence strength, which is over-simpliﬁed and can miss many important strength varying patterns. To overcome these limitations, we propose SPRITE, a Self-adaptable Point p Rocess w Ith nonparametric Time d Ecays, which can decouple the inﬂuences between every pair of the events and capture various time decays of the inﬂuence strengths. Speciﬁcally, we use an embedding to represent each event type and model the event inﬂuence as an unknown function of the embeddings and time span. We derive a general construction that can cover all possible time-decaying functions. By placing Gaussian process (GP) priors over the latent functions and using Gauss-Legendre quadrature to obtain the integral in the construction, we can ﬂexibly estimate all kinds of time-decaying inﬂuences, without restricting to any speciﬁc form or imposing derivative constraints that bring learning difﬁculties. We then use weight space augmentation of GPs to develop an efﬁcient stochastic variational learning algorithm. We show the advantages of our approach in both the ablation study and real-world applications.

1 Introduction

Events of multiple types are ubiquitous, such as in online shopping, social networking and biological signal transduction. Understanding the complex inﬂuences of those events on each other, including excitation, inhibition, and how the strength of these inﬂuences varies with time, is crucial to discover useful knowledge and to predict future events and their types, which can beneﬁt many applications, e.g., marketing analysis, biological study, and early warnings of catastrophes.

While many excellent works have been proposed for event modeling and analysis (Blundell et al., 2012; Linderman and Adams, 2014; Xu et al., 2016; Tan et al., 2016; Xu and Zha, 2017; Zhang et al., 2020a), most of them are inadequate to capture complex inﬂuences between the events. The classical Poisson processes simply ignore these inﬂuences and assume independent increments. Although Hawkes processes (HP) can estimate the mutual excitation between the events, they ignore the inhibition effect, which is common in real world. More important, existing methods (Xu et al., 2016; Zhang et al., 2020a), even for those considering both the excitation and inhibition, mostly assume an exponential time decay of the inﬂuence strength, which is over-simpliﬁed and can miss many other decay patterns. While recent works have also used recurrent neural networks (RNNs) (Du et al., 2016; Mei and Eisner, 2017) to model the event rate and hence are highly expressive, they couple

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

all the temporal dependencies in a black box, and are difﬁcult to distill meaningful and important knowledge, e.g., inﬂuence types and strengths.

To address these limitations, we propose SPRITE, a multi-variate self-adaptable point process with nonparametric time decays. Our method not only can discover the type of the inﬂuence (i.e., excitation or inhibition) between every pair of the events, but also is ﬂexible enough to estimate all kinds of inﬂuence strength decaying patterns, not limited to the exponential decay or other particular forms. Speciﬁcally, we ﬁrst introduce an embedding to represent each event type, and model the inﬂuence between any two events as an unknown function of the event type embeddings and time span. In so doing, we avoid separately estimating the inﬂuence between every pair of the events, and largely reduce the parameters. Next, we propose a general construction that covers all possible time decaying functions. We assign Gaussian process (GP) priors over the free latent functions and use Gauss-Legendra quadrature to calculate the integral in the construction. In this way, we can ﬂexibly estimate a variety of time decaying inﬂuences while not needing to impose any derivative constraint, which can bring learning difﬁculties. For efﬁcient inference, we use weight-space augmentation of GPs to prevent computing huge covariances matrices, and neural network feature mapping to enhance the learning capacity. We then use the re-reparameterization trick to develop a scalable stochastic variational learning algorithm.

For evaluation, we ﬁrst examined SPRITE in an ablation study. We tested the case of a single event type with excitation effects only and three time decays, and a bi-type event case with mixed excitation and inhibition effects. In both cases, our method accurately recovered the rate function of each type of events and the inﬂuence between the events, performing better than state-of-the-art methods based on or extending Hawkes processes and RNNs. Next, we examined SPRITE in three real-world and one synthetic benchmark datasets. We examined the accuracy in predicting the occurrence time and type of future events. In both tasks, our method nearly always outperforms the competing approaches, often by a large margin. Finally, by looking into the embeddings and inﬂuence functions learned by our model, we found interesting patterns within the event types and their interactions.

2 Background

Temporal Point Processes. We denote a sequence of events and their types by Γ = [(t1, s1), . . . , (t M, s M)], where each ti is the time point when the i-th event occurred (t1 . . . t M), si is the type of the i-th event. Suppose we have K types of events in total; so each si {1, . . . , K}.

Temporal point processes (TPPs) (Daley and Vere-Jones, 2007) provide a fundamental framework to characterize and model the events. A TPP is deﬁned via the rate (or conditional intensity) function of the events, λs(t). For example, a homogeneous Poisson process assumes each λs(t) is a constant λ0 s, and the past events {(tj, sj)|tj < t} do not have any inﬂuence on generating a new event at t. While being simple and convenient, Poisson processes completely ignore the interactions between the events. Hawkes processes (HPs) (Hawkes, 1971) overcome this issue by modeling the triggering effects between the events, λs(t) = λ0 s + P

tj<t γsj s(t tj) where γsj s( ) > 0 is called triggering kernel , and characterizes how largely the previous event at tj encourages a new event of type s to occur at t. Naturally, this effect is supposed to decay with time . The most commonly used choice is the exponential decay,

γs s( ) = αs s exp ( τs s ) , (1)

where αs s > 0 and τs s > 0. Given the deﬁnition of λs( ), the probability of the event sequence is

k=1 exp( Z T

0 λk(t)dt) YM

i=1 λsi(ti), (2)

where T is the total span of the events in Γ and so t M T.

RNN based Point Processes. HPs only consider the triggering effects and hence can be quite restricted. In order to estimate arbitrarily varying rate functions, recent works Du et al. (2016); Mei and Eisner (2017) use recurrent neural networks (RNNs) to model the event dependencies and rates. The key idea is as follows. For each event i, we introduce a hidden state vector β(ti). The event dependency is modeled through the updating rule of the hidden states,

β(ti+1) = ρ (β(ti), ti+1) (3)

where ρ( ) is a complicated mapping, fulﬁlled by a combination of neural network operations, such as linear transformation, nonlinear activation, and gating. Note that the event type si can also be incorporated into ρ( ). The rate function is then deﬁned as a transformation of β(t). We can therefore evaluate and maximize the point process likelihood (2) for model estimation.

While RNN based methods are highly expressive and can capture arbitrarily complex dependencies, due to the black-box updating rule (see (3)), they can hardly distill meaningful knowledge, such as the inﬂuence type and strength between the events, which are critical for data analysis and knowledge discovery. On the other hand, while HPs enjoy excellent interpretability due to the additive structure in the intensity, they overlook the inhibition effect among the events, which is common in real world. More important, nearly all the HP based methods adopt exponential time-decaying kernels (or a combination of them) (see (1)), including the recent works (Mei and Eisner, 2017; Zhang et al., 2020a) that extend HPs to allow a negative triggering kernel (αs s < 0 in (1)). However, an exponential decay can be overly simplistic for various, complex real-world applications. These methods can therefore miss other important decaying patterns (e.g., polynomial decays), resulting in inferior inﬂuence estimation and predictions.

To overcome these issues, we propose SPRITE, a multi-variate self-adaptable point process model that can explicitly estimate the inﬂuence type and strength between every pair of the events, and ﬂexibly capture all kinds of time decays of the inﬂuence strength, not limited to a particular form. Speciﬁcally, we ﬁrst deﬁne a raw rate function for each event type s(1 s K) to characterize both the excitation and inhibition effects from the previous events,

λs(t) = λ0 s + XM

i=1 1(ti < t)hsi s(t ti), (4)

where 1( ) is an indicator function, λ0 s is the background (or base) rate, and hsi s( ) is an inﬂuence term that characterizes the effect of a previous event of type si happened at ti on the occurrence of a new event of type s at t. When hsi s > 0, it indicates a triggering effect while hsi s < 0 means inhibition. The absolute value of hsi s is the strength of the inﬂuence. Since the raw rate can be negative, we then apply a scaled soft-plus transformation (Mei and Eisner, 2017) to ensure we obtain a positive rate function, λs(t) = β log 1 + exp( λs(t)/β) where β > 0.

Next, we introduce a set of embeddings U = {us}K s=1 to represent the event types. We sample these embeddings from the standard Gaussian distribution, p(U) = Q

s N(us|0, I). Then we model the inﬂuence term as a function of the event type embeddings and time span, hsi s( ) = h(usi, us, ). (5) The advantage of doing so is that we only need to estimate one function h to capture the inﬂuence among all the events (including those within the same type). The embeddings can further enable us to discover the hidden structures of the event types (e.g., cluster and outliers). Otherwise, if we estimate one function for each pair of event types, we have to learn K(K + 1)/2 functions, which can be much more challenging and costly, especially when the event types are many (i.e., large K).

We want our model to ﬂexibly learn the inﬂuence function h( ) and its strength decaying patterns with time, rather than restrict to any pre-speciﬁed, parametric form, e.g., the exponential decay in (1). In other words, we only require when h > 0, the time derivative h

< 0, and when h < 0, h

> 0, i.e., h h

< 0, to ensure |h| decreases over time. While it is standard to assign a Gaussian process (GP) prior (Rasmussen and Williams, 2006) over h to enable a ﬂexible, nonparametric function estimation, incorporating the time derivative constraints is difﬁcult. The derivative of a GP is still a GP (provided the kernel is differentiable). Since the support of the Gaussian distribution (GP ﬁnite projection) is the entire real space, imposing a monotonic constraint is infeasible (there are always nonzero probabilities for positive/negative derivatives). Existing work (Riihimäki and Vehtari, 2010) introduces a set of virtual points and encourage the derivative at these points to be as positive (or negative) as possible. This essentially is a soft regularization, not an actual constraint.

To address this issue, we observe that the time-decaying inﬂuence functions have a general structure. Lemma 3.1. h h

< 0 if and only if h has the form

h(usi, us, ) = g(usi, us)(1 f(usi, us, )), (6)

where g(usi, us) = h(usi, us, 0), 0 f( ) < 1, f

> 0, and f(usi, us, 0) = 0.

We leave the proof in the Appendix. Next, to construct f, we represent f( ) = tanh z(usi, us, )

where z( ) is another function. To satisfy the requirements for f in (6), we need to ensure z 0, z(usi, us, 0) = 0, and z

> 0. According to these constraints, we can represent z as an integral function, z(usi, us, ) = R 0 exp η(usi, us, r) dr where η( ) is a free function. Combining with (6), we now derive a general construction of our inﬂuence function,

Corollary 3.1.1. h h

< 0 if and only if

h(usi, us, ) = g(usi, us)

0 exp (η(usi, us, r)) dr

where both g( ) and η( )1 are free latent functions.

Note that our construction (7) also covers the special case that h( ) = 0 (when g = 0), i.e., no inﬂuence is from the previous event. To ﬂexibly estimate h( ), we can place GP priors over g and η. However, we no longer need to incorporate any derivative constraint and hence the learning is much easier. Note that the prior of h( ) is not GP anymore.

One might be concerned that the integral in (7) is analytically intractable and will bring troubles in model estimation. However, since the integration is one dimensional, we can use Gauss-Legendre quadrature to calculate the integral quite accurately and evaluate h( ). We will discuss the details in Sec. 4.2.

4 Algorithm

The model estimation is challenging. The GP prior over the latent function g( ) in (7) requires us to compute a multivariate Gaussian distribution of {g(us, uk)}1 s,k K (the ﬁnite projection of g), and over η( ) a multivariate Gaussian distribution of {η(us, uk, )}1 s,k K, A where A are the time differences between every pair of events. Hence, even with a moderate number of events and event types (e.g., K = 100), the corresponding covariance (kernel) matrices will be huge and infeasible to calculate. To address this problem, we use the weight space view of GPs (Rasmussen and Williams, 2006). That is, a GP model is equivalent to Bayesian linear regression after a (nonlinear) feature mapping. We use neural networks to construct a ﬁnite yet highly expressive mapping, and explicitly estimate the posterior of the (augmented) weight vector. In so doing, we can ease the inference and avoid computing the full covariance matrices. We then use Gauss-Legendre quadrature and the reparameterization trick to develop an efﬁcient mini-batch stochastic variational learning algorithm.

4.1 GP Weight-Space Augmentation with Neural Network Feature Mapping

Speciﬁcally, we introduce R dimensional weight vectors wg and wη for the latent functions g( ) and η( ) in (7). We then sample g and η in the following way:

wg N(wg|0, I), wη N(wη|0, I),

g(us, uk) = w g φg(us, uk), η(us, uk, ) = w η φη(us, uk, ), (8)

where φg and φη are nonlinear feature mappings fulﬁlled by neural networks (NNs). It is known that if we marginalize out the weight vectors, we recover the GP model, where the function values at an arbitrary ﬁnite set of inputs follow a multivariate Gaussian distribution. The covariance (or kernel) function is an inner-product of the mapped feature vector. For example, the kernel for g( ) will be κg([us, uk], [us , uk ]) = (φg(us, uk)) φg(us , uk ); the NN parameters in φg can be viewed as kernel parameters. Although the kernel does not correspond to an inﬁnite feature mapping as in traditional kernels (e.g., RBF), it still can be highly expressive due to the capacity of neural networks. We will keep the weight vectors wg and wη in our model inference. In so doing, we never need to operate the huge full covariance matrices. The dimension of the weight vectors R is often set to be small, e.g., 16 or 32. Hence the computation will be much more efﬁcient, and the inference are easier and convenient.

1Rigorously speaking, η should be exponential integrable, i.e., exp(η( )) is integrable in [0, ]. This is a pretty mild condition, which can be satisﬁed by, e.g., continuity of η( ).

Suppose we have observed a collection of N independent event sequences, D = {Γn}1 n N where each Γn = (tn 1, sn 1), . . . , (tn Mn, sn Mn) and Mn is the number of events in sequence n. According to (2) and (8), the joint probability of our model is given by

p(U, wg, wη, D) = Y

s N(us|0, I)N(wg|0, I)N(wη|0, I)

i=1 λsn i (tn i ), (9)

where Tn is the total time span across the events in sequence n.

4.2 Stochastic Variational Learning

Given the joint probability (9), we aim to estimate the posterior distribution of the weight vectors wg and wη, the event type embeddings U, the parameters in the feature mappings φg and φη, and the other parameters. Due to the intractable model evidence (i.e., normalizer), exact inference is infeasible. Therefore, we use variational inference (Wainwright and Jordan, 2008). Speciﬁcally, we introduce a Gaussian variational posterior for the weight vectors,

q(wg, wη) = q(wg)q(wη) = N(wg|µg, Σg)N(wη|µη, Ση). (10)

We use the Cholesky decomposition to parameterize the posterior covariance matrices to ensure their positive deﬁniteness, Σg = Lg L g and Ση = LηL η where Lg and Lη are lower triangular matrices. We then derive a variational model evidence lower bound (ELBO),

L = KL (q(wg) p(wg)) KL (q(wη) p(wη)) + log(p(U))

i=1 Eq log λsn i (tn i ) !

where p(wg) and p(wη) are the standard Gaussian priors in (8), and p(U) = Q

s N(us|0, I). To handle large N, K (i.e., the number of sequences and event types) and the intractable integration R Tn 0 λk(t)dt, we resort to stochastic optimization. Speciﬁcally, we partition all the event sequences of into mini-batches of size B: {B1, . . . , BN/B}, and event types into mini-batches of size C: {C1, . . . , CK/C}. We observe that the integration of

the rate function is piece-wise, R Tn 0 λk(t)dt = PMn j=0 R tn j+1 tn j λk(t)dt where tn 0 = 0 and

tn Mn+1 = Tn. The rate λk(t) in each interval [tn j , tn j+1] is smooth (see (4)), and hence we

use Gauss-Legendre quadrature to compute each R tn j+1 tn j λk(t)dt. We now talk about how

to compute a sample of λk(t) for stochastic optimization, which uses the Gauss-Legendre quadrature again. Speciﬁcally, we arrange the ELBO as L = KL (q(wg) p(wg))

KL (q(wη) p(wη)) + log(p(U)) P

k Cl K C Eq h R Tn 0 λk(t)dt i + P

n Bm N B PMn i=1 Eq log λsn i (tn i ) , which can be further viewed as an expectation,

L = Ep(m),p(l)[ ˆLm,l], ˆLm,l = KL (q(wg) p(wg)) KL (q(wη) p(wη)) + log(p(U))

i=1 Eq log λsn i (tn i ) ,

where p(m) = B

N , m {1, . . . , N

B }, p(l) = C

K , l {1, . . . , K

C }. To conduct efﬁcient stochastic optimization, each step we ﬁrst draw a mini-batch Bm and Cl to calculate the ˆLml. We then use the reparameterization trick to draw from q(wg, wη): ˆwg = µg + Lgϵg and ˆwη = µη + Lηϵη, where ϵg N( |0, I) and ϵη N( |0, I). We substitute these samples into the corresponding terms in the expectations and obtain an unbiased stochastic estimate of ˆLml. We then calculate the gradient to obtain an unbiased estimate of ˆLml, which is also an unbiased estimate of L. Accordingly, we can perform stochastic optimization to maximize L so as to estimate q and the other parameters.

However, a critical issue is that after we use the parameterized samples of wg and wη to obtain the latent functions g( ) and η( ), we cannot compute the inﬂuence function h( ) in the rate function

λk( ), due to the intractable integral in (7). To address this issue, we observe that the integration is only one dimensional over time. Hence, we can use Gauss Legendre quadrature to calculate the integral with an analytical form, Z

0 exp (η(usi, us, r)) dr X

2 exp η(usi, us,

where {ωj} and {ξj} are quadrature weights and nodes respectively. Note that this is based on the integral transform of the standard Gauss Legendre quadrature, which requires that the integration interval is [ 1, 1]. We provide more details in the Appendix.

4.3 Algorithm Complexity

The time complexity of our inference algorithm is O(BCM + R3) where B and C are the minibatch sizes of the partitions of the event sequences and types, M is the maximum length of the sequences, and R is the dimension of weight vectors wg and wη. Therefore, the computational cost is proportional to the mini-batch sizes, rather than determined by the total number of sequences N and events types K. The space complexity is O(Kd + 2R3), which is to store the covariances of q(wg) and q(wη), and the event type embeddings U (d is the dimension of the embeddings).

5 Related Work

Many works use Poisson processes to analyze event data for its elegance and convenience, e.g., (Lloyd et al., 2015; Schein et al., 2015, 2016, 2019). However, Poisson processes assume independent increments in their counting processes and hence overlook the interactions between the events. Hawkes processes (HPs) (Hawkes, 1971) therefore become popular, due to their capability of capturing the mutual excitations between the events. Many works propose HP based models to estimate the temporal relationships from data, e.g., (Blundell et al., 2012; Tan et al., 2016; Linderman and Adams, 2014; Du et al., 2015; He et al., 2015; Wang et al., 2017; Yang et al., 2017a; Xu and Zha, 2017; Xu et al., 2018). Several methods have also been proposed to improve the inference of HPs, such as nonparametric triggering kernel (and base rate) estimation (Zhou et al., 2013; Zhang et al., 2020b; Zhou et al., 2020), Granger causality (Xu et al., 2016), short doubly-censored event sequences (Xu et al., 2017) and online estimation (Yang et al., 2017b). Another recent line of research (Zhe and Du, 2018; Pan et al., 2020; Wang et al., 2020) uses or extends HPs to learn the representations of event participants.

To overcome the limitation of HPs in only estimating the triggering effects, recent research efforts have been made to capture more complex temporal dependencies, e.g., inhibition. In (Mei and Eisner, 2017), a simple extension of HPs was proposed to allow a negative coefﬁcient in the exponential triggering kernel (i.e., αs s < 0 in (1)), and the intensity was transformed through a soft-plus function to ensure positiveness. Zhou et al. (2021) uses a similar extension, but it models the kernel by a mixture of shifted Beta densities, and then uses a sigmoid transform scaled by an upper-bound to obtain a positive rate. The second method in (Mei and Eisner, 2017) develops a continues-time Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) model to capture the event dependencies through the nonlinear, black-box updates of the hidden states. The rate is a transform of the hidden state. Du et al. (2016) developed a discrete RNN to model each event sequence. The rate at any time point t is obtained by taking the exponential of a linear combination of the hidden state of the closest event, and the time span from that event. In addition, Omi et al. (2019) used feed-forward NNs to model the cumulative intensity function given the RNN states, so as to avoid computing the integration in the point process likelihood (see (2)), which is usually intractable. Shchur et al. (2019) modeled the time span between the events with a mixture of log-normal distribution parameterized by RNN states and other parameters.

Despite the expressive power of RNN based models, they encode all the temporal dependencies into the hidden state vectors and blackbox updating mechanisms. We can hardly extract important knowledge including the inﬂuence type and strength between a pair of events, how they vary with time, etc. The very recent work (Zhang et al., 2020a) uses the self-attention mechanism (Vaswani et al., 2017; Bahdanau et al., 2014) to capture the correlation between the events. The similar idea appears in the concurrent work (Zuo et al., 2020). Zhang et al. (2020a) modeled the rate as the soft-plus transformation of a base rate plus an exponential time-decaying kernel, where the base

rate, the amplitude and decaying rate are computed via the attention mechanism, which essentially calculates a weighted similarity with the representation of the past events. Although the model is more interpretable and highly expressive, like most HP models, it adopts an exponential time-decay for the inﬂuence strength, and hence can miss other decaying patterns in data. This (partly) motivates us to develop an interpretable yet ﬂexible point process model to capture various time decays from data. There are other temporal models based on ODEs, e.g., (Rubanova et al., 2019).

6 Experiment

6.1 Synthetic Data

To conﬁrm the effectiveness of SPRITE, we ﬁrst conducted an ablation study with synthetic data. Speciﬁcally, we considered two cases: single-type events and bi-type events. For single-type events, we assumed that they only have excitation effects on each other. We examined three time-decaying excitations,

h1( ) = max(0.4 0.1 2, 0), (Quadratic) h2( ) = max(0.4 0.2 , 0), (Linear) h3( ) = 0.3 exp( 0.8 ). (Exponential) (12)

We applied the scaled soft-plus transformation to ensure a positive rate. We set the scale β = 0.8. We used Thinning algorithm Lewis and Shedler (1979) to sample event sequences. For each case, we generated 10K sequences for training and 1K for validation. The length of each sequence is 32.

For bi-type events, we assume type 0 events trigger type 1 events while type 1 inhibits type 0, and no effects are between events of the same type:

h0 1( ) = max(1.0 0.05 2, 0), h1 0( ) = 0.5 exp( 0.5 ). (13)

We set β = 0.4 and used scaled soft-plus to obtain the positive rate function. Again, we sampled 10K sequences for training and 1K for validation. Each sequence includes 64 events.

We compared with the following state-of-the-art methods: (1) Hawkes processes (HP) with the standard exponential decaying triggering kernel, (2) Recurrent Marked Temporal Point Processes (RMTPP) (Du et al., 2016) that use a discrete RNN to model the event sequences, and use hidden RNN states and other parameters to estimate an overall rate function (for all types of events), (3) Neural Hawkes Processes (Neural HP) (Mei and Eisner, 2017) that use a continuous LSTM to model the event sequences, and estimate the rate function for each event type, and (4) Self-Attentive Hawkes Process (SAHP) (Zhang et al., 2020a) that uses the self-attention mechanism to calculate the aggregated inﬂuence from the previous events on the current one. We implemented SPRITE and HP with Tensor Flow (Abadi et al., 2016). We used the original implementation of Neural HP (https://github.com/HMEIat JHU/neurawkes) and SAHP (https://github.com/ Qiang AIResearcher/sahp_repo), and a high-quality open-source implementation of RMTPP (https://github.com/woshiyyya/ERPP-RMTPP). For SPRITE and HP, we set the minibatch size of the event sequences to 16 and learning rate 10 3. We set the dimension of the embeddings to 4 for SPRITE. The nonlinear feature mappings φg and φη in (8) were both chosen as a single-layer feed-forward NN, with 16 neurons and leaky RELU as the activation function. We used the default settings of all the other methods. We ran each method with 50 epochs (enough for convergence). To avoid an unfair comparison caused by overﬁtting (especially for RNN based methods), we evaluated the likelihood of a validation dataset after each epoch and stopped training if there is no improvement (early stopping).

Fig. 1 a,c,e (left) show the rate function estimations of all the methods for single-type events. As we can see, the estimations of SPRITE nearly always overlap with the ground-truth rate function, demonstrating that our method can perfectly adapt to all the cases and capture different time decays. By contrast, SAHP and RMTPP s estimations signiﬁcantly deviate from the ground-truth. SAHP often severely over-estimates and under-estimates the rate (see Fig. 1c and e). RMTPP correctly captures the trends, but often mistakenly estimates the local details. Although this test favors HP since only the excitation effects were considered, HP still resulted in apparent deviations in the case of quadratic and linear decays (see Fig. 1a and c), showing the limitation of the rigid exponentialdecay assumption. Note that the estimation of HP overlaps with SPRITE and the ground-truth for exponential decays (see Fig. 1e). While Neural HP also gives excellent rate estimations, in a few

0 5 10 15 20 25 30

3.0 Events RMTPP SPRITE

Ground Truth Neural HP

(a) Quadratic

0 10 20 30 40

Type 0 Events Type 1 Events

Ground Truth: Type 0 Ground Truth: Type 1

Estimated: Type 0 Estimated: Type 1

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

0 10 20 30 40

(d) Neural HP

0 5 10 15 20

(e) Exponential

0 10 20 30 40

0 1 2 3 4 5 0.0

Learned Influence 0 1 Ground Truth Influence 0 1

0 1 2 3 4 5

Learned Influence 1 0 Ground Truth Influence 1 0

(g) Recovered Inﬂuence function by SPRITE

0 10 20 30 40

Figure 1: Rate function estimation for single-type events (a,c,e) and for bi-type events (b, d, f, h), and event inﬂuence function estimated by SPRITE for bi-type events (g). The estimated inﬂuences for single-type events are given in Fig. 1 of the Appendix. In all ﬁgures, the x-axis is time and y-axis the rate or inﬂuence value.

local regions, it still fails to capture the detail and/or gives a worse estimation than SPRITE, e.g., t [16, 17] in Fig. 1a, t [4.0, 4.5] and [10.5, 11.5] in Fig. 1c, and over-estimations for t around 4.0 and 19.0 in Fig. 1e.

Fig. 1 b,d,f,h (right) show the rate estimations for bi-type events. We did not compare with RMTPP, since it only produces an overall rate function. As shown in Fig. 1b, HP s estimation for event type 0 is completely wrong, because HP cannot capture the inhibition effects from type 1 events. In Fig. 1f, while SAHP indeed captures both the triggering and inhibition effects, its rate estimations severely deviate from the ground-truth, and miss capturing many local details (see the ﬂat curves in many intervals). Finally, Neural HP and SPRITE can accurately estimate the rates for both events. The estimation of SPRITE for event type 0 is slightly better, because it completely overlaps with the ground-truth. Note that, however, SPRITE can further tell the inﬂuence type and strength between every pair of the events while Neural HP, based on a blackbox LSTM, cannot. All these results have demonstrated the advantage of our method in ﬂexibly capturing various time decays from data.

We also examined the estimated inﬂuence function by SPRITE. Fig. 1g shows the results for bi-type events. We can see that our learned decay is very accurate, nearly overlapping with the ground-truth. While the two inﬂuence functions (h0 1 and h1 0) are very different one is convex and the other concave, they both can be accurately captured by SPRITE. The estimations for the single-type events are given in Fig. 1 of the Appendix, showing SPRITE accurately recovered all the three types of inﬂuences. The results demonstrate the power of our nonparametric model. Note that due to the blackbox nature, we cannot discover event inﬂuence functions from Neural HP, SAHP and RMTPP.

SSTA PP HP SAHP Neural HPRMTPP SPRITE

33.24 34.29

34.89 34.91

SSTA HP Neural HP SPRITE

PP SAHP RMTPP

(a) Retweet

SSTA PP HP SAHP Neural HPRMTPP SPRITE

1.03 1.03 1.02

SSTA PP HP SAHP Neural HPRMTPP SPRITE

SSTA PP HP SAHP Neural HPRMTPP SPRITE

SSTA PP HP SAHPNeural HPRMTPPSPRITE

0.549 0.549 0.540

0.574 0.575 0.575

(e) Retweet

SSTA PP HP SAHPNeural HPRMTPPSPRITE

SSTA PP HP SAHPNeural HPRMTPPSPRITE

0.366 0.366 0.357

0.383 0.376 0.384

SSTA PP HP SAHPNeural HPRMTPPSPRITE

0.739 0.739 0.744

0.751 0.747 0.748

Figure 2: Prediction accuracy of the time (top row) and type (bottom row) of the future events.

Small Medium Large

Small Medium Large

(a) Initial strength ( = 0)

0.0 0.1 0.2 0

0.0 0.1 0.2 0

0.0 0.1 0.2 0

0.0 0.1 0.2 0

0.0 0.1 0.2 0

0.0 0.1 0.2 0

0.0 0.1 0.2 0

0.0 0.1 0.2 0

0.0 0.1 0.2 0

(b) Inﬂuence functions

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(c) Event-type clusters

(d) Initial strength ( = 0)

Figure 3: The learned time-decaying inﬂuences between every two types of events ( small , medium , large ) from Retweet dataset (a,b). The clustering structure of the event type embeddings and the initial strength of the inﬂuence functions organized in the event type clusters on MIMIC dataset (c, d).

6.2 Predictive Performance

Next, we examined the performance of SPRITE in predicting both the occurrence and type of the future events. To this end, we used the following real-world benchmark datasets. (1) Retweets (Zhao et al., 2015), 24, 000 retweet event sequences from twitter.com. The events were generated by three types of users, small" retweeters with less than 120 followers, medium" retweeters having the number of followers between 120 and 1, 363, and large" retweeters with more than 1, 363 followers. The time of the ﬁrst event in each sequence is labeled as 0, and the time of the subsequent events are the time spans from the ﬁrst event. (2) MIMIC (Du et al., 2016), clinical visit sequences of 650 anonymous patients in a seven-year period. Each visit is considered as an event, and the event type is the diagnosis result. We have in total 75 event types. (3) SO (Du et al., 2016), 6, 633 awarding event sequences in the question and answer site Stack Overﬂow. The users can be awarded based on both their questions and answers. There are 22 types of events, corresponding to 22 awards, including Nice Question , Good Answer , Guru , Great answer , etc. We downloaded the preprocessed datasets from https://drive.google.com/drive/ folders/0Bwqm V0Eco Uc8Ukl IR1BKV25YR1U. In addition, we tested the prediction accuracy on (4) SIM synthetic data of bi-type events in Sec. 6.1.

In addition to the methods mentioned in Sec. 6.1, we compared with two extra baselines: (1) Simple Statistics (SSTA) that use the average span between consecutive events to predict the future event time, and the most frequent event type to predict the future event type, and (2) a homogeneous Poisson process (PP) that assumes the event rate is a time-invariant constant.

For Retweets, MIMIC and SO, we randomly split the dataset into 70% for training, 10% for validation, 20% for testing. We used each method to predict when the last event in a test sequence occurred and what the type is. We calculated the root-mean-square-error (RMSE) of the time predictions, and classiﬁcation accuracy of the type predictions. We chose the dimension of embeddings as the 4th root of the number of event types (the rule of thumb suggested by Google (Tensor Flow Team, 2017)).

We repeated the experiments for 5 times with the 5-fold cross validation splitting scheme, and report the mean and standard deviation of RMSE and accuracy. For SIM, we randomly generated 10K sequences for training, 1K for validation, and 2K for testing in each experiment. The results of all the methods are shown in Fig. 2. As we can see, SPRITE outperforms all the competing approaches in all the cases, except that in SIM, the prediction accuracy of event types is slightly worse than Neural HP. In many cases, SPRITE outperforms SHAP, RMTPP and/or Neural HP by a large margin, e.g., Fig. 2a, c, d, and f. The results show the advantage of SPRITE in predictive performance.

6.3 Pattern Discovery

Finally, we looked into the patterns discovered by our method. We ﬁrst examined the learned inﬂuence functions from Retweet dataset (see (5)). In Fig. 3, we show the initial strength ( = 0) and the curve of the inﬂuence function for every pair of event types. We found interesting patterns. First, among retweeting events are only excitation effects, and there are no inhibition effects. Second, small users retweeting can hardly cause or excite large users to retweet (the same message); the initial strength of small large is 0.7 (Fig. 3a). By contrast, large users retweeting can strongly incur small users to retweet (the initial strength of large small is 18.8). This is reasonable, because large users have a great many of followers, and most of these followers are small users. The result is also consistent with the fact that large users, especially twitter celebrities, often act as hub nodes, and play a key role in information or opinion dissemination. It is interesting to see, though, large users have much less inﬂuence on other large users. Third, we can see across different pairs of event types, the inﬂuence varies quite much (Fig. 3b), implying the need of ﬂexible models for their estimation. While approximating them as some exponential decays in standard HPs seems feasible, it leads to much worse prediction accuracy as compared with SPRITE (see Fig. 2a and e).

Next, we examined if our method can discover hidden structures. To this end, we investigated the learned event type embeddings from MIMIC dataset. We ﬁrst performed Principled Component Analysis (PCA) to project the embeddings onto a plane, and then ran the k-means algorithm to ﬁnd potential grouping structures. We used the elbow method (Ketchen and Shook, 1996) to determine the cluster number. As we can see from Fig. 3c, the (projected) embeddings reﬂect a clear structure within the event types. This is reasonable, because the event types are represented by the diagnosis results of the medical visits and there can be strong associations within those diagnosis results. Note that since the dataset is completely anonymized, we cannot verify the meaning of these clusters. Furthermore, we investigate the (initial) inﬂuence strengths between the event types within the same cluster and across different clusters. From Fig. 3d, we can see clear patterns. For examples, the types of events in Cluster 4 mostly excite each other, but inhibit those of Cluster 0; the types of events in Cluster 0 nearly inhibit those across all the clusters, including themselves. We also found interesting structures of the embeddings and inﬂuence functions estimated from SO dataset. We leave the results in the Appendix.

7 Conclusion

We have developed SPRITE, a self-adaptable point process that not only can explicitly estimate the inﬂuence type and strength between every pair of events, but also is ﬂexible enough to capture a variety of inﬂuence time decays from data.

Acknowledgments

This work has been supported by NSF CAREER Award IIS-2046295 and NSF IIS-1619287.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. (2016). Tensorﬂow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265 283.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473.

Blundell, C., Beck, J., and Heller, K. A. (2012). Modelling reciprocating relationships with hawkes processes. In Advances in Neural Information Processing Systems, pages 2600 2608.

Daley, D. J. and Vere-Jones, D. (2007). An introduction to the theory of point processes: volume II:

general theory and structure. Springer Science & Business Media.

Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., and Song, L. (2016). Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1555 1564.

Du, N., Farajtabar, M., Ahmed, A., Smola, A. J., and Song, L. (2015). Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 219 228. ACM.

Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90.

He, X., Rekatsinas, T., Foulds, J., Getoor, L., and Liu, Y. (2015). Hawkestopic: A joint model for network inference and topic modeling from text-based cascades. In International conference on machine learning, pages 871 880.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735 1780.

Ketchen, D. J. and Shook, C. L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, 17(6):441 458.

Lewis, P. W. and Shedler, G. S. (1979). Simulation of nonhomogeneous poisson processes by thinning.

Naval research logistics quarterly, 26(3):403 413.

Linderman, S. and Adams, R. (2014). Discovering latent network structure in point process data. In

International Conference on Machine Learning, pages 1413 1421.

Lloyd, C., Gunter, T., Osborne, M., and Roberts, S. (2015). Variational inference for gaussian process modulated poisson processes. In International Conference on Machine Learning, pages 1814 1822. PMLR.

Mei, H. and Eisner, J. M. (2017). The neural hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, pages 6754 6764.

Omi, T., Aihara, K., et al. (2019). Fully neural network based model for general temporal point processes. In Advances in Neural Information Processing Systems, pages 2122 2132.

Pan, Z., Wang, Z., and Zhe, S. (2020). Scalable nonparametric factorization for high-order interaction events. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4325 4335. PMLR.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.

Riihimäki, J. and Vehtari, A. (2010). Gaussian processes with monotonicity information. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pages 645 652.

Rubanova, Y., Chen, R. T. Q., and Duvenaud, D. K. (2019). Latent ordinary differential equations for irregularly-sampled time series. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.

Schein, A., Linderman, S., Zhou, M., Blei, D., and Wallach, H. (2019). Poisson-randomized gamma dynamical systems. In Advances in Neural Information Processing Systems, pages 782 793.

Schein, A., Paisley, J., Blei, D. M., and Wallach, H. (2015). Bayesian poisson tensor factorization for inferring multilateral relations from sparse dyadic event counts. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1045 1054. ACM.

Schein, A., Zhou, M., Blei, D. M., and Wallach, H. (2016). Bayesian poisson tucker decomposition for learning the structure of international relations. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pages 2810 2819. JMLR.org.

Shchur, O., Bilovs, M., and Günnemann, S. (2019). Intensity-free learning of temporal point processes. In International Conference on Learning Representations.

Tan, X., Naqvi, S. A., Qi, A. Y., Heller, K. A., and Rao, V. (2016). Content-based modeling of reciprocal relationships using hawkes and gaussian processes. In UAI.

Tensor Flow Team (2017). Introducing tensorﬂow feature columns. https://developers.googleblog.com/2017/11/introducing-tensorﬂow-feature-columns.html.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, pages 5998 6008. Curran Associates, Inc.

Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational

inference. Now Publishers Inc.

Wang, Y., Ye, X., Zha, H., and Song, L. (2017). Predicting user activity level in point processes with mass transport equation. In Advances in Neural Information Processing Systems, pages 1644 1654.

Wang, Z., Chu, X., and Zhe, S. (2020). Self-modulating nonparametric event-tensor factorization. In

International Conference on Machine Learning, pages 9857 9867. PMLR.

Xu, H., Farajtabar, M., and Zha, H. (2016). Learning granger causality for hawkes processes. In

International Conference on Machine Learning, pages 1717 1726.

Xu, H., Luo, D., Chen, X., and Carin, L. (2018). Beneﬁts from superposed hawkes processes. In

International Conference on Artiﬁcial Intelligence and Statistics, pages 623 631. PMLR.

Xu, H., Luo, D., and Zha, H. (2017). Learning hawkes processes from short doubly-censored event sequences. In International Conference on Machine Learning, pages 3831 3840.

Xu, H. and Zha, H. (2017). A dirichlet mixture model of hawkes processes for event sequence clustering. In Advances in Neural Information Processing Systems, pages 1354 1363.

Yang, J., Rao, V. A., and Neville, J. (2017a). Decoupling homophily and reciprocity with latent space network models. In UAI.

Yang, Y., Etesami, J., He, N., and Kiyavash, N. (2017b). Online learning for multivariate hawkes processes. In Advances in Neural Information Processing Systems, pages 4937 4946.

Zhang, Q., Lipani, A., Kirnap, O., and Yilmaz, E. (2020a). Self-attentive hawkes process. In

International Conference on Machine Learning, pages 11183 11193. PMLR.

Zhang, R., Walder, C., and Rizoiu, M.-A. (2020b). Variational inference for sparse gaussian process modulated hawkes process. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 6803 6810.

Zhao, Q., Erdogdu, M. A., He, H. Y., Rajaraman, A., and Leskovec, J. (2015). Seismic: A selfexciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1513 1522. ACM.

Zhe, S. and Du, Y. (2018). Stochastic nonparametric event-tensor decomposition. In Proceedings of

the 32nd International Conference on Neural Information Processing Systems, pages 6857 6867.

Zhou, F., Li, Z., Fan, X., Wang, Y., Sowmya, A., and Chen, F. (2020). Efﬁcient inference for nonparametric hawkes processes using auxiliary latent variables. Journal of Machine Learning Research, 21(241):1 31.

Zhou, F., Zhang, Y., and Zhu, J. (2021). Efﬁcient inference of ﬂexible interaction in spiking-neuron networks. In Proceedings of the International Conference on Learning Representations (ICLR).

Zhou, K., Zha, H., and Song, L. (2013). Learning triggering kernels for multi-dimensional hawkes processes. In International Conference on Machine Learning, pages 1301 1309.

Zuo, S., Jiang, H., Li, Z., Zhao, T., and Zha, H. (2020). Transformer hawkes process. In International

Conference on Machine Learning, pages 11692 11702. PMLR.