# transformer_hawkes_process__cd870d45.pdf

Transformer Hawkes Process

Simiao Zuo 1 Haoming Jiang 1 Zichong Li 2 Tuo Zhao 1 3 Hongyuan Zha 4 5

Modern data acquisition routinely produce massive amounts of event sequence data in various domains, such as social media, healthcare, and ﬁnancial markets. These data often exhibit complicated short-term and long-term temporal dependencies. However, most of the existing recurrent neural network based point process models fail to capture such dependencies, and yield unreliable prediction performance. To address this issue, we propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efﬁciency. Numerical experiments on various datasets show that THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin. Moreover, THP is quite general and can incorporate additional structural knowledge. We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information.

1. Introduction

Event sequence data are naturally observed in our daily life. Through social media such as Twitter and Facebook, we share our experiences and respond to other users information (Yang et al., 2011). In these websites, each user has a sequence of events such as tweets and interactions. Hundreds of millions of users generate large amounts of tweets,

1Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA; 2School of the Gifted Young, University of Science and Technology of China, Hefei, China; 3Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA; 4School of Data Science, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China; 5Currently on leave from Georgia Institute of Technology. Correspondence to: Simiao Zuo <simiaozuo@gatech.edu>, Tuo Zhao <tourzhao@gatech.edu>, Hongyuan Zha <zhahy@cuhk.edu.cn>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

which are essentially sequences of events at different time stamps. Besides social media, event data also exist in domains like ﬁnancial transactions (Bacry et al., 2015) and personalized healthcare (Wang et al., 2018). For example, in electronic medical records, tests and diagnoses of each patient can be treated as a sequence of events. Unlike other sequential data such as time series, event sequences tend to be asynchronous (Ross et al., 1996), which means time intervals between events are just as important as the order of them to describe their dynamics. Also, depending on speciﬁc application requirements, event data show sophisticated dependencies on their history.

Point process is a powerful tool for modeling sequences of discrete events in continuous time, and the technique has been widely applied. Hawkes process (Hawkes, 1971; Isham & Westcott, 1979) and Poisson point process are traditionally used as examples of point processes. However, the simpliﬁed assumptions of the complicated dynamics of point processes limit the models practicality. As an example, Hawkes process states that all past events should have positive inﬂuences on the occurrence of current events. However, a user on Twitter may initiate tweets on different topics, and these events should be considered as unrelated instead of mutually-excited.

To alleviate the over-simpliﬁcations, likelihood-free methods (Xiao et al., 2017a; Li et al., 2018) and non-parametric models like kernel methods and splines (Vere-Jones et al., 1990) have been proposed, but the increasing complexity and quantity of collected data crave for more powerful models. With the development of neural networks, in particular deep neural networks, focuses have been placed on incorporating these ﬂexible models into classical point processes. Because of the sequential nature of event steams, existing methods rely heavily on Recurrent Neural Networks (RNNs). Neural networks are known for their ability to capture complicated high-level features, in particular, RNNs have the representation power to model the dynamics of event sequence data. In previous works, either vanilla RNN (Du et al., 2016) or its variants (Mei & Eisner, 2017; Xiao et al., 2017b) have been used and signiﬁcant progress in terms of likelihood and event prediction have been achieved.

However, there are two signiﬁcant drawbacks with RNNbased models. First, recurrent neural networks, even those equipped with forget gates, such as Long Short-Term Mem-

Transformer Hawkes Process

ory (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Units (Chung et al., 2014), are unlikely to capture long-term dependencies. In ﬁnancial transactions, short-term effects such as policy changes are important for modeling buy-sell behaviors of stocks. On the other hand, because of the delays in asset returns, stock transactions and prices often exhibit long-term dependencies on their history. As another example, in medical domains, at times we are interested in examining short-term dependencies on symptoms such as fever and cough for acute diseases like pneumonia. But for certain types of chronic diseases such as diabetes, longterm dependencies on disease diagnoses and medications are more critical. Desirable models should be able to capture these long-term dependencies. Yet with recurrent structures, interactions between two events located far in the temporal domain are always weak (Hochreiter et al., 2001), even though in reality they may be highly correlated. The reason is that the probability of keeping information in a state that is far away from the current state decreases exponentially with distance.

The second drawback is trainability of recurrent neural networks. Training deep RNNs (including LSTMs) is notoriously difﬁcult because of gradient explosion and gradient vanishing (Pascanu et al., 2013). In practice, single-layer and two-layer RNNs are mostly used, and they may not successfully model sophisticated dependencies among data (Bengio et al., 1994). Additionally, inputs are fed into the recurrent models sequentially, which means future states must be processed after the current state, rendering it impossible to process all the events in parallel. This limits RNNs ability to scale to large problems.

Recently, convolutional neural network variants that are tailored for analyzing sequential data (Oord et al., 2016; Gehring et al., 2017; Yin et al., 2017) have been proposed to better capture long-term effects. However, these models enforce many unnecessary dependencies. This particular downside plus the increased computational burdens deem these models insufﬁcient.

To address the above concerns, we propose the Transformer Hawkes Process (THP) model that is able to capture both short-term and long-term dependencies whilst enjoying computational efﬁciency. Even though the Transformer (Vaswani et al., 2017) is widely adopted in natural language processing, it has rarely been used in other applications. We remark that such an architecture is not readily applicable to event sequences that are deﬁned in a continuous-time domain. To the best of our knowledge, our proposed THP is the ﬁrst of this type in point process literature.

Building blocks of THP are the self-attention modules (Bahdanau et al., 2014). These modules directly model dependencies among events by assigning attention scores. A large score between two events implies a strong dependency, and

Figure 1. Illustration of dependency computation between the last event (the red triangle) and its history (the blue circles). RNNbased NHP models dependencies through recursion. THP directly and adaptively models the event s dependencies on its history. Convolution-based models enforce static dependency patterns.

a small score implies a weak one. In this way, the modules are able to adaptively select events that are at any temporal distance from the current event. Therefore, THP has the ability to capture both short-term and long-term dependencies. Figure 1 demonstrates dependency computation of different models.

The non-recurrent structure of THP facilitates efﬁcient training of multi-layer models. Transformer-based architectures can be as deep as dozens of layers (Devlin et al., 2018; Radford et al., 2019), where deeper layers capture higher order dependencies. The ability to capture such dependencies creates models that are more powerful than RNNs, which are often shallow. Also, THP allows full parallelism when calculating dependencies across all events, i.e., the computation between any two event pairs is independent with each other. This yields a model presenting strong efﬁciency.

Our proposed model is quite general, and can incorporate additional structural knowledge to learn more complicated event sequence data, such as multiple point processes over a graph. In social networks, each user has her own sequence of events, like tweets and comments. Sequences among users can be related, for example, a tweet from a user may trigger retweets from her followers. We can use graphs to model these follower-followee relationships (Zhou et al., 2013; Farajtabar et al., 2017), where each vertex corresponds to a speciﬁc user and each edge represents connections between the two associated users. We propose an extension to THP that integrates these relational graphs (Borgatti et al., 2009; Linderman & Adams, 2014) into the self-attention module via a similarity metric among users. Such a metric can be learned by our proposed graph regularization.

We experiment THP on ﬁve datasets to evaluate both validation likelihood and event prediction accuracy. Our THP model exhibits superior performance to RNN-based models in all these experiments. We further test our structured-

Transformer Hawkes Process

THP on two additional datasets, where the model achieves improved prediction performance for learning multiple point processes when incorporating their relational information. Our code is available at https://github.com/ Simiao Zuo/Transformer-Hawkes-Process.

2. Background

We brieﬂy review Hawkes Process (Hawkes, 1971), Neural Hawkes Process (Mei & Eisner, 2017), and Transformer (Vaswani et al., 2017) in this section.

Hawkes Process is a doubly stochastic point process, whose intensity function is deﬁned as

λ(t) = µ + X

j:tj<t ψ(t tj). (1)

Here µ is the base intensity and ψ( ) is a pre-speciﬁed decaying function, i.e., exponential function and power-law function. Intuitively, Eq. 1 means that each of the past events has a positive contribution to occurrence of the current event, and this inﬂuence decreases through time. However, a major limitation of this formulation is the simpliﬁcation that history events can never inhibit occurrence of future events, which is unrealistic in complex real-life scenarios.

Neural Hawkes Process generalizes the classical Hawkes process by parameterizing its intensity function with recurrent neural networks. Speciﬁcally,

k=1 λk(t) =

k=1 fk w k h(t) , t (0, T],

fk(x) = βk log 1 + exp x

P[kt = k] = λk(t)

where λ(t) is the intensity function, K is the number of event types, and h(t)s are the hidden states of the event sequence, obtained by a continuous-time LSTM (CLSTM) module. CLSTM is an interpolated version of the standard LSTM, and it allows us to generate outputs in a continuoustime domain. Also, fk( ) is the softplus function with parameter βk that guarantees a positive intensity. One downside of the neural Hawkes process is that intrinsic weaknesses of RNNs are still inherited, namely the model is unable to capture long-term dependencies and is difﬁcult to train.

Transformer is an attention-based model that has been broadly applied in tasks such as machine translation (Devlin et al., 2018) and language modeling (Radford et al., 2019). Despite its success in natural language processing, it has rarely been used in other areas. We remark that the Transformer architecture is not directly applicable to model point processes. In particular, time intervals between any two events can be arbitrary in event streams, while in natural languages, words are observed on regularly spaced time

Figure 2. Architecture of the Transformer Hawkes Process. Each event sequence S is fed through embedding layers and N multihead self-attention modules. Outputs of the THP are hidden representations of events in S, with history information encoded.

intervals. Therefore, we need to generalize the architecture to a continuous-time domain.

We introduce our proposed Transformer Hawkes Process. Suppose we are given an event sequence S = {(tj, kj)}L j=1 of L events, where each event has type kj {1, 2, . . . , K}, with a total number of K types. Then each pair (tj, kj) corresponds to an event of type kj occurs at time tj.

3.1. Transformer Hawkes Process

The key ingredient of our proposed THP model is the selfattention module. Different from RNNs, the attention mechanism discards recurrent structures. However, our model still needs to be aware of the temporal information of inputs, i.e., time stamps. Therefore, analogous to the original positional encoding method (Vaswani et al., 2017), we propose to use a temporal encoding procedure, deﬁned by

( cos tj/10000 i 1

M , if i is odd,

sin tj/10000 i M , if i is even. (2)

Eq. 2 uses trigonometric functions to deﬁne a temporal encoding for each time stamp, i.e., for each tj, we deterministically computes z(tj) RM, where M is the dimension of encoding. Other temporal encoding methods can also be applied, such as the relative position representation model (Shaw et al., 2018), where two temporal encoding matrices are learned instead of pre-deﬁned.

Besides temporal encoding, we train an embedding matrix U RM K for the event types, where the k-th column of U is a M-dimensional embedding for event type k. For any event of type kj, let kj be its one-hot encoding (a Kdimensional vector with all 0s except for the kj-th index, which has value 1), then its embedding is Ukj. Notice

Transformer Hawkes Process

that for any event and its corresponding time stamp (tj, kj), the temporal encoding z(tj) and the event embedding Ukj both reside in RM. Embedding of the event sequence S = {(tj, kj)}L j=1 is then speciﬁed by

X = UY + Z , (3)

where Y = [k1, k2, . . . , k L] RK L is the collection of event type embedding, and Z = [z(t1), z(t2), . . . , z(t L)] RM L is the concatenation of event time encodings. Notice that X RL M and each row of X corresponds to the embedding of a speciﬁc event in the sequence.

After the initial encoding and embedding layers, we pass X through the self-attention module. Speciﬁcally, we compute the attention output S by

S = Softmax QK

Q = XWQ, K = XWK, V = XWV . (4)

Here Q, K, and V are the query, key, and value matrices obtained by different transformations of X, and WQ, WK RM MK, WV RM MV are weights for the linear transformations, respectively. In practice using multi-head self-attention to increase model ﬂexibility is more beneﬁcial for data ﬁtting. To facilitate this, different attention outputs S1, S2, . . . , SH are computed using different sets of weights {W Q h , W K h , W V h }H h=1. The ﬁnal attention output for the event sequence is then

S = S1, S2, . . . , SH WO,

where WO RHMV M is an aggregation matrix.

We highlight that the self-attention module is able to directly select events whose occurrence time is at any distance from the current time. The j-th column of the attention weights Softmax(QK / MK) signiﬁes event tj s extent of dependency on its history. In contrast, RNN-based models encode history information sequentially via hidden representations of the events, i.e., the state of tj depends on that of tj 1, which in turn depends on tj 2, etc. Should any of these encodings be weak, i.e., the RNN fails to learn sufﬁcient relevant information for event tk, hidden representations of any event tj where j k will be inferior.

The attention output S is then fed through a position-wise feed-forward neural network, generating hidden representations h(t) of the input event sequence:

H = Re LU SWFC 1 + b1 WFC 2 + b2,

h(tj) = H(j, :). (5)

Here WFC 1 RM MH, WFC 2 RMH M, b1 RMH, and b2 RM are parameters of the neural network, and WFC 2 has identical columns. The resulting matrix H RL M

contains hidden representations of all the events in the input sequence, where each row corresponds to a particular event.

To avoid peeking into the future , our attention algorithm is equipped with masks. That is, when computing the attention output S(j, :) (the j-th row of S), we mask all the future positions, i.e., we set Q(j, j + 1), Q(j, j + 1), . . . , Q(j, L) to inf. This will avoid the softmax function from assigning dependency to events in the future.

In practice we stack multiple self-attention modules together, and inputs are passed through each of these modules sequentially. In this way our model is able to capture high level dependencies. We remark that stacking RNN/LSTM is not plausible because gradient explosion and gradient vanishing will render the stacked model difﬁcult to train. Figure 2 illustrates the architecture of THP.

3.2. Continuous Time Conditional Intensity Dynamics of temporal point processes are described by a continuous conditional intensity function. Eq. 5 only generates hidden representations for discrete time stamps, and the associated intensity is also discrete. Therefore an interpolated continuous time intensity function is in need.

Let λ(t|Ht) be the conditional intensity function for our model, where Ht = {(tj, kj) : tj < t} is the history up to time t. We deﬁne different intensity functions for different event types, i.e., for every k {1, 2, . . . , K}, deﬁne λk(t|Ht) as the conditional intensity function for events of type k. The conditional intensity function for the entire event sequence is deﬁned by

k=1 λk(t|Ht),

where each of the type-speciﬁc intensity takes the form

λk(t|Ht) = fk αk t tj

tj | {z } current

+ w k h(tj) | {z } history

+ bk |{z} base

In Eq. 6, time is deﬁned on interval t [tj, tj+1), and fk(x) = βk log 1 + exp(x/βk) is the softplus function with softness parameter βk. The reason for choosing this particular function is two-fold: ﬁrst, the softplus function ensures that the intensity is positive; second, softness of the softplus function guarantees stable computation and avoids dramatic changes in the intensity.

Now we explain each term in Eq. 6 in detail:

The current inﬂuence is an interpolation between two observed time stamps tj and tj+1, and αk modulates importance of the interpolation. When t = tj, i.e., a new observation comes in, this inﬂuence is 0. When t tj+1, the conditional intensity function is no longer continuous. As a matter of fact, Eq. 6 is continuous everywhere except for the observed events {(tj, kj)}. However, these jumps in intensity is a non-factor when computing likelihood.

Transformer Hawkes Process

The history term contains two parts: a vector wk that transforms the hidden states of the THP model into a scalar, and the hidden states h(t) (Sec. 3.1) themselves that encode past events up to time t.

The base intensity represents probability of occurrence of events without considering history information.

With our proposed conditional intensity function, next time stamp prediction and next event type prediction is given by1

p(t|Ht) = λ(t|Ht) exp Z t

tj λ(τ|Hτ)dτ ,

tj t p(t|Ht)dt,

bkj+1 = argmax k

λk(tj+1|Hj+1)

λ(tj+1|Hj+1) .

3.3. Training For any sequence S over an observation interval [t1, t L], given its conditional intensity function λ(t|Ht), the loglikelihood is

j=1 log λ(tj|Hj)

| {z } event log-likelihood

t1 λ(t|Ht)dt

| {z } non-event log-likelihood

Model parameters are learned by maximizing the loglikelihood across all sequences. Concretely, suppose we have N sequences S1, S2, . . . , SN , then the goal is to ﬁnd parameters that solve

max PN i=1 ℓ(Si),

where ℓ(Si) is the log-likelihood of event sequence Si. This optimization problem can be efﬁciently solved by stochastic gradient type algorithms like ADAM (Kingma & Ba, 2014). Additionally, techniques that help stabilizing training such as layer normalization (Ba et al., 2016) and residual connection (He et al., 2016) are also applied.

In Eq. 8, one challenge is to compute Λ = R t L t1 λ(t|Ht)dt, the non-event log-likelihood. Because of the softplus function, there is no closed-form computation for this integral, and a proper approximation is needed.

The ﬁrst approach to approximate the non-event loglikelihood is by using Monte Carlo integration (Robert & Casella, 2013):

j=2 (tj tj 1) 1

i=1 λ(ui) ,

j=2 (tj tj 1) 1

i=1 λ(ui) .

1Without causing any confusion, denote Htj as Hj.

Figure 3. Illustration of event sequences on a graph. Sequences on vertices are aligned temporally to form a long sequence, and relational information among events are shown in arrows. Notice that only the structural information of the last event (the blue circle) and the third to the last event (the purple diamond) are shown. Like before, events cannot attend to future.

Here ui Unif(tj 1, tj) is sampled from a uniform distribution with support [tj 1, tj]. Notice that λ(ui) and λ(ui) can be calculated by feed-forward and back-propagation through the model, respectively. Moreover, Eq. 9 yields an unbiased estimation to the integral, i.e., E[bΛMC] = Λ.

The second approach is to apply numerical integration methods, which are faster because of the elimination of sampling. For example, the trapezoidal rule (Stoer & Bulirsch, 2013) states that

λ(tj|Hj) + λ(tj 1|Hj 1) (10)

qualiﬁes as an approximation to Λ. Other higher order methods such as the Simpson s rule (Stoer & Bulirsch, 2013) can also be applied. Even though approximations build upon numerical integration algorithms are biased, in practice they are affordable. This is because the conditional intensity (Eq. 6) uses softplus as its activation function, which is highly smooth and ensures bias introduced by linear interpolations (Eq. 10) between consecutive events are small.

4. Structured Transformer Hawkes Process

THP is quite general and can incorporate additional structural knowledge. We consider multiple point processes, where any two of them can be related. Such relationships are often described by a graph G = (V, E), where V is the vertex set, and each vertex is associated with a point process. Also, E is the edge set, where each edge signiﬁes relational information between the corresponding two vertices. Figure 3 illustrates event sequences on a graph.

The graph encodes relationships among vertices, and further indicates potential interactions. We propose to model all the point processes with a single THP, and the heterogeneity of the vertices point processes is handled by a vertex embedding approach.

Suppose we have an event sequence S = {(tj, kj, vj)}L j=1, where tj and kj are time stamps and event types as before. Further, vj {1, 2, . . . , |V|} is an indicator to which ver-

Transformer Hawkes Process

tex the event belongs. In addition to the event embedding and the temporal encoding (Eq. 3), we introduce a vertex embedding matrix E RM |V|, where the j-th column of E denotes the M-dimensional embedding for vertex j. Let vj be the one-hot encoding of vj, then embedding of S is speciﬁed by

X = UY + EV + Z ,

where V = [v1, v2, . . . , v L] R|V| L is the concatenation of vertices, and other terms are deﬁned in Eq. 3.

The graph attention output is deﬁned by

S = Softmax QK

MK + A Vvalue,

A = (EV) Ω(EV), (11)

where Q, K, and Vvalue are the same2 as in Eq. 4. Matrix A RL L is the vertex similarity matrix, where each entry Aij signiﬁes the similarity between two vertices vi and vj, and Ω RM M is a metric to be learned. To extend the graph self-attention module to a multi-head setting, we use different metric matrices {Ωj}H j=1 for different heads.

We remark that unlike RNN-based shallow models, in structured-THP, multiple multi-head self-attention modules can be stacked (Figure 2) to learn high level representations, a feature that enables learning of complicated similarities among vertices. Moreover, the vertex similarity matrix enables modeling of even more complicated structured data, such as sequences on dynamically evolving graphs.

With the incorporation of relational information, we need to modify the conditional intensity function accordingly. As an extension to Eq. 6, where each type of events has its own intensity, we deﬁne a different intensity function for each event type and each vertex. Speciﬁcally,

v=1 λk,v(t|Ht), t [tj, tj+1),

λk,v(t|Ht) = fk,v αk,v t tj

tj + w k,vh(t) + bk,v .

Model parameters are learned by maximizing the loglikelihood (Eq. 8) across all sequences. Concretely, suppose we have N sequences S1, S2, . . . , SN , then parameters are obtained by solving

i=1 ℓ(Si) + µLgraph(V, Ω),

where µ is a hyper-parameter and

Lgraph(V, Ω) =

j=1 log 1 + exp(VjΩVk)

+ 1{(vj, vk) E} VjΩVk .

2We use Vvalue to denote the value matrix instead of V, which denotes the vertices.

Table 1. Datasets statistics. From left to right columns: name of the dataset, number of event types, number of events in the dataset, and average length per sequence.

Dataset K # Events Avg. length Retweets 3 2, 173, 533 109 Meme Track 5000 123, 639 3 Financial 2 414, 800 2074 MIMIC-II 75 2, 419 4 Stack Overﬂow 22 480, 413 72 911-Calls 3 290, 293 403 Earthquake 2 256, 932 500

Here Lgraph(V, Ω) is a regularization term that encourages VjΩVk to be large when there exists an edge between vj and vk. Which means if two vertices are connected in graph G, then the regularizer will promote attention between them, and vice versa.

Notice that in the simplest case, A in Eq. 11 can be some transformation of the adjacency matrix, i.e., Aij = 1 if (vi, vj) E, and 0 otherwise. However, we believe that this constraint is too strict, i.e., some connected vertices may not behave similarly. Therefore, we treat the graph as a guide and introduce a regularization term that encourages A to be similar to the adjacency matrix, but not enforce it. In this way, our model is more ﬂexible.

5. Experiments

We compare THP against existing models: Recurrent Marked Temporal Point Process (RMTPP, Du et al. (2016)), Neural Hawkes Process (NHP, Mei & Eisner (2017)), Time Series Event Sequence (TSES, Xiao et al. (2017b)), and Selfattentive Hawkes Processes (SAHP, Zhang et al. (2019))3. We evaluate the models by per-event log-likelihood (in nats) and event prediction accuracy on held-out test sets. Details about training are deferred to the appendix.

5.1. Datasets We adopt several datasets to evaluate the models. Table 1 summarizes statistics of the datasets.

Retweets (Zhao et al., 2015): The Retweets dataset contains sequences of tweets, where each sequence contains an origin tweet (i.e., some user initiates a tweet), and some follow-up tweets. We record the time and the user tag of each tweet. Further, users are grouped into three categories based on the number of their followers: small , medium , and large .

Meme Track (Leskovec & Krevl, 2014): This dataset contains mentions of 42 thousand different memes spanning ten months. We collect data on over 1.5 million documents (blogs, web articles, etc.) from over 5000 websites. Each se-

3This is a concurrent work that also employs the Transformer architecture, and we only include results reported in their paper.

Transformer Hawkes Process

quence in this dataset is the life-cycle of a particular meme, where each event (usage of meme) in the sequence is associated with a time stamp and a website id.

Financial Transactions (Du et al., 2016): This ﬁnancial dataset contains transaction records of a stock in one day. We record the time (in milliseconds) and the action that was taken in each transaction. The dataset is a single long sequence with only two types of events: buy and sell . The event sequence is further partitioned by time stamps.

Electrical Medical Records (Johnson et al., 2016): MIMICII medical dataset collects patients visit to a hospital s ICU in a seven-year period. We treat the visits of each patient as a separate sequence, where each event in the sequence contains a time stamp and a diagnosis.

Stack Overﬂow (Leskovec & Krevl, 2014): Stack Overﬂow is a question-answering website. The website rewards users with badges to promote engagement in the community, and the same badge can be rewarded multiple times to the same user. We collect data in a two-year period, and we treat each user s reward history as a sequence. Each event in the sequence signiﬁes receipt of a particular medal.

911-Calls4: The 911-Calls dataset contains emergency phone call records. Calling time, location of the caller, and nature of the emergency are logged for each record. We consider three types of emergencies: EMS, ﬁre, and trafﬁc. We treat location of callers (given by zipcodes) as vertices on a relational information graph. Zipcodes are ranked based on the number of recorded calls, and only the top 75 zipcodes are kept. An undirected edge exists between two vertices if their zipcodes are within 10 of each other.

Earthquake5: This dataset contains time and location of earthquakes in China in an eight-year period. We partition the records into two categories: small and large . A relational information graph is built based on geographical locations of the earthquakes, i.e., each province is a vertex and earthquakes are sequences on the vertices. Two vertices are connected if their associated provinces are neighbors.

5.2. Likelihood Comparison We ﬁt THP and NHP on Retweets and Meme Track. From Figure 4, we can see that THP outperforms NHP during the entire training process by large margins on both of the datasets. The reason is because of the complicated nature of social media data, and RNN-based models such as NHP are not powerful enough to model the dynamics.

In the Retweets dataset, we often observe time gaps between two consecutive retweets become larger, and this dynamic

4The dataset is available on www.kaggle.com/ mchirico/montcoalert. 5The dataset is provided by China Earthquake Data Center. (http://data.earthquake.cn)

Table 2. Log-likelihood comparison. Here RT is the Retweets dataset, MT is the Meme Track dataset, FIN is the Financial Transactions dataset, and SO is the Stack Overﬂow dataset.

Model RT MT FIN MIMIC-II SO RMTPP -5.99 -6.04 -3.89 -1.35 -2.60 NHP -5.60 -6.23 -3.60 -1.38 -2.55 SAHP -4.56 -0.52 -1.86 THP -2.04 0.68 -1.11 0.820 0.042

Figure 4. Training curves of NHP and THP ﬁtted on Retweets (left ﬁgure) and Meme Track (right ﬁgure).

can be successfully modeled by temporal encoding. Also, unlike RNN-based models, our model is able to capture long-term dependencies that exist in long sequences. In the Meme Track dataset, we have extremely short sequences, i.e., average sequence length is 3. Even though the data only exhibit short-term dependencies, we still need to model latent properties of memes such as topics and targeted users. We build deep THP models to capture these high-level features, and we remark that constructing deep NHP is not plausible because of the difﬁculty in training.

Table 2 summarizes results on other datasets. Note that TSES is likelihood-free. Our THP model ﬁts the data well and outperforms all the baselines in all the experiments.

Figure 5 visualizes attention patterns of THP. We can see that each attention head employs a different pattern to capture dependencies. Moreover, while attention heads in the ﬁrst layer tend to focus on individual events, the attention patterns in the last layer are more uniformly distributed. This is because features in deeper layers are already transformed by attention heads in shallow layers.

5.3. Event Prediction Comparison

For point processes, event prediction is just as important as data ﬁtting. Eq. 7 enables us to predict future events. In practice, however, adding additional prediction layers on top of the THP model yields better performance. Speciﬁcally, given the hidden representation h(tj) for event (tj, kj), the next event type and time predictions are as follows.

The next event type prediction is

bpj+1 = Softmax Wtypeh(tj) , bkj+1 = argmax k bpj+1(k),

Transformer Hawkes Process

Figure 5. Visualization of attention patterns of different attention heads in different layers. Pixel (i, j) in each ﬁgure signiﬁes the attention weight of event (tj, kj) attending to event (ti, ki). Attention heads in the upper two ﬁgures are from the ﬁrst layer, while they are from the last layer in the lower two ﬁgures.

where Wtype RK M is the parameter of the event type predictor, and bpj(k) is the k-th element of bpj RK.

The next event time prediction is

btj+1 = Wtimeh(tj),

where Wtime R1 M is the predictor parameter.

To learn the predictor parameters, the loss function is equipped with a cross-entropy term for event type predictions and a squared error term for event time predictions. Concretely, for an event sequence S = {(tj, kj)}L j=1, let k1, k2, . . . , k L be the ground-truth one-hot encodings for the event types, we deﬁne

Ltype(S) = PL j=2 k j log(bpj),

Ltime(S) = PL j=2(tj btj)2,

notice that we do not predict the ﬁrst event. Then, given event sequences {Si}N i=1, we seek to solve

i=1 ℓ(Si) + Ltype(Si) + Ltime(Si),

where ℓ(Si) is the log-likelihood (Eq. 8) of Si.

To evaluate model performance, we predict every held-out event (tj, kj) given its history Hj, i.e., for a test sequence of length L, we make L 1 predictions. We evaluate event type prediction by accuracy and event time prediction by Root Mean Square Error (RMSE). Table 3 and Table 4 summarize experiment results. We can see that THP outperforms the baselines in all these tasks. The datasets we adopted vary signiﬁcantly in average sequence length, i.e., the average

Table 3. Event type prediction accuracy comparison. Model Financial MIMIC-II Stack Overﬂow RMTPP 61.95 81.2 45.9 NHP 62.20 83.2 46.3 TSES 62.17 83.0 46.2 THP 62.64 85.3 47.0

Table 4. Event time prediction RMSE comparison. Model Financial MIMIC-II Stack Overﬂow RMTPP 1.56 6.12 9.78 NHP 1.56 6.13 9.83 TSES 1.50 4.70 8.00 SAHP 3.89 5.57 THP 0.93 0.82 4.99

Figure 6. Prediction error rates of THP, NHP, and RMTPP. Based on a same train-dev-test splitting ratio, each dataset is sampled ﬁve times to produce different train, development and test sets. Error bars are generated according to these experiments.

length in Financial Transactions is 2074 while it is only 4 in MIMIC-II. In all the three datasets, THP improves upon RNN-based models by a notable margin. The results demonstrate that THP is able to capture both short-term and long-term dependencies better than existing methods.

Figure 6 illustrates run-to-run variance of THP, NHP, and RMTPP. The error bars are wide because of how the data are split. Held-out test sets are constructed by randomly sampling some events from the entire dataset. That is, at times important events are sampled out, which will yield unsatisfactory model performance. Our results are better than all the baselines in all the individual experiments.

5.4. THP vs. Structured-THP Now we demonstrate by incorporating relational information, THP achieves improved performance.

Baseline models are constructed as following: for each vertex on a relational graph G, there exists a point process that consists of time and type of events. These event sequences are learned separately by both THP and NHP, i.e., we do not allow information sharing among vertices in these models.

Transformer Hawkes Process

Figure 7. Log-likelihood and prediction accuracy of NHP, THP, THP with full attention (THP-F), and structured-THP (THP-S) ﬁtted on the 911-Calls (left two ﬁgures) and the Earthquake (right two ﬁgures) datasets. Models are trained using different number of events.

To integrate G into THP, we consider two approaches. The ﬁrst approach is by allowing full attention, i.e., information from one vertex can be shared with all the other vertices. The second approach is by using the neighborhood graph, which is constructed based on spatial proximity. In this approach, a speciﬁc vertex can only share information with its neighbors. We ﬁt a structured-THP to both of the cases.

Figure 7 summarizes experimental results. We can see that THP is comparable or better than NHP in both validation likelihood and event prediction, which further demonstrates that THP can model complicated dynamics better than RNNbased models. Notice that THP-F, the structured-THP with full attention, yields a much better likelihood than the baseline models, which means relational information sharing can help the models in capturing latent dynamics. However, unlike likelihood, THP-F does not show consistent improvements in event prediction. This is because when the number of training events is small, the model cannot build a sufﬁcient information-sharing heuristic. Also, the performance drop when the number of training events is large is due to the inhomogeneity of data. This demonstrates that the full attention scheme results in undesirable dependencies on which the attention heads focus. THP-S successfully resolves this issue by eliminating such dependencies from the attention heads span based on spatial closeness of vertices. In this way, THP-S further improves upon THP-F, especially in event prediction tasks.

5.5. Ablation Study We perform ablation study on Retweets and Meme Track, and we evaluate models by validation log-likelihood. We inspect variants of THP by removing the self-attention and the temporal encoding mechanisms. Moreover, we test the effect of temporal encoding on NHP. Table 5 summarizes experimental results. As shown, both the self-attention module and the temporal encoding contribute to model performance.

We examine the models sensitivity to the number of parameters on the Retweets dataset. As shown in Table 6, our model is not sensitive to its number of parameters. Without the recurrent structure, Transformer-based models often have large number of parameters, but our THP model can

Table 5. Log-likelihood of variants of NHP and THP ﬁtted on Retweets and Meme Track. TE stands for temporal encoding (Eq. 2), and PE stands for positional encoding (Vaswani et al., 2017).

Model Retweets Meme Track NHP 5.60 6.23 NHP + TE 2.50 1.64 Atten 5.29 5.09 Atten + PE 5.25 4.70 Atten + TE 2.03 0.68

Table 6. Sensitivity to the number of parameters and run-time comparison. Speedup is the speed of THP against NHP.

# Parameters Log-likelihood Speedup THP NHP 100k 2.090 6.019 1.985 200k 2.072 5.595 2.564 500k 2.058 5.590 2.224 1000k 2.060 5.614 1.778

outperform RNN-based models with fewer parameters. In all the experiments, using a small model (about 100-200k parameters) will sufﬁce. In comparison, NHP has about 1000k and TSES has about 2000k parameters to achieve the best performance, which are much larger than THP. We also include run-time comparison in Table 6. We conclude that THP is efﬁcient in both model size and training speed.

6. Conclusion

In this paper we present Transformer Hawkes Process, a framework for analyzing event streams. Event sequence data are common in our daily life, and they exhibit sophisticated short-term and long-term dependencies. Our proposed model utilizes the self-attention mechanism to capture both of these dependencies, and meanwhile enjoys computational efﬁciency. Moreover, THP is quite general and can integrate structural knowledge into the model. This facilitates analyzing more complicated data, such as event sequences on graphs. Experiments on various real-world datasets demonstrate that THP achieves state-of-the-art performance in terms of both likelihood and event prediction accuracy.

Transformer Hawkes Process

Acknowledgement

The work done by Haoming Jiang and Tuo Zhao is partially supported by NSF III 1717916. The work done by Hongyuan Zha is supported by Shenzhen Institute of Artiﬁcial Intelligence and Robotics for Society, and Shenzhen Research Institute of Big Data.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

Bacry, E., Mastromatteo, I., and Muzy, J.-F. Hawkes processes in ﬁnance. Market Microstructure and Liquidity, 1 (01):1550005, 2015.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difﬁcult. IEEE transactions on neural networks, 5(2):157 166, 1994.

Borgatti, S. P., Mehra, A., Brass, D. J., and Labianca, G. Network analysis in the social sciences. science, 323 (5916):892 895, 2009.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555, 2014.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez Rodriguez, M., and Song, L. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555 1564. ACM, 2016.

Farajtabar, M., Wang, Y., Gomez-Rodriguez, M., Li, S., Zha, H., and Song, L. Coevolve: A joint point process model for information diffusion and network evolution. The Journal of Machine Learning Research, 18(1):1305 1353, 2017.

Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243 1252. JMLR. org, 2017.

Hawkes, A. G. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83 90, 1971.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735 1780, 1997.

Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., et al. Gradient ﬂow in recurrent nets: the difﬁculty of learning long-term dependencies, 2001.

Isham, V. and Westcott, M. A self-correcting point process. Stochastic Processes and Their Applications, 8(3):335 347, 1979.

Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. Mimic-iii, a freely accessible critical care database. Scientiﬁc data, 3:160035, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Leskovec, J. and Krevl, A. Snap datasets: Stanford large network dataset collection, 2014.

Li, S., Xiao, S., Zhu, S., Du, N., Xie, Y., and Song, L. Learning temporal point processes via reinforcement learning. In Advances in neural information processing systems, pp. 10781 10791, 2018.

Linderman, S. and Adams, R. Discovering latent network structure in point process data. In International Conference on Machine Learning, pp. 1413 1421, 2014.

Mei, H. and Eisner, J. M. The neural hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, pp. 6754 6764, 2017.

Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

Pascanu, R., Mikolov, T., and Bengio, Y. On the difﬁculty of training recurrent neural networks. In International conference on machine learning, pp. 1310 1318, 2013.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. Open AI Blog, 1(8), 2019.

Robert, C. and Casella, G. Monte Carlo statistical methods. Springer Science & Business Media, 2013.

Transformer Hawkes Process

Ross, S. M., Kelly, J. J., Sullivan, R. J., Perry, W. J., Mercer, D., Davis, R. M., Washburn, T. D., Sager, E. V., Boyce, J. B., and Bristow, V. L. Stochastic processes, volume 2. Wiley New York, 1996.

Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. ar Xiv preprint ar Xiv:1803.02155, 2018.

Stoer, J. and Bulirsch, R. Introduction to numerical analysis, volume 12. Springer Science & Business Media, 2013.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017.

Vere-Jones, D., of Wellington. Institute of Statistics, V. U., and Research, O. Statistical Methods for the Description and Display of Earthquake Catalogues. Technical report (Victoria University of Wellington. Institute of Statistics and Operations Research). Victoria University of Wellington, 1990.

Wang, L., Zhang, W., He, X., and Zha, H. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2447 2456. ACM, 2018.

Xiao, S., Farajtabar, M., Ye, X., Yan, J., Song, L., and Zha, H. Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems, pp. 3247 3257, 2017a.

Xiao, S., Yan, J., Yang, X., Zha, H., and Chu, S. M. Modeling the intensity function of point process via recurrent neural networks. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017b.

Yang, S.-H., Long, B., Smola, A., Sadagopan, N., Zheng, Z., and Zha, H. Like like alike: joint friendship and interest propagation in social networks. In Proceedings of the 20th international conference on World wide web, pp. 537 546, 2011.

Yin, W., Kann, K., Yu, M., and Sch utze, H. Comparative study of cnn and rnn for natural language processing. ar Xiv preprint ar Xiv:1702.01923, 2017.

Zhang, Q., Lipani, A., Kirnap, O., and Yilmaz, E. Self-attentive hawkes processes. ar Xiv preprint ar Xiv:1907.07561, 2019.

Zhao, Q., Erdogdu, M. A., He, H. Y., Rajaraman, A., and Leskovec, J. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1513 1522. ACM, 2015.

Zhou, K., Zha, H., and Song, L. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In Artiﬁcial Intelligence and Statistics, pp. 641 649, 2013.