# transformer_hawkes_process__cd870d45.pdf Transformer Hawkes Process Simiao Zuo 1 Haoming Jiang 1 Zichong Li 2 Tuo Zhao 1 3 Hongyuan Zha 4 5 Modern data acquisition routinely produce massive amounts of event sequence data in various domains, such as social media, healthcare, and financial markets. These data often exhibit complicated short-term and long-term temporal dependencies. However, most of the existing recurrent neural network based point process models fail to capture such dependencies, and yield unreliable prediction performance. To address this issue, we propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efficiency. Numerical experiments on various datasets show that THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin. Moreover, THP is quite general and can incorporate additional structural knowledge. We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information. 1. Introduction Event sequence data are naturally observed in our daily life. Through social media such as Twitter and Facebook, we share our experiences and respond to other users information (Yang et al., 2011). In these websites, each user has a sequence of events such as tweets and interactions. Hundreds of millions of users generate large amounts of tweets, 1Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA; 2School of the Gifted Young, University of Science and Technology of China, Hefei, China; 3Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA; 4School of Data Science, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China; 5Currently on leave from Georgia Institute of Technology. Correspondence to: Simiao Zuo , Tuo Zhao , Hongyuan Zha . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). which are essentially sequences of events at different time stamps. Besides social media, event data also exist in domains like financial transactions (Bacry et al., 2015) and personalized healthcare (Wang et al., 2018). For example, in electronic medical records, tests and diagnoses of each patient can be treated as a sequence of events. Unlike other sequential data such as time series, event sequences tend to be asynchronous (Ross et al., 1996), which means time intervals between events are just as important as the order of them to describe their dynamics. Also, depending on specific application requirements, event data show sophisticated dependencies on their history. Point process is a powerful tool for modeling sequences of discrete events in continuous time, and the technique has been widely applied. Hawkes process (Hawkes, 1971; Isham & Westcott, 1979) and Poisson point process are traditionally used as examples of point processes. However, the simplified assumptions of the complicated dynamics of point processes limit the models practicality. As an example, Hawkes process states that all past events should have positive influences on the occurrence of current events. However, a user on Twitter may initiate tweets on different topics, and these events should be considered as unrelated instead of mutually-excited. To alleviate the over-simplifications, likelihood-free methods (Xiao et al., 2017a; Li et al., 2018) and non-parametric models like kernel methods and splines (Vere-Jones et al., 1990) have been proposed, but the increasing complexity and quantity of collected data crave for more powerful models. With the development of neural networks, in particular deep neural networks, focuses have been placed on incorporating these flexible models into classical point processes. Because of the sequential nature of event steams, existing methods rely heavily on Recurrent Neural Networks (RNNs). Neural networks are known for their ability to capture complicated high-level features, in particular, RNNs have the representation power to model the dynamics of event sequence data. In previous works, either vanilla RNN (Du et al., 2016) or its variants (Mei & Eisner, 2017; Xiao et al., 2017b) have been used and significant progress in terms of likelihood and event prediction have been achieved. However, there are two significant drawbacks with RNNbased models. First, recurrent neural networks, even those equipped with forget gates, such as Long Short-Term Mem- Transformer Hawkes Process ory (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Units (Chung et al., 2014), are unlikely to capture long-term dependencies. In financial transactions, short-term effects such as policy changes are important for modeling buy-sell behaviors of stocks. On the other hand, because of the delays in asset returns, stock transactions and prices often exhibit long-term dependencies on their history. As another example, in medical domains, at times we are interested in examining short-term dependencies on symptoms such as fever and cough for acute diseases like pneumonia. But for certain types of chronic diseases such as diabetes, longterm dependencies on disease diagnoses and medications are more critical. Desirable models should be able to capture these long-term dependencies. Yet with recurrent structures, interactions between two events located far in the temporal domain are always weak (Hochreiter et al., 2001), even though in reality they may be highly correlated. The reason is that the probability of keeping information in a state that is far away from the current state decreases exponentially with distance. The second drawback is trainability of recurrent neural networks. Training deep RNNs (including LSTMs) is notoriously difficult because of gradient explosion and gradient vanishing (Pascanu et al., 2013). In practice, single-layer and two-layer RNNs are mostly used, and they may not successfully model sophisticated dependencies among data (Bengio et al., 1994). Additionally, inputs are fed into the recurrent models sequentially, which means future states must be processed after the current state, rendering it impossible to process all the events in parallel. This limits RNNs ability to scale to large problems. Recently, convolutional neural network variants that are tailored for analyzing sequential data (Oord et al., 2016; Gehring et al., 2017; Yin et al., 2017) have been proposed to better capture long-term effects. However, these models enforce many unnecessary dependencies. This particular downside plus the increased computational burdens deem these models insufficient. To address the above concerns, we propose the Transformer Hawkes Process (THP) model that is able to capture both short-term and long-term dependencies whilst enjoying computational efficiency. Even though the Transformer (Vaswani et al., 2017) is widely adopted in natural language processing, it has rarely been used in other applications. We remark that such an architecture is not readily applicable to event sequences that are defined in a continuous-time domain. To the best of our knowledge, our proposed THP is the first of this type in point process literature. Building blocks of THP are the self-attention modules (Bahdanau et al., 2014). These modules directly model dependencies among events by assigning attention scores. A large score between two events implies a strong dependency, and Figure 1. Illustration of dependency computation between the last event (the red triangle) and its history (the blue circles). RNNbased NHP models dependencies through recursion. THP directly and adaptively models the event s dependencies on its history. Convolution-based models enforce static dependency patterns. a small score implies a weak one. In this way, the modules are able to adaptively select events that are at any temporal distance from the current event. Therefore, THP has the ability to capture both short-term and long-term dependencies. Figure 1 demonstrates dependency computation of different models. The non-recurrent structure of THP facilitates efficient training of multi-layer models. Transformer-based architectures can be as deep as dozens of layers (Devlin et al., 2018; Radford et al., 2019), where deeper layers capture higher order dependencies. The ability to capture such dependencies creates models that are more powerful than RNNs, which are often shallow. Also, THP allows full parallelism when calculating dependencies across all events, i.e., the computation between any two event pairs is independent with each other. This yields a model presenting strong efficiency. Our proposed model is quite general, and can incorporate additional structural knowledge to learn more complicated event sequence data, such as multiple point processes over a graph. In social networks, each user has her own sequence of events, like tweets and comments. Sequences among users can be related, for example, a tweet from a user may trigger retweets from her followers. We can use graphs to model these follower-followee relationships (Zhou et al., 2013; Farajtabar et al., 2017), where each vertex corresponds to a specific user and each edge represents connections between the two associated users. We propose an extension to THP that integrates these relational graphs (Borgatti et al., 2009; Linderman & Adams, 2014) into the self-attention module via a similarity metric among users. Such a metric can be learned by our proposed graph regularization. We experiment THP on five datasets to evaluate both validation likelihood and event prediction accuracy. Our THP model exhibits superior performance to RNN-based models in all these experiments. We further test our structured- Transformer Hawkes Process THP on two additional datasets, where the model achieves improved prediction performance for learning multiple point processes when incorporating their relational information. Our code is available at https://github.com/ Simiao Zuo/Transformer-Hawkes-Process. 2. Background We briefly review Hawkes Process (Hawkes, 1971), Neural Hawkes Process (Mei & Eisner, 2017), and Transformer (Vaswani et al., 2017) in this section. Hawkes Process is a doubly stochastic point process, whose intensity function is defined as λ(t) = µ + X j:tj