# unipoint_universally_approximating_point_processes_intensities__410c5a9a.pdf

UNIPoint: Universally Approximating Point Processes Intensities

Alexander Soen, Alexander Mathews, Daniel Grixti-Cheng, Lexing Xie The Australian National University alexander.soen@anu.edu.au, alex.mathews@anu.edu.au, a500846@anu.edu.au, lexing.xie@anu.edu.au

Point processes are a useful mathematical tool for describing events over time, and so there are many recent approaches for representing and learning them. One notable open question is how to precisely describe the ﬂexibility of point process models and whether there exists a general model that can represent all point processes. Our work bridges this gap. Focusing on the widely used event intensity function representation of point processes, we provide a proof that a class of learnable functions can universally approximate any valid intensity function. The proof connects the well known Stone Weierstrass Theorem for function approximation, the uniform density of non-negative continuous functions using a transfer functions, the formulation of the parameters of a piece-wise continuous functions as a dynamic system, and a recurrent neural network implementation for capturing the dynamics. Using these insights, we design and implement UNIPoint, a novel neural point process model, using recurrent neural networks to parameterise sums of basis function upon each event. Evaluations on synthetic and real world datasets show that this simpler representation performs better than Hawkes process variants and more complex neural network-based approaches. We expect this result will provide a practical basis for selecting and tuning models, as well as furthering theoretical work on representational complexity and learnability.

1 Introduction Temporal point processes (Daley and Vere-Jones 2007) are a preferred tool for describing events happening in irregular intervals, such as, earthquake modelling (Ogata 1988), social media (Zhao et al. 2015), and ﬁnance (Embrechts, Liniger, and Lin 2011). One common variant is the self-exciting Hawkes process with parametric kernel (Laub, Taimre, and Pollett 2015), which describes prior events triggering future events. However, misspeciﬁcation of the kernel will likely result in poor performance (Mishra, Rizoiu, and Xie 2016). One may ask what are the most ﬂexible classes of point process intensity functions? How can they be implemented computationally? Does a ﬂexible representation lead to good performance? There is a body of literature surrounding these three questions. Multi-layer neural networks are well known for being ﬂexible function approximators. They are able to ap-

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Overview of our method of universally approximating point processes. A RNN is used to parameterise a set of basis functions for each interarrival time τi. Then, the sum of basis functions is used to approximate a continuous function, which is composed with a transfer function f+ to universally approximate all valid intensity functions.

proximate any Borel-measurable function on a compact domain (Cybenko 1989; Hornik, Stinchcombe, and White 1989). A number of neural architectures have been proposed for point processes. The Recurrent Marked Temporal Point Process model (RMTPP) (Du et al. 2016) uses Recurrent Neural Networks (RNN) to encode event history, and deﬁnes the conditional intensity function by a parametric form. Common choices of such parametric forms include an exponential function (Du et al. 2016; Upadhyay, De, and Rodriguez 2018) or a constant function (Li et al. 2018; Huang, Wang, and Mak 2019). Variants of the RNN have been explored, including Neural Hawkes (Mei and Eisner 2017) that makes the RNN state a functions over time; as well as Transformer Hawkes (Zuo et al. 2020) and Self-attention Hawkes (Zhang et al. 2019) which uses attention mechanisms instead of recurrent units. However, a conceptual gap on the ﬂexibility of the neural point process representation still remains. Piece-wise exponential functions (Du et al. 2016; Upadhyay, De, and Rodriguez 2018) only encode intensities that are monotonic between events. The functional RNN representation (Mei and Eisner 2017) is ﬂexible but uses many more parameters. Transformers (Zuo et al. 2020;

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Zhang et al. 2019) are generic sequence-to-sequence function approximators (Yun et al. 2020), but the functional form of the Transformer Hawkes point process intensity function is not an universal approximator. Furthermore, intensity functions are non-negative and discontinuous at event times which means neural network approximation results cannot be applied directly. Recent results shed light on alternative point process representations. Omi, Ueda, and Aihara (2019) uses a positive weight monotone neural network to learn the compensator (the integral of the intensity function). Although it is a generic approximator for compensators, it might assign nonzero probability to invalid inter-arrival times as the compensator can be non-zero at time zero. (Shchur, Biloˇs, and G unnemann 2020) represents inter-arrival times using normalising ﬂow and mixture models, which can universally approximate any density. However, by deﬁning the point process with the event density, the model cannot account for event sequences which stop naturally (see Section 3). These approaches are promising alternatives but are not a full replacement for intensity functions, which are preferred since they are intuitive and can be superimposed. In this work, we propose a class of neural networks that can approximate any point process intensity function to arbitrary accuracy, along with a proof showing the role of three key constituents: a set of uniformly dense basis functions, a positive transfer function, and an approximator for arbitrary dynamic systems. We implement this proposal using RNNs, the output of which is used to parameterise a set of basis functions upon arrival of each event, as shown in Figure 1. Named UNIPoint, the proposed model performs well across synthetic and real world datasets in comparison to the Hawkes process and other neural variants. This work provides a general yet parsimonious representation for temporal point processes, and so forms a solid basis for future development in point process representations that incorporate rich contextual information into event models. Our primary contributions are:

A novel architecture that can approximate any point process intensity function to arbitrary accuracy.

A theoretical guarantee for the ﬂexible point process representation that builds upon the theory of universally approximating continuous functions and dynamic systems.

UNIPoint the neural network implementation of the proposed architecture with strong empirical results on both synthetic and real world datasets. Reference code is available online1.

C(X, Y ) denotes the class of continuous functions mapping from domain X to range Y . Denote R as the set of real numbers, R+ as the non-negative reals and R++ as the strictly positive reals. Deﬁne the composition of a function f and a class of functions F as f F = {f g : g F}. The sigmoid function [1 + exp( x)] 1 is denoted as σ(x).

1https://github.com/alexandersoen/unipoint

2 Preliminary: Temporal Point Processes A temporal point process is an ordered set of event times {ti}N i=0. We typically describe a point process by its conditional intensity function λ(t | Ht ) which can be interpreted as the instantaneous probability of an event occurring at time t given event history Ht , consisting of the set of all events before time t. This can be written as (Daley and Vere-Jones 2007):

λ(t | Ht ) .= lim h 0+ P(N[t, t + h) > 0 | Ht )

where N[t1, t2) is the number of events occurring between two arbitrary times t1 < t2. Equation 1 restricts the conditional intensity function to non-negative functions. Given history Ht , the conditional intensity is a deterministic function of time t. Following standard convention, we refer to the conditional intensity function as simply the intensity function, abbreviating λ(t | Ht ) to λ (t). Point processes can be speciﬁed by choosing a functional form for the intensity function. For example, the Hawkes process, one of the simplest interacting point process (Bacry, Mastromatteo, and Muzy 2015), can be deﬁned as follows:

λ (t) = µ + X

ti<t ϕ(t ti), (2)

where µ speciﬁes the background intensity and ϕ(t ti) is the triggering kernel which characterises the self-exciting effects of prior events ti. The likelihood of a point process is (Daley and Vere-Jones 2007)

where the negated term in the exponential is known as the compensator function Λ (t) = R T 0 λ (s) ds.

3 Universal Approximation of Intensities To represent the inﬂuence of past events on future events, point process intensity functions λ (t) are often continuous between events (ti 1, ti]; with discontinuities only possible at events. For example, the intensity function of the Hawkes process has discontinuities at each event, Eq. (2). Intuitively, this piece-wise continuous characterisation of the intensity function encodes the belief that the process only signiﬁcantly changes its behaviour when new information (an event) is observed. As such, there are two behaviours of a point process we need to approximate: (1) the continuous intensity function segment between consecutive events, given a ﬁxed event history; and (2) the change in the point process intensity function when an event occurs, so that we can approximate the jump dynamics between events. We consider an intensity function λ (t) with ﬁxed observation period (0, T]. The intensity function can be segmented by the event times of an event sequence (t0, t1], (t1, t2], . . . , (t N 1, t N], (t N, t N+1], where t N+1 = T. Given a piece-wise continuous intensity function, the segmented intensity function is continuous: ui(τ) = λ (t) for

t (ti 1, ti], where τ = t ti 1 (0, ti ti 1]. Thus to approximate the intensity function between consecutive events, we learn a function ˆu(τ; pi), parameterised by pi, to approximate any of the segmented intensity functions ui(τ), where each segment only differs in parameterisation. Then to approximate the jump dynamics of the intensity function we utilise the RNN approximation of a dynamic system, which dictates how the parameters pi change over time. To quantify the quality of an approximation, we use the uniform metric between two functions f, g : X R,

d(f, g) = sup x X |f(x) g(x)|. (4)

This metric is the maximum difference of the two functions over a shared (compact) domain X. The uniform metric has been used to prove universal approximation properties for neural networks (Hornik, Stinchcombe, and White 1989; Debao 1993) and RNNs (Sch afer and Zimmermann 2007). Given classes of functions F and G, F is a universal approximator of G if for any ε > 0 and g G, there exists an f F such that d(f, g) < ε. An equivalently expression is: F is uniformly dense in G. Remark. Although we refer to Hawkes point processes as the primary example of a point process to approximate throughout the paper, following the example of (Mei and Eisner 2017), we note that as long as the point process has continuous intensity function between events, our approximation analysis will hold. Thus, in addition to Hawkes processes, the methods proposed in our work can approximate point processes including self correcting processes and nonhomogenous Poisson processes with continuous densities.

Approximation Between Two Events To approximate the time shifted non-negative functions ui(τ), we ﬁrst introduce transfer functions f+ (Deﬁnition 1). We then prove that the class of composed function f+ F preserves uniform density (Theorem 1). Given this theorem, we provide a method for constructing uniformly dense classes with sums of basis functions Σ(φ) (Deﬁnition 2) which are in turn uniformly dense after composing with f+ (Corollary 1). We further provide a set of suitable basis functions (Table 1). Formally, we deﬁne the M-transfer functions which maps negative outputs of a function to positive values. Deﬁnition 1. A function f+ : R R+ is a M-transfer function if it satisﬁes the following: 1. f+ is M-Lipschitz continuous; 2. R++ f+[R]; 3. And f+ is strictly increasing on f 1 + [R++]. Deﬁnition 1 provides a wide range of functions. In practice, it is convenient to use softplus function f SP(x) = log(1+exp(x)) which is a 1-transfer function commonly used in other neural point processes (Mei and Eisner 2017; Omi, Ueda, and Aihara 2019; Zuo et al. 2020). Alternatively, f+(x) = max(0, x) could be used; however, this is not differentiable at x = 0 which can cause issues in practice. Intuitively, M-transfer function are increasing functions which map to all positive values and have bounded steepness.

When a Gaussian process is used to deﬁne an inhomogenous Poisson process, the link functions serve a similar role to ensure valid intensity functions (Lloyd et al. 2015). However, many of these link function violate the conditions of being a M-transfer function (Donner and Opper 2018), i.e., the exponential link function f+(x) = exp(x) and squared link function f+(x) = x2 are not M-Lipschitz continuous as they have unbounded derivatives; whereas the sigmoid link function f+(x) = σ(x) is a bounded function (violating condition 2). Using M-transfer functions, we can show that a uniformly dense class of unbounded functions will be uniformly dense for strictly positive functions under composition. These functions are deﬁned with domain K R, a compact subset, which can be set as K = [0, T] for intensity functions.

Theorem 1. Given a class of functions F which is uniformly dense in C(K, R) and a M-transfer function f+, the composed class of functions f+ F is uniformly dense in C(K, R++) for any compact subset K R.

Proof. Let f C(K, R++) and ε > 0 be arbitrary. Since f+ is strictly increasing and continuous on the preimage of R++ then f 1 + exists, is continuous, and restricted to subdomain R++. Thus, there exists some g C(K, R) such that f = f+ g. As F is dense with respect to the uniform metric, for ε/M there exists some h F such that d(h, g) < ε/M. Thus for any x K,

|(f+ h)(x) f(x)| = |(f+ h)(x) (f+ g)(x)| M|h(x) g(x)| < ε.

We have d(f+ h, f) < ε.

To approximate ui(τ) using Theorem 1 we need a family of functions which are able to approximate functions in C(K, R). We consider the family of functions consisting of the sum of basis functions φ( ; pj), where pj P denotes the parameterisation of the basis function φ.

Deﬁnition 2. Denote Σ(φ) as the class of functions corresponding to the sum of basis functions φ : R P R, with parameter space P, as follows:

ˆu : R R | ˆu(x) =

j=1 φ(x; pj), pj P, J N

The parameter space P of a basis function is determined by the parametric form of a chosen basis function φ(x; pj). For example, the class composed of exponential basis functions could be deﬁned with parameter space P = R2 with functions {φ : R R | φ(x) = α exp(βx), α, β R}. Deﬁnition 2 encompasses a wide range of function classes, including neural networks with sigmoid (Cybenko 1989; Hornik, Stinchcombe, and White 1989; Debao 1993) or rectiﬁed linear unit activations (Sonoda and Murata 2017). The Stone-Weierstrass Theorem provides sufﬁcient conditions for ﬁnding basis function for universal approximation.

Theorem 2 (Stone-Weierstrass Theorem (Rudin et al. 1964; Royden and Fitzpatrick 1988)). Suppose a subalgebra A of C(K, R), where K R is a compact subset, satisﬁes the following conditions:

1. For all x, y K, there exists some f A such that f(x) = f(y); 2. For all x0 K, there exists f A such that f(x0) = 0.

Then A is uniformly dense in C(K, R).

Thus, by using Theorem 1 and the Stone-Weierstrass theorem, Theorem 2, we arrive at Corollary 1, which gives sufﬁcient conditions for basis functions φ to ensure that f+ Σ(φ) is a universal approximator for C(K, R++).

Corollary 1. For any compact subset K R and for any M-transfer function f+, if a basis function φ( ; p) parametrised by p P satisﬁes the following conditions:

1. P(φ) is closed under product; 2. For any distinct points x, y K, there exists some p P such that φ(x; p) = φ(y; p); 3. For all x0 K, there exists some p P such that φ(x0; p) = 0.

Then f+ P(φ) is uniformly dense in C(K, R++).

The ﬁrst condition of Corollary 1 is given such that the set of basis functions P(φ) is a subalgebra of C(X, R). The later two conditions are the required preconditions for the Stone-Weierstrass Theorem to hold. Given the conditions of Corollary 1, some interesting choices for valid basis functions φ(x; p) are the exponential basis function φEXP(x) = α exp(βx) and the power law basis function φPL(x) = α(1 + x) β. These basis functions are similar to the exponential and power law Hawkes triggering kernels, which have seen widespread use in many domains (Ogata 1988; Bacry, Mastromatteo, and Muzy 2015; Laub, Taimre, and Pollett 2015; Rizoiu et al. 2017). We note that the class of intensity functions in Theorem 1 and Corollary 1 are strictly positive continuous functions. However, these results generalise to non-negative continuous functions as our deﬁnition of intensity functions permits arbitrarily low intensity in ui(τ) where switching from arbitrarily low intensities to zero intensity results in arbitrarily low error with respect to the uniform metric on (0, T]. In Table 1, we provide a selection of interesting basis functions to universally approximate ui(τ) C(K, R++). One should note that Corollary 1 only provides sufﬁcient conditions, where some of the basis function in Table 1 do not satisfy the precondition. For example, the sigmoid basis function φSIG(x) = ασ(βx + δ), (α, β, δ) R3 does not allow Σ(φSIG) to be closed under product and thus does not satisfy the conditions of Corollary 1. However, the sum of sigmoid basis functions is equivalent to the class of single hidden layer neural networks (Hornik, Stinchcombe, and White 1989; Debao 1993). Thus, in additional to an appropriate transfer function it does have the universal approximation property for non-negative continuous functions through Theorem 1. Additionally, other basis functions used to deﬁne point process intensity functions can be used, such as

Basis Function Functional Form φ Parameter Space P

φEXP α exp(βx) (α, β) R2

φPL α(1 + x) β (α, β) R R+ φCOS α cos(βx + δ) (α, β, δ) R3

φSIG ασ(βx + δ) (α, β, δ) R3

φRe LU max(0, αx + β) (α, β) R2

Table 1: Basis function universal approximators for intensity functions between two consecutive events. indicates functions that satisfy Corollary 1; one proven in (Cybenko 1989); and one proven in (Sonoda and Murata 2017).

radial basis functions (Tabibian et al. 2017) that are not generally closed under product but have universal approximation properties (Park and Sandberg 1991).

Approximation for Event Sequences The approximations to ui(τ) use a set of parameters, e.g. (α, β, δ) in Table 1. We denote these parameters vectors as pi P, and the approximated function segment as ˆui(τ; pi). Since each segment ˆui(τ; pi) is uniquely determined by pi, and the union of all segments approximates λ (t), we would only need to capture the dynamics in pi. We express pi as the output of a dynamic system.

si+1 = g(si, ti) pi = ν(si), (5)

where si+1 is the internal state of the dynamic system, g updates the internal state at each step, and ν maps from the internal state to the output. Theorem 3 (RNN Universal Approximation (Sch afer and Zimmermann 2007)). Let g : RJ RI RJ be measurable and ν : RJ Rn be continuous, the external inputs xi RI, the inner states si RJ, and the outputs pi R (for i = 1, . . . , N). Then, any open dynamic system of the form of Eq. (5) can be approximated by an RNN, with sigmoid activation function, to arbitrary accuracy. Given that RNNs approximate pi, we use continuity condition on basis φ and in turn ˆu to show how to universally approximate an intensity function with an RNN. Theorem 4. Let {ti}N i=0 be a sequence of events with ti [0, T] and λ (t) be an intensity function. Given a parametric family of functions F = {ˆu( ; p) : p P} which is uniformly dense in C([0, T], R++) and ˆu(x; p) continuous with respect to p for all x [0, T]. Then there exists a recurrent neural network

hi = σ(Whi 1 + vti 1 + b) ˆpi = Ahi for t (ti 1, ti] ˆλ(t) = ˆu(τ; ˆpi) and τ = t ti 1, (6)

where σ is a sigmoid activation function and [W, v, b, A] are weights of appropriate shapes, such that ˆλ(t) approximates λ (t) with arbitrary precision for all (0, T].

Proof. Let ε > 0 be arbitrary. For any interval (ti 1, ti], we know from the uniform density of F that there exists a pi such that sup τ [0,T ] |ˆui(τ; pi) ui(τ)| ε

By the continuity conditions of ˆu, it follows that for each pi and any τ [0, T] there exists δi such that

pi ˆpi < δi = |ˆu(τ; pi) ˆu(τ; ˆpi)| < ε

by taking the minimum over δτ s in the (ε/2, δτ)-condition of continuity for all τ [0, T] (where the subscript emphasises the range of τ for ﬁxed i). The LHS of Eq. (8) is the precision needed in our RNN approximtor for each interval (ti 1, ti]. We take the minimum approximation discrepancy over the sequence of ˆpi s, δ := mini δi and use an RNN with precision δ to bound the approximation quality due to ˆpi s using Theorem 3,

sup τ [0,T ] |ˆu(τ; pi) ˆu(τ; ˆpi)| < ε

Using the triangle inequality of the uniform metric, we can combine and bound the discrepancies due to ˆu in Eq. (7) and those due to ˆpi in Eq. (9),

sup τ [0,T ] |ui(τ) ˆu(τ; ˆpi)| < ε. (10)

Eq. (10) holds for all i {1, . . . , N}. Thus uniform density condition for λ (t) also holds for the piece-wise approximator ˆλ(t) given by Eq. (6) over the entire sequence.

From Theorem 4 and Corollary 1, universal approximation with respect to the uniform metric follows immediately when using basis functions which are continuous with respect to their parameter space, for example Table 1. Extensions and discussions. While the original work on learning the compensator function (Omi, Ueda, and Aihara 2019) does not provide theoretical backings for its proposal, we note that Theorem 4, combined with universal approximation capabilities of monotone neural networks (Sill 1998), can be used to show that the class of monotonic (increasing) neural networks provide universal approximation for compensator functions. The guarantee described here does not explicitly account for additional dimensions or marks. To extend Theorem 4 in this manner, we consider replacing basis functions φ(x), which has domain R, to basis functions with extended domain R K where K is a compact set. For example, K can be a set of discrete ﬁnite marks in the case of approximated marked temporal point processes. The universal approximation property would then generalise as long as P(φ) is dense in C([0, T] K, R++) and continuous in the parameter space of the basis functions. Likewise, if we want to approximate a spatial point process, we let K = R2 and ﬁnd an appropriate set of basis functions with domain R R2. It is worth mentioning two distinctions from the intensity free approach (Shchur, Biloˇs, and G unnemann 2020). First, although density approximation allows for direct event

time sampling, the log-normal mixture representation assumes that an event will always occur on R+ specifically, events cannot naturally stop. Instead, the intensity function representation allow for events to stop with probability 1 P(τ < ) = exp ( Λ ( )). In other-words, 1 P(τ < ) is the probability of events not occurring in ﬁnite time, which is non-zero when the intensity function decays and stays at zero. Furthermore, the intensity free approach proposed one functional form (log-normal mixture) for approximating densities, whereas we show that a variety of basis functions all fulﬁl the goal of universal approximation.

4 Implementation with Neural Networks We propose UNIPoint, a neural network architecture implementing a fully ﬂexible intensity function. Let {ti}N i=0 be a sequence of events with corresponding interarrival times τi = ti ti 1. Let M be the size of the hidden state of the RNN, and φ( ; ) be the chosen basis function with parameter space P. Let P denote the dimension of the parameter space. The approximation guarantees (given in Corollary 1) hold in the limit of an inﬁnite number of basis functions, in practice the number of basis functions is a hyper-parameter, denoted as J. This network has four key components. Recurrent Neural Network. We use a simple RNN cell (Elman 1990), though other popular variants would also work, e.g., LSTM, or GRU. The recurrent unit produces hidden state vector hi from hi 1 the previous hidden state and τi 1 the normalised interarrival time (divided by standard deviation): hi = f(Whi 1 + vτi 1 + b) (11)

Here W, v, b, and h0 are learnable parameters. f is any activation function compatible with RNN universal approximation, i.e., sigmoid σ (Sch afer and Zimmermann 2007). Basis Function Parameters are generated using a linear transformation that maps the hidden state vector of the RNN hi RM to parameters pi = (pi1, . . . , pi J),

pij = Ajhi + Bj, t (ti 1, ti], j {1, . . . , J}. (12)

Here Aj and Bj are learnable parameters and pij P. Eq. (11) and Eq. (12) deﬁnes the RNN which approximates a point processes underlying dynamic system. The error contribution of these two equations is upper bounded by the sum of their individual contributions (Sch afer and Zimmermann 2007, Theorem 2). Intensity Function. Using parameters pi1, . . . , pi J, the intensity function with respect to time since the last event τ = t ti 1 is deﬁned as:

ˆλ(τ) = f SP

j=1 φ(τ; pij)

, τ (0, ti ti 1], (13)

where f SP(x) = log(1 + exp(x)) is the softplus function. Loss Function. We use the point process negative loglikelihood, as per Eq. (3). In most cases the integral cannot be calculated analytically so instead we calculate it numerically using Monte-Carlo integration (Press et al. 2007), see

Training settings and the online appendix (Soen et al. 2020, Section F). Our use of RNNs to encode event history is similar to other neural point process architectures. We note that (Du et al. 2016) only supports monotonic intensities. Our representation is more parsimonious than (Mei and Eisner 2017) since the hidden states need not be functions over time, yet the output can still universally approximate any intensity function. (Omi, Ueda, and Aihara 2019) produce monotonically increasing compensator functions but can have invalid inter-arrival times.

5 Evaluation We compare the performance of UNIPoint models to various simple temporal point processes and neural network based models on three synthetic datasets and three real world datasets. For the simple temporal point processes we consider self-exciting intensity functions which are piece-wise monotonic (Self-Correcting process (Isham and Westcott 1979) and Exponential Hawkes process (Hawkes 1971)) and non-monotonic (Decaying Sine Hawkes process). The details of dataset preprocessing, model settings and parameter sizes can be found in the appendix (Soen et al. 2020, Section A and B).

Synthetic Datasets We synthesise datasets from simple temporal point process models, generating 2, 048 event sequences each containing 128 events. This results in roughly 262, 000 events, which is of the same magnitude tested in (Omi, Ueda, and Aihara 2019). Self-correcting process and exponential Hawkes process datasets have previously been used in other neural point process studies (Du et al. 2016; Omi, Ueda, and Aihara 2019; Shchur, Biloˇs, and G unnemann 2020). We consider a decaying sine Hawkes process to test whether the models capture non-monotonic self-exciting intensity functions. The following synthetic datasets are used: Self-Correcting Process. The intensity function is

λ (t) = exp

where ν = 1 and γ = 1. Exponential Hawkes Process. The intensity function is a Hawkes process with exponential decaying triggering kernel, given by

λ (t) = µ + αβ X

ti<t exp( β(t ti)),

where µ = 0.5, α = 0.8, and β = 1. Decaying Sine Hawkes Process. The intensity function is a Hawkes process with a sinusoidal triggering kernel product with an exponential decaying triggering kernel:

λ (t) = µ + γ X

ti<t (1 + sin(α(t ti)) exp( β(t ti)),

where µ = 0.5, α = 5π, β = 2, and γ = 1.

Real World Dataset We further evaluate the performance of our model with three real world datasets. Although these dataset originally have marks/event types, we ignore such information to test UNIPoint. The real world datasets used are: MOOC2. A dataset of student interactions in online courses (Kumar, Zhang, and Leskovec 2019), previously used for evaluating neural point processes (Shchur, Biloˇs, and G unnemann 2020). Events correspond to different types of interaction, e.g., watching videos. Reddit2. A dataset of user posts on a social media platform (Kumar, Zhang, and Leskovec 2019), previously used for evaluating neural point processes (Shchur, Biloˇs, and G unnemann 2020). Each event sequence corresponds to a user s post behaviour. Stack Overﬂow (Du et al. 2016). A dataset of events which consists of users gaining badges on a question-answer website. Only users with at least 40 badges between 01-01-2012 and 01-01-2014 are considered.

Baselines The following traditional and neural network point process models are compared to our models. We implement all but the Neural Hawkes baseline. We also compare to Transformer Hawkes (Zuo et al. 2020) but the results are sensitive to model settings, the observations from which are discussed in the appendix (Soen et al. 2020, Section C). Exponential Hawkes Process The point process likelihood is optimised to determine parameter µ, α, and β in intensity function

λ (t) = µ + αβ X

ti<t exp( β(t ti)).

Power Law Hawkes Process. The point process likelihood is optimised to determine parameters µ, α, and β in intensity function

λ (t) = µ + α X

ti<t (t ti + δ) (1+β).

The δ parameter is ﬁxed at 0.5 to compensate for the difﬁculty of the power law intensity function being inﬁnity when t ti + δ = 0 (Bacry, Mastromatteo, and Muzy 2015). RMTPP (Du et al. 2016). We implement the RMTPP neural network architecture as a baseline. The intensity function of RMTPP

λ (t) = exp(v T hi + w(t ti 1) + b) (14)

is deﬁned with respect to the RNN hidden state hi. We use a RNN size of 48 for testing. Fully Neural (Omi, Ueda, and Aihara 2019). We also implement the fully neural network point process. The integral of the intensity function (compensator) is deﬁned as a neural network with RNN hidden state and event time input. We use a RNN size of 48 and fully connected layer of size 48 to produce the compensator.

2https://github.com/srijankr/jodie/

Dataset Synthetic Real World Models Self Correcting Exp Hawkes Decaying Sine MOOC Reddit Stack Overﬂow

Exp Hawkes 0.994 .001 0.044 .037 0.838 .019 3.578 .060 0.100 .039 1.031 .025 PLHawkes 0.994 .001 0.036 .037 0.845 .019 0.532 .070 0.787 .035 0.918 .024 RMTPP 0.776 .003 0.054 .038 0.864 .020 2.040 .098 0.336 .031 0.864 .022 Fully Neural 0.789 .003 0.059 .037 0.833 .020 4.699 .054 0.206 .046 0.810 .022 Neural Hawkes 0.777 .006 0.066 .037 0.821 .021 4.641 .110 0.201 .048 0.801 .023

Exp Sum 0.774 .008 0.056 .042 0.828 .020 3.114 .125 0.151 .045 0.812 .023 PLSum 0.779 .006 0.064 .038 0.829 .020 4.939 .085 0.162 .046 0.814 .023 Re LUSum 0.780 .007 0.059 .039 0.828 .021 4.676 .075 0.221 .046 0.810 .023 Cos Sum 0.777 .008 0.062 .039 0.828 .020 4.471 .075 0.139 .044 0.814 .023 Sig Sum 0.776 .007 0.064 .038 0.827 .020 4.346 .076 0.170 .045 0.814 .023 Mixed Sum 0.779 .007 0.062 .038 0.828 .020 4.928 .085 0.201 .047 0.804 .023

Table 2: Averaged log-likelihood scores with corresponding 95% conﬁdence intervals. A higher score is better; the best of the baselines are indicated by and the best of the UNIPoint models are indicated by . Bold indicates results when the difference between and are signiﬁcantly better (t-test p = 0.05).

Neural Hawkes3 (Mei and Eisner 2017). We utilise the reference implementation for Neural Hawkes (Mei, Qin, and Eisner 2019), which provides a neural network architecture that encodes the decaying nature of Hawkes process exponential kernels in the LSTM of the model. We use a LSTM size of 48 and default parameters for other model settings.

Training Settings

We ﬁt a UNIPoint model for each of the ﬁve basis function types described in Table 1 with softplus transfer functions and 64 basis functions with learnable parameters. The mixture of basis functions, Mixed Sum, is used, with 32 power law and 32 Re LU basis functions. We study effects of the number of basis functions in the appendix (Soen et al. 2020, Section E). We ﬁt models for all synthetic and real world datasets, with a 60 : 20 : 20 train-validation-test split. Our models are implemented in Py Torch4. During training, we use a single sample per event interval to calculate the loss function as we ﬁnd using multiple samples does not improve performance, as shown in the appendix (Soen et al. 2020, Section F). All UNIPoint models tested employ an RNN with 48 hidden units, a batch size of 64, and are trained using Adam (Kingma and Ba 2014) with L2 weight decay set to 10 5. The validation set is used for early stopping: training halts if the validation loss does not improve by more than 10 4 for 100 successive minibatches. The training for one of the real world datasets (e.g., Stack Overﬂow) takes approximately 1 day. We further test UNIPoint using LSTMs in Appendix G and an alternative transfer function in Appendix H.

Evaluation Metrics

Holdout Log-likelihood. We calculate the log-likelihood of event sequences using Eq. (3). We numerically calculate the integral term with Monte-Carlo integration (Press et al. 2007) if it cannot be calculated analytically.

3https://github.com/hmeiatjhu/neural-hawkes-particlesmoothing 4https://pytorch.org (Paszke et al. 2017)

Total Variation. We use total variation as it mimics the uniform metric as they both depend on the difference between the true and approximate intensity function. It is deﬁned as TV(f, g) = R |f(s) g(s)|2 ds. Total variation can only be used on synthetic datasets where the true intensity is known. To calculate it, we use Monte-Carlo integration (Press et al. 2007). We do not compute total variation for Neural Hawkes as the reference implementation does not allow the intensity function to be evaluated over ﬁxed event histories.

Table 2 reports log-likelihoods of all models across the three synthetic and three real world datasets. Figure 2 reports the total variations of intensity functions for the synthetic datasets and relative log-likelihood (calculated by subtracting the log-likelihood of UNIPoint Re LUSum) for the three real world datasets. The total variation scores are only available for synthetic datasets since calculating the total variation requires a ground truth intensity function. Synthetic datasets. Contrasting the log-likelihood and total variation metrics reveal interesting insights about model performance. The Self Correcting dataset has a piece-wise monotonically increasing intensity function. Both metrics indicate that Exp Hawkes, PLHawkes, and RMTPP under perform the other approaches by a large margin, since they are restricted to piece-wise monotone intensity functions. All UNIPoint variants perform well, achieving average likelihoods within 0.01 of each other. Exp Sum is the best variant, possibly due to its exponential shape matching that of the ground-truth Self Correcting intensity function. For the Exp Hawkes dataset, the Exp Hawkes baseline has the lowest total variation (close to zero, as expected) but not the best holdout log-likelihood. This indicates that models with good log-likelihood scores still have the potential to overﬁt given the wrong intensity function representation. Despite UNIPoint s guarantees with inﬁnite basis functions, Exp Sum shows signiﬁcantly better total variation scores than other UNIPoint models here showing that, selection of basis functions for speciﬁc datasets is important. For Decaying Sine, the intensity between events are non-

Self Correcting

Total Variation

Decaying Sine

Exp Hawkes PLHawkes

RMTPP Fully Neural

Neural Hawkes Transformer Hawkes

Exp Sum PLSum

Re LUSum Cos Sum

Sig Sum Mixed Sum

Stack Overﬂow 0.6

Relative Log-Likelihood

Figure 2: Total variation of intensity functions for synthetic datasets (left) and relative log-likelihood of event sequences for real world datasets standardised by subtracting the score of Re LUSum UNIPoint (right). Lower score is better. Markers correspond to the mean of the score and error bars to the interquartile range. A missing marker indicate a mean above the visible axis range.

monotonic. All UNIPoint variants perform comparably on both the log-likelihood and total variation metric. The Fully Neural approach performs comparably with the UNIPoint variants on total variation, but is inferior on log-likelihood. This is likely due to it assigning non-zero probabilities to negative event times. Neural Hawkes has the best loglikelihood for this dataset, but the difference with respect to Sig Sum is not signiﬁcant. In addition, we visualise intensity functions learnt by UNIPoint and other approaches, see the appendix (Soen et al. 2020, Section D). The neural baseline models learn similar intensity functions to UNIPoint in Exp Hawkes. However in the case of the MOOC dataset, RMTPP learns an intensity function that is different to those learnt by the other neural point processes. Meanwhile, Fully Neural does not exhibit strong decaying components in the intensity function. Real-world datasets. For all three real-world datasets, baselines Exp Hawkes, PLHawkes, and RMTPP signiﬁcantly under-perform in comparison to the rest of the approaches. This likely occurs due to their inability to support nonmonotone intensity functions in inter-event intervals. We observe that UNIPoint variants are signiﬁcantly better than the baselines for MOOC and Reddit. UNIPoint is second best (to Neural Hawkes) on Stack Overﬂow dataset, but the difference is not statistically signiﬁcant. Neural Hawkes performs strongly on the Stack Overﬂow dataset, potentially because it has the closest architecture to the UNIPoint Exp Sum variant, while also being more complex. In particular Neural Hawkes has time decaying hidden states and LSTM recurrent units rather than the a perceptron recurrent unit and vector-formed hidden state of UNIPoint. The Stack Overﬂow dataset has a longer average sequence length than MOOC and Reddit, which would advantage the LSTM recurrent units over the standard RNN since the RNN is more likely to suffer from vanishing or exploding gradients than the LSTM which allows for long-term dependencies (Hochreiter and Schmidhuber 1997). Details on dataset characteristics can be found in the appendix (Soen et al. 2020, Section A). One peculiar result is the performance of

Exp Sum in the MOOC dataset. The reason for the poor performance is that the exponential basis function is unstable with large interarrival times, which can cause numeric overﬂow or underﬂow. The performance of UNIPoint variants depend greatly on the particular basis function used for each dataset. We ﬁnd that no single type of basis function ensures that a UNIPoint model performs best over all datasets. For example, in the MOOC dataset, ignoring Exp Sum, the UNIPoint models have log-likelihood scores from 4.346 0.076 to 4.939 0.085. Using mixture of basis function, Mixed Sum provides good overall performance. Among the UNIPoint variants, it is either the best or a close second across all datasets, suggesting that using different types of basis functions improves model ﬂexibility in practice even with a ﬁxed parameter budget. We also observe an improvement in performance when more basis functions are used, see Appendix E. Overall, our evaluations demonstrate the power of UNIPoint for modelling complex intensity function that are not piece-wise monotone. Results on real-world datasets show models with ﬂexible intensity functions outperform Hawkes processes. Open questions remain on which neural architectures, among the ones with universal approximation power, strike the best balance of representational power, parsimony, and learnability.

7 Conclusion

We develop a new method for universally approximating the conditional intensity function of temporal point processes. This is achieved by breaking down the intensity function into piece-wise continuous functions and approximating each segment with a sum of basis functions, followed by a transfer function. We also propose UNIPoint, a neural implementation of the approximator. Evaluations on synthetic and real world benchmarks demonstrates that UNIPoint consistently outperform the less ﬂexible alternatives. Future work include: investigating methods for selecting and tuning different basis functions and further theoretical work on representation complexity, expressiveness and learnability.

Acknowledgments

This research was supported in part by the Australian Research Council Project DP180101985 and AOARD Project 20IOA064. This research was supported by use of the Nectar Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS).

Bacry, E.; Mastromatteo, I.; and Muzy, J.-F. 2015. Hawkes processes in ﬁnance. Market Microstructure and Liquidity 1(01): 1550005.

Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2(4): 303 314.

Daley, D. J.; and Vere-Jones, D. 2007. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media.

Debao, C. 1993. Degree of approximation by superpositions of a sigmoidal function. Approximation Theory and its Applications 9(3): 17 28.

Donner, C.; and Opper, M. 2018. Efﬁcient Bayesian inference of sigmoidal Gaussian Cox processes. The Journal of Machine Learning Research 19(1): 2710 2743.

Du, N.; Dai, H.; Trivedi, R.; Upadhyay, U.; Gomez Rodriguez, M.; and Song, L. 2016. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1555 1564. ACM.

Elman, J. L. 1990. Finding structure in time. Cognitive science 14(2): 179 211.

Embrechts, P.; Liniger, T.; and Lin, L. 2011. Multivariate Hawkes processes: an application to ﬁnancial data. Journal of Applied Probability 48(A): 367 378.

Hawkes, A. G. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1): 83 90.

Hochreiter, S.; and Schmidhuber, J. 1997. LSTM can solve hard long time lag problems. In Advances in neural information processing systems, 473 479.

Hornik, K.; Stinchcombe, M.; and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2(5): 359 366.

Huang, H.; Wang, H.; and Mak, B. 2019. Recurrent poisson process unit for speech recognition. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, 6538 6545.

Isham, V.; and Westcott, M. 1979. A self-correcting point process. Stochastic processes and their applications 8(3): 335 347.

Kingma, D. P.; and Ba, J. 2014. Adam: A Method for Stochastic Optimization. Co RR abs/1412.6980.

Kumar, S.; Zhang, X.; and Leskovec, J. 2019. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1269 1278.

Laub, P. J.; Taimre, T.; and Pollett, P. K. 2015. Hawkes processes. ar Xiv preprint ar Xiv:1507.02822 .

Li, S.; Xiao, S.; Zhu, S.; Du, N.; Xie, Y.; and Song, L. 2018. Learning temporal point processes via reinforcement learning. In Advances in neural information processing systems, 10781 10791.

Lloyd, C.; Gunter, T.; Osborne, M.; and Roberts, S. 2015. Variational inference for Gaussian process modulated Poisson processes. In International Conference on Machine Learning, 1814 1822.

Mei, H.; and Eisner, J. M. 2017. The neural hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, 6754 6764.

Mei, H.; Qin, G.; and Eisner, J. 2019. Imputing Missing Events in Continuous-Time Event Streams. In Proceedings of the International Conference on Machine Learning.

Mishra, S.; Rizoiu, M.-A.; and Xie, L. 2016. Feature driven and point process approaches for popularity prediction. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 1069 1078.

Ogata, Y. 1988. Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical association 83(401): 9 27.

Omi, T.; Ueda, N.; and Aihara, K. 2019. Fully Neural Network based Model for General Temporal Point Processes. In Advances in Neural Information Processing Systems 32, 2120 2129. Curran Associates, Inc.

Park, J.; and Sandberg, I. W. 1991. Universal approximation using radial-basis-function networks. Neural computation 3(2): 246 257.

Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic Differentiation in Py Torch. In NIPS Autodiff Workshop.

Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; and Flannery, B. P. 2007. Numerical recipes 3rd edition: The art of scientiﬁc computing. Cambridge university press.

Rizoiu, M.-A.; Xie, L.; Sanner, S.; Cebrian, M.; Yu, H.; and Van Hentenryck, P. 2017. Expecting to be HIP: Hawkes intensity processes for social media popularity. In Proceedings of the 26th International Conference on World Wide Web, 735 744.

Royden, H. L.; and Fitzpatrick, P. 1988. Real analysis, volume 32. Macmillan New York.

Rudin, W.; et al. 1964. Principles of mathematical analysis, volume 3. Mc Graw-hill New York.

Sch afer, A. M.; and Zimmermann, H.-G. 2007. Recurrent neural networks are universal approximators. International journal of neural systems 17(04): 253 263. Shchur, O.; Biloˇs, M.; and G unnemann, S. 2020. Intensity Free Learning of Temporal Point Processes. In International Conference on Learning Representations. Sill, J. 1998. Monotonic networks. In Advances in neural information processing systems, 661 667. Soen, A.; Mathews, A.; Grixti-Cheng, D.; and Xie, L. 2020. Appendix - UNIPoint: Universally Approximating Point Processes Intensities. Co RR abs/2007.14082: 11 13. URL https://arxiv.org/pdf/2007.14082.pdf#page=11. Sonoda, S.; and Murata, N. 2017. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis 43(2): 233 268. Tabibian, B.; Valera, I.; Farajtabar, M.; Song, L.; Sch olkopf, B.; and Gomez-Rodriguez, M. 2017. Distilling information reliability and source trustworthiness from digital traces. In Proceedings of the 26th International Conference on World Wide Web, 847 855. Upadhyay, U.; De, A.; and Rodriguez, M. G. 2018. Deep reinforcement learning of marked temporal point processes. In Advances in Neural Information Processing Systems, 3168 3178. Yun, C.; Bhojanapalli, S.; Rawat, A. S.; Reddi, S.; and Kumar, S. 2020. Are Transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations. Zhang, Q.; Lipani, A.; Kirnap, O.; and Yilmaz, E. 2019. Self-attentive Hawkes processes. ar Xiv preprint ar Xiv:1907.07561 . Zhao, Q.; Erdogdu, M. A.; He, H. Y.; Rajaraman, A.; and Leskovec, J. 2015. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 1513 1522. Zuo, S.; Jiang, H.; Li, Z.; Zhao, T.; and Zha, H. 2020. Transformer Hawkes Process. ar Xiv preprint ar Xiv:2002.09291 .