# unipoint_universally_approximating_point_processes_intensities__410c5a9a.pdf UNIPoint: Universally Approximating Point Processes Intensities Alexander Soen, Alexander Mathews, Daniel Grixti-Cheng, Lexing Xie The Australian National University alexander.soen@anu.edu.au, alex.mathews@anu.edu.au, a500846@anu.edu.au, lexing.xie@anu.edu.au Point processes are a useful mathematical tool for describing events over time, and so there are many recent approaches for representing and learning them. One notable open question is how to precisely describe the flexibility of point process models and whether there exists a general model that can represent all point processes. Our work bridges this gap. Focusing on the widely used event intensity function representation of point processes, we provide a proof that a class of learnable functions can universally approximate any valid intensity function. The proof connects the well known Stone Weierstrass Theorem for function approximation, the uniform density of non-negative continuous functions using a transfer functions, the formulation of the parameters of a piece-wise continuous functions as a dynamic system, and a recurrent neural network implementation for capturing the dynamics. Using these insights, we design and implement UNIPoint, a novel neural point process model, using recurrent neural networks to parameterise sums of basis function upon each event. Evaluations on synthetic and real world datasets show that this simpler representation performs better than Hawkes process variants and more complex neural network-based approaches. We expect this result will provide a practical basis for selecting and tuning models, as well as furthering theoretical work on representational complexity and learnability. 1 Introduction Temporal point processes (Daley and Vere-Jones 2007) are a preferred tool for describing events happening in irregular intervals, such as, earthquake modelling (Ogata 1988), social media (Zhao et al. 2015), and finance (Embrechts, Liniger, and Lin 2011). One common variant is the self-exciting Hawkes process with parametric kernel (Laub, Taimre, and Pollett 2015), which describes prior events triggering future events. However, misspecification of the kernel will likely result in poor performance (Mishra, Rizoiu, and Xie 2016). One may ask what are the most flexible classes of point process intensity functions? How can they be implemented computationally? Does a flexible representation lead to good performance? There is a body of literature surrounding these three questions. Multi-layer neural networks are well known for being flexible function approximators. They are able to ap- Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Overview of our method of universally approximating point processes. A RNN is used to parameterise a set of basis functions for each interarrival time τi. Then, the sum of basis functions is used to approximate a continuous function, which is composed with a transfer function f+ to universally approximate all valid intensity functions. proximate any Borel-measurable function on a compact domain (Cybenko 1989; Hornik, Stinchcombe, and White 1989). A number of neural architectures have been proposed for point processes. The Recurrent Marked Temporal Point Process model (RMTPP) (Du et al. 2016) uses Recurrent Neural Networks (RNN) to encode event history, and defines the conditional intensity function by a parametric form. Common choices of such parametric forms include an exponential function (Du et al. 2016; Upadhyay, De, and Rodriguez 2018) or a constant function (Li et al. 2018; Huang, Wang, and Mak 2019). Variants of the RNN have been explored, including Neural Hawkes (Mei and Eisner 2017) that makes the RNN state a functions over time; as well as Transformer Hawkes (Zuo et al. 2020) and Self-attention Hawkes (Zhang et al. 2019) which uses attention mechanisms instead of recurrent units. However, a conceptual gap on the flexibility of the neural point process representation still remains. Piece-wise exponential functions (Du et al. 2016; Upadhyay, De, and Rodriguez 2018) only encode intensities that are monotonic between events. The functional RNN representation (Mei and Eisner 2017) is flexible but uses many more parameters. Transformers (Zuo et al. 2020; The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Zhang et al. 2019) are generic sequence-to-sequence function approximators (Yun et al. 2020), but the functional form of the Transformer Hawkes point process intensity function is not an universal approximator. Furthermore, intensity functions are non-negative and discontinuous at event times which means neural network approximation results cannot be applied directly. Recent results shed light on alternative point process representations. Omi, Ueda, and Aihara (2019) uses a positive weight monotone neural network to learn the compensator (the integral of the intensity function). Although it is a generic approximator for compensators, it might assign nonzero probability to invalid inter-arrival times as the compensator can be non-zero at time zero. (Shchur, Biloˇs, and G unnemann 2020) represents inter-arrival times using normalising flow and mixture models, which can universally approximate any density. However, by defining the point process with the event density, the model cannot account for event sequences which stop naturally (see Section 3). These approaches are promising alternatives but are not a full replacement for intensity functions, which are preferred since they are intuitive and can be superimposed. In this work, we propose a class of neural networks that can approximate any point process intensity function to arbitrary accuracy, along with a proof showing the role of three key constituents: a set of uniformly dense basis functions, a positive transfer function, and an approximator for arbitrary dynamic systems. We implement this proposal using RNNs, the output of which is used to parameterise a set of basis functions upon arrival of each event, as shown in Figure 1. Named UNIPoint, the proposed model performs well across synthetic and real world datasets in comparison to the Hawkes process and other neural variants. This work provides a general yet parsimonious representation for temporal point processes, and so forms a solid basis for future development in point process representations that incorporate rich contextual information into event models. Our primary contributions are: A novel architecture that can approximate any point process intensity function to arbitrary accuracy. A theoretical guarantee for the flexible point process representation that builds upon the theory of universally approximating continuous functions and dynamic systems. UNIPoint the neural network implementation of the proposed architecture with strong empirical results on both synthetic and real world datasets. Reference code is available online1. C(X, Y ) denotes the class of continuous functions mapping from domain X to range Y . Denote R as the set of real numbers, R+ as the non-negative reals and R++ as the strictly positive reals. Define the composition of a function f and a class of functions F as f F = {f g : g F}. The sigmoid function [1 + exp( x)] 1 is denoted as σ(x). 1https://github.com/alexandersoen/unipoint 2 Preliminary: Temporal Point Processes A temporal point process is an ordered set of event times {ti}N i=0. We typically describe a point process by its conditional intensity function λ(t | Ht ) which can be interpreted as the instantaneous probability of an event occurring at time t given event history Ht , consisting of the set of all events before time t. This can be written as (Daley and Vere-Jones 2007): λ(t | Ht ) .= lim h 0+ P(N[t, t + h) > 0 | Ht ) where N[t1, t2) is the number of events occurring between two arbitrary times t1 < t2. Equation 1 restricts the conditional intensity function to non-negative functions. Given history Ht , the conditional intensity is a deterministic function of time t. Following standard convention, we refer to the conditional intensity function as simply the intensity function, abbreviating λ(t | Ht ) to λ (t). Point processes can be specified by choosing a functional form for the intensity function. For example, the Hawkes process, one of the simplest interacting point process (Bacry, Mastromatteo, and Muzy 2015), can be defined as follows: λ (t) = µ + X ti 0 and g G, there exists an f F such that d(f, g) < ε. An equivalently expression is: F is uniformly dense in G. Remark. Although we refer to Hawkes point processes as the primary example of a point process to approximate throughout the paper, following the example of (Mei and Eisner 2017), we note that as long as the point process has continuous intensity function between events, our approximation analysis will hold. Thus, in addition to Hawkes processes, the methods proposed in our work can approximate point processes including self correcting processes and nonhomogenous Poisson processes with continuous densities. Approximation Between Two Events To approximate the time shifted non-negative functions ui(τ), we first introduce transfer functions f+ (Definition 1). We then prove that the class of composed function f+ F preserves uniform density (Theorem 1). Given this theorem, we provide a method for constructing uniformly dense classes with sums of basis functions Σ(φ) (Definition 2) which are in turn uniformly dense after composing with f+ (Corollary 1). We further provide a set of suitable basis functions (Table 1). Formally, we define the M-transfer functions which maps negative outputs of a function to positive values. Definition 1. A function f+ : R R+ is a M-transfer function if it satisfies the following: 1. f+ is M-Lipschitz continuous; 2. R++ f+[R]; 3. And f+ is strictly increasing on f 1 + [R++]. Definition 1 provides a wide range of functions. In practice, it is convenient to use softplus function f SP(x) = log(1+exp(x)) which is a 1-transfer function commonly used in other neural point processes (Mei and Eisner 2017; Omi, Ueda, and Aihara 2019; Zuo et al. 2020). Alternatively, f+(x) = max(0, x) could be used; however, this is not differentiable at x = 0 which can cause issues in practice. Intuitively, M-transfer function are increasing functions which map to all positive values and have bounded steepness. When a Gaussian process is used to define an inhomogenous Poisson process, the link functions serve a similar role to ensure valid intensity functions (Lloyd et al. 2015). However, many of these link function violate the conditions of being a M-transfer function (Donner and Opper 2018), i.e., the exponential link function f+(x) = exp(x) and squared link function f+(x) = x2 are not M-Lipschitz continuous as they have unbounded derivatives; whereas the sigmoid link function f+(x) = σ(x) is a bounded function (violating condition 2). Using M-transfer functions, we can show that a uniformly dense class of unbounded functions will be uniformly dense for strictly positive functions under composition. These functions are defined with domain K R, a compact subset, which can be set as K = [0, T] for intensity functions. Theorem 1. Given a class of functions F which is uniformly dense in C(K, R) and a M-transfer function f+, the composed class of functions f+ F is uniformly dense in C(K, R++) for any compact subset K R. Proof. Let f C(K, R++) and ε > 0 be arbitrary. Since f+ is strictly increasing and continuous on the preimage of R++ then f 1 + exists, is continuous, and restricted to subdomain R++. Thus, there exists some g C(K, R) such that f = f+ g. As F is dense with respect to the uniform metric, for ε/M there exists some h F such that d(h, g) < ε/M. Thus for any x K, |(f+ h)(x) f(x)| = |(f+ h)(x) (f+ g)(x)| M|h(x) g(x)| < ε. We have d(f+ h, f) < ε. To approximate ui(τ) using Theorem 1 we need a family of functions which are able to approximate functions in C(K, R). We consider the family of functions consisting of the sum of basis functions φ( ; pj), where pj P denotes the parameterisation of the basis function φ. Definition 2. Denote Σ(φ) as the class of functions corresponding to the sum of basis functions φ : R P R, with parameter space P, as follows: ˆu : R R | ˆu(x) = j=1 φ(x; pj), pj P, J N The parameter space P of a basis function is determined by the parametric form of a chosen basis function φ(x; pj). For example, the class composed of exponential basis functions could be defined with parameter space P = R2 with functions {φ : R R | φ(x) = α exp(βx), α, β R}. Definition 2 encompasses a wide range of function classes, including neural networks with sigmoid (Cybenko 1989; Hornik, Stinchcombe, and White 1989; Debao 1993) or rectified linear unit activations (Sonoda and Murata 2017). The Stone-Weierstrass Theorem provides sufficient conditions for finding basis function for universal approximation. Theorem 2 (Stone-Weierstrass Theorem (Rudin et al. 1964; Royden and Fitzpatrick 1988)). Suppose a subalgebra A of C(K, R), where K R is a compact subset, satisfies the following conditions: 1. For all x, y K, there exists some f A such that f(x) = f(y); 2. For all x0 K, there exists f A such that f(x0) = 0. Then A is uniformly dense in C(K, R). Thus, by using Theorem 1 and the Stone-Weierstrass theorem, Theorem 2, we arrive at Corollary 1, which gives sufficient conditions for basis functions φ to ensure that f+ Σ(φ) is a universal approximator for C(K, R++). Corollary 1. For any compact subset K R and for any M-transfer function f+, if a basis function φ( ; p) parametrised by p P satisfies the following conditions: 1. P(φ) is closed under product; 2. For any distinct points x, y K, there exists some p P such that φ(x; p) = φ(y; p); 3. For all x0 K, there exists some p P such that φ(x0; p) = 0. Then f+ P(φ) is uniformly dense in C(K, R++). The first condition of Corollary 1 is given such that the set of basis functions P(φ) is a subalgebra of C(X, R). The later two conditions are the required preconditions for the Stone-Weierstrass Theorem to hold. Given the conditions of Corollary 1, some interesting choices for valid basis functions φ(x; p) are the exponential basis function φEXP(x) = α exp(βx) and the power law basis function φPL(x) = α(1 + x) β. These basis functions are similar to the exponential and power law Hawkes triggering kernels, which have seen widespread use in many domains (Ogata 1988; Bacry, Mastromatteo, and Muzy 2015; Laub, Taimre, and Pollett 2015; Rizoiu et al. 2017). We note that the class of intensity functions in Theorem 1 and Corollary 1 are strictly positive continuous functions. However, these results generalise to non-negative continuous functions as our definition of intensity functions permits arbitrarily low intensity in ui(τ) where switching from arbitrarily low intensities to zero intensity results in arbitrarily low error with respect to the uniform metric on (0, T]. In Table 1, we provide a selection of interesting basis functions to universally approximate ui(τ) C(K, R++). One should note that Corollary 1 only provides sufficient conditions, where some of the basis function in Table 1 do not satisfy the precondition. For example, the sigmoid basis function φSIG(x) = ασ(βx + δ), (α, β, δ) R3 does not allow Σ(φSIG) to be closed under product and thus does not satisfy the conditions of Corollary 1. However, the sum of sigmoid basis functions is equivalent to the class of single hidden layer neural networks (Hornik, Stinchcombe, and White 1989; Debao 1993). Thus, in additional to an appropriate transfer function it does have the universal approximation property for non-negative continuous functions through Theorem 1. Additionally, other basis functions used to define point process intensity functions can be used, such as Basis Function Functional Form φ Parameter Space P φEXP α exp(βx) (α, β) R2 φPL α(1 + x) β (α, β) R R+ φCOS α cos(βx + δ) (α, β, δ) R3 φSIG ασ(βx + δ) (α, β, δ) R3 φRe LU max(0, αx + β) (α, β) R2 Table 1: Basis function universal approximators for intensity functions between two consecutive events. indicates functions that satisfy Corollary 1; one proven in (Cybenko 1989); and one proven in (Sonoda and Murata 2017). radial basis functions (Tabibian et al. 2017) that are not generally closed under product but have universal approximation properties (Park and Sandberg 1991). Approximation for Event Sequences The approximations to ui(τ) use a set of parameters, e.g. (α, β, δ) in Table 1. We denote these parameters vectors as pi P, and the approximated function segment as ˆui(τ; pi). Since each segment ˆui(τ; pi) is uniquely determined by pi, and the union of all segments approximates λ (t), we would only need to capture the dynamics in pi. We express pi as the output of a dynamic system. si+1 = g(si, ti) pi = ν(si), (5) where si+1 is the internal state of the dynamic system, g updates the internal state at each step, and ν maps from the internal state to the output. Theorem 3 (RNN Universal Approximation (Sch afer and Zimmermann 2007)). Let g : RJ RI RJ be measurable and ν : RJ Rn be continuous, the external inputs xi RI, the inner states si RJ, and the outputs pi R (for i = 1, . . . , N). Then, any open dynamic system of the form of Eq. (5) can be approximated by an RNN, with sigmoid activation function, to arbitrary accuracy. Given that RNNs approximate pi, we use continuity condition on basis φ and in turn ˆu to show how to universally approximate an intensity function with an RNN. Theorem 4. Let {ti}N i=0 be a sequence of events with ti [0, T] and λ (t) be an intensity function. Given a parametric family of functions F = {ˆu( ; p) : p P} which is uniformly dense in C([0, T], R++) and ˆu(x; p) continuous with respect to p for all x [0, T]. Then there exists a recurrent neural network hi = σ(Whi 1 + vti 1 + b) ˆpi = Ahi for t (ti 1, ti] ˆλ(t) = ˆu(τ; ˆpi) and τ = t ti 1, (6) where σ is a sigmoid activation function and [W, v, b, A] are weights of appropriate shapes, such that ˆλ(t) approximates λ (t) with arbitrary precision for all (0, T]. Proof. Let ε > 0 be arbitrary. For any interval (ti 1, ti], we know from the uniform density of F that there exists a pi such that sup τ [0,T ] |ˆui(τ; pi) ui(τ)| ε By the continuity conditions of ˆu, it follows that for each pi and any τ [0, T] there exists δi such that pi ˆpi < δi = |ˆu(τ; pi) ˆu(τ; ˆpi)| < ε by taking the minimum over δτ s in the (ε/2, δτ)-condition of continuity for all τ [0, T] (where the subscript emphasises the range of τ for fixed i). The LHS of Eq. (8) is the precision needed in our RNN approximtor for each interval (ti 1, ti]. We take the minimum approximation discrepancy over the sequence of ˆpi s, δ := mini δi and use an RNN with precision δ to bound the approximation quality due to ˆpi s using Theorem 3, sup τ [0,T ] |ˆu(τ; pi) ˆu(τ; ˆpi)| < ε Using the triangle inequality of the uniform metric, we can combine and bound the discrepancies due to ˆu in Eq. (7) and those due to ˆpi in Eq. (9), sup τ [0,T ] |ui(τ) ˆu(τ; ˆpi)| < ε. (10) Eq. (10) holds for all i {1, . . . , N}. Thus uniform density condition for λ (t) also holds for the piece-wise approximator ˆλ(t) given by Eq. (6) over the entire sequence. From Theorem 4 and Corollary 1, universal approximation with respect to the uniform metric follows immediately when using basis functions which are continuous with respect to their parameter space, for example Table 1. Extensions and discussions. While the original work on learning the compensator function (Omi, Ueda, and Aihara 2019) does not provide theoretical backings for its proposal, we note that Theorem 4, combined with universal approximation capabilities of monotone neural networks (Sill 1998), can be used to show that the class of monotonic (increasing) neural networks provide universal approximation for compensator functions. The guarantee described here does not explicitly account for additional dimensions or marks. To extend Theorem 4 in this manner, we consider replacing basis functions φ(x), which has domain R, to basis functions with extended domain R K where K is a compact set. For example, K can be a set of discrete finite marks in the case of approximated marked temporal point processes. The universal approximation property would then generalise as long as P(φ) is dense in C([0, T] K, R++) and continuous in the parameter space of the basis functions. Likewise, if we want to approximate a spatial point process, we let K = R2 and find an appropriate set of basis functions with domain R R2. It is worth mentioning two distinctions from the intensity free approach (Shchur, Biloˇs, and G unnemann 2020). First, although density approximation allows for direct event time sampling, the log-normal mixture representation assumes that an event will always occur on R+ specifically, events cannot naturally stop. Instead, the intensity function representation allow for events to stop with probability 1 P(τ < ) = exp ( Λ ( )). In other-words, 1 P(τ < ) is the probability of events not occurring in finite time, which is non-zero when the intensity function decays and stays at zero. Furthermore, the intensity free approach proposed one functional form (log-normal mixture) for approximating densities, whereas we show that a variety of basis functions all fulfil the goal of universal approximation. 4 Implementation with Neural Networks We propose UNIPoint, a neural network architecture implementing a fully flexible intensity function. Let {ti}N i=0 be a sequence of events with corresponding interarrival times τi = ti ti 1. Let M be the size of the hidden state of the RNN, and φ( ; ) be the chosen basis function with parameter space P. Let P denote the dimension of the parameter space. The approximation guarantees (given in Corollary 1) hold in the limit of an infinite number of basis functions, in practice the number of basis functions is a hyper-parameter, denoted as J. This network has four key components. Recurrent Neural Network. We use a simple RNN cell (Elman 1990), though other popular variants would also work, e.g., LSTM, or GRU. The recurrent unit produces hidden state vector hi from hi 1 the previous hidden state and τi 1 the normalised interarrival time (divided by standard deviation): hi = f(Whi 1 + vτi 1 + b) (11) Here W, v, b, and h0 are learnable parameters. f is any activation function compatible with RNN universal approximation, i.e., sigmoid σ (Sch afer and Zimmermann 2007). Basis Function Parameters are generated using a linear transformation that maps the hidden state vector of the RNN hi RM to parameters pi = (pi1, . . . , pi J), pij = Ajhi + Bj, t (ti 1, ti], j {1, . . . , J}. (12) Here Aj and Bj are learnable parameters and pij P. Eq. (11) and Eq. (12) defines the RNN which approximates a point processes underlying dynamic system. The error contribution of these two equations is upper bounded by the sum of their individual contributions (Sch afer and Zimmermann 2007, Theorem 2). Intensity Function. Using parameters pi1, . . . , pi J, the intensity function with respect to time since the last event τ = t ti 1 is defined as: ˆλ(τ) = f SP j=1 φ(τ; pij) , τ (0, ti ti 1], (13) where f SP(x) = log(1 + exp(x)) is the softplus function. Loss Function. We use the point process negative loglikelihood, as per Eq. (3). In most cases the integral cannot be calculated analytically so instead we calculate it numerically using Monte-Carlo integration (Press et al. 2007), see Training settings and the online appendix (Soen et al. 2020, Section F). Our use of RNNs to encode event history is similar to other neural point process architectures. We note that (Du et al. 2016) only supports monotonic intensities. Our representation is more parsimonious than (Mei and Eisner 2017) since the hidden states need not be functions over time, yet the output can still universally approximate any intensity function. (Omi, Ueda, and Aihara 2019) produce monotonically increasing compensator functions but can have invalid inter-arrival times. 5 Evaluation We compare the performance of UNIPoint models to various simple temporal point processes and neural network based models on three synthetic datasets and three real world datasets. For the simple temporal point processes we consider self-exciting intensity functions which are piece-wise monotonic (Self-Correcting process (Isham and Westcott 1979) and Exponential Hawkes process (Hawkes 1971)) and non-monotonic (Decaying Sine Hawkes process). The details of dataset preprocessing, model settings and parameter sizes can be found in the appendix (Soen et al. 2020, Section A and B). Synthetic Datasets We synthesise datasets from simple temporal point process models, generating 2, 048 event sequences each containing 128 events. This results in roughly 262, 000 events, which is of the same magnitude tested in (Omi, Ueda, and Aihara 2019). Self-correcting process and exponential Hawkes process datasets have previously been used in other neural point process studies (Du et al. 2016; Omi, Ueda, and Aihara 2019; Shchur, Biloˇs, and G unnemann 2020). We consider a decaying sine Hawkes process to test whether the models capture non-monotonic self-exciting intensity functions. The following synthetic datasets are used: Self-Correcting Process. The intensity function is λ (t) = exp where ν = 1 and γ = 1. Exponential Hawkes Process. The intensity function is a Hawkes process with exponential decaying triggering kernel, given by λ (t) = µ + αβ X ti