# meta_temporal_point_processes__29ce2ccf.pdf Published as a conference paper at ICLR 2023 META TEMPORAL POINT PROCESSES Wonho Bae University of British Columbia & Borealis AI whbae@cs.ubc.ca Mohamed Osama Ahmed Borealis AI mohamed.o.ahmed@borealisai.com Frederick Tung Borealis AI frederick.tung@borealisai.com Gabriel L. Oliveira Borealis AI gabriel.oliveira@borealisai.com A temporal point process (TPP) is a stochastic process where its realization is a sequence of discrete events in time. Recent work in TPPs model the process using a neural network in a supervised learning framework, where a training set is a collection of all the sequences. In this work, we propose to train TPPs in a meta learning framework, where each sequence is treated as a different task, via a novel framing of TPPs as neural processes (NPs). We introduce context sets to model TPPs as an instantiation of NPs. Motivated by attentive NP, we also introduce local history matching to help learn more informative features. We demonstrate the potential of the proposed method on popular public benchmark datasets and tasks, and compare with state-of-the-art TPP methods. 1 INTRODUCTION With the advancement of deep learning, there has been growing interest in modeling temporal point processes (TPPs) using neural networks. Although the community has developed many innovations in how neural TPPs encode the history of past events (Biloˇs et al., 2021) or how they decode these representations into predictions of the next event (Shchur et al., 2020; Lin et al., 2022), the general training framework for TPPs has been supervised learning where a model is trained on a collection of all the available sequences. However, supervised learning is susceptible to overfitting, especially in high noise environments, and generalization to new tasks can be challenging. In recent years, meta learning has emerged as an alternative to supervised learning as it aims to adapt or generalize well on new tasks, which resembles how humans can learn new skills from a few examples. Inspired by this, we propose to train TPPs in a meta learning framework. To this end, we treat each sequence as a task , since it is a realization of a stochastic process with its own characteristics. For instance, consider the pickup times of taxis in a city. The dynamics of these event sequences are governed by many factors such as location, weather and the routine of a taxi driver, which implies the pattern of each sequence can be significantly different from each other. Under the supervised learning framework, a trained model tends to capture the patterns seen in training sequences well, but it easily breaks on unseen patterns. As the goal of modeling TPPs is to estimate the true probability distribution of the next event time given the previous event times, we employ Neural Processes (NPs), a family of the model-based meta learning with stochasticity, to explain TPPs. In this work, we formulate neural TPPs as NPs by satisfying some conditions of NPs, and term it as Meta TPP. Inspired by the literature in NP, we further propose the Meta TPP with a cross-attention module, which we refer to as Attentive TPP. We demonstrate the strong potential of the proposed method through extensive experiments. Our contributions can be summarized as follows, To the best of our knowledge, this is the first work that formulates the TPP problem in a meta learning framework, opening up a new research direction in neural TPPs. Inspired by the NP literature, we present a conditional meta TPP formulation, followed by a latent path extension, culminating with our proposed Attentive TPP model. Published as a conference paper at ICLR 2023 The experimental results show that our proposed Attentive TPP model achieves state-ofthe-art results on four widely used TPP benchmark datasets, and is more successful in capturing periodic patterns on three additional datasets compared to previous methods. We demonstrate that our meta learning TPP approach can be more robust in practical deployment scenarios such as noisy sequences and distribution drift. 2 PRELIMINARIES Neural processes. A general form of optimization objective in supervised learning is defined as, θ = arg max θ EB p(D) (x,y) B log pθ(y | x) where D := {(x(i), y(i))}|D| i=1 for an input x and label y, and B denotes a mini-batch set of (x, y) data pairs. Here, the goal is to learn a model f parameterized by θ that maps x to y as fθ : x y. In recent years, meta learning has emerged as an alternative to supervised learning as it aims to adapt or generalize well on new tasks (Santoro et al., 2016), which resembles how humans learn new skills from few examples. In meta learning, we define a meta dataset, a set of different tasks, as M := {D(i)}|M| i=1 . Here, D(i) is a dataset of i-th task consisting of a context and target set as D := C T . The objective of meta learning is then defined as, θ = arg max θ EBD p(M) (C,T ) BD log pθ(YT | XT , C) where BD denotes a mini-batch set of tasks. Also, XT and YT represent inputs and labels of a target set, respectively. Unlike supervised learning, the goal is to learn a mapping from x to y given C: more formally, fθ( , C) : x y. Although meta learning is a powerful framework to learn fast adaption to new tasks, it does not provide uncertainty for its predictions, which is becoming more important in modern machine learning literature as a metric to measure the reliability of a model. To take the uncertainty into account for meta learning, Neural processes (NPs) have been proposed (Garnelo et al., 2018b;a). Instead of finding point estimators as done in regular meta learning models, NPs learn a probability distribution of a label y given an input x and context set C: pθ(y|x, C). In this work, we frame TPPs as meta learning instead of supervised learning, for the first time. To this end, we employ NPs to incorporate the stochastic nature of TPPs. In Section 3.1, we propose a simple modification of TPPs to connect them to NPs, which enables us to employ a rich line of works in NPs to TPPs as described in Section 3.2 and Section 3.3. Neural temporal point processes. TPPs are stochastic processes where their realizations are sequences of discrete events in time. In notations, a collection of event time sequences is defined as D := {s(i)}|D| i=1 where s(i) = (τ (i) 1 , τ (i) 2 , . . . , τ (i) Li ) and Li denotes the length of i-th sequence. The history of studying TPPs started decades ago (Daley & Vere-Jones, 2003), but in this work, we focus on neural TPPs where TPPs are modeled using neural networks (Shchur et al., 2021). As described in Figure 1a, a general form of neural TPPs consists of an encoder, which takes a sequence of previous event times and outputs a history embedding, and a decoder which takes the history embedding and outputs probability distribution of the time when the next event happens. Previous works of neural TPPs are auto-regressively modeled in a supervised learning framework. More formally, the objective of neural TPPs are defined as, θ = arg max θ EB p(D) l=1 log pθ(τ (i) l+1 | τ (i) l ) where B p(D) denotes a mini-batch of event time sequences. To frame TPPs as NPs, we need to define a target input and context set shown in Equation (2), from an event time history τ l, which will be described in the following section. Published as a conference paper at ICLR 2023 Decoder Encoder (a) Neural TPP Decoder Encoder (b) Conditional Meta TPP (c) Attentive TPP Latent Path Cross Attention Weight Sharing Figure 1: Overall architectures of TPP models. 3 META TEMPORAL POINT PROCESS AND ITS VARIANTS 3.1 TEMPORAL POINT PROCESSES AS NEURAL PROCESSES To frame TPPs as NPs, we treat each event time sequence s as a task for meta learning, which intuitively makes sense since each sequence is a realization of a stochastic process. For instance, the transaction times of different account holders are very different from each other due to many factors including an account holder s financial status and characteristics. With the new definition of tasks, we define a target input and context set for a conditional probability distribution of meta learning shown in Equation (2), using previous event times τ l. There are many ways to define them but a target input and context set need to be semantically aligned since the target input will be an element of the context set for the next event time prediction. Hence, we define a target input for τl+1 as the latest local history τl k+1:l where k is the window size of the local history. Similarly, a context set for τl+1 is defined as Cl := {τt k+1:t}l 1 t=1. Here, if t k 0, we include event times from τ1. With Transformer structure, it is easy to efficiently compute the feature embeddings for the context set C. Figure 1b shows a schematic of the Conditional Meta TPP with a mask (shaded) used for an example case of 5 event times with a local history window size of k = 3. Then, the feature embedding rl contains information of τl k+1:l. With the notations for target inputs and context sets, we propose the objective of TPPs in a meta learning framework as, θ = arg max θ EB p(D) l=1 log pθ(τ (i) l+1 | τ (i) l k+1:l, C(i) l ) Note that we have only one target label τ (i) l+1 to predict per event unlike the general meta learning objective in Equation (2) where usually |T | > 1. It is because TPP models in general are trained to predict the next event time. Modeling TPPs to predict multiple future event times would be an interesting future work, but it is out of scope of this work. Requirements for neural processes. Let XT := {xi}|T | i=1 and YT := {yi}|T | i=1 be a set of target inputs and labels, respectively, and π be an arbitrary permutation of a set. To design NP models, it is required to satisfy the following two conditions. Condition 3.1 (Consistency over a target set). A probability distribution pθ is consistent if it is consistent under permutation: pθ(YT | XT , C) = pθ(π(YT ) | π(XT ), C), and marginalization: pθ(y1:m | XT , C) = R pθ(y1:n | XT , C) dym+1:n for any positive integer m < n. Published as a conference paper at ICLR 2023 Condition 3.2 (Permutation invariance over a context set). pθ(YT | XT , C) = pθ(YT | XT , π(C)) According to Kolmogorov extension theorem (Oksendal, 2013), a collection of finite-dimensional distributions is defined as a stochastic process if condition 3.1 is satisfied. In NP literature, condition 3.1 is satisfied through factorization: it assumes target labels are independent to each other given a target input and a context set C, in other words, pθ(YT | XT , C) = Π|T | i=1pθ(yi | xi, x