# a_multitask_point_process_predictive_model__70a323d4.pdf A Multitask Point Process Predictive Model Wenzhao Lian1 WL89@DUKE.EDU Ricardo Henao1 R.HENAO@DUKE.EDU Vinayak Rao2 VARAO@PURDUE.EDU Joseph Lucas3 JOE@STAT.DUKE.EDU Lawrence Carin1 LCARIN@DUKE.EDU 1Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA 2Department of Statistics, Purdue University, West Lafayette, IN 47907, USA 3Center for Predictive Medicine, Duke Clinical Research Institute, Durham, NC 27708, USA Point process data are commonly observed in fields like healthcare and the social sciences. Designing predictive models for such event streams is an under-explored problem, due to often scarce training data. In this work we propose a multitask point process model, leveraging information from all tasks via a hierarchical Gaussian process (GP). Nonparametric learning functions implemented by a GP, which map from past events to future rates, allow analysis of flexible arrival patterns. To facilitate efficient inference, we propose a sparse construction for this hierarchical model, and derive a variational Bayes method for learning and inference. Experimental results are shown on both synthetic data and as well as real electronic health-records data. 1. Introduction Point process data have seen increased attention in fields like biomedical research (Rad & Paninski, 2011; Lian et al., 2014), electronic commerce (Xu et al., 2014), and healthcare analysis (Lasko, 2014). One thread of work focuses on learning arrival rates by imposing smoothness on a latent rate function (Adams et al., 2009; Rao & Teh, 2011; Lloyd et al., 2014). Another consists of predicting future arrivals as a direct function of past observations (Pillow et al., 2008; Gunawardana et al., 2011). Taking healthcare analysis as a motivating example, we focus on the latter problem: given a patient s hospital visit history up to time t, (i) when will the next visit happen? and (ii) how many visits will the patient have in [t, t + L]? Answering such questions provides Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). a quantitative evaluation of the patient s risk, which helps to make treatment plans and allocate hospital resources efficiently (Amarasingham et al., 2010). Similar problems also arise in other fields, such as predicting purchasing behavior for individual customers, or predicting failures in distributed computer systems. A few works have explored the prediction problem in point processes by learning a functional mapping from history features to the current intensity rate (Pillow et al., 2008; Rajaram et al., 2005; Gunawardana et al., 2011). In Gunawardana et al. (2011), the intensity function is constrained to be piecewise-constant, learned using decision trees and used for prediction. In our proposed multitask point process model, we build upon this piecewise-constant intensity model. To allow for flexibility of the function mapping from history features to future rate, and to capture the uncertainty of estimation, we use a nonparametric method, by imposing a Gaussian process (GP) prior on the intensity rate. However, when building such predictive models for event arrival processes, one difficulty is that the available training data are scarce for each subject/task. Therefore, we treat each individual arrival process as a task and follow a multitask learning approach to share information from all tasks in a hierarchical manner. Methods for learning GPs from multiple tasks have been proposed (Yu et al., 2005), but involve a shared global mean function, inferred at all observed inputs (history features) across all the tasks. The posterior of this function cannot be directly updated due to non-conjugacy of the point process likelihood to GP priors. One approximation method, variational Bayes, is often applied, leading to the number of unknown parameters scaling as O(N 2), where N is the number of unique features from all tasks. Borrowing from the framework of pseudo inputs in the GP literature (Snelson & Ghahramani, 2006; Titsias, 2009), we constrain the rate functions to an M-dimensional latent space, where A Multitask Point Process Predictive Model M N, reducing the parameter space to O(M 2 + MP), where P is the GP input dimension. By adjusting the locations of the pseudo inputs, we can effectively share data across tasks and efficiently represent these functions. Learning the history-to-rate mapping functions allows one to predict future events by analytical integration or forward sampling. We consider both, evaluating our model and inference methodology on both synthetic and a real Electronic Health Records (EHR) dataset. The latter involves different categories of health problems, and we demonstrate that future hospital visits for some types of diseases are predictable even with simple history features. Our work has two main contributions: (i) providing an efficient approach to share data/parameters in hierarchical/multitask GP models; and (ii) building a predictive model for arrival data from multiple event streams, using point processes in a multitask scheme. 2. Related Work GP-modulated point processes are a popular approach for modeling event streams (Adams et al., 2009; Rao & Teh, 2011; Lloyd et al., 2014; Lasko, 2014). Assuming a smoothly varying intensity function, the intensity rate, as well as its uncertainty, can be estimated from observed streams. Extrapolation can be used for short-term prediction. There are two main limitations of these models. First, the smoothness assumption does not hold in many scenarios. Sudden rate changes often happen upon event arrivals, e.g., the risk of a patient s hospital visit might change significantly after a single visit. Second, these models involve a common modulating GP, so that multiple streams have to be aligned, something not always appropriate or possible. For example, similar arrival patterns might appear at different periods of time for different streams, which cannot be captured by such models. Another relevant line of work on point process predictive modeling comes from the neural decoding literature (Kulkarni & Paninski, 2007; Rad & Paninski, 2011; Pillow et al., 2008), where multiple subjects/neurons are affected by common stimuli (features), generating event streams. A generalized linear model can be trained on the stimuli or spiking history to learn the rate function, and further predict future spiking events. Meanwhile, the network structure across neurons can be inferred, which in turn, helps with prediction. However, temporal alignment is also assumed in these models, which is valid in the setting of neural decoding, but not in the cases we consider. Also related is the work of Weiss & Page (2013), which integrates the multiplicative forest Continuous Time Bayesian Network (mf CTBN) with the piecewise-constant intensity model of Gunawardana et al. (2011). More pre- cisely, they learn forests mapping from demographic and event history features to intensity rates, but under the assumption that all subjects have the same function, i.e., each event stream is a realization trajectory of some underlying model. In our work, we consider a different setting, where the variability across subjects cannot be ignored. This difference is crucial in scenarios such as healthcare analysis and electronic commerce, where variability among members of a population affects arrival pattern discovery and predictive performance. 3. Piecewise-constant Conditional Intensity Model Multi-task point process observations are sequences of arrival time stamps {yu n}, with n = 1, , Du and u = 1, , U, where yu n represents the time stamp of the n-th arrival of subject/task u. Our goal is to model the sequences and make predictions for future event arrivals. The event streams can be naturally modeled using an intensity model, with a hazard rate function γ(t). In an infinitesimal time interval , the probability of an event occuring in this interval is given by γ(t). To build the temporal dependency between past observations and future events, we choose the hazard rate as a function of past observations h(t), denoted as γ(t|h(t)), in which h(t) summarizes the history of past observations (Gunawardana et al., 2011); details on the potential form of h(t) are discussed subsequently. Denoting Yu = {yu 1 , , yu Du} as the set of arrival time stamps in task u, the likelihood is n=1 γu(yu n|hu(yu n)) exp( γu(τ|hu(τ))dτ) . Assuming γu(t|hu(t)) as piecewise constant with N u change points (pieces) at {tu i }Nu i=1, and piece length u i = tu i+1 tu i , we have the likelihood i=1 γu(tu i |hu(tu i ))I(tu i Yu) exp{ u i γu(tu i |hu(tu i ))} . Many approaches exist to extract features h(t) from past event arrivals. One possible feature-construction approach uses empirical rates at recent time points (Rajaram et al., 2005), i.e., hu(tu i ) RP + with hu p(tu i ) = γ(tu i Lp) for L1, L2, , LP predefined lengths of memory. Here γ(tu i Lp) refers to the empirically computed rate at time tu i Lp. Keeping the algorithm simple, we adopt a construction similar in spirit to Gunawardana et al. (2011): hu(tu i ) is P dimensional, where its p-th element denotes the count of arrivals in [tu i Lp, tu i ]. Therefore, hu(t) A Multitask Point Process Predictive Model is a piecewise constant function. Revisiting the likelihood in (1), we observe that N u is the number of change points (pieces) of the feature function hu(t) going through the whole event stream {yu n}Du n=1, and that N u scales linearly with Du (the number of events). Side information in the form of covariates can also be considered as a natural extension, by augmenting the feature space. Constructing richer features from history observations is left as an interesting direction for future work. To complete the intensity model, the functional mapping from feature space to intensity rate must be specified. Define the space of the history features hu(tu i ) as H, where each point hu(tu i ) hu i H is a possible feature vector. Here we impose a GP prior on a set of functions f u( ) : H R, followed by a transformation to ensure the non-negativity of intensity rates. We use a square transformation (γu( ) = {f u( )}2) rather than the common γu( ) = exp{f u( )}, because the uncertainty of f u( ) cannot be properly estimated in the latter construction, as discussed by Lloyd et al. (2014); this issue is addressed thoroughly in Section 5. To resolve the ambiguity caused by {f u( )}2 = { f u( )}2, a prior specification favoring the positive half-space is imposed and works well in practice. 4. Multitask Point Process Predictive Model As mentioned in Section 1, sharing information across tasks is necessary when learning the rate functions from sparse data. Accordingly, we consider a hierarchical GP construction. Denote the rate at feature vector hu i as γu i , with the corresponding transformed rate denoted as f u N,i. The generative process from history features to rates may be described as follows µ0 N GP g, 1 f u N N µ0 N, KNN , (3) γu i = {f u N,i}2 . (4) In (2), g is a prior mean, which may be set according to prior knowledge or the empirical average intensity rate over all tasks. Parameter ξ controls the complexity of the hyperprior. The covariance matrix KNN can be obtained from a squared-exponential or Mat ern function with automatic relevance determination (ARD) covariance kernel, for example, KNN,ij = τ 2 exp + σ2I(i = j) , where hi H may be any possible history feature. In (2), µ0 N determines the global mean function of transformed rate functions, f u N, from all U tasks. For each individual task, the task-specific transformed rate function, f u N, is a noisy version of the global mean function µ0 N. Our construction is inspired by Yu et al. (2005), which can be shown analogous to a hierarchical construction of the multitask linear regression model wu N w0, I , (7) f u N,i N wu hi, σ2 , (8) where w0 and wu are the global and task-specific regression parameters, respectively, and v is the hypermean for w0. Choosing a linear kernel for (2) and (3) (KNN,ij = h T i hj), and letting v = g = 0, the construction through (2)-(3) is equivalent to the multitask linear regression model in (6)-(8) on a finite observed dataset. If there exists enough training data, i.e., {hu i }Nu i=1 densely covers the feature space H, f u N is mainly determined by the data in task u. Otherwise, f u N is significantly affected by the prior µ0 N, which pools together information from rate functions {f u N} from all tasks. However, a problem with the hierarchal GP construction using (2) and (3) is that the dimension of µ0 N and f u N is typically massive, in fact it is not smaller than the number of unique history features from all tasks, i.e., | u{hu i }Nu i=1|, where |S| refers to the cardinality of a set S. Thus, the number of unknown parameters in the model scales with O(U PU u=1 N u), or O(U(PU u=1 N u)2) if variance estimation is also considered. In point process models in particular, this is impractical because the influence of the likelihood is weak and noise levels are high. Therefore, we propose an alternative approach inspired by the pseudo input framework for GPs (Snelson & Ghahramani, 2006; Titsias, 2009). Specifically, we propose a two-step process by introducing an M-dimension vector f u M for each task, where M | u {hu i }Nu i=1|. Then, using a so-called conditional Gaussian Processes, we define the generative process as µ0 M N g, 1 f u M N µ0 M, KMM , (10) f u N|f u M GP g + Ku NMK 1 MM(f u M g), (11) ξ )(Ku NN Ku NMK 1 MMKu NM ) , where f u N contains the transformed rates at all possible feature vectors {hu i }Nu i=1 in task u, while f u M is a Mdimensional vector, consisting of the transformed rates at M feature-vector locations {si}M i=1. In (11), for each task u, only the function values at feature vectors appearing in task u are required, while in (3), for A Multitask Point Process Predictive Model σ2, , λ1:P, s1:P Figure 1. Graphical model representation (filled circle are observed, empty circles denotes latent variables, rectangles refer to model parameters, and the rest are free hyper-parameters). each task, the function values at all feature vectors appearing in all tasks need to be specified. Following the GP literature (Titsias, 2009), we refer to {si} as pseudo inputs. These need not appear in any of the tasks (or even H); we only require that a distance can be properly defined between si and hu j . And similar to (5), KMM and Ku NM can be obtained through the squared-exponential or the Mat ern function with ARD kernel, with hu j and si as covariates. Specifically, KMM,ij = τ 2 exp + σ2I(i = j) , Ku NM,ij = τ 2 exp (hu pi spj)2 It can be shown that the construction through (9)-(11) results in the same marginal prior distribution for f u N as (2)-(3) (see supplemental material) while having a significantly reduced computational cost. Overfitting problems can also be alleviated because model complexity is reduced. This is especially beneficial when sophisticated features are constructed, resulting in very large history feature spaces, however, with the rate function likely living in a low-dimensional manifold. As a result, assuming {si, f u M,i}M i=1 captures the characteristics of the function {hu i , f u N,i}Nu i=1, while achieving computational savings, and improved estimation accuracy. The proposed full generative model is summarized by equations (9)-(11), (4) and (1), as demonstrated in Figure 1. 5. Inference Model parameters Θ include the pseudo-input locations, the global mean of transformed rates, and the GP hyperparamters: Θ = {{sm}M m=1, µ0 M, σ2, τ 2, {λp}P p=1}. Obtaining a full posterior distribution for µ0 M is straightforward due to local conjugacy. However we observed no significant difference in performance against just a point estimate, and for easier interpretability, focus on the latter. Maintaining p(f u M, f u N|Y, Θ), the full posterior distribution over the intensity functions of each task is important for transfer learning across tasks with different numbers of observations. However, this is not straightforward due to the non-conjugate point process likelihood, and borrowing ideas from variational learning for sparse GPs (Lloyd et al., 2014; Titsias, 2009), we propose a variational form to approximate the posterior p(f u M, f u N|Y, Θ). Letting Y refer to the complete collection of event streams Y1, , YU, we use q(f u N, f u M) = p(f u N|f u M)q(f u M) = p(f u N|f u M)N(f u M; µu, Σu) . (12) Note that we allow a free-form Gaussian distribution for f u M, the task-specific transformed rates at the pseudo inputs. However the transformed rates evaluated at all features in each task, f u N, are constrained by the lowdimensional function f u M via (11). In the following, we use a simplified notation qu to denote q(f u M) = N(f u M; µu, Σu). Because of the special factorized form in (12), {µu, Σu} are the only variational parameters in the algorithm. The inference objective is to maximize the variational lower bound (Beal, 2003) (called ELBO for evidence lower bound optimization): log p(Y, Θ) log p(Θ) (13) Eq(f u N,f u M)[log p(Yu, f u N, f u M) q(f u N, f u M) ] . Note that in (13), q(f u N, f u M) and p(Yu, f u N, f u M) implicitly depend on Θ. Since we only impose a Gaussian prior on µ0 M in (9), log p(Θ) is simplified to log p(µ0 M). Priors on other model parameters may also be imposed, e.g., a log-normal prior on GP hyper-parameters, but here we learn their maximum likelihood estimate (MLE) instead. To show explicit dependence, we denote the ELBO as F(q1, , q U, Θ). We maximize F(q1, , q U, Θ) in (13), giving a variational Expectation Maximization (EM) algorithm guaranteed to converge to a local optimum (Beal, 2003). In practice, we alternate between a variational Estep, where Θ is fixed and F(q1, , q U, Θ) is maximized w.r.t. {qu}U u=1, and a variational M-step, where {qu}U u=1 is fixed and F(q1, , q U, Θ) is maximized w.r.t. Θ. Derivation details are standard; we list the key steps below: A Multitask Point Process Predictive Model F(q1, , q U, Θ) = u=1 Equ h Ep(f u N|f u M)[log p(Yu|f u N)] i u=1 Equ log p(f u M|µ0 M) q(f u M|µu, Σu) + log p(µ0 M) F1 + F2 + F3 . (14) The first term F1 (15) measures how the functions specified by qu (determining the distribution of function values at pseudo inputs) fit the observations. Because of the sparsity assumption of the conditional Gaussian process, we can integrate out f u N and leave F1 as a function of only {µu, Σu} and Θ: Z q(f u N|µu, Σu) log p(Yu|f u N)df u N (15) I(tu i Yu)E[log(f u N,i)2] u i E[(f u N,i)2] , The expectation is w.r.t. q(f u N|µu, Σu) = N(bu, Bu), with parameters in (16) and (17): bu = g + Ku NMK 1 MM(µu g) , (16) Ku NN Ku NMK 1 MMKu NM + Ku NMK 1 MMΣu(Ku NMK 1 MM) . (17) E[log(f u N,i)2] in (15) can be calculated using confluent hypergeometric functions (Lloyd et al., 2014). Further details can be found in the Supplemental Material, where we also provide a robust approximation to tackle the wellknown numerical instability issue in confluent hypergeometric function evaluations (Ancarani & Gasaneo, 2008). As a side note on why we prefer a squared transformation over the exponential, when using the latter, F1 is modified as (18), denoted as F1: i=1 {I(tu i Yu)E[f u N,i] u i E[exp(f u N,i)]} , (18) Now the variance of f u N,i does not affect E[f u N,i] at event times tu i Yu (Lloyd et al., 2014), but only contributes to the second term through E[exp(f u N,i)]. This leads to instability issues during inference. We also implemented the algorithm using this transformation, but the estimated function f u diverged even with strong prior constraints. F2, as specified in (19), penalizes the task-specific function s deviation from the global mean function µ0 M. This is especially important for tasks with few training data, e.g., short event streams or few event arrivals. 2 log |KMM| + 1 u=1 log |Σu| (19) PU u=1 1 2tr h K 1 MM µuµu + Σu+µ0 Mµ0 M 2µuµ0 M i . F3 involves the hyper-prior on the global function µ0 M, with ξ controlling the belief strength: 2{log |ξK 1 MM| tr(ξK 1 MM(µ0 M g)(µ0 M g) )} . Having defined the variational objective, we can maximize the objective alternately w.r.t. to {µu, Σu} and Θ. In the variational E-step, we update the variational parameters µu and Σu using gradient descent methods. As a practical consideration, to preserve the positive definiteness of Σu, we optimize it instead over its lower Cholesky decomposition, Lu, where Σu = Lu Lu , and with positiveness constraints on the diagonal elements. In the variational Mstep, updates for µ0 M are obtained in closed form due to local conjugacy. Gradient methods are needed for updating other parameters, including locations of pseudo inputs and GP hyperparameters (see Supplemental Material for details). When the number of tasks is large, the algorithm can be readily distributed over a cluster, with the E-step optimization procedures for tasks distributed over individual cores/machines. 6. Experiments We evaluate our model on both synthetic data as well as an EHR dataset. At any time t, the basic challenge is to predict the event arrival patterns in the interval [t, t+L]. One baseline is a simplified model where each task is learned independently (fixing µ0 M in (10) to g) to demonstrate the benefit of sharing across tasks. We refer to our proposed model as MTPP (for multitask point process model), and the simplified model as IPP (for independent point process model). Another baseline is Poisson regression (Poi R), specialized for the prediction problem considered. Here, at each time stamp t, using observations in a window of length LP (the maximum memory length), we calculate the history features h(t), and treat them as predictors for the number of arrivals in the succeeding window [t, t + L]. While this is not a generative processs, the Poisson regression model can be trained using observations and tested on prediction tasks. 6.1. Synthetic Experiments For the first experiment, we synthesize a rate function for each of U = 10 tasks using a two step process. First the rates at M = 10 feature locations (pseudo inputs) are generated from a common Gaussian distribution. The rate A Multitask Point Process Predictive Model 40 50 60 70 80 90 100 14 Observation Length T Log Likelihood True Model MTPP IPP 40 50 60 70 80 90 100 0 Observation Length T Rate Estimation Error Figure 2. Left: Predictive log-likelihood per unit time; Right: mean normalized estimation error of intensity rates. function is then drawn conditioned on these, as specified by (11). The features constructed at time t are the event counts in intervals [t 5, t] and [t 1, t]. The GP hyperparameters are set as, λ1 = λ2 = 4, τ = 1.2, and σ = 0.01, to produce moderate rate functions, e.g., the empirical average rate over each event stream is between 0.05 and 1.5. Using the conditional intensity model introduced in Section 3, we generate event streams for each of the U tasks. We train the models using observations up to time T, and perform forward prediction. For inference, the pseudo points were initialized using Kmeans clustering over the history features appearing in the training set. The GP hyperparameters are initialized setting the length-scale parameters, {λp}, as the standard deviation of observed feature vectors, the magnitude, τ, as the standard deviation of square roots of empirical rates (computed via binning methods), and the noise term, σ2, as a small value, 0.01τ 2. The algorithm converges within 20 iterations, after which the relative successive increase of the variational objective is negligible. Model fit: We first evaluate the model and inference by comparing MTPP and IPP with the ground truth model. We vary the observation length of the training set T; the experiments for each setting are repeated for 15 runs, with mean and standard deviation reported. The left panel of Figure 2 compares the predictive loglikelihood per unit time (data log-likelihood discounted by event stream length) on unseen streams. We see that with an increasing length of training observations, the predictive log-likelihood of both MTPP and IPP gradually approaches the log-likelihood obtained with the true model. When T is small, which is true in many practical scenarios, IPP cannot fit the data well while MTPP learns the model accurately. This is again illustrated in the right panel of Figure 2, where for both MTPP and IPP, we compute the rate estimation error (ℓ2-norm), normalized w.r.t. the true rates, and averaged over all history features occurred in the data. We observe that MTPP outperforms IPP, especially in scarce-data scenarios, demonstrating the benefit of sharing across tasks. Forward prediction: We evaluate the prediction performance using two metrics (mirroring the two questions 0 1 2 3 4 5 6 7 8 9 10 11 12 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 Event count True Model MTPP IPP Observed count 40 50 60 70 80 90 100 0 Observation Length T KL divergence Figure 3. Left: Predicted empirical distributions of event counts in [t, t+5]; Right: KL divergence between predicted distributions of event counts in [t, t + 5] using the learned and the true model. posed in Section 1). The first is a binary classification problem on whether at least one event occurs in [t, t + L], and the second, estimating the distribution of event arrival counts in [t, t+L]. For the former, at time t, the probability of no event occurring until t + L can be analytically computed as exp( R t+L t γ(τ|h(τ))dτ). This is easily solvable because h(τ) is piecewise-constant. For the latter, because the intensity model is trained given observations, i.e., the history-to-rate functions are learned, at time t, we can generate sample paths in a forward manner. Using Monte Carlo sampling, we estimate the quantities of interest, e.g., the distribution of event arrival counts in [t, t + L]. To better illustrate how the model performs forward prediction, the left panel of Figure 3, shows for one testing instance, the empirical distribution of event arrival counts obtained from 100 Monte Carlo sample paths. Qualitatively, the distribution predicted by MTPP matches the one predicted by the true model best. The actual observed count can be considered as a draw from the distribution produced by the true model. To quantitively measure the performance, we compute the KL divergence between predicted distributions using the learned models and the true model, varying with observation length T. As indicated in the right panel of Figure 3, the KL divergence between predicted distributions using IPP and the true model is large for short streams, decreasing only with longer streams. For MTPP, there is a much smaller divergence (which decreases to a value around 0.75). We next compare prediction results from MTPP, IPP, and Poi R, where for Poi R all tasks are trained independently. The testing instances are constructed by taking a sequence of snapshots on the unseen streams. In particular, we start at T with a length L sliding window, extract history features, record the event count, and move forward with stepsize L 2 until the end of the stream. Figure 4 demonstrates the prediction performance as a function of prediction window L. Shown in the left panel is the Area Under the Curve (AUC) for the binary problem of whether at least one event occurs in [t, t + L]. MTPP performs best, and is closest to the AUC achieved using the true model. The in- A Multitask Point Process Predictive Model 0 2 4 6 8 10 0 Prediction Window L Area Under Curve True Model MTPP IPP Poi R 0 2 4 6 8 10 0 Prediction Window L Mean Absolute Error True Model MTPP IPP Poi R Figure 4. Left: AUC of binary prediction on whether at least one event occurs in [t, t + L]; Right: MAE of event arrival counts in [t, t + L]. dependent models perform poorly due to insufficient training data. One interesting phenomenon is that the prediction accuracy increases first and starts decreasing after a change point. This phenomenon results from a trade-off between two effects. First, for small L, the probability of no event occurring decreases from a moderate value (around 0.5) to 0, leading to an easier prediction problem. Second, as L increases further, the accumulating Poisson noise makes it harder. When L is small, the first effect dominates leading to a higher AUC. However, after a change point, the accumulated Poisson noise dominates, driving prediction accuracy down. The right panel of Figure 4 shows the Mean Absolute Error (MAE) of event arrival counts in [t, t + L]. Unlike the AUC for binary prediction, the MAE for all methods increases monotonically as prediction window length increases. MTPP outperforms other baselines, achieving the MAE closest to the one obtained by the true model. The increment is linear because of the linearly accumulating Poisson noise. 6.2. Applications on Electrical Health Records We now consider a real-world application involving an EHR dataset. We use the New Zealand national minimum dataset 1, covering the years 2007 through 2011 (inclusive). The data contains approximately 3.3 million inpatient visits from 1.5 million unique individuals with ages from 18 to 65. Available variables include ICD-10-AM (Australian Modification) diagnosis and procedure codes which are grouped into 22 broad categories (World Health Organization, 2010). We focus on hospital visits for each disease category, associated with a block of ICD-10-AM billing codes (e.g. billing codes in the range C00 to D48 all refer to neoplasms). Treating clinical data as point process observations is a relatively novel approach to the best of our knowledge, and has only been adopted in Lasko (2014). In our experiments, for each disease category, we record all patients visits associated with billing codes belonging to it, filter out patients with infrequent visits (fewer than 50), and 1http://www.health.govt.nz/nz-health-statistics split visit sequences for each patient into training and testing (split at the time stamp when half of the number of visits are observed). After preprocessing, we have visit streams of multiple patients for each disease category, with the number of patients varying from 36 to 118, and the number of visits per patient varying from 51 to 98. In total, we analyze 6 disease types, listed in Table 1, mostly related with chronic diseases. For example, neoplasms include malignant and benign ones, and metabolic problems include type-I and type-II diabetes. Note we are not exploring the correlation across visits for different disease types, which is also interesting, and left as future work. The aim of this application to EHR data is to show that using the multitask point process model proposed, with a simple feature construction approach, the visit patterns for some disease types are reasonably predictable. For each patient s visit stream of a disease category, the features constructed include the number of his/her hospital visits for this disease in the previous week, month, and six months. For inference, we set the number of pseudo inputs M = 15, and the initialization procedure is the same as in the synthetic experiments. The algorithm generally converges in tens of iterations. Similar with the above synthetic experiments, we evaluate the algorithm from two perspectives: model fit and predictive ability. For model fit, we evaluate the data loglikelihood, comparing with two state-of-the-art approaches using GP modulated renewal processes: a direct inference method in Lasko (2014) and a thinning approach in Rao & Teh (2011). For the prediction tasks, unlike the synthetic experiments, we do not have access to the ground truth model, i.e., knowledge about heterogeneity across population/tasks. Thus, we use the test dataset to compare MTPP with Poi R, using the same model for all patients for the latter. We also evaluated IPP, but because the patients visits are sparse, the individual models learned with this scarce training data perform poorly. Hence only results of pooled MTPP and Poi R are reported. Model Fit: GP modulated renewal processes are trained using both direct inference as in Lasko (2014) and thinning approach as in Rao & Teh (2011), for each patient s visit sequence in the neoplasm category. We then compare these two methods with MTPP on the data log-likelihood per time unit (day) over all sequences. As shown in the topleft panel of Figure 5, MTPP is comparable with the other two methods, with a smaller variation due to the sharing across patients. Shown in the top-right and bottom panels are the mean intensity functions inferred of anonymous patients (error bars are omitted for clarity), each associated with the raster plot of the hospital visits in the bottom. Relative dates, instead of calendar dates are used for confidentiality. As can be seen, the two temporal GP methods infer A Multitask Point Process Predictive Model MTPP Direct Thinning Data log likelihood 0 8 16 24 32 40 Time, months Direct Thinning MTPP 0 3 6 9 12 15 Time, months Direct Thinning MTPP 0 8 16 24 32 40 Time, months Direct Thinning MTPP Figure 5. Model fit results. Top-left: Comparison of data loglikelihood per day of MTPP, direct inference, and thinning approach; Top-right and bottom row: Intensity functions inferred of anonymous patients arrival sequences. a smooth function modeling the rate, which fails to capture sudden changes. For example, see the small-to-large change rate around the 33rd month in the top-right panel. Also note in the bottom-left panel, MTPP performs well at the beginning and ending of the sequence, with no boundary effects of GPs. Such differences are consistent over the dataset. Note that the intensity function inferred via our method is piecewise-constant, with change points located where recent history changes. The goodness of fit suggests this current rate s dependence on history enables forward prediction. Forward Prediction: Figure 6 (left panel) shows the learned history-to-rate mapping function globally shared by all patients in the neoplasm category. In particular, we plot the value of the function in terms of two features, namely, number of arrivals in windows [t 7, t] and [t 30, t] (the previous week and month). As indicated, a patient with many visits in a month and few visits in a week has a larger rate of a revisit. The right panel in Figure 6 shows that for neoplasms, the MAE increases monotonically as the prediction window length increases up to 90 days. This is consistent with the results from synthetic data shown in Figure 4. Interestingly, because Poi R requires a separate model for each prediction interval, no noise variance accumulates in the model, and the increase in MAE results from the data itself. On the other hand, we just train MTPP once and generate sample paths for different window lengths, resulting in a Poisson noise that increases linearly with prediction interval. As a result, for MTPP the MAE diverges faster than for Poi R as L increases. Results in Table 1 show AUC values for the 6 disease cate- 0 5 10 15 20 0 Arrival Count in [t 7, t] Arrival Count in [t 30, t] 0 20 40 60 80 100 0 Prediction Window L (days) Mean Absolute Error Figure 6. Neoplasms results. Left: Global rate function inferred by MTPP as a function of the number of arrivals in windows [t 7, t] and [t 30, t]; Right: MAE as a function of the prediction window length, L, in days. gories considered. Two settings for the length of the prediction windows, namely, 1 week and 1 month are evaluated. We see that consistent with the results for artificial data in Figure 4, AUCs tend to increase for moderate sizes of L. We also verified (results not shown) that further increasing the prediction window size has an accordingly negative impact on the AUCs. In fact, predictions become no better than random as L approaches 6 months for some disease types. When comparing MTPP to Poi R, we observe that the former outperforms the latter in 4 of 6 disease types. This supports the hypothesis that the changes of hospital visit patterns among the population are disease specific. However, correlations across diseases and clustering of patient subpopulations could be exploited, and are left as future work. Table 1. AUC of binary predictions of events occurring in weekly and monthly windows for 6 disease types. 1 WEEK 1 MONTH DISEASE TYPE MTPP POIR MTPP POIR NEOPLASMS 0.7379 0.7249 0.8136 0.8058 METABOLIC 0.6807 0.6170 0.6778 0.6195 NERVOUS 0.6926 0.7241 0.7978 0.7167 CIRCULATORY 0.6807 0.6778 0.6170 0.6195 RESPIRATORY 0.5733 0.6302 0.6308 0.6322 DIGESTIVE 0.6050 0.5562 0.6555 0.6170 7. Conclusions and Future Work We have considered the problem of analyzing multiple streaming point processes in a multi-task setting, and have proposed a simple predictive strategy. The proposed model, using hierarchical GPs, leverages information across the tasks in a nonparametric manner. Exploring multi-task marked point processes, designing richer feature construction approaches for predictive models, and clustering tasks based on their arrival patterns, are left as interesting open problems. Acknowledgements The research reported here was supported in part by ARO, DARPA, DOE, NGA and ONR. A Multitask Point Process Predictive Model Adams, Ryan P., Murray, Iain, and Mac Kay, David J. C. Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In Proceedings of the 26th International Conference on Machine Learning, 2009. Amarasingham, Ruben, Moore, Billy J., Tabak, Ying P., Drazner, Mark H, Clark, Christopher A., Zhang, Song, Reed, W. Gary, Swanson, Timothy S., Ma, Ying, and Halm, Ethan A. An automated model to identify heart failure patients at risk for 30-day readmission or death using electronic medical record data. Medical Care, 48 (11):981 988, 2010. Ancarani, L. U. and Gasaneo, G. Derivatives of any order of the confluent hypergeometric function. Journal of Mathematical Physics, 49, 2008. Beal, Matthew James. Variational algorithms for approximate Bayesian inference. Ph D thesis, U. London, 2003. Gunawardana, Asela, Meek, Christopher, and Xu, Puyang. A model for temporal dependencies in event streams. In Advances in Neural Information Processing Systems, pp. 1962 1970, 2011. Kulkarni, Jayant E. and Paninski, Liam. Common-input models for multiple neural spike-train data. Network: Computation in Neural Systems, 18(4):375 407, 2007. Lasko, Thomas A. Efficient inference of Gaussian process modulated renewal processes with application to medical event data. In Uncertainty in Artificial Intelligence, 2014. Lian, Wenzhao, Rao, Vinayak, Eriksson, Brian, and Carin, Lawrence. Modeling correlated arrival events with latent semi-markov processes. In Proceedings of the 31st International Conference on Machine Learning, pp. 396 404, 2014. Lloyd, Chris, Gunter, Tom, Osborne, Michael A., and Roberts, Stephen J. Variational inference for Gaussian process modulated Poisson processes. ar Xiv:1411.0254, 2014. Pillow, Jonathan W, Shlens, Jonathon, Paninski, Liam, Sher, Alexander, Litke, Alan M, Chichilnisky, EJ, and Simoncelli, Eero P. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature, 454(7207):995 999, 2008. Rad, Kamiar R. and Paninski, Liam. Information rates and optimal decoding in large neural populations. In Advances in Neural Information Processing Systems, pp. 846 854, 2011. Rajaram, Shyamsundar, Graepel, Thore, and Herbrich, Ralf. Poisson-networks: A model for structured point processes. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 277 284, 2005. Rao, Vinayak and Teh, Yee Whye. Gaussian process modulated renewal processes. In Advances in Neural Information Processing Systems, pp. 2474 2482, 2011. Snelson, Ed and Ghahramani, Zoubin. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pp. 1257 1264, 2006. Titsias, Michalis. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, 2009. Weiss, Jeremy C. and Page, David. Forest-based point process for event prediction from electronic health records. In Machine Learning and Knowledge Discovery in Databases, pp. 547 562. 2013. World Health Organization, Geneva. International classification of diseases, 10th revision (ICD-10). 2010. Xu, Lizhen, Duan, Jason A, and Whinston, Andrew. Path to purchase: A mutually exciting point process model for online advertising and conversion. Management Science, 2014. Yu, Kai, Tresp, Volker, and Schwaighofer, Anton. Learning Gaussian processes from multiple tasks. In Proceedings of the 22nd International Conference on Machine Learning, pp. 1012 1019, 2005.