# learning_registered_point_processes_from_idiosyncratic_observations__eb06f523.pdf Learning Registered Point Processes from Idiosyncratic Observations Hongteng Xu 1 2 Lawrence Carin 1 Hongyuan Zha 3 A parametric point process model is developed, with modeling based on the assumption that sequential observations often share latent phenomena, while also possessing idiosyncratic effects. An alternating optimization method is proposed to learn a registered point process that accounts for shared structure, as well as warping functions that characterize idiosyncratic aspects of each observed sequence. Under reasonable constraints, in each iteration we update the sample-specific warping functions by solving a set of constrained nonlinear programming problems in parallel, and update the model by maximum likelihood estimation. The justifiability, complexity and robustness of the proposed method are investigated in detail, and the influence of sequence stitching on the learning results is discussed empirically. Experiments on both synthetic and real-world data demonstrate that the method yields explainable point process models, achieving encouraging results compared to state-of-the-art methods. 1. Introduction The behavior of real-world entities often may be recorded as event sequences; for example, interactions of participants in a social network, the admissions of patients, and the jobhopping behavior of employees. In practice, these behaviors are under the control of complicated mechanisms, which can be captured approximately by an appropriate parametric temporal point process model. While the observed event sequences associated with a given process (e.g., disease) may share common ( standard ) attributes, there are often subject-specific factors that may impact the observed data. For example, the admission records of different patients are always personalized: even if the patients suffer from 1Department of ECE, Duke University, Durham, NC, USA 2Infinia ML Inc., Durham, NC, USA 3College of Computing, Georgia Institute of Technology, Atlanta, GA, USA. Correspondence to: Hongteng Xu . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Registered Temporal Point Process Intensity function Original timeline and realizations Warped idiosyncratic observations Warping function 1 Warping function 2 Warping function 3 Figure 1. The illustration of concepts in our work. The dotted parts (parametric point process model, unwarped realizations and warping functions) are what we aim to learn. the same disease, they may spend unequal time on recovery because their medications, history and environmental conditions may be distinct. Another typical example is the job-hopping behavior of employees. The employees in the same company often make very different career plans depending on their age, family situation and unobserved status of the job market. The examples above reveal that event sequences that share an underlying temporal point process linked to a given phenomenon of interest may be personalized by hidden idiosyncratic factors, yielding a subject-specific warping along timeline, as shown in Fig. 1. The characteristics of such data often have a negative influence on the learning of the target point process, i.e., increase the uncertainty of the model. The complexity of models can be increased to fit the personalized observations well, e.g., the locally-stationary point processes in (Roueff et al., 2016; Mammen, 2017; Xu et al., 2017a). However, from the viewpoint of model registration, it is desirable to separate the essential mechanism of the model and idiosyncratic aspects of the data, such that the final model is registered and characterizes the shared phenomena, while also inferring what is sample-specific. Learning registered point processes from idiosyncratic observations is a challenging problem, requiring one to jointly learn a shared point process model and a set of samplespecific warping functions corresponding to observed event sequences. To solve this problem, we propose a novel and effective learning method based on alternating optimization. Specifically, in each iteration we first apply the inverse of estimated warping functions (i.e., unwarping functions) to Learning Registered Point Processes from Idiosyncratic Observations unwarp observed event sequences and learn the parameter of a registered point process by maximum likelihood estimation; we then update the warping functions of event sequences, based on the estimation of registered point process. The new functions are applied to update the model for the next iteration. In particular, we approximate the warping/unwarping functions of event sequences by piecewise linear models, and learn their parameters by solving a set of constrained nonlinear programming problems in parallel. We analyze the justification and the complexity of our method in detail. The meaning of the regularizers and constraints used in our method and their effects are also investigated. Furthermore, we consider to improve learning results by stitching warped sequences randomly and learning from the stitched sequences, and verify the feasibility of this data processing strategy empirically. Experimental results show that the proposed method outperforms its competitors on both synthetic and real-world data. 2. Proposed Model Denote a temporal point process as N. Its event sequence consists of multiple events {(ti, ci)}I i=1 with time stamps ti [0, T] and event types ci C = {1, ..., C}, which can be represented as {Nc(t)}C c=1, where Nc(t) is the number of type-c events occurring at or before time t. A temporal point process can be characterized by its intensity functions {λc(t)}C c=1, where λc(t) = E[d Nc(t)|HC t ]/dt and HC t = {(ti, ci)|ti < t, ci C} collects historical events before time t. Each λc(t) represents the expected instantaneous rate of the type-c event at time t, which can be parametrized by a parameter θ. In this work, we assume that: 1. Exponential-like intensity: each λ(t) is an exponentiallike function PJ j=1 exptj(fj(t; θ, HC t )), where J 1, fj s are linear functions of time, which are related to θ and historical observations. exptj(fj(t)) = exp(fj(t)) if t tj, otherwise, it equals to 0. Note that many important point processes, e.g., the Hawkes process (Hawkes & Oakes, 1974) and the self-correcting process (Isham & Westcott, 1979) satisfy this assumption, as shown in the supplementary file. The sequences of a parametric point process Nθ may be warped in [0, T] by a set of continuous and invertible warping functions. Denote the sequences and the corresponding warping functions as {Sm}M m=1 and {Wm}M m=1, respectively. Each Sm = {(tm i , cm i )}Im i=1 contains Im events, whose time stamps are deformed from a standard timeline under the corresponding warping function Wm : [0, T] 7 [0, T]. Accordingly, the unwarping functions can be denoted as {W 1 m }M m=1. We assume that for m = 1, ..., M 2. Unbiasedness: E[W 1 m (t)] = t on [0, T]. 3. Regularity: W 1 m (t) is monotone increasing on [0, T]. Taking the warping functions into account, the likelihood of an (unobserved) unwarped sequence can be formulated based on the intensity functions (Daley & Vere-Jones, 2007): L(θ; W 1 m (Sm)) = QIm i=1 λcm i (W 1 m (tm i )) exp PC c=1 R T 0 λc(W 1 m (s))ds , (1) where W 1 m (Sm) represents the unwarped event sequence, i.e., W 1 m (Sm) = {(W 1 m (tm i ), cm i )}Im i=1. The warped data caused by idiosyncratic effects generally do harm to the maximum likelihood estimation of the target point process, except some trivial cases (See our supplementary file): Proposition 2.1. For a temporal point process Nθ satisfying the assumption 3, ˆθ and ˆθ denote its maximum likelihood estimation based on original data and that based on warped data, respectively. Then ˆθ = ˆθ if and only if 1) the warping functions are translations; or 2) Nθ is a homogeneous Poisson process. The problem is that given the warped observations {Sm}M m=1, we seek to learn a registered model θ, as well as sample-specific warping functions {Wm}M m=1 (or equivalently, the unwarping functions {W 1 m }M m=1) . 3. Learning Registered Point Processes 3.1. Maximizing the likelihood We develop a learning method based on maximum likelihood estimation (MLE). Considering the assumptions of unwarping functions and the likelihood in (1), we can formulate the optimization problem as min θ,{Wm} X m log L(θ; W 1 m (Sm)) + γR({W 1 m }) s.t. 1) W 1 m (0) = 0, W 1 m (T) = T, and 2) W 1 m (t) > 0 for m = 1, ..., M, where W 1 m (t) = d W m dt is the derivative of unwarping function. In our objective function, the first term represents the negative log-likelihood of unwarped event sequences while the second term represents the regularizer imposed on unwarping functions. For each unwarping function, the first constraint corresponds to its range and the second constraint makes it obey the regularity assumption. Furthermore, according to the unbiasedness assumption, we apply the following regularizer: R({W 1 m }) = Z T m=1 W 1 m (s) s 2 ds. (3) The optimization problem in (2) is non-convex and has a large number of unknown variables. Solving it directly is Learning Registered Point Processes from Idiosyncratic Observations intractable. Fortunately, for the parametric point processes with exponential-like intensity functions, we can design an effective alternating optimization method to solve the problem iteratively, after parameterizing the warping functions as piecewise linear functions. In each iteration, we first maximize the likelihood of the unwarped sequences based on the estimation of warping functions, and then optimize the warping functions based on the estimated model. Specifically, in the k-th iteration, given the warping functions estimated in the previous iteration, i.e., {W k 1 m }M m=1, we learn the target point process by θk = arg minθ XM m=1 log L(θ; (W k 1 m ) 1(Sm)). (4) Focusing on different point processes, we can apply various optimization methods to solve this problem. For example, learning Hawkes processes can be achieved in the framework of expectation-maximization (EM) (Lewis & Mohler, 2011; Zhou et al., 2013), which is equivalent to a projectedgradient-ascent algorithm. For other kinds of parametric point processes, e.g., the selfand mutually-correcting processes, we can learn their parameters by gradient descent or stochastic gradient descent (SGD). 3.2. Learning warping/unwarping functions Given θk, we need to update the warping/unwarping functions. To simplify the problem and accelerate our learning method, we take advantage of the warping functions estimated in the previous iteration, i.e., {W k 1 m }M m=1, and decompose the problem into M independent problems: for m = 1, ..., M, W k m is the solution of min Wm log L(θk; W 1 m (Sm)) m =m(W k 1 m ) 1(s) s.t. W 1 m (0) = 0, W 1 m (T) = T, W 1 m (t) > 0. Solving these problems is non-trivial, requiring further parameterization of the warping functions {Wm}M m=1, or equivalently, the unwarping functions {W 1 m }M m=1. We apply a set of piecewise linear models to fit the unwarping functions, for the convenience of mathematical derivation and computation. Specifically, given L landmarks {t1, ..., t L} in [0, T], where t1 = 0, t L = T and tl < tl+1, we model W 1 m for m = 1, ..., M as W 1 m (t) = am l t + bm l , if t [tl, tl+1). (6) Denoting am = {am l }L 1 l=1 and bm = {bm l }L 1 l=1 as the parameters of the model, we rewrite the regularizer and the constraints of W 1 m as P m =m(W k 1 m ) 1(s) M am + a m 2 M bm + b m 2 W 1 m (0) = 0 bm 1 = 0, W 1 m (T) = T am L 1T + bm L 1 = T, W 1 m (t) > 0 am l > 0 for l = 1, ..., L 1, where 2 indicates the ℓ2 norm of a vector, a m = P m =m am ,k 1 M 1 and b m = m =m bm ,k 1 M . am ,k 1 and bm ,k 1 are estimated in the previous iteration. To guarantee continuity of W 1 m , we further impose the following constraints on am and bm: for l = 1, ..., L 2, am l tl+1 + bm l = am l+1tl+1 + bm l+1. (8) Based on the piecewise linear model and the exponentiallike intensity assumption, we propose a tight upper bound for the negative log-likelihood in (5): log L(θk; W 1 m (Sm)) 0 λc(W 1 m (s))ds i=1 log λcm i (W 1 m (tm i )) 0 λc(s)d Wm(s) j=1 qm ij log(λcm i (W 1 m (tm i ))/qm ij ) " pm l am l tm i [tl,tl+1) qm ij fj(am l tm i + bm l ) =Q(am, bm). Here, λcm i (W 1 m (tm i )) = PJ j=1 exp(fj(W 1 m (tm i ); θk)), the coefficients pm l = P c R W 1 m (tl+1) W 1 m (tl) λc(s)ds, qm ij = exp(fj(W 1 m (tm j ))) λcm i (W 1 m (tm i )) and C is the constant independent to W 1 m . The inequality is based on Jensen s inequality and the {pm l , qm ij } are calculated based on the parameters estimated in the previous iteration. The detailed derivation and the implementation for Hawkes process are given in the supplementary file. Considering (7, 8, 9) together, we propose the surrogate problem of (5): min am,bm Q(am, bm) + γ am 2 s.t. 1) bm 1 = 0, am L 1T + bm L 1 = T, 2) for l = 1, ..., L 1, am l > 0, and 3) am l tl+1 + bm l = am l+1tl+1 + bm l+1. Learning Registered Point Processes from Idiosyncratic Observations (10) is a typical constrained nonlinear programming problem. Many optimization methods can be applied here, e.g., sequential quadratic programming and an interior-point method. Note that after getting optimal am and bm, we need to re-calculate the {pm l , qm ij } in Q and solve (10) iteratively until the objective function converges. Repeating the two steps above, we estimate the model and the warping/unwarping functions effectively. 3.3. Justifiability Analysis The reasons for applying piecewise linear models to warping functions are twofold. First, our learning method involves the computation of unwarping function W 1 m and the derivative of warping function W m. Applying our piecewise linear model, both warping and unwarping functions can be represented explicitly. If we use other basis functions, e.g., Gaussian basis, to represent Wm (or W 1 m ), the W 1 m (or W m) may be hard to be represented in closed-form. Second, compared to the finite element analysis used in functional optimization and differential equations, which discretizes functions into many grids, our piecewise linear model requires much fewer parameters, which reduces the risk of over-fitting and the computational complexity. Complexity Consider a C-dimensional Hawkes process as an example. We implement the MLE step and the updating of unwarping functions via an EM-based framework (Zhou et al., 2013) and an interior-point method (Potra & Wright, 2000), respectively. Given M sequences with I events in each, the computational complexity of our method per iteration in the worst case is O(MI2 + C2 + ML3). The O(MI2) and O(C2) correspond to the computational complexity of the E-step and the M-step, and the O(ML3) corresponds to the computational complexity of solving M nonlinear programming with 2L variables per each in the worst case. Because we update unwarping functions by solving M independent optimization problems in parallel, the time complexity of our method can be O(MI2 + C2 + L3). Convergence Our learning method converges in each step. For parametric point processes like Hawkes processes, their likelihood functions are convex and the convergence of the MLE-step is guaranteed. Further, the objective function in (10) is convex, as shown in the supplementary file, thus updating of the unwarping functions also converges well. Compared with existing point process registration methods, e.g., the Wasserstein learning-based registration method (WLR) (Bigot et al., 2012; Panaretos & Zemel, 2016; Zemel & Panaretos, 2017) and the multi-task learning-based method (MTL) (Luo et al., 2015), our RPP method has several advantages. First, both the WLR and the MTL require learning a specific model for each event sequence. For complicated multi-dimensional point processes, they require a 5 10 15 20 25 Distortion: max|W(t)-t|/T Relative estimation error Corr = 0.75 Figure 2. In 100 trials, 40 sequences with length T are generated by a 1D Hawkes process and warped by a function with a certain W(t) t . The correlation between the estimation errors and the distortion is 0.75. large amount of events per sequence to learn reliable models independently, which might be unavailable in practice. Our method has much fewer parameters, and thus has much lower computational complexity and lower risk of overfitting. Second, both the WLR and the MTL decompose the learning of model and warping functions into two independent steps. The estimation error caused in the previous step will propagate to the following one. On the contrary, our method optimizes model and warping functions alternatively with guaranteed convergence, so the estimation error will be suppressed iteratively. 4. Potential Improvement Based on Stitching Empirically, the influence of warped data on learning results is correlated with the distortion of warping function compared to identity function. The distortion should be a measurement not only dependent with the difference between warping function and identity function but also related to the scale of time window because the distortion on a certain scale becomes ignorable when we observe and analyze it on a larger scale. In particular, we propose a definition the distortion as D = W (t) t T in this work. Here, W(t) t = max{|W(t) t|, t [0, T]}, which represents the most serious warping achieved by the warping function, and T is the length of time window. In Fig. 2, we show that the distortion based on this definition is highly correlated with the relative estimation error (i.e., θ θ 2 θ 2 , where θ is the ground truth and θ is the estimation result). This relationship θ θ 2 θ 2 D implies a potential strategy to further improve our learning results. Suppose that we have two warped sequences S1 = {(t1 i , c1 i )}I1 i=1 and S2 = {(t2 i , c2 i )}I2 i=1 observed in the time window [0, T], whose distortions are D1 and D2, respectively. If we stitch these two sequences together, i.e., S = S1 2 = {(t1 1, c1 1), ..., (t2 1 + T, c2 1), ...}, the distortion of the new sequence in [0, 2T] will be D = 1 2 max{D1, D2}. According to the relationship above, learning from the stitched sequence may help us obtain lower estimation error than learning from the separated two sequences. Note that for memoryless point processes like Poisson processes, such a stitching-based learning strategy will not Learning Registered Point Processes from Idiosyncratic Observations cause model misspecification because the stitched sequence obeys the same model with the original short sequences. However, for more complicated model like Hawkes processes or self-correcting processes, the stitching operation may introduce nonexistent triggering patterns. In such a situation, our stitching-based learning strategy suppresses the influence of warping function while rises the risk of model misspecification as an exchange. Fortunately, as discussed in (Xu et al., 2017a), when the intensity function is exponential-like function, the model misspecification problem is ignorable with a small number of stitching operations. The experiments in the experimental section further verifies the feasibility of our method. 5. Related Work 5.1. Temporal point processes Point processes have been proven to be useful in many applications, e.g., financial analysis (Bacry et al., 2012), social network analysis (Zhou et al., 2013; Zhao et al., 2015), information system analysis (Xu et al., 2016) and clinical data analysis (Xu et al., 2017b). However, most existing work does not consider registering warped parametric point processes. The methods in (Lewis & Mohler, 2011; Yan et al., 2015) try to estimate time scaling parameters for their point processes, but they are only available for the Hawkes processes whose event sequences share the same linear transformation of time. The work in (Luo et al., 2015) is able to jointly learn different Hawkes processes by multitask learning, but it does not register its learning results or learn sample-specific warping functions. 5.2. Data registration and model registration As aforementioned, the idiosyncratic aspects of sequential data may be viewed in terms of a sample-specific warping of a common latent phenomena, which can be registered based on learned or predefined transformations. Typical methods include the dynamic time warping (DTW) (Berndt & Clifford, 1994; Moeckel & Murray, 1997) and its variants (Wang et al., 2016; Cuturi & Blondel, 2017; Ramsay & Li, 1998), the self-modeling registration method (SMR) (Gervini & Gasser, 2004), the moment-based method (MBM) (James, 2007), the pairwise curve synchronization method (PACE) (Tang & M uller, 2008), and the functional convex averaging (FCA) method (Liu & M uller, 2004). These methods can be categorized in the same framework the registered curves and the corresponding warping functions are learned alternatively based on a nonlinear leastsquares criterion. Instead of using the Euclidean metric, the recent work in (Srivastava et al., 2011) obtains better data registration results by using the Fisher-Rao metric (FRM). For those nonparametric models like Gaussian processes, warping data is beneficial to improve the robustness of learn- ing methods (Snelson et al., 2004; Cunningham et al., 2012; Snoek et al., 2014; Herlands et al., 2016). The work in (Panaretos & Zemel, 2016; Zemel & Panaretos, 2017) proposes a model-registration method. Specifically, the unregistered distributions of warped observations are first estimated by nonparametric models, and then the registered point process are estimated as the barycenter of the distributions in Wasserstein space (Muskulus & Verduyn Lunel, 2011). Finally, the warping function between any unregistered distribution and the registered one is learned as an optimal transport (Anderes et al., 2016). However, all the methods above focus on warping/unwarping continuous curves in a nonparametric way, which are hard to register parametric point processes from idiosyncratic event sequences. The recent combination of Wasserstein learning and neural networks (Arjovsky et al., 2017; Xiao et al., 2017) achieves encouraging improvements on learning robust generative models from imperfect observations. However, the neural network-based model requires many time-consuming simulation steps in the learning phase, and cannot in general learn explicit warping functions. 6. Experiments Denote our point process registering method and its variant assisted with stitching operation as RPP and RPP-stitch, respectively To demonstrate the feasibility and effectiveness of the proposed methods, we compare them to existing point process learning and registration methods on both synthetic and real-world datasets. In particular, we compare to the following methods: purely maximum likelihood estimation based on warped observations (Warped); the multi-task learning-based method (MTL) (Luo et al., 2015); and the Wasserstein learning-based registration method (WLR) (Panaretos & Zemel, 2016). Specifically, the MTL method learns specific parametric point processes jointly from warped event sequences with low-rank and sparse regularizers, and averages the learned parameters over all event sequences in the Euclidean space. The WLR is the state-of-the-art model registration method focusing on point processes and their warped event sequences. To apply the WLR method to learn parametric point process models, we first follow the work in (Panaretos & Zemel, 2016), learning the densities of observed events by kernel density estimation (KDE) (Sheather & Jones, 1991), and learning the warping functions by finding the optimal transport between the densities and their barycenter in the Wasserstein space. Finally, we apply the reversed warping functions to unwarp the observations and learn a parametric point process. 6.1. Synthetic data We simulate an 1D inhomogeneous Poisson process and a 4-dimensional Hawkes process, respectively. For each Learning Registered Point Processes from Idiosyncratic Observations 0 0.5 1 Original timeline Warped timeline 20 40 60 80 100 The number of training sequences Relative estimation error 20 40 60 80 100 The number of training sequences Testing log-likelihood #104 Real Warped WLR RPP RPP-Stitch1 RPP-Stitch2 (b) Inhomogeneous Poisson process 20 40 60 80 100 The number of training sequences Relative estimation error 20 40 60 80 100 The number of training sequences Testing log-likelihood #104 Real Warped MTL WLR RPP RPP-Stitch1 RPP-Stitch2 (c) Hawkes process 0 50 100 Original timeline Warped timeline real initial iter-1 iter-3 iter-5 iter-7 (d) Convergence Figure 3. Comparisons for various methods on synthetic data. synthetic data set, we generate 200 event sequences in the time window [0, 100] using Ogata s thinning method (Ogata, 1981) and divide them equally into a training set and a testing set. The intensity function of the Poisson process is represented as P5 j=1 exptj( (t tj)), where tj is uniformly sampled from [0, 100], while the intensity function of the Hawkes process is defined as the model in (Zhou et al., 2013). Each sequence in the training set is modified by a specific warping function. The warping functions are visualized in Fig. 3(a), in which each color curve represents a warping function and the black bold curve represents the average of all the functions. The generation method of the warping functions is given in the supplementary file. It ensures that both the warping and the unwarping functions are monotone increasing and the averaged warping and unwarping functions are close to an identity function. Given the training data, we can learn registered point process models by different methods and evaluate their performance on 1) the relative estimation error; and 2) the log-likelihood of testing set. For each method, we test it in 5 trials on the two data sets, respectively, and visualize its averaged results in Figs. 3(b) and 3(c). The black bold curves correspond to the MLE based on unwarped data, which achieves the best performance (i.e., the lowest estimation error and the highest log-likelihood), while the black dot curves correspond to the MLE based on warped data. The performance of a good registration method should be much better than the black dot curves and approach to the block bold curves. Our RPP method achieves superior performance to MTL and WLR. The performance of MTL is even worse than that of applying MLE to warped data directly, especially in the case with few training data. This result implies that 1) the sparse and low-rank structure imposed in the multi-task learning phase cannot reflect the actual influence of warped data on the distribution of parameters; and 2) the average of the parameters in the Euclidean space does not converge well to the ground truth. The performance of WLR is comparable to that of applying MLE to warped data directly, which verifies our claim that the WLR is unsuitable for learning complicated point processes when observations are not sufficient enough. Both MTL and WLR reply on a strategy of learning a specific model for each event sequence, and then averaging the models in a predefined space. This strategy ignores a fact that the number of events in a single event sequence is often insufficient to learn a reliable model in practice. Our RPP method, by contrast, learns a single registered model and all warping functions jointly in an iterative manner, rather than in independent steps. As a result, our method suppresses the risk of over-fitting and achieves much better results. Furthermore, we illustrate the learning process of a warping function in Fig. 3(d) and verify the convergence of our RPP method. The black bold curve corresponds to the ground truth and the blue line is the initialization of our estimation. Applying our RPP method, the learning result converges after 7 iterations and the final estimation of the warping function approaches the ground truth. The usefulness of the stitching strategy is tested as well. In particular, the RPP-Stitch K means that for each event sequence we randomly stitch it with K other event sequences and then apply our RPP method to the 200 stitched sequences in the time window [0, 100(K + 1)]. We can find that for both Poisson processes and Hawkes processes, RPP-Stitch 1 obtains better results than original RPP method, which verifies the improvements caused by the stitching strategy. Another advantage of the stitching strategy is improving the stability of our learning results, especially in the cases with small training sets. Given 20 training sequences, the standard deviation of the estimation errors in 5 trial is 0.053 for original RPP, 0.037 for RPP-Stitch 1 and 0.033 for RPP-Stitch 2 . However, for Poisson processes the improvements can be further enhanced by applying stitching operations multiple times (i.e., K = 2), while for Hawkes processes the improvements are almost unchanged. As we discussed in Section 4, applying too many stitching operations to the historically-dependent point processes may cause model misspecification and counteract the benefits from suppressing distortions. 6.2. Real-world data We test our methods and compare it with the WLR on two real-world datasets: the MIMIC III dataset (Johnson et al., 2016) and the Linkedin dataset (Xu et al., 2017a). The MIMIC III dataset contains over ten thousand patient ad- Learning Registered Point Processes from Idiosyncratic Observations mission records over ten years. Each admission record is a sequence, with admission time stamps and the ICD-9 codes of diseases. Following (Xu et al., 2017a), we assume that there are triggering patterns between different diseases, which can be modeled by a Hawkes process. We focus on modeling the triggering patterns between the diseases of the circulatory system, which are grouped into 8 categories. We extract 1, 129 admission records related to the 8 categories as the training set. Each record can be viewed as an event sequence warped from a standard record because of the idiosyncratic nature of different patients. For the Linkedin dataset, we extract 709 users having working experience in 7 IT companies. Similarly, the timeline of different users can be different, because they have different working experience and personal conditions, and the status of the job market when they jump is different as well. We want to learn a standard Hawkes process to measure the relationships among the companies and exclude these uncertain factors. We apply different model registration methods to learn registered Hawkes processes from the two real-world datasets. The evaluation is challenging because both the ground truth of the model and that of the warping functions are unknown. Fortunately, we can use learning results to evaluate the risks of underand over-registration for different methods in an empirical way. Given unwarped event sequences estimated by different methods, we learn the parameter of model θ and estimate its variance var(θ ) by parametric bootstrapping (Wassermann, 2006). For the method with a lower risk of under-registration, its learning result should be more stable and the estimated variance should be smaller. Therefore, we can use the estimated variance as a metric for the risk of under-registration, i.e., riskunder = var(θ ). We define the following metric to evaluate the risk of over-registration: R T 0 |s W (s)|2ds 1 M PM m=1 R T 0 |Wm(s) W (s)|2ds, where W(s) = m Wm(s). The numerator is the distance between the mean of warping functions and an identity function, and the denominator is the variance of warping functions. When the estimated warping functions have a small variance (i.e., the warping functions are similar to each other) but are very distinct from identity function (i.e., the bias of the warping functions is large), it means that the corresponding method causes over-registration. The side information of the dataset is also helpful to evaluate the appropriateness of the learning result. In Fig. 4(a), most of the admission records in the MIMIC III dataset are from relatively old patients. The incidence of circulatory system diseases is mainly correlated with patient age. Learning a standard patient model from a dataset dominated by old patients, we can imagine that the admission record of an old patient should be more similar to that of the standard patient, and the corresponding warping function should be closer to the identity function. Therefore, given the devi- Table 1. Comparison for various methods. Data Method riskunder riskover Rank Corr. WLR 0.018 0.055 0.025 RPP 0.011 0.009 0.053 RPP-Stitch1 0.003 0.002 0.053 WLR 0.029 0.657 0.344 RPP 0.025 0.010 0.375 RPP-Stitch1 0.005 0.006 0.387 ations between learned warping functions and the identity function, we can calculate the Kendall s rank correlation between the warping deviations and the ages of the patients. Similarly, in Fig. 4(b), most of samples in the Linkedin dataset are from young users with 4 or fewer working years, so these young users behaviors should likely be close to that of the standard job-hopping model learned from the data, and the warping deviations should be correlated with the working years. Table 1 shows the comparison between our methods (RPP and RPP-Stitch1) and the WLR method on these two datasets. We find that our RPP method outperforms WLR consistently on different metrics and different datasets, obtaining lower risks of underand over-registration and higher rank correlation. In particular, the low risk of underregistration means that the parameter θ learned by our method is stable. The low risk of over-registration means that the warping/unwarping functions we learned have good diversity and low bias. The high rank correlation verifies the justifiability of our method the warping deviations of dominant samples (i.e., the old patients in MIMIC III and young employees in Linkedin data) are smaller than those of minor samples (i.e., the young patients and the old employees). Similar to the case of synthetic data, applying stitching strategy once, we can further improve the learning results. Figures 4(c) and 4(d) compare the infectivity matrices1 of the registered Hawkes processes and the warping functions learned by WLR and our RPP-Stitch1 for the two datasets. These results further verify the superiority of our method. First, the warping/unwarping functions we learned have good diversity and the bias of the functions is lower than than that of the functions learned by WLR. Second, the infectivity matrices learned by our RPP-Stitch1 are more dense and informative, which reflect some reasonable phenomena that are not found by WLR. For the MIMIC III data, the infectivity matrix of WLR only reflects the self-triggering patterns of the disease categories, while ours is more informative: the 5-th row of our matrix (the bottom-left subfigure in Fig. 4(c)) corresponds to the category other forms of heart disease (ICD-9 code 420-429), which contains many 1Infectivity matrix is denoted as Ψ = [ψcc ]. Its element is the integral of impact function over time, i.e., ψcc = R T 0 φcc (s)ds. Learning Registered Point Processes from Idiosyncratic Observations 0 20 40 60 80 Age The number of patients 5 10 15 20 Working years The number of users (b) Linkedin 393-398 401-405 410-414 415-417 420-429 430-439 440-449 450-459 393-398 401-405 410-414 415-417 420-429 430-439 440-449 450-459 (c) MIMIC III 0 20 40 Year Warped timeline 0 20 40 Year Warped timeline Apple facebook Google Microsoft Nvidia Twitter Apple facebook Google Microsoft Nvidia Twitter (d) Linkedin Figure 4. Experimental results of our method on real-world datasets. In (c) and (d), the first row corresponds to the infectivity matrix and the warping functions learned by WLR, and the second row corresponds to those learned by our RPP-Stitch1. The black bold curves are the average of warping functions. -3 -2 -1 0 1 log10(.) Relative estimation error (a) γ from 10 3 to 10 20 40 60 80 100 The number of landmarks Relative estimation error (b) L from 5 to 100 Figure 5. Illustration of robustness. miscellaneous heart diseases and has complicated relationships with other categories. Our learning result reflects this fact the 5-th row of our infectivity matrix contains many non-zero elements. For the Linkedin data, the infectivity matrix of our method reveals more information besides the self-triggering patterns: 1) The values of Facebook-Google and Google-Facebook imply that job-hopping behaviors happen frequently between Facebook and Google, which reflects fierce competition between these companies. 2) The values of Facebook-Nvidia and Google-Nvidia reflect the fact that recent years many Nvidia s employees jump to Google and Facebook to develop the hardware of AI. More detailed analysis are given in the supplementary file. 6.3. Robustness analysis We investigate the robustness of our method to variations in its parameters, including the weight of regularizer γ and the number of landmarks L. In particular, we learn models from the synthetic data by our method with different configurations, and visualize the estimation errors with respect to these two parameters in Fig. 5. The weight γ controls the importance of the regularizer, which is correlated with the strictness of the unbiasedness assumption. The larger γ, the more similarity we have between unwarping function and identity function. In Fig. 5(a) we find that our method is robust to the change of γ in a wide range (i.e., from 10 3 to 1). When γ is too small (i.e., γ = 10 3), however, the estimation error increases because the regularizer is too weak to prevent over-registration. The number of landmarks L has an effect on the representation power of our method. In Fig. 5(b), we find that the lowest estimation error is achieved when the number of landmarks L = 20. When L is too small, our piecewise linear model is over-simplified and cannot fit complicated warping functions well. When L is too large, (10) has too many variables and the updating of warping function suffers to the problem of over-fitting. 7. Conclusions and Future work We have proposed an alternating optimization method to learn parametric point processes from idiosyncratic observations. We demonstrate its justifiability and advantages relative to existing methods. Additionally, we also consider the influence of the stitching operation on the learning results and show the potential benefits empirically. Our method has potentials to many applications, including admission data analysis and job-hopping behavior analysis. In the future, we plan to extend our method to more complicated point process models and analyze the influence of the stitching operation theoretically. 8. Acknowledgment This work was supported in part by DARPA, DOE, NIH, ONR, NSF, IIS-1717916 and CMMI-1745382. We thank Yoav Zemel for discussions and comments about this work. Learning Registered Point Processes from Idiosyncratic Observations Anderes, E., Borgwardt, S., and Miller, J. Discrete Wasserstein barycenters: optimal transport for discrete data. Mathematical Methods of Operations Research, 84(2): 389 409, 2016. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.07875, 2017. Bacry, E., Dayri, K., and Muzy, J.-F. Non-parametric kernel estimation for symmetric Hawkes processes. application to high frequency financial data. The European Physical Journal B, 85(5):1 12, 2012. Berndt, D. J. and Clifford, J. Using dynamic time warping to find patterns in time series. In KDD workshop, 1994. Bigot, J., Klein, T., et al. Consistent estimation of a population barycenter in the Wasserstein space. Ar Xiv e-prints, 2012. Cunningham, J., Ghahramani, Z., and Rasmussen, C. E. Gaussian processes for time-marked time-series data. In AISTATS, 2012. Cuturi, M. and Blondel, M. Soft-dtw: a differentiable loss function for time-series. ar Xiv preprint ar Xiv:1703.01541, 2017. Daley, D. J. and Vere-Jones, D. An introduction to the theory of point processes: volume II: general theory and structure, volume 2. Springer Science & Business Media, 2007. Gervini, D. and Gasser, T. Self-modelling warping functions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4):959 971, 2004. Hawkes, A. G. and Oakes, D. A cluster process representation of a self-exciting process. Journal of Applied Probability, 11(3):493 503, 1974. Herlands, W., Wilson, A., Nickisch, H., Flaxman, S., Neill, D., Van Panhuis, W., and Xing, E. Scalable gaussian processes for characterizing multidimensional change surfaces. In AISTATS, 2016. Isham, V. and Westcott, M. A self-correcting point process. Stochastic Processes and Their Applications, 8(3):335 347, 1979. James, G. M. Curve alignment by moments. The Annals of Applied Statistics, pp. 480 501, 2007. Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Scientific data, 3, 2016. Lewis, E. and Mohler, G. A nonparametric EM algorithm for multiscale Hawkes processes. Journal of Nonparametric Statistics, 1(1):1 20, 2011. Liu, X. and M uller, H.-G. Functional convex averaging and synchronization for time-warped random curves. Journal of the American Statistical Association, 99(467):687 699, 2004. Luo, D., Xu, H., Zhen, Y., Ning, X., Zha, H., Yang, X., and Zhang, W. Multi-task multi-dimensional Hawkes processes for modeling event sequences. In IJCAI, 2015. Mammen, E. Nonparametric estimation of locally stationary hawkes processe. ar Xiv preprint ar Xiv:1707.04469, 2017. Moeckel, R. and Murray, B. Measuring the distance between time series. Physica D: Nonlinear Phenomena, 102(3-4): 187 194, 1997. Muskulus, M. and Verduyn-Lunel, S. Wasserstein distances in the analysis of time series and dynamical systems. Physica D: Nonlinear Phenomena, 240(1):45 58, 2011. Ogata, Y. On Lewis simulation method for point processes. IEEE Transactions on Information Theory, 27(1):23 31, 1981. Panaretos, V. M. and Zemel, Y. Amplitude and phase variation of point processes. The Annals of Statistics, 44(2): 771 812, 2016. Potra, F. A. and Wright, S. J. Interior-point methods. Journal of Computational and Applied Mathematics, 124(1): 281 302, 2000. Ramsay, J. O. and Li, X. Curve registration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(2):351 363, 1998. Roueff, F., Von Sachs, R., and Sansonnet, L. Locally stationary hawkes processes. Stochastic Processes and their Applications, 126(6):1710 1743, 2016. Sheather, S. J. and Jones, M. C. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), pp. 683 690, 1991. Snelson, E., Ghahramani, Z., and Rasmussen, C. E. Warped gaussian processes. In NIPS, 2004. Snoek, J., Swersky, K., Zemel, R., and Adams, R. Input warping for bayesian optimization of non-stationary functions. In ICML, 2014. Srivastava, A., Wu, W., Kurtek, S., Klassen, E., and Marron, J. Registration of functional data using Fisher-Rao metric. ar Xiv preprint ar Xiv:1103.3817, 2011. Learning Registered Point Processes from Idiosyncratic Observations Tang, R. and M uller, H.-G. Pairwise curve synchronization for functional data. Biometrika, 95(4):875 889, 2008. Wang, Y., Miller, D. J., Poskanzer, K., Wang, Y., Tian, L., and Yu, G. Graphical time warping for joint alignment of multiple curves. In NIPS, 2016. Wassermann, L. All of nonparametric statistics. Springer Science+ Business Media, New York, 2006. Xiao, S., Farajtabar, M., Ye, X., Yan, J., Song, L., and Zha, H. Wasserstein learning of deep generative point process models. ar Xiv preprint ar Xiv:1705.08051, 2017. Xu, H., Farajtabar, M., and Zha, H. Learning Granger causality for Hawkes processes. In ICML, 2016. Xu, H., Luo, D., and Zha, H. Learning Hawkes processes from short doubly-censored event sequences. In ICML, 2017a. Xu, H., Wu, W., Nemati, S., and Zha, H. Patient flow prediction via discriminative learning of mutually-correcting processes. IEEE transactions on Knowledge and Data Engineering, 29(1):157 171, 2017b. Yan, J., Zhang, C., Zha, H., Gong, M., Sun, C., Huang, J., Chu, S., and Yang, X. On machine learning towards predictive sales pipeline analytics. In AAAI, 2015. Zemel, Y. and Panaretos, V. M. Fr echet means and Procrustes analysis in Wasserstein space. ar Xiv preprint ar Xiv:1701.06876, 2017. Zhao, Q., Erdogdu, M. A., He, H. Y., Rajaraman, A., and Leskovec, J. SEISMIC: A self-exciting point process model for predicting tweet popularity. In KDD, 2015. Zhou, K., Zha, H., and Song, L. Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes. In AISTATS, 2013.