# interaction_point_processes_via_infinite_branching_model__cc3ad459.pdf Interaction Point Processes via Infinite Branching Model Peng Lin , Bang Zhang , Ting Guo , Yang Wang , Fang Chen NICTA, Australian Technology Park, 13 Garden Street, Eveleigh NSW 2015, Australia School of Computer Science and Engineering, The University of New South Wales, Australia {peng.lin, bang.zhang, ting.guo, yang.wang, fang.chen}@nicta.com.au Many natural and social phenomena can be modeled by interaction point processes (IPPs) (Diggle et al. 1994), stochastic point processes considering the interaction between points. In this paper, we propose the infinite branching model (IBM), a Bayesian statistical model that can generalize and extend some popular IPPs, e.g., Hawkes process (Hawkes 1971; Hawkes and Oakes 1974). It treats IPP as a mixture of basis point processes with the aid of a distance dependent prior over branching structure that describes the relationship between points. The IBM can estimate point event intensity, interaction mechanism and branching structure simultaneously. A generic Metropolis-within-Gibbs sampling method is also developed for model parameter inference. The experiments on synthetic and real-world data demonstrate the superiority of the IBM. Introduction The evolving of our world can be regarded as a series of events, many of which are generally nonindependent. One event may cause or repel the occurrences of others. Examples can be readily found in various of areas. For instance, many biological phenomena compete for local resources, hence demonstrate spatial over-dispersion property. Strong clustering patterns are often observed by seismologists (Marsan and Lenglin e 2008) and epidemiologists (Yang and Zha 2013), as earthquakes and epidemics are well known diffusible events. Buy and sell trades in financial markets also arrive in clusters (Hewlett 2006). Information prorogation in social network shows contagious and clustering trait (Yu, Xie, and Sanner 2015). All these events exhibit strong interactive property. Understanding their characteristics can help us categorize, predict and manipulate these events, thereby making positive impacts in our physical and social world. Despite the high diversity of the aforementioned areas, there are three common tasks for understanding these interactive events: (1) Event intensity estimation, which aims at predicting the number of events for a specific time period. It helps to gain insight into the temporal trends in events. (2) Interaction mechanism estimation, which tries to reveal the Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. triggering or repelling mechanism of events. It provides informative hints for dissemination control and influence manipulation. (3) Branching structure1 estimation, in which the relationship between events is inferred. It helps to determine the connection of events, understand the underlying causal structure and support event grouping. These three tasks tangle with one another, making the overall problem complex. As a result, most existing approaches only consider one or two of these tasks. Stochastic point process (Vere-Jones 1988) provides us a generic yet adaptable tool for modeling series of events occurring at random locations and times. It considers a random collection of points falling in some space. When modeling purely temporal events, each point represents the time of an event and the space in which the points fall is simply a portion of the real line. A variety of point processes has been developed with distinct modeling purposes. In this paper, we mainly focus on interaction point processes (IPPs) (Diggle et al. 1994; Ripley 1977) that model not only the generation of points but also their interactions. Specifically, a Bayesian statistical model, which can generalize and extend some popular IPPs, e.g., Hawkes process (Hawkes 1971; Hawkes and Oakes 1974), is proposed with the consideration of the aforementioned three tasks. Many statistical methods exist for modeling events in spatial and temporal space, and most of them make an exchangeability assumption on a certain component of the overall model (Orbanz and Roy 2013). For instance, random walk models (Bacallado et al. 2013) and the infinite hidden Markov model (Beal, Ghahramani, and Rasmussen 2001), which are widely used for discrete time series, assume Markov exchangeability (Diaconis and Freedman 1980; Zabell 1995) implying that the joint probability only depends on the initial state and the number of transitions. L evy process, the continuous time analog of random walk and the foundation of many other widely used continuous time models, assumes exchangeability over increments. Another popular example is dependent Dirichlet process (DDP) (Mac Eachern 1999) which is marginally exchangeable (Orbanz and Roy 2013). It means that DDP is a random measurable mapping whose output is an exchange- 1The formal definition of branching structure will be given later. It can be understood as relationship between events for now. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) able random structure. However, in general, the observed events in spatial and temporal space are seldom exchangeable, especially for their dependencies. Distance dependent Chinese Restaurant process (dd CRP) (Blei and Frazier 2011) is a simple yet flexible class of distributions over partitions allowing nonexchangeability. It can be used for directly modeling dependencies between data points in infinite clustering models and the dependencies can be across space and time. In this paper, we adapt the dd CRP as a prior over the branching structure of spatial and temporal events. With its support, a Bayesian statistical model is proposed treating IPP as a mixture of basis point processes (b PPs). It allows discovery of a potentially unbounded number of mixing b PPs, while simultaneously estimating branching structure. We therefore call our approach the infinite branching model (IBM). The IBM is also related to the infinite relational model (IRM). The IRM aims at inferring meaningful latent structure within observed graph or network. An unbounded number of blocks of nodes with similar behavior can be automatically revealed with the support of the CRP prior on node partitions. But the IBM is interested in discovering the implicit branching structure of a collection of spatial and temporal points based on their positions and distances in space. In terms of the adopted prior for partition, the IBM can be regarded as a distance dependent version of IRM, but for discovering latent branching structure in spatial and temporal space. Interaction Point Processes Interaction point process2 (Diggle et al. 1994; Ripley 1977) is a broad range of stochastic point processes that can model various of interaction mechanisms, e.g., Hawkes process (Hawkes 1971; Hawkes and Oakes 1974), cascades of Poisson processes (Simma and Jordan 2010) and Neyman-Scott process (Neyman and Scott 1958). The work in (Adams 2009) provides a brief summary of some popular IPPs. In this section, we use Hawkes process as an illustration to introduce IPP and show that many popular IPPs can be defined via a Poisson cluster process (Hawkes and Oakes 1974) inspired by which we can later define a mixture model with Poisson processes as b PPs for generalizing and extending these IPPs. Hawkes Processes Hawkes process is one of the most general and flexible IPPs. Its formal definition can be given via conditional intensity function. Let X = {ti}N i=1 be a stochastic point process on temporal space, where ti R indicates the time of point. Hawkes process is a family of point processes having the following form of conditional intensity function: λ(t) = μ(t) + ti 1 : (a) Sample tn > tn 1 from PP(μ + n 1 i=1 αiβi(t ti)) (b) Sample point assignment cn dd CRP(η, f, D). It indirectly determines cluster assignment and point types: R(cn), R (cn) and R (cn). (c) If tn is an offspring, then set αn = αR(cn) and λβ = λβ R(cn). Otherwise, for a new cluster, sample its total offspring intensity αR(cn) Exponential(λα) and sample inverse scale parameter for its normalized offspring intensity λβ R(cn) Gamma(α , β ). In the above, λμ, λα, α and β are hyper-parameters. PP( ) indicates a Poisson process. Samples can be drawn from an inhomogeneous Poisson process by utilizing a thinning process, a point process variant of rejection sampling. Specifically, the Ogata s modified thinning (Ogata 1981) can be used, as summarized by the Algorithm 7.5.IV in (Vere-Jones 1988). The model can be readily simplified for mimicking the traditional Hawkes process. Although we adopt exponential distribution form for normalized offspring intensity, other distribution forms or combinations can be used for modeling different interaction mechanisms, e.g., spatiotemporal interaction. It has been noted that the CRP can be regarded as a special case of the dd CRP. As a result, the branching structure prior in the IBM can become exchangeable in terms of clustering when the dd CRP prior degrades to the CRP. It means that the probability of a point belonging to a cluster only depends on the number of points that are already in the cluster. Hierarchical Model It is always desirable to discover latent hierarchical structure from data. For IPPs, it is beneficial to reveal the relationship between point clusters. For instance, finding similar clusters of buy and sell trades in financial market can be insightful for making trading strategy. Hence, we extend the IBM to a hierarchical model in which similar point clusters can form a hyper-cluster sharing the same offspring intensity. For defining a concrete extension, we again assume μ and α are constant variables drawn from exponential distributions. But, for this time, we let normalized offspring intensity be in Weibull distribution form for showing its capability of capturing different interaction mecha- nism: β(t) = kβ/λβ t/λβ kβ 1 e (t/λβ)kβ . The hierarchical model is described as follow: 1. Sample immigrant intensity μ Exponential(λμ). 2. Sample t1 from PP(μ), sample its total offspring intensity α1 Exponential(λα) and sample inverse scale parameter for its normalized offspring intensity λβ 1 Gamma(α , β ). 3. For n > 1 : (a) Sample tn > tn 1 from PP(μ + n 1 i=1 αiβi(t ti)) (b) Sample point assignment cn dd CRP(η, f, D). It indirectly determines cluster assignment and point types: R(cn), R (cn) and R (cn). (c) Sample hyper-cluster assignment h R(cn) CRP(γ), γ is the concentration parameter for CRP. (d) If R(cn) belongs to an existing hyper-cluster, then set αn = αh R(cn) and βn = βh R(ci). Otherwise, for a new hyper-cluster, sample its total offspring intensity αh Exponential(λα), and sample scale parameter for normalized offspring intensity λβ h Inverse Gamma(α , β ). In the above, λμ, λα, α , β and kβ are hyper-parameters. It is worth noting that this hierarchical model extends the IBM in a similar way that the Chinese restaurant franchise (CRF) process (Teh et al. 2004) extends the CRP. The main difference is that the point clustering in our model is achieved via a dd CRP instead of a CRP with the consideration of branching structure. This hierarchial model can automatically discover the point clusters that share the same triggering scheme even when they are disjoint in spatiotemporal space. Inference with Generic Metropolis-with-Gibbs Sampling The purpose of the inference is to estimate the posteriors of branching structure, immigrant intensity and offspring in- tensity given observed points. Since it is not tractable analytically, we adopt the Markov chain Monte Carlo (MCMC) algorithm. Assume we have observed a set of points, X = {ti}N i=1, for a time period [0, T]. We do not consider edge effect in this work, hence no point exists before time 0. As described in (Vere-Jones 1988), with the support of local Janossy density, the likelihood function for a realization X of a regular point process can be represented as: 0 λ(t)dt , (3) where λ(t) denotes conditional intensity function. Unlike the traditional Hawkes process, the conditional intensity function in the IBM can be written separately for immigrants and offspring. Furthermore, directly modeling the branching structure grants us the computational simplicity to decompose the likelihood function into independent parts: p(X|μ, α, β, C) = p(I|μ, C) i=1 p(Oi|αR(ci), βR(ci), C), (4) where I represents immigrants and Oi denotes the offspring whose parent is point i. The likelihood functions for I and Oi can be written as: exp T 0 μ(t)dt , (5) tj Oi αR(ci)βR(ci)(tj ti) exp αR(ci) T ti βR(ci)(t ti)dt . (6) In some cases in which the conditional distributions of parameters are tractable, Gibbs sampling method can be used for inference. However, here we present a generic Metropolis-within-Gibbs algorithm (Neal 2000) despite the specific form of intensities. For Metropolis-within-Gibbs approach, each inference iteration updates parameters alternatively as Gibbs sampling does, while Metropolis-Hasting method is used for each parameter s update. In order to update a parameter w, a proposal distribution q( ) is used to generate a candidate value w . Its acceptance probability is defined as: min 1, q(w|w )τ(w ) q(w |w)τ(w) , where τ( ) can be any un-normalized measure for parameter w. The second input of the min function is called Hastings ratio. In the following, we give the Hastings ratio for each parameter s update in the IBM: 0 ˆμ(t)dt , AαR(ci) =p(ˆαR(ci)) αR(ci) ˆαR(ci) BR(ci) AβR(ci) =p(ˆβR(ci)) ˆβR(ci)(tj tcj) βR(ci)(tj tcj) tj R (ci) αR(ci)U For the above Hastings ratios, we assume the prior distributions of parameters are independent. In Eq. 8 and Eq. 9, intermediate variables are defined for notation simplicity: BR(ci) = T tj βR(ci)(t tj)dt, ˆBR(ci) = T tj ˆβR(ci)(t tj)dt, and U = BR(ci) ˆBR(ci). Variables ˆμ, ˆαR(ci) and ˆβR(ci) indicate the candidate values drawn from proposal distributions, e.g., Gaussian distribution. For updating the branching structure variables, the branching structure prior defined by Eq. 2 is used as the proposal distribution. The conditional prior and the proposal distribution cancel when calculating Hastings ratios, and only the likelihood ratio is left. There are three different cases for branching structure variable update: (1) update from immigrant to offspring. (2) update from offspring to immigrant, and (3) change parent. For the first case, the Hastings ratio can be represented as: AI O ci =αR(ˆci)βR(ˆci)(ti tˆci) tj R (ci) V exp tj R (ci) W where V = αR(ˆci)βR(ˆci)(tj tcj ) αR(ci)βR(ci)(tj tcj ) and W = αR(ci)BR(ci) αR(ˆci)BR(ˆci) are intermediate variables for notation simplicity. The first part of Eq. 10 represents the likelihood ratio for point i, and the second part represents the likelihood ratio for all of its offspring indicated by tj R (ci). Similarly, we can have the Hastings ratio for the second case: AO I ci = μ(ti) αR(ci)βR(ci)(ti tci) tj R (ˆci) V exp tj R (ˆci) W where V and W are as defined before. The second part of Eq. 11 also represents the likelihood ratio for all the offspring of point i, which are indicated by tj R (ˆci). For the third case, we have the Hastings ratio: AO ˆ O ci =αR(ˆci)βR(ˆci)(ti tˆci) αR(ci)βR(ci)(ti tci) tj R (ci) tj R (ˆci) V tj R (ci) tj R (ˆci) W where V and W are as defined before. For the second part of Eq. 12, we use the notation tj R (ci) tj R (ˆci) to represent point i s offspring that change clusters when ci is changed. Each of these Metropolis-Hasting updates can be performed several times before combining via Gibbs sampling. The Metropolis-within-Gibbs inference algorithm for the extended hierarchical IBM can be derived based on the above derivations with the consideration of hyper-cluster assignment. Experiments We conduct experiments on both synthetic and real-world data to evaluate the proposed IBM. The state-of-the-art approaches are compared to demonstrate its superiority. Method EMLL MISD BHawk IBM(CRP) IBM(Wind) Diff 0.46 0.41 0.45 0.40 0.36 Log Lik 1063 917 1121 862 736 Table 1: Results of Diff and Log Lik. Figure 1: Estimated branching structure matrices. Synthetic Data In this section, we use the synthetic data generated from traditional Hawkes process to evaluate the IBM. Two triggering kernels, exponential and Weibull kernels, are used to generate the data. Immigrant intensities are set to 0.8 for both kernels. For each kernel, 130 synthetic temporal samples are generated on time interval [0, 20]. The simplified IBM with all points sharing the same offspring intensity is applied to the first 100 samples. Both the traditional CRP and the dd CRP with window decay function are adopted as branching structure prior for the IBM. Bayesian model averaging is applied to the estimated models obtained from the first 100 samples. The final model is used to (1) measure the difference between the true and estimated triggering kernels, and (2) measure the loglikelihood on the rest 30 samples. A relative distance called Diff defined by (Zhou, Zha, and Song 2013) is used to measure the difference between kernels. Three state-of-theart approaches are compared with the proposed method: Hawkes process with expectation maximization on a lower bound of log-likelihood function (EMLL) (Yan et al. 2013), model independent stochastic declustering (MISD) (Marsan and Lenglin e 2008) and Bayesian inference approach for Hawkes process (BHawk) (Rasmussen 2013). The comparison results for Diff and log-likelihood are given in Table 1. As we can see that the IBM can achieve the best performance for both Diff and log-likelihood. The dd CRP prior with window decay function outperforms the traditional CRP prior. Besides, we select a synthetic sample from exponential kernel to visualize and demonstrate the IBM s performance on branching structure estimation. A matrix called branching structure matrix is used to demonstrate the estimation of branching structure. Fig. 1 (1) and Fig. 1 (2) show the branching structure matrices for the dd CRP prior and the CRP prior respectively. For these matrices, column indices represent child points and row indices represent parent points. The element in row i and column j represents the estimated probability of the parent-child relationship cj = i. Bright color in the matrices indicates higher probability. As Figure 2: Estimated point types. Method HPP SGCP EMLL MISD BHawk CPP IBM MSE 69.3 64.5 60.3 57.5 60.8 57.0 52.8 F1 - - 0.70 0.75 0.71 0.76 0.79 SC - - - - - - 0.76 Table 2: Results of MSE and F1 we can see that both matrices show strong clustering behaviors. The dd CRP prior gives more clusters with fewer points in each cluster, while the CRP prior gives fewer clusters with more points in each cluster. Correspondingly, Fig. 2 (1) and Fig. 2 (2) show the results of point type estimation. In these figures, the vertical bars on time line denote the simulated points and the lines show the overall intensity. The circles on bars indicate that the points are estimated as offspring and the circles at higher positions indicate that the points are estimated as immigrants. As we can see in Fig. 2 (2), the CRP prior tends to underestimate the number of immigrants and exhibits strong rich gets richer behavior. The dd CRP prior, shown in Fig. 2 (1), gives a more accurate estimation for the number of immigrants. Both priors can detect temporal clustering behaviors. The CRP prior tends to find coarser clusters, while the dd CRP prior tends to find finer clusters. Real-world Application For the real-world application, we apply our method to the water pipe failure prediction problem. Domain experts have observed that water pipe failures exhibit strong spatiotemporal clustering behaviors (Kleiner and Rajani 2001; Yan et al. 2013; Li et al. 2014; Lin et al. 2015; Li et al. 2015). One failure can cause other failures in adjacent spatiotemporal space. As a result, pipe failures can be categorized into two types: background failure that occurs due to material fatigue or corrosion, and triggered failure that is caused by another failure. It is desired for water utilities to accurately estimate both the type and amount of pipe failures. In this experiment, we collected 922 failures from a metropolitan water supply network. The failures occurred in a district during 8 years. We treat each pipe failure as a point in spatiotemporal space. Hence, we can use our model for both failure type estimation and failure amount prediction. For the branching structure prior, we adopt a decay function considering both spatial and temporal distances: f(d S, d T ) = I(d S a S) I(d T a T ) 2πσ2 exp d2 S 2σ2 where d S and d T represent spatial and temporal distances respectively. σ and ρ are pre-determined parameters. I( ) denotes a indicator function that returns 1 if the input condition satisfies and 0 otherwise. a S and a T are constants that determine the window sizes for spatial and temporal spaces. They can be set by domain experts as hard constraints to quickly filter out unrealistic branching structures. The hierarchical IBM is used for modeling the failures. It can automatically discover the failure clusters that share the similar failure triggering pattern. In additional to EMLL, MISD and BHawk, homogeneous Poisson process (HPP), sigmoidal Gaussian Cox process (SGCP) (Adams, Murray, and Mac Kay 2009) and cascades of Poisson process (CPP) (Simma and Jordan 2010) are also compared with the IBM for failure amount prediction. We use 4, 5, 6 and 7 years data for training and the obtained models are used to predict the amount of the failures in the coming year. The mean square error (MSE) is used to measure the difference between the true and predicted failure amounts. For failure type categorization, the IBM, EMLL, MISD, BHawk and CPP are applied to all the failures to estimate their types. F1 score is used to measure the performances. Additionally, we use silhouette coefficient (SC) (Rousseeuw 1987), to measure the clustering performance for the hierarchical IBM s hyper-clusters. The spatial and temporal distances between the triggering and triggered failures are used to calculate silhouette coefficient. The other approaches do not have the ability to discover the hidden hierarchical structure. The results of MSE, F1 and SC are shown in Table 2. As we can see, the proposed method outperforms others for both MSE and F1 score and it can also achieve an accurate clustering on top of the failure clusters. Conclusion and Future Directions In this paper, we proposed the IBM, a Bayesian statistical model that generalizes and extends popular IPPs. It considers point intensity, interaction mechanism and branching structure simultaneously. The experimental results on both synthetic and real-world data demonstrate its superiority. There are also many potential venues for future work. It will be interesting to consider high order point interaction (Baddeley and Van Lieshout 1995), the connection between branching structure and causality measure of point processes (Kim et al. 2011) and the extension for multivariate IPPs. Acknowledgement National ICT Australia (NICTA) is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. References Adams, R. P.; Murray, I.; and Mac Kay, D. J. 2009. Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities. In ICML, 9 16. ACM. Adams, R. P. 2009. Kernel methods for nonparametric Bayesian inference of probability densities and point processes. Ph.D. Dissertation. Bacallado, S.; Favaro, S.; Trippa, L.; et al. 2013. Bayesian nonparametric analysis of reversible markov chains. The Annals of Statistics 41(2):870 896. Baddeley, A. J., and Van Lieshout, M. 1995. Areainteraction point processes. Annals of the Institute of Statistical Mathematics 47(4):601 619. Beal, M. J.; Ghahramani, Z.; and Rasmussen, C. E. 2001. The infinite hidden markov model. In NIPS, 577 584. Blei, D. M., and Frazier, P. I. 2011. Distance dependent chinese restaurant processes. The Journal of Machine Learning Research 12:2461 2488. Diaconis, P., and Freedman, D. 1980. de finetti s theorem for markov chains. The Annals of Probability 115 130. Diggle, P. J.; Fiksel, T.; Grabarnik, P.; Ogata, Y.; Stoyan, D.; and Tanemura, M. 1994. On parameter estimation for pairwise interaction point processes. International Statistical Review 99 117. Hawkes, A. G., and Oakes, D. 1974. A cluster process representation of a self-exciting process. Journal of Applied Probability 493 503. Hawkes, A. G. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1):83 90. Hewlett, P. 2006. Clustering of order arrivals, price impact and trade path optimisation. In Workshop on Financial Modeling with Jump processes, Ecole Polytechnique, 6 8. Kim, S.; Putrino, D.; Ghosh, S.; and Brown, E. N. 2011. A granger causality measure for point process models of ensemble neural spiking activity. PLo S computational biology 7(3):e1001110. Kleiner, Y., and Rajani, B. 2001. Comprehensive review of structural deterioration of water mains: statistical models. Urban water 3(3):131 150. Li, Z.; Zhang, B.; Wang, Y.; Chen, F.; Taib, R.; Whiffin, V.; and Wang, Y. 2014. Water pipe condition assessment: a hierarchical beta process approach for sparse incident data. Machine learning 95(1):11 26. Li, B.; Zhang, B.; Li, Z.; Wang, Y.; Chen, F.; and Vitanage, D. 2015. Prioritising water pipes for condition assessment with data analytics. Oz Water. Lin, P.; Zhang, B.; Wang, Y.; Li, Z.; Li, B.; Wang, Y.; and Chen, F. 2015. Data driven water pipe failure prediction: A bayesian nonparametric approach. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 15, 193 202. New York, NY, USA: ACM. Mac Eachern, S. N. 1999. Dependent nonparametric processes. In ASA proceedings of the section on Bayesian statistical science, 50 55. Marsan, D., and Lenglin e, O. 2008. Extending earthquakes reach through cascading. Science 319(5866):1076 1079. Neal, R. M. 2000. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics 9(2):249 265. Neyman, J., and Scott, E. L. 1958. Statistical approach to problems of cosmology. Journal of the Royal Statistical Society. Series B (Methodological) 1 43. Ogata, Y. 1981. On lewis simulation method for point processes. Information Theory, IEEE Transactions on 27:23 31. Orbanz, P., and Roy, D. M. 2013. Bayesian models of graphs, arrays and other exchangeable random structures. ar Xiv preprint ar Xiv:1312.7857. Rasmussen, J. G. 2013. Bayesian inference for hawkes processes. Methodology and Computing in Applied Probability 15(3):623 642. Ripley, B. D. 1977. Modelling spatial patterns. Journal of the Royal Statistical Society. Series B (Methodological) 172 212. Rousseeuw, P. J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20:53 65. Simma, A., and Jordan, M. I. 2010. Modeling events with cascades of poisson processes. In UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, July 8-11, 2010, 546 555. Snoek, J.; Zemel, R.; and Adams, R. P. 2013. A determinantal point process latent variable model for inhibition in neural spiking data. In NIPS, 1932 1940. Teh, Y. W.; Jordan, M. I.; Beal, M. J.; and Blei, D. M. 2004. Sharing clusters among related groups: Hierarchical dirichlet processes. In NIPS. Vere-Jones, D. 1988. An introduction to the theory of point processes. Springer Ser. Statist., Springer, New York. Yan, J.; Wang, Y.; Zhou, K.; Huang, J.; Tian, C.; Zha, H.; and Dong, W. 2013. Towards effective prioritizing water pipe replacement and rehabilitation. 2931 2937. Yang, S.-H., and Zha, H. 2013. Mixture of mutually exciting processes for viral diffusion. In ICML, 1 9. Yu, H.; Xie, L.; and Sanner, S. 2015. The lifecyle of a youtube video: Phases, content and popularity. In AAAI Conference on Web and Social Media. Zabell, S. L. 1995. Characterizing markov exchangeable sequences. Journal of Theoretical Probability 8(1):175 178. Zhou, K.; Zha, H.; and Song, L. 2013. Learning triggering kernels for multi-dimensional hawkes processes. In ICML, 1301 1309.