# rowclustering_of_a_point_processvalued_matrix__d052f903.pdf Row-clustering of a Point Process-valued Matrix Lihao Yin1,2 Ganggang Xu3 Huiyan Sang2 Yongtao Guan3 1Institute of Statistics and Big Data, Renmin University, Beijing, China, 2Department of Statistics, Texas A&M University, College Station, Texas, 3University of Miami, Coral Gables, Florida, {lihao, huiyan}@stat.tamu.edu, {gangxu, yguan}@bus.miami.edu Structured point process data harvested from various platforms poses new challenges to the machine learning community. To cluster repeatedly observed marked point processes, we propose a novel mixture model of multi-level marked point processes for identifying potential heterogeneity in the observed data. Specifically, we study a matrix whose entries are marked log-Gaussian Cox processes and cluster rows of such a matrix. An efficient semi-parametric Expectation-Solution (ES) algorithm combined with functional principal component analysis (FPCA) of point processes is proposed for model estimation. The effectiveness of the proposed framework is demonstrated through simulation studies and real data analyses. 1 Introduction Large-scale, high-resolution, and irregularly scattered event time data has attracted enormous research interest recently in many applications, including medical visiting records [13], financial transaction ledgers [28] and server logs [11]. Given a collection of event time sequences, one research thread is to identify groups displaying similar patterns. In practice, the significance of this task emerges in multifarious scenarios. For example, matching users with similar activity patterns on social media platforms is beneficial to ads recommendations; clustering patients by their visiting records may help predict the course of the disease progression. Our study is motivated by a dataset we collected from Twitter, which consists of posting times of 500 university official accounts from April 15, to May 14th, 2021. Figure 1 displays posting time stamps of seven selected accounts in five consecutive days. While the daily posting patterns vary across different accounts, the posting date seems to also play an important role. Specifically, all accounts cascade a barrage of postings on April 16th while few postings appear on April 18th. Lastly, each posting is associated with a specific type of activity, namely, tweet, retweet, or reply. Our main interest is to cluster these multi-category, dynamic posting patterns into subgroups. To characterize the highly complex posting patterns, we propose a mixture model of Multi-level Marked Point Processes (MM-MPP). We assume that the event sequences from each cluster are realizations of a multi-level log-Gaussian Cox process (LGCP) [16], which has been demonstrated useful for modeling repeatedly observed event sequences [28]. We here extend their work to the case of mixture models and propose a semiparametric Expectation-Solution algorithm to learn the underlying cluster structure. The proposed learning algorithm avoids iterative numerical optimizations within each ES step and hence is computationally efficient. In particular, the expectation step is carried out using MCMC samples based on the FPCA of the latent Gaussian processes, which imposes minimal parametric assumptions on the proposed model. Finally, we design an algorithm that can take advantage of array programming and GPU acceleration to further speed up computation. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Figure 1: The activities of selected accounts on Twitter. 2 Related Work Modelling of Event Sequences Point processes have been widely used to model temporal events [5], although rarely does existing work focus on repeatedly observed event sequences. One prominent example is the Hawkes process [8, 32, 33], which accounts for temporal dependence among events by a self-triggering mechanism. However, the existing Hawkes process may fail to describe our cases for two reasons. First, many human activities naturally have discontinuity by day. So it is unclear how to define the triggering mechanism across days with Hawkes processes. Second, the multivariate Hawkes process characters an overall rate of events for different days. The clustering methods based on the Hawkes process [15, 14, 30] are more likely to distinguish individuals by overall event frequency, other than their intra-day behavior patterns. In our motivating example, there exist multiple variations for the event sequences, both from individual and day levels. One way to account for variations from multiple sources is to exploit Cox process models, whose intensities are modeled by latent random functions. One popular class of Cox processes is the log-Gaussian Cox process (LGCP) [16], whose latent intensity functions are assumed to be transformed Gaussian processes. Recently, [28] proposed a multi-level LGCP model to account for different sources of variations for repeatedly observed event data. However, clustering of repeatedly observed marked event time data was not considered in their work. Clustering of Event Sequences. Extensive research has been done on this topic. To our knowledge, clustering models for point processes can be summarized into two major categories: distance-based clustering [3, 4, 18] and distribution-based clustering [30, 15]. The former measures the closeness between event sequences based on some extracted features and then uses classical distanced-based clustering algorithms such as k-means [4, 19] or EM algorithms [27]. The second approach, also referred to as model-based clustering, assumes that event sequences are derived from a parametric mixture model of point processes. One notable thread is the mixture model of the Hawkes point processes. For example, [30] proposed a Dirichlet mixture of Hawkes processes (DMHP) under the Expectation-Maximization (EM) framework to identify clusters. However, existing EM algorithms for event sequence clustering have a common issue that they typically require iterative numerical optimizations within each M-step, which would drastically overburden the computation. This computational issue will be accentuated when event data are repeatedly observed and have marks. 3 Model-based Row-clustering for a Matrix of Marked Point Processes Notation. Suppose that we observe daily event sequences from n accounts during m days. For account i on day j, let Ni,j denote the total number of events, ti,j,l (0, T] denote the l-th event time stamp, and ri,j,l {1, , R} denote the corresponding event types (marks). The activities of account i on day j can be summarized by a set Si,j = {(ti,j,l, ri,j,l)}Ni,j l=1 , recording the time stamps and types for all Ni,j events. This general notation can also describe other marked event sequences which are repeatedly observed on m non-overlapping time slots. We represent the collection of all marked daily event sequences as an n m matrix S, whose (i, j)th entry is a marked event sequence Si,j. We aim to cluster the rows of S to identify potential heterogeneity in account activity patterns, while taking into account the dependence across rows and columns to characterize the complex event patterns and interactions among accounts, days, and event types. 3.1 A Mixture of Multi-level Marked LGCP Model Given a matrix of daily event sequences S, we can separate each matrix entry Si,j according to their marks (event types). Let Sr i,j = {ti,j,l|ri,j,l = r} be the collection of time stamps of event type r {1, , R}. We model each Sr i,j by an inhomogeneous Poisson point process conditional on a latent intensity function λr i,j(t|Λr i,j) = exp{Λr i,j(t)}, where Λr i,j(t) : [0, T] 7 R is the random log intensity function on [0, T]. Following [28], we assume a multi-level model for Λr i,j(t): Λr i,j(t) = Xr i (t) + Y r j (t) + Zr i,j(t), t [0, T], (1) for i = 1, , n, j = 1, , m and r = 1, , R. In model (1), Xr i (t), Y r j (t) and Zr i,j(t) are random functions on [0, T], characterizing the variations of account-level, day-level and the residual deviations, respectively. In addition, we also take into account the dependence across event types when modelling Xr i (t), Y r j (t) and Zr i,j(t), while assuming independence across accounts, that is, for any (r, r ), Xr i (t) and Xr i (t) are independent when i = i , Y r j (t) and Y r j (t) are independent when j = j , and Zr i,j(t) and Zr i ,j (t) are independent if (i, j) = (i , j ). We assume that Xi(t) = {Xr i (t)}R r=1 is a mixture of multivariate Gaussian processes with C components in order to detect heterogeneous clusters. We introduce a binary vector ωi = {ω1,i, , ωC,i} to encode the cluster membership for account i, where ωc,i = 1 if account i belongs to the c-th cluster and 0 otherwise. In analogy to other model-based clustering approaches, the unobserved cluster membership ωi are treated as missing data and assumed to follow a categorical distribution with parameter π = {π1, , πC}, where πc indicates the probability that an account belongs to the c-th cluster. Conditional on π, we assume that Xr i (t) s in different clusters have heterogeneous behavioral patterns, characterized by their corresponding cluster-specific multivariate Gaussian processes with mean functions µr x,c(t) = E[Xr i (t)|ωc,i = 1] and cross covariance functions Γr,r x,c (s, t) = Cov[Xr i (s), Xr i (t)|ωc,i = 1], for s, t [0, T], and r, r = 1, , R. Here, µr x,c(t) characterizes the cluster-specific first-order intensity function, and Γr,r x,c (s, t) describes the temporal dependence patterns within and across event types in the same cluster c, c = 1, , C. Similarly, we assume that Yj(t) = {Y r j (t)}R r=1 and Zi,j(t) = {Zi,j(t)r}R r=1 are both mean-zero multivariate Gaussian processes to account for dependence of day-level and residual random effects within and across event types, respectively. The covariance functions take the forms: Γr,r y (t) = Cov[Y r j (t), Y r j (t)], and Γr,r z (t) = Cov[Zr i,j(t), Zr i,j(t)]. As the heterogeneity patterns are assumed to be mainly explained by the account-level effect X, both Γr,r y (t) and Γr,r z (t) are assumed to be homogeneous across all clusters. A Single-level Special Case. When m = 1, our data matrix S only has one column of event sequences. The multi-level model in (1) reduces to a single-level model: λr i,1(t|Λr i,1) = exp{Λr i,1(t)}, Λr i,1 = Xr i (t), t [0, T] (2) where Xi(t) = {Xr i (t)}R r=1 has the same model specification as in the multi-level case described earlier. We remark that it is still of importance to consider this special case that has also been studied in the literature [29], as even in this simpler case limited work has been done for the clustering of repeatedly observed marked point processes. 3.2 Likelihood Function We denote the parameters concerning Xi(t) in cluster c as Θx,c and the parameters concerning Yj(t) and Zi,j(t) as Θy and Θz, respectively. Therefore, the parameters in model (1) consist of Ω= {π, Θy, Θz, Θx,c, c = 1, , C}. When m = 1, Ω= {π, Θx,c, c = 1, , C} representing the parameters involved in model (2). The complete data D consists of the observed data S and the unobserved latent variables {ωi}n i=1, L , where L = {Xi(t)}, {Yi(t)}, {Zi,j(t)} for model (1) and L = {{Xi(t)}} for model (2). Let Si be the i-th row of S representing activities of the i-th account. In our mixture model, the probability of the observed data S can be written as p(S; Ω) = EωEL i=1 PP(Si|L) | {ωi}n i=1; Ω where the expectations are taken with respect to the conditional distribution of latent variables L and ωi s, and PP(Si|L) is the conditional probability of a Poisson point process, PP(Si | L) = t Sr i,j λr i,j(t | Λr i,j) exp 0 λr i,j(s | Λr i,j)ds) where, conditional on L, Λr i,j(t) has the form as (1) for m > 1 and as (2) for m = 1. 4 Row-clustering Algorithms Existing mixture model-based clustering methods typically rely on likelihood-based Expectation Maximization algorithms [1] by treating unobserved latent variables, {ωi}n i=1, L in our case, as missing data. However, standard EM algorithms are computationally intractable for the models we consider here. One computation bottleneck is the numerical optimizations involved in M-steps, which require many iterations due to the lack of closed-form solutions when updating parameters. Moreover, the computation burden is severely aggravated by the fact that the expectations in E-step (see (3) for an example) involve an intractable multivariate integration. In Section 4.1, we describe a novel efficient semi-parametric Expectation-Solution algorithm for the single-level model in (2) to bypass the computation challenges described above. We then show in Section 4.2 that the learning task of multi-level models in (1) can be transformed and solved by utilizing an algorithm similar to that of single-level models. 4.1 Learning of Single-level Models The ES algorithm [6] is a general iterative approach to solving estimating equations involving missing data or latent variables. The algorithm proceeds by first constructing estimating equations based on a complete-data summary statistic, which may arise from a likelihood, a quasi-likelihood, or other generalized estimating equations. Similar to the EM algorithm, the ES algorithm then iterates between an expectation (E)-step and a solution (S)-step until convergence to obtain parameter estimates. The detailed steps of a general ES algorithm framework are included in Supplementary S.2. The EM framework is a special case of ES when estimating equations are constructed from full likelihoods and using complete data as the summary statistic. Due to the lack of closed-form for the likelihood function (3), we opt to design our algorithm under the more flexible and general ES framework for parameter estimations of the single-level models in (2), i.e., m = 1. The algorithm is summarized in Algorithm 1 and detailed below. As a preliminary, we give the form of the expectation of the conditional intensity function given cluster memberships as follows: ρr c(t) = E[λr i,1(t) | ωc,i = 1] = exp[µr x,c(t) + Γr x,c(t, t)/2]. (5) The form of the second-order conditional intensity function is c,i = E[λr i (s)λr i (t) | ωc,i = 1] = E{exp[Xr i (s) + Xr i (t)|ωc,i = 1]} = ρr c(s)ρr c (t) exp[Γr,r x,c (s, t)], (6) for i = 1, , n, r, r = 1, , R, where the last equality is derived following the moment generating function of a Gaussian random variable. Estimating Equations. We carefully construct estimating equations of unknown parameters with three considerations in mind: (1) the expectation of the estimating equations over the complete data should be zero; (2) the conditional expectation of the estimating equation can be solved efficiently in the S-step; (3) the estimating equations should be fast to calculate. Let K( ) be a kernel function and Kh(t) = h 1K(t/h) with a bandwidth h > 0. We define Ar,r c (s, t; h) = i=1 ωc,iar,r i (s, t; h), where ar,r i (s, t; h) = u Sr i ,v Sr i Kh(s u)Kh(t v) ng(s; h)g(t; h) ; Br c(t; h) = i=1 ωc,ibr i (t; h), where br i (t; h) = X ng(t; h) , (7) for c = 1, ..., C, and r, r = 1, ..., R, where g(x; h) = R Kh(x t)dt. Using the Campbell s Theorem [17] and the moment generating function of the normal distribution, it is straightforward to show that E h Ar,r c (s, t; h)|ω i πcρr c(s)ρr c (t) exp[Γr,r x,c (s, t)] and that E [Br c(t; h)|ω] πcρr c(t), provided that h is sufficiently small. This motivates us to consider the following estimating equations: Ar,r c (s, t; h) πcρr c(s)ρr c (t) exp[Γr,r x,c (s, t)] = 0 Br c(t; h) πcρr c(t) = 0 i=1 ωc,i πc = 0. Expectation (E-step). Given an observed data S and a current parameter estimate Ω , we calculate the conditional expectation of the estimation equations in (8). Note that, conditioned on S, the three estimating equations are all linear with respect to {ωc,i, c = 1, , C, i = 1, , n}. Therefore, the conditional expectations of the estimating equations are obtained by replacing wc,i with its conditional expectation Eω[ωc,i|S; Ω ], which has the following form: Eω[ωc,i|S; Ω ] = π cf(Si|ωc,i = 1; Ω ) PC c=1 π cf(Si|ωc,i = 1; Ω ) , (9) where f(Si|ωc,i = 1; Ω ) = EL[PP(Si|L)|ωc,i = 1; Ω ]. Here PP( ) is the conditional distribution function of Si given ωc,i and Ω as defined in (4). We propose to approximate f(Si|ωc,i = 1; Ω ) by its Monte Carlo counterpart, ˆf(Si | ωc,i = 1; Ω ) Q 1 X ˆ PP(Si | X(q) c (t)), (10) where Q is the Monte Carlo sample size, X(q) c (t) s are independent samples from the multivariate Gaussian process with parameters Θ x,c (see details below), and ˆ PP( ) is a numerical quadrature approximation of PP( ) following [2]: ˆ PP(Si|X(t)) = exp n R X vu[yu Xr(u) exp Xr(u)] o , (11) In the above, Sr i,1 is the union of Sr i,1 and a set of regular grid points, vu is the quadrature weight corresponding to each u and yu = v 1 u u, where u is an indicator of whether u is an observation ( u = 1) or a grid point ( u = 0). Solution (S-step). In this step, we update the parameters by finding the solutions to the expected estimating equations from the E-step. For c = 1, , C, r, r = 1, , R and r = r , the solutions take the following closed forms: π c = n 1 n X i=1 E[ωc,i|S; Ω ], (12) Γr,r x,c (s, t) = log π c Eω[Ar,r c (s, t; h)|S; Ω ] Eω[Brc(s; h)|S; Ω ]Eω[Br c (t; h)|S; Ω ], (13) µr x,c(t) = log π 1 c Eω[Br c(t; h)|S; Ω ]/ exp[Γr x,c(t, t)/2] . (14) Sampling Strategies. The multi-dimensional functional form of X(g) c renders the sampling procedures in (10) intractable. Given Θx,c, one solution is to find a low-rank representation of Xi with the functional principal components analysis (FPCA) [20]. Specifically, we approximate the latent Gaussian process Xi in cluster c nonparametrically using the Karhunen-L oeve expansion [25] as: Xr i (t) = µc + ξT i φ(t), for r = 1, , R, where ξi is a vector of normal random variables, and φ(t) is a vector of orthogonal eigenfunctions. Using FPCA, we can obtain the samples of Xi by sampling ξi indirectly. More detailed sampling procedure via FPCA can be seen in Supplementary S.2. Remarks. The most significant advantage of our method is that it avoids expensive iterations inside each E-step and S-step, unlike other EM algorithms for mixture point process models [30]. The elements ar,r i (s, t; h) s and br i (t; h) s in (7) can be pre-calculated before E-S iterations to save computations. Moreover, the S-step is fast to execute thanks to the closed-form solutions. We will analyze the overall computation complexity of the learning algorithm in Section 4.3. Algorithm 1 Learning of the Single-level model in (2) Input: S = {Si}n i=1, the number of clusters C, the bandwidth h; Output: Estimates of model parameters, ˆπ, ˆµr x,c(t), ˆΓr,r x,c (s, t), for c = 1, , C, r, r = 1, , R; Calculate the components ar,r i (s, t; h) s and br i (t; h) s given in (7); Initialize Ω = {π , Θ x,c, c = 1, , C} randomly; Repeat: E-Step: Calculate Eω[ωc,i|S; Ω ] as (9); Calculate Eω[Ar,r c (s, t)|S; Ω ] and Eω[Br c(t)|S; Ω ] as linear combinations of Eω[ωc,i|S; Ω ] s. S-Step: Update π , µr x,c(t) and Γr,r x,c (s, t) according to (12), (13) and (14); End; Until: Reach the convergence criteria; ˆπ = π , ˆµr x,c = µr x,c(t) and ˆΓr,r x,c (s, t) = Γr,r x,c (s, t); Model Selection. Algorithm 1 requires choosing a proper number of clusters C and bandwidth h. In model-based clustering, one popular method for choosing the number of clusters is based on the Bayes information criterion (BIC) [22], which can be readily computed for our model since the probability f(S|ω) is already calculated in each iteration. The choice of bandwidth h also plays an important role in model estimation. A small h may produce unstable clustering results while a large h would dampen the characteristics of each cluster. With ar,r i (s, t; h) s and br i (s; h) s given in (7) pre-calculated for different candidates of h, we can adaptively choose the h that maximizes the likelihood in each iteration in a computationally efficient manner. 4.2 Learning of Multi-level Models We now consider developing the learning algorithm for the multi-level model (1), assuming we repeatedly observe R types of events from n accounts on m days with m > 1. Below, we propose a method to transform the learning task of a multi-level model into a problem that can be solved by a two-step procedure, where the second step is mathematically equivalent to a single-level model and hence can be conveniently solved by a similar algorithm as in Algorithm 1. For a given account i, consider the aggregated event sequence Sr i = m j=1Sr i,j for each row of S and event type r. If we assume a multi-level model for each Sr i,j as in (1), conditional on latent variables L, Sr i is a superposition of m independent Poisson processes and hence can be viewed as a new Poisson process with intensity functional λr i (t|L) = Pm j=1 exp Λr i,j(t). We approximate the distribution of Sr i by a Poisson process with a marginal intensity function, λr i (t) = EY Z{λr i (t|L)|Xi(t)} = m exp{ Xr i (t)}, (15) where Xi = { X1 i , , XR i } is a new multivariate mixture Gaussian process with mean function µr x,c(t) = µr x,c + Γr,r y (t, t)/2 + Γr,r z (t, t)/2 and covariance function Γr,r x,c (s, t) = Γr,r x,c (s, t), if account i belongs to cluster c. When m is large, we expect the above approximation to be accurate. Note that the model in (15) for the aggregated event sequence Sr i is inherently reduced to a singlelevel model. It allows us to separate the inference of the multi-level model in (1) into two steps: (Step I) learn the parameters in Θy and Θz and denote the estimated parameters as ˆΓr,r y (s, t) and ˆΓr,r z (s, t); (Step II) learn the clusters of the single-level model in (15) and estimate the parameters π, µr x,c and Γr,r x,c . Afterwards, the parameters involved in Θx,c can be obtained by ˆµr x,c(t) = µr x,c(t) ˆΓr,r y (t, t)/2 ˆΓr,r z (t, t)/2, ˆΓr,r x,c (s, t) = Γr,r x,c (s, t). For the learning task in Step I, [28] developed a semi-parametric algorithm to learn the repeatedly observed event sequences. In analogy to their work, we propose a similar inference framework to estimate Θy and Θx in our mixture multi-level model (1) and provide the details in Supplementary S.1. For step II, we resort to the single-level model algorithm described in Section 4.1. 4.3 Computational Complexity and Acceleration Assume that the training event sequences belong to n accounts and C clusters and are repeatedly observed on m time slots. We also assume that the data contains R types of events and each sequence consists of I time stamps on average. Let Q be the sampling size used in the Monte Carlo integration in (10). In numerical implementation, we divide the interval [0, T] into D equally spaced grid points D = {0 = u1 < < u D = T}. In Step I, it requires O(nm R2D2) computation complexity to estimate Θy and Θz, according to [28]. Computation complexity to pre-calculate ar,r i (s, t; h) s and br i (t; h) s in (7) for all s, t D is of the order O(nm R2D2) if we decomposition ar,r i (s, t; h) as: i (s, t; h) = u Sr i Sr i Kh(s u)Kh(t v) g(s; h)g(t; h) . In Step II, for each E-S iteration, we need O(CQR3) for sampling and O(n CIQR2) for other calculations. Therefore, the overall computational complexity is O(R2(nm D2 + CQR + n CIQ)). To further reduce computation, we use array programming and GPU acceleration to calculate the high-dimensional integration in the Monte Carlo EM framework [26] to reduce the runtime of (9). The details are included in Supplementary S.2, and a numerical demonstration is given in Section 5.1. 5 Numerical Examples We examine the performance of our MM-MPP framework for clustering event sequences via synthetic data examples and real-world applications and compare the performances between the proposed method and two other state-of-the-art methods. One competing method is a discrete Fréchet distancebased method (DF) by [18]. Unlike other distance-based clustering methods, the DF cluster can characterize interactions among events. Another is a model-based clustering method based on the Dirichlet mixture of Hawkes processes (DMHP) by [30]. DMHP is chosen as a competitor due to its capability of accounting for complex point patterns while performing clustering and making efficient variational Bayesian inference algorithms under a nested EM framework. 5.1 Synthetic Data Setting. We generate the synthetic data from the proposed mixture model of log-Gaussian Cox processes in (1) and (2), in which there are R = 5 event types and daily time stamps reside in [0, 2]. We set the number of clusters C from 2 to 5 and set the number of accounts in each cluster to 500. We experiment with an increasing number of replicates (m = 1, 20 or 100), to check the convergence of our method. When m = 1, we generate event sequences from the single-level model in (2) without day-level variations. In this case, we compare the clustering results of DF, DMHP with those of the single-level model (MS-MPP). When m = 20 or 100, we generate data from the multi-level model in (1) and use the MM-MPP method to model the scenario where event sequences are repeatedly observed. However, the two competing methods, DF and DMHP, are not directly applicable for repeated event sequences. Therefore, in this case, we concatenate {Sr i,j}m j=1 sequentially into a new event sequence Sr i on [0, m T] and then apply DF and DMGP to this new sequence. The detailed settings of Xr i (t) s, Y r j (t) s and Zr i,j(t) s and other details of synthetic data examples are elaborated in Supplementary S.3. Results. We evaluate the clustering performance of each method over 100 repeated experiments under each setting, using clustering purity [21] as a evaluation metric. Table 1 reports the averaged clustering purity of each method on the synthetic data. When m = 1, MS-MPP obtains the best clustering result in terms of purity consistently across different numbers of clusters. Especially when C increases, in which case there are more overlaps among clusters, the advantage of MS-MPP becomes more prominent. When m = 20, 100, MM-MPP still significantly outperforms the other two competitors. It is also noticeable that the performance of DF and DMPH, in general, deteriorates as m increases, although more repeated event sequences offer more information for clustering. One explanation is that both DF and DMHP may incur bias due to ignoring different sources of variations for repeatedly observed event times. Another reason may be that many existing Hawkes process models, such as DMHP, assume a constant triggering function over time, which may not be flexible enough to characterize the data generated from models (1) and (2). Table 1: Clustering Purity on Synthetic Data. m = 1 m = 20 m = 100 C DF DMPH MS-MPP DF DMPH MM-MPP DF DMPH MM-MPP 2 0.597 0.537 0.831 0.536 0.513 0.947 0.532 0.522 0.988 3 0.514 0.466 0.767 0.465 0.423 0.902 0.477 0.394 0.967 4 0.443 0.421 0.714 0.422 0.356 0.874 0.436 0.285 0.944 5 0.379 0.354 0.675 0.351 0.298 0.835 0.333 0.276 0.919 Our code can be accessed via https://github.com/Lihao Yin/MMMPP. To show the computational advantage of the proposed ES algorithm over the EM algorithm, Table 2 gives the computation times of CPU-based EM, CPU-based ES, and GPU-based ES algorithms for 20 iterations in the estimation of model (1) with n = 500, 100, m = 20, R = 5 and C = 3. For each iteration, 10, 000 MCMC samples are drawn to approximate (10). Table 2 demonstrates that with the GPU acceleration, the computation time of the proposed ES can be reduced by more than 20 folds in this case scenario compared to the EM algorithm, which is not suitable for array programming [7]. Table 2: Running Time (in seconds) on Synthetic Data Methods and devices n = 500 n = 1000 GPU-ES (RTX 8000 48G GPU) 30.09 51.42 CPU-ES (i7-7700HQ CPU) 275.87 505.07 CPU-EM (i7-7700HQ CPU) 568.36 1105.46 5.2 Real-world Data In this section, we apply our method to the following real-world datasets. Twitter Dataset. The Twitter dataset consists of the postings of the official accounts of America s top 500 universities from April 15, 2021, to May 14, 2021. The data set was scraped from Twitter with the API rtweet [12]. The dataset involves three categories of postings (tweet, retweet, and reply), indicating R = 3 in this study. As a result, the dataset contains n = 500 Twitter accounts for m = 30 consecutive days with a total of 233, 465 time stamps. Chicago City Taxi Dataset The City of Chicago collected the information of all taxi rides in Chicago since 2013 1. Each trip record in the dataset consists of drivers encrypted IDs, pick-up/drop-off time stamps, and locations (in the form of latitude/longitude coordinates). We gathered the trips of 9,000 randomly selected taxi drivers from Jan 1 to Dec 31, 2016, and more than 19 million trip records were picked. We mapped the pick-up coordinates to their corresponding zoning types according to Chicago Zoning Map Dataset2, which divides the city into nine basic zoning districts3, including Residence (R), Business (B), Commercial (C), Manufacturing (M), etc. For this data set, we have n = 9000, m = 366, and R = 9. 1https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew 2https://data.cityofchicago.org/ 3https://secondcityzoning.org/zones/ Credit Card Transaction Dataset. The dataset contains 641, 914 transaction records of 5, 000 European credit card customers (n = 5000) during the period covering January 1 to December 31, 2016 (m = 366). We applied the univariate model (R = 1) without event marks to the dataset. We evaluate and compare clustering stability based on a measure called clustering consistency via Ktrial cross-validations [23, 24], as there are no ground truth clustering labels. The detailed definition of clustering consistency and other real data example details are included in Supplementary S.4. Results. We compare the performance of DF, DMHP, and MM-MPP in terms of clustering consistency for three data sets with K = 100 trials. The results in Table 3 suggest that MM-MPP outperforms its competitors notably, demonstrating that our model can better characterize the postings patterns and offer a more stable and consistent clustering than other methods. Figure 2 shows the histograms of the number of learned clusters for each method. For the Twitter dataset, the median numbers of learned clusters are 3, 5, and 8 for MM-MPP, DMHP, and DF respectively. Besides, the distribution of the number of clusters from our method seems to be the least variable, indicating robustness in clustering. The robustness of our method may be partly attributed to the flexibility of the latent conditional intensity functions that account for multi-level deviations within each account. In contrast, other methods that fail to account for different sources of deviations may treat them as sources of heterogeneity and consequently result in more clusters. Table 3: Clustering Consistency on Real-World Datasets. Method DF DMHP MM-MPP Twitter 0.096 0.275 0.394 Credit Card 0.102 0.331 0.378 Chicago Taxi 0.045 0.142 0.153 Figure 2: Histogram of the number of clusters. Left: Twitter dataset; Right: Credit Card dataset; More stories can be told by the estimated posting patterns. Given a predicted membership of account i by ci = arg maxc Eω[ωc,i|S; ˆΩ], Figure 3 displays the estimated curves of ˆµr x,c for tweet events (r = 1), retweet events (r = 2) and reply events (r = 3) respectively for C = 3. Recall ˆµr x,c is interpreted as the baseline of intensity functions. This figure shows three different activity modes for the selected Twitter accounts. The universities in cluster 1 marked by red curves in Figure 3 in general have a lower frequency of posting retweets and replies, especially during the daytime. This group includes the most top university in America, such as MIT, Harvard, and Stanford. In contrast, the accounts in cluster 2 are relatively more active in all three types of postings. We further find that many accounts in this cluster belong to the universities with middle ranks. We further applied the proposed MM-MPP to the Chicago Taxi dataset. As suggested by BIC, the 9000 taxi drivers are clustered into 9 groups, whose averaged daily pick-up log intensity functions are illustrated in Figure 4(a). We can see that the taxi drivers are clustered not only according to their pick-up frequency but also by their working schedules. For example, the black curves on Figure 4(a) corresponds to the most dominating group, which occupies 23.2% of the sample. Figure 3(b) displays the curves of average log intensity (the black line) and log intensity for each driver (gray lines) in the selected cluster. Figure 4(c-e) show the estimated ˆµx(t) for pick-up in commercial, residence and Figure 3: Curves of ˆµr x,c(t). Left: tweet events; Mid: retweet events; Right: reply events; manufacturing districts, respectively. While the pick-up events are more likely to occur in commercial districts for this group during the daytime, they also tend to pick up passengers at the residential district in the morning and to appear at the manufacturing district in the afternoon. These patterns are consistent with the schedules of passengers who commute between homes and workplaces. More results and discussions on chase credit dataset are included in our Supplementary file. Figure 4: Left: Overall log-intensities for all clusters; Right: Log-intensity for one selected cluster; 6 Concluding Remarks We propose a mixture of multi-level marked point processes to cluster repeatedly observed marked event sequences. A novel and efficient learning algorithm is developed based on a semi-parametric ES algorithm. The proposed method is demonstrated to significantly outperform other competing methods in simulation experiments and real data analyses. The current model only focuses on events over temporal domains. However, clustering of spatial patterns on 2or 3-dimensional domains has also attracted much research interest [10, 31, 9]. It will be an interesting research topic to extend the current model to such settings. This work has no foreseeable negative societal impacts, but users should be cautious when giving interpretation on clustering results to avoid any misleading conclusions. Acknowledgement We thank the anonymous reviewers for their constructive comments that help improve the manuscript significantly. Xu s research is supported by NSF grant SES-1902195, Guan s research is supported by NSF grant SES-1758575, and Sang s research is supported by NSF grant DMS-1854155. [1] Murray Aitkin and Donald B Rubin. Estimation and hypothesis testing in finite mixture models. Journal of the Royal Statistical Society: Series B (Methodological), 47(1):67 75, 1985. [2] Mark Berman and T Rolf Turner. Approximating point process likelihoods with glim. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(1):31 38, 1992. [3] Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages 359 370, 1994. [4] Paul S Bradley and Usama M Fayyad. Refining initial points for k-means clustering. In ICML, volume 98, pages 91 99. Citeseer, 1998. [5] Daryl J Daley and David Vere-Jones. An introduction to the theory of point processes: volume I: elementary theory and methods. Springer, 2003. [6] Michael Elashoff and Louise Ryan. An EM algorithm for estimating equations. Journal of Computational and Graphical Statistics, 13(1):48 65, 2004. [7] Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy. Nature, 585(7825):357 362, 2020. [8] Alan G Hawkes and David Oakes. A cluster process representation of a self-exciting process. Journal of Applied Probability, pages 493 503, 1974. [9] Kristian Bjørn Hessellund, Ganggang Xu, Yongtao Guan, and Rasmus Waagepetersen. Semiparametric multinomial logistic regression for multivariate point pattern data. Journal of the American Statistical Association, pages 1 16, 2021. [10] Anders Hildeman, David Bolin, Jonas Wallin, and Janine B Illian. Level set Cox processes. Spatial statistics, 28:169 193, 2018. [11] Husna Sarirah Husin, Lishan Cui, Herny Ramadhani Husny Hamid, and Norhaiza Ya Abdullah. Time series analysis of web server logs for an online newspaper. In Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, pages 1 4, 2013. [12] Michael W Kearney. rtweet: Collecting and analyzing twitter data. Journal of Open Source Software, 4(42):1829, 2019. [13] Thomas A Lasko. Efficient inference of gaussian-process-modulated renewal processes with application to medical event data. In Uncertainty in artificial intelligence: proceedings of the... conference. Conference on Uncertainty in Artificial Intelligence, volume 2014, page 469. NIH Public Access, 2014. [14] Liangda Li and Hongyuan Zha. Dyadic event attribution in social networks with mixtures of hawkes processes. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1667 1672, 2013. [15] Dixin Luo, Hongteng Xu, Yi Zhen, Xia Ning, Hongyuan Zha, Xiaokang Yang, and Wenjun Zhang. Multi-task multi-dimensional hawkes processes for modeling event sequences. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. [16] Jesper Møller, Anne Randi Syversveen, and Rasmus Plenge Waagepetersen. Log gaussian cox processes. Scandinavian journal of statistics, 25(3):451 482, 1998. [17] Jesper Moller and Rasmus Plenge Waagepetersen. Statistical inference and simulation for spatial point processes. CRC Press, 2003. [18] Tao Pei, Xi Gong, Shih-Lung Shaw, Ting Ma, and Chenghu Zhou. Clustering of temporal event processes. International Journal of Geographical Information Science, 27(3):484 510, 2013. [19] Jie Peng, Hans-Georg Müller, et al. Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions. Annals of Applied Statistics, 2(3):1056 1077, 2008. [20] JO Ramsay and BW Silverman. Principal components analysis for functional data. Functional data analysis, pages 147 172, 2005. [21] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. Introduction to information retrieval, volume 39. Cambridge University Press Cambridge, 2008. [22] Gideon Schwarz et al. Estimating the dimension of a model. Annals of statistics, 6(2):461 464, 1978. [23] Robert Tibshirani and Guenther Walther. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):511 528, 2005. [24] Ulrike von Luxburg. Clustering stability: An overview. Machine Learning, 2(3):235 274, 2009. [25] Satosi Watanabe. Karhunen-loeve expansion and factor analysis: theoretical remarks and application. In Trans. on 4th Prague Conf. Information Theory, Statistic Decision Functions, and Random Processes Prague, pages 635 660, 1965. [26] Hong-Zhong Wu, Jun-Jie Zhang, Long-Gang Pang, and Qun Wang. Zmcintegral: A package for multi-dimensional monte carlo integration on multi-gpus. Computer Physics Communications, 248:106962, 2020. [27] Weichang Wu, Junchi Yan, Xiaokang Yang, and Hongyuan Zha. Discovering temporal patterns for event sequence clustering via policy mixture model. IEEE Transactions on Knowledge and Data Engineering, 2020. [28] Ganggang Xu, Ming Wang, Jiangze Bian, Hui Huang, Timothy Burch, Sandro Andrade, Jingfei Zhang, and Yongtao Guan. Semi-parametric learning of structured temporal point processes. Journal of machine learning research, 2020. [29] Ganggang Xu, Chong Zhao, Abdollah Jalilian, Rasmus Waagepetersen, Jingfei Zhang, and Yongtao Guan. Nonparametric estimation of the pair correlation function of replicated inhomogeneous point processes. Electronic Journal of Statistics, 14(2):3730 3765, 2020. [30] Hongteng Xu and Hongyuan Zha. A dirichlet mixture model of hawkes processes for event sequence clustering. ar Xiv preprint ar Xiv:1701.09177, 2017. [31] Fan Yin, Guanyu Hu, and Weining Shen. Analysis of professional basketball field goal attempts via a bayesian matrix clustering approach. ar Xiv preprint ar Xiv:2010.08495, 2020. [32] R Zhang, C Walder, MA Rizoiu, and L Xie. Efficient non-parametric bayesian hawkes processes. In IJCAI International Joint Conference on Artificial Intelligence, 2019. [33] Feng Zhou, Zhidong Li, Xuhui Fan, Yang Wang, Arcot Sowmya, and Fang Chen. Efficient inference for nonparametric hawkes processes using auxiliary latent variables. Journal of Machine Learning Research, 21(241):1 31, 2020.