# concurrent_multilabel_prediction_in_event_streams__19194d59.pdf Concurrent Multi-Label Prediction in Event Streams Xiao Shou1,2, Tian Gao3, Dharmashankar Subramanian3 Debarun Bhattacharjya3, Kristin P. Bennett1,2 1 Department of Mathematical Sciences, Rensselaer Polytechnic Institute 2 Department of Computer Science, Rensselaer Polytechnic Institute 3 Research AI, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Streams of irregularly occurring events are commonly modeled as a marked temporal point process. Many real-world datasets such as e-commerce transactions and electronic health records often involve events where multiple event types co-occur, e.g. multiple items purchased or multiple diseases diagnosed simultaneously. In this paper, we tackle multi-label prediction in such a problem setting, and propose a novel Transformer-based Conditional Mixture of Bernoulli Network (TCMBN) that leverages neural density estimation to capture complex temporal dependence as well as probabilistic dependence between concurrent event types. We also propose potentially incorporating domain knowledge in the objective by regularizing the predicted probability. To represent probabilistic dependence of concurrent event types graphically, we design a two-step approach that first learns the mixture of Bernoulli network and then solves a leastsquares semi-definite constrained program to numerically approximate the sparse precision matrix from a learned covariance matrix. This approach proves to be effective for event prediction while also providing an interpretable and possibly non-stationary structure for insights into event co-occurrence. We demonstrate the superior performance of our approach compared to existing baselines on multiple synthetic and real benchmarks. Introduction Various types of human activities consist of irregularly occurring events over a period of time. For example, online customer transaction records involve purchases at a particular time for an account associated with an individual, and electronic health records (EHRs) keep track of a patient s health history including diagnoses and treatments throughout their life. Temporal point processes (TPPs) provide a suitable continuous-time mathematical tool for modeling event streams, where discrete events happen irregularly (Daley and Jones 2003). A classic approach to model event sequences as TPPs is through the Hawkes process, in which a simple parametric form is used to capture temporal dependence among events (Hawkes 1971). In the past few years, many researchers have developed neural TPP models that have achieved fruitful results on standard benchmarks for predictive tasks, because neural networks are ca- Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. pable of capturing more complex dependencies (Du et al. 2016; Mei and Eisner 2016; Xiao et al. 2017; Upadhyay, De, and Gomez-Rodriguez 2018; Omi, Ueda, and Aihara 2019; Shchur, Biloˇs, and G unnemann 2019; Zuo et al. 2020; Zhang et al. 2020a; Shchur et al. 2020; Boyd et al. 2020; Gao et al. 2020; Gu 2021). Many real-world applications involve event streams with concurrent labels, i.e. multiple labels that occur simultaneously in an event (for the recorded temporal granularity). For example, in the aforementioned applications of e-commerce and healthcare, multiple items can be purchased at the time of a transaction, and multiple diseases can be diagnosed during a single provider visit. Importantly, concurrent event label occurrences can be highly correlated. For instance, Amazon s recommender system has the frequently bought together option, and comorbidities such as type 2 diabetes and Alzheimer s disease for the elderly (Chatterjee and Mudher 2018) are very common in healthcare. Although many proposed neural TPPs excel at solving prediction problems, most are not directly applicable for concurrent multi-label event streams. A notable exception is a recent approach for modeling EHR data using attentionbased neural TPP models (Enguehard et al. 2020). This class of model captures long-term, nonsequential dependencies of contexts (Bahdanau, Cho, and Bengio 2014) and also jointly models dependence of time with the associated labels. As far as we are aware, existing approaches however fail to provide meaningful structural dependence among coinciding event labels at a given timestamp, and are unable to incorporate such dependence for label prediction. We note that although there is some prior work on graphical representations of TPPs (Didelez 2008; Bhattacharjya, Subramanian, and Gao 2018), these are directed graphs that capture historical dependence when each event is associated with exactly one event label. In contrast, our approach involves an undirected graph for representing relations between concurrent labels; it is conceptually similar to prior work on time series graphs (Eichler 1999). In this paper, we propose a new approach for modeling concurrent event labels using neural TPPs. Our main contributions are as follows: We formalize the multi-label prediction problem in event streams and propose a general framework for modeling concurrent labels in event streams. Crucially, our method The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) allows for both complex temporal dependence as well as probabilistic dependence between concurrent labels. We enable potentially incorporating domain knowledge for the prediction task through an approach that regularizes the predicted probability from our model. This can improve model reliability whenever applicable. To offer meaningful insights, we propose a two-step procedure for discovering an undirected graphical structure among concurrent event labels and illustrate its effectiveness through a case study on transaction data. We conduct an extensive empirical investigation including ablation studies and demonstrate superior performance of our proposed model as compared to state-ofthe-art baselines for next event multi-label prediction. Related Work and Background Temporal Point Processes A marked temporal point process is a stochastic process that generates not only a timestamp but also a label associated with it: for an generated event sequence El, El = {(ti, yi)}nl i=1, where each event epoch is a tuple of a timestamp and its label. Each timestamp ti is the time of occurrence and ti R+, and each label yi belongs to the label set L, whose cardinality is M. One common approach to characterize a TPP is through the conditional intensity function λ (t). A classic form of the conditional intensity function is the Hawkes process (also called the self-exciting point process) which has been applied to model many phenomena in social networks, financial systems and Internet Protocol television (IPTV) systems (Hawkes 1971; Zhou, Zha, and Song 2013; Bacry, Mastromatteo, and Muzy 2015; Luo et al. 2015). Neural Temporal Point Processes The expressive power of neural networks has enriched the TPP literature. Researchers have applied deep neural networks to model TPPs since the recurrent marked temporal point process (RMTPP) (Du et al. 2016). A review of neural TPP models appeared recently (Shchur et al. 2021). Neural TPPs are able to capture more complex dependencies among events than their parametric counterparts. The main idea in RMTPP is to use an RNN (or its modern variants) to capture the historical dependency of the events via the history embedding or context. The history embedding for ith event hi is modeled through a recurrent relation: hi = ψ(Wtti + Wyyi + Whhi 1 + bh) (1) where Wt, Wy , Wh, bh denote weights for the time and mark at ith event, weights and bias for the history embedding respectively; hi 1 is the history embedding at i 1th event, and ψ is an activation function. The conditional intensity function at the ith event can be modeled as exponential intensity, λ (ti) = exp(w(ti ti 1) + v T hi 1 + b) where w, v and b are weights and bias. The label probabilities are modeled independently from time and obtained via softmax: p (yi = m) = exp(Vy mhi 1 + by m) PM m=1 exp(Vy mhi 1 + by m) (2) where Vy and by are label weights and bias, and subscript m represents mth row of Vy and mth entry of by. The neural Hawkes process extends the classical Hawkes process with neural networks so that it is self-modulating: past events can not only excite but also inhibit future events (Mei and Eisner 2016). Instead of modeling the instantaneous conditional intensity function, Fully NN directly models the cumulative intensity through a feed-forward network (Omi, Ueda, and Aihara 2019). Shchur, Biloˇs, and G unnemann (2019) propose intensityfree modeling of TPPs. This approach allows characterizing TPPs with inter-event times τi = ti ti 1 R+. A mixture of log-normal distributions is used to capture the conditional density of τi. The history dependence of τ can be modeled through a neural density network (Bishop 1994; Rezende and Mohamed 2015), in which the parameters depend on the history embedding. Transformers for Event Streams Attention and transformer models have been used to model event data in recent years (Xiao et al. 2019; Zhang et al. 2020a; Zuo et al. 2020; Gu 2021). The self-attention mechanism, in this context, relates different event instances of a single sequence in order to compute a representation of the sequence. The architecture of transformers for TPPs consists of an embedding layer and a self-attention layer. In Transformer Hawkes Processes (THP) (Zuo et al. 2020), for example, temporal embedding for ti is through ( cos(ti/10000 c 1 d ) if c is odd sin(ti/10000 c d ) if c is even (3) where d is the dimension of encoding, and subscript c denotes cth dimension. Time embedding and one-hot encoded types are combined to form the embedded input X to be fed into the attention module. The dot-product attention is computed as: S = As V = softmax(QKT where Q, K, V are query, key and value matrix; they are linear transformations of X. As is the attention score matrix. The output S is then fed into a pointwise feed forward neural network (FFN) to learn a high level representation of the sequence for modeling the conditional intensity function. Multi-Label Prediction Multi-label classification is an extensively studied machine learning task (Tsoumakas and Katakis 2007). Many successful approaches have been proposed in the setting of tabular data (Li et al. 2016; Yang et al. 2019) and also for set prediction (Zhang, Hare, and Prugel-Bennett 2019; Xie et al. 2017) particularly in image classification. A recent study considers the use of LSTM for multi-label classification in timeseries for fault detection (Zhang et al. 2020b). Enguehard et al. (2020) apply encoder-decoder models for multi-label prediction in TPPs without considering potential correlation among concurrently occurring event labels we address this important aspect here in our proposed approach. Figure 1: Concurrent multi-label prediction in event streams with a transformer architecture. The history embedding Hi s are used as features for next event label prediction. Two groups of labels for each epoch are observed an example is crytocurrency transaction where group1 consists different actions, and group2 coin types. The structure of the 2 groups may change over time. Task and Model Description Concurrent Multi-Label Prediction We bridge point process models with multi-label classification and formalize concurrent multi-label prediction in event streams. Consider a concurrent multi-label event sequence dataset where each sequence El consists of event epochs (ti, yi)nl i=1 where ti is a timestamp and yi {0, 1}M is a binary M-dimensional vector. We formally define the problem as: Definition 1 Given a set of label candidates Y = {1, 2, ..., M} and a time-stamped event dataset with events of the form (ti, yi), the multi-label prediction task in event streams aims to map any prior history hj = (ti, yi)j i=1 to a subset of the label set as its next event labels yj+1 for j = {1, 2, 3, ..., nl 1}. An illustrative example is shown in Figure 1. While stateof-the-art TPP models exist for multi-class prediction and can be as modeled via Eq. 2, a simple augmentation of labels will result in an exponential number of classes. Multi-label classification models such as conditional Bernoulli mixture model (Li et al. 2016) are however not suitable for our setting for the underlying dynamics of event streams results in a non-IID pattern for each label occurrence. To capture both inter-epoch label interaction and intra-epoch label dependence, we propose a novel Transformer-based Conditional Mixture Bernoulli Network (TCMBN) model for concurrent multi-label prediction in event streams. Our approach embeds event epochs as temporal encodings (Zuo et al. 2020) through Eq. 3. However, we allow multi-hot encoded labels combined with its temporal encoding to form an embedded input X = UY + Z where Z Rd nl , U Rd M is a trainable weight matrix and Y {0, 1}nl M is a multi-hot encoding with max sequence length nl. The inter label-label attention among epochs are captured by the dot-product attention via Eq. 4. The output after B blocks of attention layers and FFN with residual connection for event epoch i which we denote as Hi can be considered as a high level representation of the past up to ti. The expressiveness of transformers for sequence modeling with position encoding is thoroughly examined by prior work (Yun et al. 2019). With temporal encoding, the following holds: Theorem 1 Transformers with temporal encodings are universal approximators for any continuous sequence-tosequence function with compact domain, i.e. they approximate any continuous functions f: X H with ϵ error w.r.t. p-norm where 1 p < and X, H Rd nl. The universal approximation property of temporally encoded transformer for event sequence-to-sequence is crucial in continuous time, which differentiates significantly from its time series (or sequences in natural language process) counterpart. Learned history features can be used for any downstream tasks. Given Hi for each event epoch, our problem evolves into a static multi-label prediction of next epoch. Remark. Given a set of label candidates Y = {1, 2, ..., M} and data points with history-embedded features (Hi, yi+1), TCMBN aims to learn a classifier that maps each historical feature Hi to a subset of the label set for prediction. In order to distinguish the history between different input epochs at ti and tj with their respective embedding Xi and Xj, we show there exists a transformer that separates the two, which gives feasibility guarantee for our multi-label classification problem; otherwise, any classifier will fail to distinguish two labels with identical features. The following is a consequence of Theorem 1. Theorem 2 There exists a transformer g with temporal encoding that separates two points (Xi, Hi) and (Xj, Hj), i.e. g(X)i = Hi = g(X)j = Hj for some Xi = Xj where Xi, Xj, Hi, Hj Rd. While many multi-label neural classifiers are available (Read et al. 2021; Liu et al. 2021), we propose the marriage of neural density estimation (Bishop 1994; Rezende and Mohamed 2015) and a conditional Bernoulli mixture model (Li et al. 2016) for multi-label classification given a history embedding, due to the flexibility of neural mixture models and non-diagonal covariance of a multivariate Bernoulli mixture with K components (K > 1) which naturally encodes label correlation. The responsibility network π and mean network µ which make up the conditional mixture of Bernoulli network (CMBN) have the following structure: π = softmax(MLPπ(Hi)) (5) µm,k = sigmoid (MLPµ(Hi)) (6) where MLP specifies multi-layer perceptron and signifies component wise operation for all K components, each with dimension M. In particular, for 2-layer MLP, we use ELU activation (Clevert, Unterthiner, and Hochreiter 2015): MLP := W2(ELU(W1Hi) + b1) + b2) (7) We use a separate set of weights and biases W1, W2, b1, b2 for π and µ network respectively for flexible learning. We emphasize that π, µ are functions of history embedded features and can be used to compute covariance matrix in the sections to follow. Thus, with a focus on next event label(s) prediction (time prediction can be tackled with either mixture of log-normals (Shchur, Biloˇs, and G unnemann 2019) or RMTPP (Du et al. 2016)1 ), our generative approach for concurrent multi-label prediction aims to optimize: p(yi+1|Hi) = k=1 πk BER(yi+1; µk) (8) where BER(y; µk) = QM l=1 Ber(ym; µm,k). BER signifies the multivariate Bernoulli distribution and Ber the univariate Bernoulli parametrized by the µm,k network. Alternatively, our approach can be viewed as elegantly solving a neural network parameterized linear systems of equations: µ(Hi)π(Hi) = yi+1 (9) Background Knowledge Injection We enable potentially incorporating background knowledge into the concurrent multi-label prediction problem. Our approach differs from prior use of a semantic loss function for multi-class prediction where only 1 class is predicted (Xu et al. 2018). In the multi-label classification setting where each class of events is not mutually exclusive, we consider a practical case involving the existence of exhaustive groups which comprise all class labels such that within each group, only q events can happen. For example, q = 1 indicates mutually exclusive classes within each group. Such a partition of groups is not uncommon in disease management and trip itineraries where diseases are categorized and flights have certain hubs and routes. In crypto-transactions, exactly one action in the transaction group will take place and simultaneously one type of coin will be traded at each epoch. Let ˆpcj denote the predicted probability for cth class in jth group (with a total of oj classes and G groups). We propose the following background knowledge loss for an instance of prediction: cj=1 ˆpcj q)2 (10) The least square loss can be considered a soft regularization of allocation of probabilities. Such loss will promote our model to find the mode of the joint distribution p(yi+1|Hi). TCMBN seeks to minimize the following objective given N sequences of El = {(ti, yi)}nl i=1 for l = {1, 2, ..., N} : LLL + λLBK = i (log p (yi+1)) cj=1 ˆpcj,i+1,l q)2 (11) where λ trades off between negative log-likelihood loss and background knowledge loss (Dash et al. 2022). Capturing Structure among Concurrent Labels We propose a novel approach to learn the label structure of an undirected graph G(i) = (V, Ei) at each epoch ti, where the vertices V are the labels, and the edges Ei encodes label dependencies which may change over time by 1Results are given in Appendix. Dataset # Classes # Seqs. Avg. Len. Data type Synthea 232 2500 43 Simulated EHR Dunnhumby 24 2500 93 Real Transaction MIMIC III 169 6644 3 Real EHR Defi 39 5000 27 Real Finance Table 1: Properties of four benchmark event datasets. numerically approximating the precision matrix of the multivariate Bernoulli distribution (Banerjee, El Ghaoui, and d Aspremont 2008; Bai et al. 2019; Ravikumar, Wainwright, and Lafferty 2010). Since we learn the covariance in closedform at each epoch from TCMBN, we can compute its inverse numerically. A more attractive feature is to learn a sparse precision matrix as in the multivariate Gaussian case (Friedman, Hastie, and Tibshirani 2008). In our setting, jointly approximating the precision matrix with a sparsity constraint while performing concurrent multi-label classification is challenging because of the symmetric positive definiteness of the matrix. Hence we propose a two-step procedure for algorithmic practicality, analogous to learning Gaussian graphical models (Giraud 2015): Step 1: Solve minimization problem: minimize loglikelihood term for event streams and obtain learned covariance matrices as in the following from TCMBN. Cov(yi+1) = k=1 πk[Σk+µ:,k(µ:,k)T ] E(yi+1)E(yi+1)T (12) where E(y) = PK k=1 πkµk is the weighted mean of all the component means. Σk = diag(µm,k(1 µm,k)) for k K is the variance of a univariate Bernoulli with mean µm,k. Note that all terms above depend on Hi; we dropped such a dependency to avoid notational cluttering. Step 2: For each learned covariance matrix ˆC: solve the following least-squares sparse approximation (LSSA) to obtain the estimated precision matrix: min P 1 2||ˆC P I||2 F + γ||F P||1 s.t. P Sn ++. (13) where P is the precision matrix to optimize, I the identity matrix, 1 matrix of ones, F = 1 I, γ the shrinkage parameter, F the Frobenius norm, Sn ++ the set of symmetric positive definite matrices, and Hadamard product of two matrices. Note that LSSA is a convex optimization problem and can be solved by efficient solvers such as cvxopt (Andersen et al. 2013). We emphasize that our approach TCMBNLSSA is completely data-driven and capable of capturing the nonstationarity of labels (Trivedi et al. 2019). Experiments Model Setting, Baselines & Evaluation Metrics We implement and train our model with Pytorch and report results using 64 Bernoulli mixture components for all experiments. Hyper-parameter λ is chosen from {0.1, 0.01, 0.001} and is only used if domain knowledge is injected; otherwise it is set to 0. Further details and codes are included in Appendix A in supplementary material. We compare our model with encoder-decoder based TPP models (Enguehard et al. 2020) including RMTPP (Du et al. 2016), intensity-free (Shchur, Biloˇs, and G unnemann 2019) and attention-based models for comparison (Xiao et al. 2019). For the baselines, best performing results are selected from running a few default settings according to the package provided by the authors2. Some potential baselines are not suitable for the concurrent multi-label setting (Zhang et al. 2020a; Gu 2021; Xu, Farajtabar, and Zha 2016; Wu et al. 2019, 2020) while employing others (Wu et al. 2018; Li et al. 2017) will require a significant non-trivial altering of the original models. We run a transformer model with a logistic regression for each class (dubbed Transformer-LG) by adapting from transformer hawkes process (Zuo et al. 2020) and an RNN model with CMBN (dubbed RNN-CMBN) for further comparison. We evaluate the performance of different models on the task of multi-label prediction on the test data using several evaluation metrics. Specifically, we consider the weighted ROC-AUC score (Biloˇs, Charpentier, and G unnemann 2019; Enguehard et al. 2020) as well as the weighted F1 score and the Hamming score (Sorower 2010). Synthetic Concurrent Multi-Label Events We generate 2 synthetic datasets: Poisson-MBN (PM) and Hawkes-MBN (HM) with M = 5 labels, where timestamps are generated using a Poisson and exponential kernel Hawkes process respectively. For each timestamp, we partition the 5 labels into 2 groups one with 3 classes and the other with 2. For each class in each group, we count the number of historical occurrences of the class and normalize by their total sum: this is the probability for each class to generate labels for the next event epoch. Thus, our approach for generation naturally induces a long history dependence and interaction among labels in every epoch (both pair-wise and 3rd order interaction). We generate 5 simulations, each of which consists of a total of 1000 sequences and randomly split 60-20-20 training-dev-test subsets. Further details regarding data generation are supplied in Appendix B. Results. Table 2 compares 7 different models on PM and HM. Our model achieves the best performance over all baselines using all three metrics. Note that the encoder-decoder models do not incorporate potential correlation among labels at each timestamp. A close competitor is RNN-CMBN while it is able to capture the correlation, it may have missed the complex history dependence. Table 3 summarizes the performance of the models for each class. TCMBN consistently achieves much better ROC-AUC scores for all classes in both experiments compared to baselines. Interestingly, all models predict better when the label interaction is pairwise in Group 1 (classes 1 and 2) than that of the third order interaction in Group 2 (classes 3, 4 and 5). 2https://github.com/babylonhealth/neural TPPs Real-World Concurrent Multi-Label Events We consider the following 4 event datasets for the task of multi-label prediction. Their properties are summarized in Table ??. We incorporate domain knowledge only on the Defidataset where each event label consists of only 1 coin type and 1 transaction type. Synthea. This is a simulated EHR dataset that closely mimics real EHR data (Walonoski et al. 2018). Each module generates patient populations with events that can occur in a medical history of a synthetic patient.3 Dunnhumby. We extract this dataset from Kaggle s Dunnhumby - The Complete Journey dataset.4 We use the transaction file with household level transactions over two years from a group of 2,500 households frequently shopping at a retailer. To roll up the item types, each item is mapped to its department category based on the product file. MIMIC III. The MIMIC III database provides patientlevel de-identified health-related data associated with the Israel Deaconess Medical Center between 2001 and 2012 (Johnson, Pollard, and Mark 2016; Johnson et al. 2016; Goldberger et al. 2000). We extract hospital admission records for each patient to include admission time and ICD-9 codes which were mapped to CCS codes as labels. Defi. This dataset provides user-level cryptocurrency trading history under a specific protocol called Aave.5 The data includes timestamp, transaction type and coin type for each transaction. Coupled marks that concatenate the transaction and coin type are used as labels for each event. Results. Table 4 summarizes the performances of 7 different models on the benchmarks. TCMBN achieves overall superior performance compared to all baselines. In particular, it achieves the best results on 6 out of 12 evaluations, and close to best performance on the other 6 evaluations; none of the other models compare in terms of consistent robust performance on this task. A suggested strong baseline GRU-CP (Enguehard et al. 2020) achieves noticeably inferior results than our model. Ablation I: Mixture Components Figure 2 shows the effects of varying mixture components. Overall, we did not observe dramatic changes in predictive performance with varying the number of mixture components. The most changes comes with respect to Hamming loss, while all models perform roughly the same with respect to weighted ROC-AUC and F1 scores. TCMBN with 32 mixture components appears to perform best, while our choice of 64 components comes after. The lack of noticeable variation in performance for neural mixtures of Bernoulli models is different from the classical E-M type of mixture of experts where performance is usually boosted with more components, as observed in tabular multi-label classification task (Li et al. 2016). The optimization mechanism underlies this difference: the likelihood in the classical E-M approach is bound to increase with more mixture components, while 3https://github.com/synthetichealth/synthea 4https://www.kaggle.com/frtgnn/dunnhumby-the-completejourney 5aave.com Data Metric/Model GRU-CP GRU-RMTPP GRU-LNM GRU-ATTN Transformer-LG RNN-CMBN TCMBN (ours) ROC-AUC 0.751(0.012) 0.733(0.026) 0.724(0.028) 0.669(0.038) 0.568(0.013) 0.752(0.009) 0.764(0.011) PM F1 0.676(0.005) 0.651(0.015) 0.662(0.012) 0.639(0.009) 0.620(0.005) 0.677(0.007) 0.686(0.004) Hamming 0.345(0.008) 0.416(0.039) 0.409(0.055) 0.467(0.058) 0.561(0.006) 0.342(0.008) 0.337(0.006) ROC-AUC 0.754(0.006) 0.751(0.011) 0.755(0.005) 0.650(0.019) 0.585(0.003) 0.755(0.011) 0.765(0.005) HM F1 0.675(0.004) 0.665(0.016) 0.675(0.004) 0.640(0.009) 0.619(0.001) 0.679(0.006) 0.683(0.002) Hamming 0.345(0.013) 0.378(0.041) 0.338(0.006) 0.506(0.011) 0.563(0.002) 0.342(0.006) 0.336(0.013) Table 2: Overall predictive performances on PM & HM datasets as measured by weighted ROC-AUC, weighted F1 and Hamming loss. Mean values are shown along with standard deviation in parentheses. Best results are in bold. Data Metric/Model GRU-CP GRU-RMTPP GRU-LNM GRU-ATTN Transformer-LG RNN-CMBN TCMBN (ours) G1C1 0.771(0.016) 0.774(0.018) 0.780(0.016) 0.769(0.019) 0.580(0.024) 0.775(0.016) 0.783(0.018) G1C2 0.778(0.023) 0.780(0.022) 0.786(0.020) 0.777(0.017) 0.581(0.020) 0.778(0.013) 0.791(0.019) PM G2C1 0.728(0.019) 0.691(0.058) 0.636(0.086) 0.584(0.082) 0.560(0.009) 0.733(0.011) 0.738(0.015) G2C2 0.732(0.012) 0.678(0.054) 0.682(0.064) 0.574(0.064) 0.554(0.016) 0.728(0.013) 0.747(0.020) G2C3 0.724(0.012) 0.704(0.017) 0.685(0.057) 0.552(0.081) 0.551(0.016) 0.722(0.010) 0.740(0.009) G1C1 0.786(0.014) 0.790(0.010) 0.791(0.011) 0.793(0.015) 0.602(0.011) 0.783(0.017) 0.799(0.011) G1C2 0.781(0.014) 0.784(0.018) 0.786(0.016) 0.781(0.017) 0.598(0.010) 0.780(0.017) 0.790(0.016) HM G2C1 0.722(0.014) 0.714(0.015) 0.720(0.016) 0.520(0.016) 0.571(0.013) 0.730(0.008) 0.735(0.010) G2C2 0.729(0.013) 0.713(0.023) 0.729(0.012) 0.531(0.031) 0.570(0.015) 0.730(0.013) 0.741(0.013) G2C3 0.725(0.009) 0.719(0.011) 0.720(0.012) 0.524(0.021) 0.569(0.009) 0.729(0.011) 0.734(0.010) Table 3: ROC-AUC for each class on PM & HM. Best results are in bold. G:Group C:Class stochastic gradient descent in our neural models does not guarantee this increase. Nevertheless, even with a relatively small number of mixture components, our approach shows competitive predictive strength. Figure 2: TCMBN results with varying mixture components. Ablation II: Background Knowledge Constraint We train TCMBN without background knowledge loss on the synthetic datasets and Defi. A comparison of two cases in Table 5 shows that our prediction without the additional loss consistently gets worse in terms of F1-weighted score, while fluctuating with respect to Hamming loss. Such discrepancy is likely due to class imbalance in our setting. Structure Discovery for Synthetic Binary I.I.D. Data (without Timestamps) We further test our two-step CMBN-LSSA approach for learning label dependency from i.i.d. data by generating binary data with a dimension of 5 and 8 variables based on the Ising model: Prob(x1, x2, ..., x M) = exp(Θ0 + P Θixi + P i