# weighted_clock_logic_point_process__08ddb8e1.pdf Published as a conference paper at ICLR 2023 WEIGHTED CLOCK LOGIC POINT PROCESS Ruixuan Yan1, Yunshi Wen1, Debarun Bhattacharjya2, Ronny Luss2, Tengfei Ma2, Achille Fokoue2, and Agung Julius1 1Rensselaer Polytechnic Institute 2IBM T.J. Watson Research Center Datasets involving multivariate event streams are prevalent in numerous applications. We present a novel framework for modeling temporal point processes called clock logic neural networks (CLNN) which learn weighted clock logic (w CL) formulas as interpretable temporal rules by which some events promote or inhibit other events. Specifically, CLNN models temporal relations between events using conditional intensity rates informed by a set of w CL formulas, which are more expressive than related prior work. Unlike conventional approaches of searching for generative rules through expensive combinatorial optimization, we design smooth activation functions for components of w CL formulas that enable a continuous relaxation of the discrete search space and efficient learning of w CL formulas using gradient-based methods. Experiments on synthetic datasets manifest our model s ability to recover the ground-truth rules and improve computational efficiency. In addition, experiments on real-world datasets show that our models perform competitively when compared with state-of-the-art models. 1 INTRODUCTION AND RELATED WORK Multivariate event streams are emerging types of data that involve occurrences of different types of events in continuous time. Event streams are observed in a wide range of applications, including but not limited to finance (Bacry et al., 2015), politics (O Brien, 2010), system maintenance (Gunawardana et al., 2011), healthcare (Weiss & Page, 2013), and social networks (Farajtabar et al., 2015). As opposed to time series data that typically comprises continuous-valued variables evolving in regular discrete time stamps, event streams involve events occurring irregularly and asynchronously in continuous time. Modeling the dynamics in event streams is important for a wide range of scientific and industrial processes, such as predicting the occurrence of events of interest or understanding why some deleterious events occur so as to possibly prevent their occurrence. A (multivariate) temporal point process (TPP) provides a formal mathematical framework for representing event streams, where a conditional intensity rate for each event measures its occurrence rate at any time given the historical events in the stream (Daley & Vere-Jones, 2003; Aalen et al., 2008). There has been a proliferation of research around TPPs in recent years, particularly around the use of neural networks for modeling conditional intensity rates as a function of historical occurrences (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017; Xu et al., 2017; Gao et al., 2020; Zhang et al., 2020; Zuo et al., 2020). One stream of research studies graphical event models (GEMs) as a compact and interpretable graphical representation for TPPs, where the conditional intensity rate for any particular event depends only on the history of a subset of the events (Didelez, 2008; Gunawardana & Meek, 2016). While any TPP can be represented as a GEM, various models make assumptions about the parametric form of conditional intensity rates for the sake of learnability, for instance that rates are piece-wise constant with respect to occurrences within historical windows (Gunawardana et al., 2011; Bhattacharjya et al., 2018). Ordinal GEMs(OGEM) (Bhattacharjya et al., 2020; 2021) are a recent model from this family where a conditional intensity rate depends on the order in which parent events occur within the most recent historical time period. A temporal logic point process (TLPP) framework was proposed as an alternate way to lend some interpretability to TPPs by modeling intensity rates using temporal logic rules (Li et al., 2020). Although the initial work pre-specified temporal logic rules, recent work has introduced a temporal logic rule learner (TELLER) for automatically discovering rules (Li et al., 2021). There is however Published as a conference paper at ICLR 2023 the issue of scalability since TELLER exploits an expensive branch-and-price algorithm to search for temporal logic rules in a discrete space. Another important limitation of this work is that TELLER s rules are not informative enough to explain how the interval length between ordered events impacts the conditional intensity rate. For instance, while predicting the occurrence of diabetes, the rule that insulin injection happens 20 minutes before eating meal is more informative and accurate in predicting blood glucose remains normal than the rule that insulin injection happens before eating meal , as the latter rule cannot expose the interval between insulin injection and eating meal . To tackle the above limitations, we propose novel atomic predicates enriching the expressiveness of temporal logic rules as well as a differentiable framework to learn rules in an end-to-end manner. This work introduces a differentiable neuro-symbolic framework, clock logic neural network (CLNN), to model TPPs by learning weighted clock logic (w CL) formulas as explanations. Firstly, event streams are converted into continuous-time clock signals representing the time interval between the last occurrence of an event and the current time. Next, we propose a novel w CL to describe the underlying temporal relations with relative interval length, enabling the design of a CLNN to learn the generative mechanisms. Instead of searching for temporal logic rules in some vast discrete space, CLNN associates every neuron with an order representation or a logical operator and assigns weights to edges to reflect the importance of various inputs, which relaxes the search space to be continuous. Moreover, architecture weights are introduced into CLNN to make the formula structure search differentiable. w CL formula-informed intensity rates are carefully designed so that the parameters appearing in the rules can be learned through maximum likelihood estimation using gradient-based approaches. CLNN is tested on synthetic datasets to show that CLNN can recover the ground-truth rules as well as on real-world datasets to demonstrate its model-fitting performance. 2 PRELIMINARIES 2.1 NOTATION & BACKGROUND Let L denote the set of event labels, and M =|L| denote the number of event labels. An event stream is a sequence of events including time stamps, denoted as D={(l1, t1), (l2, t2), ..., (l N, t N)}, where ti R+ denotes a time stamp between the beginning time t0 = 0 and end time t N+1 = T, and li L is the event label that happens at ti. We refer to event label and label interchangeably. Every event label l L has an associated conditional intensity rate describing the occurrence rate of label l at t given the history up to t. In multivariate temporal point processes, conditional intensity rates describe the dynamics of events. Let Ht = {(li, ti) : ti < t} denote the historical events up to time t. The conditional intensity rate of event label l is denoted as λl(t|Ht). Specifically, λl(t|Ht) describes the expected number of occurrences of event label l in an infinitesimal interval [t, t+ t] given the history Ht, i.e., λl(t|Ht) = lim t 0(E[Nl(t + t) Nl(t)|Ht]/ t), where Nl(t) denotes the number of event label l s occurrences up to t. Example 1 A running example of an event stream with 11 events of 4 labels is shown in Figure 1(a). 0 1 3 6 10 12 14 17 21 23 25 27 30 8 Event Stream Masked event stream Event clocks POC w CL formula Intensity CLNN Output Masking function Clocking function Figure 1: (a): An event stream example with N = 11 events of M = 4 event labels over T = 30 days. (Integer-valued time stamps are utilized for easy interpretation, note that the proposed approach also works for ti R). (b): The overall workflow of the proposed method (POC: paired order cell, SOC: singleton order cell, AC: architecture cell, details presented in Section 2.2 to 3.3). 2.2 ORDER REPRESENTATIONS FOR EVENT STREAMS The overall workflow of the proposed framework is visualized as Figure 1(b). The raw event streams first go through a masking function to generate the masked event streams, which are then transformed into event clocks using a clocking function. The event clocks are given as inputs to the clock logic neural network (CLNN) to learn interpretable w CL formulas and the intensity rate of event occurrences. The following sections provide a detailed explanation for each module in Figure 1(b). We are interested in exploring the effect of temporal ordering between event labels and the occurrences of causal event labels in a historical window on the occurrence rate of a particular event label, Published as a conference paper at ICLR 2023 where the generative mechanism is expressed as interpretable formulas. An event stream up to t may include multiple occurrences of the same event label, thus a masking function is required to mask out duplicated event labels in the history for accessing the ordering information at any t. Here we adopt a technique similar to Bhattacharjya et al. (2020) for extracting distinct event labels from Ht. Definition 1 (Masking Function) A masking function Γ( ) is a function that takes an event stream as input and returns a new event stream that is a subset of the input stream and contains no duplicated event labels. Mathematically, Γ( ) is applied to Ht = {(li, ti)} and converts it into a new stream H t = {(lj, tj) Ht : lj = lj if j = j }. We consider the following two masking functions as per Bhattacharjya et al. (2020) due to simplicity: first masking and last masking. The first (resp. last ) masking function keeps the first (resp. last) occurrence of an event label in an event stream. Example 1 (cont.) Let H13 = {(A, 1), (B, 3), (A, 6), (D, 8), (C, 10), (D, 12)}. The first masking function converts it to H 13 = {(A, 1), (B, 3), (D, 8), (C, 10)}, and the last masking function converts it to H 13 = {(B, 3), (A, 6), (C, 10), (D, 12)}. With the masked event history H t, we define two order representations for the order relationship between any two event labels and the occurrence of an event within a historical window of t. Definition 2 (Paired Order Representation (POR)) A paired order representation is defined as [li, lj] [L]2, where [L]2 denotes two-element permutation of a subset of L. A paired order representation for H t can be obtained by arranging any two distinct labels in H t in a sequential order. Definition 3 (Singleton Order Representation (SOR)) A singleton order representation is denoted as [lj, ulj] L R+, representing event label lj L occurred within the past ulj time units, where ulj is a variable to learn through a process that will be explained in Section 3.3. Example 1 (cont.) With first masking, an example of paired order representation for H 13 can be [A, B] representing A happens before B or [B, C] representing B happens before C . The overall order representation for H 13 is expressed as [A, B, D, C], which can be derived from the paired order representations: [A, B], [B, D], [D, C]. A singleton order representation example of H 13 can be expressed as [B, 10.5], meaning B happened in the past 10.5 days. 2.3 WEIGHTED CLOCK LOGIC FORMULA To adapt H t to continuous-time signals that can be described by logical statements, we extract clock signals from H t to describe the time passed since the last occurrence of a label. A clocking function is introduced to convert tj into a clock signal cj denoting the time interval length between tj and t. Definition 4 (Clocking Function) A clocking function Ξ( ) converts H t into a vector of clock signals as C (t) = [c1(t), c2(t), ..., c M(t)]T RM + with ci(t) denoting the clock signal for event label i L, where ci(t) is computed as ci(t) = t tj if (lj, tj) H t and lj = i, and ci(t) = Z otherwise. Note that Z is a user-defined, large positive number to indicate event label i not happening in H t. Example 1 (cont.) Taking the first masked event stream H 13 = {(A, 1), (B, 3), (D, 8), (C, 12)} as an example, the event clocks are extracted as C (13) = [12, 10, 1, 5]T . The event clocks can essentially provide the ordering between any two event labels in that the difference between any two event labels clock signals reflects which event label happens first. As shown in the diabetes prediction example in the Introduction section, the time interval between ordering events is notably important in explaining and predicting an event label s occurrence. In contrast to (Li et al., 2020; 2021) which only learns the temporal ordering relation between event labels, we define a paired order predicate (POP) with a learnable parameter ulilj to describe the time interval between two ordered event labels li and lj and a singleton order predicate (SOP) with a learnable parameter ulj to describe the occurrence of label lj within a historical window ulj as follows. Definition 5 (Paired Order Predicate) A POP describes the order between two labels li, lj L, li = lj, denoted as πlilj pop := g(cli, clj) = cli clj > ulilj, where ulilj R is a parameter to learn. A positive ulilj means li happened before lj for at least ulilj time units, and a negative ulilj means lj happened before li for at most ulilj time units. A POP is used in the POC of Figure 1(b). Definition 6 (Singleton Order Predicate) An SOP describes a causal label lj L occurring within the past ulj time units, defined as πlj sop := clj ulj < 0, where ulj R+ is a learnable parameter. Published as a conference paper at ICLR 2023 Instead of taking a heuristic approach for some underlying combinatorial search problem for a given set of temporal predicates (Bhattacharjya et al., 2020; 2021; Li et al., 2021) to uncover the effective order relations, this work proposes a differentiable learning model to learn suitable singleton and paired order predicates among all the possible choices of order predicates through a gradient-based approach. The scheme of weighted signal temporal logic (w STL) in Yan et al. (2021; 2022) is exploited to build weighted clock logic (w CL) formulas that are logical compositions of singleton and paired order predicates. The syntax of w CL is recursively defined as (Mehdipour et al., 2021): ϕ := πlilj pop | πlj sop | ϕ | ϕw1 1 ϕw2 2 ϕwk k | ϕw1 1 ϕw2 2 ϕwk k , (1) where ϕ1, ϕk are w CL formulas, denotes negation, denotes logical conjunction, denotes logical disjunction, wj 0, j = 1, , k denotes non-negative weights assigned to ϕ1, , ϕk in the conjunction and disjunction operations. A w CL formula can describe the characteristics of Ht, thus the conditional intensity rate of event l given Ht can be equivalently denoted as λl|ϕ(t). Remark 7 The syntax above means each w CL formula can be built by using predicates in πlilj pop or πlj sop and then by recursively applying the or the or the operations. Example 1 (cont.) A w CL formula example is ϕ = (c A c B > 1)1 (c C < 3)0.05. The first and second clauses read A happened before B for at least one day and C happened less than 3 days ago , respectively. Note that ϕ is satisfied by the event stream up to t = 13 in Figure 1(a). The two clauses have weights of 1 and 0.05, reflecting the first clause is more important than the second one. 3 WEIGHTED CLOCK LOGIC POINT PROCESSES 3.1 TRUTH DEGREE OF WEIGHTED CLOCK LOGIC To quantitatively measure the satisfaction degree of a w CL formula ϕ over the event clocks C (t), i.e., how well does ϕ describe the underlying patterns of C (t), we propose smooth activation functions (AFs) to compute the truth degree, denoted p(C , ϕ, t) [0, 1], defined as (Riegel et al., 2020): p(C , πlilj pop, t) = sigmoid(cli(t) clj(t) ulilj), (2) p(C , πlj sop, t) = sigmoid(ulj clj(t)), (3) p(C , ϕ, t) = 1 p(C , ϕ, t). (4) In contrast to the combinatorial search of the temporal logic predicates in Li et al. (2021), the smooth design of AFs in (2) - (4) benefits the maximum likelihood estimation problem shown later in Section 3.6 by allowing it to learn the parameters in the POP and SOP through gradient-based methods. Next, we present the design of activation functions (AF) for the operator. Here we use a 2-ary conjunction operator to motivate the design. Let p = p(C , ϕw1 1 ϕw2 2 , t) [0, 1]. Intuitively, p is low when either input is low, and p is high when both inputs are high. Here we adopt a similar idea to Sen et al. (2022) for capturing the low and high. A user-defined hyperparameter α [ 1 2, 1] is introduced to aid the interpretability of low and high such that p represents high if p [α, 1] and low if p [0, 1 α]. Considering the importance weights, a low input with a zero weight should not impact the output, which implies p should be low when both inputs are low. With these considerations, the AF for the operator is defined as follows: (See Appendix A for more details.) p(C , ϕw1 1 ϕw2 2 ϕwk k , t) = f(β j=1 wj(1 p(C , ϕj, t))), (5) subject to β j=1 wj(1 α) α, β j=1 wjα 1 α, where f(z) = max{0, min{z, 1}} clamps the truth degree into [0, 1], wj 0 and β 0 are parameters to learn. By De Morgan s law (Hurley, 2014), the AF for the operator is defined as p(C , ϕw1 1 ϕw2 2 ϕwk k , t) = f(1 β + j=1 wj(p(C , ϕj, t))), (6) subject to 1 β + j=1 wjα α, 1 β + j=1 wj(1 α) 1 α. Published as a conference paper at ICLR 2023 An event stream with M event labels would generate P2 M = M! (M 2)! paired order predicates and M singleton order predicates. If a conjunction or disjunction operator takes these predicates as inputs, how it recognizes the effective order predicates in describing the event dynamics becomes a critical issue. By carefully designing the AFs in (5) - (6), the logical operators exhibit the following properties so as to recognize effective inputs. This is a critical advantage over Bhattacharjya et al. (2020; 2021); Li et al. (2021) in that it allows a differentiable search of the suitable predicates among all the possible choices of order predicates in an end-to-end manner. Here we illustrate the properties for with two inputs, which can be generalized to k-ary inputs. (See Appendix B for more details.) Theorem 8 The AF for the operator with two inputs exhibits the following properties. 1) Nonimpact for zero weights: If wj = 0, j = 1, 2, p(C , ϕj, t) has no impact on p(C , ϕ1 ϕ2, t). 2) Impact ordering: If p(C , ϕ1, t)=p(C , ϕ2, t), and w1 w2, then p(C ,ϕ1 ϕ2,t) p(C ,ϕ1,t) p(C ,ϕ1 ϕ2,t) p(C ,ϕ2,t) . 3) Monotonicity: f(β P2 j=1wj(1 p(C , ϕj, t))) f(β P2 j=1 wj(1 (p(C , ϕj, t) + d))), d 0. 𝒄𝑨(𝒕) 𝒄𝑩(𝒕) 𝒄𝑪(𝒕) 𝑫,𝑪 𝝅𝒔𝒐𝒑 𝑨 𝝅𝒔𝒐𝒑 𝑩 𝝅𝒔𝒐𝒑 𝑪 Architecture cell Paired order cell Singleton order cell 𝑤𝑠𝑜𝑝 𝐵 𝑤𝑠𝑜𝑝 𝐶 𝒑(𝓒 , 𝝓, 𝒕) 𝛼𝑎𝑟𝑐 Architecture weight 𝑤𝑝𝑜𝑝 Paired-order weight 𝑤𝑠𝑜𝑝 Singleton-order weight POP selection Logical selection node SOP selection 𝒄𝑨(𝒕) 𝒄𝑩(𝒕) 𝒄𝑪(𝒕) 𝒑(𝓒 , 𝝓, 𝒕) Figure 2: CLNN Structure. (a): Continuous relaxation of the search space using weights. (b): The learned discrete model structure for ϕ = (πA,B pop πB,C pop ) (πA sop). 3.2 LEARNING OF PAIRED ORDER REPRESENTATION With the smooth AFs designed in (2) - (6), a neuro-symbolic model called clock logic neural network (CLNN) can be designed for any given w CL formula ϕ, in which every neuron has a corresponding symbolic representation. A typical CLNN for ϕ = (πA,B pop πB,C pop ) (πA sop) is visualized as Fig. 2(b), which can be considered as the discrete structure obtained by learning the parameters of the model in Figure 2(a) and keeping the dominant components. Here ϕ can be interpreted as (A happens before B for at least u AB time units or B happens before C for at least u BC time units) and A happens within the past u A time units. This part describes the continuous relaxation of the search space by designing a paired order cell, a singleton order cell, and an architecture cell for learning the paired order representation, singleton order representation and the formula structure. Paired Order Cell (POC). A POC is a directed acyclic graph (DAG) comprising two paired order predicate (POP) nodes and one logical node for the operator, shown as an orange block in Figure 2(a). The two POP nodes represent πli,lj pop and πlj,li pop sharing the same parameter uli,lj, where πli,lj pop denotes li happened before lj for at least uli,lj time units and πlj,li pop denotes lj happened before li for at least uli,lj time units . Each POP has an associated weight wli,lj pop or wlj,li pop to be learned, and the operator forces one of the two weight parameters to dominate the other one such that the learned POR is consistent with the event stream. For example, the POC in Figure 2(a) aims to learn the POR between A and B, whose discretized version would be either πA,B pop or πB,A pop . An event stream with M event labels can generate P2 M = M! (M 2)! PORs between any two event Published as a conference paper at ICLR 2023 labels, resulting in (P2 M/2) POCs. Similar to learning the POR between any two events, the discrete order representations for the entire history Ht can be learned using a POP selection node (as shown in Figure 2(a)) that takes the outputs of all the POCs as input and identifies the important PORs. The learning of the POCs essentially becomes learning the w, β in (5) for the POCs and the POP selection node, as well as ulilj in (2) for the POPs through back propagation. The discrete PORs can be acquired by keeping the top-k strongest POCs and the dominant POPs. 3.3 LEARNING OF SINGLETON ORDER REPRESENTATION Singleton Order Cell (SOC). The learning of SOR is accomplished by an SOC, which is displayed as a green block in Figure 2(a). An SOC is a DAG comprising M singleton order predicate (SOP) nodes and one SOP selection node for the operator. An SOP node represents πlj sop that takes clj(t) as input and returns the truth degree of πlj sop over clj(t). The SOP selection node has the same functionality as the POP selection node. The operator in the SOP selection node assigns a nonnegative weight to every SOP node and learns the importance weights w and β to extract the dominant SORs affecting the conditional intensity rate the most. The learning of the SOC is thus learning the w, β in (5) for the SOP selection node and ulj in (3) for the SOPs through back propagation. The discrete SORs can be determined by keeping the top-k strongest SOPs. 3.4 LEARNING OF FORMULA STRUCTURE Architecture Cell (AC). For a given set of PORs or SORs, their conjunction or disjunction will behave differently and have distinct meanings. For instance, given two causal formulas ϕ1 = (c A c B > 1)1 (c C < 5)1 and ϕ2 = (c A c B > 1)1 (c C < 5)1 for the occurrence of event label D, ϕ1 means (A happens before B for at least 1 time unit) and (C happens within the past 5 time units) simultaneously will cause D to happen , whereas ϕ2 means (A happens before B for at least 1 time unit) or (C happens within the past 5 time units) alternatively will cause D to happen. The afore-mentioned cells can learn the order representations. Nevertheless, whether their outputs should be connected by the or operator needs to be determined. Here we consider the outputs of the POCs and the SOCs having two choices of being connected by a or operator, each of which is associated with an architecture weight α arc or α arc that enables continuous learning of the two choices; this is also called differentiable architecture search (Liu et al., 2019). An architecture cell is introduced for learning the model architecture, which comprises two logical nodes representing a operator and a operator as well as a logical selection node (LSN), shown as the blue block in Figure 2(a). Let p = {p1, ..., pk} denote the set of inputs for each logical operator. Subsequently, the conjunction operator takes p as input and returns p = f(β Pk j=1 w j (1 pj)), and the dis- junction operator takes p as input and returns p = f(1 β +Pk j=1 w j pj). The LSN represented by takes p and p as inputs and returns their weighted sum, where the weights are computed using the softmax of the architecture weights as shown below: p = p(C , ϕ, t) = X m { , }eαm arc pm. (7) The task of architecture search then reduces to learning the architecture weights α arc, α arc and the w, β in (5) - (6) for the two logical operators, which can be executed simultaneously while learning parameters in the POCs and SOCs. The outcome of the architecture search process is a discrete architecture obtained by retaining the logical operator with the strongest architecture weight. 3.5 WCL-INFORMED INTENSITY FUNCTION CLNN--𝝓𝟏 CLNN--𝝓𝒏 𝓒 :Clock signal for 𝓛 𝒑(𝓒 , 𝝓𝟏, 𝒕) 𝒑(𝓒 , 𝝓𝒏, 𝒕) 𝑳𝑳𝒍 Forward propagation Backward propagation Figure 3: The overall learning framework for n w CL formulas. The output of a CLNN is the truth degree of ϕ over C at t, which is incorporated into modeling the conditional intensity rates. The modeling process aims to discover the generative mechanism as w CL formulas for every l L. In other words, a larger value of p(C , ϕ, t) should reflect that ϕ has a greater impact on the occurrence of a particular label. For example, if the w CL formula for affecting the occurrence of event label D is given as ϕ = ((πA,B pop )w1 (πC sop)w2), it means if ϕ is satisfied or the truth degree of ϕ is high, then it has a strong impact on the occurrence of D, where the impact can be promoting or inhibiting the occurrence of D. In terms of the relation between the truth degree and the conditional intensity rate, the higher the truth degree p(C , ϕ, t), the greater its impact on λD|ϕ. Note Published as a conference paper at ICLR 2023 that the occurrence of one event label may depend on multiple w CL formulas. This work follows the assumption that the impact of multiple formulas are additive in predicting the intensity rate, similar to Li et al. (2020). To incorporate a set of w CL formulas Φ = {ϕ1, ϕ2, ..., ϕn} into the modeling of the conditional intensity rate, we define a w CL formula-informed conditional intensity rate as: λl|Φ(t) = exp( i=1 wϕip(C , ϕi, t) + ρ), (8) where wϕi is the weight of ϕi, and ρ is a bias term that allows for spontaneous occurrence without the influence from ϕ. 3.6 MAXIMUM LIKELIHOOD ESTIMATION Suppose event stream D contains nl occurrences of event l, for which the occurrence time stamps are denoted as tl1, tl2, ..., tlnl . Let t0 = 0, tlnl+1 = T. Based on the conditional intensity function in (8), the likelihood for label l over the event stream is calculated as (Daley & Vere-Jones, 2003): tli λl|Φ(s)ds λl|Φ(tli+1) tlnl λl|Φ(s)ds The corresponding log-likelihood for event label l is expressed as LLl = ( R T 0 λl|Φ(s)ds) + Pnl i=1[log(λl|Φ(tli))]. The total log-likelihood of all the events in D is thus LLD = P l L LLl. During the training process, we train the model parameters for each event label separately. Specifically, the maximum likelihood estimation problem for event label l can be formulated as follows: min LLl (10) s.t. ϕ Φ, 1 k K ϕ , βk X i Ik wi,k(1 α) α, βk X i Ik wi,kα 1 α, (11) ϕ Φ, 1 k K ϕ , 1 βk + X i Ik wi,k α α, 1 βk + X i Ik wi,k (1 α) 1 α, (12) wi,k 0, βk 0, wi,k 0, βk 0, ulj 0, where K ϕ (resp. K ϕ ) is the number of (resp. ) operators in ϕ, Ik (resp. Ik ) denotes the inputs to the k-th (resp. k -th ) operator. Please see Appendix A for more details about the above formulation. The overall learning framework is shown in Figure 3, in which the forward propagation computes LLl by using n CLNNs; each learns a w CL formula ϕi and the backward propagation updates the parameters in n CLNNs using projected gradient descent. 4 EXPERIMENTS We conduct several experiments on synthetic and real-world datasets to demonstrate the efficacy of our proposed model. Simultaneously, we compare with state-of-the-art (SOTA) models. The experiments are run using the Adam W optimizer in Pytorch (1.10.2) on a Windows 10 system desktop with a 16-core CPU (i7, 3.60GHz) and 32 GB RAM. Our code is available at https://ICLR-CLNN. Multivariate Hawkes Process (MHP) [(Bacry et al., 2017)]: A conventional multivariate Hawkes process utilizing an exponential kernel function to describe the conditional intensity rate, which involves a decay rate and an infectivity matrix characterizing the inter-dependence among events. This model is implemented in the tick1 library, where the learning problem is posed as a convex quadratic programming problem with a fixed decay rate. Proximal Graphical Event Model (PGEM) [(Bhattacharjya et al., 2018)]: A type of GEM that models event data by considering whether a parent in some underlying graph happens in a proximal (recent) window. 1https://x-datainitiative.github.io/tick/modules/hawkes.html Published as a conference paper at ICLR 2023 Ground truth ˆϕ1 = (c A c B > 1)1 (c A c C > 3)1 CLNN s rule (c A c B > 1.21)1.52 (c A c C > 3.00)1.41 (c A c D > 0.82)0.33 (c B c C > 4.33)0 (c B c D > 10.69)0 (c D c C > 6.57)0.16 TELLER s rule A before D, B before D, C before D, A before D and C before D OGEM-tab s rule Excitation: [B], [C, B], [B, C]; Inhibitory: [A], [C, A], [A, C] Table 1: Comparison of rule discovery for CLNN, TELLER, and OGEM-tab on the Syn-1 dataset. Ordinal Graphical Event Model (OGEM) [(Bhattacharjya et al., 2020; 2021)]: An ordinal GEM that models the impact of the order of events on the conditional intensity rate. OGEM-tab (resp. OGEM-tree) refers to an OGEM that adopts a tabular (resp. tree) representation of orders. Temporal Logic Rule Learner (TELLER)2 [(Li et al., 2021)]. This is a method to learn first-order temporal logic rules explaining the generative mechanism of TPPs. The rule discovery process is formulated as a maximum likelihood estimation problem solved by a branch-and-price algorithm. 4.2 SYNTHETIC DATASETS The first part of this experiment demonstrates CLNN s capability of recovering ground-truth rules using three synthetic datasets generated by CLNN with pre-specified formula structure and parameters, including ulilj in πli,lj pop , as well as the importance weights w and bias β in (5) for logical operators, and the wϕ and ρ in (8) for the conditional intensity rate. Experimental Setting. Each synthetic dataset contains 1, 000 event streams partitioned into three sets: training (70%), validation (15%), and test (15%). Every dataset is generated using a w CL formula with wϕ = 3 and ρ = 5. The truth value threshold is set as α = 0.5, and the clock signal for representing an event not occurring in H t is set as Z = 1.5Tmax, where Tmax is the maximal ending time among all the event streams. During the training process, we initialize the parameters using four approaches (see Appendix C.5 for more details) and report the best one, and CLNN aims to recover the manually set parameters. Results. The ground-truth rule ˆϕ1 for generating the first synthetic dataset (Syn-1) with L = {A, B, C, D} and the rules discovered by CLNN, TELLER, and OGEM-tab are summarized in Table 1. Results for the other synthetic datasets are presented in Appendix C. The rules are learned using the last masking method, which was also used for data generation. The experimental results show an accurate recovery performance of CLNN in terms of order representation recovery and parameter identification. The unweighted version of the ground truth rule reads: If A happens before B for at least 1 time unit and A happens before C for at least 3 time units, then D will happen . The rule of TELLER only reflects the temporal relation between events A, B, C and D but is unable to capture the temporal relation between A and B or A and C, which does not match the ground-truth rule. In OGEM-tab s rule, [l] denotes a single parent. We show the top 3 excitation and inhibitory rules from OGEM-tab, where excitation (resp. inhibitory) means λl|Φ is higher (resp. lower) than the λl|Φ with all wϕi = 0. The excitation rules of OGEM-tab do not match the ground-truth rule. In contrast, the rule discovered by CLNN (ϕ1) assigns larger weights to the paired order predicates πA,B pop = (c A c B > 1.21) and πA,C pop = (c A c C > 3.00) and small weights to the other predicates, where the interval values of 1.21 and 3.00 are both learned. By ignoring the small weights, ϕ1 can be interpreted as If A happens before B for at least 1.21 time units and A happens before C for at least 3.00 time units, then D will happen , meaning the paired order representations discovered by CLNN match well with the ground truth. Moreover, CLNN s rules are more expressive than TELLER and OGEM as it provides a detailed interval length between two ordered labels. w CL formula ϕ1 ϕ2 ϕ3,1 ϕ3,2 Average CLNN 5.20 4.60 4.95 7.73 5.62 TELLER 252.91 286.83 925.58 1078.66 635.99 Table 2: Runtime (s) for CLNN and TELLER on synthetic datasets. To show the computational efficiency of our gradient-based learning, we compare the runtimes of CLNN and TELLER on the synthetic datasets in Table 2. Notably, CLNN not only recovers the correct order representations but also was two orders of magnitude faster on average (5.62 s vs 635.99 s). In addition, CLNN can learn more expressive order representations that describe both the order relation between two events and their interval length. 4.3 REAL-WORLD DATASETS Linked In [(Xu et al., 2017)]. An event dataset related to job hopping records of 3, 000 Linked In users in 82 IT companies. Each event stream records a user s check-in time stamps for different companies or the time stamps for role change within the same company. We filter the dataset to popular companies as per Bhattacharjya et al. (2020), resulting in 1, 000 users. 2https://github.com/Feng Mingquan-sjtu/Logic Point Processes ICLR Published as a conference paper at ICLR 2023 Mimic II [(Saeed et al., 2011)]. An event dataset concerning health records of patients from Intensive Care Unit (ICU) visits over 7 years. A patient s event stream records each visit s time stamp and the corresponding diagnosis. We filter out sequences with few visits, resulting in 650 patients. Stack Overflow [(Grant & Betts, 2013)]. An event dataset that is related to the badges awarded to users in the question-answering website, the Stack Overflow. Each user s event stream records the badges that he/she receives at various time stamps. We keep the event streams with one or more of 20 types of badges and sample 1, 000 users from the dataset used in Du et al. (2016). Experimental Setup. Each dataset is partitioned into three sets: training (70%), validation (15%), and test (15%). For simplicity, ulilj are set as 0 to study the ordering representations. The truth value threshold is α = 0.5, and Z = 1.5Tmax, same as the setting for the synthetic datasets, and the number of subformulas is n = 5, and the parameters are initialized as random numbers from a uniform distribution on [0, 1). CLNN is trained on the training set, and the validation set is utilized for model selection during training. Model fit is evaluated using log-likelihood on the test set. Results. We follow a similar trend to Bhattacharjya et al. (2018; 2020; 2021) to use the loglikelihood for evaluation of the model s performance. The log-likelihood on the real-world datasets is reported in Table 3, where DR denotes the difference ratio the difference between CLNN and the best SOTA divided by the absolute value of best SOTA. CLNN s result is chosen as the better one among the first or the last masking. Notably, CLNN outperforms the baseline models on the Linked In dataset (13.40% advantage) and achieves a competitive result on the MIMIC II dataset (1.63% loss only). It is observed that PGEM achieves a better result on the Stack Overflow dataset. In Stack Overflow, one type of badge can be awarded only when a user receives a particular badge multiple times, for example, the Epic badge is awarded only when earning 200 daily reputations 50 times, depending on the Mortarboard badge acquired while answering or asking questions. CLNN and OGEMs apply masking methods to the data, which may not capture the above dependence. In contrast, PGEM models data without masking, making it more suitable for this dataset. Dataset N (# events) M (labels) MHP PGEM OGEM-tab OGEM-tree TELLER CLNN DR Linked In 2932 10 -1593 -1462 -1478 -1418 -1548 -1228 13.40% MIMIC II 2419 15 -567 -500 -474 -429 -645 -436 -1.63% Stack Overflow 71254 20 -52543 -48323 -49344 -49192 -71101 -50981 -5.50% Table 3: Dataset information and log-likelihood for all models on the real-world datasets. Rules Effect ϕ1 = (c D > c H)0.90 (c I > c J)0.72 Inhibitory ϕ2 = ((c B < 0.45)0.58 (c D < 0.05)0.66 Excitation ϕ3 = (c B > c F )0.50 (c J > c D)0.47 Inhibitory ϕ4 = (c A < 0.84)0.76 (c H < 1.09)0.50 Inhibitory TELLER [A, F], [C, F], [E, F], [B, F], [D, F] Excitation OGEM-tab [F], [F, A] Excitation [A] Inhibitory Table 4: Formulas and their effect as learned by CLNN, TELLER and OGEM-tab on company F of Linked In. Case Study. The primary strength of CLNN over the SOTA models is that it can describe the generative mechanism as w CL formulas, being more expressive and potentially providing more detailed information. CLNN can be deployed as a valuable tool for assisting domain specialists in knowledge discovery from event data. Here we showcase the above strength of CLNN using an illustrative example. We select the experimental result on company F of the Linked In dataset to demonstrate the expressivity of CLNN s rules, which are shown in Table 4. Here we specify the model to learn five formulas, four of which are inhibitory, and one exhibits excitation. One inhibitory formula has a weight of 0.05, thus not reported in Table 4. Each formula shows the dominant singleton or paired order predicates. Notably, CLNN learns expressive w CL formulas that describe how the logical composition of paired order predicates and(or) singleton order predicates affect a role change in the company F. CLNN s rules are more expressive than TELLER and as expressive as OGEM-tab for describing the occurrence of a causal event within a specific historical window. 5 CONCLUSION In this paper, we proposed a novel neuro-symbolic model, CLNN, to learn interpretable w CL formulas from multivariate event data. Experimental results using synthetic and real-world datasets demonstrate CLNN s expressiveness in recovering ground-truth rules in multivariate temporal point processes. Further, CLNN can be trained using gradient-based methods, which improve the learning speed compared to the SOTA. Published as a conference paper at ICLR 2023 6 ACKNOWLEDGEMENT This research is sponsored by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network; the National Science Foundation under Grant CMMI1936578; and the Defense Advanced Research Projects Agency (DARPA) through Cooperative Agreement D20AC00004 awarded by the U.S. Department of the Interior (DOI), Interior Business Center. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Odd Aalen, Ornulf Borgan, and Hakon Gjessing. Survival and Event History Analysis: A Process Point of View. Springer Science & Business Media, 2008. Emmanuel Bacry, Iacopo Mastromatteo, and Jean-Franc ois Muzy. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01):1550005, 2015. Emmanuel Bacry, Martin Bompaire, Philip Deegan, St ephane Ga ıffas, and Søren V Poulsen. tick: A Python library for statistical learning, with an emphasis on Hawkes processes and time-dependent models. The Journal of Machine Learning Research, 18(1):7937 7941, 2017. Debarun Bhattacharjya, Dharmashankar Subramanian, and Tian Gao. Proximal graphical event models. Advances in Neural Information Processing Systems (Neur IPS), 31:8147 8156, 2018. Debarun Bhattacharjya, Tian Gao, and Dharmashankar Subramanian. Order-dependent event models for agent interactions. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1977 1983, 2020. Debarun Bhattacharjya, Tian Gao, and Dharmashankar Subramanian. Ordinal historical dependence in graphical event models with tree representations. In Proceedings of the Conference on Artificial Intelligence (AAAI), pp. 6759 6767, 2021. Yuanda Chen. Thinning algorithms for simulating point processes. Florida State University, Tallahassee, FL, 2016. Daryl J Daley and David Vere-Jones. An Introduction to the Theory of Point Processes, Volume I: Elementary Theory and Methods. Springer, 2003. Vanessa Didelez. Graphical models for marked point processes based on local independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):245 264, 2008. Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555 1564, 2016. Mehrdad Farajtabar, Yichen Wang, Manuel Gomez Rodriguez, Shuang Li, Hongyuan Zha, and Le Song. COEVOLVE: A joint point process model for information diffusion and network coevolution. In Advances in Neural Information Processing Systems (Neur IPS), volume 28, pp. 1954 1962, 2015. Tian Gao, Dharmashankar Subramanian, Karthikeyan Shanmugam, Debarun Bhattacharjya, and Nicholas Mattei. A multi-channel neural graphical event model with negative evidence. In Proceedings of the Conference on Artificial Intelligence (AAAI), pp. 3946 3953, 2020. Scott Grant and Buddy Betts. Encouraging user behaviour with achievements: An empirical study. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR 13, pp. 65 68. IEEE Press, 2013. Asela Gunawardana and Chris Meek. Universal models of multivariate temporal point processes. In Artificial Intelligence and Statistics, pp. 556 563. PMLR, 2016. Published as a conference paper at ICLR 2023 Asela Gunawardana, Christopher Meek, and Puyang Xu. A model for temporal dependencies in event streams. Advances in Neural Information Processing Systems (Neur IPS), 24, 2011. Patrick J Hurley. A Concise Introduction to Logic. Cengage Learning, 2014. Shuang Li, Lu Wang, Ruizhi Zhang, Xiaofu Chang, Xuqin Liu, Yao Xie, Yuan Qi, and Le Song. Temporal logic point processes. In International Conference on Machine Learning, pp. 5990 6000. PMLR, 2020. Shuang Li, Mingquan Feng, Lu Wang, Abdelmajid Essofi, Yufeng Cao, Junchi Yan, and Le Song. Explaining point processes by learning interpretable temporal logic rules. In International Conference on Learning Representations, 2021. Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019. Noushin Mehdipour, Cristian-Ioan Vasile, and Calin Belta. Specifying user preferences using weighted signal temporal logic. IEEE Control Systems Letters, 5(6):2006 2011, 2021. Hongyuan Mei and Jason M Eisner. The neural Hawkes process: A neurally self-modulating multivariate point process. Advances in Neural Information Processing Systems (Neur IPS), 30:6757 6767, 2017. Sean P O Brien. Crisis early warning and decision support: Contemporary approaches and thoughts on future research. International Studies Review, 12(1):87 104, 2010. Ryan Riegel, Alexander Gray, Francois Luus, Naweed Khan, Ndivhuwo Makondo, Ismail Yunus Akhalwaya, Haifeng Qian, Ronald Fagin, Francisco Barahona, Udit Sharma, et al. Logical neural networks. ar Xiv preprint ar Xiv:2006.13155, 2020. Mohammed Saeed, Mauricio Villarroel, Andrew T Reisner, Gari Clifford, Li-Wei Lehman, George Moody, Thomas Heldt, Tin H Kyaw, Benjamin Moody, and Roger G Mark. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A public-access intensive care unit database. Critical Care Medicine, 39(5):952, 2011. Prithviraj Sen, Breno WSR de Carvalho, Ryan Riegel, and Alexander Gray. Neuro-symbolic inductive logic programming with logical neural networks. In Proceedings of the Conference on Artificial Intelligence (AAAI), volume 36, pp. 8212 8219, 2022. Jeremy C. Weiss and David Page. Forest-based point process for event prediction from electronic health records. In Machine Learning and Knowledge Discovery in Databases, pp. 547 562, 2013. Shuai Xiao, Junchi Yan, Xiaokang Yang, Hongyuan Zha, and Stephen Chu. Modeling the intensity function of point process via recurrent neural networks. In Proceedings of the Conference on Artificial Intelligence (AAAI), volume 31, pp. 1597 1603, 2017. Hongteng Xu, Dixin Luo, and Hongyuan Zha. Learning Hawkes processes from short doublycensored event sequences. In International Conference on Machine Learning, pp. 3831 3840. PMLR, 2017. Ruixuan Yan, Agung Julius, Maria Chang, Achille Fokoue, Tengfei Ma, and Rosario Uceda-Sosa. STONE: Signal temporal logic neural network for time series classification. In 2021 International Conference on Data Mining Workshops (ICDMW), pp. 778 787. IEEE, 2021. Ruixuan Yan, Tengfei Ma, Achille Fokoue, Maria Chang, and Agung Julius. Neuro-symbolic models for interpretable time series classification using temporal logic description. In 2022 IEEE International Conference on Data Mining (ICDM), pp. 618 627, 2022. doi: 10.1109/ICDM54844. 2022.00072. Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz. Self-attentive Hawkes process. In International Conference on Machine Learning, pp. 11183 11193. PMLR, 2020. Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer Hawkes process. In International Conference on Machine Learning, pp. 11692 11702. PMLR, 2020. Published as a conference paper at ICLR 2023 A FORMULATION OF LOGICAL CONSTRAINTS & OBJECTIVE FUNCTION The optimization problem in (10) is formulated by maximizing the log-likelihood subject to the logical constraints for the and operators. This section discusses the details of the formulation for the two logical constraints and how to formulate the optimization problem while considering the logical constraints. Without loss of generality, we illustrate the formulation of the constraints for the operator, and the constraints for operator can be derived from the constraints for the operator using De Morgan s law. Logical constraints for operator. Let x, y [0, 1] denote the inputs of the operator, and f(x, y) denote the quantitative satisfaction of . The conventional characteristic of the operator is illustrated as follows: 1) f(x, y) is low when either input is low, and 2) f(x, y) is high when both inputs are high. However, we associate each input with a nonnegative weight, implying the input with a zero weight should not affect the output. In other words, if a low input has a zero weight, it should not affect the output of f(x, y). Therefore, we require the operator to exhibit the following characteristics: 1) f(x, y) is low when both inputs are low, and 2) f(x, y) is high when both inputs are high. Here we introduce a user-defined hyperparameter α [ 1 2, 1] to capture low vs. high: x [0, 1 α) represents low and x [α, 1] represents high. According to the above characteristics, we have (Sen et al., 2022) f(x, y) 1 α, x, y [0, 1 α), f(x, y) α, x, y [α, 1]. (13) Here we follow a specific choice of f by using a triangular norm (t-norm) and define the quantitative satisfaction function of as (Riegel et al., 2020) p(C , ϕw1 1 ϕw2 2 , t) = f(β j=1 wj(1 p(C , ϕj, t))), (14) subject to β j=1 wj(1 α) α, β j=1 wjα 1 α, (15) where f(z) = max{0, min{z, 1}} is introduced to clamp the truth value into the range of [0, 1]. Logical constraints for operator. By using De Morgan s law, we could derive the quantitative satisfaction function and the logical constraints for the operator with 2 inputs as follows: p(C , ϕw1 1 ϕw2 2 , t) = f(1 β + j=1 wj(p(C , ϕj, t))), (16) subject to 1 β + j=1 wjα α, 1 β + j=1 wj(1 α) 1 α. (17) Here we show the characteristics of the activation functions for the and operators using Figure 4. Figure 4(a) shows the truth value of the operator with α = 0.7. Figure 4(b) shows the truth value of the operator with α = 0.9. It can be distinctly observed that f(x, y) is close to 0 when both x and y are low, and f(x, y) is close to 1 when both x and y are high. In addition, the unconstrained region for α = 0.9 is larger than the unconstrained region for α = 0.7. Figure 4(c) shows the truth value of the operator with α = 0.7. It is obvious that f(x, y) is close to 0 when both x and y are low, and f(x, y) is close to 1 when both x and y are high. In general, we could extend the quantitative satisfaction for the and operators in (14) - (17) to k-ary conjunction and k-ary disjunction. The k-ary conjunction formulation is expressed as follows. p(C , ϕw1 1 ϕw2 2 ϕwk k , t) = f(β j=1 wj(1 p(C , ϕj, t))), (18) Published as a conference paper at ICLR 2023 Figure 4: Plot of truth degree for (a) CLNNwith α = 0.7, (b) CLNNwith α = 0.9, (c) CLNNwith α = 0.7. subject to β j=1 wj(1 α) α, β j=1 wjα 1 α. (19) The k-ary disjunction formulation is expressed as follows. p(C , ϕw1 1 ϕw2 2 ϕwk k , t) = f(1 β + j=1 wj(p(C , ϕj, t))), (20) subject to 1 β + j=1 wjα α, 1 β + j=1 wj(1 α) 1 α. (21) With the above constraints, we can formulate the maximum likelihood estimation problem as min LLl (22) s.t. ϕ Φ, 1 k K ϕ , βk X i Ik wi,k(1 α) α, βk X i Ik wi,kα 1 α, (23) ϕ Φ, 1 k K ϕ , 1 βk + X i Ik wi,k α α, 1 βk + X i Ik wi,k (1 α) 1 α. In this paper, we set α = 0.5, thus the constraints in (19) become i=1 wi 2β 1, i=1 wi 2β 1, 2β 1 0, wi 0. Reformulating the above constraints, we have i=1 wi = 2β 1, (26) β 0.5, wi 0. (27) The above constraints hold for each conjunction operator in ϕ. Therefore, we can incorporate the constraints in (26) into the objective function, which becomes i Ik wi,k 2βk + 1)2, (28) Published as a conference paper at ICLR 2023 subject to wi,k 0, βk 0.5, i Ik, 1 k K ϕ , ϕ Φ. (29) Similarly, we propose a set of logical constraints for the operator as (21). If we set α = 0.5, the constraints in (21) become i=1 wi 2β 1, i=1 wi 2β 1, 2β 1 0, wi 0. Reformulating the above constraints, we have i1 wi = 2β 1, (31) β 0.5. wi 0. (32) The above constraints hold for each disjunction operator in ϕ. Therefore, we can incorporate the constraints in (31) into the objective function. The maximum likelihood estimation problem then becomes i Ik wi,k 2βk + 1)2 + i Ik wi,k 2βk + 1)2, (33) subject to wi,k 0, βk 0.5, i Ik, 1 k K ϕ , ϕ Φ, wi,k 0, βk 0.5, i Ik , 1 k K ϕ , ϕ Φ. B PROOF OF THEOREM 8 The activation function designed for the operator satisfies the properties of nonimpact for zero weights, impact ordering, and monotonicity. Without loss of generality, we present the proof for the operator connecting two clauses, which can be generalized to the operator connecting k-ary clauses. Proof 1 Here we present the proof for the activation function for the operator satisfying each property mentioned above. Nonimpact for zero weights. This means if wj = 0, j = 1, 2, then p(C , ϕj, t) should have no impact on p(C , ϕw1 1 ϕw2 2 , t). Without loss of generality, we suppose w1 = 0, thus we have p(C , ϕw1 1 ϕw2 2 , t) = f(β 0 (1 p(C , ϕ1, t)) w2 (1 p(C , ϕ2, t))), = f(β w2 (1 p(C , ϕ2, t))), (34) meaning p(C , ϕ1, t) has no impact on p(C , ϕw1 1 ϕw2 2 , t). Impact Ordering This means the truth degree of subformula with higher weights has a greater impact on p(C , ϕw1 1 ϕw2 2 , t). Mathematically, we need to prove that if p(C , ϕ1, t) = p(C , ϕ2, t) and w1 w2, then p(C , ϕw1 1 ϕw2 2 , t) p(C , ϕ1, t) p(C , ϕw1 1 ϕw2 2 , t) p(C , ϕ2, t) . (35) Published as a conference paper at ICLR 2023 As f(x) = max{0, min{x, 1}}, we have 0, if x < 0, 1, if 0 < x < 1, 0, if x > 1. (36) If β P2 j=1 wj(1 p(C , ϕj, t)) < 0 or β P2 j=1 wj(1 p(C , ϕj, k)) > 1, then we have p(C , ϕw1 1 ϕw2 2 , t) p(C , ϕ1, t) = p(C , ϕw1 1 ϕw2 2 , t) p(C , ϕ2, t) = 0. (37) Also, if 0 < β P2 j=1 wj(1 p(C , ϕj, t)) < 1, then we have (β P2 j=1 wj(1 p(C , ϕj, t))) p(C , ϕ1, t) = w1(β j=1 wj(1 p(C , ϕj, t))), (38) (β P2 j=1 wj(1 p(C , ϕj, t))) p(C , ϕ2, t) = w2(β j=1 wj(1 p(C , ϕj, t))). (39) As w1 w2, the following holds: p(C , ϕw1 1 ϕw2 2 , t) p(C , ϕ1, t) p(C , ϕw1 1 ϕw2 2 , t) p(C , ϕ2, t) , (40) which proves the impact ordering property holds. Monotonicity. This means p(C , ϕw1 1 ϕw2 2 , t) increases monotonically over p(C , ϕj, t), i.e. j=1 wj(1 p(C , ϕj, t))) f(β j=1 wj(1 p(C , ϕj, t) d)) for d 0. (41) First, note that β P2 j=1 wj(1 p(C , ϕj, t)) can be rewritten as j=1 wj(1 p(C , ϕj, t)) = β w1 w2 + w1p(C , ϕ1, t) + w2p(C , ϕ2, t). (42) This implies f(β P2 j=1 wj(1 p(C , ϕj, t))) is monotonically increasing over p(C , ϕ1, t) and p(C , ϕ2, t). Also, from the proof of impact ordering we know f(x) = max{0, min{x, 1}} is monotonically nondecreasing, we can show that j=1 wj(1 p(C , ϕj, t))) f(β j=1 wj(1 p(C , ϕj, t) d)), d 0. (43) Thus the property of monotonicity is satisfied. C EXPERIMENT RESULTS OF SYNTHETIC DATASETS Dataset Generation. In the experiments on synthetic datasets, we manually generate 3 synthetic datasets considering different settings, where the details and results for the first synthetic dataset is reported in Section 4.2. Each setting considers a different order representation, different number of event labels or different intensity of causal event labels. Published as a conference paper at ICLR 2023 𝒄𝑨(𝒕) 𝒄𝑩(𝒕) 𝒄𝑪(𝒕) 𝒑(𝓒 , 𝝓𝟏, 𝒕) 𝑨,𝑩: = 𝒄𝑨 𝒄𝑩> 𝟏 𝑨,𝑪: = 𝒄𝑨 𝒄𝑪> 𝟑 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟎 Figure 5: Model structure of ˆϕ1 for generating the first synthetic dataset. C.1 SYNTHETIC DATASET 1 (SYN-1). Generation process. The first synthetic dataset contains 4 event labels: A, B, C, and D, where D is the event for prediction, and A, B, C are causal events. The w CL formula used to generate event D in the first synthetic dataset is set as ˆϕ1 = (c A c B > 1)1 (c A c C > 3)1, (44) whose unweighted version reads as If A happens before B for at least 1 time unit and A happens before C for at least 3 time units, then D will happen. Here we consider event labels A, B, C as free predicates, whose occurrences are generated by a homogeneous Poisson process. The homogeneous intensity rate for A, B, C are set as λA = 0.2, λB = 0.2, and λC = 0.2. The algorithm used to generate instances of A, B, C is described as Algorithm 1 (Chen, 2016). Algorithm 1 Simulation of a homogeneous Poisson process with intensity rate λ. Intensity rate λ, simulation horizon T Output: Occurrence time stamps T = {tk} 1: Initialize n = 0, t0 = 0; 2: while True do 3: Generate u uniform(0, 1); 4: Let w = ln(u)/λ; 5: Set tn+1 = tn + w; 6: if tn+1 > T then 7: return T = {tk}k=1,2,...,n; 8: else 9: Set n = n + 1; 10: end if 11: end while With the above algorithm, we can generate the occurrences of event labels A, B, and C. Next, we build a CLNN for ˆϕ1 = (c A c B > 1)1 (c A c C > 3)1 to calculate the conditional intensity rate λD| ˆϕ1, whose model structure is shown in Figure 5. After obtaining λD| ˆϕ1(t), we could use Algorithm 2 (Chen, 2016) to generate the occurrence of D. Results. The rules learned by CLNN, TELLER, and OGEM-tab on the first synthetic dataset are presented in Table 5, where the paired order predicate among the two candidates with the highest Published as a conference paper at ICLR 2023 Algorithm 2 Simulation of an inhomogeneous Poisson process with intensity rate λ(t). intensity rate λ(t), simulation horizon T Output: Occurrence time stamps T = {tk} 1: Initialize n = m = 0, t0 = s0 = 0, λ = sup0 t T ; λ(t); 2: while sm < T do 3: Generate a uniform random variable u uniform(0, 1); 4: Let w = ln u/ λ; 5: Set sm+1 = sm + w; 6: Generate D uniform(0,1); 7: if D λ(sm+1) λ then 8: tn+1 = sm+1; 9: n = n + 1; 10: end if 11: m = m + 1; 12: if tn T then 13: return {tk}k=1,2,...,n 14: else 15: return {tk}k=1,2,...,n 1 16: end if 17: end while Dataset Syn-1 N (# events) N = 4, L = {A, B, C, D} Ground truth ˆϕ1 = (c A c B > 1)1 (c A c C > 3)1 CLNN s rule (c A c B > 1.21)1.52 (c A c C > 3.00)1.41 (c A c D > 0.82)0.33 (c B c C > 4.33)0 (c B c D > 10.69)0 (c D c C > 6.57)0.16 TELLER s rule A before D, B before D, C before D, A before D and C before D OGEM-tab s rule Excitation: [B], [C], [C, B], [B, C], [A, C, B], [A, B, C] Inhibitory: [A], [B, A], [B, A, C], [C, B, A], [A, B], [A, C], [B, C, A], [C, A, B], [C, A] Table 5: Comparison of rule discovery for CLNN and TELLER on the Syn-1 dataset. weight is presented. It can be clearly observed that by truncating the predicates with small weights, we could obtain the formula as ϕ1 = (c A c B > 1.21)1.52 (c A c C > 3.00)1.41, (45) which matches well with the ground-truth rule. However, TELLER cannot capture the paired order representation between A and B or A and C. OGEM-tab captures the order representation [A, B] and [A, C] as inhibitory causes, which contradicts the ground-truth rule. C.2 SYNTHETIC DATASET-2 (SYN-2). Generation Process. The second synthetic dataset contains 5 event labels: A, B, C, D and E, where E is the event for prediction, and A, B, C, D are causal events. The w CL formula used to generate the occurrence of event E in the second synthetic dataset is set as ˆϕ2 = (c A c B > 0.5)1 (c A c C > 1.5)1 (c C c D > 2)1, (46) whose unweighted version reads as If A happens before B for at least 0.5 time units, A happens before C for at least 1.5 time units, and C happens before D for at least 2 time units, then E will happen. Published as a conference paper at ICLR 2023 𝒄𝑨(𝒕) 𝒄𝑩(𝒕) 𝒄𝑪(𝒕) 𝒑(𝓒 , 𝝓𝟐, 𝒕) 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟏 𝑨,𝑩: = 𝒄𝑨 𝒄𝑩> 𝟎. 𝟓 𝑩,𝑪: = 𝒄𝑩 𝒄𝑪> 𝟏. 𝟓 𝑪,𝑫: = 𝒄𝑪 𝒄𝑫> 𝟐 Figure 6: Model structure of ˆϕ2 for generating the second synthetic dataset. The occurrence of events A, B, C and D are generated using Algorithm 1, in which λA = λB = λC = λD = 0.2. After obtaining the occurrence of A, B, C and D, we simulate the generation of event label E using Algorithm 2, in which the intensity rate λE| ˆϕ2(t) is computed using the model shown in Figure 6. Results. The rules learned by CLNN, TELLER and OGEM-tab on the second synthetic dataset are presented in Table 6, where the paired order predicate with the highest weight is presented. It can be clearly observed that by truncating the predicates with small weights, CLNN learns a w CL formula as: ϕ2 = (c A c B > 0.77)1.27 (c A c C > 2.09)1.15 (c C c D > 2.60)1.06, (47) whose order representation match well with the ground-truth rule. Nevertheless, TELLER s rule only capture the ordering between A, B and E, whereas the ordering between A and B or B and C or C and D are not learned. OGEM-tab s rules can only capture the relation between event label D and event label E can excite the occurrence of event label E, whereas not able to capture the dependence of event label E s occurrence on the order relation between A and B or B and C or C and D. Dataset Syn-2 N (# events) N = 5, L = {A, B, C, D, E} Ground truth ˆϕ2 = (c A c B > 0.5)1 (c B c C > 1.5)1 (c C c D > 2)1 CLNN s rule (c A c B > 0.77)1.27 (c A c C > 2.09)1.15 ((c A c D) > 5.00)0.25 ((c A c E) > 2.74)0.09 (c B c C > 9.31)0.02 (c B c D > 8.54)0.08 (c B c E > 2.07)0 ((c C c D) > 2.60)1.06 ((c C c E) > 4.27)0.03 ((c D c E) > 1.17)0.07 TELLER s rule A before E, B before E, A and B before E, A and C before E OGEM-tab s rule Excitation: [D], [D, E], [E], [E, D] Inhibitory: [D, A], [A], [A, D], [A,D,E], [E, D, A], [D, A, E], [A, E], [E, A], [D, E, A], [A, E, D], [E, A, D] Table 6: Comparison of rule discovery for CLNN and TELLER on the Syn-2 dataset. C.3 SYNTHETIC DATASET 3 (SYN-3). The third synthetic dataset is generated using a more interesting scheme by combining the generation schemes of the first synthetic dataset and the second synthetic dataset. The third synthetic dataset Published as a conference paper at ICLR 2023 𝑩,𝑨: = 𝒄𝑩 𝒄𝑨> 𝟐 𝑪,𝑨: = 𝒄𝑪 𝒄𝑨> 𝟓 𝒄𝑨(𝒕) 𝒄𝑩(𝒕) 𝒄𝑪(𝒕) 𝒑(𝓒 , 𝝓𝟑,𝟏, 𝒕) 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟎 Figure 7: Model structure of ˆϕ3,1 for generating the occurrence of D in the Syn-3 dataset. 𝑩,𝑨: = 𝒄𝑩 𝒄𝑨> 𝟓 𝑪,𝑩: = 𝒄𝑪 𝒄𝑩> 𝟒 𝑫,𝑪: = 𝒄𝑫 𝒄𝑪> 𝟑 𝒄𝑨(𝒕) 𝒄𝑩(𝒕) 𝒄𝑪(𝒕) 𝒑(𝓒 , 𝝓𝟑,𝟐, 𝒕) 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟎 𝒘= 𝟏 𝒘= 𝟏 𝒘= 𝟏 Figure 8: Model structure of ˆϕ3,2 for generating the occurrence of E in the Syn-3 dataset. includes five event labels: A, B, C, D and E. Here we consider A, B, and C as the causal events for the occurrence of D, and A, B, C, and D as the causal events for the occurrence of E. The occurrence of events A, B, C are generated using Algorithm 1, in which λA = 0.2, λb = 0.2, and λc = 0.2. The w CL formula used to generate the occurrence of event D is set as ˆϕ3,1 = (c B c A > 2)1 (c C c A > 5)1, (48) whose unweighted version reads as If A happens before B for less than 2 time units, and A happens before C for less than 1 time unit, then D will happen. The generation of D s occurrence follows Algorithm 2, where λD| ˆϕ3,1(t) is computed using the model shown in Figure 7. We call the third synthetic dataset at this step as Syn-3.1. After obtaining the occurrences of events A, B, C, and D, we could simulate the occurrence of E using the following formula: ˆϕ3,2 = (c B c A > 5)1 (c C c B > 4)1 (c D c C > 3)1. (49) Similarly, the generation of E s occurrence follows Algorithm 2, where the intensity rate λE| ˆϕ3,2(t) is computed using the model shown in Figure 8. We call the third synthetic dataset at this step as Syn-3.2. The rules learned by CLNN, TELLER, and OGEM-tab on the cause of event D in the third synthetic dataset are presented in Table 7, where the paired order predicate with the highest weight among the two candidates is reported. It can be clearly observed that by truncating the predicates with small Published as a conference paper at ICLR 2023 weights, CLNN learns a w CL formula as ϕ3,1 = (c B c A > 1.85)1.72 (c C c A > 3.90)1.59, (50) whose order representation match well with the ground-truth rule. On the other hand, TELLER s rule only reveal the temporal relation between event labels A, B, C and D, but it does not capture the temporal relation between event labels A and B or A and C. In addition, we could observe that OGEM-tab does not capture that C is a parent event of D. Dataset Syn-3.1 N (# events) N = 5, L = {A, B, C, D, E} Ground truth ˆϕ3,1 = (c B c A > 2)1 (c C c A > 5)1 CLNN s rule (c B c A > 1.85)1.72 (c C c A > 3.90)1.59 ((c D c A) > 16.25)0.33 ((c C c B) > 3.01)0 (c D c B > 7.37)0.02 (c D c C > 7.55)0 TELLER s rule A before D, B before D, C before D OGEM-tab s rule Excitation: [A], [A, B, D], [B, D, A], [D, A], [D, A, B], [B, A], [A, D], [D], [B, A, D], [D, B, A] Inhibitory: [A, B], [B, D], [B], [A, D, B], [D, B] Table 7: Comparison of rule discovery of ϕ3,1 for CLNN and TELLER on the Syn-3.1 dataset. The rules learned by CLNN, TELLER, and GEM on the cause of event E in the third synthetic dataset are presented in Table 8, in which the discrete w CL formula learned by CLNN is ϕ3,2 = (c B c A > 3.94)1.49 (c C c B > 3.02)2.03 ((c D c C) > 2.00)1.92. (51) It is obvious that ϕ3,2 is able to learn the temporal relation between A and B, B and C, and C and D. However, TELLER s rules only reflect the temporal relation between A, B, C and E, which cannot give the information about the temporal relation betwee A and B, or B and C, or C and D. OGEM-tab s rule indicate that it considers event labels A, D, E as the parent events of D, which does not match with the ground-truth parent set. Dataset Syn-3.2 N (# events) N = 5, L = {A, B, C, D, E} Ground truth ˆϕ3,2 = (c B c A > 5)1 (c C c B > 4)1 (c D c C > 3)1 CLNN s rule (c B c A > 3.94)1.49 (c C c A > 9.12)0.25 ((c D c A) > 1.42)0.13 ((c E c A) > 3.88)0.15 (c C c B > 3.02)2.03 (c D c B > 6.27)0.02 (c E c B > 7.30)0.04 ((c D c C) > 2.00)1.92 ((c E c C) > 5.30)0.09 ((c E c D) > 1.57)0.01 TELLER s rule A before E, B before E, C before E OGEM-tab s rule Excitation: [A, D], [D, A], [D, E], [E], [A, D, E], [D, E, A], [E, A], [A, E], [E, A, D], [A, E, D], [D, A, E], [E, D, A] Inhibitory: [A], [D], [E, D] Table 8: Comparison of rule discovery of ϕ3,2 for CLNN and TELLER on the Syn-3.2 dataset. C.4 QUANTITATIVE COMPARISON OF CLNN S RULES WITH GROUND TRUTH To quantitatively evaluate the difference between the ground-truth rules and the rules learned by CLNN, we adopt the Jaccard similarity score to assess the learned formulas against the ground truth. Let G denote the set of paired ordering representations from the ground-truth rule, and C denote the set of paired ordering representations from the learned rules, the Jaccard similarity score is calculated as J = |C G| |C G|. For TELLER and OGEM-tab, the ordering representations are extracted Published as a conference paper at ICLR 2023 Figure 9: Comparison of ground-truth rules with CLNN s rules in terms of Jaccard similarity score for a) Syn-1, b) Syn-2, c) Syn-3.1, d) Syn-3.2. from the excitation rules. The comparison of Jaccard similarity score for the synthetic datasets is shown in Figure 9, where the Jaccard similarity score of 0 is manually set to the minimum threshold 0.05 for clarity purposes. It is clearly observed that the Jaccard similarity scores for CLNN is higher than the ones for TELLER or OGEM, implying the rules discovered by CLNN are more consistent with the ground truth. C.5 STABILITY ANALYSIS OF CLNN S RULES WITH RESPECT TO INITIALIZATION To further validate the model s stability in learning w CL rules, different parameter initialization methods are carried out, including: 1. rand parameter initialization as random numbers from a uniform distribution on the interval [0, 1); 2. randn random numbers from a normal distribution with mean 0 and variance 1; 3. ones constant values of 1; 4. xavier random numbers from a uniform distribution on the interval [ 1/ n, 1/ n], where n is the dimension of the parameter. The rules learned by CLNN for the above parameter initializations are summarized in Table 9. By inspecting the rules for different initialization methods, it is clear that CLNN can still recover the correct paired order representations even if initializing the learning process from a different position. In the meantime, the logic formulas learned by CLNN are stable as the variance of learned parameters is relatively small. Published as a conference paper at ICLR 2023 Dataset Initialization Rules Ground truth ˆϕ = (c A c B > 1)1 (c A c C > 3)1 rand ϕ = (c A c B > 1.21)1.52 (c A c C > 3.00)1.41 randn ϕ = (c A c B > 1.21)1.58 (c A c C > 3.32)1.56 ones ϕ = (c A c B > 1.17)1.59 (c A c C > 3.14)1.32 xavier ϕ = (c A c B > 1.12)1.45 (c A c C > 3.20)1.33 Ground truth ˆϕ = (c A c B > 0.5)1 (c A c C > 1.5)1 (c C c D > 2)1 rand ϕ = (c A c B > 0.77)1.27 (c A c C > 2.09)1.15 ((c C c D) > 2.60)1.06 randn ϕ = (c A c B > 0.80)1.97 (c A c C > 1.92)1.62 ((c C c D) > 1.74)1.45 ones ϕ = (c A c B > 1.03)1.63 (c A c C > 1.92)1.50 ((c C c D) > 2.03)1.44 xavier ϕ = (c A c B > 0.97)1.92 (c A c C > 2.07)1.63 ((c C c D) > 1.97)1.62 Ground truth ˆϕ = (c B c A > 2)1 (c C c A > 5)1 rand ϕ = (c B c A > 1.85)1.72 (c C c A > 3.90)1.59 randn ϕ = (c B c A > 1.98)1.51 (c C c A > 3.89)1.68 ones ϕ3,1 = (c B c A > 1.94)1.84 (c C c A > 3.68)2.33 xavier ϕ3,1 = (c B c A > 1.89)1.54 (c C c A > 3.92)1.62 Ground truth ˆϕ = (c B c A > 5)1 (c C c B > 4)1 (c D c C > 3)1 rand ϕ = (c B c A > 3.94)1.49 (c C c B > 3.02)2.03 ((c D c C) > 2.00)1.92 randn ϕ = (c B c A > 3.79)1.71 (c C c B > 3.04)1.89 ((c D c C) > 1.68)1.65 ones ϕ = (c B c A > 3.53)1.66 (c C c B > 3.09)1.88 ((c D c C) > 1.25)1.81 xavier ϕ = (c B c A > 3.71)1.53 (c C c B > 3.09)2.04 ((c D c C) > 1.86)1.73 Table 9: Comparison of rules learned by CLNN for different parameter initialization methods. C.6 ANALYSIS OF LOGICAL CONSTRAINTS ON THE LL In this part, we investigate the effect of the interpretability using an experiment of the impact of logical constraints on the model s performance. The log-ikelihood on the synthetic datasets for CLNN with and without logical constraints is summarized in Table 10. Table 10 demonstrates that the loglikelihood for CLNN with logical constraints is higher than the log-likelihood for CLNN without constraints, implying that interpretability (logical constraints) is helpful to improve the performance. Dataset CLNN with constraints CLNN without constraints Syn - 1 -7821 -8716 Syn - 2 -6075 -6942 Syn - 3.1 -10898 -11583 Syn - 3.2 -10919 -11230 Table 10: Comparison of LL for CLNN with and without logical constraints. D EXPERIMENT RESULTS OF REAL-WORLD DATASETS D.1 LINKEDIN DATASET The Linked In dataset is a collection of job hopping records between 82 IT companies of 3,000 Linked In users. Each event stream represents a user s check-in time stamps for different companies or role changes within the same company. Here we select 1000 users event streams to compose the dataset by filtering out the event streams with uncommon companies, resulting in 10 event labels: L = {A, B, C, D, E, F, G, H, I, J}. Here we set the number of formulas as 5, i.e., Φ = {ϕ1, ϕ2, ϕ3, ϕ4, ϕ5}, each of which embodies a model structure shown in Figure 2(a) and CLNN aims to learn the parameters for each formula. The weight parameters in the paired order cell or the singleton order cell are initialized as random variables following a Gaussian distribution, and the bias terms of conjunction or disjunction operators are initialized as 1. The architecture weights are initialized as random variables following a Gaussian distribution, and the formula impact weights and bias are initialized as Gaussian random variables. The detailed log-likelihood for each event label is summarized in Table 11. Published as a conference paper at ICLR 2023 Event Label Log-likelihood A -180.59 B -177.80 C -89.49 D -140.31 E -132.83 F -76.63 G -106.23 H -103.33 I -95.51 J -125.45 Table 11: Log likelihood for each event label in the Linked In dataset. D.2 MIMIC II DATASET MIMIC II dataset is obtained from the intensive care unit research database that consists of 25,328 intensity care unit stays. The records include laboratory data, therapeutic intervention profiles such as nursing progress notes, discharge summaries and others. Here we restrict the event types to the diagnosis of patients and filter out the shorter event sequences with few visits, ending up with 650 patients and 15 event labels: L = {1, 2, 8, 9, 11, 12, 14, 20, 21, 22, 23, 26, 27, 42, 47}. Similar to the setting for the Linked In dataset, where the initialization of parameters follow the same setting as the Linked In dataset. The detailed log-likelihood for each event label is presented in Table 12. Event Label Log-likelihood 1 -72.14 2 -62.33 8 -5.98 9 -51.34 11 -43.64 12 -25.81 14 -69.73 20 -5.96 21 -6.08 22 -10.47 23 -10.64 26 -27.08 27 -27.42 42 -5.95 47 -10.54 Table 12: Log likelihood for each event label in the MIMIC II dataset. D.3 STACK OVERFLOW DATASET Stack Overflow is a question-and-answer website spanning a wide range of domains. A badge rewarding scheme is exploited to encourage users to participate in the questioning and answering activities. The badge system of Stack Overflow comprises 81 types of non-topical badges, including the badges that can be awarded only once and the badges that can be awarded several times. The dataset in (Du et al., 2016) was obtained by first filtering out the badges that can be awarded only once, then restricting to the users who have acquired at least 40 badges from 2012-01-01 to 2014-0101, from which the badges have been awarded more than 100 times are selected as the determinate dataset. Our dataset was acquired by retaining the event streams with one or more of the 20 types of specified badges and then randomly sampling 1000 users to obtain 1000 event streams. The detailed log-likelihood for each event label in the Stack Overflow dataset is summarized in Table 13. Published as a conference paper at ICLR 2023 Event Label Log-likelihood 1 -3791 2 -1451 3 -538 4 -17656 5 -3574 6 -3559 7 -1381 8 -1330 9 -10961 10 -1105 11 -189 12 -2012 13 -673 14 -1340 15 -406 16 -117 17 -186 18 -330 19 -282 20 -100 Table 13: Log likelihood for each event label in the Stack Overflow dataset. Dataset CLNN with SOP CLNN without SOP Linked In -1228 -1344 MIMIC II -436 -480 Stack Overflow -50981 -51195 Table 14: Comparison of log-likelihood for CLNN with and without SOP on the real-world datasets. D.4 ANALYSIS OF EXPRESSIVENESS ON MODEL S PERFORMANCE In this part, we conduct an experiment by training the CLNN without the singleton order cell (SOC) on real-world datasets to show the effectiveness of the singleton order predicates. The comparison of log-likelihood for CLNN with SOC and CLNN without SOC is summarized in Table 14. As evidenced by Table 14, the log-likelihood of CLNN with SOP is higher than the log-likelihood of CLNN without SOP, meaning enriching the expressiveness of w CL formulas can better explain the generative mechanism of events.