# imputing_missing_events_in_continuoustime_event_streams__f3bd62d2.pdf Imputing Missing Events in Continuous-Time Event Streams Hongyuan Mei 1 Guanghui Qin 2 Jason Eisner 1 Events in the world may be caused by other, unobserved events. We consider sequences of events in continuous time. Given a probability model of complete sequences, we propose particle smoothing a form of sequential importance sampling to impute the missing events in an incomplete sequence. We develop a trainable family of proposal distributions based on a type of bidirectional continuous-time LSTM. Bidirectionality lets the proposals condition on future observations, not just on the past as in particle filtering. Our method can sample an ensemble of possible complete sequences (particles), from which we form a single consensus prediction that has low Bayes risk under our chosen loss metric. We experiment in multiple synthetic and real domains, using different missingness mechanisms, and modeling the complete sequences in each domain with a neural Hawkes process (Mei & Eisner, 2017). On held-out incomplete sequences, our method is effective at inferring the groundtruth unobserved events, with particle smoothing consistently improving upon particle filtering. 1. Introduction Event streams of discrete events in continuous time are often partially observed. We would like to impute the missing events z. Suppose we know the prior distribution pmodel of complete event streams, as well as the missingness mechanism pmiss(z | complete stream), which stochastically determines which of the events will not be observed. One can then use use Bayes Theorem, as spelled out in equation (1) below, to define the posterior distribution p(z | x) given just the observed events x.1 Why is this important? The ability to impute z is useful in many applied domains, for example: 1Department of Computer Science, Johns Hopkins University, USA 2Department of Physics, Peking University, China. Correspondence to: Hongyuan Mei . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright Medical records. Some patients record detailed symptoms, self-administered medications, diet, and sleep. Imputing these events for other patients would produce an augmented medical record that could improve diagnosis, prognosis, treatment, and counseling. Similar remarks apply to users of life-tracking apps (e.g., My Fitness Pal) who forget to log some of their daily activities (e.g., meals, sleep and exercise). Competitive games. In poker or Star Craft, a player lacks full information about what her opponents have acquired (cards) or done (build mines and train soldiers). Accurately imputing hidden actions from what I did and what I observed others doing can help the player make good decisions. Similar remarks apply to practical scenarios (e.g., military) where multiple actors compete and/or cooperate. User interface interactions. Cognitive events are usually unobserved. For example, users of an online news provider (e.g., Bloomberg Terminal) may have read and remembered a displayed headline whether or not they clicked on it. Such events are expensive to observe (e.g., via gaze tracking or asking the user). Imputing them given the observed events (e.g., other clicks) would facilitate personalization. Other partially observed event streams arise in online shopping, social media, etc. Why is it challenging? It is computationally difficult to reason about the posterior distribution p(z | x). Even for a simple pmodel like a Hawkes process (Hawkes, 1971), Markov chain Monte Carlo (MCMC) methods are needed, and these methods obtain an efficient transition kernel only by exploiting special properties of the process (Shelton et al., 2018). Unfortunately, such properties no longer hold for the more flexible neural models that we will use in this paper (Du et al., 2016; Mei & Eisner, 2017). What is our contribution? We are, to the best of our knowledge, the first to develop general sequential Monte Carlo (SMC) methods to approximate the posterior distribution over incompletely observed draws from a neural point process. We begin by sketching the approach. 2019 by the author(s). 1Bayes Theorem can be applied even if pmiss is a missingnot-at-random (MNAR) mechanism, as is common in this setting. MNAR is only tricky if we know neither pmodel nor pmiss. Imputing Missing Events in Continuous-Time Event Streams Figure 1. Stochastically imputing a taxi s pick-up events ( ) given its observed drop-off events ( ). At this stage, we are trying to determine the next event after the at time t1 either an unobserved event at t1,1 (t1, t2) or the next observed event at t2. 0 t (a1) Both intensities are low (i.e., passengers are scarce at this time of day), so no event happens to be proposed in (t1, t2). 0 t (a2) Specifically, the next proposed event ( ) would be somewhere after t2, without bothering to determine its time precisely. 0 t (a3) Thus, the next event is @t2; we feed it into the LSTM, preempting , which is discarded (line 37 of Algorithm 1). (a) Particle filtering (section 3.1). We show part of the process of drawing one particle. Above left, the neural Hawkes process s LSTM has already read the proposed and observed events at times t1. Its resulting state determines the model intensities and of the two event types and , from which the sampler (Algorithm 1 in Appendix C) determines that there is no unobserved event in (t1, t2). Above right, we continue to extend the particle by feeding @t2 into the LSTM and proposing subsequent events based on the new intensities after t2. But because was low at t2, the @t2 was unexpected, and that results in downweighting the particle (line 45 of Algorithm 1). Downweighting recognizes belatedly that proposing no event in (t1, t2) has committed us to a particle that will be improbable under the posterior, because its complete sequence includes consecutive drop-offs ( @t1, @t2) far apart in time. 0 t (b1) Since a drop-off at t2 strongly suggests a pick-up before t2, considering the future increases the intensity of pick-up on (t1, t2) from to (while decreasing that of drop-off from to ). 0 t (b2) Consequently, the next proposed event is more likely to be a pick-up in (t1, t2) than it was in Figure 1a. If we stochastically generate such an event @t1,1, it is fed into the original LSTM. 0 t (b3) The updated state determines the new model intensities and , and also combines with to determine the new proposal intensities and , which are used to sample the next event. (b) Particle smoothing (section 3.2) samples from a better-informed proposal distribution: a second LSTM (Appendix D) reads the future observations from right to left, and its state is used together with to determine the proposal intensities and . Mei & Eisner (2017) give an algorithm to sample a complete sequence from a neural Hawkes process. Each event in turn is sampled given the complete history of previous events. However, this algorithm only samples from the prior over complete sequences. We first adapt it into a particle filtering algorithm that samples from the posterior given all the observed events. The basic idea (Figure 1a) is to draw the events in sequence as before, but now we force any observed events to be drawn at the appropriate times. That is, we add the observed events to the sequence as they happen (and they duly affect the distribution of subsequent events). There is an associated cost: if we are forced to draw an observed event that is improbable given its past history, we must downweight the resulting complete sequence accordingly, because evidently the particular past history that we sampled was inconsistent with the observed event, and hence cannot be part of a high-likelihood complete sequence. Using this method, we sample many sequences (or particles) of different relative weights. This method applies to any temporal point process.2 Linderman et al. (2017) apply it to the classical Hawkes process. Alas, this approach is computationally inefficient. Sampling a complete sequence that is actually probable under the posterior requires great luck, as the proposal distribution must have the good fortune to draw only events that happen to be consistent with future observations. Such lucky particles would appropriately get a high weight relative to other particles. The problem is that we will rarely get such particles at all (unless we sample very many). To get a more accurate picture of the posterior, this paper draws each event from a smarter distribution that is conditioned on the future observations (rather than drawing the event in ignorance of the future and then downweighting the particle if the future does not turn out as hoped). 2As long as the number of events is finite with probability 1, and it is tractable to compute the log-likelihood of a complete sequence and to estimate the log-likelihoods of its prefixes. Imputing Missing Events in Continuous-Time Event Streams This idea is called particle smoothing (Doucet & Johansen, 2009). How does it work in our setting? The neural Hawkes process defines the distribution of the next event using the state of a continuous-time LSTM that has read the past history from left to right. When sampling a proposed event, we now use a modified distribution (Figure 1b) that also considers the state of a second continuoustime LSTM that has read the future observations from right to left. As this modified distribution is still imperfect merely a proposal distribution we still have to reweight our particles to match the actual posterior under the model. But this reweighting is not as drastic as for particle filtering, because the new proposal distribution was constructed and trained to resemble the actual posterior. Our proposal distribution could also be used with other point process models by replacing the left-to-right LSTM state with other informative statistics of the past history. What other contributions? We introduce an appropriate evaluation loss metric for event stream reconstruction, and then design a consensus decoder that outputs a single low-risk prediction of the missing events by combining the sampled particles (instead of picking one of them). 2. Preliminaries3 2.1. Partially Observed Event Streams We consider a missing-data setting (Little & Rubin, 1987). We are given a fixed time interval [0, T) over which events can be observed. An event of type k {1, 2, . . . , K} at time t [0, T) is denoted by an ordered pair written mnemonically as k@t. Each possible outcome in our probability distributions is a complete event sequence in which each event is designated as either observed or missing. We observe only the observed events, denoted by x = {k1@t1, k2@t2, . . . , k I@t I}, where 0 = t0 < t1 < t2 < . . . < t I < t I+1 = T. We are given the observation interval [0, T) in the form of two boundary events k0@t0 and k I+1@t I+1 at its endpoints, where k0 =0 and k I+1 =K+1. Let ki,0@ti,0 be an alternative notation for the observed event ki@ti. Following this observed event (for any 0 i I), there are Ji 0 unobserved events z = {ki,1@ti,1, ki,2@ti,2, . . . , ki,Ji@ti,Ji}, where ti,0 < ti,1 < . . . < ti,Ji < ti+1. We must guess this unobserved sequence including its length Ji. Let denote disjoint union. Our hypothesized complete event sequence x z is thus {ki,j@ti,j : 0 i I + 1, 0 j Ji}, where ti,j increases strictly with the pair i, j in lexicographic order.4 3Our conventions regarding capitalization, boldface, etc. are inherited from the notation of Mei & Eisner (2017, section 2). 4In general we should allow ti,j to increase non-strictly with i, j . But equality happens to have probability 0 under the neural Hawkes model. So it is convenient to exclude it here, simplifying notation by allowing x, z, H(t) to be sets, not sequences. In this paper, we will attempt to guess all of z jointly by sampling it from the posterior distribution p(Z = z | X = x) pmodel(Y = x z) pmiss(Z = z | Y = x z) of a process that first generates the complete sequence x z from a complete data model pmodel (given [0, T)), and then determines which events to censor with the possibly stochastic missingness mechanism pmiss. The random variables X , Z, and Y refer respectively to the sets of observed events, missing events, and all events over [0, T). Thus Y = X Z. Under the distributions we will consider, |Y | is almost surely finite. Notice that z denotes the set of missing events in Y and Z = z denotes the fact that they are missing. That said, we will abbreviate our notation above in the standard way: p(z | x) pmodel(x z) pmiss(z | x z) (1) Note that x z is simply an undifferentiated sequence of k@t pairs; the subscripts i, j are in effect assigned by pmiss, which partitions x z into x and z. To explain a sequence of 50 observed events, one hypothesis is that pmodel generated 73 events and then pmiss selected 23 of them to be missing (as z), leaving the 50 observed events (as x). In many missing data settings, the second factor of equation (1) can be ignored because (for the given x) it is known to be a constant function of z. Then the missing data are said to be missing at random (MAR). For event streams, however, the second factor is generally not constant in z but varies with the number of missing events |z|. Thus, our unobserved events are normally missing not at random (MNAR). See discussion in section 5.1 and Appendix A. 2.2. Choice of pmodel We need a multivariate point process model pmodel(x z). We choose the neural Hawkes process (Mei & Eisner, 2017), which has proven flexible and effective at modeling many real-world event streams. Whether an event happens at time t [0, T) depends on the history H(t) def= {k @t x z : t < t} the set of all observed and unobserved events before t. Given this history, the neural Hawkes process defines an intensity λk(t | H(t)) R 0, which may be thought of as the instantaneous rate at time t of events of type k: λk(t | H(t)) = fk(v k h(t)) (2) Here fk is a softplus function with k-specific scaling parameter. The vector h(t) ( 1, 1)D summarizes (H(t), t). It is the hidden state at time t of a continuoustime LSTM that previously read the events in H(t) as they happened. The state of such an LSTM evolves endogenously as it waits between events, so the state h(t) reflects Imputing Missing Events in Continuous-Time Event Streams not only the sequence of past events but also their timing, including the gap between the last event in H(t) and t. As Mei & Eisner (2017) explain, the probability of an event of type k in the interval [t, t+dt), divided by dt, approaches (2) as dt 0+. Thus, λk is similar to the intensity function of an inhomogeneous Poisson process. Yet it is not a fixed parameter: the λk function for times t is affected by the previously sampled events H(t). See Appendix B.1. 3. Particle Methods It is often intractable to sample exactly from p(z | x), because x and z can be interleaved with each other. As an alternative, we can use normalized importance sampling, drawing many z values from a proposal distribution q(z | x) and weighting them in proportion to p(z|x) q(z|x). Figure 1 shows the key ideas in terms of an example. Full details are spelled out in Algorithm 1 in Appendix C. Algorithm 1 is a Sequential Monte Carlo (SMC) approach (Moral, 1997; Liu & Chen, 1998; Doucet et al., 2000; Doucet & Johansen, 2009). It returns an ensemble of weighted particles ZM = {(zm, wm)}M m=1. Each particle zm is sampled from the proposal distribution q(z | x), which is defined to support sampling via a sequential procedure that draws one unobserved event at a time. The corresponding wm are importance weights, which are defined as follows (and built up factor-by-factor in Algorithm 1): wm pmodel(x zm) pmiss(zm | x zm) q(zm | x) 0 (3) where the normalizing constant is chosen to make PM m=1 wm = 1. Equations (1) and (3) imply that we would have wm = 1/M if we could set q(z | x) equal to p(z | x), so that the particles were IID samples from the desired posterior distribution. In practice, q will not equal p, but will be easier than p to sample from. To correct for the mismatch, the importance weights wm are higher for particles that q proposes less often than p would have proposed them. The distribution implicitly formed by the ensemble, ˆp(z), approaches p(z | x) as M (Doucet & Johansen, 2009). Thus, for large M, the ensemble may be used to estimate the expectation of any function f(z), via Ep(z|x)[f(z)] Eˆp[f(z)] = PM m=1 wmf(zm) (4) f(z) may be a function that summarizes properties of the complete stream x z on [0, T), or predicts future events on [T, ) using the sufficient statistic H(T) = x z. In the subsections below, we will describe two specific proposal distributions q that are appropriate for the neural Hawkes process, as we sketched in section 1. These distributions define intensity functions λq over time intervals. The trickiest part of Algorithm 1 (at line 31) is to sample the next unobserved event from the proposal distribution q. Here we use the thinning algorithm (Lewis & Shedler, 1979; Liniger, 2009; Mei & Eisner, 2017). Briefly, this is a rejection sampling algorithm whose own proposal distribution uses a constant intensity λ , making it a homogeneous Poisson process (which is easy to sample from). A event proposed by the Poisson process at time t is accepted with probability λq(t)/λ 1. If it is rejected, we move on to the next event proposed by the Poisson process, continuing until we either accept such an unobserved event or are preempted by the arrival of the next observed event. After each step, one may optionally resample a new set of particles from {zm}M m=1 (the RESAMPLE procedure in Algorithm 1). This trick tends to discard low-weight particles and clone high-weight particles, so that the algorithm can explore multiple continuations of the high-weight particles. 3.1. Particle Filtering We already have a neural Hawkes process pmodel that was trained on complete data. This model uses a neural net to define an intensity function λp k(t | H(t)) for any history H(t) of events before t and each event type k. The simplest proposal distribution uses this intensity function to draw the unobserved events. More precisely, for each i = 0, 1, . . . , I, for each j = 0, 1, 2, . . ., let the next event ki,j+1@ti,j+1 be the first event generated by any of the K intensity functions λk(t | H(t)) over the interval t (ti,j, ti+1), where H(t) consists of all observed and unobserved events up through ki,j@ti,j. If no event is generated on this interval, then the next event is ki+1@ti+1. This is implemented by Algorithm 1 with smooth = false. 3.2. Particle Smoothing As motivated in section 1, we would rather draw each unobserved event according to λk(t | H(t), F(t)) where the future F(t) def= {ki@ti : t < ti T} consists of all observed events that happen after t. Note the asymmetry with H(t), which includes observed but also unobserved events. We use a right-to-left continuous-time LSTM to summarize the future F(t) for any time t into another hidden state vector h(t) ( 1, 1)D . Then we parameterize the proposal intensity using an extended variant of equation (2): λq k(t | H(t), F(t)) = fk(v k (h(t) + B h(t))) (5) This extra machinery is used by Algorithm 1 when smooth = true. Intuitively, the left-to-right h(t), as explained in Mei & Eisner (2017), reads the history H(t) and computes sufficient statistics for predicting events at times t given H(t). But we wish to predict these events given H(t) and F(t). Equation (5) approximates this Bayesian update using the right-to-left h(t), which is trained to carry Imputing Missing Events in Continuous-Time Event Streams back relevant information about future observations F(t). This is a kind of neuralized forward-backward algorithm. Lin & Eisner (2018) treat the discrete-time analogue, explaining why a neural forward pmodel no longer admits tractable exact proposals as does a hidden Markov model (Rabiner, 1989) or linear dynamical system (Rauch et al., 1965). Like them, we fall back on training an approximate proposal distribution. Regardless of pmodel, particle smoothing is to particle filtering as Kalman smoothing is to Kalman filtering (Kalman, 1960; Kalman & Bucy, 1961). Our right-to-left LSTM has the same architecture as the left-to-right LSTM used in our pmodel (section 2.2), but a separate parameter vector. For any time t [0, T), it arrives at h(t) by reading only the observed events {ki@ti : t < ti T}, i.e., F(t), in reverse chronological order. Formulas are given in Appendix D. This architecture seemed promising for reading an incomplete sequence of events from right to left, as Mei & Eisner (2017, section 6.3) had already found that this architecture is predictive when used to read incomplete sequences from left to right. 3.2.1. TRAINING THE PROPOSAL DISTRIBUTION The particle smoothing proposer q can be trained to approximate p(z | x) by minimizing a Kullback-Leibler (KL) divergence. Its left-to-right LSTM is fixed at pmodel, so its trainable parameters φ are just the parameters of the right-to-left LSTM together with the matrix B from equation (5). Though p(z | x) is unknown, the gradient of inclusive KL divergence between q(z | x) and p(z | x) is φKL(p || q) = Ez p(z|x)[ φ log q(z | x)] (6) and the gradient of exclusive KL divergence is: φKL(q || p) = Ez q[ φ 1 2 (log q(z | x) b)2 ] (7a) b = log pmodel(x z) + log pmiss(z | x z) (7b) where log pmodel(x z) is given in Appendix B.1, log q(z | x) is given in Appendix C.1, and pmiss(z | x z) is assumed to be known to us for any given pair of x and z. Minimizing inclusive KL divergence aims at high recall q(z | x) is adjusted to assign high probabilities to all of the good hypotheses (according to p(z | x)). Conversely, minimizing exclusive KL divergence aims at high precision q(z | x) is adjusted to assign low probabilities to poor reconstructions, so that they will not be proposed. We seek to minimize the linearly combined divergence Div = β KL(p q) + (1 β)KL(q p) with β [0, 1] (8) and training is early-stopped when the divergence stops decreasing on the held-out development set. But how do we measure these divergences between q(z | x) and p(z | x)? Of course, we actually want the expected divergence when the observed sequence x the true distribution. Thus, we sample x by starting with a fully observed sequence from our training examples and then sampling a partition x, z from the known missingness mechanism pmiss.5 The inclusive expectation in (6) uses this x and z. For the exclusive expectation in (7), we keep this x but sample a new z from our proposal distribution q( | x). Notice that minimizing exclusive divergence here is essentially the REINFORCE algorithm (Williams, 1992), which is known to have large variance. In practice, when tuning our hyperparameters (Appendix G.2), β = 1 in (8) gave the best results. That is perhaps unsurprisingly our experiments effectively avoided REINFORCE altogether and placed all the weight on the inclusive KL, which has no variance issue. More training details including a bias and variance discussion can be found in Appendix G.2. Appendix H discusses situations where training on incomplete data by EM is possible. 4. A Loss Function and Decoding Method It is often useful to find a single hypothesis ˆz that minimizes the Bayes risk, i.e., the expected loss with respect to the unknown ground truth z . This procedure is called minimum Bayes risk (MBR) decoding and can be approximated with our ensemble of weighted particles: ˆz = argminz Z P z Z p(z | x)L(z, z ) (9a) argminz Z PM m=1 wm L(z, zm) (9b) where L(z, z ) is the loss of z with respect to z . This procedure for combining the particles into a single prediction is sometimes called consensus decoding. We now propose a specific loss function L and an approximate decoder. 4.1. Optimal Transport Distance The loss of z is defined as the minimum cost of editing z into the ground truth z . To accomplish this edit, we must identify the best alignment a one-to-one partial matching a of the events in the two sequences. We require any two aligned events to have the same type k. We define a as a collection of alignment edges (t, t ) where t and t are the times of the aligned events in z and z respectively. An alignment edge between a predicted event at time t (in z) and a true event at time t (in z ) incurs a cost of |t t | to move the former to the correct time. Each unaligned event in z incurs a deletion cost of Cdelete, and each unaligned event in z incurs an insertion cost of Cinsert. Now L(z, z ) = min a A(z,z ) D(z, z , a) (10) 5To get more data for training q, we could sample more partitions of the fully observed sequence. In this paper, we only sample one partition. Note that the fully observed sequence is a real observation from the true complete data distribution (not the model). Imputing Missing Events in Continuous-Time Event Streams where A(z, z ) is the set of all possible alignments between z and z , and D(z, z , a) is the total cost given the alignment a. Notice that if |z| = |z |, any alignment leaves some events unaligned; also, rather than align two faraway events, it is cheaper to leave them unaligned if Cdelete + Cinsert < |t t |. Algorithm 2 in Appendix E uses dynamic programming to compute the loss (10) and its corresponding alignment a, similar to edit distance (Levenshtein, 1965) or dynamic time warping (Sakoe & Chiba, 1971; Listgarten et al., 2005). In practice we symmetrize the loss by specifying equal costs Cinsert = Cdelete = C. 4.2. Consensus Decoding Since aligned events must have the same type, consensus decoding (9b) decomposes into separately choosing a set ˆz(k) of type-k events for each k = 1, 2, . . . , K, based on the particles sets z(k) m of type-k events. Thus, we simplify the presentation by omitting (k) throughout this section. The loss function L defined in section 4.1 warrants: Theorem 1. Given {zm}M m=1, if we define z = FM m=1 zm, then ˆz z such that PM m=1 wm L(ˆz, zm) = minz Z PM m=1 wm L(z, zm) That is to say, there exists one subsequence of z that achieves the minimum Bayes risk. The proof is given in Appendix F: it shows that if ˆz minimizes the Bayes risk but is not a subsequence of z , then we can modify it to either improve its Bayes risk (a contradiction) or keep the same Bayes risk while making it a subsequence of z as desired. Now we have reduced this decoding problem to a combinatorial optimization problem: ˆz = argminz z PM m=1 wm L(z, zm) (11) which is probably NP-hard, by analogy with the Steiner string problem (Gusfield, 1997). Our heuristic (Algorithm 3 of Appendix F) seeks to iteratively improve ˆz by (1) using Algorithm 2 to find the optimal alignment am of ˆz with each zm, and then (2) repeating the following sequence of 3 phases until ˆz does not change. Each phase tries to update ˆz to decrease the weighted distance PM m=1 wm D(ˆz, zm, am) which by Theorem 1 is an upper bound of the Bayes risk PM m=1 wm L(ˆz, zm):6 Move Phase For each event in ˆz, move its time to the weighted median (using weights wm) of the times of all M events that am aligns it to (if any), while keeping the alignment edges. This selects the new time that minimizes PM m=1 wm D(ˆz, zm, am). 6Note these phases compute D(ˆz, zm, am) but not L(ˆz, zm), so they need not call the dynamic programming algorithm. Delete Phase For each event in ˆz, delete it (together with any related edges in each am) if this decreases PM m=1 wm D(ˆz, zm, am). Insert Phase If we inserted t into ˆz, we would also modify each am to align t to the closest unaligned event in zm (if any) provided that this decreased D(ˆz, zm, am). Let (t) be the resulting reduction in PM m=1 wm D(ˆz, zm, am). Let t = argmaxt z ,t/ ˆz (t). While (t ) > 0, insert t . The move or delete phase can consider events in any order, or in parallel; this does not change the result. 5. Experiments We compare our particle smoothing method with the strong particle filtering baseline our neural version of Linderman et al. (2017) s Hawkes process particle filter on multiple real-world and synthetic datasets. See Appendix G for training details (e.g., hyperparameter selection). Py Torch code can be found at https://github.com/ HMEIat JHU/neural-hawkes-particle-smoothing. 5.1. Missing-Data Mechanisms We experiment with missingness mechanisms of the form pmiss(z | x z) = Y ki@ti z ρki Y ki@ti x (1 ρki) (12) meaning that each event in the complete stream x z is independently censored with probability ρk that only depends on its event type k.7 We consider both deterministic and stochastic missingness mechanisms. For the deterministic experiments, we set ρk for each k to be either 0 or 1, so that some event types are always observed while others are always missing. Then pmiss(z | x z) = 1 if z consists of precisely the events in x z that ought to go missing, and 0 otherwise. For our stochastic experiments, we simply set ρk = ρ regardless of the event type k and experiment with ρ = 0.1, 0.3, 0.5, 0.7, 0.9. Then equation (12) can be written as pmiss(z | x z) = (1 ρ)|x|ρ|z|, whose value decreases exponentially in the number of missing events |z|. As this depends on z, the stochastic setting is definitely MNAR (not MCAR as one might have imagined). 5.2. Datasets The datasets that we use in this paper range from short sequences with mean length 15 to long ones with mean length > 300. For each of the datasets, we possess fully observed data that we use to train the model and the proposal distribution.8 For each dev and test example, we censored 7Appendix H discusses how ρ could be imputed when complete and incomplete data are both available. 8The focus of this paper is on inference (imputation) under a given model, so training the model is simply a preparatory step. Imputing Missing Events in Continuous-Time Event Streams out some events from the fully observed sequence, so we present the x part as input to the proposal distribution but we also know the z part for evaluation purposes. Fully replicable details of the dataset preparation can be found in Appendix G, including how event types are defined and which event types are missing in the deterministic settings. Synthetic Datasets We first checked that we could successfully impute unobserved events that are generated from known distributions. That is, when the generating distribution actually is a neural Hawkes process, could our method outperform particle filtering in practice? Is the performance consistent over multiple datasets drawn from different processes? To investigate this, we synthesized 10 datasets, each of which was drawn from a different neural Hawkes process with randomly sampled parameters. Elevator System Dataset (Crites & Barto, 1996). A multi-floor building is often equipped with multiple elevator cars that follow cooperative strategies to transport passengers between floors (Lewis, 1991; Bao et al., 1994; Crites & Barto, 1996). In this dataset, the events are which elevator car stops at which floor. The deterministic case of this domain is representative of many real-world cooperative (or competitive) scenarios observing the activities of some players and imputing those of the others. New York City Taxi Dataset (Whong, 2014). Each medallion taxi in New York City has a sequence of timestamped pick-up and drop-off events, where different locations have different event types. Figure 1 shows how we impute the pick-up events given the drop-off events (the deterministic missingness case). 5.3. Data Fitting Results First, as an internal check, we measure how probable each ground truth reference z is under the proposal distribution constructed by each method, i.e., log q(z | x). As shown in Figure 2, the improvement from particle smoothing is remarkably robust across 12 datasets, improving nearly every sequence in each dataset. The plots for the deterministic missingness mechanisms are so boringly similar that we only show them in Appendix G.6 (Figure 4). 5.4. Decoding Results For each x, we now make a prediction by sampling an ensemble of M = 50 particles (section 3)9 and constructing their consensus sequence ˆz (section 4.2). We use multinomial resampling since otherwise the effective sample size However, inference could be used to help train on incomplete data via the EM algorithm, provided that the missingness mechanism is known; see Appendix H for discussion. 9Increasing M would increase both effective sample size (ESS) and runtime. (a) Synthetic Data (b) Elevator System (c) NYC Taxi Figure 2. Scatterplots of neural Hawkes particle smoothing (yaxis) vs. particle filtering (x-axis) with a stochastic missingness mechanism (ρ = 0.5). Each point represents a single test sequence, and compares the values of log q(z | x) / |z |. Larger values mean that the proposal distribution is better at proposing the ground truth z . Each dataset s scatterplot is converted to a cloud using kernel density estimation, with the centroid denoted by a black dot. A double-arrowed line indicates the improvement of particle smoothing over filtering. For the synthetic datasets, we draw ten clouds on the same figure and show the line for the dataset where smoothing improves the most. As we can see, the density is always well concentrated above y = x. That is, this is not merely an average improvement: nearly every ground truth z gets higher proposal probability! Particle smoothing performs well even on datasets where particle filtering performs badly. is very low (only 1 2 on some datasets).10 We evaluate ˆz by its optimal transport distance (section 4.1) to the ground truth z . Note that a, we can decompose D(ˆz, z , a) as C ( |ˆz| + |z | 2|a| | {z } total insertions+deletions (t,t ) a |t t | | {z } total distance moved Letting a be the alignment that minimizes D(ˆz, z , a), the former term measures how well ˆz predicts which events happened, and the latter measures how well ˆz predicts when those events happened. Different choices of C yield different ˆz with different trade-offs between these two terms. Intuitively, when C 0, the decoder is free to insert and delete event tokens; as C increases, ˆz will tend to insert/delete fewer event tokens and move more of them. Figure 3 plots the performance of particle smoothing ( ) vs. particle filtering ( ) for the stochastic missingness mechanisms, showing the two terms above as the x and y coordinates. The very similar plots for the deterministic missingness mechanisms are in Appendix G.6 (Figure 5).11 5.5. Sensitivity to Missingness Mechanism For the stochastic missingness mechanisms, we also did experiments with different values of missing rate ρ = 0.1, 0.3, 0.7, 0.9. Our particle smoothing method consistently outperforms the filtering baseline in all the experiments (Figure 6 in Appendix G.7), similar to Figure 3. 10Any multinomial resampling step drives the ESS metric to M. This cannot guarantee better samples in general, but resampling did improve our decoding performance on all datasets. 11We show the 2 real datasets only. The figures for the 10 synthetic datasets are boringly similar to these. Imputing Missing Events in Continuous-Time Event Streams (a) Elevator System (b) NYC Taxi Figure 3. Optimal transport distance of particle smoothing ( ) vs. particle filtering ( ) on test data with a stochastic missingness mechanism (ρ = 0.5). In each figure, the x-axis is the total number of deletions and insertions in the test dataset, PN n=1(|ˆzn| + |z n| 2|an|), and the y-axis is the total movement cost, PN n=1 P (t,t ) an |t t |. Both axes are normalized by the true total number of missing events PN n=1 |z n|, so the x-axis shows a fraction and the y-axis shows an average time difference. On each dataset, we show one per C. According to equation (13), (C, 1), denoted by , turns out to be the gradient of PN n=1 D(ˆzn, z n, an) at this . The shows the actual improvement obtained by switching to particle smoothing (which is, indeed, an improvement because it has positive dot product with the gradient ). The Pareto frontier (convex hull) of the symbols dominates the Pareto frontier of the symbols lying everywhere to its left which means that our particle smoothing method outperforms the filtering baseline. 5.6. Runtime The theoretical runtime complexity is O(MI) where M is the number of particles and I is the number of observed events. In practice, we generate the particles in parallel, leading to acceptable speeds of 300-400 milliseconds per event for the final method. More details about the wallclock runtime can be found in Appendix G.8. 6. Discussion and Related Work To our knowledge, this is the first time a bidirectional recurrent neural network has been extended to predict events in continuous time. Bidirectional architectures have proven effective at predicting linguistic words and their properties given their left and right contexts (Graves et al., 2013; Bahdanau et al., 2015; Peters et al., 2018; Devlin et al., 2018): in particular, Lin & Eisner (2018) recently applied them to particle smoothing for discrete-time sequence tagging. Previous work that infers unobserved events in continuous time exploits special properties of simpler models, including Markov jump processes (Rao & Teh, 2012; 2013), continuous-time Bayesian networks (Fan et al., 2010) and Hawkes processes (Shelton et al., 2018). Such properties no longer hold for our more expressive neural model, necessitating our approximate inference method. Metropolis-Hastings would be an alternative to our particle smoothing method. The transition kernel could propose a single-event change to z (insert, delete, or move). Unfortunately, this would be quite slow for a neural model like ours, because any proposed change early in the sequence would affect the LSTM state and hence the probability of all subsequent events. Thus, a single move takes O(|x z|) time to evaluate. Furthermore, the Markov chain may mix slowly because a move that changes only one event may often lead to an incoherent sequence that will be rejected. The point of our particle smoothing is essentially to avoid such rejection by proposing a coherent sequence of events, left to right but considering future x events, from an approximation q(z | x) to the true posterior. (One might build a better Metropolis-Hastings algorithm by designing a transition kernel that makes use of our current proposal distribution, e.g., via particle Gibbs (Chopin & Singh, 2015).) We also introduced an optimal transport distance between event sequences, which is a valid metric. It essentially regards each event sequence as a 0-1 function over times, and applies a variant of Wasserstein distance (Villani, 2008) or Earth Mover s distance (Kantorovitch, 1958; Levina & Bickel, 2001). Such variants are under active investigation (Benamou, 2003; Chizat et al., 2015; Frogner et al., 2015; Chizat et al., 2018). Our version allows event insertion and deletion during alignment, where these operations can only apply to an entire event we cannot align half of an event and delete the other half. Due to these constraints, dynamic programming rather than a linear programming relaxation is needed to find the optimal transport. Xiao et al. (2017) also proposed an optimal transport distance between event sequences that allows event insertion and deletion; however, their insertion and deletion costs turn out to depend on the timing of the events in (we feel) a peculiar way. We also gave a method to find a single consensus reconstruction with small average distance to our particles. This problem is related to Steiner string (Gusfield, 1997), which is usually reduced to multiple sequence alignment (MSA) (Mount, 2004) and heuristically solved by progressive alignment construction using a guide tree (Feng & Doolittle, 1987; Larkin et al., 2007; Notredame et al., 2000) and iterative realignment of the initial sequences with addition of new sequences to the growing MSA (Hirosawa et al., 1995; Gotoh, 1996). These methods might also be tried in our setting. For us, however, the ith event of type k is not simply a character in a finite alphabet such as {A,C,G,T} but a time that falls in the infinite set [0, T). The substitution cost between two events of type k is then their time difference. On multiple synthetic and real-world datasets, our method turns out to be effective at inferring the ground truth sequence of unobserved events. The improvement of particle smoothing upon particle filtering is substantial and consistent, showing the benefit of training a proposal distribution. Imputing Missing Events in Continuous-Time Event Streams Acknowledgments We are grateful to Bloomberg L.P. for enabling this work through a Ph.D. Fellowship Award to the first author. The work was also supported by an Amazon Research Award and by the National Science Foundation under Grant No. 1718846. We thank the anonymous reviewers, Hongteng Xu at Duke University, and our lab group at Johns Hopkins University s Center for Language and Speech Processing for helpful comments. We also thank NVIDIA Corporation for kindly donating two Titan X Pascal GPUs and the state of Maryland for the Maryland Advanced Research Computing Center. The first version of this work appeared on Open Review in September 2018. Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. Bao, G., Cassandras, C. G., Djaferis, T. E., Gandhi, A. D., and Looze, D. P. Elevators dispatchers for down-peak traffic, 1994. Baran, I., Demaine, E. D., and Katz, D. A. Optimally adaptive integration of univariate lipschitz functions. Algorithmica, 50(2):255 278, February 2008. URL https://link.springer.com/ article/10.1007/s00453-007-9093-7. Benamou, J.-D. Numerical resolution of an unbalanced mass transport problem. ESAIM: Mathematical Modelling and Numerical Analysis, 2003. Chizat, L., Peyr e, G., Schmitzer, B., and Vialard, F.-X. Unbalanced optimal transport: Geometry and Kantorovich formulation. ar Xiv preprint ar Xiv:1508.05216, 2015. Chizat, L., Peyr e, G., Schmitzer, B., and Vialard, F.- X. An interpolating distance between optimal transport and Fisher-Rao metrics. Foundations of Computational Mathematics, 2018. Chopin, N. and Singh, S. S. On particle Gibbs sampling. Bernoulli, 2015. Crites, R. H. and Barto, A. G. Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems, 1996. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1977. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Doucet, A. and Johansen, A. M. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, 2009. Doucet, A., Godsill, S., and Andrieu, C. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 2000. Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez Rodriguez, M., and Song, L. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. Fan, Y., Xu, J., and Shelton, C. R. Importance sampling for continuous-time Bayesian networks. Journal of Machine Learning Research, 2010. Feng, D.-F. and Doolittle, R. F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution, 1987. Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Poggio, T. A. Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems, 2015. Gotoh, O. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. Journal of Molecular Biology, 1996. Graves, A., Jaitly, N., and Mohamed, A.-R. Hybrid speech recognition with deep bidirectional LSTM. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2013. Gusfield, D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. 1997. Hawkes, A. G. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 1971. Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M. Comprehensive study on iterative algorithms of multiple sequence alignment. Bioinformatics, 1995. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 1997. Huang, Z., Xu, W., and Yu, K. Bidirectional LSTMCRF models for sequence tagging. ar Xiv preprint ar Xiv:1508.01991, 2015. Imputing Missing Events in Continuous-Time Event Streams Kalman, R. E. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 1960. Kalman, R. E. and Bucy, R. S. New results in linear filtering and prediction theory. Journal of Basic Engineering, 1961. Kantorovitch, L. On the translocation of masses. Management Science, 1958. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., Mc Gettigan, P. A., Mc William, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., et al. Clustal W and Clustal X version 2.0. Bioinformatics, 2007. Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. In Doklady Akademii Nauk, 1965. Levina, E. and Bickel, P. The Earth Mover s distance is the Mallows distance: Some insights from statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision (ICCV), 2001. Lewis, J. A Dynamic Load Balancing Approach to the Control of Multiserver Polling Systems with Applications to Elevator System Dispatching. Ph D thesis, University of Massachusetts, Amherst, 1991. Lewis, P. A. and Shedler, G. S. Simulation of nonhomogeneous Poisson processes by thinning. Naval Research Logistics Quarterly, 1979. Lin, C.-C. and Eisner, J. Neural particle smoothing for sampling from conditional sequence models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018. Linderman, S. W., Wang, Y., and Blei, D. M. Bayesian inference for latent Hawkes processes. In Advances in Approximate Bayesian Inference Workshop, 31st Conference on Neural Information Processing Systems, 2017. Liniger, T. J. Multivariate Hawkes processes. Diss., Eidgen ossische Technische Hochschule ETH Z urich, Nr. 18403, 2009. Listgarten, J., Neal, R. M., Roweis, S. T., and Emili, A. Multiple alignment of continuous time series. In Advances in Neural Information Processing Systems (NIPS), 2005. Little, R. J. A. and Rubin, D. B. Statistical Analysis with Missing Data. 1987. Liu, J. S. and Chen, R. Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association, 1998. Mc Lachlan, G. and Krishnan, T. The EM algorithm and Extensions. 2007. Mei, H. and Eisner, J. The neural Hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems (NIPS), 2017. Mohan, K. and Pearl, J. Graphical models for processing missing data. ar Xiv preprint ar Xiv:1801.03583, 2018. Moral, P. D. Nonlinear filtering: Interacting particle resolution. Comptes Rendus de l Academie des Sciences-Serie I-Mathematique, 1997. Mount, D. W. Bioinformatics: Sequence and Genome Analysis. 2004. Notredame, C., Higgins, D. G., and Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of molecular biology, 2000. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018. Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, 1989. Rao, V. and Teh, Y. W. MCMC for continuous-time discrete-state systems. In Advances in Neural Information Processing Systems, 2012. Rao, V. and Teh, Y. W. Fast MCMC sampling for Markov jump processes and extensions. The Journal of Machine Learning Research, 2013. Rauch, H. E., Striebel, C. T., and Tung, F. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 1965. Sakoe, H. and Chiba, S. A dynamic programming approach to continuous speech recognition. In Proceedings of the Seventh International Congress on Acoustics, Budapest, 1971. Shelton, C. R., Qin, Z., and Shetty, C. Hawkes process inference with missing data. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. Imputing Missing Events in Continuous-Time Event Streams Spall, J. C. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. 2005. Villani, C. Optimal Transport: Old and New. 2008. Wei, G. C. G. and Tanner, M. A. A Monte Carlo implementation of the EM algorithm and the poor man s data augmentation algorithms. Journal of the American Statistical Association, 1990. Whong, C. FOILing NYC s taxi trip data, 2014. Williams, R. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992. Xiao, S., Farajtabar, M., Ye, X., Yan, J., Yang, X., Song, L., and Zha, H. Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems (NIPS), 2017.