# segmenting_hybrid_trajectories_using_latent_odes__a191a08f.pdf Segmenting Hybrid Trajectories using Latent ODEs Ruian Shi 1 2 Quaid Morris 1 2 3 1University of Toronto 2Vector Institute, Toronto 3Memorial Sloan Kettering Cancer Center. Correspondence to: Ruian Shi . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Smooth dynamics interrupted by discontinuities are known as hybrid systems and arise commonly in nature. Latent ODEs allow for powerful representation of irregularly sampled time series but are not designed to capture trajectories arising from hybrid systems. Here, we propose the Latent Segmented ODE (Lat Seg ODE), which uses Latent ODEs to perform reconstruction and changepoint detection within hybrid trajectories featuring jump discontinuities and switching dynamical modes. Where it is possible to train a Latent ODE on the smooth dynamical fows between discontinuities, we apply the pruned exact linear time (PELT) algorithm to detect changepoints where latent dynamics restart, thereby maximizing the joint probability of a piece-wise continuous latent dynamical representation. We propose usage of the marginal likelihood as a score function for PELT, circumventing the need for modelcomplexity-based penalization. The Lat Seg ODE outperforms baselines in reconstructive and segmentation tasks including synthetic data sets of sine waves, Lotka Volterra dynamics, and UCI Character Trajectories. 1. Introduction The complexity of modelling time-series data increases when accounting for discontinuous changes in dynamical behavior. As a motivational example, consider the Lotka Volterra equations, a simplifed model of predator-prey interactions. The system is described by the pair of ordinary differential equations (ODEs): dx dy = αx βxy = δxy γy (1) dt dt where x and y are the population size of predators and prey respectively. Coeffcients α, β, δ, γ describe interac- tion characteristics, such as the rate of encounter, and rate of successful predation per encounter. When these parameters are fxed, modelling this system from observed population trajectories is straightforward. However, external factors can perturb the system. Additional predators can suddenly be introduced via migration midway in an observed population trajectory, causing a jump discontinuity in the trajectory. The coeffcients describing predator-prey interaction may also abruptly change, instantaneously changing the dynamical mode of the system. Systems featuring smooth dynamical fows (SDFs) interrupted by discontinuities are known as hybrid systems (Van Der Schaft & Schumacher, 2000). These discontinuities can arise as discrete jumps or instantaneous switches in dynamical mode (Ackerson & Fu, 1970), shown in Figure 1 at times (a) and (b) respectively. We propose a method to model the hybrid trajectories which arise from hybrid systems. 0 2 4 6 8 10 12 14 Time SDF SDF SDF Prey Predator Figure 1. A Lotka-Volterra hybrid trajectory composed of three smooth dynamical fows. The plot shows populations of predators and prey over time. At time (a), a jump discontinuity occurs. At time (b), a distributional shift in dynamical coeffcients occurs. Recently, the Latent ODE architecture (Rubanova et al., 2019) has been introduced to represent time series using latent dynamical trajectories. However, Latent ODEs are not designed to model discontinuous latent dynamics and, thus, represent hybrid trajectories poorly. Here, we propose the Latent Segmented ODE (Lat Seg ODE), an extension of a Latent ODE explicitly designed for hybrid trajectories. Given a base model Latent ODE trained on the segments of SDFs between discontinuities, we apply the Pruned Exact Linear Time (PELT) search algorithm (Killick et al., 2012) to model hybrid trajectories as a sequence of samples from the base model, each with a different initial state. The Lat Seg ODE detects the positions where the latent ODE dynamics are restarted with a new initial state, thus modelling hybrid trajectories using a piece-wise continuous latent trajectory. We provide a novel way to use deep architectures in conjunction Segmenting Hybrid Trajectories using Latent ODEs with offine changepoint detection (CPD) methods. Using the marginal likelihood under the Latent ODE as a score function, we fnd the Bayesian Occam s Razor (Mac Kay, 1992) effect automatically prevents over-segmentation in CPD methods. We evaluate Lat Seg ODE on data sets of 1D sine wave hybrid trajectories, Lotka-Volterra hybrid trajectories, and a synthetically composed UCI Character Trajectories data set. We demonstrate that the Lat Seg ODE interpolates, extrapolates, and fnds the changepoints in hybrid trajectories with high accuracy compared to current baseline methods. 2. Background 2.1. Latent ODEs The Latent ODE architecture (Rubanova et al., 2019) is an extension of the Neural ODE method (Chen et al., 2018), which provides memory-effcient gradient computation without back-propagation through ODE solve operations. Neural ODEs represent trajectories as the solution to the initial value problem: dh(t) = fθ(h(t), t) (2) dt h0:N = ODESolve(fθ, h0, t0:N ) (3) where fθ is parameterized by a neural network, and h(t) represents hidden dynamics. The continuous dynamical representation allows Neural ODEs to natively incorporate irregularly sampled time series. Latent ODEs arrange Neural ODEs in an encoder-decoder architecture. Observed trajectories are encoded using a GRU-ODE architecture (Brouwer et al., 2019; Rubanova et al., 2019). The GRU-ODE combines a Neural ODE with a gated recurrent unit (GRU) (Cho et al., 2014). Observed trajectories are encoded by the GRU into a hidden state, which is continuously evolved between observations by a Neural ODE parameterized by neural network fθ. The GRUODE encodes the observed data sequence into parameters for a variational posterior. Using the reparameterization trick (Kingma & Welling, 2014), a differentiable sample of the latent initial state z0 is obtained. A Neural ODE parameterized by neural network fΨ deterministically solves a latent trajectory from the latent initial state. Finally, a neural network fΦ decodes the latent trajectory into data space. The Latent ODE architecture can thus be represented as: µz0, σ2 z0 = GRUODEfθ (x1:N , t1:N ) (4) z0 q(z0|x1:N ) = N (µz0, σ2 z 0) (5) z1:N = ODESolve(fΨ, z0, t1:N ) (6) xi N (fΦ(zi), σ2) for i = 1, ..., N (7) where σ2 is a fxed variance term. The Latent ODE is trained by maximizing the evidence lower-bound (ELBO). Letting X = x1:N , the ELBO is: Ez0 q(z0|X)[log p(X)] KL [q(z0|X) || p(z0)] (8) 2.2. Representational Limitations of the Neural ODE Latent ODEs use Neural ODEs to represent latent dynamics, and thus inherit their representational limitations. The accuracy of an ODE solver used by a Neural ODE depends on the smoothness of the solution; the local error of the solution can exceed ODE solver tolerances when a jump discontinuity occurs (Calvo et al., 2008). At a jump, adaptive ODE solvers will continuously reduce step size in response to increased error, possibly until numerical underfow occurs. Even if integration is possible across the jump, it is slow, and the global error of the solution can be adversely affected (Calvo et al., 2003). Typically, these issues can be easily avoided by restarting ODE solutions at the discontinuity but this requires these positions to be known. Classical methods use the increase in local error or adaptive rejections associated with jump discontinuity as criteria to restart solutions (Calvo et al., 2008). Recently, Neural Event ODEs (Chen et al., 2020) uses a similar paradigm of discontinuity detection, using an event function parameterized by a neural network to detect locations to restart the ODE solution. With all event detection approaches, failure to accurately detect jump discontinuity will cause the local error bound decrease to a lower order (Stewart, 2011). Hybrid trajectories with discontinuous change in the dynamical coeffcients present different but still hard modeling challenges due to the representational limitations of Neural ODEs. Latent ODEs do not circumvent these limitations, and cannot generalize in hybrid trajectories. When a hybrid trajectory is encountered, the Latent ODE can only encode the exact sequence of SDFs into a single latent representation. Should a permutation of these SDFs arise at test time, the Latent ODE will not be able to reconstruct the test trajectory. The Lat Seg ODE detects positions of jump discontinuity or switching dynamical mode by representing a hybrid trajectory as a piece-wise combination of samples from a learned base model Latent ODE. At each changepoint, the latent dynamics of the base model are restarted from a new initial state. We apply the PELT algorithm to effciently search through all possible positions to restart ODE dynamics, and return changepoints that correspond to the positions of restart which maximize the joint probability of a hybrid trajectory. This avoids the need to train an event detector, and guarantees optimal segmentation, but the Lat Seg ODE requires the availability of a training data set of SDFs on which the base model can be trained. Segmenting Hybrid Trajectories using Latent ODEs Latent Dynamic Figure 2. Schematic of the Lat Seg ODE reconstructing a hybrid trajectory. Arrows indicate computation fow. Data in each segment is encoded into parameters for the variational posterior, from which a latent initial state is sampled. Each latent segment is solved using shared latent dynamic fΦ, which continues until the next point of change. The latent trajectory is decoded into data space. At evaluation time, an arbitrary number of changepoints can be detected by the PELT algorithm. Plot adapted from (Rubanova et al., 2019). 3.1. Extension to Hybrid Trajectories We frst defne the class of hybrid trajectories which can be represented by the Lat Seg ODE. Consider a sequential series of data X = x1, x2, ..., x N and associated times of observation T = t1, t2, ..., t N . We represent a hybrid trajectory as a piece-wise sequence of C continuous dynamical segments. Each observed data point can only belong to a single segment. Each segment is bounded by starting index si and ending index ei, where 0 i C, s0 = 1, and e C = N. Segments are sequential and do not intersect, i.e., si+1 = ei + 1. The boundaries of segments represent locations of jump discontinuity or switch in dynamical mode. The trajectory within each segment is represented by a sample from the base model Latent ODE. The Lat Seg ODE can be applied to hybrid trajectories containing an unknown number and order of SDFs. The Lat Seg ODE aims to approximate each SDF using a segment. Using offine CPD, the Lat Seg ODE detects positions of jump discontinuity or switching dynamical mode, and introduces a latent discontinuity at the timepoint indexed by si. At these timepoints, indexed by si, the latent dynamics are restarted from a new latent initial condition z0i, which is obtained from the Latent ODE encoder network acting on segment data points xsi:ei . The latent dynamics are solved using the same latent Neural ODE parameterized by fΦ. We provide a schematic visualizing Lat Seg ODE hybrid trajectory reconstruction in Figure 2. The example hybrid trajectory is represented by a sequence of base model Latent ODE reconstructions, each starting from a new initial latent state which can discontinuously jump from the previous dynamic. An arbitrary number of restarts can be detected at test time. To fnish the problem formulation, we defne I as the unknown ground truth set of segment boundaries and latent initial states, such that each hybrid trajectory is associated with set: I = (si, ei, z0i) 0 i C (9) Where Z = z0:C , the joint log probability of an observed hybrid trajectory can be represented as: C X log p(X, Z|s1:C , e1:C ) = log p(xsi:ei , z0i) (10) i=0 This formulation assumes independence between observations in separate segments, such that xsi:ei (X \ xsi:ei ). While this assumption can be limiting in trajectories with long term dependencies, it also allows for increased reconstruction performance in the absence of inter-segment dependency. In these situations, given a trajectory with two dynamical modes, allowing latent dynamics to completely restart at the time of modal change allows for a better representation. In comparison, methods which cannot account for shifts in latent dynamics will be forced to adopt an averaged representation between the two dynamical modes. This intuition is later demonstrated in the experimental section. We note that the Lat Seg ODE does not represent the location of changepoints using a random process. Since event detection is non-probabilistic, the method is not suitable for hybrid trajectories which self-excite or otherwise change dynamical mode past the observed trajectory. 3.2. Optimal Segmentation Given this formulation of hybrid trajectories, the key challenge is fnding the unknown set I which maximizes the joint probability of an observed hybrid trajectory. We propose application of optimized search algorithms from the feld of offine changepoint detection (CPD) to recover locations of jump discontinuity and switches in dynamical mode, and consequently I. Through complexity penalization, these search algorithms can automatically determine the optimal number and location of segments without prior specifcation. Offine CPD methods attempt to discover changepoints which defne segment boundaries. A combination of segments which reconstruct a trajectory is referred to as a segmentation. We allow each observed timepoint to be a potential changepoint. Thus, the space of all possible segmentations is formed by all combinations of an arbitrary Segmenting Hybrid Trajectories using Latent ODEs number of changepoints. At either extremes, placing no changepoints or a changepoint at each time of observation are both valid segmentations. The space of all possible segmentations grows exponentially (2N ) with the number of observations (N). The optimal partitioning method (Jackson et al., 2005) uses dynamical programming to search through this large space of solutions. Where C is a cost function, m is the number of changepoints, and τ is a set of changepoints such that τ0 = 0, τm+1 = n, it minimizes m+1 X C(xτi 1+1:τi ) + β (11) with respect to τ using dynamic programming. Of all possible segmentations up to data index t, we let F (t) represent the one which results in the minimal cost. This result is memoized. For a new data index s > t, we can extend the optimal solution via recursion F (s) = min F (t) + C(x(t+1):s) + β (12) t Thus, we begin by solving for F (1), and incrementally extend the solution until F (N), at which point the optimal segmentation is returned. The memoization of previous optimal sub-solutions allows a quadratic runtime with respect to number of observations. The full algorithm is provided in Appendix A. The β term penalizes over-segmentation, and typically scales with the number of parameters introduced by each additional changepoint. When a maximum likelihood cost function is used without a β penalty, optimal partitioning degenerates by placing a changepoint at each possible index. The presence of β enforces a trade-off between accuracy and model complexity. With an appropriate β, this formulation also conveniently recovers the segmentation with the minimized Bayesian Information Criterion (BIC) (Schwarz et al., 1978) through minimization of equation (11). Choice of β is a key challenge in using CPD methods with deep architectures. It is not always clear how many effective parameters are introduced by each additional segment, though this number is upper bounded by the dimensionality of the latent initial state. Additionally, the theoretical assumptions required by the BIC are violated by neural network architectures (Watanabe, 2013). The Lat Seg ODE circumvents these challenges by using the marginal likelihood under the Latent ODE as the score function for each segment. We compute a Monte Carlo estimate of the marginal likelihood by importance sampling using a variational approxi- mation to the posterior over the initial state: Z log p(xs:e) = log p(xs:e|z0) p(z0) dz0 (13) p(z0) = Ez0 q(z0|xs:e ) p(xs:e|z0) (14) q(z0|xs:e) M 1 X N (z0j |0, 1) = N (x 2 s:e|xs:e, σ ) (15) M N (z0j |µz0, σ2 z0) j=1 where xs:e is the output of the Latent ODE base model, µz0, σ2 z0 is obtained by the GRU-ODE encoder, and z0j is sampled as N (µz0, σ2 z 0). The variance σ2 is fxed, and set to the same value used to compute the ELBO during training. We take M samples for the Monte Carlo estimate. Because we use the marginal likelihood, the complexity of the recovered segmentation is implicitly regularized by the Bayesian Occam s Razor (Mac Kay, 1992). Refecting this, in our experiments, we show that the penalization term β can be set to 0 without over-segmentation. Thus, we can simply set C in equation (11) to be the marginal likelihood computed by equation (15), and solve for the set of changepoints τ which maximize the joint probability of the entire trajectory using optimal partitioning (the original objective is a minimization, but this can trivially be switched to maximization). The quadratic runtime of optimal partitioning can be reduced to between O(N) and O(N 2) through the pruned exact linear time (PELT) (Killick et al., 2012) algorithm. Using an identical search algorithm, PELT introduces a pruning condition which allows removal of sub-solutions from consideration. Given the existence of K such that for all changepoint indexes s, t, T such that t < s < T : C(x(t+1):s) + C(x(s+1):T ) + K C(x(t+1):T ) (16) Then if F (t) + C(x(t+1):s) + K F (s) (17) we are able to discard the changepoint t from future consideration, asymptotically reducing the number of operations required. Due to noise in the estimates of the score function, fnding an analytic method to determine K is an area for further research. If K is set too low, sub-optimal solutions are recovered. In practice, this issue is not limiting, as setting K to a suffciently high value allows for near-optimal solutions at the cost of higher runtime. This trade-off is documented in Appendix B. The computation of F (t), the optimal segmentation up to length t, and Monte Carlo estimate of the marginal likelihood can all be batch parallelized using GPU computation. An implementation is available at: https://github. com/Ian Shi1996/Latent Segmented ODE. Segmenting Hybrid Trajectories using Latent ODEs 3.3. When can I use this method? The Lat Seg ODE requires a Latent ODE base model trained on a family of SDFs. We propose two scenarios where SDFs may be available. First, the Lat Seg ODE is applicable when a training set of hybrid trajectories with labelled changepoints exists. In this case, given a training set of N hybrid trajectories X = (x(i), t(i))N i=1 each with C labelled SDF boundaries (i) (sk, e C k) k=0 , we treat each xsj :ej as an independent training trajectory, and train on the union of all SDFs. The Lat Seg ODE can also be applied when physical simulation is available. In these scenarios, the base model can be trained on trajectories which are simulated in the range of dynamical modes which we expect in hybrid trajectories at test time. These two use cases are illustrated in the frst two experiments. 4. Related work Switching Dynamical Systems: Hybrid trajectories have previously been modelled as Switching Linear Dynamical Systems (SLDS). We provide a non-exhaustive summary of these methods. Typically, trajectories are represented by a Bayesian network containing a sequence of latent variables, from which observations are emitted. Latent variables are updated linearly, while a higher order of latent variable represents the current dynamical mode. Structured VAEs (Johnson et al., 2016) introduce a discrete latent variable to control dynamical mode, and use a VAE observation model. GPHSMMs (Nakamura et al., 2017) uses a Gaussian Process observation model within a hidden semi-Markov model. Kalman VAEs integrate a Kalman Filter with a VAE observation model (Fraccaro et al., 2017). Models in this class are generally trained via an inference procedure (Dong et al., 2020), while several are fully differentiable (Kipf et al., 2019). These methods are unsupervised, requiring no training data with labelled changepoint locations. In contrast, the Lat Seg ODE requires a base model to be trained on SDFs. It does not model dependency between segments unlike methods such as r SLDS (Linderman et al., 2017). At evaluation time, the Lat Seg ODE operates without specifcation of the number of segments or dynamical modes. This is an advantage compared to previously discussed works, where performance is sensitive to these hyperparameters (Dong et al., 2020). The Neural Event ODE (Chen et al., 2020) is closely related to the Lat Seg ODE. It represents observed dynamics using a Neural ODE and trains a neural network to detect the positions and update values of a switching dynamical system. The Neural Event ODE can be trained in an unsupervised fashion, without prior knowledge of change point locations in training data. When extrapolating past observed data, it is able to introduce additional change points, which the Lat Seg ODE cannot model. However, the Neural Event ODE inherits the same limitations as the Neural ODE: it cannot model a data set which cannot be described by a single ODE function in data space. So, for example, two different dynamics cannot start from the same observed point. This issue is elaborated in Appendix C. The Lat Seg ODE circumvents these limitation by modelling the data using an ODE in latent space. Offine Changepoint Detection: The Lat Seg ODE closely relates to offine CPD, and we refer to Truong et al. (2020) for an in-depth review. The Lat Seg ODE leverages search algorithms from offine CPD, but represents the behavior within segments using a complex generative model, as opposed to a simple statistical cost function. The use of the Latent ODE allows for higher representational power and extrapolation/interpolation within segments. However, training data is required to ft the base model and, as such, its total runtime is signifcantly higher. Other methods have incorporated deep architectures with CPD search methods (Lee et al., 2018), but use a sliding window search with predefned window size, and use a feature distance metric to determine boundaries as opposed to the marginal likelihood used by Lat Seg ODE. Miscellaneous: A distantly related class of methods classify individual observations into class labels, which can be seen as segmentation (Supratak et al., 2017). These approaches are distinct as they do not explicitly model dynamics, and require a fxed segment size and trajectory length, a limitation which the Lat Seg ODE does not have. The Lat Seg ODE does not treat positions of jump discontinuity or switching dynamical mode as a random variable, unlike methods that model these jumps as a random process (Mei & Eisner, 2017; Jia & Benson, 2019). 5. Experiments Here we investigate the Lat Seg ODE s ability to simultaneously perform accurate reconstruction and segmentation on synthetic and semi-synthetic data sets. When training the base model, we mask observations from the last 20% of the timepoints and 25% of internal timepoints, this 25% is shared across all training and test examples. When evaluating the model on the test set, we use the 55% of unmasked timepoints to infer the initial states and perform segmentation, and then attempt to reconstruct the observations at the masked timepoints. We report the mean squared error (MSE) between ground truth and predicted observations on test trajectories. We benchmark against auto-regressive and vanilla Latent ODE baselines for reconstructive tasks. We augment the input data for the vanilla Latent ODE with a binary-valued time series denoting changepoint positions. This ensures it has access Segmenting Hybrid Trajectories using Latent ODEs to the same information as the Lat Seg ODE. We report performance on an extrapolation region which assumes the last observed dynamical mode continues. We attempted to benchmark against Neural ODEs and Neural Event ODEs, but found that their training did not converge on any of our benchmarks (see Appendix C). We benchmark the segmentation performance of the Lat Seg ODE against classic CPD algorithms using Gaussian kernelized mean change (Arlot et al., 2019), auto regressive (Bai et al., 2000), and Gaussian Process (Lavielle & Teyssiere, 2006) cost functions. These are denoted RPTRBF, RPT-AR, and RPT-NORM respectively. Segmentation performance is measured using the Rand Index (Rand, 1971), the Hausdorff metric (Rockafellar & Wets, 2009), and the F1 score. The Rand Index measures the overlap between the predicted segmentation and the ground truth segmentation. Given data points x1:N , a membership matrix A is defned such that Aij = 1 if xi and xj are in the same segment. Otherwise, Aij = 0. Membership matrices are generated for the ground truth segmentation (A) and the predicted segmentation (A ). Using these two matrices, the Rand Index is calculated as: P 1[A == A ] i