# graph_switching_dynamical_systems__8e1219e1.pdf Graph Switching Dynamical Systems Yongtuo Liu 1 Sara Magliacane 1 2 Miltiadis Kofinas 1 Efstratios Gavves 1 Dynamical systems with complex behaviours, e.g. immune system cells interacting with a pathogen, are commonly modelled by splitting the behaviour into different regimes, or modes, each with simpler dynamics, and then learning the switching behaviour from one mode to another. Switching Dynamical Systems (SDS) are a powerful tool that automatically discovers these modes and mode-switching behaviour from time series data. While effective, these methods focus on independent objects, where the modes of one object are independent of the modes of the other objects. In this paper, we focus on the more general interacting object setting for switching dynamical systems, where the per-object dynamics also depends on an unknown and dynamically changing subset of other objects and their modes. To this end, we propose a novel graph-based approach for switching dynamical systems, GRAph Switching dynamical Systems (GRASS), in which we use a dynamic graph to characterize interactions between objects and learn both intra-object and inter-object mode-switching behaviour. We introduce two new datasets for this setting, a synthesized ODE-driven particles dataset and a realworld Salsa Couple Dancing dataset. Experiments show that GRASS can consistently outperforms previous state-of-the-art methods. 1. Introduction Complex time series are pervasive both in daily life and scientific research, usually consisting of sophisticated behaviours and interactions between entities or objects (Pavlovic et al., 2000; Shi et al., 2021). Consider for example emotion contagion in a crowd and how it might affect the crowd dynamics (Xu et al., 2021), or the differentiation of T 1University of Amsterdam 2MIT-IBM Watson AI Lab. Correspondence to: Yongtuo Liu . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). cells, a crucial type of immune cell, into different subtypes with different roles after interacting with certain pathogens. A common way of modelling complex behaviour, e.g. represented by a discontinuous function, is by considering it as a sequence of simpler modes, e.g. represented by a set of smooth functions. For example, the behaviour of a ball bouncing on the floor can be represented by two simple modes of falling and bouncing back. In many cases, the challenge is to identify the mode at each time point based on observations. The state-of-the-art approaches for this task are Switching Linear Dynamical Systems (SLDS) (Ackerson & Fu, 1970; Ghahramani & Hinton, 2000; Oh et al., 2005) and their non-linear extensions, e.g. Switching Nonlinear Dynamical Systems (SNLDS) (Dong et al., 2020) and REDSDS (Ansari et al., 2021). While effective, these approaches either model the mode of a single object, including modelling different objects as a super object (Dong et al., 2020; Glaser et al., 2020), or assume independent objects, i.e. they model the mode of each object as independent from the others, e.g. dancing bees in (Ansari et al., 2021). In this paper, we focus on the more general setting in which there are multiple interacting objects, and in which the mode of an object can be influenced by the mode of the other objects. This is a more realistic setting for modelling many real-world systems, from crowds of people, to groups of immune cells and swarms. For this setting, we propose GRAph Switching dynamical Systems (GRASS) (described at high level in Fig. 1), a framework that learns a dynamic graph to model interactions between objects and their modes across time, and can be combined with previously developed independent-objects switching dynamical systems methods. To evaluate this new setting, we also propose two new datasets for benchmarking interacting object systems: a synthetic ODE-driven Particle dataset, and a Salsa Couple Dancing dataset, inspired by real-world benchmarks (Dong et al., 2020). Experiments show that GRASS outperforms the baselines and identifies mode-switching behaviors with higher accuracy and fewer switching errors. 2. Multi Object Switching Dynamical Systems We start from a collection of time series of observations y := y1:N 1:T for T time steps and N objects. The N objects move and their motions can be categorized to one out of K Graph Switching Dynamical Systems mode 1 mode 2 mode 3 switching dynamics switching dynamics Observations Interaction Figure 1. Illustration of Graph Switching Dynamical Systems (GRASS). As opposed to independent objects Switching Dynamical Systems, where objects are processed independently, Graph Switching Dynamical Systems discover modes and mode-switching behaviours that can depend on object interactions. Interactions are modelled by a latent dynamic graph, which is inferred jointly with the other variables by maximizing the evidence lower bound. Activated interaction edges and mode switching are highlighted with red arrows, while inactive edges (no interactions) are visualized with grayed-out dashed lines in the interaction graph at each timestep. possible modes. For instance, an object might be moving in a spiral trajectory (mode 1) or it might bouncing on a wall (mode 2). The N objects interact with each other, and their motions change according to these interactions. For instance, after a collision, an object might switch from a spiral to a sinusoidal motion. The dynamics of these objects are governed by three types of variables: mode variables, count variables and state variables. Mode variables are categorical variables z := z1:N 1:T = {z1 t , . . . , z N t }T t=1, where zn t {0, . . . , K 1} denotes the mode for each time step t (1, . . . , T) and for each object n (1, . . . , N). For instance, zn=2 t=10 = 3 and zn=5 t=10 = 4 mean that, at time step 10, the second object moves according to the third dynamic mode (for instance a spiral trajectory), while the fifth object moves according to the fourth dynamic mode (for instance a sinusoidal trajectory). Count variables are categorical variables c := c1:N 1:T = {c1 t, . . . , c N t }T t=1, where each cn t (1, . . . , M) explicitly models the durations between switching modes for each object n and each timestep t and M is the maximum number of steps before a switch. These variables help us avoid frequent mode switching, caused by the fact that durations typically follow a geometric distribution, biasing unfavourably towards shorter durations (Ansari et al., 2021). State variables are continuous variables x := x1:N 1:T = {x1 t, . . . , x N t }T t=1, where each xn t Rd encodes the dynamics content per object and time step. For instance, at time step t, xn t could encode the position and velocity of the trajectory of the n-th object. 2.1. Interactions between all objects We formulate a probabilistic graphical model to describe our system of multiple interacting objects. We first start with a formulation in which the modes of each object are affected by the modes of all other objects. Then, in Section 4.1 we extend our system within a dynamic graph, with which we can learn at which time steps there exist interactions and between which objects as described in Section 3. Assuming Markovian dynamics and extending the standard Switching Dynamical Systems (Linderman et al., 2016; Ansari et al., 2021) paradigm to the case of N objects, we assume the joint probability distribution is p(y, x, z, c) = n=1 p(yn 1 |xn 1) p(xn 1|zn 1 ) p(zn 1 ) | {z } Initial States t=2 p(z1:N t |z1:N t 1, x1:N t 1, c1:N t ) | {z } Interacting Modes p(yn t |xn t )p(xn t |xn t 1, zn t )p(cn t |zn t 1, cn t 1) | {z } Per object dynamics We start by describing the per-object dynamics. In this case, we model for each object n an observation probability p(yn t |xn t ), a state transition probability p(xn t |xn t 1, zn t ) and a count transition probability p(cn t |zn t 1, cn t 1). The observation probability p(yn t |xn t ) models how the continuous state variables for this object xn t map into the observations yn t . The state transition probability p(xn t |xn t 1, zn t ) models how the continuous state variables at time t are influenced by their previous values at time t 1 conditioned on mode variable for this object zn t . The count transition probability p(cn t |zn t 1, cn t 1) models how the count variables at time t Graph Switching Dynamical Systems depend on their previous values at time t 1 and on the mode for this object at the previous time step zn t 1. The initial states have a similar setup, but in this case the state transition probability does not have an input from the previous timestep and the count variables are initialized at 1. The mode transition probability p(z1:N t |z1:N t 1, x1:N t 1, c1:N t ) models how the modes of objects are affected by the modes of all other objects z1:N t 1, conditioned on the state variables x1:N t 1 and count variables c1:N t . In the absence of any knowledge on what interactions take place, this probability considers that all objects may potentially influence all other objects. In Eq. (1) except for the mode transition probability in the Interacting Modes term, all other terms p(yn 1 |xn 1), p(xn 1|zn 1), p(zn 1), p(yn t |xn t ), p(xn t |xn t 1, zn t ), p(cn t |zn t 1, cn t 1) are factorized per object and thus similar independent-object dynamical systems treating all N objects independently. We refer to (Dong et al., 2020; Ansari et al., 2021) for details. 2.2. Learning an amortized transition dynamics To simplify the modelling of switching dynamics, we assume that current dynamics for each object at time t is independent from other objects given the complete latent state at t 1. By further adopting a mixture representation for the marginal transition probabilities (Raftery, 1985; Saul & Jordan, 1999), we assume we can explicitly model pairwise mode-to-mode and object-to-object effects: p(z1:N t |z1:N t 1, x1:N t 1, c1:N t )= n=1 p(zn t |z1:N t 1, x1:N t 1, cn t ) m=1 wm n t p(zn t |zm t 1, xm,n t 1 , cn t ), (2) where xm,n t 1 = fe(xm t 1, xn t 1) is a representation that aggregates the continuous states of objects m and n, for instance concatenation and wm n t is the local dynamic factor for objects m and n, which satisfies wm n t 0 and PN m=1 wm n t = 1. This mixture assumption implies that the dynamics of object m at time t depends only on pairwise interactions with all other objects n = 1, . . . , N, at time t 1, ignoring higher-order interactions. The local dynamic factors allow dropping interactions between objects when none exist, since in multi-object systems, objects often affect one another at sparse points in time and space. The amortized transition dynamics benefits our modelling, because they allow us to model a larger number of objects and their switching dynamics (whether there exist or not) by simply extending the respective products and sums. In the next section, we show how we can learn and use these local dynamic factors to ensure interaction sparsity more effectively when we learn a dynamic graph. 3. Graph Switching Dynamical Systems Since our system consists of multiple objects, which may or may not interact at random points in time, we can model the system with a dynamic graph Gt = (Vt, Et), whose structure and information varies across time. The nodes Vt are all latent variables and observations related to each object m at time step t, that is vm t = {zm t , xm t , ym t , cm t } Vt. The edges em n t Et denote whether there is an interaction between objects m and n at time t, which include self loops, i.e., em m t Et, m (1, . . . , N). Embedding the switching dynamical system into a graph topology, we want messages to be passed between graph nodes vm and vn via edges to signal interactions between objects. Since we cannot know when interactions take place, how do they take place, and between what objects, we set the latent edge variables to be one-hot vectors of L + 1 dimensions, em n t = [em n t,1 , ..., em n t,L+1], where em n t,l {0, 1}. Setting the l-th dimension to 1, em n t,l = 1, indicates the l-th type of interaction is active between objects m and n at time t, with em n t,l=1 = 1 standing for no interaction . Further, we set the prior edge distribution pθ(et) = Q m =n pθ(em n t ) to be a factorized object-to-object uniform distribution over edge types. We set the prior probability to be higher for no interaction edges, thus encouraging sparse graphs. We enable two types of messages to be passed via the edges. First, we want the latent edges to signal whether there is an interaction between two objects. Thus, for objects m and n we set the unnormalized local dynamic factor wm n t to be the sum of L possible types of interaction: l=2 em n t,l , wm n t = wm n t PN m=1 wm n t (3) Note that since the count starts from l = 2 (l = 1 stands for no interaction), wm n t sums up to either 0 (no interaction) or 1. wm n t is a local influence weight from object m to object n. For the local dynamic factor, we normalize the weights over m to get the weighted influence from all m to n that we use in the interacting modes term of Eq. (2). We also want the edges to influence how the continuous state of a pair of objects xm,n t 1 changes in case of an interaction. To attain this, rather than simply concatenating features in xm t 1 and xn t 1 in Eq. (2), we use the edges as weights: xm,n t 1 = X l em n t,l f l e([xm t 1, xn t 1]), (4) where f l e means a function for edge type l that aggregates continuous states between any object pair into a single representation. These L functions represent different interaction types indexed by the edge type l = 2, . . . , L + 1, similar to Kipf et al. (2018). Note that there is no need for a specific function for the no interaction case. Graph Switching Dynamical Systems (a) Generative Model (b) Inference Figure 2. (a) Generative model of GRASS. (b) Left: Amortized approximate inference for the continuous states x1:N t and discrete edge variable e1:N2 t by inference networks. Temporal dependence is modeled by an intermediate latent embedding h1:N t which is given by directional RNNs. Right: Exact inference of discrete mode and count variables z1:N t and c1:N t based on the approximate pseudo-observations and pseudo-interactions x1:N t and e1:N2 t . Orange circles denote observations or approximate pseudo-observations. Taking into account the latent edge variables that are part of our probabilistic model, the joint probability becomes: p(y, x, z, c, e) = n=1 p(yn 1 |xn 1) p(xn 1|zn 1 ) p(zn 1 ) | {z } Initial States m=1 wm n t p(zn t |zm t 1, xm,n t 1 , cn t , em n t ) | {z } Pairwise Interacting Modes p(yn t |xn t )p(xn t |xn t 1, zn t )p(cn t |zn t 1, cn t 1) | {z } Per object dynamics The overall generative model and inference stages of GRASS are detailed in Fig. 2. We show a more detailed version with the complete factorization in App. A.2. 4. Neural Network Implementation We use neural networks to model the terms in the joint likelihoods of our Switching Dynamical Systems, specifically of Eq. (1) for the Multiple-Object Switching Dynamical System (MOSDS) of Section 2, and of Eq. (5) for Graph Switching Dynamical Systems (GRASS) of Section 3. Since the mode variables z1:N t take one out of K possible values for dynamic modes, we model them as categorical variables, parameterized by transition probabilities Tt. Specifically, for pairs of objects in our system, we have: p(zn t |zn t 1, xm,n t 1, cn t , em n t )= ( δzn t =zn t 1 if cn t > 1 Cat(zn t ; Tt) if cn t = 1 (6) where we resample the dynamic modes of objects or preserve them via a Kronecker δ function depending on whether our count variable is reset or not. For MOSDS, we model the parameters Tt of the categorical distributions in Eq. (6) with a neural network Tt = fz(x1:N t ) that takes as input the continuous states of all objects. In this case, the neural network returns a NK NK transition matrix per time step t, where rows correspond to past modes z1:N t 1 and columns correspond to current modes z1:N t . The shape of the matrix NK NK is because the neural network must predict in one forward pass the likelihoods for all possible combinations of (object m, object n, mode i, mode j). Clearly, such a neural network is prohibitively expensive as it scales exponentially with the number of objects N and modes K, and also wasteful to optimize, as it assumes object pairs do not share any dynamics at all. So for GRASS, we instead model the parameters Tt in Eq. (6) with an amortized neural network Tt = f l z(xm,n t 1 ) that takes as input only pairs of continuous states (the weights of the neural network are shared for any pair of objects). For both MOSDS and GRASS, the neural network fz is a simple MLP. To satisfy the positivity Tt,i,j > 0 i, j = 1, ..., K and ℓ1 constraints P j Tt,i,j = 1 i = 1, ..., K for Tt, we apply a tempered softmax on fz, Sτ fz( ). The latent edges also take one out of L + 1 possible values for different types of interactions. Thus, we model them by an L + 1-way categorical distribution as well. 4.1. Inference Due to the exponential complexity of the state space, exact inference of latent variables in Switching Dynamical Systems is intractable. Similar to Ansari et al. (2021), we resort to approximate variational inference with neural networks for the continuous latent variable. Furthermore, we modify the original forward-and-backward algorithm by Yu (2010) to perform exact inference for the discrete mode and count variables, as we will detail below. The variational approximation of the true posterior is p(x, e, z, c | y) Graph Switching Dynamical Systems q(x, e, z,c | y) = qϕx(x|y)qϕe(e|x) pθ(z, c|y, x, e). The qϕx and qϕe correspond to neural networks for the approximate inference of the continuous state and discrete edge variables, respectively, and parameterized accordingly. We now describe the exact and approximate inference for each variable. To summarize our setup, we provide a flowchat of the inference algorithm of GRASS in App. A.1. The network architecture and implementation details are in App. A.2. Approximate inference of continuous state x. Following (Dong et al., 2020; Ansari et al., 2021), we factorize the approximate posterior of x as qϕx(x1:N 1:T |y1:N 1:T ) = QN n=1 qϕx(xn 1:T |yn 1:T ). In particular, we first process observations yn 1:T by a bi-RNN to accumulate temporally smoothed embedding hn 1:T . Then, we feed the embedding of the bi-RNN into a causal (i.e. forward uni-directional) RNN, which outputs the overall posterior distribution qϕx(x1:N 1:T |y1:N 1:T ) = Q t qϕx(xn t |xn 1:t 1, hn 1:t 1). Approximate inference of discrete edge e. Given the inferred x1:N 1:T qϕx(x1:N 1:T |y1:N 1:T ), we next infer the latent interaction graph structure of our graph Gt. We use a graph neural network fϕz( x1:N 1:T ), which is potentially fully connected and with self loops, where the node embeddings are the sampled continuous states xm t . We obtain relational edge embeddings h2 m n by two rounds of message passing: h1 m = f emb ϕz ( xm t ) (7) v e: h1 m n = f e,1 ϕz ([h1 m, h1 n]) (8) e v: h2 m = f v,1 ϕz ( n=1 h1 n m) (9) v e: h2 m n = f e,2 ϕz ([h2 m, h2 n]) (10) Assuming conditional independence between edges given all the inferred states, the approximate posterior for edge types becomes qϕe(e1:N2 1:T | x1:N 1:T ) = Q t qϕe(e1:N2 t | x1:N 1:t ) = Q m,n qϕe(em n t | x1:N 1:t ) = Q m,n softmax((h2 m n + g)/τ), where g is a vector sampled from a Gumbel(0, 1) distribution for the reparametrization trick and τ is a temperature to control relaxation smoothness (Maddison et al., 2016). Exact inference of discrete mode z and count c. Given the inferred states x1:N 1:T qϕx(x1:N 1:T |y1:N 1:T ) and edges e1:N2 1:T qϕe(e1:N2 1:T | x1:N 1:T ), we do exact inference of the discrete mode and count variables pθ(z1:N 1:T , c1:N 1:T |y1:N 1:T , x1:N 1:T , e1:N2 1:T ). We modify the forwardbackward algorithm used with hidden Markov models (Collins, 2013) by introducing the additional continuous state x1:N 1:T and discrete edge e1:N2 1:T variables, where the forward part αt and backward part βt are defined as: αt(zt, ct) = p(y1:t, x1:t, e1:t, z1:t, c1:t) (11) βt(zt, ct) = p(yt+1:T , xt+1:T | xt, et, zt, ct), (12) where we drop for clarity superscripts from zt, ct, yt, xt, and et. We describe the details in App. A.3.3. 4.2. Learning The overall network is jointly learned by maximizing the evidence lower bound (Kingma & Welling, 2013), log pθ(y) DKL [qϕ(x, z, c, e|y) pθ(x, z, c, e|y)] = Eqϕ(x|y) [log pθ(x, y)] + H(qϕ(x|y)) (13) The joint likelihood p(x, y) is computed by marginalizing z, c, e from the forward variable αt(zt, ct), and the approximate posterior distribution q(x|y) is computed by the amortized inference network. The detailed training object of GRASS is described in App. A.3. 5. Experiments Most datasets for switching dynamical systems focus on scenarios with a single object switching dynamics, such as a one-dimensional bouncing ball, dubins path, a single dancer in Salsa Dancing from CMU Mo Cap (Dong et al., 2020), and a 3 mode system (Ansari et al., 2021). While there are a few cases with multiple objects, these objects do not interact with one another. For instance, the dancing bees by Ansari et al. (2021) are considered a single super object comprising of all objects simultaneously. The twodimensional reacher task by Dong et al. (2020) and neural populations by Glaser et al. (2020) are similarly constructed. By contrast, we focus on the generalized setting of having multiple objects that interact with one another, where interacting objects are considered simultaneously and depending on another with the objective of discovering dynamic modes and switching behaviours. To evaluate the proposed methods and compare against baselines, we introduce two datasets for benchmarking, inspired by the single-object literature: the synthesized ODE-driven particle dataset, and the Salsa Couple dancing dataset. The code and datasets are available at https://github.com/yongtuoliu/Graph-Switching Dynamical-Systems.. ODE-driven particle dataset. We introduce three Ordinary Differential Equation (ODE) systems as the three modes to generate time-evolving trajectories of particles, i.e., Lotka-Volterra, Spiral and Bouncing Ball ODE: Lotka Volterra: x = x xy; y = y + xy (14) Spiral: x = 0.1x3 + 2y3; y = 2x3 0.1y3 (15) Bouncing Ball: x = 0; y = 2 (or y = 2) (16) Graph Switching Dynamical Systems Figure 3. Visualization of ODE-driven particle dataset. Yellow and blue ball in the third frame switch their equations when they collide. Figure 4. 3D skeletons in Salsa Couple Dancing dataset. To simulate trajectories, we draw balls with radius r, randomly initialized and driven by different ODEs on a squared 2d canvas of size 64*64. Specifically, we consider three particle balls driven by three different ODE modes unless stated otherwise (e.g., in the experiments increasing the number of particles or the number of modes). Numerical values of ODEs are mapped to the canvas. For mode-switching interactions among objects, we switch the driven ODE modes of two objects when they collide in the canvas. Each sample has 100 time steps, and with 10 frames per second. We follow the sample splitting proportion of synthesized datasets in REDSDS (Ansari et al., 2021) (i.e. test data is around 5% of training data) and create 4,928 samples for training, 191 samples for validation, and 204 samples for testing. Analyses on new splitting strategy (i.e. test data is around 10% of training data) and larger dataset are in App. B.1. A sample visualization of this dataset is shown in Fig. 3. Salsa Couple dancing dataset. Dong et al. (2020) experiment with salsa dancing sequences, which, however, feature a single dancer only from CMU Mo Cap. We collect 17 realworld Salsa dancing videos from the Internet, containing 8,672 frames. Among them, 3 videos are for testing and the remaining videos are for training. We extract 3D skeletons of dancers by a pretrained model (Moon et al., 2019) and conduct temporal Gaussian smoothing afterward. As Dong et al. (2020), we annotate four modes, i.e., moving forward , moving backward , clockwise turning , and counter-clockwise turning . Each sample has 100 time steps with 5 frames per second. We have 1,321 samples for training and 156 samples for testing. The coordinates of 3D skeletal joints serve as input for each dancer, and the modes of each dancer at each time step are the output. In Fig. 4 we show the 3D skeletons extracted from the videos. Evaluation metrics. Following Dong et al. (2020); Ansari et al. (2021), we evaluate using frame-wise segmentation accuracy, i.e. accuracy and F1 after matching the labels using the Hungarian algorithm (Kuhn, 1955), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) to measure similarity between two labellings. We conduct each experiment for five random seeds and report the average performance and standard deviation of the results. Baselines. We compare MOSDS and GRASS with three state-of-the-art methods: r SLDS (Linderman et al., 2016), SNLDS (Dong et al., 2020), and REDSDS (Ansari et al., 2021). For our implementation, we use REDSDS (Ansari et al., 2021) as the base for MOSDS and GRASS. We include in the comparisons GRASS-GT as an upper bound oracle method, for which we use the ground-truth graph edges rather to learn mode transition behaviours. 5.1. ODE-driven Particle We summarize results for the ODE-drive particles in Table 1. We see that just by considering interactions between multiple objects with MOSDS, we achieve significant and consistent performance increases across all metrics. When further using graphs to model the switching dynamics in our interacting system of objects, GRASS improves by more than 9-10% over the previous state-of-the-art, REDSDS, across all metrics. We also observe that GRASS performs similarly to GRASS-GT using ground truth edges, showcasing the accuracy of inferring the latent object-to-object interactions. In Fig. 5, we show also the qualitative results of GRASS compared to REDSDS, which is the top performing baseline. GRASS discovers mode-switching behaviours between objects effectively and with fewer switching errors. 5.2. Salsa Couple Dancing We summarize the results for Salsa Couple Dancing dataset in Table 2. We observe similar findings in this real-world video dataset, as with the ODE-driven particles. GRASS achieves significantly higher accuracy across all metrics, including REDSDS and our simpler method MOSDS. Graph Switching Dynamical Systems mode 2 mode 3 mode 1 Observation Reconstruction Figure 5. Qualitative results of our GRASS model compared to previous state-of-the-art method REDSDS (Ansari et al., 2021). Each row contains three sub-rows which denote the mode segmentation of multiple objects. We can see that with explicit interaction modeling by GRASS, mode-switching behaviors among objects are discovered effectively with fewer switching errors and better segmentation results. Table 1. Comparisons on ODE-driven Particle Dataset. Method NMI ARI Accuracy F1 r SLDS 0.257 0.023 0.231 0.016 0.450 0.033 0.443 0.041 SNLDS 0.368 0.027 0.349 0.021 0.681 0.067 0.664 0.053 REDSDS 0.418 0.016 0.397 0.028 0.708 0.037 0.702 0.027 MOSDS (this paper) 0.469 0.020 0.474 0.015 0.766 0.045 0.757 0.032 GRASS (this paper) 0.528 0.014 0.519 0.008 0.794 0.030 0.790 0.021 GRASS-GT (Oracle) 0.537 0.012 0.526 0.010 0.805 0.028 0.801 0.016 Table 2. Comparisons on the Salsa Couple Dancing dataset. Method NMI ARI Accuracy F1 r SLDS 0.118 0.028 0.102 0.043 0.373 0.066 0.360 0.053 SNLDS 0.145 0.047 0.133 0.031 0.420 0.113 0.413 0.096 REDSDS 0.156 0.032 0.152 0.036 0.504 0.052 0.467 0.074 MOSDS (this paper) 0.162 0.053 0.165 0.072 0.537 0.091 0.508 0.063 GRASS (this paper) 0.174 0.031 0.176 0.043 0.569 0.065 0.524 0.046 5.3. Ablation experiments Due to limited space, we report the average performance in each table. Results with standard deviations are in App. B.2. Sensitivity to the number of interactions. We evaluate how sensitive is GRASS in the presence of an increasing number of interactions. First, we extend the normal ODEdriven Particle dataset to more particles, i.e. 3 particles, 5 particles, and 10 particles. The number of interactions naturally increases with the number of particles in a spaceconstrained canvas. For different numbers of particles, we count the average number of interactions per object per time series and they are 2.3 interactions for 3 particles, 6.1 for 5 particles, and 12.5 for 10 particles. We present the results in Table 3, where we conclude that GRASS is not adversely Table 3. Analyses on different numbers of objects on ODE-driven Particle dataset, while increasing the average number of interactions per object per time series, i.e, 2.3 interactions for 3 particles, 6.1 for 5, and 12.5 for 10. */* denotes NMI / F1. Number of Particles 3 5 10 r SLDS 0.257 / 0.443 0.252 / 0.437 0.246 / 0.430 SNLDS 0.368 / 0.664 0.361 / 0.656 0.354 / 0.651 REDSDS 0.418 / 0.701 0.411 / 0.692 0.405 / 0.687 MOSDS (this paper) 0.469 / 0.757 0.461 / 0.752 0.456 / 0.748 GRASS (this paper) 0.528 / 0.790 0.524 / 0.784 0.519 / 0.781 Table 4. Analyses on different numbers of objects on ODE-driven Particle, while fixing the average number of interactions per object per time series, i.e, 2.3 interactions. */* denotes NMI / F1. Number of Particles 3 5 10 r SLDS 0.257 / 0.443 0.262 / 0.444 0.253 / 0.437 SNLDS 0.368 / 0.664 0.365 / 0.666 0.362 / 0.659 REDSDS 0.418 / 0.701 0.423 / 0.706 0.413 / 0.694 MOSDS (this paper) 0.469 / 0.757 0.471 / 0.763 0.464 / 0.754 GRASS (this paper) 0.528 / 0.790 0.530 / 0.792 0.524 / 0.786 affected by an increasing number of objects and interactions. Sensitivity to the number of objects. We further test increasing the number of objects, while fixing the number of interactions. We achieve this by controlling the sizes of objects, as with smaller balls we have fewer collisions (and thus interactions). We roughly fix the number of interactions per object per time series to be 2.3 and change the number of objects to 3, 5, and 10 as in the previous trial. We present results in Table 4. GRASS is robust to different numbers of objects, no matter whether we fix the number of interactions. Graph Switching Dynamical Systems Table 5. Analyses of robustness to datasets without interactions on ODE-driven Particle dataset. */* denotes NMI / F1. Method dataset w/ interaction dataset w/o interaction r SLDS 0.257 / 0.443 0.471 / 0.686 SNLDS 0.368 / 0.664 0.534 / 0.772 REDSDS 0.418 / 0.701 0.579 / 0.838 MOSDS (this paper) 0.469 / 0.757 0.563/ 0.817 GRASS (this paper) 0.528 / 0.790 0.573 / 0.826 Table 6. Analyses on robustness to different maximal numbers of predefined modes. */* denotes NMI / F1. Number of Modes 3 5 10 r SLDS 0.257 / 0.443 0.253 / 0.438 0.248 / 0.436 SNLDS 0.368 / 0.664 0.365 / 0.661 0.362 / 0.657 REDSDS 0.418 / 0.701 0.415 / 0.696 0.413 / 0.694 MOSDS (this paper) 0.469 / 0.757 0.466 / 0.759 0.462 / 0.754 GRASS (this paper) 0.528 / 0.790 0.532 / 0.794 0.527 / 0.784 Sensitivity to absence of interactions. GRASS is built for systems of multiple objects that interact with one another. We test whether the method generalizes even in the case when the objects are independent and do not interact, as with single-object Switching Dynamical Systems. We create a dataset with three particles driven by three different ODEs, and set them so that they do not interact with each other. We present results in Table 5. In the presence of interactions, GRASS is considerably more accurate than REDSDS, while in the absence of interactions, it scores comparably. In this case MOSDS observes a higher drop in accuracy. The reason is that with its dynamic graph, GRASS can still predict correctly that there exist no interaction edges between objects, while MOSDS always assumes all objects interact. Sensitivity to number of dynamic modes. Like previous methods (Linderman et al., 2016; Dong et al., 2020; Ansari et al., 2021), GRASS requires a predefined maximum number of modes. We test its robustness to different maximum numbers of modes, that is 3, 5, and 10, while the true number of modes is 3. We present results in Table 6. We observe that GRASS is impervious to this misspecification, which suggests that we can set a large number of possible modes and GRASS will still use only those needed. 6. Related Work Switching Linear Dynamical Systems (SLDS) (Ackerson & Fu, 1970; Ghahramani & Hinton, 2000; Oh et al., 2005) introduce both discrete states to represent motion modes and continuous states to characterize motion dynamics of each mode, but assume linear state transitions. Switching Non-linear Dynamical Systems, implemented by neural networks, extend these methods to the nonlinear case, providing a better expressiveness of complex system dynamics. Among them, SNLDS (Dong et al., 2020) and REDSDS (Ansari et al., 2021) are two representative methods that can consistently outperform their linear counterparts. While effective, previous methods and datasets are usually limited to single-object scenarios where only one object exist. When multiple objects exist, objects are processed independently or considered as one single super-object with a single mode. For example, in (Glaser et al., 2020), multiple neural populations exist in the brain, while the only mode behaviours of the whole brain only are modelling and discovered. By contrast, in this paper we focus on the general setting where our systems comprise multiple objects interacting and changing their behaviour accordingly. Graph Neural Networks are the de facto choice for learning relational representations over graphs. Recently, there are some methods focusing on neural relational inference (Kipf et al., 2018; Graber & Schwing, 2020; Kofinas et al., 2021) over temporal sequences, whose dynamics are encoded by continuous latent states. These methods focus on systems with multiple objects, whose dynamics, however, do not change of time and, therefore, are not a good fit for discovering mode-switching behaviours over time. In this work, we start from the framework of Switching Dynamical Systems, and integrate them within a graph neural network formalism. In particular, we extend neural relational graphs and relational inference (Kipf et al., 2018; Graber & Schwing, 2020) to incorporate latent interaction variables, one per pair of objects, and model the potential dynamic interactions between objects. The proposed Graph Switching Dynamical Systems can thus handle systems with increased complexity with a significantly better accuracy. This is true even in the presence of sparse interactions in both space and time, which cause sudden and complex dynamic mode switches. 7. Conclusion and Future work We investigate the setting of interacting objects switching dynamical systems, when objects interact with each other and influence each other s modes. We propose a graphbased approach for these systems, GRAph Switching dynamical Systems (GRASS), in which we use a dynamic graph to model interactions and mode-switching behaviors between objects. We also introduce two datasets, i.e. a synthesized ODE-driven Particle dataset and a real-world Salsa Couple dancing dataset. Experiments show that GRASS improves considerably the state-of-the-art. Future work includes exploring learning switching dynamical systems with multiple objects directly from videos. Acknowledgements This work is financially supported by NWO TIMING VI.Vidi.193.129. We also thank SURF for the support in using the National Supercomputer Snellius. Graph Switching Dynamical Systems Ackerson, G. and Fu, K. On state estimation in switching environments. IEEE transactions on automatic control, 15(1):10 17, 1970. Ansari, A. F., Benidis, K., Kurle, R., Turkmen, A. C., Soh, H., Smola, A. J., Wang, B., and Januschowski, T. Deep explicit duration switching models for time series. Advances in Neural Information Processing Systems, 34: 29949 29961, 2021. Collins, M. The forward-backward algorithm. Columbia Columbia Univ, 2013. Dong, Z., Seybold, B., Murphy, K., and Bui, H. Collapsed amortized variational inference for switching nonlinear dynamical systems. In International Conference on Machine Learning, pp. 2638 2647, 2020. Ghahramani, Z. and Hinton, G. E. Variational learning for switching state-space models. Neural computation, 12 (4):831 864, 2000. Glaser, J., Whiteway, M., Cunningham, J. P., Paninski, L., and Linderman, S. Recurrent switching dynamical systems models for multiple interacting neural populations. Advances in neural information processing systems, 33: 14867 14878, 2020. Graber, C. and Schwing, A. G. Dynamic neural relational inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. In International Conference on Machine Learning, pp. 2688 2697, 2018. Kofinas, M., Nagaraja, N., and Gavves, E. Roto-translated local coordinate frames for interacting dynamical systems. Advances in Neural Information Processing Systems, 34: 6417 6429, 2021. Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83 97, 1955. Linderman, S. W., Miller, A. C., Adams, R. P., Blei, D. M., Paninski, L., and Johnson, M. J. Recurrent switching linear dynamical systems. ar Xiv preprint ar Xiv:1610.08466, 2016. Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016. Moon, G., Chang, J. Y., and Lee, K. M. Camera distanceaware top-down approach for 3d multi-person pose estimation from a single rgb image. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10133 10142, 2019. Oh, S. M., Ranganathan, A., Rehg, J. M., and Dellaert, F. A variational inference method for switching linear dynamic systems. Technical report, Georgia Institute of Technology, 2005. Pavlovic, V., Rehg, J. M., and Mac Cormick, J. Learning switching linear models of human motion. Advances in neural information processing systems, 13, 2000. Raftery, A. E. A model for high-order markov chains. Journal of the Royal Statistical Society: Series B (Methodological), 47(3):528 539, 1985. Saul, L. K. and Jordan, M. I. Mixed memory markov models: Decomposing complex stochastic processes as mixtures of simpler ones. Machine learning, 37(1):75 87, 1999. Shi, C., Schwartz, S., Levy, S., Achvat, S., Abboud, M., Ghanayim, A., Schiller, J., and Mishne, G. Learning disentangled behavior embeddings. Advances in Neural Information Processing Systems, 34:22562 22573, 2021. Xu, M., Xie, X., Lv, P., Niu, J., Wang, H., Li, C., Zhu, R., Deng, Z., and Zhou, B. Crowd behavior simulation with emotional contagion in unexpected multihazard situations. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 51(3):1567 1581, 2021. doi: 10.1109/TSMC. 2019.2899047. Yu, S.-Z. Hidden semi-markov models. Artificial intelligence, 174(2):215 243, 2010. Graph Switching Dynamical Systems A. More details of GRASS model A.1. Inference Algorithm of GRASS The inference algorithm of GRASS is in Alg. 1. As inputs, we have a time series y1:T and an interaction edge prior distribution p(e1:T ). First, we initialize distributions of continuous state and discrete mode variables as p(x1) and p(z1). Besides, the range of discrete count variable is initialized as {dmin, ..., dmax}. For each time step t in the time series, the continuous state and discrete edge are first inferred by posterior approximation, i.e. xt qϕx(xt|y1:T ) and et qϕe(et| xt). Then we calculate continuous state and discrete mode transition probabilities, i.e. pθxtr(xt| xt 1, zt) and pθztr(z1:N t |z1:N t 1, x1:N t 1, c1:N t ), which are used for exact inference of discrete mode and count by calculating forward and backward variables αt(zt, ct) and βt(zt, ct) in Forward-and-Backward algorithm. Besides, two consistency losses are introduced by calculating the loglikelihood between xt and ˆxt, yt and yt. We finally derive the ELBO optimization objective to optimize the parameters of networks. Details of the derivatives of ELBO are in Section A.3. An illustration of the inference stage is in Fig. 6. Besides, the overall generative model and inference stages of GRASS which factorize objects are detailed in Fig. 7. Algorithm 1 Inference algorithm for GRASS. Input: Time series y1:T , interaction edge prior distribution p(e1:T ) Output: Learned parameters ϕ and θ. 1 Initialize prior continuous state and discrete mode distributions as p(x1), p(z1); Initialize the range of discrete count variable {dmin, ..., dmax} ; 2 for t in [1, . . . , T] do // State Inference 3 Infer continuous state xt qϕx(xt|y1:T ); 4 Infer discrete edge et qϕe(et| xt); // Calculate continuous state transition 5 Calculate continuous state transition ˆxt pθxtr(xt| xt 1, zt); // Calculate discrete mode transition 6 for n, m [1, . . . , N] do 7 Calculate interaction weights wm n t = PL+1 l=2 em n t,l ; 8 Calculate xm,n t 1 = P l em n t,l f l e([ xm t 1, xn t 1]); 9 Calculate pθztr(zn t |zm t 1, xm,n t 1 , cn t , em n t ); 10 Calculate discrete mode transition pθztr(z1:N t |z1:N t 1, x1:N t 1, c1:N t ) = QN n=1 PN m=1wm n t pθztr(zn t |zm t 1, xm,n t 1 , cn t , em n t ); // Reconstruct input 11 Emit reconstructed input yt pθy(yt| xt); // Log-likelihood Calculation 12 Calculate Log Likelihood( yt, yt); 13 Calculate Log Likelihood( xt, ˆxt); // Exact inference of discrete mode and count 14 Calculate Forward algorithm variable: αt(zt, ct) = p(y1:t, x1:t, e1:t, z1:t, c1:t) 15 Calculate Backward algorithm variable: βt(zt, ct) = p(yt+1:T , xt+1:T | xt, et, zt, ct); // ELBO optimization 16 argmaxϕ,θ log pθ(y) DKL [qϕ(x, z, c, e|y) pθ(x, z, c, e|y)] A.2. Implementation Details In the following, we show the network details as well as embedding dimensions. bi GRU [a] denotes a bidirectional GRU with a single-layer of a hidden units. MLP [b] denotes a single-layer MLP with b hidden units and Re LU non-linearity. RNN [c] denotes a single-layer RNN with c hidden units. Inference networks for continuous state x: bi GRU [4], RNN [16], and MLP [8]; Inference networks for discrete edge e: MLP [128] (i.e. f emb ϕz ), MLP [128] (i.e. f e,1 ϕz ), MLP [128] (i.e. f v,1 ϕz ), and MLP [2] for ODE-driven particle dataset or MLP [5] for Salsa-couple dancing dataset (i.e. f e,2 ϕz ); Continuous transition network: MLP [8] Graph Switching Dynamical Systems Bi-RNN h1:T Causal RNNqφx(x1:T |y1:T ) x1:T pθy(y1:T | x1:T ) y1:T pθxtr(xt| xt 1, zt) pθztr(zt|zt 1, xt 1, ct, et) Gaussian Distribution qφe(e1:T | x1:T ) e1:T p(e1:T ) pθc(ct|ct 1, zt 1) Discrete Count Transition Gaussian Distribution Sample Loss Input Bi-temporal Embedding Continuous State Reconstruction Categorical Distribution Sample Loss Discrete Edge Edge Prior Distribution Continuous State Transition Sample Loss State Prior Distribution Discrete Mode Transition Ones Vector Figure 6. Illustration of inference algorithm of Graph Switching Dynamical Systems. After the approximate inference of continuous state x1:T and discrete edge e1:T , we further calculate continuous state transition probability pθxtr(xt| xt 1, zt), discrete mode transition probability pθztr(zt|zt 1, xt 1, ct, et), and discrete count transition probability pθc(ct|ct 1, zt 1), which are utilized by the forward and backward algorithm to conduct exact inference of discrete mode z1:T and count c1:T to finally derive ELBO optimization objective. (a) Generative Model (b) Inference Figure 7. (a) Generative model of GRASS. (b) Left: Amortized approximate inference for the continuous states (e.g. x1 t and x2 t) and discrete edge variable (e.g. e1 2 t and e2 1 t ) by inference networks. Temporal dependence is modeled by an intermediate latent embedding (e.g. h1 t and h2 t) which is given by directional RNNs. Right: Exact inference of discrete mode (e.g. z1 t and z2 t) and count variables and (e.g. c1 t and c2 t) based on the approximate pseudo-observations (e.g. x1 t and x2 t) and pseudo-interactions (e.g. e1 2 t and e2 1 t ). Orange circles denote observations or approximate pseudo-observations. Here, we assume there exist two objects in the scenario. Graph Switching Dynamical Systems (i.e. p(xn t |xn t 1, zn t )); Discrete transition network: MLP [22] for ODE-driven particle dataset or MLP [44] for ODE-driven particle dataset (i.e. p(zn t |zm t 1, xm,n t 1 , cn t , em n t )); Emission network: MLP [2] for ODE-driven particle dataset or MLP [45] for ODE-driven particle dataset (i.e. p(yn t |xn t )). We train both datasets with a fixed batch size of 20 for 60,000 training steps. We use the Adam optimizer with 10 5 weight-decay and clip gradients norm to 10. The learning rate is warmed up linearly from 5 10 5 to 2 10 4 for the first 2,000 steps, and then decays following a cosine manner with a rate of 0.99. Each experiment is running on one Nvidia Ge Force RTX 3090 GPU. A.3. Detailed Optimization Objective of GRASS A.3.1. DERIVATION OF ELBO The evidence lower bound objective (ELBO) of Graph Switching Dynamical System (GRASS) is defined as follows. For brevity, x, y, z, c, and e represents x1:N 1:T , y1:N 1:T , z1:N 1:T , c1:N 1:T , and e1:N 2 1:T respectively. N is the number of objects. T is the number of timestamps. ELBO = log pθ(y) DKL [qϕ(x, z, c, e|y) pθ(x, z, c, e|y)] = Z qϕ(x, z, c, e|y) log pθ(y) d(x, z, c, e) Z qϕ(x, z, c, e|y) log qϕ(x, z, c, e|y) pθ(x, z, c, e|y) d(x, z, c, e) = Z qϕ(x, z, c, e|y) [log pθ(x, z, c, e, y) log qϕ(x, z, c, e|y)] d(x, z, c, e) = Eqϕ(x,z,c,e|y) [log pθ(x, z, c, e, y) log qϕ(x, z, c, e|y)] = Eqϕ(x|y)qϕ(e|x)pθ(z,c|x,y,e) [log pθ(x, y)qϕ(e|x)pθ(z, c|x, y, e) log qϕ(x|y)qϕ(e|x)pθ(z, c|x, y, e)] = Eqϕ(x|y)qϕ(e|x)pθ(z,c|x,y,e) [log pθ(x, y) log qϕ(x|y)] = Eqϕ(x|y) [log pθ(x, y) log qϕ(x|y)] = Eqϕ(x|y) [log pθ(x, y)] + H(qϕ(x|y)), where the first term is a model likelihood, and the second term is conditional entropy for variational posterior of continuous latent state x. With the proper assumption of conditional independence of continuous latent states among objects, the conditional entropy is expanded through space and time as: H(qϕ(x|y)) = H(qϕ(x1:N 1:T |y1:N 1:T )) n=1 qϕ(xn 1:T |yn 1:T ) n=1 H(qϕ(xn 1:T |yn 1:T )) (qϕ(xn 1|yn 1 ) t+2 qϕ(xn t | xn 1:t 1, yn t )) H(qϕ(xn 1|yn 1 )) + t=2 H(qϕ(xn t | xn 1:t 1, yn t )) where xn 1:t 1 contains xn 1, xn 2, ..., xn t 1, in which xn t 1 qϕ(xn t 1| xn 1:t 2, yn t 1) is sampled from the variational posterior distribution. In practice, we utilize causal RNN to model the temporal dependence. Graph Switching Dynamical Systems A.3.2. TRAINING OF ELBO For training, we utilize mini-batch stochastic gradient descent algorithm. The gradients with respect to θ or ϕ in ELBO are calculated as: θELBO = θ Eqϕ(x|y)log pθ(x, y) = Eqϕ(x|y) θlog pθ(x, y), ϕELBO = ϕ Eqϕ(x|y)log pθ(x, y) + H(qϕ(x|y)) = ϕ Eqϕ(x|y)log pθ(x, y) + ϕH(qϕ(x|y)) = E ϵ N [ ϕlog pθ(x, yϕ(x, ϵ))] + ϕH(qϕ(x|y)), where we use the reparameterization trick (Kingma & Welling, 2013) to calculate gradient of ϕ Eqϕ(x|y)log pθ(x, y) . Analyzing both θELBO and ϕELBO, the challenging part is θ,ϕlog pθ(x, y). Following (Ansari et al., 2021), the derivative of the log-joint likelihood log p(x, y) is calculated as: log p(x, y) = Ep(z,c,e|x,y) [ log p(x, y)] = Ep(z,c,e|x,y) [ log p(x, y, z, c, e)] Ep(z,c,e|x,y) [ log p(z, c, e|x, y)] = Ep(z,c,e|x,y) [ log p(x, y, z, c, e)] , where Ep(z,c,e|x,y) [ log p(z, c, e|x, y)] is calculated as: Ep(z,c,e|x,y) [ log p(z, c, e|x, y)] = Z p(z, c, e|x, y) log p(z, c, e|x, y) p(z, c, e|x, y) d(z, c, e) = Z log p(z, c, e|x, y)d(z, c, e) = 1 = 0, With Markovian property, we rewrite log p(x, y, z, c, e) as: log p(x, y, z, c, e) = log p(x1:N 1:T , y1:N 1:T , z1:N 1:T , c1:N 1:T , e1:N2 1:T ) = log p(y1:N 1 |x1:N 1 )p(x1:N 1 |z1:N 1 )p(z1:N 1 ) + t=2 log p(y1:N t |x1:N t )p(x1:N t |x1:N t 1, z1:N t ) t=2 log h p(z1:N t |z1:N t 1, x1:N t 1, c1:N t 1, e1:N 2 t 1 )p(e1:N2 t |e1:N2 t 1 , z1:N t , x1:N t )p(c1:N t |c1:N t 1, z1:N t 1) i n=1 p(yn 1 |xn 1) n=1 p(xn 1|zn 1 ) p(z1:N 1 ) n=1 p(yn t |xn t ) n=1 p(xn t |xn t 1, zn t ) m=1 p(zn t |zm t 1, xm,n t 1 , cn t , em n t ) m=1 p(em n t |em n t 1 , zm,n t , xm,n t ) n=1 p(cn t |cn t 1, zn t 1) where we model the interactions among objects via p(zn t |zm t 1, xm,n t 1 , cn t , em n t ) without instantaneous dependences. Thus, Graph Switching Dynamical Systems log p(x, y) can be written as: log p(x, y) = Ep(z,c,e|x,y) [ log p(x, y, z, c, e)] = Ep(z1:N 1:T ,c1:N 1:T ,e1:N2 1:T |x1:N 1:T ,y1:N 1:T ) h log p(x1:N 1:T , y1:N 1:T , z1:N 1:T , c1:N 1:T , e1:N2 1:T ) i k p(z1:N 1 = k|x1:N 1:T , y1:N 1:T ) log n=1 p(yn 1 |xn 1) n=1 p(xn 1|zn 1) p(z1:N 1 = k) k,j,q,p,s,t ξ(k, j, q, p, s,t) log n=1 p(yn t |xn t ) n=1 p(xn t |xn t 1, zn t = kn) k,j,q,p,s,t ξ(k, j, q, p, s, t) log m=1 p(zn t =kn|zm t 1 =jm, xm,n t 1 , cn t =qn, em n t =sm n) k,j,q,p,s,t ξ(k, j, q, p, s, t) log m=1 p(em n t = sm n|em n t 1 = tm n, zm,n t = jm,n, xm,n t ) k,j,q,p,s,t ξ(k, j, q, p, s, t) log n=1 p(cn t =qn|cn t 1 =pn, zn t 1 =jn) k γ(k) log[B1(kn) π(k)] k,j,q,p,s,t ξ(k, j, q, p, s, t) log[Bt(k) At(k, j, q, s) Et(j, s, t) Ct(q, p, j)] π(k) = p(z1:N 1 = k), γ(k) = p(z1:N 1 = k|x1:N 1:T , y1:N 1:T ), ξ(k, j, q, p, s, t) = p(z1:N t =k, z1:N t 1=j, c1:N t =q, c1:N t 1 =p, e1:N 2 t =s, e1:N2 t 1 =t|x1:N 1:T , y1:N 1:T ), n=1 p(yn t |xn t ) n=1 p(xn t |xn t 1, zn t = kn), At(k, j, q, s) = m=1 p(zn t =kn|zm t 1 =jm, xm,n t 1 , cn t =qn, em n t =sm n), Et(j, s, t) = m=1 p(em n t = sm n|em n t 1 = tm n, zm,n t = jm,n, xm,n t ) Ct(q, p, j) = n=1 p(cn t =qn|cn t 1 =pn, zn t 1 =jn). π(k) is the initial joint discrete mode probability. QN n=1 p(yn 1 |xn 1) and QN n=1 p(yn t |xn t ) are emission probability. QN n=1 p(xn 1|zn 1 ) and QN n=1 p(xn t |xn t 1, zn t =kn) are continuous state transition probability conditioned on different types of discrete modes kn. At(k, j, q, s) is the discrete mode transition probability. Besides, p(z1:N 1 = k|x1:N 1:T , y1:N 1:T ) and ξ(k, j, q, p, s) can be calculated similarly to the forward and backward algorithm in HMMs (Collins, 2013), which is detailed in the next section. Graph Switching Dynamical Systems A.3.3. FORWARD AND BACKWARD ALGORITHM In this section, we aim at calculating the posterior probability of discrete mode, count, and edge variables z, c, and e conditioned on observation y and approximate continuous state x: p(zt, ct, et|x1:T , y1:T ) p(zt, ct, et, x1:T , y1:T ) = p(zt, ct, et, x1:t, y1:t) | {z } F orward p(xt+1:T , yt+1:T |xt, zt, ct, et) | {z } Backward = αt(zt, ct) βt(zt, ct). The forward part αt(zt, ct) can be expanded as: α1(z1, c1) = p(z1, c1, e1, x1, y1) = p(z1:N 1 , c1:N 1 , e1:N 2 1 , x1:N 1 , y1:N 1 ) = δc1:N 1 =1p(z1:N 1 )p(e1:N 2 1 )p(x1:N 1 |z1:N 1 )p(y1:N 1 |x1:N 1 ) = δc1:N 1 =1p(z1:N 1 )p(e1:N 2 1 ) n=1 p(xn 1|zn 1) n=1 p(yn 1 |xn 1) αt(zt, ct) = p(zt, ct, et, x1:t, y1:t) = p(z1:N t , c1:N t , e1:N 2 t , x1:N 1:t , y1:N 1:t ) z1:N t 1,c1:N t 1 p(z1:N t , c1:N t , e1:N 2 t , x1:N 1:t , y1:N 1:t , z1:N t 1, c1:N t 1) z1:N t 1,c1:N t 1 p(z1:N t 1, c1:N t 1, e1:N 2 t 1 , x1:N 1:t 1, y1:N 1:t 1)p(c1:N t |c1:N t 1, z1:N t 1)p(z1:N t |z1:N t 1, x1:N t 1, c1:N t 1, e1:N2 t 1 ) p(e1:N2 t |e1:N2 t 1 , z1:N t , x1:N t )p(x1:N t |x1:N t 1, z1:N t )p(y1:N t |x1:N t ) z1:N t 1,c1:N t 1 αt 1(zt 1, ct 1) n=1 p(cn t |cn t 1, zn t 1) m=1 p(zn t |zm t 1, xm,n t 1 , cn t , em n t ) m=1 p(em n t |em n t 1 , zm,n t , xm,n t ) n=1 p(xn t |xn t 1, zn t ) n=1 p(yn t |xn t ), where αt(zt, ct) can be expressed by αt 1(zt 1, ct 1) recursively with variable transitions and emissions. The backward part βt(zt, ct) can be expanded as: βT (z T , c T ) = 1 βt(zt, ct) = p(xt+1:T , yt+1:T |xt, zt, ct, et) = p(x1:N t+1:T , y1:N t+1:T |x1:N t , z1:N t , c1:N t , e1:N2 t ) z1:N t+1,c1:N t+1 p(x1:N t+1:T , y1:N t+1:T , z1:N t+1, c1:N t+1|x1:N t , z1:N t , c1:N t , e1:N2 t ) z1:N t+1,c1:N t+1 p(c1:N t+1|c1:N t , z1:N t )p(z1:N t+1|z1:N t , x1:N t , c1:N t , e1:N2 t ) p(x1:N t+1|x1:N t , z1:N t+1)p(e1:N 2 t+1 |e1:N2 t , z1:N t+1, x1:N t+1)p(y1:N t+1|x1:N t+1)p(x1:N t+2:T , y1:N t+2:T |x1:N t+1, z1:N t+1, c1:N t+1, e1:N2 t+1 ) z1:N t+1,c1:N t+1 n=1 p(cn t+1|cn t , zn t ) m=1 p(zn t+1|zm t , xm,n t , cn t+1, em n t+1 ) n=1 p(xn t+1|xn t , zn t+1) m=1 p(em n t+1 |em n t , zm,n t+1 , xm,n t+1 ) n=1 p(yn t+1|xn t+1) βt+1(zt+1, ct+1), Graph Switching Dynamical Systems Table 7. Analyses on different numbers of objects on ODE-driven Particle dataset, while increasing the average number of interactions per object per time series, i.e, 2.3 interactions for 3 particles, 6.1 for 5, and 12.5 for 10. */* denotes NMI / F1. Number of Particles 3 5 10 r SLDS 0.257 0.023 / 0.443 0.041 0.252 0.033 / 0.437 0.039 0.246 0.027 / 0.430 0.045 SNLDS 0.368 0.027 / 0.664 0.053 0.361 0.031 / 0.656 0.042 0.354 0.035 / 0.651 0.059 REDSDS 0.418 0.016 / 0.701 0.027 0.411 0.023 / 0.692 0.029 0.405 0.024 / 0.687 0.022 MOSDS (this paper) 0.469 0.020 / 0.757 0.032 0.461 0.024 / 0.752 0.027 0.456 0.029 / 0.748 0.035 GRASS (this paper) 0.528 0.014 / 0.790 0.021 0.524 0.019 / 0.784 0.025 0.519 0.021 / 0.781 0.018 Table 8. Analyses on different numbers of objects on ODE-driven Particle, while fixing the average number of interactions per object per time series, i.e, 2.3 interactions. */* denotes NMI / F1. Number of Particles 3 5 10 r SLDS 0.257 0.023 / 0.443 0.041 0.262 0.034 / 0.444 0.037 0.253 0.028 / 0.437 0.042 SNLDS 0.368 0.027 / 0.664 0.053 0.365 0.030 / 0.666 0.047 0.362 0.028 / 0.659 0.051 REDSDS 0.418 0.016 / 0.701 0.027 0.423 0.023 / 0.706 0.031 0.413 0.022 / 0.694 0.028 MOSDS (this paper) 0.469 0.020 / 0.757 0.032 0.471 0.025 / 0.763 0.036 0.464 0.021 / 0.754 0.035 GRASS (this paper) 0.528 0.014 / 0.790 0.021 0.530 0.012 / 0.792 0.019 0.524 0.017 / 0.786 0.024 where βt(zt, ct) can be computed via βt+1(zt+1, ct+1) recursively with variable transitions and emissions. A.4. Further Model Interactions between Continuous Variables In the main paper, we model interactions between objects by dependence on discrete mode variables only. This means that based on the derived discrete mode transition, the continuous state transition p(xn t |xn t 1, zn t ) and observation emission p(yn t |xn t ) are per-object dynamics only without interactions. However, in some real-world scenarios, the interactions between objects also happen to continuous variables. For example, in each motion type, object A still influences the detailed motion of object B. We show some preliminary results in this section and leave more comprehensive experiments as future work. B. More Experiments B.1. New splitting and larger ODE-driven particle datasets In our original ODE-driven particle dataset we used around 5k samples for training, around 200 samples for validation and testing. We tested the scalability of our method in terms of scaling to one larger (approximately 20x larger) dataset. The original dataset takes 37,000 epoches to achieve convergence and the final performance of our GRASS model is: 0.528, 0.519, 0.794, and 0.790 for NMI, ARI, Accuracy, and F1, respectively. The 20x larger dataset takes 39,000 epochs and the final performance of our GRASS model is 0.525, 0.531, 0.814, and 0.802. We find the training time before convergence and the performance of our model are almost the same, which shows the scalability of our method to larger datasets. The splitting strategy of the synthesized dataset follows the recent SOTA method, REDSDS (Ansari et al., 2021). REDSDS has 10,000 and 500 samples for training and testing of the 3-mode system (test data is around 5% of training data). We follow the proportion and have 4,928 samples for training and 204 samples for testing (around 5%). For the ODE-driven particle dataset, we also conduct a new splitting (4200/420/420 for training/validation/testing). The results of our GRASS model on the new splitting dataset are 0.522, 0.518, 0.809, and 0.805 for NMI, ARI, Accuracy, and F1, respectively, which shows almost the same performance as the original splitting in the main paper. B.2. Ablation studies with standard derivations Ablations studies with standard derivations are in Tables 7, 8, 9, and 10. We can see that the conclusions remain the same as in the main paper for ablation studies of different numbers of objects, different numbers of interactions, with or without interactions, and different numbers of predefined modes. Note that in Table 7 and Table 8, we can see that with different number of objects or interactions, GRASS has consistently better performance with the lowest variances. Graph Switching Dynamical Systems Table 9. Analyses of robustness to datasets without interactions on ODE-driven Particle dataset. */* denotes NMI / F1. Method dataset w/ interaction dataset w/o interaction r SLDS 0.257 0.023 / 0.443 0.041 0.471 0.024 / 0.686 0.035 SNLDS 0.368 0.027 / 0.664 0.053 0.534 0.032 / 0.772 0.046 REDSDS 0.418 0.016 / 0.701 0.027 0.579 0.013 / 0.838 0.022 MOSDS (this paper) 0.469 0.020 / 0.757 0.032 0.563 0.027 / 0.817 0.039 GRASS (this paper) 0.528 0.014 / 0.790 0.021 0.573 0.008 / 0.826 0.018 Table 10. Analyses on robustness to different maximal numbers of predefined modes. */* denotes NMI / F1. Number of Modes 3 5 10 r SLDS 0.257 0.023 / 0.443 0.041 0.253 0.025 / 0.438 0.043 0.248 0.032 / 0.436 0.047 SNLDS 0.368 0.027 / 0.664 0.053 0.365 0.032 / 0.661 0.047 0.362 0.036 / 0.657 0.059 REDSDS 0.418 0.016 / 0.701 0.027 0.415 0.023 / 0.696 0.035 0.413 0.026 / 0.694 0.031 MOSDS (this paper) 0.469 0.020 / 0.757 0.032 0.466 0.028 / 0.759 0.037 0.462 0.033 / 0.754 0.042 GRASS (this paper) 0.528 0.014 / 0.790 0.021 0.532 0.020 / 0.794 0.025 0.527 0.022 / 0.784 0.026