# on_the_identifiability_of_switching_dynamical_systems__b01671a3.pdf On the Identifiability of Switching Dynamical Systems Carles Balsells-Rodas 1 Yixin Wang 2 Yingzhen Li 1 The identifiability of latent variable models has received increasing attention due to its relevance in interpretability and out-of-distribution generalisation. In this work, we study the identifiability of Switching Dynamical Systems, taking an initial step toward extending identifiability analysis to sequential latent variable models. We first prove the identifiability of Markov Switching Models, which commonly serve as the prior distribution for the continuous latent variables in Switching Dynamical Systems. We present identification conditions for first-order Markov dependency structures, whose transition distribution is parametrised via non-linear Gaussians. We then establish the identifiability of the latent variables and non-linear mappings in Switching Dynamical Systems up to affine transformations, by leveraging identifiability analysis techniques from identifiable deep latent variable models. We finally develop estimation algorithms for identifiable Switching Dynamical Systems. Throughout empirical studies, we demonstrate the practicality of identifiable Switching Dynamical Systems for segmenting high-dimensional time series such as videos, and showcase the use of identifiable Markov Switching Models for regime-dependent causal discovery in climate data. 1. Introduction State-space models (SSMs) are well-established sequence modelling techniques where their linear versions have been extensively studied (Lindgren, 1978; Poritz, 1982; Hamilton, 1989). Meanwhile, recurrent neural networks (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) have gained high popularity for sequence modelling thanks to their abilities in capturing non-linear and long-term dependencies. Never- 1Imperial College London 2University of Michigan. Correspondence to: Carles Balsells-Rodas . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). theless, significant progress has been made on fusing neural networks with SSMs, with Gu et al. (2022) as one of the latest examples. Many of these advances focus on building sequential latent variable models (LVMs) as flexible deep generative models (Chung et al., 2015; Li & Mandt, 2018; Babaeizadeh et al., 2018; Saxena et al., 2021), where SSMs have been incorporated as latent dynamical priors (Johnson et al., 2016; Linderman et al., 2016; 2017; Fraccaro et al., 2017; Dong et al., 2020; Ansari et al., 2021; Smith et al., 2023). Despite the efforts in designing flexible SSM priors and developing stable training schemes, theoretical properties such as identifiability for sequential generative models are less studied, contrary to early literature on linear SSMs. Identifiability, in general, establishes a one-to-one correspondence between the data likelihood and the model parameters (or latent variables), or an equivalence class of the latter. In causal discovery (Peters et al., 2017), identifiability refers to whether the underlying causal structure can be correctly pinpointed from infinite observational data. In independent component analysis (ICA, Comon (1994)), identifiability analysis focuses on both the latent sources and the mapping from the latents to the observed. While general non-linear ICA is ill-defined (Hyvärinen & Pajunen, 1999), recent results show that identifiability can be achieved using conditional priors (Khemakhem et al., 2020). For deep (non-temporal) latent variable models, the required access to auxiliary variables can be relaxed using a finite mixture prior (Kivva et al., 2022). Recent works have attempted to extend these results to sequential models using non-linear ICA (Hyvarinen & Morioka, 2017; Hyvarinen et al., 2019; Hälvä & Hyvarinen, 2020; Hälvä et al., 2021), or latent causal processes (Yao et al., 2022a;b; Lippe et al., 2023b;a). In this work, we develop an identifiability analysis for Switching Dynamical Systems (SDSs) a class of sequential LVMs with SSM priors that allow regime-switching behaviours, where their associated inference methods have been explored recently (Dong et al., 2020; Ansari et al., 2021). Our approach differs fundamentally from non-linear ICA since the latent variables are no longer independent. To address this challenge, we first construct the latent dynamical prior using Markov Switching Models (MSMs) (Hamilton, 1989) an extension of Hidden Markov Models (HMMs) with autoregressive connections, for which we provide the first identifiability analysis of this model On the Identifiability of Switching Dynamical Systems family in non-linear transition settings. This also allows regime-dependent causal discovery (Saggioro et al., 2020) thanks to having access to the transition derivatives. We then extend the identifiability results to SDSs using recent identifiability analysis techniques for the non-temporal deep latent variable models (Kivva et al., 2022). Importantly, the identifiability conditions we provide are less restrictive, significantly broadening their applicability. In contrast,Yao et al. (2022a;b); Lippe et al. (2023a) require auxiliary information to handle regime-switching effects or distribution shifts, and Hälvä et al. (2021) imposes assumptions on the temporal correlations such as stationarity (Hyvarinen & Morioka, 2017). Moreover, unlike the majority of existing works (Hälvä et al., 2021; Yao et al., 2022a;b), our identifiability analysis does not require the injectivity of the decoder mapping. Below we summarise our main contributions in both theoretical and empirical forms: We present conditions in which first-order MSMs with non-linear Gaussian transitions are identifiable up to permutations (Section 3.1). We further provide the first analysis of identifiability conditions for non-parametric first-order MSMs in Appendix B. We further provide conditions for SDS s identifiability up to affine transformations of the latent variables and non-linear emission (Section 3.2). We demonstrate the effectiveness of identifiable SDSs on causal discovery and sequence modelling tasks. These include discovery of time-dependent causal structures in climate data (Section 6.2), and time-series segmentation in videos (Section 6.3). 2. Background 2.1. Identifiable Latent Variable Models In the non-temporal case, many works explore identifiability of latent variable models (Khemakhem et al., 2020; Kivva et al., 2022). Specifically, consider a generative model where its latent variables z Rm are drawn from a Gaussian mixture prior with K components (K < + ). Then z is transformed via a (noisy) piece-wise linear mapping to obtain the observation x Rn, n m: x = f(z) + ϵ, ϵ N(0, Σ), (1) k=1 p(s = k)N (z|µk, Σk) . (2) Kivva et al. (2022) established that if the transformation f is weakly injective (see definition below), the prior distribution p(z) is identifiable up to affine transformations1 from the observations. If we further assume f is continuous and injective, both prior distribution p(z) and non-linear mapping 1See Def. 2.2 in Kivva et al. (2022) or the equivalent adapted to SDSs (Def. 3.4). Figure 1: The generative model considered in this work, where the MSM is indicated in green and the SDS is indicated in red. The dashed arrows indicate additional dependencies which are accommodated by our theoretical results. f are identifiable up to affine transformations. In Section 3.2 we extend these results to establish identifiability for SDSs using similar proof strategies. Definition 2.1. A mapping f : Rm Rn is said to be weakly injective if (i) there exsists x0 Rn and δ > 0 s.t. |f 1({x})| = 1 for every x B(x0, δ) f(Rm), and (ii) {x Rn : |f 1({x})| > 1} f(Rm) has measure zero with respect to the Lebesgue measure on f(Rm).2. Here the notations are extended to sets, i.e., for a set B, f(B) := {f(x) : x B} and f 1(B) := {x : f(x) B}. Definition 2.2. f is injective if |f 1({x})| = 1, x f(Rm). 2.2. Switching Dynamical Systems A Switching Dynamical System (SDS), with an example graphical model illustrated in Figure 1, is a sequential latent variable model with its dynamics governed by both discrete and continuous latent states, st {1, . . . , K}, zt Rm, respectively. At each time step t, the discrete state st determines the regime of the dynamical prior that the current continuous latent variable zt should follow, and the observation xt is generated from zt using a (noisy) non-linear transformation. This gives the following probabilistic model for the states and the observation variables: pθ(x1:T , z1:T , s1:T ) = pθ(s1:T )pθ(z1:T |s1:T ) t=1 pθ(xt|zt). (3) As pθ(x1:T ) is intractable, recent works (Dong et al., 2020; Ansari et al., 2021) have developed inference techniques based on variational inference (Kingma & Welling, 2014). These works consider an additional dependency on the switch st from xt for more expressive segmentations, which is not considered in our theoretical analysis for simplicity. 2B(x0, δ) = {x Rn : ||x x0|| < δ} On the Identifiability of Switching Dynamical Systems For the latent dynamic prior pθ(z1:T ), we consider a Markov Switching Model (MSM) which has also been referred to as autoregressive HMM (Ephraim & Roberts, 2005). This type of prior uses the latent switch st to condition the distribution of zt at each time-step, and the conditional dynamic model of z1:T given s1:T follows an autoregressive process. Under first-order Markov assumption for this conditional auto-regressive processes, this leads to the following probabilistic model for the prior: pθ(z1:T ) = X s1:T pθ(s1:T )pθ(z1:T |s1:T ), (4) pθ(z1:T |s1:T ) = pθ(z1|s1) t=2 pθ(zt|zt 1, st). (5) Note that the structure of the discrete latent state prior pθ(s1:T ) is not specified, and the identifiability results presented in the next section do not require further assumptions herein. As illustrated in Figure 1 (solid lines), in experiments we use a first-order Markov process for pθ(s1:T ), described by a transition matrix Q RK K such that pθ(st = j|st 1 = i) = Qij, and an initial distribution pθ(s1). In such case we also provide identifiability guarantees for the Q matrix and the initial distribution. 3. Identifiability Theory for MSMs & SDSs This section establishes the identifiability of the SDS model (Eq. (3)) with MSM latent prior (Eq. (4)) under suitable assumptions. We address this challenge by leveraging ideas from Kivva et al. (2022) which uses finite mixture prior for static data generative models; importantly, this theory relies on the use of identifiable finite mixture priors (up to mixture component permutations). Inspired by this result, we first establish in Section 3.1 the identifiability of the Markov switching model pθ(z1:T ) as a finite mixture prior, which then allows us to extend the results of Kivva et al. (2022) to the temporal setting in Section 3.2 to prove identifiability of the SDS model. We drop the subscript θ for simplicity. 3.1. Identifiable Markov Switching Models The MSM p(z1:T ) has an equivalent formulation as a finite mixture model. Suppose the discrete states satisfy st {1, ..., K} with K < + , then for any given T < + , one can define a bijective path indexing function φ : {1, ..., KT } {1, ..., K}T such that each i {1, ..., KT } uniquely retrieves a set of states s1:T = φ(i). Then we can use ci = p(s1:T = φ(i)) to represent the joint probability of the states s1:T = φ(i) under p. Let us further define the family of initial and transition distributions for the continuous states zt: ΠA := {pa(z1)|a A}, PA := {pa(zt|zt 1)|a A}. (6) where A is an index set satisfying mild measure-theoretic conditions (Appendix A). Note PA assumes first-order Markov dynamics. Then, we can construct the family of first-order MSMs as MT (ΠA, PA) := i=1 cipai 1(z1) t=2 pai t(zt|zt 1) | K < + , pai 1 ΠA, pai t PA, t 2, ai t A, ai 1:T = aj 1:T , i = j, Since Eq. (7) requires ai 1:T = aj 1:T for any i = j, this also builds an injective mapping ϕ(i) = ai 1:T from i {1, ..., KT } to A. Combined with the path indexing function, this establishes an injective mapping ϕ φ 1 to uniquely map a set of states s1:T to the a1:T indices, and we can view pai 1(z1) and pai t(zt|zt 1) as equivalent notations of p(z1|s1) and p(zt|zt 1, st) respectively for s1:T = φ(i). This notation shows that the MSM extends finite mixture models to temporal settings as a finite mixture of KT trajectories composed by distributions in ΠA and PA. Having established the finite mixture model view of Markov switching models, we will use this notation of MSM in the rest of Section 3.1, as we will use finite mixture modelling techniques to establish its identifiability. In detail, we first define the identification of MT (ΠA, PA) as follows. Definition 3.1. The family of first-order MSMs MT (ΠA, PA) is said to be identifiable up to permutations, when for p1, p2 MT (ΠA, PA) with p1(z1:T ) = PKT i=1 cipai 1(z1) QT t=2 pai t(zt|zt 1), p2(z1:T ) = P ˆ KT i=1 ˆcipˆai 1(z1) QT t=2 pˆai t(zt|zt 1), p1(z1:T ) = p2(z1:T ), z1:T RT m, iff K = ˆK and for each 1 i KT there is some 1 j ˆKT s.t. 1. ci = ˆcj; 2. if ai t1 = ai t2 for t1, t2 2 and t1 = t2, then ˆaj t1 = ˆaj t2; 3. pai t(zt|zt 1) = pˆaj t(zt|zt 1), t 2, zt Rm; 4. pai 1(z1) = pˆaj 1(z1), z1 Rm. The 2nd requirement eliminates the permutation equivalence of e.g., s1:4 = (1, 2, 3, 2); ˆs1:4 = (3, 1, 2, 3) which would be valid in the finite mixture case with vector indexing. For the purpose of building deep generative models, we seek to define identifiable parametric families and defer the study of the non-parametric case in Appendix B. In particular we use a non-linear Gaussian transition family as follows: GA = {pa(zt|zt 1) = N(zt; m(zt 1, a), Σ(zt 1, a)) | a A, zt, zt 1 Rm}, (8) On the Identifiability of Switching Dynamical Systems where m(zt 1, a) and Σ(zt 1, a) are non-linear w.r.t. zt 1 and denote the mean and covariance matrix, respectively. We further require the unique indexing assumption: a = a A, zt 1 Rd, s.t. m(zt 1, a) = m(zt 1, a ) or Σ(zt 1, a) = Σ(zt 1, a ). (9) In other words, for such zt 1, pa(zt|zt 1) and pa (zt|zt 1) are two different Gaussian distributions. We also introduce a family of initial distributions with unique indexing: IA := {pa(z1) = N(z1; µ(a), Σ1(a)) | a A}, (10) a = a A µ(a) = µ(a ) or Σ1(a) = Σ1(a ). (11) The above Gaussian families paired with unique indexing assumptions satisfy conditions which favour identifiability of first-order MSMs under non-linear Gaussian transitions. Theorem 3.2. Define the following first-order Markov switching model family under the non-linear Gaussian families, MT NL = MT (IA, GA) with GA, IA defined by Eqs. (8), (10) respectively. Then, the MSM is identifiable in terms of Def. 3.1 under the following assumptions: (a1) Unique indexing for GA and IA: Eqs. (9), (11) hold; (a2) For any a A, the mean and covariance in GA, m( , a) : Rm Rm and Σ( , a) : Rm Rm m, are analytic functions. The proof strategy can be summarised in 4 steps. (1) Under the finite mixture model view, it suffices to show {pai 1:T (z1:T ) | ai 1:T A A} contains linearly independent functions (Yakowitz & Spragins, 1968). (2) Since pai 1:T (z1:T ) is first-order Markov, we just need to find conditions for the linear independence of {pai 1:2(z1:2)}, and prove T 3 cases by induction. (3) For non-parametric {pai 1:2(z1:2)} case, we specify conditions on PA = {pa(zt|zt 1)} and ΠA = {pa(z1)} to ensure linear independence of {pai 1:2(z1:2)}. (4) We show the MSM family with (non-linear) Gaussian transitions MT NL = MT (IA, GA) is a special case of (3) when assuming (a1 - a2). Intuitively, switches can be identified as discontinuities in the dynamic model and are fully controlled by the discrete states (assumptions a1, a2). Also assumption (a2) means m( , a) and m( , a ) with a = a can only intercept at a set of points with zero Lebesgue measure (similarly for Σ( , a)), and this smoothness property contributes to the identifiability for the continuous-state transitions pa(zt|zt 1). The central part of our proof strategy involves showing linear independence in products of consecutive variables; e.g., ΠA PA PA = pa1(z1)pa2(z2|z1)pa3(z3|z2) , Figure 2: Illustration of the intuition behind Lemma B.3, where linear independence holds if, for any pair of functions (shown in green and purple), the intersection in the domain of the conditioned variable (zt 1) is zero-measured. where the overlapping variable (indicated with different colours) challenges linear independence for any consecutive product of linearly independent functions. Figure 2 illustrates the intuition behind our main linear independence result, Lemma B.3, showing that linear independence of the product function can be achieved, as long as for the function family of each component, linear dependence only happens in zero-measure sets. Our result only requires the above to hold in at least a non-zero measured set of the overlapping variable space, and it connects to non-linear Gaussians via assumption (d3) in Theorem B.8. However, the parametric assumption (a2) is stronger, as it ensures the non-zero measure intersection is satisfied almost everywhere. The following indications of Thm. 3.2, when combined, show the profound expressivity of identifiable MSMs: Assumption (a2) enables parameterising pa(zt|zt 1) with e.g., polynomials and neural networks (NNs) with analytic activations (e.g. Soft Plus). For NNs, identifiability applies to the functional form only, since NN weights do not uniquely index NN functions. The identifiability result holds independently of the choice of p(s1:T ), allowing e.g., non-stationarity and higher-order Markov dependencies. Our experiments consider p(s1:T ) to be stationary and firstorder Markov, where its state transitions are also identifiable (Appendix C). We leave other cases to future work. Remark. Key to the full proof is the establishment of step (3), which is proved via a new theoretical result on linear independence (Lemma B.3), and it can be of independent interest. Unlike Hidden Markov Models (HMMs), in MSMs zt and zt+1 are not conditionally independent given the discrete states. Therefore, seminal HMM identifiability results (Allman et al., 2009; Gassiat et al., 2016), which exploit conditional independence of zt and zt+1 in HMMs and leverage Kruskal s 3-way arrays technique (Kruskal, 1977), do not apply to non-parametric MSMs, rendering our development of new theory (Lemma B.3) necessary. On the Identifiability of Switching Dynamical Systems Kivva et al (2022) Khemakhem et al. (2020) Figure 3: We assume zt is transformed via f with noise ϵt at each time-step t independently. We view this as a transformation on z1:T via a factored F with noise E. 3.2. Identifiable Switching Dynamical Systems Based on identifiable MSMs, our analysis of Switching Dynamical Systems extends the setup from Kivva et al. (2022) (Section 2.1) to the temporal case. Assume the prior dynamic model p(z1:T ) belongs to the non-linear Gaussian MSM family MT NL specified by Thm. 3.2. At each time step t, the latent zt Rm generates observed data xt Rn via a piece-wise linear transformation f. The generation process of such SDS model can be expressed as follows: xt = f(zt) + ϵt, ϵt N(0, Σ) (13) z1:T p(z1:T ) MT NL. (14) Note that we can also write the decoding process as x1:T = F(z1:T ) + E with E = (ϵ1, ..., ϵT ) , where F is a factored transformation composed by f as defined below. Importantly, this notion allows F to inherit e.g., piece-wise linearity and (weakly) injectivity properties from f. Definition 3.3. We say that a function F : Rm T Rn T is factored if it is composed by f : Rm Rn, such that for any z1:T Rm T , F(z1:T ) = (f(z1), . . . , f(z T )) . The identifiability analysis of the SDS model (Eq. (13)) follows two steps (see Figure 3). (i) Following Kivva et al. (2022), we establish the identifiability for a prior p(z1:T ) MT NL from (F#p)(x1:T ) in the noiseless case. Here F#p denotes the pushforward measure of p by F. (ii) When the noise E distribution is known, F#p is identifiable from (F#p) E using convolution tricks from Khemakhem et al. (2020). In detail, we first define the notion of identifiability for p(z1:T ) MT (ΠA, PA) given noiseless observations. Definition 3.4. Given a family of factored transformations F, for F F the prior p MT (ΠA, PA) is said to be identifiable up to affine transformations, when for any F F and p MT (ΠA, PA) s.t. F#p = F #p , there exists an invertible factored affine mapping H : Rm T Rm T composed by h : Rm Rm, where p H#p . For the factored F, identifiability up to affine transformations extends from Def. 2.2.1 in Kivva et al. (2022), which is defined on f. I.e., for factored mappings F, F composed by f, f , respectively, if f = (f h), where h : Rm Rm is an invertible affine transformation, then F = (F H) with H a factored mapping composed by h. Now we can state the following identifiability result for (non-linear) SDS (proof in Appendix D). Theorem 3.5. Assume the observations x1:T are generated from a latent variable model following Eq. (13), whose prior p(z1:T ) belongs to MT NL = MT (IA, GA) and satisfies (a1 - a2), and f is a piece-wise linear mapping. Then: (i) If f, which composes F, is weakly injective (Def. 2.1), the prior p(z1:T ) is identifiable up to affine transformations as defined in Def. 3.4. (ii) If f, which composes F, is continuous and injective (Def. 2.2), both prior p(z1:T ) and f are identifiable up to affine transformations (Def. 3.4). Thm. 3.5 allows identifiable SDSs to employ e.g., Soft Plus networks for pθ(zt|zt 1, st), and certain (Leaky) Re LU networks for f (see Kivva et al. (2022)). Again, for neural networks identifiability refers only to their functional form. Remark. The MSM family is closed under factored affine transformations (Prop. C.2) and the proved MSM identifiability result (Thm. 3.2) is up to permutation equivalence. This means for p1, p2 MT NL, one can show that p1 = H#p2 with an invertible factored affine transformation H composed by h(z) = Az + b, z Rm. It indicates the following relations for mi( , a) and Σi( , a), i = 1, 2, with σ( ) a permutation over indices a A: m1(z, a) = Am2 A 1(z b), σ(a) + b, (15) Σ1(z, a) = AΣ2 A 1 (z b) , σ(a) AT . (16) 3.3. On Identifiability Conditions for Sequential LVMs Key to the design of identifiable models is the specification of constraints to eliminate unwanted equivalences or symmetries. For deep sequential latent variable models, these conditions are required for (i) the temporal dependencies of the latent dynamic model p(z1:T ) , and (ii) the emission model p(x1:T |z1:T ) to translate the latent variables to observations. We compare our proposed identifiability conditions with existing work, with flexible/restrictive conditions highlighted in different colours. Identifiable SDS (ours): An MSM dynamic model p(z1:T ) with smoothness conditions on p(zt|zt 1, st) but no restrictions on p(s1:T ). The emission model is restricted to use (weakly) injective mappings. HMM-ICA (Hälvä & Hyvarinen, 2020): Latent and stationary HMM priors following Gassiat et al. (2016), which requires stationarity and bijective emissions. SNICA (Hälvä et al., 2021): Includes SDS under general conditions on latent temporal dependencies (weak On the Identifiability of Switching Dynamical Systems stationarity) but restricts the emission to use injective mappings. Identifiability does not cover transitions. IIA-HMM (Morioka et al., 2021): Introduces a recurrent structure in the emissions with bijectivity of the demixing function. Allows non-stationary innovations using regime-switching sources from a stationary HMM Gassiat et al. (2016). Mechanism Sparsity (Lachapelle et al., 2022; 2024): A temporal model p(z1:T |u1:T ) with auxiliary observations u1:T and sparse interactions within z1:T . The emission model is restricted to use diffeomorphisms. Overall, often relaxed constraints for one part of the latent dynamic model need to be compensated with restrictive conditions on the other components. The above works share a limitation on the use of at least (weakly) injective mappings for the emission model, which is a common issue in identifiable DGM research since Khemakhem et al. (2020) and thus requires substantial future work to address. 4. Model Parameter Estimation We consider two modelling choices for N sequences D = {x1:T } of length T: (a) the MSM model (Eq. (4) with z1:T replaced by x1:T ) and (b) the SDS model with MSM latent prior (Eqs. (3),(4)). Although our theory imposes no restrictions on the discrete latent state distribution, the methods are implemented with a first-order stationary Markov chain, i.e., pθ(s1:T ) = pθ(s1) QT t=2 pθ(st|st 1). Markov Switching Models We use expectation maximisation (EM) for efficient estimation of mixture models (Bishop, 2006). Below we discuss the M-step update of the transition distribution parametrised by neural networks; more details (including polynomial parametrisations) can be found in Appendix E. When considering analytic neural networks, we take a Generalised EM (GEM) (Dempster et al., 1977) approach where a gradient ascent step is performed k=1 γt,k θ log pθ(xt|xt 1, st = k), (17) where γt,k = pθ(st = k|x1:T ) and the update rule can be computed using back-propagation. In the equation we indicate the update rule for a single sample, but note that we use mini-batch stochastic gradient ascent when N is large. Convergence is guaranteed for this approach to a local maximum of the likelihood (Bengio & Frasconi, 1994). Switching Dynamical Systems We adopt variational inference as presented in Ansari et al. (2021). The parameters are learned by maximising the evidence lower bound (ELBO) (Kingma & Welling, 2014), and the proposed approximate posterior over the latent variables {z1:T , s1:T } incorporates the exact posterior of the discrete latent states given the continuous latent variables: qϕ,θ(z1:T , s1:T |x1:T ) = qϕ(z1|x1:T ) t=2 qϕ(zt|z1:t 1, x1:T )pθ(s1:T |z1:T ). (18) As in Ansari et al. (2021); Dong et al. (2020), the variational posterior over the continuous variables qϕ(z1:T |x1:T ) simulates a smoothing process by first using a bi-directional RNN on x1:T , and then a forward RNN on the resulting embeddings. By introducing an exact posterior, the discrete latent variables can be marginalised from the ELBO objective (see Appendix E.2 for details), pθ(x1:T ) Eqϕ(z1:T |x1:T ) h log pθ(x1:T |z1:T ) i KL qϕ(z1:T |x1:T )||pθ(z1:T ) . (19) We use Monte Carlo estimation with samples z1:T qϕ(z1:T |x1:T ) and the reparametrization trick for backpropagation (Kingma & Welling, 2014), and jointly learn the parameters using stochastic gradient ascent on the ELBO objective. The prior pθ(z1:T ) can be computed exactly using the forward algorithm (Bishop, 2006) with messages {αt,k(z1:t) = pθ(z1:t, st = k)} by marginalising out s1:T : pθ(z1:T ) = k=1 αT,k(z1:T ), (20) αt,k(z1:t) = k =1 pθ(zt|zt 1, st = k) pθ(st = k|st 1 = k )αt 1,k (z1:t 1). (21) An alternative approach for SDS inference is presented in Dong et al. (2020), although Ansari et al. (2021) outlines some disadvantages (see Appendix E.2 for a discussion). Note that estimating parameters with ELBO maximisation has no consistency guarantees in general, and we leave additional analyses regarding consistency for future work. 5. Related Work Sequential generative models based on non-linear SDSs have gained interest over the recent years. Notable works include structured VAEs (Johnson et al., 2016), soft-switching Kalman Filters (Fraccaro et al., 2017), or recurrent SDSs (Linderman et al., 2016; Dong et al., 2020) which can include explicit duration models on the switches (Ansari et al., 2021). However, identifiability in such highly non-linear scenarios is rarely studied. A similar situation is found for MSMs, which were first introduced by Poritz (1982) and have been studied decades ago for speech analysis (Poritz, On the Identifiability of Switching Dynamical Systems 10 100 1000 Seq. length dist(m , m ) (a) Function forms 3 5 10 20 30 50 Num. Dims. Num. States 0.003 0.004 0.004 0.015 0.012 0.064 0.004 0.008 0.009 0.023 0.032 0.052 0.011 0.014 0.013 0.031 0.042 0.054 (b) L2 distance 3 5 10 20 30 50 Num. Dims. Num. States 1.00 0.98 0.98 0.99 0.99 1.00 1.00 0.98 0.98 0.99 0.99 1.00 0.99 0.98 0.99 0.99 0.99 0.99 (c) Avg. F1 score Figure 4: Synthetic experiment results on MSMs. (a) L2 distance error using different transition functions with varying T. (b) L2 distance error and (c) averaged F1 score of non-linear data (cosine activations) with increasing states and dimensions. 1982; Ephraim & Roberts, 2005) and economics (Hamilton, 1989). Although studies have been conducted on maximum likelihood estimation for some first-order and stationary MSMs and their asymptotic properties (Frühwirth-Schnatter & Frèuhwirth-Schnatter, 2006), identifiability for general high-order autoregressive MSMs has not been proved. The main complication arises from the explicit dependency on the observed variables, which poses a great challenge to prove linear independence of p(z1:T |s1:T ). To our knowledge, identifiability in MSMs has been explicitly studied only in An et al. (2013) which assumes discrete zt states and uses the joint probability of four consecutive observations. Regarding non-linear ICA for time series, identifiability results extend from Khemakhem et al. (2020) where past values are used as auxiliary information (Hyvarinen & Morioka, 2017; Hyvarinen et al., 2019; Klindt et al., 2021). Specifically, Yao et al. (2022a;b) recover temporal latent causal variables and allow non-stationarity via distribution shifts or time-dependent effects across observed regimes. Another line of work leverages intervention targets (Lippe et al., 2023b) or unknown binary interactions with regime information (Lippe et al., 2023a). Again these identifiability results require at least injectivity for the decoder function see Section 3.3 for further discussions on this limitation. 6. Experiments We evaluate the identifiable MSMs and SDSs with three experiments: (1) simulation studies with ground truth available for verification of the identifiability results; (2) regimedependent causal discovery in climate data with identifiable MSMs; and (3) segmentation of high-dimensional sequences of salsa dancing using MSMs and SDSs. Additional results are also presented in Appendix F. Code for inference and data generation can be found in https://github. com/charlio23/identifiable-SDS. 6.1. Synthetic Experiments MSMs We generate data using ground-truth MSMs and evaluate the estimated functions upon them. We parametrise the transition distribution means using random cubic polyno- mials or networks with cosine/Soft Plus activations; and use fixed transition covariances. When increasing dimensions, we use locally connected networks (Zheng et al., 2018) to construct regime-dependent causal structures for the grouthtruth MSMs, which encourages sparsity and stable data generation. The estimation error is computed with L2 distance between grounth-truth and estimated transition mean functions. We also estimate the causal structure via thresholding the Jacobian of the estimated transition function, and compute the F1 score to evaluate structure estimation accuracy. Both evaluation metrics are averaged over K components after accounting for permutation equivalence. See Appendices F.1 to F.4 for experimental details. Figure 4a shows increasing the sequence length generally reduces the L2 estimation error. The polynomials are estimated with higher error for short sequences, which could be caused by the high-frequency components from the cubic terms. Meanwhile the smoothness of the softplus networks allows the MSM to achieve consistently better parameter estimates. Regarding scalability, Figure 4b shows low estimation errors when increasing the number of dimensions and components. For structure estimation acurracy, Figure 4c shows that the MSM with non-linear transitions is able to maintain high F1 scores, despite the differences in L2 distance when increasing dimensions and states (4b). Although the approach is restricted by first-order Markov assumptions, the synthetic setting shows promising directions for high-dimensional regime-dependent causal discovery. SDSs We use data from 2D MSMs with cosine activations for the transition means, fixed covariances, and K = 3 components, and a random Leaky Re LU network to generate observations (with no additive noise). For evaluation, we generate 1000 test samples and compute the F1 score of the state posterior with respect the ground truth component. We also compute the L2 distance of m(z, ) using Eq. (15). Again both metrics are computed after accounting for permutation and affine transform equivalence (Appendix F.1). Results in Table 6 show that the method obtains high F1 scores as well as a low L2 distance between the estimated and ground-truth transition mean functions. We further apply identifiable SDSs to synthetic videos generated by On the Identifiability of Switching Dynamical Systems Reconstruction Ground truth State posterior p(st|z1 : T) Ground truth Figure 5: Reconstruction and segmentation (with ground truth) of a video generated from 2D latent variables sampled from a MSM (frame size 32 32). 2 5 10 50 100 Image Num. dims dist(m , m ) State F1 score Figure 6: Synthetic experiment on SDSs for increasing xt dimension. Table 1: (weak) MCC scores (mean and 95-CI on 5 datasets) of i SDS in comparison to other nonlinear ICA baselines. Model MCC-strong MCC-weak i VAE* 0.642 0.104 0.770 0.059 -SNICA 0.940 0.041 0.979 0.043 i SDS (ours) 0.936 0.041 0.992 0.008 treating the 2D latents as positions of a moving ball, and show a reconstruction with the corresponding segmentation in Figure 5. This high-dimensional setting increases the difficulty of accurate estimation as the reconstruction term of the ELBO out-weights the KL term for learning the latent MSM (Appendix F.3). This results in an increased L2 distance from the ground truth MSM. Still, the estimated model achieves high-fidelity reconstructions (with an averaged pixel MSE of 8.89 10 5), and accurate structure estimation as indicated by the F1 score. In Table 1 we compare our identifiable SDS (i SDS) with i VAE (Khemakhem et al., 2020), and -SNICA (Hälvä et al., 2021) in the nonlinear ICA domain. For i VAE, which requires auxiliary observations, we use the ground-truth discrete states to create a strong non-temporal baseline (i VAE*). Using 50-dimensional test observations generated from the 2D latent MSMs, we compute the mean coefficient correlation (MCC-strong) as described in Khemakhem et al. (2020). Since our results achieve identifiability up to affine transformations, we also report the MCC scores after affine alignment of the latent variables (MCC-weak), as used in Kivva et al. (2022). Our approach achieves high MCC scores in both strong and weak settings. For -SNICA, we observe successful disentanglement despite the linear prior dynamics differing from the groundtruth MSM structure based on cosine networks. This is due to the meanfield variational inference design from Johnson et al. (2016) and overparametrised linear SDSs, which allow flexible parametrisations but lose transition identifiability. 6.2. Regime-Dependent Causal Discovery We explore regime-dependent causal discovery using climate data from Saggioro et al. (2020). The data consists (a) linear effects 0.24 (b) non-linear effects Figure 7: Regime-dependent graphs generated assuming (a) linear and (b) non-linear effects. Green and blue lines indicate effects in summer and winter months respectively. on monthly observations of El Niño Southern Oscillation (ENSO) and All India Rainfall (AIR) from 1871 to 2016. We follow Saggioro et al. (2020) and train identifiable MSMs with linear and non-linear (softplus networks) transitions. Figure 7 shows the regime-dependent graphs extracted from both models, where the edge weights denote the absolute value of the corresponding entries in the estimated transition function s Jacobian (we keep edges with weights 0.05). The MSMs capture regimes based on seasonality, as one component is assigned to summer months (May - September), and the other is assigned to winter months (October - April). In the linear case, the MSM discovers an effect from ENSO to AIR which occurs only during Summer. This suggests that ENSO has a direct effect on AIR during summer, but not in winter, which is consistent with Saggioro et al. (2020) and Webster & Palmer (1997). For the non-linear case (Figure 7b), additional links are discovered which are yet to be supported by evidence in climate science. But as the flexibility of non-linear MSMs allows capturing correlations as causal effects in disguise, these additional links may imply the presence of confounders influencing both variables a likely result for cases with few observations. 6.3. Segmentation of Dancing Patterns We consider salsa dancing sequences from CMU mocap data and Hip-Hop videos from AIST Dance DB (Tsuchida et al., 2019) to demonstrate our models ability in segmenting high-dimensional time-series. See Appendix for details on the data (F.6), training (F.3), and additional results (F.7). Key-point data We present a sample using key-point data in Figure 9 with both the identifiable MSM (i MSM; softplus networks) and KVAE (Fraccaro et al., 2017). The i MSM On the Identifiability of Switching Dynamical Systems (a) Synthetic salsa videos (pixel MSE 2.26 10 4 per frame). (b) AIST Hip-Hop videos (pixel MSE 7.85 10 3 per frame). Figure 8: Reconstructions and segmentations of (a) salsa dancing and (b) AIST Hip-Hop videos, where colour boxes indicate different components. The MSE results are computed on 560 and 100 test videos for salsa and Hip-Hop, respectively. i MSM (ours) Forwardbackward Turning around Double spin Standing in front Figure 9: Posterior probability of a salsa dancing sequence of i MSM and KVAE (Fraccaro et al., 2017) along with several patterns distinguished in the example. assigns different patterns to different states, e.g., the major pattern (forward-backward movements) is assigned to state 1. The i MSM also identifies this pattern at the end, which KVAE fails to recognise. Additionally, KVAE assigns a turning pattern into component 2, while i MSM treats turning as in state 1 patterns, but then jumps to component 2 consistently after observing it. The i MSM also classifies other dancing patterns into state 0. For limitations, the soft-switching mechanism restricts KVAE from confident component assignments. The i MSM s first-order Markov transitions hinders learning e.g., acceleration that would provide richer features for better segmentation. Dancing videos We show reconstructions and segmentation results in Figure 8 to demonstrate identifiable SDS s applicability to videos. Here, we generate salsa videos by rendering 3D meshes from key-points using the renderer from Mahmood et al. (2019) (Figure 8a). Again in this experiment, one component is more prominent and used for the majority of the forward and backward patterns; and the other components are used to model spinning and other dancing patterns. Meanwhile segmentation results on AIST videos (Figure 8b) show more diverse patterns, illustrating a correlation between the discovered regimes and the moving position of the dancer and background information. The pixel MSE results also support identifiable SDSs as flexible sequential LVMs for video modelling. 7. Conclusion We present identifiability analysis for Markov Switching Models and Switching Dynamical Systems. Key to our innovation is the establishment of MSM identifiability using Gaussian transitions with analytic functions, independently of the discrete state prior. We further extend the results to develop identifiable SDSs fully parameterised by neural networks. We verify our theory with synthetic experiments, and apply our models to regime-dependent causal discovery and high-dimensional time-series segmentation. While our work focuses on identifiability, accurate estimation is also key to the success of causal discovery. Current estimation methods require intensive hyper-parameter tuning and are prone to state collapse in high-dimensional settings. Also variational learning for deep LVMs has no consistency guarantees unless assuming universal approximations of the q posterior (Gong et al., 2023), which disagrees with the popular use of Gaussian encoders. Future work should design efficient estimation methods, and extend the identifiability and estimation innovations to higher-order MSMs/SDSs. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgments Yixin Wang was supported in part by the Office of Naval Research under grant number N00014-23-1-2590 and the National Science Foundation under Grant No. 2231174 and No. 2310831. On the Identifiability of Switching Dynamical Systems Allman, E. S., Matias, C., and Rhodes, J. A. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37(6A):3099 3132, 2009. An, Y., Hu, Y., Hopkins, J., and Shum, M. Identifiability and inference of hidden markov models. Technical report, Technical report, 2013. Ansari, A. F., Benidis, K., Kurle, R., Turkmen, A. C., Soh, H., Smola, A. J., Wang, B., and Januschowski, T. Deep explicit duration switching models for time series. Advances in Neural Information Processing Systems, 34: 29949 29961, 2021. Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational video prediction. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=rk49Mg-CW. Bengio, Y. and Frasconi, P. An input output hmm architecture. Advances in neural information processing systems, 7, 1994. Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738. Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103 111, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-4012. URL https: //aclanthology.org/W14-4012. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. Advances in neural information processing systems, 28, 2015. Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1 22, 1977. Dong, Z., Seybold, B., Murphy, K., and Bui, H. Collapsed amortized variational inference for switching nonlinear dynamical systems. In International Conference on Machine Learning, pp. 2638 2647. PMLR, 2020. Ephraim, Y. and Roberts, W. J. Revisiting autoregressive hidden markov modeling of speech signals. IEEE Signal processing letters, 12(2):166 169, 2005. Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O. A disentangled recognition and nonlinear dynamics model for unsupervised learning. Advances in neural information processing systems, 30, 2017. Frühwirth-Schnatter, S. and Frèuhwirth-Schnatter, S. Finite mixture and Markov switching models, volume 425. Springer, 2006. Gassiat, É., Cleynen, A., and Robin, S. Inference in finite state space non parametric hidden markov models and applications. Statistics and Computing, 26(1):61 71, 2016. Glymour, C., Zhang, K., and Spirtes, P. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019. Gong, W., Jennings, J., Zhang, C., and Pawlowski, N. Rhino: Deep causal temporal relationship learning with historydependent noise. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=i_1rbq8y FWC. Gu, A., Goel, K., and Ré, C. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022. Hälvä, H. and Hyvarinen, A. Hidden markov nonlinear ica: Unsupervised learning from nonstationary time series. In Conference on Uncertainty in Artificial Intelligence, pp. 939 948. PMLR, 2020. Hälvä, H., Le Corff, S., Lehéricy, L., So, J., Zhu, Y., Gassiat, E., and Hyvarinen, A. Disentangling identifiable features from noisy data with structured nonlinear ica. Advances in Neural Information Processing Systems, 34: 1624 1633, 2021. Hamilton, J. D. A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica: Journal of the econometric society, pp. 357 384, 1989. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. Hyvarinen, A. and Morioka, H. Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp. 460 469. PMLR, 2017. Hyvärinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999. On the Identifiability of Switching Dynamical Systems Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 859 868. PMLR, 2019. Johnson, M. J., Duvenaud, D. K., Wiltschko, A., , Datta, S. R., and Adams, R. P. Composing graphical models with neural networks for structured representations and fast inference. Advances in neural information processing systems, 29, 2016. Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Bengio, Y. and Le Cun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/ abs/1312.6114. Kivva, B., Rajendran, G., Ravikumar, P., and Aragam, B. Identifiability of deep generative models without auxiliary information. Advances in Neural Information Processing Systems, 35:15687 15701, 2022. Klindt, D. A., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. Towards nonlinear disentanglement in natural data with temporal sparse coding. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=Eb IDj Byn YJ8. Kruskal, J. B. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95 138, 1977. Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., Le Priol, R., Lacoste, A., and Lacoste-Julien, S. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica. In Conference on Causal Learning and Reasoning, pp. 428 484. PMLR, 2022. Lachapelle, S., López, P. R., Sharma, Y., Everett, K., Priol, R. L., Lacoste, A., and Lacoste-Julien, S. Nonparametric partial disentanglement via mechanism sparsity: Sparse actions, interventions and sparse temporal dependencies. ar Xiv preprint ar Xiv:2401.04890, 2024. Li, Y. and Mandt, S. Disentangled sequential autoencoder. In International Conference on Machine Learning, 2018. Linderman, S., Johnson, M., Miller, A., Adams, R., Blei, D., and Paninski, L. Bayesian learning and inference in recurrent switching linear dynamical systems. In Artificial Intelligence and Statistics, pp. 914 922. PMLR, 2017. Linderman, S. W., Miller, A. C., Adams, R. P., Blei, D. M., Paninski, L., and Johnson, M. J. Recurrent switching linear dynamical systems. ar Xiv preprint ar Xiv:1610.08466, 2016. Lindgren, G. Markov regime models for mixed distributions and switching regressions. Scandinavian Journal of Statistics, pp. 81 91, 1978. Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, E. BISCUIT: Causal representation learning from binary interactions. In Evans, R. J. and Shpitser, I. (eds.), Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pp. 1263 1273. PMLR, 31 Jul 04 Aug 2023a. URL https://proceedings. mlr.press/v216/lippe23a.html. Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, E. Causal representation learning for instantaneous and temporal effects in interactive systems. In The Eleventh International Conference on Learning Representations, 2023b. URL https: //openreview.net/forum?id=it Z6ggv Mnz S. Maddison, C. J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., and Teh, Y. Filtering variational objectives. Advances in Neural Information Processing Systems, 30, 2017. Mahmood, N., Ghorbani, N., F. Troje, N., Pons-Moll, G., and Black, M. J. Amass: Archive of motion capture as surface shapes. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019. URL https: //amass.is.tue.mpg.de. Mityagin, B. The zero set of a real analytic function. ar Xiv preprint ar Xiv:1512.07276, 2015. Morioka, H., Hälvä, H., and Hyvarinen, A. Independent innovation analysis for nonlinear vector autoregressive process. In International Conference on Artificial Intelligence and Statistics, pp. 1549 1557. PMLR, 2021. Pamfil, R., Sriwattanaworachai, N., Desai, S., Pilgerstorfer, P., Georgatzis, K., Beaumont, P., and Aragam, B. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pp. 1595 1605. PMLR, 2020. On the Identifiability of Switching Dynamical Systems Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. Poritz, A. Linear predictive hidden markov models and the speech signal. In ICASSP 82. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 7, pp. 1291 1294. IEEE, 1982. Saggioro, E., de Wiljes, J., Kretschmer, M., and Runge, J. Reconstructing regime-dependent causal relationships from observational time series. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(11):113115, 2020. Saxena, V., Ba, J., and Hafner, D. Clockwork variational autoencoders. In Neural Information Processing Systems, 2021. Smith, J. T., Warrington, A., and Linderman, S. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=Ai8Hw3AXqks. Tank, A., Covert, I., Foti, N., Shojaie, A., and Fox, E. B. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 1, 2021. doi: 10.1109/TPAMI.2021.3065601. Tsuchida, S., Fukayama, S., Hamasaki, M., and Goto, M. Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing. In Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, Netherlands, November 2019. Webster, P. J. and Palmer, T. N. The past and the future of el niño. Nature, 390(6660):562 564, 1997. Yakowitz, S. J. and Spragins, J. D. On the identifiability of finite mixtures. The Annals of Mathematical Statistics, 39(1):209 214, 1968. Yao, W., Chen, G., and Zhang, K. Temporally disentangled representation learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022a. URL https: //openreview.net/forum?id=Vi-s ZWNA_Ue. Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022b. URL https://openreview.net/ forum?id=RDl LMj LJXdq. Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018. On the Identifiability of Switching Dynamical Systems A. Non-parametric finite mixture models We use the following existing result on identifying finite mixtures (Yakowitz & Spragins, 1968), which introduces the concept of linear independence to the identification of finite mixtures. Specifically, consider a distribution family that contains functions defined on x Rd: FA := {Fa(x)|a A} (22) where Fa(x) is an d-dimensional CDF and A is a measurable index set such that Fa(x) as a function of (x, a) is measurable on Rd A. In this paper, we assume this measure theoretic assumption on A is satisfied. Now consider the following finite mixture distribution family by linearly combining the CDFs in F: HA := {H(x) = i=1 ci Fai(x)|N < + , ai A, ai = aj, i = j, i=1 ci = 1}. (23) Then we specify the definition of identifiable finite mixture family as follows: Definition A.1. The finite mixture family H is said to be identifiable up to permutations, when for any two finite mixtures H1(x) = PM i=1 ci Fai(x) and H2(x) = PM i=1 ˆci Fˆai(x), H1(x) = H2(x) for all x Rd, if and only if M = N and for each 1 i N there is some 1 j M such that ci = ˆcj and Fai(x) = Fˆaj(x) for all x Rd. Then Yakowitz & Spragins (1968) proved the identifiability results for finite mixtures. To see this, we first introduce the concept of linearly independent functions under finite mixtures as follows. Definition A.2. A family of functions F = {fa(x)|a A} is said to contain linearly independent functions under finite mixtures, if for any A0 A such that |A0| < + , the functions in {fa(x)|a A0} are linearly independent. This is a weaker requirement of linear independence on function classes as it allows linear dependency by representing one function as the linear combination of infinitely many other functions. With this relaxed definition of linear independence we state the identifiability result of finite mixture models as follows. Proposition A.3. (Yakowitz & Spragins, 1968) The finite mixture distribution family H is identifiable up to permutations, iff. functions in F are linearly independent under the finite mixture model. B. Proof of theorem 3.2 We follow the strategy described in the main text. B.1. Identifiability via linear independence Proposition A.3 can be directly generalised to CDFs defined on z1:T RT m. Furthermore, if we have a family of PDFs3, e.g. PT A := ΠA ( T t=2PA), with linearly independent components, then their corresponding Tm-dimensional CDFs are also linearly independent (and vice versa). Therefore we have the following result as a direct extension of Proposition A.3. Proposition B.1. Consider the distribution family given by Eq. 7. Then the joint distribution in MT (ΠA, PA) is identifiable up to permutations if and only if functions in PT A are linearly independent under finite mixtures. The above assumption of linear independence under finite mixtures over the joint distribution implies the following identifiability result. Theorem B.2. Assume the functions in PT A := ΠA ( T t=2PA) are linearly independent under finite mixtures, then the distribution family MT (ΠA, PA) is identifiable as defined in Definition 3.1. Proof. From proposition B.1 we see that, PT A being linearly independent implies identifiability up to permutation for MT (ΠA, PA) in the finite mixture sense (Definition A.1). This means for p1(z1:T ) and p2(z1:T ) defined in Definition 3.1, we have K = ˆK and for every 1 i KT , there exists 1 j ˆKT such that ci = ˆcj and t=2 pai t(zt|zt 1) = pˆaj 1(z1) t=2 pˆaj t(zt|zt 1), z1:T RT m. 3In this case we assume that the probability measures are dominated by the Lebesgue measure on RT m and the CDFs are differentiable. On the Identifiability of Switching Dynamical Systems This also indicates that pai t(zt|zt 1) = pˆaj t(zt|zt 1) for all t 2, zt, zt 1 Rm, which can be proved by noticing that pa(zt|zt 1) are conditional PDFs. To see this, notice that as the joint distributions on z1:T are equal, then the marginal distributions on z1:T 1 are also equal: t=2 pai t(zt|zt 1) = pˆaj 1(z1) t=2 pˆaj t(zt|zt 1), z1:T 1 R(T 1)m, which immediately implies pai T (z T |z T 1) = pˆaj T (z T |z T 1), z T 1, z T Rm. Similar logic applies to the other time indices t 1, which also implies pai 1(z1) = pˆaj 1(z1) for all z1 Rm. Lastly, if there exists t1 = t2 such that ai t1 = ai t2 but ˆaj t1 = ˆaj t2, then the proved fact that, for any α, β Rm, pˆaj t1(zt1 = β|zt1 1 = α) = pai t1(zt1 = β|zt1 1 = α) = pai t2(zt2 = β|zt2 1 = α) = pˆaj t2(zt2 = β|zt2 1 = α), implies linear dependence of PA, which contradicts to the assumption that PT A are linearly independent under finite mixtures. We show the contradiction by assuming the case where Pt 1 A is linearly independent for some t > 1, and then we consider the linear independence on Pt A. We should have i,j γijpai 1:t 1(z1:t 1)paj t(zt|zt 1) = 0, z1:t R(t 1)m Rm, with γij = 0, i, j. We can swap the summations to observe that from linear dependence of PA, we have γij = 0, for some i and j such that P j γijpaj t(zt|zt 1) = 0. j γijpaj t(zt|zt 1) pai 1:t 1(z1:t 1) = 0, z1:t R(t 1)m Rm, which satisfies the equation with γij = 0 for some i and j and thus contradicts with the linear independence of Pt A. B.2. Linear independence for T=2 Following the strategy as described in the main text, the next step requires us to start from linear independence results for T = 2, and then extend to T > 2. We therefore prove the following linear independence result. Lemma B.3. Consider two families UI := {ui(y, x)|i I} and VJ := {vj(z, y)|j J} with x X, y Rdy and z Rdz. We further assume the following assumptions: (b1) Positive function values: ui(y, x) > 0 for all i I, (y, x) Rdy X. Similar positive function values assumption applies to VJ: vj(z, y) > 0 for all j J, (z, y) Rdz Rdy. (b2) Unique indexing: for UI, i = i I x, y s.t. ui(x, y) = ui (x, y). Similar unique indexing assumption applies to VJ; (b3) Linear independence under finite mixtures on specific non-zero measure subsets for UI: for any non-zero measure subset Y Rdy, UI contains linearly independent functions under finite mixtures on (y, x) Y X. (b4) Linear independence under finite mixtures on specific non-zero measure subsets for VJ: there exists a non-zero measure subset Y Rdy, such that for any non-zero measure subsets Y Y and Z Rdz, VJ contains linearly independent functions under finite mixtures on (z, y) Z Y ; On the Identifiability of Switching Dynamical Systems (b5) Linear dependence under finite mixtures for subsets of functions in VJ implies repeating functions: for any β Rdy, any non-zero measure subset Z Rdz and any subset J0 J such that |J0| < + , {vj(z, y = β)|j J0} contains linearly dependent functions on z Z only if j = j J0 such that vj(z, β) = vj (z, β) for all z Rdz. (b6) Continuity for VJ: for any j J, vj(z, y) is continuous in y Rdy. Then for any non-zero measure subset Z Rdz, UI VJ := {vj(z, y)ui(y, x)|i I, j J} contains linear indepedent functions under finite mixtures defined on (x, y, z) X Rdy Z. Proof. Assume this sufficiency statement is false, then there exist a non-zero measure subset Z Rdz, S0 I J with |S0| < + and a set of non-zero values {γij R|(i, j) S0}, such that X (i,j) S0 γijvj(z, y)ui(y, x) = 0, (x, y, z) X Rdy Z. (24) Note that the choices of S0 and γij are independent of any x, y, z values, but might be dependent on Z. By assumptions (b1), the index set S0 contains at least 2 different indices (i, j) and (i , j ). In particular, S0 contains at least 2 different indices (i, j) and (i , j ) with j = j , otherwise we can extract the common term vj(z, y) out: (i,j) S0 γijvj(z, y)ui(y, x) = vj(z, y) i:(i,j) S0 γijui(y, x) = 0, (x, y, z) X Rdy Z, and as there exist at least 2 different indices (i , j) and (i, j) in S0, we have at least one i = i, and the above equation contradicts to assumptions (b1) - (b3). Now define J0 = {j A| (i, j) S0} the set of all possible j indices that appear in S0, and from |S0| < + we have |J0| < + as well. We rewrite the linear combination equation (Eq. (24)) for any β Rdy as i:(i,j) S0 γijui(y = β, x) vj(z, y = β) = 0, (x, z) X Z. (25) From assumption (b3) we know that the set Y0 := {β Rdy| P i:(i,j) S0 γijui(y = β, x) = 0, x X} can only have zero measure in Rdy. Write Y Rdy the non-zero measure subset defined by assumption (b4), we have Y1 := Y\Y0 Y also has non-zero measure and satisfies assumption (b4). Combined with assumption (b1), we have for each β Y1, there exists x X such that P i:(i,j) S0 γijui(y = β, x) = 0 for at least two j indices in J0. This means for each β Y1, {vj(z, y = β)|j J0} contains linearly dependent functions on z Z. Now under assumption (b5), we can split the index set J0 into subsets indexed by k K(β) as follows, such that within each index subset Jk(β) the functions with the corresponding indices are equal: J0 = k K(β)Jk(β), Jk(β) Jk (β) = , k = k K(β), j = j Jk(β) vj(z, y = β) = vj (z, y = β), z Z. (26) Then we can rewrite Eq. (25) for any β Y1 as i:(i,j) S0 γijui(y = β, x)vj(z, y = β) = 0, (x, z) X Z. (27) Recall from Eq. (26) that vj(z, y = β) and vj (z, y = β) are the same functions on z Z iff. j = j are in the same index set Jk(β). This means if Eq. (24) holds, then for any β Y1, under assumptions (b1) and (b5), X i:(i,j) S0 γijui(y = β, x) = 0, x Rd, k K(β). (28) Define C(β) = mink |Jk(β)| the minimum cardinality count for j indices in the Jk(β) subsets. Choose β arg minβ Y1 C(β): On the Identifiability of Switching Dynamical Systems 1. We have C(β ) < |J0| and |K(β )| 2. Otherwise for all j = j J0 we have vj(z, y = β) = vj (z, y = β) for all z Z and β Y1, so that they are linearly dependent on (z, y) Z Y1, a contradiction to assumption (b4) by setting Y = Y1. 2. Now assume |J1(β )| = C(β ) w.l.o.g.. From assumption (b5), we know that for any j J1(β ) and j J0\J1(β ), vj(z, y = β) = vj (z, y = β) only on zero measure subset of Z at most. Then as |J0| < + and Z Rdz has non-zero measure, there exist z0 Z and δ > 0 such that |vj(z = z0, y = β ) vj (z = z0, y = β )| δ, j J1(β ), j J0\J1(β ). Under assumption (b6), there exists ϵ(j) > 0 such that we can construct an ϵ-ball Bϵ(j)(β ) using ℓ2-norm, such that |vj(z = z0, y = β ) vj(z = z0, y = β)| δ/3, β Bϵ(j)(β ). Choosing a suitable 0 < ϵ minj J0 ϵ(j) (note that minj J0 ϵ(j) > 0 as |J0| < + ) and constructing an ℓ2norm-based ϵ-ball Bϵ(β ) Y1, we have for all j J1(β ), j J0\J1(β ), j / J1(β) for all β Bϵ(β ) due to |vj(z = z0, y = β) vj (z = z0, y = β)| δ/3, β Bϵ(β ). So this means for the split {Jk(β)} of any β Bϵ(β ), we have J1(β) J1(β ) and therefore |J1(β)| |J1(β )|. Now by definition of β arg minβ Y C(β) and |J1(β )| = C(β ), we have J1(β) = J1(β ) for all β Bϵ(β ). 3. One can show that |J1(β )| = 1, otherwise by definition of the split (Eq. (26)) and the above point, there exists j = j J1(β ) such that vj(z, y = β) = vj (z, y = β) for all z Z and β Bϵ(β ), a contradiction to assumption (b4) by setting Y = Bϵ(β ). Now assume that j J1(β ) is the only index in the subset, then the fact proved in the above point that J1(β) = J1(β ) for all β Bϵ(β ) means i:(i,j) S0 γijui(y = β, x) = 0, x X, β Bϵ(β ), again a contradiction to assumption (b3) by setting Y = Bϵ(β ). The above 3 points indicate that Eq. (28) cannot hold for all β Y1 (and therefore for all β Y) under assumptions (b3) - (b6), therefore a contradiction is reached. B.3. Linear independence in the non-parametric case The previous result can be used to show conditions for the linear independence of the joint distribution family PT A in the non-parametric case. Theorem B.4. Define the following joint distribution family pa1,a2:T (z1:T ) = pa1(z1) t=2 pat(zt|zt 1), pa1 ΠA, pat PA, t = 2, ..., T and assume ΠA and PA satisfy assumptions (b1)-(b6) as follows, (c1) ΠA and PA satisfy (b1) and (b2): positive function values and unique indexing, (c2) ΠA satisfies (b3), and (c3) PA satisfies (b4)-(b6). Then the following statement holds: For any T 2 and any subset Z Rm The joint distribution family contains linearly independent distributions for (z1:T 1, z T ) R(T 1)m Z. On the Identifiability of Switching Dynamical Systems Proof. We proceed to prove the statement by induction as follows. Here we set I = J = A. (1) T = 2: The result can be proved using Lemma B.3 by setting in the proof, ui(y = z1, x = z0) = πi(z1), i A and vj(z = z2, y = z1) = pj(z2|z1), j A. (2) T > 2: Assume the statement holds for the joint distribution family when T = τ 1. Note that we can write pa1:τ (z1:τ) as pa1:τ (z1:τ) = pa1,a2:τ 1(x1:τ 1)paτ (zτ|zτ 1). Then the statement for T = τ can be proved using Lemma B.3 by setting ui(y = zτ 1, x = z1:τ 2) = pa1:τ 1(z1:τ 1), i = a1:τ 1, and vj(z = zτ, y = zτ 1) = paτ (zτ|zτ 1), j = aτ. Note that the family spanned with pa1:τ 1(z1:τ 1), i = a1:τ 1 satisfies (b1) and (b2) from ΠA and PA directly, and (b3) from the induction hypothesis. With the result above, one can construct identifiable Markov Switching Models as long as the initial and transition distributions are consistent with assumptions (c1)-(c3). B.4. Linear independence in the non-linear Gaussian case As described, in the final step of the proof we explore properties of the Gaussian transition and initial distribution families (Eqs. (8) and (10) respectively). The unique indexing assumption of the Gaussian transition family (Eq. (9)) implies linear independence as shown below. Proposition B.5. Functions in GA are linearly independent on variables (zt, zt 1) if the unique indexing assumption (Eq. (9)) holds. Proof. Assume the statement is false, then there exists A0 A and a set of non-zero values {γa|a A0}, such that X a A0 γa N(zt; m(zt 1, a), Σ(zt 1, a)) = 0, zt, zt 1 Rm. In particular, this equality holds for any zt 1 Rm, meaning that a weighted sum of Gaussian distributions (defined on zt) equals to zero. Note that Yakowitz & Spragins (1968) proved that multivariate Gaussian distributions with different means and/or covariances are linearly independent. Therefore the equality above implies for any zt 1 m(zt 1, a) = m(zt 1, a ) and Σ(zt 1, a) = Σ(zt 1, a ) a, a A0, a = a , a contradiction to the unique indexing assumption. We now draw some connections from the previous Gaussian families to assumptions (b1-b6) in Lemma B.3. Proposition B.6. The conditional Gaussian distribution family GA (Eq. (8), under the unique indexing assumption (Eq. (9), satisfies assumptions (b1), (b2) and (b5) in Lemma B.3, if we define VJ := GA, z := zt and y := zt 1. Proposition B.7. The initial Gaussian distribution family IA (Eq. (10), under the unique indexing assumption (Eq. (11), satisfies assumptions (b1), (b2) and (b3) in Lemma B.3, if we define UI := IA, y := z1 and x = X = . To see why GA satisfies (b5), notice Gaussian densities are analytic in zt. Similar ideas apply to show that IA satisfies (b3). With the previous results, we can rewrite the previous result for the non-linear Gaussian case. Theorem B.8. Define the following joint distribution family under the non-linear Gaussian model pa1,a2:T (z1:T ) = pa1(z1) t=2 pat(zt|zt 1), at A, pa1 IA, pat GA, t = 2, ..., T with GA, IA defined by Eqs. (8), (10) respectively. Assume: (d1) Unique indexing for GA and IA: Eqs. (9), (11) hold; (d2) Continuity for the conditioning input: distributions in GA are continuous w.r.t. zt 1 Rm; On the Identifiability of Switching Dynamical Systems (d3) Zero-measure intersection in certain region: there exists a non-zero measure set X0 Rm s.t. {zt 1 X0|m(zt 1, a) = m(zt 1, a ), Σ(zt 1, a) = Σ(zt 1, a )} has zero measure, for any a = a ; Then, the joint distribution family contains linearly independent distributions for (z1:T 1, z T ) R(T 1)m Rm. Proof. Note that assumptions (b1) - (b3) and (b5) are satisfied due to Propositions B.6 and B.7, and assumptions (b6) and (d2) are equivalent, and assumption (b4) holds due to assumption (d3). To show (d3) = (b4), We first define VJ := GA, z := zt, and y := zt 1 from Prop. B.6. From (d3), Y := X0 and note that VJ contains linear independent functions on (z, y) M Z Y if M = Z D, where D denotes the set where intersection of moments happen within Y. Also by (d3), D has measure zero and thus, (b4) holds since Y is a non-zero measure set. Then, the statement holds by Theorem B.4. B.5. Concluding the proof Below we formally state the proof for Theorem 3.2 by further assuming parametrisations of the Gaussian moments via analytic functions, i.e. assumption (a2). Proof. (a1) and (d1) are equivalent. Following (a2), let m( , a) : Rm Rm be a multivariate analytic function, which allows a multivariate Taylor expansion. The corresponding Taylor expansion of m( , a) implies (d2). Similar logic applies to Σ( , a). To show (d3), we note for any a = a the set of intersection of moments, i.e. {z Rm|m(z, a) = m(z, a ), Σ(zt 1, a) = Σ(zt 1, a )} can be separated as the intersection of the sets {z Rm|m(x, a) = m(z, a )} and {z Rm|Σ(zt 1, a) = Σ(zt 1, a )}. Wlog, the set {z Rm|m(z, a) = m(z, a )} is the zero set of an analytic function f := m( , a) m( , a ). Proposition 0 in Mityagin (2015) shows that the zero set of a real analytic function on Rm has zero measure unless f is identically zero. Hence, the intersection of moments has zero measure from our premise of unique indexing. Since (d1-d3) are satisfied, by Theorem B.8 we have linear independence of the joint distribution family, which by Theorem B.2 implies identifiability of the MSM in the sense of Def. 3.1. C. Properties of the MSM In this section, we present some results involving MSMs for convenience. First, we start with the result on first-order stationary Markov chains presented in section 3.1. Corollary C.1. Consider an identifiable MSM from Def. 3.1, where the prior distribution of the states p(s1:T ) follows a first-order stationary Markov chain, i.e p(s1:T ) = πs1Qs1,s2 . . . Qs T 1,s T , where π denotes the initial distribution: p(s1 = k) = πk, and Q denotes the transition matrix: p(st = k|st 1 = l) = Ql,k. Then, π and Q are identifiable up to permutations. Proof. From Def. 3.1, we have K = ˆK and for every 1 i KT there is some 1 j ˆKT such that ci = ˆcj. Now writing s1:T = (si 1, ..., si T ) = φ(i) and ˆs1:T = (ˆsj 1, ..., ˆsj T ) = φ(j), we have ci = πsi 1Qsi 1,si 2 . . . Qsi T 1,si T = ˆπˆsj 1 ˆQˆsj 1,ˆsj 2 . . . ˆQˆsj T 1,ˆsj T = ˆcj. Since the joint distributions are equal on s1:T , they must also be equal on s1:T 1. Therefore, we have Qsi T 1,si T = ˆQˆsj T 1,ˆsj T . Similar logic applies to t 1, which also implies πsi 1 = ˆπˆsj 1. For t = 1, the above implies that for each i {1, ..., K}, there exists some j {1, ..., K}, such that πi = ˆπj. This indicates permutation equivalence. We denote σ( ) as such permutation function, so that for all i {1, ..., K} and the corresponding j, πi = πj = πσ(j). For t > 1, the previous implication gives us that for i, j {1, . . . , K}, k, l {1, . . . , K} such that Qi,j = ˆQk,l. Following the previous logic, we can define permutations that match Q and ˆQ: i = σ1(k), j = σ2(l). We observe from the second requirement in Def. 3.1 that if i = j, then k = l and since σ1(k) = σ2(l), we have that the permutations σ1( ) and σ2( ) must be equal. Therefore, we have Qi,j = ˆQk,l = Qσ1(k),σ1(l). On the Identifiability of Switching Dynamical Systems Finally, we can use the second requirement in Def. 3.1 to see that σ( ) and σ1( ) must be equal. Proposition C.2. The joint distribution of the Markov Switching Model with Gaussian analytic transitions and Gaussian initial distributions is closed under factored invertible affine transformations, z 1:T = H(z1:T ): z t = Azt + b, 1 t T. Proof. Consider the following affine transformation z t = Azt + b for 1 t T, and the joint distribution of a Markov Switching Model with T timesteps p(z1:T ) = X s1:T p(s1:T )p(z1|s1) t=2 p(zt|zt 1, st), where we denote the initial distribution as p(z1|s1 = i) = N(z1; µ(i), Σ1(i)) and the transition distribution as p(zt|zt 1, st = i) = N(zt; m(zt 1, i), Σ(zt 1, i)). We need to show that the distribution still consists of Gaussian initial distributions and Gaussian analytic transitions. Let us consider the change of variables rule, which we apply to p(z1:T ) pz 1:T (z 1:T ) = pz1:T A 1 1:T (z 1:T b1:T ) det(A1:T ) , where we use the subscript z 1:T to indicate the probability distribution in terms of z 1:T , but we drop it for simplicity. Note that the inverse of a block diagonal matrix can be computed as the inverse of each block, and we use similar properties for the determinant, i.e. det(A1:T ) = det(A) det(A). The distribution in terms of the transformed variable is expressed as follows: p(z 1:T ) = X s1:T p(s1:T )p A 1 (z 1 b) |s1 p A 1 (z t b) | A 1 z t 1 b , st i1,...,i T p (s1:T = {i1, . . . , i T }) N z 1; Aµ(i1) + b, AΣ1(i1)AT (2π)m det AΣ A 1 z t 1 b , it AT A 1 (z t b) m A 1 z t 1 b , it T Σ A 1 z t 1 b , it 1 A 1 (z t b) m A 1 z t 1 b , it ! i1,...,i T p (s1:T = {i1, . . . , i T }) N z 1; Aµ(i1) + b, AΣ1(i1)AT (2π)m det AΣ A 1 z t 1 b , it AT z t Am A 1 z t 1 b , it b T A 1Σ A 1 z t 1 b , it 1 A T z t Am A 1 z t 1 b , it b ! i1,...,i T p (s1:T = {i1, . . . , i T }) N z 1; Aµ(i1) + b, AΣ1(i1)AT t=2 N z t; Am A 1 z t 1 b , it + b, AΣ A 1 z t 1 b , it AT On the Identifiability of Switching Dynamical Systems We observe that the resulting distribution is a Markov Switching Model with changes in the Gaussian initial and transition distributions, where the analytic transitions are transformed as follows: bmm (z t 1, it) = Am A 1 z t 1 b , it + b, and Σ (z t 1, it) = AΣ A 1 z t 1 b , it AT for any it {1, . . . , K}. D. Proof of SDS identifiability D.1. Preliminaries We need to introduce some definitions and results that will be used in the proof. These have been previously defined in Kivva et al. (2022). Definition D.1. Let D0 D Rn be open sets. Let f0 : D0 R. We say that an analytic function f : D R is an analytic continuation of f0 onto D if f(x) = f0(x) for every x D0. Definition D.2. Let x0 Rm and δ > 0. Let p : B(x0, δ) R. Define Ext(p) : Rm R to be the unique analytic continuation of p on the entire space Rm if such a continuation exists, and to be 0 otherwise. Definition D.3. Let D0 D and p : D R be a function. We define p|D0 : D R to be a restriction of p to D0, namely a function that satisfies p|D0(x) = p(x) for every x D0. Definition D.4. Let f : Rm Rn be a piece-wise affine function. We say that a point x f(Rm) Rn is generic with respect to f if the pre-image f 1({x}) is finite and there exists δ > 0, such that f : B(z, δ) Rn is affine for every z f 1({x}). Lemma D.5. If f : Rm Rn is a piece-wise affine function such that {x Rn : |f 1({x})| > 1} f(Rm) has measure zero with respect to the Lebesgue measure on f(Rm), then dim(f(Rm)) = m and almost every point in f(Rm) is generic with respect to f. D.2. Proof of theorem 3.5.(i) We extend the results from Kivva et al. (2022) to using our MSM family as prior distribution for z1:T . The strategy requires finding some open set where the transformations F and G from two equally distributed SDSs are invertible, and then use analytic function properties to establish the identifiability result. First, we need to show that the points in the pre-image of a piece-wise factored mapping F can be computed using the MSM prior. Lemma D.6. Consider a random variable z1:T which follows a Markov Switching Model distribution. Let us consider f : Rm Rm, a piece-wise affine mapping which generates the random variable x1:T = F(z1:T ) as xt = f(zt), 1 t T. Also, consider x(0) Rm a generic point with respect to f. Then, x(0) 1:T = {x(0), . . . , x(0)} RT m is also a generic point with respect to F and the number of points in the pre-image F 1({x(0) 1:T }) can be computed as F 1 n x(0) 1:T o = lim δ 0 x1:T RT m Ext p|B(x(0) 1:T ,δ) Proof. If x(0) Rm is a generic point with respect to f, x(0) 1:T is also a generic point with respect to F since the preimage is F({x(0) 1:T }) now larger but still finite. In other words, F({x(0) 1:T }) is the Cartesian product Z Z Z, where Z = {z1, z2, . . . , zn} are the points in the pre-image f({x(0)}). Considering this, we have well defined affine mappings Gi1,...,i T : B({zi1, . . . , zi T }, ϵ) Rm, it {1, . . . , n} for 1 t T, such that Gi1,...,i T = F(z1:T ), z1:T B({zi1, . . . , zi T }, ϵ). This affine mapping Gi1,...,i T is factored as follows: git(zt) = f(zt), zt B(zi, ϵ) Gi1,...,i T = Ai1 . . . 0 ... ... ... 0 . . . Ai T bi1 ... bi T On the Identifiability of Switching Dynamical Systems Let δ0 > 0 such that B(x(0) 1:T , δ0) i1,...,i T Gi1,...,i T (B({zi1, . . . , zi T }, ϵ)) we can compute the likelihood for every x1:T B(x(0) 1:T , δ ) with 0 < δ < δ0 using Prop. C.2 where the MSM is closed under factored affine transformations. p|B(x(0) 1:T ,δ) = j1,...,j T p (s1:T = {j1, . . . , j T }) N x1; Ai1µ(j1) + bi1, Ai1Σ1(j1)AT i1 t=2 N xt; Aitm A 1 it 1 xt 1 bit 1 , jt + bit, AitΣ A 1 it xt 1 bit 1 , jt AT it Where the previous density is an analytic function which is defined on an open neighbourhood of x(0) 1:T . Then from the identity theorem of analytic functions the resulting density defines the analytic extension of p|B(x(0) 1:T ,δ) on Rm. Then, we have Z x1:T RT m Ext p|B(x(0) 1:T ,δ) j1,...,j T p (s1:T = {j1, . . . , j T }) N x1; Ai1µ(j1) + bi1, Ai1Σ1(j1)AT i1 xt; Aitm A 1 it 1 xt 1 bit 1 , jt + bit, AitΣ A 1 it 1 xt 1 bit 1 , jt AT it j1,...,j T p (s1:T = {j1, . . . , j T }) N x1; Ai1µ(j1) + bi1, Ai1Σ1(j1)AT i1 xt; Aitm A 1 it 1 xt 1 bit 1 , jt + bit, AitΣ A 1 it 1 xt 1 bit 1 , jt AT it i1,...,i T 1 = n T = F 1({x(0) 1:T }) We can deduce the following corollary as in Kivva et al. (2022). Corollary D.7. Let F, G : RT m RT n be factored piece-wise affine mappings, with xt := f(zt) and x t := g(z t), for 1 t T. Assume f and g are weakly-injective (Def. 2.1). Let z1:T and z 1:T be distributed according to the identifiable MSM family. Assume F(z1:T ) and G(z 1:T ) are equally distributed. Assume that for x0 Rn and δ > 0, f is invertible on B(x0, 2δ) f(Rm). Then, for x(0) 1:T = {x0, . . . , x0} RT n there exists x(1) 1:T B(x(0) 1:T , δ) and δ1 > 0 such that both F and G are invertible on B(x(1) 1:T , δ1) F(RT m). Proof. First, we observe that since F is a factored mapping, if f is invertible on B(x0, 2δ) f(Rm), we can compute the inverse of F on B(x(0) 1:T , 2δ) F(Rm) for x(0) 1:T = {x0, . . . , x0} RT n as F 1(x1:T ) = {f 1(x1), . . . , f 1(x T )} RT m, for xt B(x0, 2δ) f(Rm), 1 t T. Then, F is invertible on B(x(0) 1:T , 2δ) F(Rm). By Lemma D.5, almost every point x B(x(0), δ) f(Rm) is generic with respect to f and g, as both mappings are weakly injective. As discussed previously, if x(0) f(Rm) is a generic point with respect to f, the point x(0) 1:T = {x(0), . . . , x(0)} On the Identifiability of Switching Dynamical Systems F(RT m) is also generic with respect to F, as the finite points in the preimage f 1({x(0)}) extend to finite points in the preimage F 1({x(0) 1:T }). Then, almost every point x1:T B(x(0) 1:T , δ) F(RT m) is generic with respect to F and G. Consider now x(1) 1:T = {x(1), . . . , x(1)} B(x(0) 1:T , δ) such a generic point. From the invertibility of F on B(x(1) 1:T , δ), we have |F 1({x(1) 1:T })| = 1. By Lemma D.6, we have that |G 1({x(1) 1:T })| = 1, as x(1) 1:T is generic with respect to F and G. Then, there exists, δ > δ1 > 0 such that on B(x(1) 1:T , δ1) F(RT m) B(x(0) 1:T , 2δ) F(RT m) the function G is invertible. We need an additional result to prepare the proof for Theorem 3.5.(i). Theorem D.8. Let F, G : RT m RT n be factored piece-wise affine mappings, with xt := f(zt) and x t := g(z t), for 1 t T. Let z1:T and z 1:T be distributed according to the identifiable MSM family. Assume F(z1:T ) and G(z 1:T ) are equally distributed, and that there exists x(0) 1:T RT n and δ > 0 such that F and G are invertible on B(x(0) 1:T , δ) f(RT m). Then there exists an invertible factored affine transformation H such that H(z1:T ) = z 1:T . Proof. From the invertibility of F and G in B(x(0) 1:T , δ) f(RT m) we can find a Tm-dimensional affine subspace B(x(1) 1:T , δ1) L, where δ1 > 0, B(x(1) 1:T , δ1) B(x(0) 1:T , δ), and L RT n such that HF, HG : RT m L are a pair of invertible affine functions where H 1 F and H 1 G coincide with F 1 and G 1 on B(x1, δ1) L respectively. The fact that F and G are factored implies that HF, HG are also factored. To see this, we observe that the inverse of a block diagonal matrix is the inverse of each block, as an example for F, we first have that H 1 F must be forcibly factored since it needs to coincide with F 1. H 1 F (x1:T ) = Af . . . 0 ... ... ... 0 . . . Af f 1(x1) ... f 1(x T ) then we can take the inverse to obtain the factored HF. A 1 f . . . 0 ... ... ... 0 . . . A 1 f A 1 f . . . 0 ... ... ... 0 . . . A 1 f Af . . . 0 ... ... ... 0 . . . Af , where Af = A 1 f , and bf = A 1 f bf Since F(z1:T ) and G(z 1:T ) are equally distributed, we have that HF(z1:T ) and HG(z 1:T ) are equally distributed on B(x(1) 1:T , δ1) L. From Prop C.2, we know that HF(z1:T ) and HG(z 1:T ) are distributed according to the identifiable MSM family, which implies HF(z1:T ) = HG(z 1:T ), and also H 1 G (HF(z1:T )) = z 1:T , where H = H 1 G HF is an affine transformation. From the previous result and Theorem 3.2, there exists a permutation σ( ), such that mfg(z , k) = m (z , σ(k)) for 1 k K. m (z , σ(k)) = mfg(z , k) = A 1 g mf (Agz + bg, k) A 1 g bg = A 1 g Afm A 1 f Agz + A 1 f (bg bf) , k + A 1 g (bf bg) = Am A 1(z b), k + b, where A = A 1 g Af and b = A 1 g (bf bg). Similar implications apply for Σ(z, a), a A. Now we have all the elements to prove Theorem 3.5.(i). Proof. We assume there exists another model that generates the same distribution from Eq.(13), whith a prior p MT NL under assumptions (a1-a2), and non-linear emmision F , composed by f which is weakly injective and piece-wise linear: i.e. (F#p)(x1:T ) = (F #p )(x1:T ). On the Identifiability of Switching Dynamical Systems From weakly-injectiveness, at least for some x0 Rn and δ > 0, f is invertible on B(x0, 2δ) f(Rm). This satisfies the preconditions from Corollary D.7, which implies there exists x(1) 1:T B(x(0) 1:T , δ) and δ1 > 0 such that both F and F are invertible on B(x(1) 1:T , δ1) F(RT m). Thus, by Theorem D.8, there exists an affine transformation H such that H(z1:T ) = z 1:T , which means that p MT NL is identifiable up to affine transformations. D.3. Proof of theorem 3.5.(ii) So far we have proved the identifiability of the transition function on the latent MSM distribution up to affine transformations. By further assuming injectivity of the piece-wise mapping F, we can prove identifiability of F up to affine transformations by re-using results from Kivva et al. (2022). We begin by stating the following known result. Lemma D.9. Let Z PJ j=1 λj N(µj, Σj) where Z is a GMM (in reduced form). Assume that f : Rm Rm is a continuous piecewise affine function such that f(Z) Z. Then f is affine. We can state the identification of F. Theorem D.10. Let F, G : Rm T Rn T be continuous invertible factored piecewise affine functions. Let z1:T , z 1:T be random variables distributed according to MSMs. Suppose that F(z1:T ) and G(z 1:T ) are equally distributed. Then there exists a factored affine transformation H : Rm T Rm T such that H(z1:T ) = z 1:T and G = F H 1. Proof. From Theorem D.8, there exists an invertible affine transformation H1 : Rm T Rm T such that H1(z1:T ) = z 1:T . Then, F(z1:T ) G(H1(z1:T )). From the invertibility of G, we have z1:T (H 1 1 G 1 F)(z1:T ). We note that H1, G, F are factored mappings, and structured as follows H 1 1 G 1 F (z1:T ) = h 1 1 g 1 f (z1) ... h 1 1 g 1 f (z T ) (z1) ... (z T ) where the inverse of H1 is also factored, as observed from previous results. Since the transformation is equal for 1 t T, we can proceed for t = 1 considering z1 is distributed as a GMM (in reduced form), as it corresponds to the initial distribution of the MSM. Then, by Lemma D.9, there exists an affine mapping h2 : Rm Rm such that h 1 1 g 1 f = h2. Then, g h1 h2 ... g h1 h2 where h = h1 h2. Considering the invertibility of G and the fact that F(z1:T ) and G(z 1:T ) are equally distributed, we also have H(z1:T ) = z 1:T . We use the previous result to prove Theorem 3.5.(ii). Proof. We assume there exists another model that generates the same distribution from Eq.(13), whith a prior p MT NL under assumptions (a1-a2), and non-linear emmision F , composed by f which is continuous, injective and piece-wise linear: i.e. (F#p)(x1:T ) = (F #p )(x1:T ). These are the preconditions to satisfy Theorem D.10, which implies there exists an affine transformation H such that H(z1:T ) = z 1:T and F = F H. In other words, the prior p MT NL, and f which composes F are identifiable up to affine transformations. E. Estimation details We provide additional details from the descriptions in the main text. On the Identifiability of Switching Dynamical Systems E.1. Expectation Maximisation on MSMs For convenience, the expressions below are computed from samples {zb 1:T }B b=1 for a batch of size B. Recall we formulate our method in terms of the expectation maximisation (EM) algorithm. Given some arrangement of the parameter values (θ ), the E-step computes the posterior distribution of the latent variables pθ (s1:T |z1:T ). This can then be used to compute the expected log-likelihood of the complete data (latent variables and observations), L(θ, θ ) := 1 b=1 Epθ (sb 1:T |zb 1:T ) log pθ(zb 1:T , sb 1:T ) . (30) Given a first-order stationary Markov chain, we denote the posterior probability pθ(sb t = k|zb 1:T ) as γb t,k, and the joint posterior of two consecutive states pθ(sb t = k, sb t 1 = l|zb 1:T ) as ξb t,k,l. For this case, the result is equivalent to the HMM case and can be found in the literature, e.g. Bishop (2006). We can then compute a more explicit form of Eq. (30), L(θ, θ ) = 1 k=1 γb 1,k log πk + 1 l=1 ξb t,k,l log Qlk+ k=1 γb 1,k log pθ(zb 1|, sb 1 = k) + 1 k=1 γb t,k log pθ(zb t|zb t 1, sb t = k), (31) where π and Q denote the initial and transition distribution of the Markov chain. In the M-step, the previous expression is maximised to calculate the update rules for the parameters, i.e. θnew = arg maxθ L(θ, θ ). The updates for π and Q are also obtained using standard results for HMM inference (again see Bishop (2006)). Assuming Gaussian initial and transition densities, we can also use standard literature results for updating the initial mean and covariance. For the transition densities, we consider a family with fixed covariance matrices, and only the means mθ( , k) are dependent on the previous observation. In this case, the standard results can also be used to update the covariances of the transition distributions. We drop the subscript θ for convenience. The updates for the mean parameters are dependent on the functions we choose. For multivariate polynomials of degree P, we can recover an exact M-step by transforming the mapping into a matrix-vector operation: m(zt 1, k) = c=1 Ak,c ˆzc,t 1, ˆz T t 1 = 1 zt 1,1 . . . zt 1,d z2 t 1,1 zt 1,1zt 1,2 . . . , (32) where ˆzt 1 RC denotes the polynomial features of zt 1 up to degree P. The total number of features is C = P +d d and the exact update for Ak is t=2 γb t,kzb t(ˆzb t 1)T ! B X t=2 γb t,k ˆzb t 1(ˆzb t 1)T ! 1 For exact updates such as the one above, we require B to be sufficiently large to ensure consistent updates during training. In the main text, we already discussed the case where the transition means are parametrised by neural networks. E.2. Variational Inference for SDSs We provide more details on the ELBO objective for SDSs. On the Identifiability of Switching Dynamical Systems log pθ(x1:T ) = log Z X s1:T pθ(x1:T , z1:T , s1:T )dz1:T (34) Eqϕ,θ(z1:T ,s1:T |x1:T ) log pθ(x1:T , z1:T |s1:T )pθ(s1:T ) qϕ(z1:T , s1:T |x1:T ) Eqϕ(z1:T |x1:T ) log pθ(x1:T |z1:T ) qϕ(z1:T |x1:T ) + Epθ(s1:T |z1:T ) log pθ(z1:T |s1:T )pθ(s1:T ) pθ(s1:T |z1:T ) Eqϕ(z1:T |x1:T ) log pθ(x1:T |z1:T ) qϕ(z1:T |x1:T ) + log pθ(z1:T ) (37) log pθ(x1:T |z1:T ) + log pθ(z1:T ) log qϕ(z1:T |x1:T ), z1:T qϕ (38) where as as mentioned, we compute the ELBO objective using Monte Carlo integration with samples z1:T from qϕ, and pθ(z1:T ) is computed using Eq. (20). Alternatively, Dong et al. (2020) proposes computing the gradients of the latent MSM using the following rule. log pθ(x1:T , z1:T ) = Epθ(s1:T |z1:T ) [ log pθ(x1:T , z1:T , s1:T ] (39) where the objective is similar to Eq.(17). Below we reflect on the main aspects of each method. Dong et al. (2020) computes the parameters of the latent MSM using a loss term similar to Eq. (17). Although we need to compute the exact posteriors explicitly, we only take the gradient with respect to log pθ(zt|zt 1, st = k) which is relatively efficient. Unfortunately, the approach is prone to state collapse and additional loss terms with annealing schedules need to be implemented. Ansari et al. (2021) does not require computing exact posteriors as the parameters of the latent MSM are optimized using the forward algorithm. The main disadvantage is that we require back-propagation to flow through the forward computations, which is more inefficient. Despite this, the objective used is less prone to state collapse and optimisation becomes simpler. Although both approaches show good performance empirically, we observed that training becomes a difficult task and requires careful hyper-parameter tuning and multiple initialisations. Note that the methods are not (a priori) theoretically consistent with the previous identifiability results. Since exact inference is not tractable in SDSs, one cannot design a consistent estimator such as MLE. Future developments should focus on combining the presented methods with tighter variational bounds (Maddison et al., 2017) to design consistent estimators for such generative models. F. Experiment details F.1. Metrics Markov Switching Models Consider K components, where as described the evaluation is performed by computing the averaged sum of the distances between the estimated function components. Since we have identifiability of the function forms up to permutations, we need to compute distances with all the permutation configurations to resolve this indeterminacy. Therefore, we can quantify the estimation error as follows err := min k=perm({1,...,K}) 1 K i=1 d(m( , i), ˆm( , ki)), (40) where d( , ) denotes the L2 distance between functions. We compute an approximate L2 distance by evaluating the functions on points sampled from a random region of Rm and averaging the Euclidean distance, more specifically we sample 105 in the [ 1, 1]d interval for each evaluation. d(f, g) := Z q f(x) g(x) 2dx (41) q f x(i) g x(i) 2, x(i) Uniform([ 1, 1]m) (42) On the Identifiability of Switching Dynamical Systems Note that resolving the permutation indeterminacy has a cost of O(K!), which for K > 5 already poses some problems in both monitoring performance during training and testing. To alleviate this computational cost, we take a greedy approach, where for each estimated function component we pair it with the ground truth function with the lowest L2 distance. Note that this can return a suboptimal result when the functions are not estimated accurately, but the computational cost is reduced to O(K2). Switching Dynamical Systems To compute the L2 distance for the transitions means in SDSs, we first need to resolve the linear transformation in Eq.(15). Thus, we compute the following d f, (f h) (43) where f,f compose the groundtruth F and estimated F non-linear emissions respectively, and h denotes the affine transformation. We compute the above L2 norm using 500 generated observations from a held out test dataset. Finally, we compute the error as in Eq. (40), and with the following L2 norm. d m , i , A ˆm A 1(z b), σ(i) + b) (44) which is taken from Eq.(15). Similarly, we compute the norm using samples from the ground truth latent variables, generated from the held out test dataset. To resolve the permutation σ( ), we first compute the F1 score on the segmentation task as indicated, by counting the total true positives, false negatives and false positives. Then, σ( ) is determined from the permutation with highest F1 score. F.2. Averaged Jacobian and causal structure computation Regarding regime-dependent causal discovery, our approach can be considered as a functional causal model-based method (see Glymour et al. (2019) for the complete taxonomy). In such methods, the causal structure is usually estimated by inspecting the parameters that encode the dependencies between data, rather than performing independence tests (Tank et al., 2021). In the linear case, we can threshold the transition matrix to obtain an estimate of the causal structure (Pamfil et al., 2020). The non-linear case is a bit more complex since the transition functions are not separable among variables, and the Jacobian can differ considerably for different input values. With the help of locally connected networks, Zheng et al. (2018) aim to encode the variable dependencies in the first layer, and perform similar thresholding as in the linear case. To encourage that the causal structure is captured in the first layer and prevent it from happening in the next ones, the weights in the first layer are regularised with L1 loss to encourage sparsity, and all the weights in the network are regularised with L2 loss. In our experiments, we observe this approach requires enormous finetuning with the potential to sacrifice the flexibility of the network. Instead, we estimate the causal structure by thresholding the averaged absolute-valued Jacobian with respect to a set of samples. We denote the Jacobian of ˆm(z, k) as J ˆ m( ,k)(z). To ensure that the Jacobian captures the effects of the regime of interest, we use samples from the data set and classify them with the posterior distribution. In other words, we will create K sets of variables, where each set Zk with size NK = |Zk| contains variables that have been selected using the posterior, i.e. z(i) Zk if k = arg max pθ(s(i)|z1:T ), where we use the index i to denote that z(i) is associated with s(i). Then, for a given regime k, the matrix that encodes the causal structure ˆGk is expressed as J ˆ m( ,k) z(i) > τ , z(i) Zk, (45) where 1( ) is an indicator function which equals to 1 if the argument is true and 0 otherwise. We τ = 0.05 in our experiments. Finally, we evaluate the estimated K regime-dependent causal graphs can be evaluated in terms of the average F1-score over components. F.3. Training specifications All the experiments are implemented in Pytorch (Paszke et al., 2019) and carried out on NVIDIA RTX 2080Ti GPUs, except for the experiments with videos (synthetic and salsa), where we used NVIDIA RTX A6000 GPUs. On the Identifiability of Switching Dynamical Systems Markov Switching Models When training polynomials (including the linear case), we use the exact batched M-step updates with batch size 500 and train for a maximum of 100 epochs, and stop when the likelihood plateaus. When considering updates in the form of Eq. (17), e.g. neural networks, we use ADAM optimiser (Kingma & Ba, 2015) with an initial learning rate 7 10 3 and decrease it by a factor of 0.5 on likelihood plateau up to 2 times. We vary the batch size and maximum training time depending on the number of states and dimensions. For instance, for K = 3 and d = 3, we use a batch size of 256 and train for a maximum of 25 epochs. For other configurations, we decrease the batch size and increase the maximum training time to meet GPU memory requirements. Similar to related approaches (Hälvä & Hyvarinen, 2020), we use random restarts to achieve better parameter estimates. Switching Dynamical Systems Since training SDSs requires careful hyperparameter tuning for each setting, we provide details for each setting separately. For the synthetic experiments, we use batch size 100, and we train for 100 epochs. We use ADAM optimiser (Kingma & Ba, 2015) with an initial learning rate 5 10 4, and decrease it by a factor of 0.5 every 30 epochs. To avoid state collapse, we perform an initial warm-up phase for 5 epochs, where we train with fixed discrete state parameters π and Q, which we fix to uniform distributions. We run multiple seeds and select the best model on the ELBO objective. Regarding the network architecture, we estimate the transition means using two-layer networks with cosine activations and 16 hidden dimensions, and the non-linear emission using two-layer networks with Leaky Re LU activations with 64 hidden dimensions. For the inference network, the bi-directional RNN has 2 hidden layers and 64 hidden dimensions, and the forward RNN has an additional 2 layers with 64 hidden dimensions. We use LSTMs for the RNN updates. For the synthetic videos, we vary some of the above configurations. We use batch size 64 and train for 200 epochs with the same optimiser and learning rate, but we now decrease it by a factor of 0.8 every 80 epochs. Instead of running an initial warm-up phase, we devise a three-stage training. First, we pre-train the encoder (emission) and decoder networks, where for 10 epochs the objective ignores the terms from the MSM prior. The second phase is inspired by Ansari et al. (2021), where we use softmax with temperature on the logits of π and Q. To illustrate, we use softmax with temperature as follows Qk,: = p(st|st 1 = k) = Softmax(ok/τ), t {2, . . . , T} (46) where ok are the logits of p(st|st 1 = k) and τ is the temperature. We start with τ = 10, and decay it exponentially every 50 iterations by a factor of 0.99 after the pre-training stage. The third stage begins when τ = 1, where we train the model as usual. Again, we run multiple seeds and select the best model on the ELBO. The network architecture is similar, except that we use additional CNNs for inference and generation. For inference, we use a 5-layer CNN with 64 channels, kernel size 3, and padding 1, Leaky Re LU activations, and we alternate between using stride 2 and 1. We then run a 2-layer MLP with Leaky Re LU activations and 64 hidden dimensions and forward the embedding to the same RNN inference network we described before. For generation, we use a similar network, starting with a 2-layer MLP with Leaky Re LU activations and 64 hidden dimensions, and use transposed convolutions instead of convolutions (with the same configuration as before). For the salsa dancing videos, we use batch size 8 and train for a maximum of 400 epochs. We use the same optimiser and an initial learning rate of 10 4 and stop on ELBO plateau. We use a similar three-stage training as before, where we pre-train the encoder-decoder networks for 10 epochs. For the second stage, we start with τ = 100 and decay it by a factor of 0.975 every 50 iterations after the pre-training phase. As always, we run multiple seeds and select the best model on ELBO. For the network architectures, we use the same as in the previous synthetic video experiment, but we use 7-layer CNNs, we increase all the network sizes to 128, and we use a latent MSM of 256 dimensions with K = 3 components. The transitions of the continous latent variables are parametrised with 2-layer MLPs with Sof Plus activations and 256 hidden dimensions. F.4. Synthetic experiments For data generation, we sample N = 10000 sequences of length T = 200 in terms of a stationary first-order Markov chain with K states. The transition matrix Q is set to maintain the same state with probability 90% and switch to the next state with probability 10%, and the initial distribution π is the stationary distribution of Q. The initial distributions are Gaussian components with means sampled from N(0, 0.72I) and the covariance matrix is 0.12I. The covariance matrices On the Identifiability of Switching Dynamical Systems Table 2: Mean and 95-CI of the L2 distances reported in Figure 4a (computed across 5 datasets), where we experiment with increasing sequence lengths and different network types. Network type Sequence length (T) 10 100 1000 Cubic 4.466 0.814 0.795 0.448 0.077 0.115 Cosine 0.276 0.238 0.041 0.032 0.037 0.020 Softplus 2.249 1.045 0.078 0.137 0.005 0.002 Table 3: Mean and 95-CI of the L2 distances reported in Figure 4b (computed across 5 datasets), where we experiment with increasing dimensions and states. States (K) Observation dimensions (m) 3 5 10 20 30 50 3 0.003 0.005 0.004 0.004 0.004 0.002 0.015 0.005 0.012 0.006 0.064 0.010 5 0.004 0.002 0.008 0.003 0.009 0.005 0.023 0.007 0.032 0.005 0.052 0.029 10 0.011 0.008 0.014 0.007 0.013 0.003 0.031 0.017 0.042 0.005 0.054 0.023 of the transition distributions are fixed to 0.052I, and the mean transitions m(z, k), k = 1, . . . , K are parametrised using polynomials of degree 3 with random weights, random networks with cosine activations, or random networks with softplus activations. For the locally connected networks (Zheng et al., 2018), we use cosine activation networks, and the sparsity is set to allow 3 interactions per element on average. All neural networks consist of two-layer MLPs with 16 hidden units. When experimenting with SDSs, we use K = 3, T = 200, and N = 5000 of a 2D latent MSM with cosine network transitions and fixed covariances . We then further generate observations using two-layer Leaky Re LU networs with 8 hidden units. For synthetic videos, we render a ball on 32 32 coloured images from the 2D coordinates of the latent variables. When rendering images, the MSM trajectories are scaled to ensure the ball is always contained in the image canvas. In Table 2 and Table 3 we show the mean and 95-CI from the experiments reported in Figure 4a and Figure 4b respectively; where we use datasets generated using 5 different seeds. In Figure 10, we show visualisations of some function responses for the experiment considering increasing variables and states (figs. 4b and 4c). For K = 3 states and m = 3 dimensions, we achieve 3 10 3 L2 distance error on average and the responses in Figure 10a show low discrepancies with respect to the ground truth. Similar observations can be made with K = 10 states in Figure 10b, where the L2 distance error is 11 2 on average. Furthermore, Table 4 shows details of the results reported in Figure 6. F.5. ENSO-AIR experiment The data consists of monthly observations of El Niño Southern Oscillation (ENSO) and All India Rainfall (AIR), starting from 1871 to 2016. Following the setting in Saggioro et al. (2020), we more specifically use the indicators Niño 3.4 SST Index4 and All-India Rainfall precipitation5 respectively. In total, we have N = 1 samples with T = 1752 time steps and 4Extracted from https://psl.noaa.gov/gcos_wgsp/Timeseries/Nino34/. 5Extracted from https://climexp.knmi.nl/getindices.cgi?STATION=All-India_Rainfall&TYPE=p&WMO= IITMData/ALLIN. Table 4: Mean and 95-CI of the L2 distance and state F1 scores reported in Figure 6 (computed across 5 datasets), where we experiment with increasing observed dimensions. Metric Obs. dimensions 2 5 10 50 100 Image L2 dist. 0.083 0.083 0.036 0.043 0.033 0.063 0.082 0.136 0.054 0.083 0.384 0.257 State F1 0.991 0.008 0.999 0.001 0.998 0.003 0.994 0.015 0.999 0.001 0.978 0.019 On the Identifiability of Switching Dynamical Systems Figure 10: Function responses using 3 dimensions where we vary x2 and x3. Column i shows the response with respect to the i-th dimension. The blue grid shows the ground truth function for (a) 3 states showing component 1, and (b) 10 states showing component 6. consider K = 2 components. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Figure 11: Posterior distribution grouped by month. In the main text, we claim that our approach captures regimes based on seasonality. To visualise this, We group the posterior distribution by month (fig. 11), where similar groupings arise from both models, and observe that one component is assigned to Summer months (from May to September), and the other is assigned to Winter months (from October to April). To better illustrate the seasonal dependence present in this data. We show the function responses assuming linear and non-linear (softplus networks) transitions in Figures 12a and 12b respectively. In the linear case, we observe that the function responses on the ENSO variable are invariant across regimes. However, the response on the AIR variable varies across regimes, as we observe that the slope with respect to the ENSO input is zero in Winter, and increases slightly in Summer. This visualisation is consistent with the results reported in the regime-dependent graph (fig. 7). In the non-linear case, we now observe that the responses of the ENSO variable are slightly different, but the slope differences in the responses of the AIR variable with respect to the ENSO input are harder to visualise. The noticeable difference is that the self-dependency of the AIR variable changes non-linearly across regimes, contrary to the linear case where the slope with respect to AIR input was constant. F.6. Salsa experiment Mocap sequences The data we consider for this experiment consists on 28 salsa dancing sequences from the CMU mocap data. Each trial consists of a sequence with varying length, where the observations represent 3D positions of 41 joints of both participants. Following related approaches (Dong et al., 2020), we use information of one of the participants, which should be sufficient for capturing dynamics, with a total of 41 3 observations per frame. Then, we subsample the data by a factor of 4, normalise the data, and clip each sequence to T = 200. On the Identifiability of Switching Dynamical Systems 0.5 0.0 0.5 0.8 0.6 0.4 0.2 0.5 0.0 0.5 1.0 Figure 12: Function responses of the ENSO-AIR experiment assuming (a) linear and (b) non-linear softplus networks. Each row shows the function responses for Winter and Summer respectively. Video sequences To train SDSs, we generate 64 64 video sequences of length T = 200. The dancing sequences are originally obtained from the same CMU mocap data, but they are processed into human meshes using Mahmood et al. (2019) and available on the AMASS dataset. The total processed samples from CMU salsa dancing sequences are 14. To generate videos, we subsample the sequences by a factor of 8, and augment the data by rendering human meshes with rotated perspectives and offsetting the subsampled trajectories. To do so, we adapt the available code from Mahmood et al. (2019), and generate 10080 train samples and 560 test samples. In figure 14 we show examples of reconstructed salsa videos from the test dataset using our identifiable SDS, where we observe that the method achieves high-fidelity reconstructions. 10 20 30 40 50 Steps since last input Baseline i SDS Figure 13: Forecasting averaged pixel MSE and standard deviation (vertical lines) of our i SDS using test data sequences. Additionally, in Figure 13 we provide forecasting results on the test data for 50 future frames from the last observation. As a reference, at each frame we compare the i SDS predictions with a black image (indicated as baseline). As we observe, the prediction error increases rapidly. More specifically, the averaged pixel MSE for the first 20 predicted frames is 0.0148, which is high compared to the reconstruction error (2.26 10 4). Nonetheless, this result is expected for the following reasons. First, the discrete transitions are independent of the observations, which can trigger dynamical changes that are not aligned with the groundtruth transitions. Second, the errors are accumulated over time, which combined with the previous point can rapidly cause disparities in the predictions. Finally, our formulation does not train the model based on prediction explicitly. Although the predicted frames are not aligned with the ground truth, in Figure 15 we show that our i SDS is able to produce reliable future sequences, despite some exceptions (see 6th forecast). In general, we observe that our proposed model can be used to generate reliable future dancing sequences. We note that despite the restrictions assumed to achieve identifiability guarantees (e.g. removed feedback from observations in comparison to Dong et al. (2020)), our i SDS serves as a generative model for high-dimensional sequences. F.7. Real dancing videos To further motivate the applications of our identifiable SDS in challenging realistic domains, we consider real dancing sequences from the AIST Dance DB (Tsuchida et al., 2019). As in the previous semi-synthetic video experiment, we focus On the Identifiability of Switching Dynamical Systems Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Figure 14: Reconstruction and ground truth of salsa dancing videos from the test dataset. On the Identifiability of Switching Dynamical Systems Ground truth Ground truth Ground truth Ground truth Ground truth Ground truth Ground truth Ground truth Figure 15: Forecasts and corresponding ground truth of future T = 50 frames from salsa test samples. On the Identifiability of Switching Dynamical Systems on segmenting dancing patterns from high-dimensional input. The data contains a total of 12670 sequences of varying lengths, which include 10 different dancing genres (with 1267 sequences each), different actors, and camera orientations. We focus on segmenting sequences corresponding to the Middle Hip Hop genre, where we leave 100 sequences for testing. We process each sequence as follows: (i) we subsample the video by 4, (ii) we crop each frame to center the dancer position, (iii) we resize each frame to 64 64, and (iv) we crop the length of the video to T = 200 frames. For training, we adopt the same architecture and hyper-parameters as in the salsa dancing video experiment (See Appendix F.3; except we set a batch size of 16 this time). Here we also include a pre-training stage for the encoder-decoder networks, but in this case we include all the available dancing sequences (all genres, except the test samples). In the second stage where the transitions are learned, we use only the Middle Hip Hop sequences. We note that training the i SDS in this dataset is particularly challenging, as we find that the issue of state collapsed reported by related works (Dong et al., 2020; Ansari et al., 2021) is more prominent in this scenario. To mitigate this problem, we train our model using a combination of the KL annealing schedule proposed in Dong et al. (2020) and the temperature coefficient proposed in Ansari et al. (2021), which we already included previously. We start with a KL annealing term of 104 and decay it by a factor of 0.95 every 50 iterations, and with τ = 103, where we decay it by a factor of 0.975 every 100 iterations. We run this second phase for a maximum of 1000 epochs. We show reconstruction and segmentation results in Figure 16, where we observe our i SDS learns different components from the dancing sequences. In general we observe that the sequences are segmented according to different dancing moves, except for some cases where only a prominent mode is present. We also note that different prominent modes are present depending on background information. For example, the blue mode is prominent for white background, and the teal mode is prominent for combined black and white backgrounds. Such findings indicate the possibility of having data artifacts, in which the model performs segmentation based on the background information as they could be correlated with the dancing dynamics. Furthermore, our i SDS reconstructs the video sequences with high fidelity, except for fine-grained details such as the hands. Quantitatively, our approach reconstructs the sequences with an averaged pixel MSE of 7.85 10 3. We consider this quantity is reasonable as it is an order of magnitude higher in comparison to the previous salsa videos, where the sequences were rendered on black backgrounds. On the Identifiability of Switching Dynamical Systems Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Reconstruction Ground truth Figure 16: Reconstruction, ground truth, and segmentation of dancing videos from the AIST Dance DB test set.