# causal_discovery_from_conditionally_stationary_time_series__2104bc45.pdf Causal Discovery from Conditionally Stationary Time Series Carles Balsells-Rodas 1 Xavier Sumba 1 Tanmayee Narendra 2 Ruibo Tu 3 Gabriele Schweikert 2 Hedvig Kjellstr om 3 Yingzhen Li 1 Causal discovery, i.e., inferring underlying causal relationships from observational data, is highly challenging for AI systems. In a time series modeling context, traditional causal discovery methods mainly consider constrained scenarios with fully observed variables and/or data from stationary time-series. We develop a causal discovery approach to handle a wide class of nonstationary time series that are conditionally stationary, where the nonstationary behaviour is modeled as stationarity conditioned on a set of latent state variables. Named State-Dependent Causal Inference (SDCI), our approach is able to recover the underlying causal dependencies, with provable identifiablity for the state-dependent causal structures. Empirical experiments on nonlinear particle interaction data and gene regulatory networks demonstrate SDCI s superior performance over baseline causal discovery methods. Improved results over non-causal RNNs on modeling NBA player movements demonstrate the potential of our method and motivate the use of causality-driven methods for forecasting. 1. Introduction Deep learning has achieved profound success in vision and language modelling (Brown et al., 2020; Nichol et al., 2022). Still, it remains a grand challenge for deep neural networks to perform causal discovery (Yi et al., 2020; Girdhar & Ramanan, 2020; Sauer & Geiger, 2021), which is critical for interpretability, generalization, and robustness (Lake et al., 2017; Sch olkopf et al., 2021). Theoretically, structure identifiability is key to ensure a unique correspondence between observations and the underlying causal structures (Peters 1Imperial College London 2University of Dundee 3KTH Royal Institute of Technology. Correspondence to: Carles Balsells-Rodas . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). et al., 2017). Practically, better algorithms are needed to accurately extract the causal structure from data. In time series analysis, causal discovery identifies the underlying temporal causal structure of the observed sequences. Historical approaches rely on a restrictive assumption: stationarity (Granger, 1969; Peters et al., 2017; Tank et al., 2021; L owe et al., 2022), while real-world data is often nonstationary with potential hidden confounders. To address this issue, three major strategies have been recently proposed: (1) modelling nonstationary noise (Huang et al., 2020; Gong et al., 2023); (2) introducing time-dependent effects with a fixed causal structure (Huang et al., 2015; 2019); and (3) using discrete latent variables to capture structural changes over time (Saggioro et al., 2020). Despite these advances, causal discovery in nonstationary time series under realistic assumptions remains an open challenge. This work addresses causal discovery for nonstationary time series based on a much relaxed assumption, conditionally stationary time series, where the dynamics of the observed system change depending on a set of discrete state variables. These causal structures can change not only during time, but also across samples (L owe et al., 2022). This assumption holds for many real-world scenarios, such as individuals whose different decisions depend on mood, past experiences, or interactions with others. The causal discovery task for such conditionally stationary time series poses different challenges depending on the observability and dependency of the states, which we classify into 3 scenarios: States observed/independent: The states are observed and/or independent on observations (Fig. 1a). Structure identifiability can be established for both cases by Peters et al. (2013), and Balsells-Rodas et al. (2024), respectively. States determined: The states are hidden but depend on observations directly. In Fig. 1b, the states are determined by the balls positions (pink vs purple regions). States recurrent: The states depend on historical events. E.g., in Fig. 1c, particles have state changes upon collision. Also in a football game a player acts differently based on earlier actions of the others. We propose a novel framework to tackle both the theoretical and algorithmic challenges for causal discovery from condi- Causal Discovery from Conditionally Stationary Time Series (a) States observed. (b) States determined. (c) States recurrent. Figure 1: Graphical representations of the data generation processes. xt represents the observations of a time series, and st denotes the state variables. The states affect observations by changing the causal structure ft for different state values. tionally stationary time series in states determined and recurrent cases. Our contributions are summarised as follows, providing advances in both identifiability and estimation: We introduce conditional summary graph as a compact representation of the underlying causal graph structure for conditionally stationary time series. This efficiently addresses the exponential complexity of full-time graph in summarising the causal structure of time series data with nonstationary interactions between variables. We establish identifiability for the conditional summary graph and related structural properties for conditionally stationary time series satisfying state determined (Fig. 1b) or state recurrent (Fig. 1c) assumptions. We propose State-Dependent Causal Inference (SDCI) as a practical algorithm to extract the conditional summary graphs and model state dependencies, based on discrete latent variable models and graph neural networks. We validate SDCI on semi-synthetic data based on physical and biological systems, and real-world datasets. Compared to baselines including GNN and RNN-based approaches, SDCI achieves superior performance in identifying the underlying nonstationary structures in gene regulatory networks; and forecasting future trajectories from observations on NBA player movements. 2. Background 2.1. Causal Discovery in Stationary Time Series Causally-driven time series are often modelled using structural causal models (SCM; (Pearl, 2009; Peters et al., 2017)) for describing data generation processes. Consider N sequences of length T, denoted by x1:T , where x(i) t Rd denotes the features of the i-th sequence at time t; which can incorporate high-order moments, e.g velocity, acceleration. The data dependencies in a temporal SCM are structured via a causal graph, known as the full time graph, GF T 1:T . For simplicity, we assume no hidden confounders, no in- stantaneous effects, and a first-order Markov property. In stationary time series, the full time graph is static across time steps. This allows us to define the summary graph GS = {V, ES}, where V = {1, . . . , N} and an edge i j exists in ES if there is an edge from x(i) t to x(j) t in GF T 1:T for some t < t . The identifiability of both GF T 1:T and GS is guaranteed under Time Series Models with Independent Noise (Ti MINo; (Peters et al., 2013)). By further assuming an additive noise model (ANM), the SCM becomes: x(j) t = f (j) x(i) t 1|x(i) t 1 PA(j)(t 1) + ε(j) t , (1) where PA(j)(t 1) denotes the parents1 of x(j) t and ϵ(j) t pε represents independent noise. In this setting, PA(j)(t 1) is constant in time and aligns with the summary graph GS. 2.2. Markov Switching Models Markov Switching Models (MSMs; (Hamilton, 1989)) extend time series modeling by introducing discrete latent variables ut {1, . . . , U} that condition the autoregressive process at each time step t. For regime-dependent time series (Saggioro et al., 2020), the full-time graph GF T 1:T is time-dependent, but causally stationary (Assaad et al., 2022) given the discrete latent variables ut, t {1, . . . , T}. This leads to defining the regime-dependent graph, which is a set of graphs GRD 1:U := {GRD u |1 u U} where GRD u encodes the causal effects of xt for ut = u. Our work utilises MSMs under the following key assumptions (m1-m3): (m1) Conditional first-order Markov transitions, controlled by ut 1 only: for any t {2, ..., T}, p(xt|x1:t 1, u1:t 1) = p(xt|xt 1, ut 1). (2) (m2) Conditional stationarity: the transition distributions do not change during time: for any u {1, ..., U} we have 1The notation (t 1) indicates that the parents of variable j are considered at the previous time step. Causal Discovery from Conditionally Stationary Time Series (a) Full time graph (b) Regime-dependent graph (c) Conditional summary graph Figure 2: (a) Full time graph GF T 1:T of a sample considering our problem setting, (b) regime-dependent graph, and (c) conditional summary graph GCS 1:K of the corresponding sample for K = 2. We denote states using numbers {1, 2} inside each element. Different colors (red and blue) denote effects caused by different states. for any t = t , any β, γ RNd such that p(xt = β|xt 1 = γ, ut 1 = u) = p(xt = β|xt 1 = γ, ut 1 = u). (3) (m3) Factorisation based on a Gaussian ANM: p(xt|xt 1, ut 1) = j=1 N x(j) t ; f (j) ut 1 x(i) t 1|x(i) t 1 PA(j) ut 1(t 1) , σ2I , (4) for any t {2, ..., T}, where PA(j) ut 1(t 1) denotes the parents of variable j at time t 1, as described by GRD ut 1. 3. State-Dependent Causal Inference (SDCI) 3.1. Conditionally Stationary Time Series We focus on nonstationary time series where each time step t is associated with N latent variables st = {s(1) t , . . . , s(N) t }. Each s(i) t {1, ..., K} controls the causal effects of x(i) t to future observations xt+1. In other words, the causal influences of x(i) t are dynamically modified by the state s(i) t . This can be viewed as a general MSM under assumptions (m1-m3) by setting ut = φ(st) with global states U = KN, for some injective φ : {1, . . . , K}N {1, . . . , U}. Fig. 2a exemplifies the full time graph illustrating these assumptions. Here, the causal effects vary across time as s1 = s2 = s3, leading to nonstationarity. From a MSM view, we can extract a regime-dependent graph with KN regimes (Fig 2b). Although the regime-dependent graph captures the general structure, it becomes inefficient as it scales exponentially with N. Furthermore, a summary graph (aggregating all the regimes) can be non-informative, due to inability to distinguish state-dependent effects. Instead, we define the conditional summary graph. Definition 3.1 (Conditional summary graph, first-order Markov setting). The conditional summary graph is a set of K graphs, GCS 1:K = {GCS k |1 k K}, where K is the number of possible state values. Each summary graph GCS k := {V, ECS k } has the same vertices V = {1, . . . , N}. An edge i j exists in ECS k if there exists t such that s(i) t = k and x(i) t connects to x(j) t+1 in GF T 1:T . For simplicity, we omit self-connections in Fig. 2c (black edges in Fig. 2a), where we show the conditional summary graph extracted from the full time graph. For k = 1 we observe s(3) 1 = 1, and a red edge connects x(3) 1 and x(2) 2 , placing a red edge in ECS 1 . Similar reasoning applies for GCS 2 . Compared to summary or regime-dependent graphs (Fig. 2b), the conditional summary graph offers a compact and informative representation of state-dependent causal structures. In summary, the SEM for conditionally stationary time series can be formalised as follows: x(j) t = f (j) st 1 x(i) t 1|x(i) t 1 PA(j) st 1(t 1) +ε(j) t . (5) This introduces the following assumption: (m4) Multi-state dependence: The parents PA(j) st 1(t 1) of variable j at time t are defined by GCS 1:K as: PA(j) st 1(t 1) := n x(i) t 1| j C(i) k , k = s(i) t 1, 1 i N o . (6) C(i) k V denotes the children of variable i in state k, specified by GCS k . To illustrate, in Fig. 2a the parents of x(2) 2 are {x(1) 1 , x(2) 1 , x(3) 1 } as 2 C(1) s(1) 1 and 2 C(3) 3.2. Amortisation of Conditional Effects. The conditional summary graph introduces linear scaling with K. However, indexing the SEM in Eq. (5) by st 1 requires KN functions. Taking inspiration from interactive systems (Kipf et al., 2018; L owe et al., 2022), we propose a two-stage interaction and aggregation framework, formalised as the following assumption: Causal Discovery from Conditionally Stationary Time Series Figure 3: SDCI extracts the labels of conditional effects that describe state-dependent interactions in a sample. (m5) Global function-type dependence: The effect between any pair of variables is determined by nϵ functional effects F := {f0, . . . , fnϵ 1}, where f0( ) = 0 represents the absence of an edge. Each functional effect is statedependent, i.e., the effects from variable i at time t are determined by its state s(i) t . These effects are collected in the labels of conditional effects: W := n wi,j,k {0, . . . , nϵ 1}|1 i, j N, 1 k K, wi,j,k = 0 j / C(i) k o , (7) where wi,j,k specifies the edge-type i j at state k, associated to variable i. In the aggregation stage, the interactions are combined using a permutation invariant function g: f (j) st 1(xt 1) := g x(j) t 1, n fei x(i) t 1, x(j) t 1 |i = j o , (8) where ei = wi,j,s(i) t 1. The function g aggregates interactions to variable j, capturing complex state-dependent dynamics. This formulation effectively reduces the exponential complexity of conditionally stationary time series, and can be efficiently implemented with graph neural networks (GNNs) (Kipf et al., 2018). Furthermore, it aligns with well-established mechanisms in physical systems, such as aggregating forces from pair-wise interactions. 3.3. Implementation We propose State-Dependent Causal Inference (SDCI), a probabilistic approach to infer the underlying causal structure from time series. Given a dataset D, each sample x1:T D is driven by W, which may differ across samples. SDCI models the joint dynamics of states, edge-types, and observations (Fig. 3), and builds on graph neural networks (GNNs) and interactive systems (Kipf et al., 2018). Generative model. The joint distribution for conditionally stationary time series is dependent on s1:T , W, and ψ := {ψw, ψx, ψs}. For observed states, we define: pψ(x1:T , W|s1:T ) = pψx(x1:T |s1:T , W)pψw(W), (9) with a factorised prior on the edge labels pψw(W) = QK k=1 Q ij pψw(wijk). We can guide ψw through domain knowledge (e.g. sparsity). Given W pψw(W) and s1:T , a sequence x1:T is generated as pψx(x1:T |s1:T , W) = j=1 pψx(x(j) t+1|xt, st, W). To compute pψx(x(j) t+1|xt, st, W), the model queries edgetypes e(ij) t = wijk for s(i) t = k , retrieves pairwise interactions fe(x(i) t , x(j) t ), and aggregates them: h(ij) t = X e>0 1 e(ij) t =e fe(x(i) t , x(j) t ), x(j) t+1 = x(j) t + g X i =j h(ij) t , x(j) t , (10) where F := {fe}nϵ 1 e=1 and g are parametrisable functions, similar to Eq. (8). x(j) t+1 denotes the mean of x(j) t+1, which is Gaussian distributed with covariance σ2I. For latent states with influence from x1:T (determined and recurrent cases; Figs. 1b, 1c), the joint distribution extends to: pψ(x1:T , s1:T , W) = pψw(W) t=1 pψx(xt|xt 1, st 1, W)pψs(st|xt:t Lx, st 1:t Ls). where Lx and Ls are the maximum lags. The determined case fixes Lx = 0 and Ls = 0; and we assume the states are autonomous with shared parameters ψs: pψs(st|xt:t Lx, st 1:t Ls) = i=1 pψs(s(i) t |x(i) t:t Lx, s(i) t 1:t Ls). (11) Inference. Building on VAE-based approaches (Kingma & Welling, 2014; L owe et al., 2022), we introduce a variational posterior parametrised by ϕ := {ϕw, ϕs}; which approximates the posterior over W, and s1:T as follows: qϕ(W, s1:T |x1:T ) = qϕw(W|x1:T )qϕs(s1:T |x1:T ). (12) where qϕw is factorised across edges qϕw(wijk|x1:T ) = softmax(ϕijk/τ), ϕij = fϕw(x1:T )ij RK nϵ, (13) The function ϕij extracts embeddings for any pair i j that represents the state-dependent causal interaction, and the architecture of fϕw(x1:T ) is based on (Chen et al., 2021). See Appendix B.3 for details. In the determined case, exact Causal Discovery from Conditionally Stationary Time Series inference on the states is tractable (see Appendix B.2), and we set qϕs(s1:T |x1:T ) = pψs(s1:T |x1:T ). qϕs(s(i) t |x(i) t ) = softmax(ˆs(i) t /γ), ˆs(i) t = fϕs(x(i) t ), (14) In the recurrent case, qϕs(s1:T |x1:T ) = QT t=1 qϕs(st|x1:T ) is implemented via a bidirectional RNN-GNN that approximates smoothing (details in Appendix B.3). Objective. SDCI optimises the ELBO (Kingma & Welling, 2014) (see Appendix B.1 for derivation): log pψ(x1:T ) KL qϕw(W|x1:T ) pψw(W) t=1 KL qϕs(st|x1:T ) pψs(st|xt:t Lx, st 1:t Ls) + Eqϕ(W,s1:T |x1:T ) h log pψx(x1:T |s1:T , W) i , (15) where reparametrisation tricks (Jang et al., 2017) ensure backpropagation for discrete samples W, s1:T qϕ(W, s1:T |x1:T ). To reduce variance, we use straightthrough Gumbel-softmax and fix gradients to pass through unperturbed marginals (Ahmed et al., 2023). During training, Eq. (10) becomes: k=1 1(s(i) t =k) X e>0 1(wijk=e)fe(x(i) t , x(j) t ), (16) and we optimise ϕ and ψ using mini-batch gradient ascent. 4. Theoretical Considerations We provide identifiability guarantees for the labels of conditional effects W, driving conditionally stationary time series under assumptions (m1-m5). A detailed analysis of identifiability for regime-dependent graphs under (m1-m3) and conditional summary graphs under (m1-m4) is provided in Appendix A. We first define partial identifiability for W. Definition 4.1 (Partial identifiability of conditional effects). Given observations x1:T and a family of models M satisfying (m1-m5), we say M is partially identifiable with respect to the labels of conditional effects if for any p, ˆp M, with corresponding W and c W such that p(x1:T ) = ˆp(x1:T ); K = ˆK and there exist permutations π Snϵ and σ(i) SK such that for any i, j {1, . . . , N} and k {1, . . . , K}: wi,j,k = ˆwi,j,σ(i)(k) wi,j,k = 0, (17) wi,j,k = π ˆwi,j,σ(i)(k) wi,j,k = 0. (18) Remark 4.2. Mixture models can only be identified up to permutations (Yakowitz & Spragins, 1968). Contrary to previous work (Gassiat et al., 2016; H alv a & Hyvarinen, 2020; Song et al., 2024), our results do not require knowledge of the number of components K. Eq. (17) establishes equivalences of GSG 1:K up to element-wise permutations σ(i) for outgoing edges {C(i) 1 , . . . , C(i) K }. Eq. (18) ensures permutation equivalence in edge labels 1, . . . , nϵ. This partial identifiability definition excludes the state distribution, as no restrictions on the state dynamics are imposed. It cannot generally be achieved under (m1-m5) without further assumptions. However, our main focus in this work is to recover state-dependent structures, and we leave state distribution identifiability for future work. Below, we list sufficient conditions for identifiability. (a1) Unique indexing of outgoing structure. For each state s(i) t 1 {1, ..., K}N, the graph representing the direct causes of i is unique. That is, for any i {1, . . . , N}: k, k {1, ..., K}, k = k C(i) k = C(i) k . (19) This is stronger than the typical unique indexing assumption for mixture models (Balsells-Rodas et al., 2024), where: p(xt|xt 1, st 1 = (k1, . . . , k N)) = p(xt|xt 1, st 1 = (k 1, . . . , k N)), for (k1, . . . , k N) = (k 1, . . . , k N). (a2) Unique indexing of function types. Given (m1-m5): (a2.1): For any j {1, . . . , N}, and given g(x(j), {h(i,j)|i = j}), where h(i,j) = fe(x(i), x(j)) for some e {0, . . . , nϵ 1}, the partial derivative g h(i,j) is non-zero almost everywhere for any i {1, . . . , N}. (a2.2): Edge-types differ almost everywhere: f0 := 0, and for e = e , {0, nϵ 1}, the following set has zero measure: Xe,e := x1 Rd : x2 Rd, x1 = f e(x1, x2) These assumptions ensure pairwise interactions are not cancelled during aggregation, and the edge-types remain sufficiently distinct. For example, SDCI, naturally satisfies (a2.1) through GNN message passing. Additionally, implementing edge-type functions as analytic functions guarantees (a2.2). We now state the following identifiability result: Theorem 4.3. Conditionally stationary time series satisfying (m1-m5) are partially identifiable with respect to conditional effects (Def. 4.1), if they meet assumptions of unique indexing of (a1) outgoing structure and (a2) function types. Proof sketch. The full proof is depicted in Appendix A.5, and it can be divided into 3 major steps. (i) Write p(xt|x1:t 1) as a mixture model. Under (m1m3) and (a1-a2), the regime-dependent graph is identifiable up to a permutation τ SKN (Theorem A.6). Under state dependencies on x1:t 1, the key is to show that the permutation equivalence of the transition distribution is preserved almost everywhere due to (a1-a2), no matter the overlap on xt 1. Causal Discovery from Conditionally Stationary Time Series (ii) Using (m4) and (a1), regime-dependent graph identifiability transfers to conditional summary graph identifiability (Theorem A.7). (iii) By (m5) and (a1-a2), partial derivatives in the aggregation step reveal permutation equivalence on the edge-types (Corollary A.8). While several prior works (H alv a & Hyvarinen, 2020; Song et al., 2024; Balsells-Rodas et al., 2024) introduce nonstationarity via regime switching, our approach offers three key advantages: Unknown number of state values K: Most prior work, including HMM-ICA (H alv a & Hyvarinen, 2020) and Ctrl NS (Song et al., 2024), assumes that the number of latent regimes is known. Our approach leverages Yakowitz & Spragins (1968), which enables identifiability without knowing the true number of regimes. This allows identification of K by model selection (Mc Lachlan & Peel, 2000) (assuming convergence to the MLE), which is not theoretically supported when K requires to be known. No assumption on state dependency: Prior works typically assume either the states independent case (H alv a & Hyvarinen, 2020; Balsells-Rodas et al., 2024; Rahmani & Frossard, 2025); or the states determined case under strong structural constraints (Song et al., 2024). Our identifiability results make no assumptions about state dependencies, allowing feedback from observations. Regime-dependent identifiability: Our theoretical results (Thms. A.6-A.7; Cor. A.8) build from general Markov Switching Models to our specific implementation, SDCI. Theorem A.6 introduces a novel proof strategy that extends from Yakowitz & Spragins (1968), providing a general theoretical foundation for regimedependent causal discovery (Saggioro et al., 2020). As mentioned, SDCI considers samples x1:T D, each generated under different structures. The presented results remain valid, as identifiability holds even when considering the distribution of a single sample. However, the consistency of SDCI, along with other approaches within the same family, such as ACD (L owe et al., 2022), remains an open challenge. Recent works (Gong et al., 2023; Geffner et al., 2024) have shown that the validity of variational objectives for causal discovery relies on strong assumptions; including universal approximation properties of q, absence of model misspecification, and the infinite data limit. These works assume fixed structures across samples, where SDCI can be adapted by setting qϕw(W|x1:T ) = qϕw(W). 5. Related Work Causal discovery for time series traditionally extends from non-temporal data (Assaad et al., 2022). These approaches include constraint-based methods (Entner & Hoyer, 2010; Runge, 2018), score-based methods such as DYNOTEARS (Pamfil et al., 2020) based on Dynamic Bayesian Networks (Murphy et al., 2002), and functional causal model-based methods (Shimizu et al., 2006; Zhang & Hyv arinen, 2009; Peters et al., 2014) with constrained dynamics assumptions (Peters et al., 2013). The functional causal model-based approaches also motivate the use of deep generative models for causal discovery in high-dimensions (Yu et al., 2019). Our work focuses on modelling nonstationary time series with discrete latent variables that condition the underlying structure. It is related to Amortized Causal Discovery (ACD; L owe et al. (2022)), which assumes stationarity and amortizes summary graph discovery across samples with different graphs but shared dynamics. We extend ACD by allowing the causal structure to vary according to latent state variables, thus enabling conditional stationarity. When observed auxiliary variables are present, Monti et al. (2020) uses time contrastive learning (Hyvarinen & Morioka, 2016) and nonlinear ICA for causal discovery; while Yao et al. (2022) establishes identifiability conditions in the presence of piecewise stationary disturbances. Regime-dependent dynamics assume no access to states Saggioro et al. (2020), where Balsells-Rodas et al. (2024) proves identifiability through Markov Switching Models; and Rahmani & Frossard (2025) via a score-based approach. Varambally et al. (2024) also proves identifiability when the discrete latents represent global states. Our states determined model is similar to Song et al. (2024), where identifiability is established by assuming mechanism sparsity and knowledge of the number of regimes. Other works focus on fixed causal structures, modelling nonstationarity through noise distributions (Huang et al., 2020; Gong et al., 2023) or time-varying effects (Huang et al., 2015; Zhang et al., 2017; Ghassami et al., 2018; Huang et al., 2019); such as Huang et al. (2020) with distribution shifts; Gong et al. (2023) with historydependent noise; or Huang et al. (2015) with time-varying functional causal models. 6. Experiments 6.1. Springs and Magnets We evaluate SCDI on spring data adapted from Kipf et al. (2018); L owe et al. (2022), consisting of particles in a box, connected by springs with directed impact meaning that e.g. particle i could affect particle j with a force through a connecting spring, but leaving particle i unaffected by this spring force. We further modify the dataset to introduce repulsive magnetic interactions, resulting in 3 edge-types: none, spring, and magnetic. We assess SDCI across determined and recurrent states. In the determined case, the state is modeled as a function of particles positions, while the recurrent case involves state transitions when particles Causal Discovery from Conditionally Stationary Time Series (a) States determined. (b) States recurrent. Figure 4: F1 Score of the edge labels W and states for samples with a shared underlying structure with K = 2. (a) States determined. (b) States recurrent. Figure 5: F1 Score of the edge labels W and states with structures varying across samples with K = 2. collide with walls. Details on data generation, model hyperparameters, and additional results (including observed case) are provided in Appendices C.1, B.4, and E.1 respectively. To empirically verify the consistency of SDCI in estimating state-dependent structures, we begin with a fixed-graph setting, where the graph remains constant across samples, and use an amortised posterior qϕw(W|x1:T ) = qϕw(W) for inference. Fig. 4 demonstrates that SDCI achieves perfect structure estimation on both state settings. However, accurately estimating the states in the recurrent case is more challenging, likely due to the exponential computational cost of exact inference. Our formulation considers graphs that vary across samples, and we present results again in both state cases in Fig. 5. In this case, recovering state-dependent structures is significantly more challenging. While SDCI successfully captures the state dynamics in the determined case, the increased complexity from variable graph structures makes recovering state-dependent structures in the recurrent case notably difficult, even with only 3 variables. We compare SDCI with causal discovery baselines in extracting graphs from conditionally stationary time series. Specifically, we measure the F1 score for summary graph identification in determined states data with fixed and variable graph structures, benchmarking against ACD (L owe et al., 2022), Rhino (Gong et al., 2023), and Neural Granger Causality (N-GC; (Tank et al., 2021)). For variable graph datasets, Rhino and N-GC require re-training for each sample, adding computational overhead. Results in Fig. 6 show a clear advantage of SDCI in terms of summary graph estimation. We observe that approaches assuming causal stationarity struggle to aggregate the full structure into a coherent summary graph, limiting their performance. How- 5 10 20 5 10 20 Fixed Graph Variable Graph Figure 6: Summary graph (GS) F1 score on determined states data, with fixed and variable structures across samples. 1 2 3 4 5 6 7 8 Number of states K Reconstruction MSE Ground truth K Ground truth K Figure 7: Test reconstruction MSE vs number of states K for data with ground truth K = 2 (red) and K = 4 (blue). ever, N-GC particularly achieves strong results for variable graphs, surpassing ACD despite its simplicity. Overall, SDCI decomposes the nonstationary dynamics into conditional stationary components while accurately capturing state transition dynamics. Selecting the number of states. Determining the number of states K, or the number of regimes in regime-switching dynamics, is a fundamental challenge in real-world data. As discussed in Section 4, SDCI is also identifiable in terms of K. Assuming the consistency of SDCI, K can be theoretically recovered by maximising the data log-likelihood, which we approach by model selection (Mc Lachlan & Peel, 2000). However, in practice, overestimating K may lead to overfitting or redundant regimes. To overcome this issue, standard selection heuristics such as the elbow method (Thorndike, 1953) can be applied. Figure 7 shows the reconstruction error for two settings: states recurrent with N = 5 and K = 2 (in red); and states determined with N = 10 and K = 4 (in blue). In both cases, the MSE plateaus when K reaches its ground truth value, thus providing a principled heuristic to identify the true number of states. 6.2. Gene Regulatory Networks We extend our evaluation to the inference of Gene Regulatory Networks (GRNs) during cellular differentiation, reprogramming and development, which are defined by dynamically changing gene interactions with nonstationary, Causal Discovery from Conditionally Stationary Time Series State C State D (b) Figure 8: Example of a bifurcating network (b) from Dyn Gen, where in (a) gray nodes denote target genes, and other colors indicate genes activated by their corresponding states. Figure 9: Conditional summary graphs extracted by SDCI. Genes are grouped by states, and gray nodes denote targets. and nonlinear dynamics (Schiebinger et al., 2019; Yeo et al., 2021; Kamimoto et al., 2023; Bhaskar et al., 2024). We simulate gene expression data using Dyn Gen (Cannoodt et al., 2021), a semi-synthetic simulation engine that models dynamic regulatory interactions.2 In Dyn Gen, an underlying state network introduces nonstationarity by activating or deactivating groups of genes, leading to time-varying structures. Fig. 8 illustrates an example of a bifurcating network, where the system transitions through different states: activating genes in A, then transitioning to B, and ending by activating genes from C or D. See App. C.2 for details. We simulate 1000 cells with N = 49 genes for T = 50 time steps; under a bifurcating state network with 4 regimes. We find SDCI is tailored to GRN setups, since the state variables (K = 2) naturally activate or deactivate gene expressions. We evaluate SDCI in the 3 scenario classes, and we leverage ground-truth information to manually activate or deactivate genes in the observed case. In Table 1 we report results in comparison to causal discovery baselines in terms of estimating the underlying GRN structure, where we observe SDCI shows superior performance in terms of F1 score. In the observed case, we find that state auxiliary information does not provide a significant advantage, as 2We do not use the widely-used DREAM3 benchmark (Marbach et al., 2010) since it is limited by only simulating steady-state and individual gene perturbation conditions. Table 1: AUROC and F1 score of the GRN structure. METHOD AUROC F1 SCORE R-PCMCI (Saggioro et al., 2020) 0.887 0.062 N-GC (Tank et al., 2021) 0.556 0.093 ACD (L owe et al., 2022) 0.714 0.187 Rhino (Gong et al., 2023) 0.933 0.081 SDCI Observed 0.805 0.182 SDCI Determined 0.936 0.347 SDCI Recurrent 0.855 0.276 0 20 40 60 80 100 Time SDCI Obs (K=4) ACD SDCI Unobs (K=4) SDCI Unobs (K=2) VRNN NRI-NPM Figure 10: Test error on forecasts. The dashed line represents the mean value. gene deactivation is typically gradual rather than abrupt. The recurrent case suffers from the underlying exponential cost of exact inference (approximated by qϕ), which is critical in high variable scenarios. In contrast, the determined case proves to be a simple yet effective approach that adapts to the time-varying dynamics in GRNs. We visualise the estimated strucutre in Fig. 9, where genes are grouped by their associated states. SDCI successfully detects deactivated gene interactions, except for gene groups B and C; and thus provides a robust framework for realistic GRNs with direct applications to tasks such as cell type specific regulatory network inference. 6.3. NBA Player Trajectories We consider NBA player movements (Linou, 2016), a realworld multi-agent trajectory dataset with highly nonlinear & nonstationary dynamics. See Appendix C.3 for details of the dataset, including our design for auxiliary states (as groundtruth is unavailable). We evaluate SDCI under observed and determined states. To simplify the task, we model the trajectories of the players (position and velocity) conditioned on the ball s position and velocity. This is achieved by modifying Eq. (10) to include the ball features in the message passing aggregation of the decoder. In addition to ACD, we compare against non-causal baselines VRNN (Chung et al., 2015) and NRI-NPM (Chen et al., 2021) (Appendix B.4). All the models are trained on sequences of length T = 100. Fig. 10 shows the average forecast error (MSE) of the player Causal Discovery from Conditionally Stationary Time Series (c) Figure 11: Illustration of learned structures on SDCI (K = 4) given a sample (a), where (b), (c) correspond to GCS 2 and GCS 4 respectively. (a) K = 2 state values. (b) K = 4 state values. Figure 12: Learned regimes from SDCI on the NBA dataset. The colour maps refer to the state posterior distribution qϕ(s(i) t = k|x(i) t ), k = {1, . . . , K}. positions on a held-out dataset. Overall GNN-based methods outperform VRNN. Moreover, ACD and SDCI achieve better long-term forecasts, despite using first-order Markov transitions. This is likely due to the no-edge interaction for reducing error accumulation through time. The hidden state setting allows SDCI to have a slight advantage over ACD thanks to adapting player interactions through time. Fig. 11 visualizes the extracted conditional summary graphs from SDCI for a sample trajectory (additional visualizations in Appendix E.2). We observe that SDCI extracts interpretable structures relevant to basketball plays: Players 2 and 5 receive the majority of interactions (Fig. 11b) as Player 2 passes the ball to Player 5. Also interactions from the read team (defense) to the blue team (offense) are more prominent. Notably, SDCI adapts to nonstationary dynamics by controlling interaction sparsity and switching regimes throughout sequences. While our identifiability theory does not require a predetermined number of states values K, in practice accurate estimation of K remains a challenge, where different K may return different results. For instance, Fig. 12 shows the state posterior as a function of player positions on the court. For K = 2, states change near the mid-court line, reflecting common basketball play strategies where players transition between defense and offense . The K = 4 case returns a more fine-grained result by further partitioning the court, with boundaries positioned near the three-point line. Notably these insights arise despite model simplifications, where the states are made dependent solely on positions, excluding velocities. This highlights how SDCI can adapt to diverse modelling choices while extracting interpretable patterns in complex, nonstationary systems. 7. Conclusions We have developed SDCI for causal discovery in conditionally stationary time series. The key innovation is the new concept of conditional summary graphs that are identifiable under a broad class of underlying state dependencies Evaluations on semi-synthetic data show SDCI s superior performance in extracting the underlying causal graph and its potential for biology applications. Importantly, the improvement of SDCI over VRNN and NRI on modeling NBA player movements demonstrate the promise of causalitydriven methods for forecasting and data interpretability. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Ahmed, K., Zeng, Z., Niepert, M., and den Broeck, G. V. SIMPLE: A gradient estimator for k-subset sampling. In The Eleventh International Conference on Learning Representations, 2023. Ansari, A. F., Benidis, K., Kurle, R., Turkmen, A. C., Soh, H., Smola, A. J., Wang, B., and Januschowski, T. Deep explicit duration switching models for time series. Advances in Neural Information Processing Systems, 34: 29949 29961, 2021. Assaad, C. K., Devijver, E., and Gaussier, E. Survey and evaluation of causal discovery methods for time series. Journal of Artificial Intelligence Research, 73:767 819, 2022. Balsells-Rodas, C., Wang, Y., and Li, Y. On the identifiability of switching dynamical systems. In Forty-first International Conference on Machine Learning, 2024. Bhaskar, D., Magruder, D. S., Morales, M., De Brouwer, E., Venkat, A., Wenkel, F., Wolf, G., and Krishnaswamy, S. Inferring dynamic regulatory interaction graphs from time series data with perturbations. In Learning on Graphs Conference, pp. 22 1. PMLR, 2024. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020. Causal Discovery from Conditionally Stationary Time Series Cannoodt, R., Saelens, W., Deconinck, L., and Saeys, Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nature Communications, 12(1):3942, 2021. Chen, S., Wang, J., and Li, G. Neural relational inference with efficient message passing mechanisms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7055 7063, 2021. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. Advances in neural information processing systems, 28, 2015. Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In Bengio, Y. and Le Cun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. Entner, D. and Hoyer, P. O. On causal discovery from time series data using fci. Probabilistic graphical models, pp. 121 128, 2010. Gassiat, E., Cleynen, A., and Robin, S. Inference in finite state space non parametric hidden markov models and applications. Statistics and Computing, 26(1):61 71, 2016. Geffner, T., Antoran, J., Foster, A., Gong, W., Ma, C., Kiciman, E., Sharma, A., Lamb, A., Kukla, M., Pawlowski, N., Hilmkil, A., Jennings, J., Scetbon, M., Allamanis, M., and Zhang, C. Deep end-to-end causal inference. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. Ghassami, A., Kiyavash, N., Huang, B., and Zhang, K. Multi-domain causal structure learning in linear systems. Advances in neural information processing systems, 31, 2018. Girdhar, R. and Ramanan, D. CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning. In ICLR, 2020. Gong, W., Jennings, J., Zhang, C., and Pawlowski, N. Rhino: Deep causal temporal relationship learning with historydependent noise. In The Eleventh International Conference on Learning Representations, 2023. Granger, C. W. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, pp. 424 438, 1969. H alv a, H. and Hyvarinen, A. Hidden markov nonlinear ica: Unsupervised learning from nonstationary time series. In Conference on Uncertainty in Artificial Intelligence, pp. 939 948. PMLR, 2020. Hamilton, J. D. A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica: Journal of the econometric society, pp. 357 384, 1989. Huang, B., Zhang, K., and Sch olkopf, B. Identification of time-dependent causal model: A gaussian process treatment. In Twenty-Fourth international joint conference on artificial intelligence, 2015. Huang, B., Zhang, K., Gong, M., and Glymour, C. Causal discovery and forecasting in nonstationary environments with state-space models. In International Conference on Machine Learning, pp. 2901 2910. PMLR, 2019. Huang, B., Zhang, K., Zhang, J., Ramsey, J. D., Sanchez Romero, R., Glymour, C., and Sch olkopf, B. Causal discovery from heterogeneous/nonstationary data. J. Mach. Learn. Res., 21(89):1 53, 2020. Hyvarinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in neural information processing systems, 29, 2016. Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017. Kamimoto, K., Stringa, B., Hoffmann, C. M., Jindal, K., Solnica-Krezel, L., and Morris, S. A. Dissecting cell identity via network inference and in silico gene perturbation. Nature, 614(7949):742 751, 2023. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. In International Conference on Machine Learning, pp. 2688 2697. PMLR, 2018. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017. Causal Discovery from Conditionally Stationary Time Series Linou, K. NBA player movements. https://github. com/linouk23/NBA-Player-Movements, 2016. Last accessed: 2022-08-06. L owe, S., Madras, D., Zemel, R., and Welling, M. Amortized causal discovery: Learning to infer causal graphs from time-series data. In Conference on Causal Learning and Reasoning, pp. 509 525. PMLR, 2022. Marbach, D., Prill, R. J., Schaffter, T., Mattiussi, C., Floreano, D., and Stolovitzky, G. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the national academy of sciences, 107(14): 6286 6291, 2010. Mc Lachlan, G. J. and Peel, D. Finite mixture models. John Wiley & Sons, 2000. Monti, R. P., Zhang, K., and Hyv arinen, A. Causal discovery with general non-linear relationships using non-linear ica. In Uncertainty in artificial intelligence, pp. 186 195. PMLR, 2020. Murphy, K. P. et al. Dynamic bayesian networks. Probabilistic Graphical Models, M. Jordan, 7:431, 2002. Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., and Chen, M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 16784 16804. PMLR, 17 23 Jul 2022. Pamfil, R., Sriwattanaworachai, N., Desai, S., Pilgerstorfer, P., Georgatzis, K., Beaumont, P., and Aragam, B. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pp. 1595 1605. PMLR, 2020. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. Pearl, J. Causality. Cambridge university press, 2009. Peters, J., Janzing, D., and Sch olkopf, B. Causal inference on time series using restricted structural equation models. Advances in Neural Information Processing Systems, 26, 2013. Peters, J., Mooij, J. M., Janzing, D., and Sch olkopf, B. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15(58):2009 2053, 2014. Peters, J., Janzing, D., and Sch olkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. Rahmani, A. and Frossard, P. Causal temporal regime structure learning. In The 28th International Conference on Artificial Intelligence and Statistics, 2025. Runge, J. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7):075310, 2018. Saggioro, E., de Wiljes, J., Kretschmer, M., and Runge, J. Reconstructing regime-dependent causal relationships from observational time series. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(11):113115, 2020. Sauer, A. and Geiger, A. Counterfactual generative networks. In International Conference on Learning Representations, 2021. Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Solomon, A., Gould, J., Liu, S., Lin, S., Berube, P., et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019. Sch olkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Toward causal representation learning. Proceedings of the IEEE, 109(5): 612 634, 2021. Shimizu, S., Hoyer, P. O., Hyv arinen, A., Kerminen, A., and Jordan, M. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006. Song, X., Li, Z., Chen, G., Zheng, Y., Fan, Y., Dong, X., and Zhang, K. Causal temporal representation learning with nonstationary sparse transition. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Tank, A., Covert, I., Foti, N., Shojaie, A., and Fox, E. B. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 1, 2021. doi: 10.1109/TPAMI.2021.3065601. Thorndike, R. L. Who belongs in the family? Psychometrika, 18(4):267 276, 1953. Causal Discovery from Conditionally Stationary Time Series Varambally, S., Ma, Y., and Yu, R. Discovering mixtures of structural causal models from time series data. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 49171 49202. PMLR, 21 27 Jul 2024. Yakowitz, S. J. and Spragins, J. D. On the identifiability of finite mixtures. The Annals of Mathematical Statistics, 39(1):209 214, 1968. Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022. Yeo, G. H. T., Saksena, S. D., and Gifford, D. K. Generative modeling of single-cell time series with prescient enables prediction of cell trajectories with interventions. Nature communications, 12(1):3222, 2021. Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., and Tenenbaum, J. B. CLEVRER: collision events for video representation and reasoning. In ICLR, 2020. Yu, Y., Chen, J., Gao, T., and Yu, M. Dag-gnn: Dag structure learning with graph neural networks. In International conference on machine learning, pp. 7154 7163. PMLR, 2019. Zhang, K. and Hyv arinen, A. On the identifiability of the post-nonlinear causal model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 09, pp. 647 655, Arlington, Virginia, USA, 2009. AUAI Press. ISBN 9780974903958. Zhang, K., Huang, B., Zhang, J., Glymour, C., and Sch olkopf, B. Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI: Proceedings of the Conference, volume 2017, pp. 1347. NIH Public Access, 2017. Causal Discovery from Conditionally Stationary Time Series A. Identifiability Proofs We have introduced conditionally stationary time series in Section 3. Here, we will revise the model assumptions introduced in the main text, and present of our identifiability theory; starting from general Markov Switching Models (m1-m3), following with conditional summary graph identification under (m4), and finishing with identifiability of the labels of conditional effects (m5) utilised in SDCI. A.1. Preliminaries We review the main model assumptions that characterise SDCI. We start with the assumptions that apply to general MSMs, where we have a single global state at each time step t. In this case we denote the global state as ut =: φ(st) {1, . . . , U} for some injective φ : {1, . . . , K}N {1, . . . , U} (we use st when referring to multiple states in SDCI). (m1) Conditional first-order Markov transitions controlled by ut 1 only: p(xt|x1:t 1, u1:t 1) = p(xt|xt 1, ut 1), t {2, ..., T}. (21) (m2) Conditional stationarity: the conditional transition distribution does not change during time: for any u {1, ..., U} we have for any t = t , any β, γ RNd p(xt = β|xt 1 = γ, ut 1 = u) = p(xt = β|xt 1 = γ, ut 1 = u). (22) (m3) Factorisation structure based on the ANM model with Gaussian noise: p(xt|xt 1, ut 1) = j=1 N x(j) t ; f (j) ut 1 x(i) t 1|x(i) t 1 PA(j) ut 1(t 1) , σ2I , t {2, ..., T}. (23) Again, we formalise assumptions specific to conditionally stationary data, where the total state configuration is denoted by grouping all the individual states st = {s(1) t , . . . , s(N) t } {1, . . . , K}N. Note that this defines a specific structure in a general MSM with U := KN global states. (m4) Multi-state dependence: For each variable i {1, . . . , N} at time t {1, . . . , T}: x(i) t , there exists a state variable s(i) t {1, . . . , K}, such that st 1 := (s(1) t 1, . . . , s(N) t 1) {1, . . . , K}N. For any variable j {1, . . . , N}, PA(j) st 1(t 1), is defined as follows: PA(j) st 1(t 1) := n x(i) t 1|j C(i) k , k = s(i) t 1, 1 i N o . (24) where C(i) k V denotes the outgoing edge structure (children) of variable i {1, . . . , K}, corresponding to a state k {1, . . . , K}. (m5) Global function-type dependence: We allow nϵ functional effects (including the no-edge effect). We define the labels of conditional effects as W := n wi,j,k {0, . . . , nϵ 1} : i, j {1 . . . , N}, k {1, . . . , K}, wi,j,k = 0 j / C(i) k o , (25) where each wi,j,k denotes an edge-type from variable i to j at state k, associated to variable i. Each edge-type corresponds to a different function over the total function interactions F = {f0, . . . , fnϵ 1}, where f0( ) = 0 from Eq. (25). Given a state configuration st 1, f (j) st 1 : Rd Rd in Eq. (23) is defined as follows for any xt 1 RNd: f (j) st 1(xt 1) := g x(j) t 1, n fei x(i) t 1, x(j) t 1 |i = j o , where ei = wi,j,s(i) t 1 {0, . . . , nϵ}, (26) where ei, i {1, . . . , N} denotes the interactions between variables i and j, which are aggregated by a permutation invariant function g. Causal Discovery from Conditionally Stationary Time Series A.2. Partial Identifiability We wish to analyse the identifiability of the labels W for SDCI (m1-m5) in the case of unobserved states. However, we first require developing identifiability for regime-dependent and state-dependent structures, i.e GRD 1:U and GCS 1:K respecitvely. We define identifiability of the regime-dependent structures for a general MSM (m1-m3). Definition A.1 (Partial identifiability of regime-dependent graph). Given observations x1:T and a family of models M satisfying (m1-m3), we say M is partially identifiable with respect to its regime-dependent graph if for any p, ˆp M, with corresponding PA(i) u (t 1) and c PA (i) ˆu (t 1) for any i {1, . . . , N}, u {1, . . . , U}, and ˆu {1, . . . , ˆU}), such that p(x1:T ) = ˆp(x1:T ); U = ˆU and there exists a permutation τ SU such that PA(i) u (t 1) = c PA (i) τ(u)(t 1) for i {1, . . . , N}, and u {1, . . . , U}. We also define the identifiability of the outgoing edges C(i) k in terms of GCS 1:K, given a model with N state variables and multi-state dependence (m1-m4). Definition A.2 (Partial identifiability of outgoing edge structure). Given observations x1:T and a family of models M satisfying (m1-m4), we say M is partially identifiable with respect to the outgoing edge structure if for any p, ˆp M, with corresponding C(i) k and b C(i) ˆk for any i {1, . . . , N}, k {1, . . . , K}, and ˆk {1, . . . , ˆK}), such that p(x1:T ) = ˆp(x1:T ); K = ˆK and there exist permutations σ(i) SK such that C(i) k = ˆC(i) σ(i)(k) for i {1, . . . , N}, and k {1, . . . , K}. Finally, we further define identifiability of the labels of conditional effects under global function-type dependence (m5). Definition A.3 (Partial identifiability of conditional effects). Given observations x1:T and a family of models M satisfying (m1-m5), we say M is partially identifiable with respect to the labels of conditional effects if for any p, ˆp M, with corresponding W and c W such that p(x1:T ) = ˆp(x1:T ); K = ˆK and there exist permutations π Snϵ and σ(i) SK such that for any i, j {1, . . . , N} and k {1, . . . , K}: wi,j,k = ˆwi,j,σ(i)(k) wi,j,k = 0, (27) wi,j,k = π ˆwi,j,σ(i)(k) wi,j,k = 0. (28) As mentioned in the main text, these definition consider identifiability is partial since it does not cover the structure capturing interaction between x1:T and s1:T as well as those within s1:T . A.3. Sufficiency Conditions for Identifiability The defined partial graph identifiability cannot be achieved without further assumptions for (m1-m5). Below we list the sufficient conditions for such identifiability, some of them which we already introduced in the main text. (a1) Unique indexing of outgoing structure. Here we assume that for each state s(i) t 1 {1, ..., K}N, i {1, . . . , N} the underlying graph representing the direct causes from variable i to the rest of the variables is unique. k, k {1, ..., K}, k = k C(i) k = C(i) k , i {1, ..., N}. (29) This assumption has equivalences to MSMs, i.e. when considering all state variables, st 1 {1, . . . , K}N as a global state, ut 1 {1, . . . , KN}. (b1) Unique indexing of regime-dependent graph structure. For each possible state ut 1 {1, . . . , U} the underlying graph representing the direct causes between variables is unique. In other words, the resulting U graphs are different. This implies that the following holds: u, u {1, ..., U}, u = u j {1, ..., N} s.t. PA(j) ut 1=u(t 1) = PA(j) ut 1=u (t 1). (30) Assume an invertible injective mapping φ : {1, . . . , K}N {1, . . . , KN} such that we can map a specific state configuration (k1, . . . , k N) to an assigned value k1:N {1, . . . , KN} that is, k1:N = φ(k1, . . . , k N). With this view, we can write the state variables in terms of one global state variable with KN regimes. We connect assumption (a1) to assumption (b1) defined for MSMs as follows. Causal Discovery from Conditionally Stationary Time Series Proposition A.4. Assumption (a1) implies assumption (b1). Proof. Recall the definition of PA(j) st 1: PA(j) st 1(t 1) := {x(i) t 1 | j C(i) k , k = s(i) t 1, 1 i N}. (31) Given two state configurations (k1, . . . , k N), (k 1, . . . , k N) {1, . . . , K}N that differ in at least one ki, i {1, . . . , N}, under (a1), we have C(i) ki = C(i) k i . Again from (a1), j {1, . . . , N} such that j C(i) ki and j / C(i) k i . From Eq. (31), and by writing PA(j) st 1(t 1), as a MSM with one global state (ut := st) we have j {1, . . . , N} s.t. x(i) t 1 PA(j) ut 1=φ(k1,...,k N)(t 1), and x(i) t 1 / PA(j) ut 1=φ(k 1,...,k N)(t 1) = j {1, . . . , N} s.t. PA(j) ut 1=φ(k1,...,k N)(t 1) = PA(j) ut 1=φ(k 1,...,k N)(t 1) (32) Therefore, assumption (b1) also holds. We now revisit the functional assumptions introduced in the main text, and establish connections to MSMs. (a2) Unique indexing of function types. Under the model assumptions (m1-m5), we assume the following: (a2.1): For any j {1, . . . , N}, and given g(x(j), {h(i,j)|i = j}), where h(i,j) = fe(x(i), x(j)) for some e {0, . . . , nϵ 1}, the partial derivative g h(i,j) is non-zero almost everywhere for any i {1, . . . , N}. (a2.2): Edge-types differ almost everywhere: f0 := 0, and for e = e , {0, nϵ 1}, the following set has zero measure: Xe,e := x1 Rd : x2 Rd, fe(x1, x2) x1 = f e(x1, x2) (b2) Functional faithfulness. This condition considers the functional properties of f (j) st 1 in terms of its faithfulness to the graph structure. We require the following sense of faithfulness in terms of the functional behaviour for all st 1 {1, ..., K}K: (b2.1) f (j) st 1 is differentiable w.r.t. PA(j) st 1(t 1) almost everywhere. Also all the entries of the Jacobian matrix df (j) st 1 d PA(j) st 1(t 1), when well defined, are non-zero almost everywhere w.r.t. PA(j) st 1(t 1). (b2.2) If x(i) t 1 / PA(j) st 1(t 1), then f (j) st 1 is a constant w.r.t. x(i) t 1. Intuitively, this faithfulness condition requires the output of the function f (j) st 1 to vary if and only if PA(j) st 1(t 1) varies. We also connect (a2) to (b2). Proposition A.5. (a2) implies (b2). To see this, note that (a2.1) and (a2.2) force the derivatives w.r.t PA(j) st 1(t 1) to be non-zero almost everywhere (which implies (b2.1). Furthermore, this is only violated in a zero measure set of points, or when variable i interacts with j via f0, but this only is possible when x(i) t 1 / PA(j) st 1(t 1), which implies (b2.2). A.4. Identifiability of Markov Switching Models With the above assumptions we can now state the following identifiability results: Theorem A.6. A Markov Switching Model (m1-m3) under assumptions (b1) and (b2) is partially identifiable with respect to its regime-dependent graph (Def. A.1). Causal Discovery from Conditionally Stationary Time Series Proof. Assume there exist two MSMs satisfying p(x1:T ) = ˆp(x1:T ). Since wlog. p(x1:t) = p(xt|x1:t 1)p(x1:t 1), the fact that p(x1:T ) = ˆp(x1:T ), and p(xt|x1:t 1) is a probability distribution imply that p(xt|x1:t 1) = ˆp(xt|x1:t 1), t = 2, ..., T. (34) Now since the model assumes (m1) a conditional first-order Markov structure controlled by the previous state st 1 (Eq. (21)), we can show that the conditional distribution is a finite mixture distribution: p(xt|x1:t 1) = u=1 p(ut 1 = u|x1:t 1)p(xt|xt 1, ut 1 = u). (35) We assume (b1) and functional faithfulness (b2). Since (m3) p(x1:t|xt 1, ut 1 = u) is Gaussian (Eq. (23)), using the identifiability result of Gaussian finite mixture (Yakowitz & Spragins, 1968) we have U = ˆU, and for almost any given xt 1 = α RNd (modulo some zero-measure sets) we have the following result: for any u {1, ..., U}, there exists ˆu(α, u) {1, ..., U} such that p(ut 1 = u|x1:t 2, xt 1 = α) = ˆp(ut 1 = ˆu(α, u)|x1:t 2, xt 1 = α), p(xt|xt 1 = α, ut 1 = u) = ˆp(xt|xt 1 = α, ut 1 = ˆu(α, u)). (36) To clarify, Proposition 2 in Yakowitz & Spragins (1968) establishes multivariate Gaussian distributions are identifiable if and only if the means or covariances are distinct. Our assumptions (b1) and (b2) ensure distinct means for almost any xt 1 = α RNd. The (m3) factorised Gaussian structure (Eq. (23)) and the above identification result further indicate that p(x(j) t |xt 1 = α, ut 1 = u) = ˆp(x(j) t |xt 1 = α, ut 1 = ˆu(α, u)), j = 1, ..., N. (37) Again wlog., let us denote the mean function of p(x(j) t |xt 1 = α, ut 1 = u) as f (j) u for j = 1, ..., N and u = 1, ..., K, and we also use a short notation PA(j) u (t 1) to denote the input variables for f (j) u . Under (b2.1) of the (b2) functional faithfulness assumption, and given that U, N < + , there exist an open set X RNd such that µ(X) = µ(RNd), where µ( ) denotes the Lebesgue measure of a Euclidean space, and both Eq. (37) and the following condition hold (note that PA(j) u (t 1) xt 1): all the entries of df (j) u d PA(j) u (t 1) |xt 1=α are non-zero, α X, j {1, ..., N}. (38) To see this: under (b2.1) we have for each u {1, ..., U} there exists an open set Xu PA(j) u (t 1) with µ(Xu) = µ(PA(j) u (t 1)) such that the partial derivatives computed within this set are non-zero. Then we can construct X as follows: X = X ˆ X, X = u=1 (Xu {x(i) t 1 Rd|x(i) t 1 / PA(j) u (t 1)}), u=1 ( ˆ Xu {x(i) t 1 Rd|x(i) t 1 / c PA (j) u (t 1)}). (39) We can show wlog. that µ(Xu {x(i) t 1 Rd|x(i) t 1 / PA(j) u (t 1)}) = µ(RNd) since the Lebesgue measure of RNd is a product measure. Therefore using the union bound we also have µ(X) = µ(RNd). Now we show that ˆu(α, u) is a constant for α X almost everywhere, and those α values satisfy ˆu(α, u) = ˆu(u) and PA(j) u (t 1) = c PA (j) ˆu(u)(t 1) for all j {1, ..., N}. (i) By model definition (m3) (Eq. (23)) and assumption (b2.1), within xt 1 = α X we have p(x(j) t |xt 1, ut 1 = u) = p(x(j) t |PA(j) u (t 1), ut 1 = u). Similarly for ˆp we have ˆp(x(j) t |xt 1, ut 1 = ˆu(α, u)) = ˆp(x(j) t |c PA (j) ˆu(α,u)(t 1), ut 1 = ˆu(α, u)). Causal Discovery from Conditionally Stationary Time Series Then from Eq. (37) and assumption (b2.2), there exist at most a zero-measure set X0 X of α values such that c PA (j) ˆu(α,u)(t 1)) = PA(j) u (t 1). Otherwise, we can show that there exist two α = α X0 with c PA (j) ˆu(α,u)(t 1)) = c PA (j) ˆu(α ,u)(t 1)) = PA(j) u (t 1), which contradicts with Eq. (37) and (b2.2). This means, by Eq. (23) again, and notice that µ(X\X0) = µ(X): c PA (j) ˆu(α,u)(t 1) = PA(j) u (t 1), α X\X0, j {1, ..., N}. (ii) Under (b1) unique indexing of causal graph structure, we have ˆu(α, u) as a constant w.r.t. α X\X0. To see this, if there exist α, α X\X0 such that ˆu(α, u) = ˆu(α , u), then from (b1), there exists j {1, ..., N} such that c PA (j) ˆu(α,u)(t 1)) = c PA (j) ˆu(α ,u)(t 1)), a contradiction to point (i) above. We now write ˆu(α, u) = ˆu(u) w.l.o.g., then we have PA(j) u (t 1) = c PA (j) ˆu(u)(t 1), α X\X0, j {1, ..., N}. In summary, we have shown that there exists a permutation τ SU such that ˆu = τ(u) and the following equivalence holds for all j {1, ..., N} and almost everywhere for xt 1 RNd: p(x(j) t |xt 1, ut 1 = u) = ˆp(x(j) t |xt 1, ut 1 = τ(u)), PA(j) u (t 1) = c PA (j) τ(u)(t 1), j {1, ..., N}, xt 1 X\X0, µ(X\X0) = µ(RNd). (40) A.5. Identifiability of Conditionally Stationary Time Series We start by introducing assumption (m4) and establish identifiability of the conditional summary graph under unique indexing of outgoing edges (a1) and functional faithfulness (b1). Theorem A.7. The multi-state dependent model (m1-m4) is partially identifiable w.r.t. the outgoing edge structure (Def. A.2) up to permutation equivalence of the states, if it satisfies the assumptions of (a1) unique indexing of causal graph structure and (b2) functional faithfulness. Proof. Assume there exist two models under (m1-m4), satisfying p(x1:T ) = ˆp(x1:T ). For simplicity, we first want to view the above as a mixture model of KN regimes with one global state. Assume again an invertible injective mapping φ : {1, . . . , K}N {1, . . . , KN} such that we can map a specific state configuration (k1, . . . , k N) to an assigned value k1:N {1, . . . , KN} that is, k1:N = φ(k1, . . . , k N). Therefore, for any t {1, . . . , T}, we can obtain ut = φ(st) to reduce the multi-state model into a general MSM. Given that assumption (a1) implies (b1), from Theorem A.6 there exists a permutation permutation τ SKN such that ˆk1:N = τ(k1:N) and the following equivalence holds for all j {1, ..., N} and almost everywhere for xt 1 RNd: p(x(j) t |xt 1, ut 1 = k1:N) = ˆp(x(j) t |xt 1, ut 1 = τ(k1:N)), PA(j) k1:N (t 1) = c PA (j) τ(k1:N)(t 1), j {1, ..., N}, xt 1 X\X0, µ(X\X0) = µ(RNd). (41) We can revert the above back to the multi-state model notation (m1-m4), obtaining the following equivalence for the causal effects: PA(j) (k1,...,k N)(t 1) = c PA (j) (φ τ φ 1)(k1,...,k N)(t 1), (42) where the permutation τ is general and can permute both state indices and values. From (m4), recall the definition of PA(j) (k1,...,k N)(t 1) for some (k1, . . . , k N) {1, . . . , K}N, and consider the equivalence on c PA (j) (ˆk1,...,ˆk N)(t 1) for some (ˆk1, . . . , ˆk N) {1, . . . , K}N: PA(j) st 1=(k1,...,k N)(t 1) = n x(i) t 1 | j C(i) ki n x(i) t 1 | j b C(i) ˆki o = c PA (j) (ˆk1,...,ˆk N)(t 1). (43) Causal Discovery from Conditionally Stationary Time Series Then for a given i {1, . . . , N} and by fixing k = ki {1, . . . , K}, we can recover the outgoing edge structure C(i) k from PA(j) (k1,...,k N)(t 1) for all j {1, . . . , N}. Equivalently, by fixing ˆk = ˆki {1, . . . , K}, we can recover b C(i) ˆk from c PA (j) (ˆk1,...,ˆk N)(t 1) for all j {1, . . . , N} n j | x(i) t 1 PA(j) (k1,...,ki 1,k,ki+1,...,k N)(t 1) o = j | x(i) t 1 c PA (j) (ˆk1,...,ˆki 1,ˆk,ˆki+1,...,ˆk N)(t 1) = b C(i) ˆk . Given the unique indexing assumption (a1), every k necessarily needs to map to a different ˆk. Therefore, ˆk is obtained from a permutation of k, which might differ among variables. Therefore, there exists σ(i) SK such that C(i) k = b C(i) σ(i)(k) for all i {1, . . . , N} and k {1, . . . , K}. This implies that the multi-state dependent model (m1-m4) is partially identifiable with respect to its outgoing edge structure. Corollary A.8. Conditionally stationary time series (m1-m5) are partially identifiable w.r.t. the labels of conditional effects (Def. A.3) up to permutations, if they satisfy: (a1) unique indexing of outgoing structure, and (a2) unique indexing of function types. Proof. Under (m1-m4) and assuming (a1-a2), from Theorem A.6 the following equivalence holds for all j {1, ..., N} and almost everywhere for xt 1 RNd: p x(j) t |xt 1, st 1 = (k1, . . . , k N) = ˆp x(j) t |xt 1, st 1 = σ(1)(k1), . . . , σ(N)(k N) , j {1, ..., N}, xt 1 X\X0, µ(X\X0) = µ(RNd). (44) From Gaussian identifiability (Yakowitz & Spragins, 1968), we have f (j) st 1(xt 1) = ˆf (j) ˆst 1(xt 1), for each j {1, . . . , N} and any xt 1 X\X0. From (m5), we have f (j) st 1(xt 1) = g fe(1,j) t 1 x(1) t 1, x(j) t 1 , . . . , fe(N,j) t 1 x(N) t 1, x(j) t 1 = ˆg ˆfˆe(1,j) t 1 x(1) t 1, x(j) t 1 , . . . , ˆfˆe(N,j) t 1 x(N) t 1, x(j) t 1 = ˆf (j) ˆst 1(xt 1), (45) where e(i,j) t 1 := wi,j,s(i) t 1. We can directly establish a relation from e(i,j) t 1 to ˆe(i,j) t 1 using the partial derivative on the mean of x(j) t with respect to x(i) t 1 f (j) st 1(xt 1) x(i) t 1 = g h(i,j) fe(i,j) t 1 x(i) t 1 = ˆg ˆfˆe(i,j) t 1 x(i) t 1 = ˆf (j) ˆst 1(xt 1) x(i) t 1 , (46) where the derivative of g is independent of st 1, and also notably j (which will be of great importance later). Under assumption (a2.2), the set Xfunc = [ Xe,e ˆ Xe,e is zero measured. ˆ Xe,e denote the sets in (a2.2) for another model ˆp under (m1-m5) that satisfies (a1-a2). Assume X0 contains the points where the partial derivative of g with respect to h(i,j) is zero, where we know it has measure zero from (a2.1). Therefore, for any x0 X\(X0 ( NXfunc)), fe(i,j) t 1 x(i) t 1 x(i) t 1=x(i) 0 = ˆg ˆfˆe(i,j) t 1 x(i) t 1 x(i) t 1=x(i) 0 . (47) Note X\(X0 ( NXfunc)) is full measured. Given that e(i,j) t 1 {0, . . . , nϵ 1}, from (a2.1-a2.2) we know that if e(i,j) t 1 = 0 we have ˆe(i,j) t 1 = 0 due to having non-zero derivatives on ˆfe for any e = 0. Furthermore from (a2.2), for any e = e the derivatives of fe and fe are not equal. Then, any e(i,j) t 1 = 0 must map uniquely to another ˆe(i,j) t 1 = 0, otherwise under Causal Discovery from Conditionally Stationary Time Series assumption (a2.1-a2.2) we have a contradiction with Eq. (47), as it will imply a repetition of edge-types. Therefore for each i and each j, there exists a permutation π(i,j) Snϵ 1 such that for any ˆe(i,j) t 1 = π(i,j)(e(i,j) t 1 ) for e(i,j) t 1 {1, . . . , nϵ 1}. This does not imply any direct identifiability on the set of functions f1, . . . , fnϵ 1, but the labels of conditional effects W. Due to permutation invariance of the aggregation g, the partial derivative terms g h(i,j) are equal across i and j for fixed x0 X\(X0 ( NXfunc)). This forces the permutations π(i,j) to be equal, as otherwise, edge-types where e(i,j) t 1 = 0 mapping to different ˆe(i,j) t 1 will violate Eq. (47) under assumptions (a2.1-a2.2). Therefore, there exists π Snϵ 1 such that for any ˆe(i,j) t 1 = π(e(i,j) t 1 ) for e(i,j) t 1 {1, . . . , nϵ 1}. Now, recall the definition of W: W := n wi,j,k {0, . . . , nϵ 1} : i, j {1, . . . , N}, k {1, . . . , K}, wi,j,k = 0 j / C(i) k o . (48) From Theorem A.7, indexing w.r.t k in the above equation is equivalent up to a permutation σ(i) on c W. Considering e(i,j) t 1 := wi,j,s(i) t 1, we have the following equivalence for c W when wi,j,k = 0, for any i, j {1, . . . , N}, and k {1, . . . , K}. wi,j,k = π ˆwi,j,σ(i)(k) , wi,j,k {1, . . . , nϵ 1}. (49) This implies partial identifiability with respect to the labels of conditional effects W (Def. A.3). B. Implementation Details B.1. ELBO Derivation Below we derive the ELBO objective for SDCI, which is expressed in Eq. (15). We start with the likelihood term pψ(x1:T ) and write it in terms of the joint distribution pψ(x1:T , s1:T , W). log pψ(x1:T ) = log X s1:T pψ(x1:T , s1:T , W) (50) Eqϕ(W,s1:T |x1:T ) log pψ(x1:T , s1:T , W) qϕ(W, s1:T |x1:T ) Eqϕ(W,s1:T |x1:T ) log pψ(x1:T , s1:T |W) qϕ(s1:T |x1:T ) + Eqϕw (W|x1:T ) log pψw(W) qϕw(W|x1:T ) Eqϕ(W,s1:T |x1:T ) [log pψ(x1:T |s1:T , W)] KL qϕw(W|x1:T ) pψw(W) (53) + Eqϕs(s1:T |x1:T ) log QT t=1 pψs(st|xt:t Lx, st 1:t Ls) qϕs(s1:T |x1:T ) Eqϕ(W,s1:T |x1:T ) h log pψx(x1:T |s1:T , W) i KL qϕw(W|x1:T ) pψw(W) (55) t=1 KL qϕs(st|x1:T ) pψs(st|xt:t Lx, st 1:t Ls) (56) B.2. Exact Inference on State Variables Exact inference on state variables based on sum-product implementations require computing the posterior for all combinations of states at each time step, thus introducing O(KN) complexity. However, in the determined states case, the state variable is directly determined by observations, which implies p(s(i) t 1|x1:t 1) = p(s(i) t 1|x(i) t 1). To compute the reconstruction term in Eq. (15), consider the likelihood distribution p(x1:T |W) = QT t=1 p(xt|x1:t 1, W), where we assume the states have been marginalised. Following our implementation, consider the likelihood term p(xt|x1:t 1, W) and assume the message h(ij) t follows delta distributed variable. Expanding the marginalisation equation of the state variables results in the following: pψ(xt|x1:t 1, W) = Y j pψx(x(j) t |h(1j) t , . . . , h(Nj) t , x(j) t 1) Y pψx(hij|x(i) t 1, x(j) t 1, W, s(i) t 1)pψs(s(i) t 1|x(i) t 1) Causal Discovery from Conditionally Stationary Time Series Embedding Pairwise concatenation Edge type distribution Pairwise Embedding Figure 13: Illustration of the implementation of the SDCI encoder which is adapted from ACD (L owe et al., 2022) and allow for conditioning on states. In the example, we consider 2 states. Note that we can marginalise the states element-wise, thus reducing the exponential cost to O(NK). Given the above equation is equivalent to Eq. (16), we can set qϕs = pψs for exact inference on state variables. For the recurrent state case, we cannot simplify the forward state posterior pψ(s(i) t 1|x1:t 1). To see why exact inference here results in exponential cost, compute pψ(xt|x1:t 1), where we can also try marginalising the states. pψ(xt|x1:t 1, W) = Y j pψx(x(j) t |h(1j) t , . . . , h(Nj) t , x(j) t 1) Y pψx(hij|x(i) t 1, x(j) t 1, W, s(i) t 1)pψ(s(i) t 1|x(i) 1:t 1) (58) Where the state posterior is obtained using the forward algorithm: pψ(s(i) t 1|x1:t 1) = X pψ(s(i) t 1, s(i) t 2|x1:t 1) = X pψs(s(i) t 1|s(i) t 2, xt 1:t M)pψ(s(i) t 2|x1:t 1) (59) where pψs(s(i) t 1|s(i) t 2, xt 1:t M) is given from the state decoder, and pψ(s(i) t 2|x1:t 1) is computed as follows: pψ(s(i) t 2|x1:t 1) = pψ(s(i) t 2, x1:t 2)pψ(xt 1|s(i) t 2, xt 2) pψ(x1:t 2)pψ(xt 1|x1:t 2) = pψ(s(i) t 2|x1:t 2)pψ(xt 1|s(i) t 2, xt 2) pψ(xt 1|x1:t 2) , (60) and we find recursive terms pψ(xt 1|x1:t 2) and pψ(s(i) t 2|x1:t 2). Unfortunately, computing pψ(xt 1|s(i) t 2, xt 2) requires marginalising (s(1) t 2, . . . , s(N) t 2) and involves all possible tuple combinations, thus introducing computational cost of O(KN). pψ(x1:t, st 1|W) = p(xt|xt 1, st 1, W) X st 2=k p(st 1|st 2 = k, xt 1)p(st 2 =, x1:t 1, W) (61) B.3. Encoder Architecture Below we provide details our encoder architecture qϕ(W, s1:T |x1:T ) = qϕw(W|x1:T )qϕs(s1:T |x1:T ). Fixed graph encoder. As mentioned, we consider an amortised encoder on W for fixed state-dependent structures across samples: qϕw(W|x1:T ) = qϕw(W). We modify Eq. (13) and directly parametrise the logits ϕij RK nϵ independently of x1:T . Variable graph encoder Our architecture extends directly from ACD (L owe et al., 2022), and we incorporate additional edge-edge layers proposed in Chen et al. (2021) only for SDCI under the recurrent state case. The logits ϕij for the Causal Discovery from Conditionally Stationary Time Series distribution qϕw(W|x1:T ) are obtained as follows. First, the model computes a latent embedding z(i) 1 for each node i using the whole sequence: z(i) 1 = fϕ1(x(i) 1:T ). (62) Then each embedding is updated using a graph neural network (GNN) that captures the correlations between nodes. Specifically the message passing procedure follows the two equations below: z(ij) 1 = fϕ2(z(i) 1 , z(j) 1 ), (63) z(i) 2 = fϕ3 X i =j z(ij) 1 . (64) For the recurrent state case, we incorporate edge-to-edge message passing operations proposed by Chen et al. (2021) after computing the edge embeddings in Eq. (63). The edge-to-edge message passing treats the set of edges as a sequence, and we implement the mapping using self-attention. See Chen et al. (2021) for additional details. Finally, we obtain the softmax logits ϕij RK nϵ for every pair i j and every state value 1 k K: ϕij = fϕ4(z(i) 2 , z(j) 2 ). (65) The above network architecture design is visualised in Figure 13. according to equation 13. The details of the architecture settings follows the designs in L owe et al. (2022) and Chen et al. (2021). Each embedding step fϕi, including the selfattention network in the edge-to-edge message passing, uses two-layers of 256 dimensions and ELU (Clevert et al., 2016) activations followed by a batch normalization. fϕ4 uses skip connections and we modify its output size to generate a pairwise embedding for each of the K states. For fully-observed state case, the architecture for qϕw(W|x1:T , s1:T ) follows a similar structure, except that for the first layer we use z(i) 2 = fϕ1(concat(x(i) 1:T , ˆs(i) 1:T )), where ˆs(i) 1:T is a one-hot encoded version of the states {s(i) t }T t=1. State encoder. For determined states, we use the state decoder explained in the next paragraph. For recurrent states, we combine edge embeddings and the state posterior approximation from Ansari et al. (2021) to implement a GNN-RNN network. First, we generate edge embeddings with temporal components. This is equivalent to Eq. (62) where we use x(i) t to obtain z(i) 1,t. Then, we obtain edge embeddings z(ij)1,t similarly as in Eq. (63) using the same implemntation design. The temporal edge are forwarded to a 3-layer bi-directional RNN with GRU cells and 256 dimensions, followed by a 1-layer forward RNN with GRU cells and 256 dimensions. Finally, the resulting embeddings are aggregated following Eq. (65) with similar implementation designs. State Decoder. The state decoder is implemented as a two-layer MLP with 256 dimensions, where the input is dependent on the requirements for each state dependency case: x(i) t in the determined case, and (x(i) t , x(i) t 1, s(i) t 1) in the recurrent case. Observation Decoder. We implement the set of functions F = {f1, . . . , fnϵ 1} using two-layer MLPs of 256 dimensions and leaky-Re LU activations with slope 0.1. The mapping g uses skip-connections and the same design and the edge-type functions above. For gene regulatory networks, we use incorporate node embeddings into the aggregator g following Gong et al. (2023), resulting in increased network expressivity. The node embeddings have 16 dimensions. For NBA data, we instead incorporate the ball s features at each time step bt R6, denoting 3D position and velocity. B.4. Training Specifications Our method has been implemented with Pytorch (Paszke et al., 2019), and the experiments have been performed on NVIDIA RTX 2080 Ti GPUs. Code for our experiments is available at https://github.com/charlio23/SDCI. SDCI and ACD. All SDCI and ACD (L owe et al., 2022) models have been trained using the following training scheme. Following Kipf et al. (2018), the models are trained using ADAM optimizer (Kingma & Ba, 2015). In all datasets, the learning rate of the edge labels encoder is 5 10 4 for variable graphs, and 5 10 3 for fixed graphs. The learning rate of the decoder is 5 10 4. Learning rate decay is in use with factor of 0.5 every 200 epochs. The duration of the training phase differs between datasets and state dependence considerations. We monitor the reconstruction error on validation data and use early-stopping. We list additional specifications which depend on each dataset. Causal Discovery from Conditionally Stationary Time Series In springs and magnets data with determined states we train for 1000 epochs using a batch sizes of 100, 50, and 20 for N being 5, 10, and 20 respectively. For recurrent states, the GNN-RNN network restricts GPU capacity, and we adapt the number of epochs and batch size to perform 200k iterations, and decaying the learning rate every 60k iterations. In gene regulatory network data, in the determined state setting we reduce the batch size to 10 to meet GPU capacity and train for 1000 epochs. For recurrent states, we lower the batch size to 1 due to the increased number of variables, and train for 200 epochs (200k iterations), similarly to the springs and magnet case. In all settings, we use 2 edge-types and a sparsity regularisation of 0.9 using pψw. In NBA data, we train for a total of 450k iterations due to the large number of samples, with a batch size of 100 and decaying the learning rate every 200k iterations. We find both SDC and ACD perform best using 2 edge-types and a sparsity regularisation of 0.8 using pψw. The decoder is trained with teacher forcing every 10 time-steps, i.e., it receives the ground-truth as input every 10 time-steps. The variance of the decoder Gaussian distribution is σ2 = 5 10 5. The temperature term of the edge label encoder τ is set to 0.5. The state encoder temperature γ follows a schedule similar to Ansari et al. (2021) which prevents state collapse, i.e. the model ignoring states. We first set γ = 5, and decrease temperature every epoch by a factor of 0.8 until we have γ = 0.5. VRNN. The experiments with NBA player trajectories consider VRNN as a non-causal baseline to compare forecasting performance. Below we specify the network architecture and training scheme. To allow a fair comparison between SDCI, ACD, NRI, and VRNN, we modify the decoder defined in Chung et al. (2015) to condition the player positions on the ball features, similarly as we did for the previous models: pθ(xt|x