# neural_structure_learning_with_stochastic_differential_equations__ba1d55c0.pdf Published as a conference paper at ICLR 2024 NEURAL STRUCTURE LEARNING WITH STOCHASTIC DIFFERENTIAL EQUATIONS Benjie Wang University of California, Los Angeles Joel Jennings Deep Mind Wenbo Gong Microsoft Research Cambridge Discovering the underlying relationships among variables from temporal observations has been a longstanding challenge in numerous scientific disciplines, including biology, finance, and climate science. The dynamics of such systems are often best described using continuous-time stochastic processes. Unfortunately, most existing structure learning approaches assume that the underlying process evolves in discrete-time and/or observations occur at regular time intervals. These mismatched assumptions can often lead to incorrect learned structures and models. In this work, we introduce a novel structure learning method, SCOTCH, which combines neural stochastic differential equations (SDE) with variational inference to infer a posterior distribution over possible structures. This continuous-time approach can naturally handle both learning from and predicting observations at arbitrary time points. Theoretically, we establish sufficient conditions for an SDE and SCOTCH to be structurally identifiable, and prove its consistency under infinite data limits. Empirically, we demonstrate that our approach leads to improved structure learning performance on both synthetic and real-world datasets compared to relevant baselines under regular and irregular sampling intervals. 1 INTRODUCTION Time-series data is ubiquitous in the real world, often comprising a series of data points recorded at varying time intervals. Understanding the underlying structures between variables associated with temporal processes is of paramount importance for numerous real-world applications (Spirtes et al., 2000; Berzuini et al., 2012; Peters et al., 2017). Although randomised experiments are considered the gold standard for unveiling such relationships, they are frequently hindered by factors such as cost and ethical concerns. Structure learning seeks to infer hidden structures from purely observational data, offering a powerful approach for a wide array of applications (Bellot et al., 2022; L owe et al., 2022; Runge, 2018; Tank et al., 2021; Pamfil et al., 2020; Gong et al., 2022). However, many existing structure learning methods for time series are discrete, assuming that the underlying temporal processes are discretized in time and requiring uniform sampling intervals. Consequently, these models face two limitations: (i) they may misrepresent the true underlying process when it is continuous, potentially leading to incorrect inferred relationships; and (ii) they struggle with handling irregular sampling intervals, which frequently arise in many fields (Trapnell et al., 2014; Qiu et al., 2017; Qian et al., 2020) and climate science (Bracco et al., 2018; Raia, 2008). To address these challenges, we introduce a novel framework, Structure learning with COntinuous Time sto CHastic models (SCOTCH1), which employs stochastic differential equations (SDEs) for structure learning in temporal processes. SCOTCH can naturally handle irregularly sampled time series and accurately represent and learn continuous-time processes. We make the key contributions: 1. We introduce a novel latent Stochastic Differential Equation (SDE) formulation for modelling structure in continuous-time observational time-series data. To effectively train our Equal contribution. Correspondence to benjiewang@ucla.edu, wenbogong@microsoft.com Work done during an internship at Microsoft Research Cambridge. 1https://github.com/microsoft/causica/tree/main/research_experiments/ scotch Published as a conference paper at ICLR 2024 proposed model, which we denote as SCOTCH, we adapt the variational inference framework proposed in (Li et al., 2020; Tzen & Raginsky, 2019a) to approximate the posterior for both the underlying graph structure and the latent variables. We show that, in contrast to a prior approach using ordinary differential equations (ODEs) (Bellot et al., 2022), our model is capable of accurately learning the underlying dynamics from trajectories exhibiting multimodality and with non-Gaussian distribution. 2. We provide a rigorous theoretical analysis to support our proposed methodology. Specifically, we prove that when SDEs are directly employed for modelling the observational process, the resulting SDEs are structurally identifiable under global Lipschitz and diagonal noise assumptions. We also prove our model maintains structural identifiability under certain conditions, even when adopting the latent formulation; and that variational inference, when integrated with the latent formulation, in the infinite data limit, can successfully recover the ground truth graph structure and mechanisms under specific assumptions. 3. Empirically, we conduct extensive experiments on both synthetic and real-world datasets showing that SCOTCH can improve upon existing methods on structure learning, including when the data is irregularly sampled. 2 PRELIMINARIES In the rest of this paper, we use Xt RD to denote the D-dimensional observation vector at time t, with Xt,d representing the dth variable of the observation. A time series is a set of I observations X = {Xti}I t=1, where {ti}I i=1 are the observation times. In the case where we have multiple (N) i.i.d. time series, we use X(n) to indicate the nth time series. Bayesian structure learning In structure learning, the aim is to infer the underlying graph representing the relationships between variables from data. In the Bayesian approach, given time series data {X(n)}N n=1, we define a joint distribution over graphs and data given by: p(G, X(1), . . . , X(N)) = p(G) n=1 p(X(n)|G) (1) where p(G) is the graph prior and p(X(n)|G) is the likelihood term. The goal is then to compute the graph posterior p(G|X(1), . . . X(N)). However, analytic computation is intractable in high dimensional settings. Therefore, variational inference (Zhang et al., 2018) and sampling methods (Welling & Teh, 2011; Gong et al., 2018; Annadani et al., 2023) are commonly used for inference. Structural equation models (SEMs) Given a time series X and graph G {0, 1}D D, we can use SEMs to describe the structural relationships between variables: Xt,d = ft,d(P a G d(< t), ϵt,d) (2) where P a G d(< t) specifies the lagged parents of Xt,d at previous time and ϵt,d is the mutually independent noise. Such a model requires discrete time steps that are usually assumed to follow a regular sampling interval, i.e. ti+1 ti is a constant for all i = 1, . . . , I 1. Most existing models can be regarded as a special case of this framework. Itˆo diffusion A time-homogenous Itˆo diffusion is a stochastic process Xt and has the form: d Xt = f(Xt)dt + g(Xt)d Wt (3) where f : RD RD, g : RD RD D are time-homogeneous drift and diffusion functions, respectively, and Wt is a Brownian motion under the measure P. It is known that under global Lipschitz guarantees (Assumption 1) this has a unique strong solution (Øksendal & Øksendal, 2003). Euler discretization and Euler SEM For most Itˆo diffusions, the analytic solution Xt is intractable, especially with non-linear drift and diffusion functions. Thus, we often seek to simulate Published as a conference paper at ICLR 2024 the trajectory by discretization. One common discretization method is the Euler-Maruyama (EM) scheme. With a fixed step size , EM simulates the trajectory as X t+1 = X t + f(X t ) + g(X t )ηt (4) where X t is the random variable induced by discretization and ηt N(0, ). Notice that eq. (4) is a special case of eq. (2). If we define the graph G as the following: X t,i X t+1,j in G iff fj(X t ) X t,i = 0 or k, gj,k(X t ) X t,i = 0; and assume g only outputs a diagonal matrix, then the above EM induces a temporal SEM, called the Euler SEM (Hansen & Sokol, 2014), which provides a useful analysis tool for continuous time processes. 3 SCOTCH: BAYESIAN STRUCTURE LEARNING FOR CONTINUOUS TIME SERIES We consider a dynamical system in which there is both intrinsic stochasticity in the evolution of the state, as well as independent measurement noise that is present in the observed data. For example, in healthcare, the condition of a patient will progress with randomness rather than deterministically. Further, the measurement of patient condition will also be affected by the accuracy of the equipment, where the noise is independent to the intrinsic stochasticity. To account for this behaviour, we propose to use the latent SDE formulation (Li et al., 2020; Tzen & Raginsky, 2019a): d Zt = fθ(Zt)dt + gθ(Zt)d Wt (latent process) Xt = Zt + ϵt (noisy observations) (5) where Zt RD is the latent variable representing the internal state of the dynamic system, Xt RD describes the observational data with the same dimension, ϵt is additive dimensionwise independent noise, fθ : RD RD is the drift function, gθ : RD RD D is the diffusion function and Wt is the Wiener process. For SCOTCH, we require the following two assumptions: Assumption 1 (Global Lipschitz). We assume that the drift and diffusion functions in eq. (5) satisfy the global Lipschitz constraints. Namely, we have |fθ(x) fθ(y)| + |gθ(x) gθ(y)| C|x y| (6) for some constant C, x, y RD and | | is the corresponding L2 norm for vector-valued functions and matrix norm for matrix-valued functions. Assumption 2 (Diagonal diffusion). We assume that the diffusion function gθ outputs a non-zero diagonal matrix. That is, it can be simplified to a vector-valued function gθ(Xt) : RD RD. The former is a standard assumption required by most SDE literature to ensure the existence of a strong solution. The key distinction is the latter assumption of a nonzero diagonal diffusion function, gθ, rather than a full diffusion matrix, enabling structural identifiability as we show in the next section. Please refer to appendix A.1 for more detailed explanations. Now, in accordance with the graph defined in Euler SEMs (section 2), we define the signature graph G as follows: edge i j is present in G iff t s.t. either fj(Zt) Zt,i = 0 or gj(Zt) Zt,i = 0. Note that there is no requirement for the graph to be acyclic. Intuitively, the graph G describes the structural dependence between variables. We now instantiate our Bayesian structure learning scheme with prior and likelihood components: Prior over Graphs Leveraging Geffner et al. (2022); Annadani et al. (2023), our graph prior is designed as: p(G) exp( λs G 2 F ) (7) where λs is the graph sparsity coefficient, and F is the Frobenius norm. Prior process Since the latent process induces a distribution over latent trajectories pθ(Z) before seeing any observations, we also call it the prior process. We propose to use neural networks for drift and diffusion functions fθ : RD {0, 1}D D RD, gθ : RD {0, 1}D D RD, which explicitly take the latent state and the graph as inputs. Though the signature graph is defined through Published as a conference paper at ICLR 2024 the function derivatives, we explicitly use the graph G as input to enforce the constraint. We will interchangeably use the notation f G and g G to denote fθ( , G) and gθ( , G). For the graph-dependent drift and diffusion, we leverage the design of Geffner et al. (2022) and propose: f G,d(Zt) = ζ i=1 Gi,dl(Zt,i, ei), ed for both f G and g G, where ζ, l are neural networks, and ei is a trainable node embedding for the ith node. The corresponding prior process is: d Zt = fθ(Zt, G)dt + gθ(Zt, G)d Wt (prior process) (9) Likelihood of time series Given a time series X = {Xti}I i=1, the likelihood is defined as p({Xti}I i=1|{Zti}I i=1, G) = d=1 pϵd(Xti,d Zti,d) (10) where pϵd is the observational noise distribution for the dth dimension. 3.1 VARIATIONAL INFERENCE Suppose that we are given multiple time series {X(n)}N n=1 as observed data from the system. The goal is then to compute the posterior over graph structures p(G|{X(n)}N n=1), which is intractable. Thus, we leverage variational inference to simultaneously approximate both the graph posterior, and a latent posterior process over Z(n) for every observed time series X(n). Given N i.i.d time series {X(n)}N n=1, we propose to use a variational approximation qϕ(G) p(G|X(1), . . . , X(N)). With the standard trick from variational inference, we have the following evidence lower bound (ELBO): log p(X(1), . . . , X(N)) Eqϕ(G) n=1 log pθ(X(n)|G) DKL(qϕ(G) p(G)) (11) Unfortunately, the distribution pθ(X(n)|G) remains intractable due to the marginalization of the latent Itˆo diffusion Z(n). Therefore, we leverage the variational framework proposed in Tzen & Raginsky (2019a); Li et al. (2020) to approximate the true posterior p(Z(n)|X(n), G). For each n = 1, . . . , N, the variational posterior qψ( Z(n)|X(n), G) is defined as the solution to the following: Z(n) t,0 N(µψ(G, X(n)), Σψ(G, X(n))) (posterior initial state) d Z(n) t = hψ( Z(n) t , t; G, X(n))dt + g G( Z(n) t )d Wt (posterior process) (12) For the initial latent state, µψ, Σψ are posterior mean and covariance functions implemented as neural networks. For the SDE, we use the same diffusion function g G for both the prior and posterior processes, but train a separate neural drift function hψ for the posterior, which takes a time series X(n) as input. The posterior drift function differs from the prior in two key ways. Firstly, the posterior drift function depends on time; this is necessary as conditioning on the observed data creates this dependence even when the prior process is time-homogenous. Secondly, while hψ takes the graph G as an input, the function design is not constrained to have a matching signature graph like f G. More details on the implementation of hψ, µψ, Σψ can be found in Appendix B. Assume for each time series X(n), we have observation times ti for i = 1, . . . , I within the time range [0, T], then, we have the following evidence lower bound for log p(X(n)|G) (Li et al., 2020): log p(X(n)|G) Eqψ i=1 log p(X(n) ti | Z(n) ti , G) Z T 0 u(n)( Z(n) t ) 2dt where Z(n) t is the posterior process modelled by eq. (12) and u(n)( Z(n) t ) is given by: u(n)( Z(n) t ) = g G( Z(n) t ) 1(hψ( Z(n) t , t; G, X(n)) f G( Z(n) t )) (14) Published as a conference paper at ICLR 2024 Algorithm 1 SCOTCH training Input: i.i.d time series {X(n)}N n=1; drift functions f G, hψ, diffusion function g G, SDE solver Solver, initial condition Z(n) 0 , training iterations L for l = 1, . . . , L do Sample time series mini-batch {X(n)}S n=1 with batch size S. for n = 1, . . . , S do Draw graph G qϕ(G) Draw initial latent state Z(n) 0 N(µψ(G, X(n)), Σψ(G, X(n))) Solve (sample from) the posterior process ( Z(n), L) = Solver(( Z(n) 0 , 0), f G, hψ, g G) end for Maximize ELBO eq. (15) w.r.t. ϕ, ψ, θ end for 0 1 2 3 4 5 Dimension 0 0 1 2 3 4 5 Dimension 0 0 1 2 3 4 5 2.0 Dimension 0 Figure 1: Comparison between NGM and SCOTCH for simple SDE (note vertical axis scale) By combining eq. (11) and eq. (13), we derive an overall ELBO: log pθ(X(1), . . . , X(N)) Eqϕ i=1 log p(X(n) ti | Z(n) ti , G) Z T 0 u(n)( Z(n) t ) 2dt DKL(qϕ(G) p(G)) (15) In practice, we approximate the ELBO (and its gradients) using a Monte-Carlo approximation. The inner expectation can be approximated by simulating from an augmented version of eq. (12) where an extra variable L is added with drift 1 2|u(n)( Z(n) t )|2 and diffusion zero (Li et al., 2020). Algorithm 1 summarizes the training algorithm of SCOTCH. 3.2 STOCHASTICITY AND CONTINUOUS-TIME MODELING Stochasticity Bellot et al. (2022) proposed a structure learning method, called NGM, to learn from single time series generated by SDEs. NGM uses a neural ODE to model the mean process fθ, and extracts graphical structure from the first layer of fθ. However, NGM assumes that the observed single series X follows a multivariate Gaussian distribution, which only holds for linear SDEs. If this assumption is violated, optimizing their proposed squared loss cannot recover the underlying system. SCOTCH does not have this limitation and can handle more flexible state-dependent drifts and diffusions. Another drawback of NGM is its inability to handle multiple time series (N > 1). Learning from multiple series is important when dealing with SDEs with multimodal behaviour. We propose a simple bimodal 1-D failure case: d X = Xdt + 0.01d Wt, X0 = 0, with the signature graph containing a self-loop. Figure 1 shows the bimodal trajectories (upwards and downwards) sampled from the SDE. The optimal ODE mean process in this case is the constant fθ = 0 with an empty graph, as confirmed by the learned mean process of NGM (black line in fig. 1b). In contrast, SCOTCH can learn the underlying SDE and simulate the correct trajectories (fig. 1c). Discrete vs Continuous-Time Gong et al. (2022) proposed a flexible discretised temporal SEM, called Rhino, that is capable of modelling (1) lagged parents; (2) instantaneous effect; and (3) history Published as a conference paper at ICLR 2024 dependent noise. Rhino s SEM is given by Xt,d = fd(P a G d(< t), P a G d(t)) + gd(P a G d(< t))ϵt,d. We can clearly see its similarity to SCOTCH. If fd has a residual structure as fd( ) = Xt,d+rd( ) and we assume no instantaneous effect (P a G d(t) is empty), Rhino SEM is equivalent to the Euler SEM of the latent process (eq. (9)) with drift r, step size and diagonal diffusion g. Thus, similar to the relation of Res Net (He et al., 2016) to Neural ODE (Chen et al., 2018), SCOTCH can be interpreted as the continuous-time analog of Rhino. 4 THEORETICAL CONSIDERATIONS OF SCOTCH In this section, we aim to answer three important theoretical questions regarding the Itˆo diffusion proposed in section 3. For notational simplicity, we consider the single time series setting. First, we examine when a general Itˆo diffusion is structurally identifiable. Secondly, we consider structural identifiability in the latent formulation of eq. (5). Finally, we consider whether optimising ELBO (eq. (15)) can recover the true graph and mechanism if we have infinite observations of a single time series within a fixed time range [0, T]. All detailed proofs, definitions, and assumptions can be found in appendix A. 4.1 STRUCTURE IDENTIFIABILITY Suppose that the observational process is given as an Itˆo diffusion: d Xt = f G(Xt)dt + g G(Xt)d Wt (16) Then, we might ask what are sufficient conditions for the model to be structurally identifiable? That is, there does not exist G = G that can induce the same observational distribution. Theorem 4.1 (Structure identifiability of the observational process). Given eq. (16), let us define another process with Xt, G = G, f G, g G and Wt. Then, under Assumptions 1-2, and with the same initial condition X(0) = X(0) = x0, the solutions Xt and Xt will have different distributions. Next, we show that structural identifiability is preserved, under certain conditions, even in the latent formulation where the SDE solution is not directly observed. Theorem 4.2 (Structural identifiability with latent formulation). Consider the distributions p, p defined by the latent model in eq. (5) with (G, Z, X, f G, g G), ( G, Z, X, f G, g G) respectively, where G = G. Further, let t1, . . . , t I be the observation times. Then, under Assumptions 1 and 2: 1. if ti+1 ti = for all i 1, ..., I 1, then p (Xt1, . . . , Xt I) = p ( Xt1, . . . , Xt I), where p is the density generated by the Euler discretized eq. (9) for Zt; 2. if we have a fixed time range [0, T], then the path probability p(Xt1, . . . , Xt I) = p( Xt1, . . . , Xt I) under the limit of infinite data (I ). 4.2 CONSISTENCY Building upon the structural identifiability, we can prove the consistency of the variational formulation. Namely, in the infinite data limit, one can recover the ground truth graph and mechanism by maximizing ELBO with a sufficiently expressive posterior process and a correctly specified model. Theorem 4.3 (Consistency of variational formulation). Suppose Assumptions 1-4 are satisfied for the latent formulation (eq. (5)). Then, for a fixed observation time range [0, T], as the number of observations I , when ELBO (eq. (15)) is maximised, qϕ(G) = δ(G ), where G is the ground truth graph, and the latent formulation recovers the underlying ground truth mechanism. 5 RELATED WORK Discrete time causal models The majority of the existing approaches are inherently discrete in time. Assaad et al. (2022) provides a comprehensive overview. There are three types of discovery methods: (1) Granger causality; (2) structure equation model (SEM); and (3) constraint-based Published as a conference paper at ICLR 2024 methods. Granger causality assumes that no instantaneous effects are present and the causal direction cannot flow backward in time. Wu et al. (2020); Shojaie & Michailidis (2010); Siggiridou & Kugiumtzis (2015); Amornbunchornvej et al. (2019) leverage the vector-autoregressive model to predict future observations. L owe et al. (2022); Tank et al. (2021); Bussmann et al. (2021); Dang et al. (2019); Xu et al. (2019); Khanna & Tan (2019) utilise deep neural networks for prediction. Recently, Cheng et al. (2023) introduced a deep-learning based Granger causality that can handle irregularly sampled data, treating it as a missing data problem and proposing a joint framework for data imputation and graph fitting. SEM based approaches assume an explicit causal model associated to the temporal process. Hyv arinen et al. (2010) leverages the identifiability of additive noise models (Hoyer et al., 2008) to build a linear auto-regressive SEM with non-Gaussian noise. Pamfil et al. (2020) utilises the NOTEARS framework (Zheng et al., 2018) to continuously relax the DAG constraints for fully differentiable structure learning. The recently proposed Gong et al. (2022) extended the prior DECI Geffner et al. (2022) framework to handle time series data and is capable of modelling instantaneous effect and history-dependent noise. Constraint-based approaches use conditional independence tests to determine the causal structures. Runge et al. (2019) combines the PC (Spirtes et al., 2000) and momentary conditional independence tests for the lagged parents. PCMCI+ (Runge, 2020) can additionally detect the instantaneous effect. LPCMCI (Reiser, 2022) can further handle latent confounders. CD-NOD (Zhang et al., 2017) is designed to handle non-stationary heterogeneous time series data. However, all constraint-based approaches can only identify the graph up to Markov equivalence class without the functional relationship between variables. Continuous time causal models In terms of using differential equations to model the continuous temporal process, Hansen & Sokol (2014) proposed using stochastic differential equations to describe the temporal causal system. They proved identifiability with respect to the intervention distributions, but did not show how to learn a corresponding SDE. Penalised regression has been explored for linear models, where parameter consistency has been established (Ramsay et al., 2007; Chen et al., 2017; Wu et al., 2014). Recently, NGM (Bellot et al., 2022) uses ODEs to model the temporal process with both identifiability and consistency results. As discussed in previous sections, SCOTCH is based on SDEs rather than ODEs, and can model the intrinsic stochasticity within the causal system, whereas NGM assumes deterministic state transitions. 6 EXPERIMENTS Baselines and Metrics We benchmark our method against a representative sample of baselines: (i) VARLi NGa M (Hyv arinen et al., 2010), a linear SEM based approach; (ii) PCMCI+ (Runge, 2018; 2020), a constraint-based method for time series; (iii) CUTS, a Granger causality approach which can handle irregular time series; (iv) Rhino (Gong et al., 2022), a non-linear SEM based approach with history-dependent noise and instantaneous effects; and (v) NGM (Bellot et al., 2022), a continuous-time ODE based structure learner. Since most methods require a threshold to determine the graph, we use the threshold-free area under the ROC curve (AUROC) as the performance metric. In appendix D, we also report F1 score, true positive rate (TPR) and false discovery rate (FDR). Setup Both the synthetic datasets (Lorenz-96, Glycolysis) and real-world datasets (DREAM3, Netsim) consist of multiple time series. However, it is not trivial to modify NGM and CUTS to support multiple time series. For fair comparison, we use the concatenation of multiple time series, which we found empirically to improve performance. We also mimic irregularly sampled data by randomly dropping observations, which VARLi NGa M, PCMCI, and Rhino cannot handle; in these cases, for these methods we impute the missing data using zero-order hold (ZOH). For SCOTCH, we use pathwise gradient estimators with Euler discretization for solving the SDE (see appendix D.1 for discussion on this choice). Further experimental details can be found in Appendices B, C, D. 6.1 SYNTHETIC EXPERIMENTS: LORENZ AND GLYCOLYSIS First, we evaluate SCOTCH on synthetic benchmarks including the Lorenz-96 (Lorenz, 1996) and Glycolysis (Daniels & Nemenman, 2015) datasets, which model continuous-time dynamical systems. The Lorenz model is a well-known example of chaotic systems observed in biology (Goldberger & West, 1987; Heltberg et al., 2019). To mimic irregularly sampled data, we follow the setup of Cheng et al. (2023) and randomly drop some observations with missing probability p. We also Published as a conference paper at ICLR 2024 simulate another dataset from a biological model, which describes metabolic iterations that break down glucose in cells. This is called Glycolysis, consisting of an SDE with 7 variables. As a preprocessing step, we standardised this dataset to avoid large differences in variable scales. Both datasets consist of N = 10 time series with sequence length I = 100 (before random drops), and have dimensionality 10 and 7, respectively. Note that we choose a large data sampling interval, as we want to test settings where observations are fairly sparse and the difficulty of correctly modelling continuous-time dynamics increases. The above data setup is different from Bellot et al. (2022); Cheng et al. (2023) where they use a single series with I = 1000 observations, which is more informative compared to our sparse setting. Refer to appendix D.3 and appendix D.4 for details. The left two columns in table 1 compare the AUROC of SCOTCH to baselines for Lorenz. We can see that SCOTCH can effectively handle the irregularly sampled data compared to other baselines. Compared to NGM and CUTS, we can achieve much better results with small missingness and performs competitively with larger missingness. Rhino, VARLi NGa M and PCMCI+ perform poorly in comparison as they assume regularly sampled observations and are discrete in nature. From the right column in table 1, SCOTCH outperforms the baselines by a large margin on Glycolysis. In particular, compared to the ODE-based NGM, SCOTCH clearly demonstrates the advantage of the proposed SDE framework in multiple time series settings. As we may have anticipated from the discussion in section 3.2, NGM can produce an incorrect model when multiple time series are sampled from a given SDE system. Another interesting observation is that SCOTCH is more robust when encountering data with different scales compared to NGM (refer to appendix D.4.3). This robustness is due to the stochastic nature of SDE compared to the deterministic ODE, where ODE can easily overshoot with less stable training behaviour. We can also see that SCOTCH has a significant advantage over both CUTS and Rhino, which do not model continuous-time dynamics. Lorenz-96 Glycolysis p = 0.3 p = 0.6 Full VARLi NGa M 0.5102 0.025 0.4876 0.032 0.5082 0.009 PCMCI+ 0.4990 0.008 0.4952 0.021 0.4607 0.031 NGM 0.6788 0.009 0.6329 0.008 0.5953 0.018 CUTS 0.6042 0.015 0.6418 0.012 0.580 0.007 Rhino 0.5714 0.026 0.5123 0.025 0.520 0.015 SCOTCH (ours) 0.7279 0.017 0.6453 0.014 0.7113 0.012 Table 1: AUROC of synthetic datasets from SCOTCH and baselines. p represents missing probability, and Full means complete data without missingness. Each number is the average over 5 runs. We also evaluate SCOTCH performance on the DREAM3 datasets (Prill et al., 2010; Marbach et al., 2009), which have been adopted for assessing the performance of structure learning (Tank et al., 2021; Pamfil et al., 2020; Gong et al., 2022). These datasets contain in silico measurement of gene expression levels for 5 different structures. Each dataset corresponds to a particular gene expression network, and contains N = 46 time series of 100 dimensional variables, with I = 21 per series. The goal is to infer the underlying structures from each dataset. Following the same setup as (Gong et al., 2022; Khanna & Tan, 2019), we ignore all the self-connections by setting the edge probability to 0, and use AUROC as the performance metric. Appendix D.5 details the experiment setup, selected hyperparameters, and additional plots. We do not include VARLi NGa M since it cannot support time series where the dimensionality (100) is greater than the length (21). Also, due to the time series length, we decide not to test with irregularly sampled data. We use the reported numbers for Rhino and PCMCI+ in Gong et al. (2022) as the experimental setup is identical. For CUTS, we failed to reproduce the reported number in their paper, but we cite it for a fair comparison. Table 2 shows the AUROC performances of SCOTCH and baselines. We can clearly observe that SCOTCH outperforms the other baselines with a large margin. This indicates the advantage of the SDE formulation compared to ODEs and discretized temporal models, even when we have complete and regularly sampled data. A more interesting observation is to compare Rhino with SCOTCH. As discussed before, as SCOTCH is the continuous version of Rhino, the advantage comes from the continuous formulation and the corresponding training objective eq. (15). Published as a conference paper at ICLR 2024 EColi1 Ecoli2 Yeast1 Yeast2 Yeast3 Mean PCMCI+ 0.530 0.002 0.519 0.002 0.530 0.003 0.510 0.001 0.512 0 0.520 0.004 NGM 0.611 0.002 0.595 0.005 0.597 0.006 0.563 0.006 0.535 0.004 0.580 0.007 CUTS 0.543 0.003 0.555 0.005 0.545 0.003 0.518 0.007 0.511 0.002 0.534 0.008 (0.591) Rhino 0.685 0.003 0.680 0.007 0.664 0.006 0.585 0.004 0.567 0.003 0.636 0.022 SCOTCH (ours) 0.752 0.008 0.705 0.003 0.712 0.003 0.622 0.004 0.594 0.001 0.677 0.026 Table 2: AUROC for SCOTCH on DREAM3 100-dimensional datasets. Results are obtained by averaging over 5 runs. We cite the reported CUTS performance in parentheses. Full p = 0.1 p = 0.2 VARLi NGa M 0.84 0 0.723 0.001 0.719 0.003 PCMCI 0.83 0 0.81 0.001 0.79 0.006 NGM 0.89 0.009 0.86 0.009 0.85 0.007 CUTS 0.89 0.010 0.87 0.008 0.87 0.011 Rhino+No Inst 0.95 0.001 0.93 0.005 0.90 0.012 SCOTCH (ours) 0.95 0.006 0.91 0.007 0.89 0.007 Rhino 0.99 0.001 0.98 0.004 0.97 0.003 Table 3: AUROC on Netsim dataset (subjects 2-6). Results are obtained by averaging over 5 runs. Netsim consists of blood oxygenation level dependent imaging data. Following the same setup as Gong et al. (2022), we use subjects 2-6 to form the dataset, which consists of 5 time series. Each contains 15 dimensional observations with I = 200. The goal is to infer the underlying connectivity between different brain regions. Unlike Dream3, we include the self-connection edge for all methods. To evaluate the performance under irregularly sampled data, we follow the same setup as in the Lorenz and Cheng et al. (2023) to randomly drop observations with missing probability. Since it is very important to model instantaneous effects in Netsim (Gong et al., 2022), which SCOTCH cannot handle, we replace Rhino with Rhino+No Inst and PCMCI+ with PCMCI for fair comparison. Table 3 shows the performance comparisons. We observe that SCOTCH significantly outperforms the other baselines and performs on par with Rhino+No Inst, which demonstrates its robustness towards smaller datasets and balance between true and false positive rates. Again, this confirms the modelling power of our approach compared to NGM and other baselines. Interestingly, Rhinobased approaches perform particularly well on the Netsim dataset. We suspect that the underlying generation mechanism can be better modelled with a discretized as opposed to continuous system. 7 CONCLUSION We propose SCOTCH, a flexible continuous-time temporal structure learning method based on latent Itˆo diffusion. We leverage the variational inference framework to infer the posterior over latent states and the graph. Theoretically, we validate our approach by proving the structural identifiability of the Itˆo diffusion and latent formulation, and the consistency of the proposed variational framework. Empirically, we extensively evaluated our approach using synthetic and semi-synthetic datasets, where SCOTCH outperforms the baselines in both regularly and irregularly sampled data. There are three limitations that require further investigation. The first one is the inability to handle instantaneous effects, which can arise due to data aggregation. Another computational drawback is it scales linearly with the series length. This could be potentially fixed by incorporating an encoder network to infer latent states at arbitrary time points. Last but not least, the current formulation of SCOTCH cannot handle non-stationary systems due to the homogeneous drift and diffusion function. However, direct incorporation of time embeddings may break the theoretical guarantees without additional assumptions. Therefore, new theories and methodologies may be needed to tackle such a scenario. We leave these challenges for future work. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS We thank the members of the Causica team at Microsoft Research for helpful discussions. We thank Colleen Tyler, Maria Defante, and Lisa Parks for conversations on real-world use cases that inspired this work. This work was done in part while Benjie Wang was visiting the Simons Institute for the Theory of Computing. Chainarong Amornbunchornvej, Elena Zheleva, and Tanya Y Berger-Wolf. Variable-lag granger causality for time series analysis. In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 21 30. IEEE, 2019. Yashas Annadani, Nick Pawlowski, Joel Jennings, Stefan Bauer, Cheng Zhang, and Wenbo Gong. Bayesdag: Gradient-based posterior sampling for causal discovery. ar Xiv preprint ar Xiv:2307.13917, 2023. Charles K Assaad, Emilie Devijver, and Eric Gaussier. Survey and evaluation of causal discovery methods for time series. Journal of Artificial Intelligence Research, 73:767 819, 2022. Alexis Bellot, Kim Branson, and Mihaela van der Schaar. Neural graphical modelling in continuoustime: consistency guarantees and algorithms. In International Conference on Learning Representations, 2022. Carlo Berzuini, Philip Dawid, and Luisa Bernardinell. Causality: Statistical perspectives and applications. John Wiley & Sons, 2012. Michelle Bou e and Paul Dupuis. A variational representation for certain functionals of brownian motion. The Annals of Probability, 26(4):1641 1659, 1998. Annalisa Bracco, Fabrizio Falasca, Athanasios Nenes, Ilias Fountalis, and Constantine Dovrolis. Advancing climate science with knowledge-discovery through data mining. npj Climate and Atmospheric Science, 1(1):20174, 2018. Bart Bussmann, Jannes Nys, and Steven Latr e. Neural additive vector autoregression models for causal discovery in time series. In Discovery Science: 24th International Conference, DS 2021, Halifax, NS, Canada, October 11 13, 2021, Proceedings 24, pp. 446 460. Springer, 2021. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. Shizhe Chen, Ali Shojaie, and Daniela M Witten. Network reconstruction from high-dimensional ordinary differential equations. Journal of the American Statistical Association, 112(520):1697 1707, 2017. Yuxiao Cheng, Runzhao Yang, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts: Neural causal discovery from irregular time-series data. ar Xiv preprint ar Xiv:2302.07458, 2023. Kyunghyun Cho, Bart Van Merri enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014. Andrea Cini, Ivan Marisca, and Cesare Alippi. Filling the g ap s: Multivariate time series imputation by graph neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=k Ou3-S3w J7. Xuan-Hong Dang, Syed Yousaf Shah, and Petros Zerfos. seq2graph: Discovering dynamic nonlinear dependencies from multivariate time series. In 2019 IEEE International Conference on Big Data (Big Data), pp. 1774 1783. IEEE, 2019. Published as a conference paper at ICLR 2024 Bryan C Daniels and Ilya Nemenman. Efficient inference of parsimonious phenomenological models of cellular dynamics using s-systems and alternating regression. Plo S one, 10(3):e0119821, 2015. Paul Dupuis and Richard S Ellis. A weak convergence approach to the theory of large deviations. John Wiley & Sons, 2011. Tomas Geffner, Javier Antoran, Adam Foster, Wenbo Gong, Chao Ma, Emre Kiciman, Amit Sharma, Angus Lamb, Martin Kukla, Nick Pawlowski, et al. Deep end-to-end causal inference. ar Xiv preprint ar Xiv:2202.02195, 2022. Ary L Goldberger and Bruce J West. Applications of nonlinear dynamics to clinical cardiology. Annals of the New York Academy of Sciences, 504:195 213, 1987. Wenbo Gong, Yingzhen Li, and Jos e Miguel Hern andez-Lobato. Meta-learning for stochastic gradient mcmc. ar Xiv preprint ar Xiv:1806.04522, 2018. Wenbo Gong, Joel Jennings, Cheng Zhang, and Nick Pawlowski. Rhino: Deep causal temporal relationship learning with history-dependent noise. ar Xiv preprint ar Xiv:2210.14706, 2022. Niels Hansen and Alexander Sokol. Causal interpretation of stochastic differential equations. 2014. Ali Hasan, Joao M Pereira, Sina Farsiu, and Vahid Tarokh. Identifying latent stochastic differential equations. IEEE Transactions on Signal Processing, 70:89 104, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Mathias L Heltberg, Sandeep Krishna, and Mogens H Jensen. On chaotic dynamics in transcription factors and the associated effects in differential gene regulation. Nature communications, 10(1): 71, 2019. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Sch olkopf. Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21, 2008. Aapo Hyv arinen, Kun Zhang, Shohei Shimizu, and Patrik O Hoyer. Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, 11(5), 2010. Saurabh Khanna and Vincent YF Tan. Economy statistical recurrent units for inferring nonlinear granger causality. ar Xiv preprint ar Xiv:1911.09879, 2019. Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Xuechen Li, Ting-Kam Leonard Wong, Ricky TQ Chen, and David Duvenaud. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pp. 3870 3882. PMLR, 2020. Edward N Lorenz. Predictability: A problem partly solved. In Proc. Seminar on predictability, volume 1. Reading, 1996. Sindy L owe, David Madras, Richard Zemel, and Max Welling. Amortized causal discovery: Learning to infer causal graphs from time-series data. In Conference on Causal Learning and Reasoning, pp. 509 525. PMLR, 2022. Published as a conference paper at ICLR 2024 Daniel Marbach, Thomas Schaffter, Claudio Mattiussi, and Dario Floreano. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. Journal of computational biology, 16(2):229 239, 2009. Bernt Øksendal and Bernt Øksendal. Stochastic differential equations. Springer, 2003. Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pp. 1595 1605. PMLR, 2020. Jonas Peters, Dominik Janzing, and Bernhard Sch olkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. Robert J Prill, Daniel Marbach, Julio Saez-Rodriguez, Peter K Sorger, Leonidas G Alexopoulos, Xiaowei Xue, Neil D Clarke, Gregoire Altan-Bonnet, and Gustavo Stolovitzky. Towards a rigorous assessment of systems biology models: the dream3 challenges. Plo S one, 5(2):e9202, 2010. Zhaozhi Qian, Ahmed Alaa, Alexis Bellot, Mihaela Schaar, and Jem Rashbass. Learning dynamic and personalized comorbidity networks from event data using deep diffusion processes. In International Conference on Artificial Intelligence and Statistics, pp. 3295 3305. PMLR, 2020. Xiaojie Qiu, Qi Mao, Ying Tang, Li Wang, Raghav Chawla, Hannah A Pliner, and Cole Trapnell. Reversed graph embedding resolves complex single-cell trajectories. Nature methods, 14(10): 979 982, 2017. Federica Raia. Causality in complex dynamic systems: A challenge in earth systems science education. Journal of Geoscience Education, 56(1):81 94, 2008. Jim O Ramsay, Giles Hooker, David Campbell, and Jiguo Cao. Parameter estimation for differential equations: a generalized smoothing approach. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(5):741 796, 2007. Christian Reiser. Causal discovery for time series with latent confounders. ar Xiv preprint ar Xiv:2209.03427, 2022. Jakob Runge. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7), 2018. Jakob Runge. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In Conference on Uncertainty in Artificial Intelligence, pp. 1388 1397. PMLR, 2020. Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting and quantifying causal associations in large nonlinear time series datasets. Science advances, 5 (11):eaau4996, 2019. Shohei Shimizu, Patrik O Hoyer, Aapo Hyv arinen, Antti Kerminen, and Michael Jordan. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006. Ali Shojaie and George Michailidis. Discovering graphical granger causality using the truncating lasso penalty. Bioinformatics, 26(18):i517 i523, 2010. Elsa Siggiridou and Dimitris Kugiumtzis. Granger causality in multivariate time series using a timeordered restricted vector autoregressive model. IEEE Transactions on Signal Processing, 64(7): 1759 1773, 2015. Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000. Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267 4279, 2021. Published as a conference paper at ICLR 2024 Cole Trapnell, Davide Cacchiarelli, Jonna Grimsby, Prapti Pokharel, Shuqiang Li, Michael Morse, Niall J Lennon, Kenneth J Livak, Tarjei S Mikkelsen, and John L Rinn. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature biotechnology, 32(4):381 386, 2014. Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. ar Xiv preprint ar Xiv:1905.09883, 2019a. Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, pp. 3084 3114. PMLR, 2019b. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688. Citeseer, 2011. Hulin Wu, Tao Lu, Hongqi Xue, and Hua Liang. Sparse additive ordinary differential equations for dynamic gene regulatory network modeling. Journal of the American Statistical Association, 109 (506):700 716, 2014. Tailin Wu, Thomas Breuel, Michael Skuhersky, and Jan Kautz. Discovering nonlinear relations with minimum predictive information regularization. ar Xiv preprint ar Xiv:2001.01885, 2020. Chenxiao Xu, Hao Huang, and Shinjae Yoo. Scalable causal graph learning through a deep neural network. In Proceedings of the 28th ACM international conference on information and knowledge management, pp. 1853 1862, 2019. Cheng Zhang, Judith B utepage, Hedvig Kjellstr om, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008 2026, 2018. Kun Zhang, Biwei Huang, Jiji Zhang, Clark Glymour, and Bernhard Sch olkopf. Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI: Proceedings of the Conference, volume 2017, pp. 1347. NIH Public Access, 2017. Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018.