# causal_discovery_in_semistationary_time_series__9891d962.pdf Causal Discovery in Semi-Stationary Time Series Shanyun Gao Purdue University gao565@purdue.edu Raghavendra Addanki Adobe Research raddanki@adobe.com Tong Yu Adobe Research tyu@adobe.com Ryan A. Rossi Adobe Research ryrossi@adobe.com Murat Kocaoglu Purdue University mkocaoglu@purdue.edu Discovering causal relations from observational time series without making the stationary assumption is a significant challenge. In practice, this challenge is common in many areas, such as retail sales, transportation systems, and medical science. Here, we consider this problem for a class of non-stationary time series. The structural causal model (SCM) of this type of time series, called the semistationary time series, exhibits that a finite number of different causal mechanisms occur sequentially and periodically across time. This model holds considerable practical utility because it can represent periodicity, including common occurrences such as seasonality and diurnal variation. We propose a constraint-based, nonparametric algorithm for discovering causal relations in this setting. The resulting algorithm, PCMCIΩ, can capture the alternating and recurring changes in the causal mechanisms and then identify the underlying causal graph with conditional independence (CI) tests. We show that this algorithm is sound in identifying causal relations on discrete-valued time series. We validate the algorithm with extensive experiments on continuous and discrete simulated data. We also apply our algorithm to a real-world climate dataset. 1 Introduction In modern sciences, causal discovery aims to identify the collection of causal relations from observational data, as in Pearl [1980], Peters et al. [2017] and Spirtes et al. [2000]. One of the most popular causal discovery approaches is the so-called constraint-based method. Constraint-based approaches assume that the probability distribution of variables is causal Markov and faithful to a directed acyclic graph called the causal graph. Given large enough data, they can then recover the corresponding Markov equivalence class by exploiting conditional independence relationships of the variables. See Peters et al. [2017]. There are many constraint-based algorithms such as PC and FCI algorithms Spirtes et al. [2000]. The standard assumption of these approaches is that data samples are independent and identically distributed, which makes it possible to perform CI tests. See Bergsma [2004], Zhang et al. [2012] and Shah and Peters [2020]. Recently, there have been numerous efforts to extend such constraint-based algorithms to accommodate time series data. For instance, PCMCI in Runge et al. [2019] and LPCMCI in Gerhardus and Runge [2020] are the PC-based algorithms for time series. Inspired by FCI algorithms, approaches designed for time series include ANLSTM in Chu and Glymour [2008], ts FCI in Entner and Hoyer [2010] and SVAR-FCI in Malinsky and Spirtes [2018]. This setup is relevant in several industrial applications since many data points have an associated time-point, such as root-cause analysis in Ikram et al. [2022]. Most of the existing causal discovery algorithms make the stationary assumption. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). See Chu and Glymour [2008], Hyvärinen et al. [2010], Entner and Hoyer [2010], Peters et al. [2013], Malinsky and Spirtes [2018], Runge et al. [2019], Pamfil et al. [2020]and Assaad et al. [2022]. Non-stationary temporal data makes causal discovery more challenging since the statistics are time-variant, and it is unreasonable to expect that the underlying causal structure is time-invariant. Identifying causal relations from non-stationary time series without imposing any restriction on the data is difficult. Here, we focus on a specific class of non-stationary time series, called the semistationary time series, whose structural causal model (SCM) exhibits that a finite number of different causal mechanisms occur sequentially and periodically across time. One example is illustrated in Fig.1, where the time series X1 has three different causal mechanisms across time, shown as red edges, green edges, and blue edges, respectively. Similarly, time series X2 has two alternative causal mechanisms. This setting holds considerable practical utility. Periodic nature is commonly observed in many real-world time series data. See Han et al. [2002], Nakamura et al. [2003], Carskadon et al. [2005] and Komarzynski et al. [2018]. Here are a few additional intuitive examples: poor traffic conditions often coincide with commute time and weekends; household electricity consumption typically follows a pattern of being higher at night and lower during the daytime. Consequently, it is reasonable to expect periodic changes in the causal relations underlying this type of time series without assuming stationarity. Here, the constraint-based methods in Chu and Glymour [2008], Entner and Hoyer [2010], Malinsky and Spirtes [2018] and Runge et al. [2019], designed for stationary time series, may fail. Given observational data with periodically changing causal structures, it is hard to apply CI tests directly. Most of the other algorithms designed for non-stationary time series rely heavily on model assumptions, as in Gong et al. [2015], Pamfil et al. [2020], and Huang et al. [2019]. These algorithms are discussed further in the related work. In this paper, we propose an algorithm to address this problem, namely non-parametric causal discovery in time series data with semi-stationary SCMs. The key contributions of our work are: We develop an algorithm to discover the causal structure from semi-stationary time series data where the underlying causal structures change periodically. Our algorithm systematically uses the PCMCI algorithm proposed for the stationary setting in Runge et al. [2019]. The resulting algorithm is hence named PCMCIΩwhere Ωdenotes periodicity. We validate our method with synthetic simulations on both continuous-valued and discretevalued time series, showing that our method can correctly learn the periodicity and causal mechanism of the synthetic time series. We utilize our method in a real-world climate application. The result reveals the potential existence of periodicity in those time series, and the stationary assumption made by previous works could be relaxed in some practical situations. 1.1 Related Work PCMCI has been applied in diverse domains to investigate atmospheric interactions in the biosphere, as demonstrated in Krich et al. [2020], global wildfires as explored in Qu et al. [2021], water usage as studied in Zou et al. [2022], ultra-processed food manufacturing as examined in Menegozzo et al. [2020], and causal feature selection as discussed in Peterson [2022], among other applications. See Arvind et al. [2021], Gerhardus and Runge [2020], Castri et al. [2023a,b]. While PCMCI has achieved considerable success, it is not without its limitations. One notable assumption that can be challenged is the concept of causal stationary, that is, causal relations are time-invariant. PCMCI exhibits robustness when applied to linear models with an added non-stationary trend. See also Runge et al. [2019]. However, there is an ongoing exploration to enhance its performance in a wider range of non-stationary settings. Although not as extensively as the stationary case, causal discovery in non-stationary time series has been studied by some authors. However, many of those algorithms rely on parametric assumptions such as the vector autoregressive model in Gong et al. [2015] and Malinsky and Spirtes [2019]; linear and nonlinear state-space model in Huang et al. [2019]. One non-parametric algorithm in the literature is CD-NOD proposed by Huang et al. [2020], which has been extended to recover time-varying instantaneous and lagged causal relationships. In very recent work, Fujiwara et al. [2023] proposed an algorithm JIT-Li NGAM to obtain a local approximated linear causal model combining algorithm Li NGAM and JIT framework for non-linear and non-stationary data. To the best of our knowledge, no other non-parametric approaches can discover causal relations underlying time series without Figure 1: Partial causal graph for 3-variate time series V = {X1, X2, X3} with a Semi-Stationary SCM where τmax = 3, ω1 = 3, ω2 = 2, ω3 = 1, Ω= 6 and δ = 6. The first 3(=τmax) time slices {Xt}1 t 3 are the starting points. The same color edges represent the same causal mechanism. E.g. for X1: there are 3 (= ω1) time partition subsets {Π1 k}1 k 3. The time points t of nodes X1 t sharing the same filling color are in the same time partition subsets. The time points t of nodes X1 t sharing both the same filling color and the same outline shape are in the same homogenous time partition subsets (the definitions are in the supplementary material). There are 6 (= δ) different Markov chains in this multivariate time series V , and the first element of these 6 Markov chains is shown as {Zq 1}1 q 6 and are tinted with a gradient of blue hues. The superscript q of Zq i is the index of different Markov chains, whereas the subscript i denotes the running index of that specific Markov chain. For instance, Z1 1 and Z1 2 denote the first two elements of the first Markov chain, while Z2 1 and Z2 2 denote the first two elements of the second Markov chain. assuming stationarity and can also allow for sudden changes in causal mechanisms. Our proposed approach does not directly enforce the stationary assumption on the time series. The SCM also integrates a finite set of causal mechanisms that exhibit periodic variations over time. 2 PCMCIΩ: Capturing Periodicity of the Causal Structure In this section, we formulate the problem of learning the causal graph on multi-variate time series data when the SCM exhibits periodicity in causal mechanisms. In section 2.1, we present the necessary definitions and provide an overview of the problem setting. In section 2.2, we introduce the required assumption. 2.1 Preliminaries Let G(V, E) denote the underlying causal graph, and for each variable X V , we denote the set of all incoming neighbors as parents, denoted by Pa(X). For any two variables X, Y V and S V , we denote the CI relation X is independent of Y conditioned on S, by X Y | S. For simplicity s sake, define sets: [b] := {1, 2, ..., b} and [a, b] := {a, a + 1, ..., b}, where a, b N. In the time series setting, let Xj t R1 denote the variable of jth time series at time t, Xj = {Xj t }t [T ] RT denote a univariate time series and Xt = {Xj t }j [n] Rn denote a slice of all variables at time point t. V = {Xj}j [n] = {Xt}t [T ] Rn T denotes a n-variate time series. By default, we assume n > 1 and hence Xj V , and p(V ) = 0, where p(.) denotes the probability or probability density. Definition 2.1 (Non-Stationary SCM). A Non-Stationary Structural Causal Model (SCM) is a tuple M = V, F, E, P where there exists a τmax N+, defined as: τmax := arg max τ {τ : Xi t τ Pa(Xj t ), i, j [n]}, (1) such that with this τmax, each variable Xj t>τmax V is a deterministic function of its parent set Pa(Xj t>τmax) V and an unobserved (exogenous) variable ϵj t>τmax E: Xj t = fj,t(Pa(Xj t ), ϵj t), j [n], t [τmax + 1, T], (2) and there exist at least two different time points t0, t1 [τmax + 1, T] satisfying fj,t0 = fj,t1, j [n], {t0, t1} [τmax + 1, T]. (3) where fj,t, fj,t0, fj,t1 F and {ϵj t}t [T ] are jointly independent with probability measure P. τmax is the finite maximal lag in terms of the causal graph G. Definition 2.2 (Semi-Stationary SCM). A Semi-Stationary SCM is a Non-Stationary SCM that additionally satisfies the following conditions. For each j [n], there exists an ω N+such that: a) fj,t = fj,t+Nω, (4) b) Pa(Xj t+Nω) = {Xi s+Nω : Xi s Pa(Xj t ), i [n]}, (5) c) ϵj t, ϵj t+Nω are i.i.d. (6) are satisfied for all t [τmax + 1, T], N {N : N N, t + Nω T}. This means that a finite number of causal mechanisms are repeated periodically for every univariate time series Xj in V . One example of this model is illustrated in Fig.1. For X1 in Fig.1, three causal mechanisms are reiterated periodically with ω1 = 3, represented by red, green, and blue edges, respectively. The minimum value that satisfies the above conditions for Xj is defined as the periodicity of Xj, denoted by ωj. Furthermore, for an n-variate time series V , Ωdenotes the minimum periodicity across all time series Xj, j [n]. The number of causal mechanisms occurring sequentially and periodically of univariate time series Xj is ωj, and that number of causal mechanisms of multivariate time series V is Ω. For Xj, the causal mechanisms are associated with each variable Xj t . However, for V , the causal mechanisms are related to each time slice vector Xt as a whole. The relationship between Ωand ωj can be captured by: Ω= LCM({ωj : j [n]}) (7) where LCM(.) is an operation to find the least common multiple between any two or more numbers. Here, Ωis the smallest common multiple among {ω1, ...ωn}. In Fig.1, Ω= LCM(3, 2, 1) = 6. Definition 2.3 (Time Partition). A time partition Πj(T) of a univariate time series Xj in a Semi Stationary SCM with periodicity ωj is a way of dividing all time points t [T] into a collection of non-overlapping non-empty subsets {Πj k(T)}k [ωj] such that: Πj k(T) := {t : τmax + 1 t T, (t mod ωj) + 1 = k}. (8) where mod denotes the modulo operation. For instance, 5 mod 3 = 2. We can observe that the variables in {Xj t }t Πj k(T ) share the same causal mechanism. Since the number of potentially different causal mechanisms of variables in Xj is ωj, the number of such time partition subsets is ωj. For simplicity, notations Πj and Πj k are used instead of Πj(T) and Πj k(T). In Fig.1, Π1 1 = {4, 7, 10, 13, .., 4 + 3N, ...}, Π1 2 = {5, 8, 11, 14, .., 5 + 3N, ...} and Π1 3 = {6, 9, 12, 15, .., 6 + 3N, ...} where N N+. The nodes X1 t are classified into their associated time partition subsets by the matching colors. Definition 2.4 (Illusory Parent Sets). For a univariate time series Xj V with Semi-Stationary SCM having periodicity ωj > 1, parent set index p Indj k [ωj] is defined as: p Indj k := {(yi, τi)}i [n ], given Pa(Xj t ) = {Xy1 t τ1, Xy2 t τ2, ..., Xyn t τn }, t Πj k (9) where n = |Pa(Xj t ))|, τi is the time lag and yi is the variable index. Given p Indj k, Illusory Parent Sets are defined as: Pak(Xj t ) = Xyi t τi : (yi, τi) p Indj k , k {k : t / Πj k} (10) Put simply, the illusory parent sets of Xj t are the time-shift version of the parent set of other variables in Xj that have a different causal mechanism from Xj t . Note that the illusory parent sets are constructed specifically in the Semi-Stationary SCM. For stationary SCM, there is no illusory parent set needed. To maintain consistency in notation, for time points t Πj k, the notation Pak(Xj t ) can also be extended to encompass the true parent set of Xj t : Pak(Xj t ) := Pa(Xj t ), t Πj k (11) By doing so, Pa(Xj t ) k [ωj]Pak(Xj t ). In Fig.1, by observing Pa(X1 7), we have p Ind1 1 = {(1, 1), (2, 2)}; by observing Pa(X1 8), we have p Ind1 2 = {(1, 1), (3, 1)} and finally by observing Pa(X1 9), p Ind1 3 = {(1, 1), (1, 2)}. Based on those indexes, Pa1(X1 7) = {X1 7 1, X2 7 2} = {X1 6, X2 5}, Pa2(X1 7) = {X1 7 1, X3 7 1} = {X1 6, X3 6} and Pa3(X1 7) = {X1 7 1, X1 7 2} = {X1 6, X1 5}. The first one is the true parent set of X1 7 and the latter two are the illusory parent sets. The order of those parent sets is not important. At last, we need to further define a series of Markov chains that are associated tightly with Semi Stationary SCM. The presence and characteristics of these Markov chains are thoroughly examined in the supplementary materials. The motivation behind creating such Markov chains is to introduce assumptions on them rather than directly on the original data V . Definition 2.5 (Time Series as a Markov Chain). For time series V with Semi-Stationary SCM, there are (potentially) δ different Markov chains {Zq n}n N , q [δ]: Zq n = {Xτmax+q+(n 1)δ, Xτmax+q+1+(n 1)δ, ..., Xτmax+q 1+nδ}, where N := {n N+ : τmax + q 1 + nδ T}, δ = τmax+1 Ω Ω. Note that in {Zq n}, q is used to indicate a specific Markov chain, while n serves as the running index for that particular Markov chain. Such a Markov chain {Zq n}, q [δ] exists as long as Pa(Zq n) Zq n Zq n 1 for all n. This is a finite state Markov Chain if all time series in V are discrete-valued time series. The state space of {Zq n} is the set containing all possible realization of {Xτmax+q+(i 1)+(n 1)δ}i [δ],n N. The transition probabilities between the states are the product of associated causal mechanisms based on Assumption A2, which is elaborated by an example in section C.1 (Eq.(10)-(11)) of the supplementary material. 2.2 Assumptions for PCMCIΩ A1. Sufficiency: There are no unobserved confounders. A2. Causal Markov Condition: Each variable X is independent of all its non-descendants, given its parents Pa(X) in G. A3. Faithfulness Condition (Pearl [1980]): Let P be a probability distribution generated by G. G, P satisfies the Faithfulness Condition if and only if every conditional independence relation true in P is entailed by the Causal Markov Condition applied to G. A4. No Contemporaneous Causal Effects: Edges between variables at the same time are not allowed. A5. Temporal Priority: Causal relations that point from the future to the past are not permitted. A6. Hard Mechanism Change: If at time points t1 and t2, the causal mechanisms of Xj t1 and Xj t2 are different, then their corresponding parent sets can not be transformed to each other by time shifts: fj,t1 = fj,t2 Pa(Xj t2) = {Xi s+(t2 t1) : Xi s Pa(Xj t1), i [n]}. A7. Irreducible and Aperiodic Markov Chain: The Markov chains {Zn} of V are assumed to be irreducible (Serfozo [2009]): for all states i and j of {Zn}, n so that p(n) ij := p(Zn+1 = j|Z1 = i) > 0 (12) and aperiodic(Karlin [2014]): for every state i of {Zn}, d(i) = 1, where the period d(i) of the state i is the greatest common divisor of all integers n for which p(n) ii > 0. Assumptions A1-A5 are conventional and commonly employed in causal discovery methods for time series data. On the other hand, our approach requires additional Assumptions A6-A7 to be in place. To clarify, A6 is essential because our method may encounter challenges in distinguishing distinct causal mechanisms for variables in {Xj n}n [T ] if they share identical parent sets after time shifts. As for A7, it serves a crucial role in establishing the soundness of our algorithm. 3 PCMCIΩAlgorithm In this section, we propose an algorithm called PCMCIΩ, and in section 3.1, we present a theorem demonstrating the soundness of PCMCIΩand its ability to recover the causal graph. Our algorithm Algorithm 1 PCMCIΩ 1: Input: A n-variate time series V = (X1, X2, X3, ..., Xn), periodicity upper bound ωub, time lag upper bound τub. By default, we assume τub and ωub are larger than their true value. 2: A superset of parent set is obtained using PCMCI with τub and denote it by d SPa(Xj t ) j, t. 3: for Xj where j [n] do 4: for a guess ω [ωub] of ωj do 5: Let bΠj := {bΠj k|k [ω]} where bΠj k = {2τub + k, 2τub + ω + k, 2τub + 2ω + k, }. 6: for k [ω] do 7: Initialize the parent set for Xj t , t {t : t 2τub, t bΠj k} (with guess ω) denoted by c Paω(Xj t ) d SPa(Xj t ). 8: Consider Xi t τ c Paω(Xj t ). Remove Xi t τ from c Paω(Xj t ) if Xi t τ Xj t | d SPa(Xj t ) d SPa(Xi t τ) \ Xi t τ using a CI Test with samples t {t : t 2τub, t bΠj k}. 9: Store c Paω(Xj t ) for Xj t , t {t : t 2τub, t bΠj k}. 10: end for 11: end for 12: bωj arg minω [ωub] maxk [ω] |c Paω(Xj t bΠj k)|. 13: Set c Pa(Xj t ) c Paˆωj(Xj t ) for Xj t , t {t : t 2τub}. 14: end for 15: return ˆωj and c Pa(Xj t ) j [n], t 2τub. PCMCIΩbuilds on the Algorithm PCMCI in Runge et al. [2019]. Additional details about PCMCI are provided in the supplementary material. Overview of Algorithm 1 PCMCIΩ. We assume that the periodicity and time lag are upper bounded by ωub and τub respectively. Using PCMCI Runge et al. [2019], we obtain a superset of parents for every variable Xj t denoted by d SPa(Xj t ) (line 2). Our goal is to identify the correct set of parents along with its periodicity for every variable in V . For a variable Xj t , we guess its periodicity ω by iterating over all possible values in [ωub]. Next, we construct time partition subsets bΠj k, k [ω] based on the guess of periodicity ω. In each time partition subset, we maintain a parent set, denoted by c Paω(Xj t ), initializing it with the superset d SPa(Xj t ). Then we test the causal relations between Xi t τ c Paω(Xj t ) and Xj t using a CI test on the sample t bΠj k (lines 6-10). For each guess ω, every variable in Xj should have its estimated parent set (line 9), and there are total ω potentially different parent set index p Indj k [ω] in Xj. We return an estimate ˆωj that maximizes the sparsity of the causal graph (Lemma 3.4). Therefore, we select the value of ω [ωub] that minimizes the maximum value of |c Paω(Xj t )|, t [T] as the estimator of ωj (line 12). 3.1 Theoretical Guarantees Our main theorem shows that PCMCIΩrecovers the true causal graph on discrete data. There are three important lemmas. We provide all the detailed proof in the supplementary material. Theorem 3.1. Let ˆG be the estimated graph using the Algorithm PCMCIΩ. Under assumptions A1-A7 and with an oracle (infinite sample size limit), we have that: ˆG = G (13) almost surely. Lemma 3.2 and Lemma 3.3 jointly state that if CI tests are conducted on samples generated by different causal mechanisms, the obtained parent sets d SPa(Xj t ) should be the superset of the union of the true and illusory parent sets k Pak(Xj t ). That is, the estimated graph should be denser than the correct graph. The true parent set can then be obtained by directly testing the independent relations between the target variable Xj t and the variables in d SPa(Xj t ), assuming a consistent CI test. Note that the CI tests in our algorithm are assumed to be consistent given i.i.d. samples. We do not assume the consistency of CI tests with respect to semi-stationary data. Therefore, any CI tests that maintain consistency with i.i.d. samples can be seamlessly integrated into our algorithm. Lemma 3.2. Denote that {Pak(Xj t )}k [ωj] contain the true and illusory parent sets, where ωj is the true periodicity of Xj. For any random variable Xj t with large enough t, under assumptions A1-A7 and with an oracle (infinite sample size limit), we have: p p(Xj t | ωj k=1 Pak(Xj t )) = p(Xj t | ωj k=1 Pak(Xj t ) \ y) = 1, y ωj k=1Pak(Xj t ) (14) Here, p(Xj t | ωj k=1 Pak(Xj t )) = lim T ˆp(Xj t | ωj k=1 Pak(Xj t )) where ˆp(Xj t | ωj k=1 Pak(Xj t )) is an estimated conditional distribution using all samples t [τmax + 1, T]: ˆp(Xj t | ωj k=1 Pak(Xj t )) = P t 1(Xj t , ωj k=1Pak(Xj t )) P t 1( ωj k=1Pak(Xj t )) . (15) Proof sketch. We argue that the estimated conditional distribution in Eq.(15) can be written as a linear combination of ˆp(Xj t |Pa(Xj t )) where t Πj k, k [ωj], i.e., as a mixture of conditional distributions. The coefficients in the linear function, say αk, k [ωj], can be further decomposed based on a finer time partition called the homogenous time partition, which consists of subsets constructed according to the Markov chains {Zq n}q [δ] corresponding to the time series. Based on Assumption A7, the Markov chains are stationary and ergodic. Therefore, after sufficiently large time steps, the distribution of {Zq n}q [δ] will be invariant across n as it achieves unique equilibrium. With this type of stationary sample, we can express αk by joint distributions instead of the indicators. Then, we can complete the proof of our inequality claim in Eq.(14) using Assumption A2 and Bayes theorem. Lemma 3.3. Let d SPa(Xj t) denote the estimated superset of parent set for Xj V obtained from the Algorithm 1 (line 2). {Pak(Xj t )}k [ωj] contain the true and illusory parent sets, where ωj is the true periodicity of Xj. Under assumptions A1-A7 and with an oracle (infinite sample size limit), we have that: ωj k=1Pak(Xj t ) d SPa(Xj t ), t [τmax + 1, T] almost surely. Proof sketch. Assume the contrary, i.e., there exists s k Pak(Xj t ) \ d SPa(Xj t ). From Lemma 3.2, we have Xj t s ωj k=1Pak(Xj t ) \ s. By the Definition 2.4, we have Pa(Xj t ) ωj k=1Pak(Xj t ). If s Pa(Xj t ), by the causal Markov property (Assumption A2), the dependence relation can not be true because s is a non-descendant of Xj t . If s Pa(Xj t ), our Algorithm would have concluded that Xj t s d SPa(Xj t ) (line 2), evident from the causal Markov property, contradicting our assumption. Hence, the lemma. Based on Lemma 3.2 and Lemma 3.3, we can identify the true ωj for Xj through Lemma 3.4. Lemma 3.4. Let ωj denote the true periodicity for Xj V and c Paω(Xj t Πj k) denote the estimated parent set for Xj t obtained from Algorithm 1 line 9, where t Πj k. Define: bωj = arg min ω [ωub] max k [ω] |c Paω(Xj t Πj k)| (16) Under assumptions A1-A7 and with an oracle (infinite sample limit), we have that ˆωj = ωj, j [n] almost surely. Proof sketch. Assume the contrary that ˆωj = ωj, then in the Algorithm 1, we have an incorrect time partition bΠj. Hence, CI tests that are performed use samples with different causal mechanisms. 1 2 3 4 5 6 7 8 9 10 Accuracy of (a) Accuracy of ˆω 1 2 3 4 5 6 7 8 9 10 Time (seconds) (b) Runtime (in sec.) when ωmax varies Figure 2: PCMCIΩis tested on 5-variate time series with τmax = 5. Set τub = 15, ωub = 15 for all variables. Every line corresponds to a different time series length. Every marker corresponds to the average accuracy rate or average running time over 100 trials. a) The accuracy rate of ˆω for different time series lengths and different ωmax. b) Illustration of Runtime (in sec.) when ωmax varies. ˆp(Xj t | ωj k=1 Pak(Xj t )) in Eq.(15) is estimated from a mixture of two or more time partition subsets, say Πj 1 and Πj 2. We can apply Lemma 3.2 with 2 k=1Pak(Xj t ). With Lemma 3.3, 2 k=1Pak(Xj t ) c Paˆωj(Xj t ). Hence, for ˆωj, |c Paˆωj(Xj t )| | 2 k=1 Pak(Xj t )| using mixture samples t 2 k=1Πj k. For true ωj, we have |c Paωj(Xj t )| = |Pa(Xj t )|. With Assumption A6 the Hard Mechanism Change, | 2 k=1 Pak(Xj t )| > |Pa(Xj t )| so that ωj always leads to a smaller size of estimated parent sets than ˆωj, contrary to the definition of ˆωj. Hence, ˆωj = ωj. 4 Experiments 4.1 Experiments on Continuous-valued Time Series To validate the correctness and effectiveness of our algorithm, we perform a series of experiments. The Python code is provided at https://github.com/Causal ML-Lab/PCMCI-Omega. In this section, we test four algorithms1, PCMCIΩ, PCMCI Runge et al. [2019], VARLi NGAM Hyvärinen et al. [2010] and DYNOTEARS Pamfil et al. [2020], on continuous-valued time series with Gaussian noise. The experiments for continuous-valued time series with exponential noise and binary-valued time series are in the supplementary material. Following Runge et al. [2019], we generate the continuous-valued time series in three steps: 1. Construct an n-variate time series V with length T using independent and identical (Standard Gaussian or Exponential) noise temporarily. Determine τmax and ωmax where ωmax = max{ωj}j {n}. After making sure that one univariate time series, say Xj, has periodicity ωmax, the periodicity of the remaining time series Xi, i = j is randomly selected from {1, , ωmax} respectively. 2. Randomly generate ωj binary edge matrices with shape (n, τmax) for each time series Xj, j [n]. 1 denotes an edge and 0 denotes no edge. Each binary matrix represent one parent set index p Indj k, k [ωj]. Randomly generate ωj coefficient matrices with shape (n, τmax) for each time series Xj, j [n]. One binary edge matrix and one coefficient matrix jointly determine one causal mechanism. Hence, total ωj causal mechanisms are constructed. Here, make sure that V satisfies Assumption A6. 3. Starting from time point t > τmax, generate vector Xt over time according to all the causal mechanisms of V , until t achieves T. 1We did not conduct experiments on JIT-Li NGAM because this is from a very recent paper Fujiwara et al. [2023] and is considered concurrent per Neur IPS policy. 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 n = 5, max = 5 PCMCI PCMCI VARLi NGAM (a) Gaussian noise 12 5 10 15 0.0 T = 2000, max = 5, n = 5 2 5 10 15 20 0.0 2 5 10 15 20 T = 2000, max = 5, max = 5 2 5 10 15 20 PCMCI PCMCI VARLi NGAM (b) Performance when τmax or n varies Figure 3: 4 algorithms are tested on 5-variate time series. Set τub = 15, ωub = 15 for all variables. Every line corresponds to a different algorithm. Every marker corresponds to the average performance over 100 trials. Following the previous work in Huang et al. [2020], F1 score, Adjacency Precision, and Adjacency Recall are used to measure the performance of the algorithms. The details of calculating these metrics are described in the Appendix. All the performance statistics are averaged over 100 trials. The standard error of the averaged statistics is displayed either by color filling or by error bars. A correct estimator ˆω is the prerequisite for obtaining the correct causal graph. Fig.2(a) shows the accuracy rate of ˆω for different time lengths T. Here, elements in {Nωj} where N [ ωub ωj ] are all treated as correct estimations. By Definition 2.2 and 2.3, the multiple of ωj is still associated with a correct causal graph. However, it leads to a finer time point partition Πj, decreasing the sample size used in each CI test from approximately T/ωj to approximately T/(Nωj). The accuracy rate is sensitive to ωmax for small T. This result verifies that algorithm PCMCIΩhas the capacity to detect the true periodicity of each Xj V with a large enough time length. We evaluate the performances of PCMCIΩon continuous-valued time series with Gaussian noise shown as Fig.3(a). As T increases, it is natural to see a continuous improvement in performance. The sub-figures show that all three evaluation metrics decrease when ωmax gets larger. The precision of PCMCIΩis always far better than other algorithms when ωmax is not equal to 1. Given the fact that the parent sets c Pa(Xj t ) j, t obtained from PCMCIΩare subsets of the parent set d SPa(Xj t ) j, t estimated from PCMCI, the recall rate of PCMCI should be the upper bound of the recall rate of PCMCIΩ. This assertion has been verified as the red recall line of PCMCIΩis always below the blue recall line of PCMCI as T increases. In Fig.3(a), the recall of PCMCIΩis worse than PCMCI for T = 500. In this regime, the accuracy rate of ˆω is low, shown as the dark blue line in Fig.2(a). Small sample sizes in CI tests may result in a sparser causal graph. Hence the number of true positive edges may decrease. This is a common problem for many constraint-based algorithms, but it hurts PCMCIΩthe most because in PCMCIΩ, the sample sizes in each CI test are approximate T/ˆω instead of T. As T increases, the red recall line of PCMCIΩpush forward to the blue recall line of PCMCI. The high value of both adjacent precision and recall rate with large T verify that PCMCIΩcan identify the correct causal graph. We also observe the performance of our algorithm as τmax and N varies in Fig.3(b). As the performance of PCMCIΩis consistent over n-variate time series with different n, large τmax may lead to a smaller precision and recall rate. 4.2 Case Study Here, we construct an experiment with a real-world climate time series dataset. In Runge et al. [2019], the authors tested dependencies among monthly surface pressure anomalies in the West Pacific and surface air temperature anomalies in the Central Pacific, East Pacific, and tropical Atlantic from 1948 to 2012. Our application explores the causal relations among the monthly mean of the same set of variables from 1948-2022 with 900 months. Let Xwp t denote the monthly mean of surface pressure in the West Pacific, Xcp t , Xep t and Xta t denote the monthly mean of air temperature in the Central Pacific, East Pacific, and tropical Atlantic, respectively. The parent sets for each variable obtained from PCMCIΩalgorithms are shown in Table 1. Sets of true and illusory parents of a variable at time t are separated by curly braces. For instance, variable Xwp t with ˆωwp = 1 means that the causal mechanism of the surface pressure in the West Pacific remains invariant over time with the estimated parent set {Xwp t 1, Xwp t 2, Xep t 1, Xta t 1}. Only time series Xcp has three different parent sets, including one true parent set and two illusory parent sets, which appear periodically over time. The three parent sets of Xcp t imply that the causal effect from the tropical Atlantic air temperature Xta t 1 to the Central Pacific air temperature Xcp t would disappear every quarter of a year. Note that we do not have a ground truth in this case, and we do not possess the necessary knowledge in this area, so the significance of these results is under-explored. More discussion about this application can be found in the supplementary materials. Table 1: Climate application results estimated from PCMCIΩ. X ˆω {c Pak}k [ˆω]: true and illusory parent sets Xwp t 1 {Xwp t 1, Xwp t 2, Xep t 1, Xta t 1} Xcp t 3 {Xcp t 1}; {Xcp t 1, Xcp t 2, Xta t 1}; {Xcp t 1, Xta t 1} Xep t 1 {Xep t 1, Xta t 1, Xta t 2, Xcp t 1} Xta t 1 {Xta t 1, Xwp t 1} 5 Conclusions In this paper, we propose a non-parametric, constraint-based causal discovery algorithm PCMCIΩ designed for semi-stationary time-series data, in which a finite number of causal mechanisms are repeated periodically. We establish the soundness of our algorithm and assess its effectiveness on continuous-valued and discrete-valued time series data. The algorithm PCMCIΩhas the capacity to reveal the existence of periodicity of causal mechanisms in real-world datasets. 6 Acknowledgements This research has been supported in part by NSF CAREER 2239375 and Adobe Research. We wish to convey our heartfelt gratitude to the anonymous reviewers for their invaluable and constructive feedback, which significantly contributed to enhancing the quality of the manuscript. Judea Pearl. Causality: models, reasoning, and inference, 1980. Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000. Wicher Pieter Bergsma. Testing conditional independence for continuous random variables. Citeseer, 2004. Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. ar Xiv preprint ar Xiv:1202.3775, 2012. Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the generalised covariance measure. 2020. Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting and quantifying causal associations in large nonlinear time series datasets. Science advances, 5 (11):eaau4996, 2019. Andreas Gerhardus and Jakob Runge. High-recall causal discovery for autocorrelated time series with latent confounders. Advances in Neural Information Processing Systems, 33:12615 12625, 2020. Tianjiao Chu and Clark Glymour. Search for additive nonlinear time series causal models. Journal of Machine Learning Research, 9(5), 2008. Doris Entner and Patrik O Hoyer. On causal discovery from time series data using fci. Probabilistic graphical models, pages 121 128, 2010. Daniel Malinsky and Peter Spirtes. Causal structure learning from multivariate time series in settings with unmeasured confounding. In Proceedings of 2018 ACM SIGKDD workshop on causal discovery, pages 23 47. PMLR, 2018. Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microservices through causal discovery. Advances in Neural Information Processing Systems, 35:31158 31170, 2022. Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, and Patrik O Hoyer. Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, 11(5), 2010. Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Causal inference on time series using restricted structural equation models. Advances in neural information processing systems, 26, 2013. Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pages 1595 1605. PMLR, 2020. Charles K Assaad, Emilie Devijver, and Eric Gaussier. Survey and evaluation of causal discovery methods for time series. Journal of Artificial Intelligence Research, 73:767 819, 2022. Fang Han, Shyam Subramanian, Edwin R Price, Joseph Nadeau, and Kingman P Strohl. Periodic breathing in the mouse. Journal of Applied Physiology, 92(3):1133 1140, 2002. Akira Nakamura, Yasuichiro Fukuda, and Tomoyuki Kuwaki. Sleep apnea and effect of chemostimulation on breathing instability in mice. Journal of Applied Physiology, 94(2):525 532, 2003. Mary A Carskadon, William C Dement, et al. Normal human sleep: an overview. Principles and practice of sleep medicine, 4(1):13 23, 2005. Sandra Komarzynski, Qi Huang, Pasquale F Innominato, Monique Maurice, Alexandre Arbaud, Jacques Beau, Mohamed Bouchahda, Ayhan Ulusakarya, Nicolas Beaumatin, Gabrièle Breda, et al. Relevance of a mobile internet platform for capturing inter-and intrasubject variabilities in circadian coordination during daily routine: pilot study. Journal of Medical Internet Research, 20 (6):e204, 2018. Mingming Gong, Kun Zhang, Bernhard Schoelkopf, Dacheng Tao, and Philipp Geiger. Discovering temporal causal relations from subsampled data. In International Conference on Machine Learning, pages 1898 1906. PMLR, 2015. Biwei Huang, Kun Zhang, Mingming Gong, and Clark Glymour. Causal discovery and forecasting in nonstationary environments with state-space models. In International conference on machine learning, pages 2901 2910. PMLR, 2019. Christopher Krich, Jakob Runge, Diego G Miralles, Mirco Migliavacca, Oscar Perez-Priego, Tarek El-Madany, Arnaud Carrara, and Miguel D Mahecha. Estimating causal networks in biosphere atmosphere interaction with the pcmci approach. Biogeosciences, 17(4):1033 1061, 2020. Yuquan Qu, Carsten Montzka, and Harry Vereecken. Causation discovery of weather and vegetation condition on global wildfire using the pcmci approach. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pages 8644 8647. IEEE, 2021. Liangfeng Zou, Yuanyuan Zha, Yuqing Diao, Chi Tang, Wenquan Gu, and Dongguo Shao. Coupling the causal inference and informer networks for short-term forecasting in irrigation water usage. Water Resources Management, pages 1 23, 2022. Giovanni Menegozzo, Diego Dall Alba, and Paolo Fiorini. Causal interaction modeling on ultraprocessed food manufacturing. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), pages 200 205. IEEE, 2020. Sayeh Peterson. Comparison of Lasso Granger and PCMCI for Causal Feature Selection in Multivariate Time Series. Ph D thesis, The University of Arizona, 2022. DK Arvind, Sharan Maiya, and P Andreu Sedeño. Identifying causal relationships in time-series data from a pair of wearable sensors. In 2021 IEEE 17th International Conference on Wearable and Implantable Body Sensor Networks (BSN), pages 1 4. IEEE, 2021. Luca Castri, Sariah Mghames, Marc Hanheide, and Nicola Bellotto. Causal discovery of dynamic models for predicting human spatial interactions. In Social Robotics: 14th International Conference, ICSR 2022, Florence, Italy, December 13 16, 2022, Proceedings, Part I, pages 154 164. Springer, 2023a. Luca Castri, Sariah Mghames, Marc Hanheide, Nicola Bellotto, et al. Enhancing causal discovery from robot sensor data in dynamic scenarios. 2023b. Daniel Malinsky and Peter Spirtes. Learning the structure of a nonstationary vector autoregression. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2986 2994. PMLR, 2019. Biwei Huang, Kun Zhang, Jiji Zhang, Joseph D Ramsey, Ruben Sanchez-Romero, Clark Glymour, and Bernhard Schölkopf. Causal discovery from heterogeneous/nonstationary data. J. Mach. Learn. Res., 21(89):1 53, 2020. Daigo Fujiwara, Kazuki Koyama, Keisuke Kiritoshi, Tomomi Okawachi, Tomonori Izumitani, and Shohei Shimizu. Causal discovery for non-stationary non-linear time series data using just-in-time modeling. In 2nd Conference on Causal Learning and Reasoning, 2023. Richard Serfozo. Basics of applied stochastic processes. Springer Science & Business Media, 2009. Samuel Karlin. A first course in stochastic processes. Academic press, 2014. A PCMCI Algorithm The PCMCI algorithm is proposed by Runge et al. [2019], aiming to detect time-lagged causal relations in a window causal graph. There are two stages of PCMCI: the condition-selection stage and the causal discovery stage. In the first stage, unnecessary edges are removed based on the conditional independencies from an initialized partially connected graph where Assumption A4-A5 should be satisfied. In the second stage, Momentary Conditional Independence tests (MCI) are used to further remove the false positive edges caused by autocorrelations in time series data. More specifically, these two steps can be briefly formalized as follows: PC1 in Algorithm A1: Condition selection stage. PC1 is a variant of the skeleton-discovery part of the PC algorithm in a more robust version named stable-PC Le et al. [2016]. The goal in this stage is to obtain a superset of the parents c Pa(Xj t ) for all variables Xj [n] t [τmax+1,T ] V. Initialize c Pa(Xj t ) = {Xi t τ}i [n],τ [τmax]. c Pa(Xj t ) will remove Xi t τ if Xi t τ Xj t c Pa(Xj t )\{Xi t τ} (1) MCI in Algorithm A2: Causal discovery stage. In this stage, do MCI tests for all variable pairs (Xi t τ, Xj t ) with i, j [n] and time delays τ [τmax]: MCI(Xi t τ, Xj t |c Pa(Xj t )\{Xi t τ}, c Pa(Xi t τ)) (2) where c Pa(Xj t ) and c Pa(Xi t τ) are estimated from the PC1 stage. Note that τmax in this section is the same as τub in the main paper, serving as the upper bound for the time lag that exhibits causal effects. On the other hand, τmax in the main paper denotes the maximum time lag observed within the multivariate time series. Essentially, in the main paper, τub is a parameter that must be fed into the algorithm, and τmax is observed from the true causal graph. As a default, we assume τub is configured with a value greater than τmax, ensuring that the algorithm uncovers the correct causal relations. See Fig.1 for more detail. Figure 1: Set τub to be 5, then all parent candidates of variables at t = 15 are included in the large orange box, ranging from t = 10 to t = 14. Consequently, the algorithm will only examine causal effects with a time lag not exceeding 5. In the causal graph, τmax is 3, representing the maximum time lag observed among the 3-variate time series. Specifically, the maximum time lag for each component time series is τ1 = 2, τ2 = 3, τ3 = 1, respectively, and τmax represents the largest value among these three maximum lags. Algorithm A1 PCqmax 1: Input: A n-variate time series V = (X1, X2, X3, ..., Xn), target time series Xj, maximum time lag τmax, significance threshold αP C, maximum condition dimension pmax (default pmax = nτmax), maximum number of combinations qmax (default qmax = 1), conditional independence test function CI 2: function CI(X, Y, Z) 3: Test X Y |Z using test statistic measure I 4: return p-value, test statistic value I 5: Initialize preliminary set of parents c Pa(Xj t ) = {Xi t τ : i {1, ..., n}, τ {1, ..., τmax}} 6: Initialize dictionary of test statistic values Imin(Xi t τ Xj t ) = Xi t τ c Pa(Xj t ) 7: for p = 0, 1, 2, ..., pmax do 8: if | c Pa(Xj t )| 1 < p then 9: Break for-loop 10: end if 11: for all Xi t τ in c Pa(Xj t ) do 12: q = 1 13: for all lexicographically chosen subsets S c Pa(Xj t ) \ {Xi t τ} with |S| = p do 14: q = q + 1 15: if q qmax then 16: Break from inner for-loop 17: end if 18: Run CI test to obtain (p-value, I) CI(Xi t τ, Xj t , S) 19: if |I| < Imin(Xi t τ Xj t ) then Store min. I of parent among all tests 20: Imin(Xi t τ Xj t ) = |I| 21: end if 22: if p-value > αP C then Removed only after all Xi t τ have been tested 23: Mark Xi t τ for removal from c Pa(Xj t ) 24: Break from inner for-loop 25: end if 26: end for 27: end for 28: Remove non-significant parents from c Pa(Xj t ) 29: Sort parents in c Pa(Xj t ) by Imin(Xi t τ Xj t ) from largest to smallest 30: end for 31: return c Pa(Xj t ) Algorithm A2 MCI 1: Input: A n-variate time series V = (X1, X2, X3, ..., Xn), sorted parents c Pa(Xj t ) for all variables Xj estimated with Algorithm A1, maximum time lag τmax, maximum number p X of parents of variable Xi, and conditional independence test function CI 2: for all (Xi t τ, Xj t ) with i, j {1, ..., n}, τ {0, ..., τmax}, excluding (Xj t , Xj t ) do 3: Remove Xi t τ from c Pa(Xj t ) if necessary 4: Define c Pap X(Xi t τ) as the first p X parents from c Pa(Xi t), shifted by τ 5: Run MCI test to obtain (p-value, I) CI(Xi t τ, Xj t , Z = { c Pa(Xj t ), c Pap X(Xi t τ)}) 6: end for 7: Optionally adjust p-value of all links by False Discovery Rate-approach (FDR) 8: return p-value and MCI test statistic values For simplicity s sake, define sets: [b] := {1, 2, ..., b} and [a, b] := {a, a + 1, ..., b}, where a, b N. Algorithm B1 PCMCIΩ 1: Input: A n-variate time series V = (X1, X2, X3, ..., Xn), periodicity upper bound ωub, time lag upper bound τub. By default, we assume τub and ωub are larger than their true value. 2: A superset of parent set is obtained using PCMCI with τub and denote it by d SPa(Xj t ) j, t. 3: for Xj where j [n] do 4: for a guess ω [ωub] of ωj do 5: Let bΠj := {bΠj k|k [ω]} where bΠj k = {2τub + k, 2τub + ω + k, 2τub + 2ω + k, }. 6: for k [ω] do 7: Initialize the parent set for Xj t , t {t : t 2τub, t bΠj k} (with guess ω) denoted by c Paω(Xj t ) d SPa(Xj t ). 8: Consider Xi t τ c Paω(Xj t ). Remove Xi t τ from c Paω(Xj t ) if Xi t τ Xj t | d SPa(Xj t ) d SPa(Xi t τ) \ Xi t τ using a CI Test with samples t {t : t 2τub, t bΠj k}. 9: Store c Paω(Xj t ) for Xj t , t {t : t 2τub, t bΠj k}. 10: end for 11: end for 12: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13: if there exists turning points Sj, Sj [ωub] then 14: bωj min Sj 15: else 16: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17: bωj arg minω [ωub] maxk [ω] |c Paω(Xj t bΠj k)|. 18: end if 19: Set c Pa(Xj t ) c Paˆωj(Xj t ) for Xj t , t {t : t 2τub}. 20: end for 21: return ˆωj and c Pa(Xj t ) j [n], t 2τub. C Soundness of PCMCIΩ C.1 Stationary Markov Chain Claim: Any discrete-valued time series V with Semi-Stationary Structural Causal Model (SCM) satisfying assumption A1, A2, A4, A5 can be written as a Markov chain {Zn} as long as this Markov chain satisfies Pa(Zn) Zn Zn 1 for all n, where Zn is a set of variables in V . This Markov chain has a finite number of states if all time series in V are discrete-valued time series. Note that when the notation n is related to a Markov chain Zn, it means the running index. In the context of Xj [n] t , n represents the index of component time series within the n-variate time series. To simplify, assume that one associated Markov chain of V = {X, Y} has Zn = {Xt, Yt, Xt 1, Yt 1} with t {t N+ :, t T} satisfying Pa(Zn) Zn Zn 1. Here, the notation for the time points of variables is simplified as t and t 1, even though it should be a function of n, the running index of the Markov chain. Note that Zn 1 = {Xt 2, Yt 2, Xt 3, Yt 3} rather than {Xt 1, Yt 1, Xt 2, Yt 2}, as the simplified notation could erroneously suggest the latter sequence. A simple proof is shown below through Markov assumption (A2). Figure 2: Partial causal graph for 3-variate time series V = {X1, X2, X3} with a Semi-Stationary SCM where τmax = 3, ω1 = 3, ω2 = 2, ω3 = 1, Ω= 6 and δ = 6. The first 3(=τmax) time slices {Xt}1 t 3 are the starting points. The same color edges denote the same causal mechanism. E.g. for X1: there are 3 (= ωj) time partition subsets {Π1 k}1 k 3. The time points t of nodes X1 t sharing the same filling color are in the same time partition subsets. The time points t of nodes X1 t sharing both the same filling color and the same outline shape are in the same homogenous time partition subsets. There are 6 (= δ) different Markov chains in this multivariate time series V , and the first element of these 6 Markov chains is shown as {Zq 1}1 q 6 and are tinted with a gradient of blue hues. Z1 1 and Z1 2 denote the first two elements of the first Markov chain while Z2 1 and Z2 2 denote the first two elements of the second Markov chain. p(Zn|Zn 1, Zn 2, ...) (3) = p(Xt, Yt, Xt 1, Yt 1|Zn 1, Zn 2, ...) (4) = p(Xt|Zn Zn 1 \ Xt, Zn 2, )p Yt|Zn Zn 1 \ (Xt Yt), Zn 2, (5) = p Xt|Pa(Xt), Zn Zn 1 \ (Xt Pa(Xt)) (6) p Yt|Pa(Yt), Zn Zn 1 \ (Xt Yt Pa(Yt)) p Xt 1|Pa(Xt 1), Zn Zn 1 \ (Xt Yt Xt 1 Pa(Xt 1)) = p(Xt|Zn Zn 1 \ Xt)p Yt|Zn Zn 1 \ (Xt Yt) (7) = p(Xt, Yt, Xt 1, Yt 1|Zn 1) (8) = p(Zn|Zn 1) (9) Assume that the space of both Xt and Yt with t < T are {1, 2}. There are total 24 = 16 states of Markov Chain {Zn} = {{Xt, Yt, Xt 1, Yt 1}}. The transition probability P for this Markov Chain is illustrated as a 16 16 matrix: (1, 1, 1, 1) (2, 1, 1, 1) " # (1, 1, 1, 1) p1,1 p1,2 (2, 1, 1, 1) p2,1 p2,2 where (1, 1, 1, 1) means Xt = 1, Yt = 1, Xt 1 = 1, Yt 1 = 1. Each row in this transition probability matrix is a conditional distribution of Zn given one realization of Zn 1. Each entry is a probability of having one specific realization of Zn given one realization of Zn 1. This probability can be decomposed by conditional distributions based on Markov assumption (A2). Take p1,1 as an example: p1,1 = p(Xt = 1, Yt = 1, Xt 1 = 1, Yt 1 = 1|Xt 2 = 1, Yt 2 = 1, Xt 3 = 1, Yt 3 = 1) (10) = p(Xt = 1|Pa(Xt))p(Yt = 1|Pa(Yt))p(Xt 1 = 1|Pa(Xt 1))p(Yt 1 = 1|Pa(Yt 1)) (11) where Pa(.) here are realizations, not random variables. For time series V with Semi-Stationary SCM, there are (potentially) δ different Markov chains {Zq n}, q [δ]: Zq n = {Xτmax+q+(n 1)δ, Xτmax+q+1+(n 1)δ, ..., Xτmax+q 1+nδ}, where n {n : n N+, τmax + q 1 + nδ T}, δ = τmax+1 Ω Ω. As proved in the claim, such a Markov chain exists as long as Pa(Zq n) Zq n Zq n 1 for all n. The value of δ can guarantee the existence of such Markov chain because δ is larger than τmax + 1 and is a multiple of Ω, that is, a multiple of all {ωj}j [n]. By doing so, Pa(Zq n) Zq n Zq n 1 is satisfied; for any variable Xj t , there exists q [δ] and n N+ such that variable Xj t and its parent set Pa(Xj t ) can be included in Zq n; and the causal mechanism generating Zq n is invariant for different n. The state space of {Zq n} is the set containing all possible realizations of {Xτmax+q+(i 1)+(n 1)δ}i [δ],n N. The transition probabilities between the states are the product of associated causal mechanisms based on Markov assumption (A2). Determined by the starting slice Xt where τmax < t τmax + δ, there should be δ potentially different Markov chains {Zq n} where 1 q δ. To be more specific, those Markov chains are: Markov Chain 1: Z1 n = {Xτmax+1+(n 1)δ, Xτmax+2+(n 1)δ, ..., Xτmax+nδ}, (12) where n {n : n N+, τmax + nδ T}. Markov Chain 2: Z2 n = {Xτmax+2+(n 1)δ, Xτmax+3+(n 1)δ, ..., Xτmax+1+nδ}, (13) where n {n : n N+, τmax + 1 + nδ T}. ... Markov Chain δ: Zδ n = {Xτmax+nδ, Xτmax+1+nδ, ..., Xτmax 1+(n+1)δ}, (14) where n {n : n N+, τmax 1 + (n + 1)δ T}. Given Irreducible and Aperiodic Markov Chain assumption (A7), discrete-time Markov chain {Zq n}0 n. (16) Returning from the stationary and ergodic Markov chains {Zq n}, q [δ] back to the original data V through Eq.(12) to Eq.(14), the distribution of the original data V must adhere to the following condition: p(Xτmax+q+n1δ, Xτmax+q+1+n1δ, ..., Xτmax+q+δ 1+n1δ) = p(Xτmax+q+n2δ, Xτmax+q+1+n2δ, ..., Xτmax+q+δ 1+n2δ) (17) for any q [δ] and n1, n2 > n. Given these clarifications, we can naturally introduce a more refined time partition that is based on, yet finer than, the time partition defined in Definition 2.3 in the main paper. Definition C.1 (Homogenous Time Partition). For a univariate time series Xj in a Semi-Stationary SCM with periodicity ωj, the time partition Πj k of Xj can be further divided into a series of nonoverlapping and non-empty subsets {πj (k,s)}1 s δ ωj . For each t [τmax + 1, T], there exists k [ωj] so that t Πj k and further there exists s [ δ ωj ] so that t πj (k,s). πj (k,s) can be written as: πj (k,s) := {t : τmax + 1 t T, (t mod ωj) + 1 = k, (t mod δ ωj ) + 1 = s}. (18) With this definition, we have δ ωj s=1πj (k,s) = Πj k. While time partition Πj k guarantees that all variables in {Xj t }t Πj k share the same causal mechanism, homogenous time partition πj (k,s) guarantees that all variables in {Xj t }t>t ,t πj (k,s) share the same distribution where t represent the steps needed by the associated Markov chain to achieve equilibrium. Fig.2 shows a partial causal graph for a 3-variate time series with Semi-Stationary SCM. τmax = 3 means that the causal mechanisms start from t = 4, and the random variables with t {1, 2, 3} are random noises. For the first time series X1, the periodicity ω1 is 3. And the periodicity of the time series X2 and X3 is 2 and 1, respectively. The periodicity of the whole time series V is obtained by LCM(3, 2, 1) = 6. δ = τmax+1 6 6 = 1 6 = 6. In Fig.2, periodicity ω1 = 3 means that the causal mechanisms repeat every three time points and hence there are three time partition subsets Π1 k, k [3]. More specifically, Π1 1 = {4, 7, 10, 13, ..., 4 + 3N, ...}, Π1 2 = {5, 8, 11, 14, ..., 5 + 3N, ...}, Π1 3 = {6, 9, 12, 15, ..., 6 + 3N, ...} where N N+. Random variables {X1 t } with t in the same time partition subset share the same causal mechanism. However, they may not share the same marginal distribution. Still in Fig.2, based on the definition of homogenous time partition, time partition subset Π1 1 for X1 can be further decomposed as π1 (k=1,s=1) = {4, 10, ..., 4 + δN, ...}, π1 (k=1,s=2) = {7, 13, ..., 7 + δN, ...}. where s [ δ ω1 ]. After a long run n, Z1 n and Z1 n+1 will eventually share the same distribution, that is, all the variables inside Zq n will share the same joint or marginal distribution as the corresponding variables inside Zq n+1. To illustrate this, we assume that this Markov chain has already achieved its equilibrium at time point t = 4. Based on Eq.(12) and Eq.(17), we have: p(X4, X5, ..., X9) = p(X10, X11, ..., X15) = p(X16, X17, ..., X21) = (19) From the identical joint distribution, we can further have: p(X1 4) = p(X1 10) = p(X1 16) = (20) as X1 4 X4, X1 10 X10 and X1 16 X16. Therefore, for sufficiently large values of t ensuring that Z1 n has reached its stationary distribution, all variables within {Xj t }t πj(k,s) will share the same distribution. In Fig.2, there are 6(= δ) potentially different Markov chains {Zq n}, q [δ] in V . For any time window with length δ, {Xt, ..., Xt+δ 1}, there exists q [δ], n N+, so that this time window can be completely included in Zq n. For instance, set {X5, X6, ..., X10} is in Z2 1, which is the first element of Markov chain {Z2 n}. Constructing Markov chains and applying the Irreducible and Aperiodic Markov Chain assumption (A7) enable us to obtain a consistent estimator for the conditional and joint distributions of interest. C.2 Consistent Estimator The conditional distributions for variables in {Xj t }t Πj k are the same, that is, p(xj t1|Pa(xj t1)) = p(xj t2|Pa(xj t2)), t1, t2 Πj k. For simplicity, denote pt Πj k(xj t|Pa(xj t)) := p(xj t|Pa(xj t)), t Πj k (21) Consider an indicator function such that 1(xj t, Pa(xj t)) = 1 if configuration (xj t, Pa(xj t)) has realized, otherwise 1(xj t, Pa(xj t)) = 0. Since every t πj (k,s) is apart from each other with Nδ steps where N N+, and there must exist q [ωj] and n1 N+ so that {xj t, Pa(xj t)} Zq n1, then for the same q, there must exist another n2 so that {xj t+Nδ, Pa(xj t+Nδ)} Zq n2. Hence, we have {1(xj t, Pa(xj t))}t πj (k,s) = {f(Zq(t) n1(t))}t πj (k,s) with some function f : Rn δ R1 satisfying E|f(Zq(t) n1(t))| < . Since the value of t determines q and n1, we use q(t) and n1(t) to emphasize their relations. For large enough t > t , {1(xj t, Pa(xj t))}t>t ,t πj (k,s) are identical samples where t is the time point needed by the associate Markov chain to achieve its equilibrium after n1(t ) steps. Without loss of generality, we assume T is a multiple of δ all the time. We can construct an estimator of p(xj t, Pa(xj t)) with large enough t as: ˆp(xj t, Pa(xj t)) = δ 1(xj t, Pa(xj t)) (22) where k, s is determined by t and there must exist one and only one k, s satisfying t πj (k,s). Now, we are going to show this estimator is consistent. We first decompose the estimator into two parts: time point t t and t > t , where t represents the time point when the equilibrium of the associated Markov chain is achieved. 1(xj t, Pa(xj t)) (23) t t ,t πj (k,s) 1(xj t, Pa(xj t)) + X t>t ,t πj (k,s) 1(xj t, Pa(xj t)) (24) t t ,t πj (k,s) 1(xj t, Pa(xj t)) + δ t>t ,t πj (k,s) 1(xj t, Pa(xj t)) (25) t t ,t πj (k,s) 1(xj t, Pa(xj t)) + δ T t T t t>t ,t πj (k,s) 1(xj t, Pa(xj t)) (26) t t ,t πj (k,s) 1(xj t, Pa(xj t)) + δ T t T t t>t ,t πj (k,s) 1(xj t, Pa(xj t)) (27) t t ,t πj (k,s) 1(xj t, Pa(xj t)) + T t t>t ,t πj (k,s) 1(xj t, Pa(xj t)) (28) Take a limit of Eq.(23), we have: 1(xj t, Pa(xj t)) (30) = lim T δ T t t ,t πj (k,s) 1(xj t, Pa(xj t)) + lim T T t t>t ,t πj (k,s) 1(xj t, Pa(xj t)) (31) = 0 + lim T T t 1 n1(T) n1(t ) n1(t)>n1(t ) f(Zq(t) n1(t)) , where t > t , t πj (k,s) (32) Birkhoff s Ergodic Theorem ================ 0 + E f(Zq(t) n1(t)) (33) = E 1(xj t, Pa(xj t)) , where t > t , t πj (k,s) (34) = p(xj t, Pa(xj t)), where t > t , t πj (k,s) (35) pt πj (k,s)(xj t, Pa(xj t)) := p(xj t, Pa(xj t)), where t > t , t πj (k,s) (36) Based on the definition of homogenous time partition and time partition, pt πj (k,s)(xj t|Pa(xj t)) = pt Πj k(xj t|Pa(xj t)), s [ δ Similar to Eq.(22), one estimator of pt Πj k(xj t|Pa(xj t)), k = [ωj] is ˆpt Πj k(xj t|Pa(xj t)) = t Πj k 1(xj t, Pa(xj t)) P t Πj k 1(Pa(xj t)) (37) t πj (k,s) 1(xj t, Pa(xj t)) t πj (k,s) 1(Pa(xj t)) (38) t πj (k,s) 1(xj t, Pa(xj t)) t πj (k,s) 1(Pa(xj t)) (39) Take a limit of Eq.(37), we have: lim T ˆpt Πj k(xj t|Pa(xj t)) (40) Eq.(35) ===== ωj s=1 pt πj (k,s)(xj t, Pa(xj t)) ωj s=1 pt πj (k,s)(Pa(xj t)) (41) ωj s=1 pt πj (k,s)(xj t|Pa(xj t))pt πj (k,s)(Pa(xj t)) ωj s=1 pt πj (k,s)(Pa(xj t)) (42) pt πj (k,s) (xj t|Pa(xj t)) are same for all s ======================== pt Πj k(xj t|Pa(xj t)) P δ ωj s=1 pt πj (k,s)(Pa(xj t)) ωj s=1 pt πj (k,s)(Pa(xj t)) (43) = pt Πj k(xj t|Pa(xj t)) (44) Hence, ˆpt Πj k(xj t|Pa(xj t)) is a consistent estimator of pt Πj k(xj t|Pa(xj t)). Similarly, we construct an estimator of p(xj t| h Pah(xj t)) where t [T]: ˆp(xj t| h Pah(xj t)) = X t 1(xj t| h Pah(xj t)) (45) = P t 1(xj t, h Pah(xj t)) P t 1( h Pah(xj t)) . (46) We will prove that this estimator is converged as T goes to infinity in Lemma D.2. Hence, it is a consistent estimator. In this section, we have proved that ˆp(xj t, Pa(xj t)) in Eq.(22) is a consistent estimator of p(xj t, Pa(xj t)) using samples with t in the same homogenous time partition subset and ˆp(xj t|Pa(xj t)) in Eq.(37) is a consistent estimator of p(xj t|Pa(xj t)) using samples with t in the same time partition subset. Theorem D.1. Let b G be the estimated graph using the Algorithm PCMCIΩ. Under assumptions A1-A7 and with an oracle (infinite sample size limit), we have that: b G = G (47) almost surely. Lemma D.2. Denote that {Pak(Xj t )}k [ωj] contain the true and illusory parent sets, where ωj is the true periodicity of Xj. For any random variable Xj t with large enough t, under assumptions A1-A7 and with an oracle (infinite sample size limit), we have: p p(Xj t | ωj k=1 Pak(Xj t )) = p(Xj t | ωj k=1 Pak(Xj t ) \ y) = 1, y ωj k=1Pak(Xj t ) (48) Here, p(Xj t | ωj k=1 Pak(Xj t )) = lim T ˆp(Xj t | ωj k=1 Pak(Xj t )). Proof. We first prove that there exist a sequence of coefficients {αk}k [ωj] satisfying Pωj k=1 αk = 1 so that: configuration h Pah(xj t), ˆp(xj t| h Pah(xj t)) = k=1 αk ˆpk(xj t|Pa(xj t)) (49) If this is correct, then ˆp(xj t| h Pah(xj t)) would be a consistent estimator of p(xj t| h Pah(xj t)). Based on Eq.(46), we have: ˆp(xj t| h Pah(xj t)) (50) t 1(xj t, h Pah(xj t)) P t 1( h Pah(xj t)) (51) k 1(xj t, h Pah(xj t))1(t Πj k) P t 1( h Pah(xj t)) (52) t 1(xj t, h Pah(xj t))1(t Πj k) P t 1( h Pah(xj t)) (53) t Πj k 1(xj t, h Pah(xj t)) P t 1( h Pah(xj t)) (54) t Πj k 1(xj t, h Pah(xj t)) t 1( h Pah(xj t)) (((((((( P t 1( h Pah(xj t)) P t Πj k 1( h Pah(xj t)) t Πj k 1( h Pah(xj t)) P t 1( h Pah(xj t)) t Πj k 1(xj t, h Pah(xj t)) P t Πj k 1( h Pah(xj t)) t Πj k 1( k Pat Πj k(xj t)) P t 1( h Pah(xj t)) ˆpt Πj k(xj t| h Pah(xj t)) t Πj k 1( h Pah(xj t)) P t 1( h Pah(xj t)) ˆpt Πj k(xj t|Pa(xj t), h Pah(xj t) \ Pa(xj t)) t Πj k 1( h Pah(xj t)) P t 1( h Pah(xj t)) ˆpt Πj k(xj t|Pa(xj t)) t Πj k 1( h Pah(xj t)) P t 1( h Pah(xj t)) k αk(T)ˆpt Πj k(xj t|Pa(xj t)), (60) where αk(T) = t Πj k 1( h Pah(xj t)) P t 1( h Pah(xj t)) . (61) Using the same logic in Eq.(30)-(35), we can decompose the numerator and denominator of αk with homogenous time partition until each component converges to a stationary distribution. t Πj k 1( h Pah(xj t)) P t Πj k 1( h Pah(xj t)) (62) t πj (k,s) 1( h Pah(xj t)) t πj (k,s) 1( h Pah(xj t)) (63) lim T αk(T) = ωj s=1 pt πj (k,s)( h Pah(xj t)) ωj s=1 pt πj (k,s)( h Pah(xj t)) (64) Without loss of generality, assume y Pa(xj t), where t Π1 j and y / Pa(xj t), where t / Π1 j . Then we have ˆp(xj t| h Pah(xj t) \ y) = P t 1(xj t, h Pah(xj t) \ y) P t 1( h Pah(xj t) \ y) (65) ˆpt Πj k(xj t| h Pah(xj t) \ y) t Πj k 1( h Pah(xj t) \ y) P t 1( h Pah(xj t) \ y) t Πj 1 1(xj t, h Pah(xj t) \ y) P t Πj 1 1( h Pah(xj t) \ y) t Πj 1 1( h Pah(xj t) \ y) P t 1( h Pah(xj t) \ y) k=2 βk(T)ˆpt Πj k(xj t|Pa(xj t)) + β1(T)ˆpt Πj 1(xj t|Pa(xj t) \ y) (67) where βk(T) = t Πj k 1( h Pah(xj t) \ y) P t 1( h Pah(xj t) \ y) (68) Similarly, we have: t Πj k 1( h Pah(xj t) \ y) P t Πj k 1( h Pah(xj t) \ y) (69) t πj (k,s) 1( h Pah(xj t) \ y) t πj (k,s) 1( h Pah(xj t) \ y) (70) lim T βk(T) = ωj s=1 pt πj (k,s)( h Pah(xj t) \ y) ωj s=1 pt πj (k,s)( h Pah(xj t) \ y) (71) Proving p(xj t| h Pah(xj t)) = p(xj t| h Pah(xj t) \ y) is equal to proving: p(xj t| h Pah(xj t)) p(xj t| h Pah(xj t) \ y) = 0 (72) Substitutes Eq.(60) and Eq.(67) in Eq.(72), we have the following derivation: p(xj t| h Pah(xj t)) p(xj t| h Pah(xj t) \ y) (73) k=1 αk(T)ˆpt Πj k(xj t|Pa(xj t)) (74) k=2 βk(T)ˆpt Πj k(xj t|Pa(xj t)) + β1(T)ˆpt Πj 1(xj t|Pa(xj t) \ y) ωj s=1 pt πj (k,s)( h Pah(xj t)) ωj s=1 pt πj (k,s)( h Pah(xj t)) pt Πj k(xj t|Pa(xj t)) (75) ωj s=1 pt πj (k,s)( h Pah(xj t) \ y) ωj s=1 pt πj (k,s)( h Pah(xj t) \ y) pt Πj k(xj t|Pa(xj t)) ωj s=1 pt Πj (1,s)( h Pah(xj t) \ y) ωj s=1 pt πj (k,s)( h Pah(xj t) \ y) pt Πj 1(xj t|Pa(xj t) \ y) After equating the denominators, the numerator is: s=1 pt πj (k,s)( h Pah(xj t))pt Πj k(xj t|Pa(xj t)) ωj X s=1 pt πj (k,s)( h Pah(xj t) \ y) s=1 pt πj (k,s)( h Pah(xj t) \ y)pt Πj k(xj t|Pa(xj t)) s=1 pt Πj (1,s)( h Pah(xj t) \ y)pt Πj 1(xj t|Pa(xj t) \ y) s=1 pt πj (k,s)( h Pah(xj t)) (76) For the sake of simplicity, denote s=1 pt πj (k,s)( h Pah(xj t)) (77) s=1 pt πj (k,s)( h Pah(xj t) \ y) (78) ck := pt Πj k(xj t|Pa(xj t)) (79) c 1 := pt Πj 1(xj t|Pa(xj t) \ y) (80) After substituting the simple notations in Eq.(76): k=2 bkck + b1c 1)( k=1 ak) (81) k=1 (ck c1 )akb1 + i>1,i =k (ck ci)akbi (82) k=1 ckak c1 b1 i>1,i =k bi i>1,i =k cibi (83) Vt = {Xt |0 < t < t} (84) That is, Vt contains all the nodes before time point t. Denote {bti}i [n] = h Pah(xj t) and assume {bti}1 i n1 |Pa(Xj t )| so that ωj always leads to a smaller size of estimated parent sets than ˆωj, contrary to the definition of ˆωj. Hence, ˆωj = ωj. With those lemmas, we can prove Theorem 1. Proof. Assuming that a correct ωj has already been obtained, from Lemma D.4 we have c Pa(Xj t ) = Pa(Xj t ), t 2τub, j [n] From Lemma D.5, we know that a correct ωj must be obtained with consistent CI tests, that is, ˆωj = ωj, j [n]. Therefore from Algorithm B1, we have c Pa(Xj t ) = Pa(Xj t ), t 2τub, j [n] Figure 3: In the above illustration of the "turning point," the sizes of parent sets for different estimates ˆωj are depicted as |c Pak(Xj t )|, k [ˆωj]. It is worth noting that c Pak(Xj t ) represents either the true parent set or the illusory parent set of Xj t . In this context, we are interested in the sizes of these parent sets. The first occurrence of the "turning point" happens at ˆωj = 3 since the sizes of parent sets obtained when ˆωj = 2 and ˆωj = 4 are larger than the corresponding size when ˆωj = 3, respectively. The term "turning point" denotes that as ˆωj increases, the size of the parent set initially decreases and then starts increasing once the local minimum is reached. The corresponding relations exist because as long as ˆωj is not a multiple of the true ωj, the estimated time partition subsets with ˆωj must be a mixture of some correct time partition subsets with ωj. Therefore, it is reasonable to use this trick rather than looking at the maximum size of the parent sets c Pak(Xj t ), k [ˆωj] (line 17 in Algorithm B1). If the causal mechanism is fixed across time, i.e., ωj = 1, j [n], the proof of PCMCI Runge et al. [2019] showed that for all Xj V , Xi t τ Xj t / G = Xi t τ Xj t / b G Xi t τ Xj t G = Xi t τ Xj t b G Therefore b G = G. If ωj > 1, we can simply separate the whole graph G into sub graphs {Gωj k }k [ωj] consisting of only target variable Xj t with corresponding t {Πj k}k [ωj] and parent variables Xi t Pa(Xj t ). Focusing only on one time partition subset Πj k, k [ωj], we have b Gωj k = Gωj k (112) for any k [ωj] and j [n] based on the proof of Proposition 1 in the supplementary materials of Runge et al. [2019]. Each sub-graph Gωj k includes only variable Xj t , the edges entering Xj t for time points t Πj k and the corresponding parent variables Xi t Pa(Xj t ). Given Πj = k [ωj]Πj k and V = j [n]Xj, we have: b G = j [n], k [ωj] b Gωj k (113) G = j [n], k [ωj]Gωj k (114) On the basis of Eq.(112), we finally have: b G = G E Turning Points Given infinite samples, our estimate ˆωj (line 17 in Algorithm B1) is the exact value ωj (see Lemma D.5). However, for finite samples, estimating ωj by the equation in line 17 in Algorithm B1 does not yield good performance when T is small. While searching, larger guesses ω lead to finer time partitions in Πj, resulting in smaller sizes for Πk j (see Line 5 in Algorithm B1). Due to the power limit of CI tests on a smaller sample given by Πk j , the number of false negative edges increases. In order to solve this issue, we introduce turning points. A turning point is a guess ˆω satisfying: max t |c Paˆω(Xj t )| < min{max t |c Paˆω 1(Xj t )|, max t |c Paˆω+1(Xj t )|} where |c Paˆω(Xj t )| is the estimated parent set for Xj t with periodicity guess ˆω. See line 19 in Algorithm B1. We illustrate it with a special example in Fig.3. If there are several turning points, then ˆωj is the first turning point. If there is no turning point, then we obtain ˆωj using Line 17 of Algorithm B1. The concept of the turning point is not based on any formal theorem but rather on experimental observations. In experiments, the turning point often corresponds to a multiple of the true periodicity when T is not large. This occurs due to the limitations of CI tests on finite samples. In such cases, the causal graph can still be correct because the estimated time partition remains accurate. In these experiments, the accuracy rate is calculated by considering {Nωj}N ωub ωj as correct estimations. F Computational Complexity Executing the PCMCI algorithm on the entire time series constitutes the initial phase of the proposed approach (Algorithm B1 line 2). The algorithm s worst-case overall computational complexity is O(n3τ 2 ub) + O(n2τub), discussed in Runge et al. [2019]. Here, the symbol n denotes n-variate time series and τub represents the upper boundary for time lags. The subsequent computational load stemming from the remaining components of our algorithm follows a complexity of O(ω2 ubn2τub) . This encompasses the O(n2τub) complexity associated with conducting Momentary Conditional Independence (MCI) tests on all n univariate time series. The parameter ω2 ub here arises due to the search procedure involving ω, iterating through values from 1 to ωub for all n univariate time series. The runtime of the computation is further influenced by the scaling behavior of the CI test concerning the dimensionality of the conditioning set and the temporal series length T. For further details, see section 5.1 in Runge et al. [2019]. G Experiments All experiments, including those detailed in the main paper, are conducted on a single node with one core, utilizing 512 GB of memory in the Gilbreth cluster at Purdue University. Here, we describe how to calculate the metrics (F1 score, Adjacency Precision, and Adjacency Recall) in our setting. In stationary time series, the output of the causal discovery algorithm is typically an adjacency matrix with dimensions [n, n, τmax + 1]. Within the three-dimensional binary array, the value 1 signifies an edge pointing from one variable to another with a specific time lag, while 0 indicates the absence of an edge. For instance, if element [i, j, k] in the matrix is 1, then there is an edge pointing from Xi t k to Xj t . In semi-stationary time series, due to the presence of multiple causal mechanisms, the binary edge matrix is a four-dimensional array with dimensions [n, Ω, n, τmax + 1], where Ωis defined as Eq.(7) in the main paper. This expanded binary matrix is constructed based on the edge matrix of each variable Xj t , j [n], through repetition. For instance, if Ω= 2ωj, setting the third dimension of the large binary matrix to j should yield ωj potentially different parent sets (including illusory and true parent sets), each appearing twice. We should have two such binary arrays, one representing the ground truth with dimensions [n, Ω, n, τmax + 1] and one obtained from the algorithm with dimensions [n, bΩ, n, τmax + 1]. If the estimator bΩis incorrect, those two binary arrays will have different sizes, so we can not directly compare them. To solve this problem, we do the same operation and calculate the least common multiple of Ωand bΩ. Denoting this least common multiple as LCM(Ω,bΩ), we create two four-dimensional binary arrays with dimensions of [n, LCM(Ω, bΩ), n, τmax + 1] based on the true edge array and the estimated edge array, respectively, through repetition. The metrics are then computed by comparing the values in these two arrays. G.1 More Discussion regarding the Case Study As stated in the main paper, we express our inability to comment on the significance of the case study results. We open a door for the related experts; if assumptions A1-A7 are satisfied, the stationary assumption may not hold in this real-world dataset, and such periodicity exists. However, if the finding is not correct from an expert s viewpoint, the following assumptions may be violated: Assumption A4 No Contemporaneous Causal Effects: There is a possibility of potential causal effects from Xta t to Xcp t that the algorithm is unable to capture. Assumption A6 Hard Mechanism Change combined with limited power of CI tests: If there is a soft mechanism change in the variables, the reliability of the CI test of two variables given their parents will be influenced by the skewed distribution of the parent variables. This effect will be exacerbated by the fact that the sample size will be shrunk by ˆω. We provide a sound and robust algorithm for experts in various fields who are interested in validating the presence of periodicity within the causal mechanisms specific to their domain. G.2 Experiments on Continuous-valued Time Series with Exponential Noise Considering that VARLi NGAM is a temporal extension of Li NGAM and Li NGAM is an algorithm designed for non-Gaussian data, following the work in Pamfil et al. [2020], we also construct experiments on continuous-valued time series data with Exponential noise. Shown as Fig.4(a), the performance of PCMCIΩ, PCMCI and VARLi NGAM, are quite similar with their performance on Gaussian noise. The recall rate of DYNOTEARS, however, gets worse with Exponential noise. G.3 Experiments on Binary Time Series Similar to the process of generating continuous-valued time series, the generation of binary time series also involves three steps. However, the main difference lies in the last two steps. In the third step, we simulate the conditional distributions of each child variable based on all possible combinations of parent variable values. Subsequently, we randomly generate the value of the child variable by considering the corresponding conditional distribution given its parent sets. For discrete-valued time series, a longer time length is required. To evaluate performance, we conduct a series of experiments following the same methodology as described in section 4.1. Fig.4(b) illustrates the variation in comprehensive performance with respect to ωmax. PCMCIΩdemonstrates a similar performance to PCMCI in terms of the F1 score, indicating a well-balanced trade-off between precision and recall. This outcome is expected since discrete-valued time series demand larger sample sizes, and the increases in ωmax negatively impact the power of MCI tests. This observation is further supported by Fig.5(a), where an increase in time length T from 4000 to 12000 does not lead to a significant improvement in the accuracy rate of ˆω, while the accuracy decreases rapidly with higher values of ωmax. Comparing these results to the experiments conducted on continuous-valued time series, it becomes evident that the demand for efficient samples is even more substantial for binary time series, and the influence of increasing ωmax on performance becomes more pronounced. Fig.5(b) shows how the performance of the algorithm varies across τmax and the same trade-off between recall and precision has been shown. G.4 More experiments on Continuous-valued time series In this section, we conduct more experiments with continuous-valued time series with Gaussian noises. In Fig.6(a), we test our algorithm with and without utilizing the turning point rule. See lines 13-14 in Algorithm B1 and section E for more information about the turning point rule. Let PCMCIΩTP 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 n = 5, max = 5 PCMCI PCMCI VARLi NGAM (a) Exponential noise 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 1 2 3 4 5 n = 3, max = 3 PCMCI PCMCI (b) Binary time series Figure 4: a)F1 Score, Adjacency Precision, and Adjacency Recall when ωmax varies for experiments on continuous-valued time series with Exponential noise, length T = {500, 2000, 8000}, τmax = 5 and n = 5. b) F1 Score, Adjacency Precision, and Adjacency Recall when ωmax varies for experiments on binary time series with length T = {4000, 8000, 12000}, τmax = 3 and n = 3. (a) Accuracy of ˆω 1 2 3 4 5 0.0 T = 4000, max = 3, n = 3 1 2 3 4 5 0.0 PCMCI PCMCI (b) Performance when τmax varies Figure 5: PCMCIΩis tested on 3-variate binary time series. Every marker corresponds to the average accuracy rate or average running time over 100 trials. a) The accuracy rate of ˆω for different time series lengths and different ωmax. b) F1 Score, Adjacency Precision, and Adjacency Recall when τmax varies for experiments with time series length T = 4000, ωmax = 3 and n = 3. denote the version of PCMCIΩthat the turning point rule is utilized in choosing ω. PCMCIΩnon-TP means that the turning point rule is not applied and ω is chosen directly according to Lemma D.5. Fig.6(a) shows that the algorithm PCMCIΩnon-TP and PCMCIΩTP have similar performance with various T and ωmax. With T = 500, PCMCIΩnon-TP yields slightly larger standard errors for those metrics, compared to PCMCIΩTP. As time length T increases, the performance of the algorithm PCMCIΩnon-TP has consistently increased and is even slightly better than PCMCIΩTP. The consistent performance of PCMCIΩunder different chosen rules of ω supports our theoretical result; that is, the correct periodicity leads to the most sparse causal graph. In Fig.6(b), non-stationary time series are produced instead of semi-stationary ones. Consequently, the causal mechanisms for each univariate time series no longer appear sequentially and periodically. The proposed method performs slightly better in terms of F1 score and precision. However, the recall rate is the worst compared to other baselines. In Fig.6(c), we conduct experiments in the nonlinear setting. The proposed algorithms PCMCIΩTP and PCMCIΩnon-TP perform the best. In Fig.6(d), with ωub < ωmax, the performance of the proposed algorithm is significantly worse compared to the scenario where ωub > ωmax. However, with ωub < ωmax, the proposed algorithm can still detect a less dense graph in comparison to other baselines. Based on these outcomes, it is essential to maintain a slightly higher ωub without significantly impacting the number of efficient samples utilized in each CI test. 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 n = 5, max = 5 PCMCI TP PCMCI non-TP PCMCI VARLi NGAM (a) Performance with and without turning point 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 n = 5, max = 5 PCMCI TP PCMCI non-TP PCMCI VARLi NGAM (b) Performance in non-stationary setting without periodicity 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 n = 5, max = 5 PCMCI TP PCMCI non-TP PCMCI ts FCI (c) Performance in non-linear SCM 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 0.0 0.2 0.4 0.6 0.8 1.0 1 3 5 7 9 1 3 5 7 9 n = 5, max = 5 PCMCI ub > max PCMCI ub < max PCMCI VARLi NGAM (d) Performance when ωub > ωmax and ωub < ωmax Figure 6: Multiple algorithms are tested on 5-variate time series with different time lengths T. Every line corresponds to a different algorithm. Every marker corresponds to the average performance over 50 trials. In (a), the consistent performance of PCMCI under different chosen rules of ω supports our theoretical result; that is, the correct periodicity ω leads to the most sparse causal graph. In (b), data sets are in a non-stationary setting without periodicity. In (c), the structural causal model (SCM) is non-linear. In (d), algorithm PCMCIΩare tested under conditions that ωub > ωmax and ωub < ωmax respectively. Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting and quantifying causal associations in large nonlinear time series datasets. Science advances, 5 (11):eaau4996, 2019. Thuc Duy Le, Tao Hoang, Jiuyong Li, Lin Liu, Huawen Liu, and Shu Hu. A fast pc algorithm for high dimensional causal discovery with multi-core pcs. IEEE/ACM transactions on computational biology and bioinformatics, 16(5):1483 1495, 2016. Dimitri Bertsekas and John N Tsitsiklis. Introduction to probability, volume 1. Athena Scientific, 2008. Samuel Karlin. A first course in stochastic processes. Academic press, 2014. Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pages 1595 1605. PMLR, 2020.