# standardizing_structural_causal_models__5f18fcab.pdf

Published as a conference paper at ICLR 2025

STANDARDIZING STRUCTURAL CAUSAL MODELS

Weronika Ormaniec ETH Zürich, Switzerland wormaniec@ethz.ch

Scott Sussex ETH Zürich, Switzerland ssussex@ethz.ch

Lars Lorch ETH Zürich, Switzerland llorch@ethz.ch

Bernhard Schölkopf MPI for Intelligent Systems, Tübingen, Germany bs@tuebingen.mpg.de

Andreas Krause ETH Zürich, Switzerland krausea@ethz.ch

Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like Varsortability and R2-sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (i SCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, i SCMs are not Var-sortable. We also find empirical evidence that they are mostly not R2-sortable for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear i SCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make i SCMs a useful model in causal inference beyond the benchmarking problem studied here. Our code is publicly available at: https://github.com/werkaaa/iscm.

1 INTRODUCTION

Predicting the effects of interventions and policy decisions requires reasoning about causality. Consequently, scientific fields ranging from biology and earth sciences to economics and statistics are interested in modeling causal structure (Pearl, 2009; Maathuis et al., 2010; Imbens and Rubin, 2015; Runge et al., 2019). A wide array of causal discovery algorithms has been proposed with the goal of inferring causal structure from data (e.g., Squires and Uhler, 2022; Vowels et al., 2022). However, benchmarking these algorithms is challenging, since real-world datasets with an agreed-upon, ground-truth causal structure are rare (e.g., Sachs et al., 2005; see Mooij et al., 2020). The community predominantly relies on synthetic data for evaluating structure learning algorithms, where observations are generated according to a predetermined causal structure and system mechanisms. The inferred causal structures can then be directly compared to the ground truth. To generate synthetic data, it is common practice to sample from structural causal models with additive noise (SCMs) (Reisach et al., 2021). Unless stated otherwise, this work considers SCMs in which the variance scale of the additive noise is the same for all variables, a typical simplification made in benchmarking.

Under common benchmarking practices, synthetic datasets generated by SCMs contain patterns that are directly exploitable to make structure discovery easier. We will refer to such patterns as artifacts. In SCMs, the pairwise correlations between variables tend to increase along the causal ordering, since variance builds up downstream and, as a result, the proportion of the variance driven by the additive noise vanishes (Figure 1a). Reisach et al. (2024) characterize this phenomenon through an increase of the coefficients of determination (R2) of the variables regressed on all others. Crucially, this artifact occurs both in the raw data and when shifting and scaling (standardizing) the variables to have zero mean and unit variance. One of the implications is that downstream causal dependencies in SCMs become effectively deterministic, especially in large-scale systems. As Reisach et al. (2024) demonstrate, simple causal discovery baselines can perform competitively on benchmarks of this kind by directly exploiting this phenomenon. This makes SCMs alone in their general definition

Equal contribution.

Published as a conference paper at ICLR 2025

0.75 0.86 0.98 0.98 |ρ|:

(a) Standardized SCM

0.75 0.75 0.75 0.75 |ρ|:

Figure 1: Standardizing SCMs two ways. Generative process for a chain graph of (a) standard SCMs, with data x standardized post-hoc, and (b) SCMs with standardization performed during the generative process (i SCMs). Dashed arrows indicate z-standardization. Solid arrows indicate linear functions with weights from Unif [0.5, 2.0] and additive noise from N(0, 1). We report absolute correlations |ρ| of two consecutive observed variables, (a) xs j and xs j+1, or (b) exj and exj+1, averaged over 100,000 models. In standard SCMs (a), correlations tend to increase along the causal ordering.

possibly insufficient for benchmarking. Ultimately, evaluating on synthetic data with these patterns could lead to conclusions that do not generalize as expected to real-world scenarios.

In this work, we propose a simple modification of SCMs that stabilizes the data-generating process and thereby removes exploitable covariance artifacts. Our models, denoted internally-standardized SCMs (i SCMs), introduce a standardization operation at each variable during the generative process (Figure 1b). In Section 4, we provide a theoretical motivation for this idea by studying linear i SCMs. We prove that, contrary to SCMs, the causal dependencies of i SCMs under mild assumptions never collapse to deterministic mechanisms as the graph size becomes large. Moreover, we formalize the correlation artifact commonly observed in benchmarks by proving that linear SCM structures in a Markov equivalence class (MEC) are partially identifiable for certain graph classes, given weak prior knowledge on the weight distribution of the ground-truth SCM. Most importantly, we show that this is not the case for the corresponding i SCMs. In Section 5, we empirically demonstrate that the baselines proposed in Reisach et al. (2021; 2024) are unable to exploit covariance artifacts in i SCMs, while practical classes of causal discovery algorithms are still able to learn causal structures in both linear and nonlinear systems. Our findings reveal that SCM artifacts affect structure learning both positively and negatively, making i SCMs a practical tool, alongside SCMs, for disentangling the drivers of causal discovery performance of different algorithms in practice.

2 BACKGROUND AND RELATED WORK

We begin by introducing structural causal models and the problem of causal structure learning, before discussing how synthetic data is often generated for evaluating structure learning algorithms. We then review existing works that study identifiability and patterns frequently present in synthetic data.

Structural causal models A structural causal model (SCM) (Peters et al., 2017) of d variables x = {x1, . . . , xd} consists of a collection of structural assignments, each given by

xi := fi(xpa(i), εi) , (SCM)

where xpa(i) x \ {xi} are called the parents of xi. Here, fi are arbitrary functions, and εi are independent random variables that model exogenous noise (or unexplained variation). Together, they entail a joint probability distribution p(x) over the variables x. It is common to consider SCMs with additive noise, e.g., with linear functions fi, as given by

fi(xpa(i), εi) = w i xpa(i) + εi , (1)

where wi,j R denotes the weight from j pa(i) to i. The structural assignments in (SCM) induce a causal graph G = (V, E) over the variables xi, which is assumed to be acyclic. Specifically, the directed acyclic graph (DAG) G has vertices vi V for every xi x and a directed edge (i, j) E if xi xpa(j). We will explicitly distinguish this DAG G and its vertices V from the variables x. The skeleton of G denotes G with all edges undirected. If the skeleton of G is acyclic, we call G a forest.

Structure learning and benchmarking Given a set of i.i.d. observations from the probability distribution p(x) induced by an unknown SCM, causal structure learning aims to infer the causal

Published as a conference paper at ICLR 2025

graph G underlying the SCM. In this work, we focus on structure learning from observational data and only consider SCMs with no latent confounders. Because it is difficult to obtain the true G for many real-world datasets, it is common to evaluate structure learning algorithms on synthetic data where G is known. A ubiquitous approach is to sample a DAG G, then SCM functions defined over G, and finally a dataset from this SCM, with the goal of later recovering G from the data. It is common to consider εi with mean 0 and fixed variance (often 1), and for linear systems, to sample each wi,j uniformly and i.i.d. with support bounded away from 0 (Shimizu et al., 2011; Peters and Bühlmann, 2014; Zheng et al., 2018; Yu et al., 2019; Lachapelle et al., 2020; Zheng et al., 2020; Ng et al., 2020; Reisach et al., 2021; Lorch et al., 2022; Reisach et al., 2024). There exist alternative benchmarking strategies with domain-specific simulators (Schaffter et al., 2011; Dibaeinia and Sinha, 2020).

Data standardization and artifacts of SCMs Previous work shows that generating data as described above can lead to strong artifacts. Reisach et al. (2021) observe that the variance of variables tends to increase along the topological ordering of G. This leads to the Var-SORTNREGRESS baseline, which sorts variables based on their empirical variance and then performs sparse regression to infer G. Seng et al. (2024) show that structure learning algorithms minimizing an MSE-based loss (e.g., Zheng et al., 2018) can identify G under similar conditions. Therefore, Reisach et al. (2021) propose using standardization (Figure 1a) to remove this variance artifact from benchmarks. Specifically, they first sample all xi according to a standard SCM and then post-hoc transform the variables as

xs i := xi E[xi] p

Var[xi] , (Standardized SCM)

such that our observations correspond to samples from p(xs). Standardization, however, only removes the variance artifact. Even in standardized SCMs, the fraction of a variable s variance that is explained by all others, measured by the coefficient of determination R2, tends to increase along the topological ordering (Reisach et al., 2024). R2-SORTNREGRESS exploits this correlation artifact analogously to Var-SORTNREGRESS. Existing heuristics aiming to avoid the variance accumulation adjust the sampling process of fi, but they ultimately limit the causal dependencies that can be modeled, e.g., to certain levels of correlations among the observed x (Mooij et al., 2020) or a constant proportion of variance explained by the parents xpa(i) (Squires et al., 2022) and fail to induce data free from both artifacts (Appendix D.1). To our knowledge, there are currently no general methods for generating SCM data without strong correlation artifacts or significant limitations on the functions fi and noise εi.

Identifiability Given a class of SCMs, there may be several SCMs with different causal graphs G that entail the same distribution p(x) (Peters et al., 2017). Thus, even with infinite observations from p(x), we may be unable to identify the causal graph G that generated the observations. However, some identifiability results are known depending on the class of functions and noise distributions of the SCMs considered. For example, among all linear SCMs (1) with Gaussian noise εi N(0, σ2 i ), the graph G can only be uniquely identified up to its MEC (Verma and Pearl, 2013). However, if the noise is Gaussian with equal variances σ2 i = σ2 (Peters and Bühlmann, 2014) or the noise is non-Gaussian (Shimizu et al., 2006), G can be uniquely identified given p(x).

In this work, we present, to our knowledge, the first (partial) identifiability result for standardized SCMs in the linear Gaussian case. Since standardization affects the implied noise scales, existing linear Gaussian identification results, which rely on σ2 i = σ2, no longer hold when observing p(xs). Other identifiability results, e.g., based on non-Gaussian noise, do continue to hold for standardized SCMs (e.g., Shimizu et al., 2006). Our result concerns a setting with prior knowledge on the magnitudes of w in Equation (1), an assumption underlying common benchmarking practices. Under this setup, we show a stark difference in the identifiability of standardized SCMs and the i SCMs we propose, which provides a novel explanation for what we empirically observe in benchmarks.

3 SCMS WITH INTERNAL STANDARDIZATION

3.1 DEFINITION

We propose internally-standardized SCMs (i SCMs) as a modification to the standard data-generating process of SCMs. An i SCM (S, Pε) consists of d pairs of assignments, where for each i {1, . . . , d},

xi := fi(expa(i), εi) and exi := xi E[xi] p

Var[xi] (i SCM)

Published as a conference paper at ICLR 2025

Figure 2: Causal mechanisms in i SCMs. The function fi modeling xi depends on the standardized expa(i). Dashing indicates z-standardization.

Algorithm 1 Sampling from an i SCM

Input: DAG G, noise distribution Pε,

functions {f1, ..., fd} π topological ordering of G for i = 1 to d do

επi Pεπi xπi fπi(expa(πi), επi)

exπi xπi E[xπi] p

return ex1, . . . , exd Rd

with parents expa(i) ex \ {exi} of exi in the underlying DAG. In the above, fi are general functions, and the exogenous noise variables ε = [ε1, ..., εd] Pε are jointly independent, as for SCMs. The variables xi are latent, and the variables exi are observed. Figure 2 illustrates the generative process. Algorithm 1 summarizes how to sample from (i SCM). If computing the population expectations and variances of xi is intractable, the empirical statistics obtained from n samples can be used for standardization at each loop iteration of Algorithm 1.

Motivation By construction, i SCMs model observed variables with zero mean and unit marginal variance. Contrary to standard SCMs, i SCMs avoid the accumulation of variance downstream in the causal ordering that can occur in standard SCMs (see Figure 1) through the standardization operation. Because each variable xi only depends on the standardized variables expa(i), the relative scales of the noise distribution Pεi and the causal mechanisms fi are the same everywhere in the system and do not change, for example, downstream in the causal ordering. The causal mechansims of i SCMs are thus scale-free, in that the local interaction of mechanism fi and noise εi occurs at a scale independent of the position of xi in the global ordering. This property makes i SCMs particularly useful for benchmarking, where random ground-truth models are commonly generated from a fixed distribution over functions fi and noise εi. Contrary to existing heuristics (Section 2), i SCMs model arbitrarily strong or weak causal dependencies and levels of cause-explained variance.

Interventions Analogous to standard SCMs, interventions in i SCMs can be defined as modifications of the structural assignments fi in (i SCM) (Figure 2), while keeping the standardization operation based on the observational distribution. When the population statistics for standardization are intractable, we first sample observational data to obtain empirical statistics. Since we do not study interventions in this work, we defer a further discussion of interventions in i SCMs to Appendix B.

Units When modeling a physical system, the functional mechanisms in standard SCMs have to account for the difference in units between the variables for the model to be unit-covariant (see Villar et al., 2023). A side-effect of internal standardization is that variables of i SCMs become unit-less, so i SCMs obey the passive symmetry of unit covariance by construction. Therefore, i SCMs naturally model both unit-less quantities and variables measured in different units, which can make them useful beyond benchmarking. Learned i SCMs would be invariant to the units chosen by the experimenter, similar to the physical world being independent of the mathematical models chosen to describe it.

3.2 IMPLIED SCMS

It is natural to investigate whether SCMs can generate the same observations as standardized SCMs or i SCMs, given the same causal graph G and exogenous variables ε. In other words, can standardized SCMs and i SCMs be written as SCMs? For both models, the answer is yes. Specifically, we can express the generative process of xs in (Standardized SCM) and ex in (i SCM) as

xs i = gs i (xs pa(i)) + θs i εi and exi = egi(expa(i)) + eθiεi , (2)

respectively, by moving the standardization operations into the causal mechanisms of the observables but leaving the DAG G and the variables ε unchanged. Appendix A describes how to construct these implied causal mechanisms gs i and egi and implied noise scales θs i and eθi. We refer to the above SCM form of a standardized SCM or an i SCM with additive noise as their implied (SCM) model. Correspondingly, the implied SCMs have zero mean and unit variance. The notion of implied

Published as a conference paper at ICLR 2025

SCMs is powerful, because it enables us to analyze standardized SCMs and i SCMs as SCMs, and it sheds light on the performance of structure learning algorithms that assume unstandardized SCMs to underlie the generative process of the data (e.g., Shimizu et al., 2011; Zheng et al., 2018; Yu et al., 2019; Lachapelle et al., 2020; Zheng et al., 2020).

To provide a first characterization of standardized SCMs and i SCMs, our theoretical analyses focus on systems where fi are linear functions with additive, zero-mean noise as given by Equation (1). As a stepping stone for this analysis, we use an analytical expression for the covariance of linear SCMs, whose variables have unit variance by construction, without any form of standardization:

Lemma 1 (Covariance in linear SCMs with unit marginal variances). Let x be modeled by a linear SCM defined by (1) with DAG G that satisfies Var[xi] = 1. Then, the covariance Cov[xi, xj] is the sum of products of the weights along all unblocked paths between the nodes of xi and xj in G. Specifically, for any i, j {1, ..., d} such that i = j, it holds that

Cov[xi, xj] = X

(l,m) pj i wl,m , (3)

where Pj i are all unblocked paths from xj to xi in G, and (l, m) pj i indicates that the directed edge (l, m) is part of the path pj i.

This lemma, also called the trek rule, is originally due to Wright (1934). We give a proof in Appendix C.2. Since the implied SCMs of linear standardized SCMs and i SCMs are linear SCMs, the setting of Lemma 1 applies precisely to the SCM forms of both models. Thus, Lemma 1 enables us to study the covariances in standardized SCMs and i SCMs, and as we show next, derive conditions for the (non)identifiability of their DAGs G from the observational distribution.

In this section, we give two theoretical results that support the suitability of i SCMs over standard SCMs for causal discovery benchmarking. First, we prove the general case of Figure 1. Contrary to standardized SCMs, i SCMs do not degenerate towards deterministic implied SCM mechanisms in deep graphs. Moreover, we prove that the DAGs of linear i SCMs cannot be identified beyond their MEC, assuming the DAG is a forest, even if the support of w is known. Crucially, we also show that this is not generally true for standardized SCMs. This suggests that algorithms can less easily game benchmarks based on linear i SCMs when knowing the data-generating process. For all results, we consider linear SCMs (1) with zero-mean additive noise and equal noise variances. All results are at the population level, so assume we know p(xs) or p(ex). Proofs are given in Appendix C.

4.1 BEHAVIOR WITH INCREASING GRAPH DEPTH

Standardized SCMs tend towards increasing correlations between adjacent nodes down the topological ordering. This correlation artifact makes standardized SCMs problematic for benchmarking, because it may not be a property we expect to underlie real data. Reisach et al. (2024) show, under some assumptions on w, that the dependencies in standardized SCMs become deterministic with increasing graph depth. This implies that any exogenous variation εi vanishes lower down in the system. Unless prior domain knowledge leads us to assume this holds in applications of interest, it may not be desirable to implicitly bias structure learning benchmarks towards such systems. For example, if the causal ordering represents time (Pamfil et al., 2020), the mechanisms of standardized SCMs are unable to model or characterize time-invariant or stable processes. Moreover, if we expect causal mechanisms to be independent (Schölkopf, 2022), the qualitative behavior of a causal mechanism should not provide information about its position in the topological ordering relative to other mechanisms, as it would in SCMs. Reisach et al. (2024) show that baselines like R2-SORTNREGRESS can perform competitively on benchmarks by exploiting this artifact (Section 2).

i SCMs do not tend towards determinism with increasing graph depth (Figure 1b). In standardized SCMs, the correlations increase downstream, because the marginal variances of the underlying SCM increase with node depth, while the variance scale is fixed (Reisach et al., 2021). Thus, for large i, the variance scale of xi 1 becomes large relative to the scale of εi, and the correlation of xi and xi 1 tends towards 1. Since xs i and xs i 1 are just standardized versions of these variables, they maintain

Published as a conference paper at ICLR 2025

the same correlation. i SCMs avoid this by standardizing internally, which scales the variance of any parents in a mechanism fi to 1, modulating the relative variance of εi and xpa(i). In the following, we formalize this result for general graphs by bounding the fraction of cause explained variance (CEV). The fraction of CEV for xi is the proportion of Var[xi] explained by its causal parents and given by

CEVf[xi] = 1 Var[xi E[xi|xpa(i)]]

Var[xi] . (4)

The following results shows that we can bound the fraction of CEV for any variable in a linear i SCM:

Theorem 2 (Bound on CEVf in linear i SCMs). Let x be modeled by a linear i SCM (1) with DAG G and additive noise of equal variances Var[εi] = σ2. Suppose any node in G has at most m parents and w = maxi,j {1,...,d}|wi,j|. Then, for any i {1, ..., d}, the fraction of CEV for exi is bounded as

CEVf[exi] 1 σ2

m2w2 + σ2 .

Since the fraction of CEV is bounded, i SCMs are guaranteed not to collapse to determinism in large systems, alleviating several of the concerns with (standardized) SCMs discussed above.

4.2 IDENTIFIABILITY

(i) x1 x2 x3

(ii) x1 x2 x3

(iii) x1 x2 x3

(a) DAGs with edge weights α and β

1.00 0.71 0.63

0.71 1.00 0.89

0.63 0.89 1.00

(b) Cov. matrix of linear i SCMs

Figure 3: i SCMs with the same covariance matrix. (a) DAGs in an MEC with the same edge weights. (b) Covariance matrix for all linear i SCMs in (a) when α = 1, β = 2.

Figure 1a illustrates that the pairwise correlations in SCMs over chain graphs depend on the position in the topological ordering. This can allow algorithms like R2-SORTNREGRESS to infer the graph. By contrast, Figure 1b shows that i SCMs do no exhibit this pattern, with correlations between variables not increasing the identifiability of any part of the system.

In the following, we formalize this phenomenon for forests, that is, all DAGs with acyclic skeletons (Section 2). Specifically, we prove two results concerning the identifiability of the DAG G from the observational distribution, for standardized SCMs and i SCMs. This makes our finding the first identifiability result for standardized SCMs. While not every DAG is a forest, DAGs have forests as subgraphs and resemble forests as sparsity increases, thus providing us with intuition for generally sparse systems (e.g., Alon and Spencer, 2016, Chapter 11).

Our first result leverages the observation that, for standardized SCMs, many DAGs in an MEC are infeasible given p(xs) when their edge directions are not consistent with the direction of increasing absolute covariance. To illustrate this idea, suppose our goal is to distinguish between the DAGs in the MEC G = {(i), (ii), (iii)} in Figure 3a. We overload notation and denote the weights of the edges α and β regardless of orientation. For standardized SCMs, we can apply Lemma 1 to the implied SCM of graph (i) to obtain the covariances

Cov[xs 1, xs 2] = α

α2+1 and Cov[xs 2, xs 3] = β q

α2+1 β2(α2+1)+1 .

See Appendix C.4.1. Together, both expressions imply that standardized SCMs with DAG (i) satisfy

|Cov[xs 1, xs 2]| < |Cov[xs 2, xs 3]| α2 α2+1 < β2 . (5)

If |β| 1, then the right-hand side of Equation (5) is always true. In this case, the absolute covariance increases from x1 to x3 in all standardized SCMs with DAG (i). By symmetry, the covariance in SCMs with DAG (iii) increases from x3 to x1 when |α| 1. Therefore, if both weights are greater than 1, the absolute covariance increases downstream in all SCMs of (i) and (iii). This implies that, among (i) and (iii), only the DAG whose edges align with the covariance ordering in p(xs) can induce p(xs). Irrespectively, the DAG (ii) remains plausible. We can extend the intuition of this 3-variable example to identify almost all edges in any forest MEC:

Published as a conference paper at ICLR 2025

Theorem 3 (Partial identifiability of standardized linear SCMs with forest DAGs). Let xs be modeled by a standardized linear SCM (1) with forest DAG G, additive noise of equal variances Var[εi] = σ2, and |wi,j| > 1 for all i pa(j). Then, given p(xs) and the partially directed graph G representing the MEC of G, we can identify all but at most one edge of the true DAG G in each undirected connected component of the MEC G.

Our proof of Theorem 3 considers each undirected component separately from the rest of the MEC G. Hence, the identifiability result extends to undirected tree components of arbitrary, non-forest MECs as well. Theorem 3 shows that, when using standardized SCM data for benchmarking, algorithms can use pairwise correlations to orient additional edges correctly. The weights assumption of Theorem 3 is relevant to causal discovery benchmarking, because weights are often sampled i.i.d. from intervals bounded away from 0 (Section 2). Hence, empirical evaluations may render standardized linear SCMs identifiable only through the design of their weights distribution. In the following, we show that, under similar conditions, i SCMs are more difficult to identify from their MEC. In the 3-variable example above, we can show that the observational distribution of i SCMs is the same for all DAGs (i), (ii), and (iii) when the weights α and β are shared over the corresponding edges in the MEC (Figure 3b; see Appendix C.4.1). This result generalizes to forests:

Theorem 4 (Nonidentifiability of linear Gaussian i SCMs with forest DAGs). Let ex be modeled by a linear i SCM (1) with forest DAG G and additive Gaussian noise of equal variances Var[εi]. Then, for every DAG G in the MEC of G, there exists a linear i SCM with DAG G that has the same observational distribution as ex, the same noise variances, and the same weights on the corresponding edges in the MEC.

Our proof consists of showing that the covariance matrices of these systems are equal. For linear Gaussian i SCMs, this then implies that their observational distributions are identical. Theorem 4 thus shows that additional knowledge of the weight distribution in a benchmark does not allow identifying any additional edges beyond the MEC. By contrast, Theorem 3 shows that, for standardized SCMs, lower-bounding the weight magnitudes is sufficient for identifying most of the graph from its MEC. Without standardization, G is fully identified from its observational distribution under even weaker assumptions (Peters and Bühlmann, 2014). Importantly, Theorem 4 does not generalize to arbitrary graphs beyond forests. Appendix C.4.2 provides a counterexample involving a 3-node skeleton. As we study in the next section, this implies that causal structure can still be learned from nontrivial i SCMs. However, DAGs in benchmarks are often sparse, so we expect the implications of our identifiability results to capture relevant parts of empirical phenomena in benchmarking settings.

5 EXPERIMENTAL RESULTS

Our analyses suggest that i SCMs address shortcomings of naive standardization, in particular, when sampling each fi and εi from the same distribution, as common in benchmarking. In this section, we now also provide empirical evidence that i SCMs do not contain the covariance artifacts of SCMs. This makes i SCMs a useful tool for disentangling, alongside SCMs, which data patterns drive causal discovery in practice. To show this, we benchmark the SORTNREGRESS baselines and a suite of popular structure learning algorithms to gain insights into how their performance varies when benchmarked on standardized SCMs and i SCMs. Appendix E provides complete details of the experimental setup.

5.1 R2-SORTABILITY

Reisach et al. (2024) introduce the R2-sortability metric to evaluate the correlation artifact underlying a dataset. R2-sortability measures (rescaled to [0, 1]) the association between the variables causal ordering and the R2 coefficients obtained from regressing each variable onto all others (Appendix D.2). An R2-sortability close to 0.5 suggests that the R2 coefficients from regression contain no information about the causal ordering. Conversely, an R2-sortability of 0 or 1 implies that the causal ordering can be completely identified from this information. The metric gives rise to the R2-SORTNREGRESS baseline described in Section 2. Reisach et al. (2024) show that R2-sortability in SCMs is driven by an interplay of graph connectivity and the weight distribution of fi.

Figure 4 summarizes the R2-sortability statistics for linear SCM and i SCM data. We write ER(d, k) and SF(d, k) to denote Erd os-Rényi and scale-free graphs of size d and (expected) degree k, respectively (see Appendix E.2 for details). We find that i SCMs generate datasets that are not R2-sortable

Published as a conference paper at ICLR 2025

Standardized SCM i SCM

R2-sortability

(a) ER(d, 2)

R2-sortability

(b) SF(d, 2)

Figure 4: R2-sortability for different graph sizes. Linear standardized SCMs and i SCMs with εi N(0, 1) and weights drawn from uniform distributions with supports given above each plot. For every model, we evaluate 100 systems and n =1000 samples each. Lines and shaded regions denote mean and standard deviation. Datasets that satisfy R2-sortability = 0.5 (dashed) are not R2-sortable.

(R2-sortability 0.5) and thus artifact-free while sampling over common graph structures (e.g., Zheng et al., 2018; Yu et al., 2019; Reisach et al., 2021). Conversely, standardized SCMs generate datasets that are strongly R2-sortable (|R2-sortability 0.5| 0). Since R2-sortability can be exploited for causal discovery, i SCM data serves as a test for evaluating whether algorithms utilize any data properties beyond the association between R2 and the causal ordering in SCMs. Our results do not exclude the possibility of i SCM configurations that still produce R2-sortable datasets. However, we show empirically that, for commonly-used G, Pε, and w, i SCM datasets are not R2-sortable with high probability. Appendix D.1 reports the sortability metrics of the existing heuristics in Section 2, showing that neither mitigate both Varand R2-sortability. Appendix F provides results for denser graphs.

5.2 STRUCTURE LEARNING

Under the same weight and noise distributions, standardized SCMs and i SCMs have different implied SCMs and generate qualitatively different datasets. Here, we study how this affects causal structure learning in practice. For this, we evaluate Varand R2-SORTNREGRESS (SR) (Reisach et al., 2021; 2024) as well as an SR variant that uses a random ordering (Random SR). In addition, we evaluate representative algorithms from several approaches to learning structure from (co)variance information. NOTEARS by Zheng et al. (2018) leverages continuous optimization to minimize an MSE loss, which is affected by noise scaling (Loh and Bühlmann, 2014; Seng et al., 2024). GOLEM (Ng et al., 2020), similar to NOTEARS, formulates causal discovery as a continuous optimization problem. Its EV and NV versions assume equal and potentially unequal noise scales, respectively. CAM (Bühlmann et al., 2014) searches over causal orderings and performs sparse nonlinear regression to find the parents, while also estimating the noise scales. PC (Spirtes and Glymour, 1991) and GES (Chickering, 2002) are approaches based on statistical independence testing and greedy search, respectively. Finally, AVICI by Lorch et al. (2022) predicts graphs using a model pretrained on simulated data and is thus optimized to exploit any artifacts that improve predictive accuracy. To investigate its susceptibility to artifacts, we evaluate the public model checkpoints trained on standardized SCMs.

Figure 5 summarizes the results for linear and nonlinear systems with Gaussian noise (see Figure 18, Appendix F for non-Gaussian systems). The nonlinear mechanisms fi are samples from a Gaussian process with squared exponential kernel. As expected, Var-SORTNREGRESS performs best when SCMs are not standardized. Likewise, R2-SORTNREGRESS performs better on SCMs and standardized SCMs, as i SCMs have R2-sortability close to 0.5 (Section 5.1). AVICI shows the same trend, suggesting it may indeed be exploiting the correlation artifacts present in its training distribution. Like Reisach et al. (2021), we find that NOTEARS performs best on unstandardized data. However, and more interestingly, NOTEARS also performs better on i SCMs than on standardized SCMs, especially in linear and larger systems. As we investigate later on, this gap may be explained by the fact that the implied models of standardized SCMs violate the assumptions of NOTEARS more strongly than i SCMs. Overall, we find GOLEM-EV shows the same patterns as NOTEARS, severely underperforming on standardized SCMs and slightly improving the predictive accuracy on i SCMs. CAM and GOLEM-NV, which do not assume equal noise scales, perform equally well or better on standardized data, respectively, and generally better on i SCMs. The poor performance of GOLEM-NV

Published as a conference paper at ICLR 2025

SCM Standardized SCM i SCM

F1 Linear Nonlinear

Figure 5: Structure learning performance on SCM and i SCM data. F1 scores for recovering the edges of the true graph. Box plots show median and interquartile range (IQR). Whiskers extend to the largest value inside 1.5 IQR from the boxes. Left (right) column shows results for linear (nonlinear) causal mechanisms with additive noise εi N(0, 1) and wi,j Unif [0.5,2.0] (Appendix E). For every model, we evaluate 20 systems each using n =1000 data points.

Notears (SCMs with impl. noise scales)

100 101 102 103 1/θ2 i

Original SCM Impl. noise scales of std. SCM Impl. noise scales of i SCM

Figure 6: Implied Noise. Bottom panel shows the distribution over inverse implied noise scales in the implied SCMs for ER(100, 2) graphs (kernel density estimate). Lines and shading denote mean and standard deviation. Top panel shows the performance of NOTEARS on systems with these noise scale statistics but the same Var-sortability as SCMs (see Appendix E.2 and E.5).

on unstandardized SCMs was also observed by Reisach et al. (2021). In addition, for approaches based on discrete search, we find that, in particular on large systems, the PC and GES algorithms perform better on i SCMs. Overall, performance differences tend to be more pronounced for linear systems, where the downstream variance accumulation in SCMs is unbounded. Appendix F reports the results for the structural Hamming distance (SHD) and different weight ranges.

Properties of the implied SCMs When standardizing SCM data, the implied SCM corresponds to the SCM that could have generated the observations. Therefore, algorithms assuming that unstandardized SCMs generated the data will be susceptible to any assumption violations of the implied SCM, such as assumptions about the exogenous noise. Figure 6 (bottom) shows the distribution of inverse implied noise scales 1/θ2 i for the variables of the implied models (see Equation 2). Since Var[εi] = 1 in our experiments, these inverse squared noise scales are equal to the inverse variances of the full additive noise terms. We find that standardized SCMs induce inverse noise scales that are orders of magnitude greater than those of i SCMs. This distribution is essentially the footprint of the determinism in the depth limit discussed in Section 4.1. This observation also provides empirical support for our earlier explanation for the improved performance of the PC algorithm on i SCM data. The modes at 1/θ2 i = 1 and at 1/θ2 i > 1 in the i SCM plot correspond to root and non-root nodes, respectively.

Figure 6 (top) shows the performance of NOTEARS when isolating the noise properties of the implied models from the fact that standardized SCMs and i SCMs are not Var-sortable. For this, we construct SCMs that have the marginal variances (and Var-sortability, here 0.99 on average) of unstandardized

Published as a conference paper at ICLR 2025

SCMs but the noise variances of the implied models by correcting their weights (see Appendix E.5). NOTEARS performs better in such systems, suggesting that (i) the noise statistics may indeed explain the performance difference on i SCM data, and (ii) Var-sortability may not be the only reason why NOTEARS performs significantly worse on standardized data (Reisach et al., 2021). Conversely, when the weight ranges of (standardized) SCMs are smaller, the phenomenon of exploding marginal variances is less pronounced (Figure 19 in Appendix F.3). In this case, we indeed find that NOTEARS performs similarly on standardized SCMs and i SCMs (Figure 15, left, in Appendix F.1).

This sheds light on previous benchmarking results, where MSE-based algorithms perform below expectations despite perhaps not intending to evaluate the algorithms under model mismatch (e.g., Reisach et al., 2021; Kaiser and Sipos, 2021). For the MSE loss, Loh and Bühlmann (2014) and Seng et al. (2024) show that smaller ratios of noise variances increase the magnitude of weights required for the true DAG to be the unique minimizer. The MSE loss ultimately does not account for the inverse variance factor in the Gaussian noise likelihood. Overall, the statistics of the implied models of standardized SCMs are empirically further from SCMs with equal noise variances than their i SCM counterparts.

6 CONCLUSIONS

We describe the i SCM, a one-line modification of the SCM that modulates the scale of interaction between the causal mechanism fi and noise εi at each variable xi. Through several theoretical and experimental results, we study its properties in relation to standard SCMs and its ramifications for benchmarking causal discovery algorithms. To conclude, we highlight the following key takeaways:

Standardizing during the generative process removes sortability artifacts. When the functions fi and the noise εi are, for example, sampled i.i.d. for each variable xi, SCMs exhibit artifacts that are not removed when shifting and scaling the generated data. Our results in Section 5 show that i SCMs are effective at removing Varand R2-sortability. This makes i SCMs a useful complement to structure learning benchmarks with SCMs, enabling a specific evaluation of the ability of algorithms to transfer to real-world settings that do not exhibit R2 artifacts. Despite the removed sortability artifacts, causal discovery algorithms are able to infer nontrivial structure from i SCM data (Figure 5).

Standardizing post-hoc can lead to partial identifiability and degenerate implied SCMs. Scaling the units of SCM data is not innocuous. Theorem 3 shows that mild knowledge on the distribution of fi can identify edges in standardized SCMs that are typically not identifiable from observational data. To our knowledge, our result is the first concerning the identifiability of G from the standardized observational distribution of linear SCMs. This may make benchmarks, where similar assumptions on fi often hold, trivial under standardized SCMs. Moreover, Figure 6 shows that standard SCMs can collapse to modeling near-zero exogenous noise. Theorems 2 and 4 demonstrate that neither property appears in the analogous i SCMs. Ultimately, (non)identifiability may be either a feature or bug, depending on whether assumptions are verifiable in practice or a priori known during evaluation.

i SCMs are stable and scale-free, making them useful models beyond benchmarking. Beyond data generation, the stable generative process of i SCMs might also provide insights for modeling, e.g., large, temporal (Kilian, 2013; Pamfil et al., 2020) or physical systems. In i SCMs, the scale of a causal mechanism fi and its unexplained variation εi are both unit-less and independent from its position in the causal ordering (Section 3). If we think of each structural assignment as a physical mechanism, energy conservation must be respected, since a mechanism can only output as much energy as it receives from its inputs (including unexplained noise). Standardization may thus not be completely unrealistic, since it naturally bounds the output scale of every mechanism.

Since each i SCM implies a standard SCM, i SCMs can also be viewed as a reparameterization of SCMs that facilitates modeling and learning the functions fi on the same scale, e.g., under a shared prior or level of regularization. Conceptually, i SCMs are related to batch normalization (Ioffe and Szegedy, 2015), a technique used to stabilize the optimization of neural networks, which compose sequences of functions like SCMs, by adding internal standardization. Overall, these properties may make the i SCM a useful structural equation model beyond the benchmarking problem studied here.

Published as a conference paper at ICLR 2025

REPRODUCIBILITY STATEMENT

To facilitate reproducibility, we provide code, configuration files, and the commands used to obtain all the experimental results in this manuscript as supplementary material. They are also available at: https://github.com/werkaaa/iscm. In Appendix E , we describe the experimental setup, including the computational resources and wall time used to produce the results. Finally, we provide detailed proofs of our theoretical results in Appendix C.

ACKNOWLEDGEMENTS

This research was supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program grant agreement no. 815943 and the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545. This work was also supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B, and by the Machine Learning Cluster of Excellence, EXC number 2064/1, project number 390727645.

Alon, N. and Spencer, J. H. (2016). The probabilistic method. John Wiley & Sons. 6

Andersson, S. A., Madigan, D., and Perlman, M. D. (1997). A characterization of markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505 541. 25

Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. science, 286(5439):509 512. 31

Bühlmann, P., Peters, J., and Ernest, J. (2014). CAM: Causal additive models, high-dimensional order search and penalized regression. 8

Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507 554. 8

Dibaeinia, P. and Sinha, S. (2020). SERGIO: a single-cell expression simulator guided by gene regulatory networks. Cell systems, 11(3):252 271. 3

Erd os, P. and Rényi, A. (1959). On random graphs. Publicationes Mathematicae, 6:290 297. 31

Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge university press. 1

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448 456. pmlr. 10

Kaiser, M. and Sipos, M. (2021). Unsuitability of NOTEARS for causal graph discovery. ar Xiv preprint ar Xiv:2104.05441. 10

Kalainathan, D., Goudet, O., and Dutta, R. (2020). Causal discovery toolbox: Uncovering causal relationships in python. Journal of Machine Learning Research, 21(37):1 5. 32

Kilian, L. (2013). Structural vector autoregressions. In Handbook of research methods and applications in empirical macroeconomics, pages 515 554. Edward Elgar Publishing. 10

Lachapelle, S., Brouillard, P., Deleu, T., and Lacoste-Julien, S. (2020). Gradient-based neural DAG learning. In International Conference on Learning Representations. 3, 5

Loh, P.-L. and Bühlmann, P. (2014). High-dimensional learning of linear causal networks via inverse covariance estimation. The Journal of Machine Learning Research, 15(1):3065 3105. 8, 10

Lorch, L., Sussex, S., Rothfuss, J., Krause, A., and Schölkopf, B. (2022). Amortized inference for causal structure learning. Advances in Neural Information Processing Systems, 35:13104 13118. 3, 8, 30, 31, 32

Published as a conference paper at ICLR 2025

Maathuis, M. H., Colombo, D., Kalisch, M., and Bühlmann, P. (2010). Predicting causal effects in large-scale systems from observational data. Nature methods, 7(4):247 248. 1

Meek, C. (1995). Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 403 410. 25

Mooij, J. M., Magliacane, S., and Claassen, T. (2020). Joint causal inference from multiple contexts. The Journal of Machine Learning Research, 21(1):3919 4026. 1, 3, 29, 30

Ng, I., Ghassami, A., and Zhang, K. (2020). On the role of sparsity and DAG constraints for learning linear DAGs. Advances in Neural Information Processing Systems, 33:17943 17954. 3, 8, 32

Pamfil, R., Sriwattanaworachai, N., Desai, S., Pilgerstorfer, P., Georgatzis, K., Beaumont, P., and Aragam, B. (2020). DYNOTEARS: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pages 1595 1605. Pmlr. 5, 10

Pearl, J. (2009). Causality. Cambridge university press. 1

Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228. 3, 7

Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. The MIT Press. 2, 3, 17

Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Advances in neural information processing systems, 20. 31

Reisach, A., Seiler, C., and Weichwald, S. (2021). Beware of the simulated DAG! Causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34:27772 27784. 1, 2, 3, 5, 8, 9, 10, 30, 32, 33

Reisach, A., Tami, M., Seiler, C., Chambaz, A., and Weichwald, S. (2024). A scale-invariant sorting criterion to find a causal order in additive noise models. Advances in Neural Information Processing Systems, 36. 1, 2, 3, 5, 7, 8, 29, 30, 32

Runge, J., Bathiany, S., Bollt, E., Camps-Valls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., Muñoz-Marí, J., et al. (2019). Inferring causation from time series in earth system sciences. Nature communications, 10(1):2553. 1

Sachs, K., Perez, O., Pe er, D., Lauffenburger, D. A., and Nolan, G. P. (2005). Causal proteinsignaling networks derived from multiparameter single-cell data. Science, 308(5721):523 529. 1

Schaffter, T., Marbach, D., and Floreano, D. (2011). Gene Net Weaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16):2263 2270. 3

Schölkopf, B. (2022). Causality for machine learning. In Probabilistic and causal inference: The works of Judea Pearl, pages 765 804. 5

Seng, J., Zeˇcevi c, M., Dhami, D. S., and Kersting, K. (2024). Learning large DAGs is harder than you think: Many losses are minimal for the wrong DAG. In The Twelfth International Conference on Learning Representations. 3, 8, 10

Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A., and Jordan, M. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10). 3, 39

Shimizu, S., Inazumi, T., Sogawa, Y., Hyvarinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., Bollen, K., and Hoyer, P. (2011). Direct Li NGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research-JMLR, 12(Apr):1225 1248. 3, 5

Spirtes, P. and Glymour, C. (1991). An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9(1):62 72. 8

Published as a conference paper at ICLR 2025

Squires, C. and Uhler, C. (2022). Causal structure learning: A combinatorial perspective. Foundations of Computational Mathematics, 23(5):1781 1815. 1

Squires, C., Yun, A., Nichani, E., Agrawal, R., and Uhler, C. (2022). Causal structure discovery between clusters of nodes induced by latent factors. In Conference on Causal Learning and Reasoning, pages 669 687. PMLR. 3, 29, 30

Verma, T. S. and Pearl, J. (2013). On the equivalence of causal models. 3

Villar, S., Hogg, D. W., Yao, W., Kevrekidis, G. A., and Schölkopf, B. (2023). Towards fully covariant machine learning. ar Xiv preprint ar Xiv:2301.13724. 4

Vowels, M. J., Camgoz, N. C., and Bowden, R. (2022). D ya like DAGs? A survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1 36. 1

Wienöbst, M., Luttermann, M., Bannach, M., and Liskiewicz, M. (2023). Efficient enumeration of markov equivalent dags. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12313 12320. 17

Wright, S. (1934). The method of path coefficients. The Annals of Mathematical Statistics, 5(3):161 215. 5

Yu, Y., Chen, J., Gao, T., and Yu, M. (2019). DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154 7163. PMLR. 3, 5, 8

Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. (2018). DAGs with NO TEARS: Continuous optimization for structure learning. Advances in neural information processing systems, 31. 3, 5, 8, 30, 32

Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. (2020). Learning sparse nonparametric DAGs. In International Conference on Artificial Intelligence and Statistics, pages 3414 3425. Pmlr. 3, 5

Published as a conference paper at ICLR 2025

A Implied Models 15

A.1 Implied Model of a Standardized SCM . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2 Implied Model of an i SCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.3 Weights of the Implied Model of a Linear i SCM . . . . . . . . . . . . . . . . . . . 16

B Interventions in i SCMs 17

C Proofs 17

C.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C.2 Explicit Covariance in Linear SCMs with Unit Marginal Variances . . . . . . . . . 17

C.3 Bound on the Fraction of CEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

C.4 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D Background on Related Work 28

D.1 Heuristics for Mitigating Variance Accumulation . . . . . . . . . . . . . . . . . . 28

D.2 Sortability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

D.3 Structure Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

E Experimental Setup 31

E.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

E.2 Experiment Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

E.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

E.4 Hyperparameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

E.5 Transferring Noise Variances While Keeping Var-Sortability Unchanged . . . . . . 33

E.6 Compute Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

F Additional Experimental Results 39

F.1 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

F.2 R2-Sortability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

F.3 Implied Noise Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

F.4 Covariance Matrices for Figure 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Published as a conference paper at ICLR 2025

A IMPLIED MODELS

In this section, we describe how to express the assignments of the observed variables of standardized SCMs and i SCMs with a general additive noise mechanism

fi(x, εi) = fi(x) + εi , (6)

in the form of (SCM), while sharing the same causal graph G and exogenous noise variables ε. We obtain the SCM form by moving the standardization steps into the causal mechanisms by linearly rescaling fi and εi, such that each observed variable is only a function of observed variables and the noise εi. Throughout this work, the implied (SCM) model denotes the specific construction given in the following two subsections. For this, we assume that we can express the first two moments of the system in closed form. Similar to the main text, we overload notation for both standardized SCMs and i SCMs and write µi := E[xi] and si := p

Var[xi] . We also derive analytic expressions for the weights of the implied models of linear i SCMs defined by Equation (1), which we later use in our proofs.

A.1 IMPLIED MODEL OF A STANDARDIZED SCM

Let xs be modeled by (Standardized SCM) with causal mechanisms defined by Equation (6). We recall that xs are the observations obtained after standardizing x. Thus, we can rearrange xs i as

xi = sixs i + µi

and substitute every unstandardized variable xi by a function of its standardized parents xs pa(i) as

xs i = xi µi

si = fi(xpa(i)) + εi µi

si = fi(xs pa(i) spa(i) + µpa(i)) µi

where denotes elementwise multiplication, and µpa(i) and spa(i) are the vectors of the parent means and standard deviations before standardization. Thus, the assignments of xs in a standardized SCM can be written as the SCM given by

xs i = gs i (xs pa(i)) + θs i εi ,

with implied noise scales θs i := 1/si and implied causal mechanisms

gs i (xs pa(i)) :=

fi(xs pa(i) spa(i) + µpa(i)) µi

si if i is a non-root variable, and

si if i is a root variable.

A.2 IMPLIED MODEL OF AN ISCM

Let ex be modeled by (i SCM) with causal mechanisms defined by Equation (6). In an i SCM, ex are the observed variables and x are the latent variables. We can express every observation exi in terms of its observed parents expa(i) as

exi = xi µi

si = fi(expa(i)) + εi µi

si = fi(expa(i)) µi

Thus, the assignments of ex in a i SCM can be written as the SCM given by

exi = egi(expa(i)) + eθiεi ,

with implied noise scales eθi := 1/si and implied causal mechanisms

egi(expa(i)) :=

fi(expa(i)) µi

si if i is a non-root variable, and

si if i is a root variable.

Published as a conference paper at ICLR 2025

A.3 WEIGHTS OF THE IMPLIED MODEL OF A LINEAR ISCM

Here, we derive the analytical form for the mechanisms of the implied model of a linear i SCM with zero-centered, additive noise εi. This i SCM is given by

xi := w T i expa(i) + εi and exi := xi p

where εi satisfies E[εi] = 0 and Var[εi] = σ2 i . We can write the above as

exi = w T i expa(i) + εi p

j pa(i) wj,iexj + εi p

Var[xi] = X

Var[xi] exj + 1 p

Var[xi] εi .

It follows that the implied SCM of a linear i SCM is also linear, with weights and noise variances given by

ewj,i = wj,i p

Var[xi] and eσ2 i = σ2 i Var[xi] . (7)

In the above, we can write the variance of xi explicitly as

Var[xi] = Var X

j pa(i) wj,iexj + εi

j pa(i) wj,iexj

j pa(i) Cov[wk,iexk, wj,iexj] + σ2 i

j pa(i) wk,iwj,i Cov[exk, exj] + σ2 i ,

where 1 follows from Bienaymé s identity and 2 from covariance being bilinear. Substituting the variance into the expressions for the weights and noise variances, we obtain

ewj,i = wj,i q P

j pa(i) wk,iwj,i Cov[exk, exj] + σ2 i , (9)

eσ2 i = σ2 i P

j pa(i) wk,iwj,i Cov[exk, exj] + σ2 i . (10)

Finally, by construction, the variables ex of an i SCM have unit marginal variances. Thus, when the parents of exi are pairwise independent, Equation (10) simplifies to

ewj,i = wj,i q P

j pa(i) w2 j,i + σ2 i . (11)

This independence condition always holds when the DAG G is a forest.

Efficient computation We can efficiently compute the implied model weights using a bottom-up dynamic programming approach. This allows sampling data directly from the exact implied model of an i SCM without resorting to empirical standardization statistics. Algorithm 2 describes the procedure. We iteratively compute the weights and noise variances of the implied model following Equations (9) and (10). At each iteration, we update the covariance matrix according to Lemma 1. The algorithm processes the nodes in topological order, mirroring the proof by induction of Lemma 1.

Published as a conference paper at ICLR 2025

Algorithm 2 Computing the Implied Model Parameters of Linear i SCMs

Input: DAG G, weight matrix [W]i,j := wi,j, noise variances σ2 Rd + f W 0d d Σ Id π topological ordering of G for i = 1 to d do

w W:,πi Edge weights ingoing to πi Var[xπi] w Σw + σ2 πi Equation (8) f W:,πi w/ p

Var[xπi] Equation (9)

eσ2 πi σ2 πi/ Var[xπi] Equation (10) for j = 1 to i do

Σπj,πi (Σπj,:) f W:,πi Σπi,πj Σπj,πi return implied weights f W, implied noise variances eσ2

B INTERVENTIONS IN ISCMS

For an i SCM (S, Pε), we can formalize interventions as changes to its causal mechanisms fi, analogous to the common definition for SCMs (Peters et al., 2017). Specifically, let µi := E[xi] and si := p

Var[xi] be the mean and standard deviation of the latent variable xi. We define an intervention as replacing one (or several) of the assignments to the latent variables as

xi := hi(expa(i), εi),

for some function hi. Importantly, the statistics µi and si used for the standardization operation

exi := xi µi

remain unchanged. Thus, if we intervene on mechanisms of i SCMs, the variables ex may no longer have zero mean and unit variance, and the perturbations of xi propagate downstream through the causal mechanisms. We note that, under the above definition, intervening on an i SCM through a new mechanism hi is equivalent to intervening on the implied SCM of an i SCM with the mechanism

ehi(x, ε) = hi(x, ε) µi

Appendix A.2 provides details on the implied models of i SCMs.

C.1 DEFINITIONS

We define the key concepts used throughout our analysis. A path pj i between vi and vj is a set of directed edges that allows reaching vi from vj (and vice versa), not taking into account edge directionality, and that joins unique vertices. We call a node a collider in a path if the node has two ingoing directed edges in the path. We say that a path between vi and vj is unblocked if and only if there is no node vk that is a collider in the path (see Figure 10a). Finally, we use the term undirected connected component to refer to any maximal subgraph of G in which any two nodes are connected by a path containing only undirected edges (Wienöbst et al., 2023).

C.2 EXPLICIT COVARIANCE IN LINEAR SCMS WITH UNIT MARGINAL VARIANCES

Lemma 1 (Covariance in linear SCMs with unit marginal variances). Let x be modeled by a linear SCM defined by (1) with DAG G that satisfies Var[xi] = 1. Then, the covariance Cov[xi, xj] is the sum of products of the weights along all unblocked paths between the nodes of xi and xj in G.

Published as a conference paper at ICLR 2025

Figure 7: Lemma 1 inductive step. If vj is before vi in the topological ordering, then all unblocked paths from vj to vi must contain a parent of vi as the second to last node. To see this, suppose an unblocked path from vj to vi would instead contain a child of vi as the last node. Then, there either exists a collider on the path to vj, contradicting that the path is unblocked, or all edges in the path point away from vi, implying that vj is a descendant of vi and contradicting the topological ordering. Dotted lines represent unblocked paths (which may have common nodes). Solid lines represent edges. vj may or may not be a parent of vi, which we illustrate with a blue arrow.

Specifically, for any i, j {1, ..., d} such that i = j, it holds that

Cov[xi, xj] = X

(l,m) pj i wl,m , (3)

where Pj i are all unblocked paths from xj to xi in G, and (l, m) pj i indicates that the directed edge (l, m) is part of the path pj i.

Proof. We will give a proof by induction on the number of vertices d = |V| in the DAG G. Without loss of generality, we assume that the indices of the nodes are ordered according to some fixed topological ordering π, so π(j) < π(i) if j < i. By the unit marginal variance assumption,

Cov[xi, xi] = Var[xi] = 1 . (12)

From now on and without loss of generality, we consider two arbitrary indices j < i. The covariance between xi and xj is symmetric.

Base case (d = 2) If vj is not an ancestor of vi in graph G, they both must be root nodes, because the edge vi vj is the only possible edge when π(j) < π(i). Since xi and xj are root nodes, they are independent and Cov[xi, xj] = 0. Since a path of one edge cannot contain a collider, there are no unblocked paths between vi and vj, so the RHS of Equation (3) is also 0.

Conversely, if vj is an ancestor of vi in graph G, vj is the only parent and ancestor of vi. This implies that

Cov[xi, xj] = Cov[wj,ixj + εi, xj] = wj,i Cov[xj, xj] = wj,i ,

where the last equality follows from Equation (12). This is exactly Equation (3) for a two-node graph.

Induction step (d > 2) Let us assume that Equation (3) holds for all graphs of size d 1, and let G have d nodes. We will apply the inductive hypothesis to the subgraph of the first d 1 nodes in G and show that the full DAG G including the d-th vertex still satisfies Equation (3). First, we note that, since the d-th vertex is last in the topological ordering, it has no outgoing edges. Because the node has no outgoing edges, it is not visited on any unblocked paths between vj and vi for i, j < d, as vd must be a collider in any path. Second, adding the node vd to a subsystem containing x1, . . . , xd 1 results in no change to the joint distribution of xi, xj. Therefore, it has no effect on the covariance between xi, xj. Hence, both sides of Equation (3) are unchanged by the presence of a node vd for all i, j < d and the equation still holds for all i, j < d.

We want to show that Equation (3) also holds for i = d and any j < i. For this, we first construct all unblocked paths from vj to vi. First, we note that any unblocked path must go through the parents

Published as a conference paper at ICLR 2025

k pa(i), because j < i in the topological ordering (see Figure 7). Moreover, for any k pa(i), appending k i to an unblocked path pj k between vj and vk, creates a new unblocked path between vj and vi. Hence, for i = d and any j < i, it holds that

Cov[xi, xj] = Cov[ X

k pa(i) wk,ixk + εi, xj]

k pa(i) wk,i Cov[xk, xj]

1 = wj,i Cov[xj, xj] + X

k pa(i)\j wk,i Cov[xk, xj]

2 = wj,i + X

k pa(i)\j wk,i X

(l,m) pj k wl,m

pj k Pj k wk,i Y

(l,m) pj k wl,m

1[k = j]wj,i + 1[k = j]

pj k Pj k wk,i Y

(l,m) pj k wl,m

(l,m) pj i wl,m .

For step 1 , consider two cases. If j / pa(i), then wj,i = 0 and the equality trivially holds. If j pa(i), then it holds by pulling the term for j out of the sum in the previous line. In 2 , we apply the inductive hypothesis to express the covariances in terms of a sum of products of weights. In 3 , we rearrange terms to pull the wj,i term into the sum over parents. In 4 , we use the fact that the set of unblocked paths from vj to vi corresponds to all paths from vj to any parent of vi, which is vk here, with an extra edge k i appended, and a possible single-edge path directly connecting vj with vi (if j pa(i)).

This completes the induction step and the proof.

C.3 BOUND ON THE FRACTION OF CEV

Theorem 2 (Bound on CEVf in linear i SCMs). Let x be modeled by a linear i SCM (1) with DAG G and additive noise of equal variances Var[εi] = σ2. Suppose any node in G has at most m parents and w = maxi,j {1,...,d}|wi,j|. Then, for any i {1, ..., d}, the fraction of CEV for exi is bounded as

CEVf[exi] 1 σ2

m2w2 + σ2 .

Proof. We begin by bounding the variance of the latent variables xi in i SCMs. Starting from Equation (8), we can bound the covariances with a product of unit variances as

Var[xi] = X

j pa(i) wk,iwj,i Cov[exj, exk] + σ2

j pa(i) wk,iwj,i + σ2

j pa(i) wj,i 2 + σ2

2 m2w2 + σ2 ,

Published as a conference paper at ICLR 2025

where 1 uses Cov[exj, exk] 1 since Var[exj] = 1 and Var[exk] = 1, and 2 applies the Cauchy Schwartz inequality. Since we obtain exi from xi just by shifting and scaling the latter, we observe that CEVf[ exi] = CEVf[xi]. Using the upper bound on the variance of xi and the definition of the fraction of cause-explained variance in Equation (4)), we get

CEVf[ exi] = CEVf[xi] = 1 Var[xi E[xi|xpa(i)]]

Var[xi] = 1 Var[xi w i xpa(i)] Var[xi]

= 1 Var[εi]

Var[xi] = 1 σ2

Var[xi] 1 σ2

m2w2 + σ2 .

C.4 IDENTIFIABILITY

In this section, we prove Theorems 3 and 4. We begin by deriving the covariances for the 3-node example in Section 4.2 and then give the general proofs for forests. The proofs of both theorems share the same underlying argument. We first derive the SCM forms of the original models, i.e., standardized SCMs in Theorem 3 and i SCMs in Theorem 4. By showing that the standardized SCMs and i SCMs are SCMs with the same causal graphs G and observational distributions p(x), we can leverage Lemma 1 to obtain the covariances between the observed variables in both model classes. Ultimately, these covariances allow us to derive (non)identifiability conditions for the DAGs G in an MEC underlying the original models.

Theorems 3 and 4 assume that the exogenous noise is sampled from a zero-centered distribution with equal variance across variables. Since the results are based on the analysis of covariances, they also hold with the assumption that E[εi] = 0, but the zero-mean assumption simplifies notation. To derive the results for i SCMs, we additionally assume that the noise is Gaussian (see Theorem 4) . When referring to an undirected edge between nodes vi, vj, for example, in an MEC, we still denote the edge with (vi, vj), but the ordering of the nodes is arbitrary.

C.4.1 3-NODE CASE

We begin by studying the 3-node example of Figure 3 in Section 4.2. Let αi, βi, γi, λi R be linear function weights, and consider the following three causal graphs G belonging to the same MEC, along with their corresponding SCMs and i SCMs.

G SCM i SCM

v1 v2 v3 x1 := ε1 x2 := α1x1 + ε2 x3 := β1x2 + ε3

(13) x1 := ε1 x2 := γ1ex1 + ε2 x3 := λ1ex2 + ε3

v1 v2 v3 x1 := α2x2 + ε1 x2 := ε2 x3 := β2x2 + ε3

(15) x1 := γ2ex1 + ε1 x2 := ε2 x3 := λ2ex2 + ε3

v1 v2 v3 x1 := α3x2 + ε1 x2 := β3x3 + ε2 x3 := ε3

x1 := γ3ex2 + ε1 x2 := λ3ex3 + ε2 x3 := ε3

In the following subsections, we derive the covariance matrices of each of the three systems, respectively. This leads us to the equivalence presented in Equation (5) for standardized SCMs. Moreover, we show that, for i SCMs, all three systems induce exactly the same observational distribution if and only if λ1 = λ2 = λ3 and γ1 = γ2 = γ3. These are the 3-node special cases of Theorems 3 and 4.

Published as a conference paper at ICLR 2025

STANDARDIZED SCM

To obtain the covariances between the observed variables in the standardized SCMs of Equations (13), (15), and (17), we first show that the assignments to the observed variables in standardized SCMs can be written in the form of linear SCMs over the same causal graph, which allows us to use Lemma 1. In all three systems, every vertex has at most one parent. When the node vj is the only parent of vi, under our assumptions on the noise, we have xj = p

Var[xj]xs j, so the assignment of xs i can be written in the form of an SCM over xs as

xs i := xi p

Var[xi] = wj,ixj + εi p

Var[xi] = wj,i p

Var[xj]xs j + εi p

Var[xi] = wj,i

Var[xj] Var[xi] xs j+ εi p

Var[xi] . (19)

To use Equation (19), we first need to compute the marginal variances of the unstandardized observations xi. For the standardized SCMs, these marginal variances are, respectively:

for Equation (13): for Equation (15): for Equation (17):

Var[x1] = σ2 Var[x1] = (α2 2 + 1)σ2 Var[x1] = (α2 3(β2 3 + 1) + 1)σ2

Var[x2] = (α2 1 + 1)σ2 Var[x2] = σ2 Var[x2] = (β2 3 + 1)σ2

Var[x3] = (β2 1(α2 1 + 1) + 1)σ2 Var[x3] = (β2 2 + 1)σ2 Var[x3] = σ2

Given Equation (19) and the marginal variances, we know the weights of all three implied SCMs explicitly. Since all implied SCMs are linear, have unit marginal variances, and share the same causal graph, we can apply Lemma 1 and obtain the covariances of the observational distributions in the original models:

for Equation (13): for Equation (15): for Equation (17):

Cov[xs 1, xs 2] = α1

α2 1+1 Cov[xs 1, xs 2] = α2

α2 2+1 Cov[xs 1, xs 2] = α3

β2 3+1 α2 3(β2 3+1)+1

Cov[xs 1, xs 3] = α1β1

β2 1(α2 1+1)+1 Cov[xs 1, xs 3] = α2β2

(α2 2+1)(β2 2+1) Cov[xs 1, xs 3] = α3

α2 3(β2 3+1)+1

Cov[xs 2, xs 3] = β1

α2 1+1 β2 1(α2 1+1)+1 Cov[xs 2, xs 3] = β2

β2 2+1 Cov[xs 2, xs 3] = β3

In the standardized SCM (13), the causal graph is v1 v2 v3. Hence, the edge directions of the DAG G are consistent with the direction of increasing absolute covariance if and only if

|Cov[xs 1, xs 2]| < |Cov[xs 2, xs 3]| α1

α2 1+1 β2 1(α2 1+1)+1

α2 1 α2 1+1 < β2 1 α2 1+1 β2 1(α2 1+1)+1 α2 1(β2 1(α2 1 + 1) + 1) < β2 1(α2 1 + 1)2

β2 1α4 1 + β2 1α2 1 + α2 1 < β2 1α4 1 + 2β2 1α2 1 + β2 1 α2 1 < β2 1(α2 1 + 1)

α2 1 α2 1+1 < β2 1 .

In the above equivalences, we always multiply or divide by quantities greater than 0, so the direction of the inequality does not change, and transformations are equivalent. For the standardized SCM (17) with causal graph v1 v2 v3, we get an analogous condition for the edges to be aligned with the order of increasing absolute covariance when following the same algebraic manipulations:

|Cov[xs 3, xs 2]| < |Cov[xs 2, xs 1]| β2 3 β2 3+1 < α2 3.

Published as a conference paper at ICLR 2025

We make use of both of these conditions in Section 4. Since z/(z + 1) < 1 for any z > 0, the right-hand sides of both conditions are true if all weights are greater than 1. In this case, the absolute covariance increases downstream in all SCMs of Equations (13) and (17). Hence, among these two systems, only the DAG G whose edges aligns with the covariance ordering in the observed p(xs) can induce p(xs), and we can conclude that the other DAG is not the true causal graph.

To derive the observational distributions of the i SCMs in Equations (14), (16), and (18), we proceed in the same way as we did for standardized SCMs. We first show that the i SCM is an SCM with a specific set of mechanisms and then apply Lemma 1 to obtain the covariances between the observed variables. To see this, we write the assignment of exi as

exi := xi p

Var[xi] = wj,i exj + εi p

Var[xi] = wj,i p

Var[xi] exj + εi p

Var[xi] (21)

As before, using Equation (21) requires first computing the marginal variances of the latent variables xi. For the i SCMs defined by Equations (14), (16), and (18), they are given by

for Equation (14): for Equation (16): for Equation (18):

Var[x1] = σ2 Var[x1] = γ2 2 + σ2 Var[x1] = γ2 3 + σ2

Var[x2] = γ2 1 + σ2 Var[x2] = σ2 Var[x2] = λ2 3 + σ2

Var[x3] = λ2 1 + σ2 Var[x3] = λ2 2 + σ2 Var[x3] = σ2

Given Equation (21) and the marginal variances, we obtain an explicit form for the weights of all three implied SCMs. Since the implied SCMs are linear, have unit marginal variances, and share the same causal graph, we can apply Lemma 1 and obtain the covariances of the observational distributions in the original models. It turns out that the observational distribution of all three ground-truth systems (ex1, ex2, ex3) in Equations (14), (16), and (18) is a multivariate Gaussian with the same covariance matrix, with the diagonal elements equal to 1 and the off-diagonal elements given by

Cov[ex1, ex2] = γi p

Cov[ex1, ex3] = γiλi p

(λ2 i + σ2)(γ2 i + σ2)

Cov[ex2, ex3] = λi p

Since the observational distribution of all three SCMs is a zero-centered multivariate Gaussian, the distributions are equal if and only if their their covariance matrices are identical. The covariances are equal if and only if λ1 = λ2 = λ3 and γ1 = γ2 = γ3, because the function f(z) = z/

z2 + σ2 appearing in Cov[ex1, ex2] and Cov[ex2, ex3] of Equation (22) is injective for any σ > 0, which means that distinct weights z are mapped to distinct covariances. Therefore, the three node linear i SCMs in the above MEC share the same observational distribution if and only if they also share the same weights for each edge, regardless of edge orientation.

This implies that the three DAGs G in the MEC of Equations (14), (16), and (18) are not identifiable from p(ex): given p(ex) induced by an i SCM with DAG in this 3-node MEC, the two other DAGs with the same linear function weights induce the same distribution p(ex).

C.4.2 FORESTS

In this section, we generalize the above partial identifiability result for standardized SCMs to arbitrary forest DAGs (Theorem 3). After that, we similarly generalize the nonidentifiability of i SCMs to forests (Theorem 4). Our results concern the identification edge directions in an MEC represented by its partially directed graph G = (V, E), where E contains both directed and undirected edges.

Published as a conference paper at ICLR 2025

vi vi+1 vi+2

(a) Subsystem 1

vi vi+1 vi+2

(b) Subsystem 2

vi vi+1 vi+2

(c) Subsystem 3

Figure 8: Proof subcases of Lemma 5. Three possible subgraphs in a chain without a collider.

STANDARDIZED SCM

Before proving the main theorem, we extend the 3-node example to chains of arbitrary length. We show that all but at most one edge in the MEC can be correctly oriented from observational data using the assumption on the support of the weights. Analogous to the three node case, we then use this to prove a similar result for forest graphs.

Lemma 5 (Orientation of edges in undirected chains of standardized SCMs). Let xs be modeled by a standardized linear SCM (1) with chain DAG G = (V, E) , where Var[εi] = σ2 for non-root nodes and |wi,j| > 1 for all i pa(j). Additionally, suppose G contains no colliders. Then, given p(xs) and the partially directed graph G representing the MEC of G, we can identify all but at most one edge (vi, vj) of the true DAG G in each undirected connected component of the MEC G. The possible undirected edge has the smallest absolute covariance of all variables connected by edges in the MEC, satisfying |Cov[xs i, xs j]| < |Cov[xs k, xs l ]| for all (k, l) E \ (i, j).

Proof. Throughout the proof, we label the nodes vi V such that vi 1 and vi+1 are its neighbors for i {2, . . . , d 1}. We start with the analysis of three arbitrary, consecutive vertices in a chain graph. The three possible subgraphs are depicted in Figure 8. We can always find p R such that the variance of the latent root of this directed subgraph is p2σ2. This relaxed assumption on specifically the root node allows for the root of the subgraph to have potential parents outside the subgraph, or to be the root of the whole chain, when later using this lemma to prove the main theorem.

We will follow similar derivations as in Section C.4.1. Specifically, we first write the observed variables of the standardized SCM in SCM form, and then invoke Lemma 1 to obtain the covariances of the observed variables. To use Equation (19), we again need to compute the marginal variances of the variables before standardization. For the subsystems in Figures 8a and 8b, these are, respectively:

for Figure 8a: for Figure 8b:

Var[xi] = p2σ2 Var[xi] = (w2 i+1,ip2 + 1)σ2

Var[xi+1] = (w2 i,i+1p2 + 1)σ2 Var[xi+1] = p2σ2

Var[xi+2] = (w2 i+1,i+2(w2 i,i+1p2 + 1) + 1)σ2 Var[xi+2] = (w2 i+1,i+2p2 + 1)σ2

By substituting the expressions for the marginal variances into Equation (19), we obtain the weights of the implied models of the standardized SCM. Using Lemma 1, we obtain the covariances between the observed variables xs i 1, xs i, xs i+1. By construction, the marginal variances of the observed variables are equal to 1. We treat each subsystem separately:

Subsystem 1 (Figure 8a) Given the marginal variances and Lemma 1, the covariances are

Cov[xs i, xs i+1] = wi,i+1p q

w2 i,i+1p2 + 1

Cov[xs i+1, xs i+2] = wi+1,i+2

w2 i,i+1p2 + 1 w2 i+1,i+2(w2 i,i+1p2 + 1) + 1

Published as a conference paper at ICLR 2025

Following the same algebraic manipulations as in Equation (20), substituting α1 := wi,i+1p and β1 := wi+1,i+2 in the derivation, we obtain

Cov[xs i, xs i+1] < Cov[xs i+1, xs i+2] w2 i,i+1p2

w2 i,i+1p2 + 1 < w2 i+1,i+2 . (23)

The left-hand side of the right-hand inequality in Equation (23) is upper-bounded by 1, similar to the 3-node case. Therefore, if we assume that |wi+1,i+2| 1, it must hold that |Cov[xs i, xs i+1]| < |Cov[xs i+1, xs i+2]| for any choice of p.

Subsystem 2 (Figure 8b) Given the marginal variances and Lemma 1, the covariances are

Cov[xs i, xs i+1] = wi+1,ip q

w2 i+1,ip2 + 1

Cov[xs i+1, xs i+2] = wi+1,i+2p q

w2 i+1,i+2p2 + 1 .

The ordering of the covariances in this case depends on the specific choice of the weights.

Subsystem 3 (Figure 8c) Following steps analogous to the symmetric subsystem 1, we conclude that, if |wi+1,i| 1, it must hold that |Cov[xs i, xs i+1]| > |Cov[xs i+1, xs i+2]| for any p.

Given the above, we can now study the relationship between the underlying DAG G and the absolute covariance magnitudes under the assumption that |wi,i+1| > 1. We will use the fact that, if the chain does not contain a collider, then there can be at most one node contained in edges pointing in opposite directions.

First, we treat the case where there exists a vertex vi such that |Cov[xs i 1, xs i]| = |Cov[xs i, xs i+1]|, that is, where some neighboring covariances are equal. If this occurs in a 3-node subsystem, only subsystem 2 can describe the true graph. To be consistent with the assumption that there are no colliders in the graph (see Lemma 5), all other edges must be oriented in a direction away from vi, which completely identifies the graph G in the MEC.

In the second case, |Cov[xs j 1, xs j]| = |Cov[xs j, xs j+1]| holds for all nodes vj that have two neighbors in the path. Let xs i, xs i+1 be the unique pair of consecutive variables in the chain that minimizes |Cov[xs i, xs i+1]|. We can show that this pair is the unique minimizer using a proof by contradiction. Suppose there exist two pairs xs i, xs i+1 and xs j, xs j+1 such that |Cov[xs i, xs i+1]| = |Cov[xs j, xs j+1]| is the minimum covariance. Without loss of generality, let j + 1 < i. Then, the triple xs i 1, xs i, xs i+1 is consistent with only subsystems 2 or 3 based on their relative covariances, which implies that we must have vi 1 vi. Using the fact that we have no colliders, we can then orient all edges vk 1 vk for 1 < k < i. Thus, we can find a subsystem containing vj, vj+1, vj+2, which has been already oriented as subsystem 3, meaning |Cov[xs j, xs j+1]| > |Cov[xs j+1, xs j+2]|, a contradiction.

Given xs i, xs i+1 is the unique pair of consecutive variables that minimizes |Cov[xs i, xs i+1]|, we now show that we can orient all edges except (vi, vi+1). We will do this in two parts. First, we show that one can orient all edges (vj, vj+1) with j < i, and then we show that we can do the same for all edges (vj, vj+1) with j > i. If i > 1, consider the subsystem vi 1, vi, vi+1. Since |Cov[xs i 1, xs i]| > |Cov[xs i, xs i+1]|, only subsystems 2 and 3 are possible for this subgraph. We can therefore orient vi 1 vi. Similarly, if i < d 1, by a symmetric argument on vi, vi+1, vi+2, we can orient vi+1 vi+2. Since the graph cannot contain colliders, all other edges must be oriented as vj vj+1 for j < i, and vj vj+1 for j > i. In other words, all edges except (vi, vi+1) point away from the two vertices vi, vi+1, and one of the two variables must be the root of the chain. Therefore, if |Cov[xs j 1, xs j]| = |Cov[xs j, xs j+1]| holds for all vertices vj that have two neighbors, then there exists a unique covariance minimizing pair xs i, xs i+1, and all edges except (vi, vi+1) are oriented.

The two cases above are exhaustive, and in the worst case at most one edge (vj, vj+1) is left unoriented in the chain. This edge always corresponds to the minimizer of |Cov[xs j, xs j+1]|. This completes the proof.

Published as a conference paper at ICLR 2025

Remark From the proof of Lemma 5, it follows that if we are able to orient all the edges in the chain, then the root of the chain is the node joining the two edges with minimum absolute covariance. When we orient all but one edge (vi, vi+1), the root node of the chain is either vi or vj.

We can extend Lemma 5 to forest graphs. For this, we will make use of the first Meek rule (Meek, 1995). The first Meek rule concerns an MEC G, containing the undirected edges (vi, vj), (vj, vk) but not the edge (vi, vk). It states that, if one can orient vi vj, we must have vj vk.

Theorem 3 (Partial identifiability of standardized linear SCMs with forest DAGs). Let xs be modeled by a standardized linear SCM (1) with forest DAG G, additive noise of equal variances Var[εi] = σ2, and |wi,j| > 1 for all i pa(j). Then, given p(xs) and the partially directed graph G representing the MEC of G, we can identify all but at most one edge of the true DAG G in each undirected connected component of the MEC G.

Proof. The undirected parts of an MEC G are disjoint undirected connected components. Orienting the edges in all these undirected connected components without introducing a v-structure produces a valid DAG G in G (Andersson et al., 1997). Each undirected connected components represents a Markov equivalence class of its own (Andersson et al., 1997). Thus, to prove the theorem, we consider these undirected connected components independently with respect to the rest of the graph and show how to orient the edges in each undirected connected component.1 In the following argument, we therefore consider G to be a single undirected connected component, with no directed edges by definition, and show that we can orient all but one edge in G. This argument then extends to all undirected connected components of the original MEC G, implying the statement made in Theorem 3.

If G is an undirected connected component with no directed edges, we only have to consider SCMs with a ground-truth DAG G that are members of this MEC G to distinguish among possible edge orientations in G. In the case of undirected trees, the ground-truth DAG G must be a tree with no colliders and the same skeleton as G, since any other DAGs would belong to a different MEC.

We give a proof by strong induction on the number of vertices |V| in the MEC G. The base case of the induction argument is an MEC with |V| = 2 nodes. This case holds trivially, since this MEC can contain at most one undirected edge. For the inductive step, we consider an undirected tree MEC G with |V| = d and assume that we can orient all but one edge of undirected tree MECs with |V| < d.

Our argument will proceed by considering the longest chain of the undirected tree G. We will use Lemma 5 to orient all but at most one edge in this chain and then apply the first Meek rule to possibly orient additional edges in G outside the chain. After orienting these edges, we show that we reduced the original problem of orienting all but one edge in G with |V| = d to orienting all but one edge in a single undirected connected component that has strictly fewer than d nodes. This allows us to apply the inductive hypothesis and complete the proof (see Figure 9).

Consider a longest undirected chain GC = (VC, EC) that is a subgraph of the undirected tree G. Let GC refer to the directed subgraph of the DAG G induced by considering only the vertices VC. We label the k vertices in VC as v1, ..., vk, with undirected edges (vi, vi+1) E for all i {1, . . . , k 1}. The nodes v1, vk can have no undirected neighbours in G outside the chain, because otherwise we could construct a longer chain in G.

The only vertex in VC that can have a parent in the DAG G outside the chain GC, that is, in V\VC, is the unique root of GC. To see this, we first note that all nodes vi have at most one parent in G, because any vi with |pa(vi))| > 1 in G would be a collider, but G contains no colliders. Since non-root nodes in GC have an in-chain parent, they cannot have a parent outside of VC. Therefore, besides the root node of GC via its potential outside parent, GC is a completely disconnected subgraph from the rest of G. This implies that we may treat GC as a separate standardized SCM with undirected chain MEC, in which the potential parent of the root of GC is modeled as part of the exogenous noise of the root. This allows us to apply Lemma 5 to the variables of the subgraph GC.

1Orienting edges of an undirected connected component that touch a directed edge in G never introduces an additional v-structure. If a directed edge pointed into the undirected connected component, the undirected edge downstream would have had to already be directed in G by the first Meek rule. Hence, all directed edges bordering the undirected connected component must be oriented away from it, and none of the possible undirected edge orientations creates a new collider at the border node. This implies that all undirected connected components in G are upstream of the colliders and directed subgraphs of G.

Published as a conference paper at ICLR 2025

v1 vi 1 vi vi+1 vi+2 vk

Figure 9: Inductive step of the proof of Theorem 3. Ground-truth DAG G underlying an undirected connected component G in some given MEC. The nodes VC = {v1, . . . , vk} are a longest chain in G. Using Lemma 5, we can orient all edges in GC except possibly (vi, vi+1) (blue). Edges like (vi 1, u) are oriented by the first Meek rule. After Lemma 5, we are left with either the single undirected tree of vi (left shaded tree) or the single undirected tree consisting of (vi, vi+1) (blue) and both undirected trees of vi and vi+1 (both shaded trees). Either vi or vi+1 must be the root of GC. In this specific example, vi is the root of GC and is therefore the only node that can have a parent outside GC. Any node in G can have directed, outgoing edges to children in a (possibly non-forest) MEC the undirected connected component G may be a subgraph of.

By applying Lemma 5 to GC, we can orient all but at most one undirected edge in GC. We split the resulting analysis into the two cases of Lemma 5 leaving either 0 or 1 undirected edge. In the first case, we can orient all edges in GC with Lemma 5. In this case, we know that the root of GC is the node vi (see Remark of Lemma 5). By the first Meek rule, we can recursively orient all additional edges in G outside of GC away from vi, except for the subtrees of G connected to vi itself (Figure 9). This leaves at most a single connected undirected subtree containing vi and strictly less than d vertices.

In the second case, we orient all but one edge (vi, vi+1) in GC by applying Lemma 5. In this case, we know that the root of GC is either the node vi or vi+1 (see Remark of Lemma 5). Similar to the first case, we can recursively use the first Meek rule to orient all additional edges in G pointing away from vi and vi+1, except for the subtrees of G connected to vi and vi+1 itself. Since vi and vi+1 are connected by an undirected edge, we are left with a single connected subtree containing the undirected edge (vi, vi+1) that is strictly smaller than before.

In both cases, we orient at least one undirected edge of G, because the longest undirected chain in G with |V| > 2 has at least length 2. We always obtain at most a single undirected connected tree component with strictly less than d vertices, allowing us to apply the inductive hypothesis and complete the proof.

Theorem 4 (Nonidentifiability of linear Gaussian i SCMs with forest DAGs). Let ex be modeled by a linear i SCM (1) with forest DAG G and additive Gaussian noise of equal variances Var[εi]. Then, for every DAG G in the MEC of G, there exists a linear i SCM with DAG G that has the same observational distribution as ex, the same noise variances, and the same weights on the corresponding edges in the MEC.

Published as a conference paper at ICLR 2025

(a) First subcase

(b) Second subcase (More than one parent in G )

vi vl vk vj

(c) Second subcase (A single parent in G )

Figure 10: Proof subcases of Theorem 4. (a) Path with a collider. In other words, a path blocked by an empty set. In the case of forests, this configuration implies that vi and vj are d-separated. (b) Unblocked path connecting vi and vj with one of the path nodes having a parent both in the path and outside the path. The weight wp,k influences the weight ewl,k in the implied model of the i SCM. If this structure is present in a forest, it has to be present in other graphs in the same MEC. (c) Unblocked path connecting vi and vj with the only parent of vk being part of the considered path. The weight ewl,k depends only on wl,k, irrespective of the edge direction.

Proof. Because we consider linear i SCMs with Gaussian noise, the implied model is a linear SCM with additive Gaussian noise (see Appendix A.2). Hence, the observational distribution is a multivariate Gaussian with mean zero. In i SCMs, the marginal variance of an observed variable is always 1. Hence, we prove the statement if we show that for all exi, exj in the i SCM with graph G, and the corresponding ex i, ex j in the i SCM with graph G = (V, E ), Cov[exi, exj] = Cov[ex i, ex j].

Let ex i and ex j be the random variables associated with the nodes vi and vj from G , respectively. We consider two cases. First, if there is no path between vi and vj in the skeleton of G then there is no path between vi and vj in the skeleton of G and hence Cov[exi, exj] = Cov[ex i, ex j] = 0. In the second case, there is a path between vi and vj in the skeleton of G , so there also exists a path in the skeleton of G, as both graphs have the same skeleton. Due to the acyclicity of the skeleton in forests, this path is the only one connecting vi and vj in both G and G .

We further break this second case into two subcases. In the first subcase, this path contains a collider in G as shown in Figure 10a. Because the skeleton cannot have undirected cycles under the forest assumption, this collider forms a v-structure. G G implies that the same v-structure must be present in G. Hence, vi and vj are d-separated in both G and G . By the global Markov condition, this implies that ex i and ex j are independent, and that exi and exj are independent. This implies that both Cov[ex i, ex j] = Cov[exi, exi] = 0.

In the second subcase, there exists an unblocked path between vi and vj in both G and G . Here, we denote the weight matrix associated with both i SCMs by W := [wi,j], with W being symmetric, so that wi,j = wj,i is the linear weight of the edge (i, j) regardless of its orientation in the graph.

We now derive the analogous weights f W, f W in the implied SCMs for G, G respectively. Ultimately, we will demonstrate that the implied SCMs have the same weights. Specifically, we will show that ewk,l = ew k,l. Given this, Lemma 1 implies that both i SCMs have the same covariance matrix over the observed variables.

Without loss of generality, since the node labelling is arbitrary, let vk have at least as many incoming edges as vl in G . We divide the analysis into two cases: vk having only 1 parent in G , and vk having more than 1 parent. The node vk must have at least one parent, since at least one of vk, vl have an incoming edge in G , and we chose vk to have at least as many incoming edges as vl.

More than one parent in G We know that any collider in G will appear as part of a v-structure in G due to the forest assumption, and therefore will also be a collider in G. Therefore, if vk has more than one parent in G (see Figure 10b), all pairs of edges incoming to vk will form v-structures, so vk must have exactly the same set of parents in G.

Published as a conference paper at ICLR 2025

1 2 3 4 5 6

1 2 3 4 5 6

1.00 0.71 0.63 0.60 0.39 0.00

0.71 1.00 0.89 0.85 0.55 0.00

0.63 0.89 1.00 0.95 0.62 0.00

0.60 0.85 0.95 1.00 0.59 0.00

0.39 0.55 0.62 0.59 1.00 0.77

0.00 0.00 0.00 0.00 0.77 1.00

Figure 11: Illustrating Theorem 4 for trees in the same MEC. Covariance matrix of observed i SCM variables for two example forests belonging to the same MEC with the same weights assigned to the edges of the skeleton.

Moreover, any two parents of vk are d-separated in G and G by the forest assumption, since the blocked path going through vk is the only path connecting them. By the global Markov condition, the parents are pairwise independent. Hence, we can use Equation (11) to compute ewk,l, ew k,l. Since the parent sets are the same between the two graphs, and W is shared between the two i SCMs, the weight associated with the edge (l, k) in both graphs in the implied models is given by

ewl,k = ew l,k = wl,k q P

u pa(k) w2 u,k + σ2 . (24)

A single parent in G Let (l, k) be the only incoming edge to vk in G , as depicted in Figure 10c. Then, the edge connecting vl and vk in G is either the only incoming edge to vk or the only incoming edge to vl. To see this, suppose that it was not the only incoming edge to vk or vl in G. This would make vk or vl a collider that would be common to both graphs, implying that vk or vl would have at least two parents in G . We operate under the assumption that vk has at least as many parents as vl, so it would imply that vk has more than one parent, contradicting the assumption we made for case we consider in this paragraph. Irrespective of the direction, the weight associated with the edge (l, k) in the skeleton of both graphs in the implied model is, similar to Equation (21), given by

ewl,k = ew l,k = wl,k q

w2 l,k + σ2 . (25)

Equations (24) and (25) show that, for the SCM form of each i SCM, the edges connecting the same nodes irrespective of their direction in G and G have the same weights. By Lemma 1, the covariance between any exi and exj can be expressed as a product of the weights in the implied SCM corresponding to the edges on the path between vi, vj. Hence, Cov[exi, exj] = Cov[ex i, ex j].

Figure 11 shows an example for Theorem 4 for two trees from the same MEC.

Remark In Figure 12, we empirically demonstrate that Theorem 4 no longer holds if we drop the forest assumption. For data generated from an i SCM and two graphs from the same G with the same weights assigned to the skeleton edges, we observe that the estimated covariances differ. The two systems entail different observational distributions.

D BACKGROUND ON RELATED WORK

D.1 HEURISTICS FOR MITIGATING VARIANCE ACCUMULATION

Here, we review existing heuristics for avoiding the exploding variance in structure learning benchmarking with linear SCMs as defined in Equation (1). We also describe how these heuristics limit the causal dependencies that can be modeled in terms of the correlations among the SCM variables or their cause-explained variance, both of which do not occur in linear i SCMs. Finally, in Figure 13 we show that the heuristics fail to induce data that is both not Var-sortable and not R2-sortable.

Published as a conference paper at ICLR 2025

1.00 0.71 0.93

0.71 1.00 0.87

0.93 0.87 1.00

1.00 0.93 0.95

0.93 1.00 0.94

0.95 0.94 1.00

Figure 12: Non-forest counterexample for Theorem 4. Covariance matrix of observed i SCM variables for two non-forests belonging to the same MEC with the same weights assigned to the edges of the skeleton.

Scaling weights by the inverse weight norm Mooij et al. (2020, Section 5.2) sample the edge weights in linear SCMs as wi,j Unif [0.5, 1.5]. To achieve a comparable variance of each variable xj in the SCM, they propose re-scaling the sampled weights prior to the data-generating process as

wi,j wi,j q

i pa(j) w2 i,j .

If all parents of xj are i.i.d. Gaussian with variance 1, this adjustment ensures that the variance of xj is similar for all xj. However, this approximation does not take into account the covariances of the parents. Moreover, since Var[εj] is unchanged, the scaling limits the strength of the causal effect that parents can have on xj. For example, when x1 = ε1 and x2 = wx1 + ε2 with Var[εj] = 1 as for Mooij et al. (2020), the adjusted weight is w = w/

1 + w2 < 1. Thus, for any w = 0, we have

|Corr[x1, x2]| = |Cov[ε1, w ε1 + ε2]| p

Var[ε1] Var[w ε1 + ε2] = |w |

w 2 + 1 < 1

This is the maximum correlation between neighbouring variables that any SCM can model under the proposed re-scaling when Var[εj] = 1, since additional parents decrease the parent-child correlations. By contrast, i SCMs can model any level of correlation by sampling arbitrary values of wi,j, while guaranteeing unit-variance observations xj. Intuitively, i SCMs achieve this by standardizing xj after the exogenous noise εj is added to the endogenous contributions of the parents xpa(j), while weight scaling is done before εj is added to xj.

Scaling weights by the incoming variance Squires et al. (2022, Section 5.1) sample the weights of linear SCMs as wi,j Unif [0.25, 1.0]. Given the initial edge weights, they propose adjusting the weights during the generative process by first estimating the total variance ˆσ2 j that the parents of xj contribute to xj from samples drawn under an initial level of additive noise with Var[εj] = 1 and then re-scaling the weights as

wi,j wi,j q

When using additive noise with Var[εj] = 0.5 to generate the actual samples, this scaling results in Var[xj] = 1 with a constant fraction of cause-explained variance CEVf[xi] = 0.5. In benchmarks, however, we may be interested in evaluating SCMs with arbitrary levels of cause-explained variance. i SCMs allow this by construction. Contrary to Squires et al. (2022), i SCMs scale the variables xj rather than the weights wi,j while leaving the exogenous noise εj unchanged, which enables modeling arbitrarily small or large levels of unexplained variation.

D.2 SORTABILITY METRICS

In this section, we describe the definition of a sortability metric as introduced by Reisach et al. (2024), which we use in Section 5. For a function τ, τ-sortability assigns a scalar in [0, 1] to the variables x and graph G (with weight matrix WG) as Pd i=1 P ps t W i G incr(τ(x, s), τ(x, t)) Pd i=1 P

ps t W i G 1 where incr(a, b) =

1 if a < b 1 2 if a = b 0 if a > b

Published as a conference paper at ICLR 2025

Mooij et al. (2020) Squires et al. (2022)

Var-sortability

R2-sortability

(a) ER(d, 2)

Var-sortability

R2-sortability

(b) SF(d, 2)

Figure 13: Sortabilities for data generated according to heuristics that aim to remove artifacts. Weight ranges were assumed as in the original papers: wi,j Unif[0.5,1.5] for Mooij et al. (2020), wi,j Unif[0.25,1.0] for Squires et al. (2022). For every model, we evaluate 100 systems each using n =1000 samples. Lines and shaded regions denote mean and standard deviation respectively.

and W i G is the i-th power of the adjacency matrix WG and ps t W i G if and only if at least one directed path from vs to vt of length i exists in G. If τ(x, t) = Var[xt], we obtain Var-sortability from Reisach et al. (2021). If

τ(x, t) = R2[xt] = 1 Var[xt E[xt|x{1,...,d}\{t}]]

we obtain R2-sortability. Estimating R2[xt] requires performing regression of xt onto x{1,...,d}\{t}.

D.3 STRUCTURE LEARNING ALGORITHMS

To complement the interpretation of the results in Section 5, we provide some background on the structure learning methods we evaluate.

NOTEARS (Zheng et al., 2018) NOTEARS uses continuous optimization to minimize the regularized mean-squared error (MSE) between the the variables modeled by a linear SCM and the observations, while enforcing a differentiable acyclicity constraint. The objective function of NOTEARS is given by F(W) = ||X XW||2 F /2n + λ||W||1, where || ||F and || ||1 are a Frobenius and ℓ1 norm respectively. When the objective is minimized, weights below a fixed threshold are set to zero.

AVICI (Lorch et al., 2022) AVICI is an amortized variational inference method that approximates the posterior distribution over causal structures given a dataset through a pretrained inference model. The variational approximation of AVICI uses a fully-factored product of Bernoulli distributions for every possible graph edge. The inference model is a neural network that predict the variational parameters of the Bernoulli distributions by minimizing the expected forward KL divergence between the true posterior and the approximation. To train the inference model, AVICI can be optimized on any training distribution of (synthetic) dataset-graph pairs. Lorch et al. (2022) publish the pretrained parameters of inference models trained on standardized SCMs with linear and nonlinear mechanisms, which we evaluate in this work.

Sortn Regress methods (Reisach et al., 2021; 2024) The SORTNREGRESS methods order the vertices by a chosen statistic and sparsely regress every node on all of its predecessors in the obtained order. They use Lasso regression with the Bayesian Information Criterion to learn the regression function for a given variable. Var-SORTNREGRESS uses estimated marginal variances as the sorting criterion. R2-SORTNREGRESS uses R2 coefficient of determination estimated after performing a regression of every variable onto all remaining variables. RAND-SORTNREGRESS orders the vertices randomly.

Published as a conference paper at ICLR 2025

E EXPERIMENTAL SETUP

Causal mechanisms We consider systems with additive noise, where

fi(x, εi) = hi(x) + εi,

for a chosen function hi. The LINEAR systems used in this experiments have causal mechanisms as defined in Equation (1). To model nonlinear systems, we use smooth nonlinear functional mechanisms as used by Lorch et al. (2022). Specifically, the function hi that models the relationship between xi and its parents is sampled from a Gaussian Process

hi GP(0, ki) ,

where k is a squared exponential kernel ki(x, x ) = c2 i exp ||x x ||2 2/2l2 i with output and length scales ci and li respectively. We can approximately express the function sample hi analytically using random Fourier features (Rahimi and Recht, 2007) by sampling

hi(x) = ci q

j=1 α(i) cos ω(i) x

where α(i) N(0, 1), ω(i) N(0, I), and δ(i) Unif [0, 2π]. In this work, we use M = 100.

Generating a random model Following prior work (Section 2), we sample random systems in any simulation performed in this work by first drawing a graph G from the specified random graph distribution. Given the graph G, we sample function parameters of the structural mechanisms over G. For linear systems, we sample wi,j Unif [a, b], where a, b are fixed, i.i.d. for every graph edge. Similarly, for nonlinear systems, for every graph vertex, we draw the length scales li Unif[a1, b1] and output scales ci Unif[a2, b2] with predefined a1, b1, a2, b2.

Sampling data from a model Given a graph G, noise distribution Pε, and a set of functions {f1, ...fd}, we sample n datapoints from an SCM by traversing G in a topological ordering. For every vertex vi, we draw a noise sample εi Pn εi. The sample for xi is then deterministically computed by fi from the exogenous εi and the parents of xi. To sample from a Standardized SCM, we draw a dataset from an SCM and standardize it. To sample from an i SCM, we use Algorithm 1.

E.2 EXPERIMENT CONFIGURATIONS

Sortability For Figures 4a, 16a, and 17a we generate Erd os-Rényi graphs ER(d, k) (Erd os and Rényi, 1959), with d denoting the graph size and k the expected node degree. For Figures 4b, 16b, and 17b we generate undirected scale-free graphs SF(d, k) (Barabási and Albert, 1999), where d is the graph size and k the number of outgoing edges generated for each vertex. Then, we orient the edges in the graph according to a random topological ordering. We do not sample directed scale-free graphs initially to avoid high sortability by in-degree, which may confound the results.

For all four figures, we generate LINEAR systems with weights sampled from three possible distributions wi,j Unif [0.3, 1.8], wi,j Unif [0.5, 2.0] or wi,j Unif [1.3, 3.0] and noise sampled from εi N(0, 1). For every model configuration, we sample 100 systems and n =1000 data points each. To create Figures 4 and 16 we sampled graphs of sizes {20, 60, 100, 140, 180, 220}. To obtain Figure 17, we sampled graphs with k {4, 8, 12, 16, 20}.

Structure Learning (Section 5.2) For Figures 5 and 14, we sample LINEAR systems with weights wi,j Unif [0.5, 2.0]. Following Lorch et al. (2022), NONLINEAR mechanisms have length scales li Unif[7.0, 10.0] and output scales ci Unif[10.0, 20.0]. Both mechanisms are defined in Appendix E.1. For Figures 15a and 15b, we generate LINEAR systems with weights wi,j Unif [0.3, 0.8] and wi,j Unif [1.3, 3.0]. For all four figures, we sample random ER(20, 2) and ER(100, 2) graphs with noise εi N(0, 1). For every model configuration, we sample 20 systems and n = 1000 data points each.

Published as a conference paper at ICLR 2025

Noise Transfer For Figure 6 (top), we sample SCMs, standardized SCMs, and i SCMs with exactly the same underlying graph and weights sampled from wi,j Unif [0.5, 2.0]. The noise variables are drawn from εi N(0, 1). Then, for every triple of SCM, standardized SCM, and i SCM that shares a graph and weights, we create two more SCMs with the same marginal variances as the SCM, but with the noise variances of the implied models of the standardized SCM and i SCM, respectively. Appendix E.5 provides a motivation and detailed explanation of this procedure. Figure 6 (top) shows the performance of NOTEARS on the original SCMs and the two SCMs with transferred noise.

For Figure 6 (bottom), we sample multiple instances of standardized SCMs, and i SCMs with weights drawn from wi,j Unif [0.5, 2.0] and noise from εi N(0, 1). For every model instance, we approximate the density of the inverse of their implied noise variances using kernel density estimation. The figure shows the mean and standard deviation of the p.d.f. values over 100 systems. For both figures, we use ER(100, 2) graphs.

E.3 METHODS

NOTEARS (Zheng et al., 2018) To run NOTEARS, we use the original implementation provided by the authors of Zheng et al. (2018) (Apache-2.0 license). Before benchmarking NOTEARS, we run a hyperparameter search to calibrate the weight penalty (λ) and threshold on held-out instances of each data generation method. The hyperparameters can be found in Appendix E.4.

AVICI (Lorch et al., 2022) To evaluate AVICI, we use the code and model checkpoints provided by the authors of the method (MIT license). Specifically, we use the model trained on linear data to benchmark the method on LINEAR systems and the model trained on nonlinear data to benchmark on NONLINEAR systems. We score an edge as predicted if the probability prediction by AVICI is greater than 0.5. Since the parameters are pretrained, the method has otherwise no tuneable hyperparameters.

Sortabilities and SORTNREGRESS methods (Reisach et al., 2021; 2024) To compute the sortability metrics and run the SORTNREGRESS baselines, we use the Causal Disco library (BSD-3-Clause license) created by the authors of the method. The algorithms require no tuneable hyperparameters.

GOLEM (Ng et al., 2020) For GOLEM-EV, we tune λ1 (sparsity penalty coefficient), λ2 (acyclicity penalty coefficient) and the threshold for zeroing weights. For GOLEM-NV, we tune the same hyperparameters as for GOLEM-EV. We do not initialize the model with the solution returned by GOLEM-EV, as done in the original paper, since we want to evaluate a method that does not assume equal noise variances at any point. Not initializing with the GOLEM-EV weights is consistent with the benchmarking approach of Reisach et al. (2021). We use the implementation of the original work (Ng et al., 2020).

PC Algorithm For linear data, we use a Gaussian conditional independence test. For nonlinear data, we use the Hilbert-Schmidt Independence Criterion (HSIC) gamma test. We treat the test significance level as a hyperparameter that we tune. We use the implementation by the Causal Discovery Toolbox (Kalainathan et al., 2020).

GES GES uses the linear Gaussian BIC score function and does not require hyperparameter tuning. We use the implementation by the Causal Discovery Toolbox (Kalainathan et al., 2020).

CAM CAM estimates a causal ordering using maximum likelihood and then performs sparse nonlinear regression using splines on the possible parents in this ordering. We use the implementation from the dodiscover library (MIT license) and include the preliminary neighbor search option to make the algorithm scale to large graphs. We tune the cutoff value α for variable selection with hypothesis testing over regression coefficients, and the number and order of splines to use for the feature function.

LINGAM LINGAM uses independent component analysis, an algorithm for source separation, to find a causal ordering, which is identifiable in linear systems if the additive noise in an SCM is non-Gaussian. We use the implementation from the cdt (Causal Discovery Toolbox) library (MIT license).

Published as a conference paper at ICLR 2025

E.4 HYPERPARAMETER SELECTION

For all algorithms that require hyperparameter tuning, we perform the search on separate, held-out systems that follow the same configurations as the ones we present in our final experimental results. We run the algorithms 20 times per configuration and choose the median F1 score as the criterion for selecting the best hyperparameters.

To run NOTEARS, we need to specify the regularisation strength λ and a weight threshold η for thresholding the final weights for graph structure prediction. To select these hyperparameters, we run a parameter search with λ {0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3} and three possible values of the weight threshold {0.1, 0.2, 0.3}. Table 1 presents all final hyperparameter configurations for NOTEARS. For some hyperparameter configurations, 1 in 20 runs experienced numerical issues caused by the acyclicity constraint. However, this never occurs for the selected, optimal hyperparameters, neither when performing the hyperparameter search nor when running the reported experiments.

To run the PC algorithm, one needs to choose a test significance level α. During the hyperparamter search we consider α {0.01, 0.001, 0.0001}. Table 4 presents all final hyperparameter configurations for the PC algorithm.

To run GOLEM-EV and GOLEM-NV we need to tune sparsity penalty coefficient λ1, acyclicity penalty coefficient λ2 and the weight threshold η. We consider λ1 {0.02, 0.002, 0.0002}, λ2 {2.0, 5.0, 8.0} and η {0.1, 0.2, 0.3}. Tables 2 and 3 present the best configurations.

To run CAM we need to tune the cutoff value α {0.05, 0.10, 0.15} for variable selection with hypothesis testing over regression coefficients and the number and order of splines to use for the feature function for which we consider sets {5, 10} and {2, 3} respectively. Table 5 presents the best configurations.

E.5 TRANSFERRING NOISE VARIANCES WHILE KEEPING Var-SORTABILITY UNCHANGED

Reisach et al. (2021) show that post-hoc standardization of SCM data strongly impairs the performance of NOTEARS. When comparing the performance of NOTEARS between data sampled from i SCMs and standardized SCMs, there are at least two factors that can affect the performance of NOTEARS, low Var-sortability and the violation of the equal noise variance assumption. Our experiments in Figure 6 of Section 5 aim at isolating the effect of the latter. Specifically, we investigate whether NOTEARS performs better on Var-sortable datasets that have the noise scale patterns implied when assuming SCMs generated the data when in fact the data was sampled from i SCMs or standardized SCMs. To achieve this, we ensure that the Var-sortability metrics of the data sampled from the models is the same, here close to 1.

Given two linear SCMs Sa and Sb with the same underlying graph G, our goal is to construct a system St with the same marginal variances as Sa (condition 1) and the same noise variances as Sb (condition 2). For this task to be well-defined, we assume that the noise variances of the root variables in Sa and Sb are the same. The first step in constructing St is to copy the noise variances from Sb, so that for every i {1, ..., d}.

σ2 i t := σ2 i b .

This satisfies condition 2. Given this, we define xt i as

v u u t Var[xa i ] σ2 i b

Var[wa i T xt pa(i)]wa i T xt pa(i) + εt i ,

where εt i has variance σ2 i t. By construction, the condition of St sharing the noise variances with Sb and the marginal variances with Sa is fulfilled for the root variables. For all the remaining variables,

Published as a conference paper at ICLR 2025

Table 1: NOTEARS hyperparameters for all experiments. Final settings for the regularization strength λ and the weight threshold η after hyperparameter tuning on the respective models and data-generating processes together with the F1 (median) validation scores achieved by NOTEARS.

(a) ER(20, 2) DAGs, LINEAR mechanisms

Weight Distribution Model λ η F1 (median)

Unif [0.3, 0.8] SCM 0.05 0.20 0.97 Unif [0.3, 0.8] Standardized SCM 0.15 0.10 0.59 Unif [0.3, 0.8] i SCM 0.15 0.10 0.57

Unif [0.5, 2.0] SCM 0.00 0.30 0.98 Unif [0.5, 2.0] Standardized SCM 0.15 0.20 0.30 Unif [0.5, 2.0] i SCM 0.15 0.10 0.50

Unif [1.3, 3.0] SCM 0.05 0.30 0.98 Unif [1.3, 3.0] Standardized SCM 0.25 0.10 0.24 Unif [1.3, 3.0] i SCM 0.20 0.10 0.40

(b) ER(100, 2) DAGs, LINEAR mechanisms

Weight Distribution Model λ η F1 (median)

Unif [0.3, 0.8] SCM 0.10 0.10 0.99 Unif [0.3, 0.8] Standardized SCM 0.10 0.10 0.83 Unif [0.3, 0.8] i SCM 0.10 0.10 0.84

Unif [0.5, 2.0] SCM 0.05 0.30 0.94 Unif [0.5, 2.0] Standardized SCM 0.15 0.10 0.47 Unif [0.5, 2.0] i SCM 0.15 0.10 0.76

Unif [1.3, 3.0] SCM 0.10 0.30 0.82 Unif [1.3, 3.0] Standardized SCM 0.20 0.10 0.30 Unif [1.3, 3.0] i SCM 0.15 0.10 0.70

(c) ER(20, 2) DAGs, NONLINEAR mechanisms

Model λ η F1 (median)

SCM 0.15 0.30 0.58 Standardized SCM 0.15 0.10 0.33 i SCM 0.15 0.20 0.42

(d) ER(100, 2) DAGs, NONLINEAR mechanisms

Model λ η F1 (median)

SCM 0.30 0.30 0.50 Standardized SCM 0.15 0.10 0.43 i SCM 0.15 0.10 0.61

(e) Noise transfer experiment: ER(100, 2) DAGs, LINEAR mechanisms wij Unif [0.5, 2.0]

Model λ η F1 (median)

Original 0.05 0.30 0.96 Noise from standardized SCM 0.10 0.30 0.72 Noise from i SCM 0.05 0.30 0.82

Published as a conference paper at ICLR 2025

Table 2: GOLEM-EV hyperparameters for all experiments. Final settings for the sparsity penalty coefficient λ1, acyclicity penalty coefficient λ2 and the weight threshold η after hyperparameter tuning on the respective models and data-generating processes together with the F1 (median) validation scores achieved by GOLEM-EV.

(a) ER(20, 2) DAGs, LINEAR mechanisms

Weight Distribution Model λ1 λ2 η F1 (median)

Unif [0.5, 2.0] SCM 0.002 5.00 0.30 1.00 Unif [0.5, 2.0] Standardized SCM 0.020 8.00 0.10 0.15 Unif [0.5, 2.0] i SCM 0.020 2.00 0.10 0.36

Unif [1.3, 3.0] SCM 0.002 5.00 0.30 1.00 Unif [1.3, 3.0] Standardized SCM 0.020 8.00 0.10 0.12 Unif [1.3, 3.0] i SCM 0.020 5.00 0.10 0.34

Unif [0.3, 0.8] SCM 0.020 2.00 0.10 1.00 Unif [0.3, 0.8] Standardized SCM 0.020 5.00 0.10 0.33 Unif [0.3, 0.8] i SCM 0.020 5.00 0.10 0.36

(b) ER(100, 2) DAGs, LINEAR mechanisms

Weight Distribution Model λ1 λ2 η F1 (median)

Unif [0.5, 2.0] SCM 0.020 2.00 0.20 1.00 Unif [0.5, 2.0] Standardized SCM 0.020 8.00 0.10 0.13 Unif [0.5, 2.0] i SCM 0.020 5.00 0.10 0.24

Unif [1.3, 3.0] SCM 0.020 8.00 0.30 0.90 Unif [1.3, 3.0] Standardized SCM 0.020 5.00 0.10 0.08 Unif [1.3, 3.0] i SCM 0.020 5.00 0.10 0.19

Unif [0.3, 0.8] SCM 0.020 2.00 0.20 1.00 Unif [0.3, 0.8] Standardized SCM 0.020 5.00 0.10 0.30 Unif [0.3, 0.8] i SCM 0.020 2.00 0.10 0.40

(c) ER(20, 2) DAGs, NONLINEAR mechanisms

Model λ1 λ2 η F1 (median)

SCM 0.020 8.00 0.30 0.39 Standardized SCM 0.002 8.00 0.10 0.20 i SCM 0.020 2.00 0.10 0.25

(d) ER(100, 2) DAGs, NONLINEAR mechanisms

Model λ1 λ2 η F1 (median)

SCM 0.020 8.00 0.10 0.27 Standardized SCM 0.020 8.00 0.10 0.14 i SCM 0.020 5.00 0.10 0.14

Published as a conference paper at ICLR 2025

Table 3: GOLEM-NV hyperparameters for all experiments. Final settings for the sparsity penalty coefficient λ1, acyclicity penalty coefficient λ2 and the weight threshold η after hyperparameter tuning on the respective models and data-generating processes together with the F1 (median) validation scores achieved by GOLEM-NV.

(a) ER(20, 2) DAGs, LINEAR mechanisms

Weight Distribution Model λ1 λ2 η F1 (median)

Unif [0.5, 2.0] SCM 0.0002 2.00 0.20 0.16 Unif [0.5, 2.0] Standardized SCM 0.0200 2.00 0.10 0.38 Unif [0.5, 2.0] i SCM 0.0200 2.00 0.20 0.45

Unif [1.3, 3.0] SCM 0.0002 2.00 0.10 0.20 Unif [1.3, 3.0] Standardized SCM 0.0002 5.00 0.10 0.37 Unif [1.3, 3.0] i SCM 0.0200 2.00 0.20 0.37

Unif [0.3, 0.8] SCM 0.0020 5.00 0.20 0.13 Unif [0.3, 0.8] Standardized SCM 0.0200 2.00 0.20 0.55 Unif [0.3, 0.8] i SCM 0.0200 2.00 0.10 0.58

(b) ER(100, 2) DAGs, LINEAR mechanisms

Weight Distribution Model λ1 λ2 η F1 (median)

Unif [0.5, 2.0] SCM 0.002 5.00 0.20 0.10 Unif [0.5, 2.0] Standardized SCM 0.020 2.00 0.10 0.32 Unif [0.5, 2.0] i SCM 0.020 2.00 0.10 0.51

Unif [1.3, 3.0] SCM 0.002 2.00 0.10 0.21 Unif [1.3, 3.0] Standardized SCM 0.002 5.00 0.10 0.18 Unif [1.3, 3.0] i SCM 0.020 2.00 0.10 0.46

Unif [0.3, 0.8] SCM 0.020 2.00 0.10 0.18 Unif [0.3, 0.8] Standardized SCM 0.020 2.00 0.20 0.65 Unif [0.3, 0.8] i SCM 0.020 2.00 0.10 0.67

(c) ER(20, 2) DAGs, NONLINEAR mechanisms

Model λ1 λ2 η F1 (median)

SCM 0.002 8.00 0.20 0.07 Standardized SCM 0.020 2.00 0.20 0.30 i SCM 0.020 2.00 0.10 0.41

(d) ER(100, 2) DAGs, NONLINEAR mechanisms

Model λ1 λ2 η F1 (median)

SCM 0.002 5.00 0.20 0.07 Standardized SCM 0.020 2.00 0.10 0.24 i SCM 0.020 2.00 0.10 0.36

Published as a conference paper at ICLR 2025

Table 4: PC hyperparameters for all experiments. Final settings for the significance level α after hyperparameter tuning on the respective models and data-generating processes together with the F1 (median) validation scores achieved by PC.

(a) ER(20, 2) DAGs, LINEAR mechanisms

Weight Distribution Model α F1 (median)

Unif [0.3, 0.8] SCM 0.01 0.71 Unif [0.3, 0.8] Standardized SCM 0.01 0.70 Unif [0.3, 0.8] i SCM 0.01 0.72

Unif [0.5, 2.0] SCM 0.01 0.47 Unif [0.5, 2.0] Standardized SCM 0.01 0.46 Unif [0.5, 2.0] i SCM 0.01 0.58

Unif [1.3, 3.0] SCM 0.01 0.35 Unif [1.3, 3.0] Standardized SCM 0.01 0.38 Unif [1.3, 3.0] i SCM 0.01 0.48

(b) ER(100, 2) DAGs, LINEAR mechanisms

Weight Distribution Model α F1 (median)

Unif [0.3, 0.8] SCM 0.01 0.82 Unif [0.3, 0.8] Standardized SCM 0.01 0.85 Unif [0.3, 0.8] i SCM 0.01 0.86

Unif [0.5, 2.0] SCM 0.01 0.62 Unif [0.5, 2.0] Standardized SCM 0.01 0.57 Unif [0.5, 2.0] i SCM 0.01 0.79

Unif [1.3, 3.0] SCM 0.01 0.42 Unif [1.3, 3.0] Standardized SCM 0.01 0.43 Unif [1.3, 3.0] i SCM 0.01 0.71

(c) ER(20, 2) DAGs, NONLINEAR mechanisms

Model α F1 (median)

SCM 0.01 0.53 Standardized SCM 0.01 0.54 i SCM 0.01 0.65

(d) ER(100, 2) DAGs, NONLINEAR mechanisms

Model α F1 (median)

SCM 0.01 0.53 Standardized SCM 0.01 0.63 i SCM 0.01 0.68

Published as a conference paper at ICLR 2025

Table 5: CAM hyperparameters for all experiments. Final settings for the cutoff value α for variable selection with hypothesis testing over regression coefficients, the number and order of splines to use for the feature function, together with the F1 (median) validation scores achieved by CAM.

(a) ER(20, 2) DAGs, LINEAR mechanisms

Weight Distribution Model α Number of Splines Spline Order F1 (median)

Unif [0.3, 0.8] SCM 0.05 5 3 0.49 Unif [0.3, 0.8] Stand. SCM 0.05 5 2 0.46 Unif [0.3, 0.8] i SCM 0.05 5 3 0.57

Unif [0.5, 2.0] SCM 0.10 10 3 0.31 Unif [0.5, 2.0] Stand. SCM 0.10 10 2 0.23 Unif [0.5, 2.0] i SCM 0.10 5 2 0.53

Unif [1.3, 3.0] SCM 0.05 10 2 0.24 Unif [1.3, 3.0] Stand. SCM 0.05 10 3 0.27 Unif [1.3, 3.0] i SCM 0.05 5 2 0.42

(b) ER(100, 2) DAGs, LINEAR mechanisms

Weight Distribution Model α Number of Splines Spline Order F1 (median)

Unif [0.3, 0.8] SCM 0.05 10 3 0.54 Unif [0.3, 0.8] Stand. SCM 0.05 10 2 0.57 Unif [0.3, 0.8] i SCM 0.05 5 3 0.61

Unif [0.5, 2.0] SCM 0.05 10 2 0.39 Unif [0.5, 2.0] Stand. SCM 0.05 5 2 0.39 Unif [0.5, 2.0] i SCM 0.05 5 3 0.62

Unif [1.3, 3.0] SCM 0.05 5 3 0.27 Unif [1.3, 3.0] Stand. SCM 0.05 10 3 0.26 Unif [1.3, 3.0] i SCM 0.05 5 3 0.61

(c) ER(20, 2) DAGs, NONLINEAR mechanisms

Model α Number of Splines Spline Order F1 (median)

SCM 0.05 10 2 0.50 Standardized SCM 0.05 5 2 0.52 i SCM 0.05 5 2 0.57

(d) ER(100, 2) DAGs, NONLINEAR mechanisms

Model α Number of Splines Spline Order F1 (median)

SCM 0.05 10 3 0.50 Standardized SCM 0.05 10 2 0.51 i SCM 0.05 10 3 0.57

Published as a conference paper at ICLR 2025

it holds that

Var[xt i] = Var

v u u t Var[xa i ] σ2 i b

Var[wa i T xt pa(i)]wa i T xt pa(i) + εt i

= Var[xa i ] σ2 i b

Var[wa i T xt pa(i)] Var[wa i T xt pa(i)] + σ2 i b

= Var[xa i ] ,

which satisfies condition 1. Since the systems St and Sa have the same marginal variances, they have the same Var-sortability. In the noise transfer experiment of Figure 6, we transfer the noise variances from the implied models of i SCMs and standardized SCMs. To obtain the noise variances in the implied models, we divide the original noise variances (equal to 1) by the estimated marginal variances of the corresponding variable before standardization, which we estimate from n = 1000 datapoints. For i SCM, this corresponds to an empirical statistics of Equation (7).

E.6 COMPUTE RESOURCES

Our experiments were run on an internal cluster. All experiments in this work were computed using CPUs with 3GB of memory per CPU, with an exception of the AVICI runs on graphs with 100 vertices, which used 12GB per CPU. The data generation takes less than a few minutes on a single CPU, with the exception of the sortability results (Section 5.1). For the sortability results, it takes around 30 minutes to generate the datasets for a single graph specification across all weight supports and graph sizes. This is due to a bigger number of configurations and repetitions than in the other experiments. For a single graph specification and across all weight supports and graph sizes, it takes around 6 hours to compute the sortability statistics on a single CPU. All benchmarked methods take no longer than a few minutes per small graph (d = 20) and no longer than half an hour per big graph (d = 100). The SORTNREGRESS baselines run in less than 1min per graph.

F ADDITIONAL EXPERIMENTAL RESULTS

F.1 STRUCTURE LEARNING

Figure 14 summarizes the structural Hamming distance (SHD) between the predicted and true graphs for the same datasets and algorithms as in Figure 5.

In Figures 15a and 15b, we present the F1 scores and SHD attained by the structure learning algorithms on data of LINEAR i SCMs, SCMs, and standardized SCMs, across different weight distribution supports and graph sizes. We find that the difference in performance of NOTEARS on data sampled from i SCM and standardized SCMs is larger for larger weight magnitudes and for bigger graphs. For smaller weights, the difference in the mean F1 score of NOTEARS between the two standardization approaches is smaller, which is in line with our proposed explanation about the shifts of the implied noise variance distribution in Section 5.2.

In Figure 15a, we also find that when weight magnitudes are below 1, R2-SORTNREGRESS performs similarly for both standardized SCMs and i SCMs. We also observe this for AVICI. Meanwhile, for larger weights with support extending above 1, these algorithms achieve significantly higher F1 scores on standardized SCMs. This suggests that our condition of |wi,j| > 1 for all edges (vi, vj) in the statement of Theorem 3, concerning the identifiability of linear standardized SCMs, may have a more fundamental practical significance, rather than being merely an artifact of the analysis.

In Figure 18, we report results for when the additive noise in the ground-truth SCMs is non-Gaussian. In this setting, the causal graphs of SCMs are identifiable from observational data (see Section 2). Here, we also benchmark LINGAM (Shimizu et al., 2006), which is designed for this setting. While LINGAM performs very well as expected, it performs significantly worse on standardized SCMs, possibly because independent component analysis suffers in practice under the very low noise scales implied by post-hoc standardization. This would be in line with our discussion in Section 5.2.

Published as a conference paper at ICLR 2025

SCM Standardized SCM i SCM

SHD Nonlinear

Figure 14: SHD to the true causal graph for LINEAR and NONLINEAR mechanisms. Box plots show median and interquartile range (IQR). Whiskers extend to the largest value inside 1.5 IQR from the boxes. Left (right) column shows results for linear (nonlinear) causal mechanisms with additive noise εi N(0, 1). LINEAR mechanisms have weights wi,j Unif [0.5, 2.0].

F.2 R2-SORTABILITY

Figure 16 reports the R2-sortability statistics across varying graph sizes and weight distributions but for the denser graphs ER(d, 4) and SF(d, 4). We again observe R2-sortability very close to 0.5 for datasets sampled from i SCM and high degrees of R2-sortability for data drawn from standardized SCMs. Moreover, in Figure 17, we show the R2-sortability for varying expected node degrees in the graph. Data sampled from i SCMs remains close to not R2-sortable for denser graphs drawn from the graph families considered here. We omit standard SCMs from the plots as the datasets of SCMs and their standardized versions have the same R2-sortability, since the R2 coefficient is scale invariant.

F.3 IMPLIED NOISE SCALES

Figure 19 shows the inverse implied noise scales of standardized SCMs and i SCMs for linear models with smaller weights magnitudes than in Figure 6 of the main text. In this setting with smaller weights, the distributions of the implied noise scales of standardized SCMs and i SCMs show significantly greater overlap than in Figure 6. Since the weights are smaller, the effect of the exploding marginal variances and thus collapsing implied noise scales is weaker in the SCMs.

In Figure 15 (left), we evaluate the algorithms considered in Section 5 on these systems with smaller weights. We see that, in this setting, NOTEARS performs very similarly on standardized SCMs and i SCMs, with NOTEARS slightly outperforming on i SCMs for bigger graphs, since we do not remove the growing variance problem completely even for weights of small magnitude. This is inline with our reasoning in Section 5.2.

F.4 COVARIANCE MATRICES FOR FIGURE 1

Figure 20 visualizes the full mean absolute covariance (correlation) matrices of the systems presented in Figure 1. The matrix shows that the pattern of increasing mean absolute covariance in standardized SCMs is not only a feature of neighboring nodes, but it also occurs for vertex pairs further apart, though less strongly. This is not the case for i SCMs, where any two pairs of equally spaced vertices have equal covariances in expectation over the weight sampling distribution.

Published as a conference paper at ICLR 2025

SCM Standardized SCM i SCM

F1 [0.3, 0.8] [1.3, 3.0]

(a) F1 scores

SHD [0.3, 0.8]

SHD [1.3, 3.0]

(b) SHD to the true causal graph

Figure 15: Structure learning results for different LINEAR weight ranges. Results for LINEAR causal mechanisms with additive noise εi N(0, 1) and weights sampled uniformly from support indicated above each column. Box plots show median and interquartile range (IQR). Whiskers extend to the largest value inside 1.5 IQR from the boxes. For every model, we sample 20 systems and n =1000 data points each.

Published as a conference paper at ICLR 2025

Standardized SCM i SCM

R2-sortability

(a) ER(d, 4)

R2-sortability

(b) SF(d, 4)

Figure 16: R2-sortability for different graph sizes. Linear standardized SCMs and i SCMs with εi N(0, 1) and weights drawn from uniform distributions with supports given above each plot. For every model, we sample 100 systems and n =1000 data points each. Lines and shaded regions denote mean and standard deviation of R2-sortability across runs. Datasets that satisfy R2-sortability = 0.5 (dashed) are not R2-sortable.

Standardized SCM i SCM

R2-sortability

(a) ER(100, k)

R2-sortability

(b) SF(100, k)

Figure 17: R2-sortability for different (expected) node degrees. Linear standardized SCMs and i SCMs with εi N(0, 1) and weights drawn from uniform distributions with supports given above each plot. For every model, we evaluate 100 systems and n =1000 samples each. Lines and shaded regions denote mean and standard deviation. Datasets that satisfy R2-sortability = 0.5 (dashed) are not R2-sortable.

SCM Standardized SCM i SCM

Figure 18: Structure learning results for non-Gaussian noise distributions. Causal mechanisms have additive noise εi Unif [

3], which induces Var[εi] = 1, and LINEAR mechanisms with weights wi,j Unif [0.5, 2.0]. Graphs are sampled from ER(20, 2). To obtain the results, we use the same hyperparameters as the ones we used to obtain the top-left panel of Figure 5. Box plots show median and interquartile range (IQR). Whiskers extend to the largest value inside 1.5 IQR from the boxes.

Published as a conference paper at ICLR 2025

Standardized SCM i SCM

100 101 102 103 1/θ2 i

Figure 19: Distribution over inverse implied noise scales in the implied SCMs for ER(100, 2) graphs with smaller weights wi,j Unif[0.3,0.8], estimated with kernel density estimation. Lines and shading denote mean and standard deviation respectively.

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1.00 0.75 0.64 0.59 0.56 0.53 0.52 0.50 0.49 0.49

0.75 1.00 0.86 0.78 0.74 0.70 0.68 0.66 0.65 0.64

0.64 0.86 1.00 0.91 0.85 0.81 0.79 0.77 0.75 0.74

0.59 0.78 0.91 1.00 0.93 0.89 0.86 0.84 0.82 0.81

0.56 0.74 0.85 0.93 1.00 0.95 0.92 0.89 0.87 0.86

0.53 0.70 0.81 0.89 0.95 1.00 0.96 0.93 0.91 0.90

0.52 0.68 0.79 0.86 0.92 0.96 1.00 0.97 0.95 0.93

0.50 0.66 0.77 0.84 0.89 0.93 0.97 1.00 0.98 0.96

0.49 0.65 0.75 0.82 0.87 0.91 0.95 0.98 1.00 0.98

0.49 0.64 0.74 0.81 0.86 0.90 0.93 0.96 0.98 1.00

Standardized SCM

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1.00 0.75 0.56 0.41 0.31 0.23 0.17 0.13 0.09 0.07

0.75 1.00 0.75 0.56 0.41 0.31 0.23 0.17 0.13 0.09

0.56 0.75 1.00 0.75 0.56 0.41 0.31 0.23 0.17 0.13

0.41 0.56 0.75 1.00 0.75 0.55 0.41 0.31 0.23 0.17

0.31 0.41 0.56 0.75 1.00 0.74 0.56 0.41 0.31 0.23

0.23 0.31 0.41 0.55 0.74 1.00 0.75 0.56 0.41 0.31

0.17 0.23 0.31 0.41 0.56 0.75 1.00 0.74 0.55 0.41

0.13 0.17 0.23 0.31 0.41 0.56 0.74 1.00 0.75 0.56

0.09 0.13 0.17 0.23 0.31 0.41 0.55 0.75 1.00 0.75

0.07 0.09 0.13 0.17 0.23 0.31 0.41 0.56 0.75 1.00

Figure 20: Mean absolute covariance matrices for models in Figure 1. Linear standardized SCMs (left) and i SCMs (right) with 10-variable chain DAGs from x1 to x10 and weights wi,j Unif [0.5, 2.0] and additive noise from N(0, 1). Mean covariances are estimated from n = 100,000 datapoints and averaged over 100,000 models. Since both models have unit marginal variances, covariance equals correlation.