# structure_learning_for_directed_trees__055da85b.pdf

Journal of Machine Learning Research 23 (2022) 1-97 Submitted 9/21; Revised 3/22; Published 5/22

Structure Learning for Directed Trees

Martin Emil Jakobsen m.jakobsen@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark

Rajen D. Shah r.shah@statslab.cam.ac.uk Statistical Laboratory University of Cambridge Cambridge, UK

Peter B uhlmann buhlmann@stat.math.ethz.ch Seminar for Statistics ETH Zurich Zurich, Switzerland

Jonas Peters jonas.peters@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark

Editor: Aapo Hyvarinen

Knowing the causal structure of a system is of fundamental interest in many areas of science and can aid the design of prediction algorithms that work well under manipulations to the system. The causal structure becomes identiﬁable from the observational distribution under certain restrictions. To learn the structure from data, score-based methods evaluate diﬀerent graphs according to the quality of their ﬁts. However, for large, continuous, and nonlinear models, these rely on heuristic optimization approaches with no general guarantees of recovering the true causal structure. In this paper, we consider structure learning of directed trees. We propose a fast and scalable method based on Chu Liu Edmonds algorithm we call causal additive trees (CAT). For the case of Gaussian errors, we prove consistency in an asymptotic regime with a vanishing identiﬁability gap. We also introduce two methods for testing substructure hypotheses with asymptotic family-wise error rate control that is valid post-selection and in unidentiﬁed settings. Furthermore, we study the identiﬁability gap, which quantiﬁes how much better the true causal model ﬁts the observational distribution, and prove that it is lower bounded by local properties of the causal model. Simulation studies demonstrate the favorable performance of CAT compared to competing structure learning methods.

Keywords: Causality, restricted causal models, structure learning, directed trees, hypothesis testing.

c 2022 Martin Emil Jakobsen, Rajen Shah, Peter B uhlmann and Jonas Peters.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1159.html.

Jakobsen, Shah, B uhlmann and Peters

1. Introduction

Learning the underlying causal structure of a stochastic system involving the random vector X = (X1, . . . , Xp) is an important problem in economics, industry, and science. Knowing the causal structure allows researchers to understand whether Xi causes Xj (or vice versa) and how a system reacts under an intervention. However, it is not generally possible to learn the causal structure (or parts thereof) from the observational data of a system alone. Without further restrictions on the system of interest there might exist another system with a diﬀerent causal structure inducing the same observational distribution, i.e., the structure might not be identiﬁable from observed data. Common structure learning methods using observational data are constraint-based (e.g., Pearl, 2009; Spirtes et al., 2000), score-based (e.g., Chickering, 2002), or a mix thereof (e.g., Nandy et al., 2018). Each of these approaches requires diﬀerent assumptions to ensure identiﬁability of the causal structure and consistency of the approach. In structural causal models, one assumes that there are (causal) functions f1, . . . , fp such that for all

1 i p : Xi := fi(XPA(i), Ni),

for subsets PA(i) {1, ..., p} and jointly independent noise variables N = (N1, ..., Np) PN (see Deﬁnition 1 for a precise deﬁnition including further restrictions). The causal graph is constructed as follows: for each variable Xi one adds directed edges from its direct causes or parents PA(i) into i. For such models, system assumptions concerning the causal functions can make the causal graph identiﬁed from the observational distribution. Speciﬁc assumptions that guarantee identiﬁability of the causal graph have been studied for, e.g., linear additive Gaussian noise models with equal noise variance (Peters and B uhlmann, 2014), linear additive non-Gaussian noise models (Shimizu et al., 2006), nonlinear additive noise models (Hoyer et al., 2008; Peters et al., 2014), post-nonlinear additive noise models (Zhang and Hyv arinen, 2009), partially-linear additive Gaussian noise models (Rothenh ausler et al., 2018) and discrete models (Peters et al., 2011). Score-based structure learning usually starts with a function ℓassigning a population score to causal structures. Depending on the assumed model class, this function is minimized by the true structure. For example, when considering directed acylic graph (DAGs), the true causal DAG G satisfy

G arg min G : G is a DAG ℓ( G). (1)

The idea is then to estimate the score from a ﬁnite sample and minimize the empirical score over all DAGs. As the cardinality of the space of all DAGs grows super-exponentially in the number of nodes p (Chickering, 2002), brute-force minimization becomes computationally infeasible even for moderately large systems.1

For linear additive Gaussian noise models, assuming the Markov conditions and faithfulness, one can recover the correct Markov equivalence class (MEC) of G, which can be represented by a unique completed partially directed acyclic graph (CPDAG) (Pearl, 2009). The optimization can be done greedily over MECs with greedy equivalent search (GES, Chickering, 2002) or over DAGs (Tsamardinos et al., 2006) and in the former case, the

1. For example, there are over 10275 distinct directed acyclic graphs over 40 nodes (Sloane, 2021).

Structure Learning for Directed Trees

method is known to be consistent. More speciﬁcally, the output of GES search is not guaranteed, for a ﬁxed sample size, to solve the empirical version of Equation (1) but it solves the problem with probability tending to one in the large sample limit. Chickering (1996) showed that, in general, solving the problem in Equation (1) is an NP-hard problem, even if we restrict the search to MECs for structures with ﬁxed causal indegree of K > 2. Several exact exponential runtime algorithms have been proposed, for example, A search (Yuan et al., 2011; Yuan and Malone, 2013) and CPBayes (van Beek and Hoﬀmann, 2015) for discrete systems, algorithms based on integer linear programming (Jaakkola et al., 2010; Cussens et al., 2017; Cussens, 2011), and algorithms based on dynamic programming (Koivisto and Sood, 2004; Silander and Myllym aki, 2006; Parviainen and Koivisto, 2009). In the nonlinear additive Gaussian noise case, B uhlmann et al. (2014) show that nonparametric maximum-likelihood estimation consistently estimates the correct causal order. However, the greedy search algorithm minimizing the score function does not come with any theoretical guarantees. Other heuristic approaches (for discrete or linear Gaussian systems) include acyclic selection ordering-based search (Scanagatta et al., 2015), memetic insert neighbourhood ordering-based search (Lee and Beek, 2017), and max-min hill-climb (Tsamardinos et al., 2006). Recently, methods have been proposed that perform continuous, non-convex optimization (Zheng et al., 2018) but such methods are without guarantees and it is currently debated whether they exploit some artifacts in simulated data (Reisach et al., 2021). Thus, for nonlinear models, there is currently no score-based method that provably guarantees recovery of the true causal graph with high probability. In this paper we focus on models of reduced complexity, namely models with directed trees as causal graphs. This complexity reduction allow for polynomial runtime minimization of the score-function using the Chu Liu Edmonds algorithm (proposed independently by Chu and Liu, 1965; Edmonds, 1967) and it allows for the derivation of hypothesis testing theory. As such the structure learning problem remains computationally feasible even for very large systems. Our method is called causal additive trees (CAT). The method is easy to implement and consists of two steps. In the ﬁrst step, we employ user-speciﬁed (univariate) regression methods to estimate the conditional expectations x 7 E[Xi|Xj = x] for all i = j. We then use these to construct edge weights as inputs to the Chu Liu Edmonds algorithm. This algorithm then outputs a directed tree with minimal edge weight, corresponding to a directed tree minimizing the score in Equation (1).

1.1 Contributions

We now highlight four main contributions of the paper: (i) Computational feasibility: Assuming an identiﬁable model class, such as additive noise, allows us to infer the causal DAG by minimizing Equation (1) for a suitable score function. However, even for trees, the cardinality of the search space grows super-exponentially in the number of variables p. Hence, brute-force minimization (exhaustive search) in Equation (1) remains computationally infeasible for large systems. We propose the score-based method CAT (based on Chu Liu Edmonds algorithm) and prove that it recovers the causal tree with a run-time complexity of O(p2). This method can be useful even when not restricting onself to the class of directed trees: e.g., when using a heuristic method such as

Jakobsen, Shah, B uhlmann and Peters

greedy search for aiming to ﬁnd an optimal scoring DAG, one can use the score of the optimal scoring tree as a sanity check or the corresponding tree for initialization.

(ii) Consistency: We prove that CAT is pointwise consistent in an identiﬁed additive Gaussian noise setup. That is, we recover the causal directed tree with probability tending to one as the sample size increases. Consistency only requires that the regression methods for estimating the conditional mean functions have mean squared prediction error converging to zero in probability. This property that is satisﬁed by many nonparametric regression methods such as nearest neighbors, neural networks, or kernel methods (see e.g. Gy orﬁet al., 2002). Moreover, the vanishing estimation error is only required for causal edges for which the conditional means coincide with the causal functions. We also derive suﬃcient conditions that ensure consistency in an asymptotic setup with vanishing identiﬁability. Speciﬁcally, we show that consistency is retained even when the identiﬁability gap decreases at a rate qn with q 1 n = o( n) as long as the conditional expectation mean squared prediction error corresponding to the causal edges vanishes at a rate op(qn).

(iii) Hypothesis testing: We provide two algorithms for performing hypothesis tests concerning the presence and absence of substructures, such as particular edges, in the true causal graph. The type I error is controlled asymptotically when the mean squared prediction error of the regression corresponding to the true causal edges decays at a relatively slow op(n 1/2) rate. The tests are valid post-selection, that is, the hypotheses to be tested may be chosen after the graph has been estimated, and when multiple tests are performed, the family-wise error rate is controlled for any number of tests. Furthermore, one of the two proposed testing procedures is valid in the non-identiﬁed setting.

(iv) Identiﬁability analysis: We analyze the identiﬁability gap, that is, the smallest population score diﬀerence between an alternative graph and the causal graph. The reduced system complexity, due to the restriction to trees, allows us to derive simple yet informative lower bounds. For additive Gaussian noise models, for example, the lower bound can be computed using only local properties of the underlying model: it is based on a ﬁrst term that considers the minimal score gap between individual edge reversals and a second term involving the minimal mutual information of two neighboring nodes, when conditioning on another neighbor of the parent node.

1.2 Related Constraint-based Approaches

As an alternative to score-based methods, constraint-based methods such as PC or FCI (Spirtes et al., 2000) test for conditional independences statements in PX and use these results to infer (parts of) the causal structure. Such methods usually assume that PX is both Markov and faithful with respect to the causal graph G. Under these assumptions, the Markov equivalence class of the causal graph G is identiﬁed. In a jointly Gaussian setting (e.g. linear additive Gaussian noise models), consistency of constraint-based approaches relies on faithfulness, whereas uniform consistency requires strong faithfulness (see, e.g., Zhang and Spirtes, 2002; Kalisch and B uhlman, 2007) a condition that has been shown to be strong (Uhler et al., 2013). In nonlinear settings, corresponding guarantees do not exist. This may at least partially be due to the fact that conditional independence testing is known to be a hard statistical problem (Shah and Peters, 2020).

Structure Learning for Directed Trees

Constraint-based methods have also been studied for polytrees. A polytree is a DAG whose undirected graph is a tree. Polytrees, unlike directed trees, allow for multiple root nodes as well as nodes with multiple parents. Rebane and Pearl (1987), inspired by the work of Chow and Liu (1968), propose a constraint-based structure learning method for polytrees over discrete variables that can identify the correct skeleton and causal basins, structures constructed from nodes with at least two parents. More precisely, the skeleton is determined by the maximum weight spanning tree (MWST) algorithm with mutual information measure weights, while the directionality of edges is inferred by conditional independence constraints implied by the observed distribution. In the case of causal trees this constraintbased structure learning method cannot direct any edges because causal basins do not exist (Rebane and Pearl, 1987). Dominguez et al. (2013) and Ouerd (2000) extend the Rebane and Pearl (1987) algorithm for causal discovery to multivariate Gaussian polytree distributions. Friedman et al. (1997) propose a similar algorithm to learn tree Bayesian networks by ﬁnding a MWST with mutual information weights. This recovers the skeleton of the causal graph, after which an arbitrary root node is selected and all edges are oriented away from said root node. As such, the method of Friedman et al. (1997) is only guaranteed to recover a directed tree that is Markov equivalent to the causal directed tree. In this work, we employ Chu Liu Edmonds algorithm, a directed analogue of the MWST algorithm, to not only recover the skeleton but also the direction of all edges in the causal graph. This is possible since we consider restricted causal models, e.g., nonlinear additive Gaussian noise models. More speciﬁcally, these restricted causal models allow us to deﬁne edge weights that, unlike the mutual information weights, preserve directionality information. In fact, when discarding information that allows us to infer directionality of the edges, one recovers the mutual information weights of Rebane and Pearl (1987), see Remark 1 in Appendix B for details.

1.3 Organization of the Paper

In Section 2, we deﬁne the setup and relevant score functions. We further strengthen existing identiﬁability results for nonlinear additive noise models. In Section 3, we propose CAT, an algorithm solving the score-based structure learning problem that is based on Chu Liu Edmonds algorithm. We prove consistency of CAT for a ﬁxed distribution and for a setup with vanishing identiﬁability. In Section 4, we provide results on asymptotic normality of the scores, construct conﬁdence regions and propose feasible testing procedures. Section 5, we analyzes the identiﬁability gap. Section 6 shows the results of various simulation experiments. All proofs can be found in Appendix D.

2. Score-based Learning and Identiﬁability of Trees

In the remainder of this work we use of the following graph terminology (a more detailed introduction can be found in Appendix A, see also Koller and Friedman, 2009). A directed graph G = (V, E) consists of p N>0 vertices (or nodes) V = {1, . . . , p} and a collection of directed edges E {(i j) (i, j) : i, j V, i = j}. A directed acyclic graph (DAG) is a directed graph that does not contain any directed cycles. A directed tree is a connected DAG in which all nodes have at most one parent. The unique node of a directed tree G

Jakobsen, Shah, B uhlmann and Peters

with no parents is called the root node and is denoted by rt(G). We let Tp denote the set of directed trees over p N>0 nodes.

2.1 Identiﬁability of Causal Additive Tree Models

We now revisit and strengthen known identiﬁability results on restricted structural causal models. Consider a distribution that is induced by a structural causal model (SCM) with additive noise. Then, there are only special cases (such as linear additive Gaussian noise models) for which alternative models with a diﬀerent causal structure exist that generate the same distribution (see Peters et al., 2017, for an overview). To state and strengthen these results formally, we introduce the following notation. For any k N we deﬁne the following classes of functions from R to R: M denotes all measurable functions, Dk denotes the set of all k times diﬀerentiable functions and Ck denotes the k times continuously diﬀerentiable functions. We let P denote the set of mean zero probability measures on R that have a density with respect to Lebesgue measure. P+ P denotes the subset for which a density is strictly positive. For any function class F {f|f : R R}, PF P denotes the subset with a density function in F. As a special case, we let PG P+C := P+ PC denote the subset of Gaussian probability measures. For any set P of probability measures, Pp denotes all p-dimensional product measures on Rp with marginals in P. We now deﬁne structural causal additive tree models (or causal additive tree models, for short) as SCMs with a tree structure.2

Deﬁnition 1 (Structural causal additive tree models) Consider a class Tp Mp Pp. Any tuple (G, (fi), PN) Tp Mp Pp induces a structural causal model over X = (X1, . . . , Xp) given by the following structural assignments

Xi := fi(Xpa G(i)) + Ni, for all 1 i p,

where frt(G) 0 and N = (N1, . . . , Np) PN, which we call a structural causal additive tree model. By slight abuse of notation, we write Q Tp Mp Pp for a probability distribution that is induced by a structural causal additive tree model.

Furthermore, we deﬁne the set of restricted structural causal additive tree models. We will see later that for these models, the causal graph is identiﬁable from the observable distribution of the system. When the causal graph of a suﬃciently nice additive noise SCM is not identiﬁable, then certain diﬀerential equations must hold (see the proof of Proposition 4 for details). The deﬁnition of restricted structural causal additive tree models ensures that this does not happen.

Deﬁnition 2 (Restricted structural causal additive tree models) The collection of restricted structural causal additive tree models ΘR Tp Dp 3 Pp +C3 is given by all models θ = (G, (fi), PN) Tp Dp 3 Pp +C3 satisfying the following conditions for all i {1, . . . , p} \ {rt(G)}:

(i) fi is nowhere constant, i.e., it is not constant on any non-empty open set, and

2. This model class comes with the strong assumption on additive noise, which excludes certain types of hidden confounding, for example.

Structure Learning for Directed Trees

(ii) the induced log-density ξ of Xpa G(i), noise log-density ν of Ni and causal function fi are such that there exists x, y R with ν (y fi(x))f i(x) = 0 such that

ξ = ξ f i f i ν f i ν

2ν f i f i + ν f i + ν ν f i f i ν ν (f i )2

where the derivatives of ξ, ν and fi are evaluated in x, y fi(x) and x, respectively.

The following lemma, due to Hoyer et al. (2008), shows that for causal additive tree models with Gaussian noise, the diﬀerential equation constraints of Deﬁnition 2 simplify.3

Lemma 3 Let θ = (G, (fi), PN) Tp Dp 3 Pp G. Assume that for all i {1, . . . , p}\{rt(G)} the following two conditions hold (a) fi is nowhere constant and (b) fi is not linear. Then, θ ΘR.

Existing identiﬁability results for causal graphs in restricted SCMs (Hoyer et al., 2008; Peters et al., 2014) are stated and proven in terms of the ability to distinguish the induced distributions of two restricted structural causal models: For all θ = (G, . . .) ΘR and θ = ( G, . . .) ΘR, if G = G, then L(Xθ) = L(X θ) (where L denotes the distribution of a random variable), that is, Xθ and X θ do not have the same distribution. We now prove a stronger identiﬁability result that does not assume that θ is a restricted causal model.

Proposition 4 (Identiﬁability of causal additive tree models) Suppose that Xθ and X θ are generated by the SCMs θ = (G, (fi), PN) ΘR Tp Dp 3 Pp +C3 and θ = ( G, ( fi), PN) Tp Dp 1 Pp C0, respectively. It holds that

L(Xθ) = L(X θ) = G = G.

We prove Proposition 4 using the techniques of Peters et al. (2014). While we prove the statement only for restricted causal additive tree models, which suﬃces for this work, we conjecture that a similar extension holds for restricted structural causal DAG models. The extension of Proposition 4 is important for the following reason. Given a ﬁnite data set, practical methods usually assume that the true distribution is induced by an underlying restricted SCM. One can then ﬁt diﬀerent causal structures and output the structure that ﬁts the data best. The above extension accounts for the fact that regression methods hardly represent all such restrictions: e.g., most nonlinear regression techniques can also ﬁt linear models.

2.2 Score Functions

We now deﬁne population score functions which are later used to recover the causal tree. We henceforth assume that X Rp is a random vector with distribution PX generated by a restricted causal additive tree model θ = (G, (fi), PN) ΘR Tp Dp 3 Pp +C3 with G = (V, E) Tp such that E X 2 2 < . Thus, G denotes the causal tree. We use G Tp to denote an arbitrary, diﬀerent (directed) tree. For the remainder of this paper, we assume

3. For completeness, we include the proof of Lemma 3 in Appendix D, using the approach of Zhang and Hyv arinen (2009) but expressed in our notation.

Jakobsen, Shah, B uhlmann and Peters

that for any i = j it holds that Xi E[Xi|Xj] has a density with respect to Lebesgue measure.4 We often refer to one of the following two scenarios: either, (i), we have limited a priori information that PN Pp +C3, or, (ii), we know that the noise innovations are Gaussian, that is, PN Pp G.

Deﬁnition 5 For any graph G Tp we deﬁne for each node i V the

(i) local Gaussian score as ℓG( G, i) := log Var Xi E h Xi|Xpa G(i) i /2,

(ii) local entropy score as ℓE( G, i) := h Xi E h Xi|Xpa G(i) i ,

(iii) local conditional entropy score as ℓCE( G, i) := h Xi|Xpa G(i) .

Here, we use the convention that E(Xi| ) = 0 and h(Xi| ) = h(Xi); the functions h( ), h( | ), and h( , ) (used below) denote the diﬀerential entropy, conditional entropy, and cross entropy, respectively. The Gaussian, entropy and conditional entropy score of G are, respectively, given by the sum of local scores:

i=1 ℓG( G, i), ℓE( G) :=

i=1 ℓE( G, i), ℓCE( G) :=

i=1 ℓCE( G, i).

(See Polyanskiy and Wu (2019) or Cover and Thomas (2006) for the basic informationtheoretic concepts used in this paper.) Similar scores have been considered by B uhlmann et al. (2014) and Mooij et al. (2016), for example. For linear additive Gaussian noise systems, the Gaussian score of Deﬁnition 5 is proportional to the large sample limit of the Gaussian log-likelihood score function commonly used in for Bayesian network learning (see, e.g., Koller and Friedman, 2009). The following lemma shows that the Gaussian score of the graph G Tp arises naturally as a translated inﬁmum cross entropy between PX and all Q induced by causal additive tree models with Gaussian noise. Similarly, the entropy score can be seen as an inﬁmum cross entropy between PX and all Q induced by another class of SCMs.

Lemma 6 For any G Tp it holds that

ℓG( G) = inf Q { G} Dp 1 Pp G h(PX, Q) p log(

Furthermore, with F( G) := (Fi( G))1 i p, where Fi( G) := {x 7 E[Xi|Xpa G(i) = x]} for all 1 i p, it holds that

ℓE( G) = inf Q { G} F( G) Pp h(PX, Q).

4. This ensures that the entropy score function introduced in Deﬁnition 5 below is well-deﬁned and that the analysis of the identiﬁability gap in Section 5 is valid.

Structure Learning for Directed Trees

Score-based methods identify the underlying structure by evaluating the score functions (or estimates thereof) on diﬀerent graphs and choosing the best scoring graph. The difference between the score ℓ (G) of the true graph and the score ℓ ( G) of the best scoring alternative graph G is an important property of the problem: e.g., if it would be zero, we could not identify the true graph from the scores. We, therefore, refer to expressions of the form min G Tp\{G} ℓ ( G) ℓ (G) as the identiﬁability gap. In the remainder of this paper, we refer to strict positivity of the identiﬁability gap as Assumption 1.

Assumption 1 If θ ΘR Tp Dp 3 Pp G or θ ΘR Tp Dp 3 Pp +C3 it holds that

min G Tp\{G} ℓG( G) ℓG(G) > 0 or min G Tp\{G} ℓE( G) ℓE(G) > 0, (3)

respectively.

Assumption 1 does not trivially follow from the results further above. By arguments similar to those in Lemma 6 we have that, if the true data-generating model is a restricted causal additive tree model with Gaussian noise, θ ΘR Tp Dp 3 Pp G, then ℓG(G) = h(PX) p log(

2πe). Hence, the Gaussian score gap between G and the causal graph G equals

ℓG( G) ℓG(G) = inf Q { G} Dp 1 Pp G h(PX, Q) h(PX) = inf Q { G} Dp 1 Pp G DKL(PX Q),

where DKL denotes the Kullback-Leibler divergence measure. Proposition 4 implies that

G = G, Q { G} Dp 1 Pp G : DKL(PX Q) > 0.

However, this does not immediately imply that the identiﬁability gap (where we take the inﬁmum over such Q) is strictly positive. Similar considerations5 hold for the entropy score gap ℓE( G) ℓE(G) = inf Q { G} F( G) Pp DKL(PX Q).

In Section 5 we derive informative lower bounds on the Gaussian and entropy identiﬁability gaps (i.e., the inﬁmum KL-divergence) of Equation (3). It is possible to enforce Assumption 1 indirectly by the assumptions and modiﬁcations detailed in the following lemma.

Lemma 7 Assumption 1 holds if one of the following conditions is satisﬁed.

(a) We have a restricted causal additive tree model with Gaussian noise θ ΘR Tp Dp 3 Pp G and for all i = j it holds that x 7 E[Xi|Xj = x] has a diﬀerentiable version.

(b) We have a restricted causal additive tree model with Gaussian noise θ ΘR Tp Dp 3 Pp G and for all 1 i p it holds that the causal function fi is contained within a function class Fi D1 which satisﬁes arg minf Fi E[(Xi f (Xj))2] Fi for all j = i, and we consider a modiﬁed Gaussian score function ℓG.mod : Tp R with local score given by ℓG.mod( G, i) := log(min f Fi E[(Xi f(Xpa G(i)))2])/2.

5. In fact, Proposition 4 does not immediately imply that DKL(PX Q) > 0 for Q { G} F( G) Pp as it does not necessarily hold that the causal functions in F( G) are diﬀerentiable or that the noise innovation densities in Pp are continuous.

Jakobsen, Shah, B uhlmann and Peters

(c) We have a restricted causal additive tree model θ ΘR Tp Dp 3 Pp +C3, for all i = j it holds that x 7 E[Xi|Xj = x] has a diﬀerentiable version and for all i = j it holds that Xi E[Xi|Xj] has a continuous density.

The modiﬁed Gaussian score function and restrictions of condition (b) in Lemma 7 coincides with the working conditions of B uhlmann et al. (2014). Alternative informationtheoretic conditions guaranteeing that Assumption 1 holds are derived in Section 5. If Assumption 1 is satisﬁed, then we can use the score functions to identify the true causal graph of a restricted structural model: In the Gaussian noise setting, for example, we have

G = arg min G Tp ℓG( G). (4)

In practice, we consider estimates of the above quantities and optimize the corresponding empirical loss function. Solving Equation (4) (or its empirical counterpart) using exhaustive search is computationally intractable already for moderately large choices of p.6 We now introduce CAT, a computationally eﬃcient method that solves the optimization exactly.

3. Causal Additive Trees (CAT)

We introduce the population version of our algorithm CAT in Section 3.1 and discuss its ﬁnite sample version and asymptotic properties in Sections 3.2 and 3.3.

3.1 An Oracle Algorithm

Similarly as for the case of DAGs, the problem in Equation (4) is a combinatorial optimization problem, for which the cardinality of the search space grows super-exponentially with p. Indeed, the number of undirected trees on p labelled nodes is pp 2 (Cayley, 1889) and therefore pp 1 is the corresponding number of labelled trees. For the class of DAGs (which includes directed trees), existing structure learning such as B uhlmann et al. (2014) propose a greedy search technique that iteratively selects the lowest scoring directed edge under the constraint that no cycles is introduced in the resulting graph. In general, greedy search procedures do not come with any guarantees and there are indeed situations in which they fail. By exploiting the assumption of a tree structure, we will see that the optimization problem of Equation (4) can be solved computationally eﬃciently without the need for heuristic optimization techniques. Provided with a connected directed graph with edge weights, Chu Liu Edmonds algorithm ﬁnds a minimum edge weight directed spanning tree, given that such a directed tree exists. That is, for a connected directed graph H = (V, EH) on the nodes V = {1, . . . , p} with edge weights w := {wji : (j i) EH}, Chu Liu Edmonds algorithm recovers a minimum edge weight directed spanning tree (MWDST) subgraph of H,

arg min G=(V, E) Tp H

(j i) E wji,

6. In the context of linear Gaussian noise models, Chickering (2002) proves consistency of greedy equivalent search towards the correct Markov equivalence class. This, however, does not imply that the optimization problem in Equation (4) is solved: for a given sample, the method is not guaranteed to ﬁnd the optimal scoring graph (but the output will converge to the correct graph).

Structure Learning for Directed Trees

where Tp H denotes all directed spanning trees of H. The runtime of the original algorithms of Chu and Liu (1965) and Edmonds (1967) is O(|EH| p) O(p3). Karp (1971) presented an alternative proof for the correctness of the algorithm of Edmonds (1967). Tarjan (1977) devised a modiﬁcation (corrected by Camerini et al., 1979) with runtime O(min{|EH| log(p), p2}).7 Gabow et al. (1986) devised yet another modiﬁcation with runtime O(p log p+|EH|) and noted that no further improvements to the algorithm can be made (since it uses only binary decisions and can be used to sort p numbers). In our experiments, we use the C++ implementation of Tarjans modiﬁcation by Toﬁgh and Sj olund (2007) which is contained in the R-package RBGL (Carey et al., 2021) and the Python implementation of Edmonds version from the Python-package Network X (Hagberg et al., 2022). The causal graph recovery problem in Equation (4) is equivalently solved by ﬁnding a minimum edge weight directed tree, i.e., a minimum edge weight directed spanning tree of the fully connected graph on the nodes V . For example, ﬁnding the minimum of the Gaussian score function is equivalent to minimizing a translated version of the Gaussian score function

arg min G Tp ℓG( G) = arg min G Tp

1 2 log(Var(Xi E[Xi|Xpa G(i)]))

1 2 log(Var(Xi))

= arg min G Tp

Var(Xi E[Xi|Xpa G(i)])

Since the summand for the root node in Equation (5) note equals zero, we only need to sum over all nodes with an incoming edge in G. Now deﬁne the Gaussian edge weights w G := (w G ji)j =i by

w G ji := 1

2 log Var(Xi E[Xi|Xj])

for all j = i. Hence, for a causal additive tree model with Gaussian noise satisfying Assumption 1 it holds that the causal directed tree is given by the MWDST with respect to the Gaussian edge weights,

G = arg min G Tp ℓG( G) = arg min G=(V, E) Tp

(j i) E w G ji.

Similarly, the minimum of the entropy score function is given by the MWDST with respect to the entropy edge weights w E := (w E ji)j =i given by w E ji := h(Xi E[Xi|Xj]) h(Xi), for all j = i. We will henceforth denote the method where we apply Chu Liu Edmonds algorithm to ﬁnd the MWDST with respect to the Gaussian and entropy edge weights as CAT.G and CAT.E, respectively.

7. The algorithm presented in both Edmonds (1967) and Tarjan (1977) ﬁnd minimum branchings of H, i.e., directed forest spanning subgraphs of H with minimum edge weight. Note that the MWDST problem is invariant to identical translation of all edge weights. If H is a fully connected graph and we translate all edge weights w ji := wji ε max{wji : j = i} for ε > 1, then a minimum branching using edge weights (w ji) is a MWDST subgraph of H. For testing purposes, we also need to be able to ﬁnd MWDST subgraphs of non-fully connected graphs H, hence, as noted by Edmonds (1967), if we translate all edge weights w ji := wji P

j =i |wji|, then a minimum branching using edge weights (w ji) is a MWDST subgraph of H.

Jakobsen, Shah, B uhlmann and Peters

3.2 Finite Sample Algorithm

Given an n p data matrix Xn, representing n i.i.d. copies of X = (X1, . . . , Xp), we estimate the edge weights by simple plug-in estimators. Let us denote the conditional expectation function and its estimate by

ϕji(x) := E[Xi|Xj = x], ˆϕji(x) := ˆE[Xi|Xj = x], (7)

for all j = i. The empirical Gaussian edge weights ˆw G = ( ˆw G ji)j =i are then given by

ˆw G ji := 1

d Var(Xi ˆϕji(Xj))

for all i = j, where d Var( ) denotes a variance estimator using the sample Xn. We now propose to combine the Chu Liu Edmonds algorithm described above with the Gaussian score as detailed in Algorithm 1. It is also possible to combine CAT with standard pruning techniques (see, e.g., B uhlmann et al., 2014) that, e.g., based on approximate p-values, remove insigniﬁcant edges and output directed forests. An R implementation of CAT with options for cross-ﬁtting and pruning is available on Git Hub.8

Algorithm 1 Causal additive trees (CAT)

1: procedure CAT(Xn, regression method)

2: Run regression method to obtain ˆϕji for all j = i.

3: Compute empirical edge weights ˆw G, see Equation (8).

4: Apply Chu Liu Edmonds algorithm to ﬁnd MWDST with respect to ˆw G.

5: return MWDST ˆG.

6: end procedure

By default we suggest to use the empirical Gaussian edge weights as described in Algorithm 1. However, it is also possible to run Chu Liu Edmonds algorithm on the empirical entropy edge weights ˆw E = ( ˆw E ji)j =i given by

ˆw E ji := ˆh(Xi ˆϕji(Xj)) ˆh(Xi),

for all j = i, where ˆh( ) denotes a user-speciﬁc entropy estimator using the observed data Xn. Estimating diﬀerential entropy is a diﬃcult statistical problem but we will later in Section 6 demonstrate by simulation experiments that it can be beneﬁcial to use the estimated entropy edge weights when the additive noise distributions are highly non-Gaussian. Under suitable conditions on the (possibly nonparametric) regression technique, we now show that the proposed algorithm consistently recovers the true causal graph of a causal additive tree model with Gaussian noise using the empirical Gaussian edge weights.

3.3 Consistency

We study a version of the CAT.G algorithm applied to a causal additive tree model with Gaussian noise where the regression estimates are trained on auxiliary data, simplifying

8. https://github.com/Martin Emil Jakobsen/CAT

Structure Learning for Directed Trees

the theoretical analysis. We believe that consistency without sample splitting holds but may require some stronger conditions (in the experimental section, we do not use sample splitting). As such, we only view the sample splitting as a theoretical device for simplifying proofs but we do not recommend it in practical applications. For each n we let Xn = ((X1,i)1 i p, . . . , (Xn,i)1 i p) and Xn = (( X1,i)1 i p, . . . , ( Xn,i)1 i p) denote independent datasets each consisting of n i.i.d. random variables with distribution identical to that of X = (X1, ..., Xp) Rp. We suppose that the regression estimates ˆϕji have been trained on Xn and then compute the edge weights using Xn as in step 3 of Algorithm 1:

ˆw G ji := 1

1 n Pn k=1 (Xk,i ˆϕji(Xk,j))2

1 n Pn k=1 X2 k,i ( 1

n Pn k=1 Xk,i)2

The consistency results may be extended to cross-ﬁtted edge weight estimators formed as an average of estimators of the form in (9) with the roles of the Xn and Xn samples interchanged, which would make full use of the available data. The following result shows pointwise consistency of CAT.G whenever the conditional mean estimation is weakly consistent.

Theorem 8 (Pointwise consistency) Suppose that for all j = i the following two conditions hold:

(a) if (j i) E, E[( ˆϕji(Xj) ϕji(Xj))2| Xn] P n 0;

(b) if (j i) E, E[( ˆϕji(Xj) ϕji(Xj))2| Xn] P n 0 for some ﬁxed ϕji : R R,

where ϕji and ˆϕji are deﬁned in Equation (7). Furthermore, suppose that Assumption 1 holds. In the large sample limit, we recover the causal graph with probability one, that is

P( ˆG = G) n 1,

where ˆG is the output of Algorithm 1 using weights ˆw G given by Equation (9).

Theorem 8 states that under the given assumptions, the estimated graph will converge to the true causal graph with probability tending to one. In fact, the assumptions are fairly week: we only require weakly consistent estimation of the conditional means for edges that are present in the causal graph; these represent causal relationships and are often assumed to be smooth. This distinction allow us to employ regression techniques that are consistent only for those function classes that we consider reasonable for modeling the causal mechanisms. For non-causal edges, (j i) E, the estimator ˆϕji only needs to converge to a function ϕji, which does not necessarily need to be the conditional mean.

3.3.1 Consistency under Vanishing Identifiability

We now consider an asymptotic regime involving a sequence (θn)n N of SCMs with potentially changing conditional mean functions ϕji and a vanishing identiﬁability gap. We have the following result.

Theorem 9 (Consistency under vanishing identiﬁability) Let (θn)n N be a sequence of SCMs on p N nodes all with the same causal directed tree G = (V, E) such that

Jakobsen, Shah, B uhlmann and Peters

(i) for qn := min G Tp\{G} ℓG(G) ℓG( G) (the gap of model θn), we have q 1 n = o( n);

(ii) for all (j i) E and ε > 0, Pθn q 1 n Eθn h (ϕji(Xj) ˆϕji(Xj))2| Xn i > ε n 0;

(iii) for all j = i and ε > 0, Pθn q 2 n

n Eθn h (ϕji(Xj) ˆϕji(Xj))4| Xn i > ε n 0; and

(iv) there exists C > 0 such that for all j = i infn Pθn(Varθn(Xi|Xj) C) = 1 and supn Eθn X 4 2 < .

Then it holds that

P( ˆG = G) n 1.

Condition (i) asks that the identiﬁability gap qn goes to zero more slowly than the standard convergence rate 1/ n of estimators in regular parametric models. Such a requirement would be necessary in almost any structure identiﬁcation problem. Condition (ii) requires the mean squared error of the regression estimates corresponding to true causal edges to be o P (qn). We regard this as a fairly mild assumption: indeed, the minimax rate of estimation of regression functions in H older balls with smoothness β is n 2β/(2β+1) (Tsybakov, 2009). Thus, we can expect that if the causal regression functions have smoothness β 1/2 and all lie in a H older ball, (ii) can be satisﬁed for any qn satisfying (i). Condition (iii) allows the fourth moments of the estimation errors to increase at any rate slower than nq2 n ; of course, we would typically expect this error to decay, at least for the causal edges.

4. Hypothesis Testing

This section presents two procedures to test any substructure hypothesis regarding the causal directed tree of a causal additive tree model with Gaussian noise. We continue our analysis using the sample split estimators of Equation (9), where the conditional expectations are estimated on an auxiliary dataset. Our approach makes use of the fact that the estimated weights in Equation (9) are logarithms of ratios of i.i.d. quantities, and thus the joint distribution of the estimated edge weights should, with appropriate centering and scaling, be asymptotically Gaussian; see Lemma D.4 in Appendix D for the precise statement. This allows us to create a (biased) conﬁdence region of the true edge weights, which in turn gives a conﬁdence set for the true graph. This conﬁdence set of graphs is not necessarily straightforward to compute and list. However, we show that it can be queried to test hypotheses of interest, such as the presence or absence of a particular edge. As these hypothesis tests are derived from a conﬁdence region, they are valid even when the hypothesis to test has been chosen after examining the data. Similar to the results in the previous sections, we avoid making assumptions on the performance of regressions corresponding to non-causal edges. Unlike the consistency analysis, however, here we do not, in general, require identiﬁability of the true graph. In order to state our results and assumptions, we introduce the following notation. For a collection of variables (Kji)j =i, we let Ki := (K1i, . . . , K(i 1)i, K(i+1)i, . . . , Kpi) Rp 1, furthermore, for any collection (Ki)1 i p, we let K := (K1, . . . , Kp) . With this notation,

Structure Learning for Directed Trees

let, for all k {1, . . . , n}, the vectors of squared residuals and squared centered observations be given by

ˆ Mk := {(Xk,i ˆϕji(Xk,j))2}j =i Rp(p 1), ˆVk = Xk,i 1

Further let

k=1 ˆ Mk, ˆν =: 1

Note that with this notation, the empirical Gaussian edge weight for j i is given by log(ˆµji/ˆνi)/2. Let us denote by bΣM Rp(p 1) p(p 1), bΣV Rp p and bΣMV Rp(p 1) p, the empirical variances of the ˆ Mk and ˆVk and their empirical covariance respectively, so

bΣM bΣMV bΣ MV bΣV

ˆ Mk ˆ M k ˆµˆµ ˆ Mk ˆV k ˆµˆν ˆVk ˆ M k ˆνˆµ Vk V k ˆνˆν

With this, we may now present our construction of conﬁdence intervals for the edge weights. (For simplicity, all proofs in this section assume the variables to have mean zero.)

4.1 Conﬁdence Region for the Causal Tree

We use the delta method to estimate the variances of the ˆw G ji, and a simple Bonferroni correction to ensure simultaneous coverage of the conﬁdence intervals we develop. Writing zα for the upper α/{2p(p 1)} quantile of a standard normal distribution, we set

ˆuji, ˆlji := 1

zα ˆσji 2 n = ˆw G ji zα ˆσji 2 n, (10)

ˆσ2 ji := bΣM,ji,ji

ˆµ2 ji + bΣV,i,i

ˆν2 i 2 bΣMV,ji,i

We treat [ˆlji, ˆuji] as a conﬁdence interval for the true edge weight w G ji and deﬁne the following region of directed trees formed of minimizers of the score with edge weights in the conﬁdence hyperrectangle:

ˆCBon := ˆC ˆl, ˆu := arg min G=(V, E) Tp

(j i) E w ji, : j = i, w ji [ˆlji, ˆuji] .

We have the following coverage guarantee for ˆCBon.

Theorem 10 (Conﬁdence region) Suppose the following conditions hold:

(i) there exists ξ > 0 such that E X 4+ξ < ;

(ii) there exists ξ > 0 such that for all j = i, E[| ˆϕji(Xj) ϕji(Xj)|4+ξ| Xn] = Op(1);

Jakobsen, Shah, B uhlmann and Peters

(iii) Var(( ˆ M 1 , ˆV 1 ) | Xn) P n Σ, where Σ is constant with strictly positive diagonal;

(iv) for (j i) E, n E[( ˆϕji(Xk,j) ϕji(Xk,j))2| Xn] P n 0.

lim inf n P G ˆCBon 1 α.

The second condition requires little more than 4th moments for the absolute errors in the regression (they do not need to converge to zero). Condition (iv) requires that the mean squared prediction errors corresponding to the true causal edges decay faster than a relatively slow 1/ n rate. If the causal graph is unidentiﬁable, then when (iv) holds for all edges corresponding to population score minimizing graphs, ˆCBon covers every such graph with a probability of at least 1 α.

4.2 Testing of Substructures

Whilst the conﬁdence region ˆCBon has attractive coverage properties, it will typically not be possible to compute it in practice (due to the ranges of w ji one would need to try). We now introduce two computationally feasible schemes for querying whether ˆCBon satisﬁes certain constraints such as containing or not containing a given substructure. More precisely, we propose a conservative exact query scheme called Check C (for check conﬁdence region ), and an asymptotically valid query scheme called Conv B (for converging bounds ), which we will see in the simulation experiments is less conservative. The Conv B test gains power at the expense of generality. While the Check C test works in both the identiﬁed and the non-identiﬁed setup, the Conv B test needs both identiﬁability and stronger assumptions in order to hold level. The idea is as follows: by Theorem 10 the conﬁdence region for the causal graph ˆCBon contains the causal graph with probability tending to at least 1 α. Thus, if we can verify that no graph in ˆCBon contains a certain substructure, we are able to test the hypothesis that the causal graph satisﬁes said substructure with asymptotically valid 1 α level control.

4.2.1 Substructure Hypotheses

A substructure restriction R = (ER, Emiss R , r) on the nodes V contains speciﬁed sets ER of existing edges, Emiss R of missing edges, and a speciﬁc root node r (any of such restrictions may be void, too). For example, a substructure restriction could be that a single edge is present (such as X1 X2), or that a single edge is not present (such as X1 X2); the restriction can also specify a directed tree. Our approach allows us to conclude that at least one of the constraints in R does not hold for the true graph G = (V, E). More precisely, we propose a test for the null hypothesis

H0(R) : ER \ E = , E \ Emiss R = E, r = rt(G),

i.e, that all constraints in a substructure restriction R are satisﬁed in the causal graph. We henceforth assume that a proposed substructure R has no internal inconsistencies, i.e., that there exists at least one directed tree over the nodes V satisfying all conditions of H0(R). Example 1 illustrates how substructure restrictions allow us to test various hypotheses about the causal graph.

Structure Learning for Directed Trees

Example 1 In Figure 1 we illustrate a true causal graph and ﬁve examples of substructure hypotheses that we can test.

Hypothesis 1 (true) consists of the restriction R = ER, where ER := {(X4 X5)}. This substructure restriction speciﬁes that (X4 X5) is present in the causal graph.

Hypothesis 2 (false) consists of the restriction R = Emiss R , where Emiss R := {(X6 X3)}. This restriction speciﬁes that (X6 X3) is not in the causal graph.

Hypothesis 3 (true) consists of the restriction R := (ER, Emiss R ) with multiple present edges and a single missing edge. Here, the substructure restriction speciﬁes that all edges in ER := {(X3 X2), (X4 X5), (X4 X7), (X6 X3)} are present, and that the edge in Emiss R := {(X8 X9)} is not present in the causal graph.

Hypothesis 4 (false) consists of the restriction R := (ER, Emiss R ) with multiple present edges and multiple missing edges. This substructure restriction speciﬁes that all edges in ER := {(X1 X2), (X1 X4), (X5 X5)} are present, and that all edges in Emiss R := {(X3 X6), (X8 X7)} are not present in the causal graph.

Hypothesis 5 (false) contains the substructure R := ER with multiple present edges, specifying that all edges in ER := {(X1 X2), (X2 X3), (X1 X4), (X4 X5), (X4 X7), (X5 X6), (X5 X8), (X6 X9)} are present in the causal graph. This substructure restriction uniquely speciﬁes a speciﬁc complete directed tree.

Hypothesis 1

Hypothesis 2

Hypothesis 3

Hypothesis 4

Hypothesis 5

Figure 1: Illustration of six graphs, see Example 1. Colored edges represent testing the presence of edges (green, ER) or whether edges are missing (red, Emiss R ).

Jakobsen, Shah, B uhlmann and Peters

4.2.2 Checking the Confidence Region

In order to present the ﬁrst method, we introduce some notation. For any non-empty subset of directed trees T Tp, let ST (w) be the score attained by the minimum edge weight directed tree recovered by Chu Liu Edmonds algorithm with input edge weights w := (wji)j =i, when restricting the search to all directed trees in T . That is, if we denote the minimum edge weight directed spanning tree (MWDST) as recovered by Chu Liu Edmonds algorithm, when searching over all directed trees in T by

G T (w) := arg min G=(V, E) T

(j i) E wji, (11)

then with G T (w) = (V, E T (w)) the associated score is given by

ST (w) := X

(j i) E T (w) wji. (12)

Now let Tp(R) Tp be the set of all directed trees satisfying the substructure restriction R and suppose that the causal directed tree G satisﬁes R, i.e., G Tp(R). Hence with probability tending to at least 1 α we know that there exists a graph in ˆCBon satisfying the substructure restriction R. That is, there exist edge weights w = (w ji)j =i, with ˆlji w ˆuji for all j = i, such that G Tp(w ) satisﬁes the substructure restriction R. Hence, it must hold that STp(R)(w ) = STp(w ). Since the score function is weakly monotone, we have, with probability tending to at least 1 α, that

STp(R)(ˆl) STp(R)(w ) = STp(w ) STp(ˆu).

On the other hand, if STp(R)(ˆl) > STp(ˆu), then we know for certain that ˆCBon does not contain any graph satisfying the substructure restriction R. We thus deﬁne our Check C test function as

ψCheck C R := 0 if STp(R)(ˆl) STp(ˆu) 1 otherwise. (13)

Recall that Chu Liu Edmonds algorithm recovers a minimum edge weight directed spanning tree subgraph of a connected graph H. We can construct a speciﬁc connected graph H for which the set of directed spanning tree subgraphs coincides with Tp(R). In pseudoalgorithm of Algorithm 2 we detail how to test substructure hypotheses with Check C test. This testing procedure is conservative as seen by the simulation experiments in Section 6.3. While Theorem 11 proves that hypothesis testing using the Check C test achieves pointwise asymptotic level, the simulation experiments show that the ﬁnite sample power of the test is low for small to moderately large sample sizes. For example, if max{ˆlji : j = i} min{ˆuji : j = i}, then no false substructure hypothesis can be rejected. In Section 4.2.3 we propose an alternative test which exhibits improved ﬁnite sample power.

4.2.3 Converging Bounds

We now present the Conv B test which is based on an asymptotically valid query scheme, that is, with probability increasing to one (in the large sample limit) it makes a valid choice on

Structure Learning for Directed Trees

Algorithm 2 Hypothesis testing of H0(R) using the Check C test

1: procedure Check C(R = (ER, Emiss R , r), ˆl = (ˆlji)j =i, ˆu = (ˆuji)j =i)

2: Initialize fully connected graph H := {(j i) : i, j V, j = i}.

3: For each (j i) ER, delete from H the edges {(k i) : k V \ {j}} {i j}.

4: For each (j i) Emiss R , delete from H the edge (j i).

5: If root r R, delete from H the edges {(j r) : j V }.

6: Apply Chu Liu Edmonds algorithm to ﬁnd STp(R)(ˆl) and G Tp(R)(ˆl), the minimum ˆu-weighted directed spanning subtree of H.

7: Apply Chu Liu Edmonds algorithm to ﬁnd STp(ˆu) and G Tp(ˆu), the minimum ˆlweighted directed spanning subtree of the fully connected graph.

8: If STp(R)(ˆl) STp(ˆu), then set ψCheck C R := 0, otherwise set ψCheck C R := 1.

9: return ψCheck C R .

10: end procedure

whether any graph in ˆCBon satisﬁes a substructure restriction R. We call this test Conv B for converging bounds because it requires that all lower edge weight bounds converge towards the Gaussian population edge weights. Consider a true null hypothesis H0(R), i.e., a substructure restriction R which is satisﬁed by the causal graph G. Suppose that G ˆCBon, which implies the existence of edge weights w = (w ji)j =i, with ˆlji w ji ˆuji for all j = i, such that the minimum edge weight directed spanning tree,

G Tp(w ) := arg min G=(V, E) Tp

(j i) E w ji,

satisﬁes the restrictions R. The intuition for our approach is as follows: We propose a method that helps all edge weights that are not in direct disagreement with R, and penalizes all edge weights that are in disagreement with R, more precisely, we deﬁne the edge weights ˇw = ( ˇwji)j =i by

ˇwji = ˆuji if [ k = j : (k i) ER] [(i j) ER] [(j i) Emiss R ] [i = r], ˆlji otherwise, .

We can then expect that G Tp( ˇw) still satisﬁes the restriction R (with probability tending to one, see Theorem 11). Conversely, the probability that G Tp( ˇw) does not satisfy the restriction R is, in the large sample limit, bounded by the probability that G is not in the conﬁdence region ˆCBon. We may set our test function

ψConv B R = 0, if G Tp( ˇw) satisﬁes R 1, otherwise.

The pseudo-algorithm in Algorithm 3 details how to test any substructure hypothesis H0(R) using the asymptotic query scheme of the Conv B test. Our Git Hub repository (see Footnote 8) contains R implementations of both testing procedures. The following theorem shows that both substructure hypothesis tests achieve pointwise asymptotic level. Any number of null hypotheses may be tested simultaneously,

Jakobsen, Shah, B uhlmann and Peters

Algorithm 3 Hypothesis testing of H0(R) using the Conv B test

1: procedure Conv B(R = (ER, Emiss R , r), ˆl = (ˆlji)j =i, ˆu = (ˆuji)j =i)

2: Initialize ˇw := ˆl.

3: For each (j i) ER and all k V \ {j}, set ˇwki := ˆuki.

4: For each (j i) Emiss R , set ˇwji := ˆuji.

5: If root r R, then for all j V , set ˇwjr := ˆulr.

6: Apply Chu Liu Edmonds algorithm to ﬁnd G Tp( ˇw).

7: If G Tp( ˇw) satisﬁes R, then set ψConv B R := 0, otherwise set ψConv B R := 1.

8: return ψConv B R .

9: end procedure

without the need for any multiple testing correction. This is because the tests may be viewed as simply querying the properties of the single conﬁdence region of Theorem 10, which has coverage of the truth with probability at least 1 α.

Theorem 11 (Pointwise asymptotic level) Let α (0, 1) and let R1, R2, . . . be any collection of potentially data-dependent substructure restrictions. Suppose that conditions of Theorem 10 are satisﬁed. If either

(a) ψRk = ψCheck C Rk for all k 1, or

(b) ψRk = ψConv B Rk for all k 1, Assumption 1 holds, and for all (j i) E it holds that n E[( ˆϕji(Xk,j) ϕji(Xk,j))2| Xn] P 0,

then it holds that

lim sup n P

k : H0(Rk) is true (ψRk = 1)

The Conv B test requires stronger conditions than the Check C test. Additionally to the assumptions made by the Check C test, it requires identiﬁability of the causal graph and n-convergence of the mean squared estimation error for the non-causal edges. On the other hand, it would be possible to give uniform asymptotic level guarantees for the Check C test as it only relies on the coverage properties of conﬁdence intervals for the true weights.

5. Bounding the Identiﬁability Gap

We have seen in Section 3.3 that the identiﬁability gap, that is, the smallest score diﬀerence between the causal tree G and any alternative graph G Tp \ {G}, plays an important role when identifying causal trees from observational data. It provides information about whether the causal graph is identiﬁable through the corresponding score function, for example, if we can establish that the smallest Gaussian score gap is strictly positive, i.e.,

min G Tp\{G} ℓG( G) ℓG(G) = inf Q { G} F( G) Pp DKL(PX Q) > 0, (14)

Structure Learning for Directed Trees

then G is identiﬁed by the Gaussian score function. Lemma 7 lists conditions guaranteeing that Assumption 1, i.e., Equation (14) holds. However, postitivity of the identiﬁability gap for a single model is not suﬃcient for uniform consistency or consistency under vanishing identiﬁability. For consistency under vanishing identiﬁability we need to ensure that the identiﬁability gap vanishes at a slower rate than 1/ n; see Theorem 9. Similarly, for uniform consistency over a class of causal additive noise models Θ Tp Mp Pp, one needs the existence of a strictly positive constant c > 0 uniformly lower bounding the identiﬁability gap, i.e.,

inf θ Θ min G Tp\{G} ℓG( G) ℓG(G) > c. (15)

The identiﬁability gap is an involved quantity. In this section, we derive a lower bound that is based on local properties of the underlying structural causal models (such as the ability to reverse edges), using information-theoretic quantities. We ﬁrst consider the special cases of bivariate models (Section 5.1) and multivariate Markov equivalent trees (Section 5.2) and then turn to general trees (Section 5.3). However, before we venture into the derivation of the speciﬁc lower bounds we ﬁrst examine the connection between the identiﬁability gaps associated with the diﬀerent score functions. In this section, we assume that X PX is generated by a structural causal additive tree model with E X 2 < such that the local Gaussian, entropy and conditional entropy scores are well-deﬁned. We neither assume that θ is a restricted structural causal additive model, i.e., θ ΘR, nor strict positivity of the identiﬁability gap, i.e., Assumption 1. The following result shows that the local node-wise score gaps associated with the diﬀerent score functions are ordered.

Lemma 12 For any G Tp and for all i V

ℓCE( G, i) ℓCE(G, i) ℓE( G, i) ℓE(G, i).

If the underlying model is an causal additive tree model with Gaussian noise, then

ℓE( G, i) ℓE(G, i) ℓG( G, i) ℓG(G, i).

It follows that the full graph score gaps and identiﬁability gaps associated with the diﬀerent score functions satisfy a similar ordering. Thus, given that the underlying model is an causal additive tree model with Gaussian noise, a strictly positive entropy identiﬁability gap implies that the Gaussian identiﬁability gap is strictly positive. It is, however, not possible to establish strict positivity of the conditional entropy identiﬁability gap; see Remark 1 in Appendix B. Therefore, we focus on establishing a lower bound for the entropy identiﬁability gap that is tighter than that given by the conditional entropy identiﬁability gap. In general, we cannot use node-wise comparisons of the scores of two graphs to bound the identiﬁability gap (the reason is that in general a node receives a better score in a graph, where it has a parent, compared to a graph, where it does not; see Example 3 in Appendix B for a formal argument). We start by analyzing the identiﬁability gap in models with two variables.

Jakobsen, Shah, B uhlmann and Peters

5.1 Bivariate Models

We now consider two nodes V = {X, Y }, and graphs T2 = {(X Y ), (Y X)}. Without loss of generality assume that (X, Y ) L2(P) is generated by an additive noise SCM θ = (G, (fi), PN) with causal graph G = (X Y ) T2 to which the only alternative graph is G = (Y X). That is,

X := NX, Y := f(X) + NY , (16)

where (NX, NY ) PN P2. The bivariate entropy identiﬁability gap, which we will later refer to as the edge reversal entropy score gap, is deﬁned as

ℓE(X L99 Y ) : = ℓE( G) ℓE(G)

= h(Y ) + h(X E[X|Y ]) h(X) h(Y E[Y |X]),

where the fully drawn arrow symbolizes the true causal relationship and the dashed arrow the alternative. The following lemma simpliﬁes the bivariate entropy identiﬁability gap to a single mutual information between the eﬀect and the residual of the minimum mean squared prediction error regression of cause on the eﬀect.

Lemma 13 Consider the bivariate setup of Equation (16) and assume that f(X) has density. It holds that

ℓE(X L99 Y ) = I(X E[X|Y ]; Y ) 0.

Thus, the causal graph is identiﬁed in a bivariate setting if one maintains dependence between the predictor and minimum mean squared error regression residual in the anticausal direction. This result is in accordance with the previous identiﬁability results. For example, in the linear additive Gaussian noise case, I(X E[X|Y ]; Y ) = 0. Consequently, the causal graph is not identiﬁed from the entropy score function. Whenever the conditional mean in the anti-causal direction vanishes, e.g., with symmetric causal function and symmetric noise distribution, it is possible to derive a more explicit lower bound with more intuitive suﬃcient conditions for identiﬁability of the causal graph.

Proposition 14 Consider the bivariate setup of Equation (16) and assume that f(X) has density. If the reversed direction conditional mean E[X|Y ] almost surely vanishes (e.g., because f, X and NY are symmetric), then

ℓE(X L99 Y ) = I(X; f(X) + NY ),

which is strictly positive if and only if X f(X) + NY . In addition, we have the following statements.

(a) Let f(X)G and NG Y be independently normally distributed with the same mean and variance as f(X) and NY , respectively. If DKL(f(X) f(X)G) DKL(NY NG Y ), then

ℓE(X L99 Y ) 1

2 log 1 + Var(f(X))

Structure Learning for Directed Trees

(b) If the density of f(X) + NY is log-concave, then

ℓE(X L99 Y ) 1

πe Var(f(X))

This lower bound is non-trivial only if Var(f(X)) > (πe/2 1)Var(NY ) 3.27Var(NY ).

Thus, if the conditional mean E[X|Y ] in the anti-causal direction vanishes, then under certain conditions, the causal direction is identiﬁed by the entropy score function (as long as Var(f(X)) is suﬃciently large relative to Var(NY )). The edge reversal score gap for the Gaussian score is given by

ℓG(X L99 Y ) := 1

2 log Var(X E[X|Y ])

2 log Var(Y E[Y |X])

2 log Var(X E[X|Y ])

2 log 1 + Var(f(X))

which reduces to the lower bound in point (a) of Proposition 14 if the conditional mean E[X|Y ] in the anti-causal direction vanishes.

Example 2 Consider the bivariate setup of Equation (16). Suppose that the causal function f is a quadratic function f(x) = αx2 + β for some α, β, R and that NX N(0, σ2 X) and NY N(0, σ2 Y ). It holds that E[X|Y ] vanishes, and the bivariate Gaussian identiﬁability gap reduces to

ℓG(X L99 Y ) = 1

2 log 1 + 2α2 σ4 X σ2 Y

5.2 Multivariate Markov Equivalent Trees

Two Markov equivalent trees diﬀer in precisely one directed path that is reversed in one graph relative to the other.9 The entropy score gap of Markov equivalent trees therefore reduces to the binary case.

Proposition 15 Consider any G Tp \ {G} that is Markov equivalent to the causal tree G. Let c1 cr be the unique directed path in G that is reversed in G. Then

ℓE( G) ℓE(G) =

i=1 ℓE(ci L99 ci+1) min 1 i r 1 ℓE(ci L99 ci+1).

Thus, a lower bound of the entropy score gap that holds uniformly over the Markov equivalence class is given by the smallest possible edge reversal in the causal directed graph:

min G MEC(G)\{G} ℓE( G) ℓE(G) min (j i) E ℓE(j L99 i).

9. To see this, note that any two directed trees are Markov equivalent if and only if they satisfy the exact same d-separations or equivalently they share the same skeleton (there are no v-structures in directed trees). Distinct directed trees sharing the same skeleton must have distinct root nodes. Consequently, there exist exactly one directed path in G from rt(G) to rt( G) that is reversed in G; see Lemma D.6

Jakobsen, Shah, B uhlmann and Peters

5.3 General Multivariate Trees

We now derive a lower bound of the entropy identiﬁability gap, i.e., a lower bound of the entropy score gap that holds uniformly over all alternative trees Tp \ {G}. To do so, we exploit a graph reduction technique (introduced by Peters et al., 2014) which enables us to reduce the analysis to three distinct scenarios. This graph reduction works as follows. Fix any alternative graph G Tp \ {G}, and iteratively remove any node (from both G and G) that has no children and the same parents in both G and G. The score gap is unaﬀected by the graph reduction.10

Applying this iteration scheme, until no such node can be found, results in two reduced graphs GR = (VR, ER) and GR = (VR, ER). These reduced graphs cannot be empty, for that would only happen if G = G. Further, they have identical vertices but diﬀerent edges. And they can be categorized into one of three cases. To do so, consider a node L that is a sink node, i.e., a node without children, in GR and consider its parent in GR. Now, considering GR, one of the following conditions must hold: the parent is also a parent of L in GR (we then call it Z), the parent is not connected to L in GR (we then call it W), or the parent is a child of L in GR (we then call it Y ). Figure 2 visualizes these three scenarios.

an GR(W) an GR(Y ) an GR(Z)

subgraph of GR

an GR(D) an GR(Z)

de GR(Y ) de GR(O1) de GR(Ok)

subgraph of GR

Figure 2: Schematic illustration of parts of two reduced graphs produced by the graph reduction technique described in Section 5.3. Consider a sink node L in GR. Its parent (in GR) must either be a parent in GR, too, it must be a child in GR, or it is unconnected to L in GR. Thus, exactly one of the sets Z, Y , and W is non-empty. This case distinction is used to compute the three bounds in Theorem 16. D, O1, . . . , Ok denote further (possibly existing) nodes in GR.

10. All removed nodes V \ VR have identical incoming edges in both graphs and therefore have identical local scores. That is, for any loss function l {ℓCE, ℓE, ℓG} we have that l( G) ℓ(G) = P

i VR ℓ( G, i) ℓ(G, i) + P

i V \VR ℓ( G, i) ℓ(G, i) = P

i VR ℓ( G, i) ℓ(G, i) = ℓ( GR) l(GR).

Structure Learning for Directed Trees

We can now obtain bounds for each of the three case individually. For the case with a node Z (a staying parent ), deﬁne

ΠZ(G) := (z, l, o) V 3 s.t. (z l) E and o nd G(l) \ {z, l} .

The score gap can then be lower bounded by min(z,l,o) ΠZ(G) I(Xz; Xo|Xl) (see Lemma D.7). Intuitively, I(Xz; Xo|Xl) quantiﬁes the strength of the connection between z and o, when conditioning on l (which does not lie on the path between z and o). This is a non-local bound in that it does not constrain the length of the path connecting z and o. Analyzing or bounding this term might be diﬃcult. We will see in Section 5.4 that this part is not needed for causal additive tree models with Gaussian noise. For the case with a node W ( removing parent ), deﬁne

ΠW (G) := (w, l, o) V 3 s.t. (w l) E and o (ch G(w) \ {l}) pa G(w) .

This case results in the lower bound min(w,l,o) ΠW (G) I(Xw; Xl|Xo) (see Lemma D.8). Here, w is a parent of l and o is directly connected to w. Intuitively, I(Xw; Xl|Xo) quantiﬁes the strength of the edge w l. We condition on o but that node is not directly connected to l (only via w). For the ﬁrst two cases, faithfulness (Spirtes et al., 2000) implies that these terms are non-zero and bounding them away from zero reminds of strong faithfulness (Zhang and Spirtes, 2002). However, in the second case, one considers individual edges, which reminds more of a strong version of causal minimality (Spirtes et al., 2000; Peters et al., 2017). For the case with a node Y ( parent to child ), a lower bound is given by the minimal edge reversal score gap min(j i) E ℓE(j L99 i) (see Lemma D.9). The term ℓE(j L99 i) measures the identiﬁability of the direction of an individual edge. It is zero in the linear additive Gaussian noise case, for example. We provide more details on the reduced graphs and on the arguments in the three cases in Section D.4.2 of Appendix D. Combining the three bounds from above, we obtain the following theorem.

Theorem 16 It holds that

min G Tp\{G} ℓE( G) ℓE(G) min min (z,l,o) ΠZ(G) I(Xz; Xo|Xl),

min (w,l,o) ΠW (G) I(Xw; Xl|Xo),

min (j i) E ℓE(j L99 i) . (17)

This result lower bounds the identiﬁability gap using information-theoretic quantities. Corresponding results for the Gaussian score follow immediately by Lemma 12. The last two terms are local properties of the underlying structural causal model; the ﬁrst term is not. As seen in Section 5.2, the last term on the right-hand side is required when considering only Markov equivalent trees; if it is non-zero, it allows us to orient all edges in the skeleton. The ﬁrst two terms (non-zero under faithfulness) are additionally required when the considered trees are not Markov equivalent. We now turn to the case of causal additive tree models with Gaussian noise innovations. Here, the ﬁrst term is not needed; the bound then depends only on local properties of the structural causal model.

Jakobsen, Shah, B uhlmann and Peters

5.4 Gaussian Multivariate Trees

The score gap lower bound in Equation (17) consists of local dependence properties except for the node tuples ΠZ(G) (Lemma D.7) that arise when considering alternative graphs that result in reduced graphs with a node Z ( staying parents ). However, we show that for additive Gaussian noise models, the score gap for such alternative graphs can be lower bounded by the score gaps already considered in alternative graphs with a node Y ( parent to child ) and a node W ( removing parent ). Thus, we have the following theorem, with a bound consisting only of local properties of the model.

Theorem 17 (Gaussian localization of the identiﬁability gap) For causal additive tree models with Gaussian noise, we have that

min G Tp\{G} ℓG( G) ℓG(G) min min (w,l,o) ΠW (G) I(Xw; Xl | Xo), min (j i) E ℓE(j L99 i) .

6. Simulation Experiments

In this section, we investigate the ﬁnite-sample performance of CAT and perform simulation experiments investigating the identiﬁability gap and its lower bound. In Section 6.1 we compare the performance of CAT to CAM of B uhlmann et al. (2014) for causal additive tree models with Gaussian and non-Gaussian noise. In Section 6.2 we compare the CAT and CAM for causal discovery on non-tree DAG models (CAT always outputs a directed tree). In Section 6.3 we investigate the ﬁnite sample power and level of the proposed hypothesis testing procedures. In Section 6.4 we perform simulation experiments that highlight the behavior of the identiﬁability gap and its corresponding lower bound derived in Section 5. The code scripts (R) for the simulation experiments, empirical applications and the implementation of CAT and the two testing procedures are available on Git Hub (see Footnote 8).

6.1 Causal Structure Learning for Trees

In this section, we compare the performance of the structure learning methods CAT and CAM when employed on additive noise models with causal graphs given by directed trees.

6.1.1 Tree Generation Schemes

We employ two diﬀerent random directed tree generation schemes: Type 1 (many leaf nodes) and Type 2 (many branch nodes). Figure 3 illustrates two directed trees generated in accordance with the two generation schemes. For more details, see Algorithms 4 and 5 in Section C.1 of Appendix C.

6.1.2 Gaussian Experiment

In this experiment, we generate data similarly to the experimental setup of B uhlmann et al. (2014). For any given directed tree we generate causal functions by sample paths of Gaussian processes with radial basis function (RBF) kernel and bandwidth parameter of one. Sample paths of Gaussian processes with radial basis function kernels are almost

Structure Learning for Directed Trees

Type 1 Type 2

Figure 3: Illustration of Type 1 (many leaf nodes) and Type 2 (many branch nodes) directed trees over p = 100 nodes. The green nodes are leaf nodes, the brown nodes are branch nodes, and the black nodes are root nodes. The Type 1 tree contains 70 leaf nodes, while the Type 2 tree only contains 49 leaf nodes.

surely inﬁnitely continuous diﬀerentiable (e.g., Kanagawa et al., 2018), non-constant and nonlinear, so they satisfy the requirements of Lemma 3. See Figure 10 in Section C.2 of Appendix C for illustrations of random draws of such functions. Root nodes are mean zero Gaussian variables with standard deviation sampled uniformly on (1, 2). Furthermore, for each ﬁxed tree and set of causal functions, we introduce at each non-root node additive Gaussian noise with mean zero and standard deviation sampled uniformly on (1/5,

2/5). We ﬁrst compare our method CAT with Gaussian score function (CAT.G) against the method CAM of B uhlmann et al. (2014) on the previously detailed nonlinear additive Gaussian noise tree setup. We use CAT.G without both cross-ﬁtting and pruning. Note that with cross-ﬁtting the results do not change much but, as expected, cross-ﬁtting yields slightly worse results for small sample sizes (see Figure 11 in Appendix C). We use the R-package mgcv (Mixed GAM Computation Vehicle, Wood, 2022) with default settings to construct a thin plate regression spline estimate of the conditional expectations (Wood, 2003). We use the implementation of Chu Liu Edmonds algorithm from the R-package RBGL.11 CAM is employed with a maximum number of parents set to one (restricting the output to directed trees), without preliminary neighborhood selection and subsequent pruning. We measure the performance of the methods by computing the Structural Hamming Distance (SHD, Tsamardinos et al., 2006) and Structural Intervention Distance (SID, Peters and B uhlmann, 2015) to the causal tree.

11. The RBGL implementation ﬁnds maximum edge weight directed trees and requires all positive edge weights. As such, we take the negative of our edge weights and shift them all by the absolute value of smallest edge weight. If an edge weight is set to zero this edge can not be chosen.

Jakobsen, Shah, B uhlmann and Peters

For each system size p {16, 32, 64, 128} we generate a causal tree, corresponding causal functions and noise variances and sample n {50, 100, 200, 500} observations. This is repeated 200 times and the SHD results are summarized in the boxplot of Figure 4.

50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500

50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500

Sample Size

SHD to true graph

Figure 4: Causal additive tree models with Gaussian noise: Boxplots of the SHD performance of CAM and CAT.G (Gaussian score) for varying sample sizes, system sizes, and tree types. CAT.G outperforms CAM in a wide range of scenarios.

Both methods perform better on trees of Type 2 than on trees of Type 1. CAT.G outperforms CAM in terms of SHD to the true graph both in median distance and IQR length and position for all sample sizes, system sizes and tree types. Considering the SID to the causal tree yields similar conclusions; see Figure 12 in Section C.2 of Appendix C. In their default versions, CAM and CAT.G use diﬀerent estimation techniques of the conditional expectations, but this does not seem to be the source of the performance diﬀerence: Figure 13 in Section C.2 of Appendix C illustrates a similar SHD performance diﬀerence when forcing CAT.G to use the edge weights produced by the CAM implementation.

6.1.3 Non-Gaussian Experiment

We now compare the performance of CAM and CAT with Gaussian (CAT.G) and entropy (CAT.E) score functions in a setup with varying noise distributions. The entropy edge weights used by CAT.E are estimated with the diﬀerential entropy estimator of Berrett et al. (2019) as implemented in the CRAN R-package Indep Test (Berrett et al., 2018). We

Structure Learning for Directed Trees

use the same simulation setup as in Section 6.1.2 but now we only consider trees of Type 1 and parameterize the setup by α > 0, which controls the deviation of the additive noise innovations from a Gaussian distribution. More precisely, we generate the additive noise variables Ni(α) as

Ni(α) = sign(Zi)|Zi|α,

where Zi N(0, σ2 i ) with σi sampled uniformly on (1/5,

2/5) or uniformly on (1, 2) if i = rt(G). For α = 1 this yields Gaussian noise, while for alpha α = 1 the noise is non-Gaussian. We conduct the experiment for all combinations of α {0.1, 0.2, . . . , 2, 2.5, 3, 3.5, 4} and sample sizes n {50, 500} for a ﬁxed system size of p = 32. Each setting is repeated 500 times and the results are illustrated in Figure 5.

Sample Size: 50 Sample Size: 500

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.5 3 3.5 4

SHD to true graph

Figure 5: Deviations from Gaussianity: The parameter α controls the noise deviation from the Gaussian distribution. CAT.G and CAT.E are instances of CAT with edge weights derived from Gaussian and entropy score functions, respectively. The solid lines represent the median SHD and the shaded (dashed) region represents the interquartile range. Using the entropy score yields better results for noise distributions that deviate strongly from Gaussian noise.

For Gaussian noise, both CAM and CAT.G outperform CAT.E. This can (at least) be attributed to two factors: (i) CAT.E does not, unlike CAM and CAT.G, explicitly use the Gaussian noise speciﬁcation and (ii) diﬀerential entropy estimation is a diﬃcult statistical problem (see, e.g., Paninski, 2003; Han et al., 2020) For small and moderate deviations from Gaussianity, CAT.G outperforms both CAM and CAT.E. For larger deviations, CAT.E outperforms both CAT.G and CAM in terms of median SHD. Finally, we note that CAT.G always outperforms CAM in terms of median SHD.

Jakobsen, Shah, B uhlmann and Peters

6.2 Robustness: CAT on DAGs

This experiment analyzes how CAT performs compared to CAM and the max-min hillclimbing (MMHC, Tsamardinos et al., 2006) structure learning method using the Bayesian Gaussian equivalent score (BGe, Geiger and Heckerman, 1994; Heckerman and Geiger, 1995) (the latter method is not expected to work well in our setting, as it does not exploit the additional identiﬁability). We compare the performance of these structure learning methods when applied to data generated from an additive Gaussian noise model with a non-tree DAG as a causal graph. More speciﬁcally, we analyze the behavior on single-rooted DAGs.

For any ﬁxed p N we generate a directed tree of Type 1 and for each zero in the upper triangular part of the adjacency matrix we add an edge with 5% probability. The causal functions and Gaussian noise innovations are generated according to the speciﬁcations given in the experiment of Section 6.4.2. The structural assignment for each node is additive in each causal parent, i.e., for all i {1, . . . , p}, Xi := P j pa G(i) fji(Xj)+Ni, with (N1, . . . , Np) mutually independent Gaussian distributed noise innovations. For each p {16, 32, 64} and sample size n {50, 250, 500} we randomly generate 200 single-rooted Gaussian additive models according to the above speciﬁcations. For this experiment, we employ CAM with preliminary neighborhood selection and subsequent pruning.

As CAT.G outputs trees, we do not expect it to output the correct graph. In Figure 15 of Section C.2 of Appendix C we have illustrated boxplot comparisons of the SHD between the estimated and true graph for CAM, CAT.G and the MMHC with BGe score (MMHC.BGe). We see a clear ranking of the methods in terms of SHD performance. The best performance is seen for CAM, followed by CAT.G, and ﬁnally the worst performing method is that of MMHC.BGe. Note that the BGe score (and various other Bayesian network learning scores) is only suitable for jointly Gaussian data, e.g., for linear additive Gaussian noise systems.

Figure 6 illustrates the performance in terms of ancestor relations. For small to moderately sized systems (p {16, 32}) CAM slightly outperforms CAT.G in terms of median precision (TP/(TP + FP)) when classifying causal ancestors. However, for large systems (p = 64) CAT.G outperforms CAM for median precision. On the other hand, CAM is not limited to trees which allows it to ﬁnd a more signiﬁcant proportion of the true ancestor, as seen by median recall (TP/P) performance. MMHC.BGe shows subpar performance in terms of ancestor classiﬁcation, except for large systems and sample sizes when considering recall. CAT.G seems to be a viable alternative for practical non-tree applications where precision more important than recall for classifying causal ancestor relations.

Figure 14 in Section C.2 of Appendix C illustrates similar comparisons when focusing on causal edges. The precision of CAT.G is larger than that of CAM only for small sample sizes, while the opposite is true for large sample sizes. As expected, and as seen for ancestor relations, CAM outperforms both CAT.G and MMHC.BGe in terms of recall.

Finally, while both methods are computationally eﬃcient, CAT has a slightly lower runtime than the greedy search algorithm of CAM. The average runtime of CAM and CAT.G in this experiment for p = 64 and n = 500 was 288 and 199 seconds, respectively. For both methods, the most time consuming part is estimating the conditional expectations that are used to compute the edge weights.

Structure Learning for Directed Trees

p: 16 p: 32 p: 64

50 250 500 50 250 500 50 250 500

Ancestor Classification: Precision

p: 16 p: 32 p: 64

50 250 500 50 250 500 50 250 500

Sample Size

Ancestor Classification: Recall

Figure 6: Evaluating the robustness of CAT by estimating ancestor relations in non-tree DAGs, see Section 6.2. CAT.G slightly outperforms CAM in terms of true positive rates for large graphs (top) but ﬁnds less ancestor relationships (bottom) due to ﬁtting a tree. As expected, CAT.G and CAM outperform MMHC in terms of precision.

6.3 Hypothesis Testing

In this experiment, we experimentally analyze the ﬁnite sample size and power properties of the two substructure hypothesis testing procedures proposed in Section 4.2. We generate the underlying models and data similarly to the experimental setup of the Gaussian noise experiment of Section 6.1.2. We generate a random tree of Type 2 (see Section 6.1.1) of size p with Gaussian process causal functions and Gaussian noise innovations generated in accordance with the description in Section 6.1.2.

Given a ﬁnite sample of size n we use the ﬁrst n/2 observations to estimate all possible conditional mean functions x 7 E[Xi|Xj = x] for j = i with thin plate regression splines (R-package mgcv with default settings). The remaining n n/2 observations are used to estimate the upper and lower Bonferroni corrected conﬁdence bounds ˆl = (ˆlji)j =i and ˆu = (ˆuji)j =i as deﬁned in Equation (10) of Section 4. Using the two testing procedures proposed in Algorithms 2 and 3 of Section 4.2, with a signiﬁcance level of 5%, we test all simple hypotheses, i.e., all hypotheses of the form H0 : (j i) and H0 : (j i) for all j = i. We repeat this procedure 400 times to observe the average behavior of the testing procedure

Jakobsen, Shah, B uhlmann and Peters

for the previously mentioned system generation scheme. We do this for all combinations of sample sizes n {500, 1000, 5000, 10000, 20000} and system sizes p {2, 4, 6, 8, 16}. Figure 7 illustrates the resulting power properties of the two tests Check C and Conv B. Both testing procedures have better small sample power when testing a false hypothesis of the form H0 : (j i) compared to testing a false hypothesis of the form H0 : (j i). Furthermore, the ﬁnite sample power of Check C is inferior to the Conv B. The power of Conv B is only slightly negatively aﬀected by an increase in system size p, when testing a false hypothesis of the form H0 : (j i) . On the other hand, Check C suﬀers for both types of hypotheses when increasing the system size. For example, the Check C method has almost zero power when the system size is 16 and the samplesize is 20000.

H0 : (j i) H0 : (j i)

0 5000 10000 15000 20000 0 5000 10000 15000 20000

Figure 7: This ﬁgure illustrates the power of the proposed testing procedure for simple hypotheses. The left plot shows the empirical probability of rejecting a false hypothesis of the form H0 : (j i) as a function of the sample size n. Similarly, the right plot shows the empirical probability of rejecting a false hypothesis of the form H0(j i). For both tests the power increases with growing sample size with Conv B outperforming Check C.

In Table 1, we further detail the power and level achieved by the Conv B test in the above experiment. Both tests seems to hold level in all settings. For false hypotheses of the form H0 : (j i) we have split the hypotheses into three groups based on Distance(j, i) being negative , positive or no path . If j is a descendant of i, then Distance(j, i) is negative , if j is a non-parent ancestor of i, then Distance(j, i) is positive , and if there is no directed path between j and i, then Distance(j, i) equals no path . For moderately large sample sizes the test exhibits high power. However, the power of the test for a false hypothesis of the form (j i) when j is a non-parent ancestor of i is relatively low.

6.4 Identiﬁability Gap

We now investigate the behavior of the identiﬁability gap in bivariate models (Section 6.4.1) and evalute the lower bound derived in Section 5 empirically for multivariate models (Section 6.4.2).

Structure Learning for Directed Trees

Property: Power of test Size of test

H0: (j i) (j i) (j i) (j i)

Distance(j, i)

p n Negative Positive No Path Total Total Total Total

2 500 0.68 0.68 0.68 0.00 0.00 2 1000 0.82 0.82 0.82 0.00 0.00 2 5000 0.97 0.97 0.97 0.00 0.00 2 10000 0.99 0.99 0.99 0.00 0.00 2 20000 0.99 0.99 0.99 0.00 0.00

4 500 0.69 0.32 0.85 0.69 0.34 0.01 0.01 4 1000 0.79 0.54 0.92 0.80 0.54 0.00 0.00 4 5000 0.93 0.86 0.98 0.94 0.87 0.00 0.01 4 10000 0.97 0.93 0.99 0.97 0.95 0.00 0.00 4 20000 0.99 0.96 0.99 0.99 0.97 0.00 0.00

8 500 0.73 0.38 0.85 0.75 0.18 0.01 0.01 8 1000 0.78 0.53 0.91 0.83 0.35 0.01 0.01 8 5000 0.92 0.86 0.98 0.95 0.79 0.01 0.01 8 10000 0.96 0.94 0.99 0.97 0.90 0.00 0.01 8 20000 0.98 0.97 0.99 0.99 0.96 0.00 0.00

16 500 0.77 0.40 0.86 0.79 0.09 0.01 0.01 16 1000 0.80 0.54 0.92 0.86 0.21 0.01 0.01 16 5000 0.91 0.85 0.98 0.96 0.70 0.01 0.01 16 10000 0.95 0.93 0.99 0.98 0.85 0.01 0.01 16 20000 0.97 0.97 0.99 0.99 0.93 0.00 0.00

Table 1: This table contains further details on the average power and size of the Conv B hypothesis test under the data generation described in Section 6.3.

6.4.1 Bivariate Identifiability Gap

In this experiment, we investigate the behavior of the bivariate identiﬁability gap and analyze setups with both Gaussian and non-Gaussian noise innovations. Let us consider an additive noise model over (X, Y ) with causal graph X Y . The causal functions will be chosen from the following function class. For any λ [0, 1], deﬁne fλ : R R as

fλ(x) = (1 λ)x3 + λx.

That is, λ 7 fλ interpolates between a cubic function x 7 x3 and a linear function x 7 x. For any (α, λ) (0, ) [0, 1] we consider the following bivariate structural causal additive model

X := sign(NX)|NX|α, Y := fλ(X) + NY ,

Jakobsen, Shah, B uhlmann and Peters

where NX, NY are independent standard normal distributed random variables. Recall that the bivariate identiﬁability gap is given by

ℓE(Y X) ℓE(X Y ) = h(X E[X|Y ]) + h(Y ) h(X E[X|Y ], Y )

= I(X E[X|Y ]; Y ),

by Lemma 13. Thus, the causal graph X Y is identiﬁed by the entropy score function if I(X E[X|Y ]; Y ) > 0. For any ﬁxed λ and α we now estimate the identiﬁability gap; we also calculate the p-value associated with the null hypothesis that the identiﬁability gap is zero (based on 50000 observations). Similarly to the previous experiment, we estimate the conditional expectations using thin-plate spline regression. We estimate (without sample splitting) the identiﬁability gap and construct p-values using the CRAN R-package Indep Test (Berrett et al., 2018). More speciﬁcally, we use the diﬀerential entropy estimator of Berrett et al. (2019) and the mutual information based independence test of Berrett and Samworth (2019), respectively. The heatmap of Figure 8 illustrates the behavior of the identiﬁability gap for all combinations of λ {0, 0.05, . . . , 1} and α {0.3, 0.4, . . . , 1.7}. It suggests that the identiﬁability gap only tends to zero when we approach the linear additive Gaussian noise setup. Only in the models closest to the linear additive Gaussian noise setup are we unable to reject the null-hypothesis of a vanishing identiﬁability gap. This is also what the theory predicts, namely that for bivariate linear additive Gaussian noise models, the causal direction is not identiﬁed. It is known that for linear models, non-Gaussianity is helpful for identiﬁability. The empirical results indicate that the same holds for nonlinear models, i.e., that the identiﬁability gap increases with the degree of non-Gaussianity of the noise innovations.

0.00 0.25 0.50 0.75 1.00 Linearity: l

Non-Gaussianity: a

p-value ³ 0.05

Identifiability Gap

Figure 8: Heatmap of the identiﬁability gap for varying λ and α. Tiles with a red boundary correspond to the models for which the mutual information based independence test cannot reject the null hypothesis of a vanishing identiﬁability gap.

Structure Learning for Directed Trees

6.4.2 Multivariate Identifiability Gap

In this experiment, we investigate the identiﬁability gap and its relation to the lower bounds established in Theorem 17. For a causal additive tree model with Gaussian noise, it holds that

min G Tp\{G} ℓG( G) ℓG(G) min min (w,l,o) ΠW (G) I(Xw; Xl | Xo), min i j E ℓE(i L99 j) .

In other words, the identiﬁability gap is lower bounded by the minimum of the smallest local faithfulness measures and the smallest edge-reversal score diﬀerence. We now investigate empirically how important the ﬁrst term is for the inequality to hold. More speciﬁcally, for a given model generation scheme, we quantify how often the minimum edge reversal is suﬃciently small to establish the lower bound without the conditional mutual information term, that is, how often the identiﬁability constant min G Tp\{G} ℓG( G) ℓG(G) is larger than the minimum edge reversal. The minimum edge reversal can be estimated using the same conditional expectation and entropy estimators of the experiment in Section 6.4.1. However, estimating the identiﬁability gap between the second-best scoring tree and the causal tree needs further elaboration. We know that the best scoring (causal) tree can be found by Chu Liu Edmonds (a directed MWST) algorithm. The second-best scoring tree diﬀers from the best scoring tree in at least one edge. Thus, given the best scoring graph, we remove one of the p 1 edges of the best scoring tree from the pool of possible edges and rerun Chu Liu Edmonds algorithm. We do this for each of the p 1 edges in the best scoring tree which leaves us with p 1 possibly diﬀerent sub-optimal trees of which the minimum score is attained by the secondbest scoring graph. For the experiment, we randomly sample data generating models similarly to the experiment in Section 6.1.2. However, we change the causal functions from explicit sample paths of a Gaussian process to a thin-plate spline regression model estimating the sample paths due to memory constraints when generating large sample sizes. Figure 9 illustrates, for p {8, 16}, boxplots of the diﬀerence between the identiﬁability gap and the minimum edge reversal for 100 randomly generated causal additive tree models with Gaussian noise. For each model, the identiﬁability gap and corresponding minimum edge reversal is estimated from 200000 independent and identically distributed observations. The illustration suggests that it is in general necessary to also consider the conditional mutual information term in order to establish a lower bound. However, it also shows that in the majority (90%) of the models, the minimum edge reversal is indeed a lower bound for the identiﬁability gap.

7. Empirical Application

We consider the well-known non-synthetic bio-informatics data set considered by Sachs et al. (2005). The data set contains simultaneous measurements of expression levels of 11 diﬀerent phosphorylated proteins and phospholipids of human immune system cells under both observational and interventional experimental settings. Sachs et al. (2005) present (based on expert consensus and experiments) a causal directed acyclic graph with 11 nodes and 20 edges for the 11 phosphorylated proteins and phospholipids.

Jakobsen, Shah, B uhlmann and Peters

Identifiability Gap minus minimum edge reversal

Figure 9: Empirical analysis of the lower bound on the identiﬁability gap, see Section 6.4.2. In most of the simulated settings, we see that the estimated identiﬁability gap is larger than the smallest edge-reversal score diﬀerence. This suggests that in many cases, the latter term is suﬃcient for establishing a lower bound on the identiﬁability gap. We have also implemented CAT.G and CAT.E with the heuristic pruning procedure introduced in B uhlmann et al. (2014).

We compare our structure learning methods CAT.G and CAT.E with the score-based methods of CAM (B uhlmann et al., 2014), GES (Chickering, 2002), No Tears (Zheng et al., 2018) and the mixed method MMHC (Tsamardinos et al., 2006). The structure learning methods are applied to observational data (853 observations using reagents anti-CD3 and anti-CD28). The results of the structure learning methods can be seen in Table 2. Learning causal structure from observational data is a diﬃcult problem but several methods seem to outperform estimating an empty graph or a random graph. CAM is superior in terms of SHD, SID, and recall of edge and root predictions, suggesting that in this data set, one may indeed exploit nonlinearities for indentifying causal structure. However, we also see that CAT.G shows competitive performance and ranks in ﬁrst or second place with respect to all reported performance measures. Interestingly, even though CAT.G approximates the non-tree causal DAG by a directed tree, it outperforms various DAG structure learning methods such as classical approaches of GES and MMHC and the more recent continuous optimization approach of No Tears. CAT.E does not perform well on these data, witnessing that estimating entropies is a diﬃcult statistical problem.

Finally, we also evaluate the proposed hypothesis testing procedures on this data set, even though the asymptotic guarantees of the hypothesis tests derived in Section 4 are not guaranteed to hold as the true underlying graph is not a directed tree. We test every possible simple hypothesis of the form H0(j i) and H0(j i). The results can be seen in Table 3 (the Check C test holds level but has zero power). The Conv B test shows reasonable power against false hypothesis of the form H0(j i), however, it has no power against the false hypotheses of the form H0(j i). Rejection rates of the true hypotheses of the form H0(j i) are larger than the asymptotically guaranteed rate of 0.05, possibly because of

Structure Learning for Directed Trees

the model violation; this phenomenon is not as expressed for true hypotheses of the form H0(j i).

Method Prune Score SHD SHD-C SID Precision Recall

CAM Yes ℓG 14.00 15.00 72.0 0.571 0.381 CAT No ℓG 14.00 14.00 79.0 0.636 0.333 CAT Yes ℓG 15.00 16.00 83.0 0.545 0.286 MMHC BGe 15.00 14.00 84.0 0.417 0.238 MMHC BIC 15.00 14.00 84.0 0.417 0.238 GES BIC 17.00 16.00 107.0 0.231 0.143 CAT Yes ℓE 18.00 19.00 92.0 0.273 0.143 No Tears 19.00 17.00 99.0 0.182 0.095 Empty Graph 20.00 20.00 94.0 0.091 0.048 Random Graph 22.32 21.93 94.7 0.271 0.170 CAT No ℓE 24.00 25.00 104.0 0.273 0.143

Table 2: Results of the empirical application of various structure learning methods to the data set of Sachs et al. (2005). Here we report the structural hamming distance (SHD), structural hamming distance of the respective CPDAGs (SHD-C), and structural intervention distance (SID) between the causal graph and the estimated graph. The latter two columns show the precision and recall for edge and root classiﬁcation. The methods Empty Graph always outputs the empty graph and the method Random Graph outputs a random single-rooted tree generated according to the generation scheme outlined in Section 6.2. We have implemented CAT.G and CAT.E both with and without the heuristic pruning technique introduced in B uhlmann et al. (2014)

Nulls incorrect Nulls correct

H0 : (j i) (j i) (j i) (j i)

Distance(j, i)

Negative Positive No Path Total Total Total Total

Rejection rates: 0.58 0.53 0.66 0.58 0.00 0.30 0.02 N: 46 26 18 90 20 20 90

Table 3: Further details on the average power and level of the Conv B test with a signiﬁcance level of 0.05. Here, we have tested every simple hypothesis of the Sachs et al. (2005) data set; see Section 6.3 for further explanations of the distance metric. N denotes the number of hypothesis tests that have been averaged.

Jakobsen, Shah, B uhlmann and Peters

8. Summary and Future Work

This paper shows that exact structure learning is possible for systems of lesser complexity, i.e., for restricted structural causal models with additive noise and causal graphs given by directed trees. We propose the method CAT, which is guaranteed to consistently recover the causal directed tree of a causal additive tree model with Gaussian noise under mild assumptions on the regression methods used to estimate conditional means. Furthermore, we argue that CAT is consistent in an asymptotic setup with vanishing identiﬁability. We present a computationally feasible procedure to test substructure hypotheses and provide an analysis of the identiﬁability gap. Simulation experiments show that CAT outperforms other (more general) structure learning methods for the speciﬁc task of recovering the causal graph in additive noise structural causal models when the causal structure is given by directed trees. The proof of Proposition 4 is based on the fact that the causal functions of alternative models are diﬀerentiable and that the noise densities are continuous. We conjecture that it is possible to get even stronger identiﬁability statements under weaker assumptions; proving such a result necessitates new proof strategies. We believe that it could be possible to prove uniform consistency under suitable conditions when requiring that the inﬁmum of the identiﬁability gap is strictly positive and that the mean squared errors of the regression estimates converge uniformly. Furthermore, we believe that it could inspire future research on more general identiﬁability conditions (e.g., relaxing the smoothness assumptions) for directed trees and DAGs under the assumption of additive noise. Furthermore, it should be possible to use a wild bootstrap approach to construct a simultaneous hyperrectangle conﬁdence region for the Gaussian edge weights. This would, however, require a suﬃciently fast convergence rate of the estimation error of the conditional expectations corresponding to non-causal edges. Compared to the Bonferroni correction, this approach could increase the power of the test. We hypothesize that the Conv B test holds level even in many generic, non-identiﬁable settings.

Acknowledgments

We thank Phillip Bredahl Mogensen and Thomas Berrett for helpful discussions on the entropy score and its estimation. JP thanks Martin Wainwright for discussions on greedy search methods (and when they fail) and inequalities in additive noise models during his visit at UC Berkeley in 2013. PB and JP thank David B urge and Jan Ernest for helpful discussions on exploiting Chu Liu Edmonds algorithm for causal discovery during the early stages of this project. MEJ and JP were supported by the Carlsberg Foundation; JP was, in addition, supported by a research grant (18968) from VILLUM FONDEN. RDS was supported by EPSRC grant EP/N031938/1. PB received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 786461).

Appendix A. Graph Terminology

A directed graph G = (V, E) consists of p N>0 vertices (nodes) V = {1, . . . , p} and a collection of directed edges E {(j i) (j, i) : i, j V, i = j}. For any graph G = (V, E)

Structure Learning for Directed Trees

we let pa G(i) := {v V : (v, i) E} and ch G(i) := {v V : (j, v) E} denote the parents and children of node i V and we deﬁne root nodes rt(G) := {v V : pa G(i) = } as nodes with no parents (that is, no incoming edges). A path in G between two nodes i1, ik V consists of a sequence (i1, i2), . . . , (ik 1, ik) of pairs of nodes such that for all j {1, . . . , k 1}, we have either (ij ij+1) E or (ij+1 ij) E. A directed path in G between two nodes i1, ik V consists of a sequence (i1, i2), . . . , (ik 1, ik) of pairs of nodes such that for all j {1, . . . , k 1}, we have (ij ij+1) E. Furthermore, we let an G(i) and de G(i) denote the ancestors and descendants of node i V , consisting of all nodes j V for which there exists a directed path to and from i, respectively. We let nd G(i) denote the non-descendants of i. A directed acyclic graph (DAG) is a directed graph that does not contain any directed cycles, i.e., directed paths visiting the same node twice. We say that a graph is connected if a (possibly undirected) path exists between any two nodes. A directed tree is a connected DAG in which all nodes have at most one parent. More speciﬁcally, every node has a unique parent except the root node, which has no parent. The root node rt(G) is the unique node such there exists a directed path from rt(G) to any other node in the directed tree. In graph theory, a directed tree is also called an arborescence, a directed rooted tree, and a rooted out-tree. A graph G = (V , E ) is a subgraph of another graph G = (V, E) if V V , E E and for all (j i) E it holds that j, i V . A subgraph is spanning if V = V . For any DAG G = (V, E) and three mutually distinct subsets A, B, C V we let A GB | C denote that A and B are d-separated by C in G (see, e.g., Pearl, 2009).

Appendix B. Further Details on Section 5

Remark 1 The conditional entropy score gap is not strictly positive when considering the alternative graphs G that are Markov equivalent to the causal graph G, G MEC(G). A simple translation of the conditional entropy score function reveals that

ℓCE( G) + C = X

(j i) E h(Xi|Xj) h(Xi) = X

(j i) E I(Xi; Xj),

for a constant C R. By symmetry of the mutual information, it holds that ℓCE( G) = ℓCE(G), for any G MEC(G), since G and G share the same skeleton. Thus, the conditional entropy score function can, at most, identify the Markov equivalence class of the causal graph. In fact, the polytree causal structure learning method of Rebane and Pearl (1987) uses the above translated conditional entropy score function to recover the skeleton of the causal graph.

Example 3 (Negative local Gaussian score gap) Consider two graphs G and G with diﬀerent root nodes, i.e., rt(G) = rt( G). If x 7 E[Xrt(G)|Xpa G(rt(G)) = x] is not almost surely constant, then it holds that

ℓG( G, rt(G)) ℓG(G, rt(G)) = E[(Xrt(G) E[Xrt(G)|Xpa G(rt(G))])2] Var(Xrt(G))

= E[Var(Xrt(G)|Xpa G(rt(G)))] Var(Xrt(G))

= Var(E[Xrt(G)|Xpa G(rt(G))]) < 0.

Jakobsen, Shah, B uhlmann and Peters

Appendix C. Further Details on the Simulation Experiments

This section contains further details on the simulation experiments.

C.1 Tree Generation Algorithms

The following two algorithms, Algorithm 4 (many leaf nodes) and Algorithm 5 (many branch nodes), details how the Type 1 and Type 2 trees are generated, respectively.

Algorithm 4 Generating type 1 trees

procedure Type1(p)

A := 0 Rp p

for j {1, . . . , p} do

for i {j + 1, . . . , p} do

if Pp k=1 Aki = 0 then if i = j + 1 then

Aji := 1 else

Aji := Binomial(success = 0.1) end if else

Aji := 0 end if end for end for return A end procedure

Algorithm 5 Generating type 2 trees

procedure Type2(p)

for i {2, . . . , p} do

j := sample({1, . . . , i 1}) Aji := 1 end for return A end procedure

C.2 Additional Illustrations

This section contains some additional illustrations of the simulation experiments.

Structure Learning for Directed Trees

Figure 10: Four causal functions as modeled by the RBF kernel Gaussian Process.

50 100 200 500 50 100 200 500 0

Sample Size

SHD to true graph

Figure 11: Boxplot illustrating the SHD performance of CAM and CAT for varying sample sizes, system sizes and tree types in the experiment of Section 6.1.2 with 200 repetitions. CAT.G.cf is the CAT.G method with cross-ﬁtted edge weights. We see that cross-ﬁtting has no positive impact on the performance. It seems to worsen the performance for small sample sizes.

Jakobsen, Shah, B uhlmann and Peters

50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500

50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500

Sample Size

SID to true graph

Figure 12: Boxplot illustrating the SID performance of CAM and CAT for varying sample sizes, system sizes and tree types in the experiment of Section 6.1.2. CAT.G is CAT with edge weights derived from the Gaussian score function.

50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500

Sample Size

SHD to true graph

Figure 13: Boxplot illustrating the SHD performance of CAM and CAT for varying sample sizes, system sizes and tree types in the experiment of Section 6.1.2. Here CAT.G is run on the CAM edge weights , so that any diﬀerence in nonparametric regression technique is ruled out as the source of the performance diﬀerence.

Structure Learning for Directed Trees

p: 16 p: 32 p: 64

50 250 500 50 250 500 50 250 500

Edge Classification: Precision

p: 16 p: 32 p: 64

50 250 500 50 250 500 50 250 500

Sample Size

Edge Classification: Recall

Figure 14: Boxplot of edge relations for the experiment in Section 6.2.

p: 16 p: 32 p: 64

50 250 500 50 250 500 50 250 500

Sample Size

Figure 15: Boxplot of SHD for the experiment in Section 6.2.

Jakobsen, Shah, B uhlmann and Peters

Appendix D. Proofs

This section contains the proofs of all results presented in the main text.

D.1 Proofs of Section 2

Proof of Lemma 3. Let θ = (G, (fi), PN) Tp Dp 3 Pp G. Furthermore, let all causal functions (fi) be nowhere constant and nonlinear. The additive noise is Gaussian, so the log density of Ni for all i {1, . . . , p} is given by

2 log(2πσ2 i ) x2

2σ2 i , ν i(x) = x

σ2 i , ν i (x) = 1

σ2 i , ν i (x) = 0.

By assumption we have that condition (i) of Deﬁnition 2 is satisﬁed, hence assume for contradiction that condition (ii) of Deﬁnition 2 is not satisﬁed. That is, we assume that there exists an i {1, ..., p} \ {rt(G)} such that for all

(x, y) J := {(x, y) R2 : ν i (y fi(x))f i(x) = 0}

= {(x, y) R2 : f i(x) = 0},

it holds that

ξ (x) ξ (x)f i (x) f i(x) 2f i (x)f i(x) σ2 = y fi(x)

f i (x) (f i (x))2

Henceforth, suppress the subscript i of fi and σi. First note that {x R : f (x) = 0} is closed by continuity of f . The complement is open, hence there exists a countable collection of mutually disjoint open intervals (Ok)k Z such that {x R : f (x) = 0} = k ZOk. Since f is nowhere constant we know that {x R : f (x) = 0} has empty interior which implies that k ZOk = R. Now let (Ok)k Z be indexed by Z such that for any k, j Z with k < j and x Ok, y Oj it holds that x < y. As the left-hand side of Equation (18) is constant in y it must hold that

0 = f (x) (f (x))2

x f (x) f (x) f (x)

x (f (x))2 =

i.e., f (x)/f (x) is constant, for all x k ZOk. On each Ok we have that / x log(sign(f (x))f (x)) = ck,1 log(sign(f (x))f (x)) = ck,1x + ck,2 sign(f (x))f (x) = exp(ck,1x + ck,2) f (x) = exp(ck,1x + ck,2). Recall that we have assumed continuous diﬀerentiability of f . That is, for any k Z and tk := sup(Ok) = inf(Ok+1) we have limx tk f (x) = limx tk f (x) and limx tk f (x) = limx tk f (x). Assume without loss of generality that f (x) = exp(ck,1x + ck,2) for all x Ok and k Z. These conditions impose the restrictions (ck,1 ck+1,1)tk = ck+1,2 ck,2 and log(ck,1/ck+1,1) + (ck,1 ck+1,1)tk = ck+1,2 ck,2 which entails that ck,1 = ck+1,1 and ck,2 = ck+1,2. This proves that there exists c1, c2 R such that f (x) = exp(c1x + c2) for all x R. Thus, the diﬀerential equation holds for all x R,

0 = ξ (x) ξ (x)f (x)

f (x) 2f (x)f (x)

Structure Learning for Directed Trees

by division with f (x). By integration this implies that 0 = ξ (x)/f (x) 2f (x)/σ2 + c3 such that ξ (x) = 2 exp(2c1x+2c2)/σ2 c3 exp(c1x+c2) and ξ (x) = exp(2c1x+2c2)/c1σ2 c3 exp(c1x + c2)/c1 + c4 and

ξ(x) = exp(2c1x + 2c2)

2c2 1σ2 c3 exp(c1x + c2)

c2 1 + c4x + c5.

We see that ξ(x) p Xpa G(i)(x) as x sign(c1) , in contradiction with the assumption that p Xpa G(i)(x) is a probability density function if c1 = 0. Thus, it must hold that f (x)/f (x) = 0 for all x R, or equivalently, that f is a linear function, yielding a contradiction. This proves that whenever fi D3 is a nowhere constant and nonlinear function and the additive noise is Gaussian then condition (ii) of Deﬁnition 2 is satisﬁed, so θ ΘR.

Proof of Proposition 4. First, we consider the bivariate setting. Let (X, Y ) be generated by an additive noise SCM θ ΘR T2 D2 3 P2 C3 given by X := NX and Y := f(X) + NY with PX = p X λ and PNY = p NY λ having three times diﬀerentiable strictly positive densities and f is a three times diﬀerentiable nowhere constant function such that condition (ii) of Deﬁnition 2 holds. Assume for contradiction that we do not have observational identiﬁability of the causal structure G = (V = {X, Y }, E = {(X Y )}). That is, there exists θ T2 Dp 1 Pp C0 with causal graph G = G or, equivalently, a diﬀerentiable function g and noise distributions P NX = p NX λ and P NY = p NY λ with continuous densities such that the structural assignments Y := NY and X := g( Y ) + NX induce the same distribution, i.e.,

PX,Y = P X, Y . (19)

By the additive noise structural assignments we know that both PX,Y and P X, Y have densities with respect to λ2 given by

p X,Y (x, y) = p X(x)p NY (y f(x)),

p X, Y (x, y) = p NX(x g(y))p Y (y),

for all (x, y) R2. By the equality of distributions in Equation (19) and strict positivity of p X and p NY we especially have that for λ2-almost all (x, y) R2

0 < p X,Y (x, y) = p X, Y (x, y). (20)

However, as both p X,Y and p X, Y are continuous we realize that the inequality in Equation (20) holds for all (x, y) R (if they were not everywhere equal there would exists a non-empty open ball in R2 on which they diﬀer in contradiction with λ2-almost everywhere equality). Furthermore, by the assumption that f is three times diﬀerentiable and p X, p NY are three times continuously diﬀerentiable we have that 3π/ x3 and 3π/ x2 y are well-deﬁned partial-derivatives of

π(x, y) := log p X,Y (x, y) = log p X(x) + log p NY (y f(x)) =: ξ(x) + ν(y f(x)),

Jakobsen, Shah, B uhlmann and Peters

With π(x, y) := log p X, Y we have that

π(x, y) = log p NX(x g(y)) + log p Y (y) =: ξ(x g(y)) + ν(y).

Since it holds that π = π by Equation (20) the partial-derivatives 3 π/ x3 and 3 π/ x2 y are also well-deﬁned. Now note that for any x, y R

0 = lim h 0 | π(x + h, y) π(x, y)|/h = lim h 0 | ξ(x g(y) + h) ξ(x g(y))|/h,

implying that ξ is diﬀerentiable in x g(y) for any x, y R or, equivalently, ξ is everywhere diﬀerentiable. Similar arguments yield that ξ is at least three times diﬀerentiable. We conclude that 2 π(x, y)/ x2 = ξ (x g(y)) and 2 π(x, y)/ x y = ξ (x g(y))g (y) and for any (x, y) R2 such that 2 π(x, y)/ x y = 0 or, equivalently,

(x, y) J := (x, y) : 2π(x, y)

x y = ν (y f(x))f (x) = 0 ,

it holds that

2 x2 π(x, y)

2 x y π(x, y)

It is worth noting that J = to ensure that the following derivations are not void of meaning. (This can be seen by noting that f is nowhere constant, i.e., f (x) = 0 for λ-almost all x R. Hence, J = if and only if p NY is a density such that {(x, y) R2 : f (x) = 0} (x, y) 7 ν (y f(x)) is constantly zero or, equivalently, R y 7 ν (y) is constantly zero. This holds if and only if p NY is either exponentially decreasing or exponentially increasing everywhere, which is a contradiction as no continuously diﬀerentiable function integrating to one has this property.) For any (x, y) J we also have that

2 x2 π(x, y)

2 x yπ(x, y)

ξ (x) + ν (y f(x))f (x)2 ν (y f(x))f (x)

ν (y f(x))f (x)

ν f + ν ν f

(ν )2 (f )2ν

ν (f )2 + f ξ

which implies that

2ν f f + ν f + ν ν f f

in contradiction with the assumption that condition (ii) of Deﬁnition 2 holds. We conclude that PX,Y = P X, Y . Now consider a multivariate restricted causal model θ ΘR over X = (X1, . . . , Xp) with causal directed tree graph G = (V, E). Assume for contradiction that there exists an

Structure Learning for Directed Trees

alternative SCM θ = ( G, ( fi), P N) Tp Dp 1 Pp C0 inducing X = ( X1, . . . , Xp) with causal graph G = (V, E) = G, such that PX = P X. Any SCM induced distribution is Markov with respect to the underlying causal graph. As such, we have that PX is Markov with respect to both G and G. Furthermore, since (in θ) the causal functions are non-constant and the noise innovations have strictly positive density, we have, by Proposition 17 of Peters et al. (2014), that PX satisﬁes causal minimality with respect to causal graph G of θ, i.e., it is globally Markov with respect to G but not any proper subgraph of G. If PX also satisﬁes causal minimality with respect to G, then, by Proposition 29 of Peters et al. (2014), there exist i, j V such that (j i) E and (i j) E.

Assume for contradiction that PX does not satisfy causal minimality with respect to G. By Proposition 4 of Peters et al. (2014), we have that there exists (j i ) E such that Xj Xi . Deﬁne A := nd G(j ) {j } and B := de G(i ) {i }. It holds that A G(B \ {i }) | i , i.e., A and B \ {i } are d-separated by i in the directed tree G. Since PX is Markov with respect G it holds that XA XB\{i } | Xi , hence XA XB | Xi . Similarly, it holds that XA XB | Xj which implies that XA Xi | Xj . By applying the contraction property of conditional independence, we get that

XA Xi | Xj and Xi Xj = XA Xi , and

XA XB | Xi and XA Xi = XA XB.

Since A B = V, A B = and G is a directed tree (that spans V ) there exist either an edge (j i ) E with j A and i B or j B and i A. In either case, we have that Xi Xj , which contradicts PX satisfying causal minimality with respect to G. We conclude that PX also satisﬁes causal minimality with respect to the alternative graph G.

Hence, the following structural equations hold for (Xi, Xj) and ( Xi, Xj)

Xi = fi(Xj) + Ni, with Xj Ni, Xj = fj( Xi) + Nj, with Xi Nj,

with PXj,Xi = P Xj, Xi. We can apply the same arguments as in the bivariate setup if we

can argue that a density of Xj is three times diﬀerentiable and that a density of Xi is a continuous density. To this end, note that the density p Xj is given by the convolution of two densities

p Xj(y) = Z

pfj(Xpa G(j))(t)p Nj(y t) dt, (21)

as Xj := fj(Xpa G(j)) + Nj with Xpa G(j) Nj. Here we used that fj(Xpa G(j)) has density with respect to the Lebesgue measure.

Jakobsen, Shah, B uhlmann and Peters

To realize this note that fj C3 and it is nowhere constant. By arguments similar to those in the proof of Lemma 3, this implies that f (x) = 0 at only countably many points (dk). Now let (Ok) be the countable collection of mutually disjoint open intervals that cover R except for the points (dk). By continuity of f we know that f (x) is either strictly positive or strictly negative on each Ok. That is, f is continuously diﬀerentiable and strictly monotone on each Ok. Thus, f has a continuously diﬀerentiable inverse on each Ok by, e.g., the inverse function theorem. This ensures that fj(Xpa G(j)) has a density with respect to the Lebesgue measure whenever Xpa G(j) does. By starting at the root node Xrt(G) = Nrt(G), which by assumption has a density, we can iteratively apply the above argumentation down the directed path from rt(G) to j in order to conclude that any Xj for j {1, . . . , p} has a density with respect to the Lebesgue measure.

Since p Nj is assumed strictly positive three times continuous diﬀerentiable, the representation in Equation (21) furthermore yields that p Xj is three times diﬀerentiable; see, e.g., Theorem 11.4 and 11.5 of Schilling (2017). Now we argue that Xi has a continuous density. First note that PXi at least has a continuous density p Xi by arguments similar to those applied for Equation (21). By the assumption that PX = P X we especially have that PXi = P Xi which implies that also Xi has a continuous density. By virtue of the arguments for the bivariate setup we arrive at a contradiction, so it must hold that PX = P X.

Proof of Lemma 6. Consider an SCM θ = ( G, ( fi), P N) { G} Dp 1 Pp G with G = G and let Q θ be the induced distribution. As Q θ is Markov with respect to G and generated by an additive noise model the density q θ factorizes as

i=1 q θ(xi|xpa G(i)) =

i=1 q Ni(xi fi(xpa G(i))).

The cross entropy between PX and Q θ is then given by

h(PX, Q θ) := E log q θ(X)

i=1 E h log q Ni

Xi fi(Xpa G(i)) i

i=1 h Xi fi(Xpa G(i)), Ni ,

where the latter is a sum of the cross entropies between the distribution of Xi fi(Xpa G(i))

and the distribution of Ni. As Q θ is generated by a causal additive tree model with Gaussian

Structure Learning for Directed Trees

noise, we have for all 1 i p that Ni N(0, σ2 i ) for some σ2 i > 0. Hence for all 1 i p,

h Xi fi(Xpa G(i)), Ni = E

Xi fi(Xpa G(i)) 2

2π σi) + E Xi fi(Xpa G(i)) 2

Thus, for given set of causal functions ( fi) and a ﬁxed i, the noise variance that minimizes the cross entropy is given by

E Xi fi(Xpa G(i)) 2 .

We thus have

2π σi) + E Xi fi(Xpa G(i)) 2

2 log E Xi fi(Xpa G(i)) 2 + 1

We conclude that

inf Q { G} Dp 1 Pp G h(PX, Q)

inf fi D1 E Xi fi(Xpa G(i)) 2 !

Finally, as D1 is dense in L2(PXpa G(i)), we have that

inf fi D1 E Xi fi(Xpa G(i)) 2 = E Xi E[Xi|Xpa G(i)]) 2

+ inf fi D1 E E[Xi|Xpa G(i)] fi(Xpa G(i)) 2

= E Xi E[Xi|Xpa G(i)]) 2 .

Here we used that Xpa G(i) has density with respect to the Lebesgue measure, PXpa G(i) λ,

and that the density is diﬀerentiable (see proof of Proposition 4). This concludes the ﬁrst part of the proof.

Jakobsen, Shah, B uhlmann and Peters

For the second statement, we note that for any Q { G} F( G) Pp there exists some noise innovation distribution P N P such that Q is the distribution of X generated by structural assignments

Xi := fi(Xpa G(i)) + Ni = E[Xi|Xpa G(i)] + Ni,

for all 1 j p and mutually independent noise innovations N = ( N1, . . . , Np) P N Pp. Let q denote the density of Q with respect to the Lebesgue measure and let q Ni denote the density of Ni for all 1 i p. As Q is Markov with respect to G and generated by an additive noise model the density factorizes as

i=1 q(xi|xpa G(i)) =

i=1 q Ni(xi E[Xi|Xpa G(i) = xpa G(i)]).

The cross entropy between PX and Q is given by

h(PX, Q) = E [ log (q(X))]

i=1 E h log q(Xi|Xpa G(i)) i

i=1 E h log q Ni

Xi E h Xi|Xpa G(i) i i

i=1 h Xi E h Xi|Xpa G(i) i , Ni .

Note that h(P, Q) = h(P)+DKL(P Q) h(P) with equality if and only if Q = P. Thus, the inﬁmum is attained at noise innovations that are equal in distribution to Xi E[Xi|Xpa G(i)] (which has a density by assumption). That is,

inf Q { G} F( G) Pp h(PX, Q) =

i=1 inf Nj P Nj P h Xi E h Xi|Xpa G(i) i , Ni

i=1 h Xi E h Xi|Xpa G(i) i

Proof of Lemma 7. Let θ ΘR Tp Dp 3 Pp G and assume that condition (a) is satisﬁed, i.e., that for all i = j it holds that x 7 E[Xi|Xj = x] has a diﬀerentiable version. Note that

ℓG( G) ℓG(G) = inf Q { G} Dp 1 Pp G h(PX, Q) h(PX). (22)

Structure Learning for Directed Trees

Furthermore, by the considerations in the proof of Lemma 6 the inﬁmum in Equation (22) is attained for Q , where the functions are given by the conditional expectation functionals. When condition (a) is satisﬁed we therefore know that Q { G} Dp 1 Pp G. Finally,

ℓG( G) ℓG(G) = h(PX, Q ) h(PX) = DKL(PX Q ) > 0,

where the last strict inequality follows from Proposition 4. Now let θ ΘR Tp Dp 3 Pp G. Assume that condition (b) is satisﬁed, i.e., for all 1 i p it holds that the causal function fi is contained within a function class Fi D1, which for all j = i satisﬁes

arg min fi Fi E Xi fi(Xj) 2 Fi. (23)

Deﬁne the modiﬁed Gaussian score function

ℓG.mod( G) :=

1 2 log Var Xi fpa G(i)i(Xpa G(i)) ,

where fji : R R is given by

fji := arg min f Fi E Xi f(Xj) 2 ,

for all i = j. Now, for any G Tp \ {G}, it holds that

ℓG.mod( G) ℓG.mod(G) = inf Q { G} (Fi)1 i p Pp G h(PX, Q) h(PX)

= h(PX, Q ) h(PX)

= DKL(PX Q ) > 0.

Here we used the closedness in Equation (23) to argue that the inﬁmum is attained for Q { G} (Fi)1 i p Pp G. Finally, since (Fi)1 i p Dp 1, Proposition 4 guarantees the strict inequality. Now let θ ΘR Tp Dp 3 Pp +C3. Assume that for all i = j it holds that x 7 E[Xi|Xj = x] has a diﬀerentiable version, and assume that for all i = j it holds that Xi E[Xi|Xj] has a continuous density. With these assumptions we note that for any G Tp \ {G} it holds, by the arguments in the proof of Lemma 6, that

ℓE( G) ℓE(G) = inf Q { G} F( G) Pp h(PX, Q) h(PX) = h(PX, Q ) h(PX),

where Q is generated by an additive noise model G (fi) (P Ni)1 i p with causal graph G Tp, with causal functions fi x 7 E[Xi|Xpa G(i) = x] D1 and noise innovations

given by Ni D= Xi E[Xi|Xj] PNi PC0, i.e., noise innovations with continuous densities. Proposition 4 now yields that

ℓE( G) ℓE(G) = DKL(PX Q ) > 0,

Jakobsen, Shah, B uhlmann and Peters

since PX is induced by a restricted causal additive tree model and Q is induced by a causal additive tree model { G} (fi) (P Ni)1 i p Tp Dp 1 Pp C0.

D.2 Proofs of Section 3

Proof of Theorem 8. Assume that θ = (G, (fi), PN) ΘR with PN Pp G and G = (V, E). For simplicity of the proof, we assume that E[X] = 0 such that the Gaussian edge weight estimators simplify to

ˆwji := ˆw G ji = 1

n Pn k=1 (Xk,i ˆϕji(Xk,j))2

1 n Pn k=1 X2 k,i

for all j = i. Furthermore, deﬁne the Gaussian population (for i = j) and auxiliary (for (j i) E) edge weights by

2 log E[(Xi ϕji(Xj))2]

, w ji := 1

2 log E[(Xi ϕji(Xj))2]

respectively, where ϕji : R R is a ﬁxed function satisfying E[( ˆϕji(Xj) ϕji(Xj))2| Xn] P n 0. Furthermore, for any G = (V, E) Tp denote

ˆw( G) := X

(j i) E ˆwji, w( G) := X

(j i) E wji, w ( G) := X

(j i) E\E w ji + X

(j i) E E wji,

as the total estimated, population and auxiliary edge weights for G. As the conditional expectation minimizes the MSPE among measurable functions, i.e., ϕji = arg minf:R R E[(Xi f(Xj))2], we especially have, for any i = j, that

E[(Xi ϕji(Xj))2] E[(Xi ϕji(Xj))2].

This construction entails, for any G Tp, that

w ( G) w( G), and w (G) = w(G). (24)

Assumption 1 implies that there exists an m > 0 such that

min G Tp\{G} ℓG( G) ℓG(G) = m > 0. (25)

Thus, for any G Tp \ {G} it holds that

by the identiﬁability assumption of Equation (25). Now note that ℓG( G) = w( G) + C with C = Pp i=1 log(E[X2 i ])/2 for all G Tp. Hence, we have, for all G Tp \ {G}, that

2 = w(G) + m

Structure Learning for Directed Trees

by the equality and inequalities in (26) and (24). Thus, we have that

P( ˆG = G) = P

arg min G=(V, E) Tp

(j i) E ˆwji = G

| ˆw( G) w ( G)| < m

We conclude that it suﬃces to show that

sup G Tp | ˆw( G) w ( G)| P n 0.

To this end, let E := {(j i) : i, j V, i = j} \ E and note that

sup G Tp | ˆw( G) w ( G)|

2 log E[(Xi ϕji(Xj))2]

2 log E[(Xi ϕji(Xj))2]

2 log E[(Xi ϕji(Xj))2]

2 log E[(Xi ϕji(Xj))2]

Now consider a ﬁxed term (j i) E in the second sum of (27). We can upper bound the absolute diﬀerence by

2 log E[(Xi ϕji(Xj))2]

k=1 (Xk,i ˆϕji(Xk,j))2 !

log E[(Xi ϕji(Xj))2]

log(E[X2 i ]) log

In the upper bound of (28), the last absolute diﬀerence vanishes in probability due to the law of large numbers and the continuous mapping theorem. The ﬁrst absolute diﬀerence

Jakobsen, Shah, B uhlmann and Peters

also vanishes by the following arguments. Note that,

k=1 (Xk,i ˆϕji(Xk,j))2

k=1 (Xk,i ϕji(Xk,j))2 + 1

k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2

k=1 (Xk,i ϕji(Xk,j)) (ϕji(Xk,j) ˆϕji(Xk,j)) .

Hence, it holds that 1 n

k=1 (Xk,i ˆϕji(Xk,j))2 1

k=1 (Xk,j ϕji(Xk,j))2

k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2

k=1 (Xk,j ϕji(Xk,j)) (ϕji(Xk,j) ˆϕji(Xk,j))

k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2

k=1 (Xk,j ϕji(Xk,j))2

k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2, (29)

by Cauchy-Schwarz inequality. By the law of large numbers, we have that the ﬁrst factor of the second term of (29) converges in probability to a constant,

k=1 (Xk,j ϕji(Xk,j))2 P n E[X1,i ϕji(X1,j))2].

The ﬁrst term and latter factor of the second term of Equation (29) vanish in probability by assumption. That is, for any ε > 0 we have that

k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 > ε

k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 ε > ε

E h 1 n Pn k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 ε i

E h E h (ϕji(X1,j) ˆϕji(X1,j))2 Xn i ε i

Structure Learning for Directed Trees

using conditional Jensen s inequality (x 7 min(x, ε) = x ε is concave) and the dominated convergence theorem. This proves that

k=1 (Xk,j ˆϕji(Xk,j))2 P n E[X1,i ϕji(X1,j))2].

Thus, we have shown that the second term of (27) converges to zero in probability. Finally, the above arguments apply similarly to the ﬁrst term of Equation (27) by exchanging every

ϕji with ϕji. We have shown that sup G Tp | ˆw( G) w ( G)| P n 0, which concludes the proof.

Proof of Theorem 9. Assume that for each sample size n N that θn = (G, ...) ΘR with G = (V, E), additive Gaussian noise, and identiﬁability gap

min G Tp\{G} ℓG(G) ℓG( G) = qn > 0,

with q 1 n = o( n). For simplicity of the proof, we assume that Eθn[X] = 0 such that the edge weight estimators simplify to

ˆwji := ˆw G ji = ˆw G ji(Xn, Xn) = 1

n Pn k=1 (Xk,i ˆϕji(Xk,j))2

1 n Pn k=1 X2 k,i

Furthermore, we continue with the notation and population quantities introduced in the proof of Theorem 8, i.e., wji = log(Eθn[(Xi E[Xi|Xj])2])/Eθn[X2 i ])/2, where we notionally have suppressed the dependence on n. We know that for each SCM θn it holds that

ℓG(G) + qn ℓG( G), hence w(G) + qn w( G),

for all G Tp \ {G}. Thus,

arg min G=(V, E) Tp

(j i) E ˆwji = G

| ˆw(G) w(G)| < qn

ˆw( G) w( G) qn

For any G = (V, E) Tp we have that

ˆw( G) w( G) = X

(j i) E E ˆwji wji + X

(j i) E\E ˆwji wji,

where ˆwji and wji denote the estimated and population Gaussian weights for the edge (j i), respectively. Hence, it suﬃces to show that

(j i) E, ε > 0 : Pθn(| ˆwji wji| < qnε) n 1,

(j i) E, ε > 0 : Pθn ( ˆwji wji qnε) n 1.

Jakobsen, Shah, B uhlmann and Peters

To see this, note that if the above statements hold, then

Pθn | ˆw(G) w(G)| < qn

(j i) E | ˆwji wji| < qn

| ˆwji wji| < qn 2(p 1)

and for any G = (V, E) Tp

Pθn ˆw( G) w( G) qn

(j i) E E ˆwji wji + X

(j i) E\E ˆwji wji qn

| ˆwji wji| qn 2(p 1)

ˆwji wji qn 2(p 1)

hence the probability of the intersections also converges to one.

The causal edges: Now ﬁx (j i) E. We want to show that for all ε > 0 it holds that

Pθn(| ˆwji wji| < qnε) n 1.

First note that

| ˆwji wji| 1

k=1 (Xk,i ˆϕji(Xk,j))2 !

log Eθn[(Xi ϕji(Xj))2]

log(Eθn[X2 i ]) log

where ˆϕji for each n is the estimated conditional expectation x 7 Eθn[Xi|Xj = x] based on samples from the auxiliary data set. It suﬃces to show the desired convergence in probability for each of the above terms. Furthermore, for all sequences of positive random variables (Zn) and positive constants c > 0 and for all ε > 0 there exists δ > 0 such that

(q 1 n | log(Zn) log(c)| ε) (q 1 n |Zn c| δ),

for suﬃciently large n. To see this, note that if q 1 n (log(Zn) log(c)) ε, then Zn > exp(log(c) + qnε) = c exp(qnε) c(1 + qnε), so q 1 n (Zn c) cε. On the other hand, if q 1 n (log(Zn) log(c)) ε, then Zn c exp( εqn) c(1 εqn + ε2q2 n), so q 1 n (Zn c)

Structure Learning for Directed Trees

cε + cε2qn. In summary, if q 1 n | log(Zn) log(c)| ε, then q 1 n |Zn c| cε cε2qn > cε(1 M) =: δ where 1 > M > εqn for suﬃciently large n. We conclude that it suﬃces to show that for all ε > 0 it holds that

k=1 (Xk,i ˆϕji(Xk,j))2 Eθn[(Xi ϕji(Xj))2]

k=1 X2 k,i Eθn[X2 i ]

Equation (31) is satisﬁed as the summands are mean zero i.i.d. Therefore, with

k=1 X2 k,i Eθn[X2 i ],

where Eθn[q 1 n Wn] = 0, we have that Eθn[q 2 n W 2 n] = q 2 n

n Eθn[(X2 i Eθn[X2 i ])2], hence

Pθn(q 1 n Wn ε) q 2 n Eθn[W 2 n] ε2

q 2 n n supn N Eθn[(X2 i Eθn[X2 i ])2] ε2

for any ε > 0 as supn N Eθn X 4 2 < and q 1 n = o( n). Now we show Equation (30). First, we simplify the notation by letting Zk := Xk,i, Yk := Xk,j f := ϕji and ˆf := ˆϕji for all k N. Note that we have suppressed the dependence of f = ϕji on θn. We have that

Zk ˆf(Yk) 2 = 1

k=1 (Zk f(Yk))2 + 1

k=1 (f(Yk) ˆf(Yk))2

k=1 (Zk f(Yk))(f(Yk) ˆf(Yk))

=: T1,n + T2,n + T3,n.

It suﬃces to show that for all ε > 0 it holds that

(a) Pθn |T1,n Eθn[(Z1 f(Y1))2]| qnε n 0,

(b) Pθn (|T2,n| qnε) n 0, and

(c) Pθn (|T3,n| qnε) n 0.

Jakobsen, Shah, B uhlmann and Peters

First we show (a). Each term in the sum of T1,n Eθn[(Z1 f(Y1))2] is mean zero and i.i.d., i.e.,

q 1 n Eθn[(Zk f(Yk))2 Eθn[(Z1 f(Y1))2]] = 0.

Furthermore,

Varθn(q 1 n (T1,n Eθn[(Z1 f(Y1))2]))

k=1 (Zk f(Yk))2 Eθn[(Z1 f(Y1))2]

k=1 Varθn (Zk f(Yk))2 Eθn[(Z1 f(Y1))2]

q 2 n n sup n N Varθn (Z1 f(Y1))2

since q 1 n = o( n) and supn N Eθn X 4 2 < . Hence,

Pθn |q 1 n (T1,n E[(Z1 f(Y1))2])| ε Varθn(q 1 n (T,n E[(Z1 f(Y1))2]))

by Chebyshev s inequality, proving (a). Now we show (b). To that end, note that the terms of T2,n is i.i.d. conditional on Xn. For a ﬁxed 1 > ε > 0 we have

Pθn |q 1 n T2,n| ε = Eθn h Pθn q 1 n T2,n ε| Xn 1 i

Eθn h Eθn h q 1 n T2,n| Xn i 1 i

= Eθn h q 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i

where we used the conditional Markov s inequality. Now ﬁx 1 > δ > 0 and deﬁne An,δ :=

(q 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i > δ) and note that by assumption there exists an Nδ N such that n Nδ : Pθn(An,δ) < δ. Hence, for n Nδ we have that

Eθn h q 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i = Eθn h 1An,δq 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i

+ Eθn h 1Ac n,δq 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i

Eθn h 1An,δq 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i

+ Eθn h 1Ac n,δδ i

Eθn 1An,δ + δ

= Pθn(An,δ) + δ < 2δ, (32)

Structure Learning for Directed Trees

hence lim supn Pθn |q 1 n T2,n| ε < 2δ/ε, i.e., Pθn |q 1 n T2,n| ε 0 as δ > 0 was chosen arbitrarily, proving (b). Now we prove (c). To this end, recall that

k=1 (Zk f(Yk))(f(Yk) ˆf(Yk)),

is, conditional on X, an i.i.d. sum with conditional mean zero

Eθn[T3,n| Xn] = 2Eθn[(Zk f(Yk))(f(Yk) ˆf(Yk))| Xn]

= 2Eθn[(Eθn[Zk|Yk, Xn] f(Yk))(f(Yk) ˆf(Yk))| Xn]

= 2Eθn[(f(Yk) f(Yk))(f(Yk) ˆf(Yk))| Xn] = 0,

and conditional second moment given by

Eθn[T 2 3,n| Xn] = 4

k=1 Eθn[(Zk f(Yk))2(f(Yk) ˆf(Yk))2| Xn]

n Eθn h (Zk f(Yk))2(f(Yk) ˆf(Yk))2| Xn i

n Eθn h Eθn h (Zk f(Yk))2| Xn, Yk i (f(Yk) ˆf(Yk))2| Xn i

n Eθn h Varθn(Zk|Yk)(f(Yk) ˆf(Yk))2| Xn i

n Eθn h (f(Yk) ˆf(Yk))2| Xn i ,

Pθn-almost surely. Hence, w.l.o.g. assume that 0 < ε < 1 and note that the conditional Markov s inequality yields

Pθn(|q 1 n T3,n| ε) = Eθn[Pθn(|q 1 n T3,n| ε| Xn) 1]

ε2 Eθn h Eθn h q 2 n T 2 3,n| Xn i 1 i (33)

q 2 n n Eθn h (f(Yk) ˆf(Yk))2| Xn i 1 .

By conditional Jensen s inequality, we have that

Eθn h (f(Yk) ˆf(Yk))2| Xn i 1 + Eθn h (f(Yk) ˆf(Yk))2| Xn i2

1 + Eθn h (f(Yk) ˆf(Yk))4| Xn i .

Fix δ > 0 and let An,δ := q 2 n

n Eθn h (f(Yk) ˆf(Yk))4| Xn i > δ and note that Pθn(An,δ) n 0, hence there exists an Nδ N such that n Nδ : Pθn(An,δ) < δ. Furthermore, as q 1 n = o( n) there exists an N N such that q 2 n /n < δ for all n N. Similar to the

Jakobsen, Shah, B uhlmann and Peters

arguments in Equation (32) we then have that

C Pθn(|q 1 n T3,n| ε) Eθn

1 + Eθn h (f(Yk) ˆf(Yk))4| Xn i 1

q 2 n n + Eθn

q 2 n n Eθn h (f(Yk) ˆf(Yk))4| Xn i 1

q 2 n n + Eθn[1An,δ] + Eθn[1Ac n,δδ]

< δ + Pθn(An,δ) + δ < 3δ,

for any n Nδ N, so Pθn(q 1 n T3,n ε) n 0, proving (c).

The non-causal edges: Now ﬁx (j i) E, we want to show, for any ε > 0 that

Pθn( ˆwji wji qnε) n 1,

ˆwji wji = 1

k=1 (Xk,i ˆϕji(Xk,j))2 !

log E[(Xi ϕji(Xj))2] #

log(E[X2 i ]) log

2(D1,n + D2,n).

We have that Pθn( ˆwji wji qnε) Pθn ((D1,n qnε) (|D2,n| < qnε)), where the second event has already been shown to have probability converging to one in Equation (31). Thus, it suﬃces to show that

Pθn (D1,n qnε) n 1.

By similar arguments as above we have for any sequence of positive random variables (Kn)n 1 and a positive constant K that for all ε > 0 there exists an δ > 0 such that Pθn (log(Kn) log(K) < qnε) Pθn(Kn K < qnδ), for suﬃciently large n N. To see this, note that if log(Kn) log(K) < qnε, then Kn < K exp( εqn) K(1 εqn + ε2q2 n), so q 1 n (Kn K) < Kε + Kε2qn < Kε(1 M) =: δ where 1 > M > εqn for suﬃciently large n, since qn 0. Thus, it suﬃces to show that for any ε > 0 it holds that

k=1 (Xk,i ˆϕji(Xk,j))2 Eθn[(Xi ϕji(Xj))2] qnε

Structure Learning for Directed Trees

Again, we simplify the notation Zk := Xk,i, Yk := Xk,j, f = ϕji and ˆf := ˆϕji for all k N. Now deﬁne the following terms

Zk ˆf(Yk) 2 = 1

k=1 (Zk f(Yk))2

k=1 {(f(Yk) ˆf(Yk))2 δ2 n,θn}

k=1 {(Zk f(Yk))(f(Yk) ˆf(Yk)) + δ2 n,θn/2}

=: T1,n + T2,n + T3,n,

where δ2 n,θn := Eθn[(f(Y1) ˆf(Y1))2| Xn] = Eθn[(ϕji(Xj) ˆϕji(Xj))2| Xn]. It suﬃces to show that for all ε > 0 it holds that

(d) Pθn |T1,n Eθn[(Z1 f(Y1))2]| qnε n 0,

(e) Pθn | T2,n| qnε n 0, and

(f) Pθn T3,n qnε n 1.

Condition (d) holds by arguments similar to (a) for the causal edges. Now we prove (e). The expansion, conditional on Xn, is a sum of mean zero i.i.d. terms, hence

Eθn q 2 n T 2 2,n Xn = q 2 n n Eθn h {(f(Yk) ˆf(Yk))2 δ2 n,θn}2| Xn i

= q 2 n n Eθn h (f(Yk) ˆf(Yk))4 + (δ2 n,θn)2 2(f(Yk) ˆf(Yk))2δ2 n,θn| Xn i

Eθn h (f(Yk) ˆf(Yk))4| Xn i (δ2 n,θn)2

q 2 n n Eθn h (f(Yk) ˆf(Yk))4| Xn i ,

using that (δ2 n,θn)2 0. Fix 1 > δ > 0 and let An,δ := q 2 n

n Eθn h (f(Yk) ˆf(Yk))4| Xn i > δ

and note that there exists an Nδ N such that n Nδ : Pθn(An,δ) < δ. Similar to the previous arguments we have for any 1 > ε > 0 and n Nδ that

Pθn T2,n qnε = Eθn h Pθn q 1 n T2,n ε Xn 1 i

ε2 Eθn h Eθn h q 2 n T 2 2,n| Xn i 1 i

q 2 n n Eθn h (f(Yk) ˆf(Yk))4| Xn i 1

Eθn[1An,δ] + Eθn[1Ac n,δδ] < 2δ

Jakobsen, Shah, B uhlmann and Peters

by the conditional Markov s inequality. Since δ > 0 was chosen arbitrarily, we conclude that (e) holds. Finally we show (f). Recall that in the analysis of the causal edges, we deﬁned

k=1 (Zk f(Yk))(f(Yk) ˆf(Yk)).

Hence, we have that T3,n = T3,n + δ2 n,θn. We realize that for any 0 < ε < 1

Pθn( T3,n < qnε) Pθn(T3,n + δ2 n,θn qnε)

= Pθn T3,n qnε + δ2 n,θn

Pθn T 2 3,n qnε + δ2 n,θn 2

Pθn T 2 3,n (qnε)2

= Pθn q 2 n T 2 3,n ε2

= Eθn h Pθn q 2 n T 2 3,n ε2| Xn 1 i

ε2 Eθn h Eθn h q 2 n T 2 3,n Xn i 1 i

where we used the convergence shown in the proof of (c); see Equation (33). To see that the former arguments apply to non-causal edges, simply note that they did not use any conditions restricted to causal edges. This concludes the proof.

D.3 Proofs of Section 4

Lemma D.1 Consider an i.i.d. sequence (Xm)m 1 of random variables with Xm Rd

independent from a random inﬁnite sequence X Q i=1 Rd. Let (ψn)n 1 be a sequence of measurable functions s.t. for all n 1, ψn : Rd (Q i=1 Rd) Rq satisﬁes the following conditions:

(a) E[ψn(Xm, X)| X] = 0 almost surely,

(b) Σ Rq q : Pn m=1 Var(ψn(Xm, X)| X) P n Σ, and

(c) ε > 0 : Pn m=1 E[ ψn(Xm, X) 2+ε 2 | X] P n 0.

It holds that

m=1 ψn(Xm, X) D n N(0, Σ),

Structure Learning for Directed Trees

Proof of Lemma D.1. Let the random sequences be deﬁned on a common probability space (Ω, F, P) and deﬁne

Anm := E[ψn(Xm, X)| X],

m=1 Var(ψn(Xm, X)| X),

m=1 E[ ψn(Xm, X) 2+ε 2 | X].

By assumption we have that P( n,m(Anm = 0)) = 1, Bn P 0 and Cn P 0 as n . First, note that for any subsequence (nk)k 1 of the positive integers, there exists a subsequence (nkl)l N such that

P( lim l Bnkl = 0) = 1 for ( lim l Bnkl = 0) := {ω Ω: lim l Bnkl(ω) = 0},

P( lim l Cnkl = 0) = 1 for ( lim l Cnkl = 0) := {ω Ω: lim l Cnkl(ω) = 0}.

Thus, deﬁne

G := ( n,m(Anm = 0) ( lim l Bnkl = 0) ( lim l Cnkl = 0)) Ω, with P(G) = 1.

Now ﬁx x X(G) := { X(ω) Q j=1 Rd : ω G} and note that

l 1, 1 m nkl : E[ψnkl(Xm, x)] = 0,

m=1 Var(ψnkl(Xm, x)) l Σ, and

m=1 E[ ψnkl(Xm, x) 2+ε 2 ] l 0.

Furthermore, for any l 1

ψnkl(X1, x), ..., ψnkl(Xnkl, x), are jointly independent,

hence by Lyapunov s central limit theorem for triangular arrays (see, e.g., Van der Vaart, 2000, Proposition 2.27, and recall that Lyapunov s condition implies the Lindeberg Feller condition) that

m=1 ψnkl(Xm, x) D l Z N(0, Σ).

Jakobsen, Shah, B uhlmann and Peters

The above convergence in distribution is equivalent to the following statement: for any continuous bounded function g : Rq R it holds that

m=1 ψnkl(Xm, x)

= E [g(Z)] .

Fix a continuous and bounded g and note that the above convergence holds for all x X(G) with P(G) = 1. Thus, it must hold that

m=1 ψnkl(Xm, X)

a.s. l E [g(Z)] .

Finally, as (nkl)l 1 is a subsequence of an arbitrary subsequence of positive integers, we have that

m=1 ψn(Xm, x)

P n E [g(Z)] ,

and since g is bounded the dominated convergence theorem yields that

m=1 ψn(Xm, X)

m=1 ψn(Xm, X)

n E [g(Z)] .

As g was chosen arbitrarily, the above convergence holds for any continuous bounded g. We conclude that

m=1 ψn(Xm, X) D n N(0, Σ),

proving the theorem.

Lemma D.2 (Shah and Peters, 2020, Lemma 19) Let P be a family of distributions for a random variable ζ R and suppose ζ1, ζ2, . . . are i.i.d. copies of ζ. For each n N let Sn = n 1 Pn i=1 ζi. Suppose that for all P P we have EP (ζ) = 0 and EP |ζ|1+η < c for some η, c > 0. We have that for all ε > 0,

lim n sup P P P (|Sn| > ε) = 0.

Lemma D.3 Let U be a random element and let (Zn)n 1 be an i.i.d. sequence of random variables such that U (Zn)n 1 and let ((Wnm)m n)n 1 be a triangular array of random variables and (gn)n 1 be measurable mappings with the following properties:

(a) n 1, m n : Wnm = gn(Zm, U),

Structure Learning for Directed Trees

(b) η > 0 : E |Wn1|1+η | U = Op(1), as n .

Then, writing Wn := Pn m=1 Wnm/n, we have

Wn E (Wn1 | U) P n 0.

Proof of Lemma D.3. Denote

jn(Zm, U) := gn(Zm, U) E[gn(Z1, U)|U],

for any n 1 and m n. Let δ > 0 be given. Pick M > 0 and N N such that the events

Ωn := n E h |gn(Z1, U)|1+η | U i M o ,

satisfy P (Ωc n) < δ for n N. Notice that

U(Ωn) = n un : E h |gn(Z1, un)|1+ηi M o ,

since U (Zn)n 1. Fix ε > 0. Then, for all n N

P Wn E (Wn | U) > ε = P

m=1 jn(Zm, U)

m=1 jn(Zm, U)

By the dominated convergence theorem, the ﬁrst term on the RHS converges to 0 if

m=1 jn(Zm, U)

= sup un U(Ωn) P

m=1 jn(Zm, un)

which implies the desired statement as δ > 0 was chosen arbitrarily. Now note that for any n N, un U(Ωn) and all m N it holds that

E[|jn(Zm, un)|1+η] = E[|gn(Zm, un) E[gn(Z1, un)]|1+η]

2η E[|gn(Zm, un)|1+η] + |E[gn(Z1, un)]|1+η

2η E[|gn(Zm, un)|1+η] + E[|gn(Z1, un)|1+η]

< 2η+1M =: c

by the cr and Jensen s inequalities, and

E[jn(Zm, un)] = 0.

Jakobsen, Shah, B uhlmann and Peters

For any n N, deﬁne the following set of pushforward measures

Pn := {P = (jn(Z1, un))(P) : un U(Ωn)}.

For any P Pn, let (Ym)m 1 be a sequence of i.i.d. random variables such that Y1 D= jn(Z1, un) for some un U(Ωn). Notice that for all n N and P Pn it holds that EP |Y1|1+η < c and EP [Y1] = 0. Thus,

sup un U(Ωn) P

m=1 jn(Zm, un)

= sup P Pn P 1 n

sup P k Pk P 1 n

by the weak uniform law of large numbers, Lemma D.2.

Lemma D.4 (Asymptotic normality of edge weight components) Let for each sample size n N, ˆϕn ji denote the estimated conditional mean function ϕji based on the auxiliary sample Xn. For any j = i and m n, deﬁne

ˆRnm,ji := {Xm,i ˆϕn ji(Xm,j)}, ˆµn,ji := 1

m=1 ˆR2 nm,ji,

Rm,ji := {Xm,i ϕji(Xm,j)}, µji := E[R2 1,ji],

!2 , ˆνn,i := 1

νi := Var(X1,i), δ2 n,ji := E[( ˆϕn ji(X1,j) ϕji(X1,j))2| Xn].

" bΣn,R bΣn,RV bΣ n,RV bΣn,V

ˆR2 nm( ˆR2 nm) ˆµnˆµ n ˆR2 nm ˆV m ˆµnˆν n ˆVm( ˆR2 nm) ˆνnˆµ n ˆVm ˆV m ˆνnˆν n

denote the p2 p2 matrix empirical covariance matrix, where the squaring of vectors means that each entry is squared. Suppose there exists ξ > 0 such that for all j = i, the following three conditions hold:

(i) E X 4+ξ < .

(ii) E[| ˆϕn ji(Xj) ϕji(Xj)|4+ξ| Xn] = Op(1), as n .

(iii) Σ Rp2 p2 : Var ˆR2 n1 δ2 n µ ˆV1 ν

P n Σ, where Σ is constant.

Structure Learning for Directed Trees

Then we have that bΣn P Σ Rp2 p2 and

ˆR2 nm δ2 n µ ˆVm ν

= n ˆµn δ2 n µ ˆνn ν

D N(0, Σ). (34)

Proof of Lemma D.4. We prove the lemma under the assumption that E[X] = 0 under which the variance estimator simplify to ˆVm,i := X2 m,i and ˆνn,i := 1 n Pn m=1 ˆVm,i for all 1 i p. The proof only gets more notionally cumbersome without this assumption. It should follow in all generality by applying expansion techniques and Slutsky s theorem similar to the standard arguments showing asymptotic normality of the regular sample variance. Let X denote the auxilliary i.i.d. process such that Xn is the ﬁrst n-coordinates of said process. Note that conditioning ˆϕn ji on X it is equivalent to conditioning on Xn by the i.i.d. structure of X and that ˆϕn ji only depends on Xn. First, we deﬁne for all j = i, n N and m n the following conditional expectation regression error ˆδnm,ji := {ϕji(Xm,j) ˆϕn ji(Xm,j)}. Furthermore, for each n N and m n deﬁne

Ψn(Xm, X) := ˆR2 nm δ2 n µ ˆVm ν

where only Xn (containing the ﬁrst n coordinates of X) is used, and

ψn(Xm, X) := 1 nΨn(Xm, X).

Note that the desired conclusion of Equation (34) follows by verifying condition (a), (b) and (c) of Lemma D.1. First, we show (a), the conditional mean zero condition. To that end, note that for any i {1, . . . , p} and j {1, . . . , p} \ {i} it holds that

ˆR2 nm,ji = (Xm,i ϕji(Xm,j) + ϕji(Xm,j) ˆϕn ji(Xm,j))2

= (Rm,ji + ˆδnm,ji)2

= R2 m,ji + ˆδ2 nm,ji + 2Rm,jiˆδnm,ji.

Hence, we have that

ˆR2 nm,ji µji δ2 n,ji = (R2 m,ji µji) + (ˆδ2 nm,ji δ2 n,ji) + 2Rm,jiˆδnm,ji. (35)

The terms of Equation (35) are mean zero conditionally on X, since E[R2 m,ji| X] = E[R2 m,ji] = µji, E[ˆδ2 nm,ji| X] = δ2 n,ji and

E[Rm,jiˆδnm,ji| X] = E[E[Rm,jiˆδnm,ji| X, Xm,j]| X]

= E[E[Xm,i ϕji(Xm,j)| X, Xm,j]ˆδnm,ji| X]

= E[(E[Xm,i|Xm,j] ϕji(Xm,j))ˆδnm,ji| X]

Jakobsen, Shah, B uhlmann and Peters

as ϕji(Xm,j) = E[Xm,i|Xm,j] almost surely. Furthermore,

E[X2 m,i Var(Xi)| X] = E[X2 m,i] Var(Xi) = 0.

We conclude that

E[ψn(Xm, X)| X] = 1 n E ˆR2 nm δ2 n µ ˆVm ν

almost surely. With respect to (b), convergence of the sum of variances, we have, by assumption, that

Σn := Σn,R Σn,RV Σ n,RV Σn,V

:= Var Ψn(X1, X)| X P n Σ,

where Σ is a positive semi-deﬁnite matrix. Furthermore, we have that (Xm)m 1 is an i.i.d. sequence independent of X. Therefore,

m=1 Var(ψn(Xm, X)| X) =

1 n Var(Ψn(Xm, X)| X)

Finally, we show that condition (c), a conditional Lindeberg-Feller condition, is fulﬁlled. To this end, note that with ε := ξ/2 > 0 we have that

E h ψn(Xm, X) 2+ε 2 X i

ˆR2 nm δ2 n µ ˆVm ν

ˆR2 nm δ2 n µ ˆVm ν

i =j E h | ˆR2 nm,ji µji δ2 n,ji|2+ε| X i

i=1 E|X2 m,i Var(Xi)|2+ε , (36)

by the cr inequality. We now realize that the second factor of Equation (36) is stochastically bounded. To see this, note that for any j = i it holds that

E h | ˆR2 nm,ji µji δ2 n,ji|2+ε| X i 21+ε(E[| ˆRnm,ji|4+2ε| X] + µ2+ε ji + E[|δ2 n,ji( X)|2+ε| X]).

Structure Learning for Directed Trees

The ﬁrst term of the upper bound in Equation (37) is Op(1),

E[| ˆRnm,ji|4+2ε| X] = E[|Xm,i ˆϕn ji(Xm,j)|4+2ε| X]

23+2ε(E|Xm,i ϕji(Xm,j)|4+2ε + E[|ϕji(Xm,i) ˆϕn ji(Xm,j)|4+2ε| X])

= 23+2ε(E[|Rm,ji|4+ξ] + E[|ˆδnm,ji|4+ξ| X]) = Op(1),

as E X 4+ξ 2 < and E[|ˆδnm,ji|4+ξ| X] = Op(1). This holds because Rm,ji = {Xm,i E[Xm,i|Xm,j]} and both terms are in L4+ξ(P) if Xm,i L4+ξ(P) which is guaranteed as E X 4+ξ 2 < . For the third term in the upper bound of Equation (37), we note that by the conditional Jensen s inequality, we have that

E[|δ2 n,ji|2+ε| X] E[|ϕji(Xm,i) ˆϕn ji(Xm,j)|4+2ε| X] = E[|ˆδnm,ji|4+ξ| X] = Op(1),

by assumption. Therefore, we have that

m=1 E h Ψn(Xm, X) 2+ε 2 X i n

2 Op(1) = n ε/2Op(1) P n 0,

proving the conditional Lindeberg-Feller condition. By Lemma D.1 it holds that

m=1 ψn(Xm, X) D n N(0, Σ).

Now it only remains to prove that

bΣn Σn P 0,

or, equivalently, that each entry converges to zero in probability. For example, for the entries of the ﬁrst block matrix with j = i and l = r we prove that

|bΣn,R,ji,lr Σn,R,ji,lr| P 0.

Now note that the observable estimated covariance matrix entry is given by

bΣn,R,ji,lr = 1

m=1 ˆR2 nm,ji ˆR2 nm,lr ˆµn,jiˆµn,lr,

while the unobservable conditional covariance matrix is given by

Σn,R,ji,lr = E[( ˆR2 nm,ji µji δ2 n,ji)( ˆR2 nm,lr µlr δ2 n,lr)| X]

= E[ ˆR2 nm,ji ˆR2 nm,lr| X] (µji + δ2 n,ji)(µlr + δ2 n,lr)

= E[ ˆR2 nm,ji ˆR2 nm,lr| X] E[ ˆR2 nm,ji| X]E[ ˆR2 nm,lr| X],

Jakobsen, Shah, B uhlmann and Peters

where we have used that E[ ˆR2 nm,ji| X] = µji + δ2 n,ji; see Equation (35) and its discussion. Note that the second term of the covariance matrix estimator expands to

ˆµn,jiˆµn,lr =

m=1 ˆR2 nm,ji

m=1 ˆR2 nm,lr

m=1 ˆR2 nm,ji E[ ˆR2 nm,ji]

m=1 ˆR2 nm,lr E[ ˆR2 nm,lr]

E[ ˆR2 nm,ji]E[ ˆR2 nm,lr]

m=1 ˆR2 nm,ji E[ ˆR2 nm,lr]

m=1 ˆR2 nm,lr E[ ˆR2 nm,ji],

|bΣn,R,ji,lr Σn,R,ji,lr|

m=1 ( ˆR2 nm,ji ˆR2 nm,lr E[ ˆR2 nm,ji ˆR2 nm,lr| X])

m=1 ˆR2 nm,ji E[ ˆR2 nm,ji| X]

m=1 ˆR2 nm,lr E[ ˆR2 nm,lr| X]

m=1 ( ˆR2 nm,ji E[ ˆR2 nm,lr| X] E[ ˆR2 nm,ji| X]E[ ˆR2 nm,lr| X])

m=1 ( ˆR2 nm,lr E[ ˆR2 nm,ji| X] E[ ˆR2 nm,ji| X]E[ ˆR2 nm,lr| X]) . (38)

Each of these terms tends to zero in probability by Lemma D.3. For example, for the ﬁrst term of Equation (38) it suﬃces to show that

E h | ˆR2 nm,ji ˆR2 nm,lr|1+ε| X i = Op(1),

for some ε > 0. Fix ε = ξ/4 and note, by the cr-inequality, that

ˆR2 nm,ji ˆR2 nm,lr = (Xm,i ˆϕn ji(Xm,j))2(Xm,r ˆϕn lr(Xm,l))2

4(R2 m,ji + ˆδ2 nm,ji)(R2 m,lr + ˆδ2 nm,lr).

Structure Learning for Directed Trees

Thus, by the cr-inequality and the conditional Cauchy-Schwarz inequality we have, with c = 41+ε22ε, that

c 1E[| ˆR2 nm,ji ˆR2 nm,lr|1+ε| X]

c 141+εE[|R2 m,ji + ˆδ2 nm,ji|1+ε|R2 m,lr + ˆδ2 nm,lr|1+ε| X]

E[(|Rm,ji|2+2ε + |ˆδnm,ji|2+2ε)(|Rm,lr|2+2ε + |ˆδnm,lr|2+2ε)| X]

E[|Rm,ji|2+2ε|Rm,lr|2+2ε| X] + E[|Rm,ji|2+2ε|ˆδnm,lr|2+2ε| X]

+ E[|ˆδnm,ji|2+2ε|Rm,lr|2+2ε| X] + E[|ˆδnm,ji|2+2ε|ˆδnm,lr|2+2ε| X]

E[|Rm,ji|4+ξ]E[|Rm,lr|4+ξ] + E[|Rm,ji|4+ξ]E[|ˆδnm,lr|4+ξ| X]

+ E[|ˆδnm,ji|4+ξ| X]E[|Rm,lr|4+ξ] + E[|ˆδnm,ji|4+ξ| X]E[|ˆδnm,lr|4+ξ| X]

as E[|ˆδnm,ji|4+ξ| X] = Op(1) for all j = i by assumption and E[|Rm,ji|4+ξ] < since E X 4+ξ 2 < . Similar arguments show convergence in probability of the entries in the other block submatrices of bΣn less Σn, yielding the desired conclusion.

Proof of Theorem 10. We prove the theorem under the simplifying assumption that E[X] = 0 for which we can simplify the variance estimator by ˆVm,i := X2 m,i and ˆνn,i := 1 n Pn m=1 ˆVm,i for all 1 i p. First, note (using the notation introduced in Lemma D.4) that ˆ M1 = { ˆR2 n1,ji}j =i, ˆµ = ˆµn, ˆν = ˆνn and bΣ = bΣn. The conditional mean of ˆ M1 given Xn is given by

E[ ˆ M1| Xn] = E[{ ˆR2 n1,ji}j =i| Xn] = µ + δ2 n,

see Equation (35). Similarly we have that E[ ˆV1| Xn] = E[ ˆV1] = ν. Subtracting a constant (conditional on Xn) does not change the conditional variance, hence

Var ˆR2 n1 δ2 n µ ˆV1 ν

= Var ( ˆ M 1 , ˆV 1 ) Xn

Σ is constant and positive semi-deﬁnite with strictly positive diagonal. As such, the conditions of Lemma D.4 is satisﬁed, which yields that

ˆR2 nm δ2 n µ ˆVm ν

= n ˆµ δ2 n µ ˆν ν

D n N(0, Σ), (39)

" bΣM bΣMV bΣ MV bΣV

P Σ =: ΣM ΣMV Σ MV ΣV

Jakobsen, Shah, B uhlmann and Peters

For any j = i we denote

ˆµji δ2 n,ji ˆνi

where the latter is a shorthand notation for the Gaussian edge weight w G ji. Fix α (0, 1). First, consider (j i) E and note that

n ˆµji µji ˆνi νi

ˆµji δ2 n,ji µji ˆνi νi

= n δ2 n,ji 0

= n E[ˆδ2 nm,ji| Xn] 0

P n 0, (40)

by assumption (iv). Hence, Equation (39), Equation (40) and the delta method yields that

n ( ˆwji wji) = n log ˆµji

= n(log(ˆµji) log(µji) log(ˆνi) + log(νi))

D n N(0, σ2 ji),

ˆσ2 ji := bΣM,ji

ˆµ2 ji + bΣV,i

ˆν2 i 2 bΣMV,ji,i

P σ2 ji := ΣM,ji

µ2 ji + ΣV,i

ν2 i 2ΣMV,ji,i

Here bΣM,ji and bΣV,i and their limits use a shorthand notation that denote the corresponding diagonal element, e.g., bΣM,ji := bΣM,ji,ji. An asymptotically valid marginal conﬁdence interval for wji with level α is, by virtue of the above convergence in distribution, given by

ˆwji ˆσji q(1 α

where q(1 α

2 ) is the 1 α/2 quantile of the standard normal distribution. That is,

P ˆwji ˆσji q(1 α

2 ) 2 n wji ˆwji + ˆσji q(1 α

On the other hand, for any (j i) E we have, by similar arguments, except that no assumption guarantees that nδ2 n,ji vanishes, that

P wji σji q(1 α

2 ) 2 n wji wji + σji q(1 α

σ2 ji := bΣM,ji (ˆµji δ2 n,ji)2 + bΣV,i

ˆν2 i 2 bΣMV,ji,i (ˆµji δ2 n,ji)ˆνi

P σ2 ji := ΣM,ji

µ2 ji + ΣV,i

ν2 i 2ΣMV,ji,i

Structure Learning for Directed Trees

by the convergence in Equation (39). Note that σ2 ji is not observable since δ2 n,ji is not observable. Now deﬁne

ˆuα,ji, ˆlα,ji := ˆwji ˆσji q 1 α 2p(p 1)

uα,ji, lα,ji := wji σji q 1 α 2p(p 1)

for all j = i. Thus, we have the following Bonferroni corrected simultaneous conﬁdence interval for the Gaussian edge weights

lim inf n P

wji h ˆlα,ji, ˆuα,ji i \

wji h lα,ji, uα,ji i

The above conﬁdence region has the correct asymptotic level, but it is infeasible to compute in that wji, σji and E are not directly observable from data. Furthermore, deﬁne

C(ˆlα, lα, ˆuα, uα) := arg min G=(V, E) Tp

(j i) E w ji : (j i) E, w ji [ˆlα,ji, ˆuα,ji],

(j i) E, w ji [ lα,ji, uα,ji] ,

and note that this is an unobservable conﬁdence region for the causal graph. That is,

lim inf n P(G C(ˆlα, lα, ˆuα, uα))

lim inf n P

(j i) E (wji [ˆlα,ji, ˆuα,ji]) \

(j i) E (wji [ lα,ji, uα,ji])

Our proposed conﬁdence region has the form

ˆC := C(ˆlα, ˆuα) := arg min G=(V, E) Tp

(j i) E w ji : j = i, w ji [ˆlα,ji, ˆuα,ji] ,

which corresponds to the biased but computable conﬁdence region

j =i [ˆlα,ji, ˆuα,ji] = Y

ˆwji ˆσji q 1 α 2p(p 1)

for the Gaussian edge weights, where the product is over all combinations of possible edges 1 j = i p. The biased conﬁdence region Q j =i[ˆlα,ji, ˆuα,ji] does not necessarily contain the population Gaussian edge weights with a probability of at least 1 α in the large sample limit. However, it can be used to construct a conservative conﬁdence region for the causal

Jakobsen, Shah, B uhlmann and Peters

graph. To see this, note that by further penalizing the wrong (non-causal) edge weights, the causal graph still yields the minimum edge weight directed spanning tree. Hence,

lim inf n P(G C(ˆlα, ˆuα))

lim inf n P

(j i) E (wji [ˆlα,ji, ˆuα,ji]) \

(j i) E (wji [ lα,ji, uα,ji]) \

(j i) E ( uα,ji ˆuα,ji)

as P ( uα,ji ˆuα,ji) n 1 for all (j i) E by Lemma D.5 below.

Lemma D.5 Suppose that the assumptions of Lemma D.4 hold. It holds that

(j i) E, α (0, 1) : P ( uα,ji ˆuα,ji) n 1.

Proof of Lemma D.5. Fix any (j i) E and α (0, 1) and note that we want to show that

uα,ji ˆuα,ji

wji + c σji n ˆwji + c ˆσji n

0 log (ˆµji) + c ˆσji n log ˆµji δ2 n,ji c σji n

holds with probability converging to one, where c is a strictly positive constant. It suﬃces to show that an even smaller quantity is non-negative with probability converging to one. That is, it suﬃces to show that

0 log (ˆµji) + c ˆσji n log ˆµji δ2 n,ji c σ ji n,

with increasing probability, where

bΣM,ji (ˆµji δ2 n,ji)2 + bΣV,i

ˆν2 i + 2 |bΣMV,ji,i| (ˆµji δ2 n,ji)ˆνi σji,

with P( σ ji > 0) n 1. Let dn(t) : [0, ) R denote the random function given by

dn(t) := log (ˆµji) + c ˆσji n log (ˆµji t)

bΣM,ji (ˆµji t)2 + bΣV,i

ˆν2 i + 2 |bΣMV,ji,i|

(ˆµji t)ˆνi .

It holds that dn(0) = 0 surely, so by the mean value theorem, the desired conclusion holds if it with probability one (as n tends to inﬁnity) holds, for all t [0, δ2 n,ji], that d n(t) 0 .

Structure Learning for Directed Trees

Now ﬁx η > 0 and choose Mη, ε1, . . . , ε5 > 0 such that the constant lower bounds in the following inequalities are strictly positive

Ωn(1) : = (ˆµji Mη),

Ωn(2) : = (ΣM,ji ε1 bΣM,ji ΣM,ji + ε1),

Ωn(3) : = (ΣV,i ε2 bΣV,i ΣV,i + ε2),

Ωn(4) : = (0 |bΣMV,ji,i| |ΣMV,ji,i| + ε3),

Ωn(5) : = (µji ε4 ˆµji δ2 n,ji µji + ε4),

Ωn(6) : = (νi ε5 ˆνi νi + ε5),

and lim infn P(Ωn(1)) > 1 η. This is possible as ˆµji δ2 n,ji P n µji > 0 and

δ2 n,ji = E[|ˆδnm,ji|2| X] = E[|ˆδnm,ji|

4+ξ 2+ξ/2 | X]

E[|ˆδnm,ji|4+ξ| X]

1 2+ξ/2 = Op(1),

by the conditional Jensen s inequality and concavity of [0, ) x 7 x

1 2+ξ/2 , which implies that ˆµji = (ˆµji ˆδ2 n,ji µji) + (ˆδ2 n,ji + µji) = op(1) + Op(1) = Op(1). Furthermore, as

bΣM,ji P n ΣM,ji > 0, bΣV,i P n ΣV,i > 0,

|bΣMV,ji,i| P n |ΣMV,ji,i| 0, ˆνi P n νi > 0,

it holds that

lim sup n P

1 k 6 Ωn(k)c

1 k 6 lim sup n P(Ωn(k)c)

= lim sup n P(Ωn(1)c) η.

Here we used that the diagonal elements of the limit covariance matrix are assumed strictly positive. That µji, νi > 0 follows from the fact that Xi E[Xi|Xj] is assumed to have a density (w.r.t. Lebesgue measure) and that the variables are non-degenerate νi = Var(Xi) > 0. Thus, we have that

lim inf n P

1 k 6 Ωn(k)

Now consider a ﬁxed ω T 1 k 6 Ωn(k) and note that with gn : [0, δ2 n,ji] R given by gn(t) = ˆµji t we have that gn is decreasing and that

gn([0, δ2 n,ji]) [µji ε4, ˆµji] (0, Mη]

Jakobsen, Shah, B uhlmann and Peters

We have for any t [0, δ2 n,ji] that

d n(t) = 1 ˆµji t c n

bΣM,ji (ˆµji t)2 + bΣV,i

ˆν2 i + 2|bΣMV,ji,i|

(ˆµji t)ˆνi

bΣM,ji (ˆµji t)3 + |bΣMV,ji,i|

(ˆµji t)2ˆνi

d n(t) = 1 ˆµji t c n

gn(t)2 + bΣV,i

ˆν2 i + 2|bΣMV,ji,i|

gn(t)3 + |bΣMV,ji,i|

ˆµ2 ji + bΣV,i

bΣM,ji (ˆµji δ2 n,ji)3 + |bΣMV,ji,i| (ˆµji δ2 n,ji)2ˆνi

M2η + ΣV,i ε2

(µji ε4)3 + |ΣMV,ji,i| + ε3 (µji ε4)2(νi ε5)

=: 1 Mη CMη,ε1,ε2,ε3,ε4,ε5 n

for n (CMη,ε1,ε2,ε3,ε4,ε5Mη)2. We conclude that for n (CMη,ε1,ε2,ε3,ε4,ε5Mη)2

P ( uα,ji ˆuα,ji) = P 0 log (ˆµji) + c ˆσji n log ˆµji δ2 n,ji c σji n

P t [0, δ2 n,ji] : d n(t) 0

1 k 6 Ωn(k)

lim inf n P ( uα,ji ˆuα,ji) lim inf n P

1 k 6 Ωn(k)

and as η > 0 was chosen arbitrarily, we have the desired conclusion

P ( uα,ji ˆuα,ji) n 1.

Structure Learning for Directed Trees

Proof of Theorem 11. Consider a collection of arbitrary and possibly data-dependent substructures R1, R2, ... and level α (0, 1). First, we note that the score associated with two sets of edge weights w1 and w2 is weakly monotone, that is, STp(w1) STp(w2) if w1 and w2 satisfy the component-wise partial ordering w1 w2. Furthermore, the restricted score function w 7 STp(R)(w) is also weakly monotone for any set of restrictions R. Let k N and suppose that the null hypothesis

H0(Rk) : ERk \ E = , E \ Emiss Rk = , rk = rt(G),

corresponding to the restriction Rk = (ERk, Emiss Rk , rk) is true. If there is a graph in ˆCBon := ˆC(ˆlα, ˆuα) satisfying the restrictions imposed by the substructure Rk, then there exist ˆlα w ˆuα such that STp(w ) attains its minimum value in a graph satisfying Rk. Penalizing (or removing) edges that are not present in the minimum edge weight directed tree does not aﬀect the score of the minimum edge weigh directed tree. Hence, it holds that

STp(Rk)(w ) = STp(w ).

Monotonicity of STp(Rk) and STp in the edge weights imply that

STp(Rk)(ˆlα) STp(Rk)(w ) = STp(w ) STp(ˆuα).

Hence, STp(Rk)(ˆlα) > STp(ˆuα) entails that no graph in ˆC satisﬁes the restrictions of Rk. (This is a slightly conservative criterion as STp(Rk)(ˆlα) STp(ˆuα) does not necessarily guarantee that a graph in ˆCBon satisﬁes the restrictions of Rk.) Therefore, if ψCheck C Rk = 1, then we know that there is no graph in ˆCBon satisfying the restrictions of Rk. As the causal graph G satisﬁes the restriction Rk we conclude that G is not contained in ˆCBon. Thus for any true Rk we have that

(ψCheck C Rk = 1) (G ˆCBon).

Since this holds for any true Rk, the conclusion follows by noting that

lim sup n P

k:H0(Rk) is true (ψCheck C Rk = 1)

lim sup n P(G ˆCBon) α,

where we used Theorem 10.

For the claim about the level guarantee of the Conv B test, let k N and consider a true substructure restriction Rk = (ERk, Emiss Rk , rk). Suppose that G ˆCBon. This implies that there exist ˆlα w ˆuα such that STp(w ) attains its minimum value in a graph satisfying Rk. Now let w = (w ji)j =i be given by

w ji = ˆuji if [ l = j : (l i) ERk] [(i j) ERk] [(j i) Emiss Rk ] [i = r], w ji otherwise,

Jakobsen, Shah, B uhlmann and Peters

where we penalize edges that are in disagreement with the substructure restriction Rk. It is clear that the MWDST using the edge weights w and w , i.e., G Tp(w ) and G Tp(w ), both satisfy the substructure restriction Rk. However, as w is unknown, so is w . We lower bound the unknown w by ˆl and deﬁne ˇw = ( ˇwji)j =i as

ˇwji = ˆuji if [ l = j : (l i) ERk] [(i j) ERk] [(j i) Emiss Rk ] [i = r], ˆlji otherwise,

Now, the MWDST G Tp( w) may use edges that are in disagreement with Rk or not satisfy Rk. (For example, consider a three node causal graph V = {1, 2, 3} with edges 1 2 3 and consider the substructure restriction ERk = {(1 2)}. Now it may happen that ˆl12 + ˆl23 > ˆl13 + ˆu32 or ˆl12 + ˆl23 > ˆl23 + ˆl31, that is, the MWDST G Tp( w) does not satisfy the substructure restriction.) We now argue that this happens with probability tending to zero. By the assumed identiﬁability, i.e., that Assumption 1 holds, we have that

:= min G Tp\G ℓG( G) ℓG(G) > 0. (41)

Now consider the events (An)n N (that are independent of k) given by

|ˆlji w G ji| < p 1

We realize that on An it must hold that the MWDST G Tp( w) satisﬁes Rk. Thus, for a true substructure restriction Rk we have that

An (G ˆCBon) (ψConv B Rk = 0) Ac n (G ˆCBon) (ψConv B Rk = 1),

for all n N. Hence, we have that

lim sup n P

k:H0(Rk) is true (ψConv B Rk = 1)

= lim sup n P (G ˆCBon) Ac n

lim sup n P(G ˆCBon) + lim sup n P(Ac n)

by Theorem 10, proving the claim. It only remains to argue that lim supn P(Ac n) = 0.

To that end, note that by Theorem 10 it holds that ˆΣ P n Σ and that

n ˆµ δ2 n ˆν

D n N(0, Σ).

By the strengthened assumptions, i.e., the n-convergence for the non-causal edges, we have that (see the arguments for the causal edges from Theorem 10)

D n N(0, Σ).

Structure Learning for Directed Trees

Thus, for any j = i, that

ˆuji, ˆlji = 1

zα ˆσji 2 n

P n 1 2 log µji

= w G ji, (42)

since ˆµ P n µ, ˆν P n ν, and ˆσji P n σji. The convergence statements in Equation (42) obviously implies that P(An) n 1, since is strictly positive, see Equation (41)). This concludes the proof.

D.4 Proofs of Section 5

D.4.1 Proofs of first results in Section 5

Proof of Lemma 12. As conditioning reduces entropy we always have that

ℓCE( G, i) = h(Xi|Xpa G(i)) = h(Xi E[Xi|Xpa G(i)]|Xpa G(i))

h(Xi E[Xi|Xpa G(i)])

= ℓE( G, i).

Furthermore, note that when conditioning we throw out dependence information captured by the mutual information I(Xi E[Xi|Xpa G(i)]; Xpa G(i)), which is zero if and only if Xi E[Xi|Xpa G(i)] Xpa G(i). This is especially the case for the true graph, i.e., Xi E[Xi|Xpa G(i)] Xpa G(i), implying that ℓCE(G, i) = ℓE(G, i). Consequently, we have that the local conditional entropy score gap lower bounds the local entropy score gap,

ℓCE( G, i) ℓCE(G, i) ℓE( G, i) ℓE(G, i).

Furthermore, from the arguments in the proof of Lemma 6 we have that

ℓE( G, i) = inf Ni P Ni P h Xi E h Xi|Xpa G(i) i , Ni

inf Ni P Ni PG h Xi E h Xi|Xpa G(i) i , Ni

= ℓG( G, i) + log(

If X is generated by a causal additive tree model with Gaussian noise, i.e., with generating SCM θ = (G, (fi), PN) with PN Pp G, then ℓE(G, i) = h(Xi E[Xi|Xpa G(i)]) = h(Ni) = log(

2πeσi) = log(

2 log(E[N2 i ]) = log(

2πe) + l G(G, i), in which case the local entropy score gap lower bounds the local Gaussian score gap

ℓE( G, i) ℓE(G, i) ℓG( G, i) ℓG(G, i).

Jakobsen, Shah, B uhlmann and Peters

Proof of Lemma 13. Note that E[Y |X] = E[f(X) + NY |X] = f(X) + E[NY |X] = f(X) + E[NY ], since NY NX = X. Hence, the score diﬀerence can be written as

ℓE( G) ℓE(G) = ℓE( G, X) ℓE(G, X) + ℓE( G, Y ) ℓE(G, Y )

= h(X E(X|Y )) h(X) + h(Y ) h(Y E(Y |X))

= h(X E(X|Y )) h(X) + h(Y ) h(NY + E[NY ])

= h(X E(X|Y )) h(X) + h(Y ) h(NY ),

as the diﬀerential entropy is translation invariant. Now note that as NY NX it holds that NY f(X), so conditioning on f(X) yields that

h(Y ) = h(Y |f(X)) + I(Y ; f(X))

= h(f(X) + NY |f(X)) + I(Y ; f(X))

= h(NY ) + I(Y ; f(X)).

Similarly, conditioning on X yields that

h(Y ) = h(Y |X) + I(Y ; X)

= h(NY ) + I(Y ; X),

which proves that

I(Y ; f(X)) = I(Y ; X).

This equality is normally derived by restricting f to be bijective, but here it holds regardless by the structural assignment form, as Y only depends on X through f(X). Furthermore, we have that

h(X E[X|Y ]) = I(X E[X|Y ]; Y ) + h(X E[X|Y ]|Y )

= I(X E[X|Y ]; Y ) + h(X|Y ).

h(X E[X|Y ]) h(X) = I(X E[X|Y ]; Y ) + h(X|Y ) h(X)

= I(X E[X|Y ]; Y ) I(Y ; X).

ℓE( G) ℓE(G) = h(X E[X|Y ]) h(X) + h(Y ) h(NY )

= I(X E[X|Y ]; Y ) I(Y ; X) + h(NY ) + I(Y ; f(X)) h(NY )

= I(X E[X|Y ]; Y ) I(Y ; X) + I(Y ; f(X))

= I(X E[X|Y ]; Y ),

proving the claim.

Structure Learning for Directed Trees

Proof of Proposition 14. As the conditional mean E[X|Y ] vanishes, we have that

ℓE( G) ℓE(G) = I(X E(X|Y ); Y )

= I(Y ; f(X)),

where the last equality was derived in the proof of Lemma 13. Now let f(X)G and NG Y be independent normal distributed random variables with the same mean and variance as f(X) and NY . That is, f(X)G N(E[f(X)], Var(f(X))), NG Y N(E[NY ], Var(NY )) with NG Y f(X)G such that f(X)G + NG Y N(E[f(X)] + E[NY ], Var(f(X)) + Var(NY )).

(a) If DKL(f(X) f(X)G) DKL(NY NG Y ) then by Lemma C.1 of Silva (2009) we have, since X NY , that

I(Y ; f(X)) = I(f(X) + NY ; f(X)) I(f(X)G + NG Y ; f(X)G),

Note, we have equality if and only if f(X) and NY are jointly Gaussian. Furthermore,

I(f(X)G + NG Y ; f(X)G) = h(f(X)G + NG Y ) h(f(X)G + NG Y |f(X)G)

= h(f(X)G + NG Y ) h(NG Y )

2π(Var(f(X)) + Var(NY ))) log( p

2πVar(NY ))

2 log Var(f(X)) + Var(NY )

2 log 1 + Var(f(X))

(b) If f(X)+NY is log-concave distributed, then by Theorem 3 of Marsiglietti and Kostina (2018) we have that

h(f(X) + NY ) 1

2 log (4Var(f(X) + NY )) = 1

2 log (4(Var(f(X)) + Var(NY )) .

Furthermore, it is well known that for ﬁxed variance, the normal distribution maximizes entropy, hence

h(NY ) h(NG Y ) = 1

2 log (2πVar(NY )) .

Therefore, we get that

I(Y ; f(X)) = I(f(X) + NY ; f(X))

= h(f(X) + NY ) h(f(X) + NY |f(X))

= h(f(X) + NY ) h(NY )

2 log (4(Var(f(X)) + Var(NY )) 1

2 log (2πe Var(NY ))

πe Var(f(X))

Jakobsen, Shah, B uhlmann and Peters

which yields a strictly positive lower bound if and only if

πe Var(f(X))

Var(NY ) > 1 Var(f(X))

Var(NY ) > πe

Lemma D.6 Two diﬀerent but Markov equivalent trees G and ˆG share the exact same edges except for a single reversed directed path between the two root nodes of the graphs,

ˆG : c1 c2 cr 1 cr, G : cr cr 1 c2 c1,

with c1 = rt( ˆG) and cr = rt( G).

Proof of Lemma D.6. First, note that there always exists a unique directed path in ˆG from rt( ˆG) to rt( G)

ˆG : rt( ˆG) = c1 cr 1 cr = rt( G).

Since G and ˆG are Markov equivalent, they share the same skeleton, so in G the above path must be reversed. That is, there exists a unique directed path in G from rt( G) to rt( ˆG) given by

G : rt( G) = cr cr 1 c1 = rt( ˆG),

If r = p we are done, so assume r < p. As ˆG is a directed tree there must exists a node z2 which is not a part of the above path but is a child of a node in the path. That is, there exists a node z1 {c1, . . . , cr} such that ˆG contains the edge

ˆG : z1 z2.

Furthermore, by equality of skeleton, this edge must also be present in G,

Assume for contradiction that z2 z1 in G. As such, it must hold that z1 = cr = rt( G) for otherwise if z1 {c1, . . . , cr 1} then z1 would have two parents in G, a contradiction since G is a directed tree. However, if z1 = cr = rt( G) then there is an incoming edge into the root node, a contradiction. We conclude that the directed edge z1 z2 also is present in G. Any paths further out on this branch will coincide in both graphs for otherwise there exists nodes with two parents. These arguments show that any paths branching out from the main reversed path will coincide in both ˆG and G. Thus, the two graphs coincide up to a directed path between root nodes that is reversed.

Structure Learning for Directed Trees

Proof of Proposition 15. By Lemma D.6 there exists a path reversal

G : rt(G) = c1 c2 cr 1 cr = rt( G), G : rt( G) = cr cr 1 c2 c1 = rt(G),

while all other edges in G = (V, E) and G = (V, E) coincide. Hence, the entropy score diﬀerence reduces to

ℓE( G) ℓE(G) = h(Xrt( G)) + X

(j,i) E h(Xi E[Xi|Xj])

h(Xrt(G)) X

(j,i) E h(Xi E[Xi|Xj])

i=1 h(Xci E[Xci|Xci+1])

i=2 h(Xci E[Xci|Xci 1]).

h(Xcr) h(Xc1) =

i=1 h(Xci) =

i=1 h(Xci+1) h(Xci).

ℓE( G) ℓE(G)

i=1 h(Xci E[Xci|Xci+1]) + h(Xci+1) h(Xci+1 E[Xci+1|Xci]) h(Xci)

i=1 ℓE(ci L99 ci+1)

min 1 i r 1 ℓE(ci L99 ci+1),

which concludes the proof.

D.4.2 Proof of Theorem 16

We ﬁrst describe the graphs that result from the reduction technique described in 5.3. To do so, deﬁne

L(G, G) := {L VR : ch GR(L) = (pa GR(L) = pa GR(L) ch GR(L) = )},

Jakobsen, Shah, B uhlmann and Peters

containing the sink nodes in GR that are either not sink nodes in GR or sink nodes in GR with diﬀerent parents: pa GR(L) = pa GR(L). Now ﬁx any L L(G, G) VR and note that its only parent in GR, pa GR(L), is either also a parent of L, a child of L or not adjacent to L, in GR. That is, one and only one of the following sets is non-empty

Z(L) : = pa GR(L) pa GR(L), ( staying parents )

Y (L) : = pa GR(L) ch GR(L), ( parents to children )

W(L) : = pa GR(L) (V \ {L ch GR(L) pa GR(L)}) ( removing parents )

We deﬁne the GR parent and children of L that are not adjacent to L in GR as

D(L) : = pa GR(L) (V \ {L ch GR(L) pa GR(L)}), and

O(L) : = ch GR(L) (V \ {L ch GR(L) pa GR(L)}),

respectively. All such sets contain at most one node and by slight abuse of notation, we use the same letters to refer to the nodes. We will henceforth suppress the dependence on L if the choice is clear from the context. Figure 2 visualizes the above sets. Now partition Tp \ {G} into the three following disjoint partitions for which there exists a reduced graph sink node L L(G, G) such that W(L), Y (L) and Z(L) is non-empty, respectively. That is, we deﬁne

Tp(G, W) : = { G Tp \ {G} : L L(G, G) s.t. W(L) = },

Tp(G, Y ) : = { G Tp \ {G} : L L(G, G) s.t. Y (L) = } \ Tp(G, W),

Tp(G, Z) : = { G Tp \ {G} : L L(G, G) s.t. Z(L) = } \ (Tp(G, W) Tp(G, Y )).

Using that Tp(G, W) Tp(G, Y ) Tp(G, Z) = Tp(G), we can now ﬁnd a lower bound for the score gap that holds uniformly over all alternative directed tree graphs Tp \ {G}:

min G Tp\{G} ℓE( G) ℓE(G) = min min G Tp(G,Z) ℓE( G) ℓE(G),

min G Tp(G,W) ℓE( G) ℓE(G), min G Tp(G,Y ) ℓE( G) ℓE(G) .

We now turn to each of these three terms individually and ﬁrst consider alternative graphs in the partitioning Tp(G, Z). The following lower bound consists of possibly non-localized conditional dependence properties of the observable distribution PX. (That is, the bound may involve nodes that are not close to each other in the graph G.)

Lemma D.7 Let ΠZ(G) denote all tuples (z, l, o) V 3 of adjacent nodes (z l) E for which there exists a node o nd G(l) \ {z, l}. It holds that

min G Tp(G,Z) ℓE( G) ℓE(G) min (z,l,o) ΠZ(G) I(Xz; Xo|Xl).

The next result proves a lower bound that holds uniformly over all alternative graphs in Tp(G, W). The lower bound consists only of local conditional dependence properties. That

Structure Learning for Directed Trees

is, for any subgraph of the causal graph G of the form Xo Xw Xl or Xo Xw Xl we measure, by means of conditional mutual information, the conditional dependence of the two adjacent nodes Xw and Xl conditional on Xo, I(Xw; Xl|Xo). The lower bound consists of the smallest of all such local conditional dependence measures.

Lemma D.8 Let ΠW (G) denote all tuples (w, l, o) V 3 of adjacent nodes (w l) E and o (ch G(w) \ {l}) pa G(w). It holds that that

min G Tp(G,W) ℓE( G) ℓE(G) min (w,l,o) ΠW (G) I(Xw; Xl|Xo).

A uniform lower bound of the score gap over all alternative graphs in the ﬁnal partition Tp(G, Y ) is given by the smallest edge-reversal of any edge in the causal graph G.

Lemma D.9 It holds that

min G Tp(G,Y ) ℓE( G) ℓE(G) min (j i) E ℓE(j L99 i).

An immediate consequence of Lemmas D.7 to D.9 is that the entropy identiﬁability gap is given by the smallest of the lower bounds derived for each partition, see Theorem 16. Thus, it only remains to prove Lemmas D.7 to D.9.

Proof of Lemma D.7. Let G ΠZ(G) such that Z = . This implies that Y = W = as L can only have one parent in G. Furthermore, D = as L can only have one parent in G and O = for otherwise L would have been deleted by the deletion procedure in Section 5. Assume without loss of generality that O = {O1, . . . , Ok} for some k N. The two subgraphs are illustrated in Figure 16.

V1 = {Z, L}c

V1 = {Z, L, O, A1, . . . , Ak}c

Figure 16: Illustration of the reduced form graphs GR and GR for the case G ΠZ(G). A1, . . . , Ak are possibly empty sets of nodes, and dashed rectangle nodes denotes a possibly multi-node subgraph over the variables enclosed. The bi-directed edges means that the edge can be directed in both directions. An edge pointing into the multi-node subgraph, can possibly be multiple edges into distinct nodes of the subgraph.

Jakobsen, Shah, B uhlmann and Peters

For ease of notation, ﬁx any 1 i k and denote O := Oi. We note that in G the following d-separation holds

Thus, we have for all probability measures Q { G} F( G) Pp over nodes V that Z O | L (as the path between Z and O is blocked by L and all probability measures generated in accordance with an SCM are Markovian with respect to the generating graph G). Recall that

ℓE( G) ℓE(G) = inf Q { G} F( G) Pp DKL(PX Q)

= inf Q { G} F( G) Pp h(PX, Q) h(PX).

Now ﬁx Q = q λp { G} F( G) Pp and note that it factorizes as Q = QA|Z,O,LQZ|LQO|LQL, i.e., the density q factorizes as

q(x) = q A|Z,O,L(a|z, o, l)q Z,O,L(z, o, l)

= q A|Z,O,L(a|z, o, l)q Z|L(z|l)q O|L(o|l)q L(l),

for λp-almost all x = (a, z, o, l) Rp where A = V \ {Z, O, L}. Hence, the cross entropy splits additively into

h(PX, Q) E[ log(q A|Z,O,L(A|Z, O, L))]

+ E[ log(q Z|L(Z|L))]

+ E[ log(q O|L(O|L))]

+ E[ log(q L(L))]. (43)

Now note, e.g., that for a conditional distribution (Markov kernel) QZ|L it holds that

0 DKL(PZ|LPL QZ|LPL) = E log q Z|L(Z|L)p L(L)

p Z|L(Z|L)p L(L)

= E[ log(q Z|L(Z|L))] E[ log(p Z|L(Z|L))],

proving that E[ log(q Z|L(Z|L))] E[ log(p Z|L(Z|L))].

By similar arguments, we get that the three other terms in the lower bound of Equation (43) are bounded below by

E[ log(q A|Z,O,L(A|Z, O, L))] E[ log(p A|Z,O,L(A|Z, O, L))],

E[ log(q O|L(O|L))] E[ log(p O|L(O|L))],

E[ log(q L(L))] E[ log(p L(L))].

This implies that

inf Q { G} F( G) Pp h(PX, Q) h(PX, Q ),

Structure Learning for Directed Trees

where Q = PA|Z,O,LPZ|LPO|LPL. On the other hand, we know that PX factorizes as PX = PA|Z,O,LPZ,O|LPL. Thus we have the following entropy score gap lower bound

ℓE( G) ℓE(G) h(PX, Q ) h(PX)

= DKL(PX Q )

= DKL(PA|Z,O,LPZ,O|LPL PA|Z,O,LPZ|LPO|LPL)

= DKL(PZ,O|LPL PZ|LPO|LPL)

= DKL(PZ,O|L PZ|LPO|L|PL)

= I(Z; O|L).

ΠZ(G) denotes all tuples (z, l, o) V 3 of adjacent nodes (z l) E for which there exists a node o nd G(l) \ {z, l}. For any graph G Tp(G, Z) we can, by the above considerations, ﬁnd a tuple (z, l, o) ΠZ(G) such that

ℓE( G) ℓE(G) I(Xo; Xz | Xl).

We conclude that

min G Tp(G,Z) ℓE( G) ℓE(G) min (z,l,o) ΠZ(G) I(Xo; Xz|Xl).

Proof of Lemma D.8. Fix any G Tp(G, W) and L with W = such that Z = Y = . We have illustrated the subgraph GR in Figure 17 and the possible subgraphs GR in Figure 18.

{W, L}c W L

Figure 17: Illustrations of the GR subgraph for for G Tp(G, W).

Note that for any of the three possible local graph structures presented in Figure 18 there exists an A {O1, . . . , Ok, D} such that L GRW | A, i.e., A blocks the path between L and W. Thus, for all probability measures Q { G} F( G) Pp over nodes V = {1, .., p} it always holds that L W | A. By arguments similar to those in the proof of Lemma D.7, we note that

ℓE( G) ℓE(G) = inf Q { G} F( G) Pp h(PX, Q) h(PX),

and that inf Q { G} F( G) Pp h(PX, Q) h(PX, Q ),

Jakobsen, Shah, B uhlmann and Peters

GR: D = , O =

{L, D}c D L

GR: D = , O =

{D, L, O, A1, . . . , Ak}c D L ...

GR: D = , O =

Figure 18: Illustrations of the possible GR subgraphs for G Tp(G, W).

for PX = PK|W,L,APW,L|APA and Q = PK|W,L,APL|APW|APA where K = V \{W, L, A}. To that end, we now have that

ℓE( G) ℓE(G) h(PX, Q ) h(PX)

= DKL(PX Q )

= DKL(PK|W,L,APW,L|APA PK|W,L,APL|APW|APA)

= DKL(PW,L|APA PL|APW|APA)

= DKL(PW,L|A PL|APW|A|PA)

= I(W; L|A).

Let ˆΠW (G) denote all tuples (w, l, a) V 3 of adjacent nodes (w l) E for which there exists a node a nd G(l) \ {w}. Now note that for any graph G Tp(G, W) we can, by the above considerations, ﬁnd a tuple (w, l, a) ˆΠW (G) such that

ℓE( G) ℓE(G) I(Xw; Xl | Xa). (44)

(Conversely for any tuple (w, l, a) ˆΠW (G) we can construct a graph G Tp(G, W) such that (44) holds. To see this, ﬁx (w, l, a) ˆΠW (G) and construct G such that the subtree with root node l is identical in both G and G and a blocks the path between l and w in G.) Therefore, the following lower bound holds (and it is not unnecessarily small).

min G Tp(G,W) ℓE( G) ℓE(G) min (w,l,a) ˆΠW (G) I(Xw; Xl | Xa).

For any (w, l, a) ˆΠW (G) it either holds that a (ch G(w) \ {l}) pa G(w) or that there exists an o (ch G(w) \ {l}) pa G(w) blocking the path between a and l in G such that

Structure Learning for Directed Trees

Xl Xa|Xo. Furthermore, we note that as Xl (Xo, Xa) | Xw we have that

I(Xw; Xl|Xa) = h(Xl|Xa) h(Xl|Xa, Xw)

= h(Xl|Xa) h(Xl|Xw)

= h(Xl|Xa) h(Xl|Xo, Xw)

h(Xl|Xa, Xo) h(Xl|Xo, Xw)

= h(Xl|Xo) h(Xl|Xo, Xw)

= I(Xw; Xl | Xo),

as further conditioning reduces conditional entropy. Let ΠW (G) denote all tuples (w, l, o) V 3 of adjacent nodes (w l) E and o (ch G(w) \ {l}) pa G(w). By the above considerations we conclude that

min G Tp(G,W) ℓE( G) ℓE(G) min (w,l,o) ΠW (G) I(Xw; Xl | Xo).

Proof of Lemma D.9. Fix G Tp(G, Y ) and L such that Y = . It holds that W = Z = . We have illustrated the GR in Figure 19 and the three possible subgraphs GR in Figure 20.

{Y, L}c Y L

Figure 19: Illustrations of the GR subgraph for G Tp(G, Y ).

Note that for any of the three possible local graph structures of GR illustrated in Figure 20 we have that for all probability measures Q { G} F( G) Pp factorizes as QA|L,Y QL,Y , where A = V \ {L, Y }. It always holds that QL,Y is the distribution of ( L, Y ) generated in accordance with a structural causal model of the form

Y := f Y ( L) + NY , (45)

where f Y (l) = E[Y |L = l] for all l R, and any L( NY ), L( L) P with NY L. Now recall that

ℓE( G) ℓE(G) = inf Q { G} F( G) Pp h(PX, Q) h(PX),

and notice that by arguments similar to those in the proof of Lemma D.7 we get

h(PX, Q) = h(PX, QA|L,Y QL,Y )

= E[ log(q A|L,Y (A|L, Y ))] + h(PL,Y , QL,Y )

E[ log(p A|L,Y (A|L, Y ))] + h(PL,Y , QL,Y ),

Jakobsen, Shah, B uhlmann and Peters

GR: D = , O =

{Y, L, D}c D L Y

GR: D = , O =

{D, L, Y, O, A, B}c

GR: D = , O =

Figure 20: Illustrations of the possible GR subgraphs for G Tp(G, Y ).

and that h(PX) = E[ log(p A|L,Y (A|L, Y ))] + h(PL,Y ). Thus, we have that

ℓE( G) ℓE(G) inf Q { G} F( G) Pp h(PL,Y , QL,Y ) h(PL,Y ).

For any Q = QA|L,Y QL,Y { G} F( G) Pp we have that QL,Y is uniquely determined by a marginal distribution QL P and the noise distribution of NY q NY λ P from the additive noise structural assignment in Equation (45) for Y and the causal function f Y . Thus, the density q L,Y of QL,Y is given by

q L,Y (l, y) = q Y |L(y|l)q L(l) = q NY (y f Y (l))q L(l) = q NY (y E[Y |L = l])q L(l).

h(PL,Y , QL,Y ) = E [ log (q L,Y (L, Y ))]

= E log q Y |L(Y |L) + E [ log (q L(L))]

= E h log q NY (Y E[Y |L]) i + h(PL, QL)

= h(Y E[Y |L], NY ) + h(PL, QL)

h(Y E[Y |L]) + h(L),

where we used that h(P, Q) = DKL(P, Q) + h(P) h(P). Thus, we have that

ℓE( G) ℓE(G) inf Q { G} F( G) Pp h(PL,Y , QL,Y ) h(PL,Y )

h(Y E[Y |L]) + h(L) h(L E[L|Y ]) h(Y )

= ℓE(Y L99 L).

Structure Learning for Directed Trees

We conclude that

min G Tp(G,Y ) ℓE( G) ℓE(G) min (i j) E ℓE(j L99 i).

D.4.3 Remaining proof of Section 5

Proof of Theorem 17. Consider a graph G Tp(G, Z) and let GR,1 = (ER,1, VR,1) and GR,1 = ( ER,1, VR,1) be the reduced graphs after the initial edge and node deletion procedure of Section 5.3. The deletion procedure does not change the score gap, that is,

ℓG( G) ℓG(G) = ℓG( GR,1) ℓG(GR,1).

For any i 1 and ﬁxed GR,i and GR,i we deﬁne

LR,i := {L VR,i : ch GR,i(L) = (pa GR,i(L) = pa GR,i(L) ch GR,i(L) = )}.

Now ﬁx L1 LR,1 such that Z1 = , where Y1, Z1, W1, D1 and O1 are deﬁned similarly to the variables in Section 5. Let O1 = {O1,1, . . . , O1,k1}, for some k1 N. Assume that there exists an i {1, . . . , k1} such that (Z1 O1,i) ER,1 in which case we have the following two paths in GR,1 and GR,1

GR,1 : O1,i Z1 L1, and GR,1 : Z1 L1 O1,i.

Since O1,i GR,1Z1 | L1, an entropy score gap lower bound is given by

ℓG( GR,1) ℓG(GR,1) ℓE( GR,1) ℓE(GR,1)

= inf Q { G} F( G) Pp h(PX, Q) PX

= I(O1,i; Z1|L1),

with PX = PK|O,Z,LPO,Z|LPL and Q = PK|O,Z,LPZ|LPO|LPL for K = V \ {O, Z, L}, by arguments similar to those from the proof of Lemma D.8. Now note that (Z1, O1,i, L1) ΠW (GR,1) ΠW (G) as (Z1 O1,i) ER,1 and L1 ch GR,1(Z1) \ {O1,i} (ch GR,1(Z1) \ {O1,i}) pa GR,1(Z1). Hence,

ℓG( GR,1) ℓG(GR,1) min (w,l,o) ΠW (G) I(Xw; Xl|Xo). (46)

Assume now that for all i {1, .., k1} we have (Z1 O1,i) ER,1. Let ˆGR,1 = ( ˆER,1, VR,1) denote an intermediate graph where ˆER,1 is identical to ER,1 except the edges {(L1 O1,i) : 1 i k1} ER,1 are replaced by the edges {(Z1 O1,i) : 1 i k1}. It holds that

ℓG( GR,1) ℓG(GR,1) = ℓG( GR,1) ℓG( ˆGR,1) + ℓG( ˆGR,1) ℓG(GR,1)

ℓG( ˆGR,1) ℓG(GR,1).

Jakobsen, Shah, B uhlmann and Peters

Note that this score gap lower bound is still strictly positive as ˆGR,1 = GR,1. To realize the last inequality, simply note that as O1,i L1 | Z1 we have for all i {1, . . . , k1} that

2ℓG( GR,1, O1,i) = log E[(O1,i E[O1,i|L1])2]

log E[(O1,i E[O1,i|Z1, L1])2]

= log E[(O1,i E[O1,i|Z1])2]

= 2ℓG( ˆGR,1, O1,i). (47)

Now since all edges in GR,1 and ˆGR,1 coincide except the incoming edges into O1,1, . . . , O1,k1 we get that

ℓG( GR,1) ℓG( ˆGR,1) =

i=1 ℓG( GR,1, O1,i) ℓG( ˆGR,1, O1,i) 0,

where the inequality follows from Equation (47). Now both ˆGR,1 and GR,1 have a childless node L1 with the same parent Z1, so we let GR,2 and GR,2 denote these two graphs where the node L1 and its incoming edge are deleted. This deletion does not change the graph scores, i.e.,

ℓG( ˆGR,1) ℓG(GR,1) = ℓG( GR,2) ℓG(GR,2).

Now ﬁx L2 LR,2 and deﬁne Y2, Z2, W2, D2 and O2 = {O2,1, . . . , O2,k2} accordingly. If either Y2 or W2 is non-empty, we use the score gap lower bound previously discussed in Lemma D.8 and Lemma D.9. If Z2 is non-empty, we can repeat the above procedure and iteratively move edges and delete nodes until we arrive at the ﬁrst i N with GR,i and GR,i being the iteratively reduced graphs and LR,i LR,i where either

i) Yi or Wi is non-empty, here, we get that ℓG( G) ℓG(G) is lower bounded by a bound similar to the form of Lemma D.8 or Lemma D.9. That is,

ℓG( GR,i) ℓG(GR,i) ℓE( GR,i) ℓE(GR,i)

min min j i E ℓE(i L99 j), min (w,l,o) ΠW (G) I(Xw; Xl | Xo) .

ii) Zi is non-empty and there exists a j {1, . . . , ki} such that (Zi Oi,j) GR,i. As previously argued, the score gap lower bound of Equation (46) applies. That is

ℓG( GR,i) ℓG(GR,i) ℓE( GR,i) ℓE(GR,i) min (w,l,o) ΠW (G) I(Xw; Xl | Xo).

Note that whenever we do not meet scenario i) or ii) we remove a node in both graphs that is a sink node in the reduced true causal graph GR,i and the intermediate graph ˆGR,i. After at most p 2 graph reduction iterations of not encountering scenario i) or ii) we are left with two diﬀerent graphs on two nodes, in which case the score gap is an edge reversal. We conclude that

ℓG( G) ℓG(G) min min i j E ℓE(j L99 i), min (w,l,o) ΠW (G) I(Xw; Xl | Xo) .

Structure Learning for Directed Trees

T. B. Berrett and R. J. Samworth. Nonparametric independence testing via mutual information. Biometrika, 106(3):547 566, 2019.

T. B. Berrett, D. Grose, and R. J. Samworth. CRAN R-package Indep Test : Nonparametric independence tests based on entropy estimation, 2018. URL https://cran.r-project. org/web/packages/Indep Test.

T. B. Berrett, R. J. Samworth, and M. Yuan. Eﬃcient multivariate entropy estimation via k-nearest neighbour distances. The Annals of Statistics, 47(1):288 318, 2019.

P. B uhlmann, J. Peters, and J. Ernest. CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526 2556, 2014.

P. M. Camerini, L. Fratta, and F. Maﬃoli. A note on ﬁnding optimum branchings. Networks, 9(4):309 312, 1979. doi: 10.1002/net.3230090403.

V. Carey, L. Long, and R. Gentleman. Bioconductor R-package RBGL , 2021. URL https://www.bioconductor.org/packages/release/bioc/html/RBGL.html.

A. Cayley. A theorem on trees. Quart. J. Math., 23:376 378, 1889.

D. M. Chickering. Learning Bayesian networks is NP-complete. In D. Fisher and H.-J. Lenz, editors, Learning from Data: Artiﬁcial Intelligence and Statistics V, pages 121 130. Springer New York, New York, NY, 1996.

D. M. Chickering. Optimal structure identiﬁcation with greedy search. Journal of Machine Learning Research, 3(Nov):507 554, 2002.

C. K. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462 467, 1968.

Y. J. Chu and T. H. Liu. On the shortest arborescence of a directed graphs. Science Sinica, 14:1396 1400, 1965.

T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Hoboken, New Jersey, 2006.

J. Cussens. Bayesian network learning with cutting planes. In Proceedings of the Twenty Seventh Conference on Uncertainty in Artiﬁcial Intelligence, pages 153 160. AUAI Press, 2011.

J. Cussens, M. J arvisalo, J. H. Korhonen, and M. Bartlett. Bayesian network structure learning with integer programming: Polytopes, facets and complexity. Journal of Artiﬁcial Intelligence Research, 58:185 229, 2017.

I. S. Dominguez, A. H. Aguirre, and E. V. Diharce. The Gaussian polytree eda with copula functions and mutations. In EVOLVE-A Bridge between Probability, Set Oriented Numerics and Evolutionary Computation, pages 123 153. Springer, Berlin, Heidelberg, 2013.

Jakobsen, Shah, B uhlmann and Peters

J. Edmonds. Optimum branchings. Journal of Research of the National Bureau of Standards B, 71(4):233 240, 1967.

N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classiﬁers. Machine learning, 29(2):131 163, 1997.

H. N. Gabow, Z. Galil, T. Spencer, and R. E. Tarjan. Eﬃcient algorithms for ﬁnding minimum spanning trees in undirected and directed graphs. Combinatorica, 6(2):109 122, 1986.

D. Geiger and D. Heckerman. Learning Gaussian networks. In Proceedings of the Tenth Annual Conference on Uncertainty in Artiﬁcial Intelligence, pages 235 243, 1994.

L. Gy orﬁ, M. Kohler, A. Krzy zak, and H. Walk. A Distribution-free Theory of Nonparametric Regression, volume 1. Springer, Berlin, Germany, 2002.

A. Hagberg, P. Swart, and D. Schult. Python package Network X , 2022. URL github. com/Network X/Network X.

Y. Han, J. Jiao, T. Weissman, and Y. Wu. Optimal rates of entropy estimation over Lipschitz balls. The Annals of Statistics, 48(6):3228 3250, 2020.

D. Heckerman and D. Geiger. Learning Bayesian networks: A uniﬁcation for discrete and Gaussian domains. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artiﬁcial Intelligence, pages 274 284, 1995.

P. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch olkopf. Nonlinear causal discovery with additive noise models. Advances in Neural Information Processing Systems, 21: 689 696, 2008.

T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using lp relaxations. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 358 365. JMLR Workshop and Conference Proceedings, 2010.

M. Kalisch and P. B uhlman. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research, 8(22):613 636, 2007.

M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences. ar Xiv preprint ar Xiv:1807.02582, 2018.

R. M. Karp. A simple derivation of edmonds algorithm for optimum branchings. Networks, 1(3):265 272, 1971.

M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. The Journal of Machine Learning Research, 5:549 573, 2004.

D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, Massachusetts, 2009.

Structure Learning for Directed Trees

C. Lee and P. v. Beek. Metaheuristics for score-and-search Bayesian network structure learning. In Canadian Conference on Artiﬁcial Intelligence, pages 129 141. Springer, 2017.

A. Marsiglietti and V. Kostina. A lower bound on the diﬀerential entropy of log-concave random vectors with applications. Entropy, 20(3):185, 2018.

J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Sch olkopf. Distinguishing cause from eﬀect using observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1 102, 2016.

P. Nandy, A. Hauser, and M. H. Maathuis. High-dimensional consistency in score-based and hybrid structure learning. The Annals of Statistics, 46(6A):3151 3183, 2018.

M. Ouerd. Learning in belief networks and its application to distributed databases. Ph D Thesis, University of Ottawa, Ottawa, Canada, 2000.

L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6): 1191 1253, 2003.

P. Parviainen and M. Koivisto. Exact structure discovery in Bayesian networks with less space. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 09, page 436 443. AUAI Press, 2009.

J. Pearl. Causality. Cambridge University Press, Cambridge, UK, 2009.

J. Peters and P. B uhlmann. Identiﬁability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228, 2014.

J. Peters and P. B uhlmann. Structural intervention distance for evaluating causal graphs. Neural Computation, 27(3):771 799, 2015.

J. Peters, D. Janzing, and B. Sch olkopf. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12): 2436 2450, 2011.

J. Peters, J. M. Mooij, D. Janzing, and B. Sch olkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15(58):2009 2053, 2014.

J. Peters, D. Janzing, and B. Sch olkopf. Elements of causal inference: foundations and learning algorithms. MIT Press, Cambridge, Massachusetts, 2017.

Y. Polyanskiy and Y. Wu. Lecture notes on information theory, 2019. URL people.lids. mit.edu/yp/homepage/. last accessed, 9.3.2022.

G. Rebane and J. Pearl. The recovery of causal poly-trees from statistical data. In Proceedings of the Third Annual Conference on Uncertainty in Artiﬁcial Intelligence, pages 222 228, 1987.

Jakobsen, Shah, B uhlmann and Peters

A. Reisach, C. Seiler, and S. Weichwald. Beware of the simulated DAG! causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34, 2021.

D. Rothenh ausler, J. Ernest, and P. B uhlmann. Causal inference in partially linear structural equation models. Annals of Statistics, 46(6A):2904 2938, 2018.

K. Sachs, O. Perez, D. Pe er, D. A. Lauﬀenburger, and G. P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523 529, 2005.

M. Scanagatta, C. P. de Campos, G. Corani, and M. Zaﬀalon. Learning Bayesian networks with thousands of variables. Advances in neural information processing systems, 28, 2015.

R. L. Schilling. Measures, Integrals and Martingales. Cambridge University Press, Cambridge, UK, 2017.

R. D. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514 1538, 2020.

S. Shimizu, P. O. Hoyer, A. Hyv arinen, A. Kerminen, and M. Jordan. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(72):2003 2030, 2006.

T. Silander and P. Myllym aki. A simple approach for ﬁnding the globally optimal Bayesian network structure. In Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 445 452. AUAI Press, 2006.

E. I. Silva. A uniﬁed framework for the analysis and design of networked control systems. Ph D Thesis, University of Newcastle, Callaghan, Australia, 2009.

N. J. A. Sloane. The on-line encyclopedia of integer sequences, 2021. URL oeis.org/ A003024. The OEIS Foundation Inc. (2021).

P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman. Causation, Prediction, and Search. MIT press, Cambridge, Massachusetts, 2000.

R. E. Tarjan. Finding optimum branchings. Networks, 7(1):25 35, 1977.

A. Toﬁgh and E. Sj olund. C++ implementation of Edmonds algorithm, 2007. URL github. com/atofigh/edmonds-alg.

I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1):31 78, 2006.

A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, Berlin, Germany, 2009.

C. Uhler, G. Raskutti, P. B uhlmann, and B. Yu. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, 41(2):436 463, 2013. doi: 10.1214/12-aos1080.

Structure Learning for Directed Trees

P. van Beek and H.-F. Hoﬀmann. Machine learning of Bayesian networks using constraint programming. In G. Pesant, editor, Principles and Practice of Constraint Programming, pages 429 445, Cham, 2015. Springer International Publishing.

A. W. Van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 3 edition, 2000.

S. Wood. CRAN R-package mgcv : Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation, 2022. URL cran.r-project.org/web/ packages/mgcv.

S. N. Wood. Thin-plate regression splines. Journal of the Royal Statistical Society (B), 65 (1):95 114, 2003.

C. Yuan and B. Malone. Learning optimal Bayesian networks: A shortest path perspective. Journal of Artiﬁcial Intelligence Research, 48:23 65, 2013.

C. Yuan, B. Malone, and X. Wu. Learning optimal Bayesian networks using a* search. In Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2011.

J. Zhang and P. Spirtes. Strong faithfulness and uniform consistency in causal inference. In Proceedings of the Nineteenth Conference on Uncertainty in Artiﬁcial Intelligence, page 632 639, 2002.

K. Zhang and A. Hyv arinen. On the identiﬁability of the post-nonlinear causal model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, page 647 655, 2009.

X. Zheng, B. Aragam, P. Ravikumar, and E. P. Xing. DAGs with no tears: Continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, page 9492 9503, 2018.