# structure_learning_for_directed_trees__055da85b.pdf Journal of Machine Learning Research 23 (2022) 1-97 Submitted 9/21; Revised 3/22; Published 5/22 Structure Learning for Directed Trees Martin Emil Jakobsen m.jakobsen@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark Rajen D. Shah r.shah@statslab.cam.ac.uk Statistical Laboratory University of Cambridge Cambridge, UK Peter B uhlmann buhlmann@stat.math.ethz.ch Seminar for Statistics ETH Zurich Zurich, Switzerland Jonas Peters jonas.peters@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark Editor: Aapo Hyvarinen Knowing the causal structure of a system is of fundamental interest in many areas of science and can aid the design of prediction algorithms that work well under manipulations to the system. The causal structure becomes identifiable from the observational distribution under certain restrictions. To learn the structure from data, score-based methods evaluate different graphs according to the quality of their fits. However, for large, continuous, and nonlinear models, these rely on heuristic optimization approaches with no general guarantees of recovering the true causal structure. In this paper, we consider structure learning of directed trees. We propose a fast and scalable method based on Chu Liu Edmonds algorithm we call causal additive trees (CAT). For the case of Gaussian errors, we prove consistency in an asymptotic regime with a vanishing identifiability gap. We also introduce two methods for testing substructure hypotheses with asymptotic family-wise error rate control that is valid post-selection and in unidentified settings. Furthermore, we study the identifiability gap, which quantifies how much better the true causal model fits the observational distribution, and prove that it is lower bounded by local properties of the causal model. Simulation studies demonstrate the favorable performance of CAT compared to competing structure learning methods. Keywords: Causality, restricted causal models, structure learning, directed trees, hypothesis testing. c 2022 Martin Emil Jakobsen, Rajen Shah, Peter B uhlmann and Jonas Peters. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1159.html. Jakobsen, Shah, B uhlmann and Peters 1. Introduction Learning the underlying causal structure of a stochastic system involving the random vector X = (X1, . . . , Xp) is an important problem in economics, industry, and science. Knowing the causal structure allows researchers to understand whether Xi causes Xj (or vice versa) and how a system reacts under an intervention. However, it is not generally possible to learn the causal structure (or parts thereof) from the observational data of a system alone. Without further restrictions on the system of interest there might exist another system with a different causal structure inducing the same observational distribution, i.e., the structure might not be identifiable from observed data. Common structure learning methods using observational data are constraint-based (e.g., Pearl, 2009; Spirtes et al., 2000), score-based (e.g., Chickering, 2002), or a mix thereof (e.g., Nandy et al., 2018). Each of these approaches requires different assumptions to ensure identifiability of the causal structure and consistency of the approach. In structural causal models, one assumes that there are (causal) functions f1, . . . , fp such that for all 1 i p : Xi := fi(XPA(i), Ni), for subsets PA(i) {1, ..., p} and jointly independent noise variables N = (N1, ..., Np) PN (see Definition 1 for a precise definition including further restrictions). The causal graph is constructed as follows: for each variable Xi one adds directed edges from its direct causes or parents PA(i) into i. For such models, system assumptions concerning the causal functions can make the causal graph identified from the observational distribution. Specific assumptions that guarantee identifiability of the causal graph have been studied for, e.g., linear additive Gaussian noise models with equal noise variance (Peters and B uhlmann, 2014), linear additive non-Gaussian noise models (Shimizu et al., 2006), nonlinear additive noise models (Hoyer et al., 2008; Peters et al., 2014), post-nonlinear additive noise models (Zhang and Hyv arinen, 2009), partially-linear additive Gaussian noise models (Rothenh ausler et al., 2018) and discrete models (Peters et al., 2011). Score-based structure learning usually starts with a function ℓassigning a population score to causal structures. Depending on the assumed model class, this function is minimized by the true structure. For example, when considering directed acylic graph (DAGs), the true causal DAG G satisfy G arg min G : G is a DAG ℓ( G). (1) The idea is then to estimate the score from a finite sample and minimize the empirical score over all DAGs. As the cardinality of the space of all DAGs grows super-exponentially in the number of nodes p (Chickering, 2002), brute-force minimization becomes computationally infeasible even for moderately large systems.1 For linear additive Gaussian noise models, assuming the Markov conditions and faithfulness, one can recover the correct Markov equivalence class (MEC) of G, which can be represented by a unique completed partially directed acyclic graph (CPDAG) (Pearl, 2009). The optimization can be done greedily over MECs with greedy equivalent search (GES, Chickering, 2002) or over DAGs (Tsamardinos et al., 2006) and in the former case, the 1. For example, there are over 10275 distinct directed acyclic graphs over 40 nodes (Sloane, 2021). Structure Learning for Directed Trees method is known to be consistent. More specifically, the output of GES search is not guaranteed, for a fixed sample size, to solve the empirical version of Equation (1) but it solves the problem with probability tending to one in the large sample limit. Chickering (1996) showed that, in general, solving the problem in Equation (1) is an NP-hard problem, even if we restrict the search to MECs for structures with fixed causal indegree of K > 2. Several exact exponential runtime algorithms have been proposed, for example, A search (Yuan et al., 2011; Yuan and Malone, 2013) and CPBayes (van Beek and Hoffmann, 2015) for discrete systems, algorithms based on integer linear programming (Jaakkola et al., 2010; Cussens et al., 2017; Cussens, 2011), and algorithms based on dynamic programming (Koivisto and Sood, 2004; Silander and Myllym aki, 2006; Parviainen and Koivisto, 2009). In the nonlinear additive Gaussian noise case, B uhlmann et al. (2014) show that nonparametric maximum-likelihood estimation consistently estimates the correct causal order. However, the greedy search algorithm minimizing the score function does not come with any theoretical guarantees. Other heuristic approaches (for discrete or linear Gaussian systems) include acyclic selection ordering-based search (Scanagatta et al., 2015), memetic insert neighbourhood ordering-based search (Lee and Beek, 2017), and max-min hill-climb (Tsamardinos et al., 2006). Recently, methods have been proposed that perform continuous, non-convex optimization (Zheng et al., 2018) but such methods are without guarantees and it is currently debated whether they exploit some artifacts in simulated data (Reisach et al., 2021). Thus, for nonlinear models, there is currently no score-based method that provably guarantees recovery of the true causal graph with high probability. In this paper we focus on models of reduced complexity, namely models with directed trees as causal graphs. This complexity reduction allow for polynomial runtime minimization of the score-function using the Chu Liu Edmonds algorithm (proposed independently by Chu and Liu, 1965; Edmonds, 1967) and it allows for the derivation of hypothesis testing theory. As such the structure learning problem remains computationally feasible even for very large systems. Our method is called causal additive trees (CAT). The method is easy to implement and consists of two steps. In the first step, we employ user-specified (univariate) regression methods to estimate the conditional expectations x 7 E[Xi|Xj = x] for all i = j. We then use these to construct edge weights as inputs to the Chu Liu Edmonds algorithm. This algorithm then outputs a directed tree with minimal edge weight, corresponding to a directed tree minimizing the score in Equation (1). 1.1 Contributions We now highlight four main contributions of the paper: (i) Computational feasibility: Assuming an identifiable model class, such as additive noise, allows us to infer the causal DAG by minimizing Equation (1) for a suitable score function. However, even for trees, the cardinality of the search space grows super-exponentially in the number of variables p. Hence, brute-force minimization (exhaustive search) in Equation (1) remains computationally infeasible for large systems. We propose the score-based method CAT (based on Chu Liu Edmonds algorithm) and prove that it recovers the causal tree with a run-time complexity of O(p2). This method can be useful even when not restricting onself to the class of directed trees: e.g., when using a heuristic method such as Jakobsen, Shah, B uhlmann and Peters greedy search for aiming to find an optimal scoring DAG, one can use the score of the optimal scoring tree as a sanity check or the corresponding tree for initialization. (ii) Consistency: We prove that CAT is pointwise consistent in an identified additive Gaussian noise setup. That is, we recover the causal directed tree with probability tending to one as the sample size increases. Consistency only requires that the regression methods for estimating the conditional mean functions have mean squared prediction error converging to zero in probability. This property that is satisfied by many nonparametric regression methods such as nearest neighbors, neural networks, or kernel methods (see e.g. Gy orfiet al., 2002). Moreover, the vanishing estimation error is only required for causal edges for which the conditional means coincide with the causal functions. We also derive sufficient conditions that ensure consistency in an asymptotic setup with vanishing identifiability. Specifically, we show that consistency is retained even when the identifiability gap decreases at a rate qn with q 1 n = o( n) as long as the conditional expectation mean squared prediction error corresponding to the causal edges vanishes at a rate op(qn). (iii) Hypothesis testing: We provide two algorithms for performing hypothesis tests concerning the presence and absence of substructures, such as particular edges, in the true causal graph. The type I error is controlled asymptotically when the mean squared prediction error of the regression corresponding to the true causal edges decays at a relatively slow op(n 1/2) rate. The tests are valid post-selection, that is, the hypotheses to be tested may be chosen after the graph has been estimated, and when multiple tests are performed, the family-wise error rate is controlled for any number of tests. Furthermore, one of the two proposed testing procedures is valid in the non-identified setting. (iv) Identifiability analysis: We analyze the identifiability gap, that is, the smallest population score difference between an alternative graph and the causal graph. The reduced system complexity, due to the restriction to trees, allows us to derive simple yet informative lower bounds. For additive Gaussian noise models, for example, the lower bound can be computed using only local properties of the underlying model: it is based on a first term that considers the minimal score gap between individual edge reversals and a second term involving the minimal mutual information of two neighboring nodes, when conditioning on another neighbor of the parent node. 1.2 Related Constraint-based Approaches As an alternative to score-based methods, constraint-based methods such as PC or FCI (Spirtes et al., 2000) test for conditional independences statements in PX and use these results to infer (parts of) the causal structure. Such methods usually assume that PX is both Markov and faithful with respect to the causal graph G. Under these assumptions, the Markov equivalence class of the causal graph G is identified. In a jointly Gaussian setting (e.g. linear additive Gaussian noise models), consistency of constraint-based approaches relies on faithfulness, whereas uniform consistency requires strong faithfulness (see, e.g., Zhang and Spirtes, 2002; Kalisch and B uhlman, 2007) a condition that has been shown to be strong (Uhler et al., 2013). In nonlinear settings, corresponding guarantees do not exist. This may at least partially be due to the fact that conditional independence testing is known to be a hard statistical problem (Shah and Peters, 2020). Structure Learning for Directed Trees Constraint-based methods have also been studied for polytrees. A polytree is a DAG whose undirected graph is a tree. Polytrees, unlike directed trees, allow for multiple root nodes as well as nodes with multiple parents. Rebane and Pearl (1987), inspired by the work of Chow and Liu (1968), propose a constraint-based structure learning method for polytrees over discrete variables that can identify the correct skeleton and causal basins, structures constructed from nodes with at least two parents. More precisely, the skeleton is determined by the maximum weight spanning tree (MWST) algorithm with mutual information measure weights, while the directionality of edges is inferred by conditional independence constraints implied by the observed distribution. In the case of causal trees this constraintbased structure learning method cannot direct any edges because causal basins do not exist (Rebane and Pearl, 1987). Dominguez et al. (2013) and Ouerd (2000) extend the Rebane and Pearl (1987) algorithm for causal discovery to multivariate Gaussian polytree distributions. Friedman et al. (1997) propose a similar algorithm to learn tree Bayesian networks by finding a MWST with mutual information weights. This recovers the skeleton of the causal graph, after which an arbitrary root node is selected and all edges are oriented away from said root node. As such, the method of Friedman et al. (1997) is only guaranteed to recover a directed tree that is Markov equivalent to the causal directed tree. In this work, we employ Chu Liu Edmonds algorithm, a directed analogue of the MWST algorithm, to not only recover the skeleton but also the direction of all edges in the causal graph. This is possible since we consider restricted causal models, e.g., nonlinear additive Gaussian noise models. More specifically, these restricted causal models allow us to define edge weights that, unlike the mutual information weights, preserve directionality information. In fact, when discarding information that allows us to infer directionality of the edges, one recovers the mutual information weights of Rebane and Pearl (1987), see Remark 1 in Appendix B for details. 1.3 Organization of the Paper In Section 2, we define the setup and relevant score functions. We further strengthen existing identifiability results for nonlinear additive noise models. In Section 3, we propose CAT, an algorithm solving the score-based structure learning problem that is based on Chu Liu Edmonds algorithm. We prove consistency of CAT for a fixed distribution and for a setup with vanishing identifiability. In Section 4, we provide results on asymptotic normality of the scores, construct confidence regions and propose feasible testing procedures. Section 5, we analyzes the identifiability gap. Section 6 shows the results of various simulation experiments. All proofs can be found in Appendix D. 2. Score-based Learning and Identifiability of Trees In the remainder of this work we use of the following graph terminology (a more detailed introduction can be found in Appendix A, see also Koller and Friedman, 2009). A directed graph G = (V, E) consists of p N>0 vertices (or nodes) V = {1, . . . , p} and a collection of directed edges E {(i j) (i, j) : i, j V, i = j}. A directed acyclic graph (DAG) is a directed graph that does not contain any directed cycles. A directed tree is a connected DAG in which all nodes have at most one parent. The unique node of a directed tree G Jakobsen, Shah, B uhlmann and Peters with no parents is called the root node and is denoted by rt(G). We let Tp denote the set of directed trees over p N>0 nodes. 2.1 Identifiability of Causal Additive Tree Models We now revisit and strengthen known identifiability results on restricted structural causal models. Consider a distribution that is induced by a structural causal model (SCM) with additive noise. Then, there are only special cases (such as linear additive Gaussian noise models) for which alternative models with a different causal structure exist that generate the same distribution (see Peters et al., 2017, for an overview). To state and strengthen these results formally, we introduce the following notation. For any k N we define the following classes of functions from R to R: M denotes all measurable functions, Dk denotes the set of all k times differentiable functions and Ck denotes the k times continuously differentiable functions. We let P denote the set of mean zero probability measures on R that have a density with respect to Lebesgue measure. P+ P denotes the subset for which a density is strictly positive. For any function class F {f|f : R R}, PF P denotes the subset with a density function in F. As a special case, we let PG P+C := P+ PC denote the subset of Gaussian probability measures. For any set P of probability measures, Pp denotes all p-dimensional product measures on Rp with marginals in P. We now define structural causal additive tree models (or causal additive tree models, for short) as SCMs with a tree structure.2 Definition 1 (Structural causal additive tree models) Consider a class Tp Mp Pp. Any tuple (G, (fi), PN) Tp Mp Pp induces a structural causal model over X = (X1, . . . , Xp) given by the following structural assignments Xi := fi(Xpa G(i)) + Ni, for all 1 i p, where frt(G) 0 and N = (N1, . . . , Np) PN, which we call a structural causal additive tree model. By slight abuse of notation, we write Q Tp Mp Pp for a probability distribution that is induced by a structural causal additive tree model. Furthermore, we define the set of restricted structural causal additive tree models. We will see later that for these models, the causal graph is identifiable from the observable distribution of the system. When the causal graph of a sufficiently nice additive noise SCM is not identifiable, then certain differential equations must hold (see the proof of Proposition 4 for details). The definition of restricted structural causal additive tree models ensures that this does not happen. Definition 2 (Restricted structural causal additive tree models) The collection of restricted structural causal additive tree models ΘR Tp Dp 3 Pp +C3 is given by all models θ = (G, (fi), PN) Tp Dp 3 Pp +C3 satisfying the following conditions for all i {1, . . . , p} \ {rt(G)}: (i) fi is nowhere constant, i.e., it is not constant on any non-empty open set, and 2. This model class comes with the strong assumption on additive noise, which excludes certain types of hidden confounding, for example. Structure Learning for Directed Trees (ii) the induced log-density ξ of Xpa G(i), noise log-density ν of Ni and causal function fi are such that there exists x, y R with ν (y fi(x))f i(x) = 0 such that ξ = ξ f i f i ν f i ν 2ν f i f i + ν f i + ν ν f i f i ν ν (f i )2 where the derivatives of ξ, ν and fi are evaluated in x, y fi(x) and x, respectively. The following lemma, due to Hoyer et al. (2008), shows that for causal additive tree models with Gaussian noise, the differential equation constraints of Definition 2 simplify.3 Lemma 3 Let θ = (G, (fi), PN) Tp Dp 3 Pp G. Assume that for all i {1, . . . , p}\{rt(G)} the following two conditions hold (a) fi is nowhere constant and (b) fi is not linear. Then, θ ΘR. Existing identifiability results for causal graphs in restricted SCMs (Hoyer et al., 2008; Peters et al., 2014) are stated and proven in terms of the ability to distinguish the induced distributions of two restricted structural causal models: For all θ = (G, . . .) ΘR and θ = ( G, . . .) ΘR, if G = G, then L(Xθ) = L(X θ) (where L denotes the distribution of a random variable), that is, Xθ and X θ do not have the same distribution. We now prove a stronger identifiability result that does not assume that θ is a restricted causal model. Proposition 4 (Identifiability of causal additive tree models) Suppose that Xθ and X θ are generated by the SCMs θ = (G, (fi), PN) ΘR Tp Dp 3 Pp +C3 and θ = ( G, ( fi), PN) Tp Dp 1 Pp C0, respectively. It holds that L(Xθ) = L(X θ) = G = G. We prove Proposition 4 using the techniques of Peters et al. (2014). While we prove the statement only for restricted causal additive tree models, which suffices for this work, we conjecture that a similar extension holds for restricted structural causal DAG models. The extension of Proposition 4 is important for the following reason. Given a finite data set, practical methods usually assume that the true distribution is induced by an underlying restricted SCM. One can then fit different causal structures and output the structure that fits the data best. The above extension accounts for the fact that regression methods hardly represent all such restrictions: e.g., most nonlinear regression techniques can also fit linear models. 2.2 Score Functions We now define population score functions which are later used to recover the causal tree. We henceforth assume that X Rp is a random vector with distribution PX generated by a restricted causal additive tree model θ = (G, (fi), PN) ΘR Tp Dp 3 Pp +C3 with G = (V, E) Tp such that E X 2 2 < . Thus, G denotes the causal tree. We use G Tp to denote an arbitrary, different (directed) tree. For the remainder of this paper, we assume 3. For completeness, we include the proof of Lemma 3 in Appendix D, using the approach of Zhang and Hyv arinen (2009) but expressed in our notation. Jakobsen, Shah, B uhlmann and Peters that for any i = j it holds that Xi E[Xi|Xj] has a density with respect to Lebesgue measure.4 We often refer to one of the following two scenarios: either, (i), we have limited a priori information that PN Pp +C3, or, (ii), we know that the noise innovations are Gaussian, that is, PN Pp G. Definition 5 For any graph G Tp we define for each node i V the (i) local Gaussian score as ℓG( G, i) := log Var Xi E h Xi|Xpa G(i) i /2, (ii) local entropy score as ℓE( G, i) := h Xi E h Xi|Xpa G(i) i , (iii) local conditional entropy score as ℓCE( G, i) := h Xi|Xpa G(i) . Here, we use the convention that E(Xi| ) = 0 and h(Xi| ) = h(Xi); the functions h( ), h( | ), and h( , ) (used below) denote the differential entropy, conditional entropy, and cross entropy, respectively. The Gaussian, entropy and conditional entropy score of G are, respectively, given by the sum of local scores: i=1 ℓG( G, i), ℓE( G) := i=1 ℓE( G, i), ℓCE( G) := i=1 ℓCE( G, i). (See Polyanskiy and Wu (2019) or Cover and Thomas (2006) for the basic informationtheoretic concepts used in this paper.) Similar scores have been considered by B uhlmann et al. (2014) and Mooij et al. (2016), for example. For linear additive Gaussian noise systems, the Gaussian score of Definition 5 is proportional to the large sample limit of the Gaussian log-likelihood score function commonly used in for Bayesian network learning (see, e.g., Koller and Friedman, 2009). The following lemma shows that the Gaussian score of the graph G Tp arises naturally as a translated infimum cross entropy between PX and all Q induced by causal additive tree models with Gaussian noise. Similarly, the entropy score can be seen as an infimum cross entropy between PX and all Q induced by another class of SCMs. Lemma 6 For any G Tp it holds that ℓG( G) = inf Q { G} Dp 1 Pp G h(PX, Q) p log( Furthermore, with F( G) := (Fi( G))1 i p, where Fi( G) := {x 7 E[Xi|Xpa G(i) = x]} for all 1 i p, it holds that ℓE( G) = inf Q { G} F( G) Pp h(PX, Q). 4. This ensures that the entropy score function introduced in Definition 5 below is well-defined and that the analysis of the identifiability gap in Section 5 is valid. Structure Learning for Directed Trees Score-based methods identify the underlying structure by evaluating the score functions (or estimates thereof) on different graphs and choosing the best scoring graph. The difference between the score ℓ (G) of the true graph and the score ℓ ( G) of the best scoring alternative graph G is an important property of the problem: e.g., if it would be zero, we could not identify the true graph from the scores. We, therefore, refer to expressions of the form min G Tp\{G} ℓ ( G) ℓ (G) as the identifiability gap. In the remainder of this paper, we refer to strict positivity of the identifiability gap as Assumption 1. Assumption 1 If θ ΘR Tp Dp 3 Pp G or θ ΘR Tp Dp 3 Pp +C3 it holds that min G Tp\{G} ℓG( G) ℓG(G) > 0 or min G Tp\{G} ℓE( G) ℓE(G) > 0, (3) respectively. Assumption 1 does not trivially follow from the results further above. By arguments similar to those in Lemma 6 we have that, if the true data-generating model is a restricted causal additive tree model with Gaussian noise, θ ΘR Tp Dp 3 Pp G, then ℓG(G) = h(PX) p log( 2πe). Hence, the Gaussian score gap between G and the causal graph G equals ℓG( G) ℓG(G) = inf Q { G} Dp 1 Pp G h(PX, Q) h(PX) = inf Q { G} Dp 1 Pp G DKL(PX Q), where DKL denotes the Kullback-Leibler divergence measure. Proposition 4 implies that G = G, Q { G} Dp 1 Pp G : DKL(PX Q) > 0. However, this does not immediately imply that the identifiability gap (where we take the infimum over such Q) is strictly positive. Similar considerations5 hold for the entropy score gap ℓE( G) ℓE(G) = inf Q { G} F( G) Pp DKL(PX Q). In Section 5 we derive informative lower bounds on the Gaussian and entropy identifiability gaps (i.e., the infimum KL-divergence) of Equation (3). It is possible to enforce Assumption 1 indirectly by the assumptions and modifications detailed in the following lemma. Lemma 7 Assumption 1 holds if one of the following conditions is satisfied. (a) We have a restricted causal additive tree model with Gaussian noise θ ΘR Tp Dp 3 Pp G and for all i = j it holds that x 7 E[Xi|Xj = x] has a differentiable version. (b) We have a restricted causal additive tree model with Gaussian noise θ ΘR Tp Dp 3 Pp G and for all 1 i p it holds that the causal function fi is contained within a function class Fi D1 which satisfies arg minf Fi E[(Xi f (Xj))2] Fi for all j = i, and we consider a modified Gaussian score function ℓG.mod : Tp R with local score given by ℓG.mod( G, i) := log(min f Fi E[(Xi f(Xpa G(i)))2])/2. 5. In fact, Proposition 4 does not immediately imply that DKL(PX Q) > 0 for Q { G} F( G) Pp as it does not necessarily hold that the causal functions in F( G) are differentiable or that the noise innovation densities in Pp are continuous. Jakobsen, Shah, B uhlmann and Peters (c) We have a restricted causal additive tree model θ ΘR Tp Dp 3 Pp +C3, for all i = j it holds that x 7 E[Xi|Xj = x] has a differentiable version and for all i = j it holds that Xi E[Xi|Xj] has a continuous density. The modified Gaussian score function and restrictions of condition (b) in Lemma 7 coincides with the working conditions of B uhlmann et al. (2014). Alternative informationtheoretic conditions guaranteeing that Assumption 1 holds are derived in Section 5. If Assumption 1 is satisfied, then we can use the score functions to identify the true causal graph of a restricted structural model: In the Gaussian noise setting, for example, we have G = arg min G Tp ℓG( G). (4) In practice, we consider estimates of the above quantities and optimize the corresponding empirical loss function. Solving Equation (4) (or its empirical counterpart) using exhaustive search is computationally intractable already for moderately large choices of p.6 We now introduce CAT, a computationally efficient method that solves the optimization exactly. 3. Causal Additive Trees (CAT) We introduce the population version of our algorithm CAT in Section 3.1 and discuss its finite sample version and asymptotic properties in Sections 3.2 and 3.3. 3.1 An Oracle Algorithm Similarly as for the case of DAGs, the problem in Equation (4) is a combinatorial optimization problem, for which the cardinality of the search space grows super-exponentially with p. Indeed, the number of undirected trees on p labelled nodes is pp 2 (Cayley, 1889) and therefore pp 1 is the corresponding number of labelled trees. For the class of DAGs (which includes directed trees), existing structure learning such as B uhlmann et al. (2014) propose a greedy search technique that iteratively selects the lowest scoring directed edge under the constraint that no cycles is introduced in the resulting graph. In general, greedy search procedures do not come with any guarantees and there are indeed situations in which they fail. By exploiting the assumption of a tree structure, we will see that the optimization problem of Equation (4) can be solved computationally efficiently without the need for heuristic optimization techniques. Provided with a connected directed graph with edge weights, Chu Liu Edmonds algorithm finds a minimum edge weight directed spanning tree, given that such a directed tree exists. That is, for a connected directed graph H = (V, EH) on the nodes V = {1, . . . , p} with edge weights w := {wji : (j i) EH}, Chu Liu Edmonds algorithm recovers a minimum edge weight directed spanning tree (MWDST) subgraph of H, arg min G=(V, E) Tp H (j i) E wji, 6. In the context of linear Gaussian noise models, Chickering (2002) proves consistency of greedy equivalent search towards the correct Markov equivalence class. This, however, does not imply that the optimization problem in Equation (4) is solved: for a given sample, the method is not guaranteed to find the optimal scoring graph (but the output will converge to the correct graph). Structure Learning for Directed Trees where Tp H denotes all directed spanning trees of H. The runtime of the original algorithms of Chu and Liu (1965) and Edmonds (1967) is O(|EH| p) O(p3). Karp (1971) presented an alternative proof for the correctness of the algorithm of Edmonds (1967). Tarjan (1977) devised a modification (corrected by Camerini et al., 1979) with runtime O(min{|EH| log(p), p2}).7 Gabow et al. (1986) devised yet another modification with runtime O(p log p+|EH|) and noted that no further improvements to the algorithm can be made (since it uses only binary decisions and can be used to sort p numbers). In our experiments, we use the C++ implementation of Tarjans modification by Tofigh and Sj olund (2007) which is contained in the R-package RBGL (Carey et al., 2021) and the Python implementation of Edmonds version from the Python-package Network X (Hagberg et al., 2022). The causal graph recovery problem in Equation (4) is equivalently solved by finding a minimum edge weight directed tree, i.e., a minimum edge weight directed spanning tree of the fully connected graph on the nodes V . For example, finding the minimum of the Gaussian score function is equivalent to minimizing a translated version of the Gaussian score function arg min G Tp ℓG( G) = arg min G Tp 1 2 log(Var(Xi E[Xi|Xpa G(i)])) 1 2 log(Var(Xi)) = arg min G Tp Var(Xi E[Xi|Xpa G(i)]) Since the summand for the root node in Equation (5) note equals zero, we only need to sum over all nodes with an incoming edge in G. Now define the Gaussian edge weights w G := (w G ji)j =i by w G ji := 1 2 log Var(Xi E[Xi|Xj]) for all j = i. Hence, for a causal additive tree model with Gaussian noise satisfying Assumption 1 it holds that the causal directed tree is given by the MWDST with respect to the Gaussian edge weights, G = arg min G Tp ℓG( G) = arg min G=(V, E) Tp (j i) E w G ji. Similarly, the minimum of the entropy score function is given by the MWDST with respect to the entropy edge weights w E := (w E ji)j =i given by w E ji := h(Xi E[Xi|Xj]) h(Xi), for all j = i. We will henceforth denote the method where we apply Chu Liu Edmonds algorithm to find the MWDST with respect to the Gaussian and entropy edge weights as CAT.G and CAT.E, respectively. 7. The algorithm presented in both Edmonds (1967) and Tarjan (1977) find minimum branchings of H, i.e., directed forest spanning subgraphs of H with minimum edge weight. Note that the MWDST problem is invariant to identical translation of all edge weights. If H is a fully connected graph and we translate all edge weights w ji := wji ε max{wji : j = i} for ε > 1, then a minimum branching using edge weights (w ji) is a MWDST subgraph of H. For testing purposes, we also need to be able to find MWDST subgraphs of non-fully connected graphs H, hence, as noted by Edmonds (1967), if we translate all edge weights w ji := wji P j =i |wji|, then a minimum branching using edge weights (w ji) is a MWDST subgraph of H. Jakobsen, Shah, B uhlmann and Peters 3.2 Finite Sample Algorithm Given an n p data matrix Xn, representing n i.i.d. copies of X = (X1, . . . , Xp), we estimate the edge weights by simple plug-in estimators. Let us denote the conditional expectation function and its estimate by ϕji(x) := E[Xi|Xj = x], ˆϕji(x) := ˆE[Xi|Xj = x], (7) for all j = i. The empirical Gaussian edge weights ˆw G = ( ˆw G ji)j =i are then given by ˆw G ji := 1 d Var(Xi ˆϕji(Xj)) for all i = j, where d Var( ) denotes a variance estimator using the sample Xn. We now propose to combine the Chu Liu Edmonds algorithm described above with the Gaussian score as detailed in Algorithm 1. It is also possible to combine CAT with standard pruning techniques (see, e.g., B uhlmann et al., 2014) that, e.g., based on approximate p-values, remove insignificant edges and output directed forests. An R implementation of CAT with options for cross-fitting and pruning is available on Git Hub.8 Algorithm 1 Causal additive trees (CAT) 1: procedure CAT(Xn, regression method) 2: Run regression method to obtain ˆϕji for all j = i. 3: Compute empirical edge weights ˆw G, see Equation (8). 4: Apply Chu Liu Edmonds algorithm to find MWDST with respect to ˆw G. 5: return MWDST ˆG. 6: end procedure By default we suggest to use the empirical Gaussian edge weights as described in Algorithm 1. However, it is also possible to run Chu Liu Edmonds algorithm on the empirical entropy edge weights ˆw E = ( ˆw E ji)j =i given by ˆw E ji := ˆh(Xi ˆϕji(Xj)) ˆh(Xi), for all j = i, where ˆh( ) denotes a user-specific entropy estimator using the observed data Xn. Estimating differential entropy is a difficult statistical problem but we will later in Section 6 demonstrate by simulation experiments that it can be beneficial to use the estimated entropy edge weights when the additive noise distributions are highly non-Gaussian. Under suitable conditions on the (possibly nonparametric) regression technique, we now show that the proposed algorithm consistently recovers the true causal graph of a causal additive tree model with Gaussian noise using the empirical Gaussian edge weights. 3.3 Consistency We study a version of the CAT.G algorithm applied to a causal additive tree model with Gaussian noise where the regression estimates are trained on auxiliary data, simplifying 8. https://github.com/Martin Emil Jakobsen/CAT Structure Learning for Directed Trees the theoretical analysis. We believe that consistency without sample splitting holds but may require some stronger conditions (in the experimental section, we do not use sample splitting). As such, we only view the sample splitting as a theoretical device for simplifying proofs but we do not recommend it in practical applications. For each n we let Xn = ((X1,i)1 i p, . . . , (Xn,i)1 i p) and Xn = (( X1,i)1 i p, . . . , ( Xn,i)1 i p) denote independent datasets each consisting of n i.i.d. random variables with distribution identical to that of X = (X1, ..., Xp) Rp. We suppose that the regression estimates ˆϕji have been trained on Xn and then compute the edge weights using Xn as in step 3 of Algorithm 1: ˆw G ji := 1 1 n Pn k=1 (Xk,i ˆϕji(Xk,j))2 1 n Pn k=1 X2 k,i ( 1 n Pn k=1 Xk,i)2 The consistency results may be extended to cross-fitted edge weight estimators formed as an average of estimators of the form in (9) with the roles of the Xn and Xn samples interchanged, which would make full use of the available data. The following result shows pointwise consistency of CAT.G whenever the conditional mean estimation is weakly consistent. Theorem 8 (Pointwise consistency) Suppose that for all j = i the following two conditions hold: (a) if (j i) E, E[( ˆϕji(Xj) ϕji(Xj))2| Xn] P n 0; (b) if (j i) E, E[( ˆϕji(Xj) ϕji(Xj))2| Xn] P n 0 for some fixed ϕji : R R, where ϕji and ˆϕji are defined in Equation (7). Furthermore, suppose that Assumption 1 holds. In the large sample limit, we recover the causal graph with probability one, that is P( ˆG = G) n 1, where ˆG is the output of Algorithm 1 using weights ˆw G given by Equation (9). Theorem 8 states that under the given assumptions, the estimated graph will converge to the true causal graph with probability tending to one. In fact, the assumptions are fairly week: we only require weakly consistent estimation of the conditional means for edges that are present in the causal graph; these represent causal relationships and are often assumed to be smooth. This distinction allow us to employ regression techniques that are consistent only for those function classes that we consider reasonable for modeling the causal mechanisms. For non-causal edges, (j i) E, the estimator ˆϕji only needs to converge to a function ϕji, which does not necessarily need to be the conditional mean. 3.3.1 Consistency under Vanishing Identifiability We now consider an asymptotic regime involving a sequence (θn)n N of SCMs with potentially changing conditional mean functions ϕji and a vanishing identifiability gap. We have the following result. Theorem 9 (Consistency under vanishing identifiability) Let (θn)n N be a sequence of SCMs on p N nodes all with the same causal directed tree G = (V, E) such that Jakobsen, Shah, B uhlmann and Peters (i) for qn := min G Tp\{G} ℓG(G) ℓG( G) (the gap of model θn), we have q 1 n = o( n); (ii) for all (j i) E and ε > 0, Pθn q 1 n Eθn h (ϕji(Xj) ˆϕji(Xj))2| Xn i > ε n 0; (iii) for all j = i and ε > 0, Pθn q 2 n n Eθn h (ϕji(Xj) ˆϕji(Xj))4| Xn i > ε n 0; and (iv) there exists C > 0 such that for all j = i infn Pθn(Varθn(Xi|Xj) C) = 1 and supn Eθn X 4 2 < . Then it holds that P( ˆG = G) n 1. Condition (i) asks that the identifiability gap qn goes to zero more slowly than the standard convergence rate 1/ n of estimators in regular parametric models. Such a requirement would be necessary in almost any structure identification problem. Condition (ii) requires the mean squared error of the regression estimates corresponding to true causal edges to be o P (qn). We regard this as a fairly mild assumption: indeed, the minimax rate of estimation of regression functions in H older balls with smoothness β is n 2β/(2β+1) (Tsybakov, 2009). Thus, we can expect that if the causal regression functions have smoothness β 1/2 and all lie in a H older ball, (ii) can be satisfied for any qn satisfying (i). Condition (iii) allows the fourth moments of the estimation errors to increase at any rate slower than nq2 n ; of course, we would typically expect this error to decay, at least for the causal edges. 4. Hypothesis Testing This section presents two procedures to test any substructure hypothesis regarding the causal directed tree of a causal additive tree model with Gaussian noise. We continue our analysis using the sample split estimators of Equation (9), where the conditional expectations are estimated on an auxiliary dataset. Our approach makes use of the fact that the estimated weights in Equation (9) are logarithms of ratios of i.i.d. quantities, and thus the joint distribution of the estimated edge weights should, with appropriate centering and scaling, be asymptotically Gaussian; see Lemma D.4 in Appendix D for the precise statement. This allows us to create a (biased) confidence region of the true edge weights, which in turn gives a confidence set for the true graph. This confidence set of graphs is not necessarily straightforward to compute and list. However, we show that it can be queried to test hypotheses of interest, such as the presence or absence of a particular edge. As these hypothesis tests are derived from a confidence region, they are valid even when the hypothesis to test has been chosen after examining the data. Similar to the results in the previous sections, we avoid making assumptions on the performance of regressions corresponding to non-causal edges. Unlike the consistency analysis, however, here we do not, in general, require identifiability of the true graph. In order to state our results and assumptions, we introduce the following notation. For a collection of variables (Kji)j =i, we let Ki := (K1i, . . . , K(i 1)i, K(i+1)i, . . . , Kpi) Rp 1, furthermore, for any collection (Ki)1 i p, we let K := (K1, . . . , Kp) . With this notation, Structure Learning for Directed Trees let, for all k {1, . . . , n}, the vectors of squared residuals and squared centered observations be given by ˆ Mk := {(Xk,i ˆϕji(Xk,j))2}j =i Rp(p 1), ˆVk = Xk,i 1 Further let k=1 ˆ Mk, ˆν =: 1 Note that with this notation, the empirical Gaussian edge weight for j i is given by log(ˆµji/ˆνi)/2. Let us denote by bΣM Rp(p 1) p(p 1), bΣV Rp p and bΣMV Rp(p 1) p, the empirical variances of the ˆ Mk and ˆVk and their empirical covariance respectively, so bΣM bΣMV bΣ MV bΣV ˆ Mk ˆ M k ˆµˆµ ˆ Mk ˆV k ˆµˆν ˆVk ˆ M k ˆνˆµ Vk V k ˆνˆν With this, we may now present our construction of confidence intervals for the edge weights. (For simplicity, all proofs in this section assume the variables to have mean zero.) 4.1 Confidence Region for the Causal Tree We use the delta method to estimate the variances of the ˆw G ji, and a simple Bonferroni correction to ensure simultaneous coverage of the confidence intervals we develop. Writing zα for the upper α/{2p(p 1)} quantile of a standard normal distribution, we set ˆuji, ˆlji := 1 zα ˆσji 2 n = ˆw G ji zα ˆσji 2 n, (10) ˆσ2 ji := bΣM,ji,ji ˆµ2 ji + bΣV,i,i ˆν2 i 2 bΣMV,ji,i We treat [ˆlji, ˆuji] as a confidence interval for the true edge weight w G ji and define the following region of directed trees formed of minimizers of the score with edge weights in the confidence hyperrectangle: ˆCBon := ˆC ˆl, ˆu := arg min G=(V, E) Tp (j i) E w ji, : j = i, w ji [ˆlji, ˆuji] . We have the following coverage guarantee for ˆCBon. Theorem 10 (Confidence region) Suppose the following conditions hold: (i) there exists ξ > 0 such that E X 4+ξ < ; (ii) there exists ξ > 0 such that for all j = i, E[| ˆϕji(Xj) ϕji(Xj)|4+ξ| Xn] = Op(1); Jakobsen, Shah, B uhlmann and Peters (iii) Var(( ˆ M 1 , ˆV 1 ) | Xn) P n Σ, where Σ is constant with strictly positive diagonal; (iv) for (j i) E, n E[( ˆϕji(Xk,j) ϕji(Xk,j))2| Xn] P n 0. lim inf n P G ˆCBon 1 α. The second condition requires little more than 4th moments for the absolute errors in the regression (they do not need to converge to zero). Condition (iv) requires that the mean squared prediction errors corresponding to the true causal edges decay faster than a relatively slow 1/ n rate. If the causal graph is unidentifiable, then when (iv) holds for all edges corresponding to population score minimizing graphs, ˆCBon covers every such graph with a probability of at least 1 α. 4.2 Testing of Substructures Whilst the confidence region ˆCBon has attractive coverage properties, it will typically not be possible to compute it in practice (due to the ranges of w ji one would need to try). We now introduce two computationally feasible schemes for querying whether ˆCBon satisfies certain constraints such as containing or not containing a given substructure. More precisely, we propose a conservative exact query scheme called Check C (for check confidence region ), and an asymptotically valid query scheme called Conv B (for converging bounds ), which we will see in the simulation experiments is less conservative. The Conv B test gains power at the expense of generality. While the Check C test works in both the identified and the non-identified setup, the Conv B test needs both identifiability and stronger assumptions in order to hold level. The idea is as follows: by Theorem 10 the confidence region for the causal graph ˆCBon contains the causal graph with probability tending to at least 1 α. Thus, if we can verify that no graph in ˆCBon contains a certain substructure, we are able to test the hypothesis that the causal graph satisfies said substructure with asymptotically valid 1 α level control. 4.2.1 Substructure Hypotheses A substructure restriction R = (ER, Emiss R , r) on the nodes V contains specified sets ER of existing edges, Emiss R of missing edges, and a specific root node r (any of such restrictions may be void, too). For example, a substructure restriction could be that a single edge is present (such as X1 X2), or that a single edge is not present (such as X1 X2); the restriction can also specify a directed tree. Our approach allows us to conclude that at least one of the constraints in R does not hold for the true graph G = (V, E). More precisely, we propose a test for the null hypothesis H0(R) : ER \ E = , E \ Emiss R = E, r = rt(G), i.e, that all constraints in a substructure restriction R are satisfied in the causal graph. We henceforth assume that a proposed substructure R has no internal inconsistencies, i.e., that there exists at least one directed tree over the nodes V satisfying all conditions of H0(R). Example 1 illustrates how substructure restrictions allow us to test various hypotheses about the causal graph. Structure Learning for Directed Trees Example 1 In Figure 1 we illustrate a true causal graph and five examples of substructure hypotheses that we can test. Hypothesis 1 (true) consists of the restriction R = ER, where ER := {(X4 X5)}. This substructure restriction specifies that (X4 X5) is present in the causal graph. Hypothesis 2 (false) consists of the restriction R = Emiss R , where Emiss R := {(X6 X3)}. This restriction specifies that (X6 X3) is not in the causal graph. Hypothesis 3 (true) consists of the restriction R := (ER, Emiss R ) with multiple present edges and a single missing edge. Here, the substructure restriction specifies that all edges in ER := {(X3 X2), (X4 X5), (X4 X7), (X6 X3)} are present, and that the edge in Emiss R := {(X8 X9)} is not present in the causal graph. Hypothesis 4 (false) consists of the restriction R := (ER, Emiss R ) with multiple present edges and multiple missing edges. This substructure restriction specifies that all edges in ER := {(X1 X2), (X1 X4), (X5 X5)} are present, and that all edges in Emiss R := {(X3 X6), (X8 X7)} are not present in the causal graph. Hypothesis 5 (false) contains the substructure R := ER with multiple present edges, specifying that all edges in ER := {(X1 X2), (X2 X3), (X1 X4), (X4 X5), (X4 X7), (X5 X6), (X5 X8), (X6 X9)} are present in the causal graph. This substructure restriction uniquely specifies a specific complete directed tree. Hypothesis 1 Hypothesis 2 Hypothesis 3 Hypothesis 4 Hypothesis 5 Figure 1: Illustration of six graphs, see Example 1. Colored edges represent testing the presence of edges (green, ER) or whether edges are missing (red, Emiss R ). Jakobsen, Shah, B uhlmann and Peters 4.2.2 Checking the Confidence Region In order to present the first method, we introduce some notation. For any non-empty subset of directed trees T Tp, let ST (w) be the score attained by the minimum edge weight directed tree recovered by Chu Liu Edmonds algorithm with input edge weights w := (wji)j =i, when restricting the search to all directed trees in T . That is, if we denote the minimum edge weight directed spanning tree (MWDST) as recovered by Chu Liu Edmonds algorithm, when searching over all directed trees in T by G T (w) := arg min G=(V, E) T (j i) E wji, (11) then with G T (w) = (V, E T (w)) the associated score is given by ST (w) := X (j i) E T (w) wji. (12) Now let Tp(R) Tp be the set of all directed trees satisfying the substructure restriction R and suppose that the causal directed tree G satisfies R, i.e., G Tp(R). Hence with probability tending to at least 1 α we know that there exists a graph in ˆCBon satisfying the substructure restriction R. That is, there exist edge weights w = (w ji)j =i, with ˆlji w ˆuji for all j = i, such that G Tp(w ) satisfies the substructure restriction R. Hence, it must hold that STp(R)(w ) = STp(w ). Since the score function is weakly monotone, we have, with probability tending to at least 1 α, that STp(R)(ˆl) STp(R)(w ) = STp(w ) STp(ˆu). On the other hand, if STp(R)(ˆl) > STp(ˆu), then we know for certain that ˆCBon does not contain any graph satisfying the substructure restriction R. We thus define our Check C test function as ψCheck C R := 0 if STp(R)(ˆl) STp(ˆu) 1 otherwise. (13) Recall that Chu Liu Edmonds algorithm recovers a minimum edge weight directed spanning tree subgraph of a connected graph H. We can construct a specific connected graph H for which the set of directed spanning tree subgraphs coincides with Tp(R). In pseudoalgorithm of Algorithm 2 we detail how to test substructure hypotheses with Check C test. This testing procedure is conservative as seen by the simulation experiments in Section 6.3. While Theorem 11 proves that hypothesis testing using the Check C test achieves pointwise asymptotic level, the simulation experiments show that the finite sample power of the test is low for small to moderately large sample sizes. For example, if max{ˆlji : j = i} min{ˆuji : j = i}, then no false substructure hypothesis can be rejected. In Section 4.2.3 we propose an alternative test which exhibits improved finite sample power. 4.2.3 Converging Bounds We now present the Conv B test which is based on an asymptotically valid query scheme, that is, with probability increasing to one (in the large sample limit) it makes a valid choice on Structure Learning for Directed Trees Algorithm 2 Hypothesis testing of H0(R) using the Check C test 1: procedure Check C(R = (ER, Emiss R , r), ˆl = (ˆlji)j =i, ˆu = (ˆuji)j =i) 2: Initialize fully connected graph H := {(j i) : i, j V, j = i}. 3: For each (j i) ER, delete from H the edges {(k i) : k V \ {j}} {i j}. 4: For each (j i) Emiss R , delete from H the edge (j i). 5: If root r R, delete from H the edges {(j r) : j V }. 6: Apply Chu Liu Edmonds algorithm to find STp(R)(ˆl) and G Tp(R)(ˆl), the minimum ˆu-weighted directed spanning subtree of H. 7: Apply Chu Liu Edmonds algorithm to find STp(ˆu) and G Tp(ˆu), the minimum ˆlweighted directed spanning subtree of the fully connected graph. 8: If STp(R)(ˆl) STp(ˆu), then set ψCheck C R := 0, otherwise set ψCheck C R := 1. 9: return ψCheck C R . 10: end procedure whether any graph in ˆCBon satisfies a substructure restriction R. We call this test Conv B for converging bounds because it requires that all lower edge weight bounds converge towards the Gaussian population edge weights. Consider a true null hypothesis H0(R), i.e., a substructure restriction R which is satisfied by the causal graph G. Suppose that G ˆCBon, which implies the existence of edge weights w = (w ji)j =i, with ˆlji w ji ˆuji for all j = i, such that the minimum edge weight directed spanning tree, G Tp(w ) := arg min G=(V, E) Tp (j i) E w ji, satisfies the restrictions R. The intuition for our approach is as follows: We propose a method that helps all edge weights that are not in direct disagreement with R, and penalizes all edge weights that are in disagreement with R, more precisely, we define the edge weights ˇw = ( ˇwji)j =i by ˇwji = ˆuji if [ k = j : (k i) ER] [(i j) ER] [(j i) Emiss R ] [i = r], ˆlji otherwise, . We can then expect that G Tp( ˇw) still satisfies the restriction R (with probability tending to one, see Theorem 11). Conversely, the probability that G Tp( ˇw) does not satisfy the restriction R is, in the large sample limit, bounded by the probability that G is not in the confidence region ˆCBon. We may set our test function ψConv B R = 0, if G Tp( ˇw) satisfies R 1, otherwise. The pseudo-algorithm in Algorithm 3 details how to test any substructure hypothesis H0(R) using the asymptotic query scheme of the Conv B test. Our Git Hub repository (see Footnote 8) contains R implementations of both testing procedures. The following theorem shows that both substructure hypothesis tests achieve pointwise asymptotic level. Any number of null hypotheses may be tested simultaneously, Jakobsen, Shah, B uhlmann and Peters Algorithm 3 Hypothesis testing of H0(R) using the Conv B test 1: procedure Conv B(R = (ER, Emiss R , r), ˆl = (ˆlji)j =i, ˆu = (ˆuji)j =i) 2: Initialize ˇw := ˆl. 3: For each (j i) ER and all k V \ {j}, set ˇwki := ˆuki. 4: For each (j i) Emiss R , set ˇwji := ˆuji. 5: If root r R, then for all j V , set ˇwjr := ˆulr. 6: Apply Chu Liu Edmonds algorithm to find G Tp( ˇw). 7: If G Tp( ˇw) satisfies R, then set ψConv B R := 0, otherwise set ψConv B R := 1. 8: return ψConv B R . 9: end procedure without the need for any multiple testing correction. This is because the tests may be viewed as simply querying the properties of the single confidence region of Theorem 10, which has coverage of the truth with probability at least 1 α. Theorem 11 (Pointwise asymptotic level) Let α (0, 1) and let R1, R2, . . . be any collection of potentially data-dependent substructure restrictions. Suppose that conditions of Theorem 10 are satisfied. If either (a) ψRk = ψCheck C Rk for all k 1, or (b) ψRk = ψConv B Rk for all k 1, Assumption 1 holds, and for all (j i) E it holds that n E[( ˆϕji(Xk,j) ϕji(Xk,j))2| Xn] P 0, then it holds that lim sup n P k : H0(Rk) is true (ψRk = 1) The Conv B test requires stronger conditions than the Check C test. Additionally to the assumptions made by the Check C test, it requires identifiability of the causal graph and n-convergence of the mean squared estimation error for the non-causal edges. On the other hand, it would be possible to give uniform asymptotic level guarantees for the Check C test as it only relies on the coverage properties of confidence intervals for the true weights. 5. Bounding the Identifiability Gap We have seen in Section 3.3 that the identifiability gap, that is, the smallest score difference between the causal tree G and any alternative graph G Tp \ {G}, plays an important role when identifying causal trees from observational data. It provides information about whether the causal graph is identifiable through the corresponding score function, for example, if we can establish that the smallest Gaussian score gap is strictly positive, i.e., min G Tp\{G} ℓG( G) ℓG(G) = inf Q { G} F( G) Pp DKL(PX Q) > 0, (14) Structure Learning for Directed Trees then G is identified by the Gaussian score function. Lemma 7 lists conditions guaranteeing that Assumption 1, i.e., Equation (14) holds. However, postitivity of the identifiability gap for a single model is not sufficient for uniform consistency or consistency under vanishing identifiability. For consistency under vanishing identifiability we need to ensure that the identifiability gap vanishes at a slower rate than 1/ n; see Theorem 9. Similarly, for uniform consistency over a class of causal additive noise models Θ Tp Mp Pp, one needs the existence of a strictly positive constant c > 0 uniformly lower bounding the identifiability gap, i.e., inf θ Θ min G Tp\{G} ℓG( G) ℓG(G) > c. (15) The identifiability gap is an involved quantity. In this section, we derive a lower bound that is based on local properties of the underlying structural causal models (such as the ability to reverse edges), using information-theoretic quantities. We first consider the special cases of bivariate models (Section 5.1) and multivariate Markov equivalent trees (Section 5.2) and then turn to general trees (Section 5.3). However, before we venture into the derivation of the specific lower bounds we first examine the connection between the identifiability gaps associated with the different score functions. In this section, we assume that X PX is generated by a structural causal additive tree model with E X 2 < such that the local Gaussian, entropy and conditional entropy scores are well-defined. We neither assume that θ is a restricted structural causal additive model, i.e., θ ΘR, nor strict positivity of the identifiability gap, i.e., Assumption 1. The following result shows that the local node-wise score gaps associated with the different score functions are ordered. Lemma 12 For any G Tp and for all i V ℓCE( G, i) ℓCE(G, i) ℓE( G, i) ℓE(G, i). If the underlying model is an causal additive tree model with Gaussian noise, then ℓE( G, i) ℓE(G, i) ℓG( G, i) ℓG(G, i). It follows that the full graph score gaps and identifiability gaps associated with the different score functions satisfy a similar ordering. Thus, given that the underlying model is an causal additive tree model with Gaussian noise, a strictly positive entropy identifiability gap implies that the Gaussian identifiability gap is strictly positive. It is, however, not possible to establish strict positivity of the conditional entropy identifiability gap; see Remark 1 in Appendix B. Therefore, we focus on establishing a lower bound for the entropy identifiability gap that is tighter than that given by the conditional entropy identifiability gap. In general, we cannot use node-wise comparisons of the scores of two graphs to bound the identifiability gap (the reason is that in general a node receives a better score in a graph, where it has a parent, compared to a graph, where it does not; see Example 3 in Appendix B for a formal argument). We start by analyzing the identifiability gap in models with two variables. Jakobsen, Shah, B uhlmann and Peters 5.1 Bivariate Models We now consider two nodes V = {X, Y }, and graphs T2 = {(X Y ), (Y X)}. Without loss of generality assume that (X, Y ) L2(P) is generated by an additive noise SCM θ = (G, (fi), PN) with causal graph G = (X Y ) T2 to which the only alternative graph is G = (Y X). That is, X := NX, Y := f(X) + NY , (16) where (NX, NY ) PN P2. The bivariate entropy identifiability gap, which we will later refer to as the edge reversal entropy score gap, is defined as ℓE(X L99 Y ) : = ℓE( G) ℓE(G) = h(Y ) + h(X E[X|Y ]) h(X) h(Y E[Y |X]), where the fully drawn arrow symbolizes the true causal relationship and the dashed arrow the alternative. The following lemma simplifies the bivariate entropy identifiability gap to a single mutual information between the effect and the residual of the minimum mean squared prediction error regression of cause on the effect. Lemma 13 Consider the bivariate setup of Equation (16) and assume that f(X) has density. It holds that ℓE(X L99 Y ) = I(X E[X|Y ]; Y ) 0. Thus, the causal graph is identified in a bivariate setting if one maintains dependence between the predictor and minimum mean squared error regression residual in the anticausal direction. This result is in accordance with the previous identifiability results. For example, in the linear additive Gaussian noise case, I(X E[X|Y ]; Y ) = 0. Consequently, the causal graph is not identified from the entropy score function. Whenever the conditional mean in the anti-causal direction vanishes, e.g., with symmetric causal function and symmetric noise distribution, it is possible to derive a more explicit lower bound with more intuitive sufficient conditions for identifiability of the causal graph. Proposition 14 Consider the bivariate setup of Equation (16) and assume that f(X) has density. If the reversed direction conditional mean E[X|Y ] almost surely vanishes (e.g., because f, X and NY are symmetric), then ℓE(X L99 Y ) = I(X; f(X) + NY ), which is strictly positive if and only if X f(X) + NY . In addition, we have the following statements. (a) Let f(X)G and NG Y be independently normally distributed with the same mean and variance as f(X) and NY , respectively. If DKL(f(X) f(X)G) DKL(NY NG Y ), then ℓE(X L99 Y ) 1 2 log 1 + Var(f(X)) Structure Learning for Directed Trees (b) If the density of f(X) + NY is log-concave, then ℓE(X L99 Y ) 1 πe Var(f(X)) This lower bound is non-trivial only if Var(f(X)) > (πe/2 1)Var(NY ) 3.27Var(NY ). Thus, if the conditional mean E[X|Y ] in the anti-causal direction vanishes, then under certain conditions, the causal direction is identified by the entropy score function (as long as Var(f(X)) is sufficiently large relative to Var(NY )). The edge reversal score gap for the Gaussian score is given by ℓG(X L99 Y ) := 1 2 log Var(X E[X|Y ]) 2 log Var(Y E[Y |X]) 2 log Var(X E[X|Y ]) 2 log 1 + Var(f(X)) which reduces to the lower bound in point (a) of Proposition 14 if the conditional mean E[X|Y ] in the anti-causal direction vanishes. Example 2 Consider the bivariate setup of Equation (16). Suppose that the causal function f is a quadratic function f(x) = αx2 + β for some α, β, R and that NX N(0, σ2 X) and NY N(0, σ2 Y ). It holds that E[X|Y ] vanishes, and the bivariate Gaussian identifiability gap reduces to ℓG(X L99 Y ) = 1 2 log 1 + 2α2 σ4 X σ2 Y 5.2 Multivariate Markov Equivalent Trees Two Markov equivalent trees differ in precisely one directed path that is reversed in one graph relative to the other.9 The entropy score gap of Markov equivalent trees therefore reduces to the binary case. Proposition 15 Consider any G Tp \ {G} that is Markov equivalent to the causal tree G. Let c1 cr be the unique directed path in G that is reversed in G. Then ℓE( G) ℓE(G) = i=1 ℓE(ci L99 ci+1) min 1 i r 1 ℓE(ci L99 ci+1). Thus, a lower bound of the entropy score gap that holds uniformly over the Markov equivalence class is given by the smallest possible edge reversal in the causal directed graph: min G MEC(G)\{G} ℓE( G) ℓE(G) min (j i) E ℓE(j L99 i). 9. To see this, note that any two directed trees are Markov equivalent if and only if they satisfy the exact same d-separations or equivalently they share the same skeleton (there are no v-structures in directed trees). Distinct directed trees sharing the same skeleton must have distinct root nodes. Consequently, there exist exactly one directed path in G from rt(G) to rt( G) that is reversed in G; see Lemma D.6 Jakobsen, Shah, B uhlmann and Peters 5.3 General Multivariate Trees We now derive a lower bound of the entropy identifiability gap, i.e., a lower bound of the entropy score gap that holds uniformly over all alternative trees Tp \ {G}. To do so, we exploit a graph reduction technique (introduced by Peters et al., 2014) which enables us to reduce the analysis to three distinct scenarios. This graph reduction works as follows. Fix any alternative graph G Tp \ {G}, and iteratively remove any node (from both G and G) that has no children and the same parents in both G and G. The score gap is unaffected by the graph reduction.10 Applying this iteration scheme, until no such node can be found, results in two reduced graphs GR = (VR, ER) and GR = (VR, ER). These reduced graphs cannot be empty, for that would only happen if G = G. Further, they have identical vertices but different edges. And they can be categorized into one of three cases. To do so, consider a node L that is a sink node, i.e., a node without children, in GR and consider its parent in GR. Now, considering GR, one of the following conditions must hold: the parent is also a parent of L in GR (we then call it Z), the parent is not connected to L in GR (we then call it W), or the parent is a child of L in GR (we then call it Y ). Figure 2 visualizes these three scenarios. an GR(W) an GR(Y ) an GR(Z) subgraph of GR an GR(D) an GR(Z) de GR(Y ) de GR(O1) de GR(Ok) subgraph of GR Figure 2: Schematic illustration of parts of two reduced graphs produced by the graph reduction technique described in Section 5.3. Consider a sink node L in GR. Its parent (in GR) must either be a parent in GR, too, it must be a child in GR, or it is unconnected to L in GR. Thus, exactly one of the sets Z, Y , and W is non-empty. This case distinction is used to compute the three bounds in Theorem 16. D, O1, . . . , Ok denote further (possibly existing) nodes in GR. 10. All removed nodes V \ VR have identical incoming edges in both graphs and therefore have identical local scores. That is, for any loss function l {ℓCE, ℓE, ℓG} we have that l( G) ℓ(G) = P i VR ℓ( G, i) ℓ(G, i) + P i V \VR ℓ( G, i) ℓ(G, i) = P i VR ℓ( G, i) ℓ(G, i) = ℓ( GR) l(GR). Structure Learning for Directed Trees We can now obtain bounds for each of the three case individually. For the case with a node Z (a staying parent ), define ΠZ(G) := (z, l, o) V 3 s.t. (z l) E and o nd G(l) \ {z, l} . The score gap can then be lower bounded by min(z,l,o) ΠZ(G) I(Xz; Xo|Xl) (see Lemma D.7). Intuitively, I(Xz; Xo|Xl) quantifies the strength of the connection between z and o, when conditioning on l (which does not lie on the path between z and o). This is a non-local bound in that it does not constrain the length of the path connecting z and o. Analyzing or bounding this term might be difficult. We will see in Section 5.4 that this part is not needed for causal additive tree models with Gaussian noise. For the case with a node W ( removing parent ), define ΠW (G) := (w, l, o) V 3 s.t. (w l) E and o (ch G(w) \ {l}) pa G(w) . This case results in the lower bound min(w,l,o) ΠW (G) I(Xw; Xl|Xo) (see Lemma D.8). Here, w is a parent of l and o is directly connected to w. Intuitively, I(Xw; Xl|Xo) quantifies the strength of the edge w l. We condition on o but that node is not directly connected to l (only via w). For the first two cases, faithfulness (Spirtes et al., 2000) implies that these terms are non-zero and bounding them away from zero reminds of strong faithfulness (Zhang and Spirtes, 2002). However, in the second case, one considers individual edges, which reminds more of a strong version of causal minimality (Spirtes et al., 2000; Peters et al., 2017). For the case with a node Y ( parent to child ), a lower bound is given by the minimal edge reversal score gap min(j i) E ℓE(j L99 i) (see Lemma D.9). The term ℓE(j L99 i) measures the identifiability of the direction of an individual edge. It is zero in the linear additive Gaussian noise case, for example. We provide more details on the reduced graphs and on the arguments in the three cases in Section D.4.2 of Appendix D. Combining the three bounds from above, we obtain the following theorem. Theorem 16 It holds that min G Tp\{G} ℓE( G) ℓE(G) min min (z,l,o) ΠZ(G) I(Xz; Xo|Xl), min (w,l,o) ΠW (G) I(Xw; Xl|Xo), min (j i) E ℓE(j L99 i) . (17) This result lower bounds the identifiability gap using information-theoretic quantities. Corresponding results for the Gaussian score follow immediately by Lemma 12. The last two terms are local properties of the underlying structural causal model; the first term is not. As seen in Section 5.2, the last term on the right-hand side is required when considering only Markov equivalent trees; if it is non-zero, it allows us to orient all edges in the skeleton. The first two terms (non-zero under faithfulness) are additionally required when the considered trees are not Markov equivalent. We now turn to the case of causal additive tree models with Gaussian noise innovations. Here, the first term is not needed; the bound then depends only on local properties of the structural causal model. Jakobsen, Shah, B uhlmann and Peters 5.4 Gaussian Multivariate Trees The score gap lower bound in Equation (17) consists of local dependence properties except for the node tuples ΠZ(G) (Lemma D.7) that arise when considering alternative graphs that result in reduced graphs with a node Z ( staying parents ). However, we show that for additive Gaussian noise models, the score gap for such alternative graphs can be lower bounded by the score gaps already considered in alternative graphs with a node Y ( parent to child ) and a node W ( removing parent ). Thus, we have the following theorem, with a bound consisting only of local properties of the model. Theorem 17 (Gaussian localization of the identifiability gap) For causal additive tree models with Gaussian noise, we have that min G Tp\{G} ℓG( G) ℓG(G) min min (w,l,o) ΠW (G) I(Xw; Xl | Xo), min (j i) E ℓE(j L99 i) . 6. Simulation Experiments In this section, we investigate the finite-sample performance of CAT and perform simulation experiments investigating the identifiability gap and its lower bound. In Section 6.1 we compare the performance of CAT to CAM of B uhlmann et al. (2014) for causal additive tree models with Gaussian and non-Gaussian noise. In Section 6.2 we compare the CAT and CAM for causal discovery on non-tree DAG models (CAT always outputs a directed tree). In Section 6.3 we investigate the finite sample power and level of the proposed hypothesis testing procedures. In Section 6.4 we perform simulation experiments that highlight the behavior of the identifiability gap and its corresponding lower bound derived in Section 5. The code scripts (R) for the simulation experiments, empirical applications and the implementation of CAT and the two testing procedures are available on Git Hub (see Footnote 8). 6.1 Causal Structure Learning for Trees In this section, we compare the performance of the structure learning methods CAT and CAM when employed on additive noise models with causal graphs given by directed trees. 6.1.1 Tree Generation Schemes We employ two different random directed tree generation schemes: Type 1 (many leaf nodes) and Type 2 (many branch nodes). Figure 3 illustrates two directed trees generated in accordance with the two generation schemes. For more details, see Algorithms 4 and 5 in Section C.1 of Appendix C. 6.1.2 Gaussian Experiment In this experiment, we generate data similarly to the experimental setup of B uhlmann et al. (2014). For any given directed tree we generate causal functions by sample paths of Gaussian processes with radial basis function (RBF) kernel and bandwidth parameter of one. Sample paths of Gaussian processes with radial basis function kernels are almost Structure Learning for Directed Trees Type 1 Type 2 Figure 3: Illustration of Type 1 (many leaf nodes) and Type 2 (many branch nodes) directed trees over p = 100 nodes. The green nodes are leaf nodes, the brown nodes are branch nodes, and the black nodes are root nodes. The Type 1 tree contains 70 leaf nodes, while the Type 2 tree only contains 49 leaf nodes. surely infinitely continuous differentiable (e.g., Kanagawa et al., 2018), non-constant and nonlinear, so they satisfy the requirements of Lemma 3. See Figure 10 in Section C.2 of Appendix C for illustrations of random draws of such functions. Root nodes are mean zero Gaussian variables with standard deviation sampled uniformly on (1, 2). Furthermore, for each fixed tree and set of causal functions, we introduce at each non-root node additive Gaussian noise with mean zero and standard deviation sampled uniformly on (1/5, 2/5). We first compare our method CAT with Gaussian score function (CAT.G) against the method CAM of B uhlmann et al. (2014) on the previously detailed nonlinear additive Gaussian noise tree setup. We use CAT.G without both cross-fitting and pruning. Note that with cross-fitting the results do not change much but, as expected, cross-fitting yields slightly worse results for small sample sizes (see Figure 11 in Appendix C). We use the R-package mgcv (Mixed GAM Computation Vehicle, Wood, 2022) with default settings to construct a thin plate regression spline estimate of the conditional expectations (Wood, 2003). We use the implementation of Chu Liu Edmonds algorithm from the R-package RBGL.11 CAM is employed with a maximum number of parents set to one (restricting the output to directed trees), without preliminary neighborhood selection and subsequent pruning. We measure the performance of the methods by computing the Structural Hamming Distance (SHD, Tsamardinos et al., 2006) and Structural Intervention Distance (SID, Peters and B uhlmann, 2015) to the causal tree. 11. The RBGL implementation finds maximum edge weight directed trees and requires all positive edge weights. As such, we take the negative of our edge weights and shift them all by the absolute value of smallest edge weight. If an edge weight is set to zero this edge can not be chosen. Jakobsen, Shah, B uhlmann and Peters For each system size p {16, 32, 64, 128} we generate a causal tree, corresponding causal functions and noise variances and sample n {50, 100, 200, 500} observations. This is repeated 200 times and the SHD results are summarized in the boxplot of Figure 4. 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 Sample Size SHD to true graph Figure 4: Causal additive tree models with Gaussian noise: Boxplots of the SHD performance of CAM and CAT.G (Gaussian score) for varying sample sizes, system sizes, and tree types. CAT.G outperforms CAM in a wide range of scenarios. Both methods perform better on trees of Type 2 than on trees of Type 1. CAT.G outperforms CAM in terms of SHD to the true graph both in median distance and IQR length and position for all sample sizes, system sizes and tree types. Considering the SID to the causal tree yields similar conclusions; see Figure 12 in Section C.2 of Appendix C. In their default versions, CAM and CAT.G use different estimation techniques of the conditional expectations, but this does not seem to be the source of the performance difference: Figure 13 in Section C.2 of Appendix C illustrates a similar SHD performance difference when forcing CAT.G to use the edge weights produced by the CAM implementation. 6.1.3 Non-Gaussian Experiment We now compare the performance of CAM and CAT with Gaussian (CAT.G) and entropy (CAT.E) score functions in a setup with varying noise distributions. The entropy edge weights used by CAT.E are estimated with the differential entropy estimator of Berrett et al. (2019) as implemented in the CRAN R-package Indep Test (Berrett et al., 2018). We Structure Learning for Directed Trees use the same simulation setup as in Section 6.1.2 but now we only consider trees of Type 1 and parameterize the setup by α > 0, which controls the deviation of the additive noise innovations from a Gaussian distribution. More precisely, we generate the additive noise variables Ni(α) as Ni(α) = sign(Zi)|Zi|α, where Zi N(0, σ2 i ) with σi sampled uniformly on (1/5, 2/5) or uniformly on (1, 2) if i = rt(G). For α = 1 this yields Gaussian noise, while for alpha α = 1 the noise is non-Gaussian. We conduct the experiment for all combinations of α {0.1, 0.2, . . . , 2, 2.5, 3, 3.5, 4} and sample sizes n {50, 500} for a fixed system size of p = 32. Each setting is repeated 500 times and the results are illustrated in Figure 5. Sample Size: 50 Sample Size: 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.5 3 3.5 4 SHD to true graph Figure 5: Deviations from Gaussianity: The parameter α controls the noise deviation from the Gaussian distribution. CAT.G and CAT.E are instances of CAT with edge weights derived from Gaussian and entropy score functions, respectively. The solid lines represent the median SHD and the shaded (dashed) region represents the interquartile range. Using the entropy score yields better results for noise distributions that deviate strongly from Gaussian noise. For Gaussian noise, both CAM and CAT.G outperform CAT.E. This can (at least) be attributed to two factors: (i) CAT.E does not, unlike CAM and CAT.G, explicitly use the Gaussian noise specification and (ii) differential entropy estimation is a difficult statistical problem (see, e.g., Paninski, 2003; Han et al., 2020) For small and moderate deviations from Gaussianity, CAT.G outperforms both CAM and CAT.E. For larger deviations, CAT.E outperforms both CAT.G and CAM in terms of median SHD. Finally, we note that CAT.G always outperforms CAM in terms of median SHD. Jakobsen, Shah, B uhlmann and Peters 6.2 Robustness: CAT on DAGs This experiment analyzes how CAT performs compared to CAM and the max-min hillclimbing (MMHC, Tsamardinos et al., 2006) structure learning method using the Bayesian Gaussian equivalent score (BGe, Geiger and Heckerman, 1994; Heckerman and Geiger, 1995) (the latter method is not expected to work well in our setting, as it does not exploit the additional identifiability). We compare the performance of these structure learning methods when applied to data generated from an additive Gaussian noise model with a non-tree DAG as a causal graph. More specifically, we analyze the behavior on single-rooted DAGs. For any fixed p N we generate a directed tree of Type 1 and for each zero in the upper triangular part of the adjacency matrix we add an edge with 5% probability. The causal functions and Gaussian noise innovations are generated according to the specifications given in the experiment of Section 6.4.2. The structural assignment for each node is additive in each causal parent, i.e., for all i {1, . . . , p}, Xi := P j pa G(i) fji(Xj)+Ni, with (N1, . . . , Np) mutually independent Gaussian distributed noise innovations. For each p {16, 32, 64} and sample size n {50, 250, 500} we randomly generate 200 single-rooted Gaussian additive models according to the above specifications. For this experiment, we employ CAM with preliminary neighborhood selection and subsequent pruning. As CAT.G outputs trees, we do not expect it to output the correct graph. In Figure 15 of Section C.2 of Appendix C we have illustrated boxplot comparisons of the SHD between the estimated and true graph for CAM, CAT.G and the MMHC with BGe score (MMHC.BGe). We see a clear ranking of the methods in terms of SHD performance. The best performance is seen for CAM, followed by CAT.G, and finally the worst performing method is that of MMHC.BGe. Note that the BGe score (and various other Bayesian network learning scores) is only suitable for jointly Gaussian data, e.g., for linear additive Gaussian noise systems. Figure 6 illustrates the performance in terms of ancestor relations. For small to moderately sized systems (p {16, 32}) CAM slightly outperforms CAT.G in terms of median precision (TP/(TP + FP)) when classifying causal ancestors. However, for large systems (p = 64) CAT.G outperforms CAM for median precision. On the other hand, CAM is not limited to trees which allows it to find a more significant proportion of the true ancestor, as seen by median recall (TP/P) performance. MMHC.BGe shows subpar performance in terms of ancestor classification, except for large systems and sample sizes when considering recall. CAT.G seems to be a viable alternative for practical non-tree applications where precision more important than recall for classifying causal ancestor relations. Figure 14 in Section C.2 of Appendix C illustrates similar comparisons when focusing on causal edges. The precision of CAT.G is larger than that of CAM only for small sample sizes, while the opposite is true for large sample sizes. As expected, and as seen for ancestor relations, CAM outperforms both CAT.G and MMHC.BGe in terms of recall. Finally, while both methods are computationally efficient, CAT has a slightly lower runtime than the greedy search algorithm of CAM. The average runtime of CAM and CAT.G in this experiment for p = 64 and n = 500 was 288 and 199 seconds, respectively. For both methods, the most time consuming part is estimating the conditional expectations that are used to compute the edge weights. Structure Learning for Directed Trees p: 16 p: 32 p: 64 50 250 500 50 250 500 50 250 500 Ancestor Classification: Precision p: 16 p: 32 p: 64 50 250 500 50 250 500 50 250 500 Sample Size Ancestor Classification: Recall Figure 6: Evaluating the robustness of CAT by estimating ancestor relations in non-tree DAGs, see Section 6.2. CAT.G slightly outperforms CAM in terms of true positive rates for large graphs (top) but finds less ancestor relationships (bottom) due to fitting a tree. As expected, CAT.G and CAM outperform MMHC in terms of precision. 6.3 Hypothesis Testing In this experiment, we experimentally analyze the finite sample size and power properties of the two substructure hypothesis testing procedures proposed in Section 4.2. We generate the underlying models and data similarly to the experimental setup of the Gaussian noise experiment of Section 6.1.2. We generate a random tree of Type 2 (see Section 6.1.1) of size p with Gaussian process causal functions and Gaussian noise innovations generated in accordance with the description in Section 6.1.2. Given a finite sample of size n we use the first n/2 observations to estimate all possible conditional mean functions x 7 E[Xi|Xj = x] for j = i with thin plate regression splines (R-package mgcv with default settings). The remaining n n/2 observations are used to estimate the upper and lower Bonferroni corrected confidence bounds ˆl = (ˆlji)j =i and ˆu = (ˆuji)j =i as defined in Equation (10) of Section 4. Using the two testing procedures proposed in Algorithms 2 and 3 of Section 4.2, with a significance level of 5%, we test all simple hypotheses, i.e., all hypotheses of the form H0 : (j i) and H0 : (j i) for all j = i. We repeat this procedure 400 times to observe the average behavior of the testing procedure Jakobsen, Shah, B uhlmann and Peters for the previously mentioned system generation scheme. We do this for all combinations of sample sizes n {500, 1000, 5000, 10000, 20000} and system sizes p {2, 4, 6, 8, 16}. Figure 7 illustrates the resulting power properties of the two tests Check C and Conv B. Both testing procedures have better small sample power when testing a false hypothesis of the form H0 : (j i) compared to testing a false hypothesis of the form H0 : (j i). Furthermore, the finite sample power of Check C is inferior to the Conv B. The power of Conv B is only slightly negatively affected by an increase in system size p, when testing a false hypothesis of the form H0 : (j i) . On the other hand, Check C suffers for both types of hypotheses when increasing the system size. For example, the Check C method has almost zero power when the system size is 16 and the samplesize is 20000. H0 : (j i) H0 : (j i) 0 5000 10000 15000 20000 0 5000 10000 15000 20000 Figure 7: This figure illustrates the power of the proposed testing procedure for simple hypotheses. The left plot shows the empirical probability of rejecting a false hypothesis of the form H0 : (j i) as a function of the sample size n. Similarly, the right plot shows the empirical probability of rejecting a false hypothesis of the form H0(j i). For both tests the power increases with growing sample size with Conv B outperforming Check C. In Table 1, we further detail the power and level achieved by the Conv B test in the above experiment. Both tests seems to hold level in all settings. For false hypotheses of the form H0 : (j i) we have split the hypotheses into three groups based on Distance(j, i) being negative , positive or no path . If j is a descendant of i, then Distance(j, i) is negative , if j is a non-parent ancestor of i, then Distance(j, i) is positive , and if there is no directed path between j and i, then Distance(j, i) equals no path . For moderately large sample sizes the test exhibits high power. However, the power of the test for a false hypothesis of the form (j i) when j is a non-parent ancestor of i is relatively low. 6.4 Identifiability Gap We now investigate the behavior of the identifiability gap in bivariate models (Section 6.4.1) and evalute the lower bound derived in Section 5 empirically for multivariate models (Section 6.4.2). Structure Learning for Directed Trees Property: Power of test Size of test H0: (j i) (j i) (j i) (j i) Distance(j, i) p n Negative Positive No Path Total Total Total Total 2 500 0.68 0.68 0.68 0.00 0.00 2 1000 0.82 0.82 0.82 0.00 0.00 2 5000 0.97 0.97 0.97 0.00 0.00 2 10000 0.99 0.99 0.99 0.00 0.00 2 20000 0.99 0.99 0.99 0.00 0.00 4 500 0.69 0.32 0.85 0.69 0.34 0.01 0.01 4 1000 0.79 0.54 0.92 0.80 0.54 0.00 0.00 4 5000 0.93 0.86 0.98 0.94 0.87 0.00 0.01 4 10000 0.97 0.93 0.99 0.97 0.95 0.00 0.00 4 20000 0.99 0.96 0.99 0.99 0.97 0.00 0.00 8 500 0.73 0.38 0.85 0.75 0.18 0.01 0.01 8 1000 0.78 0.53 0.91 0.83 0.35 0.01 0.01 8 5000 0.92 0.86 0.98 0.95 0.79 0.01 0.01 8 10000 0.96 0.94 0.99 0.97 0.90 0.00 0.01 8 20000 0.98 0.97 0.99 0.99 0.96 0.00 0.00 16 500 0.77 0.40 0.86 0.79 0.09 0.01 0.01 16 1000 0.80 0.54 0.92 0.86 0.21 0.01 0.01 16 5000 0.91 0.85 0.98 0.96 0.70 0.01 0.01 16 10000 0.95 0.93 0.99 0.98 0.85 0.01 0.01 16 20000 0.97 0.97 0.99 0.99 0.93 0.00 0.00 Table 1: This table contains further details on the average power and size of the Conv B hypothesis test under the data generation described in Section 6.3. 6.4.1 Bivariate Identifiability Gap In this experiment, we investigate the behavior of the bivariate identifiability gap and analyze setups with both Gaussian and non-Gaussian noise innovations. Let us consider an additive noise model over (X, Y ) with causal graph X Y . The causal functions will be chosen from the following function class. For any λ [0, 1], define fλ : R R as fλ(x) = (1 λ)x3 + λx. That is, λ 7 fλ interpolates between a cubic function x 7 x3 and a linear function x 7 x. For any (α, λ) (0, ) [0, 1] we consider the following bivariate structural causal additive model X := sign(NX)|NX|α, Y := fλ(X) + NY , Jakobsen, Shah, B uhlmann and Peters where NX, NY are independent standard normal distributed random variables. Recall that the bivariate identifiability gap is given by ℓE(Y X) ℓE(X Y ) = h(X E[X|Y ]) + h(Y ) h(X E[X|Y ], Y ) = I(X E[X|Y ]; Y ), by Lemma 13. Thus, the causal graph X Y is identified by the entropy score function if I(X E[X|Y ]; Y ) > 0. For any fixed λ and α we now estimate the identifiability gap; we also calculate the p-value associated with the null hypothesis that the identifiability gap is zero (based on 50000 observations). Similarly to the previous experiment, we estimate the conditional expectations using thin-plate spline regression. We estimate (without sample splitting) the identifiability gap and construct p-values using the CRAN R-package Indep Test (Berrett et al., 2018). More specifically, we use the differential entropy estimator of Berrett et al. (2019) and the mutual information based independence test of Berrett and Samworth (2019), respectively. The heatmap of Figure 8 illustrates the behavior of the identifiability gap for all combinations of λ {0, 0.05, . . . , 1} and α {0.3, 0.4, . . . , 1.7}. It suggests that the identifiability gap only tends to zero when we approach the linear additive Gaussian noise setup. Only in the models closest to the linear additive Gaussian noise setup are we unable to reject the null-hypothesis of a vanishing identifiability gap. This is also what the theory predicts, namely that for bivariate linear additive Gaussian noise models, the causal direction is not identified. It is known that for linear models, non-Gaussianity is helpful for identifiability. The empirical results indicate that the same holds for nonlinear models, i.e., that the identifiability gap increases with the degree of non-Gaussianity of the noise innovations. 0.00 0.25 0.50 0.75 1.00 Linearity: l Non-Gaussianity: a p-value ³ 0.05 Identifiability Gap Figure 8: Heatmap of the identifiability gap for varying λ and α. Tiles with a red boundary correspond to the models for which the mutual information based independence test cannot reject the null hypothesis of a vanishing identifiability gap. Structure Learning for Directed Trees 6.4.2 Multivariate Identifiability Gap In this experiment, we investigate the identifiability gap and its relation to the lower bounds established in Theorem 17. For a causal additive tree model with Gaussian noise, it holds that min G Tp\{G} ℓG( G) ℓG(G) min min (w,l,o) ΠW (G) I(Xw; Xl | Xo), min i j E ℓE(i L99 j) . In other words, the identifiability gap is lower bounded by the minimum of the smallest local faithfulness measures and the smallest edge-reversal score difference. We now investigate empirically how important the first term is for the inequality to hold. More specifically, for a given model generation scheme, we quantify how often the minimum edge reversal is sufficiently small to establish the lower bound without the conditional mutual information term, that is, how often the identifiability constant min G Tp\{G} ℓG( G) ℓG(G) is larger than the minimum edge reversal. The minimum edge reversal can be estimated using the same conditional expectation and entropy estimators of the experiment in Section 6.4.1. However, estimating the identifiability gap between the second-best scoring tree and the causal tree needs further elaboration. We know that the best scoring (causal) tree can be found by Chu Liu Edmonds (a directed MWST) algorithm. The second-best scoring tree differs from the best scoring tree in at least one edge. Thus, given the best scoring graph, we remove one of the p 1 edges of the best scoring tree from the pool of possible edges and rerun Chu Liu Edmonds algorithm. We do this for each of the p 1 edges in the best scoring tree which leaves us with p 1 possibly different sub-optimal trees of which the minimum score is attained by the secondbest scoring graph. For the experiment, we randomly sample data generating models similarly to the experiment in Section 6.1.2. However, we change the causal functions from explicit sample paths of a Gaussian process to a thin-plate spline regression model estimating the sample paths due to memory constraints when generating large sample sizes. Figure 9 illustrates, for p {8, 16}, boxplots of the difference between the identifiability gap and the minimum edge reversal for 100 randomly generated causal additive tree models with Gaussian noise. For each model, the identifiability gap and corresponding minimum edge reversal is estimated from 200000 independent and identically distributed observations. The illustration suggests that it is in general necessary to also consider the conditional mutual information term in order to establish a lower bound. However, it also shows that in the majority (90%) of the models, the minimum edge reversal is indeed a lower bound for the identifiability gap. 7. Empirical Application We consider the well-known non-synthetic bio-informatics data set considered by Sachs et al. (2005). The data set contains simultaneous measurements of expression levels of 11 different phosphorylated proteins and phospholipids of human immune system cells under both observational and interventional experimental settings. Sachs et al. (2005) present (based on expert consensus and experiments) a causal directed acyclic graph with 11 nodes and 20 edges for the 11 phosphorylated proteins and phospholipids. Jakobsen, Shah, B uhlmann and Peters Identifiability Gap minus minimum edge reversal Figure 9: Empirical analysis of the lower bound on the identifiability gap, see Section 6.4.2. In most of the simulated settings, we see that the estimated identifiability gap is larger than the smallest edge-reversal score difference. This suggests that in many cases, the latter term is sufficient for establishing a lower bound on the identifiability gap. We have also implemented CAT.G and CAT.E with the heuristic pruning procedure introduced in B uhlmann et al. (2014). We compare our structure learning methods CAT.G and CAT.E with the score-based methods of CAM (B uhlmann et al., 2014), GES (Chickering, 2002), No Tears (Zheng et al., 2018) and the mixed method MMHC (Tsamardinos et al., 2006). The structure learning methods are applied to observational data (853 observations using reagents anti-CD3 and anti-CD28). The results of the structure learning methods can be seen in Table 2. Learning causal structure from observational data is a difficult problem but several methods seem to outperform estimating an empty graph or a random graph. CAM is superior in terms of SHD, SID, and recall of edge and root predictions, suggesting that in this data set, one may indeed exploit nonlinearities for indentifying causal structure. However, we also see that CAT.G shows competitive performance and ranks in first or second place with respect to all reported performance measures. Interestingly, even though CAT.G approximates the non-tree causal DAG by a directed tree, it outperforms various DAG structure learning methods such as classical approaches of GES and MMHC and the more recent continuous optimization approach of No Tears. CAT.E does not perform well on these data, witnessing that estimating entropies is a difficult statistical problem. Finally, we also evaluate the proposed hypothesis testing procedures on this data set, even though the asymptotic guarantees of the hypothesis tests derived in Section 4 are not guaranteed to hold as the true underlying graph is not a directed tree. We test every possible simple hypothesis of the form H0(j i) and H0(j i). The results can be seen in Table 3 (the Check C test holds level but has zero power). The Conv B test shows reasonable power against false hypothesis of the form H0(j i), however, it has no power against the false hypotheses of the form H0(j i). Rejection rates of the true hypotheses of the form H0(j i) are larger than the asymptotically guaranteed rate of 0.05, possibly because of Structure Learning for Directed Trees the model violation; this phenomenon is not as expressed for true hypotheses of the form H0(j i). Method Prune Score SHD SHD-C SID Precision Recall CAM Yes ℓG 14.00 15.00 72.0 0.571 0.381 CAT No ℓG 14.00 14.00 79.0 0.636 0.333 CAT Yes ℓG 15.00 16.00 83.0 0.545 0.286 MMHC BGe 15.00 14.00 84.0 0.417 0.238 MMHC BIC 15.00 14.00 84.0 0.417 0.238 GES BIC 17.00 16.00 107.0 0.231 0.143 CAT Yes ℓE 18.00 19.00 92.0 0.273 0.143 No Tears 19.00 17.00 99.0 0.182 0.095 Empty Graph 20.00 20.00 94.0 0.091 0.048 Random Graph 22.32 21.93 94.7 0.271 0.170 CAT No ℓE 24.00 25.00 104.0 0.273 0.143 Table 2: Results of the empirical application of various structure learning methods to the data set of Sachs et al. (2005). Here we report the structural hamming distance (SHD), structural hamming distance of the respective CPDAGs (SHD-C), and structural intervention distance (SID) between the causal graph and the estimated graph. The latter two columns show the precision and recall for edge and root classification. The methods Empty Graph always outputs the empty graph and the method Random Graph outputs a random single-rooted tree generated according to the generation scheme outlined in Section 6.2. We have implemented CAT.G and CAT.E both with and without the heuristic pruning technique introduced in B uhlmann et al. (2014) Nulls incorrect Nulls correct H0 : (j i) (j i) (j i) (j i) Distance(j, i) Negative Positive No Path Total Total Total Total Rejection rates: 0.58 0.53 0.66 0.58 0.00 0.30 0.02 N: 46 26 18 90 20 20 90 Table 3: Further details on the average power and level of the Conv B test with a significance level of 0.05. Here, we have tested every simple hypothesis of the Sachs et al. (2005) data set; see Section 6.3 for further explanations of the distance metric. N denotes the number of hypothesis tests that have been averaged. Jakobsen, Shah, B uhlmann and Peters 8. Summary and Future Work This paper shows that exact structure learning is possible for systems of lesser complexity, i.e., for restricted structural causal models with additive noise and causal graphs given by directed trees. We propose the method CAT, which is guaranteed to consistently recover the causal directed tree of a causal additive tree model with Gaussian noise under mild assumptions on the regression methods used to estimate conditional means. Furthermore, we argue that CAT is consistent in an asymptotic setup with vanishing identifiability. We present a computationally feasible procedure to test substructure hypotheses and provide an analysis of the identifiability gap. Simulation experiments show that CAT outperforms other (more general) structure learning methods for the specific task of recovering the causal graph in additive noise structural causal models when the causal structure is given by directed trees. The proof of Proposition 4 is based on the fact that the causal functions of alternative models are differentiable and that the noise densities are continuous. We conjecture that it is possible to get even stronger identifiability statements under weaker assumptions; proving such a result necessitates new proof strategies. We believe that it could be possible to prove uniform consistency under suitable conditions when requiring that the infimum of the identifiability gap is strictly positive and that the mean squared errors of the regression estimates converge uniformly. Furthermore, we believe that it could inspire future research on more general identifiability conditions (e.g., relaxing the smoothness assumptions) for directed trees and DAGs under the assumption of additive noise. Furthermore, it should be possible to use a wild bootstrap approach to construct a simultaneous hyperrectangle confidence region for the Gaussian edge weights. This would, however, require a sufficiently fast convergence rate of the estimation error of the conditional expectations corresponding to non-causal edges. Compared to the Bonferroni correction, this approach could increase the power of the test. We hypothesize that the Conv B test holds level even in many generic, non-identifiable settings. Acknowledgments We thank Phillip Bredahl Mogensen and Thomas Berrett for helpful discussions on the entropy score and its estimation. JP thanks Martin Wainwright for discussions on greedy search methods (and when they fail) and inequalities in additive noise models during his visit at UC Berkeley in 2013. PB and JP thank David B urge and Jan Ernest for helpful discussions on exploiting Chu Liu Edmonds algorithm for causal discovery during the early stages of this project. MEJ and JP were supported by the Carlsberg Foundation; JP was, in addition, supported by a research grant (18968) from VILLUM FONDEN. RDS was supported by EPSRC grant EP/N031938/1. PB received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 786461). Appendix A. Graph Terminology A directed graph G = (V, E) consists of p N>0 vertices (nodes) V = {1, . . . , p} and a collection of directed edges E {(j i) (j, i) : i, j V, i = j}. For any graph G = (V, E) Structure Learning for Directed Trees we let pa G(i) := {v V : (v, i) E} and ch G(i) := {v V : (j, v) E} denote the parents and children of node i V and we define root nodes rt(G) := {v V : pa G(i) = } as nodes with no parents (that is, no incoming edges). A path in G between two nodes i1, ik V consists of a sequence (i1, i2), . . . , (ik 1, ik) of pairs of nodes such that for all j {1, . . . , k 1}, we have either (ij ij+1) E or (ij+1 ij) E. A directed path in G between two nodes i1, ik V consists of a sequence (i1, i2), . . . , (ik 1, ik) of pairs of nodes such that for all j {1, . . . , k 1}, we have (ij ij+1) E. Furthermore, we let an G(i) and de G(i) denote the ancestors and descendants of node i V , consisting of all nodes j V for which there exists a directed path to and from i, respectively. We let nd G(i) denote the non-descendants of i. A directed acyclic graph (DAG) is a directed graph that does not contain any directed cycles, i.e., directed paths visiting the same node twice. We say that a graph is connected if a (possibly undirected) path exists between any two nodes. A directed tree is a connected DAG in which all nodes have at most one parent. More specifically, every node has a unique parent except the root node, which has no parent. The root node rt(G) is the unique node such there exists a directed path from rt(G) to any other node in the directed tree. In graph theory, a directed tree is also called an arborescence, a directed rooted tree, and a rooted out-tree. A graph G = (V , E ) is a subgraph of another graph G = (V, E) if V V , E E and for all (j i) E it holds that j, i V . A subgraph is spanning if V = V . For any DAG G = (V, E) and three mutually distinct subsets A, B, C V we let A GB | C denote that A and B are d-separated by C in G (see, e.g., Pearl, 2009). Appendix B. Further Details on Section 5 Remark 1 The conditional entropy score gap is not strictly positive when considering the alternative graphs G that are Markov equivalent to the causal graph G, G MEC(G). A simple translation of the conditional entropy score function reveals that ℓCE( G) + C = X (j i) E h(Xi|Xj) h(Xi) = X (j i) E I(Xi; Xj), for a constant C R. By symmetry of the mutual information, it holds that ℓCE( G) = ℓCE(G), for any G MEC(G), since G and G share the same skeleton. Thus, the conditional entropy score function can, at most, identify the Markov equivalence class of the causal graph. In fact, the polytree causal structure learning method of Rebane and Pearl (1987) uses the above translated conditional entropy score function to recover the skeleton of the causal graph. Example 3 (Negative local Gaussian score gap) Consider two graphs G and G with different root nodes, i.e., rt(G) = rt( G). If x 7 E[Xrt(G)|Xpa G(rt(G)) = x] is not almost surely constant, then it holds that ℓG( G, rt(G)) ℓG(G, rt(G)) = E[(Xrt(G) E[Xrt(G)|Xpa G(rt(G))])2] Var(Xrt(G)) = E[Var(Xrt(G)|Xpa G(rt(G)))] Var(Xrt(G)) = Var(E[Xrt(G)|Xpa G(rt(G))]) < 0. Jakobsen, Shah, B uhlmann and Peters Appendix C. Further Details on the Simulation Experiments This section contains further details on the simulation experiments. C.1 Tree Generation Algorithms The following two algorithms, Algorithm 4 (many leaf nodes) and Algorithm 5 (many branch nodes), details how the Type 1 and Type 2 trees are generated, respectively. Algorithm 4 Generating type 1 trees procedure Type1(p) A := 0 Rp p for j {1, . . . , p} do for i {j + 1, . . . , p} do if Pp k=1 Aki = 0 then if i = j + 1 then Aji := 1 else Aji := Binomial(success = 0.1) end if else Aji := 0 end if end for end for return A end procedure Algorithm 5 Generating type 2 trees procedure Type2(p) for i {2, . . . , p} do j := sample({1, . . . , i 1}) Aji := 1 end for return A end procedure C.2 Additional Illustrations This section contains some additional illustrations of the simulation experiments. Structure Learning for Directed Trees Figure 10: Four causal functions as modeled by the RBF kernel Gaussian Process. 50 100 200 500 50 100 200 500 0 Sample Size SHD to true graph Figure 11: Boxplot illustrating the SHD performance of CAM and CAT for varying sample sizes, system sizes and tree types in the experiment of Section 6.1.2 with 200 repetitions. CAT.G.cf is the CAT.G method with cross-fitted edge weights. We see that cross-fitting has no positive impact on the performance. It seems to worsen the performance for small sample sizes. Jakobsen, Shah, B uhlmann and Peters 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 Sample Size SID to true graph Figure 12: Boxplot illustrating the SID performance of CAM and CAT for varying sample sizes, system sizes and tree types in the experiment of Section 6.1.2. CAT.G is CAT with edge weights derived from the Gaussian score function. 50 100 200 500 50 100 200 500 50 100 200 500 50 100 200 500 Sample Size SHD to true graph Figure 13: Boxplot illustrating the SHD performance of CAM and CAT for varying sample sizes, system sizes and tree types in the experiment of Section 6.1.2. Here CAT.G is run on the CAM edge weights , so that any difference in nonparametric regression technique is ruled out as the source of the performance difference. Structure Learning for Directed Trees p: 16 p: 32 p: 64 50 250 500 50 250 500 50 250 500 Edge Classification: Precision p: 16 p: 32 p: 64 50 250 500 50 250 500 50 250 500 Sample Size Edge Classification: Recall Figure 14: Boxplot of edge relations for the experiment in Section 6.2. p: 16 p: 32 p: 64 50 250 500 50 250 500 50 250 500 Sample Size Figure 15: Boxplot of SHD for the experiment in Section 6.2. Jakobsen, Shah, B uhlmann and Peters Appendix D. Proofs This section contains the proofs of all results presented in the main text. D.1 Proofs of Section 2 Proof of Lemma 3. Let θ = (G, (fi), PN) Tp Dp 3 Pp G. Furthermore, let all causal functions (fi) be nowhere constant and nonlinear. The additive noise is Gaussian, so the log density of Ni for all i {1, . . . , p} is given by 2 log(2πσ2 i ) x2 2σ2 i , ν i(x) = x σ2 i , ν i (x) = 1 σ2 i , ν i (x) = 0. By assumption we have that condition (i) of Definition 2 is satisfied, hence assume for contradiction that condition (ii) of Definition 2 is not satisfied. That is, we assume that there exists an i {1, ..., p} \ {rt(G)} such that for all (x, y) J := {(x, y) R2 : ν i (y fi(x))f i(x) = 0} = {(x, y) R2 : f i(x) = 0}, it holds that ξ (x) ξ (x)f i (x) f i(x) 2f i (x)f i(x) σ2 = y fi(x) f i (x) (f i (x))2 Henceforth, suppress the subscript i of fi and σi. First note that {x R : f (x) = 0} is closed by continuity of f . The complement is open, hence there exists a countable collection of mutually disjoint open intervals (Ok)k Z such that {x R : f (x) = 0} = k ZOk. Since f is nowhere constant we know that {x R : f (x) = 0} has empty interior which implies that k ZOk = R. Now let (Ok)k Z be indexed by Z such that for any k, j Z with k < j and x Ok, y Oj it holds that x < y. As the left-hand side of Equation (18) is constant in y it must hold that 0 = f (x) (f (x))2 x f (x) f (x) f (x) x (f (x))2 = i.e., f (x)/f (x) is constant, for all x k ZOk. On each Ok we have that / x log(sign(f (x))f (x)) = ck,1 log(sign(f (x))f (x)) = ck,1x + ck,2 sign(f (x))f (x) = exp(ck,1x + ck,2) f (x) = exp(ck,1x + ck,2). Recall that we have assumed continuous differentiability of f . That is, for any k Z and tk := sup(Ok) = inf(Ok+1) we have limx tk f (x) = limx tk f (x) and limx tk f (x) = limx tk f (x). Assume without loss of generality that f (x) = exp(ck,1x + ck,2) for all x Ok and k Z. These conditions impose the restrictions (ck,1 ck+1,1)tk = ck+1,2 ck,2 and log(ck,1/ck+1,1) + (ck,1 ck+1,1)tk = ck+1,2 ck,2 which entails that ck,1 = ck+1,1 and ck,2 = ck+1,2. This proves that there exists c1, c2 R such that f (x) = exp(c1x + c2) for all x R. Thus, the differential equation holds for all x R, 0 = ξ (x) ξ (x)f (x) f (x) 2f (x)f (x) Structure Learning for Directed Trees by division with f (x). By integration this implies that 0 = ξ (x)/f (x) 2f (x)/σ2 + c3 such that ξ (x) = 2 exp(2c1x+2c2)/σ2 c3 exp(c1x+c2) and ξ (x) = exp(2c1x+2c2)/c1σ2 c3 exp(c1x + c2)/c1 + c4 and ξ(x) = exp(2c1x + 2c2) 2c2 1σ2 c3 exp(c1x + c2) c2 1 + c4x + c5. We see that ξ(x) p Xpa G(i)(x) as x sign(c1) , in contradiction with the assumption that p Xpa G(i)(x) is a probability density function if c1 = 0. Thus, it must hold that f (x)/f (x) = 0 for all x R, or equivalently, that f is a linear function, yielding a contradiction. This proves that whenever fi D3 is a nowhere constant and nonlinear function and the additive noise is Gaussian then condition (ii) of Definition 2 is satisfied, so θ ΘR. Proof of Proposition 4. First, we consider the bivariate setting. Let (X, Y ) be generated by an additive noise SCM θ ΘR T2 D2 3 P2 C3 given by X := NX and Y := f(X) + NY with PX = p X λ and PNY = p NY λ having three times differentiable strictly positive densities and f is a three times differentiable nowhere constant function such that condition (ii) of Definition 2 holds. Assume for contradiction that we do not have observational identifiability of the causal structure G = (V = {X, Y }, E = {(X Y )}). That is, there exists θ T2 Dp 1 Pp C0 with causal graph G = G or, equivalently, a differentiable function g and noise distributions P NX = p NX λ and P NY = p NY λ with continuous densities such that the structural assignments Y := NY and X := g( Y ) + NX induce the same distribution, i.e., PX,Y = P X, Y . (19) By the additive noise structural assignments we know that both PX,Y and P X, Y have densities with respect to λ2 given by p X,Y (x, y) = p X(x)p NY (y f(x)), p X, Y (x, y) = p NX(x g(y))p Y (y), for all (x, y) R2. By the equality of distributions in Equation (19) and strict positivity of p X and p NY we especially have that for λ2-almost all (x, y) R2 0 < p X,Y (x, y) = p X, Y (x, y). (20) However, as both p X,Y and p X, Y are continuous we realize that the inequality in Equation (20) holds for all (x, y) R (if they were not everywhere equal there would exists a non-empty open ball in R2 on which they differ in contradiction with λ2-almost everywhere equality). Furthermore, by the assumption that f is three times differentiable and p X, p NY are three times continuously differentiable we have that 3π/ x3 and 3π/ x2 y are well-defined partial-derivatives of π(x, y) := log p X,Y (x, y) = log p X(x) + log p NY (y f(x)) =: ξ(x) + ν(y f(x)), Jakobsen, Shah, B uhlmann and Peters With π(x, y) := log p X, Y we have that π(x, y) = log p NX(x g(y)) + log p Y (y) =: ξ(x g(y)) + ν(y). Since it holds that π = π by Equation (20) the partial-derivatives 3 π/ x3 and 3 π/ x2 y are also well-defined. Now note that for any x, y R 0 = lim h 0 | π(x + h, y) π(x, y)|/h = lim h 0 | ξ(x g(y) + h) ξ(x g(y))|/h, implying that ξ is differentiable in x g(y) for any x, y R or, equivalently, ξ is everywhere differentiable. Similar arguments yield that ξ is at least three times differentiable. We conclude that 2 π(x, y)/ x2 = ξ (x g(y)) and 2 π(x, y)/ x y = ξ (x g(y))g (y) and for any (x, y) R2 such that 2 π(x, y)/ x y = 0 or, equivalently, (x, y) J := (x, y) : 2π(x, y) x y = ν (y f(x))f (x) = 0 , it holds that 2 x2 π(x, y) 2 x y π(x, y) It is worth noting that J = to ensure that the following derivations are not void of meaning. (This can be seen by noting that f is nowhere constant, i.e., f (x) = 0 for λ-almost all x R. Hence, J = if and only if p NY is a density such that {(x, y) R2 : f (x) = 0} (x, y) 7 ν (y f(x)) is constantly zero or, equivalently, R y 7 ν (y) is constantly zero. This holds if and only if p NY is either exponentially decreasing or exponentially increasing everywhere, which is a contradiction as no continuously differentiable function integrating to one has this property.) For any (x, y) J we also have that 2 x2 π(x, y) 2 x yπ(x, y) ξ (x) + ν (y f(x))f (x)2 ν (y f(x))f (x) ν (y f(x))f (x) ν f + ν ν f (ν )2 (f )2ν ν (f )2 + f ξ which implies that 2ν f f + ν f + ν ν f f in contradiction with the assumption that condition (ii) of Definition 2 holds. We conclude that PX,Y = P X, Y . Now consider a multivariate restricted causal model θ ΘR over X = (X1, . . . , Xp) with causal directed tree graph G = (V, E). Assume for contradiction that there exists an Structure Learning for Directed Trees alternative SCM θ = ( G, ( fi), P N) Tp Dp 1 Pp C0 inducing X = ( X1, . . . , Xp) with causal graph G = (V, E) = G, such that PX = P X. Any SCM induced distribution is Markov with respect to the underlying causal graph. As such, we have that PX is Markov with respect to both G and G. Furthermore, since (in θ) the causal functions are non-constant and the noise innovations have strictly positive density, we have, by Proposition 17 of Peters et al. (2014), that PX satisfies causal minimality with respect to causal graph G of θ, i.e., it is globally Markov with respect to G but not any proper subgraph of G. If PX also satisfies causal minimality with respect to G, then, by Proposition 29 of Peters et al. (2014), there exist i, j V such that (j i) E and (i j) E. Assume for contradiction that PX does not satisfy causal minimality with respect to G. By Proposition 4 of Peters et al. (2014), we have that there exists (j i ) E such that Xj Xi . Define A := nd G(j ) {j } and B := de G(i ) {i }. It holds that A G(B \ {i }) | i , i.e., A and B \ {i } are d-separated by i in the directed tree G. Since PX is Markov with respect G it holds that XA XB\{i } | Xi , hence XA XB | Xi . Similarly, it holds that XA XB | Xj which implies that XA Xi | Xj . By applying the contraction property of conditional independence, we get that XA Xi | Xj and Xi Xj = XA Xi , and XA XB | Xi and XA Xi = XA XB. Since A B = V, A B = and G is a directed tree (that spans V ) there exist either an edge (j i ) E with j A and i B or j B and i A. In either case, we have that Xi Xj , which contradicts PX satisfying causal minimality with respect to G. We conclude that PX also satisfies causal minimality with respect to the alternative graph G. Hence, the following structural equations hold for (Xi, Xj) and ( Xi, Xj) Xi = fi(Xj) + Ni, with Xj Ni, Xj = fj( Xi) + Nj, with Xi Nj, with PXj,Xi = P Xj, Xi. We can apply the same arguments as in the bivariate setup if we can argue that a density of Xj is three times differentiable and that a density of Xi is a continuous density. To this end, note that the density p Xj is given by the convolution of two densities p Xj(y) = Z pfj(Xpa G(j))(t)p Nj(y t) dt, (21) as Xj := fj(Xpa G(j)) + Nj with Xpa G(j) Nj. Here we used that fj(Xpa G(j)) has density with respect to the Lebesgue measure. Jakobsen, Shah, B uhlmann and Peters To realize this note that fj C3 and it is nowhere constant. By arguments similar to those in the proof of Lemma 3, this implies that f (x) = 0 at only countably many points (dk). Now let (Ok) be the countable collection of mutually disjoint open intervals that cover R except for the points (dk). By continuity of f we know that f (x) is either strictly positive or strictly negative on each Ok. That is, f is continuously differentiable and strictly monotone on each Ok. Thus, f has a continuously differentiable inverse on each Ok by, e.g., the inverse function theorem. This ensures that fj(Xpa G(j)) has a density with respect to the Lebesgue measure whenever Xpa G(j) does. By starting at the root node Xrt(G) = Nrt(G), which by assumption has a density, we can iteratively apply the above argumentation down the directed path from rt(G) to j in order to conclude that any Xj for j {1, . . . , p} has a density with respect to the Lebesgue measure. Since p Nj is assumed strictly positive three times continuous differentiable, the representation in Equation (21) furthermore yields that p Xj is three times differentiable; see, e.g., Theorem 11.4 and 11.5 of Schilling (2017). Now we argue that Xi has a continuous density. First note that PXi at least has a continuous density p Xi by arguments similar to those applied for Equation (21). By the assumption that PX = P X we especially have that PXi = P Xi which implies that also Xi has a continuous density. By virtue of the arguments for the bivariate setup we arrive at a contradiction, so it must hold that PX = P X. Proof of Lemma 6. Consider an SCM θ = ( G, ( fi), P N) { G} Dp 1 Pp G with G = G and let Q θ be the induced distribution. As Q θ is Markov with respect to G and generated by an additive noise model the density q θ factorizes as i=1 q θ(xi|xpa G(i)) = i=1 q Ni(xi fi(xpa G(i))). The cross entropy between PX and Q θ is then given by h(PX, Q θ) := E log q θ(X) i=1 E h log q Ni Xi fi(Xpa G(i)) i i=1 h Xi fi(Xpa G(i)), Ni , where the latter is a sum of the cross entropies between the distribution of Xi fi(Xpa G(i)) and the distribution of Ni. As Q θ is generated by a causal additive tree model with Gaussian Structure Learning for Directed Trees noise, we have for all 1 i p that Ni N(0, σ2 i ) for some σ2 i > 0. Hence for all 1 i p, h Xi fi(Xpa G(i)), Ni = E Xi fi(Xpa G(i)) 2 2π σi) + E Xi fi(Xpa G(i)) 2 Thus, for given set of causal functions ( fi) and a fixed i, the noise variance that minimizes the cross entropy is given by E Xi fi(Xpa G(i)) 2 . We thus have 2π σi) + E Xi fi(Xpa G(i)) 2 2 log E Xi fi(Xpa G(i)) 2 + 1 We conclude that inf Q { G} Dp 1 Pp G h(PX, Q) inf fi D1 E Xi fi(Xpa G(i)) 2 ! Finally, as D1 is dense in L2(PXpa G(i)), we have that inf fi D1 E Xi fi(Xpa G(i)) 2 = E Xi E[Xi|Xpa G(i)]) 2 + inf fi D1 E E[Xi|Xpa G(i)] fi(Xpa G(i)) 2 = E Xi E[Xi|Xpa G(i)]) 2 . Here we used that Xpa G(i) has density with respect to the Lebesgue measure, PXpa G(i) λ, and that the density is differentiable (see proof of Proposition 4). This concludes the first part of the proof. Jakobsen, Shah, B uhlmann and Peters For the second statement, we note that for any Q { G} F( G) Pp there exists some noise innovation distribution P N P such that Q is the distribution of X generated by structural assignments Xi := fi(Xpa G(i)) + Ni = E[Xi|Xpa G(i)] + Ni, for all 1 j p and mutually independent noise innovations N = ( N1, . . . , Np) P N Pp. Let q denote the density of Q with respect to the Lebesgue measure and let q Ni denote the density of Ni for all 1 i p. As Q is Markov with respect to G and generated by an additive noise model the density factorizes as i=1 q(xi|xpa G(i)) = i=1 q Ni(xi E[Xi|Xpa G(i) = xpa G(i)]). The cross entropy between PX and Q is given by h(PX, Q) = E [ log (q(X))] i=1 E h log q(Xi|Xpa G(i)) i i=1 E h log q Ni Xi E h Xi|Xpa G(i) i i i=1 h Xi E h Xi|Xpa G(i) i , Ni . Note that h(P, Q) = h(P)+DKL(P Q) h(P) with equality if and only if Q = P. Thus, the infimum is attained at noise innovations that are equal in distribution to Xi E[Xi|Xpa G(i)] (which has a density by assumption). That is, inf Q { G} F( G) Pp h(PX, Q) = i=1 inf Nj P Nj P h Xi E h Xi|Xpa G(i) i , Ni i=1 h Xi E h Xi|Xpa G(i) i Proof of Lemma 7. Let θ ΘR Tp Dp 3 Pp G and assume that condition (a) is satisfied, i.e., that for all i = j it holds that x 7 E[Xi|Xj = x] has a differentiable version. Note that ℓG( G) ℓG(G) = inf Q { G} Dp 1 Pp G h(PX, Q) h(PX). (22) Structure Learning for Directed Trees Furthermore, by the considerations in the proof of Lemma 6 the infimum in Equation (22) is attained for Q , where the functions are given by the conditional expectation functionals. When condition (a) is satisfied we therefore know that Q { G} Dp 1 Pp G. Finally, ℓG( G) ℓG(G) = h(PX, Q ) h(PX) = DKL(PX Q ) > 0, where the last strict inequality follows from Proposition 4. Now let θ ΘR Tp Dp 3 Pp G. Assume that condition (b) is satisfied, i.e., for all 1 i p it holds that the causal function fi is contained within a function class Fi D1, which for all j = i satisfies arg min fi Fi E Xi fi(Xj) 2 Fi. (23) Define the modified Gaussian score function ℓG.mod( G) := 1 2 log Var Xi fpa G(i)i(Xpa G(i)) , where fji : R R is given by fji := arg min f Fi E Xi f(Xj) 2 , for all i = j. Now, for any G Tp \ {G}, it holds that ℓG.mod( G) ℓG.mod(G) = inf Q { G} (Fi)1 i p Pp G h(PX, Q) h(PX) = h(PX, Q ) h(PX) = DKL(PX Q ) > 0. Here we used the closedness in Equation (23) to argue that the infimum is attained for Q { G} (Fi)1 i p Pp G. Finally, since (Fi)1 i p Dp 1, Proposition 4 guarantees the strict inequality. Now let θ ΘR Tp Dp 3 Pp +C3. Assume that for all i = j it holds that x 7 E[Xi|Xj = x] has a differentiable version, and assume that for all i = j it holds that Xi E[Xi|Xj] has a continuous density. With these assumptions we note that for any G Tp \ {G} it holds, by the arguments in the proof of Lemma 6, that ℓE( G) ℓE(G) = inf Q { G} F( G) Pp h(PX, Q) h(PX) = h(PX, Q ) h(PX), where Q is generated by an additive noise model G (fi) (P Ni)1 i p with causal graph G Tp, with causal functions fi x 7 E[Xi|Xpa G(i) = x] D1 and noise innovations given by Ni D= Xi E[Xi|Xj] PNi PC0, i.e., noise innovations with continuous densities. Proposition 4 now yields that ℓE( G) ℓE(G) = DKL(PX Q ) > 0, Jakobsen, Shah, B uhlmann and Peters since PX is induced by a restricted causal additive tree model and Q is induced by a causal additive tree model { G} (fi) (P Ni)1 i p Tp Dp 1 Pp C0. D.2 Proofs of Section 3 Proof of Theorem 8. Assume that θ = (G, (fi), PN) ΘR with PN Pp G and G = (V, E). For simplicity of the proof, we assume that E[X] = 0 such that the Gaussian edge weight estimators simplify to ˆwji := ˆw G ji = 1 n Pn k=1 (Xk,i ˆϕji(Xk,j))2 1 n Pn k=1 X2 k,i for all j = i. Furthermore, define the Gaussian population (for i = j) and auxiliary (for (j i) E) edge weights by 2 log E[(Xi ϕji(Xj))2] , w ji := 1 2 log E[(Xi ϕji(Xj))2] respectively, where ϕji : R R is a fixed function satisfying E[( ˆϕji(Xj) ϕji(Xj))2| Xn] P n 0. Furthermore, for any G = (V, E) Tp denote ˆw( G) := X (j i) E ˆwji, w( G) := X (j i) E wji, w ( G) := X (j i) E\E w ji + X (j i) E E wji, as the total estimated, population and auxiliary edge weights for G. As the conditional expectation minimizes the MSPE among measurable functions, i.e., ϕji = arg minf:R R E[(Xi f(Xj))2], we especially have, for any i = j, that E[(Xi ϕji(Xj))2] E[(Xi ϕji(Xj))2]. This construction entails, for any G Tp, that w ( G) w( G), and w (G) = w(G). (24) Assumption 1 implies that there exists an m > 0 such that min G Tp\{G} ℓG( G) ℓG(G) = m > 0. (25) Thus, for any G Tp \ {G} it holds that by the identifiability assumption of Equation (25). Now note that ℓG( G) = w( G) + C with C = Pp i=1 log(E[X2 i ])/2 for all G Tp. Hence, we have, for all G Tp \ {G}, that 2 = w(G) + m Structure Learning for Directed Trees by the equality and inequalities in (26) and (24). Thus, we have that P( ˆG = G) = P arg min G=(V, E) Tp (j i) E ˆwji = G | ˆw( G) w ( G)| < m We conclude that it suffices to show that sup G Tp | ˆw( G) w ( G)| P n 0. To this end, let E := {(j i) : i, j V, i = j} \ E and note that sup G Tp | ˆw( G) w ( G)| 2 log E[(Xi ϕji(Xj))2] 2 log E[(Xi ϕji(Xj))2] 2 log E[(Xi ϕji(Xj))2] 2 log E[(Xi ϕji(Xj))2] Now consider a fixed term (j i) E in the second sum of (27). We can upper bound the absolute difference by 2 log E[(Xi ϕji(Xj))2] k=1 (Xk,i ˆϕji(Xk,j))2 ! log E[(Xi ϕji(Xj))2] log(E[X2 i ]) log In the upper bound of (28), the last absolute difference vanishes in probability due to the law of large numbers and the continuous mapping theorem. The first absolute difference Jakobsen, Shah, B uhlmann and Peters also vanishes by the following arguments. Note that, k=1 (Xk,i ˆϕji(Xk,j))2 k=1 (Xk,i ϕji(Xk,j))2 + 1 k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 k=1 (Xk,i ϕji(Xk,j)) (ϕji(Xk,j) ˆϕji(Xk,j)) . Hence, it holds that 1 n k=1 (Xk,i ˆϕji(Xk,j))2 1 k=1 (Xk,j ϕji(Xk,j))2 k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 k=1 (Xk,j ϕji(Xk,j)) (ϕji(Xk,j) ˆϕji(Xk,j)) k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 k=1 (Xk,j ϕji(Xk,j))2 k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2, (29) by Cauchy-Schwarz inequality. By the law of large numbers, we have that the first factor of the second term of (29) converges in probability to a constant, k=1 (Xk,j ϕji(Xk,j))2 P n E[X1,i ϕji(X1,j))2]. The first term and latter factor of the second term of Equation (29) vanish in probability by assumption. That is, for any ε > 0 we have that k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 > ε k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 ε > ε E h 1 n Pn k=1 (ϕji(Xk,j) ˆϕji(Xk,j))2 ε i E h E h (ϕji(X1,j) ˆϕji(X1,j))2 Xn i ε i Structure Learning for Directed Trees using conditional Jensen s inequality (x 7 min(x, ε) = x ε is concave) and the dominated convergence theorem. This proves that k=1 (Xk,j ˆϕji(Xk,j))2 P n E[X1,i ϕji(X1,j))2]. Thus, we have shown that the second term of (27) converges to zero in probability. Finally, the above arguments apply similarly to the first term of Equation (27) by exchanging every ϕji with ϕji. We have shown that sup G Tp | ˆw( G) w ( G)| P n 0, which concludes the proof. Proof of Theorem 9. Assume that for each sample size n N that θn = (G, ...) ΘR with G = (V, E), additive Gaussian noise, and identifiability gap min G Tp\{G} ℓG(G) ℓG( G) = qn > 0, with q 1 n = o( n). For simplicity of the proof, we assume that Eθn[X] = 0 such that the edge weight estimators simplify to ˆwji := ˆw G ji = ˆw G ji(Xn, Xn) = 1 n Pn k=1 (Xk,i ˆϕji(Xk,j))2 1 n Pn k=1 X2 k,i Furthermore, we continue with the notation and population quantities introduced in the proof of Theorem 8, i.e., wji = log(Eθn[(Xi E[Xi|Xj])2])/Eθn[X2 i ])/2, where we notionally have suppressed the dependence on n. We know that for each SCM θn it holds that ℓG(G) + qn ℓG( G), hence w(G) + qn w( G), for all G Tp \ {G}. Thus, arg min G=(V, E) Tp (j i) E ˆwji = G | ˆw(G) w(G)| < qn ˆw( G) w( G) qn For any G = (V, E) Tp we have that ˆw( G) w( G) = X (j i) E E ˆwji wji + X (j i) E\E ˆwji wji, where ˆwji and wji denote the estimated and population Gaussian weights for the edge (j i), respectively. Hence, it suffices to show that (j i) E, ε > 0 : Pθn(| ˆwji wji| < qnε) n 1, (j i) E, ε > 0 : Pθn ( ˆwji wji qnε) n 1. Jakobsen, Shah, B uhlmann and Peters To see this, note that if the above statements hold, then Pθn | ˆw(G) w(G)| < qn (j i) E | ˆwji wji| < qn | ˆwji wji| < qn 2(p 1) and for any G = (V, E) Tp Pθn ˆw( G) w( G) qn (j i) E E ˆwji wji + X (j i) E\E ˆwji wji qn | ˆwji wji| qn 2(p 1) ˆwji wji qn 2(p 1) hence the probability of the intersections also converges to one. The causal edges: Now fix (j i) E. We want to show that for all ε > 0 it holds that Pθn(| ˆwji wji| < qnε) n 1. First note that | ˆwji wji| 1 k=1 (Xk,i ˆϕji(Xk,j))2 ! log Eθn[(Xi ϕji(Xj))2] log(Eθn[X2 i ]) log where ˆϕji for each n is the estimated conditional expectation x 7 Eθn[Xi|Xj = x] based on samples from the auxiliary data set. It suffices to show the desired convergence in probability for each of the above terms. Furthermore, for all sequences of positive random variables (Zn) and positive constants c > 0 and for all ε > 0 there exists δ > 0 such that (q 1 n | log(Zn) log(c)| ε) (q 1 n |Zn c| δ), for sufficiently large n. To see this, note that if q 1 n (log(Zn) log(c)) ε, then Zn > exp(log(c) + qnε) = c exp(qnε) c(1 + qnε), so q 1 n (Zn c) cε. On the other hand, if q 1 n (log(Zn) log(c)) ε, then Zn c exp( εqn) c(1 εqn + ε2q2 n), so q 1 n (Zn c) Structure Learning for Directed Trees cε + cε2qn. In summary, if q 1 n | log(Zn) log(c)| ε, then q 1 n |Zn c| cε cε2qn > cε(1 M) =: δ where 1 > M > εqn for sufficiently large n. We conclude that it suffices to show that for all ε > 0 it holds that k=1 (Xk,i ˆϕji(Xk,j))2 Eθn[(Xi ϕji(Xj))2] k=1 X2 k,i Eθn[X2 i ] Equation (31) is satisfied as the summands are mean zero i.i.d. Therefore, with k=1 X2 k,i Eθn[X2 i ], where Eθn[q 1 n Wn] = 0, we have that Eθn[q 2 n W 2 n] = q 2 n n Eθn[(X2 i Eθn[X2 i ])2], hence Pθn(q 1 n Wn ε) q 2 n Eθn[W 2 n] ε2 q 2 n n supn N Eθn[(X2 i Eθn[X2 i ])2] ε2 for any ε > 0 as supn N Eθn X 4 2 < and q 1 n = o( n). Now we show Equation (30). First, we simplify the notation by letting Zk := Xk,i, Yk := Xk,j f := ϕji and ˆf := ˆϕji for all k N. Note that we have suppressed the dependence of f = ϕji on θn. We have that Zk ˆf(Yk) 2 = 1 k=1 (Zk f(Yk))2 + 1 k=1 (f(Yk) ˆf(Yk))2 k=1 (Zk f(Yk))(f(Yk) ˆf(Yk)) =: T1,n + T2,n + T3,n. It suffices to show that for all ε > 0 it holds that (a) Pθn |T1,n Eθn[(Z1 f(Y1))2]| qnε n 0, (b) Pθn (|T2,n| qnε) n 0, and (c) Pθn (|T3,n| qnε) n 0. Jakobsen, Shah, B uhlmann and Peters First we show (a). Each term in the sum of T1,n Eθn[(Z1 f(Y1))2] is mean zero and i.i.d., i.e., q 1 n Eθn[(Zk f(Yk))2 Eθn[(Z1 f(Y1))2]] = 0. Furthermore, Varθn(q 1 n (T1,n Eθn[(Z1 f(Y1))2])) k=1 (Zk f(Yk))2 Eθn[(Z1 f(Y1))2] k=1 Varθn (Zk f(Yk))2 Eθn[(Z1 f(Y1))2] q 2 n n sup n N Varθn (Z1 f(Y1))2 since q 1 n = o( n) and supn N Eθn X 4 2 < . Hence, Pθn |q 1 n (T1,n E[(Z1 f(Y1))2])| ε Varθn(q 1 n (T,n E[(Z1 f(Y1))2])) by Chebyshev s inequality, proving (a). Now we show (b). To that end, note that the terms of T2,n is i.i.d. conditional on Xn. For a fixed 1 > ε > 0 we have Pθn |q 1 n T2,n| ε = Eθn h Pθn q 1 n T2,n ε| Xn 1 i Eθn h Eθn h q 1 n T2,n| Xn i 1 i = Eθn h q 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i where we used the conditional Markov s inequality. Now fix 1 > δ > 0 and define An,δ := (q 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i > δ) and note that by assumption there exists an Nδ N such that n Nδ : Pθn(An,δ) < δ. Hence, for n Nδ we have that Eθn h q 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i = Eθn h 1An,δq 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i + Eθn h 1Ac n,δq 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i Eθn h 1An,δq 1 n Eθn h (f(Y1) ˆf(Y1))2| Xn i 1 i + Eθn h 1Ac n,δδ i Eθn 1An,δ + δ = Pθn(An,δ) + δ < 2δ, (32) Structure Learning for Directed Trees hence lim supn Pθn |q 1 n T2,n| ε < 2δ/ε, i.e., Pθn |q 1 n T2,n| ε 0 as δ > 0 was chosen arbitrarily, proving (b). Now we prove (c). To this end, recall that k=1 (Zk f(Yk))(f(Yk) ˆf(Yk)), is, conditional on X, an i.i.d. sum with conditional mean zero Eθn[T3,n| Xn] = 2Eθn[(Zk f(Yk))(f(Yk) ˆf(Yk))| Xn] = 2Eθn[(Eθn[Zk|Yk, Xn] f(Yk))(f(Yk) ˆf(Yk))| Xn] = 2Eθn[(f(Yk) f(Yk))(f(Yk) ˆf(Yk))| Xn] = 0, and conditional second moment given by Eθn[T 2 3,n| Xn] = 4 k=1 Eθn[(Zk f(Yk))2(f(Yk) ˆf(Yk))2| Xn] n Eθn h (Zk f(Yk))2(f(Yk) ˆf(Yk))2| Xn i n Eθn h Eθn h (Zk f(Yk))2| Xn, Yk i (f(Yk) ˆf(Yk))2| Xn i n Eθn h Varθn(Zk|Yk)(f(Yk) ˆf(Yk))2| Xn i n Eθn h (f(Yk) ˆf(Yk))2| Xn i , Pθn-almost surely. Hence, w.l.o.g. assume that 0 < ε < 1 and note that the conditional Markov s inequality yields Pθn(|q 1 n T3,n| ε) = Eθn[Pθn(|q 1 n T3,n| ε| Xn) 1] ε2 Eθn h Eθn h q 2 n T 2 3,n| Xn i 1 i (33) q 2 n n Eθn h (f(Yk) ˆf(Yk))2| Xn i 1 . By conditional Jensen s inequality, we have that Eθn h (f(Yk) ˆf(Yk))2| Xn i 1 + Eθn h (f(Yk) ˆf(Yk))2| Xn i2 1 + Eθn h (f(Yk) ˆf(Yk))4| Xn i . Fix δ > 0 and let An,δ := q 2 n n Eθn h (f(Yk) ˆf(Yk))4| Xn i > δ and note that Pθn(An,δ) n 0, hence there exists an Nδ N such that n Nδ : Pθn(An,δ) < δ. Furthermore, as q 1 n = o( n) there exists an N N such that q 2 n /n < δ for all n N. Similar to the Jakobsen, Shah, B uhlmann and Peters arguments in Equation (32) we then have that C Pθn(|q 1 n T3,n| ε) Eθn 1 + Eθn h (f(Yk) ˆf(Yk))4| Xn i 1 q 2 n n + Eθn q 2 n n Eθn h (f(Yk) ˆf(Yk))4| Xn i 1 q 2 n n + Eθn[1An,δ] + Eθn[1Ac n,δδ] < δ + Pθn(An,δ) + δ < 3δ, for any n Nδ N, so Pθn(q 1 n T3,n ε) n 0, proving (c). The non-causal edges: Now fix (j i) E, we want to show, for any ε > 0 that Pθn( ˆwji wji qnε) n 1, ˆwji wji = 1 k=1 (Xk,i ˆϕji(Xk,j))2 ! log E[(Xi ϕji(Xj))2] # log(E[X2 i ]) log 2(D1,n + D2,n). We have that Pθn( ˆwji wji qnε) Pθn ((D1,n qnε) (|D2,n| < qnε)), where the second event has already been shown to have probability converging to one in Equation (31). Thus, it suffices to show that Pθn (D1,n qnε) n 1. By similar arguments as above we have for any sequence of positive random variables (Kn)n 1 and a positive constant K that for all ε > 0 there exists an δ > 0 such that Pθn (log(Kn) log(K) < qnε) Pθn(Kn K < qnδ), for sufficiently large n N. To see this, note that if log(Kn) log(K) < qnε, then Kn < K exp( εqn) K(1 εqn + ε2q2 n), so q 1 n (Kn K) < Kε + Kε2qn < Kε(1 M) =: δ where 1 > M > εqn for sufficiently large n, since qn 0. Thus, it suffices to show that for any ε > 0 it holds that k=1 (Xk,i ˆϕji(Xk,j))2 Eθn[(Xi ϕji(Xj))2] qnε Structure Learning for Directed Trees Again, we simplify the notation Zk := Xk,i, Yk := Xk,j, f = ϕji and ˆf := ˆϕji for all k N. Now define the following terms Zk ˆf(Yk) 2 = 1 k=1 (Zk f(Yk))2 k=1 {(f(Yk) ˆf(Yk))2 δ2 n,θn} k=1 {(Zk f(Yk))(f(Yk) ˆf(Yk)) + δ2 n,θn/2} =: T1,n + T2,n + T3,n, where δ2 n,θn := Eθn[(f(Y1) ˆf(Y1))2| Xn] = Eθn[(ϕji(Xj) ˆϕji(Xj))2| Xn]. It suffices to show that for all ε > 0 it holds that (d) Pθn |T1,n Eθn[(Z1 f(Y1))2]| qnε n 0, (e) Pθn | T2,n| qnε n 0, and (f) Pθn T3,n qnε n 1. Condition (d) holds by arguments similar to (a) for the causal edges. Now we prove (e). The expansion, conditional on Xn, is a sum of mean zero i.i.d. terms, hence Eθn q 2 n T 2 2,n Xn = q 2 n n Eθn h {(f(Yk) ˆf(Yk))2 δ2 n,θn}2| Xn i = q 2 n n Eθn h (f(Yk) ˆf(Yk))4 + (δ2 n,θn)2 2(f(Yk) ˆf(Yk))2δ2 n,θn| Xn i Eθn h (f(Yk) ˆf(Yk))4| Xn i (δ2 n,θn)2 q 2 n n Eθn h (f(Yk) ˆf(Yk))4| Xn i , using that (δ2 n,θn)2 0. Fix 1 > δ > 0 and let An,δ := q 2 n n Eθn h (f(Yk) ˆf(Yk))4| Xn i > δ and note that there exists an Nδ N such that n Nδ : Pθn(An,δ) < δ. Similar to the previous arguments we have for any 1 > ε > 0 and n Nδ that Pθn T2,n qnε = Eθn h Pθn q 1 n T2,n ε Xn 1 i ε2 Eθn h Eθn h q 2 n T 2 2,n| Xn i 1 i q 2 n n Eθn h (f(Yk) ˆf(Yk))4| Xn i 1 Eθn[1An,δ] + Eθn[1Ac n,δδ] < 2δ Jakobsen, Shah, B uhlmann and Peters by the conditional Markov s inequality. Since δ > 0 was chosen arbitrarily, we conclude that (e) holds. Finally we show (f). Recall that in the analysis of the causal edges, we defined k=1 (Zk f(Yk))(f(Yk) ˆf(Yk)). Hence, we have that T3,n = T3,n + δ2 n,θn. We realize that for any 0 < ε < 1 Pθn( T3,n < qnε) Pθn(T3,n + δ2 n,θn qnε) = Pθn T3,n qnε + δ2 n,θn Pθn T 2 3,n qnε + δ2 n,θn 2 Pθn T 2 3,n (qnε)2 = Pθn q 2 n T 2 3,n ε2 = Eθn h Pθn q 2 n T 2 3,n ε2| Xn 1 i ε2 Eθn h Eθn h q 2 n T 2 3,n Xn i 1 i where we used the convergence shown in the proof of (c); see Equation (33). To see that the former arguments apply to non-causal edges, simply note that they did not use any conditions restricted to causal edges. This concludes the proof. D.3 Proofs of Section 4 Lemma D.1 Consider an i.i.d. sequence (Xm)m 1 of random variables with Xm Rd independent from a random infinite sequence X Q i=1 Rd. Let (ψn)n 1 be a sequence of measurable functions s.t. for all n 1, ψn : Rd (Q i=1 Rd) Rq satisfies the following conditions: (a) E[ψn(Xm, X)| X] = 0 almost surely, (b) Σ Rq q : Pn m=1 Var(ψn(Xm, X)| X) P n Σ, and (c) ε > 0 : Pn m=1 E[ ψn(Xm, X) 2+ε 2 | X] P n 0. It holds that m=1 ψn(Xm, X) D n N(0, Σ), Structure Learning for Directed Trees Proof of Lemma D.1. Let the random sequences be defined on a common probability space (Ω, F, P) and define Anm := E[ψn(Xm, X)| X], m=1 Var(ψn(Xm, X)| X), m=1 E[ ψn(Xm, X) 2+ε 2 | X]. By assumption we have that P( n,m(Anm = 0)) = 1, Bn P 0 and Cn P 0 as n . First, note that for any subsequence (nk)k 1 of the positive integers, there exists a subsequence (nkl)l N such that P( lim l Bnkl = 0) = 1 for ( lim l Bnkl = 0) := {ω Ω: lim l Bnkl(ω) = 0}, P( lim l Cnkl = 0) = 1 for ( lim l Cnkl = 0) := {ω Ω: lim l Cnkl(ω) = 0}. Thus, define G := ( n,m(Anm = 0) ( lim l Bnkl = 0) ( lim l Cnkl = 0)) Ω, with P(G) = 1. Now fix x X(G) := { X(ω) Q j=1 Rd : ω G} and note that l 1, 1 m nkl : E[ψnkl(Xm, x)] = 0, m=1 Var(ψnkl(Xm, x)) l Σ, and m=1 E[ ψnkl(Xm, x) 2+ε 2 ] l 0. Furthermore, for any l 1 ψnkl(X1, x), ..., ψnkl(Xnkl, x), are jointly independent, hence by Lyapunov s central limit theorem for triangular arrays (see, e.g., Van der Vaart, 2000, Proposition 2.27, and recall that Lyapunov s condition implies the Lindeberg Feller condition) that m=1 ψnkl(Xm, x) D l Z N(0, Σ). Jakobsen, Shah, B uhlmann and Peters The above convergence in distribution is equivalent to the following statement: for any continuous bounded function g : Rq R it holds that m=1 ψnkl(Xm, x) = E [g(Z)] . Fix a continuous and bounded g and note that the above convergence holds for all x X(G) with P(G) = 1. Thus, it must hold that m=1 ψnkl(Xm, X) a.s. l E [g(Z)] . Finally, as (nkl)l 1 is a subsequence of an arbitrary subsequence of positive integers, we have that m=1 ψn(Xm, x) P n E [g(Z)] , and since g is bounded the dominated convergence theorem yields that m=1 ψn(Xm, X) m=1 ψn(Xm, X) n E [g(Z)] . As g was chosen arbitrarily, the above convergence holds for any continuous bounded g. We conclude that m=1 ψn(Xm, X) D n N(0, Σ), proving the theorem. Lemma D.2 (Shah and Peters, 2020, Lemma 19) Let P be a family of distributions for a random variable ζ R and suppose ζ1, ζ2, . . . are i.i.d. copies of ζ. For each n N let Sn = n 1 Pn i=1 ζi. Suppose that for all P P we have EP (ζ) = 0 and EP |ζ|1+η < c for some η, c > 0. We have that for all ε > 0, lim n sup P P P (|Sn| > ε) = 0. Lemma D.3 Let U be a random element and let (Zn)n 1 be an i.i.d. sequence of random variables such that U (Zn)n 1 and let ((Wnm)m n)n 1 be a triangular array of random variables and (gn)n 1 be measurable mappings with the following properties: (a) n 1, m n : Wnm = gn(Zm, U), Structure Learning for Directed Trees (b) η > 0 : E |Wn1|1+η | U = Op(1), as n . Then, writing Wn := Pn m=1 Wnm/n, we have Wn E (Wn1 | U) P n 0. Proof of Lemma D.3. Denote jn(Zm, U) := gn(Zm, U) E[gn(Z1, U)|U], for any n 1 and m n. Let δ > 0 be given. Pick M > 0 and N N such that the events Ωn := n E h |gn(Z1, U)|1+η | U i M o , satisfy P (Ωc n) < δ for n N. Notice that U(Ωn) = n un : E h |gn(Z1, un)|1+ηi M o , since U (Zn)n 1. Fix ε > 0. Then, for all n N P Wn E (Wn | U) > ε = P m=1 jn(Zm, U) m=1 jn(Zm, U) By the dominated convergence theorem, the first term on the RHS converges to 0 if m=1 jn(Zm, U) = sup un U(Ωn) P m=1 jn(Zm, un) which implies the desired statement as δ > 0 was chosen arbitrarily. Now note that for any n N, un U(Ωn) and all m N it holds that E[|jn(Zm, un)|1+η] = E[|gn(Zm, un) E[gn(Z1, un)]|1+η] 2η E[|gn(Zm, un)|1+η] + |E[gn(Z1, un)]|1+η 2η E[|gn(Zm, un)|1+η] + E[|gn(Z1, un)|1+η] < 2η+1M =: c by the cr and Jensen s inequalities, and E[jn(Zm, un)] = 0. Jakobsen, Shah, B uhlmann and Peters For any n N, define the following set of pushforward measures Pn := {P = (jn(Z1, un))(P) : un U(Ωn)}. For any P Pn, let (Ym)m 1 be a sequence of i.i.d. random variables such that Y1 D= jn(Z1, un) for some un U(Ωn). Notice that for all n N and P Pn it holds that EP |Y1|1+η < c and EP [Y1] = 0. Thus, sup un U(Ωn) P m=1 jn(Zm, un) = sup P Pn P 1 n sup P k Pk P 1 n by the weak uniform law of large numbers, Lemma D.2. Lemma D.4 (Asymptotic normality of edge weight components) Let for each sample size n N, ˆϕn ji denote the estimated conditional mean function ϕji based on the auxiliary sample Xn. For any j = i and m n, define ˆRnm,ji := {Xm,i ˆϕn ji(Xm,j)}, ˆµn,ji := 1 m=1 ˆR2 nm,ji, Rm,ji := {Xm,i ϕji(Xm,j)}, µji := E[R2 1,ji], !2 , ˆνn,i := 1 νi := Var(X1,i), δ2 n,ji := E[( ˆϕn ji(X1,j) ϕji(X1,j))2| Xn]. " bΣn,R bΣn,RV bΣ n,RV bΣn,V ˆR2 nm( ˆR2 nm) ˆµnˆµ n ˆR2 nm ˆV m ˆµnˆν n ˆVm( ˆR2 nm) ˆνnˆµ n ˆVm ˆV m ˆνnˆν n denote the p2 p2 matrix empirical covariance matrix, where the squaring of vectors means that each entry is squared. Suppose there exists ξ > 0 such that for all j = i, the following three conditions hold: (i) E X 4+ξ < . (ii) E[| ˆϕn ji(Xj) ϕji(Xj)|4+ξ| Xn] = Op(1), as n . (iii) Σ Rp2 p2 : Var ˆR2 n1 δ2 n µ ˆV1 ν P n Σ, where Σ is constant. Structure Learning for Directed Trees Then we have that bΣn P Σ Rp2 p2 and ˆR2 nm δ2 n µ ˆVm ν = n ˆµn δ2 n µ ˆνn ν D N(0, Σ). (34) Proof of Lemma D.4. We prove the lemma under the assumption that E[X] = 0 under which the variance estimator simplify to ˆVm,i := X2 m,i and ˆνn,i := 1 n Pn m=1 ˆVm,i for all 1 i p. The proof only gets more notionally cumbersome without this assumption. It should follow in all generality by applying expansion techniques and Slutsky s theorem similar to the standard arguments showing asymptotic normality of the regular sample variance. Let X denote the auxilliary i.i.d. process such that Xn is the first n-coordinates of said process. Note that conditioning ˆϕn ji on X it is equivalent to conditioning on Xn by the i.i.d. structure of X and that ˆϕn ji only depends on Xn. First, we define for all j = i, n N and m n the following conditional expectation regression error ˆδnm,ji := {ϕji(Xm,j) ˆϕn ji(Xm,j)}. Furthermore, for each n N and m n define Ψn(Xm, X) := ˆR2 nm δ2 n µ ˆVm ν where only Xn (containing the first n coordinates of X) is used, and ψn(Xm, X) := 1 nΨn(Xm, X). Note that the desired conclusion of Equation (34) follows by verifying condition (a), (b) and (c) of Lemma D.1. First, we show (a), the conditional mean zero condition. To that end, note that for any i {1, . . . , p} and j {1, . . . , p} \ {i} it holds that ˆR2 nm,ji = (Xm,i ϕji(Xm,j) + ϕji(Xm,j) ˆϕn ji(Xm,j))2 = (Rm,ji + ˆδnm,ji)2 = R2 m,ji + ˆδ2 nm,ji + 2Rm,jiˆδnm,ji. Hence, we have that ˆR2 nm,ji µji δ2 n,ji = (R2 m,ji µji) + (ˆδ2 nm,ji δ2 n,ji) + 2Rm,jiˆδnm,ji. (35) The terms of Equation (35) are mean zero conditionally on X, since E[R2 m,ji| X] = E[R2 m,ji] = µji, E[ˆδ2 nm,ji| X] = δ2 n,ji and E[Rm,jiˆδnm,ji| X] = E[E[Rm,jiˆδnm,ji| X, Xm,j]| X] = E[E[Xm,i ϕji(Xm,j)| X, Xm,j]ˆδnm,ji| X] = E[(E[Xm,i|Xm,j] ϕji(Xm,j))ˆδnm,ji| X] Jakobsen, Shah, B uhlmann and Peters as ϕji(Xm,j) = E[Xm,i|Xm,j] almost surely. Furthermore, E[X2 m,i Var(Xi)| X] = E[X2 m,i] Var(Xi) = 0. We conclude that E[ψn(Xm, X)| X] = 1 n E ˆR2 nm δ2 n µ ˆVm ν almost surely. With respect to (b), convergence of the sum of variances, we have, by assumption, that Σn := Σn,R Σn,RV Σ n,RV Σn,V := Var Ψn(X1, X)| X P n Σ, where Σ is a positive semi-definite matrix. Furthermore, we have that (Xm)m 1 is an i.i.d. sequence independent of X. Therefore, m=1 Var(ψn(Xm, X)| X) = 1 n Var(Ψn(Xm, X)| X) Finally, we show that condition (c), a conditional Lindeberg-Feller condition, is fulfilled. To this end, note that with ε := ξ/2 > 0 we have that E h ψn(Xm, X) 2+ε 2 X i ˆR2 nm δ2 n µ ˆVm ν ˆR2 nm δ2 n µ ˆVm ν i =j E h | ˆR2 nm,ji µji δ2 n,ji|2+ε| X i i=1 E|X2 m,i Var(Xi)|2+ε , (36) by the cr inequality. We now realize that the second factor of Equation (36) is stochastically bounded. To see this, note that for any j = i it holds that E h | ˆR2 nm,ji µji δ2 n,ji|2+ε| X i 21+ε(E[| ˆRnm,ji|4+2ε| X] + µ2+ε ji + E[|δ2 n,ji( X)|2+ε| X]). Structure Learning for Directed Trees The first term of the upper bound in Equation (37) is Op(1), E[| ˆRnm,ji|4+2ε| X] = E[|Xm,i ˆϕn ji(Xm,j)|4+2ε| X] 23+2ε(E|Xm,i ϕji(Xm,j)|4+2ε + E[|ϕji(Xm,i) ˆϕn ji(Xm,j)|4+2ε| X]) = 23+2ε(E[|Rm,ji|4+ξ] + E[|ˆδnm,ji|4+ξ| X]) = Op(1), as E X 4+ξ 2 < and E[|ˆδnm,ji|4+ξ| X] = Op(1). This holds because Rm,ji = {Xm,i E[Xm,i|Xm,j]} and both terms are in L4+ξ(P) if Xm,i L4+ξ(P) which is guaranteed as E X 4+ξ 2 < . For the third term in the upper bound of Equation (37), we note that by the conditional Jensen s inequality, we have that E[|δ2 n,ji|2+ε| X] E[|ϕji(Xm,i) ˆϕn ji(Xm,j)|4+2ε| X] = E[|ˆδnm,ji|4+ξ| X] = Op(1), by assumption. Therefore, we have that m=1 E h Ψn(Xm, X) 2+ε 2 X i n 2 Op(1) = n ε/2Op(1) P n 0, proving the conditional Lindeberg-Feller condition. By Lemma D.1 it holds that m=1 ψn(Xm, X) D n N(0, Σ). Now it only remains to prove that bΣn Σn P 0, or, equivalently, that each entry converges to zero in probability. For example, for the entries of the first block matrix with j = i and l = r we prove that |bΣn,R,ji,lr Σn,R,ji,lr| P 0. Now note that the observable estimated covariance matrix entry is given by bΣn,R,ji,lr = 1 m=1 ˆR2 nm,ji ˆR2 nm,lr ˆµn,jiˆµn,lr, while the unobservable conditional covariance matrix is given by Σn,R,ji,lr = E[( ˆR2 nm,ji µji δ2 n,ji)( ˆR2 nm,lr µlr δ2 n,lr)| X] = E[ ˆR2 nm,ji ˆR2 nm,lr| X] (µji + δ2 n,ji)(µlr + δ2 n,lr) = E[ ˆR2 nm,ji ˆR2 nm,lr| X] E[ ˆR2 nm,ji| X]E[ ˆR2 nm,lr| X], Jakobsen, Shah, B uhlmann and Peters where we have used that E[ ˆR2 nm,ji| X] = µji + δ2 n,ji; see Equation (35) and its discussion. Note that the second term of the covariance matrix estimator expands to ˆµn,jiˆµn,lr = m=1 ˆR2 nm,ji m=1 ˆR2 nm,lr m=1 ˆR2 nm,ji E[ ˆR2 nm,ji] m=1 ˆR2 nm,lr E[ ˆR2 nm,lr] E[ ˆR2 nm,ji]E[ ˆR2 nm,lr] m=1 ˆR2 nm,ji E[ ˆR2 nm,lr] m=1 ˆR2 nm,lr E[ ˆR2 nm,ji], |bΣn,R,ji,lr Σn,R,ji,lr| m=1 ( ˆR2 nm,ji ˆR2 nm,lr E[ ˆR2 nm,ji ˆR2 nm,lr| X]) m=1 ˆR2 nm,ji E[ ˆR2 nm,ji| X] m=1 ˆR2 nm,lr E[ ˆR2 nm,lr| X] m=1 ( ˆR2 nm,ji E[ ˆR2 nm,lr| X] E[ ˆR2 nm,ji| X]E[ ˆR2 nm,lr| X]) m=1 ( ˆR2 nm,lr E[ ˆR2 nm,ji| X] E[ ˆR2 nm,ji| X]E[ ˆR2 nm,lr| X]) . (38) Each of these terms tends to zero in probability by Lemma D.3. For example, for the first term of Equation (38) it suffices to show that E h | ˆR2 nm,ji ˆR2 nm,lr|1+ε| X i = Op(1), for some ε > 0. Fix ε = ξ/4 and note, by the cr-inequality, that ˆR2 nm,ji ˆR2 nm,lr = (Xm,i ˆϕn ji(Xm,j))2(Xm,r ˆϕn lr(Xm,l))2 4(R2 m,ji + ˆδ2 nm,ji)(R2 m,lr + ˆδ2 nm,lr). Structure Learning for Directed Trees Thus, by the cr-inequality and the conditional Cauchy-Schwarz inequality we have, with c = 41+ε22ε, that c 1E[| ˆR2 nm,ji ˆR2 nm,lr|1+ε| X] c 141+εE[|R2 m,ji + ˆδ2 nm,ji|1+ε|R2 m,lr + ˆδ2 nm,lr|1+ε| X] E[(|Rm,ji|2+2ε + |ˆδnm,ji|2+2ε)(|Rm,lr|2+2ε + |ˆδnm,lr|2+2ε)| X] E[|Rm,ji|2+2ε|Rm,lr|2+2ε| X] + E[|Rm,ji|2+2ε|ˆδnm,lr|2+2ε| X] + E[|ˆδnm,ji|2+2ε|Rm,lr|2+2ε| X] + E[|ˆδnm,ji|2+2ε|ˆδnm,lr|2+2ε| X] E[|Rm,ji|4+ξ]E[|Rm,lr|4+ξ] + E[|Rm,ji|4+ξ]E[|ˆδnm,lr|4+ξ| X] + E[|ˆδnm,ji|4+ξ| X]E[|Rm,lr|4+ξ] + E[|ˆδnm,ji|4+ξ| X]E[|ˆδnm,lr|4+ξ| X] as E[|ˆδnm,ji|4+ξ| X] = Op(1) for all j = i by assumption and E[|Rm,ji|4+ξ] < since E X 4+ξ 2 < . Similar arguments show convergence in probability of the entries in the other block submatrices of bΣn less Σn, yielding the desired conclusion. Proof of Theorem 10. We prove the theorem under the simplifying assumption that E[X] = 0 for which we can simplify the variance estimator by ˆVm,i := X2 m,i and ˆνn,i := 1 n Pn m=1 ˆVm,i for all 1 i p. First, note (using the notation introduced in Lemma D.4) that ˆ M1 = { ˆR2 n1,ji}j =i, ˆµ = ˆµn, ˆν = ˆνn and bΣ = bΣn. The conditional mean of ˆ M1 given Xn is given by E[ ˆ M1| Xn] = E[{ ˆR2 n1,ji}j =i| Xn] = µ + δ2 n, see Equation (35). Similarly we have that E[ ˆV1| Xn] = E[ ˆV1] = ν. Subtracting a constant (conditional on Xn) does not change the conditional variance, hence Var ˆR2 n1 δ2 n µ ˆV1 ν = Var ( ˆ M 1 , ˆV 1 ) Xn Σ is constant and positive semi-definite with strictly positive diagonal. As such, the conditions of Lemma D.4 is satisfied, which yields that ˆR2 nm δ2 n µ ˆVm ν = n ˆµ δ2 n µ ˆν ν D n N(0, Σ), (39) " bΣM bΣMV bΣ MV bΣV P Σ =: ΣM ΣMV Σ MV ΣV Jakobsen, Shah, B uhlmann and Peters For any j = i we denote ˆµji δ2 n,ji ˆνi where the latter is a shorthand notation for the Gaussian edge weight w G ji. Fix α (0, 1). First, consider (j i) E and note that n ˆµji µji ˆνi νi ˆµji δ2 n,ji µji ˆνi νi = n δ2 n,ji 0 = n E[ˆδ2 nm,ji| Xn] 0 P n 0, (40) by assumption (iv). Hence, Equation (39), Equation (40) and the delta method yields that n ( ˆwji wji) = n log ˆµji = n(log(ˆµji) log(µji) log(ˆνi) + log(νi)) D n N(0, σ2 ji), ˆσ2 ji := bΣM,ji ˆµ2 ji + bΣV,i ˆν2 i 2 bΣMV,ji,i P σ2 ji := ΣM,ji µ2 ji + ΣV,i ν2 i 2ΣMV,ji,i Here bΣM,ji and bΣV,i and their limits use a shorthand notation that denote the corresponding diagonal element, e.g., bΣM,ji := bΣM,ji,ji. An asymptotically valid marginal confidence interval for wji with level α is, by virtue of the above convergence in distribution, given by ˆwji ˆσji q(1 α where q(1 α 2 ) is the 1 α/2 quantile of the standard normal distribution. That is, P ˆwji ˆσji q(1 α 2 ) 2 n wji ˆwji + ˆσji q(1 α On the other hand, for any (j i) E we have, by similar arguments, except that no assumption guarantees that nδ2 n,ji vanishes, that P wji σji q(1 α 2 ) 2 n wji wji + σji q(1 α σ2 ji := bΣM,ji (ˆµji δ2 n,ji)2 + bΣV,i ˆν2 i 2 bΣMV,ji,i (ˆµji δ2 n,ji)ˆνi P σ2 ji := ΣM,ji µ2 ji + ΣV,i ν2 i 2ΣMV,ji,i Structure Learning for Directed Trees by the convergence in Equation (39). Note that σ2 ji is not observable since δ2 n,ji is not observable. Now define ˆuα,ji, ˆlα,ji := ˆwji ˆσji q 1 α 2p(p 1) uα,ji, lα,ji := wji σji q 1 α 2p(p 1) for all j = i. Thus, we have the following Bonferroni corrected simultaneous confidence interval for the Gaussian edge weights lim inf n P wji h ˆlα,ji, ˆuα,ji i \ wji h lα,ji, uα,ji i The above confidence region has the correct asymptotic level, but it is infeasible to compute in that wji, σji and E are not directly observable from data. Furthermore, define C(ˆlα, lα, ˆuα, uα) := arg min G=(V, E) Tp (j i) E w ji : (j i) E, w ji [ˆlα,ji, ˆuα,ji], (j i) E, w ji [ lα,ji, uα,ji] , and note that this is an unobservable confidence region for the causal graph. That is, lim inf n P(G C(ˆlα, lα, ˆuα, uα)) lim inf n P (j i) E (wji [ˆlα,ji, ˆuα,ji]) \ (j i) E (wji [ lα,ji, uα,ji]) Our proposed confidence region has the form ˆC := C(ˆlα, ˆuα) := arg min G=(V, E) Tp (j i) E w ji : j = i, w ji [ˆlα,ji, ˆuα,ji] , which corresponds to the biased but computable confidence region j =i [ˆlα,ji, ˆuα,ji] = Y ˆwji ˆσji q 1 α 2p(p 1) for the Gaussian edge weights, where the product is over all combinations of possible edges 1 j = i p. The biased confidence region Q j =i[ˆlα,ji, ˆuα,ji] does not necessarily contain the population Gaussian edge weights with a probability of at least 1 α in the large sample limit. However, it can be used to construct a conservative confidence region for the causal Jakobsen, Shah, B uhlmann and Peters graph. To see this, note that by further penalizing the wrong (non-causal) edge weights, the causal graph still yields the minimum edge weight directed spanning tree. Hence, lim inf n P(G C(ˆlα, ˆuα)) lim inf n P (j i) E (wji [ˆlα,ji, ˆuα,ji]) \ (j i) E (wji [ lα,ji, uα,ji]) \ (j i) E ( uα,ji ˆuα,ji) as P ( uα,ji ˆuα,ji) n 1 for all (j i) E by Lemma D.5 below. Lemma D.5 Suppose that the assumptions of Lemma D.4 hold. It holds that (j i) E, α (0, 1) : P ( uα,ji ˆuα,ji) n 1. Proof of Lemma D.5. Fix any (j i) E and α (0, 1) and note that we want to show that uα,ji ˆuα,ji wji + c σji n ˆwji + c ˆσji n 0 log (ˆµji) + c ˆσji n log ˆµji δ2 n,ji c σji n holds with probability converging to one, where c is a strictly positive constant. It suffices to show that an even smaller quantity is non-negative with probability converging to one. That is, it suffices to show that 0 log (ˆµji) + c ˆσji n log ˆµji δ2 n,ji c σ ji n, with increasing probability, where bΣM,ji (ˆµji δ2 n,ji)2 + bΣV,i ˆν2 i + 2 |bΣMV,ji,i| (ˆµji δ2 n,ji)ˆνi σji, with P( σ ji > 0) n 1. Let dn(t) : [0, ) R denote the random function given by dn(t) := log (ˆµji) + c ˆσji n log (ˆµji t) bΣM,ji (ˆµji t)2 + bΣV,i ˆν2 i + 2 |bΣMV,ji,i| (ˆµji t)ˆνi . It holds that dn(0) = 0 surely, so by the mean value theorem, the desired conclusion holds if it with probability one (as n tends to infinity) holds, for all t [0, δ2 n,ji], that d n(t) 0 . Structure Learning for Directed Trees Now fix η > 0 and choose Mη, ε1, . . . , ε5 > 0 such that the constant lower bounds in the following inequalities are strictly positive Ωn(1) : = (ˆµji Mη), Ωn(2) : = (ΣM,ji ε1 bΣM,ji ΣM,ji + ε1), Ωn(3) : = (ΣV,i ε2 bΣV,i ΣV,i + ε2), Ωn(4) : = (0 |bΣMV,ji,i| |ΣMV,ji,i| + ε3), Ωn(5) : = (µji ε4 ˆµji δ2 n,ji µji + ε4), Ωn(6) : = (νi ε5 ˆνi νi + ε5), and lim infn P(Ωn(1)) > 1 η. This is possible as ˆµji δ2 n,ji P n µji > 0 and δ2 n,ji = E[|ˆδnm,ji|2| X] = E[|ˆδnm,ji| 4+ξ 2+ξ/2 | X] E[|ˆδnm,ji|4+ξ| X] 1 2+ξ/2 = Op(1), by the conditional Jensen s inequality and concavity of [0, ) x 7 x 1 2+ξ/2 , which implies that ˆµji = (ˆµji ˆδ2 n,ji µji) + (ˆδ2 n,ji + µji) = op(1) + Op(1) = Op(1). Furthermore, as bΣM,ji P n ΣM,ji > 0, bΣV,i P n ΣV,i > 0, |bΣMV,ji,i| P n |ΣMV,ji,i| 0, ˆνi P n νi > 0, it holds that lim sup n P 1 k 6 Ωn(k)c 1 k 6 lim sup n P(Ωn(k)c) = lim sup n P(Ωn(1)c) η. Here we used that the diagonal elements of the limit covariance matrix are assumed strictly positive. That µji, νi > 0 follows from the fact that Xi E[Xi|Xj] is assumed to have a density (w.r.t. Lebesgue measure) and that the variables are non-degenerate νi = Var(Xi) > 0. Thus, we have that lim inf n P 1 k 6 Ωn(k) Now consider a fixed ω T 1 k 6 Ωn(k) and note that with gn : [0, δ2 n,ji] R given by gn(t) = ˆµji t we have that gn is decreasing and that gn([0, δ2 n,ji]) [µji ε4, ˆµji] (0, Mη] Jakobsen, Shah, B uhlmann and Peters We have for any t [0, δ2 n,ji] that d n(t) = 1 ˆµji t c n bΣM,ji (ˆµji t)2 + bΣV,i ˆν2 i + 2|bΣMV,ji,i| (ˆµji t)ˆνi bΣM,ji (ˆµji t)3 + |bΣMV,ji,i| (ˆµji t)2ˆνi d n(t) = 1 ˆµji t c n gn(t)2 + bΣV,i ˆν2 i + 2|bΣMV,ji,i| gn(t)3 + |bΣMV,ji,i| ˆµ2 ji + bΣV,i bΣM,ji (ˆµji δ2 n,ji)3 + |bΣMV,ji,i| (ˆµji δ2 n,ji)2ˆνi M2η + ΣV,i ε2 (µji ε4)3 + |ΣMV,ji,i| + ε3 (µji ε4)2(νi ε5) =: 1 Mη CMη,ε1,ε2,ε3,ε4,ε5 n for n (CMη,ε1,ε2,ε3,ε4,ε5Mη)2. We conclude that for n (CMη,ε1,ε2,ε3,ε4,ε5Mη)2 P ( uα,ji ˆuα,ji) = P 0 log (ˆµji) + c ˆσji n log ˆµji δ2 n,ji c σji n P t [0, δ2 n,ji] : d n(t) 0 1 k 6 Ωn(k) lim inf n P ( uα,ji ˆuα,ji) lim inf n P 1 k 6 Ωn(k) and as η > 0 was chosen arbitrarily, we have the desired conclusion P ( uα,ji ˆuα,ji) n 1. Structure Learning for Directed Trees Proof of Theorem 11. Consider a collection of arbitrary and possibly data-dependent substructures R1, R2, ... and level α (0, 1). First, we note that the score associated with two sets of edge weights w1 and w2 is weakly monotone, that is, STp(w1) STp(w2) if w1 and w2 satisfy the component-wise partial ordering w1 w2. Furthermore, the restricted score function w 7 STp(R)(w) is also weakly monotone for any set of restrictions R. Let k N and suppose that the null hypothesis H0(Rk) : ERk \ E = , E \ Emiss Rk = , rk = rt(G), corresponding to the restriction Rk = (ERk, Emiss Rk , rk) is true. If there is a graph in ˆCBon := ˆC(ˆlα, ˆuα) satisfying the restrictions imposed by the substructure Rk, then there exist ˆlα w ˆuα such that STp(w ) attains its minimum value in a graph satisfying Rk. Penalizing (or removing) edges that are not present in the minimum edge weight directed tree does not affect the score of the minimum edge weigh directed tree. Hence, it holds that STp(Rk)(w ) = STp(w ). Monotonicity of STp(Rk) and STp in the edge weights imply that STp(Rk)(ˆlα) STp(Rk)(w ) = STp(w ) STp(ˆuα). Hence, STp(Rk)(ˆlα) > STp(ˆuα) entails that no graph in ˆC satisfies the restrictions of Rk. (This is a slightly conservative criterion as STp(Rk)(ˆlα) STp(ˆuα) does not necessarily guarantee that a graph in ˆCBon satisfies the restrictions of Rk.) Therefore, if ψCheck C Rk = 1, then we know that there is no graph in ˆCBon satisfying the restrictions of Rk. As the causal graph G satisfies the restriction Rk we conclude that G is not contained in ˆCBon. Thus for any true Rk we have that (ψCheck C Rk = 1) (G ˆCBon). Since this holds for any true Rk, the conclusion follows by noting that lim sup n P k:H0(Rk) is true (ψCheck C Rk = 1) lim sup n P(G ˆCBon) α, where we used Theorem 10. For the claim about the level guarantee of the Conv B test, let k N and consider a true substructure restriction Rk = (ERk, Emiss Rk , rk). Suppose that G ˆCBon. This implies that there exist ˆlα w ˆuα such that STp(w ) attains its minimum value in a graph satisfying Rk. Now let w = (w ji)j =i be given by w ji = ˆuji if [ l = j : (l i) ERk] [(i j) ERk] [(j i) Emiss Rk ] [i = r], w ji otherwise, Jakobsen, Shah, B uhlmann and Peters where we penalize edges that are in disagreement with the substructure restriction Rk. It is clear that the MWDST using the edge weights w and w , i.e., G Tp(w ) and G Tp(w ), both satisfy the substructure restriction Rk. However, as w is unknown, so is w . We lower bound the unknown w by ˆl and define ˇw = ( ˇwji)j =i as ˇwji = ˆuji if [ l = j : (l i) ERk] [(i j) ERk] [(j i) Emiss Rk ] [i = r], ˆlji otherwise, Now, the MWDST G Tp( w) may use edges that are in disagreement with Rk or not satisfy Rk. (For example, consider a three node causal graph V = {1, 2, 3} with edges 1 2 3 and consider the substructure restriction ERk = {(1 2)}. Now it may happen that ˆl12 + ˆl23 > ˆl13 + ˆu32 or ˆl12 + ˆl23 > ˆl23 + ˆl31, that is, the MWDST G Tp( w) does not satisfy the substructure restriction.) We now argue that this happens with probability tending to zero. By the assumed identifiability, i.e., that Assumption 1 holds, we have that := min G Tp\G ℓG( G) ℓG(G) > 0. (41) Now consider the events (An)n N (that are independent of k) given by |ˆlji w G ji| < p 1 We realize that on An it must hold that the MWDST G Tp( w) satisfies Rk. Thus, for a true substructure restriction Rk we have that An (G ˆCBon) (ψConv B Rk = 0) Ac n (G ˆCBon) (ψConv B Rk = 1), for all n N. Hence, we have that lim sup n P k:H0(Rk) is true (ψConv B Rk = 1) = lim sup n P (G ˆCBon) Ac n lim sup n P(G ˆCBon) + lim sup n P(Ac n) by Theorem 10, proving the claim. It only remains to argue that lim supn P(Ac n) = 0. To that end, note that by Theorem 10 it holds that ˆΣ P n Σ and that n ˆµ δ2 n ˆν D n N(0, Σ). By the strengthened assumptions, i.e., the n-convergence for the non-causal edges, we have that (see the arguments for the causal edges from Theorem 10) D n N(0, Σ). Structure Learning for Directed Trees Thus, for any j = i, that ˆuji, ˆlji = 1 zα ˆσji 2 n P n 1 2 log µji = w G ji, (42) since ˆµ P n µ, ˆν P n ν, and ˆσji P n σji. The convergence statements in Equation (42) obviously implies that P(An) n 1, since is strictly positive, see Equation (41)). This concludes the proof. D.4 Proofs of Section 5 D.4.1 Proofs of first results in Section 5 Proof of Lemma 12. As conditioning reduces entropy we always have that ℓCE( G, i) = h(Xi|Xpa G(i)) = h(Xi E[Xi|Xpa G(i)]|Xpa G(i)) h(Xi E[Xi|Xpa G(i)]) = ℓE( G, i). Furthermore, note that when conditioning we throw out dependence information captured by the mutual information I(Xi E[Xi|Xpa G(i)]; Xpa G(i)), which is zero if and only if Xi E[Xi|Xpa G(i)] Xpa G(i). This is especially the case for the true graph, i.e., Xi E[Xi|Xpa G(i)] Xpa G(i), implying that ℓCE(G, i) = ℓE(G, i). Consequently, we have that the local conditional entropy score gap lower bounds the local entropy score gap, ℓCE( G, i) ℓCE(G, i) ℓE( G, i) ℓE(G, i). Furthermore, from the arguments in the proof of Lemma 6 we have that ℓE( G, i) = inf Ni P Ni P h Xi E h Xi|Xpa G(i) i , Ni inf Ni P Ni PG h Xi E h Xi|Xpa G(i) i , Ni = ℓG( G, i) + log( If X is generated by a causal additive tree model with Gaussian noise, i.e., with generating SCM θ = (G, (fi), PN) with PN Pp G, then ℓE(G, i) = h(Xi E[Xi|Xpa G(i)]) = h(Ni) = log( 2πeσi) = log( 2 log(E[N2 i ]) = log( 2πe) + l G(G, i), in which case the local entropy score gap lower bounds the local Gaussian score gap ℓE( G, i) ℓE(G, i) ℓG( G, i) ℓG(G, i). Jakobsen, Shah, B uhlmann and Peters Proof of Lemma 13. Note that E[Y |X] = E[f(X) + NY |X] = f(X) + E[NY |X] = f(X) + E[NY ], since NY NX = X. Hence, the score difference can be written as ℓE( G) ℓE(G) = ℓE( G, X) ℓE(G, X) + ℓE( G, Y ) ℓE(G, Y ) = h(X E(X|Y )) h(X) + h(Y ) h(Y E(Y |X)) = h(X E(X|Y )) h(X) + h(Y ) h(NY + E[NY ]) = h(X E(X|Y )) h(X) + h(Y ) h(NY ), as the differential entropy is translation invariant. Now note that as NY NX it holds that NY f(X), so conditioning on f(X) yields that h(Y ) = h(Y |f(X)) + I(Y ; f(X)) = h(f(X) + NY |f(X)) + I(Y ; f(X)) = h(NY ) + I(Y ; f(X)). Similarly, conditioning on X yields that h(Y ) = h(Y |X) + I(Y ; X) = h(NY ) + I(Y ; X), which proves that I(Y ; f(X)) = I(Y ; X). This equality is normally derived by restricting f to be bijective, but here it holds regardless by the structural assignment form, as Y only depends on X through f(X). Furthermore, we have that h(X E[X|Y ]) = I(X E[X|Y ]; Y ) + h(X E[X|Y ]|Y ) = I(X E[X|Y ]; Y ) + h(X|Y ). h(X E[X|Y ]) h(X) = I(X E[X|Y ]; Y ) + h(X|Y ) h(X) = I(X E[X|Y ]; Y ) I(Y ; X). ℓE( G) ℓE(G) = h(X E[X|Y ]) h(X) + h(Y ) h(NY ) = I(X E[X|Y ]; Y ) I(Y ; X) + h(NY ) + I(Y ; f(X)) h(NY ) = I(X E[X|Y ]; Y ) I(Y ; X) + I(Y ; f(X)) = I(X E[X|Y ]; Y ), proving the claim. Structure Learning for Directed Trees Proof of Proposition 14. As the conditional mean E[X|Y ] vanishes, we have that ℓE( G) ℓE(G) = I(X E(X|Y ); Y ) = I(Y ; f(X)), where the last equality was derived in the proof of Lemma 13. Now let f(X)G and NG Y be independent normal distributed random variables with the same mean and variance as f(X) and NY . That is, f(X)G N(E[f(X)], Var(f(X))), NG Y N(E[NY ], Var(NY )) with NG Y f(X)G such that f(X)G + NG Y N(E[f(X)] + E[NY ], Var(f(X)) + Var(NY )). (a) If DKL(f(X) f(X)G) DKL(NY NG Y ) then by Lemma C.1 of Silva (2009) we have, since X NY , that I(Y ; f(X)) = I(f(X) + NY ; f(X)) I(f(X)G + NG Y ; f(X)G), Note, we have equality if and only if f(X) and NY are jointly Gaussian. Furthermore, I(f(X)G + NG Y ; f(X)G) = h(f(X)G + NG Y ) h(f(X)G + NG Y |f(X)G) = h(f(X)G + NG Y ) h(NG Y ) 2π(Var(f(X)) + Var(NY ))) log( p 2πVar(NY )) 2 log Var(f(X)) + Var(NY ) 2 log 1 + Var(f(X)) (b) If f(X)+NY is log-concave distributed, then by Theorem 3 of Marsiglietti and Kostina (2018) we have that h(f(X) + NY ) 1 2 log (4Var(f(X) + NY )) = 1 2 log (4(Var(f(X)) + Var(NY )) . Furthermore, it is well known that for fixed variance, the normal distribution maximizes entropy, hence h(NY ) h(NG Y ) = 1 2 log (2πVar(NY )) . Therefore, we get that I(Y ; f(X)) = I(f(X) + NY ; f(X)) = h(f(X) + NY ) h(f(X) + NY |f(X)) = h(f(X) + NY ) h(NY ) 2 log (4(Var(f(X)) + Var(NY )) 1 2 log (2πe Var(NY )) πe Var(f(X)) Jakobsen, Shah, B uhlmann and Peters which yields a strictly positive lower bound if and only if πe Var(f(X)) Var(NY ) > 1 Var(f(X)) Var(NY ) > πe Lemma D.6 Two different but Markov equivalent trees G and ˆG share the exact same edges except for a single reversed directed path between the two root nodes of the graphs, ˆG : c1 c2 cr 1 cr, G : cr cr 1 c2 c1, with c1 = rt( ˆG) and cr = rt( G). Proof of Lemma D.6. First, note that there always exists a unique directed path in ˆG from rt( ˆG) to rt( G) ˆG : rt( ˆG) = c1 cr 1 cr = rt( G). Since G and ˆG are Markov equivalent, they share the same skeleton, so in G the above path must be reversed. That is, there exists a unique directed path in G from rt( G) to rt( ˆG) given by G : rt( G) = cr cr 1 c1 = rt( ˆG), If r = p we are done, so assume r < p. As ˆG is a directed tree there must exists a node z2 which is not a part of the above path but is a child of a node in the path. That is, there exists a node z1 {c1, . . . , cr} such that ˆG contains the edge ˆG : z1 z2. Furthermore, by equality of skeleton, this edge must also be present in G, Assume for contradiction that z2 z1 in G. As such, it must hold that z1 = cr = rt( G) for otherwise if z1 {c1, . . . , cr 1} then z1 would have two parents in G, a contradiction since G is a directed tree. However, if z1 = cr = rt( G) then there is an incoming edge into the root node, a contradiction. We conclude that the directed edge z1 z2 also is present in G. Any paths further out on this branch will coincide in both graphs for otherwise there exists nodes with two parents. These arguments show that any paths branching out from the main reversed path will coincide in both ˆG and G. Thus, the two graphs coincide up to a directed path between root nodes that is reversed. Structure Learning for Directed Trees Proof of Proposition 15. By Lemma D.6 there exists a path reversal G : rt(G) = c1 c2 cr 1 cr = rt( G), G : rt( G) = cr cr 1 c2 c1 = rt(G), while all other edges in G = (V, E) and G = (V, E) coincide. Hence, the entropy score difference reduces to ℓE( G) ℓE(G) = h(Xrt( G)) + X (j,i) E h(Xi E[Xi|Xj]) h(Xrt(G)) X (j,i) E h(Xi E[Xi|Xj]) i=1 h(Xci E[Xci|Xci+1]) i=2 h(Xci E[Xci|Xci 1]). h(Xcr) h(Xc1) = i=1 h(Xci) = i=1 h(Xci+1) h(Xci). ℓE( G) ℓE(G) i=1 h(Xci E[Xci|Xci+1]) + h(Xci+1) h(Xci+1 E[Xci+1|Xci]) h(Xci) i=1 ℓE(ci L99 ci+1) min 1 i r 1 ℓE(ci L99 ci+1), which concludes the proof. D.4.2 Proof of Theorem 16 We first describe the graphs that result from the reduction technique described in 5.3. To do so, define L(G, G) := {L VR : ch GR(L) = (pa GR(L) = pa GR(L) ch GR(L) = )}, Jakobsen, Shah, B uhlmann and Peters containing the sink nodes in GR that are either not sink nodes in GR or sink nodes in GR with different parents: pa GR(L) = pa GR(L). Now fix any L L(G, G) VR and note that its only parent in GR, pa GR(L), is either also a parent of L, a child of L or not adjacent to L, in GR. That is, one and only one of the following sets is non-empty Z(L) : = pa GR(L) pa GR(L), ( staying parents ) Y (L) : = pa GR(L) ch GR(L), ( parents to children ) W(L) : = pa GR(L) (V \ {L ch GR(L) pa GR(L)}) ( removing parents ) We define the GR parent and children of L that are not adjacent to L in GR as D(L) : = pa GR(L) (V \ {L ch GR(L) pa GR(L)}), and O(L) : = ch GR(L) (V \ {L ch GR(L) pa GR(L)}), respectively. All such sets contain at most one node and by slight abuse of notation, we use the same letters to refer to the nodes. We will henceforth suppress the dependence on L if the choice is clear from the context. Figure 2 visualizes the above sets. Now partition Tp \ {G} into the three following disjoint partitions for which there exists a reduced graph sink node L L(G, G) such that W(L), Y (L) and Z(L) is non-empty, respectively. That is, we define Tp(G, W) : = { G Tp \ {G} : L L(G, G) s.t. W(L) = }, Tp(G, Y ) : = { G Tp \ {G} : L L(G, G) s.t. Y (L) = } \ Tp(G, W), Tp(G, Z) : = { G Tp \ {G} : L L(G, G) s.t. Z(L) = } \ (Tp(G, W) Tp(G, Y )). Using that Tp(G, W) Tp(G, Y ) Tp(G, Z) = Tp(G), we can now find a lower bound for the score gap that holds uniformly over all alternative directed tree graphs Tp \ {G}: min G Tp\{G} ℓE( G) ℓE(G) = min min G Tp(G,Z) ℓE( G) ℓE(G), min G Tp(G,W) ℓE( G) ℓE(G), min G Tp(G,Y ) ℓE( G) ℓE(G) . We now turn to each of these three terms individually and first consider alternative graphs in the partitioning Tp(G, Z). The following lower bound consists of possibly non-localized conditional dependence properties of the observable distribution PX. (That is, the bound may involve nodes that are not close to each other in the graph G.) Lemma D.7 Let ΠZ(G) denote all tuples (z, l, o) V 3 of adjacent nodes (z l) E for which there exists a node o nd G(l) \ {z, l}. It holds that min G Tp(G,Z) ℓE( G) ℓE(G) min (z,l,o) ΠZ(G) I(Xz; Xo|Xl). The next result proves a lower bound that holds uniformly over all alternative graphs in Tp(G, W). The lower bound consists only of local conditional dependence properties. That Structure Learning for Directed Trees is, for any subgraph of the causal graph G of the form Xo Xw Xl or Xo Xw Xl we measure, by means of conditional mutual information, the conditional dependence of the two adjacent nodes Xw and Xl conditional on Xo, I(Xw; Xl|Xo). The lower bound consists of the smallest of all such local conditional dependence measures. Lemma D.8 Let ΠW (G) denote all tuples (w, l, o) V 3 of adjacent nodes (w l) E and o (ch G(w) \ {l}) pa G(w). It holds that that min G Tp(G,W) ℓE( G) ℓE(G) min (w,l,o) ΠW (G) I(Xw; Xl|Xo). A uniform lower bound of the score gap over all alternative graphs in the final partition Tp(G, Y ) is given by the smallest edge-reversal of any edge in the causal graph G. Lemma D.9 It holds that min G Tp(G,Y ) ℓE( G) ℓE(G) min (j i) E ℓE(j L99 i). An immediate consequence of Lemmas D.7 to D.9 is that the entropy identifiability gap is given by the smallest of the lower bounds derived for each partition, see Theorem 16. Thus, it only remains to prove Lemmas D.7 to D.9. Proof of Lemma D.7. Let G ΠZ(G) such that Z = . This implies that Y = W = as L can only have one parent in G. Furthermore, D = as L can only have one parent in G and O = for otherwise L would have been deleted by the deletion procedure in Section 5. Assume without loss of generality that O = {O1, . . . , Ok} for some k N. The two subgraphs are illustrated in Figure 16. V1 = {Z, L}c V1 = {Z, L, O, A1, . . . , Ak}c Figure 16: Illustration of the reduced form graphs GR and GR for the case G ΠZ(G). A1, . . . , Ak are possibly empty sets of nodes, and dashed rectangle nodes denotes a possibly multi-node subgraph over the variables enclosed. The bi-directed edges means that the edge can be directed in both directions. An edge pointing into the multi-node subgraph, can possibly be multiple edges into distinct nodes of the subgraph. Jakobsen, Shah, B uhlmann and Peters For ease of notation, fix any 1 i k and denote O := Oi. We note that in G the following d-separation holds Thus, we have for all probability measures Q { G} F( G) Pp over nodes V that Z O | L (as the path between Z and O is blocked by L and all probability measures generated in accordance with an SCM are Markovian with respect to the generating graph G). Recall that ℓE( G) ℓE(G) = inf Q { G} F( G) Pp DKL(PX Q) = inf Q { G} F( G) Pp h(PX, Q) h(PX). Now fix Q = q λp { G} F( G) Pp and note that it factorizes as Q = QA|Z,O,LQZ|LQO|LQL, i.e., the density q factorizes as q(x) = q A|Z,O,L(a|z, o, l)q Z,O,L(z, o, l) = q A|Z,O,L(a|z, o, l)q Z|L(z|l)q O|L(o|l)q L(l), for λp-almost all x = (a, z, o, l) Rp where A = V \ {Z, O, L}. Hence, the cross entropy splits additively into h(PX, Q) E[ log(q A|Z,O,L(A|Z, O, L))] + E[ log(q Z|L(Z|L))] + E[ log(q O|L(O|L))] + E[ log(q L(L))]. (43) Now note, e.g., that for a conditional distribution (Markov kernel) QZ|L it holds that 0 DKL(PZ|LPL QZ|LPL) = E log q Z|L(Z|L)p L(L) p Z|L(Z|L)p L(L) = E[ log(q Z|L(Z|L))] E[ log(p Z|L(Z|L))], proving that E[ log(q Z|L(Z|L))] E[ log(p Z|L(Z|L))]. By similar arguments, we get that the three other terms in the lower bound of Equation (43) are bounded below by E[ log(q A|Z,O,L(A|Z, O, L))] E[ log(p A|Z,O,L(A|Z, O, L))], E[ log(q O|L(O|L))] E[ log(p O|L(O|L))], E[ log(q L(L))] E[ log(p L(L))]. This implies that inf Q { G} F( G) Pp h(PX, Q) h(PX, Q ), Structure Learning for Directed Trees where Q = PA|Z,O,LPZ|LPO|LPL. On the other hand, we know that PX factorizes as PX = PA|Z,O,LPZ,O|LPL. Thus we have the following entropy score gap lower bound ℓE( G) ℓE(G) h(PX, Q ) h(PX) = DKL(PX Q ) = DKL(PA|Z,O,LPZ,O|LPL PA|Z,O,LPZ|LPO|LPL) = DKL(PZ,O|LPL PZ|LPO|LPL) = DKL(PZ,O|L PZ|LPO|L|PL) = I(Z; O|L). ΠZ(G) denotes all tuples (z, l, o) V 3 of adjacent nodes (z l) E for which there exists a node o nd G(l) \ {z, l}. For any graph G Tp(G, Z) we can, by the above considerations, find a tuple (z, l, o) ΠZ(G) such that ℓE( G) ℓE(G) I(Xo; Xz | Xl). We conclude that min G Tp(G,Z) ℓE( G) ℓE(G) min (z,l,o) ΠZ(G) I(Xo; Xz|Xl). Proof of Lemma D.8. Fix any G Tp(G, W) and L with W = such that Z = Y = . We have illustrated the subgraph GR in Figure 17 and the possible subgraphs GR in Figure 18. {W, L}c W L Figure 17: Illustrations of the GR subgraph for for G Tp(G, W). Note that for any of the three possible local graph structures presented in Figure 18 there exists an A {O1, . . . , Ok, D} such that L GRW | A, i.e., A blocks the path between L and W. Thus, for all probability measures Q { G} F( G) Pp over nodes V = {1, .., p} it always holds that L W | A. By arguments similar to those in the proof of Lemma D.7, we note that ℓE( G) ℓE(G) = inf Q { G} F( G) Pp h(PX, Q) h(PX), and that inf Q { G} F( G) Pp h(PX, Q) h(PX, Q ), Jakobsen, Shah, B uhlmann and Peters GR: D = , O = {L, D}c D L GR: D = , O = {D, L, O, A1, . . . , Ak}c D L ... GR: D = , O = Figure 18: Illustrations of the possible GR subgraphs for G Tp(G, W). for PX = PK|W,L,APW,L|APA and Q = PK|W,L,APL|APW|APA where K = V \{W, L, A}. To that end, we now have that ℓE( G) ℓE(G) h(PX, Q ) h(PX) = DKL(PX Q ) = DKL(PK|W,L,APW,L|APA PK|W,L,APL|APW|APA) = DKL(PW,L|APA PL|APW|APA) = DKL(PW,L|A PL|APW|A|PA) = I(W; L|A). Let ˆΠW (G) denote all tuples (w, l, a) V 3 of adjacent nodes (w l) E for which there exists a node a nd G(l) \ {w}. Now note that for any graph G Tp(G, W) we can, by the above considerations, find a tuple (w, l, a) ˆΠW (G) such that ℓE( G) ℓE(G) I(Xw; Xl | Xa). (44) (Conversely for any tuple (w, l, a) ˆΠW (G) we can construct a graph G Tp(G, W) such that (44) holds. To see this, fix (w, l, a) ˆΠW (G) and construct G such that the subtree with root node l is identical in both G and G and a blocks the path between l and w in G.) Therefore, the following lower bound holds (and it is not unnecessarily small). min G Tp(G,W) ℓE( G) ℓE(G) min (w,l,a) ˆΠW (G) I(Xw; Xl | Xa). For any (w, l, a) ˆΠW (G) it either holds that a (ch G(w) \ {l}) pa G(w) or that there exists an o (ch G(w) \ {l}) pa G(w) blocking the path between a and l in G such that Structure Learning for Directed Trees Xl Xa|Xo. Furthermore, we note that as Xl (Xo, Xa) | Xw we have that I(Xw; Xl|Xa) = h(Xl|Xa) h(Xl|Xa, Xw) = h(Xl|Xa) h(Xl|Xw) = h(Xl|Xa) h(Xl|Xo, Xw) h(Xl|Xa, Xo) h(Xl|Xo, Xw) = h(Xl|Xo) h(Xl|Xo, Xw) = I(Xw; Xl | Xo), as further conditioning reduces conditional entropy. Let ΠW (G) denote all tuples (w, l, o) V 3 of adjacent nodes (w l) E and o (ch G(w) \ {l}) pa G(w). By the above considerations we conclude that min G Tp(G,W) ℓE( G) ℓE(G) min (w,l,o) ΠW (G) I(Xw; Xl | Xo). Proof of Lemma D.9. Fix G Tp(G, Y ) and L such that Y = . It holds that W = Z = . We have illustrated the GR in Figure 19 and the three possible subgraphs GR in Figure 20. {Y, L}c Y L Figure 19: Illustrations of the GR subgraph for G Tp(G, Y ). Note that for any of the three possible local graph structures of GR illustrated in Figure 20 we have that for all probability measures Q { G} F( G) Pp factorizes as QA|L,Y QL,Y , where A = V \ {L, Y }. It always holds that QL,Y is the distribution of ( L, Y ) generated in accordance with a structural causal model of the form Y := f Y ( L) + NY , (45) where f Y (l) = E[Y |L = l] for all l R, and any L( NY ), L( L) P with NY L. Now recall that ℓE( G) ℓE(G) = inf Q { G} F( G) Pp h(PX, Q) h(PX), and notice that by arguments similar to those in the proof of Lemma D.7 we get h(PX, Q) = h(PX, QA|L,Y QL,Y ) = E[ log(q A|L,Y (A|L, Y ))] + h(PL,Y , QL,Y ) E[ log(p A|L,Y (A|L, Y ))] + h(PL,Y , QL,Y ), Jakobsen, Shah, B uhlmann and Peters GR: D = , O = {Y, L, D}c D L Y GR: D = , O = {D, L, Y, O, A, B}c GR: D = , O = Figure 20: Illustrations of the possible GR subgraphs for G Tp(G, Y ). and that h(PX) = E[ log(p A|L,Y (A|L, Y ))] + h(PL,Y ). Thus, we have that ℓE( G) ℓE(G) inf Q { G} F( G) Pp h(PL,Y , QL,Y ) h(PL,Y ). For any Q = QA|L,Y QL,Y { G} F( G) Pp we have that QL,Y is uniquely determined by a marginal distribution QL P and the noise distribution of NY q NY λ P from the additive noise structural assignment in Equation (45) for Y and the causal function f Y . Thus, the density q L,Y of QL,Y is given by q L,Y (l, y) = q Y |L(y|l)q L(l) = q NY (y f Y (l))q L(l) = q NY (y E[Y |L = l])q L(l). h(PL,Y , QL,Y ) = E [ log (q L,Y (L, Y ))] = E log q Y |L(Y |L) + E [ log (q L(L))] = E h log q NY (Y E[Y |L]) i + h(PL, QL) = h(Y E[Y |L], NY ) + h(PL, QL) h(Y E[Y |L]) + h(L), where we used that h(P, Q) = DKL(P, Q) + h(P) h(P). Thus, we have that ℓE( G) ℓE(G) inf Q { G} F( G) Pp h(PL,Y , QL,Y ) h(PL,Y ) h(Y E[Y |L]) + h(L) h(L E[L|Y ]) h(Y ) = ℓE(Y L99 L). Structure Learning for Directed Trees We conclude that min G Tp(G,Y ) ℓE( G) ℓE(G) min (i j) E ℓE(j L99 i). D.4.3 Remaining proof of Section 5 Proof of Theorem 17. Consider a graph G Tp(G, Z) and let GR,1 = (ER,1, VR,1) and GR,1 = ( ER,1, VR,1) be the reduced graphs after the initial edge and node deletion procedure of Section 5.3. The deletion procedure does not change the score gap, that is, ℓG( G) ℓG(G) = ℓG( GR,1) ℓG(GR,1). For any i 1 and fixed GR,i and GR,i we define LR,i := {L VR,i : ch GR,i(L) = (pa GR,i(L) = pa GR,i(L) ch GR,i(L) = )}. Now fix L1 LR,1 such that Z1 = , where Y1, Z1, W1, D1 and O1 are defined similarly to the variables in Section 5. Let O1 = {O1,1, . . . , O1,k1}, for some k1 N. Assume that there exists an i {1, . . . , k1} such that (Z1 O1,i) ER,1 in which case we have the following two paths in GR,1 and GR,1 GR,1 : O1,i Z1 L1, and GR,1 : Z1 L1 O1,i. Since O1,i GR,1Z1 | L1, an entropy score gap lower bound is given by ℓG( GR,1) ℓG(GR,1) ℓE( GR,1) ℓE(GR,1) = inf Q { G} F( G) Pp h(PX, Q) PX = I(O1,i; Z1|L1), with PX = PK|O,Z,LPO,Z|LPL and Q = PK|O,Z,LPZ|LPO|LPL for K = V \ {O, Z, L}, by arguments similar to those from the proof of Lemma D.8. Now note that (Z1, O1,i, L1) ΠW (GR,1) ΠW (G) as (Z1 O1,i) ER,1 and L1 ch GR,1(Z1) \ {O1,i} (ch GR,1(Z1) \ {O1,i}) pa GR,1(Z1). Hence, ℓG( GR,1) ℓG(GR,1) min (w,l,o) ΠW (G) I(Xw; Xl|Xo). (46) Assume now that for all i {1, .., k1} we have (Z1 O1,i) ER,1. Let ˆGR,1 = ( ˆER,1, VR,1) denote an intermediate graph where ˆER,1 is identical to ER,1 except the edges {(L1 O1,i) : 1 i k1} ER,1 are replaced by the edges {(Z1 O1,i) : 1 i k1}. It holds that ℓG( GR,1) ℓG(GR,1) = ℓG( GR,1) ℓG( ˆGR,1) + ℓG( ˆGR,1) ℓG(GR,1) ℓG( ˆGR,1) ℓG(GR,1). Jakobsen, Shah, B uhlmann and Peters Note that this score gap lower bound is still strictly positive as ˆGR,1 = GR,1. To realize the last inequality, simply note that as O1,i L1 | Z1 we have for all i {1, . . . , k1} that 2ℓG( GR,1, O1,i) = log E[(O1,i E[O1,i|L1])2] log E[(O1,i E[O1,i|Z1, L1])2] = log E[(O1,i E[O1,i|Z1])2] = 2ℓG( ˆGR,1, O1,i). (47) Now since all edges in GR,1 and ˆGR,1 coincide except the incoming edges into O1,1, . . . , O1,k1 we get that ℓG( GR,1) ℓG( ˆGR,1) = i=1 ℓG( GR,1, O1,i) ℓG( ˆGR,1, O1,i) 0, where the inequality follows from Equation (47). Now both ˆGR,1 and GR,1 have a childless node L1 with the same parent Z1, so we let GR,2 and GR,2 denote these two graphs where the node L1 and its incoming edge are deleted. This deletion does not change the graph scores, i.e., ℓG( ˆGR,1) ℓG(GR,1) = ℓG( GR,2) ℓG(GR,2). Now fix L2 LR,2 and define Y2, Z2, W2, D2 and O2 = {O2,1, . . . , O2,k2} accordingly. If either Y2 or W2 is non-empty, we use the score gap lower bound previously discussed in Lemma D.8 and Lemma D.9. If Z2 is non-empty, we can repeat the above procedure and iteratively move edges and delete nodes until we arrive at the first i N with GR,i and GR,i being the iteratively reduced graphs and LR,i LR,i where either i) Yi or Wi is non-empty, here, we get that ℓG( G) ℓG(G) is lower bounded by a bound similar to the form of Lemma D.8 or Lemma D.9. That is, ℓG( GR,i) ℓG(GR,i) ℓE( GR,i) ℓE(GR,i) min min j i E ℓE(i L99 j), min (w,l,o) ΠW (G) I(Xw; Xl | Xo) . ii) Zi is non-empty and there exists a j {1, . . . , ki} such that (Zi Oi,j) GR,i. As previously argued, the score gap lower bound of Equation (46) applies. That is ℓG( GR,i) ℓG(GR,i) ℓE( GR,i) ℓE(GR,i) min (w,l,o) ΠW (G) I(Xw; Xl | Xo). Note that whenever we do not meet scenario i) or ii) we remove a node in both graphs that is a sink node in the reduced true causal graph GR,i and the intermediate graph ˆGR,i. After at most p 2 graph reduction iterations of not encountering scenario i) or ii) we are left with two different graphs on two nodes, in which case the score gap is an edge reversal. We conclude that ℓG( G) ℓG(G) min min i j E ℓE(j L99 i), min (w,l,o) ΠW (G) I(Xw; Xl | Xo) . Structure Learning for Directed Trees T. B. Berrett and R. J. Samworth. Nonparametric independence testing via mutual information. Biometrika, 106(3):547 566, 2019. T. B. Berrett, D. Grose, and R. J. Samworth. CRAN R-package Indep Test : Nonparametric independence tests based on entropy estimation, 2018. URL https://cran.r-project. org/web/packages/Indep Test. T. B. Berrett, R. J. Samworth, and M. Yuan. Efficient multivariate entropy estimation via k-nearest neighbour distances. The Annals of Statistics, 47(1):288 318, 2019. P. B uhlmann, J. Peters, and J. Ernest. CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526 2556, 2014. P. M. Camerini, L. Fratta, and F. Maffioli. A note on finding optimum branchings. Networks, 9(4):309 312, 1979. doi: 10.1002/net.3230090403. V. Carey, L. Long, and R. Gentleman. Bioconductor R-package RBGL , 2021. URL https://www.bioconductor.org/packages/release/bioc/html/RBGL.html. A. Cayley. A theorem on trees. Quart. J. Math., 23:376 378, 1889. D. M. Chickering. Learning Bayesian networks is NP-complete. In D. Fisher and H.-J. Lenz, editors, Learning from Data: Artificial Intelligence and Statistics V, pages 121 130. Springer New York, New York, NY, 1996. D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov):507 554, 2002. C. K. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462 467, 1968. Y. J. Chu and T. H. Liu. On the shortest arborescence of a directed graphs. Science Sinica, 14:1396 1400, 1965. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Hoboken, New Jersey, 2006. J. Cussens. Bayesian network learning with cutting planes. In Proceedings of the Twenty Seventh Conference on Uncertainty in Artificial Intelligence, pages 153 160. AUAI Press, 2011. J. Cussens, M. J arvisalo, J. H. Korhonen, and M. Bartlett. Bayesian network structure learning with integer programming: Polytopes, facets and complexity. Journal of Artificial Intelligence Research, 58:185 229, 2017. I. S. Dominguez, A. H. Aguirre, and E. V. Diharce. The Gaussian polytree eda with copula functions and mutations. In EVOLVE-A Bridge between Probability, Set Oriented Numerics and Evolutionary Computation, pages 123 153. Springer, Berlin, Heidelberg, 2013. Jakobsen, Shah, B uhlmann and Peters J. Edmonds. Optimum branchings. Journal of Research of the National Bureau of Standards B, 71(4):233 240, 1967. N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine learning, 29(2):131 163, 1997. H. N. Gabow, Z. Galil, T. Spencer, and R. E. Tarjan. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs. Combinatorica, 6(2):109 122, 1986. D. Geiger and D. Heckerman. Learning Gaussian networks. In Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence, pages 235 243, 1994. L. Gy orfi, M. Kohler, A. Krzy zak, and H. Walk. A Distribution-free Theory of Nonparametric Regression, volume 1. Springer, Berlin, Germany, 2002. A. Hagberg, P. Swart, and D. Schult. Python package Network X , 2022. URL github. com/Network X/Network X. Y. Han, J. Jiao, T. Weissman, and Y. Wu. Optimal rates of entropy estimation over Lipschitz balls. The Annals of Statistics, 48(6):3228 3250, 2020. D. Heckerman and D. Geiger. Learning Bayesian networks: A unification for discrete and Gaussian domains. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, pages 274 284, 1995. P. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch olkopf. Nonlinear causal discovery with additive noise models. Advances in Neural Information Processing Systems, 21: 689 696, 2008. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using lp relaxations. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 358 365. JMLR Workshop and Conference Proceedings, 2010. M. Kalisch and P. B uhlman. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research, 8(22):613 636, 2007. M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences. ar Xiv preprint ar Xiv:1807.02582, 2018. R. M. Karp. A simple derivation of edmonds algorithm for optimum branchings. Networks, 1(3):265 272, 1971. M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. The Journal of Machine Learning Research, 5:549 573, 2004. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, Massachusetts, 2009. Structure Learning for Directed Trees C. Lee and P. v. Beek. Metaheuristics for score-and-search Bayesian network structure learning. In Canadian Conference on Artificial Intelligence, pages 129 141. Springer, 2017. A. Marsiglietti and V. Kostina. A lower bound on the differential entropy of log-concave random vectors with applications. Entropy, 20(3):185, 2018. J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Sch olkopf. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1 102, 2016. P. Nandy, A. Hauser, and M. H. Maathuis. High-dimensional consistency in score-based and hybrid structure learning. The Annals of Statistics, 46(6A):3151 3183, 2018. M. Ouerd. Learning in belief networks and its application to distributed databases. Ph D Thesis, University of Ottawa, Ottawa, Canada, 2000. L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6): 1191 1253, 2003. P. Parviainen and M. Koivisto. Exact structure discovery in Bayesian networks with less space. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 09, page 436 443. AUAI Press, 2009. J. Pearl. Causality. Cambridge University Press, Cambridge, UK, 2009. J. Peters and P. B uhlmann. Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228, 2014. J. Peters and P. B uhlmann. Structural intervention distance for evaluating causal graphs. Neural Computation, 27(3):771 799, 2015. J. Peters, D. Janzing, and B. Sch olkopf. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12): 2436 2450, 2011. J. Peters, J. M. Mooij, D. Janzing, and B. Sch olkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15(58):2009 2053, 2014. J. Peters, D. Janzing, and B. Sch olkopf. Elements of causal inference: foundations and learning algorithms. MIT Press, Cambridge, Massachusetts, 2017. Y. Polyanskiy and Y. Wu. Lecture notes on information theory, 2019. URL people.lids. mit.edu/yp/homepage/. last accessed, 9.3.2022. G. Rebane and J. Pearl. The recovery of causal poly-trees from statistical data. In Proceedings of the Third Annual Conference on Uncertainty in Artificial Intelligence, pages 222 228, 1987. Jakobsen, Shah, B uhlmann and Peters A. Reisach, C. Seiler, and S. Weichwald. Beware of the simulated DAG! causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34, 2021. D. Rothenh ausler, J. Ernest, and P. B uhlmann. Causal inference in partially linear structural equation models. Annals of Statistics, 46(6A):2904 2938, 2018. K. Sachs, O. Perez, D. Pe er, D. A. Lauffenburger, and G. P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523 529, 2005. M. Scanagatta, C. P. de Campos, G. Corani, and M. Zaffalon. Learning Bayesian networks with thousands of variables. Advances in neural information processing systems, 28, 2015. R. L. Schilling. Measures, Integrals and Martingales. Cambridge University Press, Cambridge, UK, 2017. R. D. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514 1538, 2020. S. Shimizu, P. O. Hoyer, A. Hyv arinen, A. Kerminen, and M. Jordan. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(72):2003 2030, 2006. T. Silander and P. Myllym aki. A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 445 452. AUAI Press, 2006. E. I. Silva. A unified framework for the analysis and design of networked control systems. Ph D Thesis, University of Newcastle, Callaghan, Australia, 2009. N. J. A. Sloane. The on-line encyclopedia of integer sequences, 2021. URL oeis.org/ A003024. The OEIS Foundation Inc. (2021). P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman. Causation, Prediction, and Search. MIT press, Cambridge, Massachusetts, 2000. R. E. Tarjan. Finding optimum branchings. Networks, 7(1):25 35, 1977. A. Tofigh and E. Sj olund. C++ implementation of Edmonds algorithm, 2007. URL github. com/atofigh/edmonds-alg. I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1):31 78, 2006. A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, Berlin, Germany, 2009. C. Uhler, G. Raskutti, P. B uhlmann, and B. Yu. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, 41(2):436 463, 2013. doi: 10.1214/12-aos1080. Structure Learning for Directed Trees P. van Beek and H.-F. Hoffmann. Machine learning of Bayesian networks using constraint programming. In G. Pesant, editor, Principles and Practice of Constraint Programming, pages 429 445, Cham, 2015. Springer International Publishing. A. W. Van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 3 edition, 2000. S. Wood. CRAN R-package mgcv : Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation, 2022. URL cran.r-project.org/web/ packages/mgcv. S. N. Wood. Thin-plate regression splines. Journal of the Royal Statistical Society (B), 65 (1):95 114, 2003. C. Yuan and B. Malone. Learning optimal Bayesian networks: A shortest path perspective. Journal of Artificial Intelligence Research, 48:23 65, 2013. C. Yuan, B. Malone, and X. Wu. Learning optimal Bayesian networks using a* search. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011. J. Zhang and P. Spirtes. Strong faithfulness and uniform consistency in causal inference. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, page 632 639, 2002. K. Zhang and A. Hyv arinen. On the identifiability of the post-nonlinear causal model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, page 647 655, 2009. X. Zheng, B. Aragam, P. Ravikumar, and E. P. Xing. DAGs with no tears: Continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, page 9492 9503, 2018.