# learning_linear_polytree_structural_equation_model__53b33af2.pdf

Published in Transactions on Machine Learning Research (03/2025)

Learning Linear Polytree Structural Equation Model

Xingmei Lou xmlou@ucdavis.edu Department of Statistics University of California, Davis Davis, CA 95616, USA

Yu Hu mahy@ust.hk Department of Mathematics and Division of Life Science The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong S.A.R.

Xiaodong Li xdgli@ucdavis.edu Department of Statistics University of California, Davis Davis, CA 95616, USA

Reviewed on Open Review: https: // openreview. net/ forum? id= N28Fd YO2s H

We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Under the Gaussian polytree models, we study sufficient conditions on the sample sizes for the well-known Chow-Liu algorithm to exactly recover both the skeleton and the equivalence class of the polytree, which is uniquely represented by a CPDAG. On the other hand, necessary conditions on the required sample sizes for both skeleton and CPDAG recovery are also derived in terms of information-theoretic lower bounds, which match the respective sufficient conditions and thereby give a sharp characterization of the difficulty of these tasks. We also consider the problem of inverse correlation matrix estimation under the linear polytree models, and establish the estimation error bound in terms of the dimension and the total number of v-structures. We also consider an extension of group linear polytree models, in which each node represents a group of variables. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of polytree learning when the true graphical structures can only be approximated by polytrees.

1 Introduction

Over the past three decades, the problem of learning directed graphical models from i.i.d. observations of a multivariate distribution has received an enormous amount of attention since they provide a compact and flexible way to represent the joint distribution of the data, especially when the associated graph is a directed acyclic graph (DAG), which is a directed graph with no directed cycles. DAG models are popular in practice with applications in biology, genetics, machine learning, and causal inference (Sachs et al., 2005; Zhang et al., 2013; Koller & Friedman, 2009; Spirtes et al., 2000). There exists extensive literature on learning the graph structure from i.i.d. observations under DAG models. For a summary, see the survey papers by Drton & Maathuis (2017) and Heinze-Deml et al. (2018). Existing approaches generally fall into two categories, constraint-based methods (Spirtes et al., 2000; Pearl, 2009) and score-based methods (Chickering, 2002b). Constraint-based methods utilize conditional independence tests to determine whether there exists an edge between two nodes and then orient the edges in the graph, such that the resulting graph is compatible

Published in Transactions on Machine Learning Research (03/2025)

with the conditional independencies determined in the data. Score-based methods formulate the structure learning task as optimizing a score function based on the unknown graph and the data.

A polytree is a connected DAG that contains no cycles even if the directions of all edges are ignored. It is practically useful due to tractability in both structure learning and inference. To the best of our knowledge, structure learning of polytree models was originally studied in Rebane & Pearl (1987), in which the skeleton of the polytree is estimated by applying the Chow-Liu algorithm (Chow & Liu, 1968) to pairwise mutual information quantities, a method that has been widely used in the literature of Markov random field to fit undirected tree models. Polytree graphical models have received a significant amount of research interest both empirically and theoretically ever since, see, e.g., Dasgupta (1999); Cheng et al. (2002), and recent efforts such as Chatterjee & Vidyasagar (2022); Tramontano et al. (2022).

This paper aims to study sample size conditions of the method essentially proposed in Rebane & Pearl (1987) for the recovery of polytree structures by applying the Chow-Liu algorithm to pairwise sample correlations in the case of Gaussian linear structure equation models (SEM). We establish sufficient conditions on the sample sizes for consistent recovery of both the skeleton and equivalence class for the underlying polytree structure. On the other hand, we will also establish the necessary conditions on the sample sizes for these two tasks through information-theoretic lower bounds. Our sufficient and necessary conditions match in order in a broad regime of model parameters, and thereby characterize the difficulty of these two tasks in polytree learning.

A relevant line of research is structure learning for tree-structured undirected graphical models, including both discrete cases (Heinemann & Globerson, 2014; Bresler & Karzand, 2020; Netrapalli et al., 2010; Anandkumar et al., 2012b;a) and Gaussian cases (Tan et al., 2010; Tavassolipour et al., 2018; Nikolakakis et al., 2019; Katiyar et al., 2019). In particular, conditions on the sample size for undirected tree structure learning via the Chow-Liu algorithm have been studied for both Ising and Gaussian models (Bresler & Karzand, 2020; Tavassolipour et al., 2018; Nikolakakis et al., 2019), and the analyses usually rely crucially on the so-called correlation decay property over the true undirected tree. The correlation decay properties can usually be explicitly quantified by the pairwise population correlations corresponding to the edges of the underlying true tree. Based on this result and some perturbation results of pairwise sample correlations to their population counterparts, sufficient conditions on the sample size for undirected tree recovery with the Chow-Liu algorithm can be straightforwardly obtained.

In order to apply the above technical framework to study the sample size conditions for polytree learning, a natural question is whether we have a similar correlation decay phenomenon for the polytree models. In fact, this is suggested in the seminal paper Rebane & Pearl (1987). To be concrete, under some non-degeneracy assumptions, it has been shown in Rebane & Pearl (1987) (see their Eq. 13) that there holds a mutual information decay along the skeleton of the polytree. In broad terms, the mutual information decay is a direct implication of the well-known data processing inequality in information theory (Thomas & Joy, 2006). Restricted to the very special case of Gaussian linear SEM, the mutual information decay is indeed equivalent to the property of population correlation decay.

To obtain some meaningful sample complexity result, we need to quantify such correlation decay explicitly as what has been done in the study of the Chow-Liu algorithm for undirected tree models (Bresler & Karzand, 2020; Tavassolipour et al., 2018; Nikolakakis et al., 2019). The mutual information decay given in Rebane & Pearl (1987) holds for general polytree models, but one can expect to further quantify such decay under more specific models. In fact, if we restrict the polytree model to linear SEM, by applying the well-known Wright s formula (Wright, 1960; Nowzohour et al., 2017; Foygel et al., 2012), the population correlation decay property can be quantified by the pairwise correlations corresponding to the tree edges. With such quantification of correlation decay over the underlying polytree skeleton, we can apply the ideas from undirected tree structure learning to establish sufficient conditions on sample size for polytree skeleton recovery via the Chow-Liu algorithm. In broad terms, if the maximum absolute correlation coefficient over the polytree skeleton is uniformly bounded below 1, the Chow-Liu algorithm recovers the skeleton exactly with high probability if the sample size satisfies n > O((log p)/ρ2 min), where p is the number of variables and ρmin is the minimum absolute population correlation coefficient over the skeleton.

Published in Transactions on Machine Learning Research (03/2025)

To determine the directions of the polytree over the skeleton, the concept of CPDAG (Verma & Pearl, 1991) captures the equivalence class of polytrees. We then consider the CPDAG recovery procedure introduced in Verma & Pearl (1992) and Meek (1995), which is a polynomial time algorithm based on identifying all the vstructures (Verma & Pearl, 1991). Therefore, conditional on the exact recovery of the skeleton, recovering the CPDAG is equivalent to recovering all v-structures. In a non-degenerate polytree model, a pair of adjacent edges form a v-structure if and only if the two non-adjacent node variables in this triplet are independent, so we consider a natural v-structure identification procedure by thresholding the pairwise sample correlations over all adjacent pairs of edges with some appropriate threshold. In analogy to the result of skeleton recovery, we show that the CPDAG of the polytree can be exactly recovered with high probability if the sample size satisfies n > O((log p)/ρ4 min). Furthermore, by using Fano s method, we show that n > O((log p)/ρ2 min) is necessary for skeleton recovery, while n > O((log p)/ρ4 min) is necessary for CPDAG recovery. This means that we have sharply characterized the difficulties for the two tasks. We briefly note studies on linear SEMs that do not assume a polytree structure, such as Peters & Bühlmann (2014); Ghoshal & Honorio (2018). In these works, the authors make alternative assumptions to ensure the identifiability of the DAG, for instance, by assuming equal noise variances. For a discussion of the general SEM literature, we refer the readers to Ghoshal & Honorio (2018).

The paper is organized as follows: In Section 2, we review the concepts of linear polytree SEM, Markov equivalence and CPDAG, and the polytree learning method based on the Chow-Liu algorithm. In Section 3, we give optimal sample size conditions for both the skeleton and CPDAG recovery, particularly in terms of the minimum correlation over the tree skeleton. In Section 4, we introduce a version of PC algorithm adapted to the linear polytree models, and establish the same sample size conditions. In Section 5, we discuss a method of estimating the inverse correlation matrix for linear polytree models, and establish an upper bound of estimation in the entry-wise ℓ1 norm. Our theoretical findings are empirically demonstrated in Section 7, along with numerical results under some benchmark simulated data in the literature of DAG learning. A brief summary of our work and some potential future research are discussed in Section 8.

2 Linear Polytree Models and Learning

This section aims to give an overview of the concepts of linear polytree SEM, equivalence classes characterized by CPDAG, and the Chow-Liu algorithm for polytree learning. Most materials are not new, but we give a self-contained introduction of these important concepts and methods so that our main results introduced in the subsequent sections will be more accessible to a wider audience.

2.1 Linear Polytree Models

Let G = (V, E) be a directed graph with vertex set V = {1, 2, ..., p} and edge set E. We use i j E to denote that there is a directed edge from node i to node j in G. A directed graph with no directed cycles is referred to as a directed acyclic graph (DAG). The parent set of node j in G is denoted as Pa(j) := {i V : i j E}. Correspondingly, denote by Ch(j) := {k : j k E} the children set of j.

Let x = [X1, . . . , Xp] be a random vector where each random variable Xj corresponds to a node j V . The edge set E usually encodes the causal relationships among the variables. The random vector x is said to be Markov on a DAG G if its joint density function (or mass function) p(x) can be factorized according to G as p(x) = Qp j=1 p(xj|x P a(j)), where p(xj|x P a(j)) is the conditional density/probability of Xj given its parents XP a(j) := {Xi : i Pa(j)}. We usually refer to (G, p(x)) as a DAG model.

Throughout this work, we restrict our discussion to an important sub-class of DAG models: linear structure equation models (SEM), in which the dependence of each Xj on its parents is linear with additive noise. The parameterization of the linear SEM with directed graph G = (V, E) would be

i=1 βij Xi + ϵj = X

i P a(j) βij Xi + ϵj, for j = 1, . . . , p, (1)

where βij = 0 if and only if i j E,

Published in Transactions on Machine Learning Research (03/2025)

and all ϵj s are independent with zero mean. Let B = βij Rp p and ϵ = [ϵ1, . . . , ϵp] . Then the SEM can be represented as x = B x + ϵ. (2)

Denote Cov(x) = Σ = σij Rp p and Cov(ϵ) = Ω= Diag(ω11, . . . , ωpp). Here Ωis diagonal since all additive noise variables are assumed to be mutually independent.

For any DAG, if we ignore the directions of all its directed edges, the resulting undirected graph is referred to as the skeleton of the DAG. A polytree is a connected DAG whose skeleton does not possess any undirected cycles. The model (2) is referred to as a linear polytree SEM, if the underlying DAG is a polytree T = (V, E). In this paper, we focus on the case of independent Gaussian noise ϵi, so the model (2) can be also referred to as a Gaussian linear polytree SEM. A major purpose of this paper is to study the problem of polytree learning, i.e., the recovery of the equivalence class of the polytree T = (V, E) under the model (2) from a finite sample of observations x1, . . . , xn, or equivalently the n p data matrix X = [x1, . . . , xn] . We explain the concept of Markov equivalence classes in the next subsection.

2.2 Markov Equivalence and CPDAG

Let us briefly review the concept of Markov equivalence of DAGs. Note that each DAG G entails a list of statements of conditional independence, which are satisfied by any joint distribution Markov to G. Two DAGs are equivalent if they entail the same list of conditional independencies. In the present paper, the recovery of the equivalence class of DAG hinges on a well-known result given in Verma & Pearl (1991): Two DAGs are Markov equivalent if and only if they have the same skeleton and sets of v-structures, where a v-structure is a node triplet i k j where i and j are non-adjacent.

An important concept to intuitively capture equivalence classes of DAGs is the completed partially DAG (CPDAG): a graph K with both directed and undirected edges representing the Markov equivalence class of a DAG G if: (1) K and G have the same skeleton; (2) K contains a directed edge i j if and only if any DAG G that is Markov equivalent to G contains the same directed edge i j. The CPDAG of G is denoted as K = CG. It has been shown in Chickering (2002a) that two DAGs have the same CPDAG if and only if they belong to the same Markov equivalence class.

The following result provides some intuition on the CPDAG for polytree models.

Proposition 1. The undirected sub-graph containing undirected edges of the CPDAG of a polytree forms a forest. All equivalent DAGs can be obtained by orienting each undirected tree of the forest into a rooted tree, that is, by selecting any node as the root and setting all edges going away from it.

Proof. Each connected component of the undirected edges is a sub-graph of the polytree G s skeleton, thus is a tree. If a node of the tree also has directed edges, they must be outgoing according to Line 6 of Algorithm 2 (Rule 1 in Meek (1995)). This means that when we convert each undirected tree into a rooted tree, it does not create any additional v-structures in the resulting DAG G . So the original CPDAG is also the CPDAG of G , i.e., G is equivalent to G. On the other hand, if G is an equivalent DAG, for each undirected tree T in the CPDAG, let i be a source node of T according to G . Then T in G must be a rooted tree with i being the root to avoid having v-structures within T (and hence contradicting with G shares the same CPDAG). This shows that all equivalent class members can be obtained by orienting undirected trees into rooted trees and completes the proof.

2.3 Polytree Learning

The procedure of polytree learning we are considering in this paper has been in principle introduced in Rebane & Pearl (1987). The key idea is to first recover the skeleton of the polytree by applying the Chow Liu algorithm (Chow & Liu, 1968) to the pairwise sample correlations of the data matrix. After the skeleton is recovered, we propose to recover the set of all v-structures via a simple thresholding approach to pairwise sample correlations. Finally, we recover the CPDAG by applying Rule 1 introduced in Verma & Pearl (1992) and justified theoretically in Meek (1995).

Published in Transactions on Machine Learning Research (03/2025)

2.3.1 Chow-Liu Algorithm for Skeleton Recovery

The Chow-Liu tree associated with pairwise correlations, which is the estimated skeleton of the underlying polytree, is defined below.

Definition 2 (Chow-Liu tree associated to pairwise sample correlations). Consider the linear polytree model (2) associated to a polytree T = (V, E), whose skeleton is denoted as T = (V, E). Let Tp denote the set of undirected trees over p nodes. Given the data matrix X = [x1, . . . , xn] Rn p, we obtain the sample correlation ˆρij between Xi and Xj for all 1 i < j p. The Chow-Liu tree associated with the pairwise sample correlations is defined as the maximum-weight spanning tree over the p nodes where the weights are absolute values of sample correlations:

b T = arg max T =(V,E) Tp

i j E |ˆρij|. (3)

For tree-structured undirected graphical models, it has been established in Chow & Liu (1968) that the maximum likelihood estimation of the underlying tree structure is the Chow-Liu tree associated with the empirical mutual information quantities (which are used to find the maximum-weight spanning tree). The rationale of applying Chow-Liu algorithm to polytree learning has been carefully explained in Rebane & Pearl (1987), to which interested readers are referred. The step of skeleton recovery can be summarized in Algorithm 1.

Algorithm 1 Chow-Liu algorithm

Input: The data matrix X = [x1, . . . , xn] . Output: Estimated skeleton b T .

1: Compute the pairwise sample correlations ˆρij for all 1 i < j p;

2: Construct a maximum-weight spanning tree using |ˆρij| as the edge weights, i.e., b T defined in (3).

It is noteworthy that Algorithm 1 can be implemented efficiently by applying Kruskal s algorithm (Kruskal, 1956) to pairwise sample correlations |ˆρij| for the construction of maximum weight spanning tree. The computational complexity for Kruskal s algorithm is known to be O(p2 log p), which is generally no larger than that for computing the sample correlations, which is O(p2n).

2.3.2 CPDAG Recovery

In the second part of the procedure of polytree learning, we aim to estimate the CPDAG of the polytree model. Intuitively speaking, this amounts to figuring out all the edges whose orientations can be determined. The first step of this part is to identify all the v-structures. Under the polytree model, any pair of nonadjacent nodes i and j with common neighbor k form a v-structure i k j if and only if Xi and Xj are mutually independent. We thus determine the existence of a v-structure i k j when the sample correlation |ˆρij| < ρcrit = γcrit p

(log p)/n, where the choice of threshold or critical value is discussed in subsequent sections. After recovering all the v-structures, as aforementioned, it is guaranteed in Meek (1995) that the CPDAG of the polytree can be recovered by iteratively applying the four rules originally introduced in Verma & Pearl (1992). However, given our discussion is restricted to the polytree models, Rules 2, 3, and 4 in Verma & Pearl (1992) and Meek (1995) do not apply. We only need to apply Rule 1 repeatedly. This rule can be stated as follows: Orient any undirected edge j k into j k whenever there is a directed edge i j coming from a third node i.

These two steps in the second part of polytree structure learning are summarized as Algorithm 2.

3 Main Results for Polytree Learning

In this section, we discuss sample size conditions for the recovery of skeleton and CPDAG under a Gaussian linear polytree model T = (V, E). We first establish a correlation decay property on the polytree skeleton (Lemma 3) by applying the famous Wright s formula.

Published in Transactions on Machine Learning Research (03/2025)

Algorithm 2 Extending the skeleton to a CPDAG

Input: Estimated skeleton b T , sample correlations ˆρij s, tuning parameter γcrit. Output: Estimated CPDAG b CT .

1: for Each pair of non-adjacent variables i, j with common neighbor k in b T do 2: if |ˆρij| < γcrit p

(log p)/n then 3: replace i k j with the v-structure i k j 4: end if

5: end for 6: In the resulting graph, orient as many undirected edges as possible by repeatedly applying the rule: orient an undirected edge j k into j k whenever there is a directed edge i j for some i.

3.1 Preliminaries

First, the polytree learning method introduced in the previous section depends solely on the marginal correlation coefficients, and is thereby invariant to scaling. Therefore, without loss of generality, we can assume that Xj s have a unit variance for all j V , i.e. Σ is the correlation matrix. It is obvious that the standardized version of a linear SEM is still a linear SEM, and they share the same polytree structure. In this case, by denoting the pairwise correlations as ρij := corr(Xi, Xj), we have σij = ρij for all 1 i, j p.

Under the linear SEM, we know that B is permutationally similar to a strictly upper triangular matrix, which implies that all eigenvalues of I B are 1 s, and further implies that I B is invertible. Then, (I B) x = ϵ implies x = (I B) ϵ, and further implies that x is mean-zero, and has covariance

Σ = (I B) Ω(I B) 1.

This suggests that we can represent the entries of Σ by (βij) and (ωii). In fact, this can be conveniently achieved by using Wright s path tracing formula (Wright, 1960). We first introduce some necessary definitions in order to obtain such expressions. A trek connecting nodes i and j in a directed graph G = (V, E) is a sequence of non-colliding consecutive edges connecting i and j of the form

i = v L l v L l 1 v L 1 v0 v R 1 v R r 1 v R r = j.

We define the left-hand side of τ as Left(τ) = v L l v0, the right-hand side of τ as Right(τ) = v0 v R r , and the head of τ as Hτ = v0. A trek τ is said to be a simple trek if Left(τ) and Right(τ) do not have common edges. In the polytree case, any two nodes (i, j) are connected by a unique path. Also, Wright s famous formula has a simple form:

Lemma 3. Consider the linear polytree model (2) with the associated polytree T = (V, E) over p nodes. Also assume that Xj has a unit variance for all j V . Then, ρij = βij for all i j E. Furthermore, for each pair (i, j), the population correlation coefficient satisfies

s t τij ρst the path connecting i and j is a simple trek

0 otherwise. (4)

Also, the noise variances satisfy

i P a(j) ρ2 ij, j = 1, . . . , p. (5)

Remark 4. The assumption that the variables X1, . . . , Xp have unit variances is unnecessary for (4) alone, since correlation coefficients are invariant under standardization.

Remark 5. Here Eqn. (5) can be derived by the following simple argument: Since T is a polytree, all variables in Pa(j) are independent and are also independent with ϵj. Evaluating the variance on both sides of Xj = P

i P a(j) βij Xi + ϵj leads to (5).

Published in Transactions on Machine Learning Research (03/2025)

We now introduce the following definitions.

Definition 6. In a standardized linear polytree model (2), let ρmin and ρmax be the minimum and maximum absolute correlation over the tree skeleton, that is

ρmin := min i j E |ρij|, ρmax := max i j E |ρij|.

It is noteworthy that in general we cannot assume that ρmin is independent of n or p. In fact, Eqn. (5) gives rise to the following relationship between the noise variance and the correlation coefficients with parents for each node: P

i P a(j) ρ2 ij < 1, which further implies the following corollary.

Corollary 7. Let d represent the highest in-degree for a polytree. Then ρmin < 1 d .

Remark 8. In contrast, it is reasonable to assume ρmin to be a positive constant independent of p under the undirected tree-structured Gaussian graphical model, since after transforming it to a rooted tree as in Section 2.1, the highest in-degree satisfies d = 1.

A key lemma is the following well-known convergence rate for estimating the population correlation matrix:

Lemma 9. Consider a Gaussian linear SEM (2) with n C0 log p for some numerical constant C0. Then, on an event E with probability at least 1 1/p3, the following inequality holds for some absolute constant C:

bρ ρ max < C

where ρ and bρ denote the population and sample correlation matrices, respectively, and max represents the entrywise supremum norm.

Proof. This is a well-known result, which can be obtained by combining Remark 5.40 of Vershynin (2012) and Lemma 1 in Kalisch & Bühlman (2007) (reproduced in Appendix A). See also the proof of the generalized version of this result, Lemma 26).

3.2 Skeleton Recovery

First, we introduce an important result in analyzing the Chow-Liu algorithm:

Lemma 10 (e.g. Bresler & Karzand (2020), Lemma 6.1 and Lemma 8.8). Let T be the skeleton of true polytree T = (V, E) and b T be the estimated tree through Chow-Liu algorithm (3). If an edge (w, w) T and (w, w) / b T , i.e. this edge is incorrectly missed, then there exists an edge (v, v) b T and (v, v) / T such that (w, w) path T (v, v) and (v, v) pathb T (w, w). On such an error event, we have |ˆρv v| |ˆρw w|.

We now introduce a sufficient condition on the sample size for skeleton recovery under the Gaussian linear polytree model, in which the independent noise variables satisfy ϵj N(0, ωjj) for j = 1, . . . , p. Then by x = (I B) ϵ, we know that x is also multivariate Gaussian. This fact will help quantify the discrepancy between population and sample pairwise correlations as characterized in Lemma 9.

Theorem 11. Consider a Gaussian linear SEM (2) associated to a polytree T = (V, E) with ρmax < 1 δ. Denote by b T the estimated skeleton by the Chow-Liu algorithm (Algorithm 1), and by T the true polytree skeleton. Then, on the event E with probability at least 1 1/p3 defined in Lemma 9, we have exact polytree skeleton recovery b T = T as long as

where C is as defined in Lemma 9.

Proof. Consider any undirected edge (w, w) T and any non-adjacent pair (v, v) such that (w, w) path T (v, v), where path T (v, v) is the path connecting v and v in the polytree T. If path T (v, v) is a simple trek, then the correlation decay property, Lemma 3, implies that ρv v consists of the product among several

Published in Transactions on Machine Learning Research (03/2025)

correlation coefficients containing ρw w. Hence |ρv v| |ρw w|ρmax. On the other hand, if path T (v, v) is not a simple trek, then we have ρv v = 0. Overall, we can obtain an upper bound for |ρv v| |ρw w|.

|ρv v| |ρw w| |ρw w|(ρmax 1) δρmin.

Then, on the event E, uniformly for any undirected edge (w, w) T and any non-adjacent pair (v, v) such that (w, w) path T (v, v), there holds

|ˆρv v| |ˆρw w| |ˆρv v ρv v| + |ˆρw w ρw w| + |ρv v| |ρw w|

(log p)/n δρmin < 0,

where the last inequality is due to the condition (6). Then, Lemma 10 implies that b T = T on the event E.

Remark 12. Our argument on skeleton recovery basically follows the standard arguments based on correlation decay over skeleton in the literature of undirected tree learning, e.g., Nikolakakis et al. (2019); Tavassolipour et al. (2018); Bresler & Karzand (2020). On the other hand, treating undirected tree models as rooted polytree models, our sample size condition for skeleton recovery in Theorem 11 is n = O(log p), which is the optimal sample size condition for undirected tree recovery.

Remark 13. The above condition implies some dependence of the sample size on the maximum in-degree d . In fact, together with Corollary 7, the sample size condition is essentially n O(d log p) if ρmin 1/ d .

3.3 CPDAG Recovery

As described in Section 2.3.2, after obtaining the estimated skeleton, the next step is to identify all vstructures by comparing ρij for all node triplets i k j in the skeleton with a threshold ρcrit. Then the orientation propagation rule described in Algorithm 2 can be applied iteratively to orient as many undirected edges as possible. If both the skeleton and v-structures are correctly identified, the orientation rule will be able to recover the true CPDAG, i.e. the equivalence class (Meek, 1995).

Theorem 14. Consider a Gaussian linear SEM (2) associated to a polytree T = (V, E) with ρmax < 1 δ. Denote by b T and b CT the estimated polytree skeleton from Algorithm 1 and CPDAG from Algorithm 2 with threshold γcrit p

(log p)/n. Also, denote by T and CT the true polytree skeleton and the true polytree CPDAG, respectively. Then, on the event E with probability at least 1 1/p3 defined in Lemma 9, we have exact polytree skeleton recovery b T = T as well as exact polytree CPDAG recovery b CT = CT , as long as

γcrit > C and n > C0(δ)γ2 crit

where C is as defined in Lemma 9, and C0(δ) is a constant only depending on δ.

Proof. Since the skeleton recovery is guaranteed by Theorem 11, it suffices to show that under the condition (7), all v-structures are correctly identified on the event E. Let s consider all node triplets i k j in T . If the ground truth is i k j, we know that ρij = 0. Then, by (7), on E we have

(log p)/n < γcrit p

This means the v-structure is identified by Algorithm 2.

In contrast, if the ground truth is i k j or i k j or i k j, the correlation decay property Lemma 3 implies that |ρij| = |ρik||ρkj| ρ2 min. Then, on E, there holds

|ˆρij| |ρij| C p

(log p)/n ρ2 min C p

(log p)/n > γcrit p

where the last inequality is also due to (7). This means this triplet is correctly identified as a non-v-structure.

In sum, we identify all the v-structures exactly. Then the CPDAG of T can be exactly recovered by Algorithm 2 as guaranteed in Meek (1995).

Published in Transactions on Machine Learning Research (03/2025)

Remark 15. It is noteworthy to observe the difference between the sample size conditions in Theorems 11 and 14. In particular, if ρmin 1/ d , the above sufficient condition on sample size for CPDAG recovery is essentially n O(d2 log p), while recall that the sample size condition for skeleton recovery is n O(d log p). This dependence on maximum in-degree is probably a particular property for polytree learning, given that most existing theory on general sparse DAG recovery usually requires the sample size to be greater than the maximum neighborhood size, e.g., Theorem 2 in Kalisch & Bühlman (2007).

3.4 Information-theoretic Lower Bounds on the Sample Size

In this subsection, we will establish the necessary conditions on the sample size for both skeleton and CPDAG recovery under Gaussian linear polytree models. In particular, we will use Fano s method to derive information-theoretic bounds.

Theorem 16. Let T(ρmin) be a collection of Gaussian linear polytree models, such that ρmin := mini j E |ρij| is fixed and satisfies 0 < ρmin < 1/ p. In each model out of this class, assume that ρmax := maxi j E |ρij| < 1/2. Assume p 10. Suppose that T(ρmin) is indexed by θ, with corresponding polytree Tθ, covariance matrix Σθ, tree skeleton Tθ, and CPDAG CTθ. Then for any skeleton estimator b T , there holds sup θ T(ρmin) PΣθ(b T (X) = Tθ) 1/2

n < 1 ρ2 min (log(p 2) 2).

Moreover, for any CPDAG estimator b C, there holds

sup θ T(ρmin) PΣθ( b C(X) = CTθ) 1/2

n < 1 5ρ4 min

log (p 1)(p 2)

Proof. The key idea is to apply Fano s method to appropriate sub-classes of T(ρmin) to establish the intended information-theoretic lower bounds for both skeleton and CPDAG recovery. Generally speaking, let TM = {T1, . . . , TM} be a sub-class of polytree models T(ρmin) whose respective covariance matrices are denoted as Σ(T1), . . . , Σ(TM). Let model index θ be chosen uniformly at random from {1, . . . , M}. Given the observations X Rn p, the decoder ψ estimates the underlying polytree structure with maximal probability of decoding error defined as perr(ψ) = max 1 j M PΣ(Tj) (ψ(X) = Tj) .

Fano s inequality (Thomas & Joy, 2006) shows that the maximal probability of error over TM, perr(ψ), can be lower bounded as

inf ψ perr(ψ) 1 I(θ; X) + 1

Given all involved distributions are multivariate Gaussian, we will apply the following entropy-based bound of the mutual information that can be found in Wang et al. (2010):

2 F(T), where

F(T) := logdet(Σ) 1

j=1 logdet(Σ(Tj)) (8)

and the averaged covariance matrix Σ := 1 M PM j=1 Σ(Tj).

Published in Transactions on Machine Learning Research (03/2025)

Lower Bound for Skeleton Recovery

In the following we consider a class of polytree models TM = {T1, . . . , TM} where M = p 2. These polytrees share p 2 common directed edges 1 (p 1), 2 (p 1), ..., (p 2) (p 1). For the (p 1)-th directed edge, we let p 1 in T1, p 2 in T2, ..., p (p 2) in Tp 2. Also, we assume that all variables have variance one, and the correlation coefficients on the skeleton are all ρ that satisfies 0 < ρ < 1 p. Here we write ρ = ρmin for simplicity. Note that the polytrees in this sub-class of T(ρ) (defined in the statement of Theorem 16) have distinct skeletons, so

inf b T sup θ T(ρmin) PΣθ(b T (X) = Tθ) inf ψ max 1 j M PΣ(Tj) (ψ(X) = Tj) .

We can easily obtain the formula for each covariance Σ(Tj) for j = 1, . . . , M by using Lemma 3. For example, for T1, we have

1 0 . . . 0 ρ ρ 0 1 . . . 0 ρ 0 ... ... ... ... ... ... 0 0 . . . 1 ρ 0 ρ ρ . . . ρ 1 ρ2

ρ 0 . . . 0 ρ2 1

The Schur complement of A = I is thereby

D B A 1B = 1 (p 2)ρ2 0 0 1 ρ2

Then det(Σ(T1)) = det(A) det(D B A 1B) = (1 ρ2)(1 (p 2)ρ2).

Similarly, for all j = 1, . . . , p 2, there holds det(Σ(Tj)) = (1 ρ2)(1 (p 2)ρ2).

On the other hand, the average covariance is

j=1 Σ(Tj) =

1 0 . . . 0 ρ ρ/(p 2) 0 1 . . . 0 ρ ρ/(p 2) ... ... ... ... ... ... 0 0 . . . 1 ρ ρ/(p 2) ρ ρ . . . ρ 1 ρ2

ρ/(p 2) ρ/(p 2) . . . ρ/(p 2) ρ2 1

As with above, we can use Schur complement to obtain

det(Σ) = (1 ρ2/(p 2))(1 (p 2)ρ2).

Plug these results into (8), we have

F(T) = log 1 + (p 3)ρ2

(p 2)(1 ρ2)

(p 2)(1 ρ2) (p 3)ρ2

(p 2)(1 1/p) ρ2.

Then perr 1 ( n

2 ρ2+1)/ log(p 2). To ensure perr > 1/2, we only need to require 1 ( n

2 ρ2+1)/ log(p 2) > 1/2, which is equivalent to n < (log(p 2) 2)/ρ2.

Lower Bound for CPDAG Recovery

Let s now consider another class of polytree models TM = {T1, . . . , TM} where M = p 1 2 . All polytrees in this class are stars with hub node p, and p is directed to all but two nodes in {1, . . . , p 1}. In T1, the directed edges are 1 p, 2 p, p 3, p 4, ..., p (p 1). In T2, the directed edges are 1 p, p 2,

Published in Transactions on Machine Learning Research (03/2025)

3 p, p 4, ..., p (p 1). And so on until in TM, the directed edges are p 1, p 2, ..., p (p 3), (p 2) p, (p 1) p. Also, assume that all variables have variance one, and the correlation coefficients on the skeleton are all ρ that satisfies 0 < ρ < 1

2. Again, we write ρ = ρmin for simplicity. Although the polytrees in this sub-class of T(ρ) have the same skeletons, but they have distinct CPDAGs since they have distinct sets of v-structures. Therefore,

inf b C sup θ T(ρmin) PΣθ( b C(X) = CTθ) inf ψ max 1 j M PΣ(Tj) (ψ(X) = Tj) .

Again, we have the formula for each covariance Σ(Tj) for j = 1, . . . , M by using Lemma 3. For example, for T1, we have

1 0 ρ2 . . . ρ2 ρ 0 1 ρ2 . . . ρ2 ρ ρ2 ρ2 1 . . . ρ2 ρ ... ... ... ... ... ... ρ2 ρ2 ρ2 . . . 1 ρ ρ ρ ρ . . . ρ 1

Recall that in a linear polytree model there holds Σ = (I B) Ω(I B). Since B can be transformed to a strict upper triangular matrix by permuting the p nodes, we know that det(I B) = 1. Then

det(Σ) = det(Ω) =

i P a(j) ρ2 ij

Then for j = 1, . . . , M, there holds det(Σ(Tj)) = (1 ρ2)p 3(1 2ρ2), which implies that

logdet(Σ(Tj)) = (p 3) log(1 ρ2) + log(1 2ρ2).

On the other hand, we have

j=1 Σ(Tj) =

M ρ2 . . . M 1

M ρ2 1 . . . M 1

M ρ2 ρ ... ... ... ... ... M 1

M ρ2 . . . 1 ρ ρ ρ . . . ρ 1

The Schur complement of D = 1 is

Σ/D = A BD 1B ==

1 ρ2 ρ2/M . . . ρ2/M ρ2/M 1 ρ2 . . . ρ2/M ... ... ... ... ρ2/M ρ2/M . . . 1 ρ2

Published in Transactions on Machine Learning Research (03/2025)

It s easy to obtain all the eigenvalues of Σ/D: 1 p+1

p 1ρ2 with multiplicity 1 and 1 p(p 3) (p 1)(p 2)ρ2 with multiplicity p 2. Plug these results into (8), we have

F(T) = log 1 p + 1

p 1ρ2 + (p 2) log 1 p(p 3) (p 1)(p 2)ρ2

log(1 2ρ2) (p 3) log(1 ρ2)

= log 1 p + 1

p 1ρ2 + (p 2) log 1 + 2 (p 1)(p 2) ρ2

+ log 1 + ρ2

p 1ρ2 + 2 p 1 ρ2

p 1ρ2 + 2 p 1

1 2ρ2 < 5ρ4,

where the first inequality is due to log(1 + x) x, and the second inequality is due to the assumption that ρ2 < 1/4 and p 10. As with the case of skeleton recovery, we know that perr > 1/2 as long as we require that

log (p 1)(p 2)

Compare Theorem 16 with Theorems 11 and 14, we can conclude that our derived sufficient conditions on the sample sizes for the recovery of both skeleton and CPDAG are sharp.

4 PC Algorithm Adapted to Polytree Models

In this section, we introduce another algorithm to recover the skeleton and CPDAG of the polytree, which is adapted from the PC algorithm but amenable to linear polytree structure.

To implement a PC algorithm that is adapted to polytree structures, we consider an early stopping version of the algorithm explained in Kalisch & Bühlman (2007). The following lemma demonstrates important properties of marginal and conditional probabilities on a polytree.

Lemma 17. Consider a Gaussian linear SEM (2) associated to a polytree T = (V, E) with ρmax < 1 δ. Then we have the following:

1. For any (i, j) T , |ρij| ρmin and |ρij|k| δρmin for any k / {i, j}.

2. For any (i, j) / T , if the path connecting i and j is not a trek, then ρij = 0.

3. For any (i, j) / T , if the path connecting i and j is a trek, then there exists some k adj(i, T ) adj(j, T )\{i, j} such that ρij|k = 0.

Proof. We discuss these three cases separately:

1. In this case, we have |ρij| ρmin by definition. For any k / {i, j}, since (i, j) T , by the tree structure, we know either i lies in the path connecting j and k, or j lies in the path connecting i and k. WLOG, we can assume the former is true. Then by the correlation decay property Lemma 3, there holds |ρjk| ρmax|ρij|, which implies |ρij ρikρjk| |ρij|(1 ρ2 max). Then

ρij ρikρjk q

(1 ρ2 ik)(1 ρ2 jk)

ρmin(1 ρ2 max) δρmin.

Published in Transactions on Machine Learning Research (03/2025)

2. This is directly implied by Lemma 3.

3. When the path connecting i and j is a trek, there exists some k adj(i, T ) adj(j, T )\{i, j} on this path. Lemma 3 implies ρij = ρikρjk, which implies ρij|k = 0.

Besides the population correlations ρij, for any distinct i, j, and k, the population and sample partial correlations can be represented by marginal correlations through the following equations:

ρij|k = ρij ρikρjk q

(1 ρ2 ik)(1 ρ2 jk) and ˆρij|k = ˆρij ˆρik ˆρjk q

(1 ˆρ2 ik)(1 ˆρ2 jk) . (9)

Note that although the concept of partial correlations will not be used in this section, we need it in later sections. The relationship between population and sample marginal and conditional correlations can be characterized by the following lemma, which is a simplified version of Corollary 1 in Kalisch & Bühlman (2007).

Lemma 18. Consider a Gaussian linear SEM (2) with n C0 log p for some numerical constant C0. Then, on an event E with probability at least 1 1/p3, the following inequality holds for some absolute constant C:

bρ ρ max < C

and ˆρij|k ρij|k < C

n i < j, k / {i, j}.

Proof. This is an easy generalization of Lemma 9 by the relationship between sample partial correlation and sample correlation cumulative distribution functions under multivariate normal distributions established in Fisher (1924).

From the above lemma, it is natural to consider the early-stopping PC algorithm with a tuning parameter γcrit that is described in Algorithm 3. The estimated skeleton is denoted as b T .

Algorithm 3 Estimating the polytree skeleton by the simplified PC algorithm

Input: The n p data matrix X; tuning parameter γcrit Output: Estimated skeleton b T .

1: Compute the sample correlations ˆρij for all 1 i < j p; 2: Compute the sample partial correlations ˆρij|k; for all 1 i < j p and any k / {i, j}; 3: The complete undirected graph over the p nodes is denoted as G0;

4: for Each pair of non-ordered (i, j) do 5: if |ˆρij| < γcrit p

(log p)/n then 6: remove (i, j) from G0 7: else if |ˆρij|k| < γcrit p

(log p)/n for some k / {i, j} then 8: remove (i, j) from G0 9: end if

10: end for 11: The resulting graph is denoted as b T .

Our consistency result for polytree skeleton recovery by the simplified PC algorithm demonstrated in Algorithm 3 consists of two parts. In the first part, we show that b T T as long as the tuning parameter γcrit in the threshold is chosen large enough; in the second part, we show that b T = T if we assume further that ρmin satisfies some lower bound condition.

Published in Transactions on Machine Learning Research (03/2025)

Theorem 19. Consider a Gaussian linear SEM (2) associated to a polytree T = (V, E) with ρmax < 1 δ. Let b T be the estimated skeleton from the simplified PC algorithm given in Algorithm 3 with the threshold γcrit p

(log p)/n. If the tuning parameter satisfies γcrit > C, where C is as defined in Lemma 18, then, on the event E defined in Lemma 18 with probability at least 1 1/p3, we have b T T .

In addition, if

n > 4γ2 crit δ2

ρ2 min , (10)

we have b T = T on E, i.e. the exact recovery of the polytree skeleton.

Proof. Let the event E be defined as in Lemma 18. Consider any (i, j) / T . If the path connecting i and j is not a trek, we have ρij = 0 by Lemma 17. Then Lemma 18 implies |ˆρij| < C p

(log p)/n < γcrit p

(log p)/n, so (i, j) is excluded from G0 by Algorithm 3. On the other hand, if the path connecting i and j is a trek, Lemma 17 implies that there exists some k / {i, j} such that ρij|k = 0. Then Lemma 18 implies |ˆρij|k| < C p

(log p)/n < γcrit p

(log p)/n, so (i, j) is also excluded from G0. This implies any (i, j) / T is removed from G0, no matter whether the path connecting them is a trek or not. Thus, we have b T T .

Further, the condition (10) implies ρmin > 2γcrit

n , for any (i, j) T , Lemmas 18 and 17 imply

|ˆρij| |ρij| C

and for any k / {i, j},

|ˆρij|k| |ρij|k| C

This implies that (i, j) b T , so b T = T .

Remark 20. Here we show that the skeleton T can be exactly recovered with high probability under the sample size condition (10), which is comparable to the condition (6) for skeleton recovery by the Chow-Liu algorithm.

Note that our simplified PC algorithm is only aimed at recovering the skeleton rather than identifying all marginal/conditional independence relationships among the variables. This is the reason why the sample size condition (10) can be smaller than the necessary condition for CPDAG recovery established in Theorem 16. This is the crucial difference between Theorem 19 and standard CPDAG recovery result by PC algorithm for sparse DAG learning, e.g. Kalisch & Bühlman (2007).

To understand why Algorithm 3 may lead to consistent skeleton recovery even without identifying the marginal/conditional independence relationships correctly, one can take a trek i k j in the true polytree with |ρik| = |ρjk| = ρmin as an example. In this case, Algorithm 3 could possibly remove the edge i j simply due to |ˆρij| < γcrit p

(log p)/n as long as |ρij| = |ρik||ρjk| = ρ2 min is sufficiently small. Following the idea of the PC algorithm, one may record (Xi, Xj) as an independent pair of variables incorrectly. However, this may still lead to correct skeleton recovery.

To recover the CPDAG, we can naturally apply Algorithm 2 following Algorithm 3. The following result is obvious and we omit the proof.

Theorem 21. Under the assumptions in Theorem 19, if we further assume the sample size condition (7) holds, then Algorithms 3 and 2 recover the true CPDAG exactly with probability at least 1 1/p3.

5 Inverse Correlation Matrix Estimation

In this section, we are interested in recovering the inverse correlation matrix of the polytree model under a recovered CPDAG. This is particularly useful for likelihood calculation; see, e.g. van de Geer &

Published in Transactions on Machine Learning Research (03/2025)

Bühlmann (2013). This could be useful for choosing the value of tuning parameters with likelihood-based cross-validation.

To estimate the inverse correlation matrix, due to the scaling invariance of population and sample correlations, without loss of generality, we assume that all Xi s have unit variances. Then the inverse correlation matrix is Θ := Σ 1 = (I B)Ω 1(I B ). The major goal of this subsection is to study how well we can estimate Θ. It may be noteworthy that the error bound we obtained (Theorem 24) depends on the total number of v-structures in addition to the usual dimension and sample size.

5.1 Inverse Correlation Matrix and CPDAG

At first, let s choose one realization from the equivalence class represented by this CPDAG, and still refer to it as T with no confusion. By Θ = (I B)Ω 1(I B ), and the fact βij = ρij for each i j T, βij due to unit variances, we can represent the entries of the inverse correlation matrix by the correlation coefficients over the polytree as

ρij/ωjj if i j T ρji/ωii if j i T ρikρjk/ωkk if i k j T 0 otherwise,

for i = j (11)

θjj = 1 ωjj + X

ρ2 jk ωkk , for j = 1, . . . p. (12)

where ωjj = 1 P

i P a(j) ρ2 ij for j = 1, . . . , p. Notice that the k in i k j T must be unique in a polytree.

A natural question is whether we can represent the inverse correlation matrix only through the CPDAG CT . This question is important given we can only hope to recover CT by the algorithms introduced in Sections 2.3.1 and 2.3.2. We first give a useful lemma, which explains for what kind of node j, the noise variance ωjj = 1 P

i P a(j) ρ2 ij is well-defined on the CPDAG CT , i.e., invariant to any particular polytree chosen from

the equivalence class.

Lemma 22. Denote by CT the true CPDAG of the polytree T. We denote by Vm the collection of nodes j such that there is at least one undirected edge i j in CT . On the other hand, we denote Vd the collection of nodes j such that all its neighbors are connected to it with a directed edge in CT . This means that Vm and Vd form a partition of all nodes. Then, we have the following properties:

1. For each j Vm, there is no i satisfying i j CT .

2. For each j Vm and any polytree T within the equivalence class CT , j has at most one parent in T .

3. For each j Vd, since the set of parents of j is determined by the CPDAG CT , the corresponding noise variance ωjj = 1 P

i P a(j) ρ2 ij is well-defined.

4. Combining the third property and the contrapositive of the first property, we know for each i j CT , we have j Vd, and the corresponding noise variance ωjj is thereby well-defined.

We omit the proof since this result can be directly implied by the fact that v-structures are kept unchanged in all polytrees within the equivalence class determined by CT . Then the following result shows that the inverse correlation matrix can be represented by the pairwise correlations on the skeleton as well as the CPDAG.

Published in Transactions on Machine Learning Research (03/2025)

Lemma 23. Let Vm and Vd be the partition of all nodes defined in Lemma 22. Then, the inverse correlation matrix can be represented as

ρij/ωjj if i j CT ρji/ωii if j i CT ρij/(1 ρ2 ij) if i j CT ρikρjk/ωkk if i k j CT 0 otherwise,

ρ2 jk ωkk , j Vd,

ρ2 jk 1 ρ2 jk + P

ρ2 jk ωkk , j Vm.

Here ωjj = 1 P

i P a(j) ρ2 ij is well-defined in all of the above formulas, since Pa(j) is well-defined for any

Proof. Let T be a polytree in the equivalent class of CT . For θij, the possible cases listed in the lemma are exhaustive by Lemma 23. the only case need checking is i j CT . If the edge is oriented as i j in T . By point 2. in Lemma 23, i is the only parent of j thus wjj = 1 ρ2 ij. By Eq. (12), θij = ρij/(1 ρ2 ij). It is easy to see that the result is identical if the undirected edge is j j in some polytree.

For θjj and the non-obvious case of j Vm, the undirected edges j k connected to j in CT are either all, or except for one, oriented as j k in any polytree T (Lemma 23). For the first case, ωjj = 1, Eq. (12) in T becomes

θjj = 1 + X

ρ2 jk ωkk + X

ρ2 jk ωkk = 1 + X

ρ2 jk 1 ρ2 jk + X

ρ2 jk ωkk .

Note that ωkk = 1 ρ2 jk according to the result derived above for undirected edges.

In the second case, suppose k1 is the single parent of j in T . We have ωjj = 1 ρ2 jk1, and Eq. (12) is

θjj = ρ2 jk1 1 ρ2 jk1 + X

k =k1,j k CT

ρ2 jk 1 ρ2 jk + X

ρ2 jk ωkk .

It is easy to see that θjj are identical in the two cases and this completes the proof.

5.2 Inverse Correlation Matrix Estimation

By Lemma 23, we can give an estimate of the inverse correlation matrix by the estimated CPDAG b CT , sample correlations over the estimated tree skeleton, and estimated noise variance for each j b Vd:

ˆρij/ˆωjj if i j b CT ˆρji/ˆωii if j i b CT ˆρij/(1 ˆρ2 ij) if i j b CT ˆρik ˆρjk/ˆωkk if i k j b CT 0 otherwise,

for i = j (13)

ˆρ2 jk ˆωkk , j b Vd,

ˆρ2 jk 1 ˆρ2 jk + P

ˆρ2 jk ˆωkk , j b Vm. (14)

Here, b Vd and b Vm are similarly defined through the estimated CPDAG b CT as in Lemma 22.

Published in Transactions on Machine Learning Research (03/2025)

The CPDAG can be estimated through Algorithms 1 and 2. The sample correlations over the estimated skeleton can also be naturally defined. The remaining question is how to estimate noise variances in b Vd. One natural method is to estimate ωjj based on (5): ˆωjj = 1 P

i c P a(j) ˆρ2 ij for each j b Vd, where c Pa(j) is the

corresponding estimated parent set. However, the statistical property of this estimator is not easy to derive.

Instead, we propose to estimate the noise variance through the standard unbiased mean squared error, denoted as bωjj = MSEj, of the least squares fit for the linear model

βij Xi + ϵj.

Under the Gaussian assumption, if b CT = CT , we have b Vd = Vd and c Pa(j) = Pa(j) for each j Vd. Then, it is well-known that the least-squares MSE bωjj is an unbiased estimate of ωjj. In fact, there holds

bωjj d= ωjj

χ2 n din j n din j

, j Vd (15)

where din j is the in-degree of node j, i.e. din j = |Pa(j)|.

Finally, we introduce our result regarding the estimation error bounds of inverse correlation matrix estimation defined above.

Theorem 24. Consider the linear polytree SEM (2) associated with a polytree T = (V, E), where all variables have unit variances. Denote the minimum noise variance as

ωmin = min{ωjj : j Vd} min{1 ρ2 ij : i j CT }.

Denote ν as the total number of v-structures. It is easy to verify that all of these concepts only depend on the CPDAG CT . We make the assumption that

ρmax 1 δ and ωmin δ

for some constant δ > 0.

Assume that the estimated CPDAG b CT is obtained by Algorithms 1 and 2 with threshold γcrit p

(log p)/n. If we assume (7) in Theorem 14 holds, i.e.

γcrit > C and n > e C log p

where C is as defined in Lemma 9. Then, with probability at least 1 p 2, the estimated inverse correlation matrix defined in (13) and (14) satisfies

bΘ Θ ℓ1 e C (p + ν)

In the above, e C represents some constant that only depends on δ and γcrit, whose value changes from line to line.

Proof. In the following, we use e C to represent a constant that only depends on δ and γcrit, whose value can change from line to line. On the other hand, C represents some absolute constant whose value changes from line to line.

Given (7) in Theorem 14 holds, with probability at least 1 p 3, we have bρ ρ max < C q

n , and the

true CPDAG is exactly recovered by, i.e. b CT = CT . Consequently, the estimated noise variances satisfy (15).

Published in Transactions on Machine Learning Research (03/2025)

Denote the maximum in-degree as d = max{din j : j Vd} 1.

From Corollary 7, we have ρmin < 1/ d . Then the assumption n > e C (log p)/ρ4 min implies n > e Cd2 log p. Then, based on concentration inequalities for Chi-square random variables, (15) implies that with probability at least 1 p 3,

max j Vd |bωjj ωjj| C

n for some absolute constant C.

With the above concentrations of bωjj s and bρij s, based on the formula (13) and the assumption ωmin δ, we have for any i = j, ( |ˆθij θij| e C q

n if (i, j) T or i k j CT for some k

|ˆθij θij| = 0 otherwise, (17)

which further implies X

i =j |ˆθij θij| e C (p + ν)

Further, (14) implies, for each j = 1, . . . , p, there holds

|bθjj θjj| e C(1 + dj)

where dj is the degree of node j in the skeleton T . This implies

j=1 |ˆθjj θjj| e Cp

Putting the above results together, we get (16).

6 Extension to Group Polytree Linear Structural Equation Models

In this section, we consider an extension of the linear polytree structural equation model (1) to the case of variable groups. It will become clear that this is an natural and straightforward extension of our theory, where substituting correlation with a multivariate counterpart to establishes the correlation decay property needed to proving the sufficient sample size for the correct graph recovery. Such group polytree or DAG models we describe below may arise when certain variables are closely related or driven by common latent variables. Assume the random vector x = [X1, . . . , Xp] is partitioned into p groups:

x = [x 1 , . . . , x p ],

where xi is a li-dimensional random vector. We consider the following group linear polytree structural equation model over T = (V, E):

i=1 B ij Xi + ϵj = X

i P a(j) B ij Xi + ϵj, for j = 1, . . . , p, (18)

where the li lj matrix Bij = 0 if and only if i j E. Also, assume all ϵj s are independent multivariate normal random vectors. If we denote

0 B12 . . . B1p B21 0 . . . B2p ... ... ... ... Bp1 Bp2 . . . 0

Published in Transactions on Machine Learning Research (03/2025)

ϵ = [ϵ 1 , . . . , ϵ p ]. Then the SEM can still be represented as x = B x + ϵ. Denote the covariance matrices

Cov(x) = Σ =

Σ11 Σ12 . . . Σ1p Σ21 Σ22 . . . Σ2p ... ... ... ... Σp1 Σp2 . . . Σpp

and Cov(ϵ) = Ω=

Ω11 0 . . . 0 0 Ω22 . . . 0 ... ... ... ... 0 0 . . . Ωpp

we still have Σ = (I B) Ω(I B) 1. Again, our goal is still to recover the polytree structural T = (V, E) from n i.i.d observation of x: x1, . . . , xn.

Our algorithm for group polytree learning is similar to that introduced in Section 2, with the pairwise sample correlations ˆρij replaced with the leading sample canonical correlations between Xi and Xj. To be specific,

denote Rij = Σ 1

2 ii ΣijΣ 1

2 jj Rli lj, then ρij := Rij is the leading population canonical correlation

coefficient between Xi and Xj. Correspondingly, denote the sample version b Rij = bΣ 1

2 ii bΣij bΣ 1

2 jj , with the leading sample canonical correlation coefficient ˆρij := b Rij . Then we apply Algorithm 1 to recover the polytree skeleton and Algorithm 2 to recover the CPDAG, where ˆρij s represent leading sample canonical correlations.

In analogy, we also denote ηij = σmin(Rij) as pairwise least population canonical correlation coefficients. These quantities will not be employed in the algorithms, but will be used in our theoretical result of CPDAG recovery.

In the sequel, we aim to extend the consistency results Theorems 11 and 14 to the case of group polytree SEM. Since the sample canonical correlations are linear invariant, in theory, we can assume Xi N(0, Ili)

for each i = 1, . . . , p without loss of generality. In fact, if we replace xi with f Xi = Σ 1

2 ii Xi, then

1 2 ii f Xi + Σ 1

2 jj ϵj, for j = 1, . . . , p,

which shares the same polytree structure.

Before presenting our consistency results, we also need two lemmas. The first lemma is an extension of the correlation decay property, Lemma 3.

Lemma 25. Consider the Gaussian group linear polytree model (18) with the associated polytree T = (V, E). For each pair (i, j), if the path connecting i and j is not a trek, we have ρij = 0; if the path connecting i and j is a trek, we have Y

(s,t) τij ηst ρij Y

(s,t) τij ρst.

Proof. Since population canonical correlations are linearly invariant, we can assume Xi N(0, Ili) for each i = 1, . . . , p WLOG. Then, for each i j E, one can easily obtain

Rij = Bij, Rji = B ij.

Further, one can easily use argument of induction to show that for each pair (i, j), if the path connecting i and j is not a trek, we have Rij = 0 and thereby ρij = 0; if the path connecting i and j is a trek, denoted as

τij : i = v L l v L l 1 v L 1 v0 v R 1 v R r 1 v R r = j,

we have Rij = B v L l 1v L l B v L l 2v L l 1 B v0v L 1 Bv0v R 1 Bv R r 2v R r 1Bv R r 1v R r ,

which further implies

Rij = Rv L l v L l 1Rv L l 1v L l 2 Rv L 1 v0Rv0v R 1 Rv R r 2v R r 1Rv R r 1v R r .

The fact ρij = Rij implies ρij Q

(s,t) τij ρst.

Published in Transactions on Machine Learning Research (03/2025)

Another lemma is an extension of Lemma 9:

Lemma 26. Consider a Gaussian group linear SEM (18) with n C0(lmax + log p) for some absolute constant C0, where lmax = max1 i p li. Then, on an event E with probability at least 1 1/p3, the following inequality holds for some absolute constant C:

max 1 i<j p |ˆρij ρij| < C

lmax + log p

where ˆρij and ρij are sample and population leading canonical correlation coefficients between Xi and Xj, respectively.

Proof. Again, since population and sample canonical correlations are linearly invariant, we can assume Xi N(0, Ili) for each i = 1, . . . , p WLOG. By the Remark 5.40 in Vershynin (2012) (reproduced in Appendix A), with n C0(lmax + log p) for some sufficiently large constant C0, one can show that with probability at least 1 1/p3, we have

max 1 i p bΣii Ili < C

lmax + log p

max 1 i<j p bΣij Σij < C

lmax + log p

which further implies

max 1 i<j p b Rij Rij < C

lmax + log p

Similar arguments can also be found in Ma & Li (2020). Then (19) follows from the fact

|ˆρij ρij| = | b Rij Rij | b Rij Rij .

With the above lemmas, the consistency of the recovery of polytree skeleton and CPDAG by Algorithms 1 and 2 with sample correlations replaced with sample leading canonical correlation coefficients can be straightforwardly established.

Theorem 27. Consider a Gaussian group linear SEM (18) associated to a polytree T = (V, E). Denote the minimum population leading canonical correlation coefficient, maximum population leading canonical correlation coefficient, and minimum population least canonical correlation coefficient over the tree skeleton as ρmin := min i j E |ρij|, ρmax := max i j E |ρij|, and ηmin := min i j E |ηij|.

Assume ρmax < 1 δ for some constant δ. Denote by b T the estimated skeleton by the Chow-Liu algorithm (Algorithm 1), and by T the true polytree skeleton. Then, with probability at least 1 1/p3, we have exact polytree skeleton recovery b T = T as long as

n > C0(δ) lmax + log p

where C0(δ) is a constant only depending on δ.

Further, denote by CT the true polytree CPDAG, and by b CT the estimated CPDAG from Algorithm 2 with threshold γcrit p

(lmax + log p)/n. If ηmin > 0, then, with probability at least 1 1/p3, we have b CT = CT as long as

γcrit > C and n > C0(δ)γ2 crit

lmax + log p

where C is an absolute constant, and C0(δ) is a constant only depending on δ.

Published in Transactions on Machine Learning Research (03/2025)

Proof. The proof is exactly the same as those of Theorems 11 and 14. Note that the sample size condition for CPDAG recovery relies on the minimum least population canonical correlation coefficient. In fact, if the ground truth is i k j or i k j or i k j, Lemma 25 only implies that |ρij| η2 min.

Remark 28. Our CPDAG recovery result for group polytree SEM relies on the assumption ηmin > 0. At this moment, we don t know whether this is a necessary condition, and we plan to investigate this in future work.

7 Numerical Experiments

To illustrate the feasibility and quantitative performance of the polytree learning method based on the Chow Liu algorithm, we implement Algorithms 1 and 2 in Python and test on simulated data (Section 7.1). We further test on commonly used benchmark datasets (Section 7.2) to assess the robustness and applicability to real-world data. In all experiments, we set the threshold ρcrit (Algorithm 2) for rejecting a pair of nodes being independent based on the testing zero correlation for Gaussian distributions. Specifically, ρcrit = q

1 1 1+t2 α/2/(n 2), where tα/2 is the 1 α/2 quantile of a t-distribution with df = n 2, and we use

α = 0.1. For comparisons, we run these same data using two basic and representative structural learning methods: the score-based hill climbing (Gámez et al., 2011), and the constraint-based PC algorithm (Spirtes et al., 2000) along with its early-stopping adaptation to polytree (Algorithm 3). We use R implementations of the hill climbing and the PC algorithm from bnlearn and pcalg packages, respectively, along with all the default options and parameters. We implemented the polytree-adapted PC algorithm in Python. An α = 0.01 is used for the PC algorithm as recommended in Kalisch & Bühlman (2007). All codes are available at https://github.com/huyu00/linear-polytree-SEM. As is the case with all SEMs, caution should be taken when interpreting algorithm results in practical applications, as they represent potential causal interactions rather than definitive proofs.

We assess the results by comparing the true and inferred CPDAGs C and b C. On the skeleton level, there can be edges in C that are missing in b C, and vice versa b C can have extra edges. For the CPDAG, we consider a directed edge to be correct if it occurs with the same direction in both CPDAGs. For an undirected edge, it needs to be undirected in both CPDAGs to be considered correct. Any other edges that occur in both CPDAGs are considered to have wrong directions. With these notions, we can calculate the False Discovery Rate (FDR) for the skeleton as |extra|

|b C| , and for the CPDAG as |extra|+|wrong direction|

|b C| . Here |extra| is the number of extra

edges, | b C| is the number of edges in b C, and so on. To quantify the overall similarity and take into account the true positives, we calculate the Jaccard index (JI), which is |correct|+|wrong direction|

|C b C| = |correct|+|wrong direction|

|missing|+|b C| for the skeleton, and |correct|

|C|+|b C| |correct| for the CPDAG.

7.1 Testing on Simulated Polytree Data

Here we briefly describe how we generate linear polytree SEMs. Additional implementation details can be found in Section 7.3. First, we generate a polytree by randomly assigning directions to a random undirected tree. Next, the standardized SEM parameters βij s are randomly chosen within a range, which in turn determine ωii (Eq. (5)). Motivated by the theoretical results (Theorems 11, 14 and 16), we make sure that in the above procedures, the generated SEM satisfies ρmin |βij| ρmax, the maximum in-degree d = d0 , and ωmin ωii for all i V . Here ρmin, ρmax, d0 , ωmin are pre-specified constants (values used are listed in Fig. 1 caption).

Figures 1 and 2 show the performance for p = 100 and n ranging from 50 to 1000. We see that the Chow-Liu algorithm performs much better than hill climbing, and overall has an accuracy similar to or better than that of PC and early-stopping PC. At small sample sizes of less than 400, PC and early-stopping PC have a smaller FDR for skeleton recovery than Chow-Liu, but this is likely at the expense of the true positive rate, as reflected by the similar or lower JI of PC compared to Chow-Liu (Panels BD of Figs. 1 and 2). At larger sample sizes, Chow-Liu has a better accuracy in recovering the CPDAG. The early-stopping PC has a similar accuracy in recovering the skeleton as the regular PC and a better accuracy in recovering the

Published in Transactions on Machine Learning Research (03/2025)

CPDAG (Panels CD of Figs. 1 and 2). This is likely due to the algorithm only applying Meek s Rule 1 (which is all needed for polytree) to orient the edges (Section 2.3.2). As ρmin becomes smaller or as d increases, the accuracy of Chow-Liu decreases, which is consistent with the theory (Theorems 11, 14 and 16). For hill climbing and PC, the accuracy is less affected by ρmin or d (Fig. 1 vs Fig. 2).

When comparing skeleton recovery (Panels A, B in Figs. 1 and 2) with CPDAG recovery (panels C, D in Figs. 1 and 2), the accuracy of skeleton recovery is higher, which is consistent with our theoretical results (Theorems 11, 14 and 16).

Figure 1: Performance on the polytree simulated data at p = 100 and the maximum in-degree d = 10. The results from the algorithms are represented by solid lines and dot markers (polytree), dash lines and triangle markers (hill climbing), solid lines and square markers (PC), and dash-dot lines and square markers (PC early stopped). Colors correspond to three different values of ρmin. The rest of the SEM parameters are ρmax = 0.8, and ωmin = 0.1. Panels A,C show the FDR (the smaller the better) for skeleton and CPDAG recovery. Panels B,D show the Jaccard Index (the larger the better). For each combination of SEM parameters, we randomly generate a polytree, the detailed generation of the βij s and ωii s are described in Section 7.3. Then we draw iid samples from the SEM of different sizes (the x-axis, n = 50, 100, 200, 400, 600, 800, 1000). This entire process is repeated 100 times. Each point on the curves shows the average over the 100 repeats and the error bars are 1.96 times the standard error of the mean (many are smaller than the marker).

7.1.1 Running Time Comparison

Interestingly, the running time of the PC algorithm is significantly affected by d : the running time increases 40 folds when d changes from 10 to 20 (Table 1), and the code may even fail to stop (running for more than 8 hours) when d = 40 (data not shown). This phenomenon can be explained by the relationship between the maximal number of neighbors and the maximal number of iterations in the PC algorithm; see Proposition 1 of Kalisch & Bühlman (2007). On the other hand, Chow-Liu is significantly more favorable in terms of running time, similar to the early-stopped PC which avoids the issue of maximal number of neighbors above. The Chow-Liu algorithm is up to 80 times faster than the slowest alternative algorithm

Published in Transactions on Machine Learning Research (03/2025)

Figure 2: Same as Fig. 1 but for a maximum in-degree of d = 20.

(Table 1) and, importantly, has a running time that is constant across the SEM parameters (this is also true for all other experiments described later).

(Unit: sec) Polytree p = 100, din max = 10 Polytree p = 100, din max = 20 ASIA p = 8 ALARM p = 37

Chow-Liu 0.01 0.01 0.0003 0.01 Hill climbing 0.87 1.00 0.012 1.48 PC 0.07 2.86 0.03 0.53 PC early stopped 0.03 0.03 0.001 0.38

Table 1: Running time comparison. The columns correspond to the SEM data in Figs. 1 and 2 (polytree), Table 2 (ALARM) and Table 3 (ASIA). n = 5000 for the AISA and ALARM. The running time is for one inference (averaged across trials/bootstraps when applicable). All computation is done on a 2019 Intel i7 quad-core CPU desktop computer.

7.2 Testing on DAG Benchmark Data

The ALARM dataset (Beinlich et al., 1989) is a widely used benchmark data. The true DAG (Fig. 3) is not a polytree and has 37 nodes and 46 edges. In fact, a three-phase algorithm initialized by Chow-Liu has been demonstrated to be effective on this data (Cheng et al., 2002). We simply conducted the Chow-Liu algorithm (Algorithms 1 and 2), and found that it still performs better than hill climbing and PC (including the early-stopping version) in terms of the metrics (Table 2) as well as intuitively by the inferred graph (Fig. 3). At n = 5000, it even achieves the best possible accuracy for skeleton recovery as Chow-Liu can achieve (there has to be at least 46 (37 1) = 10 edges missing in an inferred polytree).

Another benchmark we test is the ASIA dataset (Lauritzen & Spiegelhalter, 1988), which is a simulated DAG dataset with eight nodes. Note that the ground truth is sparse but not exactly a polytree. At n = 500

Published in Transactions on Machine Learning Research (03/2025)

n = 500 Correct Wrong d. Missing Extra FDR sk. JI sk. FDR CPDAG JI CPDAG Chow-Liu 28 4 14 4 0.11 0.64 0.22 0.52 Hill climbing 24 17 5 60 0.59 0.39 0.76 0.2 PC 14 17 15 13 0.3 0.53 0.68 0.18 PC early stopped 24 8 14 20 0.38 0.48 0.54 0.32 n = 5000 Correct Wrong d. Missing Extra FDR sk. JI sk. FDR CPDAG JI CPDAG Chow-Liu 25 11 10 0 0.0 0.78 0.31 0.44 Hill climbing 27 18 1 62 0.58 0.42 0.75 0.21 PC 24 17 5 12 0.23 0.71 0.55 0.32 PC early stopped 42 1 3 39 0.48 0.51 0.49 0.49

Table 2: Performance on ALARM data. See text for the details of the accuracy measures: the number of correct, missing, extra, and wrong direction edges, FDR, and Jaccard index for skeleton and CPDAG. The best results across the algorithms are in bold.

Figure 3: Comparing the true CPDAG of the ALARM data and the inferred one from the four algorithms at n = 5000. There are 37 nodes and 46 edges in the true CPDAG.

samples, the performance of Chow-Liu is comparable to that of hill climbing and PC algorithm, while the hill climbing gives the best result at n = 5000 (Table 3). We illustrate the comparison intuitively by plotting the most likely inference outcome of each algorithm across the bootstrap trials in Fig. 4 (where we re-sample n observations from the original 5000 samples). Note the graph inferred by Chow-Liu (occurs at 23%) is the best possible result it can achieve. This is because at least one edge must be missing as the output is a polytree, and the v-structure involving B, E, and D can no longer be identified once missing the edge ED, leading to BD being undirected.

Lastly, we study a benchmark simulated dataset, EARTHQUAKE (Korb & Nicholson, 2010), whose ground truth graph is a polytree (Fig. 5). At n = 500 samples, the performance of Chow-Liu is comparable to that of hill climbing and PC algorithm, while Chow-Liu performs the best in the overall recovery of the skeleton and the CPDAG at n = 2000 (FDR and Jaccard index in Table 4). Similar to previous data, we plot the most likely inference outcome for each algorithm across trials Fig. 5. At n = 2000, the Chow-Liu algorithm perfectly recovers the true DAG in 90% of trials.

Published in Transactions on Machine Learning Research (03/2025)

n = 500 Correct Wrong d. Missing Extra Chow-Liu 4.0(1.1) 1.8(1.25) 2.2(0.6) 1.2(0.6) Hill climbing 4.2(1.54) 2.3(1.35) 1.5(0.67) 1.5(1.57) PC 3.3(1.19) 1.7(0.9) 3.0(0.63) 0.1(0.3) PC early stopped 2.9(1.3) 2.1(1.14) 3.0(0.63) 0.1(0.3) n = 5000 Correct Wrong d. Missing Extra Chow-Liu 4.19(1.49) 2.3(1.47) 1.51(0.51) 0.51(0.51) Hill climbing 6.96(0.9) 0.34(0.69) 0.7(0.57) 0.87(0.93) PC 4.59(1.06) 1.79(1.04) 1.62(0.56) 0.15(0.39) PC early stopped 3.82(0.83) 3.53(1.09) 0.65(0.54) 1.05(0.57)

(continue) n = 500 FDR sk. JI sk. FDR CPDAG JI CPDAG Chow-Liu 0.17(0.09) 0.64(0.11) 0.43(0.16) 0.38(0.14) Hill climbing 0.17(0.14) 0.7(0.14) 0.46(0.22) 0.39(0.23) PC 0.02(0.06) 0.62(0.09) 0.36(0.23) 0.35(0.14) PC early stopped 0.02(0.06) 0.62(0.09) 0.44(0.25) 0.3(0.16) n = 5000 FDR sk. JI sk. FDR CPDAG JI CPDAG Chow-Liu 0.07(0.07) 0.77(0.11) 0.4(0.21) 0.41(0.19) Hill climbing 0.1(0.1) 0.83(0.11) 0.13(0.15) 0.78(0.18) PC 0.02(0.05) 0.78(0.08) 0.29(0.17) 0.48(0.14) PC early stopped 0.12(0.06) 0.82(0.08) 0.54(0.11) 0.31(0.09)

Table 3: Performance on ASIA data. The accuracy measures (the number of correct, missing, extra, and wrong direction edges, FDR and Jaccard index for skeleton and CPDAG; see text) are averaged over 1000 bootstraps (resampling n observations out of 100,000) and the standard deviations are in the parentheses. The best results across the algorithms are in bold.

Figure 4: The true CPDAG and the typical inferred CPDAG for the ASIA data with n = 5000 samples. We plot the most likely inferred graph across 1000 bootstraps for each algorithm, which occurs at 23% (Chow-Liu), 44% (hill climbing), 42% (PC), 50% (early-stopping PC), respectively.

7.3 Details on polytree data generation

In simulated polytree data, we draw i.i.d. samples from a Gaussian linear SEM with a polytree structure. First, we generate an undirected tree with p nodes from a random Prufer sequence. The Prufer sequence which has a one-to-one correspondence to all the trees with p nodes is obtained by sampling p 2 numbers with replacement from {1, 2, . . . , p}. Next, a polytree is obtained by randomly orienting the edges of the

Published in Transactions on Machine Learning Research (03/2025)

n = 500 Correct Wrong d. Missing Extra Chow-Liu 2.87(1.55) 0.83(1.27) 0.3(0.67) 0.3(0.67) Hill climbing 3.38(1.29) 0.46(1.07) 0.17(0.48) 0.85(0.75) PC 2.52(1.45) 1.16(1.13) 0.32(0.56) 0.66(0.64) PC early stopped 2.63(1.53) 1.09(1.22) 0.28(0.57) 0.76(0.74) n = 2000 Correct Wrong d. Missing Extra Chow-Liu 3.62(1.17) 0.38(1.16) 0.01(0.08) 0.01(0.08) Hill climbing 3.88(0.64) 0.12(0.64) 0.0(0.0) 0.61(0.61) PC 3.86(0.47) 0.14(0.47) 0.0(0.0) 0.62(0.61) PC early stopped 3.68(0.78) 0.32(0.78) 0.0(0.0) 0.79(0.79)

(continue) n = 500 FDR sk. JI sk. FDR CPDAG JI CPDAG Chow-Liu 0.08(0.17) 0.89(0.22) 0.28(0.39) 0.68(0.4) Hill climbing 0.17(0.14) 0.81(0.17) 0.26(0.3) 0.72(0.31) PC 0.14(0.13) 0.81(0.18) 0.43(0.33) 0.51(0.35) PC early stopped 0.15(0.14) 0.8(0.18) 0.42(0.35) 0.54(0.36) n = 2000 FDR sk. JI sk. FDR CPDAG JI CPDAG Chow-Liu 0.0(0.02) 1.0(0.04) 0.08(0.27) 0.91(0.27) Hill climbing 0.13(0.11) 0.87(0.11) 0.16(0.19) 0.84(0.19) PC 0.13(0.11) 0.87(0.11) 0.16(0.14) 0.84(0.16) PC early stopped 0.17(0.14) 0.83(0.14) 0.24(0.24) 0.74(0.26)

Table 4: Performance on EARTHQUAKE data. The accuracy measures (the number of correct, missing, extra, and wrong direction edges, FDR and Jaccard index for skeleton and CPDAG; see text) are averaged over 1000 bootstraps (resampling n observations from a total of 100,000 samples) and the standard deviations are in the parentheses. The best results across the four algorithms are in bold.

Figure 5: The true CPDAG and the most frequently inferred CPDAG for the EARTHQUAKE data with n = 2000 samples over 1000 trials. The graph shown occurs at 90% for Chow-Liu, 47% for hill climbing, 46% for PC, and 41% for early-stopping PC, respectively.

undirected tree. We also ensure that one of the nodes has a specified large in-degree d0 . This is done by making a node i occur at least d0 1 times in the Prufer sequence, so the node will have a degree at least d0 in the undirected tree. We then orient all edges connected to i by selecting d0 of them to be incoming edges. The rest of the edges in the tree are oriented randomly as before.

Published in Transactions on Machine Learning Research (03/2025)

In the next step, we choose the value of the standardized βij corresponding to the correlations. Note that once βij s are given, ωii are determined by Eq. (5). Motivated by the theoretical conditions on n, p such as those in Theorems 11 and 14, we choose βij according to some pre-specified values ρmin and ρmax and study the effects of these parameters on the recovery accuracy. To avoid ill-conditioned cases, we require that ωii ωmin, where ωmin is another parameter. This adds constraints on βij, Pp j=1 β2 ij 1 ωmin, in addition to ρmin |βij| ρmax. We sample β2 ij uniformly among the set of non-negative values satisfying the above inequality constraints. This sampling is implemented by drawing β2 ij, (corresponding to all the edges in the polytree) sequentially in a random order as min(ρ2 max, ρ2 min + vjx), where x is drawn from the beta distribution B(1, din j ). Here din j is the number of incoming edges to node j whose β2 ij has not yet been chosen, and vj = 1 ωmin din j ρ2 min P

k β2 kj, where the sum is over all edges k j whose β2 kj have already been chosen, din j is the total number of incoming edges to j. The use of beta distribution here is based on the fact of the order statistics of independent uniformly distributed random variables. As an exception, we first set two |βij| values to attain equality in the constraints by ρmin and ρmax before choosing the rest of βij s according to the above sampling procedure. For ρmax, we randomly choose a node i that satisfies ρ2 min(din i 1)+ρ2 max 1 ωmin, din i > 0 (always exists if ρ2 max +ωmin 1 and the minimum nonzero in-degree is 1), and set one of its incoming edges to have |βji| = ρmax. For ρmin, we choose a node among the rest of nodes with din k > 0 and set |βlk| = ρmin for one of its incoming edges. Lastly, a positive or negative sign is given to each βij with equal probability. After the βij s (i.e., matrix B) are chosen (and hence Ω), the samples x1, . . . , xn are drawn according to x = (I B) ϵ, where ϵ are zero mean Gaussian variables with covariance Ω.

8 Discussion

This paper studies the problem of polytree learning, a special case of DAG learning where the skeleton of the directed graph is a tree. This model has been widely used in the literature for both prediction and structure learning. We consider the linear polytree model, and consider the Chow-Liu algorithm (Chow & Liu, 1968) that has been proposed in Rebane & Pearl (1987) for polytree learning. Our major contribution in this theoretical paper is to study the sample size conditions under which the polytree learning algorithm recovers the skeleton and the CPDAG exactly. Under certain mild assumptions on the correlation coefficients over the polytree skeleton, we show that the skeleton can be exactly recovered with high probability if the sample size satisfies n > O((log p)/ρ2 min), and the CPDAG of the polytree can be exactly recovered with high probability if the sample size satisfies n > O((log p)/ρ4 min). We also establish necessary conditions on sample size for both skeleton and CPDAG recovery, which are consistent with the sufficient conditions and thereby give a sharp characterization of the difficulties for these two tasks. In addition, we also study inverse correlation matrix estimation under the linear polytree SEM. Under the component-wise ℓ1 metric, we give an estimation error bound that is characterized by the dimension, the sample size, and the total number of v-structures.

There are a number of remaining questions to study in the future. It would be interesting to study how to relax the polytree assumption. In fact, the benchmark data analysis (Section 7.2) is very insightful, since it shows that the considered Chow-Liu based CPDAG recovery algorithm, which seemingly relies heavily on the polytree assumption, could lead to reasonable and accurate structure learning result when the ground truth deviates from a polytree to some degree. This inspires us to consider the robustness of the proposed approach against such structural assumptions. For example, if the ground truth can only be approximated by a polytree, can the structure learning method described in Sections 2.3.1 and 2.3.2 lead to an approximate recovery of the ground truth CPDAG with theoretical guarantees? Similarly, if the sample size is not large enough and the CPDAG is thereby unable to be recovered exactly, can we still obtain an accurate estimate of the inverse correlation matrix? As aforementioned, polytree modeling is usually used in practice only as initialization, and post-processing could give better structural recovery results. A well-known method of this type is given in Cheng et al. (2002) without theoretical guarantees. An interesting future research direction is to include such post-processing steps into our theoretical analysis, such that our structural learning results (e.g., Theorems 14) hold for more general sparse DAGs.

Published in Transactions on Machine Learning Research (03/2025)

Acknowledgment

X. Li and X. Lou acknowledge support from the NSF via the Career Award DMS-1848575. Y. Hu was partly supported by an Early Career Scheme grant 26303921 from the Hong Kong Research Grants Council. We would like to thank J. Peng for helpful discussions.

Animashree Anandkumar, Daniel J Hsu, Furong Huang, and Sham M Kakade. Learning mixtures of tree graphical models. In NIPS, pp. 1061 1069, 2012a.

Animashree Anandkumar, Vincent YF Tan, Furong Huang, Alan S Willsky, et al. High-dimensional structure estimation in ising models: Local separation criterion. The Annals of Statistics, 40(3):1346 1375, 2012b.

I. Beinlich, H. J. Suermondt, R. M. Chavez, and G. Cooper. The ALARM Monitoring System: A Case Study with two Probabilistic Inference Techniques for Belief Networks. Proc. 2nd European Conference on Artificial Intelligence in Medicine, pp. 247 256, 1989.

Guy Bresler and Mina Karzand. Learning a tree-structured ising model in order to make predictions. Annals of Statistics, 48(2):713 737, 2020.

Sourav Chatterjee and Mathukumalli Vidyasagar. Estimating large causal polytrees from small samples. 9 2022. URL http://arxiv.org/abs/2209.07028.

Jie Cheng, Russell Greiner, Jonathan Kelly, David Bell, and Weiru Liu. Learning bayesian networks from data: An information-theory based approach. Artificial intelligence, 137(1-2):43 90, 2002.

David Maxwell Chickering. Learning equivalence classes of bayesian-network structures. The Journal of Machine Learning Research, 2:445 498, 2002a.

David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507 554, 2002b.

CKCN Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462 467, 1968.

Sanjoy Dasgupta. Learning polytrees. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 134 141, 1999.

Mathias Drton and Marloes H Maathuis. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4:365 393, 2017.

RA Fisher. The distribution of the partial correlation coefficient. Metron, 3:329 332, 1924.

Rina Foygel, Jan Draisma, and Mathias Drton. Half-trek criterion for generic identifiability of linear structural equation models. The Annals of Statistics, pp. 1682 1713, 2012.

José A. Gámez, Juan L. Mateo, and José M. Puerta. Learning Bayesian networks by hill climbing: Efficient methods based on progressive restriction of the neighborhood. Data Mining and Knowledge Discovery, 22 (1-2):106 148, 2011. ISSN 13845810. doi: 10.1007/s10618-010-0178-6.

Asish Ghoshal and Jean Honorio. Learning linear structural equation models in polynomial time and sample complexity. Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR, 84, 2018.

Uri Heinemann and Amir Globerson. Inferning with high girth graphical models. In International Conference on Machine Learning, pp. 1260 1268. PMLR, 2014.

Christina Heinze-Deml, Marloes H Maathuis, and Nicolai Meinshausen. Causal structure learning. Annual Review of Statistics and Its Application, 5:371 391, 2018.

Published in Transactions on Machine Learning Research (03/2025)

Markus Kalisch and Peter Bühlman. Estimating high-dimensional directed acyclic graphs with the pcalgorithm. Journal of Machine Learning Research, 8(3), 2007.

Ashish Katiyar, Jessica Hoffmann, and Constantine Caramanis. Robust estimation of tree structured Gaussian graphical models. Proceedings of the 36th International Conference on Machine Learning, 97:3292 3300, 09 15 Jun 2019. URL https://proceedings.mlr.press/v97/katiyar19a.html.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Kevin B Korb and Ann E Nicholson. Bayesian artificial intelligence. CRC press, 2010.

Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48 50, 1956.

S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), 50 (2):157 224, 1988. ISSN 00359246. URL http://www.jstor.org/stable/2345762.

Zhuang Ma and Xiaodong Li. Subspace perspective on canonical correlation analysis: Dimension reduction and minimax rates. Bernoulli, 26(1):432 470, 2020.

Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pp. 403 410, 1995.

Praneeth Netrapalli, Siddhartha Banerjee, Sujay Sanghavi, and Sanjay Shakkottai. Greedy learning of markov network structure. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1295 1302. IEEE, 2010.

Konstantinos E Nikolakakis, Dionysios S Kalogerias, and Anand D Sarwate. Learning tree structures from noisy data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1771 1782. PMLR, 2019.

Christopher Nowzohour, Marloes H Maathuis, Robin J Evans, Peter Bühlmann, et al. Distributional equivalence and structure learning for bow-free acyclic path diagrams. Electronic Journal of Statistics, 11(2): 5342 5374, 2017.

Judea Pearl. Causality. Cambridge university press, 2009.

J Peters and P Bühlmann. Identifiability of gaussian structural equation equal error variances. Biometrika, pp. 219 228, 2014. URL https://about.jstor.org/terms.

George Rebane and Judea Pearl. The recovery of causal poly-trees from statistical data. In Proceedings of the Third Conference on Uncertainty in Artificial Intelligence, pp. 222 228, 1987.

Karen Sachs, Omar Perez, Dana Pe er, Douglas A Lauffenburger, and Garry P Nolan. Causal proteinsignaling networks derived from multiparameter single-cell data. Science, 308(5721):523 529, 2005.

Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.

Vincent YF Tan, Animashree Anandkumar, and Alan S Willsky. Learning gaussian tree models: Analysis of error exponents and extremal structures. IEEE Transactions on Signal Processing, 58(5):2701 2714, 2010.

Mostafa Tavassolipour, Seyed Abolfazl Motahari, and Mohammad-Taghi Manzuri Shalmani. Learning of tree-structured gaussian graphical models on distributed data under communication constraints. IEEE Transactions on Signal Processing, 67(1):17 28, 2018.

M. Cover J Thomas and A Thomas Joy. Elements of information theory, 2006.

Published in Transactions on Machine Learning Research (03/2025)

Daniele Tramontano, Anthea Monod, and Mathias Drton. Learning linear non-gaussian polytree models. Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, PMLR, pp. 1960 1969, 2022.

Sara van de Geer and Peter Bühlmann. c0-penalized maximum likelihood for sparse directed acyclic graphs. The Annals of Statistics, 41(2):536 567, 2013.

Thomas Verma and Judea Pearl. Equivalence and synthesis of causal models. UCLA, Computer Science Department, 1991.

Thomas Verma and Judea Pearl. An algorithm for deciding if a set of observed independencies has a causal explanation. In Uncertainty in artificial intelligence, pp. 323 330. Elsevier, 1992.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing, pp. 210 268, 2012.

Wei Wang, Martin J Wainwright, and Kannan Ramchandran. Information-theoretic bounds on model selection for gaussian markov random fields. In 2010 IEEE International Symposium on Information Theory, pp. 1373 1377. IEEE, 2010.

Sewall Wright. Path coefficients and path regressions: alternative or complementary concepts? Biometrics, 16(2):189 202, 1960.

Bin Zhang, Chris Gaiteri, Liviu-Gabriel Bodea, Zhi Wang, Joshua Mc Elwee, Alexei A Podtelezhnikov, Chunsheng Zhang, Tao Xie, Linh Tran, Radu Dobrin, et al. Integrated systems approach identifies genetic nodes and networks in late-onset alzheimer s disease. Cell, 153(3):707 720, 2013.

A Preliminaries

Lemma 1 in Kalisch & Bühlman (2007) Let ˆρij and ρij are sample and population correlation between Xi and Xj as in Section 2. Consider the Gaussian linear polytree SEM (1) with supn,i =j ρij < M < 1. Then for any 0 < γ 2, the following inequality holds:

sup i =j P (|ˆρij ρij| γ) C1(n 2) exp (n 4) log 4 γ2

where 0 < C1 < depends only on M.

Remark 5.40 in Vershynin (2012) Assume that A is an N n matrix whose rows Ai are independent sub-gaussian random vectors in Rn with second moment matrix Σ. Then for every t 0, the following inequality holds with probability at least 1 2 exp( ct2):

1 N A A Σ max(δ, δ2), where δ = C r n

Here C = CK and c = c K > 0 depend only on the sub-gaussian norm K = maxi Ai ψ2 of the rows.