# learning_gaussian_dags_from_network_data__a39516d6.pdf Journal of Machine Learning Research 25 (2024) 1-52 Submitted 7/21; Revised 8/23; Published 12/24 Learning Gaussian DAGs from Network Data Hangjian Li lihangjian123@ucla.edu Oscar Hernan Madrid Padilla oscar.madrid@stat.ucla.edu Qing Zhou zhou@stat.ucla.edu Department of Statistics and Data Science University of California, Los Angeles Los Angeles, CA 90095, USA Editor: Joris M. Mooij Structural learning of directed acyclic graphs (DAGs) or Bayesian networks has been studied extensively under the assumption that the data are independent. We propose a new Gaussian DAG model for dependent data which assumes the observations are correlated according to an undirected network. Under this model, we develop a method to estimate the DAG structure given a topological ordering of the nodes. The proposed method jointly estimates the Bayesian network and the correlations among observations by optimizing a scoring function based on penalized likelihood. We show that under some mild conditions, the proposed method produces consistent estimators after one iteration. Extensive numerical experiments also demonstrate that, by jointly estimating the DAG structure and the sample correlation, our method achieves much higher accuracy in structure learning. When the node ordering is unknown, through experiments on synthetic and real data, we show that our algorithm can be used to estimate the correlations between samples, with which we can de-correlate the dependent data to significantly improve the performance of classical DAG learning methods. Keywords: Bayesian networks, causal discovery, matrix normal distribution, network data 1. Introduction Bayesian networks (BNs) with structure given by a directed acyclic graph (DAG) are a popular class of graphical models in statistical learning and causal inference. Extensive research has been done to develop new methods and theories to estimate DAG structures and its parameters from data. In this study, we focus on the Gaussian DAG model defined as follows. Let G = (V, E) be a DAG that represents the structure of a BN for p random variables X1, . . . , Xp. The vertex set V = {1, . . . , p} represents the set of random variables and the edge set E = {(j, i) V V : j i} represents the directed edges in G. Let Πi = {j V : (j, i) E} denote the parent set of the vertex i. A data matrix X Rn p is generated by the following Gaussian linear structural equations induced by G: k Πj βkj Xk + εj, εj = (ε1j, . . . , εnj) Nn 0, ω2 j In , (1) c 2024 Hangjian Li, Oscar Hernan Madrid Padilla, Qing Zhou. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-0846.html. Li, Madrid-Padilla, and Zhou for j = 1, . . . , p, where Xj is the jth column in X, ω2 j the error variance, and B = (βkj)p p is the weighted adjacency matrix (WAM) of G such that βkj = 0 if and only if (k, j) E, and βjj = 0. The errors {εj} are independent and εj is independent of Xk for k Πj. The goal is to estimate the structure of G from X, which is equivalent to estimating the support of B. A key assumption under (1) is that the rows of X are jointly independent since the covariance matrix of each εj is diagonal. Under such i.i.d. assumption, many structure learning algorithms for DAGs have been developed, which can be largely categorized into three groups: score-based, constraint-based, and hybrid of the two. Score-based methods search for the optimal DAG by maximizing a scoring function such as minimum description length (Roos, 2017), BIC (E. Schwarz, 1978), and Bayesian scores (Heckerman et al., 1995; Cooper and Herskovits, 1992) with various search strategies, such as order-based search (Scanagatta et al., 2016; Schmidt et al., 2007; Ye et al., 2021), greedy search (Ramsey et al., 2017; Chickering, 2003), and coordinate descent (Fu and Zhou, 2013; Aragam and Zhou, 2015; Gu et al., 2019). Constraint-based methods, such as the PC algorithm in Spirtes et al. (2000), perform conditional independence tests among variables to construct a skeleton and then proceed to orient some of the edges. There are also hybrid methods such as in Tsamardinos et al. (2006) and Gasse et al. (2014) that combine the above two approaches. In real applications, however, it is common for observations to be dependent as in network data, which violate the i.i.d. assumption for the aforementioned methods. For example, when modeling the characteristics of an individual in a social network, the observed characteristics from different individuals can be dependent because they belong to the same social group such as friends, family, and colleagues who often share similar features. Another example appears when modeling a gene regulatory network from individuals that are potentially genetically linked. When estimating brain functional networks, we often have a matrix of f MRI measurements for each individual, X RT ν, across T time points, and ν brain regions of interest. The existence of correlations across both time points and brain regions often renders the estimates of standard graphical modeling methods inaccurate (Kundu and Risk, 2020). Motivated by these applications, we are interested in developing a Gaussian DAG model that can take into account the dependence between observations. Based on this model, we will develop a learning algorithm that can simultaneously infer the DAG structure and the sample dependencies. Moreover, since many real-world networks are sparse, we also want our method to be able to learn a sparse DAG and scale to a large number of vertices. A sparsity constraint on the estimated DAG can also effectively prevent overfitting and greatly improve the computational efficiency. Lastly, we would like to have theoretical guarantees on the consistency and finite-sample accuracy of the estimators. With these requirements in mind, we seek to 1. Develop a novel Bayesian network model for network data; 2. Develop a method that can jointly estimate a sparse DAG and the sample dependencies under the model; 3. Establish finite-sample error bound and consistency of our estimators; 4. Achieve good empirical performance on both synthetic and real data sets. Learning Gaussian DAGs from Network Data When the data are no longer i.i.d., we need a more general model than (1) to capture between-individual dependence, so what we propose is a union model where, in addition to using directed edges to describe in-sample relationships, we will use undirected edges to describe cross-sample relationships. Because X is defined by the Gaussian noise vectors εj according to the structural equations in (1), dependence among the rows of X may be introduced by modeling the covariance structure among the variables ε1j, . . . , εnj in εj. Based on this observation, we will use an undirected graph H to define the sparsity pattern in the precision matrix of εj. When H is an empty graph, the variables in εj are independent as in the classical Gaussian DAG model. However, when H is not empty, X follows a more complex matrix normal distribution, and the variance is defined by the product of two covariance matrices, one for the DAG G and the other for the undirected graph H. Now, one can view our proposed model as a more general model that consists of both the directed DAG G in (1) and the undirected graph H, where cross-sample dependence is characterized by H and in-sample features follow a distribution specified in (1). Estimating the structure of the DAG G as well as other model parameters under the sparsity regularization in both graphs is a challenging task. We will start by assuming that a topological ordering π of G is given so that the search space for DAGs can be largely reduced. However, due to the presence of the second graph for network data, the usual likelihood-based objective function used in traditional score-based methods is nonconvex. Constraint-based methods do not naturally extend to network data, either due to the dependence between individuals in X, which complicates conditional independence tests. In order to find a suitable objective function and develop an optimization algorithm, we exploit the biconvex nature of a regularized likelihood score function and develop an effective blockwise coordinate descent algorithm with a nice convergence property. If the topological ordering of the DAG is unknown, it is impossible to identify a unique DAG from the data due to the Markov equivalence of DAGs (Chickering, 2003). Moreover, due to the lack of independence, it is very difficult to estimate the equivalence class defined by G. In this case, we take advantage of an invariance property of the matrix normal distribution. Under some sparsity constraint on G, we show that even with a random ordering, we can still get a good estimate of the covariance of εj, which can be used to de-correlate X so that existing DAG learning algorithms can be applied to estimate an equivalence class of G. The remainder of the paper is structured as follows. In Section 2 we introduce a novel Gaussian DAG model for network data and discuss its connections with some existing models. We propose a structural learning algorithm for the model and go through its details in Section 3. Section 4 is devoted to our main theoretical results, as well as their implications under various high-dimensional asymptotic settings. Section 5 reports numerical results of our method with detailed comparisons with some competing methods on simulated data. Section 6 presents an application of our method on a real single-cell RNA sequencing data set. All proofs are deferred to the Appendix. Notations. For the convenience of the reader, we now summarize some notations to be used throughout the paper. We write G and H for the true DAG and the true undirected graph, respectively. Let Ω:= diag(ω2 j ) be a p p diagonal matrix of error variances, B denote the true WAM of G , and s := supj β j 0 denote the maximum number of parents of any node in G . Furthermore, Xj denotes the jth column of X for j = 1, . . . , p, and xi denotes the ith row of X for i = 1, . . . , n. Given two sequences fn and gn, we Li, Madrid-Padilla, and Zhou write fn gn if fn = O(gn), and fn gn if fn gn and gn fn. Denote by [p] the index set {1, . . . , p}. For x Rn, we denote by x q its ℓq norm for q [0, ]. For A Rn m, A 2 = supv{ Av 2 : v 2 1, v Rm} is the operator norm of A, A f is the Frobenius norm of A, A = maxi,j |aij| is the element-wise maximum norm of A, and |||A||| = maxi [n] Pm j=1 |aij| is the maximum row-wise ℓ1 norm of A. Denote by σmin(A) and σmax(A), respectively, the smallest and largest singular values of a matrix A. Let |S| be the size of a set S. 2. A Novel DAG Model for Dependent Data We model sample dependency through an undirected graph H on n vertices, with each vertex representing an observation xi, i [n], and the edges representing the conditional dependence relations among them. More explicitly, let A(H) be the edge set of H so that (i, j) / A(H) xi xj|x[n]\{i,j}, i = j. Suppose we observe not only the dependent samples {xi}n i=1 but also the graph (network) H. We generalize the structural equation model (SEM) in (1) to k Πj βkj Xk + εj, εj = (ε1j, . . . , εnj) Nn 0, ω2 j Σ , (2) where Σ Rn n is positive definite. The support of the precision matrix Θ = (Σ) 1 is restricted by supp(Θ) A(H), where supp(Θ) = {(i, j) | Θij = 0} and for notional brevity we put (i, i) A(H) for all i [n]. Note that when Σ = In, the SEM (2) reduces to (1). Hence, the classical Gaussian DAG model in (1) is a special case of our proposed model (2). Under the more general model (2), we face a more challenging structural learning problem: Given dependent data X generated from a DAG G and the undirected graph H that encodes the sample dependencies, we want to estimate the DAG coefficients B, the noise variance Ω= diag(ω2 j ), and the precision matrix Θ of the samples. Before introducing our method, let us first look at some useful properties of model (2). 2.1 Related Graphical Models Let us demonstrate the distinction between (1) and (2) using a data matrix X = (xij)n p with n = 5 units and p = 4 variables. Under SEM (1), we model x1 = (x11, x12, x13, x14), . . . , x5 = (x51, x52, x53, x54) using the same DAG, as shown in Figure 1a, and assume that they are independent. In contrast, the model proposed in (2) allows units to be dependent by relaxing the independence assumption among x1j, x2j, x3j, x4j, x5j for j {1, 2, 3, 4}. The dependence between units is induced by the dependence between background variables ε1j, . . . , ε5j and is modeled by an undirected graph as shown in Figure 1b. In general, the variables xi1, . . . , xip in each unit satisfy the same conditional independence constraints defined by a DAG, while the background variables ε1j, . . . , εnj across the n units are dependent. When estimating the DAG structure with such data, the correlations among individuals will reduce the effective sample size. Therefore, we need to take into account the distribution of the correlated εi. Note that our model excludes edges between two variables of any two different units. That is, we do not consider an edge of any type between xij and xkℓfor i = k and j = ℓ. Learning Gaussian DAGs from Network Data Figure 1: Graphical representation of the model (2) as a product of a DAG over features and an undirected graph over units (individuals). (a) The same DAG structure is shared by the features of every individual. (b) An undirected network of five individuals to represent cross-unit dependence. The model can also be viewed as a special case of a chain graph (Lauritzen, 1996; Lauritzen and Richardson, 2002), where the same variable across all units, x1j, . . . , xnj, form a chain component. However, the key structure in our model is that our graph over all variables and all units, {xij, i [n], j [p]}, is a Kronecker product of a DAG over p variables and an undirected graph over n units (more details in Section 2.2). This special structure is utilized in both our estimation algorithm introduced in Section 3 and theoretical analysis in Section 4. If we regard the problem as a general structure learning of a chain graph over all n p nodes, we would have only a sample of size one, from which one may not be able to construct consistent estimators in general. Another related model that has been studied for interference analysis is the segregated graph model proposed by Shpitser (2015), which represents the equivalence classes of latent variable chain graphs. Bhattacharya et al. (2020) considered a closely related problem of estimating causal effects under partial interference (Hudgens and Halloran, 2008). They assume the unit-level causal DAG is known and proposed a method to estimate causal effects under unit dependence. In comparison, we assume that the DAG structure is unknown (see Section 3.2), but instead the network structure among units is a subgraph of a given undirected graph. In fact, we only need to know a block structure among the units, and then the network structure within each block can be learned from the data. Due to these different assumptions, the structure learning algorithms in this work are very different from the one proposed by Bhattacharya et al. (2020). In general, although there is some similarity between the model in (2) and the models proposed in the interference causal inference literature, the motivation for our work is quite different. We consider the structural equation model (1) at the single-unit level as a map from extraneous variables ε1, . . . , εp to the observed variables X1, . . . , Xp. Then we introduce association (symmetric and undirected) among the extraneous variables based on a network (undirected graph). 2.2 Matrix Normal Distribution Our model (2) defines a matrix normal distribution for X. To see this, note that ε = (εij)n p in (2) follows a matrix normal distribution: ε Nn,p (0, Σ, Ω) vec(ε) Nnp(0, Ω Σ), Li, Madrid-Padilla, and Zhou where vec( ) is the vectorization operator and is the Kronecker product. Then, the random matrix X satisfies X Nn,p(0, Σ, Ψ), (3) where Ψ = (I B) Ω(I B) 1. We fix ω1 = 1 so that the parameters (Ω, Σ) are identifiable under model (2). From the properties of a matrix normal distribution, we can easily prove the following lemma which will come in handy when estimating the row covariance matrix Σ from different orderings of nodes. Given a permutation π of the set [p], define Pπ as a permutation matrix such that h Pπ = (hπ 1(1), . . . , hπ 1(p)) for any row vector h = (h1, . . . , hp). Lemma 1 If X follows the model (2), then for any permutation π of [p] we have XPπ Nn,p(0, Σ, P π ΨPπ). Although matrix normal distributions have been studied extensively in the past, the structural learning problem we consider here is quite unique. First of all, previous studies on the matrix normal model usually assume that we observe m copies of X and that MLE exists when m max{p/n, n/p} + 1 (Dutilleul, 1999). In our case, we only observe one copy of X and thus the MLE does not exist without additional sparsity constraints. Allen and Tibshirani (2010) proposed to use ℓ1 regularization to estimate the covariance matrices when m = 1, but the estimation relies on the assumption that the model is transposable, which means that the two components (Σ, Ψ) of the covariance are symmetric and can be estimated in a symmetric way. However, in the model (2), the two covariance components have different structural constraints and cannot be estimated in the same way. Lastly, practitioners are often interested in estimating large Bayesian networks with hundreds or more nodes under certain sparsity assumptions on the WAM B. For example, for methods that minimize a score function to estimate the covariances, adding a sparsity regularization term on Ψ = (I B) Ω(I B) 1 to the score function does not necessarily lead to a sparse estimate of B. In this paper, we propose a new DAG estimation method under the assumption that both the underlying undirected network among individuals and the Bayesian network are sparse. We are not interested in estimating Ψ but a sparse factorization of Ψ represented by the WAM B. This would require imposing sparsity constraints on B itself instead of on Ψ. This is different from the recent work by Tsiligkaridis et al. (2013), Allen and Tibshirani (2010), and Zhou (2014) on the Kronecker graphical lasso. 2.3 Score-equivalence The likelihood function of the proposed model (2) also satisfies the desired score-equivalence property. To see this, let βj = (β1j, . . . , βpj) be the jth column of the WAM B. Define an n n sample covariance matrix of ε1/ω1, . . . , εp/ωp from X as S(Ω, B) = 1 1 ω2 j (Xj Xβj) (Xj Xβj) . (4) Then the negative log-likelihood ℓ(B, Ω, Θ | X) from (2) is given by 2ℓ(B, Ω, Θ | X) = n log det Ω p log det Θ + p tr(ΘS(Ω, B)). (5) Learning Gaussian DAGs from Network Data Due to the dependence among observations, it is unclear whether the well-known scoreequivalence property for Gaussian DAGs (Chickering, 2003) still holds for our model. Let ( b B(G), bΩ(G), bΘ(G)) denote the MLE of (B, Ω, Θ) given a DAG G and the support restriction on Θ. Then, the following theorem confirms the score-equivalence property for our DAG model. Theorem 2 (Score equivalence) Suppose G1 and G2 are two Markov equivalent DAGs on the same set of p nodes. If the MLEs ( b B(Gm), bΩ(Gm), bΘ(Gm)), m = 1, 2, exist for the matrix X = (xij)n p, then the two MLEs will give the same log-likelihood value, ℓ( b B(G1), bΩ(G1), bΘ(G1) | X) = ℓ( b B(G2), bΩ(G2), bΘ(G2) | X). This property justifies the evaluation of estimated DAGs using a common model selection criterion such as AIC and BIC. For examples, we show in Section 5 that one can use BIC scores to select the optimal penalty level for our proposed DAG estimation algorithm. We have discussed the properties of our novel DAG model for dependent data and the unique challenges faced by the structural learning task. In this section, we develop a new method to estimate the parameters in model (2). Our estimator is defined by the minimizer of a score function that derives from a penalized log-likelihood. In order to explain our method, let us start from the penalized negative log-likelihood function: f(B, Ω, Θ) := 2ℓ(B, Ω, Θ | X) + ρ1(B) + ρ2(Θ), B D, (6) where D is the space of WAMs for DAGs and ρ1 and ρ2 are some penalty functions. This loss function is difficult to minimize due to the non-convexity of ℓand the exponentially large search space of DAGs. One way to reduce the search space is to assume a given topological ordering. Recall that a permutation π of p elements is a bijective map from [p] to [p]. A topological ordering of a DAG G with p vertices is a permutation π such that for every edge (i, j) E(G), π 1(i) < π 1(j). Recall that a WAM B is defined as (βkj)p p such that βkj = 0 if and only if (k, j) E(G); therefore, given a topological ordering π, we can define a set D(π) of WAMs compatible with π such that all B D(π) are strictly upper triangular after permuting its rows and columns according to π. Given a topological ordering π, the loss function (6) becomes f(B, Ω, Θ) = p log det Θ + n log det Ω+ 1 ω2 j LXj LXβj 2 2 + ρ1(B) + ρ2(Θ), B D(π), where L is the Cholesky factor of Θ (i.e. Θ = L L). If ρ1( ) and ρ2( ) are convex loss functions and the noise covariance matrix Ω= diag(ω2 j ) is known, (7) will be a bi-convex function in (B, Θ), which can be minimized using iterative methods such as coordinate descent. Tseng (2001) showed that the coordinate descent algorithm in bi-convex problems converges to a stationary point. Inspired by this observation, we propose the following two-step algorithm: Li, Madrid-Padilla, and Zhou Step 1: Pre-estimate Ωto get bΩ= diag(ˆω2 j ). Step 2: Estimate b B and bΘ by minimizing a biconvex score function derived from the penalized negative log-likelihood conditioning on ˆωj. Many existing noise estimation methods for high-dimensional linear models can be used to estimate bΩin Step 1 such as scaled lasso/MCP (Sun and Zhang, 2012), natural lasso (Yu and Bien, 2019), and refitted cross-validation (Fan et al., 2012). We will present the natural estimator of Ωand discuss some other alternatives in Section 4.2. Importantly, the statistical properties of the chosen estimator bΩin Step 1 will affect the properties of the bΘ and b B we get in Step 2, and thus we must choose the estimator carefully. We leave the detailed discussion of the theoretical properties of bΩand their implications to Section 4. Suppose bΩis given, we propose the following estimator for Step 2: bΘ, b B(π) = arg min Θ 0,B D(π) p log det Θ + 1 ˆω2 j LXj LXβj 2 2 ˆω2 j βj 1 + λ2 Θ 1 The ℓ1 regularization on βj/ˆω2 j not only helps promote sparsity in the estimated DAG but also prevents the model from overfitting variables that have small variances. The ℓ1 regularization on Θ ensures that the estimator is unique and can improve the accuracy of bΘ by controlling the error carried from the previous step. We will discuss how to control the estimation errors in more detail in Section 4. In Section 3.1, we assume a topological ordering π of the true DAG G is known. In this case, we will order the columns of X according to π so that for each j, only the first j 1 entries in βj can be nonzero. When minimizing (8), we fix βkj = 0 for k j and the resulting b B is guaranteed to be upper-triangular. If π is unknown, we show in Section 3.2 how the score function in (8) is still useful for estimating Θ and describe a method of de-correlation so that standard DAG learning methods can be applied on the de-correlated data. 3.1 Block Coordinate Descent We denote an estimate of the true precision matrix Θ at iteration t by bΘ(t). We also write b L(t) and L for the Cholesky factors of bΘ(t) and Θ , respectively. Since (8) is biconvex, it can be solved by iteratively minimizing over Θ and B, i.e., using block coordinate descent. Consider the tth iteration of block coordinate descent. Fixing bΘ(t), the optimization problem in (8) becomes the standard Lasso problem (Tibshirani, 1996) for each j: ˆβ(t+1) j = arg min βj 1 2n b L(t)Xj b L(t)Xβj 2 2 + λn βj 1, λn = λ1/(2n), (9) where bΘ(t) = b L(t) b L(t) is the Cholesky decomposition. Since the columns of X are ordered according to π, we can set ˆβ(t+1) ij = 0 for i = j, j + 1, . . . , p and reduce the dimension of Learning Gaussian DAGs from Network Data feasible βj in (9) to j 1. In particular, ˆβ(t+1) 1 is always a zero vector. Fixing b B(t+1), solving for bΘ(t+1) is equivalent to a graphical Lasso problem with fixed support (Ravikumar et al., 2011) bΘ(t+1) = arg min Θ 0, supp(Θ) A(H ) log det Θ + tr(b S(t+1)Θ) + λp Θ 1, (10) where b S(t+1) = S(bΩ, b B(t+1)) and λp = λ2/p. The details of the method are given in Algorithm 1. Algorithm 1: Block coordinate descent (BCD) algorithm Input: X, Θ(0), bΩ, ρ, A(H ), T while max n bΘ(t+1) bΘ(t) f, b B(t+1) b B(t) f o > ρ and t < T do for j = 1, . . . , p do ˆβ(t+1) j Lasso regression (9) end bΘ(t+1) graphical Lasso with support restriction (10) t t + 1 end Output: b B b B(t), bΘ bΘ(t) As shown in Proposition 3, Algorithm 1 will converge to a stationary point of the objective function (8). The stationary point here is defined as a point where all directional directives are nonnegative (Tseng, 2001). Proposition 3 Let {( b B(t), bΘ(t)) : t = 1, 2, . . .} be a sequence generated by Algorithm 1 for any λ1, λ2 > 0. Then for almost all X Rn p, every cluster point of {( b B(t), bΘ(t))} is a stationary point of the objective function in (8). 3.2 Estimating DAGs with Unknown Ordering Given any permutation π of [p], there exists a DAG Gπ such that (i) π is a topological sort of Gπ and (ii) the joint distribution P of the p random variables admits a recursive factorization according to Gπ (van de Geer and B uhlmann, 2013; Lauritzen, 1996, Chap. 3). Under the assumption that the true DAG G is sparse, i.e., the number of nonzero entries in β j is at most s for all j, for any random ordering π we choose, the corresponding DAG Gπ is also likely to be sparse so that the number of parents for each node is less than some positive constant s . Formally, we assume that s n/ log p uniformly for all permutations π . Similar assumptions have been made in other works on DAG learning from high-dimensional data, such as Condition 3.4 in van de Geer and B uhlmann (2013) and Theorem 4.1 in Aragam et al. (2019a). Let us randomly pick a permutation π and apply Algorithm 1 to Xπ := XPπ , where (XPπ )ij = Xiπ (j). Under the above assumption, the sparsity s is small compared to the sample size n, and therefore, the estimate ˆβ ij we get from solving the Lasso problem (9) will be consistent as well (we discuss the error bound on ˆβij in detail Li, Madrid-Padilla, and Zhou in Section 4, Eq.(15)). Moreover, since the covariance Θ is invariant to permutations by Lemma 1, the resulting estimate bΘ under the random ordering π will also be a good estimate of Θ (Eq. (18)). Note that the DAG Gπ is not necessarily Markov equivalent to the true DAG G . We merely assume that Gπ is sparse so that Θ can be accurately estimated even with a random order. With the Cholesky factor b L of bΘ, we de-correlate the rows of X and treat b X = b LX, (11) as the new data. Because the row correlations in b X vanish, we can apply existing structure learning methods which require independent observations to learn the underlying DAG. We find that this de-correlation step is able to substantially improve the accuracy of structure learning by well-known state-of-the-art methods, such as the greedy equivalence search (GES) (Chickering, 2003) and the PC algorithm (Spirtes et al., 2000). See Section 5 for more details. 4. Main Theoretical Results In this section, we present our main theoretical results for Algorithm 1 assuming a true ordering is given. We state our main theorem, Theorem 4, in Section 4.1. Section 4.2 is devoted to the error bounds of bΩusing a natural estimator, along with some important corollaries. Finally, in Section 4.3, we compare the error rates of our estimators with those in related problems. Before we start, let us introduce some additional notations used in this section. Let the errors of b L and bΘ be defined as b chol := b L L and b prec := bΘ Θ . Let b j := ˆβj β j Rp denote the estimation error of the jth column of B . Let β = sup 1 i,j p |β ij|, ω = sup 1 j p ω j , ψ2 = sup 1 j p Ψ jj, where Ψ = (I B ) Ω (I B ) 1. In the proofs, we also use Xi and X j to denote the ith row and jth column of X, respectively. Let e X = L X N(0, Ψ In) and ε = L ε. Then the rows of e X, i.e. xi, i [n], are i.i.d. from N(0, Ψ ). Recall from Section 3 that bΩis pre-estimated at Step 1 in our two-step learning procedure. As we will discuss in more detail in Section 4.1, the accuracy of bΘ obtained in Step 2 using Algorithm 1 depends on the accuracy of bΩ. Thus, let us define r(bΩ) as a measure of the estimation error of 1/ˆω2 j as follows: r(bΩ) := sup 1 j p 1 ˆω2 j 1 ω 2 j Existing methods for estimating the error variance in linear models, such as scaled Lasso (Sun and Zhang, 2012), square-root lasso (Belloni et al., 2011), and natural lasso (Yu and Bien, 2019), often assume independence among samples, which is not necessarily true under our network setting. We will discuss additional assumptions as well as possible choice of consistent estimators of Ωunder these assumptions in Section 4.2 under the network setting. But our main result in Section 4.1, in particular the consistency of the estimated DAG parameters b B, does not require these additional assumptions (see Remark 10 for more discussion). Learning Gaussian DAGs from Network Data 4.1 Error Bounds and Consistency of b B(1) and bΘ(1) Applying Algorithm 1 with a pre-estimated bΩas input gives us b B(t) and bΘ(t). In this section, we study the error bounds and consistency of b B(t) and bΘ(t). Although Algorithm 1 is computationally efficient and has a desirable convergence behavior as described in Proposition 3, there are technical difficulties in establishing the consistency of b B( ) and bΘ( ) after convergence, due to the dependence between ( b B(t), bΘ(t)) across iterations. However, we show that as long as we have a suitable initial estimate satisfying Assumption 1, Algorithm 1 can produce consistent estimators after one iteration, i.e., ( b B(1), bΘ(1)) is consistent. Assumption 1 There exists a constant 0 < M σ2 min(L ), such that the initial estimate bΘ(0) satisfies bΘ(0) Θ 2 M. Assumption 1 states that the initial estimate bΘ(0) is inside an operator norm ball centered at the true parameter Θ with a radius smaller than a constant M. The constant M is less than or equal to the smallest eigenvalue of Θ . Since σ2 min(L ) is a constant, Assumption 1 requires that the initial estimate bΘ(0) is not far from the truth. This is much weaker than, for example, assuming bΘ(0) is a consistent estimator. This condition may be written as bΘ(0) Θ 2 cλmin(Θ ) for some c (0, 1]. Intuitively, λmin(Θ ) can be understood as the minimum radius of Θ among all directions. In this sense, this assumption gives a rough scale on how accurate the initial estimate needs to be. This assumption may not be easy to verify in practice, but since it only requires bΘ(0) to be within an ℓ2-ball of constant radius around Θ , it is not difficult for Assumption 1 to be met if Θ is sparse and normalized. Under Assumption 1 we can establish finite-sample error bounds for ( b B(1), bΘ(1)). Let m denote the maximum degree of the undirected graph H and s = sup1 j p β j 0. Define R(s, p, n) := max 6 ωr(bΩ), 72 ω ψs log p log(max{n, p})2 where r(bΩ) is defined in (12) and b > 0 is a constant (see Lemma 7). Following the setup in Ravikumar et al. (2011), the set of non-zero entries in the precision matrix is denoted as supp(Θ ) := {(i, j) | Θ ij = 0}. Let us use the shorthand S and Sc to denote the support and its complement in the set [n] [n], respectively. Define the following quantities: κΣ := |||Σ ||| = max 1 i n j=1 |Σ ij|, κΓ := (Γ SS) 1 , Γ SS = Θ 1 Θ 1 The quantities κΓ and κΣ defined in (14) measure, respectively, the scale of the entries in Σ and the inverse Hessian matrix Γ 1 SS of the graphical Lasso log-likelihood function (10). The L2 error bound of bΘ (Theorem 4) depends on these constants. In general, they may scale with n and p. To simplify the asymptotic results, we assume the quantities are bounded by a constant as n, p (Corollary 8 and 9). See Ravikumar et al. (2011) for a related discussion. Li, Madrid-Padilla, and Zhou Theorem 4 Consider a sample matrix X from model (2). Let bΘ(1), b B(1) be the estimates after one iteration of the Algorithm 1, given the initial estimator bΘ(0) satisfying Assumption 1. Suppose b > 0 as defined in Lemma 7. Pick the regularization parameters in (9) and (10) such that 2 log 2 + 4 log p τ log n log 4 p + R(s, p, n), where τ > 2 and r(bΩ) is defined in (12). Let κ = σmin(Ψ ). Then for some positive constant c1, we have sup j ˆβ(1) j β j 2 s c1 κλn, (15) with probability at least (1 2/p)2 1/(exp{n/32} 1) + 1/nτ 2 + 5n2/ max{n, p}4 . If in addition n, p satisfy 3200 log(4nτ) max {160, 24m C} p, (16) and the following holds, r(bΩ), 4s ω3 log p log2 max{n, p} 1/(24C), (17) where C = max κΣ κΓ , κ3 Σ κ2 Γ , we also have bΘ(1) Θ 2 4κΓ mλp, (18) with the same probability. We leave the detailed proof for Theorem 4 to the Appendix. Note that bΩis assumed to be pre-specified in this result. In addition, assume κ, ψ, ω stay bounded as well. Then, under the conditions in Theorem 4, we have for fixed positive constants c2, c3, c4 that sup j ˆβ(1) j β j 2 2 c2slog p bΘ(1) Θ 2 c3m log p log2 max{n, p} with high probability. Remark 5 The error bounds in Theorem 4 hold for any finite samples and for any values of the sparsity measures s (maximum number of parents in the DAG) and m (maximum degree in the undirected graph). If the DAG is dense so that s = p, then the corresponding error bounds are given by the right sides of (19) with s = p. For example, supj ˆβ(1) j β j (c2p log p/n)1/2. This error bound would be small only if p n. This is expected when the DAG is not sparse. Similarly, we can obtain the error bounds when m is comparable to n (i.e. the undirected graph is dense). Learning Gaussian DAGs from Network Data 4.2 Error bound on bΩunder Block-diagonal Θ The error bound of bΘ in (19) depends on r(bΩ), the convergence rate of the error variance estimator bΩ. In order to achieve the desired rate for the consistent estimator of Θ, we need to make additional sparsity assumptions on Θ. Thus, we restrict our attention mainly to the sparse undirected graphs H consisting of N connected components, which implies that the row precision matrix Θ is block-diagonal: Θ1 Θ2 . . . ΘN The support of Θ inside each diagonal block Θi could be dense. This type of network is often seen in applications where individuals in the network form clusters: nodes in the same cluster are densely connected, and those from different clusters tend to be more independent from each other. The underlying network H will be sparse if the individuals are from a large number of small clusters. In other words, the sparsity of H depends mainly on the number of diagonal blocks in Θ. More general network structures are also considered in the numerical experiments in Section 5. If we assume that the network of the samples is block diagonal and the samples form many small clusters, we would be able to collect independent samples from different clusters. This intuition suggests that existing methods are readily applicable in our setting to obtain consistent estimates of Ω , as long as there are enough independent clusters in the undirected network H . Given the block-diagonal structure of Θ in (20), there are a few ways to estimate Ω. We use the natural estimator introduced by Yu and Bien (2019). We estimate ˆω2 j using independent samples in X according to the block structure of Θ . Let Υ [n] be a row index set and AΥ denote the submatrix formed by selecting rows from a matrix An m with row index i Υ. We draw one sample from each block and form a smaller N p design matrix XΥ. It is not difficult to see that XΥ j = XΥβ j + εΥ j . Next define the natural estimator of ω 2 j for j [p] as in Yu and Bien (2019): ˆω2 j = min βj N XΥ j XΥβj 2 2 + 2λN βj 1 where λN > 0 is a tuning parameter. Alternative methods, such as scaled lasso (Sun and Zhang, 2012) and Stein s estimator (Bayati et al., 2013), can also be used to estimate ω2 j . Formally, suppose that Θ has a block-diagonal structure defined in (20). Let N be the number of blocks. For simplicity, we assume that Θ consists of N blocks as in (20) hereafter. If we use the natural estimator from Yu and Bien (2019) introduced above to get ˆωj, then we have the following error bound: Lemma 6 Let X be generated from (2) and assume Θ is block-diagonal with N blocks. Recall that s = supj β j 0. Let bΩbe the natural estimator defined in (21) with λN = 12 ψ ω 2 log 2 + 4 log p Li, Madrid-Padilla, and Zhou then with probability at least 1 + 1/p2 3/p (p 2), ˆω2 j ω 2 j λNs β + 5 ω log 2 + log p Lemma 6 shows that the maximum error of ˆωj is upper bounded by C s p log p/N where C > 0 is constant and therefore is consistent as long as s p log p/N 0. When p is fixed and N increases, there are more diagonal blocks in Θ , indicating more independent samples, and thus ˆω2 j will be more accurate. When N = n, Θ becomes a diagonal matrix and the rows of X become i.i.d. We can easily show that the following lemma holds: Lemma 7 Suppose inf1 j p w j 1 and 1 2 inf1 j p ω 2 j sup1 j p ˆω2 j ω 2 j > b > 0. Then b4 sup 1 j p ˆω2 j ω 2 j When s(log p/N)1/2 is small enough, Lemma 6 implies that b can be taken as a positive constant with high probability. If bΩsatisfies the convergence rate specified in Lemmas 6 and 7, i.e., r(bΩ) s p log p/N, then the sample constraints in (16) are satisfied as long as m log n p, s2 log p N, and s2 log3 max{n, p} n. (22) Here, p, s, m all may approach infinity as n . As a result, we have the following two asymptotic results based on Theorem 4. The first considers the scaling p n under which DAG estimation is high-dimensional. The second considers the case n p so that the estimation of Θ is a high-dimensional problem. Corollary 8 Suppose that the sample size and the number of blocks satisfy p N log2 p n N log p . Assume β, ω, ψ, κΓ , κΣ < as n, p and r(bΩ) s p log p/N 1. Then under the same assumptions as Theorem 4, we have sup j ˆβ(1) j β j 2 2 = Op bΘ(1) Θ 2 = Op Corollary 9 Suppose that the sample size and the block numbers satisfy n s2p log p log n N s2p . Assume β, ω, ψ, κΓ , κΣ < as n, p and r(bΩ) s p log p/N 1. Then under the same assumptions as Theorem 4, we have sup j ˆβ(1) j β j 2 2 = Op bΘ(1) Θ 2 = Op Learning Gaussian DAGs from Network Data Remark 10 Although we derived the consistency of bΘ(1) and b B(1) in the above two corollaries in the setting where Θ is block-diagonal, these consistency properties still hold even when Θ is not block-diagonal. The main purpose of the block-diagonal setting is to provide an example in which we can conveniently control the error of bΩ. But in practice Θ does not have to be block-diagonal. In particular, we did not assume any block-diagonal structure of Θ for the error bounds in Theorem 4. It can be seen that the error bound of b B(1) does not depend on the error of bΩat all. Hence, the accuracy of bΩhas no impact on the accuracy of b B(1). The error bound of bΘ(1) in (19) is determined by the trade-offamong three terms, one of which is the error rate r(bΩ) as in (12). This is also supported by our numerical results. In Section 5, we demonstrate, with both simulated and real networks where Θ is not block-diagonal, that our proposed BCD method can still accurately estimate Θ and B whenever a relatively accurate bΩis provided. 4.3 Comparison to Other Results If the data matrix X consists of i.i.d. samples generated from Gaussian linear SEM (1), assuming the topological sort of the vertices is known, the DAG estimation problem in (8) is reduced to solving the standard Lasso regression in (9) with b L(t) = In, and thus independent of the initial bΘ(0) estimator. Under the restricted eigenvalue condition and letting λn p log p/n, it is known the Lasso estimator has the following optimal rate for ℓ2 error (van de Geer and B uhlmann, 2009): sup j ˆβj β j 2 2 = Op When the data are dependent, Theorem 4 shows that the estimator from Algorithm 1 can achieve the same optimal rate if we make the extra assumptions above. In particular, what we need is a reasonably good initial bΘ(0) estimate such that bΘ(0) Θ 2 M for a small positive constant M. On the other hand, if the underlying DAG is an empty graph and Ω = Ip, the problem of estimating Θ can be solved using graphical Lasso in (10) because the data (columns in X) are i.i.d. The sample variance b S would also be an unbiased estimator of Σ . In this case, Ravikumar et al. (2011) showed that bΘ Θ 2 = Op This result does not require knowing supp(Θ ) but assumes a mutual incoherence condition on the Hessian of the log likelihood function. In our case, b S(1) is biased due to the accumulated errors from the previous Lasso estimate, as well as bΩ. As a result, there is an extra bias term R(s, n, p) in b S(1) Σ (see Lemma 22 in Appendix A.3) compared to the i.i.d. setting: b S(1) Σ = δf(p, nτ) + R(s, n, p), where δf(p, nτ) p log n/p is the classical graphical Lasso error rate, and R(n, p, s) max n r(bΩ), s p log p log max{n, p}/n o Li, Madrid-Padilla, and Zhou depends on the estimation errors of b B(1) and bΩ. When n p and r(bΩ) is dominated by p log(n)/p, we get the same rate for the ℓ2 consistency of bΘ (Corollary 9) under a slightly more strict constraint on sample size (22). If n p, then the ℓ2 error rate is determined by max{r(bΩ), s(log3 p/n)1/2}. Suppose Θ is block-diagonal. If the number of blocks N is much smaller than n, then the ℓ2 rate will be dominated by r(bΩ) s p log p/N, which could be slower than the optimal graphical Lasso rate. But that is expected due to the error introduced in the DAG estimates b B(1) and bΩ. 5. Numerical Experiments Under the assumption that observations generated from a DAG model are dependent, we will evaluate the performance of the block coordinate descent (BCD) algorithm, i.e., Algorithm 1, in recovering the DAG compared to traditional methods that treat data as independent. We expect that the BCD method would give more accurate structural estimation than the baselines by taking the dependence information into account. When a topological ordering of the true DAG is known, we can identify a DAG from data using BCD. When ordering is unknown, the BCD algorithm may still give an accurate estimate of row correlations that are invariant to node-wise permutations according to Lemma 1. The estimated row correlation matrix can then be used to de-correlate the data so that traditional DAG learning algorithms would be applicable. We will demonstrate this idea of de-correlation with numerical results as well. 5.1 Simulated Networks We first perform experiments on simulated networks for both ordered and unordered cases. To apply the BCD algorithm, we need to set values for λ1 and λ2 in (8). Since the support of Θ is restricted to the edge set of H , we simply fixed λ2 to a small value (λ2 = 0.01) in all experiments. For each data set, we computed a solution path from the largest λ1 max, for which we get an empty DAG, to λ1 min = λ1 max/100. The optimal λ1 was then chosen by minimizing the BIC score over the DAGs on the solution path. We generate random DAGs with p nodes and fix the total number of edges s0 in each DAG to 2p. The entries in the weighted adjacency matrix B of each DAG were drawn uniformly from [ 1, 0.1] [0.1, 1], and ω j s were sampled uniformly from [0.1, 2]. In our simulations of Θ , we first considered networks with a clustering structure, i.e., Θ was block-diagonal as in (20). We fixed the size of the clusters to 20 or 30, and within each cluster, the individuals were correlated according to the following four covariance structures. Toeplitz: Σ ij = 0.3|i j|/5. Equal correlation: Σ ij = 0.7 if i = j, and Σ ii = 1. Star-shaped: Θ 1j = Θ i1 = a, i, j 2, a (0, 1), and Θ ii = 1. Autoregressive (AR): Θ ij = 0.7|i j| if |i j| b/4 ; Θ ij = 0 otherwise, where b is the cluster size. Toeplitz covariance structure implies that the observations are correlated as in a Markov chain. The equal correlation structure represents the cases when all observations are fully Learning Gaussian DAGs from Network Data connected in a cluster. Star-shaped and AR structures capture intermediate dependence levels. In addition to these block-diagonal covariances, we also considered a more general covariance structure defined through stochastic block models (SBM), in which H consists of several clusters and nodes within a cluster have a higher probability of being connected than those from different clusters. More explicitly, we generated Θ as follows: 1. Let B1, . . . , BL be L clusters with varying sizes that form a partition of {1, . . . , n}, where the number of clusters L ranges from 5 to 10 in our experiments. Define a probability matrix P Rn n where Pij = 0.5 if i, j Bl, l {1, . . . , L}; otherwise, Pij = 0.1. 2. Construct the adjacency matrix A of H : Aij Bern(Pij). 3. Sample Θ ij Unif[ 5, 5] if Aij = 1. Otherwise, Θ ij = 0. To ensure a positive-definite Θ , we then perform the following transformations to get Θ : eΘ = (Θ + Θ )/2, Θ = eΘ σmin(eΘ) 0.01 In. (23) In the stochastic block model, two nodes from different clusters in H are connected with probability 0.1, so Θ is not block-diagonal in general. As explained in Section 4.1, our proposed BCD algorithm does not require Θ to be block-diagonal in practice to produce accurate estimates of B and Θ . Our numerical experiments will support this point and demonstrate the robustness of the BCD method. We compared the BCD algorithm with its competitors in both high-dimensional (p > n) and low-dimensional (p < n) settings with respect to DAG learning. For each (n, p) and each type of covariances, we simulated 10 random DAGs and then generated one data set following equation (2) for each DAG. Thus, we had 10 results for each of the 2 5 = 10 simulation settings. In the end, we averaged the results over the 10 simulations under each setting for comparison. 5.1.1 Learning with given ordering Assuming that the nodes in a DAG are sorted according to a given topological order, we compared our BCD algorithm against a baseline setting that fixes Θ = In. In other words, the baseline algorithm ignores the dependencies between observations when estimating the DAG with BCD. The block sizes in Θ were set to 20 in all cases except SBM whose block sizes ranged from 5 to 25. Among other estimates, both algorithms return an estimated weighted adjacency matrix b B for the optimal λ1 selected by BIC. For the BCD algorithm, we use b B and bΘ for b B( ) and bΘ( ) after convergence (see Algorithm 1). Note that since bΘ(0) is initialized to In by default in the BCD algorithm, the estimated b B from the baseline algorithm is the same as the estimate b B(1) from BCD after one iteration. We also included three other DAG learning algorithms in our comparison. The first is the Kronecker graphical Lasso (KGLasso) algorithm (Allen and Tibshirani, 2010; Tsiligkaridis Li, Madrid-Padilla, and Zhou et al., 2013) mentioned in Section 2, which estimates both bΨ and bΘ via graphical Lasso in alternating fashion. When estimating Θ , KGLasso also makes use of its block-diagonal structure. After KGLasso converges, we perform the Cholesky factorization on bΨ = (I b B) bΩ(I b B) 1 according to the given order to obtain b B and bΩ. A distinction between BCD and KGLasso is that KGLasso imposes a sparsity regularization on Ψ instead of B, so the comparison between these two will highlight the importance of imposing sparsity directly on the Cholesky factor. The other two algorithms are PC (Spirtes et al., 2000) and Greedy Equivalence Search (Chickering et al., 2004), both of which were applied with the node ordering given as prior knowledge. Note that GES in this case is similar to a greedy hill-climbing algorithm, since restricting to a given ordering, an equivalence class reduces to a single DAG. Given the estimate b B from a method, we hard-thresholded the entries in b B at a threshold value τ to obtain an estimated DAG. To compare the methods, we chose τ such that they predicted roughly the same number of edges (E). Then we calculated the number of true positives (TP), false positives (FP) and false negatives (FN, missing edges), and two overall accuracy metrics: Jaccard index (TP / (FP + s0)) and structural Hamming distances (SHD = FP+FN). Note that, there were no reserved edges (i.e., estimated edges whose orientation is incorrect) because the ordering of the nodes was given. Detailed comparisons are summarized in Table 1 and Table 2. Our DAG model (2) defines a matrix normal distribution in eq (3), for which the row-wise inverse covariance matrix is Θ = (Σ ) 1. The matrix normal matches exactly the model assumption for KGLasso. So it is expected to be accurate in estimating Θ and Ψ . However, KGLasso showed lower accuracy (JI and SHD) in estimating the DAG (B ), because it does not impose sparsity on the DAG coefficients B. In general, the BCD algorithm outperformed the competitors by having more true positives and fewer false positives in all cases. Because the KGLasso method does not impose sparsity directly on the DAG structure, it suffered from having too many false negatives after thresholding when p > n. PC and GES had better performance than KGLasso in terms of FDR, SHD and Jaccard index. When p < n, the correlations between observations had a more significant impact on the estimation accuracy for DAGs. As a result, by taking the correlations into account, BCD performed better than the baseline, PC, and GES. In particular, BCD substantially reduced the number of missing edges (FNs) and FDR, compared to baseline. Both BCD and KGLasso yielded accurate estimates of bΘ when n < p except for the SBM. When n > p, as the sample size p for estimating Θ Rn n decreased relative to dimension n, bΘ became less accurate. The difference in the accuracy of bΘ = bΘ( ) and bΘ(1) was not significant. Figure 2 shows the ROC curves of all the methods over a sequence of τ under the 10 settings. The τ sequence contains 30 equally spaced values in [0, 0.5]. The BCD algorithm uniformly outperformed the others in terms of the area under the curve (AUC) with substantial margins when n < p. When n > p, the BCD still performed better than the other four most of the time, but its lead over KGLasso was not as significant in some cases. This was largely due to insufficient regularization on bΘ. Fixing λ2 = 0.01 in this case implies λp = 0.01/p = 0.0001 in the graphical Lasso step (10) of BCD, resulting in severe overestimates of the magnitude of the entries in Θ . After we increased λp to 0.1 which is still quite small, the BCD indeed outperformed the other methods by much larger margins. KGLasso also performed much better when n > p as shown in Table 2 and Figure 2. This Learning Gaussian DAGs from Network Data (n, p, s0) Θ-Network Method E FN TP FDR JI SHD err(bΘ) (err(bΘ(1))) BCD 686.2 (98.9) 214.0 (19.3) 386.0 (19.3) 0.355(0.065) 0.443(0.021) 514.2(65.8) 0.00034 (0.00032) equi-cor Baseline 642.4 (79.0) 240.3 (20.5) 359.7 (20.5) 0.383 (0.052) 0.410 (0.01) 523.0 (40.5) (150, 300, 600) KGLasso 756.2 (146.7) 504.9 (6.3) 95.1 (6.3) 0.822 (0.035) 0.080 (0.006) 1166.0 (135.8) 0.00019 PC 333.1 (6.3) 426.0 (3.6) 174.0 (3.6) 0.476 (0.012) 0.229 (0.005) 585.1 (7.7) GES 734.8 (80.9) 232.4 (18.8) 367.6 (18.8) 0.472 (0.028) 0.383 (0.007) 599.6 (45.7) BCD 535.0 (6.2) 306.4 (4.7) 493.6 (4.7) 0.077 (0.004) 0.586 (0.005) 347.8 (4.3) 0.00143 (0.00833) toeplitz Baseline 549.5 (9.0) 425.2 (4.3) 374.8 (4.3) 0.317 (0.008) 0.384 (0.004) 599.9 (6.2) (200, 400, 800) KGLasso 550.6 (11.2) 634.3 (3.3) 165.7 (3.3) 0.698 (0.004) 0.139 (0.002) 1019.2 (7.8) 0.01617 PC 536.2 (6.7) 554.4 (3.1) 245.6 (3.1) 0.542 (0.006) 0.225 (0.003) 845.0 (6.6) GES 534.0 (5.5) 496.8 (3.1) 303.2 (3.1) 0.432 (0.005) 0.294 (0.003) 727.6 (4.9) BCD 543.1 (3.7) 301.1 (2.4) 498.9 (2.4) 0.081 (0.003) 0.591 (0.003) 345.3 (2.7) 0.00006 (0.00051) star Baseline 515.7 (6.7) 338.3 (2.1) 461.7 (2.1) 0.103 (0.01) 0.541 (0.003) 392.3 (5.2) (200, 400, 800) KGLasso 495.8 (13.9) 504.3 (7.5) 295.7 (7.5) 0.403 (0.006) 0.295 (0.006) 704.4 (5.7) 0.00233 PC 544.1 (4.0) 375.7 (2.3) 424.3 (2.3) 0.220 (0.005) 0.461 (0.003) 495.5 (3.9) GES 543.6 (3.5) 329.7 (2.8) 470.3 (2.8) 0.135 (0.005) 0.539 (0.004) 403.0 (4.5) BCD 253.1 (5.7) 193.8 (2.9) 206.2 (2.9) 0.184 (0.008) 0.461 (0.004) 240.7 (1.5) 0.00247 (0.00219) AR(5) Baseline 245.8 (8.5) 208.9 (2.6) 191.1 (2.6) 0.217 (0.018) 0.420 (0.004) 263.6 (4.6) (100, 200, 400) KGLasso 270.4 (9.3) 271.1 (3.1) 128.9 (3.1) 0.521 (0.007) 0.237 (0.004) 412.6 (4.1) 0.02587 PC 253.5 (6.1) 237.9 (1.8) 162.1 (1.8) 0.358 (0.015) 0.330 (0.005) 329.3 (6.1) GES 314.6 (2.8) 218.2 (2.2) 181.8 (2.2) 0.422 (0.007) 0.341 (0.005) 351.0 (4.2) BCD 343.3 (16.9) 313.6 (8.5) 286.4 (8.5) 0.159 (0.016) 0.435 (0.008) 370.5 (10.2) 0.51266 (0.52297) SBM Baseline 344.6 (16.4) 338.4 (6.5) 261.6 (6.5) 0.233 (0.02) 0.383 (0.008) 421.4 (10.2) (100, 300, 600) KGLasso 301.3 (18.9) 510.7 (3.0) 89.3 (3.0) 0.696 (0.016) 0.110 (0.003) 722.7 (16.5) 0.46201 PC 302.7 (4.1) 438.3 (1.9) 161.7 (1.9) 0.465 (0.008) 0.218 (0.003) 579.3 (4.9) GES 411.3 (5.8) 335.2 (3.0) 264.8 (3.0) 0.356 (0.006) 0.355 (0.004) 481.7 (4.5) Table 1: Results for ordered DAGs on simulated data when n < p. E is the number of estimated edges in the DAG. FN is the number of false negatives in the estimated DAG. TP is the number of true positives in the estimated DAG. FDR = FP/(FP + TP) is the false discovery rate. JI = TP/(FP + s0) is the Jaccard Index. SHD = FP + FN is the structural hamming distance. We highlight in boldface the best performance in terms of FDR, JI, and SHD. The last column shows the ℓ2-estimation errors of bΘ and bΘ(1) normalized by the true support size. The numbers in the brackets are errors after one iteration of BCD. Each number corresponds to the average (standard error) over 10 simulations. Li, Madrid-Padilla, and Zhou (n, p, s0) Θ-Network Method E FN TP FDR JI SHD err(bΘ) (err(bΘ(1))) BCD 143.0 (9.4) 75.1 (6.4) 124.9 (6.4) 0.119 (0.014) 0.570 (0.022) 93.2 (3.7) 0.00211 (0.00199) equi-cor Baseline 143.6 (9.5) 102.8 (4.2) 97.2 (4.2) 0.311 (0.024) 0.393 (0.010) 149.2 (3.0) (200, 100, 200) KGLasso 142.7 (9.3) 129.5 (1.8) 70.5 (1.8) 0.490 (0.030) 0.260 (0.007) 201.7 (7.6) 0.00207 PC 141.8 (8.7) 135.9 (2.5) 64.1 (2.5) 0.539 (0.019) 0.231 (0.007) 213.6 (6.1) GES 191.2 (3.4) 103.3 (1.3) 96.7 (1.3) 0.493 (0.008) 0.329 (0.005) 197.8 (3.4) BCD 167.1 (7.3) 56.7 (4.0) 143.3 (4.0) 0.137 (0.015) 0.639 (0.010) 80.5 (1.4) 0.51735 (0.68703) toeplitz Baseline 167.3 (7.2) 104.8 (2.1) 95.2 (2.1) 0.426 (0.015) 0.350 (0.005) 176.9 (4.0) (200, 100, 200) KGLasso 167.1 (7.2) 68.8 (2.2) 131.2 (2.2) 0.206 (0.023) 0.557 (0.007) 104.7 (3.7) 0.014 PC 162.6 (5.8) 130.0 (1.4) 70.0 (1.4) 0.566 (0.013) 0.240 (0.005) 222.6 (4.9) GES 235.2 (4.8) 109.7 (1.3) 90.3 (1.3) 0.615 (0.008) 0.262 (0.005) 254.6 (4.8) BCD 186.9 (7.3) 50.6 (4.0) 149.4 (4.0) 0.199 (0.015) 0.629 (0.010) 88.1 (1.4) 0.54769 (0.39316) star Baseline 184.3 (7.2) 64.4 (2.1) 135.6 (2.1) 0.263 (0.015) 0.546 (0.005) 113.1 (4.0) (200, 100, 200) KGLasso 186.1 (7.2) 66.1 (2.2) 133.9 (2.2) 0.279 (0.023) 0.531 (0.007) 118.3 (3.7) 0.18472 PC 170.0 (2.5) 91.5 (1.3) 108.5 (1.3) 0.361 (0.009) 0.415 (0.006) 153.0 (2.5) GES 187.4 (3.9) 68.9 (2.2) 131.1 (2.2) 0.300 (0.005) 0.511 (0.006) 125.2 (1.8) BCD 185.9 (6.2) 52.2 (1.5) 147.8 (1.5) 0.200 (0.018) 0.622 (0.009) 90.3 (4.0) 0.01644 (0.01184) AR(5) Baseline 187.7 (6.9) 63.8 (1.2) 136.2 (1.2) 0.268 (0.021) 0.544 (0.011) 115.3 (5.8) (200, 100, 200) KGLasso 186.0 (6.7) 59.8 (1.5) 140.2 (1.5) 0.240 (0.02) 0.572 (0.01) 105.6 (4.7) 0.01038 PC 175.0 (1.8) 90.7 (1.7) 109.3 (1.7) 0.375 (0.009) 0.412 (0.008) 156.4 (3.1) GES 185.9 (6.6) 78.9 (2.3) 121.1 (2.3) 0.345 (0.011) 0.457 (0.005) 143.7 (3.0) BCD 140.1 (3.5) 72.2 (2.4) 127.8 (2.4) 0.086 (0.008) 0.602 (0.009) 84.5 (1.7) 0.31189 (0.31448) SBM Baseline 139.9 (4.0) 85.7 (2.4) 114.3 (2.4) 0.181 (0.01) 0.507 (0.009) 111.3 (2.1) (300, 100, 200) KGLasso 128.8 (11.0) 161.1 (3.5) 38.9 (3.5) 0.689 (0.023) 0.134 (0.011) 251.0 (8.5) 0.32263 PC 139.5 (3.3) 113.8 (1.6) 86.2 (1.6) 0.379 (0.017) 0.341 (0.009) 167.1 (4.4) GES 146.4 (2.5) 90.6 (1.95) 109.4 (1.95) 0.252 (0.011) 0.462 (0.009) 127.6 (2.9) Table 2: Results for ordered DAGs on simulated data when n > p. is expected because when n is large compared to p, the dependence between individuals will have a greater impact on the accuracy of the estimation of DAGs. Since KGLasso is designed to iteratively estimate Θ and Ψ , the more accurate estimates of Θ as reported in Table 2 compensated for the relatively inaccurate b B. We also compared test data log-likelihood among the five methods. Specifically, in each setting, we generated a test sample matrix Xtest from the true distribution for each of the 10 repeated simulations and computed L( b B, bΘ, bΩ| Xtest) using the estimates from a method following equation (5). Figure 3 shows the boxplots of the test data log-likelihood, normalized by np after subtracting the median of the baseline method: ℓplot = ℓ0 median(ℓbaseline 0 ) / np, where ℓ0 is the original test data log-likelihood. The top row shows the test log-likelihood when n < p, where we did not include the data for KGLasso in four cases because its test data log-likelihood values were too small to fit in the same plot. The bottom row shows the results for n > p. For both cases, we see that the test data log-likelihood of the BCD method (in green) is consistently higher than that of the other methods. 5.1.2 Learning with de-correlation When true ordering is unknown, we focus on estimating the row-wise covariance Σ . Given bΣ we can de-correlate the data by Equation (11) and apply existing structural learning methods. In this study, we compared the performances of three structure learning methods before and after de-correlation: GES (Chickering, 2003) and sparsebn (Aragam et al., 2019b) Learning Gaussian DAGs from Network Data 0 37.5 75 112.5 150 20 124.6 333.9 BCD Baseline KGLasso PC GES (a) Equal Corr 0 37.5 75 112.5 150 20 156.9 430.8 BCD Baseline KGLasso PC GES (b) Toeplitz 0 37.5 75 112.5 150 20 159.7 439 578.6 BCD Baseline KGLasso PC GES 0 37.5 75 112.5 150 20 76.3 132.6 245.1 BCD Baseline KGLasso PC GES 0 37.5 75 112.5 150 20 98.2 176.5 333 BCD Baseline KGLasso PC GES 0 37.5 75 112.5 150 20 50.9 81.7 112.6 BCD Baseline KGLasso PC GES (f) Equal Corr 0 37.5 75 112.5 150 20 54.7 89.3 124 158.7 BCD Baseline KGLasso PC GES (g) Toepltiz 0 37.5 75 112.5 150 20 56.7 93.4 130.2 BCD Baseline KGLasso PC GES 0 37.5 75 112.5 150 20 56 92.1 128.1 BCD Baseline KGLasso PC GES 0 37.5 75 112.5 150 20 51.2 82.4 113.7 BCD Baseline KGLasso PC GES Figure 2: ROC curves of BCD, baseline, and KGLasso on simulated and sorted DAGs: x-axis reports the number of false positive edges and y-axis true positive edges. Top row: n < p. Bottom row: n > p. Each data point in the ROC curves corresponds to the average over 10 simulations. Baseline BCD GES PC (a) Equal Corr Baseline BCD GES PC (b) Toeplitz Baseline BCD GES PC Baseline BCD GES PC Baseline BCD GES KGLasso PC Baseline BCD GES KGLasso PC (f) Equal Corr Baseline BCD GES KGLasso PC (g) Toeplitz Baseline BCD GES KGLasso PC Baseline BCD GES KGLasso PC Baseline BCD GES KGLasso PC Figure 3: Normalized test data log-likelihood of BCD and alternative methods on simulated sorted DAGs. Top row: n < p. Bottom row: n > p. Bar height represents the median log-likelihood from the 10 repeated experiments with error bars for the 25% and 75% quantiles. Li, Madrid-Padilla, and Zhou (100, 200) (200, 100) (a) Equal Corr (100, 200) (300, 100) (b) Toeplitz (100, 200) (200, 100) (100, 200) (200, 100) (100, 300) (200, 100) (100, 200)(200, 100) (f) Equal Corr (100, 200)(300, 100) (g) Toeplitz (100, 200)(200, 100) (100, 200)(200, 100) (100, 300)(200, 100) Figure 4: Decrease in SHD (top row) and increase in Jaccard index (bottom row) via de-correlation on simulated unsorted DAGs, with x-axis reporting the value of (n, p). In each panle, the three boxplots on the left and the three on the right correspond to the cases of n < p and n > p, respectively. Each boxplot contains 10 data points from 10 simulations. which are score-based methods implemented respectively in the R packages rcausal (Ramsey et al., 2017) and sparsebn, and PC (Spirtes et al., 2000) which is a constraint-based method implemented in pcalg (Kalisch et al., 2012). All three methods rely on the independent data assumption, so we expect the de-correlation step to improve their performances significantly. Unlike the previous comparison, the ordering of the nodes is unknown, so GES and PC return an estimated CPDAG (completed acyclic partially directed graph) instead of a DAG. Thus, in the following comparisons, we converted both the estimated DAG from sparsebn and the true DAG to CPDAGs, so that all the reported metrics are computed with respect to CPDAGs. As before, we divide the cases into n < p and n > p. The block size for the four block-diagonal Θ was fixed to 30. The estimated Cholesky factor b L of bΘ used for decorrelating X in (11) was calculated by our BCD algorithm with tuning parameter λ1 selected by BIC. Figure 4 shows the decrease in SHD and increase in the Jaccard index via the de-correlation of GES, PC and sparsebn on 10 random DAGs, generated under each row-covariance structure and each sample size. For almost all types of covariances and (n, p) settings we considered, there is a significant improvement of all three methods in estimating the CPDAG structures after de-correlation. Additional tables with detailed results can be found in the Supplementary Material. Before de-correlation, GES and sparsebn, both score-based methods, tend to significantly overestimate the number of edges, resulting in high false positives, so does PC in some of the cases. After de-correlation, both GES and sparsebn had significant improvements and outperformed PC, as long as bΘ was accurately estimated. The test data log-likelihood (normalized by np) of all three algorithms also increased significantly after de-correlation as shown in Figure 5. Learning Gaussian DAGs from Network Data (a) Equal Corr (b) Toeplitz (f) Equal Corr (g) Toeplitz Figure 5: Increase in the normalized test data log-likelihood after decorrelation on simulated unsorted DAGs. Top row: n < p. Bottom row: n > p. 5.2 Experiments on Real Network Structures In this section, we examine the performance of the BCD algorithm on real network structures. We took four real DAGs from the bnlearn repository (Scutari, 2010): Andes, Hailfinder, Barley, Hepar2, and two real undirected networks from tnet (Opsahl, 2009): facebook (Opsahl and Panzarasa, 2009) and celegans n306 (Watts and Strogatz, 1998). Only the structures (supports) of these real networks were used, and the parameters of the edges were simulated as follows. Given a DAG structure, we sampled the coefficients β j uniformly from [ 1, 0.1] [0.1, 1]. Given the support of Θ , we generate Θ ij uniformly from [ 5, 5]. Then, we apply the transformations in (23) to get Θ . In order to increase the size of the underlying DAG and show the scalability of the algorithm, we duplicated the DAGs above to form larger networks. In Sections 5.2.1 and 5.2.2, we again consider undirected networks consisting of several disconnected subgraphs, corresponding to a block-diagonal structure in Θ . Each of the subgraphs was sub-sampled from the original real network. In Section 5.2.3 we present experiments on more general Θ without a block-diagonal structure. The ω j were uniformly sampled from [0.1, 2] as before. With these parameters, we generated observational samples X following the structural equation (2). 5.2.1 Learning with given ordering Similar to the previous section, we first look at the results on ordered DAGs. We considered four different combinations of network structures, as shown in Table 3, with both n > p and n < p. Because KGLasso does not make use the given ordering in its estimation of the column-wise covariance matrix, it became very slow when p is large. Therefore, we did not include it in our comparisons. GES performed significantly worse than the others in some cases, so we removed it from those plots. Note that when n is much smaller than p, as in the case of (Andes, facebook) in Table 3, the BIC score that GES maximizes over may not be a good scoring function, resulting in much worse estimation. BCD continued to Li, Madrid-Padilla, and Zhou outperform the baseline method by modeling the sample correlation. This improvement was more prominent when n > p, where BCD significantly reduced the number of false positive edges, achieving higher JI and lower SHD compared to the baseline and to itself in the n < p case. Figure 6 compares the test data log-likelihood across 10 simulations, and BCD scored significantly higher test data log-likelihood in all the cases. The ROC curves are provided in a figure in the Supplementary Material. Both figures indicate that the BCD method indeed gives better DAG estimates than the baseline and the other standard methods. DAG (n, p, s0) Θ-Network Method E FN TP FDR JI SHD BCD 440.9 (15.9) 314.7 (7.8) 361.3 (7.8) 0.176 (0.014) 0.478 (0.006) 394.3 (3.1) facebook Baseline 436.6 (15.2) 334.2 (4.8) 341.8 (4.8) 0.211 (0.022) 0.444 (0.006) 429.0 (9.8) (100 446, 676) PC 441.8 (16.2) 385.2 (6.5) 290.8 (6.5) 0.338 (0.012) 0.351 (0.005) 536.2 (6.6) GES 895.7 (10.4) 504.4 (2.5) 171.6 (2.5) 0.808 (0.002) 0.123 (0.001) 1228.5 (7.2) Andes (2) BCD 500.0 (5.9) 197.3 (5.5) 478.7 (5.5) 0.043 (0.002) 0.686 (0.008) 218.6 (5.3) facebook Baseline 499.1 (5.7) 206.4 (1.9) 469.6 (1.9) 0.058 (0.008) 0.666 (0.003) 235.9 (3.1) (500 446, 676) PC 499.2 (6.6) 252.2 (3.3) 423.8 (3.3) 0.150 (0.006) 0.564 (0.003) 327.6 (3.1) GES 507.6 (2.9) 295.8 (1.4) 380.2 (1.4) 0.251 (0.003) 0.473 (0.002) 423.2 (2.5) BCD 178.5 (4.5) 113.9 (2.5) 150.1 (2.5) 0.157 (0.009) 0.513 (0.006) 142.3 (1.7) celegan n306 Baseline 179.1 (4.8) 117.8 (1.8) 146.2 (1.8) 0.180 (0.016) 0.493 (0.006) 150.7 (3.1) (100, 224, 264) PC 178.5 (5.0) 154.6 (2.4) 109.4 (2.4) 0.386 (0.005) 0.328 (0.005) 223.7 (1.2) GES 435.4 (8.6) 128.1 (1.7) 135.9 (1.7) 0.687 (0.007) 0.242 (0.005) 427.6 (8.9) Hailfinder (4) BCD 198.6 (1.3) 72.7 (1.0) 191.3 (1.0) 0.037 (0.002) 0.705 (0.003) 80.0 (0.98) celegan n306 Baseline 199.0 (1.5) 74.7 (1.2) 189.3 (1.2) 0.049 (0.005) 0.692 (0.005) 84.4 (1.6) (500, 224, 264) PC 198.6 (1.3) 95.6 (1.2) 168.4 (1.2) 0.152 (0.006) 0.573 (0.006) 125.8 (2.3) GES 198.2 (1.5) 93.8 (1.4) 170.2 (1.4) 0.141 (0.006) 0.583 (0.006) 121.8 (2.3) BCD 261.7 (6.7) 133.8 (2.7) 202.2 (2.7) 0.225 (0.012) 0.511 (0.005) 193.3 (3.4) facebook Baseline 261.8 (8.3) 140.3 (2.6) 195.7 (2.6) 0.247 (0.019) 0.488 (0.009) 206.4 (6.9) (100, 192, 336) PC 219.3 (2.4) 180.0 (1.7) 156.0 (1.7) 0.288 (0.006) 0.391 (0.004) 243.3 (2.5) GES 314.5 (4.9) 159.3 (1.5) 176.7 (1.5) 0.437 (0.008) 0.373 (0.004) 297.1 (4.7) Barley (4) BCD 260.5 (4.1) 91.3 (3.9) 244.7 (3.9) 0.061 (0.003) 0.696 (0.011) 107.1 (3.9) facebook Baseline 260.2 (4.7) 97.6 (2.5) 238.4 (3.9) 0.083 (0.008) 0.666 (0.004) 119.4 (1.6) (500, 192, 336) PC 259.9 (4.2) 135.6 (1.0) 200.4 (1.0) 0.227 (0.013) 0.507 (0.007) 195.1 (4.9) GES 260.5 (4.1) 104.2 (3.2) 231.8 (3.2) 0.110 (0.005) 0.636 (0.008) 132.9 (2.9) BCD 366.7 (13.7) 234.5 (6.2) 257.5 (6.2) 0.295 (0.010) 0.428 (0.006) 343.7 (3.2) celegan n306 Baseline 373.9 (14.1) 238.0 (2.7) 254.0 (2.7) 0.314 (0.022) 0.416 (0.007) 357.9 (10.9) (100, 280, 492) PC 308.4 (3.4) 327.6 (2.4) 164.4 (2.4) 0.466 (0.009) 0.259 (0.005) 471.6 (5.4) GES 555.8 (7.3) 236.6 (1.4) 255.4 (1.4) 0.540 (0.004) 0.322 (0.002) 537.0 (5.3) Hepar2 (4) BCD 417.5 (3.4) 122.2 (3.1) 369.8 (3.1) 0.114 (0.003) 0.685 (0.006) 169.9 (3.3) celegan n306 Baseline 417.1 (4.0) 126.0 (2.1) 366.0 (2.1) 0.122 (0.006) 0.674 (0.005) 177.1 (3.3) (500, 280, 492) PC 411.4 (2.9) 225.9 (1.3) 266.1 (1.3) 0.353 (0.005) 0.418 (0.003) 371.2 (3.3) GES 418.6 (3.4) 114.3 (2.6) 377.7 (2.6) 0.098 (0.003) 0.709 (0.004) 155.2 (2.3) Table 3: Results for ordered DAGs on real network data. Block size is 20 for n < p and 50 for n > p. The number under each DAG reports the number of times it is duplicated to form a large DAG. All numbers represent the average (standard error) over 10 simulations. 5.2.2 Learning with de-correlation When the ordering of the DAG nodes is not given, we compared the effect of de-correlation as in Section 5.1.2. All network parameters were generated in the same way as before but we randomly shuffled the columns of X. The decrease in the structural Hamming distance Learning Gaussian DAGs from Network Data Baseline BCD PC Andes Facebook Baseline BCD PC Hailfinder celegans n306 Baseline BCD GES PC Barley Facebook Baseline BCD PC Hepar2 celegans n306 Baseline BCD GES PC Andes Facebook Baseline BCD GES PC Hailfinder celegans n306 Baseline BCD GES PC Barley Facebook Baseline BCD GES PC Hepar2 celegans n306 Figure 6: Test data log-likelihood normalized by np on real sorted DAGs. Top row: n < p. Bottom row: n > p. and increase in Jaccard index from de-correlation over 10 simulations are summarized as boxplots in Figure 7. PC performed uniformly better after de-correlation compared to before. GES and sparsebn also improved after de-correlation in most cases. The changes in the test data log-likelihood are shown in Figure 8, which are positive for almost all data sets, except two outliers (removed from plots) of sparsebn in the second and fourth panels in the top row. 5.2.3 Learning under general covariance structure In the following experiments, we generate the support of Θ without the block-diagonal constraint by directly sampling the real undirected networks facebook and celegans n306. In other words, the underlying undirected network may have only one connected subgraph where all individuals are dependent. This setup poses a major challenge particularly for the estimation of Θ because its support becomes much larger, and, as a result, we will need to impose stronger regularization in the graphical Lasso step when n > p. Moreover, we use Lasso regression loss (Hastie et al., 2015) to estimate Ω: ˆω2 j = min β 1 2n Xj XΠjβ 2 2 + λ β 1 as if the data were i.i.d., where λ = p log(p)/n and Πj = {1, . . . , j 1} is the index set of the potential parent nodes of j. This approximation performed well in our experiments For simplicity, we focus on the setting p > n so that we can still fix λ2 = 0.01. Proceeding as before, we generate B from real DAGs with duplications: Andes, Hailfinder, Barley, and Hepar2. Li, Madrid-Padilla, and Zhou (100, 446) (500, 446) Andes facebook (100, 224) (500, 224) Hailfinder celegans n306 (100, 192) (500, 192) Barley facebook (100, 280) (500, 280) Hepar2 celegans n306 (100, 446)(500, 446) Andes facebook (100, 224)(500, 224) Hailfinder celegans n306 (100, 192)(500, 192) Barley facebook (100, 280)(500, 280) Hepar2 celegans n306 Figure 7: Experiments on real unsorted DAGs. Decrease in SHD (top row) and increase in JI (bottom row) via de-correlation for real networks, where the x-axis reports the value of (n, p). In each panel, the three boxplots on the left and the three on the right correspond to cases of n < p and n > p, respectively. DAG Θ-Network Method (n, p, s0) E FN TP FDR JI SHD Andes facebook BCD (100 446, 676) 424.1 (16.4) 319.9 (6.9) 356.1 (6.9) 0.155 (0.016) 0.478 (0.004) 387.9 (3.6) (2) Baseline (100 446, 676) 422.2 (16.2) 326.2 (5.6) 349.8 (5.6) 0.165 (0.019) 0.467 (0.003) 398.6 (6.2) PC (100 446, 676) 424.1 (16.4) 378.6 (4.9) 297.4 (4.9) 0.293 (0.016) 0.371 (0.003) 505.3 (7.9) GES (100 446, 676) 1782.0 (11.9) 314.5 (2.2) 361.5 (2.2) 0.797 (0.002) 0.173 (0.002) 1735.0 (13.4) Hailfinder celegans n306 BCD (100,224,264) 140.8 (6.5) 153.3 (2.8) 110.7 (2.8) 0.206 (0.019) 0.376 (0.007) 183.4 (2.5) (4) Baseline (100,224,264) 139.0 (5.8) 154.8 (2.2) 109.2 (2.2) 0.205 (0.028) 0.372 (0.009) 184.6 (4.9) PC (100,224,264) 141.5 (6.3) 160.0 (3.4) 104.0 (3.4) 0.261 (0.013) 0.344 (0.009) 197.5 (2.4) GES (100,224,264) 312.3 (3.2) 136.6 (1.3) 127.4 (1.3) 0.592 (0.004) 0.284 (0.003) 321.5 (2.9) Barley facebook BCD (100,192,336) 252.6 (10.1) 147.3 (4.6) 188.7 (4.6) 0.248 (0.014) 0.471 (0.007) 211.2 (3.0) (4) Baseline (100,192,336) 249.4 (10.0) 155.1 (3.0) 180.9 (3.0) 0.268 (0.020) 0.448 (0.005) 223.6 (5.1) PC (100,192,336) 206.7 (2.7) 187.2 (1.1) 148.8 (1.1) 0.279 (0.009) 0.378 (0.004) 245.1 (2.6) GES (100,192,336) 288.4 (4.2) 161.2 (1.3) 174.8 (1.3) 0.393 (0.006) 0.389 (0.002) 274.8(2.5) Hepar2 celegans n306 BCD (100, 420, 738) 492.3 (12.7) 386.7 (6.2) 351.3 (6.2) 0.285 (0.008) 0.399 (0.005) 527.7 (4.5) (6) Baseline (100, 420, 738) 490.8 (12.9) 397.8 (4.1) 340.2 (4.1) 0.303 (0.018) 0.384 (0.007) 548.4 (12.9) PC (100, 420, 738) 454.1 (4.0) 467.8 (1.7) 270.2 (1.7) 0.405 (0.007) 0.293 (0.003) 651.7 (5.5) GES (100, 420, 738) 970.7 (7.3) 358.3 (2.6) 379.7 (2.6) 0.609 (0.004) 0.286 (0.003) 949.3 (8.5) Table 4: Results for ordered DAGs on real network data without block structure. First, we assume that the DAG ordering is known and compare the BCD method against the baseline, PC, and GES methods. In the four cases that we considered, BCD gave better estimates of the DAG structure in terms of the Jaccard index and structural Hamming distance, as shown in Table 4. Next, without assuming a known DAG ordering, we compare the performance of GES, PC, and sparsebn before and after de-correlation. The top row in Learning Gaussian DAGs from Network Data Andes Facebook Hailfinder celegans n306 Barley Facebook Hepar2 celegans n306 Andes Facebook Hailfinder celegans n306 Barley Facebook Hepar2 celegans n306 Figure 8: Increases in test data log-likelihood on real unsorted DAGs. Top row: n < p. Bottom row: n > p. Some outliers in the top panels are not shown for a better view of the boxplots. Andes Facebook Barley Facebook Hailfinder celegans n306 Hepar2 celegans n306 Andes Facebook Barley Facebook Hailfinder celegans n306 Hepar2 celegans n306 Figure 9: Results on real unsorted DAGs with general Θ. Top row: increase in normalized test data log-likelihood after de-correlation. Bottom row: Decrease in SHD after de-correlation. Figure 9 shows the increase in the normalized test data log-likelihood after de-correlation, and the increases are positive across the 10 simulations for each of the four scenarios. The bottom row shows the distribution of the decrease in SHD across 10 simulations after Li, Madrid-Padilla, and Zhou de-correlation. In most cases, all three methods gave much more accurate estimates after de-correlation. We defer the additional tables and figures containing more detailed results to the Supplementary Material. The above results confirm that our methods can indeed improve the accuracy in DAG estimation even though Θ is not block-diagonal, as suggested by the theoretical results in Section 4. 6. Application on RNA-seq Data Gene regulatory networks (GRNs) enable biologists to examine the causal relations in gene expression during different biological processes, and are usually estimated from gene expression data. Recent advances in single-cell RNA sequencing technology have made it possible to trace cellular lineages during differentiation and to identify new cell types by measuring the gene expression of thousands of individual cells. A key question now is whether we can discover the GRN that controls cellular differentiation and drives transitions from one cell type to another using this type of data. Such GRNs can be interpreted as causal networks among genes, where nodes correspond to different genes, and a directed edge encodes a direct causal effect of one gene on another. The RNA-seq data set used in this section were generated by Chu et al. (2016), and is accessible through the Gene Expression Omnibus (GEO) series accession number GSE75748. The data set contains measurements of gene expression of around 20,000 genes from n = 1018 cells. The cells are progenitor cells differentiated from human embryonic stem (ES) cells. The cell types correspond to different lineage-specific progenitors, undifferentiated ES cells, and foreskin fibroblasts (as a control). As lineage-specific progenitor cells were differentiated from the same population of ES cells, it is reasonable to assume dependence among the cells. This was also confirmed with the principal component analysis by Chu et al. (2016) (their Figure 1b). Before conducting the experiment, we processed the data according to Li and Li (2018) by imputing missing values and applying log transformation. In this study, we focus on estimating a GRN among p = 51 target genes selected by Chu et al. (2016), while the rest of the genes, which we call background genes, are used to estimate an undirected network for all 1018 cells. 6.1 Pre-estimate the Undirected Network An essential input to Algorithm 1 is H , the undirected network of observations (cells in this case). In the dataset, progenitor cells differentiated from human ES cells included neuronal progenitor cells (NPCs, n = 173), endoderm derivatives cells (DE, n = 138), endothelial cells (ECs, n = 105), and trophoblast-like cells (TBs, n = 69). Undifferentiated H1 (n = 212) and H9 (n = 162) human ES cells and human foreskin fibroblasts (HFFs, n = 159) as control were also included. In total, the 1018 single cells in the dataset consist of the above seven distinct cell types. It is reasonable to assume that the similarity between cells of different types is minimal and we posit that the network of the 1018 cell consists of at least 7 connected components, i.e. N 7, where N denotes the number of diagonal blocks in Θ as in Section 4. Although cells from different blocks may not be completely independent, their dependence should be much weaker than those in the same clusters. We used the block structure as an approximation, which greatly reduced the model and computational complexities. Since it is unlikely that all cells of the same type are strongly associated Learning Gaussian DAGs from Network Data Figure 10: Cluster dendrograms of H1 and DEC cells from hierarchical clustering. The y-axis represents 1 ρ and the leaf nodes are individual cells. with one another, we further divided each type of cells into smaller clusters by applying the classical clustering algorithm on the background genes. More specifically, we randomly selected 8000 genes from the background genes and applied hierarchical clustering on each type of cells. In this experiment, we used hierarchical clustering with complete-linkage and a distance metric between two cells defined as 1 ρ, where ρ is the correlation between their observed gene expression levels. We verified our choice of clustering algorithm by applying it on the entire data set, and it clearly grouped the cells into 7 groups, coinciding with the 7 cell types. At the end of the hierarchical clustering step, we needed to pick cut-offlevels in order to finish clustering. Because the levels of dependence among cells are quite different across cell types as shown in Figure 10, we pick the cutoffpoints separately for each cell type. Generally, we chose the cutoffthresholds such that the largest cluster is smaller than p = 51, so λ2 can be set to 0.01. By shifting the cutofflevels, we also obtained different number N of blocks in Θ . In the end, this clustering process returned an adjacency matrix A of the estimated network defined by the N clusters. In our experiments, we compared results from three choices of N {383, 519, 698}, much larger than the number of cell types. These refined small clusters will capture more subtle dependence relations among cells. The cluster size varied from 1 (singleton clusters) to 43 across the three values of N. 6.2 Model Evaluation In this experiment, the input to the BCD algorithm is a data matrix X1018 51 and the network A estimated by the above hierarchical clustering. The matrix of error variances bΩ was obtained following the method described in Section 4.2. The output is a solution path of ( b B, bΘ) for a range of λ1 s. We computed the corresponding MLEs ( b BMLE, bΩMLE) given the support of each b B on the solution path. We picked the ( b BMLE, bΩMLE) with the smallest BIC from the solution path as in Section 5 and used the corresponding bΘ to de-correlate X. Table 5 shows the results of GES, PC, and sparsebn before and after de-correlation. In each case, we computed the BIC of the estimated GRN and used the Likelihood Ratio (LR) test to determine whether the increase in log-likelihood from de-correlation is significant or not. Li, Madrid-Padilla, and Zhou Method N (BIC) LR χ2 df p-value Z-scores 6762.79 10148.79 3386 < 10 5 82.2 PC 18171.20 21596.19 3425 < 10 5 219.5 sparsebn 9341.59 12722.59 3381 < 10 5 113.6 1723.65 8.34 1732 > 0.99 29.2 PC 4635.81 6388.81 1753 < 10 5 78.3 sparsebn 244.61 1976.61 1732 3.3 10 5 4.2 114.97 573.03 688 > 0.99 3.1 PC 2856.79 3552.79 696 < 10 5 76.6 sparsebn 1198.33 1875.33 677 < 10 5 32.5 Table 5: Compare goodness-of-fit of standard DAG learning methods with and without de-correlation of the data. N denotes the number of clusters among cells. (BIC) = BIC(baseline) BIC(decor) is the change in BIC scores before and after de-correlation (positive value means improved model fitting). LR χ2 is the likelihood ratio statistic (24) followed by its degree of freedom (df). p values are computed based on χ2-test. We also include the Z-score = (LR df)/ 2df. The LR test statistic is defined as follows: LR = 2 h log p(X | bΘ, b BMLE decor , bΩMLE decor ) log p(X | In, b BMLE baseline, bΩMLE baseline) i , (24) where ( b BMLE baseline, bΩMLE baseline) and ( b BMLE decor , bΩMLE decor ) denote the MLEs given the estimated graph structures from GES, PC, and sparsebn before and after de-correlation, respectively. If the baseline model (before de-correlation) is true, then the LR statistic follows approximately a χ2 distribution with degrees of freedom df = | supp(bΘ)| n 2 + | supp( b BMLE decor )| | supp( b BMLE baseline)|. As reported in Table 5, for most cases, we saw significant improvements, in terms of both the BIC and the χ2 statistic, in all three DAG estimation methods by de-correlating X using the estimated bΘ from the BCD algorithm. To better quantify the significance, we also include a standardized Z-score of the LR statistic, Z = (LR df)/ 2df, as df and 2df are the mean and the variance of the LR statistic under the null. Since most standard DAG structure learning methods ignore data dependence, this application shows that considering such dependence would improve goodness of fit of DAG models. This confirms the dependence among individual cells and implies that our proposed network model fits this real-world data better. Figure 11 shows the estimated CPDAGs after de-correlation for the case N = 383, which corresponds to the minimum BIC for all three methods in our experiments. It is interesting to note that a directed edge NANOG POU5F1, between the two master regulators in embryonic stem cells, appears in all three estimated CPDAGs, consistent with previously reported gene regulatory networks (Chen et al., 2008; Zhou et al., 2007). Learning Gaussian DAGs from Network Data SOX17 EOMES MTUS2 HAPLN1 GATA3 TRIM55 NANOG ZFP42 CD34 LHX1 MT1X ZFHX4 GSC FOXQ1PECAM1POU5F1MYCT1 MAPK10 DNMT3B IGFBP3 (a) GES (E = 131, U = 5) IGFBP3 GATA2 TRIM55 DLK1 MAP2 GDF3 SOX9 SOX17 HAPLN1 TNFSF10 GATA3 PAPPA2 MTUS2ERBB4LEFTY2FOXA2LEFTY1 SOX2 ZFP42 (b) PC (E = 119, U = 1) NANOG ZFP42 ERBB4 GNG11 GATA2 VTCN1 CD34 LEFTY2 LHX1 DNMT3B GABRP IGFBP3 GATA6 DLK1 FOXA2 MT1XPRDM14 CXCR4 GSC MTUS2 (c) sparsebn (E = 90, U = 0) Figure 11: Estimated gene regulatory networks (CPDAGs) after de-correlation, with E edges and U undirected edges colored in red. Li, Madrid-Padilla, and Zhou 7. Discussion In this paper our main goal is to generalize the existing Gaussian DAG model to dependent data. We proposed to model the covariance between observations by assuming a non-diagonal covariance structure of the noise vectors. Our main contributions include the development of a consistent structural learning method for the DAG and the sample network under sparsity assumptions and finite-sample guarantees for the estimators. Our proposed BCD algorithm is built upon existing Lasso regression and graphical Lasso covariance estimation methods. When a topological ordering of the true DAG is known, it estimates the covariance between the observations Σ and the WAM of the DAG B in an iterative way. The method is fast and often converges in a few iterations. Our theoretical analysis shows that the estimates after one iteration are ℓ2-consistent under various asymptotic frameworks including both n p and n p, assuming a proper initialization of the precision matrix bΘ(0). The estimate of the DAG WAM b B(1) achieves the optimal rate as Lasso estimators. The estimate of the precision matrix bΘ(1) achieves the same optimal rate as the graphical Lasso method when n p and there are sufficiently many independent subgroups within the data. Otherwise, it has a slightly worse rate due to the bias of the sample covariance matrix. We have shown that covariance Σ is invariant under permutations of the DAG nodes. Therefore, when the ordering is unknown but the true DAG is sparse, our BCD algorithm can still give a good estimate of Θ which can be used to de-correlate the data. In addition to the theoretical analysis, we conducted extensive experiments on both synthetic and real data to compare our method with existing methods. When a true ordering of the DAG was given, the BCD algorithm significantly improved the structural estimation accuracy compared to the baseline method which ignored the sample dependency. When the ordering was unknown, classical DAG learning methods, such as GES, PC, and sparsebn, can all be greatly improved with respect to structural learning of CPDAGs by using our proposed de-correlation method based on the BCD algorithm. In all cases, our estimation methods under the proposed network Gaussian DAG model yielded significantly higher test data log-likelihood compared to other competing methods, indicating better predictive modeling performance. There are several unexplored directions from our research. First, the current error bounds and consistency results are based on a known topological ordering of the true DAG. In practice, however, it can be hard to obtain the order in advance. It would be interesting to see if we can combine the method of estimating partial orders such as Niinim aki et al. (2016) with our method and extend the theoretical results. Second, part of the reason the current model relies on a known DAG ordering is the lack of experimental data. From purely observational data, it is impossible to orient some of the edges and find a topological ordering of the true DAG. In the next step, we would like to extend our method to handle both observational and experimental data sets. Finally, there are recent methods that use continuous optimization for DAG learning without imposing the acyclicity constraint, such as NOTEARS (Zheng et al., 2018). It is a promising future direction to incorporate such ideas into DAG learning on network data. Learning Gaussian DAGs from Network Data Acknowledgments This work was supported by NSF grants DMS-1952929 and DMS-2305631 (to Q.Z.). Appendix A. Technical Details of Section 4 A.1 Some Auxiliary Results Here we introduce four lemmas that we use to establish the error bounds of bΘ(1) and b B(1). Let us start by deriving an upper bound on the ℓ2 deviation of b L(0) from L under Assumption 1. Lemma 11 Suppose Assumption 1 holds and let L and b L(t) be the Cholesky factors of Θ and bΘ(t), respectively. Let b (t) chol = b L(t) L . Then, b (0) chol 2 M 2σmin(L ), where σmin(L ) is the smallest singular value of L , and M is from Assumption 1. To generalize the basic bound on ˆβlasso β j 2 from Buhlmann and van de Geer (2011) to dependent data, we need to control the ℓ -norm of an empirical process component 2X bΘ(0)εj/n. Let us start with the case when the data are independent. Define the following events, where e X = L X represents the independent data as explained in Section 4: e Xk 2 6 ψ n . (25) Then the following lemma follows from e X being sub-Gaussian. Lemma 12 Let α 2 be an integer. If n > 2 2α log p, then the event E defined in (25) holds with probability at least 1 1/pα 1. Next, let e X[j 1] denote the first j 1 columns in e X and define the following events that depend on λn Tj := n 2 e X [j 1] εj /n λn o , j = 1, . . . , p j=1 Tj, (26) where εj = L εj for j [p]. Lemma 13 Let e Xk consist of n i.i.d sub-Gaussian random variables with parameter ψ2 for k = 1, . . . , p. If n > 4 log p and λn = 12 ψ ω 2 log 2 + 4 log p then the probability of T satisfies Li, Madrid-Padilla, and Zhou Lemma 13 implies that if λn q n , then the error terms will be uniformly under control with high probability, especially when both n and p are large. Lemma 14 (Maximal inequality) Let xi = (xi1, . . . , xip) be a random vector where each element xij is sub-Gaussian with parameter ψ2, then From model (3), it is clear that each row in e X is sub-Gaussian with parameter ψ2. By Lemma 14, we have xi log p w.h.p. A.2 Error Bound of b B(1) The estimation error bound for the classical Lasso problem where samples are i.i.d. was established by choosing a penalty parameter that dominates the measurement error term. Specifically, as shown in Lemma 13, this can be achieved with high probability by setting n . In order to prove the consistency of b B(1) from Algorithm 1, we need to control a similar error term which depends on bΘ(0). Notably, such error can be controlled under the same rate as λn: Theorem 15 (Control the empirical process) Let λn be the same as in Lemma 13. Suppose the initial estimator bΘ(0) satisfies Assumption 1. Then for n > 4 log p, sup j [p] 2 X bΘ(0)εj /n λn Next, we show that the random matrix b L(0)X satisfies the Restricted Eigenvalue (RE) condition (Wainwright, 2019) w.h.p. Towards that end, we define the event K as in Theorem 7.16 from Wainwright (2019) given as K := e Xβ 2 2/n c1 Ψ β 2 2 c2ρ2(Ψ )log p where ρ2(Ψ ) is the maximum diagonal entry of Ψ , e X = L X is the de-correlated data, and c1 < 1 < c2 are positive constants. Lemma 16 (Restricted eigenvalue condition) Consider a random matrix X Rn p drawn from Nn p(0, Σ , Ψ ). Let bΘ(0) be the initial estimate of Θ = Σ 1 satisfying Assumption 1, b L(0) be the Cholesky factor of bΘ(0), and ρ2(Ψ ) be the maximum diagonal entry of Ψ . Then under event K defined in (28), there are universal positive constants c1 < 1 < c2 such that b L(0)Xβ 2 2 n c1 Ψ β 2 2 c2ρ2(Ψ )log p n β 2 1, (29) for all β Rp. Learning Gaussian DAGs from Network Data Theorem 7.16 from Wainwright (2019) shows that the event K happens with high probability. This event is a restriction on the design matrix e X and it holds with high probability for a variety of matrix ensembles. With Theorem 15 and Lemma 16, it is possible to prove an oracle inequality for the dependent Lasso problem, which yields a family of upper bounds on the estimation error. Theorem 17 (Lasso oracle inequality) Consider the Lasso problem in (9) for t = 0. Suppose the inequality (29) and the event in (27) hold. Let κ = σmin(Ψ ). For j [p] and any β j Rp, if 2 log 2 + 4 log p then any optimal solution ˆβ(1) j satisfies: ˆβ(1) j β j 2 2 144λ2 n c2 1 κ2 |S| + 32λn c1 κ β j,Sc 1 + 128c2 n β j,Sc 2 1, (30) for any subset S with cardinality |S| c1 64c2 κ ρ2(Ψ ) n log p. Let b L(0) be the Cholesky factor of bΘ(0). Then, b L(0)X(ˆβ(1) j β j ) 2 2 6λn β j 1. (31) Theorem 17 implies, with high probability, that supj [p] ˆβ(1) j β j 2 2 768λ2 n c2 1 κ2 s slog p n , where s is the maximum in-degree of the true DAG. A.3 Error Bounds of bΘ(1) Recall that s denotes the maximum number of nonzero entries in β j for j [p]. In order to control bΘ(1) Θ 2, we need to rely on certain type of error bound on b S(1) Σ , where b S(1) is the sample covariance defined in (10) for t = 0. Therefore, we adopt the definition of tail condition on the sample covariance from Ravikumar et al. (2011). Definition 18 (Tail conditions) We say the n p random matrix X from model (2) satisfies tail condition T (f, v ) if there exists a constant v (0, ] and a function f : N (0, ) (0, ) such that for any (i, j) [n] [n], P h |b S(1) ij Σ ij| δ i 1/f(p, δ) δ (0, 1/v ]. We require f(p, δ) to be monotonically increasing in p, so for a fixed δ > 0, define the inverse function pf(δ; r) := arg max {p | f(p, δ) r} . Similarly, f should be increasing in δ for each fixed p, so we define an inverse function in the second argument: δf(p; r) := arg max {δ | f(p, δ) r} . (32) Under the setting of a Gaussian DAG model, we can derive a sub-Gaussian tail bound. Li, Madrid-Padilla, and Zhou Lemma 19 Let X be a sample from our Gaussian DAG model (2). The sample covariance matrix Xj Xβ j Xj Xβ j (33) satisfies the tail bound sup i,j |bΣij Σ ij| > δ for all δ (0, 40). Corollary 20 If f(p, δ) = 4 exp n pδ2 3200 o , then the inverse function δf(p; nτ) takes the following form, δf(p; nτ) = 40 τ log n log 4 Based on the tail bound in Corollary 20, we can control the sampling noise bΣ Σ as in Lemma 21. Lemma 21 (Lemma 8 in Ravikumar et al. 2011) Define event A = n bΣ Σ δf(p; nτ) o , (34) where δf(p; nτ) = 40 τ log n log 4 p . For any τ > 2 and (n, p) such that δf(p; nτ) 1/40, we have P [Ac] 1 nτ 2 0, as n . Recall r(bΩ) defined in (12) and the constant b defined in Lemma 7. Lemma 22 Suppose b > 0 and sup j ˆβ(1) j β j 2 2 c slog p n c s2 log p n < 1, r(bΩ) 1, (35) for a fixed positive constant c. Let c W (1) := b S(1) Σ . Then, we have 6 ωr(bΩ), 144 ω ψs c log p log2 max{n, p} with probability at least 1 1/nτ 2 5n2/ max{n, p}4. Learning Gaussian DAGs from Network Data Theorem 23 Assume b B(1) satisfies (35) and b > 0. Let R(s, p, n) = max 6 ωr(bΩ), 144 ω ψs log p log2 max{n, p} δf(p; nτ) = 40 τ log n log 4 Consider the graphical Lasso estimate bΘ(1) from Algorithm 1 with λp = δf(p; nτ) + R for τ > 2. Assume pf (1/ max {160, 24m C} ; nτ) p and R 1 24m C , (36) where C = max κΣ κΓ , κ3 Σ κ2 Γ , and m is the maximum degree of the undirected network H . Then, with probability at least 1 1/nτ 2 5n2/ max{n, p}4, we have bΘ(1) Θ 4κΓ δf(p; nτ) + R , bΘ(1) Θ 2 4κΓ (m + 1) δf(p; nτ) + R . Appendix B. Proofs of Main Results We collect the proofs of our main theoretical results here, including Theorem 2 in Section 2, Proposition 3 in Section 3, and Theorem 4, Corollary 8 and Corollary 9 in Section 4. B.1 Proof of Proposition 3 Proof First, notice that, with probability one, X has full column rank. Also, because the Cholesky factor b L(t) is always positive definite for each iteration, b L(t)X is in general position a.s. Note that the two terms of (8) are differentiable (regular) and the whole function is continuous. Furthermore, solving (8) with respect to each variable gives a unique coordinatewise minimum. Therefore, by Theorem 4.1 (c) in Tseng (2001), the block coordinate descent converges to a stationary point. B.2 Proof of Theorem 2 Proof According to the matrix normal presentation (3), we write L(B, Ω, Θ) = L(Ψ(B, Ω), Θ) as the likelihood, where Ψ(B, Ω) = (I B) Ω(I B) 1. Since G1 and G2 are Markov equivalent, they represent the same set of Gaussian distributions, each parameterized by a covariance matrix Ψ. Let ( b Bm, bΩm, bΘm) be any MLE given Gm for m = 1, 2. Then, there exists ( e B2, eΩ2), where e B2 is a WAM for G2, such that Ψ( b B1, bΩ1) = Ψ( e B2, eΩ2). Therefore, L( b B1, bΩ1, bΘ1) = L( e B2, eΩ2, bΘ1) L( b B2, bΩ2, bΘ2). By symmetry, L( b B2, bΩ2, bΘ2) L( b B1, bΩ1, bΘ1), and thus, L( b B1, bΩ1, bΘ1) = L( b B2, bΩ2, bΘ2). Li, Madrid-Padilla, and Zhou B.3 Proof of Theorem 4 Proof We first prove the consistency in b B(1). Under Assumption 1, Theorem 15 shows that for the given λn, the empirical process term of the noises can be uniformly bounded with high probability. Therefore, in order to obtain the conclusion in Theorem 17, we only need the inequality (29) in Lemma 16 to hold. Since the event K in (28) holds with high probability by Theorem 7.16 in Wainwright (2019), (29) holds by Lemma 16. Next, we show bΘ(1) is consistent by invoking Theorem 23. For the chosen λp and under the constraint on (n, p) specified in (16) and (17), the sample size requirement in (36) is satisfied. Therefore, the results follow from Theorem 23. Combining Theorems 17 and 23, we get the desired results. B.4 Proof of Corollary 8 Proof The rate of supj ˆβ(1) j β j 2 2 follows directly from the choice of λn q bΘ(1) Θ 2 = Op (N log2 p n). B.5 Proof of Corollary 9 Proof The rate of supj ˆβ(1) j β j 2 2 can be derived in the same way as in the proof of Corollary 8. Since r(bΩ) = Op bΘ(1) Θ 2 = Op log p log2 n log p log2 n Learning Gaussian DAGs from Network Data Since N s2p = q N and n s2p log p log n = q s2 log p log2 n n , we arrive at bΘ(1) Θ 2 = Op Appendix C. Proofs of Intermediate Results for Theorem 4 We include the proofs for all the intermediates results that lead to Theorem 4 in this section. C.1 Proof of Theorem 15 Proof For any j = 1, . . . , p, 2 n X bΘ(0)εj = 2 n X (Θ + b (0) prec)εj 2 n e X εj + 2 n X b (0) precεj n e X εj + 2 n e X L b (0) prec L 1 εj . Let b K(0) = L b (0) prec L 1. Then following Assumption 1, b K(0) 2 b (0) prec 2/σ2 min(L ) M/σ2 min(L ). Under event E defined in (25), for j [p], b K(0) e Xj 2 6 ψM n/σ2 min(L ) 6 ψ n. For j [p], define the event T j := n 2 e X b K(0) εj /n < λn/2 o . Similar to the proof of Lemma 13, we can show P p j=1T c j E 1 and together with Lemma 12 we have T \ T | E P (E ) 1 2 where T is defined in (26). Li, Madrid-Padilla, and Zhou C.2 Proof of Lemma 16 Proof We observe that b L(0)Xθ 2/ n L Xθ 2 n b (0) chol Xθ 2/ n = L Xθ 2/ n b (0) chol L 1L Xθ 2/ n 1 b (0) chol 2/σmin (L ) L Xθ 2/ n 1 M 2σ2 min (L ) e Xθ 2/ n (By Assumption 1 and Lemma 11) 1 2 n e Xθ 2, when n is sufficiently large. Since event K defined in (28) holds, by Theorem 7.16 in Wainwright (2019), we have b L(0)Xθ 2 2 n 1 Ψ θ 2 2 c2ρ2(Ψ )log p Ψ θ 2 2 c2ρ2(Ψ )log p where cj = cj/4, j = 1, 2. C.3 Proof of Theorem 17 Proof Consider the penalized negative likelihood function from (9): ℓ(βj, λn) = 1 2n b L(0)Xj b L(0)Xβj 2 2 + λn βj 1. For simplicity, we drop the superscript (t) in ˆβ(1) and b L(0). Let ρ stand for ρ(Ψ ), β j Rp , and b j = ˆβj β j . We start from the basic inequality (Wainwright, 2019): ℓ(ˆβj, λn) ℓ(β j , λn) = 1 2n b Lεj 2 2 + λn β j 1. After rearranging some terms, 2n b LX b j 2 2 ε j bΘX b j n + λn β j 1 ˆβj 1 . (37) Next, for any subset S [p], we have β j 1 ˆβj 1 = β j,S 1 + β j,Sc 1 β j,S + b j,S 1 b j,Sc + β j,Sc 1. (38) Learning Gaussian DAGs from Network Data Combining (37) with (38), and applying triangle and H older s inequalities, 2n b LX b j 2 2 1 nε j bΘX b j + λn b j,S 1 b j,Sc 1 + 2 β j,Sc 1 X bΘεj /n b j 1 + λn b j,S 1 b j,Sc 1 + 2 β j,Sc 1 b j 1 + 2 b j,S 1 2 b j,Sc 1 + 4 β j,Sc 1 h 3 b j,S 1 b j,Sc 1 + 4 β j,Sc 1 i , (39) and so b j 1 4 b j,S 1 + β j,Sc 1 . This inequality implies (by Cauchy-Schwarz inequality) b j 2 1 4 b j,S 1 + 4 β j,Sc 1 2 32 |S| b j 2 2 + β j,Sc 2 1 . (40) Next, from (29) and (40), we know, b LX b j 2 2 n c1 κ 32c2ρ2|S|log p b j 2 2 32c2ρ2 log p n β j,Sc 2 1 2 b j 2 2 32c2ρ2 log p n β j,Sc 2 1, (41) where the last inequality comes from the condition |S| c1 64c2 κ ρ2(Ψ ) n log p. Now we analyze the following two cases regarding (41): Case 1 Suppose that c1 κ 4 b j 2 2 32c2ρ2 log p n β j,Sc 2 1, then from (39) we can get 4 b j 2 2 λn 3 p |S| b j 2 + 4 β j,Sc 1 . Solving for the zeros of this quadratic form in b j 2 yields b j 2 2 144λ2 n c2 1 κ2 |S| + 32λn β j,Sc 1 c1 κ . Case 2 Otherwise, we have c1 κ 4 b j 2 2 32c2ρ2 log p n β j,Sc 2 1. After combining the two cases, we obtain the claimed bound in (30). To get the prediction bound in (31), we first show b j 1 4 β j 1. From basic inequality, we have 2n b LX b j 2 2 ε j bΘX b j n + λn β j 1 ˆβj 1 . By H older s inequality and Theorem 15, with high probability, we have that ε j bΘX b j β j 1 + ˆβj 1 . Li, Madrid-Padilla, and Zhou Combine the two inequalities above, we get which implies ˆβj 1 3 β j 1. Consequently, we have b j 1 ˆβj 1 + β j 1 4 β j 1. Return to the basic inequality, we have b LX b j 2 2 2n λn 2 b j 1 + λn β j 1 β j + b j 1 2λn b j 1 6λn β j 1. C.4 Proof of Lemma 19 Proof The proof follows a similar approach as the proof for Lemma 1 in Ravikumar et al. (2011). C.5 Proof of Corollary 20 Proof A little calculation using Lemma 19 and Definition 18 shows that the corresponding inverse functions for data from the Gaussian DAG model (2) are: δf(p; r) = 40 p , and pf(δ; r) = 3200 log(r/4) Setting r = nτ yields the desired result. C.6 Proof of Lemma 22 Proof Let b S(t) ij denote the (i, j) entry of the sample variance matrix b S(t) defined in (10). Let Xi and X j denote the ith row and jth column of X, respectively. Let ε ik := Xik Xi β k where β k is the kth column of B , ρ k = 1/ω 2 k , ˆρk = 1/ˆω2 k, b (t) k := ˆβ(t) k β k, and ˆδk = ˆρk ρ k. Learning Gaussian DAGs from Network Data b S(1) ij = 1 k=1 ˆρk Xik Xi ˆβ(1) k Xjk Xj ˆβ(1) k k=1 ˆρk ε ik Xi b (1) k ε jk Xj b (1) k k=1 ρ k ε ik Xj b (1) k ε jk Xi b (1) k + Xi b (1) k Xj b (1) k k=1 ˆδk ε ik Xi b (1) k ε jk Xj b (1) k k=1 ˆρk ε ik Xj b (1) k ε jk Xi b (1) k + Xi b (1) k Xj b (1) k + 1 k=1 ˆδkε ikε jk. If we let Rij = 1 p Pp k=1 ˆρk ε ik Xj b (1) k ε jk Xi b (1) k + Xi b (1) k Xj b (1) k + 1 p Pp k=1 ˆδkε ikε jk, we can upper bound |Rij| by dividing it into three terms and controlling each term separately. Part 1. Define the following events: n ε i 6 ω p log max{n, p} o , ε k 2 6 ω p , log max{n, p} . Under event B1, B2, and B3, 1 p k=1 ˆρkε ik Xj b (1) k b sup k |ε ik Xj b (1) k | (By Lemma 7) log max{n, p} sup k Xj b (1) k 1 (By H older s Inequality and B1) c log p log max{n, p} b n Xj (From b (1) k 1 4 2s b (1) k 2) c log p log2 max{n, p} n (By event B3). b log max{n, p} where b > 0 is the same constant defined in Lemma 7. The second last inequality comes from b (1) 1 4 2s b (1) 2 in the proof of Theorem 17. Li, Madrid-Padilla, and Zhou Part 2 Notice that 1 p k=1 ˆρk Xi b (1) k Xj b (1) k b sup k |Xj b (1) k ||Xi b (1) k | c log2 p log2 max{n, p} b log max{n, p} cs2 log p Since s2 log p n 1 from the assumption, E.q. (42) dominates E.q. (43). Part 3 By definition, ˆδ = supk |ˆρk ρ k| = supk |1/ˆω2 k 1/ω 2 k | = r(bΩ). Combining with B2, we have 1 p k=1 ˆδkε ikε jk k=1 |ε ikε jk| 1 p δ ε i 2 ε j 2 6 ω ˆδ = 6 ωr(bΩ). Combine all three parts, we have 6 ωr(bΩ), 144 ω ψs c log p log2 max{n, p} Using Lemma 14 and Lemma 12, we can derive the upper bound for the probabilities of B1, B2, B3: P (B1) 2/ max{n, p}4, P (B2) 1/ max{n, p}4 if n > 2 10 log max{n, p}, P (B2) 2/ max{n, p}4, 5/ max{n, p}4. Applying union bound, 6 ωr(bΩ), 144 ω ψs c log p log2 max{n, p} with probability at least 1 5n2 max{n,p}4 . Take event A from Lemma 21 into account and apply union bound one more time, we arrive at the desired conclusion. Learning Gaussian DAGs from Network Data C.7 Proof of Theorem 23 Proof Let R(s, p, n) and δf(p; nτ) be defined as stated, then the monotonicity of the inverse tail function (32) and condition (36) on (n, p) implies that δf(p; nτ) 1/40. Lemma 21 and Lemma 22 imply that the event A defined in (34) and the events B1,B2,B3 defined in the proof of Lemma 22 hold with high probability. Assume A, B1,B2,B3 hold. Then, c W (1) δf(p; nτ) + R(s, p, n). Choose λp = δf(p; nτ) + R. By Lemma 22, condition (36), and the monotonicity: p > pf(δ, r) = δf(p, r) δ for δ > 0, we have that 2κΓ c W (1) + λp 4κΓ δf(p; nτ) + R min 1 3κΣ m, 1 3κ3 Σ 2κΓ m Applying Lemma 6 in Ravikumar et al. (2011) we obtain bΘ(1) Θ 4κΓ δf(p; nτ) + R , bΘ(1) Θ 2 A 2 bΘ(1) Θ (m + 1) bΘ(1) Θ . Appendix D. Proofs of Other Auxiliary Results This section includes the proofs for Lemma 6 and 7 as well as the four lemmas introduced in Section A.1. D.1 Proof of Lemma 6 Proof Notice that ˆω2 j ω 2 j ˆω2 j ε(B) j 2 2 N + ε(B) j 2 2 N ω 2 j ˆω2 j ε(B) j 2 2 N ε(B) j 2 2 N ω 2 j We will show separately that with the desired probability: ˆω2 j ε(B) j 2 2 N λNs β, (44) ε(B) j 2 2 N ω 2 j log 2 + log p Li, Madrid-Padilla, and Zhou From Lemma 1 in Yu and Bien (2019) we know that if λN N 1 X(B) ε(B) j , then ˆω2 j N 1 ε(B) j 2 2 2λN β j 1 λNs β. Pick λN = 12 ψ ω q 2 log 2+6 log p , then applying the tail bound for maxima of sub-Gaussian random variables (e.g., see proof of Lemma 13) we can show that λN sup j N 1 X(B) ε(B) j holds with probability at least (1 1 p)2. This proves the inequality in (44). To prove the second inequality, notice that, from χ2 concentration inequality (e.g. Wainwright 2019 Example 2.11), ω 2 j 1 N ε(B) j 2 2 log 2 + 3 log p N with probability at most 1/p3, N ε(B) j 2 2 log 2 + 3 log p N with probability at most 1/p2. Combining all the inequalities, we can show that the second inequality holds with probability at least (1 1/p)2 1/p. D.2 Proof of Lemma 7 Proof Simply notice that 1 ˆω2 j 1 ω 2 j = sup 1 j p 1 ˆω2 j ω 2 j ω 2 j ˆω2 j b4 sup 1 j p |ω 2 j ˆω2 j |. D.3 Proof of Lemma 11 Proof The claim follows since b prec 2 = bΘ Θ 2 = L + b chol L + b chol L L 2 = L b chol + b chol L + b chol b chol 2 max x Sn 1 x L b chol + b chol L + b chol b chol x max x Sn 1 x L b chol + b chol L x (as b chol b chol 0.) = L b chol + b chol L 2 2σmin (L ) b chol 2. Learning Gaussian DAGs from Network Data D.4 Proof of Lemma 12 Proof Notice that e Xk Rn is a sub-Gaussian random vector with variance smaller than ψ. By Theorem 1.19 in Rigollet (2015), we have that P e Xk 2 > 4 ψ n + 2 ψ p 2 log(1/δ) δ δ > 0. Setting δ = 1/pα and using union bound we obtain the desired conclusion. D.5 Proof of Lemma 13 Proof Lemma 12 implies that with probability at least 1 1/p, e Xk 2 4 ψ n + 2 ψ p 2 log p 6 ψ n, (46) for all k. Under the event E defined in (25), e X [j 1] εj /n corresponds to the absolute maximum of j 1 zero-mean Gaussian variables, each with variance at most 36 ψ2 ω2/n. Next, we calculate the probability of the event T E , where δ = 1/p2. We also let 2 log 2 + 4 log p λn = 12 ψ ω Because both e X and ε are random, we use the equivalence: p(y) = Ep(x) [p(y | x)] to apply the properties of fixed-design Lasso: Let X[j 1] denote the first j 1 columns in X, 1 P(Tj | E ) = EX[j 1]P n 2 e X [j 1] εj /n > λn | X[j 1], E o e X [j 1] εj /n > 6 ψ ω ! X[j 1], E where in the last inequality we apply the tail bound on the maxima of sub-Gaussian random variables (Duchi, 2017): If Xj sub Gaussian(σ2) are i.i.d., then P max 1 j n Xj p 2σ2 log n + t exp t2 1 P (T | E ) = 1 P p j=1Tj E = P p j=1T c j E 1 Finally, by Lemma 12 with α = 2 we get P(T ) P(E )P (T | E ) 1 1 Li, Madrid-Padilla, and Zhou D.6 Proof of Lemma 14 Proof By the sub-Gaussian maximal inequality (e.g., Theorem 1.14 in Rigollet 2015), we know that if X1, . . . XN are random variables such that Xi sub-Gaussian with parameter σ2, then for any t > 0, P max 1 i N |Xi| t 2N exp t2 Letting t = p 4 ψ2 log p and taking σ2 = ψ2, we arrive at the desired result. Genevera I Allen and Robert Tibshirani. Transposable regularized covariance models with an application to missing data imputation. The Annals of Applied Statistics, 4(2):764, 2010. Bryon Aragam and Qing Zhou. Concave penalized estimation of sparse gaussian Bayesian networks. Journal of Machine Learning Research, 16:2273 2328, 2015. URL http: //jmlr.org/papers/v16/aragam15a.html. Bryon Aragam, Arash Amini, and Qing Zhou. Globally optimal score-based learning of directed acyclic graphs in high-dimensions. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a. Bryon Aragam, Jiaying Gu, and Qing Zhou. Learning large-scale Bayesian networks with the sparsebn package. Journal of Statistical Software, Articles, 91(11):1 38, 2019b. ISSN 1548-7660. doi: 10.18637/jss.v091.i11. URL https://www.jstatsoft.org/v091/i11. Mohsen Bayati, Murat A Erdogdu, and Andrea Montanari. Estimating lasso risk and noise level. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/ file/2b8a61594b1f4c4db0902a8a395ced93-Paper.pdf. A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791 806, 2011. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/23076172. Rohit Bhattacharya, Daniel Malinsky, and Ilya Shpitser. Causal inference under interference and network uncertainty. In Uncertainty in Artificial Intelligence, pages 1028 1038. PMLR, 2020. Peter Buhlmann and Sara van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Publishing Company, Incorporated, 1st edition, 2011. ISBN 3642201911. Learning Gaussian DAGs from Network Data Xi Chen, Han Xu, Ping Yuan, Fang Fang, Mikael Huss, Vinsensius B Vega, Eleanor Wong, Yuriy L Orlov, Weiwei Zhang, Jianming Jiang, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133(6): 1106 1117, 2008. David Maxwell Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507 554, March 2003. ISSN 1532-4435. doi: 10.1162/ 153244303321897717. URL https://doi.org/10.1162/153244303321897717. David Maxwell Chickering, David Heckerman, and Christopher Meek. Large-sample learning of Bayesian networks is np-hard. Journal of Machine Learning Research, 5(Oct):1287 1330, 2004. Li-Fang Chu, Ning Leng, Jue Zhang, Zhonggang Hou, Daniel Mamott, David T. Vereide, Jeea Choi, Christina Kendziorski, Ron Stewart, and James A. Thomson. Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biology, 17(1):173, 2016. doi: 10.1186/s13059-016-1033-x. URL https://doi.org/10.1186/s13059-016-1033-x. Gregory F. Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309 347, Oct 1992. ISSN 1573-0565. doi: 10.1007/BF00994110. URL https://doi.org/10.1007/BF00994110. John Duchi. Concentration inequalities and tail bounds. https://web.stanford.edu/ class/cs229t/2017/Lectures/concentration-slides.pdf, 2017. [Online; accessed 04-August-2023]. Pierre Dutilleul. The MLE algorithm for the matrix normal distribution. Journal of Statistical Computation and Simulation, 64(2):105 123, 1999. doi: 10.1080/00949659908811970. URL http://dx.doi.org/10.1080/00949659908811970. Gideon E. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6, 03 1978. doi: 10.1214/aos/1176344136. Jianqing Fan, Shaojun Guo, and Ning Hao. Variance estimation using refitted crossvalidation in ultrahigh dimensional regression. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 74(1):37 65, 2012. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/41430928. Fei Fu and Qing Zhou. Learning sparse causal gaussian networks with experimental intervention: regularization and coordinate descent. Journal of the American Statistical Association, 108(501):288 300, 2013. Maxime Gasse, Alex Aussem, and Haytham Elghazel. A hybrid algorithm for Bayesian network structure learning with application to multi-label learning. Expert Syst. Appl., 41 (15):6755 6772, November 2014. ISSN 0957-4174. doi: 10.1016/j.eswa.2014.04.032. URL https://doi.org/10.1016/j.eswa.2014.04.032. Jiaying Gu, Fei Fu, and Qing Zhou. Penalized estimation of directed acyclic graphs from discrete data. Statistics and Computing, 29(1):161 176, 2019. Li, Madrid-Padilla, and Zhou Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, 2015. ISBN 1498712169, 9781498712163. David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197 243, Sep 1995. ISSN 1573-0565. doi: 10.1023/A:1022623210503. URL https://doi.org/10.1023/A: 1022623210503. Michael G. Hudgens and M. Elizabeth Halloran. Toward causal inference with interference. Journal of the American Statistical Association, Jun 2008. doi: 10.1198/ 016214508000000292. Markus Kalisch, Martin M achler, Diego Colombo, Marloes H. Maathuis, and Peter B uhlmann. Causal inference using graphical models with the R package pcalg. Journal of Statistical Software, 47(11):1 26, 2012. URL http://www.jstatsoft.org/v47/i11/. Suprateek Kundu and Benjamin B. Risk. Scalable Bayesian matrix normal graphical models for brain functional networks. Biometrics, n/a(n/a), 2020. doi: https://doi.org/10.1111/ biom.13319. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13319. Steffen L. Lauritzen. Graphical Models. Oxford University Press, 1996. ISBN 0-19-852219-3. Steffen L. Lauritzen and TS Richardson. Chain graph models and their causal interpretations. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 64:321 348, 2002. ISSN 1369-7412. doi: 10.1111/1467-9868.00340. Wei Vivian Li and Jingyi Jessica Li. An accurate and robust imputation method scimpute for single-cell rna-seq data. Nature Communications, 9(1):997, 2018. doi: 10.1038/ s41467-018-03405-7. URL https://doi.org/10.1038/s41467-018-03405-7. Teppo Niinim aki, Pekka Parviainen, and Mikko Koivisto. Structure discovery in Bayesian networks by sampling partial orders. The Journal of Machine Learning Research, 17(1): 2002 2048, 2016. Tore Opsahl. Structure and Evolution of Weighted Networks. University of London (Queen Mary College), London, UK, 2009. URL http://toreopsahl.com/publications/ thesis/. Tore Opsahl and Pietro Panzarasa. Clustering in weighted networks. Social Networks, 31(2): 155 163, 2009. ISSN 0378-8733. doi: https://doi.org/10.1016/j.socnet.2009.02.002. URL http://www.sciencedirect.com/science/article/pii/S0378873309000070. Joseph Ramsey, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning highdimensional graphical causal models, with an application to functional magnetic resonance images. International Journal of Data Science and Analytics, 3(2):121 129, Mar 2017. ISSN 2364-4168. doi: 10.1007/s41060-016-0032-z. URL https://doi.org/10.1007/ s41060-016-0032-z. Learning Gaussian DAGs from Network Data Pradeep Ravikumar, Martin J. Wainwright, Garvesh Raskutti, and Bin Yu. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5(none):935 980, 2011. doi: 10.1214/11-EJS631. URL https: //doi.org/10.1214/11-EJS631. Philippe Rigollet. High dimensional statistics lecture notes. https://ocw.mit.edu/ courses/mathematics/18-s997-high-dimensional-statistics-spring-2015/ lecture-notes/MIT18_S997S15_Course Notes.pdf, 2015. [Online; accessed 10December-2020]. Teemu Roos. Minimum Description Length Principle, pages 823 827. Springer US, Boston, MA, 2017. ISBN 978-1-4899-7687-1. doi: 10.1007/978-1-4899-7687-1 894. URL https: //doi.org/10.1007/978-1-4899-7687-1_894. Mauro Scanagatta, Giorgio Corani, Cassio P de Campos, and Marco Zaffalon. Learning treewidth-bounded Bayesian networks with thousands of variables. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips. cc/paper/2016/file/e2a2dcc36a08a345332c751b2f2e476c-Paper.pdf. Mark Schmidt, Alexandru Niculescu-Mizil, and Kevin Murphy. Learning graphical model structure using l1-regularization paths. In Proceedings of the 22Nd National Conference on Artificial Intelligence - Volume 2, AAAI 07, pages 1278 1283. AAAI Press, 2007. ISBN 978-1-57735-323-2. URL http://dl.acm.org/citation.cfm?id=1619797.1619850. Marco Scutari. Learning Bayesian networks with the bnlearn r package. Journal of Statistical Software, Articles, 35(3):1 22, 2010. ISSN 1548-7660. doi: 10.18637/jss.v035.i03. URL https://www.jstatsoft.org/v035/i03. Ilya Shpitser. Segregated graphs and marginals of chain graph models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ 9ac403da7947a183884c18a67d3aa8de-Paper.pdf. Peter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek, Gregory Cooper, and Thomas Richardson. Causation, Prediction, and Search. MIT press, 2000. Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879 898, 2012. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/41720740. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn., 65(1):31 78, October 2006. ISSN 0885-6125. doi: 10.1007/s10994-006-6889-7. URL https://doi.org/10.1007/ s10994-006-6889-7. Li, Madrid-Padilla, and Zhou P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, Jun 2001. ISSN 1573-2878. doi: 10.1023/A:1017501703105. URL https://doi.org/10.1023/A: 1017501703105. Theodoros Tsiligkaridis, Alfred O Hero III, and Shuheng Zhou. On convergence of kronecker graphical lasso algorithms. IEEE transactions on signal processing, 61(7):1743 1755, 2013. Sara van de Geer and Peter B uhlmann. ℓ0-penalized maximum likelihood for sparse directed acyclic graphs. The Annals of Statistics, 41(2):536 567, 2013. doi: 10.1214/13-AOS1085. URL https://doi.org/10.1214/13-AOS1085. Sara A. van de Geer and Peter B uhlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3(none):1360 1392, 2009. doi: 10.1214/ 09-EJS506. URL https://doi.org/10.1214/09-EJS506. Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. doi: 10.1017/9781108627771. Duncan J Watts and Steven H Strogatz. Collective dynamics of small-world networks. nature, 393(6684):440, 1998. Qiaoling Ye, Arash Amini, and Qing Zhou. Optimizing regularized cholesky score for order-based learning of Bayesian networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43:3555 3572, 2021. doi: 10.1109/TPAMI.2020.2990820. Guo Yu and Jacob Bien. Estimating the error variance in a high-dimensional linear model. Biometrika, 106(3):533 546, 05 2019. ISSN 0006-3444. doi: 10.1093/biomet/asz017. URL https://doi.org/10.1093/biomet/asz017. Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Advances in Neural Information Processing Systems, 2018. Qing Zhou, Hiram Chipperfield, Douglas A. Melton, and Wing Hung Wong. A gene regulatory network in mouse embryonic stem cells. Proceedings of the National Academy of Sciences, 104(42):16438 16443, 2007. ISSN 0027-8424. doi: 10.1073/pnas.0701014104. URL https://www.pnas.org/content/104/42/16438. Shuheng Zhou. Gemini: Graph estimation with matrix variate normal instances. The Annals of Statistics, 42(2):532 562, 2014.