# multitask_learning_of_orderconsistent_causal_graphs__e9d6c918.pdf Multi-task Learning of Order-Consistent Causal Graphs Xinshi Chen Georgia Institute of Technology xinshi.chen@gatech.edu Haoran Sun Georgia Institute of Technology haoransun@gatech.edu Caleb Ellington Carnegie Mellon University cellingt@cs.cmu.edu Eric Xing Carnegie Mellon University MBZUAI eric.xing@mbzuai.ac.ae Le Song Bio Map MBZUAI le.song@mbzuai.ac.ae We consider the problem of discovering K related Gaussian directed acyclic graphs (DAGs), where the involved graph structures share a consistent causal order and sparse unions of supports. Under the multi-task learning setting, we propose a l1/l2regularized maximum likelihood estimator (MLE) for learning K linear structural equation models. We theoretically show that the joint estimator, by leveraging data across related tasks, can achieve a better sample complexity for recovering the causal order (or topological order) than separate estimations. Moreover, the joint estimator is able to recover non-identifiable DAGs, by estimating them together with some identifiable DAGs. Lastly, our analysis also shows the consistency of union support recovery of the structures. To allow practical implementation, we design a continuous optimization problem whose optimizer is the same as the joint estimator and can be approximated efficiently by an iterative algorithm. We validate the theoretical analysis and the effectiveness of the joint estimator in experiments. 1 Introduction Estimating causal effects among a set of random variables is of fundamental importance in many disciplines such as genomics, epidemiology, health care and finance [1, 2, 3, 4, 5, 6]. Therefore, designing and understanding methods for causal discovery is of great interests in machine learning. Causal discovery from finite observable data is often formulated as a directed acyclic graph (DAG) estimation problem in graphical models. A major class of DAG estimation methods are score-based, which search over the space of all DAGs for the best scoring one. However, DAG estimation remains a very challenging problem from both the computational and statistical aspects [7]. On the one hand, the number of possible DAG structures grows super-exponentially in the number of random variables, whereas the number of observational sample size is normally small. On the other hand, some DAGs are non-identifiable from observational data even with infinitely many samples. Fortunately, very often multiple related DAG structures need to be estimated from data, which allows us to leverage their similarity to improve the estimator. For instance, in bioinformatics, gene expression levels are often measured over patients with different subtypes [8, 9] or under various experimental conditions [10]. In neuroinformatics, f MRI signals are often recorded for multiple subjects for studying the brain connectivity network [11, 12]. In these scenarios, multiple datasets will be collected, and their associated DAGs are likely to share similar characteristics. Intuitively, it may be beneficial to estimate these DAGs jointly. Work done partially during the visit at MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) 35th Conference on Neural Information Processing Systems (Neur IPS 2021). In this paper, we focus on the analysis of the multi-task DAG estimation problem where the DAG structures can be related by a consistent causal order and a (partially) shared sparsity pattern, but allowed to have different connection strengths and edges, and differently distributed variables. In this setting, we propose a joint estimator for recovering multiple DAGs based on a group norm regularization. We prove that the joint l1/l2-penalized maximum likelihood estimator (MLE) can recover the causal order better than individual estimators. Intuitively, it is not surprising that joint estimation is beneficial. However, our results provide a quantitative characterization on the improvement in sample complexity and the conditions under which such improvement can hold. We show that: For identifiable DAGs, if the shared sparsity pattern (union support) size s is of order O(1) in K where K is the number of tasks (DAGs), then the effective sample size for order recovery will be n K where n is the sample size in each problem. Furthermore, as long as s is of order o( K) in K, the joint estimator with group norm regularization leads to an improvement in sample complexity. A non-identifiable DAG cannot be distinguished by single-task estimators even with indefinitely many observational data. However, non-identifiable DAGs can be recovered by our joint estimator if they are estimated together with other identifiable DAGs. Apart from the theoretical guarantee, we design an efficient algorithm for approximating the joint estimator through a formulation of the combinatorial search problem to a continuous programming. This continuous formulation contains a novel design of a learnable masking matrix, which plays an important role in ensuring the acyclicity and shared order for estimations in all tasks. An interesting aspect of our design is that we can learn the masking matrix by differentiable search over a continuous space, but the optimum must be contained in a discrete space of cardinality p! (reads p factorial, where p is the number of random variables). We conduct a set of synthetic experiments to demonstrates the effectiveness of the algorithm and validates the theoretical results. Furthermore, we apply our algorithm to more realistic single-cell expression RNA sequencing data generated by SERGIO [13] based on real gene regulatory networks. The remainder of the paper is organized as follows. In Section 2, we introduce the linear structural equation model (SEM) interpretation of Gaussian DAGs and its properties. Section 3 is devoted to the statement of our main results, with some discussion on their consequences and implications. In Section 4, we present the efficient algorithm for approximating the joint estimator. Section 5 summarizes related theoretical and practical works. Experimental validations are provided in Section 6. 2 Background A substantial body of work has focused on the linear SEM interpretation of Gaussian DAGs [14, 15, 16, 17]. Let X = (X1, , Xp) be a p-dimensional random variable. Then a linear SEM reads X = e G 0 X + W, W N(0, Ω0), (1) where Ω0 is a p p positive diagonal matrix which indicates the variances of the noise W. e G0 Rp p is the connection strength matrix or adjacency matrix. Each nonzero entry e G0ij represents the direct causal effect of Xi on Xj. This model implies that X is Gaussian, X N(0, Σ), where Σ := (I e G0) Ω0(I e G0) 1. (2) Causal order π0. The nonzero entries of e G0 defines its causal order (also called topological order), which informs possible parents of each variable. A causal order can be represented by a permutation π0 over [p] := (1, 2, , p). We say e G0 is consistent with π0 if and only if e G0ij = 0 π0(i) < π0(j). (3) There could exist more than one permutations that are consistent with a DAG structure e G0, so we denote the set of permutations that satisfy Eq. 3 by Π0. Once a causal order π0 is identified, the connection strengths e G0 can be estimated by ordinary least squares regression which is a comparatively easier problem. Identifiability and equivalent class. However, estimating the true causal order π0 is very challenging, largely due to the existence of the equivalent class described below. Let Sp be the set of all permutations over [p]. For every π Sp, there exists a connection strength matrix e G(π), which is consistent with π, and a diagonal matrix Ω(π) = diag σ1(π)2, , σp(π)2 such that the variance Σ in Eq. 2 equals to Σ = (I e G(π)) Ω(π)(I e G(π)) 1, (4) and therefore the random variable X in Eq. 1 can be equivalently described by the model [15]: X = e G(π) X + W(π), W(π) N(0, Ω(π)). (5) Without further assumption, the true underlying DAG, e G0 = e G(π0), cannot be identified from the equivalent class: { e G(π) : π Sp} (6) based on the distribution of X, even if infinitely many samples are observed. 3 Joint estimation of multiple DAGs In the multi-task setting, we consider K linear SEMs: X(k) = e G(k) 0 X(k) + W (k), for k = 1, , K. The superscript notation (k) indicates the k-th model. As mentioned in Sec 2, each model defines a random variable X(k) N(0, Σ(k)) with Σ(k) = (1 e G(k) 0 ) Ω(k) 0 (1 e G(k) 0 ) 1. 3.1 Assumption Similarity. In Condition 3.1 and Definition 3.1, we make a few assumptions about the similarity of these models. Condition 3.1 ensures the causal orders in the K DAGs do not have conflicts. Definition 3.1 measures the similarity of the sparsity patterns in the K DAGs. If the sparsity patterns are strongly overlapped, then the size of the support union s will be small. We do not enforce a strict constraint on the support union, but we will see in the later theorem that a smaller s can lead to a better recovery performance. Condition 3.1 (Consistent causal orders). There exists a nonempty set of permutations Π0 Sp such that π0 Π0, it holds for all k [K] that e G0 (k) ij = 0 π0(i) < π0(j). Definition 3.1 (Support union). Recall e G(π) in Eq. 6. The support union of the K DAGs associated with permutation π is denoted by S(π) := (i, j) : k [K] s.t. e G(k) ij (π) = 0 . The support union of the K true DAGs e G(k) 0 is S0 := S(π0). We further denote s0 := |S0| and s := supπ Sp |S(π)|. Identifiability. To ensure the consistency of the estimator based on least squared loss, we first assume that the DAGs to be recovered are minimum-trace DAGs. Condition 3.2 (Minimum-trace). Recall the equivalent class defined in Eq. 4 to Eq. 4. Assume for all k = 1, , K, trace(Ω(k) 0 ) = minπ Sp trace(Ω(π)(k)). However, the minimum-trace DAG may not be unique without further assumptions, making the true DAG indistinguishable from other minimum-trace DAGs. Therefore, we consider the equal variance condition in Condition 3.3 which ensures the uniqueness of the minimum-trace DAG. In this paper, we assume the first K K models satisfy this condition, so they are identifiable. We do not make such an assumption on the other K K models, so the K K models may not be identifiable. Condition 3.3 (Equal variance). For all k = 1, , K with K K, the noise W (k) N(0, Ω(k) 0 ) has equal variance with Ω(k) 0 = σ(k) 0 Ip. 3.2 l1/l2-penalized joint estimator Denote the sample matrix by X(k) whose row vectors are n i.i.d. samples from N(0, Σ(k)). Based on the task similarity assumptions, we propose the following l1/l2-penalized joint maximum likelihood estimator (MLE) for jointly estimating the connection strength matrices { e G(k) 0 }K k=1: ˆπ, { b G(k)} = arg min π Sp,{G(k) D(π)} 1 2n X(k) X(k)G(k)(π) 2 F + λ G(1:K)(π) l1/l2. (7) Similar to the notation in Eq. 4, G(k)(π) indicates its consistency with π. D(π) denote the space of all DAGs that are consistent with π. It is notable that a single π shared across all K tasks is optimized in Eq. 7, which respects Condition 3.1. The group norm over the set of K matrices is defined as G(1:K)(π) l1/l2 := Pp i=1 Pp j=1 G(1:K) ij 2, where G(1:K) ij := [G(1) ij (π), , G(K) ij (π)]. It will penalize the size of union support in a soft way. When K = 1, this joint estimator will be reduced to the l1-penalized maximum likelihood estimation. Remark 3.1. The optimization in Eq. 7 is used for analysis only. A continuous program with the same optimizer will be discussed in Section 4 for practical implementation. 3.3 Main result: causal order recovery We start with a few pieces of notations and definitions. Then the theorem statement follows. Definition 3.2. Let eg(k) j (π) denote the j-th column of e G(k)(π). Let d := supj [p],π Sp k [K] supp(eg(k) j (π)) , gmax := supπ Sp,(i,j) S(π) e G(1:K) ij (π) 2/ In Definition 3.2, d is the maximal number of parents (in union) in the DAGs e G(k)(π), which is also a measure of the sparsity level. gmax is bounded by the maximal entry value in the matrices e G(k)(π). Condition 3.4 (Bounded spectrum). Assume for all k = 1, , K, the covariance matrix Σ(k) is positive definite. There exists constants 0 < Λmin Λmax < such that for all k = 1, , K, (a) all eigenvalues of Σ(k) are upper bounded by Λmax; (b) all eigenvalues of Σ(k) are lower bounded by Λmin. Condition 3.5 (Omega-min). There exists a constant ηw > 0 such that for any permutations π / Π0, k=1 Pp j=1(σ(k) j (π)2 σ(k) 0 2)2 > 1 ηw . Condition 3.5 with K = K = 1 is called omega-min condition in [14], so we follows this terminology. In some sense, when ηw is larger, σj(π) with π / Π0 is allowed to deviate less from the true variance σj(π0) = σ0 with π0 Π0, which will make it more difficult to separate the set Π0 from its complement in a finite sample scenario. Ideally, we should allow ηw to be large, so that the recovery is not only restricted to easy problems. Now we are ready to present the recovery guarantee for causal order. Theorem 3.1 is a specific statement when the regularization parameter λ follows the classic choice in (group) Lasso problems. A more general statement which allows other λ is given in Appendix B along with the proof. Theorem 3.1 (Causal order recovery). Suppose we solve the joint optimization in Eq. 7 with specified regularization parameter λ = q n for a set of K problems that satisfy Condition 3.1, 3.4 (a), 3.2, 3.3 and 3.5. If the following conditions are satisfied: θ(n, K, K , p, s) := p n p log p K 2 K > κ1ηw, (8) n c1 log K + c2(d + 1) log p, (9) K κ2p log p, (10) then the following statements hold true: (a) With probability at least 1 c3 exp ( κ4(d + 1) log p) exp ( c4n), it holds that (b) If in addition, n satisfies n κ5 ˆd(log K + log p) with ˆd := maxj [p],k [K] bg(k) j eg(k) j 0, and Condition 3.4 (b) holds, then with probability at least 1 c3 exp ( κ4(d + 1) log p) exp ( c4n) exp( log p log K), 1 K PK k=1 b G(k) e G(k) 0 2 F = O s0 q In this statement, c1, c2, c3, c4 are universal constants (i.e., independent of n, p, K, s, Σ(k)), κ1, κ2, κ3, κ4 are constants depending on gmax and Λmax, and κ5 is a constant depending on Λmin. Discussion on Eq. 8-Eq. 10. (i) In Eq. 8, Theorem 3.1 identifies a sample complexity parameter θ(n, K, K , p, s) . Following the terminology in [18], our use of the term sample complexity for θ reflects the dominant role it plays in our analysis as the rate at which the sample size much grow in order to obtain a consistent causal order. More precisely, for scalings (n, K, K , p, s) such that θ(n, K, K , p, s) exceeds a fixed critical threshold κ1ηw, we show that the causal order can be correctly recovered with high probability. (ii) The additional condition for the sample size n in Eq. 9 is the sample requirement for an ordinary linear regression problem. It is in general much weaker than Eq. 8, unless K grows very large. (iii) The last condition in Eq. 10 on K could be relaxed if a tighter analysis on the distribution properties of Chi-squared distribution is available. However, it is notable that this restriction on the size of K has already been weaker than many related works on multi-task ℓ1 sparse recovery, which either implicitly treat K as a constant [18, 19] or assume K = o(log p) [20, 21]. Recovering non-identifiable DAGs. A direct consequence of Theorem 3.1 is that as long as the number of identifiable DAGs K is non-zero, the joint estimator can recover the causal order of non-identifiable DAGs with high probability. This is not achievable in separate estimation even with infinitely many samples. Therefore, we show that how the information of identifiable DAGs helps recover the non-identifiable ones. Effective sample size. As indicated by θ(n, K, K , p, s) in Eq. 8, the effective sample size for recovering the correct causal order is n K 2 K if the support union size s is of order O(1) in K. To show the improvement in sample complexity, it is more fair to consider the scenario when the DAGs are identifiable, i.e., K = K. In this case, it is clear that the parameter θ(n, K, K, p, s) indicates a lower sample complexity relative to separate estimation as long as s is of order o( Separate estimation. Consider the special case of a single task estimation with K = K = 1, in which the joint estimator reduces to ℓ1-penalized MLE. We discuss how our result can recover previously known results for single DAG recovery. Unfortunately, existing analyses were conducted under different frameworks with different conditions. [14] and [22] are the most comparable ones since they were based on the same omega-min condition (i.e., Condition 3.5), but they chose a smaller regularization parameter λ. In our proof, the sample complexity parameter is derived from θ(n, K, K , p, s) = p K / sλ K and Eq. 8 is p K / sλ K > κ1ηw. When K = K = 1, this condition matches what is identified in [14] and [22] for recovering the order of a single DAG. Error of b G(k). To compare the estimation of b G(k) to the true DAG e G(k) 0 , Theorem 3.1 (b) says the averaged error in F-norm goes to zero when n K . It decreases in K as long as s = o(K). To summarize, Theorem 3.1 analyzes the causal order consistency for the joint estimator in Eq. 7. Order recovery is the most challenging component in DAG estimation. After ˆπ Π0 has been identified, the DAG estimation becomes p linear regression problems that can be solved separately. Theorem 3.1 (b) only shows the estimation error of connection matrices in F-norm. To characterize the structure error, additional conditions are required. 3.4 Support union recovery Theorem 3.1 has shown that ˆπ = π0 Π0 holds with high probability. Consequently, the support recovery analysis in this section is conditioned on this event. In fact, given the true order π0, what remains is a set of p separate l1/l2-penalized group Lasso problems, in each of which the order π0 plays a role of constraining the support set by the set Sj(π0) := {i : π0(i) < j}. However, we need to solve p such problems simultaneously where p is large. A careful analysis is required, and directly combining existing results will not give a high recovery probability. In the following, we impose a set of conditions and definitions, which are standard in many l1 sparse recovery analyses [18, 19], after which theorem statement follows. See Appendix C for the proof. Definition 3.3. The union support of j-th columns of { e G(k) 0 }k [K] is denoted by RSj := {i [p] : k [K] s.t. e G(k) 0ij = 0}. The maximal cardinality is rmax := supj [p] RSj . Definition 3.4. ρu := supk [K],j [p],S=RSj maxi Sc Σ(k) Sc Sc|S ii is the maximal diagonal entry of the conditional covariance matrices, where Σ(k) Sc Sc|S := Σ(k) Sc Sc Σ(k) Sc S(Σ(k) SS) 1Σ(k) SSc. Definition 3.5. gmin := inf(i,j) S0 e G(1:K) 0ij 2/ K represents the signal strength. Condition 3.6 (Irrepresentable condition). There exists a fixed parameter γ (0, 1] such that supj [p],S=RS(eg(1:K) 0j ) A(S) 1 γ, where A(S)ij := supk [K] Σ(k) Sc S(Σ(k) SS) 1 Theorem 3.2 (Union support recovery). Assume on the subset of probability space where {ˆπ Π0} holds, and assume Condition 3.6. Assume the following conditions are satisfied n κ6rmax log p, (11) K c7 log p, (12) r 8Λmax log p Λminn + 2 Λmin K = o(gmin), (13) where κ6 is a constant depending on γ, Λmin, ρu, σmax, and c7 is some universal constant. Then w.p. at least 1 rmax exp ( c5K log p) exp ( c6 log p) c8K exp ( c9(n rmax log p)), the support union of b G(1:K) is the same as that of e G(1:K) 0 , and that b G(1:K) e G(1:K) l /l2/ K = o(gmin). Discussion on Eq. 11 and Eq. 12. (i) Eq. 11 poses a sample size condition. The value rmax is the sparsity overlap defined in Definition 3.3 (i). It takes value in the interval [d, min{s0, p, d K}], depending on the similarity in sparsity pattern. (ii) The restriction on K in Eq. 12 plays a similar role as Eq. 10 in Theorem 3.1. This a stronger restriction, but also guarantees the stronger result of support recovery. Existing analyses on l1/l2penalized group Lasso were not able to relax this constraint, neither, so some of them treated K as a constant in the analysis [18, 19]. Recall that in Theorem 3.1, we were able to allow K = O(p log p). Technically, this was achieved because in the proof of Theorem 3.1, we avoid analyzing the general recovery of group Lasso, but only its null-consistency (i.e., the special case of true structures having zero support), where tighter bound can be derived and it is sufficient for order recovery. Benefit of joint estimation. Eq. 13 plays a similar role as Eq. 8 in Theorem 3.1. It specifies a rate at which the sample size must grow for successful union support recovery. As long as rmax is of order o( K), K will effectively reduce the second term in Eq. 8. Apart from that, the recover probability specified in Theorem 3.2 grows in K. 4 Algorithm Solving the optimization in Eq. 7 by searching over all permutations π Sp is intractable due to the large combinatorial space. Inspired by the smooth characterization of acyclic graph [16], we propose a continuous optimization problem, whose optimizer is the same as the estimator in Eq. 7. Furthermore, we will design an efficient iterative algorithm to approximate the solution. 4.1 Continuous program We convert Eq. 7 to the following constrained continuous program G(1), ,G(K) Rp p X(k) X(k)G(k) 2 F + λ G(1:K) l1/l2 + ρ 1p p T 2 F (14) subject to h(T) := trace(e T T ) p = 0, (15) where G(k) := G(k) T is element-wise multiplication between G(k) and T, and 1p p is a p p matrix with entries equal to one. Eq. 15 is a smooth DAG-ness constraint proposed by NOTEARS [16], which ensures T is acyclic. One can also use h(T) := trace((I + T T)p) p proposed in [23]. We would like to highlight the novel and interesting design of the matrix T in Eq. 14. What makes Eq. 7 difficult to solve is the requirement that {G(k)} must be DAGs and share the same order. A straightforward idea is to apply the smooth acyclic constraint to every G(k), but it is not clear how to enforce their consistent topological order. Our formulation realizes this by a single matrix T. Algorithm 1: Joint Estimation Algorithm Hyperparameters :ρ, α, λ, t, δ Initialize G(1:K), T randomly; for itr = 1, , M do for itr = 1, , M do [G(1:K), T] Grad Opt Step f; G(1:K), T, β ; Gradient-based update on f i, j [p], G(1:K) ij G(1:K) ij G(1:K) ij 2 max n 0, G(1:K) ij 2 tλ|Tij| o ; Proximal step i, j [p], Tij sign(Tij) max 0, |Tij| tλ G(1:K) ij 2 ; Proximal step β β + τh(T); Dual ascent α α (1 + δ); Typical rule [23] π = (1, 2, 3, 4) 0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 π = (4, 2, 1, 3) 0 0 1 0 1 0 1 0 0 0 0 0 1 1 1 0 To better understand the design rationale of T, recall in Eq. 3 that a matrix G is a DAG of order π if and only if its support set is in {(i, j) : π(i) < π(j)}. The matrix T plays a role of restricting the support set of G(k) by masking its entries. Two examples are shown above. However, unlike the learning of masks in other papers which allows T to have any combinations of nonzero entries, here we need T to exactly represent the support set for each π. That is T is from a space Tp with p! elements: Tp := {T {0, 1}p p : Tij = 1 π(i) < π(j)}. Now a key question arises: How to perform a continuous and differentiable search over Tp? The following finding motivates our design: T Tp T arg min T Rp p 1p p T 2 F subject to h(T) = 0 . In other words, T is a continuous projection of 1p p to the space of DAGs. We can then optimize the mask T in the continuous space Rp p but the optimal solution must be an element in the discrete space Tp. This observation also naturally leads to the design of Eq. 14. Finally, we want to emphasize that it is important for the optimal T to have binary entries. Without this property, any nonzero value c can scale the (G, T) pair to give an equivalent masked DAG, i.e., G T = (c G) ( 1 c T). This scaling equivalence will make the optimization hard to solve in practice. Proofs for the above arguments and the equivalence between Eq. 7 and Eq. 14 are in Appendix F. 4.2 Iterative algorithm We derive an efficient iterative algorithm using the Lagrangian method with quadratic penalty, which converts Eq. 14 to an unconstrained problem: min T,G(1), ,G(K) Rp p max β 0 L(G(1:K), T; β) := f(G(1:K), T; β) + λ G(1:K) l1/l2, where f(G(1:K), T; β) := X(k) X(k)G(k) 2 F + ρ 1p p T 2 F + βh(T) + αh(T)2, β is dual variable, α is the coefficient for quadratic penalty, and f is the smooth term in the objective. We can solve this min-max problem by alternating primal updates on (G(1:K), T) and dual updates on β. Due to the non-smoothness of group norm, the primal update is based on proximal-gradient method, where the proximal-operator with respect to l1/l2 has a closed form: arg min Z(1:K) RK p p 1 2 PK k=1 Z(k) X(k) 2 F + c Z(1:K) l1/l2 (1:K) ij = X(1:K) ij X(1:K) ij 2 max{0, X(1:K) ij 2 c}, which is a group-wise soft-threshold. Since G(k) and T are multiplied together element-wisely inside the group norm, the proximal operator will be applied to both of them separately. Together with the dual update for β, β β + τh(T) with τ as the step size, we summarize the overall algorithm for solving Eq. 14 in Algorithm 1. 5 Related work Single DAG estimation. Unlike the large literature of research on undirected graphical models [24, 25, 26, 27], statistical guarantees for score-based DAG estimator have been available only in recent years. [14, 15, 17] have shown the DAG estimation consistency in high-dimensions, but they do not consider joint estimation. Nevertheless, some techniques in [14, 15] are useful for our derivations. Multi-task learning. (i) Undirected graph estimation. There have been extensive studies on the joint estimation of multiple undirected Gaussian graphical models [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]. (ii) DAG estimation. In contrast, not much theoretical work has been done for joint DAG learning. A few pieces of recent works addressed certain statistical aspects of multi-task DAG estimation [9, 20, 21], but [20, 21] tackle the fewer task regime with K = o(log p), and [9] assumes the causal order is given. Another related work that we notice after the paper submission is [40], which also assumes the DAGs have a consistent causal order, but it focuses on estimating the difference between two DAGs. (iii) Linear regression. Multi-task linear regression is also a related topic [41, 42, 43, 44, 45, 46, 47], because the techniques for analyzing group Lasso are used in our analysis [18, 19]. Practical Algorithms. Works on practical algorithm design for efficiently solving the score-based optimization are actively conducted [48, 49, 50, 51]. Our algorithm is most related to recent methods exploiting a smooth characterization of acyclicity, including NOTEARS [16] and several subsequent works [23, 52, 53, 54], but they only apply for single-task DAG estimation. Although algorithms for the multi-task counterpart were proposed a decade ago [11, 55, 56, 57], none of them leverage recent advances in characterizing DAGs and providing theoretically guarantees. 6 Experiments 6.1 Synthetic data The set of experiments is designed to reveal the effective sample size predicted by Theorem 3.1, and demonstrate the effectiveness of the proposed algorithm. In the simulations, we randomly sample a causal order π and a union support set S0. Then we randomly generate multiple DAGs that follow the order π and have edges contained in the set S0. For each DAG, we construct a linear SEM with standard Gaussian noise, and sample n data points from it. On tasks with different combinations of (p, n, s, K), we exam the behavior of the joint estimator, estimated by Algorithm 1, on 64 tasks and report the statistics in the following for evaluation. In this experiment, we take K = K so that all the DAGs are identifiable. This simpler case will make it easier to verify the proposed algorithm and the rates in the theorem. thm:supportce-4mm 2 4 2 3 2 2 2 1 20 21 22 23 0.0 Recovery Probability p=32 p=64 p=128 p=256 Figure 1: Success probability vs log θ. Success probability for order recovery. For each fixed tuple (p, n, s, K), we measure the sample complexity in terms of the parameter θ specified by Theorem 3.1. Fig 1 plots the success probability Pr[ˆπ Π0], versus θ = p/s p n K/(p log p) at a logarithmic scale. Theorem 3.1 predicts that the success probability should transition to 1 once θ exceeds a critical threshold. Curves in Fig 1 actually have sharp transitions, showing step-function behavior. The sharpness is moderated by the logarithmic scale in x-axis. Moreover, by scaling the sample size n using θ, the curves align well as predicted by the theory and have a similar transition point, even though they correspond to very different model dimensions p. Fig 2 shows the success probability in the form of Heat Maps, where the rows indicate an increase in per task sample size n, and the columns indicate an increase in the number of tasks K. The results show that the increases in these two quantities have similar effect to the success probability. 1 2 4 8 16 32 Number of Tasks 10 20 40 80 160 320 Number of Samples per Task 0 0 0 0.14 0.27 0.67 0.078 0.17 0.41 0.77 0.92 0.98 0.39 0.56 0.81 0.94 0.91 1 0.7 0.91 0.95 0.97 1 1 0.59 0.8 0.94 1 1 1 0.69 0.84 0.95 1 1 1 1 2 4 8 16 32 Number of Tasks 10 20 40 80 160 320 0 0 0 0 0.047 0.48 0 0 0.062 0.47 0.84 0.95 0.062 0.23 0.45 0.91 0.98 1 0.31 0.52 0.89 0.97 1 1 0.5 0.58 0.88 0.97 0.98 1 0.62 0.83 0.95 1 1 1 1 2 4 8 16 32 Number of Tasks 10 20 40 80 160 320 0 0 0 0 0 0 0 0 0 0.031 0.41 0.61 0 0 0.14 0.61 0.84 0.92 0 0.078 0.42 0.67 0.94 1 0.17 0.3 0.62 0.94 1 1 0.25 0.36 0.7 0.91 1 1 1 2 4 8 16 32 Number of Tasks 10 20 40 80 160 320 0 0 0 0 0 0 0 0 0 0 0.031 0.27 0 0 0 0.094 0.53 0.86 0 0 0.016 0.34 0.84 0.92 0 0 0.078 0.39 0.7 0.95 0.016 0.047 0.19 0.72 0.94 1 Figure 2: Heat map: Darker colors indicate lower success probability, and lighter colors are higher. 24 25 26 27 28 0.0 False Discovery Rate (FDR) k=1 k=2 k=4 k=8 k=16 k=32 24 25 26 27 28 0.0 k=1 k=2 k=4 k=8 k=16 k=32 24 25 26 27 28 0.0 k=1 k=2 k=4 k=8 k=16 k=32 24 25 26 27 28 0.0 k=1 k=2 k=4 k=8 k=16 k=32 24 25 26 27 28 n (number of samples) True Positive Rate (TPR) k=1 k=2 k=4 k=8 k=16 k=32 24 25 26 27 28 n (number of samples) k=1 k=2 k=4 k=8 k=16 k=32 24 25 26 27 28 n (number of samples) k=1 k=2 k=4 k=8 k=16 k=32 24 25 26 27 28 n (number of samples) k=1 k=2 k=4 k=8 k=16 k=32 Figure 3: False discovery rate (FDR) and true positive rate (TPR) of the edges. Effectiveness of K in structure recovery. In this experiment, we aim at verifying the effectiveness of the joint estimation algorithm for recovering the structures. For brevity, we only report the numbers for false discovery rate (FDR) and true positive rate (TPR) of edges in Fig 3, but figures for additional metrics can be found in Appendix G. In Fig 3, we can observe consistent improvements when increasing the number of tasks K. When per task sample size is small, this improvement reveals to be more obvious. Comparison with other joint estimator. In this experiment, we compare our method Multi DAG with Joint GES [21] on models with p = 32 and p = 64. Results in Table 1 show that two algorithms have similar performance when K = 1. However, when K increases, our method returns consistently better structures in terms of SHD. Table 1: Comparison of Multi DAG (MD) and Joint GES (JG) in SHD p = 32 p = 64 n = 10 n = 20 n = 80 n = 320 n = 10 n = 20 n = 80 n = 320 MD(k=1) 39 5 25 5 10 4 6 3 104 7 77 8 29 6 19 8 MD(k=2) 37 5 22 5 8 3 4 3 103 7 68 8 22 6 13 6 MD(k=8) 29 5 13 3 4 2 2 1 85 7 42 6 13 3 6 3 MD(k=32) 23 4 11 3 3 2 1 1 66 5 35 5 10 3 3 2 JG(k=1) 31 5 19 5 8 4 6 5 100 11 53 11 18 11 18 9 JG(k=2) 32 4 19 5 9 5 7 5 99 10 51 10 20 9 21 10 JG(k=8) 30 5 19 5 12 4 10 4 82 10 42 10 20 5 27 9 JG(k=32) 26 4 18 4 12 3 9 3 57 9 36 6 19 5 26 6 6.2 Recovery of gene regulatory network We investigate how our joint estimator works on more realistic models, by conducting a set of experiments on realistic gene expression data generated by SERGIO [13], which models the additive effect of cooperative transcription factor binding across multiple gene regulators in parallel with protein degradation and noisy expression. (a) True: Ecoli 100 (b) K = 1, n = 1000 (c) K = 10, n = 100 (d) K = 1, n = 100 FDR: 0.11, TPR: 0.47 FDR:0.06, TPR:0.41 FDR: 0.18, TPR: 0.29 Figure 4: Visualization of the recovered DAG structures. Each light green colored pixel at position (i, j) indicates an edge from node i to j. We conduct this experiment on the E. coli 100 network, which includes 100 known genes and 137 known regulatory interactions. To evaluate our algorithm, we generate multiple networks by rearranging and re-weighting 5 edges at random in this network without violating the topological order. We simulate the gene expression from each network using SERGIO with a universal non-cooperative hill coefficient of 0.05, which works well with our linear recovery algorithm. Fig 4 provides a visual comparison of the recovered structures. It can be seem from the true network that there are a few key transcription factors that highlight several rows in the figure. These transcription factors are better identified by the two structures in (b) and (c), but not that clear in (d). Combining this observation with the more quantitative results in Table 2, we see that the combination (K = 10, n = 100) achieves comparable performance to K = 1 with the same total number of samples, and outperforms the single task estimation with n = 100. K n FDR TPR FPR SHD 1 100 0.18 3.8e 3 0.30 1.2e 3 0.001 7.3e 7 104.61 3.2e1 10 100 0.07 1.2e 3 0.42 5.1e 4 0.001 2.8e 7 84.0 1.5e1 1 1000 0.09 2.9e 3 0.50 1.4e 3 0.002 9.2e 7 76.4 5.9e1 Table 2: Recovery across 25 independent initializations of SERGIO for each experiment. FPR and SHD stand for false positive rate and structural hamming distance, respectively. 7 Conclusion and discussion In this paper, we have analyzed the behavior of l1/l2-penalized joint MLE for multiple DAG estimation tasks. Our main result is to show that its performance in recovering the causal order is governed by the sample complexity parameter θ(n, K, K , p, s) in Eq. 8. Besides, we have proposed an efficient algorithm for approximating the joint estimator via formulating a novel continuous programming, and demonstrated its effectiveness experimentally. The current work applies to DAGs that have certain similarity in sparsity pattern. It will be interesting to consider whether the joint estimation without the group-norm (and without the union support assumption) can also lead to similar improvement in causal order recovery. Acknowledgement Xinshi Chen is supported by the Google Ph D Fellowship. This works is done partially during a visit at MBZUAI. We are grateful for the computing resources provided by the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology, Atlanta. We are thankful for PACE Research Scientist Fang (Cherry) Liu s excellent HPC consulting. [1] Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe er. Using bayesian networks to analyze expression data. Journal of computational biology, 7(3-4):601 620, 2000. [2] James M Robins, Miguel Angel Hernan, and Babette Brumback. Marginal structural models and causal inference in epidemiology, 2000. [3] Pedro Aguilera Aguilera, Antonio Fernández, Rosa Fernández, Rafael Rumí, and Antonio Salmerón. Bayesian networks in environmental modelling. Environmental Modelling & Software, 26(12):1376 1388, 2011. [4] A Nicholson, F Cozman, M Velikova, JT Van Scheltinga, PJ Lucas, and M Spaanderman. Applications of bayesian networks exploiting causal functional relationships in bayesian network modelling for personalised healthcare. International Journal of Approximate Reasoning, 55(1):59 73, 2014. [5] Bin Zhang, Chris Gaiteri, Liviu-Gabriel Bodea, Zhi Wang, Joshua Mc Elwee, Alexei A Podtelezhnikov, Chunsheng Zhang, Tao Xie, Linh Tran, Radu Dobrin, et al. Integrated systems approach identifies genetic nodes and networks in late-onset alzheimer s disease. Cell, 153(3):707 720, 2013. [6] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. [7] David Maxwell Chickering. Learning bayesian networks is np-complete. In Learning from data, pages 121 130. Springer, 1996. [8] T Tony Cai, Hongzhe Li, Weidong Liu, and Jichun Xie. Joint estimation of multiple highdimensional precision matrices. Statistica Sinica, 26(2):445, 2016. [9] Jianyu Liu, Wei Sun, and Yufeng Liu. Joint skeleton estimation of multiple directed acyclic graphs for heterogeneous population. Biometrics, 75(1):36 47, 2019. [10] Teppei Shimamura, Seiya Imoto, Rui Yamaguchi, Masao Nagasaki, and Satoru Miyano. Inferring dynamic gene networks under varying conditions for transcriptomic network comparison. Bioinformatics, 26(8):1064 1072, 2010. [11] Shohei Shimizu. Joint estimation of linear non-gaussian acyclic models. Neurocomputing, 81:104 107, 2012. [12] Aiying Zhang, Gemeng Zhang, Vince D Calhoun, and Yu-Ping Wang. Causal brain network in schizophrenia by a two-step bayesian network analysis. In Medical Imaging 2020: Imaging Informatics for Healthcare, Research, and Applications, volume 11318, page 1131817. International Society for Optics and Photonics, 2020. [13] Payam Dibaeinia and Saurabh Sinha. Sergio: a single-cell expression simulator guided by gene regulatory networks. Cell Systems, 11(3):252 271, 2020. [14] Sara Van de Geer, Peter Bühlmann, et al. ℓ0-penalized maximum likelihood for sparse directed acyclic graphs. Annals of Statistics, 41(2):536 567, 2013. [15] Bryon Aragam, Arash A. Amini, and Qing Zhou. Learning directed acyclic graphs with penalized neighbourhood regression, 2017. [16] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, 31:9472 9483, 2018. [17] Bryon Aragam, Arash Amini, and Qing Zhou. Globally optimal score-based learning of directed acyclic graphs in high-dimensions. In Advances in Neural Information Processing Systems, volume 32, 2019. [18] Guillaume Obozinski, Martin J Wainwright, Michael I Jordan, et al. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1):1 47, 2011. [19] Weiguang Wang, Yingbin Liang, and Eric P Xing. Collective support recovery for multi-design multi-response linear regression. IEEE Transactions on Information Theory, 61(1):513 534, 2014. [20] Yuhao Wang, Santiago Segarra, Caroline Uhler, et al. High-dimensional joint estimation of multiple directed gaussian graphical models. Electronic Journal of Statistics, 14(1):2439 2483, 2020. [21] Kyoungjae Lee and Xuan Cao. Bayesian joint inference for multiple directed acyclic graphs. ar Xiv preprint ar Xiv:2008.06190, 2020. [22] Magali Champion, Victor Picheny, and Matthieu Vignes. Inferring large graphs using l (1)- penalized likelihood (vol 28, pg 905, 2018). STATISTICS AND COMPUTING, 28(6):1231 1231, 2018. [23] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. Dag-gnn: Dag structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154 7163. PMLR, 2019. [24] DR Cox and Nanny Wermuth. Multivariate Dependencies: Models, Analysis and Interpretation, volume 67. CRC Press, 1996. [25] Nicolai Meinshausen, Peter Bühlmann, et al. High-dimensional graphs and variable selection with the lasso. Annals of statistics, 34(3):1436 1462, 2006. [26] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19 35, 2007. [27] Pradeep Ravikumar, Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Model selection in gaussian graphical models: High-dimensional consistency of l1-regularized mle. In NIPS, pages 1329 1336, 2008. [28] Le Song, Mladen Kolar, and Eric P Xing. Keller: estimating time-varying interactions between genes. Bioinformatics, 25(12):i128 i136, 2009. [29] Jean Honorio and Dimitris Samaras. Multi-task learning of gaussian graphical models. In ICML, 2010. [30] Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Joint estimation of multiple graphical models. Biometrika, 98(1):1 15, 2011. [31] Tony Cai, Weidong Liu, and Xi Luo. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594 607, 2011. [32] Diane Oyen and Terran Lane. Leveraging domain knowledge in multitask bayesian network structure learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, 2012. [33] Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society. Series B, Statistical methodology, 76(2):373, 2014. [34] Mladen Kolar, Le Song, Amr Ahmed, Eric P Xing, et al. Estimating time-varying networks. Annals of Applied Statistics, 4(1):94 123, 2010. [35] Karthik Mohan, Palma London, Maryam Fazel, Daniela Witten, and Su-In Lee. Node-based learning of multiple gaussian graphical models. The Journal of Machine Learning Research, 15(1):445 488, 2014. [36] Christine Peterson, Francesco C Stingo, and Marina Vannucci. Bayesian inference of multiple gaussian graphical models. Journal of the American Statistical Association, 110(509):159 174, 2015. [37] Sen Yang, Zhaosong Lu, Xiaotong Shen, Peter Wonka, and Jieping Ye. Fused multiple graphical lasso. SIAM Journal on Optimization, 25(2):916 943, 2015. [38] André R Gonçalves, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning with gaussian copula models. The Journal of Machine Learning Research, 17(1):1205 1234, 2016. [39] Burak Varici, Saurabh Sihag, and Ali Tajer. Learning shared subgraphs in ising model pairs. In International Conference on Artificial Intelligence and Statistics, pages 3952 3960. PMLR, 2021. [40] Asish Ghoshal, Kevin Bello, and Jean Honorio. Direct learning with guarantees of the difference dag between structural equation models. ar Xiv preprint ar Xiv:1906.12024, 2019. [41] Francis R Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9(6), 2008. [42] Sahand Negahban and Martin J Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of ℓ1/ℓ -regularization. Advances in Neural Information Processing Systems, 21:1161 1168, 2008. [43] Peng Zhao, Guilherme Rocha, Bin Yu, et al. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A):3468 3497, 2009. [44] Karim Lounici, Massimiliano Pontil, Alexandre B Tsybakov, and Sara Van De Geer. Taking advantage of sparsity in multi-task learning. ar Xiv preprint ar Xiv:0903.1468, 2009. [45] Junzhou Huang, Tong Zhang, et al. The benefit of group sparsity. The Annals of Statistics, 38(4):1978 2004, 2010. [46] Jianhui Chen, Jiayu Zhou, and Jieping Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 42 50, 2011. [47] Lu Tian, Pan Xu, and Quanquan Gu. Forward backward greedy algorithms for multi-task learning with faster rates. In UAI, 2016. [48] Mauro Scanagatta, Antonio Salmerón, and Fabio Stella. A survey on bayesian network structure learning from data. Progress in Artificial Intelligence, 8(4):425 439, 2019. [49] Ignavier Ng, Zhuangyan Fang, Shengyu Zhu, Zhitang Chen, and Jun Wang. Masked gradientbased causal structure learning. ar Xiv preprint ar Xiv:1910.08527, 2019. [50] Shengyu Zhu, Ignavier Ng, and Zhitang Chen. Causal discovery with reinforcement learning. In International Conference on Learning Representations, 2020. [51] Hasan Manzour, Simge Küçükyavuz, Hao-Hsiang Wu, and Ali Shojaie. Integer programming for learning directed acyclic graphs from continuous data. Informs Journal on Optimization, 3(1):46 73, 2021. [52] Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradientbased neural dag learning. In International Conference on Learning Representations, 2020. [53] Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric Xing. Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pages 3414 3425. PMLR, 2020. [54] Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pages 1595 1605. PMLR, 2020. [55] Robert Tillman and Peter Spirtes. Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 3 15. JMLR Workshop and Conference Proceedings, 2011. [56] Robert E Tillman. Structure learning with independent non-identically distributed data. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1041 1048, 2009. [57] Lele Xu, Tingting Fan, Xia Wu, Ke Wei Chen, Xiaojuan Guo, Jiacai Zhang, and Li Yao. A pooling-lingam algorithm for effective connectivity analysis of fmri data. Frontiers in computational neuroscience, 8:125, 2014.