# gradientbased_neural_dag_learning__ab3ecf9f.pdf Published as a conference paper at ICLR 2020 GRADIENT-BASED NEURAL DAG LEARNING S ebastien Lachapelle, Philippe Brouillard, Tristan Deleu & Simon Lacoste-Julien Mila & DIRO Universit e de Montr eal We propose a novel score-based approach to learning a directed acyclic graph (DAG) from observational data. We adapt a recently proposed continuous constrained optimization formulation to allow for nonlinear relationships between variables using neural networks. This extension allows to model complex interactions while avoiding the combinatorial nature of the problem. In addition to comparing our method to existing continuous optimization methods, we provide missing empirical comparisons to nonlinear greedy search methods. On both synthetic and real-world data sets, this new method outperforms current continuous methods on most tasks, while being competitive with existing greedy search methods on important metrics for causal inference. 1 INTRODUCTION Structure learning and causal inference have many important applications in different areas of science such as genetics (Koller & Friedman, 2009; Peters et al., 2017), biology (Sachs et al., 2005) and economics (Pearl, 2009). Bayesian networks (BN), which encode conditional independencies using directed acyclic graphs (DAG), are powerful models which are both interpretable and computationally tractable. Causal graphical models (CGM) (Peters et al., 2017) are BNs which support interventional queries like: What will happen if someone external to the system intervenes on variable X? Recent work suggests that causality could partially solve challenges faced by current machine learning systems such as robustness to out-of-distribution samples, adaptability and explainability (Pearl, 2019; Magliacane et al., 2018). However, structure and causal learning are daunting tasks due to both the combinatorial nature of the space of structures (the number of DAGs grows super exponentially with the number of nodes) and the question of structure identifiability (see Section 2.2). Nevertheless, these graphical models known qualities and promises of improvement for machine intelligence renders the quest for structure/causal learning appealing. The typical motivation for learning a causal graphical model is to predict the effect of various interventions. A CGM can be best estimated when given interventional data, but interventions are often costly or impossible to obtained. As an alternative, one can use exclusively observational data and rely on different assumptions which make the graph identifiable from the distribution (see Section 2.2). This is the approach employed in this paper. We propose a score-based method (Koller & Friedman, 2009) for structure learning named Gra NDAG which makes use of a recent reformulation of the original combinatorial problem of finding an optimal DAG into a continuous constrained optimization problem. In the original method named NOTEARS (Zheng et al., 2018), the directed graph is encoded as a weighted adjacency matrix which represents coefficients in a linear structural equation model (SEM) (Pearl, 2009) (see Section 2.3) and enforces acyclicity using a constraint which is both efficiently computable and easily differentiable, thus allowing the use of numerical solvers. This continuous approach improved upon popular methods while avoiding the design of greedy algorithms based on heuristics. Our first contribution is to extend the framework of Zheng et al. (2018) to deal with nonlinear relationships between variables using neural networks (NN) (Goodfellow et al., 2016). To adapt the acyclicity constraint to our nonlinear model, we use an argument similar to what is used in Canada CIFAR AI Chair Correspondence to: sebastien.lachapelle@umontreal.ca Published as a conference paper at ICLR 2020 Zheng et al. (2018) and apply it first at the level of neural network paths and then at the level of graph paths. Although Gra N-DAG is general enough to deal with a large variety of parametric families of conditional probability distributions, our experiments focus on the special case of nonlinear Gaussian additive noise models since, under specific assumptions, it provides appealing theoretical guarantees easing the comparison to other graph search procedures (see Section 2.2 & 3.3). On both synthetic and real-world tasks, we show Gra N-DAG often outperforms other approaches which leverage the continuous paradigm, including DAG-GNN (Yu et al., 2019), a recent nonlinear extension of Zheng et al. (2018) which uses an evidence lower bound as score. Our second contribution is to provide a missing empirical comparison to existing methods that support nonlinear relationships but tackle the optimization problem in its discrete form using greedy search procedures, namely CAM (B uhlmann et al., 2014) and GSF (Huang et al., 2018). We show that Gra N-DAG is competitive on the wide range of tasks we considered, while using preand post-processing steps similar to CAM. We provide an implementation of Gra N-DAG here. 2 BACKGROUND Before presenting Gra N-DAG, we review concepts relevant to structure and causal learning. 2.1 CAUSAL GRAPHICAL MODELS We suppose the natural phenomenon of interest can be described by a random vector X Rd entailed by an underlying CGM (PX, G) where PX is a probability distribution over X and G = (V, E) is a DAG (Peters et al., 2017). Each node j V corresponds to exactly one variable in the system. Let πG j denote the set of parents of node j in G and let XπG j denote the random vector containing the variables corresponding to the parents of j in G. Throughout the paper, we assume there are no hidden variables. In a CGM, the distribution PX is said to be Markov to G, i.e. we can write the probability density function (pdf) of PX as p(x) = Qd j=1 pj(xj|xπG j ) where pj(xj|xπG j ) is the conditional pdf of variable Xj given XπG j . A CGM can be thought of as a BN in which directed edges are given a causal meaning, allowing it to answer queries regarding interventional distributions (Koller & Friedman, 2009). 2.2 STRUCTURE IDENTIFIABILITY In general, it is impossible to recover G given only samples from PX, i.e. without interventional data. It is, however, customary to rely on a set of assumptions to render the structure fully or partially identifiable. Definition 1 Given a set of assumptions A on a CGM M = (PX, G), its graph G is said to be identifiable from PX if there exists no other CGM M = ( PX, G) satisfying all assumptions in A such that G = G and PX = PX. There are many examples of graph identifiability results for continuous variables (Peters et al., 2014; Peters & B uhlman, 2014; Shimizu et al., 2006; Zhang & Hyv arinen, 2009) as well as for discrete variables (Peters et al., 2011). These results are obtained by assuming that the conditional densities belong to a specific parametric family. For example, if one assumes that the distribution PX is entailed by a structural equation model of the form Xj := fj(XπG j ) + Nj with Nj N(0, σ2 j ) j V (1) where fj is a nonlinear function satisfying some mild regularity conditions and the noises Nj are mutually independent, then G is identifiable from PX (see Peters et al. (2014) for the complete theorem and its proof). This is a particular instance of additive noise models (ANM). We will make use of this result in our experiments in Section 4. One can consider weaker assumptions such as faithfulness (Peters et al., 2017). This assumption allows one to identify, not G itself, but the Markov equivalence class to which it belongs (Spirtes Published as a conference paper at ICLR 2020 et al., 2000). A Markov equivalence class is a set of DAGs which encode exactly the same set of conditional independence statements and can be characterized by a graphical object named a completed partially directed acyclic graph (CPDAG) (Koller & Friedman, 2009; Peters et al., 2017). Some algorithms we use as baselines in Section 4 output only a CPDAG. 2.3 NOTEARS: CONTINUOUS OPTIMIZATION FOR STRUCTURE LEARNING Structure learning is the problem of learning G using a data set of n samples {x(1), ..., x(n)} from PX. Score-based approaches cast this problem as an optimization problem, i.e. ˆG = arg max G DAG S(G) where S(G) is a regularized maximum likelihood under graph G. Since the number of DAGs is super exponential in the number of nodes, most methods rely on various heuristic greedy search procedures to approximately solve the problem (see Section 5 for a review). We now present the work of Zheng et al. (2018) which proposes to cast this combinatorial optimization problem into a continuous constrained one. To do so, the authors propose to encode the graph G on d nodes as a weighted adjacency matrix U = [u1| . . . |ud] Rd d which represents (possibly negative) coefficients in a linear SEM of the form Xj := u j X + Ni j where Nj is a noise variable. Let GU be the directed graph associated with the SEM and let AU be the (binary) adjacency matrix associated with GU. One can see that the following equivalence holds: (AU)ij = 0 Uij = 0 (2) To make sure GU is acyclic, the authors propose the following constraint on U: Tr e U U d = 0 (3) where e M P k=0 M k k! is the matrix exponential and is the Hadamard product. To see why this constraint characterizes acyclicity, first note that (AU k)jj is the number of cycles of length k passing through node j in graph GU. Clearly, for GU to be acyclic, we must have Tr AU k = 0 for k = 1, 2, ..., . By equivalence (2), this is true when Tr(U U)k = 0 for k = 1, 2, ..., . From there, one can simply apply the definition of the matrix exponential to see why constraint (3) characterizes acyclicity (see Zheng et al. (2018) for the full development). The authors propose to use a regularized negative least square score (maximum likelihood for a Gaussian noise model). The resulting continuous constrained problem is max U S(U, X) 1 2n X XU 2 F λ U 1 s.t. Tr e U U d = 0 (4) where X Rn d is the design matrix containing all n samples. The nature of the problem has been drastically changed: we went from a combinatorial to a continuous problem. The difficulties of combinatorial optimization have been replaced by those of non-convex optimization, since the feasible set is non-convex. Nevertheless, a standard numerical solver for constrained optimization such has an augmented Lagrangian method (Bertsekas, 1999) can be applied to get an approximate solution, hence there is no need to design a greedy search procedure. Moreover, this approach is more global than greedy methods in the sense that the whole matrix U is updated at each iteration. Continuous approaches to combinatorial optimization have sometimes demonstrated improved performance over discrete approaches in the literature (see for example Alayrac et al. (2018, 5.2) where they solve the multiple sequence alignment problem with a continuous optimization method). 3 GRAN-DAG: GRADIENT-BASED NEURAL DAG LEARNING We propose a new nonlinear extension to the framework presented in Section 2.3. For each variable Xj, we learn a fully connected neural network with L hidden layers parametrized by φ(j) {W (1) (j) , . . . , W (L+1) (j) } where W (ℓ) (j) is the ℓth weight matrix of the jth NN (biases are omitted for clarity). Each NN takes as input X j Rd, i.e. the vector X with the jth component masked to zero, and outputs θ(j) Rm, the m-dimensional parameter vector of the desired distribution family Published as a conference paper at ICLR 2020 for variable Xj.1 The fully connected NNs have the following form θ(j) W (L+1) (j) g(. . . g(W (2) (j) g(W (1) (j) X j)) . . . ) j (5) where g is a nonlinearity applied element-wise. Note that the evaluation of all NNs can be parallelized on GPU. Distribution families need not be the same for each variable. Let φ {φ(1), . . . , φ(d)} represents all parameters of all d NNs. Without any constraint on its parameter φ(j), neural network j models the conditional pdf pj(xj|x j; φ(j)). Note that the product Qd j=1 pj(xj|x j; φ(j)) does not integrate to one (i.e. it is not a joint pdf), since it does not decompose according to a DAG. We now show how one can constrain φ to make sure the product of all conditionals outputted by the NNs is a joint pdf. The idea is to define a new weighted adjacency matrix Aφ similar to the one encountered in Section 2.3, which can be directly used inside the constraint of Equation 3 to enforce acyclicity. 3.1 NEURAL NETWORK CONNECTIVITY Before defining the weighted adjacency matrix Aφ, we need to focus on how one can make some NN outputs unaffected by some inputs. Since we will discuss properties of a single NN, we drop the NN subscript (j) to improve readability. We will use the term neural network path to refer to a computation path in a NN. For example, in a NN with two hidden layers, the sequence of weights (W (1) h1i, W (2) h2h1, W (3) kh2) is a NN path from input i to output k. We say that a NN path is inactive if at least one weight along the path is zero. We can loosely interpret the path product |W (1) h1i||W (2) h2h1||W (3) kh2| 0 as the strength of the NN path, where a path product is equal to zero if and only if the path is inactive. Note that if all NN paths from input i to output k are inactive (i.e. the sum of their path products is zero), then output k does not depend on input i anymore since the information in input i will never reach output k. The sum of all path products from input i to output k for all input i and output k can be easily computed by taking the following matrix product. C |W (L+1)| . . . |W (2)||W (1)| Rm d 0 (6) where |W| is the element-wise absolute value of W. Let us name C the neural network connectivity matrix. It can be verified that Cki is the sum of all NN path products from input i to output k. This means it is sufficient to have Cki = 0 to render output k independent of input i. Remember that each NN in our model outputs a parameter vector θ for a conditional distribution and that we want the product of all conditionals to be a valid joint pdf, i.e. we want its corresponding directed graph to be acyclic. With this in mind, we see that it could be useful to make a certain parameter θ not dependent on certain inputs of the NN. To have θ independent of variable Xi, it is sufficient to have Pm k=1 Cki = 0. 3.2 A WEIGHTED ADJACENCY MATRIX We now define a weighted adjacency matrix Aφ that can be used in constraint of Equation 3. (Aφ)ij Pm k=1 C(j) ki , if j = i 0, otherwise (7) where C(j) denotes the connectivity matrix of the NN associated with variable Xj. As the notation suggests, Aφ Rd d 0 depends on all weights of all NNs. Moreover, it can effectively be interpreted as a weighted adjacency matrix similarly to what we presented in Section 2.3, since we have that (Aφ)ij = 0 = θ(j) does not depend on variable Xi (8) We note Gφ to be the directed graph entailed by parameter φ. We can now write our adapted acyclicity constraint: h(φ) Tr e Aφ d = 0 (9) 1Not all parameter vectors need to have the same dimensionality, but to simplify the notation, we suppose mj = m j Published as a conference paper at ICLR 2020 Note that we can compute the gradient of h(φ) w.r.t. φ (except at points of non-differentiability arising from the absolute value function, similar to standard neural networks with Re LU activations (Glorot et al., 2011); these points did not appear problematic in our experiments using SGD). 3.3 A DIFFERENTIABLE SCORE AND ITS OPTIMIZATION We propose solving the maximum likelihood optimization problem max φ EX PX j=1 log pj(Xj|Xπφ j ; φ(j)) s.t. Tr e Aφ d = 0 (10) where πφ j denotes the set of parents of node j in graph Gφ. Note that Pd j=1 log pj(Xj|Xπφ j ; φ(j)) is a valid log-likelihood function when constraint (9) is satisfied. As suggested in Zheng et al. (2018), we apply an augmented Lagrangian approach to get an approximate solution to program (10). Augmented Lagrangian methods consist of optimizing a sequence of subproblems for which the exact solutions are known to converge to a stationary point of the constrained problem under some regularity conditions (Bertsekas, 1999). In our case, each subproblem is max φ L(φ, λt, µt) EX PX j=1 log pj(Xj|Xπφ j ; φ(j)) λth(φ) µt 2 h(φ)2 (11) where λt and µt are the Lagrangian and penalty coefficients of the tth subproblem, respectively. These coefficients are updated after each subproblem is solved. Since Gra N-DAG rests on neural networks, we propose to approximately solve each subproblem using a well-known stochastic gradient algorithm popular for NN in part for its implicit regularizing effect (Poggio et al., 2018). See Appendix A.1 for details regarding the optimization procedure. In the current section, we presented Gra N-DAG in a general manner without specifying explicitly which distribution family is parameterized by θ(j). In principle, any distribution family could be employed as long as its log-likelihood can be computed and differentiated with respect to its parameter θ. However, it is not always clear whether the exact solution of problem (10) recovers the ground truth graph G. It will depend on both the modelling choice of Gra N-DAG and the underlying CGM (PX, G). Proposition 1 Let φ and Gφ be the optimal solution to (10) and its corresponding graph, respectively. Let M(A) be the set of CGM (P , G ) for which the assumptions in A are satisfied and let C be the set of CGM (P , G ) which can be represented by the model (e.g. NN outputting a Gaussian distribution). If the underlying CGM (PX, G) C and C = M(A) for a specific set of assumptions A such that G is identifiable from PX, then Gφ = G. Proof: Let Pφ be the joint distribution entailed by parameter φ. Note that the population loglikelihood EX PX log pφ(X) is maximal iff Pφ = PX. We know this maximum can be achieved by a specific parameter φ since by hypothesis (PX, G) C. Since G is identifiable from PX, we know there exists no other CGM ( PX, G) C such that G = G and PX = PX. Hence Gφ has to be equal to G. In Section 4.1, we empirically explore the identifiable setting of nonlinear Gaussian ANMs introduced in Section 2.2. In practice, one should keep in mind that solving (10) exactly is hard since the problem is non-convex (the augmented Lagrangian converges only to a stationary point) and moreover we only have access to the empirical log-likelihood (Proposition 1 holds for the population case). 3.4 THRESHOLDING The solution outputted by the augmented Lagrangian will satisfy the constraint only up to numerical precision, thus several entries of Aφ might not be exactly zero and require thresholding. To do so, we mask the inputs of each NN j using a binary matrix M(j) {0, 1}d d initialized to have (M(j))ii = 1 i = j and zeros everywhere else. Having (M(j))ii = 0 means the input i of NN Published as a conference paper at ICLR 2020 j has been thresholded. This mask is integrated in the product of Equation 6 by doing C(j) |W (L+1) (j) | . . . |W (1) (j) |M(j) without changing the interpretation of C(j) (M(j) can be seen simply as an extra layer in the NN). During optimization, if the entry (Aφ)ij is smaller than the threshold ϵ = 10 4, the corresponding mask entry (M(j))ii is set to zero, permanently. The masks M(j) are never updated via gradient descent. We also add an iterative thresholding step at the end to ensure the estimated graph Gφ is acyclic (described in Appendix A.2). 3.5 OVERFITTING In practice, we maximize an empirical estimate of the objective of problem (10). It is well known that this maximum likelihood score is prone to overfitting in the sense that adding edges can never reduce the maximal likelihood (Koller & Friedman, 2009). Gra N-DAG gets around this issue in four ways. First, as we optimize a subproblem, we evaluate its objective on a held-out data set and declare convergence once it has stopped improving. This approach is known as early stopping (Prechelt, 1997). Second, to optimize (11), we use a stochastic gradient algorithm variant which is now known to have an implicit regularizing effect (Poggio et al., 2018). Third, once we have thresholded our graph estimate to be a DAG, we apply a final pruning step identical to what is done in CAM (B uhlmann et al., 2014) to remove spurious edges. This step performs a regression of each node against its parents and uses a significance test to decide which parents should be kept or not. Fourth, for graphs of 50 nodes or more, we apply a preliminary neighbors selection (PNS) before running the optimization procedure as was also recommended in B uhlmann et al. (2014). This procedure selects a set of potential parents for each variables. See Appendix A.3 for details on PNS and pruning. Many score-based approaches control overfitting by penalizing the number of edges in their score. For example, NOTEARS includes the L1 norm of its weighted adjacency matrix U in its objective. Gra N-DAG regularizes using PNS and pruning for ease of comparision to CAM, the most competitive approach in our experiments. The importance of PNS and pruning and their ability to reduce overfitting is illustrated in an ablation study presented in Appendix A.3. The study shows that PNS and pruning are both very important for the performance of Gra N-DAG in terms of SHD, but do not have a significant effect in terms of SID. In these experiments, we also present NOTEARS and DAG-GNN with PNS and pruning, without noting a significant improvement. 3.6 COMPUTATIONAL COMPLEXITY To learn a graph, Gra N-DAG relies on the proper training of neural networks on which it is built. We thus propose using a stochastic gradient method which is a standard choice when it comes to NN training because it scales well with both the sample size and the number of parameters and it implicitly regularizes learning. Similarly to NOTEARS, Gra N-DAG requires the evaluation of the matrix exponential of Aφ at each iteration costing O(d3). NOTEARS justifies the use of a batch proximal quasi-Newton algorithm by the low number of O(d3) iterations required to converge. Since Gra NDAG uses a stochastic gradient method, one would expect it will require more iterations to converge. However, in practice we observe that Gra N-DAG performs fewer iterations than NOTEARS before the augmented Lagrangian converges (see Table 4 of Appendix A.1). We hypothesize this is due to early stopping which avoids having to wait until the full convergence of the subproblems hence limiting the total number of iterations. Moreover, for the graph sizes considered in this paper (d 100), the evaluation of h(φ) in Gra N-DAG, which includes the matrix exponentiation, does not dominate the cost of each iteration ( 4% for 20 nodes and 13% for 100 nodes graphs). Evaluating the approximate gradient of the log-likelihood (costing O(d2) assuming a fixed minibatch size, NN depth and width) appears to be of greater importance for d 100. 4 EXPERIMENTS In this section, we compare Gra N-DAG to various baselines in the continuous paradigm, namely DAG-GNN (Yu et al., 2019) and NOTEARS (Zheng et al., 2018), and also in the combinatorial paradigm, namely CAM (B uhlmann et al., 2014), GSF (Huang et al., 2018), GES (Chickering, 2003) and PC (Spirtes et al., 2000). These methods are discussed in Section 5. In all experiments, each NN learned by Gra N-DAG outputs the mean of a Gaussian distribution ˆµ(j), i.e. θ(j) := ˆµ(j) and Xj|XπG j N(ˆµ(j), ˆσ2 (j)) j. The parameters ˆσ2 (j) are learned as well, but do not depend on Published as a conference paper at ICLR 2020 the parent variables XπG j (unless otherwise stated). Note that this modelling choice matches the nonlinear Gaussian ANM introduced in Section 2.2. We report the performance of random graphs sampled using the Erd os-R enyi (ER) scheme described in Appendix A.5 (denoted by RANDOM). For each approach, we evaluate the estimated graph on two metrics: the structural hamming distance (SHD) and the structural interventional distance (SID) (Peters & B uhlmann, 2013). The former simply counts the number of missing, falsely detected or reversed edges. The latter is especially well suited for causal inference since it counts the number of couples (i, j) such that the interventional distribution p(xj|do(Xi = x)) would be miscalculated if we were to use the estimated graph to form the parent adjustement set. Note that GSF, GES and PC output only a CPDAG, hence the need to report a lower and an upper bound on the SID. See Appendix A.7 for more details on SHD and SID. All experiments were ran with publicly available code from the authors website. See Appendix A.8 for the details of their hyperparameters. In Appendix A.9, we explain how one could use a held-out data set to select the hyperparameters of score-based approaches and report the results of such a procedure on almost every settings discussed in the present section. 4.1 SYNTHETIC DATA We have generated different data set types which vary along four dimensions: data generating process, number of nodes, level of edge sparsity and graph type. We consider two graph sampling schemes: Erd os-R enyi (ER) and scale-free (SF) (see Appendix A.5 for details). For each data set type, we sampled 10 data sets of 1000 examples as follows: First, a ground truth DAG G is randomly sampled following either the ER or the SF scheme. Then, the data is generated according to a specific sampling scheme. The first data generating process we consider is the nonlinear Gaussian ANM (Gauss-ANM) introduced in Section 2.2 in which data is sampled following Xj := fj(XπG j ) + Nj with mutually independent noises Nj N(0, σ2 j ) j where the functions fj are independently sampled from a Gaussian process with a unit bandwidth RBF kernel and with σ2 j sampled uniformly. As mentioned in Section 2.2, we know G to be identifiable from the distribution. Proposition 1 indicates that the modelling choice of Gra N-DAG together with this synthetic data ensure that solving (10) to optimality would recover the correct graph. Note that NOTEARS and CAM also make the correct Gaussian noise assumption, but do not have enough capacity to represent the fj functions properly. We considered graphs of 10, 20, 50 and 100 nodes. Tables 1 & 2 present results only for 10 and 50 nodes since the conclusions do not change with graphs of 20 or 100 nodes (see Appendix A.6 for these additional experiments). We consider graphs of d and 4d edges (respectively denoted by ER1 and ER4 in the case of ER graphs). We report the performance of the popular GES and PC in Appendix A.6 since they are almost never on par with the best methods presented in this section. Table 1: Results for ER and SF graphs of 10 nodes with Gauss-ANM data ER1 ER4 SF1 SF4 SHD SID SHD SID SHD SID SHD SID Gra N-DAG 1.7 2.5 1.7 3.1 8.3 2.8 21.8 8.9 1.2 1.1 4.1 6.1 9.9 4.0 16.4 6.0 DAG-GNN 11.4 3.1 37.6 14.4 35.1 1.5 81.9 4.7 9.9 1.1 29.7 15.8 20.8 1.9 48.4 15.6 NOTEARS 12.2 2.9 36.6 13.1 32.6 3.2 79.0 4.1 10.7 2.2 32.0 15.3 20.8 2.7 49.8 15.6 CAM 1.1 1.1 1.1 2.4 12.2 2.7 30.9 13.2 1.4 1.6 5.4 6.1 9.8 4.3 19.3 7.5 GSF 6.5 2.6 [6.2 10.8 21.7 8.4 [37.2 19.2 1.8 1.7 [2.0 5.1 8.5 4.2 [13.2 6.8 17.7 12.3] 62.7 14.9] 6.9 6.2] 20.6 12.1] RANDOM 26.3 9.8 25.8 10.4 31.8 5.0 76.6 7.0 25.1 10.2 24.5 10.5 28.5 4.0 47.2 12.2 Table 2: Results for ER and SF graphs of 50 nodes with Gauss-ANM data ER1 ER4 SF1 SF4 SHD SID SHD SID SHD SID SHD SID Gra N-DAG 5.1 2.8 22.4 17.8 102.6 21.2 1060.1 109.4 25.5 6.2 90.0 18.9 111.3 12.3 271.2 65.4 DAG-GNN 49.2 7.9 304.4 105.1 191.9 15.2 2146.2 64 49.8 1.3 182.8 42.9 144.9 13.3 540.8 151.1 NOTEARS 62.8 9.2 327.3 119.9 202.3 14.3 2149.1 76.3 57.7 3.5 195.7 54.9 153.7 11.8 558.4 153.5 CAM 4.3 1.9 22.0 17.9 98.8 20.7 1197.2 125.9 24.1 6.2 85.7 31.9 111.2 13.3 320.7 152.6 GSF 25.6 5.1 [21.1 23.1 81.8 18.8 [906.6 214.7 31.6 6.7 [85.8 29.9 120.2 10.9 [284.7 80.2 79.2 33.5] 1030.2 172.6] 147.3 49.9] 379.9 98.3] RANDOM 535.7 401.2 272.3 125.5 708.4 234.4 1921.3 203.5 514.0 360.0 381.3 190.3 660.6 194.9 1198.9 304.6 Published as a conference paper at ICLR 2020 We now examine Tables 1 & 2 (the errors bars represent the standard deviation across datasets per task). We can see that, across all settings, Gra N-DAG and CAM are the best performing methods, both in terms of SHD and SID, while GSF is not too far behind. The poor performance of NOTEARS can be explained by its inability to model nonlinear functions. In terms of SHD, DAG-GNN performs rarely better than NOTEARS while in terms of SID, it performs similarly to RANDOM in almost all cases except in scale-free networks of 50 nodes or more. Its poor performance might be due to its incorrect modelling assumptions and because its architecture uses a strong form of parameter sharing between the fj functions, which is not justified in a setup like ours. GSF performs always better than DAG-GNN and NOTEARS but performs as good as CAM and Gra N-DAG only about half the time. Among the continuous approaches considered, Gra N-DAG is the best performing on these synthetic tasks. Even though CAM (wrongly) assumes that the functions fj are additive, i.e. fj(xπG j ) = P i πG j fij(xj) j, it manages to compete with Gra N-DAG which does not make this incorrect modelling assumption2. This might partly be explained by a bias-variance trade-off. CAM is biased but has a lower variance than Gra N-DAG due to its restricted capacity, resulting in both methods performing similarly. In Appendix A.4, we present an experiment showing that Gra N-DAG can outperform CAM in higher sample size settings, suggesting this explanation is reasonable. Having confirmed that Gra N-DAG is competitive on the ideal Gauss-ANM data, we experimented with settings better adjusted to other models to see whether Gra N-DAG remains competitive. We considered linear Gaussian data (better adjusted to NOTEARS) and nonlinear Gaussian data with additive functions (better adjusted to CAM) named LIN and ADD-FUNC, respectively. See Appendix A.5 for the details of their generation. We report the results of Gra N-DAG and other baselines in Table 12 & 13 of the appendix. On linear Gaussian data, most methods score poorly in terms of SID which is probably due to the unidentifiability of the linear Gaussian model (when the noise variances are unequal). Gra N-DAG and CAM perform similarly to NOTEARS in terms of SHD. On ADD-FUNC, CAM dominates all methods on most graph types considered (Gra N-DAG is on par only for the 10 nodes ER1 graph). However, Gra N-DAG outperforms all other methods which can be explained by the fact that the conditions of Proposition 1 are satisfied (supposing the functions P i πG j fij(Xi) can be represented by the NNs). We also considered synthetic data sets which do not satisfy the additive Gaussian noise assumption present in Gra N-DAG, NOTEARS and CAM. We considered two kinds of post nonlinear causal models (Zhang & Hyv arinen, 2009), PNL-GP and PNL-MULT (see Appendix A.5 for details about their generation). A post nonlinear model has the form Xj := gj(fj(XπG j ) + Nj) where Nj is a noise variable. Note that Gra N-DAG (with the current modelling choice) and CAM do not have the representational power to express these conditional distributions, hence violating an assumption of Proposition 1. However, these data sets differ from the previous additive noise setup only by the nonlinearity gj, hence offering a case of mild model misspecification. The results are reported in Table 14 of the appendix. Gra N-DAG and CAM are outperforming DAG-GNN and NOTEARS except in SID for certain data sets where all methods score similarly to RANDOM. Gra N-DAG and CAM have similar performance on all data sets except one where CAM is better. GSF performs worst than Gra N-DAG (in both SHD and SID) on PNL-GP but not on PNL-MULT where it performs better in SID. 4.2 REAL AND PSEUDO-REAL DATA We have tested all methods considered so far on a well known data set which measures the expression level of different proteins and phospholipids in human cells (Sachs et al., 2005). We trained only on the n = 853 observational samples. This dataset and its ground truth graph proposed in Sachs et al. (2005) (11 nodes and 17 edges) are often used in the probabilistic graphical model literature (Koller & Friedman, 2009). We also consider pseudo-real data sets sampled from the Syn TRe N generator (Van den Bulcke, 2006). This generator was designed to create synthetic transcriptional regulatory networks and produces simulated gene expression data that approximates experimental data. See Appendix A.5 for details of the generation. 2Although it is true that Gra N-DAG does not wrongly assume that the functions fj are additive, it is not clear whether its neural networks can exactly represent functions sampled from the Gaussian process. Published as a conference paper at ICLR 2020 In applications, it is not clear whether the conditions of Proposition 1 hold since we do not know whether (PX, G) C. This departure from identifiable settings is an occasion to explore a different modelling choice for Gra N-DAG. In addition to the model presented at the beginning of this section, we consider an alternative, denoted Gra N-DAG++, which allows the variance parameters ˆσ2 (i) to depend on the parent variables XπG i through the NN, i.e. θ(i) := (ˆµ(i), log ˆσ2 (i)). Note that this is violating the additive noise assumption (in ANMs, the noise is independent of the parent variables). In addition to metrics used in Section 4.1, we also report SHD-C. To compute the SHD-C between two DAGs, we first map each of them to their corresponding CPDAG and measure the SHD between the two. This metric is useful to compare algorithms which only outputs a CPDAG like GSF, GES and PC to other methods which outputs a DAG. Results are reported in Table 3. Table 3: Results on real and pseudo-real data Protein signaling data set Syn TRe N (20 nodes) SHD SHD-C SID SHD SHD-C SID Gra N-DAG 13 11 47 34.0 8.5 36.4 8.3 161.7 53.4 Gra N-DAG++ 13 10 48 33.7 3.7 39.4 4.9 127.5 52.8 DAG-GNN 16 21 44 93.6 9.2 97.6 10.3 157.5 74.6 NOTEARS 21 21 44 151.8 28.2 156.1 28.7 110.7 66.7 CAM 12 9 55 40.5 6.8 41.4 7.1 152.3 48 GSF 18 10 [44, 61] 61.8 9.6 63.3 11.4 [76.7 51.1, 109.9 39.9] GES 26 28 [34, 45] 82.6 9.3 85.6 10 [157.2 48.3, 168.8 47.8] PC 17 11 [47, 62] 41.2 5.1 42.4 4.6 [154.8 47.6, 179.3 55.6] RANDOM 21 20 60 84.7 53.8 86.7 55.8 175.8 64.7 First, all methods perform worse than what was reported for graphs of similar size in Section 4.1, both in terms of SID and SHD. This might be due to the lack of identifiability guarantees we face in applications. On the protein data set, Gra N-DAG outperforms CAM in terms of SID (which differs from the general trend of Section 4.1) and arrive almost on par in terms of SHD and SHD-C. On this data set, DAG-GNN has a reasonable performance, beating Gra N-DAG in SID, but not in SHD. On Syn TRe N, Gra N-DAG obtains the best SHD but not the best SID. Overall, Gra N-DAG is always competitive with the best methods of each task. 5 RELATED WORK Most methods for structure learning from observational data make use of some identifiability results similar to the ones raised in Section 2.2. Roughly speaking, there are two classes of methods: independence-based and score-based methods. Gra N-DAG falls into the second class. Score-based methods (Koller & Friedman, 2009; Peters et al., 2017) cast the problem of structure learning as an optimization problem over the space of structures (DAGs or CPDAGs). Many popular algorithms tackle the combinatorial nature of the problem by performing a form of greedy search. GES (Chickering, 2003) is a popular example. It usually assumes a linear parametric model with Gaussian noise and greedily search the space of CPDAGs in order to optimize the Bayesian information criterion. GSF (Huang et al., 2018), is based on the same search algorithm as GES, but uses a generalized score function which can model nonlinear relationships. Other greedy approaches rely on parametric assumptions which render G fully identifiable. For example, Peters & B uhlman (2014) relies on a linear Gaussian model with equal variances to render the DAG identifiable. RESIT (Peters et al., 2014), assumes nonlinear relationships with additive Gaussian noise and greedily maximizes an independence-based score. However, RESIT does not scale well to graph of more than 20 nodes. CAM (B uhlmann et al., 2014) decouples the search for the optimal node ordering from the parents selection for each node and assumes an additive noise model (ANM) (Peters et al., 2017) in which the nonlinear functions are additive. As mentioned in Section 2.3, NOTEARS, proposed in Zheng et al. (2018), tackles the problem of finding an optimal DAG as a continuous constrained optimization program. This is a drastic departure from previous combinatorial approaches which enables the application of well studied numerical solvers for continuous optimizations. Recently, Yu et al. (2019) proposed DAG-GNN, a graph neural network architecture (GNN) which can be used to learn DAGs via the maximization of an evidence lower bound. By design, a GNN makes use of parameter sharing which we hypothesize is not well suited for most DAG learning tasks. To the best of our knowledge, DAG-GNN is the first approach extending the NOTEARS algorithm for structure Published as a conference paper at ICLR 2020 learning to support nonlinear relationships. Although Yu et al. (2019) provides empirical comparisons to linear approaches, namely NOTEARS and FGS (a faster extension of GES) (Ramsey et al., 2017), comparisons to greedy approaches supporting nonlinear relationships such as CAM and GSF are missing. Moreover, Gra N-DAG significantly outperforms DAG-GNN on our benchmarks. There exists certain score-based approaches which uses integer linear programming (ILP) (Jaakkola et al., 2010; Cussens, 2011) which internally solve continuous linear relaxations. Connections between such methods and the continuous constrained approaches are yet to be explored. When used with the additive Gaussian noise assumption, the theoretical guarantee of Gra N-DAG rests on the identifiability of nonlinear Gaussian ANMs. Analogously to CAM and NOTEARS, this guarantee holds only if the correct identifiability assumptions hold in the data and if the score maximization problem is solved exactly (which is not the case in all three algorithms). DAG-GNN provides no theoretical justification for its approach. NOTEARS and CAM are designed to handle what is sometimes called the high-dimensional setting in which the number of samples is significantly smaller than the number of nodes. B uhlmann et al. (2014) provides consistency results for CAM in this setting. Gra N-DAG and DAG-GNN were not designed with this setting in mind and would most likely fail if confronted to it. Solutions for fitting a neural network on less data points than input dimensions are not common in the NN literature. Methods for causal discovery using NNs have already been proposed. SAM (Kalainathan et al., 2018) learns conditional NN generators using adversarial losses but does not enforce acyclicity. CGNN (Goudet et al., 2018), when used for multivariate data, requires an initial skeleton to learn the different functional relationships. Gra N-DAG has strong connections with MADE (Germain et al., 2015), a method used to learn distributions using a masked NN which enforces the so-called autoregressive property. The autoregressive property and acyclicity are in fact equivalent. MADE does not learn the weight masking, it fixes it at the beginning of the procedure. Gra N-DAG could be used with a unique NN taking as input all variables and outputting parameters for all conditional distributions. In this case, it would be similar to MADE, except the variable ordering would be learned from data instead of fixed a priori. 6 CONCLUSION The continuous constrained approach to structure learning has the advantage of being more global than other approximate greedy methods (since it updates all edges at each step based on the gradient of the score but also the acyclicity constraint) and allows to replace task-specific greedy algorithms by appropriate off-the-shelf numerical solvers. In this work, we have introduced Gra N-DAG, a novel score-based approach for structure learning supporting nonlinear relationships while leveraging a continuous optimization paradigm. The method rests on a novel characterization of acyclicity for NNs based on the work of Zheng et al. (2018). We showed Gra N-DAG outperforms other gradient-based approaches, namely NOTEARS and its recent nonlinear extension DAG-GNN, on the synthetic data sets considered in Section 4.1 while being competitive on real and pseudo-real data sets of Section 4.2. Compared to greedy approaches, Gra N-DAG is competitive across all datasets considered. To the best of our knowledge, Gra N-DAG is the first approach leveraging the continuous paradigm introduced in Zheng et al. (2018) which has been shown to be competitive with state of the art methods supporting nonlinear relationships. ACKNOWLEDGMENTS This research was partially supported by the Canada CIFAR AI Chair Program and by a Google Focused Research award. The authors would like to thank R emi Le Priol, Tatjana Chavdarova, Charles Guille-Escuret, Nicolas Gagn e and Yoshua Bengio for insightful discussions as well as Alexandre Drouin and Florian Bordes for technical support. The experiments were in part enabled by computational resources provided by Calcul Qu ebec, Compute Canada and Element AI. J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. Learning from narrated instruction videos. TPAMI, 40(9):2194 2208, Sep. 2018. Published as a conference paper at ICLR 2020 A.-L. Barab asi. Scale-free networks: a decade and beyond. Science, 2009. A.-L. Barab asi and R. Albert. Emergence of scaling in random networks. Science, 1999. James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 2012. D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999. P. B uhlmann, J. Peters, and J. Ernest. CAM: Causal additive models, high-dimensional order search and penalized regression. Annals of Statistics, 2014. D.M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 2003. A. Clauset, C.R. Shalizi, and M.E.J. Newman. Power-law distributions in empirical data. SIAM review, 2009. J. Cussens. Bayesian network learning with cutting planes. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, 2011. M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. In Proceedings of the 32nd International Conference on Machine Learning, 2015. P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine learning, 2006. X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010. X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. O. Goudet, D. Kalainathan, P. Caillou, D. Lopez-Paz, I. Guyon, and M. Sebag. Learning Functional Causal Models with Generative Neural Networks. In Explainable and Interpretable Models in Computer Vision and Machine Learning. Springer International Publishing, 2018. Biwei Huang, Kun Zhang, Yizhu Lin, Bernhard Sch olkopf, and Clark Glymour. Generalized score functions for causal discovery. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018. T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian Network Structure using LP Relaxations. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010. D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag. Sam: Structural agnostic model, causal discovery and penalized adversarial learning. ar Xiv preprint ar Xiv:1803.04929, 2018. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. MIT Press, 2009. S. Magliacane, T. van Ommen, T. Claassen, S. Bongers, P. Versteeg, and J.M. Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems 31, 2018. J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd edition, 2009. J. Pearl. The seven tools of causal inference, with reflections on machine learning. Commun. ACM, 2019. J. Peters and P. B uhlman. Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 2014. Published as a conference paper at ICLR 2020 J. Peters and P. B uhlmann. Structural intervention distance (sid) for evaluating causal graphs. Neural Computation, 2013. J. Peters and P. B uhlmann. Structural intervention distance (SID) for evaluating causal graphs. Neural Computation, 2015. J. Peters, D. Janzing, and B. Sch olkopf. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011. J. Peters, J. M. Mooij, D. Janzing, and B. Sch olkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 2014. J. Peters, D. Janzing, and B. Sch olkopf. Elements of Causal Inference - Foundations and Learning Algorithms. MIT Press, 2017. T. A. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, and H. Mhaskar. Theory of deep learning III: explaining the non-overfitting puzzle. ar Xiv preprint ar Xiv:1801.00173, 2018. L. Prechelt. Early stopping - but when? In Neural Networks: Tricks of the Trade, volume 1524 of LNCS, chapter 2, 1997. J. Ramsey, M. Glymour, R. Sanchez-Romero, and C. Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. I. J. Data Science and Analytics, 2017. K. Sachs, O. Perez, D. Pe er, D.A. Lauffenburger, and G.P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 2005. S. Shimizu, P.O. Hoyer, A. Hyv arinen, and A. Kerminen. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 2006. P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2000. T. Tieleman and G. Hinton. Lecture 6.5 Rms Prop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. Tim Van den Bulcke, et al. Syn TRe N: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. 2006. Y. Yu, J. Chen, T. Gao, and M. Yu. DAG-GNN: DAG structure learning with graph neural networks. In Proceedings of the 36th International Conference on Machine Learning, 2019. K. Zhang and A. Hyv arinen. On the identifiability of the post-nonlinear causal model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009. Kun Zhang, Zhikun Wang, Jiji Zhang, and Bernhard Sch olkopf. On estimation of functional causal models: General results and application to the post-nonlinear causal model. ACM Trans. Intell. Syst. Technol., 2015. X. Zheng, B. Aragam, P.K. Ravikumar, and E.P. Xing. Dags with no tears: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31, 2018. Published as a conference paper at ICLR 2020 A.1 OPTIMIZATION Let us recall the augmented Lagrangian: max φ L(φ, λt, µt) EX PX i=1 log pi(Xi|Xπφ i ; φ(i)) λth(φ) µt 2 h(φ)2 (12) where λt and µt are the Lagrangian and penalty coefficients of the tth subproblem, respectively. In all our experiments, we initialize those coefficients using λ0 = 0 and µ0 = 10 3. We approximately solve each non-convex subproblem using RMSprop (Tieleman & Hinton, 2012), a stochastic gradient descent variant popular for NNs. We use the following gradient estimate: φL(φ, λt, µt) φ ˆLB(φ, λt, µt) with ˆLB(φ, λt, µt) 1 |B| i=1 log pi(xi|xπφ i ; φ(i)) λth(φ) µt 2 h(φ)2 (13) where B is a minibatch sampled from the data set and |B| is the minibatch size. The gradient estimate φ ˆLB(φ, λt, µt) can be computed using standard deep learning libraries. We consider a subproblem has converged when ˆLH(φ, λt, µt) evaluated on a held-out data set H stops increasing. Let φ t be the approximate solution to subproblem t. Then, λt and µt are updated according to the following rule: λt+1 λt + µth (φ t ) µt+1 ηµt, if h (φ t ) > γh φ t 1 µt, otherwise (14) with η = 10 and γ = 0.9. Each subproblem t is initialized using the previous subproblem solution φ t 1. The augmented Lagrangian method stops when h(φ) 10 8. Total number of iterations before augmented Lagrangian converges: In Gra N-DAG and NOTEARS, every subproblem is approximately solved using an iterative algorithm. Let T be the number of subproblems solved before the convergence of the augmented Lagrangian. For a given subproblem t, let Kt be the number of iterations executed to approximately solve it. Let I = PT t=1 Kt be the total number of iterations before the augmented Lagrangian converges. Table 4 reports the total number of iterations I for Gra N-DAG and NOTEARS, averaged over ten data sets. Note that the matrix exponential is evaluated once per iteration. Even though Gra N-DAG uses a stochastic gradient algorithm, it requires less iterations than NOTEARS which uses a batch proximal quasi-Newton method. We hypothesize early stopping avoids having to wait until full convergence before moving to the next subproblem, hence reducing the total number of iterations. Note that Gra N-DAG total run time is still larger than NOTEARS due to its gradient requiring more computation to evaluate (total runtime 10 minutes against 1 minute for 20 nodes graphs and 4 hours against 1 hour for 100 nodes graphs). Gra N-DAG runtime on 100 nodes graphs can be roughly halved when executed on GPU. Table 4: Total number of iterations ( 103) before augmented Lagrangian converges on Gauss-ANM data. 20 nodes ER1 20 nodes ER4 100 nodes ER1 100 nodes ER4 Gra N-DAG 27.3 3.6 30.4 4.2 23.1 0.7 23.1 0.8 NOTEARS 67.1 35.3 72.3 24.3 243.6 12.3 232.4 12.9 A.2 THRESHOLDING TO ENSURE ACYCLICITY The augmented Lagrangian outputs φ T where T is the number of subproblems solved before declaring convergence. Note that the weighted adjacency matrix Aφ T will most likely not represent an acyclic graph, even if we threshold as we learn, as explained in Section 3.4. We need to remove Published as a conference paper at ICLR 2020 additional edges to obtain a DAG (edges are removed using the mask presented in Section 3.4). One option would be to remove edges one by one until a DAG is obtained, starting from the edge (i, j) with the lowest (Aφ T )ij up to the edge with the highest (Aφ T )ij. This amounts to gradually increasing the threshold ϵ until Aφ T is acyclic. However, this approach has the following flaw: It is possible to have (Aφ T )ij significantly higher than zero while having θ(j) almost completely independent of variable Xi. This can happen for at least two reasons. First, the NN paths from input i to output k might end up cancelling each others, rendering the input i inactive. Second, some neurons of the NNs might always be saturated for the observed range of inputs, rendering some NN paths effectively inactive without being inactive in the sense described in Section 3.1. Those two observations illustrate the fact that having (Aφ T )ij = 0 is only a sufficient condition to have θ(j) independent of variable Xi and not a necessary one. To avoid this issue, we consider the following alternative. Consider the function L : Rd 7 Rd which maps all d variables to their respective conditional likelihoods, i.e. Li(X) pi(Xi | X π φ T i ) i. We consider the following expected Jacobian matrix X is the Jacobian matrix of L evaluated at X, in absolute value (element-wise). Similarly to (Aφ T )ij, the entry Jij can be loosely interpreted as the strength of edge (i, j). We propose removing edges starting from the lowest Jij to the highest, stopping as soon as acyclicity is achieved. We believe J is better than Aφ T at capturing which NN inputs are effectively inactive since it takes into account NN paths cancelling each others and saturated neurons. Empirically, we found that using J instead of Aφ T yields better results, and thus we report the results with J in this paper. A.3 PRELIMINARY NEIGHBORHOOD SELECTION AND DAG PRUNING PNS: For graphs of 50 nodes or more, Gra N-DAG performs a preliminary neighborhood selection (PNS) similar to what has been proposed in B uhlmann et al. (2014). This procedure applies a variable selection method to get a set of possible parents for each node. This is done by fitting an extremely randomized trees (Geurts et al., 2006) (using Extra Trees Regressor from scikit-learn) for each variable against all the other variables. For each node a feature importance score based on the gain of purity is calculated. Only nodes that have a feature importance score higher than 0.75 mean are kept as potential parent, where mean is the mean of the feature importance scores of all nodes. Although the use of PNS in CAM was motivated by gains in computation time, Gra N-DAG uses it to avoid overfitting, without reducing the computation time. Pruning: Once the thresholding is performed and a DAG is obtained as described in A.2, Gra NDAG performs a pruning step identical to CAM (B uhlmann et al., 2014) in order to remove spurious edges. We use the implementation of B uhlmann et al. (2014) based on the R function gamboost from the mboost package. For each variable Xi, a generalized additive model is fitted against the current parents of Xi and a significance test of covariates is applied. Parents with a p-value higher than 0.001 are removed from the parent set. Similarly to what B uhlmann et al. (2014) observed, this pruning phase generally has the effect of greatly reducing the SHD without considerably changing the SID. Ablation study: In Table 5, we present an ablation study which shows the effect of adding PNS and pruning to Gra N-DAG on different performance metrics and on the negative log-likelihood (NLL) of the training and validation set. Note that, before computing both NLL, we reset all parameters of Gra N-DAG except the mask and retrained the model on the training set without any acyclicity constraint (acyclicity is already ensure by the masks at this point). This retraining procedure is important since the pruning removes edges (i.e. some additional NN inputs are masked) and it affects the likelihood of the model (hence the need to retrain). Published as a conference paper at ICLR 2020 Table 5: PNS and pruning ablation study for Gra N-DAG (averaged over 10 datasets from ER1 with 50 nodes) PNS Pruning SHD SID NLL (train) NLL (validation) False False 1086.8 48.8 31.6 23.6 0.36 0.07 1.44 0.21 True False 540.4 70.3 17.4 16.7 0.52 0.08 1.16 0.17 False True 11.8 5.0 39.7 25.5 0.78 0.12 0.84 0.12 True True 6.1 3.3 29.3 19.5 0.78 0.13 0.83 0.12 A first observation is that adding PNS and pruning improve the NLL on the validation set while deteriorating the NLL on the training set, showing that those two steps are indeed reducing overfitting. Secondly, the effect on SHD is really important while the effect on SID is almost nonexistent. This can be explained by the fact that SID has more to do with the ordering of the nodes than with false positive edges. For instance, if we have a complete DAG with a node ordering coherent with the ground truth graph, the SID is zero, but the SHD is not due to all the false positive edges. Without the regularizing effect of PNS and pruning, Gra N-DAG manages to find a DAG with a good ordering but with many spurious edges (explaining the poor SHD, the good SID and the big gap between the NLL of the training set and validation set). PNS and pruning helps reducing the number of spurious edges, hence improving SHD. We also implemented PNS and pruning for NOTEARS and DAG-GNN to see whether their performance could also be improved. Table 6 reports an ablation study for DAG-GNN and NOTEARS. First, the SHD improvement is not as important as for Gra N-DAG and is almost not statistically significant. The improved SHD does not come close to performance of Gra N-DAG. Second, PNS and pruning do not have a significant effect of SID, as was the case for Gra N-DAG. The lack of improvement for those methods is probably due to the fact that they are not overfitting like Gra NDAG, as the training and validation (unregularized) scores shows. NOTEARS captures only linear relationships, thus it will have a hard time overfitting nonlinear data and DAG-GNN uses a strong form of parameter sharing between its conditional densities which possibly cause underfitting in a setup where all the parameters of the conditionals are sampled independently. Moreover, DAG-GNN and NOTEARS threshold aggressively their respective weighted adjacency matrix at the end of training (with the default parameters used in the code), which also acts as a form of heavy regularization, and allow them to remove many spurious edges. Gra N-DAG without PNS and pruning does not threshold as strongly by default which explains the high SHD of Table 5. To test this explanation, we removed all edges (i, j) for which (Aφ)ij < 0.33 for Gra N-DAG and obtained an SHD of 29.4 15.9 and an SID of 85.6 45.7, showing a significant improvement over NOTEARS and DAG-GNN, even without PNS and pruning. Table 6: PNS and pruning ablation study for DAG-GNN and NOTEARS (averaged over 10 datasets from ER1 with 50 nodes) Algorithm PNS Pruning SHD SID Score (train) Score (validation) DAG-GNN False False 56.8 11.1 322.9 103.8 -2.8 1.5 -2.2 1.6 True False 55.5 10.2 314.5 107.6 -2.1 1.6 -2.1 1.7 False True 49.4 7.8 325.1 103.7 -1.8 1.1 -1.8 1.2 True True 47.7 7.3 316.5 105.6 -1.9 1.6 -1.9 1.6 NOTEARS False False 64.2 9.5 327.1 110.9 -23.1 1.8 -23.2 2.1 True False 54.1 10.9 321.5 104.5 -25.2 2.7 -25.4 2.8 False True 49.5 8.8 327.7 111.3 -26.7 2.0 -26.8 2.1 True True 49.0 7.6 326.4 106.9 -26.23 2.2 -26.4 2.4 3This was the default value of thresholding used in NOTEARS and DAG-GNN. Published as a conference paper at ICLR 2020 A.4 LARGE SAMPLE SIZE EXPERIMENT In this section, we test the bias-variance hypothesis which attempts to explain why CAM is on par with Gra N-DAG on Gauss-ANM data even if its model wrongly assumes that the fj functions are additive. Table 7 reports the performance of Gra N-DAG and CAM for different sample sizes. We can see that, as the sample size grows, Gra N-DAG ends up outperforming CAM in terms of SID while staying on par in terms of SHD. We explain this observation by the fact that a larger sample size reduces variance for Gra N-DAG thus allowing it to leverage its greater capacity against CAM which is stuck with its modelling bias. Both algorithms were run with their respective default hyperparameter combination. This experiment suggests Gra N-DAG could be an appealing option in settings where the sample size is substantial. The present paper focuses on sample sizes typically encountered in the structure/causal learning litterature and leave this question for future work. Table 7: Effect of sample size - Gauss-ANM 50 nodes ER4 (averaged over 10 datasets) Sample size Method SHD SID 500 CAM 123.5 13.9 1181.2 160.8 Gra N-DAG 130.2 14.4 1246.4 126.1 1000 CAM 103.7 15.2 1074.7 125.8 Gra N-DAG 104.4 15.3 942.1 69.8 5000 CAM 74.1 13.2 845.0 159.8 Gra N-DAG 71.9 15.9 554.1 117.9 10000 CAM 66.3 16.0 808.1 142.9 Gra N-DAG 65.9 19.8 453.4 171.7 A.5 DETAILS ON DATA SETS GENERATION Synthetic data sets: For each data set type, 10 data sets are sampled with 1000 examples each. As the synthetic data introduced in Section 4.1, for each data set, a ground truth DAG G is randomly sampled following the ER scheme and then the data is generated. Unless otherwise stated, all root variables are sampled from U[ 1, 1]. Gauss-ANM is generated following Xj := fj(XπG j ) + Nj j with mutually independent noises Nj N(0, σ2 j ) j where the functions fj are independently sampled from a Gaussian process with a unit bandwidth RBF kernel and σ2 j U[0.4, 0.8]. Source nodes are Gaussian with zero mean and variance sampled from U[1, 2] LIN is generated following Xj|XπG j w T j XπG j + 0.2 N(0, σ2 j ) j where σ2 j U[1, 2] and wj is a vector of |πG j | coefficients each sampled from U[0, 1]. ADD-FUNC is generated following Xj|XπG j P i πG j fj,i(Xi) + 0.2 N(0, σ2 j ) j where σ2 j U[1, 2] and the functions fj,i are independently sampled from a Gaussian process with bandwidth one. This model is adapted from B uhlmann et al. (2014). PNL-GP is generated following Xj|XπG j σ(fj(XπG j ) + Laplace(0, lj)) j with the functions fj independently sampled from a Gaussian process with bandwidth one and lj U[0, 1]. In the two-variable case, this model is identifiable following the Corollary 9 from Zhang & Hyv arinen (2009). To get identifiability according to this corollary, it is important to use non-Gaussian noise, explaining our design choices. PNL-MULT is generated following Xj|XπG j exp(log(P i πG j Xi) + |N(0, σ2 j )|) j where σ2 j U[0, 1]. Root variables are sampled from U[0, 2]. This model is adapted from Zhang et al. (2015). Syn TRe N: Ten datasets have been generated using the Syn TRe N generator (http: //bioinformatics.intec.ugent.be/kmarchal/Syn TRe N/index.html) using the Published as a conference paper at ICLR 2020 software default parameters except for the probability for complex 2-regulator interactions that was set to 1 and the random seeds used were 0 to 9. Each dataset contains 500 samples and comes from a 20 nodes graph. Graph types: Erd os-R enyi (ER) graphs are generated by randomly sampling a topological order and by adding directed edges were it is allowed independently with probability p = 2e d2 d were e is the expected number of edges in the resulting DAG. Scale-free (SF) graphs were generated using the Barab asi-Albert model (Barab asi & Albert, 1999) which is based on preferential attachment. Nodes are added one by one. Between the new node and the existing nodes, m edges (where m is equal to d or 4d) will be added. An existing node i have the probability p(ki) = ki P j kj to be chosen, where ki represents the degree of the node i. While ER graphs have a degree distribution following a Poisson distribution, SF graphs have a degree distribution following a power law: few nodes, often called hubs, have a high degree. Barab asi (2009) have stated that these types of graphs have similar properties to real-world networks which can be found in many different fields, although these claims remain controversial (Clauset et al., 2009). A.6 SUPPLEMENTARY EXPERIMENTS Gauss-ANM: The results for 20 and 100 nodes are presented in Table 8 and 9 using the same Gauss ANM data set types introduced in Section 4.1. The conclusions drawn remains similar to the 10 and 50 nodes experiments. For GES and PC, the SHD and SID are respectively presented in Table 10 and 11. Their performances do not compare favorably to the Gra N-DAG nor CAM. Figure 1 shows the entries of the weighted adjacency matrix Aφ as training proceeds in a typical run for 10 nodes. LIN & ADD-FUNC: Experiments with LIN and ADD-FUNC data is reported in Table 12 & 13. The details of their generation are given in Appendix A.5. PNL-GP & PNL-MULT: Table 14 contains the performance of Gra N-DAG and other baselines on post nonlinear data discussed in Section 4.1. Table 8: Results for ER and SF graphs of 20 nodes with Gauss-ANM data ER1 ER4 SF1 SF4 SHD SID SHD SID SHD SID SHD SID Gra N-DAG 4.0 3.4 17.9 19.5 45.2 10.7 165.1 21.0 7.6 2.5 28.8 10.4 36.8 5.1 62.5 18.8 DAG-GNN 25.6 7.5 109.1 53.1 75.0 7.7 344.8 17.0 19.5 1.8 60.1 12.8 49.5 5.4 115.2 33.3 NOTEARS 30.3 7.8 107.3 47.6 79.0 8.0 346.6 13.2 23.9 3.5 69.4 19.7 52.0 4.5 120.5 32.5 CAM 2.7 1.8 10.6 8.6 41.0 11.9 157.9 41.2 5.7 2.6 23.3 18.0 35.6 4.5 59.1 18.8 GSF 12.3 4.6 [15.0 19.9 41.8 13.8 [153.7 49.4 7.4 3.5 [5.7 7.1 38.6 3.6 [54.9 14.4 45.6 22.9] 201.6 37.9] 27.3 13.2] 86.7 24.2] RANDOM 103.0 39.6 94.3 53.0 117.5 25.9 298.5 28.7 105.2 48.8 81.1 54.4 121.5 28.5 204.8 38.5 Table 9: Results for ER and SF graphs of 100 nodes with Gauss-ANM data ER1 ER4 SF1 SF4 SHD SID SHD SID SHD SID SHD SID Gra N-DAG 15.1 6.0 83.9 46.0 206.6 31.5 4207.3 419.7 59.2 7.7 265.4 64.2 262.7 19.6 872.0 130.4 DAG-GNN 110.2 10.5 883.0 320.9 379.5 24.7 8036.1 656.2 97.6 1.5 438.6 112.7 316.0 14.3 1394.6 165.9 NOTEARS 125.6 12.1 913.1 343.8 387.8 25.3 8124.7 577.4 111.7 5.4 484.3 138.4 327.2 15.8 1442.8 210.1 CAM 17.3 4.5 124.9 65.0 186.4 28.8 4601.9 482.7 52.7 9.3 230.3 36.9 255.6 21.7 845.8 161.3 GSF 66.8 7.3 [104.7 59.5 > 12 hours4 71.4 11.2 [212.7 71.1 275.9 21.0 [793.9 152.5 238.6 59.3] 325.3 105.2] 993.4 149.2] RANDOM 1561.6 1133.4 1175.3 547.9 2380.9 1458.0 7729.7 1056.0 2222.2 1141.2 1164.2 593.3 2485.0 1403.9 4206.4 1642.1 4Note that GSF results are missing for two data set types in Tables 9 and 14. This is because the search algorithm could not finish within 12 hours, even when the maximal in-degree was limited to 5. All other methods could run in less than 6 hours. Published as a conference paper at ICLR 2020 Table 10: SHD for GES and PC (against Gra N-DAG for reference) with Gauss-ANM data 10 nodes 20 nodes 50 nodes 100 nodes ER1 ER4 ER1 ER4 ER1 ER4 ER1 ER4 Gra N-DAG 1.7 2.5 8.3 2.8 4.0 3.4 45.2 10.7 5.1 2.8 102.6 21.2 15.1 6.0 206.6 31.5 GES 13.8 4.8 32.3 4.3 43.3 12.4 94.6 9.8 106.6 24.7 254.4 39.3 292.9 33.6 542.6 51.2 PC 8.4 3 34 2.6 20.136.4 6.5 73.1 5.8 44.0 11.6 183.6 20 95.2 9.1 358.8 20.6 SF1 SF4 SF1 SF4 SF1 SF4 SF1 SF4 Gra N-DAG 1.2 1.1 9.9 4.0 7.6 2.5 36.8 5.1 25.5 6.2 111.3 12.3 59.2 7.7 262.7 19.6 GES 8.1 2.4 17.4 4.5 26.2 7.5 50.7 6.2 73.9 7.4 178.8 16.5 190.3 22 408.7 24.9 PC 4.8 2.4 16.4 2.8 13.6 2.1 44.4 4.6 43.1 5.7 135.4 10.7 97.6 6.6 314.2 17.5 Table 11: Lower and upper bound on the SID for GES and PC (against Gra N-DAG for reference) with Gauss-ANM data. See Appendix A.7 for details on how to compute SID for CPDAGs. 10 nodes 20 nodes 50 nodes 100 nodes ER1 ER4 ER1 ER4 ER1 ER4 ER1 ER4 Gra N-DAG 1.7 3.1 21.8 8.9 17.9 19.5 165.1 21.0 22.4 17.8 1060.1 109.4 83.9 46.0 4207.3 419.7 GES [24.1 17.3 27.2 17.5] [ 68.5 10.5 75 7] [ 62.1 44 65.7 44.5] [ 301.9 19.4 319.3 13.3] [150.9 72.7 155.1 74] [ 1996.6 73.1 2032.9 88.7] [ 582.5 391.1 598.9 408.6] [ 8054 524.8 8124.2 470.2] PC [22.6 15.5 27.3 13.1] [78.1 7.4 79.2 5.7] [80.9 51.1 94.9 46.1] [316.7 25.7 328.7 25.6] [222.7 138 256.7 127.3] [2167.9 88.4 2178.8 80.8] [620.7 240.9 702.5 255.8] [8236.9 478.5 8265.4 470.2] SF1 SF4 SF1 SF4 SF1 SF4 SF1 SF4 Gra N-DAG 4.1 6.1 16.4 6.0 28.8 10.4 62.5 18.8 90.0 18.9 271.2 65.4 265.4 64.2 872.0 130.4 GES [11.6 9.2 16.4 11.7] [39.3 11.2 43.9 14.9] [54.9 23.1 57.9 24.6] [89.5 38.4 105.1 44.3] [171.6 70.1 182.7 77] [496.3 154.1 529.7 184.5] [511.5 257.6 524 252.2] [1421.7 247.4 1485.4 233.6] PC [8.3 4.6 16.8 12.3] [36.5 6.2 41.7 6.9] [42.2 14 59.7 14.9] [95.6 37 118.5 30] [124.2 38.3 167.1 41.4] [453.2 115.9 538 143.7] [414.5 124.4 486.5 120.9] [1369.2 259.9 1513.7 296.2] Figure 1: Entries of the weighted adjacency matrix Aφ as training proceeds in Gra N-DAG for a synthetic data set ER4 with 10 nodes. Green curves represent edges which appear in the ground truth graph while red ones represent edges which do not. The horizontal dashed line at 10 4 is the threshold ϵ introduced in Section 3.4. We can see that Gra N-DAG successfully recovers most edges correctly while keeping few spurious edges. Published as a conference paper at ICLR 2020 Table 12: LIN #Nodes 10 50 Graph Type ER1 ER4 ER1 ER4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 7.2 2.0 27.3 8.1 30.7 3.3 75.8 6.9 33.9 8.6 255.8 158.4 181.9 24.0 2035.8 137.2 DAG-GNN 10.3 3.5 39.6 14.7 18.9 4.8 63.7 8.9 54.1 9.2 330.4 117.1 130.3 17.3 1937.5 89.8 NOTEARS 9.0 3.0 35.3 13.4 27.9 4.3 72.1 7.9 45.5 7.8 310.7 125.9 126.1 13.0 1971.1 134.3 CAM 10.2 6.3 31.2 10.9 33.6 3.3 77.5 2.3 36.2 5.8 234.8 105.1 182.5 17.6 1948.7 113.5 GSF 9.2 2.9 [19.5 14.6 31.6 17.3] 38.6 3.7 [73.8 7.6 85.2 8.3] 46.7 4.1 [176.4 98.8 215.0 108.9] > 12 hours RANDOM 22.0 2.9 30.0 13.8 34.4 2.4 78.8 5.5 692.6 7.5 360.3 141.4 715.9 16.0 1932.7 40.2 Table 13: ADD-FUNC #Nodes 10 50 Graph Type ER1 ER4 ER1 ER4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 2.8 2.5 7.5 7.7 14.5 5.2 52.6 10.8 16.6 5.3 103.6 52.9 86.4 21.6 1320.6 145.8 DAG-GNN 10.1 3.4 23.3 11.5 18.3 3.6 56.4 6.1 45.5 7.9 261.1 88.8 224.3 31.6 1741.0 138.3 NOTEARS 11.1 5.0 16.9 11.3 20.3 4.9 53.5 10.5 53.7 9.5 276.1 96.8 201.8 22.1 1813.6 148.4 CAM 2.5 2.0 7.9 6.4 6.0 5.6 29.3 19.3 9.6 5.1 39.0 34.1 42.9 6.6 857.0 184.5 GSF 9.3 3.9 [13.9 8.3 24.1 12.5] 29.5 4.3 [60.3 11.6 75.0 4.5] 49.5 5.1 [151.5 73.8 213.9 82.5] > 12 hours RANDOM 23.0 2.2 26.9 18.1 33.5 2.3 76.0 6.2 689.7 6.1 340.0 113.6 711.5 9.0 1916.2 65.8 Table 14: Synthetic post nonlinear data sets PNL-GP PNL-MULT SHD SID SHD SID 10 nodes ER1 Gra N-DAG 1.6 3.0 3.9 8.0 13.1 3.8 35.7 12.3 DAG-GNN 11.5 6.8 32.4 19.3 17.900 6.2 40.700 14.743 NOTEARS 10.7 5.5 34.4 19.1 14.0 4.0 38.6 11.9 CAM 1.5 2.6 6.8 12.1 12.0 6.4 36.3 17.7 GSF 6.2 3.3 [7.7 8.7, 18.9 12.4] 10.7 3.0 [9.8 11.9, 25.3 11.5] RANDOM 23.8 2.9 36.8 19.1 23.7 2.9 37.7 20.7 10 nodes ER4 Gra N-DAG 10.9 6.8 39.8 21.1 32.1 4.5 77.7 5.9 DAG-GNN 32.3 4.3 75.8 9.3 37.0 3.1 82.7 6.4 NOTEARS 34.1 3.2 80.8 5.5 37.7 3.0 81.700 7.258 CAM 8.4 4.8 30.5 20.0 34.4 3.9 79.6 3.8 GSF 25.0 6.0 [44.3 14.5, 66.1 10.1] 31.3 5.4 [58.6 8.1, 76.4 9.9] RANDOM 35.0 3.3 80.0 5.1 33.6 3.5 76.2 7.3 50 nodes ER1 Gra N-DAG 16.5 7.0 64.1 35.4 38.2 11.4 213.8 114.4 DAG-GNN 56.5 11.1 334.3 80.3 83.9 23.8 507.7 253.4 NOTEARS 50.1 9.9 319.1 76.9 78.5 21.5 425.7 197.0 CAM 5.1 2.6 10.7 12.4 44.9 9.9 284.3 124.9 GSF 31.2 6.0 [59.5 34.1, 122.4 32.0] 46.3 12.1 [65.8 62.2, 141.6 72.6] RANDOM 688.4 4.9 307.0 98.5 691.3 7.3 488.0 247.8 50 nodes ER4 Gra N-DAG 68.7 17.0 1127.0 188.5 211.7 12.6 2047.7 77.7 DAG-GNN 203.8 18.9 2173.1 87.7 246.7 16.1 2239.1 42.3 NOTEARS 189.5 16.0 2134.2 125.6 220.0 9.9 2175.2 58.3 CAM 48.2 10.3 899.5 195.6 208.1 14.8 2029.7 55.4 GSF 105.2 15.5 [1573.7 121.2, 1620 102.8] > 12 hours RANDOM 722.3 9.0 1897.4 83.7 710.2 9.5 1935.8 56.9 A.7 METRICS SHD takes two partially directed acyclic graphs (PDAG) and counts the number of edge for which the edge type differs in both PDAGs. There are four edge types: i j, i j, i j and i j. Since this distance is defined over the space of PDAGs, we can use it to compare DAGs with DAGs, DAGs with CPDAGs and CPDAGs with CPDAGs. When comparing a DAG with a CPDAG, having i j instead of i j counts as a mistake. SHD-C is very similar to SHD. The only difference is that both DAGs are first mapped to their respective CPDAGs before measuring the SHD. Published as a conference paper at ICLR 2020 Introduced by Peters & B uhlmann (2015), SID counts the number of interventional distribution of the form p(xi| do(xj = ˆxj)) that would be miscalculated using the parent adjustment formula (Pearl, 2009) if we were to use the predicted DAG instead of the ground truth DAG to form the parent adjustment set. Some care needs to be taken to evaluate the SID for methods outputting a CPDAG such as GES and PC. Peters & B uhlmann (2015) proposes to report the SID of the DAGs which have approximately the minimal and the maximal SID in the Markov equivalence class given by the CPDAG. See Peters & B uhlmann (2015) for more details. A.8 HYPERPARAMETERS All Gra N-DAG runs up to this point were performed using the following set of hyperparameters. We used RMSprop as optimizer with learning rate of 10 2 for the first subproblem and 10 4 for all subsequent suproblems. Each NN has two hidden layers with 10 units (except for the real and pseudo-real data experiments of Table 3 which uses only 1 hidden layer). Leaky-Re LU is used as activation functions. The NN are initialized using the initialization scheme proposed in Glorot & Bengio (2010) also known as Xavier initialization. We used minibatches of 64 samples. This hyperparameter combination have been selected via a small scale experiment in which many hyperparameter combinations have been tried manually on a single data set of type ER1 with 10 nodes until one yielding a satisfactory SHD was obtained. Of course in practice one cannot select hyperparameters in this way since we do not have access to the ground truth DAG. In Appendix A.9, we explain how one could use a held-out data set to select the hyperparameters of score-based approaches and report the results of such a procedure on almost settings presented in this paper. For NOTEARS, DAG-GNN, and GSF, we used the default hyperparameters found in the authors code. It (rarely) happens that NOTEARS and DAG-GNN returns a cyclic graph. In those cases, we removed edges starting from the weaker ones to the strongest (according to their respective weighted adjacency matrices), stopping as soon as acyclicity is achieved (similarly to what was explained in Appendix A.2 for Gra N-DAG). For GES and PC, we used default hyperparameters of the pcalg R package. For CAM, we used the the default hyperparameters found in the CAM R package, with default PNS and DAG pruning. A.9 HYPERPARAMETER SELECTION VIA HELD-OUT SCORE Most structure/causal learning algorithms have hyperparameters which must be selected prior to learning. For instance, NOTEARS and GES have a regularizing term in their score controlling the sparsity level of the resulting graph while CAM has a thresholding level for its pruning phase (also controlling the sparsity of the DAG). Gra N-DAG and DAG-GNN have many hyperparameters such as the learning rate and the architecture choice for the neural networks (i.e. number of hidden layers and hidden units per layer). One approach to selecting hyperparameters in practice consists in trying multiple hyperparameter combinations and keeping the one yielding the best score evaluated on a held-out set (Koller & Friedman, 2009, p. 960). By doing so, one can hopefully avoid finding a DAG which is too dense or too sparse since if the estimated graph contains many spurious edges, the score on the held-out data set should be penalized. In the section, we experiment with this approach on almost all settings and all methods covered in the present paper. Experiments: We explored multiple hyperparameter combinations using random search (Bergstra & Bengio, 2012). Table 15 to Table 23 report results for each dataset types. Each table reports the SHD and SID averaged over 10 data sets and for each data set, we tried 50 hyperparameter combinations sampled randomly (see Table 24 for sampling schemes). The hyperparameter combination yielding the best held-out score among all 50 runs is selected per data set (i.e. the average of SHD and SID scores correspond to potentially different hyperparameter combinations on different data sets). 80% of the data was used for training and 20% was held out (Gra N-DAG uses the same data for early stopping and hyperparameter selection). Note that the held-out score is always evaluated without the regularizing term (e.g. the held-out score of NOTEARS was evaluated without its L1 regularizer). The symbols ++ and + indicate the hyperparameter search improved performance against default hyperparameter runs above one standard deviation and within one standard deviation, respectively. Analogously for and which indicate a performance reduction. The flag indicate that, on average, less than 10 hyperparameter combinations among the 50 tried allowed the method to Published as a conference paper at ICLR 2020 converge in less than 12 hours. Analogously, indicates between 10 and 25 runs converged and indicates between 25 and 45 runs converged. Discussion: Gra N-DAG and DAG-GNN are the methods benefiting the most from the hyperparameter selection procedure (although rarely significantly). This might be explained by the fact that neural networks are in general very sensitive to the choice of hyperparameters. However, not all methods improved their performance and no method improves its performance in all settings. GES and GSF for instance, often have significantly worse results. This might be due to some degree of model misspecification which renders the held-out score a poor proxy for graph quality. Moreover, for some methods the gain from the hyperparameter tuning might be outweighed by the loss due to the 20% reduction in training samples. Additional implementation details for held-out score evaluation: Gra N-DAG makes use of a final pruning step to remove spurious edges. One could simply mask the inputs of the NN corresponding to removed edges and evaluate the held-out score. However, doing so yields an unrepresentative score since some masked inputs have an important role in the learned function and once these inputs are masked, the quality of the fit might greatly suffer. To avoid this, we retrained the whole model from scratch on the training set with the masking fixed to the one recovered after pruning. Then, we evaluate the held-out score with this retrained architecture. During this retraining phase, the estimated graph is fixed, only the conditional densities are relearned. Since NOTEARS and DAG-GNN are not always guaranteed to return a DAG (although they almost always do), some extra thresholding might be needed as mentioned in Appendix A.8. Similarly to Gra N-DAG s pruning phase, this step can seriously reduce the quality of the fit. To avoid this, we also perform a retraining phase for NOTEARS and DAG-GNN. The model of CAM is also retrained after its pruning phase prior to evaluating its held-out score. Table 15: Gauss-ANM - 10 nodes with hyperparameter search Graph Type ER1 ER4 SF1 SF4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 1.0 1.6+ 0.4 1.3++ 5.5 2.8+ 9.7 8.0++ 1.3 1.8 3.0 3.4+ 9.6 4.5+ 15.1 6.1+ DAG-GNN 10.9 2.6+ 35.5 13.6+ 38.3 2.9 84.4 3.5 9.9 1.7+ 30.3 18.8 21.4 2.1 44.0 15.5+ NOTEARS 26.7 6.9 35.2 10.6+ 20.9 6.6++ 62.0 6.7++ 20.4 9.6 38.8 16.7 26.9 7.4 61.1 13.8 CAM 3.0 4.2 2.2 5.7 7.7 3.1++ 23.2 14.7+ 2.4 2.5 5.2 5.5+ 9.6 3.1+ 20.1 6.8 GSF 5.3 3.3+ [8.3 13.2+ 15.4 13.5] 23.1 7.9 [56.1 20.4 65.1 19.3] 3.3 2.5 [7.0 11.6 12.2 11.0] 14.2 5.6 [26.2 11.1 36.9 21.6] GES 38.6 2.1 [20.3 15.4+ 28.3 18.4] 33.0 3.4 [66.2 7.0+ 76.6 4.3] 38.3 2.4 [8.8 5.2 25.5 18.2] 33.6 4.8 [32.7 12.7 52.0 14.0] Table 16: Gauss-ANM - 50 nodes with hyperparameter search Graph Type ER1 ER4 SF1 SF4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 3.8 3.3+ 15.0 14.0+ 105.6 16.5 1131.7 91.0 24.7 6.4+ 86.5 34.6+ 112.7 15.5 268.3 85.8+ DAG-GNN 47.0 7.8+ 268.1 118.0+ 196.2 14.4 1972.8 110.6++ 51.8 5.6 166.5 48.9+ 144.2 11.6+ 473.4 105.4+ NOTEARS 193.5 77.3 326.0 99.1+ 369.5 81.9 2062.0 107.7+ 104.8 22.4 290.3 136.8 213.0 35.1 722.7 177.3 CAM 4.0 2.7+ 21.1 22.1+ 105.6 20.9 1225.9 205.7 23.8 6.0+ 81.5 15.3+ 112.2 14.0 333.8 156.0 GSF 24.9 7.4+ [40.0 26.3 77.5 45.3] 129.3 20.4 [1280.8 202.3 1364.1 186.7] 35.3 6.9 [99.7 41.7 151.9 59.7] 121.6 11.7 [310.8 108.1 391.9 93.3] GES 1150.1 9.8 [112.7 71.1+ 132.0 89.0] 1066.1 11.7 [1394.3 81.8++ 1464.8 63.8] 1161.7 7.0 [322.8 211.1 336.0 215.4] 1116.1 14.2 [1002.7 310.9 1094.0 345.1] Table 17: Gauss-ANM - 20 nodes with hyperparameter search Graph Type ER1 ER4 SF1 SF4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 2.7 2.3+ 9.6 10.3+ 35.9 11.8+ 120.4 37.0++ 6.5 2.4+ 17.5 6.3++ 35.6 4.1+ 54.8 14.3+ DAG-GNN 21.0 6.1+ 98.8 42.2+ 77.2 6.5 345.6 18.6 19.1 0.7+ 55.0 20.1+ 50.2 5.4 118.7 33.2 NOTEARS 101.5 39.6 100.4 47.0+ 124.0 16.3 267.0 46.5++ 55.0 28.2 87.6 26.9 66.7 8.3 154.6 43.0 CAM 2.8 2.2 11.5 10.2 64.3 29.3 121.7 73.1+ 5.5 1.6+ 19.3 7.8+ 36.0 5.1 66.3 28.6 GSF 11.6 3.0+ [26.4 13.3 49.8 26.5] 46.2 12.6 [172.7 40.8 213.5 38.6] 12.8 2.1 [32.1 14.0 56.2 13.8] 42.3 5.1 [68.9 27.7 95.1 33.8] GES 169.9 5.0 [45.4 29.2+ 57.2 36.6] 142.8 7.7 [223.3 33.6++ 254.7 22.0] 168.1 3.3 [46.7 21.7+ 53.3 20.0] 162.2 10.4 [151.1 57.4 195.8 57.4] Published as a conference paper at ICLR 2020 Table 18: Gauss-ANM - 100 nodes with hyperparameter search Graph Type ER1 ER4 SF1 SF4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 15.1 7.5+ 65.1 33.2+ 191.6 17.8+ 4090.7 418.0+ 51.6 10.2+ 210.6 51.9++ 255.7 21.1+ 790.5 159.7+ DAG-GNN 103.9 9.1+ 757.6 215.0+ 387.1 25.3 7741.9 522.5+ 103.5 8.2 391.7 60.0+ 314.8 16.3+ 1257.3 185.2+ NOTEARS 421.3 207.0 945.7 339.7 631.1 136.6 8272.4 444.2 244.3 63.8 815.6 346.5 482.3 114.1 1929.7 363.1 CAM 12.3 4.9++ 128.0 66.3 198.8 22.2 4602.2 523.7 51.1 9.4+ 233.6 62.3 255.7 22.2 851.4 206.0 GSF 100.2 9.9 [719.8 242.1 721.1 242.9] 387.6 23.9 [7535.1 595.2 7535.1 595.2] 67.3 14.0+ [254.5 35.4 340.4 70.4] 315.1 16.7 [1214.0 156.4 1214.0 156.4] GES 4782.5 22.9 [362.3 267.7+ 384.1 293.6] 4570.1 27.9 [5400.7 299.2++ 5511.5 308.5] 4769.1 26.7 [1311.1 616.6 1386.2 713.9] 4691.3 47.3 [3882.7 1010.6 3996.7 1075.7] Table 19: PNL-GP with hyperparameter search #Nodes 10 50 Graph Type ER1 ER4 ER1 ER4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 1.2 2.2+ 1.9 4.2+ 9.8 4.9+ 29.0 17.6+ 12.8 4.9+ 55.3 24.2+ 73.9 16.8 1107.2 144.7+ DAG-GNN 10.6 4.9+ 35.8 19.6 38.6 2.0 82.2 5.7 48.1 8.4+ 330.4 69.9+ 192.5 19.2+ 2079.5 120.9+ NOTEARS 20.6 11.4 30.5 18.8+ 24.2 6.5++ 66.4 6.9++ 102.1 27.3 299.8 85.8+ 660.0 258.2 1744.0 232.9++ CAM 2.7 4.0 6.4 11.8+ 8.7 4.5 30.9 20.4 4.0 2.4+ 10.7 12.4+ 52.3 8.5 913.9 209.3 GSF 12.9 3.9 [10.5 8.7 53.6 23.8] 40.7 1.3 [79.2 3.8 79.2 3.8] 48.8 3.9 [281.6 70.7 281.6 70.7] 199.9 15.2 [1878.0 122.4 1948.4 139.6] Table 20: PNL-MULT with hyperparameter search #Nodes 10 50 Graph Type ER1 ER4 ER1 ER4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 10.0 4.5+ 29.1 9.7+ 32.9 3.3 76.7 4.1+ 59.8 28.2 213.6 97.3+ 272.1 69.4 2021.6 185.8+ DAG-GNN 14.6 3.1++ 36.9 10.6+ 38.9 2.0 85.8 1.2 64.3 27.8+ 508.8 317.2 212.5 12.3++ 2216.9 95.6+ NOTEARS 28.8 9.1 30.3 11.8+ 35.4 3.8+ 78.4 7.5+ 160.2 67.5 443.5 205.1 229.2 25.4 2158.8 70.3+ CAM 17.2 8.0 33.7 14.4+ 32.3 6.5+ 76.6 8.2+ 97.5 71.1 282.3 123.8+ 251.0 25.9 2026.2 58.2+ GSF 15.6 4.4 [10.0 6.3 60.1 17.2] 39.3 2.2 [76.0 9.6 79.9 5.3] 66.4 14.4 [145.1 96.1 618.8 257.0] > 12 hours Table 21: LIN with hyperparameter search #Nodes 10 50 Graph Type ER1 ER4 ER1 ER4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 10.1 3.9 28.7 14.7 34.7 2.9 79.5 4.4 40.8 10.3 236.3 101.7+ 256.9 55.7 2151.4 144.3 DAG-GNN 9.0 2.7++ 35.6 11.4 19.6 4.6+ 63.9 7.5 48.3 6.8+ 381.7 145.4 149.7 17.2++ 2070.7 51.9 NOTEARS 14.0 4.1 32.2 7.9+ 20.7 5.1++ 63.1 8.0++ 87.7 44.3 294.3 99.3+ 200.3 67.1 1772.7 143.7++ CAM 8.8 6.0+ 25.8 13.5+ 33.9 2.8 77.1 4.5+ 34.8 7.0+ 221.2 98.3+ 202.2 14.3 1990.8 97.5 GSF 10.7 3.5 [15.8 8.4 45.2 20.2] 33.4 3.3++ [71.7 11.5+ 77.3 6.1] 54.4 6.5 [158.1 115.9 560.9 220.7] 195.6 9.9 [2004.9 85.2 2004.9 85.2] Table 22: ADD-FUNC with hyperparameter search #Nodes 10 50 Graph Type ER1 ER4 ER1 ER4 Metrics SHD SID SHD SID SHD SID SHD SID Method Gra N-DAG 2.6 2.4+ 4.3 4.3+ 7.0 3.1++ 37.1 12.4++ 13.2 6.7+ 72.1 55.2+ 90.1 25.6 1241.7 289.8+ DAG-GNN 8.7 2.8++ 22.3 9.4+ 25.3 3.8++ 63.6 8.6++ 44.7 9.7++ 306.9 114.7+ 194.0 20.4+ 1949.3 107.1+ NOTEARS 21.2 11.5 15.5 9.9+ 13.3 4.3++ 41.3 11.5++ 186.8 83.0 276.9 92.1 718.4 170.4 1105.9 250.1++ CAM 3.0 2.2 8.1 6.3 6.2 5.5 28.5 21.5+ 10.0 4.6 44.2 32.1 46.6 9.5 882.5 186.5 GSF 5.5 4.1+ [7.5 12.3+ 16.3 12.9] 19.1 7.0++ [44.5 19.7+ 60.4 16.5] 29.8 7.6++ [44.6 42.6++ 96.8 46.7] 140.4 31.7 [1674.4 133.9 1727.6 145.2] Published as a conference paper at ICLR 2020 Table 23: Results for real and pseudo real data sets with hyperparameter search Data Type Protein signaling data set Syn TRe N - 20 nodes Metrics SHD SHD-C SID SHD SHD-C SID Method Gra N-DAG 12.0+ 9.0+ 48.0 41.2 9.6 43.7 8.3 144.3 61.3+ Gra N-DAG++ 14.0 11.0 57.0 46.9 14.9 49.5 14.7 158.4 61.5 DAG-GNN 16.0 14.0+ 59.0 32.2 5.0++ 32.3 5.6++ 194.2 50.2 NOTEARS 15.0+ 14.0+ 58.0 44.2 27.5++ 45.8 27.7++ 183.1 48.4 CAM 11.0+ 9.0 51.0+ 101.7 37.2 105.6 36.6 111.5 25.3++ GSF 20.0 14.0 [37.0+ 60.0] 27.8 5.4++ 27.8 5.4++ [207.6 55.4 209.6 59.1] GES 47.0 50.0 [37.0+ 47.0] 167.5 5.6 172.2 7.0 [75.3 24.4++ 97.6 30.8] Table 24: Hyperparameter search spaces for each algorithm Hyperparameter space Log(learning rate) U[ 2, 3] (first subproblem) Log(learning rate) U[ 3, 4] (other subproblems) ϵ U{10 3, 10 4, 10 5} Log(pruning cutoff) U{ 5, 4, 3, 2, 1} # hidden units U{4, 8, 16, 32} # hidden layers U{1, 2, 3} Constraint convergence tolerance U{10 6, 10 8, 10 10} PNS threshold U[0.5, 0.75, 1, 2] Log(learning rate) U[ 4, 2] # hidden units in encoder U{16, 32, 64, 128, 256} # hidden units in decoder U{16, 32, 64, 128, 256} Bottleneck dimension (dimension of Z) U{1, 5, 10, 50, 100} Constraint convergence tolerance U{10 6, 10 8, 10 10} L1 regularizer coefficient U{0.001, 0.005, 0.01, 0.05, 0.1, 0.5} Final threshold U{0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1} Constraint convergence tolerance U{10 6, 10 8, 10 10} CAM Log(Pruning cutoff) U[ 6, 0] GSF Log(RKHS regression regularizer) U[ 4, 4] GES Log(Regularizer coefficient) U[ 4, 4]